This document is relevant for: Inf1

Inf1 Inference Performance#

The following tables contain the reference inference performance for models in the tutorials. Follow the links on each row to replicate similar results in your own environment. Refer to Setup Environment documentation to create a new environment based on the latest Neuron release.

Last update: April 11th, 2024

Encoder Models#

Model

Scripts

Framework

Inst. Type

Avg Throughput (/sec)

Latency P50 (ms)

Latency P99 (ms)

Cost per 1M inferences

Application Type

Neuron Version

Run Mode

Batch Size

Model details

BERT base

HuggingFace Pretrained BERT

PyTorch 1.13

inf1.xlarge

1056

21

21

$0.060

Batch

2.18.0

Data Parallel

4

fp32, bert-base-cased-finetuned-mrpc, sequence-length=128

BERT base

HuggingFace distilBERT with Tensorflow2

Tensorflow 2.10

inf1.6xlarge

2125

30

32

$0.154

Batch

2.18.0

Data Parallel

16

fp32, distilbert-base-uncased-finetuned-sst-2-english, sequence-length=128

BERT base

Using NeuronCore Pipeline

PyTorch 1.13

inf1.6xlarge

1968

6

6

$0.167

Real Time

2.18.0

Model Pipeline

1

fp32, bert-base-uncased, sequence-length=128

BERT base (bert-base-cased)

Compile + Benchmark

PyTorch 1.13.1

inf1.xlarge

1121

57

61

$0.056

Batch

2.18.0

Data Parallel

8

fp32, sequence-length=128

BERT base (bert-base-uncased)

Compile + Benchmark

PyTorch 1.13.1

inf1.xlarge

1182

41

45

$0.054

Batch

2.18.0

Data Parallel

6

fp32, sequence-length=128

DistilBERT base (distilbert-base-uncased)

Compile + Benchmark

PyTorch 1.13.1

inf1.xlarge

1875

34

53

$0.034

Batch

2.18.0

Data Parallel

8

fp32, sequence-length=128

DistilBERT base (distilbert-base-uncased-finetuned-sst-2-english)

Compile + Benchmark

PyTorch 1.13.1

inf1.xlarge

1872

34

55

$0.034

Batch

2.18.0

Data Parallel

8

fp32, sequence-length=128

DistilRoBERTa base (distilroberta-base)

Compile + Benchmark

PyTorch 1.13.1

inf1.xlarge

1498

15

27

$0.042

Batch

2.18.0

Data Parallel

6

fp32, sequence-length=128

Model

Scripts

Framework

Inst. Type

Avg Throughput (/sec)

Latency P50 (ms)

Latency P99 (ms)

Cost per 1M inferences

Application Type

Neuron Version

Run Mode

Batch Size

Model details

BERT base (bert-base-cased)

Compile + Benchmark

PyTorch 1.13.1

inf1.xlarge

126

8

8

$0.503

Real Time

2.18.0

Data Parallel

1

fp32, sequence-length=128

BERT base (bert-base-uncased)

Compile + Benchmark

PyTorch 1.13.1

inf1.xlarge

284

10

11

$0.223

Real Time

2.18.0

Data Parallel

3

fp32, sequence-length=128

DistilBERT base (distilbert-base-uncased)

Compile + Benchmark

PyTorch 1.13.1

inf1.xlarge

541

11

11

$0.117

Real Time

2.18.0

Data Parallel

6

fp32, sequence-length=128

DistilBERT base (distilbert-base-uncased-finetuned-sst-2-english)

Compile + Benchmark

PyTorch 1.13.1

inf1.xlarge

499

10

11

$0.127

Real Time

2.18.0

Data Parallel

5

fp32, sequence-length=128

DistilRoBERTa base (distilroberta-base)

Compile + Benchmark

PyTorch 1.13.1

inf1.xlarge

434

7

7

$0.146

Real Time

2.18.0

Data Parallel

3

fp32, sequence-length=128

Note

Throughput and latency numbers in this table were computed using* NeuronPerf. To reproduce these results, install NeuronPerf and run the provided scripts.*

Convolutional Neural Networks (CNN) Models#

Model

Tutorial

Framework

Inst. Type

Avg Throughput (/sec)

Latency P50 (ms)

Latency P99 (ms)

Cost per 1M inferences

Application Type

Neuron Version

Run Mode

Batch Size

Model details

OpenPose

Running OpenPose on Inferentia

Tensorflow 1.15

inf1.xlarge

58

60

67

$1.101

Real Time

2.12.0

Data Parallel

1

fp16

Resnet-50

ResNet-50 optimization example

Tensorflow 1.15

inf1.xlarge

2207

18

23

$0.029

Batch

2.12.0

Data Parallel

10

fp16

Resnet-50

Resnet50 model for Inferentia

PyTorch 1.13

inf1.xlarge

928

22

22

$0.068

Batch

2.18.0

Data Parallel

5

fp32

YOLO v4

Evaluate YOLO v4 on Inferentia

PyTorch 1.13

inf1.2xlarge

178

41

52

$0.566

Real Time

2.18.0

Data Parallel

1

fp32

Note

Throughput and latency numbers in this table were generated using Neuron Tutorials.

Note

Cost per 1M inferences is calculated using US East (N. Virginia) On-Demand hourly rate.

Real Time application refers to batch size 1 inference for minimal latency. Batch application refers to maximum throughput with minimum cost-per-inference.

This document is relevant for: Inf1