× We want your feedback about Neuron SDK! Let us know by taking the Neuron survey

Neuron Inference Performance

The following tables contain the reference inference performance for models in the Neuron Tutorials. Follow the links on each row to replicate similar results in your own environment. Refer to Setup Environment documentation to create a new environment based on the latest Neuron release.

Last update: April 29th, 2022

Natural Language Processing

Model

Scripts

Framework

Inst. Type

Avg Throughput (/sec)

Latency P50 (ms)

Latency P99 (ms)

Cost per 1M inferences

Application Type

Neuron Version

Run Mode

Batch Size

Model details

BERT base (bert-base-cased)

Compile + Benchmark

PyTorch 1.11.0

inf1.xlarge

1142

42

43

$0.055

Batch

1.19.0

Data Parallel

6

fp32, sequence-length=128

BERT base (bert-base-uncased)

Compile + Benchmark

PyTorch 1.11.0

inf1.xlarge

1146

42

42

$0.055

Batch

1.19.0

Data Parallel

6

fp32, sequence-length=128

DistilBERT base (distilbert-base-uncased)

Compile + Benchmark

PyTorch 1.11.0

inf1.xlarge

2080

35

37

$0.030

Batch

1.19.0

Data Parallel

9

fp32, sequence-length=128

DistilBERT base (distilbert-base-uncased-finetuned-sst-2-english)

Compile + Benchmark

PyTorch 1.11.0

inf1.xlarge

1970

28

29

$0.032

Batch

1.19.0

Data Parallel

7

fp32, sequence-length=128

DistilRoBERTa base (distilroberta-base)

Compile + Benchmark

PyTorch 1.11.0

inf1.xlarge

1764

27

28

$0.036

Batch

1.19.0

Data Parallel

6

fp32, sequence-length=128

*Throughput and latency numbers in this table were computed using NeuronPerf. To reproduce these results, install NeuronPerf and run the provided scripts.

Model

Tutorial

Framework

Inst. Type

Avg Throughput (/sec)

Latency P50 (ms)

Latency P99 (ms)

Cost per 1M inferences

Application Type

Neuron Version

Run Mode

Batch Size

Model details

BERT base

HuggingFace Pretrained BERT

PyTorch 1.11.0

inf1.xlarge

966

22

22

$0.066

Batch

1.19.0

Data Parallel

6

fp32, bert-base-cased-finetuned-mrpc, sequence-length=128

BERT base

HuggingFace distilBERT with Tensorflow2

Tensorflow 2.8

inf1.6xlarge

3182

40

40

$0.103

Batch

1.19.0

Data Parallel

128

fp32, distilbert-base-uncased-finetuned-sst-2-english, sequence-length=128

BERT base

Using NeuronCore Pipeline

PyTorch 1.11.0

inf1.6xlarge

2079

6

6

$0.158

Real Time

1.19.0

Model Pipeline

1

fp32, bert-base-uncased, sequence-length=128

*Throughput and latency numbers in this table were generated using Neuron Tutorials.

Computer Vision

Model

Tutorial

Framework

Inst. Type

Avg Throughput (/sec)

Latency P50 (ms)

Latency P99 (ms)

Cost per 1M inferences

Application Type

Neuron Version

Run Mode

Batch Size

Model details

OpenPose

Running OpenPose on Inferentia

Tensorflow 1.15

inf1.xlarge

58

60

67

$1.101

Real Time

1.19.0

Data Parallel

1

fp16

Resnet-50

ResNet-50 optimization example

Tensorflow 1.15

inf1.xlarge

2260

18

22

$0.028

Batch

1.19.0

Data Parallel

10

fp16

Resnet-50

Resnet50 model for Inferentia

PyTorch 1.11.0

inf1.xlarge

972

20

23

$0.065

Batch

1.19.0

Data Parallel

5

fp32

YOLO v4

Evaluate YOLO v4 on Inferentia

PyTorch 1.11.0

inf1.2xlarge

182

38

58

$0.552

Real Time

1.19.0

Data Parallel

1

fp32

*Throughput and latency numbers in this table were generated using Neuron Tutorials.

Note

Cost per 1M inferences is calculated using US East (N. Virginia) On-Demand hourly rate.

Real Time application refers to batch size 1 inference for minimal latency. Batch application refers to maximum throughput with minimum cost-per-inference.