Neuron Inference Performance

The following tables contain the reference inference performance for models in the Neuron Tutorials. Follow the links on each row to replicate similar results in your own environment. Refer to Setup Environment documentation to create a new environment based on the latest Neuron release.

Last update: October, 27th, 2021

Natural Language Processing

Model

Tutorial

Framework

Inst. Type

Avg Throughput (/sec)

Latency P50 (ms)

Latency P99 (ms)

Cost per 1M inferences

Application Type

Neuron Version

Run Mode

Batch Size

Model details

BERT base

HuggingFace Pretrained BERT

PyTorch 1.9.1

inf1.xlarge

966

22

23

$0.066

Batch

1.16.0

Data Parallel

6

fp32, bert-base-cased-finetuned-mrpc, sequence-length=128

BERT base

Using NeuronCore Pipeline

PyTorch 1.9.1

inf1.6xlarge

2037

6

6

$0.161

Real Time

1.16.0

Model Pipeline

1

fp32, bert-base-uncased, sequence-length=128

BERT base

HuggingFace distilBERT with Tensorflow2

Tensorflow 2.5

inf1.6xlarge

1035

123

125

$0.317

Batch

1.16.0

Data Parallel

128

fp32, distilbert-base-uncased-finetuned-sst-2-english, sequence-length=128

Note

Cost per 1M inferences is calculated using US East (N. Virginia) On-Demand hourly rate.

Real Time application refers to batch size 1 inference for minimal latency. Batch application refers to maximum throughput with minimum cost-per-inference.

Computer Vision

Model

Tutorial

Framework

Inst. Type

Avg Throughput (/sec)

Latency P50 (ms)

Latency P99 (ms)

Cost per 1M inferences

Application Type

Neuron Version

Run Mode

Batch Size

Model details

OpenPose

Running OpenPose on Inferentia

Tensorflow 1.15

inf1.xlarge

58

62

69

$1.101

Real Time

1.16.0

Data Parallel

1

fp16

Resnet-50

ResNet-50 optimization example

Tensorflow 1.15

inf1.xlarge

2270

17

21

$0.028

Batch

1.16.0

Data Parallel

10

fp16

Resnet-50

Resnet50 model for Inferentia

PyTorch 1.9.1

inf1.xlarge

919

22

23

$0.069

Batch

1.16.0

Data Parallel

5

fp32

YOLO v4

Evaluate YOLO v4 on Inferentia

PyTorch 1.9.1

inf1.2xlarge

185

42

61

$0.544

Real Time

1.16.0

Data Parallel

1

fp32

Note

Cost per 1M inferences is calculated using US East (N. Virginia) On-Demand hourly rate.

Real Time application refers to batch size 1 inference for minimal latency. Batch application refers to maximum throughput with minimum cost-per-inference.