Neuron Inference Performance

The following tables contain the reference inference performance for models in the Neuron Tutorials. Follow the links on each row to replicate similar results in your own environment. Refer to Setup Environment documentation to create a new environment based on the latest Neuron release.

Last update: August, 12th, 2021

Natural Language Processing

Model

Tutorial

Framework

Inst. Type

Avg Throughput (/sec)

Latency P50 (ms)

Latency P99 (ms)

Cost per 1M inferences

Application Type

Neuron Version

Run Mode

Batch Size

Model details

BERT base

HuggingFace Pretrained BERT

PyTorch 1.8.1

inf1.xlarge

824

25

29

$0.077

Batch

1.15.0

Data Parallel

6

fp32, bert-base-cased-finetuned-mrpc, sequence-length=128

BERT base

Using NeuronCore Pipeline

PyTorch 1.8.1

inf1.6xlarge

1681

7

8

$0.195

Real Time

1.15.0

Model Pipeline

1

fp32, bert-base-uncased, sequence-length=128

Note

Cost per 1M inferences is calculated using US East (N. Virginia) On-Demand hourly rate.

Real Time application refers to batch size 1 inference for minimal latency. Batch application refers to maximum throughput with minimum cost-per-inference.

Computer Vision

Model

Tutorial

Framework

Inst. Type

Avg Throughput (/sec)

Latency P50 (ms)

Latency P99 (ms)

Cost per 1M inferences

Application Type

Neuron Version

Run Mode

Batch Size

Model details

OpenPose

Running OpenPose on Inferentia

Tensorflow 1.15

inf1.xlarge

53

64

71

$1.193

Real Time

1.15.0

Data Parallel

1

fp16

Resnet-50

ResNet-50 optimization example

Tensorflow 1.15

inf1.xlarge

2230

18

23

$0.028

Batch

1.15.0

Data Parallel

10

fp16

Resnet-50

Resnet50 model for Inferentia

PyTorch 1.8.1

inf1.xlarge

821

24

26

$0.077

Batch

1.15.0

Data Parallel

5

fp32

YOLO v4

Evaluate YOLO v4 on Inferentia

PyTorch 1.8.1

inf1.2xlarge

164

48

70

$0.613

Real Time

1.15.0

Data Parallel

1

fp32

Note

Cost per 1M inferences is calculated using US East (N. Virginia) On-Demand hourly rate.

Real Time application refers to batch size 1 inference for minimal latency. Batch application refers to maximum throughput with minimum cost-per-inference.