This document is relevant for: Inf1

Inf1 Inference Performance#

The following tables contain the reference inference performance for models in the tutorials. Follow the links on each row to replicate similar results in your own environment. Refer to Setup Environment documentation to create a new environment based on the latest Neuron release.

Last update: November 23, 2022

Natural Language Processing#

Model

Scripts

Framework

Inst. Type

Avg Throughput (/sec)

Latency P50 (ms)

Latency P99 (ms)

Cost per 1M inferences

Application Type

Neuron Version

Run Mode

Batch Size

Model details

BERT base (bert-base-cased)

Compile + Benchmark

PyTorch 1.11.0

inf1.xlarge

1179

41

42

$0.054

Batch

2.5.0

Data Parallel

6

fp32, sequence-length=128

BERT base (bert-base-uncased)

Compile + Benchmark

PyTorch 1.11.0

inf1.xlarge

1180

41

41

$0.054

Batch

2.5.0

Data Parallel

6

fp32, sequence-length=128

DistilBERT base (distilbert-base-uncased)

Compile + Benchmark

PyTorch 1.11.0

inf1.xlarge

1948

25

25

$0.033

Batch

2.5.0

Data Parallel

6

fp32, sequence-length=128

DistilBERT base (distilbert-base-uncased-finetuned-sst-2-english)

Compile + Benchmark

PyTorch 1.11.0

inf1.xlarge

1953

29

29

$0.032

Batch

2.5.0

Data Parallel

7

fp32, sequence-length=128

DistilRoBERTa base (distilroberta-base)

Compile + Benchmark

PyTorch 1.11.0

inf1.xlarge

2134

22

25

$0.030

Batch

2.5.0

Data Parallel

6

fp32, sequence-length=128

*Throughput and latency numbers in this table were computed using NeuronPerf. To reproduce these results, install NeuronPerf and run the provided scripts.

Model

Tutorial

Framework

Inst. Type

Avg Throughput (/sec)

Latency P50 (ms)

Latency P99 (ms)

Cost per 1M inferences

Application Type

Neuron Version

Run Mode

Batch Size

Model details

BERT base

HuggingFace Pretrained BERT

PyTorch 1.11.0

inf1.xlarge

966

21

23

$0.066

Batch

2.5.0

Data Parallel

6

fp32, bert-base-cased-finetuned-mrpc, sequence-length=128

BERT base

HuggingFace distilBERT with Tensorflow2

Tensorflow 2.8

inf1.6xlarge

3036

42

43

$0.108

Batch

2.5.0

Data Parallel

128

fp32, distilbert-base-uncased-finetuned-sst-2-english, sequence-length=128

BERT base

Using NeuronCore Pipeline

PyTorch 1.11.0

inf1.6xlarge

1996

6

6

$0.164

Real Time

2.5.0

Model Pipeline

1

fp32, bert-base-uncased, sequence-length=128

*Throughput and latency numbers in this table were generated using Neuron Tutorials.

Computer Vision#

Model

Tutorial

Framework

Inst. Type

Avg Throughput (/sec)

Latency P50 (ms)

Latency P99 (ms)

Cost per 1M inferences

Application Type

Neuron Version

Run Mode

Batch Size

Model details

OpenPose

Running OpenPose on Inferentia

Tensorflow 1.15

inf1.xlarge

57

60

67

$1.105

Real Time

2.5.0

Data Parallel

1

fp16

Resnet-50

ResNet-50 optimization example

Tensorflow 1.15

inf1.xlarge

2162

18

23

$0.029

Batch

2.5.0

Data Parallel

10

fp16

Resnet-50

Resnet50 model for Inferentia

PyTorch 1.11.0

inf1.xlarge

947

21

22

$0.067

Batch

2.5.0

Data Parallel

5

fp32

YOLO v4

Evaluate YOLO v4 on Inferentia

PyTorch 1.11.0

inf1.2xlarge

177

41

55

$0.569

Real Time

2.5.0

Data Parallel

1

fp32

*Throughput and latency numbers in this table were generated using Neuron Tutorials.

Note

Cost per 1M inferences is calculated using US East (N. Virginia) On-Demand hourly rate.

Real Time application refers to batch size 1 inference for minimal latency. Batch application refers to maximum throughput with minimum cost-per-inference.

This document is relevant for: Inf1