This document is relevant for: Inf1

Inf1 Inference Performance#

The following tables contain the reference inference performance for models in the tutorials. Follow the links on each row to replicate similar results in your own environment. Refer to Setup Environment documentation to create a new environment based on the latest Neuron release.

Last update: September 16th, 2024

Encoder Models#

Model

Scripts

Framework

Inst. Type

Avg Throughput (/sec)

Latency P50 (ms)

Latency P99 (ms)

Cost per 1M inferences

Application Type

Neuron Version

Run Mode

Batch Size

Model details

BERT base

HuggingFace Pretrained BERT

PyTorch 1.13

inf1.xlarge

1056

20

21

$0.029

Batch

2.20.0

Data Parallel

4

fp32, bert-base-cased-finetuned-mrpc, sequence-length=128

BERT base

HuggingFace distilBERT with Tensorflow2

Tensorflow 2.10

inf1.6xlarge

2123

30

32

$0.074

Batch

2.20.0

Data Parallel

16

fp32, distilbert-base-uncased-finetuned-sst-2-english, sequence-length=128

BERT base

Using NeuronCore Pipeline

PyTorch 1.13

inf1.6xlarge

2009

6

6

$0.078

Real Time

2.20.0

Model Pipeline

1

fp32, bert-base-uncased, sequence-length=128

BERT base (bert-base-cased)

Compile + Benchmark

PyTorch 1.13.1

inf1.xlarge

1095

58

65

$0.028

Batch

2.20.0

Data Parallel

8

fp32, sequence-length=128

BERT base (bert-base-uncased)

Compile + Benchmark

PyTorch 1.13.1

inf1.xlarge

1181

41

45

$0.026

Batch

2.20.0

Data Parallel

6

fp32, sequence-length=128

DistilBERT base (distilbert-base-uncased)

Compile + Benchmark

PyTorch 1.13.1

inf1.xlarge

1877

34

53

$0.016

Batch

2.20.0

Data Parallel

8

fp32, sequence-length=128

DistilBERT base (distilbert-base-uncased-finetuned-sst-2-english)

Compile + Benchmark

PyTorch 1.13.1

inf1.xlarge

1875

34

54

$0.016

Batch

2.20.0

Data Parallel

8

fp32, sequence-length=128

DistilRoBERTa base (distilroberta-base)

Compile + Benchmark

PyTorch 1.13.1

inf1.xlarge

1513

15

26

$0.020

Batch

2.20.0

Data Parallel

6

fp32, sequence-length=128

Model

Scripts

Framework

Inst. Type

Avg Throughput (/sec)

Latency P50 (ms)

Latency P99 (ms)

Cost per 1M inferences

Application Type

Neuron Version

Run Mode

Batch Size

Model details

BERT base (bert-base-cased)

Compile + Benchmark

PyTorch 1.13.1

inf1.xlarge

126

8

8

$0.243

Real Time

2.20.0

Data Parallel

1

fp32, sequence-length=128

BERT base (bert-base-uncased)

Compile + Benchmark

PyTorch 1.13.1

inf1.xlarge

285

10

11

$0.107

Real Time

2.20.0

Data Parallel

3

fp32, sequence-length=128

DistilBERT base (distilbert-base-uncased)

Compile + Benchmark

PyTorch 1.13.1

inf1.xlarge

538

11

12

$0.057

Real Time

2.20.0

Data Parallel

6

fp32, sequence-length=128

DistilBERT base (distilbert-base-uncased-finetuned-sst-2-english)

Compile + Benchmark

PyTorch 1.13.1

inf1.xlarge

593

10

11

$0.051

Real Time

2.20.0

Data Parallel

5

fp32, sequence-length=128

DistilRoBERTa base (distilroberta-base)

Compile + Benchmark

PyTorch 1.13.1

inf1.xlarge

417

7

8

$0.073

Real Time

2.20.0

Data Parallel

3

fp32, sequence-length=128

Note

Throughput and latency numbers in this table were computed using* NeuronPerf. To reproduce these results, install NeuronPerf and run the provided scripts.*

Convolutional Neural Networks (CNN) Models#

Model

Tutorial

Framework

Inst. Type

Avg Throughput (/sec)

Latency P50 (ms)

Latency P99 (ms)

Cost per 1M inferences

Application Type

Neuron Version

Run Mode

Batch Size

Model details

OpenPose

Running OpenPose on Inferentia

Tensorflow 1.15

inf1.xlarge

58

60

67

$0.531

Real Time

2.12.0

Data Parallel

1

fp16

Resnet-50

ResNet-50 optimization example

Tensorflow 1.15

inf1.xlarge

2207

18

23

$0.014

Batch

2.12.0

Data Parallel

10

fp16

Resnet-50

Resnet50 model for Inferentia

PyTorch 1.13

inf1.xlarge

922

22

23

$0.033

Batch

2.20.0

Data Parallel

5

fp32

YOLO v4

Evaluate YOLO v4 on Inferentia

PyTorch 1.13

inf1.2xlarge

180

40

51

$0.268

Real Time

2.20.0

Data Parallel

1

fp32

Note

Throughput and latency numbers in this table were generated using Neuron Tutorials.

Note

Cost per 1M inferences is calculated using US East (N. Virginia) RI-Effective hourly rate.

Real Time application refers to batch size 1 inference for minimal latency. Batch application refers to maximum throughput with minimum cost-per-inference.

This document is relevant for: Inf1