Inf2 Performance
Contents
This document is relevant for: Inf2
Inf2 Performance#
Table of contents
Last update: Apr 12th, 2023
Language Models Inference Performance#
Model |
Scripts |
Framework |
Inst. Type |
Task |
Throughput (inference/second) |
Latency P50 (ms) |
Latency P99 (ms) |
Cost per 1M inferences |
Application Type |
Neuron Version |
Run Mode |
Batch Size |
Sequence Length |
Model Data Type |
Compilation Autocast Data Type |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
albert-base-v2 |
PyTorch 1.13.1 |
Inf2.xlarge |
Raw Output (AutoModel) |
2835.99 |
10.79 |
20.65 |
$0.074 |
Batch |
2.10.0 |
Data Parallel |
16 |
128 |
FP32 |
Matmult-BF16 |
|
bert-base-cased |
PyTorch 1.13.1 |
Inf2.xlarge |
Raw Output (AutoModel) |
2838.14 |
5.58 |
6.15 |
$0.074 |
Batch |
2.10.0 |
Data Parallel |
8 |
128 |
FP32 |
Matmult-BF16 |
|
bert-base-cased-finetuned-mrpc |
PyTorch 1.13.1 |
Inf2.xlarge |
Sequence Classification |
3185.58 |
10.82 |
15.51 |
$0.066 |
Batch |
2.10.0 |
Data Parallel |
16 |
128 |
FP32 |
Matmult-BF16 |
|
bert-large-cased |
PyTorch 1.13.1 |
Inf2.xlarge |
Raw Output (AutoModel) |
822.93 |
40.41 |
50.57 |
$0.256 |
Batch |
2.10.0 |
Data Parallel |
16 |
128 |
FP32 |
Matmult-BF16 |
|
bert-large-uncased |
PyTorch 1.13.1 |
Inf2.xlarge |
Raw Output (AutoModel) |
797.58 |
20.48 |
27.88 |
$0.264 |
Batch |
2.10.0 |
Data Parallel |
8 |
128 |
FP32 |
Matmult-BF16 |
|
camembert-base |
PyTorch 1.13.1 |
Inf2.xlarge |
Raw Output (AutoModel) |
2883.78 |
11.08 |
11.32 |
$0.073 |
Batch |
2.10.0 |
Data Parallel |
16 |
128 |
FP32 |
Matmult-BF16 |
|
distilbert-base-cased |
PyTorch 1.13.1 |
Inf2.xlarge |
Raw Output (AutoModel) |
5042.29 |
6.22 |
7.19 |
$0.042 |
Batch |
2.10.0 |
Data Parallel |
16 |
128 |
FP32 |
Matmult-BF16 |
|
google/electra-base-discriminator |
PyTorch 1.13.1 |
Inf2.xlarge |
Raw Output (AutoModel) |
2865.69 |
5.54 |
6.07 |
$0.073 |
Batch |
2.10.0 |
Data Parallel |
8 |
128 |
FP32 |
Matmult-BF16 |
|
gpt2 |
PyTorch 1.13.1 |
Inf2.xlarge |
Raw Output (AutoModel) |
1541.04 |
20.35 |
26.52 |
$0.137 |
Batch |
2.10.0 |
Data Parallel |
16 |
128 |
FP32 |
Matmult-BF16 |
|
gpt2-large |
PyTorch 1.13.1 |
Inf2.48xlarge |
Raw Output (AutoModel) |
374.32 |
85.15 |
87.61 |
$9.633 |
Batch |
2.10.0 |
Data Parallel |
16 |
128 |
FP32 |
Matmult-BF16 |
|
gpt2-medium |
PyTorch 1.13.1 |
Inf2.xlarge |
Raw Output (AutoModel) |
765.13 |
41.22 |
47.85 |
$0.275 |
Batch |
2.10.0 |
Data Parallel |
16 |
128 |
FP32 |
Matmult-BF16 |
|
roberta-base |
PyTorch 1.13.1 |
Inf2.xlarge |
Raw Output (AutoModel) |
2845.24 |
5.58 |
6.12 |
$0.074 |
Batch |
2.10.0 |
Data Parallel |
8 |
128 |
FP32 |
Matmult-BF16 |
|
roberta-large |
PyTorch 1.13.1 |
Inf2.xlarge |
Raw Output (AutoModel) |
944.23 |
16.52 |
22.84 |
$0.223 |
Batch |
2.10.0 |
Data Parallel |
8 |
128 |
FP32 |
Matmult-BF16 |
|
xlm-roberta-base |
PyTorch 1.13.1 |
Inf2.48xlarge |
Raw Output (AutoModel) |
30.12 |
1035.27 |
1270.77 |
$119.716 |
Batch |
2.10.0 |
Data Parallel |
16 |
128 |
FP32 |
Matmult-BF16 |
Model |
Scripts |
Framework |
Inst. Type |
Task |
Throughput (inference/second) |
Latency P50 (ms) |
Latency P99 (ms) |
Cost per 1M inferences |
Application Type |
Neuron Version |
Run Mode |
Batch Size |
Sequence Length |
Model Data Type |
Compilation Autocast Data Type |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
albert-base-v2 |
PyTorch 1.13.1 |
Inf2.xlarge |
Raw Output (AutoModel) |
1993.09 |
1.0 |
1.3 |
$0.106 |
Real Time |
2.10.0 |
Data Parallel |
1 |
128 |
FP32 |
Matmult-BF16 |
|
bert-base-cased |
PyTorch 1.13.1 |
Inf2.xlarge |
Raw Output (AutoModel) |
1872.71 |
1.07 |
1.2 |
$0.112 |
Real Time |
2.10.0 |
Data Parallel |
1 |
128 |
FP32 |
Matmult-BF16 |
|
bert-base-cased-finetuned-mrpc |
PyTorch 1.13.1 |
Inf2.xlarge |
Sequence Classification |
2018.15 |
0.98 |
1.11 |
$0.104 |
Real Time |
2.10.0 |
Data Parallel |
1 |
128 |
FP32 |
Matmult-BF16 |
|
bert-large-cased |
PyTorch 1.13.1 |
Inf2.xlarge |
Raw Output (AutoModel) |
676.43 |
2.92 |
3.29 |
$0.311 |
Real Time |
2.10.0 |
Data Parallel |
1 |
128 |
FP32 |
Matmult-BF16 |
|
bert-large-uncased |
PyTorch 1.13.1 |
Inf2.xlarge |
Raw Output (AutoModel) |
692.94 |
2.84 |
3.25 |
$0.304 |
Real Time |
2.10.0 |
Data Parallel |
1 |
128 |
FP32 |
Matmult-BF16 |
|
camembert-base |
PyTorch 1.13.1 |
Inf2.xlarge |
Raw Output (AutoModel) |
1851.02 |
1.07 |
1.2 |
$0.114 |
Real Time |
2.10.0 |
Data Parallel |
1 |
128 |
FP32 |
Matmult-BF16 |
|
distilbert-base-cased |
PyTorch 1.13.1 |
Inf2.xlarge |
Raw Output (AutoModel) |
3095.17 |
0.64 |
0.76 |
$0.068 |
Real Time |
2.10.0 |
Data Parallel |
1 |
128 |
FP32 |
Matmult-BF16 |
|
google/electra-base-discriminator |
PyTorch 1.13.1 |
Inf2.xlarge |
Raw Output (AutoModel) |
1922.0 |
1.04 |
1.17 |
$0.110 |
Real Time |
2.10.0 |
Data Parallel |
1 |
128 |
FP32 |
Matmult-BF16 |
|
gpt2 |
PyTorch 1.13.1 |
Inf2.xlarge |
Raw Output (AutoModel) |
1110.92 |
1.73 |
2.3 |
$0.190 |
Real Time |
2.10.0 |
Data Parallel |
1 |
128 |
FP32 |
Matmult-BF16 |
|
gpt2-large |
PyTorch 1.13.1 |
Inf2.48xlarge |
Raw Output (AutoModel) |
205.53 |
9.58 |
9.87 |
$17.544 |
Real Time |
2.10.0 |
Data Parallel |
1 |
128 |
FP32 |
Matmult-BF16 |
|
gpt2-medium |
PyTorch 1.13.1 |
Inf2.xlarge |
Raw Output (AutoModel) |
437.6 |
4.46 |
5.42 |
$0.481 |
Real Time |
2.10.0 |
Data Parallel |
1 |
128 |
FP32 |
Matmult-BF16 |
|
roberta-base |
PyTorch 1.13.1 |
Inf2.xlarge |
Raw Output (AutoModel) |
1851.43 |
1.08 |
1.2 |
$0.114 |
Real Time |
2.10.0 |
Data Parallel |
1 |
128 |
FP32 |
Matmult-BF16 |
|
roberta-large |
PyTorch 1.13.1 |
Inf2.xlarge |
Raw Output (AutoModel) |
707.36 |
2.82 |
2.96 |
$0.298 |
Real Time |
2.10.0 |
Data Parallel |
1 |
128 |
FP32 |
Matmult-BF16 |
|
xlm-roberta-base |
PyTorch 1.13.1 |
Inf2.48xlarge |
Raw Output (AutoModel) |
25.39 |
78.92 |
92.39 |
$142.018 |
Real Time |
2.10.0 |
Data Parallel |
1 |
128 |
FP32 |
Matmult-BF16 |
Large Language Models Inference Performance#
Model |
Scripts |
Framework |
Inst. Type |
Task |
Throughput (tokens/second) |
Latency P50 (ms) |
Latency P99 (ms) |
Cost per 1M inferences |
Application Type |
Neuron Version |
Run Mode |
TP Degree |
DP Degree |
Batch Size |
Sequence Length |
Model Data Type |
Compilation Autocast Data Type |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
opt-13b |
Transformers Neuron |
Inf2.48xlarge |
Text Generation |
2247.61 |
95.32 |
120.91 |
$1.604 |
Batch |
2.10.0 |
Tensor Parallel |
8 |
1 |
256 |
128 |
FP16 |
Matmult-BF16 |
|
opt-30b |
Transformers Neuron |
Inf2.48xlarge |
Text Generation |
1384.42 |
162.93 |
173.34 |
$2.605 |
Batch |
2.10.0 |
Tensor Parallel |
8 |
1 |
256 |
128 |
FP16 |
Matmult-BF16 |
|
opt-66b |
Transformers Neuron |
Inf2.48xlarge |
Text Generation |
1400.11 |
156.34 |
848.62 |
$2.575 |
Batch |
2.10.0 |
Tensor Parallel |
24 |
1 |
64 |
128 |
FP16 |
Matmult-BF16 |
Model |
Scripts |
Framework |
Inst. Type |
Task |
Throughput (tokens/second) |
Latency P50 (ms) |
Latency P99 (ms) |
Cost per 1M inferences |
Application Type |
Neuron Version |
Run Mode |
TP Degree |
DP Degree |
Batch Size |
Sequence Length |
Model Data Type |
Compilation Autocast Data Type |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
opt-13b |
Transformers Neuron |
Inf2.48xlarge |
Text Generation |
35.66 |
27.82 |
29.42 |
$101.131 |
Real Time |
2.10.0 |
Tensor Parallel |
8 |
1 |
1 |
2048 |
FP16 |
Matmult-BF16 |
|
opt-30b |
Transformers Neuron |
Inf2.48xlarge |
Text Generation |
19.43 |
51.11 |
54.71 |
$185.533 |
Real Time |
2.10.0 |
Tensor Parallel |
8 |
1 |
1 |
2048 |
FP16 |
Matmult-BF16 |
|
opt-66b |
Transformers Neuron |
Inf2.48xlarge |
Text Generation |
14.33 |
68.92 |
74.33 |
$251.576 |
Real Time |
2.10.0 |
Tensor Parallel |
24 |
1 |
1 |
2048 |
FP16 |
Matmult-BF16 |
Vision Models Inference Performance#
Model |
Image Size |
Scripts |
Framework |
Inst. Type |
Task |
Throughput (inference/sec) |
Latency P50 (ms) |
Latency P99 (ms) |
Inferences per dollar |
Application Type |
Neuron Version |
Run Mode |
Batch Size |
Model Data Type |
Compilation Autocast Data Type |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Stable Diffusion 2.1 |
512x512 |
PyTorch 1.13.1 |
Inf2.xlarge |
Image Generation |
0.44 |
2286.0 |
2566.0 |
2,075/$ |
Real Time |
2.10.0 |
Data Parallel |
1 |
FP32 |
Matmult-BF16 |
|
Stable Diffusion 2.1 |
768x768 |
PyTorch 1.13.1 |
Inf2.xlarge |
Image Generation |
0.13 |
7956.0 |
8087.0 |
598/$ |
Real Time |
2.10.0 |
Data Parallel |
1 |
FP32 |
Matmult-BF16 |
|
UNet |
224x224 |
PyTorch 1.13.1 |
Inf2.xlarge |
Image Segmentation |
850.53 |
18.76 |
19.33 |
4,039,442/$ |
Batch |
2.10.0 |
Data Parallel |
4 |
FP32 |
Matmult-BF16 |
|
google/vit-base-patch16-224 |
224x224 |
PyTorch 1.13.1 |
Inf2.xlarge |
Image Classification |
1403.75 |
223.53 |
258.99 |
6,666,868/$ |
Batch |
2.10.0 |
Data Parallel |
32 |
FP32 |
Matmult-BF16 |
|
resnet101 |
224x224 |
PyTorch 1.13.1 |
Inf2.xlarge |
Image Classification |
3052.69 |
83.81 |
83.94 |
14,498,273/$ |
Batch |
2.10.0 |
Data Parallel |
64 |
FP32 |
Matmult-BF16 |
|
resnet152 |
224x224 |
PyTorch 1.13.1 |
Inf2.xlarge |
Image Classification |
2310.55 |
110.68 |
110.86 |
10,973,598/$ |
Batch |
2.10.0 |
Data Parallel |
64 |
FP32 |
Matmult-BF16 |
|
resnet18 |
224x224 |
PyTorch 1.13.1 |
Inf2.xlarge |
Image Classification |
5793.64 |
5.52 |
5.6 |
27,515,973/$ |
Batch |
2.10.0 |
Data Parallel |
8 |
FP32 |
Matmult-BF16 |
|
resnet34 |
224x224 |
PyTorch 1.13.1 |
Inf2.xlarge |
Image Classification |
4570.57 |
7.0 |
7.06 |
21,707,197/$ |
Batch |
2.10.0 |
Data Parallel |
8 |
FP32 |
Matmult-BF16 |
|
resnet50 |
224x224 |
PyTorch 1.13.1 |
Inf2.xlarge |
Image Classification |
3896.43 |
8.21 |
8.28 |
18,505,472/$ |
Batch |
2.10.0 |
Data Parallel |
8 |
FP32 |
Matmult-BF16 |
|
vgg11 |
224x224 |
PyTorch 1.13.1 |
Inf2.xlarge |
Image Classification |
3347.13 |
75.79 |
76.56 |
15,896,674/$ |
Batch |
2.10.0 |
Data Parallel |
64 |
FP32 |
Matmult-BF16 |
|
vgg16 |
224x224 |
PyTorch 1.13.1 |
Inf2.xlarge |
Image Classification |
1781.93 |
17.95 |
18.06 |
8,462,983/$ |
Batch |
2.10.0 |
Data Parallel |
8 |
FP32 |
Matmult-BF16 |
Note
Cost per 1M inferences is calculated using On-Demand hourly rate.
Real Time application refers to batch size 1 inference for minimal latency. Batch application refers to maximum throughput with minimum cost-per-inference.
Model |
Image Size |
Scripts |
Framework |
Inst. Type |
Task |
Throughput (inference/sec) |
Latency P50 (ms) |
Latency P99 (ms) |
Inferences per dollar |
Application Type |
Neuron Version |
Run Mode |
Batch Size |
Model Data Type |
Compilation Autocast Data Type |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Stable Diffusion 2.1 |
512x512 |
PyTorch 1.13.1 |
Inf2.xlarge |
Image Generation |
0.44 |
2286.0 |
2566.0 |
2,075/$ |
Real Time |
2.10.0 |
Data Parallel |
1 |
FP32 |
Matmult-BF16 |
|
Stable Diffusion 2.1 |
768x768 |
PyTorch 1.13.1 |
Inf2.xlarge |
Image Generation |
0.13 |
7956.0 |
8087.0 |
598/$ |
Real Time |
2.10.0 |
Data Parallel |
1 |
FP32 |
Matmult-BF16 |
|
UNet |
224x224 |
PyTorch 1.13.1 |
Inf2.xlarge |
Image Segmentation |
334.54 |
2.96 |
3.76 |
1,588,868/$ |
Real Time |
2.10.0 |
Data Parallel |
1 |
FP32 |
Matmult-BF16 |
|
google/vit-base-patch16-224 |
224x224 |
PyTorch 1.13.1 |
Inf2.xlarge |
Image Classification |
358.36 |
2.81 |
3.72 |
1,701,983/$ |
Real Time |
2.10.0 |
Data Parallel |
1 |
FP32 |
Matmult-BF16 |
|
resnet101 |
224x224 |
PyTorch 1.13.1 |
Inf2.xlarge |
Image Classification |
830.24 |
1.19 |
1.32 |
3,943,078/$ |
Real Time |
2.10.0 |
Data Parallel |
1 |
FP32 |
Matmult-BF16 |
|
resnet152 |
224x224 |
PyTorch 1.13.1 |
Inf2.xlarge |
Image Classification |
694.81 |
1.43 |
1.53 |
3,299,903/$ |
Real Time |
2.10.0 |
Data Parallel |
1 |
FP32 |
Matmult-BF16 |
|
resnet18 |
224x224 |
PyTorch 1.13.1 |
Inf2.xlarge |
Image Classification |
1147.19 |
0.88 |
1.0 |
5,448,377/$ |
Real Time |
2.10.0 |
Data Parallel |
1 |
FP32 |
Matmult-BF16 |
|
resnet34 |
224x224 |
PyTorch 1.13.1 |
Inf2.xlarge |
Image Classification |
1034.72 |
0.95 |
1.3 |
4,914,233/$ |
Real Time |
2.10.0 |
Data Parallel |
1 |
FP32 |
Matmult-BF16 |
|
resnet50 |
224x224 |
PyTorch 1.13.1 |
Inf2.xlarge |
Image Classification |
970.46 |
1.01 |
1.14 |
4,609,040/$ |
Real Time |
2.10.0 |
Data Parallel |
1 |
FP32 |
Matmult-BF16 |
|
vgg11 |
224x224 |
PyTorch 1.13.1 |
Inf2.xlarge |
Image Classification |
1103.54 |
1.97 |
2.05 |
5,241,082/$ |
Real Time |
2.10.0 |
Data Parallel |
1 |
FP32 |
Matmult-BF16 |
|
vgg16 |
224x224 |
PyTorch 1.13.1 |
Inf2.xlarge |
Image Classification |
418.54 |
2.38 |
2.48 |
1,987,779/$ |
Real Time |
2.10.0 |
Data Parallel |
1 |
FP32 |
Matmult-BF16 |
Note
Cost per 1M inferences is calculated using On-Demand hourly rate.
Real Time application refers to batch size 1 inference for minimal latency. Batch application refers to maximum throughput with minimum cost-per-inference.
Note
See Neuron Glossary for abbreviations and terms
This document is relevant for: Inf2