Inf2 Inference Performance
Contents
This document is relevant for: Inf2
Inf2 Inference Performance#
Table of contents
Last update: April 11th, 2024
Encoder Models#
Model |
Scripts |
Framework |
Inst. Type |
Task |
Throughput (inference/second) |
Latency P50 (ms) |
Latency P99 (ms) |
Cost per 1M inferences |
Application Type |
Neuron Version |
Run Mode |
Batch Size |
Sequence Length |
Model Data Type |
Compilation Autocast Data Type |
OS Type |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
albert-base-v2 |
PyTorch 2.1 |
Inf2.xlarge |
Raw Output (AutoModel) |
3039.05 |
5.25 |
5.34 |
$0.069 |
Batch |
2.18.0 |
Data Parallel |
8 |
128 |
FP32 |
Matmult-BF16 |
U22 |
|
bert-base-cased |
PyTorch 2.1 |
Inf2.xlarge |
Raw Output (AutoModel) |
2713.61 |
2.92 |
3.41 |
$0.078 |
Batch |
2.18.0 |
Data Parallel |
4 |
128 |
FP32 |
Matmult-BF16 |
U22 |
|
bert-base-cased-finetuned-mrpc |
PyTorch 2.1 |
Inf2.xlarge |
Sequence Classification |
3269.04 |
9.8 |
10.07 |
$0.064 |
Batch |
2.18.0 |
Data Parallel |
16 |
128 |
FP32 |
Matmult-BF16 |
U22 |
|
bert-base-uncased |
PyTorch 2.1 |
Inf2.xlarge |
Raw Output (AutoModel) |
2845.18 |
5.57 |
7.33 |
$0.074 |
Batch |
2.18.0 |
Data Parallel |
8 |
128 |
FP32 |
Matmult-BF16 |
U22 |
|
bert-large-cased |
PyTorch 2.1 |
Inf2.xlarge |
Raw Output (AutoModel) |
937.59 |
16.95 |
18.93 |
$0.225 |
Batch |
2.18.0 |
Data Parallel |
8 |
128 |
FP32 |
Matmult-BF16 |
U22 |
|
bert-large-uncased |
PyTorch 2.1 |
Inf2.xlarge |
Raw Output (AutoModel) |
936.66 |
8.47 |
9.71 |
$0.225 |
Batch |
2.18.0 |
Data Parallel |
4 |
128 |
FP32 |
Matmult-BF16 |
U22 |
|
camembert-base |
PyTorch 2.1 |
Inf2.xlarge |
Raw Output (AutoModel) |
2871.72 |
10.53 |
14.81 |
$0.073 |
Batch |
2.18.0 |
Data Parallel |
16 |
128 |
FP32 |
Matmult-BF16 |
U22 |
|
distilbert-base-cased |
PyTorch 2.1 |
Inf2.xlarge |
Raw Output (AutoModel) |
4721.11 |
1.67 |
1.95 |
$0.045 |
Batch |
2.18.0 |
Data Parallel |
4 |
128 |
FP32 |
Matmult-BF16 |
U22 |
|
distilbert-base-cased-distilled-squad |
PyTorch 2.1 |
Inf2.xlarge |
Raw Output (AutoModel) |
4714.5 |
1.68 |
2.03 |
$0.045 |
Batch |
2.18.0 |
Data Parallel |
4 |
128 |
FP32 |
Matmult-BF16 |
U22 |
|
distilbert-base-uncased |
PyTorch 2.1 |
Inf2.xlarge |
Raw Output (AutoModel) |
4794.99 |
6.15 |
10.11 |
$0.044 |
Batch |
2.18.0 |
Data Parallel |
16 |
128 |
FP32 |
Matmult-BF16 |
U22 |
|
google/electra-base-discriminator |
PyTorch 2.1 |
Inf2.xlarge |
Raw Output (AutoModel) |
2949.72 |
10.76 |
14.1 |
$0.071 |
Batch |
2.18.0 |
Data Parallel |
16 |
128 |
FP32 |
Matmult-BF16 |
U22 |
|
roberta-base |
PyTorch 2.1 |
Inf2.xlarge |
Raw Output (AutoModel) |
2875.1 |
5.43 |
7.32 |
$0.073 |
Batch |
2.18.0 |
Data Parallel |
8 |
128 |
FP32 |
Matmult-BF16 |
U22 |
|
roberta-large |
PyTorch 2.1 |
Inf2.xlarge |
Raw Output (AutoModel) |
992.13 |
7.94 |
8.98 |
$0.212 |
Batch |
2.18.0 |
Data Parallel |
4 |
128 |
FP32 |
Matmult-BF16 |
U22 |
|
xlm-roberta-base |
PyTorch 2.1 |
Inf2.48xlarge |
Raw Output (AutoModel) |
50.39 |
633.24 |
703.87 |
$71.558 |
Batch |
2.18.0 |
Data Parallel |
16 |
128 |
FP32 |
Matmult-BF16 |
U22 |
Model |
Scripts |
Framework |
Inst. Type |
Task |
Throughput (inference/second) |
Latency P50 (ms) |
Latency P99 (ms) |
Cost per 1M inferences |
Application Type |
Neuron Version |
Run Mode |
Batch Size |
Sequence Length |
Model Data Type |
Compilation Autocast Data Type |
OS Type |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
albert-base-v2 |
PyTorch 2.1 |
Inf2.xlarge |
Raw Output (AutoModel) |
2045.85 |
0.97 |
1.1 |
$0.103 |
Real Time |
2.18.0 |
Data Parallel |
1 |
128 |
FP32 |
Matmult-BF16 |
U22 |
|
bert-base-cased |
PyTorch 2.1 |
Inf2.xlarge |
Raw Output (AutoModel) |
1928.26 |
1.03 |
1.23 |
$0.109 |
Real Time |
2.18.0 |
Data Parallel |
1 |
128 |
FP32 |
Matmult-BF16 |
U22 |
|
bert-base-cased-finetuned-mrpc |
PyTorch 2.1 |
Inf2.xlarge |
Sequence Classification |
2116.28 |
0.93 |
1.0 |
$0.099 |
Real Time |
2.18.0 |
Data Parallel |
1 |
128 |
FP32 |
Matmult-BF16 |
U22 |
|
bert-base-uncased |
PyTorch 2.1 |
Inf2.xlarge |
Raw Output (AutoModel) |
1929.1 |
1.03 |
1.19 |
$0.109 |
Real Time |
2.18.0 |
Data Parallel |
1 |
128 |
FP32 |
Matmult-BF16 |
U22 |
|
bert-large-cased |
PyTorch 2.1 |
Inf2.xlarge |
Raw Output (AutoModel) |
730.18 |
2.72 |
2.82 |
$0.288 |
Real Time |
2.18.0 |
Data Parallel |
1 |
128 |
FP32 |
Matmult-BF16 |
U22 |
|
bert-large-uncased |
PyTorch 2.1 |
Inf2.xlarge |
Raw Output (AutoModel) |
728.39 |
2.75 |
2.84 |
$0.289 |
Real Time |
2.18.0 |
Data Parallel |
1 |
128 |
FP32 |
Matmult-BF16 |
U22 |
|
camembert-base |
PyTorch 2.1 |
Inf2.xlarge |
Raw Output (AutoModel) |
1932.63 |
1.03 |
1.13 |
$0.109 |
Real Time |
2.18.0 |
Data Parallel |
1 |
128 |
FP32 |
Matmult-BF16 |
U22 |
|
distilbert-base-cased |
PyTorch 2.1 |
Inf2.xlarge |
Raw Output (AutoModel) |
3200.64 |
0.61 |
0.89 |
$0.066 |
Real Time |
2.18.0 |
Data Parallel |
1 |
128 |
FP32 |
Matmult-BF16 |
U22 |
|
distilbert-base-cased-distilled-squad |
PyTorch 2.1 |
Inf2.xlarge |
Raw Output (AutoModel) |
3201.45 |
0.61 |
0.89 |
$0.066 |
Real Time |
2.18.0 |
Data Parallel |
1 |
128 |
FP32 |
Matmult-BF16 |
U22 |
|
distilbert-base-uncased |
PyTorch 2.1 |
Inf2.xlarge |
Raw Output (AutoModel) |
3199.37 |
0.61 |
0.89 |
$0.066 |
Real Time |
2.18.0 |
Data Parallel |
1 |
128 |
FP32 |
Matmult-BF16 |
U22 |
|
google/electra-base-discriminator |
PyTorch 2.1 |
Inf2.xlarge |
Raw Output (AutoModel) |
1962.09 |
1.02 |
1.11 |
$0.107 |
Real Time |
2.18.0 |
Data Parallel |
1 |
128 |
FP32 |
Matmult-BF16 |
U22 |
|
roberta-base |
PyTorch 2.1 |
Inf2.xlarge |
Raw Output (AutoModel) |
1934.61 |
1.03 |
1.16 |
$0.109 |
Real Time |
2.18.0 |
Data Parallel |
1 |
128 |
FP32 |
Matmult-BF16 |
U22 |
|
roberta-large |
PyTorch 2.1 |
Inf2.xlarge |
Raw Output (AutoModel) |
731.29 |
2.72 |
2.91 |
$0.288 |
Real Time |
2.18.0 |
Data Parallel |
1 |
128 |
FP32 |
Matmult-BF16 |
U22 |
|
xlm-roberta-base |
PyTorch 2.1 |
Inf2.48xlarge |
Raw Output (AutoModel) |
42.01 |
47.62 |
61.71 |
$85.832 |
Real Time |
2.18.0 |
Data Parallel |
1 |
128 |
FP32 |
Matmult-BF16 |
U22 |
Decoder Models#
Model |
Scripts |
Framework |
Inst. Type |
Task |
Output Token Throughput (tokens/sec) |
TTFT Latency P50 (ms) |
TTFT Latency P99 (ms) |
TPOT Latency P50 (ms) |
TPOT Latency P99 (ms) |
Application Type |
Neuron Version |
Run Mode |
TP Degree |
Batch Size |
Sequence Length |
Input Length |
Output Length |
Model Data Type |
Compilation Autocast Data Type |
On-Demand hourly rate |
Cost per 1M tokens |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Llama-2-7b |
Transformers NeuronX |
Inf2.48xlarge |
Text Generation |
725.83 |
77.36 |
87.28 |
12.11 |
13.06 |
Batch |
2.18.0 |
Tensor Parallel |
24 |
8 |
4096 |
128 |
3968 |
FP16 |
Matmult-BF16 |
12.981 |
$4.968 |
|
Llama-2-7b |
Transformers NeuronX |
Inf2.48xlarge |
Text Generation |
577.97 |
80.12 |
89.69 |
16.39 |
17.81 |
Batch |
2.18.0 |
Tensor Parallel |
24 |
8 |
8192 |
128 |
8064 |
FP16 |
Matmult-BF16 |
12.981 |
$6.239 |
|
Llama-2-13b |
Transformers NeuronX |
Inf2.48xlarge |
Text Generation |
589.89 |
108.81 |
113.89 |
14.9 |
15.79 |
Batch |
2.18.0 |
Tensor Parallel |
24 |
8 |
4096 |
128 |
3968 |
FP16 |
Matmult-BF16 |
12.981 |
$6.113 |
|
Llama-2-13b |
Transformers NeuronX |
Inf2.48xlarge |
Text Generation |
351.76 |
7083.73 |
7158.32 |
20.99 |
21.8 |
Batch |
2.18.0 |
Tensor Parallel |
24 |
8 |
8192 |
4096 |
4096 |
FP16 |
Matmult-BF16 |
12.981 |
$10.251 |
|
Llama-2-13b |
Transformers NeuronX |
Inf2.48xlarge |
Text Generation |
178.57 |
5141.32 |
5160.93 |
21.71 |
22.74 |
Batch |
2.18.0 |
Tensor Parallel |
24 |
4 |
16384 |
8192 |
8192 |
FP16 |
Matmult-BF16 |
12.981 |
$20.193 |
|
Llama-2-70b |
Transformers NeuronX |
Inf2.48xlarge |
Text Generation |
30.06 |
76.6 |
77.12 |
32.9 |
33.42 |
Batch |
2.18.0 |
Tensor Parallel |
24 |
1 |
256 |
128 |
128 |
FP16 |
Matmult-BF16 |
12.981 |
$119.940 |
|
Llama-2-70b |
Transformers NeuronX |
Inf2.48xlarge |
Text Generation |
29.92 |
96.44 |
98.47 |
33.13 |
33.46 |
Batch |
2.18.0 |
Tensor Parallel |
24 |
1 |
512 |
256 |
256 |
FP16 |
Matmult-BF16 |
12.981 |
$120.499 |
|
Llama-2-70b |
Transformers NeuronX |
Inf2.48xlarge |
Text Generation |
30.07 |
76.33 |
86.53 |
33.15 |
34.08 |
Batch |
2.18.0 |
Tensor Parallel |
24 |
1 |
1152 |
128 |
1024 |
FP16 |
Matmult-BF16 |
12.981 |
$119.914 |
|
Llama-2-70b |
Transformers NeuronX |
Inf2.48xlarge |
Text Generation |
29.43 |
277.02 |
280.13 |
33.73 |
34.01 |
Batch |
2.18.0 |
Tensor Parallel |
24 |
1 |
2048 |
1024 |
1024 |
FP16 |
Matmult-BF16 |
12.981 |
$122.539 |
|
Llama-2-70b |
Transformers NeuronX |
Inf2.48xlarge |
Text Generation |
28.91 |
275.97 |
284.77 |
34.82 |
35.44 |
Batch |
2.18.0 |
Tensor Parallel |
24 |
1 |
3072 |
1024 |
2048 |
FP16 |
Matmult-BF16 |
12.981 |
$124.711 |
|
Llama-2-70b |
Transformers NeuronX |
Inf2.48xlarge |
Text Generation |
28.33 |
810.44 |
814.88 |
34.9 |
35.14 |
Batch |
2.18.0 |
Tensor Parallel |
24 |
1 |
4096 |
2048 |
2048 |
FP16 |
Matmult-BF16 |
12.981 |
$127.292 |
|
Mistral-7B-Instruct-v0.2 |
Transformers NeuronX |
Inf2.48xlarge |
Text Generation |
761.89 |
77.62 |
86.63 |
11.64 |
12.5 |
Batch |
2.18.0 |
Tensor Parallel |
24 |
8 |
4096 |
128 |
3968 |
FP16 |
Matmult-BF16 |
12.981 |
$4.733 |
|
Mistral-7B-Instruct-v0.2 |
Transformers NeuronX |
Inf2.48xlarge |
Text Generation |
450.38 |
4740.12 |
4783.75 |
16.55 |
17.53 |
Batch |
2.18.0 |
Tensor Parallel |
24 |
8 |
8192 |
4096 |
4096 |
FP16 |
Matmult-BF16 |
12.981 |
$8.006 |
|
Mistral-7B-Instruct-v0.2 |
Transformers NeuronX |
Inf2.48xlarge |
Text Generation |
411.05 |
11085.12 |
11125.86 |
18.01 |
19.96 |
Batch |
2.18.0 |
Tensor Parallel |
24 |
8 |
16384 |
8192 |
8192 |
FP16 |
Matmult-BF16 |
12.981 |
$8.772 |
|
CodeLlama-13b-hf |
Transformers NeuronX |
Inf2.48xlarge |
Text Generation |
546.51 |
115.81 |
121.5 |
15.87 |
17.21 |
Batch |
2.18.0 |
Tensor Parallel |
24 |
8 |
4096 |
128 |
3968 |
FP16 |
Matmult-BF16 |
12.981 |
$6.598 |
|
CodeLlama-13b-hf |
Transformers NeuronX |
Inf2.48xlarge |
Text Generation |
333.24 |
7115.98 |
7231.01 |
22.27 |
23.81 |
Batch |
2.18.0 |
Tensor Parallel |
24 |
8 |
8192 |
4096 |
4096 |
FP16 |
Matmult-BF16 |
12.981 |
$10.821 |
|
CodeLlama-13b-hf |
Transformers NeuronX |
Inf2.48xlarge |
Text Generation |
178.79 |
5136.62 |
5192.59 |
21.67 |
22.73 |
Batch |
2.18.0 |
Tensor Parallel |
24 |
4 |
16384 |
8192 |
8192 |
FP16 |
Matmult-BF16 |
12.981 |
$20.168 |
Model |
Scripts |
Framework |
Inst. Type |
Task |
Output Token Throughput (tokens/sec) |
TTFT Latency P50 (ms) |
TTFT Latency P99 (ms) |
TPOT Latency P50 (ms) |
TPOT Latency P99 (ms) |
Application Type |
Neuron Version |
Run Mode |
TP Degree |
Batch Size |
Sequence Length |
Input Length |
Output Length |
Model Data Type |
Compilation Autocast Data Type |
On-Demand hourly rate |
Cost per 1M tokens |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Llama-2-7b |
Transformers NeuronX |
Inf2.48xlarge |
Text Generation |
156.13 |
27.64 |
33.77 |
6.47 |
7.08 |
Real Time |
2.18.0 |
Tensor Parallel |
24 |
1 |
4096 |
128 |
3968 |
FP16 |
Matmult-BF16 |
12.981 |
$23.095 |
|
Llama-2-7b |
Transformers NeuronX |
Inf2.48xlarge |
Text Generation |
145.17 |
29.21 |
33.39 |
7.34 |
7.8 |
Real Time |
2.18.0 |
Tensor Parallel |
24 |
1 |
8192 |
128 |
8064 |
FP16 |
Matmult-BF16 |
12.981 |
$24.839 |
|
Llama-2-13b |
Transformers NeuronX |
Inf2.48xlarge |
Text Generation |
112.52 |
25.85 |
26.9 |
9.17 |
9.33 |
Real Time |
2.18.0 |
Tensor Parallel |
24 |
1 |
4096 |
128 |
3968 |
FP16 |
Matmult-BF16 |
12.981 |
$32.046 |
|
Llama-2-13b |
Transformers NeuronX |
Inf2.48xlarge |
Text Generation |
97.42 |
333.78 |
340.99 |
10.17 |
10.38 |
Real Time |
2.18.0 |
Tensor Parallel |
24 |
1 |
8192 |
4096 |
4096 |
FP16 |
Matmult-BF16 |
12.981 |
$37.015 |
|
Llama-2-13b |
Transformers NeuronX |
Inf2.48xlarge |
Text Generation |
73.17 |
994.18 |
999.8 |
13.5 |
13.98 |
Real Time |
2.18.0 |
Tensor Parallel |
24 |
1 |
16384 |
8192 |
8192 |
FP16 |
Matmult-BF16 |
12.981 |
$49.282 |
|
Llama-2-70b |
Transformers NeuronX |
Inf2.48xlarge |
Text Generation |
30.06 |
76.6 |
77.12 |
32.9 |
33.42 |
Real Time |
2.18.0 |
Tensor Parallel |
24 |
1 |
256 |
128 |
128 |
FP16 |
Matmult-BF16 |
12.981 |
$119.940 |
|
Llama-2-70b |
Transformers NeuronX |
Inf2.48xlarge |
Text Generation |
29.92 |
96.44 |
98.47 |
33.13 |
33.46 |
Real Time |
2.18.0 |
Tensor Parallel |
24 |
1 |
512 |
256 |
256 |
FP16 |
Matmult-BF16 |
12.981 |
$120.499 |
|
Llama-2-70b |
Transformers NeuronX |
Inf2.48xlarge |
Text Generation |
30.07 |
76.33 |
86.53 |
33.15 |
34.08 |
Real Time |
2.18.0 |
Tensor Parallel |
24 |
1 |
1152 |
128 |
1024 |
FP16 |
Matmult-BF16 |
12.981 |
$119.914 |
|
Llama-2-70b |
Transformers NeuronX |
Inf2.48xlarge |
Text Generation |
29.43 |
277.02 |
280.13 |
33.73 |
34.01 |
Real Time |
2.18.0 |
Tensor Parallel |
24 |
1 |
2048 |
1024 |
1024 |
FP16 |
Matmult-BF16 |
12.981 |
$122.539 |
|
Llama-2-70b |
Transformers NeuronX |
Inf2.48xlarge |
Text Generation |
28.91 |
275.97 |
284.77 |
34.82 |
35.44 |
Real Time |
2.18.0 |
Tensor Parallel |
24 |
1 |
3072 |
1024 |
2048 |
FP16 |
Matmult-BF16 |
12.981 |
$124.711 |
|
Llama-2-70b |
Transformers NeuronX |
Inf2.48xlarge |
Text Generation |
28.33 |
810.44 |
814.88 |
34.9 |
35.14 |
Real Time |
2.18.0 |
Tensor Parallel |
24 |
1 |
4096 |
2048 |
2048 |
FP16 |
Matmult-BF16 |
12.981 |
$127.292 |
|
Mistral-7B-Instruct-v0.2 |
Transformers NeuronX |
Inf2.48xlarge |
Text Generation |
149.74 |
27.34 |
29.21 |
6.86 |
7.08 |
Real Time |
2.18.0 |
Tensor Parallel |
24 |
1 |
4096 |
128 |
3968 |
FP16 |
Matmult-BF16 |
12.981 |
$24.081 |
|
Mistral-7B-Instruct-v0.2 |
Transformers NeuronX |
Inf2.48xlarge |
Text Generation |
81.7 |
557.96 |
562.86 |
7.87 |
11.65 |
Real Time |
2.18.0 |
Tensor Parallel |
24 |
1 |
8192 |
4096 |
4096 |
FP16 |
Matmult-BF16 |
12.981 |
$44.133 |
|
Mistral-7B-Instruct-v0.2 |
Transformers NeuronX |
Inf2.48xlarge |
Text Generation |
95.99 |
539.59 |
557.1 |
10.33 |
10.61 |
Real Time |
2.18.0 |
Tensor Parallel |
24 |
1 |
16384 |
8192 |
8192 |
FP16 |
Matmult-BF16 |
12.981 |
$37.563 |
|
CodeLlama-13b-hf |
Transformers NeuronX |
Inf2.48xlarge |
Text Generation |
112.71 |
27.02 |
33.25 |
9.12 |
9.38 |
Real Time |
2.18.0 |
Tensor Parallel |
24 |
1 |
4096 |
128 |
3968 |
FP16 |
Matmult-BF16 |
12.981 |
$31.994 |
|
CodeLlama-13b-hf |
Transformers NeuronX |
Inf2.48xlarge |
Text Generation |
97.52 |
338.67 |
340.46 |
10.15 |
10.55 |
Real Time |
2.18.0 |
Tensor Parallel |
24 |
1 |
8192 |
4096 |
4096 |
FP16 |
Matmult-BF16 |
12.981 |
$36.975 |
|
CodeLlama-13b-hf |
Transformers NeuronX |
Inf2.48xlarge |
Text Generation |
73.68 |
989.5 |
1000.66 |
13.44 |
13.86 |
Real Time |
2.18.0 |
Tensor Parallel |
24 |
1 |
16384 |
8192 |
8192 |
FP16 |
Matmult-BF16 |
12.981 |
$48.940 |
Encoder-Decoder Models#
Model |
Scripts |
Framework |
Inst. Type |
Task |
Throughput (tokens/second) |
Latency per Token P50 (ms) |
Latency per Token P99 (ms) |
Cost per 1M inferences |
Application Type |
Neuron Version |
Run Mode |
TP Degree |
DP Degree |
Batch Size |
Sequence Length |
Input Length |
Output Length |
Model Data Type |
Compilation Autocast Data Type |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
google/flan-t5-xl |
NeuronX Distributed |
Inf2.24xlarge |
Text Generation |
117.61 |
8.51 |
8.53 |
$15.328 |
Batch |
2.17.0 |
Tensor Parallel |
8 |
1 |
1 |
128 |
128 |
84 |
FP32 |
Matmult-BF16 |
|
t5-3b |
NeuronX Distributed |
Inf2.24xlarge |
Text Generation |
111.92 |
8.97 |
8.98 |
$16.108 |
Batch |
2.17.0 |
Tensor Parallel |
8 |
1 |
1 |
128 |
128 |
84 |
FP32 |
Matmult-BF16 |
Note
Only for Encoder-Decoder
Throughput (tokens/second) counts both input and output tokens
Latency per Token counts both input and output tokens
Model |
Scripts |
Framework |
Inst. Type |
Task |
Throughput (tokens/second) |
Latency per Token P50 (ms) |
Latency per Token P99 (ms) |
Cost per 1M inferences |
Application Type |
Neuron Version |
Run Mode |
TP Degree |
DP Degree |
Batch Size |
Sequence Length |
Input Length |
Output Length |
Model Data Type |
Compilation Autocast Data Type |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
google/flan-t5-xl |
NeuronX Distributed |
Inf2.24xlarge |
Text Generation |
117.6 |
8.5 |
8.53 |
$15.330 |
Real Time |
2.18.0 |
Tensor Parallel |
8 |
1 |
1 |
128 |
128 |
84 |
FP32 |
Matmult-BF16 |
|
t5-3b |
NeuronX Distributed |
Inf2.24xlarge |
Text Generation |
108.18 |
9.25 |
9.26 |
$16.665 |
Real Time |
2.18.0 |
Tensor Parallel |
8 |
1 |
1 |
128 |
128 |
84 |
FP32 |
Matmult-BF16 |
Note
Throughput (tokens/second) counts both input and output tokens
Latency per Token counts both input and output tokens
Vision Transformers Models#
Model |
Image Size |
Scripts |
Framework |
Inst. Type |
Task |
Throughput (inference/sec) |
Latency P50 (ms) |
Latency P99 (ms) |
Cost per 1M images |
Application Type |
Neuron Version |
Run Mode |
Batch Size |
Model Data Type |
Compilation Autocast Data Type |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
deepmind/multimodal-perceiver |
16x224x224 |
PyTorch 1.13.1 |
Inf2.xlarge |
Multimodal Autoencoding |
0.83 |
1250.0 |
1271.0 |
$253.681 |
Real Time |
2.18.0 |
Data Parallel |
1 |
FP32 |
None |
|
deepmind/vision-perceiver-conv |
224x224 |
PyTorch 1.13.1 |
Inf2.xlarge |
Image Classification |
126.5 |
14.14 |
14.2 |
$1.664 |
Real Time |
2.18.0 |
Data Parallel |
1 |
FP32 |
Matmult-BF16 |
|
deepmind/vision-perceiver-fourier |
224x224 |
PyTorch 1.13.1 |
Inf2.xlarge |
Image Classification |
67.9 |
29.5 |
29.68 |
$3.101 |
Real Time |
2.18.0 |
Data Parallel |
1 |
FP32 |
Matmult-BF16 |
|
deepmind/vision-perceiver-learned |
224x224 |
PyTorch 1.13.1 |
Inf2.xlarge |
Image Classification |
99.6 |
18.6 |
18.7 |
$2.114 |
Real Time |
2.18.0 |
Data Parallel |
1 |
FP32 |
Matmult-BF16 |
|
google/vit-base-patch16-224 |
224x224 |
PyTorch 2.1.2 |
Inf2.xlarge |
Image Classification |
1773.97 |
4.5 |
4.69 |
$0.119 |
Batch |
2.18.0 |
Data Parallel |
2 |
FP32 |
Matmult-BF16 |
|
openai/clip-vit-base-patch32 |
224x224 |
PyTorch 2.1.2 |
Inf2.xlarge |
Image Classification |
6099.53 |
46.31 |
66.27 |
$0.035 |
Batch |
2.18.0 |
Data Parallel |
64 |
FP32 |
Matmult-BF16 |
|
openai/clip-vit-large-patch14 |
224x224 |
PyTorch 2.1.2 |
Inf2.xlarge |
Image Classification |
304.07 |
105.9 |
110.58 |
$0.692 |
Batch |
2.18.0 |
Data Parallel |
8 |
FP32 |
Matmult-BF16 |
Model |
Image Size |
Scripts |
Framework |
Inst. Type |
Task |
Throughput (inference/sec) |
Latency P50 (ms) |
Latency P99 (ms) |
Cost per 1M images |
Application Type |
Neuron Version |
Run Mode |
Batch Size |
Model Data Type |
Compilation Autocast Data Type |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
deepmind/multimodal-perceiver |
16x224x224 |
PyTorch 1.13.1 |
Inf2.xlarge |
Multimodal Autoencoding |
0.83 |
1250.0 |
1271.0 |
$253.681 |
Real Time |
2.18.0 |
Data Parallel |
1 |
FP32 |
None |
|
deepmind/vision-perceiver-conv |
224x224 |
PyTorch 1.13.1 |
Inf2.xlarge |
Image Classification |
126.5 |
14.14 |
14.2 |
$1.664 |
Real Time |
2.18.0 |
Data Parallel |
1 |
FP32 |
Matmult-BF16 |
|
deepmind/vision-perceiver-fourier |
224x224 |
PyTorch 1.13.1 |
Inf2.xlarge |
Image Classification |
67.9 |
29.5 |
29.68 |
$3.101 |
Real Time |
2.18.0 |
Data Parallel |
1 |
FP32 |
Matmult-BF16 |
|
deepmind/vision-perceiver-learned |
224x224 |
PyTorch 1.13.1 |
Inf2.xlarge |
Image Classification |
99.6 |
18.6 |
18.7 |
$2.114 |
Real Time |
2.18.0 |
Data Parallel |
1 |
FP32 |
Matmult-BF16 |
|
google/vit-base-patch16-224 |
224x224 |
PyTorch 2.1.2 |
Inf2.xlarge |
Image Classification |
728.93 |
1.36 |
1.4 |
$0.289 |
Real Time |
2.18.0 |
Data Parallel |
1 |
FP32 |
Matmult-BF16 |
|
openai/clip-vit-base-patch32 |
224x224 |
PyTorch 2.1.2 |
Inf2.xlarge |
Image Classification |
158.38 |
6.31 |
6.34 |
$1.329 |
Real Time |
2.18.0 |
Data Parallel |
1 |
FP32 |
Matmult-BF16 |
|
openai/clip-vit-large-patch14 |
224x224 |
PyTorch 2.1.2 |
Inf2.xlarge |
Image Classification |
73.23 |
13.65 |
13.71 |
$2.875 |
Real Time |
2.18.0 |
Data Parallel |
1 |
FP32 |
Matmult-BF16 |
Convolutional Neural Networks (CNN) Models#
Model |
Image Size |
Scripts |
Framework |
Inst. Type |
Task |
Throughput (inference/sec) |
Latency P50 (ms) |
Latency P99 (ms) |
Cost per 1M images |
Application Type |
Neuron Version |
Run Mode |
Batch Size |
Model Data Type |
Compilation Autocast Data Type |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
UNet |
224x224 |
PyTorch 2.1 |
Inf2.xlarge |
Image Segmentation |
996.91 |
16.03 |
16.12 |
$0.211 |
Batch |
2.18.0 |
Data Parallel |
4 |
FP32 |
Matmult-BF16 |
|
resnet101 |
224x224 |
PyTorch 2.1 |
Inf2.xlarge |
Image Classification |
3260.01 |
78.46 |
78.57 |
$0.065 |
Batch |
2.18.0 |
Data Parallel |
64 |
FP32 |
Matmult-BF16 |
|
resnet152 |
224x224 |
PyTorch 2.1 |
Inf2.xlarge |
Image Classification |
2397.56 |
106.7 |
106.83 |
$0.088 |
Batch |
2.18.0 |
Data Parallel |
64 |
FP32 |
Matmult-BF16 |
|
resnet18 |
224x224 |
PyTorch 2.1 |
Inf2.xlarge |
Image Classification |
7068.94 |
4.51 |
4.56 |
$0.030 |
Batch |
2.18.0 |
Data Parallel |
8 |
FP32 |
Matmult-BF16 |
|
resnet34 |
224x224 |
PyTorch 2.1 |
Inf2.xlarge |
Image Classification |
5116.51 |
6.24 |
6.31 |
$0.041 |
Batch |
2.18.0 |
Data Parallel |
8 |
FP32 |
Matmult-BF16 |
|
resnet50 |
224x224 |
PyTorch 2.1 |
Inf2.xlarge |
Image Classification |
4476.77 |
7.13 |
7.2 |
$0.047 |
Batch |
2.18.0 |
Data Parallel |
8 |
FP32 |
Matmult-BF16 |
|
vgg11 |
224x224 |
PyTorch 2.1 |
Inf2.xlarge |
Image Classification |
4671.31 |
54.75 |
54.85 |
$0.045 |
Batch |
2.18.0 |
Data Parallel |
64 |
FP32 |
Matmult-BF16 |
|
vgg16 |
224x224 |
PyTorch 2.1 |
Inf2.xlarge |
Image Classification |
2144.44 |
14.91 |
14.98 |
$0.098 |
Batch |
2.18.0 |
Data Parallel |
8 |
FP32 |
Matmult-BF16 |
Model |
Image Size |
Scripts |
Framework |
Inst. Type |
Task |
Throughput (inference/sec) |
Latency P50 (ms) |
Latency P99 (ms) |
Cost per 1M images |
Application Type |
Neuron Version |
Run Mode |
Batch Size |
Model Data Type |
Compilation Autocast Data Type |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
UNet |
224x224 |
PyTorch 2.1 |
Inf2.xlarge |
Image Segmentation |
439.64 |
2.26 |
2.31 |
$0.479 |
Real Time |
2.18.0 |
Data Parallel |
1 |
FP32 |
Matmult-BF16 |
|
resnet101 |
224x224 |
PyTorch 2.1 |
Inf2.xlarge |
Image Classification |
1027.8 |
0.96 |
1.0 |
$0.205 |
Real Time |
2.18.0 |
Data Parallel |
1 |
FP32 |
Matmult-BF16 |
|
resnet152 |
224x224 |
PyTorch 2.1 |
Inf2.xlarge |
Image Classification |
861.61 |
1.15 |
1.2 |
$0.244 |
Real Time |
2.18.0 |
Data Parallel |
1 |
FP32 |
Matmult-BF16 |
|
resnet18 |
224x224 |
PyTorch 2.1 |
Inf2.xlarge |
Image Classification |
1743.72 |
0.55 |
0.62 |
$0.121 |
Real Time |
2.18.0 |
Data Parallel |
1 |
FP32 |
Matmult-BF16 |
|
resnet34 |
224x224 |
PyTorch 2.1 |
Inf2.xlarge |
Image Classification |
1368.5 |
0.73 |
0.74 |
$0.154 |
Real Time |
2.18.0 |
Data Parallel |
1 |
FP32 |
Matmult-BF16 |
|
resnet50 |
224x224 |
PyTorch 2.1 |
Inf2.xlarge |
Image Classification |
1264.62 |
0.78 |
0.81 |
$0.166 |
Real Time |
2.18.0 |
Data Parallel |
1 |
FP32 |
Matmult-BF16 |
|
vgg11 |
224x224 |
PyTorch 2.1 |
Inf2.xlarge |
Image Classification |
544.75 |
1.82 |
1.88 |
$0.387 |
Real Time |
2.18.0 |
Data Parallel |
1 |
FP32 |
Matmult-BF16 |
|
vgg16 |
224x224 |
PyTorch 2.1 |
Inf2.xlarge |
Image Classification |
450.83 |
2.21 |
2.26 |
$0.467 |
Real Time |
2.18.0 |
Data Parallel |
1 |
FP32 |
Matmult-BF16 |
Stable Diffusion Models#
Model |
Image Size |
Scripts |
Framework |
Inst. Type |
Task |
Throughput (inference/sec) |
Latency P50 (ms) |
Latency P99 (ms) |
Cost per 1M images |
Application Type |
Neuron Version |
Run Mode |
Batch Size |
Model Data Type |
Compilation Autocast Data Type |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Stable Diffusion 1.5 |
512x512 |
PyTorch 2.1 |
Inf2.xlarge |
Image Generation |
0.48 |
2089.0 |
2093.0 |
$439.573 |
Real Time |
2.18.0 |
Data Parallel |
1 |
FP32 |
Matmult-BF16 |
|
Stable Diffusion 2 Inpainting |
936x624 |
PyTorch 2.1 |
Inf2.xlarge |
Image Generation |
0.16 |
6045.0 |
6063.4 |
$1,276.094 |
Real Time |
2.18.0 |
Data Parallel |
1 |
FP32, BF16 |
Matmult-BF16 |
|
Stable Diffusion 2.1 |
512x512 |
PyTorch 2.1 |
Inf2.xlarge |
Image Generation |
0.6 |
1655.0 |
1663.0 |
$348.602 |
Real Time |
2.18.0 |
Data Parallel |
1 |
FP32, BF16 |
Matmult-BF16 |
|
Stable Diffusion 2.1 |
768x768 |
PyTorch 2.1 |
Inf2.xlarge |
Image Generation |
0.18 |
5504.0 |
5519.0 |
$1,156.899 |
Real Time |
2.18.0 |
Data Parallel |
1 |
FP32 |
Matmult-BF16 |
|
Stable Diffusion XL Base |
1024x1024 |
PyTorch 2.1 |
Inf2.xlarge |
Image Generation |
0.08 |
12200.0 |
12260.0 |
$2,567.751 |
Real Time |
2.18.0 |
Data Parallel |
1 |
FP32 |
Matmult-BF16 |
|
Stable Diffusion XL Base & Refiner |
1024x1024 |
PyTorch 2.1 |
Inf2.8xlarge |
Image Generation |
0.09 |
10741.0 |
11006.0 |
$5,884.110 |
Real Time |
2.18.0 |
Data Parallel |
1 |
FP32 |
Matmult-BF16 |
Note
Cost per 1M images is calculated using On-Demand hourly rate.
Real Time application refers to batch size 1 inference for minimal latency. Batch application refers to maximum throughput with minimum cost-per-inference.
Model |
Image Size |
Scripts |
Framework |
Inst. Type |
Task |
Throughput (inference/sec) |
Latency P50 (ms) |
Latency P99 (ms) |
Cost per 1M images |
Application Type |
Neuron Version |
Run Mode |
Batch Size |
Model Data Type |
Compilation Autocast Data Type |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Stable Diffusion 1.5 |
512x512 |
PyTorch 2.1 |
Inf2.xlarge |
Image Generation |
0.48 |
2089.0 |
2093.0 |
$439.573 |
Real Time |
2.18.0 |
Data Parallel |
1 |
FP32 |
Matmult-BF16 |
|
Stable Diffusion 2 Inpainting |
936x624 |
PyTorch 2.1 |
Inf2.xlarge |
Image Generation |
0.16 |
6045.0 |
6063.4 |
$1,276.094 |
Real Time |
2.18.0 |
Data Parallel |
1 |
FP32, BF16 |
Matmult-BF16 |
|
Stable Diffusion 2.1 |
512x512 |
PyTorch 2.1 |
Inf2.xlarge |
Image Generation |
0.6 |
1655.0 |
1663.0 |
$348.602 |
Real Time |
2.18.0 |
Data Parallel |
1 |
FP32, BF16 |
Matmult-BF16 |
|
Stable Diffusion 2.1 |
768x768 |
PyTorch 2.1 |
Inf2.xlarge |
Image Generation |
0.18 |
5504.0 |
5519.0 |
$1,156.899 |
Real Time |
2.18.0 |
Data Parallel |
1 |
FP32 |
Matmult-BF16 |
|
Stable Diffusion XL Base |
1024x1024 |
PyTorch 2.1 |
Inf2.xlarge |
Image Generation |
0.08 |
12200.0 |
12260.0 |
$2,567.751 |
Real Time |
2.18.0 |
Data Parallel |
1 |
FP32 |
Matmult-BF16 |
|
Stable Diffusion XL Base & Refiner |
1024x1024 |
PyTorch 2.1 |
Inf2.8xlarge |
Image Generation |
0.09 |
10741.0 |
11006.0 |
$5,884.110 |
Real Time |
2.18.0 |
Data Parallel |
1 |
FP32 |
Matmult-BF16 |
Note
Cost per 1M images is calculated using On-Demand hourly rate.
Real Time application refers to batch size 1 inference for minimal latency. Batch application refers to maximum throughput with minimum cost-per-inference.
Note
See Neuron Glossary for abbreviations and terms
This document is relevant for: Inf2