This document is relevant for: Inf2
Inf2 Inference Performance#
Last update: September 16th, 2024
Encoder Models#
Model |
Scripts |
Framework |
Inst. Type |
Task |
Throughput (inference/second) |
Latency P50 (ms) |
Latency P99 (ms) |
Cost per 1M inferences |
Application Type |
Neuron Version |
Run Mode |
Batch Size |
Sequence Length |
Model Data Type |
Compilation Autocast Data Type |
OS Type |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
albert-base-v2 |
PyTorch 2.1 |
Inf2.xlarge |
Raw Output (AutoModel) |
2999.88 |
5.32 |
5.49 |
$0.030 |
Batch |
2.20.0 |
Data Parallel |
8 |
128 |
FP32 |
Matmult-BF16 |
U22 |
|
bert-base-cased |
PyTorch 2.1 |
Inf2.xlarge |
Raw Output (AutoModel) |
2697.72 |
2.95 |
3.11 |
$0.034 |
Batch |
2.20.0 |
Data Parallel |
4 |
128 |
FP32 |
Matmult-BF16 |
U22 |
|
bert-base-cased-finetuned-mrpc |
PyTorch 2.1 |
Inf2.xlarge |
Sequence Classification |
2907.45 |
11.03 |
11.32 |
$0.031 |
Batch |
2.20.0 |
Data Parallel |
16 |
128 |
FP32 |
Matmult-BF16 |
U22 |
|
bert-base-uncased |
PyTorch 2.1 |
Inf2.xlarge |
Raw Output (AutoModel) |
2756.19 |
5.75 |
6.35 |
$0.033 |
Batch |
2.20.0 |
Data Parallel |
8 |
128 |
FP32 |
Matmult-BF16 |
U22 |
|
bert-large-cased |
PyTorch 2.1 |
Inf2.xlarge |
Raw Output (AutoModel) |
906.91 |
16.95 |
18.93 |
$0.100 |
Batch |
2.20.0 |
Data Parallel |
8 |
128 |
FP32 |
Matmult-BF16 |
U22 |
|
bert-large-uncased |
PyTorch 2.1 |
Inf2.xlarge |
Raw Output (AutoModel) |
919.08 |
8.73 |
9.12 |
$0.099 |
Batch |
2.18.0 |
Data Parallel |
4 |
128 |
FP32 |
Matmult-BF16 |
U22 |
|
camembert-base |
PyTorch 2.1 |
Inf2.xlarge |
Raw Output (AutoModel) |
2904.22 |
10.98 |
11.51 |
$0.031 |
Batch |
2.20.0 |
Data Parallel |
16 |
128 |
FP32 |
Matmult-BF16 |
U22 |
|
distilbert-base-cased |
PyTorch 2.1 |
Inf2.xlarge |
Raw Output (AutoModel) |
4756.25 |
1.67 |
1.81 |
$0.019 |
Batch |
2.20.0 |
Data Parallel |
4 |
128 |
FP32 |
Matmult-BF16 |
U22 |
|
distilbert-base-cased-distilled-squad |
PyTorch 2.1 |
Inf2.xlarge |
Raw Output (AutoModel) |
4741.08 |
1.68 |
1.81 |
$0.019 |
Batch |
2.20.0 |
Data Parallel |
4 |
128 |
FP32 |
Matmult-BF16 |
U22 |
|
distilbert-base-uncased |
PyTorch 2.1 |
Inf2.xlarge |
Raw Output (AutoModel) |
5043.38 |
6.3 |
6.88 |
$0.018 |
Batch |
2.20.0 |
Data Parallel |
16 |
128 |
FP32 |
Matmult-BF16 |
U22 |
|
google/electra-base-discriminator |
PyTorch 2.1 |
Inf2.xlarge |
Raw Output (AutoModel) |
2756.87 |
11.54 |
12.19 |
$0.033 |
Batch |
2.20.0 |
Data Parallel |
16 |
128 |
FP32 |
Matmult-BF16 |
U22 |
|
roberta-base |
PyTorch 2.1 |
Inf2.xlarge |
Raw Output (AutoModel) |
2743.63 |
5.84 |
5.95 |
$0.033 |
Batch |
2.20.0 |
Data Parallel |
8 |
128 |
FP32 |
Matmult-BF16 |
U22 |
|
roberta-large |
PyTorch 2.1 |
Inf2.xlarge |
Raw Output (AutoModel) |
969.49 |
8.23 |
8.68 |
$0.094 |
Batch |
2.20.0 |
Data Parallel |
4 |
128 |
FP32 |
Matmult-BF16 |
U22 |
|
xlm-roberta-base |
PyTorch 2.1 |
Inf2.48xlarge |
Raw Output (AutoModel) |
51.29 |
628.73 |
695.07 |
$30.369 |
Batch |
2.20.0 |
Data Parallel |
16 |
128 |
FP32 |
Matmult-BF16 |
U22 |
Model |
Scripts |
Framework |
Inst. Type |
Task |
Throughput (inference/second) |
Latency P50 (ms) |
Latency P99 (ms) |
Cost per 1M inferences |
Application Type |
Neuron Version |
Run Mode |
Batch Size |
Sequence Length |
Model Data Type |
Compilation Autocast Data Type |
OS Type |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
albert-base-v2 |
PyTorch 2.1 |
Inf2.xlarge |
Raw Output (AutoModel) |
2034.71 |
0.98 |
1.05 |
$0.045 |
Real Time |
2.20.0 |
Data Parallel |
1 |
128 |
FP32 |
Matmult-BF16 |
U22 |
|
bert-base-cased |
PyTorch 2.1 |
Inf2.xlarge |
Raw Output (AutoModel) |
41.24 |
1.02 |
1.09 |
$2.209 |
Real Time |
2.20.0 |
Data Parallel |
1 |
128 |
FP32 |
Matmult-BF16 |
U22 |
|
bert-base-cased-finetuned-mrpc |
PyTorch 2.1 |
Inf2.xlarge |
Sequence Classification |
2117.52 |
0.93 |
1.0 |
$0.043 |
Real Time |
2.20.0 |
Data Parallel |
1 |
128 |
FP32 |
Matmult-BF16 |
U22 |
|
bert-base-uncased |
PyTorch 2.1 |
Inf2.xlarge |
Raw Output (AutoModel) |
1974.52 |
1.01 |
1.09 |
$0.046 |
Real Time |
2.20.0 |
Data Parallel |
1 |
128 |
FP32 |
Matmult-BF16 |
U22 |
|
bert-large-cased |
PyTorch 2.1 |
Inf2.xlarge |
Raw Output (AutoModel) |
718.66 |
2.78 |
2.86 |
$0.127 |
Real Time |
2.20.0 |
Data Parallel |
1 |
128 |
FP32 |
Matmult-BF16 |
U22 |
|
bert-large-uncased |
PyTorch 2.1 |
Inf2.xlarge |
Raw Output (AutoModel) |
728.39 |
2.78 |
2.86 |
$0.125 |
Real Time |
2.20.0 |
Data Parallel |
1 |
128 |
FP32 |
Matmult-BF16 |
U22 |
|
camembert-base |
PyTorch 2.1 |
Inf2.xlarge |
Raw Output (AutoModel) |
1931.2 |
1.03 |
1.09 |
$0.047 |
Real Time |
2.20.0 |
Data Parallel |
1 |
128 |
FP32 |
Matmult-BF16 |
U22 |
|
distilbert-base-cased |
PyTorch 2.1 |
Inf2.xlarge |
Raw Output (AutoModel) |
3255.95 |
0.61 |
0.67 |
$0.028 |
Real Time |
2.20.0 |
Data Parallel |
1 |
128 |
FP32 |
Matmult-BF16 |
U22 |
|
distilbert-base-cased-distilled-squad |
PyTorch 2.1 |
Inf2.xlarge |
Raw Output (AutoModel) |
3255.97 |
0.61 |
0.68 |
$0.028 |
Real Time |
2.20.0 |
Data Parallel |
1 |
128 |
FP32 |
Matmult-BF16 |
U22 |
|
distilbert-base-uncased |
PyTorch 2.1 |
Inf2.xlarge |
Raw Output (AutoModel) |
3248.31 |
0.61 |
0.68 |
$0.028 |
Real Time |
2.20.0 |
Data Parallel |
1 |
128 |
FP32 |
Matmult-BF16 |
U22 |
|
google/electra-base-discriminator |
PyTorch 2.1 |
Inf2.xlarge |
Raw Output (AutoModel) |
2019.69 |
0.98 |
1.08 |
$0.045 |
Real Time |
2.20.0 |
Data Parallel |
1 |
128 |
FP32 |
Matmult-BF16 |
U22 |
|
roberta-base |
PyTorch 2.1 |
Inf2.xlarge |
Raw Output (AutoModel) |
1926.26 |
1.03 |
1.1 |
$0.047 |
Real Time |
2.20.0 |
Data Parallel |
1 |
128 |
FP32 |
Matmult-BF16 |
U22 |
|
roberta-large |
PyTorch 2.1 |
Inf2.xlarge |
Raw Output (AutoModel) |
721.21 |
2.76 |
2.84 |
$0.126 |
Real Time |
2.20.0 |
Data Parallel |
1 |
128 |
FP32 |
Matmult-BF16 |
U22 |
|
xlm-roberta-base |
PyTorch 2.1 |
Inf2.48xlarge |
Raw Output (AutoModel) |
41.24 |
48.27 |
58.71 |
$37.773 |
Real Time |
2.20.0 |
Data Parallel |
1 |
128 |
FP32 |
Matmult-BF16 |
U22 |
Encoder-Decoder Models#
Model |
Scripts |
Framework |
Inst. Type |
Task |
Throughput (tokens/second) |
Latency per Token P50 (ms) |
Latency per Token P99 (ms) |
Cost per 1M inferences |
Application Type |
Neuron Version |
Run Mode |
TP Degree |
DP Degree |
Batch Size |
Sequence Length |
Input Length |
Output Length |
Model Data Type |
Compilation Autocast Data Type |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
google/flan-t5-xl |
NeuronX Distributed |
Inf2.24xlarge |
Text Generation |
117.61 |
8.51 |
8.53 |
$6.623 |
Batch |
2.17.0 |
Tensor Parallel |
8 |
1 |
1 |
128 |
128 |
84 |
FP32 |
Matmult-BF16 |
|
t5-3b |
NeuronX Distributed |
Inf2.24xlarge |
Text Generation |
111.92 |
8.97 |
8.98 |
$6.959 |
Batch |
2.17.0 |
Tensor Parallel |
8 |
1 |
1 |
128 |
128 |
84 |
FP32 |
Matmult-BF16 |
Note
Only for Encoder-Decoder
Throughput (tokens/second) counts both input and output tokens
Latency per Token counts both input and output tokens
Model |
Scripts |
Framework |
Inst. Type |
Task |
Throughput (tokens/second) |
Latency per Token P50 (ms) |
Latency per Token P99 (ms) |
Cost per 1M inferences |
Application Type |
Neuron Version |
Run Mode |
TP Degree |
DP Degree |
Batch Size |
Sequence Length |
Input Length |
Output Length |
Model Data Type |
Compilation Autocast Data Type |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
google/flan-t5-xl |
NeuronX Distributed |
Inf2.24xlarge |
Text Generation |
117.6 |
8.5 |
8.53 |
$6.623 |
Real Time |
2.18.0 |
Tensor Parallel |
8 |
1 |
1 |
128 |
128 |
84 |
FP32 |
Matmult-BF16 |
|
t5-3b |
NeuronX Distributed |
Inf2.24xlarge |
Text Generation |
108.18 |
9.25 |
9.26 |
$7.200 |
Real Time |
2.18.0 |
Tensor Parallel |
8 |
1 |
1 |
128 |
128 |
84 |
FP32 |
Matmult-BF16 |
Note
Throughput (tokens/second) counts both input and output tokens
Latency per Token counts both input and output tokens
Vision Transformers Models#
Model |
Image Size |
Scripts |
Framework |
Inst. Type |
Task |
Throughput (inference/sec) |
Latency P50 (ms) |
Latency P99 (ms) |
Cost per 1M images |
Application Type |
Neuron Version |
Run Mode |
Batch Size |
Model Data Type |
Compilation Autocast Data Type |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
deepmind/multimodal-perceiver |
16x224x224 |
PyTorch 1.13.1 |
Inf2.xlarge |
Multimodal Autoencoding |
0.83 |
1250.0 |
1271.0 |
$109.772 |
Real Time |
2.18.0 |
Data Parallel |
1 |
FP32 |
||
deepmind/vision-perceiver-conv |
224x224 |
PyTorch 1.13.1 |
Inf2.xlarge |
Image Classification |
126.5 |
14.14 |
14.2 |
$0.720 |
Real Time |
2.18.0 |
Data Parallel |
1 |
FP32 |
Matmult-BF16 |
|
deepmind/vision-perceiver-fourier |
224x224 |
PyTorch 1.13.1 |
Inf2.xlarge |
Image Classification |
67.9 |
29.5 |
29.68 |
$1.342 |
Real Time |
2.18.0 |
Data Parallel |
1 |
FP32 |
Matmult-BF16 |
|
deepmind/vision-perceiver-learned |
224x224 |
PyTorch 1.13.1 |
Inf2.xlarge |
Image Classification |
99.6 |
18.6 |
18.7 |
$0.915 |
Real Time |
2.18.0 |
Data Parallel |
1 |
FP32 |
Matmult-BF16 |
|
google/vit-base-patch16-224 |
224x224 |
PyTorch 2.1.2 |
Inf2.xlarge |
Image Classification |
1773.97 |
4.5 |
4.69 |
$0.051 |
Batch |
2.18.0 |
Data Parallel |
2 |
FP32 |
Matmult-BF16 |
|
openai/clip-vit-base-patch32 |
224x224 |
PyTorch 2.1.2 |
Inf2.xlarge |
Image Classification |
6099.53 |
46.31 |
66.27 |
$0.015 |
Batch |
2.18.0 |
Data Parallel |
64 |
FP32 |
Matmult-BF16 |
|
openai/clip-vit-large-patch14 |
224x224 |
PyTorch 2.1.2 |
Inf2.xlarge |
Image Classification |
304.07 |
105.9 |
110.58 |
$0.300 |
Batch |
2.18.0 |
Data Parallel |
8 |
FP32 |
Matmult-BF16 |
Model |
Image Size |
Scripts |
Framework |
Inst. Type |
Task |
Throughput (inference/sec) |
Latency P50 (ms) |
Latency P99 (ms) |
Cost per 1M images |
Application Type |
Neuron Version |
Run Mode |
Batch Size |
Model Data Type |
Compilation Autocast Data Type |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
deepmind/multimodal-perceiver |
16x224x224 |
PyTorch 1.13.1 |
Inf2.xlarge |
Multimodal Autoencoding |
0.83 |
1250.0 |
1271.0 |
$109.772 |
Real Time |
2.18.0 |
Data Parallel |
1 |
FP32 |
||
deepmind/vision-perceiver-conv |
224x224 |
PyTorch 1.13.1 |
Inf2.xlarge |
Image Classification |
126.5 |
14.14 |
14.2 |
$0.720 |
Real Time |
2.18.0 |
Data Parallel |
1 |
FP32 |
Matmult-BF16 |
|
deepmind/vision-perceiver-fourier |
224x224 |
PyTorch 1.13.1 |
Inf2.xlarge |
Image Classification |
67.9 |
29.5 |
29.68 |
$1.342 |
Real Time |
2.18.0 |
Data Parallel |
1 |
FP32 |
Matmult-BF16 |
|
deepmind/vision-perceiver-learned |
224x224 |
PyTorch 1.13.1 |
Inf2.xlarge |
Image Classification |
99.6 |
18.6 |
18.7 |
$0.915 |
Real Time |
2.18.0 |
Data Parallel |
1 |
FP32 |
Matmult-BF16 |
|
google/vit-base-patch16-224 |
224x224 |
PyTorch 2.1.2 |
Inf2.xlarge |
Image Classification |
728.93 |
1.36 |
1.4 |
$0.125 |
Real Time |
2.18.0 |
Data Parallel |
1 |
FP32 |
Matmult-BF16 |
|
openai/clip-vit-base-patch32 |
224x224 |
PyTorch 2.1.2 |
Inf2.xlarge |
Image Classification |
158.38 |
6.31 |
6.34 |
$0.575 |
Real Time |
2.18.0 |
Data Parallel |
1 |
FP32 |
Matmult-BF16 |
|
openai/clip-vit-large-patch14 |
224x224 |
PyTorch 2.1.2 |
Inf2.xlarge |
Image Classification |
73.23 |
13.65 |
13.71 |
$1.244 |
Real Time |
2.18.0 |
Data Parallel |
1 |
FP32 |
Matmult-BF16 |
Convolutional Neural Networks (CNN) Models#
Model |
Image Size |
Scripts |
Framework |
Inst. Type |
Task |
Throughput (inference/sec) |
Latency P50 (ms) |
Latency P99 (ms) |
Cost per 1M images |
Application Type |
Neuron Version |
Run Mode |
Batch Size |
Model Data Type |
Compilation Autocast Data Type |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
UNet |
224x224 |
PyTorch 2.1 |
Inf2.xlarge |
Image Segmentation |
999.94 |
15.99 |
16.08 |
$0.091 |
Batch |
2.20.0 |
Data Parallel |
4 |
FP32 |
Matmult-BF16 |
|
resnet101 |
224x224 |
PyTorch 2.1 |
Inf2.xlarge |
Image Classification |
3178.53 |
80.43 |
80.56 |
$0.029 |
Batch |
2.20.0 |
Data Parallel |
64 |
FP32 |
Matmult-BF16 |
|
resnet152 |
224x224 |
PyTorch 2.1 |
Inf2.xlarge |
Image Classification |
2430.45 |
105.24 |
105.36 |
$0.037 |
Batch |
2.20.0 |
Data Parallel |
64 |
FP32 |
Matmult-BF16 |
|
resnet18 |
224x224 |
PyTorch 2.1 |
Inf2.xlarge |
Image Classification |
6944.75 |
4.59 |
4.66 |
$0.013 |
Batch |
2.20.0 |
Data Parallel |
8 |
FP32 |
Matmult-BF16 |
|
resnet34 |
224x224 |
PyTorch 2.1 |
Inf2.xlarge |
Image Classification |
5116.51 |
6.24 |
6.31 |
$0.018 |
Batch |
2.20.0 |
Data Parallel |
8 |
FP32 |
Matmult-BF16 |
|
resnet50 |
224x224 |
PyTorch 2.1 |
Inf2.xlarge |
Image Classification |
4420.47 |
7.23 |
7.31 |
$0.021 |
Batch |
2.20.0 |
Data Parallel |
8 |
FP32 |
Matmult-BF16 |
|
vgg11 |
224x224 |
PyTorch 2.1 |
Inf2.xlarge |
Image Classification |
4687.13 |
54.55 |
54.65 |
$0.019 |
Batch |
2.20.0 |
Data Parallel |
64 |
FP32 |
Matmult-BF16 |
|
vgg16 |
224x224 |
PyTorch 2.1 |
Inf2.xlarge |
Image Classification |
2095.37 |
15.25 |
15.32 |
$0.043 |
Batch |
2.20.0 |
Data Parallel |
8 |
FP32 |
Matmult-BF16 |
Model |
Image Size |
Scripts |
Framework |
Inst. Type |
Task |
Throughput (inference/sec) |
Latency P50 (ms) |
Latency P99 (ms) |
Cost per 1M images |
Application Type |
Neuron Version |
Run Mode |
Batch Size |
Model Data Type |
Compilation Autocast Data Type |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
UNet |
224x224 |
PyTorch 2.1 |
Inf2.xlarge |
Image Segmentation |
443.54 |
2.26 |
2.28 |
$0.205 |
Real Time |
2.20.0 |
Data Parallel |
1 |
FP32 |
Matmult-BF16 |
|
resnet101 |
224x224 |
PyTorch 2.1 |
Inf2.xlarge |
Image Classification |
984.12 |
1.01 |
1.03 |
$0.093 |
Real Time |
2.20.0 |
Data Parallel |
1 |
FP32 |
Matmult-BF16 |
|
resnet152 |
224x224 |
PyTorch 2.1 |
Inf2.xlarge |
Image Classification |
828.62 |
1.2 |
1.23 |
$0.110 |
Real Time |
2.20.0 |
Data Parallel |
1 |
FP32 |
Matmult-BF16 |
|
resnet18 |
224x224 |
PyTorch 2.1 |
Inf2.xlarge |
Image Classification |
1637.7 |
0.6 |
0.63 |
$0.056 |
Real Time |
2.20.0 |
Data Parallel |
1 |
FP32 |
Matmult-BF16 |
|
resnet34 |
224x224 |
PyTorch 2.1 |
Inf2.xlarge |
Image Classification |
1368.5 |
0.73 |
0.74 |
$0.067 |
Real Time |
2.20.0 |
Data Parallel |
1 |
FP32 |
Matmult-BF16 |
|
resnet50 |
224x224 |
PyTorch 2.1 |
Inf2.xlarge |
Image Classification |
1246.26 |
0.79 |
0.85 |
$0.073 |
Real Time |
2.20.0 |
Data Parallel |
1 |
FP32 |
Matmult-BF16 |
|
vgg11 |
224x224 |
PyTorch 2.1 |
Inf2.xlarge |
Image Classification |
630.69 |
1.61 |
1.63 |
$0.144 |
Real Time |
2.20.0 |
Data Parallel |
1 |
FP32 |
Matmult-BF16 |
|
vgg16 |
224x224 |
PyTorch 2.1 |
Inf2.xlarge |
Image Classification |
441.12 |
2.26 |
2.3 |
$0.207 |
Real Time |
2.20.0 |
Data Parallel |
1 |
FP32 |
Matmult-BF16 |
Stable Diffusion Models#
Model |
Image Size |
Scripts |
Framework |
Inst. Type |
Task |
Throughput (inference/sec) |
Latency P50 (ms) |
Latency P99 (ms) |
Cost per 1M images |
Application Type |
Neuron Version |
Run Mode |
Batch Size |
Model Data Type |
Compilation Autocast Data Type |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Stable Diffusion 1.5 |
512x512 |
PyTorch 2.1 |
Inf2.xlarge |
Image Generation |
0.48 |
2089.0 |
2093.0 |
$190.211 |
Real Time |
2.18.0 |
Data Parallel |
1 |
FP32 |
Matmult-BF16 |
|
Stable Diffusion 2 Inpainting |
936x624 |
PyTorch 2.1 |
Inf2.xlarge |
Image Generation |
0.16 |
6045.0 |
6063.4 |
$552.189 |
Real Time |
2.18.0 |
Data Parallel |
1 |
FP32, BF16 |
Matmult-BF16 |
|
Stable Diffusion 2.1 |
512x512 |
PyTorch 2.1 |
Inf2.xlarge |
Image Generation |
0.6 |
1655.0 |
1663.0 |
$150.846 |
Real Time |
2.18.0 |
Data Parallel |
1 |
FP32, BF16 |
Matmult-BF16 |
|
Stable Diffusion 2.1 |
768x768 |
PyTorch 2.1 |
Inf2.xlarge |
Image Generation |
0.18 |
5504.0 |
5519.0 |
$500.611 |
Real Time |
2.18.0 |
Data Parallel |
1 |
FP32 |
Matmult-BF16 |
|
Stable Diffusion XL Base |
1024x1024 |
PyTorch 2.1 |
Inf2.xlarge |
Image Generation |
0.08 |
12200.0 |
12260.0 |
$1,111.111 |
Real Time |
2.18.0 |
Data Parallel |
1 |
FP32 |
Matmult-BF16 |
|
Stable Diffusion XL Base & Refiner |
1024x1024 |
PyTorch 2.1 |
Inf2.8xlarge |
Image Generation |
0.09 |
10741.0 |
11006.0 |
$2,538.829 |
Real Time |
2.18.0 |
Data Parallel |
1 |
FP32 |
Matmult-BF16 |
Note
Cost per 1M images is calculated using RI-Effective hourly rate.
Real Time application refers to batch size 1 inference for minimal latency. Batch application refers to maximum throughput with minimum cost-per-inference.
Model |
Image Size |
Scripts |
Framework |
Inst. Type |
Task |
Throughput (inference/sec) |
Latency P50 (ms) |
Latency P99 (ms) |
Cost per 1M images |
Application Type |
Neuron Version |
Run Mode |
Batch Size |
Model Data Type |
Compilation Autocast Data Type |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Stable Diffusion 1.5 |
512x512 |
PyTorch 2.1 |
Inf2.xlarge |
Image Generation |
0.48 |
2089.0 |
2093.0 |
$190.211 |
Real Time |
2.18.0 |
Data Parallel |
1 |
FP32 |
Matmult-BF16 |
|
Stable Diffusion 2 Inpainting |
936x624 |
PyTorch 2.1 |
Inf2.xlarge |
Image Generation |
0.16 |
6045.0 |
6063.4 |
$552.189 |
Real Time |
2.18.0 |
Data Parallel |
1 |
FP32, BF16 |
Matmult-BF16 |
|
Stable Diffusion 2.1 |
512x512 |
PyTorch 2.1 |
Inf2.xlarge |
Image Generation |
0.6 |
1655.0 |
1663.0 |
$150.846 |
Real Time |
2.18.0 |
Data Parallel |
1 |
FP32, BF16 |
Matmult-BF16 |
|
Stable Diffusion 2.1 |
768x768 |
PyTorch 2.1 |
Inf2.xlarge |
Image Generation |
0.18 |
5504.0 |
5519.0 |
$500.611 |
Real Time |
2.18.0 |
Data Parallel |
1 |
FP32 |
Matmult-BF16 |
|
Stable Diffusion XL Base |
1024x1024 |
PyTorch 2.1 |
Inf2.xlarge |
Image Generation |
0.08 |
12200.0 |
12260.0 |
$1,111.111 |
Real Time |
2.18.0 |
Data Parallel |
1 |
FP32 |
Matmult-BF16 |
|
Stable Diffusion XL Base & Refiner |
1024x1024 |
PyTorch 2.1 |
Inf2.8xlarge |
Image Generation |
0.09 |
10741.0 |
11006.0 |
$2,538.829 |
Real Time |
2.18.0 |
Data Parallel |
1 |
FP32 |
Matmult-BF16 |
Note
Cost per 1M images is calculated using RI-Effective hourly rate.
Real Time application refers to batch size 1 inference for minimal latency. Batch application refers to maximum throughput with minimum cost-per-inference.
Diffusion Transformer Models#
Model |
Image Size |
Scripts |
Framework |
Inst. Type |
Task |
Throughput (inference/sec) |
Latency P50 (ms) |
Latency P99 (ms) |
Cost per 1M images |
Application Type |
Neuron Version |
Run Mode |
Batch Size |
Model Data Type |
Compilation Autocast Data Type |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
PixArt Alpha |
256x256 |
PyTorch 2.1 |
Inf2.xlarge |
Image Generation |
1.98 |
502.59 |
537.26 |
$46.132 |
Real Time |
2.2 |
Data Parallel |
1 |
“FP32, BF16” |
Matmult-BF16 |
|
PixArt Alpha |
512x512 |
PyTorch 2.1 |
Inf2.xlarge |
Image Generation |
0.56 |
1769.76 |
1775.7 |
$161.259 |
Real Time |
2.2 |
Data Parallel |
1 |
“FP32, BF16” |
Matmult-BF16 |
|
PixArt Sigma |
256x256 |
PyTorch 2.1 |
Inf2.xlarge |
Image Generation |
1.86 |
540.83 |
548.41 |
$48.984 |
Real Time |
2.2 |
Data Parallel |
1 |
“FP32, BF16” |
Matmult-BF16 |
|
PixArt Sigma |
512x512 |
PyTorch 2.1 |
Inf2.xlarge |
Image Generation |
0.54 |
1841.88 |
1850.68 |
$167.792 |
Real Time |
2.2 |
Data Parallel |
1 |
“FP32, BF16” |
Matmult-BF16 |
Note
Cost per 1M images is calculated using RI-Effective hourly rate.
Real Time application refers to batch size 1 inference for minimal latency. Batch application refers to maximum throughput with minimum cost-per-inference.
Model |
Image Size |
Scripts |
Framework |
Inst. Type |
Task |
Throughput (inference/sec) |
Latency P50 (ms) |
Latency P99 (ms) |
Cost per 1M images |
Application Type |
Neuron Version |
Run Mode |
Batch Size |
Model Data Type |
Compilation Autocast Data Type |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
PixArt Alpha |
256x256 |
PyTorch 2.1 |
Inf2.xlarge |
Image Generation |
1.98 |
502.59 |
537.26 |
$46.132 |
Real Time |
2.2 |
Data Parallel |
1 |
“FP32, BF16” |
Matmult-BF16 |
|
PixArt Alpha |
512x512 |
PyTorch 2.1 |
Inf2.xlarge |
Image Generation |
0.56 |
1769.76 |
1775.7 |
$161.259 |
Real Time |
2.2 |
Data Parallel |
1 |
“FP32, BF16” |
Matmult-BF16 |
|
PixArt Sigma |
256x256 |
PyTorch 2.1 |
Inf2.xlarge |
Image Generation |
1.86 |
540.83 |
548.41 |
$48.984 |
Real Time |
2.2 |
Data Parallel |
1 |
“FP32, BF16” |
Matmult-BF16 |
|
PixArt Sigma |
512x512 |
PyTorch 2.1 |
Inf2.xlarge |
Image Generation |
0.54 |
1841.88 |
1850.68 |
$167.792 |
Real Time |
2.2 |
Data Parallel |
1 |
“FP32, BF16” |
Matmult-BF16 |
Note
Cost per 1M images is calculated using RI-Effective hourly rate.
Real Time application refers to batch size 1 inference for minimal latency. Batch application refers to maximum throughput with minimum cost-per-inference.
Note
See Neuron Glossary for abbreviations and terms
This document is relevant for: Inf2