This document is relevant for: Inf2
Inf2 Inference Performance#
Last update: September 16th, 2024
Encoder Models#
Model |
Scripts |
Framework |
Inst. Type |
Task |
Throughput (inference/second) |
Latency P50 (ms) |
Latency P99 (ms) |
Cost per 1M inferences |
Application Type |
Neuron Version |
Run Mode |
Batch Size |
Sequence Length |
Model Data Type |
Compilation Autocast Data Type |
OS Type |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
albert-base-v2 |
PyTorch 2.5 |
Inf2.xlarge |
Raw Output (AutoModel) |
3014.09 |
5.3 |
5.5 |
$0.030 |
Batch |
2.21.0 |
Data Parallel |
8 |
128 |
FP32 |
Matmult-BF16 |
U22 |
|
bert-base-cased |
PyTorch 2.5 |
Inf2.xlarge |
Raw Output (AutoModel) |
2756.07 |
2.89 |
3.11 |
$0.033 |
Batch |
2.21.0 |
Data Parallel |
4 |
128 |
FP32 |
Matmult-BF16 |
U22 |
|
bert-base-cased-finetuned-mrpc |
PyTorch 2.5 |
Inf2.xlarge |
Sequence Classification |
3200.04 |
9.91 |
10.41 |
$0.028 |
Batch |
2.21.0 |
Data Parallel |
16 |
128 |
FP32 |
Matmult-BF16 |
U22 |
|
bert-base-uncased |
PyTorch 2.5 |
Inf2.xlarge |
Raw Output (AutoModel) |
2859.83 |
5.56 |
5.82 |
$0.032 |
Batch |
2.21.0 |
Data Parallel |
8 |
128 |
FP32 |
Matmult-BF16 |
U22 |
|
bert-large-cased |
PyTorch 2.5 |
Inf2.xlarge |
Raw Output (AutoModel) |
960.92 |
16.62 |
17.4 |
$0.095 |
Batch |
2.21.0 |
Data Parallel |
8 |
128 |
FP32 |
Matmult-BF16 |
U2 |
|
bert-large-uncased |
PyTorch 2.5 |
Inf2.xlarge |
Raw Output (AutoModel) |
950.05 |
8.41 |
8.85 |
$0.096 |
Batch |
2.21.0 |
Data Parallel |
4 |
128 |
FP32 |
Matmult-BF16 |
U22 |
|
camembert-base |
PyTorch 2.5 |
Inf2.xlarge |
Raw Output (AutoModel) |
2937.32 |
10.84 |
11.46 |
$0.031 |
Batch |
2.21.0 |
Data Parallel |
16 |
128 |
FP32 |
Matmult-BF16 |
U22 |
|
distilbert-base-cased |
PyTorch 2.5 |
Inf2.xlarge |
Raw Output (AutoModel) |
4889.52 |
1.64 |
1.74 |
$0.019 |
Batch |
2.21.0 |
Data Parallel |
4 |
128 |
FP32 |
Matmult-BF16 |
U22 |
|
distilbert-base-cased-distilled-squad |
PyTorch 2.5 |
Inf2.xlarge |
Raw Output (AutoModel) |
4891.96 |
1.64 |
1.74 |
$0.019 |
Batch |
2.21.0 |
Data Parallel |
4 |
128 |
FP32 |
Matmult-BF16 |
U22 |
|
distilbert-base-uncased |
PyTorch 2.5 |
Inf2.xlarge |
Raw Output (AutoModel) |
5268.11 |
6.06 |
6.38 |
$0.017 |
Batch |
2.21.0 |
Data Parallel |
16 |
128 |
FP32 |
Matmult-BF16 |
U22 |
|
google/electra-base-discriminator |
PyTorch 2.5 |
Inf2.xlarge |
Raw Output (AutoModel) |
2919.08 |
10.94 |
11.33 |
$0.031 |
Batch |
2.21.0 |
Data Parallel |
16 |
128 |
FP32 |
Matmult-BF16 |
U22 |
|
roberta-base |
PyTorch 2.5 |
Inf2.xlarge |
Raw Output (AutoModel) |
2855.82 |
5.59 |
5.76 |
$0.032 |
Batch |
2.21.0 |
Data Parallel |
8 |
128 |
FP32 |
Matmult-BF16 |
U22 |
|
roberta-large |
PyTorch 2.5 |
Inf2.xlarge |
Raw Output (AutoModel) |
997.97 |
7.96 |
8.38 |
$0.091 |
Batch |
2.21.0 |
Data Parallel |
4 |
128 |
FP32 |
Matmult-BF16 |
U22 |
|
xlm-roberta-base |
PyTorch 2.5 |
Inf2.48xlarge |
Raw Output (AutoModel) |
48.58 |
642.15 |
840.52 |
$32.069 |
Batch |
2.21.0 |
Data Parallel |
16 |
128 |
FP32 |
Matmult-BF16 |
U22 |
Model |
Scripts |
Framework |
Inst. Type |
Task |
Throughput (inference/second) |
Latency P50 (ms) |
Latency P99 (ms) |
Cost per 1M inferences |
Application Type |
Neuron Version |
Run Mode |
Batch Size |
Sequence Length |
Model Data Type |
Compilation Autocast Data Type |
OS Type |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
albert-base-v2 |
PyTorch 2.5 |
Inf2.xlarge |
Raw Output (AutoModel) |
2044.31 |
0.97 |
1.04 |
$0.045 |
Real Time |
2.21.0 |
Data Parallel |
1 |
128 |
FP32 |
Matmult-BF16 |
U22 |
|
bert-base-cased |
PyTorch 2.5 |
Inf2.xlarge |
Raw Output (AutoModel) |
2830.12 |
0.99 |
1.07 |
$0.032 |
Real Time |
2.21.0 |
Data Parallel |
1 |
128 |
FP32 |
Matmult-BF16 |
U22 |
|
bert-base-cased-finetuned-mrpc |
PyTorch 2.5 |
Inf2.xlarge |
Sequence Classification |
2144.41 |
0.93 |
0.98 |
$0.042 |
Real Time |
2.21.0 |
Data Parallel |
1 |
128 |
FP32 |
Matmult-BF16 |
U22 |
|
bert-base-uncased |
PyTorch 2.5 |
Inf2.xlarge |
Raw Output (AutoModel) |
2010.47 |
0.98 |
1.07 |
$0.045 |
Real Time |
2.21.0 |
Data Parallel |
1 |
128 |
FP32 |
Matmult-BF16 |
U22 |
|
bert-large-cased |
PyTorch 2.5 |
Inf2.xlarge |
Raw Output (AutoModel) |
726.22 |
2.76 |
2.82 |
$0.125 |
Real Time |
2.21.0 |
Data Parallel |
1 |
128 |
FP32 |
Matmult-BF16 |
U22 |
|
bert-large-uncased |
PyTorch 2.5 |
Inf2.xlarge |
Raw Output (AutoModel) |
723.88 |
2.76 |
2.84 |
$0.126 |
Real Time |
2.21.0 |
Data Parallel |
1 |
128 |
FP32 |
Matmult-BF16 |
U22 |
|
camembert-base |
PyTorch 2.5 |
Inf2.xlarge |
Raw Output (AutoModel) |
1936.59 |
1.03 |
1.08 |
$0.047 |
Real Time |
2.21.0 |
Data Parallel |
1 |
128 |
FP32 |
Matmult-BF16 |
U22 |
|
distilbert-base-cased |
PyTorch 2.5 |
Inf2.xlarge |
Raw Output (AutoModel) |
3266.05 |
0.61 |
0.67 |
$0.028 |
Real Time |
2.21.0 |
Data Parallel |
1 |
128 |
FP32 |
Matmult-BF16 |
U22 |
|
distilbert-base-cased-distilled-squad |
PyTorch 2.5 |
Inf2.xlarge |
Raw Output (AutoModel) |
3263.46 |
0.61 |
0.68 |
$0.028 |
Real Time |
2.21.0 |
Data Parallel |
1 |
128 |
FP32 |
Matmult-BF16 |
U22 |
|
distilbert-base-uncased |
PyTorch 2.5 |
Inf2.xlarge |
Raw Output (AutoModel) |
3264.01 |
0.61 |
0.68 |
$0.028 |
Real Time |
2.21.0 |
Data Parallel |
1 |
128 |
FP32 |
Matmult-BF16 |
U22 |
|
google/electra-base-discriminator |
PyTorch 2.5 |
Inf2.xlarge |
Raw Output (AutoModel) |
2060.44 |
0.96 |
1.04 |
$0.044 |
Real Time |
2.21.0 |
Data Parallel |
1 |
128 |
FP32 |
Matmult-BF16 |
U22 |
|
roberta-base |
PyTorch 2.5 |
Inf2.xlarge |
Raw Output (AutoModel) |
1938.6 |
1.03 |
1.07 |
$0.047 |
Real Time |
2.21.0 |
Data Parallel |
1 |
128 |
FP32 |
Matmult-BF16 |
U22 |
|
roberta-large |
PyTorch 2.5 |
Inf2.xlarge |
Raw Output (AutoModel) |
722.94 |
2.76 |
2.83 |
$0.126 |
Real Time |
2.21.0 |
Data Parallel |
1 |
128 |
FP32 |
Matmult-BF16 |
U22 |
|
xlm-roberta-base |
PyTorch 2.5 |
Inf2.48xlarge |
Raw Output (AutoModel) |
42.17 |
47.38 |
58.69 |
$36.936 |
Real Time |
2.21.0 |
Data Parallel |
1 |
128 |
FP32 |
Matmult-BF16 |
U22 |
Encoder-Decoder Models#
Model |
Scripts |
Framework |
Inst. Type |
Task |
Throughput (tokens/second) |
Latency per Token P50 (ms) |
Latency per Token P99 (ms) |
Cost per 1M inferences |
Application Type |
Neuron Version |
Run Mode |
TP Degree |
DP Degree |
Batch Size |
Sequence Length |
Input Length |
Output Length |
Model Data Type |
Compilation Autocast Data Type |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
google/flan-t5-xl |
NeuronX Distributed |
Inf2.24xlarge |
Text Generation |
117.61 |
8.51 |
8.53 |
$6.623 |
Batch |
2.17.0 |
Tensor Parallel |
8 |
1 |
1 |
128 |
128 |
84 |
FP32 |
Matmult-BF16 |
|
t5-3b |
NeuronX Distributed |
Inf2.24xlarge |
Text Generation |
111.92 |
8.97 |
8.98 |
$6.959 |
Batch |
2.17.0 |
Tensor Parallel |
8 |
1 |
1 |
128 |
128 |
84 |
FP32 |
Matmult-BF16 |
Note
Only for Encoder-Decoder
Throughput (tokens/second) counts both input and output tokens
Latency per Token counts both input and output tokens
Model |
Scripts |
Framework |
Inst. Type |
Task |
Throughput (tokens/second) |
Latency per Token P50 (ms) |
Latency per Token P99 (ms) |
Cost per 1M inferences |
Application Type |
Neuron Version |
Run Mode |
TP Degree |
DP Degree |
Batch Size |
Sequence Length |
Input Length |
Output Length |
Model Data Type |
Compilation Autocast Data Type |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
google/flan-t5-xl |
NeuronX Distributed |
Inf2.24xlarge |
Text Generation |
117.6 |
8.5 |
8.53 |
$6.623 |
Real Time |
2.18.0 |
Tensor Parallel |
8 |
1 |
1 |
128 |
128 |
84 |
FP32 |
Matmult-BF16 |
|
t5-3b |
NeuronX Distributed |
Inf2.24xlarge |
Text Generation |
108.18 |
9.25 |
9.26 |
$7.200 |
Real Time |
2.18.0 |
Tensor Parallel |
8 |
1 |
1 |
128 |
128 |
84 |
FP32 |
Matmult-BF16 |
Note
Throughput (tokens/second) counts both input and output tokens
Latency per Token counts both input and output tokens
Vision Transformers Models#
Model |
Image Size |
Scripts |
Framework |
Inst. Type |
Task |
Throughput (inference/sec) |
Latency P50 (ms) |
Latency P99 (ms) |
Cost per 1M images |
Application Type |
Neuron Version |
Run Mode |
Batch Size |
Model Data Type |
Compilation Autocast Data Type |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
deepmind/multimodal-perceiver |
16x224x224 |
PyTorch 2.5 |
Inf2.xlarge |
Multimodal Autoencoding |
0.85 |
1170.04 |
1232.06 |
$106.813 |
Real Time |
2.21.0 |
Data Parallel |
1 |
FP32 |
||
deepmind/vision-perceiver-conv |
224x224 |
PyTorch 1.13.1 |
Inf2.xlarge |
Image Classification |
126.5 |
14.14 |
14.2 |
$0.720 |
Real Time |
2.18.0 |
Data Parallel |
1 |
FP32 |
Matmult-BF16 |
|
deepmind/vision-perceiver-fourier |
224x224 |
PyTorch 1.13.1 |
Inf2.xlarge |
Image Classification |
67.9 |
29.5 |
29.68 |
$1.342 |
Real Time |
2.18.0 |
Data Parallel |
1 |
FP32 |
Matmult-BF16 |
|
deepmind/vision-perceiver-learned |
224x224 |
PyTorch 1.13.1 |
Inf2.xlarge |
Image Classification |
99.6 |
18.6 |
18.7 |
$0.915 |
Real Time |
2.18.0 |
Data Parallel |
1 |
FP32 |
Matmult-BF16 |
|
google/vit-base-patch16-224 |
224x224 |
PyTorch 2.5 |
Inf2.xlarge |
Image Classification |
1955.41 |
4.09 |
4.12 |
$0.047 |
Batch |
2.21.0 |
Data Parallel |
2 |
FP32 |
Matmult-BF16 |
|
openai/clip-vit-base-patch32 |
224x224 |
PyTorch 2.5 |
Inf2.xlarge |
Image Classification |
6509.83 |
135.81 |
136.0 |
$0.014 |
Batch |
2.21.0 |
Data Parallel |
64 |
FP32 |
Matmult-BF16 |
|
openai/clip-vit-large-patch14 |
224x224 |
PyTorch 2.5 |
Inf2.xlarge |
Image Classification |
285.94 |
113.12 |
115.94 |
$0.319 |
Batch |
2.21.0 |
Data Parallel |
8 |
FP32 |
Matmult-BF16 |
Model |
Image Size |
Scripts |
Framework |
Inst. Type |
Task |
Throughput (inference/sec) |
Latency P50 (ms) |
Latency P99 (ms) |
Cost per 1M images |
Application Type |
Neuron Version |
Run Mode |
Batch Size |
Model Data Type |
Compilation Autocast Data Type |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
deepmind/multimodal-perceiver |
16x224x224 |
PyTorch 2.5 |
Inf2.xlarge |
Multimodal Autoencoding |
0.85 |
1170.04 |
1232.06 |
$106.813 |
Real Time |
2.21.0 |
Data Parallel |
1 |
FP32 |
||
deepmind/vision-perceiver-conv |
224x224 |
PyTorch 1.13.1 |
Inf2.xlarge |
Image Classification |
126.5 |
14.14 |
14.2 |
$0.720 |
Real Time |
2.18.0 |
Data Parallel |
1 |
FP32 |
Matmult-BF16 |
|
deepmind/vision-perceiver-fourier |
224x224 |
PyTorch 1.13.1 |
Inf2.xlarge |
Image Classification |
67.9 |
29.5 |
29.68 |
$1.342 |
Real Time |
2.18.0 |
Data Parallel |
1 |
FP32 |
Matmult-BF16 |
|
deepmind/vision-perceiver-learned |
224x224 |
PyTorch 1.13.1 |
Inf2.xlarge |
Image Classification |
99.6 |
18.6 |
18.7 |
$0.915 |
Real Time |
2.18.0 |
Data Parallel |
1 |
FP32 |
Matmult-BF16 |
|
google/vit-base-patch16-224 |
224x224 |
PyTorch 2.5 |
Inf2.xlarge |
Image Classification |
746.14 |
1.32 |
1.38 |
$0.122 |
Real Time |
2.21.0 |
Data Parallel |
1 |
FP32 |
Matmult-BF16 |
|
openai/clip-vit-base-patch32 |
224x224 |
PyTorch 2.5 |
Inf2.xlarge |
Image Classification |
161.05 |
6.21 |
6.25 |
$0.566 |
Real Time |
2.21.0 |
Data Parallel |
1 |
FP32 |
Matmult-BF16 |
|
openai/clip-vit-large-patch14 |
224x224 |
PyTorch 2.5 |
Inf2.xlarge |
Image Classification |
73.26 |
13.64 |
13.68 |
$1.244 |
Real Time |
2.21.0 |
Data Parallel |
1 |
FP32 |
Matmult-BF16 |
Convolutional Neural Networks (CNN) Models#
Model |
Image Size |
Scripts |
Framework |
Inst. Type |
Task |
Throughput (inference/sec) |
Latency P50 (ms) |
Latency P99 (ms) |
Cost per 1M images |
Application Type |
Neuron Version |
Run Mode |
Batch Size |
Model Data Type |
Compilation Autocast Data Type |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
UNet |
224x224 |
PyTorch 2.5 |
Inf2.xlarge |
Image Segmentation |
1010.8 |
15.82 |
15.88 |
$0.090 |
Batch |
2.21.0 |
Data Parallel |
4 |
FP32 |
Matmult-BF16 |
|
resnet101 |
224x224 |
PyTorch 2.5 |
Inf2.xlarge |
Image Classification |
3164.99 |
80.82 |
80.94 |
$0.029 |
Batch |
2.21.0 |
Data Parallel |
64 |
FP32 |
Matmult-BF16 |
|
resnet152 |
224x224 |
PyTorch 2.5 |
Inf2.xlarge |
Image Classification |
2449.88 |
104.41 |
104.53 |
$0.037 |
Batch |
2.21.0 |
Data Parallel |
64 |
FP32 |
Matmult-BF16 |
|
resnet18 |
224x224 |
PyTorch 2.5 |
Inf2.xlarge |
Image Classification |
6949.17 |
4.59 |
4.66 |
$0.013 |
Batch |
2.21.0 |
Data Parallel |
8 |
FP32 |
Matmult-BF16 |
|
resnet34 |
224x224 |
PyTorch 2.5 |
Inf2.xlarge |
Image Classification |
5158.61 |
6.18 |
6.25 |
$0.018 |
Batch |
2.21.0 |
Data Parallel |
8 |
FP32 |
Matmult-BF16 |
|
resnet50 |
224x224 |
PyTorch 2.5 |
Inf2.xlarge |
Image Classification |
4393.3 |
7.28 |
7.33 |
$0.021 |
Batch |
2.21.0 |
Data Parallel |
8 |
FP32 |
Matmult-BF16 |
|
vgg11 |
224x224 |
PyTorch 2.5 |
Inf2.xlarge |
Image Classification |
4734.4 |
54.04 |
54.09 |
$0.019 |
Batch |
2.21.0 |
Data Parallel |
64 |
FP32 |
Matmult-BF16 |
|
vgg16 |
224x224 |
PyTorch 2.5 |
Inf2.xlarge |
Image Classification |
2161.39 |
14.77 |
14.83 |
$0.042 |
Batch |
2.21.0 |
Data Parallel |
8 |
FP32 |
Matmult-BF16 |
Model |
Image Size |
Scripts |
Framework |
Inst. Type |
Task |
Throughput (inference/sec) |
Latency P50 (ms) |
Latency P99 (ms) |
Cost per 1M images |
Application Type |
Neuron Version |
Run Mode |
Batch Size |
Model Data Type |
Compilation Autocast Data Type |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
UNet |
224x224 |
PyTorch 2.5 |
Inf2.xlarge |
Image Segmentation |
447.09 |
2.23 |
2.25 |
$0.204 |
Real Time |
2.21.0 |
Data Parallel |
1 |
FP32 |
Matmult-BF16 |
|
resnet101 |
224x224 |
PyTorch 2.5 |
Inf2.xlarge |
Image Classification |
994.69 |
1.01 |
1.02 |
$0.092 |
Real Time |
2.21.0 |
Data Parallel |
1 |
FP32 |
Matmult-BF16 |
|
resnet152 |
224x224 |
PyTorch 2.5 |
Inf2.xlarge |
Image Classification |
837.78 |
1.18 |
1.22 |
$0.109 |
Real Time |
2.21.0 |
Data Parallel |
1 |
FP32 |
Matmult-BF16 |
|
resnet18 |
224x224 |
PyTorch 2.5 |
Inf2.xlarge |
Image Classification |
1669.8 |
0.6 |
0.61 |
$0.055 |
Real Time |
2.21.0 |
Data Parallel |
1 |
FP32 |
Matmult-BF16 |
|
resnet34 |
224x224 |
PyTorch 2.5 |
Inf2.xlarge |
Image Classification |
1394.21 |
0.72 |
0.73 |
$0.065 |
Real Time |
2.21.0 |
Data Parallel |
1 |
FP32 |
Matmult-BF16 |
|
resnet50 |
224x224 |
PyTorch 2.5 |
Inf2.xlarge |
Image Classification |
1218.88 |
0.83 |
0.85 |
$0.075 |
Real Time |
2.21.0 |
Data Parallel |
1 |
FP32 |
Matmult-BF16 |
|
vgg11 |
224x224 |
PyTorch 2.5 |
Inf2.xlarge |
Image Classification |
629.19 |
1.59 |
1.6 |
$0.145 |
Real Time |
2.21.0 |
Data Parallel |
1 |
FP32 |
Matmult-BF16 |
|
vgg16 |
224x224 |
PyTorch 2.5 |
Inf2.xlarge |
Image Classification |
508.66 |
1.96 |
2.0 |
$0.179 |
Real Time |
2.21.0 |
Data Parallel |
1 |
FP32 |
Matmult-BF16 |
Stable Diffusion Models#
Model |
Image Size |
Scripts |
Framework |
Inst. Type |
Task |
Throughput (inference/sec) |
Latency P50 (ms) |
Latency P99 (ms) |
Cost per 1M images |
Application Type |
Neuron Version |
Run Mode |
Batch Size |
Model Data Type |
Compilation Autocast Data Type |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Stable Diffusion 1.5 |
512x512 |
PyTorch 2.5 |
Inf2.xlarge |
Image Generation |
0.49 |
2023.74 |
2031.7 |
$184.435 |
Real Time |
2.21.0 |
Data Parallel |
1 |
FP32 |
Matmult-BF16 |
|
Stable Diffusion 2 Inpainting |
936x624 |
PyTorch 2.5 |
Inf2.xlarge |
Image Generation |
0.13 |
7546.0 |
7550.98 |
$685.046 |
Real Time |
2.21.0 |
Data Parallel |
1 |
FP32, BF16 |
Matmult-BF16 |
|
Stable Diffusion 2.1 |
512x512 |
PyTorch 2.5 |
Inf2.xlarge |
Image Generation |
0.6 |
1679.8 |
1685.44 |
$152.871 |
Real Time |
2.21.0 |
Data Parallel |
1 |
FP32, BF16 |
Matmult-BF16 |
|
Stable Diffusion 2.1 |
768x768 |
PyTorch 2.5 |
Inf2.xlarge |
Image Generation |
0.19 |
5337.51 |
5357.36 |
$487.225 |
Real Time |
2.21.0 |
Data Parallel |
1 |
FP32 |
Matmult-BF16 |
|
Stable Diffusion XL Base |
1024x1024 |
PyTorch 2.5 |
Inf2.xlarge |
Image Generation |
0.08 |
12048.66 |
12102.43 |
$1,097.724 |
Real Time |
2.21.0 |
Data Parallel |
1 |
FP32 |
Matmult-BF16 |
|
Stable Diffusion XL Base & Refiner |
1024x1024 |
PyTorch 2.5 |
Inf2.8xlarge |
Image Generation |
0.1 |
10546.45 |
10704.57 |
$2,485.380 |
Real Time |
2.21.0 |
Data Parallel |
1 |
FP32 |
Matmult-BF16 |
Note
Cost per 1M images is calculated using RI-Effective hourly rate.
Real Time application refers to batch size 1 inference for minimal latency. Batch application refers to maximum throughput with minimum cost-per-inference.
Model |
Image Size |
Scripts |
Framework |
Inst. Type |
Task |
Throughput (inference/sec) |
Latency P50 (ms) |
Latency P99 (ms) |
Cost per 1M images |
Application Type |
Neuron Version |
Run Mode |
Batch Size |
Model Data Type |
Compilation Autocast Data Type |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Stable Diffusion 1.5 |
512x512 |
PyTorch 2.5 |
Inf2.xlarge |
Image Generation |
0.49 |
2023.74 |
2031.7 |
$184.435 |
Real Time |
2.21.0 |
Data Parallel |
1 |
FP32 |
Matmult-BF16 |
|
Stable Diffusion 2 Inpainting |
936x624 |
PyTorch 2.5 |
Inf2.xlarge |
Image Generation |
0.13 |
7546.0 |
7550.98 |
$685.046 |
Real Time |
2.21.0 |
Data Parallel |
1 |
FP32, BF16 |
Matmult-BF16 |
|
Stable Diffusion 2.1 |
512x512 |
PyTorch 2.5 |
Inf2.xlarge |
Image Generation |
0.6 |
1679.8 |
1685.44 |
$152.871 |
Real Time |
2.21.0 |
Data Parallel |
1 |
FP32, BF16 |
Matmult-BF16 |
|
Stable Diffusion 2.1 |
768x768 |
PyTorch 2.5 |
Inf2.xlarge |
Image Generation |
0.19 |
5337.51 |
5357.36 |
$487.225 |
Real Time |
2.21.0 |
Data Parallel |
1 |
FP32 |
Matmult-BF16 |
|
Stable Diffusion XL Base |
1024x1024 |
PyTorch 2.5 |
Inf2.xlarge |
Image Generation |
0.08 |
12048.66 |
12102.43 |
$1,097.724 |
Real Time |
2.21.0 |
Data Parallel |
1 |
FP32 |
Matmult-BF16 |
|
Stable Diffusion XL Base & Refiner |
1024x1024 |
PyTorch 2.5 |
Inf2.8xlarge |
Image Generation |
0.1 |
10546.45 |
10704.57 |
$2,485.380 |
Real Time |
2.21.0 |
Data Parallel |
1 |
FP32 |
Matmult-BF16 |
Note
Cost per 1M images is calculated using RI-Effective hourly rate.
Real Time application refers to batch size 1 inference for minimal latency. Batch application refers to maximum throughput with minimum cost-per-inference.
Diffusion Transformer Models#
Model |
Image Size |
Scripts |
Framework |
Inst. Type |
Task |
Throughput (inference/sec) |
Latency P50 (ms) |
Latency P99 (ms) |
Cost per 1M images |
Application Type |
Neuron Version |
Run Mode |
Batch Size |
Model Data Type |
Compilation Autocast Data Type |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
PixArt Alpha |
256x256 |
PyTorch 2.1 |
Inf2.xlarge |
Image Generation |
1.98 |
502.59 |
537.26 |
$46.132 |
Real Time |
2.2 |
Data Parallel |
1 |
“FP32, BF16” |
Matmult-BF16 |
|
PixArt Alpha |
512x512 |
PyTorch 2.1 |
Inf2.xlarge |
Image Generation |
0.56 |
1769.76 |
1775.7 |
$161.259 |
Real Time |
2.2 |
Data Parallel |
1 |
“FP32, BF16” |
Matmult-BF16 |
|
PixArt Sigma |
256x256 |
PyTorch 2.1 |
Inf2.xlarge |
Image Generation |
1.86 |
540.83 |
548.41 |
$48.984 |
Real Time |
2.2 |
Data Parallel |
1 |
“FP32, BF16” |
Matmult-BF16 |
|
PixArt Sigma |
512x512 |
PyTorch 2.1 |
Inf2.xlarge |
Image Generation |
0.54 |
1841.88 |
1850.68 |
$167.792 |
Real Time |
2.2 |
Data Parallel |
1 |
“FP32, BF16” |
Matmult-BF16 |
Note
Cost per 1M images is calculated using RI-Effective hourly rate.
Real Time application refers to batch size 1 inference for minimal latency. Batch application refers to maximum throughput with minimum cost-per-inference.
Model |
Image Size |
Scripts |
Framework |
Inst. Type |
Task |
Throughput (inference/sec) |
Latency P50 (ms) |
Latency P99 (ms) |
Cost per 1M images |
Application Type |
Neuron Version |
Run Mode |
Batch Size |
Model Data Type |
Compilation Autocast Data Type |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
PixArt Alpha |
256x256 |
PyTorch 2.1 |
Inf2.xlarge |
Image Generation |
1.98 |
502.59 |
537.26 |
$46.132 |
Real Time |
2.2 |
Data Parallel |
1 |
“FP32, BF16” |
Matmult-BF16 |
|
PixArt Alpha |
512x512 |
PyTorch 2.1 |
Inf2.xlarge |
Image Generation |
0.56 |
1769.76 |
1775.7 |
$161.259 |
Real Time |
2.2 |
Data Parallel |
1 |
“FP32, BF16” |
Matmult-BF16 |
|
PixArt Sigma |
256x256 |
PyTorch 2.1 |
Inf2.xlarge |
Image Generation |
1.86 |
540.83 |
548.41 |
$48.984 |
Real Time |
2.2 |
Data Parallel |
1 |
“FP32, BF16” |
Matmult-BF16 |
|
PixArt Sigma |
512x512 |
PyTorch 2.1 |
Inf2.xlarge |
Image Generation |
0.54 |
1841.88 |
1850.68 |
$167.792 |
Real Time |
2.2 |
Data Parallel |
1 |
“FP32, BF16” |
Matmult-BF16 |
Note
Cost per 1M images is calculated using RI-Effective hourly rate.
Real Time application refers to batch size 1 inference for minimal latency. Batch application refers to maximum throughput with minimum cost-per-inference.
Note
See Neuron Glossary for abbreviations and terms
This document is relevant for: Inf2