This document is relevant for: Inf2
Inf2 Inference Performance#
Important
The benchmark scripts linked on this page are provided for historical reference only and are not tested with recent versions of the Neuron SDK. They have been moved to the archive folder.
Last update: Feb 26th, 2026
Encoder Models#
Model |
Scripts |
Framework |
Inst. Type |
Task |
Throughput (inference/second) |
Latency P50 (ms) |
Latency P99 (ms) |
Cost per 1M inferences |
Application Type |
Neuron Version |
Run Mode |
Batch Size |
Sequence Length |
Model Data Type |
Compilation Autocast Data Type |
OS Type |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
albert-base-v2 |
PyTorch 2.7 |
Inf2.xlarge |
Raw Output (AutoModel) |
3147.1 |
5.07 |
5.28 |
$0.029 |
Batch |
2.25.0 |
Data Parallel |
8 |
128 |
FP32 |
Matmult-BF16 |
U22 |
|
bert-base-uncased |
PyTorch 2.9 |
Inf2.xlarge |
Raw Output (AutoModel) |
2674.19 |
5.97 |
6.17 |
$0.034 |
Batch |
2.27.0 |
Data Parallel |
8 |
128 |
FP32 |
Matmult-BF16 |
U22 |
|
bert-large-uncased |
PyTorch 2.5 |
Inf2.xlarge |
Raw Output (AutoModel) |
950.05 |
8.41 |
8.85 |
$0.096 |
Batch |
2.21.0 |
Data Parallel |
4 |
128 |
FP32 |
Matmult-BF16 |
U22 |
|
distilbert-base-uncased |
PyTorch 2.9 |
Inf2.xlarge |
Raw Output (AutoModel) |
5307.88 |
6.01 |
6.23 |
$0.017 |
Batch |
2.27.0 |
Data Parallel |
16 |
128 |
FP32 |
Matmult-BF16 |
U22 |
|
google/electra-base-discriminator |
PyTorch 2.7 |
Inf2.xlarge |
Raw Output (AutoModel) |
2889.75 |
11.02 |
11.98 |
$0.032 |
Batch |
2.25.0 |
Data Parallel |
16 |
128 |
FP32 |
Matmult-BF16 |
U22 |
|
roberta-base |
PyTorch 2.7 |
Inf2.xlarge |
Raw Output (AutoModel) |
2920.38 |
5.42 |
5.83 |
$0.031 |
Batch |
2.25.0 |
Data Parallel |
8 |
128 |
FP32 |
Matmult-BF16 |
U22 |
|
roberta-large |
PyTorch 2.7 |
Inf2.xlarge |
Raw Output (AutoModel) |
962.7 |
8.31 |
8.61 |
$0.095 |
Batch |
2.25.0 |
Data Parallel |
4 |
128 |
FP32 |
Matmult-BF16 |
U22 |
|
xlm-roberta-base |
PyTorch 2.5 |
Inf2.48xlarge |
Raw Output (AutoModelForMaskedLM) |
51.14 |
625.66 |
694.93 |
$30.463 |
Batch |
2.22.0 |
Data Parallel |
16 |
128 |
FP32 |
Matmult-BF16 |
U22 |
Model |
Scripts |
Framework |
Inst. Type |
Task |
Throughput (inference/second) |
Latency P50 (ms) |
Latency P99 (ms) |
Cost per 1M inferences |
Application Type |
Neuron Version |
Run Mode |
Batch Size |
Sequence Length |
Model Data Type |
Compilation Autocast Data Type |
OS Type |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
albert-base-v2 |
PyTorch 2.8 |
Inf2.xlarge |
Raw Output (AutoModel) |
2119.78 |
0.94 |
1.0 |
$0.043 |
Real Time |
2.26.0 |
Data Parallel |
1 |
128 |
FP32 |
Matmult-BF16 |
U22 |
|
bert-base-uncased |
PyTorch 2.8 |
Inf2.xlarge |
Raw Output (AutoModel) |
1998.21 |
1.0 |
1.04 |
$0.046 |
Real Time |
2.26.0 |
Data Parallel |
1 |
128 |
FP32 |
Matmult-BF16 |
U22 |
|
bert-large-uncased |
PyTorch 2.7 |
Inf2.xlarge |
Raw Output (AutoModel) |
738.65 |
2.69 |
2.78 |
$0.123 |
Real Time |
2.25.0 |
Data Parallel |
1 |
128 |
FP32 |
Matmult-BF16 |
U22 |
|
distilbert-base-uncased |
PyTorch 2.8 |
Inf2.xlarge |
Raw Output (AutoModel) |
3401.97 |
0.58 |
0.68 |
$0.027 |
Real Time |
2.26.0 |
Data Parallel |
1 |
128 |
FP32 |
Matmult-BF16 |
U22 |
|
google/electra-base-discriminator |
PyTorch 2.8 |
Inf2.xlarge |
Raw Output (AutoModel) |
2020.46 |
1.0 |
1.05 |
$0.045 |
Real Time |
2.26.0 |
Data Parallel |
1 |
128 |
FP32 |
Matmult-BF16 |
U22 |
|
roberta-base |
PyTorch 2.8 |
Inf2.xlarge |
Raw Output (AutoModel) |
1989.26 |
1.0 |
1.09 |
$0.046 |
Real Time |
2.26.0 |
Data Parallel |
1 |
128 |
FP32 |
Matmult-BF16 |
U22 |
|
roberta-large |
PyTorch 2.8 |
Inf2.xlarge |
Raw Output (AutoModel) |
738.88 |
2.69 |
2.77 |
$0.123 |
Real Time |
2.26.0 |
Data Parallel |
1 |
128 |
FP32 |
Matmult-BF16 |
U22 |
|
xlm-roberta-base |
PyTorch 2.5 |
Inf2.48xlarge |
Raw Output (AutoModelForMaskedLM) |
48.8 |
40.67 |
51.06 |
$31.920 |
Real Time |
2.22.0 |
Data Parallel |
1 |
128 |
FP32 |
Matmult-BF16 |
U22 |
Encoder-Decoder Models#
Model |
Scripts |
Framework |
Inst. Type |
Task |
Throughput (tokens/second) |
Latency per Token P50 (ms) |
Latency per Token P99 (ms) |
Cost per 1M inferences |
Application Type |
Neuron Version |
Run Mode |
TP Degree |
DP Degree |
Batch Size |
Sequence Length |
Input Length |
Output Length |
Model Data Type |
Compilation Autocast Data Type |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
google/flan-t5-xl |
NeuronX Distributed |
Inf2.24xlarge |
Text Generation |
117.61 |
8.51 |
8.53 |
$6.623 |
Batch |
2.17.0 |
Tensor Parallel |
8 |
1 |
1 |
128 |
128 |
84 |
FP32 |
Matmult-BF16 |
|
t5-3b |
NeuronX Distributed |
Inf2.24xlarge |
Text Generation |
111.92 |
8.97 |
8.98 |
$6.959 |
Batch |
2.17.0 |
Tensor Parallel |
8 |
1 |
1 |
128 |
128 |
84 |
FP32 |
Matmult-BF16 |
Note
Only for Encoder-Decoder
Throughput (tokens/second) counts both input and output tokens
Latency per Token counts both input and output tokens
Model |
Scripts |
Framework |
Inst. Type |
Task |
Throughput (tokens/second) |
Latency per Token P50 (ms) |
Latency per Token P99 (ms) |
Cost per 1M inferences |
Application Type |
Neuron Version |
Run Mode |
TP Degree |
DP Degree |
Batch Size |
Sequence Length |
Input Length |
Output Length |
Model Data Type |
Compilation Autocast Data Type |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
google/flan-t5-xl |
NeuronX Distributed |
Inf2.24xlarge |
Text Generation |
117.6 |
8.5 |
8.53 |
$6.623 |
Real Time |
2.18.0 |
Tensor Parallel |
8 |
1 |
1 |
128 |
128 |
84 |
FP32 |
Matmult-BF16 |
|
t5-3b |
NeuronX Distributed |
Inf2.24xlarge |
Text Generation |
108.18 |
9.25 |
9.26 |
$7.200 |
Real Time |
2.18.0 |
Tensor Parallel |
8 |
1 |
1 |
128 |
128 |
84 |
FP32 |
Matmult-BF16 |
Note
Throughput (tokens/second) counts both input and output tokens
Latency per Token counts both input and output tokens
Vision Transformers Models#
Model |
Image Size |
Scripts |
Framework |
Inst. Type |
Task |
Throughput (inference/sec) |
Latency P50 (ms) |
Latency P99 (ms) |
Cost per 1M images |
Application Type |
Neuron Version |
Run Mode |
Batch Size |
Model Data Type |
Compilation Autocast Data Type |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
deepmind/multimodal-perceiver |
16x224x224 |
PyTorch 2.5 |
Inf2.xlarge |
Multimodal Autoencoding |
0.85 |
1170.04 |
1232.06 |
$106.813 |
Real Time |
2.21.0 |
Data Parallel |
1 |
FP32 |
||
deepmind/vision-perceiver-conv |
224x224 |
PyTorch 1.13.1 |
Inf2.xlarge |
Image Classification |
126.5 |
14.14 |
14.2 |
$0.720 |
Real Time |
2.18.0 |
Data Parallel |
1 |
FP32 |
Matmult-BF16 |
|
deepmind/vision-perceiver-fourier |
224x224 |
PyTorch 1.13.1 |
Inf2.xlarge |
Image Classification |
67.9 |
29.5 |
29.68 |
$1.342 |
Real Time |
2.18.0 |
Data Parallel |
1 |
FP32 |
Matmult-BF16 |
|
deepmind/vision-perceiver-learned |
224x224 |
PyTorch 1.13.1 |
Inf2.xlarge |
Image Classification |
99.6 |
18.6 |
18.7 |
$0.915 |
Real Time |
2.18.0 |
Data Parallel |
1 |
FP32 |
Matmult-BF16 |
|
google/vit-base-patch16-224 |
224x224 |
PyTorch 2.5 |
Inf2.xlarge |
Image Classification |
1955.41 |
4.09 |
4.12 |
$0.047 |
Batch |
2.21.0 |
Data Parallel |
2 |
FP32 |
Matmult-BF16 |
|
openai/clip-vit-base-patch32 |
224x224 |
PyTorch 2.5 |
Inf2.xlarge |
Image Classification |
6509.83 |
135.81 |
136.0 |
$0.014 |
Batch |
2.21.0 |
Data Parallel |
64 |
FP32 |
Matmult-BF16 |
|
openai/clip-vit-large-patch14 |
224x224 |
PyTorch 2.5 |
Inf2.xlarge |
Image Classification |
285.94 |
113.12 |
115.94 |
$0.319 |
Batch |
2.21.0 |
Data Parallel |
8 |
FP32 |
Matmult-BF16 |
Model |
Image Size |
Scripts |
Framework |
Inst. Type |
Task |
Throughput (inference/sec) |
Latency P50 (ms) |
Latency P99 (ms) |
Cost per 1M images |
Application Type |
Neuron Version |
Run Mode |
Batch Size |
Model Data Type |
Compilation Autocast Data Type |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
deepmind/multimodal-perceiver |
16x224x224 |
PyTorch 2.5 |
Inf2.xlarge |
Multimodal Autoencoding |
0.85 |
1170.04 |
1232.06 |
$106.813 |
Real Time |
2.21.0 |
Data Parallel |
1 |
FP32 |
||
deepmind/vision-perceiver-conv |
224x224 |
PyTorch 1.13.1 |
Inf2.xlarge |
Image Classification |
126.5 |
14.14 |
14.2 |
$0.720 |
Real Time |
2.18.0 |
Data Parallel |
1 |
FP32 |
Matmult-BF16 |
|
deepmind/vision-perceiver-fourier |
224x224 |
PyTorch 1.13.1 |
Inf2.xlarge |
Image Classification |
67.9 |
29.5 |
29.68 |
$1.342 |
Real Time |
2.18.0 |
Data Parallel |
1 |
FP32 |
Matmult-BF16 |
|
deepmind/vision-perceiver-learned |
224x224 |
PyTorch 1.13.1 |
Inf2.xlarge |
Image Classification |
99.6 |
18.6 |
18.7 |
$0.915 |
Real Time |
2.18.0 |
Data Parallel |
1 |
FP32 |
Matmult-BF16 |
|
google/vit-base-patch16-224 |
224x224 |
PyTorch 2.5 |
Inf2.xlarge |
Image Classification |
746.14 |
1.32 |
1.38 |
$0.122 |
Real Time |
2.21.0 |
Data Parallel |
1 |
FP32 |
Matmult-BF16 |
|
openai/clip-vit-base-patch32 |
224x224 |
PyTorch 2.5 |
Inf2.xlarge |
Image Classification |
161.05 |
6.21 |
6.25 |
$0.566 |
Real Time |
2.21.0 |
Data Parallel |
1 |
FP32 |
Matmult-BF16 |
|
openai/clip-vit-large-patch14 |
224x224 |
PyTorch 2.5 |
Inf2.xlarge |
Image Classification |
73.26 |
13.64 |
13.68 |
$1.244 |
Real Time |
2.21.0 |
Data Parallel |
1 |
FP32 |
Matmult-BF16 |
Convolutional Neural Networks (CNN) Models#
Model |
Image Size |
Scripts |
Framework |
Inst. Type |
Task |
Throughput (inference/sec) |
Latency P50 (ms) |
Latency P99 (ms) |
Cost per 1M images |
Application Type |
Neuron Version |
Run Mode |
Batch Size |
Model Data Type |
Compilation Autocast Data Type |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
UNet |
224x224 |
PyTorch 2.5 |
Inf2.xlarge |
Image Segmentation |
1010.8 |
15.82 |
15.88 |
$0.090 |
Batch |
2.21.0 |
Data Parallel |
4 |
FP32 |
Matmult-BF16 |
|
resnet101 |
224x224 |
PyTorch 2.5 |
Inf2.xlarge |
Image Classification |
3164.99 |
80.82 |
80.94 |
$0.029 |
Batch |
2.21.0 |
Data Parallel |
64 |
FP32 |
Matmult-BF16 |
|
resnet152 |
224x224 |
PyTorch 2.5 |
Inf2.xlarge |
Image Classification |
2449.88 |
104.41 |
104.53 |
$0.037 |
Batch |
2.21.0 |
Data Parallel |
64 |
FP32 |
Matmult-BF16 |
|
resnet18 |
224x224 |
PyTorch 2.5 |
Inf2.xlarge |
Image Classification |
6949.17 |
4.59 |
4.66 |
$0.013 |
Batch |
2.21.0 |
Data Parallel |
8 |
FP32 |
Matmult-BF16 |
|
resnet34 |
224x224 |
PyTorch 2.5 |
Inf2.xlarge |
Image Classification |
5158.61 |
6.18 |
6.25 |
$0.018 |
Batch |
2.21.0 |
Data Parallel |
8 |
FP32 |
Matmult-BF16 |
|
resnet50 |
224x224 |
PyTorch 2.5 |
Inf2.xlarge |
Image Classification |
4393.3 |
7.28 |
7.33 |
$0.021 |
Batch |
2.21.0 |
Data Parallel |
8 |
FP32 |
Matmult-BF16 |
|
vgg11 |
224x224 |
PyTorch 2.5 |
Inf2.xlarge |
Image Classification |
4734.4 |
54.04 |
54.09 |
$0.019 |
Batch |
2.21.0 |
Data Parallel |
64 |
FP32 |
Matmult-BF16 |
|
vgg16 |
224x224 |
PyTorch 2.5 |
Inf2.xlarge |
Image Classification |
2161.39 |
14.77 |
14.83 |
$0.042 |
Batch |
2.21.0 |
Data Parallel |
8 |
FP32 |
Matmult-BF16 |
Model |
Image Size |
Scripts |
Framework |
Inst. Type |
Task |
Throughput (inference/sec) |
Latency P50 (ms) |
Latency P99 (ms) |
Cost per 1M images |
Application Type |
Neuron Version |
Run Mode |
Batch Size |
Model Data Type |
Compilation Autocast Data Type |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
UNet |
224x224 |
PyTorch 2.5 |
Inf2.xlarge |
Image Segmentation |
447.09 |
2.23 |
2.25 |
$0.204 |
Real Time |
2.21.0 |
Data Parallel |
1 |
FP32 |
Matmult-BF16 |
|
resnet101 |
224x224 |
PyTorch 2.5 |
Inf2.xlarge |
Image Classification |
994.69 |
1.01 |
1.02 |
$0.092 |
Real Time |
2.21.0 |
Data Parallel |
1 |
FP32 |
Matmult-BF16 |
|
resnet152 |
224x224 |
PyTorch 2.5 |
Inf2.xlarge |
Image Classification |
837.78 |
1.18 |
1.22 |
$0.109 |
Real Time |
2.21.0 |
Data Parallel |
1 |
FP32 |
Matmult-BF16 |
|
resnet18 |
224x224 |
PyTorch 2.5 |
Inf2.xlarge |
Image Classification |
1669.8 |
0.6 |
0.61 |
$0.055 |
Real Time |
2.21.0 |
Data Parallel |
1 |
FP32 |
Matmult-BF16 |
|
resnet34 |
224x224 |
PyTorch 2.5 |
Inf2.xlarge |
Image Classification |
1394.21 |
0.72 |
0.73 |
$0.065 |
Real Time |
2.21.0 |
Data Parallel |
1 |
FP32 |
Matmult-BF16 |
|
resnet50 |
224x224 |
PyTorch 2.5 |
Inf2.xlarge |
Image Classification |
1218.88 |
0.83 |
0.85 |
$0.075 |
Real Time |
2.21.0 |
Data Parallel |
1 |
FP32 |
Matmult-BF16 |
|
vgg11 |
224x224 |
PyTorch 2.5 |
Inf2.xlarge |
Image Classification |
629.19 |
1.59 |
1.6 |
$0.145 |
Real Time |
2.21.0 |
Data Parallel |
1 |
FP32 |
Matmult-BF16 |
|
vgg16 |
224x224 |
PyTorch 2.5 |
Inf2.xlarge |
Image Classification |
508.66 |
1.96 |
2.0 |
$0.179 |
Real Time |
2.21.0 |
Data Parallel |
1 |
FP32 |
Matmult-BF16 |
Stable Diffusion Models#
Model |
Image Size |
Scripts |
Framework |
Inst. Type |
Task |
Throughput (inference/sec) |
Latency P50 (ms) |
Latency P99 (ms) |
Cost per 1M images |
Application Type |
Neuron Version |
Run Mode |
Batch Size |
Model Data Type |
Compilation Autocast Data Type |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Stable Diffusion 1.5 |
512x512 |
PyTorch 2.5 |
Inf2.xlarge |
Image Generation |
0.49 |
2023.74 |
2031.7 |
$184.435 |
Real Time |
2.21.0 |
Data Parallel |
1 |
FP32 |
Matmult-BF16 |
|
Stable Diffusion 2 Inpainting |
936x624 |
PyTorch 2.5 |
Inf2.xlarge |
Image Generation |
0.13 |
7546.0 |
7550.98 |
$685.046 |
Real Time |
2.21.0 |
Data Parallel |
1 |
FP32, BF16 |
Matmult-BF16 |
|
Stable Diffusion 2.1 |
512x512 |
PyTorch 2.5 |
Inf2.xlarge |
Image Generation |
0.6 |
1679.8 |
1685.44 |
$152.871 |
Real Time |
2.21.0 |
Data Parallel |
1 |
FP32, BF16 |
Matmult-BF16 |
|
Stable Diffusion 2.1 |
768x768 |
PyTorch 2.5 |
Inf2.xlarge |
Image Generation |
0.19 |
5337.51 |
5357.36 |
$487.225 |
Real Time |
2.21.0 |
Data Parallel |
1 |
FP32 |
Matmult-BF16 |
|
Stable Diffusion XL Base |
1024x1024 |
PyTorch 2.5 |
Inf2.xlarge |
Image Generation |
0.08 |
12048.66 |
12102.43 |
$1,097.724 |
Real Time |
2.21.0 |
Data Parallel |
1 |
FP32 |
Matmult-BF16 |
|
Stable Diffusion XL Base & Refiner |
1024x1024 |
PyTorch 2.5 |
Inf2.8xlarge |
Image Generation |
0.1 |
10546.45 |
10704.57 |
$2,485.380 |
Real Time |
2.21.0 |
Data Parallel |
1 |
FP32 |
Matmult-BF16 |
Note
Cost per 1M images is calculated using RI-Effective hourly rate.
Real Time application refers to batch size 1 inference for minimal latency. Batch application refers to maximum throughput with minimum cost-per-inference.
Model |
Image Size |
Scripts |
Framework |
Inst. Type |
Task |
Throughput (inference/sec) |
Latency P50 (ms) |
Latency P99 (ms) |
Cost per 1M images |
Application Type |
Neuron Version |
Run Mode |
Batch Size |
Model Data Type |
Compilation Autocast Data Type |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Stable Diffusion 1.5 |
512x512 |
PyTorch 2.5 |
Inf2.xlarge |
Image Generation |
0.49 |
2023.74 |
2031.7 |
$184.435 |
Real Time |
2.21.0 |
Data Parallel |
1 |
FP32 |
Matmult-BF16 |
|
Stable Diffusion 2 Inpainting |
936x624 |
PyTorch 2.5 |
Inf2.xlarge |
Image Generation |
0.13 |
7546.0 |
7550.98 |
$685.046 |
Real Time |
2.21.0 |
Data Parallel |
1 |
FP32, BF16 |
Matmult-BF16 |
|
Stable Diffusion 2.1 |
512x512 |
PyTorch 2.5 |
Inf2.xlarge |
Image Generation |
0.6 |
1679.8 |
1685.44 |
$152.871 |
Real Time |
2.21.0 |
Data Parallel |
1 |
FP32, BF16 |
Matmult-BF16 |
|
Stable Diffusion 2.1 |
768x768 |
PyTorch 2.5 |
Inf2.xlarge |
Image Generation |
0.19 |
5337.51 |
5357.36 |
$487.225 |
Real Time |
2.21.0 |
Data Parallel |
1 |
FP32 |
Matmult-BF16 |
|
Stable Diffusion XL Base |
1024x1024 |
PyTorch 2.5 |
Inf2.xlarge |
Image Generation |
0.08 |
12048.66 |
12102.43 |
$1,097.724 |
Real Time |
2.21.0 |
Data Parallel |
1 |
FP32 |
Matmult-BF16 |
|
Stable Diffusion XL Base & Refiner |
1024x1024 |
PyTorch 2.5 |
Inf2.8xlarge |
Image Generation |
0.1 |
10546.45 |
10704.57 |
$2,485.380 |
Real Time |
2.21.0 |
Data Parallel |
1 |
FP32 |
Matmult-BF16 |
Note
Cost per 1M images is calculated using RI-Effective hourly rate.
Real Time application refers to batch size 1 inference for minimal latency. Batch application refers to maximum throughput with minimum cost-per-inference.
Diffusion Transformer Models#
Model |
Image Size |
Scripts |
Framework |
Inst. Type |
Task |
Throughput (inference/sec) |
Latency P50 (ms) |
Latency P99 (ms) |
Cost per 1M images |
Application Type |
Neuron Version |
Run Mode |
Batch Size |
Model Data Type |
Compilation Autocast Data Type |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
PixArt Alpha |
256x256 |
PyTorch 2.1 |
Inf2.xlarge |
Image Generation |
1.98 |
502.59 |
537.26 |
$46.132 |
Real Time |
2.2 |
Data Parallel |
1 |
“FP32, BF16” |
Matmult-BF16 |
|
PixArt Alpha |
512x512 |
PyTorch 2.1 |
Inf2.xlarge |
Image Generation |
0.56 |
1769.76 |
1775.7 |
$161.259 |
Real Time |
2.2 |
Data Parallel |
1 |
“FP32, BF16” |
Matmult-BF16 |
|
PixArt Sigma |
256x256 |
PyTorch 2.1 |
Inf2.xlarge |
Image Generation |
1.86 |
540.83 |
548.41 |
$48.984 |
Real Time |
2.2 |
Data Parallel |
1 |
“FP32, BF16” |
Matmult-BF16 |
|
PixArt Sigma |
512x512 |
PyTorch 2.1 |
Inf2.xlarge |
Image Generation |
0.54 |
1841.88 |
1850.68 |
$167.792 |
Real Time |
2.2 |
Data Parallel |
1 |
“FP32, BF16” |
Matmult-BF16 |
Note
Cost per 1M images is calculated using RI-Effective hourly rate.
Real Time application refers to batch size 1 inference for minimal latency. Batch application refers to maximum throughput with minimum cost-per-inference.
Model |
Image Size |
Scripts |
Framework |
Inst. Type |
Task |
Throughput (inference/sec) |
Latency P50 (ms) |
Latency P99 (ms) |
Cost per 1M images |
Application Type |
Neuron Version |
Run Mode |
Batch Size |
Model Data Type |
Compilation Autocast Data Type |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
PixArt Alpha |
256x256 |
PyTorch 2.1 |
Inf2.xlarge |
Image Generation |
1.98 |
502.59 |
537.26 |
$46.132 |
Real Time |
2.2 |
Data Parallel |
1 |
“FP32, BF16” |
Matmult-BF16 |
|
PixArt Alpha |
512x512 |
PyTorch 2.1 |
Inf2.xlarge |
Image Generation |
0.56 |
1769.76 |
1775.7 |
$161.259 |
Real Time |
2.2 |
Data Parallel |
1 |
“FP32, BF16” |
Matmult-BF16 |
|
PixArt Sigma |
256x256 |
PyTorch 2.1 |
Inf2.xlarge |
Image Generation |
1.86 |
540.83 |
548.41 |
$48.984 |
Real Time |
2.2 |
Data Parallel |
1 |
“FP32, BF16” |
Matmult-BF16 |
|
PixArt Sigma |
512x512 |
PyTorch 2.1 |
Inf2.xlarge |
Image Generation |
0.54 |
1841.88 |
1850.68 |
$167.792 |
Real Time |
2.2 |
Data Parallel |
1 |
“FP32, BF16” |
Matmult-BF16 |
Note
Cost per 1M images is calculated using RI-Effective hourly rate.
Real Time application refers to batch size 1 inference for minimal latency. Batch application refers to maximum throughput with minimum cost-per-inference.
Note
See Neuron Glossary for abbreviations and terms
This document is relevant for: Inf2