Inf2 Inference Performance#

Last update: Sept 18th, 2025

Encoder Models #

Throughput optimized

Model	Scripts	Framework	Inst. Type	Task	Throughput (inference/second)	Latency P50 (ms)	Latency P99 (ms)	Cost per 1M inferences	Application Type	Neuron Version	Run Mode	Batch Size	Sequence Length	Model Data Type	Compilation Autocast Data Type	OS Type
albert-base-v2	Benchmark	PyTorch 2.7	Inf2.xlarge	Raw Output (AutoModel)	3147.1	5.07	5.28	$0.029	Batch	2.25.0	Data Parallel	8	128	FP32	Matmult-BF16	U22
bert-base-uncased	Benchmark	PyTorch 2.8	Inf2.xlarge	Raw Output (AutoModel)	2646.13	6.02	6.28	$0.034	Batch	2.26.0	Data Parallel	8	128	FP32	Matmult-BF16	U22
bert-large-uncased	Benchmark	PyTorch 2.5	Inf2.xlarge	Raw Output (AutoModel)	950.05	8.41	8.85	$0.096	Batch	2.21.0	Data Parallel	4	128	FP32	Matmult-BF16	U22
distilbert-base-uncased	Benchmark	PyTorch 2.7	Inf2.xlarge	Raw Output (AutoModel)	5229.96	5.99	7.08	$0.017	Batch	2.25.0	Data Parallel	16	128	FP32	Matmult-BF16	U22
google/electra-base-discriminator	Benchmark	PyTorch 2.7	Inf2.xlarge	Raw Output (AutoModel)	2889.75	11.02	11.98	$0.032	Batch	2.25.0	Data Parallel	16	128	FP32	Matmult-BF16	U22
roberta-base	Benchmark	PyTorch 2.7	Inf2.xlarge	Raw Output (AutoModel)	2920.38	5.42	5.83	$0.031	Batch	2.25.0	Data Parallel	8	128	FP32	Matmult-BF16	U22
roberta-large	Benchmark	PyTorch 2.7	Inf2.xlarge	Raw Output (AutoModel)	962.7	8.31	8.61	$0.095	Batch	2.25.0	Data Parallel	4	128	FP32	Matmult-BF16	U22
xlm-roberta-base	Benchmark	PyTorch 2.5	Inf2.48xlarge	Raw Output (AutoModelForMaskedLM)	51.14	625.66	694.93	$30.463	Batch	2.22.0	Data Parallel	16	128	FP32	Matmult-BF16	U22

Latency optimized

Model	Scripts	Framework	Inst. Type	Task	Throughput (inference/second)	Latency P50 (ms)	Latency P99 (ms)	Cost per 1M inferences	Application Type	Neuron Version	Run Mode	Batch Size	Sequence Length	Model Data Type	Compilation Autocast Data Type	OS Type
albert-base-v2	Benchmark	PyTorch 2.8	Inf2.xlarge	Raw Output (AutoModel)	2119.78	0.94	1.0	$0.043	Real Time	2.26.0	Data Parallel	1	128	FP32	Matmult-BF16	U22
bert-base-uncased	Benchmark	PyTorch 2.8	Inf2.xlarge	Raw Output (AutoModel)	1998.21	1.0	1.04	$0.046	Real Time	2.26.0	Data Parallel	1	128	FP32	Matmult-BF16	U22
bert-large-uncased	Benchmark	PyTorch 2.7	Inf2.xlarge	Raw Output (AutoModel)	738.65	2.69	2.78	$0.123	Real Time	2.25.0	Data Parallel	1	128	FP32	Matmult-BF16	U22
distilbert-base-uncased	Benchmark	PyTorch 2.8	Inf2.xlarge	Raw Output (AutoModel)	3401.97	0.58	0.68	$0.027	Real Time	2.26.0	Data Parallel	1	128	FP32	Matmult-BF16	U22
google/electra-base-discriminator	Benchmark	PyTorch 2.8	Inf2.xlarge	Raw Output (AutoModel)	2020.46	1.0	1.05	$0.045	Real Time	2.26.0	Data Parallel	1	128	FP32	Matmult-BF16	U22
roberta-base	Benchmark	PyTorch 2.8	Inf2.xlarge	Raw Output (AutoModel)	1989.26	1.0	1.09	$0.046	Real Time	2.26.0	Data Parallel	1	128	FP32	Matmult-BF16	U22
roberta-large	Benchmark	PyTorch 2.8	Inf2.xlarge	Raw Output (AutoModel)	738.88	2.69	2.77	$0.123	Real Time	2.26.0	Data Parallel	1	128	FP32	Matmult-BF16	U22
xlm-roberta-base	Benchmark	PyTorch 2.5	Inf2.48xlarge	Raw Output (AutoModelForMaskedLM)	48.8	40.67	51.06	$31.920	Real Time	2.22.0	Data Parallel	1	128	FP32	Matmult-BF16	U22

Encoder-Decoder Models #

Throughput optimized

Model	Scripts	Framework	Inst. Type	Task	Throughput (tokens/second)	Latency per Token P50 (ms)	Latency per Token P99 (ms)	Cost per 1M inferences	Application Type	Neuron Version	Run Mode	TP Degree	DP Degree	Batch Size	Sequence Length	Input Length	Output Length	Model Data Type	Compilation Autocast Data Type
google/flan-t5-xl	Tutorial	NeuronX Distributed	Inf2.24xlarge	Text Generation	117.61	8.51	8.53	$6.623	Batch	2.17.0	Tensor Parallel	8	1	1	128	128	84	FP32	Matmult-BF16
t5-3b	Tutorial	NeuronX Distributed	Inf2.24xlarge	Text Generation	111.92	8.97	8.98	$6.959	Batch	2.17.0	Tensor Parallel	8	1	1	128	128	84	FP32	Matmult-BF16

Note

Only for Encoder-Decoder

Throughput (tokens/second) counts both input and output tokens

Latency per Token counts both input and output tokens

Latency optimized

Model	Scripts	Framework	Inst. Type	Task	Throughput (tokens/second)	Latency per Token P50 (ms)	Latency per Token P99 (ms)	Cost per 1M inferences	Application Type	Neuron Version	Run Mode	TP Degree	DP Degree	Batch Size	Sequence Length	Input Length	Output Length	Model Data Type	Compilation Autocast Data Type
google/flan-t5-xl	Tutorial	NeuronX Distributed	Inf2.24xlarge	Text Generation	117.6	8.5	8.53	$6.623	Real Time	2.18.0	Tensor Parallel	8	1	1	128	128	84	FP32	Matmult-BF16
t5-3b	Tutorial	NeuronX Distributed	Inf2.24xlarge	Text Generation	108.18	9.25	9.26	$7.200	Real Time	2.18.0	Tensor Parallel	8	1	1	128	128	84	FP32	Matmult-BF16

Note

Throughput (tokens/second) counts both input and output tokens

Latency per Token counts both input and output tokens

Vision Transformers Models #

Throughput optimized

Model	Image Size	Scripts	Framework	Inst. Type	Task	Throughput (inference/sec)	Latency P50 (ms)	Latency P99 (ms)	Cost per 1M images	Application Type	Neuron Version	Run Mode	Batch Size	Model Data Type	Compilation Autocast Data Type
deepmind/multimodal-perceiver	16x224x224	Benchmark	PyTorch 2.5	Inf2.xlarge	Multimodal Autoencoding	0.85	1170.04	1232.06	$106.813	Real Time	2.21.0	Data Parallel	1	FP32
deepmind/vision-perceiver-conv	224x224	Benchmark	PyTorch 1.13.1	Inf2.xlarge	Image Classification	126.5	14.14	14.2	$0.720	Real Time	2.18.0	Data Parallel	1	FP32	Matmult-BF16
deepmind/vision-perceiver-fourier	224x224	Benchmark	PyTorch 1.13.1	Inf2.xlarge	Image Classification	67.9	29.5	29.68	$1.342	Real Time	2.18.0	Data Parallel	1	FP32	Matmult-BF16
deepmind/vision-perceiver-learned	224x224	Benchmark	PyTorch 1.13.1	Inf2.xlarge	Image Classification	99.6	18.6	18.7	$0.915	Real Time	2.18.0	Data Parallel	1	FP32	Matmult-BF16
google/vit-base-patch16-224	224x224	Benchmark	PyTorch 2.5	Inf2.xlarge	Image Classification	1955.41	4.09	4.12	$0.047	Batch	2.21.0	Data Parallel	2	FP32	Matmult-BF16
openai/clip-vit-base-patch32	224x224	Benchmark	PyTorch 2.5	Inf2.xlarge	Image Classification	6509.83	135.81	136.0	$0.014	Batch	2.21.0	Data Parallel	64	FP32	Matmult-BF16
openai/clip-vit-large-patch14	224x224	Benchmark	PyTorch 2.5	Inf2.xlarge	Image Classification	285.94	113.12	115.94	$0.319	Batch	2.21.0	Data Parallel	8	FP32	Matmult-BF16

Latency optimized

Model	Image Size	Scripts	Framework	Inst. Type	Task	Throughput (inference/sec)	Latency P50 (ms)	Latency P99 (ms)	Cost per 1M images	Application Type	Neuron Version	Run Mode	Batch Size	Model Data Type	Compilation Autocast Data Type
deepmind/multimodal-perceiver	16x224x224	Benchmark	PyTorch 2.5	Inf2.xlarge	Multimodal Autoencoding	0.85	1170.04	1232.06	$106.813	Real Time	2.21.0	Data Parallel	1	FP32
deepmind/vision-perceiver-conv	224x224	Benchmark	PyTorch 1.13.1	Inf2.xlarge	Image Classification	126.5	14.14	14.2	$0.720	Real Time	2.18.0	Data Parallel	1	FP32	Matmult-BF16
deepmind/vision-perceiver-fourier	224x224	Benchmark	PyTorch 1.13.1	Inf2.xlarge	Image Classification	67.9	29.5	29.68	$1.342	Real Time	2.18.0	Data Parallel	1	FP32	Matmult-BF16
deepmind/vision-perceiver-learned	224x224	Benchmark	PyTorch 1.13.1	Inf2.xlarge	Image Classification	99.6	18.6	18.7	$0.915	Real Time	2.18.0	Data Parallel	1	FP32	Matmult-BF16
google/vit-base-patch16-224	224x224	Benchmark	PyTorch 2.5	Inf2.xlarge	Image Classification	746.14	1.32	1.38	$0.122	Real Time	2.21.0	Data Parallel	1	FP32	Matmult-BF16
openai/clip-vit-base-patch32	224x224	Benchmark	PyTorch 2.5	Inf2.xlarge	Image Classification	161.05	6.21	6.25	$0.566	Real Time	2.21.0	Data Parallel	1	FP32	Matmult-BF16
openai/clip-vit-large-patch14	224x224	Benchmark	PyTorch 2.5	Inf2.xlarge	Image Classification	73.26	13.64	13.68	$1.244	Real Time	2.21.0	Data Parallel	1	FP32	Matmult-BF16

Convolutional Neural Networks (CNN) Models #

Throughput optimized

Model	Image Size	Scripts	Framework	Inst. Type	Task	Throughput (inference/sec)	Latency P50 (ms)	Latency P99 (ms)	Cost per 1M images	Application Type	Neuron Version	Run Mode	Batch Size	Model Data Type	Compilation Autocast Data Type
UNet	224x224	Benchmark	PyTorch 2.5	Inf2.xlarge	Image Segmentation	1010.8	15.82	15.88	$0.090	Batch	2.21.0	Data Parallel	4	FP32	Matmult-BF16
resnet101	224x224	Benchmark	PyTorch 2.5	Inf2.xlarge	Image Classification	3164.99	80.82	80.94	$0.029	Batch	2.21.0	Data Parallel	64	FP32	Matmult-BF16
resnet152	224x224	Benchmark	PyTorch 2.5	Inf2.xlarge	Image Classification	2449.88	104.41	104.53	$0.037	Batch	2.21.0	Data Parallel	64	FP32	Matmult-BF16
resnet18	224x224	Benchmark	PyTorch 2.5	Inf2.xlarge	Image Classification	6949.17	4.59	4.66	$0.013	Batch	2.21.0	Data Parallel	8	FP32	Matmult-BF16
resnet34	224x224	Benchmark	PyTorch 2.5	Inf2.xlarge	Image Classification	5158.61	6.18	6.25	$0.018	Batch	2.21.0	Data Parallel	8	FP32	Matmult-BF16
resnet50	224x224	Benchmark	PyTorch 2.5	Inf2.xlarge	Image Classification	4393.3	7.28	7.33	$0.021	Batch	2.21.0	Data Parallel	8	FP32	Matmult-BF16
vgg11	224x224	Benchmark	PyTorch 2.5	Inf2.xlarge	Image Classification	4734.4	54.04	54.09	$0.019	Batch	2.21.0	Data Parallel	64	FP32	Matmult-BF16
vgg16	224x224	Benchmark	PyTorch 2.5	Inf2.xlarge	Image Classification	2161.39	14.77	14.83	$0.042	Batch	2.21.0	Data Parallel	8	FP32	Matmult-BF16

Latency optimized

Model	Image Size	Scripts	Framework	Inst. Type	Task	Throughput (inference/sec)	Latency P50 (ms)	Latency P99 (ms)	Cost per 1M images	Application Type	Neuron Version	Run Mode	Batch Size	Model Data Type	Compilation Autocast Data Type
UNet	224x224	Benchmark	PyTorch 2.5	Inf2.xlarge	Image Segmentation	447.09	2.23	2.25	$0.204	Real Time	2.21.0	Data Parallel	1	FP32	Matmult-BF16
resnet101	224x224	Benchmark	PyTorch 2.5	Inf2.xlarge	Image Classification	994.69	1.01	1.02	$0.092	Real Time	2.21.0	Data Parallel	1	FP32	Matmult-BF16
resnet152	224x224	Benchmark	PyTorch 2.5	Inf2.xlarge	Image Classification	837.78	1.18	1.22	$0.109	Real Time	2.21.0	Data Parallel	1	FP32	Matmult-BF16
resnet18	224x224	Benchmark	PyTorch 2.5	Inf2.xlarge	Image Classification	1669.8	0.6	0.61	$0.055	Real Time	2.21.0	Data Parallel	1	FP32	Matmult-BF16
resnet34	224x224	Benchmark	PyTorch 2.5	Inf2.xlarge	Image Classification	1394.21	0.72	0.73	$0.065	Real Time	2.21.0	Data Parallel	1	FP32	Matmult-BF16
resnet50	224x224	Benchmark	PyTorch 2.5	Inf2.xlarge	Image Classification	1218.88	0.83	0.85	$0.075	Real Time	2.21.0	Data Parallel	1	FP32	Matmult-BF16
vgg11	224x224	Benchmark	PyTorch 2.5	Inf2.xlarge	Image Classification	629.19	1.59	1.6	$0.145	Real Time	2.21.0	Data Parallel	1	FP32	Matmult-BF16
vgg16	224x224	Benchmark	PyTorch 2.5	Inf2.xlarge	Image Classification	508.66	1.96	2.0	$0.179	Real Time	2.21.0	Data Parallel	1	FP32	Matmult-BF16

Stable Diffusion Models #

Throughput optimized

Model	Image Size	Scripts	Framework	Inst. Type	Task	Throughput (inference/sec)	Latency P50 (ms)	Latency P99 (ms)	Cost per 1M images	Application Type	Neuron Version	Run Mode	Batch Size	Model Data Type	Compilation Autocast Data Type
Stable Diffusion 1.5	512x512	Benchmark	PyTorch 2.5	Inf2.xlarge	Image Generation	0.49	2023.74	2031.7	$184.435	Real Time	2.21.0	Data Parallel	1	FP32	Matmult-BF16
Stable Diffusion 2 Inpainting	936x624	Benchmark	PyTorch 2.5	Inf2.xlarge	Image Generation	0.13	7546.0	7550.98	$685.046	Real Time	2.21.0	Data Parallel	1	FP32, BF16	Matmult-BF16
Stable Diffusion 2.1	512x512	Benchmark	PyTorch 2.5	Inf2.xlarge	Image Generation	0.6	1679.8	1685.44	$152.871	Real Time	2.21.0	Data Parallel	1	FP32, BF16	Matmult-BF16
Stable Diffusion 2.1	768x768	Benchmark	PyTorch 2.5	Inf2.xlarge	Image Generation	0.19	5337.51	5357.36	$487.225	Real Time	2.21.0	Data Parallel	1	FP32	Matmult-BF16
Stable Diffusion XL Base	1024x1024	Benchmark	PyTorch 2.5	Inf2.xlarge	Image Generation	0.08	12048.66	12102.43	$1,097.724	Real Time	2.21.0	Data Parallel	1	FP32	Matmult-BF16
Stable Diffusion XL Base & Refiner	1024x1024	Benchmark	PyTorch 2.5	Inf2.8xlarge	Image Generation	0.1	10546.45	10704.57	$2,485.380	Real Time	2.21.0	Data Parallel	1	FP32	Matmult-BF16

Note

Cost per 1M images is calculated using RI-Effective hourly rate.

Real Time application refers to batch size 1 inference for minimal latency. Batch application refers to maximum throughput with minimum cost-per-inference.

Latency optimized

Model	Image Size	Scripts	Framework	Inst. Type	Task	Throughput (inference/sec)	Latency P50 (ms)	Latency P99 (ms)	Cost per 1M images	Application Type	Neuron Version	Run Mode	Batch Size	Model Data Type	Compilation Autocast Data Type
Stable Diffusion 1.5	512x512	Benchmark	PyTorch 2.5	Inf2.xlarge	Image Generation	0.49	2023.74	2031.7	$184.435	Real Time	2.21.0	Data Parallel	1	FP32	Matmult-BF16
Stable Diffusion 2 Inpainting	936x624	Benchmark	PyTorch 2.5	Inf2.xlarge	Image Generation	0.13	7546.0	7550.98	$685.046	Real Time	2.21.0	Data Parallel	1	FP32, BF16	Matmult-BF16
Stable Diffusion 2.1	512x512	Benchmark	PyTorch 2.5	Inf2.xlarge	Image Generation	0.6	1679.8	1685.44	$152.871	Real Time	2.21.0	Data Parallel	1	FP32, BF16	Matmult-BF16
Stable Diffusion 2.1	768x768	Benchmark	PyTorch 2.5	Inf2.xlarge	Image Generation	0.19	5337.51	5357.36	$487.225	Real Time	2.21.0	Data Parallel	1	FP32	Matmult-BF16
Stable Diffusion XL Base	1024x1024	Benchmark	PyTorch 2.5	Inf2.xlarge	Image Generation	0.08	12048.66	12102.43	$1,097.724	Real Time	2.21.0	Data Parallel	1	FP32	Matmult-BF16
Stable Diffusion XL Base & Refiner	1024x1024	Benchmark	PyTorch 2.5	Inf2.8xlarge	Image Generation	0.1	10546.45	10704.57	$2,485.380	Real Time	2.21.0	Data Parallel	1	FP32	Matmult-BF16

Note

Cost per 1M images is calculated using RI-Effective hourly rate.

Real Time application refers to batch size 1 inference for minimal latency. Batch application refers to maximum throughput with minimum cost-per-inference.

Diffusion Transformer Models #

Throughput optimized

Model	Image Size	Scripts	Framework	Inst. Type	Task	Throughput (inference/sec)	Latency P50 (ms)	Latency P99 (ms)	Cost per 1M images	Application Type	Neuron Version	Run Mode	Batch Size	Model Data Type	Compilation Autocast Data Type
PixArt Alpha	256x256	Benchmark	PyTorch 2.1	Inf2.xlarge	Image Generation	1.98	502.59	537.26	$46.132	Real Time	2.2	Data Parallel	1	“FP32, BF16”	Matmult-BF16
PixArt Alpha	512x512	Benchmark	PyTorch 2.1	Inf2.xlarge	Image Generation	0.56	1769.76	1775.7	$161.259	Real Time	2.2	Data Parallel	1	“FP32, BF16”	Matmult-BF16
PixArt Sigma	256x256	Benchmark	PyTorch 2.1	Inf2.xlarge	Image Generation	1.86	540.83	548.41	$48.984	Real Time	2.2	Data Parallel	1	“FP32, BF16”	Matmult-BF16
PixArt Sigma	512x512	Benchmark	PyTorch 2.1	Inf2.xlarge	Image Generation	0.54	1841.88	1850.68	$167.792	Real Time	2.2	Data Parallel	1	“FP32, BF16”	Matmult-BF16

Note

Cost per 1M images is calculated using RI-Effective hourly rate.

Real Time application refers to batch size 1 inference for minimal latency. Batch application refers to maximum throughput with minimum cost-per-inference.

Latency optimized

Model	Image Size	Scripts	Framework	Inst. Type	Task	Throughput (inference/sec)	Latency P50 (ms)	Latency P99 (ms)	Cost per 1M images	Application Type	Neuron Version	Run Mode	Batch Size	Model Data Type	Compilation Autocast Data Type
PixArt Alpha	256x256	Benchmark	PyTorch 2.1	Inf2.xlarge	Image Generation	1.98	502.59	537.26	$46.132	Real Time	2.2	Data Parallel	1	“FP32, BF16”	Matmult-BF16
PixArt Alpha	512x512	Benchmark	PyTorch 2.1	Inf2.xlarge	Image Generation	0.56	1769.76	1775.7	$161.259	Real Time	2.2	Data Parallel	1	“FP32, BF16”	Matmult-BF16
PixArt Sigma	256x256	Benchmark	PyTorch 2.1	Inf2.xlarge	Image Generation	1.86	540.83	548.41	$48.984	Real Time	2.2	Data Parallel	1	“FP32, BF16”	Matmult-BF16
PixArt Sigma	512x512	Benchmark	PyTorch 2.1	Inf2.xlarge	Image Generation	0.54	1841.88	1850.68	$167.792	Real Time	2.2	Data Parallel	1	“FP32, BF16”	Matmult-BF16

Note

Cost per 1M images is calculated using RI-Effective hourly rate.

Real Time application refers to batch size 1 inference for minimal latency. Batch application refers to maximum throughput with minimum cost-per-inference.

Note

See Neuron Glossary for abbreviations and terms

Inf2 Inference Performance

Contents

Inf2 Inference Performance#

Encoder Models#

Encoder-Decoder Models#

Vision Transformers Models#

Convolutional Neural Networks (CNN) Models#

Stable Diffusion Models#

Diffusion Transformer Models#

Encoder Models #

Encoder-Decoder Models #

Vision Transformers Models #

Convolutional Neural Networks (CNN) Models #

Stable Diffusion Models #

Diffusion Transformer Models #