This document is relevant for: Trn1

Trn1/Trn1n Inference Performance#

Last update: July 31st, 2025

Encoder Models #

Throughput optimized

Model	Scripts	Framework	Inst. Type	Task	Throughput (inference/sec)	Latency P50 (ms)	Latency P99 (ms)	Cost per 1M inferences	Application Type	Neuron Version	Run Mode	Batch Size	Sequence Length	Model Data Type	Compilation Autocast Data Type	OS Type
albert-base-v2	Benchmark	PyTorch 2.7	trn1.2xlarge	Raw Output (AutoModel)	3353.79	9.53	9.63	$0.042	Batch	2.25.0	Data Parallel	16	128	FP32	Matmult-BF16	U22
bert-base-uncased	Benchmark	PyTorch 2.7	trn1.2xlarge	Raw Output (AutoModel)	3339.32	9.58	9.65	$0.043	Batch	2.25.0	Data Parallel	16	128	FP32	Matmult-BF16	U22
bert-large-uncased	Benchmark	PyTorch 2.7	trn1.2xlarge	Raw Output (AutoModel)	1104.29	7.24	7.3	$0.129	Batch	2.25.0	Data Parallel	4	128	FP32	Matmult-BF16	U22
distilbert-base-uncased	Benchmark	PyTorch 2.7	trn1.2xlarge	Raw Output (AutoModel)	6320.43	5.06	5.12	$0.023	Batch	2.25.0	Data Parallel	16	128	FP32	Matmult-BF16	U22
google/electra-base-discriminator	Benchmark	PyTorch 2.7	trn1.2xlarge	Raw Output (AutoModel)	3353.92	9.53	9.6	$0.042	Batch	2.25.0	Data Parallel	16	128	FP32	Matmult-BF16	U22
roberta-base	Benchmark	PyTorch 2.7	trn1.2xlarge	Raw Output (AutoModel)	3320.7	9.62	9.81	$0.043	Batch	2.25.0	Data Parallel	16	128	FP32	Matmult-BF16	U22
roberta-large	Benchmark	PyTorch 2.7	trn1.2xlarge	Raw Output (AutoModel)	1112.52	14.37	14.46	$0.128	Batch	2.25.0	Data Parallel	8	128	FP32	Matmult-BF16	U22
xlm-roberta-base	Benchmark	PyTorch 2.5	trn1.32xlarge	Raw Output (AutoModelForMaskedLM)	44.62	358.19	366.19	$51.033	Batch	2.22.0	Data Parallel	8	128	FP32	Matmult-BF16	U22

Latency optimized

Model	Scripts	Framework	Inst. Type	Task	Throughput (inference/sec)	Latency P50 (ms)	Latency P99 (ms)	Cost per 1M inferences	Application Type	Neuron Version	Run Mode	Batch Size	Sequence Length	Model Data Type	Compilation Autocast Data Type	OS Type
albert-base-v2	Benchmark	PyTorch 2.7	trn1.2xlarge	Raw Output (AutoModel)	2252.68	0.88	0.96	$0.063	Real Time	2.25.0	Data Parallel	1	128	FP32	Matmult-BF16	U22
bert-base-uncased	Benchmark	PyTorch 2.7	trn1.2xlarge	Raw Output (AutoModel)	2063.76	0.95	1.03	$0.069	Real Time	2.25.0	Data Parallel	1	128	FP32	Matmult-BF16	U22
bert-large-uncased	Benchmark	PyTorch 2.7	trn1.2xlarge	Raw Output (AutoModel)	747.82	2.65	2.72	$0.190	Real Time	2.25.0	Data Parallel	1	128	FP32	Matmult-BF16	U22
distilbert-base-uncased	Benchmark	PyTorch 2.7	trn1.2xlarge	Raw Output (AutoModel)	3540.19	0.57	0.6	$0.040	Real Time	2.25.0	Data Parallel	1	128	FP32	Matmult-BF16	U22
google/electra-base-discriminator	Benchmark	PyTorch 2.7	trn1.2xlarge	Raw Output (AutoModel)	2121.7	0.94	1.0	$0.067	Real Time	2.25.0	Data Parallel	1	128	FP32	Matmult-BF16	U22
roberta-base	Benchmark	PyTorch 2.5	trn1.2xlarge	Raw Output (AutoModel)	2116.51	0.94	0.97	$0.067	Real Time	2.22.0	Data Parallel	1	128	FP32	Matmult-BF16	U22
roberta-large	Benchmark	PyTorch 2.7	trn1.2xlarge	Raw Output (AutoModel)	745.86	2.69	2.72	$0.191	Real Time	2.25.0	Data Parallel	1	128	FP32	Matmult-BF16	U22
xlm-roberta-base	Benchmark	PyTorch 2.5	trn1.32xlarge	Raw Output (AutoModelForMaskedLM)	40.9	49.01	50.47	$55.665	Real Time	2.22.0	Data Parallel	1	128	FP32	Matmult-BF16	U22

Encoder-Decoder Models #

Throughput optimized

Model	Scripts	Framework	Inst. Type	Task	Throughput (tokens/second)	Latency per Token P50 (ms)	Latency per Token P99 (ms)	Cost per 1M inferences	Application Type	Neuron Version	Run Mode	TP Degree	DP Degree	Batch Size	Sequence Length	Input Length	Output Length	Model Data Type	Compilation Autocast Data Type
google/flan-t5-xl	Tutorial	NeuronX Distributed	trn1.32xlarge	Text Generation	122.52	8.16	8.19	$18.584	Batch	2.17.0	Tensor Parallel	8	1	1	128	128	84	FP32	Matmult-BF16
t5-3b	Tutorial	NeuronX Distributed	trn1.32xlarge	Text Generation	116.29	8.58	8.66	$19.580	Batch	2.17.0	Tensor Parallel	8	1	1	128	128	84	FP32	Matmult-BF16

Note

Only for Encoder-Decoder

Throughput (tokens/second) counts both input and output tokens

Latency per Token counts both input and output tokens

Applicable to all models

Cost per 1M inferences is calculated using RI-Effective hourly rate.

Real Time application refers to batch size 1 inference for minimal latency. Batch application refers to maximum throughput with minimum cost-per-inference.

Latency optimized

Model	Scripts	Framework	Inst. Type	Task	Throughput (tokens/second)	Latency per Token P50 (ms)	Latency per Token P99 (ms)	Cost per 1M inferences	Application Type	Neuron Version	Run Mode	TP Degree	DP Degree	Batch Size	Sequence Length	Input Length	Output Length	Model Data Type	Compilation Autocast Data Type
google/flan-t5-xl	Tutorial	NeuronX Distributed	trn1.32xlarge	Text Generation	120.29	8.31	8.34	$18.929	Real Time	2.18.0	Tensor Parallel	8	1	1	128	128	84	FP32	Matmult-BF16
t5-3b	Tutorial	NeuronX Distributed	trn1.32xlarge	Text Generation	110.23	9.07	9.12	$20.656	Real Time	2.18.0	Tensor Parallel	8	1	1	128	128	84	FP32	Matmult-BF16

Note

Only for Encoder-Decoder

Throughput (tokens/second) counts both input and output tokens

Latency per Token counts both input and output tokens

Note

Cost per 1M inferences is calculated using RI-Effective hourly rate.

Real Time application refers to batch size 1 inference for minimal latency. Batch application refers to maximum throughput with minimum cost-per-inference.

This document is relevant for: Trn1

Trn1/Trn1n Inference Performance

Contents

Trn1/Trn1n Inference Performance#

Encoder Models#

Encoder-Decoder Models#

Encoder Models #

Encoder-Decoder Models #