This document is relevant for: Trn1
Trn1/Trn1n Inference Performance#
Last update: Sept 18th, 2025
Encoder Models#
Model |
Scripts |
Framework |
Inst. Type |
Task |
Throughput (inference/sec) |
Latency P50 (ms) |
Latency P99 (ms) |
Cost per 1M inferences |
Application Type |
Neuron Version |
Run Mode |
Batch Size |
Sequence Length |
Model Data Type |
Compilation Autocast Data Type |
OS Type |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
albert-base-v2 |
PyTorch 2.8 |
trn1.2xlarge |
Raw Output (AutoModel) |
3356.44 |
9.53 |
9.66 |
$0.042 |
Batch |
2.26.0 |
Data Parallel |
16 |
128 |
FP32 |
Matmult-BF16 |
U22 |
|
bert-base-uncased |
PyTorch 2.8 |
trn1.2xlarge |
Raw Output (AutoModel) |
3329.87 |
9.59 |
9.68 |
$0.043 |
Batch |
2.26.0 |
Data Parallel |
16 |
128 |
FP32 |
Matmult-BF16 |
U22 |
|
bert-large-uncased |
PyTorch 2.8 |
trn1.2xlarge |
Raw Output (AutoModel) |
1101.8 |
7.25 |
7.32 |
$0.129 |
Batch |
2.26.0 |
Data Parallel |
4 |
128 |
FP32 |
Matmult-BF16 |
U22 |
|
distilbert-base-uncased |
PyTorch 2.8 |
trn1.2xlarge |
Raw Output (AutoModel) |
6281.13 |
5.07 |
5.2 |
$0.023 |
Batch |
2.26.0 |
Data Parallel |
16 |
128 |
FP32 |
Matmult-BF16 |
U22 |
|
google/electra-base-discriminator |
PyTorch 2.8 |
trn1.2xlarge |
Raw Output (AutoModel) |
3333.39 |
9.59 |
9.69 |
$0.043 |
Batch |
2.26.0 |
Data Parallel |
16 |
128 |
FP32 |
Matmult-BF16 |
U22 |
|
roberta-base |
PyTorch 2.8 |
trn1.2xlarge |
Raw Output (AutoModel) |
3320.0 |
9.64 |
9.81 |
$0.043 |
Batch |
2.26.0 |
Data Parallel |
16 |
128 |
FP32 |
Matmult-BF16 |
U22 |
|
roberta-large |
PyTorch 2.8 |
trn1.2xlarge |
Raw Output (AutoModel) |
1111.5 |
14.4 |
14.48 |
$0.128 |
Batch |
2.26.0 |
Data Parallel |
8 |
128 |
FP32 |
Matmult-BF16 |
U22 |
|
xlm-roberta-base |
PyTorch 2.5 |
trn1.32xlarge |
Raw Output (AutoModelForMaskedLM) |
44.62 |
358.19 |
366.19 |
$51.033 |
Batch |
2.22.0 |
Data Parallel |
8 |
128 |
FP32 |
Matmult-BF16 |
U22 |
Model |
Scripts |
Framework |
Inst. Type |
Task |
Throughput (inference/sec) |
Latency P50 (ms) |
Latency P99 (ms) |
Cost per 1M inferences |
Application Type |
Neuron Version |
Run Mode |
Batch Size |
Sequence Length |
Model Data Type |
Compilation Autocast Data Type |
OS Type |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
albert-base-v2 |
PyTorch 2.8 |
trn1.2xlarge |
Raw Output (AutoModel) |
2264.09 |
0.88 |
0.91 |
$0.063 |
Real Time |
2.26.0 |
Data Parallel |
1 |
128 |
FP32 |
Matmult-BF16 |
U22 |
|
bert-base-uncased |
PyTorch 2.8 |
trn1.2xlarge |
Raw Output (AutoModel) |
2085.45 |
0.94 |
1.03 |
$0.068 |
Real Time |
2.26.0 |
Data Parallel |
1 |
128 |
FP32 |
Matmult-BF16 |
U22 |
|
bert-large-uncased |
PyTorch 2.8 |
trn1.2xlarge |
Raw Output (AutoModel) |
745.01 |
2.69 |
2.74 |
$0.191 |
Real Time |
2.26.0 |
Data Parallel |
1 |
128 |
FP32 |
Matmult-BF16 |
U22 |
|
distilbert-base-uncased |
PyTorch 2.8 |
trn1.2xlarge |
Raw Output (AutoModel) |
3481.16 |
0.58 |
0.62 |
$0.041 |
Real Time |
2.26.0 |
Data Parallel |
1 |
128 |
FP32 |
Matmult-BF16 |
U22 |
|
google/electra-base-discriminator |
PyTorch 2.8 |
trn1.2xlarge |
Raw Output (AutoModel) |
2122.19 |
0.94 |
0.98 |
$0.067 |
Real Time |
2.26.0 |
Data Parallel |
1 |
128 |
FP32 |
Matmult-BF16 |
U22 |
|
roberta-base |
PyTorch 2.8 |
trn1.2xlarge |
Raw Output (AutoModel) |
2033.7 |
0.99 |
1.04 |
$0.070 |
Real Time |
2.26.0 |
Data Parallel |
1 |
128 |
FP32 |
Matmult-BF16 |
U22 |
|
roberta-large |
PyTorch 2.8 |
trn1.2xlarge |
Raw Output (AutoModel) |
744.3 |
2.69 |
2.74 |
$0.191 |
Real Time |
2.26.0 |
Data Parallel |
1 |
128 |
FP32 |
Matmult-BF16 |
U22 |
|
xlm-roberta-base |
PyTorch 2.5 |
trn1.32xlarge |
Raw Output (AutoModelForMaskedLM) |
40.9 |
49.01 |
50.47 |
$55.665 |
Real Time |
2.22.0 |
Data Parallel |
1 |
128 |
FP32 |
Matmult-BF16 |
U22 |
Encoder-Decoder Models#
Model |
Scripts |
Framework |
Inst. Type |
Task |
Throughput (tokens/second) |
Latency per Token P50 (ms) |
Latency per Token P99 (ms) |
Cost per 1M inferences |
Application Type |
Neuron Version |
Run Mode |
TP Degree |
DP Degree |
Batch Size |
Sequence Length |
Input Length |
Output Length |
Model Data Type |
Compilation Autocast Data Type |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
google/flan-t5-xl |
NeuronX Distributed |
trn1.32xlarge |
Text Generation |
122.52 |
8.16 |
8.19 |
$18.584 |
Batch |
2.17.0 |
Tensor Parallel |
8 |
1 |
1 |
128 |
128 |
84 |
FP32 |
Matmult-BF16 |
|
t5-3b |
NeuronX Distributed |
trn1.32xlarge |
Text Generation |
116.29 |
8.58 |
8.66 |
$19.580 |
Batch |
2.17.0 |
Tensor Parallel |
8 |
1 |
1 |
128 |
128 |
84 |
FP32 |
Matmult-BF16 |
Note
Only for Encoder-Decoder
Throughput (tokens/second) counts both input and output tokens
Latency per Token counts both input and output tokens
Applicable to all models
Cost per 1M inferences is calculated using RI-Effective hourly rate.
Real Time application refers to batch size 1 inference for minimal latency. Batch application refers to maximum throughput with minimum cost-per-inference.
Model |
Scripts |
Framework |
Inst. Type |
Task |
Throughput (tokens/second) |
Latency per Token P50 (ms) |
Latency per Token P99 (ms) |
Cost per 1M inferences |
Application Type |
Neuron Version |
Run Mode |
TP Degree |
DP Degree |
Batch Size |
Sequence Length |
Input Length |
Output Length |
Model Data Type |
Compilation Autocast Data Type |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
google/flan-t5-xl |
NeuronX Distributed |
trn1.32xlarge |
Text Generation |
120.29 |
8.31 |
8.34 |
$18.929 |
Real Time |
2.18.0 |
Tensor Parallel |
8 |
1 |
1 |
128 |
128 |
84 |
FP32 |
Matmult-BF16 |
|
t5-3b |
NeuronX Distributed |
trn1.32xlarge |
Text Generation |
110.23 |
9.07 |
9.12 |
$20.656 |
Real Time |
2.18.0 |
Tensor Parallel |
8 |
1 |
1 |
128 |
128 |
84 |
FP32 |
Matmult-BF16 |
Note
Only for Encoder-Decoder
Throughput (tokens/second) counts both input and output tokens
Latency per Token counts both input and output tokens
Note
Cost per 1M inferences is calculated using RI-Effective hourly rate.
Real Time application refers to batch size 1 inference for minimal latency. Batch application refers to maximum throughput with minimum cost-per-inference.
This document is relevant for: Trn1