Trn1/Trn1n Inference Performance#
Last update: Dec 19th, 2025
Encoder Models#
Model |
Scripts |
Framework |
Inst. Type |
Task |
Throughput (inference/sec) |
Latency P50 (ms) |
Latency P99 (ms) |
Cost per 1M inferences |
Application Type |
Neuron Version |
Run Mode |
Batch Size |
Sequence Length |
Model Data Type |
Compilation Autocast Data Type |
OS Type |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
albert-base-v2 |
PyTorch 2.9 |
trn1.2xlarge |
Raw Output (AutoModel) |
3442.53 |
9.29 |
9.35 |
$0.041 |
Batch |
2.27.0 |
Data Parallel |
16 |
128 |
FP32 |
Matmult-BF16 |
U22 |
|
bert-base-uncased |
PyTorch 2.9 |
trn1.2xlarge |
Raw Output (AutoModel) |
3421.57 |
9.34 |
9.42 |
$0.042 |
Batch |
2.27.0 |
Data Parallel |
16 |
128 |
FP32 |
Matmult-BF16 |
U22 |
|
bert-large-uncased |
PyTorch 2.9 |
trn1.2xlarge |
Raw Output (AutoModel) |
1104.44 |
7.24 |
7.3 |
$0.129 |
Batch |
2.27.0 |
Data Parallel |
4 |
128 |
FP32 |
Matmult-BF16 |
U22 |
|
distilbert-base-uncased |
PyTorch 2.8 |
trn1.2xlarge |
Raw Output (AutoModel) |
6281.13 |
5.07 |
5.2 |
$0.023 |
Batch |
2.26.0 |
Data Parallel |
16 |
128 |
FP32 |
Matmult-BF16 |
U22 |
|
google/electra-base-discriminator |
PyTorch 2.9 |
trn1.2xlarge |
Raw Output (AutoModel) |
3425.28 |
9.33 |
9.47 |
$0.042 |
Batch |
2.27.0 |
Data Parallel |
16 |
128 |
FP32 |
Matmult-BF16 |
U22 |
|
roberta-base |
PyTorch 2.9 |
trn1.2xlarge |
Raw Output (AutoModel) |
3377.82 |
9.47 |
9.56 |
$0.042 |
Batch |
2.27.0 |
Data Parallel |
16 |
128 |
FP32 |
Matmult-BF16 |
U22 |
|
roberta-large |
PyTorch 2.9 |
trn1.2xlarge |
Raw Output (AutoModel) |
1123.9 |
14.23 |
14.3 |
$0.127 |
Batch |
2.27.0 |
Data Parallel |
8 |
128 |
FP32 |
Matmult-BF16 |
U22 |
|
xlm-roberta-base |
PyTorch 2.9 |
trn1.32xlarge |
Raw Output (AutoModelForMaskedLM) |
46.69 |
342.51 |
350.86 |
$48.768 |
Batch |
2.27.0 |
Data Parallel |
8 |
128 |
FP32 |
Matmult-BF16 |
U22 |
Model |
Scripts |
Framework |
Inst. Type |
Task |
Throughput (inference/sec) |
Latency P50 (ms) |
Latency P99 (ms) |
Cost per 1M inferences |
Application Type |
Neuron Version |
Run Mode |
Batch Size |
Sequence Length |
Model Data Type |
Compilation Autocast Data Type |
OS Type |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
albert-base-v2 |
PyTorch 2.9 |
trn1.2xlarge |
Raw Output (AutoModel) |
2321.98 |
0.86 |
0.91 |
$0.061 |
Real Time |
2.27.0 |
Data Parallel |
1 |
128 |
FP32 |
Matmult-BF16 |
U22 |
|
bert-base-uncased |
PyTorch 2.8 |
trn1.2xlarge |
Raw Output (AutoModel) |
2085.45 |
0.94 |
1.03 |
$0.068 |
Real Time |
2.26.0 |
Data Parallel |
1 |
128 |
FP32 |
Matmult-BF16 |
U22 |
|
bert-large-uncased |
PyTorch 2.9 |
trn1.2xlarge |
Raw Output (AutoModel) |
747.48 |
2.67 |
2.73 |
$0.190 |
Real Time |
2.27.0 |
Data Parallel |
1 |
128 |
FP32 |
Matmult-BF16 |
U22 |
|
distilbert-base-uncased |
PyTorch 2.9 |
trn1.2xlarge |
Raw Output (AutoModel) |
3672.38 |
0.54 |
0.59 |
$0.039 |
Real Time |
2.27.0 |
Data Parallel |
1 |
128 |
FP32 |
Matmult-BF16 |
U22 |
|
google/electra-base-discriminator |
PyTorch 2.9 |
trn1.2xlarge |
Raw Output (AutoModel) |
2127.07 |
0.93 |
1.0 |
$0.067 |
Real Time |
2.27.0 |
Data Parallel |
1 |
128 |
FP32 |
Matmult-BF16 |
U22 |
|
roberta-base |
PyTorch 2.9 |
trn1.2xlarge |
Raw Output (AutoModel) |
2094.37 |
0.96 |
1.01 |
$0.068 |
Real Time |
2.27.0 |
Data Parallel |
1 |
128 |
FP32 |
Matmult-BF16 |
U22 |
|
roberta-large |
PyTorch 2.9 |
trn1.2xlarge |
Raw Output (AutoModel) |
747.58 |
2.67 |
2.73 |
$0.190 |
Real Time |
2.27.0 |
Data Parallel |
1 |
128 |
FP32 |
Matmult-BF16 |
U22 |
|
xlm-roberta-base |
PyTorch 2.9 |
trn1.32xlarge |
Raw Output (AutoModelForMaskedLM) |
46.9 |
42.62 |
44.12 |
$48.551 |
Real Time |
2.27.0 |
Data Parallel |
1 |
128 |
FP32 |
Matmult-BF16 |
U22 |
Encoder-Decoder Models#
Model |
Scripts |
Framework |
Inst. Type |
Task |
Throughput (tokens/second) |
Latency per Token P50 (ms) |
Latency per Token P99 (ms) |
Cost per 1M inferences |
Application Type |
Neuron Version |
Run Mode |
TP Degree |
DP Degree |
Batch Size |
Sequence Length |
Input Length |
Output Length |
Model Data Type |
Compilation Autocast Data Type |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
google/flan-t5-xl |
NeuronX Distributed |
trn1.32xlarge |
Text Generation |
122.52 |
8.16 |
8.19 |
$18.584 |
Batch |
2.17.0 |
Tensor Parallel |
8 |
1 |
1 |
128 |
128 |
84 |
FP32 |
Matmult-BF16 |
|
t5-3b |
NeuronX Distributed |
trn1.32xlarge |
Text Generation |
116.29 |
8.58 |
8.66 |
$19.580 |
Batch |
2.17.0 |
Tensor Parallel |
8 |
1 |
1 |
128 |
128 |
84 |
FP32 |
Matmult-BF16 |
Note
Only for Encoder-Decoder
Throughput (tokens/second) counts both input and output tokens
Latency per Token counts both input and output tokens
Applicable to all models
Cost per 1M inferences is calculated using RI-Effective hourly rate.
Real Time application refers to batch size 1 inference for minimal latency. Batch application refers to maximum throughput with minimum cost-per-inference.
Model |
Scripts |
Framework |
Inst. Type |
Task |
Throughput (tokens/second) |
Latency per Token P50 (ms) |
Latency per Token P99 (ms) |
Cost per 1M inferences |
Application Type |
Neuron Version |
Run Mode |
TP Degree |
DP Degree |
Batch Size |
Sequence Length |
Input Length |
Output Length |
Model Data Type |
Compilation Autocast Data Type |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
google/flan-t5-xl |
NeuronX Distributed |
trn1.32xlarge |
Text Generation |
120.29 |
8.31 |
8.34 |
$18.929 |
Real Time |
2.18.0 |
Tensor Parallel |
8 |
1 |
1 |
128 |
128 |
84 |
FP32 |
Matmult-BF16 |
|
t5-3b |
NeuronX Distributed |
trn1.32xlarge |
Text Generation |
110.23 |
9.07 |
9.12 |
$20.656 |
Real Time |
2.18.0 |
Tensor Parallel |
8 |
1 |
1 |
128 |
128 |
84 |
FP32 |
Matmult-BF16 |
Note
Only for Encoder-Decoder
Throughput (tokens/second) counts both input and output tokens
Latency per Token counts both input and output tokens
Note
Cost per 1M inferences is calculated using RI-Effective hourly rate.
Real Time application refers to batch size 1 inference for minimal latency. Batch application refers to maximum throughput with minimum cost-per-inference.