This document is relevant for: Trn1
Trn1/Trn1n Inference Performance#
Last update: April 3rd, 2025
Encoder Models#
Model |
Scripts |
Framework |
Inst. Type |
Task |
Throughput (inference/sec) |
Latency P50 (ms) |
Latency P99 (ms) |
Cost per 1M inferences |
Application Type |
Neuron Version |
Run Mode |
Batch Size |
Sequence Length |
Model Data Type |
Compilation Autocast Data Type |
OS Type |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
albert-base-v2 |
PyTorch 2.5 |
trn1.2xlarge |
Raw Output (AutoModel) |
3349.27 |
9.54 |
9.63 |
$0.042 |
Batch |
2.22.0 |
Data Parallel |
16 |
128 |
FP32 |
Matmult-BF16 |
U22 |
|
bert-base-cased |
PyTorch 2.5 |
trn1.2xlarge |
Raw Output (AutoModel) |
3352.19 |
9.54 |
9.6 |
$0.042 |
Batch |
2.22.0 |
Data Parallel |
16 |
128 |
FP32 |
Matmult-BF16 |
U22 |
|
bert-base-cased-finetuned-mrpc |
PyTorch 2.5 |
trn1.2xlarge |
Sequence Classification |
3809.3 |
8.4 |
8.42 |
$0.037 |
Batch |
2.22.0 |
Data Parallel |
16 |
128 |
FP32 |
Matmult-BF16 |
U22 |
|
bert-base-uncased |
PyTorch 2.5 |
trn1.2xlarge |
Raw Output (AutoModel) |
3352.33 |
9.54 |
9.59 |
$0.042 |
Batch |
2.22.0 |
Data Parallel |
16 |
128 |
FP32 |
Matmult-BF16 |
U22 |
|
bert-large-cased |
PyTorch 2.5 |
trn1.2xlarge |
Raw Output (AutoModel) |
1145.87 |
13.95 |
14.02 |
$0.124 |
Batch |
2.22.0 |
Data Parallel |
8 |
128 |
FP32 |
Matmult-BF16 |
U22 |
|
bert-large-uncased |
PyTorch 2.5 |
trn1.2xlarge |
Raw Output (AutoModel) |
1126.56 |
7.09 |
7.15 |
$0.126 |
Batch |
2.22.0 |
Data Parallel |
4 |
128 |
FP32 |
Matmult-BF16 |
U22 |
|
camembert-base |
PyTorch 2.5 |
trn1.2xlarge |
Raw Output (AutoModel) |
3349.58 |
9.55 |
9.61 |
$0.042 |
Batch |
2.22.0 |
Data Parallel |
16 |
128 |
FP32 |
Matmult-BF16 |
U22 |
|
distilbert-base-cased |
PyTorch 2.5 |
trn1.2xlarge |
Raw Output (AutoModel) |
6128.56 |
5.21 |
5.32 |
$0.023 |
Batch |
2.22.0 |
Data Parallel |
16 |
128 |
FP32 |
Matmult-BF16 |
U22 |
|
distilbert-base-cased-distilled-squad |
PyTorch 2.5 |
trn1.2xlarge |
Raw Output (AutoModel) |
6119.31 |
5.21 |
5.33 |
$0.023 |
Batch |
2.22.0 |
Data Parallel |
16 |
128 |
FP32 |
Matmult-BF16 |
U22 |
|
distilbert-base-uncased |
PyTorch 2.5 |
trn1.2xlarge |
Raw Output (AutoModel) |
6119.44 |
5.21 |
5.32 |
$0.023 |
Batch |
2.22.0 |
Data Parallel |
16 |
128 |
FP32 |
Matmult-BF16 |
U22 |
|
google/electra-base-discriminator |
PyTorch 2.5 |
trn1.2xlarge |
Raw Output (AutoModel) |
3350.69 |
9.54 |
9.63 |
$0.042 |
Batch |
2.22.0 |
Data Parallel |
16 |
128 |
FP32 |
Matmult-BF16 |
U22 |
|
roberta-base |
PyTorch 2.5 |
trn1.2xlarge |
Raw Output (AutoModel) |
3346.77 |
9.55 |
9.63 |
$0.042 |
Batch |
2.22.0 |
Data Parallel |
16 |
128 |
FP32 |
Matmult-BF16 |
U22 |
|
roberta-large |
PyTorch 2.5 |
trn1.2xlarge |
Raw Output (AutoModel) |
1142.22 |
14.0 |
14.07 |
$0.125 |
Batch |
2.22.0 |
Data Parallel |
8 |
128 |
FP32 |
Matmult-BF16 |
U22 |
|
xlm-roberta-base |
PyTorch 2.5 |
trn1.32xlarge |
Raw Output (AutoModel) |
44.62 |
358.19 |
366.19 |
$51.033 |
Batch |
2.22.0 |
Data Parallel |
8 |
128 |
FP32 |
Matmult-BF16 |
U22 |
Model |
Scripts |
Framework |
Inst. Type |
Task |
Throughput (inference/sec) |
Latency P50 (ms) |
Latency P99 (ms) |
Cost per 1M inferences |
Application Type |
Neuron Version |
Run Mode |
Batch Size |
Sequence Length |
Model Data Type |
Compilation Autocast Data Type |
OS Type |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
albert-base-v2 |
PyTorch 2.5 |
trn1.2xlarge |
Raw Output (AutoModel) |
2139.63 |
0.94 |
0.96 |
$0.066 |
Real Time |
2.22.0 |
Data Parallel |
1 |
128 |
FP32 |
Matmult-BF16 |
U22 |
|
bert-base-cased |
PyTorch 2.5 |
trn1.2xlarge |
Raw Output (AutoModel) |
2120.34 |
0.94 |
0.97 |
$0.067 |
Real Time |
2.22.0 |
Data Parallel |
1 |
128 |
FP32 |
Matmult-BF16 |
U22 |
|
bert-base-cased-finetuned-mrpc |
PyTorch 2.5 |
trn1.2xlarge |
Sequence Classification |
2280.63 |
0.88 |
0.89 |
$0.062 |
Real Time |
2.22.0 |
Data Parallel |
1 |
128 |
FP32 |
Matmult-BF16 |
U22 |
|
bert-base-uncased |
PyTorch 2.5 |
trn1.2xlarge |
Raw Output (AutoModel) |
2121.55 |
0.94 |
0.97 |
$0.067 |
Real Time |
2.22.0 |
Data Parallel |
1 |
128 |
FP32 |
Matmult-BF16 |
U22 |
|
bert-large-cased |
PyTorch 2.5 |
trn1.2xlarge |
Raw Output (AutoModel) |
744.28 |
2.68 |
2.72 |
$0.191 |
Real Time |
2.22.0 |
Data Parallel |
1 |
128 |
FP32 |
Matmult-BF16 |
U22 |
|
bert-large-uncased |
PyTorch 2.5 |
trn1.2xlarge |
Raw Output (AutoModel) |
744.33 |
2.68 |
2.72 |
$0.191 |
Real Time |
2.22.0 |
Data Parallel |
1 |
128 |
FP32 |
Matmult-BF16 |
U22 |
|
camembert-base |
PyTorch 2.5 |
trn1.2xlarge |
Raw Output (AutoModel) |
2120.39 |
0.94 |
0.97 |
$0.067 |
Real Time |
2.22.0 |
Data Parallel |
1 |
128 |
FP32 |
Matmult-BF16 |
U22 |
|
distilbert-base-cased |
PyTorch 2.5 |
trn1.2xlarge |
Raw Output (AutoModel) |
3427.86 |
0.58 |
0.6 |
$0.041 |
Real Time |
2.22.0 |
Data Parallel |
1 |
128 |
FP32 |
Matmult-BF16 |
U22 |
|
distilbert-base-cased-distilled-squad |
PyTorch 2.5 |
trn1.2xlarge |
Raw Output (AutoModel) |
3433.83 |
0.58 |
0.6 |
$0.041 |
Real Time |
2.22.0 |
Data Parallel |
1 |
128 |
FP32 |
Matmult-BF16 |
U22 |
|
distilbert-base-uncased |
PyTorch 2.5 |
trn1.2xlarge |
Raw Output (AutoModel) |
3417.38 |
0.58 |
0.61 |
$0.042 |
Real Time |
2.22.0 |
Data Parallel |
1 |
128 |
FP32 |
Matmult-BF16 |
U22 |
|
google/electra-base-discriminator |
PyTorch 2.5 |
trn1.2xlarge |
Raw Output (AutoModel) |
2152.89 |
0.93 |
0.95 |
$0.066 |
Real Time |
2.22.0 |
Data Parallel |
1 |
128 |
FP32 |
Matmult-BF16 |
U22 |
|
roberta-base |
PyTorch 2.5 |
trn1.2xlarge |
Raw Output (AutoModel) |
2116.51 |
0.94 |
0.97 |
$0.067 |
Real Time |
2.22.0 |
Data Parallel |
1 |
128 |
FP32 |
Matmult-BF16 |
U22 |
|
roberta-large |
PyTorch 2.5 |
trn1.2xlarge |
Raw Output (AutoModel) |
744.01 |
2.68 |
2.74 |
$0.191 |
Real Time |
2.22.0 |
Data Parallel |
1 |
128 |
FP32 |
Matmult-BF16 |
U22 |
|
xlm-roberta-base |
PyTorch 2.5 |
trn1.32xlarge |
Raw Output (AutoModel) |
40.9 |
49.01 |
50.47 |
$55.665 |
Real Time |
2.22.0 |
Data Parallel |
1 |
128 |
FP32 |
Matmult-BF16 |
U22 |
Encoder-Decoder Models#
Model |
Scripts |
Framework |
Inst. Type |
Task |
Throughput (tokens/second) |
Latency per Token P50 (ms) |
Latency per Token P99 (ms) |
Cost per 1M inferences |
Application Type |
Neuron Version |
Run Mode |
TP Degree |
DP Degree |
Batch Size |
Sequence Length |
Input Length |
Output Length |
Model Data Type |
Compilation Autocast Data Type |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
google/flan-t5-xl |
NeuronX Distributed |
trn1.32xlarge |
Text Generation |
122.52 |
8.16 |
8.19 |
$18.584 |
Batch |
2.17.0 |
Tensor Parallel |
8 |
1 |
1 |
128 |
128 |
84 |
FP32 |
Matmult-BF16 |
|
t5-3b |
NeuronX Distributed |
trn1.32xlarge |
Text Generation |
116.29 |
8.58 |
8.66 |
$19.580 |
Batch |
2.17.0 |
Tensor Parallel |
8 |
1 |
1 |
128 |
128 |
84 |
FP32 |
Matmult-BF16 |
Note
Only for Encoder-Decoder
Throughput (tokens/second) counts both input and output tokens
Latency per Token counts both input and output tokens
Applicable to all models
Cost per 1M inferences is calculated using RI-Effective hourly rate.
Real Time application refers to batch size 1 inference for minimal latency. Batch application refers to maximum throughput with minimum cost-per-inference.
Model |
Scripts |
Framework |
Inst. Type |
Task |
Throughput (tokens/second) |
Latency per Token P50 (ms) |
Latency per Token P99 (ms) |
Cost per 1M inferences |
Application Type |
Neuron Version |
Run Mode |
TP Degree |
DP Degree |
Batch Size |
Sequence Length |
Input Length |
Output Length |
Model Data Type |
Compilation Autocast Data Type |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
google/flan-t5-xl |
NeuronX Distributed |
trn1.32xlarge |
Text Generation |
120.29 |
8.31 |
8.34 |
$18.929 |
Real Time |
2.18.0 |
Tensor Parallel |
8 |
1 |
1 |
128 |
128 |
84 |
FP32 |
Matmult-BF16 |
|
t5-3b |
NeuronX Distributed |
trn1.32xlarge |
Text Generation |
110.23 |
9.07 |
9.12 |
$20.656 |
Real Time |
2.18.0 |
Tensor Parallel |
8 |
1 |
1 |
128 |
128 |
84 |
FP32 |
Matmult-BF16 |
Note
Only for Encoder-Decoder
Throughput (tokens/second) counts both input and output tokens
Latency per Token counts both input and output tokens
Note
Cost per 1M inferences is calculated using RI-Effective hourly rate.
Real Time application refers to batch size 1 inference for minimal latency. Batch application refers to maximum throughput with minimum cost-per-inference.
This document is relevant for: Trn1