This document is relevant for: Trn1
, Trn1n
Trn1/Trn1n Inference Performance#
Last update: September 16th, 2024
Encoder Models#
Model |
Scripts |
Framework |
Inst. Type |
Task |
Throughput (inference/sec) |
Latency P50 (ms) |
Latency P99 (ms) |
Cost per 1M inferences |
Application Type |
Neuron Version |
Run Mode |
Batch Size |
Sequence Length |
Model Data Type |
Compilation Autocast Data Type |
OS Type |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
albert-base-v2 |
PyTorch 2.1 |
trn1.2xlarge |
Raw Output (AutoModel) |
3076.47 |
10.4 |
10.45 |
$0.046 |
Batch |
2.20.0 |
Data Parallel |
16 |
128 |
FP32 |
Matmult-BF16 |
U22 |
|
bert-base-cased |
PyTorch 2.1 |
trn1.2xlarge |
Raw Output (AutoModel) |
3076.23 |
10.4 |
10.47 |
$0.046 |
Batch |
2.20.0 |
Data Parallel |
16 |
128 |
FP32 |
Matmult-BF16 |
U22 |
|
bert-base-cased-finetuned-mrpc |
PyTorch 2.1 |
trn1.2xlarge |
Sequence Classification |
3435.2 |
9.31 |
9.35 |
$0.041 |
Batch |
2.20.0 |
Data Parallel |
16 |
128 |
FP32 |
Matmult-BF16 |
U22 |
|
bert-base-uncased |
PyTorch 2.1 |
trn1.2xlarge |
Raw Output (AutoModel) |
3075.95 |
10.4 |
10.46 |
$0.046 |
Batch |
2.20.0 |
Data Parallel |
16 |
128 |
FP32 |
Matmult-BF16 |
U22 |
|
bert-large-cased |
PyTorch 2.1 |
trn1.2xlarge |
Raw Output (AutoModel) |
1056.92 |
15.14 |
15.18 |
$0.135 |
Batch |
2.20.0 |
Data Parallel |
8 |
128 |
FP32 |
Matmult-BF16 |
U22 |
|
bert-large-uncased |
PyTorch 2.1 |
trn1.2xlarge |
Raw Output (AutoModel) |
1056.49 |
15.14 |
15.18 |
$0.135 |
Batch |
2.20.0 |
Data Parallel |
4 |
128 |
FP32 |
Matmult-BF16 |
U22 |
|
camembert-base |
PyTorch 2.1 |
trn1.2xlarge |
Raw Output (AutoModel) |
3024.15 |
10.57 |
10.66 |
$0.047 |
Batch |
2.20.0 |
Data Parallel |
16 |
128 |
FP32 |
Matmult-BF16 |
U22 |
|
distilbert-base-cased |
PyTorch 2.1 |
trn1.2xlarge |
Raw Output (AutoModel) |
5328.08 |
3.0 |
3.05 |
$0.027 |
Batch |
2.20.0 |
Data Parallel |
16 |
128 |
FP32 |
Matmult-BF16 |
U22 |
|
distilbert-base-cased-distilled-squad |
PyTorch 2.1 |
trn1.2xlarge |
Raw Output (AutoModel) |
5544.76 |
5.77 |
5.81 |
$0.026 |
Batch |
2.20.0 |
Data Parallel |
16 |
128 |
FP32 |
Matmult-BF16 |
U22 |
|
distilbert-base-uncased |
PyTorch 2.1 |
trn1.2xlarge |
Raw Output (AutoModel) |
5552.76 |
5.77 |
5.79 |
$0.026 |
Batch |
2.20.0 |
Data Parallel |
16 |
128 |
FP32 |
Matmult-BF16 |
U22 |
|
google/electra-base-discriminator |
PyTorch 2.1 |
trn1.2xlarge |
Raw Output (AutoModel) |
3081.2 |
10.39 |
10.19 |
$0.046 |
Batch |
2.20.0 |
Data Parallel |
16 |
128 |
FP32 |
Matmult-BF16 |
U22 |
|
roberta-base |
PyTorch 2.1 |
trn1.2xlarge |
Raw Output (AutoModel) |
3021.18 |
10.57 |
10.68 |
$0.047 |
Batch |
2.20.0 |
Data Parallel |
16 |
128 |
FP32 |
Matmult-BF16 |
U22 |
|
roberta-large |
PyTorch 2.1 |
trn1.2xlarge |
Raw Output (AutoModel) |
1046.17 |
15.31 |
15.34 |
$0.136 |
Batch |
2.20.0 |
Data Parallel |
8 |
128 |
FP32 |
Matmult-BF16 |
U22 |
|
xlm-roberta-base |
PyTorch 2.1 |
trn1.32xlarge |
Raw Output (AutoModel) |
46.93 |
694.39 |
707.43 |
$48.521 |
Batch |
2.20.0 |
Data Parallel |
8 |
128 |
FP32 |
Matmult-BF16 |
U22 |
Model |
Scripts |
Framework |
Inst. Type |
Task |
Throughput (inference/sec) |
Latency P50 (ms) |
Latency P99 (ms) |
Cost per 1M inferences |
Application Type |
Neuron Version |
Run Mode |
Batch Size |
Sequence Length |
Model Data Type |
Compilation Autocast Data Type |
OS Type |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
albert-base-v2 |
PyTorch 2.1 |
trn1.2xlarge |
Raw Output (AutoModel) |
2053.59 |
0.96 |
1.02 |
$0.069 |
Real Time |
2.20.0 |
Data Parallel |
1 |
128 |
FP32 |
Matmult-BF16 |
U22 |
|
bert-base-cased |
PyTorch 2.1 |
trn1.2xlarge |
Raw Output (AutoModel) |
2008.41 |
0.99 |
1.02 |
$0.071 |
Real Time |
2.20.0 |
Data Parallel |
1 |
128 |
FP32 |
Matmult-BF16 |
U22 |
|
bert-base-cased-finetuned-mrpc |
PyTorch 2.1 |
trn1.2xlarge |
Sequence Classification |
2157.4 |
0.93 |
0.94 |
$0.066 |
Real Time |
2.18.0 |
Data Parallel |
1 |
128 |
FP32 |
Matmult-BF16 |
U22 |
|
bert-base-uncased |
PyTorch 2.1 |
trn1.2xlarge |
Raw Output (AutoModel) |
2010.59 |
0.99 |
1.02 |
$0.071 |
Real Time |
2.20.0 |
Data Parallel |
1 |
128 |
FP32 |
Matmult-BF16 |
U22 |
|
bert-large-cased |
PyTorch 2.1 |
trn1.2xlarge |
Raw Output (AutoModel) |
729.87 |
2.69 |
2.75 |
$0.195 |
Real Time |
2.20.0 |
Data Parallel |
1 |
128 |
FP32 |
Matmult-BF16 |
U22 |
|
bert-large-uncased |
PyTorch 2.1 |
trn1.2xlarge |
Raw Output (AutoModel) |
729.7 |
2.74 |
2.77 |
$0.195 |
Real Time |
2.20.0 |
Data Parallel |
1 |
128 |
FP32 |
Matmult-BF16 |
U22 |
|
camembert-base |
PyTorch 2.1 |
trn1.2xlarge |
Raw Output (AutoModel) |
1999.37 |
0.99 |
1.03 |
$0.071 |
Real Time |
2.20.0 |
Data Parallel |
1 |
128 |
FP32 |
Matmult-BF16 |
U22 |
|
distilbert-base-cased |
PyTorch 2.1 |
trn1.2xlarge |
Raw Output (AutoModel) |
3424.54 |
0.58 |
0.63 |
$0.042 |
Real Time |
2.20.0 |
Data Parallel |
1 |
128 |
FP32 |
Matmult-BF16 |
U22 |
|
distilbert-base-cased-distilled-squad |
PyTorch 2.1 |
trn1.2xlarge |
Raw Output (AutoModel) |
3431.56 |
0.58 |
0.62 |
$0.041 |
Real Time |
2.20.0 |
Data Parallel |
1 |
128 |
FP32 |
Matmult-BF16 |
U22 |
|
distilbert-base-uncased |
PyTorch 2.1 |
trn1.2xlarge |
Raw Output (AutoModel) |
3433.7 |
0.58 |
0.63 |
$0.041 |
Real Time |
2.20.0 |
Data Parallel |
1 |
128 |
FP32 |
Matmult-BF16 |
U22 |
|
google/electra-base-discriminator |
PyTorch 2.1 |
trn1.2xlarge |
Raw Output (AutoModel) |
2040.3 |
0.98 |
1.01 |
$0.070 |
Real Time |
2.20.0 |
Data Parallel |
1 |
128 |
FP32 |
Matmult-BF16 |
U22 |
|
roberta-base |
PyTorch 2.1 |
trn1.2xlarge |
Raw Output (AutoModel) |
1999.42 |
0.99 |
1.03 |
$0.071 |
Real Time |
2.20.0 |
Data Parallel |
1 |
128 |
FP32 |
Matmult-BF16 |
U22 |
|
roberta-large |
PyTorch 2.1 |
trn1.2xlarge |
Raw Output (AutoModel) |
728.64 |
2.74 |
2.78 |
$0.195 |
Real Time |
2.20.0 |
Data Parallel |
1 |
128 |
FP32 |
Matmult-BF16 |
U22 |
|
xlm-roberta-base |
PyTorch 2.1 |
trn1.32xlarge |
Raw Output (AutoModel) |
41.6 |
48.25 |
49.76 |
$54.741 |
Real Time |
2.20.0 |
Data Parallel |
1 |
128 |
FP32 |
Matmult-BF16 |
U22 |
Encoder-Decoder Models#
Model |
Scripts |
Framework |
Inst. Type |
Task |
Throughput (tokens/second) |
Latency per Token P50 (ms) |
Latency per Token P99 (ms) |
Cost per 1M inferences |
Application Type |
Neuron Version |
Run Mode |
TP Degree |
DP Degree |
Batch Size |
Sequence Length |
Input Length |
Output Length |
Model Data Type |
Compilation Autocast Data Type |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
google/flan-t5-xl |
NeuronX Distributed |
trn1.32xlarge |
Text Generation |
122.52 |
8.16 |
8.19 |
$18.584 |
Batch |
2.17.0 |
Tensor Parallel |
8 |
1 |
1 |
128 |
128 |
84 |
FP32 |
Matmult-BF16 |
|
t5-3b |
NeuronX Distributed |
trn1.32xlarge |
Text Generation |
116.29 |
8.58 |
8.66 |
$19.580 |
Batch |
2.17.0 |
Tensor Parallel |
8 |
1 |
1 |
128 |
128 |
84 |
FP32 |
Matmult-BF16 |
Note
Only for Encoder-Decoder
Throughput (tokens/second) counts both input and output tokens
Latency per Token counts both input and output tokens
Applicable to all models
Cost per 1M inferences is calculated using RI-Effective hourly rate.
Real Time application refers to batch size 1 inference for minimal latency. Batch application refers to maximum throughput with minimum cost-per-inference.
Model |
Scripts |
Framework |
Inst. Type |
Task |
Throughput (tokens/second) |
Latency per Token P50 (ms) |
Latency per Token P99 (ms) |
Cost per 1M inferences |
Application Type |
Neuron Version |
Run Mode |
TP Degree |
DP Degree |
Batch Size |
Sequence Length |
Input Length |
Output Length |
Model Data Type |
Compilation Autocast Data Type |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
google/flan-t5-xl |
NeuronX Distributed |
trn1.32xlarge |
Text Generation |
120.29 |
8.31 |
8.34 |
$18.929 |
Real Time |
2.18.0 |
Tensor Parallel |
8 |
1 |
1 |
128 |
128 |
84 |
FP32 |
Matmult-BF16 |
|
t5-3b |
NeuronX Distributed |
trn1.32xlarge |
Text Generation |
110.23 |
9.07 |
9.12 |
$20.656 |
Real Time |
2.18.0 |
Tensor Parallel |
8 |
1 |
1 |
128 |
128 |
84 |
FP32 |
Matmult-BF16 |
Note
Only for Encoder-Decoder
Throughput (tokens/second) counts both input and output tokens
Latency per Token counts both input and output tokens
Note
Cost per 1M inferences is calculated using RI-Effective hourly rate.
Real Time application refers to batch size 1 inference for minimal latency. Batch application refers to maximum throughput with minimum cost-per-inference.
This document is relevant for: Trn1
, Trn1n