This document is relevant for: Trn1
Trn1/Trn1n Inference Performance#
Last update: September 16th, 2024
Encoder Models#
Model |
Scripts |
Framework |
Inst. Type |
Task |
Throughput (inference/sec) |
Latency P50 (ms) |
Latency P99 (ms) |
Cost per 1M inferences |
Application Type |
Neuron Version |
Run Mode |
Batch Size |
Sequence Length |
Model Data Type |
Compilation Autocast Data Type |
OS Type |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
albert-base-v2 |
PyTorch 2.5 |
trn1.2xlarge |
Raw Output (AutoModel) |
3113.66 |
10.28 |
10.33 |
$0.046 |
Batch |
2.20.0 |
Data Parallel |
16 |
128 |
FP32 |
Matmult-BF16 |
U22 |
|
bert-base-cased |
PyTorch 2.5 |
trn1.2xlarge |
Raw Output (AutoModel) |
3124.93 |
10.23 |
10.3 |
$0.046 |
Batch |
2.21.0 |
Data Parallel |
16 |
128 |
FP32 |
Matmult-BF16 |
U22 |
|
bert-base-cased-finetuned-mrpc |
PyTorch 2.5 |
trn1.2xlarge |
Sequence Classification |
3499.81 |
9.14 |
9.16 |
$0.041 |
Batch |
2.21.0 |
Data Parallel |
16 |
128 |
FP32 |
Matmult-BF16 |
U22 |
|
bert-base-uncased |
PyTorch 2.5 |
trn1.2xlarge |
Raw Output (AutoModel) |
3128.66 |
10.23 |
10.28 |
$0.045 |
Batch |
2.21.0 |
Data Parallel |
16 |
128 |
FP32 |
Matmult-BF16 |
U22 |
|
bert-large-cased |
PyTorch 2.5 |
trn1.2xlarge |
Raw Output (AutoModel) |
1062.71 |
15.04 |
15.12 |
$0.134 |
Batch |
2.21.0 |
Data Parallel |
8 |
128 |
FP32 |
Matmult-BF16 |
U22 |
|
bert-large-uncased |
PyTorch 2.5 |
trn1.2xlarge |
Raw Output (AutoModel) |
1061.39 |
7.54 |
7.59 |
$0.134 |
Batch |
2.21.0 |
Data Parallel |
4 |
128 |
FP32 |
Matmult-BF16 |
U22 |
|
camembert-base |
PyTorch 2.5 |
trn1.2xlarge |
Raw Output (AutoModel) |
3101.04 |
10.3 |
10.36 |
$0.046 |
Batch |
2.21.0 |
Data Parallel |
16 |
128 |
FP32 |
Matmult-BF16 |
U22 |
|
distilbert-base-cased |
PyTorch 2.5 |
trn1.2xlarge |
Raw Output (AutoModel) |
5704.33 |
5.6 |
5.67 |
$0.025 |
Batch |
2.21.0 |
Data Parallel |
16 |
128 |
FP32 |
Matmult-BF16 |
U22 |
|
distilbert-base-cased-distilled-squad |
PyTorch 2.5 |
trn1.2xlarge |
Raw Output (AutoModel) |
5705.78 |
5.6 |
5.68 |
$0.025 |
Batch |
2.21.0 |
Data Parallel |
16 |
128 |
FP32 |
Matmult-BF16 |
U22 |
|
distilbert-base-uncased |
PyTorch 2.5 |
trn1.2xlarge |
Raw Output (AutoModel) |
5710.31 |
5.6 |
5.65 |
$0.025 |
Batch |
2.21.0 |
Data Parallel |
16 |
128 |
FP32 |
Matmult-BF16 |
U22 |
|
google/electra-base-discriminator |
PyTorch 2.5 |
trn1.2xlarge |
Raw Output (AutoModel) |
3133.47 |
10.22 |
10.27 |
$0.045 |
Batch |
2.21.0 |
Data Parallel |
16 |
128 |
FP32 |
Matmult-BF16 |
U22 |
|
roberta-base |
PyTorch 2.5 |
trn1.2xlarge |
Raw Output (AutoModel) |
3102.21 |
10.29 |
10.44 |
$0.046 |
Batch |
2.21.0 |
Data Parallel |
16 |
128 |
FP32 |
Matmult-BF16 |
U22 |
|
roberta-large |
PyTorch 2.5 |
trn1.2xlarge |
Raw Output (AutoModel) |
1059.13 |
15.09 |
15.16 |
$0.134 |
Batch |
2.21.0 |
Data Parallel |
8 |
128 |
FP32 |
Matmult-BF16 |
U22 |
|
xlm-roberta-base |
PyTorch 2.5 |
trn1.32xlarge |
Raw Output (AutoModel) |
46.56 |
343.58 |
347.47 |
$48.901 |
Batch |
2.21.0 |
Data Parallel |
8 |
128 |
FP32 |
Matmult-BF16 |
U22 |
Encoder-Decoder Models#
Model |
Scripts |
Framework |
Inst. Type |
Task |
Throughput (tokens/second) |
Latency per Token P50 (ms) |
Latency per Token P99 (ms) |
Cost per 1M inferences |
Application Type |
Neuron Version |
Run Mode |
TP Degree |
DP Degree |
Batch Size |
Sequence Length |
Input Length |
Output Length |
Model Data Type |
Compilation Autocast Data Type |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
google/flan-t5-xl |
NeuronX Distributed |
trn1.32xlarge |
Text Generation |
122.52 |
8.16 |
8.19 |
$18.584 |
Batch |
2.17.0 |
Tensor Parallel |
8 |
1 |
1 |
128 |
128 |
84 |
FP32 |
Matmult-BF16 |
|
t5-3b |
NeuronX Distributed |
trn1.32xlarge |
Text Generation |
116.29 |
8.58 |
8.66 |
$19.580 |
Batch |
2.17.0 |
Tensor Parallel |
8 |
1 |
1 |
128 |
128 |
84 |
FP32 |
Matmult-BF16 |
Note
Only for Encoder-Decoder
Throughput (tokens/second) counts both input and output tokens
Latency per Token counts both input and output tokens
Applicable to all models
Cost per 1M inferences is calculated using RI-Effective hourly rate.
Real Time application refers to batch size 1 inference for minimal latency. Batch application refers to maximum throughput with minimum cost-per-inference.
Model |
Scripts |
Framework |
Inst. Type |
Task |
Throughput (tokens/second) |
Latency per Token P50 (ms) |
Latency per Token P99 (ms) |
Cost per 1M inferences |
Application Type |
Neuron Version |
Run Mode |
TP Degree |
DP Degree |
Batch Size |
Sequence Length |
Input Length |
Output Length |
Model Data Type |
Compilation Autocast Data Type |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
google/flan-t5-xl |
NeuronX Distributed |
trn1.32xlarge |
Text Generation |
120.29 |
8.31 |
8.34 |
$18.929 |
Real Time |
2.18.0 |
Tensor Parallel |
8 |
1 |
1 |
128 |
128 |
84 |
FP32 |
Matmult-BF16 |
|
t5-3b |
NeuronX Distributed |
trn1.32xlarge |
Text Generation |
110.23 |
9.07 |
9.12 |
$20.656 |
Real Time |
2.18.0 |
Tensor Parallel |
8 |
1 |
1 |
128 |
128 |
84 |
FP32 |
Matmult-BF16 |
Note
Only for Encoder-Decoder
Throughput (tokens/second) counts both input and output tokens
Latency per Token counts both input and output tokens
Note
Cost per 1M inferences is calculated using RI-Effective hourly rate.
Real Time application refers to batch size 1 inference for minimal latency. Batch application refers to maximum throughput with minimum cost-per-inference.
This document is relevant for: Trn1