Trn1/Trn1n Inference Performance
Contents
This document is relevant for: Trn1
, Trn1n
Trn1/Trn1n Inference Performance#
Table of contents
Last update: April 30th, 2024
Encoder Models#
Model |
Scripts |
Framework |
Inst. Type |
Task |
Throughput (inference/sec) |
Latency P50 (ms) |
Latency P99 (ms) |
Cost per 1M inferences |
Application Type |
Neuron Version |
Run Mode |
Batch Size |
Sequence Length |
Model Data Type |
Compilation Autocast Data Type |
OS Type |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
albert-base-v2 |
PyTorch 2.1 |
trn1.2xlarge |
Raw Output (AutoModel) |
3098.91 |
10.32 |
10.39 |
$0.120 |
Batch |
2.18.0 |
Data Parallel |
16 |
128 |
FP32 |
Matmult-BF16 |
U22 |
|
bert-base-cased |
PyTorch 2.1 |
trn1.2xlarge |
Raw Output (AutoModel) |
3176.04 |
10.07 |
10.13 |
$0.117 |
Batch |
2.18.0 |
Data Parallel |
16 |
128 |
FP32 |
Matmult-BF16 |
U22 |
|
bert-base-cased-finetuned-mrpc |
PyTorch 2.1 |
trn1.2xlarge |
Sequence Classification |
3584.4 |
8.92 |
8.96 |
$0.104 |
Batch |
2.18.0 |
Data Parallel |
16 |
128 |
FP32 |
Matmult-BF16 |
U22 |
|
bert-base-uncased |
PyTorch 2.1 |
trn1.2xlarge |
Raw Output (AutoModel) |
3165.0 |
10.09 |
10.21 |
$0.118 |
Batch |
2.18.0 |
Data Parallel |
16 |
128 |
FP32 |
Matmult-BF16 |
U22 |
|
bert-large-cased |
PyTorch 2.1 |
trn1.2xlarge |
Raw Output (AutoModel) |
1058.66 |
15.09 |
15.21 |
$0.352 |
Batch |
2.18.0 |
Data Parallel |
8 |
128 |
FP32 |
Matmult-BF16 |
U22 |
|
bert-large-uncased |
PyTorch 2.1 |
trn1.2xlarge |
Raw Output (AutoModel) |
1056.08 |
7.57 |
7.65 |
$0.352 |
Batch |
2.18.0 |
Data Parallel |
4 |
128 |
FP32 |
Matmult-BF16 |
U22 |
|
camembert-base |
PyTorch 2.1 |
trn1.2xlarge |
Raw Output (AutoModel) |
3072.36 |
10.41 |
10.57 |
$0.121 |
Batch |
2.18.0 |
Data Parallel |
16 |
128 |
FP32 |
Matmult-BF16 |
U22 |
|
distilbert-base-cased |
PyTorch 2.1 |
trn1.2xlarge |
Raw Output (AutoModel) |
5554.69 |
5.75 |
5.88 |
$0.067 |
Batch |
2.18.0 |
Data Parallel |
16 |
128 |
FP32 |
Matmult-BF16 |
U22 |
|
distilbert-base-cased-distilled-squad |
PyTorch 2.1 |
trn1.2xlarge |
Raw Output (AutoModel) |
5585.58 |
5.72 |
5.8 |
$0.067 |
Batch |
2.18.0 |
Data Parallel |
16 |
128 |
FP32 |
Matmult-BF16 |
U22 |
|
distilbert-base-uncased |
PyTorch 2.1 |
trn1.2xlarge |
Raw Output (AutoModel) |
5557.51 |
5.75 |
5.87 |
$0.067 |
Batch |
2.18.0 |
Data Parallel |
16 |
128 |
FP32 |
Matmult-BF16 |
U22 |
|
google/electra-base-discriminator |
PyTorch 2.1 |
trn1.2xlarge |
Raw Output (AutoModel) |
3167.84 |
10.09 |
10.19 |
$0.118 |
Batch |
2.18.0 |
Data Parallel |
16 |
128 |
FP32 |
Matmult-BF16 |
U22 |
|
roberta-base |
PyTorch 2.1 |
trn1.2xlarge |
Raw Output (AutoModel) |
3082.21 |
10.34 |
10.69 |
$0.121 |
Batch |
2.18.0 |
Data Parallel |
16 |
128 |
FP32 |
Matmult-BF16 |
U22 |
|
roberta-large |
PyTorch 2.1 |
trn1.2xlarge |
Raw Output (AutoModel) |
1053.24 |
15.19 |
15.26 |
$0.353 |
Batch |
2.18.0 |
Data Parallel |
8 |
128 |
FP32 |
Matmult-BF16 |
U22 |
|
xlm-roberta-base |
PyTorch 2.1 |
trn1.32xlarge |
Raw Output (AutoModel) |
40.74 |
392.23 |
396.12 |
$146.608 |
Batch |
2.18.0 |
Data Parallel |
8 |
128 |
FP32 |
Matmult-BF16 |
U22 |
Model |
Scripts |
Framework |
Inst. Type |
Task |
Throughput (inference/sec) |
Latency P50 (ms) |
Latency P99 (ms) |
Cost per 1M inferences |
Application Type |
Neuron Version |
Run Mode |
Batch Size |
Sequence Length |
Model Data Type |
Compilation Autocast Data Type |
OS Type |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
albert-base-v2 |
PyTorch 2.1 |
trn1.2xlarge |
Raw Output (AutoModel) |
2122.39 |
0.94 |
0.98 |
$0.175 |
Real Time |
2.18.0 |
Data Parallel |
1 |
128 |
FP32 |
Matmult-BF16 |
U22 |
|
bert-base-cased |
PyTorch 2.1 |
trn1.2xlarge |
Raw Output (AutoModel) |
1999.59 |
0.99 |
1.03 |
$0.186 |
Real Time |
2.18.0 |
Data Parallel |
1 |
128 |
FP32 |
Matmult-BF16 |
U22 |
|
bert-base-cased-finetuned-mrpc |
PyTorch 2.1 |
trn1.2xlarge |
Sequence Classification |
2154.24 |
0.93 |
0.95 |
$0.173 |
Real Time |
2.18.0 |
Data Parallel |
1 |
128 |
FP32 |
Matmult-BF16 |
U22 |
|
bert-base-uncased |
PyTorch 2.1 |
trn1.2xlarge |
Raw Output (AutoModel) |
1999.54 |
0.99 |
1.03 |
$0.186 |
Real Time |
2.18.0 |
Data Parallel |
1 |
128 |
FP32 |
Matmult-BF16 |
U22 |
|
bert-large-cased |
PyTorch 2.1 |
trn1.2xlarge |
Raw Output (AutoModel) |
739.62 |
2.69 |
2.75 |
$0.503 |
Real Time |
2.18.0 |
Data Parallel |
1 |
128 |
FP32 |
Matmult-BF16 |
U22 |
|
bert-large-uncased |
PyTorch 2.1 |
trn1.2xlarge |
Raw Output (AutoModel) |
740.18 |
2.69 |
2.74 |
$0.503 |
Real Time |
2.18.0 |
Data Parallel |
1 |
128 |
FP32 |
Matmult-BF16 |
U22 |
|
camembert-base |
PyTorch 2.1 |
trn1.2xlarge |
Raw Output (AutoModel) |
2000.81 |
0.99 |
1.03 |
$0.186 |
Real Time |
2.18.0 |
Data Parallel |
1 |
128 |
FP32 |
Matmult-BF16 |
U22 |
|
distilbert-base-cased |
PyTorch 2.1 |
trn1.2xlarge |
Raw Output (AutoModel) |
3169.09 |
0.63 |
0.66 |
$0.117 |
Real Time |
2.18.0 |
Data Parallel |
1 |
128 |
FP32 |
Matmult-BF16 |
U22 |
|
distilbert-base-cased-distilled-squad |
PyTorch 2.1 |
trn1.2xlarge |
Raw Output (AutoModel) |
3164.73 |
0.64 |
0.66 |
$0.118 |
Real Time |
2.18.0 |
Data Parallel |
1 |
128 |
FP32 |
Matmult-BF16 |
U22 |
|
distilbert-base-uncased |
PyTorch 2.1 |
trn1.2xlarge |
Raw Output (AutoModel) |
3165.42 |
0.63 |
0.66 |
$0.118 |
Real Time |
2.18.0 |
Data Parallel |
1 |
128 |
FP32 |
Matmult-BF16 |
U22 |
|
google/electra-base-discriminator |
PyTorch 2.1 |
trn1.2xlarge |
Raw Output (AutoModel) |
2018.08 |
0.99 |
1.02 |
$0.184 |
Real Time |
2.18.0 |
Data Parallel |
1 |
128 |
FP32 |
Matmult-BF16 |
U22 |
|
roberta-base |
PyTorch 2.1 |
trn1.2xlarge |
Raw Output (AutoModel) |
2001.64 |
0.99 |
1.03 |
$0.186 |
Real Time |
2.18.0 |
Data Parallel |
1 |
128 |
FP32 |
Matmult-BF16 |
U22 |
|
roberta-large |
PyTorch 2.1 |
trn1.2xlarge |
Raw Output (AutoModel) |
734.46 |
2.73 |
2.77 |
$0.507 |
Real Time |
2.18.0 |
Data Parallel |
1 |
128 |
FP32 |
Matmult-BF16 |
U22 |
|
xlm-roberta-base |
PyTorch 2.1 |
trn1.32xlarge |
Raw Output (AutoModel) |
40.11 |
49.9 |
51.54 |
$148.911 |
Real Time |
2.18.0 |
Data Parallel |
1 |
128 |
FP32 |
Matmult-BF16 |
U22 |
Decoder Models#
Model |
Scripts |
Framework |
Inst. Type |
Task |
Output Token Throughput (tokens/sec) |
TTFT Latency P50 (ms) |
TTFT Latency P99 (ms) |
TPOT Latency P50 (ms) |
TPOT Latency P99 (ms) |
Application Type |
Neuron Version |
Run Mode |
TP Degree |
Batch Size |
Sequence Length |
Input Length |
Output Length |
Model Data Type |
Compilation Autocast Data Type |
Weight Storage Data Type |
On-Demand hourly rate |
Cost per 1M tokens |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Llama-3-8B |
Transformers NeuronX |
trn1.32xlarge |
Text Generation |
933.5 |
55.16 |
61.47 |
9.95 |
10.1 |
Batch |
2.18.1 |
Tensor Parallel |
32 |
8 |
8192 |
128 |
8064 |
FP16 |
Matmult-BF16 |
int8 |
21.5 |
$6.398 |
|
Llama-3-8B |
Transformers NeuronX |
trn1.32xlarge |
Text Generation |
770.16 |
1265.95 |
1292.94 |
10.04 |
10.33 |
Batch |
2.18.1 |
Tensor Parallel |
32 |
8 |
8192 |
4096 |
4096 |
FP16 |
Matmult-BF16 |
int8 |
21.5 |
$7.754 |
|
Llama-3-8B |
Transformers NeuronX |
trn1.32xlarge |
Text Generation |
1142.7 |
49.05 |
52.79 |
7.65 |
7.94 |
Batch |
2.18.1 |
Tensor Parallel |
32 |
8 |
4096 |
128 |
3968 |
FP16 |
Matmult-BF16 |
int8 |
21.5 |
$5.226 |
|
Llama-3-70B |
Transformers NeuronX |
trn1.32xlarge |
Text Generation |
120.36 |
1661.12 |
1672.71 |
32.33 |
33.27 |
Batch |
2.18.1 |
Tensor Parallel |
32 |
4 |
4096 |
2048 |
2048 |
FP16 |
Matmult-BF16 |
bf16 |
21.5 |
$49.619 |
|
Llama-3-70B |
Transformers NeuronX |
trn1.32xlarge |
Text Generation |
140.51 |
129.86 |
132.03 |
28.38 |
29.11 |
Batch |
2.18.1 |
Tensor Parallel |
32 |
4 |
1152 |
128 |
1024 |
FP16 |
Matmult-BF16 |
bf16 |
21.5 |
$42.504 |
|
Llama-3-70B |
Transformers NeuronX |
trn1.32xlarge |
Text Generation |
138.01 |
130.37 |
130.48 |
28.08 |
28.53 |
Batch |
2.18.1 |
Tensor Parallel |
32 |
4 |
256 |
128 |
128 |
FP16 |
Matmult-BF16 |
bf16 |
21.5 |
$43.273 |
|
Llama-2-7b |
Transformers NeuronX |
trn1.32xlarge |
Text Generation |
917.25 |
66.4 |
70.64 |
10.1 |
10.46 |
Batch |
2.18.0 |
Tensor Parallel |
32 |
8 |
8192 |
128 |
8064 |
FP16 |
Matmult-BF16 |
bf16 |
21.5 |
$6.511 |
|
Llama-2-13b |
Transformers NeuronX |
trn1.32xlarge |
Text Generation |
371.7 |
6668.7 |
6689.8 |
19.86 |
21.06 |
Batch |
2.18.0 |
Tensor Parallel |
32 |
8 |
8192 |
4096 |
4096 |
FP16 |
Matmult-BF16 |
bf16 |
21.5 |
$16.067 |
|
Llama-2-13b |
Transformers NeuronX |
trn1.32xlarge |
Text Generation |
184.28 |
4628.45 |
4635.25 |
21.09 |
22.39 |
Batch |
2.18.0 |
Tensor Parallel |
32 |
4 |
16384 |
8192 |
8192 |
FP16 |
Matmult-BF16 |
bf16 |
21.5 |
$32.408 |
|
Llama-2-70b |
Transformers NeuronX |
trn1.32xlarge |
Text Generation |
141.45 |
156.85 |
158.41 |
26.72 |
30.17 |
Batch |
2.18.0 |
Tensor Parallel |
32 |
4 |
256 |
128 |
128 |
FP16 |
Matmult-BF16 |
bf16 |
21.5 |
$42.220 |
|
Llama-2-70b |
Transformers NeuronX |
trn1.32xlarge |
Text Generation |
143.43 |
270.16 |
270.56 |
26.91 |
27.91 |
Batch |
2.18.0 |
Tensor Parallel |
32 |
4 |
512 |
256 |
256 |
FP16 |
Matmult-BF16 |
bf16 |
21.5 |
$41.640 |
|
Llama-2-70b |
Transformers NeuronX |
trn1.32xlarge |
Text Generation |
145.13 |
156.69 |
161.41 |
27.21 |
30.6 |
Batch |
2.18.0 |
Tensor Parallel |
32 |
4 |
1152 |
128 |
1024 |
FP16 |
Matmult-BF16 |
bf16 |
21.5 |
$41.151 |
|
Llama-2-70b |
Transformers NeuronX |
trn1.32xlarge |
Text Generation |
133.25 |
1478.64 |
1479.78 |
28.55 |
29.5 |
Batch |
2.18.0 |
Tensor Parallel |
32 |
4 |
2048 |
1024 |
1024 |
FP16 |
Matmult-BF16 |
bf16 |
21.5 |
$44.819 |
|
Llama-2-70b |
Transformers NeuronX |
trn1.32xlarge |
Text Generation |
129.28 |
1478.85 |
1482.93 |
31.67 |
32.02 |
Batch |
2.18.0 |
Tensor Parallel |
32 |
4 |
3072 |
1024 |
2048 |
FP16 |
Matmult-BF16 |
bf16 |
21.5 |
$46.197 |
|
Llama-2-70b |
Transformers NeuronX |
trn1.32xlarge |
Text Generation |
120.63 |
2722.03 |
2730.95 |
31.79 |
33.23 |
Batch |
2.18.0 |
Tensor Parallel |
32 |
4 |
4096 |
2048 |
2048 |
FP16 |
Matmult-BF16 |
bf16 |
21.5 |
$49.509 |
|
Mistral-7B-Instruct-v0.2 |
Transformers NeuronX |
trn1.32xlarge |
Text Generation |
484.58 |
8614.85 |
8630.24 |
15.44 |
15.94 |
Batch |
2.18.0 |
Tensor Parallel |
32 |
8 |
16384 |
8192 |
8192 |
FP16 |
Matmult-BF16 |
bf16 |
21.5 |
$12.325 |
|
CodeLlama-13b-hf |
Transformers NeuronX |
trn1.32xlarge |
Text Generation |
370.98 |
6625.16 |
6628.26 |
19.92 |
20.95 |
Batch |
2.18.0 |
Tensor Parallel |
32 |
8 |
8192 |
4096 |
4096 |
FP16 |
Matmult-BF16 |
bf16 |
21.5 |
$16.099 |
|
CodeLlama-13b-hf |
Transformers NeuronX |
trn1.32xlarge |
Text Generation |
184.18 |
4626.17 |
4630.67 |
21.1 |
22.17 |
Batch |
2.18.0 |
Tensor Parallel |
32 |
4 |
16384 |
8192 |
8192 |
FP16 |
Matmult-BF16 |
bf16 |
21.5 |
$32.426 |
Model |
Scripts |
Framework |
Inst. Type |
Task |
Output Token Throughput (tokens/sec) |
TTFT Latency P50 (ms) |
TTFT Latency P99 (ms) |
TPOT Latency P50 (ms) |
TPOT Latency P99 (ms) |
Application Type |
Neuron Version |
Run Mode |
TP Degree |
Batch Size |
Sequence Length |
Input Length |
Output Length |
Model Data Type |
Compilation Autocast Data Type |
Weight Storage Data Type |
On-Demand hourly rate |
Cost per 1M tokens |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Llama-3-8B |
Transformers NeuronX |
trn1.32xlarge |
Text Generation |
157.25 |
17.09 |
21.62 |
7.03 |
7.16 |
Real Time |
2.18.1 |
Tensor Parallel |
32 |
1 |
8192 |
128 |
8064 |
FP16 |
Matmult-BF16 |
int8 |
21.5 |
$37.979 |
|
Llama-3-8B |
Transformers NeuronX |
trn1.32xlarge |
Text Generation |
140.5 |
153.02 |
159.13 |
7.04 |
7.13 |
Real Time |
2.18.1 |
Tensor Parallel |
32 |
1 |
8192 |
4096 |
4096 |
FP16 |
Matmult-BF16 |
int8 |
21.5 |
$42.507 |
|
Llama-3-8B |
Transformers NeuronX |
trn1.32xlarge |
Text Generation |
178.19 |
14.75 |
22.94 |
5.86 |
6.0 |
Real Time |
2.18.1 |
Tensor Parallel |
32 |
1 |
4096 |
128 |
3968 |
FP16 |
Matmult-BF16 |
int8 |
21.5 |
$33.516 |
|
Llama-3-70B |
Transformers NeuronX |
trn1.32xlarge |
Text Generation |
37.7 |
547.0 |
553.89 |
26.2 |
26.79 |
Real Time |
2.18.1 |
Tensor Parallel |
32 |
1 |
4096 |
2048 |
2048 |
FP16 |
Matmult-BF16 |
bf16 |
21.5 |
$158.398 |
|
Llama-3-70B |
Transformers NeuronX |
trn1.32xlarge |
Text Generation |
40.64 |
53.2 |
59.5 |
24.48 |
26.17 |
Real Time |
2.18.1 |
Tensor Parallel |
32 |
1 |
1152 |
128 |
1024 |
FP16 |
Matmult-BF16 |
bf16 |
21.5 |
$146.961 |
|
Llama-3-70B |
Transformers NeuronX |
trn1.32xlarge |
Text Generation |
40.81 |
52.53 |
52.79 |
26.48 |
24.22 |
Real Time |
2.18.1 |
Tensor Parallel |
32 |
1 |
256 |
128 |
128 |
FP16 |
Matmult-BF16 |
bf16 |
21.5 |
$146.342 |
|
Llama-2-7b |
Transformers NeuronX |
trn1.32xlarge |
Text Generation |
161.71 |
13.32 |
14.12 |
6.7 |
6.85 |
Real Time |
2.18.0 |
Tensor Parallel |
32 |
1 |
8192 |
128 |
8064 |
FP16 |
Matmult-BF16 |
bf16 |
21.5 |
$36.932 |
|
Llama-2-13b |
Transformers NeuronX |
trn1.32xlarge |
Text Generation |
60.43 |
864.14 |
865.91 |
9.84 |
10.15 |
Real Time |
2.18.0 |
Tensor Parallel |
32 |
1 |
8192 |
4096 |
4096 |
FP16 |
Matmult-BF16 |
bf16 |
21.5 |
$98.823 |
|
Llama-2-13b |
Transformers NeuronX |
trn1.32xlarge |
Text Generation |
31.4 |
2367.93 |
2369.14 |
13.41 |
15.77 |
Real Time |
2.18.0 |
Tensor Parallel |
32 |
1 |
16384 |
8192 |
8192 |
FP16 |
Matmult-BF16 |
bf16 |
21.5 |
$190.204 |
|
Llama-2-70b |
Transformers NeuronX |
trn1.32xlarge |
Text Generation |
39.29 |
53.91 |
54.95 |
25.18 |
26.58 |
Real Time |
2.18.0 |
Tensor Parallel |
32 |
1 |
256 |
128 |
128 |
FP16 |
Matmult-BF16 |
bf16 |
21.5 |
$152.020 |
|
Llama-2-70b |
Transformers NeuronX |
trn1.32xlarge |
Text Generation |
39.18 |
81.88 |
98.78 |
25.27 |
25.76 |
Real Time |
2.18.0 |
Tensor Parallel |
32 |
1 |
512 |
256 |
256 |
FP16 |
Matmult-BF16 |
bf16 |
21.5 |
$152.443 |
|
Llama-2-70b |
Transformers NeuronX |
trn1.32xlarge |
Text Generation |
39.16 |
57.75 |
64.76 |
25.45 |
26.13 |
Real Time |
2.18.0 |
Tensor Parallel |
32 |
1 |
1152 |
128 |
1024 |
FP16 |
Matmult-BF16 |
bf16 |
21.5 |
$152.493 |
|
Llama-2-70b |
Transformers NeuronX |
trn1.32xlarge |
Text Generation |
38.1 |
232.48 |
239.03 |
26.04 |
26.18 |
Real Time |
2.18.0 |
Tensor Parallel |
32 |
1 |
2048 |
1024 |
1024 |
FP16 |
Matmult-BF16 |
bf16 |
21.5 |
$156.771 |
|
Llama-2-70b |
Transformers NeuronX |
trn1.32xlarge |
Text Generation |
37.71 |
236.78 |
241.15 |
26.62 |
27.03 |
Real Time |
2.18.0 |
Tensor Parallel |
32 |
1 |
3072 |
1024 |
2048 |
FP16 |
Matmult-BF16 |
bf16 |
21.5 |
$158.375 |
|
Llama-2-70b |
Transformers NeuronX |
trn1.32xlarge |
Text Generation |
36.78 |
690.96 |
695.92 |
26.85 |
27.04 |
Real Time |
2.18.0 |
Tensor Parallel |
32 |
1 |
4096 |
2048 |
2048 |
FP16 |
Matmult-BF16 |
bf16 |
21.5 |
$162.376 |
|
Mistral-7B-Instruct-v0.2 |
Transformers NeuronX |
trn1.32xlarge |
Text Generation |
49.56 |
1322.87 |
1325.86 |
9.89 |
10.18 |
Real Time |
2.18.0 |
Tensor Parallel |
32 |
1 |
16384 |
8192 |
8192 |
FP16 |
Matmult-BF16 |
bf16 |
21.5 |
$120.508 |
|
CodeLlama-13b-hf |
Transformers NeuronX |
trn1.32xlarge |
Text Generation |
60.22 |
868.64 |
870.98 |
9.86 |
10.24 |
Real Time |
2.18.0 |
Tensor Parallel |
32 |
1 |
8192 |
4096 |
4096 |
FP16 |
Matmult-BF16 |
bf16 |
21.5 |
$99.181 |
|
CodeLlama-13b-hf |
Transformers NeuronX |
trn1.32xlarge |
Text Generation |
31.38 |
2372.93 |
2375.92 |
13.4 |
13.79 |
Real Time |
2.18.0 |
Tensor Parallel |
32 |
1 |
16384 |
8192 |
8192 |
FP16 |
Matmult-BF16 |
bf16 |
21.5 |
$190.333 |
Encoder-Decoder Models#
Model |
Scripts |
Framework |
Inst. Type |
Task |
Throughput (tokens/second) |
Latency per Token P50 (ms) |
Latency per Token P99 (ms) |
Cost per 1M inferences |
Application Type |
Neuron Version |
Run Mode |
TP Degree |
DP Degree |
Batch Size |
Sequence Length |
Input Length |
Output Length |
Model Data Type |
Compilation Autocast Data Type |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
google/flan-t5-xl |
NeuronX Distributed |
trn1.32xlarge |
Text Generation |
122.52 |
8.16 |
8.19 |
$48.745 |
Batch |
2.17.0 |
Tensor Parallel |
8 |
1 |
1 |
128 |
128 |
84 |
FP32 |
Matmult-BF16 |
|
t5-3b |
NeuronX Distributed |
trn1.32xlarge |
Text Generation |
116.29 |
8.58 |
8.66 |
$51.356 |
Batch |
2.17.0 |
Tensor Parallel |
8 |
1 |
1 |
128 |
128 |
84 |
FP32 |
Matmult-BF16 |
Note
Only for Encoder-Decoder
Throughput (tokens/second) counts both input and output tokens
Latency per Token counts both input and output tokens
Applicable to all models
Cost per 1M inferences is calculated using On-Demand hourly rate.
Real Time application refers to batch size 1 inference for minimal latency. Batch application refers to maximum throughput with minimum cost-per-inference.
Model |
Scripts |
Framework |
Inst. Type |
Task |
Throughput (tokens/second) |
Latency per Token P50 (ms) |
Latency per Token P99 (ms) |
Cost per 1M inferences |
Application Type |
Neuron Version |
Run Mode |
TP Degree |
DP Degree |
Batch Size |
Sequence Length |
Input Length |
Output Length |
Model Data Type |
Compilation Autocast Data Type |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
google/flan-t5-xl |
NeuronX Distributed |
trn1.32xlarge |
Text Generation |
120.29 |
8.31 |
8.34 |
$49.649 |
Real Time |
2.18.0 |
Tensor Parallel |
8 |
1 |
1 |
128 |
128 |
84 |
FP32 |
Matmult-BF16 |
|
t5-3b |
NeuronX Distributed |
trn1.32xlarge |
Text Generation |
110.23 |
9.07 |
9.12 |
$54.180 |
Real Time |
2.18.0 |
Tensor Parallel |
8 |
1 |
1 |
128 |
128 |
84 |
FP32 |
Matmult-BF16 |
Note
Only for Encoder-Decoder
Throughput (tokens/second) counts both input and output tokens
Latency per Token counts both input and output tokens
Note
Cost per 1M inferences is calculated using On-Demand hourly rate.
Real Time application refers to batch size 1 inference for minimal latency. Batch application refers to maximum throughput with minimum cost-per-inference.
This document is relevant for: Trn1
, Trn1n