Trn1/Trn1n Performance
Contents
This document is relevant for: Inf2
, Trn1
, Trn1n
Trn1/Trn1n Performance#
Table of contents
Last update: April 12th, 2023
Training Performance (Trn1 / Trn1n)#
Model |
Model Data-Type |
Instance-Type |
Training Data-Type |
Nodes |
Topology |
Microbatch |
Global Minibatch |
Optimizer |
Performance [seq/sec] |
Strong/Weak Scaling |
Neuron Version |
Neuron Tutorial/Example |
Pytorch Neuron(torch-neuronx) Version |
OS Type. |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
HuggingFace BERT-Large Ph1 pre-training |
FP32 |
trn1.32xlarge/trn1n.32xlarge |
Autocast:BF16+SR |
16 |
[32xNC(DP)] x 16Nodes(DP) |
16 |
262144 |
Adam |
48414 |
weak scaling |
2.9.0 |
1.13.0.1.6.0 |
U20 |
|
HuggingFace BERT-Large Ph2 pre-training |
FP32 |
trn1.32xlarge/trn1n.32xlarge |
Autocast:BF16+SR |
16 |
[32xNC(DP)] x 16Nodes(DP) |
2 |
524288 |
Adam |
7522 |
weak scaling |
2.9.0 |
1.13.0.1.6.0 |
U20 |
|
HuggingFace BERT-Large Ph1 pre-training |
FP32 |
trn1.32xlarge/trn1n.32xlarge |
Autocast:BF16+SR |
16 |
[32xNC(DP)] x 16Nodes(DP) |
16 |
262144 |
Lamb |
41242 |
strong scaling |
2.9.0 |
1.13.0.1.6.0 |
U20 |
|
GPT3-6.7B pre-training |
FP32 |
trn1.32xlarge/trn1n.32xlarge |
Autocast:BF16+SR |
16 |
[8xNC(TP)x4(DP)] x 16Nodes(DP) |
1 |
1024 |
129 |
2.9.0 |
1.13.0.1.6.0 |
U20 |
Inference Performance#
Model |
Framework |
Inst. Type |
Throughput (/sec) |
Latency P50 (ms) |
Latency P99 (ms) |
Cost per 1M inferences |
Application Type |
Neuron Version |
Run Mode |
Batch Size |
Model details |
---|---|---|---|---|---|---|---|---|---|---|---|
albert-base-v2 |
PyTorch 1.13.0 |
trn1.2xlarge |
2693 |
11.86 |
12.16 |
$0.138 |
Batch |
2.8.0 |
Data Parallel |
16 |
bf16 |
bert-base-cased |
PyTorch 1.13.0 |
trn1.2xlarge |
2607 |
6.11 |
6.27 |
$0.143 |
Batch |
2.8.0 |
Data Parallel |
8 |
bf16 |
bert-base-cased-finetuned-mrpc |
PyTorch 1.13.0 |
trn1.2xlarge |
2990 |
5.33 |
5.37 |
$0.124 |
Batch |
2.8.0 |
Data Parallel |
8 |
bf16 |
bert-large-cased |
PyTorch 1.13.0 |
trn1.2xlarge |
958 |
8.33 |
8.51 |
$0.388 |
Batch |
2.8.0 |
Data Parallel |
4 |
bf16 |
roberta-base |
PyTorch 1.13.0 |
trn1.2xlarge |
2565 |
6.21 |
6.42 |
$0.145 |
Batch |
2.8.0 |
Data Parallel |
8 |
bf16 |
roberta-large |
PyTorch 1.13.0 |
trn1.2xlarge |
948 |
8.42 |
8.53 |
$0.393 |
Batch |
2.8.0 |
Data Parallel |
4 |
bf16 |
Note
Throughput optimization was performed by selecting a batch size which maximized the metric. All compiler flags, data types, and parameters are identical between model configurations Cost per 1M inferences is calculated using On-Demand hourly rate.
Real Time application refers to batch size 1 inference for minimal latency. Batch application refers to maximum throughput with minimum cost-per-inference.
Model |
Framework |
Inst. Type |
Throughput (/sec) |
Latency P50 (ms) |
Latency P99 (ms) |
Cost per 1M inferences |
Application Type |
Neuron Version |
Run Mode |
Batch Size |
Model details |
---|---|---|---|---|---|---|---|---|---|---|---|
albert-base-v2 |
PyTorch 1.13.0 |
trn1.2xlarge |
1792 |
1.09 |
1.54 |
$0.208 |
Real Time |
2.8.0 |
Data Parallel |
1 |
bf16 |
bert-base-cased |
PyTorch 1.13.0 |
trn1.2xlarge |
1792 |
1.1 |
1.49 |
$0.208 |
Real Time |
2.8.0 |
Data Parallel |
1 |
bf16 |
bert-base-cased-finetuned-mrpc |
PyTorch 1.13.0 |
trn1.2xlarge |
1834 |
1.09 |
1.12 |
$0.203 |
Real Time |
2.8.0 |
Data Parallel |
1 |
bf16 |
bert-large-cased |
PyTorch 1.13.0 |
trn1.2xlarge |
658 |
3.03 |
3.09 |
$0.566 |
Real Time |
2.8.0 |
Data Parallel |
1 |
bf16 |
roberta-base |
PyTorch 1.13.0 |
trn1.2xlarge |
1729 |
1.15 |
1.2 |
$0.215 |
Real Time |
2.8.0 |
Data Parallel |
1 |
bf16 |
roberta-large |
PyTorch 1.13.0 |
trn1.2xlarge |
645 |
3.09 |
3.13 |
$0.577 |
Real Time |
2.8.0 |
Data Parallel |
1 |
bf16 |
Note
Latency optimization was performed by selecting a batch size which maximized the metric. All compiler flags, data types, and parameters are identical between model configurations Cost per 1M inferences is calculated using On-Demand hourly rate.
Real Time application refers to batch size 1 inference for minimal latency. Batch application refers to maximum throughput with minimum cost-per-inference.
This document is relevant for: Inf2
, Trn1
, Trn1n