This document is relevant for: Trn1
Trn1/Trn1n Training Performance#
Last update: December 20th, 2024
Encoder Models#
Model |
Instance-Type |
Training Data-Type |
Nodes |
Topology |
Microbatch |
Global Minibatch |
Optimizer |
Sequence Length |
Performance [seq/sec] |
Strong/Weak Scaling |
Neuron Version |
Neuron Tutorial/Example |
Pytorch Neuron(torch-neuronx) Version |
OS Type. |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
HuggingFace BERT-Large Ph1 pre-training |
trn1.32xlarge/trn1n.32xlarge |
Autocast:BF16+SR |
16 |
[32xNC(DP)] x 16Nodes(DP) |
16 |
1048576 |
Lamb |
128 |
55102.1 |
weak scaling |
2.21.0 |
2.5.1.2.4.0 |
U22 |
|
HuggingFace BERT-Large Ph2 pre-training |
trn1.32xlarge/trn1n.32xlarge |
Autocast:BF16+SR |
16 |
[32xNC(DP)] x 16Nodes(DP) |
2 |
524288 |
Lamb |
128 |
7832.4 |
weak scaling |
2.21.0 |
2.5.1.2.4.0 |
U22 |
|
HuggingFace BERT-Large Ph1 pre-training |
trn1.32xlarge/trn1n.32xlarge |
Autocast:BF16/AMP |
16 |
[32xNC(DP)] x 16Nodes(DP) |
16 |
16384 |
AdamW |
128 |
26069.4 |
strong scaling |
2.21.0 |
2.5.1.2.4.0 |
U22 |
|
HuggingFace BERT-Large Ph1 pre-training |
trn1.32xlarge/trn1n.32xlarge |
FP32 |
16 |
[32xNC(DP)] x 16Nodes(DP) |
8 |
1048576 |
Lamb |
128 |
35144.6 |
weak scaling |
2.21.0 |
2.5.1.2.4.0 |
U22 |
|
HuggingFace BERT-Large Ph1 pre-training |
trn1.32xlarge/trn1n.32xlarge |
Autocast:BF16+SR |
1 |
[32xNC(DP)] |
16 |
16384 |
AdamW |
128 |
3963.0 |
strong scaling |
2.21.0 |
2.5.1.2.4.0 |
U22 |
|
HuggingFace BERT-Large Ph1 pre-training |
trn1.32xlarge/trn1n.32xlarge |
Autocast:BF16+SR |
1 |
[32xNC(DP)] |
16 |
65536 |
Lamb |
128 |
4137.9 |
strong scaling |
2.21.0 |
2.5.1.2.4.0 |
U22 |
Decoder Models#
Model |
Instance-Type |
Training Data-Type |
Nodes |
Topology |
Microbatch |
Global Minibatch |
Optimizer |
Sequence Length |
Performance [seq/sec] |
Strong/Weak Scaling |
Neuron Version |
Neuron Tutorial/Example |
Pytorch Neuron(torch-neuronx) Version |
OS Type. |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Llama2-7B pre-training |
trn1.32xlarge/trn1n.32xlarge |
Autocast:BF16+FP32Optimizer |
16 |
TP=8 DP=64 |
1 |
1024 |
AdamW |
4096 |
95.14 |
strong scaling |
2.21.0 |
2.5.1.2.4.0 |
U22 |
|
Llama2-13B pre-training |
trn1.32xlarge/trn1n.32xlarge |
Autocast:BF16+FP32Optimizer |
32 |
TP=8 DP=16 PP=8 |
1 |
1024 |
AdamW |
4096 |
80.33 |
strong scaling |
2.21.0 |
2.5.1.2.4.0 |
U22 |
|
Llama2-70B pre-training |
trn1.32xlarge/trn1n.32xlarge |
Autocast:BF16+FP32Optimizer |
32 |
TP=32 DP=4 PP=8 |
1 |
1024 |
AdamW |
4096 |
14.44 |
strong scaling |
2.21.0 |
2.5.1.2.4.0 |
U22 |
|
Llama-3.1-8B pre-training |
trn1.32xlarge/trn1n.32xlarge |
Autocast:BF16+FP32Optimizer |
32 |
TP=32 DP=32 PP=1 |
1 |
1024 |
AdamW |
8192 |
53.51 |
strong scaling |
2.21.0 |
2.5.1.2.4.0 |
U22 |
|
Llama-3.1-70B pre-training |
trn1.32xlarge/trn1n.32xlarge |
Autocast:BF16+FP32Optimizer |
32 |
TP=32 DP=4 PP=8 |
1 |
1024 |
AdamW |
8192 |
7.97 |
strong scaling |
2.21.0 |
2.5.1.2.4.0 |
U22 |
Note
TP (Tensor Parallel), PP (Pipeline Parallel) and DP (Data Parallel) Topology configuration refers to the degrees of 3D Parallelism (How the model and data is sharded across neuron cores).
TP and PP are specified in the run script and DP is calculated by dividing **world size**(Number of nodes/instances * Number of neuron cores per instance) by TP * PP degrees.
For example : TP = 4, PP = 4 and Number of instances is 32 (trn1.32xlarge). The world size will be : 32 (num instances) * 32(neuron cores per instance) = 1024. Now, DP degree = 1024 (World size)/ 4 (TP) * 4 (PP) = 64
Vision Transformer Models#
Model |
Instance-Type |
Training Data-Type |
Nodes |
Topology |
Microbatch |
Global Minibatch |
Optimizer |
Performance [seq/sec] |
Strong/Weak Scaling |
Neuron Version |
Neuron Tutorial/Example |
Pytorch Neuron(torch-neuronx) Version |
OS Type. |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
HuggingFace ViT-Base fine-tuning |
trn1.32xlarge/trn1n.32xlarge |
Autocast:BF16+SR |
1 |
[32xNC(DP)] |
64 |
2048 |
AdamW |
5907.61 |
weak scaling |
2.21.0 |
2.5.1.2.4.0 |
U22 |
|
HuggingFace CLIP-Base fine-tuning |
trn1.32xlarge/trn1n.32xlarge |
Autocast:BF16+SR |
1 |
[32xNC(DP)] |
80 |
2560 |
AdamW |
5465.77 |
weak scaling |
2.18.0 |
2.5.1.2.4.0 |
U22 |
|
HuggingFace Vision-Perceiver-Conv fine-tuning |
trn1.32xlarge/trn1n.32xlarge |
Autocast:BF16+SR |
1 |
[32xNC(DP)] |
4 |
128 |
AdamW |
589.55 |
weak scaling |
2.21.0 |
2.5.1.2.4.0 |
U22 |
|
HuggingFace Language-Perceiver fine-tuning |
trn1.32xlarge/trn1n.32xlarge |
Autocast:BF16+SR |
1 |
[32xNC(DP)] |
20 |
640 |
AdamW |
140.14 |
weak scaling |
2.21.0 |
2.5.1.2.4.0 |
U22 |
Note
Read more about strong vs weak scaling here Training with Neuron - FAQ
This document is relevant for: Trn1