This document is relevant for: Trn1
Trn1/Trn1n Training Performance#
Last update: May 20th, 2025
Encoder Models#
Model |
Instance-Type |
Training Data-Type |
Nodes |
Topology |
Microbatch |
Globalbatch |
Optimizer |
Sequence Length |
Performance [seq/sec] |
Strong/Weak Scaling |
Neuron Version |
Neuron Tutorial/Example |
Pytorch Neuron(torch-neuronx) Version |
OS Type. |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
HuggingFace BERT-Large Ph1 pre-training |
trn1.32xlarge/trn1n.32xlarge |
Autocast:BF16+SR |
16 |
[32xNC(DP)] x 16Nodes(DP) |
16 |
1048576 |
Lamb |
128 |
53655.8 |
weak scaling |
2.23.0 |
2.5.1.2.7.5413 |
U22 |
|
HuggingFace BERT-Large Ph1 pre-training |
trn1.32xlarge/trn1n.32xlarge |
FP32 |
16 |
[32xNC(DP)] x 16Nodes(DP) |
8 |
1048576 |
Lamb |
128 |
33095.9 |
weak scaling |
2.23.0 |
2.5.1.2.7.5413 |
U22 |
|
HuggingFace BERT-Large Ph1 pre-training |
trn1.32xlarge/trn1n.32xlarge |
Autocast:BF16+SR |
1 |
[32xNC(DP)] |
16 |
16384 |
AdamW |
128 |
3705.1 |
strong scaling |
2.23.0 |
2.5.1.2.7.5413 |
U22 |
|
Decoder Models#
Model |
Instance-Type |
Training Data-Type |
Nodes |
Topology |
Microbatch |
Globalbatch |
Optimizer |
Sequence Length |
Performance [seq/sec] |
Strong/Weak Scaling |
Neuron Version |
Neuron Tutorial/Example |
Pytorch Neuron(torch-neuronx) Version |
OS Type. |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Llama-3.1-8B pre-training |
trn1.32xlarge/trn1n.32xlarge |
Autocast:BF16+FP32Optimizer |
32 |
TP=32 DP=32 PP=1 ZeRO-1 |
1 |
1024 |
AdamW |
8192 |
41.99 |
strong scaling |
2.21.0 |
2.5.1.2.4.0 |
U22 |
|
Llama-3.1-70B pre-training |
trn1.32xlarge/trn1n.32xlarge |
Autocast:BF16+FP32Optimizer |
32 |
TP=32 DP=4 PP=8 |
1 |
1024 |
AdamW |
8192 |
7.51 |
strong scaling |
2.22.0 |
2.5.1.2.4.0 |
U22 |
Note
TP (Tensor Parallel), PP (Pipeline Parallel) and DP (Data Parallel) Topology configuration refers to the degrees of 3D Parallelism (How the model and data is sharded across neuron cores).
TP and PP are specified in the run script and DP is calculated by dividing **world size**(Number of nodes/instances * Number of neuron cores per instance) by TP * PP degrees.
For example : TP = 4, PP = 4 and Number of instances is 32 (trn1.32xlarge). The world size will be : 32 (num instances) * 32(neuron cores per instance) = 1024. Now, DP degree = 1024 (World size)/ 4 (TP) * 4 (PP) = 64
For more information on batch sizes please refer to Neuron Batching
Vision Transformer Models#
Model |
Instance-Type |
Training Data-Type |
Nodes |
Topology |
Microbatch |
Globalbatch |
Optimizer |
Performance [seq/sec] |
Strong/Weak Scaling |
Neuron Version |
Neuron Tutorial/Example |
Pytorch Neuron(torch-neuronx) Version |
OS Type. |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
HuggingFace ViT-Base fine-tuning |
trn1.32xlarge/trn1n.32xlarge |
Autocast:BF16+SR |
1 |
[32xNC(DP)] |
64 |
2048 |
AdamW |
5907.61 |
weak scaling |
2.21.0 |
2.5.1.2.4.0 |
U22 |
|
HuggingFace Vision-Perceiver-Conv fine-tuning |
trn1.32xlarge/trn1n.32xlarge |
Autocast:BF16+SR |
1 |
[32xNC(DP)] |
4 |
128 |
AdamW |
589.55 |
weak scaling |
2.21.0 |
2.5.1.2.4.0 |
U22 |
|
HuggingFace Language-Perceiver fine-tuning |
trn1.32xlarge/trn1n.32xlarge |
Autocast:BF16+SR |
1 |
[32xNC(DP)] |
20 |
640 |
AdamW |
140.14 |
weak scaling |
2.21.0 |
2.5.1.2.4.0 |
U22 |
Note
Read more about strong vs weak scaling here Training with Neuron - FAQ
This document is relevant for: Trn1