This document is relevant for: Trn1

Trn1/Trn1n Training Performance#

Last update: May 20th, 2025

Encoder Models#

Model

Instance-Type

Training Data-Type

Nodes

Topology

Microbatch

Globalbatch

Optimizer

Sequence Length

Performance [seq/sec]

Strong/Weak Scaling

Neuron Version

Neuron Tutorial/Example

Pytorch Neuron(torch-neuronx) Version

OS Type.

HuggingFace BERT-Large Ph1 pre-training

trn1.32xlarge/trn1n.32xlarge

Autocast:BF16+SR

16

[32xNC(DP)] x 16Nodes(DP)

16

1048576

Lamb

128

53655.8

weak scaling

2.23.0

Hugging Face BERT Pretraining Tutorial (Data-Parallel)

2.5.1.2.7.5413

U22

HuggingFace BERT-Large Ph1 pre-training

trn1.32xlarge/trn1n.32xlarge

FP32

16

[32xNC(DP)] x 16Nodes(DP)

8

1048576

Lamb

128

33095.9

weak scaling

2.23.0

Hugging Face BERT Pretraining Tutorial (Data-Parallel)

2.5.1.2.7.5413

U22

HuggingFace BERT-Large Ph1 pre-training

trn1.32xlarge/trn1n.32xlarge

Autocast:BF16+SR

1

[32xNC(DP)]

16

16384

AdamW

128

3705.1

strong scaling

2.23.0

Hugging Face BERT Pretraining Tutorial (Data-Parallel)

2.5.1.2.7.5413

U22

Decoder Models#

Model

Instance-Type

Training Data-Type

Nodes

Topology

Microbatch

Globalbatch

Optimizer

Sequence Length

Performance [seq/sec]

Strong/Weak Scaling

Neuron Version

Neuron Tutorial/Example

Pytorch Neuron(torch-neuronx) Version

OS Type.

Llama-3.1-8B pre-training

trn1.32xlarge/trn1n.32xlarge

Autocast:BF16+FP32Optimizer

32

TP=32 DP=32 PP=1 ZeRO-1

1

1024

AdamW

8192

41.99

strong scaling

2.21.0

NeuronX Distributed

2.5.1.2.4.0

U22

Llama-3.1-70B pre-training

trn1.32xlarge/trn1n.32xlarge

Autocast:BF16+FP32Optimizer

32

TP=32 DP=4 PP=8

1

1024

AdamW

8192

7.51

strong scaling

2.22.0

NeuronX Distributed

2.5.1.2.4.0

U22

Note

TP (Tensor Parallel), PP (Pipeline Parallel) and DP (Data Parallel) Topology configuration refers to the degrees of 3D Parallelism (How the model and data is sharded across neuron cores).

TP and PP are specified in the run script and DP is calculated by dividing **world size**(Number of nodes/instances * Number of neuron cores per instance) by TP * PP degrees.

For example : TP = 4, PP = 4 and Number of instances is 32 (trn1.32xlarge). The world size will be : 32 (num instances) * 32(neuron cores per instance) = 1024. Now, DP degree = 1024 (World size)/ 4 (TP) * 4 (PP) = 64

For more information on batch sizes please refer to Neuron Batching

Vision Transformer Models#

Model

Instance-Type

Training Data-Type

Nodes

Topology

Microbatch

Globalbatch

Optimizer

Performance [seq/sec]

Strong/Weak Scaling

Neuron Version

Neuron Tutorial/Example

Pytorch Neuron(torch-neuronx) Version

OS Type.

HuggingFace ViT-Base fine-tuning

trn1.32xlarge/trn1n.32xlarge

Autocast:BF16+SR

1

[32xNC(DP)]

64

2048

AdamW

5907.61

weak scaling

2.21.0

ViT-Base Fine-tuning Example

2.5.1.2.4.0

U22

HuggingFace Vision-Perceiver-Conv fine-tuning

trn1.32xlarge/trn1n.32xlarge

Autocast:BF16+SR

1

[32xNC(DP)]

4

128

AdamW

589.55

weak scaling

2.21.0

Vision Perceiver Conv Fine-tuning

2.5.1.2.4.0

U22

HuggingFace Language-Perceiver fine-tuning

trn1.32xlarge/trn1n.32xlarge

Autocast:BF16+SR

1

[32xNC(DP)]

20

640

AdamW

140.14

weak scaling

2.21.0

Language Perceiver Fine-tuning

2.5.1.2.4.0

U22

Note

Read more about strong vs weak scaling here Training with Neuron - FAQ

This document is relevant for: Trn1