This document is relevant for: Trn1

Trn1/Trn1n Training Performance#

Last update: December 20th, 2024

Encoder Models#

Model

Instance-Type

Training Data-Type

Nodes

Topology

Microbatch

Global Minibatch

Optimizer

Sequence Length

Performance [seq/sec]

Strong/Weak Scaling

Neuron Version

Neuron Tutorial/Example

Pytorch Neuron(torch-neuronx) Version

OS Type.

HuggingFace BERT-Large Ph1 pre-training

trn1.32xlarge/trn1n.32xlarge

Autocast:BF16+SR

16

[32xNC(DP)] x 16Nodes(DP)

16

1048576

Lamb

128

55102.1

weak scaling

2.21.0

Hugging Face BERT Pretraining Tutorial (Data-Parallel)

2.5.1.2.4.0

U22

HuggingFace BERT-Large Ph2 pre-training

trn1.32xlarge/trn1n.32xlarge

Autocast:BF16+SR

16

[32xNC(DP)] x 16Nodes(DP)

2

524288

Lamb

128

7832.4

weak scaling

2.21.0

Hugging Face BERT Pretraining Tutorial (Data-Parallel)

2.5.1.2.4.0

U22

HuggingFace BERT-Large Ph1 pre-training

trn1.32xlarge/trn1n.32xlarge

Autocast:BF16/AMP

16

[32xNC(DP)] x 16Nodes(DP)

16

16384

AdamW

128

26069.4

strong scaling

2.21.0

Hugging Face BERT Pretraining Tutorial (Data-Parallel)

2.5.1.2.4.0

U22

HuggingFace BERT-Large Ph1 pre-training

trn1.32xlarge/trn1n.32xlarge

FP32

16

[32xNC(DP)] x 16Nodes(DP)

8

1048576

Lamb

128

35144.6

weak scaling

2.21.0

Hugging Face BERT Pretraining Tutorial (Data-Parallel)

2.5.1.2.4.0

U22

HuggingFace BERT-Large Ph1 pre-training

trn1.32xlarge/trn1n.32xlarge

Autocast:BF16+SR

1

[32xNC(DP)]

16

16384

AdamW

128

3963.0

strong scaling

2.21.0

Hugging Face BERT Pretraining Tutorial (Data-Parallel)

2.5.1.2.4.0

U22

HuggingFace BERT-Large Ph1 pre-training

trn1.32xlarge/trn1n.32xlarge

Autocast:BF16+SR

1

[32xNC(DP)]

16

65536

Lamb

128

4137.9

strong scaling

2.21.0

Hugging Face BERT Pretraining Tutorial (Data-Parallel)

2.5.1.2.4.0

U22

Decoder Models#

Model

Instance-Type

Training Data-Type

Nodes

Topology

Microbatch

Global Minibatch

Optimizer

Sequence Length

Performance [seq/sec]

Strong/Weak Scaling

Neuron Version

Neuron Tutorial/Example

Pytorch Neuron(torch-neuronx) Version

OS Type.

Llama2-7B pre-training

trn1.32xlarge/trn1n.32xlarge

Autocast:BF16+FP32Optimizer

16

TP=8 DP=64

1

1024

AdamW

4096

95.14

strong scaling

2.21.0

NeuronX Distributed

2.5.1.2.4.0

U22

Llama2-13B pre-training

trn1.32xlarge/trn1n.32xlarge

Autocast:BF16+FP32Optimizer

32

TP=8 DP=16 PP=8

1

1024

AdamW

4096

80.33

strong scaling

2.21.0

NeuronX Distributed

2.5.1.2.4.0

U22

Llama2-70B pre-training

trn1.32xlarge/trn1n.32xlarge

Autocast:BF16+FP32Optimizer

32

TP=32 DP=4 PP=8

1

1024

AdamW

4096

14.44

strong scaling

2.21.0

NeuronX Distributed

2.5.1.2.4.0

U22

Llama-3.1-8B pre-training

trn1.32xlarge/trn1n.32xlarge

Autocast:BF16+FP32Optimizer

32

TP=32 DP=32 PP=1

1

1024

AdamW

8192

53.51

strong scaling

2.21.0

NeuronX Distributed

2.5.1.2.4.0

U22

Llama-3.1-70B pre-training

trn1.32xlarge/trn1n.32xlarge

Autocast:BF16+FP32Optimizer

32

TP=32 DP=4 PP=8

1

1024

AdamW

8192

7.97

strong scaling

2.21.0

NeuronX Distributed

2.5.1.2.4.0

U22

Note

TP (Tensor Parallel), PP (Pipeline Parallel) and DP (Data Parallel) Topology configuration refers to the degrees of 3D Parallelism (How the model and data is sharded across neuron cores).

TP and PP are specified in the run script and DP is calculated by dividing **world size**(Number of nodes/instances * Number of neuron cores per instance) by TP * PP degrees.

For example : TP = 4, PP = 4 and Number of instances is 32 (trn1.32xlarge). The world size will be : 32 (num instances) * 32(neuron cores per instance) = 1024. Now, DP degree = 1024 (World size)/ 4 (TP) * 4 (PP) = 64

Vision Transformer Models#

Model

Instance-Type

Training Data-Type

Nodes

Topology

Microbatch

Global Minibatch

Optimizer

Performance [seq/sec]

Strong/Weak Scaling

Neuron Version

Neuron Tutorial/Example

Pytorch Neuron(torch-neuronx) Version

OS Type.

HuggingFace ViT-Base fine-tuning

trn1.32xlarge/trn1n.32xlarge

Autocast:BF16+SR

1

[32xNC(DP)]

64

2048

AdamW

5907.61

weak scaling

2.21.0

ViT-Base Fine-tuning Example

2.5.1.2.4.0

U22

HuggingFace CLIP-Base fine-tuning

trn1.32xlarge/trn1n.32xlarge

Autocast:BF16+SR

1

[32xNC(DP)]

80

2560

AdamW

5465.77

weak scaling

2.18.0

CLIP-Base Fine-tuning

2.5.1.2.4.0

U22

HuggingFace Vision-Perceiver-Conv fine-tuning

trn1.32xlarge/trn1n.32xlarge

Autocast:BF16+SR

1

[32xNC(DP)]

4

128

AdamW

589.55

weak scaling

2.21.0

Vision Perceiver Conv Fine-tuning

2.5.1.2.4.0

U22

HuggingFace Language-Perceiver fine-tuning

trn1.32xlarge/trn1n.32xlarge

Autocast:BF16+SR

1

[32xNC(DP)]

20

640

AdamW

140.14

weak scaling

2.21.0

Language Perceiver Fine-tuning

2.5.1.2.4.0

U22

Note

Read more about strong vs weak scaling here Training with Neuron - FAQ

This document is relevant for: Trn1