This document is relevant for: Trn1

Trn1/Trn1n Training Performance#

Last update: September 16th, 2024

Encoder Models#

Model

Instance-Type

Training Data-Type

Nodes

Topology

Microbatch

Global Minibatch

Optimizer

Sequence Length

Performance [seq/sec]

Strong/Weak Scaling

Neuron Version

Neuron Tutorial/Example

Pytorch Neuron(torch-neuronx) Version

OS Type.

HuggingFace BERT-Large Ph1 pre-training

trn1.32xlarge/trn1n.32xlarge

Autocast:BF16+SR

16

[32xNC(DP)] x 16Nodes(DP)

16

1048576

Lamb

128

61765.4

weak scaling

2.20.0

Hugging Face BERT Pretraining Tutorial

2.1.2.2.3.0

U22

HuggingFace BERT-Large Ph2 pre-training

trn1.32xlarge/trn1n.32xlarge

Autocast:BF16+SR

16

[32xNC(DP)] x 16Nodes(DP)

2

524288

Lamb

128

8396.1

weak scaling

2.20.0

Hugging Face BERT Pretraining Tutorial

2.1.2.2.3.0

U22

HuggingFace BERT-Large Ph1 pre-training

trn1.32xlarge/trn1n.32xlarge

Autocast:BF16/AMP

16

[32xNC(DP)] x 16Nodes(DP)

16

16384

AdamW

128

25853.3

strong scaling

2.20.0

Hugging Face BERT Pretraining Tutorial

2.1.2.2.3.0

U22

HuggingFace BERT-Large Ph1 pre-training

trn1.32xlarge/trn1n.32xlarge

FP32

16

[32xNC(DP)] x 16Nodes(DP)

8

1048576

Lamb

128

31637.3

weak scaling

2.20.0

Hugging Face BERT Pretraining Tutorial

2.1.2.2.3.0

U22

HuggingFace BERT-Large Ph1 pre-training

trn1.32xlarge/trn1n.32xlarge

Autocast:BF16+SR

1

[32xNC(DP)]

16

16384

AdamW

128

4379.6

strong scaling

2.20.0

Hugging Face BERT Pretraining Tutorial

2.1.2.2.3.0

U22

HuggingFace BERT-Large Ph1 pre-training

trn1.32xlarge/trn1n.32xlarge

Autocast:BF16+SR

1

[32xNC(DP)]

16

65536

Lamb

128

4538.5

strong scaling

2.20.0

Hugging Face BERT Pretraining Tutorial

2.1.2.2.3.0

U22

Decoder Models#

Model

Instance-Type

Training Data-Type

Nodes

Topology

Microbatch

Global Minibatch

Optimizer

Sequence Length

Performance [seq/sec]

Strong/Weak Scaling

Neuron Version

Neuron Tutorial/Example

Pytorch Neuron(torch-neuronx) Version

OS Type.

Llama2-7B pre-training

trn1.32xlarge/trn1n.32xlarge

Autocast:BF16+FP32Optimizer

16

TP=8 DP=64

1

1024

AdamW

4096

101.3

strong scaling

2.20.0

NeuronX Distributed

2.1.2.2.3.0

U22

Llama2-13B pre-training

trn1.32xlarge/trn1n.32xlarge

Autocast:BF16+FP32Optimizer

32

TP=8 DP=16 PP=8

1

1024

AdamW

4096

84.0

strong scaling

2.20.0

NeuronX Distributed

2.1.2.2.3.0

U22

Llama2-70B pre-training

trn1.32xlarge/trn1n.32xlarge

Autocast:BF16+FP32Optimizer

32

TP=32 DP=4 PP=8

1

1024

AdamW

4096

14.2

strong scaling

2.20.0

NeuronX Distributed

2.1.2.2.3.0

U22

Llama-3.1-8B pre-training

trn1.32xlarge/trn1n.32xlarge

Autocast:BF16+FP32Optimizer

32

TP=32 DP=32 PP=1

1

1024

AdamW

8192

49.47

strong scaling

2.20.0

NeuronX Distributed

2.1.2.2.3.0

U20

Llama-3.1-70B pre-training

trn1.32xlarge/trn1n.32xlarge

Autocast:BF16+FP32Optimizer

32

TP=32 DP=4 PP=8

1

1024

AdamW

8192

7.59

strong scaling

2.20.0

NeuronX Distributed

2.1.2.2.3.0

U20

Note

TP (Tensor Parallel), PP (Pipeline Parallel) and DP (Data Parallel) Topology configuration refers to the degrees of 3D Parallelism (How the model and data is sharded across neuron cores).

TP and PP are specified in the run script and DP is calculated by dividing **world size**(Number of nodes/instances * Number of neuron cores per instance) by TP * PP degrees.

For example : TP = 4, PP = 4 and Number of instances is 32 (trn1.32xlarge). The world size will be : 32 (num instances) * 32(neuron cores per instance) = 1024. Now, DP degree = 1024 (World size)/ 4 (TP) * 4 (PP) = 64

Vision Transformer Models#

Model

Instance-Type

Training Data-Type

Nodes

Topology

Microbatch

Global Minibatch

Optimizer

Performance [seq/sec]

Strong/Weak Scaling

Neuron Version

Neuron Tutorial/Example

Pytorch Neuron(torch-neuronx) Version

OS Type.

HuggingFace ViT-Base fine-tuning

trn1.32xlarge/trn1n.32xlarge

Autocast:BF16+SR

1

[32xNC(DP)]

64

2048

AdamW

5230.72

weak scaling

2.18.0

ViT-Base Fine-tuning Example

2.1.2.2.1.0

U20

HuggingFace CLIP-Base fine-tuning

trn1.32xlarge/trn1n.32xlarge

Autocast:BF16+SR

1

[32xNC(DP)]

80

2560

AdamW

5465.77

weak scaling

2.18.0

CLIP-Base Fine-tuning

2.1.2.2.1.0

U20

HuggingFace Vision-Perceiver-Conv fine-tuning

trn1.32xlarge/trn1n.32xlarge

Autocast:BF16+SR

1

[32xNC(DP)]

4

128

AdamW

458.0907

weak scaling

2.18.0

Vision Perceiver Conv Fine-tuning

2.1.2.2.1.0

U20

HuggingFace Language-Perceiver fine-tuning

trn1.32xlarge/trn1n.32xlarge

Autocast:BF16+SR

1

[32xNC(DP)]

20

640

AdamW

1407.20

weak scaling

2.18.0

Language Perceiver Fine-tuning

1.13.1.1.13.1

U20

Note

Read more about strong vs weak scaling here Training with Neuron - FAQ

This document is relevant for: Trn1