This document is relevant for: Trn1, Trn1n

Trn1/Trn1n Training Performance#

Last update: April 11th, 2024

Encoder Models#

Model

Instance-Type

Training Data-Type

Nodes

Topology

Microbatch

Global Minibatch

Optimizer

Sequence Length

Performance [seq/sec]

Strong/Weak Scaling

Neuron Version

Neuron Tutorial/Example

Pytorch Neuron(torch-neuronx) Version

OS Type.

HuggingFace BERT-Large Ph1 pre-training

trn1.32xlarge/trn1n.32xlarge

Autocast:BF16+SR

16

[32xNC(DP)] x 16Nodes(DP)

16

1048576

Lamb

128

52921

weak scaling

2.18.0

Hugging Face BERT Pretraining Tutorial

2.1.2.2.1.0

U20

HuggingFace BERT-Large Ph2 pre-training

trn1.32xlarge/trn1n.32xlarge

Autocast:BF16+SR

16

[32xNC(DP)] x 16Nodes(DP)

2

524288

Lamb

128

7589

weak scaling

2.18.0

Hugging Face BERT Pretraining Tutorial

2.1.2.2.1.0

U20

HuggingFace BERT-Large Ph1 pre-training

trn1.32xlarge/trn1n.32xlarge

Autocast:BF16/AMP

16

[32xNC(DP)] x 16Nodes(DP)

16

16384

AdamW

128

24518.47

strong scaling

2.14.0

Hugging Face BERT Pretraining Tutorial

1.13.1.1.11.0

U20

HuggingFace BERT-Large Ph1 pre-training

trn1.32xlarge/trn1n.32xlarge

FP32

16

[32xNC(DP)] x 16Nodes(DP)

8

1048576

Lamb

128

29271

weak scaling

2.18.0

Hugging Face BERT Pretraining Tutorial

2.1.2.2.1.0

U20

HuggingFace BERT-Large Ph1 pre-training

trn1.32xlarge/trn1n.32xlarge

Autocast:BF16+SR

1

[32xNC(DP)]

16

16384

AdamW

128

3311.59

strong scaling

2.18.0

Hugging Face BERT Pretraining Tutorial

2.1.2.2.1.0

U20

HuggingFace BERT-Large Ph1 pre-training

trn1.32xlarge/trn1n.32xlarge

Autocast:BF16+SR

1

[32xNC(DP)]

16

65536

Lamb

128

3478

strong scaling

2.18.0

Hugging Face BERT Pretraining Tutorial

2.1.2.2.1.0

U20

Decoder Models#

Model

Instance-Type

Training Data-Type

Nodes

Topology

Microbatch

Global Minibatch

Optimizer

Sequence Length

Performance [seq/sec]

Strong/Weak Scaling

Neuron Version

Neuron Tutorial/Example

Pytorch Neuron(torch-neuronx) Version

OS Type.

GPT3-23B pre-training

trn1.32xlarge/trn1n.32xlarge

Autocast:BF16+SR

32

TP=8 DP=32 PP=4

1

1024

AdamW

2048

107.18

strong scaling

2.18.0

Nemo Megatron

2.1.2.2.1.0

U20

GPT3-46B pre-training

trn1.32xlarge/trn1n.32xlarge

Autocast:BF16+SR

32

TP=8 DP=16 PP=8

1

1024

AdamW

2048

48.24

strong scaling

2.18.0

Nemo Megatron

2.1.2.2.1.0

U20

GPT3-175B pre-training

trn1.32xlarge/trn1n.32xlarge

Autocast:BF16+SR

32

TP=32 DP=4 PP=8

1

1024

AdamW

2048

14.60

strong scaling

2.18.0

Nemo Megatron

2.1.2.2.1.0

U20

Llama2-7B pre-training

trn1.32xlarge/trn1n.32xlarge

Autocast:BF16+FP32Optimizer

16

TP=8 DP=64

1

1024

AdamW

4096

92.93

strong scaling

2.18.0

NeuronX Distributed

2.1.2.2.1.0

U20

Llama2-13B pre-training

trn1.32xlarge/trn1n.32xlarge

Autocast:BF16+FP32Optimizer

32

TP=8 DP=32 PP=4

1

1024

AdamW

4096

87.44

strong scaling

2.18.0

NeuronX Distributed

2.1.2.2.1.0

U20

Llama2-70B pre-training

trn1.32xlarge/trn1n.32xlarge

Autocast:BF16+FP32Optimizer

32

TP=32 DP=8 PP=4

1

1024

AdamW

4096

17.86

strong scaling

2.18.0

NeuronX Distributed

2.1.2.2.1.0

U20

Note

TP (Tensor Parallel), PP (Pipeline Parallel) and DP (Data Parallel) Topology configuration refers to the degrees of 3D Parallelism (How the model and data is sharded across neuron cores).

TP and PP are specified in the run script and DP is calculated by dividing **world size**(Number of nodes/instances * Number of neuron cores per instance) by TP * PP degrees.

For example : TP = 4, PP = 4 and Number of instances is 32 (trn1.32xlarge). The world size will be : 32 (num instances) * 32(neuron cores per instance) = 1024. Now, DP degree = 1024 (World size)/ 4 (TP) * 4 (PP) = 64

Vision Transformer Models#

Model

Instance-Type

Training Data-Type

Nodes

Topology

Microbatch

Global Minibatch

Optimizer

Performance [seq/sec]

Strong/Weak Scaling

Neuron Version

Neuron Tutorial/Example

Pytorch Neuron(torch-neuronx) Version

OS Type.

HuggingFace ViT-Base fine-tuning

trn1.32xlarge/trn1n.32xlarge

Autocast:BF16+SR

1

[32xNC(DP)]

64

2048

AdamW

5230.72

weak scaling

2.18.0

ViT-Base Fine-tuning Example

2.1.2.2.1.0

U20

HuggingFace CLIP-Base fine-tuning

trn1.32xlarge/trn1n.32xlarge

Autocast:BF16+SR

1

[32xNC(DP)]

80

2560

AdamW

5465.77

weak scaling

2.18.0

CLIP-Base Fine-tuning

2.1.2.2.1.0

U20

HuggingFace Vision-Perceiver-Conv fine-tuning

trn1.32xlarge/trn1n.32xlarge

Autocast:BF16+SR

1

[32xNC(DP)]

4

128

AdamW

458.0907

weak scaling

2.18.0

Vision Perceiver Conv Fine-tuning

2.1.2.2.1.0

U20

HuggingFace Language-Perceiver fine-tuning

trn1.32xlarge/trn1n.32xlarge

Autocast:BF16+SR

1

[32xNC(DP)]

20

640

AdamW

1407.20

weak scaling

2.18.0

Language Perceiver Fine-tuning

1.13.1.1.13.1

U20

Note

Read more about strong vs weak scaling here Training with Neuron - FAQ

This document is relevant for: Trn1, Trn1n