This document is relevant for: Trn1

Trn1/Trn1n Training Performance#

Last update: September 16th, 2024

Encoder Models #

Model	Instance-Type	Training Data-Type	Nodes	Topology	Microbatch	Global Minibatch	Optimizer	Sequence Length	Performance [seq/sec]	Strong/Weak Scaling	Neuron Version	Neuron Tutorial/Example	Pytorch Neuron(torch-neuronx) Version	OS Type.
HuggingFace BERT-Large Ph1 pre-training	trn1.32xlarge/trn1n.32xlarge	Autocast:BF16+SR	16	[32xNC(DP)] x 16Nodes(DP)	16	1048576	Lamb	128	61765.4	weak scaling	2.20.0	Hugging Face BERT Pretraining Tutorial	2.1.2.2.3.0	U22
HuggingFace BERT-Large Ph2 pre-training	trn1.32xlarge/trn1n.32xlarge	Autocast:BF16+SR	16	[32xNC(DP)] x 16Nodes(DP)	2	524288	Lamb	128	8396.1	weak scaling	2.20.0	Hugging Face BERT Pretraining Tutorial	2.1.2.2.3.0	U22
HuggingFace BERT-Large Ph1 pre-training	trn1.32xlarge/trn1n.32xlarge	Autocast:BF16/AMP	16	[32xNC(DP)] x 16Nodes(DP)	16	16384	AdamW	128	25853.3	strong scaling	2.20.0	Hugging Face BERT Pretraining Tutorial	2.1.2.2.3.0	U22
HuggingFace BERT-Large Ph1 pre-training	trn1.32xlarge/trn1n.32xlarge	FP32	16	[32xNC(DP)] x 16Nodes(DP)	8	1048576	Lamb	128	31637.3	weak scaling	2.20.0	Hugging Face BERT Pretraining Tutorial	2.1.2.2.3.0	U22
HuggingFace BERT-Large Ph1 pre-training	trn1.32xlarge/trn1n.32xlarge	Autocast:BF16+SR	1	[32xNC(DP)]	16	16384	AdamW	128	4379.6	strong scaling	2.20.0	Hugging Face BERT Pretraining Tutorial	2.1.2.2.3.0	U22
HuggingFace BERT-Large Ph1 pre-training	trn1.32xlarge/trn1n.32xlarge	Autocast:BF16+SR	1	[32xNC(DP)]	16	65536	Lamb	128	4538.5	strong scaling	2.20.0	Hugging Face BERT Pretraining Tutorial	2.1.2.2.3.0	U22

Decoder Models #

Model	Instance-Type	Training Data-Type	Nodes	Topology	Microbatch	Global Minibatch	Optimizer	Sequence Length	Performance [seq/sec]	Strong/Weak Scaling	Neuron Version	Neuron Tutorial/Example	Pytorch Neuron(torch-neuronx) Version	OS Type.
Llama2-7B pre-training	trn1.32xlarge/trn1n.32xlarge	Autocast:BF16+FP32Optimizer	16	TP=8 DP=64	1	1024	AdamW	4096	101.3	strong scaling	2.20.0	NeuronX Distributed	2.1.2.2.3.0	U22
Llama2-13B pre-training	trn1.32xlarge/trn1n.32xlarge	Autocast:BF16+FP32Optimizer	32	TP=8 DP=16 PP=8	1	1024	AdamW	4096	84.0	strong scaling	2.20.0	NeuronX Distributed	2.1.2.2.3.0	U22
Llama2-70B pre-training	trn1.32xlarge/trn1n.32xlarge	Autocast:BF16+FP32Optimizer	32	TP=32 DP=4 PP=8	1	1024	AdamW	4096	14.2	strong scaling	2.20.0	NeuronX Distributed	2.1.2.2.3.0	U22
Llama-3.1-8B pre-training	trn1.32xlarge/trn1n.32xlarge	Autocast:BF16+FP32Optimizer	32	TP=32 DP=32 PP=1	1	1024	AdamW	8192	49.47	strong scaling	2.20.0	NeuronX Distributed	2.1.2.2.3.0	U20
Llama-3.1-70B pre-training	trn1.32xlarge/trn1n.32xlarge	Autocast:BF16+FP32Optimizer	32	TP=32 DP=4 PP=8	1	1024	AdamW	8192	7.59	strong scaling	2.20.0	NeuronX Distributed	2.1.2.2.3.0	U20

Note

TP (Tensor Parallel), PP (Pipeline Parallel) and DP (Data Parallel) Topology configuration refers to the degrees of 3D Parallelism (How the model and data is sharded across neuron cores).

TP and PP are specified in the run script and DP is calculated by dividing **world size**(Number of nodes/instances * Number of neuron cores per instance) by TP * PP degrees.

For example : TP = 4, PP = 4 and Number of instances is 32 (trn1.32xlarge). The world size will be : 32 (num instances) * 32(neuron cores per instance) = 1024. Now, DP degree = 1024 (World size)/ 4 (TP) * 4 (PP) = 64

Vision Transformer Models #

Model	Instance-Type	Training Data-Type	Nodes	Topology	Microbatch	Global Minibatch	Optimizer	Performance [seq/sec]	Strong/Weak Scaling	Neuron Version	Neuron Tutorial/Example	Pytorch Neuron(torch-neuronx) Version	OS Type.
HuggingFace ViT-Base fine-tuning	trn1.32xlarge/trn1n.32xlarge	Autocast:BF16+SR	1	[32xNC(DP)]	64	2048	AdamW	5230.72	weak scaling	2.18.0	ViT-Base Fine-tuning Example	2.1.2.2.1.0	U20
HuggingFace CLIP-Base fine-tuning	trn1.32xlarge/trn1n.32xlarge	Autocast:BF16+SR	1	[32xNC(DP)]	80	2560	AdamW	5465.77	weak scaling	2.18.0	CLIP-Base Fine-tuning	2.1.2.2.1.0	U20
HuggingFace Vision-Perceiver-Conv fine-tuning	trn1.32xlarge/trn1n.32xlarge	Autocast:BF16+SR	1	[32xNC(DP)]	4	128	AdamW	458.0907	weak scaling	2.18.0	Vision Perceiver Conv Fine-tuning	2.1.2.2.1.0	U20
HuggingFace Language-Perceiver fine-tuning	trn1.32xlarge/trn1n.32xlarge	Autocast:BF16+SR	1	[32xNC(DP)]	20	640	AdamW	1407.20	weak scaling	2.18.0	Language Perceiver Fine-tuning	1.13.1.1.13.1	U20

Note

Read more about strong vs weak scaling here Training with Neuron - FAQ

This document is relevant for: Trn1

Trn1/Trn1n Training Performance

Contents

Trn1/Trn1n Training Performance#

Encoder Models#

Decoder Models#

Vision Transformer Models#

Encoder Models #

Decoder Models #

Vision Transformer Models #