This document is relevant for: Trn1

Trn1/Trn1n Training Performance#

Last update: June 24th, 2025

Encoder Models #

Model	Instance-Type	Training Data-Type	Nodes	Topology	Microbatch	Globalbatch	Optimizer	Sequence Length	Performance [seq/sec]	Strong/Weak Scaling	Neuron Version	Neuron Tutorial/Example	Pytorch Neuron(torch-neuronx) Version	OS Type.
HuggingFace BERT-Large Ph1 pre-training	trn1.32xlarge/trn1n.32xlarge	Autocast:BF16+SR	16	[32xNC(DP)] x 16Nodes(DP)	16	1048576	Lamb	128	52431.5	weak scaling	2.25.0	Hugging Face BERT Pretraining Tutorial (Data-Parallel)	2.7.0.2.9.9357	U22
HuggingFace BERT-Large Ph1 pre-training	trn1.32xlarge/trn1n.32xlarge	FP32	16	[32xNC(DP)] x 16Nodes(DP)	8	1048576	Lamb	128	32385.0	weak scaling	2.25.0	Hugging Face BERT Pretraining Tutorial (Data-Parallel)	2.7.0.2.9.9357	U22
HuggingFace BERT-Large Ph1 pre-training	trn1.32xlarge/trn1n.32xlarge	Autocast:BF16+SR	1	[32xNC(DP)]	16	16384	AdamW	128	3704.1	strong scaling	2.25.0	Hugging Face BERT Pretraining Tutorial (Data-Parallel)	2.7.0.2.9.9238	U22

Decoder Models #

Model	Instance-Type	Training Data-Type	Nodes	Topology	Microbatch	Globalbatch	Optimizer	Sequence Length	Performance [seq/sec]	Strong/Weak Scaling	Neuron Version	Neuron Tutorial/Example	Pytorch Neuron(torch-neuronx) Version	OS Type.
Llama-3.1-8B pre-training	trn1.32xlarge/trn1n.32xlarge	Autocast:BF16+FP32Optimizer	32	TP=32 DP=32 PP=1 ZeRO-1	1	1024	AdamW	8192	47.95	strong scaling	2.24.0	NeuronX Distributed	2.7.0.2.8.6896	U22
Llama-3.1-70B pre-training	trn1.32xlarge/trn1n.32xlarge	Autocast:BF16+FP32Optimizer	32	TP=32 DP=4 PP=8	1	1024	AdamW	8192	7.94	strong scaling	2.24.0	NeuronX Distributed	2.7.0.2.8.6896	U22

Note

TP (Tensor Parallel), PP (Pipeline Parallel) and DP (Data Parallel) Topology configuration refers to the degrees of 3D Parallelism (How the model and data is sharded across neuron cores).

TP and PP are specified in the run script and DP is calculated by dividing **world size**(Number of nodes/instances * Number of neuron cores per instance) by TP * PP degrees.

For example : TP = 4, PP = 4 and Number of instances is 32 (trn1.32xlarge). The world size will be : 32 (num instances) * 32(neuron cores per instance) = 1024. Now, DP degree = 1024 (World size)/ 4 (TP) * 4 (PP) = 64

For more information on batch sizes please refer to Neuron Batching

Vision Transformer Models #

Model	Instance-Type	Training Data-Type	Nodes	Topology	Microbatch	Globalbatch	Optimizer	Performance [seq/sec]	Strong/Weak Scaling	Neuron Version	Neuron Tutorial/Example	Pytorch Neuron(torch-neuronx) Version	OS Type.
HuggingFace ViT-Base fine-tuning	trn1.32xlarge/trn1n.32xlarge	BF16	1	[32xNC(DP)]	64	2048	AdamW	6587.25	weak scaling	2.25.0	ViT-Base Fine-tuning Example	2.7.0.2.9.0	U22

Note

Read more about strong vs weak scaling here Training with Neuron - FAQ

This document is relevant for: Trn1

Trn1/Trn1n Training Performance

Contents

Trn1/Trn1n Training Performance#

Encoder Models#

Decoder Models#

Vision Transformer Models#

Encoder Models #

Decoder Models #

Vision Transformer Models #