AWS Trn1 Architecture
Contents
This document is relevant for: Trn1
AWS Trn1 Architecture#
In this page, we provide an architectural overview of the AWS Trn1 instances, and the corresponding Trainium NeuronDevices that power them (Trainium devices from here on).
Table of contents
Trn1 Architecture#
The EC2 Trn1 instance is powered by up to 16 Trainium devices, and allows customers to choose between 2 instance sizes:
Instance size |
# of Trainium devices |
vCPUs |
Host Memory (GiB) |
FP8/FP16/BF16/TF32 TFLOPS |
FP32 TFLOPS |
Device Memory (GiB) |
Device Memory Bandwidth (GiB/sec) |
NeuronLink-v2 device-to-device |
EFA bandwidth (Gbps) |
---|---|---|---|---|---|---|---|---|---|
Trn1.2xlarge |
1 |
8 |
32 |
210 |
52.5 |
32 |
820 |
N/A |
up-to 25 |
Trn1.32xlarge |
16 |
128 |
512 |
3,360 |
840 |
512 |
13,120 |
Yes |
800 |
The Trn1.2xlarge instance size allows customers to train their models on a single Trainium device, which is useful for small model training, as well as model experimentation. The Trn1.32xlarge instance size comes with a high-bandwidth and low-latency NeuronLink-v2 device-to-device interconnect, which utilizes a 4D-HyperCube topology. This is useful for collective-communication between the Trainium devices during scale-out training, as well as for pooling the memory capacity of all Trainium devices, making it directly addressable from each one of the devices.
In a Trn1 server, the Trainium devices are connected in a 2D Torus topology, as depicted below:

The Trn1 instances are also available in an EC2 UltraCluster, which enables customers to scale Trn1 instances to over 30,000 Trainium devices, and leverage the AWS-designed non-blocking petabit-scale EFA networking infrastructure.

Trainium Architecture#
At the heart of the Trn1 instance are Trainium devices, the second generation purpose-built Machine Learning accelerator from AWS. The Trainium device architecture is depicted below:

Each Trainium device consists of:
- Compute:
2x NeuronCore-v2 cores, delivering 420 INT8 TOPS, 210 FP16/BF16/cFP8/TF32 TFLOPS, and 52.5 FP32 TFLOPS.
- Device Memory:
32GB of device memory (for storing model state), with 820 GB/sec of bandwidth.
- Data movement:
1 TB/sec of DMA bandwidth, with inline memory compression/decompression.
- NeuronLink:
NeuronLink-v2 for device-to-device interconnect enables efficient scale-out training, as well as memory pooling between the different Trainium devices.
- Programmability:
Trainium supports dynamic shapes and control flow, via ISA extensions of NeuronCore-v2. In addition, Trainium also allows for user-programmable rounding mode (Round Nearest Even Stochastic Rounding), and custom-operators via the deeply embedded GPSIMD Engine.
This document is relevant for: Trn1