This document is relevant for: Trn1

Trainium Architecture#

At the heart of the Trn1 instance are 16 x Trainium chips (each Trainium include 2 x NeuronCore-v2). Trainium is the second generation purpose-built Machine Learning accelerator from AWS. The Trainium chip architecture is depicted below:

../../../_images/trainium-neurondevice.png

Each Trainium chip consists of:

Compute	Two NeuronCore-v2 delivering 380 INT8 TOPS, 190 FP16/BF16/cFP8/TF32 TFLOPS, and 47.5 FP32 TFLOP.
Device Memory	32 GiB of device memory (for storing model state), with 820 GiB/sec of bandwidth.
Data Movement	1 TB/sec of DMA bandwidth, with inline memory compression/decompression.
NeuronLink	NeuronLink-v2 for chip-to-chip interconnect enables efficient scale-out training, as well as memory pooling between the different Trainium chips.
Programmability	Trainium supports dynamic shapes and control flow, via ISA extensions of NeuronCore-v2. In addition, Trainium also allows for user-programmable rounding mode (Round Nearest Even Stochastic Rounding), and custom operators via the deeply embedded GPSIMD engines.

For a detailed description of all the hardware engines, see NeuronCore-v2

This document is relevant for: Trn1