This document is relevant for: Trn2
Trainium2 Architecture#
Trainium2 is the third generation, purpose-built Machine Learning chip from AWS. Every Trainium2 chip contains eight NeuronCore-V3. Beginning with Trainium2, AWS Neuron adds support for Logical NeuronCore Configuration (LNC), which lets you combine the compute and memory resources of multiple physical NeuronCores into a single logical NeuronCore. The following diagram shows the architecture overview of a Trainium2 chip.

Trainium2 chip components#
Each Trainium2 chip consists of the following components:
Compute |
Eight NeuronCore-v3 that collectively deliver:
|
Device Memory |
96 GiB of device memory with 2.9 TB/sec of bandwidth. |
Data Movement |
3.5 TB/sec of DMA bandwidth, with inline memory compression and decompression. |
NeuronLink |
NeuronLink-v3 for chip-to-chip interconnect provides 1.28 TB/sec bandwidth per chip. It allows for efficient scale-out training and inference, as well as memory pooling between Trainium2 chips. |
Programmability |
Trainium2 supports dynamic shapes and control flow via NeuronCore-v3 ISA extensions. Trainium2 also allows for user-programmable rounding mode (Round Nearest Even or Stochastic Rounding), and custom operators via deeply embedded GPSIMD engines. |
Collective communication |
20 CC-Cores orchestrate collective communication among Trainium2 chips within and across instances. |
Trainium2 performance improvements#
The following set of tables offer a comparison between Trainium and Trainium2 chips.
Compute#
Trainium |
Trainium2 |
Improvement factor |
|
---|---|---|---|
FP8 (TFLOPS) |
191 |
1299 |
6.7x |
BF16/FP16/TF32 (TFLOPS) |
191 |
667 |
3.4x |
FP32 (TFLOPS) |
48 |
181 |
3.7x |
FP8/FP16/BF16/TF32 Sparse (TFLOPS) |
Not applicable |
2563 |
Not applicable |
Memory#
Trainium |
Trainium2 |
Improvement factor |
|
---|---|---|---|
HBM Capacity (GiB) |
32 |
96 |
3x |
HBM Bandwidth (TB/sec) |
0.8 |
2.9 |
3.6x |
SBUF Capacity (MiB) |
48 |
224 |
4.7x |
Memory Pool Size |
Up to 16 chips |
Up to 64 chips |
4x |
Interconnect#
Trainium |
Trainium2 |
Improvement factor |
|
---|---|---|---|
Inter-chip Interconnect (GB/sec/chip) |
384 |
1280 |
3.3x |
Data movement#
Trainium |
Trainium2 |
Improvement factor |
|
---|---|---|---|
CC Cores |
6 |
20 |
3.3x |
DMA barriers |
Write-after-write |
Strong-order-write |
>1x (Benefit DMA-size dependent) |
SBUF memory layout |
Row-major |
Row-major, Col-major-2B, Col-major-4B |
Not applicable |
Additional resources#
For a detailed description of NeuronCore-v3 hardware engines, instances powered by AWS Trainium2, and Logical NeuronCore configuration, see the following resources:
This document is relevant for: Trn2