What is the Trainium Memory Hierarchy#

This topic covers the Trainium Memory Hierarchy and how it applies to developing with the AWS Neuron SDK. This overview covers the various memories that are available on the Trainium hardware and how they are used. Understanding the memory hierarchy is important for writing performant kernels for use in your Machine Leaning models.

Memory hierarchy#

The diagram in Fig. 110, below, shows the four-level memory hierarchy available to a single NeuronCore. The latency ranges provided in the figure are approximate and are intended to calibrate the programmer’s mental model (see NeuronDevice Architecture Guide for the exact values). Memories closer to the top of the figure are the closer to the compute engines; therefore, they are designed to provide the highest bandwidth and lowest latency. However, the faster memories also have smaller capacities compared to memories near the bottom. This set of memories is the Memory Hierarchy for the Trainium devices.

Unlike memory hierarchies for traditional processors (such as CPUs and GPUs), all of the memories available to a NeuronCore are software-managed. This means the contents of the memories are managed either directly by the programmer, or by the Neuron SDK tool chain, rather than being managed by the hardware. In other words, NeuronCore does not have a hardware cache system that performs data movement across memories in a way that is opaque to the program. All memory movement is explicit in the program itself. These explicit memory movements may be specified by writing a NKI kernel, or they may be computed by the Neuron Graph Compiler as part of the optimization process.

In the following section we will discuss each memory in turn.

../../_images/pm-memory.png

Fig. 19 NeuronCore Memory Hierarchy with Capacity and Bandwidth Ranges#

NeuronCore external memory#

The two memories at the bottom of the hierarchy, host memory and device memory, are both considered external memory for a NeuronCore. These memories are linear memory, where multi-dimensional tensors must be stored in a flattened manner.

The host memory is the CPU-attached DRAM, which is accessible by the host CPUs and all the NeuronCores attached to the instance. NKI kernels currently do not provide APIs to move data in and out of the host memory directly, but rather, rely on ML frameworks such as PyTorch or JAX to send input data from host memory to the NeuronDevice and vice versa. For an example of this, see Getting Started with NKI.

The device memory resides within a NeuronDevice and uses High Bandwidth Memory (HBM) technologies starting from NeuronDevice v2. Currently, the input and output parameters to NKI kernels must be HBM tensor references. When a NKI kernel begins execution, the first task is to load the input tensors from HBM into the internal memory. Then computation can be done on the tensors in internal memory. Once the computation is complete, the results are copied from the internal memory back to the HBM.

NeuronCore internal memory#

The two memories at the top of the hierarchy, SBUF and PSUM, are both considered internal (or on-chip) memory for a NeuronCore. Both memories are two-dimensional memory, organized in 128 partitions. The partitions size of PSUM is typically much smaller than SBUF, and PSUM/SBUF partition sizes vary with NeuronCore generations.

State Buffer (SBUF) memory is the main software-managed on-chip memory. The SBUF is accessible by all the compute engines within a NeuronCore. NKI kernel input tensors from HBM must be loaded into the SBUF for computation computed output tensors of the kernel must be stored back into the HBM from SBUF before the host can access them.

Both loading and storing to and from the HBM memory can be done using the nki.isa.dma_copy API. In addition, SBUF is used for storing intermediate data within the kernel, generated by the compute engines. Note, SBUF has ~20x higher bandwidth than HBM, but it needs to be carefully managed to minimize HBM accesses for better performance.

Lastly, Partial Sum Buffer (PSUM) memory is a small, dedicated memory designed for storing matrix multiplication (MatMult) results computed by the tensor engine. Tensor Engine is able to read-add-write to every address in PSUM. Therefore, PSUM is useful for performing large MatMult calculations using multiple tiles where multiple MatMult instructions need to accumulate into the same output tile. As is shown in Fig. 109, PSUM memory can also be read and written by the vector and scalar engines. However, due to the limited capacity of PSUM, we recommend that you reserve PSUM space for the tensor engine to write MatMult outputs and to use the vector and scalar engines to evict MatMult results back to SBUF as soon as possible.

Note, to optimize kernel performance, it is good practice for NKI programmers to be mindful of SBUF and PSUM usage through careful tiling and loop fusion. If the total size of the live data being used by a NKI kernel overflows the capacity of any on-chip memory, the Neuron compiler will insert the necessary spills or refills between that memory and the next-tier memory in the hierarchy.