This document is relevant for: Inf1
, Inf2
, Trn1
, Trn2
Neuron Device Memory#
Learn how to understand, monitor, and optimize memory usage on AWS Neuron devices. This topic covers memory categories including tensors, model constants, scratchpad allocations, DMA rings, and profiling buffers. Discover debugging tools like neuron-top and neuron-monitor, troubleshoot out-of-memory (OOM) errors, and implement strategies to reduce memory consumption for efficient ML workload execution on Inferentia and Trainium instances.
Overview#
The Neuron Runtime’s memory usage falls into the following categories:
tensors
: input and output tensors allocated by applicationmodel constants
: compiled constants used by a NEFF programmodel code
: the executable instructions for the Neuron Core. This also includes a micro-code overhead of 96MB per physical Neuron Core (this overhead is subject to future improvements)profile buffers
: buffers used to store profling eventsscratchpad
andshared scratchpad
: additional space used to store intermediary SBUF and other computations. Read Scratchpad for details.dma rings
: Data transfer instructions describing data movements during NEFF execution, used during NEFF execution.collectives
: Memory overhead used to orchestrate collective communication
Here’s what users can do to adjust these forms of memory usage:
model constants
andtensors
are entirely controlled by the user. Adjust similar to other XLA devices with matrix dimensions, batch sizes, etc.scratchpad
andshared scratchpad
depend on model size, model type and tiling strategy. Read the Scratchpad.dma rings
usage is not easily actionable. It can be reduced by using DGE where possible, or changing the model to reduce data movements (like transfers between HBM and SBUF).profile buffers
are allocated when the user enables profiling. Users can influence these allocations by either disabling profiling or manually adjusting. Read the Profile Buffers section.model code
usages are not actionable. If users observe significant usage, contact your AWS Neuron support.
Logical Neuron Cores#
Starting with trn2
, we introduced the concept of Logical Neuron Cores, where multiple physical Neuron Cores are grouped into the same “Neuron Core”. Read this article for more details.
Note
On trn2
, the default configuration is LNC2, but when using LNC1 (NEURON_LOGICAL_NC_CONFIG=1
), two neighboring Neuron Cores will end up SHARING a HBM. See the following diagram, where two vertically neighboring NeuronCore-V3s share a HBM.

As a result, there will be noisy neighbor problems, and you may see out-of-memory (OOM) errors earlier than expected depending on what is loaded on the neighboring core.
Debugging Tools#
neuron-top#
Running neuron-top
will give you a view of the current memory usages on a core level. Read this article for more details.
sysfs#
As an alternative, you can find the same information from the sysfs. Read this article for more details.
Out-of-memory (OOM) Errors#
When an OOM occurs, the Neuron Runtime dumps a detailed breakdown of the various memory usage types for each NEFF. For example:
2025-May-15 20:58:33.895937 224822:224822 ERROR TDRV:print_lnc_hbm_details LNC size is 1. Neuron Cores using this HBM: NC 4 and NC 5
2025-May-15 20:58:33.897479 224822:224822 ERROR TDRV:log_dev_mem Failed to allocate 4.000GB (alignment: none, usage: tensors) on ND 0:NC 4
2025-May-15 20:58:33.899416 224822:224822 ERROR TDRV:log_dev_mem_usage_table Displaying Current Memory Utilization:
(NOTE: the lines are LONG, and NEFF id to name mapping is printed after)
| | Model | Model | | Shared | | |DMA Rings |DMA Rings | DMA Rings |DMA Rings | | | Profiler |
| TOTAL | Code |Constants | Tensors |Scratchpad|Scratchpad| Runtime | IO | Spill |Collectives| Runtime |Collectives| XT CC | Buffers |
ND 0 Overall | 20.188GB |192.102MB | 82.344KB | 20.000GB | 0.000B | 0.000B |350.125KB |179.000KB | 64.000KB | 0.000B | 68.000KB | 0.000B | 0.000B | 0.000B |
\_NC 4 | 20.094GB | 96.065MB | 58.344KB | 20.000GB | 0.000B | 0.000B |229.062KB |118.000KB | 48.000KB | 0.000B | 36.000KB | 0.000B | 0.000B | 0.000B |
\_NEFF 1001 |263.906KB | 28.562KB | 34.344KB | n/a | n/a | 0.000B |108.000KB | 57.000KB | 32.000KB | 0.000B | 4.000KB | 0.000B | n/a | n/a |
\_NEFF 1002 |244.875KB | 31.875KB | 24.000KB | n/a | n/a | 0.000B |108.000KB | 61.000KB | 16.000KB | 0.000B | 4.000KB | 0.000B | n/a | n/a |
\_NC 5 | 96.285MB | 96.037MB | 24.000KB | 0.000B | 0.000B | 0.000B |121.062KB | 61.000KB | 16.000KB | 0.000B | 32.000KB | 0.000B | 0.000B | 0.000B |
\_NEFF 1003 |244.875KB | 31.875KB | 24.000KB | n/a | n/a | 0.000B |108.000KB | 61.000KB | 16.000KB | 0.000B | 4.000KB | 0.000B | n/a | n/a |
NEFF id to name mapping:
1001: "1.0.41235.0+df4a714bb-/local/out-test0_meta_dense"
1002: "1.0.41235.0+df4a714bb-/local/out-test0_meta_concat3"
1003: "1.0.41235.0+df4a714bb-/local/out-test0_meta_concat3"
In case this OOM message is truncated, this information is also available under /tmp/neuron_mem_table_device_<device_id>_hbm_<hbm_idx>.log
.
Per-NEFF INFO logs#
The memory usage of a NEFF is also available as INFO
level logs during model load. By using NEURON_RT_LOG_LEVEL_TDRV=info
, you’ll see a log like:
2025-May-15 07:41:15.014997 2198754:2198754 INFO TDRV:dml_log_dev_neff_mem
[ND 0:NC 0] Current Usage Total: 96.543MB
shared scratchpad: 0.000B
Per NEFF memory usage breakdown for [out-test0_meta_concat3]:
Total: 230.562KB
* model code: 30.562KB
* model constants: 24.000KB
* scratchpad: 0.000B
* runtime: 95.000KB
* dma rings io: 61.000KB
* dma rings spill: 16.000KB
* dma rings collectives: 0.000B
* dma rings runtime: 4.000KB
* collectives: 0.000B
Profile Buffers#
When used with NRT’s profiling APIs and neuron-profiler capture
, Runtime allocates buffers in order to store the profiling events. These profiling buffers by default are about 64 or 128 MB each, so expect around 2 GB overhead. (subject to future changes)
These profiler buffer sizes can be manually adjusted by setting flags NEURON_RT_PROFILE_BUF_<buffer type>_MB
. For example, NEURON_RT_PROFILE_BUF_DMA_MB=512
. Here’s a list of the different buffers one can attempt adjusting: EVENT
, DMA
, THROTTLE
, CC_CORE_INSTRUCTION
, CC_CORE_EVENT
.
Note
Adjusting the buffer sizes manually is NOT recommended, since buffers too small will cause profiler to lose events. Prioritize profiling one NEFF at a time, and only consider when profiling a single NEFF still OOMs.
Another option for reducing memory usage further when profiling is to use the --single-io
. This option will reduce the memory used by IO tensors by creating an IO tensor the size of the largest IO tensor in the model. Other IO tensors will point to slices of this tensor during execution. The output will no longer be correct but the profile will still realistically capture performance. Note that the --single-io
option is only available to neuron-profile
.
neuron-profile capture -n file.neff --single-io
NOTE: only device profiles require extra device memory. System profiles do not. If you are only interested in a high-level view of performance kernel execution latency and time spent in Neuron runtime APIs, consider capturing a system profile with the nrt_sys_trace_fetch_events
or NEURON_RT_INSPECT_ENABLE
APIs.
Scratchpad#
Aside from inputs and outputs, a NEFF execution requires additional space on HBM for temporary spills out of the state buffer (the cache). This is necessary because the working set of a program can be arbitrarily large, and may not fit in the state buffer. We call this space scratchpad.
Scratchpad size requirement for a NEFF is specified entirely by the compiler. Scratchpad size depends on kernel size, kernel type and tiling strategy. For example, for a training workload, scratchpad usage is usually determined by the size of activation between forward and backward layer. For an inference kernel, scratchpad usage is usually determined by the size of hidden states. Additionally, optimal tiling and fusion of collective and/or compute operations can reduce scratchpad usage significantly.
def.json
within a NEFF contains information about how much scratchpad space is required for the NEFF. Scratchpad memory is allocated on the HBM, per NeuronCore. The memory is only used while a NEFF execution is running. Thus it makes sense to share this memory among all loaded NEFFs to reduce the overall memory footprint. Runtime allocates a shared scratchpad - that is shared by all NEFFs loaded on a particular NeuronCore. The size of the shared scratchpad size is equivalent of the size of the largest scratchpad among all the loaded NEFFs. In some cases a variable cannot be placed in shared scratchpad and is placed in a non-shared scratchpad specific to a NEFF (see Scratchpad variables below).
Scratchpad variables#
The scratchpad space is fully managed by the Compiler. A NEFF defines scratchpad variables and their size and offset within the scratchpad space. Runtime maps all these variables to the scratchpad space it allocates on the HBM. Some of the variables may overlap with others since not all variables are “live” at the same time during NEFF execution.
Runtime iterates through all scratchpad variables in def.json
and computes MAX
of offset + size
over all of them. That is the size of the shared scratchpad space required by the NEFF.
Appendix: NEFF format for scratchpad variables#
If we unpack a NEFF (using neuron-packager
), and inspect sg00/def.json
(and sg01/def.json
in case of NEFFs generated for Trn2 LNC size 2 configuration), we will see variables entries like these:
"var": {
"some_variable_name": {
"backing_variable_off": 17108992,
"ops": [],
"size": 131072,
"type": "virtual",
"var_id": 2349
},
...
}
type
being “virtual” for a variable indicates that it is a scratchpad variable. The backing_variable_off
field is the offset inside the shared scratchpad space allocated by Runtime, and the size
field is the size of the variable.
DMA Rings#
DMA rings are buffers used to store DMA descriptors (each descriptor describes a data movement that the DMA engines can execute).
DGE generates the descriptors dynamically during NEFF execution, so, if a NEFF is using DGE for some DMA, then no allocation is needed on the HBM for those descriptors.
For any DMAs not using DGE, Runtime must allocate the DMA rings on HBM and build the DMA descriptors before execution. The details for building the descriptors for these DMAs in the NEFF is encoded in def.json
and <engine>.json
where <engine>
is the TPB engine that will trigger the DMA operation.
Overall, reducing DMA rings usage requires changes in the NEFF itself, with the most effective change being using DGE for DMAs where supported.
In OOM reports, DMA rings are further categorized as:
IO - These descriptors have an I/O tensor as their source or destination
Spill - These descriptors move data between any NEFF variables/tensors, excluding any I/O tensors
Collectives - These descriptors move data for collectives operations between ranks on the same node
Runtime - These descriptors do not correspond to any explicit DMAs in the NEFF but are needed to perform DMAs to support NEFF execution. Examples: loading DVE and activation tables, instruction fetch DMAs for TPB engines
This document is relevant for: Inf1
, Inf2
, Trn1
, Trn2