This document is relevant for: Trn1, Trn2, Trn3
Descriptor Generation Engine (DGE) Reference#
Every DMA operation (nisa.dma_copy, nisa.dma_transpose) needs a
descriptor that tells the hardware the source address, destination address,
transfer shape, and stride pattern. We can specify when and where those
descriptors are produced—on the host before execution, on the GpSimd engine
at runtime, or on a dedicated hardware block. Each choice has different
performance characteristics and capability constraints. DGE (Descriptor
Generation Engine) is the umbrella term for the strategies that control this.
In the NKI API, there are three concrete strategies—plus an unknown mode
that lets the compiler choose—exposed through the nki.isa.dge_mode enum.
The rest of this document describes each mode, its constraints, and when to use
each one.
DGE Modes#
unknown — let the compiler decide#
nisa.dma_copy(dst=sbuf_tile, src=hbm_tensor,
dge_mode=nisa.dge_mode.unknown)
The default. The compiler selects the best mode based on the target hardware, tensor shapes, and surrounding instruction schedule. Use this unless you have a specific reason to force a specific mode.
none — pre-computed descriptors in HBM#
nisa.dma_copy(dst=sbuf_tile, src=hbm_tensor,
dge_mode=nisa.dge_mode.none)
DMA descriptors are pre-computed on the Trainium host before NEFF execution. The pre-computed descriptors are stored them in HBM. At runtime the DMA engine reads the pre-built descriptor directly—no on-device generation is needed.
When to use:
Fully static transfer patterns where source/destination addresses are known at compile time.
When you want to avoid any on-device descriptor generation overhead.
Trade-offs:
Descriptors consume HBM capacity (one per DMA instruction instance).
Cannot handle dynamic (runtime-computed) addresses or indices.
swdge — software DGE#
nisa.dma_copy(dst=sbuf_tile, src=hbm_tensor,
dge_mode=nisa.dge_mode.swdge)
The GpSimd Engine generates DMA descriptors during NEFF execution. This is the only mode that supports indirect (gather/scatter) operations with dynamic indices from SBUF.
When to use:
Dynamic addresses that depend on runtime values.
Gather or scatter operations using
vector_offset(indirect indexing).Indirect transpose (
dma_transposewith indirectsrc).
Trade-offs:
Consumes GpSimd Engine cycles for descriptor generation.
May compete with other GpSimd workloads.
Importantly, swdge has additional constraints for indirect transpose:
src.shape[-1] <= 128src.dtypemust be 2 bytes (float16/bfloat16)srcmust be on HBMsrc.shape[0]must be divisible by 16When
srcis 4D:src.shape[1]orsrc.shape[2]must be 1Index tensor must be 2-D, on SBUF, with dtype
uint32indices.shape[0]must be in[16, 128]and divisible by 16When
indices.shape[1] > 1:indices.shape[0]must be exactly 128Only available on NeuronCore-v3 (Trainium2) or newer only
hwdge — hardware DGE#
nisa.dma_copy(dst=sbuf_tile, src=hbm_tensor,
dge_mode=nisa.dge_mode.hwdge)
A dedicated hardware block on the NeuronCore generates descriptors on demand, triggered by the Scalar Engine or Sync Engine sequencer. Each TRN2 NeuronCore has two DGE instances.
When to use:
Dynamic or semi-dynamic transfer patterns on NeuronCore-v3+.
When GpSimd Engine is busy with other work (avoids
swdgecontention).Overlapping descriptor generation with compute via Scalar Engine pipelining.
Trade-offs:
Each hardware-DGE DMA instruction takes approximately 600 ns to execute.
Does not support indirect (gather/scatter) operations.
Note, for dma_copy with hwdge, the engine parameter can optionally
select which sequencer triggers the DGE block:
# Let Scalar Engine trigger DGE (can overlap with earlier compute)
nisa.dma_copy(dst=sbuf_tile, src=hbm_tensor,
dge_mode=nisa.dge_mode.hwdge,
engine=nisa.engine.scalar)
# Let Sync Engine trigger DGE
nisa.dma_copy(dst=sbuf_tile, src=hbm_tensor,
dge_mode=nisa.dge_mode.hwdge,
engine=nisa.engine.sync)
Only nisa.engine.scalar and nisa.engine.sync are valid when
dge_mode=hwdge.
Hardware DGE constraints for dma_transpose:
src.shape[0] == 16src.shape[-1] % 128 == 0src.dtypemust be 2 bytes (float16/bfloat16)
Mode Selection Summary#
Mode |
Descriptor Source |
Min HW |
Indirect Support |
Best For |
|---|---|---|---|---|
|
Host (pre-computed in HBM) |
Any |
No |
Fully static patterns, zero on-device overhead |
|
GpSimd Engine |
Any (indirect: v3+) |
Yes |
Gather/scatter, dynamic indices |
|
Hardware DGE block |
NeuronCore-v3+ |
No |
Dynamic patterns without GpSimd contention |
|
Compiler decides |
Any |
Depends |
Default—recommended unless tuning |
The name Parameter#
Both dma_copy and dma_transpose accept an optional name string:
nisa.dma_copy(dst=sbuf_tile, src=hbm_tensor, name="load_weights")
This label appears in profiling traces and compiler debug output. It does not affect execution. Assigning meaningful names makes it significantly easier to identify specific DMA operations when analyzing performance with Neuron profiling tools.
Performance Implications#
In essence, the choice comes down to where you want to spend your overhead budget:
``none`` — Lowest per-transfer latency (descriptor already in HBM), but each descriptor consumes HBM bandwidth on first fetch and HBM capacity permanently.
``swdge`` — Flexible but uses GpSimd cycles. In GpSimd-bound kernels this can become a bottleneck.
``hwdge`` — ~600 ns per instruction. When triggered from Scalar Engine, descriptor generation overlaps with earlier compute instructions in the pipeline, effectively hiding the cost. Frees GpSimd for other work.
``unknown`` — The compiler applies heuristics to pick the best mode for the target and workload. Start here and only override after profiling.
In summary, use unknown until profiling tells you otherwise, then
switch to the specific mode that addresses the bottleneck you observe.
Code Examples#
Static copy (no DGE)#
import neuronxcc.nki.isa as nisa
# Pre-computed descriptors — addresses fully known at compile time
nisa.dma_copy(dst=sbuf_tile, src=hbm_tensor,
dge_mode=nisa.dge_mode.none,
name="static_load")
Software DGE copy with dynamic address#
import neuronxcc.nki.isa as nisa
# GpSimd generates the descriptor at runtime
nisa.dma_copy(dst=sbuf_tile, src=hbm_tensor,
dge_mode=nisa.dge_mode.swdge,
name="dynamic_load")
Hardware DGE copy#
import neuronxcc.nki.isa as nisa
# Hardware DGE block generates the descriptor (NeuronCore-v3+)
nisa.dma_copy(dst=sbuf_tile, src=hbm_tensor,
dge_mode=nisa.dge_mode.hwdge,
name="hwdge_load")
Hardware DGE transpose#
import neuronxcc.nki.isa as nisa
# src must be [16, ...] with last dim divisible by 128, 2-byte dtype
nisa.dma_transpose(dst=sbuf_tile, src=hbm_tensor,
dge_mode=nisa.dge_mode.hwdge,
name="hwdge_transpose")
Software DGE indirect transpose (gather + transpose)#
import neuronxcc.nki.isa as nisa
# indices is a 2-D uint32 SBUF tensor; src is on HBM
# Effectively: dst = src[indices.T.flatten()[:src.shape[0]], :].T
nisa.dma_transpose(dst=sbuf_tile, src=hbm_tensor[indices],
dge_mode=nisa.dge_mode.swdge,
name="gather_transpose")
Compiler-selected mode (default)#
import neuronxcc.nki.isa as nisa
# Let the compiler pick the best DGE mode
nisa.dma_copy(dst=sbuf_tile, src=hbm_tensor,
name="auto_load")
nisa.dma_transpose(dst=sbuf_tile, src=hbm_tensor,
name="auto_transpose")
This document is relevant for: Trn1, Trn2, Trn3