This document is relevant for: Trn2, Trn3
nki.isa.dma_copy#
- nki.isa.dma_copy(dst, src, oob_mode=oob_mode.error, dge_mode=dge_mode.unknown, engine=engine.unknown, name=None)[source]#
Copy data from
srctodstusing DMA engines.This instruction performs data movement between memory locations (SBUF or HBM) using DMA engines. The operation copies data from the source tensor to the destination tensor:
dst = src.nisa.dma_copysupports different modes of DMA descriptor generation (DGE):nisa.dge_mode.none: Neuron Runtime generates DMA descriptors and stores them into HBM before NEFF execution.nisa.dge_mode.swdge: Gpsimd Engine generates DMA descriptors as part of thenisa.dma_copyinstruction during NEFF execution.nisa.dge_mode.hwdge: Sync Engine or Scalar Engine sequencers invoke DGE hardware block to generate DMA descriptors as part of thenisa.dma_copyinstruction during NEFF execution.
See Trainium2 arch guide and Introduction to DMA with NKI for more discussion.
When either
sw_dgeorhw_dgemode is used, thesrcanddsttensors can have a dynamic start address which depends on a variable that cannot be resolved at compile time. Whensw_dgeis selected,nisa.dma_copycan also perform a gather or scatter operation, using a list of dynamic indices from SBUF. In both of these dynamic modes, out-of-bound address checking is turned on automatically during execution. By default a runtime error is raised (oob_mode=oob_mode.erroras default setting). Developers can disable this error and make thenisa.dma_copyinstruction skip the DMA transfer for a given dynamic address or index when it is out of bound usingoob_mode=oob_mode.skip.Memory types.
Both
srcanddsttiles can be in HBM or SBUF. However, if both tiles are in SBUF, consider using an alternative for better performance:nisa.tensor_copy for direct copies
nisa.nc_n_gather to gather elements within each partition independently
nisa.local_gather to gather elements within groups of partitions
Data types.
Both
srcanddsttiles can be any supported NKI data types (see Supported Data Types for more information).The DMA engines automatically handle data type conversion when
srcanddsthave different data types. The conversion is performed through a two-step process: first casting fromsrc.dtypeto float32, then from float32 todst.dtype.Tile size.
The total number of data elements in
srcmust match that ofdst.Indirect addressing (gather/scatter).
nisa.dma_copysupports indirect addressing for dynamic row selection at runtime. This enables gather (read from dynamic rows) and scatter (write to dynamic rows) patterns. Indirect addressing is activated by calling.ap()onsrcordstwith avector_offsetorscalar_offsetparameter.There are two types of indirect addressing:
Vector indirection provides per-partition dynamic offsets. Each of the hardware partitions gets its own index, enabling gather/scatter where different partitions access different rows. Use
.ap(pattern=..., vector_offset=idx_tensor, indirect_dim=0)whereidx_tensoris an SBUF tensor of shape(P, 1)containing one row index per partition. The tensor being indexed (the one.ap()is called on) must be in HBM.Scalar indirection provides a single dynamic offset applied uniformly to all partitions. Use
.ap(pattern=..., scalar_offset=reg_or_tensor, indirect_dim=N)where the offset is either a 1x1 SBUF tensor or aVirtualRegisterfromnisa.register_alloc().vector_offsetandscalar_offsetare mutually exclusive.Indirect gather example (
vector_offsetonsrc):import nki import nki.isa as nisa import nki.language as nl @nki.jit def indirect_gather_kernel(data, indices): P, F = indices.shape[0], data.shape[1] output = nl.ndarray((P, F), dtype=data.dtype, buffer=nl.shared_hbm) idx = nl.ndarray((P, 1), dtype=nl.uint32, buffer=nl.sbuf) nisa.dma_copy(dst=idx, src=indices) dst = nl.ndarray((P, F), dtype=data.dtype, buffer=nl.sbuf) nisa.dma_copy( dst=dst, src=data.ap( pattern=[[F, P], [1, F]], vector_offset=idx, indirect_dim=0, ), ) nisa.dma_copy(dst=output, src=dst) return output
Indirect scatter example (
vector_offsetondst):import nki @nki.jit def indirect_scatter_kernel(src_data, indices, output): P, F = src_data.shape src = nl.ndarray((P, F), dtype=src_data.dtype, buffer=nl.sbuf) nisa.dma_copy(dst=src, src=src_data) idx = nl.ndarray((P, 1), dtype=nl.uint32, buffer=nl.sbuf) nisa.dma_copy(dst=idx, src=indices) nisa.dma_copy( dst=output.ap( pattern=[[F, P], [1, F]], vector_offset=idx, indirect_dim=0, ), src=src, ) return output
- Parameters:
dst – the destination tensor to copy data into
src – the source tensor to copy data from
dge_mode – (optional) specify which Descriptor Generation Engine (DGE) mode to use for DMA descriptor generation:
nki.isa.dge_mode.none(turn off DGE) ornki.isa.dge_mode.swdge(software DGE) ornki.isa.dge_mode.hwdge(hardware DGE) ornki.isa.dge_mode.unknown(by default, let compiler select the best DGE mode). Hardware based DGE is only supported for NeuronCore-v3 or newer. See Trainium2 arch guide for more information.oob_mode –
(optional) Specifies how to handle out-of-bounds (oob) array indices during indirect access operations. Valid modes are:
oob_mode.error: (Default) Raises an error when encountering out-of-bounds indices.oob_mode.skip: Silently skips any operations involving out-of-bounds indices.
For example, when using indirect gather/scatter operations, out-of-bounds indices can occur if the index array contains values that exceed the dimensions of the target array.
engine – (optional) the engine to use for HWDGE descriptor generation:
nki.isa.engine.syncornki.isa.engine.scalar. Only valid whendge_mode=nisa.dge_mode.hwdge.nki.isa.engine.unknownby default.
This document is relevant for: Trn2, Trn3