This document is relevant for: Trn2, Trn3

nki.isa.dma_copy#

nki.isa.dma_copy(dst, src, oob_mode=oob_mode.error, dge_mode=dge_mode.unknown, engine=engine.unknown, name=None)[source]#

Copy data from src to dst using DMA engines.

This instruction performs data movement between memory locations (SBUF or HBM) using DMA engines. The operation copies data from the source tensor to the destination tensor: dst = src.

nisa.dma_copy supports different modes of DMA descriptor generation (DGE):

  • nisa.dge_mode.none: Neuron Runtime generates DMA descriptors and stores them into HBM before NEFF execution.

  • nisa.dge_mode.swdge: Gpsimd Engine generates DMA descriptors as part of the nisa.dma_copy instruction during NEFF execution.

  • nisa.dge_mode.hwdge: Sync Engine or Scalar Engine sequencers invoke DGE hardware block to generate DMA descriptors as part of the nisa.dma_copy instruction during NEFF execution.

See Trainium2 arch guide and Introduction to DMA with NKI for more discussion.

When either sw_dge or hw_dge mode is used, the src and dst tensors can have a dynamic start address which depends on a variable that cannot be resolved at compile time. When sw_dge is selected, nisa.dma_copy can also perform a gather or scatter operation, using a list of dynamic indices from SBUF. In both of these dynamic modes, out-of-bound address checking is turned on automatically during execution. By default a runtime error is raised (oob_mode=oob_mode.error as default setting). Developers can disable this error and make the nisa.dma_copy instruction skip the DMA transfer for a given dynamic address or index when it is out of bound using oob_mode=oob_mode.skip.

Memory types.

Both src and dst tiles can be in HBM or SBUF. However, if both tiles are in SBUF, consider using an alternative for better performance:

Data types.

Both src and dst tiles can be any supported NKI data types (see Supported Data Types for more information).

The DMA engines automatically handle data type conversion when src and dst have different data types. The conversion is performed through a two-step process: first casting from src.dtype to float32, then from float32 to dst.dtype.

Tile size.

The total number of data elements in src must match that of dst.

Indirect addressing (gather/scatter).

nisa.dma_copy supports indirect addressing for dynamic row selection at runtime. This enables gather (read from dynamic rows) and scatter (write to dynamic rows) patterns. Indirect addressing is activated by calling .ap() on src or dst with a vector_offset or scalar_offset parameter.

There are two types of indirect addressing:

Vector indirection provides per-partition dynamic offsets. Each of the hardware partitions gets its own index, enabling gather/scatter where different partitions access different rows. Use .ap(pattern=..., vector_offset=idx_tensor, indirect_dim=0) where idx_tensor is an SBUF tensor of shape (P, 1) containing one row index per partition. The tensor being indexed (the one .ap() is called on) must be in HBM.

Scalar indirection provides a single dynamic offset applied uniformly to all partitions. Use .ap(pattern=..., scalar_offset=reg_or_tensor, indirect_dim=N) where the offset is either a 1x1 SBUF tensor or a VirtualRegister from nisa.register_alloc().

vector_offset and scalar_offset are mutually exclusive.

Indirect gather example (vector_offset on src):

import nki
import nki.isa as nisa
import nki.language as nl

@nki.jit
def indirect_gather_kernel(data, indices):
    P, F = indices.shape[0], data.shape[1]
    output = nl.ndarray((P, F), dtype=data.dtype, buffer=nl.shared_hbm)

    idx = nl.ndarray((P, 1), dtype=nl.uint32, buffer=nl.sbuf)
    nisa.dma_copy(dst=idx, src=indices)

    dst = nl.ndarray((P, F), dtype=data.dtype, buffer=nl.sbuf)
    nisa.dma_copy(
        dst=dst,
        src=data.ap(
            pattern=[[F, P], [1, F]],
            vector_offset=idx,
            indirect_dim=0,
        ),
    )

    nisa.dma_copy(dst=output, src=dst)
    return output

Indirect scatter example (vector_offset on dst):

import nki

@nki.jit
def indirect_scatter_kernel(src_data, indices, output):
    P, F = src_data.shape

    src = nl.ndarray((P, F), dtype=src_data.dtype, buffer=nl.sbuf)
    nisa.dma_copy(dst=src, src=src_data)

    idx = nl.ndarray((P, 1), dtype=nl.uint32, buffer=nl.sbuf)
    nisa.dma_copy(dst=idx, src=indices)

    nisa.dma_copy(
        dst=output.ap(
            pattern=[[F, P], [1, F]],
            vector_offset=idx,
            indirect_dim=0,
        ),
        src=src,
    )
    return output
Parameters:
  • dst – the destination tensor to copy data into

  • src – the source tensor to copy data from

  • dge_mode – (optional) specify which Descriptor Generation Engine (DGE) mode to use for DMA descriptor generation: nki.isa.dge_mode.none (turn off DGE) or nki.isa.dge_mode.swdge (software DGE) or nki.isa.dge_mode.hwdge (hardware DGE) or nki.isa.dge_mode.unknown (by default, let compiler select the best DGE mode). Hardware based DGE is only supported for NeuronCore-v3 or newer. See Trainium2 arch guide for more information.

  • oob_mode

    (optional) Specifies how to handle out-of-bounds (oob) array indices during indirect access operations. Valid modes are:

    • oob_mode.error: (Default) Raises an error when encountering out-of-bounds indices.

    • oob_mode.skip: Silently skips any operations involving out-of-bounds indices.

    For example, when using indirect gather/scatter operations, out-of-bounds indices can occur if the index array contains values that exceed the dimensions of the target array.

  • engine – (optional) the engine to use for HWDGE descriptor generation: nki.isa.engine.sync or nki.isa.engine.scalar. Only valid when dge_mode=nisa.dge_mode.hwdge. nki.isa.engine.unknown by default.

This document is relevant for: Trn2, Trn3