nki.isa.dma_copy#

nki.isa.dma_copy(dst, src, dst_rmw_op=None, oob_mode=oob_mode.error, dge_mode=dge_mode.unknown, name=None)[source]#

Copy data from src to dst using DMA engines with optional read-modify-write operations.

This instruction performs data movement between memory locations (SBUF or HBM) using DMA engines. The basic operation copies data from the source tensor to the destination tensor: dst = src. Optionally, a read-modify-write operation can be performed where the source data is combined with existing destination data using a specified operation: dst = dst_rmw_op(dst, src).

Currently, only np.add is supported for dst_rmw_op when performing read-modify-write operations. When dst_rmw_op=None, the source data directly overwrites the destination data.

nisa.dma_copy supports different modes of DMA descritpor generation (DGE):

  • nisa.dge_mode.none: Neuron Runtime generates DMA descriptors and stores them into HBM before NEFF execution.

  • nisa.dge_mode.swdge: Gpsimd Engine generates DMA descriptors as part of the nisa.dma_copy instruction during NEFF execution.

  • nisa.dge_mode.hwdge: Sync Engine or Scalar Engine sequencers invoke DGE hardware block to generate DMA descriptors as part of the nisa.dma_copy instruction during NEFF execution.

See Trainium2 arch guide and Introduction to DMA with NKI for more discussion.

When either sw_dge or hw_dge mode is used, the src and dst tensors can have a dynamic start address which depends on a variable that cannot be resolved at compile time. When sw_dge is selected, nisa.dma_copy can also perform a gather or scatter operation, using a list of unique dynamic indices from SBUF. In both of these dynamic modes, out-of-bound address checking is turned on automatically during execution. By default a runtime error is raised (oob_mode=oob_mode.error as default setting). Developers can disable this error and make the nisa.dma_copy instruction skips the DMA transfer for a given dynamic address or index when it is out of bound using oob_mode=oob_mode.skip. If dst_rmw_op is specified for these dynamic modes, only oob_mode.error is allowed. See Beta2 NKI kernel migration guide for the latest syntax to handle dynamic addresses or indices.

nisa.dma_copy also supports non-unique scatter indices when dge_mode=nisa.dge_mode.none and dst_rmw_op=nl.add are set. An example use case for this is performing embedding table entry updates after a training backward pass to calcualte embedding table gradients.

Memory types.

Both src and dst tiles can be in HBM or SBUF. However, if both tiles are in SBUF, consider using nisa.tensor_copy instead for better performance.

Data types.

Both src and dst tiles can be any supported NKI data types (see Supported Data Types for more information).

The DMA engines automatically handle data type conversion when src and dst have different data types. The conversion is performed through a two-step process: first casting from src.dtype to float32, then from float32 to dst.dtype.

If dst_rmw_op is used, the DMA engines automatically cast input data types to float32 before performing the read-modify-write computation, and the final float32 result is cast to the output data type in a pipelined fashion.

Layout.

If dst_rmw_op is used, the computation is done element-wise between src and dst.

Tile size.

The total number of data elements in src must match that of dst.

Parameters:
  • dst – the destination tensor to copy data into

  • src – the source tensor to copy data from

  • dst_rmw_op – optional read-modify-write operation (currently only np.add is supported)

  • dge_mode – (optional) specify which Descriptor Generation Engine (DGE) mode to use for DMA descriptor generation: nki.isa.dge_mode.none (turn off DGE) or nki.isa.dge_mode.swdge (software DGE) or nki.isa.dge_mode.hwdge (hardware DGE) or nki.isa.dge_mode.unknown (by default, let compiler select the best DGE mode). Hardware based DGE is only supported for NeuronCore-v3 or newer. See Trainium2 arch guide for more information.

  • oob_mode

    (optional) Specifies how to handle out-of-bounds (oob) array indices during indirect access operations. Valid modes are:

    • oob_mode.error: (Default) Raises an error when encountering out-of-bounds indices.

    • oob_mode.skip: Silently skips any operations involving out-of-bounds indices.

    For example, when using indirect gather/scatter operations, out-of-bounds indices can occur if the index array contains values that exceed the dimensions of the target array.