This document is relevant for: Inf2
, Trn1
, Trn2
nki.isa.dma_copy#
- nki.isa.dma_copy(*, dst, src, mask=None, dst_rmw_op=None, oob_mode=oob_mode.error, dge_mode=dge_mode.unknown)[source]#
Copy data from
src
todst
using DMA engine. Bothsrc
anddst
tiles can be in device memory (HBM) or SBUF. However, if bothsrc
anddst
tiles are in SBUF, consider using nisa.tensor_copy instead for better performance.- Parameters:
src – the source of copy.
dst – the dst of copy.
dst_rmw_op – the read-modify-write operation to be performed at the destination. Currently only
np.add
is supported, which adds the source data to the existing destination data. IfNone
, the source data directly overwrites the destination. Ifdst_rmw_op
is specified, onlyoob_mode=oob_mode.error
is allowed. For best performance with Descriptor Generation Engine (DGE), unique dynamic offsets must be used to accessdst
. Multiple accesses to the same offset will cause a data hazard. If duplicated offsets are present, the compiler automatically adds synchronization to avoid hazards, which slows down computation.mask – (optional) a compile-time constant predicate that controls whether/how this instruction is executed (see NKI API Masking for details)
mode –
(optional) Specifies how to handle out-of-bounds (oob) array indices during indirect access operations. Valid modes are:
oob_mode.error
: (Default) Raises an error when encountering out-of-bounds indices.oob_mode.skip
: Silently skips any operations involving out-of-bounds indices.
For example, when using indirect gather/scatter operations, out-of-bounds indices can occur if the index array contains values that exceed the dimensions of the target array.
dge_mode – (optional) specify which Descriptor Generation Engine (DGE) mode to use for copy:
nki.isa.dge_mode.none
(turn off DGE) ornki.isa.dge_mode.swdge
(software DGE) ornki.isa.dge_mode.hwdge
(hardware DGE) ornki.isa.dge_mode.unknown
(by default, let compiler select the best DGE mode). HWDGE is only supported for NeuronCore-v3+.
A cast will happen if the
src
anddst
have different dtype.Example:
import neuronxcc.nki.isa as nisa ############################################################################ # Example 1: Copy over the tensor to another tensor ############################################################################ nisa.dma_copy(dst=b, src=a)
import neuronxcc.nki.isa as nisa import neuronxcc.nki.language as nl from neuronxcc.nki.typing import tensor ############################################################################ # Example 2: Load elements from HBM with indirect addressing. If addressing # results out-of-bound access, the operation will fail. ############################################################################ ... n, m = in_tensor.shape ix, iy = nl.mgrid[0:n//2, 0:m] expr_arange = 2*nl.arange(n//2)[:, None] idx_tile: tensor[64, 1] = nisa.iota(expr_arange, dtype=np.int32) out_tile: tensor[64, 512] = nisa.memset(shape=(n//2, m), value=-1, dtype=in_tensor.dtype) nisa.dma_copy(src=in_tensor[idx_tile, iy], dst=out_tile[ix, iy], oob_mode=nisa.oob_mode.error)
import neuronxcc.nki.isa as nisa import neuronxcc.nki.language as nl from neuronxcc.nki.typing import tensor ############################################################################ # Example 3: Load elements from HBM with indirect addressing. If addressing # results in out-of-bounds access, the operation will fail. ############################################################################ ... n, m = in_tensor.shape ix, iy = nl.mgrid[0:n//2, 0:m] # indices are out of range on purpose to demonstrate the error expr_arange = 3*nl.arange(n//2)[:, None] idx_tile: tensor[64, 1] = nisa.iota(expr_arange, dtype=np.int32) out_tile: tensor[64, 512] = nisa.memset(shape=(n//2, m), value=-1, dtype=in_tensor.dtype) nisa.dma_copy(src=in_tensor[idx_tile, iy], dst=out_tile[ix, iy], oob_mode=nisa.oob_mode.error)
import neuronxcc.nki.isa as nisa import neuronxcc.nki.language as nl from neuronxcc.nki.typing import tensor ############################################################################ # Example 4: Load elements from HBM with indirect addressing. If addressing # results in out-of-bounds access, the operation will skip indices. ############################################################################ ... n, m = in_tensor.shape ix, iy = nl.mgrid[0:n//2, 0:m] # indices are out of range on purpose expr_arange = 3*nl.arange(n//2)[:, None] idx_tile: tensor[64, 1] = nisa.iota(expr_arange, dtype=np.int32) out_tile: tensor[64, 512] = nisa.memset(shape=(n//2, m), value=-1, dtype=in_tensor.dtype) nisa.dma_copy(src=in_tensor[idx_tile, iy], dst=out_tile[ix, iy], oob_mode=nisa.oob_mode.skip)
import neuronxcc.nki.isa as nisa import neuronxcc.nki.language as nl from neuronxcc.nki.typing import tensor ############################################################################ # Example 5: Store elements to HBM with indirect addressing and with # read-modifed-write operation. ############################################################################ ... n, m = in_tensor.shape ix, iy = nl.mgrid[0:n, 0:m] expr_arange = 2*nl.arange(n)[:, None] inp_tile: tensor[64, 512] = nl.load(in_tensor[ix, iy]) idx_tile: tensor[64, 1] = nisa.iota(expr_arange, dtype=np.int32) out_tile: tensor[128, 512] = nisa.memset(shape=(2*n, m), value=1, dtype=in_tensor.dtype) nl.store(out_tensor, value=out_tile) nisa.dma_copy(dst=out_tensor[idx_tile, iy], src=inp_tile, dst_rmw_op=np.add)
import neuronxcc.nki.isa as nisa import neuronxcc.nki.language as nl from neuronxcc.nki.typing import tensor ############################################################################ # Example 6: Store elements to HBM with indirect addressing. If indirect # addressing results out-of-bound access, the operation will fail. ############################################################################ ... n, m = in_tensor.shape ix, iy = nl.mgrid[0:n, 0:m] expr_arange = 2*nl.arange(n)[:, None] inp_tile: tensor[64, 512] = nl.load(in_tensor[ix, iy]) idx_tile: tensor[64, 1] = nisa.iota(expr_arange, dtype=np.int32) out_tile: tensor[128, 512] = nisa.memset(shape=(2*n, m), value=-1, dtype=in_tensor.dtype) nl.store(out_tensor, value=out_tile) nisa.dma_copy(dst=out_tensor[idx_tile, iy], src=inp_tile, oob_mode=nisa.oob_mode.error)
import neuronxcc.nki.isa as nisa import neuronxcc.nki.language as nl from neuronxcc.nki.typing import tensor ############################################################################ # Example 7: Store elements to HBM with indirect addressing. If indirect # addressing results out-of-bounds access, the operation will skip indices. ############################################################################ ... n, m = in_tensor.shape ix, iy = nl.mgrid[0:n, 0:m] # indices are out of range on purpose to demonstrate the error expr_arange = 3*nl.arange(n)[:, None] inp_tile: tensor[64, 512] = nl.load(in_tensor[ix, iy]) idx_tile: tensor[64, 1] = nisa.iota(expr_arange, dtype=np.int32) out_tile: tensor[128, 512] = nisa.memset(shape=(2*n, m), value=-1, dtype=in_tensor.dtype) nl.store(out_tensor, value=out_tile) nisa.dma_copy(dst=out_tensor[idx_tile, iy], src=inp_tile, oob_mode=nisa.oob_mode.error)
############################################################################ # Example 8: Store elements to HBM with indirect addressing. If indirect # addressing results out-of-bounds access, the operation will skip indices. ############################################################################ ... n, m = in_tensor.shape ix, iy = nl.mgrid[0:n, 0:m] # indices are out of range on purpose expr_arange = 3*nl.arange(n)[:, None] inp_tile: tensor[64, 512] = nl.load(in_tensor[ix, iy]) idx_tile: tensor[64, 1] = nisa.iota(expr_arange, dtype=np.int32) out_tile: tensor[128, 512] = nisa.memset(shape=(2*n, m), value=-1, dtype=in_tensor.dtype) nl.store(out_tensor, value=out_tile) nisa.dma_copy(dst=out_tensor[idx_tile, iy], src=inp_tile, oob_mode=nisa.oob_mode.skip)
This document is relevant for: Inf2
, Trn1
, Trn2