This document is relevant for: Inf2, Trn1, Trn1n

nki.language.load#

nki.language.load(src, mask=None, dtype=None, **kwargs)[source]#

Load a tensor from device memory (HBM) into on-chip memory (SBUF).

See Memory hierarchy for detailed information.

Parameters:
  • src – HBM tensor to load the data from.

  • mask – (optional) a compile-time constant predicate that controls whether/how this instruction is executed (see NKI API Masking for details)

  • dtype – (optional) data type to cast the output type to (see Supported Data Types for more information); if not specified, it will default to be the same as the data type of the input tile.

Returns:

a new tile on SBUF with values from src.

import neuronxcc.nki.language as nl

@nki_jit
def example_kernel(in_tensor, out_tensor):
  # load from in_tensor[P, F] that is on HBM
  # copy into data_tile[P, F] that is on SBUF
  data_tile = nl.load(in_tensor)
  ...

Note

Partition dimension size can’t exceed the hardware limitation of nki.language.tile_size.pmax, see Tile size considerations.

Partition dimension has to be the first dimension in the index tuple of a tile. Therefore, data may need to be split into multiple batches to load/store, for example:

import neuronxcc.nki.language as nl

@nki_jit
def example_load_store_b(in_tensor, out_tensor):
  for i_b in nl.affine_range(4): 
    data_tile = nl.zeros((128, 512), dtype=in_tensor.dtype) 
    # load from in_tensor[4, 128, 512] one batch at a time
    # copy into data_tile[128, 512]
    i_p, i_f = nl.mgrid[0:128, 0:512]
    data_tile[i_p, i_f] = nl.load(in_tensor[i_b, i_p, i_f])
    ...

Also supports indirect DMA access with dynamic index values:

import neuronxcc.nki.language as nl
...

  ############################################################################################
  # Indirect DMA read example 1:
  # - data_tensor on HBM has shape [128 x 512].
  # - idx_tensor on HBM has shape [64] (with values [0, 2, 4, 6, ...]).
  # - idx_tensor values read from HBM and stored in SBUF idx_tile of shape [64 x 1]
  # - data_tensor values read from HBM indexed by values in idx_tile 
  #   and store into SBUF data_tile of shape [64 x 512].
  ############################################################################################
  i_p = nl.arange(64)[:, None]
  i_f = nl.arange(512)[None, :]

  idx_tile = nl.load(idx_tensor[i_p]) # indices have to be in SBUF
  data_tile = nl.load(data_tensor[idx_tile[i_p, 0], i_f]) 
  ...
import neuronxcc.nki.isa as nisa
import neuronxcc.nki.language as nl
...

  ############################################################################################
  # Indirect DMA read example 2:
  # - data_tensor on HBM has shape [128 x 512].
  # - idx_tile on SBUF has shape [64 x 1] (with values [[0], [2], [4], ...] generated by iota)
  # - data_tensor values read from HBM indexed by values in idx_tile 
  #   and store into SBUF data_tile of shape [64 x 512].
  ############################################################################################
  i_f = nl.arange(512)[None, :]
  
  idx_expr = 2*nl.arange(64)[:, None]
  idx_tile = nisa.iota(idx_expr, dtype=np.int32)
  data_tile = nl.load(data_tensor[idx_tile, i_f]) 
  ...

This document is relevant for: Inf2, Trn1, Trn1n