This document is relevant for: Inf2, Trn1, Trn1n

nki.isa.local_gather#

nki.isa.local_gather(src_buffer, index, num_elem_per_idx=1, num_valid_indices=None, mask=None)[source]#

Gather SBUF data in src_buffer using index on GpSimd Engine.

Each of the eight GpSimd cores in GpSimdE connects to 16 contiguous SBUF partitions (e.g., core[0] connected to partition[0:16]) and performs gather from the connected 16 SBUF partitions independently in parallel. The indices used for gather on each core should also come from the same 16 connected SBUF partitions.

During execution of the instruction, each GpSimd core reads a 16-partition slice from index, flattens all indices into a 1D array indices_1d (along the partition dimension first). By default with no num_valid_indices specified, each GpSimd core will treat all indices from its corresponding 16-partition index slice as valid indices. However, when the number of valid indices per core is not a multiple of 16, users can explicitly specify the valid index count per core in num_valid_indices. Note, num_valid_indices must not exceed the total element count in each 16-partition index slice (i.e., num_valid_indices <= index.size / (index.shape[0] / 16)).

Next, each GpSimd core uses the flattened indices_1d indices as partition offsets to gather from the connected 16-partition slice of src_buffer. Optionally, this API also allows gathering of multiple contiguous elements starting at each index to improve gather throughput, as indicated by num_elem_per_idx. Behavior of out-of-bound index access is undefined.

Even though all eight GpSimd cores can gather with completely different indices, a common use case for this API is to make all cores gather with the same set of indices (i.e., partition offsets). In this case, users can generate indices into 16 partitions, replicate them eight times to 128 partitions and then feed them into local_gather.

As an example, if src_buffer is (128, 512) in shape and index is (128, 4) in shape, where the partition dimension size is 128, local_gather effectively performs the following operation:

num_gpsimd_cores = 8
num_partitions_per_core = 16
src_buffer = np.random.random_sample([128, 512, 4]).astype(np.float32)
index_per_core = np.random.randint(low=0, high=512, size=(16, 4), dtype=np.uint16)
# replicate 8 times for 8 GpSimd cores
index = np.tile(index_per_core, (num_gpsimd_cores, 1))
num_elem_per_idx = 4
index_hw = index * num_elem_per_idx
num_valid_indices = 64
output_shape = (128, 4, 16, 4)
num_active_cores = index.shape[0] / num_partitions_per_core
num_valid_indices = num_valid_indices if num_valid_indices \
    else index.size / num_active_cores

output_np = np.ndarray(shape=(128, num_valid_indices, num_elem_per_idx),
                       dtype=src_buffer.dtype)

for i_core in range(num_gpsimd_cores):
  start_par = i_core * num_partitions_per_core
  end_par = (i_core+1) * num_partitions_per_core
  indices_1d = index[start_par:end_par].flatten(order='F')[0: num_valid_indices]

  output_np[start_par:end_par, :, :] = np.take(
    src_buffer[start_par:end_par],
    indices_1d, axis=1)

output_np = output_np.reshape(output_shape)

local_gather preserves the input data types from src_buffer in the gather output. Therefore, no data type casting is allowed in this API. The indices in index tile must be uint16 types.

This API has three tile size constraints [subject to future relaxation]:

  1. The partition dimension size of src_buffer must match that of index and must be a multiple of 16. In other words, src_buffer.shape[0] == index.shape[0] and src_buffer.shape[0] % 16 == 0.

  2. The number of contiguous elements to gather per index per partition num_elem_per_idx must be one of the following values: [1, 2, 4, 8, 16, 32].

  3. The number of indices for gather per core must be less than or equal to 4096.

Estimated instruction cost:

150 + (num_valid_indices * num_elem_per_idx)/C GpSimd Engine cycles, where C can be calculated using ((28 + t * num_elem_per_idx)/(t * num_elem_per_idx)) / min(4/dtype_size, num_elem_per_idx). dtype_size is the size of src_buffer.dtype in bytes. Currently, t is a constant 4, but subject to change in future software implementation.

Parameters:
  • src_buffer – an input tile for gathering.

  • index – an input tile with indices used for gathering.

  • num_elem_per_idx – an optional integer value to read multiple contiguous elements per index per partition; default is 1.

  • num_valid_indices – an optional integer value to specify the number of valid indices per GpSimd core; default is index.size / (index.shape[0] / 16).

  • mask – (optional) a compile-time constant predicate that controls whether/how this instruction is executed (see NKI API Masking for details)

Returns:

an output tile of the gathered data

Example:

import neuronxcc.nki.isa as nisa
import neuronxcc.nki.language as nl

##################################################################
# Example 1: gather src_buffer using index
# Gather input: src_buffer_tile with shape (128, 512)
# Gather indices: index_tile with shape (128, 4)
# We use num_valid_indices indices per core, and read num_elem_per_idx
# contiguous elements per partition.
##################################################################

src_buffer_tile = nl.load(src_buffer)
index_tile = nl.load(index)
output_tile = nisa.local_gather(src_buffer_tile, index_tile,
                                num_elem_per_idx, num_valid_indices)

nl.store(output, output_tile)

Click here to download the full NKI code example with equivalent numpy implementation.

This document is relevant for: Inf2, Trn1, Trn1n