This document is relevant for: Trn1, Trn2, Trn3

Find Nonzero Indices Subkernel API Reference#

Finds indices of nonzero elements along the T dimension.

The kernel supports:

  • Finding nonzero indices in an input tensor of shape [T, C]

  • LNC2 sharding across columns

  • GpSimd nonzero_with_count ISA for parallel processing

  • Token counts up to 65536 and column counts up to 128

  • Optional column subsetting via col_start_id and n_cols

Background#

The find_nonzero_indices subkernel computes the indices of nonzero elements along the T dimension for each column of an input tensor. It uses the GpSimd nonzero_with_count ISA instruction for parallel processing of 8 columns at a time, with LNC2 sharding across the column dimension.

API Reference#

Source code for this kernel API can be found at: find_nonzero_indices.py

find_nonzero_indices#

nkilib.core.subkernels.find_nonzero_indices(input_tensor: nl.ndarray, col_start_id: nl.ndarray = None, n_cols: int = None, chunk_size: int = None, index_dtype: nki.dtype = nl.int32)#

Find indices of nonzero elements along the T dimension.

Parameters:
  • input_tensor (nl.ndarray) – [T, C], Input tensor on HBM. Nonzero elements are found along the T dimension for each column.

  • col_start_id (nl.ndarray) – [1], Optional HBM tensor containing the starting column index in the C dimension. If specified, only n_cols Columns starting from col_start_id are processed. If None, all C Columns are processed.

  • n_cols (int) – Number of columns (in C dimension) to process. Required when col_start_id is specified, ignored otherwise.

  • chunk_size (int) – Size of chunks for processing T dimension. If None, defaults to T. Must divide T evenly. Smaller chunk sizes reduce memory usage.

  • index_dtype (nki.dtype) – Data type for output indices tensor. Default is nl.int32.

Returns:

[C, T] or [n_cols, T], Tensor containing nonzero indices. For each column c, the first N values are the T-indices of nonzero elements, followed by -1 padding values.

Return type:

nl.ndarray

Returns:

[C] or [n_cols], Count of nonzero elements per column.

Return type:

nl.ndarray

Notes:

  • Requires LNC2 configuration (2 NeuronCores)

  • C must be divisible by 2 (for LNC2 sharding)

  • chunk_size must be divisible by 128 (partition size)

  • Uses GpSimd nonzero_with_count ISA which only operates on partitions [0, 16, 32, …, 112]

Dimensions:

  • T: Sequence/token dimension (first dimension of input)

  • C: Column dimension that used to calculate the non zero indices (second dimension of input)

  • C_full: Full columns dimension from input tensor shape

This document is relevant for: Trn1, Trn2, Trn3