nki.isa.nonzero_with_count#
- nki.isa.nonzero_with_count(dst, src, index_offset=0, padding_val=-1)[source]#
Find indices of nonzero elements in an input tensor and their total count using GpSimd Engine.
Note
Available only on NeuronCore-v3 and newer.
NOTE: this instruction only operates on partitions [0, 16, 32, …, 112] of the input tile and writes to partitions [0, 16, 32, …, 112] of the destination tile. The data in other partitions of the destination tile are not modified, including the last ‘extra’ slot for count.
This behavior is due to the physical connectivity of GpSimd engine. Each of the eight GpSimd cores connects to 16 contiguous SBUF partitions (e.g., core[0] connects to partitions[0:16]). In nonzero_with_count, each GpSimd core reads from and writes to its 0-th partition only.
This instruction takes an input array and produces an output array containing the indices of all nonzero elements, followed by padding values, and ending with the count of nonzero elements found.
The output tensor has one more element in the free dimension than the input tensor:
First N elements: 0-indexed positions of nonzero elements, offset by
index_offsetNext T-N elements: Filled with
padding_valLast element: Count
Nof nonzero elements found
The
index_offsetparameter is useful when processing arrays in tiles, allowing indices to be relative to the original array position rather than the tile.Example for one partition of the tensor:
Input array (T=8): [0, 1, 1, 0, 0, 1, 0, 0] index_offset = 16 padding_val = -1 Output (T+1=9): [17, 18, 21, -1, -1, -1, -1, -1, 3] Where: - 17, 18, 21 are the indices (1, 2, 5) plus offset 16 - -1 is the padding value for unused slots - 3 is the count of nonzero elements
Constraints
Supported arch versions: NeuronCore-v3+.
Supported engines: GpSimd.
Parameters
src,dstmust have the same number of elements in the partition dimension.Destination tensor must have exactly 1 more element than the source tensor in the free dimension.
Only accesses the 0-th partition for each GpSimd core (i.e., [0, 16, 32, …, 112]).
srcmust be in SBUF with dtype float32 or int32.dstmust be in SBUF with dtype int32.index_offsetandpadding_valmust be int32.
- Parameters:
src – Input tensor to find nonzero indices from. Only partitions [0, 16, 32, …, 112] are read from. Supported buffers: SBUF. Supported dtypes: float32, int32.
dst – Output tensor containing nonzero indices, padding, and count. Only partitions [0, 16, 32, …, 112] are written to. It must have one extra element than src in the free dimension. Supported buffers: SBUF. Supported dtypes: int32.
index_offset – Offset to add to the found indices (useful for tiled processing). Supported dtypes: int32.
padding_val – Value to use for padding unused output elements. Supported dtypes: int32.
Behavior
# Find all nonzero elements in input nonzero_indices = [] for i in range(len(input_array)): if input_array[i] != 0: nonzero_indices.append(i + index_offset) # Build output array output = [] # Add found indices for idx in nonzero_indices: output.append(idx) # Add padding for remaining slots for _ in range(len(input_array) - len(nonzero_indices)): output.append(padding_val) # Add count as last element output.append(len(nonzero_indices))
Example
def nonzero_with_count_kernel(in_tensor): in_shape = in_tensor.shape assert len(in_tensor.shape) == 2, "expected 2D tensor" in_tile = nl.ndarray(in_shape, dtype=in_tensor.dtype, buffer=nl.sbuf) nisa.dma_copy(dst=in_tile, src=in_tensor) out_tile = nl.ndarray((in_shape[0], in_shape[1] + 1), dtype=nl.int32, buffer=nl.sbuf) nisa.nonzero_with_count(dst=out_tile, src=in_tile, index_offset=0, padding_val=-1) out_tensor = nl.ndarray(out_tile.shape, dtype=out_tile.dtype, buffer=nl.hbm) nisa.dma_copy(dst=out_tensor, src=out_tile) return out_tensor