nki.isa.nonzero_with_count#

nki.isa.nonzero_with_count(dst, src, index_offset=0, padding_val=-1)[source]#

Find indices of nonzero elements in an input tensor and their total count using GpSimd Engine.

Note

Available only on NeuronCore-v3 and newer.

NOTE: this instruction only operates on partitions [0, 16, 32, …, 112] of the input tile and writes to partitions [0, 16, 32, …, 112] of the destination tile. The data in other partitions of the destination tile are not modified, including the last ‘extra’ slot for count.

This behavior is due to the physical connectivity of GpSimd engine. Each of the eight GpSimd cores connects to 16 contiguous SBUF partitions (e.g., core[0] connects to partitions[0:16]). In nonzero_with_count, each GpSimd core reads from and writes to its 0-th partition only.

This instruction takes an input array and produces an output array containing the indices of all nonzero elements, followed by padding values, and ending with the count of nonzero elements found.

The output tensor has one more element in the free dimension than the input tensor:

  • First N elements: 0-indexed positions of nonzero elements, offset by index_offset

  • Next T-N elements: Filled with padding_val

  • Last element: Count N of nonzero elements found

The index_offset parameter is useful when processing arrays in tiles, allowing indices to be relative to the original array position rather than the tile.

Example for one partition of the tensor:

Input array (T=8): [0, 1, 1, 0, 0, 1, 0, 0]
index_offset = 16
padding_val = -1

Output (T+1=9): [17, 18, 21, -1, -1, -1, -1, -1, 3]

Where:

- 17, 18, 21 are the indices (1, 2, 5) plus offset 16
- -1 is the padding value for unused slots
- 3 is the count of nonzero elements

Constraints

  • Supported arch versions: NeuronCore-v3+.

  • Supported engines: GpSimd.

  • Parameters src, dst must have the same number of elements in the partition dimension.

  • Destination tensor must have exactly 1 more element than the source tensor in the free dimension.

  • Only accesses the 0-th partition for each GpSimd core (i.e., [0, 16, 32, …, 112]).

  • src must be in SBUF with dtype float32 or int32.

  • dst must be in SBUF with dtype int32.

  • index_offset and padding_val must be int32.

Parameters:
  • src – Input tensor to find nonzero indices from. Only partitions [0, 16, 32, …, 112] are read from. Supported buffers: SBUF. Supported dtypes: float32, int32.

  • dst – Output tensor containing nonzero indices, padding, and count. Only partitions [0, 16, 32, …, 112] are written to. It must have one extra element than src in the free dimension. Supported buffers: SBUF. Supported dtypes: int32.

  • index_offset – Offset to add to the found indices (useful for tiled processing). Supported dtypes: int32.

  • padding_val – Value to use for padding unused output elements. Supported dtypes: int32.

Behavior

# Find all nonzero elements in input
nonzero_indices = []
for i in range(len(input_array)):
    if input_array[i] != 0:
        nonzero_indices.append(i + index_offset)

# Build output array
output = []
# Add found indices
for idx in nonzero_indices:
    output.append(idx)
# Add padding for remaining slots
for _ in range(len(input_array) - len(nonzero_indices)):
    output.append(padding_val)
# Add count as last element
output.append(len(nonzero_indices))

Example

def nonzero_with_count_kernel(in_tensor):
    in_shape = in_tensor.shape
    assert len(in_tensor.shape) == 2, "expected 2D tensor"

    in_tile = nl.ndarray(in_shape, dtype=in_tensor.dtype, buffer=nl.sbuf)
    nisa.dma_copy(dst=in_tile, src=in_tensor)

    out_tile = nl.ndarray((in_shape[0], in_shape[1] + 1), dtype=nl.int32, buffer=nl.sbuf)
    nisa.nonzero_with_count(dst=out_tile, src=in_tile, index_offset=0, padding_val=-1)

    out_tensor = nl.ndarray(out_tile.shape, dtype=out_tile.dtype, buffer=nl.hbm)
    nisa.dma_copy(dst=out_tensor, src=out_tile)

    return out_tensor