nki.isa#

NKI ISA#

nc_matmul

Compute dst = stationary.T @ moving matrix multiplication using Tensor Engine.

nc_matmul_mx

Compute matrix multiplication of MXFP8/MXFP4 quantized matrices with integrated dequantization using Tensor Engine.

nc_transpose

Perform a 2D transpose between the partition axis and the free axis of input data using Tensor or Vector Engine.

activation

Apply an activation function on every element of the input tile using Scalar Engine, with an optional scale/bias operation before the activation and an optional reduction operation after the activation in the same instruction.

activation_reduce

Perform the same computation as nisa.activation and also a reduction along the free dimension of the nisa.activation result using Scalar Engine.

tensor_reduce

Apply a reduction operation to the free axes of an input data tile using Vector Engine.

tensor_partition_reduce

Apply a reduction operation across partitions of an input data tile using GpSimd Engine.

tensor_tensor

Perform an element-wise operation of input two tiles using Vector Engine or GpSimd Engine.

tensor_tensor_scan

Perform a scan operation of two input tiles using Vector Engine.

scalar_tensor_tensor

Apply two math operators in sequence using Vector Engine: (data <op0> operand0) <op1> operand1.

tensor_scalar

Apply up to two math operators to the input data tile by broadcasting scalar/vector operands in the free dimension using Vector or Scalar or GpSimd Engine: (data <op0> operand0) <op1> operand1.

tensor_scalar_reduce

Perform the same computation as nisa.tensor_scalar with one math operator and also a reduction along the free dimension of the nisa.tensor_scalar result using Vector Engine.

tensor_copy

Create a copy of src tile within NeuronCore on-chip SRAMs using Vector, Scalar or GpSimd Engine.

tensor_copy_dynamic_src

Create a copy of src tile within NeuronCore on-chip SRAMs using Vector or Scalar or GpSimd Engine, with src located at a dynamic offset within each partition.

tensor_copy_dynamic_dst

Create a copy of src tile within NeuronCore on-chip SRAMs using Vector or Scalar or GpSimd Engine, with dst located at a dynamic offset within each partition.

tensor_copy_predicated

Conditionally copy elements from the src tile to the destination tile on SBUF / PSUM based on a predicate using Vector Engine.

reciprocal

Compute element-wise reciprocal (1.0/x) of the input data tile using Vector Engine.

quantize_mx

Quantize FP16/BF16 data to MXFP8 tensors (both data and scales) using Vector Engine.

iota

Generate a constant literal pattern into SBUF using GpSimd Engine.

dropout

Randomly replace some elements of the input tile data with zeros based on input probabilities using Vector Engine.

affine_select

Select elements between an input tile on_true_tile and a scalar value on_false_value according to a boolean predicate tile using GpSimd Engine.

range_select

Select elements from on_true_tile based on comparison with bounds using Vector Engine.

select_reduce

Selectively copy elements from either on_true or on_false to the destination tile based on a predicate using Vector Engine, with optional reduction (max).

sequence_bounds

Compute the sequence bounds for a given set of segment IDs using GpSIMD Engine.

memset

Initialize a tile filled with a compile-time constant value using Vector or GpSimd Engine.

bn_stats

Compute mean- and variance-related statistics for each partition of an input tile data in parallel using Vector Engine.

bn_aggr

Aggregate one or multiple bn_stats outputs to generate a mean and variance per partition using Vector Engine.

local_gather

Gather SBUF data in src_buffer using index on GpSimd Engine.

dma_copy

Copy data from src to dst using DMA engines with optional read-modify-write operations.

dma_transpose

Perform a transpose on input src using DMA Engine.

dma_compute

Perform math operations using compute logic inside DMA engines with element-wise scaling and reduction.

max8

Find the 8 largest values in each partition of the source tile.

nc_find_index8

Find indices of the 8 given vals in each partition of the data tensor.

nc_match_replace8

Replace first occurrence of each value in vals with imm in data using the Vector engine and return the replaced tensor.

nc_stream_shuffle

Apply cross-partition data movement within a quadrant of 32 partitions from source tile src to destination tile dst using Vector Engine.

register_alloc

Allocate a virtual register and optionally initialize it with an integer value x.

register_load

Load a scalar value from memory (HBM or SBUF) into a virtual register.

register_move

Move a compile-time constant integer value into a virtual register.

register_store

Store the value from a virtual register into memory (HBM/SBUF).

core_barrier

Synchronize execution across multiple NeuronCores by implementing a barrier mechanism.

sendrecv

Perform point-to-point communication between NeuronCores by sending and receiving data simultaneously using DMA engines.

NKI ISA Config Enums#

engine

Neuron Device engines

reduce_cmd

Engine Register Reduce commands

dge_mode

Neuron Descriptor Generation Engine Mode

Target#

nc_version

NeuronCore version

get_nc_version

Returns the nc_version of the current target context.