This document is relevant for: Inf1, Inf2, Trn1, Trn2, Trn3

Release Notes for Neuron Component: Neuron Kernel Interface (NKI)#

The release notes for the Neuron Kernel Interface (NKI) component. Read them for the details about the changes, improvements, and bug fixes for all release versions of the AWS Neuron SDK.

Neuron Kernel Interface (NKI) (Beta 2 - 0.2.0) [2.28] (Neuron 2.28.0 Release)#

Date of Release: 02/26/2026

New Features#

Improvements#

  • Updated nki.isa APIs:

  • Compiler output improvements:

    • The compiler no longer truncates diagnostic output; users now receive the full set of warnings and errors

Breaking Changes#

  • nki.isa.nc_matmul parameter psumAccumulateFlag has been removed. This parameter had no effect on compilation or execution. Simply remove it from your kernel code.

  • nki.isa.nc_matmul parameter is_moving_zero has been renamed to is_moving_onezero to match hardware semantics, consistent with the companion is_stationary_onezero parameter. Kernels that passed is_moving_zero by name should update to is_moving_onezero.

  • nki.tensor has moved to nki.meta.tensor. Users should update their imports accordingly.

Note

The previously announced removal of the neuronxcc.nki.* namespace has been postponed from Neuron 2.28 to Neuron 2.29. Both the neuronxcc.nki.* and nki.* namespaces continue to be supported in this release. We encourage customers to migrate to the nki.* namespace using the NKI Migration Guide.

Bug Fixes#

  • Fixed incorrect default value for on_false_value in nki.isa.range_select. The default was 0.0 instead of negative infinity (-inf). This caused range_select to write zeros for out-of-range elements instead of the expected negative-infinity sentinel, which could produce incorrect results in downstream reductions (e.g., max-pooling or top-k). See nki.isa.range_select.

  • Fixed default value parsing for keyword-only arguments in NKI kernels. When a Python function used keyword-only arguments with default values (arguments after * in the signature), the NKI compiler did not associate the defaults with their corresponding parameter names. This caused keyword-only arguments to appear as required even when they had defaults, leading to “missing argument” errors during kernel compilation.

  • Fixed wrong default for reduce_cmd in nki.isa.activation. The default was incorrectly set to ZeroAccumulate instead of Idle, causing the accumulator to be zeroed before every activation call even when no reduction was requested.

  • Fixed missing ALU operators (rsqrt, abs, power) in nki.isa.tensor_scalar and nki.isa.tensor_tensor. Passing these operators previously raised an “unsupported operator” error. See NKI Language Guide.

  • Fixed float8_e4m3fn to float8_e4m3 conversion for kernel inputs and outputs. When a tensor with dtype float8_e4m3fn was passed to the compiler, the automatic conversion to float8_e4m3 could fail with a size-check error. The conversion now validates sizes correctly before casting. See nki.language.float8_e4m3.

  • Fixed dynamic for loop incorrectly incrementing the loop induction variable. In loops with a runtime-determined trip count (sequential_range with non-constant bounds), the compiler generated incorrect increment code, causing the loop counter to never advance and the loop to run indefinitely or produce incorrect iteration values. See nki.language.sequential_range.

  • Fixed reshape of shared_hbm and private_hbm tensors failing partition size check. Reshape only recognized plain hbm memory as exempt from partition-dimension size validation. Tensors allocated in shared_hbm or private_hbm (used for cross-kernel and kernel-private storage) incorrectly triggered a “partition size mismatch” error when reshaped. See nki.language.shared_hbm and nki.language.private_hbm.

  • Fixed bias shape checking in nki.isa.activation. The bias parameter was not validated for shape correctness. A bias tensor with a free dimension other than 1 (e.g., shape (128, 64) instead of (128, 1)) was accepted without validation, which could produce incorrect results. The compiler now raises an error if the bias free dimension is not 1.

  • Fixed incorrect line numbers in stack traces and error reporting. An off-by-one error in the line offset calculation caused all reported line numbers to be shifted by one. Additionally, error location was sometimes lost when errors propagated across file boundaries.

  • Fixed invalid keyword arguments being silently ignored instead of raising an error. When calling an NKI API with a misspelled or unsupported keyword argument, the argument was ignored without warning. The compiler now validates all keyword argument names against the function signature and raises an unexpected keyword argument error for unrecognized names.

  • Fixed nki.jit in auto-detection mode returning an uncalled kernel object instead of executing the kernel. When nki.jit was used without specifying a framework mode (e.g., @nki.jit with no mode argument), the auto-detection path constructed the appropriate framework-specific kernel object but returned it without calling it. The user received a kernel object instead of the computed result, requiring an extra manual invocation. See nki.jit.

  • Fixed stale kernel object state between trace invocations. When tracing the same kernel multiple times (e.g., with different input shapes), compiler state was not fully reset between invocations, causing name collisions and incorrect results. The trace state is now fully reset before each invocation.

  • Improved ‘removed during code migration’ error messages with clear descriptions of unimplemented features. APIs not available in this release (nki.baremetal, nki.benchmark, nki.profile, nki.simulate_kernel) previously raised a generic NotImplementedError("removed during code migration") message. Each now raises a specific message naming the unsupported API. Additionally, calling an nki.jit kernel with no arguments now raises a clear error instead. See NKI Migration Guide.

  • Fixed nested nki_jit decorators not being allowed. The NKI compiler only recognized @nki.jit-decorated functions when they were plain function objects. Nested decorators (e.g., @my_wrapper @nki.jit) wrapped the function in a non-function object, causing the compiler to skip it. The compiler now correctly unwraps decorator chains to find the underlying kernel function. See nki.jit.

Known Issues#

  • nki.isa.range_select: The on_false_value and reduce_cmd parameters are incorrectly

ignored by the NKI compiler. The on_false_value is always set to (-3.4028235e+38) and reduce_cmd is always set to reduce_cmd.reset_reduce, regardless of the values passed in.

Neuron Kernel Interface (NKI) (Beta 2 - 0.1.0) [2.27] (Neuron 2.27.0 Release)#

Date: 12/25/2025

Improvements#

Known Issues#

  • nki.isa.nki.isa.nc_matmul - is_moving_onezero was incorrectly named is_moving_zero in this release

  • NKI ISA semantic checks are not available with Beta 2, workaround is to reference the API docs

  • NKI Collectives are not available with Beta 2

  • nki.benchmark and nki.profile are not available with Beta 2


Neuron Kernel Interface (NKI) (Beta) [2.26] (Neuron 2.26.0 Release)#

Date: 09/18/2025

Improvements#

  • new nki.language APIs:

    • nki.language.gelu_apprx_sigmoid - Gaussian Error Linear Unit activation function with sigmoid approximation.

    • nki.language.tile_size.total_available_sbuf_size to get total available SBUF size

  • new nki.isa APIs:

    • nki.isa.select_reduce - selectively copy elements with max reduction

    • nki.isa.sequence_bounds - compute sequence bounds of segment IDs

    • nki.isa.dma_transpose

      • axes param to define 4D transpose for some supported cases

      • dge_mode to specify Descriptor Generation Engine (DGE).

    • nl.gelu_apprx_sigmoid op support on nki.isa.activation

  • fixes / improvements:

    • nki.language.store supports PSUM buffer with extra additional copy inserted.

  • docs/tutorial improvements:

    • nki.isa.dma_transpose API doc and example

    • nki.simulate_kernel example improvement

    • use nl.fp32.min in tutorial code instead of a magic number

  • better error reporting:

    • indirect indexing on transpose

    • mask expressions


Neuron Kernel Interface (NKI) (Beta) [2.24] (Neuron 2.24.0 Release)#

Date: 06/24/2025

Improvements#

  • sqrt valid data range extended for accuracy improvement with wider numerical values support.

  • nki.language.gather_flattened new API

  • nki.isa.nc_match_replace8 additional param dst_idx

  • improved docs/examples on nki.isa.nc_match_replace8, nki.isa.nc_stream_shuffle

  • improved error messages


Neuron Kernel Interface (NKI) (Beta) [2.23] (Neuron 2.23.0 Release)#

Date: 05/20/2025

Improvements#

  • nki.isa.range_select (for trn2) new instruction

  • abs, power ops supported on to nki.isa tensor instruction

  • abs op supported on nki.isa.activation instruction

  • GpSIMD engine support added to add, multiply in 32bit integer to nki.isa tensor operations

  • nki.isa.tensor_copy_predicated support for reversing predicate.

  • nki.isa.tensor_copy_dynamic_src, tensor_copy_dynamic_dst engine selection.

  • nki.isa.dma_copy additional support with dge_mode, oob_mode, and in-place add rmw_op.

  • +=, -=, /=, *= operators now work consistently across loop types, PSUM, and SBUF,

  • fixed simulation for instructions: nki.language.rand, random_seed, nki.isa.dropout

  • fixed simulation masking behavior

  • Added warning when the block dimension is used for SBUF and PSUM tensors, see: NKI Block Dimension Migration Guide


Neuron Kernel Interface (NKI) (Beta) [2.22] (Neuron 2.22.0 Release)#

Date: 04/03/2025

Improvements#

  • New modules and APIs:

    • nki.profile

    • nki.isa new APIs:

      • tensor_copy_dynamic_dst

      • tensor_copy_predicated

      • max8, nc_find_index8, nc_match_replace8

      • nc_stream_shuffle

    • nki.language new APIs: mod, fmod, reciprocal, broadcast_to, empty_like

  • Improvements:

    • nki.isa.nc_matmul now supports PE tiling feature

    • nki.isa.activation updated to support reduce operation and reduce commands

    • nki.isa.engine enum

    • engine parameter added to more nki.isa APIs that support engine selection (ie, tensor_scalar, tensor_tensor, memset)

    • Documentation for nki.kernels have been moved to the GitHub: https://aws-neuron.github.io/nki-samples. The source code can be viewed at aws-neuron/nki-samples.

      • These kernels are still shipped as part of Neuron package in neuronxcc.nki.kernels module

  • Documentation updates:


Neuron Kernel Interface (NKI) (Beta) [2.21] (Neuron 2.21.0 Release)#

Date: 12/16/2024

Improvements#

  • New modules and APIs:

    • nki.compiler module with Allocation Control and Kernel decorators, see guide for more info.

    • nki.isa: new APIs (activation_reduce, tensor_partition_reduce, scalar_tensor_tensor, tensor_scalar_reduce, tensor_copy, tensor_copy_dynamic_src, dma_copy), new activation functions(identity, silu, silu_dx), and target query APIs (nc_version, get_nc_version).

    • nki.language: new APIs (shared_identity_matrix, tan, silu, silu_dx, left_shift, right_shift, ds, spmd_dim, nc).

    • New datatype <nl_datatypes>: float8_e5m2

    • New kernels (allocated_fused_self_attn_for_SD_small_head_size, allocated_fused_rms_norm_qkv) added, kernels moved to public repository.

  • Improvements:

    • Semantic analysis checks for nki.isa APIs to validate supported ops, dtypes, and tile shapes.

    • Standardized naming conventions with keyword arguments for common optional parameters.

    • Transition from function calls to kernel decorators (jit, benchmark, baremetal, simulate_kernel).

  • Documentation updates:


Neuron Kernel Interface (NKI) (Beta) (Neuron 2.20.1 Release)#

Date: 12/03/2024

Improvements#

  • NKI support for Trainium2, including full integration with Neuron Compiler. Users can directly shard NKI kernels across multiple Neuron Cores from an SPMD launch grid. See tutorial for more info. See Trainium2 Architecture Guide for an initial version of the architecture specification (more details to come in future releases).

  • New calling convention in NKI kernels, where kernel output tensors are explicitly returned from the kernel instead of pass-by-reference. See any NKI tutorial for code examples.


Neuron Kernel Interface (NKI) (Beta) [2.20] (Neuron 2.20.0 Release)#

Date: 09/16/2024

Improvements#

  • This release includes the beta launch of the Neuron Kernel Interface (NKI) (Beta). NKI is a programming interface enabling developers to build optimized compute kernels on top of Trainium and Inferentia. NKI empowers developers to enhance deep learning models with new capabilities, performance optimizations, and scientific innovation. It natively integrates with PyTorch and JAX, providing a Python-based programming environment with Triton-like syntax and tile-level semantics offering a familiar programming experience for developers. Additionally, to enable bare-metal access precisely programming the instructions used by the chip, this release includes a set of NKI APIs (nki.isa) that directly emit Neuron Instruction Set Architecture (ISA) instructions in NKI kernels.

This document is relevant for: Inf1, Inf2, Trn1, Trn2, Trn3