This document is relevant for: Inf1, Inf2, Trn1, Trn2, Trn3
Release Notes for Neuron Component: Neuron Kernel Interface (NKI)#
The release notes for the Neuron Kernel Interface (NKI) component. Read them for the details about the changes, improvements, and bug fixes for all release versions of the AWS Neuron SDK.
Neuron Kernel Interface (NKI) [0.4.0] (Neuron 2.30.0 Release)#
Date of Release: 05/21/2026
New Features#
nki.language.abs_max and nki.language.abs_min: New callable APIs for element-wise absolute maximum and absolute minimum. These are
trn3only and run on the Vector Engine. Also usable asop0withnki.isa.tensor_scalarandnki.isa.tensor_scalar_reduce. See nki.isa.tensor_scalar and nki.isa.tensor_scalar_reduce.nki.isa.activate2: New
trn3Scalar Engine API that applies an activation function to the result of a two-stage tensor-scalar preprocessing pipeline(data op0 imm0) op1 imm1, with an optional reduction, all in a single instruction. Supports six(op0, op1)combinations (scale+bias, scale-only, bias-only, etc.) and optional operand reversal for non-commutative operations. Reduces instruction count compared to chainingnisa.tensor_scalarwithnisa.activation. See nki.isa.activate2.New opcodes for nki.isa.tensor_scalar and tensor_scalar_reduce:
squareandreluare now accepted asop0ontrn3. See nki.isa.tensor_scalar and nki.isa.tensor_scalar_reduce.New activation and arithmetic opcodes in nki.language:
nl.prelu(parametric ReLU, used asop=nl.preluwithnki.isa.activate2) andnl.bypass(pass-through op fornki.isa.activate2). See the supported activation functions and arithmetic operator tables in nki.api.shared.tile_size bytes-aware constants: New properties on
nki.language.tile_sizeexpose SBUF and PSUM capacity in both elements and bytes:tile_size.sbuf_size_bytes— total SBUF capacity across all 128 partitions, in bytestile_size.sbuf_fmax— per-partition usable SBUF free dimension in FP32 elementstile_size.sbuf_fmax_bytes— per-partition usable SBUF free dimension in bytestile_size.psum_bank_fmax— PSUM bank capacity in FP32 elementstile_size.psum_bank_fmax_bytes— PSUM bank capacity in bytes
See nki.language.tile_size.
nki.isa.dma_compute oob_mode parameter:
dma_computenow accepts anoob_modeparameter (oob_mode.errororoob_mode.skip) to control handling of out-of-bounds indices in indirect gather/scatter operations withvector_offset, mirroring existingdma_copybehavior. Validation ensuresoob_mode.skipis used only with indirect indexing. See nki.isa.dma_compute.nc_matmul float8_e4m3fn input dtype: On
trn3,nc_matmulnow acceptsfloat8_e4m3fn(OCP FP8) as an input dtype, distinct from the legacyfloat8_e4m3. A new validation prevents mixing legacyfloat8_e4m3with OCPfloat8_e4m3fnoperands in the same matmul. See nki.isa.nc_matmul.JAX dtype support: JAX scalar dtype types (
jnp.bfloat16,jnp.float16,jnp.float32, etc.) are now accepted as kernel arguments and keyword arguments and automatically converted to the equivalent NKI dtype. Unsupported JAX dtypes raiseTypeError. See NKI data types.
Improvements#
nki.isa.sendrecv— Removed the restriction that thesrcanddstpartition dimension must be a multiple of 16. Note:sendrecvis an intra-LNC communication API and is only supported when running on LNC2 (trn2or later). See nki.isa.sendrecv.nki.isa.dma_transposewith indirect indexing — Relaxed thesrcinnermost dimension constraint from exactly128to<= 128whensrcuses an indirect access pattern (vector_offset). See nki.isa.dma_transpose.``nki.simulate`` default accuracy improved:
NKI_PRECISE_FP=1is now the default for CPU simulation. Low-precision dtypes (bfloat16,float8) are now modeled accurately instead of being approximated withfloat32, producing simulator results closer to hardware. SetNKI_PRECISE_FP=0to restore the previous behavior. See nki.simulate.``NKI_SIMULATOR=1`` environment variable: Setting
NKI_SIMULATOR=1now works withtorch.Tensorinputs directly — no manual conversion to NumPy arrays required. See nki.simulate.Improved error messages for nested NKI calls: Kernel compilation errors now show the full Python call stack instead of only the innermost frame, making it easier to locate the call site that triggered the error.
Deprecated and Removed APIs#
nki.isa.tensor_copy_dynamic_src/nki.isa.tensor_copy_dynamic_dst— Removed. Usenisa.tensor_copy()with.ap()andscalar_offsetinstead. See nki.isa.tensor_copy.nki.language.tile_size.total_available_sbuf_size— Deprecated. Despite the name, this attribute returns the usable SBUF free dimension per partition, not total SBUF capacity. Usetile_size.sbuf_size_bytesfor total SBUF capacity across all partitions, ortile_size.sbuf_fmax_bytesfor the per-partition size. The deprecated attribute continues to work and returns the same value as before. See nki.language.tile_size.
Breaking Changes#
nisa.dma_transpose— Now enforces thatdst.shapematches the transposedsrc.shapeexactly, including rank. Previously, a lower-rankdst.shapewas silently padded to match a higher-ranksrc.shape(e.g., a 3Ddstagainst a 4Dsrc). The compiler now raises an assertion error if the ranks differ. To migrate: either match thedstrank to thesrcrank (e.g., usedst.shape=(128, 1, 1, 4096)for a 4Dsrc), or use asrcandaxesof the same rank as the intendeddst(e.g., a 3Dsrcwithaxes=(2, 1, 0)instead of a 4Dsrcwithaxes=(3, 1, 2, 0)). See nki.isa.dma_transpose.neuronxcc.nki.*namespace — Usage of the deprecatedneuronxcc.nki.*namespace inside NKI kernels now raises a compilation error instead of a warning. To migrate, follow the NKI Beta 2 Migration Guide.
Bug Fixes#
``nki.simulate`` correctness fixes: The CPU simulator was corrected in several areas for closer matching to hardware behavior:
nki.isa.dma_copywithoob_mode=oob_mode.skipno longer casts integer tiles tofloat32in the OOB skip path — integer bit patterns are now preserved.nki.isa.nc_stream_shufflewithmask=255now preserves existing destination data instead of zeroing it.nki.isa.local_gathernow produces correct results, including when the destination uses a sub-view or.ap()access pattern.nki.isa.nc_matmulwith 3D+ operand shapes now copies all elements (previously higher dimensions were silently dropped, producing zero-filled results).nki.isa.quantize_mxwithfloat8_e4m3fn_x4output now simulates correctly.nki.isa.iotaandnki.isa.affine_selectnow handle dynamic register offsets correctly.MX x4 packed dtypes (
float8_e4m3fn_x4,float8_e5m2_x4) now simulate correctly when targetingtrn3.nki.isa.dma_computeno longer applies a fictional additive scale mode that the hardware does not support.nl.logical_andin CPU simulation now produces correct results.
See nki.simulate.
Fixed
NKIObjectsubclasses decorated with@dataclass(frozen=True)failing to instantiate withFrozenInstanceError. Frozen NKI object subclasses can now be constructed normally.Fixed
nki.jitcache misses whenNKIObjectsubclasses are decorated with@dataclass.@dataclassremoves__hash__by default, which prevented the cache key from being computed.NKIObjectdataclass subclasses are now handled with a consistent cache key regardless of hashability. See nki.jit.Fixed bool output tensors being returned as
uint8from PyTorch Native kernels. Bool dtype is now preserved end-to-end so output tensors are returned astorch.boolinstead oftorch.uint8.Fixed hardware race conditions in dynamic loop kernels when loop-body memory accesses overlapped with pre-loop accesses. Cross-scope memory dependencies are now tracked correctly across loop boundaries.
Fixed silent, undefined behavior when writing to the induction variable of an
nl.dynamic_rangeloop. Writing to the induction variable (e.g., to try to break out of the loop early) had no effect but did not surface any error. Such writes now raiseAssertionErrorat trace time. Additionally, the induction variable is now aVirtualRegister(previously a bare scalar), so it can be used as ascalar_offsetin access patterns (e.g.,nisa.dma_copywith a per-iteration dynamic offset) — resolving a 0.3.0 known issue. See nki.language.dynamic_range.Fixed
nki.isa.nc_transposesilently ignoring theengineargument. An explicitengine=engine.vectororengine=engine.tensoris now honored;engine=engine.unknown(the default) continues to auto-select based on the destination buffer. See nki.isa.nc_transpose.Fixed
nki.jitkernel cache missing when a kernel is invoked withNonearguments. Previously this triggered unnecessary recompilation on every such call. See nki.jit.Fixed
nl.device_printfailing verification on 1-D HBM tensors (e.g., after linearization).device_printnow works with any tensor rank. See nki.language.device_print.
Known Issues#
Control Flow
Nested
nl.dynamic_rangeloops with loop-carried values fail to compile with a “Could not find register” error. Workaround: restructure to avoid nested dynamic loops, or usenl.static_range/nl.affine_rangefor the outer loop when the trip count is known at compile time.
CPU Simulator
The CPU simulator has additional known limitations beyond those listed here. See the Simulation Limitations section of the simulator guide for the full list.
NKI Language (experimental)
The nki.language APIs are convenience wrappers around nki.isa instructions. They are experimental and have the following known limitations:
nki.language.divideis not supported — Division is not available as a hardware instruction. As a workaround, multiply by the reciprocal:nl.multiply(x, nl.reciprocal(y)).nki.language.fmodandnki.language.modare not supported — Modulo operations are not available as hardware instructions. These APIs work in simulation but fail when compiled for Trainium hardware.nki.language.powerdoes not support scalar exponents —nl.power(tile, scalar)is not supported. Usenl.power(tile, tile)instead, where both operands are tiles.Binary operations do not support broadcasting — Operations like
nl.add(a, b)require both operands to have the same shape. Broadcasting (e.g., adding a(128, 1)tile to a(128, 512)tile) is not yet supported.nki.language.random_seedrequires a tensor, not a scalar — Pass a[1, 1]tensor on SBUF instead of a Python integer. For example:nl.random_seed(nl.full((1, 1), 42, dtype=nl.int32, buffer=nl.sbuf)).nki.language.randandnki.language.random_seedengine behavior — Ontrn3,randusesnisa.rand2on the Vector Engine. On earlier targets,randusesnisa.rngwhich may run on a different engine thanrandom_seed, potentially causingrandom_seedto have no effect onrandoutput.nki.language.matmulwithouttranspose_x=Trueis not supported — Callingnl.matmul(x, y)without settingtranspose_x=Truewill fail. As a workaround, always usenl.matmul(x, y, transpose_x=True)and pre-arrange data accordingly.nki.language.copyuses lossy FP32 casting —nl.copyuses the Scalar Engine which internally casts throughfloat32, which is lossy for integer types with values exceedingfloat32precision (e.g.,int32values greater than 2^23). Additionally, cross-buffer copies (e.g., PSUM to SBUF) are not supported.
Neuron Kernel Interface (NKI) [0.3.0] (Neuron 2.29.0 Release)#
Date of Release: 04/09/2026
AWS Neuron SDK 2.29.0 introduces NKI 0.3.0, a significant update to the Neuron Kernel Interface for General Availability. NKI 0.3.0 features NKI Standard Library (nki-stdlib), which provides developer-visible code for all NKI APIs and native language objects (e.g., NkiTensor). This release provides new exposed Trainium capabilities and features in the NKI API and introduces nki.language APIs. NKI 0.3.0 includes a CPU Simulator, which executes NKI kernels entirely on CPU using NumPy — enabling developers to validate kernel logic on laptops and CI environments without Trainium hardware. NKI 0.3.0 also includes the nki.typing module for declaring expected tensor shapes, a dedicated nki.isa.exponential instruction optimized for Softmax computation, matmul accumulation control, explicit memory address placement, and variable-length all-to-all collectives via nki.collectives.all_to_all_v. NKI 0.3.0 includes several API breaking changes that improve correctness and consistency along with updated documentation.
For the full list of changes and update examples, see the NKI 0.3.0 Update Guide.
New Features#
NKI Standard Library (nki-stdlib): NKI 0.3.0 ships with the NKI Standard Library (nki-stdlib), which provides developer-visible code for all NKI APIs and native language objects (e.g.,
NkiTensor).NKI CPU Simulator (Experimental): Executes NKI kernels entirely on CPU using NumPy, enabling local development, debugging, and functional correctness testing without Trainium hardware. Set the environment variable
NKI_SIMULATOR=1to run existing kernels without code changes, or wrap the kernel call withnki.simulate(kernel). See nki.simulate API Reference.nki.language APIs (Experimental): Introduces
nki.languageAPIs as convenience wrappers aroundnki.isaAPIs, includingnl.load,nl.store,nl.copy,nl.matmul,nl.transpose,nl.softmax, and other high-level operations. See nki.language API Reference.nki.typing module: New module for type-annotating kernel tensor parameters. Use
nt.tensor[shape]to declare expected tensor shapes.nki.isa.exponential: Dedicated exponential instruction with max subtraction, faster than
nisa.activation(op=nl.exp)and useful for Softmax calculation. Trn3 (NeuronCore-v4) only. See nki.isa.exponential.nki.collectives.all_to_all_v: Variable-length all-to-all collective. Unlike
all_to_all, uses a metadata tensor to specify per-rank send/recv counts. See nki.collectives API Reference.Matmul accumulation:
nc_matmulandnc_matmul_mxnow have anaccumulateparameter that controls whether the operation overwrites or accumulates on the destination PSUM tile. The default (accumulate=None) auto-detects, matching NKI 0.2.0 behavior. See nki.isa.nc_matmul.Address placement: The
addressparameter was added tonki.language.ndarrayfor explicit memory placement. See nki.language.ndarray.
Deprecated and Removed APIs#
nki.isa.tensor_copy_dynamic_src/nki.isa.tensor_copy_dynamic_dst— Deprecated and scheduled for removal. Usenisa.tensor_copy()with.ap()andscalar_offsetinstead.nki.jit(platform_target=...)— Deprecated. Set the target platform via theNEURON_PLATFORM_TARGET_OVERRIDEenvironment variable instead. This is a breaking change.
nki.jit(mode=...)— Deprecated and ignored. The NKI Compiler now auto-detects the framework from kernel arguments. This is a breaking change.
Breaking Changes#
Note
NKI 0.3.0 requires all NKI kernels in a model to be updated to NKI 0.3.0. Mixing NKI 0.3.0 and NKI 0.2.0 kernels in the same model is not supported. For models that have not yet been updated, continue using Neuron SDK 2.28.
nisa.dma_copy— No longer supports reading directly from PSUM. Copy the PSUM tensor to SBUF first usingnisa.tensor_copy.nisa.dma_copy— Enforces matching source and destination element types when usingdge_mode=dge_mode.hwdge. Use.view()to reinterpret types.nisa.dma_copy—dst_rmw_opandunique_indicesparameters removed. Usenisa.dma_computeinstead.nisa.dma_compute—scalesandreduce_opparameters swapped positions.scalesis now optional.unique_indicesparameter added. Update call sites to use the new parameter order:nisa.dma_compute(dst, srcs, reduce_op, scales=None, unique_indices=True).nisa.memset— Enforces strict type matching betweenvalueand destination dtype. x4 packed types enforcevalue=0. Kernels that pass float values to integer-typed tensors (e.g.,value=2.0instead ofvalue=2) will now raise an error at compile time.nisa.sendrecv—use_gpsimd_dmareplaced bydma_engineenum. Update existing kernels to use the new enum.nisa.affine_select—offsetmoved from 3rd positional argument to keyword argument with default0.nisa.register_move—immrenamed tosrc, now acceptsVirtualRegister. Update keyword argument fromimm=tosrc=.nki.collectives.collective_permute_implicit_current_processing_rank_id—num_channelsparameter removed. Removenum_channelsfrom call sites and passchannel_idslist tocollective_permute_implicit()instead.Output tensors must use
buffer=nl.shared_hbm. Usingnl.hbmcauses compilation failures.Raw integer enum constants no longer accepted. Use named enum members.
String buffer names no longer accepted. Use buffer objects (e.g.,
nl.sbuf).Keyword-only argument separator (
*) in kernel signatures is not supported.is/is notoperators are not supported. Use==/!=.listkernel arguments are not supported. Convert to tuples.
For before-and-after code examples, see the NKI 0.3.0 Update Guide.
Note
The previously announced removal of the neuronxcc.nki.* namespace has been postponed to a future release. Both the neuronxcc.nki.* and nki.* namespaces continue to be supported in this release.
Other Changes#
nki.isa.dma_enginealias repurposed as thedma_engineenum for DMA transfer engine selection.nki.isa.iota—offsetnow optional with default0.nki.isa.core_barrier—enginedefault changed fromunknowntogpsimd(no behavioral change).nki.language.num_programs—axesdefault changed fromNoneto0.nki.language.program_id—axisnow defaults to0.nki.language.ndarray—bufferdefault changed fromNonetonl.sbuf.nki.language.zeros—bufferdefault changed fromNonetonl.sbuf.nki.language.sequential_range—stopandstepnow have default values (Noneand1).
Bug Fixes#
Fixed incorrect axis handling in
nisa.tensor_reduce. NKI 0.2.0 incorrectly allowedaxis=1to refer to the last free dimension even for 3D/4D tensors. NKI 0.3.0 corrects this so that axis values correspond to the actual tensor dimensions.Fixed
nisa.range_selectsilently overriding user-specified parameters. Theon_false_valueandreduce_cmdparameters were incorrectly ignored by the compiler —on_false_valuewas always set to-3.4028235e+38andreduce_cmdwas always set toreset_reduce, regardless of the values passed in. NKI 0.3.0 honors thereduce_cmdparameter and documents theFP32_MINhardware constraint foron_false_value.
Known Issues#
Math Operations
nki.language.divideis not supported — Division is not available as a hardware instruction. As a workaround, multiply by the reciprocal:nl.multiply(x, nl.reciprocal(y)).nki.language.fmodandnki.language.modare not supported — Modulo operations are not available as hardware instructions. These APIs work in simulation but will fail when compiled for Trainium hardware.nki.language.powerdoes not support scalar exponents —nl.power(tile, scalar)is not supported. Usenl.power(tile, tile)instead, where both operands are tiles.
Broadcasting
Binary operations do not support broadcasting — Operations like
nl.add(a, b)require both operands to have the same shape. Broadcasting (e.g., adding a(128, 1)tile to a(128, 512)tile) is not yet supported.
Random Number Generation
nki.language.random_seedrequires a tensor, not a scalar — Pass a[1, 1]tensor on SBUF instead of a Python integer. For example:nl.random_seed(nl.full((1, 1), 42, dtype=nl.int32, buffer=nl.sbuf)).nki.language.randandnki.language.random_seedengine behavior — On NeuronCore-v4+ (Trn3+),randusesnisa.rand2on the Vector Engine. On earlier NeuronCores,randusesnisa.rngwhich may run on a different engine thanrandom_seed, potentially causingrandom_seedto have no effect onrandoutput.
Matrix Operations
nki.language.matmulwithouttranspose_x=Trueis not supported — Callingnl.matmul(x, y)without settingtranspose_x=Truewill fail. As a workaround, always usenl.matmul(x, y, transpose_x=True)and pre-arrange data accordingly.
Data Movement
nki.language.storedoes not support PSUM tiles directly — Storing a tile that resides in PSUM requires manually copying it to SBUF first usingnisa.tensor_copy.nki.language.copyuses lossy FP32 casting —nl.copyuses the Scalar Engine which internally casts through FP32, which is lossy for integer types with values exceeding FP32 precision (e.g., int32 values > 2^23). Additionally, cross-buffer copies (e.g., PSUM to SBUF) are not supported.
Control Flow
nki.language.dynamic_rangeloop variable cannot be used in index arithmetic — The induction variable of adynamic_rangeloop is a scalar, not a register. It cannot be used as ascalar_offsetin access patterns or in arithmetic expressions for computing tile offsets. Usenl.affine_rangeornl.static_rangeif you need to compute offsets from the loop variable.
Multi-Core (LNC2)
LNC2 requires identical control flow across cores — When running with Logical NeuronCore 2 (LNC2), the NKI compiler expects each physical NeuronCore to execute identical control flow. Programs with dynamic control flow that differs across cores may deadlock or produce incorrect results. This constraint is not enforced at compile time.
Caching
NKI kernel caching assumes kernels are pure functions of their input arguments. If a kernel’s output depends on external state (such as global variables or closures over mutable objects), the cache may return stale results. This is undefined behavior. Always ensure kernel outputs are determined solely by kernel arguments.
Compiler
Address rotation cannot be disabled — Address rotation, a backend compiler optimization that rotates tensor addresses for improved memory utilization, is enabled by default and cannot be opted out of in this release.
Collectives
nki.collectives.all_to_all_v():has_rdispls=Truehas no effect on NeuronSwitch-based architectures (e.g., Trainium3 UltraServer); the receive layout is the same ashas_rdispls=False.
Neuron Kernel Interface (NKI) (0.2.0) [2.28] (Neuron 2.28.0 Release)#
Date of Release: 02/26/2026
New Features#
LNC (Large Neuron Core) multi-core support:
Shared buffers and canonical outputs: The compiler now tracks shared_hbm tensors declared in kernels and canonicalizes LNC kernel outputs into a consistent form. This is foundational infrastructure for multi-core kernel compilation. See LNC Overview.
Private HBM tensors: Users can declare tensors private to a single NeuronCore using the private_hbm memory type, distinct from regular and shared HBM.
Intra-LNC collectives: New ISA instruction types for multi-core collective operations such as cross-core reductions and broadcasts. See full API listing under nki.collectives below.
New
nki.isaAPIs:nki.isa.nonzero_with_count — returns nonzero element indices and their count, useful for sparse computation and dynamic masking
nki.isa.exponential— computes element-wise exponential on tensors. See nki.isa.activation.
New nki.collectives module, enabling collective communication across multiple NeuronCores directly from NKI kernels:
New
dtypes:nki.language.float8_e4m3fn — for FP8 inference and training workloads
New NKI language features:
no_reorderblocks — usewith no_reorder(): ...to prevent the compiler from reordering instructions within a block, for kernels where instruction ordering affects correctness__call__special method support — callable objects (classes with__call__) can now be used as functions within NKI kernelstensor.viewmethod — tensors now support.view()for reshapingnl.shared_constantcan now be passed to kernels as string arguments, not just tensor objects
Improvements#
Updated
nki.isaAPIs:nki.isa.dma_transpose now supports indirect addressing
nki.isa.dma_copy now supports
unique_indicesparameternki.isa.register_alloc now accepts an optional tensor argument to pre-fill the allocated register with initial values
Compiler output improvements:
The compiler no longer truncates diagnostic output; users now receive the full set of warnings and errors
Breaking Changes#
nki.isa.nc_matmul parameter
psumAccumulateFlaghas been removed. This parameter had no effect on compilation or execution. Simply remove it from your kernel code.nki.isa.nc_matmul parameter
is_moving_zerohas been renamed tois_moving_onezeroto match hardware semantics, consistent with the companionis_stationary_onezeroparameter. Kernels that passedis_moving_zeroby name should update tois_moving_onezero.nki.tensorhas moved tonki.meta.tensor. Users should update their imports accordingly.
Note
The previously announced removal of the neuronxcc.nki.* namespace has
been postponed from Neuron 2.28 to Neuron 2.29. Both the neuronxcc.nki.*
and nki.* namespaces continue to be supported in this release. We
encourage customers to migrate to the nki.* namespace using the
NKI 0.2.0 Migration Guide.
Bug Fixes#
Fixed incorrect default value for
on_false_valueinnki.isa.range_select. The default was0.0instead of negative infinity (-inf). This causedrange_selectto write zeros for out-of-range elements instead of the expected negative-infinity sentinel, which could produce incorrect results in downstream reductions (e.g., max-pooling or top-k). See nki.isa.range_select.Fixed default value parsing for keyword-only arguments in NKI kernels. When a Python function used keyword-only arguments with default values (arguments after
*in the signature), the NKI compiler did not associate the defaults with their corresponding parameter names. This caused keyword-only arguments to appear as required even when they had defaults, leading to “missing argument” errors during kernel compilation.Fixed wrong default for
reduce_cmdin nki.isa.activation. The default was incorrectly set toZeroAccumulateinstead ofIdle, causing the accumulator to be zeroed before every activation call even when no reduction was requested.Fixed missing ALU operators (
rsqrt,abs,power) in nki.isa.tensor_scalar and nki.isa.tensor_tensor. Passing these operators previously raised an “unsupported operator” error. See NKI Language Guide.Fixed
float8_e4m3fntofloat8_e4m3conversion for kernel inputs and outputs. When a tensor with dtypefloat8_e4m3fnwas passed to the compiler, the automatic conversion tofloat8_e4m3could fail with a size-check error. The conversion now validates sizes correctly before casting. See nki.language.float8_e4m3.Fixed dynamic for loop incorrectly incrementing the loop induction variable. In loops with a runtime-determined trip count (
sequential_rangewith non-constant bounds), the compiler generated incorrect increment code, causing the loop counter to never advance and the loop to run indefinitely or produce incorrect iteration values. See nki.language.sequential_range.Fixed reshape of
shared_hbmandprivate_hbmtensors failing partition size check. Reshape only recognized plainhbmmemory as exempt from partition-dimension size validation. Tensors allocated inshared_hbmorprivate_hbm(used for cross-kernel and kernel-private storage) incorrectly triggered a “partition size mismatch” error when reshaped. See nki.language.shared_hbm and nki.language.private_hbm.Fixed bias shape checking in nki.isa.activation. The
biasparameter was not validated for shape correctness. A bias tensor with a free dimension other than 1 (e.g., shape(128, 64)instead of(128, 1)) was accepted without validation, which could produce incorrect results. The compiler now raises an error if the bias free dimension is not 1.Fixed incorrect line numbers in stack traces and error reporting. An off-by-one error in the line offset calculation caused all reported line numbers to be shifted by one. Additionally, error location was sometimes lost when errors propagated across file boundaries.
Fixed invalid keyword arguments being silently ignored instead of raising an error. When calling an NKI API with a misspelled or unsupported keyword argument, the argument was ignored without warning. The compiler now validates all keyword argument names against the function signature and raises an
unexpected keyword argumenterror for unrecognized names.Fixed
nki.jitin auto-detection mode returning an uncalled kernel object instead of executing the kernel. Whennki.jitwas used without specifying a framework mode (e.g.,@nki.jitwith nomodeargument), the auto-detection path constructed the appropriate framework-specific kernel object but returned it without calling it. The user received a kernel object instead of the computed result, requiring an extra manual invocation. See nki.jit.Fixed stale kernel object state between trace invocations. When tracing the same kernel multiple times (e.g., with different input shapes), compiler state was not fully reset between invocations, causing name collisions and incorrect results. The trace state is now fully reset before each invocation.
Improved ‘removed during code migration’ error messages with clear descriptions of unimplemented features. APIs not available in this release (
nki.baremetal,nki.benchmark,nki.profile,nki.simulate_kernel) previously raised a genericNotImplementedError("removed during code migration")message. Each now raises a specific message naming the unsupported API. Additionally, calling annki.jitkernel with no arguments now raises a clear error instead. See NKI 0.2.0 Migration Guide.Fixed nested
nki_jitdecorators not being allowed. The NKI compiler only recognized@nki.jit-decorated functions when they were plain function objects. Nested decorators (e.g.,@my_wrapper @nki.jit) wrapped the function in a non-function object, causing the compiler to skip it. The compiler now correctly unwraps decorator chains to find the underlying kernel function. See nki.jit.
Known Issues#
nki.isa.range_select: Theon_false_valueandreduce_cmdparameters are incorrectly ignored by the NKI compiler. Theon_false_valueis always set to(-3.4028235e+38)andreduce_cmdis always set toreduce_cmd.reset_reduce, regardless of the values passed in.
Neuron Kernel Interface (NKI) (0.1.0) [2.27] (Neuron 2.27.0 Release)#
Date: 12/25/2025
Improvements#
new
nki.languageAPIs:nki.language.device_print
new
nki.isaAPIs:nki.isa.dma_computenki.isa.nki.isa.quantize_mxnki.isa.nki.isa.nc_matmulnki.isa.nki.isa.nc_n_gather[used to benl.gather_flattenedwith free partition limited to 512]nki.isa.rand2nki.isa.rand_set_statenki.isa.rand_get_statenki.isa.set_rng_seednki.isa.rng
new
dtypes:nki.language.float8_e5m2_x4nki.language.float4_e2m1fn_x4nki.language.float8_e4m3fn_x4
changes to existing APIs:
several
nki.languageAPIs have been removed in NKI 0.2.0all nki.isa APIs have
dstas an input paramall nki.isa APIs removed
dtypeandmasksupportnki.isa.memset— removedshapepositional arg , since we havedstnki.isa.affine_select— instead ofpred, we now takepatternandcmp_opparamsnki.isa.iota—exprreplaced withpatternandoffsetnki.isa.nc_stream_shuffle-srcanddstorder changed
docs improvements:
restructured NKI Documentation to align with workflows
added Get Started with NKI
added NKI Language Guide
added About the NKI Compiler
updated Matrix Multiplication Tutorial
updated Profile a NKI Kernel
updated NKI APIs
updated NKI Library docs
removed NKI Error Guide
Known Issues#
nki.isa.nki.isa.nc_matmul-is_moving_onezerowas incorrectly namedis_moving_zeroin this releaseNKI ISA semantic checks are not available with NKI 0.2.0, workaround is to reference the API docs
NKI Collectives are not available with NKI 0.2.0
nki.benchmarkandnki.profileare not available with NKI 0.2.0
Neuron Kernel Interface (NKI) (Beta) [2.26] (Neuron 2.26.0 Release)#
Date: 09/18/2025
Improvements#
new
nki.languageAPIs:nki.language.gelu_apprx_sigmoid- Gaussian Error Linear Unit activation function with sigmoid approximation.nki.language.tile_size.total_available_sbuf_sizeto get total available SBUF size
new
nki.isaAPIs:nki.isa.select_reduce- selectively copy elements with max reductionnki.isa.sequence_bounds- compute sequence bounds of segment IDsnki.isa.dma_transposeaxesparam to define 4D transpose for some supported casesdge_modeto specify Descriptor Generation Engine (DGE).
nl.gelu_apprx_sigmoidop support onnki.isa.activation
fixes / improvements:
nki.language.storesupports PSUM buffer with extra additional copy inserted.
docs/tutorial improvements:
nki.isa.dma_transposeAPI doc and examplenki.simulate_kernelexample improvementuse
nl.fp32.minin tutorial code instead of a magic number
better error reporting:
indirect indexing on transpose
mask expressions
Neuron Kernel Interface (NKI) (Beta) [2.24] (Neuron 2.24.0 Release)#
Date: 06/24/2025
Improvements#
sqrtvalid data range extended for accuracy improvement with wider numerical values support.nki.language.gather_flattenednew APInki.isa.nc_match_replace8additional paramdst_idximproved docs/examples on
nki.isa.nc_match_replace8,nki.isa.nc_stream_shuffleimproved error messages
Neuron Kernel Interface (NKI) (Beta) [2.23] (Neuron 2.23.0 Release)#
Date: 05/20/2025
Improvements#
nki.isa.range_select(for trn2) new instructionabs,powerops supported on to nki.isa tensor instructionabsop supported onnki.isa.activationinstructionGpSIMD engine support added to
add,multiplyin 32bit integer to nki.isa tensor operationsnki.isa.tensor_copy_predicatedsupport for reversing predicate.nki.isa.tensor_copy_dynamic_src,tensor_copy_dynamic_dstengine selection.nki.isa.dma_copyadditional support withdge_mode,oob_mode, and in-place addrmw_op.+=, -=, /=, *=operators now work consistently across loop types, PSUM, and SBUF,fixed simulation for instructions:
nki.language.rand,random_seed,nki.isa.dropoutfixed simulation masking behavior
Added warning when the block dimension is used for SBUF and PSUM tensors, see: NKI Block Dimension Migration Guide
Neuron Kernel Interface (NKI) (Beta) [2.22] (Neuron 2.22.0 Release)#
Date: 04/03/2025
Improvements#
New modules and APIs:
nki.profilenki.isanew APIs:tensor_copy_dynamic_dsttensor_copy_predicatedmax8,nc_find_index8,nc_match_replace8nc_stream_shuffle
nki.languagenew APIs:mod,fmod,reciprocal,broadcast_to,empty_like
Improvements:
nki.isa.nc_matmulnow supports PE tiling featurenki.isa.activationupdated to support reduce operation andreducecommandsnki.isa.engineenumengineparameter added to morenki.isaAPIs that support engine selection (ie,tensor_scalar,tensor_tensor,memset)Documentation for
nki.kernelshave been moved to the GitHub: https://aws-neuron.github.io/nki-samples. The source code can be viewed at aws-neuron/nki-samples.These kernels are still shipped as part of Neuron package in
neuronxcc.nki.kernelsmodule
Documentation updates:
Kernels public repository https://aws-neuron.github.io/nki-samples
Updated profiling guide to use
nki.profileinstead ofnki.benchmarkNKI ISA Activation functions table now have valid input data ranges listed
NKI ISA Supported Math operators now have supported engine listed
Clarify
+=syntax support/limitation
Neuron Kernel Interface (NKI) (Beta) [2.21] (Neuron 2.21.0 Release)#
Date: 12/16/2024
Improvements#
New modules and APIs:
nki.compilermodule with Allocation Control and Kernel decorators, see guide for more info.nki.isa: new APIs (activation_reduce,tensor_partition_reduce,scalar_tensor_tensor,tensor_scalar_reduce,tensor_copy,tensor_copy_dynamic_src,dma_copy), new activation functions(identity,silu,silu_dx), and target query APIs (nc_version,get_nc_version).nki.language: new APIs (shared_identity_matrix,tan,silu,silu_dx,left_shift,right_shift,ds,spmd_dim,nc).New
datatype <nl_datatypes>:float8_e5m2New
kernels(allocated_fused_self_attn_for_SD_small_head_size,allocated_fused_rms_norm_qkv) added, kernels moved to public repository.
Improvements:
Semantic analysis checks for nki.isa APIs to validate supported ops, dtypes, and tile shapes.
Standardized naming conventions with keyword arguments for common optional parameters.
Transition from function calls to kernel decorators (
jit,benchmark,baremetal,simulate_kernel).
Documentation updates:
Tutorial for SPMD usage with multiple Neuron Cores on Trn2
Neuron Kernel Interface (NKI) (Beta) (Neuron 2.20.1 Release)#
Date: 12/03/2024
Improvements#
NKI support for Trainium2, including full integration with Neuron Compiler. Users can directly shard NKI kernels across multiple Neuron Cores from an SPMD launch grid. See Trainium2 Architecture Guide for the architecture specification.
New calling convention in NKI kernels, where kernel output tensors are explicitly returned from the kernel instead of pass-by-reference. See any NKI tutorial for code examples.
Neuron Kernel Interface (NKI) (Beta) [2.20] (Neuron 2.20.0 Release)#
Date: 09/16/2024
Improvements#
This release includes the beta launch of the Neuron Kernel Interface (NKI) (Beta). NKI is a programming interface enabling developers to build optimized compute kernels on top of Trainium and Inferentia. NKI empowers developers to enhance deep learning models with new capabilities, performance optimizations, and scientific innovation. It natively integrates with PyTorch and JAX, providing a Python-based programming environment with Triton-like syntax and tile-level semantics offering a familiar programming experience for developers. Additionally, to enable bare-metal access precisely programming the instructions used by the chip, this release includes a set of NKI APIs (
nki.isa) that directly emit Neuron Instruction Set Architecture (ISA) instructions in NKI kernels.
This document is relevant for: Inf1, Inf2, Trn1, Trn2, Trn3