This document is relevant for: Inf1, Inf2, Trn1, Trn2, Trn3
Release Notes for Neuron Component: NKI Library#
The release notes for the NKI Library Neuron component. Read them for the details about the changes, improvements, and bug fixes for all release versions of the AWS Neuron SDK.
NKI Library (NKI-Lib) (Neuron 2.29.0 Release)#
Date of Release: 04/09/2026
What’s New#
This release promotes find_nonzero_indices from experimental to a core subkernel and adds 7 new experimental kernels (Conv1D, Transformer TKG, 3 collective communication kernels, Top-K Reduce, and Dynamic Elementwise Add). Existing kernels receive sequence packing support, MXFP quantization paths, and expanded dimension limits. PyTorch reference implementations are added for 22 kernels.
New Core Additions#
find_nonzero_indices (promoted from experimental) — Finds indices of nonzero elements along the T dimension using GpSimd
nonzero_with_countISA. Optimized for LNC2 sharding. Supports token counts up to 65536 and column counts up to 128.
New Experimental Kernels#
Conv1D — 1D convolution using tensor engine with replication strategy. Supports stride, padding, dilation, optional bias, activation fusion, and LNC sharding.
Transformer TKG — Multi-layer transformer forward pass megakernel for token generation. Executes attention block, all-reduce, MLP, and residual connections across a configurable number of layers.
Fine-Grained All-Gather — Ring-based all-gather for TRN2 using collective permute with double buffering to overlap communication and data movement.
FGCC (All-Gather + Matmul) — Fused all-gather and matrix multiplication for TRN2, overlapping communication with compute.
SBUF-to-SBUF All-Gather — Two variants:
allgather_sb2sbfor small tensors fitting in SBUF andallgather_sb2sb_tiledwith tiling and LNC support for larger tensors.Top-K Reduce — Gathers scattered rows by packed global token index and reduces along the K dimension for MoE output. Supports LNC sharding on the hidden dimension.
Dynamic Elementwise Add — Elementwise addition with runtime-variable M-dimension tiling using dynamic loop bounds.
Improvements#
Attention CTE Kernel: Added
mm_out_dtypeparameter for controlling matmul output dtype. Addedbound_min/bound_maxparameters for sequence packing support (per-query KV range bounds). Increased max batch size from 32 to 512. Increased max sequence length from 36864 to 131072.Attention BWD Kernel: Added
bound_min/bound_maxparameters for sequence packing support. Added support for large batch size.Attention TKG Kernel: Added
start_pos_idsparameter for explicit KV cache position control to support sliding window masking.Attention Block TKG Kernel: Added
rmsnorm_QK_pre_rope_W_Q/rmsnorm_QK_pre_rope_W_Kparameters for fused QK-norm before RoPE. Added KVDP attention sharding support (KVDP,KVDP_replica_group). Addedenable_fa_s_prior_tilingfor overriding flash attention s_prior tiling.MLP Kernel: Added
sbm(BufferManager) parameter for custom SBUF memory management. Added MXFP4/MXFP8 quantization path.MoE TKG Kernel: Added new dynamic all-expert algorithm that uses
block_sizeandis_all_expert_dynamicargs. Expanded support for small I and added support for sharding on T in all-expert MX kernel.Output Projection CTE Kernel: Added
output_dtypeparameter for controlling output data type.Output Projection TKG Kernel: Added
sbm(BufferManager) parameter for custom SBUF memory management.QKV Kernel: Added
is_h_dim_4h_transposedandweight_layoutparameters for flexible weight layout support.rmsnorm_tkg / layernorm_tkg: Added
shard_on_hparameter for sharding on the hidden dimension.Added PyTorch reference implementations for 22 kernels for testing and validation.
Breaking Changes#
Router Top-K Kernel: The
output_in_sbuf,x_input_in_sbuf, andexpert_affin_in_sbparameters have been removed. The kernel now auto-detects SBUF inputs from the tensor buffer type. Callers passing these keyword arguments must remove them.QKV Kernel: The
is_input_swizzledparameter has been removed and replaced byis_h_dim_4h_transposed(same position, same defaultFalse) and a newweight_layoutparameter. Callers usingis_input_swizzledby name must rename tois_h_dim_4h_transposed.QKV Kernel (TKG variant): New parameter
is_h_dim_4h_transposedhas been inserted afterquantization_type. Callers using positional arguments forqkv_w_scaleor later parameters must update to use keyword arguments.Attention CTE Kernel: New parameter
mm_out_dtypehas been inserted betweensoftmax_dtypeandcp_offset. Callers using positional arguments forcp_offset,global_cp_deg, orcp_strided_q_slicingmust update to use keyword arguments.Attention TKG Kernel: New parameter
start_pos_idshas been inserted afterrope_pos_ids. Callers using positional arguments beyondrope_pos_idsmust update to use keyword arguments.Attention BWD Kernel: New parameters
bound_minandbound_maxhave been inserted betweensinks_refanduse_causal_mask. Callers using positional arguments foruse_causal_maskor later parameters must update to use keyword arguments.Attention Block TKG Kernel: The keyword-only marker (
*) has been removed and multiple parameters have been reordered. New pre-RoPE QK-norm parameters (rmsnorm_QK_pre_rope_W_Q,rmsnorm_QK_pre_rope_W_K) have been added.softmax_scale,k_scale, andv_scalehave been moved to optional parameters with defaults. All callers must review their argument ordering.rmsnorm_tkg / layernorm_tkg: New parameter
shard_on_hhas been inserted beforeuse_heap_memoryandsbm. Callers using positional arguments beyondsingle_core_forced(rmsnorm) oreps(layernorm) must update to use keyword arguments. Helper functionsprocess_rmsnorm_tile,rmsnorm_tkg_llama_impl, andlayernorm_tkg_llama_implhave been made private (prefixed with_).SbufManager has been renamed to BufferManager. A backward-compatible alias
SbufManager = BufferManageris provided, so existing code usingSbufManagerwill continue to work.MoE TKG: Replaced boolean sharding flags (
shard_on_I,shard_on_T) withLNCShardingStrategyenum in down projection interfaces.MoE TKG MX quantization files restructured:
down_projection_mx_shard_I.pyandgate_up_projection_mx_shard_I.pyreplaced withall_expert_mx_utils.py,down_projection_mx.py, andgate_up_projection_mx.py. Callers importing from the old file paths must update their imports.find_nonzero_indiceshas been moved fromnkilib.experimental.subkernelstonkilib.core.subkernels. A backward-compatible re-export is provided, so imports via the experimental path continue to work.Removed usage of
nki.language.par_dimthroughout the library.
Bug Fixes#
Fixed MLP CTE indexing in gate proj row scales.
Fixed QKV TKG
sb2sb_wrapper_kernelsignature missing QK-norm parameters.Fixed MLP failure for FP4 quantization with specific dimension combinations (
vnc=2, h=3072, i=384).Fixed
bwmm_shard_on_Hwith explicit TensorCopy from PSUM to SBUF for NKI 0.3.0 compatibility.
Known Issues#
NKI Library (NKI-Lib) (Neuron 2.28.0 Release)#
What’s New#
This release expands the NKI Library with 9 new kernels, bringing the total to 16 documented kernel APIs. New core kernels include RoPE, Router Top-K, MoE CTE, MoE TKG, and Cumsum. New experimental kernels include Attention Block TKG (fused attention block for token generation), Cross Entropy (forward and backward passes), Depthwise Conv1D, and Blockwise MM Backward for MoE training.
Existing kernels receive FP8 and MX quantization support across QKV, MLP, and both Output Projection kernels. Kernel utilities gain new TensorView methods, SbufManager logging improvements with tree-formatted allocation tracing, and new utilities including interleave_copy, LncSubscriptable, and rmsnorm_mx_quantize_tkg. Note that several breaking changes affect kernel signatures and utility APIs — see the Breaking Changes section for details.
New Core Kernels#
RoPE Kernel — Applies Rotary Position Embedding to input embeddings with optional LNC sharding and flexible layout support (contiguous and interleaved).
Router Top-K Kernel — Computes router logits and top-K expert selection for Mixture of Experts models, with support for multiple layout configurations and sharding strategies.
MoE CTE Kernel — Implements Mixture of Experts optimized for Context Encoding with multiple sharding strategies (block sharding, intermediate dimension sharding) and MxFP4/MxFP8 quantization.
MoE TKG Kernel — Implements Mixture of Experts optimized for Token Generation with all-expert and selective-expert modes, supporting FP8 and MxFP4 quantization.
Cumsum Kernel — Computes cumulative sum along the last dimension, optimized for batch sizes up to 2048.
New Experimental Kernels#
Attention Block TKG Kernel — Fused attention block for Token Generation that combines RMSNorm, QKV projection, RoPE, attention, and output projection in SBUF to minimize HBM traffic.
Cross Entropy Kernel — Memory-efficient cross entropy loss forward and backward passes for large vocabularies using online log-sum-exp algorithm, optimized for LNC2.
Depthwise Conv1D Kernel — Depthwise 1D convolution using implicit GEMM algorithm with support for arbitrary stride and padding values, optimized for TRN2.
Blockwise MM Backward Kernel — Backward pass for blockwise matrix multiplication in MoE layers, computing gradients for all parameters with support for dropless MoE.
Improvements#
QKV Kernel: Added FP8 quantization support (
quantization_type,qkv_w_scale,qkv_in_scale), fused FP8 KV cache quantization (k_cache,v_cache,k_scale,v_scale,fp8_max,fp8_min,kv_dtype), block-based KV cache layout (use_block_kv,block_size,slot_mapping), and MX quantization input swizzling (is_input_swizzled).MLP Kernel: Added FP8 quantization support (
quantization_type,gate_w_scale,up_w_scale,down_w_scale,gate_up_in_scale,down_in_scale,quant_clipping_bound), gate/up projection clamping (gate_clamp_upper_limit,gate_clamp_lower_limit,up_clamp_upper_limit,up_clamp_lower_limit),skip_gate_projoption, and fp16 support for TKG mode.Output Projection CTE Kernel: Added FP8 quantization support (
quantization_type,input_scales,weight_scales).Output Projection TKG Kernel: Added FP8 quantization support (
quantization_type,weight_scale,input_scale) and removed 512 restriction on non-transpose path.Attention CTE Kernel: Added strided Q slicing for context parallelism (
cp_strided_q_slicing).RMSNorm-Quant Kernel: Added input dequantization scale support (
input_dequant_scale).
Kernel Utilities#
See Kernel Utilities Reference for full documentation.
TensorView: Added
rearrangemethod for flexible dimension reordering,has_dynamic_accessfor checking whether a view requires runtime-dependent addressing, andkey_in_dicthelper. Theslicemethod now clamps the end index to dimension bounds instead of asserting.TiledRange:
TiledRangeIteratornow exposes anend_offsetattribute, enabling kernels to determine the end position of each tile without manual calculation.SbufManager (Allocator): Added
get_total_spaceandget_used_spacefor querying SBUF utilization,set_name_prefix/get_name_prefixfor scoped naming, andflush_logsto emit buffered allocation logs. SbufManager now usesTreeLoggerto provide hierarchical, tree-formatted logs of SBUF allocation and deallocation events, making it easier to debug memory usage across nested scopes.QuantizationType: Added
MXenum value for microscaling quantization (MxFP4/MxFP8).common_types: Added
GateUpDimenum for distinguishing gate vs up projection dimensions.rmsnorm_tkg / layernorm_tkg: Both subkernels now accept a
TensorViewornl.ndarrayfor input and require an explicitoutputtensor parameter, giving callers control over output placement.New utilities: Added
rmsnorm_mx_quantize_tkgsubkernel for fused RMSNorm with MX quantization in token generation,interleave_copyfor interleaved tensor copy operations,LncSubscriptablefor LNC-aware data access patterns, andTreeLoggerfor hierarchical allocation logging.
Breaking Changes#
The open source repository source directory has been renamed from
nkilib_standalonetonkilib_src.MLP Kernel: The function has been renamed from
mlp_kerneltomlp. New parameters have been inserted in the middle of the signature; callers using positional arguments beyondnormalization_typemust update to use keyword arguments.QKV Kernel: New parameters (
quantization_type,qkv_w_scale,qkv_in_scale) have been inserted afterbias; callers using positional arguments beyondbiasmust update to use keyword arguments.Output Projection TKG Kernel: The
biasparameter is now optional (defaultNone). New parameters (quantization_type,weight_scale,input_scale) have been inserted beforeTRANSPOSE_OUT; callers using positional arguments beyondbiasmust update to use keyword arguments.TiledRangeIterator: The constructor now requires a fourth positional argument
end_offset.TensorView: The
sizesattribute has been renamed toshape.rmsnorm_tkg: The
inpparameter has been renamed toinput. A new requiredoutputparameter has been added as the third argument. Theoutput_in_sbufparameter has been removed. New parametershidden_dim_tpandsingle_core_forcedhave been added.layernorm_tkg: The
inpparameter has been renamed toinput. A new requiredoutputparameter has been added as the third argument. Theoutput_in_sbufparameter has been removed.
Bug Fixes#
Fixed attention TKG compilation and non-determinism issues.
Fixed incorrect v_active slice indices in attention TKG block KV path.
Fixed batch sharding in gen_mask_tkg active mask loading.
Fixed expert_affinities masking when
mask_unselected_expertsis True in MoE TKG.Fixed expert_index shape mismatch in MoE TKG for T > 128.
Fixed MoE affinity mask handling for T not divisible by 128.
Fixed MoE TKG MX weight generation x4 pack size.
Fixed MLP CTE
force_cte_modeparameter validation.Fixed output projection CTE mixed precision support.
Fixed output projection TKG variable name typo.
Fixed router_topk bias shape to satisfy NKI check requirements.
Fixed tail iteration bug for sequences not a multiple of 128 in MoE CTE.
Fixed reading extra partitions for last rank in MoE CTE.
Known Issues#
NKI Library (NKI-Lib) (Neuron 2.27.0 Release)#
What’s New#
This release introduces the NKI Library, which provides pre-built kernels you can use to optimize the performance of your models. The NKI Library offers ready-to-use, pre-optimized kernels that leverage the full capabilities of AWS Trainium hardware.
NKI Library kernels are published in the NKI Library GitHub repository.
In Neuron 2.27, these kernels are also shipped as part of neuronx-cc under the nkilib.* namespace.
Accessing NKI Library Kernels#
You can access NKI Library kernels in two ways:
Shipped version: Import from the
nkilib.*namespace (included with neuronx-cc in Neuron 2.27)Open source repository: Clone and use kernels from the GitHub repository under the
nkilib_standalone.nkilib.*namespace
New Kernels#
This release includes the following pre-optimized kernels:
Attention CTE Kernel — Implements attention with support for multiple variants and optimizations
Attention TKG Kernel — Implements attention specifically optimized for token generation scenarios
MLP Kernel — Implements a Multi-Layer Perceptron with optional normalization fusion and various optimizations
Output Projection CTE Kernel — Computes the output projection operation optimized for Context Encoding use cases
Output Projection TKG Kernel — Computes the output projection operation optimized for Token Generation use cases
QKV Kernel — Performs Query-Key-Value projection with optional normalization fusion
RMSNorm-Quant Kernel — Performs optional RMS normalization followed by quantization to fp8
NKI Library Kernel Migration to New nki.* Namespace in Neuron 2.28#
Some NKI Library kernels currently use the legacy neuronxcc.nki.* namespace. Starting with
Neuron 2.28, all NKI Library kernels will migrate to the new nki.* namespace.
The new nki.* namespace introduces changes to NKI APIs and language constructs. Customers
using NKI Library kernels should review the migration guide for any required changes.
NKI Library Namespace Changes in Neuron 2.28#
Starting with Neuron 2.28, the open source repository namespace will change from
nkilib_standalone.nkilib.* to nkilib.*, providing a consistent namespace between
the open source repository and the shipped version.
Customers who want to add or modify NKI Library kernels can build and install them to replace the default implementation without changing model imports.
This document is relevant for: Inf1, Inf2, Trn1, Trn2, Trn3