NKI Library Supported Kernel Reference#

The NKI Library provides pre-built reference kernels you can use directly in your model development with the AWS Neuron SDK and NKI. These kernels provide the default classes, functions, and parameters you can use to integrate the NKI Library kernels into your models.

Source code for these kernel APIs can be found at: aws-neuron/nki-library

Core Kernels#

Normalization and Quantization Kernels#

RMSNorm-Quant

Performs optional RMS normalization followed by quantization to fp8.

QKV Projection Kernels#

QKV

Performs Query-Key-Value projection with optional normalization and RoPE fusion.

Attention Kernels#

Attention CTE

Implements attention optimized for Context Encoding (prefill) use cases.

Attention TKG

Implements attention optimized for Token Generation (decode) use cases with small active sequence lengths.

Rotary Position Embedding (RoPE) Kernels#

RoPE

Applies Rotary Position Embedding to input embeddings with flexible layout support.

Multi-Layer Perceptron (MLP) Kernels#

MLP

Implements Multi-Layer Perceptron with optional normalization fusion and quantization support.

Output Projection Kernels#

Output Projection CTE

Computes output projection optimized for Context Encoding use cases.

Output Projection TKG

Computes output projection optimized for Token Generation use cases.

Mixture of Experts (MoE) Kernels#

Router Top-K

Computes router logits, applies activation functions, and performs top-K selection for MoE models.

MoE CTE

Implements Mixture of Experts MLP operations optimized for Context Encoding use cases.

MoE TKG

Implements Mixture of Experts MLP operations optimized for Token Generation use cases.

Cumulative Sum Kernels#

Cumsum

Computes cumulative sum along the last dimension with optimized tiling.

Experimental Kernels#

Note

Experimental kernels are under active development and their APIs may change in future releases.

Attention Kernels#

Attention Block TKG

Fused attention block for Token Generation that keeps all intermediate tensors in SBUF to minimize HBM traffic.

Convolution Kernels#

Depthwise Conv1D

Implements depthwise 1D convolution using implicit GEMM algorithm.

Loss Kernels#

Cross Entropy

Memory-efficient cross entropy loss forward and backward passes using online log-sum-exp algorithm.

MoE Backward Kernels#

Blockwise MM Backward

Computes backward pass for blockwise matrix multiplication in Mixture of Experts layers.