This document is relevant for: Inf2, Trn1, Trn2

NKI Kernels#

nki.kernels#

The source code of the kernels in the neuronxcc.nki.kernels namespace is available at the GitHub Repository nki-samples. They are optimized kernels from the Neuron Team serving as samples. The repository also contains numeric tests, performance benchmarks, as well as scripts to use them in real models.

You are welcome to customize them to fit your unique workloads, and contributing to the repository by opening a PR. Note that these kernels are already being deployed as part of the Neuron stack. With flash attention as an example, compiling Llama models with transformers-neuronx will automatically invoke the flash_fwd kernel listed here. Therefore, replacing the framework operators with these NKI kernels likely won’t result in extra performance benefit.

See the README page of the GitHub Repository nki-samples for more details.

fused_self_attn_for_SD_small_head_size

Fused self attention kernel for small head size Stable Diffusion workload.

allocated_fused_self_attn_for_SD_small_head_size

Allocated fused self attention kernel for small head size Stable Diffusion workload.

allocated_fused_rms_norm_qkv

Allocated kernel that computes RMSNorm(hidden) @ wQKV.

flash_attn_bwd

Flash attention backward kernel.

resize_nearest_fixed_dma_kernel

Resize the input image to the given size using the nearest interpolation mode.

flash_fwd

Flash Attention Forward kernel

select_and_scatter_kernel

Implementation of a select-and-scatter kernel.

This document is relevant for: Inf2, Trn1, Trn2