Neuron Kernel Interface (NKI) Documentation#
The Neuron Kernel Interface (NKI) is a bare-metal programming interface that enables direct access to AWS NeuronDevices available on Inf2, Trn1, Trn2, and Trn3 instances. NKI empowers ML developers to write high-performance kernel functions that can be integrated into PyTorch and JAX models, allowing fine-grained control over hardware resources while maintaining a familiar programming model.
NKI Beta versions
NKI is currently in beta, with Beta 2 as the current shipped version. Read more about NKI beta versions here.
With NKI, you can develop, optimize, and run custom operators directly on NeuronCores, making full use of available compute engines and memory resources. This interface bridges the gap between high-level machine learning frameworks and the specialized hardware capabilities of AWS Neuron accelerators, enabling you to self-serve and invent new ways to use the NeuronCore hardware.
NKI currently supports multiple NeuronDevice generations:
Trainium/Inferentia2, available on AWS
trn1,trn1nandinf2instancesTrainium2, available on AWS
trn2instances and UltraServersTrainium3, available on AWS
trn3instances and UltraServers
NKI provides a Python-based programming environment with syntax and tile-level semantics similar to Triton and NumPy, enabling you to get started quickly while still having full control of the underlying hardware. At the hardware level, NeuronCore’s tensorized memory access capability enables efficient reading and writing of multi-dimensional arrays on a per-instruction basis, making NKI’s tile-based programming highly suitable for the NeuronCore instruction set.
For comparison, before NKI was introduced, the only way to program NeuronDevices was through defining high-level ML models in frameworks such as PyTorch and JAX. The Neuron Graph Compiler takes such high-level model definitions as input, performs multiple rounds of optimization, and eventually generates a NEFF (Neuron Executable File Format) that is executable on NeuronDevices. At a high level, the Graph Compiler runs the following optimization stages in order:
Hardware-agnostic graph-level optimizations. These transformations are done in the Graph Compiler’s front-end, using XLA and including optimizations like constant propagation, re-materialization and operator fusion.
Loop-level optimization. THe Graph Compiler turns the optimized graph from stage 1 into a series of loop nests and performs layout, tiling and loop fusion optimizations.
Hardware intrinsics mapping. The Graph Compiler maps the architecture-agnostic loop nests from stage 2 into architecture-specific instructions.
Hardware-specific optimizations. These optimizations are mainly done at the instruction level in the Graph Compiler’s back-end, with a key goal of reducing memory pressure and improving instruction-level parallelism. For example, memory allocation and instruction scheduling are done in this stage.
NKI kernels bypass the first 3 stages through the specialized NKI Compiler, which translates kernel code directly into IRs (intermediate representations) that the Neuron Compiler’s back-end can immediately process. The NKI Compiler serves as a critical bridge, converting high-level NKI code into optimized low-level representations while preserving developer-specified optimizations. This direct path to lower-level compilation provides significant performance advantages and preserves fine-grained control.
Advanced features in NKI, such as direct allocation, further enable programmers to bypass specific compiler passes in stage 4, giving developers precise control over NeuronDevices down to the instruction level. The NKI Compiler’s targeted optimizations complement the Neuron Compiler’s back-end capabilities, creating a powerful toolchain for hardware-specific acceleration. For optimal kernel performance, Neuron strongly recommends studying the underlying hardware architecture before optimization.
Explore the comprehensive guides below to learn how to optimize your kernels for AWS Neuron hardware: