This document is relevant for: Trn2, Trn3
NKI Guides
This section provides hands-on tutorials for the Neuron Kernel Interface (NKI), demonstrating how to write custom kernels for AWS Trainium and Inferentia instances. These tutorials cover fundamental operations, advanced techniques, and distributed computing patterns using NKI.
Tutorials
Learn the fundamentals of implementing matrix multiplication in your NKI kernels.
Implement efficient 2D matrix transpose operations using NKI
Create custom 2D average pooling kernels for computer vision workloads
Implement fused Mamba state space model kernels
Architecture Guides
Neuron recommends new NKI developers start with Trainium/Inferentia2 Architecture Guide before exploring newer NeuronDevice architecture.
Trainium/Inferentia2 Architecture Guide
Foundational architecture guide for understanding NeuronDevice basics.
Trainium2 Architecture Guide
Architecture enhancements and improvements in the Trainium2 generation.
Trainium3 Architecture Guide
Latest architecture features and capabilities in Trainium3 devices.
How-To Guides
How to use the NKI CPU Simulator
Develop and debug NKI kernels on your CPU with no hardware required.
How to Insert NKI Kernels into Models
How to insert a NKI kernel as a custom operator into a PyTorch or JAX model using simple code examples.
How to Use the NKI Scheduling APIs
Control instruction execution order using dependency edges and no-reorder blocks for kernel performance optimization.
Profiling a NKI Kernel with Neuron Explorer
Profile NKI kernels using Neuron Explorer to analyze hardware-level performance.
This document is relevant for: Trn2, Trn3