This document is relevant for: Trn2, Trn3

NKI Guides#

This section provides hands-on tutorials for the Neuron Kernel Interface (NKI), demonstrating how to write custom kernels for AWS Trainium and Inferentia instances. These tutorials cover fundamental operations, advanced techniques, and distributed computing patterns using NKI.

Tutorials#

Matrix Multiplication

Learn the fundamentals of implementing matrix multiplication in your NKI kernels.

Transpose 2D

Implement efficient 2D matrix transpose operations using NKI

Average Pooling 2D

Create custom 2D average pooling kernels for computer vision workloads

Fused Mamba

Implement fused Mamba state space model kernels

Architecture Guides#

Neuron recommends new NKI developers start with Trainium/Inferentia2 Architecture Guide before exploring newer NeuronDevice architecture.

Trainium/Inferentia2 Architecture Guide

Foundational architecture guide for understanding NeuronDevice basics.

Trainium2 Architecture Guide

Architecture enhancements and improvements in the Trainium2 generation.

Trainium3 Architecture Guide

Latest architecture features and capabilities in Trainium3 devices.

How-To Guides#

How to use the NKI CPU Simulator

Develop and debug NKI kernels on your CPU with no hardware required.

How to Insert NKI Kernels into Models

How to insert a NKI kernel as a custom operator into a PyTorch or JAX model using simple code examples.

How to Use the NKI Scheduling APIs

Control instruction execution order using dependency edges and no-reorder blocks for kernel performance optimization.

Profiling a NKI Kernel with Neuron Explorer

Profile NKI kernels using Neuron Explorer to analyze hardware-level performance.

This document is relevant for: Trn2, Trn3