This document is relevant for: Trn2, Trn3

NKI Deep Dives#

This section provides in-depth technical documentation and guides for advanced users of the Neuron Kernel Interface (NKI). These deep dives offer detailed explanations of NKI concepts, programming patterns, and best practices to help you maximize the performance and capabilities of your NKI code on AWS Neuron devices.

Optimizing a NKI Kernel#

NKI Performance Optimizations

Advanced NKI Programming#

MXFP4/8 Matrix Multiplication Guide

Perform matrix multiplication using MXFP8 data types in NKI kernels, including data layout, quantization, and tiling strategies.

NKI Compiler

Learn about the NKI Compiler.

NKI Dynamic Loops

Use dynamic loops with runtime-determined trip counts via hardware loop instructions.

Descriptor Generation Engine (DGE)

Control how DMA descriptors are generated: pre-computed, software (GpSimd), or hardware DGE.

DMA Bandwidth Guide

Guidelines for maximizing DMA bandwidth with large contiguous payloads.

NKI Access Patterns

Learn about Access Patterns (AP) to directly specify how the Trainium hardware accesses tensors.

Additional NKI Information#

This document is relevant for: Trn2, Trn3