This document is relevant for: Inf2
, Trn1
, Trn1n
NKI Kernels#
nki.kernels#
The source code of the kernels in the neuronxcc.nki.kernels namespace is available at the Github Repository nki-samples. They are optimized kernels from the Neuron Team serving as samples. The repository also contains numeric tests, performance benchmarks, as well as scripts to use them in real models.
You are welcome to customize them to fit your unique workloads, and contributing to the repository by opening a PR. Note that these kernels are already being deployed as part of the Neuron stack. With flash attention as an example, compiling Llama models with transformers-neuronx will automatically invoke the flash_fwd kernel listed here. Therefore, replacing the framework operators with these NKI kernels likely won’t result in extra performance benefit.
Please see the README page of the Github Repository nki-samples for more details.
Fused self attention kernel for small head size Stable Diffusion workload. |
|
Flash attention backward kernel. |
|
Resize the input image to the given size using the nearest interpolation mode. |
|
Flash Attention Forward kernel |
|
Implementation of a select-and-scatter kernel. |
This document is relevant for: Inf2
, Trn1
, Trn1n