This document is relevant for: Inf2, Trn1, Trn2
NKI Known Issues#
This document outlines some of the known issues and limitations for the NKI beta release.
Unsupported Syntax:#
Top-level tensors must be on HBM. The input and output tensors of the top-level NKI kernel (the kernel function decorated with
nki_jit/nki.baremetalor called by JAXnki_call) must be located in HBM. We currently do not support using tensors stored in SBUF or PSUM as the input or output of the top-level kernel. Tensors must be loaded from HBM into SBUF before use, and output tensors must be stored from SBUF back into HBM. See nl.load and nl.store.Indexing:
Tile on SBUF/PSUM must have at least 2 dimensions as described here. If using a 1D tile on SBUF/PSUM, users may get an “
Insufficient rank” error. Workaround this by creating a 2D tile, e.g.,buf = nl.zeros((128, ), dtype=dtype, buffer=nl.sbuf) # this won't work buf = nl.zeros((128, 1), dtype=dtype, buffer=nl.sbuf) # this works
Users must index their
[N, 1]or[1, M]shaped 2D buffers with both indices, domy_sbuf[0:N, 0]ormy_sbuf[0, 0:M]to access them, since accessing in 1Dmy_sbuf[0:N]won’t work.Use
nl.arangefor indirect load/store access indexing,nl.mgridwon’t work. See code examples in nl.load and nl.store.If indexing with
[0, 0]gets internal errors, try using[0:1, 0:1]ornl.mgrid[0:1, 0:1]instead.If indexing with
[0:1, ...]gets internal errors, try using[0, ...]instead.
Masks conjunction: Use
&to combine masks. We do not support usingandfor masks. See examples in NKI API Masking.nisa.bn_stats does not support mask on the reduce dimension, the mask sent to
bn_statscould not contain any indices from the reduction dimension.Partition dimension broadcasting is not supported on operator overloads (i.e,
+,-,*,/,<<,>>, etc), usenki.languageAPIs instead (i.e,nl.add,nl.multiply, …).When direct allocation API is used, non-IO HBM tensors are not supported.
All tensors declared with
buffer=nl.shared_hbmmust be returned as the result of the kernel.Tensors declared with
buffer=nl.hbmorbuffer=nl.private_hbmare not allowed.An error “
[NKI005] (float32 [128, 512] %'<name of the hbm tensor>':5)0: DRAM location of kind Internal mapping failed. Only input/output/const DRAM location is supported!” will be thrown when such tensor is encountered.
Unexpected Behavior:#
Simulation using nki.simulate_kernel:
Custom data types like
nl.float32r,nl.bfloat16,nl.float8_e4m3, andnl.float8_e5m2simulate infp32precision. Also, NumPy API calls outside of the NKI kernel, such asnp.allclosemay not work with the above types.
Execution:
Profiler:
When using
neuron-profileuse the flag--disable-dgeto workaround a temporary issue with DMA information. See the Profile using neuron-profile section for more details.
Optimization:
Users need to declare their NKI buffers as small as possible to avoid buffer overflow errors. An error “
[GCA046] Some infinite-cost nodes remain” may mean there’s a buffer overflow, workaround this by creating smaller local buffers.
Compiler passes:
NKI ISA API may not be one-to-one with generated hardware ISA instructions. The compiler may aid in the support of these instruction calls by adding additional instructions.
This document is relevant for: Inf2, Trn1, Trn2