This document is relevant for: Inf2
, Trn1
, Trn1n
NKI Known Issues#
This document outlines some of the known issues and limitations for the NKI beta release.
Unsupported Syntax:#
Top-level tensors must be on HBM. The input and output tensors of the top-level NKI kernel (the kernel function decorated with
nki_jit
/nki.baremetal
or called by JAXnki_call
) must be located in HBM. We currently do not support using tensors stored in SBUF or PSUM as the input or output of the top-level kernel. Tensors must be loaded from HBM into SBUF before use, and output tensors must be stored from SBUF back into HBM. See nl.load and nl.store.Top-level input and output tensors have to be distinct. We do not support reading and writing to the same tensor. See corresponding error message for more info.
Indexing:
Tile on SBUF/PSUM must have at least 2 dimensions as described here. If using a 1D tile on SBUF/PSUM, users may get an “
Insufficient rank
” error. Workaround this by creating a 2D tile, e.g.,buf = nl.zeros((128, ), dtype=dtype, buffer=nl.sbuf) # this won't work buf = nl.zeros((128, 1), dtype=dtype, buffer=nl.sbuf) # this works
Users must index their
[N, 1]
or[1, M]
shaped 2D buffers with both indices, domy_sbuf[0:N, 0]
ormy_sbuf[0, 0:M]
to access them, since accessing in 1Dmy_sbuf[0:N]
won’t work.Use
nl.arange
for indirect load/store access indexing,nl.mgrid
won’t work. See code examples in nl.load and nl.store.If indexing with
[0, 0]
gets internal errors, try using[0:1, 0:1]
ornl.mgrid[0:1, 0:1]
instead.If indexing with
[0:1, ...]
gets internal errors, try using[0, ...]
instead.
Masks conjunction: Use
&
to combine masks. We do not support usingand
for masks. See examples in NKI API Masking.nisa.bn_stats does not support mask on the reduce dimension, the mask sent to
bn_stats
could not contain any indices from the reduction dimension.Partition dimension broadcasting is not supported on operator overloads (i.e,
+
,-
,*
,/
), usenki.language
APIs instead (i.e,nl.add
,nl.multiply
, …).
Unexpected Behavior:#
Simulation using nki.simulate_kernel:
Custom data types like
nl.float32r
,nl.bfloat16
, andnl.float8_e4m3
simulate infp32
precision. Also, NumPy API calls outside of the NKI kernel, such asnp.allclose
may not work with the above types.nl.rand generates the same values for subsequent calls to
nl.rand()
.nl.random_seed is a no-op in simulation.
nisa.dropout is a no-op in simulation.
Masks don’t work in simulation, and garbage data is generated in tensor elements that are supposed to be untouched based on API masking.
Execution:
Profiler:
When using
neuron-profile
use the flag--disable-dge
to workaround a temporary issue with DMA information. See the Profile using neuron-profile section for more details.
Optimization:
Users need to declare their NKI buffers as small as possible to avoid buffer overflow errors. An error “
[GCA046] Some infinite-cost nodes remain
” may mean there’s a buffer overflow, workaround this by creating smaller local buffers.
Compiler passes:
NKI ISA API may not be one-to-one with generated hardware ISA instructions. The compiler may aid in the support of these instruction calls by adding additional instructions.
NKI ISA nisa.nc_transpose API’s
engine
param may not be respected in some corner cases, such as if the transpose is merged with load/store into intermediate operations during compilation.
This document is relevant for: Inf2
, Trn1
, Trn1n