This document is relevant for: Inf2
, Trn1
, Trn2
NKI Known Issues#
This document outlines some of the known issues and limitations for the NKI beta release.
Unsupported Syntax:#
Top-level tensors must be on HBM. The input and output tensors of the top-level NKI kernel (the kernel function decorated with
nki_jit
/nki.baremetal
or called by JAXnki_call
) must be located in HBM. We currently do not support using tensors stored in SBUF or PSUM as the input or output of the top-level kernel. Tensors must be loaded from HBM into SBUF before use, and output tensors must be stored from SBUF back into HBM. See nl.load and nl.store.Indexing:
Tile on SBUF/PSUM must have at least 2 dimensions as described here. If using a 1D tile on SBUF/PSUM, users may get an “
Insufficient rank
” error. Workaround this by creating a 2D tile, e.g.,buf = nl.zeros((128, ), dtype=dtype, buffer=nl.sbuf) # this won't work buf = nl.zeros((128, 1), dtype=dtype, buffer=nl.sbuf) # this works
Users must index their
[N, 1]
or[1, M]
shaped 2D buffers with both indices, domy_sbuf[0:N, 0]
ormy_sbuf[0, 0:M]
to access them, since accessing in 1Dmy_sbuf[0:N]
won’t work.Use
nl.arange
for indirect load/store access indexing,nl.mgrid
won’t work. See code examples in nl.load and nl.store.If indexing with
[0, 0]
gets internal errors, try using[0:1, 0:1]
ornl.mgrid[0:1, 0:1]
instead.If indexing with
[0:1, ...]
gets internal errors, try using[0, ...]
instead.
Masks conjunction: Use
&
to combine masks. We do not support usingand
for masks. See examples in NKI API Masking.nisa.bn_stats does not support mask on the reduce dimension, the mask sent to
bn_stats
could not contain any indices from the reduction dimension.Partition dimension broadcasting is not supported on operator overloads (i.e,
+
,-
,*
,/
,<<
,>>
, etc), usenki.language
APIs instead (i.e,nl.add
,nl.multiply
, …).When direct allocation API is used, non-IO HBM tensors are not supported.
All tensors declared with
buffer=nl.shared_hbm
must be returned as the result of the kernel.Tensors declared with
buffer=nl.hbm
orbuffer=nl.private_hbm
are not allowed.An error “
[NKI005] (float32 [128, 512] %'<name of the hbm tensor>':5)0: DRAM location of kind Internal mapping failed. Only input/output/const DRAM location is supported!
” will be thrown when such tensor is encountered.
Unexpected Behavior:#
Simulation using nki.simulate_kernel:
Custom data types like
nl.float32r
,nl.bfloat16
,nl.float8_e4m3
, andnl.float8_e5m2
simulate infp32
precision. Also, NumPy API calls outside of the NKI kernel, such asnp.allclose
may not work with the above types.
Execution:
Profiler:
When using
neuron-profile
use the flag--disable-dge
to workaround a temporary issue with DMA information. See the Profile using neuron-profile section for more details.
Optimization:
Users need to declare their NKI buffers as small as possible to avoid buffer overflow errors. An error “
[GCA046] Some infinite-cost nodes remain
” may mean there’s a buffer overflow, workaround this by creating smaller local buffers.
Compiler passes:
NKI ISA API may not be one-to-one with generated hardware ISA instructions. The compiler may aid in the support of these instruction calls by adding additional instructions.
This document is relevant for: Inf2
, Trn1
, Trn2