This document is relevant for: Trn2, Trn3

NKI Language Guide#

The Neuron Kernel Interface (NKI) language is designed for writing kernel functions to accelerate machine learning workloads on Trainium devices. This guide is an introduction to the NKI language and the key concepts you will need to know to program in NKI effectively.

Let us start by looking at a simple NKI function.

@nki.jit
def nki_tensor_add_kernel(a_input, b_input):
    """
    NKI kernel to compute element-wise addition of two input tensors.
    """

    # Check both input tensor shapes/dtypes are the same for element-wise operation.
    assert a_input.shape == b_input.shape
    assert a_input.dtype == b_input.dtype

    print(f"adding tensors of type {a_input.dtype} and shape {a_input.shape}")

    # Check the first dimension's size to ensure it does not exceed on-chip
    # memory tile size, since this simple kernel does not tile inputs.
    assert a_input.shape[0] <= nl.tile_size.pmax

    # Allocate space for the input tensors in SBUF and copy the inputs from HBM
    # to SBUF with DMA copy.
    a_tile = nl.ndarray(shape=a_input.shape, dtype=a_input.dtype, buffer=nl.sbuf)
    nisa.dma_copy(dst=a_tile, src=a_input)

    b_tile = nl.ndarray(shape=b_input.shape, dtype=b_input.dtype, buffer=nl.sbuf)
    nisa.dma_copy(dst=b_tile, src=b_input)

    # Allocate space for the result and use tensor_tensor to perform
    # element-wise addition. Note: the first argument of 'tensor_tensor'
    # is the destination tensor.
    c_tile = nl.ndarray(shape=a_input.shape, dtype=a_input.dtype, buffer=nl.sbuf)
    nisa.tensor_tensor(dst=c_tile, data1=a_tile, data2=b_tile, op=nl.add)

    # Create a tensor in HBM and copy the result into HBM.
    c_output = nl.ndarray(dtype=a_input.dtype, shape=a_input.shape, buffer=nl.shared_hbm)
    nisa.dma_copy(dst=c_output, src=c_tile)

    # Return kernel output as function output.
    return c_output

Important

The first thing you may notice about this NKI function is that it looks very much like a Python function. In fact, all NKI functions are syntactically valid Python functions. However, it is important to understand that NKI functions are not Python functions: they will be compiled by the NKI compiler and run on the Trainium accelerator. Because of this, not all Python constructs and libraries are supported within a NKI function.

The second thing to notice is that NKI has a sequential programming model. This means that the logical order of operations follows the syntactic order of the statements in the function. As you learn more about the Trainium hardware, you will see that the hardware can often do many things at the same time across the different compute engines on the Trainium devices. When we compile NKI functions, we will respect the sequential order of operations written by the programmer. The compiler may reorder operations that have no data dependencies, but this is functionally transparent to NKI programmers. Later you will see how to control which engines operations run on and even how to influence the ordering of operations with no data dependencies for better performance, but all of this is done in the context of the sequential ordering of the code.

The third thing to notice about this simple function is that is has a print statement. You may be wondering: When does this print happen? Does the Trainium hardware output a string, where does it go? What about all those different engines we just talked about and the sequential ordering? The answer to these questions reveal a very important aspect of NKI programming. The answer is that the print is evaluated by the compiler at compile time, not at runtime. So, when you compile this NKI function, the NKI compiler will output a string like:

adding tensors of type float16 and shape (128, 512)

However, when we run this compiled function on Trainium devices they will not output anything. This is usually what you want. The compiler gives important debugging information during compilation, but when you deploy your function across 1000 Trainium devices, they will not waste any time generating debug output.

Note: There is a special print function that does run on the Trainium devices, called device_print, that can be used if this is really what you need, see the API references for more information.

We have just seen that the print statement is evaluated at compile-time, and not at runtime. In fact, most things in NKI programs are evaluated at compile time. In general, calls to nki.isa.* functions will result in on-device operations, and (almost) all other things will be evaluated by the compiler at compile time. We will discuss some exceptions to this rule below, but for now it is generally the case that only the nki.isa.* calls result in run-time operations, and everything else is evaluated by the compiler at compile-time.

This leads us to our the last observation about NKI functions. The nki.isa.* APIs are the heart of the matter. These APIs are designed to expose the underlying hardware capabilities in as direct a way as possible. If you write a nki.isa function, then the hardware will execute that operation at that point in the program. The NKI meta-programming language simply provides a convenient way to specify which ISA operations you want to run on your data.

In the rest of this guide we will focus on the NKI language, starting with the compilation model and namespaces, then the values you can manipulate in a NKI function. We will then cover tensor indexing, control flow, and end with a discussion of class support, interoperation with Python, and composable kernels.

Compilation Model#

When you decorate a function with @nki.jit and call it, the NKI compiler processes your kernel in three stages:

Specialization: The compiler takes your Python function and evaluates all meta-programming constructs. This includes resolving tensor shapes, unrolling loops, inlining function calls, and evaluating if-statements with compile-time conditions. The result is a specialized, flat sequence of nki.isa.* operations with all compile-time values resolved.
Compilation: The specialized program is lowered to Trainium machine code. This stage performs instruction scheduling, register allocation, and memory layout.
Graph-compiler linking: The compiled kernel is linked into the larger computation graph managed by the Neuron graph compiler, which handles data movement between the host and device.

The specialization stage is key to understanding NKI programming. During specialization, the compiler acts as an interpreter for the meta-programming parts of your kernel. Everything that is not a nki.isa.* call or a dynamic_range loop is evaluated and resolved at this stage. This means:

All for loops (except dynamic_range) are unrolled at specialization time. The compiler expands the loop body once for each iteration.
All function calls are inlined at specialization time. The compiler substitutes the function body at each call site.
All if statements with compile-time conditions are resolved at specialization time. Only the taken branch is included in the specialized program.
All Python expressions on compile-time values (integers, booleans, strings, shapes) are evaluated at specialization time.

The only constructs that survive specialization and become runtime operations are nki.isa.* calls and dynamic_range loops. Everything else is part of the meta-programming language that controls how the final sequence of ISA operations is generated.

Note

Throughout this documentation, we use the term NKI meta-programming language to refer to the Python subset that is evaluated at specialization time (loops, conditionals, function calls, and expressions on compile-time values), and NKI language to refer to the runtime primitives (nki.isa.* operations and dynamic_range loops) that execute on the device.

@nki.jit
def example_kernel(a_input):
    # Meta-programming: this loop is unrolled at specialization time
    for i in range(4):
        tile = nl.ndarray((128, 512), dtype=nl.float16, buffer=nl.sbuf)
        nisa.dma_copy(dst=tile, src=a_input[i * 128:(i + 1) * 128, :])
        # Meta-programming: this if is resolved at specialization time
        if i % 2 == 0:
            nisa.tensor_scalar(dst=tile, data=tile, op0=nl.add, operand0=1.0)

After specialization, this kernel becomes a flat sequence of dma_copy and tensor_scalar operations, with the loop and if-statement fully resolved.

Kernel Caching#

NKI caches compiled kernels based on their input arguments (shapes, dtypes, and compile-time values). This means NKI kernels must be pure functions of their arguments — the kernel’s output must be determined solely by its input arguments. If a kernel’s behavior depends on external state such as global variables, closures over mutable objects, or side effects, the cache may return a stale compiled artifact and produce incorrect results.

NKI Namespaces#

NKI is organized into several Python namespaces:

nki — The top-level package. Provides the @nki.jit decorator for compiling kernel functions.
nki.language (commonly imported as nl) — The high-level language API. This includes tensor creation (ndarray), data types, memory buffers, loop ranges (affine_range, dynamic_range), and high-level math operations (nl.add, nl.matmul, nl.softmax, etc.). Many of the functions in nki.language are convenience wrappers around one or more nki.isa operations.
nki.isa (commonly imported as nisa) — The low-level instruction set architecture API. Each function in this namespace maps directly to a Trainium hardware operation. These are the only calls that produce runtime operations on the device.
nki.collectives — APIs for multi-device collective communication operations such as all_reduce, all_gather, and collective_permute.

A typical NKI kernel imports these namespaces as follows:

import nki
import nki.language as nl
import nki.isa as nisa

The distinction between nki.language and nki.isa is important. When you call a nki.language function like nl.add(a, b), the compiler may lower this to one or more nki.isa operations depending on the tensor shapes and types. When you call a nki.isa function like nisa.tensor_tensor(...), you are directly specifying the hardware operation. Use nki.language for readability and portability; use nki.isa when you need precise control over which hardware engine executes an operation.

NKI Values#

The NKI language supports six types of values:

The special None value
Boolean values (True and False)
32-bit integer values
32-bit IEEE floating-point values
String literals
Tensors (on-device tensor memory)

In addition, NKI supports the following container types:

Tuples of any fixed length
Lists of arbitrary length
Dictionaries with string-value keys
Simple user-defined classes

NKI values and containers are very similar to their Python equivalents. For instance, you can use most of the Python standard list functions, and they work in the same way as in Python.

l = [1,2,3]    # create a list with 3 elements
l.append(4.1)  # append a value to the list
l.extend(("Hello", "List")) # extend list with multiple values
size = len(l) # return number of elements in list
third = l[2]  # get third element of list (index 2)

# search list for a specific value
if l.index(2):
  print("list contains 2")

# remove a specific value from a list (if present)
l.remove(1)

# print out list in reverse order
l.reverse()
for x in l:
  print(x)

The NKI dictionary type is also similar to the Python version, but with the restriction that the keys must be string values.

d = dict() # create an empty dictionary
d['a'] = 1 # set a value in the dictionary

print(d.keys())  # print out keys in dictionary
print(d.items())  # print out values in dictionary

# print out dictionary
for k in d.keys():
    v = d[k]
    print(k, v)

# remove value from dictionary if present
if d.pop('a'):
    print("removed 'a' from dictionary")

# fetch value of a, set to 2 if not present
a = d.setdefault('a', 2)

We will discuss user-defined classes later in the guide. For now, let’s take a close look at the most important value in NKI, the tensor.

Tensor Values#

The NkiTensor class represents an on-chip tensor. It has two parts: the storage (a buffer in memory described by shape, dtype, memory type, and memory location), and a view on top of that storage (a strided access pattern with up to 5 dimensions). Creation routines allocate fresh storage and hand back a tensor; view operations (slice, reshape, permute, view, ap, etc.) define a new access pattern on top of the same storage, without touching the underlying data. See the NkiTensor API reference for the full list of attributes and methods.

Note

Tensors represent on-device memory at runtime. At specialization time the compiler knows the shape, dtype, strides, and buffer, which is enough to reason about layout and schedule instructions, but it does not know the actual element values. Any NKI meta-programming expression that needs to touch tensor contents directly (nl.add(t, 5.0), device_print(t), …) is a kernel operation, not a specialization-time expression.

Anatomy of a tensor#

An NkiTensor is fully described by its shape, strides, offset, dtype, and buffer, plus the convenience attributes ndim / size and an optional debug name — all available at specialization time. See the NkiTensor API reference for what each attribute means.

The key idea is the strided view: given these attributes, the element addressed by an index tuple (i0, i1, …, i_{n-1}) sits at element position:

offset + sum_k (strides[k] * i_k)

inside the underlying storage.

Example. A contiguous SBUF tensor:

t = nl.ndarray((128, 64), dtype=nl.float32, buffer=nl.sbuf)

assert t.shape   == (128, 64)
assert t.strides == (64, 1)
assert t.offset  == 0
assert t.dtype   == nl.float32
assert t.buffer  == nl.sbuf
assert t.ndim    == 2
assert t.size    == 128 * 64

The free-dim stride is 1 (consecutive elements in the free dimension are adjacent in memory). The partition-dim stride is 64.

Note

SBUF and PSUM tensors have a partition dimension at dim 0 that maps to the NeuronCore’s parallel partitions. The partition dimension places some restrictions on views, and moving data across partitions generally requires a physical operation rather than a free view; see the individual view methods for the exact rules and cross-partition data movement for background. HBM tensors have no partition dim and can be reshaped freely.

Querying layout#

Two predicates describe common layout questions:

NkiTensor.is_contiguous() returns True when the view covers its storage in dense row-major order. A fresh nl.ndarray is contiguous. reshape and view(dtype) require contiguity on the reshaped dimensions (see their individual docs). permute produces a non-contiguous view.
NkiTensor.is_indirect() returns True when the view uses runtime-resolved (dynamic) addressing. Indirect views are produced by select() with a tensor index, vector_select(), or ap() with scalar_offset / vector_offset. Once a view is indirect, the indirected dimension cannot be sliced or selected again. Use is_indirect() to guard against chaining these operations.

NkiTensor.get_pattern() returns the view’s layout as a list of [stride, count] pairs, in the format .ap() accepts. It is useful when building a new .ap() pattern that reuses most of the current layout.

View primitives#

The view primitives all return a new NkiTensor that shares the input’s storage. The full catalogue with signatures lives in the NkiTensor API reference; in brief:

Indexing — t[...] (integer, slice, ellipsis, and tuple keys) and the explicit NkiTensor.slice().
Reshaping — NkiTensor.reshape(), plus the targeted NkiTensor.reshape_dim() / NkiTensor.flatten_dims().
Reordering and shaping — NkiTensor.permute(), NkiTensor.broadcast(), NkiTensor.expand_dim() / NkiTensor.squeeze_dim(), and einops-style NkiTensor.rearrange().
Indexing by value — NkiTensor.select() (static or dynamic), plus the lower-level NkiTensor.vector_select() and NkiTensor.ap() for gather and hardware-native access patterns (see the NKI Access Patterns deep-dive).
Reinterpret cast — NkiTensor.view() reinterprets the storage bits as a different dtype, rescaling the last (fastest-changing) dimension.

A few rules these primitives share are worth calling out here, since they follow from the hardware rather than from any single method:

The partition dimension is special. SBUF and PSUM tensors are always at least 2-D and keep their partition dim at position 0. Integer indexing on the partition dim keeps it at size 1 rather than dropping it; HBM tensors have no partition dim and drop an indexed dim as usual:

t = nl.ndarray((128, 64, 32), dtype=nl.float32, buffer=nl.sbuf)
t[0, :, :]            # shape (1, 64, 32) — partition dim stays at size 1
t[:, :, 0]            # shape (128, 64)  — free-dim integer index drops it

h = nl.ndarray((4, 128, 32), dtype=nl.float32, buffer=nl.shared_hbm)
h[0, :, :]            # shape (128, 32)  — HBM drops the indexed dim

The partition dimension (dim 0) places some restrictions on views: operations that would rearrange or reshape data across partitions are generally not expressible as a free view and instead need a physical operation to move data between partitions. The exact restrictions are noted on the individual view methods; see cross-partition data movement for background.

Contiguity. A view is contiguous when its elements occupy a dense, row-major span of the underlying storage — no gaps and no reordering. A freshly created tensor is contiguous; views that skip elements or reorder dimensions are not. Some view operations require contiguity on the dimensions they act on; see the individual methods for where.

Dynamic indexing. Some view operations, such as NkiTensor.select() and NkiTensor.vector_select(), accept a runtime index supplied as an SBUF tensor or a register. See the individual methods for details.

See Tensor Indexing for the full indexing rules.

Composition#

View primitives return NkiTensor and every primitive accepts any NkiTensor, so they compose freely:

# Start from a 3-D SBUF tile, extract the partition window [0:64],
# permute the two free dims, then pick a single column:
t = nl.ndarray((128, 4, 16), dtype=nl.float32, buffer=nl.sbuf)
u = t[0:64, :, :].permute((0, 2, 1))[:, 0, :]
assert u.shape == (64, 4)

Reminder: each NkiTensor view operation has no runtime cost. They are evaluated at specialization time to build the final hardware access pattern used by the ISA instruction that eventually consumes the tensor.

Low-level raw access pattern: `ap`#

When one of the higher-level view primitives does not express the layout you need, one can use NkiTensor.ap(). It takes an explicit hardware access pattern (a list of [stride, count] pairs) and returns a new view over the same storage. .ap() can express any access pattern the hardware supports.

t = nl.ndarray((128, 1024), dtype=nl.float16, buffer=nl.sbuf)
# Access every other element in the free dimension.
# Partition stride is 1024 (storage's free-dim count), NOT 1.
u = t.ap(pattern=[(1024, 128), (2, 512)])
assert u.shape == (128, 512)

For SBUF and PSUM tensors, the partition stride (the first [stride, count] pair) must be equal to the storage’s free-dim element count so that every partition performs the same access in parallel. HBM tensors have no partition dimension and accept any pattern. When having already a tensor whose layout is the starting point, NkiTensor.get_pattern() returns a pattern that can be mutated and passed to .ap():

pattern = t.get_pattern()
pattern[0][1] = 64          # read 64 partitions instead of 128
u = t.ap(pattern=pattern)   # same layout, partition count overridden

offset defaults to None, which means the new view inherits the current view’s storage offset. Pass an explicit integer to override.

Further details on nested indexing semantics, reinterpret cast with dtype=, and dynamic access with scalar_offset and vector_offset are documented in NKI Access Patterns.

Warning

.ap() is not composable: the pattern addresses the tensor’s underlying storage directly, ignoring the shape, strides, and offset of the view it is called on. t.slice(...).ap(pattern=...) and t.ap(pattern=...) produce the same result.

Creating Tensors#

The easiest way to create tensors is using the nki.language.ndarray API. This function takes a shape, a dtype, and a memory type, and returns an NkiTensor representing a reference to a memory region in the given memory type large enough to hold the tensor.

Note

ndarray does not initialize memory. The contents of a newly allocated tensor are undefined until explicitly written to (e.g., via nisa.dma_copy or nisa.memset).

# A matrix of 128x128 16-bit float values in the SBUF memory
t = nl.ndarray((128,128), nl.float16, nl.sbuf)
assert t.shape == (128,128)
assert t.dtype == nl.float16
assert t.buffer == nl.sbuf

You can pass an optional name argument to ndarray. The name is a string label that the compiler propagates into the generated IR and debug information. It appears in compiler warnings and errors referencing the tensor, as a debug symbol used by the Neuron Explorer profiler, and in the scheduling APIs that operate on named tensors.

# Named tensor for easier identification in compiler output
t = nl.ndarray((128,128), nl.float16, nl.sbuf, name="my_weights")

You can also create a tensor from an existing tensor using the reshape method. The reshape method will create a new reference to the same memory with a different shape. The reshaped tensor must have the same total number of elements as the original.

# create an alternate view of t with shape 128x2x64
u = t.reshape((128,2,64))

# create an alternate view of t with shape 128x32x4
v = t.reshape((128,32,4))

In both cases, u and v refer to the same underlying memory as t; no data is copied.

Tensor Indexing#

Again , note that in Tensor Values that every NkiTensor has a shape, strides, offset, and buffer. Let’s look in detail at the most common way of producing new views of a tensor: indexing with integers and slices.

Suppose you have an SBUF tensor t with shape (64, 64, 64). By convention the first dimension is the partition dimension and the remaining dimensions lay out the free dimension of each partition. You can refer to sub-tensors with an index expression.

t = nl.ndarray((64, 64, 64), dtype=nl.float32, buffer=nl.sbuf)

# On-chip tensors stay at least 2-D: integer indexing on the partition
# dim keeps it at size 1, so t[0,0,10] is a (1,1) view, not a scalar.
u = t[0, 0, 10]
assert u.shape == (1, 1)

# Integer indexing on a free dim drops that dim, unless dropping would
# make the result < 2-D, in which case the last free dim is kept at 1.
u = t[:, 0]
assert u.shape == (64, 64)

u = t[:, 0, 10]
assert u.shape == (64, 1)     # last dim kept at size 1 to stay ≥ 2-D

For larger sub-tensors, use slice expressions of the form start:stop:step, or ... to fill in default slices across a range of dimensions.

# All first 64 elements of every partition
u = t[0:64, 0, 0:64]
assert u.shape == (64, 64)

# Same as above, using defaults
u = t[:, 0, :]
assert u.shape == (64, 64)

# Only the even elements of the third dimension
u = t[:, :, ::2]
assert u.shape == (64, 64, 32)

# The whole tensor t. `...` and `:` both fill in missing dims, so
# t[...] and t[:, ...] are equivalent here.
u = t[...]
assert u.shape == (64, 64, 64)

# Use defaults for the inner dimensions; partition index 0 is kept at
# size 1 because SBUF/PSUM tensors stay ≥ 2-D.
u = t[0, ..., :]
assert u.shape == (1, 64, 64)

Every indexing expression returns a new NkiTensor sharing storage with t. That means you can chain indexing, query the result’s shape, strides, offset, and pattern, and pass it to any NKI ISA instruction that accepts a tensor.

u = t[0, ...]
assert u.shape == (1, 64, 64)

v = u[:, 0:32, :]
assert v.shape == (1, 32, 64)

# All attributes are available at specialization time:
print(u.shape, u.strides, u.offset)
print(u.get_pattern())       # [[stride, count], ...]

Control Flow#

NKI supports basic control flow constructs, including if-statements, for-loops over ranges, lists or tuples, and while loops. All of these constructs work similarly their equivalents in Python, but with one important difference: they are all evaluated at specialization time. This means the compiler unrolls every loop and resolves every branch before generating device code. For example, the code below uses a simple loop with a nested if statement to process the even and odd elements of a list differently.

inputs = [a, b, c]
outputs = [x, y, z]

assert len(inputs) == len(outputs)
for i in range(len(inputs)):
    if i % 2 == 0:
        nisa.nc_transpose(dst=outputs[i], data=inputs[i])
    else:
        nisa.reciprocal(dst=outputs[i], data=inputs[i])

The loop and if-statement above will ultimately be evaluated away by NKI Compiler. This means that the ISA instructions will be included in the final executable as a linear sequence:

nki.isa.nc_transpose(dst=x, data=a)
nki.isa.reciprocal(dst=y, data=b)
nki.isa.nc_transpose(dst=z, data=c)

A for-loop can also iterate over a list or tuple, similar to Python. The two loops below both print the numbers 1-3 in sequence.

l = [1,2,3]
for x in l:
  print(x)

t = (1,2,3)
for x in t:
  print(x)

Finally, NKI also supports while loops. Again these loops are similar to Python, and will be unrolled by the compiler, just like the for-loops.

# print the numbers 0-9
x = 0
while x < 10:
  print(x)
  x += 1

Dynamic Control Flow#

In the previous section we looked at control-flow constructs that are ultimately expanded at compile-time. NKI also supports dynamic control-flow, or control-flow that runs on the device. Dynamic control-flow is not expanded by the compiler, but lowered to equivalent Trainium control-flow instructions.

The most basic dynamic loop is a for-loop with static bounds. A dynamic loop with static bounds can be written using the standard for-loop with a dynamic_range hint.

# create a dynamic loop that runs "on chip"
for i in dynamic_range(10):
  process_tensor(t[i])

The for loop above will lower to a loop on the Trainium device. The loop will execute its body (process_tensor), 10 times and then continue. Because this is a dynamic loop, the loop index, i, will be stored in a hardware register during evaluation. Therefore, the type of i is register in NKI. Register values can be used to index tensors, and passed to nki.isa APIs. We can also use registers to create dynamic loops with dynamic bounds.

count = nki.isa.register_alloc(0)
nisa.register_load(count, count_tensor)
for i in dynamic_range(count):
  process_tensor(t[i])

The loop above uses a register value as the upper bound. This register is allocated with the register_alloc function, and then its value is populated from a tensor using register_load. The for loop will then execute count times.

There are four register APIs that can be used to create, and load and store values to and from registers. Each register is 32-bit and supports multiple data types: u8, u16, u32, i8, i16, i32, and fp32 (or a pair of registers for u64/i64). Signed integers are supported, so negative values (e.g., count=-5) are valid. The register APIs return and operate on VirtualRegister objects.

A VirtualRegister represents a scalar value stored in a hardware register on the Trainium device. Unlike compile-time integer values, a VirtualRegister holds a value that exists at runtime. You can use a VirtualRegister as a loop bound for dynamic_range, as a condition for a dynamic while loop, or as a scalar_offset in a tensor access pattern for dynamic indexing.

Note

The induction variable of a dynamic_range loop is also a VirtualRegister, but it is frozen: you cannot write to it with register_move or register_load. This prevents ambiguity about whether modifying the induction variable would affect loop termination.

# allocate a new register with initial value (32-bit integer)
def register_alloc(x: int) -> VirtualRegister: ...

# store a constant integer into a register
def register_move(dst: VirtualRegister, imm: int): ...

# load a value from an SBUF tensor into a register
# the source tensor must be a 1x1 SBUF tile
def register_load(dst: VirtualRegister, src: tensor): ...

# store the value of a register into an SBUF tensor
def register_store(dst: tensor, src: VirtualRegister): ...

Using the APIs above, we can also create dynamic while loops. A dynamic while loop is specified using the standard while-loop with a condition that is a single register value. The NKI compiler will preserve while loops with register conditions, and not unroll them.

# suppose cond is an SBUF tensor, perhaps declared as
cond = nl.ndarray((1, 1), buffer=nl.sbuf, dtype=nl.int32)

# allocate a register with initial value 1
reg = nisa.register_alloc(1)

# This while loop is dynamic because the condition is a register
while reg:
   # perform a calculation that updates cond
   nisa.dma_copy(dst=cond, ...)

   # update register used in while-loop condition
   nisa.register_load(reg, cond)

The code above uses a 1x1 SBUF tensor called cond to store the condition. We update this tensor in the body of the loop and then use register_load to update the register. When the register reg holds the value 0 the loop will terminate.

Class Support#

NKI has basic support for user-defined classes. In NKI all classes are similar to Python data classes. When you declare a class for use in a NKI kernel, the class must inherit from NKIObject and no other classes. This restriction is to ensure the NKI compiler only brings in class definitions that are intended for NKI. A simple NKI class can be declared similar to a Python data class:

@dataclass
class C(NKIObject):
  x : int
  y : bool = False

  def toggle(self):
    self.y = not self.y

c = C(1)
c.toggle()

# prints 1 True
print(c.x, c.y)

The @dataclass decorator is optional; classes with and without the @dataclass decorator will be compiled in the same way by the NKI compiler. The compiler will create the initializer functions __init__ and __post_init__, if they are not provided by the user. For the class above, the default initializers are:

# default if not provided by the user
def __init__(self, x = None, y = False):
  self.x = x
  self.y = y
  self.__post_init__()

# default if not provided by the user
def __post_init__(self):
  pass

Classes can be declared in Python and passed as arguments to NKI functions. When a class is used as an argument to a NKI kernel, the NKI kernel will import the definition of the Python class, and convert the Python class instance to a NKI instance using the objects dictionary. Currently, NKI does not look at slots or other object features, only the object dictionary. For example, consider the code shown below.

class A(NKIObject):
  x : int = 1
  def __init__(self, x):
    self.x = x

@nki.jit
def kernel(a : A): ...

kernel(A(1))

The class A is instantiated in Python as an argument to the kernel function. The NKI compiler will take this object and translate it to an instance of A on the NKI side. Roughly this translation is done by translating the object dictionary, in pseudo-code:

# pseudo-code "copy construct" A on NKI side
def kernel(python_a : A):
  # make a NKI instance of class A
  nki_a = new A
  # populate NKI instance from Python instance
  nki_a.__dict__ = python_a.__dict__

Enumerations#

In addition to the basic data classes described, NKI also supports basic enumerations. For example, the following can be used in NK kernel functions.

class E(Enum):
  x = 1
  y = 2
  z = 3

def f(e : E):
  if e == E.x: ...
  elif e == E.y: ...
  elif e == E.z: ...

f(E.x)

Similar to Python, the NKI compiler will translate the enumeration class E to the following:

class E(NKIObject):
  x = E("x", 1)
  y = E("y", 2)
  z = E("z", 3)

  def __init__(self, name, value):
    self.name = name
    self.value = value

Equality in NKI is structural, so no additional code is needed to replicate the behavior of == and != for objects of type E. No other binary operators on enum values are supported.

Composable Kernels#

Because all functions are inlined at specialization time, NKI supports a powerful composition pattern: you can pass functions as arguments to other functions, and the compiler will inline them at each call site. This allows you to write generic kernel templates that can be specialized with different operations.

For example, consider a generic tiled processing kernel that applies a user-supplied function to each tile:

def tiled_process(input_tensor, output_tensor, tile_fn):
    """Generic kernel that applies tile_fn to each tile of the input."""
    for i in range(input_tensor.shape[0] // nl.tile_size.pmax):
        tile = nl.ndarray((128, 512), dtype=input_tensor.dtype, buffer=nl.sbuf)
        nisa.dma_copy(dst=tile, src=input_tensor[i * 128:(i + 1) * 128, :])

        result = nl.ndarray((128, 512), dtype=input_tensor.dtype, buffer=nl.sbuf)
        tile_fn(dst=result, src=tile)

        nisa.dma_copy(dst=output_tensor[i * 128:(i + 1) * 128, :], src=result)

def my_activation(dst, src):
    nisa.activation(dst=dst, data=src, op=nl.relu)

def my_scale(dst, src):
    nisa.tensor_scalar(dst=dst, data=src, op0=nl.multiply, operand0=0.5)

@nki.jit
def relu_kernel(a_input, a_output):
    tiled_process(a_input, a_output, my_activation)

@nki.jit
def scale_kernel(a_input, a_output):
    tiled_process(a_input, a_output, my_scale)

During specialization, the compiler inlines tiled_process and then inlines the specific tile_fn (either my_activation or my_scale) at each call site. The result is a fully specialized kernel with no function call overhead.

This pattern is especially useful for building mega-kernels that compose multiple operations. You can pass function references as hyperparameters when using the kernel builder API:

from nki.compiler.kernel_builder import compile_kernel

compile_kernel(
    tiled_process,
    inputs={"input_tensor": input_array},
    outputs={"output_tensor": output_array},
    compile_opts=opts,
    tile_fn=my_activation,  # passed as a hyperparameter
)

Functions can also be stored in data structures, returned from other functions, and selected dynamically at specialization time based on compile-time conditions:

def select_activation(name):
    if name == "relu":
        return my_relu
    elif name == "gelu":
        return my_gelu

@nki.jit
def kernel(a_input, a_output):
    act_fn = select_activation("relu")
    # act_fn is resolved at specialization time; the selected
    # function is inlined directly
    act_fn(dst=a_output, src=a_input)

Because all of this resolution happens at specialization time, there is no runtime cost. The compiled kernel contains only the specific ISA operations for the chosen function.

This document is relevant for: Trn2, Trn3

NKI Language Guide

Contents

NKI Language Guide#

Compilation Model#

Kernel Caching#

NKI Namespaces#

NKI Values#

Tensor Values#

Anatomy of a tensor#

Views share storage#

Querying layout#

View primitives#

Composition#

Low-level raw access pattern: ap#

Creating Tensors#

Tensor Indexing#

Control Flow#

Dynamic Control Flow#

Class Support#

Enumerations#

Composable Kernels#

Low-level raw access pattern: `ap`#