This document is relevant for: Inf2, Trn1, Trn2

NKI Programming Model#

The NKI programming model enables developers to create custom kernels to program NeuronCores, where every kernel consists of three main stages:

Loading of inputs from device memory (High Bandwidth Memory, or HBM) to the on-chip SRAM (State Buffer, or SBUF).
Computation definition, to be executed on the NeuronCore compute engines.
Storing of outputs from on-chip SRAM (SBUF) back to device memory (HBM).

Fig. 13 below is a simplified diagram of a NeuronCore along with its attached HBM device memory. NKI kernels in current release can target a single NeuronCore-v2 or up to two NeuronCore-v3.

As shown in Fig. 13, a single NeuronCore consists of two on-chip SRAMs (SBUF and PSUM) and four heterogenous compute engines: the Tensor Engine, Vector Engine, Scalar Engine, and GpSimd Engine. For more information about the compute engine capabilities, see NeuronDevice Architecture Guide. Next, let’s dive into the memory hierarchy design of NeuronCore, which provides the necessary architecture knowledge to understand the NKI programming model.

../../_images/pm-nc.png — Fig. 13 NeuronCore Architecture (multiple NeuronCores available per NeuronDevice)#

Memory hierarchy#

Fig. 14 below shows the four-level memory hierarchy available to a single NeuronCore. The ranges provided in the figure are intended to calibrate the programmer’s mental model. See NeuronDevice Architecture Guide for the exact values.

Similar to standard memory hierarchy in other devices, memories near the top of the hierarchy are the closest to the compute engines; therefore, they are designed to provide the highest bandwidth and lowest latency. However, the faster memories have smaller capacities compared to memories near the bottom. Unlike memory hierarchy for traditional processors (e.g., CPU, GPU), all the memories available to a NeuronCore are software-managed. They are managed either directly by the programmers or the Neuron SDK. In other words, NeuronCore does not have a hardware cache system to perform any data movement across memories that is opaque to the program. Next, let’s discuss the different memories bottom-up.

../../_images/pm-memory.png — Fig. 14 NeuronCore Memory Hierarchy with Capacity and Bandwidth Ranges#

NeuronCore external memory#

The two memories at the bottom of the hierarchy, host memory and device memory, are both considered external memory for a NeuronCore. These memories are linear memory, where multi-dimensional tensors must be stored in a flattened manner.

The host memory is the CPU-attached DRAM, which is accessible by the host CPUs and all the NeuronCores attached to the instance. NKI kernels currently do not provide APIs to move data in and out of the host memory directly, but we can rely on ML frameworks such as PyTorch or JAX to send input data from host memory into NeuronDevice and vice versa. For an example of this, see Getting Started with NKI.

The device memory resides within a NeuronDevice and uses High Bandwidth Memory (HBM) technologies starting from NeuronDevice v2. This means that device memory and HBM refer to the same thing within NKI. Currently, the input and output parameters to NKI kernels must be HBM tensor references. Input tensors in HBM must be loaded into memory within a NeuronCore before any computation can take place.

NeuronCore internal memory#

The two memories at the top of the hierarchy, SBUF and PSUM, are both considered internal, on-chip memory for a NeuronCore. Both memories are two-dimensional memory, organized in 128 partitions. The partitions size of PSUM is typically much smaller than SBUF, and PSUM/SBUF partition sizes vary with NeuronCore generations.

State Buffer (SBUF) memory is the main software-managed on-chip SRAM. The SBUF is accessible by all the compute engines within a NeuronCore. NKI kernel input tensors from HBM must be loaded into the SBUF for computation using nki.language.load, and computed output tensors of the kernel must be stored back into the HBM from SBUF using nki.language.store before the host can access them. In addition, SBUF is used for storing intermediate data within the kernel, generated by the compute engines. Note, SBUF has ~20x higher bandwidth than HBM, but needs to be carefully managed to minimize HBM accesses for better performance.

Lastly, Partial Sum Buffer (PSUM) memory is a small, dedicated memory designed for storing matrix multiplication (MatMult) results computed by the tensor engine. Tensor Engine is able to read-add-write to every address in PSUM. Therefore, PSUM is useful for performing large MatMult calculations using multiple tiles where multiple MatMult instructions need to accumulate into the same output tile. As is shown in Fig. 13, PSUM memory can also be read and written by the vector and scalar engines. However, due to the limited capacity of PSUM, we recommend that you reserve PSUM space for the tensor engine to write MatMult outputs and to use the vector and scalar engines to evict MatMult results back to SBUF as soon as possible.

Note that to optimize kernel performance, it is a good practice for NKI programmers to be mindful of SBUF and PSUM usage through careful tiling and loop fusion. However, ultimately the Neuron compiler performs memory allocation for SBUF and PSUM and assigns memory addresses to kernel intermediate data. When the cumulative size of live data defined by the NKI kernel overflows the capacity of any on-chip memory, the Neuron compiler inserts the necessary spills or refills between that memory and the next-tier memory in the hierarchy.

Representing data in NKI#

NKI represents data in NeuronCore’s memory hierarchy with built-in type Tensor and its subclasses.

A Tensor is a multi-dimensional array which contains elements with the same data type. Programmers can pass Tensor in and out of NKI kernels, and declare or initialize Tensor in any memory within the NeuronDevice (PSUM, SBUF, HBM) using APIs such as nki.language.ndarray, nki.language.zeros, and nki.language.full. Input and output tensors from ML frameworks to NKI kernels can be reinterpreted as NKI Tensor of hbm buffer type in the same underlying memory buffer.

Tensor in NeuronCore’s internal memories (SBUF and PSUM) also have a dimension mapped to the partitions of the internal memories. We call this dimension the partition dimension. By default, NKI infers the first dimension (that is, the left most dimension) as the partition dimension of Tensor. Users could also explicitly annotate the partition dimension with par_dim from nki.language. For example:

# NKI infers the left most dimension as the partition dimension (size 128 below)
x = nl.ndarray((128, 32, 512), dtype=nl.float32, buffer=nl.sbuf)

# Same as above but more verbose
y = nl.ndarray((nl.par_dim(128), 32, 512), dtype=nl.float32, buffer=nl.sbuf)

# We can also explicitly annotate the partition dimension if we want the partition dimension
# to be on the other dimensions. In the following code we are creating a tensor whose partition
# dimension is the second from the left most dimension
z = nl.ndarray((128, nl.par_dim(32), 512), dtype=nl.float32, buffer=nl.sbuf)

There is a special subclass of Tensor called Index. Index represents the result of the affine expression over variables produced by index-generating APIs, such as loop variables, nki.language.program_id, nki.language.affine_range, and nki.language.arange.

A Tensor whose partition dimension is the first dimension is also called a Tile in NKI. In the above code example, x and y is a Tile, z is not a Tile. All NKI APIs take Tile as input and return a Tile as output. We will give more explanation in Tile-based operations.

Tile-based operations#

All NKI APIs operate on Tile, which aligns with NeuronCore instruction set architecture (NeuronCore ISA).

x = nl.ndarray((128, 32, 512), dtype=nl.float32, buffer=nl.sbuf)
xx = nl.exp(x) # works

z = nl.ndarray((128, nl.par_dim(32), 512), dtype=nl.float32, buffer=nl.sbuf)
zz = nl.exp(z) # not supported

To call NKI APIs to process data in a Tensor whose partition dimension is not the first dimension, users need to generate Tiles from the Tensor. This can be done by indexing the Tensor with a tuple of Index, following standard Python syntax Tensor[Index, Index, ...]. For example:

z = nl.ndarray((128, nl.par_dim(32), 512), dtype=nl.float32, buffer=nl.sbuf)
for i in range(128):
  zz = nl.exp(z[i, :, :]) # works

We will provide more discussion of the indexing in Tensor Indexing. Next, let’s discuss two important considerations when working with tile-based operations in NKI: data layout and tile size constraints.

Layout considerations#

When working with multi-dimensional arrays in any platform, it is important to consider the physical memory layout of the arrays, or how data is stored in memory. For example, in the context of 1D linear memory, we can store a 2D array in a row-major layout or a column-major layout. Row-major layouts place elements within each row in contiguous memory, and column-major layouts place elements within each column in contiguous memory.

As discussed in the Memory hierarchy section, the on-chip memories, SBUF and PSUM, are arranged as 2D memory arrays. The first dimension is the partition dimension P with 128 memory partitions that can be read and written in parallel by compute engines. The second dimension is the free dimension F where elements are read and written sequentially. A tensor is placed in SBUF and PSUM across both P and F, with the same start offset across all P partitions used by the tensor. Fig. 15 below illustrates a default tensor layout. Note that a tile in NKI must map shape[0] to the partition dimension.

../../_images/pm-layout.png — Fig. 15 Tensor mapped to partition and free dimensions of SBUF and PSUM#

Similar to other domain-specific languages that operate on tensors, NKI defines a contraction axis of a tensor as the axis over which reduction is performed, for example the summation axis in a dot product. NKI also defines a parallel axis as an axis over which the same operation is performed on all elements. For example, if we take a [100, 200] matrix and sum each row independently to get an output of shape [100, 1], then the row-axis (axis[0], left-most) is the parallel axis, and the column-axis (axis[1], right-most) is the contraction axis.

To summarize, the partition and free dimensions of a NKI tensor dictate how the tensor is stored in the 2D on-chip memories physically, while the parallel and contraction axes of a tensor are logical axes that are determined by the computation to be done on the tensor.

The NeuronCore compute engines impose two layout constraints:

[LC#1] For matrix multiplication operations, the contraction axis of both input tiles must be mapped to the P dimension.
[LC#2] For operations that are not matrix multiplication operations, such as scalar or vector operations, the parallel axis should be mapped to the P dimension.

LC#1 means that to perform a matrix multiplication of shapes [M, K] and [K, N], Tensor Engine (the engine performing this operation) requires the K dimension to be mapped to the partition dimension in SBUF for both input matrices. Therefore, you need to pass shapes [K, M] and [K, N] into the nki.isa.nc_matmul API, as the partition dimension is always the left-most dimension for an input tile to any NKI compute API.

To help developers get started with NKI quickly, NKI also provides a high-level API nki.language.matmul that can take [M, K] and [K, N] input shapes and invoke the necessary layout shuffling on the input data before sending it to the Tensor Engine matmul instruction.

LC#2, on the other hand, is applicable to many instructions supported on Vector, Scalar and GpSimd Engines. See nki.isa.tensor_reduce API as an example.

Tile size considerations#

Besides layout constraints, NeuronCore hardware further imposes three tile-size constraints in NKI:

[TC#1] The P dimension size of a tile in both SBUF and PSUM must never exceed nki.tile_size.pmax == 128.
[TC#2] For tiles in PSUM, the F dimension size must not exceed nki.tile_size.psum_fmax == 512.
[TC#3] Matrix multiplication input tiles F dimension size must not exceed nki.tile_size.gemm_stationary_fmax == 128 on the left-hand side (LHS), or nki.tile_size.gemm_moving_fmax == 512 on the right-hand side (RHS).

You are responsible for breaking your tensors according to these tile-size constraints. If the constraints are not met properly, the NKI kernel compilation throws a SyntaxError indicating which constraint is violated. For example, below we show a simple kernel that applies the exponential function to every element of an input tensor. To start, let’s write a kernel that expects a hard-coded shape of (128, 512) for both input and output tensors:

import neuronxcc.nki.language as nl
from neuronxcc import nki

@nki.jit
def tensor_exp_kernel_(in_tensor):
  """NKI kernel to compute elementwise exponential of an input tensor

  Args:
      in_tensor: an input tensor of shape [128,512]
  Returns:
      out_tensor: an output tensor of shape [128,512]
  """

  out_tensor = nl.ndarray(in_tensor.shape, dtype=in_tensor.dtype,
                          buffer=nl.shared_hbm)

  # Generate indices for the input/output tensors
  i_p = nl.arange(128)[:, None]
  i_f = nl.arange(512)[None, :]

  # Load input data from HBM to on-chip memory
  in_tile = nl.load(in_tensor[i_p, i_f])

  # perform the computation:
  out_tile = nl.exp(in_tile)

  # store the results back to HBM
  nl.store(out_tensor[i_p, i_f], value=out_tile)

  return out_tensor


if __name__ == "__main__":
  import torch
  from torch_xla.core import xla_model as xm

  device = xm.xla_device()

  shape = (128, 512)
  in_tensor = torch.ones(shape,  dtype=torch.bfloat16).to(device=device)
  out_tensor = tensor_exp_kernel_(in_tensor)

  print(out_tensor) # an implicit XLA barrier/mark-step

As expected, the output tensor is an element-wise exponentiation of the input-tensor (a tensor of ones):

tensor([[2.7188, 2.7188, 2.7188, ..., 2.7188, 2.7188, 2.7188],
        [2.7188, 2.7188, 2.7188, ..., 2.7188, 2.7188, 2.7188],
        [2.7188, 2.7188, 2.7188, ..., 2.7188, 2.7188, 2.7188],
        ...,
        [2.7188, 2.7188, 2.7188, ..., 2.7188, 2.7188, 2.7188],
        [2.7188, 2.7188, 2.7188, ..., 2.7188, 2.7188, 2.7188],
        [2.7188, 2.7188, 2.7188, ..., 2.7188, 2.7188, 2.7188]],
        device='xla:1', dtype=torch.bfloat16)

Now let’s examine what happens if the input/output tensor shapes do not match the shape of the compute kernel. As an example, we can change the input and output tensor shape from [128,512] to [256,512]:

if __name__ == "__main__":
  import torch
  from torch_xla.core import xla_model as xm

  device = xm.xla_device()

  shape = (256, 512) # Previously (128, 512)
  in_tensor = torch.ones(shape,  dtype=torch.bfloat16).to(device=device)
  out_tensor = tensor_exp_kernel_(in_tensor)

  print(out_tensor) # an implicit XLA barrier/mark-step

Since the compute kernel is expecting (128, 512) input/output tensors, but we used a (256, 512) input/output tensor instead, the bottom half of the output tensor becomes garbage data:

tensor([[2.7188, 2.7188, 2.7188, ..., 2.7188, 2.7188, 2.7188],
        [2.7188, 2.7188, 2.7188, ..., 2.7188, 2.7188, 2.7188],
        [2.7188, 2.7188, 2.7188, ..., 2.7188, 2.7188, 2.7188],
        ...,
        [0.5273, 0.6055, 0.4336, ..., 0.9648, 0.9414, 0.4062],
        [0.7109, 0.2539, 0.7227, ..., 0.7344, 0.2539, 0.1211],
        [0.8867, 0.2109, 0.8789, ..., 0.8477, 0.2227, 0.1406]],
        device='xla:1', dtype=torch.bfloat16)

We could try to fix this by changing the tile size inside the compute kernel to (256, 512) as well, and see what happens: (NOTE: This violates tile-size constraint #1!):

import neuronxcc.nki.language as nl
from neuronxcc import nki


@nki.jit
def tensor_exp_kernel_(in_tensor):
  """NKI kernel to compute elementwise exponential of an input tensor

  Args:
      in_tensor: an input tensor of shape [128,512]
  Returns:
      out_tensor: an output tensor of shape [128,512]
  """
  out_tensor = nl.ndarray(in_tensor.shape, dtype=in_tensor.dtype,
                          buffer=nl.shared_hbm)

  # Generate indices for the input/output tensors
  i_p = nl.arange(256)[:, None] # Previously nl.arange(128)
  i_f = nl.arange(512)[None, :]

  # Load input data from HBM to on-chip memory
  in_tile = nl.load(in_tensor[i_p, i_f])

  # perform the computation:
  out_tile = nl.exp(in_tile)

  # store the results back to HBM
  nl.store(out_tensor[i_p, i_f], value=out_tile)


if __name__ == "__main__":
  import torch
  from torch_xla.core import xla_model as xm

  device = xm.xla_device()

  shape = (256, 512) # Previously (128, 512)
  in_tensor = torch.ones(shape,  dtype=torch.bfloat16).to(device=device)
  out_tensor = tensor_exp_kernel_(in_tensor)

  print(out_tensor) # an implicit XLA barrier/mark-step

Here, Neuron compiler identifies the tile-size constraint violation and fails compilation with the following exception:

SyntaxError: Size of partition dimension 256 exceeds architecture limitation of 128.

Now, let’s see how NKI developers can build a kernel that properly handles (256, 512) input/output tensors with a simple loop. We can use the nki.language.tile_size.pmax constant defined in NKI as the maximum partition dimension size in a tile.

import neuronxcc.nki.language as nl
from torch_neuronx import nki_jit

@nki_jit
def tensor_exp_kernel_(in_tensor):
  """NKI kernel to compute elementwise exponential of an input tensor

  Args:
      in_tensor: an input tensor of shape [256,512]
  Returns:
      out_tensor: an output tensor of shape [256,512]
  """
  out_tensor = nl.ndarray(in_tensor.shape, dtype=in_tensor.dtype,
                          buffer=nl.shared_hbm)

  i_f = nl.arange(512)[None, :]

  for k in nl.affine_range(2):
    # Generate tensor indices for the input/output tensors
    i_p = k * nl.tile_size.pmax + nl.arange(nl.tile_size.pmax)[:, None]

    # Load input data from HBM to on-chip memory
    in_tile = nl.load(in_tensor[i_p, i_f])

    # perform the computation
    out_tile = nl.exp(in_tile)

    # store the results back to HBM
    nl.store(out_tensor[i_p, i_f], value=out_tile)

  return out_tensor

The nl.affine_range(2) API call returns a list of integers [0, 1]. nl.affine_range should be the default loop iterator choice in NKI, when the loop has no loop-carried dependency. Note, associative reductions are not considered loop carried dependencies in this context. One such example is accumulating results of multiple matrix multiplication calls into the same output buffer using += (see Matmul Tutorial for an example). Otherwise, nl.sequential_range should be used to handle loop-carried dependency. Note, Neuron compiler transforms any usage of Python range() API into nl.sequential_range() under the hood. See NKI iterator API for a detailed discussion of various loop iterator options in NKI.

While the code above does handle (256, 512) tensors correctly, it is rather inflexible since it only supports input shape of (256, 512). Therefore, as a last step, we extend this kernel to handle varying input/output sizes:

import neuronxcc.nki.language as nl
from neuronxcc import nki
import math

@nki.jit
def tensor_exp_kernel_(in_tensor):
  """NKI kernel to compute elementwise exponential of an input tensor

  Args:
      in_tensor: an input tensor of ANY 2D shape (up to SBUF size)
  Returns:
      out_tensor: an output tensor of ANY 2D shape (up to SBUF size)
  """
  sz_p, sz_f = in_tensor.shape
  out_tensor = nl.ndarray((sz_p, sz_f), dtype=in_tensor.dtype,
                          buffer=nl.shared_hbm)


  i_f = nl.arange(sz_f)[None, :]

  for p in nl.affine_range(math.ceil(sz_p / nl.tile_size.pmax)):
    # Generate tensor indices for the input/output tensors
    # pad index to pmax, for simplicity
    i_p = p * nl.tile_size.pmax + nl.arange(nl.tile_size.pmax)[:, None]

    # Load input data from external memory to on-chip memory
    # only read up to sz_p
    in_tile = nl.load(in_tensor[i_p, i_f], mask=(i_p<sz_p))

    # perform the computation
    out_tile = nl.exp(in_tile, mask=(i_p<sz_p))

    # store the results back to external memory
    # only write up to sz_p
    nl.store(out_tensor[i_p, i_f], value=out_tile, mask=(i_p<sz_p))

    return out_tensor

The above example handles cases where in_tensor.shape[0] is not a multiple of 128 by passing a mask field into the nl.load and nl.store API calls. For more information, refer to NKI API Masking.

Later in this guide, we’ll explore another way to launch a kernel with varying input/output shapes, with a single program multiple data programming model, or SPMD. The SPMD programming model removes the need for explicit looping over different tiles with variable trip counts, which could lead to cleaner and more readable code.

Tensor Indexing#

As mentioned above, we can index Tensor with standard Python syntax to produce Tiles. There are two styles of indexing: Basic and Advanced Tensor Indexing. Note that currently NKI does not support mixing Basic and Advanced Tensor Indexing in the same Index tuple.

Basic Tensor Indexing#

We can index a Tensor with fewer indices than dimensions, we get a view of the original tensor as a sub-dimensional tensor. For example:

x = nl.ndarray((2, 2, 2), dtype=nl.float32, buffer=nl.hbm)

# `x[1]` return a view of x with shape of [2, 2]
# [[x[1, 0, 0], x[1, 0 ,1]], [x[1, 1, 0], x[1, 1 ,1]]]
assert x[1].shape == [2, 2]

By indexing a Tensor like this, we can generate a Tile with the partition dimension in the first dimension and feed the Tile to NKI compute APIs:

# Not a tile, cannot directly feed to a NKI compute API
x = nl.ndarray((2, nl.par_dim(2), 2), dtype=nl.float32)
# Error
y = nl.exp(x)

# `x[1]` have shape [2, 2], and the first dimension is the partition dimension of the original
# tensor. We can feed it to a NKI compute API.
y = nl.exp(x[1])

NKI also supports slicing in basic tensor indexing:

x = nl.ndarray((2, 128, 1024), dtype=nl.float32, buffer=nl.hbm)

# `x[1, :, :]` is the same as `x[1]`
assert x[1, :, :].shape == [128, 1024]

# Get a smaller view of the third dimension
assert x[1, :, 0:512].shape == [128, 512]

# `x[:, 1, 0:2]` returns a view of x with shape of [2, 2]
# [[x[0, 1, 0], x[0, 1 ,1]], [x[1, 1, 0], x[1, 1 ,1]]]
assert x[:, 1, 0:2].shape == [2, 2]

Advanced Tensor Indexing#

So far we have only shown basic indexing in tensors. However, NeuronCore offers much more flexible tensorized memory access in its on-chip SRAMs along the free dimension. You can use this to efficiently stride the SBUF/PSUM memories at high performance for all NKI APIs that access on-chip memories. However, such flexible indexing is not supported along the partition dimension. That being said, device memory (HBM) is always more performant when accessed sequentially.

In this section, we share several use cases that benefit from advanced memory access patterns and demonstrate how to implement them in NKI.

Advanced Tensor Indexing in NKI leverages the nl.arange API.

Case #1 - Tensor split to even and odd columns#

Here we split an input tensor into two output tensors, where the first output tensor gathers all the even columns from the input tensor, and the second output tensor gathers all the odd columns from the input tensor. We assume the rows of the input tensors are mapped to SBUF partitions. Therefore, we are effectively gathering elements along the free dimension of the input tensor. Fig. 16 below visualizes the input and output tensors.

../../_images/pm-index-1.png — Fig. 16 Tensor split to even and odd columns#

from neuronxcc import nki
import neuronxcc.nki.language as nl
import math

@nki.jit
def tensor_split_kernel_(in_tensor):
  """NKI kernel to split an input tensor into two output tensors, along the column axis.

  The even columns of the input tensor will be gathered into the first output tensor,
  and the odd columns of the input tensor will be gathered into the second output tensor.

  Args:
      in_tensor: an input tensor
  Returns:
      out_tensor_even: a first output tensor (will hold the even columns of the input tensor)
      out_tensor_odd: a second output tensor (will hold the odd columns of the input tensor)
  """

  # Extract tile sizes.
  sz_p, sz_f = in_tensor.shape
  sz_fout_even = sz_f - sz_f // 2
  sz_fout_odd = sz_f // 2
  out_tensor_even = nl.ndarray((sz_p, sz_fout_even), dtype=in_tensor.dtype, buffer=nl.shared_hbm)
  out_tensor_odd = nl.ndarray((sz_p, sz_fout_odd), dtype=in_tensor.dtype, buffer=nl.shared_hbm)

  # We assume that all three tensors have the same partition dimension size
  # and it does not exceed pmax
  assert in_tensor.shape[0] == out_tensor_even.shape[0] == out_tensor_odd.shape[0]
  assert in_tensor.shape[0] <= nl.tile_size.pmax

  # Make sure even/odd output tensors have correct free dimension size
  assert sz_fout_even == math.ceil(sz_f / 2)
  assert sz_fout_odd == math.floor(sz_f / 2)

  # Generate tensor indices for the input/output tensors
  i_p = nl.arange(sz_p)[:, None]
  i_f = nl.arange(sz_f)[None, :]
  i_fout_even = nl.arange(sz_fout_even)[None, :]
  i_fout_odd = nl.arange(sz_fout_odd)[None, :]

  # Split pattern:
  i_f_even = (2 * i_fout_even)
  i_f_odd = (2 * i_fout_odd + 1)

  # Load input data from external memory to on-chip memory
  in_tile = nl.load(in_tensor[i_p, i_f])

  # Perform the split
  # these assignments invoke copy instructions under the hood
  # which can execute on either Scalar or Vector Engine
  # (decided by compiler instruction scheduler)
  out_tile_even = in_tile[i_p, i_f_even]
  out_tile_odd = in_tile[i_p, i_f_odd]

  # Store the results back to external memory
  nl.store(out_tensor_even[i_p, i_fout_even], value=out_tile_even)
  nl.store(out_tensor_odd[i_p, i_fout_odd], value=out_tile_odd)

  return out_tensor_even, out_tensor_odd


if __name__ == "__main__":
    import torch
    from torch_xla.core import xla_model as xm

    device = xm.xla_device()

    X, Y = 4, 5
    in_tensor = torch.arange(X * Y, dtype=torch.bfloat16).reshape(X, Y).to(device=device)

    out1_tensor, out2_tensor = tensor_split_kernel_(in_tensor)
    print(in_tensor, out1_tensor, out2_tensor)

The main concept in this example is that we introduced the even (i_f_even) and odd ( i_f_odd ) indices. Note that both indices are affine expressions of the form start + stride * nl.arange(size) with a specific start offset (0/1 respectively) and stride (2 for both cases). This allows us to stride through the in_tile memory and copy it to both output tiles (out_tile_even and out_tile_odd), according to the desired pattern.

Case #2 - Transpose tensor along the f axis#

In this example we transpose a tensor along two of its axes. Note, there are two main types of transposition in NKI:

Transpose between the partition-dimension axis and one of the free-dimension axes, which is achieved via the nki.isa.nc_transpose API.
Transpose between two free-dimension axes, which is achieved via a nki.language.copy API, with indexing manipulation in the transposed axes to re-arrange the data.

In this example, we’ll focus on the second case: consider a three-dimensional input tensor [P, F1, F2], where the P axis is mapped to the different SBUF partitions and the F1 and F2 axes are flattened and placed in each partition, with F1 being the major dimension. Our goal in this example is to transpose the F1 and F2 axes with a parallel dimension P, which would re-arrange the data within each partition. Fig. 17 below illustrates the input and output tensor layouts.

../../_images/pm-index-2.png — Fig. 17 Tensor F1:F2 Transpose#

import neuronxcc.nki as nki
import neuronxcc.nki.language as nl


@nki.jit
def tensor_transpose2D_kernel_(in_tensor, shape2D):
  """
  NKI kernel to reorder the elements on axis[1] of the input tensor.

  Every row of the input tensor is a flattened row-major 2D matrix.
  The shape2D argument defines the dimensions of the flattened matrices (#rows,#cols).
  Our goal in this kernel is to transpose these flattened 2D matrices, i.e. make them (#cols,#rows).

  Example:
      in_tensor = [a0,a1,a2,a3,b0,b1,b2,b3,c0,c1,c2,c3]
      shape2D = (3,4)
  this means that in_tensor has 3 rows and 4 columns, i.e. can be represented as:
      [a0,a1,a2,a3]
      [b0,b1,b2,b3]
      [c0,c1,c2,c3]
  after transpose, we expect to get:
      [a0,b0,c0]
      [a1,b1,c1]
      [a2,b2,c2]
      [a3,b3,c3]
  Thus, out_tensor is expected to be [a0,b0,c0,a1,b1,c1,a2,b2,c2,a3,b3,c3]

  Args:
    in_tensor: an input tensor
    shape2D: tuple representing the dimensions to be transposed: (#rows, #cols)
    out_tensor: an output (transposed) tensor
  """
  out_tensor = nl.ndarray(in_tensor.shape, dtype=in_tensor.dtype,
                          buffer=nl.shared_hbm)
  # Gather input shapes
  sz_p, _ = in_tensor.shape

  # Load input data from external memory to on-chip memory
  in_tile = nl.load(in_tensor)

  # Performing f1/f2 transpose
  # ==========================
  # The desired transpose pattern is provided as an input:
  sz_f1, sz_f2 = shape2D

  # We're going to need 3 indices to perform f1:f2 transpose.
  # - i_p0 is the parallel index
  # - i_f1 and i_f2 are both free-dim indices, and will be used to transpose between the f1/f2 axes
  i_p0 = nl.arange(sz_p)[:, None, None]
  i_f1 = nl.arange(sz_f1)[None, :, None]
  i_f2 = nl.arange(sz_f2)[None, None, :]

  # Perform the transposition via a SBUF-to-SBUF copy, with access-pattern manipulation
  # Note that we have 2D tensors and 3 indices, since we need to represent a 2D access pattern *per partition*
  # RHS traverses an F1 x F2 matrix in a row major manner
  # LHS traverses an F2 x F1 (new) matrix in a row major manner
  out_tile = nl.ndarray(shape=(sz_p, sz_f2*sz_f1), dtype=out_tensor.dtype)
  out_tile[i_p0, i_f2*sz_f1+i_f1] = nl.copy(in_tile[i_p0, i_f1*sz_f2+i_f2])

  # Finally, we store out_tile to external memory
  nl.store(out_tensor, value=out_tile)

  return out_tensor

The main concept introduced in this example is a 2D memory access pattern per partition, via additional indices. We copy in_tile into out_tile, while traversing the memory in different access patterns between the source and destination, thus achieving the desired transposition.

You may download the full runnable script from Transpose2d tutorial.

Case #3 - 2D pooling operation#

Lastly, we examine a case of dimensionality reduction. We implement a 2D MaxPool operation, which is used in many vision neural networks. This operation takes C x [H,W] matrices and reduces each matrix along the H and W axes. To leverage free-dimension flexible indexing, we can map the C (parallel) axis to the P dimension and H/W (contraction) axes to the F dimension. Performing such a 2D pooling operation requires a 4D memory access pattern in the F dimension, with reduction along two axes. Fig. 18 below illustrates the input and output tensor layouts.

../../_images/pm-index-3.png — Fig. 18 2D-Pooling Operation (reducing on axes F2 and F4)#

from neuronxcc import nki
import neuronxcc.nki.language as nl

@nki.jit
def tensor_maxpool_kernel_(in_tensor, pool_size):
  """NKI kernel to compute a 2D max-pool operation

  Args:
      in_tensor: an input tensor, of dimensions C x H x W
      pool_size: integer P representing a (square) pool-window size
  Returns:
      out_tensor: the resulting output tensor, of dimensions C x (H/P) x (W/P)
  """

  # Get input/output dimensions
  sz_cin, sz_hin, sz_win = in_tensor.shape
  sz_hout, sz_wout = sz_hin // pool_size, sz_win // pool_size
  out_tensor = nl.ndarray((sz_cin, sz_hout, sz_wout), dtype=in_tensor.dtype,
                          buffer=nl.shared_hbm)

  # Set relevant sizes
  sz_p = sz_cin
  sz_pool = pool_size

  # Generate tensor h/w index patterns
  # 3D indexing according to [C, H, W]
  i_p = nl.arange(sz_p)[:, None, None] # 3D for
  i_win = nl.arange(sz_win)[None, None, :]
  i_hin = nl.arange(sz_hin)[None, :, None]

  i_wout = nl.arange(sz_wout)[None, None, :]
  i_hout = nl.arange(sz_hout)[None, :, None]

  # Generate pool index patterns (requires two extra dimensions, for the pool window)
  i_0 = nl.arange(sz_p)[:, None, None, None, None] #
  i_1 = nl.arange(sz_hin//sz_pool)[None, :, None, None, None] # y_outer
  i_2 = nl.arange(sz_pool)[None, None, :, None, None] # y_inner
  i_3 = nl.arange(sz_win//sz_pool)[None, None, None, :, None] # x_outer
  i_4 = nl.arange(sz_pool)[None, None, None, None, :] # x_inner

  # Load input data from external memory to on-chip memory
  # Declare ndarray to force a 3D tensor (temporary requirement)
  in_tile = nl.ndarray([sz_p, sz_hin, sz_win], dtype=in_tensor.dtype)
  in_tile[:,:,:] = nl.load(in_tensor[i_p, i_hin, i_win])

  # Perform the pooling operation:
  # We use numpy's advanced indexing, in order to extend in_tile to 5D, and then reduce-max two dimension.
  # axis[0] is the index for p_dim, and thus doesn't participate in the reduction operation.
  # axis[1] and axis[2] together index the rows, with axis[2] responsible for inner strides
  # (i.e. inside a pooling window), and axis[1] responsible for the outer strides. As such, we reduce over axis[2].
  # Similarly, axis[3] and axis[4] together index the columns, and we thus reduce over axis[4].
  out_tile = nl.max(in_tile[i_0, sz_pool*i_1+i_2, sz_pool*i_3+i_4], axis=[2,4])

  # Store the results back to external memory
  nl.store(out_tensor[i_p, i_hout, i_wout], value=out_tile)

  return out_tensor


if __name__ == "__main__":
    import torch
    from torch_xla.core import xla_model as xm

    device = xm.xla_device()

    # Now let's run the kernel
    POOL_SIZE = 2
    C, HIN, WIN = 2, 6, 6
    HOUT, WOUT = HIN//POOL_SIZE, WIN//POOL_SIZE

    in_tensor = torch.arange(C * HIN * WIN, dtype=torch.bfloat16).reshape(C, HIN, WIN).to(device=device)
    out_tensor = tensor_maxpool_kernel_(in_tensor, POOL_SIZE)

    print(in_tensor, out_tensor) # an implicit XLA barrier/mark-step

SPMD: Launching multiple instances of a kernel#

So far we have discussed how to launch a single NKI kernel instance, in which the full input tensor is processed. In this section, we discuss how to launch multiple instances of the same kernel and slice the full input tensor across kernel instances using a single program multiple data programming model (SPMD).

Note

In current NKI release, adopting the SPMD programming model has no impact on performance of NKI kernel, and therefore is considered optional. A SPMD program is compiled into an executable that targets one NeuronCore, and the different instances of the SPMD program are executed serially on a single NeuronCore. This is subject to changes in future releases.

NKI allows users to launch multiple instances of a kernel, which are organized in a user-defined multi-dimensional grid. The grid indices are then used by the different kernel instances to select which input and output data to access. There is no restriction on the number of dimensions in an SPMD grid, nor on the size of each dimension. Each kernel instance can find its coordinates within the launch grid using the nki.language.program_id API. Neuron compiler translates the SPMD launch grid into nested loops of compute-kernel invocations, which are then executed on the NeuronCore.

As an example, we’ll perform a C=A@B matrix multiplication, where A and B are of shape (512, 128) and (128, 1024) respectively. We partition the output tensor C of shape (512, 1024) into 4x2 tiles and assign the task of computing each output tile to a different kernel instance. A 4x2 launch-grid is chosen in this case, in order to make each compute kernel instance operate on a single tile in A and a single tile in B, while adhering to the tile-size constraints.

With a 2D 4x2 launch grid, the (i,j) kernel instance is responsible for computing the (i,j) tile of C. The computation of the (i,j) tile requires the corresponding rows of A and columns of B. This induces a four-way row-wise partitioning of A and a two-way column-wise partitioning of B, as shown in Fig. 19.

../../_images/pm-spmd.png — Fig. 19 Visualization of 512x128x1024 matrix multiplication using SPMD#

In this SPMD kernel example, we will use the high-level nki.language.matmul API, so that we can focus on the concept of SPMD without worrying about the layout requirement of Tensor Engine (LC#1). To achieve the best performance, we suggest transposing input A and invoking another NKI kernel instead, which solely performs matmul operations on Tensor Engine using nki.isa.nc_matmul without extra overhead in changing input layouts to meet LC#1.

import neuronxcc.nki.language as nl
from neuronxcc import nki


@nki.jit
def matmul_128x128x512_spmd(A, B):
  """NKI kernel to compute a 128x128x512 matrix multiplication operation.
     Use SPMD program IDs to index into the full A and B input tensor to get tiles
     for 128x128x512 matrix multiplication.

  Args:
      A: an input tensor of shape [M=512,K=128],
         a left hand side argument of the matrix multiplication,
      B: an input tensor of shape [K=128,N=1024],
         a right hand side argument of the matrix multiplication
      result: the resulting output tensor of shape [M=512,N=1024]
  """
  N, K = A.shape
  K_, M = B.shape
  assert K == K_
  # Create output tensor shared between all SPMD instances as result tensor
  result = nl.ndarray((N, M), dtype=A.dtype, buffer=nl.shared_hbm)

  # Defining starting indexes for input A and B
  i_A_row = nl.program_id(0) * 128
  i_B_col = nl.program_id(1) * 512

  # Loading the inputs (HBM->SBUF)
  A_tile = nl.load(A[i_A_row:i_A_row+128, 0:128])
  B_tile = nl.load(B[0:128, i_B_col:i_B_col+512])

  # Perform the matrix-multiplication
  # Note1: nl.matmul will invoke a transpose on A_tile before performing the actual matmul operation
  # Note2: A NKI matmul instruction always writes to PSUM in float32 data-type
  result_psum = nl.matmul(A_tile, B_tile)

  # Copy the result from PSUM back to SBUF, and cast to expected output data-type
  result_sbuf = nl.copy(result_psum, dtype=result.dtype)

  # The result of a [128,128] x [128,512] matrix multiplication has a shape of [128, 512].
  # This dictates which indices to use to address the result tile.
  nl.store(result[i_A_row:i_A_row+128, i_B_col:i_B_col+512], value=result_sbuf)

  return result

if __name__ == "__main__":
  from torch_xla.core import xla_model as xm
  import torch

  device = xm.xla_device()

  A = torch.ones((512, 128), dtype=torch.bfloat16).to(device=device)
  B = torch.ones((128, 1024), dtype=torch.bfloat16).to(device=device)

  # Launch kernel with a 2D grid
  result = matmul_128x128x512_spmd[4, 2](A, B)

  print(result) # an implicit XLA barrier/mark-step