This document is relevant for: Inf2, Trn1, Trn2

nki.isa.nc_matmul#

nki.isa.nc_matmul(stationary, moving, *, is_stationary_onezero=False, is_moving_onezero=False, is_transpose=False, tile_position=(), tile_size=(), mask=None, **kwargs)[source]#

Compute stationary.T @ moving matrix multiplication using Tensor Engine.

The nc_matmul instruction must read inputs from SBUF and write outputs to PSUM. Therefore, the stationary and moving must be SBUF tiles, and the result tile is a PSUM tile.

The nc_matmul instruction currently supports float8_e4m3/float8_e5m2/bfloat16/float16/tfloat32/float32 input data types as listed in Supported Data Types. The matmul accumulation and results are always in float32.

The Tensor Engine imposes special layout constraints on the input tiles. First, the partition axis sizes of the stationary and moving tiles must be identical and <=128, which corresponds to the contraction dimension of the matrix multiplication. Second, the free axis sizes of stationary and moving tiles must be <= 128 and <=512, respectively, For example, stationary.shape = (128, 126); moving.shape = (128, 512) and nc_matmul(stationary,moving) returns a tile of shape = (126, 512). For more information about the matmul layout, see Tensor Engine.

../../../../_images/matmul.png

Fig. 12 MxKxN Matrix Multiplication Visualization.#

If the contraction dimension of the matrix multiplication exceeds 128, you may accumulate multiple nc_matmul instruction output tiles into the same PSUM tile. See example code snippet below.

Estimated instruction cost:

The Tensor Engine has complex performance characteristics given its data flow and pipeline design. The below formula is the average nc_matmul cost assuming many nc_matmul instructions of the same shapes running back-to-back on the engine:

Cost (Tensor Engine Cycles)

Condition

max(min(64, N_stationary), N_moving)

input data type is one of float8_e4m3/float8_e5m2/bfloat16/float16/tfloat32

4 * max(min(64, N_stationary), N_moving)

input data type is float32

where,

  • N_stationary is the number of elements per partition in stationary tile.

  • N_moving is the number of elements per partition in moving tile.

The Tensor Engine, as a systolic array with 128 rows and 128 columns of processing elements (PEs), could be underutilized for small nc_matmul instructions, i.e., the stationary tile has small free axis size or small partition axis size (e.g. 32, 64). In such a case, the Tensor Engine allows PE tiling, i.e., multiple small nc_matmul instructions to execute in parallel on the PE array, to improve compute throughput. PE tiling is enabled by setting tile_position and tile_size. tile_position indicates the PE tile starting position (row position, column position) for a nc_matmul instruction in the PE array. tile_size indicates the PE tile size (row size, column size) to hold by a nc_matmul instruction starting from the tile_position. For example, setting tile_position to (0, 0) and tile_size to (128, 128) means using full PE array.

Requirements on tile_position and tile_size are:

  1. tile_position and tile_size must be both set to enable PE tiling.

  2. The type of values in tile_position and tile_size must be integer or affine expression.

  3. Values in tile_position and tile_size must be multiple of 32.

  4. tile_size must be larger than or equal to accessed stationary tile size.

  5. Both the row and column sizes in tile_size cannot be 32 for NeuronCore-v2.

Parameters:
  • stationary – the stationary operand on SBUF; layout: (partition axis <= 128, free axis <= 128)

  • moving – the moving operand on SBUF; layout: (partition axis <= 128, free axis <= 512)

  • mask – (optional) a compile-time constant predicate that controls whether/how this instruction is executed (see NKI API Masking for details)

  • is_stationary_onezero – hints to the compiler whether the stationary operand is a tile with ones/zeros only; setting this field explicitly could lead to 2x better performance if stationary tile is in float32; the field has no impact for non-float32 stationary.

  • is_moving_onezero – hints to the compiler if the moving operand is a tile with ones/zeros only; setting this field explicitly could lead to 2x better performance if moving tile is in float32; the field has no impact for non-float32 moving.

  • is_transpose – hints to the compiler that this is a transpose operation with moving as an identity matrix.

  • tile_position – a 2D tuple (row, column) for the start PE tile position to run nc_matmul.

  • tile_size – a 2D tuple (row, column) for the PE tile size to hold by nc_matmul starting from tile_position.

Returns:

a tile on PSUM that has the result of matrix multiplication of stationary and moving tiles; layout: partition axis comes from free axis of stationary, while free axis comes from free axis of moving.

Example:

import neuronxcc.nki.isa as nisa
import neuronxcc.nki.language as nl

##################################################################
# Example 1:
# multiply matrix a of shape (128, 128) and matrix b of shape (128, 512)
# to get matrix c in PSUM of shape (128, 512)
##################################################################
a_mgrid = nl.mgrid[0:128, 0:128]
b_mgrid = nl.mgrid[0:128, 0:512]
c_mgrid = nl.mgrid[0:128, 0:512]

a = nl.load(a_tensor[a_mgrid.p, a_mgrid.x])
b = nl.load(b_tensor[b_mgrid.p, b_mgrid.x])

c_psum = nisa.nc_matmul(a[a_mgrid.p, a_mgrid.x], b[b_mgrid.p, b_mgrid.x])

nl.store(c_tensor[c_mgrid.p, c_mgrid.x], c_psum)

##################################################################
# Example 2:
# multiply matrix d of shape (256, 128) and matrix e of shape (256, 512)
# to get matrix f in PSUM of shape (128, 512) using psum accumulation
##################################################################
d_mgrid = nl.mgrid[0:128, 0:128]
e_mgrid = nl.mgrid[0:128, 0:512]
f_mgrid = nl.mgrid[0:128, 0:512]

f_psum = nl.zeros((128, 512), nl.float32, buffer=nl.psum)

for i_contract in nl.affine_range(2):
  d = nl.load(d_tensor[i_contract * 128 + d_mgrid.p, d_mgrid.x])
  e = nl.load(e_tensor[i_contract * 128 + e_mgrid.p, e_mgrid.x])
  f_psum += nisa.nc_matmul(d[d_mgrid.p, d_mgrid.x], e[e_mgrid.p, e_mgrid.x])
  
nl.store(f_tensor[f_mgrid.p, f_mgrid.x], f_psum)

##################################################################
# Example 3:
# perform batched matrix multiplication on matrix g of shape (16, 64, 64) 
# and matrix h of shape (16, 64, 512) to get matrix i of (16, 64, 512) 
# using Tensor Engine PE tiling mode. 
##################################################################
g_mgrid = nl.mgrid[0:64, 0:64]
h_mgrid = nl.mgrid[0:64, 0:512]
i_mgrid = nl.mgrid[0:64, 0:512]

for i in nl.affine_range(4):
  for j in nl.affine_range(4):
    g = nl.load(g_tensor[i * 4 + j, g_mgrid.p, g_mgrid.x])
    h = nl.load(h_tensor[i * 4 + j, h_mgrid.p, h_mgrid.x])
    i_psum = nisa.nc_matmul(g, h, tile_position=((i % 2) * 64, (j % 2) * 64), tile_size=(64, 64))
    nl.store(i_tensor[i * 4 + j, i_mgrid.p, i_mgrid.x], i_psum)

return c_tensor, f_tensor, i_tensor

This document is relevant for: Inf2, Trn1, Trn2