nki.isa.nc_matmul#

nki.isa.nc_matmul(dst, stationary, moving, is_stationary_onezero=False, is_moving_onezero=False, is_transpose=False, tile_position=(), tile_size=(), psum_accumulate_flag=3, name=None)[source]#

Compute dst = stationary.T @ moving matrix multiplication using Tensor Engine.

The figure below illustrates how to map a matrix multiplication from a mathematical definition to nisa.nc_matmul on Tensor Engine. For more detailed discussion of Tensor Engine capabilities, see Trainium arch guide.

../../../_images/matmul.png

Fig. 116 MxKxN Matrix Multiplication Visualization.#

Performance mode.

On NeuronCore-v2, performance mode is not supported. On NeuronCore-v3 and NeuronCore-v4, Tensor Engine supports FP8 double performance mode, enabled by setting performance mode to double_row. See Trainium2 arch guide for more details. double_row performance mode cannot be combined with Tensor Engine column tiling mode (details below).

Tiling mode. NeuronCore Tensor Engine is built upon a systolic array with 128 rows and 128 columns of processing elements (PEs). Tensor Engine supports both row and column tiling modes, which allow multiple nc_matmul instructions with a stationary tile size smaller than [128, 128] to run in parallel to improve hardware utilization. Row tiling mode slices the 128 PE rows into 2x 64 row tiles (NeuronCore-v2 or newer), or 4x 32 row tiles (NeuronCore-v3 or newer). Column tiling mode slices the 128 PE columns in the same fashion. The row and column tile sizes can be set independently in the tile_size field as a tuple (row_size, column_size). The stationary tile size must not exceed the chosen tile_size.

In addition, a given nc_matmul can also pick the exact row and column tile within the 128x128 systolic array, by specifying the starting row and starting column in tile_position as a tuple (start_row, start_column). The start_row must be a multiple of row_size specified in tile_size and must not exceed 128. Similarly, the start_column must be a multiple of column_size and must not exceed 128.

For example, setting tile_position to (64, 0) and tile_size to (64, 128) means using the bottom half of the systolic array.

Note, tile_position and tile_size must both be set to enable tiling mode. If they are not set, the default is to use the full systolic array, which is equivalent to tile_position=(0, 0) and tile_size=(128, 128). The values in tile_position and tile_size tuples can be integers or affine expressions.

Transpose mode.

Tensor Engine can transpose a tile in SBUF by loading it as a stationary tile and using an identity matrix as the moving tile. Starting NeuronCore-v3, turning on transpose mode by setting is_transpose=True enables bit-accurate data transpose, which can transpose tensors with NaN/Inf values properly. See Trainium2 arch guide for more details.

On NeuronCore-v2, Tensor Engine does not support transpose mode natively. However, setting is_transpose=True ensures neuron-profile identifies this instruction as a transpose for performance metric accounting purposes.

Memory types.

The nc_matmul instruction must read inputs from SBUF and write outputs to PSUM. Therefore, the stationary and moving must be SBUF tiles, and dst tile must be a PSUM tile.

The psum_accumulate_flag controls whether the matmul result data should overwrite or accumulate on top of the dst PSUM tile. Multiple nisa.nc_matmul instructions accumulating into the same PSUM tile can form an accumulation group before the PSUM tile content is evicted back to SBUF. The encoding of psum_accumulate_flag is as follows:

  • bit[0] of psum_accumulate_flag: if set, indicates this nisa.nc_matmul call is the first instruction in the accumulation group. The matmul result should overwrite the existing content in the dst PSUM tile.

  • bit[1] of psum_accumulate_flag: if set, indicates this nisa.nc_matmul call is the last instruction in the accumulation group. The matmul result should accumulate to the existing content in the dst PSUM tile.

  • bit[2] of psum_accumulate_flag: if set, indicates this nisa.nc_matmul call is the first instruction in the accumulation group. However, the matmul result should accumulate to the existing content in the dst PSUM tile.

nisa.nc_matmul calls that are not the first or last instruction of an accumulation group should not set any bit: psum_accumulate_flag=0.

Data types.

The input stationary and moving tiles can be one of these supported data types: float8_e4m3/float8_e5m2/bfloat16/float16/tfloat32/float32. The stationary and moving tiles can have different data types, with one exception: if one of the input tiles is tfloat32/float32, the other tile must also be tfloat32/float32. On NeuronCore-v3 and NeuronCore-v4, when performance mode is double_row, stationary and moving tiles must be one of float8_e4m3 or float8_e5m2, but the two input tiles can have different float8 formats.

The accumulation precision internal to Tensor Engine is float32. The dst tile must be a float32 tile in NeuronCore-v2 and NeuronCore-v3. Starting NeuronCore-v4, dst can either be a float32 or bfloat16 tile.

Layout.

If performance mode is off, the contraction dimension of the matmul must be along the partition dimension in both stationary and moving tiles.

If performance mode is double_row, the contraction dimension of the matmul is split between the partition dimension and the first free dimension after the partition dimension in both stationary and moving tiles. The first free dimension must be 2. For example, to perform a matmul of [1, 256]@[256, 3]=[1, 3], the stationary tile is of shape [128, 2, 1], while the moving tile is of shape [128, 2, 3].

Regardless of performance mode, the free dimension of the stationary tile matches the partition dimension of the output dst tile in size, while the free dimension of the moving tile matches the free dimension of the dst tile in size.

Tile size.

The partition dimension sizes of the stationary and moving tiles must be identical. They must not exceed 128 when tiling mode is off or row_size specified in tile_size when tiling mode is on. The free dimension size of stationary must not exceed 128 when tiling mode is off or column_size in tile_size when tiling mode is on.

On NeuronCore-v2 and -v3, the free dimension size of moving tile must not exceed 512, matching the maximum number of float32 elements per PSUM bank. Starting NeuronCore-v4, the free dimension size of moving tile can go up to 4096 for float32 dst or 8192 for bfloat16 dst, matching the size of 8x PSUM banks (the entire PSUM).

Explicit tiling is required when the high-level matmul operation exceeds the tile size limits of nc_matmul.

Parameters:
  • dst – the matmul output

  • stationary – the stationary operand

  • moving – the moving operand

  • is_stationary_onezero – hints to the compiler whether the stationary operand is a tile with ones/zeros only; setting this field explicitly could lead to 2x better performance if stationary tile is in float32; the field has no impact for non-float32 stationary

  • is_moving_onezero – hints to the compiler whether the moving operand is a tile with ones/zeros only; setting this field explicitly could lead to 2x better performance if moving tile is in float32; the field has no impact for non-float32 moving

  • is_transpose – controls Tensor Engine transpose mode on/off starting NeuronCore-v3

  • tile_position – a 2D tuple (start_row, start_column) to control starting row in Tensor Engine tiling mode; start_column must be 0

  • tile_size – a 2D tuple (row_size, column_size) to control row tile size in Tensor Engine tiling mode; column_size must be 128

  • psum_accumulate_flag – controls PSUM near-memory accumulation in the dst tile