nki.isa.nc_matmul#
- nki.isa.nc_matmul(dst, stationary, moving, is_stationary_onezero=False, is_moving_onezero=False, is_transpose=False, tile_position=(), tile_size=(), psum_accumulate_flag=3, name=None)[source]#
Compute
dst = stationary.T @ movingmatrix multiplication using Tensor Engine.The figure below illustrates how to map a matrix multiplication from a mathematical definition to
nisa.nc_matmulon Tensor Engine. For more detailed discussion of Tensor Engine capabilities, see Trainium arch guide.
Fig. 116 MxKxN Matrix Multiplication Visualization.#
Performance mode.
On NeuronCore-v2, performance mode is not supported. On NeuronCore-v3 and NeuronCore-v4, Tensor Engine supports FP8 double performance mode, enabled by setting performance mode to
double_row. See Trainium2 arch guide for more details.double_rowperformance mode cannot be combined with Tensor Engine column tiling mode (details below).Tiling mode. NeuronCore Tensor Engine is built upon a systolic array with 128 rows and 128 columns of processing elements (PEs). Tensor Engine supports both row and column tiling modes, which allow multiple
nc_matmulinstructions with a stationary tile size smaller than [128, 128] to run in parallel to improve hardware utilization. Row tiling mode slices the 128 PE rows into 2x 64 row tiles (NeuronCore-v2 or newer), or 4x 32 row tiles (NeuronCore-v3 or newer). Column tiling mode slices the 128 PE columns in the same fashion. The row and column tile sizes can be set independently in thetile_sizefield as a tuple(row_size, column_size). The stationary tile size must not exceed the chosentile_size.In addition, a given
nc_matmulcan also pick the exact row and column tile within the 128x128 systolic array, by specifying the starting row and starting column intile_positionas a tuple(start_row, start_column). Thestart_rowmust be a multiple ofrow_sizespecified intile_sizeand must not exceed 128. Similarly, thestart_columnmust be a multiple ofcolumn_sizeand must not exceed 128.For example, setting
tile_positionto (64, 0) andtile_sizeto (64, 128) means using the bottom half of the systolic array.Note,
tile_positionandtile_sizemust both be set to enable tiling mode. If they are not set, the default is to use the full systolic array, which is equivalent totile_position=(0, 0)andtile_size=(128, 128). The values intile_positionandtile_sizetuples can be integers or affine expressions.Transpose mode.
Tensor Engine can transpose a tile in SBUF by loading it as a stationary tile and using an identity matrix as the moving tile. Starting NeuronCore-v3, turning on transpose mode by setting
is_transpose=Trueenables bit-accurate data transpose, which can transpose tensors with NaN/Inf values properly. See Trainium2 arch guide for more details.On NeuronCore-v2, Tensor Engine does not support transpose mode natively. However, setting
is_transpose=Trueensures neuron-profile identifies this instruction as a transpose for performance metric accounting purposes.Memory types.
The
nc_matmulinstruction must read inputs from SBUF and write outputs to PSUM. Therefore, thestationaryandmovingmust be SBUF tiles, anddsttile must be a PSUM tile.The
psum_accumulate_flagcontrols whether the matmul result data should overwrite or accumulate on top of thedstPSUM tile. Multiplenisa.nc_matmulinstructions accumulating into the same PSUM tile can form an accumulation group before the PSUM tile content is evicted back to SBUF. The encoding ofpsum_accumulate_flagis as follows:bit[0] of
psum_accumulate_flag: if set, indicates thisnisa.nc_matmulcall is the first instruction in the accumulation group. The matmul result should overwrite the existing content in thedstPSUM tile.bit[1] of
psum_accumulate_flag: if set, indicates thisnisa.nc_matmulcall is the last instruction in the accumulation group. The matmul result should accumulate to the existing content in thedstPSUM tile.bit[2] of
psum_accumulate_flag: if set, indicates thisnisa.nc_matmulcall is the first instruction in the accumulation group. However, the matmul result should accumulate to the existing content in thedstPSUM tile.
nisa.nc_matmulcalls that are not the first or last instruction of an accumulation group should not set any bit:psum_accumulate_flag=0.Data types.
The input
stationaryandmovingtiles can be one of these supported data types:float8_e4m3/float8_e5m2/bfloat16/float16/tfloat32/float32. Thestationaryandmovingtiles can have different data types, with one exception: if one of the input tiles istfloat32/float32, the other tile must also betfloat32/float32. On NeuronCore-v3 and NeuronCore-v4, when performance mode isdouble_row,stationaryandmovingtiles must be one offloat8_e4m3orfloat8_e5m2, but the two input tiles can have different float8 formats.The accumulation precision internal to Tensor Engine is float32. The
dsttile must be a float32 tile in NeuronCore-v2 and NeuronCore-v3. Starting NeuronCore-v4,dstcan either be a float32 or bfloat16 tile.Layout.
If performance mode is off, the contraction dimension of the matmul must be along the partition dimension in both
stationaryandmovingtiles.If performance mode is
double_row, the contraction dimension of the matmul is split between the partition dimension and the first free dimension after the partition dimension in bothstationaryandmovingtiles. The first free dimension must be 2. For example, to perform a matmul of[1, 256]@[256, 3]=[1, 3], the stationary tile is of shape[128, 2, 1], while the moving tile is of shape[128, 2, 3].Regardless of performance mode, the free dimension of the
stationarytile matches the partition dimension of the outputdsttile in size, while the free dimension of themovingtile matches the free dimension of thedsttile in size.Tile size.
The partition dimension sizes of the
stationaryandmovingtiles must be identical. They must not exceed 128 when tiling mode is off orrow_sizespecified intile_sizewhen tiling mode is on. The free dimension size ofstationarymust not exceed 128 when tiling mode is off orcolumn_sizeintile_sizewhen tiling mode is on.On NeuronCore-v2 and -v3, the free dimension size of
movingtile must not exceed 512, matching the maximum number of float32 elements per PSUM bank. Starting NeuronCore-v4, the free dimension size ofmovingtile can go up to 4096 for float32dstor 8192 for bfloat16dst, matching the size of 8x PSUM banks (the entire PSUM).Explicit tiling is required when the high-level matmul operation exceeds the tile size limits of
nc_matmul.- Parameters:
dst – the matmul output
stationary – the stationary operand
moving – the moving operand
is_stationary_onezero – hints to the compiler whether the
stationaryoperand is a tile with ones/zeros only; setting this field explicitly could lead to 2x better performance ifstationarytile is in float32; the field has no impact for non-float32stationaryis_moving_onezero – hints to the compiler whether the
movingoperand is a tile with ones/zeros only; setting this field explicitly could lead to 2x better performance ifmovingtile is in float32; the field has no impact for non-float32movingis_transpose – controls Tensor Engine transpose mode on/off starting NeuronCore-v3
tile_position – a 2D tuple (start_row, start_column) to control starting row in Tensor Engine tiling mode; start_column must be 0
tile_size – a 2D tuple (row_size, column_size) to control row tile size in Tensor Engine tiling mode; column_size must be 128
psum_accumulate_flag – controls PSUM near-memory accumulation in the
dsttile