This document is relevant for: Trn2, Trn3
nki.isa.nc_matmul_mx#
- nki.isa.nc_matmul_mx(dst, stationary, moving, stationary_scale, moving_scale, tile_position=None, tile_size=None, accumulate=None, name=None)[source]#
Compute matrix multiplication of MXFP8/MXFP4 quantized matrices with integrated dequantization using Tensor Engine.
Note
Available only on NeuronCore-v4 and newer.
The NeuronCore-v4 Tensor Engine supports matrix multiplication of MXFP8/MXFP4 quantized matrices as defined in the OCP Microscaling standard. This instruction performs matrix multiplication between quantized
stationaryandmovingmatrices while applying dequantization scales during computation. The micro-scaling group size is 32 elements in groups of 8 partitions × 4 elements per partition of bothstationaryandmovingtensors. See Trainium3 arch guide for more detailed discussion.Tiling Mode.
NeuronCore Tensor Engine is built upon a systolic array with 128 rows and 128 columns of processing elements (PEs). For
nc_matmul_mx, Tensor Engine supports only row tiling mode, which allows multiplenc_matmul_mxinstructions with a stationary partition dimension size smaller than 128 to run in parallel to improve hardware utilization. Row tiling mode slices the 128 PE rows into 2x 64 row tiles or 4x 32 row tiles.The row tile size can be set in the
tile_sizefield as a tuple(row_size, column_size), wherecolumn_sizemust be 128. The stationary tile size must not exceed the chosentile_size.A given
nc_matmul_mxcan pick the exact row tile within the 128x128 systolic array by specifying the starting row intile_positionas a tuple(start_row, start_column), wherestart_columnmust be 0. Thestart_rowmust be a multiple ofrow_sizespecified intile_sizeand must not exceed 128.For example, setting
tile_positionto (64, 0) andtile_sizeto (64, 128) means using the bottom half of the systolic array.Note,
tile_positionandtile_sizemust both be set to enable tiling mode. If they are not set, the default is to use the full systolic array, which is equivalent totile_position=(0, 0)andtile_size=(128, 128). The values intile_positionandtile_sizetuples can be integers or affine expressions.Memory types.
The
nc_matmul_mxinstruction must read inputs from SBUF and write outputs to PSUM. Therefore, thestationary,moving,stationary_scale, andmoving_scalemust be SBUF tiles, anddsttile must be a PSUM tile.Data types.
The input
stationaryandmovingtiles must be float8_e5m2_x4, float8_e4m3fn_x4, or float4_e2m1fn_x4 (4-packed quantized data types). Thestationary_scaleandmoving_scaletiles must be uint8. Thedsttile can be float32 or bfloat16.Layout.
The contraction dimension of the matrix multiplication is along the partition dimension of
stationaryandmovingtensors and also the x4 dimension within each packed data type element (float8_e5m2_x4, float8_e4m3fn_x4, or float4_e2m1fn_x4).The free dimension of the
stationarytile matches the partition dimension of the outputdsttile in size, while the free dimension of themovingtile matches the free dimension of thedsttile in size.The scale tensors follow a special layout requirement. See more details in
nisa.quantize_mxAPI doc.Tile size
The partition dimension size of
stationaryandmovingmust be identical and be a multiple of 32, not exceeding 128.The free dimension size of
stationarymust be even and not exceed 128.The free dimension size of
movingmust not exceed 512 whendstis in float32 or 1024 whendstis in bfloat16.The scale tensors have partition dimensions that depend on whether the data tensors span multiple quadrants. See more details in
nisa.quantize_mxAPI doc.
Profiler view syntax.
nc_matmul_mxuses the same profiler output format as nisa.nc_matmul, except the source access pattern is interpreted as an MX-quantized tensor:src=<dtype>@$MX[<data_addr>,<scale_addr>,<start_scale_partition>]@[<step_elem>][<num_elem>].- Parameters:
dst – the matrix multiplication output (PSUM tile)
stationary – the stationary quantized matrix (SBUF tile)
moving – the moving quantized matrix (SBUF tile)
stationary_scale – the dequantization scales for stationary matrix (SBUF tile)
moving_scale – the dequantization scales for moving matrix (SBUF tile)
tile_position – a 2D tuple (start_row, start_column) to control starting row and column in Tensor Engine tiling mode
tile_size – a 2D tuple (row_size, column_size) to control row and column tile sizes in Tensor Engine tiling mode
accumulate – if True, accumulate the matmul result into the existing
dstPSUM tile content; if False, overwrite the existing content; if None (default), auto-detect based on whether this PSUM location was previously written
This document is relevant for: Trn2, Trn3