nki.isa.quantize_mx#
- nki.isa.quantize_mx(dst, src, dst_scale, name=None)[source]#
Quantize FP16/BF16 data to MXFP8 tensors (both data and scales) using Vector Engine.
Note
Available only on NeuronCore-v4 and beyond.
The resulting MXFP8 tensors,
dstanddst_scaleare as defined in the OCP Microscaling standard. This instruction calculates the required scales for each group of 32 values insrc, divides them by the calculated scale, and casts to the target MXFP8 datatype. The output layout is suitable for direct consumption by thenisa.nc_matmul_mxAPI running on Tensor Engine.Memory types.
All input
srcand output tiles (dstanddst_scale) must be in SBUF.Data types.
The input
srctile must be float16 or bfloat16. The outputdsttile must be float8_e5m2_x4 or float8_e4m3fn_x4 (4-packed FP8 data types). Thedst_scaletile must be uint8.The 4-packed data types (float8_e5m2_x4/float8_e4m3fn_x4) are 32-bit data types that pack four 8-bit float8_e5m2/float8_e4m3fn values.
Layout.
The quantization operates on groups of 32 elements from the input
srctile, where each group consists of 8 partitions × 4 elements per partition. For each 32-element group, the instruction produces:Quantized FP8 data in
dstOne shared scale value in
dst_scaleper group
Logically,
dstshould have the same shape assrcifdstis interpreted as a pure FP8 data type. However, in NKI,dstuses a custom 4-packed data type that packs four contiguous FP8 elements into a single float8_e5m2_x4/float8_e4m3fn_x4 element. Therefore,dsthas one quarter of the element count per partition compared to that ofsrc.Logically,
dst_scaleshould have 1/32 the element count ofsrcdue to the microscaling group size of 32. Physically, thedst_scaletensor follows a special SBUF quadrant (32 partitions) distribution pattern where scale values are distributed across multiple SBUF quadrants while maintaining the same partition offset at each quadrant. Within each SBUF quadrant, a 32-partition slice ofsrctile produces 32//8 = 4 partitions worth of scale, where 8 is due to each group consisted of 8 partitions × 4 elements per partition. The number of scales per partition is 1/4 of the free dimension size of thesrctile. Different SBUF quadrants of scales are produced in parallel, with the scales written to the first (or second) 8 partitions of each SBUF quadrant. In other words, thedst_scalemust be placed in the first 16 partitions of each SBUF quadrant. Thedst_scaletile declaration must always occupy a multiple 32 partitions, even though not all partitions can be filled with scale values bynisa.quantize_mx.Tile size.
The partition dimension size of
srcmust be a multiple of 32 and must not exceed 128.The free dimension size of
srcmust be a multiple of 4 and must not exceed the physical size of each SBUF partition.The
dsttile has the same partition dimension size assrcbut a free dimension size that is 1/4 ofsrcfree dimension size due to the special 4-packed FP8 data types.- The
dst_scaletile partition dimension depends on whethersrcspans multiple SBUF quadrants. If
srcoccupies only 32 partitions,dst_scalewill occupy 4 partitions.Otherwise,
dst_scalewill occupy the same number of partitions assrc.
- The
- Parameters:
dst – the quantized MXFP8 output tile
src – the input FP16/BF16 tile to be quantized
dst_scale – the output scale tile