This document is relevant for: Inf2, Trn1, Trn2
nki.isa.nc_matmul#
- nki.isa.nc_matmul(stationary, moving, *, is_stationary_onezero=False, is_moving_onezero=False, is_transpose=False, tile_position=(), tile_size=(), mask=None, **kwargs)[source]#
Compute
stationary.T @ movingmatrix multiplication using Tensor Engine.The
nc_matmulinstruction must read inputs from SBUF and write outputs to PSUM. Therefore, thestationaryandmovingmust be SBUF tiles, and the result tile is a PSUM tile.The nc_matmul instruction currently supports
float8_e4m3/float8_e5m2/bfloat16/float16/tfloat32/float32input data types as listed in Supported Data Types. The matmul accumulation and results are always in float32.The Tensor Engine imposes special layout constraints on the input tiles. First, the partition axis sizes of the
stationaryandmovingtiles must be identical and<=128, which corresponds to the contraction dimension of the matrix multiplication. Second, the free axis sizes ofstationaryandmovingtiles must be<= 128and<=512, respectively, For example,stationary.shape = (128, 126);moving.shape = (128, 512)andnc_matmul(stationary,moving)returns a tile ofshape = (126, 512). For more information about the matmul layout, see Tensor Engine.
Fig. 12 MxKxN Matrix Multiplication Visualization.#
If the contraction dimension of the matrix multiplication exceeds
128, you may accumulate multiplenc_matmulinstruction output tiles into the same PSUM tile. See example code snippet below.Estimated instruction cost:
The Tensor Engine has complex performance characteristics given its data flow and pipeline design. The below formula is the average nc_matmul cost assuming many
nc_matmulinstructions of the same shapes running back-to-back on the engine:Cost (Tensor Engine Cycles)
Condition
max(min(64, N_stationary), N_moving)input data type is one of
float8_e4m3/float8_e5m2/bfloat16/float16/tfloat324 * max(min(64, N_stationary), N_moving)input data type is
float32where,
N_stationaryis the number of elements per partition instationarytile.N_movingis the number of elements per partition inmovingtile.
The Tensor Engine, as a systolic array with 128 rows and 128 columns of processing elements (PEs), could be underutilized for small
nc_matmulinstructions, i.e., thestationarytile has small free axis size or small partition axis size (e.g. 32, 64). In such a case, the Tensor Engine allows PE tiling, i.e., multiple smallnc_matmulinstructions to execute in parallel on the PE array, to improve compute throughput. PE tiling is enabled by settingtile_positionandtile_size.tile_positionindicates the PE tile starting position (row position, column position) for anc_matmulinstruction in the PE array.tile_sizeindicates the PE tile size (row size, column size) to hold by anc_matmulinstruction starting from thetile_position. For example, settingtile_positionto (0, 0) andtile_sizeto (128, 128) means using full PE array.Requirements on
tile_positionandtile_sizeare:tile_positionandtile_sizemust be both set to enable PE tiling.The type of values in
tile_positionandtile_sizemust be integer or affine expression.Values in
tile_positionandtile_sizemust be multiple of 32.tile_sizemust be larger than or equal to accessedstationarytile size.Both the row and column sizes in
tile_sizecannot be 32 for NeuronCore-v2.
- Parameters:
stationary – the stationary operand on SBUF; layout: (partition axis
<= 128, free axis<= 128)moving – the moving operand on SBUF; layout: (partition axis
<= 128, free axis<= 512)mask – (optional) a compile-time constant predicate that controls whether/how this instruction is executed (see NKI API Masking for details)
is_stationary_onezero – hints to the compiler whether the
stationaryoperand is a tile with ones/zeros only; setting this field explicitly could lead to 2x better performance ifstationarytile is in float32; the field has no impact for non-float32stationary.is_moving_onezero – hints to the compiler if the
movingoperand is a tile with ones/zeros only; setting this field explicitly could lead to 2x better performance ifmovingtile is in float32; the field has no impact for non-float32moving.is_transpose – hints to the compiler that this is a transpose operation with
movingas an identity matrix.tile_position – a 2D tuple (row, column) for the start PE tile position to run
nc_matmul.tile_size – a 2D tuple (row, column) for the PE tile size to hold by
nc_matmulstarting fromtile_position.
- Returns:
a tile on PSUM that has the result of matrix multiplication of
stationaryandmovingtiles; layout: partition axis comes from free axis ofstationary, while free axis comes from free axis ofmoving.
Example:
import neuronxcc.nki.isa as nisa import neuronxcc.nki.language as nl ################################################################## # Example 1: # multiply matrix a of shape (128, 128) and matrix b of shape (128, 512) # to get matrix c in PSUM of shape (128, 512) ################################################################## a_mgrid = nl.mgrid[0:128, 0:128] b_mgrid = nl.mgrid[0:128, 0:512] c_mgrid = nl.mgrid[0:128, 0:512] a = nl.load(a_tensor[a_mgrid.p, a_mgrid.x]) b = nl.load(b_tensor[b_mgrid.p, b_mgrid.x]) c_psum = nisa.nc_matmul(a[a_mgrid.p, a_mgrid.x], b[b_mgrid.p, b_mgrid.x]) nl.store(c_tensor[c_mgrid.p, c_mgrid.x], c_psum) ################################################################## # Example 2: # multiply matrix d of shape (256, 128) and matrix e of shape (256, 512) # to get matrix f in PSUM of shape (128, 512) using psum accumulation ################################################################## d_mgrid = nl.mgrid[0:128, 0:128] e_mgrid = nl.mgrid[0:128, 0:512] f_mgrid = nl.mgrid[0:128, 0:512] f_psum = nl.zeros((128, 512), nl.float32, buffer=nl.psum) for i_contract in nl.affine_range(2): d = nl.load(d_tensor[i_contract * 128 + d_mgrid.p, d_mgrid.x]) e = nl.load(e_tensor[i_contract * 128 + e_mgrid.p, e_mgrid.x]) f_psum += nisa.nc_matmul(d[d_mgrid.p, d_mgrid.x], e[e_mgrid.p, e_mgrid.x]) nl.store(f_tensor[f_mgrid.p, f_mgrid.x], f_psum) ################################################################## # Example 3: # perform batched matrix multiplication on matrix g of shape (16, 64, 64) # and matrix h of shape (16, 64, 512) to get matrix i of (16, 64, 512) # using Tensor Engine PE tiling mode. ################################################################## g_mgrid = nl.mgrid[0:64, 0:64] h_mgrid = nl.mgrid[0:64, 0:512] i_mgrid = nl.mgrid[0:64, 0:512] for i in nl.affine_range(4): for j in nl.affine_range(4): g = nl.load(g_tensor[i * 4 + j, g_mgrid.p, g_mgrid.x]) h = nl.load(h_tensor[i * 4 + j, h_mgrid.p, h_mgrid.x]) i_psum = nisa.nc_matmul(g, h, tile_position=((i % 2) * 64, (j % 2) * 64), tile_size=(64, 64)) nl.store(i_tensor[i * 4 + j, i_mgrid.p, i_mgrid.x], i_psum) return c_tensor, f_tensor, i_tensor
This document is relevant for: Inf2, Trn1, Trn2