This document is relevant for: Inf2, Trn1, Trn2
nki.isa.nc_transpose#
- nki.isa.nc_transpose(data, *, mask=None, dtype=None, engine=engine.unknown, **kwargs)[source]#
Perform a 2D transpose between the partition axis and the free axis of input
data, i.e., a PF-transpose, using Tensor or Vector Engine. If thedatatile has more than one free axes, this API implicitly collapses all free axes into one axis and then performs a 2D PF-transpose.In NeuronCore, both Tensor and Vector Engine can perform a PF-transpose, but they support different input shapes. Tensor Engine
nc_transposecan handle an input tile of shape (128, 128) or smaller, while Vector Engine can handle shape (32, 32) or smaller. Therefore, when the input tile shape is (32, 32) or smaller, we have an option to run it on either engine, which is controlled by theenginefield. If noengineis specified, Neuron Compiler will automatically select an engine based on the input shape. Note, similar to other Tensor Engine instructions, the Tensor Enginenc_transposemust read the input tile from SBUF and write the transposed result to PSUM. On the other hand, Vector Enginenc_transposecan read/write from/to either SBUF or PSUM.Note, PF-transpose on Tensor Engine is done by performing a matrix multiplication between
dataas the stationary tensor and an identity matrix as the moving tensor. See architecture guide for more information. On NeuronCore-v2, such matmul-style transpose is not bit-accurate if the inputdatacontains NaN/Inf. You may consider replacing NaN/Inf with regular floats (float_max/float_min/zeros) in the input matrix before callingnc_transpose(engine=nki.isa.constants.engine.tensor).Estimated instruction cost:
Cost (Engine Cycles)
Condition
max(MIN_II, N)engineset tonki.isa.constants.engine.vectormax(P, min(64, F))engineset tonki.isa.constants.engine.tensorand assuming many back-to-backnc_transposeof the same shape on Tensor Enginewhere,
Nis the number of elements per partition indata.MIN_IIis the minimum instruction initiation interval for small input tiles.MIN_IIis roughly 64 engine cycles.Pis partition axis size ofdata.Fis the number of elements per partition indata.
- Parameters:
data – the input tile to be transposed
mask – (optional) a compile-time constant predicate that controls whether/how this instruction is executed (see NKI API Masking for details)
dtype – if specified and it’s different from the data type of input tile
data, an additional nki.isa.cast instruction will be inserted to cast the transposed data into the targetdtype(see Supported Data Types for more information)engine – specify which engine to use for transpose:
nki.isa.tensor_engineornki.isa.vector_engine; by default, the best engine will be selected for the given input tile shape
- Returns:
a tile with transposed result of input
datatile
Example:
import neuronxcc.nki.isa as nisa import neuronxcc.nki.language as nl ... ################################################################## # Example 1: transpose tile a of shape (128, 64) ################################################################## i_p_a = nl.arange(128)[:, None] i_f_a = nl.arange(64)[None, :] aT = nisa.nc_transpose(a[i_p_a, i_f_a]) ################################################################## # Example 2: transpose tile b of shape (32, 2) using Vector Engine ################################################################## i_p_b = nl.arange(32)[:, None] i_f_b = nl.arange(2)[None, :] bT = nisa.nc_transpose(b[i_p_b, i_f_b], engine=nisa.vector_engine)
This document is relevant for: Inf2, Trn1, Trn2