This document is relevant for: Inf2
, Trn1
, Trn2
nki.isa.nc_transpose#
- nki.isa.nc_transpose(data, *, mask=None, dtype=None, engine=0, **kwargs)[source]#
Perform a 2D transpose between the partition axis and the free axis of input
data
, i.e., a PF-transpose, using Tensor or Vector Engine. If thedata
tile has more than one free axes, this API implicitly collapses all free axes into one axis and then performs a 2D PF-transpose.In NeuronCore, both Tensor and Vector Engine can perform a PF-transpose, but they support different input shapes. Tensor Engine
nc_transpose
can handle an input tile of shape (128, 128) or smaller, while Vector Engine can handle shape (32, 32) or smaller. Therefore, when the input tile shape is (32, 32) or smaller, we have an option to run it on either engine, which is controlled by theengine
field. If noengine
is specified, Neuron Compiler will automatically select an engine based on the input shape. Note, similar to other Tensor Engine instructions, the Tensor Enginenc_transpose
must read the input tile from SBUF and write the transposed result to PSUM. On the other hand, Vector Enginenc_transpose
can read/write from/to either SBUF or PSUM.Note, PF-transpose on Tensor Engine is done by performing a matrix multiplication between
data
as the stationary tensor and an identity matrix as the moving tensor. See architecture guide for more information. On NeuronCore-v2, such matmul-style transpose is not bit-accurate if the inputdata
contains NaN/Inf. You may consider replacing NaN/Inf with regular floats (float_max/float_min/zeros) in the input matrix before callingnc_transpose(engine=nki.isa.tensor_engine)
.Estimated instruction cost:
Cost (Engine Cycles)
Condition
max(MIN_II, N)
engine
set tonki.isa.vector_engine
max(P, min(64, F))
engine
set tonki.isa.tensor_engine
and assuming many back-to-backnc_transpose
of the same shape on Tensor Enginewhere,
N
is the number of elements per partition indata
.MIN_II
is the minimum instruction initiation interval for small input tiles.MIN_II
is roughly 64 engine cycles.P
is partition axis size ofdata
.F
is the number of elements per partition indata
.
- Parameters:
data – the input tile to be transposed
mask – (optional) a compile-time constant predicate that controls whether/how this instruction is executed (see NKI API Masking for details)
dtype – if specified and it’s different from the data type of input tile
data
, an additional nki.isa.cast instruction will be inserted to cast the transposed data into the targetdtype
(see Supported Data Types for more information)engine – specify which engine to use for transpose:
nki.isa.tensor_engine
ornki.isa.vector_engine
; by default, the best engine will be selected for the given input tile shape
- Returns:
a tile with transposed result of input
data
tile
Example:
import neuronxcc.nki.isa as nisa import neuronxcc.nki.language as nl ... ################################################################## # Example 1: transpose tile a of shape (128, 64) ################################################################## i_p_a = nl.arange(128)[:, None] i_f_a = nl.arange(64)[None, :] aT = nisa.nc_transpose(a[i_p_a, i_f_a]) ################################################################## # Example 2: transpose tile b of shape (32, 2) using Vector Engine ################################################################## i_p_b = nl.arange(32)[:, None] i_f_b = nl.arange(2)[None, :] bT = nisa.nc_transpose(b[i_p_b, i_f_b], engine=nisa.vector_engine)
This document is relevant for: Inf2
, Trn1
, Trn2