This document is relevant for: Inf2, Trn1, Trn2

nki.isa.nc_transpose#

nki.isa.nc_transpose(data, *, mask=None, dtype=None, engine=engine.unknown, **kwargs)[source]#

Perform a 2D transpose between the partition axis and the free axis of input data, i.e., a PF-transpose, using Tensor or Vector Engine. If the data tile has more than one free axes, this API implicitly collapses all free axes into one axis and then performs a 2D PF-transpose.

In NeuronCore, both Tensor and Vector Engine can perform a PF-transpose, but they support different input shapes. Tensor Engine nc_transpose can handle an input tile of shape (128, 128) or smaller, while Vector Engine can handle shape (32, 32) or smaller. Therefore, when the input tile shape is (32, 32) or smaller, we have an option to run it on either engine, which is controlled by the engine field. If no engine is specified, Neuron Compiler will automatically select an engine based on the input shape. Note, similar to other Tensor Engine instructions, the Tensor Engine nc_transpose must read the input tile from SBUF and write the transposed result to PSUM. On the other hand, Vector Engine nc_transpose can read/write from/to either SBUF or PSUM.

Note, PF-transpose on Tensor Engine is done by performing a matrix multiplication between data as the stationary tensor and an identity matrix as the moving tensor. See architecture guide for more information. On NeuronCore-v2, such matmul-style transpose is not bit-accurate if the input data contains NaN/Inf. You may consider replacing NaN/Inf with regular floats (float_max/float_min/zeros) in the input matrix before calling nc_transpose(engine=nki.isa.constants.engine.tensor).

Estimated instruction cost:

Cost (Engine Cycles)	Condition
`max(MIN_II, N)`	`engine` set to `nki.isa.constants.engine.vector`
`max(P, min(64, F))`	`engine` set to `nki.isa.constants.engine.tensor` and assuming many back-to-back `nc_transpose` of the same shape on Tensor Engine

where,

N is the number of elements per partition in data.
MIN_II is the minimum instruction initiation interval for small input tiles. MIN_II is roughly 64 engine cycles.
P is partition axis size of data.
F is the number of elements per partition in data.

Parameters:

data – the input tile to be transposed
mask – (optional) a compile-time constant predicate that controls whether/how this instruction is executed (see NKI API Masking for details)
dtype – if specified and it’s different from the data type of input tile data, an additional nki.isa.cast instruction will be inserted to cast the transposed data into the target dtype (see Supported Data Types for more information)
engine – specify which engine to use for transpose: nki.isa.tensor_engine or nki.isa.vector_engine ; by default, the best engine will be selected for the given input tile shape

Returns:

a tile with transposed result of input data tile

Example:

import neuronxcc.nki.isa as nisa
import neuronxcc.nki.language as nl
...

##################################################################
# Example 1: transpose tile a of shape (128, 64)
##################################################################
i_p_a = nl.arange(128)[:, None]
i_f_a = nl.arange(64)[None, :]
aT = nisa.nc_transpose(a[i_p_a, i_f_a])


##################################################################
# Example 2: transpose tile b of shape (32, 2) using Vector Engine
##################################################################
i_p_b = nl.arange(32)[:, None]
i_f_b = nl.arange(2)[None, :]
bT = nisa.nc_transpose(b[i_p_b, i_f_b], engine=nisa.vector_engine)

This document is relevant for: Inf2, Trn1, Trn2

nki.isa.nc_transpose

Contents

nki.isa.nc_transpose#