This document is relevant for: Inf2, Trn1, Trn1n

nki.isa.nc_transpose#

nki.isa.nc_transpose(data, mask=None, dtype=None, engine=None, **kwargs)[source]#

Perform a 2D transpose between the partition axis and the free axis of input data, i.e., a PF-transpose, using Tensor or Vector Engine. If the data tile has more than one free axes, this API implicitly collapses all free axes into one axis and then performs a 2D PF-transpose.

In NeuronCore, both Tensor and Vector Engine can perform a PF-transpose, but they support different input shapes. Tensor Engine nc_transpose can handle an input tile of shape (128, 128) or smaller, while Vector Engine can handle shape (32, 32) or smaller. Therefore, when the input tile shape is (32, 32) or smaller, we have an option to run it on either engine, which is controlled by the engine field. If no engine is specified, Neuron Compiler will automatically select an engine based on the input shape. Note, similar to other Tensor Engine instructions, the Tensor Engine nc_transpose must read the input tile from SBUF and write the transposed result to PSUM. On the other hand, Vector Engine nc_transpose can read/write from/to either SBUF or PSUM.

Note, PF-transpose on Tensor Engine is done by performing a matrix multiplication between data as the stationary tensor and an identity matrix as the moving tensor. See architecture guide for more information. On NeuronCore-v2, such matmul-style transpose is not bit-accurate if the input data contains NaN/Inf. You may consider replacing NaN/Inf with regular floats (float_max/float_min/zeros) in the input matrix before calling nc_transpose(engine=nki.isa.tensor_engine).

Estimated instruction cost:

  • Vector Engine: N Vector Engine cycles, where N is the number of elements per partition in data.

  • Tensor Engine (assuming many back-to-back nc_transpose of the same shape on Tensor Engine): max(P, min(64, F)) Tensor Engine cycles, where P is partition axis size of data and F is the number of elements per partition in data.

Parameters:
  • data – the input tile to be transposed

  • mask – (optional) a compile-time constant predicate that controls whether/how this instruction is executed (see NKI API Masking for details)

  • dtype – if specified and it’s different from the data type of input tile data, an additional nki.isa.cast instruction will be inserted to cast the transposed data into the target dtype (see Supported Data Types for more information)

  • engine – specify which engine to use for transpose: nki.isa.tensor_engine or nki.isa.vector_engine ; by default, the best engine will be selected for the given input tile shape

Returns:

a tile with transposed result of input data tile

Example:

import neuronxcc.nki.isa as nisa
import neuronxcc.nki.language as nl
...
  ##################################################################
  # Example 1: transpose tile a of shape (128, 64)
  ##################################################################
  i_p_a = nl.arange(128)[:, None]
  i_f_a = nl.arange(64)[None, :]
  aT = nisa.nc_transpose(a[i_p_a, i_f_a])

  ##################################################################
  # Example 2: transpose tile b of shape (32, 2) using Vector Engine
  ##################################################################
  i_p_b = nl.arange(32)[:, None]
  i_f_b = nl.arange(2)[None, :]
  bT = nisa.nc_transpose(b[i_p_b, i_f_b], engine=nisa.vector_engine)

This document is relevant for: Inf2, Trn1, Trn1n