This document is relevant for: Inf2
, Trn1
, Trn2
nki.isa.nc_stream_shuffle#
- nki.isa.nc_stream_shuffle(src, dst, shuffle_mask, *, dtype=None, mask=None, **kwargs)[source]#
Apply cross-partition data movement within a quadrant of 32 partitions from source tile
src
to destination tiledst
using Vector Engine.Both source and destination tiles can be in either SBUF or PSUM, and passed in by reference as arguments. In-place shuffle is allowed, i.e.,
dst
same assrc
.shuffle_mask
is a 32-element list. Each mask element must be in data type int or affine expression.shuffle_mask[i]
indicates which input partition the output partition [i] copies from within each 32-partition quadrant. The special valueshuffle_mask[i]=255
means the output tensor in partition [i] will be unmodified.nc_stream_shuffle
can be applied to multiple of quadrants. In the case with more than one quadrant, the shuffle is applied to each quadrant independently, and the sameshuffle_mask
is used for each quadrant.mask
applies todst
, meaning that locations masked out bymask
will be unmodified. For more information about the cross-partition data movement, see Cross-partition Data Movement.This API has 3 constraints on
src
anddst
:dst
must have same data type assrc
.dst
must have the same number of elements per partition assrc
.The access start partition of
src
(src_start_partition
), does not have to match or be in the same quadrant as that ofdst
(dst_start_partition
). However,src_start_partition
/dst_start_partition
needs to follow some special hardware rules with the number of active partitionsnum_active_partitions
.num_active_partitions = ceil(max(src_num_partitions, dst_num_partitions)/32) * 32
, wheresrc_num_partitions
anddst_num_partitions
refer to the number of partitions thesrc
anddst
tensors access respectively.src_start_partition
/dst_start_partition
is constrained based on the value ofnum_active_partitions
:
If
num_active_partitions
is 96/128,src_start_partition
/dst_start_partition
must be 0.If
num_active_partitions
is 64,src_start_partition
/dst_start_partition
must be 0/64.If
num_active_partitions
is 32,src_start_partition
/dst_start_partition
must be 0/32/64/96.
Estimated instruction cost:
max(MIN_II, N)
Vector Engine cycles, whereN
is the number of elements per partition insrc
, andMIN_II
is the minimum instruction initiation interval for small input tiles.MIN_II
is roughly 64 engine cycles.- Parameters:
src – the source tile
dst – the destination tile
shuffle_mask – a 32-element list that specifies the shuffle source and destination partition
dtype – (optional) data type to cast the output type to (see Supported Data Types for more information); if not specified, it will default to be the same as the data type of the input tile.
mask – (optional) a compile-time constant predicate that controls whether/how this instruction is executed (see NKI API Masking for details)
Example:
import neuronxcc.nki.isa as nisa import neuronxcc.nki.language as nl from neuronxcc.nki.typing import tensor ##################################################################### # Example 1: # Apply cross-partition data movement to a 32-partition tensor, # in-place shuffling the data in partition[i] to partition[(i+1)%32]. ##################################################################### ... a: tensor[32, 128] = nl.load(in_tensor) a_mgrid = nl.mgrid[0:32, 0:128] shuffle_mask = [(i - 1) % 32 for i in range(32)] nisa.nc_stream_shuffle(src=a[a_mgrid.p, a_mgrid.x], dst=a[a_mgrid.p, a_mgrid.x], shuffle_mask=shuffle_mask) nl.store(out_tensor, value=a)
##################################################################### # Example 2: # Broadcast data in 1 partition to 32 partitions. ##################################################################### ... a: tensor[1, 128] = nl.load(in_tensor) b = nl.ndarray(shape=(32, 128), dtype=np.float32) dst_mgrid = nl.mgrid[0:32, 0:128] src_mgrid = nl.mgrid[0:1, 0:128] shuffle_mask = [0] * 32 nisa.nc_stream_shuffle(src=a[0, src_mgrid.x], dst=b[dst_mgrid.p, dst_mgrid.x], shuffle_mask=shuffle_mask) nl.store(out_tensor, value=b)
##################################################################### # Example 3: # In the case where src and dst access more than one quadrant (32 # partitions), the shuffle is applied to each quadrant independently, # and the same shuffle_mask is used for each quadrant. ##################################################################### ... a: tensor[128, 128] = nl.load(in_tensor) b = nl.ndarray(shape=(128, 128), dtype=np.float32) mgrid = nl.mgrid[0:128, 0:128] shuffle_mask = [(i - 1) % 32 for i in range(32)] nisa.nc_stream_shuffle(src=a[mgrid.p, mgrid.x], dst=b[mgrid.p, mgrid.x], shuffle_mask=shuffle_mask) nl.store(out_tensor, value=b)
This document is relevant for: Inf2
, Trn1
, Trn2