nki.isa.nc_stream_shuffle#

nki.isa.nc_stream_shuffle(dst, src, shuffle_mask, name=None)[source]#

Apply cross-partition data movement within a quadrant of 32 partitions from source tile src to destination tile dst using Vector Engine.

Both source and destination tiles can be in either SBUF or PSUM, and passed in by reference as arguments. In-place shuffle is allowed, i.e., dst same as src. shuffle_mask is a 32-element list. Each mask element must be in data type int or affine expression. shuffle_mask[i] indicates which input partition the output partition [i] copies from within each 32-partition quadrant. The special value shuffle_mask[i]=255 means the output tensor in partition [i] will be unmodified. nc_stream_shuffle can be applied to multiple of quadrants. In the case with more than one quadrant, the shuffle is applied to each quadrant independently, and the same shuffle_mask is used for each quadrant. For more information about the cross-partition data movement, see Cross-partition Data Movement.

This API has 3 constraints on src and dst:

  1. dst must have same data type as src.

  2. dst must have the same number of elements per partition as src.

  3. The access start partition of src (src_start_partition), does not have to match or be in the same quadrant as that of dst (dst_start_partition). However, src_start_partition/dst_start_partition needs to follow some special hardware rules with the number of active partitions num_active_partitions. num_active_partitions = ceil(max(src_num_partitions, dst_num_partitions)/32) * 32, where src_num_partitions and dst_num_partitions refer to the number of partitions the src and dst tensors access respectively. src_start_partition/dst_start_partition is constrained based on the value of num_active_partitions:

  • If num_active_partitions is 96/128, src_start_partition/dst_start_partition must be 0.

  • If num_active_partitions is 64, src_start_partition/dst_start_partition must be 0/64.

  • If num_active_partitions is 32, src_start_partition/dst_start_partition must be 0/32/64/96.

Parameters:
  • dst – the destination tile

  • src – the source tile

  • shuffle_mask – a 32-element list that specifies the shuffle source and destination partition