nki.isa.tensor_reduce#
- nki.isa.tensor_reduce(dst, op, data, axis, negate=False, keepdims=False, name=None)[source]#
Apply a reduction operation to the free axes of an input
datatile using Vector Engine.The reduction operator is specified in the
opinput field (see Supported Math Operators for NKI ISA for a list of supported reduction operators).nisa.tensor_reducesupports two types of reduction operators: 1) bitvec operators (e.g., bitwise_and, bitwise_or) and 2) arithmetic operators (e.g., add, subtract, multiply).The reduction axes are specified in the
axisfield using a list of integer(s) to indicate axis indices. The reduction axes can contain up to four free axes and must start at the most minor free axis. Since axis 0 is the partition axis in a tile, the reduction axes must contain axis 1 (most-minor). In addition, the reduction axes must be consecutive: e.g., [1, 2, 3, 4] is a legalaxisfield, but [1, 3, 4] is not.When the reduction
opis an arithmetic operator, the instruction can also multiply the output reduction results by-1.0before writing into the output tile, at no additional performance cost. This behavior is controlled by thenegateinput field.Memory types.
Both the input
dataanddsttiles can be in SBUF or PSUM.Data types.
For bitvec operators, the input/output data types must be integer types and Vector Engine treats all input elements as bit patterns without any data type casting. For arithmetic operators, the input/output data types can be any supported NKI data types, but the engine automatically casts input data types to float32 and performs the reduction operation in float32 math. The float32 reduction results are cast to the data type of
dst.Layout.
nisa.tensor_reduceonly supports free axes reduction. Therefore, the partition dimension of the inputdatais considered the parallel compute dimension. To perform a partition axis reduction, we can either:invoke a
nisa.nc_transposeinstruction on the input tile and then thisnisa.tensor_reduceon the transposed tile, orinvoke
nki.isa.nc_matmulinstructions to multiply anl.ones([128, 1], dtype=data.dtype)as a stationary tensor with the input tile as a moving tensor. See more discussion on Tensor Engine alternative usage in Trainium architecture guide.
Tile size.
The partition dimension size of input
dataand outputdsttiles must be the same and must not exceed 128. The number of elements per partition ofdatamust not exceed the physical size of each SBUF partition. The number of elements per partition indstmust be consistent with theaxisfield. For example, ifaxisindicates all free dimensions ofdataare reduced, the number of elements per partition indstmust be 1.- Parameters:
dst – output tile of the reduction result
op – the reduction operator (see Supported Math Operators for NKI ISA for supported reduction operators)
data – the input tile to be reduced
axis – int or tuple/list of ints. The axis (or axes) along which to operate; must be free dimensions, not partition dimension (0); can only be the last contiguous dim(s) of the tile:
[1], [1,2], [1,2,3], [1,2,3,4]negate – if True, reduction result is multiplied by
-1.0; only applicable when op is an arithmetic operatorkeepdims – If this is set to True, the axes which are reduced are left in the result as dimensions with size one. With this option, the result will broadcast correctly against the input array.