This document is relevant for: Inf2
, Trn1
, Trn2
nki.isa.activation_reduce#
- nki.isa.activation_reduce(op, data, *, reduce_op, reduce_res, bias=None, scale=1.0, mask=None, dtype=None, **kwargs)[source]#
Perform the same computation as
nisa.activation
and also a reduction along the free dimension of thenisa.activation
result using Scalar Engine. The results for the reduction is stored in the reduce_res.This API is equivalent to calling
nisa.activation
withreduce_cmd=nisa.reduce_cmd.reset_reduce
and passing in reduce_res. This API is kept for backward compatibility, we recommend usingnisa.activation
moving forward.Refer to nisa.activation for semantics of
op/data/bias/scale
.In addition to nisa.activation computation, this API also performs a reduction along the free dimension(s) of the nisa.activation result, at a small additional performance cost. The reduction result is returned in
reduce_res
in-place, which must be a SBUF/PSUM tile with the same partition axis size as the input tiledata
and one element per partition. On NeuronCore-v2, thereduce_op
can only be an addition,np.add
ornl.add
.There are 128 registers on the scalar engine for storing reduction results, corresponding to the 128 partitions of the input. These registers are shared between
activation
andactivation_accu
calls. This instruction first resets those registers to zero, performs the reduction on the value after activation function is applied, stores the results into the registers, then reads out the reduction results from the register, eventually store them intoreduce_res
.Note that
nisa.activation
can also change the state of the register. It’s user’s responsibility to ensure correct ordering. It’s the best practice to not mixing the use ofactivation_reduce
andactivation
.Reduction axis is not configurable in this API. If the input tile has multiple free axis, the API will reduce across all of them.
Mathematically, this API performs the following computation:
\[\begin{split}output = f_{act}(data * scale + bias) \\ reduce\_res = reduce\_op(output, axis=<FreeAxis>)\end{split}\]Estimated instruction cost:
max(MIN_II, N) + MIN_II
Scalar Engine cycles, whereN
is the number of elements per partition indata
, andMIN_II
is the minimum instruction initiation interval for small input tiles.MIN_II
is roughly 64 engine cycles.
- Parameters:
op – an activation function (see Supported Activation Functions for NKI ISA for supported functions)
data – the input tile; layout: (partition axis <= 128, free axis)
reduce_op – the reduce operation to perform on the free dimension of the activation result
reduce_res – a tile of shape
(data.shape[0], 1)
, where data.shape[0] is the partition axis size of the inputdata
tile. The result ofsum(ReductionResult)
is written in-place into the tensor.bias – a vector with the same partition axis size as
data
for broadcast add (after broadcast multiply withscale
)scale – a scalar or a vector with the same partition axis size as
data
for broadcast multiplydtype – (optional) data type to cast the output type to (see Supported Data Types for more information); if not specified, it will default to be the same as the data type of the input tile.
mask – (optional) a compile-time constant predicate that controls whether/how this instruction is executed (see NKI API Masking for details)
- Returns:
output tile of the activation instruction; layout: same as input
data
tile
This document is relevant for: Inf2
, Trn1
, Trn2