nki.isa.activation#
- nki.isa.activation(dst, op, data, bias=None, scale=1.0, reduce_op=None, reduce_res=None, reduce_cmd=reduce_cmd.idle, name=None)[source]#
Apply an activation function on every element of the input tile using Scalar Engine, with an optional scale/bias operation before the activation and an optional reduction operation after the activation in the same instruction.
The activation function is specified in the
opinput field (see Supported Activation Functions for NKI ISA for a list of supported activation functions and their valid input ranges).nisa.activationcan optionally multiply the inputdataby a scalar or vectorscaleand then add another vectorbiasbefore the activation function is applied.After the activation function is applied, Scalar Engine can also reduce along the free dimensions of the activated data per lane, using
reduce_opoperation.reduce_opmust benl.add.The reduction result is then either stored into or reduced on top of a set of internal engine registers called
reduce_regs(one 32-bit register per compute lane, 128 registers in total), controlled by thereduce_cmdfield:nisa.reduce_cmd.reset: Resetreduce_regsto zero only.nisa.reduce_cmd.idle: Do not modifyreduce_regs.nisa.reduce_cmd.reduce: Reduce activated data over existing values inreduce_regs.nisa.reduce_cmd.reset_reduce: Resetreduce_regsto zero and then store the reduction result of the activated data.
nisa.activationcan also emit another instruction to read outreduce_regsby passing an SBUF/PSUM tile in thereduce_resarguments. Thereduce_regsstate can persist across multiplenisa.activationinstructions without the need to be evicted back to SBUF/PSUM (reduce_restile).The following is the pseudo code for
nisa.activation:\[ \begin{align}\begin{aligned}output = op(data * scale + bias)\\if reduce_cmd == nisa.reduce_cmd.reset or reduce_cmd == nisa.reduce_cmd.reset_reduce: reduce_regs = 0\\result = reduce\_op(reduce_regs, reduce\_op(output, axis=<FreeAxis>))\\if reduce_cmd == nisa.reduce_cmd.reduce or reduce_cmd == nisa.reduce_cmd.reset_reduce: reduce_regs += result\\if reduce_res: reduce_res = reduce_regs\end{aligned}\end{align} \]All these optional operations incur no further performance penalty compared to only applying the activation function, except reading out
reduce_regsintoreduce_reswill have a small overhead due to an extra instruction.Memory types.
The input
datatile can be an SBUF or PSUM tile. Similarly, the instruction can write the outputdsttile into either SBUF or PSUM.Data types.
Both input
dataand outputdsttiles can be in any valid NKI data type (see Supported Data Types for more information). The Scalar Engine always performs the math operations in float32 precision. Therefore, the engine automatically casts the inputdatatile to float32 before performing multiply/add/activate specified in the activation instruction. The engine is also capable of casting the float32 math results into another output data type indstat no additional performance cost. Thescaleparameter must have a float32 data type, while thebiasparameter can be float32/float16/bfloat16.Layout.
The
scalecan either be a compile-time constant scalar or a[N, 1]vector from SBUF/PSUM.Nmust be the same as the partition dimension size ofdata. In NeuronCore-v2, thebiasmust be a[N, 1]vector, but starting NeuronCore-v3,biascan either be a compile-time constant scalar or a[N, 1]vector similar toscale.When the
scale(or similarly,bias) is a scalar, the scalar is broadcasted to all the elements in the inputdatatile to perform the computation. When thescale(orbias) is a vector, thescale(orbias) value in each partition is broadcast along the free dimension of thedatatile.Tile size.
The partition dimension size of input
dataand outputdsttiles must be the same and must not exceed 128. The number of elements per partition ofdataanddsttiles must be the same and must not exceed the physical size of each SBUF partition.- Parameters:
dst – the activation output
op – an activation function (see Supported Activation Functions for NKI ISA for supported functions)
data – the input tile; layout: (partition axis <= 128, free axis)
scale – a scalar or a vector for multiplication
bias – a scalar (NeuronCore-v3 or newer) or a vector for addition
reduce_op – the reduce operation to perform on the free dimension of the activated data
reduce_res – a tile of shape
(data.shape[0], 1)to hold the final state ofreduce_regs.reduce_cmd – an enum member from
nisa.reduce_cmdto control the state ofreduce_regs.