This document is relevant for: Inf2, Trn1, Trn2
nki.isa.activation#
- nki.isa.activation(op, data, *, bias=None, scale=1.0, reduce_op=None, reduce_res=None, reduce_cmd=reduce_cmd.idle, mask=None, dtype=None, **kwargs)[source]#
Apply an activation function on every element of the input tile using Scalar Engine. The activation function is specified in the
opinput field (see Supported Activation Functions for NKI ISA for a list of supported activation functions and their valid input ranges).The activation instruction can optionally multiply the input
databy a scalar or vectorscaleand then add another vectorbiasbefore the activation function is applied, at no additional performance cost:\[output = f_{act}(data * scale + bias)\]When the scale is a scalar, it must be a compile-time constant. In this case, the scale is broadcasted to all the elements in the input
datatile. When the scale/bias is a vector, it must have the same partition axis size as the inputdatatile and only one element per partition. In this case, the element of scale/bias within each partition is broadcasted to elements of the inputdatatile in the same partition.There are 128 registers on the scalar engine for storing reduction results, corresponding to the 128 partitions of the input. The scalar engine can reduce along free dimensions without extra performance penalty, and store the result of reduction into these registers. The reduction is done after the activation function is applied.
\[output = f_{act}(data * scale + bias) accu\_registers = reduce\_op(accu\_registers, reduce\_op(output, axis=<FreeAxis>))\]These registers are shared between
activationandactivation_accucalls, and the state of them can be controlled via thereduce_cmdparameter.nisa.reduce_cmd.reset: Reset the accumulators to zeronisa.reduce_cmd.idle: Do not use the accumulatorsnisa.reduce_cmd.reduce: keeps accumulating over the current value of the accumulatornisa.reduce_cmd.reset_reduce: Resets the accumulators then immediately accumulate the results of the current instruction into the accumulators
We can choose to read out the current values stored in the register by passing in a tensor in the
reduce_resarguments. Reading out the accumulator will incur a small overhead.Note that
activation_accucan also change the state of the registers. It’s user’s responsibility to ensure correct ordering. It’s recommended to not mixing the use ofactivation_accuandactivation, whenreduce_cmdis not set to idle.Note, the Scalar Engine always performs the math operations in float32 precision. Therefore, the engine automatically casts the input
datatile to float32 before performing multiply/add/activate specified in the activation instruction. The engine is also capable of casting the float32 math results into another output data type specified by thedtypefield at no additional performance cost. Ifdtypefield is not specified, Neuron Compiler will set output data type of the instruction to be the same as input data type ofdata. On the other hand, thescaleparameter must have a float32 data type, while thebiasparameter can be float32/float16/bfloat16.The input
datatile can be an SBUF or PSUM tile. Similarly, the instruction can write the output tile into either SBUF or PSUM, which is specified using thebufferfield. If not specified,nki.language.sbufis selected by default.Estimated instruction cost:
max(MIN_II, N)Scalar Engine cycles, whereNis the number of elements per partition indata.MIN_IIis the minimum instruction initiation interval for small input tiles.MIN_IIis roughly 64 engine cycles.
- Parameters:
op – an activation function (see Supported Activation Functions for NKI ISA for supported functions)
data – the input tile; layout: (partition axis <= 128, free axis)
bias – a vector with the same partition axis size as
datafor broadcast add (after broadcast multiply withscale)scale – a scalar or a vector with the same partition axis size as
datafor broadcast multiplyreduce_op – the reduce operation to perform on the free dimension of the activation result
reduce_res – a tile of shape
(data.shape[0], 1), where data.shape[0] is the partition axis size of the inputdatatile. The result ofsum(ReductionResult)is written in-place into the tensor.reduce_cmd – an enum member from
nisa.reduce_cmdto control the state of reduction registersdtype – (optional) data type to cast the output type to (see Supported Data Types for more information); if not specified, it will default to be the same as the data type of the input tile.
mask – (optional) a compile-time constant predicate that controls whether/how this instruction is executed (see NKI API Masking for details)
- Returns:
output tile of the activation instruction; layout: same as input
datatile
Example:
import neuronxcc.nki.language as nl import neuronxcc.nki.isa as nisa ################################################################## # Example 1: perform exponential function on matrix a of shape (128, 1024) ################################################################## a = nl.load(a_tensor) activated_a = nisa.activation(op=nl.exp, data=a) nl.store(a_act_tensor, activated_a) ################################################################## # Example 2: perform the following operations to matrix b of shape (128, 512) # using a single activation instruction: np.square(b * 2.0) + c # 1) compute `np.square(b * 2.0 + c)` # 2) cast 1) results into bfloat16 ################################################################## b = nl.load(b_tensor) c = nl.load(c_tensor) activated_b = nisa.activation(op=np.square, data=b, bias=c, scale=2.0, dtype=nl.bfloat16) nl.store(b_act_tensor, activated_b)
This document is relevant for: Inf2, Trn1, Trn2