This document is relevant for: Trn2, Trn3
nki.isa.activate2#
- nki.isa.activate2(dst, op, data, imm0, imm1, op0, op1, relu_param=0.0, reverse0=False, reverse1=False, reduce_op=None, reduce_res=None, reduce_cmd=reduce_cmd.idle, name=None)[source]#
Perform tensor activation with configurable tensor-scalar operations and optional reduction using Scalar Engine.
Note
Available only on NeuronCore-v4 and newer.
This instruction provides a three-stage pipeline per partition:
Tensor-scalar operations:
(data op0 imm0) op1 imm1Activation function application via
opOptional internal reduction controlled by
reduce_opandreduce_cmd
The tensor-scalar stage supports six
(op0, op1)combinations:(nl.multiply, nl.add)— scale and bias(nl.multiply, nl.subtract)— scale and negative bias(nl.multiply, nl.bypass)— scale only(nl.add, nl.bypass)— bias only(nl.subtract, nl.bypass)— subtract only(nl.bypass, nl.bypass)— no tensor-scalar operation
When
reverse0=True, the first operation computesimm0 <op0> datainstead ofdata <op0> imm0. Similarly,reverse1=Truecomputesimm1 <op1> result.The Scalar Engine always performs math in float32 precision, automatically casting input data to float32 before computation and casting results to the output dtype at no additional performance cost.
Constraints
Supported engines: Scalar.
dataanddstmust have the same partition dimension size (at most 128).dataanddstmust have the same number of elements in the free dimensions.All immediates (
imm0,imm1) must have the same dtype when both are tensors.op1requiresop0to be set.reverse0requiresop0to be set;reverse1requiresop1to be set.
- Parameters:
dst – the activation output tile. Supported buffers: SBUF, PSUM.
op – an activation function (see Supported Activation Functions for NKI ISA for supported functions).
data – the input tile; layout: (partition axis <= 128, free axis). Supported buffers: SBUF, PSUM.
imm0 – scalar or
[N, 1]vector value for the first tensor-scalar operation.Nmust match the partition dimension size ofdata.imm1 – scalar or
[N, 1]vector value for the second tensor-scalar operation.Nmust match the partition dimension size ofdata.op0 – first ALU operation in tensor-scalar pipeline. Must be an arithmetic operator (e.g.,
nl.multiply,nl.add,nl.subtract) ornl.bypassfor no operation.op1 – second ALU operation in tensor-scalar pipeline. Must be an arithmetic operator (e.g.,
nl.add,nl.subtract) ornl.bypassfor no operation.relu_param – scalar or vector parameter for parameterized activation functions (e.g., PReLU). Defaults to
0.0.reverse0 – reverse operand order for
op0. WhenTrue, computesimm0 <op0> datainstead ofdata <op0> imm0. Requiresop0to be set.reverse1 – reverse operand order for
op1. WhenTrue, computesimm1 <op1> resultinstead ofresult <op1> imm1. Requiresop1to be set.reduce_op – the reduce operation to perform on the free dimension of the activated data. Supported:
nl.add,nl.maximum,nl.minimum,nl.abs_max,nl.abs_min.reduce_res – a tile of shape
(data.shape[0], 1)to hold the final state of the reduction registers. Supported buffers: SBUF, PSUM.reduce_cmd – an enum member from
nisa.reduce_cmdto control the state of the reduction registers.
Accumulator behavior:
The Scalar Engine maintains internal accumulator registers (one FP32 value per lane, 128 total) that can be controlled via the
reduce_cmdparameter:reduce_cmd.reset_reduce: Reset accumulators to the identity value forreduce_op, then reduce the current activation results into the accumulators.reduce_cmd.reduce: Continue accumulating on top of existing accumulator values.reduce_cmd.reset: Reset accumulators only, without reducing current elements.reduce_cmd.idle: (default) Do not modify accumulator state.
When
reduce_resis provided, an additional instruction is emitted to read the accumulator values into the output tile.Note
The accumulator registers are shared across Scalar Engine accumulation instructions including nki.isa.activation and
nki.isa.activate2.Example
import nki import nki.isa as nisa import nki.language as nl import numpy as np import pytest @nki.jit def activate2_scale_bias_kernel(data_tensor): out = nl.ndarray(data_tensor.shape, dtype=nl.float32, buffer=nl.shared_hbm) # Load input from HBM to SBUF x = nl.ndarray(data_tensor.shape, dtype=nl.float32, buffer=nl.sbuf) nisa.dma_copy(dst=x, src=data_tensor) # activate2: multiply by 2.0, add 0.5, then apply GELU result = nl.ndarray(data_tensor.shape, dtype=nl.float32, buffer=nl.sbuf) nisa.activate2( dst=result, op=nl.gelu, data=x, imm0=2.0, imm1=0.5, op0=nl.multiply, op1=nl.add, ) nisa.dma_copy(dst=out, src=result) return out
Behavior
for i in range(num_elements_per_partition): # Stage 1: tensor-scalar operations val = data[i] if op0 is not bypass: val = op0(val, imm0) # or op0(imm0, val) if reverse0 if op1 is not bypass: val = op1(val, imm1) # or op1(imm1, val) if reverse1 # Stage 2: activation function dst[i] = op(val, relu_param=relu_param) # Stage 3: optional reduction if reduce_cmd in (reset_reduce, reduce): accumulator = reduce_op(accumulator, dst[i])
This document is relevant for: Trn2, Trn3