This document is relevant for: Trn2, Trn3

nki.isa.activate2#

nki.isa.activate2(dst, op, data, imm0, imm1, op0, op1, relu_param=0.0, reverse0=False, reverse1=False, reduce_op=None, reduce_res=None, reduce_cmd=reduce_cmd.idle, name=None)[source]#

Perform tensor activation with configurable tensor-scalar operations and optional reduction using Scalar Engine.

Note

Available only on NeuronCore-v4 and newer.

This instruction provides a three-stage pipeline per partition:

  1. Tensor-scalar operations: (data op0 imm0) op1 imm1

  2. Activation function application via op

  3. Optional internal reduction controlled by reduce_op and reduce_cmd

The tensor-scalar stage supports six (op0, op1) combinations:

  • (nl.multiply, nl.add) — scale and bias

  • (nl.multiply, nl.subtract) — scale and negative bias

  • (nl.multiply, nl.bypass) — scale only

  • (nl.add, nl.bypass) — bias only

  • (nl.subtract, nl.bypass) — subtract only

  • (nl.bypass, nl.bypass) — no tensor-scalar operation

When reverse0=True, the first operation computes imm0 <op0> data instead of data <op0> imm0. Similarly, reverse1=True computes imm1 <op1> result.

The Scalar Engine always performs math in float32 precision, automatically casting input data to float32 before computation and casting results to the output dtype at no additional performance cost.

Constraints

  • Supported engines: Scalar.

  • data and dst must have the same partition dimension size (at most 128).

  • data and dst must have the same number of elements in the free dimensions.

  • All immediates (imm0, imm1) must have the same dtype when both are tensors.

  • op1 requires op0 to be set.

  • reverse0 requires op0 to be set; reverse1 requires op1 to be set.

Parameters:
  • dst – the activation output tile. Supported buffers: SBUF, PSUM.

  • op – an activation function (see Supported Activation Functions for NKI ISA for supported functions).

  • data – the input tile; layout: (partition axis <= 128, free axis). Supported buffers: SBUF, PSUM.

  • imm0 – scalar or [N, 1] vector value for the first tensor-scalar operation. N must match the partition dimension size of data.

  • imm1 – scalar or [N, 1] vector value for the second tensor-scalar operation. N must match the partition dimension size of data.

  • op0 – first ALU operation in tensor-scalar pipeline. Must be an arithmetic operator (e.g., nl.multiply, nl.add, nl.subtract) or nl.bypass for no operation.

  • op1 – second ALU operation in tensor-scalar pipeline. Must be an arithmetic operator (e.g., nl.add, nl.subtract) or nl.bypass for no operation.

  • relu_param – scalar or vector parameter for parameterized activation functions (e.g., PReLU). Defaults to 0.0.

  • reverse0 – reverse operand order for op0. When True, computes imm0 <op0> data instead of data <op0> imm0. Requires op0 to be set.

  • reverse1 – reverse operand order for op1. When True, computes imm1 <op1> result instead of result <op1> imm1. Requires op1 to be set.

  • reduce_op – the reduce operation to perform on the free dimension of the activated data. Supported: nl.add, nl.maximum, nl.minimum, nl.abs_max, nl.abs_min.

  • reduce_res – a tile of shape (data.shape[0], 1) to hold the final state of the reduction registers. Supported buffers: SBUF, PSUM.

  • reduce_cmd – an enum member from nisa.reduce_cmd to control the state of the reduction registers.

Accumulator behavior:

The Scalar Engine maintains internal accumulator registers (one FP32 value per lane, 128 total) that can be controlled via the reduce_cmd parameter:

  • reduce_cmd.reset_reduce: Reset accumulators to the identity value for reduce_op, then reduce the current activation results into the accumulators.

  • reduce_cmd.reduce: Continue accumulating on top of existing accumulator values.

  • reduce_cmd.reset: Reset accumulators only, without reducing current elements.

  • reduce_cmd.idle: (default) Do not modify accumulator state.

When reduce_res is provided, an additional instruction is emitted to read the accumulator values into the output tile.

Note

The accumulator registers are shared across Scalar Engine accumulation instructions including nki.isa.activation and nki.isa.activate2.

Example

import nki
import nki.isa as nisa
import nki.language as nl
import numpy as np
import pytest

@nki.jit
def activate2_scale_bias_kernel(data_tensor):
    out = nl.ndarray(data_tensor.shape, dtype=nl.float32, buffer=nl.shared_hbm)

    # Load input from HBM to SBUF
    x = nl.ndarray(data_tensor.shape, dtype=nl.float32, buffer=nl.sbuf)
    nisa.dma_copy(dst=x, src=data_tensor)

    # activate2: multiply by 2.0, add 0.5, then apply GELU
    result = nl.ndarray(data_tensor.shape, dtype=nl.float32, buffer=nl.sbuf)
    nisa.activate2(
        dst=result,
        op=nl.gelu,
        data=x,
        imm0=2.0,
        imm1=0.5,
        op0=nl.multiply,
        op1=nl.add,
    )

    nisa.dma_copy(dst=out, src=result)
    return out

Behavior

for i in range(num_elements_per_partition):
    # Stage 1: tensor-scalar operations
    val = data[i]
    if op0 is not bypass:
        val = op0(val, imm0)       # or op0(imm0, val) if reverse0
    if op1 is not bypass:
        val = op1(val, imm1)       # or op1(imm1, val) if reverse1

    # Stage 2: activation function
    dst[i] = op(val, relu_param=relu_param)

    # Stage 3: optional reduction
    if reduce_cmd in (reset_reduce, reduce):
        accumulator = reduce_op(accumulator, dst[i])

This document is relevant for: Trn2, Trn3