This document is relevant for: Inf2, Trn1, Trn1n

nki.isa.tensor_scalar#

nki.isa.tensor_scalar(data, op0, operand0, reverse0=False, op1=None, operand1=None, reverse1=False, dtype=None, mask=None, **kwargs)[source]#

Apply up to two math operators to the input data tile by broadcasting scalar/vector operands in the free dimension using Vector or Scalar Engine: (data <op0> operand0) <op1> operand1.

The input data tile can be an SBUF or PSUM tile. Both operand0 and operand1 can be SBUF or PSUM tiles of shape (data.shape[0], 1), i.e., vectors, or compile-time constant scalars.

op1 and operand1 are optional, but must be None (default values) when unused. Note, performing one operator has the same performance cost as performing two operators in the instruction.

When the operators are non-commutative (e.g., subtract), we can reverse ordering of the inputs for each operator through:

  • reverse0 = True: tmp_res = operand0 <op0> data

  • reverse1 = True: operand1 <op1> tmp_res

The tensor_scalar instruction supports two types of operators: 1) bitvec operators (e.g., bitwise_and) and 2) arithmetic operators (e.g., add). See Supported Math Operators for the full list of supported operators. The two operators, op0 and op1, in a tensor_scalar instruction must be of the same type (both bitvec or both arithmetic). If bitvec operators are used, the tensor_scalar instruction must run on Vector Engine. Also, the input/output data types must be integer types, and input elements are treated as bit patterns without any data type casting.

If arithmetic operators are used, the tensor_scalar instruction can run on Vector or Scalar Engine. However, the Scalar Engine only supports a subset of the operator combination:

  • op0=np.multiply and op1=np.add

  • op0=np.multiply and op1=None

  • op0=add and op1=None

Currently, the compiler instruction scheduler selects the engine automatically based on the operator types.

Also, arithmetic operators impose no restriction on the input/output data types, but the engine automatically casts input data types to float32 and performs the operators in float32 math. The float32 computation results are cast to the target data type specified in the dtype field before written into the output tile, at no additional performance cost. If the dtype field is not specified, it is default to be the same as input tile data type.

Estimated instruction cost:

N Vector or Scalar Engine cycles depending which engine compiler assigns the instruction to, where N is the number of elements per partition in data.

Parameters:
  • data – the input tile

  • op0 – the first math operator used with operand0 (see Supported Math Operators for supported operators)

  • operand0 – a scalar constant or a tile of shape (data.shape[0], 1), where data.shape[0] is the partition axis size of the input data tile

  • reverse0 – reverse ordering of inputs to op0; if false, operand0 is the rhs of op0; if true, operand0 is the lhs of op0

  • op1 – the second math operator used with operand1 (see Supported Math Operators for supported operators); this operator is optional

  • operand1 – a scalar constant or a tile of shape (data.shape[0], 1), where data.shape[0] is the partition axis size of the input data tile

  • reverse1 – reverse ordering of inputs to op1; if false, operand1 is the rhs of op1; if true, operand1 is the lhs of op1

  • dtype – (optional) data type to cast the output type to (see Supported Data Types for more information); if not specified, it will default to be the same as the data type of the input tile.

  • mask – (optional) a compile-time constant predicate that controls whether/how this instruction is executed (see NKI API Masking for details)

Returns:

an output tile of (data <op0> operand0) <op1> operand1 computation

Example:

import neuronxcc.nki.isa as nisa
import neuronxcc.nki.language as nl
import numpy as np
...
  ##################################################################
  # Example 1: subtract 1.0 from all elements of tile a of
  # shape (128, 512) and get the output tile in b
  ##################################################################
  i_p = nl.arange(128)[:, None]
  i_f = nl.arange(512)[None, :]
  b = nisa.tensor_scalar(a[i_p, i_f], np.subtract, 1.0)

  ##################################################################
  # Example 2: broadcast 1.0 into a shape of (128, 512) and subtract
  # it with tile c to get output tile d
  ##################################################################
  i_p = nl.arange(128)[:, None]
  i_f = nl.arange(512)[None, :]
  d = nisa.tensor_scalar(c[i_p, i_f], np.subtract, 1.0, reverse0=True)

  ##################################################################
  # Example 3: broadcast multiply tile e with vector f and
  # then broadcast add with scalar 2.5;
  # tile e has a shape of (64, 1024) and vector f has a shape of (64, 1)
  ##################################################################
  i_p_ef = nl.arange(64)[:, None]
  i_f_e = nl.arange(1024)[None, :]
  i_f_f = nl.arange(1)[None, :]
  g = nisa.tensor_scalar(e[i_p_ef, i_f_e], op0=np.multiply, operand0=f[i_p_ef, i_f_f], op1=np.add, operand1=2.5)  

This document is relevant for: Inf2, Trn1, Trn1n