This document is relevant for: Inf2
, Trn1
, Trn2
nki.isa.tensor_tensor#
- nki.isa.tensor_tensor(data1, data2, op, *, dtype=None, mask=None, **kwargs)[source]#
Perform an element-wise operation of input two tiles using Vector Engine or GpSimd Engine. The two tiles must have the same partition axis size and the same number of elements per partition.
The element-wise operator is specified using the
op
field and can be any binary operator supported by NKI (see Supported Math Operators for NKI ISA for details) that runs on the Vector Engine, or can benp.power
/nl.power
that runs on the GpSimd Engine. For bitvec operators, the input/output data types must be integer types and Vector Engine treats all input elements as bit patterns without any data type casting. For arithmetic operators, there is no restriction on the input/output data types, but the engine automatically casts input data types to float32 and performs the element-wise operation in float32 math. The float32 results are cast to the target data type specified in thedtype
field before written into the output tile. If thedtype
field is not specified, it is default to be the same as the data type ofdata1
ordata2
, whichever has the higher precision.Since GpSimd Engine cannot access PSUM, the input or output tiles cannot be in PSUM if
op
isnp.power
/nl.power
(see NeuronCore-v2 Compute Engines for details). Otherwise, the output tile can be in either SBUF or PSUM. However, the two input tiles,data1
anddata2
cannot both reside in PSUM. The three legal cases are:Both
data1
anddata2
are in SBUF.data1
is in SBUF, whiledata2
is in PSUM.data1
is in PSUM, whiledata2
is in SBUF.
Note, if you need broadcasting capability in the free dimension for either input tile, you should consider using nki.isa.tensor_scalar API instead, which has better performance than
nki.isa.tensor_tensor
in general.Estimated instruction cost:
See below table for tensor_tensor performance when it runs on Vector Engine.
Cost (Vector Engine Cycles)
Condition
max(MIN_II, N)
one input tile is in PSUM and the other is in SBUF
max(MIN_II, N)
all of the below:
both input tiles are in SBUF,
input/output data types are all
bfloat16
,the operator is add, multiply or subtract,
Input tensor data is contiguous along the free dimension (that is, stride in each partition is 1 element)
max(MIN_II, 2N)
otherwise
where,
N
is the number of elements per partition indata1
/data2
.MIN_II
is the minimum instruction initiation interval for small input tiles.MIN_II
is roughly 64 engine cycles.
- Parameters:
data1 – lhs input operand of the element-wise operation
data2 – rhs input operand of the element-wise operation
op – a binary math operator (see Supported Math Operators for NKI ISA for supported operators)
mask – (optional) a compile-time constant predicate that controls whether/how this instruction is executed (see NKI API Masking for details)
dtype – (optional) data type to cast the output type to (see Supported Data Types for more information); if not specified, it will default to be the same as the data type of the input tiles, or whichever input type has the highest precision (see NKI Type Promotion for more information);
- Returns:
an output tile of the element-wise operation
Example:
import neuronxcc.nki.isa as nisa import neuronxcc.nki.language as nl from neuronxcc.nki.typing import tensor ... ################################################################## # Example 1: add two tiles, a and b, of the same # shape (128, 512) element-wise and get # the addition result in tile c ################################################################## a: tensor[128, 512] = nl.load(a_tensor) b: tensor[128, 512] = nl.load(b_tensor) c: tensor[128, 512] = nisa.tensor_tensor(a, b, op=nl.add)
This document is relevant for: Inf2
, Trn1
, Trn2