This document is relevant for: Trn2, Trn3

nki.isa.tensor_tensor#

nki.isa.tensor_tensor(dst, data1, data2, op, engine=engine.tensor, name=None)[source]#

Perform an element-wise operation of input two tiles using Vector Engine or GpSimd Engine. The two tiles must have the same partition axis size and the same number of elements per partition.

The element-wise operator is specified using the op field. Valid choices for op:

Any supported binary operator that runs on the Vector Engine. (See Supported Math Operators for NKI ISA for details.)
nl.power. (Which runs on the GpSimd engine.)

For bitvec operators, the input/output data types must be integer types and Vector Engine treats all input elements as bit patterns without any data type casting. For arithmetic operators, the behavior depends on the data types:

Float types: The engine casts input data types to float32 and performs the element-wise operation in float32 math. The float32 results are cast to dst.dtype at no additional performance cost.
int32/uint32 types: When all input/output tiles are int32 or uint32, the operation defaults to GpSimd Engine, which uses native integer arithmetic. This ensures exact results for all 32-bit integer values. You may override this by passing engine=nki.isa.engine.vector explicitly.

Since GpSimd Engine cannot access PSUM, the input/output tiles cannot be in PSUM if op is nl.power. Similarly, the automatic GpSimd dispatch for int32/uint32 falls back to Vector Engine when any operand resides in PSUM. (See NeuronCore-v2 Compute Engines for details.)

Otherwise, the output tile can be in either SBUF or PSUM. However, the two input tiles, data1 and data2 cannot both reside in PSUM. The three legal cases are:

Both data1 and data2 are in SBUF.
data1 is in SBUF, while data2 is in PSUM.
data1 is in PSUM, while data2 is in SBUF.

Note, if you need broadcasting capability in the free dimension for either input tile, you should consider using nki.isa.tensor_scalar API instead, which has better performance than nki.isa.tensor_tensor in general.

Tensor indirection.

On NeuronCore-v4 and later, dst and data1 support tensor indirection (gather/scatter) by passing a view created with .indirect(index). data2 does not support tensor indirection. When tensor indirection is used, the operation must run on the Vector or GpSimd engine.

When operands are manually allocated, their base partitions must satisfy:

the index of every .indirect() view starts on a quadrant boundary (a multiple of 32);
if data1 uses .indirect(), data1 starts on the same partition as its index;
if dst uses .indirect() and data1 is in SBUF, dst’s index starts on the same partition as data1;
if dst uses .indirect() and data1 is in PSUM and uses .indirect(), dst’s index starts on the same partition as data1’s index.

Parameters:

dst – an output tile of the element-wise operation
data1 – lhs input operand of the element-wise operation
data2 – rhs input operand of the element-wise operation
op – a binary math operator (see Supported Math Operators for NKI ISA for supported operators)
engine – (optional) the engine to use for the operation: nki.isa.engine.vector, nki.isa.engine.gpsimd or nki.isa.engine.unknown (default, let compiler select best engine based on the input tile shape).

This document is relevant for: Trn2, Trn3

nki.isa.tensor_tensor

Contents

nki.isa.tensor_tensor#