This document is relevant for: Trn2, Trn3

nki.isa.tensor_copy#

nki.isa.tensor_copy(dst, src, engine=engine.unknown, name=None)[source]#

Create a copy of src tile within NeuronCore on-chip SRAMs using Vector, Scalar or GpSimd Engine.

The output tile has the same partition axis size and also the same number of elements per partition as the input tile src.

All three compute engines, Vector, Scalar and GpSimd Engine can perform tensor copy. However, their copy behavior is slightly different across engines:

  • Scalar Engine on NeuronCore-v2 performs copy by first casting the input tile to FP32 internally and then casting from FP32 to dst.dtype. Users should be cautious with assigning this instruction to Scalar Engine when the input data type cannot be precisely cast to FP32 (e.g., INT32).

  • Both GpSimd and Vector Engine can operate in two modes: (1) bit-accurate copy when input and output data types are the same or (2) intermediate FP32 cast when input and output data types differ, similar to Scalar Engine.

In addition, since GpSimd Engine cannot access PSUM in NeuronCore, Scalar or Vector Engine must be chosen when the input or output tile is in PSUM (see NeuronCore-v2 Compute Engines for details). By default, this API returns a tile in SBUF, unless the returned value is assigned to a pre-declared PSUM tile.

On NeuronCore v2, tensor_copy is not supported on the Scalar Engine. Instead, use nisa.activation with op=nl.copy.

Parameters:
  • dst – a tile with the same content and partition axis size as the src tile.

  • src – the source of copy, must be a tile in SBUF or PSUM.

  • engine – (optional) the engine to use for the operation: nki.isa.engine.vector, nki.isa.engine.scalar, nki.isa.engine.gpsimd or nki.isa.engine.unknown (default, compiler selects best engine based on engine workload).

This document is relevant for: Trn2, Trn3