This document is relevant for: Inf2, Trn1, Trn2

nki.isa.bn_aggr#

nki.isa.bn_aggr(data, *, mask=None, dtype=None, **kwargs)[source]#

Aggregate one or multiple bn_stats outputs to generate a mean and variance per partition using Vector Engine.

The input data tile effectively has an array of (count, mean, variance*count) tuples per partition produced by bn_stats instructions. Therefore, the number of elements per partition of data must be a modulo of three.

Note, if you need to aggregate multiple bn_stats instruction outputs, it is recommended to declare a SBUF tensor and then make each bn_stats instruction write its output into the SBUF tensor at different offsets (see example implementation in Example 2 in bn_stats).

Vector Engine performs the statistics aggregation in float32 precision. Therefore, the engine automatically casts the input data tile to float32 before performing float32 computation and is capable of casting the float32 computation results into another data type specified by the dtype field, at no additional performance cost. If dtype field is not specified, the instruction will cast the float32 results back to the same data type as the input data tile.

Estimated instruction cost:

max(MIN_II, 13*(N/3)) Vector Engine cycles, where N is the number of elements per partition in data and MIN_II is the minimum instruction initiation interval for small input tiles. MIN_II is roughly 64 engine cycles.

Parameters:

data – an input tile with results of one or more bn_stats
mask – (optional) a compile-time constant predicate that controls whether/how this instruction is executed (see NKI API Masking for details)
dtype – (optional) data type to cast the output type to (see Supported Data Types for more information); if not specified, it will default to be the same as the data type of the input tile.

Returns:

an output tile with two elements per partition: a mean followed by a variance

This document is relevant for: Inf2, Trn1, Trn2

nki.isa.bn_aggr

Contents

nki.isa.bn_aggr#