nki.isa.range_select#
- nki.isa.range_select(dst, on_true_tile, comp_op0, comp_op1, bound0, bound1, reduce_cmd=reduce_cmd.idle, reduce_res=None, reduce_op=<function amax>, range_start=0.0, on_false_value=0.0, name=None)[source]#
Select elements from
on_true_tilebased on comparison with bounds using Vector Engine.Note
Available only on NeuronCore-v3 and newer.
For each element in
on_true_tile, compares its free dimension index +range_startagainstbound0andbound1using the specified comparison operators (comp_op0andcomp_op1). If both comparisons evaluate to True, copies the element to the output; otherwise useson_false_value.Additionally performs a reduction operation specified by
reduce_opon the results, storing the reduction result inreduce_res.Note on numerical stability:
In self-attention, we often have this instruction sequence:
range_select(VectorE) ->reduce_res->activation(ScalarE). Whenrange_selectoutputs a full row offill_value, caution is needed to avoid NaN in the activation instruction that subtracts the output ofrange_selectbyreduce_res(max value):If
dtypeandreduce_resare both FP32, we should not hit any NaN issue sinceFP32_MIN - FP32_MIN = 0. Exponentiation on 0 is stable (1.0 exactly).If
dtypeis FP16/BF16/FP8, the fill_value in the output tile will become-INFsince HW performs a downcast from FP32_MIN to a smaller dtype. In this case, you must make sure reduce_res uses FP32dtypeto avoid NaN inactivation. NaN can be avoided becauseactivationalways upcasts input tiles to FP32 to perform math operations:-INF - FP32_MIN = -INF. Exponentiation on-INFis stable (0.0 exactly).
Constraints:
The comparison operators must be one of:
np.equal
np.less
np.less_equal
np.greater
np.greater_equal
Partition dim sizes must match across
on_true_tile,bound0, andbound1:bound0andbound1must have one element per partitionon_true_tilemust be one of the FP dtypes, andbound0/bound1must be FP32 types.
The comparison with
bound0,bound1, and free dimension index is done in FP32. Make surerange_start+ free dimension index is within 2^24 range.Numpy equivalent:
indices = np.zeros_like(on_true_tile, dtype=np.float32) indices[:] = range_start + np.arange(on_true_tile[0].size) mask = comp_op0(indices, bound0) & comp_op1(indices, bound1) select_out_tile = np.where(mask, on_true_tile, on_false_value) reduce_tile = reduce_op(select_out_tile, axis=1, keepdims=True)
- Parameters:
dst – output tile with selected elements
on_true_tile – input tile containing elements to select from
on_false_value – constant value to use when selection condition is False. Due to HW constraints, this must be FP32_MIN FP32 bit pattern
comp_op0 – first comparison operator
comp_op1 – second comparison operator
bound0 – tile with one element per partition for first comparison
bound1 – tile with one element per partition for second comparison
reduce_op – reduction operator to apply on across the selected output. Currently only
np.maxis supported.reduce_res – optional tile to store reduction results.
range_start – starting base offset for index array for the free dimension of
on_true_tileDefaults to 0, and must be a compiler time integer.