nki.isa.range_select#

nki.isa.range_select(dst, on_true_tile, comp_op0, comp_op1, bound0, bound1, reduce_cmd=reduce_cmd.idle, reduce_res=None, reduce_op=<function amax>, range_start=0.0, on_false_value=0.0, name=None)[source]#

Select elements from on_true_tile based on comparison with bounds using Vector Engine.

Note

Available only on NeuronCore-v3 and newer.

For each element in on_true_tile, compares its free dimension index + range_start against bound0 and bound1 using the specified comparison operators (comp_op0 and comp_op1). If both comparisons evaluate to True, copies the element to the output; otherwise uses on_false_value.

Additionally performs a reduction operation specified by reduce_op on the results, storing the reduction result in reduce_res.

Note on numerical stability:

In self-attention, we often have this instruction sequence: range_select (VectorE) -> reduce_res -> activation (ScalarE). When range_select outputs a full row of fill_value, caution is needed to avoid NaN in the activation instruction that subtracts the output of range_select by reduce_res (max value):

  • If dtype and reduce_res are both FP32, we should not hit any NaN issue since FP32_MIN - FP32_MIN = 0. Exponentiation on 0 is stable (1.0 exactly).

  • If dtype is FP16/BF16/FP8, the fill_value in the output tile will become -INF since HW performs a downcast from FP32_MIN to a smaller dtype. In this case, you must make sure reduce_res uses FP32 dtype to avoid NaN in activation. NaN can be avoided because activation always upcasts input tiles to FP32 to perform math operations: -INF - FP32_MIN = -INF. Exponentiation on -INF is stable (0.0 exactly).

Constraints:

The comparison operators must be one of:

  • np.equal

  • np.less

  • np.less_equal

  • np.greater

  • np.greater_equal

Partition dim sizes must match across on_true_tile, bound0, and bound1:

  • bound0 and bound1 must have one element per partition

  • on_true_tile must be one of the FP dtypes, and bound0/bound1 must be FP32 types.

The comparison with bound0, bound1, and free dimension index is done in FP32. Make sure range_start + free dimension index is within 2^24 range.

Numpy equivalent:

indices = np.zeros_like(on_true_tile, dtype=np.float32)
indices[:] = range_start + np.arange(on_true_tile[0].size)

mask = comp_op0(indices, bound0) & comp_op1(indices, bound1)
select_out_tile = np.where(mask, on_true_tile, on_false_value)
reduce_tile = reduce_op(select_out_tile, axis=1, keepdims=True)
Parameters:
  • dst – output tile with selected elements

  • on_true_tile – input tile containing elements to select from

  • on_false_value – constant value to use when selection condition is False. Due to HW constraints, this must be FP32_MIN FP32 bit pattern

  • comp_op0 – first comparison operator

  • comp_op1 – second comparison operator

  • bound0 – tile with one element per partition for first comparison

  • bound1 – tile with one element per partition for second comparison

  • reduce_op – reduction operator to apply on across the selected output. Currently only np.max is supported.

  • reduce_res – optional tile to store reduction results.

  • range_start – starting base offset for index array for the free dimension of on_true_tile Defaults to 0, and must be a compiler time integer.