This document is relevant for: Inf2
, Trn1
, Trn1n
nki.language.affine_range#
- nki.language.affine_range(*args)[source]#
Create a sequence of numbers for use as parallel loop iterators in NKI.
affine_range
should be the default loop iterator choice, when there is no loop carried dependency. Note, associative reductions are not considered loop carried dependencies in this context. A concrete example of associative reduction is multiple nl.matmul or nisa.nc_matmul calls accumulating into the same output buffer defined outside of this loop level (see code example #2 below).When the above conditions are not met, we recommend using sequential_range instead.
Notes:
Using
affine_range
prevents Neuron compiler from unrolling the loops until entering compiler backend, which typically results in better compilation time compared to the fully unrolled iterator static_range.Using
affine_range
also allows Neuron compiler to perform additional loop-level optimizations, such as loop vectorization in current release. The exact type of loop-level optimizations applied is subject to changes in future releases.
1import neuronxcc.nki.language as nl 2 3####################################################################### 4# Example 1: No loop carried dependency 5# Input/Output tensor shape: [128, 2048] 6# Load one tile ([128, 512]) at a time, square the tensor element-wise, 7# and store it into output tile 8####################################################################### 9 10# Every loop instance works on an independent input/output tile. 11# No data dependency between loop instances. 12for i_input in nl.affine_range(input.shape[1] // 512): 13 offset = i_input * 512 14 input_sb = nl.load(input[0:input.shape[0], offset:offset+512]) 15 result = nl.multiply(input_sb, input_sb) 16 nl.store(output[0:input.shape[0], offset:offset+512], result) 17 18####################################################################### 19# Example 2: Matmul output buffer accumulation, a type of associative reduction 20# Input tensor shapes for nl.matmul: xT[K=2048, M=128] and y[K=2048, N=128] 21# Load one tile ([128, 128]) from both xT and y at a time, matmul and 22# accumulate into the same output buffer 23####################################################################### 24 25result_psum = nl.zeros((128, 128), dtype=nl.float32, buffer=nl.psum) 26for i_K in nl.affine_range(xT.shape[0] // 128): 27 offset = i_K * 128 28 xT_sbuf = nl.load(offset:offset+128, 0:xT.shape[1]]) 29 y_sbuf = nl.load(offset:offset+128, 0:y.shape[1]]) 30 31 result_psum += nl.matmul(xT_sbuf, y_sbuf, transpose_x=True)
This document is relevant for: Inf2
, Trn1
, Trn1n