Source code for nki.compiler.sbuf

import numpy as np
import ml_dtypes

[docs]def alloc(func): r""" Allocate SBUF memory space for each logical block in a tensor using a customized allocation method. This is one of the NKI direction allocation APIs. We recommend reading :doc:`NKI Direct Allocation Developer Guide <../../nki_direct_allocation_guide>` before using these APIs. In NKI, a SBUF tensor (declared using :ref:`NKI tensor creation APIs <nl_creation>`) can have three kinds of dimensions, in order: logical block(B), partition(P), and free(F). The partition and free dimensions directly map to the SBUF dimensions. Both B and F can be multi-dimensional, while P must be one-dimensional per Neuron ISA constraints. The block dimension describes how many (P, F) logical tiles this tensor has, but does not reflect the number of physical tiles being allocated. ``ncc.sbuf.alloc`` should be assigned to the ``buffer`` field of a NKI tensor declaration API. For example, .. code-block:: nki_tensor = nl.ndarray((4, 8, nl.par_dim(128), 4, 32), dtype=nl.bfloat16, buffer=ncc.sbuf.alloc(...)) ``ncc.sbuf.alloc`` allows programmers to specify the physical location of each logical tile in the tensor. The API accepts a single input ``func`` parameter, which is a callable object that takes in: 1. a tuple of integers ``idx`` representing a logical block index, 2. an integer ``pdim_size`` for the number of partitions the logical tile has, and 3. an integer ``fdim_size`` for the number of bytes the logical tile has per partition. The number of integers in ``idx`` must match the number of B dimensions the SBUF tensor has. For example, for the above ``nki_tensor``, we expect the ``idx`` tuple to have two integers for a 2D block index. ``pdim_size`` should match the partition dimension size of the NKI tensor exactly. ``fdim_size`` should be the total size of F dimension shapes of each logical tile in the tensor, multiplied by the data type size in bytes. For the above ``sbuf_tensor``, ``pdim_size`` should be 128, and ``fdim_size`` should be ``4*32*sizeof(nl.bfloat16) = 256`` bytes. The ``func`` callable must return a tuple of two integers ``(start_partition, byte_addr)`` indicating the physical tile location for the input logical block index. ``start_partition`` indicates the lowest partition the physical tile allocation starts from and must follow the these ISA rules: - If ``64 < pdim_size <= 128``, ``start_partition`` must be 0 - If ``32 < pdim_size <= 64``, ``start_partition`` must be 0 or 64 - If ``0 < pdim_size <= 32``, ``start_partition`` must be one of 0/32/64/96 The ``byte_addr`` indicates the byte offset into each partition the physical tile starts from. On NeuronCore-v2, a valid ``byte_addr`` can be any integer values from 0 (inclusive) to 192KiB-16KiB=(192-16)*1024 (exclusive). 192KiB is the physical size of a SBUF partition (defined in :doc:`architecture guide <../../arch/trainium_inferentia2_arch>`) and 16KiB is allocated for compiler internal usage. In addition, the ``base_addr`` must be aligned to ``nki.language.constants.sbuf_min_align``. .. note:: In current release, programmers cannot mix NKI tensor declarations using automatic allocation (``ncc.sbuf.auto_alloc()`` or the PSUM variant) and direction allocation APIs (``ncc.sbuf.alloc()``, ``ncc.sbuf.mod_alloc()`` or the PSUM variants) in the same kernel. :param func: a callable object to specify how to place the logical block in SBUF memory. """ ...
[docs]def mod_alloc(*, base_addr, base_partition=0, num_par_tiles=(), num_free_tiles=()): r""" Allocate SBUF memory space for each logical tile in a tensor through modulo allocation. This is one of the NKI direction allocation APIs. We recommend reading :doc:`NKI Direct Allocation Developer Guide <../../nki_direct_allocation_guide>` before using these APIs. This API is equivalent to calling :doc:`nisa.compiler.alloc() <nki.compiler.sbuf.alloc>` with a callable ``psum_modulo_alloc_func`` as defined below. .. nki_example:: ../../../nki/test/test_sbuf_modulo_alloc.py :language: python :linenos: :marker: NKI_EXAMPLE_0 Here's an example usage of this API: .. code-block:: python nki_tensor = nl.ndarray((4, par_dim(128), 512), dtype=nl.bfloat16, buffer=nki.compiler.sbuf.mod_alloc(base_addr=0, num_free_tiles=(2, ))) for i_block in nl.affine_range(4): nki_tensor[i_block, :, :] = nl.load(...) ... = nl.exp(nki_tensor[i_block, :, :]) This produces the following allocation: .. list-table:: Modulo Allocation Example :header-rows: 1 * - Logical Tile Index - Physical Tile ``start_partition`` - Physical Tile ``byte_addr`` * - (0, ) - 0 - 0 + (0 % 2) * 512 * sizeof(nl.bfloat16) = 0 * - (1, ) - 0 - 0 + (1 % 2) * 512 * sizeof(nl.bfloat16) = 1024 * - (2, ) - 0 - 0 + (2 % 2) * 512 * sizeof(nl.bfloat16) = 0 * - (3, ) - 0 - 0 + (3 % 2) * 512 * sizeof(nl.bfloat16) = 1024 With above scheme, we are able to implement double buffering in ``nki_tensor``, such that ``nl.load`` in one iteration can write to one physical tile while ``nl.exp`` of the previous iteration can read from the other physical tile simultaneously. .. note:: In current release, programmers cannot mix NKI tensor declarations using automatic allocation (``ncc.sbuf.auto_alloc()`` or the PSUM variant) and direction allocation APIs (``ncc.sbuf.alloc()``, ``ncc.sbuf.mod_alloc()`` or the PSUM variants). :param base_addr: the base address in the free(F) dimension of the SBUF in bytes. :param base_partition: the partition where the physical tile starts from. Must be 0 in the current version. :param num_par_tiles: the number of physical tiles on the partition dimension of SBUF allocated for the tensor. The length of the tuple must be empty or equal to the length of block dimension for the tensor. :param num_free_tiles: the number of physical tiles on the free dimension of SBUF allocated for the tensor. The length of the tuple must be empty or equal to the length of block dimension for the tensor. """ ...
[docs]def auto_alloc(): r""" Returns a maker to indicate the tensor should be automatically allocated by compiler. All SBUF tensors in a kernel must either all be marked as ``auto_alloc()``, or all be allocated with ``alloc`` or ``mod_alloc``. Initialize a tensor with ``buffer=nl.sbuf`` is equivalent to ``buffer=ncc.sbuf.auto_alloc()``. """ ...