This document is relevant for: Trn2, Trn3

SBUF-to-SBUF All-Gather Kernel API Reference#

Performs SBUF-to-SBUF all-gather for gathering tensors across ranks.

The kernel provides two variants:

allgather_sb2sb — Optimized for small tensors that fit entirely in SBUF
allgather_sb2sb_tiled — Adds tiling and LNC support for larger tensors

Background#

The allgather_sb2sb kernels gather input tensors from all ranks along the last dimension (K dimension). Each rank contributes its local tensor, and all ranks receive the concatenated result.

API Reference#

Source code for this kernel API can be found at: sb2sb_allgather.py

allgather_sb2sb#

nkilib.experimental.collectives.allgather_sb2sb(inp: nl.ndarray, replica_groups: ReplicaGroup, tp_degree: int) → nl.ndarray#

SBUF-to-SBUF all-gather kernel for gathering tensors across ranks.

Parameters:

inp (nl.ndarray) – [H, W], Input tensor on HBM, where W is the local width per rank.
replica_groups (ReplicaGroup) – ReplicaGroup defining which ranks participate in the collective.
tp_degree (int) – Tensor parallelism degree (number of ranks in the group).

Returns:

[H, K], Output tensor on shared HBM containing gathered data from all ranks.

Return type:

nl.ndarray

Notes:

Input tensor must fit in SBUF (H * W * dtype_size <= SBUF capacity)
Output is stored in shared_hbm for cross-rank visibility
All ranks receive identical output after the collective

Dimensions:

H: Height dimension (partition dimension, typically <= 128)
W: Width dimension per rank (local width before gather)

allgather_sb2sb_tiled#

nkilib.experimental.collectives.allgather_sb2sb_tiled(inp: nl.ndarray, replica_groups: ReplicaGroup, tp_degree: int) → nl.ndarray#

SBUF-to-SBUF all-gather with tiling and LNC support for larger tensors.