This document is relevant for: Trn1, Trn2, Trn3
SBUF-to-SBUF All-Gather Kernel API Reference#
Performs SBUF-to-SBUF all-gather for gathering tensors across ranks.
The kernel provides two variants:
allgather_sb2sb— Optimized for small tensors that fit entirely in SBUFallgather_sb2sb_tiled— Adds tiling and LNC support for larger tensors
Background#
The allgather_sb2sb kernels gather input tensors from all ranks along the last dimension (K dimension). Each rank contributes its local tensor, and all ranks receive the concatenated result.
API Reference#
Source code for this kernel API can be found at: sb2sb_allgather.py
allgather_sb2sb#
- nkilib.experimental.collectives.allgather_sb2sb(inp: nl.ndarray, replica_groups: ReplicaGroup, tp_degree: int) nl.ndarray#
SBUF-to-SBUF all-gather kernel for gathering tensors across ranks.
- Parameters:
inp (
nl.ndarray) – [H, W], Input tensor on HBM, where W is the local width per rank.replica_groups (
ReplicaGroup) – ReplicaGroup defining which ranks participate in the collective.tp_degree (
int) – Tensor parallelism degree (number of ranks in the group).
- Returns:
[H, K], Output tensor on shared HBM containing gathered data from all ranks.
- Return type:
nl.ndarray
Notes:
Input tensor must fit in SBUF (H * W * dtype_size <= SBUF capacity)
Output is stored in shared_hbm for cross-rank visibility
All ranks receive identical output after the collective
Dimensions:
H: Height dimension (partition dimension, typically <= 128)
W: Width dimension per rank (local width before gather)
allgather_sb2sb_tiled#
- nkilib.experimental.collectives.allgather_sb2sb_tiled(inp: nl.ndarray, replica_groups: ReplicaGroup, tp_degree: int) nl.ndarray#
SBUF-to-SBUF all-gather with tiling and LNC support for larger tensors.
- Parameters:
inp (
nl.ndarray) – [M, K], Input tensor on HBM, where K is the local width per rank.replica_groups (
ReplicaGroup) – ReplicaGroup defining which ranks participate in the collective.tp_degree (
int) – Tensor parallelism degree (number of ranks in the group).
- Returns:
[M, K * tp_degree], Output tensor on shared HBM containing gathered data.
- Return type:
nl.ndarray
Notes:
TILE_M is capped at 128 (SBUF partition size limit)
When launched with LNC grid [lnc], tiles are distributed across LNC cores
Each LNC core processes TILES_PER_CORE = NUM_M_TILES // n_prgs tiles
Assumes M is evenly divisible by 128 when M > 128
Dimensions:
M: Height dimension (tiled along this dimension)
K: Width dimension per rank (local width before gather)
TILE_M: Tile size along M dimension (capped at 128)
This document is relevant for: Trn1, Trn2, Trn3