This document is relevant for: Trn1, Trn2, Trn3

SBUF-to-SBUF All-Gather Kernel API Reference#

Performs SBUF-to-SBUF all-gather for gathering tensors across ranks.

The kernel provides two variants:

  • allgather_sb2sb — Optimized for small tensors that fit entirely in SBUF

  • allgather_sb2sb_tiled — Adds tiling and LNC support for larger tensors

Background#

The allgather_sb2sb kernels gather input tensors from all ranks along the last dimension (K dimension). Each rank contributes its local tensor, and all ranks receive the concatenated result.

API Reference#

Source code for this kernel API can be found at: sb2sb_allgather.py

allgather_sb2sb#

nkilib.experimental.collectives.allgather_sb2sb(inp: nl.ndarray, replica_groups: ReplicaGroup, tp_degree: int) nl.ndarray#

SBUF-to-SBUF all-gather kernel for gathering tensors across ranks.

Parameters:
  • inp (nl.ndarray) – [H, W], Input tensor on HBM, where W is the local width per rank.

  • replica_groups (ReplicaGroup) – ReplicaGroup defining which ranks participate in the collective.

  • tp_degree (int) – Tensor parallelism degree (number of ranks in the group).

Returns:

[H, K], Output tensor on shared HBM containing gathered data from all ranks.

Return type:

nl.ndarray

Notes:

  • Input tensor must fit in SBUF (H * W * dtype_size <= SBUF capacity)

  • Output is stored in shared_hbm for cross-rank visibility

  • All ranks receive identical output after the collective

Dimensions:

  • H: Height dimension (partition dimension, typically <= 128)

  • W: Width dimension per rank (local width before gather)

allgather_sb2sb_tiled#

nkilib.experimental.collectives.allgather_sb2sb_tiled(inp: nl.ndarray, replica_groups: ReplicaGroup, tp_degree: int) nl.ndarray#

SBUF-to-SBUF all-gather with tiling and LNC support for larger tensors.

Parameters:
  • inp (nl.ndarray) – [M, K], Input tensor on HBM, where K is the local width per rank.

  • replica_groups (ReplicaGroup) – ReplicaGroup defining which ranks participate in the collective.

  • tp_degree (int) – Tensor parallelism degree (number of ranks in the group).

Returns:

[M, K * tp_degree], Output tensor on shared HBM containing gathered data.

Return type:

nl.ndarray

Notes:

  • TILE_M is capped at 128 (SBUF partition size limit)

  • When launched with LNC grid [lnc], tiles are distributed across LNC cores

  • Each LNC core processes TILES_PER_CORE = NUM_M_TILES // n_prgs tiles

  • Assumes M is evenly divisible by 128 when M > 128

Dimensions:

  • M: Height dimension (tiled along this dimension)

  • K: Width dimension per rank (local width before gather)

  • TILE_M: Tile size along M dimension (capped at 128)

This document is relevant for: Trn1, Trn2, Trn3