nki.isa.core_barrier#

nki.isa.core_barrier(data, cores, engine=engine.unknown, name=None)[source]#

Synchronize execution across multiple NeuronCores by implementing a barrier mechanism.

Note

Available only on NeuronCore-v2 or newer.

This instruction creates a synchronization point where all specified NeuronCores must reach before any can proceed. The barrier is implemented using a semaphore-based protocol where each NeuronCore writes a semaphore to each other core (remote semaphore update) and then waits for the other cores’ semaphores before continuing execution (local semaphore wait).

The use case is when two NeuronCores both need to write to disjoint portions of a shared HBM tensor (data) and they both need to consume the tensor after both cores have finished writing into the tensor. In this case, both cores can perform the write to data in HBM using nisa.dma_copy, and then signal to each other when the write operation is complete using nisa.core_barrier.

This instruction is only allowed in NeuronCore-v3 or newer when LNC (Logical NeuronCore) is enabled. Currently only cores=(0, 1) is supported. This allows synchronization between exactly two NeuronCores that share the same HBM stack.

The data parameter represents the shared data that all cores need to synchronize on. This must be data in shared HBM that multiple cores are accessing.

The engine parameter allows specifying which engine inside the NeuronCores should execute the barrier instruction (that is, the remote semaphore update and local semaphore wait).

Parameters:
  • data – the shared data that all cores need to synchronize on; must be data in shared HBM

  • cores – a tuple of core indices to synchronize; only (0, 1) is supported when LNC2 is enabled

  • engine – the engine to execute the barrier instruction on; defaults to automatic selection

Example:

# Synchronize between two cores after each core writes to half of shared tensor
shared_tensor = nl.ndarray((batch_size, hidden_dim), dtype=nl.float32, buffer=nl.shared_hbm)

# Each core writes to half of the tensor
if core_id == 0:
    # Core 0 writes to first half
    core0_data = nl.ndarray((batch_size // 2, hidden_dim), dtype=nl.float32, buffer=nl.sbuf)
    nisa.dma_copy(dst=shared_tensor[:batch_size // 2, :], src=core0_data)
else:
    # Core 1 writes to second half
    core1_data = nl.ndarray((batch_size // 2, hidden_dim), dtype=nl.float32, buffer=nl.sbuf)
    nisa.dma_copy(dst=shared_tensor[batch_size // 2:, :], src=core1_data)

core_barrier(data=shared_tensor, cores=(0, 1))

# Now both cores can safely read the complete tensor