nki.isa.core_barrier#
- nki.isa.core_barrier(data, cores, engine=engine.unknown, name=None)[source]#
Synchronize execution across multiple NeuronCores by implementing a barrier mechanism.
Note
Available only on NeuronCore-v2 or newer.
This instruction creates a synchronization point where all specified NeuronCores must reach before any can proceed. The barrier is implemented using a semaphore-based protocol where each NeuronCore writes a semaphore to each other core (remote semaphore update) and then waits for the other cores’ semaphores before continuing execution (local semaphore wait).
The use case is when two NeuronCores both need to write to disjoint portions of a shared HBM tensor (
data) and they both need to consume the tensor after both cores have finished writing into the tensor. In this case, both cores can perform the write todatain HBM usingnisa.dma_copy, and then signal to each other when the write operation is complete usingnisa.core_barrier.This instruction is only allowed in NeuronCore-v3 or newer when LNC (Logical NeuronCore) is enabled. Currently only
cores=(0, 1)is supported. This allows synchronization between exactly two NeuronCores that share the same HBM stack.The
dataparameter represents the shared data that all cores need to synchronize on. This must be data in shared HBM that multiple cores are accessing.The
engineparameter allows specifying which engine inside the NeuronCores should execute the barrier instruction (that is, the remote semaphore update and local semaphore wait).- Parameters:
data – the shared data that all cores need to synchronize on; must be data in shared HBM
cores – a tuple of core indices to synchronize; only
(0, 1)is supported when LNC2 is enabledengine – the engine to execute the barrier instruction on; defaults to automatic selection
Example:
# Synchronize between two cores after each core writes to half of shared tensor shared_tensor = nl.ndarray((batch_size, hidden_dim), dtype=nl.float32, buffer=nl.shared_hbm) # Each core writes to half of the tensor if core_id == 0: # Core 0 writes to first half core0_data = nl.ndarray((batch_size // 2, hidden_dim), dtype=nl.float32, buffer=nl.sbuf) nisa.dma_copy(dst=shared_tensor[:batch_size // 2, :], src=core0_data) else: # Core 1 writes to second half core1_data = nl.ndarray((batch_size // 2, hidden_dim), dtype=nl.float32, buffer=nl.sbuf) nisa.dma_copy(dst=shared_tensor[batch_size // 2:, :], src=core1_data) core_barrier(data=shared_tensor, cores=(0, 1)) # Now both cores can safely read the complete tensor