SbufManager (Allocator) API Reference#
This topic provides the API reference for the SbufManager utility. It provides stack-based SBUF memory allocation with scope management and multi-buffering support.
When to Use#
Use SbufManager when you need:
Deterministic memory layout: Manual control over SBUF addresses for predictable memory placement
Scope-based allocation: Automatic cleanup of temporary buffers when a computation phase ends
Multi-buffering in loops: Ping-pong buffers for overlapping compute and memory operations
Memory debugging: Detailed logging of allocation patterns and usage statistics
SbufManager is particularly useful in complex kernels with multiple computation phases where different buffers are needed at different times.
API Reference#
Source code: aws-neuron/nki-library
SbufManager#
- class nkilib.core.utils.allocator.SbufManager(sb_lower_bound, sb_upper_bound, logger=None, use_auto_alloc=False, default_stack_alloc=True)#
Stack-based SBUF memory manager with scope support.
- Parameters:
sb_lower_bound (int) – Lower bound of available SBUF memory region.
sb_upper_bound (int) – Upper bound of available SBUF memory region.
logger (Logger, optional) – Optional logger instance for allocation tracking.
use_auto_alloc (bool) – If True, delegates address assignment to compiler. Default False.
default_stack_alloc (bool) – If True,
alloc()uses stack; if False, uses heap. Default True.
- open_scope(interleave_degree=1, name='')#
Opens a new allocation scope. Allocations within this scope are freed when the scope closes.
- close_scope()#
Closes the current scope and frees all stack allocations made within it.
- Return type:
None
- increment_section()#
Advances to the next buffer section within a multi-buffer scope. When all sections are used, wraps back to the first section.
- Return type:
None
- alloc_stack(shape, dtype, buffer=nl.sbuf, name=None, base_partition=0, align=None)#
Allocates a tensor on the stack (freed when scope closes).
- Parameters:
- Returns:
Allocated SBUF tensor.
- Return type:
nl.ndarray
- alloc_heap(shape, dtype, buffer=nl.sbuf, name=None, base_partition=0, align=None)#
Allocates a tensor on the heap (must be manually freed with
pop_heap()).Parameters are identical to
alloc_stack().- Return type:
nl.ndarray
- alloc(shape, dtype, buffer=nl.sbuf, name=None, base_partition=0, align=None)#
Allocates a tensor on the stack or heap, depending on the
default_stack_allocsetting.Parameters are identical to
alloc_stack().- Return type:
nl.ndarray
- pop_heap()#
Frees the most recently allocated heap tensor.
- Return type:
None
- get_used_space()#
Returns the number of bytes currently used by stack and heap allocations.
- Return type:
- get_stack_curr_addr()#
Returns the current stack address. Not supported in auto-allocation mode.
- Return type:
- get_heap_curr_addr()#
Returns the current heap address. Not supported in auto-allocation mode.
- Return type:
- align_stack_curr_addr(align=32)#
Aligns the current stack address to the given alignment. Not supported in auto-allocation mode.
- Parameters:
align (int) – Alignment in bytes. Default 32.
- Return type:
None
- set_name_prefix(prefix)#
Sets a prefix string prepended to all subsequent allocation names.
- Parameters:
prefix (str) – Prefix string.
- Return type:
None
- flush_logs()#
Prints buffered allocation logs in tree format.
- Return type:
None
create_auto_alloc_manager#
- nkilib.core.utils.allocator.create_auto_alloc_manager(logger=None)#
Creates an SbufManager that delegates address assignment to the compiler.
- Parameters:
logger (Logger, optional) – Optional logger instance.
- Returns:
Auto-allocation SbufManager instance.
- Return type:
Examples#
Without SbufManager (Manual Allocation)#
import nki.language as nl
@nki.jit
def kernel_without_sbm(input_hbm, output_hbm):
addr = 0
# Heap-like allocation at end of SBUF
heap_addr = nl.tile_size.total_available_sbuf_size - 512
weights = nl.ndarray((128, 256), dtype=nl.bfloat16, buffer=nl.sbuf,
address=(0, heap_addr))
print(f"weights.address = {weights.address}") # (0, 261632)
# Outer scope
buf1 = nl.ndarray((128, 512), dtype=nl.bfloat16, buffer=nl.sbuf,
address=(0, addr))
print(f"buf1.address = {buf1.address}") # (0, 0)
addr += 512 * 2 # 1024
# Inner scope
inner_start = addr
buf2 = nl.ndarray((128, 256), dtype=nl.bfloat16, buffer=nl.sbuf,
address=(0, addr))
print(f"buf2.address = {buf2.address}") # (0, 1024)
addr += 256 * 2 # 1536
buf3 = nl.ndarray((128, 256), dtype=nl.bfloat16, buffer=nl.sbuf,
address=(0, addr))
print(f"buf3.address = {buf3.address}") # (0, 1536)
# End inner scope - must manually reset
addr = inner_start # 1024
# Back in outer - reuse inner's memory
buf4 = nl.ndarray((128, 512), dtype=nl.bfloat16, buffer=nl.sbuf,
address=(0, addr))
print(f"buf4.address = {buf4.address}") # (0, 1024)
With SbufManager#
import nki.language as nl
from nkilib.core.utils.allocator import SbufManager
@nki.jit
def kernel_with_sbm(input_hbm, output_hbm):
sbm = SbufManager(0, nl.tile_size.total_available_sbuf_size)
weights = sbm.alloc_heap((128, 256), nl.bfloat16, name="weights")
print(f"weights.address = {weights.address}") # (0, 261632)
sbm.open_scope(name="outer")
buf1 = sbm.alloc_stack((128, 512), nl.bfloat16, name="buf1")
print(f"buf1.address = {buf1.address}") # (0, 0)
sbm.open_scope(name="inner")
buf2 = sbm.alloc_stack((128, 256), nl.bfloat16, name="buf2")
print(f"buf2.address = {buf2.address}") # (0, 1024)
buf3 = sbm.alloc_stack((128, 256), nl.bfloat16, name="buf3")
print(f"buf3.address = {buf3.address}") # (0, 1536)
sbm.close_scope()
buf4 = sbm.alloc_stack((128, 512), nl.bfloat16, name="buf4")
print(f"buf4.address = {buf4.address}") # (0, 1024)
sbm.close_scope()
sbm.pop_heap()
Both produce identical memory layouts:
weights.address = (0, 261632) # heap at top
buf1.address = (0, 0) # stack grows up
buf2.address = (0, 1024) # inner scope
buf3.address = (0, 1536) # inner scope
buf4.address = (0, 1024) # reuses inner's memory
Multi-Buffering Example#
import nki.language as nl
from nkilib.core.utils.allocator import SbufManager
@nki.jit
def kernel_multibuffer(input_hbm, output_hbm, N):
sbm = SbufManager(0, nl.tile_size.total_available_sbuf_size)
# Double-buffering: 2 sections alternate
sbm.open_scope(interleave_degree=2, name="double_buffer")
for i in nl.affine_range(N):
# Allocates to section 0, then 1, then 0, then 1...
buf = sbm.alloc_stack((128, 512), nl.bfloat16)
# Load to buf[current], compute on buf[previous]
sbm.increment_section()
sbm.close_scope()
Debug output for N=4:
[SBM] Allocations:
▶ SCOPE 'double_buffer' [interleave=2] @ 0
├── (unnamed): 1024 B @ 0 (128, 512) bfloat16
├── ↳ section: 1/2 @ 1024
├── (unnamed): 1024 B @ 1024 (128, 512) bfloat16
├── ↻ section: 0/2 @ 0
├── (unnamed): 1024 B @ 0 (128, 512) bfloat16
├── ↳ section: 1/2 @ 1024
└── (unnamed): 1024 B @ 1024 (128, 512) bfloat16
◀ END 'double_buffer' freed=2048 B
Note how allocations alternate between addresses 0 and 1024.
See Also#
TensorView - Zero-copy tensor view operations