RoPE Kernel API Reference#
Applies Rotary Position Embedding (RoPE) to input embeddings, encoding positional information by rotating embedding dimension pairs using precomputed sine/cosine frequencies.
The kernel supports:
Efficient position encoding without absolute position embeddings
Optional LNC sharding for parallelization across cores
Flexible memory layouts (contiguous or interleaved)
Layout conversion strategies (DMA strided access or SBUF matmul)
Standalone operation with HBM I/O
SBUF-only operation for megakernel fusion
Background#
The RoPE kernel implements Rotary Position Embedding, which encodes positional information by rotating pairs of embedding dimensions using precomputed sine/cosine frequencies. This approach enables position-aware attention mechanisms without requiring absolute position embeddings.
The kernel applies the following transformation:
out[even] = x[even] * cos - x[odd] * sinout[odd] = x[odd] * cos + x[even] * sin
The kernel supports two memory layouts for the head dimension: contiguous (first half, second half) and interleaved (even, odd, even, odd). Layout conversion can be performed using either strided DMA access or SBUF matmul operations.
API Reference#
Source code for this kernel API can be found at: rope.py
RoPE#
- nkilib.core.rope.RoPE(x_in, cos, sin, lnc_shard=False, contiguous_layout=True, relayout_in_sbuf=False)#
Apply Rotary Position Embedding (RoPE) to input embeddings. Standalone kernel with HBM I/O and optional LNC sharding.
- Parameters:
x_in (
nl.ndarray) – Input embeddings tensor with shape[d_head, B, n_heads, S]in HBMcos (
nl.ndarray) – Cosine frequencies tensor with shape[d_head//2, B, S]in HBMsin (
nl.ndarray) – Sine frequencies tensor with shape[d_head//2, B, S]in HBMlnc_shard (
bool, optional) – Parallelize across LNC cores by tiling sequence dimension. Default isFalse.contiguous_layout (
bool, optional) – Memory layout in d_head dimension.Truefor[first_half, second_half](default, more efficient),Falsefor[even, odd, even, odd, ...](interleaved).relayout_in_sbuf (
bool, optional) – Use SBUF matmul for layout conversion (only for small tensors). Default isFalse.
- Returns:
RoPE applied output tensor with shape
[d_head, B, n_heads, S]in HBM- Return type:
nl.ndarray
Constraints:
Head dimension (
d_head) must be 64 or 128Batch size (
B) must be in range (0, 64]Sequence length (
S) must be in range (0, 512]Number of heads (
n_heads) must be in range (0, 16]When
lnc_shard=True, sequence length must be divisible by number of programsSBUF relayout (
relayout_in_sbuf=True) requiresB * n_heads * S <= gemm_moving_fmax
RoPE_sbuf#
- nkilib.core.rope.RoPE_sbuf(x_in_sb, cos_sb, sin_sb, x_out_sb, convert_from_interleaved=False)#
Apply RoPE on tensors in SBUF (for megakernel fusion). Helper function that operates entirely in SBUF without HBM I/O.
- Parameters:
x_in_sb (
nl.ndarray) – Input embeddings tensor with shape[d_head, B, n_heads, S]in SBUFcos_sb (
nl.ndarray) – Cosine frequencies tensor with shape[d_head//2, B, S]in SBUFsin_sb (
nl.ndarray) – Sine frequencies tensor with shape[d_head//2, B, S]in SBUFx_out_sb (
nl.ndarray) – Output buffer tensor with shape[d_head, B, n_heads, S]in SBUFconvert_from_interleaved (
bool, optional) – Convert from interleaved to contiguous layout (only for small tensors:B * n_heads * S <= gemm_moving_fmax). Default isFalse.
- Returns:
Output tensor with RoPE applied (modified in-place)
- Return type:
nl.ndarray
Constraints:
Assumes contiguous layout unless
convert_from_interleaved=TrueFor large tensors with interleaved layout, use
RoPE()with strided DMAInput and output tensors must have matching dtypes
Implementation Details#
The kernel implementation includes several key optimizations:
Layout Conversion Strategies: Supports two methods for converting between contiguous and interleaved layouts:
DMA Strided Access: Uses strided DMA operations with step=2 to gather/scatter even and odd indices separately. Suitable for all tensor sizes.
SBUF Matmul: Uses matrix multiplication with a permutation matrix for layout conversion. Limited to small tensors where
B * n_heads * S <= gemm_moving_fmax.
LNC Sharding: Supports parallelization across Logical NeuronCore (LNC) cores by tiling the sequence dimension. Each core processes a tile of size
S // n_prgs.Efficient Tensor Operations: Uses
tensor_tensoroperations with TensorView broadcasting to efficiently apply cos/sin coefficients across the n_heads dimension.Memory Management: Carefully manages SBUF allocations for intermediate buffers including separate storage for odd half elements to satisfy tensor_tensor alignment requirements.
Permutation Matrix Generation: For SBUF layout conversion, generates a permutation matrix using strided access on an identity matrix, enabling efficient transformation via matrix multiplication.
See Also#
RoPE HuggingFace Kernel API Reference