RoPE Kernel API Reference#

Applies Rotary Position Embedding (RoPE) to input embeddings, encoding positional information by rotating embedding dimension pairs using precomputed sine/cosine frequencies.

The kernel supports:

Efficient position encoding without absolute position embeddings
Optional LNC sharding for parallelization across cores
Flexible memory layouts (contiguous or interleaved)
Layout conversion strategies (DMA strided access or SBUF matmul)
Standalone operation with HBM I/O
SBUF-only operation for megakernel fusion

Background#

The RoPE kernel implements Rotary Position Embedding, which encodes positional information by rotating pairs of embedding dimensions using precomputed sine/cosine frequencies. This approach enables position-aware attention mechanisms without requiring absolute position embeddings.

The kernel applies the following transformation:

out[even] = x[even] * cos - x[odd] * sin
out[odd] = x[odd] * cos + x[even] * sin

The kernel supports two memory layouts for the head dimension: contiguous (first half, second half) and interleaved (even, odd, even, odd). Layout conversion can be performed using either strided DMA access or SBUF matmul operations.

API Reference#

Source code for this kernel API can be found at: rope.py

RoPE#

nkilib.core.rope.RoPE(x_in, cos, sin, lnc_shard=False, contiguous_layout=True, relayout_in_sbuf=False)#

Apply Rotary Position Embedding (RoPE) to input embeddings. Standalone kernel with HBM I/O and optional LNC sharding.

Parameters:

x_in (nl.ndarray) – Input embeddings tensor with shape [d_head, B, n_heads, S] in HBM
cos (nl.ndarray) – Cosine frequencies tensor with shape [d_head//2, B, S] in HBM
sin (nl.ndarray) – Sine frequencies tensor with shape [d_head//2, B, S] in HBM
lnc_shard (bool, optional) – Parallelize across LNC cores by tiling sequence dimension. Default is False.
contiguous_layout (bool, optional) – Memory layout in d_head dimension. True for [first_half, second_half] (default, more efficient), False for [even, odd, even, odd, ...] (interleaved).
relayout_in_sbuf (bool, optional) – Use SBUF matmul for layout conversion (only for small tensors). Default is False.

Returns:

RoPE applied output tensor with shape [d_head, B, n_heads, S] in HBM

Return type:

nl.ndarray

Constraints:

Head dimension (d_head) must be 64 or 128
Batch size (B) must be in range (0, 64]
Sequence length (S) must be in range (0, 512]
Number of heads (n_heads) must be in range (0, 16]
When lnc_shard=True, sequence length must be divisible by number of programs
SBUF relayout (relayout_in_sbuf=True) requires B * n_heads * S <= gemm_moving_fmax

RoPE_sbuf#

nkilib.core.rope.RoPE_sbuf(x_in_sb, cos_sb, sin_sb, x_out_sb, convert_from_interleaved=False)#

Apply RoPE on tensors in SBUF (for megakernel fusion). Helper function that operates entirely in SBUF without HBM I/O.

Parameters:

x_in_sb (nl.ndarray) – Input embeddings tensor with shape [d_head, B, n_heads, S] in SBUF
cos_sb (nl.ndarray) – Cosine frequencies tensor with shape [d_head//2, B, S] in SBUF
sin_sb (nl.ndarray) – Sine frequencies tensor with shape [d_head//2, B, S] in SBUF
x_out_sb (nl.ndarray) – Output buffer tensor with shape [d_head, B, n_heads, S] in SBUF
convert_from_interleaved (bool, optional) – Convert from interleaved to contiguous layout (only for small tensors: B * n_heads * S <= gemm_moving_fmax). Default is False.

Returns:

Output tensor with RoPE applied (modified in-place)

Return type:

nl.ndarray

Constraints:

Assumes contiguous layout unless convert_from_interleaved=True
For large tensors with interleaved layout, use RoPE() with strided DMA
Input and output tensors must have matching dtypes

Implementation Details#

The kernel implementation includes several key optimizations:

Layout Conversion Strategies: Supports two methods for converting between contiguous and interleaved layouts:
- DMA Strided Access: Uses strided DMA operations with step=2 to gather/scatter even and odd indices separately. Suitable for all tensor sizes.
- SBUF Matmul: Uses matrix multiplication with a permutation matrix for layout conversion. Limited to small tensors where B * n_heads * S <= gemm_moving_fmax.
LNC Sharding: Supports parallelization across Logical NeuronCore (LNC) cores by tiling the sequence dimension. Each core processes a tile of size S // n_prgs.
Efficient Tensor Operations: Uses tensor_tensor operations with TensorView broadcasting to efficiently apply cos/sin coefficients across the n_heads dimension.
Memory Management: Carefully manages SBUF allocations for intermediate buffers including separate storage for odd half elements to satisfy tensor_tensor alignment requirements.
Permutation Matrix Generation: For SBUF layout conversion, generates a permutation matrix using strided access on an identity matrix, enabling efficient transformation via matrix multiplication.

RoPE Kernel API Reference

Contents