RoPE Kernel API Reference#

Applies Rotary Position Embedding (RoPE) to input embeddings, encoding positional information by rotating embedding dimension pairs using precomputed sine/cosine frequencies.

The kernel supports:

  • Efficient position encoding without absolute position embeddings

  • Optional LNC sharding for parallelization across cores

  • Flexible memory layouts (contiguous or interleaved)

  • Layout conversion strategies (DMA strided access or SBUF matmul)

  • Standalone operation with HBM I/O

  • SBUF-only operation for megakernel fusion

Background#

The RoPE kernel implements Rotary Position Embedding, which encodes positional information by rotating pairs of embedding dimensions using precomputed sine/cosine frequencies. This approach enables position-aware attention mechanisms without requiring absolute position embeddings.

The kernel applies the following transformation:

  • out[even] = x[even] * cos - x[odd] * sin

  • out[odd] = x[odd] * cos + x[even] * sin

The kernel supports two memory layouts for the head dimension: contiguous (first half, second half) and interleaved (even, odd, even, odd). Layout conversion can be performed using either strided DMA access or SBUF matmul operations.

API Reference#

Source code for this kernel API can be found at: rope.py

RoPE#

nkilib.core.rope.RoPE(x_in, cos, sin, lnc_shard=False, contiguous_layout=True, relayout_in_sbuf=False)#

Apply Rotary Position Embedding (RoPE) to input embeddings. Standalone kernel with HBM I/O and optional LNC sharding.

Parameters:
  • x_in (nl.ndarray) – Input embeddings tensor with shape [d_head, B, n_heads, S] in HBM

  • cos (nl.ndarray) – Cosine frequencies tensor with shape [d_head//2, B, S] in HBM

  • sin (nl.ndarray) – Sine frequencies tensor with shape [d_head//2, B, S] in HBM

  • lnc_shard (bool, optional) – Parallelize across LNC cores by tiling sequence dimension. Default is False.

  • contiguous_layout (bool, optional) – Memory layout in d_head dimension. True for [first_half, second_half] (default, more efficient), False for [even, odd, even, odd, ...] (interleaved).

  • relayout_in_sbuf (bool, optional) – Use SBUF matmul for layout conversion (only for small tensors). Default is False.

Returns:

RoPE applied output tensor with shape [d_head, B, n_heads, S] in HBM

Return type:

nl.ndarray

Constraints:

  • Head dimension (d_head) must be 64 or 128

  • Batch size (B) must be in range (0, 64]

  • Sequence length (S) must be in range (0, 512]

  • Number of heads (n_heads) must be in range (0, 16]

  • When lnc_shard=True, sequence length must be divisible by number of programs

  • SBUF relayout (relayout_in_sbuf=True) requires B * n_heads * S <= gemm_moving_fmax

RoPE_sbuf#

nkilib.core.rope.RoPE_sbuf(x_in_sb, cos_sb, sin_sb, x_out_sb, convert_from_interleaved=False)#

Apply RoPE on tensors in SBUF (for megakernel fusion). Helper function that operates entirely in SBUF without HBM I/O.

Parameters:
  • x_in_sb (nl.ndarray) – Input embeddings tensor with shape [d_head, B, n_heads, S] in SBUF

  • cos_sb (nl.ndarray) – Cosine frequencies tensor with shape [d_head//2, B, S] in SBUF

  • sin_sb (nl.ndarray) – Sine frequencies tensor with shape [d_head//2, B, S] in SBUF

  • x_out_sb (nl.ndarray) – Output buffer tensor with shape [d_head, B, n_heads, S] in SBUF

  • convert_from_interleaved (bool, optional) – Convert from interleaved to contiguous layout (only for small tensors: B * n_heads * S <= gemm_moving_fmax). Default is False.

Returns:

Output tensor with RoPE applied (modified in-place)

Return type:

nl.ndarray

Constraints:

  • Assumes contiguous layout unless convert_from_interleaved=True

  • For large tensors with interleaved layout, use RoPE() with strided DMA

  • Input and output tensors must have matching dtypes

Implementation Details#

The kernel implementation includes several key optimizations:

  1. Layout Conversion Strategies: Supports two methods for converting between contiguous and interleaved layouts:

    • DMA Strided Access: Uses strided DMA operations with step=2 to gather/scatter even and odd indices separately. Suitable for all tensor sizes.

    • SBUF Matmul: Uses matrix multiplication with a permutation matrix for layout conversion. Limited to small tensors where B * n_heads * S <= gemm_moving_fmax.

  2. LNC Sharding: Supports parallelization across Logical NeuronCore (LNC) cores by tiling the sequence dimension. Each core processes a tile of size S // n_prgs.

  3. Efficient Tensor Operations: Uses tensor_tensor operations with TensorView broadcasting to efficiently apply cos/sin coefficients across the n_heads dimension.

  4. Memory Management: Carefully manages SBUF allocations for intermediate buffers including separate storage for odd half elements to satisfy tensor_tensor alignment requirements.

  5. Permutation Matrix Generation: For SBUF layout conversion, generates a permutation matrix using strided access on an identity matrix, enabling efficient transformation via matrix multiplication.

See Also#

  • RoPE HuggingFace Kernel API Reference