This document is relevant for: Inf2, Trn1, Trn2

nki.kernels.allocated_fused_self_attn_for_SD_small_head_size#

nki.kernels.allocated_fused_self_attn_for_SD_small_head_size(q_ref, k_ref, v_ref, use_causal_mask=False, mixed_precision=True)[source]#

Allocated fused self attention kernel for small head size Stable Diffusion workload.

Computes (softmax(Q.T@K)V).T. The wired layout is chosen to avoid transpose as much as possible to simplify the debug. The kernel uses the direct allocation API, and implements double buffering to achieve better performance than automatic allocation. As of NeuronSDK 2.21, it achieves 18% better performance than auto allocated equivalent. To see the performance gap, you can use force_auto_alloc decorator to override manual allocation and benchmark the performance difference.

This kernel is designed to be used for Stable Diffusion models where the n_heads is equal to 128. Seqlen must be divisible by 1024, and smaller than 5120. Assertion is thrown if n_heads or sequence length does not satisfy the requirement. These restrictions are to simplify the address calculation in allocations.

IO tensor layouts:
  • q_ptr: shape (bs, d_heads, seq_q)

  • k_ptr: shape (bs, d_heads, seq_k)

  • v_ptr: shape (bs, seq_v, n_heads)

  • out_ptr: shape (bs, d_heads, seq_q)

  • We use seq_q and seq_k just for clarity, this kernel requires seq_q == seq_k

IO tensor dtypes:
  • This kernel assumes all IO tensors have the same dtype

  • If mixed_precision is True, then all Tensor Engine operation will be performed in bfloat16 and accumulation will be performed in float32. Otherwise the intermediates will be in the same type as the inputs.

This document is relevant for: Inf2, Trn1, Trn2