This document is relevant for: Inf2
, Trn1
, Trn2
nki.kernels.allocated_fused_self_attn_for_SD_small_head_size#
- nki.kernels.allocated_fused_self_attn_for_SD_small_head_size(q_ref, k_ref, v_ref, use_causal_mask=False, mixed_precision=True)[source]#
Allocated fused self attention kernel for small head size Stable Diffusion workload.
Computes (softmax(Q.T@K)V).T. The wired layout is chosen to avoid transpose as much as possible to simplify the debug. The kernel uses the direct allocation API, and implements double buffering to achieve better performance than automatic allocation. As of NeuronSDK 2.21, it achieves 18% better performance than auto allocated equivalent. To see the performance gap, you can use
force_auto_alloc
decorator to override manual allocation and benchmark the performance difference.This kernel is designed to be used for Stable Diffusion models where the n_heads is equal to 128. Seqlen must be divisible by 1024, and smaller than 5120. Assertion is thrown if
n_heads
or sequence length does not satisfy the requirement. These restrictions are to simplify the address calculation in allocations.- IO tensor layouts:
q_ptr: shape (bs, d_heads, seq_q)
k_ptr: shape (bs, d_heads, seq_k)
v_ptr: shape (bs, seq_v, n_heads)
out_ptr: shape (bs, d_heads, seq_q)
We use seq_q and seq_k just for clarity, this kernel requires seq_q == seq_k
- IO tensor dtypes:
This kernel assumes all IO tensors have the same dtype
If mixed_precision is True, then all Tensor Engine operation will be performed in bfloat16 and accumulation will be performed in float32. Otherwise the intermediates will be in the same type as the inputs.
This document is relevant for: Inf2
, Trn1
, Trn2