This document is relevant for: Inf2
, Trn1
, Trn1n
nki.kernels.fused_self_attn_for_SD_small_head_size#
- nki.kernels.fused_self_attn_for_SD_small_head_size(q_ref, k_ref, v_ref, out_ref, use_causal_mask=False, mixed_percision=True)[source]#
Fused self attention kernel for small head size Stable Diffusion workload.
Computes softmax(QK^T)V. Decoder model can optionally include a causal mask application. Does not include QKV rojection, output projection, dropout, residual connection, etc.
This kernel is designed to be used for Stable Diffusion models where the n_heads is smaller or equal to 128. Assertion is thrown if n_heads does not satisfy the requirement.
- IO tensor layouts:
q_ptr: shape (bs, n_heads, seq_q)
k_ptr: shape (bs, seq_k, n_heads)
v_ptr: shape (bs, seq_v, n_heads)
out_ptr: shape (bs, seq_q, n_heads)
We use seq_q and seq_k just for clarity, this kernel requires seq_q == seq_k
- IO tensor dtypes:
This kernel assumes all IO tensors have the same dtype
If mixed_percision is True, then all Tensor Engine operation will be performed in bfloat16 and accumulation will be performed in float32. Otherwise the intermediates will be in the same type as the inputs.
This document is relevant for: Inf2
, Trn1
, Trn1n