This document is relevant for: Trn2, Trn3

MLP Backward MXFP8 Kernel API Reference#

Return (num_cores, shard_id) for LNC2 sharding.

Background#

The get_program_sharding_info kernel returns the LNC2 sharding configuration (num_cores, shard_id), used by the MXFP8 MLP backward pass to distribute computation across logical cores.

API Reference#

Source code for this kernel API can be found at: mlp_bwd_mxfp8_kernel.py

get_program_sharding_info#

nkilib.experimental.mlp_mxfp8.mlp_bwd_mxfp8.get_program_sharding_info(run_with_lnc2: bool) → tuple#: Return (num_cores, shard_id) for LNC2 sharding.

compute_phase1_down_proj_mm_grad_mxfp8#

nkilib.experimental.mlp_mxfp8.mlp_bwd_mxfp8.compute_phase1_down_proj_mm_grad_mxfp8(output_grad_td: TensorDescriptor, gate_pre_td: TensorDescriptor, gate_act_td: TensorDescriptor, up_td: TensorDescriptor, d_gate_up_td: TensorDescriptor, scratch_td: TensorDescriptor, down_weight_td: TensorDescriptor, s_base: int, dtype: type, fp8_x4_dtype: type, config: MatmulMxfp8KernelConfig = None, spill_reload: bool = True, use_scale_packing: bool = True, run_with_lnc2: bool = True, clamp_limits: ClampLimits = None) → None#

Phase 1: Compute gradient through the down projection and SwiGLU gate.

Parameters:

output_grad_td (TensorDescriptor) – [S, H], incoming gradient (is_f_by_k=True).
gate_pre_td (TensorDescriptor) – [S, I], checkpointed gate pre-activation.
gate_act_td (TensorDescriptor) – [S, I], checkpointed gate post-activation.
up_td (TensorDescriptor) – [S, I], checkpointed up projection.
d_gate_up_td (TensorDescriptor) – [S, 2I], output: fused gate || up gradient.
scratch_td (TensorDescriptor) – [2I, S], output: transposed d_gate || d_up.
down_weight_td (TensorDescriptor) – [I, H], transposed down projection weights (is_f_by_k=True).
s_base (int) – Row offset into the full [S, …] tensors for this LNC core.
dtype (type) – Data type for computation (nl.bfloat16).
fp8_x4_dtype (type) – MXFP8 quantized data type (e.g. float8_e4m3fn_x4).
config (MatmulMxfp8KernelConfig) – Per-phase matmul tiling configuration. Replaces the previous TILES_IN_BLOCK_M/N/K arguments.
clamp_limits (ClampLimits) – Optional activation clamp limits applied during the gradient computation.

compute_phase2_hidden_states_grad_mxfp8#

nkilib.experimental.mlp_mxfp8.mlp_bwd_mxfp8.compute_phase2_hidden_states_grad_mxfp8(hidden_states_grad_td: TensorDescriptor, gate_up_weight_td: TensorDescriptor, d_gate_up_td: TensorDescriptor, s_base: int, dtype: type, fp8_x4_dtype: type, config: MatmulMxfp8KernelConfig = None, spill_reload: bool = True, use_scale_packing: bool = True, run_with_lnc2: bool = True) → None#

Phase 2: Compute gradient w.r.t. input hidden states.

Parameters:

hidden_states_grad_td (TensorDescriptor) – [S, H], output: dL/d_hidden.
gate_up_weight_td (TensorDescriptor) – [H, 2I], transposed fused gate+up projection weights (is_f_by_k=True).
d_gate_up_td (TensorDescriptor) – [S, 2I], fused gate || up gradient (is_f_by_k=True).
s_base (int) – Row offset for this LNC core’s shard.
dtype (type) – Data type for computation (nl.bfloat16).
fp8_x4_dtype (type) – MXFP8 quantized data type (e.g. float8_e4m3fn_x4).
config (MatmulMxfp8KernelConfig) – Per-phase matmul tiling configuration. Replaces the previous TILES_IN_BLOCK_M/N/K arguments.

compute_phase3_gate_up_weight_grad_mxfp8#

nkilib.experimental.mlp_mxfp8.mlp_bwd_mxfp8.compute_phase3_gate_up_weight_grad_mxfp8(weight_grad_td: TensorDescriptor, hidden_states_T_td: TensorDescriptor, grad_T_td: TensorDescriptor, dtype: type, fp8_x4_dtype: type, config: MatmulMxfp8KernelConfig = None, spill_reload: bool = True, use_scale_packing: bool = True, run_with_lnc2: bool = True) → None#

Phase 3: Compute gradient w.r.t. gate and up weight matrices as a single matmul.

Parameters:

weight_grad_td (TensorDescriptor) – [2I, H], output: [dW_gate; dW_up].
hidden_states_T_td (TensorDescriptor) – [H, S], transposed input hidden states (is_f_by_k=True).
grad_T_td (TensorDescriptor) – [2I, S], transposed gate+up gradients (is_f_by_k=True, is_col_parallel_sharded=True for LNC2).
dtype (type) – Data type for computation (nl.bfloat16).
fp8_x4_dtype (type) – MXFP8 quantized data type.
config (MatmulMxfp8KernelConfig) – Per-phase matmul tiling configuration. Replaces the previous TILES_IN_BLOCK_M/N/K arguments.

Dimensions:

S: Sequence length.
H: Hidden dimension size.

compute_phase4_down_weight_grad_mxfp8#

nkilib.experimental.mlp_mxfp8.mlp_bwd_mxfp8.compute_phase4_down_weight_grad_mxfp8(down_weight_grad_td: TensorDescriptor, output_grad_T_td: TensorDescriptor, intermediate_T_td: TensorDescriptor, h_base: int, dtype: type, fp8_x4_dtype: type, config: MatmulMxfp8KernelConfig = None, spill_reload: bool = True, use_scale_packing: bool = True, run_with_lnc2: bool = True) → None#

Phase 4: Compute gradient w.r.t. down projection weight matrix.

Parameters:

down_weight_grad_td (TensorDescriptor) – [H, I], output: dW_down.
output_grad_T_td (TensorDescriptor) – [H, S], transposed output gradient (is_f_by_k=True).
intermediate_T_td (TensorDescriptor) – [I, S], transposed intermediate activations (is_f_by_k=True).
h_base (int) – Row offset into the H dimension for this LNC core.
dtype (type) – Data type for computation (nl.bfloat16).
fp8_x4_dtype (type) – MXFP8 quantized data type.
config (MatmulMxfp8KernelConfig) – Per-phase matmul tiling configuration. Replaces the previous TILES_IN_BLOCK_M/N/K arguments.

mlp_backward_mxfp8_base_nki#

nkilib.experimental.mlp_mxfp8.mlp_bwd_mxfp8.mlp_backward_mxfp8_base_nki(output_grad_td: TensorDescriptor, gate_pre_td: TensorDescriptor, gate_act_td: TensorDescriptor, up_td: TensorDescriptor, gate_up_weight_T_td: TensorDescriptor, down_weight_T_td: TensorDescriptor, d_gate_up_td: TensorDescriptor, hidden_states_T_td: TensorDescriptor, output_grad_T_td: TensorDescriptor, intermediate_T_td: TensorDescriptor, scratch_td: TensorDescriptor, hidden_states_grad_td: TensorDescriptor, weight_grad_td: TensorDescriptor, down_weight_grad_td: TensorDescriptor, run_with_lnc2: bool = True, matmul_config: MlpBwdMatmulConfig = None, fp8_x4_dtype: type = float8_e4m3fn_x4, spill_reload: bool = True, use_scale_packing: bool = True, clamp_limits: ClampLimits = None) → tuple#

MXFP8 SwiGLU MLP backward pass (base kernel).

Parameters:

output_grad_td (TensorDescriptor) – [S, H], incoming gradient dL/d_output (is_f_by_k=True).
gate_pre_td (TensorDescriptor) – [S, I], gate pre-activation (before SiLU).
gate_act_td (TensorDescriptor) – [S, I], gate post-activation (SiLU(gate_pre)).
up_td (TensorDescriptor) – [S, I], up projection (hidden @ W_up.T).
gate_up_weight_T_td (TensorDescriptor) – [H, 2I], transposed fused gate+up projection weights.
down_weight_T_td (TensorDescriptor) – [I, H], transposed down projection weights.
d_gate_up_td (TensorDescriptor) – [S, 2I], scratch: fused gate || up gradient.
hidden_states_T_td (TensorDescriptor) – [H, S], pre-transposed input hidden states.
output_grad_T_td (TensorDescriptor) – [H, S], pre-transposed output gradient.
intermediate_T_td (TensorDescriptor) – [I, S], pre-transposed intermediate activations.
scratch_td (TensorDescriptor) – [2I, S], scratch: transposed d_gate || d_up.
hidden_states_grad_td (TensorDescriptor) – [S, H], output: dL/d_hidden.
weight_grad_td (TensorDescriptor) – [2I, H], output: fused [dW_gate; dW_up].
down_weight_grad_td (TensorDescriptor) – [H, I], output: dL/dW_down.
run_with_lnc2 (bool) – Whether to shard across 2 LNC cores.
matmul_config (MlpBwdMatmulConfig) – Per-phase matmul tiling configuration. Replaces the previous per-phase phase*_tiles_* arguments.
fp8_x4_dtype (type) – MXFP8 quantized data type.
clamp_limits (ClampLimits) – Optional activation clamp limits.

Returns:

(hidden_states_grad [S, H], gate_up_weight_grad [2I, H], down_weight_grad [H, I]).

Return type:

nl.ndarray

Dimensions:

S: Sequence length.
H: Hidden dimension size.

mlp_backward_mxfp8_nki#

nkilib.experimental.mlp_mxfp8.mlp_bwd_mxfp8.mlp_backward_mxfp8_nki(output_grad: nl.ndarray, hidden_states: nl.ndarray, down_proj_weight: nl.ndarray = None, gate_up_weights: nl.ndarray = None, gate_up_weight_T: nl.ndarray = None, gate_up_weight_T_scales: nl.ndarray = None, gate_up_weights_scales: nl.ndarray = None, down_weight_T: nl.ndarray = None, down_weight_T_scales: nl.ndarray = None, output_grad_T: nl.ndarray = None, output_grad_T_scales: nl.ndarray = None, hidden_states_T: nl.ndarray = None, hidden_states_T_scales: nl.ndarray = None, gate_pre: nl.ndarray = None, gate_act: nl.ndarray = None, up: nl.ndarray = None, intermediate: nl.ndarray = None, run_with_lnc2: bool = True, matmul_config: MlpBwdMatmulConfig = None, fp8_x4_dtype: type = float8_e4m3fn_x4, spill_reload: bool = True, use_scale_packing: bool = True, clamp_limits: ClampLimits = None) → tuple#

MXFP8 SwiGLU MLP backward pass with activation checkpointing support.

Parameters:

output_grad (nl.ndarray) – [S, H], incoming gradient dL/d_output.
hidden_states (nl.ndarray) – [S, H], original input (for recompute + weight grad).
down_proj_weight (nl.ndarray) – [I, H], down projection weights (phase 1).
gate_up_weights (nl.ndarray) – [2I, H], fused gate+up weights (for recompute).
gate_up_weight_T (nl.ndarray) – [H, 2I], transposed fused gate+up projection weights (phase 2). Optionally pre-quantized MXFP8 via gate_up_weight_T_scales.
gate_up_weight_T_scales (nl.ndarray) – MXFP8 scales for pre-quantized gate_up_weight_T.
gate_up_weights_scales (nl.ndarray) – MXFP8 scales for pre-quantized gate_up_weights (recompute RHS).
down_weight_T (nl.ndarray) – [I, H], transposed down projection weights. Optionally pre-quantized via down_weight_T_scales.
down_weight_T_scales (nl.ndarray) – MXFP8 scales for pre-quantized down_weight_T.
output_grad_T (nl.ndarray) – [H, S], pre-transposed output gradient. Optionally pre-quantized via output_grad_T_scales.
output_grad_T_scales (nl.ndarray) – MXFP8 scales for pre-quantized output_grad_T.
hidden_states_T (nl.ndarray) – [H, S], pre-transposed input hidden states. Optionally pre-quantized via hidden_states_T_scales.
hidden_states_T_scales (nl.ndarray) – MXFP8 scales for pre-quantized hidden_states_T.
gate_pre (nl.ndarray) – [S, I], checkpointed gate pre-activation, or None.
gate_act (nl.ndarray) – [S, I], checkpointed SiLU(gate_pre), or None.
up (nl.ndarray) – [S, I], checkpointed up projection, or None.
intermediate (nl.ndarray) – [S, I], checkpointed gate_act * up, or None.
run_with_lnc2 (bool) – Whether to shard across 2 LNC cores.
matmul_config (MlpBwdMatmulConfig) – Per-phase matmul tiling configuration. Replaces the previous per-phase phase*_tiles_* / recompute_tiles_* arguments.
fp8_x4_dtype (type) – MXFP8 quantized data type.
clamp_limits (ClampLimits) – Optional activation clamp limits.

Returns:

(hidden_states_grad [S, H], gate_up_weight_grad [2I, H], down_proj_weight_grad [H, I]).

Return type:

nl.ndarray

Dimensions:

S: Sequence length.
H: Hidden dimension size.

This document is relevant for: Trn2, Trn3

MLP Backward MXFP8 Kernel API Reference

Contents

MLP Backward MXFP8 Kernel API Reference#

Background#

API Reference#

get_program_sharding_info#

compute_phase1_down_proj_mm_grad_mxfp8#

compute_phase2_hidden_states_grad_mxfp8#

compute_phase3_gate_up_weight_grad_mxfp8#

compute_phase4_down_weight_grad_mxfp8#

mlp_backward_mxfp8_base_nki#

mlp_backward_mxfp8_nki#