This document is relevant for: Inf1, Inf2, Trn1, Trn2, Trn3
Deep dive: Validate model ports with the Equivalence skill#
Why read this guide? This guide is intended for ML engineers who need to verify that a ported NxD Inference model produces numerically correct output compared to its HuggingFace reference. It explains the Equivalence skill — an AI agent-driven workflow that progressively validates a model port through eight stages of structural analysis, component-level testing, fault localization, debugging, and end-to-end accuracy verification.
How to use this guide: If you are porting a model from scratch, start with the Autoport skill first. Use this guide after you have a completed port and need to verify its correctness. Skip to the workflow stages if you already understand the environment setup and R-ratio methodology.
This topic explores the Equivalence skill in depth, covering structural scaffolding, the 3-tensor R-ratio method, component-level testing, fault localization, patching, end-to-end comparison, and downstream evaluation. You need experience with PyTorch model development, the NxD Inference library structure, and basic numerical analysis to fully understand this content.
Prerequisites#
Before you start, you must be familiar with the following:
NxD Inference library overview: How to build and deploy models using NxD Inference. See Neuron Agentic Development.
PyTorch model architecture: Transformer building blocks (attention, MLP, embeddings) and how HuggingFace models are structured.
Neuron compilation model: How
torch-neuronxtraces Python code into HLO and compiles it to NEFF for NeuronCores. See NxD Inference Features Configuration Guide.Tensor parallelism concepts: How models are sharded across NeuronCores. See Parallelism Techniques for LLM Inference.
Model porting workflow: How models are ported to NxD Inference. See Deep dive: Port HuggingFace models to Neuron with the Autoport skill.
Overview#
The Equivalence skill validates functional and numerical equivalence between a source (reference) neural network implementation and a target (ported) implementation. It does not perform the actual porting work — it verifies that an existing port is correct through progressive stages of testing, localization, and debugging.
The skill is designed for workflows where a model has been migrated between:
Frameworks: HuggingFace to NxD Inference
Hardware targets: CPU to Neuron (Trainium)
Precision regimes: FP32 to BF16, FP32 to MXFP4/INT8
Execution modes: single TP degree to multi-TP degree sharding
It works with dense transformer models (decoder-only, encoder-decoder), Mixture of Experts (MoE) models, models with novel attention mechanisms (sliding window, grouped query attention, multi-latent attention), models requiring weight dequantization (MXFP4, INT8), and cross-framework ports with precision regime changes.
The workflow has eight stages:
Structural scaffolding — build model trees and create a component mapping between source and target architectures.
Smoke testing — quick liveness check using greedy token matching to verify the port produces coherent output.
Component-level testing — isolate each mapped component using the 3-tensor R-ratio method to identify which components diverge.
Fault localization — automatically classify root causes and rank suspect components.
Debugging and patching — fix failing components with standalone monkey patches without modifying the original port.
End-to-end comparison — verify the assembled model with real weights under teacher forcing using R-ratio, cosine similarity, and KL divergence.
Downstream evaluation — confirm production readiness using industry-standard benchmarks.
Hardware and software requirements#
Instance type:
trn1.32xlarge(32 NeuronCores, 16 GB per core) or equivalent Trainium instance. CPU-mode testing (Stages 0-4) can run on any instance.Neuron SDK: Version 2.28+ with the following system packages installed:
aws-neuronx-dkmsaws-neuronx-runtime-libaws-neuronx-collectivesaws-neuronx-tools
Python: 3.10 or later.
Neuron SDK Python packages:
neuronx-distributed-inference(0.8.x)neuronx-distributed(0.17.x)transformers(4.57+)torch(2.x+)numpymatplotlib
Model validation package: The
model_validationpackage from NeuroborosFoundations must be available onPYTHONPATH.Model weights: Downloaded from HuggingFace Hub or available locally.
Compiled model: A compiled (NEFF) version of the target model for device-mode testing (Stages 5-7).
Disk space: Sufficient for model weights, compiled artifacts, and experiment outputs (typically 2-5x the model size).
Note
Stages 0 through 4 run in CPU mode (NXD_CPU_MODE=1, TP=1) and do not require
Neuron hardware. Only Stages 5-7 require a compiled model and Neuron device access.
Inputs#
Before the agent begins the workflow, it collects these required parameters from the user:
Parameter |
Description |
|---|---|
|
Path to reference model weights in HuggingFace format. |
|
Path to the compiled/quantized target model (NEFF artifacts). |
|
Path to the target port’s modeling Python file. |
|
Inner model class name (extends |
|
|
|
|
|
Path to Python virtual environment with torch and neuronx packages. |
|
Path to the |
|
Experiment output directory for all artifacts. |
Key concepts#
The R-ratio metric#
The R-ratio is the core metric used throughout the skill to quantify divergence:
R = ||target - source_fp32||_F / (||source_bf16 - source_fp32||_F + ε)
Where:
source_fp32is the reference implementation running in FP32 (ground truth).source_bf16is the reference implementation running in BF16 (precision baseline).targetis the target port running in BF16 (under test).|| . ||_Fis the Frobenius norm (L2 norm of the flattened tensor).εis a small constant to avoid division by zero.
The denominator measures the expected precision loss from FP32 to BF16 — the irreducible error from the precision downgrade. The numerator measures the actual error of the port. An R-ratio near 1.0 means the port introduces no additional error beyond the precision baseline.
R-ratio |
Interpretation |
|---|---|
≈ 1.0 |
Port matches precision baseline. No porting bug. |
< 1.2 |
Within acceptable tolerance. Minor TP rounding or kernel differences. |
1.2 – 3.0 |
Possible porting bug. Missing multiplier or precision ordering issue. |
3.0 – 10.0 |
Likely porting bug. Missing multiplier or precision ordering issue. |
>> 10 |
Missing algorithm or wrong formula (e.g., YaRN scaling absent from RoPE). |
>> 100 |
Completely wrong computation. |
< 1.0 |
Over-precision. Extra |
The 3-tensor comparison method#
Every component test produces three outputs from the same input:
ref_fp32 — source (HuggingFace) model class, FP32 weights, FP32 input.
ref_bf16 — source model class, BF16 weights, BF16 input.
target_bf16 — target (Neuron port) model class, BF16 weights, BF16 input.
All three share the same FP32 weights (with BF16 versions created by downcasting). This isolates the porting error from precision error: the denominator of R captures only precision drift, while the numerator captures precision drift plus any porting bugs.
Method |
When to use |
Baseline |
|---|---|---|
3-tensor |
Reference can run in FP32 |
Precision error from FP32 to BF16 downgrade |
2-tensor |
Reference can only run in target precision |
Machine-epsilon perturbation baseline |
Expected structural differences#
When comparing HuggingFace and NxD Inference model trees, these differences are expected and do not indicate bugs:
HuggingFace |
Neuron Port |
Reason |
|---|---|---|
|
|
Tensor parallel sharding |
|
|
Embedding sharded across TP ranks |
Flat q/k/v projections |
Wrapped in |
NxDI attention framework |
Single |
Per-layer |
Implementation choice |
|
|
Framework normalization |
Fused |
Split |
TP requires separate sharding |
(none) |
|
Neuron-specific infrastructure |
Differences that do indicate bugs:
Missing modules (norm layer absent in port)
Extra unexpected modules with no framework explanation
Wrong nesting (MLP inside attention instead of parallel)
Mismatched layer counts (47 instead of 48)
Missing activation functions
Workflow#
The skill follows a strict 8-stage sequential workflow. Stages must not be skipped, reordered, or parallelized.
Stage 0: Structural scaffolding#
Build the alignment map between source and target model hierarchies.
Purpose: Understand both model structures and create a formal mapping between their components.
Build model trees#
source ${VENV}/bin/activate
PYTHONPATH=${SCRIPTS_DIR} python3 ${SCRIPTS_DIR}/run_stage0.py \
--source-model-path ${SOURCE_MODEL_PATH} \
--target-model-path ${SOURCE_MODEL_PATH} \
--target-module-file ${TARGET_MODELING_FILE} \
--target-inner-class ${TARGET_INNER_CLASS} \
--target-config-class ${TARGET_CONFIG_CLASS} \
--output-dir ${EXP_DIR}/model_tree
The script instantiates the target in CPU mode (NXD_CPU_MODE=1, TP=1) to produce a
structure-only comparison without device dependencies.
Outputs:
${EXP_DIR}/model_tree/
├── model_tree_source.json # Compressed source tree
├── model_tree_source_full.json # Uncompressed source tree
├── model_tree_source_pretty.txt # ASCII pretty-print
├── model_tree_source_flat_paths.txt # Flat module path list
├── model_tree_target.json # Compressed target tree
├── model_tree_target_full.json # Uncompressed target tree
├── model_tree_target_pretty.txt # ASCII pretty-print
└── model_tree_target_flat_paths.txt # Flat module path list
Create component mapping#
Manually compare the printed trees and create ${EXP_DIR}/component_mapping.json. This
file maps each source module (or group of modules) to its target equivalent(s).
The mapping uses an array format where each entry is a pair of [source_modules, target_modules]
with indexed variables ({i} for layer indices) and reasoning:
One-to-one:
["model.layers.{i}.norm"]maps to["model.language_model.layers.{i}.norm"]One-to-many (fused):
["model.q_proj", "model.k_proj", "model.v_proj"]maps to["model.qkv_proj"]No counterpart: Document the reasoning (framework scaffolding, TP-specific structure)
Detect CPU vs device class divergence#
python3 ${SCRIPTS_DIR}/detect_class_divergence.py \
--target-module-file ${TARGET_MODELING_FILE} \
--output ${EXP_DIR}/class_divergence_report.json
This scans the target modeling file for patterns where different classes are used in CPU mode versus device mode:
Factory functions (
get_rmsnorm_cls()) that branch onNXD_CPU_MODEoron_cpuConditional assignments (
self.norm = ClassA() if cpu else ClassB())NKI kernel imports (e.g.,
LlamaRMSNormon CPU,CustomRMSNormon device)
Components with class divergence require dual testing in Stage 2 — one test for the CPU class and one for the device class.
Stage 1: Smoke test#
Quick liveness check — does the port produce coherent output?
PYTHONPATH=${MODEL_VALIDATION_DIR} python3 ${SCRIPTS_DIR}/run_stage1.py \
--model-path ${SOURCE_MODEL_PATH} \
--compiled-model-path ${COMPILED_MODEL_PATH} \
--model-class ${TARGET_MODELING_FILE}:${TARGET_CAUSAL_CLASS} \
--config-class ${TARGET_MODELING_FILE}:${TARGET_CONFIG_CLASS} \
--num-tokens 32 \
--output ${EXP_DIR}/results/stage1.json
The script runs 10-prompt greedy token matching and computes per-position distribution metrics: cosine similarity, KL divergence, top-k agreement, and relative L2 error.
Interpreting results:
Match rate |
Meaning |
Action |
|---|---|---|
> 30% |
Liveness threshold met |
Continue to Stage 2 |
100% on most prompts |
Normal BF16 precision drift |
Continue to Stage 2 |
< 30% |
Catastrophic failure |
Proceed to Stage 2 for localization |
High cosine similarity (> 0.95) with low token match suggests margin-sensitive divergence — the top two token probabilities are close, and BF16 rounding flips the argmax. This is expected behavior and not a bug.
Stage 2: Component-level testing#
Test each mapped component using the 3-tensor R-ratio method to isolate which component(s) diverge.
Set up test infrastructure#
Copy the comparison utility into the test directory:
cp ${SCRIPTS_DIR}/tensor_compare.py ${EXP_DIR}/tests/
Create
conftest.pyfrom the provided template. Fill in model-specific constants:HIDDEN_SIZE,NUM_HEADS,NUM_KV_HEADS,VOCAB_SIZE,INTERMEDIATE_SIZE, and other values from the model’sconfig.json.Write one test file per component, ordered bottom-up from simplest to most complex:
test_00_rmsnorm.py,test_01_embedding.py,test_02_linear.py, etc.
Write component tests#
Each test follows the 3-tensor pattern:
def test_component_name():
torch.manual_seed(42)
weight_fp32 = torch.randn(OUT_DIM, IN_DIM)
# ref_fp32: Source class, FP32 weights
ref_fp32 = SourceClass(config)
ref_fp32.weight.data.copy_(weight_fp32)
# ref_bf16: Source class, BF16 weights
ref_bf16 = SourceClass(config)
ref_bf16.weight = nn.Parameter(weight_fp32.to(torch.bfloat16))
# target_bf16: Target port's class, BF16 weights
target_bf16 = TargetClass(neuron_config)
target_bf16.weight = nn.Parameter(weight_fp32.to(torch.bfloat16))
target_bf16.eval()
x = torch.randn(BS, SEQ_LEN, IN_DIM)
with torch.no_grad():
out1 = ref_fp32(x.float()).float()
out2 = ref_bf16(x.to(torch.bfloat16)).float()
out3_raw = target_bf16(x.to(torch.bfloat16))
out3 = out3_raw[0].float() if isinstance(out3_raw, tuple) else out3_raw.float()
result = compare_3tensors(out1, out2, out3)
assert check_3tensor_result(result, "component_name", TOLERANCE_RATIO)
Critical rules for test writing:
ref_fp32andref_bf16use the source model’s class (HuggingFace).target_bf16uses the target port’s actual class (may differ from source).All three share the same FP32 weights (BF16 versions created by downcasting).
Use
nn.Parameter()replacement forColumnParallelLinear(not.copy_()).Set
.eval()mode on Neuron modules withpad=True.Handle tuple outputs:
out = out[0] if isinstance(out, tuple) else out.Align shapes before comparison for fused components (QKV, gate/up projections).
Cast all outputs to
.float()before comparison.Check
class_divergence_report.json— write dual tests for components with CPU/device class differences.
Run component tests#
NXD_CPU_MODE=1 python3 ${SCRIPTS_DIR}/run_stage2.py \
--tests-dir ${EXP_DIR}/tests \
--tau-r 1.2 \
--output ${EXP_DIR}/results/stage2.json
Decision: If all R < 1.2, proceed to Stage 5 (E2E). If any R >= 1.2, proceed to Stage 3 for fault localization.
Stage 3: Fault localization#
Analyze Stage 2 R-ratios to identify where divergence originates and classify root causes.
python3 ${SCRIPTS_DIR}/run_stage3.py \
--stage2-output ${EXP_DIR}/results/stage2.json \
--tau-r 1.2 \
--output ${EXP_DIR}/results/stage3.json
Change-point detection#
The script identifies two divergence patterns:
Spike: High R at a single point that returns to baseline at the next component. Indicates an alignment artifact or transient error.
Step: High R that persists for all subsequent components. Indicates a functional bug whose error propagates downstream.
The earliest step-pattern point is the primary fault candidate.
Root-cause classification#
R magnitude |
Likely cause |
Examples |
|---|---|---|
R >> 10 |
Missing algorithm or wrong formula |
YaRN scaling absent from RoPE, MoE routing ignored, masking wrong |
1.2 < R < 3 |
Precision ordering or missing multiplier |
Variance computed in BF16 instead of FP32, attention scaling omitted |
R < 1.0 |
Over-precision (unintended FP32 upcast) |
Extra |
The output is a ranked list of suspect components with: component name, R-ratio, divergence pattern (spike or step), root cause label, description, and mapped module paths.
Stage 4: Debug and patch#
Fix failing components with standalone monkey patches. This is the only stage where code changes are made, and the original port is never modified directly.
Debugging workflow#
Read the Stage 3 fault localization report.
Compare the HuggingFace and Neuron implementations side-by-side:
Config parameters consumed by HuggingFace but missing in Neuron.
Operations present in one implementation but not the other.
Dtype casting differences.
Write a standalone monkey-patch file.
Re-run Stage 2 with the patch applied.
Verify the R-ratio drops to approximately 1.0.
Patch structure#
def apply_component_patch():
"""Monkey-patch TargetClass to fix the issue. Call BEFORE instantiation."""
from modeling_xxx import TargetClass
if getattr(TargetClass, "_patched", False):
return # Idempotent guard
_original_init = TargetClass.__init__
def _patched_init(self, config):
_original_init(self, config)
# Fix: compute corrected values
def _patched_forward(self, *args, **kwargs):
# Fix: use corrected computation
pass
TargetClass.__init__ = _patched_init
TargetClass.forward = _patched_forward
TargetClass._patched = True
Key rules:
Never modify the original port files. All fixes are delivered as standalone patches.
Include an idempotent guard (
_patchedflag) to prevent double-patching.Apply patches before model instantiation.
If a patch fixes one module but breaks a downstream composite, the fix is incomplete — re-run the full bottom-up test suite.
Common pitfalls#
Pitfall |
Solution |
|---|---|
Config parameter gaps |
Derive missing values from known config fields. |
Precision ordering |
Scaling must be applied before BF16 cast, not after. |
Buffer assignment |
|
Output shape conventions |
Match the target’s shape format so downstream code works. |
Dtype mismatch |
No extra |
Repeat Stage 4 until all component R-ratios are below the threshold (default 1.2).
Stage 5: End-to-end comparison#
Verify the assembled model with real weights under teacher forcing.
PYTHONPATH=${MODEL_VALIDATION_DIR}:${SCRIPTS_DIR} \
python3 ${SCRIPTS_DIR}/run_teacher_forced_comparison.py \
--model-path ${SOURCE_MODEL_PATH} \
--compiled-model-path ${COMPILED_MODEL_PATH} \
--model-class ${TARGET_MODELING_FILE}:${TARGET_CAUSAL_CLASS} \
--config-class ${TARGET_MODELING_FILE}:${TARGET_CONFIG_CLASS} \
--num-tokens 32 \
--output ${EXP_DIR}/results/teacher_forced.json
Teacher forcing explained#
At each generation position t, all three models (source FP32, source BF16, target BF16) receive the same prefix tokens — taken from the source FP32 greedy output. This ensures logits are compared under identical contexts and prevents trajectory divergence from contaminating per-position metrics.
Stage 6: Distributional and semantic validation#
This stage is combined with Stage 5 in a single script invocation. It adds two additional conditions beyond the E2E R-ratio:
Condition B (Cosine similarity):
cos(v_source, v_target) >= θ(default θ = 0.95)Condition C (KL divergence):
D_KL(P_source || P_target) <= δ(calibrated from known-good ports)
Pass criteria#
Metric |
Threshold |
|---|---|
E2E R-ratio (p95) |
< 1.2 (default τ_R) |
Cosine similarity (p5) |
>= 0.95 (default θ) |
KL divergence (p95) |
<= δ (calibrated) |
Top-1 agreement |
> 50% |
Interpreting Stage 5/6 results#
Scenario |
Likely cause |
Action |
|---|---|---|
Stage 2 all-pass + Stage 5 fail |
Compilation-induced divergence (operator fusion, kernel numerics) |
Not a porting bug; escalate to compiler team |
Stage 2 fail + Stage 5 fail |
Porting bug propagates to E2E |
Fix via Stage 4 first |
Condition B pass + Condition C fail |
Logit directions agree, probability mass differs |
Threshold calibration needed |
Stage 5 fail, all components clean |
Unmapped component or different execution path on device |
Run |
Stage 7: Downstream task evaluation#
Confirm the port remains usable for production workloads using industry-standard benchmarks.
python3 ${SCRIPTS_DIR}/run_stage7.py \
--bench-config ${EXP_DIR}/bench_config.yaml \
--output-dir ${EXP_DIR}/results/stage7 \
--tolerance 0.02
Benchmark configuration#
model:
model_class: "path/to/modeling.py:NeuronXxxForCausalLM"
config_class: "path/to/modeling.py:XxxInferenceConfig"
model_path: "/path/to/hf_model"
compiled_model_path: "/path/to/compiled_model"
benchmarks:
lm_eval:
accuracy:
tasks: ["gsm8k_cot", "mmlu_pro"]
limit: 200
use_chat: true
run_hf_baseline: true
Pass criteria: Score regression <= 2 percentage points on all tasks.
Result |
Meaning |
Action |
|---|---|---|
All tasks within tolerance |
Port is production-ready |
PASS |
Math/reasoning tasks fail, knowledge passes |
Precision-sensitive computation affected |
Return to Stage 4 |
All tasks fail |
Fundamental porting issue |
Return to Stage 2 |
File organization#
The Equivalence skill enforces a strict file organization:
${EXP_DIR}/
├── model_tree/ # Stage 0 outputs
│ ├── model_tree_source.json # Compressed source tree
│ ├── model_tree_source_full.json # Uncompressed source tree
│ ├── model_tree_source_pretty.txt # ASCII pretty-print
│ ├── model_tree_source_flat_paths.txt
│ ├── model_tree_target.json # Compressed target tree
│ ├── model_tree_target_full.json # Uncompressed target tree
│ ├── model_tree_target_pretty.txt # ASCII pretty-print
│ └── model_tree_target_flat_paths.txt
├── component_mapping.json # Manual source-to-target mapping
├── class_divergence_report.json # CPU vs device class branching
├── tests/ # Stage 2 component tests
│ ├── conftest.py # Shared infrastructure
│ ├── tensor_compare.py # Comparison utility (copied from scripts/)
│ ├── test_00_rmsnorm.py # Simplest component first
│ ├── test_01_embedding.py
│ ├── test_02_linear.py
│ ├── test_03_rotary.py
│ ├── test_04_mlp.py
│ ├── test_05_attention_qkv.py
│ ├── test_06_lm_head.py
│ └── test_07_decoder_layer.py # Most complex last
├── results/ # Test results (JSON)
│ ├── stage1.json
│ ├── stage2.json
│ ├── stage3.json
│ ├── teacher_forced.json
│ └── stage7/
└── bench_config.yaml # Stage 7 benchmark configuration
Tools reference#
Tree builder#
Module: scripts/run_stage0.py
Builds compressed and uncompressed model trees for both source and target architectures.
Instantiates the target model in CPU mode (NXD_CPU_MODE=1, TP=1) to avoid device
dependencies. Uses scripts/stage0_scaffolding.py for tree generation utilities.
Class divergence detector#
Module: scripts/detect_class_divergence.py
Scans the target modeling file for conditional class selection patterns (factory functions, conditional assignments, NKI kernel imports). Produces a JSON report listing each divergence with the CPU class, device class, and recommendation for dual testing.
Smoke test runner#
Module: scripts/run_stage1.py
Runs 10-prompt greedy token matching and computes per-position distribution metrics.
Delegates to model_validation.check_accuracy_with_hf_golden for the core comparison.
Component test runner#
Module: scripts/run_stage2.py
Discovers and executes all test_*.py files in the specified test directory. Collects
R-ratios and produces a pass/fail summary with the configured threshold (default τ_R = 1.2).
Fault localizer#
Module: scripts/run_stage3.py
Analyzes Stage 2 results using change-point detection to identify spike vs step patterns. Classifies root causes and produces a ranked list of suspect components.
Tensor comparator#
Module: scripts/tensor_compare.py
Core utility for 3-tensor comparison. Computes R-ratio, generates QQ plots and histograms
for visual analysis, and provides the compare_3tensors() and check_3tensor_result()
functions used by all component tests.
Teacher-forced comparator#
Module: scripts/run_teacher_forced_comparison.py
Runs Stages 5 and 6 in a single pass. Compares source FP32, source BF16, and target BF16 models under teacher forcing. Reports per-position R-ratio, cosine similarity, KL divergence, and top-1 agreement.
Downstream evaluator#
Module: scripts/run_stage7.py
Runs industry-standard benchmarks (via lm_eval) and compares scores against a HuggingFace
baseline. Reports per-task accuracy with a configurable regression tolerance (default 2
percentage points).
Calibration tool#
Module: scripts/run_calibration.py
Computes threshold values (τ_R, θ, δ) from known-good ports. Use this to establish project-specific thresholds rather than relying on defaults.
Common issues#
Stage 0 failures#
Error |
Solution |
|---|---|
|
Config validation requires |
|
|
|
Use |
|
Verify |
Stage 2 failures#
Error |
Solution |
|---|---|
Test import failures |
Ensure |
|
Fill in all constants in |
Shape mismatch on comparison |
Align shapes before calling |
|
Use |
Stage 4 failures#
Error |
Solution |
|---|---|
Patch applied but test still fails |
Ensure the |
Downstream composite test breaks after patch |
Incomplete fix — run full bottom-up test suite and patch both component and composite if needed. |
Buffer assignment does not persist |
|
Stage 5/6 failures#
Error |
Solution |
|---|---|
|
Add |
Device tensors have wrong shape |
Device pads to full |
|
Embeddings are lookups (no computation), so |
Project guidelines#
The Equivalence skill enforces these rules during the validation process:
Do NOT write tree generation, test runner, or validation scripts. All major scripts are bundled and must be run as-is.
The only files the user creates are:
component_mapping.json(Stage 0, manual),test_NN_*.pytest files (Stage 2, following templates), and monkey-patch files (Stage 4, debugging only).Do NOT modify original model files. All fixes are delivered as standalone patches.
Follow stage order strictly. Do not skip ahead, reorder, or parallelize stages.
Show full output from every script run. Do not summarize or truncate.
No try/except statements in test files. Let errors surface directly.
No additional pip installs. Use only the packages in the provided virtual environment.
Knowledge base#
The skill includes a curated knowledge base at references/ containing solutions gathered
from prior validation sessions.
Foundational concepts#
equiv-concept.md— foundational concepts: 3-way comparison, R-ratio derivation, QQ plot interpretation.expected_structural_diffs.md— catalog of expected HuggingFace to Neuron structural differences.mapping_example.json— full worked example of a 31-component mapping (Llama4 multimodal model).
Debugging guides#
cpu-component-debugging.md— full workflow for CPU-level component debugging with patterns and pitfalls.device-component-debugging.md— XLA-compatible patch patterns for device-mode execution.device-e2e-debugging.md— device E2E with 1-layer isolation and fix-compile-verify cycle.cpu-e2e-debugging.md— CPU E2E withmp.spawn, TP > 1, and bias restoration.dump-tensors.md— intermediate tensor capture methodology for per-layer comparison.neuronxcc-debugging.md— NeuronX compiler debugging tools and escalation procedures.
Case studies#
debugging-case-study-gptoss.md— real worked example from GPT-OSS 20B showing error ratios, root causes, and patches applied.
Visual references#
example_plots/positive_samples/— QQ plots and histograms showing passing distributions (errors follow the 45-degree line).example_plots/negative_samples/— QQ plots and histograms showing failing distributions (divergent error patterns).