This document is relevant for: Inf1, Inf2, Trn1, Trn2, Trn3

Deep dive: Validate model ports with the Equivalence skill#

Why read this guide? This guide is intended for ML engineers who need to verify that a ported NxD Inference model produces numerically correct output compared to its HuggingFace reference. It explains the Equivalence skill — an AI agent-driven workflow that progressively validates a model port through eight stages of structural analysis, component-level testing, fault localization, debugging, and end-to-end accuracy verification.

How to use this guide: If you are porting a model from scratch, start with the Autoport skill first. Use this guide after you have a completed port and need to verify its correctness. Skip to the workflow stages if you already understand the environment setup and R-ratio methodology.

This topic explores the Equivalence skill in depth, covering structural scaffolding, the 3-tensor R-ratio method, component-level testing, fault localization, patching, end-to-end comparison, and downstream evaluation. You need experience with PyTorch model development, the NxD Inference library structure, and basic numerical analysis to fully understand this content.

Prerequisites#

Before you start, you must be familiar with the following:

NxD Inference library overview: How to build and deploy models using NxD Inference. See Neuron Agentic Development.
PyTorch model architecture: Transformer building blocks (attention, MLP, embeddings) and how HuggingFace models are structured.
Neuron compilation model: How torch-neuronx traces Python code into HLO and compiles it to NEFF for NeuronCores. See NxD Inference Features Configuration Guide.
Tensor parallelism concepts: How models are sharded across NeuronCores. See Parallelism Techniques for LLM Inference.
Model porting workflow: How models are ported to NxD Inference. See Deep dive: Port HuggingFace models to Neuron with the Autoport skill.

Overview#

The Equivalence skill validates functional and numerical equivalence between a source (reference) neural network implementation and a target (ported) implementation. It does not perform the actual porting work — it verifies that an existing port is correct through progressive stages of testing, localization, and debugging.

The skill is designed for workflows where a model has been migrated between:

Frameworks: HuggingFace to NxD Inference
Hardware targets: CPU to Neuron (Trainium)
Precision regimes: FP32 to BF16, FP32 to MXFP4/INT8
Execution modes: single TP degree to multi-TP degree sharding

It works with dense transformer models (decoder-only, encoder-decoder), Mixture of Experts (MoE) models, models with novel attention mechanisms (sliding window, grouped query attention, multi-latent attention), models requiring weight dequantization (MXFP4, INT8), and cross-framework ports with precision regime changes.

The workflow has eight stages:

Structural scaffolding — build model trees and create a component mapping between source and target architectures.
Smoke testing — quick liveness check using greedy token matching to verify the port produces coherent output.
Component-level testing — isolate each mapped component using the 3-tensor R-ratio method to identify which components diverge.
Fault localization — automatically classify root causes and rank suspect components.
Debugging and patching — fix failing components with standalone monkey patches without modifying the original port.
End-to-end comparison — verify the assembled model with real weights under teacher forcing using R-ratio, cosine similarity, and KL divergence.
Downstream evaluation — confirm production readiness using industry-standard benchmarks.

Hardware and software requirements#

Instance type: trn1.32xlarge (32 NeuronCores, 16 GB per core) or equivalent Trainium instance. CPU-mode testing (Stages 0-4) can run on any instance.
Neuron SDK: Version 2.28+ with the following system packages installed:
- aws-neuronx-dkms
- aws-neuronx-runtime-lib
- aws-neuronx-collectives
- aws-neuronx-tools
Python: 3.10 or later.
Neuron SDK Python packages:
- neuronx-distributed-inference (0.8.x)
- neuronx-distributed (0.17.x)
- transformers (4.57+)
- torch (2.x+)
- numpy
- matplotlib
Model validation package: The model_validation package from NeuroborosFoundations must be available on PYTHONPATH.
Model weights: Downloaded from HuggingFace Hub or available locally.
Compiled model: A compiled (NEFF) version of the target model for device-mode testing (Stages 5-7).
Disk space: Sufficient for model weights, compiled artifacts, and experiment outputs (typically 2-5x the model size).

Note

Stages 0 through 4 run in CPU mode (NXD_CPU_MODE=1, TP=1) and do not require Neuron hardware. Only Stages 5-7 require a compiled model and Neuron device access.

Inputs#

Before the agent begins the workflow, it collects these required parameters from the user:

Parameter	Description
`SOURCE_MODEL_PATH`	Path to reference model weights in HuggingFace format.
`COMPILED_MODEL_PATH`	Path to the compiled/quantized target model (NEFF artifacts).
`TARGET_MODELING_FILE`	Path to the target port’s modeling Python file.
`TARGET_INNER_CLASS`	Inner model class name (extends `NeuronBaseModel`).
`TARGET_CAUSAL_CLASS`	`ForCausalLM` wrapper class name.
`TARGET_CONFIG_CLASS`	`InferenceConfig` class name.
`VENV`	Path to Python virtual environment with torch and neuronx packages.
`MODEL_VALIDATION_DIR`	Path to the `model_validation` package directory.
`EXP_DIR`	Experiment output directory for all artifacts.

Key concepts#

The R-ratio metric#

The R-ratio is the core metric used throughout the skill to quantify divergence:

R = ||target - source_fp32||_F / (||source_bf16 - source_fp32||_F + ε)

Where:

source_fp32 is the reference implementation running in FP32 (ground truth).
source_bf16 is the reference implementation running in BF16 (precision baseline).
target is the target port running in BF16 (under test).
|| . ||_F is the Frobenius norm (L2 norm of the flattened tensor).
ε is a small constant to avoid division by zero.

The denominator measures the expected precision loss from FP32 to BF16 — the irreducible error from the precision downgrade. The numerator measures the actual error of the port. An R-ratio near 1.0 means the port introduces no additional error beyond the precision baseline.

R-ratio	Interpretation
≈ 1.0	Port matches precision baseline. No porting bug.
< 1.2	Within acceptable tolerance. Minor TP rounding or kernel differences.
1.2 – 3.0	Possible porting bug. Missing multiplier or precision ordering issue.
3.0 – 10.0	Likely porting bug. Missing multiplier or precision ordering issue.
>> 10	Missing algorithm or wrong formula (e.g., YaRN scaling absent from RoPE).
>> 100	Completely wrong computation.
< 1.0	Over-precision. Extra `.float()` calls not present in reference.

The 3-tensor comparison method#

Every component test produces three outputs from the same input:

ref_fp32 — source (HuggingFace) model class, FP32 weights, FP32 input.
ref_bf16 — source model class, BF16 weights, BF16 input.
target_bf16 — target (Neuron port) model class, BF16 weights, BF16 input.

All three share the same FP32 weights (with BF16 versions created by downcasting). This isolates the porting error from precision error: the denominator of R captures only precision drift, while the numerator captures precision drift plus any porting bugs.

Method	When to use	Baseline
3-tensor	Reference can run in FP32	Precision error from FP32 to BF16 downgrade
2-tensor	Reference can only run in target precision	Machine-epsilon perturbation baseline

Expected structural differences#

When comparing HuggingFace and NxD Inference model trees, these differences are expected and do not indicate bugs:

HuggingFace	Neuron Port	Reason
`nn.Linear`	`ColumnParallelLinear` / `RowParallelLinear`	Tensor parallel sharding
`nn.Embedding`	`ParallelEmbedding`	Embedding sharded across TP ranks
Flat q/k/v projections	Wrapped in `GroupQueryAttention_QKV`	NxDI attention framework
Single `RotaryEmbedding` at model level	Per-layer `RotaryEmbedding`	Implementation choice
`XxxRMSNorm` (source class)	`LlamaRMSNorm` (CPU) or `CustomRMSNorm` (device)	Framework normalization
Fused `gate_up_proj`	Split `gate_proj` + `up_proj`	TP requires separate sharding
(none)	`SPMDRank`, `KVCacheManager`	Neuron-specific infrastructure

Differences that do indicate bugs:

Missing modules (norm layer absent in port)
Extra unexpected modules with no framework explanation
Wrong nesting (MLP inside attention instead of parallel)
Mismatched layer counts (47 instead of 48)
Missing activation functions

Workflow#

The skill follows a strict 8-stage sequential workflow. Stages must not be skipped, reordered, or parallelized.

Stage 0: Structural scaffolding#

Build the alignment map between source and target model hierarchies.

Purpose: Understand both model structures and create a formal mapping between their components.

Build model trees#

source ${VENV}/bin/activate
PYTHONPATH=${SCRIPTS_DIR} python3 ${SCRIPTS_DIR}/run_stage0.py \
  --source-model-path ${SOURCE_MODEL_PATH} \
  --target-model-path ${SOURCE_MODEL_PATH} \
  --target-module-file ${TARGET_MODELING_FILE} \
  --target-inner-class ${TARGET_INNER_CLASS} \
  --target-config-class ${TARGET_CONFIG_CLASS} \
  --output-dir ${EXP_DIR}/model_tree

The script instantiates the target in CPU mode (NXD_CPU_MODE=1, TP=1) to produce a structure-only comparison without device dependencies.

Outputs:

${EXP_DIR}/model_tree/
├── model_tree_source.json             # Compressed source tree
├── model_tree_source_full.json        # Uncompressed source tree
├── model_tree_source_pretty.txt       # ASCII pretty-print
├── model_tree_source_flat_paths.txt   # Flat module path list
├── model_tree_target.json             # Compressed target tree
├── model_tree_target_full.json        # Uncompressed target tree
├── model_tree_target_pretty.txt       # ASCII pretty-print
└── model_tree_target_flat_paths.txt   # Flat module path list

Create component mapping#

Manually compare the printed trees and create ${EXP_DIR}/component_mapping.json. This file maps each source module (or group of modules) to its target equivalent(s).

The mapping uses an array format where each entry is a pair of [source_modules, target_modules] with indexed variables ({i} for layer indices) and reasoning:

One-to-one: ["model.layers.{i}.norm"] maps to ["model.language_model.layers.{i}.norm"]
One-to-many (fused): ["model.q_proj", "model.k_proj", "model.v_proj"] maps to ["model.qkv_proj"]
No counterpart: Document the reasoning (framework scaffolding, TP-specific structure)

Detect CPU vs device class divergence#

python3 ${SCRIPTS_DIR}/detect_class_divergence.py \
  --target-module-file ${TARGET_MODELING_FILE} \
  --output ${EXP_DIR}/class_divergence_report.json

This scans the target modeling file for patterns where different classes are used in CPU mode versus device mode:

Factory functions (get_rmsnorm_cls()) that branch on NXD_CPU_MODE or on_cpu
Conditional assignments (self.norm = ClassA() if cpu else ClassB())
NKI kernel imports (e.g., LlamaRMSNorm on CPU, CustomRMSNorm on device)

Components with class divergence require dual testing in Stage 2 — one test for the CPU class and one for the device class.

Stage 1: Smoke test#

Quick liveness check — does the port produce coherent output?

PYTHONPATH=${MODEL_VALIDATION_DIR} python3 ${SCRIPTS_DIR}/run_stage1.py \
  --model-path ${SOURCE_MODEL_PATH} \
  --compiled-model-path ${COMPILED_MODEL_PATH} \
  --model-class ${TARGET_MODELING_FILE}:${TARGET_CAUSAL_CLASS} \
  --config-class ${TARGET_MODELING_FILE}:${TARGET_CONFIG_CLASS} \
  --num-tokens 32 \
  --output ${EXP_DIR}/results/stage1.json

The script runs 10-prompt greedy token matching and computes per-position distribution metrics: cosine similarity, KL divergence, top-k agreement, and relative L2 error.

Interpreting results:

Match rate	Meaning	Action
> 30%	Liveness threshold met	Continue to Stage 2
100% on most prompts	Normal BF16 precision drift	Continue to Stage 2
< 30%	Catastrophic failure	Proceed to Stage 2 for localization

High cosine similarity (> 0.95) with low token match suggests margin-sensitive divergence — the top two token probabilities are close, and BF16 rounding flips the argmax. This is expected behavior and not a bug.

Stage 2: Component-level testing#

Test each mapped component using the 3-tensor R-ratio method to isolate which component(s) diverge.

Set up test infrastructure#

Copy the comparison utility into the test directory:

cp ${SCRIPTS_DIR}/tensor_compare.py ${EXP_DIR}/tests/

Create conftest.py from the provided template. Fill in model-specific constants: HIDDEN_SIZE, NUM_HEADS, NUM_KV_HEADS, VOCAB_SIZE, INTERMEDIATE_SIZE, and other values from the model’s config.json.
Write one test file per component, ordered bottom-up from simplest to most complex: test_00_rmsnorm.py, test_01_embedding.py, test_02_linear.py, etc.

Write component tests#

Each test follows the 3-tensor pattern:

def test_component_name():
    torch.manual_seed(42)
    weight_fp32 = torch.randn(OUT_DIM, IN_DIM)

    # ref_fp32: Source class, FP32 weights
    ref_fp32 = SourceClass(config)
    ref_fp32.weight.data.copy_(weight_fp32)

    # ref_bf16: Source class, BF16 weights
    ref_bf16 = SourceClass(config)
    ref_bf16.weight = nn.Parameter(weight_fp32.to(torch.bfloat16))

    # target_bf16: Target port's class, BF16 weights
    target_bf16 = TargetClass(neuron_config)
    target_bf16.weight = nn.Parameter(weight_fp32.to(torch.bfloat16))
    target_bf16.eval()

    x = torch.randn(BS, SEQ_LEN, IN_DIM)
    with torch.no_grad():
        out1 = ref_fp32(x.float()).float()
        out2 = ref_bf16(x.to(torch.bfloat16)).float()
        out3_raw = target_bf16(x.to(torch.bfloat16))
        out3 = out3_raw[0].float() if isinstance(out3_raw, tuple) else out3_raw.float()

    result = compare_3tensors(out1, out2, out3)
    assert check_3tensor_result(result, "component_name", TOLERANCE_RATIO)

Critical rules for test writing:

ref_fp32 and ref_bf16 use the source model’s class (HuggingFace).
target_bf16 uses the target port’s actual class (may differ from source).
All three share the same FP32 weights (BF16 versions created by downcasting).
Use nn.Parameter() replacement for ColumnParallelLinear (not .copy_()).
Set .eval() mode on Neuron modules with pad=True.
Handle tuple outputs: out = out[0] if isinstance(out, tuple) else out.
Align shapes before comparison for fused components (QKV, gate/up projections).
Cast all outputs to .float() before comparison.
Check class_divergence_report.json — write dual tests for components with CPU/device class differences.

Run component tests#

NXD_CPU_MODE=1 python3 ${SCRIPTS_DIR}/run_stage2.py \
  --tests-dir ${EXP_DIR}/tests \
  --tau-r 1.2 \
  --output ${EXP_DIR}/results/stage2.json

Decision: If all R < 1.2, proceed to Stage 5 (E2E). If any R >= 1.2, proceed to Stage 3 for fault localization.

Stage 3: Fault localization#

Analyze Stage 2 R-ratios to identify where divergence originates and classify root causes.

python3 ${SCRIPTS_DIR}/run_stage3.py \
  --stage2-output ${EXP_DIR}/results/stage2.json \
  --tau-r 1.2 \
  --output ${EXP_DIR}/results/stage3.json

Change-point detection#

The script identifies two divergence patterns:

Spike: High R at a single point that returns to baseline at the next component. Indicates an alignment artifact or transient error.
Step: High R that persists for all subsequent components. Indicates a functional bug whose error propagates downstream.

The earliest step-pattern point is the primary fault candidate.

Root-cause classification#

R magnitude	Likely cause	Examples
R >> 10	Missing algorithm or wrong formula	YaRN scaling absent from RoPE, MoE routing ignored, masking wrong
1.2 < R < 3	Precision ordering or missing multiplier	Variance computed in BF16 instead of FP32, attention scaling omitted
R < 1.0	Over-precision (unintended FP32 upcast)	Extra `.float()` call not present in reference

The output is a ranked list of suspect components with: component name, R-ratio, divergence pattern (spike or step), root cause label, description, and mapped module paths.

Stage 4: Debug and patch#

Fix failing components with standalone monkey patches. This is the only stage where code changes are made, and the original port is never modified directly.

Debugging workflow#

Read the Stage 3 fault localization report.
Compare the HuggingFace and Neuron implementations side-by-side:
- Config parameters consumed by HuggingFace but missing in Neuron.
- Operations present in one implementation but not the other.
- Dtype casting differences.
Write a standalone monkey-patch file.
Re-run Stage 2 with the patch applied.
Verify the R-ratio drops to approximately 1.0.

Patch structure#

def apply_component_patch():
    """Monkey-patch TargetClass to fix the issue. Call BEFORE instantiation."""
    from modeling_xxx import TargetClass
    if getattr(TargetClass, "_patched", False):
        return  # Idempotent guard

    _original_init = TargetClass.__init__
    def _patched_init(self, config):
        _original_init(self, config)
        # Fix: compute corrected values

    def _patched_forward(self, *args, **kwargs):
        # Fix: use corrected computation
        pass

    TargetClass.__init__ = _patched_init
    TargetClass.forward = _patched_forward
    TargetClass._patched = True

Key rules:

Never modify the original port files. All fixes are delivered as standalone patches.
Include an idempotent guard (_patched flag) to prevent double-patching.
Apply patches before model instantiation.
If a patch fixes one module but breaks a downstream composite, the fix is incomplete — re-run the full bottom-up test suite.

Common pitfalls#

Pitfall	Solution
Config parameter gaps	Derive missing values from known config fields.
Precision ordering	Scaling must be applied before BF16 cast, not after.
Buffer assignment	`register_buffer("name", None)` resists direct assignment — store on wrapper instead.
Output shape conventions	Match the target’s shape format so downstream code works.
Dtype mismatch	No extra `.float()` calls, no missing `.float()` calls — match reference exactly.

Repeat Stage 4 until all component R-ratios are below the threshold (default 1.2).

Stage 5: End-to-end comparison#

Verify the assembled model with real weights under teacher forcing.

PYTHONPATH=${MODEL_VALIDATION_DIR}:${SCRIPTS_DIR} \
python3 ${SCRIPTS_DIR}/run_teacher_forced_comparison.py \
  --model-path ${SOURCE_MODEL_PATH} \
  --compiled-model-path ${COMPILED_MODEL_PATH} \
  --model-class ${TARGET_MODELING_FILE}:${TARGET_CAUSAL_CLASS} \
  --config-class ${TARGET_MODELING_FILE}:${TARGET_CONFIG_CLASS} \
  --num-tokens 32 \
  --output ${EXP_DIR}/results/teacher_forced.json

Teacher forcing explained#

At each generation position t, all three models (source FP32, source BF16, target BF16) receive the same prefix tokens — taken from the source FP32 greedy output. This ensures logits are compared under identical contexts and prevents trajectory divergence from contaminating per-position metrics.

Stage 6: Distributional and semantic validation#

This stage is combined with Stage 5 in a single script invocation. It adds two additional conditions beyond the E2E R-ratio:

Condition B (Cosine similarity): cos(v_source, v_target) >= θ (default θ = 0.95)
Condition C (KL divergence): D_KL(P_source || P_target) <= δ (calibrated from known-good ports)

Pass criteria#

Metric	Threshold
E2E R-ratio (p95)	< 1.2 (default τ_R)
Cosine similarity (p5)	>= 0.95 (default θ)
KL divergence (p95)	<= δ (calibrated)
Top-1 agreement	> 50%

Interpreting Stage 5/6 results#

Scenario	Likely cause	Action
Stage 2 all-pass + Stage 5 fail	Compilation-induced divergence (operator fusion, kernel numerics)	Not a porting bug; escalate to compiler team
Stage 2 fail + Stage 5 fail	Porting bug propagates to E2E	Fix via Stage 4 first
Condition B pass + Condition C fail	Logit directions agree, probability mass differs	Threshold calibration needed
Stage 5 fail, all components clean	Unmapped component or different execution path on device	Run `detect_class_divergence.py`; add device-specific tests

Stage 7: Downstream task evaluation#

Confirm the port remains usable for production workloads using industry-standard benchmarks.

python3 ${SCRIPTS_DIR}/run_stage7.py \
  --bench-config ${EXP_DIR}/bench_config.yaml \
  --output-dir ${EXP_DIR}/results/stage7 \
  --tolerance 0.02

Benchmark configuration#

model:
  model_class: "path/to/modeling.py:NeuronXxxForCausalLM"
  config_class: "path/to/modeling.py:XxxInferenceConfig"
  model_path: "/path/to/hf_model"
  compiled_model_path: "/path/to/compiled_model"

benchmarks:
  lm_eval:
    accuracy:
      tasks: ["gsm8k_cot", "mmlu_pro"]
      limit: 200
      use_chat: true

run_hf_baseline: true

Pass criteria: Score regression <= 2 percentage points on all tasks.

Result	Meaning	Action
All tasks within tolerance	Port is production-ready	PASS
Math/reasoning tasks fail, knowledge passes	Precision-sensitive computation affected	Return to Stage 4
All tasks fail	Fundamental porting issue	Return to Stage 2

File organization#

The Equivalence skill enforces a strict file organization:

${EXP_DIR}/
├── model_tree/                        # Stage 0 outputs
│   ├── model_tree_source.json         # Compressed source tree
│   ├── model_tree_source_full.json    # Uncompressed source tree
│   ├── model_tree_source_pretty.txt   # ASCII pretty-print
│   ├── model_tree_source_flat_paths.txt
│   ├── model_tree_target.json         # Compressed target tree
│   ├── model_tree_target_full.json    # Uncompressed target tree
│   ├── model_tree_target_pretty.txt   # ASCII pretty-print
│   └── model_tree_target_flat_paths.txt
├── component_mapping.json             # Manual source-to-target mapping
├── class_divergence_report.json       # CPU vs device class branching
├── tests/                             # Stage 2 component tests
│   ├── conftest.py                    # Shared infrastructure
│   ├── tensor_compare.py              # Comparison utility (copied from scripts/)
│   ├── test_00_rmsnorm.py             # Simplest component first
│   ├── test_01_embedding.py
│   ├── test_02_linear.py
│   ├── test_03_rotary.py
│   ├── test_04_mlp.py
│   ├── test_05_attention_qkv.py
│   ├── test_06_lm_head.py
│   └── test_07_decoder_layer.py       # Most complex last
├── results/                           # Test results (JSON)
│   ├── stage1.json
│   ├── stage2.json
│   ├── stage3.json
│   ├── teacher_forced.json
│   └── stage7/
└── bench_config.yaml                  # Stage 7 benchmark configuration

Tools reference#

Tree builder#

Module: scripts/run_stage0.py

Builds compressed and uncompressed model trees for both source and target architectures. Instantiates the target model in CPU mode (NXD_CPU_MODE=1, TP=1) to avoid device dependencies. Uses scripts/stage0_scaffolding.py for tree generation utilities.

Class divergence detector#

Module: scripts/detect_class_divergence.py

Scans the target modeling file for conditional class selection patterns (factory functions, conditional assignments, NKI kernel imports). Produces a JSON report listing each divergence with the CPU class, device class, and recommendation for dual testing.

Smoke test runner#

Module: scripts/run_stage1.py

Runs 10-prompt greedy token matching and computes per-position distribution metrics. Delegates to model_validation.check_accuracy_with_hf_golden for the core comparison.

Component test runner#

Module: scripts/run_stage2.py

Discovers and executes all test_*.py files in the specified test directory. Collects R-ratios and produces a pass/fail summary with the configured threshold (default τ_R = 1.2).

Fault localizer#

Module: scripts/run_stage3.py

Analyzes Stage 2 results using change-point detection to identify spike vs step patterns. Classifies root causes and produces a ranked list of suspect components.

Tensor comparator#

Module: scripts/tensor_compare.py

Core utility for 3-tensor comparison. Computes R-ratio, generates QQ plots and histograms for visual analysis, and provides the compare_3tensors() and check_3tensor_result() functions used by all component tests.

Teacher-forced comparator#

Module: scripts/run_teacher_forced_comparison.py

Runs Stages 5 and 6 in a single pass. Compares source FP32, source BF16, and target BF16 models under teacher forcing. Reports per-position R-ratio, cosine similarity, KL divergence, and top-1 agreement.

Downstream evaluator#

Module: scripts/run_stage7.py

Runs industry-standard benchmarks (via lm_eval) and compares scores against a HuggingFace baseline. Reports per-task accuracy with a configurable regression tolerance (default 2 percentage points).

Calibration tool#

Module: scripts/run_calibration.py

Computes threshold values (τ_R, θ, δ) from known-good ports. Use this to establish project-specific thresholds rather than relying on defaults.

Common issues#

Stage 0 failures#

Error	Solution
`'NoneType' has no attribute 'windowed_context_encoding_size'`	Config validation requires `neuron_config`. Pass `neuron_config=NeuronConfig(...)` to `from_pretrained()`.
`intra_layer_model parallel group is not initialized`	`run_stage0.py` handles this automatically. If running manually, call `init_process_group("gloo")` then `initialize_model_parallel(tp=1)`.
`Please initialize parallel processing via 'torchrun'`	Use `tp_degree=1, world_size=1` for structure inspection.
`No module named 'modeling_xxx'`	Verify `--target-module-file` path is correct and its parent directory is on `sys.path`.

Stage 2 failures#

Error	Solution
Test import failures	Ensure `tensor_compare.py` is copied to the tests directory and `conftest.py` is present.
`TOLERANCE_RATIO` not defined	Fill in all constants in `conftest.py` from the `conftest_template.py`.
Shape mismatch on comparison	Align shapes before calling `compare_3tensors` for fused components (QKV, gate/up projections).
`nn.Parameter()` replacement failed	Use `target.weight = nn.Parameter(weight.to(torch.bfloat16))` instead of `.copy_()`.

Stage 4 failures#

Error	Solution
Patch applied but test still fails	Ensure the `_patched` flag is set and the patch is idempotent.
Downstream composite test breaks after patch	Incomplete fix — run full bottom-up test suite and patch both component and composite if needed.
Buffer assignment does not persist	`register_buffer()` resists assignment. Store corrected value on the wrapper object instead.

Stage 5/6 failures#

Error	Solution
`OnDeviceSamplingConfig` required but missing	Add `on_device_sampling_config=OnDeviceSamplingConfig(...)` to `NeuronConfig`.
Device tensors have wrong shape	Device pads to full `seq_len`. Slice device tensor: `device_tensor[:, :hf_seq_len, :]`.
`embed_tokens` comparison shows infinite error ratio	Embeddings are lookups (no computation), so `baseline_err=0`. Check cosine similarity instead — cosine = 1.0 means PASS.

Project guidelines#

The Equivalence skill enforces these rules during the validation process:

Do NOT write tree generation, test runner, or validation scripts. All major scripts are bundled and must be run as-is.
The only files the user creates are: component_mapping.json (Stage 0, manual), test_NN_*.py test files (Stage 2, following templates), and monkey-patch files (Stage 4, debugging only).
Do NOT modify original model files. All fixes are delivered as standalone patches.
Follow stage order strictly. Do not skip ahead, reorder, or parallelize stages.
Show full output from every script run. Do not summarize or truncate.
No try/except statements in test files. Let errors surface directly.
No additional pip installs. Use only the packages in the provided virtual environment.

Knowledge base#

The skill includes a curated knowledge base at references/ containing solutions gathered from prior validation sessions.

Foundational concepts#

equiv-concept.md — foundational concepts: 3-way comparison, R-ratio derivation, QQ plot interpretation.
expected_structural_diffs.md — catalog of expected HuggingFace to Neuron structural differences.
mapping_example.json — full worked example of a 31-component mapping (Llama4 multimodal model).

Debugging guides#

cpu-component-debugging.md — full workflow for CPU-level component debugging with patterns and pitfalls.
device-component-debugging.md — XLA-compatible patch patterns for device-mode execution.
device-e2e-debugging.md — device E2E with 1-layer isolation and fix-compile-verify cycle.
cpu-e2e-debugging.md — CPU E2E with mp.spawn, TP > 1, and bias restoration.
dump-tensors.md — intermediate tensor capture methodology for per-layer comparison.
neuronxcc-debugging.md — NeuronX compiler debugging tools and escalation procedures.

Case studies#

debugging-case-study-gptoss.md — real worked example from GPT-OSS 20B showing error ratios, root causes, and patches applied.

Visual references#

example_plots/positive_samples/ — QQ plots and histograms showing passing distributions (errors follow the 45-degree line).
example_plots/negative_samples/ — QQ plots and histograms showing failing distributions (divergent error patterns).

Deep dive: Validate model ports with the Equivalence skill

Contents

Deep dive: Validate model ports with the Equivalence skill#

Prerequisites#

Overview#

Hardware and software requirements#

Inputs#

Key concepts#

The R-ratio metric#

The 3-tensor comparison method#

Expected structural differences#

Workflow#

Stage 0: Structural scaffolding#

Build model trees#

Create component mapping#

Detect CPU vs device class divergence#

Stage 1: Smoke test#

Stage 2: Component-level testing#

Set up test infrastructure#

Write component tests#

Run component tests#

Stage 3: Fault localization#

Change-point detection#

Root-cause classification#

Stage 4: Debug and patch#

Debugging workflow#

Patch structure#

Common pitfalls#

Stage 5: End-to-end comparison#

Teacher forcing explained#

Stage 6: Distributional and semantic validation#

Pass criteria#

Interpreting Stage 5/6 results#

Stage 7: Downstream task evaluation#

Benchmark configuration#

File organization#

Tools reference#

Tree builder#

Class divergence detector#

Smoke test runner#

Component test runner#

Fault localizer#

Tensor comparator#

Teacher-forced comparator#

Downstream evaluator#

Calibration tool#

Common issues#

Stage 0 failures#

Stage 2 failures#

Stage 4 failures#

Stage 5/6 failures#

Project guidelines#

Knowledge base#

Foundational concepts#

Debugging guides#

Case studies#

Visual references#

Related resources#