This document is relevant for: Inf1, Inf2, Trn1, Trn2, Trn3
Tutorial: Port a HuggingFace model using the Autoport skill#
This tutorial walks you through porting a HuggingFace transformer model to NxD Inference using the Autoport agent. You will provide model parameters, invoke the agent, and the agent will handle the rest: analysis, implementation, compilation, inference testing, and accuracy validation.
By the end of this tutorial you will have a working NxD Inference implementation of your model that passes a 95% greedy token match against the HuggingFace reference.
Prerequisites#
Set up a Trainium instance#
You need a Trainium instance (trn1 or trn2). Launch it from the Neuron Deep
Learning AMI (DLAMI) and SSH in.
Verify Neuron devices are available.
neuron-ls
You should see 32 NeuronCores. If you see 0, your instance does not have Neuron hardware attached. Stop here and fix that first.
Install Neuron Agentic Development#
If you have not already installed the package, follow the Getting Started guide.
Make sure the deploy step completed.
# For Claude Code
deploy-neuron-agentic-development-to-claude
# For Kiro
deploy-neuron-agentic-development-to-kiro
Activate your Python environment#
source ~/opt/aws_neuronx_venv_pytorch_2_9/bin/activate
Verify the required packages are installed.
pip list | grep neuronx-distributed-inference
Download your model weights#
Download the HuggingFace model you want to port. For this example we use a small model to keep compilation fast.
mkdir -p agent_artifacts/data
huggingface-cli download arcee-ai/AFM-4.5B-Base --local-dir agent_artifacts/data
Step 1. Gather your model parameters#
The Autoport agent needs six pieces of information about your model. Gather these before you start.
Parameter |
What it is |
Example value |
|---|---|---|
|
The HuggingFace model class name |
|
|
Path to the HuggingFace model source |
|
|
The modeling file |
|
|
The config file |
|
|
HuggingFace model identifier |
|
|
Where model weights live |
|
You can also pass an optional pathToVenv if you use a non default virtual environment.
To find the model class name, open the modeling file in the HuggingFace transformers
source and look for the main class (usually <ModelName>ForCausalLM).
Step 2. Invoke the Autoport agent#
Open your agentic IDE (Claude Code or Kiro) on the Trainium instance.
Invoke the agent with your parameters.
Port with inputs as ModelName is ArceeForCausalLM,
pathToModelImplementationDirectory is transformers/src/transformers/models/arcee,
nameOfImplementationFile is modeling_arcee.py,
nameOfConfigurationFile is configuration_arcee.py,
huggingFaceModelID is arcee-ai/AFM-4.5B-Base,
pathToModelWeightsDirectory agent_artifacts/data
The agent confirms your parameters and starts working. You do not need to do anything else. The agent runs through all six stages automatically.
If you want a dry run (analysis and code generation only, no compilation or hardware), add
dry-run to your request.
Step 3. What happens during the port#
The agent works through six stages. Here is what you will see at each one.
Stage 1: Knowledge base analysis. The agent reads its internal porting guides and known issues database. It identifies patterns relevant to your model architecture. This takes a few seconds.
Stage 2: Architecture analysis. The agent reads the HuggingFace model source code and maps each component (attention, MLP, embeddings) to existing NxD Inference modules. It identifies what can be reused and what needs custom implementation.
Stage 3: Implementation. The agent writes the Neuron compatible model code. It creates
files in a neuron_port/ directory. You can watch the code appear in real time.
Stage 4: Compilation. The agent compiles the model to NEFF format using the Neuron
compiler. This is the longest step and can take 10 to 30 minutes depending on model size.
The agent sets tp_degree=8 by default (8 NeuronCores). You will see compiler output
scroll by.
Stage 5: Inference testing. The agent loads the compiled model and generates text. It verifies the output is coherent (not garbage or repeated tokens).
Stage 6: Accuracy validation. The agent compares Neuron model output against the HuggingFace reference model loaded in FP32. It checks 64 tokens with greedy decoding. The port passes when match rate reaches 95% or higher.
If validation fails, the agent automatically iterates. It analyzes what diverged, fixes the code, recompiles, and validates again. It does not stop until it passes.
Step 4. Check the results#
When the agent finishes, you will have these outputs.
project_root/
├── neuron_port/
│ └── modeling_yourmodel.py # The ported implementation
├── agent_artifacts/
│ ├── data/compiled_model/ # Compiled NEFF artifacts
│ ├── traces/port_summary.md # Summary of decisions made
│ └── results/ # Validation JSON results
The ported model in neuron_port/ is the final product. You can use it directly with
NxD Inference.
Step 5. Deploy with vLLM (optional)#
Once you have a validated port, you can serve it with vLLM. See the vLLM User Guide for deployment instructions.
Troubleshooting#
The agent handles most issues automatically, but here are things that might require your input.
Agent asks for missing parameters. If you forgot a parameter, the agent will ask for it. Provide the value and it continues.
Compilation takes too long. Large models (70B+) can take 30 to 60 minutes to compile.
This is normal. You can reduce compilation time by setting a smaller seq_len for testing.
Agent cannot find model weights. Make sure pathToModelWeightsDirectory points to
a directory containing the actual weight files (.safetensors or .bin).
Validation keeps failing. If the agent iterates more than 3 times on validation,
check the agent_artifacts/traces/ directory for the agent’s analysis of what is
diverging. Common causes include wrong normalization layers (LayerNorm vs RMSNorm),
incorrect RoPE implementation, or weight loading mismatches.
No NeuronCores detected. Run neuron-ls. If it shows 0 devices, your instance
either does not have Neuron hardware or the driver is not loaded. Check that
aws-neuronx-dkms is installed.