Quickstart: Serve models online with vLLM on Neuron#
This quickstart shows you how to launch the vLLM OpenAI-compatible API server on AWS Neuron. You install the vllm-neuron plugin, start the server, validate it with curl, and call it from the OpenAI Python SDK.
This quickstart is for: Developers who need an interactive, low-latency serving endpoint on Neuron Time to complete: ~20 minutes
Prerequisites#
Before you begin, make sure you have:
An EC2 instance with Neuron cores and network access to Hugging Face Hub.
The Neuron SDK installed (see Setup Instructions).
Python 3.10 or later with
pip.Basic familiarity with running Python scripts in a virtual environment.
Note
For the fastest setup, consider the vLLM Neuron Deep Learning Container (DLC), which bundles the SDK, vLLM, and dependencies. See Quickstart: Configure and deploy a vLLM server using Neuron Deep Learning Container (DLC).
Step 1: Install the vllm-neuron plugin#
In this step, you install the Neuron-enabled vLLM plugin inside your Python environment.
# Activate your Neuron virtual environment
source ~/aws_neuronx_venv_pytorch_2_8_nxd_inference/bin/activate
# Clone the vLLM Neuron plugin repository
git clone https://github.com/vllm-project/vllm-neuron.git
cd vllm-neuron
# Install with the Neuron package repository
pip install --extra-index-url=https://pip.repos.neuron.amazonaws.com -e .
Important
The --extra-index-url flag pulls the Neuron-compatible wheels from the AWS repository.
To confirm the install succeeded, run python -c "import vllm" and verify no errors display.
Step 2: Launch the API server#
In this step, you start an OpenAI-compatible endpoint with a LLaMA model.
export VLLM_NEURON_FRAMEWORK="neuronx-distributed-inference"
vllm serve \
--model meta-llama/Llama-3.1-8B-Instruct \
--tensor-parallel-size 8 \
--max-model-len 128 \
--max-num-seqs 4 \
--no-enable-prefix-caching \
--port 8000 \
--additional-config '{
"override_neuron_config": {
"enable_bucketing": false
}
}'
Key arguments:
--tensor-parallel-size: Matches the number of Neuron cores you want to use.--max-model-lenand--max-num-seqs: Duplicate limits from your offline workflow.--additional-config: Wrap Neuron overrides underoverride_neuron_config(enable_bucketinghere).--port: Choose the listening port for the API server.
Step 3: Verify the endpoint with curl#
In this step, you confirm the server is responding by sending a chat completion request.
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-3.1-8B-Instruct",
"messages": [
{"role": "system", "content": "You are a concise assistant."},
{"role": "user", "content": "List three Neuron optimization tips."}
],
"temperature": 0.2
}'
If successful, the server returns a JSON payload containing the generated answer.
Step 4: Call the API with the OpenAI SDK#
Now that the server is live, call it using the OpenAI Python SDK.
from openai import OpenAI
# Client setup
client = OpenAI(
api_key="EMPTY",
base_url="http://localhost:8000/v1",
)
models = client.models.list()
model_name = models.data[0].id
max_tokens = 50
temperature = 1.0
top_p = 1.0
top_k = 50
stream = False
response = client.chat.completions.create(
model=model_name,
messages=[{"role": "user", "content": "Hello, my name is Llama"}],
max_tokens=max_tokens,
temperature=temperature,
top_p=top_p,
stream=stream,
extra_body={"top_k": top_k},
)
generated_text = ""
if stream:
for chunk in response:
if chunk.choices[0].delta.content is not None:
generated_text += chunk.choices[0].delta.content
else:
generated_text = response.choices[0].message.content
print(generated_text)
Step 5: Explore advanced configuration (optional)#
The commands below show optional tuning features to adapt the server for different workloads.
Reuse compiled models to avoid recompilation:
# Create directory for compiled artifacts if it doesn't exist
mkdir -p ./neuron_compiled_models/llama3-8b
# Set the environment variable before launching the server
export NEURON_COMPILED_ARTIFACTS="./neuron_compiled_models/llama3-8b"
Enable prefix caching when prompts share a long context:
vllm serve \
--model TinyLlama/TinyLlama-1.1B-Chat-v1.0 \
--tensor-parallel-size 32 \
--max-model-len 1024 \
--max-num-seqs 8 \
--enable-prefix-caching \
--block-size 32 \
--num-gpu-blocks-override 256 \
--additional-config '{
"override_neuron_config": {
"is_prefix_caching": true,
"is_block_kv_layout": true,
"pa_num_blocks": 256,
"pa_block_size": 32
}
}' \
--port 8000
Use Eagle speculative decoding when you have an EAGLE checkpoint available:
Below is an example of how to run vLLM inference with an EAGLE V1 checkpoint
Note
Eagle draft checkpoints must be converted for NxD Inference compatibility and include the target model’s LM head. Follow the guidance at EAGLE Checkpoint Compatibility.
vllm serve \
--model meta-llama/Llama-3.1-8B-Instruct \
--tensor-parallel-size 32 \
--max-model-len 2048 \
--max-num-seqs 4 \
--speculative-config '{"model": "./eagle_draft_converted", "method": "eagle", "num_speculative_tokens": 5, "max_model_len": 2048}' \
--port 8000
Use MultiLoRA:
Note
For multi-LoRA serving, you can optionally create a JSON configuration file that maps LoRA adapter IDs to their checkpoint paths. This enables dynamic adapter loading and swapping between HBM and host memory.
# Example JSON configuration (save as lora_config.json):
# {
# "lora-ckpt-dir": "/opt/lora/tinyllama/",
# "lora-ckpt-paths": {
# "tarot_adapter": "tarot",
# "support_adapter": "mental-health"
# },
# "lora-ckpt-paths-cpu": {
# "tarot_adapter": "tarot",
# "support_adapter": "mental-health"
# }
# }
vllm serve \
--model TinyLlama/TinyLlama-1.1B-Chat-v1.0 \
--tensor-parallel-size 32 \
--max-model-len 1024 \
--max-num-seqs 2 \
--enable-lora \
--max-loras 2 \
--max-cpu-loras 4 \
--lora-modules \
tarot_adapter=/opt/lora/tinyllama/tarot \
support_adapter=/opt/lora/tinyllama/mental-health \
--additional-config '{"override_neuron_config": {"lora_ckpt_json": "/path/to/lora_config.json"}}' \
--port 8000
Clients can select an adapter per request by setting the model field to the adapter ID in the request. The max-loras parameter controls concurrent adapters in HBM, while max-cpu-loras controls adapters in host memory with dynamic swapping support.
Tune context and token buckets for long prompts:
export VLLM_NEURON_FRAMEWORK="neuronx-distributed-inference"
vllm serve \
--model "meta-llama/Llama-3.1-8B-Instruct" \
--tensor-parallel-size 32 \
--max-num-seqs 1 \
--max-model-len 1024 \
--port 8000 \
--no-enable-prefix-caching \
--additional-config '{
"override_neuron_config": {
"enable_bucketing": true,
"context_encoding_buckets": [256, 512, 1024],
"token_generation_buckets": [32, 64, 128, 256, 512, 768],
"max_context_length": 1024,
"seq_len": 1024,
"batch_size": 1,
"ctx_batch_size": 1,
"tkg_batch_size": 1,
"is_continuous_batching": true
}
}'
Set NEURON_COMPILED_ARTIFACTS before launching if you want to reuse artifacts across runs.
Confirmation#
Resend the curl request from Step 3 or rerun the OpenAI SDK snippet from Step 4. Successful responses confirm that the server is up and reachable. You can also open http://localhost:8000/health to check the health probe.
Common issues#
Server exits immediately: Confirm
--tensor-parallel-sizematches the number of available Neuron cores.Requests return 5xx errors: Lower
--max-num-seqsor--max-model-lenif the model runs out of memory.Initial requests take too long: Set
NEURON_COMPILED_ARTIFACTSso subsequent launches reuse compiled artifacts.
Clean up#
Stop the API server (Ctrl+C). Deactivate your Python environment with deactivate. Remove the cloned vllm-neuron repository if you no longer need it, and clear any cached artifacts if disk space is a concern.
Next steps#
Explore prefix caching, Eagle speculative decoding, and other options in NxD Inference Features Configuration Guide.
Review supported model architectures in Supported Model Architectures.
Try the offline batch quickstart (Quickstart: Run offline inference with vLLM on Neuron) if you need non-interactive inference.
Further reading#
vLLM User Guide for NxD Inference – Complete integration reference.
NxD Inference Tutorials – In-depth tutorials and workflow guides.
OpenAI Python SDK reference – API documentation for the client used in Step 4.