This document is relevant for: Inf1
, Inf2
, Trn1
, Trn2
Quickstart: Configure and deploy a vLLM server using Neuron Deep Learning Container (DLC)#
This topic guides you through deploying a vLLM server on Trainium and Inferentia instances using Deep Learning Container preconfigured with AWS Neuron SDK artifacts. When you complete this tutorial, you will be able run a vLLM inference server on AWS Trainium and Inferentia instances.
Overview#
You will pull a vLLM Docker image, configure it for Neuron devices, and start an inference server. This process lets you deploy large language models on AWS ML accelerators for high-performance inference workloads.
Before you start#
This tutorial assumes that you have experience in the following areas:
Docker container management
AWS EC2 instance administration
Command-line interface operations
Prerequisites#
Before you begin, ensure you have:
AWS Trainium or Inferentia instance access
Docker installed on your instance. You can set up docker environment according to Tutorial Docker environment setup
SSH access to your instance
Prepare your environment#
Launch an AWS Trainium or Inferentia instance with sufficient resources for your model requirements. We recommend using one of the base DLAMIs to launch your instance - Neuron Base DLAMI <#>.
Step 1: Pull the vLLM Docker image#
In this step, you will download the vLLM Docker image from AWS ECR.
Pull the latest vLLM Docker image from Neuron repo in AWS ECR public gallery here pytorch-inference-vllm-neuronx <https://gallery.ecr.aws/neuron/pytorch-inference-vllm-neuronx>. We
docker pull <image_uri>
Replace <image_uri>
with the specific vLLM image URI for example - public.ecr.aws/neuron/pytorch-inference-vllm-neuronx:0.9.1-neuronx-py310-sdk2.25.0-ubuntu22.04
Step 2: Start the Docker container#
In this step, you will run the container with access to Neuron devices. For this tutorial, we are using an inf2.48xlarge instance.
Run the container interactively with access to Neuron devices:
docker run -it \
--device=/dev/neuron0 \
--device=/dev/neuron1 \
--device=/dev/neuron2 \
--device=/dev/neuron3 \
--device=/dev/neuron4 \
--device=/dev/neuron5 \
--device=/dev/neuron6 \
--device=/dev/neuron7 \
--device=/dev/neuron8 \
--device=/dev/neuron9 \
--device=/dev/neuron10 \
--device=/dev/neuron11 \
--cap-add SYS_ADMIN \
--cap-add IPC_LOCK \
-p 8080:8080 \
--name <server_name> \
<image_uri> \
bash
Note
The inf2.48xlarge instance provides 16 Neuron devices. Adjust the number of Neuron devices (--device=/dev/neuronX
) based on your instance type and requirements.
Step 3: Start the vLLM server#
In this step, you will launch the vLLM inference server inside the container.
Inside the container, start the vLLM inference server:
VLLM_NEURON_FRAMEWORK='neuronx-distributed-inference' python vllm.entrypoints.openai.api_server \
--model='TinyLlama/TinyLlama-1.1B-Chat-v1.0' \
--max-num-seqs=4 \
--max-model-len=128 \
--tensor-parallel-size=8 \
--port=8080 \
--device 'neuron' \
--override-neuron-config '{"enable_bucketing":false}'
Important
Choose the appropriate model for your use case
Set
--tensor-parallel-size
to be less than or equal to the number of Neuron devices you specified in Step 2Server startup typically takes 5-10 minutes
Step 4: Verify server status#
In this step, you will confirm the server starts successfully.
Wait for the server to fully initialize. You will see output showing available API routes:
INFO 08-12 00:04:47 [launcher.py:28] Available routes are:
INFO 08-12 00:04:47 [launcher.py:36] Route: /health, Methods: GET
INFO 08-12 00:04:47 [launcher.py:36] Route: /v1/chat/completions, Methods: POST
INFO 08-12 00:04:47 [launcher.py:36] Route: /v1/completions, Methods: POST
All complete! Now, let’s confirm everything works.
Step 5: Inference Confirmation#
Test the API to confirm your setup works correctly.
Open a separate terminal and make an API call:
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"messages": [
{
"role": "user",
"content": "What is the capital of Italy?"
}
]
}'
You should receive a response similar to:
{
"id": "chatcmpl-ac7551dd2f2a4be3bd2c1aabffa79b4c",
"object": "chat.completion",
"created": 1754958455,
"model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "The capital of Italy is Rome...",
"tool_calls": []
},
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 23,
"total_tokens": 106,
"completion_tokens": 83
}
}
Congratulations! You have successfully deployed a vLLM inference server using a preconfigured Neuron DLC. If you encountered any issues, see the Common issues section below.
Available API endpoints#
The server provides various endpoints for different use cases:
Health Check:
GET /health
Chat Completions:
POST /v1/chat/completions
Text Completions:
POST /v1/completions
Embeddings:
POST /v1/embeddings
Models Info:
GET /v1/models
API Documentation:
GET /docs
Common issues#
Did you encounter an error while working through this tutorial? Here are common issues and solutions:
Server won’t start: Check that you have sufficient Neuron devices allocated
Connection refused: Verify the container is running and port 8080 is properly mapped
Slow performance: Ensure your
tensor-parallel-size
matches your available Neuron devicesMemory issues: Consider using a larger instance type or reducing model size
For additional help, refer to the complete vLLM User Guide for NxD Inference documentation.
Clean up#
To clean up resources after completing this tutorial:
Stop the Docker container:
docker stop <server_name>
Remove the container:
docker rm <server_name>
Terminate your EC2 instance if no longer needed.
Next steps#
Now that you’ve completed this tutorial, explore these related topics:
Learn more about vLLM configuration options in the vLLM User Guide for NxD Inference
Explore model optimization techniques for better performance
Set up production deployment with load balancing and monitoring
Further reading#
vLLM User Guide for NxD Inference - Complete documentation for vLLM on Neuron
AWS Neuron SDK Documentation - Full Neuron SDK reference
This document is relevant for: Inf1
, Inf2
, Trn1
, Trn2