This document is relevant for: Inf1, Inf2, Trn1, Trn2

Quickstart: Configure and deploy a vLLM server using Neuron Deep Learning Container (DLC)#

This topic guides you through deploying a vLLM server on Trainium and Inferentia instances using Deep Learning Container preconfigured with AWS Neuron SDK artifacts. When you complete this tutorial, you will be able run a vLLM inference server on AWS Trainium and Inferentia instances.

Overview#

You will pull a vLLM Docker image, configure it for Neuron devices, and start an inference server. This process lets you deploy large language models on AWS ML accelerators for high-performance inference workloads.

Before you start#

This tutorial assumes that you have experience in the following areas:

Docker container management
AWS EC2 instance administration
Command-line interface operations

Prerequisites#

Before you begin, ensure you have:

AWS Trainium or Inferentia instance access
Docker installed on your instance. You can set up docker environment according to Tutorial Docker environment setup
SSH access to your instance

Prepare your environment#

Launch an AWS Trainium or Inferentia instance with sufficient resources for your model requirements. We recommend using one of the base DLAMIs to launch your instance - Neuron Base DLAMI <#>.

Step 1: Pull the vLLM Docker image#

In this step, you will download the vLLM Docker image from AWS ECR.

Get the latest vLLM Docker image from Neuron’s ECR public gallery pytorch-inference-vllm-neuronx repository, and then get the latest published image tag and use it in the command below:

docker pull public.ecr.aws/neuron/pytorch-inference-vllm-neuronx:<image_tag>

For example, replace <image_tag> with an SDK 2.26.0 released DLC image tag such as 0.9.1-neuronx-py311-sdk2.26.0-ubuntu22.04

Step 2: Start the Docker container#

In this step, you will run the container with access to Neuron devices. For this tutorial, we are using an inf2.48xlarge instance.

Run the container interactively with access to Neuron devices:

docker run -it \
--device=/dev/neuron0 \
--device=/dev/neuron1 \
--device=/dev/neuron2 \
--device=/dev/neuron3 \
--device=/dev/neuron4 \
--device=/dev/neuron5 \
--device=/dev/neuron6 \
--device=/dev/neuron7 \
--device=/dev/neuron8 \
--device=/dev/neuron9 \
--device=/dev/neuron10 \
--device=/dev/neuron11 \
--cap-add SYS_ADMIN \
--cap-add IPC_LOCK \
-p 8080:8080 \
--name <server_name> \
<image_uri> \
bash

Note

The inf2.48xlarge instance provides 16 Neuron devices. Adjust the number of Neuron devices (--device=/dev/neuronX) based on your instance type and requirements.

Step 3: Start the vLLM server#

In this step, you will launch the vLLM inference server inside the container.

Inside the container, start the vLLM inference server:

VLLM_NEURON_FRAMEWORK='neuronx-distributed-inference' python -m vllm.entrypoints.openai.api_server \
--model='TinyLlama/TinyLlama-1.1B-Chat-v1.0' \
--max-num-seqs=4 \
--max-model-len=128 \
--tensor-parallel-size=8 \
--port=8080 \
--device 'neuron' \
--override-neuron-config '{"enable_bucketing":false}'

Important

Choose the appropriate model for your use case
Set --tensor-parallel-size to be less than or equal to the number of Neuron devices you specified in Step 2
Server startup typically takes 5-10 minutes

Step 4: Verify server status#

In this step, you will confirm the server starts successfully.

Wait for the server to fully initialize. You will see output showing available API routes:

INFO 08-12 00:04:47 [launcher.py:28] Available routes are:
INFO 08-12 00:04:47 [launcher.py:36] Route: /health, Methods: GET
INFO 08-12 00:04:47 [launcher.py:36] Route: /v1/chat/completions, Methods: POST
INFO 08-12 00:04:47 [launcher.py:36] Route: /v1/completions, Methods: POST

Note

During startup, you may see warning logs similar to the following, which can be safely ignored:

No module named 'vllm._version'
  from .version import __version__, __version_tuple__  # isort:skip
WARNING [__init__.py:25] The vLLM package was not found, so its version could not be inspected. This may cause platform detection to fail.
INFO [__init__.py:243] Automatically detected platform neuron.
WARNING [_custom_ops.py:21] Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'")

All complete! Now, let’s confirm everything works.

Step 5: Inference Confirmation#

Test the API to confirm your setup works correctly.

Open a separate terminal and make an API call:

curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
  "messages": [
    {
      "role": "user",
      "content": "What is the capital of Italy?"
    }
  ]
}'

You should receive a response similar to:

{
  "id": "chatcmpl-ac7551dd2f2a4be3bd2c1aabffa79b4c",
  "object": "chat.completion",
  "created": 1754958455,
  "model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "The capital of Italy is Rome...",
        "tool_calls": []
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 23,
    "total_tokens": 106,
    "completion_tokens": 83
  }
}

Congratulations! You have successfully deployed a vLLM inference server using a preconfigured Neuron DLC. If you encountered any issues, see the Common issues section below.

Available API endpoints#

The server provides various endpoints for different use cases:

Health Check: GET /health
Chat Completions: POST /v1/chat/completions
Text Completions: POST /v1/completions
Embeddings: POST /v1/embeddings
Models Info: GET /v1/models
API Documentation: GET /docs

Common issues#

Did you encounter an error while working through this tutorial? Here are common issues and solutions:

Server won’t start: Check that you have sufficient Neuron devices allocated
Connection refused: Verify the container is running and port 8080 is properly mapped
Slow performance: Ensure your tensor-parallel-size matches your available Neuron devices
Memory issues: Consider using a larger instance type or reducing model size

For additional help, refer to the complete vLLM User Guide for NxD Inference documentation.

Clean up#

To clean up resources after completing this tutorial:

Stop the Docker container:
```
docker stop <server_name>
```
Remove the container:
```
docker rm <server_name>
```
Terminate your EC2 instance if no longer needed.

Next steps#

Now that you’ve completed this tutorial, explore these related topics:

Learn more about vLLM configuration options in the vLLM User Guide for NxD Inference
Explore model optimization techniques for better performance
Set up production deployment with load balancing and monitoring

Quickstart: Configure and deploy a vLLM server using Neuron Deep Learning Container (DLC)

Contents

Quickstart: Configure and deploy a vLLM server using Neuron Deep Learning Container (DLC)#

Overview#

Before you start#

Prerequisites#

Prepare your environment#

Step 1: Pull the vLLM Docker image#

Step 2: Start the Docker container#

Step 3: Start the vLLM server#

Step 4: Verify server status#

Step 5: Inference Confirmation#

Available API endpoints#

Common issues#

Clean up#

Next steps#

Further reading#