Tutorial: Deploy Qwen3-VL 8B on Trn2 instances #

This tutorial provides a step-by-step guide to deploy Qwen/Qwen3-VL-8B-Thinking on a single trn2.48xlarge instance using vLLM V1 with the vLLM-Neuron Plugin.

Examples #

Offline Example
Online Example

Step 1: Set up your development environment #

As a prerequisite, this tutorial requires that you have a Trn2 instance created from a Deep Learning AMI that has the Neuron SDK pre-installed.

To set up a Trn2 instance using Deep Learning AMI with pre-installed Neuron SDK, see the NxDI setup guide. To use a Jupyter (.ipynb) notebook on a Neuron instance, follow this guide.

After setting up an instance, use SSH to connect to the Trn2 instance using the key pair that you chose when you launched the instance.

After you are connected, activate the Python virtual environment that includes the Neuron SDK.

pip list | grep neuron

You should see Neuron packages including neuronx-distributed-inference and neuronx-cc.

Step 2: Install the vLLM version that supports NxD Inference #

NxD Inference supports running models with vLLM. This functionality is available in the vLLM-Neuron GitHub repository. Install the latest release branch of vLLM-Neuron plugin following instructions in the vLLM User Guide for NxD Inference.

Ensure that the Neuron virtual environment is activated if you are using a new terminal instead of the one from connection step above. Then, install the Neuron vLLM into the virtual environment.

Step 3: Download the model from HuggingFace (Optional)#

To deploy Qwen/Qwen3-VL-8B-Thinking on Neuron, download the checkpoint from HuggingFace to a local path on the Trn2 instance. For more information on downloading models from HuggingFace, refer to the HuggingFace guide on downloading models).

After the download, you should see a config.json file in the output folder along with weights in model-xxxx-of-xxxx.safetensors format.

Step 4: Compile and deploy Qwen3 VL Inference #

We provide two examples to run Qwen3 VL with vLLM V1:

Offline inference: you can provide prompts in a python script and execute it.
Online inference: you will serve the model in an online server and send requests.

Model Compilation and Configuration #

Certain configurations are used to optimize the performance of the model during compilation. These configurations are described below and can be modified for your specific use case.

Qwen3 VL consists of a text model and a vision encoder. You must specify configurations explicitly through text_neuron_config and vision_neuron_config.
world_size: max number of neuron cores in the distributed environment. Text and vision model must have the same world size.
tp_degree: degree of tensor parallelism. Text and vision model can use different sharding scheme and therefore different TP degree.
dp_degree: degree of data parallelism for the vision encoder. The vision encoder supports data parallelism where tp_degree * dp_degree = world_size. This allows the vision encoder to process multiple blocks of vision tokens in parallel across cores.
batch_size: This is set to the batch size for compiling the models. For optimized latency, Prefill is always done with batch_size = 1; hence ctx_batch_size in text_neuron_config and the batch_size in vision_neuron_config are set to 1. The batch_size and tkg_batch_size in text_neuron_config are set to the desired value for handling concurrent requests (same as max-num-seqs for the vllm argument).

text_neuron_config
- seq_len: Set this to the maximum sequence length in your use case. We currently support up to 32768 in the text model. This refers to the total length of vision and text, input and output tokens.
- enable_bucketing: Bucketing allows one to optimize performance for specific sequence lengths and in this case we configure specific buckets.
- context_encoding_buckets: This refers to the prefill/context encoding phase and should be set to handle different total length of vision and text input tokens.
- token_generation_buckets: This refers to the decode/token generation phase. The bucket size should reflect the total sequence length, which is the sum of vision tokens, text input tokens, and output tokens.
- ``sequence_parallel_enabled``: Enable the sequence parallelism for text model.
- fused_qkv: QKV weight fusion concatenates a model’s query, key and value weight matrices to achieve better performance.
- qkv_kernel_enabled: Enable the use of the fused QKV kernel.
- mlp_kernel_enabled: Enable the use of the MLP kernel.
- attn_kernel_enabled: Enable the use of the Flash Attention kernel.
vision_neuron_config
- seq_len: Set this to the maximum vision sequence length in your use case. With block-wise encoding enabled via ve_block_size, larger vision sequence lengths (e.g., 131072) are supported. Vision sequence length is calculated by num_images * (image_height//patch_size) * (image_width//patch_size).
- buckets: Set this to handle different vision sequence lengths.
- attn_kernel_enabled: Enable block-packing Flash Attention kernel for the vision encoder. See below ve_block_size for explanation of block-packing.
- ve_block_size: Controls the block size for processing vision embeddings in the vision encoder, must be a multiply of 128. The entire vision sequence is split into multiple blocks. We compute attention per block, skipping unnecessary computation on masked area. Each block should contain at least one image, so set the ve_block_size >= max_num_vision_token_per_image.
- dp_degree: Degree of data parallelism for the vision encoder. The vision encoder can use a different parallelism scheme from the text model. With dp_degree=16 and tp_degree=1, each core processes a separate block of vision tokens independently, which improves throughput for long vision sequences. The constraint tp_degree * dp_degree = world_size must be satisfied.
- fused_qkv: QKV weight fusion concatenates a model’s query, key and value weight matrices to achieve better performance.

Note: Qwen3 VL vision embeddings are spatially compressed by a factor of spatial_merge_size ** 2 before being fed into the text model. This value is defined in the model’s config.json. As a result, the effective text context length is calculated as: text_context_len = vision_seq_len // (spatial_merge_size ** 2).

[ ]:

text_neuron_config = {
    # Batch Size
    "batch_size": 1,
    "ctx_batch_size": 1,
    "tkg_batch_size": 1,

    # Sequence Lengths
    "seq_len": 32768,
    "max_context_length": 32768,

    # Buckets
    "enable_bucketing": True,
    "context_encoding_buckets": [5120, 32768],
    "token_generation_buckets": [5120, 32768],

    # Parallelism
    "world_size": 16,
    "tp_degree": 16,
    "sequence_parallel_enabled": True,

    # Others
    "torch_dtype": "float16",
    "rpl_reduce_dtype": "float16",
    "attention_dtype": "float16",
    "cast_type": "as-declared",
    "logical_neuron_cores": 2,
    "cc_pipeline_tiling_factor": 1,

    # Kernels
    "fused_qkv": True,
    "qkv_kernel_enabled": True,
    "mlp_kernel_enabled": True,
    "attn_kernel_enabled": True,
}

vision_neuron_config = {
    # Batch Size
    "batch_size": 1,

    # Sequence Lengths
    "seq_len": 131072,
    "max_context_length": 131072,

    # Buckets
    "enable_bucketing": True,
    "ve_block_size": 4096,
    "buckets": [32768, 131072],

    # Parallelism
    "world_size": 16,
    "tp_degree": 1,
    "dp_degree": 16,

    # Others
    "torch_dtype": "float16",
    "rpl_reduce_dtype": "float16",
    "cast_type": "as-declared",
    "logical_neuron_cores": 2,
    "cc_pipeline_tiling_factor": 2,

    # Kernels
    "fused_qkv": True,
    "attn_kernel_enabled": True,
    "mlp_kernel_enabled": False,
}

Offline Example #

[ ]:

import os

os.environ["VLLM_NEURON_FRAMEWORK"] = "neuronx-distributed-inference"
os.environ["NEURON_RT_DBG_INTRA_RDH_CHANNEL_BUFFER_SIZE"] = "146800640" # to support 32k sequence length

from vllm import LLM, SamplingParams

model_name_or_path = "~/models/Qwen3-VL-8B-Thinking/"

# Create an LLM.
llm = LLM(
   model=model_name_or_path,
   tokenizer=model_name_or_path,
   trust_remote_code=True,
   tensor_parallel_size=16,
   max_num_seqs=1,
   max_model_len=32768,
   async_scheduling=False,
   additional_config={
      "override_neuron_config": {
            "text_neuron_config": text_neuron_config,
            "vision_neuron_config": vision_neuron_config
      }
   },
   limit_mm_per_prompt={"image": 20}, # Use the max number of image in your use case
   enable_prefix_caching=False,
   enable_chunked_prefill=False,
)

# Sample prompts.
from transformers import AutoProcessor
from vllm.assets.image import ImageAsset

processor = AutoProcessor.from_pretrained(model_name_or_path)

prompt = "What do you see in these images?"
images = [
   ImageAsset("blue_flowers").pil_image,
   ImageAsset("bird").pil_image,
]

placeholders = [{"type": "image"} for _ in images]
messages = [
   {"role": "system", "content": "You are a helpful assistant."},
   {
   "role": "user",
      "content": [
               *placeholders,
               {
               "type": "text",
               "text": prompt,
               },
      ],
   },
]

prompt = processor.apply_chat_template(
   messages,
   tokenize=False,
   add_generation_prompt=True,
)
inputs = {
   "prompt": prompt,
   "multi_modal_data": {
      "image": images,
   },
}

outputs = llm.generate([inputs], SamplingParams(top_k=1, max_tokens=1024))
print(f"Prompt: {prompt!r}, Generated text: {outputs[0].outputs[0].text!r}")

Below is an example output:

Prompt: '<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\n<|vision_start|><|image_pad|><|vision_end|><|vision_start|><|image_pad|><|vision_end|>What do you see in these images?<|im_end|>\n<|im_start|>assistant\n<think>\n', Generated text: "So, let's look at both images. First image: there are blue flowers with water droplets, some pink flowers in the background, and they're in a wet, reflective surface, maybe water. There are bokeh lights (those yellow circles) in the background, so it's a shallow depth of field. Second image: a bird with bright red head and chest, blue wings and tail, perched on a branch. The background is green, blurred, so it's a forest or jungle setting. Need to describe each image clearly.\n\nFirst image details: blue flowers (maybe plumeria?), water droplets on petals, some pink flowers, wet surface (water), reflections, bokeh lights (out of focus yellow circles). Second image: bird with vibrant colors—red body, blue wings/tail, black beak, perched on a brown branch, green background (blurred foliage). Both images have high detail, vibrant colors, nature themes.\n\nSo, summarize each image's content.\n</think>\n\nIn the first image, I see **vibrant blue flowers** (likely plumeria) with water droplets glistening on their petals. These flowers are partially submerged in a reflective, wet surface (possibly water), creating subtle ripples and reflections. In the background, there are soft, out-of-focus pink flowers and warm, golden bokeh lights (blurred circular highlights), which add a dreamy, atmospheric quality to the scene. The overall mood is serene and ethereal, emphasizing the delicate beauty of the flowers and the moisture around them.  \n\nIn the second image, I observe a **colorful bird** perched on a thick, textured brown branch. The bird has a striking combination of colors: a bright red head and chest, vivid blue wings and tail, and a dark beak. Its feathers appear detailed and glossy, with the blue wings showing intricate patterns. The background is a blurred, lush green (suggesting a forest or jungle environment), which creates a soft, natural backdrop that highlights the bird’s vibrant plumage. The image captures the bird in sharp focus, emphasizing its vivid colors and the texture of its feathers and the branch it rests on.  \n\nBoth images showcase nature’s beauty with high detail, vibrant colors, and a focus on the interplay of light and texture."

Online Example #

[ ]:

import json

VLLM_NEURON_FRAMEWORK='neuronx-distributed-inference'
additional_neuron_config=json.dumps(dict(override_neuron_config=dict(text_neuron_config=text_neuron_config, vision_neuron_config=vision_neuron_config)))
limit_mm_per_prompt_json = json.dumps({"image": 20})

start_server_cmd= f'''vllm serve \
--model="~/models/Qwen3-VL-8B-Thinking/" \
--tokenizer="~/models/Qwen3-VL-8B-Thinking/" \
--trust-remote-code \
--tensor-parallel-size=16 \
--max-num-seqs=1 \
--max-model-len=32768 \
--additional-config=\'{additional_neuron_config}\' \
--limit_mm_per_prompt=\'{limit_mm_per_prompt_json}\' \
--no-enable-chunked-prefill \
--no-enable-prefix-caching \
--no-async-scheduling \
--port=8080
'''

import os
os.system(start_server_cmd)

After deploying the model server, you can run inference by sending it requests. The below example sends a text prompt with two images -

[ ]:

from openai import OpenAI

client = OpenAI(api_key="EMPTY", base_url="http://0.0.0.0:8080/v1")
models = client.models.list()
model_name = models.data[0].id

messages = [
   {
      "role": "user",
      "content": [
            {
               "type": "image_url",
               "image_url": {
                  "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg"}
            },
            {
               "type": "text",
               "text": "Describe this image",
            },
      ],
   },
]

response = client.chat.completions.create(
   model=model_name,
   messages=messages,
   temperature=1.0,
   top_p=1.0,
   stream=False,
   extra_body={"top_k": 1},
)

generated_text = response.choices[0].message.content
print(generated_text)

Below is an example output:

So, let's describe this image. First, the main subject is a wild cat, probably a Pallas's cat, in a snowy environment. Let's check the details. The cat has thick, fluffy fur that's a mix of brown, gray, and maybe some lighter shades. Its fur is dusted with snow, so it's in a winter setting. The cat is walking on snow, with one paw lifted, so it's in motion. The background has white birch trees with black bark patterns, typical of a snowy forest. There's also a chain-link fence on the left side, which might indicate a controlled environment like a zoo or wildlife reserve. The snow on the ground is fresh, and there are some small twigs or debris visible. The cat's face has distinctive markings, like the white area around the mouth and the striped pattern on its cheeks. The overall scene is cold, with the snow and the cat's thick fur suggesting it's adapted to cold climates. Let's structure the description: start with the main subject, then details about the cat's appearance, the environment, and the setting.
</think>

The image depicts a **Pallas's cat** (a wild feline species native to Central Asia) walking through a snowy landscape. The cat’s thick, fluffy fur is a mix of brown, gray, and cream tones, dusted with snowflakes, emphasizing its adaptation to cold climates. Its face features distinctive markings: a white patch around the mouth, dark stripes on the cheeks, and a short, rounded muzzle. The cat is captured mid-stride, with one paw lifted, conveying movement across the snow-covered ground.

In the background, **white-barked birch trees** with dark, irregular bark patterns create a stark, wintry forest scene. To the left, a **chain-link fence** suggests the setting may be a controlled environment like a zoo or wildlife reserve. The snow on the ground is fresh and undisturbed except for the cat’s path, with small twigs and debris scattered nearby. The overall atmosphere is serene and cold, highlighting the cat’s natural camouflage and resilience in a snowy habitat.

Conclusion #

Congratulations ! You now know how to deploy Qwen/Qwen3-VL-8B-Thinking on a trn2.48xlarge instance. Modify the configurations and deploy the model as per your requirements and use case.

Tutorial: Deploy Qwen3-VL 8B on Trn2 instances

Contents