Tutorial: Deploy Qwen2-VL on Trn2 instances #

This tutorial provides a step-by-step guide to deploy Qwen/Qwen2-VL-7B-Instruct using NeuronX Distributed (NxD) Inference on a single trn2.48xlarge instance.

Step 1: Set up your development environment #

As a prerequisite, this tutorial requires that you have a Trn2 instance created from a Deep Learning AMI that has the Neuron SDK pre-installed.

To set up a Trn2 instance using Deep Learning AMI with pre-installed Neuron SDK, see the NxDI setup guide. To run a Jupyter (.ipynb) notebook on a Neuron instance, follow this guide.

After setting up an instance, use SSH to connect to the Trn2 instance using the key pair that you chose when you launched the instance.

After you are connected, activate the Python virtual environment that includes the Neuron SDK.

pip list | grep neuron

You should see Neuron packages including neuronx-distributed-inference and neuronx-cc.

Step 2: Install the vLLM version that supports NxD Inference #

NxD Inference supports running models with vLLM. This functionality is available in the vLLM-Neuron GitHub repository. Install the latest release branch of vLLM-Neuron plugin following instructions in the vLLM User Guide for NxD Inference.

Ensure that the Neuron virtual environment is activated if you are using a new terminal session instead of the one from connection step above. Then, install the Neuron vLLM fork into the virtual environment.

Step 3 Download the model from HuggingFace (Optional)#

To deploy Qwen/Qwen2-VL-7B-Instruct on Neuron, first download the checkpoint from HuggingFace to a local path on the Trn2 instance. For more information on downloading models from HuggingFace, refer to HuggingFace’s guide on Downloading models).

After the download, you should see a config.json file in the output folder along with weights in model-xxxx-of-xxxx.safetensors format.

Step 4: Compile and deploy Qwen2-VL Inference #

In this step, you use the vllm command to deploy the model. The neuronx-distributed-inference model loader in vllm performs JIT compilation before deploying it with the model server. Replace the model_name_or_path with your specific path if you download the model checkpoint from HuggingFace(Step 3).

Here are two examples of running Qwen2-VL with vLLM V1:

Offline inference: you can provide prompts in a python script and execute it.
Online inference: you will serve the model in an online server and send requests.

Model Configuration Requirements & Examples #

There is a known issue with batch_size > 1 or tp_degree != 4 configurations for Qwen2-VL models. Here we suggest to use batch_size = 1 and tp_degree = 4 configuration, which deploys Qwen/Qwen2-VL-7B-Instruct model on a single trn2 chip with 4 cores. You can replicate the setting on the trn2.48xlarge instance consisting of 16 chips and 64 cores.

We support configurable image sizes for Qwen2-VL and use number_of_images as the vision buckets. For example, in the configuration below, number_of_images is the maximum vision bucket, i.e., 128. Please specify default_image_width and default_image_height in the vision_neuron_config as the input image size. The default image sizes are default_image_width: 640 and default_image_height: 320.

Note: Please make sure the number of tokens does not exceed the max_context_length in the text_neuron_config, i.e., number_of_prompt_tokens + (default_image_width // 28) * (default_image_height // 28) * number_of_images < max_context_length - max_new_tokens.

We configure these fields below to improve performance. For more details, refer to NxD Inference features configurations guide.

``enable_ve_data_parallelism``: whether to enable vision encoder data parallelism.

Note: We set the ve_dp_degree to world_size // tp_degree in the vision_neuron_config. With enable_ve_data_parallelism=True, we require the number of images (vision bucket size) to be divisible by ve_dp_degree.

sequence_parallel_enabled: whether to enable sequence parallelism.
fuse_qkv and qkv_kernel_enabled: whether to use the fused QKV kernel. qkv_kernel_enabled is not supported yet in the vision_neuron_config for Qwen2-VL.
attn_kernel_enabled: whether to use the optimized attention kernel.

Below we provide the recommended configuration with batch_size 1 and tp_degree 4.

Note: If you encounter Out-of-Memory issue during the runtime, please try to reduce the size of vision buckets as the KV cache grows linearly with batch size and sequence length.

[ ]:

qwen2_vl_neuron_config = {
    "text_neuron_config": {
        "batch_size": 1,
        "ctx_batch_size": 1,
        "tkg_batch_size": 1,
        "seq_len": 32768,
        "max_new_tokens": 64,
        "max_context_length": 32768,
        "torch_dtype": "float16",
        "skip_sharding": False,
        "save_sharded_checkpoint": True,
        "tp_degree": 4,
        "world_size": 4,
        "enable_bucketing": True,
        "context_encoding_buckets": [2048, 16384, 32768],
        "token_generation_buckets": [2048, 16384, 32768],
        "fused_qkv": True,
        "qkv_kernel_enabled": True,
        "sequence_parallel_enabled": True,
        "attn_kernel_enabled": True,
        "cc_pipeline_tiling_factor": 2,
        "attention_dtype": "float16",
        "rpl_reduce_dtype": "float16",
        "cast_type": "as-declared",
        "logical_neuron_cores": 2,
    },
    "vision_neuron_config": {
        "batch_size": 1,
        "seq_len": 131072,
        "max_context_length": 131072,
        "torch_dtype": "bfloat16",
        "skip_sharding": False,
        "save_sharded_checkpoint": True,
        "tp_degree": 1,
        "world_size": 4,
        "fused_qkv": True,
        "enable_ve_data_parallel": True,
        "qkv_kernel_enabled": False,
        "attn_kernel_enabled": True,
        "enable_bucketing": True,
        "buckets": [128],
        "cc_pipeline_tiling_factor": 2,
        "attention_dtype": "bfloat16",
        "rpl_reduce_dtype": "bfloat16",
        "cast_type": "as-declared",
        "logical_neuron_cores": 2,
        "default_image_width": 640,
        "default_image_height": 320
    }
}

Offline Example #

[ ]:

import os

os.environ["VLLM_NEURON_FRAMEWORK"] = "neuronx-distributed-inference"

from vllm import LLM, SamplingParams
from vllm.assets.image import ImageAsset
from transformers import AutoProcessor

def qwen2_vl_offline_test():
    model_name_or_path = "Qwen/Qwen2-VL-7B-Instruct/"
    # Create an LLM.
    llm = LLM(
    model=model_name_or_path,
    tensor_parallel_size=4,
    max_num_seqs=1,
    max_model_len=32768,
    additional_config=dict(
        override_neuron_config=qwen2_vl_neuron_config  # Use the configuration defined above
    ),
    enable_prefix_caching=False,
    enable_chunked_prefill=False,
    )

    # Sample prompts.
    prompt = "What do you see in these images?"
    # Resize to default image size
    default_image_size = (640, 320)

    images = [
        ImageAsset("blue_flowers").pil_image.resize(default_image_size),
        ImageAsset("bird").pil_image.resize(default_image_size),
    ]

    processor = AutoProcessor.from_pretrained(model_name_or_path)

    placeholders = [{"type": "image"} for _ in images]
    messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {
        "role": "user",
        "content": [
                *placeholders,
                {
                "type": "text",
                "text": prompt,
                },
        ],
    },
    ]

    prompt = processor.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
    )
    inputs = {
    "prompt": prompt,
    "multi_modal_data": {
        "image": images,
    },
    }
    outputs = llm.generate([inputs], SamplingParams(top_k=1, max_tokens=64))

    for output in outputs:
        generated_text = output.outputs[0].text
        print(f"Generated text: {generated_text!r}")

if __name__ == "__main__":
    qwen2_vl_offline_test()

Below is an example output:

Generated text: 'The first image shows a close-up of a flower with blue petals and water droplets on them, set against a dark background. The second image features a vibrant red bird with blue and green wings perched on a branch.'

Online Example #

[ ]:

import json

VLLM_NEURON_FRAMEWORK='neuronx-distributed-inference'
additional_neuron_config=json.dumps(dict(override_neuron_config=qwen2_vl_neuron_config))
start_server_cmd=cmd = f'''python3 -m vllm.entrypoints.openai.api_server \
   --model=\'{model_name_or_path}\' \
   --tensor-parallel-size=4 \
   --max-num-seqs=1 \
   --max-model-len=32768 \
   --additional-config=\'{additional_neuron_config}\' \
   --no-enable-chunked-prefill \
   --no-enable-prefix-caching \
   --port=8080
'''

import os
os.system(start_server_cmd)

Once the vLLM server is online, submit requests using the example below:

[ ]:

from openai import OpenAI


client = OpenAI(api_key="EMPTY", base_url="http://0.0.0.0:8080/v1")
models = client.models.list()
model_name = models.data[0].id

messages = [
   {"role": "system", "content": "You are a helpful assistant."},
   {
      "role": "user",
      "content": [
        {
            "type": "text",
            "text": "Describe this image.",
        },
        {
            "type": "image_url",
            "image_url": {
                "url": "example_image_url" # need to resize to {default_image_width}x{default_image_height}
            }
        }
      ],
   },
]

response = client.chat.completions.create(
    model=model_name,
    messages=messages,
    max_tokens=64,
    temperature=1.0,
    top_p=1.0,
    stream=False,
    extra_body={"top_k": 1},
)

generated_text = response.choices[0].message.content
print(generated_text)

Conclusion #

Congratulations ! You now know how to deploy Qwen/Qwen2-VL-7B-Instruct on a trn2.48xlarge instance. Modify the configurations and deploy the model as per your requirements and use case.

Tutorial: Deploy Qwen2-VL on Trn2 instances

Contents