This document is relevant for: Inf2, Trn1, Trn1n

Transformers NeuronX (transformers-neuronx) Developer Guide For Continuous Batching#

The continuous batching feature has been enabled with Transformers NeuronX. Transformers NeuronX for Trn1 and Inf2 is a software package that enables PyTorch users to perform large language model (LLM) performant inference on second-generation Neuron hardware (See: NeuronCore-v2). The Neuron performance page lists expected inference performance for commonly used Large Language Models.

Overview of continuous batching API and vLLM support#

Transformers NeuronX supports continuous batching for LLaMA model class.

The basic flow for continuous batching support includes the following operations performed automatically by Transformers NeuronX:

  1. Fill multiple prompts with context encoding by using virtual dynamic batching.

  2. Decode each sequence until one of the sequences generates an EOS token.

  3. Evict the finished sequence and insert a new prompt encoding.

  4. Resume the decoding proces, repeating steps 2-3 until all of the sequences are decoded.

Overview of models supported#

We support continuous batching for models compatible with the the following HuggingFace classes:

  • LlamaForCausalLM

  • MistralForCausalLM

New Features in 2.20 Neuron vLLM integration#

Currently Neuron vLLM integration is tested with Transformers NeuronX using v0.5.0 vLLM that is publicly available. As part of 2.20 release, there are several enhancements done in Neuron vLLM integration on top of v0.5.0 vLLM. It is our intention that these features will be upstreamed to vLLM repo as soon as possible after the release. Until the features are upstreamed, you can access these features from vLLM upstreaming Neuron fork in our AWS Neuron GitHub repository at aws-neuron/upstreaming-to-vllm.

Features in v0.5.0 vLLM Neuron Fork#

  • On-device generation is now enabled by default. Please note that this feature currently does not support producing log probabilities in the vllm integration. Furthermore, we restrict sampling to at most top 256 tokens by default. It can be changed by editing global_top_k in the GenerationConfig.

  • Multi-node inference to support larger models. Example scripts are included in the commit .

  • Directly load HuggingFace compatible checkpoints without creation of -split directory.

  • Fix AssertionError when running offline examples i.e. no need to apply the patch mentioned below.

Some features have already been upstreamed to vllm v0.6.0 and are not included in our v0.5.0 vLLM Neuron fork:

  • Configure bucketing and NeuronConfig throught vLLM interface instead of requiring code changes

Installing vLLM and running a simple offline script#

Once neuronx-cc and transformers-neuronx packages are installed, we will be able to install vLLM from source.

git clone https://github.com/vllm-project/vllm.git
cd vllm
git checkout v0.5.0
pip install .
pip install ray

Note

Please note the vLLM pip package from PyPI is not compatible with Neuron. To work with Neuron, install vLLM using the source as outlined above.

Note

There is currently a known issue with offline inference examples that can be solved by applying the vllm_v0.5.0_neuron.patch patch to vLLM source after checking out v0.5.0 tag by running git apply vllm_v0.5.0_neuron.patch. We are in the process of upstreaming the fix for a future vLLM release.

If Neuron packages are detected correctly in the installation process, vllm-0.5.0+neuron214 (The neuron version depends on the installed neuronx-cc version) will be installed.

In the following example we demonstrate how to perform continuous batching with a LLaMA model.

from vllm import LLM, SamplingParams

# Sample prompts.
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]
# Create a sampling params object.
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)

# Create an LLM.
llm = LLM(
    model="TinyLlama/TinyLlama-1.1B-Chat-v1.0",
    max_num_seqs=8,
    # The max_model_len and block_size arguments are required to be same as max sequence length,
    # when targeting neuron device. Currently, this is a known limitation in continuous batching
    # support in transformers-neuronx.
    max_model_len=128,
    block_size=128,
    # The device can be automatically detected when AWS Neuron SDK is installed.
    # The device argument can be either unspecified for automated detection, or explicitly assigned.
    device="neuron",
    tensor_parallel_size=2)
# Generate texts from the prompts. The output is a list of RequestOutput objects
# that contain the prompt, generated text, and other information.
outputs = llm.generate(prompts, sampling_params)
# Print the outputs.
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

Known issues and FAQs#

How to fix ‘AssertionError: Cache operations are not supported for Neuron backend.’ ?

Make sure the following patch is applied to vLLM v0.5.0 before installing. You can do so with the following steps: Navigating into the vLLM source directory. Create a vllm_v0.5.0_neuron.patch file inside the vLLM source directory with the content shown below. Then run git apply vllm_v0.5.0_neuron.patch to apply the patch and install vllm via pip install .

diff --git a/vllm/executor/neuron_executor.py b/vllm/executor/neuron_executor.py
index e7f0e887..87564b76 100644
--- a/vllm/executor/neuron_executor.py
+++ b/vllm/executor/neuron_executor.py
@@ -48,9 +48,9 @@ class NeuronExecutor(ExecutorBase):
    def execute_model(
            self,
            execute_model_req: ExecuteModelRequest) -> List[SamplerOutput]:
-        assert (execute_model_req.blocks_to_swap_in == {}
-                and execute_model_req.blocks_to_swap_out == {}
-                and execute_model_req.blocks_to_copy == {}), (
+        assert (not execute_model_req.blocks_to_swap_in
+                and not execute_model_req.blocks_to_swap_out
+                and not execute_model_req.blocks_to_copy), (
                    "Cache operations are not supported for Neuron backend.")
        assert execute_model_req.num_lookahead_slots == 0, (
            "lookahead not supported for Neuron backend.")

Is PagedAttention supported with vLLM integration?

No, PagedAttention is not currently supported. It will be supported in a future Neuron release.

This document is relevant for: Inf2, Trn1, Trn1n