This document is relevant for: Inf2, Trn1, Trn1n

Transformers NeuronX (transformers-neuronx) Developer Guide#

Transformers NeuronX for Trn1 and Inf2 is a software package that enables PyTorch users to perform large language model (LLM) performant inference on second-generation Neuron hardware (See: NeuronCore-v2).The Neuron performance page lists expected inference performance for commonly used Large Language Models.

Introduction#

The Transformers NeuronX repository contains the source code of the AWS Neuron Transformers integration project. As it stands now, it mainly serves the purpose of running transformer decoder inference (autoregressive sampling) workflows on the Neuron platform.

Note: This project is actively in development. The Neuron team is still heavily modifying the Neuron optimized module classes. The functionality provided in this repository will not maintain long-term API stability until version >= 1.0.0. For applications willing to reuse code from this repository, we recommend treating the Neuron optimized module implementations as samples, and pin the version of the main library package torch-neuronx to avoid breaking interface changes as new features are developed.

Checkpoint compatibility with HuggingFace Transformers#

transformers-neuronx is checkpoint-compatible with HuggingFace Transformers. While the Neuron team reimplemented some HuggingFace Transformers models from scratch for the purpose of maximizing the execution efficiency of transformer decoders on Neuron, the implementations are done with maximizing compatibility in mind, meaning one can train transformer decoder models, say GPT2, using the standard HuggingFace Transformers library, and then construct an inference-optimized decoder model using transformers-neuronx’s GPT2ForSampling class. If training was done with other libraries such as MegatronLM, then it is still possible to convert the obtained checkpoint to the standard HuggingFace Transformers checkpoint format, and then move on to transformers-neuronx’s optimized decoder implementations.

Neuron optimized transformer decoders implemented in XLA High Level Operations (HLO)#

Due to the stateful nature of the autoregressive sampling computation, an efficient implementation of autoregressive sampling using the Neuron SDK requires rewriting the model forward function into a pure-function computation running on fixed-shape tensors. Furthermore, we want the pure-function computation be implemented in a compiled language so that the Neuron compiler can perform extensive code analysis and optimization. We chose XLA High Level Operations (HLO) as the compiled language for implementing Neuron optimized transformer decoder classes. The source code of these classes contains Python functions written in a syntax called “PyHLO”, name of a Neuron internal tool for writing/compiling the HLO language in Python. As an example, a “language model head” implemented in PyHLO may look like the following.

class LmHeadHlo:

    ...

    def lm_head(self, scribe):
        dtype = self.dtype
        hidden_size = self.hidden_size
        n_active_tokens = self.n_active_tokens
        batch_size = self.batch_size
        vocab_size = self.vocab_size
        hidden = dtype[hidden_size, n_active_tokens, batch_size].Parameter(parameter_number=0)
        weight = dtype[hidden_size, vocab_size].Parameter(parameter_number=1)
        rhs_size = n_active_tokens * batch_size
        hidden = dtype[hidden_size, rhs_size].Reshape(hidden)
        dot_dims = dict(lhs_contracting_dimensions=[0], rhs_contracting_dimensions=[0])
        logits = dtype[vocab_size, rhs_size].Dot(weight, hidden, dot_dimension_numbers=dot_dims)
        return dtype[vocab_size, n_active_tokens, batch_size].Reshape(logits)

    ...

The transformers_neuronx.compiler.compile_py_func function can convert the Python lm_head function into HloModuleProto, a valid input format for the neuronx-cc compiler.

Tensor-parallelism support#

For transformer decoders used in large language models, tensor-parallelism is necessary as it provides a way to shard the models’ large weight matrices onto multiple NeuronCores, and having NeuronCores working on the same matrix multiply operation collaboratively. transformers-neuronx’s tensor-parallelism support makes heavy use of collective operations such as all-reduce, which is supported natively by the Neuron runtime.

There are some principles for setting tensor-parallelism degree (number of NeuronCores participating in sharded matrix multiply operations) for Neuron-optimized transformer decoder models.

  1. The number of attention heads needs to be divisible by the tensor-parallelism degree.

  2. The total data size of model weights and key-value caches needs to be smaller than 16 GB times the tensor-parallelism degree.

  3. Currently, the Neuron runtime supports tensor-parallelism degrees 1, 2, 8, and 32 on Trn1 and supports tensor-parallelism degrees 1, 2, 4, 8, and 24 on Inf2.

Some examples:

  1. facebook/opt-13b has 40 attention heads, and when running at batch size 1 and float16 precision the model requires ~29 GB memory, therefore a trn1.2xlarge with 32 GB device memory is sufficient.

  2. facebook/opt-30b has 56 attention heads, and at batch size 1 and float16 precision the model requires ~66 GB memory, therefore it can run on 8 NeuronCores on one trn1.32xlarge using 128 GB device memory.

  3. gpt2-xl has 25 attention heads and requires ~4 GB memory at bfloat16 precision. It runs without tensor-parallelism only.

Features#

Compile-time Configurations#

Transformers Neuron models support a variety of compile-time configurations that can be used to tune model performance. All models support the following configurations:

  • batch_size: The batch size to compile a model for. Once the batch size has been set, this is the only size that is supported at inference time. Neuron uses ahead-of-time compilation to achieve high performance which requires that the compiled artifact shapes must be known at compilation time.

  • n_positions: The maximum number of positions (or sequence length) to allow during generation. This parameter directly controls the width of the KV cache. This parameter should be set to the maximum expected sequence length for the end application.

  • tp_degree: This parameter controls the number of tensor parallel shards to split the model into. Each shard will execute on a separate NeuronCore. To minimize latency, it is recommended to set the tensor parallelism to be equal to the number of NeuronCores that are available on an instance.

  • amp: This allows a models weights and compute to be cast to a different type. The options are; 'bf16', 'f16', or 'f32'. For models trained in float32, the 16-bit mixed precision options ('bf16', 'f16') generally provide sufficient accuracy while significantly improving performance.

  • context_length_estimate: This parameter controls the maximum sequence length of the prompt/context handling compute graph. This parameter is not supported in GPTNeoXForSampling and GPTJForSampling.

from transformers_neuronx import NeuronAutoModelForCausalLM

model = NeuronAutoModelForCausalLM.from_pretrained(
    'gpt2',                      # Uses the GPT2 checkpoint from https://huggingface.co/gpt2
    batch_size=1,                # Allow inference with batch size 1 inputs
    n_positions=128,             # Allow a maximum size of 128 prompt & output tokens
    tp_degree=2,                 # Shard the model weights & compute across 2 NeuronCores
    amp='f16',                   # Downcast the weights & compute to float16
    context_length_estimate=64,  # Build an optimized context encoding network for a maximum prompt size of 64
)
model.to_neuron() # Load/compile the model

Checkpoint support and automatic model selection#

New in release 2.18

Transformers Neuron now supports a greater variety of checkpoints including older pytorch binary checkpoints and newer safetensors checkpoints. For improved load speed and reduced host memory consumption, it is recommended to always use safetensors by default. Both regular and sharded variants of checkpoints are supported. It is no longer recommended to use the save_pretrained_split function which was used in older Transformers Neuron examples.

In addition to supporting standard checkpoint formats, Transformers Neuron provides an AutoModel class NeuronAutoModelForCausalLM which can be used to load the correct model without explicitly importing the architecture-specific class.

from transformers_neuronx import NeuronAutoModelForCausalLM

# Loads: https://huggingface.co/bigscience/bloom-560m
bloom = NeuronAutoModelForCausalLM.from_pretrained('bigscience/bloom-560m')
bloom.to_neuron()

# Loads: https://huggingface.co/openlm-research/open_llama_3b_v2
llama = NeuronAutoModelForCausalLM.from_pretrained('openlm-research/open_llama_3b_v2')
llama.to_neuron()

# This is equivalent to the following:
from transformers_neuronx import BloomForSampling
model = BloomForSampling.from_pretrained('bigscience/bloom-560m')
model.to_neuron()

from transformers_neuronx import LlamaForSampling
llama = LlamaForSampling.from_pretrained('openlm-research/open_llama_3b_v2')
llama.to_neuron()

Note

Advanced features of huggingface hub access are not supported. This includes private repositories which require access tokens and branches.

In order to support more advanced repository downloads, please download the model to a local directory and load it from there.

Hugging Face generate() API support#

Transformers Neuron models support the Hugging Face generate() API via the HuggingFaceGenerationModelAdapter adapter class. In the following example we demonstrate how to run sampling with temperature using the GPT2 model:

import torch
from transformers import AutoTokenizer, AutoConfig
from transformers_neuronx import GPT2ForSamplingWithContextBroadcasting, HuggingFaceGenerationModelAdapter

# Create and compile the Neuron model
model = GPT2ForSamplingWithContextBroadcasting.from_pretrained('gpt2')
model.to_neuron()

# Use the `HuggingFaceGenerationModelAdapter` to access the generate API
config = AutoConfig.from_pretrained('gpt2')
model = HuggingFaceGenerationModelAdapter(config, model)

# Get a tokenizer and example input
tokenizer = AutoTokenizer.from_pretrained('gpt2')
tokenizer.pad_token_id = tokenizer.eos_token_id
tokenizer.padding_side = 'left'
text = "Hello, I'm a language model,"
encoded_input = tokenizer(text, return_tensors='pt', padding=True)

# Run inference using temperature
with torch.inference_mode():
    model.reset_generation()
    generated_sequence = model.generate(
        input_ids=encoded_input.input_ids,
        attention_mask=encoded_input.attention_mask,
        do_sample=True,
        max_length=256,
        temperature=0.7,
    )

print([tokenizer.decode(tok) for tok in generated_sequence])

Note: As the Hugging Face generation API can expand the input’s batch dimension based on different generation configurations, we need to compile the neuron model with different compile batch_size compared to the run time batch_size (batch dimension of inputs to generation API). - if do_sample=True, compile_batch_size = runtime_batch_size x num_return_sequences x beam_size - otherwise, compile_batch_size = runtime_batch_size x num_return_sequences

Neuron Persistent Cache#

The Neuron Persistent Cache is now enabled for Transformers Neuron by default. Model artifacts which have been compiled once will be cached and reused on successive runs when possible. Model artifacts will only be reused when compiling with the same compiler version (neuronx-cc), model configurations, and compiler flags. It also includes other features (i.e. using an S3 bucket as the cache backend). For more detailed information, see the Persistent cache documentation

int8 weight storage support#

Transformers Neuron supports int8 weight storage for the GPT2 model class. int8 weight storage can be used to reduce memory bandwidth usage to improve model performance. int8 weight storage support for additional model classes will be added in an upcoming release. In the following example we demonstrate how to apply int8 weight storage to the GPT2 model via the QuantizationConfig and NeuronConfig configs:

import torch
from transformers import AutoTokenizer
from transformers_neuronx import GPT2ForSamplingWithContextBroadcasting, NeuronConfig, QuantizationConfig

# Set the weight storage config use int8 quantization and bf16 dequantization
neuron_config = NeuronConfig(
    quant=QuantizationConfig(quant_dtype='s8', dequant_dtype='bf16'),
)

# Create and compile the Neuron model
model = GPT2ForSamplingWithContextBroadcasting.from_pretrained(
    'gpt2',
    amp='bf16', # NOTE: When using quantization, amp type must match dequant type
    neuron_config=neuron_config
)
model.to_neuron()

# Get a tokenizer and example input
tokenizer = AutoTokenizer.from_pretrained('gpt2')
text = "Hello, I'm a language model,"
encoded_input = tokenizer(text, return_tensors='pt')

# Run inference
with torch.inference_mode():
    generated_sequence = model.sample(encoded_input.input_ids, sequence_length=256, start_ids=None)
print([tokenizer.decode(tok) for tok in generated_sequence])

Parallel Input Prompt Context Encoding#

Transformers Neuron supports parallel input prompt context encoding for the GPT2 model class. Parallel context encoding can be used to significantly reduce the latency of the input prompt context encoding before the autoregressive decoder token generation loop. Parallel context encoding support for additional model classes will be added in an upcoming release.

The GPT2ForSamplingWithContextBroadcasting class has a context_length_estimate variable that determines the number of input prompt tokens that will be processed in parallel. For optimal results, this should be set to a power of 2 that is closest to the most frequently seen input prompt length. In the following example we demonstrate how to apply parallel context encoding to the GPT2 model via the GPT2ForSamplingWithContextBroadcasting class. In this example, we set the context_length_estimate to be 128, which is the closest power of 2 the length of the input prompt (97 tokens).

import torch
from transformers import AutoTokenizer
from transformers_neuronx import GPT2ForSamplingWithContextBroadcasting

# Create and compile the Neuron model
model = GPT2ForSamplingWithContextBroadcasting.from_pretrained(
    'gpt2',
    context_length_estimate=256 # Create an optimized network which handles prompts up to 256 tokens
)
model.to_neuron()

# Get a tokenizer and example input
tokenizer = AutoTokenizer.from_pretrained('gpt2')
text = "Hello, I'm a generative AI language model. Generative AI is a type of AI that can create new content and ideas, including conversations, stories, images, videos, and music. It is powered by large models that are pre-trained on vast amounts of data and commonly referred to as foundation models (FMs). With generative AI on AWS, you can reinvent your applications, create entirely new customer experiences, drive unprecedented levels of productivity, and transform your business. "
encoded_input = tokenizer(text, return_tensors='pt')

# Run inference
with torch.inference_mode():
    generated_sequence = model.sample(encoded_input.input_ids, sequence_length=256)
print([tokenizer.decode(tok) for tok in generated_sequence])

The GPT2ForSamplingWithContextBroadcasting class can also process an input prompt that has a different batch size from the batch size of the autoregressive decoder output. For example, an input prompt with batch size = 1 can be used to produce an output of batch size = 5 to generate multiple suggestions for the same input prompt. The input prompt batch size can be specified using the prompt_batch_size argument and the autoregressive decoder output batch size can be specified using the batch_size argument. In the following example we demonstrate how to apply parallel context encoding to the GPT2 model to generate 5 outputs for a single input.

import torch
from transformers import AutoTokenizer
from transformers_neuronx import GPT2ForSamplingWithContextBroadcasting

# Create and compile the Neuron model
model = GPT2ForSamplingWithContextBroadcasting.from_pretrained(
    'gpt2',
    prompt_batch_size=1, # This allows prompt and output batch to vary
    batch_size=5,
    context_length_estimate=256
)
model.to_neuron()

# Get a tokenizer and example input
tokenizer = AutoTokenizer.from_pretrained('gpt2')
text = "Hello, I'm a generative AI language model. Generative AI is a type of AI that can create new content and ideas, including conversations, stories, images, videos, and music. It is powered by large models that are pre-trained on vast amounts of data and commonly referred to as foundation models (FMs). With generative AI on AWS, you can reinvent your applications, create entirely new customer experiences, drive unprecedented levels of productivity, and transform your business. "
encoded_input = tokenizer(text, return_tensors='pt')

# Run inference
with torch.inference_mode():
    generated_sequence = model.sample(encoded_input.input_ids, sequence_length=256)

for i, output in enumerate(generated_sequence):
    print('-' * 50)
    print(f'Batch {i} output:')
    print(tokenizer.decode(output))

Serialization support [Beta]#

Transformers Neuron supports model serialization (model saving and loading) for all models except the GPTJForSampling and GPTNeoXForSampling` model classes. In the following example we demonstrate how to save and load the GPT2 model:

import torch
from transformers import AutoTokenizer
from transformers_neuronx import GPT2ForSamplingWithContextBroadcasting, HuggingFaceGenerationModelAdapter

# Create and compile the Neuron model
model = GPT2ForSamplingWithContextBroadcasting.from_pretrained('gpt2')
model.to_neuron()

# Save the compiled Neuron model
model.save('gpt2-compiled-artifacts')

# Load the Neuron model
model = GPT2ForSamplingWithContextBroadcasting.from_pretrained('gpt2')
# Load the compiled Neuron artifacts
model.load('gpt2-compiled-artifacts')
# Since prior artifacts are loaded, this skips compilation
model.to_neuron()

# Get a tokenizer and example input
tokenizer = AutoTokenizer.from_pretrained('gpt2')
text = "Hello, I'm a language model,"
encoded_input = tokenizer(text, return_tensors='pt')

# Run inference
with torch.inference_mode():
    generated_sequence = model.sample(encoded_input.input_ids, sequence_length=256, start_ids=None)
print([tokenizer.decode(tok) for tok in generated_sequence])

Grouped-query attention (GQA) support [Beta]#

Transformers Neuron supports grouped-query attention (GQA) models for Llama and Mistral model classes. There are multiple sharding strategies for K/V cache, in order to satisfy different constraints.

  • GQA.SHARD_OVER_HEADS distributes K/V caches along head dimension. This can be only used when K/V heads is multiple of tensor-parallelism degree. This is the default configuration.

  • GQA.SHARD_OVER_BATCH distributes K/V caches along batch dimension. This can be only used when batch size is multiple of tensor-parallelism degree. This can be useful for large-batch inference.

  • GQA.REPLICATED_HEADS replicates K/V heads. This can be used when neither batch size nor K/V heads can be divisible by tensor-parallelism degree. This can be useful for low-latency small-batch inference.

  • GQA.ALL_GATHER_HEADS evenly splits the K/V heads across all NeuronCores. This is optimized for large-batch inference of GQA model without replication.

In the following example we demonstrate how to configure these distributed inference strategies and perform inference with the Mistral model:

import torch
from transformers import AutoTokenizer
from transformers_neuronx import MistralForSampling, GQA, NeuronConfig

# Set sharding strategy for GQA to be shard over heads
neuron_config = NeuronConfig(
    group_query_attention=GQA.SHARD_OVER_HEADS
)

# Create and compile the Neuron model
model_neuron = MistralForSampling.from_pretrained('mistralai/Mistral-7B-Instruct-v0.2', amp='bf16', neuron_config=neuron_config)
model_neuron.to_neuron()

# Get a tokenizer and exaple input
tokenizer = AutoTokenizer.from_pretrained('mistralai/Mistral-7B-Instruct-v0.2')
text = "[INST] What is your favourite condiment? [/INST]"
encoded_input = tokenizer(text, return_tensors='pt')

# Run inference
with torch.inference_mode():
    generated_sequence = model_neuron.sample(encoded_input.input_ids, sequence_length=256, start_ids=None)
print([tokenizer.decode(tok) for tok in generated_sequence])

Repeated Ngram Filtering#

Repeated Ngram Filtering reduces redundant ngram phrases within the generated text. It uses the same API as HuggingFace API for NoRepeatedNGram. Set the parameter no_repeat_ngram_size to the size of ngram phrases to be filtered and pass it to the sampling function as in the example model.sample(inputs_ids, no_repeat_ngram_size=3)

Top-K on-device sampling support [Beta]#

Transformers Neuron supports Top-K Sampling on-device for all models except Mixtral models. In the following example, we demonstrate how to use on-device Top-K for the Llama model via the GenerationConfig and NeuronConfig configs.

import torch
from transformers_neuronx import LlamaForSampling
from transformers_neuronx.config import NeuronConfig, GenerationConfig
from transformers import AutoTokenizer

neuron_config = NeuronConfig(
    on_device_generation=GenerationConfig(max_length=128, top_k=10, do_sample=True)
)

# Create and compile the Neuron model
model_neuron = LlamaForSampling.from_pretrained('openlm-research/open_llama_3b', batch_size=1, tp_degree=8, n_positions=128, neuron_config=neuron_config)
model_neuron.to_neuron()

# Get a tokenizer and exaple input
tokenizer = AutoTokenizer.from_pretrained('openlm-research/open_llama_3b')
text = "Hello, I'm a language model,"
encoded_input = tokenizer(text, return_tensors='pt')

# Run inference
with torch.inference_mode():
    generated_sequence = model_neuron.sample(encoded_input.input_ids, sequence_length=128, top_k=10)
    print([tokenizer.decode(tok) for tok in generated_sequence])

Running inference with multiple models#

Multiple transformers-neuronx models can be loaded at the same time as long as the total number of consumed NeuronCores is less than or equal to the total number of NeuronCores on the instance. For example, three tp-degree=8 models can be loaded and run in parallel on an inf2.48xlarge which has 24 NeuronCores. The NEURON_RT_NUM_CORES and NEURON_RT_VISIBLE_CORES environment variables can be used to allocate the necessary number of NeuronCores to each process to run multiple transformers-neuronx models in parallel. See the NeuronCore Allocation and Model Placement for Inference (torch-neuronx) section for additional information about how to use these environment variables.

It is important to notice that when multiple models are used on a single instance, the number of threads should be reduced to avoid race condition on host side. Assume the neuron instance (i.e. trn1) has 192 CPU cores. If one of the models keeps all CPU cores busy, there would be significant performance degradation in the rest of models. As a result, the number of threads for each model should be limited to part of available cores. To do this, OMP_NUM_THREADS environment variable can be set. For example, if there are 192 CPU cores available and four tp-degree=8 models are used, one can export OMP_NUM_THREADS=48 to avoid race condition.

Streamer#

LLMs generate tokens in auto-regressive loop. A model.sample call waits till the end of full sequence generation before returning the generated response. It is possible to output an output token as soon as it is generated. To do this, a streamer object can be used. Streamer is an object which has 2 methods: put and end. There are several predefined streamer in transformers library such as TextIteratorStreamer. The following example shows how to define a streamer and use it in transformers-neuronx:

import torch
from transformers import AutoTokenizer
from transformers_neuronx import MistralForSampling, GQA

import transformers
from time import time

# Create a custom streamer inherited from transformers.generation.streamers.BaseStreamer
class CustomStreamer(transformers.generation.streamers.BaseStreamer):
    def __init__(self) -> None:
        self.reset()

    def reset(self):
        self.token_latencies = []
        self.iter = 0
        self.now = time()

    def put(self, tokens):
        now = time()
        token_latency = now - self.now
        print(f"Iteration {self.iter:4d}: Latency [s] {token_latency:6.3f} -- Token {tokens}")
        self.now = now
        self.iter += 1
        self.token_latencies.append(token_latency)


    def end(self):
        print("First 10 token latencies:", self.token_latencies[:10])


# Create and compile the Neuron model
model_neuron = MistralForSampling.from_pretrained('mistralai/Mistral-7B-Instruct-v0.2', amp='bf16')
model_neuron.to_neuron()

# Get a tokenizer and exaple input
tokenizer = AutoTokenizer.from_pretrained('mistralai/Mistral-7B-Instruct-v0.2')
text = "[INST] What is your favourite condiment? [/INST]"
encoded_input = tokenizer(text, return_tensors='pt')

streamer = CustomStreamer()
# Run inference
with torch.inference_mode():
    generated_sequence = model_neuron.sample(encoded_input.input_ids, sequence_length=256, start_ids=None, streamer=streamer)

Stopping Criteria#

We can define custom stopping criteria to stop autoregressive loop. For example, if we want to limit autoregressive loop after 0.5s, we can define and use stopping criteria class as follows:

import torch
import transformers
from transformers import AutoModelForCausalLM, AutoTokenizer, TextIteratorStreamer
from transformers_neuronx import MistralForSampling, GQA, NeuronConfig
from transformers_neuronx.stopping_criteria import StoppingCriteria, StoppingCriteriaList

from time import time
from typing import List, Optional, Callable


class MaxTimeCriteria(StoppingCriteria):
    """
    This class can be used to stop generation whenever the full generation exceeds some amount of time. By default, the
    time will start being counted when you initialize this function. You can override this by passing an
    `initial_time`.

    Args:
        max_time (`float`):
            The maximum allowed time in seconds for the generation.
        initial_time (`float`, *optional*, defaults to `time()`):
            The start of the generation allowed time.
    """

    def __init__(self, max_time: float, initial_timestamp: Optional[float] = None):
        self.max_time = max_time
        self.initial_timestamp = time() if initial_timestamp is None else initial_timestamp

    def __call__(self, input_ids: torch.LongTensor, scores: torch.FloatTensor, **kwargs) -> bool:
        dt = time() - self.initial_timestamp
        end_condition = dt > self.max_time
        if end_condition:
            print("Stopping!")
        return end_condition

# Create a streamer. This can be a custom streamer too inherited from transformers.generation.streamers.BaseStreamer
class CustomStreamer(transformers.generation.streamers.BaseStreamer):
    def __init__(self) -> None:
        self.reset()

    def reset(self):
        self.token_latencies = []
        self.iter = 0
        self.now = time()

    def put(self, tokens):
        now = time()
        token_latency = now - self.now
        print(f"Iteration {self.iter:4d}: Latency [s] {token_latency:6.3f} -- Token {tokens}")
        self.now = now
        self.iter += 1
        self.token_latencies.append(token_latency)


    def end(self):
        pass

# Create and compile the Neuron model
model_neuron = MistralForSampling.from_pretrained('mistralai/Mistral-7B-Instruct-v0.2', amp='bf16')
model_neuron.to_neuron()

# Get a tokenizer and exaple input
tokenizer = AutoTokenizer.from_pretrained('mistralai/Mistral-7B-Instruct-v0.2')
text = "[INST] What is your favourite condiment? [/INST]"
encoded_input = tokenizer(text, return_tensors='pt')

# Add stopping criteria to stop after 0.5 seconds
stopping_criteria_list= StoppingCriteriaList([MaxTimeCriteria(0.5)])
streamer = CustomStreamer()

# Run inference
with torch.inference_mode():
    model_neuron.sample(input_ids=encoded_input.input_ids, sequence_length=256, stopping_criteria_list=stopping_criteria_list, streamer=streamer)

Speculative sampling [Beta]#

Transformers Neuron supports speculative sampling for the Llama and GPT2 model classes. In speculative sampling, we use use a smaller draft model to speculate future tokens. These are then sent to the larger target model, which accepts or rejects these tokens. For more detailed information, see the original proposal by DeepMind titled Accelerating Large Language Model Decoding with Speculative Sampling. Speculative sampling is currently supported for batch size 1.

In the following example, we demonstrate how to perform speculative sampling using the Llama model.

import torch
from transformers import LlamaTokenizer
from transformers_neuronx import LlamaForSampling
from transformers_neuronx.speculation import SpeculativeGenerator

# Load draft model
draft_neuron_model = LlamaForSampling.from_pretrained('openlm-research/open_llama_3b', n_positions=256, batch_size=1, tp_degree=8, amp='f32')
# Compile the model
draft_neuron_model.to_neuron()

# Load target model
target_neuron_model = LlamaForSampling.from_pretrained('openlm-research/open_llama_13b', n_positions=256, batch_size=1, tp_degree=8, amp='f32')
# Enable speculative decoder
target_neuron_model.enable_speculative_decoder(4)
# Compile the model
target_neuron_model.to_neuron()

# Initialize tokenizer and text prompt
tokenizer = LlamaTokenizer.from_pretrained("openlm-research/open_llama_13b")
prompt = "Hello, I'm a generative AI language model."
input_ids = tokenizer(prompt, return_tensors="pt").input_ids

# Create SpeculativeGenerator
spec_gen = SpeculativeGenerator(draft_neuron_model, target_neuron_model, 4)

# Call speculative sampling on given input
response = spec_gen.sample(
    input_ids=input_ids,
    sequence_length=30,
)

# Decode the response
generated_text = tokenizer.decode(response[0])
print(f"\nDecoded tokens: {generated_text}")

QKV Weight Fusion#

Concatenating a model’s query, key and value weight matrices often achieves better performance because larger matrices allow for more efficient data movement and compute. QKV weight fusion can be enabled by setting fuse_qkv=True in the NeuronConfig:

neuron_config = NeuronConfig(fuse_qkv=True)

Bucketing#

LLM inference is a generate process that can produce variable length sequences. This poses a problem since the Neuron compiler produces executables which expect statically shaped inputs and outputs. To make LLM work with different shapes, transformers_neuronx generates buckets and applies padding wherever it is required.

There are at least two set of buckets for each LLM inference that can be set by user: 1) Context encoding (pre-fill) buckets and 2) output token generation buckets.

Token generation buckets

In token generation, tokens are generated iteratively. At each token position, transformer need to attend to the previous tokens only. But in the naive implementation with static shapes, one may attend to all KV-cache (full sequence length). To solve this problem, we use token generation buckets. Token generation buckets determine the attention lengths. For instance, if the max sequence length is 1024 tokens and current token is at position 120, there is no need to attend to all 1024 tokens in the current step. We can use token generation buckets to attend to different portions of KV-cache. By default, token generation buckets which are powers of 2 starting from 128 tokens are used (i.e. 128, 256, 512, up to sequence length). In the example above, bucket 128 would be used for position 120 which would reduce the wasted compute significantly. User can change these buckets by setting a list for n_positions (see example below). Otherwise, if a number is given for n_positions (sequence length), instead of a list, then the powers of 2 buckets starting from 128 will be used. The last bucket would be n_positions (sequence length), even if it is not a power of 2.

Context encoding buckets

The prompt tokens can be processed in parallel. As a result, we need to set the bucket sizes for different estimated length of input prompts. We can specify these context bucket sizes using the context_length_estimate argument. In general, it is better to have all the bucket to be multiples of 256 tokens. But adding too many buckets would increase device memory consumption and add extra latency for bucket switching. Usually, the powers of 2 starting from 128 tokens are used for context encoding buckets. If the total sequence length (n_positions) is beyond 2048 tokens, it is desirable to add extra buckets with multiple of 512 or 1024 tokens. It is not recommended to add buckets of multiples of 256 tokens or smaller for context buckets beyond 2k to avoid bucket switching latency. At runtime, the smallest bucket which fits the input context will be used. By default, the context encoding buckets set to half of output-token buckets. Adding extra context buckets would reduce the wasted compute and improves performance. However, the extra executables would reduce memory space since executables require device memory space.

Notice that the default output token generation buckets work well for wide range of applications. However, ideal context encoding buckets depends on the specific use case. For instance, if all the requests have a context length of about 1500 +/- 500 tokens, adding more buckets closer to 1500 might help context encoding time. In this example, adding buckets of 1024, 1280, 1536, 1792, 2048 tokens (distance of 256 tokens) could help. Moreover, the largest context encoding bucket should be larger than the largest context length. Otherwise, the performance would degrade significantly.

To set context encoding and token generation buckets manually:

context_length_estimate = [1024, 1280, 1536, 1792, 2048]    # The best context estimate depends on the use case
n_positions = [128, 256, 512, 1024, 2048, 3072]             # Usually default buckets are appropriate

model = NeuronAutoModelForCausalLM.from_pretrained(
    'gpt2',
    batch_size=1,
    n_positions=n_positions,
    tp_degree=2,
    amp='f16',
    context_length_estimate=context_length_estimate,
)

Long Sequence length support up to 32k#

With the integration of FlashAttention kernel, developers can use longer sequence lengths for LLAMA models. The Flash Attention kernel is automatically used when the input sequence length is greater than 8k without any additional configuration. Refer to Tutorial for usage of 32k sequence length on a variation of LLAMA3-8B Model.

This document is relevant for: Inf2, Trn1, Trn1n