This document is relevant for: Inf2, Trn1

Transformers NeuronX (`transformers-neuronx`) Developer Guide#

Transformers NeuronX for Trn1 and Inf2 is a software package that enables PyTorch users to perform large language model (LLM) performant inference on second-generation Neuron hardware (See: NeuronCore-v2).The Neuron performance page lists expected inference performance for commonly used Large Language Models.

Introduction#

The Transformers NeuronX repository contains the source code of the AWS Neuron Transformers integration project. As it stands now, it mainly serves the purpose of running transformer decoder inference (autoregressive sampling) workflows on the Neuron platform.

Note: This project is actively in development. The Neuron team is still heavily modifying the Neuron optimized module classes. The functionality provided in this repository will not maintain long-term API stability until version >= 1.0.0. For applications willing to reuse code from this repository, we recommend treating the Neuron optimized module implementations as samples, and pin the version of the main library package torch-neuronx to avoid breaking interface changes as new features are developed.

Checkpoint compatibility with HuggingFace Transformers#

transformers-neuronx is checkpoint-compatible with HuggingFace Transformers. While the Neuron team reimplemented some HuggingFace Transformers models from scratch for the purpose of maximizing the execution efficiency of transformer decoders on Neuron, the implementations are done with maximizing compatibility in mind, meaning one can train transformer decoder models, say GPT2, using the standard HuggingFace Transformers library, and then construct an inference-optimized decoder model using transformers-neuronx’s GPT2ForSampling class. If training was done with other libraries such as MegatronLM, then it is still possible to convert the obtained checkpoint to the standard HuggingFace Transformers checkpoint format, and then move on to transformers-neuronx’s optimized decoder implementations.

Neuron optimized transformer decoders implemented in XLA High Level Operations (HLO)#

Due to the stateful nature of the autoregressive sampling computation, an efficient implementation of autoregressive sampling using the Neuron SDK requires rewriting the model forward function into a pure-function computation running on fixed-shape tensors. Furthermore, we want the pure-function computation be implemented in a compiled language so that the Neuron compiler can perform extensive code analysis and optimization. We chose XLA High Level Operations (HLO) as the compiled language for implementing Neuron optimized transformer decoder classes. The source code of these classes contains Python functions written in a syntax called “PyHLO”, name of a Neuron internal tool for writing/compiling the HLO language in Python. As an example, a “language model head” implemented in PyHLO may look like the following.

class LmHeadHlo:

    ...

    def lm_head(self, scribe):
        dtype = self.dtype
        hidden_size = self.hidden_size
        n_active_tokens = self.n_active_tokens
        batch_size = self.batch_size
        vocab_size = self.vocab_size
        hidden = dtype[hidden_size, n_active_tokens, batch_size].Parameter(parameter_number=0)
        weight = dtype[hidden_size, vocab_size].Parameter(parameter_number=1)
        rhs_size = n_active_tokens * batch_size
        hidden = dtype[hidden_size, rhs_size].Reshape(hidden)
        dot_dims = dict(lhs_contracting_dimensions=[0], rhs_contracting_dimensions=[0])
        logits = dtype[vocab_size, rhs_size].Dot(weight, hidden, dot_dimension_numbers=dot_dims)
        return dtype[vocab_size, n_active_tokens, batch_size].Reshape(logits)

    ...

The transformers_neuronx.compiler.compile_py_func function can convert the Python lm_head function into HloModuleProto, a valid input format for the neuronx-cc compiler.

Tensor-parallelism support#

For transformer decoders used in large language models, tensor-parallelism is necessary as it provides a way to shard the models’ large weight matrices onto multiple NeuronCores, and having NeuronCores working on the same matrix multiply operation collaboratively. transformers-neuronx’s tensor-parallelism support makes heavy use of collective operations such as all-reduce, which is supported natively by the Neuron runtime.

There are some principles for setting tensor-parallelism degree (number of NeuronCores participating in sharded matrix multiply operations) for Neuron-optimized transformer decoder models.

The number of attention heads needs to be divisible by the tensor-parallelism degree.
The total data size of model weights and key-value caches needs to be smaller than 16 GB times the tensor-parallelism degree.
Currently, the Neuron runtime supports tensor-parallelism degrees 1, 2, 8, and 32 on Trn1 and supports tensor-parallelism degrees 1, 2, 4, 8, and 24 on Inf2.

Some examples:

facebook/opt-13b has 40 attention heads, and when running at batch size 1 and float16 precision the model requires ~29 GB memory, therefore a trn1.2xlarge with 32 GB device memory is sufficient.
facebook/opt-30b has 56 attention heads, and at batch size 1 and float16 precision the model requires ~66 GB memory, therefore it can run on 8 NeuronCores on one trn1.32xlarge using 128 GB device memory.
gpt2-xl has 25 attention heads and requires ~4 GB memory at bfloat16 precision. It runs without tensor-parallelism only.

Features#

Compile-time Configurations#

Transformers Neuron models support a variety of compile-time configurations that can be used to tune model performance. All models support the following configurations:

batch_size: The batch size to compile a model for. Once the batch size has been set, this is the only size that is supported at inference time. Neuron uses ahead-of-time compilation to achieve high performance which requires that the compiled artifact shapes must be known at compilation time.
n_positions: The maximum number of positions (or sequence length) to allow during generation. This parameter directly controls the width of the KV cache. This parameter should be set to the maximum expected sequence length for the end application.
tp_degree: This parameter controls the number of tensor parallel shards to split the model into. Each shard will execute on a separate NeuronCore. To minimize latency, it is recommended to set the tensor parallelism to be equal to the number of NeuronCores that are available on an instance.
amp: This allows a models weights and compute to be cast to a different type. The options are; 'bf16', 'f16', or 'f32'. For models trained in float32, the 16-bit mixed precision options ('bf16', 'f16') generally provide sufficient accuracy while significantly improving performance.
context_length_estimate: This parameter controls the maximum sequence length of the prompt/context handling compute graph. This parameter is not supported in GPTNeoXForSampling and GPTJForSampling.

from transformers_neuronx import NeuronAutoModelForCausalLM

model = NeuronAutoModelForCausalLM.from_pretrained(
    'gpt2',                      # Uses the GPT2 checkpoint from https://huggingface.co/gpt2
    batch_size=1,                # Allow inference with batch size 1 inputs
    n_positions=128,             # Allow a maximum size of 128 prompt & output tokens
    tp_degree=2,                 # Shard the model weights & compute across 2 NeuronCores
    amp='f16',                   # Downcast the weights & compute to float16
    context_length_estimate=64,  # Build an optimized context encoding network for a maximum prompt size of 64
)
model.to_neuron() # Load/compile the model

Checkpoint support and automatic model selection#

New in release 2.18

Transformers Neuron now supports a greater variety of checkpoints including older pytorch binary checkpoints and newer safetensors checkpoints. For improved load speed and reduced host memory consumption, it is recommended to always use safetensors by default. Both regular and sharded variants of checkpoints are supported. It is no longer recommended to use the save_pretrained_split function which was used in older Transformers Neuron examples.

In addition to supporting standard checkpoint formats, Transformers Neuron provides an AutoModel class NeuronAutoModelForCausalLM which can be used to load the correct model without explicitly importing the architecture-specific class.

from transformers_neuronx import NeuronAutoModelForCausalLM

# Loads: https://huggingface.co/bigscience/bloom-560m
bloom = NeuronAutoModelForCausalLM.from_pretrained('bigscience/bloom-560m')
bloom.to_neuron()

# Loads: https://huggingface.co/openlm-research/open_llama_3b_v2
llama = NeuronAutoModelForCausalLM.from_pretrained('openlm-research/open_llama_3b_v2')
llama.to_neuron()

# This is equivalent to the following:
from transformers_neuronx import BloomForSampling
model = BloomForSampling.from_pretrained('bigscience/bloom-560m')
model.to_neuron()

from transformers_neuronx import LlamaForSampling
llama = LlamaForSampling.from_pretrained('openlm-research/open_llama_3b_v2')
llama.to_neuron()

Note

Advanced features of huggingface hub access are not supported. This includes private repositories which require access tokens and branches.

In order to support more advanced repository downloads, please download the model to a local directory and load it from there.

Hugging Face generate() API support#

Transformers Neuron models support the Hugging Face generate() API via the HuggingFaceGenerationModelAdapter adapter class. In the following example we demonstrate how to run sampling with temperature using the GPT2 model:

import torch
from transformers import AutoTokenizer, AutoConfig
from transformers_neuronx import GPT2ForSamplingWithContextBroadcasting, HuggingFaceGenerationModelAdapter

# Create and compile the Neuron model
model = GPT2ForSamplingWithContextBroadcasting.from_pretrained('gpt2')
model.to_neuron()

# Use the `HuggingFaceGenerationModelAdapter` to access the generate API
config = AutoConfig.from_pretrained('gpt2')
model = HuggingFaceGenerationModelAdapter(config, model)

# Get a tokenizer and example input
tokenizer = AutoTokenizer.from_pretrained('gpt2')
tokenizer.pad_token_id = tokenizer.eos_token_id
tokenizer.padding_side = 'left'
text = "Hello, I'm a language model,"
encoded_input = tokenizer(text, return_tensors='pt', padding=True)

# Run inference using temperature
with torch.inference_mode():
    model.reset_generation()
    generated_sequence = model.generate(
        input_ids=encoded_input.input_ids,
        attention_mask=encoded_input.attention_mask,
        do_sample=True,
        max_length=256,
        temperature=0.7,
    )

print([tokenizer.decode(tok) for tok in generated_sequence])

Note: As the Hugging Face generation API can expand the input’s batch dimension based on different generation configurations, we need to compile the neuron model with different compile batch_size compared to the run time batch_size (batch dimension of inputs to generation API). - if do_sample=True, compile_batch_size = runtime_batch_size x num_return_sequences x beam_size - otherwise, compile_batch_size = runtime_batch_size x num_return_sequences

Neuron Persistent Cache#

The Neuron Persistent Cache is now enabled for Transformers Neuron by default. Model artifacts which have been compiled once will be cached and reused on successive runs when possible. Model artifacts will only be reused when compiling with the same compiler version (neuronx-cc), model configurations, and compiler flags. It also includes other features (i.e. using an S3 bucket as the cache backend). For more detailed information, see the Persistent cache documentation

int8 weight storage support#

Transformers Neuron supports int8 weight storage for the GPT2 model class. int8 weight storage can be used to reduce memory bandwidth usage to improve model performance. int8 weight storage support for additional model classes will be added in an upcoming release. In the following example we demonstrate how to apply int8 weight storage to the GPT2 model via the QuantizationConfig and NeuronConfig configs:

import torch
from transformers import AutoTokenizer
from transformers_neuronx import GPT2ForSamplingWithContextBroadcasting, NeuronConfig, QuantizationConfig

# Set the weight storage config use int8 quantization and bf16 dequantization
neuron_config = NeuronConfig(
    quant=QuantizationConfig(quant_dtype='s8', dequant_dtype='bf16'),
)

# Create and compile the Neuron model
model = GPT2ForSamplingWithContextBroadcasting.from_pretrained(
    'gpt2',
    amp='bf16', # NOTE: When using quantization, amp type must match dequant type
    neuron_config=neuron_config
)
model.to_neuron()

# Get a tokenizer and example input
tokenizer = AutoTokenizer.from_pretrained('gpt2')
text = "Hello, I'm a language model,"
encoded_input = tokenizer(text, return_tensors='pt')

# Run inference
with torch.inference_mode():
    generated_sequence = model.sample(encoded_input.input_ids, sequence_length=256, start_ids=None)
print([tokenizer.decode(tok) for tok in generated_sequence])

Parallel Input Prompt Context Encoding#

Transformers Neuron supports parallel input prompt context encoding for the GPT2 model class. Parallel context encoding can be used to significantly reduce the latency of the input prompt context encoding before the autoregressive decoder token generation loop. Parallel context encoding support for additional model classes will be added in an upcoming release.

The GPT2ForSamplingWithContextBroadcasting class has a context_length_estimate variable that determines the number of input prompt tokens that will be processed in parallel. For optimal results, this should be set to a power of 2 that is closest to the most frequently seen input prompt length. In the following example we demonstrate how to apply parallel context encoding to the GPT2 model via the GPT2ForSamplingWithContextBroadcasting class. In this example, we set the context_length_estimate to be 128, which is the closest power of 2 the length of the input prompt (97 tokens).

import torch
from transformers import AutoTokenizer
from transformers_neuronx import GPT2ForSamplingWithContextBroadcasting

# Create and compile the Neuron model
model = GPT2ForSamplingWithContextBroadcasting.from_pretrained(
    'gpt2',
    context_length_estimate=256 # Create an optimized network which handles prompts up to 256 tokens
)
model.to_neuron()

# Get a tokenizer and example input
tokenizer = AutoTokenizer.from_pretrained('gpt2')
text = "Hello, I'm a generative AI language model. Generative AI is a type of AI that can create new content and ideas, including conversations, stories, images, videos, and music. It is powered by large models that are pre-trained on vast amounts of data and commonly referred to as foundation models (FMs). With generative AI on AWS, you can reinvent your applications, create entirely new customer experiences, drive unprecedented levels of productivity, and transform your business. "
encoded_input = tokenizer(text, return_tensors='pt')

# Run inference
with torch.inference_mode():
    generated_sequence = model.sample(encoded_input.input_ids, sequence_length=256)
print([tokenizer.decode(tok) for tok in generated_sequence])

The GPT2ForSamplingWithContextBroadcasting class can also process an input prompt that has a different batch size from the batch size of the autoregressive decoder output. For example, an input prompt with batch size = 1 can be used to produce an output of batch size = 5 to generate multiple suggestions for the same input prompt. The input prompt batch size can be specified using the prompt_batch_size argument and the autoregressive decoder output batch size can be specified using the batch_size argument. In the following example we demonstrate how to apply parallel context encoding to the GPT2 model to generate 5 outputs for a single input.

import torch
from transformers import AutoTokenizer
from transformers_neuronx import GPT2ForSamplingWithContextBroadcasting

# Create and compile the Neuron model
model = GPT2ForSamplingWithContextBroadcasting.from_pretrained(
    'gpt2',
    prompt_batch_size=1, # This allows prompt and output batch to vary
    batch_size=5,
    context_length_estimate=256
)
model.to_neuron()

# Get a tokenizer and example input
tokenizer = AutoTokenizer.from_pretrained('gpt2')
text = "Hello, I'm a generative AI language model. Generative AI is a type of AI that can create new content and ideas, including conversations, stories, images, videos, and music. It is powered by large models that are pre-trained on vast amounts of data and commonly referred to as foundation models (FMs). With generative AI on AWS, you can reinvent your applications, create entirely new customer experiences, drive unprecedented levels of productivity, and transform your business. "
encoded_input = tokenizer(text, return_tensors='pt')

# Run inference
with torch.inference_mode():
    generated_sequence = model.sample(encoded_input.input_ids, sequence_length=256)

for i, output in enumerate(generated_sequence):
    print('-' * 50)
    print(f'Batch {i} output:')
    print(tokenizer.decode(output))

Serialization support#

Transformers NeuronX supports model serialization (model saving and loading) for all models except the GPTJForSampling and GPTNeoXForSampling` model classes. In the following example we demonstrate how to save and load the compiled artifacts for the GPT2 model:

import torch
from transformers import AutoTokenizer
from transformers_neuronx import GPT2ForSamplingWithContextBroadcasting

# Create and compile the Neuron model
model = GPT2ForSamplingWithContextBroadcasting.from_pretrained('gpt2')
model.to_neuron()

# Save the compiled Neuron model
model.save('gpt2-compiled-artifacts')

# Load the Neuron model
model = GPT2ForSamplingWithContextBroadcasting.from_pretrained('gpt2')
# Load the compiled Neuron artifacts
model.load('gpt2-compiled-artifacts')
# Since prior artifacts are loaded, this skips compilation
model.to_neuron()

# Get a tokenizer and example input
tokenizer = AutoTokenizer.from_pretrained('gpt2')
text = "Hello, I'm a language model,"
encoded_input = tokenizer(text, return_tensors='pt')

# Run inference
with torch.inference_mode():
    generated_sequence = model.sample(encoded_input.input_ids, sequence_length=256, start_ids=None)
print([tokenizer.decode(tok) for tok in generated_sequence])

Transformers NeuronX also supports the serialization of presharded weights. This reduces future model load time by saving a transformed and sharded set of weights as a new safetensors checkpoint. When this checkpoint is loaded, sharding and transformations normally done by Transformers NeuronX will be skipped, reducing model load time significantly. The saving of presharded weights is only available when on_device_embedding is true. In the following example we demonstrate how to save and load presharded weights along with compiled artifacts on a Llama model:

from transformers_neuronx import LlamaForSampling
from transformers_neuronx import NeuronConfig
from transformers import AutoTokenizer

neuron_config = NeuronConfig(on_device_embedding=True)

# Create and compile the Neuron model
model_neuron = LlamaForSampling.from_pretrained('openlm-research/open_llama_3b', batch_size=1, tp_degree=8, n_positions=128, neuron_config=neuron_config)
model_neuron.to_neuron()

# save the presharded weights and compiled artifacts to a directory
model_neuron.save('llama-artifacts', sharded_weights=True)

del model_neuron

# use the presharded checkpoint to reduce model load time
model_neuron_presharded = LlamaForSampling.from_pretrained('llama-artifacts', batch_size=1, tp_degree=8, n_positions=128, neuron_config=neuron_config)

# load in the compiled artifcats to skip compilation
model_neuron_presharded.load('llama-artifacts')
model_neuron_presharded.to_neuron()

CPU Compilation Support#

Transformers NeuronX now supports compilation on CPU. CPU compilation is compatible with model serialization and presharding weights, and is available for all models except the GPTJForSampling and GPTNeoXForSampling model classes. To compile on CPU, the initial call to to_neuron() is replaced with cpu_compile(). In the following example we demonstrate how to compile on CPU for the LLaMA model:

from transformers_neuronx import LlamaForSampling
from transformers_neuronx import NeuronConfig
from transformers import AutoTokenizer

neuron_config = NeuronConfig(on_device_embedding=True)

# Create and compile the model on CPU
model_neuron = LlamaForSampling.from_pretrained('openlm-research/open_llama_3b', batch_size=1, tp_degree=8, n_positions=128, neuron_config=neuron_config)
model_neuron.cpu_compile() # instead of model_neuron.to_neuron()

# save the weights and compiled artifacts to a directory
model_neuron.save('llama-artifacts')

To use the saved artifacts generated by CPU compilation on a Neuron device:

from transformers_neuronx import LlamaForSampling
from transformers_neuronx import NeuronConfig
from transformers import AutoTokenizer

neuron_config = NeuronConfig(on_device_embedding=True)

# use the presharded checkpoint to reduce model load time
model_neuron_presharded = LlamaForSampling.from_pretrained('llama-artifacts', batch_size=1, tp_degree=8, n_positions=128, neuron_config=neuron_config)

# load in the compiled artifacts to skip compilation
model_neuron_presharded.load('llama-artifacts')

# now, use CPU compiled artifacts to run the model
model_neuron_presharded.to_neuron()

Compilation worker count support#

Transformers-neuronx supports providing compilation worker count for all models. This setting controls how many workers will execute HLO graph compilation tasks in parallel. A lower setting reduces CPU memory utilization when compiling a model, but increases the compilation time. This setting is useful to prevent out of CPU memory errors when compiling large models. By default, the number of workers used is equal to the total HLO graphs required for compilation. Compilation worker count integrates with both CPU compilation flow using cpu_compile() and neuron device compilation flow using to_neuron(). To set the compilation worker count, use the compilation_worker_count argument in NeuronConfig. The following sample shows how to compile the graphs one by one.

neuron_config = NeuronConfig(compilation_worker_count=1)

Grouped-query attention (GQA) support [Beta]#

Transformers Neuron supports grouped-query attention (GQA) models for Llama and Mistral model classes. There are multiple sharding strategies for K/V cache, in order to satisfy different constraints.

GQA.SHARD_OVER_HEADS distributes K/V caches along head dimension. This can be only used when K/V heads is multiple of tensor-parallelism degree. This is the default configuration.
GQA.SHARD_OVER_BATCH distributes K/V caches along batch dimension. This can be only used when batch size is multiple of tensor-parallelism degree. This can be useful for large-batch inference.
GQA.REPLICATED_HEADS replicates K/V heads. This can be used when neither batch size nor K/V heads can be divisible by tensor-parallelism degree. This can be useful for low-latency small-batch inference.
GQA.ALL_GATHER_HEADS evenly splits the K/V heads across all NeuronCores. This is optimized for large-batch inference of GQA model without replication.

In the following example we demonstrate how to configure these distributed inference strategies and perform inference with the Mistral model:

import torch
from transformers import AutoTokenizer
from transformers_neuronx import MistralForSampling, GQA, NeuronConfig

# Set sharding strategy for GQA to be shard over heads
neuron_config = NeuronConfig(
    group_query_attention=GQA.SHARD_OVER_HEADS
)

# Create and compile the Neuron model
model_neuron = MistralForSampling.from_pretrained('mistralai/Mistral-7B-Instruct-v0.2', amp='bf16', neuron_config=neuron_config)
model_neuron.to_neuron()

# Get a tokenizer and exaple input
tokenizer = AutoTokenizer.from_pretrained('mistralai/Mistral-7B-Instruct-v0.2')
text = "[INST] What is your favourite condiment? [/INST]"
encoded_input = tokenizer(text, return_tensors='pt')

# Run inference
with torch.inference_mode():
    generated_sequence = model_neuron.sample(encoded_input.input_ids, sequence_length=256, start_ids=None)
print([tokenizer.decode(tok) for tok in generated_sequence])

Repeated Ngram Filtering#

Repeated Ngram Filtering reduces redundant ngram phrases within the generated text. It uses the same API as HuggingFace API for NoRepeatedNGram. Set the parameter no_repeat_ngram_size to the size of ngram phrases to be filtered and pass it to the sampling function as in the example model.sample(inputs_ids, no_repeat_ngram_size=3)

On-device sampling support [Beta]#

Transformers-neuronx supports on-device sampling for all models except Mixtral models. The features can be enabled by setting on_device_generation in NeuronConfig to an instance of GenerationConfig.

In the following example, we demonstrate how to use on-device generation for a Llama model using top_k, top_p, top_p_min_tokens and temperature.

Top-K on-device sampling support [Beta]#

Transformers Neuron supports Top-K Sampling on-device for all models except Mixtral models. In the following example, we demonstrate how to use on-device Top-K for the Llama model via the GenerationConfig and NeuronConfig configs.

import torch
from transformers_neuronx import LlamaForSampling
from transformers_neuronx.config import NeuronConfig, GenerationConfig
from transformers import AutoTokenizer

neuron_config = NeuronConfig(
    on_device_generation=GenerationConfig(max_length=128, top_k=10, top_p=0.9, top_p_min_tokens=1, temperature=0.9, do_sample=True)
)

# Create and compile the Neuron model
model_neuron = LlamaForSampling.from_pretrained('openlm-research/open_llama_3b', batch_size=1, tp_degree=8, n_positions=128, neuron_config=neuron_config)
model_neuron.to_neuron()

# Get a tokenizer and exaple input
tokenizer = AutoTokenizer.from_pretrained('openlm-research/open_llama_3b')
text = "Hello, I'm a language model,"
encoded_input = tokenizer(text, return_tensors='pt')

# Run inference
with torch.inference_mode():
    generated_sequence = model_neuron.sample(encoded_input.input_ids, sequence_length=128, top_k=10)
    print([tokenizer.decode(tok) for tok in generated_sequence])

By default, transformers-neuronx uses the same, fixed sampling parameters for all sequences across all invocations of the model when on-device generation is enabled. It is possible to provide new sampling parameters per model invocation by enabling the dynamic feature in the GenerationConfig. It is also possible to provide different sampling parameters for each sequence in the batch by using the per_batch_line feature. When using this feature, it is recommended to limit the number of tokens that are considered during sampling across all sequences by setting global_top_k to a reasonably low number e.g. 250 to prevent poor performance when computing top_p tokens over a large vocabulary without any prior filtering. When using per_batch_line, top_k, top_p, top_p_min_tokens and temperature accept lists with value per sequence in the batch.

In the following example, we demonstrate how to use the dynamic and per_batch_line features together.

import torch
from transformers_neuronx import LlamaForSampling
from transformers_neuronx.config import NeuronConfig, GenerationConfig
from transformers import AutoTokenizer

batch_size = 2
generation_config = GenerationConfig(
        max_length=128, dynamic=True, per_batch_line=True, do_sample=True,
        top_k=[1] * batch_size,
        top_p=[1.0] * batch_size,
        top_p_min_tokens=[1] * batch_size,
        temperature=[1.0] * batch_size,
        global_top_k=256
    )

neuron_config = NeuronConfig(
    on_device_generation=generation_config
)

# Create and compile the Neuron model
model_neuron = LlamaForSampling.from_pretrained('openlm-research/open_llama_3b', batch_size=2, tp_degree=8, n_positions=128, neuron_config=neuron_config)
model_neuron.to_neuron()

# Get a tokenizer and exaple input
tokenizer = AutoTokenizer.from_pretrained('openlm-research/open_llama_3b')
tokenizer.pad_token = tokenizer.eos_token
text = ["Hello, I'm a language model,", "Hello, I'm also a language model,"]
encoded_input = tokenizer(text, return_tensors='pt')

# Run inference
with torch.inference_mode():
    generated_sequence = model_neuron.sample(encoded_input.input_ids, sequence_length=128)
    print([tokenizer.decode(tok) for tok in generated_sequence])

    # Use different settings for each sequence in the batch
    # Supported because we use `generation_config.per_batch_line = True`
    generation_config.top_k = [1, 20]
    generation_config.top_p = [1.0, 0.9]
    generation_config.top_p_min_tokens = [1, 1]
    generation_config.temperature = [1.0, 0.9]

    # Update the generation configuration dynamically
    # Supported because we use `generation_config.dynamic = True`
    model_neuron.update_generation_config(generation_config)

    generated_sequence = model_neuron.sample(encoded_input.input_ids, sequence_length=128)
    print([tokenizer.decode(tok) for tok in generated_sequence])

Running inference with multiple models#

Multiple transformers-neuronx models can be loaded at the same time as long as the total number of consumed NeuronCores is less than or equal to the total number of NeuronCores on the instance. For example, three tp-degree=8 models can be loaded and run in parallel on an inf2.48xlarge which has 24 NeuronCores. The NEURON_RT_NUM_CORES and NEURON_RT_VISIBLE_CORES environment variables can be used to allocate the necessary number of NeuronCores to each process to run multiple transformers-neuronx models in parallel. See the NeuronCore Allocation and Model Placement for Inference (torch-neuronx) section for additional information about how to use these environment variables.

It is important to notice that when multiple models are used on a single instance, the number of threads should be reduced to avoid race condition on host side. Assume the neuron instance (i.e. trn1) has 192 CPU cores. If one of the models keeps all CPU cores busy, there would be significant performance degradation in the rest of models. As a result, the number of threads for each model should be limited to part of available cores. To do this, OMP_NUM_THREADS environment variable can be set. For example, if there are 192 CPU cores available and four tp-degree=8 models are used, one can export OMP_NUM_THREADS=48 to avoid race condition.

Streamer#

LLMs generate tokens in auto-regressive loop. A model.sample call waits till the end of full sequence generation before returning the generated response. It is possible to output an output token as soon as it is generated. To do this, a streamer object can be used. Streamer is an object which has 2 methods: put and end. There are several predefined streamer in transformers library such as TextIteratorStreamer. The following example shows how to define a streamer and use it in transformers-neuronx:

import torch
from transformers import AutoTokenizer
from transformers_neuronx import MistralForSampling, GQA

import transformers
from time import time

# Create a custom streamer inherited from transformers.generation.streamers.BaseStreamer
class CustomStreamer(transformers.generation.streamers.BaseStreamer):
    def __init__(self) -> None:
        self.reset()

    def reset(self):
        self.token_latencies = []
        self.iter = 0
        self.now = time()

    def put(self, tokens):
        now = time()
        token_latency = now - self.now
        print(f"Iteration {self.iter:4d}: Latency [s] {token_latency:6.3f} -- Token {tokens}")
        self.now = now
        self.iter += 1
        self.token_latencies.append(token_latency)


    def end(self):
        print("First 10 token latencies:", self.token_latencies[:10])


# Create and compile the Neuron model
model_neuron = MistralForSampling.from_pretrained('mistralai/Mistral-7B-Instruct-v0.2', amp='bf16')
model_neuron.to_neuron()

# Get a tokenizer and exaple input
tokenizer = AutoTokenizer.from_pretrained('mistralai/Mistral-7B-Instruct-v0.2')
text = "[INST] What is your favourite condiment? [/INST]"
encoded_input = tokenizer(text, return_tensors='pt')

streamer = CustomStreamer()
# Run inference
with torch.inference_mode():
    generated_sequence = model_neuron.sample(encoded_input.input_ids, sequence_length=256, start_ids=None, streamer=streamer)

Stopping Criteria#

We can define custom stopping criteria to stop autoregressive loop. For example, if we want to limit autoregressive loop after 0.5s, we can define and use stopping criteria class as follows:

import torch
import transformers
from transformers import AutoModelForCausalLM, AutoTokenizer, TextIteratorStreamer
from transformers_neuronx import MistralForSampling, GQA, NeuronConfig
from transformers_neuronx.stopping_criteria import StoppingCriteria, StoppingCriteriaList

from time import time
from typing import List, Optional, Callable


class MaxTimeCriteria(StoppingCriteria):
    """
    This class can be used to stop generation whenever the full generation exceeds some amount of time. By default, the
    time will start being counted when you initialize this function. You can override this by passing an
    `initial_time`.

    Args:
        max_time (`float`):
            The maximum allowed time in seconds for the generation.
        initial_time (`float`, *optional*, defaults to `time()`):
            The start of the generation allowed time.
    """

    def __init__(self, max_time: float, initial_timestamp: Optional[float] = None):
        self.max_time = max_time
        self.initial_timestamp = time() if initial_timestamp is None else initial_timestamp

    def __call__(self, input_ids: torch.LongTensor, scores: torch.FloatTensor, **kwargs) -> bool:
        dt = time() - self.initial_timestamp
        end_condition = dt > self.max_time
        if end_condition:
            print("Stopping!")
        return end_condition

# Create a streamer. This can be a custom streamer too inherited from transformers.generation.streamers.BaseStreamer
class CustomStreamer(transformers.generation.streamers.BaseStreamer):
    def __init__(self) -> None:
        self.reset()

    def reset(self):
        self.token_latencies = []
        self.iter = 0
        self.now = time()

    def put(self, tokens):
        now = time()
        token_latency = now - self.now
        print(f"Iteration {self.iter:4d}: Latency [s] {token_latency:6.3f} -- Token {tokens}")
        self.now = now
        self.iter += 1
        self.token_latencies.append(token_latency)


    def end(self):
        pass

# Create and compile the Neuron model
model_neuron = MistralForSampling.from_pretrained('mistralai/Mistral-7B-Instruct-v0.2', amp='bf16')
model_neuron.to_neuron()

# Get a tokenizer and exaple input
tokenizer = AutoTokenizer.from_pretrained('mistralai/Mistral-7B-Instruct-v0.2')
text = "[INST] What is your favourite condiment? [/INST]"
encoded_input = tokenizer(text, return_tensors='pt')

# Add stopping criteria to stop after 0.5 seconds
stopping_criteria_list= StoppingCriteriaList([MaxTimeCriteria(0.5)])
streamer = CustomStreamer()

# Run inference
with torch.inference_mode():
    model_neuron.sample(input_ids=encoded_input.input_ids, sequence_length=256, stopping_criteria_list=stopping_criteria_list, streamer=streamer)

Speculative sampling [Beta]#

Transformers Neuron supports speculative sampling for the Llama and GPT2 model classes. In speculative sampling, we use use a smaller draft model to speculate future tokens. These are then sent to the larger target model, which accepts or rejects these tokens. For more detailed information, see the original proposal by DeepMind titled Accelerating Large Language Model Decoding with Speculative Sampling. Our implementation for speculative sampling is lossless. In addition to standalone draft models, we also support Eagle draft models. Currently we only support Eagle v1.

In the following example, we demonstrate how to perform speculative sampling using the Llama model. In this example, we are performing multinomial sampmling.

import torch
from transformers import LlamaTokenizer
from transformers_neuronx import NeuronAutoModelForCausalLM, NeuronConfig, GenerationConfig
from transformers_neuronx.fused_speculation import FusedSpeculativeDecoder

# Specify path to draft and target
draft = '/home/ubuntu/Llama-2-7b-chat-hf'
target = '/home/ubuntu/Llama-2-70b-chat-hf'

# Specify generation parameters
gen_kwargs = {
    "top_k": 50,
    "top_p": 0.9,
    "do_sample": True,
    "temperature": 0.7,
}

# Load draft model
draft_neuron_model = NeuronAutoModelForCausalLM.from_pretrained(
        draft,
        n_positions=1024,
        batch_size=1,
        tp_degree=32,
        amp='bf16',
        neuron_config=NeuronConfig(
            padding_side="right",
            attention_layout=Layout.BSH,
            collectives_layout="BSH",
            on_device_embedding=True,
            on_device_generation=GenerationConfig(**gen_kwargs),
            ),
        )
draft_neuron_model.to_neuron()
# Load target model
target_neuron_model = NeuronAutoModelForCausalLM.from_pretrained(
        target,
        n_positions=1024,
        batch_size=1,
        tp_degree=32,
        amp='bf16',
        neuron_config=NeuronConfig(
            padding_side="right",
            attention_layout=Layout.BSH,
            collectives_layout="BSH",
            on_device_embedding=True,
            on_device_generation=GenerationConfig(**gen_kwargs),
            ),
        )
target_neuron_model.to_neuron()

# Compile the speculative sampling model
# Here we set sepculation length to be 4
fsd = FusedSpeculativeDecoder(
        draft_neuron_model,
        target_neuron_model,
        4,
        )
fsd.to_neuron()

# Initialize tokenizer and text prompt
tokenizer = LlamaTokenizer.from_pretrained(target)
prompt = "Hello, I'm a generative AI language model."
inputs = tokenizer(prompt, return_tensors="pt")

# Call speculative sampling on given input
response = fsd.sample(
    input_ids=inputs.input_ids,
    attention_mask=inputs.attention_mask,
    sequence_length=30,
)

# Decode the response
generated_text = tokenizer.decode(response[0])
print(f"\nDecoded tokens: {generated_text}")

The following sample shows how to enable EAGLE speculation. To get the EAGLE draft model to work, manually copy the LM head weights from the target model to the draft model. Additionally, you need to rename the keys in the draft model’s state_dict to match those in the target model.

import torch
from transformers import LlamaTokenizer
from transformers_neuronx import NeuronAutoModelForCausalLM, NeuronConfig, GenerationConfig
from transformers_neuronx.fused_speculation import FusedSpeculativeDecoder

# Specify path to draft and target
# The Eagle draft model can be downloaded from Eagle website
draft = '/home/ubuntu/EAGLE-llama2-chat-70B'
target = '/home/ubuntu/Llama-2-70b-chat-hf'

# Specify generation parameters
gen_kwargs = {
    "top_k": 50,
    "top_p": 0.9,
    "do_sample": True,
    "temperature": 0.7,
}

# Load draft model
draft_neuron_model = NeuronAutoModelForCausalLM.from_pretrained(
        draft,
        n_positions=1024,
        batch_size=1,
        tp_degree=32,
        amp='bf16',
        neuron_config=NeuronConfig(
            is_eagle_draft=True,
            has_pre_attention_norm=False,
            # Need the above two configs for Eagle
            padding_side="right",
            attention_layout=Layout.BSH,
            collectives_layout="BSH",
            on_device_embedding=True,
            on_device_generation=GenerationConfig(**gen_kwargs),
            ),
        )
draft_neuron_model.to_neuron()
# Load target model
target_neuron_model = NeuronAutoModelForCausalLM.from_pretrained(
        target,
        n_positions=1024,
        batch_size=1,
        tp_degree=32,
        amp='bf16',
        neuron_config=NeuronConfig(
            is_eagle_target=True,
            # Need the above config for Eagle
            padding_side="right",
            attention_layout=Layout.BSH,
            collectives_layout="BSH",
            on_device_embedding=True,
            on_device_generation=GenerationConfig(**gen_kwargs),
            ),
        )
target_neuron_model.to_neuron()

# Compile the speculative sampling model
# Here we set sepculation length to be 4
fsd = FusedSpeculativeDecoder(
        draft_neuron_model,
        target_neuron_model,
        4,
        )
fsd.to_neuron()

# The rest are the same

QKV Weight Fusion#

Concatenating a model’s query, key and value weight matrices often achieves better performance because larger matrices allow for more efficient data movement and compute. QKV weight fusion can be enabled by setting fuse_qkv=True in the NeuronConfig:

neuron_config = NeuronConfig(fuse_qkv=True)

Attention Layout#

The intermediate tensor layouts in a model’s attention layer can impact the compiler’s optimization opportunities and thus can impact a model’s performance. Using (batch, sequence, hidden) (or BSH) layout for attention often achieves better performance since it can enable better overlapping of compute with collectives and can reduce transposes. We intend to enable BSH attention by default in a future release. For now, BSH attention layout can be enabled by setting attention_layout="BSH" in the NeuronConfig:

neuron_config = NeuronConfig(attention_layout="BSH")

Bucketing#

LLM inference is a generate process that can produce variable length sequences. This poses a problem since the Neuron compiler produces executables which expect statically shaped inputs and outputs. To make LLM work with different shapes, transformers_neuronx generates buckets and applies padding wherever it is required.

There are at least two set of buckets for each LLM inference that can be set by user: 1) Context encoding (pre-fill) buckets and 2) output token generation buckets.

Token generation buckets

In token generation, tokens are generated iteratively. At each token position, transformer need to attend to the previous tokens only. But in the naive implementation with static shapes, one may attend to all KV-cache (full sequence length). To solve this problem, we use token generation buckets. Token generation buckets determine the attention lengths. For instance, if the max sequence length is 1024 tokens and current token is at position 120, there is no need to attend to all 1024 tokens in the current step. We can use token generation buckets to attend to different portions of KV-cache. By default, token generation buckets which are powers of 2 starting from 128 tokens are used (i.e. 128, 256, 512, up to sequence length). In the example above, bucket 128 would be used for position 120 which would reduce the wasted compute significantly. User can change these buckets by setting a list for n_positions (see example below). Otherwise, if a number is given for n_positions (sequence length), instead of a list, then the powers of 2 buckets starting from 128 will be used. The last bucket would be n_positions (sequence length), even if it is not a power of 2.

Context encoding buckets

The prompt tokens can be processed in parallel. As a result, we need to set the bucket sizes for different estimated length of input prompts. We can specify these context bucket sizes using the context_length_estimate argument. In general, it is better to have all the bucket to be multiples of 256 tokens. But adding too many buckets would increase device memory consumption and add extra latency for bucket switching. Usually, the powers of 2 starting from 128 tokens are used for context encoding buckets. If the total sequence length (n_positions) is beyond 2048 tokens, it is desirable to add extra buckets with multiple of 512 or 1024 tokens. It is not recommended to add buckets of multiples of 256 tokens or smaller for context buckets beyond 2k to avoid bucket switching latency. At runtime, the smallest bucket which fits the input context will be used. By default, the context encoding buckets set to half of output-token buckets. Adding extra context buckets would reduce the wasted compute and improves performance. However, the extra executables would reduce memory space since executables require device memory space.

Notice that the default output token generation buckets work well for wide range of applications. However, ideal context encoding buckets depends on the specific use case. For instance, if all the requests have a context length of about 1500 +/- 500 tokens, adding more buckets closer to 1500 might help context encoding time. In this example, adding buckets of 1024, 1280, 1536, 1792, 2048 tokens (distance of 256 tokens) could help. Moreover, the largest context encoding bucket should be larger than the largest context length. Otherwise, the performance would degrade significantly.

To set context encoding and token generation buckets manually:

context_length_estimate = [1024, 1280, 1536, 1792, 2048]    # The best context estimate depends on the use case
n_positions = [128, 256, 512, 1024, 2048, 3072]             # Usually default buckets are appropriate

model = NeuronAutoModelForCausalLM.from_pretrained(
    'gpt2',
    batch_size=1,
    n_positions=n_positions,
    tp_degree=2,
    amp='f16',
    context_length_estimate=context_length_estimate,
)

Multi-node inference support (TP/PP)#

Prerequisite: https://awsdocs-neuron.readthedocs-hosted.com/en/latest/frameworks/torch/torch-neuronx/setup-trn1-multi-node-execution.html

When models are too large to fit on single node, Transformers NeuronX multi-node inference (tensor parallel and pipeline parallel) can be used to shard model weights across multiple Neuron instances (only supported on Trn1 and Trn1n). Single node inference code can easily be extended to multi-node inference.

Note that Transformers Neuronx currently doesn’t support multi-node Tensor Parallel and Pipeline Parallel at same time, when Pipeline Parallel is used, the Tensor Parallel has to be within a node (TP<=32 on Trn1/Trn1n).

In the below sections, we first outline the sample code for single node execution and then provide instructions to migrate the code to use multi-node tensor parallel or multi-node pipeline parallel. To start with, the code below is for single node script, running llama2-3b model with tensor parallel degree as 32.

import torch
from transformers import AutoTokenizer, AutoConfig
from transformers_neuronx import  LlamaForSampling, HuggingFaceGenerationModelAdapter

# Create and compile the Neuron model
model = LlamaForSampling.from_pretrained("openlm-research/open_llama_3b", tp_degree=32)
model.to_neuron()

# Use the `HuggingFaceGenerationModelAdapter` to access the generate API
config = AutoConfig.from_pretrained("openlm-research/open_llama_3b")
model = HuggingFaceGenerationModelAdapter(config, model)

# Get a tokenizer and example input
tokenizer = AutoTokenizer.from_pretrained("openlm-research/open_llama_3b")
tokenizer.pad_token_id = tokenizer.eos_token_id
tokenizer.padding_side = 'left'
text = "Hello, I'm a language model,"
encoded_input = tokenizer(text, return_tensors='pt', padding=True)



# Run inference using temperature
with torch.inference_mode():
    model.reset_generation()
    generated_sequence = model.generate(
        input_ids=encoded_input.input_ids,
        attention_mask=encoded_input.attention_mask,
        do_sample=True,
        max_length=256,
        temperature=0.7,
    )

print([tokenizer.decode(tok) for tok in generated_sequence])

command line:

python3 multi_node_dev_example.py

Multi-Node Tensor Parallel

Compared to single node tensor parallel, multi-node tensor parallel shards the model weights in the same way but having mores cores across nodes. In the meantime, it requires each node’s model.forward() receives the exact same input, otherwise there would be unexpected behaviors (runtime failure, wrong output).

Configurations (environment variables to be configured on each node):

NEURON_RT_ROOT_COMM_ID: the master node’s <IP address>:<port>
NEURON_RANK_ID: rank of the node, 0 means master node
NEURON_LOCAL_TP: the local tensor parallel degree on each node

example:

Change the single node script to use tp=64 (2 node). Set the torch.manual_seed to ensure the sampling loop running on each node will sample same token as next input.

Node 1 command line:

NEURON_RT_ROOT_COMM_ID=10.1.201.64:63423 NEURON_RANK_ID=0 NEURON_LOCAL_TP=32 python3 multi_node_dev_example.py

Node 2 command line (same as Node 1 but set NEURON_RANK_ID as 1):

NEURON_RT_ROOT_COMM_ID=10.1.201.64:63423 NEURON_RANK_ID=1 NEURON_LOCAL_TP=32 python3 multi_node_dev_example.py

You can also refer to Tutorial to run lama 3.1 405b multinode 16k tutorial with multi-node tensor parallel.

Multi-Node Pipeline Parallel

While having the weight tensor sharded as tensor pararallel, one can utilize pipeline parallel to partition the layers across different node, the intermediate tensor (hidden) will be transferred from one pipeline stage (nodes) to the next pipeline stage (nodes). The final output will be sent from last pipeline stage back to first pipeline stage.

Compared to multi-node tensor parallel, for non-zero rank, the model.forward in pipeline parallel will fallback to while loop and block on the input broadcasting from master.

Configurations (environment variables to be configured on each node):

NEURON_RT_ROOT_COMM_ID: the master node’s <IP address>:<port>
CPU_COMM_ID: similar to NEURON_RT_ROOT_COMM_ID , but need to set with different port
NEURON_RANK_ID: rank of the node, 0 means master node
NEURON_PP_STAGES: number of pipeline stages (nodes)

example:

Keep the original single node script with tp=32.

Node 1 command line:

NEURON_PP_STAGES=2 CPU_COMM_ID=10.1.201.64:8989 NEURON_RT_ROOT_COMM_ID=10.1.201.64:63423 NEURON_RANK_ID=0 python3 multi_node_dev_example.py

Node 2 command line (same as Node 1 but set NEURON_RANK_ID as 1):

NEURON_PP_STAGES=2 CPU_COMM_ID=10.1.201.64:8989 NEURON_RT_ROOT_COMM_ID=10.1.201.64:63423 NEURON_RANK_ID=1 python3 multi_node_dev_example.py

Long Sequence length support up to 128k#

Flash Attention

With the integration of FlashAttention kernel, developers can use longer sequence lengths for LLAMA models. The Flash Attention kernel is automatically used when the input sequence length is greater than 8k without any additional configuration. Refer to Tutorial for usage of 32k sequence length on a variation of LLAMA3-8B Model.

Flash Decoding

Flash Decoding (FD) is a technique that significantly speeds up attention during inference, especially for long-context tasks in large language models (LLMs) with GQA.

With integration of FD, developers can achieve faster inference with larger sequence and batch size by reducing the KV cache replication. Refer to Tutorial on flash decoding usage for 128k sequence length sampling. Flash decoding can be enabled by setting the flag shard_over_sequence=True in NeuronConfig

neuron_config = NeuronConfig(shard_over_sequence=True)

Note that you can skip the first Allgather introduced by flash decoding at the cost of duplicate Q weights, this is only recommended for relatively small models (i.e. 3B, 8B) and large batch size.

neuron_config = NeuronConfig(shard_over_sequence=True, duplicate_q_weight_sos=True)

Known limitations and FAQs

Flash decoding is expected to have performance degradation (PTL) for smaller sequence and batch sizes. We recommend flash decoding when batch-size x sequence length > 16k
Flash decoding support is not enabled for the following features

Speculative Decoding

Multi Head Attention (MHA) models

This document is relevant for: Inf2, Trn1

Transformers NeuronX (transformers-neuronx) Developer Guide

Contents

Transformers NeuronX (`transformers-neuronx`) Developer Guide#

Introduction#

Checkpoint compatibility with HuggingFace Transformers#

Neuron optimized transformer decoders implemented in XLA High Level Operations (HLO)#

Tensor-parallelism support#

Features#

Compile-time Configurations#

Checkpoint support and automatic model selection#

Hugging Face generate() API support#

Neuron Persistent Cache#

int8 weight storage support#

Parallel Input Prompt Context Encoding#

Serialization support#

CPU Compilation Support#

Compilation worker count support#

Grouped-query attention (GQA) support [Beta]#

Repeated Ngram Filtering#

On-device sampling support [Beta]#

Top-K on-device sampling support [Beta]#

Running inference with multiple models#

Streamer#

Stopping Criteria#

Speculative sampling [Beta]#

QKV Weight Fusion#

Attention Layout#

Bucketing#

Multi-node inference support (TP/PP)#

Long Sequence length support up to 128k#

Transformers NeuronX (transformers-neuronx) Developer Guide

Contents

Transformers NeuronX (transformers-neuronx) Developer Guide#

Introduction#

Checkpoint compatibility with HuggingFace Transformers#

Neuron optimized transformer decoders implemented in XLA High Level Operations (HLO)#

Tensor-parallelism support#

Features#

Compile-time Configurations#

Checkpoint support and automatic model selection#

Hugging Face generate() API support#

Neuron Persistent Cache#

int8 weight storage support#

Parallel Input Prompt Context Encoding#

Serialization support#

CPU Compilation Support#

Compilation worker count support#

Grouped-query attention (GQA) support [Beta]#

Repeated Ngram Filtering#

On-device sampling support [Beta]#

Top-K on-device sampling support [Beta]#

Running inference with multiple models#

Streamer#

Stopping Criteria#

Speculative sampling [Beta]#

QKV Weight Fusion#

Attention Layout#

Bucketing#

Multi-node inference support (TP/PP)#

Long Sequence length support up to 128k#

Transformers NeuronX (`transformers-neuronx`) Developer Guide#