Llama-2-7b Inference#

In this example we compile and deploy the Hugging Face meta-llama/Llama-2-7b-chat-hf model for tensor parallel inference on Neuron using the Neuronx-Distributed package.

Note: This model is not currently optimized for performance on neuronx-distributed. For optimized llama-2 inference use transformers-neuronx.

The example has the following main sections:

  1. Set up the Jupyter Notebook

  2. Install dependencies

  3. Download the model

  4. Trace the model

  5. Perform greedy sampling

  6. Benchmark sampling

This Jupyter Notebook can be run on a Trn1 instance (trn1.32xlarge).

Set up the Jupyter Notebook#

The following steps set up Jupyter Notebook and launch this tutorial: 1. Clone the Neuronx-Distributed repo to your instance using

git clone https://github.com/aws-neuron/neuronx-distributed.git
  1. Navigate to the examples/inference samples folder

cd neuronx-distributed/example/inference/
  1. Copy the tutorial notebook llama2_inference.ipynb to the example/inference/ directory.

wget https://raw.githubusercontent.com/aws-neuron/aws-neuron-sdk/master/src/examples/pytorch/neuronx_distributed/llama/llama2_inference.ipynb
  1. Follow the instructions in Jupyter Notebook QuickStart to run Jupyter Notebook on your instance.

Install Dependencies#

This tutorial requires the following pip packages:

  • torch-neuronx

  • neuronx-cc

  • sentencepiece

  • transformers

  • neuronx-distributed

You can install neuronx-distributed using the setup guide. Most of other packages will be installed when configuring your environment using the torch-neuronx inference setup guide. The additional dependencies must be installed here:

[ ]:
! pip install transformers==4.35.2 sentencepiece

Download the model#

Use of this model is governed by the Meta license. In order to download the model weights and tokenizer follow the instructions in meta-llama/Llama-2-7b-chat-hf.

Once granted access, you can download the model. For the purposes of this sample we assume you have saved the Llama-2-7b model in a directory called models/Llama-2-7b-chat-hf with the following format:

 ├── LICENSE.txt
 ├── README.md
 ├── USE_POLICY.md
 ├── config.json
 ├── generation_config.json
 ├── model-00001-of-00002.safetensors
 ├── model-00002-of-00002.safetensors
 ├── model.safetensors.index.json
 ├── pytorch_model-00001-of-00002.bin
 ├── pytorch_model-00002-of-00002.bin
 ├── pytorch_model.bin.index.json
 ├── special_tokens_map.json
 ├── tokenizer.json
 ├── tokenizer.model
 └── tokenizer_config.json

By default, this model uses float16 precision, which is not supported for this model at this time. Go into config.json and switch the torch_dtype field to bfloat16.

[ ]:
model_path = "/home/ubuntu/models/Llama-2-7b-chat-hf"
traced_model_path = "/home/ubuntu/models/llama-2-7b-chat-hf-trace"

Trace and load the model#

Now we can trace the model using the LlamaRunner script. This saves the model to the traced_model_path. Tracing the 7b model can take up to 70 minutes. After tracing, the model can be loaded.

In this sample we use tensor parallelism degree 32 to optimize performance on trn1.32xlarge.

[ ]:
from llama2.llama2_runner import LlamaRunner

max_context_length = 128
max_new_tokens = 384
batch_size = 2
tp_degree = 32

runner = LlamaRunner(model_path=model_path,


neuron_model = runner.load_neuron_model(traced_model_path)


Now lets use the model to perform autoregressive sampling.

[ ]:
prompt = ["I believe the meaning of life is", "The color of the sky is"]

generate_ids, outputs = runner.generate_on_neuron(prompt, neuron_model)

for idx, output in enumerate(outputs):
    print(f"output {idx}: {output}")


Here we benchmark the per token latency for greedy sampling.

[ ]:
results = runner.benchmark_sampling(neuron_model)