Llama-2-7b Inference#

In this example we compile and deploy the Hugging Face meta-llama/Llama-2-7b-chat-hf model for tensor parallel inference on Neuron using the Neuronx-Distributed package.

Note: This model is not currently optimized for performance on neuronx-distributed. For optimized llama-2 inference use transformers-neuronx.

The example has the following main sections:

  1. Set up the Jupyter Notebook

  2. Install dependencies

  3. Download the model

  4. Trace the model

  5. Perform greedy sampling

  6. Benchmark sampling

This Jupyter Notebook can be run on a Trn1 instance (trn1.32xlarge).

Set up the Jupyter Notebook#

The following steps set up Jupyter Notebook and launch this tutorial: 1. Clone the Neuronx-Distributed repo to your instance using

git clone https://github.com/aws-neuron/neuronx-distributed.git
  1. Navigate to the examples/inference samples folder

cd neuronx-distributed/example/inference/
  1. Copy the tutorial notebook llama2_inference.ipynb to the example/inference/ directory.

wget https://raw.githubusercontent.com/aws-neuron/aws-neuron-sdk/master/src/examples/pytorch/neuronx_distributed/llama/llama2_inference.ipynb
  1. Follow the instructions in Jupyter Notebook QuickStart to run Jupyter Notebook on your instance.

Install Dependencies#

This tutorial requires the following pip packages:

  • torch-neuronx

  • neuronx-cc

  • sentencepiece

  • transformers

  • neuronx-distributed

You can install neuronx-distributed using the setup guide. Most of other packages will be installed when configuring your environment using the torch-neuronx inference setup guide. The additional dependencies must be installed here:

[ ]:
! pip install transformers==4.35.2 sentencepiece

Download the model#

Use of this model is governed by the Meta license. In order to download the model weights and tokenizer follow the instructions in meta-llama/Llama-2-7b-chat-hf.

Once granted access, you can download the model. For the purposes of this sample we assume you have saved the Llama-2-7b model in a directory called models/Llama-2-7b-chat-hf with the following format:

Llama-2-7b-chat-hf
 ├── LICENSE.txt
 ├── README.md
 ├── USE_POLICY.md
 ├── config.json
 ├── generation_config.json
 ├── model-00001-of-00002.safetensors
 ├── model-00002-of-00002.safetensors
 ├── model.safetensors.index.json
 ├── pytorch_model-00001-of-00002.bin
 ├── pytorch_model-00002-of-00002.bin
 ├── pytorch_model.bin.index.json
 ├── special_tokens_map.json
 ├── tokenizer.json
 ├── tokenizer.model
 └── tokenizer_config.json
[ ]:
model_path = "/home/ubuntu/models/Llama-2-7b-chat-hf"
traced_model_path = "/home/ubuntu/models/llama-2-7b-chat-hf-trace"

Trace the model#

Now we can trace the model using the LlamaRunner script. This saves the model to the traced_model_path. Tracing the 7b model can take up to 70 minutes.

In this sample we use tensor parallelism degree 32 to optimize performance on trn1.32xlarge.

[ ]:
from llama2.llama2_runner import LlamaRunner

max_context_length = 128
max_new_tokens = 384
batch_size = 2
tp_degree = 32

runner = LlamaRunner(model_path=model_path,
                     tokenizer_path=model_path)

runner.trace(traced_model_path=traced_model_path,
             tp_degree=tp_degree,
             batch_size=batch_size,
             max_context_length=max_context_length,
             max_new_tokens=max_new_tokens)

Inference#

Now lets load the traced model to perform autoregressive sampling.

[ ]:
generate_ids, outputs = runner.generate_on_neuron(prompt=["I believe the meaning of life is", "The color of the sky is"],
                                                  traced_model_path=traced_model_path)

for idx, output in enumerate(outputs):
    print(f"output {idx}: {output}")

Benchmarking#

Here we benchmark the per token latency for greedy sampling.

[ ]:
results = runner.benchmark_sampling(traced_model_path)