Llama-2-7b Inference
Contents
Llama-2-7b Inference#
In this example we compile and deploy the Hugging Face meta-llama/Llama-2-7b-chat-hf model for tensor parallel inference on Neuron using the Neuronx-Distributed
package.
Note: This model is not currently optimized for performance on neuronx-distributed. For optimized llama-2 inference use transformers-neuronx.
The example has the following main sections:
Set up the Jupyter Notebook
Install dependencies
Download the model
Trace the model
Perform greedy sampling
Benchmark sampling
This Jupyter Notebook can be run on a Trn1 instance (trn1.32xlarge
).
Set up the Jupyter Notebook#
The following steps set up Jupyter Notebook and launch this tutorial: 1. Clone the Neuronx-Distributed repo to your instance using
git clone https://github.com/aws-neuron/neuronx-distributed.git
Navigate to the
examples/inference
samples folder
cd neuronx-distributed/example/inference/
Copy the tutorial notebook
llama2_inference.ipynb
to theexample/inference/
directory.
wget https://raw.githubusercontent.com/aws-neuron/aws-neuron-sdk/master/src/examples/pytorch/neuronx_distributed/llama/llama2_inference.ipynb
Follow the instructions in Jupyter Notebook QuickStart to run Jupyter Notebook on your instance.
Install Dependencies#
This tutorial requires the following pip packages:
torch-neuronx
neuronx-cc
sentencepiece
transformers
neuronx-distributed
You can install neuronx-distributed
using the setup guide. Most of other packages will be installed when configuring your environment using the torch-neuronx inference setup guide. The additional dependencies must be installed here:
[ ]:
! pip install transformers==4.35.2 sentencepiece
Download the model#
Use of this model is governed by the Meta license. In order to download the model weights and tokenizer follow the instructions in meta-llama/Llama-2-7b-chat-hf.
Once granted access, you can download the model. For the purposes of this sample we assume you have saved the Llama-2-7b model in a directory called models/Llama-2-7b-chat-hf
with the following format:
Llama-2-7b-chat-hf
├── LICENSE.txt
├── README.md
├── USE_POLICY.md
├── config.json
├── generation_config.json
├── model-00001-of-00002.safetensors
├── model-00002-of-00002.safetensors
├── model.safetensors.index.json
├── pytorch_model-00001-of-00002.bin
├── pytorch_model-00002-of-00002.bin
├── pytorch_model.bin.index.json
├── special_tokens_map.json
├── tokenizer.json
├── tokenizer.model
└── tokenizer_config.json
[ ]:
model_path = "/home/ubuntu/models/Llama-2-7b-chat-hf"
traced_model_path = "/home/ubuntu/models/llama-2-7b-chat-hf-trace"
Trace the model#
Now we can trace the model using the LlamaRunner script. This saves the model to the traced_model_path
. Tracing the 7b model can take up to 70 minutes.
In this sample we use tensor parallelism degree 32 to optimize performance on trn1.32xlarge.
[ ]:
from llama2.llama2_runner import LlamaRunner
max_context_length = 128
max_new_tokens = 384
batch_size = 2
tp_degree = 32
runner = LlamaRunner(model_path=model_path,
tokenizer_path=model_path)
runner.trace(traced_model_path=traced_model_path,
tp_degree=tp_degree,
batch_size=batch_size,
max_context_length=max_context_length,
max_new_tokens=max_new_tokens)
Inference#
Now lets load the traced model to perform autoregressive sampling.
[ ]:
generate_ids, outputs = runner.generate_on_neuron(prompt=["I believe the meaning of life is", "The color of the sky is"],
traced_model_path=traced_model_path)
for idx, output in enumerate(outputs):
print(f"output {idx}: {output}")
Benchmarking#
Here we benchmark the per token latency for greedy sampling.
[ ]:
results = runner.benchmark_sampling(traced_model_path)