Llama-2-7b Inference#
In this example we compile and deploy the Hugging Face meta-llama/Llama-2-7b-chat-hf model for tensor parallel inference on Neuron using the Neuronx-Distributed
package.
Note: This model is not currently optimized for performance on neuronx-distributed. For optimized llama-2 inference use transformers-neuronx.
The example has the following main sections:
Set up the Jupyter Notebook
Install dependencies
Download the model
Trace the model
Perform greedy sampling
Benchmark sampling
This Jupyter Notebook can be run on a Trn1 instance (trn1.32xlarge
).
Set up the Jupyter Notebook#
The following steps set up Jupyter Notebook and launch this tutorial:
Clone the Neuronx-Distributed repo to your instance using
git clone https://github.com/aws-neuron/neuronx-distributed.git
Navigate to the
examples/inference
samples folder
cd neuronx-distributed/example/inference/
Copy the tutorial notebook
llama2_inference.ipynb
to theexample/inference/
directory.
wget https://raw.githubusercontent.com/aws-neuron/aws-neuron-sdk/master/src/examples/pytorch/neuronx_distributed/llama/llama2_inference.ipynb
Follow the instructions in Jupyter Notebook QuickStart to run Jupyter Notebook on your instance.
Install Dependencies#
This tutorial requires the following pip packages:
torch-neuronx
neuronx-cc
sentencepiece
transformers
neuronx-distributed
You can install neuronx-distributed
using the setup guide. Most of other packages will be installed when configuring your environment using the torch-neuronx inference setup guide. The additional dependencies must be installed here:
[ ]:
! pip install -r requirements.txt
Download the model#
Use of this model is governed by the Meta license. In order to download the model weights and tokenizer follow the instructions in meta-llama/Llama-2-7b-chat-hf.
Once granted access, you can download the model. For the purposes of this sample we assume you have saved the Llama-2-7b model in a directory called models/Llama-2-7b-chat-hf
with the following format:
Llama-2-7b-chat-hf
├── LICENSE.txt
├── README.md
├── USE_POLICY.md
├── config.json
├── generation_config.json
├── model-00001-of-00002.safetensors
├── model-00002-of-00002.safetensors
├── model.safetensors.index.json
├── pytorch_model-00001-of-00002.bin
├── pytorch_model-00002-of-00002.bin
├── pytorch_model.bin.index.json
├── special_tokens_map.json
├── tokenizer.json
├── tokenizer.model
└── tokenizer_config.json
By default, this model uses float16
precision, which is not supported for this model at this time. Go into config.json
and switch the torch_dtype
field to bfloat16
.
[ ]:
model_path = "/home/ubuntu/models/Llama-2-7b-chat-hf"
traced_model_path = "/home/ubuntu/models/llama-2-7b-chat-hf-trace"
Trace and load the model#
Now we can trace the model using the LlamaRunner script. This saves the model to the traced_model_path
. Tracing the 7b model can take up to 70 minutes. After tracing, the model can be loaded.
In this sample we use tensor parallelism degree 32 to optimize performance on trn1.32xlarge.
[ ]:
from llama2.llama2_runner import LlamaRunner
max_context_length = 128
max_new_tokens = 384
batch_size = 2
tp_degree = 32
runner = LlamaRunner(model_path=model_path,
tokenizer_path=model_path)
runner.trace(traced_model_path=traced_model_path,
tp_degree=tp_degree,
batch_size=batch_size,
context_lengths=max_context_length,
new_token_counts=max_new_tokens,
on_device_sampling=True)
neuron_model = runner.load_neuron_model(traced_model_path)
Inference#
Now lets use the model to perform autoregressive sampling.
[ ]:
prompt = ["I believe the meaning of life is", "The color of the sky is"]
generate_ids, outputs = runner.generate_on_neuron(prompt, neuron_model)
for idx, output in enumerate(outputs):
print(f"output {idx}: {output}")
Benchmarking#
Here we benchmark the per token latency for greedy sampling.
[ ]:
results = runner.benchmark_sampling(neuron_model)