CodeLlama-13b Inference with 16k sequence length#
In this example we compile and deploy the Hugging Face codellama/CodeLlama-13b-hf model for tensor parallel inference on Neuron using the Neuronx-Distributed
package. We use 16k sequence length.
The example has the following main sections:
Set up the Jupyter Notebook
Install dependencies
Download the model
Trace the model
Perform greedy sampling
Benchmark sampling
This Jupyter Notebook can be run on a Trn1 instance (trn1.32xlarge
).
Set up the Jupyter Notebook#
The following steps set up Jupyter Notebook and launch this tutorial:
Clone the Neuronx-Distributed repo to your instance using
git clone https://github.com/aws-neuron/neuronx-distributed.git
Navigate to the
examples/inference
samples folder
cd neuronx-distributed/example/inference/
Copy the tutorial notebook
codellama_16k_inference.ipynb
to theexample/inference/
directory.
wget https://raw.githubusercontent.com/aws-neuron/aws-neuron-sdk/master/src/examples/pytorch/neuronx_distributed/llama/codellama_16k_inference.ipynb
You might need to set
ulimit -n 65536
depending on your OS configuration.Follow the instructions in Jupyter Notebook QuickStart to run Jupyter Notebook on your instance.
Install Dependencies#
This tutorial requires the following pip packages:
torch-neuronx
neuronx-cc
sentencepiece
transformers
neuronx-distributed
You can install neuronx-distributed
using the setup guide. Most of other packages will be installed when configuring your environment using the torch-neuronx inference setup guide. The additional dependencies must be installed here:
[ ]:
! pip install transformers==4.40 sentencepiece
Download the model#
In order to download the model weights and tokenizer follow the instructions in codellama/CodeLlama-13b-hf.
For the purposes of this sample we assume you have saved the CodeLlama-13b model in a directory called models/CodeLlama-13b-hf
.
[ ]:
model_path = "/home/ubuntu/models/CodeLlama-13b-hf"
traced_model_path = "/home/ubuntu/models/CodeLlama-13b-hf-trace"
Trace and load the model#
Now we can trace the model using the LlamaRunner script. This saves the model to the traced_model_path
. After tracing, the model can be loaded.
In this sample we use tensor parallelism degree 32 to optimize performance on trn1.32xlarge.
[ ]:
from llama2.llama2_runner import LlamaRunner
# select maximum prompt length and total sequence length
max_prompt_length = 12288
sequence_length = 16384
batch_size = 1
tp_degree = 32
runner = LlamaRunner(model_path=model_path,
tokenizer_path=model_path)
runner.trace(traced_model_path=traced_model_path,
tp_degree=tp_degree,
batch_size=batch_size,
max_prompt_length=max_prompt_length,
sequence_length=sequence_length,
on_device_sampling=True)
neuron_model = runner.load_neuron_model(traced_model_path)
Inference#
Now lets use the model to perform autoregressive sampling.
[ ]:
import requests
# construct a tokenizer and encode prompt text (prompt is loaded from a library and appended with instruction to write a function)
prompt = requests.get("https://raw.githubusercontent.com/kedartatwawadi/stanford_compression_library/e2fca703ac812331a277644ecc4ae5cfef160ab3/scl/compressors/lz77_sliding_window.py").text
prompt += "\n\n# Function to load binary data from user-provided file and compress it with LZ77 and write output to file\n" # over 11k tokens
prompt = [prompt] # batch size 1
generate_ids, outputs = runner.generate_on_neuron(prompt, neuron_model)
for idx, output in enumerate(outputs):
print(f"output {idx}: {output}")