CodeLlama-13b Inference with 16k sequence length#

In this example we compile and deploy the Hugging Face codellama/CodeLlama-13b-hf model for tensor parallel inference on Neuron using the Neuronx-Distributed package. We use 16k sequence length.

The example has the following main sections:

Set up the Jupyter Notebook
Install dependencies
Download the model
Trace the model
Perform greedy sampling
Benchmark sampling

This Jupyter Notebook can be run on a Trn1 instance (trn1.32xlarge).

Set up the Jupyter Notebook#

The following steps set up Jupyter Notebook and launch this tutorial:

Clone the Neuronx-Distributed repo to your instance using

git clone https://github.com/aws-neuron/neuronx-distributed.git

Navigate to the examples/inference samples folder

cd neuronx-distributed/example/inference/

Copy the tutorial notebook codellama_16k_inference.ipynb to the example/inference/ directory.

wget https://raw.githubusercontent.com/aws-neuron/aws-neuron-sdk/master/src/examples/pytorch/neuronx_distributed/llama/codellama_16k_inference.ipynb

You might need to set ulimit -n 65536 depending on your OS configuration.
Follow the instructions in Jupyter Notebook QuickStart to run Jupyter Notebook on your instance.

Install Dependencies#

This tutorial requires the following pip packages:

torch-neuronx
neuronx-cc
sentencepiece
transformers
neuronx-distributed

You can install neuronx-distributed using the setup guide. Most of other packages will be installed when configuring your environment using the torch-neuronx inference setup guide. The additional dependencies must be installed here:

[ ]:

! pip install transformers==4.40 sentencepiece

Download the model#

In order to download the model weights and tokenizer follow the instructions in codellama/CodeLlama-13b-hf.

For the purposes of this sample we assume you have saved the CodeLlama-13b model in a directory called models/CodeLlama-13b-hf.

[ ]:

model_path = "/home/ubuntu/models/CodeLlama-13b-hf"
traced_model_path = "/home/ubuntu/models/CodeLlama-13b-hf-trace"

Trace and load the model#

Now we can trace the model using the LlamaRunner script. This saves the model to the traced_model_path. After tracing, the model can be loaded.

In this sample we use tensor parallelism degree 32 to optimize performance on trn1.32xlarge.

[ ]:

from llama2.llama2_runner import LlamaRunner

# select maximum prompt length and total sequence length
max_prompt_length = 12288
sequence_length = 16384
batch_size = 1
tp_degree = 32

runner = LlamaRunner(model_path=model_path,
                     tokenizer_path=model_path)

runner.trace(traced_model_path=traced_model_path,
             tp_degree=tp_degree,
             batch_size=batch_size,
             max_prompt_length=max_prompt_length,
             sequence_length=sequence_length,
             on_device_sampling=True)

neuron_model = runner.load_neuron_model(traced_model_path)

Inference#

Now lets use the model to perform autoregressive sampling.

[ ]:

import requests
# construct a tokenizer and encode prompt text (prompt is loaded from a library and appended with instruction to write a function)
prompt = requests.get("https://raw.githubusercontent.com/kedartatwawadi/stanford_compression_library/e2fca703ac812331a277644ecc4ae5cfef160ab3/scl/compressors/lz77_sliding_window.py").text
prompt += "\n\n# Function to load binary data from user-provided file and compress it with LZ77 and write output to file\n" # over 11k tokens

prompt = [prompt] # batch size 1
generate_ids, outputs = runner.generate_on_neuron(prompt, neuron_model)

for idx, output in enumerate(outputs):
    print(f"output {idx}: {output}")

CodeLlama-13b Inference with 16k sequence length