Tutorial: Evaluating Performance of Llama-3.3-70B on Neuron using Performance CLI#
Introduction#
This tutorial provides a step-by-step guide to measure the performance of Llama3.3 70B on Trn1
with easy to reproduce benchmarks.
In this tutorial you will learn how llama-3.3-70B can be easily tested with llm-perf for 3.3-70b-instruct model.
You must have the instruction-tuned version of llama-3.3 70b meta-llama/Llama-3.3-70B-Instruct available for Hugging Face to successfully complete it.
Environment Setup Guide#
Prerequisites#
This tutorial requires that you have a Trn1
instance created from a Deep Learning AMI that has the Neuron SDK pre-installed. This tutorial depends on the Neuron fork of vLLM.
Before running evaluations, ensure your environment is properly configured by following these essential setup guides:
Installing dependencies#
Copy the inference-benchmarking directory to some location on your instance.
Change your current working directory to your copy of inference-benchmarking.
Install other required dependencies in the same Python env (such as
aws_neuron_venv_pytorch
, if you followed the steps in Manually install NxD Inference) by:
[1]:
%%bash
pip list | grep neuron
apt list --installed | grep neuron
aws-neuron-llm-evaluation 1.0
awsneuroneval 1.0
libneuronxla 2.2.7366.0+1faf0ddf
neuron-torch-tools 1.0.0.33853+83b6bf63a
neuronx-cc 2.20.2831.0+8bfecb25
neuronx-cc-devel 2.20.2831.0+8bfecb25
neuronx-distributed 0.14.17095+c66a8ca6
neuronx-distributed-inference 0.5.0+dev
torch-neuronx 2.7.0.2.9.8707+08e1f40d
vllm-neuronx 0.9.0.dev0+neuron225
WARNING: apt does not have a stable CLI interface. Use with caution in scripts.
aws-neuronx-collectives/now 2.27.13.0-f3bd841a2 amd64 [installed,local]
aws-neuronx-dkms/now 2.23.0.0 all [installed,local]
aws-neuronx-runtime-lib/now 2.27.7.0-765d5f599 amd64 [installed,local]
aws-neuronx-tools/now 2.25.100.0 amd64 [installed,local]
/home/ubuntu/aws_neuron_venv/lib/python3.10/site-packages/IPython/core/completerlib.py:371: UserWarning: This is now an optional IPython functionality, using bookmarks requires you to install the `pickleshare` library.
bks = self.db.get('bookmarks',{})
You should see Neuron packages including neuronx-distributed-inference
and its related components.
[ ]:
Download llama-3.3 70B#
To use this sample, you must first download meta-llama/Llama-3.3-70B-Instruct model checkpoint from Hugging Face /home/ubuntu/models/Llama-3.3-70B-Instruct/
on the Trn1
instance. For more information, see Downloading models in the Hugging Face documentation.
To use a Jupyter Notebook on the Neuron instance, follow this guide.
Running Evaluations#
There are two methods that you can use to run your evaluation.
Use a YAML configuration file and
performance.py
scriptWrite your own python script that uses several components provided in
performance.py
andserver_config.py
Each use case is demonstrated below:
1. Running performance with yaml config file#
In this method, you create a YAML (.yaml
) config file that specifies the server configuration and testing scenario you want to run. Create config.yaml
with the following content.
[ ]:
cd inference-benchmarking/
[4]:
#Install requirements present in inference-benchmarking package
#!pip install -r requirements.txt
perf.yaml
server:
name: "llama-3.3-70b-instruct"
model_path: "/home/ubuntu/models/llama-3.3-70b-instruct"
model_s3_path: null
compiled_model_path: "/home/ubuntu/traced_models/llama-3.3-70b-instruct"
max_seq_len: 256
context_encoding_len: 128
tp_degree: 32
server_port: 8000
continuous_batch_size: 1
custom_chat_template_path: "default"
test:
performance:
sonnets_small_test:
client: "llm_perf"
client_type: "llm_perf_github_patched"
n_batches: 1
max_concurrent_requests: 20
timeout: 3600
input_size: 128
output_size: 124
client_params:
stddev_input_tokens: 0
stddev_output_tokens: 1
The above YAML file is explained in more detail in Performance Params guide
For changing sequence length you must adjust max_seq_len
.
Run python performance.py --config perf.yaml
[ ]:
!python performance.py --config perf.yaml
2. Running perf as part of your own Python code#
You nmight want to run the performance script as part of your Python code. For example, you might want to change the configuration programatically or post-process the results. This is possible using 3 main components provided in performance.py
and server_config.py
.
Server Configuration: Use ServerConfig to define the vLLM server settings
Performance Scenario: Use PerformanceScenario to specify evaluation parameters
Test Execution: Run the performance with the configured settings
Step-by-Step Implementation#
First, import the necessary components:
[ ]:
cd "/home/ubuntu/inference-benchmarking"
[59]:
from performance import PerformanceScenario, run_perf_test
from server_config import ServerConfig
1. Configure the Server#
Set up your server configuration with ServerConfig. This example uses Llama 3.3-70b Instruct:
[60]:
name = "llama-3.3-70b-instruct"
server_config = ServerConfig(
name=name,
model_path=f"/home/ubuntu/models/{name}", # Local model path
model_s3_path=None, # S3 model path
max_seq_len=256, # Maximum sequence length
context_encoding_len=128, # Context window size
tp_degree=32, # Tensor parallel degree
n_vllm_threads=1, # Number of vLLM threads
server_port=8000, # Server port
continuous_batch_size=1, # Batch size for continuous batching
custom_chat_template_path="default" # Chat template
)
2. Define Performance Scenarios#
Create a PerformanceScenario to specify your perf parameters:
[61]:
scenario = PerformanceScenario(
client="llm_perf", # Evaluation client
client_type="llm_perf_github_patched",
n_batches=1,
max_concurrent_requests=20, # Maximum concurrent requests
timeout=5000, # Timeout in seconds - changed to 5000 from 3600
input_size=128, # Input length
output_size=124, # Output length
client_params={"stddev_input_tokens": 0, "stddev_output_tokens": 1} # Client-specific parameters
)
3. Run the Evaluation#
Execute the evaluation using run_perf_test
:
[ ]:
# Run the test with a named scenario
results_collection = run_perf_test(
server_config=server_config,
named_scenarios={"mytest": scenario}
)
[ ]:
from pprint import pprint
# Display results
pprint(results_collection)
This code will execute and return detailed performance metrics for the model.