Tutorial: Evaluating Performance of Llama-3.3-70B on Neuron using Performance CLI#

Introduction#

This tutorial provides a step-by-step guide to measure the performance of Llama3.3 70B on Trn1 with easy to reproduce benchmarks.

In this tutorial you will learn how llama-3.3-70B can be easily tested with llm-perf for 3.3-70b-instruct model.

You must have the instruction-tuned version of llama-3.3 70b meta-llama/Llama-3.3-70B-Instruct available for Hugging Face to successfully complete it.

Environment Setup Guide#

Prerequisites#

This tutorial requires that you have a Trn1 instance created from a Deep Learning AMI that has the Neuron SDK pre-installed. This tutorial depends on the Neuron fork of vLLM.

Before running evaluations, ensure your environment is properly configured by following these essential setup guides:

Installing dependencies#

Copy the inference-benchmarking directory to some location on your instance.
Change your current working directory to your copy of inference-benchmarking.
Install other required dependencies in the same Python env (such as aws_neuron_venv_pytorch, if you followed the steps in Manually install NxD Inference) by:

[1]:

%%bash
pip list | grep neuron
apt list --installed | grep neuron

aws-neuron-llm-evaluation                1.0
awsneuroneval                            1.0
libneuronxla                             2.2.7366.0+1faf0ddf
neuron-torch-tools                       1.0.0.33853+83b6bf63a
neuronx-cc                               2.20.2831.0+8bfecb25
neuronx-cc-devel                         2.20.2831.0+8bfecb25
neuronx-distributed                      0.14.17095+c66a8ca6
neuronx-distributed-inference            0.5.0+dev
torch-neuronx                            2.7.0.2.9.8707+08e1f40d
vllm-neuronx                             0.9.0.dev0+neuron225


WARNING: apt does not have a stable CLI interface. Use with caution in scripts.

aws-neuronx-collectives/now 2.27.13.0-f3bd841a2 amd64 [installed,local]
aws-neuronx-dkms/now 2.23.0.0 all [installed,local]
aws-neuronx-runtime-lib/now 2.27.7.0-765d5f599 amd64 [installed,local]
aws-neuronx-tools/now 2.25.100.0 amd64 [installed,local]

/home/ubuntu/aws_neuron_venv/lib/python3.10/site-packages/IPython/core/completerlib.py:371: UserWarning: This is now an optional IPython functionality, using bookmarks requires you to install the `pickleshare` library.
  bks = self.db.get('bookmarks',{})

You should see Neuron packages including neuronx-distributed-inference and its related components.

[ ]:

Download llama-3.3 70B#

To use this sample, you must first download meta-llama/Llama-3.3-70B-Instruct model checkpoint from Hugging Face /home/ubuntu/models/Llama-3.3-70B-Instruct/ on the Trn1 instance. For more information, see Downloading models in the Hugging Face documentation.

To use a Jupyter Notebook on the Neuron instance, follow this guide.

Running Evaluations#

There are two methods that you can use to run your evaluation.

Use a YAML configuration file and performance.py script
Write your own python script that uses several components provided in performance.py and server_config.py

Each use case is demonstrated below:

1. Running performance with yaml config file#

In this method, you create a YAML (.yaml) config file that specifies the server configuration and testing scenario you want to run. Create config.yaml with the following content.

[ ]:

cd inference-benchmarking/

[4]:

#Install requirements present in inference-benchmarking package
#!pip install -r requirements.txt

perf.yaml

server:
  name: "llama-3.3-70b-instruct"
  model_path: "/home/ubuntu/models/llama-3.3-70b-instruct"
  model_s3_path: null
  compiled_model_path: "/home/ubuntu/traced_models/llama-3.3-70b-instruct"
  max_seq_len: 256
  context_encoding_len: 128
  tp_degree: 32
  server_port: 8000
  continuous_batch_size: 1
  custom_chat_template_path: "default"

test:
  performance:
    sonnets_small_test:
      client: "llm_perf"
      client_type: "llm_perf_github_patched"
      n_batches: 1
      max_concurrent_requests: 20
      timeout: 3600
      input_size: 128
      output_size: 124
      client_params:
        stddev_input_tokens: 0
        stddev_output_tokens: 1

The above YAML file is explained in more detail in Performance Params guide

For changing sequence length you must adjust max_seq_len.

Run python performance.py --config perf.yaml

[ ]:

!python performance.py --config perf.yaml

2. Running perf as part of your own Python code#

You nmight want to run the performance script as part of your Python code. For example, you might want to change the configuration programatically or post-process the results. This is possible using 3 main components provided in performance.py and server_config.py.

Server Configuration: Use ServerConfig to define the vLLM server settings
Performance Scenario: Use PerformanceScenario to specify evaluation parameters
Test Execution: Run the performance with the configured settings

Step-by-Step Implementation#

First, import the necessary components:

[ ]:

cd "/home/ubuntu/inference-benchmarking"

[59]:

from performance import PerformanceScenario, run_perf_test
from server_config import ServerConfig

1. Configure the Server#

Set up your server configuration with ServerConfig. This example uses Llama 3.3-70b Instruct:

[60]:

name = "llama-3.3-70b-instruct"
server_config = ServerConfig(
    name=name,
    model_path=f"/home/ubuntu/models/{name}",  # Local model path
    model_s3_path=None,  # S3 model path
    max_seq_len=256,          # Maximum sequence length
    context_encoding_len=128,  # Context window size
    tp_degree=32,               # Tensor parallel degree
    n_vllm_threads=1,          # Number of vLLM threads
    server_port=8000,           # Server port
    continuous_batch_size=1,    # Batch size for continuous batching
    custom_chat_template_path="default" # Chat template
)

2. Define Performance Scenarios#

Create a PerformanceScenario to specify your perf parameters:

[61]:

scenario = PerformanceScenario(
    client="llm_perf",          # Evaluation client
    client_type="llm_perf_github_patched",
    n_batches=1,
    max_concurrent_requests=20,  # Maximum concurrent requests
    timeout=5000,              # Timeout in seconds - changed to 5000 from 3600
    input_size=128,            # Input length
    output_size=124,           # Output length
    client_params={"stddev_input_tokens": 0, "stddev_output_tokens": 1}  # Client-specific parameters
)

3. Run the Evaluation#

Execute the evaluation using run_perf_test:

[ ]:

# Run the test with a named scenario
results_collection = run_perf_test(
    server_config=server_config,
    named_scenarios={"mytest": scenario}
)

[ ]:

from pprint import pprint
# Display results
pprint(results_collection)

This code will execute and return detailed performance metrics for the model.

Tutorial: Evaluating Performance of Llama-3.3-70B on Neuron using Performance CLI

Contents