Tutorial: Evaluating Accuracy of Llama-3.1-70B on Neuron using open source datasets#

This tutorial provides a step-by-step guide to measure the accuracy of Llama3.1 70B on Trn1 with evaluation on two distinct tasks: mathematical reasoning and logical analysis.

For this tutorial we use two datasets available in lm-eval, namely gsm8k_cot(high school math questions) and mmlu_flan_n_shot_generative_logical_fallacies (multiple choice questions on the subject) to demonstrate accuracy evaluation on Trn1. The metrics in these task are two variants of ExactMatch metrics called StrictMatch and FlexibleExtract which differ in how strict they are in extracting the final answer from the generated output from the model. To see the exact task definition used in lm-eval please look at gsm8k-cot and mmlu template.

We also need the instruction-tuned version of llama-3.1 70b meta-llama/Llama-3.1-70B-Instruct available hugging face.

Task Overview #

1. GSM8K with Chain-of-Thought (gsm8k_cot)#

The GSM8K dataset focuses on grade school math word problems, testing LLMs’ mathematical reasoning capabilities. Using Chain-of-Thought (CoT) prompting, we evaluate models’ ability to:

Solve complex math word problems
Show step-by-step reasoning
Arrive at accurate numerical answers

2. MMLU Logical Fallacies (mmlu_flan_n_shot_generative_logical_fallacies)#

This evaluation focuses on the model’s ability to identify and explain logical fallacies, a subset of the MMLU benchmark. The task tests:

Understanding of common logical fallacies
Ability to analyze arguments
Explanation of reasoning flaws

Environment Setup Guide #

Prerequisites #

This tutorial requires that you have a Trn1 instance created from a Deep Learning AMI that has the Neuron SDK pre-installed. Also we depend on our fork of vLLM as described in the vLLM User Guide for NxD Inference.

To use Jupyter Notebook on the Neuron instance, you can use this guide.

Before running evaluations, ensure your environment is properly configured by following these essential setup guides:

NxD Inference Setup Guide
- Configure AWS Neuron environment
- Set up required dependencies
- Verify system requirements
vLLM User Guide for NxD Inference
- Setup vLLM according to the guide

Installing dependencies #

Copy the inference-benchmarking directory to some location on your instance. Change directory to the your copy of inference-benchmarking. Install other required dependencies in the same python env (e.g aws_neuron_venv_pytorch if you followed manual install NxD Inference ) by:

git clone --depth 1 https://github.com/aws-neuron/aws-neuron-samples.git

pip install -r requirements.txt

Download llama-3.1 70B #

To use this sample, you must first download meta-llama/Llama-3.1-70B-Instruct model checkpoint from Hugging Face and store it locally. We are saving the model checkpoints at /home/ubuntu/models/Llama-3.1-70B-Instruct/ on the Trn1 instance. For more information, see Downloading models in the Hugging Face documentation.

Running Evaluations #

There are two methods that you can use the evaluation scripts to run your evaluation.

Using a yaml configuration file and accuracy.py script
writing your own python script that uses several components provided in accuracy.py and server_config.py

We demonstrate each use case separately here.

1. Running eval with yaml config file #

In this method all you need is to create a yaml config file that specifies the server configuration and testing scenario you want to run. Create config.yaml with the following content.

[ ]:

%%writefile config.yaml
server:
  name: "Llama-3.1-70B-Instruct"
  model_path: "/home/ubuntu/model_hf/llama-3.1-70B-Instruct-hf"
  model_s3_path: null
  compiled_model_path: "/home/ubuntu/traced_model_hf/llama-3.1-70B-Instruct-hf"
  max_seq_len: 16384
  context_encoding_len: 16384
  tp_degree: 32
  n_vllm_threads: 32
  server_port: 8000
  continuous_batch_size: 1

test:
  accuracy:
    mytest:
      client: "lm_eval"
      datasets: ["gsm8k_cot", "mmlu_flan_n_shot_generative_logical_fallacies"]
      max_concurrent_requests: 1
      timeout: 3600
      client_params:
        limit: 200
        use_chat: True

For tasks that require higher sequence length you need to adjust max_seq_len. For the tasks in this tutorial 16384 would suffice.

Run python accuracy.py --config config.yaml

[ ]:

%%bash
python accuracy.py --config config.yaml 2>&1 | tee accuracy_evaluation.log

2. Running eval through your own python code #

You might be interested in running the evaluation in you python code. For instance if you want to change the configuration programatically or post-process the results. This is possible using 3 main components provided in accuracy.py and server_config.py.

Server Configuration: Using ServerConfig to define the vLLM server settings
Accuracy Scenario: Using AccuracyScenario to specify evaluation parameters
Test Execution: Running the evaluation with the configured settings

First, import the necessary components:

[ ]:

from aws_neuron_eval.accuracy import AccuracyScenario, run_accuracy_test
from aws_neuron_eval.server_config import ServerConfig

1. Configure the Server #

Set up your server configuration with ServerConfig. This example uses Llama 3.1-70b Instruct:

[ ]:

# Configure the server settings
name = "Llama-3.1-70B-Instruct"

server_config = ServerConfig(
    name=name,
    model_path=f"/home/ubuntu/model_hf/llama-3.1-70B-Instruct-hf",  # Local model path
    model_s3_path=None,                         # S3 model path (not used)
    compiled_model_path=f"/home/ubuntu/traced_model_hf/llama-3.1-70B-Instruct-hf",  # Compiled model path
    max_seq_len=16384,                          # Maximum sequence length
    context_encoding_len=16384,                 # Context window size
    tp_degree=32,                               # Tensor parallel degree for Trn1
    n_vllm_threads=32,                          # Number of vLLM threads
    server_port=8000,                           # Server port
    continuous_batch_size=1,                    # Batch size for continuous batching
)

2. Define Evaluation Scenarios #

Create an AccuracyScenario to specify your evaluation parameters:

[ ]:

scenario = AccuracyScenario(
    client="lm_eval",              # Evaluation client
    datasets=[                     # Target datasets
        "gsm8k_cot",
        "mmlu_flan_n_shot_generative_logical_fallacies",
    ],
    max_concurrent_requests=1,     # Maximum concurrent requests
    timeout=5000,                  # Timeout in seconds - changed to 5000 from 3600
    client_params={"limit": 200}   # Client-specific parameters
)

3. Run the Evaluation #

Execute the evaluation using run_accuracy_test:

[ ]:

# Run the test with a named scenario
results_collection = run_accuracy_test(
    server_config=server_config,
    named_scenarios={"mytest": scenario}
)

# Display results
print(results_collection)

This code will execute the evaluation on the specified datasets and return detailed performance metrics. The results include accuracy scores and other relevant metrics for each dataset.

Tutorial: Evaluating Accuracy of Llama-3.1-70B on Neuron using open source datasets

Contents

Tutorial: Evaluating Accuracy of Llama-3.1-70B on Neuron using open source datasets#