Tutorial: Evaluating Accuracy of Llama-3.1-70B on Neuron using open source datasets#
This tutorial provides a step-by-step guide to measure the accuracy of Llama3.1 70B on Trn1 with evaluation on two distinct tasks: mathematical reasoning and logical analysis.
For this tutorial we use two datasets available in lm-eval, namely gsm8k_cot
(high school math questions) and mmlu_flan_n_shot_generative_logical_fallacies
(multiple choice questions on the subject) to demonstrate accuracy evaluation on Trn1. The metrics in these task are two variants of ExactMatch metrics called StrictMatch and FlexibleExtract which differ in how strict they are in extracting the final answer from the
generated output from the model. To see the exact task definition used in lm-eval please look at gsm8k-cot and mmlu template.
We also need the instruction-tuned version of llama-3.1 70b meta-llama/Llama-3.1-70B-Instruct available hugging face.
Task Overview#
1. GSM8K with Chain-of-Thought (gsm8k_cot)#
The GSM8K dataset focuses on grade school math word problems, testing LLMs’ mathematical reasoning capabilities. Using Chain-of-Thought (CoT) prompting, we evaluate models’ ability to:
Solve complex math word problems
Show step-by-step reasoning
Arrive at accurate numerical answers
2. MMLU Logical Fallacies (mmlu_flan_n_shot_generative_logical_fallacies)#
This evaluation focuses on the model’s ability to identify and explain logical fallacies, a subset of the MMLU benchmark. The task tests:
Understanding of common logical fallacies
Ability to analyze arguments
Explanation of reasoning flaws
Environment Setup Guide#
Prerequisites#
This tutorial requires that you have a Trn1 instance created from a Deep Learning AMI that has the Neuron SDK pre-installed. Also we depend on our fork of vLLM as described in the vLLM User Guide for NxD Inference.
To use Jupyter Notebook on the Neuron instance, you can use this guide.
Before running evaluations, ensure your environment is properly configured by following these essential setup guides:
-
Configure AWS Neuron environment
Set up required dependencies
Verify system requirements
vLLM User Guide for NxD Inference
Setup vLLM according to the guide
Installing dependencies#
Copy the inference-benchmarking directory to some location on your instance. Change directory to the your copy of inference-benchmarking. Install other required dependencies in the same python env (e.g aws_neuron_venv_pytorch if you followed manual install NxD Inference ) by:
git clone --depth 1 https://github.com/aws-neuron/aws-neuron-samples.git
pip install -r requirements.txt
Download llama-3.1 70B#
To use this sample, you must first download meta-llama/Llama-3.1-70B-Instruct model checkpoint from Hugging Face and store it locally. We are saving the model checkpoints at /home/ubuntu/models/Llama-3.1-70B-Instruct/
on the Trn1 instance. For more information, see Downloading models in the Hugging Face documentation.
Running Evaluations#
There are two methods that you can use the evaluation scripts to run your evaluation.
Using a yaml configuration file and
accuracy.py
scriptwriting your own python script that uses several components provided in
accuracy.py
andserver_config.py
We demonstrate each use case separately here.
1. Running eval with yaml config file#
In this method all you need is to create a yaml config file that specifies the server configuration and testing scenario you want to run. Create config.yaml
with the following content.
[ ]:
%%writefile config.yaml
server:
name: "Llama-3.1-70B-Instruct"
model_path: "/home/ubuntu/model_hf/llama-3.1-70B-Instruct-hf"
model_s3_path: null
compiled_model_path: "/home/ubuntu/traced_model_hf/llama-3.1-70B-Instruct-hf"
max_seq_len: 16384
context_encoding_len: 16384
tp_degree: 32
n_vllm_threads: 32
server_port: 8000
continuous_batch_size: 1
test:
accuracy:
mytest:
client: "lm_eval"
datasets: ["gsm8k_cot", "mmlu_flan_n_shot_generative_logical_fallacies"]
max_concurrent_requests: 1
timeout: 3600
client_params:
limit: 200
use_chat: True
For tasks that require higher sequence length you need to adjust max_seq_len
. For the tasks in this tutorial 16384 would suffice.
Run python accuracy.py --config config.yaml
[ ]:
%%bash
python accuracy.py --config config.yaml 2>&1 | tee accuracy_evaluation.log
2. Running eval through your own python code#
You might be interested in running the evaluation in you python code. For instance if you want to change the configuration programatically or post-process the results. This is possible using 3 main components provided in accuracy.py
and server_config.py
.
Server Configuration: Using ServerConfig to define the vLLM server settings
Accuracy Scenario: Using AccuracyScenario to specify evaluation parameters
Test Execution: Running the evaluation with the configured settings
First, import the necessary components:
[ ]:
from aws_neuron_eval.accuracy import AccuracyScenario, run_accuracy_test
from aws_neuron_eval.server_config import ServerConfig
1. Configure the Server#
Set up your server configuration with ServerConfig. This example uses Llama 3.1-70b Instruct:
[ ]:
# Configure the server settings
name = "Llama-3.1-70B-Instruct"
server_config = ServerConfig(
name=name,
model_path=f"/home/ubuntu/model_hf/llama-3.1-70B-Instruct-hf", # Local model path
model_s3_path=None, # S3 model path (not used)
compiled_model_path=f"/home/ubuntu/traced_model_hf/llama-3.1-70B-Instruct-hf", # Compiled model path
max_seq_len=16384, # Maximum sequence length
context_encoding_len=16384, # Context window size
tp_degree=32, # Tensor parallel degree for Trn1
n_vllm_threads=32, # Number of vLLM threads
server_port=8000, # Server port
continuous_batch_size=1, # Batch size for continuous batching
)
2. Define Evaluation Scenarios#
Create an AccuracyScenario to specify your evaluation parameters:
[ ]:
scenario = AccuracyScenario(
client="lm_eval", # Evaluation client
datasets=[ # Target datasets
"gsm8k_cot",
"mmlu_flan_n_shot_generative_logical_fallacies",
],
max_concurrent_requests=1, # Maximum concurrent requests
timeout=5000, # Timeout in seconds - changed to 5000 from 3600
client_params={"limit": 200} # Client-specific parameters
)
3. Run the Evaluation#
Execute the evaluation using run_accuracy_test:
[ ]:
# Run the test with a named scenario
results_collection = run_accuracy_test(
server_config=server_config,
named_scenarios={"mytest": scenario}
)
# Display results
print(results_collection)
This code will execute the evaluation on the specified datasets and return detailed performance metrics. The results include accuracy scores and other relevant metrics for each dataset.