# Tutorial: Evaluating Accuracy of Llama-3.1-70B on Neuron using open source datasets

This tutorial provides a step-by-step guide to measure the accuracy of Llama3.1 70B on Trn1 with evaluation on two distinct tasks: mathematical reasoning and logical analysis.

For this tutorial we use two datasets available in lm-eval, namely `gsm8k_cot`(high school math questions) and `mmlu_flan_n_shot_generative_logical_fallacies` (multiple choice questions on the subject) to demonstrate accuracy evaluation on Trn1. The metrics in these task are two variants of [ExactMatch](https://huggingface.co/spaces/evaluate-metric/exact_match) metrics called StrictMatch and FlexibleExtract which differ in how strict they are in extracting the final answer from the generated output from the model. To see the exact task definition used in lm-eval please look at [gsm8k-cot](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/lm_eval/tasks/gsm8k/gsm8k-cot.yaml) and [mmlu template](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/lm_eval/tasks/mmlu/flan_n_shot/generative/_mmlu_flan_generative_template_yaml).

We also need the instruction-tuned version of llama-3.1 70b [meta-llama/Llama-3.1-70B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-70B-Instruct) available hugging face.

## Task Overview

### 1. GSM8K with Chain-of-Thought (gsm8k_cot)

The GSM8K dataset focuses on grade school math word problems, testing LLMs’ mathematical reasoning capabilities. Using Chain-of-Thought (CoT) prompting, we evaluate models’ ability to:

- Solve complex math word problems

- Show step-by-step reasoning

- Arrive at accurate numerical answers

### 2. MMLU Logical Fallacies (mmlu_flan_n_shot_generative_logical_fallacies)

This evaluation focuses on the model’s ability to identify and explain logical fallacies, a subset of the MMLU benchmark. The task tests:

- Understanding of common logical fallacies

- Ability to analyze arguments

- Explanation of reasoning flaws

## Environment Setup Guide

### Prerequisites

This tutorial requires that you have a Trn1 instance created from a Deep Learning AMI that has the Neuron SDK pre-installed. Also we depend on our fork of vLLM as described in the [vLLM User Guide for NxD Inference](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/developer_guides/vllm-user-guide.html#nxdi-vllm-user-guide).

To use Jupyter Notebook on the Neuron instance, you can use this [guide](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/setup/notebook/setup-jupyter-notebook-steps-troubleshooting.html).

Before running evaluations, ensure your environment is properly configured by following these essential setup guides:

1. [NxD Inference Setup Guide](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/nxdi-setup.html)

    - Configure AWS Neuron environment

    - Set up required dependencies

    - Verify system requirements

2. [vLLM User Guide for NxD Inference](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/developer_guides/vllm-user-guide.html)

    - Setup vLLM according to the guide


### Installing dependencies

Copy the [inference-benchmarking](https://github.com/aws-neuron/aws-neuron-samples/tree/master/inference-benchmarking/) directory to some location on your instance. Change directory to the your copy of [inference-benchmarking](https://github.com/aws-neuron/aws-neuron-samples/tree/master/inference-benchmarking/). Install other required dependencies in the same python env (e.g aws_neuron_venv_pytorch if you followed [manual install NxD Inference](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/nxdi-setup.html#id3) ) by:

```python
git clone --depth 1 https://github.com/aws-neuron/aws-neuron-samples.git
```

```python
pip install -r requirements.txt
```

### Download llama-3.1 70B
To use this sample, you must first download [meta-llama/Llama-3.1-70B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-70B-Instruct) model checkpoint from Hugging Face and store it locally. We are saving the model checkpoints at ``/home/ubuntu/models/Llama-3.1-70B-Instruct/`` on the Trn1 instance. For more information, see [Downloading models](https://huggingface.co/docs/hub/en/models-downloading) in the Hugging Face documentation.

## Running Evaluations
There are two methods that you can use [the evaluation scripts](https://github.com/aws-neuron/aws-neuron-samples/tree/master/inference-benchmarking/) to run your evaluation.

1. Using a yaml configuration file and `accuracy.py` script

2. writing your own python script that uses several components provided in `accuracy.py` and `server_config.py`

We demonstrate each use case separately here.


### 1. Running eval with yaml config file
In this method all you need is to create a yaml config file that specifies the server configuration and testing scenario you want to run. Create `config.yaml` with the following content.

In [None]:
%%writefile config.yaml
server:
  name: "Llama-3.1-70B-Instruct"
  model_path: "/home/ubuntu/model_hf/llama-3.1-70B-Instruct-hf"
  model_s3_path: null
  compiled_model_path: "/home/ubuntu/traced_model_hf/llama-3.1-70B-Instruct-hf"
  max_seq_len: 16384
  context_encoding_len: 16384
  tp_degree: 32
  n_vllm_threads: 32
  server_port: 8000
  continuous_batch_size: 1

test:
  accuracy:
    mytest:
      client: "lm_eval"
      datasets: ["gsm8k_cot", "mmlu_flan_n_shot_generative_logical_fallacies"]
      max_concurrent_requests: 1
      timeout: 3600
      client_params:
        limit: 200
        use_chat: True


For tasks that require higher sequence length you need to adjust `max_seq_len`. For the tasks in this tutorial 16384 would suffice.

Run `python accuracy.py --config config.yaml`

In [None]:
%%bash
python accuracy.py --config config.yaml 2>&1 | tee accuracy_evaluation.log

### 2. Running eval through your own python code
You might be interested in running the evaluation in you python code. For instance if you want to change the configuration programatically or post-process the results. This is possible using 3 main components provided in `accuracy.py` and `server_config.py`.

1. Server Configuration: Using ServerConfig to define the vLLM server settings

2. Accuracy Scenario: Using AccuracyScenario to specify evaluation parameters

3. Test Execution: Running the evaluation with the configured settings


First, import the necessary components:

In [None]:
from aws_neuron_eval.accuracy import AccuracyScenario, run_accuracy_test
from aws_neuron_eval.server_config import ServerConfig

#### 1. Configure the Server

Set up your server configuration with ServerConfig. This example uses Llama 3.1-70b Instruct:

In [None]:
# Configure the server settings
name = "Llama-3.1-70B-Instruct"

server_config = ServerConfig(
    name=name,
    model_path=f"/home/ubuntu/model_hf/llama-3.1-70B-Instruct-hf",  # Local model path
    model_s3_path=None,                         # S3 model path (not used)
    compiled_model_path=f"/home/ubuntu/traced_model_hf/llama-3.1-70B-Instruct-hf",  # Compiled model path
    max_seq_len=16384,                          # Maximum sequence length
    context_encoding_len=16384,                 # Context window size
    tp_degree=32,                               # Tensor parallel degree for Trn1
    n_vllm_threads=32,                          # Number of vLLM threads
    server_port=8000,                           # Server port
    continuous_batch_size=1,                    # Batch size for continuous batching
)


#### 2. Define Evaluation Scenarios

Create an AccuracyScenario to specify your evaluation parameters:

In [None]:
scenario = AccuracyScenario(
    client="lm_eval",              # Evaluation client
    datasets=[                     # Target datasets
        "gsm8k_cot",
        "mmlu_flan_n_shot_generative_logical_fallacies",
    ],
    max_concurrent_requests=1,     # Maximum concurrent requests
    timeout=5000,                  # Timeout in seconds - changed to 5000 from 3600
    client_params={"limit": 200}   # Client-specific parameters
)

#### 3. Run the Evaluation

Execute the evaluation using run_accuracy_test:

In [None]:
# Run the test with a named scenario
results_collection = run_accuracy_test(
    server_config=server_config,
    named_scenarios={"mytest": scenario}
)

# Display results
print(results_collection)

This code will execute the evaluation on the specified datasets and return detailed performance metrics. The results include accuracy scores and other relevant metrics for each dataset.