Accuracy Evaluation of Models on Neuron Using Open Source Datasets#

This guide demonstrates how to evaluate accuracy of models on Trainium and Inferentia instances using open source datasets. This approach expands on the accuracy evaluation using logits and enables you to evaluate accuracy using open source datasets like MMLU and GSM8K for tasks such as instruction following and mathematical reasoning.

Under the hood, this accuracy suite uses vLLM server to serve the model and can use benchmarking clients such as lm-eval and LongBench to evaluate on their supported datasets. In future we will add support for other benchmarking clients.

The code used in this guide is located at aws-neuron/aws-neuron-samples

For a tutorial that you can follow and run on a trainium or inferentia instance please look at Evaluating Accuracy of Llama-3.1-70B on Neuron using open source datasets.

Configuration Setup#

Creating the Configuration File#

Create a test_config.yaml file that defines your server settings and accuracy test configurations:

server:
  name: "test-model-server"
  model_path: "/path/to/model"
  model_s3_path: "s3://bucket/path/to/model"
  max_seq_len: 2048
  context_encoding_len: 1024
  tp_degree: 2
  n_vllm_threads: 16
  server_port: 8000
  continuous_batch_size: 2

test:
  accuracy:
    mmlu_test:
      client: "lm_eval"
      datasets: ["mmlu"]
      max_concurrent_requests: 1
      timeout: 3600
      client_params:
        limit: 100

    longbench_test:
      client: "longbench"
      datasets: ["qasper", "multifieldqa"]
      max_concurrent_requests: 1
      timeout: 7200
      client_params:
        max_length: 4096

Configuration Parameters#

Server Configuration#

Parameter	Description
`name`	Identifier for your model server
`model_path`	Local path to model files
`model_s3_path`	S3 location of model files
`max_seq_len`	Maximum sequence length
`context_encoding_len`	Length of context encoding
`tp_degree`	Tensor parallelism degree
`n_vllm_threads`	Number of vLLM threads
`server_port`	Server port number
`continuous_batch_size`	Size of continuous batches

if model_s3_path is specified, the model will be downloaded into model_path, otherwise model should already exist in model_path.

Accuracy Test Configuration#

Parameter	Description
`client`	Evaluation framework (e.g., “lm_eval”, “longbench”)
`datasets`	List of datasets for evaluation from the supported set by the client
`max_concurrent_requests`	Maximum parallel requests
`timeout`	Maximum execution time (seconds)
`client_params`	Client-specific parameters

Running Evaluations#

Execute accuracy tests using the CLI command:

python accuracy.py --config test_config.yaml

For more detailed information and advanced configurations, please refer to: - lm-eval Documentation - LongBench Documentation

These resources provide comprehensive guides on client-specific parameters and advanced evaluation scenarios.

Accuracy Evaluation of Models on Neuron Using Open Source Datasets

Contents