Accuracy Evaluation of Models on Neuron Using Open Source Datasets#
This guide demonstrates how to evaluate accuracy of models on Trainium and Inferentia instances using open source datasets. This approach expands on the accuracy evaluation using logits and enables you to evaluate accuracy using open source datasets like MMLU and GSM8K for tasks such as instruction following and mathematical reasoning.
Under the hood, this accuracy suite uses vLLM server to serve the model and can use benchmarking clients such as lm-eval and LongBench to evaluate on their supported datasets. In future we will add support for other benchmarking clients.
The code used in this guide is located at aws-neuron/aws-neuron-samples
For a tutorial that you can follow and run on a trainium or inferentia instance please look at Evaluating Accuracy of Llama-3.1-70B on Neuron using open source datasets.
Configuration Setup#
Creating the Configuration File#
Create a test_config.yaml file that defines your server settings and accuracy test configurations:
server:
name: "test-model-server"
model_path: "/path/to/model"
model_s3_path: "s3://bucket/path/to/model"
max_seq_len: 2048
context_encoding_len: 1024
tp_degree: 2
n_vllm_threads: 16
server_port: 8000
continuous_batch_size: 2
test:
accuracy:
mmlu_test:
client: "lm_eval"
datasets: ["mmlu"]
max_concurrent_requests: 1
timeout: 3600
client_params:
limit: 100
longbench_test:
client: "longbench"
datasets: ["qasper", "multifieldqa"]
max_concurrent_requests: 1
timeout: 7200
client_params:
max_length: 4096
Configuration Parameters#
Server Configuration#
Parameter |
Description |
---|---|
|
Identifier for your model server |
|
Local path to model files |
|
S3 location of model files |
|
Maximum sequence length |
|
Length of context encoding |
|
Tensor parallelism degree |
|
Number of vLLM threads |
|
Server port number |
|
Size of continuous batches |
if model_s3_path
is specified, the model will be downloaded into model_path
,
otherwise model should already exist in model_path
.
Accuracy Test Configuration#
Parameter |
Description |
---|---|
|
Evaluation framework (e.g., “lm_eval”, “longbench”) |
|
List of datasets for evaluation from the supported set by the client |
|
Maximum parallel requests |
|
Maximum execution time (seconds) |
|
Client-specific parameters |
Running Evaluations#
Execute accuracy tests using the CLI command:
python accuracy.py --config test_config.yaml
For more detailed information and advanced configurations, please refer to: - lm-eval Documentation - LongBench Documentation
These resources provide comprehensive guides on client-specific parameters and advanced evaluation scenarios.