Evaluating Performance of Models on Neuron Using LLMPerf#

This topic guides you through determining the performance of your models on Trainium and Inferentia instances using open-source clients. It expands on the basic performance analysis tools provided with Neuron by incorporating the LLMperf client to collect additional information about performance for models such as llama-3.3-70B-instruct and llama-3.1-8b.

Under the hood, this performance suite uses vLLM server to serve the model and can use benchmarking clients such as llm-perf to evaluate on their supported models.

In the future we will add support for other benchmarking clients.

The code used in this guide is located at inference-benchmarking.

For a tutorial that you can follow and run on a Trainium or Inferentia instance, see Tutorial: Evaluating Performance of Llama-3.3-70B on Neuron using Performance CLI.

Creating the Configuration File#

Create a test_config.yaml file that defines your server settings and performance test configurations and paste in the following code:

server:
  name: "test-model-server"
  model_path: "/path/to/model"
  model_s3_path: "s3://bucket/path/to/model"
  max_seq_len: 256
  context_encoding_len: 128
  tp_degree: 32
  server_port: 8000
  continuous_batch_size: 1
  custom_chat_template_path: "default"

test:
  performance:
    llama_test:
      client: "llm_perf"
      client_type: "llm_perf_github_patched"
      max_concurrent_requests: 20
      timeout: 3600
      input_size: 128
      output_size: 124
      client_params:
        stddev_input_tokens: 0
        stddev_output_tokens: 1

Configuration Parameters#

Below is a reference for the configuration parameters you can use when configuring the server and tastes for your model performance analysis:

Server Configuration#

Parameter	Description
`name`	Identifier for your model server
`model_path`	Local path to model files
`model_s3_path`	S3 location of model files
`max_seq_len`	Maximum sequence length
`context_encoding_len`	Length of context encoding
`tp_degree`	Tensor parallelism degree
`server_port`	Server port number
`continuous_batch_size`	Size of continuous batches
`custom_chat_template_path`	Chat template for the prompt

if model_s3_path is specified, the model is downloaded to model_path; otherwise, the model should already be available at model_path.

Performance Test Configuration#

Parameter	Description
`client`	Performance framework (such as, llm-perf)
`client_type`	List of clients such as llm_perf_github_patched
`max_concurrent_requests`	Maximum parallel requests
`timeout`	Maximum execution time (seconds)
`input_size`	Input context length
`output_size`	Output length / MaxNewTokens
`client_params`	Client-specific parameters

Client_params#

Involves stddev_input_tokens and stddev_output_tokens

To prevent bucket overflow at higher batch sizes, we use the following default:

outputlength = orig_output_length - 4* continuous_batch_size

stddev_output_tokens = batch_size

Running Evaluations#

Execute performance tests using the CLI command:

python performance.py --config perf.yaml

For more detailed information and advanced configurations, please refer to: - llm-perf Documentation -

These resources provide comprehensive guides on client-specific parameters and advanced evaluation scenarios.

Evaluating Performance of Models on Neuron Using LLMPerf

Contents