LLM Inference benchmarking guide#
This guide gives an overview of the metrics that are tracked for LLM Inference and guidelines in using LLMPerf library to benchmark for LLM Inference.
LLM Inference metrics#
Following are the essential metrics for monitoring LLM Inference server performance.
Metric |
Description |
|---|---|
Time To First Token (TTFT) |
Average time taken for the LLM to process the prompt and output the first output token to the user. This is typically measured in milli seconds. |
Time per Output Token (TPOT) |
Average time taken for LLM to generate an output token for an inference request. This is typically measured in milli seconds. This metric is also referred as Inter Token Latency (ITL) or Per Token Latency(PTL) |
End-to-End Response Latency |
Time taken for the LLM to generate the entire response, including all output tokens. This metric is computed as end-to-end latency = (TTFT) + (TPOT) * (Number of output tokens). |
Output Token Throughput |
Number of output tokens generated per second by the inference server across all concurrent users and requests. |
Using LLMPerf to benchmark LLM Inference performance#
LLMPerf is an open source library to benchmark LLM Inference performance. However, there are few changes that need to be applied to LLMPerf to accurately benchmark and reproduce the metrics that are published by Neuron.
All the changes outlined below are provided as a patch file.
Note
Patches need to be applied in order because they might modify the same files.
Step 1: Install LLMPerf from source#
python3 -m venv llmperf-env
source llmperf-env/bin/activate
git clone https://github.com/ray-project/llmperf.git ~/llmperf
cd ~/llmperf
pip install -e .
Step 2: Patch custom Tokenizer and updated TPOT metric#
In public LLMPerf, hf-internal-testing tokenizer is used for all models which leads to incorrect
performance metrics due to counting more or less tokens than were actually processed by the model
on the server. Instead, we use the tokenizer of the model that is being benchmarked.
LLMPerf includes TTFT in Time per Output Token(or Inter Token Latency) calculation. As TPOT and TTFT are two different metrics, a change is done to LLMPerf to exclude TTFT from TPOT calculation to keep it consistent with how other industry standard performance benchmarks are done.
Follow these instructions to apply the patch to the LLMPerf library.
Download the
neuron_perf.patchfileinto thellmperfdirectory.Run
git apply neuron_perf.patch. Confirm changes withgit diff.
Step 3: Patch data parallel benchmarking with multiple model endpoints#
To measure performance with data parallel inference using multiple model copies, we allow users to provide multiple semicolon separated endpoints via OPENAI_API_BASE (e.g. “export OPENAI_API_BASE=http://server1;http://server2;http://server3”) for the OpenAI chat completion client. By default, the patch uses round-robin to route requests.
Download the
llmperf_dp.patchfileinto thellmperfdirectory.Run
git apply llmperf_dp.patch. Confirm changes withgit diff.
Step 4: Patch reasoning model support#
To measure LLM Inference performance of reasoning models, we need to patch LLMPerf to measure TTFT up to the first reasoning token instead of the first answer token.
Download the
llmperf_reasoning.patchfileinto thellmperfdirectory.Run
git apply llmperf_reasoning.patch. Confirm changes withgit diff.