You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Fix PerfMetrics collection for beam search scenario (#2943)
## Description
This PR fixes PerfMetrics collection issues in beam search scenarios by
correcting the output token size calculation and Time To First Token
(TTFT) measurement. The fix ensures that token counts reflect actual
generated tokens from the sampler rather than batch sizes, and improves
numerical precision in statistical calculations.
Ticket: CVS-175197
## Checklist:
- [x] Tests have been updated or added to cover the new code
- [x] This patch fully addresses the ticket
- [x] I have made corresponding changes to the documentation
Copy file name to clipboardExpand all lines: site/docs/guides/performance-metrics.mdx
+9-3Lines changed: 9 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -79,16 +79,22 @@ However, since mean and standard deviation values are usually sufficient, we wil
79
79
</LanguageTabs>
80
80
81
81
```sh title="Output:"
82
-
mean_generate_duration: 76.28
83
-
mean_ttft: 42.58
84
-
mean_tpot 3.80
82
+
Generate duration: 702.85
83
+
TTFT: 137.58 ms
84
+
TPOT: 29.74 ms/token
85
+
Throughput: 33.62 tokens/s
85
86
```
86
87
87
88
:::info Note
88
89
If the input prompt is just a string, the generate function returns only a string without perf_metrics.
89
90
To obtain perf_metrics, provide the prompt as a list with at least one element or call generate with encoded inputs.
90
91
:::
91
92
93
+
:::info Note
94
+
TPOT (Time Per Output Token) represents the average time required to generate each output token in the final result.
95
+
For beam search scenario, TPOT is calculated based on the effective output tokens delivered to users, not the tokens generated by individual beams during internal processing.
96
+
:::
97
+
92
98
## Accumulating Metrics
93
99
94
100
Several `perf_metrics` can be added to each other.
Copy file name to clipboardExpand all lines: src/README.md
+1-145Lines changed: 1 addition & 145 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -252,151 +252,7 @@ int main(int argc, char* argv[]) {
252
252
253
253
### Performance Metrics
254
254
255
-
`openvino_genai.PerfMetrics` (referred as `PerfMetrics` for simplicity) is a structure that holds performance metrics for each generate call. `PerfMetrics` holds fields with mean and standard deviations for the following metrics:
256
-
- Time To the First Token (TTFT), ms
257
-
- Time per Output Token (TPOT), ms/token
258
-
- Generate total duration, ms
259
-
- Tokenization duration, ms
260
-
- Detokenization duration, ms
261
-
- Throughput, tokens/s
262
-
263
-
and:
264
-
- Load time, ms
265
-
- Number of generated tokens
266
-
- Number of tokens in the input prompt
267
-
268
-
Performance metrics are stored either in the `DecodedResults` or `EncodedResults` `perf_metric` field. Additionally to the fields mentioned above, `PerfMetrics` has a member `raw_metrics` of type `openvino_genai.RawPerfMetrics` (referred to as `RawPerfMetrics` for simplicity) that contains raw values for the durations of each batch of new token generation, tokenization durations, detokenization durations, and more. These raw metrics are accessible if you wish to calculate your own statistical values such as median or percentiles. However, since mean and standard deviation values are usually sufficient, we will focus on `PerfMetrics`.
269
-
270
-
```python
271
-
import openvino_genai as ov_genai
272
-
pipe = ov_genai.LLMPipeline(models_path, "CPU")
273
-
result = pipe.generate(["The Sun is yellow because"], max_new_tokens=20)
>**Note**: If the input prompt is just a string, the generate function returns only a string without perf_metrics. To obtain perf_metrics, provide the prompt as a list with at least one element or call generate with encoded inputs.
307
-
308
-
#### Accumulating metrics
309
-
Several `perf_metrics` can be added to each other. In that case `raw_metrics` are concatenated and mean/std values are recalculated. This accumulates statistics from several `generate()` calls
310
-
311
-
```cpp
312
-
#include"openvino/genai/llm_pipeline.hpp"
313
-
#include<iostream>
314
-
315
-
intmain(int argc, char* argv[]) {
316
-
std::string models_path = argv[1];
317
-
ov::genai::LLMPipeline pipe(models_path, "CPU");
318
-
auto result_1 = pipe.generate("The Sun is yellow because", ov::genai::max_new_tokens(20));
319
-
auto result_2 = pipe.generate("The Sun is yellow because", ov::genai::max_new_tokens(20));
320
-
auto perf_metrics = result_1.perf_metrics + result_2.perf_metrics
In addition to mean and standard deviation values, the `perf_metrics` object has a `raw_metrics` field. This field stores raw data, including:
345
-
346
-
- Timestamps for each batch of generated tokens
347
-
- Batch sizes for each timestamp
348
-
- Tokenization durations
349
-
- Detokenization durations
350
-
- Other relevant metrics
351
-
352
-
These metrics can be use for more fine grained analysis, such as getting exact calculating median values, percentiles, etc. Below are a few examples of how to use raw metrics.
353
-
354
-
Getting timestamps for each generated token:
355
-
```python
356
-
import openvino_genai as ov_genai
357
-
pipe = ov_genai.LLMPipeline(models_path, "CPU")
358
-
result = pipe.generate(["The Sun is yellow because"], max_new_tokens=20)
print(f'Median from token to token duration: {np.median(durations):.2f} ms')
396
-
```
397
-
398
-
For more examples of how metrics are used, please refer to the Python [benchmark_genai.py](../samples/python/text_generation/README.md) and C++ [benchmark_genai](../samples/cpp/text_generation/README.md) samples.
399
-
255
+
Refer to the [Performance Metrics](https://openvinotoolkit.github.io/openvino.genai/docs/guides/performance-metrics) page for details and usage examples.
400
256
401
257
### Structured Output generation
402
258
OpenVINO™ GenAI supports structured output generation, which allows you to generate outputs in a structured format such as JSON, regex, or according to EBNF (Extended Backus–Naur form) grammar.
0 commit comments