Fix PerfMetrics collection for beam search scenario (#2943)

yatarkan · web-flow · commit 6c481bcef238 · 2025-11-12T18:08:30.000Z
## Description This PR fixes PerfMetrics collection issues in beam search scenarios by correcting the output token size calculation and Time To First Token (TTFT) measurement. The fix ensures that token counts reflect actual generated tokens from the sampler rather than batch sizes, and improves numerical precision in statistical calculations. Ticket: CVS-175197 ## Checklist: - [x] Tests have been updated or added to cover the new code - [x] This patch fully addresses the ticket - [x] I have made corresponding changes to the documentation
diff --git a/site/docs/guides/performance-metrics.mdx b/site/docs/guides/performance-metrics.mdx
@@ -79,16 +79,22 @@ However, since mean and standard deviation values are usually sufficient, we wil
 </LanguageTabs>
 
 ```sh title="Output:"
-mean_generate_duration: 76.28
-mean_ttft: 42.58
-mean_tpot 3.80
+Generate duration: 702.85
+TTFT: 137.58 ms
+TPOT: 29.74 ms/token
+Throughput: 33.62 tokens/s
 ```
 
 :::info Note
 If the input prompt is just a string, the generate function returns only a string without perf_metrics.
 To obtain perf_metrics, provide the prompt as a list with at least one element or call generate with encoded inputs.
 :::
 
+:::info Note
+TPOT (Time Per Output Token) represents the average time required to generate each output token in the final result.
+For beam search scenario, TPOT is calculated based on the effective output tokens delivered to users, not the tokens generated by individual beams during internal processing.
+:::
+
 ## Accumulating Metrics
 
 Several `perf_metrics` can be added to each other.
diff --git a/src/README.md b/src/README.md
@@ -252,151 +252,7 @@ int main(int argc, char* argv[]) {
 
 ### Performance Metrics
 
-`openvino_genai.PerfMetrics` (referred as `PerfMetrics` for simplicity) is a structure that holds performance metrics for each generate call. `PerfMetrics` holds fields with mean and standard deviations for the following metrics:
-- Time To the First Token (TTFT), ms
-- Time per Output Token (TPOT), ms/token
-- Generate total duration, ms
-- Tokenization duration, ms
-- Detokenization duration, ms
-- Throughput, tokens/s
-
-and:
-- Load time, ms
-- Number of generated tokens
-- Number of tokens in the input prompt
-
-Performance metrics are stored either in the `DecodedResults` or `EncodedResults` `perf_metric` field. Additionally to the fields mentioned above, `PerfMetrics` has a member `raw_metrics` of type `openvino_genai.RawPerfMetrics` (referred to as `RawPerfMetrics` for simplicity) that contains raw values for the durations of each batch of new token generation, tokenization durations, detokenization durations, and more. These raw metrics are accessible if you wish to calculate your own statistical values such as median or percentiles. However, since mean and standard deviation values are usually sufficient, we will focus on `PerfMetrics`.
-
-```python
-import openvino_genai as ov_genai
-pipe = ov_genai.LLMPipeline(models_path, "CPU")
-result = pipe.generate(["The Sun is yellow because"], max_new_tokens=20)
-perf_metrics = result.perf_metrics
-
-print(f'Generate duration: {perf_metrics.get_generate_duration().mean:.2f}')
-print(f'TTFT: {perf_metrics.get_ttft().mean:.2f} ms')
-print(f'TPOT: {perf_metrics.get_tpot().mean:.2f} ms/token')
-print(f'Throughput: {perf_metrics.get_throughput().mean:.2f} tokens/s')
-```
-
-```cpp
-#include "openvino/genai/llm_pipeline.hpp"
-#include <iostream>
-
-int main(int argc, char* argv[]) {
-    std::string models_path = argv[1];
-    ov::genai::LLMPipeline pipe(models_path, "CPU");
-    auto result = pipe.generate("The Sun is yellow because", ov::genai::max_new_tokens(20));
-    auto perf_metrics = result.perf_metrics;
-
-    std::cout << std::fixed << std::setprecision(2);
-    std::cout << "Generate duration: " << perf_metrics.get_generate_duration().mean << " ms" << std::endl;
-    std::cout << "TTFT: " << metrics.get_ttft().mean  << " ms" << std::endl;
-    std::cout << "TPOT: " << metrics.get_tpot().mean  << " ms/token " << std::endl;
-    std::cout << "Throughput: " << metrics.get_throughput().mean  << " tokens/s" << std::endl;
-}
-```
-output:
-```sh
-mean_generate_duration: 76.28
-mean_ttft: 42.58
-mean_tpot 3.80
-```
-
->**Note**: If the input prompt is just a string, the generate function returns only a string without perf_metrics. To obtain perf_metrics, provide the prompt as a list with at least one element or call generate with encoded inputs.
-
-#### Accumulating metrics
-Several `perf_metrics` can be added to each other. In that case `raw_metrics` are concatenated and mean/std values are recalculated. This accumulates statistics from several `generate()` calls
-
-```cpp
-#include "openvino/genai/llm_pipeline.hpp"
-#include <iostream>
-
-int main(int argc, char* argv[]) {
-    std::string models_path = argv[1];
-    ov::genai::LLMPipeline pipe(models_path, "CPU");
-    auto result_1 = pipe.generate("The Sun is yellow because", ov::genai::max_new_tokens(20));
-    auto result_2 = pipe.generate("The Sun is yellow because", ov::genai::max_new_tokens(20));
-    auto perf_metrics = result_1.perf_metrics + result_2.perf_metrics
-
-    std::cout << std::fixed << std::setprecision(2);
-    std::cout << "Generate duration: " << perf_metrics.get_generate_duration().mean << " ms" << std::endl;
-    std::cout << "TTFT: " << metrics.get_ttft().mean  << " ms" << std::endl;
-    std::cout << "TPOT: " << metrics.get_tpot().mean  << " ms/token " << std::endl;
-    std::cout << "Throughput: " << metrics.get_throughput().mean  << " tokens/s" << std::endl;
-}
-```
-
-```python
-import openvino_genai as ov_genai
-pipe = ov_genai.LLMPipeline(models_path, "CPU")
-res_1 = pipe.generate(["The Sun is yellow because"], max_new_tokens=20)
-res_2 = pipe.generate(["Why Sky is blue because"], max_new_tokens=20)
-perf_metrics = res_1.perf_metrics + res_2.perf_metrics
-
-print(f'Generate duration: {perf_metrics.get_generate_duration().mean:.2f}')
-print(f'TTFT: {perf_metrics.get_ttft().mean:.2f} ms')
-print(f'TPOT: {perf_metrics.get_tpot().mean:.2f} ms/token')
-print(f'Throughput: {perf_metrics.get_throughput().mean:.2f} tokens/s')
-```
-
-#### Using raw performance metrics
-In addition to mean and standard deviation values, the `perf_metrics` object has a `raw_metrics` field. This field stores raw data, including:
-
-- Timestamps for each batch of generated tokens
-- Batch sizes for each timestamp
-- Tokenization durations
-- Detokenization durations
-- Other relevant metrics
-
-These metrics can be use for more fine grained analysis, such as getting exact calculating median values, percentiles, etc. Below are a few examples of how to use raw metrics.
-
-Getting timestamps for each generated token:
-```python
-import openvino_genai as ov_genai
-pipe = ov_genai.LLMPipeline(models_path, "CPU")
-result = pipe.generate(["The Sun is yellow because"], max_new_tokens=20)
-perf_metrics = result.perf_metrics
-raw_metrics = perf_metrics.raw_metrics
-
-print(f'Generate duration: {perf_metrics.get_generate_duration().mean:.2f}')
-print(f'Throughput: {perf_metrics.get_throughput().mean:.2f} tokens/s')
-print(f'Timestamps: {" ms, ".join(f"{i:.2f}" for i in raw_metrics.m_new_token_times)}')
-```
-
-Getting pure inference time without tokenizatin and detokenization duration:
-```python
-import openvino_genai as ov_genai
-import numpy as np
-pipe = ov_genai.LLMPipeline(models_path, "CPU")
-result = pipe.generate(["The Sun is yellow because"], max_new_tokens=20)
-perf_metrics = result.perf_metrics
-print(f'Generate duration: {perf_metrics.get_generate_duration().mean:.2f} ms')
-
-raw_metrics = perf_metrics.raw_metrics
-generate_duration = np.array(raw_metrics.generate_durations)
-tok_detok_duration = np.array(raw_metrics.tokenization_durations) - np.array(raw_metrics.detokenization_durations)
-pure_inference_duration = np.sum(generate_duration - tok_detok_duration) / 1000 # in milliseconds
-print(f'Pure Inference duration: {pure_inference_duration:.2f} ms')
-```
-
-Example of using raw metrics to calculate median value of generate duration:
-```python
-import openvino_genai as ov_genai
-import numpy as np
-pipe = ov_genai.LLMPipeline(models_path, "CPU")
-result = pipe.generate(["The Sun is yellow because"], max_new_tokens=20)
-perf_metrics = result.perf_metrics
-raw_metrics = perf_metrics.raw_metrics
-
-print(f'Generate duration: {perf_metrics.get_generate_duration().mean:.2f}')
-print(f'Throughput: {perf_metrics.get_throughput().mean:.2f} tokens/s')
-durations = np.array(raw_metrics.m_new_token_times[1:]) - np.array(raw_metrics.m_new_token_times[:-1])
-print(f'Median from token to token duration: {np.median(durations):.2f} ms')
-```
-
-For more examples of how metrics are used, please refer to the Python [benchmark_genai.py](../samples/python/text_generation/README.md) and C++ [benchmark_genai](../samples/cpp/text_generation/README.md) samples.
-
+Refer to the [Performance Metrics](https://openvinotoolkit.github.io/openvino.genai/docs/guides/performance-metrics) page for details and usage examples.
 
 ### Structured Output generation
 OpenVINO™ GenAI supports structured output generation, which allows you to generate outputs in a structured format such as JSON, regex, or according to EBNF (Extended Backus–Naur form) grammar.
diff --git a/src/cpp/src/lm_encoding.cpp b/src/cpp/src/lm_encoding.cpp
@@ -153,12 +153,11 @@ ov::genai::utils::GenerationFinishInfo get_lm_encoded_results(
 
     const auto infer_start = std::chrono::steady_clock::now();
     m_llm.infer();
+
     const auto infer_end = std::chrono::steady_clock::now();
     const auto infer_ms = PerfMetrics::get_microsec(infer_end - infer_start);
     raw_perf_counters.m_inference_durations[0] += MicroSeconds(infer_ms);
     raw_perf_counters.m_token_infer_durations.emplace_back(infer_ms);
-    raw_perf_counters.m_new_token_times.emplace_back(infer_end);
-    raw_perf_counters.m_batch_sizes.emplace_back(batch_size);
 
     auto logits = m_llm.get_tensor("logits");
 
@@ -175,6 +174,9 @@ ov::genai::utils::GenerationFinishInfo get_lm_encoded_results(
     SamplerOutput sampler_output = sampler.sample(sequence_groups, logits);
     free_non_running_requests(); // handle sampler output
 
+    raw_perf_counters.m_new_token_times.emplace_back(std::chrono::steady_clock::now());
+    raw_perf_counters.m_batch_sizes.emplace_back(sampler_output.num_generated_tokens);
+
     // "Generation" phase
 
     while (!active_sequence_groups.empty()) {
@@ -271,11 +273,12 @@ ov::genai::utils::GenerationFinishInfo get_lm_encoded_results(
         const auto infer_ms = PerfMetrics::get_microsec(infer_end - infer_start);
         raw_perf_counters.m_inference_durations[0] += MicroSeconds(infer_ms);
         raw_perf_counters.m_token_infer_durations.emplace_back(infer_ms);
-        raw_perf_counters.m_new_token_times.emplace_back(infer_end);
-        raw_perf_counters.m_batch_sizes.emplace_back(current_batch_size);
 
         sampler_output = sampler.sample(active_sequence_groups, m_llm.get_tensor("logits"));
         free_non_running_requests(); // handle sampler output
+        
+        raw_perf_counters.m_new_token_times.emplace_back(std::chrono::steady_clock::now());
+        raw_perf_counters.m_batch_sizes.emplace_back(sampler_output.num_generated_tokens);
     }
 
     stream_generated_tokens();
diff --git a/src/cpp/src/perf_metrics.cpp b/src/cpp/src/perf_metrics.cpp
@@ -12,31 +12,33 @@ namespace genai {
 
 ov::genai::MeanStdPair calc_mean_and_std(const std::vector<ov::genai::MicroSeconds>& durations) {
     if (durations.size() == 0) {
-        return {-1, -1};
+        return {-1.0f, -1.0f};
     }
     // Accepts time durations in microseconds and returns standard deviation and mean in milliseconds.
-    float mean = std::accumulate(durations.begin(), durations.end(), 0.0f, 
-        [](const float& acc, const ov::genai::MicroSeconds& duration) -> float {
-            return acc + duration.count() / 1000.0f;
+    double mean = std::accumulate(durations.begin(), durations.end(), 0.0, 
+        [](const double& acc, const ov::genai::MicroSeconds& duration) -> double {
+            return acc + duration.count() / 1000.0;
         });
     mean /= durations.size();
     
-    float sum_square_durations = std::accumulate(durations.begin(), durations.end(), 0.0f,
-        [](const float& acc, const ov::genai::MicroSeconds& duration) -> float {
-            auto d = duration.count() / 1000.0f;
+    double sum_square_durations = std::accumulate(durations.begin(), durations.end(), 0.0,
+        [](const double& acc, const ov::genai::MicroSeconds& duration) -> double {
+            auto d = duration.count() / 1000.0;
             return acc + d * d;
         });
-    float std = std::sqrt(sum_square_durations / durations.size() - mean * mean);
-    return {mean, std};
+    double std = std::sqrt(sum_square_durations / durations.size() - mean * mean);
+    return {static_cast<float>(mean), static_cast<float>(std)};
 }
 
 ov::genai::SummaryStats calc_full_stat(const std::vector<ov::genai::MicroSeconds>& durations) {
     if (durations.size() == 0) {
-        return {-1, -1, -1, -1};
+        return {-1.0f, -1.0f, -1.0f, -1.0f};
     }
-    auto minmax = std::minmax_element(durations.begin(), durations.end());
     auto meanstd = calc_mean_and_std(durations);
-    return {meanstd.mean, meanstd.std, minmax.first->count() / 1000.0f, minmax.second->count() / 1000.0f};
+    auto minmax = std::minmax_element(durations.begin(), durations.end());
+    float min = static_cast<float>(minmax.first->count() / 1000.0);
+    float max = static_cast<float>(minmax.second->count() / 1000.0);
+    return {meanstd.mean, meanstd.std, min, max};
 }
 
 float PerfMetrics::get_load_time() {
@@ -116,11 +118,14 @@ void PerfMetrics::evaluate_statistics(std::optional<TimePoint> start_time) {
         auto start_time_val = *start_time;
         auto& tok_times = raw_metrics.m_new_token_times;
         auto& batch_sizes = raw_metrics.m_batch_sizes;
+
+        raw_metrics.m_durations.clear();
         raw_metrics.m_durations.reserve(tok_times.size());
 
         auto ttft = tok_times[0] - start_time_val;
         raw_metrics.m_times_to_first_token.clear();
         raw_metrics.m_times_to_first_token.emplace_back(ttft);
+
         num_generated_tokens = batch_sizes[0];
 
         // The very first infer request (prefill stage) is slower than subsequent ones since we process a sequence of tokens.
diff --git a/src/cpp/src/sampling/sampler.cpp b/src/cpp/src/sampling/sampler.cpp
@@ -274,7 +274,6 @@ void Sampler::GroupBeamSearcher::select_next_tokens(const ov::Tensor& logits,
     for (Group& group : m_groups) {
         if (!group.done) {
             for (Beam& beam : group.ongoing) {
-                sampler_output.num_generated_tokens++;
                 uint64_t parent_seq_id = beam.m_sequence->get_id();
 
                 // here we need to map index of sequence in beam search group(s) and sequence group
@@ -903,6 +902,10 @@ SequenceGroupSamplingInfo Sampler::sample_from_sequence_group(SequenceGroup::Ptr
             beam_searcher = &m_beam_search_info.at(request_id);
         }
 
+        if (!sequence_group->has_finished()) {
+            sg_sampling_info.sampler_output.num_generated_tokens++;
+        }
+
         // current algorithm already adds new tokens to running sequences and
         beam_searcher->select_next_tokens(sequence_group_logits, sg_sampling_info.sampler_output, stop_strings);
 
diff --git a/tests/python_tests/test_llm_pipeline.py b/tests/python_tests/test_llm_pipeline.py