Skip to content

Commit 6c481bc

Browse files
authored
Fix PerfMetrics collection for beam search scenario (#2943)
## Description This PR fixes PerfMetrics collection issues in beam search scenarios by correcting the output token size calculation and Time To First Token (TTFT) measurement. The fix ensures that token counts reflect actual generated tokens from the sampler rather than batch sizes, and improves numerical precision in statistical calculations. Ticket: CVS-175197 ## Checklist: - [x] Tests have been updated or added to cover the new code - [x] This patch fully addresses the ticket - [x] I have made corresponding changes to the documentation
1 parent 82b8330 commit 6c481bc

File tree

6 files changed

+70
-186
lines changed

6 files changed

+70
-186
lines changed

site/docs/guides/performance-metrics.mdx

Lines changed: 9 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -79,16 +79,22 @@ However, since mean and standard deviation values are usually sufficient, we wil
7979
</LanguageTabs>
8080

8181
```sh title="Output:"
82-
mean_generate_duration: 76.28
83-
mean_ttft: 42.58
84-
mean_tpot 3.80
82+
Generate duration: 702.85
83+
TTFT: 137.58 ms
84+
TPOT: 29.74 ms/token
85+
Throughput: 33.62 tokens/s
8586
```
8687

8788
:::info Note
8889
If the input prompt is just a string, the generate function returns only a string without perf_metrics.
8990
To obtain perf_metrics, provide the prompt as a list with at least one element or call generate with encoded inputs.
9091
:::
9192

93+
:::info Note
94+
TPOT (Time Per Output Token) represents the average time required to generate each output token in the final result.
95+
For beam search scenario, TPOT is calculated based on the effective output tokens delivered to users, not the tokens generated by individual beams during internal processing.
96+
:::
97+
9298
## Accumulating Metrics
9399

94100
Several `perf_metrics` can be added to each other.

src/README.md

Lines changed: 1 addition & 145 deletions
Original file line numberDiff line numberDiff line change
@@ -252,151 +252,7 @@ int main(int argc, char* argv[]) {
252252
253253
### Performance Metrics
254254
255-
`openvino_genai.PerfMetrics` (referred as `PerfMetrics` for simplicity) is a structure that holds performance metrics for each generate call. `PerfMetrics` holds fields with mean and standard deviations for the following metrics:
256-
- Time To the First Token (TTFT), ms
257-
- Time per Output Token (TPOT), ms/token
258-
- Generate total duration, ms
259-
- Tokenization duration, ms
260-
- Detokenization duration, ms
261-
- Throughput, tokens/s
262-
263-
and:
264-
- Load time, ms
265-
- Number of generated tokens
266-
- Number of tokens in the input prompt
267-
268-
Performance metrics are stored either in the `DecodedResults` or `EncodedResults` `perf_metric` field. Additionally to the fields mentioned above, `PerfMetrics` has a member `raw_metrics` of type `openvino_genai.RawPerfMetrics` (referred to as `RawPerfMetrics` for simplicity) that contains raw values for the durations of each batch of new token generation, tokenization durations, detokenization durations, and more. These raw metrics are accessible if you wish to calculate your own statistical values such as median or percentiles. However, since mean and standard deviation values are usually sufficient, we will focus on `PerfMetrics`.
269-
270-
```python
271-
import openvino_genai as ov_genai
272-
pipe = ov_genai.LLMPipeline(models_path, "CPU")
273-
result = pipe.generate(["The Sun is yellow because"], max_new_tokens=20)
274-
perf_metrics = result.perf_metrics
275-
276-
print(f'Generate duration: {perf_metrics.get_generate_duration().mean:.2f}')
277-
print(f'TTFT: {perf_metrics.get_ttft().mean:.2f} ms')
278-
print(f'TPOT: {perf_metrics.get_tpot().mean:.2f} ms/token')
279-
print(f'Throughput: {perf_metrics.get_throughput().mean:.2f} tokens/s')
280-
```
281-
282-
```cpp
283-
#include "openvino/genai/llm_pipeline.hpp"
284-
#include <iostream>
285-
286-
int main(int argc, char* argv[]) {
287-
std::string models_path = argv[1];
288-
ov::genai::LLMPipeline pipe(models_path, "CPU");
289-
auto result = pipe.generate("The Sun is yellow because", ov::genai::max_new_tokens(20));
290-
auto perf_metrics = result.perf_metrics;
291-
292-
std::cout << std::fixed << std::setprecision(2);
293-
std::cout << "Generate duration: " << perf_metrics.get_generate_duration().mean << " ms" << std::endl;
294-
std::cout << "TTFT: " << metrics.get_ttft().mean << " ms" << std::endl;
295-
std::cout << "TPOT: " << metrics.get_tpot().mean << " ms/token " << std::endl;
296-
std::cout << "Throughput: " << metrics.get_throughput().mean << " tokens/s" << std::endl;
297-
}
298-
```
299-
output:
300-
```sh
301-
mean_generate_duration: 76.28
302-
mean_ttft: 42.58
303-
mean_tpot 3.80
304-
```
305-
306-
>**Note**: If the input prompt is just a string, the generate function returns only a string without perf_metrics. To obtain perf_metrics, provide the prompt as a list with at least one element or call generate with encoded inputs.
307-
308-
#### Accumulating metrics
309-
Several `perf_metrics` can be added to each other. In that case `raw_metrics` are concatenated and mean/std values are recalculated. This accumulates statistics from several `generate()` calls
310-
311-
```cpp
312-
#include "openvino/genai/llm_pipeline.hpp"
313-
#include <iostream>
314-
315-
int main(int argc, char* argv[]) {
316-
std::string models_path = argv[1];
317-
ov::genai::LLMPipeline pipe(models_path, "CPU");
318-
auto result_1 = pipe.generate("The Sun is yellow because", ov::genai::max_new_tokens(20));
319-
auto result_2 = pipe.generate("The Sun is yellow because", ov::genai::max_new_tokens(20));
320-
auto perf_metrics = result_1.perf_metrics + result_2.perf_metrics
321-
322-
std::cout << std::fixed << std::setprecision(2);
323-
std::cout << "Generate duration: " << perf_metrics.get_generate_duration().mean << " ms" << std::endl;
324-
std::cout << "TTFT: " << metrics.get_ttft().mean << " ms" << std::endl;
325-
std::cout << "TPOT: " << metrics.get_tpot().mean << " ms/token " << std::endl;
326-
std::cout << "Throughput: " << metrics.get_throughput().mean << " tokens/s" << std::endl;
327-
}
328-
```
329-
330-
```python
331-
import openvino_genai as ov_genai
332-
pipe = ov_genai.LLMPipeline(models_path, "CPU")
333-
res_1 = pipe.generate(["The Sun is yellow because"], max_new_tokens=20)
334-
res_2 = pipe.generate(["Why Sky is blue because"], max_new_tokens=20)
335-
perf_metrics = res_1.perf_metrics + res_2.perf_metrics
336-
337-
print(f'Generate duration: {perf_metrics.get_generate_duration().mean:.2f}')
338-
print(f'TTFT: {perf_metrics.get_ttft().mean:.2f} ms')
339-
print(f'TPOT: {perf_metrics.get_tpot().mean:.2f} ms/token')
340-
print(f'Throughput: {perf_metrics.get_throughput().mean:.2f} tokens/s')
341-
```
342-
343-
#### Using raw performance metrics
344-
In addition to mean and standard deviation values, the `perf_metrics` object has a `raw_metrics` field. This field stores raw data, including:
345-
346-
- Timestamps for each batch of generated tokens
347-
- Batch sizes for each timestamp
348-
- Tokenization durations
349-
- Detokenization durations
350-
- Other relevant metrics
351-
352-
These metrics can be use for more fine grained analysis, such as getting exact calculating median values, percentiles, etc. Below are a few examples of how to use raw metrics.
353-
354-
Getting timestamps for each generated token:
355-
```python
356-
import openvino_genai as ov_genai
357-
pipe = ov_genai.LLMPipeline(models_path, "CPU")
358-
result = pipe.generate(["The Sun is yellow because"], max_new_tokens=20)
359-
perf_metrics = result.perf_metrics
360-
raw_metrics = perf_metrics.raw_metrics
361-
362-
print(f'Generate duration: {perf_metrics.get_generate_duration().mean:.2f}')
363-
print(f'Throughput: {perf_metrics.get_throughput().mean:.2f} tokens/s')
364-
print(f'Timestamps: {" ms, ".join(f"{i:.2f}" for i in raw_metrics.m_new_token_times)}')
365-
```
366-
367-
Getting pure inference time without tokenizatin and detokenization duration:
368-
```python
369-
import openvino_genai as ov_genai
370-
import numpy as np
371-
pipe = ov_genai.LLMPipeline(models_path, "CPU")
372-
result = pipe.generate(["The Sun is yellow because"], max_new_tokens=20)
373-
perf_metrics = result.perf_metrics
374-
print(f'Generate duration: {perf_metrics.get_generate_duration().mean:.2f} ms')
375-
376-
raw_metrics = perf_metrics.raw_metrics
377-
generate_duration = np.array(raw_metrics.generate_durations)
378-
tok_detok_duration = np.array(raw_metrics.tokenization_durations) - np.array(raw_metrics.detokenization_durations)
379-
pure_inference_duration = np.sum(generate_duration - tok_detok_duration) / 1000 # in milliseconds
380-
print(f'Pure Inference duration: {pure_inference_duration:.2f} ms')
381-
```
382-
383-
Example of using raw metrics to calculate median value of generate duration:
384-
```python
385-
import openvino_genai as ov_genai
386-
import numpy as np
387-
pipe = ov_genai.LLMPipeline(models_path, "CPU")
388-
result = pipe.generate(["The Sun is yellow because"], max_new_tokens=20)
389-
perf_metrics = result.perf_metrics
390-
raw_metrics = perf_metrics.raw_metrics
391-
392-
print(f'Generate duration: {perf_metrics.get_generate_duration().mean:.2f}')
393-
print(f'Throughput: {perf_metrics.get_throughput().mean:.2f} tokens/s')
394-
durations = np.array(raw_metrics.m_new_token_times[1:]) - np.array(raw_metrics.m_new_token_times[:-1])
395-
print(f'Median from token to token duration: {np.median(durations):.2f} ms')
396-
```
397-
398-
For more examples of how metrics are used, please refer to the Python [benchmark_genai.py](../samples/python/text_generation/README.md) and C++ [benchmark_genai](../samples/cpp/text_generation/README.md) samples.
399-
255+
Refer to the [Performance Metrics](https://openvinotoolkit.github.io/openvino.genai/docs/guides/performance-metrics) page for details and usage examples.
400256
401257
### Structured Output generation
402258
OpenVINO™ GenAI supports structured output generation, which allows you to generate outputs in a structured format such as JSON, regex, or according to EBNF (Extended Backus–Naur form) grammar.

src/cpp/src/lm_encoding.cpp

Lines changed: 7 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -153,12 +153,11 @@ ov::genai::utils::GenerationFinishInfo get_lm_encoded_results(
153153

154154
const auto infer_start = std::chrono::steady_clock::now();
155155
m_llm.infer();
156+
156157
const auto infer_end = std::chrono::steady_clock::now();
157158
const auto infer_ms = PerfMetrics::get_microsec(infer_end - infer_start);
158159
raw_perf_counters.m_inference_durations[0] += MicroSeconds(infer_ms);
159160
raw_perf_counters.m_token_infer_durations.emplace_back(infer_ms);
160-
raw_perf_counters.m_new_token_times.emplace_back(infer_end);
161-
raw_perf_counters.m_batch_sizes.emplace_back(batch_size);
162161

163162
auto logits = m_llm.get_tensor("logits");
164163

@@ -175,6 +174,9 @@ ov::genai::utils::GenerationFinishInfo get_lm_encoded_results(
175174
SamplerOutput sampler_output = sampler.sample(sequence_groups, logits);
176175
free_non_running_requests(); // handle sampler output
177176

177+
raw_perf_counters.m_new_token_times.emplace_back(std::chrono::steady_clock::now());
178+
raw_perf_counters.m_batch_sizes.emplace_back(sampler_output.num_generated_tokens);
179+
178180
// "Generation" phase
179181

180182
while (!active_sequence_groups.empty()) {
@@ -271,11 +273,12 @@ ov::genai::utils::GenerationFinishInfo get_lm_encoded_results(
271273
const auto infer_ms = PerfMetrics::get_microsec(infer_end - infer_start);
272274
raw_perf_counters.m_inference_durations[0] += MicroSeconds(infer_ms);
273275
raw_perf_counters.m_token_infer_durations.emplace_back(infer_ms);
274-
raw_perf_counters.m_new_token_times.emplace_back(infer_end);
275-
raw_perf_counters.m_batch_sizes.emplace_back(current_batch_size);
276276

277277
sampler_output = sampler.sample(active_sequence_groups, m_llm.get_tensor("logits"));
278278
free_non_running_requests(); // handle sampler output
279+
280+
raw_perf_counters.m_new_token_times.emplace_back(std::chrono::steady_clock::now());
281+
raw_perf_counters.m_batch_sizes.emplace_back(sampler_output.num_generated_tokens);
279282
}
280283

281284
stream_generated_tokens();

src/cpp/src/perf_metrics.cpp

Lines changed: 17 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -12,31 +12,33 @@ namespace genai {
1212

1313
ov::genai::MeanStdPair calc_mean_and_std(const std::vector<ov::genai::MicroSeconds>& durations) {
1414
if (durations.size() == 0) {
15-
return {-1, -1};
15+
return {-1.0f, -1.0f};
1616
}
1717
// Accepts time durations in microseconds and returns standard deviation and mean in milliseconds.
18-
float mean = std::accumulate(durations.begin(), durations.end(), 0.0f,
19-
[](const float& acc, const ov::genai::MicroSeconds& duration) -> float {
20-
return acc + duration.count() / 1000.0f;
18+
double mean = std::accumulate(durations.begin(), durations.end(), 0.0,
19+
[](const double& acc, const ov::genai::MicroSeconds& duration) -> double {
20+
return acc + duration.count() / 1000.0;
2121
});
2222
mean /= durations.size();
2323

24-
float sum_square_durations = std::accumulate(durations.begin(), durations.end(), 0.0f,
25-
[](const float& acc, const ov::genai::MicroSeconds& duration) -> float {
26-
auto d = duration.count() / 1000.0f;
24+
double sum_square_durations = std::accumulate(durations.begin(), durations.end(), 0.0,
25+
[](const double& acc, const ov::genai::MicroSeconds& duration) -> double {
26+
auto d = duration.count() / 1000.0;
2727
return acc + d * d;
2828
});
29-
float std = std::sqrt(sum_square_durations / durations.size() - mean * mean);
30-
return {mean, std};
29+
double std = std::sqrt(sum_square_durations / durations.size() - mean * mean);
30+
return {static_cast<float>(mean), static_cast<float>(std)};
3131
}
3232

3333
ov::genai::SummaryStats calc_full_stat(const std::vector<ov::genai::MicroSeconds>& durations) {
3434
if (durations.size() == 0) {
35-
return {-1, -1, -1, -1};
35+
return {-1.0f, -1.0f, -1.0f, -1.0f};
3636
}
37-
auto minmax = std::minmax_element(durations.begin(), durations.end());
3837
auto meanstd = calc_mean_and_std(durations);
39-
return {meanstd.mean, meanstd.std, minmax.first->count() / 1000.0f, minmax.second->count() / 1000.0f};
38+
auto minmax = std::minmax_element(durations.begin(), durations.end());
39+
float min = static_cast<float>(minmax.first->count() / 1000.0);
40+
float max = static_cast<float>(minmax.second->count() / 1000.0);
41+
return {meanstd.mean, meanstd.std, min, max};
4042
}
4143

4244
float PerfMetrics::get_load_time() {
@@ -116,11 +118,14 @@ void PerfMetrics::evaluate_statistics(std::optional<TimePoint> start_time) {
116118
auto start_time_val = *start_time;
117119
auto& tok_times = raw_metrics.m_new_token_times;
118120
auto& batch_sizes = raw_metrics.m_batch_sizes;
121+
122+
raw_metrics.m_durations.clear();
119123
raw_metrics.m_durations.reserve(tok_times.size());
120124

121125
auto ttft = tok_times[0] - start_time_val;
122126
raw_metrics.m_times_to_first_token.clear();
123127
raw_metrics.m_times_to_first_token.emplace_back(ttft);
128+
124129
num_generated_tokens = batch_sizes[0];
125130

126131
// The very first infer request (prefill stage) is slower than subsequent ones since we process a sequence of tokens.

src/cpp/src/sampling/sampler.cpp

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -274,7 +274,6 @@ void Sampler::GroupBeamSearcher::select_next_tokens(const ov::Tensor& logits,
274274
for (Group& group : m_groups) {
275275
if (!group.done) {
276276
for (Beam& beam : group.ongoing) {
277-
sampler_output.num_generated_tokens++;
278277
uint64_t parent_seq_id = beam.m_sequence->get_id();
279278

280279
// here we need to map index of sequence in beam search group(s) and sequence group
@@ -903,6 +902,10 @@ SequenceGroupSamplingInfo Sampler::sample_from_sequence_group(SequenceGroup::Ptr
903902
beam_searcher = &m_beam_search_info.at(request_id);
904903
}
905904

905+
if (!sequence_group->has_finished()) {
906+
sg_sampling_info.sampler_output.num_generated_tokens++;
907+
}
908+
906909
// current algorithm already adds new tokens to running sequences and
907910
beam_searcher->select_next_tokens(sequence_group_logits, sg_sampling_info.sampler_output, stop_strings);
908911

0 commit comments

Comments
 (0)