Skip to content

Commit 4a165c5

Browse files
author
maksimov maksim
committed
add desc to README about bench work
1 parent 888c0cc commit 4a165c5

File tree

4 files changed

+34
-0
lines changed

4 files changed

+34
-0
lines changed

.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -283,6 +283,7 @@ benchmark/.env
283283
benchmark/simpleqa_bench_results*.xlsx
284284
benchmark/~$*.xlsx
285285
*.xlsx
286+
!assets/*.xlsx
286287
benchmark_logs/
287288

288289
# Временные скрипты

README.md

Lines changed: 33 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -566,6 +566,39 @@ ______________________________________________________________________
566566

567567
## 📊 Benchmarking with SimpleQA
568568

569+
We conducted a comprehensive benchmark evaluation using the [SimpleQA](https://huggingface.co/datasets/basicv8vc/SimpleQA) dataset - a factuality benchmark that measures the ability of language models to answer short, fact-seeking questions.
570+
571+
### Our Benchmark Results
572+
573+
![SimpleQA Benchmark Comparison](assets/simpleqa_benchmark_comprasion.png)
574+
575+
**Performance Metrics:**
576+
- **Accuracy:** 86.08%
577+
- **Correct:** 3,724 answers
578+
- **Incorrect:** 554 answers
579+
- **Not Attempted:** 48 answers
580+
581+
**Benchmark Configuration:**
582+
583+
| Component | Parameter | Value |
584+
|-----------|-----------|-------|
585+
| **Search Engine** | Provider | Tavily Basic Search |
586+
| | Scraping Enabled | Yes |
587+
| | Max Pages | 5 |
588+
| | Content Limit | 33,000 characters |
589+
| **Agent** | Name | sgr_tool_calling_agent |
590+
| | Max Steps | 20 |
591+
| **LLM (Agent)** | Model | gpt-4o-mini |
592+
| | Max Tokens | 12,000 |
593+
| | Temperature | 0.2 |
594+
| **LLM (Judge)** | Model | gpt-4o |
595+
| | Max Tokens | Default |
596+
| | Temperature | Default |
597+
598+
Detailed benchmark results are available in [this spreadsheet](assets/simpleqa_result.xlsx).
599+
600+
---
601+
569602
The project includes benchmarking capabilities using the **SimpleQA** dataset from DeepMind/Kaggle. The benchmark automatically runs the SGR agent on each question and uses an LLM judge to grade the answers.
570603

571604
### What is SimpleQA?
183 KB
Loading

assets/simpleqa_result.xlsx

948 KB
Binary file not shown.

0 commit comments

Comments
 (0)