vamplabAI
diff --git a/‎.gitignore‎
Lines changed: 1 addition & 0 deletions b/‎.gitignore‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎README.md‎
Lines changed: 33 additions & 0 deletions b/‎README.md‎
Lines changed: 33 additions & 0 deletions
diff --git a/‎assets/simpleqa_benchmark_comprasion.png‎
183 KB b/‎assets/simpleqa_benchmark_comprasion.png‎
183 KB
diff --git a/‎assets/simpleqa_result.xlsx‎
948 KB b/‎assets/simpleqa_result.xlsx‎
948 KB
@@ -283,6 +283,7 @@ benchmark/.env
 benchmark/simpleqa_bench_results*.xlsx
 benchmark/~$*.xlsx
 *.xlsx
+!assets/*.xlsx
 benchmark_logs/
 
 # Временные скрипты
 
@@ -566,6 +566,39 @@ ______________________________________________________________________
 
 ## 📊 Benchmarking with SimpleQA
 
+We conducted a comprehensive benchmark evaluation using the [SimpleQA](https://huggingface.co/datasets/basicv8vc/SimpleQA) dataset - a factuality benchmark that measures the ability of language models to answer short, fact-seeking questions.
+
+### Our Benchmark Results
+
+![SimpleQA Benchmark Comparison](assets/simpleqa_benchmark_comprasion.png)
+
+**Performance Metrics:**
+- **Accuracy:** 86.08%
+- **Correct:** 3,724 answers
+- **Incorrect:** 554 answers
+- **Not Attempted:** 48 answers
+
+**Benchmark Configuration:**
+
+| Component | Parameter | Value |
+|-----------|-----------|-------|
+| **Search Engine** | Provider | Tavily Basic Search |
+| | Scraping Enabled | Yes |
+| | Max Pages | 5 |
+| | Content Limit | 33,000 characters |
+| **Agent** | Name | sgr_tool_calling_agent |
+| | Max Steps | 20 |
+| **LLM (Agent)** | Model | gpt-4o-mini |
+| | Max Tokens | 12,000 |
+| | Temperature | 0.2 |
+| **LLM (Judge)** | Model | gpt-4o |
+| | Max Tokens | Default |
+| | Temperature | Default |
+
+Detailed benchmark results are available in [this spreadsheet](assets/simpleqa_result.xlsx).
+
+---
+
 The project includes benchmarking capabilities using the **SimpleQA** dataset from DeepMind/Kaggle. The benchmark automatically runs the SGR agent on each question and uses an LLM judge to grade the answers.
 
 ### What is SimpleQA?