@@ -573,31 +573,32 @@ We conducted a comprehensive benchmark evaluation using the [SimpleQA](https://h
573573![ SimpleQA Benchmark Comparison] ( assets/simpleqa_benchmark_comprasion.png )
574574
575575** Performance Metrics:**
576+
576577- ** Accuracy:** 86.08%
577578- ** Correct:** 3,724 answers
578579- ** Incorrect:** 554 answers
579580- ** Not Attempted:** 48 answers
580581
581582** Benchmark Configuration:**
582583
583- | Component | Parameter | Value |
584- | -----------| -----------| -------|
585- | ** Search Engine** | Provider | Tavily Basic Search |
586- | | Scraping Enabled | Yes |
587- | | Max Pages | 5 |
588- | | Content Limit | 33,000 characters |
589- | ** Agent** | Name | sgr_tool_calling_agent |
590- | | Max Steps | 20 |
591- | ** LLM (Agent)** | Model | gpt-4o-mini |
592- | | Max Tokens | 12,000 |
593- | | Temperature | 0.2 |
594- | ** LLM (Judge)** | Model | gpt-4o |
595- | | Max Tokens | Default |
596- | | Temperature | Default |
584+ | Component | Parameter | Value |
585+ | ----------------- | ---------------- | ---------------------- |
586+ | ** Search Engine** | Provider | Tavily Basic Search |
587+ | | Scraping Enabled | Yes |
588+ | | Max Pages | 5 |
589+ | | Content Limit | 33,000 characters |
590+ | ** Agent** | Name | sgr_tool_calling_agent |
591+ | | Max Steps | 20 |
592+ | ** LLM (Agent)** | Model | gpt-4o-mini |
593+ | | Max Tokens | 12,000 |
594+ | | Temperature | 0.2 |
595+ | ** LLM (Judge)** | Model | gpt-4o |
596+ | | Max Tokens | Default |
597+ | | Temperature | Default |
597598
598599Detailed benchmark results are available in [ this spreadsheet] ( assets/simpleqa_result.xlsx ) .
599600
600- ---
601+ ______________________________________________________________________
601602
602603The project includes benchmarking capabilities using the ** SimpleQA** dataset from DeepMind/Kaggle. The benchmark automatically runs the SGR agent on each question and uses an LLM judge to grade the answers.
603604
0 commit comments