You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We conducted a comprehensive benchmark evaluation using the [SimpleQA](https://huggingface.co/datasets/basicv8vc/SimpleQA) dataset - a factuality benchmark that measures the ability of language models to answer short, fact-seeking questions.
Detailed benchmark results are available in [this spreadsheet](assets/simpleqa_result.xlsx).
599
+
600
+
---
601
+
569
602
The project includes benchmarking capabilities using the **SimpleQA** dataset from DeepMind/Kaggle. The benchmark automatically runs the SGR agent on each question and uses an LLM judge to grade the answers.
0 commit comments