-
Notifications
You must be signed in to change notification settings - Fork 516
refactor: split BRIGHT benchmark into individual subset tasks
#3285
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from 5 commits
57c757f
b04b46e
7299e59
bf31a79
3f875a2
6aeea07
f95a246
9df0bba
c9a30bd
3b1e90b
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -1198,6 +1198,42 @@ | |
| """, | ||
| ) | ||
|
|
||
| BRIGHT_SUBSETS = Benchmark( | ||
| name="BRIGHT (subsets)", | ||
| display_name="Reasoning Retrieval (subsets)", | ||
| tasks=get_tasks( | ||
whybe-choi marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| tasks=[ | ||
| "BrightBiologyRetrieval", | ||
| "BrightEarthScienceRetrieval", | ||
| "BrightEconomicsRetrieval", | ||
| "BrightPsychologyRetrieval", | ||
| "BrightRoboticsRetrieval", | ||
| "BrightStackoverflowRetrieval", | ||
| "BrightSustainableLivingRetrieval", | ||
| "BrightPonyRetrieval", | ||
| "BrightLeetcodeRetrieval", | ||
| "BrightAopsRetrieval", | ||
| "BrightTheoremQATheoremsRetrieval", | ||
| "BrightTheoremQAQuestionsRetrieval", | ||
| ], | ||
| ), | ||
| description="""BRIGHT: A Realistic and Challenging Benchmark for Reasoning-Intensive Retrieval (Individual Subsets). | ||
| This benchmark contains individual subset tasks for each domain in the BRIGHT benchmark, | ||
| allowing for domain-specific evaluation. The subsets include: biology, earth science, economics, | ||
| psychology, robotics, stackoverflow, sustainable living, pony, leetcode, aops, theoremqa_theorems, | ||
| and theoremqa_questions. | ||
| """, | ||
| reference="https://brightbenchmark.github.io/", | ||
| citation=r""" | ||
| @article{su2024bright, | ||
| author = {Su, Hongjin and Yen, Howard and Xia, Mengzhou and Shi, Weijia and Muennighoff, Niklas and Wang, Han-yu and Liu, Haisu and Shi, Quan and Siegel, Zachary S and Tang, Michael and others}, | ||
| journal = {arXiv preprint arXiv:2407.12883}, | ||
| title = {Bright: A realistic and challenging benchmark for reasoning-intensive retrieval}, | ||
| year = {2024}, | ||
| } | ||
| """, | ||
| ) | ||
|
|
||
| BRIGHT_LONG = Benchmark( | ||
| name="BRIGHT (long)", | ||
| tasks=MTEBTasks( | ||
|
|
@@ -1227,6 +1263,37 @@ | |
| """, | ||
| ) | ||
|
|
||
| BRIGHT_SUBSETS_LONG = Benchmark( | ||
| name="BRIGHT (long subsets)", | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do we want to have both a long and a short (I would probably argue that we just have one with two different columns)
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Agree
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do you mean we don't need to create a separate
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes |
||
| display_name="Reasoning Retrieval (long subsets)", | ||
| tasks=get_tasks( | ||
| tasks=[ | ||
| "BrightBiologyLongRetrieval", | ||
| "BrightEarthScienceLongRetrieval", | ||
| "BrightEconomicsLongRetrieval", | ||
| "BrightPsychologyLongRetrieval", | ||
| "BrightRoboticsLongRetrieval", | ||
| "BrightStackoverflowLongRetrieval", | ||
| "BrightSustainableLivingLongRetrieval", | ||
| "BrightPonyLongRetrieval", | ||
| ], | ||
| ), | ||
| description="""BRIGHT: A Realistic and Challenging Benchmark for Reasoning-Intensive Retrieval (Long Individual Subsets). | ||
| This benchmark contains individual subset tasks for each domain in the BRIGHT benchmark with long documents, | ||
| allowing for domain-specific evaluation with longer context. The subsets include: biology, earth science, | ||
| economics, psychology, robotics, stackoverflow, sustainable living, and pony. | ||
| """, | ||
| reference="https://brightbenchmark.github.io/", | ||
| citation=r""" | ||
| @article{su2024bright, | ||
| author = {Su, Hongjin and Yen, Howard and Xia, Mengzhou and Shi, Weijia and Muennighoff, Niklas and Wang, Han-yu and Liu, Haisu and Shi, Quan and Siegel, Zachary S and Tang, Michael and others}, | ||
| journal = {arXiv preprint arXiv:2407.12883}, | ||
| title = {Bright: A realistic and challenging benchmark for reasoning-intensive retrieval}, | ||
| year = {2024}, | ||
| } | ||
| """, | ||
| ) | ||
|
|
||
| CODE_RAG = Benchmark( | ||
| name="CodeRAG", | ||
| tasks=get_tasks( | ||
|
|
@@ -1619,8 +1686,7 @@ | |
| "TRECCOVID-NL", | ||
| ], | ||
| ), | ||
| description="BEIR-NL is a Dutch adaptation of the publicly available BEIR benchmark, created through automated " | ||
| "translation.", | ||
| description="BEIR-NL is a Dutch adaptation of the publicly available BEIR benchmark, created through automated translation.", | ||
| reference="https://arxiv.org/abs/2412.08329", | ||
| contacts=["nikolay-banar"], | ||
| citation=r""" | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm from a user POV it is quite unclear which one to use here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(would be great to replace the old one with these)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should I remove the existing
BRIGHTbenchmark and renameBRIGHT_SUBSETtoBRIGHT?Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would probably replace, but if the scores do not align 1-1, then I would version it
(keep the two old ones, but name this BRIGHT(v1.1) and add to the description what has changed.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably they won't match 1-1, because we've added instructions to the tasks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So let us do BRIGHT(v1.1) and add the description: "v1.1 refactors the BRIGHT and BRIGHT(long) into a single benchmark (with seperate columns) and added prompt to individual tasks"