How do I add a test dataset and benchmark that include multiple special metrics? #3305

DunZhang · 2025-10-10T08:42:40Z

DunZhang
Oct 10, 2025

Dear MTEB Team,

I would like to add the PosIR benchmark to MTEB in the near future. Details about this benchmark are available in this PR: #3147.

PosIR defines at least two main metrics:

NDCG@10: retrieval performance
PSI: a measure of positional bias

Given this, we would like the leaderboard to surface both metrics with the following columns:

Borda Rank (NDCG)
Borda Rank (PSI)
Mean by TaskType (NDCG)
Mean by TaskType (PSI)

Would MTEB support this change? Is there a straightforward way to implement it? Should I also update the MTEB leaderboard UI code?

Thank you for your guidance.

Best regards,
infgrad

Answered by KennethEnevoldsen

Oct 10, 2025

With the recent addition of RTEB, we began allowing custom summary tables. This has already been used by multiple benchmarks, including MIEB, RTEB, and HUME

You can see an example of how it is implemented here:

mteb/mteb/benchmarks/benchmark.py

Line 113 in d2c704c

class MIEBBenchmark(Benchmark):

I think it should cover your use-case

View full answer

KennethEnevoldsen · 2025-10-10T09:56:34Z

KennethEnevoldsen
Oct 10, 2025
Maintainer

With the recent addition of RTEB, we began allowing custom summary tables. This has already been used by multiple benchmarks, including MIEB, RTEB, and HUME

You can see an example of how it is implemented here:

mteb/mteb/benchmarks/benchmark.py

Line 113 in d2c704c

class MIEBBenchmark(Benchmark):

I think it should cover your use-case

6 replies

KennethEnevoldsen Oct 10, 2025
Maintainer

I agree, keeping the Borda rank as is seems reasonable; however, we can change the main metric of it if needed.

DunZhang Oct 11, 2025
Author

I’ve reviewed the code at the referenced link and believe it satisfies my requirements. Additionally, because NDCG and PSI are fundamentally different metrics, both “Borda Rank (NDCG)” and “Borda Rank (PSI)” should be included in the summary table. Which metric to use for ranking should be determined by the user’s specific use case.

DunZhang Oct 11, 2025
Author

As an update, our latest experiments indicate that the degree of positional bias varies across different document lengths. We plan to present positional bias by document length in a summary table to more comprehensively evaluate model performance.

In short, the summary table will likely include more than four metrics. We will finalize the ordering and the specific column names once the optimization of our test data is complete.

KennethEnevoldsen Oct 11, 2025
Maintainer

Sounds great - looking forward to seeing the PR - re. Borda rank, I think, is a minor thing. I would be happy to have two if needed, but it does correlate reasonably well with the mean (which has the benefit of being continuous) for metrics on a similar scale.

DunZhang Nov 11, 2025
Author

Sounds great - looking forward to seeing the PR - re. Borda rank, I think, is a minor thing. I would be happy to have two if needed, but it does correlate reasonably well with the mean (which has the benefit of being continuous) for metrics on a similar scale.

We’re currently working on it. We’ve translated the dataset into approximately ten languages and are conducting translation quality checks as well as query augmentation😄😄😄.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

How do I add a test dataset and benchmark that include multiple special metrics? #3305

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 6 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

How do I add a test dataset and benchmark that include multiple special metrics? #3305

Uh oh!

DunZhang Oct 10, 2025

Replies: 1 comment · 6 replies

Uh oh!

KennethEnevoldsen Oct 10, 2025 Maintainer

Uh oh!

KennethEnevoldsen Oct 10, 2025 Maintainer

Uh oh!

DunZhang Oct 11, 2025 Author

Uh oh!

DunZhang Oct 11, 2025 Author

Uh oh!

KennethEnevoldsen Oct 11, 2025 Maintainer

Uh oh!

DunZhang Nov 11, 2025 Author

DunZhang
Oct 10, 2025

Replies: 1 comment 6 replies

KennethEnevoldsen
Oct 10, 2025
Maintainer

KennethEnevoldsen Oct 10, 2025
Maintainer

DunZhang Oct 11, 2025
Author

DunZhang Oct 11, 2025
Author

KennethEnevoldsen Oct 11, 2025
Maintainer

DunZhang Nov 11, 2025
Author