Getting MTEB Average Score for Model based on task #3515

nseidan · 2025-10-31T11:20:04Z

nseidan
Oct 31, 2025

Hi,
is there a way to get the average score based on the task and model? For example,

model=sentence-transformers/all-MiniLM-L12-v2
task=Retrieval

average ndcg score=43.67%

I see these files
https://github.com/embeddings-benchmark/leaderboard/blob/main/boards_data/en/data_tasks/Retrieval/default.jsonl
https://raw.githubusercontent.com/embeddings-benchmark/leaderboard/main/EXTERNAL_MODEL_RESULTS.json

but outdated, average scores don't match with the current average scores shown on the MTEB leaderboard.

Answered by Samoed

Oct 31, 2025

This can be caused by that tasks results are duplicated between revision, and you're selecting all splits and subsets for these tasks (FiQA have dev, test, train).

import mteb

b = mteb.get_benchmark("MTEB(eng, v2)")
retrieval = [t for t in b.tasks if t.metadata.type == "Retrieval"]
results = mteb.load_results(models=["sentence-transformers/all-MiniLM-L12-v2"], tasks=retrieval).join_revisions()

scores = []
for model_results in results.model_results: # multiple revisions
    for task_result in model_results.task_results:
        for split_name, split_subsets in task_result.scores.items():
            for task_subset in split_subsets:
                if "main_score" in task_subset:  # for …

View full answer

Samoed · 2025-10-31T13:12:11Z

Samoed
Oct 31, 2025
Maintainer

You can do this by:

import mteb
from mteb.cache import ResultCache

cache = ResultCache()
# cache.download_from_remote()

tasks = mteb.get_tasks(task_types=["Retrieval"])
results = cache.load_results(["sentence-transformers/all-MiniLM-L12-v2"], tasks)

scores = []
for model_results in results.model_results: # multiple revisions
    for task_result in model_results.task_results:
        for split_name, split_subsets in task_result.scores.items():
            for task_subset in split_subsets:
                if "ndcg_at_10" in task_subset:
                    scores.append(task_subset["ndcg_at_10"])
print(sum(scores)/len(scores))

7 replies

nseidan Oct 31, 2025
Author

@Samoed thanks, is the ResultCache getting the results from the remote cache? Also with mteb older versions, ResultCache is not there. I am currently using v1.38. Is there any workaround for backward compatibility?

Samoed Oct 31, 2025
Maintainer

is the ResultCache getting the results from the remote cache

It can do from local and remote. You can specify this by include_remote flag https://embeddings-benchmark.github.io/mteb/api/results/?h=load_results#mteb.cache.ResultCache

I am currently using v1.38. Is there any workaround for backward compatibility?

You can change of getting results to

results = mteb.load_results(models=["sentence-transformers/all-MiniLM-L12-v2"], tasks=tasks)

Which also works in v2

nseidan Oct 31, 2025
Author

I tried the following:

task_names = ["ArguAna",
    "ClimateFEVERHardNegatives",
    "CQADupstackGamingRetrieval",
    "CQADupstackUnixRetrieval",
    "FEVERHardNegatives",
    "FiQA2018",
    "HotpotQAHardNegatives",
    "SCIDOCS",
    "Touche2020Retrieval.v3",
    "TRECCOVID"]

(These tasks above are used in the Retrieval task type). 

tasks = mteb.get_tasks(tasks=task_names)

results = mteb.load_results(models=["sentence-transformers/all-MiniLM-L12-v2"], tasks=tasks)
scores = []
for model_results in results.model_results:
    for task_result in model_results.task_results:
        for split_name, split_subsets in task_result.scores.items():
            for task_subset in split_subsets:
                if "ndcg_at_10" in task_subset:
                    scores.append(task_subset["ndcg_at_10"])
if scores:
    avg_ndcg_10 = sum(scores) / len(scores)
    print(f"\nAverage NDCG@10: {avg_ndcg_10:.4f}")

And getting

Average NDCG@10: 0.4241

which is not precisely aligned with the MTEB leaderboard, that's 43.67%

Wondering if there is something I am missing?

Samoed Oct 31, 2025
Maintainer

This can be caused by that tasks results are duplicated between revision, and you're selecting all splits and subsets for these tasks (FiQA have dev, test, train).

import mteb

b = mteb.get_benchmark("MTEB(eng, v2)")
retrieval = [t for t in b.tasks if t.metadata.type == "Retrieval"]
results = mteb.load_results(models=["sentence-transformers/all-MiniLM-L12-v2"], tasks=retrieval).join_revisions()

scores = []
for model_results in results.model_results: # multiple revisions
    for task_result in model_results.task_results:
        for split_name, split_subsets in task_result.scores.items():
            for task_subset in split_subsets:
                if "main_score" in task_subset:  # for leaderboard reproducing better to use `main_score`
                    scores.append(task_subset["main_score"])
print(sum(scores)/len(scores))

Answer selected by nseidan

nseidan Oct 31, 2025
Author

@Samoed, cool, thanks a lot!

KennethEnevoldsen Nov 3, 2025
Maintainer

@Samoed I think this averages slightly differently than the leaderboard (each task, subset is evenly weighted instead of the leaderboard where each task is evenly weighted regardless of the number of subsets).

If want it this way then I think you can simplify it down to:

import mteb

b = mteb.get_benchmark("MTEB(eng, v2)")
retrieval = [t for t in b.tasks if t.metadata.type == "Retrieval"]
results = mteb.load_results(models=["sentence-transformers/all-MiniLM-L12-v2"], tasks=retrieval).join_revisions()

scores = []
for model_results in results.model_results: # multiple revisions
    for task_result in model_results.task_results:
        scores.append(task_result.get_score()) # default fetched the mean main_score across the subsets
print(sum(scores)/len(scores)) # benchmark mean

Samoed Nov 4, 2025
Maintainer

Yes, that's better. I've initially checked score in results because I thought question was about specific metric

average ndcg score=43.67%

Getting MTEB Average Score for Model based on task #3515

Uh oh!

Uh oh!

nseidan Oct 31, 2025

Replies: 1 comment · 7 replies

Uh oh!

Samoed Oct 31, 2025 Maintainer

Uh oh!

nseidan Oct 31, 2025 Author

Uh oh!

Samoed Oct 31, 2025 Maintainer

Uh oh!

nseidan Oct 31, 2025 Author

Uh oh!

Uh oh!

Samoed Oct 31, 2025 Maintainer

Uh oh!

nseidan Oct 31, 2025 Author

Uh oh!

KennethEnevoldsen Nov 3, 2025 Maintainer

Uh oh!

Samoed Nov 4, 2025 Maintainer

nseidan
Oct 31, 2025

Replies: 1 comment 7 replies

Samoed
Oct 31, 2025
Maintainer

nseidan Oct 31, 2025
Author

Samoed Oct 31, 2025
Maintainer

nseidan Oct 31, 2025
Author

Samoed Oct 31, 2025
Maintainer

nseidan Oct 31, 2025
Author

KennethEnevoldsen Nov 3, 2025
Maintainer

Samoed Nov 4, 2025
Maintainer