Adding new benchmark MTEB-NL #3339

nikolay-banar · 2025-10-13T10:40:04Z

nikolay-banar
Oct 13, 2025

Hi all,

Recently, we released Massive Text Embedding Benchmark for Dutch (MTEB-NL). You can find the paper here.

We are planning to submit it to some conferences, so it hasn’t been peer-reviewed yet.
Would it make sense to start integrating the benchmark into the MTEB leaderboard now, or should we wait until after peer review?

Sincerely,
Nicolae

Samoed · 2025-10-13T10:45:18Z

Samoed
Oct 13, 2025
Maintainer

Yes, you can start to integrate your benchmark now, but we're planning release v2 version of MTEB soon (probably on this week)

0 replies

KennethEnevoldsen · 2025-10-13T13:59:50Z

KennethEnevoldsen
Oct 13, 2025
Maintainer

Sounds great @nikolay-banar - nice work on the article. As @Samoed says, I would wait a week until we have v2 ready.

However, just a few questions:

Will this also replace the BEIR-NL as the default retrieval dataset? (It would be great if it were a subset; this would reduce the number of tasks required to place on either)
I would also like to get your opinion on the current Dutch datasets in MTEB(Multilingual, v2), and if there is anything that you feel should be added or removed.
One of the larger issues with BEIR-NL has sadly been its size, and this might be why we haven't seen too many submit results for it. I couldn't find any mention of this in the paper, but it might be worth exploring further. We plan to add a set of reference models to have on all future benchmarks (though we will exclude benchmarks with a large runtime)

4 replies

nikolay-banar Oct 14, 2025
Author

My PR is almost ready, and I can push it even today. So, let me know when I should do that. Wouldn’t it be helpful for MTEB(Multilingual, v2) if I do it now?
Sure, I will take a look at MTEB(Multilingual, v2). Could you point me to where I can find it?
Regarding BEIR-NL, I think the issue isn’t just its size, but also the overuse of this benchmark for both testing and training. It seems that many popular embedding models are over-optimized to perform well on it, and also rely on a substantial portion of its data for training. In my opinion, the entire BEIR family, including its translated versions, should be deprecated if an alternative is available. However, I am not sure that the retrieval part of MTEB-NL is ready to became a separate retrieval benchmark and substitute BEIR-NL. It partially overlaps with BEIR-NL, but it’s also shallower in coverage. In future work, we (or community) should incorporate more authentic Dutch datasets for retrieval (and other tasks). However, it may still be too early to retire BEIR-NL at this stage.

Samoed Oct 14, 2025
Maintainer

Can you try to integrate your PR into v2.0.0 branch?
You can look into benchmarks file

mteb/mteb/benchmarks/benchmarks/benchmarks.py

Line 874 in 721e8a3

name="MTEB(Multilingual, v2)",

nikolay-banar Oct 17, 2025
Author

I took a brief look at the current selection of datasets. It seems that Dutch is included only as a subset of multilingual datasets. I believe that some datasets from MTEB-NL could already be included in MTEB (Multilingual, v2)
From MTEB-NL, I excluded all machine-translated and repetitive datasets. Below is a small selection of authentic datasets that might be suitable:

Classification:

"DutchBookReviewSentimentClassification.v2"
"VaccinChatNLClassification"
"DutchColaClassification"
"DutchGovernmentBiasClassification"

PairClassification

"XLWICNLPairClassification"

MultiLabelClassification

"CovidDisinformationNLMultiLabelClassification"

Clustering

"DutchNewsArticlesClusteringP2P" (or "DutchNewsArticlesClusteringS2S")
"VABBClusteringP2P" (or "VABBClusteringS2S")
"OpenTenderClusteringP2P" (or "OpenTenderClusteringS2S")
"IconclassClusteringS2S"

Retrieval

"bBSARDNLRetrieval" or/and "LegalQANLRetrieval"

Both datasets come from the legal domain. The first one originates from Belgium, and the second from the Netherlands. However, in bBSARDNLRetrieval, the queries are machine-translated from French, while the documents are authentic. What do you think?

KennethEnevoldsen Oct 19, 2025
Maintainer

Thanks for taking the time to look at it @nikolay-banar. I will keep that in mind when we make the v3.

We do not include machine-translated tasks in Mteb(multilingual), and I don't see that one changing.

nikolay-banar · 2025-10-29T09:48:31Z

nikolay-banar
Oct 29, 2025
Author

I have a question regarding prompts. In our experiments, we fed e5-style prompts directly to the models, but I guess it is not the best solution for MTEB. The default prompts (e.g. abstask_prompt = "Retrieve text based on user query.") do not fit all tasks (e.g. Arguana).

Does it make sense to add the prompts now to every dataset class?

class ArguAnaNL(AbsTaskRetrieval):
    prompt={"query": "Given a claim, find documents that refute the claim"}

The results from MTEB-NL will not be affected, because I did not submit any instruct models. However, BEIR-NL has some instruct submissions. In this case, I will need to rerun some experiments.

3 replies

Samoed Oct 29, 2025
Maintainer

Yes, this is make sense. If task result exists with instruction model it's better to create new revision of task like in #3469

nikolay-banar Oct 29, 2025
Author

I see. However, if the task is changed to v2 in the leaderboard, it means that all (also non-instruct) models should be re-evaluated for this task. Is it correct?

Samoed Oct 29, 2025
Maintainer

Yes

Adding new benchmark MTEB-NL #3339

Uh oh!

nikolay-banar Oct 13, 2025

Replies: 3 comments · 7 replies

Uh oh!

Samoed Oct 13, 2025 Maintainer

Uh oh!

KennethEnevoldsen Oct 13, 2025 Maintainer

Uh oh!

nikolay-banar Oct 14, 2025 Author

Uh oh!

Samoed Oct 14, 2025 Maintainer

Uh oh!

nikolay-banar Oct 17, 2025 Author

Classification:

PairClassification

MultiLabelClassification

Clustering

Retrieval

Uh oh!

Uh oh!

KennethEnevoldsen Oct 19, 2025 Maintainer

Uh oh!

nikolay-banar Oct 29, 2025 Author

Uh oh!

Samoed Oct 29, 2025 Maintainer

Uh oh!

nikolay-banar Oct 29, 2025 Author

Uh oh!

Samoed Oct 29, 2025 Maintainer

nikolay-banar
Oct 13, 2025

Replies: 3 comments 7 replies

Samoed
Oct 13, 2025
Maintainer

KennethEnevoldsen
Oct 13, 2025
Maintainer

nikolay-banar Oct 14, 2025
Author

Samoed Oct 14, 2025
Maintainer

nikolay-banar Oct 17, 2025
Author

KennethEnevoldsen Oct 19, 2025
Maintainer

nikolay-banar
Oct 29, 2025
Author

Samoed Oct 29, 2025
Maintainer

nikolay-banar Oct 29, 2025
Author

Samoed Oct 29, 2025
Maintainer