TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

The MTEB benchmark is dead

40 pointsby herecomethefuzz5 months ago

5 comments

cuuupid5 months ago
It has been for a while, we ended up building our own test set to evaluate embedding models on our domain.<p>What we realized after doing this is that MTEB has always been a poor indicator, as embedding model performance varies wildly in-domain compared to out-of-domain. You&#x27;ll get decent performance (lets say 70%) with most models, but eeking out gains over that is domain-dependent more than it is model-dependent.<p>Personally I recommend NV-Embed because it&#x27;s easy to deploy and get the other performance measurements (e.g. speed) to be high spec. You can then simply enrich your data itself by e.g. using an LLM to create standardized artifacts that point back to the original text, kind of like an &quot;embedding symlink.&quot;<p>Our observation has widely been that after standardizing data, the best-n models mostly perform the same.
评论 #42505021 未加载
0xab5 months ago
Datasets need to stop shipping with any training sets at all! And they should forbid anyone from using the test set to update the parameters of any model through their license.<p>We did this with ObjectNet (<a href="https:&#x2F;&#x2F;objectnet.dev&#x2F;" rel="nofollow">https:&#x2F;&#x2F;objectnet.dev&#x2F;</a>) years ago. It&#x27;s only a test set, no training set provided at all. Back then it was very controversial and we were given a hard time for it initially. Now it&#x27;s more accepted. Time to make this idea mainstream.<p>No more training sets. Everything should be out of domain.
评论 #42504795 未加载
评论 #42507922 未加载
minimaxir5 months ago
The MTEB benchmark was never that great since embeddings are used for more specific domain-specific tasks (e.g. search&#x2F;clustering) that can&#x27;t really be represented well in a generalized test, moreso than LLM next-token-prediction benchmarks which aren&#x27;t great either.<p>As with all LLM models and their subproducts, the <i>only</i> way to ensure good results is to test yourself, ideally with less subjective, real-world feedback metrics.
评论 #42504632 未加载
artine5 months ago
I’m not closely familiar with this benchmark, but data leakage in machine learning can be way too easy to accidentally introduce even under the best of intentions. It really does require diligence at every stage of experiment and model design to strictly firewall all test data from any and all training influence. So, not surprising when leakage breaks highly publicized benchmarks.
评论 #42504852 未加载
RevEng5 months ago
I feel this is common throughout all of training, even on public data. Every time we talk about something specific at length, that becomes part of the training data and that influences the models. For example, ask a problem about a butterfly flapping its wings causing a tornado and all modern LLMs immediately recognize the classic example of chaos theory, but change the entities and suddenly it&#x27;s not so smart. Same thing for the current fixation on the number of Rs in strawberry.<p>There was recently a post showing how LLMs could actively try to deceive the user to hide its conflicting alignment, and using a chain of thought style prompt showed how it did this very deliberately. However, the thought process it produced and the wording sounded exactly like every example of this theoretical alignment problem. Given that an LLM chooses the most probable tokens based on what it has seen in training, could it be that we unintentionally trained it to respond this way?