UrbanComp Lab | 学习资料库

返回资讯

资讯

Hugging Face Blog

LLM

Agent

Dataset

Platform

中文标题

推出 RTEB：一种新的检索评估标准

English Title

Introducing RTEB: A New Standard for Retrieval Evaluation

Frank Liu, Kenneth C. Enevoldsen, Solomatin Roman, Isaac Chung, Tom Aarsen, Fődi, Zoltán

发布时间

2025/10/1 08:00:00

来源类型

blog

语言

摘要

中文对照

从检索增强生成（RAG）与智能体（agents）到推荐系统，众多 AI 应用的性能从根本上受限于搜索与检索的质量。因此，准确衡量嵌入模型（embedding models）的检索质量，成为开发者的普遍痛点。你如何真正了解一个模型在实际场景中的表现？这正是问题棘手之处。当前的评估标准通常依赖模型在公开基准测试上的“零样本”（zero-shot）性能。

English Original

正文

中文全文

从 RAG、智能体（agents）到推荐系统，许多 AI 应用的性能从根本上受限于搜索与检索的质量。因此，准确衡量嵌入模型（embedding model）的检索质量，是开发者普遍面临的一大痛点。你究竟如何确信一个模型在真实场景中的表现？这正是问题变得棘手之处。当前主流的评估方式，往往依赖模型在公开基准测试上的“零样本”（zero-shot）性能。然而，这种方式至多只是对模型真实泛化能力的一种近似。当模型反复在相同的公开数据集上接受评估时，其报告分数与在全新、未见数据上的实际表现之间，便会逐渐产生差距。为应对上述挑战，我们开发了 RTEB（Retrieval Embedding Benchmark），一个旨在为检索模型评估提供可靠标准的基准测试。正因如此，某些零样本得分[1]较低的模型可能在该基准测试中表现优异，却无法泛化至新任务。正因如此，我们通常更推荐那些基准测试得分略低但零样本得分更高的模型。今天，我们很高兴正式推出检索嵌入基准测试（RTEB）。其目标是构建一种新型、可靠、高质量的基准，以真实衡量嵌入模型的检索准确性。这种混合式方法鼓励开发者构建具备广泛、稳健泛化能力的模型。若某模型在公开数据集与私有数据集上的性能出现显著下降，则表明其存在过拟合现象，从而向社区发出明确信号。目前已有部分模型在 RTEB 的私有数据集上表现出明显性能下滑，这一现象已初现端倪。 RTEB 的设计特别强调企业级应用场景。它摒弃了复杂的层级结构，转而采用简洁明了的分组方式。单个数据集可归属多个组别（例如，一份德国法律数据集同时属于“法律”组与“德语”组）。完整数据集列表详见下方。我们计划持续更新公开与私有两部分数据集，涵盖不同类别的数据，并积极鼓励社区参与；若您希望推荐其他数据集，请在 GitHub 上的 MTEB 仓库中提交 issue。 RTEB 今日起以 Beta 版本正式发布。我们坚信，构建一个稳健的基准测试是一项社区协作工程；我们将依据开发者与研究人员的反馈，持续演进 RTEB。我们诚挚邀请您分享见解、推荐新数据集、指出既有数据集中的问题，并携手共建一套更可靠的通用标准。欢迎加入 GitHub 上 MTEB 仓库的讨论，或直接提交 issue。为凸显待改进方向，我们愿就 RTEB 当前的局限性及未来规划保持透明。RTEB 排行榜现已上线 Hugging Face，作为 MTEB 排行榜全新“检索”板块的一部分。我们诚邀您访问查看、评估您的模型，并与我们一道，为整个 AI 社区打造更优、更可靠的基准测试。 [1] 零样本得分指模型提供方明确声明未在其中进行训练的评估集所占比例，通常仅包含训练集划分部分。有趣观察：纯英文模型在私有数据集上的表现往往更差，而多语言模型则在私有数据集上更具优势。这是否因为基线基准 MTEB 仅为英文基准，而 RTEB 是多语言基准？多语言模型在多语言基准上优于英文基准，本属自然。另有观点认为，多语言模型对语码转换（code-switching）及外来术语等现象亦更具鲁棒性。

English Original

The performance of many AI applications, from RAG and agents to recommendation systems, is fundamentally limited by the quality of search and retrieval. As such, accurately measuring the retrieval quality of embedding models is a common pain point for developers. How do you really know how well a model will perform in the wild? This is where things get tricky. The current standard for evaluation often relies on a model's "zero-shot" performance on public benchmarks. However, this is, at best, an approximation of a model's true generalization capabilities. When models are repeatedly evaluated against the same public datasets, a gap emerges between their reported scores and their actual performance on new, unseen data. To address these challenges, we developed RTEB, a benchmark built to provide a reliable standard for evaluating retrieval models. Because of the above, models with a lower zero-shot score[1] may perform very well on the benchmark, without generalizing to new problems. For this reason, models with slightly lower benchmark performance and a higher zero-shot score are often recommended instead. Today, we’re excited to introduce the Retrieval Embedding Benchmark (RTEB). Its goal is to create a new, reliable, high-quality benchmark that measures the true retrieval accuracy of embedding models. This hybrid approach encourages the development of models with broad, robust generalization. A model with a significant performance drop between the open and the private datasets would suggest overfitting, providing a clear signal to the community. This is already apparent with some models, which show a notable drop in performance on RTEB's private datasets. RTEB is designed with a particular emphasis on enterprise use cases. Instead of a complex hierarchy, it uses simple groups for clarity. A single dataset can belong to multiple groups (e.g., a German law dataset exists in both the "law" and "German" groups). A complete list of the datasets can be found below. We plan to continually update both the open as well as closed portion with different categories of datasets and actively encourage participation from the community; please open an issue on the MTEB repository on GitHub if you would like to suggest other datasets. RTEB is launching today in beta. We believe building a robust benchmark is a community effort, and we plan to evolve RTEB based on feedback from developers and researchers alike. We encourage you to share your thoughts, suggest new datasets, find issues in existing datasets and help us build a more reliable standard for everyone. Please feel free to join the discussion or open an issue in the MTEB repository on Github. To highlight areas for improvement we want to be transparent about RTEB's current limitations and our plans for the future. The RTEB leaderboard is available today on Hugging Face as a part of the new Retrieval section on the MTEB leaderboard. We invite you to check it out, evaluate your models, and join us in building a better, more reliable benchmark for the entire AI community. [1] Zero-shot score is the proportion of the evaluation set which the model provider have explicitly stated to have trained on. This typically only includes the training split. Observed a fun fact: English-only models tend to get worse performance on closed datasets, while multilingual models are better at closed dataset. Is it because the baseline, MTEB, is a benchmark merely in English, and RTEB is multilingual? It's natural that multilingual model get better performance on multilingual benchmark instead of English. There is also an argument that multilingual models are more robust to phenomena such as code-switching and foreign terms.

资源链接

元数据

来源Hugging Face Blog

类型资讯

抽取状态raw

关键词

LLM

Agent

Dataset

Platform