QIMMA 在评估模型前对基准测试进行验证,确保所报告的分数真实反映大语言模型(LLM)的阿拉伯语能力。若您持续关注阿拉伯语大语言模型的评估工作,可能已注意到一种日益加剧的张力:基准测试与排行榜数量正迅速增长,但我们实际测量的,是否确为我们意图测量的内容?
QIMMA validates benchmarks before evaluating models, ensuring reported scores reflect genuine Arabic language capability in LLMs. If you've been tracking Arabic LLM evaluation, you've probably noticed a growing tension: the number of benchmarks and leaderboards is expanding rapidly, but are we actually measuring what we think we're measuring?
QIMMA 在评估大语言模型(LLM)之前,先对基准测试集进行验证,确保所报告的分数真实反映模型在阿拉伯语上的能力。如果你一直关注阿拉伯语大语言模型的评估工作,可能已经注意到一种日益加剧的张力:基准测试集与排行榜数量正迅速增长,但我们实际测量的,真的是我们自以为在测量的内容吗?本文将介绍 QIMMA 是什么、我们如何构建它、发现了哪些问题,以及在清理数据后模型排名呈现怎样的面貌。 阿拉伯语使用者逾 4 亿,分布于多样化的方言与文化语境之中;然而,当前阿拉伯语自然语言处理(NLP)评估生态仍高度碎片化。推动本项工作的几个关键痛点包括: **质量验证缺位**:即便是面向母语者的阿拉伯语基准测试集,也常未经严格的质量审查即发布。已有文献记录了既有资源中存在的标注不一致、标准答案错误、编码错误,以及真实标签中隐含的文化偏见等问题。 **可复现性缺口**:评估脚本与逐样本输出结果极少公开发布,导致结果难以审计,也难以在前人工作基础上开展后续研究。 **覆盖范围碎片化**:现有排行榜仅涵盖孤立任务与狭窄领域,难以实现对模型能力的全面评估。 这正是 QIMMA 的方法论核心。在运行任一模型之前,我们对每个基准测试集中的每一样本均执行多阶段验证流程。我们选取了两种具备较强阿拉伯语能力但训练数据构成不同的模型,使其联合判断比任一模型单独判断更具鲁棒性。若任一模型对某样本的评分低于 7/10,则该样本被剔除;若两个模型均判定剔除,则该样本立即移除;若仅一个模型标记某样本,则该样本进入第二阶段——人工审核。 被标记的样本由熟悉阿拉伯语文化背景与方言差异的母语者进行人工审核。人工标注员就以下方面作出最终裁决: - 对涉及文化敏感内容的样本,会综合多方视角进行判断,因为“正确性”在不同阿拉伯地区确实可能存在实质性差异。 该流程揭示了各基准测试集中反复出现的质量问题——并非零星个案,而是系统性模式,反映出这些基准测试集在最初构建过程中存在的固有缺陷。 截至 2026 年 4 月的结果:涵盖已评估的 Top 10 模型。请访问实时排行榜获取最新排名。 在完整排行榜(共 46 个模型)中,模型规模与性能之间呈现出清晰但并不完美的正相关关系。然而,其中亦存在若干引人注目的例外情况:
QIMMA validates benchmarks before evaluating models, ensuring reported scores reflect genuine Arabic language capability in LLMs. If you've been tracking Arabic LLM evaluation, you've probably noticed a growing tension: the number of benchmarks and leaderboards is expanding rapidly, but are we actually measuring what we think we're measuring? This post walks through what QIMMA is, how we built it, what problems we found, and what the model rankings look like once you clean things up. Arabic is spoken by over 400 million people across diverse dialects and cultural contexts, yet the Arabic NLP evaluation landscape remains fragmented. A few key pain points have motivated this work: Absent quality validation. Even native Arabic benchmarks are often released without rigorous quality checks. Annotation inconsistencies, incorrect gold answers, encoding errors, and cultural bias in ground-truth labels have all been documented in established resources. Reproducibility gaps. Evaluation scripts and per-sample outputs are rarely released publicly, making it hard to audit results or build on prior work. Coverage fragmentation. Existing leaderboards cover isolated tasks and narrow domains, making holistic model assessment difficult. This is the methodological heart of QIMMA. Before running a single model, we applied a multi-stage validation pipeline to every sample in every benchmark. We chose two models with strong Arabic capability but different training data compositions, so that their combined judgment is more robust than either alone. A sample is eliminated if either model scores it below 7/10. Samples where both models agree on elimination are dropped immediately. However, where only one model flags a sample, it proceeds to human review in Stage 2. Flagged samples are reviewed by native Arabic speakers with cultural and dialectal familiarity. Human annotators make final calls on: For culturally sensitive content, multiple perspectives are considered, since "correctness" can genuinely vary across Arab regions. The pipeline revealed recurring quality issues across benchmarks; not isolated errors, but systematic patterns reflecting gaps in how benchmarks were originally constructed. Results as of April 2026; covering top 10 evaluated models. Visit the live leaderboard for current rankings. Across the full leaderboard (46 models), a clear but imperfect size-performance correlation emerges. However, there are interesting exceptions: