资讯
Hugging Face Blog
AI
LLM
Agent
Dataset
Platform
Tool
中文标题
深入 VAKRA:智能体的推理能力、工具使用与失效模式
English Title
Inside VAKRA: Reasoning, Tool Use, and Failure Modes of Agents
Ankita Naik, danish, Ben, Anupama Murthi, Praveen
发布时间
2026/4/15 20:07:25
来源类型
blog
语言
en
摘要
中文对照

任务描述、评估框架、错误分析、结论。试用 VAKRA——您的智能体在何处失效?我们近期推出了 VAKRA,这是一个以工具为依托、可执行的基准测试,用于评估 AI 智能体在类企业环境中的推理与行动能力。

English Original

Task Description Evaluation Framework Error Analysis Conclusion Try VAKRA — Where Does Your Agent Break? We recently introduced VAKRA, a tool-grounded, executable benchmark for evaluating how well AI agents reason and act in enterprise-like environments.

正文
中文全文

任务描述 评估框架 错误分析 结论 试用 VAKRA — 你的智能体在何处失效? 我们近期推出了 VAKRA,这是一个以工具为根基、可执行的基准测试套件,用于评估 AI 智能体在类企业环境中的推理与行动能力。不同于传统基准测试仅考察孤立技能,VAKRA 通过跨 API 与文档的组合式推理能力进行评测,并利用完整执行轨迹判断智能体能否可靠完成多步骤工作流。VAKRA 提供一个可执行环境,其中智能体可与本地部署的 8,000+ 个 API 进行交互;这些 API 背后连接真实数据库,覆盖 62 个领域,并配有与各领域对齐的文档集合。任务可能涉及 3–7 步的推理链,在自然语言工具使用约束下,融合结构化 API 调用与非结构化信息检索。 如下图所示,模型在 VAKRA 上表现较差。本文将补充介绍 VAKRA 各任务的数据集细节,并系统分析我们在不同任务中观察到的失败模式。 如图所示,VAKRA 基准测试包含四项任务,每项任务分别检验一组不同的能力。 OpenAI API 规范将工具列表输入长度限制为最多 128 个工具。该限制要求使用该 API 构建智能体的开发者必须通过筛选机制直接管理工具列表长度。在我们代码仓库中的基线智能体中,一种简单的工具筛选能力即可应对这一挑战。项目仓库中的基线智能体通过在提示词中添加如下语句来强制执行此类策略:“你是一位可调用工具的有用助手。\n工具使用约束:{additional_instructions}。”当然,智能体构建者可自由选择任意约束执行机制。 VAKRA 在工具环境中评估智能体,其成功与否既取决于执行连贯多步工作流的能力,也取决于最终答案的正确性。我们提出一种以执行为中心的评估框架,不仅评估最终输出,还评估完整的工具调用轨迹(包括工具调用、输入参数及中间结果)。VAKRA 评估器对每个样本接收两类关键输入:预测的最终响应及其对应的工具调用轨迹。预测轨迹中的工具调用将在与真实轨迹相同的环境中执行,以验证中间工具输出是否一致。 工具序列比对 由于存在可执行环境,智能体可在环境中探索,并有时通过调用与我们标注不同的 API 组合得出答案。为支持替代但有效的工具调用与推理路径,正确性评估采用执行每个预测工具并比对其输出集合与真实输出集合的方式(而非强制逐级严格匹配)。 最终响应评估 对于通过前述检查的轨迹,最终响应由基于 LLM 的裁判模型进行评估。该步骤确保响应:(i)以预测工具输出为依据;(ii)在事实上与真实答案保持一致,同时容许措辞或结构上的合理差异。该设计确保智能体不仅因产出正确答案而获奖励,更因其通过有效且完整的推理过程获得答案而受肯定。 能力得分计算方式为:能力 1 至能力 3 中,每个样本在对应能力内权重相等。 接下来,我们将针对 VAKRA 四项能力展开详细错误分析。为便于分析,我们采用分阶段错误归类法,将每次失败归因于最早出现故障的环节。具体而言,我们按顺序评估:(i)是否选中了正确的工具;(ii)是否提供了所需全部参数,且无遗漏或幻觉;(iii)参数值是否准确;(iv)最终响应是否既准确又以工具输出为依据。 由于单一样本可能在多个步骤中出现多种错误,我们依序将每个实例归入首个失败阶段(例如,工具选择错误优先于参数错误)。该方法避免重复计数,使各错误类别可被解释为数据集互斥的子集比例。尽管可引入更细粒度指标(如工具使用精度/召回率)(Elder et al., 2026),但我们发现当前形式能提供简洁且可解释的智能体失败归因。 多跳推理通过要求模型连续回答多个隐式关联的问题,显著提升了原始任务难度;而每个问题均需正确选择并调用相应 API。正如预期,所有模型在仅含单逻辑跳的问题上表现最佳,而在双跳问题上性能下降,在三跳及以上问题上进一步恶化。 数据集最后一部分除工具/API 来源外,还纳入了文档来源。由此产生的样本可能需要单次或多次 API 调用、单次或多次文档检索,或二者组合。总体而言,我们发现模型要么违反约束条件,要么未能检索足够信息:有时模型理解策略却无法正确作答;有时则表现出前述任一失败模式。 总体而言,在工具使用策略受限的设置下,结果表明:尽管模型具备对工具与信息源的推理能力,但在将外部约束融入该推理过程方面仍显乏力——而这恰恰是实现可靠现实部署的关键要求。 VAKRA 揭示了表层工具能力与端到端智能体鲁棒可靠性之间的重要鸿沟。尽管现代模型日益擅长选择 API 并执行孤立的工具调用,VAKRA 表明,仅凭这些能力尚不足以支撑现实世界部署。实践中,当模型需执行组合式推理时,往往即告失效。

English Original

Task Description Evaluation Framework Error Analysis Conclusion Try VAKRA — Where Does Your Agent Break? We recently introduced VAKRA, a tool-grounded, executable benchmark for evaluating how well AI agents reason and act in enterprise-like environments. Unlike traditional benchmarks that test isolated skills, VAKRA measures compositional reasoning across APIs and documents, using full execution traces to assess whether agents can reliably complete multi-step workflows. VAKRA provides an executable environment where agents interact with over 8,000+ locally hosted APIs backed by real databases spanning 62 domains, along with domain-aligned document collections. Tasks can require 3-7 step reasoning chains that combine structured API interaction with unstructured retrieval under natural-language tool-use constraints. As can be seen below, models perform poorly on VAKRA - in this blog, we include additional dataset details about the tasks in VAKRA and present an analysis of failure modes we observed on different tasks. As shown below, the VAKRA benchmark comprises of four tasks, each testing a different set of capabilities. The OpenAI API Specification restricts the tool list input to a maximum length of 128 tools. This restriction requires an agent builder using this API to manage the length of the tool list directly via a shortlisting mechanism. In the baseline agents in our repository, a simple shortlisting capability handles this challenge. The baseline agent in the project repo imposes adherence to these policies through a simple addition to the prompt: "You are a helpful assistant with access to tools.\n Tool Usage Constraint: {additional_instructions}.". Of course, agent builders are free to choose any constraint enforcement mechanism. VAKRA evaluates agents in tool environments where success depends on both the ability to execute coherent, multi-step workflows and answer correctness. We introduce an execution-centric evaluation framework that assesses not only final outputs but also the full tool-execution trajectory that includes tool calls, inputs, and intermediate results. The VAKRA Evaluator operates over two key inputs for each sample: a predicted final response and the corresponding tool-call trajectory. The tool calls from the predicted trajectory are executed in the same environment as the ground truth to verify intermediate tool outputs. Tool-Sequence Comparison Due to the presence of an executable environment, agents can explore the environment and sometimes return the answer by invoking a different set of APIs than the ones identified by us. In order to support alternative but valid tool invocations and reasoning paths, correctness is assessed by executing each predicted tool and comparing the set of tool responses against those from the ground truth (rather than enforcing strict step-level matching). Final Response Evaluation For trajectories that pass the previous check, the final response is evaluated using an LLM-based judge. This step ensures that the response is (i) grounded in the predicted tool outputs, and (ii) factually consistent with the ground truth answer, accounting for potential variations in phrasing or structure. This design ensures that agents are rewarded not only for producing correct answers, but for obtaining them through valid and complete reasoning processes. To obtain a capability score, every sample within a capability is equally weighted for capabilities 1 through 3. We now present detailed error analysis across the four VAKRA capabilities. To facilitate our analysis, we adopt stage-wise error categorization to assign each failure to the first point of breakdown. Specifically, we evaluate, in order: (i) whether the correct tool(s) were selected, (ii) whether the required arguments were provided without omissions or hallucinations, (iii) whether argument values were correct, and (iv) whether the final response is both accurate and grounded in the tool outputs. Since a single sample may exhibit multiple errors across different steps, we sequentially classify each instance to the earliest failing stage (e.g., tool selection errors take precedence over argument errors). This avoids double-counting and allows error categories to be interpreted as disjoint fractions of the dataset. While more granular metrics (e.g., precision/recall over tool usage) are possible (Elder et al., 2026), we find this formulation provides a simple and interpretable breakdown of agent failures. Multi-hop reasoning increases the difficulty of the original task by requiring models to successfully answer multiple implicitly coupled questions, each of which requires selecting and calling the correct API. As expected, all models performed best on the questions with only a single logical hop, and saw performance degradations on 2-hop and again on 3+ hop questions. The final segment of the dataset includes document sources in addition to the tool/API sources in the other segments. This leads to instances that require single or multiple API calls, single or multiple document searches, or some combination of API calls and document searches. In general, we find that models either violate constraints or fail to retrieve sufficient information, where they sometimes understood the policy but could not answer the question correctly, or they exhibit one of the previously analyzed failure modes. Overall, tool-use policy-constrained settings suggest that while models can reason over tools and sources, they struggle to incorporate external constraints into that reasoning - often a key requirement for reliable real-world deployment. VAKRA exposes a critical gap between surface-level tool competence and robust, end‑to‑end agent reliability. Although modern models can increasingly select APIs and execute isolated tool calls, VAKRA shows that these abilities alone are insufficient for real‑world deployment. In practice, models often break down when required to perform compositional reasoning under execution constraints—spanning APIs, documents, dialog context, and policy requirements. Run it on VAKRA and see where it falls apart—tool selection, multi-hop reasoning, or policy constraints.

资源链接
Careersapply.workable.com/huggingfaceYang et al., 2024arxiv.org/html/2406.04744v1Elder et al., 2026arxiv.org/pdf/2506.11266外部资源cdn-uploads.huggingface.co...b4931b3b2eb8d5820a/-KsUUjAJU-Pspy6wUHDJO.png外部资源cdn-uploads.huggingface.co...b4931b3b2eb8d5820a/Ckd7lRTEYbTM9gVXUvwDI.png外部资源cdn-uploads.huggingface.co...b4931b3b2eb8d5820a/JGR1iJ0DgZulV_gQ4Wx-Z.png外部资源cdn-uploads.huggingface.co...b4931b3b2eb8d5820a/P5L7b-vzkQKYqO63EHccb.png外部资源cdn-uploads.huggingface.co...b4931b3b2eb8d5820a/cTXZ1LmNP-tkPIe5ghj_t.png外部资源cdn-uploads.huggingface.co...b4931b3b2eb8d5820a/d8Cqz2tApdRe6ePcaZbYf.png外部资源cdn-uploads.huggingface.co...b4931b3b2eb8d5820a/l05VRsxYKcKNxq_zrdnkS.png外部资源cdn-uploads.huggingface.co...b4931b3b2eb8d5820a/oGFhvAEhWZLctvTLeUoZ6.pngGoogle Analyticsdevelopers.google.com...porting/data/v1/rest/v1beta/FilterExpressionGitHubgithub.com/IBM/vakrashortlisting capabilitygithub.com...b/main/agents/components/tool_shortlister.pyVAKRA Evaluatorgithub.com/IBM/vakra/tree/main/evaluatorSubmit to Leaderboardgithub.com/IBM/vakrapromptgithub.com...research/CRAG/blob/main/prompts/templates.pyTableauhelp.tableau.com.../rest_api_concepts_filtering_and_sorting.htmanalysis of failure modeshuggingface.co/blog/ibm-research/vakra-benchmark-analysisdataset detailshuggingface.co/blog/ibm-research/vakra-benchmark-analysisFigure 1huggingface.co/blog/ibm-research/vakra-benchmark-analysisVAKRA Datasethuggingface.co/datasets/ibm-research/VAKRALeaderBoardibm-research-vakra.hf.spaceRelease Blogwww.ibm.com...ew/announcements/introducing-vakra-benchmark原始来源页面huggingface.co...blog/ibm-research/vakra-benchmark-analysis
元数据
来源Hugging Face Blog
类型资讯
抽取状态raw
关键词
AI
LLM
Agent
Dataset
Platform
Tool