UrbanComp Lab | 学习资料库

返回论文库

论文

arXiv

SpatialIntelligence

Agent

中文标题

S-Agent：空间工具使用激发空间智能推理

English Title

S-Agent: Spatial Tool-Use Elicits Reasoning for Spatial Intelligence

Yalun Dai, Hao Li, Shulin Tian, Runmao Yao, Yuhao Dong, Fangzhou Hong, Zhaoxi Chen, Fangfu Liu, Baoliang Tian, Dingwen Zhang, Tao Wang, Kim-Hui Yap, Ziwei Liu

发布时间

2026/6/19 01:34:55

来源类型

preprint

语言

摘要

中文对照

现实世界中的空间智能需对连续演化的三维环境进行推理，而现有视觉语言模型（VLM）及工具增强型智能体仍主要依赖于对孤立静态视觉观测的无状态推理。我们提出\textbf{\textsc{S-Agent}}——一种面向连续多视角图像与视频理解与推理的空间工具使用型智能体范式。通过将空间推理建模为时空证据累积过程，而非孤立帧级预测，\textsc{S-Agent} 将空间感知从以帧为中心的识别转向以场景为中心的理解。具体而言，\textsc{S-Agent} 将 VLM 视为语义规划器，用以决定所需证据；同时，由空间工具与专家构成的层级结构负责在二维空间中定位物体、将其提升为三维几何证据，并将此类证据聚合为高层空间知识（例如计数、测量、朝向与相对位置）。此外，其时序记忆机制包含场景记忆（Scene Memory）与智能体记忆（Agent Memory）：前者用于维护动态演化的场景状态，后者用于累积推理上下文，从而支持跨帧与跨推理步骤的证据整合。在多视角与视频空间推理基准上的全面实验表明，\textsc{S-Agent} 能以无需训练的方式持续提升开源与闭源 VLM 的性能。除推理时增强外，在 \textsc{S-Agent} 生成的空间轨迹数据集 \textsc{S-300K} 上进行监督微调（SFT），可得到紧凑型空间智能体 \textsc{S-Agent-8B}，其性能显著超越同规模基线模型（如 Qwen3-VL-8B），并与先进闭源模型（如 GPT-5.4 和 Gemini 3）表现相当。

English Original

Real-world spatial intelligence requires reasoning over a continuous and evolving 3D world, yet existing VLMs and tool-augmented agents largely remain tied to static, stateless inference from isolated visual observations. We introduce \textbf{\textsc{S-Agent}}, a spatial tool-use agentic paradigm for understanding and reasoning over continuous multi-view images and videos. By formulating spatial reasoning as spatio-temporal evidence accumulation rather than isolated frame-level prediction, \textsc{S-Agent} reshapes spatial perception into scene-centric understanding beyond frame-centric recognition. Specifically, \textsc{S-Agent} casts the VLM as a semantic planner that decides what evidence is needed, while a hierarchy of spatial tools and experts grounds objects in 2D, lifts them into 3D geometric evidence, and aggregates this evidence into high-level spatial knowledge (\textit{e.g.}, counting, measurement, orientation, and relative position). Additionally, a temporal memory mechanism, including Scene Memory for maintaining the evolving scene state and Agent Memory for accumulating reasoning context, enables evidence integration across frames and reasoning steps. Comprehensive experiments on multi-view and video spatial reasoning benchmarks show that \textsc{S-Agent} consistently improves both open-source and closed-source VLMs in a training-free manner. Beyond inference-time augmentation, supervised fine-tuning (SFT) on \textsc{S-Agent}-generated spatial trajectories \textsc{S-300K} yields \textsc{S-Agent-8B}, a compact spatial agent that significantly surpasses similar-scale baselines (e.g., Qwen3-VL-8B) and performs comparably to advanced closed-source models (e.g., GPT-5.4 and Gemini 3).

资源链接

论文 PDFarxiv.org/pdf/2606.20515v1 论文 PDFarxiv.org/pdf/2606.20515v1 原始来源页面arxiv.org/abs/2606.20515v1

元数据

arXiv2606.20515v1

来源arXiv

类型论文

抽取状态raw

关键词

SpatialIntelligence

Agent

cs.CV