UrbanComp Lab | 学习资料库

返回论文库

论文

arXiv

SpatialIntelligence

LLM

Multimodal

GeoMultimodal

中文标题

GeoWeaver：在场景推理前利用几何证据对视觉令牌进行几何接地

English Title

GeoWeaver: Grounding Visual Tokens with Geometric Evidence before Scene Reasoning

Deshui Miao, Xingsen Huang, Yameng Gu, Xin Li, Haijun Zhang, Ming-Hsuan Yang

发布时间

2026/5/21 22:40:03

来源类型

preprint

语言

摘要

中文对照

视觉-语言模型中的时空推理需要能保留物理几何结构而非仅语义外观的视觉表征。近期多模态模型通过结构化分支、3D感知监督、推理阶段融合或长时程记忆等方式引入几何信息。尽管这些方法凸显了几何信息对空间智能的重要性，但通常将几何线索视为所有视觉令牌共享的统一信号。我们指出，这忽视了一个更细粒度的挑战：不同视觉令牌因其空间角色差异，需依赖不同的几何证据。为解决该局限，我们提出GeoWeaver——一种推理前几何接地框架，将几何建模视为时空推理的表征前提。GeoWeaver从一个冻结的几何编码器构建多层次几何知识库，并执行令牌自适应的几何证据分配，使每个视觉令牌可检索最相关的几何抽象。所选几何证据通过残差接地操作注入视觉令牌，该操作在语言建模之前完成，从而生成面向下游推理的几何接地表征。在多项空间推理基准上的广泛评估表明，GeoWeaver在持续提升几何感知推理能力的同时，保持了通用多模态能力。这表明，几何信息的最大价值并非作为后期融合的辅助信号，而是作为塑造大语言模型推理所依赖表征基础的根本前提。全部源代码与模型将在https://github.com/yahooo-m/GeoWeaver发布。

English Original

Spatio-temporal reasoning in vision-language models requires visual representations that preserve physical geometry rather than merely semantic appearance. Recent multimodal models incorporate geometric information through structural branches, 3D-aware supervision, reasoning-stage fusion, or long-horizon memory. While these approaches demonstrate the importance of geometry for spatial intelligence, they typically treat geometric cues as a shared signal across all visual tokens. We note that this overlooks a finer-grained challenge: different visual tokens require different geometric evidence depending on their spatial roles. To address this limitation, we introduce GeoWeaver, a pre-reasoning geometric grounding framework that treats geometry as a representational prerequisite for spatio-temporal reasoning. GeoWeaver constructs a multi-level geometry bank from a frozen geometry encoder and performs token-adaptive geometric evidence allocation, enabling each visual token to retrieve the most relevant geometric abstractions. The selected evidence is incorporated into visual tokens via a residual grounding operation prior to language modeling, yielding geometry-grounded representations for downstream reasoning. Extensive evaluations on spatial reasoning benchmarks demonstrate that GeoWeaver consistently enhances geometry-aware reasoning while retaining general multimodal capabilities. This indicates that geometric information yields the greatest benefit not as a late-fusion auxiliary signal but as a fundamental prerequisite that shapes the representational foundation on which large language models perform reasoning. All source code and models will be released at https://github.com/yahooo-m/GeoWeaver .

资源链接

论文 PDFarxiv.org/pdf/2605.22558v1 论文 PDFarxiv.org/pdf/2605.22558v1 原始来源页面arxiv.org/abs/2605.22558v1

元数据

arXiv2605.22558v1

来源arXiv

类型论文

抽取状态raw

关键词

SpatialIntelligence

LLM

Multimodal

GeoMultimodal

cs.CV