UrbanComp Lab | 学习资料库

返回论文库

论文

arXiv

SpatialIntelligence

LLM

Multimodal

GeoMultimodal

中文标题

面向空间智能的双路径几何感知多模态大语言模型

English Title

Dual-Pathway Geometry-Aware MLLM for Spatial Intelligence

Yufei Zheng, Xuhan Zhu, Zide Liu, Chunpeng Zhou, Chenfeng Wang, Yongchao Xu, Yunnan Wang, Jiawei Liu, Pengfei Yu, Wei Zhai, Yang Cao, Zheng-Jun Zha

发布时间

2026/5/25 09:33:19

来源类型

preprint

语言

摘要

中文对照

从2D视觉输入中理解物理世界的空间结构依赖于两类互补的几何知识：整体性3D结构感知与细粒度度量尺度估计。现有多模态大语言模型（MLLM）通常仅处理其中一种形式，通过额外引入深度图或点云作为模型输入，这不仅带来显著的计算开销，还继承了上游预测模型的泛化能力局限。我们提出GAMSI——一种面向空间智能的双路径几何感知MLLM，其仅以RGB图像为输入，并在统一的自回归主干网络中内化上述两类几何先验。具体而言，我们引入度量-结构解耦查询（Metric-Structure Decoupled Queries, MSDQ），采用两组可学习查询，分别从共享视觉上下文中提取密集的度量信号与稀疏的结构线索；并借助任务解耦的注意力掩码，进一步防止两条通路间的相互干扰。在此基础上，专家引导的视觉定位（Expert-Guided Visual Grounding, EVG）模块将聚合后的线索投影回帧级视觉特征，并将其与视觉基础模型对齐；后者仅在训练阶段提供监督信号，而非作为模型输入。此外，我们构建了一个多任务空间指令微调数据集（Multi-Task Spatial instruction-tuning dataset, MTS），共包含152,776个样本，覆盖13种任务类型及3种视觉模态，整合自6个公开数据集。通过两阶段课程学习训练，GAMSI在7个空间智能基准测试上达到当前最优性能。

English Original

Spatial understanding of the physical world from 2D visual inputs hinges on two complementary forms of geometric knowledge: holistic 3D structural perception and fine-grained metric scale estimation. Existing multimodal large language models (MLLMs) typically address only one facet, ingesting either depth maps or point clouds as additional model inputs, which incurs substantial computational overhead and inherits the generalization limitations of upstream prediction models. We propose GAMSI, a dual-pathway Geometry-Aware MLLM for Spatial Intelligence that takes only RGB images as input while internalizing both forms of geometric prior within a unified autoregressive backbone. Specifically, we introduce Metric-Structure Decoupled Queries (MSDQ) which employ two groups of learnable queries to respectively extract dense metric signals and sparse structural cues from the shared visual context, with a task-decoupled attention mask further preventing the two pathways from contaminating each other. Building on this, an Expert-Guided Visual Grounding (EVG) module projects the aggregated cues back to frame-level visual features and aligns them with vision foundation models, which serve purely as training-time supervision, rather than as model inputs. We further build a multi-task spatial instruction-tuning dataset (MTS) comprising 152{,}776 samples spanning 13 task types and three visual modalities, consolidated from six public datasets. Trained with a two-stage curriculum, GAMSI achieves state-of-the-art performance on seven spatial intelligence benchmarks.

资源链接

论文 PDFarxiv.org/pdf/2605.25334v1 论文 PDFarxiv.org/pdf/2605.25334v1 原始来源页面arxiv.org/abs/2605.25334v1

元数据

arXiv2605.25334v1

来源arXiv

类型论文

抽取状态raw

关键词

SpatialIntelligence

LLM

Multimodal

GeoMultimodal

cs.CV