论文
arXiv
SpatialIntelligence
LLM
Multimodal
GeoMultimodal
中文标题
从视频中学习几何表征:面向空间智能的多模态大语言模型
English Title
Learning Geometric Representations from Videos for Spatial Intelligent Multimodal Large Language Models
Haibo Wang, Lifu Huang
发布时间
2026/6/4 16:11:12
来源类型
preprint
语言
en
摘要
中文对照

多模态大语言模型(MLLMs)在二维语义理解方面表现优异,但缺乏内在的三维感知能力,导致其表征无法在视频帧间维持几何与空间一致性。鉴于大规模三维数据的稀缺性,我们提出 GeoVR——一种仅利用二维视频序列学习几何表征的新框架。该方法有效重构了 MLLMs 内部的语义潜在空间,从而释放空间智能。GeoVR 并非采用表层特征混合策略,而是通过蒸馏预训练三维基础模型中的几何知识,重塑 MLLM 的内部表征。其实现依赖于一种多目标学习策略,该策略由四个互补的几何学习目标驱动:(1)估计帧间相机位姿,以嵌入变化的视角动态;(2)回归稠密深度图,以锚定物理距离;(3)预测度量尺度因子,实现真实世界校准;(4)蒸馏多尺度三维特征,对齐中间特征空间。在这些显式的物理与几何约束引导下,模型内部表征自然发展出强三维感知能力。在多项空间推理基准上的大量实验表明,GeoVR 达到当前最优性能,为赋予基础模型空间智能确立了新范式。

English Original

Multimodal Large Language Models (MLLMs) excel at 2D semantic understanding but lack intrinsic 3D awareness, resulting in representations that fail to maintain geometric and spatial consistency across video frames. Given the scarcity of large-scale 3D data, we present GeoVR, a novel framework that learns geometric representations using purely 2D video sequences. This approach effectively restructures the semantic latent space within MLLMs to unlock spatial intelligence. Rather than employing superficial feature mixing, GeoVR reshapes the internal representations of the MLLM by distilling geometry knowledge from pre-trained 3D foundation models. This is accomplished through a multi-objective learning strategy driven by four complementary geometric targets: (1) estimating inter-frame camera poses to embed varying viewpoint dynamics, (2) regressing dense depth maps to anchor physical distances, (3) predicting a metric scale factor for real-world calibration, and (4) distilling multi-scale 3D features to align the intermediate feature space. Guided by these explicit physical and geometric constraints, the model's internal representations naturally develop strong 3D awareness. Extensive experiments on spatial reasoning benchmarks demonstrate that GeoVR achieves state-of-the-art performance, establishing a new paradigm for endowing foundation models with spatial intelligence.

元数据
arXiv2606.05833v1
来源arXiv
类型论文
抽取状态raw
关键词
SpatialIntelligence
LLM
Multimodal
GeoMultimodal
cs.CV
cs.AI