论文
arXiv
SpatialIntelligence
LLM
Multimodal
GeoMultimodal
中文标题
SpatialSV:通过面向任务的视觉监督在多模态大语言模型中内化可解释的3D空间感知能力
English Title
SpatialSV: Internalizing Interpretable 3D Spatial Awareness in MLLMs via Task-Oriented Visual Supervision
Jiayu Tang, Yuchen Zhou, Chao Gou
发布时间
2026/6/18 16:09:32
来源类型
preprint
语言
en
摘要
中文对照

释放多模态大语言模型(MLLMs)的空间智能,对于理解与交互三维世界至关重要。现有主流方法通常借助外部工具注入空间先验知识,但会带来显著的推理开销;或依赖潜在特征蒸馏,而该方式仍缺乏可解释性且缺少细粒度几何约束。为解决上述问题,我们提出SpatialSV框架,旨在使MLLMs内化鲁棒的3D空间感知能力,同时提供固有的可解释性。不同于被动的特征模仿,SpatialSV采用面向任务的视觉监督,驱动模型主动将其2D视觉特征提升为显式的3D表示,包括深度图、相机位姿和点云。关键在于,这一2D到3D的提升过程为模型表征提供了透明的观察窗口:所生成的3D重建结果可作为直观代理,用于可视化与诊断模型内在空间知识的质量。在多个模型与基准上的大量实验验证了SpatialSV在增强并解释MLLMs空间智能方面的有效性。此外,该框架在半监督场景下展现出强泛化能力,证实其具备利用未标注视觉数据实现可扩展、可解释空间表征学习的潜力。

English Original

Unlocking the spatial intelligence of multimodal large language model (MLLMs) is crucial for understanding and interacting with the 3D world. Prevailing approaches typically inject spatial priors via external tools, which impose significant inference overhead, or rely on latent feature distillation, which remains uninterpretable and lacks fine-grained geometric constraints. To address these issues, we propose SpatialSV, a framework designed to internalize robust 3D spatial awareness within MLLMs while simultaneously offering inherent interpretability. Deviating from passive feature imitation, SpatialSV employs task-oriented visual supervision, compelling the model to actively lift its 2D visual features into explicit 3D representations, including depth maps, camera poses, and point clouds. Crucially, this 2D-to-3D lifting process provides a transparent window into the model's representations: the resulting 3D reconstructions serve as an intuitive proxy for visualizing and diagnosing the quality of the model's intrinsic spatial knowledge. Extensive experiments across multiple models and benchmarks demonstrate the effectiveness of SpatialSV in enhancing and interpreting MLLMs' spatial intelligence. Furthermore, the framework exhibits strong generalization in semi-supervised settings, validating its potential to leverage unlabeled visual data for scalable, interpretable spatial representation learning.

元数据
arXiv2606.19915v1
来源arXiv
类型论文
抽取状态raw
关键词
SpatialIntelligence
LLM
Multimodal
GeoMultimodal
cs.CV