论文
arXiv
SpatialIntelligence
LLM
Multimodal
GeoMultimodal
Agent
中文标题
CoCoSI:面向空间智能的协同认知地图构建
English Title
CoCoSI: Collaborative Cognitive Map Construction for Spatial Intelligence
Yiming Zhang, Ruoxuan Cao, Zhihang Zhong
发布时间
2026/6/9 12:20:08
来源类型
preprint
语言
en
摘要
中文对照

空间智能是多模态大语言模型(MLLM)的关键前沿方向,使其能够基于视觉经验对物理世界进行推理。受人类空间认知机制启发,近期方法通过多帧视觉输入构建基于网格的认知地图,以在时间维度上维持连贯的空间表征。然而,有限的上下文长度仍制约空间理解能力;而现有方法(如长上下文建模与外部记忆)往往需修改模型架构、引入记忆模块或进行微调,限制了其在现成预训练MLLM上的适用性。为此,我们提出一种轻量级、模型无关的方法,可在模型原生上下文窗口之外保留空间信息。具体而言,我们设计了一个即插即用的多智能体框架,通过协同方式构建结构化空间记忆——即认知地图,从而在不修改架构、无需额外训练的前提下增强任意预训练MLLM的空间理解能力。该框架包含局部-全局智能体协同、基于原子提交的认知地图构建,以及跨智能体验证机制。大量实验表明,本方法在空间理解任务上取得更优性能,且全程无需训练。代码将开源。

English Original

Spatial intelligence is a key frontier for multimodal large language models (MLLMs), enabling them to reason about the physical world from visual experience. Inspired by human spatial cognition, recent approaches construct grid-based cognitive maps from multi-frame visual inputs to maintain coherent spatial representations over time. However, limited context lengths still challenge spatial understanding, while existing methods, such as long-context modeling and external memory, often require architectural changes, memory modules, or finetuning, limiting their applicability to off-the-shelf pretrained MLLMs. This motivates a lightweight, model-agnostic method for preserving spatial information beyond the native context window. To this end, we propose a plug-and-play multi-agent framework that collaboratively constructs cognitive maps as structured spatial memory, enhancing the spatial understanding of arbitrary pretrained MLLMs without architectural modification or additional training. Our framework features local-global agent coordination, cognitive map construction with atomic commits, and cross-agent verification. Extensive experiments demonstrate that our method achieves superior performance on spatial understanding tasks while remaining fully training-free. Code will be released.

元数据
arXiv2606.10401v1
来源arXiv
类型论文
抽取状态raw
关键词
SpatialIntelligence
LLM
Multimodal
GeoMultimodal
Agent
cs.CV