UrbanComp Lab | 学习资料库

返回论文库

论文

arXiv

SpatialIntelligence

LLM

Multimodal

GeoMultimodal

Agent

中文标题

CoCoSI：面向空间智能的协同认知地图构建

English Title

CoCoSI: Collaborative Cognitive Map Construction for Spatial Intelligence

Yiming Zhang, Ruoxuan Cao, Zhihang Zhong

发布时间

2026/6/9 12:20:08

来源类型

preprint

语言

摘要

中文对照

空间智能是多模态大语言模型（MLLM）的关键前沿方向，使其能够基于视觉经验对物理世界进行推理。受人类空间认知机制启发，近期方法通过多帧视觉输入构建基于网格的认知地图，以在时间维度上维持连贯的空间表征。然而，有限的上下文长度仍制约空间理解能力；而现有方法（如长上下文建模与外部记忆）往往需修改模型架构、引入记忆模块或进行微调，限制了其在现成预训练MLLM上的适用性。为此，我们提出一种轻量级、模型无关的方法，可在模型原生上下文窗口之外保留空间信息。具体而言，我们设计了一个即插即用的多智能体框架，通过协同方式构建结构化空间记忆——即认知地图，从而在不修改架构、无需额外训练的前提下增强任意预训练MLLM的空间理解能力。该框架包含局部-全局智能体协同、基于原子提交的认知地图构建，以及跨智能体验证机制。大量实验表明，本方法在空间理解任务上取得更优性能，且全程无需训练。代码将开源。

English Original

Spatial intelligence is a key frontier for multimodal large language models (MLLMs), enabling them to reason about the physical world from visual experience. Inspired by human spatial cognition, recent approaches construct grid-based cognitive maps from multi-frame visual inputs to maintain coherent spatial representations over time. However, limited context lengths still challenge spatial understanding, while existing methods, such as long-context modeling and external memory, often require architectural changes, memory modules, or finetuning, limiting their applicability to off-the-shelf pretrained MLLMs. This motivates a lightweight, model-agnostic method for preserving spatial information beyond the native context window. To this end, we propose a plug-and-play multi-agent framework that collaboratively constructs cognitive maps as structured spatial memory, enhancing the spatial understanding of arbitrary pretrained MLLMs without architectural modification or additional training. Our framework features local-global agent coordination, cognitive map construction with atomic commits, and cross-agent verification. Extensive experiments demonstrate that our method achieves superior performance on spatial understanding tasks while remaining fully training-free. Code will be released.

资源链接

论文 PDFarxiv.org/pdf/2606.10401v1 论文 PDFarxiv.org/pdf/2606.10401v1 原始来源页面arxiv.org/abs/2606.10401v1

元数据

arXiv2606.10401v1

来源arXiv

类型论文

抽取状态raw

关键词

SpatialIntelligence

LLM

Multimodal

GeoMultimodal

Agent

cs.CV