空间智能是多模态大语言模型(MLLM)的关键前沿方向,使其能够基于视觉经验对物理世界进行推理。受人类空间认知机制启发,近期方法通过多帧视觉输入构建基于网格的认知地图,以在时间维度上维持连贯的空间表征。然而,有限的上下文长度仍制约空间理解能力;而现有方法(如长上下文建模与外部记忆)往往需修改模型架构、引入记忆模块或进行微调,限制了其在现成预训练MLLM上的适用性。为此,我们提出一种轻量级、模型无关的方法,可在模型原生上下文窗口之外保留空间信息。具体而言,我们设计了一个即插即用的多智能体框架,通过协同方式构建结构化空间记忆——即认知地图,从而在不修改架构、无需额外训练的前提下增强任意预训练MLLM的空间理解能力。该框架包含局部-全局智能体协同、基于原子提交的认知地图构建,以及跨智能体验证机制。大量实验表明,本方法在空间理解任务上取得更优性能,且全程无需训练。代码将开源。
Spatial intelligence is a key frontier for multimodal large language models (MLLMs), enabling them to reason about the physical world from visual experience. Inspired by human spatial cognition, recent approaches construct grid-based cognitive maps from multi-frame visual inputs to maintain coherent spatial representations over time. However, limited context lengths still challenge spatial understanding, while existing methods, such as long-context modeling and external memory, often require architectural changes, memory modules, or finetuning, limiting their applicability to off-the-shelf pretrained MLLMs. This motivates a lightweight, model-agnostic method for preserving spatial information beyond the native context window. To this end, we propose a plug-and-play multi-agent framework that collaboratively constructs cognitive maps as structured spatial memory, enhancing the spatial understanding of arbitrary pretrained MLLMs without architectural modification or additional training. Our framework features local-global agent coordination, cognitive map construction with atomic commits, and cross-agent verification. Extensive experiments demonstrate that our method achieves superior performance on spatial understanding tasks while remaining fully training-free. Code will be released.