多模态基础模型(MFMs)虽已取得显著进展,但在物理世界的空间推理任务中仍表现脆弱。其关键瓶颈在于难以将局部自我中心(egocentric)观测转化为全局异心(allocentric)空间表征。为此,我们提出 AlloSpatial——一种面向基础模型异心空间认知的智能体式框架。AlloSpatial 引入 World2Mind,一个即插即用的认知映射沙盒,可将自我中心观测转化为结构化的异心先验,包括异心空间树(Allocentric-Spatial Trees, ASTs)与路径图,从而支持对物体拓扑关系、几何关系、通行性及运动轨迹的查询。为在重建噪声与视觉证据模糊等条件下可靠利用此类先验,AlloSpatial 进一步提出空间推理赋能模块(Spatial Reasoning Harness),实现工具使用判断、模态解耦的线索采集以及几何-语义仲裁。我们还通过冷启动强化学习,以赋能模块门控的轨迹级奖励机制,将该过程内化至 Qwen3-VL 模型中。在 VSI-Bench 和 MindCube 上的实验表明:AlloSpatial 在无需训练的设定下,使专有模型性能提升 5%–18%;即使移除视觉输入,仅依赖 ASTs 亦能支撑强有力的空间推理。经训练的 AlloSpatial 智能体进一步超越更大规模的通用模型及具有竞争力的空间基线方法,表明结构化异心表征、主动式工具调用与可验证推理,为构建具备空间能力的基础模型提供了可行路径。
Multimodal Foundation Models (MFMs) have made substantial progress, yet remain fragile in spatial reasoning over the physical world. A key bottleneck lies in their inability to transform local egocentric observations into a global allocentric spatial representation. To address this, we propose AlloSpatial, an agentic framework for allocentric spatial cognition in foundation models. AlloSpatial introduces World2Mind, a plug-and-play cognitive mapping sandbox that converts egocentric observations into structured allocentric priors, including Allocentric-Spatial Trees and route maps that support querying object topology, geometric relations, passability, and trajectories. To utilize these priors reliably under noisy reconstruction and ambiguous visual evidence, AlloSpatial introduces a Spatial Reasoning Harness for tool-use judgment, modality-decoupled cue collection, and geometry-semantic arbitration. We further internalize this process in Qwen3-VL through cold-start reinforcement learning with a harness-gated trajectory-level reward. Experiments on VSI-Bench and MindCube show that AlloSpatial improves proprietary models by 5%-18% in a training-free setting, while ASTs alone support strong spatial reasoning even when visual inputs are removed. The trained AlloSpatial agents further outperform larger general-purpose models and competitive spatial baselines, suggesting that structured allocentric representations, active tool use, and verifiable reasoning offer a promising route toward spatially capable foundation models.