视觉语言模型(VLMs)已逐渐成为理解室内场景的主要范式,但仍面临度量与空间推理的挑战。现有方法依赖端到端视频理解或大规模空间问答微调,本质上将感知与推理耦合在一起。本文探讨解耦感知与推理是否能提升空间推理能力。我们提出一种面向静态三维室内场景推理的智能体框架,将大语言模型(LLM)显式地锚定于三维场景图(3DSG)。不同于直接输入视频,每个场景通过专用感知模块构建为持久的3DSG。为隔离推理性能,我们从真实标注中实例化3DSG。该智能体仅通过结构化的几何工具与场景交互,这些工具揭示了物体尺寸、距离、位姿及空间关系等基本属性。在VSI-Bench静态数据集上的实验结果表明,在理想感知条件下,本方法达到了空间推理性能的上限,且显著优于此前工作,提升幅度高达16%,且无需特定任务微调。相较于基础VLM,我们的智能体变体在平均性能上提升了33%至50%。这些发现表明,显式的几何锚定可显著提升空间推理性能,并提示结构化表示是纯粹端到端视觉推理的一种有力替代方案。
Visual Language Models (VLMs) have increasingly become the main paradigm for understanding indoor scenes, but they still struggle with metric and spatial reasoning. Current approaches rely on end-to-end video understanding or large-scale spatial question answering fine-tuning, inherently coupling perception and reasoning. In this paper, we investigate whether decoupling perception and reasoning leads to improved spatial reasoning. We propose an agentic framework for static 3D indoor scene reasoning that grounds an LLM in an explicit 3D scene graph (3DSG). Rather than ingesting videos directly, each scene is represented as a persistent 3DSG constructed by a dedicated perception module. To isolate reasoning performance, we instantiate the 3DSG from ground-truth annotations. The agent interacts with the scene exclusively through structured geometric tools that expose fundamental properties such as object dimensions, distances, poses, and spatial relationships. The results we obtain on the static split of VSI-Bench provide an upper bound under ideal perceptual conditions on the spatial reasoning performance, and we find that it is significantly higher than previous works, by up to 16\%, without task specific fine-tuning. Compared to base VLMs, our agentic variant achieves significantly better performance, with average improvements between 33\% to 50\%. These findings indicate that explicit geometric grounding substantially improves spatial reasoning performance, and suggest that structured representations offer a compelling alternative to purely end-to-end visual reasoning.