多模态大语言模型(MLLM)已展现出作为具身智能体的强大潜力,但由于缺乏细粒度评估,具身地理定位(embodied geo-localization)仍鲜有探索。本文提出 ERGeoBench,一个面向视觉驱动具身地理定位的诊断性基准。ERGeoBench 在三种渐进式设置下评估模型:单视角(single-view)、全景视角(panorama-view)和具身视角(embodied-view),其中智能体可通过偏航角(yaw)、俯仰角(pitch)和缩放(zoom)的序列调整主动获取观测。该基准包含 2,207 张全球分布的街景全景图像,并衡量四项互补能力:基础感知、空间意识、常识推理与地理定位推理。对主流闭源与开源 MLLM 的评估表明,当前模型可推断高层地理语义,但在细粒度感知操作、度量级定位(metric localization)以及跨视角空间一致性方面仍存在明显不足。我们进一步发现,地理定位能力与其他能力维度高度相关,表明准确的地理定位依赖于感知、空间推理与常识推断的协同整合,而非孤立的视觉识别。总体而言,ERGeoBench 为诊断与推进类人具身地理定位提供了统一框架。项目主页:https://kaixuewen.github.io/ERGeoBench/
Multimodal large language models (MLLMs) have shown strong potential as embodied agents, yet embodied geo-localization remains underexplored due to the lack of fine-grained evaluation. We introduce ERGeoBench, a diagnostic benchmark for vision-driven embodied geo-localization. ERGeoBench evaluates models under three progressive settings -- single-view, panorama-view, and embodied-view -- where agents may actively acquire observations through sequential changes in yaw, pitch, and zoom. The benchmark contains 2,207 globally distributed street-view panoramas and measures four complementary capabilities: foundational perception, spatial awareness, common sense reasoning, and geo-localization reasoning. Evaluations of leading proprietary and open-source MLLMs show that current models can infer high-level geographic semantics, but still struggle with fine-grained perceptual operations, metric localization, and spatial consistency across views. We further observe that geo-localization is strongly correlated with the other capability dimensions, suggesting that accurate localization depends on integrated perception, spatial reasoning, and commonsense inference rather than isolated visual recognition. Overall, ERGeoBench provides a unified framework for diagnosing and advancing human-like embodied geo-localization. Project Page: https://kaixuewen.github.io/ERGeoBench/