UrbanComp Lab | 学习资料库

返回论文库

论文

arXiv

GeoAI

GIS

SpatialIntelligence

LLM

Multimodal

GeoMultimodal

Agent

中文标题

ERGeoBench：面向具身推理与地理定位的多模态大语言模型综合基准

English Title

ERGeoBench:A Comprehensive Benchmark for Embodied Reasoning and Geo-localization in Multimodal Large Language Models

Kaiwen Xue, Tao Wei, Guoxin Zhang, Zhonghong Ou, Kaoyan Lu, Yu Feng, Yifan Zhu, Haoran Luo

发布时间

2026/5/29 20:49:17

来源类型

preprint

语言

摘要

中文对照

多模态大语言模型（MLLM）已展现出作为具身智能体的强大潜力，但由于缺乏细粒度评估，具身地理定位（embodied geo-localization）仍鲜有探索。本文提出 ERGeoBench，一个面向视觉驱动具身地理定位的诊断性基准。ERGeoBench 在三种渐进式设置下评估模型：单视角（single-view）、全景视角（panorama-view）和具身视角（embodied-view），其中智能体可通过偏航角（yaw）、俯仰角（pitch）和缩放（zoom）的序列调整主动获取观测。该基准包含 2,207 张全球分布的街景全景图像，并衡量四项互补能力：基础感知、空间意识、常识推理与地理定位推理。对主流闭源与开源 MLLM 的评估表明，当前模型可推断高层地理语义，但在细粒度感知操作、度量级定位（metric localization）以及跨视角空间一致性方面仍存在明显不足。我们进一步发现，地理定位能力与其他能力维度高度相关，表明准确的地理定位依赖于感知、空间推理与常识推断的协同整合，而非孤立的视觉识别。总体而言，ERGeoBench 为诊断与推进类人具身地理定位提供了统一框架。项目主页：https://kaixuewen.github.io/ERGeoBench/

English Original

Multimodal large language models (MLLMs) have shown strong potential as embodied agents, yet embodied geo-localization remains underexplored due to the lack of fine-grained evaluation. We introduce ERGeoBench, a diagnostic benchmark for vision-driven embodied geo-localization. ERGeoBench evaluates models under three progressive settings -- single-view, panorama-view, and embodied-view -- where agents may actively acquire observations through sequential changes in yaw, pitch, and zoom. The benchmark contains 2,207 globally distributed street-view panoramas and measures four complementary capabilities: foundational perception, spatial awareness, common sense reasoning, and geo-localization reasoning. Evaluations of leading proprietary and open-source MLLMs show that current models can infer high-level geographic semantics, but still struggle with fine-grained perceptual operations, metric localization, and spatial consistency across views. We further observe that geo-localization is strongly correlated with the other capability dimensions, suggesting that accurate localization depends on integrated perception, spatial reasoning, and commonsense inference rather than isolated visual recognition. Overall, ERGeoBench provides a unified framework for diagnosing and advancing human-like embodied geo-localization. Project Page: https://kaixuewen.github.io/ERGeoBench/

资源链接

论文 PDFarxiv.org/pdf/2605.31251v1 论文 PDFarxiv.org/pdf/2605.31251v1 原始来源页面arxiv.org/abs/2605.31251v1

元数据

arXiv2605.31251v1

来源arXiv

类型论文

抽取状态raw

关键词

GeoAI

GIS

SpatialIntelligence

LLM

Multimodal

GeoMultimodal

Agent

cs.CV

cs.AI