论文
arXiv
GeoAI
GIS
RemoteSensing
EarthObservation
Multimodal
GeoMultimodal
Agent
中文标题
EO-Gym:面向地球观测智能体的多模态交互式环境
English Title
EO-Gym: A Multimodal, Interactive Environment for Earth Observation Agents
Sai Ma, Zhuang Li, Sichao Li, Xinyue Xu, Ruibiao Zhu, Tony Boston, John A. Taylor
发布时间
2026/5/2 13:09:17
来源类型
preprint
语言
en
摘要
中文对照

地球观测(Earth Observation, EO)分析本质上具有交互性:消除不确定性通常需要扩展兴趣区域、检索历史观测数据,以及在光学与合成孔径雷达(Synthetic Aperture Radar, SAR)等不同传感器之间切换。然而,当前多数EO基准测试将该过程简化为固定输入、单轮次任务。为弥补这一缺口,我们提出EO-Gym——一个受控可执行框架,专为支持多模态、工具调用型EO智能体而设计;其将EO分析建模为一种Gymnasium风格的本地地理空间工作区,底层由逾66万个多模态文件支撑,这些文件按地理位置、时间及传感器类型索引,并配备35种EO专用工具,覆盖六大任务类别。基于该环境,我们构建了EO-Gym-Data基准数据集,包含9,078条轨迹与34,604个推理步骤,数据源自八个公开EO数据集,并整合Landsat与Sentinel-2影像。对10个开源及闭源视觉语言模型(VLM)的评估表明,即使性能较强的通用模型在交互式EO推理任务上仍表现欠佳,尤其在时序与跨模态工作流方面。作为参考基线,EO-Gym-4B通过在EO-Gym-Data上微调Qwen3-VL-4B-Instruct获得,在主评估设置下整体Pass@3指标由0.49提升至0.74。EO-Gym提供了一个可复现的交互式EO智能体实验环境,将EO操作化为一项需统筹地理空间、时间与传感模态的证据收集问题。

English Original

Earth Observation (EO) analysis is inherently interactive: resolving uncertainty often requires expanding the region of interest, retrieving historical observations, and switching across sensors such as optical and Synthetic Aperture Radar. However, most EO benchmarks collapse this process into fixed-input, single-turn tasks. To address this gap, we present EO-Gym, a controlled executable framework for multimodal, tool-using EO agents that formulates EO analysis as a Gymnasium-style local geospatial workspace backed by more than 660k multimodal files indexed by location, time, and sensor type, with 35 EO-specialized tools spanning six task families. Built on this environment, we construct EO-Gym-Data, a benchmark of 9,078 trajectories and 34,604 reasoning steps, and grounded in eight public EO datasets together with Landsat and Sentinel-2 imagery. Evaluating $10$ open and closed VLMs shows that strong general-purpose models still struggle with interactive EO reasoning, especially on temporal and cross-modal workflows. As a reference baseline, EO-Gym-4B, obtained by fine-tuning Qwen3-VL-4B-Instruct on EO-Gym-Data, improves overall Pass@3 from $0.49$ to $0.74$ under the main evaluation setting. O-Gym provides a reproducible environment for interactive EO agents, operationalizing EO as an evidence-gathering problem that requires planning across geospatial, temporal, and sensing modality.

元数据
arXiv2605.01250v1
来源arXiv
类型论文
抽取状态raw
关键词
GeoAI
GIS
RemoteSensing
EarthObservation
Multimodal
GeoMultimodal
Agent
cs.AI