大型语言模型(LLM)与地理信息系统(GIS)的融合标志着自主空间分析范式的转变。然而,由于地理空间工作流具有复杂、多步骤的特性,对这类基于LLM的智能体进行评估仍具挑战性。现有基准主要依赖静态文本或代码匹配,忽视了动态运行时反馈以及空间输出的多模态特性。为弥补这一空白,我们提出GeoAgentBench(GABench),一个专为工具增强型GIS智能体设计的动态交互式评估基准。GABench提供了一个真实的执行沙箱,集成了117个原子级GIS工具,覆盖6个核心GIS领域中的53类典型空间分析任务。鉴于精确的参数配置是动态GIS环境中执行成功的主要决定因素,我们设计了参数执行准确率(Parameter Execution Accuracy, PEA)指标,采用“最终尝试对齐”(Last-Attempt Alignment)策略,量化隐式参数推断的保真度。作为补充,我们提出一种基于视觉-语言模型(Vision-Language Model, VLM)的验证方法,用于评估数据-空间准确性及制图风格符合度。此外,为应对因参数错配和运行时异常导致的频繁任务失败,我们开发了一种新型智能体架构——Plan-and-React,该架构通过解耦全局编排与逐步响应式执行,模拟专家级认知工作流。针对七种代表性LLM开展的大量实验表明,Plan-and-React范式显著优于传统框架,在多步推理与错误恢复方面尤其突出,实现了逻辑严谨性与执行鲁棒性的最优平衡。我们的研究揭示了当前能力边界,并确立了一个稳健的
The integration of Large Language Models (LLMs) into Geographic Information Systems (GIS) marks a paradigm shift toward autonomous spatial analysis. However, evaluating these LLM-based agents remains challenging due to the complex, multi-step nature of geospatial workflows. Existing benchmarks primarily rely on static text or code matching, neglecting dynamic runtime feedback and the multimodal nature of spatial outputs. To address this gap, we introduce GeoAgentBench (GABench), a dynamic and interactive evaluation benchmark tailored for tool-augmented GIS agents. GABench provides a realistic execution sandbox integrating 117 atomic GIS tools, encompassing 53 typical spatial analysis tasks across 6 core GIS domains. Recognizing that precise parameter configuration is the primary determinant of execution success in dynamic GIS environments, we designed the Parameter Execution Accuracy (PEA) metric, which utilizes a "Last-Attempt Alignment" strategy to quantify the fidelity of implicit parameter inference. Complementing this, a Vision-Language Model (VLM) based verification is proposed to assess data-spatial accuracy and cartographic style adherence. Furthermore, to address the frequent task failures caused by parameter misalignments and runtime anomalies, we developed a novel agent architecture, Plan-and-React, that mimics expert cognitive workflows by decoupling global orchestration from step-wise reactive execution. Extensive experiments with seven representative LLMs demonstrate that the Plan-and-React paradigm significantly outperforms traditional frameworks, achieving the optimal balance between logical rigor and execution robustness, particularly in multi-step reasoning and error recovery. Our findings highlight current capability boundaries and establish a robust standard for assessing and advancing the next generation of autonomous GeoAI.