基础模型已深刻变革自然语言处理与计算机视觉领域,其影响正重塑遥感图像分析。凭借强大的泛化能力与迁移学习特性,基础模型天然契合遥感数据的多模态、多分辨率及多时相特征。为应对该领域的独特挑战,多模态地理空间基础模型(GFMs)应运而生,成为专门的研究前沿。本综述从模态驱动视角系统回顾多模态GFMs,涵盖五种核心视觉与视觉-语言模态。我们探讨成像物理差异与数据表征方式如何影响交互设计,并分析对齐、融合与知识迁移的关键技术,以应对模态异质性、分布偏移与语义鸿沟问题。训练范式、模型架构及任务特定适应策略的进展得到系统评估,同时梳理了大量新兴基准。代表性多模态视觉与视觉-语言GFMs在十项下游任务中被评估,深入剖析其架构特点、性能表现与应用场景。涵盖土地覆盖制图、农业监测、灾害响应、气候研究与地理空间情报等真实案例研究,展示了GFMs的实际应用潜力。最后,本文指出领域泛化、可解释性、效率与隐私等关键挑战,并展望未来研究的可行方向。
Foundation models have transformed natural language processing and computer vision, and their impact is now reshaping remote sensing image analysis. With powerful generalization and transfer learning capabilities, they align naturally with the multimodal, multi-resolution, and multi-temporal characteristics of remote sensing data. To address unique challenges in the field, multimodal geospatial foundation models (GFMs) have emerged as a dedicated research frontier. This survey delivers a comprehensive review of multimodal GFMs from a modality-driven perspective, covering five core visual and vision-language modalities. We examine how differences in imaging physics and data representation shape interaction design, and we analyze key techniques for alignment, integration, and knowledge transfer to tackle modality heterogeneity, distribution shifts, and semantic gaps. Advances in training paradigms, architectures, and task-specific adaptation strategies are systematically assessed alongside a wealth of emerging benchmarks. Representative multimodal visual and vision-language GFMs are evaluated across ten downstream tasks, with insights into their architectures, performance, and application scenarios. Real-world case studies, spanning land cover mapping, agricultural monitoring, disaster response, climate studies, and geospatial intelligence, demonstrate the practical potential of GFMs. Finally, we outline pressing challenges in domain generalization, interpretability, efficiency, and privacy, and chart promising avenues for future research.