视觉-语言基础模型(VLFMs)在图像描述、图像-文本检索、视觉问答和视觉定位等多种多模态任务上取得了显著进展。然而,大多数方法依赖于通用图像数据集进行训练,缺乏地理空间数据导致其在地球观测任务中表现不佳。近年来,已提出大量地理空间图像-文本配对数据集以及在这些数据集上微调的VLFMs。这些新方法旨在利用大规模、多模态地理空间数据,构建具备多样化地理感知能力的通用智能模型,我们称之为视觉-语言地理基础模型(VLGFMs)。本文全面回顾了VLGFMs,总结并分析了该领域的最新发展。具体而言,我们介绍了VLGFMs兴起的背景与动机,强调了其独特的研究意义;系统梳理了VLGFMs的核心技术,包括数据构建、模型架构以及各类多模态地理空间任务的应用;最后,我们总结了对未来研究方向的见解、现存问题与讨论。据我们所知,这是首篇关于VLGFMs的综合性文献综述。我们将持续追踪相关工作,详见 https://github.com/zytx121/Awesome-VLGFM。
Vision-Language Foundation Models (VLFMs) have made remarkable progress on various multimodal tasks, such as image captioning, image-text retrieval, visual question answering, and visual grounding. However, most methods rely on training with general image datasets, and the lack of geospatial data leads to poor performance on earth observation. Numerous geospatial image-text pair datasets and VLFMs fine-tuned on them have been proposed recently. These new approaches aim to leverage large-scale, multimodal geospatial data to build versatile intelligent models with diverse geo-perceptive capabilities, which we refer to as Vision-Language Geo-Foundation Models (VLGFMs). This paper thoroughly reviews VLGFMs, summarizing and analyzing recent developments in the field. In particular, we introduce the background and motivation behind the rise of VLGFMs, highlighting their unique research significance. Then, we systematically summarize the core technologies employed in VLGFMs, including data construction, model architectures, and applications of various multimodal geospatial tasks. Finally, we conclude with insights, issues, and discussions regarding future research directions. To the best of our knowledge, this is the first comprehensive literature review of VLGFMs. We keep tracing related works at https://github.com/zytx121/Awesome-VLGFM.