视觉与语言基础模型的进展推动了地理基础模型(GeoFMs)的发展,显著提升了多种地理空间任务的性能。然而,现有大多数GeoFMs主要关注俯视遥感(RS)数据,忽视了街景(SV)影像等其他数据模态。多模态GeoFM发展的关键挑战在于显式建模跨模态的地理空间关系,从而实现任务、空间尺度和时间上下文间的泛化能力。为解决上述局限,我们提出GAIR,一种新型的多模态GeoFM架构,整合俯视遥感数据、街景影像及其地理定位元数据。我们采用三个因子化神经编码器,将街景影像、其地理坐标及遥感影像映射至嵌入空间。街景影像需位于遥感影像的空间覆盖范围内,但无需处于其地理中心。为实现街景影像与遥感影像的地理对齐,我们提出一种新颖的隐式神经表示(INR)模块,学习连续的遥感影像表示,并在街景影像的地理坐标处查询对应的遥感嵌入。随后,这些经过地理对齐的街景嵌入、遥感嵌入及位置嵌入通过无监督数据上的对比学习目标进行训练。我们在涵盖遥感影像、街景影像及位置嵌入基准的10项地理空间任务上评估GAIR。实验结果表明,GAIR优于当前最先进的GeoFMs及其他强基线模型,验证了其在学习通用且可迁移的地理空间表示方面的有效性。
Advancements in vision and language foundation models have inspired the development of geo-foundation models (GeoFMs), enhancing performance across diverse geospatial tasks. However, many existing GeoFMs primarily focus on overhead remote sensing (RS) data while neglecting other data modalities such as ground-level imagery. A key challenge in multimodal GeoFM development is to explicitly model geospatial relationships across modalities, which enables generalizability across tasks, spatial scales, and temporal contexts. To address these limitations, we propose GAIR, a novel multimodal GeoFM architecture integrating overhead RS data, street view (SV) imagery, and their geolocation metadata. We utilize three factorized neural encoders to project an SV image, its geolocation, and an RS image into the embedding space. The SV image needs to be located within the RS image's spatial footprint but does not need to be at its geographic center. In order to geographically align the SV image and RS image, we propose a novel implicit neural representations (INR) module that learns a continuous RS image representation and looks up the RS embedding at the SV image's geolocation. Next, these geographically aligned SV embedding, RS embedding, and location embedding are trained with contrastive learning objectives from unlabeled data. We evaluate GAIR across 10 geospatial tasks spanning RS image-based, SV image-based, and location embedding-based benchmarks. Experimental results demonstrate that GAIR outperforms state-of-the-art GeoFMs and other strong baselines, highlighting its effectiveness in learning generalizable and transferable geospatial representations.