论文
arXiv
GeoAI
GIS
RemoteSensing
EarthObservation
SpatialIntelligence
中文标题
GAIR:基于地理对齐隐式表示的位置感知自监督对比预训练
English Title
GAIR: Location-Aware Self-Supervised Contrastive Pre-Training with Geo-Aligned Implicit Representations
Zeping Liu, Ni Lao, Zhangyu Wang, Junfeng Jiao, Gengchen Mai
发布时间
2025/3/21 03:59:39
来源类型
preprint
语言
en
摘要
中文对照

视觉Transformer(ViT)已在计算机视觉任务中广泛应用并取得优异效果,可为整幅图像或图像块提供表征。然而,在涉及多种地理空间数据模态(如俯视遥感(RS)数据、地面级影像及地理空间矢量数据)的地理空间任务中,ViT难以在任意位置生成细粒度的局部化图像表征;而此类高分辨率局部表征对于建模跨模态的地理空间关系与对齐至关重要。为此,我们提出一种隐式神经表征(INR)模块,通过神经隐式局部插值扩展ViT,从而生成覆盖遥感图像中任意位置的连续RS图像表征。基于该INR模块,我们提出了GAIR——一种新颖的位置感知自监督学习(SSL)目标,整合俯视RS数据、街景(SV)影像及其地理位置元数据。GAIR采用三个解耦的神经编码器,将不同模态映射至嵌入空间,并利用INR模块进一步实现这些表征的地理对齐;整个模型通过无标签数据上的对比学习目标进行训练。我们在涵盖RS影像、SV影像及位置嵌入三大类别的9项地理空间任务、22个数据集上对GAIR进行了评估。实验结果表明,GAIR优于当前最先进的地理基础模型(GeoFM)以及未采用细粒度地理对齐空间表征的其他SSL训练目标(如MoCo V3和MAE)。结果凸显了GAIR在跨任务、跨空间尺度及跨时间上下文场景下学习泛化性地理空间表征的有效性。

English Original

Vision Transformer (ViT) has been widely used in computer vision tasks with excellent results by providing representations for a whole image or image patches. However, ViT lacks detailed localized image representations at arbitrary positions when applied to geospatial tasks that involve multiple geospatial data modalities, such as overhead remote sensing (RS) data, ground-level imagery, and geospatial vector data. Here high-resolution localized representations are vital for modeling geospatial relationships and alignments across modalities. We proposed to solve this representation problem with an implicit neural representation (INR) module extending ViT with Neural Implicit Local Interpolation, which produces a continuous RS image representation covering arbitrary location in the RS image. Based on the INR module, we introduce GAIR, a novel location-aware self-supervised learning (SSL) objective integrating overhead RS data, street view (SV) imagery, and their geolocation metadata. GAIR utilizes three factorized neural encoders to project different modalities into the embedding space, and the INR module is used to further align these representations geographically, which are trained with contrastive learning objectives from unlabeled data. We evaluate GAIR across 9 geospatial tasks and 22 datasets spanning RS image-based, SV image-based, and location embedding-based benchmarks. Experimental results demonstrate that GAIR outperforms state-of-the-art geo-foundation models (GeoFM) and alternative SSL training objectives (e.g., MoCo V3 and MAE) that do not use fine-grained geo-aligned spatial representations. Our results highlight the effectiveness of GAIR in learning generalizable geospatial representations across tasks, spatial scales, and temporal contexts. The project code is available at https://github.com/zpl99/GAIR.

元数据
arXiv2503.16683v2
来源arXiv
类型论文
抽取状态raw
关键词
GeoAI
GIS
RemoteSensing
EarthObservation
SpatialIntelligence
cs.CV
cs.AI