论文
arXiv
GeoAI
GIS
Multimodal
GeoMultimodal
中文标题
基于位置绑定的多模态对比学习以实现隐式地球嵌入
English Title
Multi-Modal Contrastive Learning for Implicit Earth Embeddings via Location Tying
Jonathan Hecht, Lukas Arzoumanidis, Ziyue Li, Youness Dehbi
发布时间
2026/6/18 20:35:14
来源类型
preprint
语言
en
摘要
中文对照

空间预测任务常受限于高质量标注真值观测数据的缺乏。为应对这一挑战,自监督预训练是一种可行方案,其中对比学习在位置编码器中占据主导地位。现有方法通常仅将地理坐标与单一额外模态对齐。本文提出两种多模态对比学习架构:基于位置绑定的多模态嵌入(MELT)与序列交替位置训练(SALT)。这两种架构通过利用非配对地理空间数据,将该框架扩展至超过两个模态。两种方法在技术上均具可行性,并在四项下游任务中达到最强双模态基线(SATCLIP)的性能水平。然而,模态数量的增加并未持续提升性能,表明所选位置编码器是主要瓶颈——对比目标函数的性能在早期即达峰值,且该峰值不受模态多样性或预训练数据量的影响。MELT 比 SALT 具有更稳定的训练过程,为未来扩展提供了更坚实的基础。

English Original

Spatial prediction tasks are often limited by a lack of high-quality labelled ground-truth observations. To overcome this challenge, self-supervised pre-training is a possible solution, with contrastive learning dominant for location encoders. Those approaches usually align geographic coordinates with just one additional modality. We propose two multimodal contrastive learning architectures: Multimodal Embedding via Location Tying (MELT) and Sequential Alternating Location Training (SALT). These architectures expand this framework beyond two modalities by utilising unpaired geospatial data. Both methods are technically viable and match the performance of the strongest two-modality baseline (SATCLIP) across four downstream tasks. However, increasing the number of modalities does not consistently improve performance, suggesting that the chosen location encoder is the main limitation - the contrastive objective reaches its peak early, regardless of modality diversity or pre-training volume. MELT provides more stable training than SALT and presents a stronger foundation for future scaling.

元数据
arXiv2606.20167v1
来源arXiv
类型论文
抽取状态raw
关键词
GeoAI
GIS
Multimodal
GeoMultimodal
cs.LG