空间预测任务常受限于高质量标注真值观测数据的缺乏。为应对这一挑战,自监督预训练是一种可行方案,其中对比学习在位置编码器中占据主导地位。现有方法通常仅将地理坐标与单一额外模态对齐。本文提出两种多模态对比学习架构:基于位置绑定的多模态嵌入(MELT)与序列交替位置训练(SALT)。这两种架构通过利用非配对地理空间数据,将该框架扩展至超过两个模态。两种方法在技术上均具可行性,并在四项下游任务中达到最强双模态基线(SATCLIP)的性能水平。然而,模态数量的增加并未持续提升性能,表明所选位置编码器是主要瓶颈——对比目标函数的性能在早期即达峰值,且该峰值不受模态多样性或预训练数据量的影响。MELT 比 SALT 具有更稳定的训练过程,为未来扩展提供了更坚实的基础。
Spatial prediction tasks are often limited by a lack of high-quality labelled ground-truth observations. To overcome this challenge, self-supervised pre-training is a possible solution, with contrastive learning dominant for location encoders. Those approaches usually align geographic coordinates with just one additional modality. We propose two multimodal contrastive learning architectures: Multimodal Embedding via Location Tying (MELT) and Sequential Alternating Location Training (SALT). These architectures expand this framework beyond two modalities by utilising unpaired geospatial data. Both methods are technically viable and match the performance of the strongest two-modality baseline (SATCLIP) across four downstream tasks. However, increasing the number of modalities does not consistently improve performance, suggesting that the chosen location encoder is the main limitation - the contrastive objective reaches its peak early, regardless of modality diversity or pre-training volume. MELT provides more stable training than SALT and presents a stronger foundation for future scaling.