UrbanComp Lab | 学习资料库

返回论文库

论文

arXiv

GeoAI

GIS

Multimodal

GeoMultimodal

中文标题

基于位置绑定的多模态对比学习以实现隐式地球嵌入

English Title

Multi-Modal Contrastive Learning for Implicit Earth Embeddings via Location Tying

Jonathan Hecht, Lukas Arzoumanidis, Ziyue Li, Youness Dehbi

发布时间

2026/6/18 20:35:14

来源类型

preprint

语言

摘要

中文对照

空间预测任务常受限于高质量标注真值观测数据的缺乏。为应对这一挑战，自监督预训练是一种可行方案，其中对比学习在位置编码器中占据主导地位。现有方法通常仅将地理坐标与单一额外模态对齐。本文提出两种多模态对比学习架构：基于位置绑定的多模态嵌入（MELT）与序列交替位置训练（SALT）。这两种架构通过利用非配对地理空间数据，将该框架扩展至超过两个模态。两种方法在技术上均具可行性，并在四项下游任务中达到最强双模态基线（SATCLIP）的性能水平。然而，模态数量的增加并未持续提升性能，表明所选位置编码器是主要瓶颈——对比目标函数的性能在早期即达峰值，且该峰值不受模态多样性或预训练数据量的影响。MELT 比 SALT 具有更稳定的训练过程，为未来扩展提供了更坚实的基础。

English Original

Spatial prediction tasks are often limited by a lack of high-quality labelled ground-truth observations. To overcome this challenge, self-supervised pre-training is a possible solution, with contrastive learning dominant for location encoders. Those approaches usually align geographic coordinates with just one additional modality. We propose two multimodal contrastive learning architectures: Multimodal Embedding via Location Tying (MELT) and Sequential Alternating Location Training (SALT). These architectures expand this framework beyond two modalities by utilising unpaired geospatial data. Both methods are technically viable and match the performance of the strongest two-modality baseline (SATCLIP) across four downstream tasks. However, increasing the number of modalities does not consistently improve performance, suggesting that the chosen location encoder is the main limitation - the contrastive objective reaches its peak early, regardless of modality diversity or pre-training volume. MELT provides more stable training than SALT and presents a stronger foundation for future scaling.

资源链接

论文 PDFarxiv.org/pdf/2606.20167v1 论文 PDFarxiv.org/pdf/2606.20167v1 原始来源页面arxiv.org/abs/2606.20167v1

元数据

arXiv2606.20167v1

来源arXiv

类型论文

抽取状态raw

关键词

GeoAI

GIS

Multimodal

GeoMultimodal

cs.LG