论文
arXiv
GeoAI
GIS
GeoLargeModel
GeoFoundationModel
中文标题
预训练在哪里?探究预训练数据多样性对地理空间基础模型性能的影响
English Title
Pretrain Where? Investigating How Pretraining Data Diversity Impacts Geospatial Foundation Model Performance
Amandeep Kaur, Mirali Purohit, Gedeon Muhawenayo, Esther Rolf, Hannah Kerner
发布时间
2026/4/23 05:43:03
来源类型
preprint
语言
en
摘要
中文对照

新兴的地理空间基础模型引入了新的模型架构与预训练数据集,其数据采样常基于不同的数据多样性定义。当前性能差异主要归因于模型架构或输入模态,而预训练数据集的作用却鲜有研究。为填补这一研究空白,我们系统性地探究了预训练数据的地理构成如何影响模型在下游任务中的性能。我们构建了全球尺度及各洲尺度的预训练数据集,并在全球及各洲尺度的下游数据集上对其进行评估。结果表明,源自欧洲的预训练数据集在全局及局部下游评估中均优于全球尺度及各洲专用的预训练数据集。为进一步探究影响预训练数据集下游性能的因素,我们分析了10个预训练数据集在大陆、生物群系、土地覆被及光谱值四个维度上的多样性。结果发现,仅光谱多样性与模型性能呈强相关性,其余维度相关性均较弱。该发现确立了一个新的多样性维度,应在构建高性能预训练数据集时予以考量。我们在 https://github.com/kerner-lab/pretrain-where 开源了7个新预训练数据集、对应预训练模型及实验框架。

English Original

New geospatial foundation models introduce a new model architecture and pretraining dataset, often sampled using different notions of data diversity. Performance differences are largely attributed to the model architecture or input modalities, while the role of the pretraining dataset is rarely studied. To address this research gap, we conducted a systematic study on how the geographic composition of pretraining data affects a model's downstream performance. We created global and per-continent pretraining datasets and evaluated them on global and per-continent downstream datasets. We found that the pretraining dataset from Europe outperformed global and continent-specific pretraining datasets on both global and local downstream evaluations. To investigate the factors influencing a pretraining dataset's downstream performance, we analysed 10 pretraining datasets using diversity across continents, biomes, landcover and spectral values. We found that only spectral diversity was strongly correlated with performance, while others were weakly correlated. This finding establishes a new dimension of diversity to be accounted for when creating a high-performing pretraining dataset. We open-sourced 7 new pretraining datasets, pretrained models, and our experimental framework at https://github.com/kerner-lab/pretrain-where.

元数据
arXiv2604.21104v1
来源arXiv
类型论文
抽取状态raw
关键词
GeoAI
GIS
GeoLargeModel
GeoFoundationModel
cs.CV
cs.LG