基础模型(FMs)通过大规模预训练在多个领域实现了最先进性能。在地球观测(EO)领域,近年来海量卫星数据档案(拍字节级)的可用性推动了地理空间基础模型(GFMs)的发展。然而,关于数据集规模、模型架构与模型规模如何共同决定下游性能的基本问题仍待解答。本文系统地探索该设计空间,基于三个数据集规模进行模型预训练与微调:PhilEO Globe(0.5TB)、FastTOM(2TB,本文首次提出)以及MajorTOM(23TB)。评估了三种架构族:Geo-Aware U-Net(CNN)、ViT-UPerNet(Transformer)和Mamba(状态空间模型),涵盖参数量从44M到300M的多种模型规模。所有模型均在PhilEO Bench上进行基准测试,任务包括道路密度与建筑密度回归、土地覆盖分割,并与现有GFMs如TerraMind和Prithvi-EO-2.0进行对比。结果表明,在少样本设置下,基于CNN的模型依然具有较强竞争力,其中200M参数的Geo-Aware U-Net在回归任务中优于更大规模的架构。然而,当扩展至拍字节级数据集时,ViT-UPerNet表现最佳,尤其在MajorTOM(23TB)上的语义分割任务中优势显著。最后,我们首次对Mamba模型在地球观测领域的应用进行了广泛评估,凸显其潜在的效率优势,但需进一步的大规模预训练才能完全达到CNN与ViT的性能水平。本文公开发布全部代码、预训练模型及FastTOM数据集,以支持可复现性并促进对GFMs缩放定律的深入研究。
Foundation Models (FMs) have achieved state-of-the-art performance across domains by leveraging large-scale pretraining. In Earth Observation (EO), the availability of petabyte-scale satellite archives has recently enabled the development of GeoSpatial Foundation Models (GFMs). Yet, fundamental questions remain regarding how dataset size, model architecture, and size interact to determine downstream performance. In this work, we systematically explore this design space by pretraining and fine-tuning models on three dataset scales: PhilEO Globe (0.5TB), FastTOM (2TB, introduced here), and MajorTOM (23TB). We evaluate three architectural families: Geo-Aware U-Net (CNN), ViT-UPerNet (Transformer), and Mamba (State-Space Model); across model sizes ranging from 44M to 300M parameters. All models are benchmarked on the PhilEO Bench, covering: road density and building density regression, and land cover segmentation, and are compared against existing GFMs such as TerraMind and Prithvi-EO-2.0. Our results show that CNN-based models remain highly competitive in low-shot settings, with a 200M-parameter Geo-Aware U-Net outperforming larger architectures on regression tasks. However, when scaling to multi-terabyte datasets, ViT-UPerNet achieves the best performance, particularly for semantic segmentation on MajorTOM (23TB). Finally, we provide the first extensive evaluation of Mamba models in EO, highlighting their potential efficiency advantages, though further large-scale pretraining is required to fully match CNNs and ViTs. All code, pretrained models, and the FastTOM dataset are released publicly, enabling reproducibility and further exploration of scaling laws for GFMs.