基础模型有望通过使大型计算机视觉模型在海量遥感数据上进行预训练,从而变革遥感(RS)数据分析的格局。这些模型随后可使用少量标注训练数据进行微调,并应用于多种任务。然而,现有大多数基础模型专为高空间分辨率、无云卫星影像或照片设计,限制了其在需要频繁时间监测或宽光谱范围的应用场景中的适用性。因此,仅基于无云图像训练的基础模型在涉及大气变量或需进行大气校正的应用中实用性有限。我们提出SatVision-TOA,一种在14波段MODIS L1B大气顶层(TOA)辐射率影像上预训练的新型基础模型,以满足对中等及粗分辨率全天空遥感数据进行预训练的需求。SatVision-TOA模型采用掩码图像建模(MIM)框架与SwinV2架构进行预训练,通过自监督学习无需标签即可学习详细的上下文表征。该模型包含30亿参数,基于1亿张图像进行训练。据我们所知,这是目前唯一完全基于卫星遥感影像训练的最大规模基础模型。实验结果表明,SatVision-TOA在下游任务如三维云层反演中表现优于基线方法。值得注意的是,该模型的平均交并比(mIOU)达到0.46,显著高于基线的0.22。此外,在微调任务中,误报率较基线降低超过50%。本研究推动了多光谱遥感预训练视觉建模的发展。
Foundation models have the potential to transform the landscape of remote sensing (RS) data analysis by enabling large computer vision models to be pre-trained on vast amounts of remote sensing data. These models can then be fine-tuned with small amounts of labeled training and applied to a variety of applications. Most existing foundation models are designed for high spatial resolution, cloud-free satellite imagery or photos, limiting their applicability in scenarios that require frequent temporal monitoring or broad spectral profiles. As a result, foundation models trained solely on cloud-free images have limited utility for applications that involve atmospheric variables or require atmospheric corrections. We introduce SatVision-TOA, a novel foundation model pre-trained on 14-band MODIS L1B Top-Of-Atmosphere (TOA) radiance imagery, addressing the need for models pre-trained to handle moderate- and coarse-resolution all-sky remote sensing data. The SatVision-TOA model is pre-trained using a Masked-Image-Modeling (MIM) framework and the SwinV2 architecture, and learns detailed contextual representations through self-supervised learning without the need for labels. It is a 3 billion parameter model that is trained on 100 million images. To our knowledge this is the largest foundation model trained solely on satellite RS imagery. Results show that SatVision-TOA achieves superior performance over baseline methods on downstream tasks such as 3D cloud retrieval. Notably, the model achieves a mean intersection over union (mIOU) of 0.46, a substantial improvement over the baseline mIOU of 0.22. Additionally, the rate of false negative results in the fine-tuning task were reduced by over 50% compared to the baseline. Our work advances pre-trained vision modeling for multispectral RS by learning from a variety of atmospheric and aerosol conditions to improve cloud and land surface monitoring.