地理空间栅格数据,例如由卫星成像系统在不同时间与光谱波段采集的数据,具有推动广泛高影响力应用的巨大潜力。这种潜力源于在多个通道和传感模态下,空间与时间上下文丰富的信息。近期研究已尝试将现有的自监督学习方法应用于此类地理空间数据,但其模型架构缺乏可扩展性,导致在面对越来越多的通道与模态时表现出灵活性不足和计算效率低下。为解决上述局限,我们提出低秩高效空间-光谱视觉Transformer(LESS ViT),包含三项关键创新:i)LESS注意力模块,通过低维空间与光谱注意力组件的Kronecker积近似高维空间-光谱注意力;ii)连续位置-通道嵌入层,保留每个空间-光谱块的连续性与物理特性;iii)感知场掩码机制,通过将注意力限制在邻近块内以利用局部空间依赖性。为评估所提创新,我们构建了GFM-Bench,作为此类地理空间栅格数据的综合性基准。我们采用集成位置与通道掩码策略的高光谱掩码自编码器框架对LESS ViT进行预训练。实验结果表明,所提方法在性能上达到与当前先进多模态地理空间基础模型相当的水平,并在跨卫星泛化任务中表现更优,同时具备更高的计算效率。该框架的灵活性与可扩展性使其成为极具前景的解决方案。
Geospatial raster data, such as that collected by satellite-based imaging systems at different times and spectral bands, hold immense potential for enabling a wide range of high-impact applications. This potential stems from the rich information that is spatially and temporally contextualized across multiple channels and sensing modalities. Recent work has adapted existing self-supervised learning approaches for such geospatial data. However, they fall short of scalable model architectures, leading to inflexibility and computational inefficiencies when faced with an increasing number of channels and modalities. To address these limitations, we introduce Low-rank Efficient Spatial-Spectral Vision Transformer with three key innovations: i) the LESS Attention Block that approximates high-dimensional spatial-spectral attention through Kronecker's product of the low-dimensional spatial and spectral attention components; ii) the Continuous Positional-Channel Embedding Layer that preserves both the continuity and physical characteristics of each spatial-spectral patch; and iii) the Perception Field Mask that exploits local spatial dependencies by constraining attention to neighboring patches. To evaluate the proposed innovations, we construct GFM-Bench, which serves as a comprehensive benchmark for such geospatial raster data. We pretrain LESS ViT using a Hyperspectral Masked Autoencoder framework with integrated positional and channel masking strategies. Experimental results demonstrate that our proposed method achieves competitive performance against state-of-the-art multi-modal geospatial foundation models while outperforming them on cross-satellite generalization tasks with higher computational efficiency. The flexibility and extensibility of our framework make it a promising direction for future geospatial data analysis tasks that involve a wide range of modalities and channels.