UrbanComp Lab | 学习资料库

返回论文库

论文

arXiv

GeoAI

GIS

RemoteSensing

EarthObservation

SpatialIntelligence

GeoLargeModel

GeoFoundationModel

中文标题

面向多模态与高光谱地理空间数据的可扩展基础模型

English Title

Towards Scalable Foundation Model for Multi-modal and Hyperspectral Geospatial Data

Haozhe Si, Yuxuan Wan, Minh Do, Deepak Vasisht, Han Zhao, Hendrik F. Hamann

发布时间

2025/3/17 13:42:19

来源类型

preprint

语言

摘要

中文对照

地理空间栅格数据，例如由卫星成像系统在不同时间与光谱波段采集的数据，具有推动广泛高影响力应用的巨大潜力。这种潜力源于在多个通道和传感模态下，空间与时间上下文丰富的信息。近期研究已尝试将现有的自监督学习方法应用于此类地理空间数据，但其模型架构缺乏可扩展性，导致在面对越来越多的通道与模态时表现出灵活性不足和计算效率低下。为解决上述局限，我们提出低秩高效空间-光谱视觉Transformer（LESS ViT），包含三项关键创新：i）LESS注意力模块，通过低维空间与光谱注意力组件的Kronecker积近似高维空间-光谱注意力；ii）连续位置-通道嵌入层，保留每个空间-光谱块的连续性与物理特性；iii）感知场掩码机制，通过将注意力限制在邻近块内以利用局部空间依赖性。为评估所提创新，我们构建了GFM-Bench，作为此类地理空间栅格数据的综合性基准。我们采用集成位置与通道掩码策略的高光谱掩码自编码器框架对LESS ViT进行预训练。实验结果表明，所提方法在性能上达到与当前先进多模态地理空间基础模型相当的水平，并在跨卫星泛化任务中表现更优，同时具备更高的计算效率。该框架的灵活性与可扩展性使其成为极具前景的解决方案。

English Original

Geospatial raster data, such as that collected by satellite-based imaging systems at different times and spectral bands, hold immense potential for enabling a wide range of high-impact applications. This potential stems from the rich information that is spatially and temporally contextualized across multiple channels and sensing modalities. Recent work has adapted existing self-supervised learning approaches for such geospatial data. However, they fall short of scalable model architectures, leading to inflexibility and computational inefficiencies when faced with an increasing number of channels and modalities. To address these limitations, we introduce Low-rank Efficient Spatial-Spectral Vision Transformer with three key innovations: i) the LESS Attention Block that approximates high-dimensional spatial-spectral attention through Kronecker's product of the low-dimensional spatial and spectral attention components; ii) the Continuous Positional-Channel Embedding Layer that preserves both the continuity and physical characteristics of each spatial-spectral patch; and iii) the Perception Field Mask that exploits local spatial dependencies by constraining attention to neighboring patches. To evaluate the proposed innovations, we construct GFM-Bench, which serves as a comprehensive benchmark for such geospatial raster data. We pretrain LESS ViT using a Hyperspectral Masked Autoencoder framework with integrated positional and channel masking strategies. Experimental results demonstrate that our proposed method achieves competitive performance against state-of-the-art multi-modal geospatial foundation models while outperforming them on cross-satellite generalization tasks with higher computational efficiency. The flexibility and extensibility of our framework make it a promising direction for future geospatial data analysis tasks that involve a wide range of modalities and channels.

资源链接

论文 PDFarxiv.org/pdf/2503.12843v3 论文 PDFarxiv.org/pdf/2503.12843v3 原始来源页面arxiv.org/abs/2503.12843v3

元数据

arXiv2503.12843v3

来源arXiv

类型论文

抽取状态raw

关键词