论文
arXiv
GeoAI
GIS
RemoteSensing
EarthObservation
GeoLargeModel
GeoFoundationModel
Multimodal
GeoMultimodal
中文标题
GeoViSTA:面向多模态环境表征的地理空间视觉-表格变换器
English Title
GeoViSTA: Geospatial Vision-Tabular Transformer for Multimodal Environment Representation
Yuhao Liu, Sadeer Al-Kindi, Ashok Veeraraghavan, Guha Balakrishnan
发布时间
2026/5/14 13:46:07
来源类型
preprint
语言
en
摘要
中文对照

对地球观测影像开展的大规模预训练已生成了关于自然与建成环境的强表征能力。然而,当前大多数地理空间基础模型并未直接建模通常以表格形式存储的结构化社会经济协变量。这种模态鸿沟限制了其对完整总体环境的刻画能力,而该能力对于推断复杂的环境、社会及健康相关结果至关重要。本文提出 GeoViSTA(Geospatial Vision-Tabular Transformer),一种视觉-表格联合架构,可从配准后的栅格影像与表格数据中学习统一的地理空间嵌入。GeoViSTA 利用双向交叉注意力机制在模态间交换空间与语义信息,并通过一种地理感知注意力机制加以引导,使连续的图像块与不规则的普查分区(census tract)标记对齐。我们采用自监督的联合掩码自编码(joint masked-autoencoding)目标训练 GeoViSTA,迫使其利用局部空间上下文及跨模态线索恢复缺失的图像块和表格行。实验表明,GeoViSTA 的统一嵌入在若干高影响力下游任务的线性探针(linear probing)性能上优于基线模型,尤其在预测疾病特异性死亡率及未见区域火灾风险频率方面表现更优。结果证实,将物理环境与结构化社会经济背景联合建模,可生成高度可迁移的表征,支撑全面的地理空间推理。

English Original

Large-scale pretraining on Earth observation imagery has yielded powerful representations of the natural and built environment. However, most existing geospatial foundation models do not directly model the structured socioeconomic covariates typically stored in tabular form. This modality gap limits their ability to capture the complete total environment, which is critical for reasoning about complex environmental, social, and health-related outcomes. In this work, we propose GeoViSTA (Geospatial Vision-Tabular Transformer), a vision-tabular architecture that learns unified geospatial embeddings from co-registered gridded imagery and tabular data. GeoViSTA utilizes bilateral cross-attention to exchange spatial and semantic information across modalities, guided by a geography-aware attention mechanism that aligns continuous image patches with irregular census-tract tokens. We train GeoViSTA with a self-supervised joint masked-autoencoding objective, forcing it to recover missing image patches and tabular rows using local spatial context and cross-modal cues. Empirically, GeoViSTA's unified embeddings improve linear probing performance on high-impact downstream tasks, outperforming baselines in predicting disease-specific mortality and fire hazard frequency across held-out regions. These results demonstrate that jointly modeling the physical environment alongside structured socioeconomic context yields highly transferable representations for holistic geospatial inference.

元数据
arXiv2605.14406v1
来源arXiv
类型论文
抽取状态raw
关键词
GeoAI
GIS
RemoteSensing
EarthObservation
GeoLargeModel
GeoFoundationModel
Multimodal
GeoMultimodal
cs.LG
cs.CV