论文
arXiv
RemoteSensing
EarthObservation
SpatialIntelligence
LLM
Multimodal
GeoMultimodal
中文标题
Earth-OneVision:将遥感多模态大语言模型扩展至更多传感器模态与任务
English Title
Earth-OneVision: Extending Remote Sensing Multimodal Large Language Models to More Sensor Modalities and Tasks
Miaoxin Cai, Guanqun Wang, Wei Zhang, Guangyao Zhou, Yin Zhuang, Tong Zhang, Hao Wang, He Chen, Jun Li
发布时间
2026/6/9 21:01:51
来源类型
preprint
语言
en
摘要
中文对照

遥感多模态大语言模型(RS-MLLMs)支持对地球观测影像的自然语言理解与空间推理。然而,现有模型仅支持有限的传感器类型与任务,导致对地球的观测呈现碎片化,并使跨模态地球科学知识在很大程度上未被利用。本工作提出 Earth-OneVision,一个参数量为20亿的 RS-MLLM,其在单一自回归框架内统一了六类传感器模态(即光学、合成孔径雷达 SAR、红外、多光谱、时序、视频)以及涵盖九类任务的跨传感器融合能力。针对三大瓶颈,本工作设计了三项专用机制:全粒度视觉-语言对齐(FGVLA)将多层次视觉特征与多维语言空间对齐;空间-语言同构序列化(SLIS)将异构空间输出统一为自回归 token;渐进式跨模态适配(PCMA)将复合领域差异分解为若干顺序阶段,依次解决视角差异与成像物理差异。为支持联合训练,构建了 MMRS-OneVision 数据集,包含约3400万组问答对,覆盖全部六类传感器模态及九类任务下的跨传感器融合,规模显著超越现有遥感多模态指令数据集。Earth-OneVision 仅以2B 参数量,在广泛基准测试中取得具有竞争力或当前最优(state-of-the-art)的结果,持续达到或超越参数量为4B–72B 的 RS-MLLMs。其在光学视觉定位基准 OPT-RSVG 测试集上取得 87.52% 的 [email protected] 指标,在 SAR 视觉问答基准 SARLANG-Bench 上取得 80.68% 的准确率,分别超出 7B 模型逾 7%;在多光谱分类基准 BigEarthNet-MS 测试集上召回率达 75.74%,在跨模态推理基准 EarthMind-Bench 上多项选择题(MCQ)准确率达 81.94%。

English Original

RS-MLLMs enable natural-language understanding and spatial reasoning over earth observation imagery. However, existing models support only a narrow range of sensor types and tasks, yielding a fragmented view of the earth and leaving cross-modal geoscientific knowledge largely unexploited. This work presents Earth-OneVision, a 2B RS-MLLM that unifies six sensor modalities (i.e., optical, SAR, infrared, multispectral, temporal, and video) and cross-sensor fusion across 9 task categories within a single autoregressive framework. Three dedicated mechanisms address three bottlenecks. Full-Granularity Vision-Language Alignment (FGVLA) aligns multi-level visual features with the multi-dimensional language space. Spatial-Linguistic Isomorphic Serialization (SLIS) unifies heterogeneous spatial outputs as autoregressive tokens. Progressive Cross-Modality Adaptation (PCMA) decomposes the compound domain gap into sequential stages, tackling the viewpoint and imaging physics gaps in turn. To support joint training, MMRS-OneVision is constructed with ~34M QA pairs spanning all six sensor modalities and cross-sensor fusion across 9 task categories, substantially exceeding existing RS multimodal instruction datasets. With only 2B parameters, Earth-OneVision achieves competitive or state-of-the-art results across extensive benchmarks, consistently matching or outperforming 4B-72B RS-MLLMs. It achieves 87.52% [email protected] on the OPT-RSVG testset for optical visual grounding and 80.68% on the SAR VQA benchmark SARLANG-Bench, exceeding 7B models by over 7%. It further achieves 75.74% recall on the BigEarthNet-MS testset for multispectral classification, and 81.94% MCQ accuracy on EarthMind-Bench for cross-modality reasoning.

元数据
arXiv2606.10819v1
来源arXiv
类型论文
抽取状态raw
关键词
RemoteSensing
EarthObservation
SpatialIntelligence
LLM
Multimodal
GeoMultimodal
cs.CV
cs.AI