UrbanComp Lab | 学习资料库

返回论文库

论文

arXiv

RemoteSensing

EarthObservation

SpatialIntelligence

LLM

Multimodal

GeoMultimodal

中文标题

Earth-OneVision：将遥感多模态大语言模型扩展至更多传感器模态与任务

English Title

Earth-OneVision: Extending Remote Sensing Multimodal Large Language Models to More Sensor Modalities and Tasks

Miaoxin Cai, Guanqun Wang, Wei Zhang, Guangyao Zhou, Yin Zhuang, Tong Zhang, Hao Wang, He Chen, Jun Li

发布时间

2026/6/9 21:01:51

来源类型

preprint

语言

摘要

中文对照

遥感多模态大语言模型（RS-MLLMs）支持对地球观测影像的自然语言理解与空间推理。然而，现有模型仅支持有限的传感器类型与任务，导致对地球的观测呈现碎片化，并使跨模态地球科学知识在很大程度上未被利用。本工作提出 Earth-OneVision，一个参数量为20亿的 RS-MLLM，其在单一自回归框架内统一了六类传感器模态（即光学、合成孔径雷达 SAR、红外、多光谱、时序、视频）以及涵盖九类任务的跨传感器融合能力。针对三大瓶颈，本工作设计了三项专用机制：全粒度视觉-语言对齐（FGVLA）将多层次视觉特征与多维语言空间对齐；空间-语言同构序列化（SLIS）将异构空间输出统一为自回归 token；渐进式跨模态适配（PCMA）将复合领域差异分解为若干顺序阶段，依次解决视角差异与成像物理差异。为支持联合训练，构建了 MMRS-OneVision 数据集，包含约3400万组问答对，覆盖全部六类传感器模态及九类任务下的跨传感器融合，规模显著超越现有遥感多模态指令数据集。Earth-OneVision 仅以2B 参数量，在广泛基准测试中取得具有竞争力或当前最优（state-of-the-art）的结果，持续达到或超越参数量为4B–72B 的 RS-MLLMs。其在光学视觉定位基准 OPT-RSVG 测试集上取得 87.52% 的 [email protected] 指标，在 SAR 视觉问答基准 SARLANG-Bench 上取得 80.68% 的准确率，分别超出 7B 模型逾 7%；在多光谱分类基准 BigEarthNet-MS 测试集上召回率达 75.74%，在跨模态推理基准 EarthMind-Bench 上多项选择题（MCQ）准确率达 81.94%。

English Original

RS-MLLMs enable natural-language understanding and spatial reasoning over earth observation imagery. However, existing models support only a narrow range of sensor types and tasks, yielding a fragmented view of the earth and leaving cross-modal geoscientific knowledge largely unexploited. This work presents Earth-OneVision, a 2B RS-MLLM that unifies six sensor modalities (i.e., optical, SAR, infrared, multispectral, temporal, and video) and cross-sensor fusion across 9 task categories within a single autoregressive framework. Three dedicated mechanisms address three bottlenecks. Full-Granularity Vision-Language Alignment (FGVLA) aligns multi-level visual features with the multi-dimensional language space. Spatial-Linguistic Isomorphic Serialization (SLIS) unifies heterogeneous spatial outputs as autoregressive tokens. Progressive Cross-Modality Adaptation (PCMA) decomposes the compound domain gap into sequential stages, tackling the viewpoint and imaging physics gaps in turn. To support joint training, MMRS-OneVision is constructed with ~34M QA pairs spanning all six sensor modalities and cross-sensor fusion across 9 task categories, substantially exceeding existing RS multimodal instruction datasets. With only 2B parameters, Earth-OneVision achieves competitive or state-of-the-art results across extensive benchmarks, consistently matching or outperforming 4B-72B RS-MLLMs. It achieves 87.52% [email protected] on the OPT-RSVG testset for optical visual grounding and 80.68% on the SAR VQA benchmark SARLANG-Bench, exceeding 7B models by over 7%. It further achieves 75.74% recall on the BigEarthNet-MS testset for multispectral classification, and 81.94% MCQ accuracy on EarthMind-Bench for cross-modality reasoning.

资源链接

论文 PDFarxiv.org/pdf/2606.10819v1 论文 PDFarxiv.org/pdf/2606.10819v1 原始来源页面arxiv.org/abs/2606.10819v1

元数据

arXiv2606.10819v1

来源arXiv

类型论文

抽取状态raw

关键词

RemoteSensing

EarthObservation

SpatialIntelligence

LLM

Multimodal

GeoMultimodal

cs.CV

cs.AI