遥感多模态大语言模型(RS-MLLMs)支持对地球观测影像的自然语言理解与空间推理。然而,现有模型仅支持有限的传感器类型与任务,导致对地球的观测呈现碎片化,并使跨模态地球科学知识在很大程度上未被利用。本工作提出 Earth-OneVision,一个参数量为20亿的 RS-MLLM,其在单一自回归框架内统一了六类传感器模态(即光学、合成孔径雷达 SAR、红外、多光谱、时序、视频)以及涵盖九类任务的跨传感器融合能力。针对三大瓶颈,本工作设计了三项专用机制:全粒度视觉-语言对齐(FGVLA)将多层次视觉特征与多维语言空间对齐;空间-语言同构序列化(SLIS)将异构空间输出统一为自回归 token;渐进式跨模态适配(PCMA)将复合领域差异分解为若干顺序阶段,依次解决视角差异与成像物理差异。为支持联合训练,构建了 MMRS-OneVision 数据集,包含约3400万组问答对,覆盖全部六类传感器模态及九类任务下的跨传感器融合,规模显著超越现有遥感多模态指令数据集。Earth-OneVision 仅以2B 参数量,在广泛基准测试中取得具有竞争力或当前最优(state-of-the-art)的结果,持续达到或超越参数量为4B–72B 的 RS-MLLMs。其在光学视觉定位基准 OPT-RSVG 测试集上取得 87.52% 的 [email protected] 指标,在 SAR 视觉问答基准 SARLANG-Bench 上取得 80.68% 的准确率,分别超出 7B 模型逾 7%;在多光谱分类基准 BigEarthNet-MS 测试集上召回率达 75.74%,在跨模态推理基准 EarthMind-Bench 上多项选择题(MCQ)准确率达 81.94%。
RS-MLLMs enable natural-language understanding and spatial reasoning over earth observation imagery. However, existing models support only a narrow range of sensor types and tasks, yielding a fragmented view of the earth and leaving cross-modal geoscientific knowledge largely unexploited. This work presents Earth-OneVision, a 2B RS-MLLM that unifies six sensor modalities (i.e., optical, SAR, infrared, multispectral, temporal, and video) and cross-sensor fusion across 9 task categories within a single autoregressive framework. Three dedicated mechanisms address three bottlenecks. Full-Granularity Vision-Language Alignment (FGVLA) aligns multi-level visual features with the multi-dimensional language space. Spatial-Linguistic Isomorphic Serialization (SLIS) unifies heterogeneous spatial outputs as autoregressive tokens. Progressive Cross-Modality Adaptation (PCMA) decomposes the compound domain gap into sequential stages, tackling the viewpoint and imaging physics gaps in turn. To support joint training, MMRS-OneVision is constructed with ~34M QA pairs spanning all six sensor modalities and cross-sensor fusion across 9 task categories, substantially exceeding existing RS multimodal instruction datasets. With only 2B parameters, Earth-OneVision achieves competitive or state-of-the-art results across extensive benchmarks, consistently matching or outperforming 4B-72B RS-MLLMs. It achieves 87.52% [email protected] on the OPT-RSVG testset for optical visual grounding and 80.68% on the SAR VQA benchmark SARLANG-Bench, exceeding 7B models by over 7%. It further achieves 75.74% recall on the BigEarthNet-MS testset for multispectral classification, and 81.94% MCQ accuracy on EarthMind-Bench for cross-modality reasoning.