UrbanComp Lab | 学习资料库

返回论文库

论文

arXiv

RemoteSensing

EarthObservation

Multimodal

GeoMultimodal

中文标题

MetaEarth-MM：基于场景中心联合建模的统一多模态遥感图像生成方法

English Title

MetaEarth-MM: Unified Multimodal Remote Sensing Image Generation with Scene-centered Joint Modeling

Zhiping Yu, Chenyang Liu, Jinqi Cao, Qinzhe Yang, Siwei Yu, Zhengxia Zou, Zhenwei Shi

发布时间

2026/5/20 00:47:02

来源类型

preprint

语言

摘要

中文对照

多模态遥感图像对地球观测至关重要，但在实际应用中，完整的配对观测往往稀缺。现有生成方法通常通过孤立的两两模态翻译来应对该问题，但随着模态数量与生成任务种类的增加，其通用性与可扩展性仍显不足。本文提出一种面向多模态遥感影像的生成式基础模型 MetaEarth-MM，支持在统一框架下实现五种模态间的配对联合生成及任意模态到任意模态的翻译。鉴于多模态观测内在的场景一致性，MetaEarth-MM 引入一种场景中心联合建模范式：不同于以往依赖外观层面直接跨模态映射的方法，本模型以底层场景内容为核心组织生成过程。具体而言，MetaEarth-MM 采用解耦式架构，首先从已有观测中推断出潜在场景表征，再以此中间状态为条件生成目标模态图像。为支撑训练，我们进一步构建了 EarthMM 数据集——一个包含 280 万幅多分辨率全球遥感图像、其中 220 万对严格配准样本的大规模数据集。大量实验表明，MetaEarth-MM 不仅在各类生成任务中展现出强大的生成能力与鲁棒泛化性能，还能在数据级与表征级支持下游任务，凸显其作为跨模态地球观测通用基础模型的潜力。代码与数据集将发布于 https://github.com/YZPioneer/MetaEarth-MM。

English Original

Multi-modal remote sensing images are vital for Earth observation, yet complete paired observations are often scarce in practice. Existing generative methods commonly address this problem through isolated pairwise modality translation, but their versatility and scalability remain limited as the number of modalities and generation tasks increases. Here, we develop a generative foundation model MetaEarth-MM for multi-modal remote sensing imagery, enabling paired joint generation and any-to-any translation across five modalities within a unified model. Recognizing the intrinsic scene consistency underlying multi-modal observations, we introduce a scene-centered joint modeling paradigm in MetaEarth-MM. Unlike previous methods that rely on direct appearance-level cross-modal mapping, our model organizes the generation around the underlying scene content. Specifically, MetaEarth-MM adopts a decoupled architecture that first infers a latent scene representation from available observations, and then generates target modalities conditioned on this intermediate state. To support training, we further construct EarthMM, a large-scale dataset comprising 2.8 million multi-resolution global images with 2.2 million aligned pairs. Extensive experiments demonstrate that MetaEarth-MM not only exhibits strong generative capability and robust generalization across diverse generation tasks, but also supports downstream tasks at both data and representation levels, highlighting its potential as a general foundation model for cross-modal Earth observation. The code and dataset will be available at https://github.com/YZPioneer/MetaEarth-MM.

资源链接

论文 PDFarxiv.org/pdf/2605.20090v1 论文 PDFarxiv.org/pdf/2605.20090v1 原始来源页面arxiv.org/abs/2605.20090v1

元数据

arXiv2605.20090v1

来源arXiv

类型论文

抽取状态raw

关键词

RemoteSensing

EarthObservation

Multimodal

GeoMultimodal

cs.CV