多模态遥感图像对地球观测至关重要,但在实际应用中,完整的配对观测往往稀缺。现有生成方法通常通过孤立的两两模态翻译来应对该问题,但随着模态数量与生成任务种类的增加,其通用性与可扩展性仍显不足。本文提出一种面向多模态遥感影像的生成式基础模型 MetaEarth-MM,支持在统一框架下实现五种模态间的配对联合生成及任意模态到任意模态的翻译。鉴于多模态观测内在的场景一致性,MetaEarth-MM 引入一种场景中心联合建模范式:不同于以往依赖外观层面直接跨模态映射的方法,本模型以底层场景内容为核心组织生成过程。具体而言,MetaEarth-MM 采用解耦式架构,首先从已有观测中推断出潜在场景表征,再以此中间状态为条件生成目标模态图像。为支撑训练,我们进一步构建了 EarthMM 数据集——一个包含 280 万幅多分辨率全球遥感图像、其中 220 万对严格配准样本的大规模数据集。大量实验表明,MetaEarth-MM 不仅在各类生成任务中展现出强大的生成能力与鲁棒泛化性能,还能在数据级与表征级支持下游任务,凸显其作为跨模态地球观测通用基础模型的潜力。代码与数据集将发布于 https://github.com/YZPioneer/MetaEarth-MM。
Multi-modal remote sensing images are vital for Earth observation, yet complete paired observations are often scarce in practice. Existing generative methods commonly address this problem through isolated pairwise modality translation, but their versatility and scalability remain limited as the number of modalities and generation tasks increases. Here, we develop a generative foundation model MetaEarth-MM for multi-modal remote sensing imagery, enabling paired joint generation and any-to-any translation across five modalities within a unified model. Recognizing the intrinsic scene consistency underlying multi-modal observations, we introduce a scene-centered joint modeling paradigm in MetaEarth-MM. Unlike previous methods that rely on direct appearance-level cross-modal mapping, our model organizes the generation around the underlying scene content. Specifically, MetaEarth-MM adopts a decoupled architecture that first infers a latent scene representation from available observations, and then generates target modalities conditioned on this intermediate state. To support training, we further construct EarthMM, a large-scale dataset comprising 2.8 million multi-resolution global images with 2.2 million aligned pairs. Extensive experiments demonstrate that MetaEarth-MM not only exhibits strong generative capability and robust generalization across diverse generation tasks, but also supports downstream tasks at both data and representation levels, highlighting its potential as a general foundation model for cross-modal Earth observation. The code and dataset will be available at https://github.com/YZPioneer/MetaEarth-MM.