我们提出 TerraMind,这是首个面向地球观测(Earth Observation, EO)的任意模态到任意模态生成式多模态基础模型。与其他多模态模型不同,TerraMind 在双尺度表征上进行预训练,融合了跨模态的词元级(token-level)与像素级(pixel-level)数据。在词元级,TerraMind 编码高层上下文信息以学习跨模态关系;在像素级,TerraMind 利用细粒度表征捕捉关键的空间细节。我们在一个覆盖全球、大规模的地理空间数据集上,基于九种地理空间模态对 TerraMind 进行了预训练。本文表明:(i)TerraMind 的双尺度早期融合方法支持一系列地球观测领域的零样本与少样本应用;(ii)TerraMind 提出“模态内思考”(Thinking-in-Modalities, TiM)能力,即在微调与推理过程中生成额外的人工数据以提升模型输出质量;(iii)TerraMind 在 PANGAEA 等地球观测领域社区标准基准测试中达到超越当前最优(beyond state-of-the-art)的性能。预训练数据集、模型权重及代码均以宽松许可证开源。
We present TerraMind, the first any-to-any generative, multimodal foundation model for Earth observation (EO). Unlike other multimodal models, TerraMind is pretrained on dual-scale representations combining both token-level and pixel-level data across modalities. On a token level, TerraMind encodes high-level contextual information to learn cross-modal relationships, while on a pixel level, TerraMind leverages fine-grained representations to capture critical spatial nuances. We pretrained TerraMind on nine geospatial modalities of a global, large-scale dataset. In this paper, we demonstrate that (i) TerraMind's dual-scale early fusion approach unlocks a range of zero-shot and few-shot applications for Earth observation, (ii) TerraMind introduces "Thinking-in-Modalities" (TiM) -- the capability of generating additional artificial data during finetuning and inference to improve the model output -- and (iii) TerraMind achieves beyond state-of-the-art performance in community-standard benchmarks for EO like PANGAEA. The pretraining dataset, the model weights, and our code are open-sourced under a permissive license.