UrbanComp Lab | 学习资料库

返回论文库

论文

arXiv

GeoAI

GIS

RemoteSensing

EarthObservation

Multimodal

GeoMultimodal

中文标题

TerraMind：面向地球观测的大规模生成式多模态模型

English Title

TerraMind: Large-Scale Generative Multimodality for Earth Observation

Johannes Jakubik, Felix Yang, Benedikt Blumenstiel, Erik Scheurer, Rocco Sedona, Stefano Maurogiovanni, Jente Bosmans, Nikolaos Dionelis, Valerio Marsocci, Niklas Kopp, Rahul Ramachandran, Paolo Fraccaro, Thomas Brunschwiler, Gabriele Cavallaro, Juan Bernabe-Moreno, Nicolas Longépé

发布时间

2025/4/15 21:17:39

来源类型

preprint

语言

摘要

中文对照

我们提出 TerraMind，这是首个面向地球观测（Earth Observation, EO）的任意模态到任意模态生成式多模态基础模型。与其他多模态模型不同，TerraMind 在双尺度表征上进行预训练，融合了跨模态的词元级（token-level）与像素级（pixel-level）数据。在词元级，TerraMind 编码高层上下文信息以学习跨模态关系；在像素级，TerraMind 利用细粒度表征捕捉关键的空间细节。我们在一个覆盖全球、大规模的地理空间数据集上，基于九种地理空间模态对 TerraMind 进行了预训练。本文表明：（i）TerraMind 的双尺度早期融合方法支持一系列地球观测领域的零样本与少样本应用；（ii）TerraMind 提出“模态内思考”（Thinking-in-Modalities, TiM）能力，即在微调与推理过程中生成额外的人工数据以提升模型输出质量；（iii）TerraMind 在 PANGAEA 等地球观测领域社区标准基准测试中达到超越当前最优（beyond state-of-the-art）的性能。预训练数据集、模型权重及代码均以宽松许可证开源。

English Original

We present TerraMind, the first any-to-any generative, multimodal foundation model for Earth observation (EO). Unlike other multimodal models, TerraMind is pretrained on dual-scale representations combining both token-level and pixel-level data across modalities. On a token level, TerraMind encodes high-level contextual information to learn cross-modal relationships, while on a pixel level, TerraMind leverages fine-grained representations to capture critical spatial nuances. We pretrained TerraMind on nine geospatial modalities of a global, large-scale dataset. In this paper, we demonstrate that (i) TerraMind's dual-scale early fusion approach unlocks a range of zero-shot and few-shot applications for Earth observation, (ii) TerraMind introduces "Thinking-in-Modalities" (TiM) -- the capability of generating additional artificial data during finetuning and inference to improve the model output -- and (iii) TerraMind achieves beyond state-of-the-art performance in community-standard benchmarks for EO like PANGAEA. The pretraining dataset, the model weights, and our code are open-sourced under a permissive license.

资源链接

论文 PDFarxiv.org/pdf/2504.11171v5 论文 PDFarxiv.org/pdf/2504.11171v5 原始来源页面arxiv.org/abs/2504.11171v5

元数据

arXiv2504.11171v5

来源arXiv

类型论文

抽取状态raw

关键词

GeoAI

GIS

RemoteSensing

EarthObservation

Multimodal

GeoMultimodal

cs.CV

cs.AI