UrbanComp Lab | 学习资料库

返回论文库

论文

arXiv

RemoteSensing

EarthObservation

LLM

Multimodal

GeoMultimodal

中文标题

解码变化量：利用多模态大语言模型统一遥感变化检测与理解

English Title

Decoding the Delta: Unifying Remote Sensing Change Detection and Understanding with Multimodal Large Language Models

Xiaohe Li, Jiahao Li, Kaixin Zhang, Yuqiang Fang, Leilei Lin, Hong Wang, Haohua Wu, Zide Fan

发布时间

2026/4/16 00:23:05

来源类型

preprint

语言

摘要

中文对照

尽管多模态大语言模型（MLLMs）在通用视觉-语言任务中表现优异，但其在遥感变化理解中的应用受限于一种根本性的“时间盲性”。现有架构缺乏内在的多时相对比推理机制，且难以实现精确的空间定位。为此，我们首先提出Delta-QA——一个包含18万条视觉问答样本的综合性基准。Delta-QA在双时相与三时相场景下统一了像素级分割与视觉问答任务，并将变化解释结构化为四个递进的认知维度。方法上，我们提出Delta-LLaVA，一种专为多时相遥感解释设计的新型MLLM框架。该框架通过三项核心创新克服了朴素特征拼接的局限：（1）变化增强注意力（Change-Enhanced Attention）模块，系统性地分离并强化视觉差异；（2）变化分割（Change-SEG）模块，利用变化先验嵌入（Change Prior Embedding）提取可区分的差异特征作为大语言模型（LLM）输入；（3）局部因果注意力（Local Causal Attention），防止跨时相上下文泄露。大量实验表明，Delta-LLaVA在复杂变化推理与高精度边界定位任务上显著优于主流通用MLLM及专用分割模型，确立了一种面向地球观测智能的统一框架。

English Original

While Multimodal Large Language Models (MLLMs) excel in general vision-language tasks, their application to remote sensing change understanding is hindered by a fundamental "temporal blindness". Existing architectures lack intrinsic mechanisms for multi-temporal contrastive reasoning and struggle with precise spatial grounding. To address this, we first introduce Delta-QA, a comprehensive benchmark comprising 180k visual question-answering samples. Delta-QA unifies pixel-level segmentation and visual question answering across bi- and tri-temporal scenarios, structuring change interpretation into four progressive cognitive dimensions. Methodologically, we propose Delta-LLaVA, a novel MLLM framework explicitly tailored for multi-temporal remote sensing interpretation. It overcomes the limitations of naive feature concatenation through three core innovations: a Change-Enhanced Attention module that systematically isolates and amplifies visual differences, a Change-SEG module utilizing Change Prior Embedding to extract differentiable difference features as input for the LLM, and Local Causal Attention to prevent cross-temporal contextual leakage. Extensive experiments demonstrate that Delta-LLaVA decisively outperforms leading generalist MLLMs and specialized segmentation models in complex change deduction and high-precision boundary localization, establishing a unified framework for earth observation intelligence.

资源链接

论文 PDFarxiv.org/pdf/2604.14044v1 论文 PDFarxiv.org/pdf/2604.14044v1 原始来源页面arxiv.org/abs/2604.14044v1

元数据

arXiv2604.14044v1

来源arXiv

类型论文

抽取状态raw

关键词

RemoteSensing

EarthObservation

LLM

Multimodal

GeoMultimodal

cs.CV