尽管多模态大语言模型(MLLMs)在通用视觉-语言任务中表现优异,但其在遥感变化理解中的应用受限于一种根本性的“时间盲性”。现有架构缺乏内在的多时相对比推理机制,且难以实现精确的空间定位。为此,我们首先提出Delta-QA——一个包含18万条视觉问答样本的综合性基准。Delta-QA在双时相与三时相场景下统一了像素级分割与视觉问答任务,并将变化解释结构化为四个递进的认知维度。方法上,我们提出Delta-LLaVA,一种专为多时相遥感解释设计的新型MLLM框架。该框架通过三项核心创新克服了朴素特征拼接的局限:(1)变化增强注意力(Change-Enhanced Attention)模块,系统性地分离并强化视觉差异;(2)变化分割(Change-SEG)模块,利用变化先验嵌入(Change Prior Embedding)提取可区分的差异特征作为大语言模型(LLM)输入;(3)局部因果注意力(Local Causal Attention),防止跨时相上下文泄露。大量实验表明,Delta-LLaVA在复杂变化推理与高精度边界定位任务上显著优于主流通用MLLM及专用分割模型,确立了一种面向地球观测智能的统一框架。
While Multimodal Large Language Models (MLLMs) excel in general vision-language tasks, their application to remote sensing change understanding is hindered by a fundamental "temporal blindness". Existing architectures lack intrinsic mechanisms for multi-temporal contrastive reasoning and struggle with precise spatial grounding. To address this, we first introduce Delta-QA, a comprehensive benchmark comprising 180k visual question-answering samples. Delta-QA unifies pixel-level segmentation and visual question answering across bi- and tri-temporal scenarios, structuring change interpretation into four progressive cognitive dimensions. Methodologically, we propose Delta-LLaVA, a novel MLLM framework explicitly tailored for multi-temporal remote sensing interpretation. It overcomes the limitations of naive feature concatenation through three core innovations: a Change-Enhanced Attention module that systematically isolates and amplifies visual differences, a Change-SEG module utilizing Change Prior Embedding to extract differentiable difference features as input for the LLM, and Local Causal Attention to prevent cross-temporal contextual leakage. Extensive experiments demonstrate that Delta-LLaVA decisively outperforms leading generalist MLLMs and specialized segmentation models in complex change deduction and high-precision boundary localization, establishing a unified framework for earth observation intelligence.