UrbanComp Lab | 学习资料库

返回论文库

论文

arXiv

RemoteSensing

EarthObservation

LLM

Multimodal

中文标题

卫星到街景：基于生成式视觉模型从卫星影像合成灾后街景视图

English Title

Satellite-to-Street: Synthesizing Post-Disaster Views from Satellite Imagery via Generative Vision Models

Yifan Yang, Lei Zou, Wendy Jepson

发布时间

2026/3/21 15:47:33

来源类型

preprint

语言

摘要

中文对照

自然灾害发生后的初期阶段，快速获取态势感知信息至关重要。传统上，卫星观测被广泛用于评估灾害损毁范围，但其缺乏刻画具体结构失效与影响所必需的地面视角。与此同时，地面数据（例如街景影像）在时间敏感的应急事件中往往难以获取。本研究探索‘卫星到街景’（Satellite-to-Street View）合成方法，以弥合这一数据鸿沟。我们提出两种从卫星影像合成灾后街景的生成策略：一种由视觉-语言模型（VLM）引导的方法，另一种为具备损毁感知能力的专家混合（Mixture-of-Experts, MoE）方法。我们基于所提出的结构感知评估框架（Structure-Aware Evaluation Framework），将上述方法与通用基线模型（Pix2Pix、ControlNet）进行基准测试。该多层级评估协议包含：（1）像素级质量评估；（2）基于ResNet的语义一致性验证；（3）一种新颖的以VLM为判据（VLM-as-a-Judge）的感知对齐评估。在300个灾害场景上的实验揭示了一个关键的现实性—保真度权衡：基于扩散的模型（如ControlNet）虽能实现较高的感知现实性，却常幻化出不存在的结构细节。定量结果表明，标准ControlNet在语义准确性上表现最优（得分为0.71），而VLM增强型与MoE模型在纹理合理性方面更优，但在语义清晰度上表现欠佳。本工作为可信的跨视角合成建立了基准，并强调：视觉上高度逼真的生成结果仍可能无法保留灾害评估所依赖的关键结构信息。

English Original

In the immediate aftermath of natural disasters, rapid situational awareness is critical. Traditionally, satellite observations are widely used to estimate damage extent. However, they lack the ground-level perspective essential for characterizing specific structural failures and impacts. Meanwhile, ground-level data (e.g., street-view imagery) remains largely inaccessible during time-sensitive events. This study investigates Satellite-to-Street View Synthesis to bridge this data gap. We introduce two generative strategies to synthesize post-disaster street views from satellite imagery: a Vision-Language Model (VLM)-guided approach and a damage-sensitive Mixture-of-Experts (MoE) method. We benchmark these against general-purpose baselines (Pix2Pix, ControlNet) using a proposed Structure-Aware Evaluation Framework. This multi-tier protocol integrates (1) pixel-level quality assessment, (2) ResNet-based semantic consistency verification, and (3) a novel VLM-as-a-Judge for perceptual alignment. Experiments on 300 disaster scenarios reveal a critical realism--fidelity trade-off: while diffusion-based approaches (e.g., ControlNet) achieve high perceptual realism, they often hallucinate structural details. Quantitative results show that standard ControlNet achieves the highest semantic accuracy, 0.71, whereas VLM-enhanced and MoE models excel in textural plausibility but struggle with semantic clarity. This work establishes a baseline for trustworthy cross-view synthesis, emphasizing that visually realistic generations may still fail to preserve critical structural information required for reliable disaster assessment.

资源链接

论文 PDFarxiv.org/pdf/2603.20697v1 论文 PDFarxiv.org/pdf/2603.20697v1 原始来源页面arxiv.org/abs/2603.20697v1

元数据

arXiv2603.20697v1

来源arXiv

类型论文

抽取状态raw

关键词

RemoteSensing

EarthObservation

LLM

Multimodal

cs.CV

cs.AI