论文
arXiv
RemoteSensing
EarthObservation
LLM
Multimodal
中文标题
卫星到街景:基于生成式视觉模型从卫星影像合成灾后街景视图
English Title
Satellite-to-Street: Synthesizing Post-Disaster Views from Satellite Imagery via Generative Vision Models
Yifan Yang, Lei Zou, Wendy Jepson
发布时间
2026/3/21 15:47:33
来源类型
preprint
语言
en
摘要
中文对照

自然灾害发生后的初期阶段,快速获取态势感知信息至关重要。传统上,卫星观测被广泛用于评估灾害损毁范围,但其缺乏刻画具体结构失效与影响所必需的地面视角。与此同时,地面数据(例如街景影像)在时间敏感的应急事件中往往难以获取。本研究探索‘卫星到街景’(Satellite-to-Street View)合成方法,以弥合这一数据鸿沟。我们提出两种从卫星影像合成灾后街景的生成策略:一种由视觉-语言模型(VLM)引导的方法,另一种为具备损毁感知能力的专家混合(Mixture-of-Experts, MoE)方法。我们基于所提出的结构感知评估框架(Structure-Aware Evaluation Framework),将上述方法与通用基线模型(Pix2Pix、ControlNet)进行基准测试。该多层级评估协议包含:(1)像素级质量评估;(2)基于ResNet的语义一致性验证;(3)一种新颖的以VLM为判据(VLM-as-a-Judge)的感知对齐评估。在300个灾害场景上的实验揭示了一个关键的现实性—保真度权衡:基于扩散的模型(如ControlNet)虽能实现较高的感知现实性,却常幻化出不存在的结构细节。定量结果表明,标准ControlNet在语义准确性上表现最优(得分为0.71),而VLM增强型与MoE模型在纹理合理性方面更优,但在语义清晰度上表现欠佳。本工作为可信的跨视角合成建立了基准,并强调:视觉上高度逼真的生成结果仍可能无法保留灾害评估所依赖的关键结构信息。

English Original

In the immediate aftermath of natural disasters, rapid situational awareness is critical. Traditionally, satellite observations are widely used to estimate damage extent. However, they lack the ground-level perspective essential for characterizing specific structural failures and impacts. Meanwhile, ground-level data (e.g., street-view imagery) remains largely inaccessible during time-sensitive events. This study investigates Satellite-to-Street View Synthesis to bridge this data gap. We introduce two generative strategies to synthesize post-disaster street views from satellite imagery: a Vision-Language Model (VLM)-guided approach and a damage-sensitive Mixture-of-Experts (MoE) method. We benchmark these against general-purpose baselines (Pix2Pix, ControlNet) using a proposed Structure-Aware Evaluation Framework. This multi-tier protocol integrates (1) pixel-level quality assessment, (2) ResNet-based semantic consistency verification, and (3) a novel VLM-as-a-Judge for perceptual alignment. Experiments on 300 disaster scenarios reveal a critical realism--fidelity trade-off: while diffusion-based approaches (e.g., ControlNet) achieve high perceptual realism, they often hallucinate structural details. Quantitative results show that standard ControlNet achieves the highest semantic accuracy, 0.71, whereas VLM-enhanced and MoE models excel in textural plausibility but struggle with semantic clarity. This work establishes a baseline for trustworthy cross-view synthesis, emphasizing that visually realistic generations may still fail to preserve critical structural information required for reliable disaster assessment.

元数据
arXiv2603.20697v1
来源arXiv
类型论文
抽取状态raw
关键词
RemoteSensing
EarthObservation
LLM
Multimodal
cs.CV
cs.AI