论文
arXiv
LLM
Multimodal
GeoMultimodal
中文标题
从静态到动态:通过多模态大语言模型引导的生成式修复评估动态元素在城市街景中的感知影响
English Title
From Static to Dynamic: Evaluating the Perceptual Impact of Dynamic Elements in Urban Scenes via MLLM-Guided Generative Inpainting
Zhiwei Wei, Mengzi Zhang, Boyan Lu, Zhitao Deng, Nai Yang, Hua Liao
发布时间
2025/12/31 07:21:10
来源类型
preprint
语言
en
摘要
中文对照

基于街景图像理解城市感知已成为城市分析与以人为中心的城市设计的核心议题。然而,现有研究大多将城市场景视为静态,严重忽视行人与车辆等动态元素的作用,由此引发基于感知的城市分析可能存在偏差的担忧。为应对该问题,我们提出一种受控框架,利用语义分割与多模态大语言模型(MLLM)引导的生成式修复技术,构建包含与不包含行人及车辆的配对街景图像,以分离并量化动态元素的感知效应。基于中国东莞采集的720组配对图像,开展了一项感知实验,参与者在六个感知维度上对原始场景与编辑后场景进行评估。结果表明,移除动态元素导致感知活力度一致下降30.97%,而其他维度的变化则更为温和且异质。为进一步探究潜在机制,我们基于多模态视觉特征训练了11个机器学习模型,识别出光照条件、人类存在及深度变化是驱动感知变化的关键因素。在个体层面,65%的参与者表现出显著的活力度变化,而其他维度对应比例为35–50%;性别对安全性感知呈现微弱的调节效应。除受控实验外,所训练模型进一步拓展至城市尺度数据集,用于预测移除动态元素后的活力度变化。城市尺度结果表明,此类感知变化广泛存在且具有空间结构特征,影响73.7%的地理位置及32.1%的图像,提示仅依赖静态影像的城市感知评估存在系统性局限。

English Original

Understanding urban perception from street view imagery has become a central topic in urban analytics and human centered urban design. However, most existing studies treat urban scenes as static and largely ignore the role of dynamic elements such as pedestrians and vehicles, raising concerns about potential bias in perception based urban analysis. To address this issue, we propose a controlled framework that isolates the perceptual effects of dynamic elements by constructing paired street view images with and without pedestrians and vehicles using semantic segmentation and MLLM guided generative inpainting. Based on 720 paired images from Dongguan, China, a perception experiment was conducted in which participants evaluated original and edited scenes across six perceptual dimensions. The results indicate that removing dynamic elements leads to a consistent 30.97% decrease in perceived vibrancy, whereas changes in other dimensions are more moderate and heterogeneous. To further explore the underlying mechanisms, we trained 11 machine learning models using multimodal visual features and identified that lighting conditions, human presence, and depth variation were key factors driving perceptual change. At the individual level, 65% of participants exhibited significant vibrancy changes, compared with 35-50% for other dimensions; gender further showed a marginal moderating effect on safety perception. Beyond controlled experiments, the trained model was extended to a city-scale dataset to predict vibrancy changes after the removal of dynamic elements. The city level results reveal that such perceptual changes are widespread and spatially structured, affecting 73.7% of locations and 32.1% of images, suggesting that urban perception assessments based solely on static imagery may substantially underestimate urban liveliness.

元数据
arXiv2512.24513v2
来源arXiv
类型论文
抽取状态raw
关键词
LLM
Multimodal
GeoMultimodal
cs.CY