UrbanComp Lab | 学习资料库

返回论文库

论文

arXiv

GeoAI

GIS

LLM

Multimodal

GeoMultimodal

中文标题

面向人的可解释多模态街道评估框架：融合视觉-语言模型的感知型城市诊断

English Title

Interpretable Multimodal Framework for Human-Centered Street Assessment: Integrating Visual-Language Models for Perceptual Urban Diagnostics

HaoTian Lan

发布时间

2025/6/5 22:34:04

来源类型

preprint

语言

摘要

中文对照

尽管基于影像或GIS的客观街道指标已成为城市分析的标准工具，但其仍难以捕捉包容性城市设计所必需的主观感知。本研究提出一种新型多模态街道评估框架（MSEF），将视觉Transformer（VisualGLM-6B）与大语言模型（GPT-4）相融合，实现对街道景观的可解释双输出评估。该框架利用中国哈尔滨逾15,000张标注街景图像，采用LoRA与P-Tuning v2方法进行参数高效微调。模型在客观特征识别任务中达到0.84的F1分数，在居民主观感知一致性检验中达成89.3%的吻合率，并在分层社会经济地理区域中完成验证。除分类准确率外，MSEF还能揭示情境依赖的矛盾现象：例如，非正规商业活动虽提升感知活力，却同时降低行人舒适度；亦能识别非线性及语义依赖模式——如建筑透明度在居住区与商业区引发截然不同的感知效应，从而暴露普适性空间启发法的局限性。通过基于注意力机制生成自然语言推理依据，该框架弥合了感官数据与社会情感推断之间的鸿沟，支持符合联合国可持续发展目标SDG 11的透明化城市诊断。本研究既为城市感知建模提供了方法论创新，也为亟需协调基础设施精度与真实生活体验的规划系统提供了实践价值。

English Original

While objective street metrics derived from imagery or GIS have become standard in urban analytics, they remain insufficient to capture subjective perceptions essential to inclusive urban design. This study introduces a novel Multimodal Street Evaluation Framework (MSEF) that fuses a vision transformer (VisualGLM-6B) with a large language model (GPT-4), enabling interpretable dual-output assessment of streetscapes. Leveraging over 15,000 annotated street-view images from Harbin, China, we fine-tune the framework using LoRA and P-Tuning v2 for parameter-efficient adaptation. The model achieves an F1 score of 0.84 on objective features and 89.3 percent agreement with aggregated resident perceptions, validated across stratified socioeconomic geographies. Beyond classification accuracy, MSEF captures context-dependent contradictions: for instance, informal commerce boosts perceived vibrancy while simultaneously reducing pedestrian comfort. It also identifies nonlinear and semantically contingent patterns -- such as the divergent perceptual effects of architectural transparency across residential and commercial zones -- revealing the limits of universal spatial heuristics. By generating natural-language rationales grounded in attention mechanisms, the framework bridges sensory data with socio-affective inference, enabling transparent diagnostics aligned with SDG 11. This work offers both methodological innovation in urban perception modeling and practical utility for planning systems seeking to reconcile infrastructural precision with lived experience.

资源链接

论文 PDFarxiv.org/pdf/2506.05087v1 论文 PDFarxiv.org/pdf/2506.05087v1 原始来源页面arxiv.org/abs/2506.05087v1

元数据

arXiv2506.05087v1

来源arXiv

类型论文

抽取状态raw

关键词

GeoAI

GIS

LLM

Multimodal

GeoMultimodal

cs.CV

cs.CL