UrbanComp Lab | 学习资料库

返回论文库

论文

arXiv

LLM

Multimodal

中文标题

视觉-语言模型如何看待城市场景？一个城市感知基准

English Title

Do Vision-Language Models See Urban Scenes as People Do? An Urban Perception Benchmark

Rashid Mushkani

发布时间

2025/9/18 11:21:10

来源类型

preprint

语言

摘要

中文对照

理解人类如何解读城市场景可为设计与规划提供依据。我们引入了一个小型基准，用于测试视觉-语言模型（VLMs）在城市感知方面的表现，采用100张蒙特利尔街景图像，照片与逼真合成场景各占一半。来自七个社区团体的12名参与者提供了涵盖30个维度的230份标注表单，包含物理属性与主观印象。法语回答经标准化处理为英文。我们在零样本设置下，使用结构化提示和确定性解析器评估了七种VLMs。对于单选题采用准确率，多标签题采用Jaccard重叠度；人类标注一致性使用Krippendorff's alpha和成对Jaccard计算。结果表明，模型在可见的客观属性上表现出更强的对齐性，而在主观评价方面则较弱。表现最佳的系统（claude-sonnet）在多标签任务上的宏平均准确率为0.31，平均Jaccard值为0.48。人类标注一致性较高时，模型得分也相应更高。合成图像略微降低模型表现。我们公开发布该基准、提示模板及评估工具包，以支持参与式城市分析中的可复现且具备不确定性意识的评估。

English Original

Understanding how people read city scenes can inform design and planning. We introduce a small benchmark for testing vision-language models (VLMs) on urban perception using 100 Montreal street images, evenly split between photographs and photorealistic synthetic scenes. Twelve participants from seven community groups supplied 230 annotation forms across 30 dimensions mixing physical attributes and subjective impressions. French responses were normalized to English. We evaluated seven VLMs in a zero-shot setup with a structured prompt and deterministic parser. We use accuracy for single-choice items and Jaccard overlap for multi-label items; human agreement uses Krippendorff's alpha and pairwise Jaccard. Results suggest stronger model alignment on visible, objective properties than subjective appraisals. The top system (claude-sonnet) reaches macro 0.31 and mean Jaccard 0.48 on multi-label items. Higher human agreement coincides with better model scores. Synthetic images slightly lower scores. We release the benchmark, prompts, and harness for reproducible, uncertainty-aware evaluation in participatory urban analysis.

资源链接

论文 PDFarxiv.org/pdf/2509.14574v2 论文 PDFarxiv.org/pdf/2509.14574v2 原始来源页面arxiv.org/abs/2509.14574v2

元数据

arXiv2509.14574v2

来源arXiv

类型论文

抽取状态raw

关键词

LLM

Multimodal

cs.CV

cs.AI