论文
arXiv
LLM
Multimodal
中文标题
视觉-语言模型如何看待城市场景?一个城市感知基准
English Title
Do Vision-Language Models See Urban Scenes as People Do? An Urban Perception Benchmark
Rashid Mushkani
发布时间
2025/9/18 11:21:10
来源类型
preprint
语言
en
摘要
中文对照

理解人类如何解读城市场景可为设计与规划提供依据。我们引入了一个小型基准,用于测试视觉-语言模型(VLMs)在城市感知方面的表现,采用100张蒙特利尔街景图像,照片与逼真合成场景各占一半。来自七个社区团体的12名参与者提供了涵盖30个维度的230份标注表单,包含物理属性与主观印象。法语回答经标准化处理为英文。我们在零样本设置下,使用结构化提示和确定性解析器评估了七种VLMs。对于单选题采用准确率,多标签题采用Jaccard重叠度;人类标注一致性使用Krippendorff's alpha和成对Jaccard计算。结果表明,模型在可见的客观属性上表现出更强的对齐性,而在主观评价方面则较弱。表现最佳的系统(claude-sonnet)在多标签任务上的宏平均准确率为0.31,平均Jaccard值为0.48。人类标注一致性较高时,模型得分也相应更高。合成图像略微降低模型表现。我们公开发布该基准、提示模板及评估工具包,以支持参与式城市分析中的可复现且具备不确定性意识的评估。

English Original

Understanding how people read city scenes can inform design and planning. We introduce a small benchmark for testing vision-language models (VLMs) on urban perception using 100 Montreal street images, evenly split between photographs and photorealistic synthetic scenes. Twelve participants from seven community groups supplied 230 annotation forms across 30 dimensions mixing physical attributes and subjective impressions. French responses were normalized to English. We evaluated seven VLMs in a zero-shot setup with a structured prompt and deterministic parser. We use accuracy for single-choice items and Jaccard overlap for multi-label items; human agreement uses Krippendorff's alpha and pairwise Jaccard. Results suggest stronger model alignment on visible, objective properties than subjective appraisals. The top system (claude-sonnet) reaches macro 0.31 and mean Jaccard 0.48 on multi-label items. Higher human agreement coincides with better model scores. Synthetic images slightly lower scores. We release the benchmark, prompts, and harness for reproducible, uncertainty-aware evaluation in participatory urban analysis.

元数据
arXiv2509.14574v2
来源arXiv
类型论文
抽取状态raw
关键词
LLM
Multimodal
cs.CV
cs.AI