论文
arXiv
LLM
Multimodal
中文标题
面向城市感知的视觉-语言模型基准测试应具备可靠性意识并经协商确立
English Title
Benchmarks for Vision-Language Models in Urban Perception Should Be Reliability-Aware and Negotiated
Rashid Mushkani
发布时间
2026/5/31 03:56:17
来源类型
preprint
语言
en
摘要
中文对照

视觉-语言模型(VLMs)正日益被用于生成街景图像的结构化描述,以支持街道环境评估、制图及公众咨询等任务。此类应用将可观测属性与评价性类别相结合,其目标人群常表现为存在分歧与明确拒答的判断分布。本文主张:针对城市感知任务的VLM基准测试,应将人类判断间的分歧与主动弃答视为测量结果本身;在报告模型与人类标注一致性的同时,须一并报告标注者间信度(inter-annotator reliability);且当模型输出旨在为城市治理提供依据时,标签空间与评分策略应被视为可协商的技术产物。本论点基于一项实证基准研究:对蒙特利尔100个街景样本,由来自7个社区组织的12名参与者在30个维度上进行标注,并对7种VLM开展确定性零样本评估。结果显示,各维度上模型与人类共识的一致性与其对应维度的人类标注信度呈共变关系;而在评价性维度“总体印象”(Overall Impression)上,模型与人类标注者之间存在分布错配,包括“不适用”(Not applicable)选项使用率的差异。最后,本文提出若干行动建议,供基准构建者、模型开发者及相关机构采纳,以在评估报告中显式呈现不确定性及基准假设。

English Original

Vision-language models (VLMs) are increasingly used to generate structured descriptions of street-level imagery for tasks such as streetscape auditing, mapping, and public consultation. These uses combine observable attributes with appraisal categories, and the human targets are often distributions of judgments with disagreement and explicit non-response. This paper argues that benchmarking VLMs for urban perception should treat disagreement and abstention as measurement outcomes, report inter-annotator reliability alongside model alignment, and treat the label space and scoring policy as negotiable artifacts when outputs are intended to inform urban governance. We ground the argument in a benchmark of 100 Montreal street scenes annotated along 30 dimensions by 12 participants from seven community organizations, and in a deterministic zero-shot evaluation of seven VLMs. Across dimensions, model agreement with human consensus co-varies with dimension-level human reliability, and for the appraisal dimension Overall Impression models and annotators exhibit distributional mismatch including different rates of Not applicable. We close with actions for benchmark creators, model developers, and institutions to make uncertainty and benchmark assumptions visible in evaluation reports.

元数据
arXiv2606.00871v1
来源arXiv
类型论文
抽取状态raw
关键词
LLM
Multimodal
cs.CV
cs.AI