城市感知描述了人们如何主观评估城市环境,从而塑造城市被体验与理解的方式。现有计算方法主要直接从街景图像建模城市感知,却在很大程度上忽略了形成此类判断所依赖的人类感知过程。本文提出 Place Pulse-Gaze 数据集,该数据集在街景图像基础上同步增加了眼动追踪记录及个体感知标签。基于该数据集,我们构建了注视引导的城市感知框架(Gaze-Guided Urban Perception Framework),以研究注视行为如何助力主观城市感知的建模。该框架系统性地考察了三种互补设定:仅使用注视信息建模、将注视信息与显式语义场景表征相融合、以及将注视信息与隐式更丰富的视觉表征相融合。实验表明,仅注视信息本身已包含对主观城市感知具有预测价值的信号;而将注视信息与场景表征相结合,可在语义表征与更丰富视觉表征两种设定下进一步提升预测性能。总体而言,我们的发现强调了在城市场景理解中纳入人类感知过程的重要性,并为注视引导的多模态城市计算开辟了新方向。
Urban perception describes how people subjectively evaluate urban environments, shaping how cities are experienced and understood. Existing computational approaches primarily model urban perception directly from street view images, but largely ignore the human perceptual process through which such judgments are formed. In this paper, we introduce Place Pulse-Gaze, an urban perception dataset that augments street view images with synchronized eye-tracking recordings and individual perception labels. Based on this dataset, we propose a Gaze-Guided Urban Perception Framework to study how gaze behavior contributes to the modeling of subjective urban perception. The framework systematically investigates three complementary settings: gaze-only modeling, gaze fusion with explicit semantic scene representations, and gaze fusion with implicit richer visual representations. Experiments show that gaze alone already carries useful predictive signals for subjective urban perception, and that integrating gaze with scene representations further improves prediction under both semantic and richer visual representations. Overall, our findings highlight the importance of incorporating human perceptual processes into urban scene understanding and open a direction for gaze-guided multimodal urban computing.