感知研究越来越多地采用街景图像进行建模,但许多方法仍依赖于像素特征或物体共现统计,忽略了塑造人类感知的显式关系。本研究提出一个三阶段流程,将街景图像(SVI)转化为结构化表示,以预测六种感知指标。第一阶段,使用开放集全景场景图模型(OpenPSG)解析每张图像,提取物体-谓词-物体三元组。第二阶段,通过异质图自编码器(GraphMAE)学习紧凑的场景级嵌入。第三阶段,利用神经网络从这些嵌入中预测感知评分。我们在准确率、精确度和跨城市泛化能力方面,将所提方法与仅基于图像的基线模型进行对比评估。结果表明:(i)本方法在感知预测准确率上平均比基线模型提升26%;(ii)在跨城市预测任务中仍保持较强的泛化性能。此外,结构化表示揭示了影响城市场景感知评分的特定关系模式,例如墙面涂鸦和车辆停放在人行道上。总体而言,本研究证明了基于图的结构能够为建模都市感知提供表达性强、泛化性好且可解释的信号,推动以人为本、情境感知的城市分析发展。
Perception research is increasingly modelled using streetscapes, yet many approaches still rely on pixel features or object co-occurrence statistics, overlooking the explicit relations that shape human perception. This study proposes a three stage pipeline that transforms street view imagery (SVI) into structured representations for predicting six perceptual indicators. In the first stage, each image is parsed using an open-set Panoptic Scene Graph model (OpenPSG) to extract object predicate object triplets. In the second stage, compact scene-level embeddings are learned through a heterogeneous graph autoencoder (GraphMAE). In the third stage, a neural network predicts perception scores from these embeddings. We evaluate the proposed approach against image-only baselines in terms of accuracy, precision, and cross-city generalization. Results indicate that (i) our approach improves perception prediction accuracy by an average of 26% over baseline models, and (ii) maintains strong generalization performance in cross-city prediction tasks. Additionally, the structured representation clarifies which relational patterns contribute to lower perception scores in urban scenes, such as graffiti on wall and car parked on sidewalk. Overall, this study demonstrates that graph-based structure provides expressive, generalizable, and interpretable signals for modelling urban perception, advancing human-centric and context-aware urban analytics.