UrbanComp Lab | 学习资料库

返回论文库

论文

arXiv

GeoAI

GIS

RemoteSensing

EarthObservation

Multimodal

GeoMultimodal

中文标题

跨模态城市感知：基于街景与航拍影像评估声觉-视觉的一致性

English Title

Cross-Modal Urban Sensing: Evaluating Sound-Vision Alignment Across Street-Level and Aerial Imagery

Pengyu Chen, Xiao Huang, Teng Fei, Sicheng Wang

发布时间

2025/6/4 04:56:37

来源类型

preprint

语言

摘要

中文对照

环境声景蕴含丰富的城市生态与社会信息，但在大规模地理分析中的潜力尚未得到充分挖掘。本研究通过比较多种视觉表征策略在捕捉声学语义方面的表现，探究城市声音与视觉场景之间的对应关系。研究整合了三个全球主要城市（伦敦、纽约、东京）的地理定位声学记录与街景及遥感影像，采用AST模型处理音频，CLIP与RemoteCLIP处理图像，并利用CLIPSeg和Seg-Earth OV进行语义分割，提取嵌入向量与类别级特征以评估跨模态相似性。结果表明，街景嵌入相较于分割输出与环境声音具有更强的一致性，而遥感影像的分割结果在基于生物声—地质声—人类声（BGA）框架下对生态类别的解释更具优势。研究发现表明，基于嵌入的模型在语义对齐方面表现更优，而基于分割的方法则提供了视觉结构与声景生态之间可解释的关联。本研究推动了多模态城市感知领域的进展，为将声音融入地理空间分析提供了新视角。

English Original

Environmental soundscapes convey substantial ecological and social information regarding urban environments; however, their potential remains largely untapped in large-scale geographic analysis. In this study, we investigate the extent to which urban sounds correspond with visual scenes by comparing various visual representation strategies in capturing acoustic semantics. We employ a multimodal approach that integrates geo-referenced sound recordings with both street-level and remote sensing imagery across three major global cities: London, New York, and Tokyo. Utilizing the AST model for audio, along with CLIP and RemoteCLIP for imagery, as well as CLIPSeg and Seg-Earth OV for semantic segmentation, we extract embeddings and class-level features to evaluate cross-modal similarity. The results indicate that street view embeddings demonstrate stronger alignment with environmental sounds compared to segmentation outputs, whereas remote sensing segmentation is more effective in interpreting ecological categories through a Biophony--Geophony--Anthrophony (BGA) framework. These findings imply that embedding-based models offer superior semantic alignment, while segmentation-based methods provide interpretable links between visual structure and acoustic ecology. This work advances the burgeoning field of multimodal urban sensing by offering novel perspectives for incorporating sound into geospatial analysis.

资源链接

论文 PDFarxiv.org/pdf/2506.03388v1 论文 PDFarxiv.org/pdf/2506.03388v1 原始来源页面arxiv.org/abs/2506.03388v1

元数据

arXiv2506.03388v1

来源arXiv

类型论文

抽取状态raw

关键词

GeoAI

GIS

RemoteSensing

EarthObservation

Multimodal

GeoMultimodal

cs.CV