环境声景蕴含大量关于城市生态与社会状况的信息,但其在大规模地理分析中的潜力尚未得到充分挖掘。本研究通过比较多种视觉表征策略在捕捉声学语义方面的能力,探究城市声音与视觉场景之间的对应程度。我们采用多模态方法,将地理编码的声学录音与街景影像及遥感影像相结合,覆盖伦敦、纽约和东京三座全球主要城市。音频端采用AST模型,影像端分别采用CLIP与RemoteCLIP模型,语义分割则使用CLIPSeg与Seg-Earth OV模型;基于这些模型提取嵌入向量与类别级特征,以评估跨模态相似性。结果表明,街景影像嵌入与环境声音的对齐性优于分割输出;而遥感影像分割在Biophony–Geophony–Anthrophony(BGA)框架下对生态类别的解释能力更强。这些发现表明,基于嵌入的模型在语义对齐方面表现更优,而基于分割的方法则能提供视觉结构与声学生态之间可解释的关联。本工作通过为地理空间分析中融入声音数据提供新视角,推动了新兴的多模态城市感知领域的发展。
Environmental soundscapes convey substantial ecological and social information regarding urban environments; however, their potential remains largely untapped in large-scale geographic analysis. In this study, we investigate the extent to which urban sounds correspond with visual scenes by comparing various visual representation strategies in capturing acoustic semantics. We employ a multimodal approach that integrates geo-referenced sound recordings with both street-level and remote sensing imagery across three major global cities: London, New York, and Tokyo. Utilizing the AST model for audio, along with CLIP and RemoteCLIP for imagery, as well as CLIPSeg and Seg-Earth OV for semantic segmentation, we extract embeddings and class-level features to evaluate cross-modal similarity. The results indicate that street view embeddings demonstrate stronger alignment with environmental sounds compared to segmentation outputs, whereas remote sensing segmentation is more effective in interpreting ecological categories through a Biophony--Geophony--Anthrophony (BGA) framework. These findings imply that embedding-based models offer superior semantic alignment, while segmentation-based methods provide interpretable links between visual structure and acoustic ecology. This work advances the burgeoning field of multimodal urban sensing by offering novel perspectives for incorporating sound into geospatial analysis.