论文
arXiv
GeoAI
GIS
RemoteSensing
EarthObservation
LLM
Multimodal
GeoMultimodal
中文标题
Sat2Sound:一种面向零样本声景制图的统一框架
English Title
Sat2Sound: A Unified Framework for Zero-Shot Soundscape Mapping
Subash Khanal, Srikumar Sastry, Aayush Dhakal, Adeel Ahmad, Abby Stylianou, Nathan Jacobs
发布时间
2025/5/20 07:36:04
来源类型
preprint
语言
en
摘要
中文对照

我们提出 Sat2Sound,一种面向地理空间声景理解的统一多模态框架,旨在预测并绘制地球表面声音分布图。现有方法依赖配对的卫星图像与地理标记音频样本,往往难以充分表征某一位置声音的全部多样性。Sat2Sound 通过引入语义丰富的、由视觉-语言模型生成的声景描述来扩充数据集,从而拓展每个位置所能表征的环境声音范围。本框架通过对比学习与码本对齐学习,联合利用音频、音频文本描述、卫星图像及合成图像字幕,发现跨模态共享的一组“声景概念”,实现超局部化、可解释的声景制图。Sat2Sound 在 GeoSound 和 SoundingEarth 基准上实现了卫星图像与音频之间跨模态检索的最先进性能。最后,通过检索可由文本到音频模型渲染的详细声景字幕,Sat2Sound 支持基于位置的声景合成,适用于沉浸式与教育类应用,且对计算资源需求较低。代码与模型发布于 https://github.com/mvrl/sat2sound。

English Original

We present Sat2Sound, a unified multimodal framework for geospatial soundscape understanding, designed to predict and map the distribution of sounds across the Earth's surface. Existing methods for this task rely on paired satellite images and geotagged audio samples, which often fail to capture the full diversity of sound at a location. Sat2Sound overcomes this limitation by augmenting datasets with semantically rich, vision-language model-generated soundscape descriptions, which broaden the range of possible ambient sounds represented at each location. Our framework jointly learns from audio, text descriptions of audio, satellite images, and synthetic image captions through contrastive and codebook-aligned learning, discovering a set of "soundscape concepts" shared across modalities, enabling hyper-localized, explainable soundscape mapping. Sat2Sound achieves state-of-the-art performance in cross-modal retrieval between satellite image and audio on the GeoSound and SoundingEarth benchmarks. Finally, by retrieving detailed soundscape captions that can be rendered through text-to-audio models, Sat2Sound enables location-conditioned soundscape synthesis for immersive and educational applications, even with limited computational resources. Our code and models are available at https://github.com/mvrl/sat2sound.

元数据
arXiv2505.13777v2
来源arXiv
类型论文
抽取状态raw
关键词
GeoAI
GIS
RemoteSensing
EarthObservation
LLM
Multimodal
GeoMultimodal
cs.CV
cs.AI
cs.SD