UrbanComp Lab | 学习资料库

返回论文库

论文

arXiv

GeoAI

GIS

RemoteSensing

EarthObservation

LLM

Multimodal

GeoMultimodal

中文标题

Sat2Sound：一种面向零样本声景制图的统一框架

English Title

Sat2Sound: A Unified Framework for Zero-Shot Soundscape Mapping

Subash Khanal, Srikumar Sastry, Aayush Dhakal, Adeel Ahmad, Abby Stylianou, Nathan Jacobs

发布时间

2025/5/20 07:36:04

来源类型

preprint

语言

摘要

中文对照

我们提出 Sat2Sound，一种面向地理空间声景理解的统一多模态框架，旨在预测并绘制地球表面声音分布图。现有方法依赖配对的卫星图像与地理标记音频样本，往往难以充分表征某一位置声音的全部多样性。Sat2Sound 通过引入语义丰富的、由视觉-语言模型生成的声景描述来扩充数据集，从而拓展每个位置所能表征的环境声音范围。本框架通过对比学习与码本对齐学习，联合利用音频、音频文本描述、卫星图像及合成图像字幕，发现跨模态共享的一组“声景概念”，实现超局部化、可解释的声景制图。Sat2Sound 在 GeoSound 和 SoundingEarth 基准上实现了卫星图像与音频之间跨模态检索的最先进性能。最后，通过检索可由文本到音频模型渲染的详细声景字幕，Sat2Sound 支持基于位置的声景合成，适用于沉浸式与教育类应用，且对计算资源需求较低。代码与模型发布于 https://github.com/mvrl/sat2sound。

English Original

We present Sat2Sound, a unified multimodal framework for geospatial soundscape understanding, designed to predict and map the distribution of sounds across the Earth's surface. Existing methods for this task rely on paired satellite images and geotagged audio samples, which often fail to capture the full diversity of sound at a location. Sat2Sound overcomes this limitation by augmenting datasets with semantically rich, vision-language model-generated soundscape descriptions, which broaden the range of possible ambient sounds represented at each location. Our framework jointly learns from audio, text descriptions of audio, satellite images, and synthetic image captions through contrastive and codebook-aligned learning, discovering a set of "soundscape concepts" shared across modalities, enabling hyper-localized, explainable soundscape mapping. Sat2Sound achieves state-of-the-art performance in cross-modal retrieval between satellite image and audio on the GeoSound and SoundingEarth benchmarks. Finally, by retrieving detailed soundscape captions that can be rendered through text-to-audio models, Sat2Sound enables location-conditioned soundscape synthesis for immersive and educational applications, even with limited computational resources. Our code and models are available at https://github.com/mvrl/sat2sound.

资源链接

论文 PDFarxiv.org/pdf/2505.13777v2 论文 PDFarxiv.org/pdf/2505.13777v2 原始来源页面arxiv.org/abs/2505.13777v2

元数据

arXiv2505.13777v2

来源arXiv

类型论文

抽取状态raw

关键词

GeoAI

GIS

RemoteSensing

EarthObservation

LLM

Multimodal

GeoMultimodal

cs.CV

cs.AI

cs.SD