论文
arXiv
GeoAI
GIS
RemoteSensing
EarthObservation
Multimodal
GeoMultimodal
中文标题
融合物体级标签与场景级语义特征的开放词汇语义分割网络:面向多模态遥感图像
English Title
Open-Vocabulary Semantic Segmentation Network Integrating Object-Level Label and Scene-Level Semantic Features for Multimodal Remote Sensing Images
Jinkun Dai, Yuanxin Ye, Peng Tang, Tengfeng Tang, Xianping Ma, Jing Xiao, Mi Wang
发布时间
2026/4/27 15:23:36
来源类型
preprint
语言
en
摘要
中文对照

多模态遥感图像的语义分割在土地利用/土地覆盖(LULC)制图、环境监测及精准地球观测中发挥着关键作用。当前多模态方法主要集中于融合互补的视觉模态,却忽视了非视觉文本数据这一富含知识的信息源——文本可有效弥合视觉模式与现实世界概念之间的语义鸿沟。为解决该局限,我们提出TSMNet:一种文本监督的多模态开放词汇语义分割网络,通过协同整合文本监督与视觉表征实现开放词汇语义分割。不同于传统多模态分割框架,TSMNet引入双分支文本编码器,从多种文本数据中分别提取场景级语义与物体级标签信息,从而支持动态跨模态融合。这些由文本导出的特征通过所提出的文本引导视觉语义融合模块与视觉嵌入动态交互,实现领域感知的特征优化及人类可解释的决策过程。为验证本方法,我们创新性地构建了两个新的多模态数据集,并开展大量实验,将所提方法与其它前沿(SOTA)语义分割模型进行系统性对比。结果表明,TSMNet在分割精度上表现优越,且在多样化的地理区域与传感器特异性场景中展现出强泛化能力。本工作为可解释遥感分析建立了新范式,证实文本知识整合可显著提升模型泛化性。源代码将于https://github.com/yeyuanxin公开。

English Original

Semantic segmentation of multi-modal remote sensing imagery plays a pivotal role in land use/land cover (LULC) mapping, environmental monitoring, and precision earth observation. Current multi-modal approaches mainly focus on integrating complementary visual modalities, yet neglect the incorporating of non-visual textual data - a rich source of knowledge that can bridge semantic gaps between visual patterns and real-world concepts. To address this limitation, we propose TSMNet, a text supervised multi-modal open vocabulary semantic segmentation network that synergistically integrates textual supervision with visual representation for open-vocabulary semantic segmentation. Unlike conventional multi-modal segmentation frameworks, TSMNet introduces a dual-branch text encoder to extract both scene-level semantic and object-level label information from various textual data, enabling dynamic cross-modal fusion. These text-derived features dynamically interact with visual embeddings through the proposed text-guided visual semantic fusion module, enabling domain-aware feature refinement and human-interpretable decision-making. To verify our method, we innovatively construct two new multi-modal datasets, and carry out extensive experiments to make a comprehensive comparison between the proposed method and other state-of-the-art (SOTA) semantic segmentation models. Results demonstrate that TSMNet achieves superior segmentation accuracy while exhibiting robust generalization capabilities across diverse geographical and sensor-specific scenarios. This work establishes a new paradigm for explainable remote sensing analysis, demonstrating that textual knowledge integration significantly enhances model generalizability. The source code will be available at https://github.com/yeyuanxin110/TSMNet

元数据
arXiv2604.24125v1
来源arXiv
类型论文
抽取状态raw
关键词
GeoAI
GIS
RemoteSensing
EarthObservation
Multimodal
GeoMultimodal
cs.CV