论文
arXiv
RemoteSensing
EarthObservation
Multimodal
中文标题
面向多模态遥感语义分割的参数高效模态平衡对称融合方法
English Title
Parameter-Efficient Modality-Balanced Symmetric Fusion for Multimodal Remote Sensing Semantic Segmentation
Haocheng Li, Juepeng Zheng, Shuangxi Miao, Ruibo Lu, Guosheng Cai, Haohuan Fu, Jianxi Huang
发布时间
2026/3/18 21:23:58
来源类型
preprint
语言
en
摘要
中文对照

多模态遥感语义分割通过利用异构数据中的互补物理特征,提升了场景理解能力。尽管预训练视觉基础模型(VFMs)提供了强大的通用表征能力,但将其适配至多模态任务通常带来显著的计算开销,并易受模态不平衡影响,即在优化过程中辅助模态的贡献被抑制。为应对上述挑战,本文提出MoBaNet,一种参数高效且模态平衡的对称融合框架。该框架基于大量冻结的VFM主干网络,采用对称双流结构,在保留可泛化表征的同时最大限度减少可训练参数数量。具体而言,我们设计了跨模态提示注入适配器(CPIA),通过生成共享提示并在冻结主干下的瓶颈适配器中注入,实现深层语义交互。为进一步获得紧凑且具有判别性的多模态表征以用于解码,我们引入差异引导门控融合模块(DGFM),通过显式利用跨模态差异来指导特征选择,自适应融合成对阶段特征。此外,我们提出模态条件随机掩码(MCRM)策略,通过仅在训练时掩码一个模态,并对模态特定分支施加硬像素辅助监督,缓解模态不平衡问题。在ISPRS Vaihingen和Potsdam基准上的大量实验表明,MoBaNet在显著少于全微调可训练参数的情况下实现了当前最优性能,验证了其在鲁棒且均衡的多模态融合中的有效性。本工作源代码见:https://github.com/saur

English Original

Multimodal remote sensing semantic segmentation enhances scene interpretation by exploiting complementary physical cues from heterogeneous data. Although pretrained Vision Foundation Models (VFMs) provide strong general-purpose representations, adapting them to multimodal tasks often incurs substantial computational overhead and is prone to modality imbalance, where the contribution of auxiliary modalities is suppressed during optimization. To address these challenges, we propose MoBaNet, a parameter-efficient and modality-balanced symmetric fusion framework. Built upon a largely frozen VFM backbone, MoBaNet adopts a symmetric dual-stream architecture to preserve generalizable representations while minimizing the number of trainable parameters. Specifically, we design a Cross-modal Prompt-Injected Adapter (CPIA) to enable deep semantic interaction by generating shared prompts and injecting them into bottleneck adapters under the frozen backbone. To obtain compact and discriminative multimodal representations for decoding, we further introduce a Difference-Guided Gated Fusion Module (DGFM), which adaptively fuses paired stage features by explicitly leveraging cross-modal discrepancy to guide feature selection. Furthermore, we propose a Modality-Conditional Random Masking (MCRM) strategy to mitigate modality imbalance by masking one modality only during training and imposing hard-pixel auxiliary supervision on modality-specific branches. Extensive experiments on the ISPRS Vaihingen and Potsdam benchmarks demonstrate that MoBaNet achieves state-of-the-art performance with significantly fewer trainable parameters than full fine-tuning, validating its effectiveness for robust and balanced multimodal fusion. The source code in this work is available at https://github.com/sauryeo/MoBaNet.

元数据
arXiv2603.17705v1
来源arXiv
类型论文
抽取状态raw
关键词
RemoteSensing
EarthObservation
Multimodal
cs.CV