细粒度高分辨率遥感制图通常依赖局部视觉特征,这限制了跨域泛化能力,并常导致大范围地物覆盖的预测碎片化。尽管全局地理空间基础模型(geospatial foundation models)提供了强大且可泛化的表征,但将其高维隐式嵌入直接与高分辨率视觉特征融合,往往因严重的语义-空间鸿沟而引发特征干扰与空间结构退化。为克服上述局限,我们提出一种结构-语义解耦调制(Structure-Semantic Decoupled Modulation, SSDM)框架,将全局地理空间表征解耦为两条互补的跨模态注入路径:其一,结构先验调制分支将全局表征所蕴含的宏观感受野先验引入高分辨率编码器的自注意力模块,通过整体性结构约束引导局部特征提取,从而有效抑制由高频细节噪声和类内差异过大所导致的预测碎片化;其二,全局语义注入分支显式对齐整体上下文与深层高分辨率特征空间,并通过跨模态融合直接补充全局语义,显著提升复杂地物覆盖的语义一致性与类别级判别能力。大量实验表明,本方法在各类跨模态融合方法中达到当前最优性能;通过充分释放全局嵌入潜力,SSDM在多种场景下持续提升高分辨率制图精度,为地理空间基础模型融入高分辨率遥感分析提供了一种通用且有效的范式。
Fine-grained high-resolution remote sensing mapping typically relies on localized visual features, which restricts cross-domain generalizability and often leads to fragmented predictions of large-scale land covers. While global geospatial foundation models offer powerful, generalizable representations, directly fusing their high-dimensional implicit embeddings with high-resolution visual features frequently triggers feature interference and spatial structure degradation due to a severe semantic-spatial gap. To overcome these limitations, we propose a Structure-Semantic Decoupled Modulation (SSDM) framework, which decouples global geospatial representations into two complementary cross-modal injection pathways. First, the structural prior modulation branch introduces the macroscopic receptive field priors from global representations into the self-attention modules of the high-resolution encoder. By guiding local feature extraction with holistic structural constraints, it effectively suppresses prediction fragmentation caused by high-frequency detail noise and excessive intra-class variance. Second, the global semantic injection branch explicitly aligns holistic context with the deep high-resolution feature space and directly supplements global semantics via cross-modal integration, thereby significantly enhancing the semantic consistency and category-level discrimination of complex land covers. Extensive experiments demonstrate that our method achieves state-of-the-art performance compared to existing cross-modal fusion approaches. By unleashing the potential of global embeddings, SSDM consistently improves high-resolution mapping accuracy across diverse scenarios, providing a universal and effective paradigm for integrating geospatial foundation models into high-resolution vision tasks.