UrbanComp Lab | 学习资料库

返回论文库

论文

arXiv

SpatialIntelligence

Multimodal

GeoMultimodal

中文标题

UrbanGraphEmbeddings：面向城市科学的空间锚定多模态嵌入学习与评估

English Title

UrbanGraphEmbeddings: Learning and Evaluating Spatially Grounded Multimodal Embeddings for Urban Science

Jie Zhang, Xingtong Yu, Yuan Fang, Rudi Stouffs, Zdravko Trivic

发布时间

2026/2/9 15:28:49

来源类型

preprint

语言

摘要

中文对照

在城市环境中学习可迁移的多模态嵌入具有挑战性，因为城市理解本质上具有空间属性，而现有数据集和基准缺乏街景图像与城市结构之间的显式对齐。我们提出了UGData，一个空间锚定的数据集，将街景图像与结构化空间图对齐，并通过空间推理路径和空间上下文描述提供图对齐的监督信号，揭示了图像内容之外的距离、方向性、连通性及邻里上下文信息。基于UGData，我们提出UGE，一种两阶段训练策略，通过结合指令引导的对比学习与基于图的空间编码，逐步且稳定地对齐图像、文本与空间结构。我们进一步构建了UGBench，一个综合性基准，用于评估空间锚定嵌入在多种城市理解任务中的表现，包括地理定位排序、图像检索、城市感知与空间定位。我们在多个先进的视觉语言模型（VLM）骨干网络上实现UGE，包括Qwen2-VL、Qwen2.5-VL、Phi-3-Vision和LLaVA1.6-Mistral，并采用LoRA微调训练固定维度的空间嵌入。基于Qwen2.5-VL-7B骨干网络构建的UGE在训练城市上的图像检索任务中提升达44%，地理定位排序任务提升30%；在未见城市上分别取得超过30%和22%的性能增益，证明了显式空间锚定对于空间密集型城市任务的有效性。

English Original

Learning transferable multimodal embeddings for urban environments is challenging because urban understanding is inherently spatial, yet existing datasets and benchmarks lack explicit alignment between street-view images and urban structure. We introduce UGData, a spatially grounded dataset that anchors street-view images to structured spatial graphs and provides graph-aligned supervision via spatial reasoning paths and spatial context captions, exposing distance, directionality, connectivity, and neighborhood context beyond image content. Building on UGData, we propose UGE, a two-stage training strategy that progressively and stably aligns images, text, and spatial structures by combining instruction-guided contrastive learning with graph-based spatial encoding. We finally introduce UGBench, a comprehensive benchmark to evaluate how spatially grounded embeddings support diverse urban understanding tasks -- including geolocation ranking, image retrieval, urban perception, and spatial grounding. We develop UGE on multiple state-of-the-art VLM backbones, including Qwen2-VL, Qwen2.5-VL, Phi-3-Vision, and LLaVA1.6-Mistral, and train fixed-dimensional spatial embeddings with LoRA tuning. UGE built upon Qwen2.5-VL-7B backbone achieves up to 44% improvement in image retrieval and 30% in geolocation ranking on training cities, and over 30% and 22% gains respectively on held-out cities, demonstrating the effectiveness of explicit spatial grounding for spatially intensive urban tasks.

资源链接

论文 PDFarxiv.org/pdf/2602.08342v1 论文 PDFarxiv.org/pdf/2602.08342v1 原始来源页面arxiv.org/abs/2602.08342v1

元数据

arXiv2602.08342v1

来源arXiv

类型论文

抽取状态raw

关键词

SpatialIntelligence

Multimodal

GeoMultimodal

cs.CV

cs.AI