论文
arXiv
SpatialIntelligence
Multimodal
GeoMultimodal
中文标题
UrbanGraphEmbeddings:面向城市科学的空间锚定多模态嵌入学习与评估
English Title
UrbanGraphEmbeddings: Learning and Evaluating Spatially Grounded Multimodal Embeddings for Urban Science
Jie Zhang, Xingtong Yu, Yuan Fang, Rudi Stouffs, Zdravko Trivic
发布时间
2026/2/9 15:28:49
来源类型
preprint
语言
en
摘要
中文对照

在城市环境中学习可迁移的多模态嵌入具有挑战性,因为城市理解本质上具有空间属性,而现有数据集和基准缺乏街景图像与城市结构之间的显式对齐。我们提出了UGData,一个空间锚定的数据集,将街景图像与结构化空间图对齐,并通过空间推理路径和空间上下文描述提供图对齐的监督信号,揭示了图像内容之外的距离、方向性、连通性及邻里上下文信息。基于UGData,我们提出UGE,一种两阶段训练策略,通过结合指令引导的对比学习与基于图的空间编码,逐步且稳定地对齐图像、文本与空间结构。我们进一步构建了UGBench,一个综合性基准,用于评估空间锚定嵌入在多种城市理解任务中的表现,包括地理定位排序、图像检索、城市感知与空间定位。我们在多个先进的视觉语言模型(VLM)骨干网络上实现UGE,包括Qwen2-VL、Qwen2.5-VL、Phi-3-Vision和LLaVA1.6-Mistral,并采用LoRA微调训练固定维度的空间嵌入。基于Qwen2.5-VL-7B骨干网络构建的UGE在训练城市上的图像检索任务中提升达44%,地理定位排序任务提升30%;在未见城市上分别取得超过30%和22%的性能增益,证明了显式空间锚定对于空间密集型城市任务的有效性。

English Original

Learning transferable multimodal embeddings for urban environments is challenging because urban understanding is inherently spatial, yet existing datasets and benchmarks lack explicit alignment between street-view images and urban structure. We introduce UGData, a spatially grounded dataset that anchors street-view images to structured spatial graphs and provides graph-aligned supervision via spatial reasoning paths and spatial context captions, exposing distance, directionality, connectivity, and neighborhood context beyond image content. Building on UGData, we propose UGE, a two-stage training strategy that progressively and stably aligns images, text, and spatial structures by combining instruction-guided contrastive learning with graph-based spatial encoding. We finally introduce UGBench, a comprehensive benchmark to evaluate how spatially grounded embeddings support diverse urban understanding tasks -- including geolocation ranking, image retrieval, urban perception, and spatial grounding. We develop UGE on multiple state-of-the-art VLM backbones, including Qwen2-VL, Qwen2.5-VL, Phi-3-Vision, and LLaVA1.6-Mistral, and train fixed-dimensional spatial embeddings with LoRA tuning. UGE built upon Qwen2.5-VL-7B backbone achieves up to 44% improvement in image retrieval and 30% in geolocation ranking on training cities, and over 30% and 22% gains respectively on held-out cities, demonstrating the effectiveness of explicit spatial grounding for spatially intensive urban tasks.

元数据
arXiv2602.08342v1
来源arXiv
类型论文
抽取状态raw
关键词
SpatialIntelligence
Multimodal
GeoMultimodal
cs.CV
cs.AI