论文
arXiv
GeoAI
GIS
RemoteSensing
EarthObservation
SpatialIntelligence
Trajectory
Mobility
GeoLargeModel
GeoFoundationModel
Multimodal
GeoMultimodal
中文标题
TRAJGANR:基于地理空间对齐神经表征的轨迹中心化城市多模态学习
English Title
TRAJGANR: Trajectory-Centric Urban Multimodal Learning via Geospatially Aligned Neural Representations
Maria Despoina Siampou, Gengchen Mai, Ni Lao, Jinmeng Rao, Neha Arora, Cyrus Shahabi, Shushman Choudhury
发布时间
2026/5/8 06:10:41
来源类型
preprint
语言
en
摘要
中文对照

多模态自监督学习(MSSL)已成为预训练地理空间基础模型的关键范式。然而,现有地理空间MSSL方法主要面向静态模态对(如卫星影像、街景影像和文本),其学习过程依赖于对同一位置或邻近位置观测结果的对齐。该假设在人类移动轨迹场景下失效,因为轨迹表征的是沿路径的连续运动,而非单个位置上的离散观测。尽管轨迹对于城市理解至关重要——其能够随时间捕捉人类活动在道路、社区和场所间的动态分布——但当前地理空间MSSL框架对其探索仍十分有限。本文提出TrajGANR,一种新颖的以轨迹为中心的地理空间MSSL框架,可将连续运动模式与静态的、基于位置的观测进行对齐。TrajGANR学习每条路径上任意点处轨迹的连续神经表征,从而实现与邻近街景影像的细粒度对齐,即使这些影像未与任何轨迹航点共址。我们利用该能力设计了一种MSSL目标函数,联合对齐三种模态:轨迹、街景影像及其地理坐标。我们在四项城市移动性与道路理解任务上评估TrajGANR。结果表明,TrajGANR在所有任务中均持续优于现有地理空间MSSL框架及一种专用于轨迹的基础模型。消融实验进一步证实,所提出的MSSL目标函数与多模态学习框架是性能提升的主要动因,凸显了细粒度地理空间对齐相较于粗粒度聚合的重要性,以及地理空间多模态协同学习的价值。

English Original

Multimodal self-supervised learning (MSSL) has emerged as a key paradigm for pretraining geospatial foundation models. However, existing geospatial MSSL methods are mainly designed for static pairs of modalities, such as satellite imagery, street-view imagery, and text, where learning is driven by aligning observations from the same or nearby locations. This assumption breaks down for human mobility trajectories, which represent continuous movement along paths rather than discrete observations at individual locations. Although trajectories are important for urban understanding through their ability to capture human activity across roads, neighborhoods, and places over time, they remain largely underexplored in current geospatial MSSL frameworks. We present TrajGANR, a novel trajectory-centric geospatial MSSL framework that aligns continuous movement patterns with static, location-based observations. TrajGANR learns a continuous neural representation of trajectories at arbitrary points along each path, which enables fine-grained alignment with nearby street-view images, even when they are not co-located with any trajectory waypoints. We leverage this capability to introduce an MSSL objective that jointly aligns three modalities: trajectories, street-view images, and their geographic locations. We evaluate TrajGANR on four urban mobility and road understanding tasks. Across these tasks, TrajGANR consistently outperforms existing geospatial MSSL frameworks and a trajectory-specific foundation model. Ablation studies further demonstrate that our proposed MSSL objective and the multimodal learning framework are the primary drivers of these improvements, highlighting the importance of fine-grained geospatial alignment over coarser aggregation, as well as geospatial multimodal learning.

元数据
arXiv2605.06990v1
来源arXiv
类型论文
抽取状态raw
关键词
GeoAI
GIS
RemoteSensing
EarthObservation
SpatialIntelligence
Trajectory
Mobility
GeoLargeModel
GeoFoundationModel
Multimodal
GeoMultimodal
cs.CV
cs.LG