尽管人类轨迹异常研究对推进空间数据挖掘至关重要,但实证研究仍因缺乏真实标注数据集而严重受阻。尽管目前已存在若干真实世界及模拟的人类轨迹数据集,但这些数据集仅涵盖正常移动模式,且未标注异常样本。这一特定匮乏根本源于异常事件固有的统计稀有性,使得传统观测方法难以实施。此外,大规模移动数据的系统性采集还受到高昂成本与严格隐私法规的双重制约。为克服上述根本性限制,并构建具备真实标注的可靠人类轨迹异常数据集,我们提出一种新颖的端到端生成框架,用于规模化合成逼真的轨迹异常。本架构通过直接作用于基线模拟轨迹,在纯合成移动数据与复杂现实物理约束之间建立桥梁。我们采用大语言模型(LLM)智能体,系统性地注入语义合理的异常行为,例如分布外的异常签到和常规访问跳过。为确保严格的空间有效性,系统利用地图约束的路径重构技术,重新计算经LLM智能体修改后的停留点之间的物理转移路径。此外,为缩小仿真与现实之间的差距,我们引入一种上下文感知的空间噪声模型,该模型以环境变量和位置特异性变量为参数,精准模拟异质化的GPS传感器退化效应。
Although the study of human trajectory anomalies is critical for advancing spatial data mining, empirical research remains severely hindered by a pervasive lack of ground-truth datasets. Despite the availability of several real-world and simulated human trajectory collections, these datasets exclusively capture normal mobility patterns and lack annotated anomalies. This specific scarcity is fundamentally driven by the inherent statistical rarity of anomalous events, precluding the feasibility of conventional observational methods. Compounding this challenge, the systematic acquisition of large-scale mobility data is strictly bottlenecked by prohibitive costs and stringent privacy regulations. To overcome these fundamental limitations and establish a reliable human trajectory anomalies dataset with annotated ground truth, we introduce a novel, end-to-end generative framework designed to synthesize realistic trajectory anomalies at scale. Our architecture bridges the gap between purely synthetic mobility data and complex real-world physical constraints by operating directly on baseline simulated trajectories. We employ Large Language Model (LLM) agents to systematically inject semantically meaningful behavioral anomalies such as irregular out-of-distribution check-ins and skipped routine visits. To ensure rigorous spatial validity, the system leverages map-constrained routing reconstruction to recalculate the physical transitions between these LLM agent-modified staypoints. Moreover, to narrow the simulation-to-reality gap, we augment the resulting trajectories with a context-aware spatial noise model, parameterized by environmental and location-specific variables, to accurately emulate heterogeneous GPS sensor degradation.