论文
arXiv
GeoAI
GIS
SpatialIntelligence
LLM
Multimodal
GeoMultimodal
Agent
UrbanTraffic
中文标题
ALIGN:一种基于地理空间神经推理的高精度事故定位视觉-语言框架
English Title
ALIGN: A Vision-Language Framework for High-Accuracy Accident Location Inference through Geo-Spatial Neural Reasoning
MD Thamed Bin Zaman Chowdhury, Moazzem Hossain
发布时间
2025/11/9 18:44:26
来源类型
preprint
语言
en
摘要
中文对照

在低收入和中等收入国家,公共安全与城市规划工作常面临准确、位置明确的道路交通事故数据严重匮乏的问题。从非结构化文本中提取可靠的地理空间信息,需克服传统基于文本的地理编码工具的局限性——此类工具在多语种环境及地名描述模糊的情况下往往失效。本研究提出 ALIGN(Accident Location Inference through Geo-Spatial Neural Reasoning,即通过地理空间神经推理实现事故位置推断),一种视觉-语言框架,旨在模拟人类空间推理能力,从非结构化的孟加拉语新闻报道及地图线索中推断精确的事故坐标。我们构建了一个多阶段自动化处理流程,用于整合多样化的文本与视觉数据,结合大语言模型进行线索抽取,并利用视觉-语言模型开展地图验证。采用智能体(agentic)架构,我们建模了一个迭代式推理循环,融合光学字符识别(OCR)、基于网格的空间扫描以及三轮几何投票法,以数学方式识别并抑制视觉幻觉。结果表明,该多模态 ALIGN 框架显著优于传统纯文本地理解析基线方法。例如,在验证数据集上,所提系统将平均定位误差从不可用的 10.915 公里大幅降低至亚公里级精度 0.593 公里;进一步与达卡大都会警察局官方记录对比测试,其平均误差为 0.465 公里,验证了系统的可靠性。本成果为数据匮乏地区提供了高精度、无需训练的自动事故制图基础,支持循证式道路交通安全政策制定,并推动多模态人工智能在交通分析中的应用。

English Original

In low- and middle-income countries, public safety and urban planning initiatives frequently face a critical shortage of accurate, location-specific road crash data. Extracting reliable geospatial information from unstructured text requires overcoming the limitations of traditional text-based geocoding tools, which often fail in multilingual environments with ambiguous place descriptions. This study introduces ALIGN (Accident Location Inference through Geo-Spatial Neural Reasoning), a vision-language framework designed to emulate human spatial reasoning to infer precise accident coordinates from unstructured Bangla news reports and map-based cues. A multi stage automated pipeline was developed to process diverse textual and visual data, integrating large language models for cue extraction with vision-language models for map verification. Using an agentic architecture, we modelled an iterative reasoning loop that combines Optical Character Recognition (OCR), grid-based spatial scanning, and a 3-run geometric voting method to mathematically isolate and reduce visual hallucinations. The findings highlight that the multimodal ALIGN framework significantly outperforms traditional text-only geoparsing baselines. For example, the proposed system successfully reduced the mean localization error from an unusable 10.915 km to a sub-kilometer precision of 0.593 km on a validation dataset. Furthermore, testing the framework against official Dhaka Metropolitan Police records confirmed its reliability by achieving a mean error of 0.465 km. The results provide a high-accuracy, training-free foundation for automated crash mapping in data-scarce regions, supporting evidence-driven road-safety policymaking and the integration of multimodal AI in transportation analytics.

元数据
arXiv2511.06316v3
来源arXiv
类型论文
抽取状态raw
关键词
GeoAI
GIS
SpatialIntelligence
LLM
Multimodal
GeoMultimodal
Agent
UrbanTraffic
cs.AI