在低收入和中等收入国家,公共安全与城市规划工作常面临准确、位置明确的道路交通事故数据严重匮乏的问题。从非结构化文本中提取可靠的地理空间信息,需克服传统基于文本的地理编码工具的局限性——此类工具在多语种环境及地名描述模糊的情况下往往失效。本研究提出 ALIGN(Accident Location Inference through Geo-Spatial Neural Reasoning,即通过地理空间神经推理实现事故位置推断),一种视觉-语言框架,旨在模拟人类空间推理能力,从非结构化的孟加拉语新闻报道及地图线索中推断精确的事故坐标。我们构建了一个多阶段自动化处理流程,用于整合多样化的文本与视觉数据,结合大语言模型进行线索抽取,并利用视觉-语言模型开展地图验证。采用智能体(agentic)架构,我们建模了一个迭代式推理循环,融合光学字符识别(OCR)、基于网格的空间扫描以及三轮几何投票法,以数学方式识别并抑制视觉幻觉。结果表明,该多模态 ALIGN 框架显著优于传统纯文本地理解析基线方法。例如,在验证数据集上,所提系统将平均定位误差从不可用的 10.915 公里大幅降低至亚公里级精度 0.593 公里;进一步与达卡大都会警察局官方记录对比测试,其平均误差为 0.465 公里,验证了系统的可靠性。本成果为数据匮乏地区提供了高精度、无需训练的自动事故制图基础,支持循证式道路交通安全政策制定,并推动多模态人工智能在交通分析中的应用。
In low- and middle-income countries, public safety and urban planning initiatives frequently face a critical shortage of accurate, location-specific road crash data. Extracting reliable geospatial information from unstructured text requires overcoming the limitations of traditional text-based geocoding tools, which often fail in multilingual environments with ambiguous place descriptions. This study introduces ALIGN (Accident Location Inference through Geo-Spatial Neural Reasoning), a vision-language framework designed to emulate human spatial reasoning to infer precise accident coordinates from unstructured Bangla news reports and map-based cues. A multi stage automated pipeline was developed to process diverse textual and visual data, integrating large language models for cue extraction with vision-language models for map verification. Using an agentic architecture, we modelled an iterative reasoning loop that combines Optical Character Recognition (OCR), grid-based spatial scanning, and a 3-run geometric voting method to mathematically isolate and reduce visual hallucinations. The findings highlight that the multimodal ALIGN framework significantly outperforms traditional text-only geoparsing baselines. For example, the proposed system successfully reduced the mean localization error from an unusable 10.915 km to a sub-kilometer precision of 0.593 km on a validation dataset. Furthermore, testing the framework against official Dhaka Metropolitan Police records confirmed its reliability by achieving a mean error of 0.465 km. The results provide a high-accuracy, training-free foundation for automated crash mapping in data-scarce regions, supporting evidence-driven road-safety policymaking and the integration of multimodal AI in transportation analytics.