UrbanComp Lab | 学习资料库

返回论文库

论文

arXiv

GeoAI

GIS

SpatialIntelligence

LLM

Multimodal

GeoMultimodal

Agent

UrbanTraffic

中文标题

ALIGN：一种基于地理空间神经推理的高精度事故定位视觉-语言框架

English Title

ALIGN: A Vision-Language Framework for High-Accuracy Accident Location Inference through Geo-Spatial Neural Reasoning

MD Thamed Bin Zaman Chowdhury, Moazzem Hossain

发布时间

2025/11/9 18:44:26

来源类型

preprint

语言

摘要

中文对照

在低收入和中等收入国家，公共安全与城市规划工作常面临准确、位置明确的道路交通事故数据严重匮乏的问题。从非结构化文本中提取可靠的地理空间信息，需克服传统基于文本的地理编码工具的局限性——此类工具在多语种环境及地名描述模糊的情况下往往失效。本研究提出 ALIGN（Accident Location Inference through Geo-Spatial Neural Reasoning，即通过地理空间神经推理实现事故位置推断），一种视觉-语言框架，旨在模拟人类空间推理能力，从非结构化的孟加拉语新闻报道及地图线索中推断精确的事故坐标。我们构建了一个多阶段自动化处理流程，用于整合多样化的文本与视觉数据，结合大语言模型进行线索抽取，并利用视觉-语言模型开展地图验证。采用智能体（agentic）架构，我们建模了一个迭代式推理循环，融合光学字符识别（OCR）、基于网格的空间扫描以及三轮几何投票法，以数学方式识别并抑制视觉幻觉。结果表明，该多模态 ALIGN 框架显著优于传统纯文本地理解析基线方法。例如，在验证数据集上，所提系统将平均定位误差从不可用的 10.915 公里大幅降低至亚公里级精度 0.593 公里；进一步与达卡大都会警察局官方记录对比测试，其平均误差为 0.465 公里，验证了系统的可靠性。本成果为数据匮乏地区提供了高精度、无需训练的自动事故制图基础，支持循证式道路交通安全政策制定，并推动多模态人工智能在交通分析中的应用。

English Original

In low- and middle-income countries, public safety and urban planning initiatives frequently face a critical shortage of accurate, location-specific road crash data. Extracting reliable geospatial information from unstructured text requires overcoming the limitations of traditional text-based geocoding tools, which often fail in multilingual environments with ambiguous place descriptions. This study introduces ALIGN (Accident Location Inference through Geo-Spatial Neural Reasoning), a vision-language framework designed to emulate human spatial reasoning to infer precise accident coordinates from unstructured Bangla news reports and map-based cues. A multi stage automated pipeline was developed to process diverse textual and visual data, integrating large language models for cue extraction with vision-language models for map verification. Using an agentic architecture, we modelled an iterative reasoning loop that combines Optical Character Recognition (OCR), grid-based spatial scanning, and a 3-run geometric voting method to mathematically isolate and reduce visual hallucinations. The findings highlight that the multimodal ALIGN framework significantly outperforms traditional text-only geoparsing baselines. For example, the proposed system successfully reduced the mean localization error from an unusable 10.915 km to a sub-kilometer precision of 0.593 km on a validation dataset. Furthermore, testing the framework against official Dhaka Metropolitan Police records confirmed its reliability by achieving a mean error of 0.465 km. The results provide a high-accuracy, training-free foundation for automated crash mapping in data-scarce regions, supporting evidence-driven road-safety policymaking and the integration of multimodal AI in transportation analytics.

资源链接

论文 PDFarxiv.org/pdf/2511.06316v3 论文 PDFarxiv.org/pdf/2511.06316v3 原始来源页面arxiv.org/abs/2511.06316v3

元数据

arXiv2511.06316v3

来源arXiv

类型论文

抽取状态raw

关键词