交互式数字地图已彻底改变了人们出行与认知世界的方式;然而,其依赖于地理信息系统(GIS)数据库中预先存在的结构化数据(例如道路网络、兴趣点索引),因而难以回答与现实世界视觉外观相关的地理-视觉问题。本文提出“地理-视觉智能体”(Geo-Visual Agents)的构想:一类多模态AI智能体,能够通过分析大规模地理空间图像库(包括街景图像(如Google街景)、场所关联照片(如TripAdvisor、Yelp)及航拍影像(如卫星图像))并融合传统GIS数据源,理解并回应关于现实世界细致入微的视觉-空间查询。我们阐述该构想的定义,描述感知与交互方法,给出三个示例,并列举未来研究中的关键挑战与机遇。
Interactive digital maps have revolutionized how people travel and learn about the world; however, they rely on pre-existing structured data in GIS databases (e.g., road networks, POI indices), limiting their ability to address geo-visual questions related to what the world looks like. We introduce our vision for Geo-Visual Agents--multimodal AI agents capable of understanding and responding to nuanced visual-spatial inquiries about the world by analyzing large-scale repositories of geospatial images, including streetscapes (e.g., Google Street View), place-based photos (e.g., TripAdvisor, Yelp), and aerial imagery (e.g., satellite photos) combined with traditional GIS data sources. We define our vision, describe sensing and interaction approaches, provide three exemplars, and enumerate key challenges and opportunities for future work.