UrbanComp Lab | 学习资料库

返回资讯

资讯

Google Research Blog

Industry

Dataset

中文标题

教会人工智能读取地图

English Title

Teaching AI to read a map

Google Research Blog

发布时间

2026/2/18 05:37:00

来源类型

blog

语言

摘要

中文对照

观察一张购物中心或主题公园的地图。在几秒钟内，你的大脑就能处理视觉信息，确定自身位置，并规划出到达目的地的最佳路径。你本能地理解哪些线条代表墙壁，哪些代表通道。这种基本的细粒度空间推理能力对人类而言是自然而然的。给定地图上的起点和终点，模型能够输出一条符合地图约束的有效路径。

English Original

正文

Look at a map of a shopping mall or a theme park. Within seconds, your brain processes the visual information, identifies your location, and traces the optimal path to your destination. You instinctively understand which lines are walls and which are walkways. This fundamental skill — fine-grained spatial reasoning — is second nature. Given a start and end location on a map, the model outputs a valid path that respects map constraints. We observed that the generated images tend to render text incorrectly however we mostly focus on path qualities in this work. We believe that with improvements in image generation models, these artifacts can be easily suppressed in future work. The most direct way to teach this would be to collect a massive dataset of maps with millions of paths traced by hand. But annotating a single path with pixel-level accuracy is a painstaking process, and scaling it to the level required for training a large model is practically impossible. Furthermore, many of the best examples of complex maps — like those for malls, museums, and theme parks — are proprietary and cannot be easily collected for research. This data bottleneck has held back progress. Without sufficient training examples, models lack the "spatial grammar" to interpret a map correctly. They see a soup of pixels, not a structured, navigable space. To address this data gap, we designed a fully automated, scalable pipeline that leverages the generative capabilities of Gemini Models to produce diverse high-quality maps. This process allows fine-grained control over data diversity and complexity, generating annotated paths that adhere to intended routes and avoid non-traversable regions without the need for collecting large-scale real-world maps. The pipeline works in four automated and scalable stages, using AI models as both creators and critics to ensure quality and produce pixel-level annotations. First, we use a large language model (LLM) to generate rich, descriptive prompts for different types of maps. The LLM generates everything from "a map of a zoo with interconnected habitats" to "a shopping mall with a central food court" or "a fantasy theme park with winding paths through different themed lands." These text prompts are then fed into a text-to-image model that renders them into complex map images. Once we have a map image, we need to identify all the "walkable" areas. Our system does this by clustering the pixels by color to create candidate path masks — essentially, a black-and-white map of all the walkways. With a clean mask of all traversable areas, we convert that 2D image into a more structured graph format. Think of this as creating a digital version of a road network, where intersections are nodes and the roads between them are edges. This "pixel-graph" captures the connectivity of the map, making it easy to calculate routes computationally. This pipeline enabled us to create a dataset of 2M annotated map images with valid paths. While the generated images occasionally exhibit typographic errors, this study focuses primarily on path fidelity. We anticipate that ongoing advancements in generative modeling will naturally mitigate these artifacts in future iterations. Fine-tuning on our dataset substantially improved the models' abilities across the board. The fine-tuned Gemini 2.5 Flash model, for example, saw its NDTW drop significantly (from 1.29 to 0.87), achieving the best overall performance. These gains confirm our central hypothesis: fine-grained spatial reasoning is not an innate property of MLLMs but an acquired skill. With the right kind of explicit supervision, even if it's synthetically generated, we can teach models to understand and navigate spatial layouts. Qualitative examples comparing the fine-tuned Gemini-2.5-Flash (red) to the base model (blue). The fine-tuned model adheres more closely to the intended routes and avoids non-traversable regions. The ability to reason about paths and connectivity unlocks a host of future applications. Including:

资源链接

元数据

来源Google Research Blog

类型资讯

抽取状态raw

关键词

Machine Perception

Open Source Models & Datasets

Industry

Dataset