街景影像(SVI)被广泛用于量化城市环境的关键指标,例如绿化率、天空可视率或道路可视率。然而,现有研究主要集中于测量当前街道景观,极少支持生成替代性或尚未存在的城市场景——而这正是城市规划与设计等地理空间学科的核心任务。为弥补这一空白,我们提出一种生成式多模态人工智能框架,该框架可根据目标视觉指标条件化合成替代性街道景观,从而实现对城市场景的直接可视化探索。我们首先构建了一个多模态数据集,将芝加哥和奥兰多两地的街景影像与文本描述、语义分割图、道路掩膜以及视觉要素的定量指标进行对齐。基于该数据集,我们证明扩散模型能够生成既真实又语义一致的街道景观影像,并同时响应文本与图像两类控制信号。定量评估表明,引入视觉控制可提升语义一致性,LPIPS 指标降低约 6%,同时保持全局视觉真实性;mIoU 指标显示,奥兰多和芝加哥的整体语义一致性分别提升 23.7% 和 46.4%,其中建筑可视率指标在类别层面的提升甚至超过 100%。街道景观生成可通过视觉与文本提示实现细粒度控制;当两类控制发生冲突时,图像控制始终占主导地位,表明存在明确的控制层级关系,也凸显了进一步发展面向城市场景生成的视觉控制方法的重要性。总体而言,本工作为基于街景影像与扩散模型的街道景观生成确立了重要基准,并阐明了生成式人工智能如何赋能城市设计实践。
Street-view imagery (SVI) is widely used to quantify key indicators of urban environment, such as green- ery, sky, or road view indices. However, existing studies largely focus on measuring current streetscapes and rarely support the generation of alternative and non-existing urban scenarios, which is a core task in geospatial disciplines such as urban planning and design. To address this gap, we propose a gener- ative multimodal AI framework that synthesizes alternative streetscapes conditioned on targeted visual metrics, enabling direct visual exploration of urban scenarios. We first construct a multimodal dataset that aligns SVIs with textual descriptions, segmentation maps, road masks, and quantitative metrics of visual elements in Chicago and Orlando. Using this dataset, we demonstrate that diffusion models can produce realistic and semantically consistent streetscape imagery while responding to both textual and imagery controls. Our quantitative evaluations show that incorporating visual controls can improve semantic consistency, reducing the LPIPS index by approximately 6% while maintaining global visual realism. In addition, overall semantic consistency increases by 23.7% in Orlando and 46.4% in Chicago, as measured by the mIoU index, with class-wise gains exceeding even 100% improvement for building view indices. Streetscape generation can be controlled in a fine-grained manner by both visual and textual prompts, and when textual and visual controls conflict, imagery controls consistently dominate, indicating a clear control hierarchy and the importance of further developing visual controls for urban scene generation. Overall, this work establishes an important benchmark for streetscape generation us- ing SVIs and diffusion models, and illustrates how generative AI can serve as a practical, scalable, and controllable approach for urban scenario exploration.