UrbanComp Lab | 学习资料库

返回论文库

论文

arXiv

GeoAI

GIS

RemoteSensing

EarthObservation

LLM

Multimodal

中文标题

CityLens：面向城市社会经济感知的大型视觉-语言模型评估

English Title

CityLens: Evaluating Large Vision-Language Models for Urban Socioeconomic Sensing

Tianhui Liu, Hetian Pang, Xin Zhang, Tianjian Ouyang, Zhiyuan Zhang, Jie Feng, Yong Li, Pan Hui

发布时间

2025/5/31 20:25:33

来源类型

preprint

语言

摘要

中文对照

通过视觉数据理解城市社会经济状况，是可持续城市发展与政策规划中一项具有挑战性但至关重要的任务。本文提出 \textit{CityLens}，一个用于评估大型视觉-语言模型（LVLM）从卫星影像与街景图像中预测社会经济指标能力的综合性基准。我们构建了一个覆盖全球17个城市的多模态数据集，涵盖经济、教育、犯罪、交通、健康与环境六大关键领域，全面反映城市生活的多维特性。基于该数据集，我们定义了11项预测任务，并采用三种评估范式：直接指标预测、归一化指标估计与基于特征的回归。我们在这些任务上对17种前沿LVLM进行了基准测试。CityLens目前是地理覆盖范围最广、指标多样性最高、所评测模型规模最大的城市社会经济基准。实验结果表明，尽管LVLM展现出良好的感知与推理能力，其在预测城市社会经济指标方面仍存在明显局限。CityLens为诊断此类局限提供了统一框架，并为未来利用LVLM理解与预测城市社会经济模式的研究提供指导。代码与数据开源地址为：https://github.com/tsinghua-fib-lab/CityLens。

English Original

Understanding urban socioeconomic conditions through visual data is a challenging yet essential task for sustainable urban development and policy planning. In this work, we introduce \textit{CityLens}, a comprehensive benchmark designed to evaluate the capabilities of Large Vision-Language Models (LVLMs) in predicting socioeconomic indicators from satellite and street view imagery. We construct a multi-modal dataset covering a total of 17 globally distributed cities, spanning 6 key domains: economy, education, crime, transport, health, and environment, reflecting the multifaceted nature of urban life. Based on this dataset, we define 11 prediction tasks and utilize 3 evaluation paradigms: Direct Metric Prediction, Normalized Metric Estimation, and Feature-Based Regression. We benchmark 17 state-of-the-art LVLMs across these tasks. These make CityLens the most extensive socioeconomic benchmark to date in terms of geographic coverage, indicator diversity, and model scale. Our results reveal that while LVLMs demonstrate promising perceptual and reasoning capabilities, they still exhibit limitations in predicting urban socioeconomic indicators. CityLens provides a unified framework for diagnosing these limitations and guiding future efforts in using LVLMs to understand and predict urban socioeconomic patterns. The code and data are available at https://github.com/tsinghua-fib-lab/CityLens.

资源链接

论文 PDFarxiv.org/pdf/2506.00530v2 论文 PDFarxiv.org/pdf/2506.00530v2 原始来源页面arxiv.org/abs/2506.00530v2

元数据

arXiv2506.00530v2

来源arXiv

类型论文

抽取状态raw

关键词

GeoAI

GIS

RemoteSensing

EarthObservation

LLM

Multimodal

cs.AI

cs.CL