UrbanComp Lab | 学习资料库

返回论文库

论文

arXiv

LLM

Multimodal

中文标题

利用多模态大语言模型（Multimodal LLMs）从街景影像评估建成环境与住宅属性

English Title

Leveraging Multimodal LLMs for Built Environment and Housing Attribute Assessment from Street-View Imagery

Siyuan Yao, Siavash Ghorbany, Kuangshi Ai, Arnav Cherukuthota, Meghan Forstchen, Alexis Korotasz, Matthew Sisk, Ming Hu, Chaoli Wang

发布时间

2026/4/23 05:42:09

来源类型

preprint

语言

摘要

中文对照

我们提出一种新框架，通过结合大语言模型（LLMs）与谷歌街景（Google Street View, GSV）影像，实现对美国全国范围内建筑状况的自动评估。该方法在较小规模人工标注数据集上对 Gemma 3 27B 进行微调，使其输出与人类平均意见分（Mean Opinion Score, MOS）高度一致，在与 MOS 基准对比的 Spearman 等级相关系数（SRCC）和皮尔逊线性相关系数（PLCC）指标上甚至优于单个评估者。为提升效率，我们采用知识蒸馏技术，将 Gemma 3 27B 的能力迁移至更小的 Gemma 3 4B 模型，在保持相近性能的同时实现 3 倍推理加速；进一步将知识蒸馏至 CNN 模型 EfficientNetV2-M 和 Transformer 模型 SwinV2-B，在性能接近的前提下实现 30 倍加速。此外，我们通过人-AI 对齐研究，系统考察了 LLMs 对大量建成环境与住宅属性的评估能力，并开发了一个可视化仪表盘，整合 LLM 评估结果，供房主开展下游分析。本框架为大规模建筑状况评估提供了灵活、高效的解决方案，在仅需极少人工标注的前提下实现高精度评估。

English Original

We present a novel framework for automatically evaluating building conditions nationwide in the United States by leveraging large language models (LLMs) and Google Street View (GSV) imagery. By fine-tuning Gemma 3 27B on a modest human-labeled dataset, our approach achieves strong alignment with human mean opinion scores (MOS), outperforming even individual raters on SRCC and PLCC relative to the MOS benchmark. To enhance efficiency, we apply knowledge distillation, transferring the capabilities of Gemma 3 27B to a smaller Gemma 3 4B model that achieves comparable performance with a 3x speedup. Further, we distill the knowledge into a CNN-based model (EfficientNetV2-M) and a transformer (SwinV2-B), delivering close performance while achieving a 30x speed gain. Furthermore, we investigate LLMs' capabilities for assessing an extensive list of built environment and housing attributes through a human-AI alignment study and develop a visualization dashboard that integrates LLM assessment outcomes for downstream analysis by homeowners. Our framework offers a flexible and efficient solution for large-scale building condition assessment, enabling high accuracy with minimal human labeling effort.

资源链接

论文 PDFarxiv.org/pdf/2604.21102v1 论文 PDFarxiv.org/pdf/2604.21102v1 原始来源页面arxiv.org/abs/2604.21102v1

元数据

arXiv2604.21102v1

来源arXiv

类型论文

抽取状态raw

关键词

LLM

Multimodal

cs.CV

cs.AI