我们提出一种新框架,通过结合大语言模型(LLMs)与谷歌街景(Google Street View, GSV)影像,实现对美国全国范围内建筑状况的自动评估。该方法在较小规模人工标注数据集上对 Gemma 3 27B 进行微调,使其输出与人类平均意见分(Mean Opinion Score, MOS)高度一致,在与 MOS 基准对比的 Spearman 等级相关系数(SRCC)和皮尔逊线性相关系数(PLCC)指标上甚至优于单个评估者。为提升效率,我们采用知识蒸馏技术,将 Gemma 3 27B 的能力迁移至更小的 Gemma 3 4B 模型,在保持相近性能的同时实现 3 倍推理加速;进一步将知识蒸馏至 CNN 模型 EfficientNetV2-M 和 Transformer 模型 SwinV2-B,在性能接近的前提下实现 30 倍加速。此外,我们通过人-AI 对齐研究,系统考察了 LLMs 对大量建成环境与住宅属性的评估能力,并开发了一个可视化仪表盘,整合 LLM 评估结果,供房主开展下游分析。本框架为大规模建筑状况评估提供了灵活、高效的解决方案,在仅需极少人工标注的前提下实现高精度评估。
We present a novel framework for automatically evaluating building conditions nationwide in the United States by leveraging large language models (LLMs) and Google Street View (GSV) imagery. By fine-tuning Gemma 3 27B on a modest human-labeled dataset, our approach achieves strong alignment with human mean opinion scores (MOS), outperforming even individual raters on SRCC and PLCC relative to the MOS benchmark. To enhance efficiency, we apply knowledge distillation, transferring the capabilities of Gemma 3 27B to a smaller Gemma 3 4B model that achieves comparable performance with a 3x speedup. Further, we distill the knowledge into a CNN-based model (EfficientNetV2-M) and a transformer (SwinV2-B), delivering close performance while achieving a 30x speed gain. Furthermore, we investigate LLMs' capabilities for assessing an extensive list of built environment and housing attributes through a human-AI alignment study and develop a visualization dashboard that integrates LLM assessment outcomes for downstream analysis by homeowners. Our framework offers a flexible and efficient solution for large-scale building condition assessment, enabling high accuracy with minimal human labeling effort.