街景感知模型可大规模预测安全等主观属性,但其本质仍为相关性建模:无法识别针对特定场景、可能改变人类判断的局部视觉变化。我们提出一种基于杠杆的干预性反事实框架,将场景级可解释性重构为在结构化反事实编辑空间内的有界搜索。每个杠杆定义一个语义概念、空间支持范围、干预方向及受约束的编辑模板。候选编辑通过提示词引导的图像编辑生成,并仅在满足同地点保持性、局部性、真实性和合理性等有效性检验时予以保留。在来自五座城市的50个场景的初步实验中,该框架揭示了基于代理的方向性模式初探结果,以及纯提示编辑下的实用失效分类体系;其中,交通基础设施(Mobility Infrastructure)与物理维护(Physical Maintenance)两类杠杆引发的安全性辅助变化最为显著。人类成对判断仍为未来验证的基准真值终点。
Street-view perception models predict subjective attributes such as safety at scale, but remain correlational: they do not identify which localized visual changes would plausibly shift human judgement for a specific scene. We propose a lever-based interventional counterfactual framework that recasts scene-level explainability as a bounded search over structured counterfactual edits. Each lever specifies a semantic concept, spatial support, intervention direction, and constrained edit template. Candidate edits are generated through prompt-conditioned image editing and retained only if they satisfy validity checks for same-place preservation, locality, realism, and plausibility. In a pilot across 50 scenes from five cities, the framework reveals preliminary proxy-based directional patterns and a practical failure taxonomy under prompt-only editing, with Mobility Infrastructure and Physical Maintenance showing the largest auxiliary safety shifts. Human pairwise judgements remain the ground-truth endpoint for future validation.