从单张图像中重建可动3D物体需要联合推断物体几何、部件结构和运动参数,而这些信息仅来自有限的视觉线索。其核心难点在于运动线索与物体结构之间的耦合,导致直接进行关节参数回归时不稳定。现有方法通过多视角监督、基于检索的组装或辅助视频生成来应对该挑战,但往往牺牲了可扩展性或效率。我们提出MonoArt,一种基于渐进式结构推理的统一框架。不同于直接从图像特征预测关节参数,MonoArt在单一架构内逐步将视觉观测转化为规范几何、结构化部件表示以及运动感知嵌入。这一结构化推理过程实现了稳定且可解释的关节推断,无需外部运动模板或多阶段流水线。在PartNet-Mobility上的大量实验表明,MonoArt在重建精度和推理速度方面均达到当前最优水平。该框架还可推广至机器人操作与可动场景重建任务。
Reconstructing articulated 3D objects from a single image requires jointly inferring object geometry, part structure, and motion parameters from limited visual evidence. A key difficulty lies in the entanglement between motion cues and object structure, which makes direct articulation regression unstable. Existing methods address this challenge through multi-view supervision, retrieval-based assembly, or auxiliary video generation, often sacrificing scalability or efficiency. We present MonoArt, a unified framework grounded in progressive structural reasoning. Rather than predicting articulation directly from image features, MonoArt progressively transforms visual observations into canonical geometry, structured part representations, and motion-aware embeddings within a single architecture. This structured reasoning process enables stable and interpretable articulation inference without external motion templates or multi-stage pipelines. Extensive experiments on PartNet-Mobility demonstrate that OM achieves state-of-the-art performance in both reconstruction accuracy and inference speed. The framework further generalizes to robotic manipulation and articulated scene reconstruction.