Wonder3D++: Cross-domain Diffusion for High-fidelity 3D Generation from a Single Image

Jian Li Wei Chen Meng Wang Zhi Liu
Institute of Advanced Computer Vision and Graphics, University of Technology

Abstract

This paper introduces Wonder3D++, an innovative framework for generating high-fidelity 3D assets from a single 2D image. It leverages a cross-domain diffusion model to synthesize consistent multi-view images and refine 3D geometry and texture. Experimental results demonstrate that Wonder3D++ significantly outperforms prior state-of-the-art methods in terms of 3D quality, consistency, and fidelity. The proposed approach provides a robust solution for efficient and high-quality 3D content creation.

Keywords

3D Generation, Diffusion Models, Single Image, Cross-domain Learning, Neural Radiance Fields


1. Introduction

The increasing demand for 3D content across various applications highlights the need for efficient 3D asset generation from limited inputs. Generating high-fidelity and consistent 3D models from a single 2D image remains a significant challenge due to inherent ambiguity and lack of depth information. This paper addresses these limitations by introducing Wonder3D++, a novel framework. The primary models used in this article include advanced diffusion models, implicit neural representations (like NeRF variants), and multi-view consistency networks.

2. Related Work

Previous efforts in single-image 3D reconstruction often relied on traditional computer vision techniques or simpler deep learning architectures, struggling with complex geometries and textures. Recent advancements in generative models, especially diffusion models, have shown promise in image synthesis and view generation. However, integrating these into robust 3D generation pipelines, particularly for cross-domain fidelity and consistency, still presents challenges that Wonder3D++ aims to overcome.

3. Methodology

Wonder3D++ employs a novel cross-domain diffusion architecture that implicitly learns to infer 3D geometry and appearance from a single 2D input. The core idea involves a two-stage process: first, generating consistent multi-view images using a view-conditioned diffusion model, and second, optimizing a neural radiance field (NeRF) representation guided by these synthesized views. This methodology ensures better geometric consistency and texture fidelity by bridging the gap between 2D image synthesis and 3D reconstruction.

4. Experimental Results

Wonder3D++ was extensively evaluated on challenging benchmarks, demonstrating superior performance in terms of visual quality, geometric accuracy, and multi-view consistency compared to existing methods. Quantitative metrics such as LPIPS, FID, and PSNR for novel view synthesis, alongside qualitative assessments of 3D mesh quality, consistently show significant improvements. For instance, on common object datasets, Wonder3D++ achieves an average LPIPS score of 0.15, outperforming its predecessor, Wonder3D, by 20%. The following table summarizes key performance metrics.

The table below presents a comparative analysis of Wonder3D++ against several baseline methods, showcasing its superior performance across critical evaluation metrics for 3D generation from single images.

MethodLPIPS (↓)FID (↓)PSNR (↑)3D Consistency (↓)
Baseline A0.2565.222.50.12
Baseline B0.2158.923.80.10
Wonder3D0.1952.324.70.08
Wonder3D++ (Ours)0.1545.126.10.05

These results highlight Wonder3D++'s ability to generate more perceptually realistic and geometrically accurate 3D models, with significantly improved multi-view consistency, marking a substantial advance in the field.

5. Discussion

The superior performance of Wonder3D++ can be attributed to its innovative cross-domain diffusion mechanism, which effectively mitigates ambiguity in single-image 3D reconstruction and enhances consistency across synthesized views. These findings imply that integrating sophisticated generative models with explicit 3D representations is crucial for achieving high-fidelity results. Future work could explore real-time generation capabilities and adaptation to more complex scene types.