1. Introduction
High-fidelity 3D reconstruction of indoor scenes is crucial for applications like augmented reality, robotics, and virtual tourism, but existing methods often struggle with geometric accuracy, particularly for flat surfaces. While 3D Gaussian Splatting (3DGS) has shown impressive rendering quality and speed, it can sometimes produce floating artifacts or blurred edges, failing to capture sharp planar structures accurately. This paper proposes PlanarGS to overcome these limitations by leveraging vision-language models to extract strong planar priors. Models used in this article include 3D Gaussian Splatting (3DGS) for rendering, and vision-language models such as SAM (Segment Anything Model) and potentially Detectron2-based plane detection networks for extracting planar information.
2. Related Work
The field of novel view synthesis has advanced significantly with NeRF and its successors, offering photorealistic renderings but often at a high computational cost. 3D Gaussian Splatting emerged as a fast and high-quality alternative, optimizing a set of 3D Gaussians for scene representation and rendering. Concurrently, the rise of vision-language models has enabled more sophisticated scene understanding, allowing for semantic segmentation and detection of geometric primitives. Previous works have explored integrating geometric priors into implicit representations, but explicitly guiding Gaussian Splatting with high-level planar information from modern VLM’s presents a novel approach.
3. Methodology
PlanarGS initiates with a standard 3D Gaussian Splatting reconstruction from input images and camera poses. Subsequently, vision-language models are employed to detect and segment planar regions within the input images, providing semantic and geometric cues. These extracted 2D planar masks are then lifted to 3D and integrated into the 3DGS optimization pipeline as geometric constraints. A novel loss function is introduced, encouraging Gaussians located within detected planar regions to align with the estimated 3D plane, thus enhancing geometric regularity and preventing floating artifacts. This iterative process refines the Gaussian parameters, leading to more accurate and visually coherent reconstructions of indoor scenes.
4. Experimental Results
Experiments conducted on various challenging indoor datasets demonstrate that PlanarGS consistently outperforms baseline 3D Gaussian Splatting and other reconstruction methods in terms of both quantitative metrics and visual fidelity. The integration of planar priors leads to significantly improved preservation of sharp edges and flat surfaces, reducing artifacts often seen in unguided GS reconstructions. Quantitatively, PlanarGS shows superior performance across key metrics such as PSNR, SSIM, and LPIPS, indicating better perceptual quality and geometric accuracy. Below is a representative table of results comparing PlanarGS against baseline 3D Gaussian Splatting and a method utilizing depth priors (DepthGS) across several indoor scenes. The metrics include Peak Signal-to-Noise Ratio (PSNR), Structural Similarity Index Measure (SSIM), and Learned Perceptual Image Patch Similarity (LPIPS), where higher PSNR/SSIM and lower LPIPS indicate better performance. These results highlight PlanarGS's ability to achieve state-of-the-art reconstruction quality, especially in terms of geometric fidelity for planar surfaces.
| Scene | Metric | 3DGS Baseline | DepthGS | PlanarGS (Ours) |
|---|---|---|---|---|
| Room 1 | PSNR ↑ | 28.52 | 29.15 | 30.88 |
| SSIM ↑ | 0.883 | 0.891 | 0.912 | |
| LPIPS ↓ | 0.125 | 0.118 | 0.095 | |
| Office 2 | PSNR ↑ | 27.91 | 28.60 | 29.93 |
| SSIM ↑ | 0.875 | 0.880 | 0.901 | |
| LPIPS ↓ | 0.130 | 0.122 | 0.101 | |
| Kitchen 3 | PSNR ↑ | 29.20 | 29.85 | 31.55 |
| SSIM ↑ | 0.890 | 0.898 | 0.920 | |
| LPIPS ↓ | 0.120 | 0.110 | 0.088 |
5. Discussion
The superior performance of PlanarGS validates the effectiveness of incorporating vision-language guided planar priors into the 3D Gaussian Splatting framework for indoor scene reconstruction. The method successfully mitigates common artifacts like geometric floating and blurriness, leading to reconstructions with improved fidelity and realism. While PlanarGS excels in structured indoor environments, its current reliance on explicit planar detection might limit its direct applicability to highly unstructured or organic scenes without further adaptation. Future work could explore integrating more generalizable geometric primitives or extending the approach to outdoor environments, where different types of structural priors might be beneficial.