Niagara Icon Niagara: Normal-Integrated Geometric Affine Fields for Scene Reconstruction from a Single View

1Westlake University, Hangzhou, China, 2Jiangxi University of Science and Technology, Ganzhou, China, 3Hong Kong University of Science and Technology,Hong Kong, 4University of Central Florida, Florida, USA, 5Lancaster University, Lancaster, UK, 6Everlyn AI, Hong Kong,
*Equal contribution, †Corresponding author.
WLU
JXUST
UST
Ucf
lancaster
Everlyn
Video

Input

Our input is just a single picture.

lancaster

Render

Niagara is the first model that can effectively reconstruct the challenging outdoor scenes from a single view.

Left: This paper presents Niagara, a new 3D scene reconstruction method from a single view. Unlike the previous SoTA method Flash3D in this line, which only utilizes depth maps as input, Niagara proposes to exploit the surface normals with a novel geometric affine field (GAF) as additional input. They are used in a proposed 3D self-attention fashion to learn 3D Gaussians of the scene. Niagara is the first model that can effectively reconstruct the challenging outdoor scenes from a single view (as shown by the rendered novel views above). Right: Further quantitative comparison in PSNR and LPIPS on the RE10K dataset confirms the merits of our method vs. Flash3D.

Video

Abstract

Recent advances in single-view 3D scene reconstruction have highlighted the challenges in capturing fine geometric details and ensuring structural consistency, particularly in high-fidelity outdoor scene modeling.This paper presents Niagara, a new single-view 3D scene reconstruction framework that can faithfully reconstruct challenging outdoor scenes from a single input image for the first time.

Our approach integrates monocular depth and normal estimation as input, which substantially improves its ability to capture fine details, mitigating common issues like geometric detail loss and deformation.

Additionally, we introduce a geometric affine field (GAF) and 3D self-attention as geometry-constraint, which combines the structural properties of explicit geometry with the adaptability of implicit feature fields, striking a balance between efficient rendering and high-fidelity reconstruction.

Our framework finally proposes a specialized encoder-decoder architecture, where a depth-based 3D Gaussian decoder is proposed to predict 3D Gaussian parameters, which can be used for novel view synthesis. Extensive results and analyses suggest that our Niagara surpasses prior SoTA approaches such as Flash3D in both single-view and dual-view settings, significantly enhancing the geometric accuracy and visual fidelity, especially in outdoor scenes.

Flash3D

The output is a 3D outdoor scene by Flash3D.

Niagara

The output is the same 3D outdoor scene by Niagara.

Result

Novel view synthesis comparison on the RealEstate10K dataset. Following Flash3D, we evaluate our method on the in-domain novel view synthesis task. As seen, our model consistently outperforms all existing methods across different frame counts (5 frames, 10 frames, u[-30,30] frames), in terms of PSNR, SSIM, and LPIPS.

(Best results are in bold, second best underlined.)

lancaster