Niagara: Normal-Integrated Geometric Affine Fields for Scene Reconstruction from a Single View

Xianzu Wu^1,3,6*, Zhenxin Ai^1,2*, Harry Yang^3,6, Ser-Nam Lim^4,6, Jun Liu⁵, Huan Wang^1†,

¹Westlake University, Hangzhou, China, ²Jiangxi University of Science and Technology, Ganzhou, China, ³Hong Kong University of Science and Technology,Hong Kong, ⁴University of Central Florida, Florida, USA, ⁵Lancaster University, Lancaster, UK, ⁶Everlyn AI, Hong Kong,
*Equal contribution, †Corresponding author.

Paper arXiv Video Code Data

Video

Input

Our input is just a single picture.

Render

Niagara is the first model that can effectively reconstruct the challenging outdoor scenes from a single view.

Left: This paper presents Niagara, a new 3D scene reconstruction method from a single view. Unlike the previous SoTA method Flash3D in this line, which only utilizes depth maps as input, Niagara proposes to exploit the surface normals with a novel geometric affine field (GAF) as additional input. They are used in a proposed 3D self-attention fashion to learn 3D Gaussians of the scene. Niagara is the first model that can effectively reconstruct the challenging outdoor scenes from a single view (as shown by the rendered novel views above). Right: Further quantitative comparison in PSNR and LPIPS on the RE10K dataset confirms the merits of our method vs. Flash3D.

Video

Abstract

Recent advances in single-view 3D scene reconstruction have highlighted the challenges in capturing fine geometric details and ensuring structural consistency, particularly in high-fidelity outdoor scene modeling.This paper presents Niagara, a new single-view 3D scene reconstruction framework that can faithfully reconstruct challenging outdoor scenes from a single input image for the first time.

Our approach integrates monocular depth and normal estimation as input, which substantially improves its ability to capture fine details, mitigating common issues like geometric detail loss and deformation.

Additionally, we introduce a geometric affine field (GAF) and 3D self-attention as geometry-constraint, which combines the structural properties of explicit geometry with the adaptability of implicit feature fields, striking a balance between efficient rendering and high-fidelity reconstruction.

Our framework finally proposes a specialized encoder-decoder architecture, where a depth-based 3D Gaussian decoder is proposed to predict 3D Gaussian parameters, which can be used for novel view synthesis. Extensive results and analyses suggest that our Niagara surpasses prior SoTA approaches such as Flash3D in both single-view and dual-view settings, significantly enhancing the geometric accuracy and visual fidelity, especially in outdoor scenes.