LidaRF: Delving into Lidar for Neural Radiance Field on Street Scenes

LidaRF can accurately reconstruct the original sensor data with little to no artifacts. We can also move the self-driving vehicle’s viewpoint to the left, right, or change the camera sensor’s position, elevating it or lowering it. Built upon the well-known nerfacto model, our research proposes three contributions: (1) Lidar encoding, (2) Robust depth supervision and (3) Augmented view supervision.

Abstract

Photorealistic simulation plays a crucial role in applications such as autonomous driving, where advances in neural radiance fields (NeRFs) may allow better scalability through the automatic creation of digital 3D assets. However, reconstruction quality suffers on street scenes due to largely collinear camera motions and sparser samplings at higher speeds. On the other hand, the application often demands rendering from camera views that deviate from the inputs to accurately simulate behaviors like lane changes. In this paper, we propose several insights that allow a better utilization of Lidar data to improve NeRF quality on street scenes. First, our framework learns a geometric scene representation from Lidar, which are fused with the implicit grid-based representation for radiance decoding, thereby supplying stronger geometric information offered by explicit point cloud. Second, we put forth a robust occlusion-aware depth supervision scheme, which allows utilizing densified Lidar points by accumulation. Third, we generate augmented training views from Lidar points for further improvement. Our insights translate to largely improved novel view synthesis under real driving scenes.

Video

Lidar Encoding

Method

Lidar holds strong potential for geometric guidance;
Lidar encoding through 3D sparse CNN has proven powerful in 3D perception framework;
Aggregate neighboring Lidar features with inverse-distance weighting;
Fuse Lidar encoding and hash grid feature.

Effect

Lidar encoding is beneficial in modelling sharp textures, e.g. power lines. This enhancement stems from our sparse convolution-based architecture, which is resilient to Lidar point noise and density variations;
Our proposed sparse convolution based Lidar encoding is better than MLP or PointNet++.

Robust Depth Supervision

Method

Issue: Inter-points occlusion dut to the camera-Lidar displacement;
Goal: Discard fake depth supervision adaptively;
Curriculum learning: The model initially trains with closer, more reliable depth data, which are less prone to occlusion. As training progresses, the model gradually begins to incorporate more distant depth data;
Adopt URF loss for samples in \( \mathcal{D}_\text{reliable}^{m} \).

Effect

Under the supervision of \( \mathcal{L}_{ds}^1 \), which utilizes single-frame Lidar depth maps known for their sparsity and reduced occlusions, the rendered depths exhibit high accuracy in texture-rich areas (as indicated by the yellow boxes). However, this accuracy significantly diminishes in regions with thin structures, due to a lack of abundant geometric guidance, as observed in the red boxes;
The supervision with \( \mathcal{L}_{ds}^10 \), involving with noisy Lidar depth maps accumulated from 10 adjacent frames, leads to rendered depths that display noticeable noise;
Employing our proposed \( \mathcal{L}_{ds}^{robust} \), our method effectively leverages denser depth maps for the reconstruction of intricate structures (highlighted in red boxes), as well as some occluded depths (noted in yellow boxes).

Augmented View Supervision

Method

Colorize Lidar points in each Lidar frame and accumulate colorized Lidar points;
Project them to the augmented views. These augmented training views are derived from existing ones, by introducing stochastic perturbations to their camera centers, with some shifting magnitude;
Apply the fore-mentioned supervision scheme to exclude occluded Lidar points online.