Authors:
(1) Han Jiang, HKUST and Equal contribution (hjiangav@connect.ust.hk);
(2) Haosen Sun, HKUST and Equal contribution (hsunas@connect.ust.hk);
(3) Ruoxuan Li, HKUST and Equal contribution (rliba@connect.ust.hk);
(4) Chi-Keung Tang, HKUST (cktang@cs.ust.hk);
(5) Yu-Wing Tai, Dartmouth College, (yu-wing.tai@dartmouth.edu).
Table of Links
2. Related Work
2.1. NeRF Editing and 2.2. Inpainting Techniques
2.3. Text-Guided Visual Content Generation
3.1. Training View Pre-processing
4. Experiments and 4.1. Experimental Setups
5. Conclusion and 6. References
2.3. Text-Guided Visual Content Generation
The advent of generative models has led to extensive research on guiding the generation results using natural language. For example, the latent diffusion model, as exemplified by [23], has made significant strides in text-guided image generation. Various image modification techniques, such as [7, 9, 10, 33], have emerged as a result of these improvements.
Based on the above text-to-image achievements, text-to3D generation has been introduced, as shown in [2, 12, 21, 32]. These approaches aim to bridge the gap in 3D content generation, leveraging the Score Distillation Sampling (SDS) technique and its variants for multiview convergence. Moreover, attempts have been made to generate 4D dynamic content from text [26], with several techniques including a temporal consistency regularizer to extend DreamFusion [21] to dynamic NeRFs. Despite the complexity of SDS sampling, they have achieved impressive results in terms of 3D consistency. On the other hand, our proposed method conditions on the seed view to control the generation of other views to force multiview convergence. This approach restricts the ill-posed text-guided generation problem to a well-posed problem with strong priors, thus making the problem easier to tackle. 3D generation conditioned on one generated view has been presented in some most recent works, including Zero123 [13] and SyncDreamer [16]. Given an image of an object and multiple camera poses, they can infer feasible observation of the object from other views.
However, existing implementations are all limited to a single object without conditioning on the background, and struggle with manipulating objects within large scenes. In other words, they all contribute to the pure generation task which is different from our inpainting task. Our method distinguishes itself by enabling the removal, addition, and manipulation of specific objects within a given background NeRF, while maintaining consistency with the unmasked background and partially masked foreground objects. In addition, the 3D inpainted results can be extended to 4D while maintaining temporal consistency.
This paper is available on arxiv under CC 4.0 license.