NeRF Editing and Inpainting Techniques: Conclusion and References

18 Jul 2024

Authors:

(1) Han Jiang, HKUST and Equal contribution (hjiangav@connect.ust.hk);

(2) Haosen Sun, HKUST and Equal contribution (hsunas@connect.ust.hk);

(3) Ruoxuan Li, HKUST and Equal contribution (rliba@connect.ust.hk);

(4) Chi-Keung Tang, HKUST (cktang@cs.ust.hk);

(5) Yu-Wing Tai, Dartmouth College, (yu-wing.tai@dartmouth.edu).

Table of Links

Abstract and 1. Introduction

2.3. Text-Guided Visual Content Generation

3. Method

3.1. Training View Pre-processing

3.2. Progressive Training

3.3. 4D Extension

4. Experiments and 4.1. Experimental Setups

4.2. Ablation and comparison

5. Conclusion and 6. References

5. Conclusion

We introduce Inpaint4DNeRF, a unified framework that can directly generate text-guided, background-appropriate, and multi-view consistent content within an existing NeRF. To ensure convergence from the original object to a completely different object, we propose a training image pre-processing method that projects from initially inpainted seed images to other views, with details refined by stable diffusion. A

Figure 2. Our qualitative results in 3D. Each column illustrates an inpainting example. We show final renderings from 2 views to demonstrate the multiview consistency. We also show depth maps and rgb images of different training stages to show their roles.

Figure 3. 4D NeRF Inpainting example. Text prompt: “a golden sword, side view”. The first column corresponds to the first frame which includes the first seed image, and the other columns correspond to 2 later frames. Inpaint4DNeRF can generate a moving object that is overall consistent.

Figure 4. Training results with view independent inpainting. Left: rgb render. Right: noisy and incorrect depth map.

Figure 5. Training results with instruct-nerf2nerf. Left: Result from our baseline. Right: result from warmup training followed by instruct-nerf2nerf.

roughly multiview consistent set of training images, combined with depth regularization, guarantees coarse convergence on geometry and appearance. Finally, the coarse NeRF is fine-tuned by iterative dataset update with stable diffusion. Our baseline can be readily extended to dynamic NeRF inpainting by generalizing the seed-image-to-other strategy from the spatial domain to the temporal domain. We provide

Figure 6. Depth maps and rgb renderings from warmup training with and without depth supervision. Left: with depth supervision (ours). Right: without depth supervision.

3D and 4D examples to demonstrate the effectiveness of our method. We also investigate the role of various elements in our baseline by ablation and comparison.

The proposed framework expands the possibilities for realistic and coherent scene editing in 3D and 4D settings. However, our current baseline still has some limitations, providing room for further improvement. Specifically, it is challenging for our method to handle complex geometry generation with a camera set covering wide angles. The consistency of the final NeRF can still be improved. In addition, to extend our method fully into 4D, certain techniques are required to further improve temporal consistency and maintain better multiview consistency across frames. We hope that our proposed baseline can inspire these future research directions for text-guided generative NeRF inpainting.

References

[1] Tim Brooks, Aleksander Holynski, and Alexei A Efros. Instructpix2pix: Learning to follow image editing instructions. arXiv preprint arXiv:2211.09800, 2022. 5

[2] Rui Chen, Yongwei Chen, Ningxin Jiao, and Kui Jia. Fantasia3d: Disentangling geometry and appearance for highquality text-to-3d content creation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023. 2

[3] Kangle Deng, Andrew Liu, Jun-Yan Zhu, and Deva Ramanan. Depth-supervised NeRF: Fewer views and faster training for free. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022. 5

[4] Ayaan Haque, Matthew Tancik, Alexei A Efros, Aleksander Holynski, and Angjoo Kanazawa. Instruct-nerf2nerf: Editing 3d scenes with instructions. arXiv preprint arXiv:2303.12789, 2023. 1, 3, 5, 6

[5] Zhang Jiakai, Liu Xinhang, Ye Xinyi, Zhao Fuqiang, Zhang Yanshun, Wu Minye, Zhang Yingliang, Xu Lan, and Yu Jingyi. Editable free-viewpoint video using a layered neural representation. In ACM SIGGRAPH, 2021. 2

[6] Nikita Karaev, Ignacio Rocco, Benjamin Graham, Natalia Neverova, Andrea Vedaldi, and Christian Rupprecht. Cotracker: It is better to track together. arXiv preprint arXiv:2307.07635, 2023. 5

[7] Gwanghyun Kim, Taesung Kwon, and Jong Chul Ye. Diffusionclip: Text-guided diffusion models for robust image manipulation, 2022. 2

[8] Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C. Berg, Wan-Yen Lo, Piotr Dollar, and Ross ´ Girshick. Segment anything. arXiv:2304.02643, 2023. 4

[9] Bowen Li, Xiaojuan Qi, Thomas Lukasiewicz, and Philip H. S. Torr. Manigan: Text-guided image manipulation, 2020. 2

[10] Bowen Li, Xiaojuan Qi, Philip Torr, and Thomas Lukasiewicz. Lightweight generative adversarial networks for text-guided image manipulation. Advances in Neural Information Processing Systems, 33:22020–22031, 2020. 2

[11] Zhen Li, Cheng-Ze Lu, Jianhua Qin, Chun-Le Guo, and MingMing Cheng. Towards an end-to-end framework for flowguided video inpainting. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 17562–17571, 2022. 6

[12] Chen-Hsuan Lin, Jun Gao, Luming Tang, Towaki Takikawa, Xiaohui Zeng, Xun Huang, Karsten Kreis, Sanja Fidler, MingYu Liu, and Tsung-Yi Lin. Magic3d: High-resolution text-to3d content creation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2023. 1, 2

[13] Ruoshi Liu, Rundi Wu, Basile Van Hoorick, Pavel Tokmakov, Sergey Zakharov, and Carl Vondrick. Zero-1-to-3: Zero-shot one image to 3d object, 2023. 3

[14] Steven Liu, Xiuming Zhang, Zhoutong Zhang, Richard Zhang, Jun-Yan Zhu, and Bryan Russell. Editing conditional radiance fields. In Proceedings of the International Conference on Computer Vision (ICCV), 2021. 2

[15] Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Chunyuan Li, Jianwei Yang, Hang Su, Jun Zhu, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499, 2023. 4

[16] Yuan Liu, Cheng Lin, Zijiao Zeng, Xiaoxiao Long, Lingjie Liu, Taku Komura, and Wenping Wang. Syncdreamer: Learning to generate multiview-consistent images from a singleview image. arXiv preprint arXiv:2309.03453, 2023. 3

[17] Andreas Lugmayr, Martin Danelljan, Andres Romero, Fisher Yu, Radu Timofte, and Luc Van Gool. Repaint: Inpainting using denoising diffusion probabilistic models. pages 11451– 11461, 2022. 2

[18] Ben Mildenhall, Pratul P. Srinivasan, Matthew Tancik, Jonathan T. Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view synthesis. In ECCV, 2020. 1

[19] Ashkan Mirzaei, Tristan Aumentado-Armstrong, Konstantinos G. Derpanis, Jonathan Kelly, Marcus A. Brubaker, Igor Gilitschenski, and Alex Levinshtein. SPIn-NeRF: Multiview segmentation and perceptual inpainting with neural radiance fields. In CVPR, 2023. 1, 2

[20] Keunhong Park, Utkarsh Sinha, Peter Hedman, Jonathan T. Barron, Sofien Bouaziz, Dan B Goldman, Ricardo MartinBrualla, and Steven M. Seitz. Hypernerf: A higherdimensional representation for topologically varying neural radiance fields. ACM Trans. Graph., 40(6), 2021. 1

[21] Ben Poole, Ajay Jain, Jonathan T Barron, and Ben Mildenhall. Dreamfusion: Text-to-3d using 2d diffusion. arXiv preprint arXiv:2209.14988, 2022. 1, 2, 3

[22] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bjorn Ommer. High-resolution image ¨ synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10684–10695, 2022. 1, 3

[23] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bjorn Ommer. High-resolution image ¨ synthesis with latent diffusion models, 2022. 2

[24] Cheng Peng Zerong Zheng Boyao Zhou Hongwen Zhang Yebin Liu. Ruizhi Shao, Jingxiang Sun. Control4d: Dynamic portrait editing by learning 4d gan from 2d diffusion-based editor, 2023. 1

[25] Sara Fridovich-Keil and Giacomo Meanti, Frederik Rahbæk Warburg, Benjamin Recht, and Angjoo Kanazawa. K-planes: Explicit radiance fields in space, time, and appearance. In CVPR, 2023. 1, 6

[26] Uriel Singer, Shelly Sheynin, Adam Polyak, Oron Ashual, Iurii Makarov, Filippos Kokkinos, Naman Goyal, Andrea Vedaldi, Devi Parikh, Justin Johnson, and Yaniv Taigman. Text-to-4d dynamic scene generation. arXiv:2301.11280, 2023. 1, 2

[27] Hyeonseop Song, Seokhun Choi, Hoseok Do, Chul Lee, and Taehyeong Kim. Blending-nerf: Text-driven localized editing in neural radiance fields, 2023. 1, 2

[28] Liangchen Song, Anpei Chen, Zhong Li, Zhang Chen, Lele Chen, Junsong Yuan, Yi Xu, and Andreas Geiger. Nerfplayer: A streamable dynamic scene representation with decomposed neural radiance fields. IEEE Transactions on Visualization and Computer Graphics, 29(5):2732–2742, 2023. 1

[29] Roman Suvorov, Elizaveta Logacheva, Anton Mashikhin, Anastasia Remizova, Arsenii Ashukha, Aleksei Silvestrov, Naejin Kong, Harshith Goka, Kiwoong Park, and Victor Lempitsky. Resolution-robust large mask inpainting with fourier convolutions. arXiv preprint arXiv:2109.07161, 2021. 2, 4

[30] Matthew Tancik, Ethan Weber, Evonne Ng, Ruilong Li, Brent Yi, Justin Kerr, Terrance Wang, Alexander Kristoffersen, Jake Austin, Kamyar Salahi, Abhik Ahuja, David McAllister, and Angjoo Kanazawa. Nerfstudio: A modular framework for neural radiance field development. In ACM SIGGRAPH 2023 Conference Proceedings, 2023. 6

[31] Dongqing Wang, Tong Zhang, Alaa Abboud, and Sabine Susstrunk. Inpaintnerf360: Text-guided 3d inpainting ¨ on unbounded neural radiance fields. arXiv preprint arXiv:2305.15094, 2023. 1, 2, 5

[32] Zhengyi Wang, Cheng Lu, Yikai Wang, Fan Bao, Chongxuan Li, Hang Su, and Jun Zhu. Prolificdreamer: High-fidelity and diverse text-to-3d generation with variational score distillation. arXiv preprint arXiv:2305.16213, 2023. 1, 2

[33] Weihao Xia, Yujiu Yang, Jing-Hao Xue, and Baoyuan Wu. Tedigan: Text-guided diverse face image generation and manipulation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2256–2265, 2021. 2

[34] Wenqi Xian, Jia-Bin Huang, Johannes Kopf, and Changil Kim. Space-time neural irradiance fields for free-viewpoint video. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 9416–9426, 2020. 5

[35] Jae Shin Yoon, Kihwan Kim, Orazio Gallo, Hyun Soo Park, and Jan Kautz. Novel view synthesis of dynamic scenes with globally coherent depths from a monocular camera, 2020. 6

[36] Jiahui Yu, Zhe Lin, Jimei Yang, Xiaohui Shen, Xin Lu, and Thomas S Huang. Free-form image inpainting with gated convolution. arXiv preprint arXiv:1806.03589, 2018. 2

[37] Jiahui Yu, Zhe Lin, Jimei Yang, Xiaohui Shen, Xin Lu, and Thomas S Huang. Generative image inpainting with contextual attention. arXiv preprint arXiv:1801.07892, 2018. 2

[38] Yu-Jie Yuan, Yang-Tian Sun, Yu-Kun Lai, Yuewen Ma, Rongfei Jia, and Lin Gao. Nerf-editing: geometry editing of neural radiance fields. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18353–18364, 2022. 2

[39] Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models, 2023. 1

[40] Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. In CVPR, 2018. 5

[41] Shangzhan Zhang, Sida Peng, Yinji ShenTu, Qing Shuai, Tianrun Chen, Kaicheng Yu, Hujun Bao, and Xiaowei Zhou. Dyn-e: Local appearance editing of dynamic neural radiance fields, 2023. 1, 2

This paper is available on arxiv under CC 4.0 license.