Survey on low-level controllable image synthesis with deep learning

Shixiong Zhang; Jiao Li; Lu Yang; Shixiong Zhang; Jiao Li; Lu Yang

doi:10.3934/era.2023374

Electronic Research Archive

2023, Volume 31, Issue 12: 7385-7426. doi: 10.3934/era.2023374

Previous Article Next Article

Review Special Issues

Survey on low-level controllable image synthesis with deep learning

1.
School of Automation Engineering, University of Electronic Science and Technology of China, Sichuan, China
2.
College of Information Engineering, Sichuan Agricultural University, Sichuan, China

Received: 03 September 2023 Revised: 29 October 2023 Accepted: 01 November 2023 Published: 21 November 2023

Deep learning, particularly generative models, has inspired controllable image synthesis methods and applications. These approaches aim to generate specific visual content using latent prompts. To explore low-level controllable image synthesis for precise rendering and editing tasks, we present a survey of recent works in this field using deep learning. We begin by discussing data sets and evaluation indicators for low-level controllable image synthesis. Then, we review the state-of-the-art research on geometrically controllable image synthesis, focusing on viewpoint/pose and structure/shape controllability. Additionally, we cover photometrically controllable image synthesis methods for 3D re-lighting studies. While our focus is on algorithms, we also provide a brief overview of related applications, products and resources for practitioners.

Keywords:

Citation: Shixiong Zhang, Jiao Li, Lu Yang. Survey on low-level controllable image synthesis with deep learning[J]. Electronic Research Archive, 2023, 31(12): 7385-7426. doi: 10.3934/era.2023374

Related Papers:

[1]	Jian Liu, Zhen Yu, Wenyu Guo . The 3D-aware image synthesis of prohibited items in the X-ray security inspection by stylized generative radiance fields. Electronic Research Archive, 2024, 32(3): 1801-1821. doi: 10.3934/era.2024082
[2]	Wu Zeng, Heng-liang Zhu, Chuan Lin, Zheng-ying Xiao . A survey of generative adversarial networks and their application in text-to-image synthesis. Electronic Research Archive, 2023, 31(12): 7142-7181. doi: 10.3934/era.2023362
[3]	Jinjiang Liu, Yuqin Li, Wentao Li, Zhenshuang Li, Yihua Lan . Multiscale lung nodule segmentation based on 3D coordinate attention and edge enhancement. Electronic Research Archive, 2024, 32(5): 3016-3037. doi: 10.3934/era.2024138
[4]	Shuangjie Yuan, Jun Zhang, Yujia Lin, Lu Yang . Hybrid self-supervised monocular visual odometry system based on spatio-temporal features. Electronic Research Archive, 2024, 32(5): 3543-3568. doi: 10.3934/era.2024163
[5]	Jiange Liu, Yu Chen, Xin Dai, Li Cao, Qingwu Li . MFCEN: A lightweight multi-scale feature cooperative enhancement network for single-image super-resolution. Electronic Research Archive, 2024, 32(10): 5783-5803. doi: 10.3934/era.2024267
[6]	Shizhen Huang, Enhao Tang, Shun Li, Xiangzhan Ping, Ruiqi Chen . Hardware-friendly compression and hardware acceleration for transformer: A survey. Electronic Research Archive, 2022, 30(10): 3755-3785. doi: 10.3934/era.2022192
[7]	Yixin Sun, Lei Wu, Peng Chen, Feng Zhang, Lifeng Xu . Using deep learning in pathology image analysis: A novel active learning strategy based on latent representation. Electronic Research Archive, 2023, 31(9): 5340-5361. doi: 10.3934/era.2023271
[8]	Chetan Swarup, Kamred Udham Singh, Ankit Kumar, Saroj Kumar Pandey, Neeraj varshney, Teekam Singh . Brain tumor detection using CNN, AlexNet & GoogLeNet ensembling learning approaches. Electronic Research Archive, 2023, 31(5): 2900-2924. doi: 10.3934/era.2023146
[9]	Zongsheng Zheng, Jia Du, Yuewei Zhang, Xulong Wang . CoReFuNet: A coarse-to-fine registration and fusion network for typhoon intensity classification using multimodal satellite imagery. Electronic Research Archive, 2025, 33(4): 1875-1901. doi: 10.3934/era.2025085
[10]	Huixia Liu, Zhihong Qin . Deep quantization network with visual-semantic alignment for zero-shot image retrieval. Electronic Research Archive, 2023, 31(7): 4232-4247. doi: 10.3934/era.2023215

Abstract

1. Introduction

Artificial Intelligence Generated Content (AIGC) is the term for digital media produced by machine learning methods, such as ChatGPT and stable diffusion ^[1], which are currently popular ^[2]. AIGC has various applications in domains such as entertainment, education, marketing and research ^[3]. Image synthesis is a subcategory of AIGC that involves generating realistic or stylized images from textual inputs, sketches or other images ^[4]. Image synthesis can also perform various tasks such as inpainting, semantic scene synthesis, super-resolution and unconditional image generation ^[1,5,6,7].

Currently, deep learning methods can be categorized into two types: Generative and discriminative ^[8]. The goal of a discriminative model is to directly predict or classify based on input data, without involving the process of data generation. Discriminative methods have various applications, such as image classification ^[9,10], image segmentation ^[11] and sequence prediction ^[12]. On the other hand, generative models learn the distribution characteristics of the data in order to generate new samples that are similar to the original data. Generative models also have many typical applications, such as image synthesis ^[13] and text synthesis ^[14].

Image synthesis can be classified into two types based on controllability: Unconditional and conditional ^[15]. Conditional image synthesis can be further divided into three levels of control: High, medium and low. High-level control refers to the image content such as category, medium-level control refers to the image background and other aspects and low-level control refers to manipulating the image based on the underlying principles of traditional computer vision ^[16,17,18].

Conventional 3D image synthesis techniques face challenges in handling intricate deFiguretails and patterns that vary across different objects ^[19]. Deep learning methods can better model the variations in shape, texture and illumination of 3D objects ^[20]. The field of deep learning-based image synthesis has made remarkable progress in recent years, aided by the availability of more open source data sets ^[21,22,23]. Various image synthesis methods have emerged, such as generative adversarial network (GAN) ^[7], diffusion model (DM) ^[6] and neural radiance field (NeRF) ^[24]. These methods differ in their levels of controllability: GAN and DM are suitable for high-level or medium-level controllable image synthesis, while NeRF is suitable for low-level controllable image synthesis.

Low-level controllable image synthesis can be categorized into geometric and illumination control. Geometric control involves manipulating the pose and structure of the scene, where the pose can refer to either the camera or the object, while the structure can refer to either the global shape (using depth maps, point clouds or other 3D representations) or the local attributes (such as size, shape, color, etc.) of the object. Illumination control involves manipulating the light source and the material properties of the object.

Several surveys have attempted to cover the state-of-the-art techniques and applications in image synthesis. However, most of these surveys have become obsolete due to the rapid development of the field ^[15] or have focused on the high-level and medium-level aspects of image synthesis, while ignoring the low-level aspects ^[25]. Furthermore, most of these surveys have adopted a methodological perspective, which is useful for researchers who want to understand the underlying principles and algorithms of image synthesis, but not for practitioners who want to apply image synthesis techniques to solve specific problems in various domains ^[25,26]. In this paper, we provide a task-oriented review of low-level controllable image synthesis, excluding human subjects ^{[27,28,29,30]}.

This review offers a comprehensive overview of the state-of-the-art deep learning methods for low-level controllable image synthesis. The overview of the surveyed low-level controllable image synthesis is shown in Figure 1. In Section 2, we begin by introducing the common data sets and evaluation indicators for this task. For the data set section, we divide it by its content. In Sections 3 to 5, we survey the control methods based on pose (see Figure 2), structure (see Figure 3) and illumination (see Figure 4) and divide each section into global and local controls. In Section 6, we discuss some current applications of low-level controllable image synthesis based on deep learning. Finally, Section 7 concludes this paper. In the following sections, we will review common data sets and evaluation indicators in detail.

Figure 1. The structure diagram of this paper.

DownLoad: Full-Size Img PowerPoint

Figure 2. Pose manipulation example. Subfigures (a) and (b) correspond to the content of Global Pose realization in Section 3, which come from ^[31]. Subfigures (c) and (d) correspond to the content of Local Pose realization in Section 3, which come from ^[32].

DownLoad: Full-Size Img PowerPoint

Figure 3. Structure manipulation example. Subfigures (a) and (b) correspond to the content of Global Structure realization in Section 4, which come from ^[33]. Subfigures (c) and (d) correspond to the content of Local Structure realization in Section 4, which come from ^[34].

DownLoad: Full-Size Img PowerPoint

Figure 4. Illumination manipulation example. Subfigures (a) and (b) correspond to the content of Global Illumination realization in Section 5, which come from ^[35]. Subfigures (c) and (d) correspond to the content of Local Illumination realization in Section 5, which come from ^[36].

DownLoad: Full-Size Img PowerPoint

2. Data sets and evaluation indicators for low-level controllable image synthesis

One of the key challenges in low-level controllable image synthesis is to evaluate the quality and diversity of the generated images. Different data sets and metrics have been proposed to measure various aspects of low-level controllable image synthesis, such as realism, consistency, fidelity and controllability. In this section, we will introduce some commonly used data sets and metrics for low-level controllable image synthesis and discuss their advantages and limitations.

2.1. Data sets

3D image synthesis is the task of generating realistic images of 3D objects from different viewpoints. This task requires a large amount of training data that can capture the shape, texture, lighting and pose variations of 3D objects. Several data sets have been proposed for this purpose, each with its own advantages and limitations. Table 1 shows all the data sets covered in this section, as well as the relationships between the data sets and each section in the survey. The details of these data sets are as follows:

Table 1. The types of data sets listed in this survey and the sections utilized for data set in the survey.

type	data sets	the section used
viewpoint	ABO ^[31]	Section 3
	Clevr3D ^[37]
	ScanNet ^[38]
	RealEstate10K ^[39]
point cloud	ShapeNet ^[40]	Section 4
	KITTI ^[41]
	nuScenes ^[42]
	Matterport3D ^[43]
depth map	Middlebury Stereo ^{[44,45,46,47,48]}
	NYU Depth ^[49]
	KITTI ^[41]
illumination	Multi-PIE ^[35]	Section 5
illumination	Relightables ^[50]	Section 5

| Show Table

DownLoad: CSV

● ABO is a synthetic data set that contains 3D shapes generated by assembling basic objects (ABOs) such as cubes, spheres, cylinders and cones. It has 10 categories and 1000 shapes per category. ABO is useful for tasks such as shape abstraction, decomposition and generation. However, ABO is also limited by its synthetic nature, its small number of categories and instances and its lack of realistic lighting and occlusion ^[31].

● Clevr3D is a synthetic data set that contains 3D scenes composed of simple geometric shapes with various attributes such as color, size and material. It also provides natural language descriptions and questions for each scene. Clevr3D is useful for tasks such as scene understanding, reasoning and captioning. However, Clevr3D is also limited by its synthetic nature, its simple scene composition and its lack of realistic textures and backgrounds ^[37].

● ScanNet is an RGB-D video data set that contains 2.5 million views in more than 1500 scans of indoor scenes. It provides annotations such as camera poses, surface reconstructions and instance-level semantic segmentations. ScanNet is useful for tasks such as semantic segmentation, object detection and pose estimation. ScanNet is also limited by its incomplete coverage (due to scanning difficulties), its inconsistent labeling (due to human errors) and its lack of fine-grained details (such as object parts) ^[38].

● RealEstate10K is a data set for view synthesis that contains camera poses corresponding to 10 million frames derived from about 80,000 video clips gathered from YouTube videos. The data set also provides links to download the original videos. RealEstate10K is a large-scale and diverse data set that covers various types of scenes, such as houses, apartments, offices and landscapes. RealEstate10K is useful for tasks such as stereo magnification, light field rendering and novel view synthesis. However, RealEstate10K also has some challenges, such as the low quality of the videos, the inconsistency of the camera poses and the lack of depth information ^[39].

Point cloud data sets are collections of points that represent the shape and appearance of a 3D object or scene. They are often obtained from sensors such as lidars, radars or cameras. Some of the data sets are:

● ShapeNet is a large-scale repository of 3D CAD models that covers 55 common object categories and 4 million models. It provides rich annotations such as category labels, part labels, alignments and correspondences. ShapeNet is useful for tasks such as shape classification, segmentation, retrieval and completion. Some of the limitations of ShapeNet are that it does not contain realistic textures or materials, it does not capture the variability and diversity of natural scenes and it does not provide ground truth poses or camera parameters for rendering ^[40].

● KITTI is a data set for autonomous driving that contains 3D point clouds captured by a Velodyne HDL-64E LIDAR sensor, along with RGB images, GPS/IMU data, object annotations and semantic labels. KITTI is one of the most popular and challenging data sets for 3D object detection and semantic segmentation, as it covers various scenarios, weather conditions and occlusions. However, KITTI also has some limitations, such as the limited number of frames per sequence (around 200), the fixed sensor configuration and the lack of dynamic objects ^[41].

● nuScenes is another data set for autonomous driving that contains 3D point clouds captured by a 32-beam LIDAR sensor, along with RGB images, radar data, GPS/IMU data, object annotations and semantic labels. nuScenes is more comprehensive and diverse than KITTI, as it covers 1000 scenes from six cities in different countries, with varying traffic rules and driving behaviors. nuScenes also provides more temporal information, with 20 seconds of continuous data per scene. However, nuScenes also has some challenges, such as the lower resolution of the point clouds, the higher complexity of the scenes and the need for sensor fusion ^[42].

● Matterport3D is a data set for indoor scene understanding that contains 3D point clouds reconstructed from RGB-D images captured by a Matterport camera. The data set also provides surface reconstructions, camera poses and 2D and 3D semantic segmentations. Matterport3D is a large-scale and high-quality data set that covers 10,800 panoramic views from 194,400 RGB-D images in 90 building types. Matterport3D is useful for tasks such as keypoint matching, view overlap prediction and scene completion. However, Matterport3D also has some limitations, such as the lack of dynamic objects, the dependence on RGB-D sensors and the difficulty of obtaining ground truth annotations ^[43].

Depth map data sets are collections of images and their corresponding depth values, which can be used for various computer vision tasks such as depth estimation, 3D reconstruction, scene understanding, etc. The commonly used depth map data sets are as follows:

● Middlebury Stereo is a data set of stereo images with ground truth disparity maps obtained using structured light or a robot arm. It contains several versions of data sets collected from 2001 to 2021, with different scenes, resolutions and levels of difficulty. The data set is widely used for evaluating stereo matching algorithms and provides online benchmarks and leaderboards. The strengths of this data set are its high accuracy, diversity and availability. The limitations are its relatively small size, indoor scenes only and lack of semantic labels ^{[44,45,46,47,48]}.

● NYU Depth Data set V2 is a data set of RGB-D images captured by Microsoft Kinect in various indoor scenes. It contains 1449 densely labeled pairs of aligned RGB and depth images, as well as 407,024 unlabeled frames. The data set also provides surface normals, 3D point clouds and semantic labels for each pixel. The data set is widely used for evaluating monocular depth estimation algorithms and provides online tools for data processing and visualization. The strengths of this data set are its large size, rich annotations and realistic scenes. The limitations are its low resolution, noisy depth values and indoor scenes only ^[49].

● KITTI also includes depth maps, but its depth maps are limited by sparse and noisy LiDAR depth maps. There is also a lack of real depth maps on the ground for certain scenes, as well as limitations on city Settings ^[41].

Illumination data sets are collections of information about the intensity, distribution and characteristics of artificial or natural light sources. Some examples of common illumination data sets are:

● Multi-PIE is a large-scale data set that contains over 750,000 images of 337 subjects, captured in 15 view angles and 19 illumination conditions. Each subject also performed different facial expressions, such as neutral, smile, surprise and squint. The data set is useful for studying face recognition, face alignment, face synthesis and face editing under varying conditions. However, Multi-PIE only contains images of Caucasian subjects, which limits its diversity and generalization ^[35].

● Relightables is a collection of high-quality 3D scans of human subjects under varying lighting conditions. This data set allows for realistic rendering of human performances with any lighting and viewpoint, which can be integrated into any CG scene. Nevertheless, this data set has some drawbacks, such as the low diversity of subjects, poses and expressions and the high computational expense of processing the data ^[50].

In conclusion, data sets are essential for low-level controllable image synthesis based on deep learning, as they provide the necessary information for training and evaluating deep generative models. These data sets provide rich annotations and variations for different type of control, such as viewpoint, lighting, poses, point clouds and depth. However, each data set has its own strengths and weaknesses, and there is room for improvement and innovation in this field.

2.2. Evaluation indicators

To evaluate the quality and diversity of the synthesized images, several performance indicators are commonly used. Some of them are:

- Peak signal-to-noise ratio (PSNR) ^[51]: This measures the similarity between the synthesized image and a reference image in terms of pixel values. It is defined as the ratio of the maximum possible power of a signal to the power of noise that affects the fidelity of its representation. A higher PSNR indicates a better image quality.

- Structural similarity index (SSIM) ^[52]: This measures the similarity between the synthesized image and a reference image in terms of luminance, contrast and structure. It is based on the assumption that the human visual system is highly adapted to extract structural information from images. A higher SSIM indicates a better image quality.

- Learned perceptual image patch similarity (LPIPS) ^[53]: This measures the similarity between the synthesized image and a reference image in terms of deep features. It is defined as the distance between the activations of two image patches for a pre-trained network. A lower LPIPS indicates a better image quality.

- Inception score (IS) ^[54]: This measures the quality and diversity of the synthesized images using a pre-trained classifier, such as Inception-v3. It is based on the idea that good images should have high class diversity (i.e., they can be classified into different categories) and low class ambiguity (i.e., they can be classified with high confidence). A higher IS indicates a better image synthesis.

- Fréchet inception distance (FID) ^[55]: This measures the distance between the feature distributions of the synthesized images and the real images using a pre-trained classifier, such as Inception-v3. It is based on the idea that good images should have similar feature statistics to real images. A lower FID indicates a better image synthesis.

- Kernel inception distance (KID) ^[56]: This measures the squared maximum mean discrepancy between the feature distributions of the synthesized images and the real images using a pre-trained classifier, such as Inception-v3. It is based on the idea that good images should have similar feature statistics to real images. A lower KID indicates a better image synthesis.

3. Pose manipulation

3.1. Global pose

3.1.1. GAN

GAN ^[7] can generate realistic and diverse data from a latent space. GAN consists of two neural networks: A generator and a discriminator. The generator tries to produce data that can fool the discriminator, while the discriminator tries to distinguish between real and fake data. Its network structure is shown in . The loss function of GAN measures how well the generator and the discriminator perform their tasks. The loss function is usually composed of two terms: One for the generator ( $\mathcal{L}_{G}$ ) and one for the discriminator ( $\mathcal{L}_{D}$ ). $\mathcal{L}_{G}$ is based on how often the discriminator classifies the generated data as real, while $\mathcal{L}_{D}$ is based on how often it correctly classifies the real and fake data. The goal of GAN is to minimize $\mathcal{L}_{G}$ and maximize $\mathcal{L}_{D}$ . As shown in Eq (3.1).

$\begin{equation} \begin{aligned} &\mathcal{L}_{D} = \mathbb{E}_{{\bf x} \sim p_{{\bf data}}{\bf (x) }}[\log (D ({\bf x}) ) ] + \mathbb{E}_{{\bf z} \sim p_{{\bf z}}{\bf (z) }}[\log (1 - D (G ({\bf z}) ) ) ) ] \\ &\mathcal{L}_{G} = - \mathbb{E}_{{\bf z} \sim p_{{\bf z}}{\bf (z) }}[\log (D (G ({\bf z}) ) ) ) ] \\ &\mathcal{L}_{GAN} = \mathcal{L}_{D} + \mathcal{L}_{G} \end{aligned} \end{equation}$

(3.1)

where ${\bf x}$ is the sample obtained from the real data distribution $p_{{\bf data}}$ , and ${\bf z}$ is the sample obtained from a specific distribution $p_{{\bf z}}{\bf (z) }$ .

Figure 5. Schematic of GAN, which comes from ^[57].

DownLoad: Full-Size Img PowerPoint

a) Crossview image synthesis. Viewpoint manipulation refers to the ability to manipulate the perspective or orientation of the objects or scenes in the synthetic images. The earliest view composites were usually only able to composite a specific view, such as a bird's eye view, a frontal view of a person's face, etc. Huang et al. introduced TP-GAN, a method that integrates global structure and local details to generate realistic frontal views of faces ^[58]. Similarly, Zhao et al. proposed VariGAN, which combines variational inference and GANs for the progressive refinement of synthesized target images ^[59]. To address the challenge of generating scenes from different viewpoints and resolutions, Regmi and Borji developed two methods: Crossview Fork (X-Fork) and Crossview Sequential (X-Seq) ^[60]. These methods employ semantic segmentation graphs to aid conditional GANs (cGANs) in producing sharper images. Furthermore, Regmi and Borji utilized geometry-guided cGANs for image synthesis, converting ground images to aerial views ^[61]. Mokhayeri et al. proposed a cross-domain face synthesis approach using a Controllable GAN (C-GAN). This method generates realistic face images under various poses by refining simulated images from a 3D face model through an adversarial game ^[62]. Zhu et al. developed BridgeGAN, a technique for synthesizing bird's eye view images from single frontal view images. They employed a homography view as an intermediate representation to accomplish this task ^[63]. Ding et al. addressed the problem of cross-view image synthesis by utilizing GANs based on deformable convolution and attention mechanisms ^[64]. Lastly, Ren et al. proposed MLP-Mixer GANs for cross-view image conversion. This method comprises two stages to alleviate severe deformation when generating entirely different views ^[65].

b) Free viewpoint image synthesis. By adding conditional inputs, such as a camera pose or camera manifold to the GAN network, they can output images from any viewpoint. Zhu et al. introduced CycleGAN, a method capable of recovering the front face from a single profile postural facial image, even when the source domain does not match the target domain ^[66]. This approach is based on a conditional variational autoencoder and GAN (cVAE-GAN) framework, which does not require paired data, making it a versatile method for view translation ^[67]. Shen et al. proposed Pairwise-GAN, employing two parallel U-Nets as generators and PatchGAN as a discriminator to synthesize frontal face images ^[68]. Similarly, Chan et al. presented pi-GAN, a method utilizing periodic implicit GANs for high-quality 3D-aware image synthesis ^[69]. Cai et al. further extended this approach with Pix2NeRF, an unsupervised method leveraging pi-GAN to train on single images without relying on 3D or multi-view supervision ^[70]. Leimkuhler et al. introduced FreeStyleGAN, which integrates a pre-trained StyleGAN into standard 3D rendering pipelines, enabling stereo rendering or consistent insertion of faces in synthetic 3D environments ^[71]. Medin et al. proposed MOST GAN, explicitly incorporating physical facial attributes as prior knowledge to achieve realistic portrait image manipulation ^[72]. On the other hand, Or-El et al. developed StyleSDF, a novel method generating images based on StyleGAN2 by utilizing Signed Distance Fields (SDFs) to accurately model 3D surfaces, enabling volumetric rendering with consistent results ^[73]. Additionally, Zheng et al. presented SDF-StyleGAN, a deep learning method for generating 3D shapes based on StyleGAN2, employing two new shape discriminators operating on global and local levels to compare real and synthetic SDF values and gradients, significantly enhancing shape geometry and visual quality ^[74]. Moreover, Deng et al. proposed GRAM, a novel approach regulating point sampling and radiance field learning on 2D manifolds, embodied as a set of learned implicit surfaces in the 3D volume, leading to improved synthesis results ^[75]. Xiang et al. built upon this work with GRAM-HD, capable of generating high-resolution images with strict 3D consistency, up to a resolution of 1024 x 1024 ^[76]. In another line of research, Chan et al. developed an efficient framework for generating realistic 3D shapes from 2D images using GANs, comprising a geometry-aware module predicting the 3D shape and its projection parameters from the input image, and a refinement module enhancing shape quality and details ^[77]. Similarly, Zhao et al. proposed a method for generating high-quality 3D images from 2D inputs using GAN, achieving consistency across different viewpoints and offering rendering with novel lighting effects ^[78]. Lastly, Alhaija et al. introduced XDGAN, a method for synthesizing realistic and diverse 3D shapes from 2D images, converting 3D shapes into compact 1-channel geometry images and utilizing StyleGAN3 and image-to-image translation networks to generate 3D objects in a 2D space ^[79]. These advancements in image synthesis techniques have significantly enriched the field of 3D image generation from 2D inputs.

3.1.2. NeRF

NeRF ^[24] is a novel representation for complex 3D scenes that can be rendered photo realistically from any viewpoint. NeRF models a scene as a continuous function that maps 5D coordinates (3D location and 2D viewing direction, expressed as ( $x$ , $y$ , $z$ , $\theta$ , $\varphi$ )) to a 4D output (RGB color and opacity). Its schematic diagram is shown in . This function is learned from a set of posed images of the scene using a deep neural network. Before the NeRF passes the ( $x$ , $y$ , $z$ , $\theta$ , $\varphi$ ) input to the network, it maps the input to a higher dimensional space using high-frequency functions to better fit the data containing high-frequency variations. The high-frequency coding function is:

$\begin{equation} \gamma (p) = \left (\sin \left (2^{0} \pi p\right) , \cos \left (2^{0} \pi p\right) , \ldots, \sin \left (2^{L-1} \pi p\right) , \cos \left (2^{L-1} \pi p\right) \right) \end{equation}$

(3.2)

where $p$ is the input ( $x$ , $y$ , $z$ , $\theta$ , $\varphi$ ).

Figure 6. Schematic of NeRF, which comes from ^[24].

DownLoad: Full-Size Img PowerPoint

Zhang et al. introduced NeRF++ as a framework that enhances NeRF through adaptive sampling, hierarchical volume rendering and multiscale feature encoding techniques ^[80]. This approach enables high-quality rendering for both static and dynamic scenes while improving efficiency and robustness. Rebain et al. proposed a method to enhance the efficiency and quality of neural rendering by employing spatial decomposition ^[81]. Park et al. developed a novel technique for capturing and rendering high-quality 3D selfies using a single RGB camera. Their method utilizes a deformable NeRF model capable of representing both the geometry and appearance of dynamic scenes ^[82]. Li et al. introduced MINE, a method for novel view synthesis and depth estimation from a single image. This approach generalizes Multiplane Images (MPI) with continuous depth using NeRF ^[83]. Park et al. proposed HyperNeRF, a method for representing and rendering complex 3D scenes with varying topology using neural radiance fields (NeRFs). Unlike previous NeRF-based approaches that rely on a fixed 3D coordinate system, HyperNeRF employs a higher-dimensional continuous embedding space to capture arbitrary scene changes ^[84]. Chen et al. presented Aug-NeRF, a novel method for training NeRFs with physically-grounded augmentations at different levels: Scene, camera and pixel ^[85]. Kaneko proposed AR-NeRF, a method for learning 3D representations of natural images without supervision. The approach utilizes a NeRF model to render images with various viewpoints and aperture sizes, capturing both depth and defocus effects ^[86]. Li et al. introduced SymmNeRF, a framework that utilizes NeRFs to synthesize novel views of objects from a single image. This method leverages symmetry priors to recover fine appearance details, particularly in self-occluded areas ^[87]. Zhou et al. proposed NeRFLiX, a novel framework for improving the quality of novel view synthesis using NeRF. This approach addresses rendering artifacts such as noise and blur by employing an inter-viewpoint aggregation framework that fuses high-quality training images to generate more realistic synthetic views ^[88].

Besides, a number of researchers have proposed enhancements to the original NeRF model, addressing its limitations in scenarios such as no camera pose, sparse data, noisy data, large-scale image synthesis and image synthesis speed. See Table 2.

Table 2. Enhancements to NeRF.

Feature	Method	Publication	Image resolution	Data set
No camera pose	NeRF– ^[89]	arXiv2022	756 x 1008/1080 x 1920/520 x 780	^{[90]/ [39]/ [89]}
	GNeRF ^[91]	ICCV2021	400 x 400/500 x 400	^{[24]/ [92]}
	SCNeRF ^[93]	ICCV2021	756 x 1008/648 x 484	^{[90]/ [94]}
	NoPe-NeRF ^[95]	CVPR2023	960 x 540/648 x 484	^{[94]/ [38]}
	SPARF ^[96]	CVPR2023	-	^{[92]/ [90]/ [97]}
Sparse data	NeRS ^[98]	NIPS2021	600 x 450	^[98]
	MixNeRF ^[99]	CVPR2023	-	^{[92]/ [90]/ [24]}
	SceneRF ^[100]	ICCV2023	1220 x 370	^[101]
	GM-NeRF ^[102]	CVPR2023	224 x 224	^{[103]/ [104]/ [105]/ [106]}
	SPARF ^[96]	CVPR2023	960 x 540/648 x 484	^{[94]/ [38]}
Noisy data	RawNeRF ^[107]	CVPR2022	-	^[107]
	Deblur-NeRF ^[108]	CVPR2022	512 x 512	^[108]
	HDR-NeRF ^[109]	CVPR2022	400 x 400/804 x 534	^[109]
	NAN ^[110]	CVPR2022	-	^[110]
Large-scale image synthesis	Mip-NeRF 360 ^[111]	CVPR2022	960 x 540	^[94]
	BungeeNeRF ^[112]	ECCV2022	-	^[113]
	Block-NeRF ^[114]	CVPR2022	-	^[114]
	GridNeRF ^[115]	CVPR2023	2048 x 2048/4096 x 4096	^{[116]/ [115]}
	EgoNeRF ^[117]	CVPR2023	600 x 600	^[117]
Image synthesis speed	PlenOctrees ^[118]	ICCV2021	800 x 800/1920 x 1080	^{[24]/ [94]}
	DirectVoxGO ^[119]	CVPR2022	800 x 800/800 x 800/768 x 576/1920 x 1080/512 x 512	^{[24]/ [120]/ [121]/ [94]/ [122]}
	R2L ^[123]	ECCV2022	800 x 800	^{[24]/ [124]}
	SqueezeNeRF ^[125]	CVPR2022	-	^{[24]/ [90]}
	MobileNeRF ^[126]	CVPR2023	800 x 800/756 x 1008/1256 x 828	^{[24]/ [90]/ [111]}
	L2G-NeRF ^[127]	CVPR2023	756 x 1008	^[90]

| Show Table

DownLoad: CSV

3.1.3. Diffusion model

One of the most widely used models in deep learning is the diffusion model, which is a generative model that can produce realistic and diverse images from random noise. The diffusion model is based on the idea of reversing the process of adding Gaussian noise to an image until it becomes completely corrupted. The diffusion process starts from a data sample and gradually adds noise until it reaches a predefined noise level. If we use ${\bf x}_{t}$ to represent the image information at time $t$ , then the process ( $q\left ({\bf x}_{t} \mid {\bf x}_{t-1}\right)$ ) can be expressed as Eq (3.3). The generative model then learns to reverse this process by denoising the samples at each step, i.e., $p_{{\bf \theta }}\left ({\bf x}_{t-1} \mid {\bf x}_{t}\right)$ in , where $\theta$ represents the parameters of the neural network.

$\begin{equation} q\left ({\bf x}_{t} \mid {\bf x}_{t-1}\right) = \mathcal{N}\left ({\bf x}_{t} ; \sqrt{1-\beta_{t}} {\bf x}_{t-1}, \beta_{t} {\bf I}\right) \end{equation}$

(3.3)

where $\beta_{t}$ is the constant that changes with time $t$ .

Figure 7. Schematic of diffusion model, which comes from ^[57].

DownLoad: Full-Size Img PowerPoint

Sbrolli et al. introduced IC3D, a novel approach addressing various challenges in shape generation. This method is capable of reconstructing a 3D shape from a single view, synthesizing a 3D shape from multiple views and completing a 3D shape from partial inputs ^[128]. Another significant contribution in this area is the work by Gu et al., who developed Control3Diff, a generative model with 3D-awareness and controllability. By combining diffusion models and 3D GANs, Control3Diff can synthesize diverse and realistic images without relying on 3D ground truth data and can be trained solely on single-view image data sets ^[129]. Additionally, Anciukevicius et al. proposed RenderDiffusion, an innovative diffusion model for 3D generation and inference. Remarkably, this model can be trained using only monocular 2D supervision and incorporates an intermediate three-dimensional representation of the scene during each denoising step, effectively integrating a robust inductive structure into the diffusion process ^[130]. Xiang et al. presented a novel method for generating 3D-aware images using 2D diffusion models. Their approach involves a sequential process of generating multi-view 2D images from different perspectives, ultimately achieving the synthesis of 3D-aware images ^[131]. Furthermore, Liu et al. proposed a framework for changing the camera viewpoint of an object using only a single RGB image. Leveraging the geometric priors learned by large-scale diffusion models about natural images, their framework employs a synthetic data set to learn the controls for adjusting the relative camera viewpoint ^[132]. Lastly, Chan et al. developed a method for generating diverse and realistic novel views of a scene based on a single input image. Their approach utilizes a diffusion-based model that incorporates 3D geometry priors through a latent feature volume. This feature volume captures the distribution of potential scene representations and enables the rendering of view-consistent images ^[133].

3.1.4. Transformer

Transformers are a type of neural network architecture that have been widely used in natural language processing. They are based on the idea of self-attention, which allows the network to learn the relationships between different parts of the input and output sequences. Transformers is introduced into the field of computer vision in the paper ViT ^[134]. Its core is the Attention section in Figure 8 and its formula is as follows:

$\begin{equation} Attention (Q, K, V) = softmax (\frac{QK^T}{\sqrt{d_k}}) V \end{equation}$

(3.4)

where $d_k$ is a dimensional constant, $Q$ represents the vector used to calculate the similarity between the current position (or token) and other positions (or tokens) in the sequence, $K$ represents the vector associated with each position (or token) in the sequence and $V$ represents the vector that contains the information or content associated with each position (or token) in the sequence.

Figure 8. Schematic of Transformer, we have made some modifications based on the reference ^[135].

DownLoad: Full-Size Img PowerPoint

Leveraging the Transformer architecture for vision applications, several studies have explored its potential for synthesizing 3D views. Nguyen-Ha and colleagues presented a pioneering approach to synthesizing new views of a scene using a given set of input views. Their method employs a transformer-based architecture that effectively captures the long-range dependencies among the input views. Using a sequential process, the method generates high-quality novel views. This research contribution is documented in ^[136]. Similarly, Yang and colleagues proposed an innovative method for generating viewpoint-invariant 3D shapes from a single image. Their approach is based on disentangling learning and parametric NURBS surface generation. The method employs an encoder-decoder network augmented with a disentangled transformer module. This configuration enables the independent learning of shape semantics and camera viewpoints. The output of this comprehensive network includes the geometric parameters of the NURBS surface representing the 3D shape, as well as the camera-viewpoint parameters involving rotation, translation and scaling. Further details of this method can be found in ^[137]. Additionally, Kulhánek and colleagues proposed ViewFormer, an impressive neural rendering method that does not rely on NeRF and instead capitalizes on the power of transformers. ViewFormer is designed to learn a latent representation of a scene using only a few images, and this learned representation enables the synthesis of novel views. Notably, ViewFormer can handle complex scenes with varying illumination and geometry without requiring any 3D information or ray marching. The specific approach and findings of ViewFormer are detailed in ^[138].

3.1.5. Hybrid NeRF

a) GAN-based NeRF. NeRF is a novel method for rendering images from arbitrary viewpoints, but it suffers from high computational cost due to its pixel-wise optimization. GANs can synthesize realistic images in a single forward pass, but they may not preserve the view consistency across different viewpoints. Hence, there is a growing interest in exploring the integration of NeRF and GAN for efficient and consistent image synthesis. Meng et al. presented the GNeRF framework, which combines GANs and NeRF reconstruction to generate scenes with unknown or random camera poses ^[91]. Similarly, Zhou et al. introduced CIPS-3D, a generative model that utilizes style transfer, shallow NeRF networks and deep INR networks to represent 3D scenes and provide precise control over camera poses ^[139]. Another approach by Xu et al. is GRAF, a generative model for radiance fields that enables high-resolution image synthesis while being aware of the 3D shape. GRAF disentangles camera and scene properties from unposed 2D images, allowing for the synthesis of novel views and modifications to shape and appearance ^[140]. Lan et al. proposed a self-supervised geometry-aware encoder for style-based 3D GAN inversion. Their encoder recovers the latent code of a given 3D shape and enables manipulation of its style and geometry attributes ^[141]. Li et al. developed a two-step approach for 3D-aware multi-class image-to-image translation using NeRFs. They trained a multi-class 3D-aware GAN with a conditional architecture and innovative training strategy. Based on this GAN, they constructed a 3D-aware image-to-image translation system ^[142]. Shahbazi et al. focused on knowledge distillation, proposing a method to transfer the knowledge of a GAN trained on NeRF representation to a convolutional neural network (CNN). This enables efficient 3D-aware image synthesis ^[143]. Kania et al. introduced a generative model for 3D objects based on NeRFs, which are rendered into 2D novel views using a hypernetwork. The model is trained adversarially with a 2D discriminator ^[144]. Lastly, Bhattarai et al. proposed TriPlaneNet, an encoder specifically designed for EG3D inversion. The task of EG3D inversion involves reconstructing 3D shapes from 2D edge images ^[145].

b) Diffusion model-based NeRF. Likewise, the diffusion model alone fails to produce images that are consistent across different viewpoints. Therefore, many researchers integrate it with NeRF to synthesize high-quality and view-consistent images. Muller et al. proposed DiffRF, which directly generates volumetric radiance fields from a set of posed images using a 3D denoising model and a rendering loss ^[146]. Similarly, Xu et al. proposed NeuralLift-360, a framework that generates a 3D object with 360° views from a single 2D photo using a depth-aware NeRF and a denoising diffusion model ^[147]. Chen et al. proposed a 3D-aware image synthesis framework using NeRF and diffusion models, which jointly optimizes a NeRF auto-decoder and a latent diffusion model to enable simultaneous 3D reconstruction and prior learning from multi-view images of diverse objects ^[148]. Lastly, Gu et al. proposed NeRFDiff, a method for generating realistic and 3D-consistent novel views from a single input image. This method distills the knowledge of the conditional diffusion model (CDM) into the NeRF by synthesizing and refining a set of virtual views at test time, using a NeRF-guided distillation algorithm ^[149]. These approaches demonstrate the potential of using NeRF and diffusion models for 3D scene synthesis, and further research in this area is expected to yield even more exciting results.

c) Transformer-based NeRF. Building on the previous work of integrating GANs and NeRFs, some researchers have explored the possibility of using Transformer models and NeRFs to generate 3D images that are consistent across different viewpoints. Wang et al. proposed a method that can handle complex scenes with dynamic objects and occlusions, and can generalize to unseen scenes without fine-tuning. The key idea is to use a transformer to learn a global latent representation of the scene, which is then used to condition a NeRF model that renders novel views ^[150]. Similarly, Lin et al. proposed a method for novel view synthesis from a single unposed image using NeRF and a vision transformer (ViT). The method leverages both global and local image features to form a 3D representation of the scene, which is then used to render novel views by a multi-layer perceptron (MLP) network ^[151]. Finally, Liu et al. proposed a method for visual localization using a conditional NeRF model. The method can estimate the 6-DoF pose of a query image given a sparse set of reference images and their poses ^[152]. These methods demonstrate the potential of NeRFs and transformers in addressing challenging problems in computer vision.

3.2. Local pose

3.2.1. GAN

Liao et al. proposed a novel framework consisting of two components for learning generative models that can achieve this goal. The first component is a 3D generator that learns to reconstruct the 3D shape and appearance of an object from a single image, while the second component is a 2D generator that learns to render the 3D object into a 2D image. This framework can generate high-quality images with controllable factors such as pose, shape and appearance ^[153]. Nguyen-Phuoc et al. proposed BlockGAN, a novel image generative model that can create realistic images of scenes composed of multiple objects. BlockGAN learns to generate 3D features for each object and combine them into a 3D scene representation. The model then renders the 3D scene into a 2D image, taking into account the occlusion and interaction between objects, such as shadows and lighting. BlockGAN can manipulate the pose and identity of each object independently while preserving image quality ^[154]. Pan et al. proposed a novel framework that can reconstruct 3D shapes from 2D image GANs without any supervision or prior knowledge. The method can generate realistic and diverse 3D shapes for various object categories, and the reconstructed shapes are consistent with the 2D images generated by the GANs. The recovered 3D shapes allow high-quality image editing such as relighting and object rotation ^[155]. Tewari et al. proposed a novel 3D generative model that can learn to separate the geometry and appearance factors of objects from a data set of monocular images. The model uses a non-rigid deformable scene formulation, where each object instance is represented by a deformed canonical 3D volume. The model can also compute dense correspondences between images and embed real images into its latent space, enabling editing of real images ^[156].

3.2.2. NeRF

Niemeyer and Geiger introduced GIRAFFE, a deep generative model that can synthesize realistic and controllable images of 3D scenes. The model represents scenes as compositional neural feature fields that encode the shape and appearance of individual objects as well as the background. The model can disentangle these factors from unstructured and unposed image collections without any additional supervision. With GIRAFFE, individual objects in the scene can be manipulated by translating, rotating, or changing their appearance, as well as changing the camera pose ^[33]. Yang et al. proposed a neural scene rendering system called OC-NeRF that learns an object-compositional NeRF for editable scene rendering. OC-NeRF consists of a scene branch and an object branch, which encode the scene and object geometry and appearance, respectively. The object branch is conditioned on learnable object activation codes that enable object-level editing such as moving, adding or rotating objects ^[32]. Kobayashi et al. proposed a method to enable semantic editing of 3D scenes represented by NeRFs. The authors introduced distilled feature fields (DFFs), which are 3D feature descriptors learned by transferring the knowledge of pre-trained 2D image feature extractors such as CLIP-LSeg or DINO. DFFs allow users to query and select specific regions or objects in the 3D space using text, image patches or point-and-click inputs. The selected regions can then be edited in various ways, such as rotation, translation, scaling, warping, colorization or deletion ^[157]. Zhang et al. introduced NeRFlets, a new approach to represent 3D scenes from 2D images using local radiance fields. Unlike prior approaches that rely on global implicit functions, NeRFlets partition the scene into a collection of local coordinate frames that encode the structure and appearance of the scene. This enables efficient rendering and editing of complex scenes with high fidelity and detail. NeRFlets can manipulate the object's orientation, position and size, among other operations ^[158]. Finally, Zheng et al. proposed EditableNeRF, a method that allows users to edit dynamic scenes modeled by NeRF with key points. The method can handle topological changes and generate novel views from a single camera input. The key points are detected and optimized automatically by the network, and users can drag them to modify the scene. These approaches provide various means for 3D scene synthesis and editing, including manipulating objects, changing camera pose, selecting and editing specific regions or objects and handling topological changes ^[159].

4. Structure manipulation

4.1. Global structure

4.1.1. Editing point cloud

A depth map is a representation of the distance between a scene and a reference point, such as a camera. It can be used to create realistic effects such as depth of field, occlusion and parallax ^[160]. Liang et al. proposed a novel method called SPIDR for representing and manipulating 3D objects using neural point fields (NPFs) and signed distance functions (SDFs) ^[161]. The method combines explicit point cloud and implicit neural representations to enable high-quality mesh and surface reconstruction for object deformation and lighting estimation. With the trained SPIDR model, various geometric edits can be applied to the point cloud representation, which can be used for image editing. Zhang et al. introduced a new method for rendering point clouds with frequency modulation, which enables easy editing of shape and appearance ^[162]. The method converts point clouds into a set of frequency-modulated signals that can be rendered efficiently using Fourier analysis. The signals can also be manipulated in the frequency domain to achieve various editing effects, such as deformation, smoothing, sharpening and color adjustment. Chen et al. also proposed NeuralEditor, a novel method for editing NeRFs for shape editing tasks ^[163]. The method uses point clouds as the underlying structure to construct NeRFs and renders them with a new scheme based on K-D tree-guided voxels. NeuralEditor can perform shape deformation and scene morphing by mapping points between point clouds.

4.1.2. Editing depth map

Zhu et al. introduced the Visual Object Networks (VON) framework, which enables the disentangled learning of 3D object representations from 2D images. This framework comprises three modules, namely a shape generator, an appearance generator and a rendering network. By manipulating the generators, VON can perform a range of tasks, including shape manipulation, appearance transfer and novel view synthesis ^[164]. Mirzaei et al. proposed a reference-guided controllable inpainting method for NeRFs, which allows for the synthesis of novel views of a scene with missing regions. The method employs a reference image to guide the inpainting process and a user interface that enables the user to adjust the degree of blending between the reference and the original NeRF ^[165]. Yin et al. introduced OR-NeRF, a novel pipeline that can remove objects from 3D scenes using point or text prompts on a single view. This pipeline leverages a points projection strategy, a 2D segmentation model, 2D inpainting methods and depth supervision and perceptual loss to achieve better editing quality and efficiency than previous works ^[166]. Kim et al. proposed a visual comfort aware-reinforcement learning (VCARL) method for depth adjustment of stereoscopic 3D images. This method aims to improve the visual quality and comfort of 3D images by learning a depth adjustment policy from human feedback ^[167]. These advancements offer various means of manipulating objects, adjusting depth and generating novel views, ultimately enhancing the quality and realism of 3D scene synthesis and editing.

4.2. Local structure

4.2.1. GAN

In recent years, there have been significant advancements in the field of 3D scene inpainting and editing using GANs. Jheng et al. proposed a dual-stream GAN for free-form 3D scene inpainting. The network comprises two streams, namely a depth stream and a color stream, which are jointly trained to inpaint the missing regions of a 3D scene. The depth stream predicts the depth map of the scene, while the color stream synthesizes the color image. This approach enables the removal of objects using existing 3D editing tools ^[168]. Another recent development in GAN training is the introduction of LinkGAN, a regularizer proposed by Zhu et al. that links some latent axes to image regions or semantic categories. By resampling partial latent codes, this approach enables local control of GAN generation ^[34]. Wang et al. proposed a novel method for synthesizing realistic images of indoor scenes with explicit camera pose control and object-level editing capabilities. This method builds on BlobGAN, a 2D GAN that disentangles individual objects in the scene using 2D blobs as latent codes. To extend this approach to 3D scenes, the authors introduced 3D blobs, which capture the 3D nature of objects and allow for flexible manipulation of their location and appearance ^[169]. These recent advancements in GAN-based 3D scene inpainting and editing have the potential to significantly improve the quality and realism of synthesized scenes.

4.2.2. NeRF

Liu et al. ^[120] introduced Neural Sparse Voxel Fields (NSVF), which combines neural implicit functions with sparse voxel octrees to enable high-quality novel view synthesis from a sparse set of input images, without requiring explicit geometry reconstruction or meshing. Gu et al. ^[170] introduced StyleNeRF, a method that enables camera pose manipulation for synthesizing high-resolution images with strong multi-view coherence and photo realism. Wang et al. ^[171] introduced CLIP-NeRF, a method for manipulating 3D objects represented by NeRF using text or image inputs. Kania et al. ^[172] proposed a novel method for manipulating neural 3D representations of scenes beyond novel view rendering by allowing the user to specify which part of the scene they want to control with mask annotations in the training images. Lazova et al. ^[173] proposed a novel method for performing flexible, 3D-aware image content manipulation while enabling high-quality novel view synthesis by combining scene-specific feature volumes with a general neural rendering network. Yuan et al. ^[174] proposed a method for user-controlled shape deformation of scenes represented by implicit neural rendering, especially NeRF. Sun et al. ^[175] proposed NeRFEditor, a learning framework for 3D scene editing that uses a pre-trained StyleGAN model and a NeRF model to generate stylized images from a 360-degree video input. Wang et al. ^[176] proposed a novel method for image synthesis of topology-varying objects using generative deformable radiance fields (GDRFs). Tertikas et al. ^[177] proposed PartNeRF, a novel part-aware generative model for editable 3D shape synthesis that does not require any explicit 3D supervision. Bao et al. ^[178] proposed SINE, a novel approach for editing a NeRF with a single image or text prompts. Cohen-Bar et al. ^[179] proposed a novel framework for synthesizing and manipulating 3D scenes from text prompts and object proxies. Finally, Mirzaei et al. ^[180] proposed a novel method for reconstructing 3D scenes from multi-view images by leveraging NeRF to model the geometry and appearance of the scene, and introducing a segmentation network and a perceptual inpainting network to handle occlusions and missing regions. These methods represent significant progress towards the goal of enabling high-quality, user-driven 3D scene synthesis and editing.

4.2.3. Diffusion model

Avrahami et al. ^[181] introduced a method for local image editing based on natural language descriptions and region-of-interest masks. The method uses a pre-trained language-image model (CLIP) and a denoising diffusion probabilistic model (DDPM) to produce realistic outcomes that conform to the text input. It can perform various editing tasks, such as object addition, removal, replacement or modification, background replacement and image extrapolation. Nichol et al. ^[182] proposed GLIDE, a diffusion-based model for text-conditional image synthesis and editing. This method uses a guidance technique to trade off diversity for fidelity and produces photorealistic images that match the text prompts. Couairon et al. ^[183] proposed DiffEdit, a method that uses text-conditioned diffusion models to edit images based on text queries. It can automatically generate a mask that highlights the regions of the image that need to be changed according to the text query. It also uses latent inference to preserve the content in those regions. DiffEdit can produce realistic and diverse semantic image edits for various text prompts and image sources. Sella et al. ^[184] proposed Vox-E, a novel framework that uses latent diffusion models to edit 3D objects based on text prompts. It takes 2D images of a 3D object as input and learns a voxel grid representation of it. It then optimizes a score distillation loss to align the voxel grid with the text prompt while regularizing it in 3D space to preserve the global structure of the original object. Vox-E can create diverse and realistic edits. Haque et al. ^[185] proposed a novel method for editing 3D scenes with natural language instructions. The method leverages a NeRF representation of the scene and a transformer-based model that can parse the instructions and modify the NeRF accordingly. The method can perform various editing tasks, such as changing the color, shape, position and orientation of objects, as well as adding and removing objects, with high fidelity and realism. Lin et al. ^[186] proposed CompoNeRF, a novel method for text-guided multi-object compositional NeRF with editable 3D scene layout. CompoNeRF can synthesize photorealistic images of complex scenes from natural language descriptions and user-specified camera poses. It can also edit the 3D layout of the scene by manipulating the objects' positions, orientations and scales. These methods have shown promising results in advancing the field of image and 3D object editing using natural language descriptions and they have the potential to be applied in various applications.

5. Illumination manipulation

Controllable image generation refers to the use of technology to generate images and to constrain and adjust the generation process so that the generated images meet specific requirements. By guiding external conditions or manipulating and adjusting the code, it is possible to trim a certain area or attribute of the image while leaving other areas or attributes unchanged. To solve the low-level image generation problem, we analyze the image generation for different conditions, lighting being one of them, and summarize the algorithms for each solution under different lighting conditions.

Inverse rendering. Currently, neural rendering is applied to scene restruction. One approach is to capture photometric appearance variations in in-the-wild data, decomposing the scene into image-dependent shared components ^[187].

Another very important type of rendering is inverse rendering. The inverse rendering of objects under completely unknown capture conditions is a fundamental challenge in computer vision and graphics. This challenge is especially acute when the input image is captured in a complex and changing environment. Without using the NeRF method, Boss et al. proposed a join optimization framework to estimate the shape, BRDF, per-image camera pose and illumination ^[188].

Choi et al. proposed IBL-NeRF also based on rendering. This method's inverse rendering extends the original NeRF formulation to capture the spatial variation of lighting within the scene volume, in addition to surface properties. Specifically, the scenes of diverse materials are decomposed into intrinsic components for image-based rendering, namely, albedo, roughness, surface normal, irradiance and pre-filtered radiance. All the components are inferred as neural images from MLP and model large-scale general scenes ^[189].

However, NeRF-based methods encode shape, reflectance and illumination implicitly and this makes it challenging for users to manipulate these properties in the rendered images explicitly. So a new hybrid SDF-based 3D neural representation is generated, capable of rendering scene deformations and lighting more accurately. This neural representation also adds a new SDF regularization. The disadvantage of this approach is that it sacrifices rendering quality. In reverse rendering, high render quality is often at odds with accurate lighting decomposition, as shadows and lighting can easily be misinterpreted as textures. Therefore, rendering quality requires a concerted effort of surface reconstruction and reverse rendering ^[161]. While dynamic NeRF is a powerful algorithm capable of rendering photo-realistic novel view images from a monocular RGB video of a dynamic scene. However, dynamic NeRF does not model the change of the reflected color during the warping. This is one of its drawbacks. To address this problem in rendering, Yan et al. allowed specularly reflective surfaces of different poses to maintain different reflective colors when mapped to the common canonical space by reformulating the neural radiation field function as conditional on the position and orientation of the surface in the observation space. This method more accurately reconstructs and renders dynamic specular scenes ^[190].

The inverse rendering objective function of this method is as follows:

$\begin{equation} \mathcal{L} = \mathcal{L}_{\text {render }}+\mathcal{L}_{\text {pref }}+\mathcal{L}_{\text {prior }}+\lambda_{I, \text { reg }} \mathcal{L}_{I, \text { reg }} \end{equation}$

(5.1)

$\mathcal{L}_{\text {render }}$ and $\mathcal{L}_{\text {pref }}$ are rendering losses to match the rendered images with the input images.

Next, we will explain each of these parameters.

$\begin{equation} \mathcal{L}_{\text {render }} = \left\|\mathcal{L}_o (r) -\hat{\mathcal{L}}_o (r) \right\|_2^2, \end{equation}$

(5.2)

This is for each pixel of the camera light. $r$ represents a single piexl. Where $\mathcal{L}_o$ is our estimated radiance and $\hat{\mathcal{L}}_o$ is ground truth radiance.

$\begin{equation} \mathcal{L}_{\text {pref}} = \sum\limits_j\left\|\mathcal{L}_{\text {pref }}^j (r) -\mathcal{L}_{\mathrm{G}}^j (r) \right\|_2^2 . \end{equation}$

(5.3)

This is the rendering loss of pre-filtered radiation. $\mathcal{L}_{\text {pref }}^j (r)$ is inferred prefiltered radiance of $j^{th}$ level and $\mathcal{L}_{\mathrm{G}}^j (r)$ is the radiance convolved with $j^{th}$ level Gaussian convolution, where $\mathcal{L}_{\mathrm{G}}^0$ = L.

$\begin{equation} \mathcal{L}_{\text {prior }} = \|a (r) -\hat{a} (r) \|_2^2 . \end{equation}$

(5.4)

The equation encourages our inferred albedo $a$ to match the pseudo albedo.

$\begin{equation} \mathcal{L}_{I, \text { reg }} = \|I (r) -\mathbb{E}[\hat{I}]\|_2^2, \end{equation}$

(5.5)

This is the irradiance regularization loss, where $\mathbb{E}[\hat{I}]$ is the mean of irradiance (shading) values in training set images.

5.1. Global illumination

The absence of ideal light and the fact that the studied objects are in an unfavorable environment such as deflection, movement, darkness and high interference can lead to under-illuminated, single irradiated light source and complex illumination of the acquired images, all of which can degrade the performance of the final image generation. Next, we will review the various ways to deal with these aspects.

5.1.1. Brightness level

The use of illumination normalization GAN-IN-GAN can be well generalized to images with less illumination variations. The method combines deep convolutional neural networks and GANs to normalize the illumination of color or grayscale face images, then train feature extractors and classifiers, and process both frontal and non-frontal face images illumination. The method can be extended to other areas, not only for face image generation. However, it cannot preserve more texture details and has some limitations. Moreover, the training model is conducted with well-controlled illumination variations, which can deal with poorly controlled illumination variation to a certain extent, but there are limitations to the study of other features and geometric structures in realistic and complex environments, etc. It can be further investigated whether the model can work better if the model is trained under complex lighting changes ^[191].

When the data set is insufficient, an unsupervised approach can be used for this. For example, for low-light scenes, the unsupervised Aleth-NeRF method is used to learn directly from dark images. The algorithm is mainly a multi-view synthesis method that takes a low-light scene as input and renders a normally illuminated scene. However, a model needs to be trained specifically for different scenes and does not handle non-uniform lighting conditions well ^[192].

Furthermore, as far as the results are concerned, images taken in low-light scenes are affected by distracting factors such as blur and noise. For this type of problem, a hybrid architecture based on Retinex theory and GAN can be used to deal with it. For image vision tasks in the dark or under low light conditions, the image is first decomposed into a light image and a reflection image, and then the enhancement part is used to generate a high quality clear image, starting from minimizing the effect of blurring or noise generation. The method introduces structural similarity loss to avoid the side effect of blur. However, real-life eligible low level and high level images may not be easily acquired and have the shortage of input. Additionally, to maximize the performance of the algorithm, a sufficient size of data set is required. The data obtained after training also has the problem of real-time, which is not enough to meet real-life needs. In general, the algorithm is only from the perspective of solving image blurring and noise, making the impact of these two minimal, other aspects of the problem still exists more, need to further optimize the network structure ^[193]. This class of problems can also be explored by exploring multiple diffusion spaces to estimate the light component, which is used as bright pixels to enhance the shimmering image based on the maximum diffusion value. Generates high-fidelity images without significant distortion, minimizing the problem of noise amplification ^[194]. Later, the conditional diffusion implicit model is utilized in DiFaReli's method (DDIM) to decode the coding of decomposed light. Ponglertnapakorn et al. proposed a novel conditioning technique that eases the modeling of the complex interaction between light and geometry using a rendered shading reference to spatially modulate the DDIM. This method allows for single-view face reillumination in the wild. However, this method has limitations in eliminating shadows cast by external objects and is susceptible to image ambiguity ^[195].

In summary, the full objectives of this method are as follows:

$\begin{equation} \mathcal{L} (G, D) = \mathcal{L}_{\text {adversarial }} (G, D) +\lambda_1 \times \mathcal{L}_{\text {content }} (\mathrm{G}) +\lambda_2 \times \mathcal{L}_{l 1} (\mathrm{G}) \end{equation}$

(5.6)

where $\lambda_1, \lambda_2$ are weight parameters respectively.

$\mathcal{L}_{\text {adversarial }}$ , $\mathcal{L}_{\text {content }}$ and $\mathcal{L}_{l 1}$ are as follows:

$\begin{equation} \begin{aligned} & \mathcal{L}_{\text {adversarial }} (G, D) = E_x[\log D (x) ]+E_{G (x) }[\log (1-D (G (x) ) ) ] \\ & \mathcal{L}_{\text {content }} (G) = \|F (y) -F (G (x) ) \|_1 \\ & \mathcal{L}_{l 1} (G) = \|y-G (x) \|_1 \end{aligned} \end{equation}$

(5.7)

where $\mathrm{x}$ denotes input image, whereas $\mathrm{y}$ is the target image, $\mathrm{F}$ means feature extractor.

5.1.2. Light source movement

A method of generating scenes with a sense of reality from captured object images can be used when the light is moving. On the basis of NeRFs, the bulk density of the scene and the radiance of the directional emission are simulated. A representation of each object light transmission is implicitly simulated using illumination and view-related neural networks. This approach can cope with the problem of light movement without retraining the model ^[196].

5.1.3. Uneven illumination

For the characteristics of light inhomogeneity in the environment, it is possible to use the light correction network framework, UDoc-GAN, to solve it. The main thing is to convert uncertain normal to abnormal image panning to deterministic image panning with different levels of ambient light for learning guidance. In contrast, Aleth-NeRF cannot handle non-uniform illumination or shadow images. Meanwhile, UDoc-GAN algorithm is more computationally efficient in the inference stage and closer to realistic requirements ^[197].

5.1.4. Shadow ray

Ling et al. monitored the camera illumination between the scene and multi-view image planes and noticed shadow rays, which led to a new shadow ray supervision scheme. This scheme optimizes the samples and ray positions along the rays. By supervising the shadow rays to achieve controllable illumination, a neural SDF network for single-view scene reproduction under multi-illumination conditions is finally constructed. However, the method is applicable only to point and parallel light sources and has obvious requirements for the position of the light source. The implementation of the method is also based on a simple environment where the scene is not illuminate ^[198].

5.1.5. Complex light variation

Also, for uncontrolled complex environment settings from which images are acquired, the NeRF-OSR algorithm enables the generation of new views and new illumination. This is a solution for image generation in complex environments. Solving some fuzzy performance from the perspective of optimizing this algorithm can be an interesting future research direction. For example, resolving inaccuracies in geometric estimation, incorporating more priori knowledge of the outdoor scenes, etc., ^[199]. Later, for this problem, Higuera et al. proposed a solution to the complex problem of light variation by reducing the perceptual differences in vision and using a probabilistic diffusion model to capture light. The method is implemented based on simulated data and can address the limitations of large-scale data. Of course, the method suffers from the problem of computation time, especially in the denoising process which consumes more time ^[200]. This is especially true for reflections in complex environments, for example, with glass and mirrors. Guo et al. introduced NeRFReN for simulating scenes with reflections, mainly by dividing the scene into transmission and reflection components and modeling these two components with independent neural radiation fields. This approach has far-reaching implications for further research in scene understanding and neural editing. However, this method does not consider modeling curved reflective surfaces and multiple non-coplanar reflective surfaces ^[201].

5.2. Local illumination

5.2.1. Reflectance

Generally speaking, reflected light can be divided into three components, namely ambient reflection, diffuse reflection and specular reflection. The different media materials that cause the reflected light will show different lighting cues in the exposure. An omnidirectional illumination method trains deep neural networks on videos with automatic exposure and white balance to match real images with predicted illumination based on image reillumination and then regression from the background ^[202].

The method focuses on minimizing the reconstructed illumination loss function and adding an adversarial loss. And the reconstructed illumination loss and the adversarial loss are as follows:

$\begin{equation} \mathcal{L}_{\mathrm{rec}} = \sum\limits_{b = 0}^2 \lambda_b\left\|\hat{M} \odot\left (\Lambda\left (\hat{I}_b\right) ^{\frac{1}{\gamma}}-\Lambda\left (I_b\right) \right) \right\|_1 . \end{equation}$

(5.8)

In this formulation, the linear rendering of the shear is $\gamma$ -encoded with $\gamma$ to match $I$ . $\hat{M}$ represents a binary mask. $\lambda_b$ represents an optional weight.

$\begin{equation} \mathcal{L}_{\mathrm{adv}} = \log D\left (\Lambda\left (I_c\right) \right) +\log \left (1-D\left (\Lambda\left (\sum\limits_{\theta, \phi} R (\theta, \phi) e^{G (x ; \theta, \phi) }\right) ^{\frac{1}{\gamma}}\right) \right) \end{equation}$

(5.9)

In this formulation, the $D$ represents an auxiliary discriminator network, the $G$ represents the generator, the $x$ represents input image.

Therefore, combining the two yields the following common objectives:

$\begin{equation} G^* = \arg \min\limits_G \max\limits_D\left (1-\lambda_{\mathrm{rec}}\right) {\bf E}\left[\mathcal{L}_{\mathrm{adv}}\right]+\lambda_{\mathrm{rec}} {\bf E}\left[\mathcal{L}_{\mathrm{rec}}\right] \end{equation}$

(5.10)

Of course, there are certainly real-life situations where the reflectance is similar.

In illumination variation, there is also a cluster optimization method based on neural reflection field using reflection iteration to solve the problem of similar reflectance of different instances from the perspective of hierarchical clustering. However, there exists the challenge of facing complex scenarios that do not conform to the unsupervised intrinsic prior, and solutions to such problems need to be proposed ^[203].

5.2.2. Radiance

Different mediums have different radiance to light, using a web-based query light integration network on which reflection decomposition is performed. The algorithm captures changing illumination, enabling more accurate new view compositing and reillumination. Finally, fast and practical distinguishable rendering areas are implemented. The algorithm can also estimate the shape and BRDF of the objects in the image, which is a point of superiority over other algorithms. However, this method has some limitations in the study of mutual reflection. In particular, an effective treatment of the interactions between all effects could be a future research direction ^[204].

5.3. Proportion of the models

The proportion of different neural network models utilized in the survey within Sections 3–5 of this review is visually presented through a Figure 9. These models can be classified into seven distinct types: NeRF, GAN, Hybrid NeRF, Transformer, DM and others. The chart reveals that NeRF stands out as the most prevalent model, accounting for 45% of the survey. Following closely behind is GAN, which represents 25% of the survey. Hybrid NeRF secures the third position with a representation of 13%, while DM follows closely at 12%. Transformer is the fifth most popular model, appearing in 2.5% of the survey. Lastly, a similar percentage of 2.5% is attributed to the utilization of other models.

Figure 9. The proportion of different neural network models used in the survey in Sections 3–5 of this review.

DownLoad: Full-Size Img PowerPoint

This proportion of neural network models highlights the dominance of NeRF in the survey, indicating its widespread usage and recognition in the field. GAN also holds a significant share, reflecting its popularity for various applications. Hybrid NeRF, DM and Transformer, although not as prevalent as NeRF and GAN, demonstrate notable representation in the survey. The remaining 2.5% is distributed among other models, indicating a diverse landscape of neural network approaches explored in the reviewed survey.

6. Applications

Low-level controllable image synthesis has many potential applications in various domains, such as entertainment, industry and security.

6.1. Entertainment application

a) Video games. 3D image synthesis can create immersive and interactive virtual worlds for gamers to explore and enjoy. It can also enhance the realism and variety of characters, objects and environments in the game ^[205,206].

b) Movies and TV shows. 3D image synthesis can produce stunning visual effects and animations for movies and TV shows. It can also enable the creation of digital actors, creatures and scenarios that would be impossible or impractical to film in real life ^[207,208].

c) Virtual reality and augmented reality. 3D image synthesis can generate realistic and immersive virtual experiences for users who wear VR or AR devices. It can also augment the real world with digital information and graphics that enhance the user's perception and interaction ^[209].

d) Art and design. 3D image synthesis can enable artists and designers to express their creativity and vision in new ways. It can also facilitate the creation and presentation of 3D artworks, models and prototypes ^[210].

6.2. Industry application

a) Product design and prototyping. Using 3D image synthesis, designers can visualize and test different aspects of their products, such as shape, color, texture, functionality and performance, before manufacturing them. This can save time and money, as well as improve the quality and innovation of the products ^[211].

b) Training and simulation. Using 3D image synthesis, trainers can create realistic and immersive scenarios for workers to practice their skills and learn new procedures. For example, 3D image synthesis can be used to simulate hazardous environments, such as oil rigs, mines or nuclear plants, where workers can train safely and effectively.

c) Inspection and quality control. Using 3D image synthesis, inspectors can detect and analyze defects and errors in products or processes, such as cracks, leaks or misalignments. For example, 3D image synthesis can be used to inspect complex structures, such as bridges, pipelines or aircrafts, where human inspection may be difficult or dangerous ^[212,213].

6.3. Security application

a) Biometric authentication. 3D image synthesis can be used to generate realistic face images from 3D face scans or facial landmarks, which can be used for identity verification or access control. For example, Face ID on iPhone uses 3D image synthesis to project infrared dots on the user's face and match them with the stored 3D face model ^[214,215].

b) Forensic analysis. 3D image synthesis can be used to reconstruct crime scenes or evidence from partial or noisy data, such as surveillance videos, witness sketches or DNA samples. For example, Snapshot DNA Phenotyping uses 3D image synthesis to predict the facial appearance of a person from their DNA ^[216].

c) Counter-terrorism. 3D image synthesis can be used to detect and prevent potential threats by generating realistic scenarios or simulations based on intelligence data or risk assessment. For example, the US Department of Defense uses 3D image synthesis to create virtual environments for training and testing purposes.

d) Cybersecurity. 3D image synthesis can be used to protect sensitive data or systems from unauthorized access or manipulation by generating fake or distorted images that can fool attackers or malware. For example, Adversarial Robustness Toolbox uses 3D image synthesis to generate adversarial examples that can evade or mislead deep learning models ^[217].

7. Conclusions

In this paper, we have given a comprehensive survey of the emerging progress on low-level controllable image synthesis. We discussed a variety of low-level controllable image synthesis aspects according to their low-level vision cues. The survey reviewed important progress made on 3D data sets, geometrically controllable image synthesis, photometrically controllable image synthesis and related applications. Moreover, the global and local synthesis approaches are separately summarized in each controllable mode to further distinguish diverse synthesis tasks. Our goal is to provide a useful guide for the researchers and developers who would be interested to synthesizing and editing the image from the low-level 3D prompts. We categorize literatures mainly according to controllable 3D cues since they directly decide our synthesis tasks and abilities. However, there are still other non-rigid 3D cues such as body kinematic joints and elastic shape deformation which are not covered by this survey.

3D controlled image synthesis is a challenging task that aims to generate realistic and diverse images of 3D objects with user-specified attributes, such as pose, shape, appearance and viewpoint. In our view, some of the difficulties facing this task are: Data scarcity and diversity, as 3D controlled image synthesis requires large-scale and high-quality data sets of 3D objects with various attributes and annotations. However, such data sets are scarce and expensive to obtain, especially for complex scenes and fine-grained categories. Moreover, the data distribution may not cover all possible attribute combinations, leading to mode collapse or unrealistic synthesis. Model complexity and efficiency: 3D controlled image synthesis involves modeling both the 3D structure and the 2D appearance of the objects, which requires sophisticated and computationally intensive models. Controllability and interpretability: 3D controlled image synthesis aims to provide users with intuitive and flexible control over the synthesis process. However, existing methods often use latent codes or predefined attributes as control inputs, which may not reflect the user's intention or expectation. Moreover, the relationship between the control inputs and the synthesis outputs may not be clear or consistent, making it difficult to interpret and manipulate the results.

In response to the above-mentioned tasks, we recommend readers to utilize large-scale models judiciously, as the training of such models incorporates a vast amount of data, which can overcome certain challenges arising from data limitations. Furthermore, we suggest conducting further research on potential decomposition or inverse rendering techniques. In the future, we expect that more explainable controllable cues can be explored from current diffusion and NeRFs models by advanced latent decomposition or inverse rendering techniques. Together with the semantic-level controllable image synthesis, the low-level low-level controllable image synthesis and editing can generate more incredible and reliable images in our lives.

Use of AI tools declaration

The authors declare they have not used Artificial Intelligence (AI) tools in the creation of this article.

Acknowledgments

This research was supported by NSFC (No. 61871074).

Conflict of interest

The authors declare that there are no conflicts of interest.

References

[1]	R. Rombach, A. Blattmann, D. Lorenz, P. Esser, B. Ommer, High-resolution image synthesis with latent diffusion models, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, (2022), 10684–10695.
[2]	Y. Cao, S. Li, Y. Liu, Z. Yan, Y. Dai, P. S. Yu, et al., A comprehensive survey of AI-generated content (aigc): A history of generative AI from GAN to ChatGPT, preprint, arXiv: 2303.04226.
[3]	R. Bommasani, D. A. Hudson, E. Adeli, R. Altman, S. Arora, S. V. Arx, et al., On the opportunities and risks of foundation models, preprint, arXiv: 2108.07258.
[4]	L. Zhang, A. Rao, M. Agrawala, Adding conditional control to text-to-image diffusion models, in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), IEEE, (2023), 3836–3847.
[5]	X. Wang, L. Xie, C. Dong, Y. Shan, Real-ESRGAN: Training real-world blind super-resolution with pure synthetic data, in Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), IEEE, (2021), 1905–1914.
[6]	H. Jonathan, J. Ajay, A. Pieter, Denoising diffusion probabilistic models, in Advances in Neural Information Processing Systems, Curran Associates, Inc., 33 (2020), 6840–6851.
[7]	I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, et al., Generative adversarial nets, in Advances in Neural Information Processing Systems, Curran Associates, Inc., 27 (2014), 1–9.
[8]	Y. LeCun, S. Chopra, R. Hadsell, M. Ranzato, F. Huang, A tutorial on energy-based learning, Predict. Struct. Data, 1 (2006), 1–59.
[9]	J. Zhou, Z. Wu, Z. Jiang, K. Huang, K. Guo, S. Zhao, Background selection schema on deep learning-based classification of dermatological disease, Comput. Biol. Med., 149 (2022), 105966. https://doi.org/10.1016/j.compbiomed.2022.105966 doi: 10.1016/j.compbiomed.2022.105966
[10]	Q. Su, F. Wang, D. Chen, G. Chen, C. Li, L. Wei, Deep convolutional neural networks with ensemble learning and transfer learning for automated detection of gastrointestinal diseases, Comput. Biol. Med., 150 (2022), 106054. https://doi.org/10.1016/j.compbiomed.2022.106054 doi: 10.1016/j.compbiomed.2022.106054
[11]	G. Liu, Q. Ding, H. Luo, M. Sha, X. Li, M. Ju, Cx22: A new publicly available dataset for deep learning-based segmentation of cervical cytology images, Comput. Biol. Med., 150 (2022), 106194. https://doi.org/10.1016/j.compbiomed.2022.106194 doi: 10.1016/j.compbiomed.2022.106194
[12]	L. Xu, R. Magar, A. B. Farimani, Forecasting COVID-19 new cases using deep learning methods, Comput. Biol. Med., 144 (2022), 105342. https://doi.org/10.1016/j.compbiomed.2022.105342 doi: 10.1016/j.compbiomed.2022.105342
[13]	D. P. Kingma, M. Welling, Auto-encoding variational bayes, preprint, arXiv: 1312.6114.
[14]	A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever, Language models are unsupervised multitask learners, OpenAI Blog, 1 (2019), 9.
[15]	H. Huang, P. S. Yu, C. Wang, An introduction to image synthesis with generative adversarial nets, preprint, arXiv: 1803.04469.
[16]	M. Mirza, S. Osindero, Conditional generative adversarial nets, preprint, arXiv: 1411.1784.
[17]	L. A. Gatys, A. S. Ecker, M. Bethge, Image style transfer using convolutional neural networks, in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, (2016), 2414–2423. https://doi.org/10.1109/CVPR.2016.265
[18]	S. Agarwal, N. Snavely, I. Simon, S. M. Seitz, R. Szeliski, Building rome in a day, in 2009 IEEE 12th International Conference on Computer Vision, IEEE, (2009), 72–79. https://doi.org/10.1109/ICCV.2009.5459148
[19]	L. Yang, T. Yendo, M. P. Tehrani, T. Fujii, M. Tanimoto, Probabilistic reliability based view synthesis for FTV, in 2010 IEEE International Conference on Image Processing, IEEE, (2010), 1785–1788. https://doi.org/10.1109/ICIP.2010.5650222
[20]	Y. Zheng, G. Zeng, H. Li, Q. Cai, J. Du, Colorful 3D reconstruction at high resolution using multi-view representation, J. Visual Commun. Image Represent., 85 (2022), 103486. https://doi.org/10.1016/j.jvcir.2022.103486 doi: 10.1016/j.jvcir.2022.103486
[21]	J. Deng, W. Dong, R. Socher, L. Li, L. Kai, F. Li, ImageNet: A large-scale hierarchical image database, in 2009 IEEE Conference on Computer Vision and Pattern Recognition, IEEE, (2009), 248–255. https://doi.org/10.1109/CVPR.2009.5206848
[22]	S. Christoph, B. Romain, V. Richard, G. Cade, W. Ross, C. Mehdi, et al., Laion-5b: An open large-scale dataset for training next generation image-text models, in Advances in Neural Information Processing Systems, Curran Associates, Inc., 35 (2022), 25278–25294.
[23]	S. M. Mohammad, S. Kiritchenko, Wikiart emotions: An annotated dataset of emotions evoked by art, in Proceedings of the 11th Edition of the Language Resources and Evaluation Conference (LREC-2018), (2018), 1–14.
[24]	M. Ben, P. P. Srinivasan, M. Tancik, J. T. Barron, R. Ramamoorthi, R. Ng, NeRF: Representing scenes as Neural Radiance Fields for view synthesis, in European Conference on Computer Vision, Springer, (2020), 405–421. https://doi.org/10.1007/978-3-030-58452-8_24
[25]	S. Huang, Q. Li, J. Liao, L. Liu, L. Li, An overview of controllable image synthesis: Current challenges and future trends, SSRN, 2022.
[26]	A. Tsirikoglou, G. Eilertsen, J. Unger, A survey of image synthesis methods for visual machine learning, Comput. Graphics Forum, 39 (2020), 426–451. https://doi.org/10.1111/cgf.14047 doi: 10.1111/cgf.14047
[27]	H. Ren, G. Stella, B. S. Sami, Controllable GAN synthesis using non-rigid structure-from-motion, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, (2023), 678–687.
[28]	J. Zhang, A. Siarohin, Y. Liu, H. Tang, N. Sebe, W. Wang, Training and tuning generative neural radiance fields for attribute-conditional 3D-aware face generation, preprint, arXiv: 2208.12550.
[29]	J. Ko, K. Cho, D. Choi, K. Ryoo, S. Kim, 3D GAN inversion with pose optimization, in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), IEEE, (2023), 2967–2976.
[30]	S. Yang, W. Wang, B. Peng, J. Dong, Designing a 3D-aware StyleNeRF encoder for face editing, in ICASSP 2023–2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, (2023), 1–5. https://doi.org/10.1109/ICASSP49357.2023.10094932
[31]	J. Collins, S. Goel, K. Deng, A. Luthra, L. Xu, E. Gundogdu, et al., ABO: Dataset and benchmarks for real-world 3D object understanding, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, (2022), 21126–21136.
[32]	B. Yang, Y. Zhang, Y. Xu, Y. Li, H. Zhou, H. Bao, et al., Learning object-compositional Neural Radiance Field for editable scene rendering, in 2021 IEEE/CVF International Conference on Computer Vision (ICCV), IEEE, (2021), 13759–13768. https://doi.org/10.1109/ICCV48922.2021.01352
[33]	M. Niemeyer, A. Geiger, GIRAFFE: Representing scenes as compositional generative neural feature fields, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, (2021), 11453–11464.
[34]	J. Zhu, C. Yang, Y. Shen, Z. Shi, B. Dai, D. Zhao, et al., LinkGAN: Linking GAN latents to pixels for controllable image synthesis, in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), IEEE, (2023), 7656–7666.
[35]	R. Gross, I. Matthews, J. Cohn, T. Kanade, S. Baker, Multi-PIE, in 2008 8th IEEE International Conference on Automatic Face and Gesture Recognition, IEEE, (2008), 1–8. https://doi.org/10.1109/AFGR.2008.4813399
[36]	M. Boss, R. Braun, V. Jampani, J. T. Barron, C. Liu, H. P. A. Lensch, NeRD: Neural reflectance decomposition from image collections, in 2021 IEEE/CVF International Conference on Computer Vision (ICCV), IEEE, (2021), 12664–12674. https://doi.org/10.1109/ICCV48922.2021.01245
[37]	X. Yan, Z. Yuan, Y. Du, Y. Liao, Y. Guo, Z. Li, et al., CLEVR3D: Compositional language and elementary visual reasoning for question answering in 3D real-world scenes, preprint, arXiv: 2112.11691.
[38]	A. Dai, A. X. Chang, M. Savva, M. Halber, T. Funkhouser, M. Nießner, ScanNet: Richly-annotated 3D reconstructions of indoor scenes, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, (2017), 5828–5839.
[39]	T. Zhou, T. Richard, F. John, F. Graham, S. Noah, Stereo magnification: Learning view synthesis using multiplane images, ACM Trans. Graphics, 37 (2018), 1–12. https://doi.org/10.1145/3197517.3201323 doi: 10.1145/3197517.3201323
[40]	A. X. Chang, T. A. Funkhouser, L. J. Guibas, P. Hanrahan, Q. Huang, Z. Li, et al., ShapeNet: An information-rich 3D model repository, preprint, arXiv: 1512.03012.
[41]	A. Geiger, P. Lenz, R. Urtasun, Are we ready for autonomous driving? the KITTI vision benchmark suite, in 2012 IEEE Conference on Computer Vision and Pattern Recognition, IEEE, (2012), 3354–3361. https://doi.org/10.1109/CVPR.2012.6248074
[42]	H. Caesar, V. Bankiti, A. H. Lang, S. Vora, V. E. Liong, Q. Xu, et al., NuScenes: A multimodal dataset for autonomous driving, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, (2020), 11621–11631.
[43]	S. K. Ramakrishnan, A. Gokaslan, E. Wijmans, O. Maksymets, A. Clegg, J. M. Turner, et al., Habitat-Matterport 3D dataset (HM3D): 1000 large-scale 3D environments for embodied AI, in Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), (2021), 1–12.
[44]	D. Scharstein, R. Szeliski, A taxonomy and evaluation of dense two-frame stereo correspondence algorithms, Int. J. Comput. Vision, 47 (2002), 7–42. https://doi.org/10.1023/A:1014573219977 doi: 10.1023/A:1014573219977
[45]	D. Scharstein, R. Szeliski, High-accuracy stereo depth maps using structured light, in 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, IEEE, (2003), 1. https://doi.org/10.1109/CVPR.2003.1211354
[46]	D. Scharstein, C. Pal, Learning conditional random fields for stereo, in 2007 IEEE Conference on Computer Vision and Pattern Recognition, IEEE, (2007), 1–8. https://doi.org/10.1109/CVPR.2007.383191
[47]	H. Hirschmuller, D. Scharstein, Evaluation of cost functions for stereo matching, in 2007 IEEE Conference on Computer Vision and Pattern Recognition, IEEE, (2007), 1–8. https://doi.org/10.1109/CVPR.2007.383248
[48]	D. Scharstein, H. Hirschmüller, Y. Kitajima, G. Krathwohl, N. Nesic, X. Wang, et al., High-resolution stereo datasets with subpixel-accurate ground truth, in 36th German Conference on Pattern Recognition, Springer, (2014), 31–42. https://doi.org/10.1007/978-3-319-11752-2_3
[49]	N. Silberman, D. Hoiem, K. Pushmeet, R. Fergus, Indoor segmentation and support inference from rgbd images, in Computer Vision–ECCV 2012: 12th European Conference on Computer Vision, Springer, (2012), 746–760. https://doi.org/https://doi.org/10.1007/978-3-642-33715-4_54
[50]	K. Guo, P. Lincoln, P. Davidson, J. Busch, X. Yu, M. Whalen, et al., The Relightables: Volumetric performance capture of humans with realistic relighting, ACM Trans. Graphics, 38 (2019), 1–19. https://doi.org/10.1145/3355089.3356571 doi: 10.1145/3355089.3356571
[51]	A. Horé, D. Ziou, Image quality metrics: PSNR vs. SSIM, in 2010 20th International Conference on Pattern Recognition, IEEE, (2010), 2366–2369. https://doi.org/10.1109/ICPR.2010.579
[52]	Z. Wang, A. C. Bovik, H. R. Sheikh, E. P. Simoncelli, Image quality assessment: From error visibility to structural similarity, IEEE Trans. Image Process., 13 (2004), 600–612. https://doi.org/10.1109/TIP.2003.819861 doi: 10.1109/TIP.2003.819861
[53]	R. Zhang, P. Isola, A. A. Efros, E. Shechtman, O. Wang, The unreasonable effectiveness of deep features as a perceptual metric, in Proceedings of the IEEE conference on computer vision and pattern recognition, IEEE, (2018), 586–595.
[54]	T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, X. Chen, Improved techniques for training GANs, in Advances in Neural Information Processing Systems, Curran Associates, Inc., 29 (2016), 1–9.
[55]	M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, S. Hochreiter, GANs trained by a two time-scale update rule converge to a local nash equilibrium, in Advances in Neural Information Processing Systems, Curran Associates Inc., 30 (2017), 1–12.
[56]	M. Bińkowski, D. J. Sutherland, M. Arbel, A. Gretton, Demystifying MMD GANs, in International Conference on Learning Representations, 2018.
[57]	Z. Shi, S. Peng, Y. Xu, Y. Liao, Y. Shen, Deep generative models on 3D representations: A survey, preprint, arXiv: 2210.15663.
[58]	R. Huang, S. Zhang, T. Li, R. He, Beyond face rotation: Global and local perception GAN for photorealistic and identity preserving frontal view synthesis, in Proceedings of the IEEE International Conference on Computer Vision (ICCV), IEEE, (2017), 2439–2448.
[59]	B. Zhao, X. Wu, Z. Cheng, H. Liu, Z. Jie, J. Feng, Multi-view image generation from a single-view, in Proceedings of the 26th ACM International Conference on Multimedia, ACM, (2018), 383–391. https://doi.org/10.1145/3240508.3240536
[60]	K. Regmi, A. Borji, Cross-view image synthesis using conditional GANs, in 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, (2018), 3501–3510. https://doi.org/10.1109/CVPR.2018.00369
[61]	K. Regmi, A. Borji, Cross-view image synthesis using geometry-guided conditional GANs, Comput. Vision Image Understanding, 187 (2019), 102788. https://doi.org/10.1016/j.cviu.2019.07.008 doi: 10.1016/j.cviu.2019.07.008
[62]	F. Mokhayeri, K. Kamali, E. Granger, Cross-domain face synthesis using a controllable GAN, in 2020 IEEE Winter Conference on Applications of Computer Vision (WACV), IEEE, (2020), 241–249. https://doi.org/10.1109/WACV45572.2020.9093275
[63]	X. Zhu, Z. Yin, J. Shi, H. Li, D. Lin, Generative adversarial frontal view to bird view synthesis, in 2018 International Conference on 3D Vision (3DV), IEEE, (2018), 454–463. https://doi.org/10.1109/3DV.2018.00059
[64]	H. Ding, S. Wu, H. Tang, F. Wu, G. Gao, X. Jing, Cross-view image synthesis with deformable convolution and attention mechanism, in Pattern Recognition and Computer Vision, Springer, (2020), 386–397. https://doi.org/10.1007/978-3-030-60633-6_32
[65]	B. Ren, H. Tang, N. Sebe, Cascaded cross MLP-Mixer GANs for cross-view image translation, in British Machine Vision Conference, (2021), 1–14.
[66]	J. Zhu, T. Park, P. Isola, A. A. Efros, Unpaired image-to-image translation using cycle-consistent adversarial networks, in 2017 IEEE International Conference on Computer Vision (ICCV), IEEE, (2017), 2242–2251. https://doi.org/10.1109/ICCV.2017.244
[67]	M. Yin, L. Sun, Q. Li, Novel view synthesis on unpaired data by conditional deformable variational auto-encoder, in Computer Vision–ECCV 2020, Springer, (2020), 87–103. https://doi.org/10.1007/978-3-030-58604-1_6
[68]	X. Shen, J. Plested, Y. Yao, T. Gedeon, Pairwise-GAN: Pose-based view synthesis through pair-wise training, in Neural Information Processing, Springer, (2020), 507–515. https://doi.org/10.1007/978-3-030-63820-7_58
[69]	E. R. Chan, M. Monteiro, P. Kellnhofer, J. Wu, G. Wetzstein, pi-GAN: Periodic implicit generative adversarial networks for 3D-aware image synthesis, in 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, (2021), 5795–5805. https://doi.org/10.1109/CVPR46437.2021.00574
[70]	S. Cai, A. Obukhov, D. Dai, L. V. Gool, Pix2NeRF: Unsupervised conditional $\pi$ -GAN for single image to Neural Radiance Fields translation, in 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, (2022), 3971–3980. https://doi.org/10.1109/CVPR52688.2022.00395
[71]	T. Leimkhler, G. Drettakis, FreeStyleGAN, ACM Trans. Graphics, 40 (2021), 1–15. https://doi.org/10.1145/3478513.3480538 doi: 10.1145/3478513.3480538
[72]	S. C. Medin, B. Egger, A. Cherian, Y. Wang, J. B. Tenenbaum, X. Liu, et al., MOST-GAN: 3D morphable StyleGAN for disentangled face image manipulation, in Proceedings of the AAAI Conference on Artificial Intelligence, AAAI Press, 36 (2022), 1962–1971. https://doi.org/10.1609/aaai.v36i2.20091
[73]	R. Or-El, X. Luo, M. Shan, E. Shechtman, J. J. Park, I. Kemelmacher-Shlizerman, StyleSDF: High-resolution 3D-consistent image and geometry generation, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, (2022), 13503–13513.
[74]	X. Zheng, Y. Liu, P. Wang, X. Tong, SDF-StyleGAN: Implicit SDF-based StyleGAN for 3D shape generation, Comput. Graphics Forum, 41 (2022), 52–63. https://doi.org/10.1111/cgf.14602 doi: 10.1111/cgf.14602
[75]	Y. Deng, J. Yang, J. Xiang, X. Tong, GRAM: Generative radiance manifolds for 3D-aware image generation, in 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, (2022), 10663–10673. https://doi.org/10.1109/CVPR52688.2022.01041
[76]	J. Xiang, J. Yang, Y. Deng, X. Tong, GRAM-HD: 3D-consistent image generation at high resolution with generative radiance manifolds, in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), IEEE, (2023), 2195–2205.
[77]	E. R. Chan, C. Z. Lin, M. A. Chan, K. Nagano, B. Pan, S. D. Mello, et al., Efficient geometry-aware 3D generative adversarial networks, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, (2022), 16123–16133.
[78]	X. Zhao, F. Ma, D. Güera, Z. Ren, A. G. Schwing, A. Colburn, Generative multiplane images: Making a 2D GAN 3D-aware, in Computer Vision–ECCV 2022, Springer, (2022), 18–35. https://doi.org/10.1007/978-3-031-20065-6_2
[79]	H. A. Alhaija, A. Dirik, A. Knrig, S. Fidler, M. Shugrina, XDGAN: Multi-modal 3D shape generation in 2D space, in British Machine Vision Conference, (2022), 1–14.
[80]	K. Zhang, G. Riegler, N. Snavely, V. Koltun, NeRF++: Analyzing and improving Neural Radiance Fields, preprint, arXiv: 2010.07492.
[81]	D. Rebain, W. Jiang, S. Yazdani, K. Li, K. M. Yi, A. Tagliasacchi, DeRF: Decomposed radiance fields, in 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, (2021), 14148–14156. https://doi.org/10.1109/CVPR46437.2021.01393
[82]	K. Park, U. Sinha, J. T. Barron, S. Bouaziz, D. B. Goldman, S. M. Seitz, et al., Nerfies: Deformable Neural Radiance Fields, in 2021 IEEE/CVF International Conference on Computer Vision (ICCV), IEEE, (2021), 5845–5854. https://doi.org/10.1109/ICCV48922.2021.00581
[83]	J. Li, Z. Feng, Q. She, H. Ding, C. Wang, G. H. Lee, MINE: Towards continuous depth MPI with NeRF for novel view synthesis, in 2021 IEEE/CVF International Conference on Computer Vision (ICCV), IEEE, (2021), 12558–12568. https://doi.org/10.1109/ICCV48922.2021.01235
[84]	K. Park, U. Sinha, P. Hedman, J. T. Barron, S. Bouaziz, D. B. Goldman, et al., HyperNeRF: A higher-dimensional representation for topologically varying Neural Radiance Fields, ACM Trans. Graphics, 40 (2021), 1–12. https://doi.org/10.1145/3478513.3480487 doi: 10.1145/3478513.3480487
[85]	T. Chen, P. Wang, Z. Fan, Z. Wang, Aug-NeRF: Training stronger Neural Radiance Fields with triple-level physically-grounded augmentations, in 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, (2022), 15170–15181. https://doi.org/10.1109/CVPR52688.2022.01476
[86]	T. Kaneko, AR-NeRF: Unsupervised learning of depth and defocus effects from natural images with aperture rendering Neural Radiance Fields, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, (2022), 18387–18397.
[87]	X. Li, C. Hong, Y. Wang, Z. Cao, K. Xian, G. Lin, SymmNeRF: Learning to explore symmetry prior for single-view view synthesis, in Proceedings of the Asian Conference on Computer Vision (ACCV), (2022), 1726–1742.
[88]	K. Zhou, W. Li, Y. Wang, T. Hu, N. Jiang, X. Han, et al., NeRFLix: High-quality neural view synthesis by learning a degradation-driven inter-viewpoint mixer, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, (2023), 12363–12374.
[89]	Z. Wang, S. Wu, W. Xie, M. Chen, V. A. Prisacariu, NeRF–: Neural Radiance Fields without known camera parameters, preprint, arXiv: 2102.07064.
[90]	B. Mildenhall, P. P. Srinivasan, R. Ortiz-Cayon, N. K. Kalantari, R. Ramamoorthi, R. Ng, et al., Local light field fusion: Practical view synthesis with prescriptive sampling guidelines, ACM Trans. Graphics, 38 (2019), 1–14. https://doi.org/10.1145/3306346.3322980 doi: 10.1145/3306346.3322980
[91]	Q. Meng, A. Chen, H. Luo, M. Wu, H. Su, L. Xu, et al., GNeRF: GAN-based Neural Radiance Field without posed camera, in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), IEEE, (2021), 6351–6361.
[92]	R. Jensen, A. Dahl, G. Vogiatzis, E. Tola, H. Aanæs, Large scale multi-view stereopsis evaluation, in Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition, IEEE, (2014), 406–413. https://doi.org/10.1109/CVPR.2014.59
[93]	Y. Jeong, S. Ahn, C. Choy, A. Anandkumar, M. Cho, J. Park, Self-calibrating Neural Radiance Fields, in 2021 IEEE/CVF International Conference on Computer Vision (ICCV), IEEE, (2021), 5826–5834. https://doi.org/10.1109/ICCV48922.2021.00579
[94]	A. Knapitsch, J. Park, Q. Zhou, V. Koltun, Tanks and temples: Benchmarking large-scale scene reconstruction, ACM Trans. Graphics, 36 (2017), 1–13. https://doi.org/10.1145/3072959.3073599 doi: 10.1145/3072959.3073599
[95]	W. Bian, Z. Wang, K. Li, J. Bian, V. A. Prisacariu, NoPe-NeRF: Optimising Neural Radiance Field with no pose prior, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, (2023), 4160–4169.
[96]	P. Truong, M. Rakotosaona, F. Manhardt, F. Tombari, SPARF: Neural Radiance Fields from sparse and noisy poses, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, (2023), 4190–4200.
[97]	J. Straub, T. Whelan, L. Ma, Y. Chen, E. Wijmans, S. Green, et al., The replica dataset: A digital replica of indoor spaces, preprint, arXiv: 1906.05797.
[98]	J. Y. Zhang, G. Yang, S. Tulsiani, D. Ramanan, NeRS: Neural reflectance surfaces for sparse-view 3D reconstruction in the wild, in Conference on Neural Information Processing Systems, Curran Associates, Inc., 34 (2021), 29835–29847.
[99]	S. Seo, D. Han, Y. Chang, N. Kwak, MixNeRF: Modeling a ray with mixture density for novel view synthesis from sparse inputs, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, (2023), 20659–20668.
[100]	A. Cao, R. D. Charette, SceneRF: Self-supervised monocular 3D scene reconstruction with radiance fields, in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), IEEE, (2023), 9387–9398.
[101]	J. Behley, M. Garbade, A. Milioto, J. Quenzel, S. Behnke, C. Stachniss, et al., SemanticKITTI: A dataset for semantic scene understanding of lidar sequences, in 2019 IEEE/CVF International Conference on Computer Vision (ICCV), IEEE, (2019), 9296–9306. https://doi.org/10.1109/ICCV.2019.00939
[102]	J. Chen, W. Yi, L. Ma, X. Jia, H. Lu, GM-NeRF: Learning generalizable model-based Neural Radiance Fields from multi-view images, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, (2023), 20648–20658.
[103]	T. Yu, Z. Zheng, K. Guo, P. Liu, Q. Dai, Y. Liu, Function4D: Real-time human volumetric capture from very sparse consumer rgbd sensors, in 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, (2021), 5742–5752. https://doi.org/10.1109/CVPR46437.2021.00569
[104]	B. Bhatnagar, G. Tiwari, C. Theobalt, G. Pons-Moll, Multi-Garment net: Learning to dress 3D people from images, in 2019 IEEE/CVF International Conference on Computer Vision (ICCV), IEEE, (2019), 5419–5429. https://doi.org/10.1109/ICCV.2019.00552
[105]	W. Cheng, S. Xu, J. Piao, C. Qian, W. Wu, K. Lin, et al., Generalizable neural performer: Learning robust radiance fields for human novel view synthesis, preprint, arXiv: 2204.11798.
[106]	S. Peng, Y. Zhang, Y. Xu, Q. Wang, Q. Shuai, H. Bao, et al., Neural Body: Implicit neural representations with structured latent codes for novel view synthesis of dynamic humans, in 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, (2021), 9050–9059. https://doi.org/10.1109/CVPR46437.2021.00894
[107]	B. Mildenhall, P. Hedman, R. Martin-Brualla, P. P. Srinivasan, J. T. Barron, NeRF in the dark: High dynamic range view synthesis from noisy raw images, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, (2022), 16190–16199.
[108]	L. Ma, X. Li, J. Liao, Q. Zhang, X. Wang, J. Wang, et al., Deblur-NeRF: Neural Radiance Fields from blurry images, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, (2022), 12861–12870.
[109]	X. Huang, Q. Zhang, Y. Feng, H. Li, X. Wang, Q. Wang, Hdr-NeRF: High dynamic range Neural Radiance Fields, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, (2022), 18398–18408.
[110]	P. Naama, T. Tali, K. Simon, NAN: Noise-aware NeRFs for burst-denoising, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, (2022), 12672–12681.
[111]	J. T. Barron, B. Mildenhall, D. Verbin, P. P. Srinivasan, P. Hedman, Mip-NeRF 360: Unbounded anti-aliased Neural Radiance Fields, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, (2022), 5470–5479.
[112]	Y. Xiangli, L. Xu, X. Pan, N. Zhao, A. Rao, C. Theobalt, et al., BungeeNeRF: Progressive Neural Radiance Field for extreme multi-scale scene rendering, in Computer Vision–ECCV 2022, Springer, (2022), 106–122. https://doi.org/10.1007/978-3-031-19824-3_7
[113]	Google, Google earth studio, 2018. Available from: https://www.google.com/earth/studio/.
[114]	M. Tancik, V. Casser, X. Yan, S. Pradhan, B. P. Mildenhall, P. Srinivasan, et al., Block-NeRF: Scalable large scene neural view synthesis, in 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, (2022), 8238–8248. https://doi.org/10.1109/CVPR52688.2022.00807
[115]	L. Xu, Y. Xiangli, S. Peng, X. Pan, N. Zhao, C. Theobalt, et al., Grid-guided Neural Radiance Fields for large urban scenes, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, (2023), 8296–8306.
[116]	T. Haithem, R. Deva, S. Mahadev, Mega-NERF: Scalable construction of large-scale NeRFs for virtual fly-throughs, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, (2022), 12922–12931.
[117]	C. Choi, S. M. Kim, Y. M. Kim, Balanced spherical grid for egocentric view synthesis, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, (2022), 16590–16599.
[118]	A. Yu, R. Li, M. Tancik, H. Li, R. Ng, A. Kanazawa, PlenOctrees for real-time rendering of Neural Radiance Fields, in 2021 IEEE/CVF International Conference on Computer Vision (ICCV), IEEE, (2021), 5732–5741. https://doi.org/10.1109/ICCV48922.2021.00570
[119]	C. Sun, M. Sun, H. Chen, Direct voxel grid optimization: Super-fast convergence for radiance fields reconstruction, in 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, (2022), 5449–5459. https://doi.org/10.1109/CVPR52688.2022.00538
[120]	L. Liu, J. Gu, K. Z. Lin, T. Chua, C. Theobalt, Neural sparse voxel fields, in Advances in Neural Information Processing Systems, Curran Associates, Inc., 33 (2020), 15651–15663.
[121]	Y. Yao, Z. Luo, S. Li, J. Zhang, Y. Ren, L. Zhou, et al., BlendedMVS: A large-scale dataset for generalized multi-view stereo networks, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, (2020), 1790–1799.
[122]	V. Sitzmann, J. Thies, F. Heide, M. Nießner, G. Wetzstein, M. Zollhöfer, DeepVoxels: Learning persistent 3D feature embeddings, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, (2019), 2437–2446.
[123]	H. Wang, J. Ren, Z. Huang, K. Olszewski, M. Chai, Y. Fu, et al., R2L: Distilling Neural Radiance Field to neural light field for efficient novel view synthesis, in Computer Vision–ECCV 2022, Springer, (2022), 612–629. https://doi.org/10.1007/978-3-031-19821-2_35
[124]	T. Neff, P. Stadlbauer, M. Parger, A. Kurz, J. H. Mueller, C. R. A. Chaitanya, et al., DONeRF: Towards real-time rendering of compact Neural Radiance Fields using depth oracle networks, Comput. Graphics Forum, 40 (2021), 45–59. https://doi.org/10.1111/cgf.14340 doi: 10.1111/cgf.14340
[125]	K. Wadhwani, T. Kojima, SqueezeNeRF: Further factorized FastNeRF for memory-efficient inference, in 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), IEEE, (2022), 2716–2724. https://doi.org/10.1109/CVPRW56347.2022.00307
[126]	Z. Chen, T. Funkhouser, P. Hedman, A. Tagliasacchi, MobileNeRF: Exploiting the polygon rasterization pipeline for efficient neural field rendering on mobile architectures, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, (2023), 16569–16578.
[127]	Y. Chen, X. Chen, X. Wang, Q. Zhang, Y. Guo, Y. Shan, et al., Local-to-global registration for bundle-adjusting Neural Radiance Fields, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, (2023), 8264–8273.
[128]	C. Sbrolli, P. Cudrano, M. Frosi, M. Matteucci, IC3D: Image-conditioned 3D diffusion for shape generation, preprint, arXiv: 2211.10865.
[129]	J. Gu, Q. Gao, S. Zhai, B. Chen, L. Liu, J. Susskind, Learning controllable 3D diffusion models from single-view images, preprint, arXiv: 2304.06700.
[130]	T. Anciukevičius, Z. Xu, M. Fisher, P. Henderson, H. Bilen, N. J. Mitra, et al., RenderDiffusion: Image diffusion for 3D reconstruction, inpainting and generation, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, (2023), 12608–12618.
[131]	J. Xiang, J. Yang, B. Huang, X. Tong, 3D-aware image generation using 2D diffusion models, in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), IEEE, (2023), 2383–2393.
[132]	R. Liu, R. Wu, B. V. Hoorick, P. Tokmakov, S. Zakharov, C. Vondrick, Zero-1-to-3: Zero-shot one image to 3D object, preprint, arXiv: 2303.11328.
[133]	E. R. Chan, K. Nagano, M. A. Chan, A. W. Bergman, J. J. Park, A. Levy, et al., Generative novel view synthesis with 3D-aware diffusion models, in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), IEEE, (2023), 4217–4229.
[134]	A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, et al., An image is worth 16 x 16 words: Transformers for image recognition at scale, in International Conference on Learning Representations, 2021.
[135]	A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, et al., Attention is all you need, in Proceedings of the 31st International Conference on Neural Information Processing Systems, Curran Associates Inc., (2017), 6000–6010.
[136]	P. Nguyen-Ha, L. Huynh, E. Rahtu, J. Heikkila, Sequential view synthesis with transformer, in Proceedings of the Asian Conference on Computer Vision (ACCV), 2020.
[137]	J. Yang, Y. Li, L. Yang, Shape transformer nets: Generating viewpoint-invariant 3D shapes from a single image, J. Visual Commun. Image Represent., 81 (2021), 103345. https://doi.org/10.1016/j.jvcir.2021.103345 doi: 10.1016/j.jvcir.2021.103345
[138]	J. Kulhánek, E. Derner, T. Sattler, R. Babuška, ViewFormer: NeRF-free neural rendering from few images using transformers, in Computer Vision–ECCV 2022, Springer, (2022), 198–216. https://doi.org/10.1007/978-3-031-19784-0_12
[139]	P. Zhou, L. Xie, B. Ni, Q. Tian, CIPS-3D: A 3D-aware generator of GANs based on conditionally-independent pixel synthesis, preprint, arXiv: 2110.09788.
[140]	X. Xu, X. Pan, D. Lin, B. Dai, Generative occupancy fields for 3D surface-aware image synthesis, in Advances in Neural Information Processing Systems, Curran Associates, Inc., 34 (2021), 20683–20695.
[141]	Y. Lan, X. Meng, S. Yang, C. C. Loy, B. Dai, Self-supervised geometry-aware encoder for style-based 3D GAN inversion, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, (2023), 20940–20949.
[142]	S. Li, J. van de Weijer, Y. Wang, F. S. Khan, M. Liu, J. Yang, 3D-aware multi-class image-to-image translation with NeRFs, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, (2023), 12652–12662.
[143]	M. Shahbazi, E. Ntavelis, A. Tonioni, E. Collins, D. P. Paudel, M. Danelljan, et al., NeRF-GAN distillation for efficient 3D-aware generation with convolutions, in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops, IEEE, (2023), 2888–2898.
[144]	A. Kania, A. Kasymov, M. Ziba, P. Spurek, HyperNeRFGAN: Hypernetwork approach to 3D NeRF GAN, preprint, arXiv: 2301.11631.
[145]	A. R. Bhattarai, M. Nießner, A. Sevastopolsky, TriPlaneNet: An encoder for EG3D inversion, preprint, arXiv: 2303.13497.
[146]	N. Müller, Y. Siddiqui, L. Porzi, S. R. Bulo, P. Kontschieder, M. Nießner, Diffrf: Rendering-guided 3D radiance field diffusion, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, (2023), 4328–4338.
[147]	D. Xu, Y. Jiang, P. Wang, Z. Fan, Y. Wang, Z. Wang, NeuralLift-360: Lifting an in-the-wild 2D photo to a 3D object with 360deg views, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, (2023), 4479–4489.
[148]	H. Chen, J. Gu, A. Chen, W. Tian, Z. Tu, L. Liu, et al., Single-stage diffusion NeRF: A unified approach to 3D generation and reconstruction, in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), IEEE, (2023), 2416–2425.
[149]	J. Gu, A. Trevithick, K. Lin, J. Susskind, C. Theobalt, L. Liu, et al., NerfDiff: Single-image view synthesis with NeRF-guided distillation from 3D-aware diffusion, in International Conference on Machine Learning, PMLR, (2023), 11808–11826.
[150]	D. Wang, X. Cui, S. Salcudean, Z. J. Wang, Generalizable Neural Radiance Fields for novel view synthesis with transformer, preprint, arXiv: 2206.05375.
[151]	K. Lin, L. Yen-Chen, W. Lai, T. Lin, Y. Shih, R. Ramamoorthi, Vision transformer for NeRF-based view synthesis from a single input image, in 2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), IEEE, (2023), 806–815. https://doi.org/10.1109/WACV56688.2023.00087
[152]	J. Liu, Q. Nie, Y. Liu, C. Wang, NeRF-Loc: Visual localization with conditional Neural Radiance Field, preprint, arXiv: 2304.07979.
[153]	Y. Liao, K. Schwarz, L. Mescheder, A. Geiger, Towards unsupervised learning of generative models for 3D controllable image synthesis, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, (2020), 5871–5880.
[154]	T. Nguyen-Phuoc, C. Richardt, L. Mai, Y. Yang, N. Mitra, BlockGAN: Learning 3D object-aware scene representations from unlabelled images, in Advances in Neural Information Processing Systems, Curran Associates, Inc., 33 (2020), 6767–6778.
[155]	X. Pan, B. Dai, Z. Liu, C. C. Loy, P. Luo, Do 2D GANs know 3D shape? Unsupervised 3D shape reconstruction from 2D image GANs, in International Conference on Learning Representations, 2021.
[156]	A. Tewari, M. B. R, X. Pan, O. Fried, M. Agrawala, C. Theobalt, Disentangled3D: Learning a 3D generative model with disentangled geometry and appearance from monocular images, in 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, (2022), 1506–1515. https://doi.org/10.1109/CVPR52688.2022.00157
[157]	S. Kobayashi, E. Matsumoto, V. Sitzmann, Decomposing NeRF for editing via feature field distillation, in Advances in Neural Information Processing Systems, Curran Associates, Inc., 35 (2022), 23311–23330.
[158]	X. Zhang, K. Abhijit, F. Thomas, G. Leonidas, S. Hao, G. Kyle, Nerflets: Local radiance fields for efficient structure-aware 3D scene representation from 2D supervision, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, (2023), 8274–8284.
[159]	C. Zheng, W. Lin, F. Xu, EditableNeRF: Editing topologically varying Neural Radiance Fields by key points, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, (2023), 8317–8327.
[160]	J. Zhang, L. Yang, MonodepthPlus: Self-supervised monocular depth estimation using soft-attention and learnable outlier-masking, J. Electron. Imaging, 30 (2021), 023017. https://doi.org/10.1117/1.JEI.30.2.023017 doi: 10.1117/1.JEI.30.2.023017
[161]	R. Liang, J. Zhang, H. Li, C. Yang, Y. Guan, N. Vijaykumar, SPIDR: SDF-based neural point fields for illumination and deformation, preprint, arXiv: 2210.08398.
[162]	Y. Zhang, X. Huang, B. Ni, T. Li, W. Zhang, Frequency-modulated point cloud rendering with easy editing, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, (2023), 119–129.
[163]	J. Chen, J. Lyu, Y. Wang, NeuralEditor: Editing Neural Radiance Fields via manipulating point clouds, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, (2023), 12439–12448.
[164]	J. Zhu, Z. Zhang, C. Zhang, J. Wu, A. Torralba, J. B. Tenenbaum, et al., Visual object networks: Image generation with disentangled 3D representations, in Advances in Neural Information Processing Systems, Curran Associates, Inc., 31 (2018).
[165]	A. Mirzaei, T. Aumentado-Armstrong, M. A. Brubaker, J. Kelly, A. Levinshtein, K. G. Derpanis, et al., Reference-guided controllable inpainting of Neural Radiance Fields, in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), IEEE, (2023), 17815–17825.
[166]	Y. Yin, Z. Fu, F. Yang, G. Lin, OR-NeRF: Object removing from 3D scenes guided by multiview segmentation with Neural Radiance Fields, preprint, arXiv: 2305.10503.
[167]	H. G. Kim, M. Park, S. Lee, S. Kim, Y. M. Ro, Visual comfort aware-reinforcement learning for depth adjustment of stereoscopic 3D images, in Proceedings of the AAAI Conference on Artificial Intelligence, AAAI Press, 35 (2021), 1762–1770. https://doi.org/10.1609/aaai.v35i2.16270
[168]	R. Jheng, T. Wu, J. Yeh, W. H. Hsu, Free-form 3D scene inpainting with dual-stream GAN, in British Machine Vision Conference, 2022.
[169]	Q. Wang, Y. Wang, M. Birsak, P. Wonka, BlobGAN-3D: A spatially-disentangled 3D-aware generative model for indoor scenes, preprint, arXiv: 2303.14706.
[170]	J. Gu, L. Liu, P. Wang, C. Theobalt, StyleNeRF: A style-based 3D aware generator for high-resolution image synthesis, in Tenth International Conference on Learning Representations, (2022), 1–25.
[171]	C. Wang, M. Chai, M. He, D. Chen, J. Liao, CLIP-NeRF: Text-and-image driven manipulation of Neural Radiance Fields, in 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, (2022), 3825–3834. https://doi.org/10.1109/CVPR52688.2022.00381
[172]	K. Kania, K. M. Yi, M. Kowalski, T. Trzciński, A. Tagliasacchi, CoNeRF: Controllable Neural Radiance Fields, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, IEEE, (2022), 18623–18632.
[173]	V. Lazova, V. Guzov, K. Olszewski, S. Tulyakov, G. Pons-Moll, Control-NeRF: Editable feature volumes for scene rendering and manipulation, in 2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), IEEE, (2023), 4329–4339. https://doi.org/10.1109/WACV56688.2023.00432
[174]	Y. Yuan, Y. Sun, Y. La, Y. Ma, R. Jia, L. Gao, NeRF-Editing: Geometry editing of Neural Radiance Fields, in 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, (2022), 18332–18343. https://doi.org/10.1109/CVPR52688.2022.01781
[175]	C. Sun, Y. Liu, J. Han, S. Gould, NeRFEditor: Differentiable style decomposition for full 3D scene editing, preprint, arXiv: 2212.03848.
[176]	Z. Wang, Y. Deng, J. Yang, J. Yu, X. Tong, Generative deformable radiance fields for disentangled image synthesis of topology-varying objects, Comput. Graphics Forum, 41 (2022), 431–442. https://doi.org/10.1111/cgf.14689 doi: 10.1111/cgf.14689
[177]	K. Tertikas, D. Paschalidou, B. Pan, J. J. Park, M. A. Uy, I. Emiris, et al., PartNeRF: Generating part-aware editable 3D shapes without 3D supervision, preprint, arXiv: 2303.09554.
[178]	C. Bao, Y. Zhang, B. Yang, T. Fan, Z. Yang, H. Bao, et al., SINE: Semantic-driven image-based NeRF editing with prior-guided editing field, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, (2023), 20919–20929.
[179]	D. Cohen-Bar, E. Richardson, G. Metzer, R. Giryes, D. Cohen-Or, Set-the-Scene: Global-local training for generating controllable NeRF scenes, in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops, IEEE, (2023), 2920–2929.
[180]	A. Mirzaei, T. Aumentado-Armstrong, K. G. Derpanis, J. Kelly, M. A. Brubaker, I. Gilitschenski, et al., SPIn-NeRF: Multiview segmentation and perceptual inpainting with Neural Radiance Fields, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, (2023), 20669–20679.
[181]	O. Avrahami, D. Lischinski, O. Fried, Blended diffusion for text-driven editing of natural images, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, (2022), 18208–18218.
[182]	A. Nichol, P. Dhariwal, A. Ramesh, P. Shyam, P. Mishkin, B. McGrew, et al., GLIDE: Towards photorealistic image generation and editing with text-guided diffusion models, in Proceedings of the 39th International Conference on Machine Learning, PMLR, (2022), 16784–16804.
[183]	G. Couairon, J. Verbeek, H. Schwenk, M. Cord, DiffEdit: Diffusion-based semantic image editing with mask guidance, in the Eleventh International Conference on Learning Representations, 2023.
[184]	E. Sella, G. Fiebelman, P. Hedman, H. Averbuch-Elor, Vox-E: Text-guided voxel editing of 3D objects, in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), IEEE, (2023), 430–440.
[185]	A. Haque, M. Tancik, A. A. Efros, A. Holynski, A. Kanazawa, Instruct-NeRF2NeRF: Editing 3D scenes with instructions, in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), IEEE, (2023), 19740–19750.
[186]	Y. Lin, H. Bai, S. Li, H. Lu, X. Lin, H. Xiong, et al., CompoNeRF: Text-guided multi-object compositional NeRF with editable 3D scene layout, preprint, arXiv: 2303.13843.
[187]	R. Martin-Brualla, N. Radwan, M. S. M. Sajjadi, J. T. Barron, A. Dosovitskiy, D. Duckworth, NeRF in the wild: Neural Radiance Fields for unconstrained photo collections, in 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, (2021), 7206–7215. https://doi.org/10.1109/CVPR46437.2021.00713
[188]	M. Boss, A. Engelhardt, A. Kar, Y. Li, D. Sun, J. T. Barron, et al., SAMURAI: Shape and material from unconstrained real-world arbitrary image collections, in Advances in Neural Information Processing Systems, Curran Associates, Inc., 35 (2022), 26389–26403.
[189]	C. Choi, J. Kim, Y. M. Kim, IBL-NeRF: Image-based lighting formulation of Neural Radiance Fields, preprint, arXiv: 2210.08202.
[190]	Z. Yan, C. Li, G. H. Lee, NeRF-DS: Neural Radiance Fields for dynamic specular objects, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, (2023), 8285–8295.
[191]	D. Guo, L. Zhu, S. Ling, T. Li, G. Zhang, Q. Yang, et al., Face illumination normalization based on Generative Adversarial Network, Nat. Comput., 22 (2022), 105–117. https://doi.org/10.1007/s11047-022-09892-4 doi: 10.1007/s11047-022-09892-4
[192]	Z. Cui, L. Gu, X. Sun, Y. Qiao, T. Harada, Aleth-NeRF: Low-light condition view synthesis with concealing fields, preprint, arXiv: 2303.05807.
[193]	A. R. Nandhini, V. P. D. Raj, Low-light image enhancement based on generative adversarial network, Front. Genet., 12 (2021), 799777. https://doi.org/10.3389/fgene.2021.799777 doi: 10.3389/fgene.2021.799777
[194]	W. Kim, R. Lee, M. Park, S. Lee, Low-light image enhancement based on maximal diffusion values, IEEE Access, 7 (2019), 129150–129163. https://doi.org/10.1109/ACCESS.2019.2940452 doi: 10.1109/ACCESS.2019.2940452
[195]	P. Ponglertnapakorn, N. Tritrong, S. Suwajanakorn, DiFaReli: Diffusion face relighting, preprint, arXiv: 2304.09479.
[196]	M. Guo, A. Fathi, J. Wu, T. Funkhouser, Object-centric neural scene rendering, preprint, arXiv: 2012.08503.
[197]	Y. Wang, W. Zhou, Z. Lu, H. Li, UDoc-GAN: Unpaired document illumination correction with background light prior, in Proceedings of the 30th ACM International Conference on Multimedia, ACM, (2022), 5074–5082. https://doi.org/10.1145/3503161.3547916
[198]	J. Ling, Z. Wang, F. Xu, ShadowNeuS: Neural SDF reconstruction by shadow ray supervision, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, (2023), 175–185.
[199]	V. Rudnev, M. Elgharib, W. Smith, L. Liu, V. Golyanik, C. Theobalt, NeRF for outdoor scene relighting, in Computer Vision–ECCV 2022, Springer, (2022), 615–631. https://doi.org/10.1007/978-3-031-19787-1_35
[200]	C. Higuera, B. Boots, M. Mukadam, Learning to read braille: Bridging the tactile reality gap with diffusion models, preprint, arXiv: 2304.01182.
[201]	T. Guo, D. Kang, L. Bao, Y. He, S. Zhang, NeRFReN: Neural Radiance Fields with reflections, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, (2022), 18409–18418.
[202]	C. LeGendre, W. Ma, G. Fyffe, J. Flynn, L. Charbonnel, J. Busch, et al., DeepLight: Learning illumination for unconstrained mobile mixed reality, in 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, (2019), 5911–5921. https://doi.org/10.1109/CVPR.2019.00607
[203]	W. Ye, S. Chen, C. Bao, H. Bao, M. Pollefeys, Z. Cui, et al., IntrinsicNeRF: Learning intrinsic Neural Radiance Fields for editable novel view synthesis, in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), IEEE, (2023), 339–351.
[204]	M. Boss, V. Jampani, R. Braun, C. Liu, J. T. Barron, H. P. A. Lensch, Neural-PIL: Neural pre-integrated lighting for reflectance decomposition, in Advances in Neural Information Processing Systems, Curran Associates, Inc., 34 (2021), 10691–10704.
[205]	S. Saito, T. Simon, J. Saragih, H. Joo, PIFuHD: Multi-level pixel-aligned implicit function for high-resolution 3D human digitization, in 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, (2020), 81–90. https://doi.org/10.1109/CVPR42600.2020.00016
[206]	H. Tang, S. Bai, L. Zhang, P. H. Torr, N. Sebe, XingGAN for person image generation, in Computer Vision–ECCV 2020, Springer, (2020), 717–734. https://doi.org/10.1007/978-3-030-58595-2_43
[207]	Y. Ren, X. Yu, J. Chen, T. H. Li, G. Li, Deep image spatial transformation for person image generation, in 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, (2020), 7687–7696. https://doi.org/10.1109/CVPR42600.2020.00771
[208]	Y. Liu, Z. Qin, T. Wan, Z. Luo, Auto-painter: Cartoon image generation from sketch by using conditional wasserstein generative adversarial networks, Neurocomputing, 311 (2018), 78–87. https://doi.org/10.1016/j.neucom.2018.05.045 doi: 10.1016/j.neucom.2018.05.045
[209]	H. Li, AI synthesis for the metaverse: From avatars to 3D scenes, Stanford University, Stanford Talks, 2022. Available from: https://talks.stanford.edu/hao-li-pinscreen-on-ai-synthesis-for-the-metaverse-from-avatars-to-3d-scenes/.
[210]	S. Murray, A. Tallon, Mapping gothic france, Columbia University, Media Center for Art History, 2023. Available from: https://mcid.mcah.columbia.edu/art-atlas/mapping-gothic.
[211]	Y. Xiang, C. Lv, Q. Liu, X. Yang, B. Liu, M. Ju, A creative industry image generation dataset based on captions, preprint, arXiv: 2211.09035.
[212]	C. Tatsch, J. A. Bredu, D. Covell, I. B. Tulu, Y. Gu, Rhino: An autonomous robot for mapping underground mine environments, in 2023 IEEE/ASME International Conference on Advanced Intelligent Mechatronics (AIM), IEEE, (2023), 1166–1173. https://doi.org/10.1109/AIM46323.2023.10196202
[213]	Y. Tian, L. Li, A. Fumagalli, Y. Tadesse, B. Prabhakaran, Haptic-enabled mixed reality system for mixed-initiative remote robot control, preprint, arXiv: 2102.03521.
[214]	G. Pu, Y. Men, Y. Mao, Y. Jiang, W. Ma, Z. Lian, Controllable image synthesis with attribute-decomposed GAN, IEEE Trans. Pattern Anal. Mach. Intell., 45 (2023), 1514–1532. https://doi.org/10.1109/TPAMI.2022.3161985 doi: 10.1109/TPAMI.2022.3161985
[215]	X. Wu, Y. Zhang, Q. Li, Y. Qi, J. Wang, Y. Guo, Face aging with pixel-level alignment GAN, Appl. Intell., 52 (2022), 14665–14678. https://doi.org/10.1007/s10489-022-03541-0 doi: 10.1007/s10489-022-03541-0
[216]	D. Sero, A. Zaidi, J. Li, J. D. White, T. B. G. Zarzar, M. L. Marazita, et al., Facial recognition from dna using face-to-dna classifiers, Nat. Commun., 10 (2019), 1. https://doi.org/10.1038/s41467-018-07882-8 doi: 10.1038/s41467-018-07882-8
[217]	M. Nicolae, M. Sinn, M. Tran, B. Buesser, A. Rawat, M. Wistuba, et al., Adversarial robustness toolbox v1.0.0, 2018. Available from: https://github.com/Trusted-AI/adversarial-robustness-toolbox.

This article has been cited by:

Kaixuan Wang, Shixiong Zhang, Yang Cao, Lu Yang, Weakly supervised anomaly detection based on sparsity prior, 2024, 32, 2688-1594, 3728, 10.3934/era.2024169

Reader Comments

Your name:*

Email:*
© 2023 the Author(s), licensee AIMS Press. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0)

通讯作者: 陈斌, bchen63@163.com

1.
沈阳化工大学材料科学与工程学院沈阳 110142

Electronic Research Archive

1 1.3

Metrics

Article views(2184) PDF downloads(130) Cited by(1)

Preview PDF

Download XML

Export Citation

Article outline

Show full outline

Figures and Tables

Figures(9) / Tables(2)

Electronic Research Archive

Survey on low-level controllable image synthesis with deep learning

Related Papers:

Abstract

1. Introduction

2. Data sets and evaluation indicators for low-level controllable image synthesis

2.1. Data sets

2.2. Evaluation indicators

3. Pose manipulation

3.1. Global pose

3.1.1. GAN

3.1.2. NeRF

3.1.3. Diffusion model

3.1.4. Transformer

3.1.5. Hybrid NeRF

3.2. Local pose

3.2.1. GAN

3.2.2. NeRF

4. Structure manipulation

4.1. Global structure

4.1.1. Editing point cloud

4.1.2. Editing depth map

4.2. Local structure

4.2.1. GAN

4.2.2. NeRF

4.2.3. Diffusion model

5. Illumination manipulation

5.1. Global illumination

5.1.1. Brightness level

5.1.2. Light source movement

5.1.3. Uneven illumination

5.1.4. Shadow ray

5.1.5. Complex light variation

5.2. Local illumination

5.2.1. Reflectance

5.2.2. Radiance

5.3. Proportion of the models

6. Applications

6.1. Entertainment application

6.2. Industry application

6.3. Security application

7. Conclusions

Use of AI tools declaration

Acknowledgments

Conflict of interest

References

This article has been cited by:

Reader Comments

通讯作者: 陈斌, bchen63@163.com

Metrics

Figures and Tables

Other Articles By Authors

Related pages

Tools

Export File

Citation

Format

Content

Catalog