The below are a list of ideas that I think are interesting. Some of them may be used for academic course projects and some may be insufficient. You can build on such ideas. If you have any queries regarding these ideas, please don't hesitate to contact me. If you do try these ideas, kindly update me, and I'll update your findings here (with due credits, of course). If there is already an existing work on the ideas listed below, kindly update me with that as well. I'll add a link to the relevant content.
SCADE [2] was proposed to train NeRF [1] with few inputs. Specifically, a mono-depth model is used to obtain depth prior. Instead of obtaining a single depth prior, SCADE uses a mono-depth model that takes a latent vector to condition the uncertainty in predicting mono-depth. By providing multiple latent vectors sampled from a unit Gaussian, multiple depth priors are obtained per pixel thereby obtaining a distribution on depth. The depth distribution from NeRF is obtained using coarse NeRF and samples are obtained by inverse transform sampling. Space carving loss is imposed between the samples from the depth distributions of prior and NeRF.
Is it necessary to obtain depth prior from a deep neural network? Deep neural networks are known to suffer from generalization issues. Instead, can we obtain depth prior from Plane-Sweep-Volumes? Specifically, based on the errors/variance at every plane, apply inverse transform sampling to obtain depth prior samples. How well would this do across different datasets?
[1] Ben Mildenhall et al. "NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis", ECCV 2020.Mip-NeRF [2] was proposed to address antialiasing limitation of NeRF [1]. Specifically, if the same scene-region/object is captured at different scales, NeRF breaks down and introduces blur or aliasing artifacts. Mip-NeRF addresses this by shooting a cone into the scene instead of a single ray and sampling conical frustums instead of individual 3D points. An ideal solution to obtain the density/color of thw conical frustum would involve sampling multiple points in the conical frustum and averaging them. To make the model computationally efficient, Mip-NeRF averages the input (positional encoded 3D points) instead of the output; thereby querying the MLP only once per frustum. While the idea is reasonable, the end result of integrated positional encoded features is that the encoded features of a point varies depending on its distance to the camera. So, what would happen if one simply provides the distance to the camera as another input to the NeRF and use original positional encoding instead of integrated positional encoding? Would this perform as well as Mip-NeRF?
[1] Ben Mildenhall et al. "NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis", ECCV 2020.Mip-NeRF [2] was proposed to address antialiasing limitation of NeRF [1]. Specifically, if the same scene-region/object is captured at different scales, NeRF breaks down and introduces blur or aliasing artifacts. Mip-NeRF addresses this by providing expected positional encoding of all the 3D points in a conical frustum along the cone as input to the MLP, instead of positional encoding of the centroid of the cone as done in NeRF. For evaluation, the train and test sets of the original NeRF are downsampled to three different scales. On the multi-scale dataset, Mip-NeRF performs significantly better than NeRF.
Thinking from fundamental graphics principles, antialiasing should be implemented by integrating the sigma/radiance (output of MLP) over the conical frustum. Integrating the inputs may not have the desired antialiasing effect. That is, the input to the MLP is different at different scales and the MLP can exploit this to hack through and provide antialiased outputs. Thus, if the scale changes at test time/novel view to a value unseen during training, Mip-NeRF could potentially output junk. However, it is not clear if this will actually happen. The MLP could learn to integrate the sigma/radiance when the inputs corresponding to higher dimension are zero. So, it is interesting to experiment this: train Mip-NeRF on the original Blender dataset [1] and test it on the multi-scale Blender dataset [2].
[1] Ben Mildenhall et al. "NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis", ECCV 2020.Vanilla NeRF [1] needs to be trained for every scene, which was overcome by pixelNeRF [2], which is a generalized NeRF model. The core idea is to employ a feature extractor on the input images and condition the NeRF MLP on the feature at the corresponding 3D point in addition to the 3D point location and viewing direction. The feature at the queried 3D point is obtained by projecting the 3D point onto the image plane and bilinearly interpolating the feature map at the projected point. As a result, all the 3D points along the ray corresponding to a pixel will share the same feature. How can the NeRF MLP distinguish the 3D points? (See "What is the role of MLP in generalizable NeRF models?" to understand why). Is it that the feature encoder embeds mono-depth information into the feature vector and the MLP checks if the depth of the 3D point matches with the depth embedded in the feature vector to determine the volume density?
One way to validate this is to add a few more conv layers on top of pixelNeRF encoder and train the additional layers to predict depth from the features. Note that the pixelNeRF feature encoder should be frozen during this training. True depth or depth estimated by a state-of-the-art mono-depth model can be used to supervise the training of the additional layers. If this model is able to estimate depth to high accuracy, it indicates that the pixelNeRF feature encoder is acting also as a mono-depth estimator and embedding the depth information in the feature vectors.
[1] Ben Mildenhall et al. "NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis", ECCV 2020.Vanilla NeRF [1] needs to be trained for every scene, which was overcome by generalized NeRF models such as pixelNeRF [2], MVS-NeRF [3] and so on. The core idea is to employ a feature extractor on the input images and condition the NeRF MLP on the feature at the corresponding 3D point in addition to the 3D point location and viewing direction. When using in the generalized setup without any scene-specific fine-tuning, this raises the question that whether the 3D point location contributes any information. One way to validate this is to remove the 3D location from the inputs and feed only the feature at the 3D point and the viewing direction to the NeRF MLP and analyze the performance. If there is no significant drop in generalized (unseen scenes) performance, then it can be concluded that the MLP is simply acting as a decoder for the 3D features. How does the performance vary for scene-specific fine-tuning?
[1] Ben Mildenhall et al. "NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis", ECCV 2020.Although deep networks have become extremely powerful for object recognition, they perform poorly in the presence of adversarial attacks. In [1], the authors propose a method to train deep networks to be robust to adversarial attacks by enforcing BPFC regularization. In [2], the authors show that deep networks trained on ImageNet-1k database are biased towards local texture and hence when tested on edge maps (where no local texture is present), their efficiency reduces.
Since adversarial attacks/examples mainly modify local properties of images, it would be interesting to see if training a deep network to be robust to adversarial attacks would lead to lower bias towards local texture. One way to verify this would be to test the performance of deep networks, which are trained to be robust to adversarial attacks, on edge maps of images from ImageNet-1k database or any other similar database.
[1] Sravanti Addepalli et al. "Towards Achieving Adversarial Robustness by Enforcing Feature Consistency Across Bit Planes", CVPR 2020.In [1], the authors show that deep networks trained on ImageNet-1k database are biased towards local texture than global shape. One of the experiments conducted in the paper is to get ResNet-50 [2] predictions on edge map of an image. In the paper it is shown that accuracy of ResNet reduces on edge maps of images.
It would be interesting to see if ResNet features (features tapped before the global pooling operation) contain information about global shape. One way to check this is to freeze the weights of previous layers of ResNet-50 and train only the last softmax layer on edge maps of images in ImageNet-1k database. And then test this new model with edge maps of images.
[1] Robert Geirhos et al. "ImageNet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness", ICLR 2019.A video generative model has been proposed in [1], which decomposes video into content and motion. Content latent vector remains same across all frames while motion latent vector is generated using a recurrent network. A decoder then generates a frame using content and motion latent vectors.
It would be interesting to check how good the decomposition is working. To check this, one can change the content latent vector mid-generation. Ideally the motion should remain the same while the person should change.
[1] Sergey Tulyakov et al. "MoCoGAN: Decomposing Motion and Content for Video Generation", CVPR 2018.Quantitatively evaluating GAN models has been found challenging. Here, I propose a simple idea to evaluate them based on how we would evaluate linear regression. Given a trained GAN model and an image from test set, backpropagate the gradients (by keeping the generator weights fixed) to find the input which can generate an image close to the test image. The error between the generated image and the test image, averaged over all images in the test set, may be evaluated as a quantitative measure for evaluating GANs. Various error/similarity measures like MSE, SSIM, VGG MSE or VGG cosine similarity can be experimented with.
Build a face detection and recognition system, which can be applied on any new directory of images. Build a GUI such that whenever a model encounters an unknown face, it requests the user to enter the name of the person. The face recognition model should be able to dynamically learn to classify new faces. The model builds an index of the people appearing in the photos and stores it. Later, when a query for a particular person is issued, the model should retrieve all the images containing that person.
This idea has already been implemented in Google Photos. Nonetheless, we may not be willing to upload all our private photos to Google Photos. In that case, having an offline model, specifically trained on people in our friend circle, may be useful. Although this model may not be able to perform as good as Google Photos.