In this project I worked on benchmarking a Deep Learning approach to mixed-illuminant whitebalance in comparison with analytical methods based on images from a polarization camera. In a 5 person team of media technology students at TH Köln, I was responsible for deploying a pre-trained Deep Learning model to our custom camera preprocessing pipeline, that was uniquely designed for the specific camera hardware.
This project showcases my abilities to
- work with a custom camera pipeline in Python,
- deploy a pre-trained Deep Learning model to a custom infrastructure,
- evaluate the performance of a Deep Learning model against analytical methods,
- work with image data and to preprocess it for Deep Learning applications efficiently,
- think analytically about the problem and to design a custom evaluation data set,
- understand complex math behind technical innovation and their contribution to domain-specific problems,
- work in a team and to communicate results effectively.
The Deep Learning part
For the whitebalance predictions, we used the Deep Neural Network from Mahmoud Afifi. I wrote a custom class, that holds this model and can be used in our node-based image pipeline. This pipeline is written in Python and uses OpenCV
for image processing. It is used to preprocess the raw images from the camera, apply a color profile, and to apply the whitebalance correction to the images.
The DNN has learned to generate mappings for 5 predefined white balance settings, that are commonly used in photography. That makes it possible to use the net in a modified camera. The camera has to render every image with 5 predefined white balance settings no matter what the scene actual demands. The network then creates mappings to correct the white balance in post-processing. When learning about this architecture and how the authors trained it, I was really buffled how they instrumentalized the loss function to achieve visually pleasing results. The overall loss function is defined as \(\mathcal{L}=\mathcal{L}_r + \lambda \mathcal{L}_s\). \(\mathcal{L}_r\) is the following, relatively simple, reconstruction loss: \[ \mathcal{L}_r=||P_{corr}-\sum_i \hat{W}_i \odot P_{c_i} ||_F^2 \]
In this loss function, \(||\cdot||_F^2\) computes the squared Frobenius norm, whereas \(\odot\) is the Hadamard product. \(P_{corr}\) and \(P_{c_i}\) are extracted training patches from the ground truth sRGB images and input sRGB images rendered with the \(C_i\) WB setting, respectively. \(W_i\) is \(i\)-th blending weighting map as generated by the network for \(P_{c_i}\).
To further improve the results, additional steps have been taken by Afifi et al.: Firstly a cross-channel softmax operator has been applied before loss calculation in order to avoid out-of-gamut-colors. In this step, the exponential function is applied to each element and the output is then normalized by dividing by the sum of all new values. Secondly, a regularization term is introduced to the loss function. Hereby, GridNet is trained to produce rather smoothed weighting maps opposed to perfectly accurate maps. This may be due to reasons of generalization as well as visual observation by the researchers. The regularization is applied with \[ \mathcal{L}_s = \sum_i||\hat{W}_i\ast \bigtriangledown _x||_F^2 + || \hat{W}_i \ast \bigtriangledown _y ||_F^2 \]
with \(\bigtriangledown_x\) and \(\bigtriangledown_y\) \(3\times3\) horizontal and vertical (edge detecting) Sobel filters and \(\ast\) being the convolution operator. When all these terms are put together in \(\mathcal{L}=\mathcal{L}_r + \lambda \mathcal{L}_s\). \(\mathcal{L}_r\), the contribution of the regularization/edge-preserving term is controlled by the hyperparameter \(\lambda\). I think it’s really fascinating, how the authors applied image filtering techniques to the weighting maps to achieve a better generalization and visually improved outputs of the network.
Benchmarking experiments
Now, to compare this Deep Learning approach with the analytical methods, we produced a unique evaluation data set, that is super hard to white balance. We photographed images with extreme mixed illuminant scenarios where two light sources of opposing color temperatures were lighting the scene or sometimes were even visible in the image. We then compared the results of the Deep Learning approach with the analytical methods based on the polarization camera images by employing error metrices.
Let’s see how the DNN works in one example of our benchmarking data set:
We can also inspect, what the DNN did under the hood by plotting the weighting maps and the Hadamard products of the weighting maps and the pre-rendered WB images:
To get the full picture of how well the approach works on our custom evaluation data set, we calculated the \(\Delta E_{00}\) error metric for the corrected images against the ground truth images. The \(\Delta E_{00}\) metric is a color difference metric that is widely used in the industry to evaluate the color difference between two images. The lower the \(\Delta E_{00}\) value, the better the color match between the two images. The Afifi approach is the one on the left; all other approaches are not deep learning based.
Code
Here is my code, that I wrote for the custom class AfifiRenderer
. The first part defines two custom dictionary classes, that prevent us from using the model in a wrong way.
The second part now defines the AfifiRenderer
class, that holds the model and the camera pipeline. The render
method applies the model to the images and the save
method saves the results to disk.
Here is an example how this would be used: