Mixed-illuminant whitebalance with GridNet

In this project I worked on benchmarking a Deep Learning approach to mixed-illuminant whitebalance in comparison with analytical methods based on images from a polarization camera. In a 5 person team of media technology students at TH Köln, I was responsible for deploying a pre-trained Deep Learning model to our custom camera preprocessing pipeline, that was uniquely designed for the specific camera hardware.

This project showcases my abilities to

work with a custom camera pipeline in Python,
deploy a pre-trained Deep Learning model to a custom infrastructure,
evaluate the performance of a Deep Learning model against analytical methods,
work with image data and to preprocess it for Deep Learning applications efficiently,
think analytically about the problem and to design a custom evaluation data set,
understand complex math behind technical innovation and their contribution to domain-specific problems,
work in a team and to communicate results effectively.

The Deep Learning part

For the whitebalance predictions, we used the Deep Neural Network from Mahmoud Afifi. I wrote a custom class, that holds this model and can be used in our node-based image pipeline. This pipeline is written in Python and uses OpenCV for image processing. It is used to preprocess the raw images from the camera, apply a color profile, and to apply the whitebalance correction to the images.

The DNN has learned to generate mappings for 5 predefined white balance settings, that are commonly used in photography. That makes it possible to use the net in a modified camera. The camera has to render every image with 5 predefined white balance settings no matter what the scene actual demands. The network then creates mappings to correct the white balance in post-processing. When learning about this architecture and how the authors trained it, I was really buffled how they instrumentalized the loss function to achieve visually pleasing results. The overall loss function is defined as $L = L_{r} + λ L_{s}$ . $L_{r}$ is the following, relatively simple, reconstruction loss: $L_{r} = | | P_{c o r r} - \sum_{i} {\hat{W}}_{i} ⊙ P_{c_{i}} | |_{F}^{2}$

In this loss function, $| | \cdot | |_{F}^{2}$ computes the squared Frobenius norm, whereas $⊙$ is the Hadamard product. $P_{c o r r}$ and $P_{c_{i}}$ are extracted training patches from the ground truth sRGB images and input sRGB images rendered with the $C_{i}$ WB setting, respectively. $W_{i}$ is $i$ -th blending weighting map as generated by the network for $P_{c_{i}}$ .

To further improve the results, additional steps have been taken by Afifi et al.: Firstly a cross-channel softmax operator has been applied before loss calculation in order to avoid out-of-gamut-colors. In this step, the exponential function is applied to each element and the output is then normalized by dividing by the sum of all new values. Secondly, a regularization term is introduced to the loss function. Hereby, GridNet is trained to produce rather smoothed weighting maps opposed to perfectly accurate maps. This may be due to reasons of generalization as well as visual observation by the researchers. The regularization is applied with $L_{s} = \sum_{i} | | {\hat{W}}_{i} * ▽_{x} | |_{F}^{2} + | | {\hat{W}}_{i} * ▽_{y} | |_{F}^{2}$

with $▽_{x}$ and $▽_{y}$ $3 \times 3$ horizontal and vertical (edge detecting) Sobel filters and $*$ being the convolution operator. When all these terms are put together in $L = L_{r} + λ L_{s}$ . $L_{r}$ , the contribution of the regularization/edge-preserving term is controlled by the hyperparameter $λ$ . I think it’s really fascinating, how the authors applied image filtering techniques to the weighting maps to achieve a better generalization and visually improved outputs of the network.

Architecture of the GridNet deep neural network as used with six columns and four rows in the MixedIll WB method by Afifi, Brubaker, and Brown (2022). Blue row units are residual units, green column units are convolutional downscaling units (i.e. reducing the dimensions of each feature received from the upper row reduced by two, while the number of output channels is duplicated), whereas orange column units are deconvolutional bilinear upscaling units (increasing the dimensions of the received features by two in the last three columns).

Benchmarking experiments

Now, to compare this Deep Learning approach with the analytical methods, we produced a unique evaluation data set, that is super hard to white balance. We photographed images with extreme mixed illuminant scenarios where two light sources of opposing color temperatures were lighting the scene or sometimes were even visible in the image. We then compared the results of the Deep Learning approach with the analytical methods based on the polarization camera images by employing error metrices.

Let’s see how the DNN works in one example of our benchmarking data set:

Comparison of ground truth (left) and final corrected image with the MixedIllWB method by Afifi, Brubaker, and Brown (2022) (right). The corrected image was generated by applying the weighting maps (bottom row) to the fixed WB images. The scene was illuminated with Skypanels set to $5928$ K and $2768$ K, respectively.

We can also inspect, what the DNN did under the hood by plotting the weighting maps and the Hadamard products of the weighting maps and the pre-rendered WB images:

Weighting maps $W_{i}$ (top row) and Hadamard products of $W_{i}$ and pre-rendered WB images $P_{c_{i}}$ (bottom row) for the corresponding WB settings (columns) for MIPo image A0023.

To get the full picture of how well the approach works on our custom evaluation data set, we calculated the $Δ E_{00}$ error metric for the corrected images against the ground truth images. The $Δ E_{00}$ metric is a color difference metric that is widely used in the industry to evaluate the color difference between two images. The lower the $Δ E_{00}$ value, the better the color match between the two images. The Afifi approach is the one on the left; all other approaches are not deep learning based.

Mixed illuminants – $Δ E_{00}$ : Boxplots of the $Δ E_{00}$ metric for white balancing methods against ground truth for corrected images of the MIPo dataset. Boxplot properties: The horizontal line inside the box is the median value. The height of the box is the interquartile range (IQR) between lower and upper quartile. The whiskers mark the minimum and maximum datapoints within the range of $1.5 * IQR$ from the nearest (lower/upper) quartile. Outliers are datapoints exceeding this definition and are marked with diamond symbols.

Code

Here is my code, that I wrote for the custom class AfifiRenderer. The first part defines two custom dictionary classes, that prevent us from using the model in a wrong way.

The second part now defines the AfifiRenderer class, that holds the model and the camera pipeline. The render method applies the model to the images and the save method saves the results to disk.

Here is an example how this would be used:

References

Afifi, Mahmoud, Marcus A. Brubaker, and Michael S. Brown. 2022. “Auto White-Balance Correction for Mixed-Illuminant Scenes.” In Proceedings of the WACV, 1210–19. Waikoloa: IEEE. https://github.com/mahmoudnafifi/mixedillWB.