Reconstructing Faces from Ear Pictures using Image-to-Image Translation

Aug 19, 2020

By Matthew Ye

If you were ever warned to keep yourself anonymous on the internet, you now have one more thing to worry about: ear pictures. In the paper Ear2Face: Deep Biometric Modality Mapping, researchers trained a model to reconstruct full facial profiles from cropped images of ears alone. Before you go out and buy a beanie, in this blog post, we will explore image-to-image translation, and how the researchers modified the original idea to better suit the task of biometric modality mapping. We will assume a basic understanding of convolutional neural networks.

Conditional Adversarial Networks

Ear2Face uses a type of generative network called a GAN (generative adversarial network). Traditional GANs learn a mapping from random noise vector z to output image y; more specifically, we train a generative network to process z into an output that can fool a discriminator, which is a binary classifier that is being trained to distinguish authentic and generated images. Thus, both parties are incentivized to improve their game until the discriminator yields 50-50 probabilities for generated images. This means the discriminator - and usually humans, too - can no longer distinguish generated images from real ones. (If you want some more background, Google has published an excellent intuitive explanation here.)

Traditional GANs allow us to capture the probability distribution of our image and create something similar from noise. But what if our starting points weren’t noise, but something concrete? That is the premise of conditional adversarial networks (cGANs), introduced by this 2017 paper (pix2pix) that describes a generalized method for performing image-to-image translation tasks using cGANs. Conditional GANs learn a conditional mapping GGG from a random noise vector z and input images x to output images y. In other words, we train our adversarial generator to combine perturbed representations of input images with perturbed noise vectors until it reasonably aligns with the distribution of our target images. The randomness of the noise vector makes our model less vulnerable to a phenomenon known as mode collapse - a common issue with GANs in which the generator begins producing similar images over and over again regardless of the input provided. Meanwhile, the discriminator learns to differentiate the conditional mappings (x, G(x, z)) and (x, y). Note that it is given the input image x in addition to the outputs. This makes the classifier more robust by allowing it to learn not only the difference between a real and fake output G(x, z) and y but also by learning associations between input and fake output and defending itself against these associations, which in turn incentivizes our generator to become even more realistic.

In this way, cGANs can learn to map any type of image to any other type of image - see the image below for some examples. Ear2Pix is just one of nearly limitless applications of cGANs, associating ears with faces.

Architecture

The architecture of the Ear2Pix generator is the U-Net model, as described in the pix2pix paper. To understand U-Nets, you must first understand autoencoders. Autoencoders are neural networks with symmetrical architectures that downsample an input, pass it through a bottleneck layer, and upsample it until it matches the original input dimensions. The idea is that, during training, the downsampling layers will extract higher level features, giving us a compact representation of our image when we get to the bottleneck layer. Then, in the upsampling layers, detail is reconstructed from these high-level features. For many image translation problems, however, there is a great deal of low-level information shared between the input and output. In the U-Net architecture, this low-level information is sent directly across the network using skip connections. As an intuitive example, consider the task of image colorization: the autoencoder network might handle the colorization of high-level features while the skip connections restore prominent edges that might have been lost in downsampling.

Unfortunately, the authors of Ear2Pix don’t go into much detail about their discriminator model other than mentioning it is similar to the one used in the Markovian GAN (mGAN) paper. If you’re interested in computer vision, it’s definitely worth a read, and if you’re not, there are really neat image samples, so check it out anyway.

Loss Functions

To train the generator, we need to establish a loss function to minimize via gradient descent towards a fake output image. Ear2Face’s generator uses four loss functions, the most important of which is adversarial loss — the probability that the discriminator correctly distinguishes the generator’s output (x, G(x, z)) (ear, fake face) from real (x, y) (ear, real face) pairs. Note that the discriminator, on the other hand, attempts to maximize this value. Adversarial loss can be represented mathematically as follows:

\(\begin{aligned} \mathcal{L}_{c G A N}(G, D)=& \mathbb{E}_{x, y}[\log D(x, y)]+\\ & \mathbb{E}_{x, z}[\log (1-D(x, G(x, z))] \end{aligned}\)

However, the generator in Ear2Face actually combines adversarial loss with three more loss functions.

Pixel loss: Comparing corresponding pixels in the generator output image (the fake face) and the target image (the real face) forces the network to create outputs analogous to target data. In the pix2pix paper, the authors found that including per-pixel L1 loss reduced blurriness in output images compared to L2 loss. Of course, such a loss function tends to encourage overfitting, so the weight of pixel loss is relatively low compared to adversarial loss.

\(L_{L 1}(G)=\underset{x, y, z}{\mathbb{E}}\left[\|y-G(x, z)\|_{1}\right]\)

Feature reconstruction loss: The researchers used a clever approach to penalize high-level differences between generated faces. First, they took a ResNet-50 model pre-trained to classify faces on the VGGFace2 dataset. The ResNet50 architecture consists of many convolutional blocks (convolution + batch normalization + ReLU), followed by a max-pooling layer to merge the RGB channels, followed by a fully connected layer that corresponds to classification probabilities. The researchers then used the output of the max-pooling layer (call this ϕ) as a vector representation of an input face’s high-level facial features. Ear2Face feeds the generated image and target image into the ResNet model and takes the L2 loss between the 1x2048 feature vector of the generated image ϕ_G(x,z) and that of the target image ϕ_G(y).
Style loss: Including a “style” loss function allows us to further preserve high level features such as texture and other subtleties that might help make our output more realistic. Artistic style may seem like something that’s very difficult to quantify, but as outlined in this paper, it can be computed by generating Gram matrices of high-level feature vectors. Intuitively, Gram matrices encode correlations between features, so we can penalize unnatural correlations by taking the L2 loss between the Gram matrix of the generated image and that of the target image. Here, the feature vector is the same as the one used for feature reconstruction loss.

The relative weights of these loss functions serve as hyperparameters. After training, the results when evaluated on the test set are impressive, to say the least.

After reading this article, you might be tempted to go out and buy that beanie. But by now, you should also understand that no part of your body is really safe - anyone can make a Hand2Face or Mouth2Face or FacePicWithPixelatedEyes2Face. The only thing preserving our anonymity is a couple hours of GPU compute time and lack of will.

Figures from Pix2Pix paper, Ear2Face paper, and cGAN architecture figure adapted from this paper.

ML@B Blog

Discussion about this post