Style-based GANs – Generating and Tuning Realistic Artificial Faces9 min read

Generative Adversarial Networks (GAN) are a relatively new concept in Machine Learning, introduced for the first time in 2014. Their goal is to synthesize artificial samples, such as images, that are indistinguishable from authentic images. A common example of a GAN application is to generate artificial face images by learning from a dataset of celebrity faces. While GAN images became more realistic over time, one of their main challenges is controlling their output, i.e. changing specific features such pose, face shape and hair style in an image of a face.  

A new paper by NVIDIA, A Style-Based Generator Architecture for GANs (StyleGAN), presents a novel model which addresses this challenge. StyleGAN generates the artificial image gradually, starting from a very low resolution and continuing to a high resolution (1024×1024). By modifying the input of each level separately, it controls the visual features that are expressed in that level, from coarse features (pose, face shape) to fine details (hair color), without affecting other levels.

This technique not only allows for a better understanding of the generated output, but also produces state-of-the-art results – high-res images that look more authentic than previously generated images.

Background

The basic components of every GAN are two neural networks – a generator that synthesizes new samples from scratch, and a discriminator that takes samples from both the training data and the generator’s output and predicts if they are “real” or “fake”.

The generator input is a random vector (noise) and therefore its initial output is also noise. Over time, as it receives feedback from the discriminator, it learns to synthesize more “realistic” images. The discriminator also improves over time by comparing generated samples with real samples, making it harder for the generator to deceive it.

GANs overview

Researchers had trouble generating high-quality large images (e.g. 1024×1024) until 2018, when NVIDIA first tackles the challenge with ProGAN. The key innovation of ProGAN is the progressive training – it starts by training the generator and the discriminator with a very low resolution image (e.g. 4×4) and adds a higher resolution layer every time.

This technique first creates the foundation of the image by learning the base features which appear even in a low-resolution image, and learns more and more details over time as the resolution increases. Training the low-resolution images is not only easier and faster, it also helps in training the higher levels, and as a result, total training is also faster.

ProGAN overview

ProGAN generates high-quality images but, as in most models, its ability to control specific features of the generated image is very limited. In other words, the features are entangled and therefore attempting to tweak the input, even a bit, usually affects multiple features at the same time. A good analogy for that would be genes, in which changing a single gene might affect multiple traits.

ProGAN progressive training from low to high res layers. Source: (Sarah Wolf’s great blog post on ProGAN).

How StyleGAN works

The StyleGAN paper offers an upgraded version of ProGAN’s image generator, with a focus on the generator network. The authors observe that a potential benefit of the ProGAN progressive layers is their ability to control different visual features of the image, if utilized properly. The lower the layer (and the resolution), the coarser the features it affects. The paper divides the features into three types:

  1. Coarse – resolution of up to 82 – affects pose, general hair style, face shape, etc
  2. Middle – resolution of 162 to 322  – affects finer facial features, hair style, eyes open/closed, etc.
  3. Fine – resolution of 642 to 10242 – affects color scheme (eye, hair and skin) and micro features.

The new generator includes several additions to the ProGAN’s generators:

Mapping Network

The Mapping Network’s goal is to encode the input vector into an intermediate vector whose different elements control different visual features. This is a non-trivial process since the ability to control visual features with the input vector is limited, as it must follow the probability density of the training data. For example, if images of people with black hair are more common in the dataset, then more input values will be mapped to that feature. As a result, the model isn’t capable of mapping parts of the input (elements in the vector) to features, a phenomenon called features entanglement. However, by using another neural network the model can generate a vector that doesn’t have to follow the training data distribution and can reduce the correlation between features.
The Mapping Network consists of 8 fully connected layers and its output w is of the same size as the input layer (512×1).

The generator with the Mapping Network (in addition to the ProGAN synthesis network)

Sign up to our monthly newsletter
Stay updated with the latest research in Deep Learning

Style Modules (AdaIN)

The AdaIN (Adaptive Instance Normalization) module transfers the encoded information w, created by the Mapping Network, into the generated image. The module is added to each resolution level of the Synthesis Network and defines the visual expression of the features in that level:  

  1. Each channel of the convolution layer output is first normalized to make sure the scaling and shifting of step 3 have the expected effect.
  2. The intermediate vector w is transformed using another fully-connected layer (marked as A) into a scale and bias for each channel.
  3. The scale and bias vectors shift each channel of the convolution output, thereby defining the importance of each filter in the convolution. This tuning translates the information from w to a visual representation.
The generator’s Adaptive Instance Normalization (AdaIN)

Removing traditional input

Most models, and ProGAN among them, use the random input to create the initial image of the generator (i.e. the input of the 4×4 level). The StyleGAN team found that the image features are controlled by w and the AdaIN, and therefore the initial input can be omitted and replaced by constant values. Though the paper doesn’t explain why it improves performance, a safe assumption is that it reduces feature entanglement – it’s easier for the network to learn only using w without relying on the entangled input vector.

The Synthesis Network input is replaced with a constant input

Stochastic variation

There are many aspects in people’s faces that are small and can be seen as stochastic, such as freckles, exact placement of hairs, wrinkles, features which make the image more realistic and increase the variety of outputs. The common method to insert these small features into GAN images is adding random noise to the input vector. However, in many cases it’s tricky to control the noise effect due to the features entanglement phenomenon that was described above, which leads to other features of the image being affected.

The noise in StyleGAN is added in a similar way to the AdaIN mechanism – A scaled noise is added to each channel before the AdaIN module and changes a bit the visual expression of the features of the resolution level it operates on.

Adding scaled noise to each resolution level of the synthesis network

Style mixing

The StyleGAN generator uses the intermediate vector in each level of the synthesis network, which might cause the network to learn that levels are correlated. To reduce the correlation, the model randomly selects two input vectors and generates the intermediate vector w for them. It then trains some of the levels with the first and switches (in a random point) to the other to train the rest of the levels. The random switch ensures that the network won’t learn and rely on a correlation between levels.

Though it doesn’t improve the model performance on all datasets, this concept has a very interesting side effect – its ability to combine multiple images in a coherent way (as shown in the video below). The model generates two images A and B and then combines them by taking low-level features from A and the rest of the features from B.

Example of Style Mixing

Truncation trick in W

One of the challenges in generative models is dealing with areas that are poorly represented in the training data. The generator isn’t able to learn them and create images that resemble them (and instead creates bad-looking images). To avoid generating poor images, StyleGAN truncates the intermediate vector w, forcing it to stay close to the “average” intermediate vector.

After training the model, an “average” wavg is produced by selecting many random inputs; generating their intermediate vectors with the mapping network; and calculating the mean of these vectors. When generating new images, instead of using Mapping Network output directly, w is transformed into wnew=wavg+𝞧(w – wavg), where the value of 𝞧 defines how far the image can be from the “average” image (and how diverse the output can be). Interestingly, by using a different 𝞧 for each level, before the affine transformation block, the model can control how far from average each set of features is, as shown in the video below.

Tweaking the generated image by changing the value of 𝞧 in different levels

Fine-tuning

Additional improvement of StyleGAN upon ProGAN was updating several network hyperparameters, such as training duration and loss function, and replacing the up/downscaling from nearest neighbors to bilinear sampling. Though this step is significant for the model performance, it’s less innovative and therefore won’t be described here in detail (Appendix C in the paper).

An overview of StyleGAN

Results

The paper presents state-of-the-art results on two datasets – CelebA-HQ, which consists of images of celebrities, and a new dataset Flickr-Faces-HQ (FFHQ), which consists of images of “regular” people and is more diversified. The chart below shows the Frèchet inception distance (FID) score of different configurations of the model.  

The performance (FID score) of the model in different configurations compared to ProGAN. The lower score the better the model (Source: StyleGAN)

In addition to these results, the paper shows that the model isn’t tailored only to faces by presenting its results on two other datasets of bedroom images and car images.

Feature disentanglement

In order to make the discussion regarding feature separation more quantitative, the paper presents two novel ways to measure feature disentanglement:  

  1. Perceptual path length – measure the difference between consecutive images (their VGG16 embeddings) when interpolating between two random inputs. Drastic changes mean that multiple features have changed together and that they might be entangled.
  2. Linear separability – the ability to classify inputs into binary classes, such as male and female. The better the classification the more separable the features.

By comparing these metrics for the input vector z and the intermediate vector w, the authors show that features in w are significantly more separable. These metrics also show the benefit of selecting 8 layers in the Mapping Network in comparison to 1 or 2 layers.

Implementation Details

StyleGAN was trained on the CelebA-HQ and FFHQ datasets for one week using 8 Tesla V100 GPUs. It is implemented in TensorFlow and can be found here.

Conclusion

StyleGAN is a groundbreaking paper that not only produces high-quality and realistic images but also allows for superior control and understanding of generated images, making it even easier than before to generate believable fake images. The techniques presented in StyleGAN, especially the Mapping Network and the Adaptive Normalization (AdaIN), will likely be the basis for many future innovations in GANs.


Sign up to our monthly newsletter
Stay updated with the latest research in Deep Learning

43 thoughts on “Style-based GANs – Generating and Tuning Realistic Artificial Faces9 min read

  • Very interesting, my first read into AI. A good article. Reminds me of Thomas Edison methods but turbo-charged. Wonder if a Nikola Tesla will find a different approach?☺

  • Could this be used for Indentity models of criminals
    As well as computer generated movies , developed using the computers base data of the particular actors ( then ” actors ” will be doing their ” own stunts , ” realistically too )

    • I feel like I have only seen facial images to date, and only stills. It would be interesting to see if the same approach would work on motion and on full body. If so we could theoretically reach a point where there is no need for extras. As to leads, well, that’s still probably a ways away. Stunt work, as you said, could benefit greatly.

  • This is remarkable, but I cannot help but to think of how this will be used to deceive people for political or otherwise nafrious purposes. Regardless, we are moving closer

      • Are you JOKING? Irrelevant? This is super scary. We are not far from not being able to distinguish reality from fiction. The implications are far reaching and society altering. We’re not in Kansas any more!

    • This is awesome! I wonder however who „owns“ the rights for these generated faces? As they originate from real human beings do these model share ownership? Or will these become Creative Common Pictures that can be freel used for creative processes such as UX Persona modelling or alike?

      • I wondered the same thing, this seems perfect for UX persona images. Many companies I’ve worked with have paid a lot of money for royalty images of models for their persona documents. I haven’t seen any terms of service here yet, and hope that the creators do release them using something like CC Pictures licensing.

    • The github repository that explains the algorithm states that “the images were crawled from Flickr, thus inheriting all the biases of that website.” I was wondering the same thing as you. I refreshed maybe a 100 faces and very few were women of color. So Flickr must have fewer images of women of color?

    • If they feed it an ape or a monkey they might reveal where the black person came from, wouldn’t want that to happen!

  • Has anyone noticed that this method of generating images of faces seems to have a real problem with depicting teeth? I’ve noticed that the teeth are often displaced to one side – so much so that if you draw an imaginary line downwards from the midpoint of the nose between the nostrils, this line will be nearly centered on one of the front teeth, rather than the space between the two front teeth, as you would expect. Also, in cases where the two front teeth ARE properly centered, they are often noticeably asymmetrical, which of course in real life is pretty rare.

  • This is amazing! I am so intrigued by this tech, and thank you so much for making it accessible.

    I’ve noticed that it has a tough time with hats, which is, frankly, hilarious. I’ve started a collection of my favorite people-with-hats generated images at https://computersjustdontgethats.tumblr.com/ so that I can laugh myself silly from time-to-time.

    Bravo, and thanks again!

  • Fascinating program, but the algorithm is consistently not placing the teeth in the correct position on images with an open-mouth smile. Instead of the center line of the face lining up with the space between the two front teeth, it is as though one big front tooth is in the center-line. It looks like the teeth are all taken from a front view, but the faces are turned between straight front and three-quarter front, so it’s… slightly skewed and disconcerting. But the detail in the images is so realistic, it’s amazing!

  • What stuck out most to me, was the absence of really darker skinned people, as well as curly/coarse hair textures. I saw some badly done hair textures, but even those were straight. The eyes often didn’t line up well at all either.

  • And one would think we could develop some interesting “footage” of historical figures from the time after photography but before talkies – interacting as if they were alive today – imagine seeing Abraham Lincoln and US Grant having a conversation as if they were filmed in high def – based on authentic photographs from the era. Or, “perfect” re-enactments of historical events where no footage exists – Oval office conversations during the Cuban missile crisis?

  • Although the faces generated looks almost realistic,they simply lack the common gestures as created by human beings. With a little bit tweaking on the method, it can be enhanced to a greater extent to deceive anyone.

  • I am new here in this topic
    My Questions:

    The discussion seems to be for programmers only..??
    I am only a normal user who wants to create faces … How can I do it?
    Software download link ? Where ?
    I want my 5 yr old daughter to “change” a bit and see how she looked 3 years ago as virtual baby and how she will (could) look like in 5-10 yrs… Is that possible ?

    Lets say I have only 1 pic of her and me.
    She has blond hair
    Green eyes
    slavic face

    What is necessary to create more pics out of one pic (but finally should look similar in different posing)?

    How can I start??
    Can someone hep me?

    Regards from Germany
    Marc

  • Nice achievement. But also key to many possible threats.

    Another question. why are there predominantly white persons? I would love to see much more ethnic mixing. More “Africans”, more “Asians”, more “native Americans”, and of course “persons” in-between.

  • Fascinating stuff!

    I wonder about the copyright status of the generated images. It is not uncommon in publishing, to need a headshot of (for example) a ‘generic Mom’. Usually such a picture is found (and paid for) via Getty Images or some similar service. But this GAN software might be cheaper, easier, and offer a wider variety of choices.

    On the downside, I can see this being used by catfishers and other bottom-dwellers to generate “selfies” that won’t appear in a reverse image search.

    • People are already using these types of pics for fake profile pics. I see them all the time on Twitter. There are still obvious telltale signs if you look at the full size but the thumbnail looks real enough.

      The copyright question is an interesting one. If you license stock photos for the training set, do images generated from the model qualify as derivative works if you try to sell them?

  • No scars, facial deformities, heavy bias towards 18-50s. No teeth braces, hats (now I’ve seen one), vitaligo. No make up (I’ve now seen some lipstick). Nose studs. Jewellery. Acne. Very little variety in facial hair.

    It’s extraordinary but only reflects internet people, not humanity in general.

Leave a Reply
:)