Generative Adversarial Networks (GAN) are a relatively new concept in Machine Learning, introduced for the first time in 2014. Their goal is to synthesize artificial samples, such as images, that are indistinguishable from authentic images. A common example of a GAN application is to generate artificial face images by learning from a dataset of celebrity faces. While GAN images became more realistic over time, one of their main challenges is controlling their output, i.e. changing specific features such pose, face shape and hair style in an image of a face.
A new paper by NVIDIA, A Style-Based Generator Architecture for GANs (StyleGAN), presents a novel model which addresses this challenge. StyleGAN generates the artificial image gradually, starting from a very low resolution and continuing to a high resolution (1024×1024). By modifying the input of each level separately, it controls the visual features that are expressed in that level, from coarse features (pose, face shape) to fine details (hair color), without affecting other levels.
This technique not only allows for a better understanding of the generated
Background
The basic components of every GAN are two neural networks – a generator that synthesizes new samples from scratch, and a discriminator that takes samples from both the training data and the generator’s output and predicts if they are “real” or “fake”.
The generator input is a random vector (noise) and therefore its initial output is also noise. Over time, as it receives feedback from the discriminator, it learns to synthesize more “realistic” images. The discriminator also improves over time by comparing generated samples with real samples, making it harder for the generator to deceive it.
Researchers had trouble generating high-quality large images (e.g. 1024×1024) until 2018, when NVIDIA first tackles the challenge with ProGAN. The key innovation of ProGAN is the progressive training – it starts by training the generator and the discriminator with a very low resolution image (e.g. 4×4) and adds a higher resolution layer every time.
This technique first creates the foundation of the image by learning the base features which appear even in a low-resolution
ProGAN generates high-quality images but, as in most models, its ability to control specific features of the generated image is very limited. In other words, the features are entangled and therefore attempting to tweak the input, even a bit, usually affects multiple features at the same time. A good analogy for that would be genes, in which changing a single gene might affect multiple traits.
How StyleGAN works
The StyleGAN paper offers an upgraded version of ProGAN’s image generator, with a focus on the generator network. The authors observe that a potential benefit of the ProGAN progressive layers is their ability to control different visual features of the
- Coarse – resolution of up to 82 – affects pose, general hair style, face shape, etc
- Middle – resolution of 162 to 322 – affects finer facial features, hair style, eyes open/closed, etc.
- Fine – resolution of 642 to 10242 – affects color scheme (eye, hair and skin) and micro features.
The new generator includes several additions to the ProGAN’s generators:
Mapping Network
The Mapping Network’s goal is to encode the input vector into an intermediate vector whose different elements control different visual features. This is a non-trivial process since the ability to control visual features with the input vector is limited, as it must follow the probability density of the training data. For example, if images of people with black hair are more common in the dataset, then more input values will be mapped to that feature. As a result, the model isn’t capable of mapping parts of the input (elements in the vector) to features, a phenomenon called features entanglement. However, by using another neural network the model can generate a vector that doesn’t have to follow the training data distribution and can reduce the correlation between features.
The Mapping Network consists of 8 fully connected layers and its output w is of the same size as the input layer (512×1).
Sign up to our monthly newsletter
Stay updated with the latest research in Deep Learning
Style Modules (AdaIN)
The AdaIN (Adaptive Instance Normalization) module transfers the encoded information w, created by the Mapping Network, into the generated image. The module is added to each resolution level of the Synthesis Network and defines the visual expression of the features in that level:
- Each channel of the convolution layer output is first normalized to make sure the scaling and shifting of step 3 have the expected effect.
- The intermediate vector w is transformed using another fully-connected layer (marked as A) into a scale and bias for each channel.
- The scale and bias vectors shift each channel of the convolution output, thereby defining the importance of each filter in the convolution. This tuning translates the information from w to a visual representation.
Removing traditional input
Most models, and ProGAN among
Stochastic variation
There are many aspects in people’s faces that are small and can be seen as stochastic, such as freckles,
The noise in StyleGAN is added in a similar way to the AdaIN mechanism – A scaled noise is added to each channel before the AdaIN module and changes a bit the visual expression of the features of the resolution level it operates on.
Style mixing
The StyleGAN generator uses the intermediate vector in each level of the synthesis network, which might cause the network to learn that levels are correlated. To reduce the correlation, the model randomly selects two input vectors and generates the intermediate vector w for them. It then trains some of the levels with the first and switches (in a random point) to the other to train the rest of the levels. The random switch ensures that the network won’t learn and rely on a correlation between levels.
Though it doesn’t improve the model performance on all datasets, this concept has a very interesting side effect – its ability to combine multiple images in a coherent way (as shown in the video below). The model generates two images A and B and then combines them by taking low-level features from A and the rest of the features from B.
Truncation trick in W
One of the challenges in generative models is dealing with areas that are poorly represented in the training data. The generator isn’t able to learn them and create images that resemble them (and instead creates bad-looking images). To avoid generating poor images, StyleGAN truncates the intermediate vector w, forcing it to
After training the model, an “average” wavg is produced by selecting many random inputs; generating their intermediate vectors with the mapping network; and calculating the mean of these vectors. When generating new images, instead of using Mapping Network output directly, w is transformed into wnew=wavg+𝞧(w – wavg), where the value of 𝞧 defines how far the image can be from the “average” image (and how diverse the output can be). Interestingly, by using a different 𝞧 for each level, before the affine transformation block, the model can control how far from average each set of features is, as shown in the video below.
Fine-tuning
Additional improvement of StyleGAN upon ProGAN was updating several network hyperparameters, such as training duration and loss function, and replacing the up/downscaling from nearest neighbors to bilinear sampling. Though this step is significant for the model performance, it’s less innovative and therefore won’t be described here in detail (Appendix C in the paper).
Results
The paper presents state-of-the-art results on two datasets – CelebA-HQ, which consists of images of celebrities, and a new dataset Flickr-Faces-HQ (FFHQ), which consists of images of “regular” people and is more diversified. The chart below shows the Frèchet inception distance (FID) score of different configurations of the model.
In addition to these results, the paper shows that the model isn’t tailored only to faces by presenting its results on two other datasets of bedroom images and car images.
Feature disentanglement
In order to make the discussion regarding feature separation more quantitative, the paper presents two novel ways to measure feature disentanglement:
- Perceptual path length – measure the difference between consecutive images (their VGG16 embeddings) when interpolating between two random inputs. Drastic changes mean that multiple features have changed together and that they might be entangled.
- Linear separability – the ability to classify inputs into binary classes, such as male and female. The better the classification the more separable the features.
By comparing these metrics for the input vector z and the intermediate vector w, the authors show that features in w are significantly more separable. These metrics also show the benefit of selecting 8 layers in the Mapping Network in comparison to 1 or 2 layers.
Implementation Details
StyleGAN was trained on the
Conclusion
StyleGAN is a groundbreaking paper that not only produces high-quality and realistic images but also allows for superior control and understanding of generated images, making it even easier than before to generate believable fake images. The techniques presented in StyleGAN, especially the Mapping Network and the Adaptive Normalization (AdaIN), will likely be the basis for many future innovations in GANs.
Sign up to our monthly newsletter
Stay updated with the latest research in Deep Learning
Very interesting, my first read into AI. A good article. Reminds me of Thomas Edison methods but turbo-charged. Wonder if a Nikola Tesla will find a different approach?
Could this be used for Indentity models of criminals
As well as computer generated movies , developed using the computers base data of the particular actors ( then ” actors ” will be doing their ” own stunts , ” realistically too )
I feel like I have only seen facial images to date, and only stills. It would be interesting to see if the same approach would work on motion and on full body. If so we could theoretically reach a point where there is no need for extras. As to leads, well, that’s still probably a ways away. Stunt work, as you said, could benefit greatly.
I’ve seen that done in video as well and the results are really freaky.
This is remarkable, but I cannot help but to think of how this will be used to deceive people for political or otherwise nafrious purposes. Regardless, we are moving closer
Refreshed for a quarter hour. AI never generated a black person. Weird.
Irrelevant
Are you JOKING? Irrelevant? This is super scary. We are not far from not being able to distinguish reality from fiction. The implications are far reaching and society altering. We’re not in Kansas any more!
This is awesome! I wonder however who „owns“ the rights for these generated faces? As they originate from real human beings do these model share ownership? Or will these become Creative Common Pictures that can be freel used for creative processes such as UX Persona modelling or alike?
I wondered the same thing, this seems perfect for UX persona images. Many companies I’ve worked with have paid a lot of money for royalty images of models for their persona documents. I haven’t seen any terms of service here yet, and hope that the creators do release them using something like CC Pictures licensing.
My 5. oder 6. image was an image of a black person
Was able to generate people of color, tool is useful now!
I know I said the same thing. It would be irrelevant unless you were black.
The github repository that explains the algorithm states that “the images were crawled from Flickr, thus inheriting all the biases of that website.” I was wondering the same thing as you. I refreshed maybe a 100 faces and very few were women of color. So Flickr must have fewer images of women of color?
If they feed it an ape or a monkey they might reveal where the black person came from, wouldn’t want that to happen!
refresh por menos de 3 minutos y salieron 4 ó 5 personas negras creo que tuviste mala suerte.
Being refreshing for a long time, not seen a black guy or an ugly person.
#UglyLivesMatter
Has anyone noticed that this method of generating images of faces seems to have a real problem with depicting teeth? I’ve noticed that the teeth are often displaced to one side – so much so that if you draw an imaginary line downwards from the midpoint of the nose between the nostrils, this line will be nearly centered on one of the front teeth, rather than the space between the two front teeth, as you would expect. Also, in cases where the two front teeth ARE properly centered, they are often noticeably asymmetrical, which of course in real life is pretty rare.
Absolutely right
This is bothering me too! It seems that the teeth are always facing directly into the camera, but often the faces are slightly turned left or right.
This is amazing! I am so intrigued by this tech, and thank you so much for making it accessible.
I’ve noticed that it has a tough time with hats, which is, frankly, hilarious. I’ve started a collection of my favorite people-with-hats generated images at https://computersjustdontgethats.tumblr.com/ so that I can laugh myself silly from time-to-time.
Bravo, and thanks again!
Link doesnt work
https://thishatdoesnotexist.tumblr.com
hahaha awesome… but these generated hats are not less creepy than the faces
Fascinating program, but the algorithm is consistently not placing the teeth in the correct position on images with an open-mouth smile. Instead of the center line of the face lining up with the space between the two front teeth, it is as though one big front tooth is in the center-line. It looks like the teeth are all taken from a front view, but the faces are turned between straight front and three-quarter front, so it’s… slightly skewed and disconcerting. But the detail in the images is so realistic, it’s amazing!
I also noticed that when women are wearing ear rings they never have two of the same pair.
Shut it down, what a waste of resources.
A very remarkable approach. I wonder whether there exists a version in which we can prefix some parameters.
What stuck out most to me, was the absence of really darker skinned people, as well as curly/coarse hair textures. I saw some badly done hair textures, but even those were straight. The eyes often didn’t line up well at all either.
And one would think we could develop some interesting “footage” of historical figures from the time after photography but before talkies – interacting as if they were alive today – imagine seeing Abraham Lincoln and US Grant having a conversation as if they were filmed in high def – based on authentic photographs from the era. Or, “perfect” re-enactments of historical events where no footage exists – Oval office conversations during the Cuban missile crisis?
Although the faces generated looks almost realistic,they simply lack the common gestures as created by human beings. With a little bit tweaking on the method, it can be enhanced to a greater extent to deceive anyone.
I am new here in this topic
My Questions:
The discussion seems to be for programmers only..??
I am only a normal user who wants to create faces … How can I do it?
Software download link ? Where ?
I want my 5 yr old daughter to “change” a bit and see how she looked 3 years ago as virtual baby and how she will (could) look like in 5-10 yrs… Is that possible ?
Lets say I have only 1 pic of her and me.
She has blond hair
Green eyes
slavic face
What is necessary to create more pics out of one pic (but finally should look similar in different posing)?
How can I start??
Can someone hep me?
Regards from Germany
Marc
you can play with tl-gan hosted on kaggle
https://www.kaggle.com/summitkwan/tl-gan-demo
you need kaggle account, then fork and run example
It does nor work for me on firefox browser, site remains black
Any suggestions how to get it to work?
Nice achievement. But also key to many possible threats.
Another question. why are there predominantly white persons? I would love to see much more ethnic mixing. More “Africans”, more “Asians”, more “native Americans”, and of course “persons” in-between.
Another SJW… It’s just a AI
Fascinating stuff!
I wonder about the copyright status of the generated images. It is not uncommon in publishing, to need a headshot of (for example) a ‘generic Mom’. Usually such a picture is found (and paid for) via Getty Images or some similar service. But this GAN software might be cheaper, easier, and offer a wider variety of choices.
On the downside, I can see this being used by catfishers and other bottom-dwellers to generate “selfies” that won’t appear in a reverse image search.
People are already using these types of pics for fake profile pics. I see them all the time on Twitter. There are still obvious telltale signs if you look at the full size but the thumbnail looks real enough.
The copyright question is an interesting one. If you license stock photos for the training set, do images generated from the model qualify as derivative works if you try to sell them?
Can an image be generated the exact same way twice?
Given the same input, the algorithm (if deterministic) will output the same face again and again.
Am I the only one who actually finds it pretty scary?
No scars, facial deformities, heavy bias towards 18-50s. No teeth braces, hats (now I’ve seen one), vitaligo. No make up (I’ve now seen some lipstick). Nose studs. Jewellery. Acne. Very little variety in facial hair.
It’s extraordinary but only reflects internet people, not humanity in general.