Making Anime Faces With StyleGAN

A tutorial explaining how to train and generate high-quality anime faces with StyleGAN neural networks, and tips/scripts for effective StyleGAN use.
topics: anime, NGE, NN, Python, technology
created: 4 Feb 2019; modified: 19 Mar 2019; status: in progress; confidence: highly likely; importance: 5

Generative neural networks, such as GANs, have struggled for years to generate decent-quality anime faces, despite their great success with photographic imagery such as real human faces. The task has now been effectively solved, for anime faces as well as many other domains, by the development of a new generative adversarial network, StyleGAN, whose source code was released in February 2019.

I show off my StyleGAN anime faces & videos, provide downloads, provide the ‘missing manual’ & explain how I trained them based on Danbooru2017/2018 with source code for the data preprocessing, describe the website “This Waifu Does Not Exist” I set up as a public demo, discuss how the trained models can be used for transfer learning such as generating high-quality faces of anime characters with small available datasets, and touch on more advanced StyleGAN applications like encoders & controllable generation.

When Ian Goodfellow’s first GAN paper came out in 2014, with its blurry 64px grayscale faces, I said to myself, “given the rate at which GPUs & NN architectures improve, in a few years, we’ll probably be able to throw a few GPUs at some anime collection like Danbooru and the results will be hilarious.” There is something intrinsically amusing about trying to make computers draw anime, and it would be much more fun than working with yet more celebrity headshots or ImageNet samples; further, anime/illustrations/drawings are so different from the exclusively-photographic datasets always (over)used in contemporary ML research that I was curious how it would work on anime—better, worse, faster, or different failure modes? Even more amusing—if random images become doable, then text→images would not be far behind.

A blond-haired blue-eyed anime face looking at the viewer based on the _Neon Genesis Evangelion_ character, Asuka Souryuu Langley.

So when GANs hit 128px color images on ImageNet, and could do somewhat passable CelebA face samples around 2015, along with my char-RNN experiments, I began experimenting with Soumith Chintala’s implementation of DCGAN, restricting myself to faces of single anime characters where I could easily scrape up ~5–10k faces. (I did a lot of Asuka Souryuu Langley from Neon Genesis Evangelion because she has a color-centric design which made it easy to tell if a GAN run was making any progress: blonde-red hair, blue eyes, and red hair ornaments.)

It did not work. Despite many runs on my laptop & a borrowed desktop, DCGAN never got remotely near to the level of the CelebA face samples, typically topping out at reddish blobs before diverging or outright crashing.1 Thinking perhaps the problem was too-small datasets & I needed to train on all the faces, I began creating the Danbooru2017 version of “Danbooru2018: A Large-Scale Crowdsourced and Tagged Anime Illustration Dataset”. Armed with an extremely large dataset, I subsequently began working through particularly promising members of the GAN zoo, emphasizing SOTA & open implementations.

Among others, I have tried StackGAN/StackGAN++ & Pixel*NN* (failed to get running)2, WGAN-GP, Glow, GAN-QP, MSG-GAN, SAGAN (multiple implementations), VGAN, PokeGAN3, BigGAN4, ProGAN, & StyleGAN. Glow & BigGAN had promising results reported on CelebA & ImageNet respectively, but unfortunately their training requirements were out of the question.5 (As interesting as SPIRAL and CAN are, no source was released so I couldn’t even attempt them.)

While some remarkable tools like PaintsTransfer/style2paints were created, and there were the occasional semi-successful anime face GANs like IllustrationGAN, the most notable attempt at anime face generation was Make (Jin et al 2017). MGM could, interestingly, do in-browser 256px anime face generation using particularly small GANs, but it is a dead end. MGM accomplished that much by making the problem easier: they added some light supervision in the form of a crude tag embedding6, and then simplifying the problem drastically to n=42k faces cropped from professional video game character artwork, which I regarded as not an acceptable solution—the faces were small & boring, and it was unclear if this data-cleaning approach could scale to anime faces in general, much less anime images in general. They are recognizably anime faces but the resolution is low and the quality is not great:

2017 SOTA: 16 random Make Girls.Moe face samples (4x4 grid)
2017 SOTA: 16 random Make Girls.Moe face samples (4x4 grid)

Typically, a GAN would diverge after a day or two of training, or it would collapse to producing a limited range of faces (or a single face), or if it was stable, simply converge to an low level of quality with a lot of fuzziness; perhaps the most typical failure mode was heterochromia (which is common in anime but not that common)—mismatched eye colors (each color individually plausible), from the Generator apparently being unable to coordinate with itself to pick consistently. With more recent architectures like VGAN or SAGAN, which carefully weaken the Discriminator or which add extremely-powerful components like self-attention layers, I could reach fuzzy 128px faces.

ProGAN (source; video), while expensive and requiring >6 GPU-weeks7, did work and was even powerful enough to overfit single-character face datasets; I didn’t have enough GPU time to train on unrestricted face datasets, much less anime images in general, but merely getting this far was exciting. Because, a common sequence in DL/DRL is that a problem seems intractable for long periods, until someone modifies a scalable architecture slightly, produces somewhat-credible (not necessarily human or even near-human) results, and then throws a ton of compute/data at it and, since the architecture scales, it rapidly exceeds SOTA and approaches human levels (and potentially exceeds human-level). Given the miserable failure of all the prior NNs I had tried, I had begun to seriously wonder if there was something about non-photographs which made them intrinsically unable to be easily modeled by convolutional neural networks (the common ingredient to them all). Did convolutions render it unable to generate sharp lines or flat regions of color? Did regular GANs work only because photographs were made almost entirely of blurry textures? But ProGAN demonstrated that regular CNNs could learn to generate sharp clear anime images with only somewhat infeasible amounts of training. Now I just needed something faster.

A history of GAN generation of anime faces: ‘do want’ to ‘oh no’ to ‘awesome’
A history of GAN generation of anime faces: ‘do want’ to ‘oh no’ to ‘awesome’

StyleGAN was the final breakthrough in providing ProGAN-level capabilities but fast: by switching to a radically different architecture, it minimized the need for the slow progressive growing (perhaps eliminating it entirely), and learned efficiently at multiple levels of resolution, with bonuses in providing much more control of the generated images with its “style transfer” metaphor.


First, some demonstrations of what is possible with StyleGAN on anime faces:

64 of the best TWDNE anime face samples selected from social media (click to zoom).
64 of the best TWDNE anime face samples selected from social media (click to zoom).
100 random sample images from the StyleGAN anime faces on TWDNE
100 random sample images from the StyleGAN anime faces on TWDNE

Even a quick look at the MGM & StyleGAN samples demonstrates the latter to be superior in resolution, fine details, and overall appearance (although the MGM faces admittedly have fewer global mistakes). It is also superior to my 2018 ProGAN faces. Perhaps the most striking fact about these faces, which should be emphasized for those fortunate enough not to have spent as much time looking at awful GAN samples as I have, is not that the individual faces are good, but rather that the faces are so diverse, particularly when I look through face samples with 𝜓≥1—it is not just the hair/eye color or head orientation or fine details that differ, but the overall style ranges from CG to cartoon sketch, and even the ‘media’ differ, I could swear many of these are trying to imitate watercolors, charcoal sketching, or oil painting rather than digital drawings, and some come off as recognizably ’90s-anime-style vs ’00s-anime-style. (I could look through samples all day despite the global errors because so many are interesting, which is not something I could say of the MGM model whose novelty is quickly exhausted, and it appears that users of my TWDNE website feel similarly as the average length of each visit is 1m:55s.)

Interpolation video of the 11 February 2019 face StyleGAN demonstrating generalization.
StyleGAN anime face interpolation videos are Elon Musk™-approved!
StyleGAN anime face interpolation videos are Elon Musk™-approved!
Later interpolation video (8 March 2019 face StyleGAN)


StyleGAN was published in 2018 as “A Style-Based Generator Architecture for Generative Adversarial Networks”, Karras et al 2018 (source code; demo video/algorithmic review video/results & discussions video/slides; attempted reimplementation in PyTorch; explainers: Minute Papers video). StyleGAN takes the standard GAN architecture embodied by ProGAN (whose source code it reuses) and, like the similar GAN architecture Chen et al 2018, draws inspiration from the field of “style transfer” (essentially invented by Gatys et al 2014), by changing the Generator (G) which creates the image by repeatedly upscaling its resolution to take, at each level of resolution from 8px→16px→32px→64px→128px etc a random input or “style noise”, which is combined with AdaIN and is used to tell the Generator how to ‘style’ the image at that resolution by changing the hair or changing the skin texture and so on. ‘Style noise’ at a low resolution like 32px affects the image relatively globally, perhaps determining the hair length or color, while style noise at a higher level like 256px might affect how frizzy individual strands of hair are. In contrast, ProGAN and almost all other GANs inject noise into the G as well, but only at the beginning, which appears to work not nearly as well (perhaps because it is difficult to propagate that randomness ‘upwards’ along with the upscaled image itself to the later layers to enable them to make consistent choices?). To put it simply, by systematically providing a bit of randomness at each step in the process of generating the image, StyleGAN can ‘choose’ variations effectively.

Karras et al 2018, StyleGAN vs ProGAN architecture: “Figure 1. While a traditional generator [29] feeds the latent code though the input layer only, we first map the input to an intermediate latent space W, which then controls the generator through adaptive instance normalization (AdaIN) at each convolution layer. Gaussian noise is added after each convolution, before evaluating the nonlinearity. Here “A” stands for a learned affine transform, and “B” applies learned per-channel scaling factors to the noise input. The mapping network f consists of 8 layers and the synthesis network g consists of 18 layers—two for each resolution (42-−10242). The output of the last layer is converted to RGB using a separate 1×1 convolution, similar to Karras et al. [29]. Our generator has a total of 26.2M trainable parameters, compared to 23.1M in the traditional generator."
Karras et al 2018, StyleGAN vs ProGAN architecture: “Figure 1. While a traditional generator [29] feeds the latent code though the input layer only, we first map the input to an intermediate latent space W, which then controls the generator through adaptive instance normalization (AdaIN) at each convolution layer. Gaussian noise is added after each convolution, before evaluating the nonlinearity. Here “A” stands for a learned affine transform, and “B” applies learned per-channel scaling factors to the noise input. The mapping network f consists of 8 layers and the synthesis network g consists of 18 layers—two for each resolution (42-−10242). The output of the last layer is converted to RGB using a separate 1×1 convolution, similar to Karras et al. [29]. Our generator has a total of 26.2M trainable parameters, compared to 23.1M in the traditional generator."

StyleGAN makes a number of additional improvements, but they appear to be less important: for example, it introduces a new “FFHQ” face/portrait dataset with 1024px images in order to show that StyleGAN convincingly improves on ProGAN in final image quality; switches to a loss which is more well-behaved than the usual logistic-style losses; and architecture-wise, it makes unusually heavy use of fully-connected (FC) layers to process an initial random input, no less than 8 layers of 512 neurons, where most GANs use 1 or 2 FC layers.8 More striking is that it omits techniques that other GANs have found critical for being able to train at 512px–1024px scale: it does not use newer losses like the relativistic loss, SAGAN-style self-attention layers in either G/D, VGAN-style variational bottlenecks in the D, conditioning on a tag or category embedding, BigGAN-style very-large minibatches, different noise distributions9, advanced regularization like spectral normalization, etc.10

Aside from the FCs and style noise & normalization, it is a fairly vanilla architecture. (One oddity is the use of only 3x3 convolutions & so few layers in each upscaling block; a more conventional upscaling block than StyleGAN’s 3x3→3x3 would be something like BigGAN which does 1x1→3x3→3x3→1x1. It’s not clear if this is a good idea as it limits the spatial influence of each pixel by providing limited receptive fields11, and may be related to the “blob” artifacts.) Thus, if one has some familiarity with training a ProGAN or another GAN, one can immediately work with StyleGAN with no trouble: the training dynamics are similar and the hyperparameters have their usual meaning, and the codebase is much the same as the original ProGAN (with the main exception being that has been renamed and the original, which stores the critical configuration parameters, has been moved to training/; there is still no support for command-line options and StyleGAN must be controlled by editing by hand).


Because of its speed and stability, when the source code was released on 4 February 2018 (a date that will long be noted in the ANNals of GANime), the Nvidia models & sample dumps were quickly perused & new StyleGANs trained on a wide variety of image types, yielding, in addition to the original faces/carts/cats of Karras et al 2018:

Imagequilt visualization of the wide range of visual subjects StyleGAN has been applied to
Imagequilt visualization of the wide range of visual subjects StyleGAN has been applied to

Why Don’t GANs Work?

Why does StyleGAN work so well on anime images while other GANs worked not at all or slowly at best?

The lesson I took from “Are GANs Created Equal? A Large-Scale Study”, Lucic et al 2017, is that CelebA/CIFAR10 are too easy, as almost all evaluated GAN architectures were capable of occasionally achieving good FID if one simply did enough iterations & hyperparameter tuning.

Interestingly, I consistently observe in training all GANs on anime that clear lines & sharpness & cel-like smooth gradients appear only toward the end of training, after typically initially blurry textures have coalesced. This suggest an inherent bias of CNNs: color images work because they provide some degree of textures to start with, but lineart/monochrome stuff fails because the GAN optimization dynamics flail around. This is consistent with Geirhos et al 2018’s findings—which uses style transfer to construct a data-augmented/transformed “Stylized-ImageNet”—showing that ImageNet CNNs are lazy and, because the tasks can be achieved to some degree with texture-only classification (as demonstrated by several of Geirhos et al 2018’s authors via “BagNets”), focus on textures unless otherwise forced. So while CNNs can learn sharp lines & shapes rather than textures, the typical GAN architecture & training algorithm do not make it easy. Since CIFAR10/CelebA can be fairly described as being just as heavy on textures as ImageNet (which is not true of anime images), it is not surprising that GANs train easily on them starting with textures and gradually refining into good samples but then struggle on anime.

This raises a question of whether the StyleGAN architecture is necessary and whether many GANs might work, if only one had good style transfer for anime images and could, to defeat the texture bias, generate many versions of each anime image which kept the shape while changing the color palette? (Current style transfer methods like the one used by Geirhos et al 2018, do not work well on anime images, ironically enough, because they are trained on photographic images, typically using the old VGG model.)


“…Its social accountability seems sort of like that of designers of military weapons: unculpable right up until they get a little too good at their job.”

David Foster Wallace, E unibus pluram: television and U.S. fiction”

  • Overfitting: “Aren’t StyleGAN (or BigGAN) just overfitting & memorizing data?”

    Amusingly, this is not a question anyone would have really bothered to ask of earlier GAN architectures, which is a sign of progress. Overfitting is a much better problem to have than underfitting, because overfitting means you can use a smaller model or more data or more aggressive regularization techniques, while underfitting means your approach just isn’t working.

    In any case, while there is currently no way to conclusively prove that cutting-edge GANs are not 100% memorizing (because they should be memorizing to a considerable extent in order to learn image generation, and evaluating generative models is hard in general, and for GANs in particular, because they don’t provide standard metrics like likelihoods which could be used on held-out samples), there are several reasons to think that they are not just memorizing:

    1. sample/dataset overlap: a standard check for overfitting is to compare generated images to their closest matches using nearest-neighbors (where distance is defined by features like a CNN embedding) lookup; an example of this are BigGAN’s Figures 10–14, where the photorealistic samples are nevertheless completely different from the most similar ImageNet datapoints. This has not been done for StyleGAN yet but I wouldn’t expect different results as GANs typically pass this check.
    2. semantic understanding: GANs appear to learn meaningful concepts like individual objects, as demonstrated by research tools like GANdissection; image edits like object deletions/additions are difficult to explain without some genuine understanding of images. In the case of StyleGAN anime faces, there are encoders and controllable face generation now which demonstrate that the latent variables do map onto meaningful factors of variation & the model must have learned.
    3. latent space smoothness: in general, interpolation in the latent space shows smooth changes of images and logical transformations or variations of face features; if StyleGAN were merely memorizing individual datapoints, the interpolation would be expected to be low quality, yield many terrible faces, and exhibit ‘jumps’ in between points corresponding to real, memorized, datapoints. The StyleGAN anime face models do not exhibit this. (In contrast, the Holo ProGAN, which overfit badly, does show severe problems in its latent space interpolation videos.)

    Which is not to say that GANs do not have issues: “mode dropping” seems to still be an issue for BigGAN despite the expensive large-minibatch training, which is overfitting to some degree, and StyleGAN presumably suffers from it too.

  • Compute requirements: “Doesn’t StyleGAN take too long to train?”

    StyleGAN is remarkably fast-training for a GAN. With the anime faces, I got better results after 1–3 days of StyleGAN training than I’d gotten with >3 weeks of ProGAN training. The training times quoted by the StyleGAN repo may sound scary, but they are, in practice, a steep overestimate of what you actually need, for several reasons:

    • lower resolution: the largest figures are for 1024px images but you may not need them to be that large or even have a big dataset of 1024px images. For anime faces, 1024px-sized faces are relatively rare, and training at 512px & upscaling 2x to 1024 with waifu2x12 works fine & is much faster. Since upscaling is relatively simple & easy, another strategy is to change the progressive-growing schedule: instead of proceeding to the final resolution as fast as possible, instead adjust the schedule to stop at a more feasible resolution & spend the bulk of training time there instead and then do just enough training at the final resolution to learn to upscale (eg spend 10% of training growing to 512px, then 80% of training time at 512px, then 10% at 1024px).
    • diminishing returns: the largest gains in image quality are seen in the first few days or weeks of training with the remaining training being not that useful as they focus on improving small details (so just a few days may be more than adequate for your purposes, especially if you’re willing to select a little more aggressively from samples)
    • transfer learning from a related model can save days or weeks of training, as there is no need to train from scratch; with the anime face StyleGAN, one can train a character-specific StyleGAN with a few hours or days at most, and certainly do not need to spend multiple weeks training from scratch! (assuming that wouldn’t just cause overfitting) For 1024px, you could use a super-resolution GAN like to upscale? Alternately, you could change the image progression budget to spend most of your time at 512px and then at the tail end try 1024px.
  • Copyright infringement: “Who owns StyleGAN images?”

    1. The Nvidia source code & released models are under a CC-BY-NC license, and you cannot edit them or produce “derivative works” such as retraining their FFHQ, cat, or cat StyleGAN models. If a model is trained from scratch, then that does not apply as the source code is simply another tool used to create the model and nothing about the CC-BY-NC license forces you to donate the copyright to Nvidia. (It would be odd if such a thing did happen—if your word processor claimed to transfer the copyrights of everything written in it to Microsoft!)
    2. Models in general are generally considered “transformative works” and the copyright owners of whatever data the model was trained on have no copyright on the model. The model is copyrighted to whomever created it. Hence, Nvidia has copyright on the models it created but I have copyright under the models I trained (which I release under CC-0).
    3. Samples are a little trickier. The usual widely-stated legal interpretation is that the standard copyright law position is that only human authors can earn a copyright and that machines, animals, inanimate objects or most famously, monkeys, cannot. A dump of random samples such as the Nvidia samples or TWDNE therefore has no copyright & is in the public domain. A new copyright can be created, however, if a human author is sufficiently ‘in the loop’, so to speak, as to exert a de minimis amount of creative effort, even if that creative effort is simply selecting a single image out of a dump of thousands of them or twiddling knobs until they get what like (eg on Make Girls.Moe). Crypko, for example, take this position.

    Some further reading on computer-generated art copyrights:

Training requirements


“The road of excess leads to the palace of wisdom
…If the fool would persist in his folly he would become wise
…You never know what is enough unless you know what is more than enough.”

William Blake, The Marriage of Heaven and Hell

The necessary size for a dataset depends on the complexity of the domain and whether transfer learning is being used. StyleGAN’s default settings yield a 1024px Generator with 26.2M parameters, which is a large model and can soak up potentially millions of images, so there is no such thing as too much.

For learning decent-quality anime faces from scratch, a minimum of 5000 appears to be necessary in practice; for learning a specific character when using the anime face StyleGAN, potentially as little as ~500 (especially with data augmentation) can give good results. For domains as complicated as “any cat photo” like Karras et al 2018’s cat StyleGAN which is trained on the LSUN Cats category of ~1.8M13 cat photos, that appears to either not be enough or StyleGAN was not trained to convergence; Karras et al 2018 note that Cats continues to be a difficult dataset due to the high intrinsic variation in poses, zoom levels, and backgrounds.”14


To fit reasonable minibatch sizes, one will want GPUs with >11GB VRAM. At 512px, that will only train n=4, and going below that means it’ll be even slower (and you may have to reduce learning rates to avoid unstable training). So, Nvidia 1080ti & up would be good. (Reportedly, AMD/OpenCL works for running StyleGAN models but I haven’t seen any one state that they have trained StyleGAN on AMD GPUs with an OpenCL backend.)

The StyleGAN repo provide the following estimated training times for 1–8 GPU systems (which I convert to total GPU-hours & provide a nominal AWS cost estimate):

GPUs 10242 5122 2562 AWS Costs15
1 41 days 4 hours [988 GPU-hours] 24 days 21 hours [597 GPU-hours] 14 days 22 hours [358 GPU-hours] [$320, $194, $115]
2 21 days 22 hours [1052] 13 days 7 hours [638] 9 days 5 hours [442] [NA]
4 11 days 8 hours [1088] 7 days 0 hours [672] 4 days 21 hours [468] [NA]
8 6 days 14 hours [1264] 4 days 10 hours [848] 3 days 8 hours [640] [$2,730, $1,831, $1,382]

AWS GPU instances are some of the most expensive ways to train a NN and provide an upper bound; 512px is often an acceptable (or necessary) resolution; and in practice, the full quoted training time is not really necessary—with my anime face StyleGAN, the faces themselves were high quality within 48 GPU-hours, and what training it for ~1000 additional GPU-hours accomplished was primarily to improve details like the shoulders & backgrounds. (ProGAN/StyleGAN particularly struggle with backgrounds & edges of images because those are cut off, obscured, and highly-varied compared to the faces, whether anime or FFHQ.)

256px StyleGAN anime faces after ~46 GPU-hours vs 512px anime faces after 382 GPU-hours
256px StyleGAN anime faces after ~46 GPU-hours vs 512px anime faces after 382 GPU-hours

Data Preparation

The single most difficult part of running StyleGAN is preparing the dataset properly. StyleGAN does not, unlike most GAN implementations (particularly PyTorch ones), support reading a directory of files as input; it can only read its unique .tfrecord format which stores each image as raw arrays at every relevant resolution.16 Thus, input files must be perfectly uniform, slowly converted to the .tfrecord formats by the special tool, and will take up ~19x more disk space.17

A StyleGAN dataset must consist of images all formatted exactly the same way: they must be precisely 512x512px or 1024x1024px etc (512x513px will not work), they must all be the same colorspace (you cannot have sRGB and Grayscale JPGs), the filetype must be the same as the model you intend to train (ie you cannot retrain a PNG-trained model on a JPG dataset, StyleGAN will crash with inscrutable errors), and there must be no subtle errors like CRC checksum errors which image viewers or libraries like ImageMagick will ignore.

Faces preparation

My workflow:

  1. Download raw images from Danbooru2018 if necessary
  2. Extract from the JSON Danbooru2018 metadata all the IDs of a subset of images if a specific Danbooru tag (such as a single character) is desired, using jq and shell scripting
  3. Crop anime faces from raw images using Nagadomi’s lbpcascade_animeface (regular face-detection methods do not work on anime images)
  4. Delete empty files, monochrome or grayscale files, & exact-duplicate files
  5. Convert to JPG
  6. Upscale below the target resolution (512px) images with waifu2x
  7. Convert all images to exactly 512x512 resolution sRGB JPG images
  8. If feasible, improve data quality by checking for low-quality images by hand, removing near-duplicates images found by findimagedupes, and filtering with a pretrained GAN’s Discriminator
  9. Convert to StyleGAN format using

The goal is to turn this:

100 random sample images from the 512px SFW subset of Danbooru in a 10x10 grid.
100 random sample images from the 512px SFW subset of Danbooru in a 10x10 grid.

into this:

36 random sample images from the cropped Danbooru faces in a 6x6 grid.
36 random sample images from the cropped Danbooru faces in a 6x6 grid.


The Danbooru2018 download can be done via BitTorrent or rsync, which provides a JSON metadata tarball which unpacks into metadata/2* & a folder structure of {original,512px}/{0-999}/$ID.{png,jpg,...}.

For training on SFW whole images, the 512px/ version of Danbooru2018 would work, but it is not a great idea for faces because by scaling images down to 512px, a lot of face detail has been lost, and getting high-quality faces is a challenge. The SFW IDs can be extracted from the filenames in 512px/ directly or from the metadata by extracting the id & rating fields (and saving to a file):

After installing and testing Nagadomi’s lbpcascade_animeface to make sure it & OpenCV works, one can use a simple script which crops the face(s) from a single input image. The accuracy on Danbooru images is fairly good, perhaps 90% excellent faces, 5% low-quality faces (genuine but either awful art or tiny little faces on the order of 64px which useless), and 5% outright errors—non-faces like armpits or elbows (oddly enough). It can be improved by making the script more restrictive, such as requiring 250x250px regions, which eliminates most of the low-quality faces & mistakes. (There is an alternative more-difficult-to-run library by Nakatomi which offers a face-cropping script, animeface-2009’s face_collector.rb, which Nakatomi says is better at cropping faces, but I was not impressed when I tried it out.)

The IDs can be combined with the provided lbpcascade_animeface script using xargs, however this will be far too slow and it would be better to exploit parallelism with parallel. It’s also worth noting that lbpcascade_animeface seems to use up GPU VRAM even though GPU use offers no apparent speedup (a slowdown if anything, given limited VRAM), so I find it helps to explicitly disable GPU use by setting CUDA_VISIBLE_DEVICES="". (For this step, it’s quite helpful to have a many-core system like a Threadripper.)

Combining everything, parallel face-cropping of an entire Danbooru2018 subset can be done like this:

Cleaning & Upscaling

Miscellaneous cleanups can be done:

The next major step is upscaling images using waifu2x, which does an excellent job on 2x upscaling of anime images, which are nigh-indistinguishable from a higher-resolution original and greatly increase the usable corpus. The downside is that it can take 1–10s per image, must run on the GPU (I can reliably fit ~9 instances on my 2x1080ti), and is written in a now-unmaintained DL framework, Torch, with no current plans to port to PyTorch, and is gradually becoming harder to get running (hopefully by the time that CUDA updates break it entirely, there will be another super-resolution GAN I or someone else can train on Danbooru to replace it). If pressed for time, one can just upscale the faces normally with ImageMagick but I believe there will be some quality loss and it’s worthwhile.

Quality Checks & Data Augmentation

At this point, one can do manual quality checks by viewing a few hundred images, running findimagedupes -t 99% to look for near-identical faces, or dabble in further modifications such as doing “data augmentation”. Working with Danbooru2018, at this point one would have ~600–700,000 faces, which is more than enough to train StyleGAN and one will have difficulty storing the final StyleGAN dataset because of its sheer size (due to the ~18x size multiplier).

However, if that is not enough or one is working with a small dataset like for a single character, data augmentation may be necessary. The mirror/horizontal flip is not necessary as StyleGAN has that built-in as an option18, but there are many other possible data augmentations. One can stretch, shift colors, sharpen, blur, increase/decrease contrast/brightness, crop, and so on. An example, extremely aggressive, set of data augmentations could be done like this:

Upscaling & Conversion

Once any quality fixes or data augmentation are done, it’d be a good idea to save a lot of disk space by converting to JPG & lossily reducing quality (I find 33% saves a ton of space at no visible change):

WARNING: remember that StyleGAN models are only compatible with images of the type they were trained on, so if you are using a StyleGAN pretrained model which was trained on PNGs (like, IIRC, the FFHQ StyleGAN models), you will need to keep using PNGs.

Doing the final scaling to exactly 512px can be done at many points but I generally postpone it to the end in order to work with images in their ‘native’ aspect-ratio for as long as possible. At this point we carefully tell ImageMagick to rescale everything to 512x512, not preserving the aspect ratio & filling in with a black background as necessary:

Any slightly-different image could crash the import process. Therefore, we delete any image which is even slightly different from the 512x512 sRGB JPG they are supposed to be:

Having done all this, we should have a large consistent high-quality dataset.

Finally, the faces can now be converted to the ProGAN or StyleGAN dataset format using It is worth remembering at this point how fragile that is and the requirements ImageMagick’s identify command is handy for looking at files in more details, particularly their resolution & colorspace, which are often the problem.

Because of the extreme fragility of, I strongly advise that you edit it to print out the filenames of each file as they are being processed so that when (not if) it crashes, you can investigate the culprit and check the rest. The edit could be as simple as this:

There should be no issues if all the images were thoroughly checked earlier, but should an images crash it, they can be checked in more detail by identify. (I advise just deleting them and not trying to rescue them.)

Then the conversion is just (assuming StyleGAN prerequisites are installed, see next section):

Congratulations, the hardest part is over. Most of the rest simply requires patience (and a willingness to edit Python files directly in order to configure StyleGAN).



I assume you have CUDA installed & functioning. If not, good luck. (On my Ubuntu Bionic 18.04.2 LTS OS, I have successfully used the Nvidia driver version #410.104, CUDA 10.1, and TensorFlow 1.13.1.)

A Python ≥3.619 virtual environment can be set up for StyleGAN to keep dependencies tidy, TensorFlow & StyleGAN dependencies installed:

StyleGAN can also be trained on the interactive Google Colab service, which provides free slices of K80 GPUs 12-GPU-hour chunks, using this Colab notebook. Colab is much slower than training on a local machine & the free instances are not enough to train the best StyleGANs, but this might be a useful option for people who simply want to try it a little or who are doing something quick like extremely low-resolution training or transfer-learning where a few hours on a slow small GPU might be enough.


StyleGAN doesn’t ship with any support for CLI options; instead, one must edit and train/

  1. train/

    The core configuration is done in the function defaults to training_loop beginning line 112.

    The key arguments are G_smoothing_kimg & D_repeats (affects the learning dynamics), network_snapshot_ticks (how often to save the pickle snapshots—more frequent means less progress lost in crashes, but as each one weighs 300MB+, can quickly use up gigabytes of space), resume_run_id (set to "latest"), and resume_kimg (governs where in the overall progressive-growing training schedule StyleGAN starts from & is vitally important when doing transfer learning that it is set to a sufficiently high number that training begins at the highest desired resolution like 512px).

    Note that many of these variables are overridden in, like the learning rates. It’s better to set those there or else you may confuse yourself badly (like I did in wondering why ProGAN & StyleGAN seemed bizarrely robust to large changes in the learning rates…)

  2. (previously

    Here we set the number of GPUs, image resolution, dataset, learning rates, horizontal flipping/mirroring data augmentation, and minibatch sizes. Learning rate & minibatch should generally be left alone (except towards the end of training when one wants to lower the learning rate to promote convergence or rebalance the G/D), but the image resolution/dataset/mirroring do need to be set, like thus:

    This sets up the 512px face dataset which was previously created in dataset/faces, turns on mirroring (because while there may be writing in the background, we don’t care about it for face generation), and sets a title for the checkpoints/logs, which will now appear in results/ with the ‘-faces’ string.

    Assuming you do not have 8 GPUs (as you probably do not), you must change the -preset to match your number of GPUs, StyleGAN will not automatically choose the correct number of GPUs. If you fail to set it correctly to the appropriate preset, StyleGAN will attempt to use GPUs which do not exist and will crash with the opaque error message (note that CUDA uses zero-indexing so GPU:0 refers to the first GPU, GPU:1 refers to my second GPU, and thus /device:GPU:2 refers to my—nonexistent—third GPU):

    tensorflow.python.framework.errors_impl.InvalidArgumentError: Cannot assign a device for operation \
        G_synthesis_3/lod: {{node G_synthesis_3/lod}}was explicitly assigned to /device:GPU:2 but available \
        devices are [ /job:localhost/replica:0/task:0/device:CPU:0, /job:localhost/replica:0/task:0/device:GPU:0, \
        /job:localhost/replica:0/task:0/device:GPU:1, /job:localhost/replica:0/task:0/device:XLA_CPU:0, \
        /job:localhost/replica:0/task:0/device:XLA_GPU:0, /job:localhost/replica:0/task:0/device:XLA_GPU:1 ]. \
        Make sure the device specification refers to a valid device.
         [[{{node G_synthesis_3/lod}}]]

    For my 2x1080ti I’d set:

    So my results get saved to results/00001-sgan-faces-2gpu etc (the run ID increments, ‘sgan’ because StyleGAN rather than ProGAN, ‘-faces’ as the dataset being trained on, and ‘2gpu’ because it’s multi-GPU).


I typically run StyleGAN in a screen session which can be detached and keeps multiple shells organized: 1 terminal/shell for the StyleGAN run, 1 terminal/shell for TensorBoard, and 1 for Emacs.

With Emacs, I keep the two key Python files open ( and train/ for reference & easy editing.

With the “latest” patch, StyleGAN can be thrown into a while-loop to keep running after crashes, like:

TensorBoard is a logging utility which displays little time-series of recorded variables which one views in a web browser, eg:

Note that TensorBoard can be backgrounded, but needs to be updated every time a new run is started as the results will then be in a different folder.

Training StyleGAN is much easier & more reliable than other GANs, but it is still more of an art than a science. (We put up with it because while GANs suck, everything else sucks more.) Notes on training:

Here is what a successful training progression looks like for the anime face StyleGAN:

Training montage video of the first 9k iterations of the anime face StyleGAN.

The anime face model as of 8 March 2019, trained for 21,980 iterations or ~21m images or ~38 GPU-days, is available for download. (It is still not fully-converged, but the quality is good.)


Having successfully trained a StyleGAN, now one wants samples.

Psi/“truncation trick”

The “truncation trick”(BigGAN discussion, StyleGAN discussion; apparently first introduced by Marchesi 2017) is the most important hyperparameter for all StyleGAN generation.

The truncation trick is used at sample generation time but not training time. The idea is to edit the latent vector, which is a vector of , to remove any variables which are above a certain size like 0.5 or 1.0, and resample those.20 This seems to help by avoiding ‘extreme’ latent values or combinations of latent values which the G is not as good at—a G will not have generated many data points with each latent variable at, say, +1.5SD. The tradeoff is that those are still legitimate areas of the overall latent space which were being used during training to cover parts of the data distribution; so while the latent variables close to the mean of 0 may be the most accurately modeled, they are also only a small part of the space of all possible images. So one can generate latent variables from the full unrestricted distribution for each one, or one can truncate them at something like +1SD or +0.7SD. (Like the discussion of the best distribution for the original latent distribution, there’s no good reason to think that this is an optimal method of doing truncation; there are many alternatives, such as ones penalizing the sum of the variables, either rejecting them or scaling them down.)

At 𝜓=0, diversity is nil and all faces are a single global average face (a brown-eyed brown-haired schoolgirl, unsurprisingly); at ±0.5 you have a broad range of faces, and by ±1.2, you’ll see tremendous diversity in faces/styles/consistency but also tremendous artifacting & distortion. Where you set your 𝜓 will heavily influence how ‘original’ outputs look. At 𝜓=1.2, they are tremendously original but extremely hit or miss. At 𝜓=0.5 they are consistent but boring. For most of my sampling, I set 𝜓=0.7 which strikes the best balance between craziness/artifacting and quality/diversity. (Personally, I prefer to look at 𝜓=1.2 samples because they are so much more interesting, but if I released those samples, it would give a misleading impression to readers.)

Random Samples

The StyleGAN repo has a simple script to download & generate a single face; in the interests of reproducibility, it hardwires the model and the RNG seed so it will only generate 1 particular face. However, it can be easily adapted to use a local model and (slowly21) generate, say, 1000 sample images with the hyperparameter 𝜓=0.6 (which gives high-quality but not highly-diverse images) which are saved to results/example-{0-999}.png:

Karras et al 2018 Figures

The figures in Karras et al 2018, demonstrating random samples and aspects of the style noise using the 1024px FFHQ face model (as well as the others), were generated by This script needs extensive modifications to work with my 512px anime face; going through the file:

  • the code uses 𝜓=1 truncation, but faces look better with 𝜓=0.7 (several of the functions have truncation_psi= settings but, trickily, Figure 3’s draw_style_mixing_figure has its 𝜓 setting hidden away in the synthesis_kwargs global variable)
  • the loaded model needs to be switched to the anime face model, of course
  • dimensions must be reduced 1024→512 as appropriate; some ranges are hardcoded and must be reduced for 512px images as well
  • the truncation trick figure 8 doesn’t show enough faces to give insight into what the latent space is doing so it needs to be expanded to show both more random seeds/faces, and more 𝜓 values
  • the bedroom/car/cat samples should be disabled

The changes I make are as follows:

diff --git a/ b/
index 45b68b8..f27af9d 100755
--- a/
+++ b/
@@ -24,16 +24,13 @@ url_bedrooms    = '
 url_cars        = '' # karras2019stylegan-cars-512x384.pkl
 url_cats        = '' # karras2019stylegan-cats-256x256.pkl

-synthesis_kwargs = dict(output_transform=dict(func=tflib.convert_images_to_uint8, nchw_to_nhwc=True), minibatch_size=8)
+synthesis_kwargs = dict(output_transform=dict(func=tflib.convert_images_to_uint8, nchw_to_nhwc=True), minibatch_size=8, truncation_psi=0.7)

 _Gs_cache = dict()

 def load_Gs(url):
-    if url not in _Gs_cache:
-        with dnnlib.util.open_url(url, cache_dir=config.cache_dir) as f:
-            _G, _D, Gs = pickle.load(f)
-        _Gs_cache[url] = Gs
-    return _Gs_cache[url]
+    _G, _D, Gs = pickle.load(open("results/02051-sgan-faces-2gpu/network-snapshot-021980.pkl", "rb"))
+    return Gs

 # Figures 2, 3, 10, 11, 12: Multi-resolution grid of uncurated result images.
@@ -85,7 +82,7 @@ def draw_noise_detail_figure(png, Gs, w, h, num_samples, seeds):
     canvas ='RGB', (w * 3, h * len(seeds)), 'white')
     for row, seed in enumerate(seeds):
         latents = np.stack([np.random.RandomState(seed).randn(Gs.input_shape[1])] * num_samples)
-        images =, None, truncation_psi=1, **synthesis_kwargs)
+        images =, None, **synthesis_kwargs)
         canvas.paste(PIL.Image.fromarray(images[0], 'RGB'), (0, row * h))
         for i in range(4):
             crop = PIL.Image.fromarray(images[i + 1], 'RGB')
@@ -109,7 +106,7 @@ def draw_noise_components_figure(png, Gs, w, h, seeds, noise_ranges, flips):
     all_images = []
     for noise_range in noise_ranges:
         tflib.set_vars({var: val * (1 if i in noise_range else 0) for i, (var, val) in enumerate(noise_pairs)})
-        range_images =, None, truncation_psi=1, randomize_noise=False, **synthesis_kwargs)
+        range_images =, None, randomize_noise=False, **synthesis_kwargs)
         range_images[flips, :, :] = range_images[flips, :, ::-1]

@@ -144,14 +141,11 @@ def draw_truncation_trick_figure(png, Gs, w, h, seeds, psis):
 def main():
     os.makedirs(config.result_dir, exist_ok=True)
-    draw_uncurated_result_figure(os.path.join(config.result_dir, 'figure02-uncurated-ffhq.png'), load_Gs(url_ffhq), cx=0, cy=0, cw=1024, ch=1024, rows=3, lods=[0,1,2,2,3,3], seed=5)
-    draw_style_mixing_figure(os.path.join(config.result_dir, 'figure03-style-mixing.png'), load_Gs(url_ffhq), w=1024, h=1024, src_seeds=[639,701,687,615,2268], dst_seeds=[888,829,1898,1733,1614,845], style_ranges=[range(0,4)]*3+[range(4,8)]*2+[range(8,18)])
-    draw_noise_detail_figure(os.path.join(config.result_dir, 'figure04-noise-detail.png'), load_Gs(url_ffhq), w=1024, h=1024, num_samples=100, seeds=[1157,1012])
-    draw_noise_components_figure(os.path.join(config.result_dir, 'figure05-noise-components.png'), load_Gs(url_ffhq), w=1024, h=1024, seeds=[1967,1555], noise_ranges=[range(0, 18), range(0, 0), range(8, 18), range(0, 8)], flips=[1])
-    draw_truncation_trick_figure(os.path.join(config.result_dir, 'figure08-truncation-trick.png'), load_Gs(url_ffhq), w=1024, h=1024, seeds=[91,388], psis=[1, 0.7, 0.5, 0, -0.5, -1])
-    draw_uncurated_result_figure(os.path.join(config.result_dir, 'figure10-uncurated-bedrooms.png'), load_Gs(url_bedrooms), cx=0, cy=0, cw=256, ch=256, rows=5, lods=[0,0,1,1,2,2,2], seed=0)
-    draw_uncurated_result_figure(os.path.join(config.result_dir, 'figure11-uncurated-cars.png'), load_Gs(url_cars), cx=0, cy=64, cw=512, ch=384, rows=4, lods=[0,1,2,2,3,3], seed=2)
-    draw_uncurated_result_figure(os.path.join(config.result_dir, 'figure12-uncurated-cats.png'), load_Gs(url_cats), cx=0, cy=0, cw=256, ch=256, rows=5, lods=[0,0,1,1,2,2,2], seed=1)
+    draw_uncurated_result_figure(os.path.join(config.result_dir, 'figure02-uncurated-ffhq.png'), load_Gs(url_ffhq), cx=0, cy=0, cw=512, ch=512, rows=3, lods=[0,1,2,2,3,3], seed=5)
+    draw_style_mixing_figure(os.path.join(config.result_dir, 'figure03-style-mixing.png'), load_Gs(url_ffhq), w=512, h=512, src_seeds=[639,701,687,615,2268], dst_seeds=[888,829,1898,1733,1614,845], style_ranges=[range(0,4)]*3+[range(4,8)]*2+[range(8,16)])
+    draw_noise_detail_figure(os.path.join(config.result_dir, 'figure04-noise-detail.png'), load_Gs(url_ffhq), w=512, h=512, num_samples=100, seeds=[1157,1012])
+    draw_noise_components_figure(os.path.join(config.result_dir, 'figure05-noise-components.png'), load_Gs(url_ffhq), w=512, h=512, seeds=[1967,1555], noise_ranges=[range(0, 18), range(0, 0), range(8, 18), range(0, 8)], flips=[1])
+    draw_truncation_trick_figure(os.path.join(config.result_dir, 'figure08-truncation-trick.png'), load_Gs(url_ffhq), w=512, h=512, seeds=[91,388, 389, 390, 391, 392, 393, 394, 395, 396], psis=[1, 0.7, 0.5, 0.25, 0, -0.25, -0.5, -1])

All this done, we get some fun anime face samples to parallel Karras et al 2018’s figures:

Anime face StyleGAN, Figure 2, uncurated samples
Anime face StyleGAN, Figure 2, uncurated samples
Figure 3, “style mixing” of source/transfer faces, demonstrating control & interpolation (top row=style, left column=target to be styled)
Figure 3, “style mixing” of source/transfer faces, demonstrating control & interpolation (top row=style, left column=target to be styled)
Figure 8, the “truncation trick” visualized: 10 random faces, with the range 𝜓 = [1, 0.7, 0.5, 0.25, 0, -0.25, -0.5, -1]—demonstrating the tradeoff between diversity & quality, and the global average face.
Figure 8, the “truncation trick” visualized: 10 random faces, with the range 𝜓 = [1, 0.7, 0.5, 0.25, 0, -0.25, -0.5, -1]—demonstrating the tradeoff between diversity & quality, and the global average face.


Training Montage

The easiest samples are the progress snapshots generated during training. Over the course of training, their size increases as the effective resolution increases & finer details are generated, and at the end can be quite large (often 14MB each for the anime faces) so doing lossy compression with a tool like pngnq or converting them to JPG with lowered quality is a good idea. To turn the many snapshots into a training montage video like above, I use FFmpeg on the PNGs:


The original ProGAN repo provided a config for generating interpolation videos, but that was removed in StyleGAN. Cyril Diagne implemented a replacement, providing 3 kinds of videos:

  1. random_grid_404.mp4: a standard interpolation video, which is simply a random walk through the latent space, modifying all the variables smoothly and animating it; by default it makes 4 of them arranged 2x2 in the video. Several interpolation videos are show in the examples section.

  2. interpolate.mp4: a ‘coarse’ “style mixing” video; a single ‘source’ face is generated & held constant; a secondary interpolation video, a random walk as before is generated; at each step of the random walk, the ‘coarse’/high-level ‘style’ noise is copied from the random walk to overwrite the source face’s original style noise. For faces, this means that the original face will be modified with all sorts of orientations & facial expressions while still remaining recognizably the original character. (It is the video analog of Karras et al 2018’s Figure 3.)

    ‘Coarse’ style-transfer/interpolation video

  3. fine_503.mp4: a ‘fine’ style mixing video; in this case, the style noise is taken from later on and instead of affecting the global orientation or expression, it affects subtler details like the precise shape of hair strands or hair color or mouths.

    ‘Fine’ style-transfer/interpolation video

Circular interpolations are another interesting kind of interpolation, written by snow halcy, which instead of random walking around the latent space freely, with large or awkward transitions, instead tries to move around a fixed high-dimensional point doing: “binary search to get the MSE to be roughly the same between frames (slightly brute force, but it looks nicer), and then did that for what is probably close to a circle around a point in the latent space.” Cleaned up into a stand-alone program:

import dnnlib.tflib as tflib
import math
import moviepy.editor
from numpy import linalg
import numpy as np
import pickle

def main():

    _G, _D, Gs = pickle.load(open("results/02051-sgan-faces-2gpu/network-snapshot-021980.pkl", "rb"))

    rnd = np.random
    latents_a = rnd.randn(1, Gs.input_shape[1])
    latents_b = rnd.randn(1, Gs.input_shape[1])
    latents_c = rnd.randn(1, Gs.input_shape[1])

    def circ_generator(latents_interpolate):
        radius = 40.0

        latents_axis_x = (latents_a - latents_b).flatten() / linalg.norm(latents_a - latents_b)
        latents_axis_y = (latents_a - latents_c).flatten() / linalg.norm(latents_a - latents_c)

        latents_x = math.sin(math.pi * 2.0 * latents_interpolate) * radius
        latents_y = math.cos(math.pi * 2.0 * latents_interpolate) * radius

        latents = latents_a + latents_x * latents_axis_x + latents_y * latents_axis_y
        return latents

    def mse(x, y):
        return (np.square(x - y)).mean()

    def generate_from_generator_adaptive(gen_func):
        max_step = 1.0
        current_pos = 0.0

        change_min = 10.0
        change_max = 11.0

        fmt = dict(func=tflib.convert_images_to_uint8, nchw_to_nhwc=True)

        current_latent = gen_func(current_pos)
        current_image =, None, truncation_psi=0.7, randomize_noise=False, output_transform=fmt)[0]
        array_list = []

        video_length = 1.0
        while(current_pos < video_length):

            lower = current_pos
            upper = current_pos + max_step
            current_pos = (upper + lower) / 2.0

            current_latent = gen_func(current_pos)
            current_image = images =, None, truncation_psi=0.7, randomize_noise=False, output_transform=fmt)[0]
            current_mse = mse(array_list[-1], current_image)

            while current_mse < change_min or current_mse > change_max:
                if current_mse < change_min:
                    lower = current_pos
                    current_pos = (upper + lower) / 2.0

                if current_mse > change_max:
                    upper = current_pos
                    current_pos = (upper + lower) / 2.0

                current_latent = gen_func(current_pos)
                current_image = images =, None, truncation_psi=0.7, randomize_noise=False, output_transform=fmt)[0]
                current_mse = mse(array_list[-1], current_image)
            print(current_pos, current_mse)
        return array_list

    frames = generate_from_generator_adaptive(circ_generator)
    frames = moviepy.editor.ImageSequenceClip(frames, fps=30)

    # Generate video.
    mp4_file = 'results/circular.mp4'
    mp4_codec = 'libx264'
    mp4_bitrate = '3M'
    mp4_fps = 20

    frames.write_videofile(mp4_file, fps=mp4_fps, codec=mp4_codec, bitrate=mp4_bitrate)

if __name__ == "__main__":
‘Circular’ interpolation video



include roadrunner01’s video since it’s famous now: fine-detail control samples random faces dump: 2019-02-14-stylegan-faces-02021-010483.tar 165.0 MB!2DRDQIjJ!JKQ_DhEXCzeYJXjliUSWRvE-_rfrvWv_cq3pgRuFadw

Transfer Learning

finetuning: have to start at highest resolution? Start at 512px (eg by setting resume_kimg=7000 in fakes00000.png should not show up because indicates beginning from scratch & ignoring the pretrained model—StyleGAN seems to erase all resolutions higher than the starting point for unknown reasons. Obviously, a very big problem for transfer learning… - need reasonably similar: Finetuning only works for closely related topics, like bedrooms->kitchens or all-faces->single-character-fce. Any particular character’s faces is already a face the face-StyleGAN knows how to generate reasonably well, so you’re simply specializing it to specific kind of face. But knowing how to draw a face doesn’t help much with, say, landscape drawings. May be still be faster than training from scratch but you won’t get good results in hours (all anime faces -> specific anime face) or days (anime faces -> real faces) aim for n>500, n>5000 even better; can use data augmentation


StyleGAN blind-test success: hand-selected samples: /docs/ai/ ~100, 24MB Holo ProGAN failure:


hand-selected examples: /docs/ai/ ~100, 13MB StyleGAN blind-test success:



/u/Ending_Credits - Saber (Fate/Stay Night): - Louise (Zero no Tsukaima):



Code Geass n=50 (!) interps


finetuned on a full-headshot dataset?


TWDNE was a huge success and popularized the anime face StyleGAN. It was not perfect, though, and flaws were noted.

The main issues I saw for the faces:

  1. the sexually-suggestive fluids: because I had not expected StyleGAN to work or to wind up making something like TWDNE, I had not taken the effort to crop faces solely from the SFW subset, since no GAN had proven to be good enough to pick up any embarrassing details and I was more concerned with maximizing the dataset size. The explicitly-NSFW images make up only ~9% of Danbooru but between the SFW-but-suggestive images and the explicit ones, and StyleGAN’s learning capabilities, this proved to be enough to make some of the faces quite naughty-looking. Naturally, everyone insisted on joking about this.

  2. head crops: Nagadomi’s face-cropper is a face cropper, not a head-cropper or a portrait-cropper; it centers its crops on the center of a face (like the nose) and will cut off all the additional details associated with anime heads such as the ‘ahoge’ or bunny ears or twin-tails. Similarly, I had left Nagadomi’s face-cropper on the default settings instead of bothering to tweak it to produce more head-shot-like crops—since if GANs couldn’t master the faces there was no point in making the problem even harder & worrying about details of the hair.

    This was not good for characters with distinctive hats or hair or animal ears (such as Holo’s wolf ears)

  3. Background/bodies: I suspected that the tightness of the crops also made it hard for StyleGAN to learn things in the edges, like backgrounds or shoulders, because they would always be partial if the face-cropper was doing its job. With bigger crops, there would be more variation and more opportunity to see whole shoulders or large unobstructed backgrounds, and this might lead to more convincing overall images.

  4. Holo/Asuka overrepresentation: to my surprise, TWDNE viewers seemed quite annoyed by the overrepresentation of Holo/Asuka-like (but mostly Holo) samples. For the same reason as not filtering to SFW, I had thrown in 2 earlier datasets I had made of Holo & Asuka faces—I had made the at 512px, and cleaned them fairly thoroughly, and they would increase the dataset size, so why not? Being overrepresented, and well-represented in Danbooru (a major part of why I had chosen them in the first place to make prototype datasets with), of course StyleGAN was more likely to generate samples looking like them than other popular anime characters.22 Why this annoyed people, I don’t understand, but it might as well be fixed.

  5. persistent artifacts: despite the generally excellent results, there are still occasional bizarre anomalous images which are scarce faces at all, even with 𝜓=0.7; I suspect that this may be due to the small percentage of non-faces, cut-off faces, or just poorly/weirdly drawn faces and that more stringent data cleaning would help polish the model.

Issues #1–3 can be fixed by transfer-learning StyleGAN on a new dataset made of faces from the SFW subset and cropped with much larger margins to produce more ‘portrait’-style face crops. (There would still be many errors or suboptimal crops but I am not sure there is any full solution short of training a face-localization CNN just for anime images.)

For this, I needed to edit lbpcascade_animeface’s and adjust the margins. Experimenting, changing the cropping line to

These margins seemed to deliver acceptable results which generally show the entire head while leaving enough room for extra background or hats/ears (although there is still the occasional error like a 4-koma comic or image with multiple faces or heads still partially cropped):

100 real faces from the ‘portrait’ dataset (SFW Danbooru2018 cropped with expanded margins) in a 10x10 grid
100 real faces from the ‘portrait’ dataset (SFW Danbooru2018 cropped with expanded margins) in a 10x10 grid

After cropping all ~2.8m SFW Danbooru2018 full-resolution images (as demonstrated in the cropping section), I was left with ~700k faces.

Issue #4 is solved by just not adding the Asuka/Holo datasets.

Finally, issue #5 is harder to deal with: pruning 200k+ images by hand is infeasible, there’s no easy way to improve the face cropping script, and I don’t have the budget to Mechanical-Turk review all the faces like Karras et al 2018 did for FFHQ to remove their false positives (like statues). One way I do have to improve it is to exploit the Discriminator of a pretrained face GAN; since the D is all about evaluating the probability of a face being a face, it automatically flags outliers & can be used for data cleaning—run the D on the whole dataset to rank each image (faster than it seems since the G & backpropagation are unnecessary, even a large dataset can be ranked in an hour or two), then one can review manually the bottom X%, or perhaps just delete the bottom X% sight unseen if enough data is available. The anime face StyleGAN D would be ideal since it clearly works so well already, but the requirement to load images from the .tfrecords stymied me.

As it happens, I have an earlier PyTorch BigGAN model which I hacked to read files from folders & rank them which I could use, and while not nearly as good a D, it does manage to at least rank a lot of bad faces below-average and is usable for this purpose. It’s not a great approach but I did use it, so I will document it here. The hack to make the data-loading functions keep filenames while still reading from a folder (otherwise it would simply print out the D loss on its own, which is useless) is quite nasty:

diff --git a/ b/
index 736362c..119ffed 100755
--- a/
+++ b/
@@ -29,7 +29,7 @@ class Data_Loader():
         transforms = self.transform(True, True, True, False)
         dataset = dsets.LSUN(self.path, classes=classes, transform=transforms)
         return dataset
     def load_imagenet(self):
         transforms = self.transform(True, True, True, True)
         dataset = dsets.ImageFolder(self.path+'/imagenet', transform=transforms)
@@ -42,9 +42,15 @@ class Data_Loader():

     def load_off(self):
         transforms = self.transform(True, True, True, False)
+        # dataset = ImageFolderWithPaths(self.path, transform=transforms)^M
         dataset = dsets.ImageFolder(self.path, transform=transforms)
         return dataset

+    def load_rank(self):^M
+        transforms = self.transform(True, True, True, True)^M
+        dataset = ImageFolderWithPaths(self.path, transform=transforms)^M
+        return dataset^M
     def loader(self):
         if self.dataset == 'lsun':
             dataset = self.load_lsun()
@@ -54,6 +60,8 @@ class Data_Loader():
             dataset = self.load_celeb()
         elif self.dataset == 'off':
             dataset = self.load_off()
+        elif self.dataset == 'rank':^M
+            dataset = self.load_rank()^M

         loader =,
@@ -63,3 +71,18 @@ class Data_Loader():
         return loader

+class ImageFolderWithPaths(dsets.ImageFolder):^M
+    """Custom dataset that includes image file paths. Extends^M
+    torchvision.datasets.ImageFolder^M
+    """^M
+    # override the __getitem__ method. this is the method dataloader calls^M
+    def __getitem__(self, index):^M
+        # this is what ImageFolder normally returns^M
+        original_tuple = super(ImageFolderWithPaths, self).__getitem__(index)^M
+        # the image file path^M
+        path = self.imgs[index][0]^M
+        # make a new tuple that includes original and the path^M
+        tuple_with_path = (original_tuple + (path,))^M
+        return tuple_with_path^M
diff --git a/ b/
index 0b59c5a..56b2128 100755
--- a/
+++ b/
@@ -20,7 +20,7 @@ def get_parameters():
     parser.add_argument('--version', type=str, default='sagan_1')

     # Training setting
-    parser.add_argument('--total_step', type=int, default=1000000, help='how many times to update the generator')
+    parser.add_argument('--total_step', type=int, default=10000000000, help='how many times to update the generator')^M
     parser.add_argument('--d_iters', type=float, default=5)
     parser.add_argument('--batch_size', type=int, default=64)
     parser.add_argument('--num_workers', type=int, default=12)
@@ -37,7 +37,7 @@ def get_parameters():
     parser.add_argument('--train', type=str2bool, default=True)
     parser.add_argument('--parallel', type=str2bool, default=False)
     parser.add_argument('--gpus', type=str, default='0', help='gpuids eg: 0,1,2,3  --parallel True  ')
-    parser.add_argument('--dataset', type=str, default='lsun', choices=['lsun', 'celeb','off'])
+    parser.add_argument('--dataset', type=str, default='lsun', choices=['lsun', 'celeb','off', 'rank'])^M
     parser.add_argument('--use_tensorboard', type=str2bool, default=False)

     # Path

This done, a trained model can be run with a simplified version of the training loop I call

import torch
import torch.nn as nn
from model_resnet import Discriminator
from parameter import *
from data_loader import Data_Loader
from torch.backends import cudnn
import os
import sys

class Trainer(object):
    def __init__(self, data_loader, config):

        # Data loaders
        self.data_loader = data_loader

        # exact model and loss
        self.model = config.model

        # Model hyper-parameters
        self.imsize = config.imsize
        self.parallel = config.parallel
        self.gpus = config.gpus

        self.batch_size = config.batch_size
        self.num_workers = config.num_workers
        self.pretrained_model = config.pretrained_model

        self.dataset = config.dataset
        self.image_path = config.image_path
        self.version = config.version

        self.n_class = 1000 # config.n_class TODO
        self.chn = config.chn

        # Path
        self.model_save_path = os.path.join(config.model_save_path, self.version)

        self.device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

        # Start with trained model

    def train(self):

        # Data iterator
        data_iter = iter(self.data_loader)

        total_steps = self.data_loader.__len__()
        for step in range(0, total_steps):
            real_images, real_labels, real_paths = next(data_iter)
            real_labels =
            real_images =
            d_out_real = self.D(real_images, real_labels)

            rankings =
            for i in range(0, len(real_paths)):
                print(real_paths[i], rankings[i])

    def load_pretrained_model(self):
            self.model_save_path, '{}_D.pth'.format(self.pretrained_model))))

    def build_model(self):
        self.D = Discriminator(self.n_class, chn=self.chn).to(self.device)
        if self.parallel:
            gpus = [int(i) for i in self.gpus.split(',')]
            self.D = nn.DataParallel(self.D, device_ids=gpus)

def main(config):
    cudnn.benchmark = True
    # Data loader
    data_loader = Data_Loader(config.train, config.dataset, config.image_path, config.imsize,
                              config.batch_size, shuf=False)
    Trainer(data_loader.loader(), config)
if __name__ == '__main__':
    config = get_parameters()
    # print(config)

The BigGAN-PyTorch model I used is available for download. Combined, faces can be ranked like this, the top 300,000 kept, and the rest deleted:

My hope is that 300k is enough to retrain and that deleting the other 400k will purge many of the worst datapoints that could potentially screw up training. Better methods of data cleaning would be useful.

Anime Bodies

skylion sample

All-faces ~> FFHQ real faces

terribad interp: zombies:

All-faces + FFHQ real faces

So, I have a StyleGAN model trained simultaneously on the FFHQ real faces and Danbooru2017 anime faces:!ia4FFQLA!BcxJMWphtty1Y39f_cKlkm-4FeB2SCSu9xpqSGT_PFc 2019-02-16-stylegan-ffhqdanbooru-network-02036-012162.pkl

My hypothesis was that by forcing it to learn real & anime faces simultaneously, it would share information and thus the disentangled latent factors would include a ‘real<->anime’ axis. I hoped that this would be the major axis that psi tapped into, and then it would be very easy to transform faces back & forth.

It did manage to learn both faces quite well, but the style transfer & psi samples were disappointments. The interpolation video does show that it learned to interpolate smoothly between real & anime faces, giving half-anime/half-real faces, but it looks like it only happens sometimes—mostly with young female faces. (This shouldn’t’ve surprised me.)


  • nshepperd:
  • ???:

A+F ~> Danbooru2018

Reversing StyleGAN to modify images

Link: Notebook with more examples

Future work

Further possible work: meta-learning for few-shot face or character or artist imitation eg FIGR ; scripts to edit trained models so 512px pretrained models can be promoted to work with 1024px images and vice versa; conditional StyleGAN for text embeddings (such as tag embeddings from Danbooru2018), yielding higher-quality results, faster training, and controllability borrowing architectural enhancements from BigGAN: self-attention layers, spectral norm regularization, and large-minibatch training (presumably implemented using gradient accumulation) bootstrapping image generation by starting with a seed corpus, generating many random samples, selecting the best by hand, and retraining; eg expand a corpus of a specific character, or explore ‘hybrid’ corpuses which mix A/B images & one then selects for images which look most A+Bish Open questions: - is progressive growing still necessary with StyleGAN? - are 8x512 FC layers necessary? - why do large blob artifacts regularly appear throughout training, even when almost-converged & photorealistic? - what are the wrinkly-line/cracks noise artifacts which appear at the end of training?

Previous GANs


all-anime faces

building the Danbooru2018 SFW 512px .tfrecords for StyleGAN: (fastai) 11:54 AM $ find /media/gwern/Data/danbooru2018/512px/ -type f | sort | xargs -n 10000 identify | fgrep " JPEG 512x512 512x512+0+0 8-bit sRGB"|cut –delimiter ’ ’ -f 1 > danbooru2018-color-ids.txt ; wc danbooru2018-color-ids.txt; alert

  1. I have been working on anime for a while. I picked anime out of a mix of intrinsic amusement factor, availability of high-quality, data, and genuine difficulty for machine learning. Most existing datasets are so boring—you can only go so far with street signs or random photographs like ImageNet—but anime is not something you see in many published papers, and is something everyone can appreciate: bad results are funny while good results are even funnier. Creating a dataset turned out to be remarkably easy as anime fans have already, for non-ML purposes, put together extremely high quality datasets in terms of size and labels; all I had to do was download it and package it up a little bit. (Compare creating a dataset like ImageNet which costs hundreds of thousands of dollars and still has impoverished metadata by comparison.) Then I discovered that in some respects, anime is far more difficult for standard deep learning methods than regular photographs. For example, in GANs, people have been generating credible faces for a long time at high resolutions with simple models on regular datasets. But to generate low resolution anime faces, Make Girls.Moe (the previous SOTA) had to resort to heroic efforts like an ultra-clean simplified set of faces, and most of the GANs I used failed miserably even as the original authors were having no trouble generating CelebA face photos. This is interesting and to me makes the fact that StyleGAN Just Worked in generating high-quality high-resolution anime faces with just days of training much more impressive than people realize.

  2. I think the big difference between GANs and earlier evolutionary art is that evolutionary and other approaches tend to break down on increasingly complex domains. You can evolve simple small things, but while you can do a lot of cool things by carefully choosing a language or primitives to work with, it requires a lot of domain knowledge to come up with the best and most compact encodings which aren’t too big for earlier approaches to work. With GANs, at least once they work, you can simply tackle larger problems by using bigger neural networks and training longer. I would highlight the recent example of BigGAN. BigGAN, training on hundreds of TPUs, uses a shocking amount of compute and a very large neural network (somewhere around 100x the compute, and the neural network is at least twice as large as my anime StyleGAN), but it manages to learn to generate equally shockingly realistic images of pretty much everything. There is nothing remotely comparable in evolutionary or genetic algorithms. If you try to scale them like that, they would simply stop working. Not all machine learning researchers would agree with this (although others, like many DM/OA researchers, probably would), but I would summarize it this way: the value of GANs is that like deep learning general, they allow us to convert ever larger amounts of data & computation into richer more complex more realistic images, which is not true of earlier algorithms which reach their effective limits quickly. So, I can throw my 220k anime faces into a huge 150MB StyleGAN trained for a week on my 2 GPUs (which are the equivalent of a supercomputer not too long ago) and just keep getting better and more fascinating and bizarre anime faces out.

  3. The original GAN paper by Goodfellow was the inspiration. I read it and immediately thought, “at some point, in a few years, as GPUs keep getting faster and neural networks get better, we’ll be able to generate photorealistic images of just about anything. Wouldn’t it be hilarious if we could do artwork, or anime? Too bad there’s no dataset for that researchers could use, though. Hm…” Eventually, I created Danbooru2017 (then Danbooru2018) and began testing out every new GAN on various anime face datasets I’d compiled to see if GANs have finally improved enough. ProGAN was almost enough but too compute-demanding to really work. I was bowled over by the StyleGAN improvements and radical architecture change, and checked the website every day until they released the source. I immediately set to work with amazing results, someone put up the “This Person Does Not Exist” website, another person on Twitter joked that since everyone was doing ‘This Cat Does Not Exist’ or ‘This Airbnb Does Not Exist’, then I should make a “This Waifu Does Not Exist” too, and the rest is history.

    I am just finishing up adding anime-themed GPT-2 text snippets to TWDNE to make it even more amusing, and then I’ll continue training my face StyleGAN (it still hasn’t finished learning!) and once that’s done, next is training a StyleGAN on raw 512px anime images: some collaborators have been experimenting with whole anime images centered on a single character, and whole anime images in general, and we’ve seen promising results. We don’t think we’ve hit the limits of StyleGAN’s capacity, even though it’ll take several GPU-months to train on the entirety of Danbooru2018. The results should be interesting. Should that succeed as well as faces, the next step will be taking the text descriptions of each image compiled by Danbooru users and using those as inputs to StyleGAN, which, should it work, would mean you could create arbitrary anime images simply by typing in a string like 1_boy samurai facing_viewer red_hair clouds sword armor blood etc. That’ll be months and months from now even optimistically, though.

See also

  1. Turns out that when training goes really wrong, you can crash many GAN implementations with either a segfault, integer overflow, or division by zero error.

  2. StackGAN/StackGAN++/PixelCNN et al are difficult to run as they require a unique image embedding which could only be computed in the unmaintained Torch framework using Reed’s prior work on a joint text+image embedding which however doesn’t run on anything but the Birds & Flowers datasets, and so no one has ever, as far as I am aware, run those implementations on anything else—certainly I never managed to despite quite a few hours trying to reverse-engineer the embedding & various implementations.

  3. Unpublished.

  4. Be sure to check out Ganbreeder.

  5. Glow’s reported results required >40 GPU-weeks; BigGAN’s total compute is unclear as it was trained on a TPUv3 Google cluster but it would appear to be at least 8 GPU-months.

  6. illustration2vec is an old & small CNN trained to predict a few -booru tags on anime images, and so provides an embedding—but not a good one. The lack of a good embedding is the major limitation for anime deep learning as of February 2019. (DeepDanbooru, while performing well apparently, is not currently released & is in an unpopular framework, so it would be hard to use.) An embedding is necessary for text→image GANs, image searches & nearest-neighbor checks of overfitting, FID errors for objectively comparing GANs, anime style transfer (both for its own sake & for creating a ‘StyleDanbooru2018’), encoding into GAN latent spaces for manipulation, data cleaning (to detect anomalous datapoints like failed face crops), etc.

  7. Technical note: I typically train NNs using my workstation with 2x1080ti GPUs. For easier comparison, I convert all my times to single-GPU equivalent (ie “6 GPU-weeks” means 3 realtime/wallclock weeks on my 2 GPUs).

  8. Curiously, the benefit of many more FC layers than usual may have been stumbled across before: IllustrationGAN found that adding some FC layers seemed to help their DCGAN generate anime faces, and when I & FeepingCreature experimented with adding 2–4 FC layers to WGAN-GP along IllustrationGAN’s lines, it did help our lackluster results. But we never dreamed of going as deep as 8!

  9. The latent embedding is usually generated in about the simplest possible way: draws from the Normal distribution, . A is sometimes used instead. There is no good justification for this and some reason to think this can be bad (how does a GAN easily map a discrete or binary latent factor, such as the presence or absence of the left ear, onto a Normal variable?).

    The BigGAN paper explores alternatives, finding improvements in training time and/or final quality from using instead (in ascending order): a Normal + binary Bernoulli (p=0.5 (personal communication)) variable, a binary (Bernoulli), and a Rectified Gaussian (sometimes called a “censored normal” even though that sounds like a truncated normal distribution rather than the rectified one). The rectified Gaussian distribution “outperforms (in terms of IS) by 15–20% and tends to require fewer iterations.”

    The downside is that the “truncation trick”, which yields even larger average improvements in image quality (at the expense of diversity) doesn’t quite apply, and the rectified Gaussian sans truncation produced similar results as the Normal+truncation, so BigGAN reverted to the default Normal distribution+truncation (personal communication).

    The truncation trick either directly applies to some of the other distributions, particularly the Rectified Gaussian, or could easily be adapted—possibly yielding an improvement over either approach. The Rectified Gaussian can be truncated just like the default Normals can. And for the Bernoulli, one could decrease p during the generation, or what is probably equivalent, re-sample whenever the variance (ie squared sum) of all the Bernoulli latent variables exceeds a certain constant. (With p=0.5, a latent vector of 512 Bernouillis would on average all sum up to simply , with the 2.5%–97.5% quantiles being 234–278, so a ‘truncation trick’ here might be throwing out every vector with a sum above, say, the 80% quantile of 266.)

    One also wonders about vectors which draw from multiple distributions rather than just one. Could the StyleGAN 8-FC-layer learned-latent-variable be reverse-engineered?

  10. Which raises the question: if you added any or all of those features, would StyleGAN become that much better? Unfortunately, while theorists & practitioners have had many ideas, so far theory has proven more fecund than fatidical and the large-scale GAN experiments necessary to truly test the suggestions are too expensive for most. Half of these suggestions are great ideas—but which half?

  11. For more on the choice of convolution layers/kernel sizes, see Karpathy’s 2015 notes for “CS231n: Convolutional Neural Networks for Visual Recognition”, or take a look at these Convolution animations & Yang’s interactive “Convolution Visualizer”.

  12. A possible alternative is ESRGAN (Wang et al 2018).

  13. Based on eyeballing the ‘cat’ bar graph in Figure 3 of Yu et al 2015.

  14. Cats offer an amusing instance of the dangers of data augmentation: ProGAN used horizontal flipping/mirroring for everything, because why not? This led to strange Cyrillic text captions showing up in the generated cat images. Why not Latin alphabet captions? Because every cat image was being shown mirrored as well as normally! For StyleGAN, mirroring was disabled, so now the lolcat captions are recognizably Latin alphabetical, and even almost English words. This demonstrates that even datasets where left/right doesn’t seem to matter, like cat photos, can surprise you.

  15. I estimated using AWS EC2 preemptible hourly costs on 15 March 2019 as follows:

    • 1 GPU: p2.xlarge instance in us-east-2a, Half of a K80 (12GB VRAM): $0.3235/hour
    • 2 GPUs: there is no P2 instance with 2 GPUs, only 1/8/16
    • 8 GPUs: p2.8xlarge in us-east-2a, 8 halves of K80s (12GB VRAM each): $2.160/hour

    As usual, there is sublinear scaling, and larger instances cost disproportionately more, because one is paying for faster wallclock training (time is valuable) and for not having to create a distributed infrastructure which can exploit the cheap single-GPU instances.

    This cost estimate does not count additional costs like hard drive space. In addition to the dataset size (the StyleGAN data encoding is ~18x larger than the raw data size, so a 10GB folder of images → 200GB of .tfrecords), you would need at least 100GB HDD (50GB for the OS, and 50GB for checkpoints/images/etc to avoid crashes from running out of space).

  16. I regard this as a flaw in StyleGAN & TF in general. Computers are more than fast enough to load & process images asynchronously using a few worker threads, and working with a directory of images (rather than a special binary format 10–20x larger) avoids imposing serious burdens on the user & hard drive. PyTorch GANs almost always avoid this mistake, and are much more pleasant to work with as one can freely modify the dataset between (and even during) runs.

  17. For example, my Danbooru2018 anime portrait dataset is 16GB, and the StyleGAN dataset is 296GB.

  18. But you may not want to–remember the lolcat captions!

  19. If you are using Python 2, you will get print syntax error messages; if you are using Python 3–3.6, you will get ‘type hint’ errors.

  20. This makes it conform to a truncated normal distribution; why truncated rather than rectified/winsorized at a max like 0.5 or 1.0 instead? Because then many, possibly most, of the latent variables would all be at the max, instead of smoothly spread out over the permitted range.

  21. No minibatches are used, so this is much slower than necessary.

  22. Holo faces were far more common than Asuka faces. There were 12,611 Holo faces & 5,838 Asuka faces, so Holo was only 2x more common and Asuka is a more popular character in general in Danbooru, so I am a little puzzled why Holo showed up so much more than Asuka. One possibility is that Holo is inherently easier to model under the truncation trick—I noticed that the brown short-haired face at 𝜓=0 resembles Holo much more than Asuka, so perhaps when setting 𝜓, Asukas are disproportionately filtered out? Or faces closer to the origin are simply more likely to be generated to begin with.