# Making Anime Faces With StyleGAN

A tutorial explaining how to train and generate high-quality anime faces with StyleGAN neural networks, and tips/scripts for effective StyleGAN use.
topics: anime, NGE, NN, Python, technology, tutorial
created: 4 Feb 2019; modified: 17 Oct 2019; status: finished; confidence: highly likely;

Generative neural networks, such as GANs, have struggled for years to generate decent-quality anime faces, despite their great success with photographic imagery such as real human faces. The task has now been effectively solved, for anime faces as well as many other domains, by the development of a new generative adversarial network, , whose was released in February 2019.

I show off my StyleGAN anime faces & videos, provide downloads, provide the ‘missing manual’ & explain how I trained them based on with source code for the data preprocessing, document installation & configuration & training tricks.

For application, I document various scripts for generating images & videos, briefly describe the website as a public demo (see also ), discuss how the trained models can be used for transfer learning such as generating high-quality faces of anime characters with small datasets (eg Holo or Asuka Souryuu Langley), and touch on more advanced StyleGAN applications like encoders & controllable generation.

The appendix gives samples of my failures with earlier GANs for anime face generation, and I provide samples & model from a relatively large-scale BigGAN training run suggesting that BigGAN may be the next step forward to generating full-scale anime images.

A minute of reading could save an hour of debugging!

When Ian Goodfellow’s first paper , with its blurry 64px grayscale faces, I said to myself, “given the rate at which GPUs & NN architectures improve, in a few years, we’ll probably be able to throw a few GPUs at some anime collection like Danbooru and the results will be hilarious.” There is something intrinsically amusing about trying to make computers draw anime, and it would be much more fun than working with yet more celebrity headshots or ImageNet samples; further, anime/illustrations/drawings are so different from the exclusively-photographic datasets always (over)used in contemporary ML research that I was curious how it would work on anime—better, worse, faster, or different failure modes? Even more amusing—if random images become doable, then text→images would not be far behind.

So when GANs hit , and could do somewhat passable around 2015, along with my , I began experimenting with of , restricting myself to faces of single anime characters where I could easily scrape up ~5–10k faces. (I did a lot of from because she has a color-centric design which made it easy to tell if a GAN run was making any progress: blonde-red hair, blue eyes, and red hair ornaments.)

It did not work. Despite many runs on my laptop & a borrowed desktop, DCGAN never got remotely near to the level of the CelebA face samples, typically topping out at reddish blobs before diverging or outright crashing.1 Thinking perhaps the problem was too-small datasets & I needed to train on all the faces, I began creating the Danbooru2017 version of . Armed with an extremely large dataset, I subsequently began working through particularly promising members of the , emphasizing SOTA & open implementations.

Among others, / & (failed to get running)2, WGAN-GP, Glow, GAN-QP, MSG-GAN, SAGAN, VGAN, PokeGAN, BigGAN3, ProGAN, & StyleGAN. These architectures vary widely in their design & core algorithms and which of the many stabilization tricks () they use, but they were more similar in their results: dismal.

Glow & BigGAN had promising results reported on CelebA & ImageNet respectively, but unfortunately their training requirements were out of the question.4 (As interesting as SPIRAL and are, no source was released so I couldn’t even attempt them.)

While some remarkable tools like were created, and there were the occasional semi-successful anime face GANs like , the most notable attempt at anime face generation was (). MGM could, interestingly, do in-browser 256px anime face generation using particularly small GANs, but it is a dead end. MGM accomplished that much by making the problem easier: they added some light supervision in the form of a crude tag embedding5, and then simplifying the problem drastically to n=42k faces cropped from professional video game character artwork, which I regarded as not an acceptable solution—the faces were small & boring, and it was unclear if this data-cleaning approach could scale to anime faces in general, much less anime images in general. They are recognizably anime faces but the resolution is low and the quality is not great:

Typically, a GAN would diverge after a day or two of training, or it would collapse to producing a limited range of faces (or a single face), or if it was stable, simply converge to an low level of quality with a lot of fuzziness; perhaps the most typical failure mode was heterochromia (which is but not that common)—mismatched eye colors (each color individually plausible), from the Generator apparently being unable to coordinate with itself to pick consistently. With more recent architectures like VGAN or SAGAN, which carefully weaken the Discriminator or which add extremely-powerful components like self-attention layers, I could reach fuzzy 128px faces.

Given the miserable failure of all the prior NNs I had tried, I had begun to seriously wonder if there was something about non-photographs which made them intrinsically unable to be easily modeled by convolutional neural networks (the common ingredient to them all). Did convolutions render it unable to generate sharp lines or flat regions of color? Did regular GANs work only because photographs were made almost entirely of blurry textures?

But BigGAN demonstrated that a large cutting-edge GAN architecture could scale, given enough training, to all of ImageNet at even 512px. And ProGAN demonstrated that regular CNNs could learn to generate sharp clear anime images with only somewhat infeasible amounts of training. (; ), while expensive and requiring >6 GPU-weeks6, did work and was even powerful enough to overfit single-character face datasets; I didn’t have enough GPU time to train on unrestricted face datasets, much less anime images in general, but merely getting this far was exciting. Because, a common sequence in DL/DRL (unlike many areas of AI) is that a problem seems intractable for long periods, until someone modifies a scalable architecture slightly, produces somewhat-credible (not necessarily human or even near-human) results, and then throws a ton of compute/data at it and, since the architecture scales, it rapidly exceeds SOTA and approaches human levels (and potentially exceeds human-level). Now I just needed a faster GAN architecture which I could train a much bigger model with on a much bigger dataset.

StyleGAN was the final breakthrough in providing ProGAN-level capabilities but fast: by switching to a radically different architecture, it minimized the need for the slow progressive growing (perhaps eliminating it entirely7), and learned efficiently at multiple levels of resolution, with bonuses in providing much more control of the generated images with its “style transfer” metaphor.

# Examples

First, some demonstrations of what is possible with StyleGAN on anime faces:

Even a quick look at the MGM & StyleGAN samples demonstrates the latter to be superior in resolution, fine details, and overall appearance (although the MGM faces admittedly have fewer global mistakes). It is also superior to my 2018 ProGAN faces. Perhaps the most striking fact about these faces, which should be emphasized for those fortunate enough not to have spent as much time looking at awful GAN samples as I have, is not that the individual faces are good, but rather that the faces are so diverse, particularly when I look through face samples with 𝜓≥1—it is not just the hair/eye color or head orientation or fine details that differ, but the overall style ranges from CG to cartoon sketch, and even the ‘media’ differ, I could swear many of these are trying to imitate watercolors, charcoal sketching, or oil painting rather than digital drawings, and some come off as recognizably ’90s-anime-style vs ’00s-anime-style. (I could look through samples all day despite the global errors because so many are interesting, which is not something I could say of the MGM model whose novelty is quickly exhausted, and it appears that users of my TWDNE website feel similarly as the average length of each visit is 1m:55s.)

# Background

StyleGAN was published in 2018 as (; //; 9; /; explainers: //). StyleGAN takes the standard GAN architecture embodied by ProGAN (whose source code it reuses) and, like the similar GAN architecture , draws inspiration from the field of “style transfer” (essentially invented by ), by changing the Generator (G) which creates the image by repeatedly upscaling its resolution to take, at each level of resolution from 8px→16px→32px→64px→128px etc a random input or “style noise”, which is combined with and is used to tell the Generator how to ‘style’ the image at that resolution by changing the hair or changing the skin texture and so on. ‘Style noise’ at a low resolution like 32px affects the image relatively globally, perhaps determining the hair length or color, while style noise at a higher level like 256px might affect how frizzy individual strands of hair are. In contrast, ProGAN and almost all other GANs inject noise into the G as well, but only at the beginning, which appears to work not nearly as well (perhaps because it is difficult to propagate that randomness ‘upwards’ along with the upscaled image itself to the later layers to enable them to make consistent choices?). To put it simply, by systematically providing a bit of randomness at each step in the process of generating the image, StyleGAN can ‘choose’ variations effectively.

StyleGAN makes a number of additional improvements, but they appear to be less important: for example, it introduces a new face/portrait dataset with 1024px images in order to show that StyleGAN convincingly improves on ProGAN in final image quality; switches to a loss which is more well-behaved than the usual logistic-style losses; and architecture-wise, it makes unusually heavy use of fully-connected (FC) layers to process an initial random input, no less than 8 layers of 512 neurons, where most GANs use 1 or 2 FC layers.10 More striking is that it omits techniques that other GANs have found critical for being able to train at 512px–1024px scale: it does not use newer losses like the , SAGAN-style self-attention layers in either G/D, VGAN-style variational bottlenecks in the D, conditioning on a tag or category embedding11, BigGAN-style large minibatches, different noise distributions12, advanced regularization like , etc.13

Aside from the FCs and style noise & normalization, it is a fairly vanilla architecture. (One oddity is the use of only 3x3 convolutions & so few layers in each upscaling block; a more conventional upscaling block than StyleGAN’s 3x3→3x3 would be which does 1x1→3x3→3x3→1x1. It’s not clear if this is a good idea as it limits the spatial influence of each pixel by providing limited receptive fields14, and may be related to the “blob” artifacts.) Thus, if one has some familiarity with training a ProGAN or another GAN, one can immediately work with StyleGAN with no trouble: the training dynamics are similar and the hyperparameters have their usual meaning, and the codebase is much the same as the original ProGAN (with the main exception being that config.py has been renamed train.py and the original train.py, which stores the critical configuration parameters, has been moved to training/training_loop.py; there is still no support for command-line options and StyleGAN must be controlled by editing train.py/training_loop.py by hand).

## Applications

Because of its speed and stability, when the source code was released on 4 February 2019 (a date that will long be noted in the ANNals of GANime), the Nvidia models & sample dumps were quickly perused & new StyleGANs trained on a wide variety of image types, yielding, in addition to the original faces/carts/cats of Karras et al 2018:

## Why Don’t GANs Work?

Why does StyleGAN work so well on anime images while other GANs worked not at all or slowly at best?

The lesson I took from , Lucic et al 2017, is that CelebA/CIFAR10 are too easy, as almost all evaluated GAN architectures were capable of occasionally achieving good FID if one simply did enough iterations & hyperparameter tuning.

Interestingly, I consistently observe in training all GANs on anime that clear lines & sharpness & cel-like smooth gradients appear only toward the end of training, after typically initially blurry textures have coalesced. This suggest an inherent bias of CNNs: color images work because they provide some degree of textures to start with, but lineart/monochrome stuff fails because the GAN optimization dynamics flail around. This is consistent with —which uses style transfer to construct a data-augmented/transformed “Stylized-ImageNet”—showing that ImageNet CNNs are lazy and, because the tasks can be achieved to some degree with texture-only classification (as demonstrated by several of Geirhos et al 2018’s authors via ), focus on textures unless otherwise forced. So while CNNs can learn sharp lines & shapes rather than textures, the typical GAN architecture & training algorithm do not make it easy. Since CIFAR10/CelebA can be fairly described as being just as heavy on textures as ImageNet (which is not true of anime images), it is not surprising that GANs train easily on them starting with textures and gradually refining into good samples but then struggle on anime.

This raises a question of whether the StyleGAN architecture is necessary and whether many GANs might work, if only one had good style transfer for anime images and could, to defeat the texture bias, generate many versions of each anime image which kept the shape while changing the color palette? (Current style transfer methods like the used by Geirhos et al 2018, do not work well on anime images, ironically enough, because they are trained on photographic images, typically using the old VGG model.)

# FAQ

“…Its social accountability seems sort of like that of designers of military weapons: unculpable right up until they get a little too good at their job.”

To address some common questions people have after seeing generated samples:

• Overfitting: “Aren’t StyleGAN (or BigGAN) just overfitting & memorizing data?”

Amusingly, this is not a question anyone really bothered to ask of earlier GAN architectures, which is a sign of progress. Overfitting is a better problem to have than underfitting, because overfitting means you can use a smaller model or more data or more aggressive regularization techniques, while underfitting means your approach just isn’t working.

In any case, while there is currently no way to conclusively prove that cutting-edge GANs are not 100% memorizing (because they should be memorizing to a considerable extent in order to learn image generation, and evaluating generative models is hard in general, and for GANs in particular, because they don’t provide standard metrics like likelihoods which could be used on held-out samples), there are several reasons to think that they are not just memorizing:15

1. sample/dataset overlap: a standard check for overfitting is to compare generated images to their closest matches using (where distance is defined by features like a CNN embedding) lookup; an example of this are & BigGAN’s , where the photorealistic samples are nevertheless completely different from the most similar ImageNet datapoints. This has not been done for StyleGAN yet but I wouldn’t expect different results as GANs typically pass this check.

2. semantic understanding: GANs appear to learn meaningful concepts like individual objects, as demonstrated by “latent space addition” or research tools like ; image edits like object deletions/additions are difficult to explain without some genuine understanding of images. In the case of StyleGAN anime faces, there are encoders and controllable face generation now which demonstrate that the latent variables do map onto meaningful factors of variation & the model must have learned.

3. latent space smoothness: in general, interpolation in the latent space (z) shows smooth changes of images and logical transformations or variations of face features; if StyleGAN were merely memorizing individual datapoints, the interpolation would be expected to be low quality, yield many terrible faces, and exhibit ‘jumps’ in between points corresponding to real, memorized, datapoints. The StyleGAN anime face models do not exhibit this. (In contrast, the Holo ProGAN, which overfit badly, does show severe problems in its latent space interpolation videos.)

Which is not to say that GANs do not have issues: “mode dropping” seems to still be an issue for BigGAN despite the expensive large-minibatch training, which is overfitting to some degree, and StyleGAN presumably suffers from it too.

4. transfer learning: GANs have been used for semi-supervised learning (eg generating plausible ‘labeled’ samples to train a classifier on), imitation learning like , and retraining on further datasets; if the G is merely memorizing, it is difficult to explain how any of this would work.

• Compute Requirements: “Doesn’t StyleGAN take too long to train?”

StyleGAN is remarkably fast-training for a GAN. With the anime faces, I got better results after 1–3 days of StyleGAN training than I’d gotten with >3 weeks of ProGAN training. The training times quoted by the StyleGAN repo may sound scary, but they are, in practice, a steep overestimate of what you actually need, for several reasons:

• lower resolution: the largest figures are for 1024px images but you may not need them to be that large or even have a big dataset of 1024px images. For anime faces, 1024px-sized faces are relatively rare, and training at 512px & upscaling 2x to 1024 with waifu2x16 works fine & is much faster. Since upscaling is relatively simple & easy, another strategy is to change the progressive-growing schedule: instead of proceeding to the final resolution as fast as possible, instead adjust the schedule to stop at a more feasible resolution & spend the bulk of training time there instead and then do just enough training at the final resolution to learn to upscale (eg spend 10% of training growing to 512px, then 80% of training time at 512px, then 10% at 1024px).
• diminishing returns: the largest gains in image quality are seen in the first few days or weeks of training with the remaining training being not that useful as they focus on improving small details (so just a few days may be more than adequate for your purposes, especially if you’re willing to select a little more aggressively from samples)
• transfer learning from a related model can save days or weeks of training, as there is no need to train from scratch; with the anime face StyleGAN, one can train a character-specific StyleGAN with a few hours or days at most, and certainly do not need to spend multiple weeks training from scratch! (assuming that wouldn’t just cause overfitting) Similarly, if one wants to train on some 1024px face dataset, why start from scratch, taking ~1000 GPU-hours, when you can start from Nvidia’s FFHQ face model which is already fully trained, and can converge in a fraction of the from-scratch time? For 1024px, you could use a super-resolution GAN like to upscale? Alternately, you could change the image progression budget to spend most of your time at 512px and then at the tail end try 1024px.
• one-time costs: the upfront cost of a few hundred dollars of GPU-time (at inflated AWS prices) may seem steep, but should be kept in perspective. As with almost all NNs, training 1 StyleGAN model can be literally tens of millions of times more expensive than simply running the Generator to produce 1 image; but it also need be paid only once by only one person, and the total price need not even be paid by the same person, given transfer learning, but can be amortized across various datasets. Indeed, given how fast running the Generator is, the trained model doesn’t even need to be run on a GPU. (The rule of thumb is that a GPU is 20–30x faster than the same thing on CPU, with rare instances when overhead dominates of the CPU being as fast or faster, so since generating 1 image takes on the order of ~0.1s on GPU, a CPU can do it in ~3s, which is adequate for many purposes.)
• Copyright Infringement: “Who owns StyleGAN images?”

1. The Nvidia source code & released models are under a CC-BY-NC license, and you cannot edit them or produce “derivative works” such as retraining their FFHQ, cat, or cat StyleGAN models. If a model is trained from scratch, then that does not apply as the source code is simply another tool used to create the model and nothing about the CC-BY-NC license forces you to donate the copyright to Nvidia. (It would be odd if such a thing did happen—if your word processor claimed to transfer the copyrights of everything written in it to Microsoft!)

2. Models in general are generally considered “transformative works” and the copyright owners of whatever data the model was trained on have no copyright on the model. The model is copyrighted to whomever created it. Hence, Nvidia has copyright on the models it created but I have copyright under the models I trained (which I release under CC-0).

3. Samples are trickier. The usual widely-stated legal interpretation is that the standard copyright law position is that only and that machines, animals, inanimate objects or most famously, , cannot. The US Copyright Office states clearly that regardless of whether we regard a GAN as a machine or a something more intelligent like an animal, either way, it doesn’t count:

A work of authorship must possess “some minimal degree of creativity” to sustain a copyright claim. Feist, 499 U.S. at 358, 362 (citation omitted). “[T]he requisite level of creativity is extremely low.” Even a “slight amount” of creative expression will suffice. “The vast majority of works make the grade quite easily, as they possess some creative spark, ‘no matter how crude, humble or obvious it might be.’” Id. at 346 (citation omitted).

… To qualify as a work of “authorship” a work must be created by a human being. See Burrow-Giles Lithographic Co., 111 U.S. at 58. Works that do not satisfy this requirement are not copyrightable. The Office will not register works produced by nature, animals, or plants.

Examples:

• A photograph taken by a monkey.
• A mural painted by an elephant.

…the Office will not register works produced by a machine or mere mechanical process that operates randomly or automatically without any creative input or intervention from a human author.

A dump of random samples such as the Nvidia samples or TWDNE therefore has no copyright & by definition is in the public domain.

A new copyright can be created, however, if a human author is sufficiently ‘in the loop’, so to speak, as to exert a de minimis amount of creative effort, even if that ‘creative effort’ is simply selecting a single image out of a dump of thousands of them or twiddling knobs (eg on Make Girls.Moe). , for example,

# Training requirements

## Data

…If the fool would persist in his folly he would become wise
…You never know what is enough unless you know what is more than enough. …If others had not been foolish, we should be so.”

William Blake, “Proverbs of Hell”,

The necessary size for a dataset depends on the complexity of the domain and whether transfer learning is being used. StyleGAN’s default settings yield a 1024px Generator with 26.2M parameters, which is a large model and can soak up potentially millions of images, so there is no such thing as too much.

For learning decent-quality anime faces from scratch, a minimum of 5000 appears to be necessary in practice; for learning a specific character when using the anime face StyleGAN, potentially as little as ~500 (especially with data augmentation) can give good results. For domains as complicated as “any cat photo” like Karras et al 2018’s cat StyleGAN which is trained on the Cats category of ~1.8M17 cat photos, that appears to either not be enough or StyleGAN was not trained to convergence; Karras et al 2018 note that Cats continues to be a difficult dataset due to the high intrinsic variation in poses, zoom levels, and backgrounds.”18

## Compute

To fit reasonable minibatch sizes, one will want GPUs with >11GB VRAM. At 512px, that will only train n=4, and going below that means it’ll be even slower (and you may have to reduce learning rates to avoid unstable training). So, Nvidia 1080ti & up would be good. (Reportedly, AMD/OpenCL works , and there is one report of successful training with “Radeon VII with tensorflow-rocm 1.13.2 and rocm 2.3.14”.)

The StyleGAN repo provide the following estimated training times for 1–8 GPU systems (which I convert to total GPU-hours & provide a worst-case AWS-based cost estimate):

Estimated StyleGAN wallclock training times for various resolutions & GPU-clusters (source: StyleGAN repo)
GPUs 10242 5122 2562 [March 2019 AWS Costs19]
1 41 days 4 hours [988 GPU-hours] 24 days 21 hours [597 GPU-hours] 14 days 22 hours [358 GPU-hours] [$320,$194, $115] 2 21 days 22 hours [1,052] 13 days 7 hours [638] 9 days 5 hours [442] [NA] 4 11 days 8 hours [1,088] 7 days 0 hours [672] 4 days 21 hours [468] [NA] 8 6 days 14 hours [1,264] 4 days 10 hours [848] 3 days 8 hours [640] [$2,730, $1,831,$1,382]

AWS GPU instances are some of the most expensive ways to train a NN and provide an upper bound (compare ); 512px is often an acceptable (or necessary) resolution; and in practice, the full quoted training time is not really necessary—with my anime face StyleGAN, the faces themselves were high quality within 48 GPU-hours, and what training it for ~1000 additional GPU-hours accomplished was primarily to improve details like the shoulders & backgrounds. (ProGAN/StyleGAN particularly struggle with backgrounds & edges of images because those are cut off, obscured, and highly-varied compared to the faces, whether anime or FFHQ.)

# Data Preparation

The most difficult part of running StyleGAN is preparing the dataset properly. StyleGAN does not, unlike most GAN implementations (particularly PyTorch ones), support reading a directory of files as input; it can only read its unique .tfrecord format which stores each image as raw arrays at every relevant resolution.20 Thus, input files must be perfectly uniform, slowly converted to the .tfrecord formats by the special dataset_tool.py tool, and will take up ~19x more disk space.21

WARNING: A StyleGAN dataset must consist of images all formatted exactly the same way: they must be precisely 512x512px or 1024x1024px etc (512x513px will not work), they must all be the same colorspace (you cannot have sRGB and Grayscale JPGs), the filetype must be the same as the model you intend to (re)train (ie you cannot retrain a PNG-trained model on a JPG dataset, StyleGAN will crash every time with inscrutable convolution/channel-related errors)22, and there must be no subtle errors like CRC checksum errors which image viewers or libraries like ImageMagick often ignore.

## Faces preparation

My workflow:

2. Extract from the JSON Danbooru2018 metadata all the IDs of a subset of images if a specific Danbooru tag (such as a single character) is desired, using jq and shell scripting
3. Crop anime faces from raw images using (regular face-detection methods do not work on anime images)
4. Delete empty files, monochrome or grayscale files, & exact-duplicate files
5. Convert to JPG
6. Upscale below the target resolution (512px) images with
7. Convert all images to exactly 512x512 resolution sRGB JPG images
8. If feasible, improve data quality by checking for low-quality images by hand, removing near-duplicates images found by findimagedupes, and filtering with a pretrained GAN’s Discriminator
9. Convert to StyleGAN format using dataset_tool.py

The goal is to turn this:

into this:

Below I use shell scripting to prepare the dataset. A possible alternative is , which aims to help “explore the dataset, filter by tags, rating, and score, detect faces, and resize the images”.

### Cropping

The can be done via BitTorrent or rsync, which provides a JSON metadata tarball which unpacks into metadata/2* & a folder structure of {original,512px}/{0-999}/$ID.{png,jpg,...}. For training on SFW whole images, the 512px/ version of Danbooru2018 would work, but it is not a great idea for faces because by scaling images down to 512px, a lot of face detail has been lost, and getting high-quality faces is a challenge. The SFW IDs can be extracted from the filenames in 512px/ directly or from the metadata by extracting the id & rating fields (and saving to a file): find ./512px/ -type f | sed -e 's/.*\/$$[[:digit:]]*$$\.jpg/\1/' # 967769 # 1853769 # 2729769 # 704769 # 1799769 # ... tar xf metadata.json.tar.xz cat metadata/20180000000000* | jq '[.id, .rating]' -c | fgrep '"s"' | cut -d '"' -f 2 # " # ... After installing and testing to make sure it & works, one can use which crops the face(s) from a single input image. The accuracy on Danbooru images is fairly good, perhaps 90% excellent faces, 5% low-quality faces (genuine but either awful art or tiny little faces on the order of 64px which useless), and 5% outright errors—non-faces like armpits or elbows (oddly enough). It can be improved by making the script more restrictive, such as requiring 250x250px regions, which eliminates most of the low-quality faces & mistakes. (There is an alternative more-difficult-to-run library by Nakatomi which offers a face-cropping script, ’s , which Nakatomi says is better at cropping faces, but I was not impressed when I tried it out.) crop.py: import cv2 import sys import os.path def detect(cascade_file, filename, outputname): if not os.path.isfile(cascade_file): raise RuntimeError("%s: not found" % cascade_file) cascade = cv2.CascadeClassifier(cascade_file) image = cv2.imread(filename) gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY) gray = cv2.equalizeHist(gray) ## Suggested modification: increase minSize to '(250,250)' px, ## increasing proportion of high-quality faces & reducing ## false positives. Faces which are only 50x50px are useless ## and often not faces at all. faces = cascade.detectMultiScale(gray, # detector options scaleFactor = 1.1, minNeighbors = 5, minSize = (50, 50)) i=0 for (x, y, w, h) in faces: cropped = image[y: y + h, x: x + w] cv2.imwrite(outputname+str(i)+".png", cropped) i=i+1 if len(sys.argv) != 4: sys.stderr.write("usage: detect.py <animeface.xml file> <input> <output prefix>\n") sys.exit(-1) detect(sys.argv[1], sys.argv[2], sys.argv[3]) The IDs can be combined with the provided lbpcascade_animeface script using xargs, however this will be far too slow and it would be better to exploit parallelism with xargs --max-args=1 --max-procs=16 or parallel. It’s also worth noting that lbpcascade_animeface seems to use up GPU VRAM even though GPU use offers no apparent speedup (a slowdown if anything, given limited VRAM), so I find it helps to explicitly disable GPU use by setting CUDA_VISIBLE_DEVICES="". (For this step, it’s quite helpful to have a many-core system like a Threadripper.) Combining everything, parallel face-cropping of an entire Danbooru2018 subset can be done like this: cropFaces() { BUCKET=$(printf "%04d" $(($@ % 1000 )) )
ID="$@" CUDA_VISIBLE_DEVICES="" nice python ~/src/lbpcascade_animeface/examples/crop.py \ ~/src/lbpcascade_animeface/lbpcascade_animeface.xml \ ./original/$BUCKET/$ID.* "./faces/$ID"
}
export -f cropFaces

mkdir ./faces/
cat sfw-ids.txt | parallel --progress cropFaces

### Cleaning & Upscaling

Miscellaneous cleanups can be done:

## Delete failed/empty files
find faces/ -size 0    -type f -delete

## Delete 'too small' files which is indicative of low quality:
find faces/ -size -40k -type f -delete

## Delete exact duplicates:
fdupes --delete --omitfirst --noprompt faces/

## Delete monochrome or minimally-colored images:
### the heuristic of <257 unique colors is imperfect but better than anything else I tried
deleteBW() { if [[ identify -format "%k" "$@" -lt 257 ]]; then rm "$@"; fi; }
export -f deleteBW
find faces -type f | parallel --progress deleteBW

A good trick with GANs is, after training to reasonable levels of quality, reusing the Discriminator to rank the real datapoints; images the trained D assigns the lowest probability/score of being real are often the worst-quality ones and going through the bottom decile (or deleting them entirely) should remove many anomalies and may improve the GAN. The GAN is then trained on the new cleaned dataset, making this a kind of “active learning”. Since rating images is what the D already does, no new algorithms or training methods are necessary, and almost no code is necessary.

Here is a simple script (ranker.py) to open a StyleGAN .pkl and run it on a list of image filenames to print out the D score:

import os
import pickle
import numpy as np
import PIL.Image
import dnnlib
import dnnlib.tflib as tflib
import config
import sys

def main():
tflib.init_tf()
_G, D, _Gs = pickle.load(open(sys.argv[1], "rb"))
image_filenames = sys.argv[2:]

for i in range(0, len(image_filenames)):
img = np.asarray(PIL.Image.open(image_filenames[i]))
img = img.reshape(1, 3,512,512)
score = D.run(img, None)
print(image_filenames[i], score[0][0])

if __name__ == "__main__":
main()

Example use:

find /media/gwern/Data/danbooru2018/characters-1k-faces/ -type f | xargs -n 9000 --max-procs=1 \
python ranker.py results/02086-sgan-portraits-2gpu/network-snapshot-058662.pkl \
| tee portraitfaces-rank.txt
fgrep /media/gwern/ 2019-04-22-portraitfaces-rank.txt | \
sort --field-separator ' ' --key 2 --numeric-sort | head -100
# .../megurine.luka/7853120.jpg -708.6835
# .../remilia.scarlet/26352470.jpg -707.39856
# .../z1.leberecht.maass..kantai.collection./26703440.jpg -702.76904
# .../suzukaze.aoba/27957490.jpg -700.5606
# .../jack.the.ripper..fate.apocrypha./31991880.jpg -700.0554
# .../senjougahara.hitagi/4947410.jpg -699.0976
# .../ayase.eli/28374650.jpg -698.7358
# .../ayase.eli/16185520.jpg -696.97845
# .../illustrious..azur.lane./31053930.jpg -696.8634
# ...

Depending on how noisy the rankings are in terms of ‘quality’ and available sample size, one can either review the worst-ranked images by hand, or delete the bottom X%. One should check the top-ranked images as well to make sure the ordering is right; there can also be some odd images in the top X% as well which should be removed.

It might be possible to use ranker.py to improve the quality of generated samples as well, as a simple version of .

The next major step is upscaling images using , which does an excellent job on 2x upscaling of anime images, which are nigh-indistinguishable from a higher-resolution original and greatly increase the usable corpus. The downside is that it can take 1–10s per image, must run on the GPU (I can reliably fit ~9 instances on my 2x1080ti), and is written in a now-unmaintained DL framework, Torch, with , and is gradually becoming harder to get running (one hopes that by the time CUDA updates break it entirely, there will be another super-resolution GAN I or someone else can train on Danbooru to replace it). If pressed for time, one can just upscale the faces normally with ImageMagick but I believe there will be some quality loss and it’s worthwhile.

. ~/src/torch/install/bin/torch-activate
upscaleWaifu2x() {
SIZE1=$(identify -format "%h" "$@")
SIZE2=$(identify -format "%w" "$@");

if (( $SIZE1 < 512 &&$SIZE2 < 512  )); then
echo "$@"$SIZE
TMP=$(mktemp "/tmp/XXXXXX.png") CUDA_VISIBLE_DEVICES="$((RANDOM % 2 < 1))" nice th ~/src/waifu2x/waifu2x.lua -model_dir \
~/src/waifu2x/models/upconv_7/art -tta 1 -m scale -scale 2 \
-i "$@" -o "$TMP"
convert "$TMP" "$@"
rm "$TMP" fi; } export -f upscaleWaifu2x find faces/ -type f | parallel --progress --jobs 9 upscaleWaifu2x ### Quality Checks & Data Augmentation At this point, one can do manual quality checks by viewing a few hundred images, running findimagedupes -t 99% to look for near-identical faces, or dabble in further modifications such as doing “data augmentation”. Working with Danbooru2018, at this point one would have ~600–700,000 faces, which is more than enough to train StyleGAN and one will have difficulty storing the final StyleGAN dataset because of its sheer size (due to the ~18x size multiplier). However, if that is not enough or one is working with a small dataset like for a single character, data augmentation may be necessary. The mirror/horizontal flip is not necessary as StyleGAN has that built-in as an option23, but there are many other possible data augmentations. One can stretch, shift colors, sharpen, blur, increase/decrease contrast/brightness, crop, and so on. An example, extremely aggressive, set of data augmentations could be done like this: dataAugment () { image="$@"
target=$(basename "$@")
suffix="png"
convert -deskew 50                     "$image" "$target".deskew."$suffix" convert -resize 110%x100% "$image" "$target".horizstretch."$suffix"
convert -resize 100%x110%              "$image" "$target".vertstretch."$suffix" convert -blue-shift 1.1 "$image" "$target".midnight."$suffix"
convert -fill red -colorize 5%         "$image" "$target".red."$suffix" convert -fill orange -colorize 5% "$image" "$target".orange."$suffix"
convert -fill yellow -colorize 5%      "$image" "$target".yellow."$suffix" convert -fill green -colorize 5% "$image" "$target".green."$suffix"
convert -fill blue -colorize 5%        "$image" "$target".blue."$suffix" convert -fill purple -colorize 5% "$image" "$target".purple."$suffix"
convert -adaptive-blur 3x2             "$image" "$target".blur."$suffix" convert -adaptive-sharpen 4x2 "$image" "$target".sharpen."$suffix"
convert -brightness-contrast 10        "$image" "$target".brighter."$suffix" convert -brightness-contrast 10x10 "$image" "$target".brightercontraster."$suffix"
convert -brightness-contrast -10       "$image" "$target".darker."$suffix" convert -brightness-contrast -10x10 "$image" "$target".darkerlesscontrast."$suffix"
convert +level 5%                      "$image" "$target".contraster."$suffix" convert -level 5%\! "$image" "$target".lesscontrast."$suffix"
}
export -f dataAugment
find faces/ -type f | parallel --progress dataAugment

### Upscaling & Conversion

Once any quality fixes or data augmentation are done, it’d be a good idea to save a lot of disk space by converting to JPG & lossily reducing quality (I find 33% saves a ton of space at no visible change):

convertPNGToJPG() { convert -quality 33 "$@" "$@".jpg && rm "@"; } export -f convertPNGToJPG find faces/ -type f -name "*.png" | parallel --progress convertPNGToJPG WARNING: remember that StyleGAN models are only compatible with images of the type they were trained on, so if you are using a StyleGAN pretrained model which was trained on PNGs (like, IIRC, the FFHQ StyleGAN models), you will need to keep using PNGs. Doing the final scaling to exactly 512px can be done at many points but I generally postpone it to the end in order to work with images in their ‘native’ resolutions & aspect-ratios for as long as possible. At this point we carefully everything to 512x51224, not preserving the aspect ratio by filling in with a black background as necessary on either side: find faces/ -type f | xargs --max-procs=16 -n 9000 \ mogrify -resize 512x512\> -extent 512x512\> -gravity center -background black Any slightly-different image could crash the import process. Therefore, we delete any image which is even slightly different from the 512x512 sRGB JPG they are supposed to be: find faces/ -type f | xargs --max-procs=16 -n 9000 identify | \ fgrep -v " JPEG 512x512 512x512+0+0 8-bit sRGB"| cut -d ' ' -f 1 | \ xargs --max-procs=16 -n 10000 rm Having done all this, we should have a large consistent high-quality dataset. Finally, the faces can now be converted to the ProGAN or StyleGAN dataset format using dataset_tool.py. It is worth remembering at this point how fragile that is and the requirements ImageMagick’s identify command is handy for looking at files in more details, particularly their resolution & colorspace, which are often the problem. Because of the extreme fragility of dataset_tool.py, I strongly advise that you edit it to print out the filenames of each file as they are being processed so that when (not if) it crashes, you can investigate the culprit and check the rest. The edit could be as simple as this: diff --git a/dataset_tool.py b/dataset_tool.py index 4ddfe44..e64e40b 100755 --- a/dataset_tool.py +++ b/dataset_tool.py @@ -519,6 +519,7 @@ def create_from_images(tfrecord_dir, image_dir, shuffle): with TFRecordExporter(tfrecord_dir, len(image_filenames)) as tfr: order = tfr.choose_shuffled_order() if shuffle else np.arange(len(image_filenames)) for idx in range(order.size): + print(image_filenames[order[idx]]) img = np.asarray(PIL.Image.open(image_filenames[order[idx]])) if channels == 1: img = img[np.newaxis, :, :] # HW => CHW There should be no issues if all the images were thoroughly checked earlier, but should an images crash it, they can be checked in more detail by identify. (I advise just deleting them and not trying to rescue them.) Then the conversion is just (assuming StyleGAN prerequisites are installed, see next section): source activate MY_TENSORFLOW_ENVIRONMENT python dataset_tool.py create_from_images datasets/faces /media/gwern/Data/danbooru2018/faces/ Congratulations, the hardest part is over. Most of the rest simply requires patience (and a willingness to edit Python files directly in order to configure StyleGAN). # Training ## Installation I assume you have installed & functioning. If not, good luck. (On my Ubuntu Bionic 18.04.2 LTS OS, I have successfully used the Nvidia driver version #410.104, CUDA 10.1, and TensorFlow 1.13.1.) A Python ≥3.625 virtual environment can be set up for StyleGAN to keep dependencies tidy, & StyleGAN dependencies installed: conda create -n stylegan pip python=3.6 source activate stylegan ## TF: pip install tensorflow-gpu ## Test install: python -c "import tensorflow as tf; tf.enable_eager_execution(); \ print(tf.reduce_sum(tf.random_normal([1000, 1000])))" pip install tensorboard ## StyleGAN: ## Install pre-requisites: pip install pillow numpy moviepy scipy opencv-python lmdb # requests? ## Download: git clone 'https://github.com/NVlabs/stylegan.git' && cd ./stylegan/ ## Test install: python pretrained_example.py ## ./results/example.png should be a photograph of a middle-aged man StyleGAN can also be trained on the interactive service, which provides free slices of K80 GPUs 12-GPU-hour chunks, using . Colab is much slower than training on a local machine & the free instances are not enough to train the best StyleGANs, but this might be a useful option for people who simply want to try it a little or who are doing something quick like extremely low-resolution training or transfer-learning where a few GPU-hours on a slow small GPU might be enough. ## Configuration StyleGAN doesn’t ship with any support for CLI options; instead, one must edit train.py and train/training_loop.py: 1. train/training_loop.py The core configuration is done in the function defaults to training_loop beginning . The key arguments are G_smoothing_kimg & D_repeats (affects the learning dynamics), network_snapshot_ticks (how often to save the pickle snapshots—more frequent means less progress lost in crashes, but as each one weighs 300MB+, can quickly use up gigabytes of space), resume_run_id (set to "latest"), and resume_kimg. WARNING: resume_kimg governs where in the overall progressive-growing training schedule StyleGAN starts from. If it is set to 0, training begins at the beginning of the progressive-growing schedule, at the lowest resolution, regardless of how much training has been previously done. It is vitally important when doing transfer learning that it is set to a sufficiently high number (eg 10000) that training begins at the highest desired resolution like 512px, as it appears that layers are erased when added during progressive-growing. More experimentally, I suggest setting minibatch_repeats = 1 instead of minibatch_repeats = 5; in line with the suspiciousness of the gradient-accumulation implementation in ProGAN/StyleGAN, this appears to make training both stabler & faster. Note that some of these variables, like learning rates, are overridden in train.py. It’s better to set those there or else you may confuse yourself badly (like I did in wondering why ProGAN & StyleGAN seemed extraordinarily robust to large changes in the learning rates…). 2. train.py (previously config.py) Here we set the number of GPUs, image resolution, dataset, learning rates, horizontal flipping/mirroring data augmentation, and minibatch sizes. (This file includes settings intended ProGAN—watch out that you don’t accidentally turn on ProGAN instead of StyleGAN & confuse yourself.) Learning rate & minibatch should generally be left alone (except towards the end of training when one wants to lower the learning rate to promote convergence or rebalance the G/D), but the image resolution/dataset/mirroring do need to be set, : desc += '-faces'; dataset = EasyDict(tfrecord_dir='faces', resolution=512); train.mirror_augment = True This sets up the 512px face dataset which was previously created in dataset/faces, turns on mirroring (because while there may be writing in the background, we don’t care about it for face generation), and sets a title for the checkpoints/logs, which will now appear in results/ with the ‘-faces’ string. Assuming you do not have 8 GPUs (as you probably do not), you must change the -preset to match your number of GPUs, StyleGAN will not automatically choose the correct number of GPUs. If you fail to set it correctly to the appropriate preset, StyleGAN will attempt to use GPUs which do not exist and will crash with the opaque error message (note that CUDA uses zero-indexing so GPU:0 refers to the first GPU, GPU:1 refers to my second GPU, and thus /device:GPU:2 refers to my—nonexistent—third GPU): tensorflow.python.framework.errors_impl.InvalidArgumentError: Cannot assign a device for operation \ G_synthesis_3/lod: {{node G_synthesis_3/lod}}was explicitly assigned to /device:GPU:2 but available \ devices are [ /job:localhost/replica:0/task:0/device:CPU:0, /job:localhost/replica:0/task:0/device:GPU:0, \ /job:localhost/replica:0/task:0/device:GPU:1, /job:localhost/replica:0/task:0/device:XLA_CPU:0, \ /job:localhost/replica:0/task:0/device:XLA_GPU:0, /job:localhost/replica:0/task:0/device:XLA_GPU:1 ]. \ Make sure the device specification refers to a valid device. [[{{node G_synthesis_3/lod}}]] For my 2x1080ti : desc += '-preset-v2-2gpus'; submit_config.num_gpus = 2; sched.minibatch_base = 8; sched.minibatch_dict = \ {4: 256, 8: 256, 16: 128, 32: 64, 64: 32, 128: 16, 256: 8}; sched.G_lrate_dict = {512: 0.0015, 1024: 0.002}; \ sched.D_lrate_dict = EasyDict(sched.G_lrate_dict); train.total_kimg = 99000 So my results get saved to results/00001-sgan-faces-2gpu etc (the run ID increments, ‘sgan’ because StyleGAN rather than ProGAN, ‘-faces’ as the dataset being trained on, and ‘2gpu’ because it’s multi-GPU). ## Running I typically run StyleGAN in a session which can be detached and keeps multiple shells organized: 1 terminal/shell for the StyleGAN run, 1 terminal/shell for TensorBoard, and 1 for Emacs. With Emacs, I keep the two key Python files open (train.py and train/training_loop.py) for reference & easy editing. With the “latest” patch, StyleGAN can be thrown into a while-loop to keep running after crashes, like: while true; do nice py train.py ; date; (xmessage "alert: StyleGAN crashed" &); sleep 10s; done is a logging utility which displays little time-series of recorded variables which one views in a web browser, eg: tensorboard --logdir results/02022-sgan-faces-2gpu/ # TensorBoard 1.13.0 at http://127.0.0.1:6006 (Press CTRL+C to quit) Note that TensorBoard can be backgrounded, but needs to be updated every time a new run is started as the results will then be in a different folder. Training StyleGAN is much easier & more reliable than other GANs, but it is still more of an art than a science. (We put up with it because while GANs suck, everything else sucks more.) Notes on training: • Crashproofing: The initial release of StyleGAN was prone to crashing when I ran it, segfaulting at random. Updating TensorFlow appeared to reduce this but the root cause is still unknown. Segfaulting or crashing is also reportedly common if running on mixed GPUs (eg a 1080ti + Titan V). Unfortunately, StyleGAN has no setting for simply resuming from the latest snapshot after crashing/exiting (which is what one usually wants), and one must manually edit the to set it to the latest run ID. This is tedious and error-prone—at one point I realized I had wasted 6 GPU-days of training by restarting from a 3-day-old snapshot because I had not updated the resume_run_id after a segfault! If you are doing any runs longer than a few wallclock hours, I strongly advise use of to automatically restart from the latest snapshot by setting resume_run_id = "latest": diff --git a/training/misc.py b/training/misc.py index 50ae51c..d906a2d 100755 --- a/training/misc.py +++ b/training/misc.py @@ -119,6 +119,14 @@ def list_network_pkls(run_id_or_run_dir, include_final=True): del pkls[0] return pkls +def locate_latest_pkl(): + allpickles = sorted(glob.glob(os.path.join(config.result_dir, '0*', 'network-*.pkl'))) + latest_pickle = allpickles[-1] + resume_run_id = os.path.basename(os.path.dirname(latest_pickle)) + RE_KIMG = re.compile('network-snapshot-(\d+).pkl') + kimg = int(RE_KIMG.match(os.path.basename(latest_pickle)).group(1)) + return (locate_network_pkl(resume_run_id), float(kimg)) + def locate_network_pkl(run_id_or_run_dir_or_network_pkl, snapshot_or_network_pkl=None): for candidate in [snapshot_or_network_pkl, run_id_or_run_dir_or_network_pkl]: if isinstance(candidate, str): diff --git a/training/training_loop.py b/training/training_loop.py index 78d6fe1..20966d9 100755 --- a/training/training_loop.py +++ b/training/training_loop.py @@ -148,7 +148,10 @@ def training_loop( # Construct networks. with tf.device('/gpu:0'): if resume_run_id is not None: - network_pkl = misc.locate_network_pkl(resume_run_id, resume_snapshot) + if resume_run_id == 'latest': + network_pkl, resume_kimg = misc.locate_latest_pkl() + else: + network_pkl = misc.locate_network_pkl(resume_run_id, resume_snapshot) print('Loading networks from "%s"...' % network_pkl) G, D, Gs = misc.load_pkl(network_pkl) else: (The diff can be edited by hand, or copied into the repo as a file like latest.patch & then applied with git apply latest.patch.) • Tuning Learning Rates The LR is one of the most critical hyperparameters: too-large updates based on too-small minibatches are devastating to GAN stability & final quality. The LR also seems to interact with the intrinsic difficulty or diversity of an image domain; Karras et al 2019 use 0.003 G/D LRs on their FFHQ dataset (which has been carefully curated and the faces aligned to put landmarks like eyes/mouth in the same locations in every image) when training on 8-GPU machines with minibatches of n=32, but I find lower to be better on my anime face/portrait datasets where I can only do n=8. From looking at training videos of whole-Danbooru2018 StyleGAN runs, I suspect that the necessary LRs would be lower still. Learning rates are closely related to minibatch size (a common rule of thumb in supervised learning of CNNs is that the relationship of biggest usable LR follows a square-root curve in minibatch size) and the BigGAN research argues that minibatch size itself strongly influences how bad mode dropping is, which suggests that smaller LRs may be more necessary the more diverse/difficult a dataset is. • Balancing G/D: Later in training, if the G is not making good progress towards the ultimate goal of a 0.5 loss (and the D’s loss gradually decreasing towards 0.5), and has a loss stubbornly stuck around -1 or something, it may be necessary to change the balance of G/D. This can be done several ways but the easiest is to adjust , sched.G_lrate_dict & sched.D_lrate_dict. One needs to keep an eye on the G/D losses and also the perceptual quality of the faces (since we don’t have any good FID equivalent yet for anime faces, which requires a good open-source Danbooru tagger to create embeddings), and reduce both LRs (or usually just the D’s LR) based on the face quality and whether the G/D losses are exploding or otherwise look imbalanced. What you want, I think, is for the G/D losses to be stable at a certain absolute amount for a long time while the quality visibly improves, reducing D’s LR as necessary to keep it balanced with G; and then once you’ve run out of time/patience or artifacts are showing up, then you can decrease both LRs to converge onto a local optima. I find the default of 0.003 can be too high once quality reaches a high level with both faces & portraits, and it helps to reduce it by a third to 0.001 or a tenth to 0.0003. If there still isn’t convergence, the D may be too strong and it can be turned down separately, to a tenth or a fiftieth even. (Given the stochasticity of training & the relativity of the losses, one should wait several wallclock hours or days after each modification to see if it made a difference.) • Skipping FID metrics: Some metrics are computed for logging/reporting. The FID metrics are calculated using an old ImageNet CNN; what is realistic on ImageNet may have little to do with your particular domain and while a large FID like 100 is concerning, FIDs like 20 or even increasing are not necessarily a problem or useful guidance compared to just looking at the generated samples or the loss curves. Given that computing FID metrics is not free & potentially irrelevant or misleading on many image domains, I suggest disabling them entirely. (They are not used in the training for anything, and disabling them is safe.) They can be edited out of the main training loop by commenting out the call to metrics.run like so: @@ -261,7 +265,7 @@ def training_loop() if cur_tick % network_snapshot_ticks == 0 or done or cur_tick == 1: pkl = os.path.join(submit_config.run_dir, 'network-snapshot-%06d.pkl' % (cur_nimg // 1000)) misc.save_pkl((G, D, Gs), pkl) # metrics.run(pkl, run_dir=submit_config.run_dir, num_gpus=submit_config.num_gpus, tf_config=tf_config) • ‘Blob’ & ‘Crack’ Artifacts: During training, ‘blobs’ often show up or move around. These blobs appear even late in training on otherwise high-quality images and are unique to StyleGAN (at least, I’ve never seen another GAN whose training artifacts look like the blobs). That they are so large & glaring suggests a weakness in StyleGAN somewhere. The source of the blobs is as yet unknown but nshepperd has speculated they are related to the 3x3 convolution layers; it is possible that adding additional (1x1) convolution or self-attention layers would eliminate them. If you watch training videos, these blobs seem to gradually morph into new features such as eyes or hair or glasses. I suspect they are part of how StyleGAN ‘creates’ new features, starting with a feature-less blob superimposed at approximately the right location, and gradually refined into something useful. If blobs are appearing too often or one wants a final model without any new intrusive blobs, it may help to lower the LR to try to converge to a local optima. In training anime faces, I have seen additional artifacts, which look like ‘cracks’ or ‘waves’ or elephant skin wrinkles or the sort of fine crazing seen in old paintings or ceramics, which appear toward the end of training on primarily skin or areas of flat color; they happen particularly fast when transfer learning on a small dataset. In contrast to the blob artifacts, I currently suspect the cracks are a sign of overfitting rather than a peculiarity of normal StyleGAN training, where the G has started trying to memorize noise in the fine detail of pixelation/lines, and the only solution I have found so far is to either stop training or get more data. • Gradient Accumulation: ProGAN/StyleGAN’s codebase claims to support gradient accumulation, which is a way to fake large minibatch training (eg n=2048) by not doing the backpropagation update every minibatch, but instead summing the gradients over many minibatches and applying them all at once. This is a useful trick for stabilizing training, and large minibatch NN training can differ qualitatively from small minibatch NN training—BigGAN performance increased with increasingly large minibatches (n=2048) and the authors speculate that this is because such large minibatches mean that the full diversity of the dataset is represented in each ‘minibatch’ so the BigGAN models cannot simply ‘forget’ rarer datapoints which would otherwise not appear for many minibatches in a row, resulting in the GAN pathology of ‘mode dropping’ where some kinds of data just get ignored by both G/D. However, the ProGAN/StyleGAN implementation of gradient accumulation does not resemble that of any other implementation I’ve seen in TensorFlow or PyTorch, and in my own experiments with up to n=4096, I didn’t observe any stabilization or qualitative differences, so I am suspicious the implementation is wrong. Here is what a successful training progression looks like for the anime face StyleGAN: The anime face model as of 8 March 2019, trained for 21,980 iterations or ~21m images or ~38 GPU-days, is . (It is still not fully-converged, but the quality is good.) # Sampling Having successfully trained a StyleGAN, now the fun part—generating samples! ## Psi/“truncation trick” The 𝜓/“truncation trick”(, ; apparently first introduced by ) is the most important hyperparameter for all StyleGAN generation. The truncation trick is used at sample generation time but not training time. The idea is to edit the latent vector z, which is a vector of , to remove any variables which are above a certain size like 0.5 or 1.0, and resample those.26 This seems to help by avoiding ‘extreme’ latent values or combinations of latent values which the G is not as good at—a G will not have generated many data points with each latent variable at, say, +1.5SD. The tradeoff is that those are still legitimate areas of the overall latent space which were being used during training to cover parts of the data distribution; so while the latent variables close to the mean of 0 may be the most accurately modeled, they are also only a small part of the space of all possible images. So one can generate latent variables from the full unrestricted distribution for each one, or one can truncate them at something like +1SD or +0.7SD. (Like the discussion of the best distribution for the original latent distribution, there’s no good reason to think that this is an optimal method of doing truncation; there are many alternatives, such as ones penalizing the sum of the variables, either rejecting them or scaling them down, and than the current truncation trick.) At 𝜓=0, diversity is nil and all faces are a single global average face (a brown-eyed brown-haired schoolgirl, unsurprisingly); at ±0.5 you have a broad range of faces, and by ±1.2, you’ll see tremendous diversity in faces/styles/consistency but also tremendous artifacting & distortion. Where you set your 𝜓 will heavily influence how ‘original’ outputs look. At 𝜓=1.2, they are tremendously original but extremely hit or miss. At 𝜓=0.5 they are consistent but boring. For most of my sampling, I set 𝜓=0.7 which strikes the best balance between craziness/artifacting and quality/diversity. (Personally, I prefer to look at 𝜓=1.2 samples because they are so much more interesting, but if I released those samples, it would give a misleading impression to readers.) ## Random Samples The StyleGAN repo has a simple script to download & generate a single face; in the interests of reproducibility, it hardwires the model and the RNG seed so it will only generate 1 particular face. However, it can be easily adapted to use a local model and (slowly27) generate, say, 1000 sample images with the hyperparameter 𝜓=0.6 (which gives high-quality but not highly-diverse images) which are saved to results/example-{0-999}.png: import os import pickle import numpy as np import PIL.Image import dnnlib import dnnlib.tflib as tflib import config def main(): tflib.init_tf() _G, _D, Gs = pickle.load(open("results/02051-sgan-faces-2gpu/network-snapshot-021980.pkl", "rb")) Gs.print_layers() for i in range(0,1000): rnd = np.random.RandomState(None) latents = rnd.randn(1, Gs.input_shape[1]) fmt = dict(func=tflib.convert_images_to_uint8, nchw_to_nhwc=True) images = Gs.run(latents, None, truncation_psi=0.6, randomize_noise=True, output_transform=fmt) os.makedirs(config.result_dir, exist_ok=True) png_filename = os.path.join(config.result_dir, 'example-'+str(i)+'.png') PIL.Image.fromarray(images[0], 'RGB').save(png_filename) if __name__ == "__main__": main() ## Karras et al 2018 Figures The figures in Karras et al 2018, demonstrating random samples and aspects of the style noise using the 1024px FFHQ face model (as well as the others), were generated by . This script needs extensive modifications to work with my 512px anime face; going through the file: • the code uses 𝜓=1 truncation, but faces look better with 𝜓=0.7 (several of the functions have truncation_psi= settings but, trickily, Figure 3’s draw_style_mixing_figure has its 𝜓 setting hidden away in the synthesis_kwargs global variable) • the loaded model needs to be switched to the anime face model, of course • dimensions must be reduced 1024→512 as appropriate; some ranges are hardcoded and must be reduced for 512px images as well • the truncation trick figure 8 doesn’t show enough faces to give insight into what the latent space is doing so it needs to be expanded to show both more random seeds/faces, and more 𝜓 values • the bedroom/car/cat samples should be disabled The changes I make are as follows: diff --git a/generate_figures.py b/generate_figures.py index 45b68b8..f27af9d 100755 --- a/generate_figures.py +++ b/generate_figures.py @@ -24,16 +24,13 @@ url_bedrooms = 'https://drive.google.com/uc?id=1MOSKeGF0FJcivpBI7s63V9YHloUTO url_cars = 'https://drive.google.com/uc?id=1MJ6iCfNtMIRicihwRorsM3b7mmtmK9c3' # karras2019stylegan-cars-512x384.pkl url_cats = 'https://drive.google.com/uc?id=1MQywl0FNt6lHu8E_EUqnRbviagS7fbiJ' # karras2019stylegan-cats-256x256.pkl -synthesis_kwargs = dict(output_transform=dict(func=tflib.convert_images_to_uint8, nchw_to_nhwc=True), minibatch_size=8) +synthesis_kwargs = dict(output_transform=dict(func=tflib.convert_images_to_uint8, nchw_to_nhwc=True), minibatch_size=8, truncation_psi=0.7) _Gs_cache = dict() def load_Gs(url): - if url not in _Gs_cache: - with dnnlib.util.open_url(url, cache_dir=config.cache_dir) as f: - _G, _D, Gs = pickle.load(f) - _Gs_cache[url] = Gs - return _Gs_cache[url] + _G, _D, Gs = pickle.load(open("results/02051-sgan-faces-2gpu/network-snapshot-021980.pkl", "rb")) + return Gs #---------------------------------------------------------------------------- # Figures 2, 3, 10, 11, 12: Multi-resolution grid of uncurated result images. @@ -85,7 +82,7 @@ def draw_noise_detail_figure(png, Gs, w, h, num_samples, seeds): canvas = PIL.Image.new('RGB', (w * 3, h * len(seeds)), 'white') for row, seed in enumerate(seeds): latents = np.stack([np.random.RandomState(seed).randn(Gs.input_shape[1])] * num_samples) - images = Gs.run(latents, None, truncation_psi=1, **synthesis_kwargs) + images = Gs.run(latents, None, **synthesis_kwargs) canvas.paste(PIL.Image.fromarray(images[0], 'RGB'), (0, row * h)) for i in range(4): crop = PIL.Image.fromarray(images[i + 1], 'RGB') @@ -109,7 +106,7 @@ def draw_noise_components_figure(png, Gs, w, h, seeds, noise_ranges, flips): all_images = [] for noise_range in noise_ranges: tflib.set_vars({var: val * (1 if i in noise_range else 0) for i, (var, val) in enumerate(noise_pairs)}) - range_images = Gsc.run(latents, None, truncation_psi=1, randomize_noise=False, **synthesis_kwargs) + range_images = Gsc.run(latents, None, randomize_noise=False, **synthesis_kwargs) range_images[flips, :, :] = range_images[flips, :, ::-1] all_images.append(list(range_images)) @@ -144,14 +141,11 @@ def draw_truncation_trick_figure(png, Gs, w, h, seeds, psis): def main(): tflib.init_tf() os.makedirs(config.result_dir, exist_ok=True) - draw_uncurated_result_figure(os.path.join(config.result_dir, 'figure02-uncurated-ffhq.png'), load_Gs(url_ffhq), cx=0, cy=0, cw=1024, ch=1024, rows=3, lods=[0,1,2,2,3,3], seed=5) - draw_style_mixing_figure(os.path.join(config.result_dir, 'figure03-style-mixing.png'), load_Gs(url_ffhq), w=1024, h=1024, src_seeds=[639,701,687,615,2268], dst_seeds=[888,829,1898,1733,1614,845], style_ranges=[range(0,4)]*3+[range(4,8)]*2+[range(8,18)]) - draw_noise_detail_figure(os.path.join(config.result_dir, 'figure04-noise-detail.png'), load_Gs(url_ffhq), w=1024, h=1024, num_samples=100, seeds=[1157,1012]) - draw_noise_components_figure(os.path.join(config.result_dir, 'figure05-noise-components.png'), load_Gs(url_ffhq), w=1024, h=1024, seeds=[1967,1555], noise_ranges=[range(0, 18), range(0, 0), range(8, 18), range(0, 8)], flips=[1]) - draw_truncation_trick_figure(os.path.join(config.result_dir, 'figure08-truncation-trick.png'), load_Gs(url_ffhq), w=1024, h=1024, seeds=[91,388], psis=[1, 0.7, 0.5, 0, -0.5, -1]) - draw_uncurated_result_figure(os.path.join(config.result_dir, 'figure10-uncurated-bedrooms.png'), load_Gs(url_bedrooms), cx=0, cy=0, cw=256, ch=256, rows=5, lods=[0,0,1,1,2,2,2], seed=0) - draw_uncurated_result_figure(os.path.join(config.result_dir, 'figure11-uncurated-cars.png'), load_Gs(url_cars), cx=0, cy=64, cw=512, ch=384, rows=4, lods=[0,1,2,2,3,3], seed=2) - draw_uncurated_result_figure(os.path.join(config.result_dir, 'figure12-uncurated-cats.png'), load_Gs(url_cats), cx=0, cy=0, cw=256, ch=256, rows=5, lods=[0,0,1,1,2,2,2], seed=1) + draw_uncurated_result_figure(os.path.join(config.result_dir, 'figure02-uncurated-ffhq.png'), load_Gs(url_ffhq), cx=0, cy=0, cw=512, ch=512, rows=3, lods=[0,1,2,2,3,3], seed=5) + draw_style_mixing_figure(os.path.join(config.result_dir, 'figure03-style-mixing.png'), load_Gs(url_ffhq), w=512, h=512, src_seeds=[639,701,687,615,2268], dst_seeds=[888,829,1898,1733,1614,845], style_ranges=[range(0,4)]*3+[range(4,8)]*2+[range(8,16)]) + draw_noise_detail_figure(os.path.join(config.result_dir, 'figure04-noise-detail.png'), load_Gs(url_ffhq), w=512, h=512, num_samples=100, seeds=[1157,1012]) + draw_noise_components_figure(os.path.join(config.result_dir, 'figure05-noise-components.png'), load_Gs(url_ffhq), w=512, h=512, seeds=[1967,1555], noise_ranges=[range(0, 18), range(0, 0), range(8, 18), range(0, 8)], flips=[1]) + draw_truncation_trick_figure(os.path.join(config.result_dir, 'figure08-truncation-trick.png'), load_Gs(url_ffhq), w=512, h=512, seeds=[91,388, 389, 390, 391, 392, 393, 394, 395, 396], psis=[1, 0.7, 0.5, 0.25, 0, -0.25, -0.5, -1]) All this done, we get some fun anime face samples to parallel Karras et al 2018’s figures: ## Videos ### Training Montage The easiest samples are the progress snapshots generated during training. Over the course of training, their size increases as the effective resolution increases & finer details are generated, and at the end can be quite large (often 14MB each for the anime faces) so doing lossy compression with a tool like pngnq+advpng or converting them to JPG with lowered quality is a good idea. To turn the many snapshots into a training montage video like above, I use on the PNGs: cat(ls ./results/*faces*/fakes*.png | sort --numeric-sort) | ffmpeg -framerate 10 \ # show 10 inputs per second
-i - # stdin
-r 25 # output frame-rate; frames will be duplicated to pad out to 25FPS
-c:v libx264 # x264 for compatibility
-pix_fmt yuv420p # force ffmpeg to use a standard colorspace - otherwise PNG colorspace is kept, breaking browsers (!)
-crf 33 # adequate high quality
-vf "scale=iw/2:ih/2" \ # shrink the image by 2x, the full detail is not necessary & saves space
-preset veryslow -tune animation \ # aim for smallest binary possible with animation-tuned settings
./stylegan-facestraining.mp4

### Interpolations

The original ProGAN repo provided a config for generating interpolation videos, but that was removed in StyleGAN. (@kikko_fr) , providing 3 kinds of videos:

1. random_grid_404.mp4: a standard interpolation video, which is simply a random walk through the latent space, modifying all the variables smoothly and animating it; by default it makes 4 of them arranged 2x2 in the video. Several interpolation videos are show in the examples section.
2. interpolate.mp4: a ‘coarse’ “style mixing” video; a single ‘source’ face is generated & held constant; a secondary interpolation video, a random walk as before is generated; at each step of the random walk, the ‘coarse’/high-level ‘style’ noise is copied from the random walk to overwrite the source face’s original style noise. For faces, this means that the original face will be modified with all sorts of orientations & facial expressions while still remaining recognizably the original character. (It is the video analog of Karras et al 2018’s Figure 3.)

A copy of Diagne’s video.py:

import os
import pickle
import numpy as np
import PIL.Image
import dnnlib
import dnnlib.tflib as tflib
import config
import scipy

def main():

tflib.init_tf()

# with dnnlib.util.open_url(url, cache_dir=config.cache_dir) as f:
## NOTE: insert model here:
_G, _D, Gs = pickle.load(open("results/02047-sgan-faces-2gpu/network-snapshot-013221.pkl", "rb"))
# _G = Instantaneous snapshot of the generator. Mainly useful for resuming a previous training run.
# _D = Instantaneous snapshot of the discriminator. Mainly useful for resuming a previous training run.
# Gs = Long-term average of the generator. Yields higher-quality results than the instantaneous snapshot.

grid_size = [2,2]
image_shrink = 1
image_zoom = 1
duration_sec = 60.0
smoothing_sec = 1.0
mp4_fps = 20
mp4_codec = 'libx264'
mp4_bitrate = '5M'
random_seed = 404
mp4_file = 'results/random_grid_%s.mp4' % random_seed
minibatch_size = 8

num_frames = int(np.rint(duration_sec * mp4_fps))
random_state = np.random.RandomState(random_seed)

# Generate latent vectors
shape = [num_frames, np.prod(grid_size)] + Gs.input_shape[1:] # [frame, image, channel, component]
all_latents = random_state.randn(*shape).astype(np.float32)
import scipy
all_latents = scipy.ndimage.gaussian_filter(all_latents, [smoothing_sec * mp4_fps] + [0] * len(Gs.input_shape), mode='wrap')
all_latents /= np.sqrt(np.mean(np.square(all_latents)))

def create_image_grid(images, grid_size=None):
assert images.ndim == 3 or images.ndim == 4
num, img_h, img_w, channels = images.shape

if grid_size is not None:
grid_w, grid_h = tuple(grid_size)
else:
grid_w = max(int(np.ceil(np.sqrt(num))), 1)
grid_h = max((num - 1) // grid_w + 1, 1)

grid = np.zeros([grid_h * img_h, grid_w * img_w, channels], dtype=images.dtype)
for idx in range(num):
x = (idx % grid_w) * img_w
y = (idx // grid_w) * img_h
grid[y : y + img_h, x : x + img_w] = images[idx]
return grid

# Frame generation func for moviepy.
def make_frame(t):
frame_idx = int(np.clip(np.round(t * mp4_fps), 0, num_frames - 1))
latents = all_latents[frame_idx]
fmt = dict(func=tflib.convert_images_to_uint8, nchw_to_nhwc=True)
images = Gs.run(latents, None, truncation_psi=0.7,
randomize_noise=False, output_transform=fmt)

grid = create_image_grid(images, grid_size)
if image_zoom > 1:
grid = scipy.ndimage.zoom(grid, [image_zoom, image_zoom, 1], order=0)
if grid.shape[2] == 1:
grid = grid.repeat(3, 2) # grayscale => RGB
return grid

# Generate video.
import moviepy.editor
video_clip = moviepy.editor.VideoClip(make_frame, duration=duration_sec)
video_clip.write_videofile(mp4_file, fps=mp4_fps, codec=mp4_codec, bitrate=mp4_bitrate)

# import scipy
# coarse
duration_sec = 60.0
smoothing_sec = 1.0
mp4_fps = 20

num_frames = int(np.rint(duration_sec * mp4_fps))
random_seed = 500
random_state = np.random.RandomState(random_seed)

w = 512
h = 512
#src_seeds = [601]
dst_seeds = [700]
style_ranges = ([0] * 7 + [range(8,16)]) * len(dst_seeds)

fmt = dict(func=tflib.convert_images_to_uint8, nchw_to_nhwc=True)
synthesis_kwargs = dict(output_transform=fmt, truncation_psi=0.7, minibatch_size=8)

shape = [num_frames] + Gs.input_shape[1:] # [frame, image, channel, component]
src_latents = random_state.randn(*shape).astype(np.float32)
src_latents = scipy.ndimage.gaussian_filter(src_latents,
smoothing_sec * mp4_fps,
mode='wrap')
src_latents /= np.sqrt(np.mean(np.square(src_latents)))

dst_latents = np.stack(np.random.RandomState(seed).randn(Gs.input_shape[1]) for seed in dst_seeds)

src_dlatents = Gs.components.mapping.run(src_latents, None) # [seed, layer, component]
dst_dlatents = Gs.components.mapping.run(dst_latents, None) # [seed, layer, component]
src_images = Gs.components.synthesis.run(src_dlatents, randomize_noise=False, **synthesis_kwargs)
dst_images = Gs.components.synthesis.run(dst_dlatents, randomize_noise=False, **synthesis_kwargs)

canvas = PIL.Image.new('RGB', (w * (len(dst_seeds) + 1), h * 2), 'white')

for col, dst_image in enumerate(list(dst_images)):
canvas.paste(PIL.Image.fromarray(dst_image, 'RGB'), ((col + 1) * h, 0))

def make_frame(t):
frame_idx = int(np.clip(np.round(t * mp4_fps), 0, num_frames - 1))
src_image = src_images[frame_idx]
canvas.paste(PIL.Image.fromarray(src_image, 'RGB'), (0, h))

for col, dst_image in enumerate(list(dst_images)):
col_dlatents = np.stack([dst_dlatents[col]])
col_dlatents[:, style_ranges[col]] = src_dlatents[frame_idx, style_ranges[col]]
col_images = Gs.components.synthesis.run(col_dlatents, randomize_noise=False, **synthesis_kwargs)
for row, image in enumerate(list(col_images)):
canvas.paste(PIL.Image.fromarray(image, 'RGB'), ((col + 1) * h, (row + 1) * w))
return np.array(canvas)

# Generate video.
import moviepy.editor
mp4_file = 'results/interpolate.mp4'
mp4_codec = 'libx264'
mp4_bitrate = '5M'

video_clip = moviepy.editor.VideoClip(make_frame, duration=duration_sec)
video_clip.write_videofile(mp4_file, fps=mp4_fps, codec=mp4_codec, bitrate=mp4_bitrate)

import scipy

duration_sec = 60.0
smoothing_sec = 1.0
mp4_fps = 20

num_frames = int(np.rint(duration_sec * mp4_fps))
random_seed = 503
random_state = np.random.RandomState(random_seed)

w = 512
h = 512
style_ranges = [range(6,16)]

fmt = dict(func=tflib.convert_images_to_uint8, nchw_to_nhwc=True)
synthesis_kwargs = dict(output_transform=fmt, truncation_psi=0.7, minibatch_size=8)

shape = [num_frames] + Gs.input_shape[1:] # [frame, image, channel, component]
src_latents = random_state.randn(*shape).astype(np.float32)
src_latents = scipy.ndimage.gaussian_filter(src_latents,
smoothing_sec * mp4_fps,
mode='wrap')
src_latents /= np.sqrt(np.mean(np.square(src_latents)))

dst_latents = np.stack([random_state.randn(Gs.input_shape[1])])

src_dlatents = Gs.components.mapping.run(src_latents, None) # [seed, layer, component]
dst_dlatents = Gs.components.mapping.run(dst_latents, None) # [seed, layer, component]

def make_frame(t):
frame_idx = int(np.clip(np.round(t * mp4_fps), 0, num_frames - 1))
col_dlatents = np.stack([dst_dlatents[0]])
col_dlatents[:, style_ranges[0]] = src_dlatents[frame_idx, style_ranges[0]]
col_images = Gs.components.synthesis.run(col_dlatents, randomize_noise=False, **synthesis_kwargs)
return col_images[0]

# Generate video.
import moviepy.editor
mp4_file = 'results/fine_%s.mp4' % (random_seed)
mp4_codec = 'libx264'
mp4_bitrate = '5M'

video_clip = moviepy.editor.VideoClip(make_frame, duration=duration_sec)
video_clip.write_videofile(mp4_file, fps=mp4_fps, codec=mp4_codec, bitrate=mp4_bitrate)

if __name__ == "__main__":
main()
1. fine_503.mp4: a ‘fine’ style mixing video; in this case, the style noise is taken from later on and instead of affecting the global orientation or expression, it affects subtler details like the precise shape of hair strands or hair color or mouths.

Circular interpolations are another interesting kind of interpolation, written by , which instead of random walking around the latent space freely, with large or awkward transitions, instead tries to move around a fixed high-dimensional point doing: “binary search to get the MSE to be roughly the same between frames (slightly brute force, but it looks nicer), and then did that for what is probably close to a sphere or circle in the latent space.” A later version of circular interpolation is in snowy halcy’s face editor repo, but here is the original version cleaned up into a stand-alone program:

import dnnlib.tflib as tflib
import math
import moviepy.editor
from numpy import linalg
import numpy as np
import pickle

def main():
tflib.init_tf()
_G, _D, Gs = pickle.load(open("results/02051-sgan-faces-2gpu/network-snapshot-021980.pkl", "rb"))

rnd = np.random
latents_a = rnd.randn(1, Gs.input_shape[1])
latents_b = rnd.randn(1, Gs.input_shape[1])
latents_c = rnd.randn(1, Gs.input_shape[1])

def circ_generator(latents_interpolate):

latents_axis_x = (latents_a - latents_b).flatten() / linalg.norm(latents_a - latents_b)
latents_axis_y = (latents_a - latents_c).flatten() / linalg.norm(latents_a - latents_c)

latents_x = math.sin(math.pi * 2.0 * latents_interpolate) * radius
latents_y = math.cos(math.pi * 2.0 * latents_interpolate) * radius

latents = latents_a + latents_x * latents_axis_x + latents_y * latents_axis_y
return latents

def mse(x, y):
return (np.square(x - y)).mean()

max_step = 1.0
current_pos = 0.0

change_min = 10.0
change_max = 11.0

fmt = dict(func=tflib.convert_images_to_uint8, nchw_to_nhwc=True)

current_latent = gen_func(current_pos)
current_image = Gs.run(current_latent, None, truncation_psi=0.7, randomize_noise=False, output_transform=fmt)[0]
array_list = []

video_length = 1.0
while(current_pos < video_length):
array_list.append(current_image)

lower = current_pos
upper = current_pos + max_step
current_pos = (upper + lower) / 2.0

current_latent = gen_func(current_pos)
current_image = images = Gs.run(current_latent, None, truncation_psi=0.7, randomize_noise=False, output_transform=fmt)[0]
current_mse = mse(array_list[-1], current_image)

while current_mse < change_min or current_mse > change_max:
if current_mse < change_min:
lower = current_pos
current_pos = (upper + lower) / 2.0

if current_mse > change_max:
upper = current_pos
current_pos = (upper + lower) / 2.0

current_latent = gen_func(current_pos)
current_image = images = Gs.run(current_latent, None, truncation_psi=0.7, randomize_noise=False, output_transform=fmt)[0]
current_mse = mse(array_list[-1], current_image)
print(current_pos, current_mse)
return array_list

frames = moviepy.editor.ImageSequenceClip(frames, fps=30)

# Generate video.
mp4_file = 'results/circular.mp4'
mp4_codec = 'libx264'
mp4_bitrate = '3M'
mp4_fps = 20

frames.write_videofile(mp4_file, fps=mp4_fps, codec=mp4_codec, bitrate=mp4_bitrate)

if __name__ == "__main__":
main()

An interesting use of interpolations is Kyle McLean’s video: a singing anime video mashing up StyleGAN anime faces + lyrics + music.

# Models

## Anime Faces

The primary model I’ve trained, the anime face model is described in the data processing & training section. It is a 512px StyleGAN model trained on n=218,794 faces cropped from all of Danbooru2017, cleaned, & upscaled, and trained for 21,980 iterations or ~21m images or ~38 GPU-days.

Downloads (I recommend using the more-recent portrait StyleGAN unless cropped faces are specifically desired):

### TWDNE

To show off the anime faces, and as a joke, on 14 February 2019, I set up , a standalone static website which displays a random anime face (out of 100,000), generated with various 𝜓, and paired with GPT-2-small text snippets prompted on anime plot summaries. are too length to go into here

But the site was amusing & an enormous success. It went viral overnight and by the end of March 2019, ~1 million unique visitors (most from China) had visited TWDNE, spending over 2 minutes each looking at the NN-generated faces & text; people began hunting for hilariously-deformed faces, using TWDNE as a screensaver, picking out faces as avatars, creating packs of faces for video games, painting their own collages of faces, using it as a character designer for inspiration, etc.

## Anime Bodies

experimented with a custom 256px anime game image dataset which has individual characters posed in whole-person images to see how StyleGAN coped with more complex geometries. Progress required additional data cleaning and lowering the learning rate but, trained on a 4-GPU system for week or two, the results are promising (even down to reproducing the copyright statements in the images), providing preliminary evidence that StyleGAN can scale:

# Transfer Learning

One of the most useful things to do with a trained model on a broad data corpus is to use it as a launching pad to train a better model quicker on lesser data, called “transfer learning”. For example, one might transfer learn from Nvidia’s FFHQ face StyleGAN model to a different celebrity dataset, or from . Or with the anime face model, one might retrain it on a subset of faces—all characters with red hair, or all male characters, or just a single specific character. Even if a dataset seems different, starting from a pretrained model can save time; after all, while male and female faces may look different and it may seem like a mistake to start from a mostly-female anime face model, the alternative of starting from scratch means starting with a model generating random static, and male faces look far more like female faces than they do random static. (Indeed, you can quickly train a photographic face model starting from the anime face model.) This extends the reach of good StyleGAN models from those blessed with both big data & big compute to those with little of either.

Transfer learning works particularly well for specializing the anime face model to a specific character: the images of that character would be too little to train a good StyleGAN on, too data-impoverished for the sample-inefficient StyleGAN, but having been trained on all anime faces, the StyleGAN has learned well the full space of anime faces and can easily specialize down without overfitting. Trying to do, say, faceslandscapes is probably a bridge too far.

Data-wise, for doing face specialization, the more the better but n=500–5000 is an adequate range, but even as low as n=50 works surprisingly well. I don’t know to what extent data augmentation can substitute for original datapoints but it’s probably worth a try especially if you have n<5000.

Compute-wise, specialization is rapid. Adaptation can happen within a few ticks, possibly even 1. This is surprisingly fast given that StyleGAN is not designed for few-shot/transfer learning. I speculate that this may be because the StyleGAN latent space is expressive enough that even new faces (such as new human faces for a FFHQ model, or a new anime character for an anime-face model) are still already present in the latent space. Examples of the expressivity are provided by , who find that “although the StyleGAN generator is trained on a human face dataset [FFHQ], the embedding algorithm is capable of going far beyond human faces. As Figure 1 shows, although slightly worse than those of human faces, we can obtain reasonable and relatively high-quality embeddings of cats, dogs and even paintings and cars.” If even images as different as cars can be encoded successfully into a face StyleGAN, then clearly the latent space can easily model new faces and so any new face training data is in some sense already learned; so the training process is perhaps not so much about learning ‘new’ faces as about making the new faces more ‘important’ by expanding the latent space around them & contracting it around everything else, which seems like a far easier task.

How does one actually do transfer learning? Since StyleGAN is (currently) unconditional with no dataset-specific categorical or text or metadata encoding, just a flat set of images, all that has to be done is to encode the new dataset and simply start training with an existing model. One creates the new dataset as usual, and then edits training.py with a new -desc line for the new dataset, and if resume_kimg is set correctly (see next paragraph) and resume_run_id = "latest" enabled as advised, you can then run python train.py and presto, transfer learning.

The main problem seems to be that training cannot be done from scratch/0 iterations, as one might naively assume—when I tried this, it did not work well and StyleGAN appeared to be ignoring the pretrained model. My hypothesis is that as part of the progressive growing/fading in of additional resolution/layers, StyleGAN simply randomizes or wipes out each new layer and overwrites them—making it pointless. This is easy to avoid: simply jump the training schedule all the way to the desired resolution. For example, to start at 512px one might set resume_kimg=7000 in training_loop.py. This forces StyleGAN to skip all the progressive growing and load the full model as-is. To make sure you did it right, check the first sample (fakes07000.png or whatever), from before any transfer learning training has been done, and it should look like the original model did at the end of its training. Then subsequent training samples should show the original quickly morphing to the new dataset. (Anything like fakes00000.png should not show up because indicates beginning from scratch.)

## Anime Faces → Character Faces

### Holo

The first transfer learning was done with Holo of . It used a 512px Holo face dataset created with Nagadomi’s cropper from all of Danbooru2017, upscaled with waifu2x, cleaned by hand, and then data-augmented from n=3900 to n=12600; mirroring was enabled since Holo is symmetrical. I then used the anime face model as of 9 February 2019—it was not fully converged, indeed, wouldn’t converge with weeks more training, but the quality was so good I was too curious as to how well retraining would work so I switched gears.

It’s worth mentioning that this dataset was used previously with ProGAN, where after weeks of training, ProGAN overfit badly as demonstrated by the samples & interpolation videos.

Training happened remarkably quickly, with all the faces converted to recognizably Holo faces within a few hundred iterations:

The best samples were convincing without exhibiting the failures of the ProGAN:

The StyleGAN was much more successful, despite a few failure latent points carried over from the anime faces. Indeed, after a few hundred iterations, it was starting to overfit with the ‘crack’ artifacts & smearing in the interpolations. The latest I was willing to use was iteration #11370, and I think it is still somewhat overfit anyway. I thought that with its total n (after data augmentation), Holo would be able to train longer (being th the size of FFHQ), but apparently not. Perhaps the data augmentation is considerably less valuable than 1-for-1, either because the invariants encoded in aren’t that useful (suggesting that Geirhos et al 2018-like style transfer data augmentation is what’s necessary) or that they would be but the anime face StyleGAN has already learned them all as part of the previous training & needs more real data to better understand Holo-like faces. It’s also possible that the results could be improved by using one of the later anime face StyleGANs since they did improve when I trained them further after my 2 Holo/Asuka transfer experiments.

Nevertheless, impressed, I couldn’t help but wonder if they had reached human-levels of verisimilitude: would an unwary viewer assume they were handmade?

So I selected ~100 of the best samples (24MB; ) from a dump of 2000, cropped about 5% from the left/right sides to hide the background artifacts a little bit, and submitted them on 11 February 2019 to under an alt account. I made the mistake of sorting by filesize & thus leading with a face that was particularly suspicious (streaky hair) so one Redditor voiced the suspicion they were from MGM (absurd yet not entirely wrong) but all the other commenters took the faces in stride or praising them, and the submission received +248 votes (99% positive) by March. A Redditor then which earned +192 (100%) and many positive comments with no further suspicions until I explained. Not bad indeed.

### Asuka

After the Holo training & link submission went so well, I knew I had to try my other character dataset, Asuka, using n=5300 data-augmented to n=58,000.28 Keeping in mind how data seemed to limit the Holo quality, I left mirroring enabled for Asuka, even though she is not symmetrical due to her eyepatch over her left eye (as purists will no doubt note).

Interestingly, while Holo trained within GPU-hours, Asuka proved much more difficult and did not seem to be finished training or showing the cracks despite training twice as long. Is this due to having ~35% more real data, having 10x rather than 3x data augmentation, or some inherent difference like Asuka being more complex (eg because of more variations in her appearance like the eyepatches or plugsuits)?

I generated 1000 random samples with 𝜓=1.2 because they were particularly interesting to look at. As with Holo, I picked out the best 100 (13MB; ) from ~2000:

And I submitted to the subreddit, where it also did well (+109, 98%); there were no speculations about the faces being NN-generated before I revealed it, merely requests for me. Between the two, it appears that with adequate data (n>3000) and moderate curation, a simple kind of art Turing test can be passed.

### Zuihou

In early February 2019, using the then-released model, Redditor tried transfer learning to n=500 faces of the for ~1 tick (~60k iterations).

The samples & interpolations have many artifacts, but the sample size is tiny and I’d consider this good finetuning from a model never intended for few-shot learning:

Probably it could be made better by starting from the latest anime face StyleGAN model, and using aggressive data augmentation. Another option would be to try to find as many characters which look similar to Zuihou (matching on hair color might work) and train on a joint dataset—unconditional samples would then need to be filtered for just Zuihou faces, but perhaps that drawback could be avoided by a third stage of Zuihou-only training?

### Akizuki

Another Kancolle character, , was trained in .

### Saber

likewise did transfer to (), n=4000. The results look about as expected given the sample sizes and previous transfer results:

### Fate/Grand Order

in May 2019 experimented with transfer learning from the 512px anime portrait GAN to faces cropped from ~6k wallpapers he downloaded via Google search queries. His results for Saber & related characters look reasonable but more broadly, somewhat low-quality, which Sugimura suspects is due to inadequate data cleaning (“there are a number of lower quality images and also images of backgrounds, armor, non-character images left in the dataset which causes weird artifacts in generated images or just lower quality generated images.”).

### Louise

Finally, Ending_Credits did transfer to (), n=350:

Not as good as Saber due to the much smaller sample size.

### Lelouch

roadrunner01 experimented with a number of transfers, of the male character () with n=50 (!), which is not nearly as garbage as it should be.

### Asashio

experimented with transfer to n=988 (augmented to n=18772) (KanColle) faces, creating .

### Marisa Kirisame &

A Japanese user mei_miya posted of the Touhou character Marisa Kirisame by . They also did the Touhou characters Satori/Koishi Komeiji .

## Anime Faces → Anime Headshots

Twitter user did transfer learning to of a specific artist, , n≅1000. His images work well and the interpolation looks nice:

## Anime Faces → Portrait

TWDNE was a huge success and popularized the anime face StyleGAN. It was not perfect, though, and flaws were noted.

The main issues I saw for the faces:

1. Sexually-Suggestive Faces: because I had not expected StyleGAN to work or to wind up making something like TWDNE, I had not taken the effort to crop faces solely from the SFW subset, since no GAN had proven to be good enough to pick up any embarrassing details and I was more concerned with maximizing the dataset size. The explicitly-NSFW images make up only ~9% of Danbooru but between the SFW-but-suggestive images and the explicit ones, and StyleGAN’s learning capabilities, this proved to be enough to make some of the faces quite naughty-looking. Naturally, everyone insisted on joking about this.

2. Head Crops: Nagadomi’s face-cropper is a face cropper, not a head-cropper or a portrait-cropper; it centers its crops on the center of a face (like the nose) and will cut off all the additional details associated with anime heads such as the ‘ahoge’ or bunny ears or twin-tails. Similarly, I had left Nagadomi’s face-cropper on the default settings instead of bothering to tweak it to produce more head-shot-like crops—since if GANs couldn’t master the faces there was no point in making the problem even harder & worrying about details of the hair.

This was not good for characters with distinctive hats or hair or animal ears (such as Holo’s wolf ears)

3. Messy Background/Bodies: I suspected that the tightness of the crops also made it hard for StyleGAN to learn things in the edges, like backgrounds or shoulders, because they would always be partial if the face-cropper was doing its job. With bigger crops, there would be more variation and more opportunity to see whole shoulders or large unobstructed backgrounds, and this might lead to more convincing overall images.

4. Holo/Asuka Overrepresentation: to my surprise, TWDNE viewers seemed quite annoyed by the overrepresentation of Holo/Asuka-like (but mostly Holo) samples. For the same reason as not filtering to SFW, I had thrown in 2 earlier datasets I had made of Holo & Asuka faces—I had made the at 512px, and cleaned them fairly thoroughly, and they would increase the dataset size, so why not? Being overrepresented, and well-represented in Danbooru (a major part of why I had chosen them in the first place to make prototype datasets with), of course StyleGAN was more likely to generate samples looking like them than other popular anime characters.29 Why this annoyed people, I don’t understand, but it might as well be fixed.

5. Persistent Global Artifacts: despite the generally excellent results, there are still occasional bizarre anomalous images which are scarce faces at all, even with 𝜓=0.7; I suspect that this may be due to the small percentage of non-faces, cut-off faces, or just poorly/weirdly drawn faces and that more stringent data cleaning would help polish the model.

### Portrait Improvements

Issues #1–3 can be fixed by transfer-learning StyleGAN on a new dataset made of faces from the SFW subset and cropped with much larger margins to produce more ‘portrait’-style face crops. (There would still be many errors or suboptimal crops but I am not sure there is any full solution short of training a face-localization CNN just for anime images.)

For this, I needed to edit lbpcascade_animeface’s crop.py and adjust the margins. Experimenting, I changed the cropping line to:

    for (x, y, w, h) in faces:
cropped = image[int(y*0.25): y + h, int(x*0.90): x + int(w*1.25)]

These margins seemed to deliver acceptable results which generally show the entire head while leaving enough room for extra background or hats/ears (although there is still the occasional error like a or image with multiple faces or heads still partially cropped):

After cropping all ~2.8m SFW Danbooru2018 full-resolution images (as demonstrated in the cropping section), I was left with ~700k faces. This was a large dataset, but the disadvantage was that many heads/faces overlapped, so after a few weeks of training, I had decent portraits marred by strange hydra-like heads jutting in from the side. So I redid the cropping process using the solo tag to eliminate images which might have multiple faces in them.

Issue #4 is solved by just not adding the Asuka/Holo datasets.

Finally, issue #5 is harder to deal with: pruning 200k+ images by hand is infeasible, there’s no easy way to improve the face cropping script, and I don’t have the budget to Mechanical-Turk review all the faces like Karras et al 2018 did for FFHQ to remove their false positives (like statues).

One way I do have to improve it is to exploit the Discriminator of a pretrained face GAN; since the D is all about evaluating the probability of a face being a face, it automatically flags outliers & can be used for data cleaning—run the D on the whole dataset to rank each image (faster than it seems since the G & backpropagation are unnecessary, even a large dataset can be ranked in a wallclock hour or two), then one can review manually the bottom & top30 X%, or perhaps just delete the bottom X% sight unseen if enough data is available. The anime face StyleGAN D would be ideal since it clearly works so well already, so I wrote a ranker.py script to use a StyleGAN checkpoint and rank specified images on disk, and then rebuilt the .tfrecords with troublesome images removed. (This process can be reiterated as the StyleGAN model improves and the D improves its ability to spot anomalies.) I engaged in 5 cycles of ranker.py cleaning over April 2019, deleting 14k images; it seemed to reduce some of the artifacting related to hands

### Portrait Results

After retraining the final face StyleGAN 8 March 2019–30 April 2019 on the new portrait datasets:

I used this model at 𝛙=0.5 to generate 100,000 new portraits for TWDNE (#100,000–199,999), balancing the previous faces.

I was surprised how difficult upgrading to portraits seemed to be; I spent almost two months training it before giving up on further improvements, while I had been expecting more like a week or two. The portrait results are better than the faces (I was right that not cropping off the top of the head adds verisimilitude), but don’t impress me as much as the faces did. And our other experimental runs on whole-Danbooru2018 images never progressed beyond suggestive blobs during this period.

I suspect that StyleGAN—at least, on its default architecture & hyperparameters, without a great deal more compute—is reaching its limits here, and that changes may be necessary to scale to richer images. (Self-attention is probably the easiest to add since it should be easy to plug in additional layers to the convolution code.)

## Anime Faces → Male Faces

A few people have observed that it would be nice to have an anime face GAN for male characters instead of always generating female ones. The anime face StyleGAN does in fact have male faces in its dataset as I did no filtering—it’s merely that female faces are overwhelmingly frequent (and it may also be that male anime faces are relatively androgynous/feminized anyway so it’s hard to tell any difference between a female with short hair & a guy31).

Training a male-only anime face StyleGAN would be another good application of transfer learning.

The faces can be easily extracted out of Danbooru2018 by querying for "male_focus", which will pick up ~150k images. More narrowly, one could search "1boy" & "solo", to ensure that the only face in the image is a male face (as opposed to, say, 1boy 1girl, where a female face might be cropped out as well). This provides n=99k raw hits. It would be good to also filter out ‘trap’ or overly-female-looking faces (else what’s the point?), by filtering on tags like cat ears or particularly popular ‘trap’ characters like Fate/Grand Order’s Astolfo. A more complicated query to pick up scenes with multiple males could be to search for both "1boy" & "multiple_boys" and then filter out "1girl" & "multiple_girls", in order to select all images with 1 or more males and then remove all images with 1 or more females; this doubles the raw hits to n=198k. (A downside is that the face-cropping will often unavoidably yield crops with two faces, a primary face and an overlapping face, which is bad and introduces artifacting when I tried this with all faces.)

Combined with transfer learning from the general anime face StyleGAN, the results should be as good as the general (female) faces.

I settled for "1boy" & "solo", and did considerable cleaning by hand. The raw count of images turned out to be highly misleading, and many faces are unusable for a male anime face StyleGAN: many are so highly stylized (such as action scenes) as to be damaging to a GAN, or they are almost indistinguishable from female faces (because they are bishonen or trap or just androgynous), which would be pointless to include (the regular portrait StyleGAN covers those already). After hand cleaning & use of ranker.py, I was left with n~3k, so I used heavy data augmentation to bring it up to n~57k, and I initialized from the final portrait StyleGAN for the highest quality.

It did not overfit after ~4 days of training, but the results were not noticeably improving, so I stopped (in order to start training the GPT-2-345M, which OpenAI had just released, ). There are hints in the interpolation videos, I think, that it is indeed slightly overfitting, in the form of ‘glitches’ where the image abruptly jumps slightly, presumably to another mode/face/character of the original data; nevertheless, the male face StyleGAN mostly works.

The male face StyleGAN model is available for download, as is 1000 random faces with 𝛙=0.7 (mirror; ).

## Anime Faces → Danbooru2018

nshepperd began using an early anime face StyleGAN model on the 512px SFW Danbooru2018 subset; after ~3–5 weeks (with many interruptions) on 1 GPU, as of 22 March 2019, the training samples look like this:

The StyleGAN is able to pick up global structure and there are recognizably anime figures, despite the sheer diversity of images, which is promising. The fine details are seriously lacking, and training, to my eye, is wandering around without any steady improvement or sharp details (except perhaps the faces which are inherited from the previous model). I suspect that the learning rate is still too high and, especially with only 1 GPU/n=4, such small minibatches don’t cover enough modes to enable steady improvement. If so, the LR will need to be set much lower (or gradient accumulation used in order to fake having large minibatches where large LRs are stable) & training time extended to multiple months. Another possibility would be to restart with added self-attention layers, which I have noticed seem to particularly help with complicated details & sharpness; the style noise approach may be adequate for the job but just a few vanilla convolution layers may be too few (pace the BigGAN results on the benefits of increasing depth while decreasing parameter count).

## FFHQ Variations

### Anime Faces → FFHQ Faces

If StyleGAN can smoothly warp anime faces among each other and express global transforms like hair length+color with 𝜓, could 𝜓 be a quick way to gain control over a single large-scale variable? For example, male vs female faces, or… animereal faces? (Given a particular image/latent vector, one would simply flip the sign to convert it to the opposite; this would give the opposite version of each random face, and if one had an encoder, one could do automatically anime-fy or real-fy an arbitrary face by encoding it into the latent vector which creates it, and then flipping.)

Since Karras et al 2801 provide a nice FFHQ download script (albeit slower than I’d like once Google Drive rate-limits you a wallclock hour into the full download) for the full-resolution PNGs, it would be easy to downscale to 512px and create a 512px FFHQ dataset to train on, or even create a combined anime+FFHQ dataset.

The first and fastest thing was to do transfer learning from the anime faces to FFHQ real faces. It was unlikely that the model would retain much anime knowledge & be able to do morphing, but it was worth a try.

The initial results early in training are hilarious and look like zombies:

After 97 ticks, the model has converged to a boringly normal appearance, with the only hint of its origins being perhaps some excessively-fabulous hair in the training samples:

### Anime Faces → Anime Faces + FFHQ Faces

So, that was a bust. The next step is to try training on anime & FFHQ faces simultaneously; given the stark difference between the datasets, would positive vs negative 𝜓 wind up splitting into real vs anime and provide a cheap & easy way of converting arbitrary faces?

This simply merged the 512px FFHQ faces with the 512px anime faces and resumed training from the previous FFHQ model (I reasoned that some of the anime-ness should still be in the model, so it would be slightly faster than restarting from the original anime face model). I trained it for 812 iterations, #11,359–12,171 (somewhat over 2 GPU-days), at which point it was mostly done.

It did manage to learn both kinds of faces quite well, separating them clearly in random samples:

However, the style transfer & 𝜓 samples were disappointments. The style mixing shows limited ability to modify faces cross-domain or convert them, and the truncation trick chart shows no clear disentanglement of the desired factor (indeed, the various halves of 𝜓 correspond to nothing clear):

The interpolation video does show that it learned to interpolate slightly between real & anime faces, giving half-anime/half-real faces, but it looks like it only happens sometimes—mostly with young female faces32:

They’re hard to spot in the interpolation video because the transition happens abruptly, so I generated samples & selected some of the more interesting anime-ish faces:

Similarly, trained a StyleGAN on FFHQ+Western portrait illustrations, and the interpolation video is much smoother & more mixed, suggesting that more realistic & more varied illustrations are easier for StyleGAN to interpolate between.

### Anime Faces + FFHQ → Danbooru2018

While I didn’t have the compute to properly train a Danbooru2018 StyleGAN, after nshepperd’s results, I was curious and spent some time (817 iterations, so ~2 GPU-days?) retraining the anime face+FFHQ model on Danbooru2018 SFW 512px images.

The training montage is interesting for showing how faces get repurposed into figures:

One might think that it is a bridge too far for transfer learning, but it seems not.

# Reversing StyleGAN To Control & Modify Images

Modifying images is harder than generating them. An unconditional GAN architecture is, by default, ‘one-way’: the latent vector z gets generated from a bunch of variables, fed through the GAN, and out pops an image. There is no way to run the unconditional GAN ‘backwards’ to feed in an image and pop out the z instead.33

If one could, one could take an arbitrary image and encode it into the z and by jittering z, generate many new version of it; or one could feed it back into StyleGAN and play with the style noises at various levels in order to transform the image; or do things like ‘average’ two images or create interpolations between two arbitrary faces’; or one could (assuming one knew what each variable in z ‘means’) edit the image to changes things like or whether they are smiling.

The most straightforward way would be to switch to a conditional GAN architecture based on a text or tag embedding. Then to generate a specific character wearing glasses, one simply says as much as the conditional input: "character glasses". Or if they should be smiling, add "smile". And so on. This would create images of said character with the desired modifications. This option is not available at the moment as creating a tag embedding & training StyleGAN requires quite a bit of modification. It also is not a complete solution as it wouldn’t work for the cases of editing an existing image.

For an unconditional GAN, there are two complementary approaches to inverting the G:

1. what one NN can learn to do, another can learn to undo (eg , ):

If StyleGAN has learned z→image, then train a second encoder NN on the supervised learning problem of image→z! The sample size is infinite (just keep running G) and the mapping is fixed (given a fixed G), so it’s ugly but not that hard.

2. backpropagate a pixel or feature-level loss to ‘optimize’ a latent code (eg ):

While StyleGAN is not inherently reversible, it’s not a blackbox as, being a NN trained by , it must admit of gradients. In training neural networks, there are 3 components: inputs, model parameters, and outputs/losses, and thus there are 3 ways to use backpropagation, even if we usually only use 1. One can hold the inputs fixed, and vary the model parameters in order to change (usually reduce) the fixed outputs in order to reduce a loss, which is training a NN; one can hold the inputs fixed and vary the outputs in order to change (often increase) internal parameters such as layers, which corresponds to neural network visualizations & exploration; and finally, one can hold the parameters & outputs fixed, and use the gradients to iteratively find an set of inputs which creates a specific output with a low loss.34

This can be used to create images which are ‘optimized’ in some sense. For example, in , the gradient descent35 on the individual pixels of an image is done to minimize/maximize a NSFW classifier’s prediction; but this can also be done on a higher level by trying to maximize similarity to a NN embedding of an image to make it as ‘similar’ as possible, as was done originally in Gatys et al 2014 for style transfer, or for more complicated kinds of style transfer like in .

In this case, given an arbitrary desired image’s z, one can initialize a random z, run it forward through the GAN to get an image, compare it at the pixel level with the desired (fixed) image, and the total difference is the ‘loss’; holding the GAN fixed, the backpropagation goes back through the model and adjusts the inputs (the unfixed z) to make it slightly more like the desired image. Done many times, the final z will now yield something like the desired image, and that can be treated as its true z. Comparing at the pixel-level can be improved by instead looking at the higher layers in a NN trained to do classification (often an ImageNet VGG), which will focus more on the semantic similarity (more of a “perceptual loss”) rather than misleading details of static & individual pixels. The latent code can be the original z, or z after it’s passed through the stack of 8 FC layers and has been transformed, or it can even be the various per-layer style noises inside the CNN part of StyleGAN; the last is what uses & 36 argue that it works better to target the layer-wise encodings than the original z.

This may not work too well as the local optima might be bad or the GAN may have trouble generating precisely the desired image no matter how carefully it is optimized, the pixel-level loss may not be a good loss to use, and the whole process may be quite slow, especially if one runs it many times with many different initial random z to try to avoid bad local optima. But it does mostly work.

3. encoder+backpropagation is a useful hybrid strategy: the encoder makes its best guess at the z, which will usually be close to the true z, and then backpropagation is done for a few iterations to finetune the z. This can be much faster (one forward pass vs many forward+backward passes) and much less prone to getting stuck in bad local optima (since it starts at a good initial z thanks to the encoder).

Comparison with editing in flow-based models: On a tangent, editing/reversing is one of the great advantages37 of ‘flow’-based NN models such as Glow, which is one of the families of NN models competitive with GANs for high-quality image generation (along with autoregressive pixel prediction models like PixelRNN, and VAEs). Flow models have the same shape as GANs in pushing a random latent vector z through a series of upscaling convolution or other layers to produce final pixel values, but flow models use a carefully-limited set of primitives which make the model runnable both forwards and backwards exactly. This means every set of pixels corresponds to a unique z and vice-versa, and so an arbitrary set of pixels can put in and the model run backwards to yield the exact corresponding z. There is no need to fight with the model to create an encoder to reverse it or use backpropagation optimization to try to find something almost right, as the flow model can already do this. This makes editing easy: plug the image in, get out the exact z with the equivalent of a single forward pass, figure out which part of z controls a desired attribute like ‘glasses’, change that, and run it forward. The downside of flow models, which is why I do not (yet) use them, is that the restriction to reversible layers means that they are typically much larger and slower to train than a more-or-less perceptually equivalent GAN model, by easily an order of magnitude (for Glow). When I tried Glow, I could barely run an interesting model despite aggressive memory-saving techniques, and I didn’t get anywhere interesting with the several GPU-days I spent (which was unsurprising when I realized how many GPU-months OA had spent). Since high-quality photorealistic GANs are at the limit of 2019 trainability for most researchers or hobbyists, flow models are clearly out of the question despite their many practical & theoretical advantages—they’re just too expensive! However, there is no known reason flow models couldn’t be competitive with GANs (they will probably always be larger, but because they are more correct & do more), and future improvements or hardware scaling may make them more viable, so flow-based models are an approach to keep an eye on.

One of those 3 approaches will encode an image into a latent z. So far so good, that enables things like generating randomly-different versions of a specific image or interpolating between 2 images, but how does one control the z in a more intelligent fashion to make specific edits?

If one knew what each variable in the z meant, one could simply slide them in the -1/+1 range, change the z, and generate the corresponding edited image. But there are 512 variables in z (for StyleGAN), which is a lot to examine manually, and their meaning is opaque as StyleGAN doesn’t necessarily map each variable onto a human-recognizable factor like ‘smiling’. A recognizable factor like ‘eyeglasses’ might even be governed by multiple variables simultaneously which are nonlinearly interacting.

As always, the solution to one model’s problems is yet more models; to control the z, like with the encoder, we can simply train yet another model (perhaps just a linear classifier or random forests this time) to take the z of many images which are all labeled ‘smiling’ or ‘not smiling’, and learn what parts of z cause ‘smiling’ (eg ). These additional models can then be used to control a z. The necessary labels (a few hundred to a few thousand will be adequate since the z is only 512 variables) can be obtained by hand or by using a pre-existing classifier.

So, the pieces of the puzzle & putting it all together:

The final result is interactive editing of anime faces along many different factors:

## Editing Rare Attributes

A strategy of hand-editing or using a tagger to classify attributes works for common ones which will be well-represented in a sample of a few thousand since the classifier needs a few hundred cases to work with, but what about rarer attributes which might appear only on one in a thousand random samples, or attributes too rare in the dataset for StyleGAN to have learned, or attributes which may not be in the dataset at all? Editing “red eyes” should be easy, but what about something like “bunny ears”? It would be amusing to be able to edit portraits to add bunny ears, but there aren’t that many bunny ear samples (although cat ears might be much more common); is one doomed to generate & classify hundreds of thousands of samples to enable bunny ear editing? That would be infeasible for hand labeling, and difficult even with a tagger.

One suggestion I have for this use-case would be to briefly train another StyleGAN model on an enriched or boosted dataset, like a dataset of 50-50 bunny ear images & normal images. If one can obtain a few thousand bunny ear images, then this is adequate for transfer learning (combined with a few thousand random normal images from the original dataset), and one can retrain the StyleGAN on an equal balance of images. The high presence of bunny ears will ensure that the StyleGAN quickly learns all about those, while the normal images prevent it from overfitting or catastrophic forgetting of the full range of images.

This new bunny-ear StyleGAN will then produce bunny-ear samples half the time, circumventing the rare base rate issue (or failure to learn, or nonexistence in dataset), and enabling efficient training of a classifier. And since normal faces were used to preserve its general face knowledge despite the transfer learning potentially degrading it, it will remain able to encode & optimize normal faces. (The original classifiers may even be reusable on this, depending on how extreme the new attribute is, as the latent space z might not be too affected by the new attribute and the various other attributes approximately maintain the original relationship with z as before the retraining.)

# Future work

Some open questions about StyleGAN’s architecture & training dynamics:

• is progressive growing still necessary with StyleGAN?
• are 8x512 FC layers necessary?
• why do large blob artifacts regularly appear throughout training, even when almost-converged & photorealistic? Can they be fixed?
• what are the wrinkly-line/cracks noise artifacts which appear at the end of training?
• how does StyleGAN compare to BigGAN in final quality?

Further possible work:

• exploration of “curriculum learning”: can training be sped up by training to convergence on small n and then periodically expanding the dataset?

• bootstrapping image generation by starting with a seed corpus, generating many random samples, selecting the best by hand, and retraining; eg expand a corpus of a specific character, or explore ‘hybrid’ corpuses which mix A/B images & one then selects for images which look most A+B-ish

• improved transfer learning scripts to edit trained models so 512px pretrained models can be promoted to work with 1024px images and vice versa

• better Danbooru tagger CNN for providing embeddings for various purposes, particularly FID loss monitoring, minibatch discrimination/auxiliary loss, and style transfer for creating a ‘StyleDanbooru’

• with a StyleDanbooru, I am curious if that can be used as a particularly powerful form of data augmentation for small n character datasets, and whether it leads to a reversal of training dynamics with edges coming before colors/textures—it’s possible that a StyleDanbooru could make many GAN architectures, not just StyleGAN, stable to train on anime/illustration datasets
• borrowing architectural enhancements from BigGAN: self-attention layers, spectral norm regularization, large-minibatch training, and a rectified Gaussian distribution for the latent vector z

• text→image conditional GAN architecture (à la StackGAN):

This would take the text tag descriptions of each image compiled by Danbooru users and use those as inputs to StyleGAN, which, should it work, would mean you could create arbitrary anime images simply by typing in a string like 1_boy samurai facing_viewer red_hair clouds sword armor blood etc.

This should also, by providing rich semantic descriptions of each image, make training faster & stabler and converge to higher final quality.

• meta-learning for few-shot face or character or artist imitation (eg or or perhaps , or —the last of which achieves few-shot learning using samples of n=25 TWDNE StyleGAN anime faces)

# BigGAN

I explore BigGAN, another recent GAN with SOTA results on the most complex image domain tackled by GANs so far, ImageNet. BigGAN’s capabilities come at a steep compute cost, however. I experiment with 128px ImageNet transfer learning (successful) with ~6 GPU-days, and from-scratch 256px anime portraits of 1000 characters on a 8x2080ti machine for a month (mixed results). My BigGAN results are good but compromised by practical problems with the released BigGAN code base. While BigGAN is not yet superior to StyleGAN for many purposes, BigGAN-like approaches may turn out to be necessary to scale to whole anime images.

The primary rival GAN to StyleGAN for large-scale image synthesis as of mid-2019 is BigGAN (; ).

BigGAN successfully trains on up to 512px images from ImageNet, from all 1000 categories (conditioned on category), with near-photorealistic results on the best-represented categories (dogs), and apparently can even handle the far larger internal Google JFT dataset. In contrast, StyleGAN, while far less computationally demanding, shows poorer results on more complex categories (Karras et al 2018’s LSUN Cats StyleGAN; our whole-Danbooru2018 pilots) and has not been demonstrated to scale to ImageNet, much less beyond.

BigGAN does this by combining a few improvements on standard DCGANs (most of which are not used in StyleGAN):

The downside is that, as the name indicates, BigGAN is both a big model and requires big compute (particularly, big minibatches)—somewhere around $20,000, we estimate, based on public TPU pricing. This present a dilemma: larger-scale portrait modeling or whole-anime image modeling may be beyond StyleGAN’s current capabilities; but while BigGAN may be able to handle those tasks, we can’t afford to train it! Must it cost that much? Probably not. In particular, BigGAN’s use of a fixed large minibatch throughout training is probably inefficient: it is highly unlikely that the benefits of a n=2048 minibatch are necessary at the beginning of training when the Generator is generating static which looks nothing at all like real data, and at the end of training, that may still be too small a minibatch (Brock et al 2018 note that the benefits of larger minibatches had not saturated at n=2048 but time/compute was not available to test larger still minibatches, which is consistent with the observation that the harder & more RL-like a problem, the larger the minibatch it needs). Typically, minibatches and/or learning rates are scheduled: imprecise gradients are acceptable early on, while as the model approaches perfection, more exact gradients are necessary. So it should be possible to start out with minibatches a tiny fraction of the size and gradually scale them up during training, saving an enormous amount of compute compared to BigGAN’s reported numbers. The gradient noise scale could possibly be used to automatically set the total minibatch scale, although I didn’t find any examples of anyone using it in PyTorch this way. And using TPU pods is fast, but not necessarily the cheapest form of compute. ## BigGAN Transfer Learning Another optimization is to exploit transfer learning from the released models, and reuse the enormous amount of compute invested in them. The practical details there are fiddly. The original BigGAN 2018 release included the 128px/256px/512px Generator Tensorflow models but not their Discriminators, nor a training codebase; the compare_gan Tensorflow codebase released in early 2019 includes an independent implementation of BigGAN that can potentially train them, and I believe that the Generator may still be usable for transfer learning on its own and if not—given the arguments that Discriminators simply memorize data and do not learn much beyond that—the Discriminators can be trained from scratch by simply freezing a G while training its D on G outputs for as long as necessary. The 2019 PyTorch release includes a different model, a full 128px model with G/D (at 2 points in its training), and code to convert the original Tensorflow models into PyTorch format; the catch there is that the pretrained model must be loaded into exactly the same architecture, and while the PyTorch codebase defines the architecture for 32/64/128/256px BigGANs, it does not (as of 4 June 2019) define the architecture for a 512px BigGAN or BigGAN-deep (I tried but couldn’t get it quite right). It would also be possible to do model surgery and promote the 128px model to a 512px model, since the two upscaling blocks (128px→256px and 256px→512px) should be easy to learn (similar to my use of waifu2x to fake a 1024px StyleGAN anime face model). Anyway, the upshot is that one can only use the 128px/256px pretrained models; the 512px will be possible with a small update to the PyTorch codebase. All in all, it is possible that BigGAN with some tweaks could be affordable to train. (At least, with some crowdfunding…) ## BigGAN: Danbooru2018-1K Experiments To test out the water, I ran three BigGAN experiments: 1. I first experimented with retraining the ImageNet 128px model38. That resulted in almost total mode collapse when I re-enabled G after 2 days; investigating, I realized that I had misunderstood: it was a brandnew BigGAN model, trained independently, and came with its fully-trained D already. Oops. 2. transfer learning the 128px ImageNet PyTorch BigGAN model to the 1k anime portraits; successful with ~6 GPU-days 3. training from scratch a 256px BigGAN-deep on the 1k portraits; Partially successful after ~240 GPU-days: it reached comparable quality to StyleGAN before suffering serious mode collapse due, possibly, being forced to run with small minibatch sizes by BigGAN bugs ### Danbooru2018-1K Dataset Constructing a new Danbooru-1k dataset: as BigGAN requires conditioning information, I constructed a new portrait dataset by taking the 1000 most popular Danbooru2018 characters, with characters as categories, and cropped out portraits as usual: cat metadata/20180000000000* | fgrep -e '"name":"solo"' | fgrep -v '"rating":"e"' | jq -c '.tags | \ .[] | select(.category == "4") | .name' | sort | uniq --count | \ sort --numeric-sort > characters.txt mkdir ./characters-1k/ ; cd ./characters-1k/ cpCharacterFace () { # } CHARACTER="$@"
CHARACTER_SAFE=$(echo$CHARACTER | tr '[:punct:]' '.')
mkdir "$CHARACTER_SAFE" echo "$CHARACTER" "$CHARACTER_SAFE" IDS=$(cat ../metadata/* | fgrep '"name":"'$CHARACTER\" | fgrep -e '"name":"solo"' \ # ) | fgrep -v '"rating":"e"' | jq .id | tr -d '"') for ID in$IDS; do
BUCKET=$(printf "%04d"$(( $ID % 1000 )) ); TARGET=$(ls ../original/$BUCKET/$ID.*)
~/src/lbpcascade_animeface/lbpcascade_animeface.xml "$TARGET" "./$CHARACTER_SAFE/$ID" done } export -f cpCharacterFace tail -1000 ../characters.txt | cut -d '"' -f 2 | parallel --progress cpCharacterFace I merged a number of redundant folders by hand39, cleaned as usual, and did further cropping as necessary to reach 1000. This resulted in 212,359 portrait faces, with the largest class (Hatsune Miku) having 6,624 images and the smallest classes having ~0 or 1 images. (I don’t know if the class imbalance constitutes a real problem for BigGAN, as ImageNet itself is imbalanced on many levels.) The data-loading code attempts to make the class index/ID number line up with the folder count, so the th alphabetical folder (character) should have class ID n, which is important to know for generating conditional samples. The final set/IDs (as defined for my Danbooru 1K dataset by find_classes): 2k.tan: 0 abe.nana: 1 abigail.williams..fate.grand.order.: 2 abukuma..kantai.collection.: 3 admiral..kantai.collection.: 4 aegis..persona.: 5 aerith.gainsborough: 6 afuro.terumi: 7 agano..kantai.collection.: 8 agrias.oaks: 9 ahri: 10 aida.mana: 11 aino.minako: 12 aisaka.taiga: 13 aisha..elsword.: 14 akagi..kantai.collection.: 15 akagi.miria: 16 akashi..kantai.collection.: 17 akatsuki..kantai.collection.: 18 akaza.akari: 19 akebono..kantai.collection.: 20 akemi.homura: 21 aki.minoriko: 22 aki.shizuha: 23 akigumo..kantai.collection.: 24 akitsu.maru..kantai.collection.: 25 akitsushima..kantai.collection.: 26 akiyama.mio: 27 akiyama.yukari: 28 akizuki..kantai.collection.: 29 akizuki.ritsuko: 30 akizuki.ryou: 31 akuma.homura: 32 albedo: 33 alice..wonderland.: 34 alice.margatroid: 35 alice.margatroid..pc.98.: 36 alisa.ilinichina.amiella: 37 altera..fate.: 38 amagi..kantai.collection.: 39 amagi.yukiko: 40 amami.haruka: 41 amanogawa.kirara: 42 amasawa.yuuko: 43 amatsukaze..kantai.collection.: 44 amazon..dragon.s.crown.: 45 anastasia..idolmaster.: 46 anchovy: 47 android.18: 48 android.21: 49 anegasaki.nene: 50 angel..kof.: 51 angela.balzac: 52 anjou.naruko: 53 aoba..kantai.collection.: 54 aoki.reika: 55 aori..splatoon.: 56 aozaki.aoko: 57 aqua..konosuba.: 58 ara.han: 59 aragaki.ayase: 60 araragi.karen: 61 arashi..kantai.collection.: 62 arashio..kantai.collection.: 63 archer: 64 arcueid.brunestud: 65 arima.senne: 66 artoria.pendragon..all.: 67 artoria.pendragon..lancer.: 68 artoria.pendragon..lancer.alter.: 69 artoria.pendragon..swimsuit.rider.alter.: 70 asahina.mikuru: 71 asakura.ryouko: 72 asashimo..kantai.collection.: 73 asashio..kantai.collection.: 74 ashigara..kantai.collection.: 75 asia.argento: 76 astolfo..fate.: 77 asui.tsuyu: 78 asuna..sao.: 79 atago..azur.lane.: 80 atago..kantai.collection.: 81 atalanta..fate.: 82 au.ra: 83 ayanami..azur.lane.: 84 ayanami..kantai.collection.: 85 ayanami.rei: 86 ayane..doa.: 87 ayase.eli: 88 baiken: 89 bardiche: 90 barnaby.brooks.jr: 91 battleship.hime: 92 bayonetta..character.: 93 bb..fate...all.: 94 bb..fate.extra.ccc.: 95 bb..swimsuit.mooncancer...fate.: 96 beatrice: 97 belfast..azur.lane.: 98 bismarck..kantai.collection.: 99 black.hanekawa: 100 black.rock.shooter..character.: 101 blake.belladonna: 102 blanc: 103 boko..girls.und.panzer.: 104 bottle.miku: 105 boudica..fate.grand.order.: 106 bowsette: 107 bridget..guilty.gear.: 108 busujima.saeko: 109 c.c.: 110 c.c..lemon..character.: 111 caesar.anthonio.zeppeli: 112 cagliostro..granblue.fantasy.: 113 camilla..fire.emblem.if.: 114 cammy.white: 115 caren.hortensia: 116 caster: 117 cecilia.alcott: 118 celes.chere: 119 charlotte..madoka.magica.: 120 charlotte.dunois: 121 charlotte.e.yeager: 122 chen: 123 chibi.usa: 124 chiki: 125 chitanda.eru: 126 chloe.von.einzbern: 127 choukai..kantai.collection.: 128 chun.li: 129 ciel: 130 cirno: 131 clarisse..granblue.fantasy.: 132 clownpiece: 133 consort.yu..fate.: 134 cure.beauty: 135 cure.happy: 136 cure.march: 137 cure.marine: 138 cure.moonlight: 139 cure.peace: 140 cure.sunny: 141 cure.sunshine: 142 cure.twinkle: 143 d.va..overwatch.: 144 daiyousei: 145 danua: 146 darjeeling: 147 dark.magician.girl: 148 dio.brando: 149 dizzy: 150 djeeta..granblue.fantasy.: 151 doremy.sweet: 152 eas: 153 eila.ilmatar.juutilainen: 154 elesis..elsword.: 155 elin..tera.: 156 elizabeth.bathory..brave...fate.: 157 elizabeth.bathory..fate.: 158 elizabeth.bathory..fate...all.: 159 ellen.baker: 160 elphelt.valentine: 161 elsa..frozen.: 162 emilia..re.zero.: 163 emiya.kiritsugu: 164 emiya.shirou: 165 emperor.penguin..kemono.friends.: 166 enma.ai: 167 enoshima.junko: 168 enterprise..azur.lane.: 169 ereshkigal..fate.grand.order.: 170 erica.hartmann: 171 etna: 172 eureka: 173 eve..elsword.: 174 ex.keine: 175 failure.penguin: 176 fate.testarossa: 177 felicia: 178 female.admiral..kantai.collection.: 179 female.my.unit..fire.emblem.if.: 180 female.protagonist..pokemon.go.: 181 fennec..kemono.friends.: 182 ferry..granblue.fantasy.: 183 flandre.scarlet: 184 florence.nightingale..fate.grand.order.: 185 fou..fate.grand.order.: 186 francesca.lucchini: 187 frankenstein.s.monster..fate.: 188 fubuki..kantai.collection.: 189 fujibayashi.kyou: 190 fujimaru.ritsuka..female.: 191 fujiwara.no.mokou: 192 furude.rika: 193 furudo.erika: 194 furukawa.nagisa: 195 fusou..kantai.collection.: 196 futaba.anzu: 197 futami.mami: 198 futatsuiwa.mamizou: 199 fuuro..pokemon.: 200 galko: 201 gambier.bay..kantai.collection.: 202 ganaha.hibiki: 203 gangut..kantai.collection.: 204 gardevoir: 205 gasai.yuno: 206 gertrud.barkhorn: 207 gilgamesh: 208 ginga.nakajima: 209 giorno.giovanna: 210 gokou.ruri: 211 graf.eisen: 212 graf.zeppelin..kantai.collection.: 213 grey.wolf..kemono.friends.: 214 gumi: 215 hachikuji.mayoi: 216 hagikaze..kantai.collection.: 217 hagiwara.yukiho: 218 haguro..kantai.collection.: 219 hakurei.reimu: 220 hamakaze..kantai.collection.: 221 hammann..azur.lane.: 222 han.juri: 223 hanasaki.tsubomi: 224 hanekawa.tsubasa: 225 hanyuu: 226 haramura.nodoka: 227 harime.nui: 228 haro: 229 haruka..pokemon.: 230 haruna..kantai.collection.: 231 haruno.sakura: 232 harusame..kantai.collection.: 233 hasegawa.kobato: 234 hassan.of.serenity..fate.: 235 hata.no.kokoro: 236 hatoba.tsugu..character.: 237 hatsune.miku: 238 hatsune.miku..append.: 239 hatsuyuki..kantai.collection.: 240 hatsuzuki..kantai.collection.: 241 hayami.kanade: 242 hayashimo..kantai.collection.: 243 hayasui..kantai.collection.: 244 hecatia.lapislazuli: 245 helena.blavatsky..fate.grand.order.: 246 heles: 247 hestia..danmachi.: 248 hex.maniac..pokemon.: 249 hibari..senran.kagura.: 250 hibiki..kantai.collection.: 251 hieda.no.akyuu: 252 hiei..kantai.collection.: 253 higashi.setsuna: 254 higashikata.jousuke: 255 high.priest: 256 hiiragi.kagami: 257 hiiragi.tsukasa: 258 hijiri.byakuren: 259 hikari..pokemon.: 260 himejima.akeno: 261 himekaidou.hatate: 262 hinanawi.tenshi: 263 hinatsuru.ai: 264 hino.akane..idolmaster.: 265 hino.akane..smile.precure..: 266 hino.rei: 267 hirasawa.ui: 268 hirasawa.yui: 269 hiryuu..kantai.collection.: 270 hishikawa.rikka: 271 hk416..girls.frontline.: 272 holo: 273 homura..xenoblade.2.: 274 honda.mio: 275 hong.meiling: 276 honma.meiko: 277 honolulu..azur.lane.: 278 horikawa.raiko: 279 hoshi.shouko: 280 hoshiguma.yuugi: 281 hoshii.miki: 282 hoshimiya.ichigo: 283 hoshimiya.kate: 284 hoshino.fumina: 285 hoshino.ruri: 286 hoshizora.miyuki: 287 hoshizora.rin: 288 hotarumaru: 289 hoto.cocoa: 290 houjou.hibiki: 291 houjou.karen: 292 houjou.satoko: 293 houjuu.nue: 294 houraisan.kaguya: 295 houshou..kantai.collection.: 296 huang.baoling: 297 hyuuga.hinata: 298 i.168..kantai.collection.: 299 i.19..kantai.collection.: 300 i.26..kantai.collection.: 301 i.401..kantai.collection.: 302 i.58..kantai.collection.: 303 i.8..kantai.collection.: 304 ia..vocaloid.: 305 ibaraki.douji..fate.grand.order.: 306 ibaraki.kasen: 307 ibuki.fuuko: 308 ibuki.suika: 309 ichigo..darling.in.the.franxx.: 310 ichinose.kotomi: 311 ichinose.shiki: 312 ikamusume: 313 ikazuchi..kantai.collection.: 314 illustrious..azur.lane.: 315 illyasviel.von.einzbern: 316 imaizumi.kagerou: 317 inaba.tewi: 318 inami.mahiru: 319 inazuma..kantai.collection.: 320 index: 321 ingrid: 322 inkling: 323 inubashiri.momiji: 324 inuyama.aoi: 325 iori.rinko: 326 iowa..kantai.collection.: 327 irisviel.von.einzbern: 328 iroha..samurai.spirits.: 329 ishtar..fate.grand.order.: 330 isokaze..kantai.collection.: 331 isonami..kantai.collection.: 332 isuzu..kantai.collection.: 333 itsumi.erika: 334 ivan.karelin: 335 izayoi.sakuya: 336 izumi.konata: 337 izumi.sagiri: 338 jack.the.ripper..fate.apocrypha.: 339 jakuzure.nonon: 340 japanese.crested.ibis..kemono.friends.: 341 jeanne.d.arc..alter...fate.: 342 jeanne.d.arc..alter.swimsuit.berserker.: 343 jeanne.d.arc..fate.: 344 jeanne.d.arc..fate...all.: 345 jeanne.d.arc..granblue.fantasy.: 346 jeanne.d.arc..swimsuit.archer.: 347 jeanne.d.arc.alter.santa.lily: 348 jintsuu..kantai.collection.: 349 jinx..league.of.legends.: 350 johnny.joestar: 351 jonathan.joestar: 352 joseph.joestar..young.: 353 jougasaki.mika: 354 jougasaki.rika: 355 jun.you..kantai.collection.: 356 junketsu: 357 junko..touhou.: 358 kaban..kemono.friends.: 359 kaburagi.t.kotetsu: 360 kaenbyou.rin: 361 kaenbyou.rin..cat.: 362 kafuu.chino: 363 kaga..kantai.collection.: 364 kagamine.len: 365 kagamine.rin: 366 kagerou..kantai.collection.: 367 kagiyama.hina: 368 kagura..gintama.: 369 kaguya.luna..character.: 370 kaito: 371 kaku.seiga: 372 kakyouin.noriaki: 373 kallen.stadtfeld: 374 kamikaze..kantai.collection.: 375 kamikita.komari: 376 kamio.misuzu: 377 kamishirasawa.keine: 378 kamiya.nao: 379 kamoi..kantai.collection.: 380 kaname.madoka: 381 kanbaru.suruga: 382 kanna.kamui: 383 kanzaki.ranko: 384 karina.lyle: 385 kasane.teto: 386 kashima..kantai.collection.: 387 kashiwazaki.sena: 388 kasodani.kyouko: 389 kasugano.sakura: 390 kasugano.sora: 391 kasumi..doa.: 392 kasumi..kantai.collection.: 393 kasumi..pokemon.: 394 kasumigaoka.utaha: 395 katori..kantai.collection.: 396 katou.megumi: 397 katsura.hinagiku: 398 katsuragi..kantai.collection.: 399 katsushika.hokusai..fate.grand.order.: 400 katyusha: 401 kawakami.mai: 402 kawakaze..kantai.collection.: 403 kawashiro.nitori: 404 kay..girls.und.panzer.: 405 kazama.asuka: 406 kazami.yuuka: 407 kenzaki.makoto: 408 kijin.seija: 409 kikuchi.makoto: 410 kino: 411 kino.makoto: 412 kinomoto.sakura: 413 kinugasa..kantai.collection.: 414 kirigaya.suguha: 415 kirigiri.kyouko: 416 kirijou.mitsuru: 417 kirima.sharo: 418 kirin..armor.: 419 kirino.ranmaru: 420 kirisame.marisa: 421 kirishima..kantai.collection.: 422 kirito: 423 kiryuuin.satsuki: 424 kisaragi..kantai.collection.: 425 kisaragi.chihaya: 426 kise.yayoi: 427 kishibe.rohan: 428 kishin.sagume: 429 kiso..kantai.collection.: 430 kiss.shot.acerola.orion.heart.under.blade: 431 kisume: 432 kitakami..kantai.collection.: 433 kiyohime..fate.grand.order.: 434 kiyoshimo..kantai.collection.: 435 kizuna.ai: 436 koakuma: 437 kobayakawa.rinko: 438 kobayakawa.sae: 439 kochiya.sanae: 440 kohinata.miho: 441 koizumi.hanayo: 442 komaki.manaka: 443 komeiji.koishi: 444 komeiji.satori: 445 kongou..kantai.collection.: 446 konjiki.no.yami: 447 konpaku.youmu: 448 konpaku.youmu..ghost.: 449 kooh: 450 kos.mos: 451 koshimizu.sachiko: 452 kotobuki.tsumugi: 453 kotomine.kirei: 454 kotonomiya.yuki: 455 kousaka.honoka: 456 kousaka.kirino: 457 kousaka.tamaki: 458 kozakura.marry: 459 kuchiki.rukia: 460 kujikawa.rise: 461 kujou.karen: 462 kula.diamond: 463 kuma..kantai.collection.: 464 kumano..kantai.collection.: 465 kumoi.ichirin: 466 kunikida.hanamaru: 467 kuradoberi.jam: 468 kuriyama.mirai: 469 kurodani.yamame: 470 kuroka..high.school.dxd.: 471 kurokawa.eren: 472 kuroki.tomoko: 473 kurosawa.dia: 474 kurosawa.ruby: 475 kuroshio..kantai.collection.: 476 kuroyukihime: 477 kurumi.erika: 478 kusanagi.motoko: 479 kusugawa.sasara: 480 kuujou.jolyne: 481 kuujou.joutarou: 482 kyon: 483 kyonko: 484 kyubey: 485 laffey..azur.lane.: 486 lala.satalin.deviluke: 487 lancer: 488 lancer..fate.zero.: 489 laura.bodewig: 490 leafa: 491 lei.lei: 492 lelouch.lamperouge: 493 len: 494 letty.whiterock: 495 levi..shingeki.no.kyojin.: 496 libeccio..kantai.collection.: 497 lightning.farron: 498 lili..tekken.: 499 lilith.aensland: 500 lillie..pokemon.: 501 lily.white: 502 link: 503 little.red.riding.hood..grimm.: 504 louise.francoise.le.blanc.de.la.valliere: 505 lucina: 506 lum: 507 luna.child: 508 lunamaria.hawke: 509 lunasa.prismriver: 510 lusamine..pokemon.: 511 lyn..blade...soul.: 512 lyndis..fire.emblem.: 513 lynette.bishop: 514 m1903.springfield..girls.frontline.: 515 madotsuki: 516 maekawa.miku: 517 maka.albarn: 518 makigumo..kantai.collection.: 519 makinami.mari.illustrious: 520 makise.kurisu: 521 makoto..street.fighter.: 522 makoto.nanaya: 523 mankanshoku.mako: 524 mao..pokemon.: 525 maou..maoyuu.: 526 maribel.hearn: 527 marie.antoinette..fate.grand.order.: 528 mash.kyrielight: 529 matoi..pso2.: 530 matoi.ryuuko: 531 matou.sakura: 532 matsuura.kanan: 533 maya..kantai.collection.: 534 me.tan: 535 medicine.melancholy: 536 medjed: 537 meer.campbell: 538 megumin: 539 megurine.luka: 540 mei..overwatch.: 541 mei..pokemon.: 542 meiko: 543 meltlilith: 544 mercy..overwatch.: 545 merlin.prismriver: 546 michishio..kantai.collection.: 547 midare.toushirou: 548 midna: 549 midorikawa.nao: 550 mika..girls.und.panzer.: 551 mikasa.ackerman: 552 mikazuki.munechika: 553 miki.sayaka: 554 millia.rage: 555 mima: 556 mimura.kanako: 557 minami.kotori: 558 minamoto.no.raikou..fate.grand.order.: 559 minamoto.no.raikou..swimsuit.lancer...fate.: 560 minase.akiko: 561 minase.iori: 562 miqo.te: 563 misaka.mikoto: 564 mishaguji: 565 misumi.nagisa: 566 mithra: 567 miura.azusa: 568 miyafuji.yoshika: 569 miyako.yoshika: 570 miyamoto.frederica: 571 miyamoto.musashi..fate.grand.order.: 572 miyaura.sanshio: 573 mizuhashi.parsee: 574 mizuki..pokemon.: 575 mizunashi.akari: 576 mizuno.ami: 577 mogami..kantai.collection.: 578 momo.velia.deviluke: 579 momozono.love: 580 mononobe.no.futo: 581 mordred..fate.: 582 mordred..fate...all.: 583 morgiana: 584 morichika.rinnosuke: 585 morikubo.nono: 586 moriya.suwako: 587 moroboshi.kirari: 588 morrigan.aensland: 589 motoori.kosuzu: 590 mumei..kabaneri.: 591 murakumo..kantai.collection.: 592 murasa.minamitsu: 593 murasame..kantai.collection.: 594 musashi..kantai.collection.: 595 mutsu..kantai.collection.: 596 mutsuki..kantai.collection.: 597 my.unit..fire.emblem..kakusei.: 598 my.unit..fire.emblem.if.: 599 myoudouin.itsuki: 600 mysterious.heroine.x: 601 mysterious.heroine.x..alter.: 602 mystia.lorelei: 603 nadia: 604 nagae.iku: 605 naganami..kantai.collection.: 606 nagato..kantai.collection.: 607 nagato.yuki: 608 nagatsuki..kantai.collection.: 609 nagi: 610 nagisa.kaworu: 611 naka..kantai.collection.: 612 nakano.azusa: 613 nami..one.piece.: 614 nanami.chiaki: 615 nanasaki.ai: 616 nao..mabinogi.: 617 narmaya..granblue.fantasy.: 618 narukami.yuu: 619 narusawa.ryouka: 620 natalia..idolmaster.: 621 natori.sana: 622 natsume..pokemon.: 623 natsume.rin: 624 nazrin: 625 nekomiya.hinata: 626 nekomusume: 627 nekomusume..gegege.no.kitarou.6.: 628 nepgear: 629 neptune..neptune.series.: 630 nero.claudius..bride...fate.: 631 nero.claudius..fate.: 632 nero.claudius..fate...all.: 633 nero.claudius..swimsuit.caster...fate.: 634 nia.teppelin: 635 nibutani.shinka: 636 nico.robin: 637 ninomiya.asuka: 638 nishikino.maki: 639 nishizumi.maho: 640 nishizumi.miho: 641 nitocris..fate.grand.order.: 642 nitocris..swimsuit.assassin...fate.: 643 nitta.minami: 644 noel.vermillion: 645 noire: 646 northern.ocean.hime: 647 noshiro..kantai.collection.: 648 noumi.kudryavka: 649 nu.13: 650 nyarlathotep..nyaruko.san.: 651 oboro..kantai.collection.: 652 oda.nobunaga..fate.: 653 ogata.chieri: 654 ohara.mari: 655 oikawa.shizuku: 656 okazaki.yumemi: 657 okita.souji..alter...fate.: 658 okita.souji..fate.: 659 okita.souji..fate...all.: 660 onozuka.komachi: 661 ooi..kantai.collection.: 662 oomori.yuuko: 663 ootsuki.yui: 664 ooyodo..kantai.collection.: 665 osakabe.hime..fate.grand.order.: 666 oshino.shinobu: 667 otonashi.kotori: 668 panty..psg.: 669 passion.lip: 670 patchouli.knowledge: 671 pepperoni..girls.und.panzer.: 672 perrine.h.clostermann: 673 pharah..overwatch.: 674 phosphophyllite: 675 pikachu: 676 pixiv.tan: 677 platelet..hataraku.saibou.: 678 platinum.the.trinity: 679 pod..nier.automata.: 680 pola..kantai.collection.: 681 priest..ragnarok.online.: 682 princess.king.boo: 683 princess.peach: 684 princess.serenity: 685 princess.zelda: 686 prinz.eugen..azur.lane.: 687 prinz.eugen..kantai.collection.: 688 prisma.illya: 689 purple.heart: 690 puru.see: 691 pyonta: 692 qbz.95..girls.frontline.: 693 rachel.alucard: 694 racing.miku: 695 raising.heart: 696 ramlethal.valentine: 697 ranka.lee: 698 ranma.chan: 699 re.class.battleship: 700 reinforce: 701 reinforce.zwei: 702 reisen.udongein.inaba: 703 reiuji.utsuho: 704 reizei.mako: 705 rem..re.zero.: 706 remilia.scarlet: 707 rensouhou.chan: 708 rensouhou.kun: 709 rias.gremory: 710 rider: 711 riesz: 712 ringo..touhou.: 713 ro.500..kantai.collection.: 714 roll: 715 rosehip: 716 rossweisse: 717 ruby.rose: 718 rumia: 719 rydia: 720 ryougi.shiki: 721 ryuuguu.rena: 722 ryuujou..kantai.collection.: 723 saber: 724 saber.alter: 725 saber.lily: 726 sagisawa.fumika: 727 saigyouji.yuyuko: 728 sailor.mars: 729 sailor.mercury: 730 sailor.moon: 731 sailor.saturn: 732 sailor.venus: 733 saint.martha: 734 sakagami.tomoyo: 735 sakamoto.mio: 736 sakata.gintoki: 737 sakuma.mayu: 738 sakura.chiyo: 739 sakura.futaba: 740 sakura.kyouko: 741 sakura.miku: 742 sakurai.momoka: 743 sakurauchi.riko: 744 samidare..kantai.collection.: 745 samus.aran: 746 sanya.v.litvyak: 747 sanzen.in.nagi: 748 saotome.ranma: 749 saratoga..kantai.collection.: 750 sasaki.chiho: 751 saten.ruiko: 752 satonaka.chie: 753 satsuki..kantai.collection.: 754 sawamura.spencer.eriri: 755 saya: 756 sazaki.kaoruko: 757 sazanami..kantai.collection.: 758 scathach..fate...all.: 759 scathach..fate.grand.order.: 760 scathach..swimsuit.assassin...fate.: 761 seaport.hime: 762 seeu: 763 seiran..touhou.: 764 seiren..suite.precure.: 765 sekibanki: 766 selvaria.bles: 767 sendai..kantai.collection.: 768 sendai.hakurei.no.miko: 769 sengoku.nadeko: 770 senjougahara.hitagi: 771 senketsu: 772 sento.isuzu: 773 serena..pokemon.: 774 serval..kemono.friends.: 775 sf.a2.miki: 776 shameimaru.aya: 777 shana: 778 shanghai.doll: 779 shantae..character.: 780 sheryl.nome: 781 shibuya.rin: 782 shidare.hotaru: 783 shigure..kantai.collection.: 784 shijou.takane: 785 shiki.eiki: 786 shikinami..kantai.collection.: 787 shikinami.asuka.langley: 788 shimada.arisu: 789 shimakaze..kantai.collection.: 790 shimamura.uzuki: 791 shinjou.akane: 792 shinki: 793 shinku: 794 shiomi.shuuko: 795 shirabe.ako: 796 shirai.kuroko: 797 shirakiin.ririchiyo: 798 shiranui..kantai.collection.: 799 shiranui.mai: 800 shirasaka.koume: 801 shirase.sakuya: 802 shiratsuyu..kantai.collection.: 803 shirayuki.hime: 804 shirogane.naoto: 805 shirona..pokemon.: 806 shoebill..kemono.friends.: 807 shokuhou.misaki: 808 shouhou..kantai.collection.: 809 shoukaku..kantai.collection.: 810 shuten.douji..fate.grand.order.: 811 signum: 812 silica: 813 simon: 814 sinon: 815 soga.no.tojiko: 816 sona.buvelle: 817 sonoda.umi: 818 sonohara.anri: 819 sonozaki.mion: 820 sonozaki.shion: 821 sora.ginko: 822 sorceress..dragon.s.crown.: 823 souryuu..kantai.collection.: 824 souryuu.asuka.langley: 825 souseiseki: 826 star.sapphire: 827 stocking..psg.: 828 su.san: 829 subaru.nakajima: 830 suigintou: 831 suiren..pokemon.: 832 suiseiseki: 833 sukuna.shinmyoumaru: 834 sunny.milk: 835 suomi.kp31..girls.frontline.: 836 super.pochaco: 837 super.sonico: 838 suzukaze.aoba: 839 suzumiya.haruhi: 840 suzutsuki..kantai.collection.: 841 suzuya..kantai.collection.: 842 tachibana.arisu: 843 tachibana.hibiki..symphogear.: 844 tada.riina: 845 taigei..kantai.collection.: 846 taihou..azur.lane.: 847 taihou..kantai.collection.: 848 tainaka.ritsu: 849 takagaki.kaede: 850 takakura.himari: 851 takamachi.nanoha: 852 takami.chika: 853 takanashi.rikka: 854 takao..azur.lane.: 855 takao..kantai.collection.: 856 takara.miyuki: 857 takarada.rikka: 858 takatsuki.yayoi: 859 takebe.saori: 860 tama..kantai.collection.: 861 tamamo..fate...all.: 862 tamamo.cat..fate.: 863 tamamo.no.mae..fate.: 864 tamamo.no.mae..swimsuit.lancer...fate.: 865 tanamachi.kaoru: 866 taneshima.popura: 867 tanned.cirno: 868 taokaka: 869 tatara.kogasa: 870 tateyama.ayano: 871 tatsumaki: 872 tatsuta..kantai.collection.: 873 tedeza.rize: 874 tenryuu..kantai.collection.: 875 tenshi..angel.beats..: 876 teruzuki..kantai.collection.: 877 tharja: 878 tifa.lockhart: 879 tina.branford: 880 tippy..gochiusa.: 881 tokiko..touhou.: 882 tokisaki.kurumi: 883 tokitsukaze..kantai.collection.: 884 tomoe.gozen..fate.grand.order.: 885 tomoe.hotaru: 886 tomoe.mami: 887 tone..kantai.collection.: 888 toono.akiha: 889 tooru..maidragon.: 890 toosaka.rin: 891 toramaru.shou: 892 toshinou.kyouko: 893 totoki.airi: 894 toudou.shimako: 895 toudou.yurika: 896 toujou.koneko: 897 toujou.nozomi: 898 touko..pokemon.: 899 touwa.erio: 900 toyosatomimi.no.miko: 901 tracer..overwatch.: 902 tsukikage.yuri: 903 tsukimiya.ayu: 904 tsukino.mito: 905 tsukino.usagi: 906 tsukumo.benben: 907 tsurumaru.kuninaga: 908 tsuruya: 909 tsushima.yoshiko: 910 u.511..kantai.collection.: 911 ujimatsu.chiya: 912 ultimate.madoka: 913 umikaze..kantai.collection.: 914 unicorn..azur.lane.: 915 unryuu..kantai.collection.: 916 urakaze..kantai.collection.: 917 uraraka.ochako: 918 usada.hikaru: 919 usami.renko: 920 usami.sumireko: 921 ushio..kantai.collection.: 922 ushiromiya.ange: 923 ushiwakamaru..fate.grand.order.: 924 uzuki..kantai.collection.: 925 vampire..azur.lane.: 926 vampy: 927 venera.sama: 928 verniy..kantai.collection.: 929 victorica.de.blois: 930 violet.evergarden..character.: 931 vira.lilie: 932 vita: 933 vivio: 934 wa2000..girls.frontline.: 935 wakasagihime: 936 wang.liu.mei: 937 warspite..kantai.collection.: 938 watanabe.you: 939 watarase.jun: 940 watatsuki.no.yorihime: 941 waver.velvet: 942 weiss.schnee: 943 white.mage: 944 widowmaker..overwatch.: 945 wo.class.aircraft.carrier: 946 wriggle.nightbug: 947 xenovia.quarta: 948 xp.tan: 949 xuanzang..fate.grand.order.: 950 yagami.hayate: 951 yagokoro.eirin: 952 yahagi..kantai.collection.: 953 yakumo.ran: 954 yakumo.yukari: 955 yamada.aoi: 956 yamada.elf: 957 yamakaze..kantai.collection.: 958 yamashiro..azur.lane.: 959 yamashiro..kantai.collection.: 960 yamato..kantai.collection.: 961 yamato.no.kami.yasusada: 962 yang.xiao.long: 963 yasaka.kanako: 964 yayoi..kantai.collection.: 965 yazawa.nico: 966 yin: 967 yoko.littner: 968 yorha.no..2.type.b: 969 yorigami.shion: 970 yowane.haku: 971 yuffie.kisaragi: 972 yui..angel.beats..: 973 yuigahama.yui: 974 yuki.miku: 975 yukikaze..kantai.collection.: 976 yukine.chris: 977 yukinoshita.yukino: 978 yukishiro.honoka: 979 yumi..senran.kagura.: 980 yuna..ff10.: 981 yuno: 982 yura..kantai.collection.: 983 yuubari..kantai.collection.: 984 yuudachi..kantai.collection.: 985 yuugumo..kantai.collection.: 986 yuuki..sao.: 987 yuuki.makoto: 988 yuuki.mikan: 989 yuzuhara.konomi: 990 yuzuki.yukari: 991 yuzuriha.inori: 992 z1.leberecht.maass..kantai.collection.: 993 z3.max.schultz..kantai.collection.: 994 zero.two..darling.in.the.franxx.: 995 zeta..granblue.fantasy.: 996 zooey..granblue.fantasy.: 997 zuihou..kantai.collection.: 998 zuikaku..kantai.collection.: 999 (Aside from being potentially useful to stabilize training by providing supervision/metadata, use of classes/categories reduces the need for character-specific transfer learning for specialized StyleGAN models, since you can just generate samples from a specific class. For the 256px model, I provide downloadable samples for each of the 1000 classes.) BigGAN requires the dataset metadata to be defined in utils.py, and then it must be processed into a HDF5 archive, along with Inception statistics for the periodic testing (although I minimize testing, the preprocessed statistics are still necessary). The utils.py must be edited to add metadata per dataset (no CLI), which looks like this to define a 128px Danbooru-1k portrait dataset:  # Convenience dicts -dset_dict = {'I32': dset.ImageFolder, 'I64': dset.ImageFolder, +dset_dict = {'I32': dset.ImageFolder, 'I64': dset.ImageFolder, 'I128': dset.ImageFolder, 'I256': dset.ImageFolder, 'I32_hdf5': dset.ILSVRC_HDF5, 'I64_hdf5': dset.ILSVRC_HDF5, 'I128_hdf5': dset.ILSVRC_HDF5, 'I256_hdf5': dset.ILSVRC_HDF5, - 'C10': dset.CIFAR10, 'C100': dset.CIFAR100} + 'C10': dset.CIFAR10, 'C100': dset.CIFAR100, + 'D1K': dset.ImageFolder, 'D1K_hdf5': dset.ILSVRC_HDF5 } imsize_dict = {'I32': 32, 'I32_hdf5': 32, 'I64': 64, 'I64_hdf5': 64, 'I128': 128, 'I128_hdf5': 128, 'I256': 256, 'I256_hdf5': 256, - 'C10': 32, 'C100': 32} + 'C10': 32, 'C100': 32, + 'D1K': 128, 'D1K_hdf5': 128 } root_dict = {'I32': 'ImageNet', 'I32_hdf5': 'ILSVRC32.hdf5', 'I64': 'ImageNet', 'I64_hdf5': 'ILSVRC64.hdf5', 'I128': 'ImageNet', 'I128_hdf5': 'ILSVRC128.hdf5', 'I256': 'ImageNet', 'I256_hdf5': 'ILSVRC256.hdf5', - 'C10': 'cifar', 'C100': 'cifar'} + 'C10': 'cifar', 'C100': 'cifar', + 'D1K': 'characters-1k-faces', 'D1K_hdf5': 'D1K.hdf5' } nclass_dict = {'I32': 1000, 'I32_hdf5': 1000, 'I64': 1000, 'I64_hdf5': 1000, 'I128': 1000, 'I128_hdf5': 1000, 'I256': 1000, 'I256_hdf5': 1000, - 'C10': 10, 'C100': 100} -# Number of classes to put per sample sheet + 'C10': 10, 'C100': 100, + 'D1K': 1000, 'D1K_hdf5': 1000 } +# Number of classes to put per sample sheet classes_per_sheet_dict = {'I32': 50, 'I32_hdf5': 50, 'I64': 50, 'I64_hdf5': 50, 'I128': 20, 'I128_hdf5': 20, 'I256': 20, 'I256_hdf5': 20, - 'C10': 10, 'C100': 100} + 'C10': 10, 'C100': 100, + 'D1K': 1, 'D1K_hdf5': 1 } Each dataset exists in 2 forms, as the original image folder and then as the processed HDF5: python make_hdf5.py --dataset D1K512 --data_root /media/gwern/Data2/danbooru2018 python calculate_inception_moments.py --dataset D1K_hdf5 --batch_size 32 \ --data_root /media/gwern/Data2/danbooru2018 ## Or ImageNet example: python make_hdf5.py --dataset I128 --data_root /media/gwern/Data/imagenet/ python calculate_inception_moments.py --dataset I128_hdf5 --batch_size 64 \ --data_root /media/gwern/Data/imagenet/ make_hdf5.py will write the HDF5 to a ILSVRC*.hdf5 file, so rename it to whatever (eg D1K.hdf5). ## BigGAN Training With the HDF5 & Inception statistics calculated, it should be possible to run like so: python train.py --dataset D1K --parallel --shuffle --num_workers 4 --batch_size 32 \ --num_G_accumulations 8 --num_D_accumulations 8 \ --num_D_steps 1 --G_lr 1e-4 --D_lr 4e-4 --D_B2 0.999 --G_B2 0.999 --G_attn 64 --D_attn 64 \ --G_nl inplace_relu --D_nl inplace_relu --SN_eps 1e-6 --BN_eps 1e-5 --adam_eps 1e-6 \ --G_ortho 0.0 --G_shared --G_init ortho --D_init ortho --hier --dim_z 120 --shared_dim 128 \ --G_eval_mode --G_ch 96 --D_ch 96 \ --ema --use_ema --ema_start 20000 --test_every 2000 --save_every 1000 --num_best_copies 5 \ --num_save_copies 2 --seed 0 --use_multiepoch_sampler --which_best FID \ --data_root /media/gwern/Data2/danbooru2018 The architecture is specified on the commandline and must be correct; examples are in the scripts/ directory. In the above example, --num_D_steps...--D_ch should be left strictly alone and the key parameters are before/after that architecture block. In this example, my 2x1080ti can support a batch size of n=32 & the gradient accumulation overhead without OOMing. In addition to that, it’s important to enable EMA, which makes a truly remarkable difference in the generated sample quality (which is interesting because EMA sounds redundant with momentum/learning rates, but isn’t). The big batches of BigGAN are implemented by --batch_size times --num_{G/D}_accumulations; I would need an accumulation of 64 to match n=2048. Without EMA, samples are low quality and change drastically at each iteration; but EMA, which averages each iteration offline (but one doesn’t train using the averaged model!40), shows that collectively these iterations are similar because they are ‘orbiting’ around a central point and the image quality is clearly gradually improving when EMA is turned on. Transfer learning is not supported natively, but a similar trick as with StyleGAN is feasible: just drop the pretrained models into the checkpoint folder and resume (which will work as long as the architecture is identical to the CLI parameters). The sample sheet functionality can easily overload a GPU and OOM. In utils.py, it may be necessary to simply comment out all of the sampling functionality starting with utils.sample_sheet. The main problem running BigGAN is odd bugs in BigGAN’s handling of epochs/iterations and changing gradient accumulations. With --use_multiepoch_sampler, it does complicated calculations to try to keep sampling consistent across epoches with precisely the same ordering of samples regardless of how often the BigGAN job is started/stopped (eg on a cluster), but as one increases the total minibatch size and it progresses through an epoch, it tries to index data which doesn’t exist and crashes; I was unable to figure out how the calculations were going wrong, exactly.41 While with that option disabled and larger total minibatches used, a different bug gets triggered, leading to inscrutable crashes: ... ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm). ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm). ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm). ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm). ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm). ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm). ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm). ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm). Traceback (most recent call last): File "train.py", line 228, in <module> main() File "train.py", line 225, in main run(config) File "train.py", line 172, in run for i, (x, y) in enumerate(pbar): File "/root/BigGAN-PyTorch-mooch/utils.py", line 842, in progress for n, item in enumerate(items): File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 631, in __next__ idx, batch = self._get_batch() File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 601, in _get_batch return self.data_queue.get(timeout=MP_STATUS_CHECK_INTERVAL) File "/opt/conda/lib/python3.7/queue.py", line 179, in get self.not_empty.wait(remaining) File "/opt/conda/lib/python3.7/threading.py", line 300, in wait gotit = waiter.acquire(True, timeout) File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 274, in handler _error_if_any_worker_fails() RuntimeError: DataLoader worker (pid 21103) is killed by signal: Bus error. There is no good workaround here: starting with small fast minibatches compromises final quality, while starting with big slow minibatches may work but then costs far more compute. I did find that the G/D accumulations can be imbalanced to allow increasing the G’s total minibatch (which appears to be the key for better quality) but then this risks destabilizing training. These bugs need to be fixed before trying BigGAN for real. ## BigGAN: ImageNet→Danbooru2018-1K In any case, I ran the 128px ImageNet→Danbooru2018-1K for ~6 GPU-days (or ~3 days on my 2x1080ti workstation) and the training montage indicates it was working fine: Sometime after that, while continuing to play with imbalanced minibatches to avoid triggering the iteration/crash bugs, it diverged badly and mode-collapsed into static, so I killed the run, as the point appears to have been made: transfer learning is indeed possible, and the speed of the adaptation suggests benefits to training time by starting with a highly-trained model already. ## BigGAN: 256px Danbooru2018-1K More seriously, I began training a 256px model on Danbooru2018-1K portraits. This required rebuilding the HDF5 with 256px settings, and since I wasn’t doing transfer learning, I used the BigGAN-deep architecture settings since that has better results & is smaller than the original BigGAN. My own 2x1080ti were inadequate for reasonable turnaround on training a 256px BigGAN from scratch—they would take something like 4+ months wallclock— so I decided to shell out for a big cloud instance. AWS/GCP are too expensive, so I used this to investigate as an alternative: they typically have much lower prices. Vast.ai setup was straightforward, and I found a nice instance: an 8x2080ti machine available for just$1.7/hour (AWS, for comparison, would charge closer to $2.16/hour for just 8 K80 halves). So I ran 2 May 2019–3 June 2019 their 8x2080ti instance ($1.7/hour; total: $1373.64). That is ~250 GPU-days of training, although this is a misleading way to put it since the Vast.ai bill includes bandwidth/hard-drive in that total and the GPU utilization was poor so each ‘GPU-day’ is worth about a third less than with the 128px BigGAN which had good GPU utilization and the 2080tis were overkill. It should be possible to do much better with the same budget in the future. The training command: python train.py --model BigGANdeep --dataset D1K_hdf5 --parallel --shuffle --num_workers 16 \ --batch_size 56 --num_G_accumulations 8 --num_D_accumulations 8 --num_D_steps 1 --G_lr 1e-4 \ --D_lr 4e-4 --D_B2 0.999 --G_B2 0.999 --G_attn 64 --D_attn 64 --G_ch 128 --D_ch 128 \ --G_depth 2 --D_depth 2 --G_nl inplace_relu --D_nl inplace_relu --SN_eps 1e-6 --BN_eps 1e-5 \ --adam_eps 1e-6 --G_ortho 0.0 --G_shared --G_init ortho --D_init ortho --hier --dim_z 64 \ --shared_dim 64 --ema --use_ema --G_eval_mode --test_every 200000 --sv_log_interval 1000 \ --save_every 90 --num_best_copies 1 --num_save_copies 1 --seed 0 --no_fid \ --num_inception_images 1 --augment --data_root ~/tmp --resume --experiment_name \ BigGAN_D1K_hdf5_BigGANdeep_seed0_Gch128_Dch128_Gd2_Dd2_bs64_Glr1.0e-04_Dlr4.0e-04_Gnlinplace_relu_Dnlinplace_relu_Ginitortho_Dinitortho_Gattn64_Dattn64_Gshared_hier_ema The system worked well but BigGAN turns out to have serious bottlenecks and did not make good use of the 8 GPUs, averaging GPU utilization ~30% according to nvidia-smi. (On my 2x1080tis with the 128px, GPU-utilization was closer to 95%.) In retrospect, I probably should’ve switched to a less expensive instance like a 8x1080ti where it likely would’ve had similar throughput but cost less. Training progressed well up until iterations #80–90k, when I began seeing signs of mode collapse: I was unable to increase the minibatch to more than ~500 because of the bugs, limiting what I could do against mode collapse, and I suspect the small minibatch was why mode collapse was happening in the first place. (Gokaslan tried the last checkpoint I saved—#95,160—with the same settings, and ran it to #100,000 iterations and experienced near-total mode collapse.) The last checkpoint I saved from before mode collapse was #83,520, saved on 28 May 2019 after ~24 wallclock days (accounting for various crashes & time setting up & tweaking). Random samples, interpolation grids (not videos), and class-conditional samples can be generated using sample.py; like train.py, it requires the exact architecture to be specified. I used the following command (many of the options are probably not necessary, but I didn’t know which): python sample.py --model BigGANdeep --dataset D1K_hdf5 --parallel --shuffle --num_workers 16 \ --batch_size 56 --num_G_accumulations 8 --num_D_accumulations 8 --num_D_steps 1 \ --G_lr 1e-4 --D_lr 4e-4 --D_B2 0.999 --G_B2 0.999 --G_attn 64 --D_attn 64 --G_ch 128 \ --D_ch 128 --G_depth 2 --D_depth 2 --G_nl inplace_relu --D_nl inplace_relu --SN_eps 1e-6 \ --BN_eps 1e-5 --adam_eps 1e-6 --G_ortho 0.0 --G_shared --G_init ortho --D_init ortho --hier \ --dim_z 64 --shared_dim 64 --ema --use_ema --G_eval_mode --test_every 200000 \ --sv_log_interval 1000 --save_every 90 --num_best_copies 1 --num_save_copies 1 --seed 0 \ --no_fid --num_inception_images 1 --skip_init --G_batch_size 32 --use_ema --G_eval_mode \ --sample_random --sample_sheets --sample_interps --resume --experiment_name 256px Random samples are already well-represented by the training montage. The interpolations look similar to StyleGAN interpolations. The class-conditional samples are the most fun to look at because one can look at specific characters without the need to retrain the entire model, which while only taking a few hours at most, is a hassle. ### 256px Danbooru2018-1K Samples Interpolation images and 5 character-specific random samples (Asuka, Holo, Rin, Chen, Ruri): ### 256px BigGAN Downloads Model & sample downloads: ## Evaluation The best results from the 128px BigGAN model look about as good as could be expected from 128px samples; the 256px model is fairly good, but suffers from much more noticeable artifacting than 512px StyleGAN, and cost$1373 (a 256px StyleGAN would have been closer to $400 on AWS). In BigGAN’s defense, it had clearly not converged yet and could have benefited from much more training and much larger minibatches, had that been possible. Qualitatively, looking at the more complex elements of samples, like hair ornaments/hats, I feel like BigGAN was doing a much better job of coping with complexity & fine detail than StyleGAN would have at a similar point. However, training 512px portraits or whole-Danbooru images is infeasible at this point: while the cost might be only a few thousand dollars, the various bugs mean that it may not be possible to stably train to a useful quality. It’s a dilemma: at small or easy domains, StyleGAN is much faster (if not better); but at large or hard domains, mode collapse is too risky and endangers the big investment necessary to surpass StyleGAN. To make BigGAN viable, it needs at least: • minibatch size bugs fixed to enable up to n=2048 (or larger, as gradient noise scale indicates) • 512px architectures defined, to allow transfer learning from the released Tensorflow 512px ImageNet model • optimization work to reduce overhead and allow reasonable GPU utilization on >2-GPU systems With those done, it should be possible to train 512px portraits for <$1,000 and whole-Danbooru images for <$10,000. (Given the release of DeepDanbooru as a TensorFlow model, enabling an anime-specific perceptual loss, it would also be interesting to investigate applying pretraining to BigGAN.) # See also # Appendix For comparison, here are some of my older GAN or other NN attempts; as the quality is worse than StyleGAN, I won’t bother going into details—creating the datasets & training the ProGAN & tuning & transfer-learning were all much the same as already outlined at length for the StyleGAN results. Included are: • ProGAN • Glow • MSG-GAN • PokeGAN • Self-Attention-GAN-TensorFlow • VGAN • BigGAN unofficial (official BigGAN is covered above) • BigGAN-TensorFlow • BigGAN-PyTorch • GAN-QP • WGAN • IntroVAE ## ProGAN 1. 8 September 2018, 512–1024px whole-Asuka images ProGAN samples: 2. 18 September 2018, 512px Asuka faces, ProGAN samples: 3. 29 October 2018, 512px Holo faces, ProGAN: After generating ~1k Holo faces, I selected the top decile (n=103) of the faces (): The top decile images are, nevertheless, showing distinct signs of both artifacting & overfitting/memorization of data points. Another 2 weeks proved this out further: 4. 17 January 2019, Danbooru2017 512px SFW images, ProGAN: 5. 5 February 2019 (stopped in order to train with the new StyleGAN codebase), the 512px anime face dataset used elsewhere, ProGAN: Downloads: ## Glow Due to the enormous model size (4.2GB), I had to modify Glow’s settings to get training working reasonably well, after extensive tinkering to figure out what any meant: {"verbose": true, "restore_path": "logs/model_4.ckpt", "inference": false, "logdir": "./logs", "problem": "asuka", "category": "", "data_dir": "../glow/data/asuka/", "dal": 2, "fmap": 1, "pmap": 16, "n_train": 20000, "n_test": 1000, "n_batch_train": 16, "n_batch_test": 50, "n_batch_init": 16, "optimizer": "adamax", "lr": 0.0005, "beta1": 0.9, "polyak_epochs": 1, "weight_decay": 1.0, "epochs": 1000000, "epochs_warmup": 10, "epochs_full_valid": 3, "gradient_checkpointing": 1, "image_size": 512, "anchor_size": 128, "width": 512, "depth": 13, "weight_y": 0.0, "n_bits_x": 8, "n_levels": 7, "n_sample": 16, "epochs_full_sample": 5, "learntop": false, "ycond": false, "seed": 0, "flow_permutation": 2, "flow_coupling": 1, "n_y": 1, "rnd_crop": false, "local_batch_train": 1, "local_batch_test": 1, "local_batch_init": 1, "direct_iterator": true, "train_its": 1250, "test_its": 63, "full_test_its": 1000, "n_bins": 256.0, "top_shape": [4, 4, 768]} ... {"epoch": 5, "n_processed": 100000, "n_images": 6250, "train_time": 14496, "loss": "2.0090", "bits_x": "2.0090", "bits_y": "0.0000", "pred_loss": "1.0000"} An additional challenge was numerical instability in the reversing of matrices, giving rise to many ‘invertibility’ crashes. Final sample before I looked up the compute requirements more carefully & gave up on Glow: ## MSG-GAN ## PokeGAN nshepperd’s (unpublished) multi-scale GAN with self-attention layers, spectral normalization, and a few other tweaks: ## Self-Attention-GAN-TensorFlow did not have an official implementation released at the time so I used the ; 128px SAGAN, WGAN-LP loss, on Asuka faces & whole Asuka images: ## VGAN The official for had not been released when I began trying VGAN, so I used . The variational bottleneck, along with self-attention layers and progressive growing, is one of the few strategies which permit 512px images, and I was intrigued to see that it worked relatively well, although I ran into persistent issues with instability & mode collapse. I suspect that VGAN could’ve worked better than it did with some more work. ## BigGAN unofficial ^s were not released until late March 2019 (nor the semi-official compare_gan implementation until February 2019), and I experimented with 2 unofficial implementations in late 2018–early 2019. ### BigGAN-TensorFlow ; 128px spectral norm hinge loss, anime faces: This one never worked well at all, and I am still puzzled what went wrong. ### BigGAN-PyTorch Aaron Leong’s implementation (not the official BigGAN implementation). As it’s class-conditional, I faked having 1000 classes by constructing a variant anime face dataset: taking the top 1000 characters by tag count in the Danbooru2017 metadata, I then filtered for those character tags 1 by 1, and copied them & cropped faces into matching subdirectories 1–1000. This let me try out both faces & whole images. I also attempted to hack in gradient accumulation for big minibatches to make it a true BigGAN implementation, but didn’t help too much; the problem here might simply have been that I couldn’t run it long enough. Results upon abandoning: ## GAN-QP Training oscillated enormously, with all the samples closely linked and changing simultaneously. This was despite the checkpoint model being enormous (551MB) and I am suspicious that something was seriously wrong—either the model architecture was wrong (too many layers or filters?) or the learning rate was many orders of magnitude too large. Because of the small minibatch, progress was difficult to make in a reasonable amount of wallclock time, so I moved on. ## WGAN ; I did most of the early anime face work with WGAN on a different machine and didn’t keep copies. However, a sample from a short run gives an idea of what WGAN tended to look like on anime runs: ## IntroVAE A hybrid GAN-VAE architecture introduced in mid-2018 by , Huang et al 2018, with the , IntroVAE attempts to reuse the encoder-decoder for an adversarial loss as well, to combine the best of both worlds: the principled stable training & reversible encoder of the VAE with the sharpness & high quality of a GAN. Quality-wise, they show IntroVAE works on CelebA & LSUN Bedroom at up to 1024px resolution with results they claim are comparable to ProGAN. Performance-wise, for 512px, they give a runtime of 7 days with a minibatch n=12, or presumably 4 GPUs (since their 1024px run script implies they used 4 GPUs and I can fit a minibatch of n=4 onto 1x1080ti, so 4 GPUs would be consistent with n=12), and so 28 GPU-days. I adapted the 256px suggested settings for my 512px anime portraits dataset: python main.py --hdim=512 --output_height=512 --channels='32, 64, 128, 256, 512, 512, 512' --m_plus=120 \ --weight_rec=0.05 --weight_kl=1.0 --weight_neg=0.5 --num_vae=0 \ --dataroot=/media/gwern/Data2/danbooru2018/portrait/1/ --trainsize=302652 --test_iter=1000 --save_iter=1 \ --start_epoch=0 --batchSize=4 --nrow=8 --lr_e=0.0001 --lr_g=0.0001 --cuda --nEpochs=500 # ...====> Cur_iter: [187060]: Epoch[3](5467/60531): time: 142675: Rec: 19569, Kl_E: 162, 151, 121, Kl_G: 151, 121, There was a minor bug in the codebase where it would crash on trying to print out the log data, perhaps because it assumes multi-GPU and I was running on 1 GPU, and was trying to index into an array which was actually a simple scalar, which I fixed by removing the indexing: - info += 'Rec: {:.4f}, '.format(loss_rec.data[0]) - info += 'Kl_E: {:.4f}, {:.4f}, {:.4f}, '.format(lossE_real_kl.data[0], - lossE_rec_kl.data[0], lossE_fake_kl.data[0]) - info += 'Kl_G: {:.4f}, {:.4f}, '.format(lossG_rec_kl.data[0], lossG_fake_kl.data[0]) - + + info += 'Rec: {:.4f}, '.format(loss_rec.data) + info += 'Kl_E: {:.4f}, {:.4f}, {:.4f}, '.format(lossE_real_kl.data, + lossE_rec_kl.data, lossE_fake_kl.data) + info += 'Kl_G: {:.4f}, {:.4f}, '.format(lossG_rec_kl.data, lossG_fake_kl.data) Sample results after ~1.7 GPU-days: By this point, StyleGAN would have been generating recognizable faces from scratch, while the IntroVAE random samples are not even face-like, and the IntroVAE training curve was not improving at a notable rate. IntroVAE has some hyperparameters which could probably be tuned better for the anime portrait faces (they briefly discuss the use of the --num_vae option to run in classic VAE mode to let you tune the VAE-related hyperparameters before enabling the GAN-like part), but it should be fairly insensitive overall to hyperparameters and unlikely to help all that much. So IntroVAE probably can’t replace StyleGAN (yet?) for general-purpose image synthesis. This demonstrates again that it seems like everything works on CelebA these days and just because something works on a photographic dataset does not mean it’ll work on other datasets. Image generation papers should probably branch out some more and consider non-photographic tests. 1. Turns out that when training goes really wrong, you can crash many GAN implementations with either a segfault, integer overflow, or division by zero error.↩︎ 2. StackGAN/StackGAN++/PixelCNN et al are difficult to run as they require a unique image embedding which could only be computed in the unmaintained Torch framework using Reed’s prior work on a joint text+image embedding which however doesn’t run on anything but the Birds & Flowers datasets, and so no one has ever, as far as I am aware, run those implementations on anything else—certainly I never managed to despite quite a few hours trying to reverse-engineer the embedding & various implementations.↩︎ 3. Be sure to check out .↩︎ 4. Glow’s reported results ; BigGAN’s total compute is unclear as it was trained on a TPUv3 Google cluster but it would appear that a 128px BigGAN might be ~4 GPU-months assuming hardware like an 8-GPU machine, 256px ~8 GPU-months, and 512px ≫8 GPU-months, with VRAM being the main limiting factor for larger models (although progressive growing might be able to cut those estimates).↩︎ 5. is an old & small CNN trained to predict a few -booru tags on anime images, and so provides an embedding—but not a good one. The lack of a good embedding is the major limitation for anime deep learning as of February 2019. (, while performing well apparently, has not yet been used for embeddings.) An embedding is necessary for text→image GANs, image searches & nearest-neighbor checks of overfitting, FID errors for objectively comparing GANs, minibatch discrimination to help the D/provide an auxiliary loss to stabilize learning, anime style transfer (both for its own sake & for creating a ‘StyleDanbooru2018’ to reduce texture cheating), encoding into GAN latent spaces for manipulation, data cleaning (to detect anomalous datapoints like failed face crops), perceptual losses for encoders or as an additional auxiliary loss/pretraining (like , which trains a Generator on a perceptual loss and does GAN training only for finetuning) etc. A good tagger is also a good starting point for doing pixel-level semantic segmentation (via “weak supervision”), which metadata is key for training something like Nvidia’s successor to pix2pix (; ).↩︎ 6. Technical note: I typically train NNs using my workstation with 2x1080ti GPUs. For easier comparison, I convert all my times to single-GPU equivalent (ie “6 GPU-weeks” means 3 realtime/wallclock weeks on my 2 GPUs).↩︎ 7. observes (§4 “Using precision and recall to analyze and improve StyleGAN”) that StyleGAN with progressive growing disabled does work but at some cost to precision/recall quality metrics; whether this reflects inferior performance on a given training budget or an inherent limit—BigGAN and other self-attention-using GANs do not use progressive growing at all, suggesting it is not truly necessary—is not investigated.↩︎ 8. This has confused some people, so to clarify the sequence of events: I trained my anime face StyleGAN and posted notes on Twitter, releasing an early model; generated an interpolation video using said model (but a different random seed, of course); this interpolation video was retweeted by the Japanese Twitter user , upon which it went viral and was ‘liked’ by Elon Musk, further driving virality (19k reshares, 65k likes, 1.29m watches as of 22 March 2019).↩︎ 9. Google Colab is a free service includes free GPU time (up to 12 hours on a small GPU). Especially for people who do not have a reasonably capable GPU on their personal computers (such as all Apple users) or do not want to engage in the admitted hassle of renting a real cloud GPU instance, Colab can be a great way to play with a pretrained model, like generating GPT-2-small text completions or StyleGAN interpolation videos, or prototype on tiny problems. However, it is a bad idea to try to train real models, like 512–1024px StyleGANs, on a Colab instance as the GPUs are low VRAM, far slower (6 hours per StyleGAN tick!), unwieldly to work with (as one must save snapshots constantly to restart when the session runs out), doesn’t have a real command-line, etc. Colab is just barely adequate for perhaps 1 or 2 ticks of transfer learning, but not more. If you harbor greater ambitions but still refuse to spend any money (rather than time), Kaggle has a similar service with P100 GPU slices rather than K80s. Otherwise, one needs to get access to real GPUs.↩︎ 10. Curiously, the benefit of many more FC layers than usual may have been stumbled across before: IllustrationGAN found that adding some FC layers seemed to help their DCGAN generate anime faces, and when I & experimented with adding 2–4 FC layers to WGAN-GP along IllustrationGAN’s lines, it did help our lackluster results. But we never dreamed of going as deep as 8!↩︎ 11. The ProGAN/StyleGAN codebase with conditioning, but none of the papers report on this functionality and I have not used it myself.↩︎ 12. The latent embedding z is usually generated in about the simplest possible way: draws from the Normal distribution, . A is sometimes used instead. There is no good justification for this and some reason to think this can be bad (how does a GAN easily map a discrete or binary latent factor, such as the presence or absence of the left ear, onto a Normal variable?). The , finding improvements in training time and/or final quality from using instead (in ascending order): a Normal + binary Bernoulli (p=0.5; personal communication, Brock) variable, a binary (Bernoulli), and a (sometimes called a “censored normal” even though that sounds like a rather than the rectified one). The rectified Gaussian distribution “outperforms (in terms of IS) by 15–20% and tends to require fewer iterations.” The downside is that the , which yields even larger average improvements in image quality (at the expense of diversity) doesn’t quite apply, and the rectified Gaussian sans truncation produced similar results as the Normal+truncation, so BigGAN reverted to the default Normal distribution+truncation (personal communication). The truncation trick either directly applies to some of the other distributions, particularly the Rectified Gaussian, or could easily be adapted—possibly yielding an improvement over either approach. The Rectified Gaussian can be truncated just like the default Normals can. And for the Bernoulli, one could decrease p during the generation, or what is probably equivalent, re-sample whenever the variance (ie squared sum) of all the Bernoulli latent variables exceeds a certain constant. (With p=0.5, a latent vector of 512 Bernouillis would on average all sum up to simply , with the 2.5%–97.5% quantiles being 234–278, so a ‘truncation trick’ here might be throwing out every vector with a sum above, say, the 80% quantile of 266.) One also wonders about vectors which draw from multiple distributions rather than just one. Could the StyleGAN 8-FC-layer learned-latent-variable be reverse-engineered? Perhaps the first layer or two merely converts the normal input into a more useful distribution & parameters/training could be saved or insight gained by imitating that.↩︎ 13. Which raises the question: if you added any or all of those features, would StyleGAN become that much better? Unfortunately, while theorists & practitioners have had many ideas, so far theory has proven more fecund than fatidical and the large-scale GAN experiments necessary to truly test the suggestions are too expensive for most. Half of these suggestions are great ideas—but which half?↩︎ 14. For more on the choice of convolution layers/kernel sizes, see Karpathy’s 2015 notes for , or take a look at these & Yang’s interactive .↩︎ 15. This observations apply only to the Generator in GANs (which is what we primarily care about); curiously, there’s some reason to think that GAN Discriminators are in fact mostly memorizing (see later).↩︎ 16. A possible alternative is ().↩︎ 17. Based on eyeballing the ‘cat’ bar graph in Figure 3 of .↩︎ 18. Cats offer an amusing instance of the dangers of data augmentation: ProGAN used horizontal flipping/mirroring for everything, because why not? This led to strange Cyrillic text captions showing up in the generated cat images. Why not Latin alphabet captions? Because every cat image was being shown mirrored as well as normally! For StyleGAN, mirroring was disabled, so now the lolcat captions are recognizably Latin alphabetical, and even almost English words. This demonstrates that even datasets where left/right doesn’t seem to matter, like cat photos, can surprise you.↩︎ 19. I estimated the total cost using AWS EC2 preemptible hourly costs on 15 March 2019 as follows: • 1 GPU: p2.xlarge instance in us-east-2a, Half of a K80 (12GB VRAM):$0.3235/hour
• 2 GPUs: NA—there is no with 2 GPUs, only 1/8/16
• 8 GPUs: p2.8xlarge in us-east-2a, 8 halves of K80s (12GB VRAM each): \$2.160/hour

As usual, there is sublinear scaling, and larger instances cost disproportionately more, because one is paying for faster wallclock training (time is valuable) and for not having to create a distributed infrastructure which can exploit the cheap single-GPU instances.

This cost estimate does not count additional costs like hard drive space. In addition to the dataset size (the StyleGAN data encoding is ~18x larger than the raw data size, so a 10GB folder of images → 200GB of .tfrecords), you would need at least 100GB HDD (50GB for the OS, and 50GB for checkpoints/images/etc to avoid crashes from running out of space).↩︎

20. I regard this as a flaw in StyleGAN & TF in general. Computers are more than fast enough to load & process images asynchronously using a few worker threads, and working with a directory of images (rather than a special binary format 10–20x larger) avoids imposing serious burdens on the user & hard drive. PyTorch GANs almost always avoid this mistake, and are much more pleasant to work with as one can freely modify the dataset between (and even during) runs.↩︎

21. For example, my Danbooru2018 anime portrait dataset is 16GB, and the StyleGAN dataset is 296GB.↩︎

22. This may be why some people report that StyleGAN just crashes for them & they can’t figure out why. They should try changing their dataset JPGPNG.↩︎

23. But you may not want to–remember the lolcat captions!↩︎

24. Note: If you use a different command to resize, check it thoroughly. With ImageMagick, if you use the ^ operator like -resize 512x512^, you will not get exactly 512x512px images as you need; while if you use the ! operator like -resize 512x512!, the images will be exactly 512x512px but the aspect ratios will distorted to make images fit, and this may confuse anything you are training by introducing unnecessary meaningless distortions & will make any generated images look bad.↩︎

25. If you are using Python 2, you will get print syntax error messages; if you are using Python 3–3.6, you will get ‘type hint’ errors.↩︎

26. This makes it conform to a truncated normal distribution; why truncated rather than rectified/winsorized at a max like 0.5 or 1.0 instead? Because then many, possibly most, of the latent variables would all be at the max, instead of smoothly spread out over the permitted range.↩︎

27. No minibatches are used, so this is much slower than necessary.↩︎

28. There are more real Asuka images than Holo to begin with, but there is no particular reason for the 10x data augmentation compared to the Holo’s 3x—the data augmentations were just done at different times and happened to have less or more augmentations enabled.↩︎

29. Holo faces were far more common than Asuka faces. There were 12,611 Holo faces & 5,838 Asuka faces, so Holo was only 2x more common and Asuka is a more popular character in general in Danbooru, so I am a little puzzled why Holo showed up so much more than Asuka. One possibility is that Holo is inherently easier to model under the truncation trick—I noticed that the brown short-haired face at 𝜓=0 resembles Holo much more than Asuka, so perhaps when setting 𝜓, Asukas are disproportionately filtered out? Or faces closer to the origin (because of brown hair?) are simply more likely to be generated to begin with.↩︎

30. I find that the highest ranked images often contain many anomalies or low-quality images which need to be deleted. Why? The notes a well-trained D which achieves 98% real vs fake classification performance on the ImageNet training dataset falls to 50–55% accuracy when run on the validation dataset, suggesting the D’s role is about memorizing the training data rather than some measure of ‘realism’. Perhaps because the D ranking is not necessarily a ‘quality’ score but simply a sort of confidence rating that an image is from the real dataset; if the real images contain certain easily-detectable images which the G can’t replicate, then the D might memorize or learn them quickly. For example, in face crops, whole figure crops are common mistaken crops, making up a tiny percentage of images; how could a face-only G learn to generate whole realistic bodies without the intermediate steps being instantly detected & defeated as errors by D, while D is easily able to detect realistic bodies as definitely real? This would explain the polarized rankings. And given the close connections between GANs & DRL, I have to wonder if there is more memorization going on than suspected in things like ? Incidentally, this may also explain the problem with using Discriminators for semi-supervised representation learning: if the D is memorizing datapoints to force the G to generalize, then its internal representations would be expected to be useless. (One would instead want to extract knowledge from the G, perhaps by encoding an image into z and using the z as the representation.)↩︎

31. A famous example is character designer turn () into (Evangelion).↩︎

32. In retrospect, this shouldn’t’ve surprised me.↩︎

33. There is for other architectures like flow-based ones such as Glow, and this is one of their benefits–while the requirement to be made out of building blocks which can be run backwards & forwards equally well, to be ‘invertible’, is currently extremely expensive and the results not competitive either in final image quality or compute requirements, the invertibility means that encoding an arbitrary real image to get its inferred latents Just Works™ and one can easily morph between 2 arbitrary images, or encode an arbitrary image & edit it in the latent space to do things like add/remove glasses from a face or create an opposites-ex version.↩︎

34. This final approach is, interestingly, the historical reason backpropagation was invented: it corresponds to planning in a model. For example, in planning the flight path of an airplane (/): the destination or ‘output’ is fixed, the aerodynamics+geography or ‘model parameters’ are also fixed, and the question is what actions determining a flight path will reduce the loss function of time or fuel spent. One starts with a random set of actions picking a random flight path, runs it forward through the environment model, gets a final time/fuel spent, and then backpropagates through the model to get the gradients for the flight path, adjusting the flight path towards a new set of actions which will slightly reduce the time/fuel spent; the new actions are used to plan out the flight to get a new loss, and so on, until a local minimum of the actions has been found. This works with non-stochastic problems; for stochastic ones where the path can’t be guaranteed to be executed, “model-predictive control” can be used to replan at every step and execute adjustments as necessary. Another interesting use of backpropagation for outputs is which tackles the long-standing problem of how to get NNs to output sets rather than list outputs by generating a possible set output & refining it via backpropagation.↩︎

35. SGD is common, but a second-order algorithm like is often used in these applications in order to run as few iterations as possible.↩︎

36. shows that BigGAN/StyleGAN latent embeddings can also go beyond what one might expect, to include zooms, translations, and other transforms.↩︎

37. Flow models have other advantages, mostly stemming from the maximum likelihood training objective. Since the image can be propagated backwards and forwards losslessly, instead of being limited to generating random samples like a GAN, it’s possible to calculate the exact probability of an image, enabling maximum likelihood as a loss to optimize, and dropping the Discriminator entirely. With no GAN dynamics, there’s no worry about weird training dynamics, and the likelihood loss also forbids ‘mode dropping’: the flow model can’t simply conspire with a Discriminator to forget possible images.↩︎

38. ImageNet requires you to sign up & be approved to download from them, but 2 months later I have still heard nothing back. So I used the data from ILSVRC2012_img_train.tar (MD5: 1d675b47d978889d74fa0da5fadfb00e; 138GB) which I downloaded from the torrent.↩︎

39. Danbooru can classify the same character under multiple tags: for example, Sailor Moon characters are tagged under their “Sailor X” name for images of their transformed version, and their real names for ‘civilian’ images (eg ‘Sailor Venus’ or ‘Cure Moonlight’, the former of which I merged with ‘Aino Minako’). Some popular franchises have many variants of each character: the Fate franchise, especially with the success of , is a particular offender, with quite a few variants of characters like Saber.↩︎

40. One would think it would, but I asked Brock and apparently it doesn’t help to occasionally initialize from the EMA snapshots.↩︎

41. As far as I can tell, it has something to do with the dataloader code in utils.py: the calculation of length and the iterator do something weird to adjust for previous training, so the net effect is that you can run with a fixed minibatch accumulation and it’ll be fine, and you can reduce the number of accumulations, and it’ll simply underrun the dataloader, but if you increase the number of accumulations, if you’ve trained enough percentage-wise, it’ll immediately flip over into a negative length and indexing into it becomes completely impossible, leading to crashes. Unfortunately, I only ever want to increase the minibatch accumulation… I tried to fix it but the logic is too convoluted for me to follow it.↩︎

42. Mirror: rsync --verbose rsync://78.46.86.149:873/biggan/2019-05-28-biggan-danbooru2018-snapshot-83520.tar.xz ./↩︎

43. Mirror: rsync --verbose rsync://78.46.86.149:873/biggan/2019-06-04-biggan-256px-danbooru20181k-83520-randomsamples.tar ./↩︎