A tutorial explaining how to train and generate high-quality anime faces with StyleGAN 1/2 neural networks, and tips/scripts for effective StyleGAN use.
2019-02-04–2021-01-03
finished
certainty: highly likely
importance: 5
- Examples
- Background
- FAQ
- Training requirements
- Data Preparation
- Training
- Sampling
- Models
- Transfer Learning
- Reversing StyleGAN To Control & Modify Images
- StyleGAN 2
- Future Work
- BigGAN
- See Also
- External Links
- Appendix
- Link Bibliography
Generative neural networks, such as GANs, have struggled for years to generate decent-quality anime faces, despite their great success with photographic imagery such as real human faces. The task has now been effectively solved, for anime faces as well as many other domains, by the development of a new generative adversarial network, StyleGAN, whose source code was released in February 2019.
I show off my StyleGAN 1/
2 CC-0-licensed anime faces & videos, provide downloads for the final models & anime portrait face dataset, provide the ‘missing manual’ & explain how I trained them based on Danbooru2017/ 2018 with source code for the data preprocessing, document installation & configuration & training tricks.For application, I document various scripts for generating images & videos, briefly describe the website “This Waifu Does Not Exist” I set up as a public demo (see also Artbreeder), discuss how the trained models can be used for transfer learning such as generating high-quality faces of anime characters with small datasets (eg Holo or Asuka Souryuu Langley), and touch on more advanced StyleGAN applications like encoders & controllable generation.
The appendix gives samples of my failures with earlier GANs for anime face generation, and I provide samples & model from a relatively large-scale BigGAN training run suggesting that BigGAN may be the next step forward to generating full-scale anime images.
A minute of reading could save an hour of debugging!
When Ian Goodfellow’s first GAN paper came out in 2014, with its blurry 64px grayscale faces, I said to myself, “given the rate at which GPUs & NN architectures improve, in a few years, we’ll probably be able to throw a few GPUs at some anime collection like Danbooru and the results will be hilarious.” There is something intrinsically amusing about trying to make computers draw anime, and it would be much more fun than working with yet more celebrity headshots or ImageNet samples; further, anime/
So when GANs hit 128px color images on ImageNet, and could do somewhat passable CelebA face samples around 2015, along with my char-RNN experiments, I began experimenting with Soumith Chintala’s implementation of DCGAN, restricting myself to faces of single anime characters where I could easily scrape up ~5–10k faces. (I did a lot of Asuka Souryuu Langley from Neon Genesis Evangelion because she has a color-centric design which made it easy to tell if a GAN run was making any progress: blonde-red hair, blue eyes, and red hair ornaments.)
It did not work. Despite many runs on my laptop & a borrowed desktop, DCGAN never got remotely near to the level of the CelebA face samples, typically topping out at reddish blobs before diverging or outright crashing.1 Thinking perhaps the problem was too-small datasets & I needed to train on all the faces, I began creating the Danbooru2017 version of “Danbooru2018: A Large-Scale Crowdsourced and Tagged Anime Illustration Dataset”. Armed with a large dataset, I subsequently began working through particularly promising members of the GAN zoo, emphasizing SOTA & open implementations.
Among others, I have tried StackGAN/
Glow & BigGAN had promising results reported on CelebA & ImageNet respectively, but unfortunately their training requirements were out of the question.4 (As interesting as SPIRAL and CAN are, no source was released and I couldn’t even attempt them.)
While some remarkable tools like PaintsTransfer/

Typically, a GAN would diverge after a day or two of training, or it would collapse to producing a limited range of faces (or a single face), or if it was stable, simply converge to a low level of quality with a lot of fuzziness; perhaps the most typical failure mode was heterochromia (which is common in anime but not that common)—mismatched eye colors (each color individually plausible), from the Generator apparently being unable to coordinate with itself to pick consistently. With more recent architectures like VGAN or SAGAN, which carefully weaken the Discriminator or which add extremely-powerful components like self-attention layers, I could reach fuzzy 128px faces.
Given the miserable failure of all the prior NNs I had tried, I had begun to seriously wonder if there was something about non-photographs which made them intrinsically unable to be easily modeled by convolutional neural networks (the common ingredient to them all). Did convolutions render it unable to generate sharp lines or flat regions of color? Did regular GANs work only because photographs were made almost entirely of blurry textures?
But BigGAN demonstrated that a large cutting-edge GAN architecture could scale, given enough training, to all of ImageNet at even 512px. And ProGAN demonstrated that regular CNNs could learn to generate sharp clear anime images with only somewhat infeasible amounts of training. ProGAN (source; video), while expensive and requiring >6 GPU-weeks6, did work and was even powerful enough to overfit single-character face datasets; I didn’t have enough GPU time to train on unrestricted face datasets, much less anime images in general, but merely getting this far was exciting. Because, a common sequence in DL/

StyleGAN was the final breakthrough in providing ProGAN-level capabilities but fast: by switching to a radically different architecture, it minimized the need for the slow progressive growing (perhaps eliminating it entirely7), and learned efficiently at multiple levels of resolution, with bonuses in providing much more control of the generated images with its “style transfer” metaphor.
Examples
First, some demonstrations of what is possible with StyleGAN on anime faces:



Even a quick look at the MGM & StyleGAN samples demonstrates the latter to be superior in resolution, fine details, and overall appearance (although the MGM faces admittedly have fewer global mistakes). It is also superior to my 2018 ProGAN faces. Perhaps the most striking fact about these faces, which should be emphasized for those fortunate enough not to have spent as much time looking at awful GAN samples as I have, is not that the individual faces are good, but rather that the faces are so diverse, particularly when I look through face samples with 𝜓≥1—it is not just the hair/

Background

StyleGAN was published in 2018 as “A Style-Based Generator Architecture for Generative Adversarial Networks”, Karras et al 2018 (source code; demo video/

StyleGAN makes a number of additional improvements, but they appear to be less important: for example, it introduces a new “FFHQ” face/
Aside from the FCs and style noise & normalization, it is a vanilla architecture. (One oddity is the use of only 3×3 convolutions & so few layers in each upscaling block; a more conventional upscaling block than StyleGAN’s 3×3→3×3 would be something like BigGAN which does 1×1 → 3×3 → 3×3 → 1×1. It’s not clear if this is a good idea as it limits the spatial influence of each pixel by providing limited receptive fields14.) Thus, if one has some familiarity with training a ProGAN or another GAN, one can immediately work with StyleGAN with no trouble: the training dynamics are similar and the hyperparameters have their usual meaning, and the codebase is much the same as the original ProGAN (with the main exception being that config.py
has been renamed train.py
(or run_training.py
in S2) and the original train.py
, which stores the critical configuration parameters, has been moved to training/training_loop.py
; there is still no support for command-line options and StyleGAN must be controlled by editing train.py
/training_loop.py
by hand).
Applications
Because of its speed and stability, when the source code was released on 2019-02-04 (a date that will long be noted in the ANNals of GANime), the Nvidia models & sample dumps were quickly perused & new StyleGANs trained on a wide variety of image types, yielding, in addition to the original faces/
“This Person Does Not Exist” (random samples from the 1024px FFHQ face StyleGAN)
- quizzes: “Which Face is Real”/
“Real or Fake?” - voting: “Judge Fake People”
- Instagram portraits
- quizzes: “Which Face is Real”/
cats: “These Cats Do Not Exist”, “This Cat Does Not Exist” (cat failure modes; interpolation/
style-transfer )/corgies hotel rooms (with char-RNN-generated text descriptions): “This Rental Does Not Exist”
- kitchen/
dining room/ (using transfer learning)living room/ bedroom
- kitchen/
satellite imagery; Gothic cathedrals; Frank Gehry buildings; cityscapes; floor plans
“This Waifu Does Not Exist” (background/
implementation )-
large upgrade over TWDNE: random generation, exploration, image attribute editing, saving to a gallery, and crossbreeding portraits
-
interactive waifu generator (improved results, inspired by TWDNE & using Danbooru2018 as a dataset)
-
fonts: 1/
2; Unicode characters; kanji Large Logo Dataset; Cedric Oeldorf’s Conditional StyleGAN
eyes (ProGAN)

Why Don’t GANs Work?
Why does StyleGAN work so well on anime images while other GANs worked not at all or slowly at best?
The lesson I took from “Are GANs Created Equal? A Large-Scale Study”, Lucic et al 2017, is that CelebA/
Interestingly, I consistently observe in training all GANs on anime that clear lines & sharpness & cel-like smooth gradients appear only toward the end of training, after typically initially blurry textures have coalesced. This suggests an inherent bias of CNNs: color images work because they provide some degree of textures to start with, but lineart/
This raises a question of whether the StyleGAN architecture is necessary and whether many GANs might work, if only one had good style transfer for anime images and could, to defeat the texture bias, generate many versions of each anime image which kept the shape while changing the color palette? (Current style transfer methods like the AdaIN PyTorch implementation used by Geirhos et al 2018, do not work well on anime images, ironically enough, because they are trained on photographic images, typically using the old VGG model.)
FAQ
“…Its social accountability seems sort of like that of designers of military weapons: unculpable right up until they get a little too good at their job.”
David Foster Wallace, “E unibus pluram: Television and U.S. Fiction”
To address some common questions people have after seeing generated samples:
Overfitting: “Aren’t StyleGAN (or BigGAN) just overfitting & memorizing data?”
Amusingly, this is not a question anyone really bothered to ask of earlier GAN architectures, which is a sign of progress. Overfitting is a better problem to have than underfitting, because overfitting means you can use a smaller model or more data or more aggressive regularization techniques, while underfitting means your approach just isn’t working.
In any case, while there is currently no way to conclusively prove that cutting-edge GANs are not 100% memorizing (because they should be memorizing to a considerable extent in order to learn image generation, and evaluating generative models is hard in general, and for GANs in particular, because they don’t provide standard metrics like likelihoods which could be used on held-out samples), there are several reasons to think that they are not just memorizing:15
Sample/
Dataset Overlap : a standard check for overfitting is to compare generated images to their closest matches using nearest-neighbors (where distance is defined by features like a CNN embedding) lookup; an example of this are StackGAN’s Figure 6 & BigGAN’s Figures 10–14, where the photorealistic samples are nevertheless completely different from the most similar ImageNet datapoints. This has not been done for StyleGAN yet but I wouldn’t expect different results as GANs typically pass this check. (It’s worth noting that Clearview AI’s facial recognition reportedly does not return Flickr matches for random FFHQ StyleGAN faces, suggesting the generated faces genuinely look like new faces rather than any of the original Flickr faces.)One intriguing observation about GANs made by the BigGAN paper is that the criticisms of Generators memorizing datapoints may be precisely the opposite of reality: GANs may work primarily by the Discriminator (adaptively) overfitting to datapoints, thereby repelling the Generator away from real datapoints and forcing it to learn nearby possible images which collectively span the image distribution. (With enough data, this creates generalization because “neural nets are lazy” and only learn to generalize when easier strategies fail.)
Semantic Understanding: GANs appear to learn meaningful concepts like individual objects, as demonstrated by “latent space addition” or research tools like GANdissection/
Suzuki et al 2018; image edits like object deletions/ additions (Bau et al 2020) or segmenting objects like dogs from their backgrounds (Voynov & Babenko 2020/ Voynov et al 2020) are difficult to explain without some genuine understanding of images.
In the case of StyleGAN anime faces, there are encoders and controllable face generation now which demonstrate that the latent variables do map onto meaningful factors of variation & the model must have genuinely learned about creating images rather than merely memorizing real images or image patches. Similarly, when we use the “truncation trick”/
ψ to sample from relatively extreme unlikely images and we look at the distortions, they show how generated images break down in semantically-relevant ways, which would not be the case if it was just plagiarism. (A particularly extreme example of the power of the learned StyleGAN primitives is Abdal et al 2019’s demonstration that Karras et al’s FFHQ faces StyleGAN can be used to generate fairly realistic images of cats/ dogs/ cars.) Latent Space Smoothness: in general, interpolation in the latent space (z) shows smooth changes of images and logical transformations or variations of face features; if StyleGAN were merely memorizing individual datapoints, the interpolation would be expected to be low quality, yield many terrible faces, and exhibit ‘jumps’ in between points corresponding to real, memorized, datapoints. The StyleGAN anime face models do not exhibit this. (In contrast, the Holo ProGAN, which overfit badly, does show severe problems in its latent space interpolation videos.)
Which is not to say that GANs do not have issues: “mode dropping” seems to still be an issue for BigGAN despite the expensive large-minibatch training, which is overfitting to some degree, and StyleGAN presumably suffers from it too.
Transfer Learning: GANs have been used for semi-supervised learning (eg generating plausible ‘labeled’ samples to train a classifier on), imitation learning like GAIL, and retraining on further datasets; if the G is merely memorizing, it is difficult to explain how any of this would work.
Compute Requirements: “Doesn’t StyleGAN take too long to train?”
StyleGAN is remarkably fast-training for a GAN. With the anime faces, I got better results after 1–3 days of StyleGAN training than I’d gotten with >3 weeks of ProGAN training. The training times quoted by the StyleGAN repo may sound scary, but they are, in practice, a steep overestimate of what you actually need, for several reasons:
- Lower Resolution: the largest figures are for 1024px images but you may not need them to be that large or even have a big dataset of 1024px images. For anime faces, 1024px-sized faces are relatively rare, and training at 512px & upscaling 2× to 1024 with
waifu2x
16 works fine & is much faster. Since upscaling is relatively simple & easy, another strategy is to change the progressive-growing schedule: instead of proceeding to the final resolution as fast as possible, instead adjust the schedule to stop at a more feasible resolution & spend the bulk of training time there instead and then do just enough training at the final resolution to learn to upscale (eg spend 10% of training growing to 512px, then 80% of training time at 512px, then 10% at 1024px). - Diminishing Returns: the largest gains in image quality are seen in the first few days or weeks of training with the remaining training being not that useful as they focus on improving small details (so just a few days may be more than adequate for your purposes, especially if you’re willing to select a little more aggressively from samples)
- Transfer Learning from a related model can save days or weeks of training, as there is no need to train from scratch; with the anime face StyleGAN, one can train a character-specific StyleGAN with a few hours or days at most, and certainly do not need to spend multiple weeks training from scratch! (assuming that wouldn’t just cause overfitting) Similarly, if one wants to train on some 1024px face dataset, why start from scratch, taking ~1000 GPU-hours, when you can start from Nvidia’s FFHQ face model which is already fully trained, and can converge in a fraction of the from-scratch time? For 1024px, you could use a super-resolution GAN like to upscale? Alternately, you could change the image progression budget to spend most of your time at 512px and then at the tail end try 1024px.
- One-Time Costs: the upfront cost of a few hundred dollars of GPU-time (at inflated AWS prices) may seem steep, but should be kept in perspective. As with almost all NNs, training 1 StyleGAN model can be literally tens of millions of times more expensive than simply running the Generator to produce 1 image; but it also need be paid only once by only one person, and the total price need not even be paid by the same person, given transfer learning, but can be amortized across various datasets. Indeed, given how fast running the Generator is, the trained model doesn’t even need to be run on a GPU. (The rule of thumb is that a GPU is 20–30× faster than the same thing on CPU, with rare instances when overhead dominates of the CPU being as fast or faster, so since generating 1 image takes on the order of ~0.1s on GPU, a CPU can do it in ~3s, which is adequate for many purposes.)
- Lower Resolution: the largest figures are for 1024px images but you may not need them to be that large or even have a big dataset of 1024px images. For anime faces, 1024px-sized faces are relatively rare, and training at 512px & upscaling 2× to 1024 with
Copyright Infringement: “Who owns StyleGAN images?”
The Nvidia Source Code & Released Models for StyleGAN 1 are under a CC-BY-NC license, and you cannot edit them or produce “derivative works” such as retraining their FFHQ, cat, or cat StyleGAN models. (StyleGAN 2 is under a new “Nvidia Source Code License-NC”, which appears to be effectively the same as the CC-BY-NC with the addition of a patent retaliation clause.)
If a model is trained from scratch, then that does not apply as the source code is simply another tool used to create the model and nothing about the CC-BY-NC license forces you to donate the copyright to Nvidia. (It would be odd if such a thing did happen—if your word processor claimed to transfer the copyrights of everything written in it to Microsoft!)
For those concerned by the CC-BY-NC license, a 512px FFHQ config-f StyleGAN 2 has been trained & released into the public domain by Aydao, and is available for download from Mega and my rsync mirror:
rsync --verbose rsync://78.46.86.149:873/biggan/2020-06-07-aydao-stylegan2-configf-ffhq-512-avg-tpurun1.pkl.xz ./
Models in general are generally considered “transformative works” and the copyright owners of whatever data the model was trained on have no copyright on the model. (The fact that the datasets or inputs are copyrighted is irrelevant, as training on them is universally considered fair use and transformative, similar to artists or search engines; see the further reading.) The model is copyrighted to whomever created it. Hence, Nvidia has copyright on the models it created but I have copyright under the models I trained (which I release under CC-0).
Samples are trickier. The usual widely-stated legal interpretation is that the standard copyright law position is that only human authors can earn a copyright and that machines, animals, inanimate objects or most famously, monkeys, cannot. The US Copyright Office states clearly that regardless of whether we regard a GAN as a machine or a something more intelligent like an animal, either way, it doesn’t count:
A work of authorship must possess “some minimal degree of creativity” to sustain a copyright claim. Feist, 499 U.S. at 358, 362 (citation omitted). “[T]he requisite level of creativity is extremely low.” Even a “slight amount” of creative expression will suffice. “The vast majority of works make the grade quite easily, as they possess some creative spark, ‘no matter how crude, humble or obvious it might be.’” Id. at 346 (citation omitted).
… To qualify as a work of “authorship” a work must be created by a human being. See Burrow-Giles Lithographic Co., 111 U.S. at 58. Works that do not satisfy this requirement are not copyrightable. The Office will not register works produced by nature, animals, or plants.
Examples:
- A photograph taken by a monkey.
- A mural painted by an elephant.
…the Office will not register works produced by a machine or mere mechanical process that operates randomly or automatically without any creative input or intervention from a human author.
A dump of random samples such as the Nvidia samples or TWDNE therefore has no copyright & by definition is in the public domain.
A new copyright can be created, however, if a human author is sufficiently ‘in the loop’, so to speak, as to exert a de minimis amount of creative effort, even if that ‘creative effort’ is simply selecting a single image out of a dump of thousands of them or twiddling knobs (eg on Make Girls.Moe). Crypko, for example, take this position.
Further reading on computer-generated art copyrights:
- “Copyrights In Computer-Generated Works: Whom, If Anyone, Do We Reward?”, Glasser 2001
- “Ex Machina: Copyright Protection For Computer-Generated Works”, Denicola 2016
- “Computer Generated Works and Copyright: Selfies, Traps, Robots, AI and Machine Learning”, Lambert 2017
- “Who holds the Copyright in AI Created Art?”, Steve Schlackman (2018)
- “The Machine as Author”, Gervais 2019
- “Why Is AI Art Copyright So Complicated?”, Jason Bailey (2019)
- “We’ve been warned about AI and music for over 50 years, but no one’s prepared: ‘This road is literally being paved as we’re walking on it’” (The Verge 2019)
Copyright
Per the copyright point above, all my generated videos and samples and models are released under the CC-0 (public domain equivalent) license. Source code listed may be derivative works of Nvidia’s CC-BY-NC-licensed StyleGAN code, and may be CC-BY-NC.
Training requirements
Data
“The road of excess leads to the palace of wisdom
…If the fool would persist in his folly he would become wise
…You never know what is enough unless you know what is more than enough. …If others had not been foolish, we should be so.”William Blake, “Proverbs of Hell”, The Marriage of Heaven and Hell
The necessary size for a dataset depends on the complexity of the domain and whether transfer learning is being used. StyleGAN’s default settings yield a 1024px Generator with 26.2M parameters, which is a large model and can soak up potentially millions of images, so there is no such thing as too much.
For learning decent-quality anime faces from scratch, a minimum of 5000 appears to be necessary in practice; for learning a specific character when using the anime face StyleGAN, potentially as little as ~500 (especially with data augmentation) can give good results. For domains as complicated as “any cat photo” like Karras et al 2018’s cat StyleGAN which is trained on the LSUN CATS category of ~1.8M17 cat photos, that appears to either not be enough or StyleGAN was not trained to convergence; Karras et al 2018 note that “CATS continues to be a difficult dataset due to the high intrinsic variation in poses, zoom levels, and backgrounds.”18
Compute
To fit reasonable minibatch sizes, one will want GPUs with >11GB VRAM. At 512px, that will only train n = 4, and going below that means it’ll be even slower (and you may have to reduce learning rates to avoid unstable training). So, Nvidia 1080ti & up would be good. (Reportedly, AMD/tensorflow-rocm
1.13.2 and rocm
2.3.14”.)
The StyleGAN repo provide the following estimated training times for 1–8 GPU systems (which I convert to total GPU-hours & provide a worst-case AWS-based cost estimate):
Estimated StyleGAN wallclock training times for various resolutions & GPU-clusters (source: StyleGAN repo) GPUs 10242 5122 2562 [March 2019 AWS Costs19] 1 41 days 4 hours [988 GPU-hours] 24 days 21 hours [597 GPU-hours] 14 days 22 hours [358 GPU-hours] [$320, $194, $115] 2 21 days 22 hours [1,052] 13 days 7 hours [638] 9 days 5 hours [442] [NA] 4 11 days 8 hours [1,088] 7 days 0 hours [672] 4 days 21 hours [468] [NA] 8 6 days 14 hours [1,264] 4 days 10 hours [848] 3 days 8 hours [640] [$2,730, $1,831, $1,382]
AWS GPU instances are some of the most expensive ways to train a NN and provide an upper bound (compare Vast.ai); 512px is often an acceptable (or necessary) resolution; and in practice, the full quoted training time is not really necessary—with my anime face StyleGAN, the faces themselves were high quality within 48 GPU-hours, and what training it for ~1000 additional GPU-hours accomplished was primarily to improve details like the shoulders & backgrounds. (ProGAN/

Data Preparation
The most difficult part of running StyleGAN is preparing the dataset properly. StyleGAN does not, unlike most GAN implementations (particularly PyTorch ones), support reading a directory of files as input; it can only read its unique .tfrecord
format which stores each image as raw arrays at every relevant resolution.20 Thus, input files must be perfectly uniform, (slowly) converted to the .tfrecord
format by the special dataset_tool.py
tool, and will take up ~19× more disk space.21
A StyleGAN dataset must consist of images all formatted exactly the same way.
Images must be precisely 512×512px or 1024×1024px etc (any eg 512×513px images will kill the entire run), they must all be the same colorspace (you cannot have sRGB and Grayscale JPGs—and I doubt other color spaces work at all), they must not be transparent, the filetype must be the same as the model you intend to (re)train (ie you cannot retrain a PNG-trained model on a JPG dataset, StyleGAN will crash every time with inscrutable convolution/
Faces preparation
My workflow:
- Download raw images from Danbooru2018 if necessary
- Extract from the JSON Danbooru2018 metadata all the IDs of a subset of images if a specific Danbooru tag (such as a single character) is desired, using
jq
and shell scripting - Crop square anime faces from raw images using Nagadomi’s
lbpcascade_animeface
(regular face-detection methods do not work on anime images) - Delete empty files, monochrome or grayscale files, & exact-duplicate files
- Convert to JPG
- Upscale below the target resolution (512px) images with
waifu2x
- Convert all images to exactly 512×512 resolution sRGB JPG images
- If feasible, improve data quality by checking for low-quality images by hand, removing near-duplicates images found by
findimagedupes
, and filtering with a pretrained GAN’s Discriminator - Convert to StyleGAN format using
dataset_tool.py
The goal is to turn this:

into this:

Below I use shell scripting to prepare the dataset. A possible alternative is danbooru-utility
, which aims to help “explore the dataset, filter by tags, rating, and score, detect faces, and resize the images”.
Cropping
The Danbooru2018 download can be done via BitTorrent or rsync, which provides a JSON metadata tarball which unpacks into metadata/2*
& a folder structure of {original,512px}/{0-999}/$ID.{png,jpg,...}
.
For training on SFW whole images, the 512px/
version of Danbooru2018 would work, but it is not a great idea for faces because by scaling images down to 512px, a lot of face detail has been lost, and getting high-quality faces is a challenge. The SFW IDs can be extracted from the filenames in 512px/
directly or from the metadata by extracting the id
& rating
fields (and saving to a file):
find ./512px/ -type f | sed -e 's/.*\/\([[:digit:]]*\)\.jpg/\1/'
# 967769
# 1853769
# 2729769
# 704769
# 1799769
# ...
tar xf metadata.json.tar.xz
cat metadata/20180000000000* | jq '[.id, .rating]' -c | fgrep '"s"' | cut -d '"' -f 2 # "
# ...
After installing and testing Nagadomi’s lbpcascade_animeface
to make sure it & OpenCV works, one can use a simple script which crops the face(s) from a single input image. The accuracy on Danbooru images is fairly good, perhaps 90% excellent faces, 5% low-quality faces (genuine but either awful art or tiny little faces on the order of 64px which useless), and 5% outright errors—non-faces like armpits or elbows (oddly enough). It can be improved by making the script more restrictive, such as requiring 250×250px regions, which eliminates most of the low-quality faces & mistakes. (There is an alternative more-difficult-to-run library by Nakatomi which offers a face-cropping script, animeface-2009’s face_collector.rb
, which Nakatomi says is better at cropping faces, but I was not impressed when I tried it out.) crop.py
:
import cv2
import sys
import os.path
def detect(cascade_file, filename, outputname):
if not os.path.isfile(cascade_file):
raise RuntimeError("%s: not found" % cascade_file)
cascade = cv2.CascadeClassifier(cascade_file)
image = cv2.imread(filename)
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
gray = cv2.equalizeHist(gray)
## NOTE: Suggested modification: increase minSize to '(250,250)' px,
## increasing proportion of high-quality faces & reducing
## false positives. Faces which are only 50×50px are useless
## and often not faces at all.
## FOr my StyleGANs, I use 250 or 300px boxes
faces = cascade.detectMultiScale(gray,
# detector options
scaleFactor = 1.1,
minNeighbors = 5,
minSize = (50, 50))
i=0
for (x, y, w, h) in faces:
cropped = image[y: y + h, x: x + w]
cv2.imwrite(outputname+str(i)+".png", cropped)
i=i+1
if len(sys.argv) != 4:
sys.stderr.write("usage: detect.py <animeface.xml file> <input> <output prefix>\n")
sys.exit(-1)
detect(sys.argv[1], sys.argv[2], sys.argv[3])
The IDs can be combined with the provided lbpcascade_animeface
script using xargs
, however this will be far too slow and it would be better to exploit parallelism with xargs --max-args=1 --max-procs=16
or parallel
. It’s also worth noting that lbpcascade_animeface
seems to use up GPU VRAM even though GPU use offers no apparent speedup (a slowdown if anything, given limited VRAM), so I find it helps to explicitly disable GPU use by setting CUDA_VISIBLE_DEVICES=""
. (For this step, it’s quite helpful to have a many-core system like a Threadripper.)
Combining everything, parallel face-cropping of an entire Danbooru2018 subset can be done like this:
cropFaces() {
BUCKET=$(printf "%04d" $(( $@ % 1000 )) )
ID="$@"
CUDA_VISIBLE_DEVICES="" nice python ~/src/lbpcascade_animeface/examples/crop.py \
~/src/lbpcascade_animeface/lbpcascade_animeface.xml \
./original/$BUCKET/$ID.* "./faces/$ID"
}
export -f cropFaces
mkdir ./faces/
cat sfw-ids.txt | parallel --progress cropFaces
# NOTE: because of the possibility of multiple crops from an image, the script appends a N counter;
# remove that to get back the original ID & filepath: eg
#
## original/0196/933196.jpg → portrait/9331961.jpg
## original/0669/1712669.png → portrait/17126690.jpg
## original/0997/3093997.jpg → portrait/30939970.jpg
Nvidia StyleGAN, by default and like most image-related tools, expects square images like 512×512px, but there is nothing inherent to neural nets or convolutions that requires square inputs or outputs, and rectangular convolutions are possible. In the case of faces, they tend to be more rectangular than square, and we’d prefer to use a rectangular convolution if possible to focus the image on the relevant dimension rather than either pay the severe performance penalty of increasing total dimensions to 1024×1024px or stick with 512×512px & waste image outputs on emitting black bars/
Cleaning & Upscaling
Miscellaneous cleanups can be done:
## Delete failed/empty files
find faces/ -size 0 -type f -delete
## Delete 'too small' files which is indicative of low quality:
find faces/ -size -40k -type f -delete
## Delete exact duplicates:
fdupes --delete --omitfirst --noprompt faces/
## Delete monochrome or minimally-colored images:
### the heuristic of <257 unique colors is imperfect but better than anything else I tried
deleteBW() { if [[ `identify -format "%k" "$@"` -lt 257 ]];
then rm "$@"; fi; }
export -f deleteBW
find faces -type f | parallel --progress deleteBW
I remove black-white or grayscale images from all my GAN experiments because in my earliest experiments, their inclusion appeared to increase instability: mixed datasets were extremely unstable, monochrome datasets failed to learn at all, but color-only runs made some progress. It is likely that StyleGAN is now powerful enough to be able to learn on mixed datasets (and some later experiments by other people suggest that StyleGAN can handle both monochrome & color anime-style faces without a problem), but I have not risked a full month-long run to investigate, and so I continue doing color-only.
Discriminator ranking
A good trick with GANs is, after training to reasonable levels of quality, reusing the Discriminator to rank the real datapoints; images the trained D assigns the lowest probability/
Since rating images is what the D already does, no new algorithms or training methods are necessary, and almost no code is necessary: run the D on the whole dataset to rank each image (faster than it seems since the G & backpropagation are unnecessary, even a large dataset can be ranked in a wallclock hour or two), then one can review manually the bottom & top X%, or perhaps just delete the bottom X% sight unseen if enough data is available.
What is a D doing? I find that the highest ranked images often contain many anomalies or low-quality images which need to be deleted. Why? The BigGAN paper notes a well-trained D which achieves 98% real vs fake classification performance on the ImageNet training dataset falls to 50–55% accuracy when run on the validation dataset, suggesting the D’s role is about memorizing the training data rather than some measure of ‘realism’.
Perhaps because the D ranking is not necessarily a ‘quality’ score but simply a sort of confidence rating that an image is from the real dataset; if the real images contain certain easily-detectable images which the G can’t replicate, then the D might memorize or learn them quickly. For example, in face crops, whole figure crops are common mistaken crops, making up a tiny percentage of images; how could a face-only G learn to generate whole realistic bodies without the intermediate steps being instantly detected & defeated as errors by D, while D is easily able to detect realistic bodies as definitely real? This would explain the polarized rankings. And given the close connections between GANs & DRL, I have to wonder if there is more memorization going on than suspected in things like “Deep reinforcement learning from human preferences”? Incidentally, this may also explain the problem with using Discriminators for semi-supervised representation learning: if the D is memorizing datapoints to force the G to generalize, then its internal representations would be expected to be useless. (One would instead want to extract knowledge from the G, perhaps by encoding an image into z and using the z as the representation.)
An alternative perspective is offered by a crop of 2020 papers (Zhao et al 2020b; Tran et al 2020; Karras et al 2020; Zhao et al 2020c) examining how useful GAN data augmentation requires it to be done during training, and one must augment all images.23 Zhao et al 2020c & Karras et al 2020 observe, with regular GAN training, there is a striking steady decline of D performance on heldout data, and increase on training data, throughout the course of training, confirming the BigGAN observation but also showing it is a dynamic phenomenon, and probably a bad one. Adding in correct data augmentation reduces this overfitting—and markedly improves sample-efficiency & final quality. This suggests that the D does indeed memorize, but that this is not a good thing. Karras et al 2020 describes what happens as
Convergence is now achieved [with ADA/
data augmentation] regardless of the training set size and overfitting no longer occurs. Without augmentations, the gradients the generator receives from the discriminator become very simplistic over time—the discriminator starts to pay attention to only a handful of features, and the generator is free to create otherwise nonsensical images. With ADA, the gradient field stays much more detailed which prevents such deterioration.
In other words, just as the G can ‘mode collapse’ by focusing on generating images with only a few features, the D can also ‘feature collapse’ by focusing on a few features which happen to correctly split the training data’s reals from fakes, such as by memorizing them outright. This technically works, but not well. This also explains why BigGAN training stabilized when training on JFT-300M: divergence/
If so, this suggests that for D ranking, it may not be too useful to take the D from the end of a run, if not using data augmentation, because that D be the version with the greatest degree of memorization!
Here is a simple StyleGAN2 script (ranker.py
) to open a StyleGAN .pkl
and run it on a list of image filenames to print out the D score, courtesy of Shao Xuning:
import pickle
import numpy as np
import cv2
import dnnlib.tflib as tflib
import random
import argparse
import PIL.Image
from training.misc import adjust_dynamic_range
def preprocess(file_path):
# print(file_path)
img = np.asarray(PIL.Image.open(file_path))
# Preprocessing from dataset_tool.create_from_images
img = img.transpose([2, 0, 1]) # HWC => CHW
# img = np.expand_dims(img, axis=0)
img = img.reshape((1, 3, 512, 512))
# Preprocessing from training_loop.process_reals
img = adjust_dynamic_range(data=img, drange_in=[0, 255], drange_out=[-1.0, 1.0])
return img
def main(args):
random.seed(args.random_seed)
minibatch_size = args.minibatch_size
input_shape = (minibatch_size, 3, 512, 512)
# print(args.images)
images = args.images
images.sort()
tflib.init_tf()
_G, D, _Gs = pickle.load(open(args.model, "rb"))
# D.print_layers()
image_score_all = [(image, []) for image in images]
# Shuffle the images and process each image in multiple minibatches.
# Note: networks.stylegan2.minibatch_stddev_layer
# calculates the standard deviation of a minibatch group as a feature channel,
# which means that the output of the discriminator actually depends
# on the companion images in the same minibatch.
for i_shuffle in range(args.num_shuffles):
# print('shuffle: {}'.format(i_shuffle))
random.shuffle(image_score_all)
for idx_1st_img in range(0, len(image_score_all), minibatch_size):
idx_img_minibatch = []
images_minibatch = []
input_minibatch = np.zeros(input_shape)
for i in range(minibatch_size):
idx_img = (idx_1st_img + i) % len(image_score_all)
idx_img_minibatch.append(idx_img)
image = image_score_all[idx_img][0]
images_minibatch.append(image)
img = preprocess(image)
input_minibatch[i, :] = img
output = D.run(input_minibatch, None, resolution=512)
print('shuffle: {}, indices: {}, images: {}'
.format(i_shuffle, idx_img_minibatch, images_minibatch))
print('Output: {}'.format(output))
for i in range(minibatch_size):
idx_img = idx_img_minibatch[i]
image_score_all[idx_img][1].append(output[i][0])
with open(args.output, 'a') as fout:
for image, score_list in image_score_all:
print('Image: {}, score_list: {}'.format(image, score_list))
avg_score = sum(score_list)/len(score_list)
fout.write(image + ' ' + str(avg_score) + '\n')
def parse_arguments():
parser = argparse.ArgumentParser()
parser.add_argument('--model', type=str, required=True,
help='.pkl model')
parser.add_argument('--images', nargs='+')
parser.add_argument('--output', type=str, default='rank.txt')
parser.add_argument('--minibatch_size', type=int, default=4)
parser.add_argument('--num_shuffles', type=int, default=5)
parser.add_argument('--random_seed', type=int, default=0)
return parser.parse_args()
if __name__ == '__main__':
main(parse_arguments())
Depending on how noisy the rankings are in terms of ‘quality’ and available sample size, one can either review the worst-ranked images by hand, or delete the bottom X%. One should check the top-ranked images as well to make sure the ordering is right; there can also be some odd images in the top X% as well which should be removed.
It might be possible to use ranker.py
to improve the quality of generated samples as well, as a simple version of discriminator rejection sampling.
Upscaling
The next major step is upscaling images using waifu2x
, which does an excellent job on 2× upscaling of anime images, which are nigh-indistinguishable from a higher-resolution original and greatly increase the usable corpus. The downside is that it can take 1–10s per image, must run on the GPU (I can reliably fit ~9 instances on my 2×1080ti), and is written in a now-unmaintained DL framework, Torch, with no current plans to port to PyTorch, and is gradually becoming harder to get running (one hopes that by the time CUDA updates break it entirely, there will be another super-resolution GAN I or someone else can train on Danbooru to replace it). If pressed for time, one can just upscale the faces normally with ImageMagick but I believe there will be some quality loss and it’s worthwhile.
. ~/src/torch/install/bin/torch-activate
upscaleWaifu2x() {
SIZE1=$(identify -format "%h" "$@")
SIZE2=$(identify -format "%w" "$@");
if (( $SIZE1 < 512 && $SIZE2 < 512 )); then
echo "$@" $SIZE
TMP=$(mktemp "/tmp/XXXXXX.png")
CUDA_VISIBLE_DEVICES="$((RANDOM % 2 < 1))" nice th ~/src/waifu2x/waifu2x.lua -model_dir \
~/src/waifu2x/models/upconv_7/art -tta 1 -m scale -scale 2 \
-i "$@" -o "$TMP"
convert "$TMP" "$@"
rm "$TMP"
fi; }
export -f upscaleWaifu2x
find faces/ -type f | parallel --progress --jobs 9 upscaleWaifu2x
Quality Checks & Data Augmentation
The single most effective strategy to improve a GAN is to clean the data. StyleGAN cannot handle too-diverse datasets composed of multiple objects or single objects shifted around, and rare or odd images cannot be learned well. Karras et al get such good results with StyleGAN on faces in part because they constructed FFHQ to be an extremely clean consistent dataset of just centered well-lit clear human faces without any obstructions or other variation. Similarly, Arfa’s “This Fursona Does Not Exist” (TFDNE) S2 generates much better furry portraits than my own “This Waifu Does Not Exist” (TWDNE) S2 anime portraits, due partly to training longer to convergence on a TPU pod but mostly due to his investment in data cleaning: aligning the faces and heavy filtering of samples—this left him with only n = 50k but TFDNE nevertheless outperforms TWDNE’s n = 300k. (Data cleaning/
At this point, one can do manual quality checks by viewing a few hundred images, running findimagedupes -t 99%
to look for near-identical faces, or dabble in further modifications such as doing “data augmentation”. Working with Danbooru2018, at this point one would have ~600–700,000 faces, which is more than enough to train StyleGAN and one will have difficulty storing the final StyleGAN dataset because of its sheer size (due to the ~18× size multiplier). After cleaning etc, my final face dataset is the portrait dataset with n = 300k.
However, if that is not enough or one is working with a small dataset like for a single character, data augmentation may be necessary. The mirror/
dataAugment () {
image="$@"
target=$(basename "$@")
suffix="png"
convert -deskew 50 "$image" "$target".deskew."$suffix"
convert -resize 110%x100% "$image" "$target".horizstretch."$suffix"
convert -resize 100%x110% "$image" "$target".vertstretch."$suffix"
convert -blue-shift 1.1 "$image" "$target".midnight."$suffix"
convert -fill red -colorize 5% "$image" "$target".red."$suffix"
convert -fill orange -colorize 5% "$image" "$target".orange."$suffix"
convert -fill yellow -colorize 5% "$image" "$target".yellow."$suffix"
convert -fill green -colorize 5% "$image" "$target".green."$suffix"
convert -fill blue -colorize 5% "$image" "$target".blue."$suffix"
convert -fill purple -colorize 5% "$image" "$target".purple."$suffix"
convert -adaptive-blur 3x2 "$image" "$target".blur."$suffix"
convert -adaptive-sharpen 4x2 "$image" "$target".sharpen."$suffix"
convert -brightness-contrast 10 "$image" "$target".brighter."$suffix"
convert -brightness-contrast 10x10 "$image" "$target".brightercontraster."$suffix"
convert -brightness-contrast -10 "$image" "$target".darker."$suffix"
convert -brightness-contrast -10x10 "$image" "$target".darkerlesscontrast."$suffix"
convert +level 5% "$image" "$target".contraster."$suffix"
convert -level 5%\! "$image" "$target".lesscontrast."$suffix"
}
export -f dataAugment
find faces/ -type f | parallel --progress dataAugment
Upscaling & Conversion
Once any quality fixes or data augmentation are done, it’d be a good idea to save a lot of disk space by converting to JPG & lossily reducing quality (I find 33% saves a ton of space at no visible change):
convertPNGToJPG() { convert -quality 33 "$@" "$@".jpg && rm "$@"; }
export -f convertPNGToJPG
find faces/ -type f -name "*.png" | parallel --progress convertPNGToJPG
Remember that StyleGAN models are only compatible with images of the type they were trained on, so if you are using a StyleGAN pretrained model which was trained on PNGs (like, IIRC, the FFHQ StyleGAN models), you will need to keep using PNGs.
Doing the final scaling to exactly 512px can be done at many points but I generally postpone it to the end in order to work with images in their ‘native’ resolutions & aspect-ratios for as long as possible. At this point we carefully tell ImageMagick to rescale everything to 512×51226, not preserving the aspect ratio by filling in with a black background as necessary on either side:
find faces/ -type f | xargs --max-procs=16 -n 9000 \
mogrify -resize 512x512\> -extent 512x512\> -gravity center -background black
Any slightly-different image could crash the import process. Therefore, we delete any image which is even slightly different from the 512×512 sRGB JPG they are supposed to be:
find faces/ -type f | xargs --max-procs=16 -n 9000 identify | \
# remember the warning: images must be identical, square, and sRGB/grayscale:
fgrep -v " JPEG 512x512 512x512+0+0 8-bit sRGB"| cut -d ' ' -f 1 | \
xargs --max-procs=16 -n 10000 rm
Having done all this, we should have a large consistent high-quality dataset.
Finally, the faces can now be converted to the ProGAN or StyleGAN dataset format using dataset_tool.py
. It is worth remembering at this point how fragile that is and the requirements ImageMagick’s identify
command is handy for looking at files in more details, particularly their resolution & colorspace, which are often the problem.
Because of the extreme fragility of dataset_tool.py
, I strongly advise that you edit it to print out the filenames of each file as they are being processed so that when (not if) it crashes, you can investigate the culprit and check the rest. The edit could be as simple as this:
diff --git a/dataset_tool.py b/dataset_tool.py
index 4ddfe44..e64e40b 100755
--- a/dataset_tool.py
+++ b/dataset_tool.py
@@ -519,6 +519,7 @@ def create_from_images(tfrecord_dir, image_dir, shuffle):
with TFRecordExporter(tfrecord_dir, len(image_filenames)) as tfr:
order = tfr.choose_shuffled_order() if shuffle else np.arange(len(image_filenames))
for idx in range(order.size):
+ print(image_filenames[order[idx]])
img = np.asarray(PIL.Image.open(image_filenames[order[idx]]))
if channels == 1:
img = img[np.newaxis, :, :] # HW => CHW
There should be no issues if all the images were thoroughly checked earlier, but should any images crash it, they can be checked in more detail by identify
. (I advise just deleting them and not trying to rescue them.)
Then the conversion is just (assuming StyleGAN prerequisites are installed, see next section):
source activate MY_TENSORFLOW_ENVIRONMENT
python dataset_tool.py create_from_images datasets/faces /media/gwern/Data/danbooru2018/faces/
Congratulations, the hardest part is over. Most of the rest simply requires patience (and a willingness to edit Python files directly in order to configure StyleGAN).
Training
Installation
I assume you have CUDA installed & functioning. If not, good luck. (On my Ubuntu Bionic 18.04.2 LTS OS, I have successfully used the Nvidia driver version #410.104, CUDA 10.1, and TensorFlow 1.13.1.)
A Python ≥3.627 virtual environment can be set up for StyleGAN to keep dependencies tidy, TensorFlow & StyleGAN dependencies installed:
conda create -n stylegan pip python=3.6
source activate stylegan
## TF:
pip install tensorflow-gpu
## Test install:
python -c "import tensorflow as tf; tf.enable_eager_execution(); \
print(tf.reduce_sum(tf.random_normal([1000, 1000])))"
pip install tensorboard
## StyleGAN:
## Install pre-requisites:
pip install pillow numpy moviepy scipy opencv-python lmdb # requests?
## Download:
git clone 'https://github.com/NVlabs/stylegan.git' && cd ./stylegan/
## Test install:
python pretrained_example.py
## ./results/example.png should be a photograph of a middle-aged man
StyleGAN can also be trained on the interactive Google Colab service, which provides free slices of K80 GPUs 12-GPU-hour chunks, using this Colab notebook. Colab is much slower than training on a local machine & the free instances are not enough to train the best StyleGANs, but this might be a useful option for people who simply want to try it a little or who are doing something quick like extremely low-resolution training or transfer-learning where a few GPU-hours on a slow small GPU might be enough.
Configuration
StyleGAN doesn’t ship with any support for CLI options; instead, one must edit train.py
and train/training_loop.py
:
train/training_loop.py
The core configuration is done in the function defaults to
training_loop
beginning line 112.The key arguments are
G_smoothing_kimg
&D_repeats
(affects the learning dynamics),network_snapshot_ticks
(how often to save the pickle snapshots—more frequent means less progress lost in crashes, but as each one weighs 300MB+, can quickly use up gigabytes of space),resume_run_id
(set to"latest"
), andresume_kimg
.Don’t Erase Your Modelresume_kimg
governs where in the overall progressive-growing training schedule StyleGAN starts from. If it is set to 0, training begins at the beginning of the progressive-growing schedule, at the lowest resolution, regardless of how much training has been previously done. It is vitally important when doing transfer learning that it is set to a sufficiently high number (eg 10000) that training begins at the highest desired resolution like 512px, as it appears that layers are erased when added during progressive-growing. (resume_kimg
may also need to be set to a high value to make it skip straight to training at the highest resolution if you are training on small datasets of small images, where there’s risk of it overfitting under the normal training schedule and never reaching the highest resolution.) This trick is unnecessary in StyleGAN 2, which is simpler in not using progressive growing.More experimentally, I suggest setting
minibatch_repeats = 1
instead ofminibatch_repeats = 5
; in line with the suspiciousness of the gradient-accumulation implementation in ProGAN/StyleGAN, this appears to make training both stabler & faster. Note that some of these variables, like learning rates, are overridden in
train.py
. It’s better to set those there or else you may confuse yourself badly (like I did in wondering why ProGAN & StyleGAN seemed extraordinarily robust to large changes in the learning rates…).train.py
(previouslyconfig.py
in ProGAN; renamedrun_training.py
in StyleGAN 2)Here we set the number of GPUs, image resolution, dataset, learning rates, horizontal flipping/
mirroring data augmentation, and minibatch sizes. (This file includes settings intended ProGAN—watch out that you don’t accidentally turn on ProGAN instead of StyleGAN & confuse yourself.) Learning rate & minibatch should generally be left alone (except towards the end of training when one wants to lower the learning rate to promote convergence or rebalance the G/ D), but the image resolution/ dataset/ mirroring do need to be set, like thus: desc += '-faces'; dataset = EasyDict(tfrecord_dir='faces', resolution=512); train.mirror_augment = True
This sets up the 512px face dataset which was previously created in
dataset/faces
, turns on mirroring (because while there may be writing in the background, we don’t care about it for face generation), and sets a title for the checkpoints/logs, which will now appear in results/
with the ‘-faces’ string.Assuming you do not have 8 GPUs (as you probably do not), you must change the
-preset
to match your number of GPUs, StyleGAN will not automatically choose the correct number of GPUs. If you fail to set it correctly to the appropriate preset, StyleGAN will attempt to use GPUs which do not exist and will crash with the opaque error message (note that CUDA uses zero-indexing soGPU:0
refers to the first GPU,GPU:1
refers to my second GPU, and thus/device:GPU:2
refers to my—nonexistent—third GPU):tensorflow.python.framework.errors_impl.InvalidArgumentError: Cannot assign a device for operation \ G_synthesis_3/lod: {{node G_synthesis_3/lod}}was explicitly assigned to /device:GPU:2 but available \ devices are [ /job:localhost/replica:0/task:0/device:CPU:0, /job:localhost/replica:0/task:0/device:GPU:0, \ /job:localhost/replica:0/task:0/device:GPU:1, /job:localhost/replica:0/task:0/device:XLA_CPU:0, \ /job:localhost/replica:0/task:0/device:XLA_GPU:0, /job:localhost/replica:0/task:0/device:XLA_GPU:1 ]. \ Make sure the device specification refers to a valid device. [[{{node G_synthesis_3/lod}}]]
For my 2×1080ti I’d set:
desc += '-preset-v2-2gpus'; submit_config.num_gpus = 2; sched.minibatch_base = 8; sched.minibatch_dict = \ {4: 256, 8: 256, 16: 128, 32: 64, 64: 32, 128: 16, 256: 8}; sched.G_lrate_dict = {512: 0.0015, 1024: 0.002}; \ sched.D_lrate_dict = EasyDict(sched.G_lrate_dict); train.total_kimg = 99000
So my results get saved to
results/00001-sgan-faces-2gpu
etc (the run ID increments, ‘sgan’ because StyleGAN rather than ProGAN, ‘-faces’ as the dataset being trained on, and ‘2gpu’ because it’s multi-GPU).
Running
I typically run StyleGAN in a screen
session which can be detached and keeps multiple shells organized: 1 terminal/
With Emacs, I keep the two key Python files open (train.py
and train/training_loop.py
) for reference & easy editing.
With the “latest” patch, StyleGAN can be thrown into a while-loop to keep running after crashes, like:
while true; do nice py train.py ; date; (xmessage "alert: StyleGAN crashed" &); sleep 10s; done
TensorBoard is a logging utility which displays little time-series of recorded variables which one views in a web browser, eg:
tensorboard --logdir results/02022-sgan-faces-2gpu/
# TensorBoard 1.13.0 at http://127.0.0.1:6006 (Press CTRL+C to quit)
Note that TensorBoard can be backgrounded, but needs to be updated every time a new run is started as the results will then be in a different folder.
Training StyleGAN is much easier & more reliable than other GANs, but it is still more of an art than a science. (We put up with it because while GANs suck, everything else sucks more.) Notes on training:
Crashproofing:
The initial release of StyleGAN was prone to crashing when I ran it, segfaulting at random. Updating TensorFlow appeared to reduce this but the root cause is still unknown. Segfaulting or crashing is also reportedly common if running on mixed GPUs (eg a 1080ti + Titan V).
Unfortunately, StyleGAN has no setting for simply resuming from the latest snapshot after crashing/
exiting (which is what one usually wants), and one must manually edit the resume_run_id
line intraining_loop.py
to set it to the latest run ID. This is tedious and error-prone—at one point I realized I had wasted 6 GPU-days of training by restarting from a 3-day-old snapshot because I had not updated theresume_run_id
after a segfault!If you are doing any runs longer than a few wallclock hours, I strongly advise use of nshepperd’s patch to automatically restart from the latest snapshot by setting
resume_run_id = "latest"
:diff --git a/training/misc.py b/training/misc.py index 50ae51c..d906a2d 100755 --- a/training/misc.py +++ b/training/misc.py @@ -119,6 +119,14 @@ def list_network_pkls(run_id_or_run_dir, include_final=True): del pkls[0] return pkls +def locate_latest_pkl(): + allpickles = sorted(glob.glob(os.path.join(config.result_dir, '0*', 'network-*.pkl'))) + latest_pickle = allpickles[-1] + resume_run_id = os.path.basename(os.path.dirname(latest_pickle)) + RE_KIMG = re.compile('network-snapshot-(\d+).pkl') + kimg = int(RE_KIMG.match(os.path.basename(latest_pickle)).group(1)) + return (locate_network_pkl(resume_run_id), float(kimg)) + def locate_network_pkl(run_id_or_run_dir_or_network_pkl, snapshot_or_network_pkl=None): for candidate in [snapshot_or_network_pkl, run_id_or_run_dir_or_network_pkl]: if isinstance(candidate, str): diff --git a/training/training_loop.py b/training/training_loop.py index 78d6fe1..20966d9 100755 --- a/training/training_loop.py +++ b/training/training_loop.py @@ -148,7 +148,10 @@ def training_loop( # Construct networks. with tf.device('/gpu:0'): if resume_run_id is not None: - network_pkl = misc.locate_network_pkl(resume_run_id, resume_snapshot) + if resume_run_id == 'latest': + network_pkl, resume_kimg = misc.locate_latest_pkl() + else: + network_pkl = misc.locate_network_pkl(resume_run_id, resume_snapshot) print('Loading networks from "%s"...' % network_pkl) G, D, Gs = misc.load_pkl(network_pkl) else:
(The diff can be edited by hand, or copied into the repo as a file like
latest.patch
& then applied withgit apply latest.patch
.)Tuning Learning Rates
The LR is one of the most critical hyperparameters: too-large updates based on too-small minibatches are devastating to GAN stability & final quality. The LR also seems to interact with the intrinsic difficulty or diversity of an image domain; Karras et al 2019 use 0.003 G/
D LRs on their FFHQ dataset (which has been carefully curated and the faces aligned to put landmarks like eyes/ mouth in the same locations in every image) when training on 8-GPU machines with minibatches of n = 32, but I find lower to be better on my anime face/ portrait datasets where I can only do n = 8. From looking at training videos of whole-Danbooru2018 StyleGAN runs, I suspect that the necessary LRs would be lower still. Learning rates are closely related to minibatch size (a common rule of thumb in supervised learning of CNNs is that the relationship of biggest usable LR follows a square-root curve in minibatch size) and the BigGAN research argues that minibatch size itself strongly influences how bad mode dropping is, which suggests that smaller LRs may be more necessary the more diverse/ difficult a dataset is. Balancing G/
D :Screenshot of TensorBoard G/ D losses for an anime face StyleGAN making progress towards convergence Later in training, if the G is not making good progress towards the ultimate goal of a 0.5 loss (and the D’s loss gradually decreasing towards 0.5), and has a loss stubbornly stuck around −1 or something, it may be necessary to change the balance of G/
D. This can be done several ways but the easiest is to adjust the LRs in train.py
,sched.G_lrate_dict
&sched.D_lrate_dict
.One needs to keep an eye on the G/
D losses and also the perceptual quality of the faces (since we don’t have any good FID equivalent yet for anime faces, which requires a good open-source Danbooru tagger to create embeddings), and reduce both LRs (or usually just the D’s LR) based on the face quality and whether the G/ D losses are exploding or otherwise look imbalanced. What you want, I think, is for the G/ D losses to be stable at a certain absolute amount for a long time while the quality visibly improves, reducing D’s LR as necessary to keep it balanced with G; and then once you’ve run out of time/ patience or artifacts are showing up, then you can decrease both LRs to converge onto a local optima. I find the default of 0.003 can be too high once quality reaches a high level with both faces & portraits, and it helps to reduce it by a third to 0.001 or a tenth to 0.0003. If there still isn’t convergence, the D may be too strong and it can be turned down separately, to a tenth or a fiftieth even. (Given the stochasticity of training & the relativity of the losses, one should wait several wallclock hours or days after each modification to see if it made a difference.)
Skipping FID metrics:
Some metrics are computed for logging/
reporting. The FID metrics are calculated using an old ImageNet CNN; what is realistic on ImageNet may have little to do with your particular domain and while a large FID like 100 is concerning, FIDs like 20 or even increasing are not necessarily a problem or useful guidance compared to just looking at the generated samples or the loss curves. Given that computing FID metrics is not free & potentially irrelevant or misleading on many image domains, I suggest disabling them entirely. (They are not used in the training for anything, and disabling them is safe.) They can be edited out of the main training loop by commenting out the call to
metrics.run
like so:@@ -261,7 +265,7 @@ def training_loop() if cur_tick % network_snapshot_ticks == 0 or done or cur_tick == 1: pkl = os.path.join(submit_config.run_dir, 'network-snapshot-%06d.pkl' % (cur_nimg // 1000)) misc.save_pkl((G, D, Gs), pkl) # metrics.run(pkl, run_dir=submit_config.run_dir, num_gpus=submit_config.num_gpus, tf_config=tf_config)
‘Blob’ & ‘Crack’ Artifacts:
During training, ‘blobs’ often show up or move around. These blobs appear even late in training on otherwise high-quality images and are unique to StyleGAN (at least, I’ve never seen another GAN whose training artifacts look like the blobs). That they are so large & glaring suggests a weakness in StyleGAN somewhere. The source of the blobs was unclear. If you watch training videos, these blobs seem to gradually morph into new features such as eyes or hair or glasses. I suspect they are part of how StyleGAN ‘creates’ new features, starting with a feature-less blob superimposed at approximately the right location, and gradually refined into something useful. The StyleGAN 2 paper investigated the blob artifacts & found it to be due to the Generator working around a flaw in StyleGAN’s use of AdaIN normalization. Karras et al 2019 note that images without a blob somewhere are severely corrupted; because the blobs are in fact doing something useful, it is unsurprising that the Discriminator doesn’t fix the Generator. StyleGAN 2 changes the AdaIN normalization to eliminate this problem, improving overall quality.28
If blobs are appearing too often or one wants a final model without any new intrusive blobs, it may help to lower the LR to try to converge to a local optima where the necessary blob is hidden away somewhere unobtrusive.
In training anime faces, I have seen additional artifacts, which look like ‘cracks’ or ‘waves’ or elephant skin wrinkles or the sort of fine crazing seen in old paintings or ceramics, which appear toward the end of training on primarily skin or areas of flat color; they happen particularly fast when transfer learning on a small dataset. The only solution I have found so far is to either stop training or get more data. In contrast to the blob artifacts (identified as an architectural problem & fixed in StyleGAN 2), I currently suspect the cracks are a sign of overfitting rather than a peculiarity of normal StyleGAN training, where the G has started trying to memorize noise in the fine detail of pixelation/
lines, and so these are a kind of overfitting/ mode collapse. (More speculatively: another possible explanation is that the cracks are caused by the StyleGAN D being single-scale rather than multi-scale—as in MSG-GAN and a number of others—and the ‘cracks’ are actually high-frequency noise created by the G in specific patches as adversarial examples to fool the D. They reportedly do not appear in MSG-GAN or StyleGAN 2, which both use multi-scale Ds.) Gradient Accumulation:
ProGAN/
StyleGAN’s codebase claims to support gradient accumulation, which is a way to fake large minibatch training (eg n = 2048) by not doing the backpropagation update every minibatch, but instead summing the gradients over many minibatches and applying them all at once. This is a useful trick for stabilizing training, and large minibatch NN training can differ qualitatively from small minibatch NN training—BigGAN performance increased with increasingly large minibatches (n = 2048) and the authors speculate that this is because such large minibatches mean that the full diversity of the dataset is represented in each ‘minibatch’ so the BigGAN models cannot simply ‘forget’ rarer datapoints which would otherwise not appear for many minibatches in a row, resulting in the GAN pathology of ‘mode dropping’ where some kinds of data just get ignored by both G/ D. However, the ProGAN/
StyleGAN implementation of gradient accumulation does not resemble that of any other implementation I’ve seen in TensorFlow or PyTorch, and in my own experiments with up to n = 4096, I didn’t observe any stabilization or qualitative differences, so I am suspicious the implementation is wrong.
Here is what a successful training progression looks like for the anime face StyleGAN:
The anime face model is obsoleted by the StyleGAN 2 portrait model.
The anime face model as of 2019-03-08, trained for 21,980 iterations or ~21m images or ~38 GPU-days, is available for download. (It is still not fully-converged, but the quality is good.)
Sampling
Having successfully trained a StyleGAN, now the fun part—generating samples!
Psi/“truncation trick”
The 𝜓/
The truncation trick is used at sample generation time but not training time. The idea is to edit the latent vector z, which is a vector of , to remove any variables which are above a certain size like 0.5 or 1.0, and resample those.29 This seems to help by avoiding ‘extreme’ latent values or combinations of latent values which the G is not as good at—a G will not have generated many data points with each latent variable at, say, +1.5SD. The tradeoff is that those are still legitimate areas of the overall latent space which were being used during training to cover parts of the data distribution; so while the latent variables close to the mean of 0 may be the most accurately modeled, they are also only a small part of the space of all possible images. So one can generate latent variables from the full unrestricted distribution for each one, or one can truncate them at something like +1SD or +0.7SD. (Like the discussion of the best distribution for the original latent distribution, there’s no good reason to think that this is an optimal method of doing truncation; there are many alternatives, such as ones penalizing the sum of the variables, either rejecting them or scaling them down, and some appear to work much better than the current truncation trick.)
At 𝜓 = 0, diversity is nil and all faces are a single global average face (a brown-eyed brown-haired schoolgirl, unsurprisingly); at ±0.5 you have a broad range of faces, and by ±1.2, you’ll see tremendous diversity in faces/
Random Samples
The StyleGAN repo has a simple script pretrained_example.py
to download & generate a single face; in the interests of reproducibility, it hardwires the model and the RNG seed so it will only generate 1 particular face. However, it can be easily adapted to use a local model and (slowly30) generate, say, 1000 sample images with the hyperparameter 𝜓 = 0.6 (which gives high-quality but not highly-diverse images) which are saved to results/example-{0-999}.png
:
import os
import pickle
import numpy as np
import PIL.Image
import dnnlib
import dnnlib.tflib as tflib
import config
def main():
tflib.init_tf()
_G, _D, Gs = pickle.load(open("results/02051-sgan-faces-2gpu/network-snapshot-021980.pkl", "rb"))
Gs.print_layers()
for i in range(0,1000):
rnd = np.random.RandomState(None)
latents = rnd.randn(1, Gs.input_shape[1])
fmt = dict(func=tflib.convert_images_to_uint8, nchw_to_nhwc=True)
images = Gs.run(latents, None, truncation_psi=0.6, randomize_noise=True, output_transform=fmt)
os.makedirs(config.result_dir, exist_ok=True)
png_filename = os.path.join(config.result_dir, 'example-'+str(i)+'.png')
PIL.Image.fromarray(images[0], 'RGB').save(png_filename)
if __name__ == "__main__":
main()
Karras et al 2018 Figures
The figures in Karras et al 2018, demonstrating random samples and aspects of the style noise using the 1024px FFHQ face model (as well as the others), were generated by generate_figures.py
. This script needs extensive modifications to work with my 512px anime face; going through the file:
- the code uses 𝜓 = 1 truncation, but faces look better with 𝜓 = 0.7 (several of the functions have
truncation_psi=
settings but, trickily, the Figure 3draw_style_mixing_figure
has its 𝜓 setting hidden away in thesynthesis_kwargs
global variable) - the loaded model needs to be switched to the anime face model, of course
- dimensions must be reduced 1024→512 as appropriate; some ranges are hardcoded and must be reduced for 512px images as well
- the truncation trick figure 8 doesn’t show enough faces to give insight into what the latent space is doing so it needs to be expanded to show both more random seeds/
faces, and more 𝜓 values - the bedroom/
car/ cat samples should be disabled
The changes I make are as follows:
diff --git a/generate_figures.py b/generate_figures.py
index 45b68b8..f27af9d 100755
--- a/generate_figures.py
+++ b/generate_figures.py
@@ -24,16 +24,13 @@ url_bedrooms = 'https://drive.google.com/uc?id=1MOSKeGF0FJcivpBI7s63V9YHloUTO
url_cars = 'https://drive.google.com/uc?id=1MJ6iCfNtMIRicihwRorsM3b7mmtmK9c3' # karras2019stylegan-cars-512x384.pkl
url_cats = 'https://drive.google.com/uc?id=1MQywl0FNt6lHu8E_EUqnRbviagS7fbiJ' # karras2019stylegan-cats-256x256.pkl
-synthesis_kwargs = dict(output_transform=dict(func=tflib.convert_images_to_uint8, nchw_to_nhwc=True), minibatch_size=8)
+synthesis_kwargs = dict(output_transform=dict(func=tflib.convert_images_to_uint8, nchw_to_nhwc=True), minibatch_size=8, truncation_psi=0.7)
_Gs_cache = dict()
def load_Gs(url):
- if url not in _Gs_cache:
- with dnnlib.util.open_url(url, cache_dir=config.cache_dir) as f:
- _G, _D, Gs = pickle.load(f)
- _Gs_cache[url] = Gs
- return _Gs_cache[url]
+ _G, _D, Gs = pickle.load(open("results/02051-sgan-faces-2gpu/network-snapshot-021980.pkl", "rb"))
+ return Gs
#----------------------------------------------------------------------------
# Figures 2, 3, 10, 11, 12: Multi-resolution grid of uncurated result images.
@@ -85,7 +82,7 @@ def draw_noise_detail_figure(png, Gs, w, h, num_samples, seeds):
canvas = PIL.Image.new('RGB', (w * 3, h * len(seeds)), 'white')
for row, seed in enumerate(seeds):
latents = np.stack([np.random.RandomState(seed).randn(Gs.input_shape[1])] * num_samples)
- images = Gs.run(latents, None, truncation_psi=1, **synthesis_kwargs)
+ images = Gs.run(latents, None, **synthesis_kwargs)
canvas.paste(PIL.Image.fromarray(images[0], 'RGB'), (0, row * h))
for i in range(4):
crop = PIL.Image.fromarray(images[i + 1], 'RGB')
@@ -109,7 +106,7 @@ def draw_noise_components_figure(png, Gs, w, h, seeds, noise_ranges, flips):
all_images = []
for noise_range in noise_ranges:
tflib.set_vars({var: val * (1 if i in noise_range else 0) for i, (var, val) in enumerate(noise_pairs)})
- range_images = Gsc.run(latents, None, truncation_psi=1, randomize_noise=False, **synthesis_kwargs)
+ range_images = Gsc.run(latents, None, randomize_noise=False, **synthesis_kwargs)
range_images[flips, :, :] = range_images[flips, :, ::-1]
all_images.append(list(range_images))
@@ -144,14 +141,11 @@ def draw_truncation_trick_figure(png, Gs, w, h, seeds, psis):
def main():
tflib.init_tf()
os.makedirs(config.result_dir, exist_ok=True)
- draw_uncurated_result_figure(os.path.join(config.result_dir, 'figure02-uncurated-ffhq.png'), load_Gs(url_ffhq), cx=0, cy=0, cw=1024, ch=1024, rows=3, lods=[0,1,2,2,3,3], seed=5)
- draw_style_mixing_figure(os.path.join(config.result_dir, 'figure03-style-mixing.png'), load_Gs(url_ffhq), w=1024, h=1024, src_seeds=[639,701,687,615,2268], dst_seeds=[888,829,1898,1733,1614,845], style_ranges=[range(0,4)]*3+[range(4,8)]*2+[range(8,18)])
- draw_noise_detail_figure(os.path.join(config.result_dir, 'figure04-noise-detail.png'), load_Gs(url_ffhq), w=1024, h=1024, num_samples=100, seeds=[1157,1012])
- draw_noise_components_figure(os.path.join(config.result_dir, 'figure05-noise-components.png'), load_Gs(url_ffhq), w=1024, h=1024, seeds=[1967,1555], noise_ranges=[range(0, 18), range(0, 0), range(8, 18), range(0, 8)], flips=[1])
- draw_truncation_trick_figure(os.path.join(config.result_dir, 'figure08-truncation-trick.png'), load_Gs(url_ffhq), w=1024, h=1024, seeds=[91,388], psis=[1, 0.7, 0.5, 0, -0.5, -1])
- draw_uncurated_result_figure(os.path.join(config.result_dir, 'figure10-uncurated-bedrooms.png'), load_Gs(url_bedrooms), cx=0, cy=0, cw=256, ch=256, rows=5, lods=[0,0,1,1,2,2,2], seed=0)
- draw_uncurated_result_figure(os.path.join(config.result_dir, 'figure11-uncurated-cars.png'), load_Gs(url_cars), cx=0, cy=64, cw=512, ch=384, rows=4, lods=[0,1,2,2,3,3], seed=2)
- draw_uncurated_result_figure(os.path.join(config.result_dir, 'figure12-uncurated-cats.png'), load_Gs(url_cats), cx=0, cy=0, cw=256, ch=256, rows=5, lods=[0,0,1,1,2,2,2], seed=1)
+ draw_uncurated_result_figure(os.path.join(config.result_dir, 'figure02-uncurated-ffhq.png'), load_Gs(url_ffhq), cx=0, cy=0, cw=512, ch=512, rows=3, lods=[0,1,2,2,3,3], seed=5)
+ draw_style_mixing_figure(os.path.join(config.result_dir, 'figure03-style-mixing.png'), load_Gs(url_ffhq), w=512, h=512, src_seeds=[639,701,687,615,2268], dst_seeds=[888,829,1898,1733,1614,845], style_ranges=[range(0,4)]*3+[range(4,8)]*2+[range(8,16)])
+ draw_noise_detail_figure(os.path.join(config.result_dir, 'figure04-noise-detail.png'), load_Gs(url_ffhq), w=512, h=512, num_samples=100, seeds=[1157,1012])
+ draw_noise_components_figure(os.path.join(config.result_dir, 'figure05-noise-components.png'), load_Gs(url_ffhq), w=512, h=512, seeds=[1967,1555], noise_ranges=[range(0, 18), range(0, 0), range(8, 18), range(0, 8)], flips=[1])
+ draw_truncation_trick_figure(os.path.join(config.result_dir, 'figure08-truncation-trick.png'), load_Gs(url_ffhq), w=512, h=512, seeds=[91,388, 389, 390, 391, 392, 393, 394, 395, 396], psis=[1, 0.7, 0.5, 0.25, 0, -0.25, -0.5, -1])
All this done, we get some fun anime face samples to parallel Karras et al 2018’s figures:



Videos
Training Montage
The easiest samples are the progress snapshots generated during training. Over the course of training, their size increases as the effective resolution increases & finer details are generated, and at the end can be quite large (often 14MB each for the anime faces) so doing lossy compression with a tool like pngnq
+advpng
or converting them to JPG with lowered quality is a good idea. To turn the many snapshots into a training montage video like above, I use FFmpeg on the PNGs:
cat $(ls ./results/*faces*/fakes*.png | sort --numeric-sort) | ffmpeg -framerate 10 \ # show 10 inputs per second
-i - # stdin
-r 25 # output frame-rate; frames will be duplicated to pad out to 25FPS
-c:v libx264 # x264 for compatibility
-pix_fmt yuv420p # force ffmpeg to use a standard colorspace - otherwise PNG colorspace is kept, breaking browsers (!)
-crf 33 # adequate high quality
-vf "scale=iw/2:ih/2" \ # shrink the image by 2×, the full detail is not necessary & saves space
-preset veryslow -tune animation \ # aim for smallest binary possible with animation-tuned settings
./stylegan-facestraining.mp4
Interpolations
The original ProGAN repo provided a config for generating interpolation videos, but that was removed in StyleGAN. Cyril Diagne (@kikko_fr
) implemented a replacement, providing 3 kinds of videos:
random_grid_404.mp4
: a standard interpolation video, which is simply a random walk through the latent space, modifying all the variables smoothly and animating it; by default it makes 4 of them arranged 2×2 in the video. Several interpolation videos are show in the examples section.interpolate.mp4
: a ‘coarse’ “style mixing” video; a single ‘source’ face is generated & held constant; a secondary interpolation video, a random walk as before is generated; at each step of the random walk, the ‘coarse’/high-level ‘style’ noise is copied from the random walk to overwrite the source face’s original style noise. For faces, this means that the original face will be modified with all sorts of orientations & facial expressions while still remaining recognizably the original character. (It is the video analog of Karras et al 2018’s Figure 3.) A copy of Diagne’s
video.py
:import os import pickle import numpy as np import PIL.Image import dnnlib import dnnlib.tflib as tflib import config import scipy def main(): tflib.init_tf() # Load pre-trained network. # url = 'https://drive.google.com/uc?id=1MEGjdvVpUsu1jB4zrXZN7Y4kBBOzizDQ' # with dnnlib.util.open_url(url, cache_dir=config.cache_dir) as f: ## NOTE: insert model here: _G, _D, Gs = pickle.load(open("results/02047-sgan-faces-2gpu/network-snapshot-013221.pkl", "rb")) # _G = Instantaneous snapshot of the generator. Mainly useful for resuming a previous training run. # _D = Instantaneous snapshot of the discriminator. Mainly useful for resuming a previous training run. # Gs = Long-term average of the generator. Yields higher-quality results than the instantaneous snapshot. grid_size = [2,2] image_shrink = 1 image_zoom = 1 duration_sec = 60.0 smoothing_sec = 1.0 mp4_fps = 20 mp4_codec = 'libx264' mp4_bitrate = '5M' random_seed = 404 mp4_file = 'results/random_grid_%s.mp4' % random_seed minibatch_size = 8 num_frames = int(np.rint(duration_sec * mp4_fps)) random_state = np.random.RandomState(random_seed) # Generate latent vectors shape = [num_frames, np.prod(grid_size)] + Gs.input_shape[1:] # [frame, image, channel, component] all_latents = random_state.randn(*shape).astype(np.float32) import scipy all_latents = scipy.ndimage.gaussian_filter(all_latents, [smoothing_sec * mp4_fps] + [0] * len(Gs.input_shape), mode='wrap') all_latents /= np.sqrt(np.mean(np.square(all_latents))) def create_image_grid(images, grid_size=None): assert images.ndim == 3 or images.ndim == 4 num, img_h, img_w, channels = images.shape if grid_size is not None: grid_w, grid_h = tuple(grid_size) else: grid_w = max(int(np.ceil(np.sqrt(num))), 1) grid_h = max((num - 1) // grid_w + 1, 1) grid = np.zeros([grid_h * img_h, grid_w * img_w, channels], dtype=images.dtype) for idx in range(num): x = (idx % grid_w) * img_w y = (idx // grid_w) * img_h grid[y : y + img_h, x : x + img_w] = images[idx] return grid # Frame generation func for moviepy. def make_frame(t): frame_idx = int(np.clip(np.round(t * mp4_fps), 0, num_frames - 1)) latents = all_latents[frame_idx] fmt = dict(func=tflib.convert_images_to_uint8, nchw_to_nhwc=True) images = Gs.run(latents, None, truncation_psi=0.7, randomize_noise=False, output_transform=fmt) grid = create_image_grid(images, grid_size) if image_zoom > 1: grid = scipy.ndimage.zoom(grid, [image_zoom, image_zoom, 1], order=0) if grid.shape[2] == 1: grid = grid.repeat(3, 2) # grayscale => RGB return grid # Generate video. import moviepy.editor video_clip = moviepy.editor.VideoClip(make_frame, duration=duration_sec) video_clip.write_videofile(mp4_file, fps=mp4_fps, codec=mp4_codec, bitrate=mp4_bitrate) # import scipy # coarse duration_sec = 60.0 smoothing_sec = 1.0 mp4_fps = 20 num_frames = int(np.rint(duration_sec * mp4_fps)) random_seed = 500 random_state = np.random.RandomState(random_seed) w = 512 h = 512 #src_seeds = [601] dst_seeds = [700] style_ranges = ([0] * 7 + [range(8,16)]) * len(dst_seeds) fmt = dict(func=tflib.convert_images_to_uint8, nchw_to_nhwc=True) synthesis_kwargs = dict(output_transform=fmt, truncation_psi=0.7, minibatch_size=8) shape = [num_frames] + Gs.input_shape[1:] # [frame, image, channel, component] src_latents = random_state.randn(*shape).astype(np.float32) src_latents = scipy.ndimage.gaussian_filter(src_latents, smoothing_sec * mp4_fps, mode='wrap') src_latents /= np.sqrt(np.mean(np.square(src_latents))) dst_latents = np.stack(np.random.RandomState(seed).randn(Gs.input_shape[1]) for seed in dst_seeds) src_dlatents = Gs.components.mapping.run(src_latents, None) # [seed, layer, component] dst_dlatents = Gs.components.mapping.run(dst_latents, None) # [seed, layer, component] src_images = Gs.components.synthesis.run(src_dlatents, randomize_noise=False, **synthesis_kwargs) dst_images = Gs.components.synthesis.run(dst_dlatents, randomize_noise=False, **synthesis_kwargs) canvas = PIL.Image.new('RGB', (w * (len(dst_seeds) + 1), h * 2), 'white') for col, dst_image in enumerate(list(dst_images)): canvas.paste(PIL.Image.fromarray(dst_image, 'RGB'), ((col + 1) * h, 0)) def make_frame(t): frame_idx = int(np.clip(np.round(t * mp4_fps), 0, num_frames - 1)) src_image = src_images[frame_idx] canvas.paste(PIL.Image.fromarray(src_image, 'RGB'), (0, h)) for col, dst_image in enumerate(list(dst_images)): col_dlatents = np.stack([dst_dlatents[col]]) col_dlatents[:, style_ranges[col]] = src_dlatents[frame_idx, style_ranges[col]] col_images = Gs.components.synthesis.run(col_dlatents, randomize_noise=False, **synthesis_kwargs) for row, image in enumerate(list(col_images)): canvas.paste(PIL.Image.fromarray(image, 'RGB'), ((col + 1) * h, (row + 1) * w)) return np.array(canvas) # Generate video. import moviepy.editor mp4_file = 'results/interpolate.mp4' mp4_codec = 'libx264' mp4_bitrate = '5M' video_clip = moviepy.editor.VideoClip(make_frame, duration=duration_sec) video_clip.write_videofile(mp4_file, fps=mp4_fps, codec=mp4_codec, bitrate=mp4_bitrate) import scipy duration_sec = 60.0 smoothing_sec = 1.0 mp4_fps = 20 num_frames = int(np.rint(duration_sec * mp4_fps)) random_seed = 503 random_state = np.random.RandomState(random_seed) w = 512 h = 512 style_ranges = [range(6,16)] fmt = dict(func=tflib.convert_images_to_uint8, nchw_to_nhwc=True) synthesis_kwargs = dict(output_transform=fmt, truncation_psi=0.7, minibatch_size=8) shape = [num_frames] + Gs.input_shape[1:] # [frame, image, channel, component] src_latents = random_state.randn(*shape).astype(np.float32) src_latents = scipy.ndimage.gaussian_filter(src_latents, smoothing_sec * mp4_fps, mode='wrap') src_latents /= np.sqrt(np.mean(np.square(src_latents))) dst_latents = np.stack([random_state.randn(Gs.input_shape[1])]) src_dlatents = Gs.components.mapping.run(src_latents, None) # [seed, layer, component] dst_dlatents = Gs.components.mapping.run(dst_latents, None) # [seed, layer, component] def make_frame(t): frame_idx = int(np.clip(np.round(t * mp4_fps), 0, num_frames - 1)) col_dlatents = np.stack([dst_dlatents[0]]) col_dlatents[:, style_ranges[0]] = src_dlatents[frame_idx, style_ranges[0]] col_images = Gs.components.synthesis.run(col_dlatents, randomize_noise=False, **synthesis_kwargs) return col_images[0] # Generate video. import moviepy.editor mp4_file = 'results/fine_%s.mp4' % (random_seed) mp4_codec = 'libx264' mp4_bitrate = '5M' video_clip = moviepy.editor.VideoClip(make_frame, duration=duration_sec) video_clip.write_videofile(mp4_file, fps=mp4_fps, codec=mp4_codec, bitrate=mp4_bitrate) if __name__ == "__main__": main()
fine_503.mp4
: a ‘fine’ style mixing video; in this case, the style noise is taken from later on and instead of affecting the global orientation or expression, it affects subtler details like the precise shape of hair strands or hair color or mouths.
Circular interpolations are another interesting kind of interpolation, written by snowy halcy, which instead of random walking around the latent space freely, with large or awkward transitions, instead tries to move around a fixed high-dimensional point doing: “binary search to get the MSE to be roughly the same between frames (slightly brute force, but it looks nicer), and then did that for what is probably close to a sphere or circle in the latent space.” A later version of circular interpolation is in snowy halcy’s face editor repo, but here is the original version cleaned up into a stand-alone program:
import dnnlib.tflib as tflib
import math
import moviepy.editor
from numpy import linalg
import numpy as np
import pickle
def main():
tflib.init_tf()
_G, _D, Gs = pickle.load(open("results/02051-sgan-faces-2gpu/network-snapshot-021980.pkl", "rb"))
rnd = np.random
latents_a = rnd.randn(1, Gs.input_shape[1])
latents_b = rnd.randn(1, Gs.input_shape[1])
latents_c = rnd.randn(1, Gs.input_shape[1])
def circ_generator(latents_interpolate):
radius = 40.0
latents_axis_x = (latents_a - latents_b).flatten() / linalg.norm(latents_a - latents_b)
latents_axis_y = (latents_a - latents_c).flatten() / linalg.norm(latents_a - latents_c)
latents_x = math.sin(math.pi * 2.0 * latents_interpolate) * radius
latents_y = math.cos(math.pi * 2.0 * latents_interpolate) * radius
latents = latents_a + latents_x * latents_axis_x + latents_y * latents_axis_y
return latents
def mse(x, y):
return (np.square(x - y)).mean()
def generate_from_generator_adaptive(gen_func):
max_step = 1.0
current_pos = 0.0
change_min = 10.0
change_max = 11.0
fmt = dict(func=tflib.convert_images_to_uint8, nchw_to_nhwc=True)
current_latent = gen_func(current_pos)
current_image = Gs.run(current_latent, None, truncation_psi=0.7, randomize_noise=False, output_transform=fmt)[0]
array_list = []
video_length = 1.0
while(current_pos < video_length):
array_list.append(current_image)
lower = current_pos
upper = current_pos + max_step
current_pos = (upper + lower) / 2.0
current_latent = gen_func(current_pos)
current_image = images = Gs.run(current_latent, None, truncation_psi=0.7, randomize_noise=False, output_transform=fmt)[0]
current_mse = mse(array_list[-1], current_image)
while current_mse < change_min or current_mse > change_max:
if current_mse < change_min:
lower = current_pos
current_pos = (upper + lower) / 2.0
if current_mse > change_max:
upper = current_pos
current_pos = (upper + lower) / 2.0
current_latent = gen_func(current_pos)
current_image = images = Gs.run(current_latent, None, truncation_psi=0.7, randomize_noise=False, output_transform=fmt)[0]
current_mse = mse(array_list[-1], current_image)
print(current_pos, current_mse)
return array_list
frames = generate_from_generator_adaptive(circ_generator)
frames = moviepy.editor.ImageSequenceClip(frames, fps=30)
# Generate video.
mp4_file = 'results/circular.mp4'
mp4_codec = 'libx264'
mp4_bitrate = '3M'
mp4_fps = 20
frames.write_videofile(mp4_file, fps=mp4_fps, codec=mp4_codec, bitrate=mp4_bitrate)
if __name__ == "__main__":
main()
An interesting use of interpolations is Kyle McLean’s “Waifu Synthesis” video: a singing anime video mashing up StyleGAN anime faces + GPT-2 lyrics + Project Magenta music.
Models
Anime Faces
The primary model I’ve trained, the anime face model is described in the data processing & training section. It is a 512px StyleGAN model trained on n = 218,794 faces cropped from all of Danbooru2017, cleaned, & upscaled, and trained for 21,980 iterations or ~21m images or ~38 GPU-days.
Downloads (I recommend using the more-recent portrait StyleGAN unless cropped faces are specifically desired):
random samples generated on 2091-02-14 with an extreme 𝜓 = 1.2 (165MB, JPG)
the StyleGAN model used for TWDNEv1 samples as of 2019-02-26 (294MB,
.pkl
)- all TWDNE faces via rsync
the anime face StyleGAN model, further trained, as of 2019-03-8
The anime face model is obsoleted by the StyleGAN 2 portrait model.
TWDNE
To show off the anime faces, and as a joke, on 2019-02-14, I set up “This Waifu Does Not Exist”, a standalone static website which displays a random anime face (out of 100,000), generated with various 𝜓, and paired with GPT-2-117M text snippets prompted on anime plot summaries. The details of the site implementation & generating the faces/
But the site was amusing & an enormous success. It went viral overnight and by the end of March 2019, ~1 million unique visitors (most from China) had visited TWDNE, spending over 2 minutes each looking at the NN-generated faces & text; people began hunting for hilariously-deformed faces, using TWDNE as a screensaver, picking out faces as avatars, creating packs of faces for video games, painting their own collages of faces, using it as a character designer for inspiration, etc.
Anime Bodies
Aaron Gokaslan experimented with a custom 256px anime game image dataset which has individual characters posed in whole-person images to see how StyleGAN coped with more complex geometries. Progress required additional data cleaning and lowering the learning rate but, trained on a 4-GPU system for week or two, the results are promising (even down to reproducing the copyright statements in the images), providing preliminary evidence that StyleGAN can scale:


Transfer Learning
"In the days when Sussman was a novice, Minsky once came to him as he sat hacking at the PDP-6.
“What are you doing?”, asked Minsky. “I am training a randomly wired neural net to play Tic-Tac-Toe” Sussman replied. “Why is the net wired randomly?”, asked Minsky. “I do not want it to have any preconceptions of how to play”, Sussman said.
Minsky then shut his eyes. “Why do you close your eyes?”, Sussman asked his teacher. “So that the room will be empty.”
At that moment, Sussman was enlightened."
“Sussman attains enlightenment”, “AI Koans”, Jargon File
One of the most useful things to do with a trained model on a broad data corpus is to use it as a launching pad to train a better model quicker on lesser data, called “transfer learning”. For example, one might transfer learn from Nvidia’s FFHQ face StyleGAN model to a different celebrity dataset, or from bedrooms→kitchens. Or with the anime face model, one might retrain it on a subset of faces—all characters with red hair, or all male characters, or just a single specific character. Even if a dataset seems different, starting from a pretrained model can save time; after all, while male and female faces may look different and it may seem like a mistake to start from a mostly-female anime face model, the alternative of starting from scratch means starting with a model generating random rainbow-colored static, and surely male faces look far more like female faces than they do random static?31 Indeed, you can quickly train a photographic face model starting from the anime face model.
This extends the reach of good StyleGAN models from those blessed with both big data & big compute to those with little of either. Transfer learning works particularly well for specializing the anime face model to a specific character: the images of that character would be too little to train a good StyleGAN on, too data-impoverished for the sample-inefficient StyleGAN1–232, but having been trained on all anime faces, the StyleGAN has learned well the full space of anime faces and can easily specialize down without overfitting. Trying to do, say, faces ↔︎ landscapes is probably a bridge too far.
Data-wise, for doing face specialization, the more the better but n = 500–5000 is an adequate range, but even as low as n = 50 works surprisingly well. I don’t know to what extent data augmentation can substitute for original datapoints but it’s probably worth a try especially if you have n < 5000.
Compute-wise, specialization is rapid. Adaptation can happen within a few ticks, possibly even 1. This is surprisingly fast given that StyleGAN is not designed for few-shot/
How does one actually do transfer learning? Since StyleGAN is (currently) unconditional with no dataset-specific categorical or text or metadata encoding, just a flat set of images, all that has to be done is to encode the new dataset and simply start training with an existing model. One creates the new dataset as usual, and then edits training.py
with a new -desc
line for the new dataset, and if resume_kimg
is set correctly (see next paragraph) and resume_run_id = "latest"
enabled as advised, you can then run python train.py
and presto, transfer learning.
The main problem seems to be that training cannot be done from scratch/resume_kimg=7000
in training_loop.py
. This forces StyleGAN to skip all the progressive growing and load the full model as-is. To make sure you did it right, check the first sample (fakes07000.png
or whatever), from before any transfer learning training has been done, and it should look like the original model did at the end of its training. Then subsequent training samples should show the original quickly morphing to the new dataset. (Anything like fakes00000.png
should not show up because that indicates beginning from scratch.)
Anime Faces → Character Faces
Holo
The first transfer learning was done with Holo of Spice & Wolf. It used a 512px Holo face dataset created with Nagadomi’s cropper from all of Danbooru2017, upscaled with waifu2x
, cleaned by hand, and then data-augmented from n = 3900 to n = 12600; mirroring was enabled since Holo is symmetrical. I then used the anime face model as of 2019-02-09—it was not fully converged, indeed, wouldn’t converge with weeks more training, but the quality was so good I was too curious as to how well retraining would work so I switched gears.
It’s worth mentioning that this dataset was used previously with ProGAN, where after weeks of training, ProGAN overfit badly as demonstrated by the samples & interpolation videos.
Training happened remarkably quickly, with all the faces converted to recognizably Holo faces within a few hundred iterations:
The best samples were convincing without exhibiting the failures of the ProGAN:

The StyleGAN was much more successful, despite a few failure latent points carried over from the anime faces. Indeed, after a few hundred iterations, it was starting to overfit with the ‘crack’ artifacts & smearing in the interpolations. The latest I was willing to use was iteration #11370, and I think it is still somewhat overfit anyway. I thought that with its total n (after data augmentation), Holo would be able to train longer (being 1⁄7th the size of FFHQ), but apparently not. Perhaps the data augmentation is considerably less valuable than 1-for-1, either because the invariants encoded in aren’t that useful (suggesting that Geirhos et al 2018-like style transfer data augmentation is what’s necessary) or that they would be but the anime face StyleGAN has already learned them all as part of the previous training & needs more real data to better understand Holo-like faces. It’s also possible that the results could be improved by using one of the later anime face StyleGANs since they did improve when I trained them further after my 2 Holo/
Nevertheless, impressed, I couldn’t help but wonder if they had reached human-levels of verisimilitude: would an unwary viewer assume they were handmade?
So I selected ~100 of the best samples (24MB; Imgur mirror) from a dump of 2000, cropped about 5% from the left/
The #11370 Holo StyleGAN model is available for download.
Asuka
After the Holo training & link submission went so well, I knew I had to try my other character dataset, Asuka, using n = 5300 data-augmented to n = 58,000.33 Keeping in mind how data seemed to limit the Holo quality, I left mirroring enabled for Asuka, even though she is not symmetrical due to her 3.0 eyepatch over her left eye (as purists will no doubt note).
Interestingly, while Holo trained within 4 GPU-hours, Asuka proved much more difficult and did not seem to be finished training or showing the cracks despite training twice as long. Is this due to having ~35% more real data, having 10× rather than 3× data augmentation, or some inherent difference like Asuka being more complex (eg because of more variations in her appearance like the eyepatches or plugsuits)?
I generated 1000 random samples with 𝜓 = 1.2 because they were particularly interesting to look at. As with Holo, I picked out the best 100 (13MB; Imgur mirror) from ~2000:

And I submitted to the /
The #7903 Asuka StyleGAN model is available for download.
Zuihou
In early February 2019, using the then-released model, Redditor Ending_Credits tried transfer learning to n = 500 faces of the Kantai Collection Zuihou for ~1 tick (~60k iterations).
The samples & interpolations have many artifacts, but the sample size is tiny and I’d consider this good finetuning from a model never intended for few-shot learning:

Probably it could be made better by starting from the latest anime face StyleGAN model, and using aggressive data augmentation. Another option would be to try to find as many characters which look similar to Zuihou (matching on hair color might work) and train on a joint dataset—unconditional samples would then need to be filtered for just Zuihou faces, but perhaps that drawback could be avoided by a third stage of Zuihou-only training?
Ganso
Akizuki
Another Kancolle character, Akizuki, was trained in April 2019 by Ganso.
Ptilopsis
In January 2020, Ganso trained a StyleGAN 2 model from the S2 portrait model on a tiny corpus of Ptilopsis images, a character from Arknights, a 2017 Chinese tower defense RPG mobile game.

Ptilopsis are owls, and her character design shows prominent ears; despite the few images to work with (just 21 on Danbooru as of 2020-01-19), the interpolation shows smooth adjustments of the ears in all positions & alignments, demonstrating the power of transfer learning:
Fate
Saber
Ending_Credits likewise did transfer to Saber (Fate/
Fate/Grand Order
Michael Sugimura in May 2019 experimented with transfer learning from the 512px anime portrait GAN to faces cropped from ~6k Fate/
Louise
Finally, Ending_Credits did transfer to Louise (Zero no Tsukaima), n = 350:
Not as good as Saber due to the much smaller sample size.
Lelouch
roadrunner01 experimented with a number of transfers, including a transfer of the male character Lelouch Lamperouge (Code Geass) with n = 50 (!), which is not nearly as garbage as it should be.
Asashio
FlatisDogchi experimented with transfer to n = 988 (augmented to n = 18772) Asashio (KanColle) faces, creating “This Asashio Does Not Exist”.
Marisa Kirisame & the Komeijis
A Japanese user mei_miya posted an interpolation video of the Touhou character Marisa Kirisame by transfer learning on 5000 faces. They also did the Touhou characters Satori/
The Reddit user Jepacor also has done Marisa, using Danbooru samples.
Lexington
A Chinese user 3D_DLW (S2 writeup/lbpcascade_animeface
, upscaling with waifu2x, and cleaning with ranker.py
(using the original S2 model’s Discriminator & producing datasets of varying cleanliness at n = 302–1659). Samples:

Hayasaka Ai
Tazik Shahjahan finetuned S2 on Kaguya-sama: Love Is War’s Hayasaka Ai, providing a Colab notebook demonstrating how he scraped Pixiv and filtered out invalid images to create the training corpus
Ahegao
CaJI9I created an “ahegao” StyleGAN; unspecified corpus or method:

Emilia (Re:Zero)
Anime Faces → Anime Headshots
Twitter user Sunk did transfer learning to an image corpus of a specific artist, Kurehito Misaki (深崎暮人), n≅1000. His images work well and the interpolation looks nice:
Anime Faces → Portrait
TWDNE was a huge success and popularized the anime face StyleGAN. It was not perfect, though, and flaws were noted.
Portrait Improvements
The portraits could be improved by more carefully selecting SFW images to avoid overly-suggestive faces, expanding the crops to avoid cutting off edges of heads like hairstyles,
**For details and, please see Danbooru2019 Portraits.**
Portrait Results
After retraining the final face StyleGAN 2019-03-08–2019-04-30 on the new improved portraits dataset, the results improved:

This S1 anime portrait model is obsoleted by the StyleGAN 2 portrait model.
The final model from 2019-04-30 is available for download.
I used this model at 𝛙=0.5 to generate 100,000 new portraits for TWDNE (#100,000–199,999), balancing the previous faces.
I was surprised how difficult upgrading to portraits seemed to be; I spent almost two months training it before giving up on further improvements, while I had been expecting more like a week or two. The portrait results are indeed better than the faces (I was right that not cropping off the top of the head adds verisimilitude), but the upgrade didn’t impress me as much as the original faces did compared to earlier GANs. And our other experimental runs on whole-Danbooru2018 images never progressed beyond suggestive blobs during this period.
I suspect that StyleGAN—at least, on its default architecture & hyperparameters, without a great deal more compute—is reaching its limits here, and that changes may be necessary to scale to richer images. (Self-attention is probably the easiest to add since it should be easy to plug in additional layers to the convolution code.)
Anime Faces → Male Faces
A few people have observed that it would be nice to have an anime face GAN for male characters instead of always generating female ones. The anime face StyleGAN does in fact have male faces in its dataset as I did no filtering—it’s merely that female faces are overwhelmingly frequent (and it may also be that male anime faces are relatively androgynous/
Training a male-only anime face StyleGAN would be another good application of transfer learning.
The faces can be easily extracted out of Danbooru2018 by querying for "male_focus"
, which will pick up ~150k images. More narrowly, one could search "1boy"
& "solo"
, to ensure that the only face in the image is a male face (as opposed to, say, 1boy 1girl
, where a female face might be cropped out as well). This provides n = 99k raw hits. It would be good to also filter out ‘trap’ or overly-female-looking faces (else what’s the point?), by filtering on tags like cat ears or particularly popular ‘trap’ characters like Fate/"1boy"
& "multiple_boys"
and then filter out "1girl"
& "multiple_girls"
, in order to select all images with 1 or more males and then remove all images with 1 or more females; this doubles the raw hits to n = 198k. (A downside is that the face-cropping will often unavoidably yield crops with two faces, a primary face and an overlapping face, which is bad and introduces artifacting when I tried this with all faces.)
Combined with transfer learning from the general anime face StyleGAN, the results should be as good as the general (female) faces.
I settled for "1boy"
& "solo"
, and did considerable cleaning by hand. The raw count of images turned out to be highly misleading, and many faces are unusable for a male anime face StyleGAN: many are so highly stylized (such as action scenes) as to be damaging to a GAN, or they are almost indistinguishable from female faces (because they are bishonen or trap or just androgynous), which would be pointless to include (the regular portrait StyleGAN covers those already). After hand cleaning & use of ranker.py
, I was left with n~3k, so I used heavy data augmentation to bring it up to n~57k, and I initialized from the final portrait StyleGAN for the highest quality.
It did not overfit after ~4 days of training, but the results were not noticeably improving, so I stopped (in order to start training the GPT-2-345M, which OpenAI had just released, on poetry). There are hints in the interpolation videos, I think, that it is indeed slightly overfitting, in the form of ‘glitches’ where the image abruptly jumps slightly, presumably to another mode/

The male face StyleGAN model is available for download, as is 1000 random faces with 𝛙=0.7 (mirror; partial Imgur album).
Anime Faces → Ukiyo-e Faces
In January 2020, Justin (@Buntworthy) used 5000 ukiyo-e faces cropped with Amazon Rekognition from Ukiyo-e Search to do transfer learning. After ~24h training:

Anime Faces → Western Portrait Faces
In 2019, aydao experimented with transfer learning to European portrait faces drawn from WikiArt; the transfer learning was done via Nathan Shipley’s abuse of SWA where two models are simply averaged together, parameter by parameter and layer by layer, to yield a new model. (Surprisingly, this works—as long as the models aren’t too different; if they are, the averaged model will generate only colorful blobs.) The results were amusing. From early in training:

Later:

Anime Faces → Danbooru2018
nshepperd began a training run using an early anime face StyleGAN model on the 512px SFW Danbooru2018 subset; after ~3–5 weeks (with many interruptions) on 1 GPU, as of 2019-03-22, the training samples look like this:


The StyleGAN is able to pick up global structure and there are recognizably anime figures, despite the sheer diversity of images, which is promising. The fine details are seriously lacking, and training, to my eye, is wandering around without any steady improvement or sharp details (except perhaps the faces which are inherited from the previous model). I suspect that the learning rate is still too high and, especially with only 1 GPU/
FFHQ Variations
Anime Faces → FFHQ Faces
If StyleGAN can smoothly warp anime faces among each other and express global transforms like hair length+color with 𝜓, could 𝜓 be a quick way to gain control over a single large-scale variable? For example, male vs female faces, or… anime ↔︎ real faces? (Given a particular image/
Since Karras et al 2801 provide a nice FFHQ download script (albeit slower than I’d like once Google Drive rate-limits you a wallclock hour into the full download) for the full-resolution PNGs, it would be easy to downscale to 512px and create a 512px FFHQ dataset to train on, or even create a combined anime+FFHQ dataset.
The first and fastest thing was to do transfer learning from the anime faces to FFHQ real faces. It was unlikely that the model would retain much anime knowledge & be able to do morphing, but it was worth a try.
The initial results early in training are hilarious and look like zombies:

After 97 ticks, the model has converged to a boringly normal appearance, with the only hint of its origins being perhaps some excessively-fabulous hair in the training samples:

Anime Faces → Anime Faces + FFHQ Faces
So, that was a bust. The next step is to try training on anime & FFHQ faces simultaneously; given the stark difference between the datasets, would positive vs negative 𝜓 wind up splitting into real vs anime and provide a cheap & easy way of converting arbitrary faces?
This simply merged the 512px FFHQ faces with the 512px anime faces and resumed training from the previous FFHQ model (I reasoned that some of the anime-ness should still be in the model, so it would be slightly faster than restarting from the original anime face model). I trained it for 812 iterations, #11,359–12,171 (somewhat over 2 GPU-days), at which point it was mostly done.
It did manage to learn both kinds of faces quite well, separating them clearly in random samples:

However, the style transfer & 𝜓 samples were disappointments. The style mixing shows limited ability to modify faces cross-domain or convert them, and the truncation trick chart shows no clear disentanglement of the desired factor (indeed, the various halves of 𝜓 correspond to nothing clear):


The interpolation video does show that it learned to interpolate slightly between real & anime faces, giving half-anime/
They’re hard to spot in the interpolation video because the transition happens abruptly, so I generated samples & selected some of the more interesting anime-ish faces:

Similarly, Alexander Reben trained a StyleGAN on FFHQ+Western portrait illustrations, and the interpolation video is much smoother & more mixed, suggesting that more realistic & more varied illustrations are easier for StyleGAN to interpolate between.
Anime Faces + FFHQ → Danbooru2018
While I didn’t have the compute to properly train a Danbooru2018 StyleGAN, after nshepperd’s results, I was curious and spent some time (817 iterations, so ~2 GPU-days?) retraining the anime face+FFHQ model on Danbooru2018 SFW 512px images.
The training montage is interesting for showing how faces get repurposed into figures:
One might think that it is a bridge too far for transfer learning, but it seems not.
Reversing StyleGAN To Control & Modify Images
Modifying images is harder than generating them. An unconditional GAN architecture is, by default, ‘one-way’: the latent vector z gets generated from a bunch of variables, fed through the GAN, and out pops an image. There is no way to run the unconditional GAN ‘backwards’ to feed in an image and pop out the z instead.37
If one could, one could take an arbitrary image and encode it into the z and by jittering z, generate many new versions of it; or one could feed it back into StyleGAN and play with the style noises at various levels in order to transform the image; or do things like ‘average’ two images or create interpolations between two arbitrary faces’; or one could (assuming one knew what each variable in z ‘means’) edit the image to changes things like which direction their head tilts or whether they are smiling.
There are some attempts at learning control in an unsupervised fashion (eg Voynov & Babenko 2020, GANSpace) but while excellent starting points, they have limits and may not find a specific control that one wants.
The most straightforward way would be to switch to a conditional GAN architecture based on a text or tag embedding. Then to generate a specific character wearing glasses, one simply says as much as the conditional input: "character glasses"
. Or if they should be smiling, add "smile"
. And so on. This would create images of said character with the desired modifications. This option is not available at the moment as creating a tag embedding & training StyleGAN requires quite a bit of modification. It also is not a complete solution as it wouldn’t work for the cases of editing an existing image.
For an unconditional GAN, there are two complementary approaches to inverting the G:
what one NN can learn to decode, another can learn to encode (eg Donahue et al 2016, Donahue & Simonyan 2019):
If StyleGAN has learned z→image, then train a second encoder NN on the supervised learning problem of image→z! The sample size is infinite (just keep running G) and the mapping is fixed (given a fixed G), so it’s ugly but not that hard.
backpropagate a pixel or feature-level loss to ‘optimize’ a latent code (eg Creswell & Bharath 2018):
While StyleGAN is not inherently reversible, it’s not a blackbox as, being a NN trained by backpropagation, it must admit of gradients. In training neural networks, there are 3 components: inputs, model parameters, and outputs/
losses, and thus there are 3 ways to use backpropagation, even if we usually only use 1. One can hold the inputs fixed, and vary the model parameters in order to change (usually reduce) the fixed outputs in order to reduce a loss, which is training a NN; one can hold the inputs fixed and vary the outputs in order to change (often increase) internal parameters such as layers, which corresponds to neural network visualizations & exploration; and finally, one can hold the parameters & outputs fixed, and use the gradients to iteratively find an set of inputs which creates a specific output with a low loss (eg optimize a wheel-shape input for rolling-efficiency output).38 This can be used to create images which are ‘optimized’ in some sense. For example, Nguyen et al 2016 uses activation maximization, demonstrating how images of ImageNet classes can be pulled out of a standard CNN classifier by backprop over the classifier to maximize a particular output class; or redesign a fighter jet’s camouflage for easier classification by a model; more amusingly, in “Image Synthesis from Yahoo’s
open_nsfw
”, the gradient ascent39 on the individual pixels of an image is done to minimize/maximize a NSFW classifier’s prediction. This can also be done on a higher level by trying to maximize similarity to a NN embedding of an image to make it as ‘similar’ as possible, as was done originally in Gatys et al 2014 for style transfer, or for more complicated kinds of style transfer like in “Differentiable Image Parameterizations: A powerful, under-explored tool for neural network visualizations and art”. In this case, given an arbitrary desired image’s z, one can initialize a random z, run it forward through the GAN to get an image, compare it at the pixel level with the desired (fixed) image, and the total difference is the ‘loss’; holding the GAN fixed, the backpropagation goes back through the model and adjusts the inputs (the unfixed z) to make it slightly more like the desired image. Done many times, the final z will now yield something like the desired image, and that can be treated as its true z. Comparing at the pixel-level can be improved by instead looking at the higher layers in a NN trained to do classification (often an ImageNet VGG), which will focus more on the semantic similarity (more of a “perceptual loss”) rather than misleading details of static & individual pixels. The latent code can be the original z, or z after it’s passed through the stack of 8 FC layers and has been transformed, or it can even be the various per-layer style noises inside the CNN part of StyleGAN; the last is what
style-image-prior
uses & Gabbay & Hoshen 201940 argue that it works better to target the layer-wise encodings than the original z.This may not work too well as the local optima might be bad or the GAN may have trouble generating precisely the desired image no matter how carefully it is optimized, the pixel-level loss may not be a good loss to use, and the whole process may be quite slow, especially if one runs it many times with many different initial random z to try to avoid bad local optima. But it does mostly work.
Encode+Backpropagate is a useful hybrid strategy: the encoder makes its best guess at the z, which will usually be close to the true z, and then backpropagation is done for a few iterations to finetune the z. This can be much faster (one forward pass vs many forward+backward passes) and much less prone to getting stuck in bad local optima (since it starts at a good initial z thanks to the encoder).
Comparison with editing in flow-based models On a tangent, editing/
reversing is one of the great advantages41 of ‘flow’-based NN models such as Glow, which is one of the families of NN models competitive with GANs for high-quality image generation (along with autoregressive pixel prediction models like PixelRNN, and VAEs). Flow models have the same shape as GANs in pushing a random latent vector z through a series of upscaling convolution or other layers to produce final pixel values, but flow models use a carefully-limited set of primitives which make the model runnable both forwards and backwards exactly. This means every set of pixels corresponds to a unique z and vice-versa, and so an arbitrary set of pixels can put in and the model run backwards to yield the exact corresponding z. There is no need to fight with the model to create an encoder to reverse it or use backpropagation optimization to try to find something almost right, as the flow model can already do this. This makes editing easy: plug the image in, get out the exact z with the equivalent of a single forward pass, figure out which part of z controls a desired attribute like ‘glasses’, change that, and run it forward. The downside of flow models, which is why I do not (yet) use them, is that the restriction to reversible layers means that they are typically much larger and slower to train than a more-or-less perceptually equivalent GAN model, by easily an order of magnitude (for Glow). When I tried Glow, I could barely run an interesting model despite aggressive memory-saving techniques, and I didn’t get anywhere interesting with the several GPU-days I spent (which was unsurprising when I realized how many GPU-months OA had spent). Since high-quality photorealistic GANs are at the limit of 2019 trainability for most researchers or hobbyists, flow models are clearly out of the question despite their many practical & theoretical advantages—they’re just too expensive! However, there is no known reason flow models couldn’t be competitive with GANs (they will probably always be larger, but because they are more correct & do more), and future improvements or hardware scaling may make them more viable, so flow-based models are an approach to keep an eye on.
One of those 3 approaches will encode an image into a latent z. So far so good, that enables things like generating randomly-different versions of a specific image or interpolating between 2 images, but how does one control the z in a more intelligent fashion to make specific edits?
If one knew what each variable in the z meant, one could simply slide them in the −1/
As always, the solution to one model’s problems is yet more models; to control the z, like with the encoder, we can simply train yet another model (perhaps just a linear classifier or random forests this time) to take the z of many images which are all labeled ‘smiling’ or ‘not smiling’, and learn what parts of z cause ‘smiling’ (eg Shen et al 2019). These additional models can then be used to control a z. The necessary labels (a few hundred to a few thousand will be adequate since the z is only 512 variables) can be obtained by hand or by using a pre-existing classifier.
So, the pieces of the puzzle & putting it all together:
For anime faces as of March 2019, KichangKim’s DeepDanbooru is available as a service and as a downloadable Keras model, which provides tags for many traits.
Note that explicit classification/
tagging may be overkill; if there is a mechanical way of controlling an attribute, direct control of the latents can be skipped. For example, an interpolation of mirroring a face can be done by taking a face+latent, mirroring it (using an encoder to get the mirror’s latents), and then simply linearly interpolating between the two sets of latents; since they differ only in orientation, their latents must also differ only in orientation, and interpolation=control. An example of interactively controlling a CelebA face GAN in a convenient GUI is SummitKwan’s TL-GAN (Kaggle interactive demo, discussion).
Dmitry Nikitko has written a StyleGAN encoder (discussion; alternative encoder with additional losses, using resnets) using the backpropagation approach on ImageNet VGG features (but not a direct encoder)
He has trained 3 classification models for age/
gender/ smiling, and so can do things like edit Donald Trump or Hillary Clinton photos to smile. snowy halcy has reused the encoder with the VGG loss+Discriminator loss, linear models trained on DeepDanbooru tags from n = 6.5k, allowing control of generated anime faces. (A later independent implementation of backpropagation-only was done by Abdal et al 2019.) Some of the encodings work well on solo faces & others don’t (on multiple faces), so sticking close to anime face StyleGAN samples is advised. Links:
- interactive notebook
- Google Colab version
- videos: transforming into red-eyed/
black-haired versions (image; brightness editing), TL-GAN GUI demonstration of general editing - mouth flap control: Kyle McLean’s “Waifu Synthesis” video & Aiterasu repo (demos: 1, 2), which use the Halcy tagger to open/
close mouths, allowing for lip-syncing - Artbreeder also implements face editing for my anime StyleGAN (demo video) along with a number of other models such as Western art portraits
- the This Fursona Does Not Exist model can be used to edit furry faces through GANSpace (as presumably can “This Pony Does Not Exist” for pony faces)
GANSpace (Härkönen et al 2020) is a semi-automated approach to discovering useful latent vector controls: it tries to find ‘large’ changes in images, under the assumption those correspond to interesting disentangled factors. A human tweaking the layers it uses and which ones are selected can find interesting (eg a “stoned” face vector in FFHQ StyleGAN), and it can be used in Colab.
The final result is interactive editing of anime faces along many different factors:
Editing Rare Attributes
A strategy of hand-editing or using a tagger to classify attributes works for common ones which will be well-represented in a sample of a few thousand since the classifier needs a few hundred cases to work with, but what about rarer attributes which might appear only on one in a thousand random samples, or attributes too rare in the dataset for StyleGAN to have learned, or attributes which may not be in the dataset at all? Editing “red eyes” should be easy, but what about something like “bunny ears”? It would be amusing to be able to edit portraits to add bunny ears, but there aren’t that many bunny ear samples (although cat ears might be much more common); is one doomed to generate & classify hundreds of thousands of samples to enable bunny ear editing? That would be infeasible for hand labeling, and difficult even with a tagger.
One suggestion I have for this use-case would be to briefly train another StyleGAN model on an enriched or boosted dataset, like a dataset of 50:50 bunny ear images & normal images. If one can obtain a few thousand bunny ear images, then this is adequate for transfer learning (combined with a few thousand random normal images from the original dataset), and one can retrain the StyleGAN on an equal balance of images. The high presence of bunny ears will ensure that the StyleGAN quickly learns all about those, while the normal images prevent it from overfitting or catastrophic forgetting of the full range of images.
This new bunny-ear StyleGAN will then produce bunny-ear samples half the time, circumventing the rare base rate issue (or failure to learn, or nonexistence in dataset), and enabling efficient training of a classifier. And since normal faces were used to preserve its general face knowledge despite the transfer learning potentially degrading it, it will remain able to encode & optimize normal faces. (The original classifiers may even be reusable on this, depending on how extreme the new attribute is, as the latent space z might not be too affected by the new attribute and the various other attributes approximately maintain the original relationship with z as before the retraining.)
StyleGAN 2
StyleGAN 2 (source, video), eliminates blob artifacts, adds a native encoding ‘projection’ feature for editing, simplifies the runtime by scrapping progressive growing in favor of MSG-GAN-like multi-scale architecture, & has higher overall quality—but similar total training time/
I used a 512px anime portrait S2 model trained by Aaron Gokaslan to create ThisWaifuDoesNotExist v3:

Training samples:

The model was trained to iteration #24,664 for >2 weeks on 4 Nvidia 2080ti GPUs at 35–70s per 1k images. The Tensorflow S2 model is available for download (320MB).43 (PyTorch & Onnx versions have been made by Anton using a custom repo Note that both my face & portrait models can be run via the GenForce PyTorch repo as well.) This model can be used in Google Colab (demonstration notebook, although it seems it may pull in an older S2 model) & the model can also be used with the S2 codebase for encoding anime faces.
Running S2
Because of the optimizations, which requires custom local compilation of CUDA code for maximum efficiency, getting S2 running can be more challenging than getting S1 running.
No TensorFlow 2 compatibility: the TF version must be 1.14/
1.15. Trying to run with TF 2 will give errors like: TypeError: int() argument must be a string, a bytes-like object or a number, not 'Tensor'
.I ran into cuDNN compatibility problems with TF 1.15 (which requires cuDNN >7.6.0, 2019-05-20, for CUDA 10.0), which gave errors like this:
...[2020-01-11 23:10:35.234784: E tensorflow/stream_executor/cuda/cuda_dnn.cc:319] Loaded runtime CuDNN library: 7.4.2 but source was compiled with: 7.6.0. CuDNN library major and minor version needs to match or have higher minor version in case of CuDNN 7.0 or later version. If using a binary install, upgrade your CuDNN library. If building from sources, make sure the library loaded at runtime is compatible with the version specified during compile configuration...
But then with 1.14, the
tpu-estimator
library was not found! (I ultimately took the risk of upgrading my installation withlibcudnn7_7.6.0.64-1+cuda10.0_amd64.deb
, and thankfully, that worked and did not seem to break anything else.)Getting the entire pipeline to compile the custom ops in a Conda environment was annoying so Gokaslan tweaked it to use 1.14 on Linux, used
cudatoolkit-dev
from Conda Forge, and changed the build script to usegcc-7
(sincegcc-8
was unsupported)one issue with TensorFlow 1.14 is you need to force
allow_growth
or it will error out on Nvidia 2080tisconfig name change:
train.py
has been renamed (again) torun_training.py
buggy learning rates: S2 (but not S1) accidentally uses the same LR for both G & D; either fix this or keep it in mind when doing LR tuning—changes to
D_lrate
do nothing!n = 1 minibatch problems: S2 is not a large NN so it can be trained on low-end GPUs; however, the S2 code make an unnecessary assumption that n≥2; to fix this in
training/loss.py
(fixed in Shawn Presser’s TPU/self-attention oriented fork ):@@ -157,9 +157,8 @@ def G_logistic_ns_pathreg(G, D, opt, training_set, minibatch_size, pl_minibatch_ with tf.name_scope('PathReg'): # Evaluate the regularization term using a smaller minibatch to conserve memory. if pl_minibatch_shrink > 1 and minibatch_size > 1: assert minibatch_size % pl_minibatch_shrink == 0 pl_minibatch = minibatch_size // pl_minibatch_shrink if pl_minibatch_shrink > 1: pl_minibatch = tf.maximum(1, minibatch_size // pl_minibatch_shrink) pl_latents = tf.random_normal([pl_minibatch] + G.input_shapes[0][1:]) pl_labels = training_set.get_random_labels_tf(pl_minibatch) fake_images_out, fake_dlatents_out = G.get_output_for(pl_latents, pl_labels, is_training=True, return_dlatents=True)
S2 has some sort of memory leak, possibly related to the FID evaluations, requiring regular restarts, like putting it into a loop
Once S2 was running, Gokaslan trained the S2 portrait model with generally default hyperparameters.
Future Work
Some open questions about StyleGAN’s architecture & training dynamics:
- is progressive growing still necessary with StyleGAN? (StyleGAN 2 implies that it is not, as it uses a MSG-GAN-like approach)
- are 8×512 FC layers necessary? (Preliminary BigGAN work suggests that they are not necessary for BigGAN.)
- what are the wrinkly-line/
cracks noise artifacts which appear at the end of training? - how does StyleGAN compare to BigGAN in final quality?
Further possible work:
exploration of “curriculum learning”: can training be sped up by training to convergence on small n and then periodically expanding the dataset?
bootstrapping image generation by starting with a seed corpus, generating many random samples, selecting the best by hand, and retraining; eg expand a corpus of a specific character, or explore ‘hybrid’ corpuses which mix A/
B images & one then selects for images which look most A+B-ish improved transfer learning scripts to edit trained models so 512px pretrained models can be promoted to work with 1024px images and vice versa
better Danbooru tagger CNN for providing classification embeddings for various purposes, particularly FID loss monitoring, minibatch discrimination/
auxiliary loss, and style transfer for creating a ‘StyleDanbooru’ - with a StyleDanbooru, I am curious if that can be used as a particularly Powerful Form Of Data Augmentation for small n character datasets, and whether it leads to a reversal of training dynamics with edges coming before colors/
textures—it’s possible that a StyleDanbooru could make many GAN architectures, not just StyleGAN, stable to train on anime/ illustration datasets
- with a StyleDanbooru, I am curious if that can be used as a particularly Powerful Form Of Data Augmentation for small n character datasets, and whether it leads to a reversal of training dynamics with edges coming before colors/
borrowing architectural enhancements from BigGAN: self-attention layers, spectral norm regularization, large-minibatch training, and a rectified Gaussian distribution for the latent vector z
text→image conditional GAN architecture (à la StackGAN):
This would take the text tag descriptions of each image compiled by Danbooru users and use those as inputs to StyleGAN, which, should it work, would mean you could create arbitrary anime images simply by typing in a string like
1_boy samurai facing_viewer red_hair clouds sword armor blood
etc.This should also, by providing rich semantic descriptions of each image, make training faster & stabler and converge to higher final quality.
meta-learning for few-shot face or character or artist imitation (eg Set-CGAN or FIGR or perhaps FUNIT, or Noguchi & Harada 2019—the last of which achieves few-shot learning with samples of n = 25 TWDNE StyleGAN anime faces)
ImageNet StyleGAN
As part of experiments in scaling up StyleGAN 2, using TFRC research credits, we ran StyleGAN on large-scale datasets including Danbooru2019, ImageNet, and subsets of the Flickr YFCC100M dataset. Despite running for millions of images, no S2 run ever achieved remotely the realism of S2 on FFHQ or BigGAN on ImageNet: while the textures could be surprisingly good, the semantic global structure never came together, with glaring flaws—there would be too many heads, or they would be detached from bodies, etc.
Aaron Gokaslan took the time to compute the FID on ImageNet, estimating a terrible score of FID ~120. (Higher=worse; for comparison, BigGAN with EvoNorm can be as good as FID ~7, and regular BigGAN typically surpasses FID 120 within a few thousand iterations.) Even experiments in increasing the S2 model size up to ~1GB (by increasing the feature map multiplier) improved quality relatively modestly, and showed no signs of ever approaching BigGAN-level quality. We concluded that StyleGAN is in fact fundamentally limited as a GAN, trading off stability for power, and switched over to BigGAN work.
For those interested, we provide our 512px ImageNet S2 (step 1,394,688):
rsync --verbose rsync://78.46.86.149:873/biggan/2020-04-07-shawwn-stylegan-imagenet-512px-run52-1394688.pkl.xz ./
Danbooru2019+e621 256px BigGAN
As part of testing our modifications to compare_gan
, including sampling from multiple datasets to increase n and using flood loss to stabilize it and adding an additional (crude, limited) kind of self-supervised SimCLR loss to the D, we trained several 256px BigGANs, initially on Danbooru2019 SFW but then adding in the TWDNE portraits & e621/
We ran it for 607,250 iterations on a TPUv3-256 pod until 2020-05-15. Config:
{"dataset.name": "images_256", "resnet_biggan.Discriminator.blocks_with_attention": "B2",
"resnet_biggan.Discriminator.ch": 96, "resnet_biggan.Generator.blocks_with_attention": "B5",
"resnet_biggan.Generator.ch": 96, "resnet_biggan.Generator.plain_tanh": false, "ModularGAN.d_lr": 0.0005,
"ModularGAN.d_lr_mul": 3.0, "ModularGAN.ema_start_step": 4000, "ModularGAN.g_lr": 6.66e-05,
"ModularGAN.g_lr_mul": 1.0, "options.batch_size": 2048, "options.d_flood": 0.2,
"options.datasets": "gs://XYZ-euw4a/datasets/danbooru2019-s/danbooru2019-s-0*,gs://XYZ-euw4a/datasets/e621-s/e621-s-0*,
gs://XYZ-euw4a/datasets/portraits/portraits-0*,gs://XYZ-euw4a/datasets/e621-portraits-s-512/e621-portraits-s-512-0*",
"options.g_flood": 0.05, "options.labels": "", "options.random_labels": true, "options.z_dim": 140,
"run_config.experimental_host_call_every_n_steps": 50, "run_config.keep_checkpoint_every_n_hours": 0.5,
"standardize_batch.use_cross_replica_mean": true, "TpuSummaries.save_image_steps": 50, "TpuSummaries.save_summary_steps": 1}

The model is available for download:
rsync --verbose rsync://78.46.86.149:873/biggan/2020-05-18-spresser-biggan-256px-danbooruplus-run39-607250.tar.xz ./
compare_gan
config:
$ cat bigrun39b/operative_config-603500.gin
# Parameters for AdamOptimizer:
# ==============================================================================
AdamOptimizer.beta1 = 0.0
AdamOptimizer.beta2 = 0.999
AdamOptimizer.epsilon = 1e-08
AdamOptimizer.use_locking = False
# Parameters for batch_norm:
# ==============================================================================
# None.
# Parameters for BigGanResNetBlock:
# ==============================================================================
BigGanResNetBlock.add_shortcut = True
# Parameters for conditional_batch_norm:
# ==============================================================================
conditional_batch_norm.use_bias = False
# Parameters for cross_replica_moments:
# ==============================================================================
cross_replica_moments.group_size = None
cross_replica_moments.parallel = True
# Parameters for D:
# ==============================================================================
D.batch_norm_fn = None
D.layer_norm = False
D.spectral_norm = True
# Parameters for dataset:
# ==============================================================================
dataset.name = 'images_256'
dataset.seed = 547
# Parameters for resnet_biggan.Discriminator:
# ==============================================================================
resnet_biggan.Discriminator.blocks_with_attention = 'B2'
resnet_biggan.Discriminator.ch = 96
resnet_biggan.Discriminator.channel_multipliers = None
resnet_biggan.Discriminator.project_y = True
# Parameters for G:
# ==============================================================================
G.batch_norm_fn = @conditional_batch_norm
G.spectral_norm = True
# Parameters for resnet_biggan.Generator:
# ==============================================================================
resnet_biggan.Generator.blocks_with_attention = 'B5'
resnet_biggan.Generator.ch = 96
resnet_biggan.Generator.channel_multipliers = None
resnet_biggan.Generator.embed_bias = False
resnet_biggan.Generator.embed_y = True
resnet_biggan.Generator.embed_y_dim = 128
resnet_biggan.Generator.embed_z = False
resnet_biggan.Generator.hierarchical_z = True
resnet_biggan.Generator.plain_tanh = False
# Parameters for hinge:
# ==============================================================================
# None.
# Parameters for loss:
# ==============================================================================
loss.fn = @hinge
# Parameters for ModularGAN:
# ==============================================================================
ModularGAN.conditional = True
ModularGAN.d_lr = 0.0005
ModularGAN.d_lr_mul = 3.0
ModularGAN.d_optimizer_fn = @tf.train.AdamOptimizer
ModularGAN.deprecated_split_disc_calls = False
ModularGAN.ema_decay = 0.9999
ModularGAN.ema_start_step = 4000
ModularGAN.experimental_force_graph_unroll = False
ModularGAN.experimental_joint_gen_for_disc = False
ModularGAN.fit_label_distribution = False
ModularGAN.g_lr = 6.66e-05
ModularGAN.g_lr_mul = 1.0
ModularGAN.g_optimizer_fn = @tf.train.AdamOptimizer
ModularGAN.g_use_ema = True
# Parameters for no_penalty:
# ==============================================================================
# None.
# Parameters for normal:
# ==============================================================================
normal.mean = 0.0
normal.seed = None
# Parameters for options:
# ==============================================================================
options.architecture = 'resnet_biggan_arch'
options.batch_size = 2048
options.d_flood = 0.2
options.datasets = \
'gs://darnbooru-euw4a/datasets/danbooru2019-s/danbooru2019-s-0*,gs://darnbooru-euw4a/datasets/e621-s/e621-s-0*,gs://darnbooru-euw4a/datasets/portraits/portraits-0*,gs://darnbooru-euw4a/datasets/e621-portraits-s-512/e621-portraits-s-512-0*'
options.description = \
'Describe your GIN config. (This appears in the tensorboard text tab.)'
options.disc_iters = 2
options.discriminator_normalization = None
options.g_flood = 0.05
options.gan_class = @ModularGAN
options.image_grid_height = 3
options.image_grid_resolution = 1024
options.image_grid_width = 3
options.labels = ''
options.lamba = 1
options.model_dir = 'gs://darnbooru-euw4a/runs/bigrun39b/'
options.num_classes = 1000
options.random_labels = True
options.training_steps = 250000
options.transpose_input = False
options.z_dim = 140
# Parameters for penalty:
# ==============================================================================
penalty.fn = @no_penalty
# Parameters for replace_labels:
# ==============================================================================
replace_labels.file_pattern = None
# Parameters for run_config:
# ==============================================================================
run_config.experimental_host_call_every_n_steps = 50
run_config.iterations_per_loop = 250
run_config.keep_checkpoint_every_n_hours = 0.5
run_config.keep_checkpoint_max = 10
run_config.save_checkpoints_steps = 250
run_config.single_core = False
run_config.tf_random_seed = None
# Parameters for spectral_norm:
# ==============================================================================
spectral_norm.epsilon = 1e-12
spectral_norm.singular_value = 'auto'
# Parameters for standardize_batch:
# ==============================================================================
standardize_batch.decay = 0.9
standardize_batch.epsilon = 1e-05
standardize_batch.use_cross_replica_mean = True
standardize_batch.use_moving_averages = False
# Parameters for TpuSummaries:
# ==============================================================================
TpuSummaries.save_image_steps = 50
TpuSummaries.save_summary_steps = 1
# Parameters for train_imagenet_transform:
# ==============================================================================
train_imagenet_transform.crop_method = 'random'
# Parameters for weights:
# ==============================================================================
weights.initializer = 'orthogonal'
# Parameters for z:
# ==============================================================================
z.distribution_fn = @tf.random.normal
z.maxval = 1.0
z.minval = -1.0
z.stddev = 1.0
BigGAN
I explore BigGAN, another recent GAN with SOTA results on the most complex image domain tackled by GANs so far, ImageNet. BigGAN’s capabilities come at a steep compute cost, however. I experiment with 128px ImageNet transfer learning (successful) with ~6 GPU-days, and from-scratch 256px anime portraits of 1000 characters on a 8×2080ti machine for a month (mixed results). My BigGAN results are good but compromised by the compute expense & practical problems with the released BigGAN code base. While BigGAN is not yet superior to StyleGAN for many purposes, BigGAN-like approaches may be necessary to scale to whole anime images.
The primary rival GAN to StyleGAN for large-scale image synthesis as of mid-2019 is BigGAN (Brock et al 2018; official BigGAN-PyTorch implementation & models).
BigGAN successfully trains on up to 512px images from ImageNet, from all 1000 categories (conditioned on category), with near-photorealistic results on the best-represented categories (dogs), and apparently can even handle the far larger internal Google JFT dataset. In contrast, StyleGAN, while far less computationally demanding, shows poorer results on more complex categories (Karras et al 2018’s LSUN CATS StyleGAN; our whole-Danbooru2018 pilots) and has not been demonstrated to scale to ImageNet, much less beyond.
BigGAN does this by combining a few improvements on standard DCGANs (most of which are not used in StyleGAN):
architectural:
- residual layers
- self-attention layers (as in SAGAN)
- many more layers (which are wider/
more channels, and deeper) - class conditioning (1000-category metadata)
training:
- large minibatches (up to n = 2048; either via using TPU clusters or by gradient accumulation over smaller minibatches)
- orthogonal regularization
- thorough hyperparameter sweeps to find good settings
sample time:
- exponential moving average (EMA) on the Generator model (see also Gidel et al 2018, and compare with stochastic weight averaging (SWA))
- the truncation trick

The downside is that, as the name indicates, BigGAN is both a big model and requires big compute (particularly, big minibatches)—somewhere around $20,000, we estimate, based on public TPU pricing.
This present a dilemma: larger-scale portrait modeling or whole-anime image modeling may be beyond StyleGAN’s current capabilities; but while BigGAN may be able to handle those tasks, we can’t afford to train it!
Must it cost that much? Probably not. In particular, BigGAN’s use of a fixed large minibatch throughout training is probably inefficient: it is highly unlikely that the benefits of a n = 2048 minibatch are necessary at the beginning of training when the Generator is generating static which looks nothing at all like real data, and at the end of training, that may still be too small a minibatch (Brock et al 2018 note that the benefits of larger minibatches had not saturated at n = 2048 but time/
BigGAN Transfer Learning
Another optimization is to exploit transfer learning from the released models, and reuse the enormous amount of compute invested in them. The practical details there are fiddly. The original BigGAN 2018 release included the 128px/compare_gan
Tensorflow codebase released in early 2019 includes an independent implementation of BigGAN that can potentially train them, and I believe that the Generator may still be usable for transfer learning on its own and if not—given the arguments that Discriminators simply memorize data and do not learn much beyond that—the Discriminators can be trained from scratch by simply freezing a G while training its D on G outputs for as long as necessary. The 2019 PyTorch release includes a different model, a full 128px model with G/
All in all, it is possible that BigGAN with some tweaks could be affordable to train. (At least, with some crowdfunding…)
BigGAN: Danbooru2018-1K Experiments
To test out the water, I ran three BigGAN experiments:
I first experimented with retraining the ImageNet 128px model44.
That resulted in almost total mode collapse when I re-enabled G after 2 days; investigating, I realized that I had misunderstood: it was a brandnew BigGAN model, trained independently, and came with its fully-trained D already. Oops.
transfer learning the 128px ImageNet PyTorch BigGAN model to the 1k anime portraits; successful with ~6 GPU-days
training from scratch a 256px BigGAN-deep on the 1k portraits;
Partially successful after ~240 GPU-days: it reached comparable quality to StyleGAN before suffering serious mode collapse due, possibly, being forced to run with small minibatch sizes by BigGAN bugs
Danbooru2018-1K Dataset
Constructing D1k
Constructing a new Danbooru-1k dataset: as BigGAN requires conditioning information, I constructed new 512px whole-image & portrait datasets by taking the 1000 most popular Danbooru2018 characters, with characters as categories, and cropped out portraits as usual:
cat metadata/20180000000000* | fgrep -e '"name":"solo"' | fgrep -v '"rating":"e"' | \
jq -c '.tags | .[] | select(.category == "4") | .name' | sort | uniq --count | \
sort --numeric-sort > characters.txt
mkdir ./characters-1k/ ; cd ./characters-1k/
cpCharacterFace () { # }
CHARACTER="$@"
CHARACTER_SAFE=$(echo $CHARACTER | tr '[:punct:]' '.')
mkdir "$CHARACTER_SAFE"
echo "$CHARACTER" "$CHARACTER_SAFE"
IDS=$(cat ../metadata/* | fgrep '"name":"'$CHARACTER\" | fgrep -e '"name":"solo"' \ # )
| fgrep -v '"rating":"e"' | jq .id | tr -d '"')
for ID in $IDS; do
BUCKET=$(printf "%04d" $(( $ID % 1000 )) );
TARGET=$(ls ../original/$BUCKET/$ID.*)
CUDA_VISIBLE_DEVICES="" nice python ~/src/lbpcascade_animeface/examples/crop.py \
~/src/lbpcascade_animeface/lbpcascade_animeface.xml "$TARGET" "./$CHARACTER_SAFE/$ID"
done
}
export -f cpCharacterFace
tail -1200 ../characters.txt | cut -d '"' -f 2 | parallel --progress cpCharacterFace
I merged a number of redundant folders by hand45, cleaned as usual, and did further cropping as necessary to reach 1000. This resulted in 212,359 portrait faces, with the largest class (Hatsune Miku) having 6,624 images and the smallest classes having ~0 or 1 images. (I don’t know if the class imbalance constitutes a real problem for BigGAN, as ImageNet itself is imbalanced on many levels.)
The data-loading code attempts to make the class index/find_classes
):
2k.tan: 0
abe.nana: 1
abigail.williams..fate.grand.order.: 2
abukuma..kantai.collection.: 3
admiral..kantai.collection.: 4
aegis..persona.: 5
aerith.gainsborough: 6
afuro.terumi: 7
agano..kantai.collection.: 8
agrias.oaks: 9
ahri: 10
aida.mana: 11
aino.minako: 12
aisaka.taiga: 13
aisha..elsword.: 14
akagi..kantai.collection.: 15
akagi.miria: 16
akashi..kantai.collection.: 17
akatsuki..kantai.collection.: 18
akaza.akari: 19
akebono..kantai.collection.: 20
akemi.homura: 21
aki.minoriko: 22
aki.shizuha: 23
akigumo..kantai.collection.: 24
akitsu.maru..kantai.collection.: 25
akitsushima..kantai.collection.: 26
akiyama.mio: 27
akiyama.yukari: 28
akizuki..kantai.collection.: 29
akizuki.ritsuko: 30
akizuki.ryou: 31
akuma.homura: 32
albedo: 33
alice..wonderland.: 34
alice.margatroid: 35
alice.margatroid..pc.98.: 36
alisa.ilinichina.amiella: 37
altera..fate.: 38
amagi..kantai.collection.: 39
amagi.yukiko: 40
amami.haruka: 41
amanogawa.kirara: 42
amasawa.yuuko: 43
amatsukaze..kantai.collection.: 44
amazon..dragon.s.crown.: 45
anastasia..idolmaster.: 46
anchovy: 47
android.18: 48
android.21: 49
anegasaki.nene: 50
angel..kof.: 51
angela.balzac: 52
anjou.naruko: 53
aoba..kantai.collection.: 54
aoki.reika: 55
aori..splatoon.: 56
aozaki.aoko: 57
aqua..konosuba.: 58
ara.han: 59
aragaki.ayase: 60
araragi.karen: 61
arashi..kantai.collection.: 62
arashio..kantai.collection.: 63
archer: 64
arcueid.brunestud: 65
arima.senne: 66
artoria.pendragon..all.: 67
artoria.pendragon..lancer.: 68
artoria.pendragon..lancer.alter.: 69
artoria.pendragon..swimsuit.rider.alter.: 70
asahina.mikuru: 71
asakura.ryouko: 72
asashimo..kantai.collection.: 73
asashio..kantai.collection.: 74
ashigara..kantai.collection.: 75
asia.argento: 76
astolfo..fate.: 77
asui.tsuyu: 78
asuna..sao.: 79
atago..azur.lane.: 80
atago..kantai.collection.: 81
atalanta..fate.: 82
au.ra: 83
ayanami..azur.lane.: 84
ayanami..kantai.collection.: 85
ayanami.rei: 86
ayane..doa.: 87
ayase.eli: 88
baiken: 89
bardiche: 90
barnaby.brooks.jr: 91
battleship.hime: 92
bayonetta..character.: 93
bb..fate...all.: 94
bb..fate.extra.ccc.: 95
bb..swimsuit.mooncancer...fate.: 96
beatrice: 97
belfast..azur.lane.: 98
bismarck..kantai.collection.: 99
black.hanekawa: 100
black.rock.shooter..character.: 101
blake.belladonna: 102
blanc: 103
boko..girls.und.panzer.: 104
bottle.miku: 105
boudica..fate.grand.order.: 106
bowsette: 107
bridget..guilty.gear.: 108
busujima.saeko: 109
c.c.: 110
c.c..lemon..character.: 111
caesar.anthonio.zeppeli: 112
cagliostro..granblue.fantasy.: 113
camilla..fire.emblem.if.: 114
cammy.white: 115
caren.hortensia: 116
caster: 117
cecilia.alcott: 118
celes.chere: 119
charlotte..madoka.magica.: 120
charlotte.dunois: 121
charlotte.e.yeager: 122
chen: 123
chibi.usa: 124
chiki: 125
chitanda.eru: 126
chloe.von.einzbern: 127
choukai..kantai.collection.: 128
chun.li: 129
ciel: 130
cirno: 131
clarisse..granblue.fantasy.: 132
clownpiece: 133
consort.yu..fate.: 134
cure.beauty: 135
cure.happy: 136
cure.march: 137
cure.marine: 138
cure.moonlight: 139
cure.peace: 140
cure.sunny: 141
cure.sunshine: 142
cure.twinkle: 143
d.va..overwatch.: 144
daiyousei: 145
danua: 146
darjeeling: 147
dark.magician.girl: 148
dio.brando: 149
dizzy: 150
djeeta..granblue.fantasy.: 151
doremy.sweet: 152
eas: 153
eila.ilmatar.juutilainen: 154
elesis..elsword.: 155
elin..tera.: 156
elizabeth.bathory..brave...fate.: 157
elizabeth.bathory..fate.: 158
elizabeth.bathory..fate...all.: 159
ellen.baker: 160
elphelt.valentine: 161
elsa..frozen.: 162
emilia..re.zero.: 163
emiya.kiritsugu: 164
emiya.shirou: 165
emperor.penguin..kemono.friends.: 166
enma.ai: 167
enoshima.junko: 168
enterprise..azur.lane.: 169
ereshkigal..fate.grand.order.: 170
erica.hartmann: 171
etna: 172
eureka: 173
eve..elsword.: 174
ex.keine: 175
failure.penguin: 176
fate.testarossa: 177
felicia: 178
female.admiral..kantai.collection.: 179
female.my.unit..fire.emblem.if.: 180
female.protagonist..pokemon.go.: 181
fennec..kemono.friends.: 182
ferry..granblue.fantasy.: 183
flandre.scarlet: 184
florence.nightingale..fate.grand.order.: 185
fou..fate.grand.order.: 186
francesca.lucchini: 187
frankenstein.s.monster..fate.: 188
fubuki..kantai.collection.: 189
fujibayashi.kyou: 190
fujimaru.ritsuka..female.: 191
fujiwara.no.mokou: 192
furude.rika: 193
furudo.erika: 194
furukawa.nagisa: 195
fusou..kantai.collection.: 196
futaba.anzu: 197
futami.mami: 198
futatsuiwa.mamizou: 199
fuuro..pokemon.: 200
galko: 201
gambier.bay..kantai.collection.: 202
ganaha.hibiki: 203
gangut..kantai.collection.: 204
gardevoir: 205
gasai.yuno: 206
gertrud.barkhorn: 207
gilgamesh: 208
ginga.nakajima: 209
giorno.giovanna: 210
gokou.ruri: 211
graf.eisen: 212
graf.zeppelin..kantai.collection.: 213
grey.wolf..kemono.friends.: 214
gumi: 215
hachikuji.mayoi: 216
hagikaze..kantai.collection.: 217
hagiwara.yukiho: 218
haguro..kantai.collection.: 219
hakurei.reimu: 220
hamakaze..kantai.collection.: 221
hammann..azur.lane.: 222
han.juri: 223
hanasaki.tsubomi: 224
hanekawa.tsubasa: 225
hanyuu: 226
haramura.nodoka: 227
harime.nui: 228
haro: 229
haruka..pokemon.: 230
haruna..kantai.collection.: 231
haruno.sakura: 232
harusame..kantai.collection.: 233
hasegawa.kobato: 234
hassan.of.serenity..fate.: 235
hata.no.kokoro: 236
hatoba.tsugu..character.: 237
hatsune.miku: 238
hatsune.miku..append.: 239
hatsuyuki..kantai.collection.: 240
hatsuzuki..kantai.collection.: 241
hayami.kanade: 242
hayashimo..kantai.collection.: 243
hayasui..kantai.collection.: 244
hecatia.lapislazuli: 245
helena.blavatsky..fate.grand.order.: 246
heles: 247
hestia..danmachi.: 248
hex.maniac..pokemon.: 249
hibari..senran.kagura.: 250
hibiki..kantai.collection.: 251
hieda.no.akyuu: 252
hiei..kantai.collection.: 253
higashi.setsuna: 254
higashikata.jousuke: 255
high.priest: 256
hiiragi.kagami: 257
hiiragi.tsukasa: 258
hijiri.byakuren: 259
hikari..pokemon.: 260
himejima.akeno: 261
himekaidou.hatate: 262
hinanawi.tenshi: 263
hinatsuru.ai: 264
hino.akane..idolmaster.: 265
hino.akane..smile.precure..: 266
hino.rei: 267
hirasawa.ui: 268
hirasawa.yui: 269
hiryuu..kantai.collection.: 270
hishikawa.rikka: 271
hk416..girls.frontline.: 272
holo: 273
homura..xenoblade.2.: 274
honda.mio: 275
hong.meiling: 276
honma.meiko: 277
honolulu..azur.lane.: 278
horikawa.raiko: 279
hoshi.shouko: 280
hoshiguma.yuugi: 281
hoshii.miki: 282
hoshimiya.ichigo: 283
hoshimiya.kate: 284
hoshino.fumina: 285
hoshino.ruri: 286
hoshizora.miyuki: 287
hoshizora.rin: 288
hotarumaru: 289
hoto.cocoa: 290
houjou.hibiki: 291
houjou.karen: 292
houjou.satoko: 293
houjuu.nue: 294
houraisan.kaguya: 295
houshou..kantai.collection.: 296
huang.baoling: 297
hyuuga.hinata: 298
i.168..kantai.collection.: 299
i.19..kantai.collection.: 300
i.26..kantai.collection.: 301
i.401..kantai.collection.: 302
i.58..kantai.collection.: 303
i.8..kantai.collection.: 304
ia..vocaloid.: 305
ibaraki.douji..fate.grand.order.: 306
ibaraki.kasen: 307
ibuki.fuuko: 308
ibuki.suika: 309
ichigo..darling.in.the.franxx.: 310
ichinose.kotomi: 311
ichinose.shiki: 312
ikamusume: 313
ikazuchi..kantai.collection.: 314
illustrious..azur.lane.: 315
illyasviel.von.einzbern: 316
imaizumi.kagerou: 317
inaba.tewi: 318
inami.mahiru: 319
inazuma..kantai.collection.: 320
index: 321
ingrid: 322
inkling: 323
inubashiri.momiji: 324
inuyama.aoi: 325
iori.rinko: 326
iowa..kantai.collection.: 327
irisviel.von.einzbern: 328
iroha..samurai.spirits.: 329
ishtar..fate.grand.order.: 330
isokaze..kantai.collection.: 331
isonami..kantai.collection.: 332
isuzu..kantai.collection.: 333
itsumi.erika: 334
ivan.karelin: 335
izayoi.sakuya: 336
izumi.konata: 337
izumi.sagiri: 338
jack.the.ripper..fate.apocrypha.: 339
jakuzure.nonon: 340
japanese.crested.ibis..kemono.friends.: 341
jeanne.d.arc..alter...fate.: 342
jeanne.d.arc..alter.swimsuit.berserker.: 343
jeanne.d.arc..fate.: 344
jeanne.d.arc..fate...all.: 345
jeanne.d.arc..granblue.fantasy.: 346
jeanne.d.arc..swimsuit.archer.: 347
jeanne.d.arc.alter.santa.lily: 348
jintsuu..kantai.collection.: 349
jinx..league.of.legends.: 350
johnny.joestar: 351
jonathan.joestar: 352
joseph.joestar..young.: 353
jougasaki.mika: 354
jougasaki.rika: 355
jun.you..kantai.collection.: 356
junketsu: 357
junko..touhou.: 358
kaban..kemono.friends.: 359
kaburagi.t.kotetsu: 360
kaenbyou.rin: 361
kaenbyou.rin..cat.: 362
kafuu.chino: 363
kaga..kantai.collection.: 364
kagamine.len: 365
kagamine.rin: 366
kagerou..kantai.collection.: 367
kagiyama.hina: 368
kagura..gintama.: 369
kaguya.luna..character.: 370
kaito: 371
kaku.seiga: 372
kakyouin.noriaki: 373
kallen.stadtfeld: 374
kamikaze..kantai.collection.: 375
kamikita.komari: 376
kamio.misuzu: 377
kamishirasawa.keine: 378
kamiya.nao: 379
kamoi..kantai.collection.: 380
kaname.madoka: 381
kanbaru.suruga: 382
kanna.kamui: 383
kanzaki.ranko: 384
karina.lyle: 385
kasane.teto: 386
kashima..kantai.collection.: 387
kashiwazaki.sena: 388
kasodani.kyouko: 389
kasugano.sakura: 390
kasugano.sora: 391
kasumi..doa.: 392
kasumi..kantai.collection.: 393
kasumi..pokemon.: 394
kasumigaoka.utaha: 395
katori..kantai.collection.: 396
katou.megumi: 397
katsura.hinagiku: 398
katsuragi..kantai.collection.: 399
katsushika.hokusai..fate.grand.order.: 400
katyusha: 401
kawakami.mai: 402
kawakaze..kantai.collection.: 403
kawashiro.nitori: 404
kay..girls.und.panzer.: 405
kazama.asuka: 406
kazami.yuuka: 407
kenzaki.makoto: 408
kijin.seija: 409
kikuchi.makoto: 410
kino: 411
kino.makoto: 412
kinomoto.sakura: 413
kinugasa..kantai.collection.: 414
kirigaya.suguha: 415
kirigiri.kyouko: 416
kirijou.mitsuru: 417
kirima.sharo: 418
kirin..armor.: 419
kirino.ranmaru: 420
kirisame.marisa: 421
kirishima..kantai.collection.: 422
kirito: 423
kiryuuin.satsuki: 424
kisaragi..kantai.collection.: 425
kisaragi.chihaya: 426
kise.yayoi: 427
kishibe.rohan: 428
kishin.sagume: 429
kiso..kantai.collection.: 430
kiss.shot.acerola.orion.heart.under.blade: 431
kisume: 432
kitakami..kantai.collection.: 433
kiyohime..fate.grand.order.: 434
kiyoshimo..kantai.collection.: 435
kizuna.ai: 436
koakuma: 437
kobayakawa.rinko: 438
kobayakawa.sae: 439
kochiya.sanae: 440
kohinata.miho: 441
koizumi.hanayo: 442
komaki.manaka: 443
komeiji.koishi: 444
komeiji.satori: 445
kongou..kantai.collection.: 446
konjiki.no.yami: 447
konpaku.youmu: 448
konpaku.youmu..ghost.: 449
kooh: 450
kos.mos: 451
koshimizu.sachiko: 452
kotobuki.tsumugi: 453
kotomine.kirei: 454
kotonomiya.yuki: 455
kousaka.honoka: 456
kousaka.kirino: 457
kousaka.tamaki: 458
kozakura.marry: 459
kuchiki.rukia: 460
kujikawa.rise: 461
kujou.karen: 462
kula.diamond: 463
kuma..kantai.collection.: 464
kumano..kantai.collection.: 465
kumoi.ichirin: 466
kunikida.hanamaru: 467
kuradoberi.jam: 468
kuriyama.mirai: 469
kurodani.yamame: 470
kuroka..high.school.dxd.: 471
kurokawa.eren: 472
kuroki.tomoko: 473
kurosawa.dia: 474
kurosawa.ruby: 475
kuroshio..kantai.collection.: 476
kuroyukihime: 477
kurumi.erika: 478
kusanagi.motoko: 479
kusugawa.sasara: 480
kuujou.jolyne: 481
kuujou.joutarou: 482
kyon: 483
kyonko: 484
kyubey: 485
laffey..azur.lane.: 486
lala.satalin.deviluke: 487
lancer: 488
lancer..fate.zero.: 489
laura.bodewig: 490
leafa: 491
lei.lei: 492
lelouch.lamperouge: 493
len: 494
letty.whiterock: 495
levi..shingeki.no.kyojin.: 496
libeccio..kantai.collection.: 497
lightning.farron: 498
lili..tekken.: 499
lilith.aensland: 500
lillie..pokemon.: 501
lily.white: 502
link: 503
little.red.riding.hood..grimm.: 504
louise.francoise.le.blanc.de.la.valliere: 505
lucina: 506
lum: 507
luna.child: 508
lunamaria.hawke: 509
lunasa.prismriver: 510
lusamine..pokemon.: 511
lyn..blade...soul.: 512
lyndis..fire.emblem.: 513
lynette.bishop: 514
m1903.springfield..girls.frontline.: 515
madotsuki: 516
maekawa.miku: 517
maka.albarn: 518
makigumo..kantai.collection.: 519
makinami.mari.illustrious: 520
makise.kurisu: 521
makoto..street.fighter.: 522
makoto.nanaya: 523
mankanshoku.mako: 524
mao..pokemon.: 525
maou..maoyuu.: 526
maribel.hearn: 527
marie.antoinette..fate.grand.order.: 528
mash.kyrielight: 529
matoi..pso2.: 530
matoi.ryuuko: 531
matou.sakura: 532
matsuura.kanan: 533
maya..kantai.collection.: 534
me.tan: 535
medicine.melancholy: 536
medjed: 537
meer.campbell: 538
megumin: 539
megurine.luka: 540
mei..overwatch.: 541
mei..pokemon.: 542
meiko: 543
meltlilith: 544
mercy..overwatch.: 545
merlin.prismriver: 546
michishio..kantai.collection.: 547
midare.toushirou: 548
midna: 549
midorikawa.nao: 550
mika..girls.und.panzer.: 551
mikasa.ackerman: 552
mikazuki.munechika: 553
miki.sayaka: 554
millia.rage: 555
mima: 556
mimura.kanako: 557
minami.kotori: 558
minamoto.no.raikou..fate.grand.order.: 559
minamoto.no.raikou..swimsuit.lancer...fate.: 560
minase.akiko: 561
minase.iori: 562
miqo.te: 563
misaka.mikoto: 564
mishaguji: 565
misumi.nagisa: 566
mithra: 567
miura.azusa: 568
miyafuji.yoshika: 569
miyako.yoshika: 570
miyamoto.frederica: 571
miyamoto.musashi..fate.grand.order.: 572
miyaura.sanshio: 573
mizuhashi.parsee: 574
mizuki..pokemon.: 575
mizunashi.akari: 576
mizuno.ami: 577
mogami..kantai.collection.: 578
momo.velia.deviluke: 579
momozono.love: 580
mononobe.no.futo: 581
mordred..fate.: 582
mordred..fate...all.: 583
morgiana: 584
morichika.rinnosuke: 585
morikubo.nono: 586
moriya.suwako: 587
moroboshi.kirari: 588
morrigan.aensland: 589
motoori.kosuzu: 590
mumei..kabaneri.: 591
murakumo..kantai.collection.: 592
murasa.minamitsu: 593
murasame..kantai.collection.: 594
musashi..kantai.collection.: 595
mutsu..kantai.collection.: 596
mutsuki..kantai.collection.: 597
my.unit..fire.emblem..kakusei.: 598
my.unit..fire.emblem.if.: 599
myoudouin.itsuki: 600
mysterious.heroine.x: 601
mysterious.heroine.x..alter.: 602
mystia.lorelei: 603
nadia: 604
nagae.iku: 605
naganami..kantai.collection.: 606
nagato..kantai.collection.: 607
nagato.yuki: 608
nagatsuki..kantai.collection.: 609
nagi: 610
nagisa.kaworu: 611
naka..kantai.collection.: 612
nakano.azusa: 613
nami..one.piece.: 614
nanami.chiaki: 615
nanasaki.ai: 616
nao..mabinogi.: 617
narmaya..granblue.fantasy.: 618
narukami.yuu: 619
narusawa.ryouka: 620
natalia..idolmaster.: 621
natori.sana: 622
natsume..pokemon.: 623
natsume.rin: 624
nazrin: 625
nekomiya.hinata: 626
nekomusume: 627
nekomusume..gegege.no.kitarou.6.: 628
nepgear: 629
neptune..neptune.series.: 630
nero.claudius..bride...fate.: 631
nero.claudius..fate.: 632
nero.claudius..fate...all.: 633
nero.claudius..swimsuit.caster...fate.: 634
nia.teppelin: 635
nibutani.shinka: 636
nico.robin: 637
ninomiya.asuka: 638
nishikino.maki: 639
nishizumi.maho: 640
nishizumi.miho: 641
nitocris..fate.grand.order.: 642
nitocris..swimsuit.assassin...fate.: 643
nitta.minami: 644
noel.vermillion: 645
noire: 646
northern.ocean.hime: 647
noshiro..kantai.collection.: 648
noumi.kudryavka: 649
nu.13: 650
nyarlathotep..nyaruko.san.: 651
oboro..kantai.collection.: 652
oda.nobunaga..fate.: 653
ogata.chieri: 654
ohara.mari: 655
oikawa.shizuku: 656
okazaki.yumemi: 657
okita.souji..alter...fate.: 658
okita.souji..fate.: 659
okita.souji..fate...all.: 660
onozuka.komachi: 661
ooi..kantai.collection.: 662
oomori.yuuko: 663
ootsuki.yui: 664
ooyodo..kantai.collection.: 665
osakabe.hime..fate.grand.order.: 666
oshino.shinobu: 667
otonashi.kotori: 668
panty..psg.: 669
passion.lip: 670
patchouli.knowledge: 671
pepperoni..girls.und.panzer.: 672
perrine.h.clostermann: 673
pharah..overwatch.: 674
phosphophyllite: 675
pikachu: 676
pixiv.tan: 677
platelet..hataraku.saibou.: 678
platinum.the.trinity: 679
pod..nier.automata.: 680
pola..kantai.collection.: 681
priest..ragnarok.online.: 682
princess.king.boo: 683
princess.peach: 684
princess.serenity: 685
princess.zelda: 686
prinz.eugen..azur.lane.: 687
prinz.eugen..kantai.collection.: 688
prisma.illya: 689
purple.heart: 690
puru.see: 691
pyonta: 692
qbz.95..girls.frontline.: 693
rachel.alucard: 694
racing.miku: 695
raising.heart: 696
ramlethal.valentine: 697
ranka.lee: 698
ranma.chan: 699
re.class.battleship: 700
reinforce: 701
reinforce.zwei: 702
reisen.udongein.inaba: 703
reiuji.utsuho: 704
reizei.mako: 705
rem..re.zero.: 706
remilia.scarlet: 707
rensouhou.chan: 708
rensouhou.kun: 709
rias.gremory: 710
rider: 711
riesz: 712
ringo..touhou.: 713
ro.500..kantai.collection.: 714
roll: 715
rosehip: 716
rossweisse: 717
ruby.rose: 718
rumia: 719
rydia: 720
ryougi.shiki: 721
ryuuguu.rena: 722
ryuujou..kantai.collection.: 723
saber: 724
saber.alter: 725
saber.lily: 726
sagisawa.fumika: 727
saigyouji.yuyuko: 728
sailor.mars: 729
sailor.mercury: 730
sailor.moon: 731
sailor.saturn: 732
sailor.venus: 733
saint.martha: 734
sakagami.tomoyo: 735
sakamoto.mio: 736
sakata.gintoki: 737
sakuma.mayu: 738
sakura.chiyo: 739
sakura.futaba: 740
sakura.kyouko: 741
sakura.miku: 742
sakurai.momoka: 743
sakurauchi.riko: 744
samidare..kantai.collection.: 745
samus.aran: 746
sanya.v.litvyak: 747
sanzen.in.nagi: 748
saotome.ranma: 749
saratoga..kantai.collection.: 750
sasaki.chiho: 751
saten.ruiko: 752
satonaka.chie: 753
satsuki..kantai.collection.: 754
sawamura.spencer.eriri: 755
saya: 756
sazaki.kaoruko: 757
sazanami..kantai.collection.: 758
scathach..fate...all.: 759
scathach..fate.grand.order.: 760
scathach..swimsuit.assassin...fate.: 761
seaport.hime: 762
seeu: 763
seiran..touhou.: 764
seiren..suite.precure.: 765
sekibanki: 766
selvaria.bles: 767
sendai..kantai.collection.: 768
sendai.hakurei.no.miko: 769
sengoku.nadeko: 770
senjougahara.hitagi: 771
senketsu: 772
sento.isuzu: 773
serena..pokemon.: 774
serval..kemono.friends.: 775
sf.a2.miki: 776
shameimaru.aya: 777
shana: 778
shanghai.doll: 779
shantae..character.: 780
sheryl.nome: 781
shibuya.rin: 782
shidare.hotaru: 783
shigure..kantai.collection.: 784
shijou.takane: 785
shiki.eiki: 786
shikinami..kantai.collection.: 787
shikinami.asuka.langley: 788
shimada.arisu: 789
shimakaze..kantai.collection.: 790
shimamura.uzuki: 791
shinjou.akane: 792
shinki: 793
shinku: 794
shiomi.shuuko: 795
shirabe.ako: 796
shirai.kuroko: 797
shirakiin.ririchiyo: 798
shiranui..kantai.collection.: 799
shiranui.mai: 800
shirasaka.koume: 801
shirase.sakuya: 802
shiratsuyu..kantai.collection.: 803
shirayuki.hime: 804
shirogane.naoto: 805
shirona..pokemon.: 806
shoebill..kemono.friends.: 807
shokuhou.misaki: 808
shouhou..kantai.collection.: 809
shoukaku..kantai.collection.: 810
shuten.douji..fate.grand.order.: 811
signum: 812
silica: 813
simon: 814
sinon: 815
soga.no.tojiko: 816
sona.buvelle: 817
sonoda.umi: 818
sonohara.anri: 819
sonozaki.mion: 820
sonozaki.shion: 821
sora.ginko: 822
sorceress..dragon.s.crown.: 823
souryuu..kantai.collection.: 824
souryuu.asuka.langley: 825
souseiseki: 826
star.sapphire: 827
stocking..psg.: 828
su.san: 829
subaru.nakajima: 830
suigintou: 831
suiren..pokemon.: 832
suiseiseki: 833
sukuna.shinmyoumaru: 834
sunny.milk: 835
suomi.kp31..girls.frontline.: 836
super.pochaco: 837
super.sonico: 838
suzukaze.aoba: 839
suzumiya.haruhi: 840
suzutsuki..kantai.collection.: 841
suzuya..kantai.collection.: 842
tachibana.arisu: 843
tachibana.hibiki..symphogear.: 844
tada.riina: 845
taigei..kantai.collection.: 846
taihou..azur.lane.: 847
taihou..kantai.collection.: 848
tainaka.ritsu: 849
takagaki.kaede: 850
takakura.himari: 851
takamachi.nanoha: 852
takami.chika: 853
takanashi.rikka: 854
takao..azur.lane.: 855
takao..kantai.collection.: 856
takara.miyuki: 857
takarada.rikka: 858
takatsuki.yayoi: 859
takebe.saori: 860
tama..kantai.collection.: 861
tamamo..fate...all.: 862
tamamo.cat..fate.: 863
tamamo.no.mae..fate.: 864
tamamo.no.mae..swimsuit.lancer...fate.: 865
tanamachi.kaoru: 866
taneshima.popura: 867
tanned.cirno: 868
taokaka: 869
tatara.kogasa: 870
tateyama.ayano: 871
tatsumaki: 872
tatsuta..kantai.collection.: 873
tedeza.rize: 874
tenryuu..kantai.collection.: 875
tenshi..angel.beats..: 876
teruzuki..kantai.collection.: 877
tharja: 878
tifa.lockhart: 879
tina.branford: 880
tippy..gochiusa.: 881
tokiko..touhou.: 882
tokisaki.kurumi: 883
tokitsukaze..kantai.collection.: 884
tomoe.gozen..fate.grand.order.: 885
tomoe.hotaru: 886
tomoe.mami: 887
tone..kantai.collection.: 888
toono.akiha: 889
tooru..maidragon.: 890
toosaka.rin: 891
toramaru.shou: 892
toshinou.kyouko: 893
totoki.airi: 894
toudou.shimako: 895
toudou.yurika: 896
toujou.koneko: 897
toujou.nozomi: 898
touko..pokemon.: 899
touwa.erio: 900
toyosatomimi.no.miko: 901
tracer..overwatch.: 902
tsukikage.yuri: 903
tsukimiya.ayu: 904
tsukino.mito: 905
tsukino.usagi: 906
tsukumo.benben: 907
tsurumaru.kuninaga: 908
tsuruya: 909
tsushima.yoshiko: 910
u.511..kantai.collection.: 911
ujimatsu.chiya: 912
ultimate.madoka: 913
umikaze..kantai.collection.: 914
unicorn..azur.lane.: 915
unryuu..kantai.collection.: 916
urakaze..kantai.collection.: 917
uraraka.ochako: 918
usada.hikaru: 919
usami.renko: 920
usami.sumireko: 921
ushio..kantai.collection.: 922
ushiromiya.ange: 923
ushiwakamaru..fate.grand.order.: 924
uzuki..kantai.collection.: 925
vampire..azur.lane.: 926
vampy: 927
venera.sama: 928
verniy..kantai.collection.: 929
victorica.de.blois: 930
violet.evergarden..character.: 931
vira.lilie: 932
vita: 933
vivio: 934
wa2000..girls.frontline.: 935
wakasagihime: 936
wang.liu.mei: 937
warspite..kantai.collection.: 938
watanabe.you: 939
watarase.jun: 940
watatsuki.no.yorihime: 941
waver.velvet: 942
weiss.schnee: 943
white.mage: 944
widowmaker..overwatch.: 945
wo.class.aircraft.carrier: 946
wriggle.nightbug: 947
xenovia.quarta: 948
xp.tan: 949
xuanzang..fate.grand.order.: 950
yagami.hayate: 951
yagokoro.eirin: 952
yahagi..kantai.collection.: 953
yakumo.ran: 954
yakumo.yukari: 955
yamada.aoi: 956
yamada.elf: 957
yamakaze..kantai.collection.: 958
yamashiro..azur.lane.: 959
yamashiro..kantai.collection.: 960
yamato..kantai.collection.: 961
yamato.no.kami.yasusada: 962
yang.xiao.long: 963
yasaka.kanako: 964
yayoi..kantai.collection.: 965
yazawa.nico: 966
yin: 967
yoko.littner: 968
yorha.no..2.type.b: 969
yorigami.shion: 970
yowane.haku: 971
yuffie.kisaragi: 972
yui..angel.beats..: 973
yuigahama.yui: 974
yuki.miku: 975
yukikaze..kantai.collection.: 976
yukine.chris: 977
yukinoshita.yukino: 978
yukishiro.honoka: 979
yumi..senran.kagura.: 980
yuna..ff10.: 981
yuno: 982
yura..kantai.collection.: 983
yuubari..kantai.collection.: 984
yuudachi..kantai.collection.: 985
yuugumo..kantai.collection.: 986
yuuki..sao.: 987
yuuki.makoto: 988
yuuki.mikan: 989
yuzuhara.konomi: 990
yuzuki.yukari: 991
yuzuriha.inori: 992
z1.leberecht.maass..kantai.collection.: 993
z3.max.schultz..kantai.collection.: 994
zero.two..darling.in.the.franxx.: 995
zeta..granblue.fantasy.: 996
zooey..granblue.fantasy.: 997
zuihou..kantai.collection.: 998
zuikaku..kantai.collection.: 999
(Aside from being potentially useful to stabilize training by providing supervision/
D1K Download
D1K (20GB; n = 822,842 512px JPEGs) and the portrait-crop version, D1K-portraits (18GB; n = 212,359) are available for download:
rsync --verbose --recursive rsync://78.46.86.149:873/biggan/d1k ./d1k/
The JPG compression turned out to be too aggressive and result in noticeable artifacting, so in early 2020 I regenerated D1k from Danbooru2019 for future projects, creating D1K-2019-512px: a fresh set of top-1k solo character images, s/
Merges of overlapping characters were again necessary; the full set of tag merges:
merge() { mv ./$1/* ./$2/ && rmdir ./$1; }
merge alice.margatroid..pc.98. alice.margatroid
merge artoria.pendragon..all. saber
merge artoria.pendragon..lancer. saber
merge artoria.pendragon..lancer.alter. saber
merge artoria.pendragon..swimsuit.rider.alter. saber
merge artoria.pendragon..swimsuit.ruler...fate. saber
merge atago..midsummer.march...azur.lane. atago..azur.lane.
merge bardiche fate.testarossa
merge bb..fate...all. matou.sakura
merge bb..fate.extra.ccc. matou.sakura
merge bb..swimsuit.mooncancer...fate. matou.sakura
merge bottle.miku hatsune.miku
merge cure.beauty aoki.reika
merge cure.happy hoshizora.miyuki
merge cure.march midorikawa.nao
merge cure.marine kurumi.erika
merge cure.melody houjou.hibiki
merge cure.moonlight tsukikage.yuri
merge cure.peace kise.yayoi
merge cure.peach momozono.love
merge cure.sunny hino.akane..smile.precure..
merge cure.sunshine myoudouin.itsuki
merge cure.sword kenzaki.makoto
merge cure.twinkle amanogawa.kirara
merge eas higashi.setsuna
merge elizabeth.bathory..brave...fate. elizabeth.bathory..fate.
merge elizabeth.bathory..fate...all. elizabeth.bathory..fate.
merge ex.keine kamishirasawa.keine
merge frankenstein.s.monster..swimsuit.saber...fate. frankenstein.s.monster..fate.
merge frederica.bernkastel furude.rika
merge furudo.erika furude.rika
merge graf.eisen vita
merge hatsune.miku..append. hatsune.miku
merge ishtar..fate.grand.order. ishtar..fate...all.
merge jeanne.d.arc..alter...fate. jeanne.d.arc..fate.
merge jeanne.d.arc..alter.swimsuit.berserker. jeanne.d.arc..fate.
merge jeanne.d.arc..fate...all. jeanne.d.arc..fate.
merge jeanne.d.arc..swimsuit.archer. jeanne.d.arc..fate.
merge jeanne.d.arc.alter.santa.lily jeanne.d.arc..fate.
merge kaenbyou.rin..cat. kaenbyou.rin
merge kiyohime..swimsuit.lancer...fate. kiyohime..fate.grand.order.
merge konpaku.youmu..ghost. konpaku.youmu
merge kyonko kyon
merge lancer cu.chulainn..fate...all.
merge medb..fate.grand.order. medb..fate...all.
merge medjed nitocris..fate.grand.order.
merge meltryllis..swimsuit.lancer...fate. meltryllis
merge minamoto.no.raikou..swimsuit.lancer...fate. minamoto.no.raikou..fate.grand.order.
merge miyamoto.musashi..swimsuit.berserker...fate. miyamoto.musashi..fate.grand.order.
merge mordred..fate...all. mordred..fate.
merge mysterious.heroine.x saber
merge mysterious.heroine.x..alter. saber
merge mysterious.heroine.xx..foreigner. saber
merge nero.claudius..bride...fate. nero.claudius..fate.
merge nero.claudius..fate...all. nero.claudius..fate.
merge nero.claudius..swimsuit.caster...fate. nero.claudius..fate.
merge nitocris..swimsuit.assassin...fate. nitocris..fate.grand.order.
merge oda.nobunaga..fate...all. oda.nobunaga..fate.
merge okita.souji..alter...fate. okita.souji..fate.
merge okita.souji..fate...all. okita.souji..fate.
merge princess.of.the.crystal takakura.himari
merge princess.serenity tsukino.usagi
merge prinz.eugen..unfading.smile...azur.lane. prinz.eugen..azur.lane.
merge prisma.illya illyasviel.von.einzbern
merge purple.heart neptune..neptune.series.
merge pyonta moriya.suwako
merge racing.miku hatsune.miku
merge raising.heart takamachi.nanoha
merge reinforce.zwei reinforce
merge rensouhou.chan shimakaze..kantai.collection.
merge roll.caskett roll
merge saber.alter saber
merge saber.lily saber
merge sailor.jupiter kino.makoto
merge sailor.mars hino.rei
merge sailor.mercury mizuno.ami
merge sailor.moon tsukino.usagi
merge sailor.saturn tomoe.hotaru
merge sailor.venus aino.minako
merge sakura.miku hatsune.miku
merge scathach..fate.grand.order. scathach..fate...all.
merge scathach..swimsuit.assassin...fate. scathach..fate...all.
merge scathach.skadi..fate.grand.order. scathach..fate...all.
merge schwertkreuz yagami.hayate
merge seiren..suite.precure. kurokawa.eren
merge shanghai.doll alice.margatroid
merge shikinami.asuka.langley souryuu.asuka.langley
merge su.san medicine.melancholy
merge taihou..forbidden.feast...azur.lane. taihou..azur.lane.
merge tamamo..fate...all. tamamo.cat..fate.
merge tamamo.no.mae..fate. tamamo.cat..fate.
merge tamamo.no.mae..swimsuit.lancer...fate. tamamo.cat..fate.
merge tanned.cirno cirno
merge ultimate.madoka kaname.madoka
merge yuki.miku hatsune.miku
Download:
rsync --verbose --recursive rsync://78.46.86.149:873/biggan/d1k-2019-512px/ ./d1k-2019-512px/
D1K BigGAN Conversion
BigGAN requires the dataset metadata to be defined in utils.py
, and then, if using HDF5 archives it must be processed into a HDF5 archive, along with Inception statistics for the periodic testing (although I minimize testing, the preprocessed statistics are still necessary).
HDF5 is not necessary and can be omitted, BigGAN-Pytorch can read image folders, if you prefer to avoid the hassle.
The utils.py
must be edited to add metadata per dataset (no CLI), which looks like this to define a 128px Danbooru-1k portrait dataset:
# Convenience dicts
-dset_dict = {'I32': dset.ImageFolder, 'I64': dset.ImageFolder,
+dset_dict = {'I32': dset.ImageFolder, 'I64': dset.ImageFolder,
'I128': dset.ImageFolder, 'I256': dset.ImageFolder,
'I32_hdf5': dset.ILSVRC_HDF5, 'I64_hdf5': dset.ILSVRC_HDF5,
'I128_hdf5': dset.ILSVRC_HDF5, 'I256_hdf5': dset.ILSVRC_HDF5,
- 'C10': dset.CIFAR10, 'C100': dset.CIFAR100}
+ 'C10': dset.CIFAR10, 'C100': dset.CIFAR100,
+ 'D1K': dset.ImageFolder, 'D1K_hdf5': dset.ILSVRC_HDF5 }
imsize_dict = {'I32': 32, 'I32_hdf5': 32,
'I64': 64, 'I64_hdf5': 64,
'I128': 128, 'I128_hdf5': 128,
'I256': 256, 'I256_hdf5': 256,
- 'C10': 32, 'C100': 32}
+ 'C10': 32, 'C100': 32,
+ 'D1K': 128, 'D1K_hdf5': 128 }
root_dict = {'I32': 'ImageNet', 'I32_hdf5': 'ILSVRC32.hdf5',
'I64': 'ImageNet', 'I64_hdf5': 'ILSVRC64.hdf5',
'I128': 'ImageNet', 'I128_hdf5': 'ILSVRC128.hdf5',
'I256': 'ImageNet', 'I256_hdf5': 'ILSVRC256.hdf5',
- 'C10': 'cifar', 'C100': 'cifar'}
+ 'C10': 'cifar', 'C100': 'cifar',
+ 'D1K': 'characters-1k-faces', 'D1K_hdf5': 'D1K.hdf5' }
nclass_dict = {'I32': 1000, 'I32_hdf5': 1000,
'I64': 1000, 'I64_hdf5': 1000,
'I128': 1000, 'I128_hdf5': 1000,
'I256': 1000, 'I256_hdf5': 1000,
- 'C10': 10, 'C100': 100}
-# Number of classes to put per sample sheet
+ 'C10': 10, 'C100': 100,
+ 'D1K': 1000, 'D1K_hdf5': 1000 }
+# Number of classes to put per sample sheet
classes_per_sheet_dict = {'I32': 50, 'I32_hdf5': 50,
'I64': 50, 'I64_hdf5': 50,
'I128': 20, 'I128_hdf5': 20,
'I256': 20, 'I256_hdf5': 20,
- 'C10': 10, 'C100': 100}
+ 'C10': 10, 'C100': 100,
+ 'D1K': 1, 'D1K_hdf5': 1 }
Each dataset exists in 2 forms, as the original image folder and then as the processed HDF5:
python make_hdf5.py --dataset D1K512 --data_root /media/gwern/Data2/danbooru2018
python calculate_inception_moments.py --parallel --dataset D1K_hdf5 --batch_size 64 \
--data_root /media/gwern/Data2/danbooru2018
## Or ImageNet example:
python make_hdf5.py --dataset I128 --data_root /media/gwern/Data/imagenet/
python calculate_inception_moments.py --dataset I128_hdf5 --batch_size 64 \
--data_root /media/gwern/Data/imagenet/
make_hdf5.py
will write the HDF5 to a ILSVRC*.hdf5
file, so rename it to whatever (eg D1K.hdf5
).
BigGAN Training
With the HDF5 & Inception statistics calculated, it should be possible to run like so:
python train.py --dataset D1K --parallel --shuffle --num_workers 4 --batch_size 32 \
--num_G_accumulations 8 --num_D_accumulations 8 \
--num_D_steps 1 --G_lr 1e-4 --D_lr 4e-4 --D_B2 0.999 --G_B2 0.999 --G_attn 64 --D_attn 64 \
--G_nl inplace_relu --D_nl inplace_relu --SN_eps 1e-6 --BN_eps 1e-5 --adam_eps 1e-6 \
--G_ortho 0.0 --G_shared --G_init ortho --D_init ortho --hier --dim_z 120 --shared_dim 128 \
--G_eval_mode --G_ch 96 --D_ch 96 \
--ema --use_ema --ema_start 20000 --test_every 2000 --save_every 1000 --num_best_copies 5 \
--num_save_copies 2 --seed 0 --use_multiepoch_sampler --which_best FID \
--data_root /media/gwern/Data2/danbooru2018
The architecture is specified on the command line and must be correct; examples are in the scripts/
directory. In the above example, --num_D_steps...--D_ch
should be left strictly alone and the key parameters are before/--batch_size
times --num_{G/D}_accumulations
; I would need an accumulation of 64 to match n = 2048. Without EMA, samples are low quality and change drastically at each iteration; but after a certain number of iterations, sampling is done with EMA, which averages each iteration offline (but one doesn’t train using the averaged model!46), shows that collectively these iterations are similar because they are ‘orbiting’ around a central point and the image quality is clearly gradually improving when EMA is turned on.
Transfer learning is not supported natively, but a similar trick as with StyleGAN is feasible: just drop the pretrained models into the checkpoint folder and resume (which will work as long as the architecture is identical to the CLI parameters).
The sample sheet functionality can easily overload a GPU and OOM. In utils.py
, it may be necessary to simply comment out all of the sampling functionality starting with utils.sample_sheet
.
The main problem running BigGAN is odd bugs in BigGAN’s handling of epochs/--use_multiepoch_sampler
, it does complicated calculations to try to keep sampling consistent across epoches with precisely the same ordering of samples regardless of how often the BigGAN job is started/
While with that option disabled and larger total minibatches used, a different bug gets triggered, leading to inscrutable crashes:
...
ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
Traceback (most recent call last):
File "train.py", line 228, in <module>
main()
File "train.py", line 225, in main
run(config)
File "train.py", line 172, in run
for i, (x, y) in enumerate(pbar):
File "/root/BigGAN-PyTorch-mooch/utils.py", line 842, in progress
for n, item in enumerate(items):
File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 631, in __next__
idx, batch = self._get_batch()
File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 601, in _get_batch
return self.data_queue.get(timeout=MP_STATUS_CHECK_INTERVAL)
File "/opt/conda/lib/python3.7/queue.py", line 179, in get
self.not_empty.wait(remaining)
File "/opt/conda/lib/python3.7/threading.py", line 300, in wait
gotit = waiter.acquire(True, timeout)
File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 274, in handler
_error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 21103) is killed by signal: Bus error.
There is no good workaround here: starting with small fast minibatches compromises final quality, while starting with big slow minibatches may work but then costs far more compute. I did find that the G/
BigGAN: ImageNet→Danbooru2018-1K
In any case, I ran the 128px ImageNet→Danbooru2018-1K for ~6 GPU-days (or ~3 days on my 2×1080ti workstation) and the training montage indicates it was working fine:
Sometime after that, while continuing to play with imbalanced minibatches to avoid triggering the iteration/
BigGAN: 256px Danbooru2018-1K
More seriously, I began training a 256px model on Danbooru2018-1K portraits. This required rebuilding the HDF5 with 256px settings, and since I wasn’t doing transfer learning, I used the BigGAN-deep architecture settings since that has better results & is smaller than the original BigGAN.
My own 2×1080ti were inadequate for reasonable turnaround on training a 256px BigGAN from scratch—they would take something like 4+ months wallclock— so I decided to shell out for a big cloud instance. AWS/
Vast.ai setup was straightforward, and I found a nice instance: an 8×2080ti machine available for just $1.7/
That is ~250 GPU-days of training, although this is a misleading way to put it since the Vast.ai bill includes bandwidth/
The training command:
python train.py --model BigGANdeep --dataset D1K_hdf5 --parallel --shuffle --num_workers 16 \
--batch_size 56 --num_G_accumulations 8 --num_D_accumulations 8 --num_D_steps 1 --G_lr 1e-4 \
--D_lr 4e-4 --D_B2 0.999 --G_B2 0.999 --G_attn 64 --D_attn 64 --G_ch 128 --D_ch 128 \
--G_depth 2 --D_depth 2 --G_nl inplace_relu --D_nl inplace_relu --SN_eps 1e-6 --BN_eps 1e-5 \
--adam_eps 1e-6 --G_ortho 0.0 --G_shared --G_init ortho --D_init ortho --hier --dim_z 64 \
--shared_dim 64 --ema --use_ema --G_eval_mode --test_every 200000 --sv_log_interval 1000 \
--save_every 90 --num_best_copies 1 --num_save_copies 1 --seed 0 --no_fid \
--num_inception_images 1 --augment --data_root ~/tmp --resume --experiment_name \
BigGAN_D1K_hdf5_BigGANdeep_seed0_Gch128_Dch128_Gd2_Dd2_bs64_Glr1.0e-04_Dlr4.0e-04_Gnlinplace_relu_Dnlinplace_relu_Ginitortho_Dinitortho_Gattn64_Dattn64_Gshared_hier_ema
The system worked well but BigGAN turns out to have serious performance bottlenecks (apparently in synchronizing batchnorm across GPUs) and did not make good use of the 8 GPUs, averaging GPU utilization ~30% according to nvidia-smi
. (On my 2×1080tis with the 128px, GPU-utilization was closer to 95%.) In retrospect, I probably should’ve switched to a less expensive instance like a 8×1080ti where it likely would’ve had similar throughput but cost less.
Training progressed well up until iterations #80–90k, when I began seeing signs of mode collapse:
I was unable to increase the minibatch to more than ~500 because of the bugs, limiting what I could do against mode collapse, and I suspect the small minibatch was why mode collapse was happening in the first place. (Gokaslan tried the last checkpoint I saved—#95,160—with the same settings, and ran it to #100,000 iterations and experienced near-total mode collapse.)
The last checkpoint I saved from before mode collapse was #83,520, saved on 2019-05-28 after ~24 wallclock days (accounting for various crashes & time setting up & tweaking).
Random samples, interpolation grids (not videos), and class-conditional samples can be generated using sample.py
; like train.py
, it requires the exact architecture to be specified. I used the following command (many of the options are probably not necessary, but I didn’t know which):
python sample.py --model BigGANdeep --dataset D1K_hdf5 --parallel --shuffle --num_workers 16 \
--batch_size 56 --num_G_accumulations 8 --num_D_accumulations 8 --num_D_steps 1 \
--G_lr 1e-4 --D_lr 4e-4 --D_B2 0.999 --G_B2 0.999 --G_attn 64 --D_attn 64 --G_ch 128 \
--D_ch 128 --G_depth 2 --D_depth 2 --G_nl inplace_relu --D_nl inplace_relu --SN_eps 1e-6 \
--BN_eps 1e-5 --adam_eps 1e-6 --G_ortho 0.0 --G_shared --G_init ortho --D_init ortho --hier \
--dim_z 64 --shared_dim 64 --ema --use_ema --G_eval_mode --test_every 200000 \
--sv_log_interval 1000 --save_every 90 --num_best_copies 1 --num_save_copies 1 --seed 0 \
--no_fid --num_inception_images 1 --skip_init --G_batch_size 32 --use_ema --G_eval_mode \
--sample_random --sample_sheets --sample_interps --resume --experiment_name 256px
Random samples are already well-represented by the training montage. The interpolations look similar to StyleGAN interpolations. The class-conditional samples are the most fun to look at because one can look at specific characters without the need to retrain the entire model, which while only taking a few hours at most, is a hassle.
256px Danbooru2018-1K Samples
Interpolation images and 5 character-specific random samples (Asuka, Holo, Rin, Chen, Ruri):






256px BigGAN Downloads
Model & sample downloads:
- trained model:
2019-05-28-biggan-danbooru2018-snapshot-83520.tar.xz
48 (291MB) - random samples for all 1000 character classes49 (321MB)
Evaluation

The best results from the 128px BigGAN model look about as good as could be expected from 128px samples; the 256px model is fairly good, but suffers from much more noticeable artifacting than 512px StyleGAN, and cost $1373 (a 256px StyleGAN would have been closer to $400 on AWS). In BigGAN’s defense, it had clearly not converged yet and could have benefited from much more training and much larger minibatches, had that been possible. Qualitatively, looking at the more complex elements of samples, like hair ornaments/
However, training 512px portraits or whole-Danbooru images is infeasible at this point: while the cost might be only a few thousand dollars, the various bugs mean that it may not be possible to stably train to a useful quality. It’s a dilemma: at small or easy domains, StyleGAN is much faster (if not better); but at large or hard domains, mode collapse is too risky and endangers the big investment necessary to surpass StyleGAN.
To make BigGAN viable, it needs at least:
- minibatch size bugs fixed to enable up to n = 2048 (or larger, as gradient noise scale indicates)
- 512px architectures defined, to allow transfer learning from the released Tensorflow 512px ImageNet model
- optimization work to reduce overhead and allow reasonable GPU utilization on >2-GPU systems
With those done, it should be possible to train 512px portraits for <$1,000 and whole-Danbooru images for <$10,000. (Given the release of DeepDanbooru as a TensorFlow model, enabling an anime-specific perceptual loss, it would also be interesting to investigate applying “NoGAN” pretraining to BigGAN.)
See Also
External Links
“Pretrained Anime StyleGAN 2—convert to Pytorch and editing images by encoder”, Allen Ng
“Video shows off hundreds of beautiful AI-created anime girls in less than a minute【Video】”
“Talking Head Anime from a Single Image”, Pramook Khungurn
“StyleGAN in WebAssembly using tensor4”, Stanislav Pidhorskyi (code)
“StyleGAN2: Align, Project, Animate, Mix Styles and Train”, Nikhil Tirumala (“easy-to-use utility functions to generate animations and mix styles”)
“StyleGAN v2: notes on training and latent space exploration”, Alex Martinelli
“Practical aspects of StyleGAN2 training”, l4rz (improving topless image generation by hyperparameter tuning, model size increase, & face-cropping data augmentation)
Discussion: Reddit
Appendix
For comparison, here are some of my older GAN or other NN attempts; as the quality is worse than StyleGAN, I won’t bother going into details—creating the datasets & training the ProGAN & tuning & transfer-learning were all much the same as already outlined at length for the StyleGAN results.
Included are:
ProGAN
Glow
MSG-GAN
PokeGAN
Self-Attention-GAN-TensorFlow
VGAN
BigGAN unofficial (official BigGAN is covered above)
- BigGAN-TensorFlow
- BigGAN-PyTorch
GAN-QP
WGAN
IntroVAE
ProGAN
Using Karras et al 2017’s official implementation:
2018-09-08, 512–1024px whole-Asuka images ProGAN samples:
1024px, whole-Asuka images, ProGAN 512px whole-Asuka images, ProGAN 2018-09-18, 512px Asuka faces, ProGAN samples:
512px Asuka faces, ProGAN 2018-10-29, 512px Holo faces, ProGAN:
Random samples of 512px ProGAN Holo faces After generating ~1k Holo faces, I selected the top decile (n = 103) of the faces (Imgur mirror):
512px ProGAN Holo faces, random samples from top decile (6×6) The top decile images are, nevertheless, showing distinct signs of both artifacting & overfitting/
memorization of data points. Another 2 weeks proved this out further: ProGAN samples of 512px Holo faces, after badly overfitting (iteration #10,325) Interpolation video of the October 2018 512px Holo face ProGAN; note the gross overfitting indicated by the abruptness of the interpolations jumping from face (mode) to face (mode) and lack of meaningful intermediate faces in addition to the overall blurriness & low visual quality.
2019-01-17, Danbooru2017 512px SFW images, ProGAN:
512px SFW Danbooru2017, ProGAN 2019-02-05 (stopped in order to train with the new StyleGAN codebase), the 512px anime face dataset used elsewhere, ProGAN:
512px anime faces, ProGAN Interpolation video of the 2018-02-05 512px anime face ProGAN; while the image quality is low, the diversity is good & shows no overfitting/
memorization or blatant mode collapse Downloads:
Glow
Used Glow (Kingma & Dhariwal 2018) official implementation.
Due to the enormous model size (4.2GB), I had to modify Glow’s settings to get training working reasonably well, after extensive tinkering to figure out what any meant:
{"verbose": true, "restore_path": "logs/model_4.ckpt", "inference": false, "logdir": "./logs", "problem": "asuka",
"category": "", "data_dir": "../glow/data/asuka/", "dal": 2, "fmap": 1, "pmap": 16, "n_train": 20000, "n_test": 1000,
"n_batch_train": 16, "n_batch_test": 50, "n_batch_init": 16, "optimizer": "adamax", "lr": 0.0005, "beta1": 0.9,
"polyak_epochs": 1, "weight_decay": 1.0, "epochs": 1000000, "epochs_warmup": 10, "epochs_full_valid": 3,
"gradient_checkpointing": 1, "image_size": 512, "anchor_size": 128, "width": 512, "depth": 13, "weight_y": 0.0,
"n_bits_x": 8, "n_levels": 7, "n_sample": 16, "epochs_full_sample": 5, "learntop": false, "ycond": false, "seed": 0,
"flow_permutation": 2, "flow_coupling": 1, "n_y": 1, "rnd_crop": false, "local_batch_train": 1, "local_batch_test": 1,
"local_batch_init": 1, "direct_iterator": true, "train_its": 1250, "test_its": 63, "full_test_its": 1000, "n_bins": 256.0, "top_shape": [4, 4, 768]}
...
{"epoch": 5, "n_processed": 100000, "n_images": 6250, "train_time": 14496, "loss": "2.0090", "bits_x": "2.0090", "bits_y": "0.0000", "pred_loss": "1.0000"}
An additional challenge was numerical instability in the reversing of matrices, giving rise to many ‘invertibility’ crashes.
Final sample before I looked up the compute requirements more carefully & gave up on Glow:

MSG-GAN
MSG-GAN official implementation:

PokeGAN
nshepperd’s (unpublished) multi-scale GAN with self-attention layers, spectral normalization, and a few other tweaks:

Self-Attention-GAN-TensorFlow
SAGAN did not have an official implementation released at the time so I used the Junho Kim implementation; 128px SAGAN, WGAN-LP loss, on Asuka faces & whole Asuka images:


VGAN
The official VGAN code for Peng et al 2018 had not been released when I began trying VGAN, so I used akanimax’s implementation.
The variational discriminator bottleneck, along with self-attention layers and progressive growing, is one of the few strategies which permit 512px images, and I was intrigued to see that it worked relatively well, although I ran into persistent issues with instability & mode collapse. I suspect that VGAN could’ve worked better than it did with some more work.

BigGAN unofficial
Brock et al 2018^s official implementation & models were not released until late March 2019 (nor the semi-official compare_gan
implementation until February 2019), and I experimented with 2 unofficial implementations in late 2018–early 2019.
BigGAN-TensorFlow
Junho Kim implementation; 128px spectral norm hinge loss, anime faces:

This one never worked well at all, and I am still puzzled what went wrong.
BigGAN-PyTorch
Aaron Leong’s PyTorch BigGAN implementation (not the official BigGAN implementation). As it’s class-conditional, I faked having 1000 classes by constructing a variant anime face dataset: taking the top 1000 characters by tag count in the Danbooru2017 metadata, I then filtered for those character tags 1 by 1, and copied them & cropped faces into matching subdirectories 1–1000. This let me try out both faces & whole images. I also attempted to hack in gradient accumulation for big minibatches to make it a true BigGAN implementation, but didn’t help too much; the problem here might simply have been that I couldn’t run it long enough.
Results upon abandoning:


GAN-QP
Implementation of Su 2018:

Training oscillated enormously, with all the samples closely linked and changing simultaneously. This was despite the checkpoint model being enormous (551MB) and I am suspicious that something was seriously wrong—either the model architecture was wrong (too many layers or filters?) or the learning rate was many orders of magnitude too large. Because of the small minibatch, progress was difficult to make in a reasonable amount of wallclock time, so I moved on.
WGAN
WGAN-GP official implementation; I did most of the early anime face work with WGAN on a different machine and didn’t keep copies. However, a sample from a short run gives an idea of what WGAN tended to look like on anime runs:

IntroVAE
A hybrid GAN-VAE architecture introduced in mid-2018 by “IntroVAE: Introspective Variational Autoencoders for Photographic Image Synthesis”, Huang et al 2018, with the official PyTorch implementation released in April 2019, IntroVAE attempts to reuse the encoder-decoder for an adversarial loss as well, to combine the best of both worlds: the principled stable training & reversible encoder of the VAE with the sharpness & high quality of a GAN.
Quality-wise, they show IntroVAE works on CelebA & LSUN BEDROOM at up to 1024px resolution with results they claim are comparable to ProGAN. Performance-wise, for 512px, they give a runtime of 7 days with a minibatch n = 12, or presumably 4 GPUs (since their 1024px run script implies they used 4 GPUs and I can fit a minibatch of n = 4 onto 1×1080ti, so 4 GPUs would be consistent with n = 12), and so 28 GPU-days.
I adapted the 256px suggested settings for my 512px anime portraits dataset:
python main.py --hdim=512 --output_height=512 --channels='32, 64, 128, 256, 512, 512, 512' --m_plus=120 \
--weight_rec=0.05 --weight_kl=1.0 --weight_neg=0.5 --num_vae=0 \
--dataroot=/media/gwern/Data2/danbooru2018/portrait/1/ --trainsize=302652 --test_iter=1000 --save_iter=1 \
--start_epoch=0 --batchSize=4 --nrow=8 --lr_e=0.0001 --lr_g=0.0001 --cuda --nEpochs=500
# ...====> Cur_iter: [187060]: Epoch [3] (5467/60531): time: 142675: Rec: 19569, Kl_E: 162, 151, 121, Kl_G: 151, 121,
There was a minor bug in the codebase where it would crash on trying to print out the log data, perhaps because it assumes multi-GPU and I was running on 1 GPU, and was trying to index into an array which was actually a simple scalar, which I fixed by removing the indexing:
- info += 'Rec: {:.4f}, '.format(loss_rec.data[0])
- info += 'Kl_E: {:.4f}, {:.4f}, {:.4f}, '.format(lossE_real_kl.data[0],
- lossE_rec_kl.data[0], lossE_fake_kl.data[0])
- info += 'Kl_G: {:.4f}, {:.4f}, '.format(lossG_rec_kl.data[0], lossG_fake_kl.data[0])
-
+
+ info += 'Rec: {:.4f}, '.format(loss_rec.data)
+ info += 'Kl_E: {:.4f}, {:.4f}, {:.4f}, '.format(lossE_real_kl.data,
+ lossE_rec_kl.data, lossE_fake_kl.data)
+ info += 'Kl_G: {:.4f}, {:.4f}, '.format(lossG_rec_kl.data, lossG_fake_kl.data)
Sample results after ~1.7 GPU-days:

By this point, StyleGAN would have been generating recognizable faces from scratch, while the IntroVAE random samples are not even face-like, and the IntroVAE training curve was not improving at a notable rate. IntroVAE has some hyperparameters which could probably be tuned better for the anime portrait faces (they briefly discuss the use of the --num_vae
option to run in classic VAE mode to let you tune the VAE-related hyperparameters before enabling the GAN-like part), but it should be fairly insensitive overall to hyperparameters and unlikely to help all that much. So IntroVAE probably can’t replace StyleGAN (yet?) for general-purpose image synthesis. This demonstrates again that it seems like everything works on CelebA these days and just because something works on a photographic dataset does not mean it’ll work on other datasets. Image generation papers should probably branch out some more and consider non-photographic tests.
Link Bibliography
Bibliography of page links in reading order (with annotations when available):
“A Style-Based Generator Architecture for Generative Adversarial Networks”, (2018-12-12):
We propose an alternative generator architecture for generative adversarial networks, borrowing from style transfer literature. The new architecture leads to an automatically learned, unsupervised separation of high-level attributes (e.g., pose and identity when trained on human faces) and stochastic variation in the generated images (e.g., freckles, hair), and it enables intuitive, scale-specific control of the synthesis. The new generator improves the state-of-the-art in terms of traditional distribution quality metrics, leads to demonstrably better interpolation properties, and also better disentangles the latent factors of variation. To quantify interpolation quality and disentanglement, we propose two new, automated methods that are applicable to any generator architecture. Finally, we introduce a new, highly varied and high-quality dataset of human faces.
“Danbooru2019 Portraits”, (2019):
Danbooru2019 Portraits is a dataset of n = 302,652 (16GB) 512px anime faces cropped from ‘solo’ SFW Danbooru2019 images in a relatively broad ‘portrait’ style encompassing necklines/ears/hats/etc rather than tightly focused on the face, upscaled to 512px as necessary, and low-quality images deleted by manual review using Discriminator ranking. This dataset has been used for creating TWDNE.
“ThisWaifuDoesNotExist.net”, (2019-02-19):
ThisWaifuDoesNotExist.net
(TWDNE) is a static website which uses JS to display random anime faces generated by StyleGAN neural networks, along with GPT-3-generated anime plot summaries.A screenshot of “This Waifu Does Not Exist” (TWDNE) showing a random StyleGAN-generated anime face and a random GPT-3 text sample conditioned on anime keywords/phrases. “This Waifu Does Not Exist”, (2019-02-19):
Generating high-quality anime faces has long been a task neural networks struggled with. The invention of StyleGAN in 2018 has effectively solved this task and I have trained a StyleGAN model which can generate high-quality anime faces at 512px resolution. To show off the recent progress, I made a website, “This Waifu Does Not Exist” for displaying random StyleGAN 2 faces. TWDNE displays a different neural-net-generated face & plot summary every 15s. The site was popular and went viral online, especially in China. The model can also be used interactively for exploration & editing in the Artbreeder online service.
TWDNE faces have been used as screensavers, user avatars, character art for game packs or online games, uploaded to Pixiv, given away in streams, and used in a research paper (Noguchi & Harada 2019). TWDNE results also helped inspired Sizigi Studio’s online interactive waifu GAN, Waifu Labs, which generates even better anime faces than my StyleGAN results.
“Generative adversarial network”, (2020-12-22):
A generative adversarial network (GAN) is a class of machine learning frameworks designed by Ian Goodfellow and his colleagues in 2014. Two neural networks contest with each other in a game.
“Generative Adversarial Networks”, (2014-06-10):
We propose a new framework for estimating generative models via an adversarial process, in which we simultaneously train two models: a generative model G that captures the data distribution, and a discriminative model D that estimates the probability that a sample came from the training data rather than G. The training procedure for G is to maximize the probability of D making a mistake. This framework corresponds to a minimax two-player game. In the space of arbitrary functions G and D, a unique solution exists, with G recovering the training data distribution and D equal to 1/2 everywhere. In the case where G and D are defined by multilayer perceptrons, the entire system can be trained with backpropagation. There is no need for any Markov chains or unrolled approximate inference networks during either training or generation of samples. Experiments demonstrate the potential of the framework through qualitative and quantitative evaluation of the generated samples.
“Improved Techniques for Training GANs”, (2016-06-10):
We present a variety of new architectural features and training procedures that we apply to the generative adversarial networks (GANs) framework. We focus on two applications of GANs: semi-supervised learning, and the generation of images that humans find visually realistic. Unlike most work on generative models, our primary goal is not to train a model that assigns high likelihood to test data, nor do we require the model to be able to learn well without using any labels. Using our new techniques, we achieve state-of-the-art results in semi-supervised classification on MNIST, CIFAR-10 and SVHN. The generated images are of high quality as confirmed by a visual Turing test: our model generates MNIST samples that humans cannot distinguish from real data, and CIFAR-10 samples that yield a human error rate of 21.3 ImageNet samples with unprecedented resolution and show that our methods enable the model to learn recognizable features of ImageNet classes.
“RNN metadata for mimicking individual author style”, (2015-09-12):
Char-RNNs are unsupervised generative models which learn to mimic text sequences. I suggest extending char-RNNs with inline metadata such as genre or author prefixed to each line of input, allowing for better & more efficient metadata, and more controllable sampling of generated output by feeding in desired metadata. A 2015 experiment using
torch-rnn
on a set of ~30 Project Gutenberg e-books (1 per author) to train a large char-RNN shows that a char-RNN can learn to remember metadata such as authors, learn associated prose styles, and often generate text visibly similar to that of a specified author.I further try & fail to train a char-RNN on Geocities HTML for unclear reasons.
More successfully, I experiment in 2019 with a recently-developed alternative to char-RNNs, the Transformer NN architecture, by finetuning training OpenAI’s GPT-2-117M Transformer model on a much larger (117MB) Project Gutenberg poetry corpus using both unlabeled lines & lines with inline metadata (the source book). The generated poetry is much better. And GPT-3 is better still.
“Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks”, (2015-11-19):
In recent years, supervised learning with convolutional networks (CNNs) has seen huge adoption in computer vision applications. Comparatively, unsupervised learning with CNNs has received less attention. In this work we hope to help bridge the gap between the success of CNNs for supervised learning and unsupervised learning. We introduce a class of CNNs called deep convolutional generative adversarial networks (DCGANs), that have certain architectural constraints, and demonstrate that they are a strong candidate for unsupervised learning. Training on various image datasets, we show convincing evidence that our deep convolutional adversarial pair learns a hierarchy of representations from object parts to scenes in both the generator and discriminator. Additionally, we use the learned features for novel tasks—demonstrating their applicability as general image representations.
“Asuka Langley Soryu”, (2021-01-02):
Asuka Langley Soryu is a fictional character in the Neon Genesis Evangelion franchise. Within the series, she is designated as the Second Child and the pilot of the Evangelion Unit 02. Her surname is romanized as Soryu in the English manga and Sohryu in the English version of the TV series, the English version of the film, and on Gainax's website. Asuka is voiced by Yūko Miyamura in Japanese in all animated appearances and merchandise. In English, Asuka is voiced by Tiffany Grant in the ADV Films dub and by Stephanie McKeon in the Netflix dub. In the Rebuild of Evangelion films, her Japanese surname is changed to Shikinami (式波). In a Newtype poll from March 2010, Asuka was voted as the third most popular female anime character from the 1990s.
“Neon Genesis Evangelion”, (2020-12-28):
Neon Genesis Evangelion is a Japanese mecha anime television series produced by Gainax and Tatsunoko Production, directed by Hideaki Anno and broadcast on TV Tokyo from October 1995 to March 1996. The cast included Megumi Ogata as Shinji Ikari, Kotono Mitsuishi as Misato Katsuragi, Megumi Hayashibara as Rei Ayanami, and Yūko Miyamura as Asuka Langley Soryu. Music for the series was composed by Shirō Sagisu.
“Danbooru2020: A Large-Scale Crowdsourced and Tagged Anime Illustration Dataset”, (2015-12-15):
Deep learning for computer revision relies on large annotated datasets. Classification/
categorization has benefited from the creation of ImageNet, which classifies 1m photos into 1000 categories. But classification/ categorization is a coarse description of an image which limits application of classifiers, and there is no comparably large dataset of images with many tags or labels which would allow learning and detecting much richer information about images. Such a dataset would ideally be >1m images with at least 10 descriptive tags each which can be publicly distributed to all interested researchers, hobbyists, and organizations. There are currently no such public datasets, as ImageNet, Birds, Flowers, and MS COCO fall short either on image or tag count or restricted distribution. I suggest that the “image -boorus” be used. The image boorus are longstanding web databases which host large numbers of images which can be ‘tagged’ or labeled with an arbitrary number of textual descriptions; they were developed for and are most popular among fans of anime, who provide detailed annotations. The best known booru, with a focus on quality, is Danbooru. We provide a torrent/
rsync mirror which contains ~3.4tb of 4.22m images with 130m tag instances (of 434k defined tags, ~30/ image) covering Danbooru from 2005-05-24–2020-12-31 (final ID: #4,279,845), providing the image files & a JSON export of the metadata. We also provide a smaller torrent of SFW images downscaled to 512×512px JPGs (0.37tb; 3,227,715 images) for convenience. Our hope is that the Danbooru2020 dataset can be used for rich large-scale classification/
tagging & learned embeddings, test out the transferability of existing computer vision techniques (primarily developed using photographs) to illustration/ anime-style images, provide an archival backup for the Danbooru community, feed back metadata improvements & corrections, and serve as a testbed for advanced techniques such as conditional image generation or style transfer. “StackGAN: Text to Photo-realistic Image Synthesis with Stacked Generative Adversarial Networks”, (2016-12-10):
Synthesizing high-quality images from text descriptions is a challenging problem in computer vision and has many practical applications. Samples generated by existing text-to-image approaches can roughly reflect the meaning of the given descriptions, but they fail to contain necessary details and vivid object parts. In this paper, we propose Stacked Generative Adversarial Networks (StackGAN) to generate 256×256 photo-realistic images conditioned on text descriptions. We decompose the hard problem into more manageable sub-problems through a sketch-refinement process. The Stage-I GAN sketches the primitive shape and colors of the object based on the given text description, yielding Stage-I low-resolution images. The Stage-II GAN takes Stage-I results and text descriptions as inputs, and generates high-resolution images with photo-realistic details. It is able to rectify defects in Stage-I results and add compelling details with the refinement process. To improve the diversity of the synthesized images and stabilize the training of the conditional-GAN, we introduce a novel Conditioning Augmentation technique that encourages smoothness in the latent conditioning manifold. Extensive experiments and comparisons with state-of-the-arts on benchmark datasets demonstrate that the proposed method achieves significant improvements on generating photo-realistic images conditioned on text descriptions.
“StackGAN++: Realistic Image Synthesis with Stacked Generative Adversarial Networks”, (2017-10-19):
Although Generative Adversarial Networks (GANs) have shown remarkable success in various tasks, they still face challenges in generating high quality images. In this paper, we propose Stacked Generative Adversarial Networks (StackGAN) aiming at generating high-resolution photo-realistic images. First, we propose a two-stage generative adversarial network architecture, StackGAN-v1, for text-to-image synthesis. The Stage-I GAN sketches the primitive shape and colors of the object based on given text description, yielding low-resolution images. The Stage-II GAN takes Stage-I results and text descriptions as inputs, and generates high-resolution images with photo-realistic details. Second, an advanced multi-stage generative adversarial network architecture, StackGAN-v2, is proposed for both conditional and unconditional generative tasks. Our StackGAN-v2 consists of multiple generators and discriminators in a tree-like structure; images at multiple scales corresponding to the same scene are generated from different branches of the tree. StackGAN-v2 shows more stable training behavior than StackGAN-v1 by jointly approximating multiple distributions. Extensive experiments demonstrate that the proposed stacked generative adversarial networks significantly outperform other state-of-the-art methods in generating photo-realistic images.
https:/
/ towardsdatascience.com/ auto-regressive-generative-models-pixelrnn-pixelcnn-32d192911173 “Artbreeder”, (2019-09-09):
[Artbreeder is an interactive GAN generator website. Originally named “Ganbreeder” and providing only the 256px BigGAN generator, it now provides a variety of BigGAN & StyleGAN models, including the anime portrait StyleGAN model. (It is more general than the similar Waifu Labs, but my anime model is not as good.) Users can generate random samples and explore slight variants of them to gradually explore the “latent space” and find interesting images, but they can also edit images more directly, upload existing images to find the most similar image produced by the model, etc. A popular website, it has generated >56m images from September 2019 to January 2020.]
“Stabilizing Generative Adversarial Networks: A Survey”, (2019-09-30):
Generative Adversarial Networks (GANs) are a type of generative model which have received much attention due to their ability to model complex real-world data. Despite their recent successes, the process of training GANs remains challenging, suffering from instability problems such as non-convergence, vanishing or exploding gradients, and mode collapse. In recent years, a diverse set of approaches have been proposed which focus on stabilizing the GAN training procedure. The purpose of this survey is to provide a comprehensive overview of the GAN training stabilization methods which can be found in the literature. We discuss the advantages and disadvantages of each approach, offer a comparative summary, and conclude with a discussion of open problems.
“Synthesizing Programs for Images using Reinforced Adversarial Learning”, (2018-04-03):
Advances in deep generative networks have led to impressive results in recent years. Nevertheless, such models can often waste their capacity on the minutiae of datasets, presumably due to weak inductive biases in their decoders. This is where graphics engines may come in handy since they abstract away low-level details and represent images as high-level programs. Current methods that combine deep learning and renderers are limited by hand-crafted likelihood or distance functions, a need for large amounts of supervision, or difficulties in scaling their inference algorithms to richer datasets. To mitigate these issues, we present SPIRAL, an adversarially trained agent that generates a program which is executed by a graphics engine to interpret and sample images. The goal of this agent is to fool a discriminator network that distinguishes between real and rendered data, trained with a distributed reinforcement learning setup without any supervision. A surprising finding is that using the discriminator’s output as a reward signal is the key to allow the agent to make meaningful progress at matching the desired output rendering. To the best of our knowledge, this is the first demonstration of an end-to-end, unsupervised and adversarial inverse graphics agent on challenging real world (MNIST, OMNIGLOT, CELEBA) and synthetic 3D datasets. A video of the agent can be found at YouTube.
“CAN: Creative Adversarial Networks, Generating "Art" by Learning About Styles and Deviating from Style Norms”, (2017-06-21):
We propose a new system for generating art. The system generates art by looking at art and learning about style; and becomes creative by increasing the arousal potential of the generated art by deviating from the learned styles. We build over Generative Adversarial Networks (GAN), which have shown the ability to learn to generate novel images simulating a given distribution. We argue that such networks are limited in their ability to generate creative products in their original design. We propose modifications to its objective to make it capable of generating creative art by maximizing deviation from established styles and minimizing deviation from art distribution. We conducted experiments to compare the response of human subjects to the generated art with their response to art created by artists. The results show that human subjects could not distinguish art generated by the proposed system from art generated by contemporary artists and shown in top art fairs. Human subjects even rated the generated images higher on various scales.
“Style2Paints GitHub repository”, (2018-05-04):
Github repo with screenshot samples of style2paints, a neural network for colorizing anime-style illustrations (trained on Danbooru2018), with or without user color hints, which was available as an online service in 2018. style2paints produces high-quality colorizations often on par with human colorizations. Many examples can be seen on Twitter or the Github repo:
Example style2paints colorization of a character from Prison School style2paints has been described in more detail in “Two-Stage Sketch Colorization”, Zhang et al 2018:
Sketch or line art colorization is a research field with significant market demand. Different from photo colorization which strongly relies on texture information, sketch colorization is more challenging as sketches may not have texture. Even worse, color, texture, and gradient have to be generated from the abstract sketch lines. In this paper, we propose a semi-automatic learning-based framework to colorize sketches with proper color, texture as well as gradient. Our framework consists of two stages. In the first drafting stage, our model guesses color regions and splashes a rich variety of colors over the sketch to obtain a color draft. In the second refinement stage, it detects the unnatural colors and artifacts, and try to fix and refine the result.Comparing to existing approaches, this two-stage design effectively divides the complex colorization task into two simpler and goal-clearer subtasks. This eases the learning and raises the quality of colorization. Our model resolves the artifacts such as water-color blurring, color distortion, and dull textures.
We build an interactive software based on our model for evaluation. Users can iteratively edit and refine the colorization. We evaluate our learning model and the interactive system through an extensive user study. Statistics shows that our method outperforms the state-of-art techniques and industrial applications in several aspects including, the visual quality, the ability of user control, user experience, and other metric
“Towards the Automatic Anime Characters Creation with Generative Adversarial Networks”, (2017-08-18):
Automatic generation of facial images has been well studied after the Generative Adversarial Network (GAN) came out. There exists some attempts applying the GAN model to the problem of generating facial images of anime characters, but none of the existing work gives a promising result. In this work, we explore the training of GAN models specialized on an anime facial image dataset. We address the issue from both the data and the model aspect, by collecting a more clean, well-suited dataset and leverage proper, empirical application of DRAGAN. With quantitative analysis and case studies we demonstrate that our efforts lead to a stable and high-quality model. Moreover, to assist people with anime character design, we build a website (http://make.girls.moe) with our pre-trained model available online, which makes the model easily accessible to general public.
“
Illustration2Vec
: a semantic vector representation of illustrations”, (2015-11-02):Referring to existing illustrations helps novice drawers to realize their ideas. To find such helpful references from a large image collection, we first build a semantic vector representation of illustrations by training convolutional neural networks. As the proposed vector space correctly reflects the semantic meanings of illustrations, users can efficiently search for references with similar attributes. Besides the search with a single query, a semantic morphing algorithm that searches the intermediate illustrations that gradually connect two queries is proposed. Several experiments were conducted to demonstrate the effectiveness of our methods. [Keywords: illustration, CNNs, visual similarity, search]
https:/
/ old.reddit.com/ r/ MachineLearning/ comments/ akbc11/ p_tag_estimation_for_animestyle_girl_image/ “NoGAN: Decrappification, DeOldification, and Super Resolution”, (2019-05-03):
Generative models are models that generate music, images, text, and other complex data types. In recent years generative models have advanced at an astonishing rate, largely due to deep learning, and particularly due to generative adversarial models (GANs). However, GANs are notoriously difficult to train, due to requiring a large amount of data, needing many GPUs and a lot of time to train, and being highly sensitive to minor hyperparameter changes.
fast.ai has been working in recent years towards making a range of models easier and faster to train, with a particular focus on using transfer learning. Transfer learning refers to pre-training a model using readily available data and quick and easy to calculate loss functions, and then fine-tuning that model for a task that may have fewer labels, or be more expensive to compute. This seemed like a potential solution to the GAN training problem, so in late 2018 fast.ai worked on a transfer learning technique for generative modeling.
The pre-trained model that fast.ai selected was this: Start with an image dataset and “crappify” the images, such as reducing the resolution, adding jpeg artifacts, and obscuring parts with random text. Then train a model to “decrappify” those images to return them to their original state. fast.ai started with a model that was pre-trained for ImageNet classification, and added a U-Net upsampling network, adding various modern tweaks to the regular U-Net. A simple fast loss function was initially used: mean squared pixel error. This U-Net could be trained in just a few minutes. Then, the loss function was replaced was a combination of other loss functions used in the generative modeling literature (more details in the f8 video) and trained for another couple of hours. The plan was then to finally add a GAN for the last few epochs—however it turned out that the results were so good that fast.ai ended up not using a GAN for the final models.…
NoGAN Training
NoGAN is a new and exciting technique in GAN training that we developed, in pursuit of higher quality and more stable renders. How, and how well, it works is a bit surprising.
Here is the NoGAN training process:
Pretrain the Generator. The generator is first trained in a more conventional and easier to control manner—with Perceptual Loss (aka Feature Loss) by itself. GAN training is not introduced yet. At this point you’re training the generator as best as you can in the easiest way possible. This takes up most of the time in NoGAN training. Keep in mind: this pretraining by itself will get the generator model far. Colorization will be well-trained as a task, albeit the colors will tend toward dull tones. Self-Attention will also be well-trained at the at this stage, which is very important.
Save Generated Images From Pretrained Generator.
Pretrain the Critic as a Binary Classifier. Much like in pretraining the generator, what we aim to achieve in this step is to get as much training as possible for the critic in a more “conventional” manner which is easier to control. And there’s nothing easier than a binary classifier! Here we’re training the critic as a binary classifier of real and fake images, with the fake images being those saved in the previous step. A helpful thing to keep in mind here is that you can simply use a pre-trained critic used for another image-to-image task and refine it. This has already been done for super-resolution, where the critic’s pretrained weights were loaded from that of a critic trained for colorization. All that is needed to make use of the pre-trained critic in this case is a little fine-tuning.
Train Generator and Critic in (Almost) Normal GAN Setting. Quickly! This is the surprising part. It turns out that in this pretraining scenario, the critic will rapidly drive adjustments in the generator during GAN training. This happens during a narrow window of time before an “inflection point” of sorts is hit. After this point, there seems to be little to no benefit in training any further in this manner. In fact, if training is continued after this point, you’ll start seeing artifacts and glitches introduced in renderings.
In the case of DeOldify, training to this point requires iterating through only about 1% to 3% of ImageNet data (or roughly 2600 to 7800 iterations on a batch size of five). This amounts to just around 30–90 minutes of GAN training, which is in stark contrast to the three to five days of progressively-sized GAN training that was done previously. Surprisingly, during that short amount of training, the change in the quality of the renderings is dramatic. In fact, this makes up the entirety of GAN training for the video model. The “artistic” and “stable” models go one step further and repeat the NoGAN training process steps 2–4 until there’s no more apparent benefit (around five repeats).
Note: a small but significant change to this GAN training that deviates from conventional GANs is the use of a loss threshold that must be met by the critic before generator training commences. Until then, the critic continues training to “catch up” in order to be able to provide the generator with constructive gradients. This catch up chiefly takes place at the beginning of GAN training which immediately follows generator and critic pretraining.
https:/
/ blogs.nvidia.com/ blog/ 2019/ 03/ 18/ gaugan-photorealistic-landscapes-nvidia-research/ “Semantic Image Synthesis with Spatially-Adaptive Normalization”, (2019-03-18):
We propose spatially-adaptive normalization, a simple but effective layer for synthesizing photorealistic images given an input semantic layout. Previous methods directly feed the semantic layout as input to the deep network, which is then processed through stacks of convolution, normalization, and nonlinearity layers. We show that this is suboptimal as the normalization layers tend to “wash away” semantic information. To address the issue, we propose using the input layout for modulating the activations in normalization layers through a spatially-adaptive, learned transformation. Experiments on several challenging datasets demonstrate the advantage of the proposed method over existing approaches, regarding both visual fidelity and alignment with input layouts. Finally, our model allows user control over both semantic and style. Code is available at https://github.com/NVlabs/SPADE .
https:/
/ safebooru.org/ index.php?page=post&s=list&tags=heterochromia “Progressive Growing of GANs for Improved Quality, Stability, and Variation”, (2017-10-27):
We describe a new training methodology for generative adversarial networks. The key idea is to grow both the generator and discriminator progressively: starting from a low resolution, we add new layers that model increasingly fine details as training progresses. This both speeds the training up and greatly stabilizes it, allowing us to produce images of unprecedented quality, e.g., CelebA images at 1024^2. We also propose a simple way to increase the variation in generated images, and achieve a record inception score of 8.80 in unsupervised CIFAR10. Additionally, we describe several implementation details that are important for discouraging unhealthy competition between the generator and discriminator. Finally, we suggest a new metric for evaluating GAN results, both in terms of image quality and variation. As an additional contribution, we construct a higher-quality version of the CelebA dataset.
“Improved Precision and Recall Metric for Assessing Generative Models”, (2019-04-15):
The ability to automatically estimate the quality and coverage of the samples produced by a generative model is a vital requirement for driving algorithm research. We present an evaluation metric that can separately and reliably measure both of these aspects in image generation tasks by forming explicit, non-parametric representations of the manifolds of real and generated data. We demonstrate the effectiveness of our metric in StyleGAN and BigGAN by providing several illustrative examples where existing metrics yield uninformative or contradictory results. Furthermore, we analyze multiple design variants of StyleGAN to better understand the relationships between the model architecture, training methods, and the properties of the resulting sample distribution. In the process, we identify new variants that improve the state-of-the-art. We also perform the first principled analysis of truncation methods and identify an improved method. Finally, we extend our metric to estimate the perceptual quality of individual samples, and use this to study latent space interpolations.
https:/
/ github.com/ ak9250/ stylegan-art/ blob/ master/ styleganportraits.ipynb “On Self Modulation for Generative Adversarial Networks”, (2018-10-02):
Training Generative Adversarial Networks (GANs) is notoriously challenging. We propose and study an architectural modification, self-modulation, which improves GAN performance across different data sets, architectures, losses, regularizers, and hyperparameter settings. Intuitively, self-modulation allows the intermediate feature maps of a generator to change as a function of the input noise vector. While reminiscent of other conditioning techniques, it requires no labeled data. In a large-scale empirical study we observe a relative decrease of in FID. Furthermore, all else being equal, adding this modification to the generator leads to improved performance in () of the studied settings. Self-modulation is a simple architectural change that requires no additional parameter tuning, which suggests that it can be applied readily to any GAN.
“A Neural Algorithm of Artistic Style”, (2015-08-26):
In fine art, especially painting, humans have mastered the skill to create unique visual experiences through composing a complex interplay between the content and style of an image. Thus far the algorithmic basis of this process is unknown and there exists no artificial system with similar capabilities. However, in other key areas of visual perception such as object and face recognition near-human performance was recently demonstrated by a class of biologically inspired vision models called Deep Neural Networks. Here we introduce an artificial system based on a Deep Neural Network that creates artistic images of high perceptual quality. The system uses neural representations to separate and recombine content and style of arbitrary images, providing a neural algorithm for the creation of artistic images. Moreover, in light of the striking similarities between performance-optimised artificial neural networks and biological vision, our work offers a path forward to an algorithmic understanding of how humans create and perceive artistic imagery.
“Arbitrary Style Transfer in Real-time with Adaptive Instance Normalization”, (2017-03-20):
Gatys et al. recently introduced a neural algorithm that renders a content image in the style of another image, achieving so-called style transfer. However, their framework requires a slow iterative optimization process, which limits its practical application. Fast approximations with feed-forward neural networks have been proposed to speed up neural style transfer. Unfortunately, the speed improvement comes at a cost: the network is usually tied to a fixed set of styles and cannot adapt to arbitrary new styles. In this paper, we present a simple yet effective approach that for the first time enables arbitrary style transfer in real-time. At the heart of our method is a novel adaptive instance normalization (AdaIN) layer that aligns the mean and variance of the content features with those of the style features. Our method achieves speed comparable to the fastest existing approach, without the restriction to a pre-defined set of styles. In addition, our approach allows flexible user controls such as content-style trade-off, style interpolation, color & spatial controls, all using a single feed-forward neural network.
https:/
/ github.com/ martinarjovsky/ WassersteinGAN/ issues/ 2#issuecomment-278710552 “The relativistic discriminator: a key element missing from standard GAN”, (2018-07-02):
In standard generative adversarial network (SGAN), the discriminator estimates the probability that the input data is real. The generator is trained to increase the probability that fake data is real. We argue that it should also simultaneously decrease the probability that real data is real because 1) this would account for a priori knowledge that half of the data in the mini-batch is fake, 2) this would be observed with divergence minimization, and 3) in optimal settings, SGAN would be equivalent to integral probability metric (IPM) GANs.
We show that this property can be induced by using a relativistic discriminator which estimate the probability that the given real data is more realistic than a randomly sampled fake data. We also present a variant in which the discriminator estimate the probability that the given real data is more realistic than fake data, on average. We generalize both approaches to non-standard GAN loss functions and we refer to them respectively as Relativistic GANs (RGANs) and Relativistic average GANs (RaGANs). We show that IPM-based GANs are a subset of RGANs which use the identity function.
Empirically, we observe that 1) RGANs and RaGANs are significantly more stable and generate higher quality data samples than their non-relativistic counterparts, 2) Standard RaGAN with gradient penalty generate data of better quality than WGAN-GP while only requiring a single discriminator update per generator update (reducing the time taken for reaching the state-of-the-art by 400 (256×256) from a very small sample (n = 2011), while GAN and LSGAN cannot; these images are of significantly better quality than the ones generated by WGAN-GP and SGAN with spectral normalization.
“Variational Discriminator Bottleneck: Improving Imitation Learning, Inverse RL, and GANs by Constraining Information Flow”, (2018-10-01):
Adversarial learning methods have been proposed for a wide range of applications, but the training of adversarial models can be notoriously unstable. Effectively balancing the performance of the generator and discriminator is critical, since a discriminator that achieves very high accuracy will produce relatively uninformative gradients. In this work, we propose a simple and general technique to constrain information flow in the discriminator by means of an information bottleneck. By enforcing a constraint on the mutual information between the observations and the discriminator’s internal representation, we can effectively modulate the discriminator’s accuracy and maintain useful and informative gradients. We demonstrate that our proposed variational discriminator bottleneck (VDB) leads to significant improvements across three distinct application areas for adversarial learning algorithms. Our primary evaluation studies the applicability of the VDB to imitation learning of dynamic continuous control skills, such as running. We show that our method can learn such skills directly from raw video demonstrations, substantially outperforming prior adversarial imitation learning methods. The VDB can also be combined with adversarial inverse reinforcement learning to learn parsimonious reward functions that can be transferred and re-optimized in new settings. Finally, we demonstrate that VDB can train GANs more effectively for image generation, improving upon a number of prior stabilization methods.
“Rectified Gaussian distribution”, (2020-12-22):
In probability theory, the rectified Gaussian distribution is a modification of the Gaussian distribution when its negative elements are reset to 0. It is essentially a mixture of a discrete distribution and a continuous distribution as a result of censoring.
“Truncated normal distribution”, (2021-01-02):
In probability and statistics, the truncated normal distribution is the probability distribution derived from that of a normally distributed random variable by bounding the random variable from either below or above. The truncated normal distribution has wide applications in statistics and econometrics. For example, it is used to model the probabilities of the binary outcomes in the probit model and to model censored data in the Tobit model.
“Spectral Norm Regularization for Improving the Generalizability of Deep Learning”, (2017-05-31):
We investigate the generalizability of deep learning based on the sensitivity to input perturbation. We hypothesize that the high sensitivity to the perturbation of data degrades the performance on it. To reduce the sensitivity to perturbation, we propose a simple and effective regularization method, referred to as spectral norm regularization, which penalizes the high spectral norm of weight matrices in neural networks. We provide supportive evidence for the abovementioned hypothesis by experimentally confirming that the models trained using spectral norm regularization exhibit better generalizability than other baseline methods.
“Spectral Normalization for Generative Adversarial Networks”, (2018-02-16):
One of the challenges in the study of generative adversarial networks is the instability of its training. In this paper, we propose a novel weight normalization technique called spectral normalization to stabilize the training of the discriminator. Our new normalization technique is computationally light and easy to incorporate into existing implementations. We tested the efficacy of spectral normalization on CIFAR10, STL-10, and ILSVRC2012 dataset, and we experimentally confirmed that spectrally normalized GANs (SN-GANs) is capable of generating images of better or equal quality relative to the previous training stabilization techniques.
“This Person Does Not Exist”, (2019-02-12):
[This Person Does Not Exist is a StyleGAN-based noninteractive website, which uses the Nvidia-trained FFHQ face StyleGAN model to generate realistic 1024px faces (upgraded to StyleGAN2 in 2020). One face is displayed at a time, and a new face automatically generated every few seconds. It inspired a rash of copycats, including This Waifu Does Not Exist.]
Recently a talented group of researchers at Nvidia released the current state of the art generative adversarial network, StyleGAN, over at https://github.com/NVlabs/stylegan I have decided to dig into my own pockets and raise some public awareness for this technology. Faces are most salient to our cognition, so I’ve decided to put that specific pretrained model up. Their research group have also included pretrained models for cats, cars, and bedrooms in their repository that you can immediately use. Each time you refresh the site, the network will generate a new facial image from scratch from a 512 dimensional vector.
https:/
/ thecleverest.com/ judgefakepeople/ main.php?sort=highest https:/
/ nitter.net/ MichaelFriese10/ status/ 1151236302559305728 “Waifu Labs”, (2019-07-23):
[Waifu Labs is an interactive website for generating (1024px?) anime faces using a customized StyleGAN trained on Danbooru2018. Similar to Artbreeder, it supports face exploration and face editing, and at the end, a user can purchase prints of a particular face.]
We taught a world-class artificial intelligence how to draw anime. All the drawings you see were made by a non-human artist! Wild, right? It turns out machines love waifus almost as much as humans do. We proudly present the next chapter of human history: lit waifu commissions from the world's smartest AI artist. In less than 5 minutes, the artist learns your preferences to make the perfect waifu just for you.
https:/
/ github.com/ a312863063/ seeprettyface-ganerator-dongman https:/
/ nitter.net/ highqualitysh1t/ status/ 1095699293011435520 https:/
/ nitter.net/ MichaelFriese10/ status/ 1127614400750346240 https:/
/ iguanamouth.tumblr.com/ post/ 158982472537/ pokemon-generated-by-neural-network https:/
/ medium.com/ @robert.munro/ creating-new-scripts-with-stylegan-c16473a50fd0 https:/
/ nitter.net/ mattjarviswall/ status/ 1110548997729452035 https:/
/ old.reddit.com/ r/ computervision/ comments/ bfcnbj/ p_stylegan_on_oxford_visual_geometry_group/ https:/
/ old.reddit.com/ r/ MachineLearning/ comments/ bkrn3i/ p_stylegan_trained_on_album_covers/ https:/
/ nitter.net/ realmeatyhuman/ status/ 1233084317032681483 https:/
/ nitter.net/ MichaelFriese10/ status/ 1130604229372997632 https:/
/ nitter.net/ MichaelFriese10/ status/ 1132777932802236417 http:/
/ digital-thinking.de/ watchgan-advancing-generated-watch-images-with-stylegans/ https:/
/ evigio.com/ post/ generating-new-watch-designs-with-stylegan https:/
/ old.reddit.com/ r/ MediaSynthesis/ comments/ ea5qoy/ butterflies_generated_with_stylegan/ “Trypophobia”, (2021-01-02):
Trypophobia is an aversion to the sight of irregular patterns or clusters of small holes or bumps. It is not officially recognized as a mental disorder, but may be diagnosed as a specific phobia if excessive fear and distress occur. People may express only disgust to trypophobic imagery.
“End-to-End Chinese Landscape Painting Creation Using Generative Adversarial Networks”, (2020-11-11):
Current GAN-based art generation methods produce unoriginal artwork due to their dependence on conditional input. Here, we propose Sketch-And-Paint GAN (SAPGAN), the first model which generates Chinese landscape paintings from end to end, without conditional input. SAPGAN is composed of two GANs: SketchGAN for generation of edge maps, and PaintGAN for subsequent edge-to-painting translation. Our model is trained on a new dataset of traditional Chinese landscape paintings never before used for generative research. A 242-person Visual Turing Test study reveals that SAPGAN paintings are mistaken as human artwork with 55 GANs. Our work lays a groundwork for truly machine-original art generation.
https:/
/ www.edwardtufte.com/ bboard/ q-and-a-fetch-msg?msg_id=0003wk#bboard_content “Are GANs Created Equal? A Large-Scale Study”, (2017-11-28):
Generative adversarial networks (GAN) are a powerful subclass of generative models. Despite a very rich research activity leading to numerous interesting GAN algorithms, it is still very hard to assess which algorithm(s) perform better than others. We conduct a neutral, multi-faceted large-scale empirical study on state-of-the art models and evaluation measures. We find that most models can reach similar scores with enough hyperparameter optimization and random restarts. This suggests that improvements can arise from a higher computational budget and tuning more than fundamental algorithmic changes. To overcome some limitations of the current metrics, we also propose several data sets on which precision and recall can be computed. Our experimental results suggest that future GAN research should be based on more systematic and objective evaluation procedures. Finally, we did not find evidence that any of the tested algorithms consistently outperforms the non-saturating GAN introduced in .
“ImageNet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness”, (2018-11-29):
Convolutional Neural Networks (CNNs) are commonly thought to recognise objects by learning increasingly complex representations of object shapes. Some recent studies suggest a more important role of image textures. We here put these conflicting hypotheses to a quantitative test by evaluating CNNs and human observers on images with a texture-shape cue conflict. We show that ImageNet-trained CNNs are strongly biased towards recognising textures rather than shapes, which is in stark contrast to human behavioural evidence and reveals fundamentally different classification strategies. We then demonstrate that the same standard architecture (ResNet-50) that learns a texture-based representation on ImageNet is able to learn a shape-based representation instead when trained on "Stylized-ImageNet", a stylized version of ImageNet. This provides a much better fit for human behavioural performance in our well-controlled psychophysical lab setting (nine experiments totalling 48,560 psychophysical trials across 97 observers) and comes with a number of unexpected emergent benefits such as improved object detection performance and previously unseen robustness towards a wide range of image distortions, highlighting advantages of a shape-based representation.
“Robustness properties of Facebook's ResNeXt WSL models”, (2019-07-17):
We investigate the robustness properties of ResNeXt class image recognition models trained with billion scale weakly supervised data (ResNeXt WSL models). These models, recently made public by Facebook AI, were trained with 1B images from Instagram and fine-tuned on ImageNet. We show that these models display an unprecedented degree of robustness against common image corruptions and perturbations, as measured by the ImageNet-C and ImageNet-P benchmarks. They also achieve substantially improved accuracies on the recently introduced "natural adversarial examples" benchmark (ImageNet-A). The largest of the released models, in particular, achieves state-of-the-art results on ImageNet-C, ImageNet-P, and ImageNet-A by a large margin. The gains on ImageNet-C, ImageNet-P, and ImageNet-A far outpace the gains on ImageNet validation accuracy, suggesting the former as more useful benchmarks to measure further progress in image recognition. Remarkably, the ResNeXt WSL models even achieve a limited degree of adversarial robustness against state-of-the-art white-box attacks (10-step PGD attacks). However, in contrast to adversarially trained models, the robustness of the ResNeXt WSL models rapidly declines with the number of PGD steps, suggesting that these models do not achieve genuine adversarial robustness. Visualization of the learned features also confirms this conclusion. Finally, we show that although the ResNeXt WSL models are more shape-biased than comparable ImageNet-trained models in a shape-texture cue conflict experiment, they still remain much more texture-biased than humans, suggesting that they share some of the underlying characteristics of ImageNet-trained models that make this benchmark challenging.
“The Origins and Prevalence of Texture Bias in Convolutional Neural Networks”, (2019-11-20):
Recent work has indicated that, unlike humans, ImageNet-trained CNNs tend to classify images by texture rather than by shape. How pervasive is this bias, and where does it come from? We find that, when trained on datasets of images with conflicting shape and texture, CNNs learn to classify by shape at least as easily as by texture. What factors, then, produce the texture bias in CNNs trained on ImageNet? Different unsupervised training objectives and different architectures have small but significant and largely independent effects on the level of texture bias. However, all objectives and architectures still lead to models that make texture-based classification decisions a majority of the time, even if shape information is decodable from their hidden representations. The effect of data augmentation is much larger. By taking less aggressive random crops at training time and applying simple, naturalistic augmentation (color distortion, noise, and blur), we train models that classify ambiguous images by shape a majority of the time, and outperform baselines on out-of-distribution test sets. Our results indicate that apparent differences in the way humans and ImageNet-trained CNNs process images may arise not primarily from differences in their internal workings, but from differences in the data that they see.
“Large Scale Adversarial Representation Learning”, (2019-07-04):
Adversarially trained generative models (GANs) have recently achieved compelling image synthesis results. But despite early successes in using GANs for unsupervised representation learning, they have since been superseded by approaches based on self-supervision. In this work we show that progress in image generation quality translates to substantially improved representation learning performance. Our approach, BigBiGAN, builds upon the state-of-the-art BigGAN model, extending it to representation learning by adding an encoder and modifying the discriminator. We extensively evaluate the representation learning and generation capabilities of these BigBiGAN models, demonstrating that these generation-based models achieve the state of the art in unsupervised representation learning on ImageNet, as well as in unconditional image generation. Pretrained BigBiGAN models – including image generators and encoders – are available on TensorFlow Hub (https://tfhub.dev/s?publisher=deepmind&q=bigbigan).
“David Foster Wallace”, (2020-12-28):
David Foster Wallace was an American author of novels, short stories and essays, as well as a university professor of English and creative writing. Wallace is widely known for his 1996 novel Infinite Jest, which Time magazine cited as one of the 100 best English-language novels from 1923 to 2005. His posthumous novel, The Pale King (2011), was a finalist for the Pulitzer Prize for Fiction in 2012.
https:/
/ www.thefreelibrary.com/ E+unibus+pluram%3A+television+and+U.S.+fiction.-a013952319 “Nearest neighbor search”, (2020-12-22):
Nearest neighbor search (NNS), as a form of proximity search, is the optimization problem of finding the point in a given set that is closest to a given point. Closeness is typically expressed in terms of a dissimilarity function: the less similar the objects, the larger the function values.
“Clearview AI”, (2021-01-02):
Clearview AI is an American technology company that provides facial recognition software, which is used by private companies, law enforcement agencies, universities and individuals. The company has developed technology that can match faces to a database of more than three billion images indexed from the Internet, including social media applications. Founded by Hoan Ton-That and Richard Schwartz, the company maintained a low profile until late 2019, when its usage by law enforcement was reported on. Multiple reports identified Clearview's association with far-right personas dating back to 2016, when the company claimed to sever ties with two employees.
“Spatially Controllable Image Synthesis with Internal Representation Collaging”, (2018-11-26):
We present a novel CNN-based image editing strategy that allows the user to change the semantic information of an image over an arbitrary region by manipulating the feature-space representation of the image in a trained GAN model. We will present two variants of our strategy: (1) spatial conditional batch normalization (sCBN), a type of conditional batch normalization with user-specifiable spatial weight maps, and (2) feature-blending, a method of directly modifying the intermediate features. Our methods can be used to edit both artificial image and real image, and they both can be used together with any GAN with conditional normalization layers. We will demonstrate the power of our method through experiments on various types of GANs trained on different datasets. Code will be available at https://github.com/pfnet-research/neural-collage.
“Rewriting a Deep Generative Model”, (2020-07-30):
A deep generative model such as a GAN learns to model a rich set of semantic and physical rules about the target distribution, but up to now, it has been obscure how such rules are encoded in the network, or how a rule could be changed. In this paper, we introduce a new problem setting: manipulation of specific rules encoded by a deep generative model. To address the problem, we propose a formulation in which the desired rule is changed by manipulating a layer of a deep network as a linear associative memory. We derive an algorithm for modifying one entry of the associative memory, and we demonstrate that several interesting structural rules can be located and modified within the layers of state-of-the-art generative models. We present a user interface to enable users to interactively change the rules of a generative model to achieve desired effects, and we show several proof-of-concept applications. Finally, results on multiple datasets demonstrate the advantage of our method against standard fine-tuning methods and edit transfer algorithms.
“Unsupervised Discovery of Interpretable Directions in the GAN Latent Space”, (2020-02-10):
The latent spaces of GAN models often have semantically meaningful directions. Moving in these directions corresponds to human-interpretable image transformations, such as zooming or recoloring, enabling a more controllable generation process. However, the discovery of such directions is currently performed in a supervised manner, requiring human labels, pretrained models, or some form of self-supervision. These requirements severely restrict a range of directions existing approaches can discover. In this paper, we introduce an unsupervised method to identify interpretable directions in the latent space of a pretrained GAN model. By a simple model-agnostic procedure, we find directions corresponding to sensible semantic manipulations without any form of (self-)supervision. Furthermore, we reveal several non-trivial findings, which would be difficult to obtain by existing methods, e.g., a direction corresponding to background removal. As an immediate practical benefit of our work, we show how to exploit this finding to achieve competitive performance for weakly-supervised saliency detection.
“Big GANs Are Watching You: Towards Unsupervised Object Segmentation with Off-the-Shelf Generative Models”, (2020-06-08):
Since collecting pixel-level groundtruth data is expensive, unsupervised visual understanding problems are currently an active research topic. In particular, several recent methods based on generative models have achieved promising results for object segmentation and saliency detection. However, since generative models are known to be unstable and sensitive to hyperparameters, the training of these methods can be challenging and time-consuming.
In this work, we introduce an alternative, much simpler way to exploit generative models for unsupervised object segmentation. First, we explore the latent space of the BigBiGAN – the state-of-the-art unsupervised GAN, which parameters are publicly available. We demonstrate that object saliency masks for GAN-produced images can be obtained automatically with BigBiGAN. These masks then are used to train a discriminative segmentation model. Being very simple and easy-to-reproduce, our approach provides competitive performance on common benchmarks in the unsupervised scenario.
“Image2StyleGAN: How to Embed Images Into the StyleGAN Latent Space?”, (2019-04-05):
We propose an efficient algorithm to embed a given image into the latent space of StyleGAN. This embedding enables semantic image editing operations that can be applied to existing photographs. Taking the StyleGAN trained on the FFHQ dataset as an example, we show results for image morphing, style transfer, and expression transfer. Studying the results of the embedding algorithm provides valuable insights into the structure of the StyleGAN latent space. We propose a set of experiments to test what class of images can be embedded, how they are embedded, what latent space is suitable for embedding, and if the embedding is semantically meaningful.
“Generative Adversarial Imitation Learning”, (2016-06-10):
Consider learning a policy from example expert behavior, without interaction with the expert or access to reinforcement signal. One approach is to recover the expert’s cost function with inverse reinforcement learning, then extract a policy from that cost function with reinforcement learning. This approach is indirect and can be slow. We propose a new general framework for directly extracting a policy from data, as if it were obtained by reinforcement learning following inverse reinforcement learning. We show that a certain instantiation of our framework draws an analogy between imitation learning and generative adversarial networks, from which we derive a model-free imitation learning algorithm that obtains significant performance gains over existing model-free methods in imitating complex behaviors in large, high-dimensional environments.
“ESRGAN: Enhanced Super-Resolution Generative Adversarial Networks”, (2018-09-01):
The Super-Resolution Generative Adversarial Network (SRGAN) is a seminal work that is capable of generating realistic textures during single image super-resolution. However, the hallucinated details are often accompanied with unpleasant artifacts. To further enhance the visual quality, we thoroughly study three key components of SRGAN—network architecture, adversarial loss and perceptual loss, and improve each of them to derive an Enhanced SRGAN (ESRGAN). In particular, we introduce the Residual-in-Residual Dense Block (RRDB) without batch normalization as the basic network building unit. Moreover, we borrow the idea from relativistic GAN to let the discriminator predict relative realness instead of the absolute value. Finally, we improve the perceptual loss by using the features before activation, which could provide stronger supervision for brightness consistency and texture recovery. Benefiting from these improvements, the proposed ESRGAN achieves consistently better visual quality with more realistic and natural textures than SRGAN and won the first place in the PIRM2018-SR Challenge. The code is available at https://github.com/xinntao/ESRGAN .
“Creative Commons license”, (2021-01-02):
A Creative Commons (CC) license is one of several public copyright licenses that enable the free distribution of an otherwise copyrighted "work". A CC license is used when an author wants to give other people the right to share, use, and build upon a work that they have created. CC provides an author flexibility and protects the people who use or redistribute an author's work from concerns of copyright infringement as long as they abide by the conditions that are specified in the license by which the author distributes the work.
“Software patents and free software”, (2020-12-22):
Opposition to software patents is widespread in the free software community. In response, various mechanisms have been tried to defuse the perceived problem.
https:/
/ mega.nz/ file/ eQdHkShY#8wyNKs343L7YUjwXlEg3cWjqK2g2EAIdYz5xbkPy3ng “Transformation (law)”, (2020-12-22):
In United States copyright law, transformation is a possible justification that use of a copyrighted work may qualify as fair use, i.e., that a certain use of a work does not infringe its holder's copyright due to the public interest in the usage. Transformation is an important issue in deciding whether a use meets the first factor of the fair-use test, and is generally critical for determining whether a use is in fact fair, although no one factor is dispositive. Transformativeness is a characteristic of such derivative works that makes them transcend, or place in a new light, the underlying works on which they are based. In computer- and Internet-related works, the transformative characteristic of the later work is often that it provides the public with a benefit not previously available to it, which would otherwise remain unavailable. Such transformativeness weighs heavily in a fair use analysis and may excuse what seems a clear copyright infringement from liability.
“Monkey selfie copyright dispute”, (2021-01-02):
The monkey selfie copyright dispute is a series of disputes about the copyright status of selfies taken by Celebes crested macaques using equipment belonging to the British nature photographer David Slater. The disputes involve Wikimedia Commons and the blog Techdirt, which have hosted the images following their publication in newspapers in July 2011 over Slater's objections that he holds the copyright, and People for the Ethical Treatment of Animals (PETA), who have argued that the macaque should be assigned the copyright.
https:/
/ www.itmedia.co.jp/ news/ articles/ 1711/ 28/ news020.html https:/
/ scholarship.law.duke.edu/ cgi/ viewcontent.cgi?article=1023&context=dltr#.pdf https:/
/ alj.artrepreneur.com/ the-next-rembrandt-who-holds-the-copyright-in-computer-generated-art/ “The Machine As Author”, (2019-03-24):
The use of Artificial Intelligence (AI) machines using deep learning neural networks to create material that facially looks like it should be protected by copyright is growing exponentially. From articles in national news media to music, film, poetry and painting, AI machines create material that has economic value and that competes with productions of human authors. The Article reviews both normative and doctrinal arguments for and against the protection by copyright of literary and artistic productions made by AI machines. The Article finds that the arguments in favor of protection are flawed and unconvincing and that a proper analysis of the history, purpose, and major doctrines of copyright law all lead to the conclusion that productions that do not result from human creative choices belong to the public domain. The Article proposes a test to determine which productions should be protected, including in case of collaboration between human and machine. Finally, the Article applies the proposed test to three specific fact patterns to illustrate its application. [Keywords: copyright, author, artificial intelligence, machine learning]
https:/
/ www.artnome.com/ news/ 2019/ 3/ 27/ why-is-ai-art-copyright-so-complicated https:/
/ www.theverge.com/ 2019/ 4/ 17/ 18299563/ ai-algorithm-music-law-copyright-human https:/
/ creativecommons.org/ share-your-work/ public-domain/ cc0/ “The Marriage of Heaven and Hell”, (2021-01-02):
The Marriage of Heaven and Hell is a book by the English poet and printmaker William Blake. It is a series of texts written in imitation of biblical prophecy but expressing Blake's own intensely personal Romantic and revolutionary beliefs. Like his other books, it was published as printed sheets from etched plates containing prose, poetry and illustrations. The plates were then coloured by Blake and his wife Catherine.
“LSUN: Construction of a Large-scale Image Dataset using Deep Learning with Humans in the Loop”, (2015-06-10):
While there has been remarkable progress in the performance of visual recognition algorithms, the state-of-the-art models tend to be exceptionally data-hungry. Large labeled training datasets, expensive and tedious to produce, are required to optimize millions of parameters in deep network models. Lagging behind the growth in model capacity, the available datasets are quickly becoming outdated in terms of size and density. To circumvent this bottleneck, we propose to amplify human effort through a partially automated labeling scheme, leveraging deep learning with humans in the loop. Starting from a large set of candidate images for each category, we iteratively sample a subset, ask people to label them, classify the others with a trained model, split the set into positives, negatives, and unlabeled based on the classification confidence, and then iterate with the unlabeled set. To assess the effectiveness of this cascading procedure and enable further progress in visual recognition research, we construct a new image dataset, LSUN. It contains around one million labeled images for each of 10 scene categories and 20 object categories. We experiment with training popular convolutional networks and find that they achieve substantial performance gains when trained on this dataset.
“OpenCV”, (2021-01-02):
OpenCV is a library of programming functions mainly aimed at real-time computer vision. Originally developed by Intel, it was later supported by Willow Garage then Itseez. The library is cross-platform and free for use under the open-source Apache 2 License. Starting with 2011, OpenCV features GPU acceleration for real-time operations.
https:/
/ github.com/ nagadomi/ lbpcascade_animeface/ issues/ 1#issue-205363706 https:/
/ github.com/ nagadomi/ animeface-2009/ blob/ master/ animeface-ruby/ face_collector.rb https:/
/ towardsdatascience.com/ animating-ganime-with-stylegan-part-1-4cf764578e “BigGAN: Large Scale GAN Training for High Fidelity Natural Image Synthesis: 4.2 Characterizing Instability: The Discriminator”, (2019-08-26):
We also observe that D’s loss approaches zero during training, but undergoes a sharp upward jump at collapse (Appendix F). One possible explanation for this behavior is that D is overfitting to the training set, memorizing training examples rather than learning some meaningful boundary between real and generated images. As a simple test for D’s memorization (related to Gulrajani et al. (2017)), we evaluate uncollapsed discriminators on the ImageNet training and validation sets, and measure what percentage of samples are classified as real or generated. While the training accuracy is consistently above 98%, the validation accuracy falls in the range of 50–55%, no better than random guessing (regardless of regularization strategy). This confirms that D is indeed memorizing the training set; we deem this in line with D’s role, which is not explicitly to generalize, but to distill the training data and provide a useful learning signal for G. Additional experiments and discussion are provided in Appendix G.
“Deep reinforcement learning from human preferences”, (2017-06-12):
For sophisticated reinforcement learning (RL) systems to interact usefully with real-world environments, we need to communicate complex goals to these systems. In this work, we explore goals defined in terms of (non-expert) human preferences between pairs of trajectory segments. We show that this approach can effectively solve complex RL tasks without access to the reward function, including Atari games and simulated robot locomotion, while providing feedback on less than one percent of our agent’s interactions with the environment. This reduces the cost of human oversight far enough that it can be practically applied to state-of-the-art RL systems. To demonstrate the flexibility of our approach, we show that we can successfully train complex novel behaviors with about an hour of human time. These behaviors and environments are considerably more complex than any that have been previously learned from human feedback.
“Image Augmentations for GAN Training”, (2020-06-04):
Data augmentations have been widely studied to improve the accuracy and robustness of classifiers. However, the potential of image augmentation in improving GAN models for image synthesis has not been thoroughly investigated in previous studies. In this work, we systematically study the effectiveness of various existing augmentation techniques for GAN training in a variety of settings. We provide insights and guidelines on how to augment images for both vanilla GANs and GANs with regularizations, improving the fidelity of the generated images substantially. Surprisingly, we find that vanilla GANs attain generation quality on par with recent state-of-the-art results if we use augmentations on both real and generated images. When this GAN training is combined with other augmentation-based regularization techniques, such as contrastive loss and consistency regularization, the augmentations further improve the quality of generated images. We provide new state-of-the-art results for conditional generation on CIFAR-10 with both consistency loss and contrastive loss as additional regularizations.
“On Data Augmentation for GAN Training”, (2020-06-09):
Recent successes in Generative Adversarial Networks (GAN) have affirmed the importance of using more data in GAN training. Yet it is expensive to collect data in many domains such as medical applications. Data Augmentation (DA) has been applied in these applications. In this work, we first argue that the classical DA approach could mislead the generator to learn the distribution of the augmented data, which could be different from that of the original data. We then propose a principled framework, termed Data Augmentation Optimized for GAN (DAG), to enable the use of augmented data in GAN training to improve the learning of the original distribution. We provide theoretical analysis to show that using our proposed DAG aligns with the original GAN in minimizing the JS divergence w.r.t. the original distribution and it leverages the augmented data to improve the learnings of discriminator and generator. The experiments show that DAG improves various GAN models. Furthermore, when DAG is used in some GAN models, the system establishes state-of-the-art Fréchet Inception Distance (FID) scores.
“ADA/StyleGAN3: Training Generative Adversarial Networks with Limited Data”, (2020-06-11):
Training generative adversarial networks (GAN) using too little data typically leads to discriminator overfitting, causing training to diverge. We propose an adaptive discriminator augmentation mechanism that significantly stabilizes training in limited data regimes. The approach does not require changes to loss functions or network architectures, and is applicable both when training from scratch and when fine-tuning an existing GAN on another dataset. We demonstrate, on several datasets, that good results are now possible using only a few thousand training images, often matching StyleGAN2 results with an order of magnitude fewer images. We expect this to open up new application domains for GANs. We also find that the widely used CIFAR-10 is, in fact, a limited data benchmark, and improve the record FID from 5.59 to 2.67.
“Differentiable Augmentation for Data-Efficient GAN Training”, (2020-06-18):
The performance of generative adversarial networks (GANs) heavily deteriorates given a limited amount of training data. This is mainly because the discriminator is memorizing the exact training set. To combat it, we propose Differentiable Augmentation (DiffAugment), a simple method that improves the data efficiency of GANs by imposing various types of differentiable augmentations on both real and fake samples. Previous attempts to directly augment the training data manipulate the distribution of real images, yielding little benefit; DiffAugment enables us to adopt the differentiable augmentation for the generated samples, effectively stabilizes training, and leads to better convergence. Experiments demonstrate consistent gains of our method over a variety of GAN architectures and loss functions for both unconditional and class-conditional generation. With DiffAugment, we achieve a state-of-the-art FID of 6.80 with an IS of 100.8 on ImageNet 128×128 and 2-4x reductions of FID given 1,000 images on FFHQ and LSUN. Furthermore, with only 20 CIFAR-100. Finally, our method can generate high-fidelity images using only 100 images without pre-training, while being on par with existing transfer learning algorithms. Code is available at https://github.com/mit-han-lab/data-efficient-gans.
“BigGAN: Large Scale GAN Training For High Fidelity Natural Image Synthesis: 5.2 Additional Evaluation On JFT-300M”, (2018-09-28):
…To confirm that our design choices are effective for even larger and more complex and diverse datasets, we also present results of our system on a subset of JFT-300M (Sun et al., 2017). The full JFT-300M dataset contains 300M real-world images labeled with 18K categories. Since the category distribution is heavily long-tailed, we subsample the dataset to keep only images with the 8.5K most common labels. The resulting dataset contains 292M images—two orders of magnitude larger than ImageNet.
…Our results show that these techniques substantially improve performance even in the setting of this much larger dataset at the same model capacity (64 base channels). We further show that for a dataset of this scale, we see significant additional improvements from expanding the capacity of our models to 128 base channels, while for ImageNet GANs that additional capacity was not beneficial. In Figure 19 (Appendix D), we present truncation plots for models trained on this dataset…Interestingly, unlike models trained on ImageNet, where training tends to collapse without heavy regularization ([Section 4])(https://arxiv.org/pdf/1809.11096.pdf&org=deepmind#section.4 “Characterizing Instability: The Generator/Discriminator/Conclusions”), the models trained on JFT-300M remain stable over many hundreds of thousands of iterations. This suggests that moving beyond ImageNet to larger datasets may partially alleviate GAN stability issues.
https:/
/ github.com/ xunings/ styleganime2/ blob/ master/ misc/ ranker.py “Discriminator Rejection Sampling”, (2018-10-16):
We propose a rejection sampling scheme using the discriminator of a GAN to approximately correct errors in the GAN generator distribution. We show that under quite strict assumptions, this will allow us to recover the data distribution exactly. We then examine where those strict assumptions break down and design a practical algorithm—called Discriminator Rejection Sampling (DRS)—that can be used on real data-sets. Finally, we demonstrate the efficacy of DRS on a mixture of Gaussians and on the SAGAN model, state-of-the-art in the image generation task at the time of developing this work. On ImageNet, we train an improved baseline that increases the Inception Score from 52.52 to 62.36 and reduces the Frechet Inception Distance from 18.65 to 14.79. We then use DRS to further improve on this baseline, improving the Inception Score to 76.08 and the FID to 13.75.
https:/
/ github.com/ nagadomi/ waifu2x/ issues/ 231#issuecomment-381164157 “This Fursona Does Not Exist (TFDNE)”, (2020-05-07):
A StyleGAN2 showcase: high-quality GAN-generated furry (anthropomorphic animals) faces, trained on n = 55k faces cropped from the e621 furry image booru. For higher quality, the creator heavily filtered faces and aligned them, and upscaled using waifu2×. For display, it reuses Obormot’s “These Waifus Do Not Exist” scrolling grid code to display an indefinite number of faces rather than one at a time. (TFDNE is also available on Artbreeder for interactive editing/crossbreeding, and a Google Colab notebook for Ganspace-based editing.)
9 random TFDNE furry face samples in a grid Model download mirrors:
Rsync:
rsync --verbose rsync://78.46.86.149:873/biggan/2020-05-06-arfafax-stylegan2-tfdne-e621-r-512-3194880.pkl.xz ./
See also the My Little Pony-themed followup “This Pony Does Not Exist” (TPDNE).
“Furry fandom”, (2020-12-28):
The furry fandom is a subculture interested in anthropomorphic animal characters with human personalities and characteristics. Examples of anthropomorphic attributes include exhibiting human intelligence and facial expressions, speaking, walking on two legs, and wearing clothes. The term "furry fandom" is also used to refer to the community of people who gather on the internet and at furry conventions.
“GPT-3: Prompts As Programming”, (2020-06-23):
The GPT-3 neural network is so large a model in terms of power and dataset that it exhibits qualitatively different behavior: you do not apply it to a fixed set of tasks which were in the training dataset, requiring retraining on additional data if one wants to handle a new task (as one would have to retrain GPT-2); instead, you interact with it, expressing any task in terms of natural language descriptions, requests, and examples, tweaking the prompt until it “understands” & it meta-learns the new task based on the high-level abstractions it learned from the pretraining.
This is a rather different way of using a DL model, and it’s better to think of it as a new kind of programming, where the prompt is now a “program” which programs GPT-3 to do new things.
https:/
/ github.com/ NVlabs/ stylegan/ blob/ master/ training/ training_loop.py#L112 “GNU Screen”, (2021-01-02):
GNU Screen is a terminal multiplexer, a software application that can be used to multiplex several virtual consoles, allowing a user to access multiple separate login sessions inside a single terminal window, or detach and reattach sessions from a terminal. It is useful for dealing with multiple programs from a command line interface, and for separating programs from the session of the Unix shell that started the program, particularly so a remote process continues running even when the user is disconnected.
https:/
/ github.com/ NVlabs/ stylegan/ blob/ master/ training/ training_loop.py#L136 https:/
/ paste.laravel.io/ f2419e15-ea7d-408a-8ff2-b8ee6d00ddd1/ raw “Analyzing and Improving the Image Quality of StyleGAN”, (2019-12-03):
The style-based GAN architecture (StyleGAN) yields state-of-the-art results in data-driven unconditional generative image modeling. We expose and analyze several of its characteristic artifacts, and propose changes in both model architecture and training methods to address them. In particular, we redesign the generator normalization, revisit progressive growing, and regularize the generator to encourage good conditioning in the mapping from latent codes to images. In addition to improving image quality, this path length regularizer yields the additional benefit that the generator becomes significantly easier to invert. This makes it possible to reliably attribute a generated image to a particular network. We furthermore visualize how well the generator utilizes its output resolution, and identify a capacity problem, motivating us to train larger models for additional quality improvements. Overall, our improved model redefines the state of the art in unconditional image modeling, both in terms of existing distribution quality metrics as well as perceived image quality.
https:/
/ mega.nz/ #!vawjXISI!F7s13yRicxDA3QYqYDL2kjnc2K7Zk3DwCIYETREmBP4 “Megapixel Size Image Creation using Generative Adversarial Networks”, (2017-05-31):
Since its appearance, Generative Adversarial Networks (GANs) have received a lot of interest in the AI community. In image generation several projects showed how GANs are able to generate photorealistic images but the results so far did not look adequate for the quality standard of visual media production industry. We present an optimized image generation process based on a Deep Convolutional Generative Adversarial Networks (DCGANs), in order to create photorealistic high-resolution images (up to 1024×1024 pixels). Furthermore, the system was fed with a limited dataset of images, less than two thousand images. All these results give more clue about future exploitation of GANs in Computer Graphics and Visual Effects.
https:/
/ github.com/ NVlabs/ stylegan/ blob/ master/ pretrained_example.py https:/
/ github.com/ NVlabs/ stylegan/ blob/ master/ generate_figures.py “FFmpeg”, (2021-01-02):
FFmpeg is a free and open-source software project consisting of a large suite of libraries and programs for handling video, audio, and other multimedia files and streams. At its core is the FFmpeg program itself, designed for command-line-based processing of video and audio files. It is widely used for format transcoding, basic editing, video scaling, video post-production effects and standards compliance.
http:/
/ everyoneishappy.com/ portfolio/ waifu-synthesis-real-time-generative-anime/ “GPT-2 Neural Network Poetry”, (2019-03-03):
In February 2019, following up on my 2015–2016 text-generation experiments with char-RNNs, I experiment with the cutting-edge Transformer NN architecture for language modeling & text generation. Using OpenAI’s GPT-2-117M (117M) model pre-trained on a large Internet corpus and nshepperd’s finetuning code, I retrain GPT-2-117M on a large (117MB) Project Gutenberg poetry corpus. I demonstrate how to train 2 variants: “GPT-2-poetry”, trained on the poems as a continuous stream of text, and “GPT-2-poetry-prefix”, with each line prefixed with the metadata of the PG book it came from. In May 2019, I trained the next-largest GPT-2, GPT-2-345M, similarly, for a further quality boost in generated poems. In October 2019, I retrained GPT-2-117M on a Project Gutenberg corpus with improved formatting, and combined it with a contemporary poem dataset based on Poetry Foundation’s website. .> With just a few GPU-days on 1080ti GPUs, GPT-2-117M finetuning can produce high-quality poetry which is more thematically consistent than my char-RNN poems, capable of modeling subtle features like rhyming, and sometimes even a pleasure to read. I list the many possible ways to improve poem generation and further approach human-level poems. For the highest-quality AI poetry to date, see my followup page, “GPT-3 Creative Writing”.
For anime plot summaries, see TWDNE; for generating ABC-formatted folk music, see “GPT-2 Folk Music” & “GPT-2 Preference Learning for Music and Poetry Generation”; for playing chess, see “A Very Unlikely Chess Game”; for the Reddit comment generator, see SubSimulatorGPT-2; for fanfiction, the Ao3; and for video games, the walkthrough model. For OpenAI’s GPT-3 followup, see “GPT-3: Language Models are Few-Shot Learners”.
https:/
/ mega.nz/ #!2DRDQIjJ!JKQ_DhEXCzeYJXjliUSWRvE-_rfrvWv_cq3pgRuFadw https:/
/ mega.nz/ #!aPRFDKaC!FDpQi_FEPK443JoRBEOEDOmlLmJSblKFlqZ1A1XPt2Y “Joseph M. Sussman”, (2020-12-22):
Joseph Martin Sussman was an American engineer who was the JR East Professor at Massachusetts Institute of Technology. In 2007, he was elected fellow of the American Association for the Advancement of Science.
“Marvin Minsky”, (2021-01-02):
Marvin Lee Minsky was an American cognitive and computer scientist concerned largely with research of artificial intelligence (AI), co-founder of the Massachusetts Institute of Technology's AI laboratory, and author of several texts concerning AI and philosophy.
“PDP-6”, (2021-01-02):
The PDP-6 is a computer model developed by Digital Equipment Corporation (DEC) in 1964. It was influential primarily as the prototype (effectively) for the later PDP-10; the instruction sets of the two machines are almost identical.
“Jargon File”, (2020-12-22):
The Jargon File is a glossary and usage dictionary of slang used by computer programmers. The original Jargon File was a collection of terms from technical cultures such as the MIT AI Lab, the Stanford AI Lab (SAIL) and others of the old ARPANET AI/LISP/PDP-10 communities, including Bolt, Beranek and Newman, Carnegie Mellon University, and Worcester Polytechnic Institute. It was published in paperback form in 1983 as The Hacker's Dictionary, revised in 1991 as The New Hacker's Dictionary.
“Prior probability”, (2020-12-22):
In Bayesian statistical inference, a prior probability distribution, often simply called the prior, of an uncertain quantity is the probability distribution that would express one's beliefs about this quantity before some evidence is taken into account. For example, the prior could be the probability distribution representing the relative proportions of voters who will vote for a particular politician in a future election. The unknown quantity may be a parameter of the model or a latent variable rather than an observable variable.
“Spice and Wolf”, (2021-01-02):
Spice and Wolf is a Japanese light novel series written by Isuna Hasekura, with illustrations by Jū Ayakura. ASCII Media Works has published 22 novels since February 2006 under their Dengeki Bunko imprint. ASCII Media Works reported that as of October 2008, over 2.2 million copies of the first nine novels have been sold in Japan. The series has been called a "unique fantasy" by Mainichi Shimbun due to the plot focusing on economics, trade, and peddling rather than the typical staples of fantasy such as swords and magic. Yen Press licensed the light novels and is releasing them in English in North America. ASCII Media Works has published three volumes of a spin-off light novel series titled Wolf and Parchment since September 2016.
/
images/ gan/ stylegan/ 2019-02-10-stylegan-holotransfer-trainingmontage.mp4 /
images/ gan/ stylegan/ 2019-03-20-stylegan-holo-interpolation.mp4 /
docs/ ai/ anime/ 2019-02-10-stylegan-holo-handselectedsamples.zip https:/
/ old.reddit.com/ r/ SpiceandWolf/ comments/ apazs0/ my_holo_face_collection/ https:/
/ mega.nz/ #!afIjAAoJ!ATuVaw-9k5I5cL_URTuK2zI9mybdgFGYMJKUUHUfbk8 “Evangelion: 3.0 You Can (Not) Redo”, (2021-01-02):
Evangelion: 3.0 You Can (Not) Redo. is a 2012 Japanese animated science fiction action film written and chief directed by Hideaki Anno and the third of four films released in the Rebuild of Evangelion tetralogy, based on the original anime series Neon Genesis Evangelion. It was produced and co-distributed by Anno's Studio Khara and released in Japanese theaters on November 17, 2012.
/
images/ gan/ stylegan/ 2019-02-21-stylegan-asukatransfer-interpolation.mp4 https:/
/ mega.nz/ #!gEFVwAoK!qYrejFI0w1g1BzeuI-st5ajaQLoVFOZsj5j_OTREp1c /
docs/ ai/ anime/ 2019-02-11-stylegan-asuka-handselectedsamples.zip https:/
/ mega.nz/ #!0JVxHQCD!C7ijBpRWNpcL_gubWFR-GTBDJTW1jXI6ThzSxwaw2aE https:/
/ old.reddit.com/ r/ MachineLearning/ comments/ apq4xu/ p_stylegan_on_anime_faces/ egf8pvt/ “Kantai Collection”, (2021-01-02):
Kantai Collection, abbreviated as KanColle, is a Japanese free-to-play web browser game developed by Kadokawa Games and published by DMM.com.
/
images/ gan/ stylegan/ 2019-02-14-endingcredits-stylegan-zuihoutransfer-interpolation-4x4.mp4 /
images/ gan/ stylegan/ 2019-02-14-endingcredits-stylegan-zuihoutransfer-interpolation-1x.mp4 “Tower defense”, (2020-12-22):
Tower defense (TD) is a subgenre of strategy video game where the goal is to defend a player's territories or possessions by obstructing the enemy attackers, usually achieved by placing defensive structures on or along their path of attack. This typically means building a variety of different structures that serve to automatically block, impede, attack or destroy enemies. Tower defense is seen as a subgenre of real-time strategy video games, due to its real-time origins, though many modern tower defense games include aspects of turn-based strategy. Strategic choice and positioning of defensive elements is an essential strategy of the genre.
“Ptilopsis”, (2021-01-02):
Ptilopsis is a genus of typical owls, or true owls, in the family Strigidae, that inhabits Africa. Its members are:
/
images/ gan/ stylegan/ 2020-01-19-ganso-stylegan2-arknights-ptilopsis-interpolation.mp4 https:/
/ old.reddit.com/ r/ MachineLearning/ comments/ apq4xu/ p_stylegan_on_anime_faces/ egmyf60/ “Saber (Fate/stay night)”, (2021-01-02):
Saber, whose real name is Artoria Pendragon, is a fictional character from the Japanese 2004 visual novel Fate/stay night by Type-Moon. Saber is a heroic warrior who is summoned by a teenager named Shirou Emiya to participate in a war between masters and servants who are fighting to accomplish their dreams using the mythical Holy Grail. Saber's relationship with the story's other characters depends on the player's decisions; she becomes a love interest to Shirou in the novel's first route and also serves as that route's servant protagonist, a supporting character in the second and a villain called "Saber Alter" in the third route.
“Fate/stay night”, (2021-01-02):
Fate/stay night is a Japanese adult visual novel developed by Type-Moon and originally released for Windows on January 30, 2004. A version of Fate/stay night rated for ages 15 and up titled Fate/stay night Réalta Nua, which features the Japanese voice actors from the anime series, was released in 2007 for the PlayStation 2 and later for download on Windows as a trilogy covering the three main story lines. Réalta Nua was also ported to the PlayStation Vita, iOS and Android. The plot focuses on a young mage named Shirou Emiya who becomes a warrior in a battle between "Servants" known as the Holy Grail War. Through each route, Shirou bonds with a heroine and confronts different mages who participate in the war.
/
images/ gan/ stylegan/ 2019-02-14-endingcredits-stylegan-sabertransfer-interpolation-4x4.mp4 https:/
/ towardsdatascience.com/ fgo-stylegan-this-heroic-spirit-doesnt-exist-23d62fbb680e “Fate/Grand Order”, (2021-01-02):
Fate/Grand Order is a free-to-play Japanese mobile game, developed by Delightworks using Unity, and published by Aniplex, a subsidiary of Sony Music Entertainment Japan. The game is based on Type-Moon's Fate/stay night franchise, and was released in Japan on July 29, 2015 for Android, and on August 12, 2015, for iOS. English-language versions followed on June 25, 2017 in the United States and Canada, and a Korean version was released on November 21, 2017. An arcade version titled Fate/Grand Order Arcade was released by Sega in Japan on 26 July 2018.
“List of The Familiar of Zero characters”, (2020-12-22):
This is a list of characters from the light novel, anime, and manga series The Familiar of Zero.
“The Familiar of Zero”, (2021-01-02):
The Familiar of Zero is a Japanese fantasy light novel series written by Noboru Yamaguchi, with illustrations by Eiji Usatsuka. Media Factory published 20 volumes between June 2004 and February 2011. The series was left unfinished due to the author's death in 2013, but was later concluded in two volumes released in February 2016 and February 2017 with a different author, making use of notes left behind by Yamaguchi. The story features several characters from the second year class of a magic academy in a fictional magical world with the main characters being the inept mage Louise and her familiar from Earth, Saito Hiraga.
/
images/ gan/ stylegan/ 2019-02-14-endingcredits-stylegan-louisetransfer-interpolation-4x4.mp4 “Lelouch Lamperouge”, (2020-12-22):
Lelouch vi Britannia, whose alias is Lelouch Lamperouge, is the title character and leading antihero of the Sunrise anime series Code Geass: Lelouch of the Rebellion. In the series, Lelouch is a former prince from the superpower Britannia who is given the power of the "Geass" by a witch known as C.C. Using the Geass and his genius-level intellect, Lelouch becomes the leader of the resistance movement known as The Black Knights under his alter ego Zero (ゼロ) to destroy the Holy Britannian Empire, an imperial monarchy that has been conquering various countries under control from his father.
“Code Geass”, (2021-01-02):
Code Geass: Lelouch of the Rebellion, often referred to simply as Code Geass, is a Japanese anime series produced by Sunrise. It was directed by Gorō Taniguchi and written by Ichirō Ōkouchi, with original character designs by Clamp. Set in an alternate timeline, the series follows the exiled prince Lelouch vi Britannia, who obtains the "power of absolute obedience" from a mysterious woman named C.C. Using this supernatural power, known as Geass, he leads a rebellion against the rule of the Holy Britannian Empire, commanding a series of mecha battles.
https:/
/ old.reddit.com/ r/ touhou/ comments/ gl180j/ here_have_a_few_marisa_portraits/ https:/
/ tvtropes.org/ pmwiki/ pmwiki.php/ VideoGame/ WarshipGirls https:/
/ nitter.net/ TazikShahjahan/ status/ 1315441277236899842 “Kaguya-sama: Love Is War”, (2021-01-02):
Kaguya-sama: Love Is War is a Japanese romantic comedy manga series by Aka Akasaka. It began serialization in Shueisha's seinen manga magazine Miracle Jump in May 2015 and was transferred to Weekly Young Jump in March 2016.
https:/
/ github.com/ ZKTKZ/ thdne/ blob/ master/ StyleGAN2_Tazik_25GB_RAM.ipynb https:/
/ www.deviantart.com/ caji9i/ art/ stylegan-neural-ahegao-842847987 “Ahegao”, (2021-01-02):
Ahegao (アヘ顔) is a term in Japanese pornography for an exaggerated facial expression of characters during sex, typically with rolling or crossed eyes, protruding tongue, and slightly reddened face, to show enjoyment or ecstasy. The style is often used in erotic manga, anime, and video games.
/
images/ gan/ stylegan/ 2020-12-30-shipblazer420-stylegan-rezeroemilia-interpolations.mp4 “Re:Zero − Starting Life in Another World”, (2021-01-04):
Re:Zero − Starting Life in Another World is a Japanese light novel series written by Tappei Nagatsuki and illustrated by Shin'ichirō Ōtsuka. The story centres on Subaru Natsuki, a hikikomori who suddenly finds himself transported to another world on his way home from the convenience store. The series was initially serialized on the website Shōsetsuka ni Narō from 2012 onwards. Twenty-one light novels, as well as four side story volumes and five short story collections have been published by Media Factory under their MF Bunko J imprint.
https:/
/ old.reddit.com/ r/ Re_Zero/ comments/ kn26lc/ mediaaigenerated_emilia_heads/ ghhw1ej/ /
images/ gan/ stylegan/ 2019-02-27-sunk-stylegan-misakikurehitotransfer-4x4.mp4 /
images/ gan/ stylegan/ 2019-04-30-stylegan-portraits-interpolation-4x4.mp4 https:/
/ mega.nz/ #!CRtiDI7S!xo4zm3n7pkq1Lsfmuio1O8QPpUwHrtFTHjNJ8_XxSJs https:/
/ wiki.evageeks.org/ images/ 3/ 36/ Sadamoto_nadia-shinji.jpg “List of Nadia: The Secret of Blue Water characters”, (2020-12-22):
This is a list of fictional characters from Nadia: The Secret of Blue Water.
“Nadia: The Secret of Blue Water”, (2021-01-02):
Nadia: The Secret of Blue Water is a Japanese animated television series inspired by the works of Jules Verne, particularly Twenty Thousand Leagues Under the Sea and the exploits of Captain Nemo. The series was created by NHK, Toho and Korad, from a concept of Hayao Miyazaki, and directed by Hideaki Anno of Gainax.
“Shinji Ikari”, (2021-01-02):
Shinji Ikari is a fictional character in the Neon Genesis Evangelion franchise created by Gainax. He is the franchise's poster boy and protagonist. In the anime series of the same name, Shinji is a young man who was abandoned by his father (Gendo), who asks him to fly a mecha called Evangelion Unit 01 to protect the city of Tokyo-3 from Angels: creatures which threaten to destroy humanity. Shinji appears in the franchise's animated feature films and related media, video games, the original net animation Petit Eva: Evangelion@School, the Rebuild of Evangelion films, and the manga adaptation by Yoshiyuki Sadamoto.
/
images/ gan/ stylegan/ 2019-05-06-stylegan-malefaces-interpolation-4x4.mp4 https:/
/ mega.nz/ #!fMNDkYwS!X-7_nBtsC6P_09CINIJAoVqR3V8Ffbv5On74rVoUbik https:/
/ mega.nz/ #!OEFjWKAS!QIqbb38fR5PnIZbdr7kx5K-koEMtOQ_XQXRqppAyv-k “Ukiyo-e”, (2021-01-02):
Ukiyo-e is a genre of woodblock prints and paintings which flourished in Japanese art from the late 17th to late 19th century. Aimed at the prosperous merchant class in the urbanizing Edo period (1603–1868), its subjects included female beauties; kabuki actors and sumo wrestlers; scenes from history and folktales; travel scenes and landscapes; flora and fauna; and erotica. The term ukiyo-e (浮世絵) means "floating world picture[s]".
“Amazon Rekognition”, (2020-12-22):
Amazon Rekognition is a cloud-based Software as a service (SaaS) computer vision platform that was launched in 2016. It has been sold and used by a number of United States government agencies, including U.S. Immigration and Customs Enforcement (ICE) and Orlando, Florida police, as well as private entities.
“Ukiyo-e Search”, (2013):
"Japanese Woodblock Print Search: Ukiyo-e Search provides an incredible resource: The ability to both search for Japanese woodblock prints by simply taking a picture of an existing print AND the ability to see similar prints across multiple collections of prints.
…The Ukiyo-e.org database and image similarity analysis engine, created by John Resig to aide researchers in the study of Japanese woodblock prints, was launched in December 2012. The database currently contains over 213,000 prints from 24 institutions and, as of September 2013, has received 3.4 million page views from 150,000 people.
The database has the following major features:
- A database of Japanese woodblock print images and metadata aggregated from a variety of museums, universities, libraries, auction houses, and dealers around the world.
- An indexed text search engine of all the metadata provided by the institutions about the prints.
- An image search engine of all the images in the database, searchable by uploading an image of a print.
- Each print image is analyzed and compared against all other print images in the database. Similar prints are displayed together for comparison and analysis.
- Multiple copies of the same print are automatically lined up with each other and made viewable in a gallery for easy comparison.
- The entire web site, and all artist information contained within it, is available in both English and Japanese, aiding international researchers.
These features, available in the Ukiyo-e.org database, are already providing researchers with substantial benefit. New copies of prints have been located by scholars at museums. Museums have been able to correct unattributed prints, finding the correct artist. Prints have been identified by lay people who cannot read Japanese and/or are unable to interpret the imagery depicted in a print.
It is challenging to reconcile information from numerous databases, many of which are in different languages. The difficulty of finding and utilizing an effective image similarity search engine, one that is capable of working with images of different sizes, colors, or even in black-and-white, is a point that deserves considerable attention.
The Ukiyo-e.org database is already significantly impacting Japanese woodblock print studies and may have implications for visual art research and digital humanities at large.
https:/
/ github.com/ cs-chan/ ArtGAN/ tree/ master/ WikiArt%20Dataset “Averaging Weights Leads to Wider Optima and Better Generalization”, (2018-03-14):
Deep neural networks are typically trained by optimizing a loss function with an SGD variant, in conjunction with a decaying learning rate, until convergence. We show that simple averaging of multiple points along the trajectory of SGD, with a cyclical or constant learning rate, leads to better generalization than conventional training. We also show that this Stochastic Weight Averaging (SWA) procedure finds much flatter solutions than SGD, and approximates the recent Fast Geometric Ensembling (FGE) approach with a single model. Using SWA we achieve notable improvement in test accuracy over conventional SGD training on a range of state-of-the-art residual networks, PyramidNets, DenseNets, and Shake-Shake networks on CIFAR-10, CIFAR-100, and ImageNet. In short, SWA is extremely easy to implement, improves generalization, and has almost no computational overhead.
https:/
/ zlkj.in/ tmp/ stylegan/ 00051-sgan-danbooru-512px-1gpu-progan/ /
images/ gan/ stylegan/ 2019-03-22-stylegan-danbooru2018-nshepperd-trainingmontage.mp4 “StyleGAN network blending”, (2020-08-25):
In my previous post about attempting to create an ukiyo-e portrait generator I introduced a concept I called “layer swapping” in order to mix two StyleGAN models. The aim was to blend a base model and another created from that using transfer learning, the fine-tuned model. The method was different to simply interpolating the weights of the two models as it allows you to control independently which model you got low and high resolution features from; in my example I wanted to get the pose from normal photographs, and the texture/style from ukiyo-e prints…after the a recent Twitter thread popped up again on model interpolation, I realised that I had missed a really obvious variation on my earlier experiments. Rather than taking the low resolution layers (pose) from normal photos and high res layers (texture) from ukiyo-e I figured it would surely be interesting to try the other way round.
…I’ve shared an initial version of some code to blend two networks in this layer swapping manner (with some interpolation thrown into the mix) in my StyleGAN2 fork (see the
blend_models.py
file). There’s also an example Colab notebook to show how to blend some StyleGAN models, in the example I use a small faces model and one I trained on satellite images of the earth above.…It was originally Arfa who asked me to share some of the layer swapping code I had been working on. He followed up by combining both the weight interpolation and layer swapping ideas, combining a bunch of different models (with some neat visualisations). The results are pretty amazing, this sort of “resolution dependent model interpolation” is the logical generalisation of both the interpolation and swapping ideas. It looks like it gives a completely new axis of control over a generative model (assuming you have some fine-tuned models which can be combined). Take these example frames from one of the above videos:
anime ↔︎ MLP On the left is the output of the anime model, on the right the My Little Pony model, and in the middle the mid-resolution layers have been transplanted from My Little Pony into anime. This essentially introduces middle resolution features such as the eyes and nose from My Little Pony into anime characters!
Going further: I think there’s lots of potential to look at these blending strategies further, in particular not only interpolating between models dependent on the resolution, but also differently for different channels. If you can identify the subset of neurons which correspond (for example) to the My Little Pony eyes you could swap those specifically into the anime model, and be able to modify the eyes without affecting other features, such as the nose. Simple clustering of the internal activations has already been shown to be an effective way of identifying neurons which correspond to attributes in the image in the Editing in Style paper so this seems pretty straightforward to try!
“Resolution Dependent GAN Interpolation for Controllable Image Synthesis Between Domains”, (2020-10-11):
GANs can generate photo-realistic images from the domain of their training data. However, those wanting to use them for creative purposes often want to generate imagery from a truly novel domain, a task which GANs are inherently unable to do. It is also desirable to have a level of control so that there is a degree of artistic direction rather than purely curation of random results. Here we present a method for interpolating between generative models of the StyleGAN architecture in a resolution dependent manner. This allows us to generate images from an entirely novel domain and do this with a degree of control over the nature of the output.
https:/
/ colab.research.google.com/ drive/ 1tputbmA9EaXs9HL9iO21g7xN7jz_Xrko /
images/ gan/ stylegan/ 2019-02-22-stylegan-ffhqdanbooru-interpolation-4x4.mp4 /
images/ gan/ stylegan/ 2019-03-23-stylegan-danbooru2018transfer.mp4 “Adversarial Feature Learning”, (2016-05-31):
The ability of the Generative Adversarial Networks (GANs) framework to learn generative models mapping from simple latent distributions to arbitrarily complex data distributions has been demonstrated empirically, with compelling results showing that the latent space of such generators captures semantic variation in the data distribution. Intuitively, models trained to predict these semantic latent representations given data may serve as useful feature representations for auxiliary problems where semantics are relevant. However, in their existing form, GANs have no means of learning the inverse mapping – projecting data back into the latent space. We propose Bidirectional Generative Adversarial Networks (BiGANs) as a means of learning this inverse mapping, and demonstrate that the resulting learned feature representation is useful for auxiliary supervised discrimination tasks, competitive with contemporary approaches to unsupervised and self-supervised feature learning.
“Inverting The Generator Of A Generative Adversarial Network (II)”, (2018-02-15):
Generative adversarial networks (GANs) learn a deep generative model that is able to synthesise novel, high-dimensional data samples. New data samples are synthesised by passing latent samples, drawn from a chosen prior distribution, through the generative model. Once trained, the latent space exhibits interesting properties, that may be useful for down stream tasks such as classification or retrieval. Unfortunately, GANs do not offer an "inverse model", a mapping from data space back to latent space, making it difficult to infer a latent representation for a given data sample. In this paper, we introduce a technique, inversion, to project data samples, specifically images, to the latent space using a pre-trained GAN. Using our proposed inversion technique, we are able to identify which attributes of a dataset a trained GAN is able to model and quantify GAN performance, based on a reconstruction loss. We demonstrate how our proposed inversion technique may be used to quantitatively compare performance of various GAN models trained on three image datasets. We provide code for all of our experiments, https://github.com/ToniCreswell/InvertingGAN.
“Backpropagation”, (2020-12-22):
In machine learning, backpropagation is a widely used algorithm for training feedforward neural networks. Generalizations of backpropagation exists for other artificial neural networks (ANNs), and for functions generally. These classes of algorithms are all referred to generically as "backpropagation". In fitting a neural network, backpropagation computes the gradient of the loss function with respect to the weights of the network for a single input–output example, and does so efficiently, unlike a naive direct computation of the gradient with respect to each weight individually. This efficiency makes it feasible to use gradient methods for training multilayer networks, updating weights to minimize loss; gradient descent, or variants such as stochastic gradient descent, are commonly used. The backpropagation algorithm works by computing the gradient of the loss function with respect to each weight by the chain rule, computing the gradient one layer at a time, iterating backward from the last layer to avoid redundant calculations of intermediate terms in the chain rule; this is an example of dynamic programming.
https:/
/ blog.benwiener.com/ programming/ 2019/ 04/ 29/ reinventing-the-wheel.html “Gradient Theory of Optimal Flight Paths”, (1960-10-01):
An analytical development of flight performance optimization according to the method of gradients or 'method of steepest decent' is presented. Construction of a minimizing sequence of flight paths by a stepwise process of descent along the local gradient direction is described as a computational scheme. Numerical application of the technique is illustrated in a simple example of orbital transfer via solar sail propulsion. Successive approximations to minimum time planar flight paths from Earth's orbit to the orbit of Mars are presented for cases corresponding to free and fixed boundary conditions on terminal velocity components.
“A Steepest-Ascent Method for Solving Optimum Programming Problems”, (1962-06-01):
A systematic and rapid steepest-ascent numerical procedure is described for solving two-point boundary-value problems in the calculus of variations for systems governed by a set of nonlinear ordinary differential equations. Numerical examples are presented for minimum time-to-climb and maximum altitude paths for a supersonic interceptor and maximum-range paths for an orbital glider. [Keywords: Boundary-value problems, Computer programming, Differential equations, Variational techniques]
…A systematic and rapid steepest-ascent numerical procedure is described for determining optimum programs for nonlinear systems with terminal constraints. The procedure uses the concept of local linearization around a nominal (non-optimum) path. The effect on the terminal conditions of a small change in the control variable program is determined by numerical integration of the adjoint differential equations for small perturbations about the nominal path. Having these adjoint (or influence) functions, it is then possible to determine the change in the control variable program that gives maximum increase in the pay-off function for a given mean-square perturbation of the control variable program while simultaneously changing the terminal quantities by desired amounts. By repeating this process in small steps, a control variable program that minimizes one quantity and yields specified values of other terminal quantities can be approached as closely as desired. Three numerical examples are presented: (a) The angle-of-attack program for a typical supersonic interceptor to climb to altitude in minimum time is determined with and without specified terminal velocity and heading. (b) The angle-of-attack program for the same interceptor to climb to maximum altitude is determined, (c) The angle-of-attack program is determined for a hypersonic orbital glider to obtain maximum surface range starting from satellite speed at 300,000 ft altitude.
“Deep Set Prediction Networks”, (2019-06-15):
Current approaches for predicting sets from feature vectors ignore the unordered nature of sets and suffer from discontinuity issues as a result. We propose a general model for predicting sets that properly respects the structure of sets and avoids this problem. With a single feature vector as input, we show that our model is able to auto-encode point sets, predict the set of bounding boxes of objects in an image, and predict the set of attributes of these objects.
“Synthesizing the preferred inputs for neurons in neural networks via deep generator networks”, (2016-05-30):
Deep neural networks (DNNs) have demonstrated state-of-the-art results on many pattern recognition tasks, especially vision classification problems. Understanding the inner workings of such computational brains is both fascinating basic science that is interesting in its own right—similar to why we study the human brain—and will enable researchers to further improve DNNs. One path to understanding how a neural network functions internally is to study what each of its neurons has learned to detect. One such method is called activation maximization (AM), which synthesizes an input (e.g. an image) that highly activates a neuron. Here we dramatically improve the qualitative state of the art of activation maximization by harnessing a powerful, learned prior: a deep generator network (DGN). The algorithm (1) generates qualitatively state-of-the-art synthetic images that look almost real, (2) reveals the features learned by each neuron in an interpretable way, (3) generalizes well to new datasets and somewhat well to different network architectures without requiring the prior to be relearned, and (4) can be considered as a high-quality generative method (in this case, by generating novel, creative, interesting, recognizable images).
“Unadversarial Examples: Designing Objects for Robust Vision”, (2020-12-22):
We study a class of realistic computer vision settings wherein one can influence the design of the objects being recognized. We develop a framework that leverages this capability to significantly improve vision models’ performance and robustness. This framework exploits the sensitivity of modern machine learning algorithms to input perturbations in order to design "robust objects," i.e., objects that are explicitly optimized to be confidently detected or classified. We demonstrate the efficacy of the framework on a wide variety of vision-based tasks ranging from standard benchmarks, to (in-simulation) robotics, to real-world experiments. Our code can be found at https://git.io/unadversarial .
“Image Synthesis from Yahoo's
open_nsfw
”, (2016):Yahoo’s recently open sourced neural network,
open_nsfw
, is a fine tuned Residual Network which scores images on a scale of 0 to 1 on its suitability for use in the workplace…What makes an image NSFW, according to Yahoo? I explore this question with a clever new visualization technique by Nguyen et al…Like Google’s Deep Dream, this visualization trick works by maximally activating certain neurons of the classifier. Unlike deep dream, we optimize these activations by performing descent on a parameterization of the manifold of natural images.[Demonstration of an unusual use of backpropagation to ‘optimize’ a neural network: instead of taking a piece of data to input to a neural network and then updating the neural network to change its output slightly towards some desired output (such as a correct classification), one can instead update the input so as to make the neural net output slightly more towards the desired output. When using a image classification neural network, this reversed form of optimization will ‘hallucinate’ or ‘edit’ the ‘input’ to make it more like a particular class of images. In this case, a porn/NSFW-detecting NN is reversed so as to make images more (or less) “porn-like”. Goh runs this process on various images like landscapes, musical bands, or empty images; the maximally/minimally porn-like images are disturbing, hilarious, and undeniably pornographic in some sense.]
“Limited-memory BFGS”, (2020-12-22):
Limited-memory BFGS is an optimization algorithm in the family of quasi-Newton methods that approximates the Broyden–Fletcher–Goldfarb–Shanno algorithm (BFGS) using a limited amount of computer memory. It is a popular algorithm for parameter estimation in machine learning. The algorithm's target problem is to minimize over unconstrained values of the real-vector where is a differentiable scalar function.
“Style Generator Inversion for Image Enhancement and Animation”, (2019-06-05):
One of the main motivations for training high quality image generative models is their potential use as tools for image manipulation. Recently, generative adversarial networks (GANs) have been able to generate images of remarkable quality. Unfortunately, adversarially-trained unconditional generator networks have not been successful as image priors. One of the main requirements for a network to act as a generative image prior, is being able to generate every possible image from the target distribution. Adversarial learning often experiences mode-collapse, which manifests in generators that cannot generate some modes of the target distribution. Another requirement often not satisfied is invertibility i.e. having an efficient way of finding a valid input latent code given a required output image. In this work, we show that differently from earlier GANs, the very recently proposed style-generators are quite easy to invert. We use this important observation to propose style generators as general purpose image priors. We show that style generators outperform other GANs as well as Deep Image Prior as priors for image enhancement tasks. The latent space spanned by style-generators satisfies linear identity-pose relations. The latent space linearity, combined with invertibility, allows us to animate still facial images without supervision. Extensive experiments are performed to support the main contributions of this paper.
“On the "steerability" of generative adversarial networks”, (2019-07-16):
An open secret in contemporary machine learning is that many models work beautifully on standard benchmarks but fail to generalize outside the lab. This has been attributed to biased training data, which provide poor coverage over real world events. Generative models are no exception, but recent advances in generative adversarial networks (GANs) suggest otherwise—these models can now synthesize strikingly realistic and diverse images. Is generative modeling of photos a solved problem? We show that although current GANs can fit standard datasets very well, they still fall short of being comprehensive models of the visual manifold. In particular, we study their ability to fit simple transformations such as camera movements and color changes. We find that the models reflect the biases of the datasets on which they are trained (e.g., centered objects), but that they also exhibit some capacity for generalization: by “steering” in latent space, we can shift the distribution while still creating realistic images. We hypothesize that the degree of distributional shift is related to the breadth of the training data distribution. Thus, we conduct experiments to quantify the limits of GAN transformations and introduce techniques to mitigate the problem. Code is released on our project page: this URL.
“Interpreting the Latent Space of GANs for Semantic Face Editing”, (2019-07-25):
Despite the recent advance of Generative Adversarial Networks (GANs) in high-fidelity image synthesis, there lacks enough understanding of how GANs are able to map a latent code sampled from a random distribution to a photo-realistic image. Previous work assumes the latent space learned by GANs follows a distributed representation but observes the vector arithmetic phenomenon. In this work, we propose a novel framework, called InterFaceGAN, for semantic face editing by interpreting the latent semantics learned by GANs. In this framework, we conduct a detailed study on how different semantics are encoded in the latent space of GANs for face synthesis. We find that the latent code of well-trained generative models actually learns a disentangled representation after linear transformations. We explore the disentanglement between various semantics and manage to decouple some entangled semantics with subspace projection, leading to more precise control of facial attributes. Besides manipulating gender, age, expression, and the presence of eyeglasses, we can even vary the face pose as well as fix the artifacts accidentally generated by GAN models. The proposed method is further applied to achieve real image manipulation when combined with GAN inversion methods or some encoder-involved models. Extensive results suggest that learning to synthesize faces spontaneously brings a disentangled and controllable facial attribute representation.
https:/
/ blog.insightdatascience.com/ generating-custom-photo-realistic-faces-using-ai-d170b1b59255 https:/
/ github.com/ Puzer/ stylegan-encoder/ blob/ master/ Play_with_latent_directions.ipynb https:/
/ github.com/ halcy/ stylegan/ blob/ master/ Stylegan-Generate-Encode.ipynb https:/
/ old.reddit.com/ r/ MediaSynthesis/ comments/ c6axmr/ close_the_world_txen_eht_nepo/ “This Pony Does Not Exist”, (2020-07):
“This Pony Does Not Exist” (TPDNE) is the followup to “This Fursona Does Not Exist”, also by Arfafax. He scraped the Derpibooru My Little Pony: Friendship is Magic image booru, hand-annotated images and trained a pony face YOLOv3 cropper to create a pony face crop dataset, and trained the TFDNE StyleGAN 2 model to convergence on TensorFork TPU pods, with an upgrade to 1024px resolution via transfer learning/model surgery. The interface reuses Said Achmiz’s These Waifus Do Not Exist grid UI.
10 random pony samples from TPDNE; see also Derpibooru uploads from TPDNE. The S2 model snapshot is available for download and I have mirrored it (
rsync rsync://78.46.86.149:873/biggan/2020-07-15-arfafax-stylegan2-thisponydoesnotexist-1024px-iter151552.pkl ./
).“GANSpace: Discovering Interpretable GAN Controls”, (2020-04-06):
This paper describes a simple technique to analyze Generative Adversarial Networks (GANs) and create interpretable controls for image synthesis, such as change of viewpoint, aging, lighting, and time of day. We identify important latent directions based on Principal Components Analysis (PCA) applied either in latent space or feature space. Then, we show that a large number of interpretable controls can be defined by layer-wise perturbation along the principal directions. Moreover, we show that BigGAN can be controlled with layer-wise inputs in a StyleGAN-like manner. We show results on different GANs trained on various datasets, and demonstrate good qualitative matches to edit directions found through earlier supervised approaches.
https:/
/ nitter.net/ realmeatyhuman/ status/ 1255570195319590913 https:/
/ colab.research.google.com/ drive/ 1g-ShMzkRWDMHPyjom_p-5kqkn2f-GwBi “Analyzing and Improving the Image Quality of StyleGAN”, (2019-12-03):
The style-based GAN architecture (StyleGAN) yields state-of-the-art results in data-driven unconditional generative image modeling. We expose and analyze several of its characteristic artifacts, and propose changes in both model architecture and training methods to address them. In particular, we redesign the generator normalization, revisit progressive growing, and regularize the generator to encourage good conditioning in the mapping from latent codes to images. In addition to improving image quality, this path length regularizer yields the additional benefit that the generator becomes significantly easier to invert. This makes it possible to reliably attribute a generated image to a particular network. We furthermore visualize how well the generator utilizes its output resolution, and identify a capacity problem, motivating us to train larger models for additional quality improvements. Overall, our improved model redefines the state of the art in unconditional image modeling, both in terms of existing distribution quality metrics as well as perceived image quality.
“MSG-GAN: Multi-Scale Gradients for Generative Adversarial Networks”, (2019-03-14):
While Generative Adversarial Networks (GANs) have seen huge successes in image synthesis tasks, they are notoriously difficult to adapt to different datasets, in part due to instability during training and sensitivity to hyperparameters. One commonly accepted reason for this instability is that gradients passing from the discriminator to the generator become uninformative when there isn’t enough overlap in the supports of the real and fake distributions. In this work, we propose the Multi-Scale Gradient Generative Adversarial Network (MSG-GAN), a simple but effective technique for addressing this by allowing the flow of gradients from the discriminator to the generator at multiple scales. This technique provides a stable approach for high resolution image synthesis, and serves as an alternative to the commonly used progressive growing technique. We show that MSG-GAN converges stably on a variety of image datasets of different sizes, resolutions and domains, as well as different types of loss functions and architectures, all with the same set of fixed hyperparameters. When compared to state-of-the-art GANs, our approach matches or exceeds the performance in most of the cases we tried.
“ThisWaifuDoesNotExist, version 3”, (2020-01-20):
Discussion of TWDNEv3, launched January 2020. TWDNEv3 upgrades TWDNEv2 to use 100k anime portraits from an anime portrait StyleGAN2, an improvement to StyleGAN released in December 2019, which removes the blob artifacts and is generally of somewhat higher visual quality. TWDNEv3 provides images in 3 ranges of diversity, showing off both narrow but high quality samples and more wild samples. It replaces the StyleGAN 1 faces and portrait samples.
https:/
/ mega.nz/ #!PeIi2ayb!xoRtjTXyXuvgDxSsSMn-cOh-Zux9493zqdxwVMaAzp4 https:/
/ hivemind-repo.s3-us-west-2.amazonaws.com/ twdne3/ twdne3.pt https:/
/ hivemind-repo.s3-us-west-2.amazonaws.com/ twdne3/ twdne3.onnx https:/
/ colab.research.google.com/ drive/ 1Pv8OIFlonha4KeYyY2oEFaK4mG-alaWF “FIGR: Few-shot Image Generation with Reptile”, (2019-01-08):
Generative Adversarial Networks (GAN) boast impressive capacity to generate realistic images. However, like much of the field of deep learning, they require an inordinate amount of data to produce results, thereby limiting their usefulness in generating novelty. In the same vein, recent advances in meta-learning have opened the door to many few-shot learning applications. In the present work, we propose Few-shot Image Generation using Reptile (FIGR), a GAN meta-trained with Reptile. Our model successfully generates novel images on both MNIST and Omniglot with as little as 4 images from an unseen class. We further contribute FIGR-8, a new dataset for few-shot image generation, which contains 1,548,944 icons categorized in over 18,409 classes. Trained on FIGR-8, initial results show that our model can generalize to more advanced concepts (such as "bird" and "knife") from as few as 8 samples from a previously unseen class of images and as little as 10 training steps through those 8 images. This work demonstrates the potential of training a GAN for few-shot image generation and aims to set a new benchmark for future work in the domain.
“Few-Shot Unsupervised Image-to-Image Translation”, (2019-05-05):
Unsupervised image-to-image translation methods learn to map images in a given class to an analogous image in a different class, drawing on unstructured (non-registered) datasets of images. While remarkably successful, current methods require access to many images in both source and destination classes at training time. We argue this greatly limits their use. Drawing inspiration from the human capability of picking up the essence of a novel object from a small number of examples and generalizing from there, we seek a few-shot, unsupervised image-to-image translation algorithm that works on previously unseen target classes that are specified, at test time, only by a few example images. Our model achieves this few-shot generation capability by coupling an adversarial training scheme with a novel network design. Through extensive experimental validation and comparisons to several baseline methods on benchmark datasets, we verify the effectiveness of the proposed framework. Our implementation and datasets are available at https://github.com/NVlabs/FUNIT .
“Image Generation From Small Datasets via Batch Statistics Adaptation”, (2019-04-03):
Thanks to the recent development of deep generative models, it is becoming possible to generate high-quality images with both fidelity and diversity. However, the training of such generative models requires a large dataset. To reduce the amount of data required, we propose a new method for transferring prior knowledge of the pre-trained generator, which is trained with a large dataset, to a small dataset in a different domain. Using such prior knowledge, the model can generate images leveraging some common sense that cannot be acquired from a small dataset. In this work, we propose a novel method focusing on the parameters for batch statistics, scale and shift, of the hidden layers in the generator. By training only these parameters in a supervised manner, we achieved stable training of the generator, and our method can generate higher quality images compared to previous methods without collapsing, even when the dataset is small ( 100). Our results show that the diversity of the filters acquired in the pre-trained generator is important for the performance on the target domain. Our method makes it possible to add a new class or domain to a pre-trained generator without disturbing the performance on the original domain.
-
The TensorFlow Research Cloud (TFRC) program enables researchers to apply for access to a cluster of more than 1,000 Cloud TPUs. In total, this cluster delivers a total of more than 180 petaflops of raw compute power! Researchers accepted into the TFRC program can use these Cloud TPUs at no charge to accelerate the next wave of open research breakthroughs. Participants in the TFRC program will be expected to share their TFRC-supported research with the world through peer-reviewed publications, open source code, blog posts, or other means. They should also be willing to share detailed feedback with Google to help us improve the TFRC program and the underlying Cloud TPU platform over time. In addition, participants accept Google’s Terms and Conditions, acknowledge that their information will be used in accordance with our Privacy Policy, and agree to conduct their research in accordance with the Google AI principles. Machine learning researchers around the world have done amazing things with the limited computational resources they currently have available. We’d like to empower researchers from many different backgrounds to think even bigger and tackle exciting new challenges that would be inaccessible otherwise.
[TFRC is an easy-to-apply cloud credit program which grants free access to up to hundreds of GCP TPUs and sometimes whole TPU pods to researchers & hobbyists like me; I relied on TFRC credits to train a variety of GPT-2-1.5b models which are infeasible on consumer GPUs. It took seconds to apply, they replied in hours with credits, and were highly responsive thereafter as we encountered various TPU issues.]
“YFCC100M: The New Data in Multimedia Research”, (2015-03-05):
We present the Yahoo Flickr Creative Commons 100 Million Dataset (YFCC100M), the largest public multimedia collection that has ever been released. The dataset contains a total of 100 million media objects, of which approximately 99.2 million are photos and 0.8 million are videos, all of which carry a Creative Commons license. Each media object in the dataset is represented by several pieces of metadata, e.g. Flickr identifier, owner name, camera, title, tags, geo, media source. The collection provides a comprehensive snapshot of how photos and videos were taken, described, and shared over the years, from the inception of Flickr in 2004 until early 2014. In this article we explain the rationale behind its creation, as well as the implications the dataset has for science, research, engineering, and development. We further present several new challenges in multimedia research that can now be expanded upon with our dataset.
“Evolving Normalization-Activation Layers”, (2020-04-06):
Normalization layers and activation functions are fundamental components in deep networks and typically co-locate with each other. Here we propose to design them using an automated approach. Instead of designing them separately, we unify them into a single tensor-to-tensor computation graph, and evolve its structure starting from basic mathematical functions. Examples of such mathematical functions are addition, multiplication and statistical moments. The use of low-level mathematical functions, in contrast to the use of high-level modules in mainstream NAS, leads to a highly sparse and large search space which can be challenging for search methods. To address the challenge, we develop efficient rejection protocols to quickly filter out candidate layers that do not work well. We also use multi-objective evolution to optimize each layer’s performance across many architectures to prevent overfitting. Our method leads to the discovery of EvoNorms, a set of new normalization-activation layers with novel, and sometimes surprising structures that go beyond existing design patterns. For example, some EvoNorms do not assume that normalization and activation functions must be applied sequentially, nor need to center the feature maps, nor require explicit activation functions. Our experiments show that EvoNorms work well on image classification models including ResNets, MobileNets and EfficientNets but also transfer well to Mask R-CNN with FPN/SpineNet for instance segmentation and to BigGAN for image synthesis, outperforming BatchNorm and GroupNorm based layers in many cases.
“Bias–variance tradeoff”, (2021-01-02):
In statistics and machine learning, the bias–variance tradeoff is the property of a model that the variance of the parameter estimates across samples can be reduced by increasing the bias in the estimated parameters.The bias–variance dilemma or bias–variance problem is the conflict in trying to simultaneously minimize these two sources of error that prevent supervised learning algorithms from generalizing beyond their training set:
- The bias error is an error from erroneous assumptions in the learning algorithm. High bias can cause an algorithm to miss the relevant relations between features and target outputs (underfitting).
- The variance is an error from sensitivity to small fluctuations in the training set. High variance can cause an algorithm to model the random noise in the training data, rather than the intended outputs (overfitting).
“Do We Need Zero Training Loss After Achieving Zero Training Error?”, (2020-02-20):
Overparameterized deep networks have the capacity to memorize training data with zero training error. Even after memorization, the training loss continues to approach zero, making the model overconfident and the test performance degraded. Since existing regularizers do not directly aim to avoid zero training loss, they often fail to maintain a moderate level of training loss, ending up with a too small or too large loss. We propose a direct solution called flooding that intentionally prevents further reduction of the training loss when it reaches a reasonably small value, which we call the flooding level. Our approach makes the loss float around the flooding level by doing mini-batched gradient descent as usual but gradient ascent if the training loss is below the flooding level. This can be implemented with one line of code, and is compatible with any stochastic optimizer and other regularizers. With flooding, the model will continue to "random walk" with the same non-zero training loss, and we expect it to drift into an area with a flat loss landscape that leads to better generalization. We experimentally show that flooding improves performance and as a byproduct, induces a double descent curve of the test loss.
“A Simple Framework for Contrastive Learning of Visual Representations”, (2020-02-13):
This paper presents SimCLR: a simple framework for contrastive learning of visual representations. We simplify recently proposed contrastive self-supervised learning algorithms without requiring specialized architectures or a memory bank. In order to understand what enables the contrastive prediction tasks to learn useful representations, we systematically study the major components of our framework. We show that (1) composition of data augmentations plays a critical role in defining effective predictive tasks, (2) introducing a learnable nonlinear transformation between the representation and the contrastive loss substantially improves the quality of the learned representations, and (3) contrastive learning benefits from larger batch sizes and more training steps compared to supervised learning. By combining these findings, we are able to considerably outperform previous methods for self-supervised and semi-supervised learning on ImageNet. A linear classifier trained on self-supervised representations learned by SimCLR achieves 76.5 top-1 accuracy, which is a 7 state-of-the-art, matching the performance of a supervised ResNet-50. When fine-tuned on only 1 outperforming AlexNet with 100× fewer labels.
“Large Scale GAN Training for High Fidelity Natural Image Synthesis”, (2018-09-28):
Despite recent progress in generative image modeling, successfully generating high-resolution, diverse samples from complex datasets such as ImageNet remains an elusive goal. To this end, we train Generative Adversarial Networks at the largest scale yet attempted, and study the instabilities specific to such scale. We find that applying orthogonal regularization to the generator renders it amenable to a simple "truncation trick," allowing fine control over the trade-off between sample fidelity and variety by reducing the variance of the Generator’s input. Our modifications lead to models which set the new state of the art in class-conditional image synthesis. When trained on ImageNet at 128×128 resolution, our models (BigGANs) achieve an Inception Score (IS) of 166.5 and Frechet Inception Distance (FID) of 7.4, improving over the previous best IS of 52.52 and FID of 18.6.
“Self-Attention Generative Adversarial Networks”, (2018-05-21):
In this paper, we propose the Self-Attention Generative Adversarial Network (SAGAN) which allows attention-driven, long-range dependency modeling for image generation tasks. Traditional convolutional GANs generate high-resolution details as a function of only spatially local points in lower-resolution feature maps. In SAGAN, details can be generated using cues from all feature locations. Moreover, the discriminator can check that highly detailed features in distant portions of the image are consistent with each other. Furthermore, recent work has shown that generator conditioning affects GAN performance. Leveraging this insight, we apply spectral normalization to the GAN generator and find that this improves training dynamics. The proposed SAGAN achieves the state-of-the-art results, boosting the best published Inception score from 36.8 to 52.52 and reducing Frechet Inception distance from 27.62 to 18.65 on the challenging ImageNet dataset. Visualization of the attention layers shows that the generator leverages neighborhoods that correspond to object shapes rather than local regions of fixed shape.
“Neural Photo Editing with Introspective Adversarial Networks”, (2016-09-22):
The increasingly photorealistic sample quality of generative image models suggests their feasibility in applications beyond image generation. We present the Neural Photo Editor, an interface that leverages the power of generative neural networks to make large, semantically coherent changes to existing images. To tackle the challenge of achieving accurate reconstructions without loss of feature quality, we introduce the Introspective Adversarial Network, a novel hybridization of the VAE and GAN. Our model efficiently captures long-range dependencies through use of a computational block based on weight-shared dilated convolutions, and improves generalization performance with Orthogonal Regularization, a novel weight regularization method. We validate our contributions on CelebA, SVHN, and CIFAR-100, and produce samples and reconstructions with high visual fidelity.
“The Unusual Effectiveness of Averaging in GAN Training”, (2018-06-12):
We examine two different techniques for parameter averaging in GAN training. Moving Average (MA) computes the time-average of parameters, whereas Exponential Moving Average (EMA) computes an exponentially discounted sum. Whilst MA is known to lead to convergence in bilinear settings, we provide the—to our knowledge—first theoretical arguments in support of EMA. We show that EMA converges to limit cycles around the equilibrium with vanishing amplitude as the discount parameter approaches one for simple bilinear games and also enhances the stability of general GAN training. We establish experimentally that both techniques are strikingly effective in the non-convex-concave GAN setting as well. Both improve inception and FID scores on different architectures and for different GAN objectives. We provide comprehensive experimental results across a range of datasets—mixture of Gaussians, CIFAR-10, STL-10, CelebA and ImageNet—to demonstrate its effectiveness. We achieve state-of-the-art results on CIFAR-10 and produce clean CelebA face images. The code is available at https://github.com/yasinyazici/EMA_GAN.
“A Variational Inequality Perspective on Generative Adversarial Networks”, (2018-02-28):
Generative adversarial networks (GANs) form a generative modeling approach known for producing appealing samples, but they are notably difficult to train. One common way to tackle this issue has been to propose new formulations of the GAN objective. Yet, surprisingly few studies have looked at optimization methods designed for this adversarial training. In this work, we cast GAN optimization problems in the general variational inequality framework. Tapping into the mathematical programming literature, we counter some common misconceptions about the difficulties of saddle point optimization and propose to extend techniques designed for variational inequalities to the training of GANs. We apply averaging, extrapolation and a computationally cheaper variant that we call extrapolation from the past to the stochastic gradient method (SGD) and Adam.
“How AI Training Scales”, (2018-12-14):
We’ve discovered that the gradient noise scale, a simple statistical metric, predicts the parallelizability of neural network training on a wide range of tasks. Since complex tasks tend to have noisier gradients, increasingly large batch sizes are likely to become useful in the future, removing one potential limit to further growth of AI systems. More broadly, these results show that neural network training need not be considered a mysterious art, but can be rigorized and systematized.
In an increasing number of domains it has been demonstrated that deep learning models can be trained using relatively large batch sizes without sacrificing data efficiency. However the limits of this massive data parallelism seem to differ from domain to domain, ranging from batches of tens of thousands in ImageNet to batches of millions in RL agents that play the game Dota 2. To our knowledge there is limited conceptual understanding of why these limits to batch size differ or how we might choose the correct batch size in a new domain. In this paper, we demonstrate that a simple and easy-to-measure statistic called the gradient noise scale predicts the largest useful batch size across many domains and applications, including a number of supervised learning datasets (MNIST, SVHN, CIFAR-10, ImageNet, Billion Word), reinforcement learning domains (Atari and Dota), and even generative model training (autoencoders on SVHN). We find that the noise scale increases as the loss decreases over a training run and depends on the model size primarily through improved model performance. Our empirically-motivated theory also describes the tradeoff between compute-efficiency and time-efficiency, and provides a rough model of the benefits of adaptive batch-size training.
The gradient noise scale (appropriately averaged over training) explains the vast majority (R2 = 80%) of the variation in critical batch size over a range of tasks spanning six orders of magnitude. Batch sizes are measured in either number of images, tokens (for language models), or observations (for games). …We have found that by measuring the gradient noise scale, a simple statistic that quantifies the signal-to-noise ratio of the network gradients, we can approximately predict the maximum useful batch size. Heuristically, the noise scale measures the variation in the data as seen by the model (at a given stage in training). When the noise scale is small, looking at a lot of data in parallel quickly becomes redundant, whereas when it is large, we can still learn a lot from huge batches of data…We’ve found it helpful to visualize the results of these experiments in terms of a tradeoff between wall time for training and total bulk compute that we use to do the training (proportional to dollar cost). At very small batch sizes, doubling the batch allows us to train in half the time without using extra compute (we run twice as many chips for half as long). At very large batch sizes, more parallelization doesn’t lead to faster training. There is a “bend” in the curve in the middle, and the gradient noise scale predicts where that bend occurs.
Increasing parallelism makes it possible to train more complex models in a reasonable amount of time. We find that a Pareto frontier chart is the most intuitive way to visualize comparisons between algorithms and scales. …more powerful models have a higher gradient noise scale, but only because they achieve a lower loss. Thus, there’s some evidence that the increasing noise scale over training isn’t just an artifact of convergence, but occurs because the model gets better. If this is true, then we expect future, more powerful models to have higher noise scale and therefore be more parallelizable. Second, tasks that are subjectively more difficult are also more amenable to parallelization…we have evidence that more difficult tasks and more powerful models on the same task will allow for more radical data-parallelism than we have seen to date, providing a key driver for the continued fast exponential growth in training compute.
https:/
/ academictorrents.com/ details/ a306397ccf9c2ead27155983c254227c0fd938e2 https:/
/ mega.nz/ #!oOpxGKKK!DvAJ1lrQeckFA9YuM-duW2f1CLZkF3waFIjVXMx_amQ https:/
/ mega.nz/ #!VXohCKoK!uslQhkYPqdioH79WsIhopX1gav0_eDENEvasdtmBWi4 https:/
/ old.reddit.com/ r/ SpiceandWolf/ comments/ bx764z/ im_sorry_i_had_to/ “Talking Head Anime from a Single Image”, (2019-11-25):
Fascinated by virtual YouTubers, I put together a deep neural network system that makes becoming one much easier. More specifically, the network takes as input an image of an anime character’s face and a desired pose, and it outputs another image of the same character in the given pose.
“Practical aspects of StyleGAN2 training”, (2020-04-28):
I have trained StyleGAN2 from scratch with a dataset of female portraits at 1024px resolution. The samples quality was further improved by tuning the parameters and augmenting the dataset with zoomed-in images, allowing the network to learn more details and to achieved FID metrics that are comparable to the results of the original work…I was curious how it would work on the human anatomy, so I decided to try to train SG2 with a dataset of head and shoulders portraits. To alleviate capacity issues mentioned in the SG2 paper I preferred to use portraits without clothes (a significant contributing factor to dataset variance); furthermore, the dataset was limited to just one gender in order to further reduce the dataset’s complexity.
…I haven’t quite been able to achieve the quality of SG2 trained with the FFHQ dataset. After over than 30000 kimg, the samples are not yet as detailed as it is desirable. For example, teeth look blurry and pupils are not perfectly round. Considering the size of my dataset as opposed to the FFHQ one, the cause is unlikely to be the lack of training data. Continuing the training does not appear to help as is evident from the plateau in FIDs.
Overall, my experience with SG2 is well in line with what others are observing. Limiting the dataset to a single domain leads to major quality improvements. SG2 is able to model textures and transitions quite well. At the same time it is struggling as the complexity of the object increases with, for instance, greater diversity in poses. It should be noted that SG2 is much more efficient for single domain tasks compared to other architectures, resulting in acceptable results much faster.
Curated samples, Ψ = 0.70 https:/
/ old.reddit.com/ r/ MachineLearning/ comments/ apq4xu/ p_stylegan_on_anime_faces/ /
docs/ ai/ anime/ 2019-02-06-progan-danbooru2017-faces-randomsamples.tar https:/
/ mega.nz/ #!ZRUDjQiS!yMMBkq1CH7ohkU2kmL8a-jc-xJZCyKbkz_oAsE5hobw “Glow: Generative Flow with Invertible 1×1 Convolutions”, (2018-07-09):
Flow-based generative models (Dinh et al., 2014) are conceptually attractive due to tractability of the exact log-likelihood, tractability of exact latent-variable inference, and parallelizability of both training and synthesis. In this paper we propose Glow, a simple type of generative flow using an invertible 1×1 convolution. Using our method we demonstrate a significant improvement in log-likelihood on standard benchmarks. Perhaps most strikingly, we demonstrate that a generative model optimized towards the plain log-likelihood objective is capable of efficient realistic-looking synthesis and manipulation of large images. The code for our model is available at https://github.com/openai/glow
https:/
/ github.com/ akanimax/ Variational_Discriminator_Bottleneck “GAN-QP: A Novel GAN Framework without Gradient Vanishing and Lipschitz Constraint”, (2018-11-18):
We know SGAN may have a risk of gradient vanishing. A significant improvement is WGAN, with the help of 1-Lipschitz constraint on discriminator to prevent from gradient vanishing. Is there any GAN having no gradient vanishing and no 1-Lipschitz constraint on discriminator? We do find one, called GAN-QP.
To construct a new framework of Generative Adversarial Network (GAN) usually includes three steps: 1. choose a probability divergence; 2. convert it into a dual form; 3. play a min-max game. In this articles, we demonstrate that the first step is not necessary. We can analyse the property of divergence and even construct new divergence in dual space directly. As a reward, we obtain a simpler alternative of WGAN: GAN-QP. We demonstrate that GAN-QP have a better performance than WGAN in theory and practice.
“Wasserstein GAN”, (2017-01-26):
We introduce a new algorithm named WGAN, an alternative to traditional GAN training. In this new model, we show that we can improve the stability of learning, get rid of problems like mode collapse, and provide meaningful learning curves useful for debugging and hyperparameter searches. Furthermore, we show that the corresponding optimization problem is sound, and provide extensive theoretical work highlighting the deep connections to other distances between distributions.
“IntroVAE: Introspective Variational Autoencoders for Photographic Image Synthesis”, (2018-07-17):
We present a novel introspective variational autoencoder (IntroVAE) model for synthesizing high-resolution photographic images. IntroVAE is capable of self-evaluating the quality of its generated samples and improving itself accordingly. Its inference and generator models are jointly trained in an introspective way. On one hand, the generator is required to reconstruct the input images from the noisy outputs of the inference model as normal VAEs. On the other hand, the inference model is encouraged to classify between the generated and real samples while the generator tries to fool it as GANs. These two famous generative frameworks are integrated in a simple yet efficient single-stream architecture that can be trained in a single stage. IntroVAE preserves the advantages of VAEs, such as stable training and nice latent manifold. Unlike most other hybrid models of VAEs and GANs, IntroVAE requires no extra discriminators, because the inference model itself serves as a discriminator to distinguish between the generated and real samples. Experiments demonstrate that our method produces high-resolution photo-realistic images (e.g., CELEBA images at ), which are comparable to or better than the state-of-the-art GANs.
“Generating Anime Faces with StyleGAN: Using a trained Discriminator to Rank and Clean Data”, (2019-04-22):
The Discriminator of a GAN is trained to detect outliers or bad datapoints. So it can be used for cleaning the original dataset of aberrant samples. This works reasonably well and I obtained BigGAN/StyleGAN quality improvements by manually deleting the worst samples (typically badly-cropped or low-quality faces), but has peculiar behavior which indicates that the Discriminator is not learning anything equivalent to a “quality” score but may be doing some form of memorization of specific real datapoints. What does this mean for how GANs work?
“Making Anime Faces With StyleGAN”, (2019-02-04):
Generative neural networks, such as GANs, have struggled for years to generate decent-quality anime faces, despite their great success with photographic imagery such as real human faces. The task has now been effectively solved, for anime faces as well as many other domains, by the development of a new generative adversarial network, StyleGAN, whose source code was released in February 2019.
I show off my StyleGAN 1/
2 CC-0-licensed anime faces & videos, provide downloads for the final models & anime portrait face dataset, provide the ‘missing manual’ & explain how I trained them based on Danbooru2017/ 2018 with source code for the data preprocessing, document installation & configuration & training tricks.For application, I document various scripts for generating images & videos, briefly describe the website “This Waifu Does Not Exist” I set up as a public demo (see also Artbreeder), discuss how the trained models can be used for transfer learning such as generating high-quality faces of anime characters with small datasets (eg Holo or Asuka Souryuu Langley), and touch on more advanced StyleGAN applications like encoders & controllable generation.
The appendix gives samples of my failures with earlier GANs for anime face generation, and I provide samples & model from a relatively large-scale BigGAN training run suggesting that BigGAN may be the next step forward to generating full-scale anime images.
A minute of reading could save an hour of debugging!
“GPT-3 Creative Fiction”, (2020-06-19):
I continue my AI poetry generation experiments with OpenAI’s 2020 GPT-3, which is 116× larger, and much more powerful, than the 2019 GPT-2. GPT-3, however, is not merely a quantitative tweak yielding “GPT-2 but better”—it is qualitatively different, exhibiting eerie runtime learning capabilities allowing even the raw model, with zero finetuning, to “meta-learn” many textual tasks purely by example or instruction. One does not train or program GPT-3 in a normal way, but one engages in dialogue and writes prompts to teach GPT-3 what one wants.
Experimenting through the OpenAI Beta API in June 2020, I find that GPT-3 does not just match my finetuned GPT-2-1.5b-poetry for poem-writing quality, but exceeds it, while being versatile in handling poetry, Tom Swifty puns, science fiction, dialogue like Turing’s Turing-test dialogue, literary style parodies… As the pièce de résistance, I recreate Stanislaw Lem’s Cyberiad’s “Trurl’s Electronic Bard” poetry using GPT-3. (Along the way, I document instances of how the BPE text encoding unnecessarily damages GPT-3’s performance on a variety of tasks, how to best elicit the highest-quality responses, common errors people make in using GPT-3, and test out GPT-3’s improvements in NN weak points like logic or commonsense knowledge.)
GPT-3’s samples are not just close to human level: they are creative, witty, deep, meta, and often beautiful. They demonstrate an ability to handle abstractions, like style parodies, I have not seen in GPT-2 at all. Chatting with GPT-3 feels uncannily like chatting with a human. I was impressed by the results reported in the GPT-3 paper, and after spending a week trying it out, I remain impressed.
This page records GPT-3 samples I generated in my explorations, and thoughts on how to use GPT-3 and its remaining weaknesses. I hope you enjoy them even a tenth as much as I enjoyed testing GPT-3 and watching the completions scroll across my screen.
“A Style-Based Generator Architecture for Generative Adversarial Networks”, (2018-12-12):
We propose an alternative generator architecture for generative adversarial networks, borrowing from style transfer literature. The new architecture leads to an automatically learned, unsupervised separation of high-level attributes (e.g., pose and identity when trained on human faces) and stochastic variation in the generated images (e.g., freckles, hair), and it enables intuitive, scale-specific control of the synthesis. The new generator improves the state-of-the-art in terms of traditional distribution quality metrics, leads to demonstrably better interpolation properties, and also better disentangles the latent factors of variation. To quantify interpolation quality and disentanglement, we propose two new, automated methods that are applicable to any generator architecture. Finally, we introduce a new, highly varied and high-quality dataset of human faces.
“Language Models are Few-Shot Learners”, (2020-05-28):
Recent work has demonstrated substantial gains on many NLP tasks and benchmarks by pre-training on a large corpus of text followed by fine-tuning on a specific task. While typically task-agnostic in architecture, this method still requires task-specific fine-tuning datasets of thousands or tens of thousands of examples. By contrast, humans can generally perform a new language task from only a few examples or from simple instructions—something which current NLP systems still largely struggle to do.
Here we show that scaling up language models greatly improves task-agnostic, few-shot performance, sometimes even reaching competitiveness with prior state-of-the-art fine-tuning approaches. Specifically, we train GPT-3, an autoregressive language model with 175 billion parameters, 10× more than any previous non-sparse language model, and test its performance in the few-shot setting. For all tasks, GPT-3 is applied without any gradient updates or fine-tuning, with tasks and few-shot demonstrations specified purely via text interaction with the model. GPT-3 achieves strong performance on many NLP datasets, including translation, question-answering, and cloze tasks, as well as several tasks that require on-the-fly reasoning or domain adaptation, such as unscrambling words, using a novel word in a sentence, or performing 3-digit arithmetic. At the same time, we also identify some datasets where GPT-3’s few-shot learning still struggles, as well as some datasets where GPT-3 faces methodological issues related to training on large web corpora.
Finally, we find that GPT-3 can generate samples of news articles which human evaluators have difficulty distinguishing from articles written by humans. We discuss broader societal impacts of this finding and of GPT-3 in general.
“Large Scale GAN Training for High Fidelity Natural Image Synthesis”, (2018-09-28):
Despite recent progress in generative image modeling, successfully generating high-resolution, diverse samples from complex datasets such as ImageNet remains an elusive goal. To this end, we train Generative Adversarial Networks at the largest scale yet attempted, and study the instabilities specific to such scale. We find that applying orthogonal regularization to the generator renders it amenable to a simple "truncation trick," allowing fine control over the trade-off between sample fidelity and variety by reducing the variance of the Generator’s input. Our modifications lead to models which set the new state of the art in class-conditional image synthesis. When trained on ImageNet at 128×128 resolution, our models (BigGANs) achieve an Inception Score (IS) of 166.5 and Frechet Inception Distance (FID) of 7.4, improving over the previous best IS of 52.52 and FID of 18.6.
“Two-stage Sketch Colorization”, (2018):
Sketch or line art colorization is a research field with significant market demand. Different from photo colorization which strongly relies on texture information, sketch colorization is more challenging as sketches may not have texture. Even worse, color, texture, and gradient have to be generated from the abstract sketch lines. In this paper, we propose a semi-automatic learning-based framework to colorize sketches with proper color, texture as well as gradient. Our framework consists of two stages. In the first drafting stage, our model guesses color regions and splashes a rich variety of colors over the sketch to obtain a color draft. In the second refinement stage, it detects the unnatural colors and artifacts, and try to fix and refine the result. Comparing to existing approaches, this two-stage design effectively divides the complex colorization task into two simpler and goal-clearer subtasks. This eases the learning and raises the quality of colorization. Our model resolves the artifacts such as water-color blurring, color distortion, and dull textures. We build an interactive software based on our model for evaluation. Users can iteratively edit and refine the colorization. We evaluate our learning model and the interactive system through an extensive user study. Statistics shows that our method outperforms the state-of-art techniques and industrial applications in several aspects including, the visual quality, the ability of user control, user experience, and other metrics.
“Revisiting Unreasonable Effectiveness of Data in Deep Learning Era”, (2017-07-10):
The success of deep learning in vision can be attributed to: (a) models with high capacity; (b) increased computational power; and (c) availability of large-scale labeled data. Since 2012, there have been significant advances in representation capabilities of the models and computational capabilities of GPUs. But the size of the biggest dataset has surprisingly remained constant. What will happen if we increase the dataset size by 10× or 100×? This paper takes a step towards clearing the clouds of mystery surrounding the relationship between ‘enormous data’ and visual deep learning. By exploiting the JFT-300M dataset which has more than 375M noisy labels for 300M images, we investigate how the performance of current vision tasks would change if this data was used for representation learning. Our paper delivers some surprising (and some expected) findings. First, we find that the performance on vision tasks increases logarithmically based on volume of training data size. Second, we show that representation learning (or pre-training) still holds a lot of promise. One can improve performance on many vision tasks by just training a better base model. Finally, as expected, we present new state-of-the-art results for different vision tasks including image classification, object detection, semantic segmentation and human pose estimation. Our sincere hope is that this inspires vision community to not undervalue the data and develop collective efforts in building larger datasets.
https:/
/ arxiv.org/ pdf/ 1809.11096.pdf&org=deepmind#figure.caption.30 “E621 Face Dataset”, (2020-02-18):
Tool for getting the dataset of cropped faces from [furry booru] e621 (NSFW; WikiFur description). It was created by training a YOLOv3 network on annotated facial features from about 1500 faces.
The total dataset includes ~186k faces. Rather than provide the cropped images, this repo contains CSV files with the bounding boxes of the detected features from my trained network, and a script to download the images from e621 and crop them based on these CSVs.
The CSVs also contain a subset of tags, which could potentially be used as labels to train a conditional GAN.
File get_faces.py Script for downloading base e621 files and cropping them based on the coordinates in the CSVs. faces_s.csv CSV containing URLs, bounding boxes, and a subset of the tags for 90k cropped faces with rating=safe from e621. features_s.csv CSV containing the bounding boxes for 389k facial features with rating=safe from e621. faces_q.csv CSV containing URLs, bounding boxes, and a subset of the tags for 96k cropped faces with rating=questionable from e621. features_q.csv CSV containing the bounding boxes for 400k facial features with rating=questionable from e621. Preview grid https:/
/ www.obormot.net/ demos/ these-waifus-do-not-exist-v2-alt “This Fursona Does Not Exist—Fursona Editor (Tensorflow Version)”, (2020-06-01):
[Google Colab notebook for interactive editing faces generated by the TFDNE.com furry face StyleGAN2 model, using Ganspace to reverse-engineer the latent encoding and allow control of specific visual attributes of faces.]
https:/
/ drive.google.com/ file/ d/ 1t7E8NEqK_gVJwxrWEihR1IcPfekaBc1d/ view https:/
/ mega.nz/ file/ Wa4EFQRA#XL9X5tGNrlp1bTdafPWK_Kg65RW3J5-CR9biGEfFm_g “Poetry Foundation”, (2020-12-27):
The Poetry Foundation is a Chicago-based American foundation created to promote poetry in the wider culture. It was formed from Poetry magazine, which it continues to publish, with a 2003 gift of $200 million from philanthropist Ruth Lilly.
“GPT-2 Folk Music”, (2019-11-01):
In November 2019, I experimented with training a GPT-2 neural net model to generate folk music in the high-level ABC music text format, following previous work in 2016 which used a char-RNN trained on a ‘The Session’ dataset. A GPT-2 hypothetically can improve on an RNN by better global coherence & copying of patterns, without problems with the hidden-state bottleneck.
I encountered problems with the standard GPT-2 model’s encoding of text which damaged results, but after fixing that, I successfully trained it on n = 205,304 ABC music pieces taken from The Session & ABCnotation.com. The resulting music samples are in my opinion quite pleasant. (A similar model was later retrained by Geerlings & Meroño-Peñuela 2020.)
The ABC folk model & dataset are available for download, and I provide for listening selected music samples as well as medleys of random samples from throughout training.
We followed the ABC folk model with an ABC-MIDI model: a dataset of 453k ABC pieces decompiled from MIDI pieces, which fit into GPT-2-117M with an expanded context window when trained on TPUs. The MIDI pieces are far more diverse and challenging, and GPT-2 underfits and struggles to produce valid samples but when sampling succeeds, it can generate even better musical samples.
“GPT-2 Preference Learning for Music Generation”, (2019-12-16):
Standard language generation neural network models, like GPT-2, are trained via likelihood training to imitate human text corpuses. Generated text suffers from persistent flaws like repetition, due to myopic generation word-by-word, and cannot improve on the training data because they are trained to predict ‘realistic’ completions of the training data.
A proposed alternative is to use reinforcement learning to train the NNs, to encourage global properties like coherence & lack of repetition, and potentially improve over the original corpus’s average quality. Preference learning trains a reward function on human ratings, and uses that as the ‘environment’ for a blackbox DRL algorithm like PPO.
OpenAI released a codebase implementing this dual-model preference learning approach for textual generation, based on GPT-2. Having previously used GPT-2 for poetry & music generation, I experimented with GPT-2 preference learning for unconditional music and poetry generation.
I found that preference learning seemed to work better for music than poetry, and seemed to reduce the presence of repetition artifacts, but the results, at n≅7,400 ratings compiled over 23 iterations of training+sampling November 2019–January 2020, are not dramatically better than alternative improvements like scaling up models or more thorough data-cleaning or more stringent sample curation. My blind ratings using n≅200 comparisons showed no large advantage for the RL-tuned samples (winning only 93 of 210 comparisons, or 46%).
This may be due to insufficient ratings, bad hyperparameters, or not using samples generated with common prefixes, but I suspect it’s the former, as some NLP tasks in Ziegler et al 2019 required up to 60k ratings for good performance, and the reward model appeared to achieve poor performance & succumb to adversarial examples easily.
Working with it, I suspect that preference learning is unnecessarily sample-inefficient & data-inefficient, and that the blackbox reinforcement learning approach is inferior to directly using the reward model to optimize text samples, and propose two major architectural overhauls: have the reward model directly model the implied ranking of every datapoint, and drop the agent model entirely in favor of backprop-powered gradient ascent which optimizes sequences to maximize the reward model’s output.
“A Very Unlikely Chess Game”, (2020-01-06):
…Black is GPT-2. Its excuse [for this chess blunder] is that it’s a text prediction program with no concept of chess. As far as it knows, it’s trying to predict short alphanumeric strings like “e2e4” or “Nb7”. Nobody told it this represents a board game. It doesn’t even have a concept of 2D space that it could use to understand such a claim. But it still captured my rook! Embarrassing!…Last month, I asked him if he thought GPT-2 could play chess. I wondered if he could train it on a corpus of chess games written in standard notation (where, for example, e2e4 means “move the pawn at square e2 to square e4”). There are literally millions of games written up like this. GPT-2 would learn to predict the next string of text, which would correspond to the next move in the chess game. Then you would prompt it with a chessboard up to a certain point, and it would predict how the chess masters who had produced its training data would continue the game – ie make its next move using the same heuristics they would. Gwern handed the idea to his collaborator Shawn Presser, who had a working GPT-2 chess engine running within a week:…You can play against GPT-2 yourself by following the directions in the last tweet, though it won’t be much of a challenge for anyone better than I am.
…What does this imply? I’m not sure (and maybe it will imply more if someone manages to make it actually good). It was already weird to see something with no auditory qualia learn passable poetic meter. It’s even weirder to see something with no concept of space learn to play chess. Is any of this meaningful? How impressed should we be that the same AI can write poems, compose music, and play chess, without having been designed for any of those tasks? I still don’t know.
[See also the much later Noever et al 2020a/Noever et al 2020b who do the exact same thing in applying GPT-2 to Go SGF/chess PGN games.]
/
docs/ www/ old.reddit.com/ 7eaaa81a26404ef60df4279ee1f1b0c829d73be5.html https:/
/ github.com/ justinpinkney/ stylegan2/ blob/ master/ blend_models.py https:/
/ colab.research.google.com/ drive/ 1tputbmA9EaXs9HL9iO21g7xN7jz_Xrko?usp=sharing https:/
/ derpibooru.org/ tags/ artist-colon-thisponydoesnotexist https:/
/ thisponydoesnotexist.net/ model/ network-ponies-1024-151552.pkl “Anime Portraits with StyleGAN 2”, (2020-01-20):
How to use StyleGAN2, an improvement to StyleGAN released in December 2019, which removes the blob artifacts and is generally of somewhat higher visual quality. StyleGAN 2 is tricky to use because it requires custom local compilation of optimized code. Aaron Gokaslan provided tips on getting StyleGAN 2 running and trained a StyleGAN 2 on my anime portraits, which is available for download and which I use to create TWDNEv3"
“Language Models are Unsupervised Multitask Learners”, (2019-02-14):
Natural language processing tasks, such as question answering, machine translation, reading comprehension, and summarization, are typically approached with supervised learning on task-specific datasets.
We demonstrate that language models begin to learn these tasks without any explicit supervision when trained on a new dataset of millions of webpages called WebText. When conditioned on a document plus questions, the answers generated by the language model reach 55 F1 on the CoQA dataset—matching or exceeding the performance of 3 out of 4 baseline systems without using the 127,000+ training examples.
The capacity of the language model is essential to the success of zero-shot task transfer and increasing it improves performance in a log-linear fashion across tasks. Our largest model, GPT-2, is a 1.5B parameter Transformer that achieves state of the art results on 7 out of 8 tested language modeling datasets in a zero-shot setting but still underfits WebText. Samples from the model reflect these improvements and contain coherent paragraphs of text.
These findings suggest a promising path towards building language processing systems which learn to perform tasks from their naturally occurring demonstrations.
“An Empirical Model of Large-Batch Training”, (2018-12-14):
In an increasing number of domains it has been demonstrated that deep learning models can be trained using relatively large batch sizes without sacrificing data efficiency. However the limits of this massive data parallelism seem to differ from domain to domain, ranging from batches of tens of thousands in ImageNet to batches of millions in RL agents that play the game Dota 2. To our knowledge there is limited conceptual understanding of why these limits to batch size differ or how we might choose the correct batch size in a new domain. In this paper, we demonstrate that a simple and easy-to-measure statistic called the gradient noise scale predicts the largest useful batch size across many domains and applications, including a number of supervised learning datasets (MNIST, SVHN, CIFAR-10, ImageNet, Billion Word), reinforcement learning domains (Atari and Dota), and even generative model training (autoencoders on SVHN). We find that the noise scale increases as the loss decreases over a training run and depends on the model size primarily through improved model performance. Our empirically-motivated theory also describes the tradeoff between compute-efficiency and time-efficiency, and provides a rough model of the benefits of adaptive batch-size training.
“Virtual YouTuber”, (2020-12-27):
A virtual YouTuber or VTuber is an online entertainer that uses a digital avatar generated using computer graphics. A growing trend that originated in Japan in the mid-2010s, a majority of VTubers are Japanese-speaking YouTubers or live streamers who use anime-inspired avatar designs. By 2020, there were more than 10,000 active VTubers.
“Danbooru2019: A Large-Scale Crowdsourced and Tagged Anime Illustration Dataset”, (2015-12-15):
Deep learning for computer revision relies on large annotated datasets. Classification/
categorization has benefited from the creation of ImageNet, which classifies 1m photos into 1000 categories. But classification/ categorization is a coarse description of an image which limits application of classifiers, and there is no comparably large dataset of images with many tags or labels which would allow learning and detecting much richer information about images. Such a dataset would ideally be >1m images with at least 10 descriptive tags each which can be publicly distributed to all interested researchers, hobbyists, and organizations. There are currently no such public datasets, as ImageNet, Birds, Flowers, and MS COCO fall short either on image or tag count or restricted distribution. I suggest that the “image -boorus” be used. The image boorus are longstanding web databases which host large numbers of images which can be ‘tagged’ or labeled with an arbitrary number of textual descriptions; they were developed for and are most popular among fans of anime, who provide detailed annotations. The best known booru, with a focus on quality, is Danbooru. We provide a torrent/
rsync mirror which contains ~3tb of 3.69m images with 108m tag instances (of 392k defined tags, ~29/ image) covering Danbooru from 2005-05-24–2019-12-31 (final ID: #3,734,659), providing the image files & a JSON export of the metadata. We also provide a smaller torrent of SFW images downscaled to 512×512px JPGs (295GB; 2,828,400 images) for convenience. Our hope is that a Danbooru2019 dataset can be used for rich large-scale classification/
tagging & learned embeddings, test out the transferability of existing computer vision techniques (primarily developed using photographs) to illustration/ anime-style images, provide an archival backup for the Danbooru community, feed back metadata improvements & corrections, and serve as a testbed for advanced techniques such as conditional image generation or style transfer. “Making Anime Faces With StyleGAN: Reversing StyleGAN To Control & Modify Images”, (2019-03-24):
Discussion of how to modify existing images with GANs. There are several possibilities: train another NN to turn an image back into the original encoding; run blackbox search on encodings, repeatedly tweaking it to approximate a target face; or the whitebox approach, directly backpropagating through the model from the image to the encoding while holding the model fixed. All of these have been implemented for StyleGAN, and a combination works best. There are even GUIs for editing StyleGAN anime faces!
“Generating Anime Faces with BigGAN”, (2019-06-04):
I explore BigGAN, another recent GAN with SOTA results on the most complex image domain tackled by GANs so far, ImageNet. BigGAN’s capabilities come at a steep compute cost, however. I experiment with 128px ImageNet transfer learning (successful) with ~6 GPU-days, and from-scratch 256px anime portraits of 1000 characters on a 8×2080ti machine for a month (mixed results). My BigGAN results are good but compromised by practical problems with the released BigGAN code base. While BigGAN is not yet superior to StyleGAN for many purposes, BigGAN-like approaches may turn out to be necessary to scale to whole anime images.
“GPT-3 Weaknesses: Byte-Pair Encodings (BPEs)”, (2020-06-23):
Compared to GPT-2, GPT-3 improves performance on character-level tasks like rhyming, alliteration, punning, anagrams or permutations, acrostic poems, and arithmetic less than expected, despite being very good at many other closely-related kinds of writings like satire.
Why? A plausible explanation is an obscure technical detail: as a performance optimization, GPT does not see characters but sub-word-chunks called “byte-pair encodings” (BPEs). Because GPTs never see characters but opaque partial-words, which vary chaotically based on the specific word and even the surrounding context, they are unable to easily learn about character-level aspects of language, like similar spellings or sounds, and are forced to learn relationships much more indirectly, like by brute-force memorizing of pairs of words.
Some experiments with reformatting GPT-3’s poorest-performing tasks to avoid inconsistent BPE encodings of strings shows small to large performance gains, consistent with this theory.
“YOLOv3: An Incremental Improvement”, (2018-04-08):
We present some updates to YOLO! We made a bunch of little design changes to make it better. We also trained this new network that’s pretty swell. It’s a little bigger than last time but more accurate. It’s still fast though, don’t worry. At 320×320 YOLOv3 runs in 22 ms at 28.2 mAP, as accurate as SSD but three times faster. When we look at the old .5 IOU mAP detection metric YOLOv3 is quite good. It achieves 57.9 mAP@50 in 51 ms on a Titan X, compared to 57.5 mAP@50 in 198 ms by RetinaNet, similar performance but 3.8x faster. As always, all the code is online at https://pjreddie.com/yolo/
“GPT-2 Folk Music: Training a Spaceless Model”, (2019-12-12):
While training a GPT-2-117M on a folk music corpus written in ABC format, persistent syntax errors kept being generated by an otherwise-high-quality model: random spaces would be generated, rendering a music piece either erroneous or lower-quality. Why? It seems to be some issue with the GPT BPE encoder handling of spaces which makes it difficult to emit the right space-separated characters. We found that ABC does not actually require spaces, and we simply removed all spaces from the corpus—noticeably improving quality of generated pieces.
“Interacting with GPT–2 to Generate Controlled and Believable Musical Sequences in ABC Notation”, (2020-10-16):
Generating symbolic music with language models is a promising research area, with potential applications in automated music composition. Recent work shows that Transformer architectures can learn to generate compelling four-instrument scores from large MIDI datasets. In this paper, we re-train the small (117M) GPT-2 model with a large dataset in ABC notation, and generate samples of single-instrument folk music. Our BLEU and ROUGE based quantitative, and survey based qualitative, evaluations suggest that ABC notation is learned with syntactical and semantic correctness, and that samples contain robust and believable n-grams.
“Generating MIDI Music With GPT-2: Generating MIDI by converting to ABC and expanding the GPT-2 context window—works, if only just”, (2020-04-25):
To expand the ABC GPT-2 model to cover a wider variety of musical genres, I turn to the next-most compact widespread music encoding format: MIDI. There are hundreds of thousands of MIDIs which can be decompiled to ABC format, averaging ~10k BPEs—within GPT-2-117M’s feasible context window when trained on TPUs (which permit training of context windows up to 30k wide).
We compile the ABC from before and 2 large MIDI datasets, and convert to ABC, yielding ~453k usable ABC-MIDI musical files (~5.1GB of text). We trained January–April 2020 on our TPU swarm (with many interruptions), achieving a final loss of ~0.2 (underfit).
Sampling from the final model is hit-or-miss as it is prone to the likelihood repetition trap and it generates instruments one-by-one so it is common for instruments to be cut off or otherwise broken during sampling (indicating that sampling is increasingly a bigger problem than training for long-range sequence modeling). However, successful pieces are possible, and are musically far more diverse than the folk ABC corpus, with many pleasingly complex samples.
/
GPT-2-preference-learning#bradley-terry-preference-learning /
GPT-2-preference-learning#optimization-by-backprop-not-blackbox “The Go Transformer: Natural Language Modeling for Game Play”, (2020-07-07):
This work applies natural language modeling to generate plausible strategic moves in the ancient game of Go. We train the Generative Pretrained Transformer (GPT-2) to mimic the style of Go champions as archived in Smart Game Format (SGF), which offers a text description of move sequences. The trained model further generates valid but previously unseen strategies for Go. Because GPT-2 preserves punctuation and spacing, the raw output of the text generator provides inputs to game visualization and creative patterns, such as the Sabaki project’s game engine using auto-replays. Results demonstrate that language modeling can capture both the sequencing format of championship Go games and their strategic formations. Compared to random game boards, the GPT-2 fine-tuning shows efficient opening move sequences favoring corner play over less advantageous center and side play. Game generation as a language modeling task offers novel approaches to more than 40 other board games where historical text annotation provides training data (e.g., Amazons & Connect 4/6).
“The Chess Transformer: Mastering Play using Generative Language Models”, (2020-08-02):
This work demonstrates that natural language transformers can support more generic strategic modeling, particularly for text-archived games. In addition to learning natural language skills, the abstract transformer architecture can generate meaningful moves on a chessboard. With further fine-tuning, the transformer learns complex gameplay by training on 2.8 million chess games in Portable Game Notation. After 30,000 training steps, OpenAI’s Generative Pre-trained Transformer (GPT-2) optimizes weights for 774 million parameters. This fine-tuned Chess Transformer generates plausible strategies and displays game formations identifiable as classic openings, such as English or the Slav Exchange. Finally, in live play, the novel model demonstrates a human-to-transformer interface that correctly filters illegal moves and provides a novel method to challenge the transformer’s chess strategies. We anticipate future work will build on this transformer’s promise, particularly in other strategy games where features can capture the underlying complex rule syntax from simple but expressive player annotations.
“Decompiler”, (2020-12-22):
A decompiler is a computer program that takes an executable file as input, and attempts to create a high level source file which can be recompiled successfully. It is therefore the opposite of a compiler, which takes a source file and makes an executable. Decompilers are usually unable to perfectly reconstruct the original source code, and as such, will frequently produce obfuscated code. Nonetheless, decompilers remain an important tool in the reverse engineering of computer software.
Turns out that when training goes really wrong, you can crash many GAN implementations with either a segfault, integer overflow, or division by zero error.↩︎
StackGAN/
StackGAN++/ PixelCNN et al are difficult to run as they require a unique image embedding which could only be computed in the unmaintained Torch framework using Reed’s prior work on a joint text+image embedding which however doesn’t run on anything but the Birds & Flowers datasets, and so no one has ever, as far as I am aware, run those implementations on anything else—certainly I never managed to despite quite a few hours trying to reverse-engineer the embedding & various implementations.↩︎ Be sure to check out Artbreeder.↩︎
Glow’s reported results required >40 GPU-weeks; BigGAN’s total compute is unclear as it was trained on a TPUv3 Google cluster but it would appear that a 128px BigGAN might be ~4 GPU-months assuming hardware like an 8-GPU machine, 256px ~8 GPU-months, and 512px ≫8 GPU-months, with VRAM being the main limiting factor for larger models (although progressive growing might be able to cut those estimates).↩︎
illustration2vec
is an old & small CNN trained to predict a few -booru tags on anime images, and so provides an embedding—but not a good one. The lack of a good embedding is the major limitation for anime deep learning as of February 2019. (DeepDanbooru, while performing well apparently, has not yet been used for embeddings.) An embedding is necessary for text→image GANs, image searches & nearest-neighbor checks of overfitting, FID errors for objectively comparing GANs, minibatch discrimination to help the D/provide an auxiliary loss to stabilize learning, anime style transfer (both for its own sake & for creating a ‘StyleDanbooru2018’ to reduce texture cheating), encoding into GAN latent spaces for manipulation, data cleaning (to detect anomalous datapoints like failed face crops), perceptual losses for encoders or as an additional auxiliary loss/ pretraining (like “NoGAN”, which trains a Generator on a perceptual loss and does GAN training only for finetuning) etc. A good tagger is also a good starting point for doing pixel-level semantic segmentation (via “weak supervision”), which metadata is key for training something like Nvidia’s GauGAN successor to pix2pix (Park et al 2019; source).↩︎ Technical note: I typically train NNs using my workstation with 2×1080ti GPUs. For easier comparison, I convert all my times to single-GPU equivalent (ie “6 GPU-weeks” means 3 realtime/
wallclock weeks on my 2 GPUs).↩︎ Kynkäänniemi et al 2019 observes (§4 “Using precision and recall to analyze and improve StyleGAN”) that StyleGAN with progressive growing disabled does work but at some cost to precision/
recall quality metrics; whether this reflects inferior performance on a given training budget or an inherent limit—BigGAN and other self-attention-using GANs do not use progressive growing at all, suggesting it is not truly necessary—is not investigated. In December 2019, StyleGAN 2 successfully dropped progressive growing entirely at modest performance cost.↩︎ This has confused some people, so to clarify the sequence of events: I trained my anime face StyleGAN and posted notes on Twitter, releasing an early model; roadrunner01 generated an interpolation video using said model (but a different random seed, of course); this interpolation video was retweeted by the Japanese Twitter user _Ryobot, upon which it went viral and was ‘liked’ by Elon Musk, further driving virality (19k reshares, 65k likes, 1.29m watches as of 2019-03-22).↩︎
Google Colab is a free service includes free GPU time (up to 12 hours on a small GPU). Especially for people who do not have a reasonably capable GPU on their personal computers (such as all Apple users) or do not want to engage in the admitted hassle of renting a real cloud GPU instance, Colab can be a great way to play with a pretrained model, like generating GPT-2-117M text completions or StyleGAN interpolation videos, or prototype on tiny problems.
However, it is a bad idea to try to train real models, like 512–1024px StyleGANs, on a Colab instance as the GPUs are low VRAM, far slower (6 hours per StyleGAN tick!), unwieldy to work with (as one must save snapshots constantly to restart when the session runs out), doesn’t have a real command-line, etc. Colab is just barely adequate for perhaps 1 or 2 ticks of transfer learning, but not more. If you harbor greater ambitions but still refuse to spend any money (rather than time), Kaggle has a similar service with P100 GPU slices rather than K80s. Otherwise, one needs to get access to real GPUs.↩︎
Curiously, the benefit of many more FC layers than usual may have been stumbled across before: IllustrationGAN found that adding some FC layers seemed to help their DCGAN generate anime faces, and when I & FeepingCreature experimented with adding 2–4 FC layers to WGAN-GP along IllustrationGAN’s lines, it did help our lackluster results, and at the time I speculated that “the fully-connected layers are transforming the latent-z/
noise into a sort of global template which the subsequent convolution layers can then fill in more locally.” But we never dreamed of going as deep as 8!↩︎ The ProGAN/
StyleGAN codebase reportedly does work with conditioning, but none of the papers report on this functionality and I have not used it myself.↩︎ The latent embedding z is usually generated in about the simplest possible way: draws from the Normal distribution, . A is sometimes used instead. There is no good justification for this and some reason to think this can be bad (how does a GAN easily map a discrete or binary latent factor, such as the presence or absence of the left ear, onto a Normal variable?).
The BigGAN paper explores alternatives, finding improvements in training time and/
or final quality from using instead (in ascending order): a Normal + binary Bernoulli (p = 0.5; personal communication, Brock) variable, a binary (Bernoulli), and a Rectified Gaussian (sometimes called a “censored normal” even though that sounds like a truncated normal distribution rather than the rectified one). The rectified Gaussian distribution “outperforms (in terms of IS) by 15–20% and tends to require fewer iterations.” The downside is that the “truncation trick”, which yields even larger average improvements in image quality (at the expense of diversity) doesn’t quite apply, and the rectified Gaussian sans truncation produced similar results as the Normal+truncation, so BigGAN reverted to the default Normal distribution+truncation (personal communication).
The truncation trick either directly applies to some of the other distributions, particularly the Rectified Gaussian, or could easily be adapted—possibly yielding an improvement over either approach. The Rectified Gaussian can be truncated just like the default Normals can. And for the Bernoulli, one could decrease p during the generation, or what is probably equivalent, re-sample whenever the variance (ie squared sum) of all the Bernoulli latent variables exceeds a certain constant. (With p = 0.5, a latent vector of 512 Bernouillis would on average all sum up to simply , with the 2.5%–97.5% quantiles being 234–278, so a ‘truncation trick’ here might be throwing out every vector with a sum above, say, the 80% quantile of 266.)
One also wonders about vectors which draw from multiple distributions rather than just one. Could the StyleGAN 8-FC-layer learned-latent-variable be reverse-engineered? Perhaps the first layer or two merely converts the normal input into a more useful distribution & parameters/
training could be saved or insight gained by imitating that.↩︎ Which raises the question: if you added any or all of those features, would StyleGAN become that much better? Unfortunately, while theorists & practitioners have had many ideas, so far theory has proven more fecund than fatidical and the large-scale GAN experiments necessary to truly test the suggestions are too expensive for most. Half of these suggestions are great ideas—but which half?↩︎
For more on the choice of convolution layers/
kernel sizes, see Karpathy’s 2015 notes for “CS231n: Convolutional Neural Networks for Visual Recognition”, or take a look at these Convolution animations & Yang’s interactive “Convolution Visualizer”.↩︎ This observations apply only to the Generator in GANs (which is what we primarily care about); curiously, there’s some reason to think that GAN Discriminators are in fact mostly memorizing (see later).↩︎
A possible alternative is ESRGAN (Wang et al 2018).↩︎
Based on eyeballing the ‘cat’ bar graph in Figure 3 of Yu et al 2015.↩︎
CATS offer an amusing instance of the dangers of data augmentation: ProGAN used horizontal flipping/
mirroring for everything, because why not? This led to strange Cyrillic text captions showing up in the generated cat images. Why not Latin alphabet captions? Because every cat image was being shown mirrored as well as normally! For StyleGAN, mirroring was disabled, so now the lolcat captions are recognizably Latin alphabetical, and even almost English words. This demonstrates that even datasets where left/ right doesn’t seem to matter, like cat photos, can surprise you.↩︎ I estimated the total cost using AWS EC2 preemptible hourly costs on 2019-03-15 as follows:
- 1 GPU:
p2.xlarge
instance inus-east-2a
, Half of a K80 (12GB VRAM): $0.3235/hour - 2 GPUs: NA—there is no P2 instance with 2 GPUs, only 1/
8/ 16 - 8 GPUs:
p2.8xlarge
inus-east-2a
, 8 halves of K80s (12GB VRAM each): $2.160/hour
As usual, there is sublinear scaling, and larger instances cost disproportionately more, because one is paying for faster wallclock training (time is valuable) and for not having to create a distributed infrastructure which can exploit the cheap single-GPU instances.
This cost estimate does not count additional costs like hard drive space. In addition to the dataset size (the StyleGAN data encoding is ~18× larger than the raw data size, so a 10GB folder of images → 200GB of
.tfrecords
), you would need at least 100GB HDD (50GB for the OS, and 50GB for checkpoints/images/ etc to avoid crashes from running out of space).↩︎ - 1 GPU:
I regard this as a flaw in StyleGAN & TF in general. Computers are more than fast enough to load & process images asynchronously using a few worker threads, and working with a directory of images (rather than a special binary format 10–20× larger) avoids imposing serious burdens on the user & hard drive. PyTorch GANs almost always avoid this mistake, and are much more pleasant to work with as one can freely modify the dataset between (and even during) runs.↩︎
For example, my Danbooru2018 anime portrait dataset is 16GB, but the StyleGAN encoded dataset is 296GB.↩︎
This may be why some people report that StyleGAN just crashes for them & they can’t figure out why. They should try changing their dataset JPG ↔︎ PNG.↩︎
That is, in training G, the G’s fake images must be augmented before being passed to the D for rating; and in training D, both real & fake images must be augmented the same way before being passed to D. Previously, all GAN researchers appear to have assumed that one should only augment real images before passing to D during D training, which conveniently can be done at dataset creation; unfortunately, this hidden assumption turns out to be about the most harmful way possible!↩︎
I would describe the distinctions as: Software 0.0 was imperative programming for hammering out clockwork mechanism; Software 1.0 was declarative programming with specification of policy; and Software 2.0 is deep learning by gardening loss functions (with everything else, from model arch to which datapoints to label ideally learned end-to-end). Continuing the theme, we might say that dialogue with models, like “prompt programming”, are “Software 3.0”…↩︎
But you may not want to–remember the lolcat captions!↩︎
Note: If you use a different command to resize, check it thoroughly. With ImageMagick, if you use the
^
operator like-resize 512x512^
, you will not get exactly 512×512px images as you need; while if you use the!
operator like-resize 512x512!
, the images will be exactly 512×512px but the aspect ratios will distorted to make images fit, and this may confuse anything you are training by introducing unnecessary meaningless distortions & will make any generated images look bad.↩︎If you are using Python 2, you will get
print
syntax error messages; if you are using Python 3–3.6, you will get ‘type hint’ errors.↩︎Stas Podgorskiy has demonstrated that the StyleGAN 2 correction can be reverse-engineered and applied back to StyleGAN 1 generators if necessary.↩︎
This makes it conform to a truncated normal distribution; why truncated rather than rectified/
winsorized at a max like 0.5 or 1.0 instead? Because then many, possibly most, of the latent variables would all be at the max, instead of smoothly spread out over the permitted range.↩︎ No minibatches are used, so this is much slower than necessary.↩︎
The question is not whether one is to start with an initialization at all, but whether to start with one which does everything poorly, or one which does a few similar things well. Similarly, from a Bayesian statistics perspective, the question of what prior to use is one that everyone faces; however, many approaches sweep it under the rug and effectively assume a default flat prior that is consistently bad and optimal for no meaningful problem ever.↩︎
ADA/
StyleGAN3 is reportedly much more sample-efficient and reduces the need for transfer learning: Karras et al 2020. But if a relevant model is available, it should still be used. Backporting the ADA data augmentation trick to StyleGAN1–2 will be a major upgrade.↩︎ There are more real Asuka images than Holo to begin with, but there is no particular reason for the 10× data augmentation compared to the Holo’s 3×—the data augmentations were just done at different times and happened to have less or more augmentations enabled.↩︎
A famous example is character designer Yoshiyuki Sadamoto demonstrating how to turn Nadia (Nadia: The Secret of Blue Water) into Shinji Ikari (Evangelion).↩︎
It turns out that this latent vector trick does work. Intriguingly, it works even better to do ‘model averaging’ or ‘model blending’ (“StyleGAN network blending”/
“Resolution Dependent GAN Interpolation for Controllable Image Synthesis Between Domains”, Pinkney & Adler 2020): retrain model A on dataset B, and then take a weighted average of the 2 models (you average them, parameter by parameter, and remarkably, that Just Works, or you can swap out layers between models), and then you can create faces which are arbitrarily in between A and B. So for example, you can blend FFHQ/ Western-animation faces (Colab notebook), ukiyo-e/FFHQ faces , furries/foxes/ , or even furries/FFHQ faces foxes/ .↩︎FFHQ/ anime/ ponies In retrospect, this shouldn’t’ve surprised me.↩︎
There is for other architectures like flow-based ones such as Glow, and this is one of their benefits–while the requirement to be made out of building blocks which can be run backwards & forwards equally well, to be ‘invertible’, is currently extremely expensive and the results not competitive either in final image quality or compute requirements, the invertibility means that encoding an arbitrary real image to get its inferred latents Just Works™ and one can easily morph between 2 arbitrary images, or encode an arbitrary image & edit it in the latent space to do things like add/
remove glasses from a face or create an opposites-ex version.↩︎ This final approach is, interestingly, the historical reason backpropagation was invented: it corresponds to planning in a model. For example, in planning the flight path of an airplane (Kelley 1960/
Bryson & Denham 1962): the destination or ‘output’ is fixed, the aerodynamics+geography or ‘model parameters’ are also fixed, and the question is what actions determining a flight path will reduce the loss function of time or fuel spent. One starts with a random set of actions picking a random flight path, runs it forward through the environment model, gets a final time/ fuel spent, and then backpropagates through the model to get the gradients for the flight path, adjusting the flight path towards a new set of actions which will slightly reduce the time/ fuel spent; the new actions are used to plan out the flight to get a new loss, and so on, until a local minimum of the actions has been found. This works with non-stochastic problems; for stochastic ones where the path can’t be guaranteed to be executed, “model-predictive control” can be used to replan at every step and execute adjustments as necessary. Another interesting use of backpropagation for outputs is Zhang et al 2019 which tackles the long-standing problem of how to get NNs to output sets rather than list outputs by generating a possible set output & refining it via backpropagation.↩︎ SGD is common, but a second-order algorithm like Limited-memory BFGS is often used in these applications in order to run as few iterations as possible.↩︎
Jahanian et al 2019 shows that BigGAN/
StyleGAN latent embeddings can also go beyond what one might expect, to include zooms, translations, and other transforms.↩︎ Flow models have other advantages, mostly stemming from the maximum likelihood training objective. Since the image can be propagated backwards and forwards losslessly, instead of being limited to generating random samples like a GAN, it’s possible to calculate the exact probability of an image, enabling maximum likelihood as a loss to optimize, and dropping the Discriminator entirely. With no GAN dynamics, there’s no worry about weird training dynamics, and the likelihood loss also forbids ‘mode dropping’: the flow model can’t simply conspire with a Discriminator to forget possible images.↩︎
StyleGAN 2 is more computationally expensive but Karras et al optimized the codebase to make up for it, keeping total compute constant.↩︎
Backup-backup mirror:
rsync rsync://78.46.86.149:873/biggan/2020-01-11-skylion-stylegan2-animeportraits-networksnapshot-024664.pkl.xz ./
↩︎ImageNet requires you to sign up & be approved to download from them, but 2 months later I have still heard nothing back. So I used the data from
ILSVRC2012_img_train.tar
(MD5:1d675b47d978889d74fa0da5fadfb00e
; 138GB) which I downloaded from the ImageNet LSVRC 2012 Training Set (Object Detection) torrent.↩︎Danbooru can classify the same character under multiple tags: for example, Sailor Moon characters are tagged under their “Sailor X” name for images of their transformed version, and their real names for ‘civilian’ images (eg ‘Sailor Venus’ or ‘Cure Moonlight’, the former of which I merged with ‘Aino Minako’). Some popular franchises have many variants of each character: the Fate franchise, especially with the success of Fate/
Grand Order , is a particular offender, with quite a few variants of characters like Saber.↩︎One would think it would, but I asked Brock and apparently it doesn’t help to occasionally initialize from the EMA snapshots. EMA is a mysterious thing.↩︎
As far as I can tell, it has something to do with the
dataloader
code inutils.py
: the calculation of length and the iterator do something weird to adjust for previous training, so the net effect is that you can run with a fixed minibatch accumulation and it’ll be fine, and you can reduce the number of accumulations, and it’ll simply underrun the dataloader, but if you increase the number of accumulations, if you’ve trained enough percentage-wise, it’ll immediately flip over into a negative length and indexing into it becomes completely impossible, leading to crashes. Unfortunately, I only ever want to increase the minibatch accumulation… I tried to fix it but the logic is too convoluted for me to follow it.↩︎Mirror:
rsync --verbose rsync://78.46.86.149:873/biggan/2019-05-28-biggan-danbooru2018-snapshot-83520.tar.xz ./
↩︎Mirror:
rsync --verbose rsync://78.46.86.149:873/biggan/2019-06-04-biggan-256px-danbooru20181k-83520-randomsamples.tar ./
↩︎