Anime Neural Net Graveyard

Compilation of failed neural network experiments in generating anime images, pre-StyleGAN/BigGAN.
anime, NGE, NN
2019-02-042021-01-29 finished certainty: highly likely importance: 2

My ex­per­i­ments in gen­er­at­ing anime faces, tried pe­ri­od­i­cally since 2015, suc­ceeded in 2019 with the re­lease of StyleGAN. But for com­par­ison, here are the fail­ures from some of my older GAN or other NN at­tempts; as the qual­ity is worse than StyleGAN, I won’t bother go­ing into de­tail­s—cre­at­ing the datasets & train­ing the ProGAN & tun­ing & trans­fer­-learn­ing were all much the same as al­ready out­lined at length for the .

In­cluded are:

  • ProGAN

  • Glow


  • PokeGAN

  • Self-Attention-GAN-TensorFlow

  • VGAN

  • BigGAN un­offi­cial (offi­cial is cov­ered else­where)

    • BigGAN-TensorFlow
    • BigGAN-PyTorch
  • GAN-QP

  • WGAN

  • IntroVAE

Ex­am­ples of anime faces gen­er­ated by neural net­works have wowed weebs across the world. Who could not be im­pressed by sam­ples like the­se:

64 TWDNE face sam­ples se­lected from so­cial me­dia, in an 8×8 grid.

They are col­or­ful, well-drawn, near-flaw­less, and at­trac­tive to look at. This page is not about them. This page is about… the oth­ers—the failed ex­per­i­ments that came be­fore.



Us­ing offi­cial im­ple­men­ta­tion:

  1. 2018-09-08, 512–1024px whole-A­suka im­ages ProGAN sam­ples:

    1024px, whole-A­suka im­ages, ProGAN
    512px whole-A­suka im­ages, ProGAN
  2. 2018-09-18, 512px Asuka faces, ProGAN sam­ples:

    512px Asuka faces, ProGAN
  3. 2018-10-29, 512px Holo faces, ProGAN:

    Ran­dom sam­ples of 512px ProGAN Holo faces

    After gen­er­at­ing ~1k Holo faces, I se­lected the top decile (n = 103) of the faces (Imgur mir­ror):

    512px ProGAN Holo faces, ran­dom sam­ples from top decile (6×6)

    The top decile im­ages are, nev­er­the­less, show­ing dis­tinct signs of both ar­ti­fact­ing & over­fit­ting/mem­o­riza­tion of data points. An­other 2 weeks proved this out fur­ther:

    ProGAN sam­ples of 512px Holo faces, after badly over­fit­ting (it­er­a­tion #10,325)

    In­ter­po­la­tion video of the Oc­to­ber 2018 512px Holo face ProGAN; note the gross over­fit­ting in­di­cated by the abrupt­ness of the in­ter­po­la­tions jump­ing from face (mode) to face (mode) and lack of mean­ing­ful in­ter­me­di­ate faces in ad­di­tion to the over­all blur­ri­ness & low vi­sual qual­i­ty.

  4. 2019-01-17, Dan­booru2017 512px SFW im­ages, ProGAN:

    512px SFW Dan­booru2017, ProGAN
  5. 2019-02-05 (stopped in or­der to train with the new StyleGAN code­base), the 512px anime face dataset used else­where, ProGAN:

    512px anime faces, ProGAN

    In­ter­po­la­tion video of the 2018-02-05 512px anime face ProGAN; while the im­age qual­ity is low, the di­ver­sity is good & shows no over­fit­ting/mem­o­riza­tion or bla­tant mode col­lapse



offi­cial im­ple­men­ta­tion:

2018-12-15, 512px Asuka faces, fail­ure case


nshep­perd’s (un­pub­lished) mul­ti­-s­cale GAN with self­-at­ten­tion lay­ers, spec­tral nor­mal­iza­tion, and a few other tweaks:

PokeGAN, Asuka faces, 2018-11-16


did not have an offi­cial im­ple­men­ta­tion re­leased at the time so I used the Junho Kim im­ple­men­ta­tion; 128px SAGAN, WGAN-LP loss, on Asuka faces & whole Asuka im­ages:

Self-Attention-GAN-TensorFlow, whole Asuka, 2019-08-18
Train­ing mon­tage of the 2018-08-18 128px whole-A­suka SAGAN; pos­si­bly too-high LR
Self-Attention-GAN-TensorFlow, Asuka faces, 2019-09-13


The offi­cial VGAN code for Peng et al 2018 had not been re­leased when I be­gan try­ing VGAN, so I used akan­i­max’s im­ple­men­ta­tion.

The vari­a­tional dis­crim­i­na­tor bot­tle­neck, along with self­-at­ten­tion lay­ers and pro­gres­sive grow­ing, is one of the few strate­gies which per­mit 512px im­ages, and I was in­trigued to see that it worked rel­a­tively well, al­though I ran into per­sis­tent is­sues with in­sta­bil­ity & mode col­lapse. I sus­pect that VGAN could’ve worked bet­ter than it did with some more work.

akan­i­max VGAN, anime faces, 2018-12-25

BigGAN unofficial

^s offi­cial im­ple­men­ta­tion & mod­els were not re­leased un­til late March 2019 (nor the semi­-offi­cial compare_gan im­ple­men­ta­tion un­til Feb­ru­ary 2019), and I ex­per­i­mented with 2 un­offi­cial im­ple­men­ta­tions in late 2018–early 2019.


Junho Kim im­ple­men­ta­tion; 128px spec­tral norm hinge loss, anime faces:

Kim BigGAN-PyTorch, anime faces, 2019-01-17

This one never worked well at all, and I am still puz­zled what went wrong.


Aaron Leong’s Py­Torch BigGAN im­ple­men­ta­tion (not the offi­cial BigGAN im­ple­men­ta­tion). As it’s class-con­di­tion­al, I faked hav­ing 1000 classes by con­struct­ing a vari­ant anime face dataset: tak­ing the top 1000 char­ac­ters by tag count in the Dan­booru2017 meta­data, I then fil­tered for those char­ac­ter tags 1 by 1, and copied them & cropped faces into match­ing sub­di­rec­to­ries 1–1000. This let me try out both faces & whole im­ages. I also at­tempted to hack in gra­di­ent ac­cu­mu­la­tion for big mini­batches to make it a true BigGAN im­ple­men­ta­tion, but did­n’t help too much; the prob­lem here might sim­ply have been that I could­n’t run it long enough.

Re­sults upon aban­don­ing:

Leong BigGAN-PyTorch, 1000-class anime char­ac­ter dataset, 2018-11-30 (#314,000)
Leong BigGAN-PyTorch, 1000-class anime face dataset, 2018-12-24 (#1,006,320)


Im­ple­men­ta­tion of :

GAN-QP, 512px Asuka faces, 2018-11-21

Train­ing os­cil­lated enor­mous­ly, with all the sam­ples closely linked and chang­ing si­mul­ta­ne­ous­ly. This was de­spite the check­point model be­ing enor­mous (551MB) and I am sus­pi­cious that some­thing was se­ri­ously wrong—ei­ther the model ar­chi­tec­ture was wrong (too many lay­ers or fil­ter­s?) or the learn­ing rate was many or­ders of mag­ni­tude too large. Be­cause of the small mini­batch, progress was diffi­cult to make in a rea­son­able amount of wall­clock time, so I moved on.


offi­cial im­ple­men­ta­tion; I did most of the early anime face work with WGAN on a differ­ent ma­chine and did­n’t keep copies. How­ev­er, a sam­ple from a short run gives an idea of what WGAN tended to look like on anime runs:

WGAN, 256px Asuka faces, it­er­a­tion 2100

Normalizing Flow


Used Glow () offi­cial im­ple­men­ta­tion.

Due to the enor­mous model size (4.2G­B), I had to mod­ify Glow’s set­tings to get train­ing work­ing rea­son­ably well, after ex­ten­sive tin­ker­ing to fig­ure out what any meant:

{"verbose": true, "restore_path": "logs/model_4.ckpt", "inference": false, "logdir": "./logs", "problem": "asuka",
"category": "", "data_dir": "../glow/data/asuka/", "dal": 2, "fmap": 1, "pmap": 16, "n_train": 20000, "n_test": 1000,
"n_batch_train": 16, "n_batch_test": 50, "n_batch_init": 16, "optimizer": "adamax", "lr": 0.0005, "beta1": 0.9,
"polyak_epochs": 1, "weight_decay": 1.0, "epochs": 1000000, "epochs_warmup": 10, "epochs_full_valid": 3,
"gradient_checkpointing": 1, "image_size": 512, "anchor_size": 128, "width": 512, "depth": 13, "weight_y": 0.0,
"n_bits_x": 8, "n_levels": 7, "n_sample": 16, "epochs_full_sample": 5, "learntop": false, "ycond": false, "seed": 0,
"flow_permutation": 2, "flow_coupling": 1, "n_y": 1, "rnd_crop": false, "local_batch_train": 1, "local_batch_test": 1,
"local_batch_init": 1, "direct_iterator": true, "train_its": 1250, "test_its": 63, "full_test_its": 1000, "n_bins": 256.0, "top_shape": [4, 4, 768]}
{"epoch": 5, "n_processed": 100000, "n_images": 6250, "train_time": 14496, "loss": "2.0090", "bits_x": "2.0090", "bits_y": "0.0000", "pred_loss": "1.0000"}

An ad­di­tional chal­lenge was nu­mer­i­cal in­sta­bil­ity in the re­vers­ing of ma­tri­ces, giv­ing rise to many ‘in­vert­ibil­ity’ crash­es.

Fi­nal sam­ple be­fore I looked up the com­pute re­quire­ments more care­fully & gave up on Glow:

Glow, Asuka faces, 5 epoches (2018-08-02)



A hy­brid GAN-VAE ar­chi­tec­ture in­tro­duced in mid-2018 by , Huang et al 2018, with the offi­cial Py­Torch im­ple­men­ta­tion re­leased in April 2019, IntroVAE at­tempts to reuse the en­coder-de­coder for an ad­ver­sar­ial loss as well, to com­bine the best of both worlds: the prin­ci­pled sta­ble train­ing & re­versible en­coder of the VAE with the sharp­ness & high qual­ity of a GAN.

Qual­i­ty-wise, they show IntroVAE works on CelebA & LSUN BEDROOM at up to 1024px res­o­lu­tion with re­sults they claim are com­pa­ra­ble to ProGAN. Per­for­mance-wise, for 512px, they give a run­time of 7 days with a mini­batch n = 12, or pre­sum­ably 4 GPUs (s­ince their 1024px run script im­plies they used 4 GPUs and I can fit a mini­batch of n = 4 onto 1×1080ti, so 4 GPUs would be con­sis­tent with n = 12), and so 28 GPU-days.

I adapted the 256px sug­gested set­tings for my 512px anime por­traits dataset:

python --hdim=512 --output_height=512 --channels='32, 64, 128, 256, 512, 512, 512' --m_plus=120 \
    --weight_rec=0.05 --weight_kl=1.0 --weight_neg=0.5 --num_vae=0 \
    --dataroot=/media/gwern/Data2/danbooru2018/portrait/1/ --trainsize=302652 --test_iter=1000 --save_iter=1 \
    --start_epoch=0 --batchSize=4 --nrow=8 --lr_e=0.0001 --lr_g=0.0001 --cuda --nEpochs=500
# ...====> Cur_iter: [187060]: Epoch [3] (5467⁄60531): time: 142675: Rec: 19569, Kl_E: 162, 151, 121, Kl_G: 151, 121,

There was a mi­nor bug in the code­base where it would crash on try­ing to print out the log data, per­haps be­cause it as­sumes multi-GPU and I was run­ning on 1 GPU, and was try­ing to in­dex into an ar­ray which was ac­tu­ally a sim­ple scalar, which I fixed by re­mov­ing the in­dex­ing:

-        info += 'Rec: {:.4f}, '.format([0])
-        info += 'Kl_E: {:.4f}, {:.4f}, {:.4f}, '.format([0],
-                      [0],[0])
-        info += 'Kl_G: {:.4f}, {:.4f}, '.format([0],[0])
+        info += 'Rec: {:.4f}, '.format(
+        info += 'Kl_E: {:.4f}, {:.4f}, {:.4f}, '.format(,
+                      ,
+        info += 'Kl_G: {:.4f}, {:.4f}, '.format(,

Sam­ple re­sults after ~1.7 GPU-days:

IntroVAE, 512px anime por­trait (n = 4, 3 sets: real dat­a­points, en­coded → de­coded ver­sions of the real dat­a­points, and ran­dom gen­er­ated sam­ples)

By this point, StyleGAN would have been gen­er­at­ing rec­og­niz­able faces from scratch, while the IntroVAE ran­dom sam­ples are not even face-like, and the IntroVAE train­ing curve was not im­prov­ing at a no­table rate. IntroVAE has some hy­per­pa­ra­me­ters which could prob­a­bly be tuned bet­ter for the anime por­trait faces (they briefly dis­cuss the use of the --num_vae op­tion to run in clas­sic VAE mode to let you tune the VAE-related hy­per­pa­ra­me­ters be­fore en­abling the GAN-like part), but it should be fairly in­sen­si­tive over­all to hy­per­pa­ra­me­ters and un­likely to help all that much. So IntroVAE prob­a­bly can’t re­place StyleGAN (yet?) for gen­er­al-pur­pose im­age syn­the­sis. This demon­strates again that it seems like every­thing works on CelebA these days and just be­cause some­thing works on a pho­to­graphic dataset does not mean it’ll work on other datasets. Im­age gen­er­a­tion pa­pers should prob­a­bly branch out some more and con­sider non-pho­to­graphic tests.