Making Anime Faces With StyleGAN

A tutorial explaining how to train and generate high-quality anime faces with StyleGAN 1/2 neural networks, and tips/scripts for effective StyleGAN use.
anime, NGE, NN, Python, technology, tutorial
2019-02-042020-07-26 finished certainty: highly likely importance: 5

Gen­er­a­tive neural net­works, such as GANs, have strug­gled for years to gen­er­ate decen­t-qual­ity anime faces, despite their great suc­cess with pho­to­graphic imagery such as real human faces. The task has now been effec­tively solved, for anime faces as well as many other domains, by the devel­op­ment of a new gen­er­a­tive adver­sar­ial net­work, , whose source code was released in Feb­ru­ary 2019.

I show off my StyleGAN 1/2 CC-0-li­censed anime faces & videos, pro­vide down­loads for the final mod­els & , pro­vide the ‘miss­ing man­ual’ & explain how I trained them based on with source code for the data pre­pro­cess­ing, doc­u­ment instal­la­tion & con­fig­u­ra­tion & train­ing tricks.

For appli­ca­tion, I doc­u­ment var­i­ous scripts for gen­er­at­ing images & videos, briefly describe the web­site “This Waifu Does Not Exist” as a pub­lic demo (see also Art­breeder), dis­cuss how the trained mod­els can be used for trans­fer learn­ing such as gen­er­at­ing high­-qual­ity faces of anime char­ac­ters with small datasets (eg Holo or Asuka Souryuu Lan­g­ley), and touch on like encoders & con­trol­lable gen­er­a­tion.

The appen­dix gives sam­ples of my fail­ures with ear­lier GANs for anime face gen­er­a­tion, and I pro­vide sam­ples & model from a rel­a­tively large-s­cale train­ing run sug­gest­ing that BigGAN may be the next step for­ward to gen­er­at­ing ful­l-s­cale anime images.

A minute of read­ing could save an hour of debug­ging!

When Ian Good­fel­low’s first paper , with its blurry 64px grayscale faces, I said to myself, “given the rate at which GPUs & NN archi­tec­tures improve, in a few years, we’ll prob­a­bly be able to throw a few GPUs at some anime col­lec­tion like Dan­booru and the results will be hilar­i­ous.” There is some­thing intrin­si­cally amus­ing about try­ing to make com­put­ers draw ani­me, and it would be much more fun than work­ing with yet more celebrity head­shots or Ima­geNet sam­ples; fur­ther, anime/illustrations/drawings are so differ­ent from the exclu­sive­ly-pho­to­graphic datasets always (over)used in con­tem­po­rary ML research that I was curi­ous how it would work on ani­me—­bet­ter, worse, faster, or differ­ent fail­ure mod­es? Even more amus­ing—if ran­dom images become doable, then tex­t→im­ages would not be far behind.

So when GANs hit , and could do some­what pass­able CelebA face sam­ples around 2015, along with my , I began exper­i­ment­ing with Soumith Chin­ta­la’s imple­men­ta­tion of , restrict­ing myself to faces of sin­gle anime char­ac­ters where I could eas­ily scrape up ~5–10k faces. (I did a lot of from because she has a col­or-cen­tric design which made it easy to tell if a GAN run was mak­ing any pro­gress: blonde-red hair, blue eyes, and red hair orna­ments.)

It did not work. Despite many runs on my lap­top & a bor­rowed desk­top, DCGAN never got remotely near to the level of the CelebA face sam­ples, typ­i­cally top­ping out at red­dish blobs before diverg­ing or out­right crash­ing.1 Think­ing per­haps the prob­lem was too-s­mall datasets & I needed to train on all the faces, I began cre­at­ing the Dan­booru2017 ver­sion of . Armed with a large dataset, I sub­se­quently began work­ing through par­tic­u­larly promis­ing mem­bers of the GAN zoo, empha­siz­ing SOTA & open imple­men­ta­tions.

Among oth­ers, I have tried / & Pix­el*NN* (failed to get run­ning)2, WGAN-GP, Glow, GAN-QP, MSG-GAN, SAGAN, VGAN, PokeGAN, BigGAN3, ProGAN, & StyleGAN. These archi­tec­tures vary widely in their design & core algo­rithms and which of the many sta­bi­liza­tion tricks () they use, but they were more sim­i­lar in their results: dis­mal.

Glow & BigGAN had promis­ing results reported on CelebA & Ima­geNet respec­tive­ly, but unfor­tu­nately their train­ing require­ments were out of the ques­tion.4 (As inter­est­ing as and are, no source was released and I could­n’t even attempt them.)

While some remark­able tools like were cre­at­ed, and there were the occa­sional semi­-suc­cess­ful anime face GANs like IllustrationGAN, the most notable attempt at anime face gen­er­a­tion was Make Girl­s.­moe (). MGM could, inter­est­ing­ly, do in-browser 256px anime face gen­er­a­tion using tiny GANs, but that is a dead end. MGM accom­plished that much by mak­ing the prob­lem eas­ier: they added some light super­vi­sion in the form of a crude tag embed­ding5, and then sim­pli­fy­ing the prob­lem dras­ti­cally to n = 42k faces cropped from pro­fes­sional video game char­ac­ter art­work, which I regarded as not an accept­able solu­tion—the faces were small & bor­ing, and it was unclear if this data-clean­ing approach could scale to anime faces in gen­er­al, much less anime images in gen­er­al. They are rec­og­niz­ably anime faces but the res­o­lu­tion is low and the qual­ity is not great:

2017 SOTA: 16 ran­dom Make Girl­s.­Moe face sam­ples (4×4 grid)

Typ­i­cal­ly, a GAN would diverge after a day or two of train­ing, or it would col­lapse to pro­duc­ing a lim­ited range of faces (or a sin­gle face), or if it was sta­ble, sim­ply con­verge to a low level of qual­ity with a lot of fuzzi­ness; per­haps the most typ­i­cal fail­ure mode was het­e­rochro­mia (which is com­mon in anime but not that com­mon)—mis­matched eye col­ors (each color indi­vid­u­ally plau­si­ble), from the Gen­er­a­tor appar­ently being unable to coor­di­nate with itself to pick con­sis­tent­ly. With more recent archi­tec­tures like VGAN or SAGAN, which care­fully weaken the Dis­crim­i­na­tor or which add extreme­ly-pow­er­ful com­po­nents like self­-at­ten­tion lay­ers, I could reach fuzzy 128px faces.

Given the mis­er­able fail­ure of all the prior NNs I had tried, I had begun to seri­ously won­der if there was some­thing about non-pho­tographs which made them intrin­si­cally unable to be eas­ily mod­eled by con­vo­lu­tional neural net­works (the com­mon ingre­di­ent to them all). Did con­vo­lu­tions ren­der it unable to gen­er­ate sharp lines or flat regions of col­or? Did reg­u­lar GANs work only because pho­tographs were made almost entirely of blurry tex­tures?

But BigGAN demon­strated that a large cut­ting-edge GAN archi­tec­ture could scale, given enough train­ing, to all of Ima­geNet at even 512px. And ProGAN demon­strated that reg­u­lar CNNs could learn to gen­er­ate sharp clear anime images with only some­what infea­si­ble amounts of train­ing. (source; video), while expen­sive and requir­ing >6 GPU-weeks6, did work and was even pow­er­ful enough to over­fit sin­gle-char­ac­ter face datasets; I did­n’t have enough GPU time to train on unre­stricted face datasets, much less anime images in gen­er­al, but merely get­ting this far was excit­ing. Because, a com­mon sequence in DL/DRL (un­like many areas of AI) is that a prob­lem seems intractable for long peri­ods, until some­one mod­i­fies a scal­able archi­tec­ture slight­ly, pro­duces some­what-cred­i­ble (not nec­es­sar­ily human or even near-hu­man) results, and then throws a ton of compute/data at it and, since the archi­tec­ture scales, it rapidly exceeds SOTA and approaches human lev­els (and poten­tially exceeds human-level). Now I just needed a faster GAN archi­tec­ture which I could train a much big­ger model with on a much big­ger dataset.

A his­tory of GAN gen­er­a­tion of anime faces: ‘do want’ to ‘oh no’ to ‘awe­some’

StyleGAN was the final break­through in pro­vid­ing ProGAN-level capa­bil­i­ties but fast: by switch­ing to a rad­i­cally differ­ent archi­tec­ture, it min­i­mized the need for the slow pro­gres­sive grow­ing (per­haps elim­i­nat­ing it entirely7), and learned effi­ciently at mul­ti­ple lev­els of res­o­lu­tion, with bonuses in pro­vid­ing much more con­trol of the gen­er­ated images with its “style trans­fer” metaphor.


First, some demon­stra­tions of what is pos­si­ble with StyleGAN on anime faces:

When it works: a hand-s­e­lected StyleGAN sam­ple from my Asuka Souryuu Lan­g­ley-fine­tuned StyleGAN
64 of the best TWDNE anime face sam­ples selected from social media (click to zoom).
100 ran­dom sam­ple images from the StyleGAN anime faces on TWDNE

Even a quick look at the MGM & StyleGAN sam­ples demon­strates the lat­ter to be supe­rior in res­o­lu­tion, fine details, and over­all appear­ance (although the MGM faces admit­tedly have fewer global mis­takes). It is also supe­rior to my 2018 ProGAN faces. Per­haps the most strik­ing fact about these faces, which should be empha­sized for those for­tu­nate enough not to have spent as much time look­ing at awful GAN sam­ples as I have, is not that the indi­vid­ual faces are good, but rather that the faces are so diverse, par­tic­u­larly when I look through face sam­ples with 𝜓≥1—it is not just the hair/eye color or head ori­en­ta­tion or fine details that differ, but the over­all style ranges from CG to car­toon sketch, and even the ‘media’ differ, I could swear many of these are try­ing to imi­tate water­col­ors, char­coal sketch­ing, or oil paint­ing rather than dig­i­tal draw­ings, and some come off as rec­og­niz­ably ’90s-anime-style vs ’00s-anime-style. (I could look through sam­ples all day despite the global errors because so many are inter­est­ing, which is not some­thing I could say of the MGM model whose nov­elty is quickly exhaust­ed, and it appears that users of my TWDNE web­site feel sim­i­larly as the aver­age length of each visit is 1m:55s.)

Inter­po­la­tion video of the 2019-02-11 face StyleGAN demon­strat­ing gen­er­al­iza­tion.
StyleGAN anime face inter­po­la­tion videos are Elon Musk™-ap­proved8!
Later inter­po­la­tion video (2019-03-08 face StyleGAN)


Exam­ple of the StyleGAN upscal­ing image pyra­mid archi­tec­ture: smal­l­→large (vi­su­al­iza­tion by Shawn Presser)

StyleGAN was pub­lished in 2018 as (source code; demo video/algo­rith­mic review video/results & dis­cus­sions video; Colab note­book9; Gen­Force PyTorch reim­ple­men­ta­tion with model zoo/Keras; explain­ers: Sky­ Minute Papers video). StyleGAN takes the stan­dard GAN archi­tec­ture embod­ied by ProGAN (whose source code it reuses) and, like the sim­i­lar GAN archi­tec­ture , draws inspi­ra­tion from the field of “style trans­fer” (essen­tially invented by ), by chang­ing the Gen­er­a­tor (G) which cre­ates the image by repeat­edly upscal­ing its res­o­lu­tion to take, at each level of res­o­lu­tion from 8px→16px→32px→64px→128px etc a ran­dom input or “style noise”, which is com­bined with and is used to tell the Gen­er­a­tor how to ‘style’ the image at that res­o­lu­tion by chang­ing the hair or chang­ing the skin tex­ture and so on. ‘Style noise’ at a low res­o­lu­tion like 32px affects the image rel­a­tively glob­al­ly, per­haps deter­min­ing the hair length or col­or, while style noise at a higher level like 256px might affect how frizzy indi­vid­ual strands of hair are. In con­trast, ProGAN and almost all other GANs inject noise into the G as well, but only at the begin­ning, which appears to work not nearly as well (per­haps because it is diffi­cult to prop­a­gate that ran­dom­ness ‘upwards’ along with the upscaled image itself to the later lay­ers to enable them to make con­sis­tent choic­es?). To put it sim­ply, by sys­tem­at­i­cally pro­vid­ing a bit of ran­dom­ness at each step in the process of gen­er­at­ing the image, StyleGAN can ‘choose’ vari­a­tions effec­tive­ly.

Kar­ras et al 2018, StyleGAN vs ProGAN archi­tec­ture: “Fig­ure 1. While a tra­di­tional gen­er­a­tor feeds the latent code [z] though the input layer only, we first map the input to an inter­me­di­ate latent space W, which then con­trols the gen­er­a­tor through adap­tive instance nor­mal­iza­tion (AdaIN) at each con­vo­lu­tion lay­er. Gauss­ian noise is added after each con­vo­lu­tion, before eval­u­at­ing the non­lin­ear­i­ty. Here”A" stands for a learned affine trans­form, and “B” applies learned per-chan­nel scal­ing fac­tors to the noise input. The map­ping net­work f con­sists of 8 lay­ers and the syn­the­sis net­work g con­sists of 18 lay­er­s—two for each res­o­lu­tion (42-−10242). The out­put of the last layer is con­verted to RGB using a sep­a­rate 1×1 con­vo­lu­tion, sim­i­lar to Kar­ras et al. [29]. Our gen­er­a­tor has a total of 26.2M train­able para­me­ters, com­pared to 23.1M in the tra­di­tional gen­er­a­tor."

StyleGAN makes a num­ber of addi­tional improve­ments, but they appear to be less impor­tant: for exam­ple, it intro­duces a new FFHQ face/portrait dataset with 1024px images in order to show that StyleGAN con­vinc­ingly improves on ProGAN in final image qual­i­ty; switches to a loss which is more well-be­haved than the usual logis­tic-style loss­es; and archi­tec­ture-wise, it makes unusu­ally heavy use of ful­ly-con­nected (FC) lay­ers to process an ini­tial ran­dom input, no less than 8 lay­ers of 512 neu­rons, where most GANs use 1 or 2 FC lay­ers.10 More strik­ing is that it omits tech­niques that other GANs have found crit­i­cal for being able to train at 512px–1024px scale: it does not use newer losses like the , SAGAN-style self­-at­ten­tion lay­ers in either G/D, vari­a­tional Dis­crim­i­na­tor bot­tle­necks, con­di­tion­ing on a tag or cat­e­gory embed­ding11, BigGAN-style large mini­batch­es, differ­ent noise dis­tri­b­u­tions12, advanced reg­u­lar­iza­tion like , etc.13 One pos­si­ble rea­son for StyleGAN’s suc­cess is the way it com­bines out­puts from the mul­ti­ple lay­ers into a sin­gle final image rather than repeat­edly upscal­ing; when we visu­al­ize the out­put of each layer as an RGB image in anime StyleGANs, there is a strik­ing divi­sion of labor between lay­er­s—­some lay­ers focus on mono­chrome out­li­nes, while oth­ers fill in tex­tured regions of col­or, and they sum up into an image with sharp lines and good color gra­di­ents while main­tain­ing details like eyes.

Aside from the FCs and style noise & nor­mal­iza­tion, it is a vanilla archi­tec­ture. (One odd­ity is the use of only 3×3 con­vo­lu­tions & so few lay­ers in each upscal­ing block; a more con­ven­tional upscal­ing block than StyleGAN’s 3×3→3×3 would be some­thing like BigGAN which does 1×1 → 3×3 → 3×3 → 1×1. It’s not clear if this is a good idea as it lim­its the spa­tial influ­ence of each pixel by pro­vid­ing lim­ited recep­tive fields14.) Thus, if one has some famil­iar­ity with train­ing a ProGAN or another GAN, one can imme­di­ately work with StyleGAN with no trou­ble: the train­ing dynam­ics are sim­i­lar and the hyper­pa­ra­me­ters have their usual mean­ing, and the code­base is much the same as the orig­i­nal ProGAN (with the main excep­tion being that has been renamed (or in S2) and the orig­i­nal, which stores the crit­i­cal con­fig­u­ra­tion para­me­ters, has been moved to training/; there is still no sup­port for com­mand-line options and StyleGAN must be con­trolled by edit­ing by hand).


Because of its speed and sta­bil­i­ty, when the source code was released on 2019-02-04 (a date that will long be noted in the ANNals of GANime), the Nvidia mod­els & sam­ple dumps were quickly perused & new StyleGANs trained on a wide vari­ety of image types, yield­ing, in addi­tion to the orig­i­nal faces/carts/cats of Kar­ras et al 2018:

Image­quilt visu­al­iza­tion of the wide range of visual sub­jects StyleGAN has been applied to

Why Don’t GANs Work?

Why does StyleGAN work so well on anime images while other GANs worked not at all or slowly at best?

The les­son I took from , Lucic et al 2017, is that CelebA/CIFAR10 are too easy, as almost all eval­u­ated GAN archi­tec­tures were capa­ble of occa­sion­ally achiev­ing good FID if one sim­ply did enough iter­a­tions & hyper­pa­ra­me­ter tun­ing.

Inter­est­ing­ly, I con­sis­tently observe in train­ing all GANs on anime that clear lines & sharp­ness & cel-like smooth gra­di­ents appear only toward the end of train­ing, after typ­i­cally ini­tially blurry tex­tures have coa­lesced. This sug­gests an inher­ent bias of CNNs: color images work because they pro­vide some degree of tex­tures to start with, but lineart/monochrome stuff fails because the GAN opti­miza­tion dynam­ics flail around. This is con­sis­tent with —which uses style trans­fer to con­struct a data-augmented/transformed “Styl­ized-Im­a­geNet”—show­ing that Ima­geNet CNNs are lazy and, because the tasks can be achieved to some degree with tex­ture-only clas­si­fi­ca­tion (as demon­strated by sev­eral of Geirhos et al 2018’s authors via “Bag­Nets”), focus on tex­tures unless oth­er­wise forced; and by & , who find that although CNNs are per­fectly capa­ble of empha­siz­ing shape over tex­ture, low­er-per­form­ing mod­els tend to rely more heav­ily on tex­ture and that many kinds of train­ing (in­clud­ing ) will induce a tex­ture focus, sug­gest­ing tex­ture tends to be low­er-hang­ing fruit. So while CNNs can learn sharp lines & shapes rather than tex­tures, the typ­i­cal GAN archi­tec­ture & train­ing algo­rithm do not make it easy. Since CIFAR10/CelebA can be fairly described as being just as heavy on tex­tures as Ima­geNet (which is not true of anime images), it is not sur­pris­ing that GANs train eas­ily on them start­ing with tex­tures and grad­u­ally refin­ing into good sam­ples but then strug­gle on ani­me.

This raises a ques­tion of whether the StyleGAN archi­tec­ture is nec­es­sary and whether many GANs might work, if only one had good style trans­fer for anime images and could, to defeat the tex­ture bias, gen­er­ate many ver­sions of each anime image which kept the shape while chang­ing the color palet­te? (Cur­rent style trans­fer meth­ods like the AdaIN PyTorch imple­men­ta­tion used by Geirhos et al 2018, do not work well on anime images, iron­i­cally enough, because they are trained on pho­to­graphic images, typ­i­cally using the old VGG mod­el.)


“…Its social account­abil­ity seems sort of like that of design­ers of mil­i­tary weapons: uncul­pa­ble right up until they get a lit­tle too good at their job.”

, E unibus plu­ram: Tele­vi­sion and U.S. Fic­tion”

To address some com­mon ques­tions peo­ple have after see­ing gen­er­ated sam­ples:

  • Over­fit­ting: “Aren’t StyleGAN (or BigGAN) just over­fit­ting & mem­o­riz­ing data?”

    Amus­ing­ly, this is not a ques­tion any­one really both­ered to ask of ear­lier GAN archi­tec­tures, which is a sign of progress. Over­fit­ting is a bet­ter prob­lem to have than under­fit­ting, because over­fit­ting means you can use a smaller model or more data or more aggres­sive reg­u­lar­iza­tion tech­niques, while under­fit­ting means your approach just isn’t work­ing.

    In any case, while there is cur­rently no way to con­clu­sively prove that cut­ting-edge GANs are not 100% mem­o­riz­ing (be­cause they should be mem­o­riz­ing to a con­sid­er­able extent in order to learn image gen­er­a­tion, and eval­u­at­ing gen­er­a­tive mod­els is hard in gen­er­al, and for GANs in par­tic­u­lar, because they don’t pro­vide stan­dard met­rics like like­li­hoods which could be used on held-out sam­ples), there are sev­eral rea­sons to think that they are not just mem­o­riz­ing:15

    1. Sample/Dataset Over­lap: a stan­dard check for over­fit­ting is to com­pare gen­er­ated images to their clos­est matches using (where dis­tance is defined by fea­tures like a CNN embed­ding) lookup; an exam­ple of this are StackGAN’s Fig­ure 6 & BigGAN’s Fig­ures 10–14, where the pho­to­re­al­is­tic sam­ples are nev­er­the­less com­pletely differ­ent from the most sim­i­lar Ima­geNet dat­a­points. This has not been done for StyleGAN yet but I would­n’t expect differ­ent results as GANs typ­i­cally pass this check. (It’s worth not­ing that facial recog­ni­tion report­edly does not return Flickr matches for ran­dom FFHQ StyleGAN faces, sug­gest­ing the gen­er­ated faces gen­uinely look like new faces rather than any of the orig­i­nal Flickr faces.)

      One intrigu­ing obser­va­tion about GANs made by the BigGAN paper is that the crit­i­cisms of Gen­er­a­tors mem­o­riz­ing dat­a­points may be pre­cisely the oppo­site of real­i­ty: GANs may work pri­mar­ily by the Dis­crim­i­na­tor (adap­tive­ly) over­fit­ting to dat­a­points, thereby repelling the Gen­er­a­tor away from real dat­a­points and forc­ing it to learn nearby pos­si­ble images which col­lec­tively span the image dis­tri­b­u­tion. (With enough data, this cre­ates gen­er­al­iza­tion because “neural nets are lazy” and only learn to gen­er­al­ize when eas­ier strate­gies fail.)

    2. Seman­tic Under­stand­ing: GANs appear to learn mean­ing­ful con­cepts like indi­vid­ual objects, as demon­strated by “latent space addi­tion” or research tools like GANdissection/; image edits like object deletions/additions () or seg­ment­ing objects like dogs from their back­grounds (/) are diffi­cult to explain with­out some gen­uine under­stand­ing of images.

    In the case of StyleGAN anime faces, there are encoders and con­trol­lable face gen­er­a­tion now which demon­strate that the latent vari­ables do map onto mean­ing­ful fac­tors of vari­a­tion & the model must have gen­uinely learned about cre­at­ing images rather than merely mem­o­riz­ing real images or image patch­es. Sim­i­lar­ly, when we use the “trun­ca­tion trick”/ψ to sam­ple from rel­a­tively extreme unlikely images and we look at the dis­tor­tions, they show how gen­er­ated images break down in seman­ti­cal­ly-rel­e­vant ways, which would not be the case if it was just pla­gia­rism. (A par­tic­u­larly extreme exam­ple of the power of the learned StyleGAN prim­i­tives is ’s demon­stra­tion that Kar­ras et al’s FFHQ faces StyleGAN can be used to gen­er­ate fairly real­is­tic images of cats/dogs/cars.)

    1. Latent Space Smooth­ness: in gen­er­al, inter­po­la­tion in the latent space (z) shows smooth changes of images and log­i­cal trans­for­ma­tions or vari­a­tions of face fea­tures; if StyleGAN were merely mem­o­riz­ing indi­vid­ual dat­a­points, the inter­po­la­tion would be expected to be low qual­i­ty, yield many ter­ri­ble faces, and exhibit ‘jumps’ in between points cor­re­spond­ing to real, mem­o­rized, dat­a­points. The StyleGAN anime face mod­els do not exhibit this. (In con­trast, the Holo ProGAN, which over­fit bad­ly, does show severe prob­lems in its latent space inter­po­la­tion videos.)

      Which is not to say that GANs do not have issues: “mode drop­ping” seems to still be an issue for BigGAN despite the expen­sive large-mini­batch train­ing, which is over­fit­ting to some degree, and StyleGAN pre­sum­ably suffers from it too.

    2. Trans­fer Learn­ing: GANs have been used for semi­-su­per­vised learn­ing (eg gen­er­at­ing plau­si­ble ‘labeled’ sam­ples to train a clas­si­fier on), imi­ta­tion learn­ing like , and retrain­ing on fur­ther datasets; if the G is merely mem­o­riz­ing, it is diffi­cult to explain how any of this would work.

  • Com­pute Require­ments: “Does­n’t StyleGAN take too long to train?”

    StyleGAN is remark­ably fast-train­ing for a GAN. With the anime faces, I got bet­ter results after 1–3 days of StyleGAN train­ing than I’d got­ten with >3 weeks of ProGAN train­ing. The train­ing times quoted by the StyleGAN repo may sound scary, but they are, in prac­tice, a steep over­es­ti­mate of what you actu­ally need, for sev­eral rea­sons:

    • Lower Res­o­lu­tion: the largest fig­ures are for 1024px images but you may not need them to be that large or even have a big dataset of 1024px images. For anime faces, 1024px-sized faces are rel­a­tively rare, and train­ing at 512px & upscal­ing 2× to 1024 with waifu2x16 works fine & is much faster. Since upscal­ing is rel­a­tively sim­ple & easy, another strat­egy is to change the pro­gres­sive-grow­ing sched­ule: instead of pro­ceed­ing to the final res­o­lu­tion as fast as pos­si­ble, instead adjust the sched­ule to stop at a more fea­si­ble res­o­lu­tion & spend the bulk of train­ing time there instead and then do just enough train­ing at the final res­o­lu­tion to learn to upscale (eg spend 10% of train­ing grow­ing to 512px, then 80% of train­ing time at 512px, then 10% at 1024px).
    • Dimin­ish­ing Returns: the largest gains in image qual­ity are seen in the first few days or weeks of train­ing with the remain­ing train­ing being not that use­ful as they focus on improv­ing small details (so just a few days may be more than ade­quate for your pur­pos­es, espe­cially if you’re will­ing to select a lit­tle more aggres­sively from sam­ples)
    • Trans­fer Learn­ing from a related model can save days or weeks of train­ing, as there is no need to train from scratch; with the anime face StyleGAN, one can train a char­ac­ter-spe­cific StyleGAN with a few hours or days at most, and cer­tainly do not need to spend mul­ti­ple weeks train­ing from scratch! (as­sum­ing that would­n’t just cause over­fit­ting) Sim­i­lar­ly, if one wants to train on some 1024px face dataset, why start from scratch, tak­ing ~1000 GPU-hours, when you can start from Nvidi­a’s FFHQ face model which is already fully trained, and can con­verge in a frac­tion of the from-scratch time? For 1024px, you could use a super-res­o­lu­tion GAN like to upscale? Alter­nate­ly, you could change the image pro­gres­sion bud­get to spend most of your time at 512px and then at the tail end try 1024px.
    • One-Time Costs: the upfront cost of a few hun­dred dol­lars of GPU-time (at inflated AWS prices) may seem steep, but should be kept in per­spec­tive. As with almost all NNs, train­ing 1 StyleGAN model can be lit­er­ally tens of mil­lions of times more expen­sive than sim­ply run­ning the Gen­er­a­tor to pro­duce 1 image; but it also need be paid only once by only one per­son, and the total price need not even be paid by the same per­son, given trans­fer learn­ing, but can be amor­tized across var­i­ous datasets. Indeed, given how fast run­ning the Gen­er­a­tor is, the trained model does­n’t even need to be run on a GPU. (The rule of thumb is that a GPU is 20–30× faster than the same thing on CPU, with rare instances when over­head dom­i­nates of the CPU being as fast or faster, so since gen­er­at­ing 1 image takes on the order of ~0.1s on GPU, a CPU can do it in ~3s, which is ade­quate for many pur­pos­es.)
  • Copy­right Infringe­ment: “Who owns StyleGAN images?”

    1. The Nvidia Source Code & Released Mod­els for StyleGAN 1 are under a -BY-NC license, and you can­not edit them or pro­duce “deriv­a­tive works” such as retrain­ing their FFHQ, cat, or cat StyleGAN mod­els. (StyleGAN 2 is under a new “Nvidia Source Code License-NC”, which appears to be effec­tively the same as the CC-BY-NC with the addi­tion of a .)

    If a model is trained from scratch, then that does not apply as the source code is sim­ply another tool used to cre­ate the model and noth­ing about the CC-BY-NC license forces you to donate the copy­right to Nvidia. (It would be odd if such a thing did hap­pen—if your word proces­sor claimed to trans­fer the copy­rights of every­thing writ­ten in it to Microsoft!)

    For those con­cerned by the CC-BY-NC license, a 512px FFHQ con­fig-f StyleGAN 2 has been trained & released into the pub­lic domain by Aydao, and is avail­able for down­load from Mega and my rsync mir­ror:

    rsync --verbose rsync:// ./
    1. Mod­els in gen­eral are gen­er­ally con­sid­ered and the copy­right own­ers of what­ever data the model was trained on have no copy­right on the mod­el. (The fact that the datasets or inputs are copy­righted is irrel­e­vant, as train­ing on them is uni­ver­sally con­sid­ered fair use and trans­for­ma­tive, sim­i­lar to artists or search engi­nes; see the fur­ther read­ing.) The model is copy­righted to whomever cre­ated it. Hence, Nvidia has copy­right on the mod­els it cre­ated but I have copy­right under the mod­els I trained (which I release under CC-0).

    2. Sam­ples are trick­i­er. The usual wide­ly-s­tated legal inter­pre­ta­tion is that the stan­dard copy­right law posi­tion is that only human authors can earn a copy­right and that machi­nes, ani­mals, inan­i­mate objects or most famous­ly, , can­not. The US Copy­right Office states clearly that regard­less of whether we regard a GAN as a machine or a some­thing more intel­li­gent like an ani­mal, either way, it does­n’t count:

      A work of author­ship must pos­sess “some min­i­mal degree of cre­ativ­ity” to sus­tain a copy­right claim. Feist, 499 U.S. at 358, 362 (ci­ta­tion omit­ted). “[T]he req­ui­site level of cre­ativ­ity is extremely low.” Even a “slight amount” of cre­ative expres­sion will suffice. “The vast major­ity of works make the grade quite eas­i­ly, as they pos­sess some cre­ative spark, ‘no mat­ter how crude, hum­ble or obvi­ous it might be.’” Id. at 346 (ci­ta­tion omit­ted).

      … To qual­ify as a work of “author­ship” a work must be cre­ated by a human being. See Bur­row-Giles Lith­o­graphic Co., 111 U.S. at 58. Works that do not sat­isfy this require­ment are not copy­rightable. The Office will not reg­is­ter works pro­duced by nature, ani­mals, or plants.


      • A pho­to­graph taken by a mon­key.
      • A mural painted by an ele­phant.

      …the Office will not reg­is­ter works pro­duced by a machine or mere mechan­i­cal process that oper­ates ran­domly or auto­mat­i­cally with­out any cre­ative input or inter­ven­tion from a human author.

      A dump of ran­dom sam­ples such as the Nvidia sam­ples or TWDNE there­fore has no copy­right & by defi­n­i­tion is in the pub­lic domain.

      A new copy­right can be cre­at­ed, how­ev­er, if a human author is suffi­ciently ‘in the loop’, so to speak, as to exert a de min­imis amount of cre­ative effort, even if that ‘cre­ative effort’ is sim­ply select­ing a sin­gle image out of a dump of thou­sands of them or twid­dling knobs (eg on Make Girl­s.­Moe). Crypko, for exam­ple, take this posi­tion.

    Fur­ther read­ing on com­put­er-gen­er­ated art copy­rights:

Training requirements


“The road of excess leads to the palace of wis­dom
…If the fool would per­sist in his folly he would become wise
…You never know what is enough unless you know what is more than enough. …If oth­ers had not been fool­ish, we should be so.”

William Blake, “Proverbs of Hell”,

The nec­es­sary size for a dataset depends on the com­plex­ity of the domain and whether trans­fer learn­ing is being used. StyleGAN’s default set­tings yield a 1024px Gen­er­a­tor with 26.2M para­me­ters, which is a large model and can soak up poten­tially mil­lions of images, so there is no such thing as too much.

For learn­ing decen­t-qual­ity anime faces from scratch, a min­i­mum of 5000 appears to be nec­es­sary in prac­tice; for learn­ing a spe­cific char­ac­ter when using the anime face StyleGAN, poten­tially as lit­tle as ~500 (espe­cially with data aug­men­ta­tion) can give good results. For domains as com­pli­cated as “any cat photo” like Kar­ras et al 2018’s cat StyleGAN which is trained on the LSUN CATS cat­e­gory of ~1.8M17 cat pho­tos, that appears to either not be enough or StyleGAN was not trained to con­ver­gence; Kar­ras et al 2018 note that “CATS con­tin­ues to be a diffi­cult dataset due to the high intrin­sic vari­a­tion in pos­es, zoom lev­els, and back­grounds.”18


To fit rea­son­able mini­batch sizes, one will want GPUs with >11GB VRAM. At 512px, that will only train n = 4, and going below that means it’ll be even slower (and you may have to reduce learn­ing rates to avoid unsta­ble train­ing). So, Nvidia 1080ti & up would be good. (Re­port­ed­ly, AMD/OpenCL works for run­ning StyleGAN mod­els, and there is one report of suc­cess­ful train­ing with “Radeon VII with tensorflow-rocm 1.13.2 and rocm 2.3.14”.)

The StyleGAN repo pro­vide the fol­low­ing esti­mated train­ing times for 1–8 GPU sys­tems (which I con­vert to total GPU-hours & pro­vide a worst-case AWS-based cost esti­mate):

Esti­mated StyleGAN wall­clock train­ing times for var­i­ous res­o­lu­tions & GPU-clusters (source: StyleGAN repo)
GPUs 10242 5122 2562 [March 2019 AWS Costs19]
1 41 days 4 hours [988 GPU-hours] 24 days 21 hours [597 GPU-hours] 14 days 22 hours [358 GPU-hours] [$320, $194, $115]
2 21 days 22 hours [1,052] 13 days 7 hours [638] 9 days 5 hours [442] [NA]
4 11 days 8 hours [1,088] 7 days 0 hours [672] 4 days 21 hours [468] [NA]
8 6 days 14 hours [1,264] 4 days 10 hours [848] 3 days 8 hours [640] [$2,730, $1,831, $1,382]

AWS GPU instances are some of the most expen­sive ways to train a NN and pro­vide an upper bound (com­pare; 512px is often an accept­able (or nec­es­sary) res­o­lu­tion; and in prac­tice, the full quoted train­ing time is not really nec­es­sary—with my anime face StyleGAN, the faces them­selves were high qual­ity within 48 GPU-hours, and what train­ing it for ~1000 addi­tional GPU-hours accom­plished was pri­mar­ily to improve details like the shoul­ders & back­grounds. (ProGAN/StyleGAN par­tic­u­larly strug­gle with back­grounds & edges of images because those are cut off, obscured, and high­ly-var­ied com­pared to the faces, whether anime or FFHQ. I hypoth­e­size that the tell­tale blurry back­grounds are due to the impov­er­ish­ment of the backgrounds/edges in cropped face pho­tos, and they could be fixed by trans­fer­-learn­ing or pre­train­ing on a more generic dataset like Ima­geNet, so it learns what the back­grounds even are in the first place; then in face train­ing, it merely has to remem­ber them & defo­cus a bit to gen­er­ate cor­rect blurry back­ground­s.)

Train­ing improve­ments: 256px StyleGAN anime faces after ~46 GPU-hours vs 512px anime faces after 382 GPU-hours; see also the video mon­tage of first 9k iter­a­tions

Data Preparation

The most diffi­cult part of run­ning StyleGAN is prepar­ing the dataset prop­er­ly. StyleGAN does not, unlike most GAN imple­men­ta­tions (par­tic­u­larly PyTorch ones), sup­port read­ing a direc­tory of files as input; it can only read its unique .tfrecord for­mat which stores each image as raw arrays at every rel­e­vant res­o­lu­tion.20 Thus, input files must be per­fectly uni­form, (slow­ly) con­verted to the .tfrecord for­mat by the spe­cial tool, and will take up ~19× more disk space.21

A StyleGAN dataset must con­sist of images all for­mat­ted exactly the same way.

Images must be pre­cisely 512×512px or 1024×1024px etc (any eg 512×513px images will kill the entire run), they must all be the same col­or­space (you can­not have sRGB and Grayscale JPGs—and I doubt other color spaces work at all), they must not be trans­par­ent, the file­type must be the same as the model you intend to (re)­train (ie you can­not retrain a PNG-trained model on a JPG dataset, StyleGAN will crash every time with inscrutable convolution/channel-related errors)22, and there must be no sub­tle errors like CRC check­sum errors which image view­ers or libraries like ImageMag­ick often ignore.

Faces preparation

My work­flow:

  1. Down­load raw images from Dan­booru2018 if nec­es­sary
  2. Extract from the JSON Dan­booru2018 meta­data all the IDs of a sub­set of images if a spe­cific Dan­booru tag (such as a sin­gle char­ac­ter) is desired, using jq and shell script­ing
  3. Crop square anime faces from raw images using Nagadomi’s lbpcascade_animeface (reg­u­lar face-de­tec­tion meth­ods do not work on anime images)
  4. Delete empty files, mono­chrome or grayscale files, & exac­t-du­pli­cate files
  5. Con­vert to JPG
  6. Upscale below the tar­get res­o­lu­tion (512px) images with waifu2x
  7. Con­vert all images to exactly 512×512 res­o­lu­tion sRGB JPG images
  8. If fea­si­ble, improve data qual­ity by check­ing for low-qual­ity images by hand, remov­ing near-du­pli­cates images found by findimagedupes, and fil­ter­ing with a pre­trained GAN’s Dis­crim­i­na­tor
  9. Con­vert to StyleGAN for­mat using

The goal is to turn this:

100 ran­dom sam­ple images from the 512px SFW sub­set of Dan­booru in a 10×10 grid.

into this:

36 ran­dom sam­ple images from the cropped Dan­booru faces in a 6×6 grid.

Below I use shell script­ing to pre­pare the dataset. A pos­si­ble alter­na­tive is danbooru-utility, which aims to help “explore the dataset, fil­ter by tags, rat­ing, and score, detect faces, and resize the images”.


The Dan­booru2018 down­load can be done via Bit­Tor­rent or rsync, which pro­vides a JSON meta­data tar­ball which unpacks into metadata/2* & a folder struc­ture of {original,512px}/{0-999}/$ID.{png,jpg,...}.

For train­ing on SFW whole images, the 512px/ ver­sion of Dan­booru2018 would work, but it is not a great idea for faces because by scal­ing images down to 512px, a lot of face detail has been lost, and get­ting high­-qual­ity faces is a chal­lenge. The SFW IDs can be extracted from the file­names in 512px/ directly or from the meta­data by extract­ing the id & rating fields (and sav­ing to a file):

find ./512px/ -type f | sed -e 's/.*\/\([[:digit:]]*\)\.jpg/\1/'
# 967769
# 1853769
# 2729769
# 704769
# 1799769
# ...
tar xf metadata.json.tar.xz
cat metadata/20180000000000* | jq '[.id, .rating]' -c | fgrep '"s"' | cut -d '"' -f 2 # "
# ...

After installing and test­ing Nagadomi’s lbpcascade_animeface to make sure it & works, one can use a sim­ple script which crops the face(s) from a sin­gle input image. The accu­racy on Dan­booru images is fairly good, per­haps 90% excel­lent faces, 5% low-qual­ity faces (gen­uine but either awful art or tiny lit­tle faces on the order of 64px which use­less), and 5% out­right errors—non-faces like armpits or elbows (oddly enough). It can be improved by mak­ing the script more restric­tive, such as requir­ing 250×250px regions, which elim­i­nates most of the low-qual­ity faces & mis­takes. (There is an alter­na­tive more-d­iffi­cult-to-run library by Nakatomi which offers a face-crop­ping script, ani­me­face-2009’s face_collector.rb, which Nakatomi says is bet­ter at crop­ping faces, but I was not impressed when I tried it out.)

import cv2
import sys
import os.path

def detect(cascade_file, filename, outputname):
    if not os.path.isfile(cascade_file):
        raise RuntimeError("%s: not found" % cascade_file)

    cascade = cv2.CascadeClassifier(cascade_file)
    image = cv2.imread(filename)
    gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
    gray = cv2.equalizeHist(gray)

    ## NOTE: Suggested modification: increase minSize to '(250,250)' px,
    ## increasing proportion of high-quality faces & reducing
    ## false positives. Faces which are only 50×50px are useless
    ## and often not faces at all.
    ## FOr my StyleGANs, I use 250 or 300px boxes

    faces = cascade.detectMultiScale(gray,
                                     # detector options
                                     scaleFactor = 1.1,
                                     minNeighbors = 5,
                                     minSize = (50, 50))
    for (x, y, w, h) in faces:
        cropped = image[y: y + h, x: x + w]
        cv2.imwrite(outputname+str(i)+".png", cropped)

if len(sys.argv) != 4:
    sys.stderr.write("usage: <animeface.xml file>  <input> <output prefix>\n")

detect(sys.argv[1], sys.argv[2], sys.argv[3])

The IDs can be com­bined with the pro­vided lbpcascade_animeface script using xargs, how­ever this will be far too slow and it would be bet­ter to exploit par­al­lelism with xargs --max-args=1 --max-procs=16 or parallel. It’s also worth not­ing that lbpcascade_animeface seems to use up GPU VRAM even though GPU use offers no appar­ent speedup (a slow­down if any­thing, given lim­ited VRAM), so I find it helps to explic­itly dis­able GPU use by set­ting CUDA_VISIBLE_DEVICES="". (For this step, it’s quite help­ful to have a many-core sys­tem like a Thread­rip­per.)

Com­bin­ing every­thing, par­al­lel face-crop­ping of an entire Dan­booru2018 sub­set can be done like this:

cropFaces() {
    BUCKET=$(printf "%04d" $(( $@ % 1000 )) )
    CUDA_VISIBLE_DEVICES="" nice python ~/src/lbpcascade_animeface/examples/  \
     ~/src/lbpcascade_animeface/lbpcascade_animeface.xml \
     ./original/$BUCKET/$ID.* "./faces/$ID"
export -f cropFaces

mkdir ./faces/
cat sfw-ids.txt | parallel --progress cropFaces

# NOTE: because of the possibility of multiple crops from an image, the script appends a N counter;
# remove that to get back the original ID & filepath: eg
## original/0196/933196.jpg  → portrait/9331961.jpg
## original/0669/1712669.png → portrait/17126690.jpg
## original/0997/3093997.jpg → portrait/30939970.jpg

Nvidia StyleGAN, by default and like most image-re­lated tools, expects square images like 512×512px, but there is noth­ing inher­ent to neural nets or con­vo­lu­tions that requires square inputs or out­puts, and rec­tan­gu­lar con­vo­lu­tions are pos­si­ble. In the case of faces, they tend to be more rec­tan­gu­lar than square, and we’d pre­fer to use a rec­tan­gu­lar con­vo­lu­tion if pos­si­ble to focus the image on the rel­e­vant dimen­sion rather than either pay the severe per­for­mance penalty of increas­ing total dimen­sions to 1024×1024px or stick with 512×512px & waste image out­puts on emit­ting black bars/backgrounds. A prop­er­ly-sized rec­tan­gu­lar con­vo­lu­tion can offer a nice speedup (eg’s train­ing Ima­geNet in 18m for $40 using them among other trick­s). Nolan Ken­t’s StyleGAN re-im­ple­men­tion (re­leased Octo­ber 2019) does sup­port rec­tan­gu­lar con­vo­lu­tions, and as he demon­strates in his blog post, it works nice­ly.

Cleaning & Upscaling

Mis­cel­la­neous cleanups can be done:

## Delete failed/empty files
find faces/ -size 0    -type f -delete

## Delete 'too small' files which is indicative of low quality:
find faces/ -size -40k -type f -delete

## Delete exact duplicates:
fdupes --delete --omitfirst --noprompt faces/

## Delete monochrome or minimally-colored images:
### the heuristic of <257 unique colors is imperfect but better than anything else I tried
deleteBW() { if [[ `identify -format "%k" "$@"` -lt 257 ]];
             then rm "$@"; fi; }
export -f deleteBW
find faces -type f | parallel --progress deleteBW

I remove black­-white or grayscale images from all my GAN exper­i­ments because in my ear­li­est exper­i­ments, their inclu­sion appeared to increase insta­bil­i­ty: mixed datasets were extremely unsta­ble, mono­chrome datasets failed to learn at all, but col­or-only runs made some progress. It is likely that StyleGAN is now pow­er­ful enough to be able to learn on mixed datasets (and some later exper­i­ments by other peo­ple sug­gest that StyleGAN can han­dle both mono­chrome & color ani­me-style faces with­out a prob­lem), but I have not risked a full mon­th-long run to inves­ti­gate, and so I con­tinue doing col­or-on­ly.

Discriminator ranking

A good trick with GANs is, after train­ing to rea­son­able lev­els of qual­i­ty, reusing the Dis­crim­i­na­tor to rank the real dat­a­points; images the trained D assigns the low­est probability/score of being real are often the worst-qual­ity ones and going through the bot­tom decile (or delet­ing them entire­ly) should remove many anom­alies and may improve the GAN. The GAN is then trained on the new cleaned dataset, mak­ing this a kind of “active learn­ing”.

Since rat­ing images is what the D already does, no new algo­rithms or train­ing meth­ods are nec­es­sary, and almost no code is nec­es­sary: run the D on the whole dataset to rank each image (faster than it seems since the G & back­prop­a­ga­tion are unnec­es­sary, even a large dataset can be ranked in a wall­clock hour or two), then one can review man­u­ally the bot­tom & top X%, or per­haps just delete the bot­tom X% sight unseen if enough data is avail­able.

What is a D doing? I find that the high­est ranked images often con­tain many anom­alies or low-qual­ity images which need to be delet­ed. Why? The notes a well-trained D which achieves 98% real vs fake clas­si­fi­ca­tion per­for­mance on the Ima­geNet train­ing dataset falls to 50–55% accu­racy when run on the val­i­da­tion dataset, sug­gest­ing the D’s role is about mem­o­riz­ing the train­ing data rather than some mea­sure of ‘real­ism’.

Per­haps because the D rank­ing is not nec­es­sar­ily a ‘qual­ity’ score but sim­ply a sort of con­fi­dence rat­ing that an image is from the real dataset; if the real images con­tain cer­tain eas­i­ly-de­tectable images which the G can’t repli­cate, then the D might mem­o­rize or learn them quick­ly. For exam­ple, in face crops, whole fig­ure crops are com­mon mis­taken crops, mak­ing up a tiny per­cent­age of images; how could a face-only G learn to gen­er­ate whole real­is­tic bod­ies with­out the inter­me­di­ate steps being instantly detected & defeated as errors by D, while D is eas­ily able to detect real­is­tic bod­ies as defi­nitely real? This would explain the polar­ized rank­ings. And given the close con­nec­tions between GANs & DRL, I have to won­der if there is more mem­o­riza­tion going on than sus­pected in things like ? Inci­den­tal­ly, this may also explain the prob­lem with using Dis­crim­i­na­tors for semi­-su­per­vised rep­re­sen­ta­tion learn­ing: if the D is mem­o­riz­ing dat­a­points to force the G to gen­er­al­ize, then its inter­nal rep­re­sen­ta­tions would be expected to be use­less. (One would instead want to extract knowl­edge from the G, per­haps by encod­ing an image into z and using the z as the rep­re­sen­ta­tion.)

An alter­na­tive per­spec­tive is offered by a crop of 2020 papers (; ; ; ) exam­in­ing how use­ful GAN data aug­men­ta­tion requires it to be done dur­ing train­ing, and one must aug­ment all images.23 Zhao et al 2020c & Kar­ras et al 2020 observe, with reg­u­lar GAN train­ing, there is a strik­ing steady decline of D per­for­mance on held­out data, and increase on train­ing data, through­out the course of train­ing, con­firm­ing the BigGAN obser­va­tion but also show­ing it is a dynamic phe­nom­e­non, and prob­a­bly a bad one. Adding in cor­rect data aug­men­ta­tion reduces this over­fit­ting—and markedly improves sam­ple-effi­ciency & final qual­i­ty. This sug­gests that the D does indeed mem­o­rize, but that this is not a good thing. Kar­ras et al 2020 describes what hap­pens as

Con­ver­gence is now achieved [with ADA/data aug­men­ta­tion] regard­less of the train­ing set size and over­fit­ting no longer occurs. With­out aug­men­ta­tions, the gra­di­ents the gen­er­a­tor receives from the dis­crim­i­na­tor become very sim­plis­tic over time—the dis­crim­i­na­tor starts to pay atten­tion to only a hand­ful of fea­tures, and the gen­er­a­tor is free to cre­ate oth­er­wise non­sen­si­cal images. With ADA, the gra­di­ent field stays much more detailed which pre­vents such dete­ri­o­ra­tion.

In other words, just as the G can ‘mode col­lapse’ by focus­ing on gen­er­at­ing images with only a few fea­tures, the D can also ‘fea­ture col­lapse’ by focus­ing on a few fea­tures which hap­pen to cor­rectly split the train­ing data’s reals from fakes, such as by mem­o­riz­ing them out­right. This tech­ni­cally works, but not well. This also explains why when train­ing on JFT-300M: divergence/collapse usu­ally starts with D win­ning; if D wins because it mem­o­rizes, then a suffi­ciently large dataset should make mem­o­riza­tion infea­si­ble; and JFT-300M turns out to be suffi­ciently large. (This would pre­dict that if Brock et al had checked the JFT-300M BigGAN D’s clas­si­fi­ca­tion per­for­mance on a held-out JFT-300M, rather than just on their Ima­geNet BigGAN, then they would have found that it clas­si­fied reals vs fake well above chance.)

If so, this sug­gests that for D rank­ing, it may not be too use­ful to take the D from the end of a run, if not using data aug­men­ta­tion, because that D be the ver­sion with the great­est degree of mem­o­riza­tion!

Here is a sim­ple StyleGAN2 script ( to open a StyleGAN .pkl and run it on a list of image file­names to print out the D score, cour­tesy of Shao Xun­ing:

import pickle
import numpy as np
import cv2
import dnnlib.tflib as tflib
import random
import argparse
import PIL.Image
from training.misc import adjust_dynamic_range

def preprocess(file_path):
    # print(file_path)
    img = np.asarray(

    # Preprocessing from dataset_tool.create_from_images
    img = img.transpose([2, 0, 1])  # HWC => CHW
    # img = np.expand_dims(img, axis=0)
    img = img.reshape((1, 3, 512, 512))

    # Preprocessing from training_loop.process_reals
    img = adjust_dynamic_range(data=img, drange_in=[0, 255], drange_out=[-1.0, 1.0])
    return img

def main(args):
    minibatch_size = args.minibatch_size
    input_shape = (minibatch_size, 3, 512, 512)
    # print(args.images)
    images = args.images

    _G, D, _Gs = pickle.load(open(args.model, "rb"))
    # D.print_layers()

    image_score_all = [(image, []) for image in images]

    # Shuffle the images and process each image in multiple minibatches.
    # Note: networks.stylegan2.minibatch_stddev_layer
    # calculates the standard deviation of a minibatch group as a feature channel,
    # which means that the output of the discriminator actually depends
    # on the companion images in the same minibatch.
    for i_shuffle in range(args.num_shuffles):
        # print('shuffle: {}'.format(i_shuffle))
        for idx_1st_img in range(0, len(image_score_all), minibatch_size):
            idx_img_minibatch = []
            images_minibatch = []
            input_minibatch = np.zeros(input_shape)
            for i in range(minibatch_size):
                idx_img = (idx_1st_img + i) % len(image_score_all)
                image = image_score_all[idx_img][0]
                img = preprocess(image)
                input_minibatch[i, :] = img
            output =, None, resolution=512)
            print('shuffle: {}, indices: {}, images: {}'
                  .format(i_shuffle, idx_img_minibatch, images_minibatch))
            print('Output: {}'.format(output))
            for i in range(minibatch_size):
                idx_img = idx_img_minibatch[i]

    with open(args.output, 'a') as fout:
        for image, score_list in image_score_all:
            print('Image: {}, score_list: {}'.format(image, score_list))
            avg_score = sum(score_list)/len(score_list)
            fout.write(image + ' ' + str(avg_score) + '\n')

def parse_arguments():
    parser = argparse.ArgumentParser()
    parser.add_argument('--model', type=str, required=True,
                        help='.pkl model')
    parser.add_argument('--images', nargs='+')
    parser.add_argument('--output', type=str, default='rank.txt')
    parser.add_argument('--minibatch_size', type=int, default=4)
    parser.add_argument('--num_shuffles', type=int, default=5)
    parser.add_argument('--random_seed', type=int, default=0)
    return parser.parse_args()

if __name__ == '__main__':

Depend­ing on how noisy the rank­ings are in terms of ‘qual­ity’ and avail­able sam­ple size, one can either review the worst-ranked images by hand, or delete the bot­tom X%. One should check the top-ranked images as well to make sure the order­ing is right; there can also be some odd images in the top X% as well which should be removed.

It might be pos­si­ble to use to improve the qual­ity of gen­er­ated sam­ples as well, as a sim­ple ver­sion of .


The next major step is upscal­ing images using waifu2x, which does an excel­lent job on 2× upscal­ing of anime images, which are nigh-indis­tin­guish­able from a high­er-res­o­lu­tion orig­i­nal and greatly increase the usable cor­pus. The down­side is that it can take 1–10s per image, must run on the GPU (I can reli­ably fit ~9 instances on my 2×1080ti), and is writ­ten in a now-un­main­tained DL frame­work, Torch, with no cur­rent plans to port to PyTorch, and is grad­u­ally becom­ing harder to get run­ning (one hopes that by the time CUDA updates break it entire­ly, there will be another super-res­o­lu­tion GAN I or some­one else can train on Dan­booru to replace it). If pressed for time, one can just upscale the faces nor­mally with ImageMag­ick but I believe there will be some qual­ity loss and it’s worth­while.

. ~/src/torch/install/bin/torch-activate
upscaleWaifu2x() {
    SIZE1=$(identify -format "%h" "$@")
    SIZE2=$(identify -format "%w" "$@");

    if (( $SIZE1 < 512 && $SIZE2 < 512  )); then
        echo "$@" $SIZE
        TMP=$(mktemp "/tmp/XXXXXX.png")
        CUDA_VISIBLE_DEVICES="$((RANDOM % 2 < 1))" nice th ~/src/waifu2x/waifu2x.lua -model_dir \
            ~/src/waifu2x/models/upconv_7/art -tta 1 -m scale -scale 2 \
            -i "$@" -o "$TMP"
        convert "$TMP" "$@"
        rm "$TMP"
    fi;  }

export -f upscaleWaifu2x
find faces/ -type f | parallel --progress --jobs 9 upscaleWaifu2x

Quality Checks & Data Augmentation

The sin­gle most effec­tive strat­egy to improve a GAN is to clean the data. StyleGAN can­not han­dle too-di­verse datasets com­posed of mul­ti­ple objects or sin­gle objects shifted around, and rare or odd images can­not be learned well. Kar­ras et al get such good results with StyleGAN on faces in part because they con­structed FFHQ to be an extremely clean con­sis­tent dataset of just cen­tered well-lit clear human faces with­out any obstruc­tions or other vari­a­tion. Sim­i­lar­ly, Arfa’s (TFDNE) S2 gen­er­ates much bet­ter por­traits than my own “This Waifu Does Not Exist” (TWDNE) S2 anime por­traits, due partly to train­ing longer to con­ver­gence on a TPU pod but mostly due to his invest­ment in data clean­ing: align­ing the faces and heavy fil­ter­ing of sam­ples—this left him with only n = 50k but TFDNE nev­er­the­less out­per­forms TWDNE’s n = 300k. (Data cleaning/augmentation is one of the more pow­er­ful ways to improve results; if we imag­ine deep learn­ing as ‘pro­gram­ming’ or ‘Soft­ware 2.0’24 in Andrej Karpa­thy’s terms, data cleaning/augmentation is one of the eas­i­est ways to fine­tune the loss func­tion towards what we really want by gar­den­ing our data to remove what we don’t want and increase what we do.)

At this point, one can do man­ual qual­ity checks by view­ing a few hun­dred images, run­ning findimagedupes -t 99% to look for near-i­den­ti­cal faces, or dab­ble in fur­ther mod­i­fi­ca­tions such as doing “data aug­men­ta­tion”. Work­ing with Dan­booru2018, at this point one would have ~600–700,000 faces, which is more than enough to train StyleGAN and one will have diffi­culty stor­ing the final StyleGAN dataset because of its sheer size (due to the ~18× size mul­ti­pli­er). After clean­ing etc, my final face dataset is the with n = 300k.

How­ev­er, if that is not enough or one is work­ing with a small dataset like for a sin­gle char­ac­ter, data aug­men­ta­tion may be nec­es­sary. The mirror/horizontal flip is not nec­es­sary as StyleGAN has that built-in as an option25, but there are many other pos­si­ble data aug­men­ta­tions. One can stretch, shift col­ors, sharp­en, blur, increase/decrease contrast/brightness, crop, and so on. An exam­ple, extremely aggres­sive, set of data aug­men­ta­tions could be done like this:

dataAugment () {
    target=$(basename "$@")
    convert -deskew 50                     "$image" "$target".deskew."$suffix"
    convert -resize 110%x100%              "$image" "$target".horizstretch."$suffix"
    convert -resize 100%x110%              "$image" "$target".vertstretch."$suffix"
    convert -blue-shift 1.1                "$image" "$target".midnight."$suffix"
    convert -fill red -colorize 5%         "$image" "$target".red."$suffix"
    convert -fill orange -colorize 5%      "$image" "$target".orange."$suffix"
    convert -fill yellow -colorize 5%      "$image" "$target".yellow."$suffix"
    convert -fill green -colorize 5%       "$image" "$target".green."$suffix"
    convert -fill blue -colorize 5%        "$image" "$target".blue."$suffix"
    convert -fill purple -colorize 5%      "$image" "$target".purple."$suffix"
    convert -adaptive-blur 3x2             "$image" "$target".blur."$suffix"
    convert -adaptive-sharpen 4x2          "$image" "$target".sharpen."$suffix"
    convert -brightness-contrast 10        "$image" "$target".brighter."$suffix"
    convert -brightness-contrast 10x10     "$image" "$target".brightercontraster."$suffix"
    convert -brightness-contrast -10       "$image" "$target".darker."$suffix"
    convert -brightness-contrast -10x10    "$image" "$target".darkerlesscontrast."$suffix"
    convert +level 5%                      "$image" "$target".contraster."$suffix"
    convert -level 5%\!                    "$image" "$target".lesscontrast."$suffix"
export -f dataAugment
find faces/ -type f | parallel --progress dataAugment

Upscaling & Conversion

Once any qual­ity fixes or data aug­men­ta­tion are done, it’d be a good idea to save a lot of disk space by con­vert­ing to JPG & loss­ily reduc­ing qual­ity (I find 33% saves a ton of space at no vis­i­ble change):

convertPNGToJPG() { convert -quality 33 "$@" "$@".jpg && rm "$@"; }
export -f convertPNGToJPG
find faces/ -type f -name "*.png" | parallel --progress convertPNGToJPG

Remem­ber that StyleGAN mod­els are only com­pat­i­ble with images of the type they were trained on, so if you are using a StyleGAN pre­trained model which was trained on PNGs (like, IIRC, the FFHQ StyleGAN mod­el­s), you will need to keep using PNGs.

Doing the final scal­ing to exactly 512px can be done at many points but I gen­er­ally post­pone it to the end in order to work with images in their ‘native’ res­o­lu­tions & aspec­t-ra­tios for as long as pos­si­ble. At this point we care­fully tell ImageMag­ick to rescale every­thing to 512×51226, not pre­serv­ing the aspect ratio by fill­ing in with a black back­ground as nec­es­sary on either side:

find faces/ -type f | xargs --max-procs=16 -n 9000 \
    mogrify -resize 512x512\> -extent 512x512\> -gravity center -background black

Any slight­ly-d­iffer­ent image could crash the import process. There­fore, we delete any image which is even slightly differ­ent from the 512×512 sRGB JPG they are sup­posed to be:

find faces/ -type f | xargs --max-procs=16 -n 9000 identify | \
    # remember the warning: images must be identical, square, and sRGB/grayscale:
    fgrep -v " JPEG 512x512 512x512+0+0 8-bit sRGB"| cut -d ' ' -f 1 | \
    xargs --max-procs=16 -n 10000 rm

Hav­ing done all this, we should have a large con­sis­tent high­-qual­ity dataset.

Final­ly, the faces can now be con­verted to the ProGAN or StyleGAN dataset for­mat using It is worth remem­ber­ing at this point how frag­ile that is and the require­ments ImageMag­ick’s identify com­mand is handy for look­ing at files in more details, par­tic­u­larly their res­o­lu­tion & col­or­space, which are often the prob­lem.

Because of the extreme fragility of, I strongly advise that you edit it to print out the file­names of each file as they are being processed so that when (not if) it crash­es, you can inves­ti­gate the cul­prit and check the rest. The edit could be as sim­ple as this:

diff --git a/ b/
index 4ddfe44..e64e40b 100755
--- a/
+++ b/
@@ -519,6 +519,7 @@ def create_from_images(tfrecord_dir, image_dir, shuffle):
     with TFRecordExporter(tfrecord_dir, len(image_filenames)) as tfr:
         order = tfr.choose_shuffled_order() if shuffle else np.arange(len(image_filenames))
         for idx in range(order.size):
+            print(image_filenames[order[idx]])
             img = np.asarray([order[idx]]))
             if channels == 1:
                 img = img[np.newaxis, :, :] # HW => CHW

There should be no issues if all the images were thor­oughly checked ear­lier, but should any images crash it, they can be checked in more detail by identify. (I advise just delet­ing them and not try­ing to res­cue them.)

Then the con­ver­sion is just (as­sum­ing StyleGAN pre­req­ui­sites are installed, see next sec­tion):

python create_from_images datasets/faces /media/gwern/Data/danbooru2018/faces/

Con­grat­u­la­tions, the hard­est part is over. Most of the rest sim­ply requires patience (and a will­ing­ness to edit Python files directly in order to con­fig­ure StyleGAN).



I assume you have CUDA installed & func­tion­ing. If not, good luck. (On my Ubuntu Bionic 18.04.2 LTS OS, I have suc­cess­fully used the Nvidia dri­ver ver­sion #410.104, CUDA 10.1, and Ten­sor­Flow 1.13.1.)

A Python ≥3.627 vir­tual envi­ron­ment can be set up for StyleGAN to keep depen­den­cies tidy, Ten­sor­Flow & StyleGAN depen­den­cies installed:

conda create -n stylegan pip python=3.6
source activate stylegan

## TF:
pip install tensorflow-gpu
## Test install:
python -c "import tensorflow as tf; tf.enable_eager_execution(); \
    print(tf.reduce_sum(tf.random_normal([1000, 1000])))"
pip install tensorboard

## StyleGAN:
## Install pre-requisites:
pip install pillow numpy moviepy scipy opencv-python lmdb # requests?
## Download:
git clone '' && cd ./stylegan/
## Test install:
## ./results/example.png should be a photograph of a middle-aged man

StyleGAN can also be trained on the inter­ac­tive Google Colab ser­vice, which pro­vides free slices of K80 GPUs 12-GPU-hour chunks, using this Colab note­book. Colab is much slower than train­ing on a local machine & the free instances are not enough to train the best StyleGANs, but this might be a use­ful option for peo­ple who sim­ply want to try it a lit­tle or who are doing some­thing quick like extremely low-res­o­lu­tion train­ing or trans­fer­-learn­ing where a few GPU-hours on a slow small GPU might be enough.


StyleGAN does­n’t ship with any sup­port for CLI options; instead, one must edit and train/

  1. train/

    The core con­fig­u­ra­tion is done in the func­tion defaults to training_loop begin­ning line 112.

    The key argu­ments are G_smoothing_kimg & D_repeats (affects the learn­ing dynam­ic­s), network_snapshot_ticks (how often to save the pickle snap­shot­s—­more fre­quent means less progress lost in crash­es, but as each one weighs 300M­B+, can quickly use up giga­bytes of space), resume_run_id (set to "latest"), and resume_kimg.

    resume_kimg gov­erns where in the over­all pro­gres­sive-grow­ing train­ing sched­ule StyleGAN starts from. If it is set to 0, train­ing begins at the begin­ning of the pro­gres­sive-grow­ing sched­ule, at the low­est res­o­lu­tion, regard­less of how much train­ing has been pre­vi­ously done. It is vitally impor­tant when doing trans­fer learn­ing that it is set to a suffi­ciently high num­ber (eg 10000) that train­ing begins at the high­est desired res­o­lu­tion like 512px, as it appears that lay­ers are erased when added dur­ing pro­gres­sive-grow­ing. (resume_kimg may also need to be set to a high value to make it skip straight to train­ing at the high­est res­o­lu­tion if you are train­ing on small datasets of small images, where there’s risk of it over­fit­ting under the nor­mal train­ing sched­ule and never reach­ing the high­est res­o­lu­tion.) This trick is unnec­es­sary in StyleGAN 2, which is sim­pler in not using pro­gres­sive grow­ing.

    More exper­i­men­tal­ly, I sug­gest set­ting minibatch_repeats = 1 instead of minibatch_repeats = 5; in line with the sus­pi­cious­ness of the gra­di­en­t-ac­cu­mu­la­tion imple­men­ta­tion in ProGAN/StyleGAN, this appears to make train­ing both sta­bler & faster.

    Note that some of these vari­ables, like learn­ing rates, are over­rid­den in It’s bet­ter to set those there or else you may con­fuse your­self badly (like I did in won­der­ing why ProGAN & StyleGAN seemed extra­or­di­nar­ily robust to large changes in the learn­ing rates…).

  2. (pre­vi­ously in ProGAN; renamed in StyleGAN 2)

    Here we set the num­ber of GPUs, image res­o­lu­tion, dataset, learn­ing rates, hor­i­zon­tal flipping/mirroring data aug­men­ta­tion, and mini­batch sizes. (This file includes set­tings intended ProGAN—watch out that you don’t acci­den­tally turn on ProGAN instead of StyleGAN & con­fuse your­self.) Learn­ing rate & mini­batch should gen­er­ally be left alone (ex­cept towards the end of train­ing when one wants to lower the learn­ing rate to pro­mote con­ver­gence or rebal­ance the G/D), but the image resolution/dataset/mirroring do need to be set, like thus:

    desc += '-faces';     dataset = EasyDict(tfrecord_dir='faces', resolution=512);              train.mirror_augment = True

    This sets up the 512px face dataset which was pre­vi­ously cre­ated in dataset/faces, turns on mir­ror­ing (be­cause while there may be writ­ing in the back­ground, we don’t care about it for face gen­er­a­tion), and sets a title for the checkpoints/logs, which will now appear in results/ with the ‘-faces’ string.

    Assum­ing you do not have 8 GPUs (as you prob­a­bly do not), you must change the -preset to match your num­ber of GPUs, StyleGAN will not auto­mat­i­cally choose the cor­rect num­ber of GPUs. If you fail to set it cor­rectly to the appro­pri­ate pre­set, StyleGAN will attempt to use GPUs which do not exist and will crash with the opaque error mes­sage (note that CUDA uses zero-in­dex­ing so GPU:0 refers to the first GPU, GPU:1 refers to my sec­ond GPU, and thus /device:GPU:2 refers to my—nonex­is­ten­t—third GPU):

    tensorflow.python.framework.errors_impl.InvalidArgumentError: Cannot assign a device for operation \
        G_synthesis_3/lod: {{node G_synthesis_3/lod}}was explicitly assigned to /device:GPU:2 but available \
        devices are [ /job:localhost/replica:0/task:0/device:CPU:0, /job:localhost/replica:0/task:0/device:GPU:0, \
        /job:localhost/replica:0/task:0/device:GPU:1, /job:localhost/replica:0/task:0/device:XLA_CPU:0, \
        /job:localhost/replica:0/task:0/device:XLA_GPU:0, /job:localhost/replica:0/task:0/device:XLA_GPU:1 ]. \
        Make sure the device specification refers to a valid device.
         [[{{node G_synthesis_3/lod}}]]

    For my 2×1080ti I’d set:

    desc += '-preset-v2-2gpus'; submit_config.num_gpus = 2; sched.minibatch_base = 8; sched.minibatch_dict = \
        {4: 256, 8: 256, 16: 128, 32: 64, 64: 32, 128: 16, 256: 8}; sched.G_lrate_dict = {512: 0.0015, 1024: 0.002}; \
        sched.D_lrate_dict = EasyDict(sched.G_lrate_dict); train.total_kimg = 99000

    So my results get saved to results/00001-sgan-faces-2gpu etc (the run ID incre­ments, ‘sgan’ because StyleGAN rather than ProGAN, ‘-faces’ as the dataset being trained on, and ‘2gpu’ because it’s multi-GPU).


I typ­i­cally run StyleGAN in a ses­sion which can be detached and keeps mul­ti­ple shells orga­nized: 1 terminal/shell for the StyleGAN run, 1 terminal/shell for Ten­sor­Board, and 1 for Emacs.

With Emacs, I keep the two key Python files open ( and train/ for ref­er­ence & easy edit­ing.

With the “lat­est” patch, StyleGAN can be thrown into a while-loop to keep run­ning after crash­es, like:

while true; do nice py ; date; (xmessage "alert: StyleGAN crashed" &); sleep 10s; done

Ten­sor­Board is a log­ging util­ity which dis­plays lit­tle time-series of recorded vari­ables which one views in a web browser, eg:

tensorboard --logdir results/02022-sgan-faces-2gpu/
# TensorBoard 1.13.0 at (Press CTRL+C to quit)

Note that Ten­sor­Board can be back­ground­ed, but needs to be updated every time a new run is started as the results will then be in a differ­ent fold­er.

Train­ing StyleGAN is much eas­ier & more reli­able than other GANs, but it is still more of an art than a sci­ence. (We put up with it because while GANs suck, every­thing else sucks more.) Notes on train­ing:

  • Crash­proofing:

    The ini­tial release of StyleGAN was prone to crash­ing when I ran it, seg­fault­ing at ran­dom. Updat­ing Ten­sor­Flow appeared to reduce this but the root cause is still unknown. Seg­fault­ing or crash­ing is also report­edly com­mon if run­ning on mixed GPUs (eg a 1080ti + Titan V).

    Unfor­tu­nate­ly, StyleGAN has no set­ting for sim­ply resum­ing from the lat­est snap­shot after crashing/exiting (which is what one usu­ally wants), and one must man­u­ally edit the resume_run_id line in to set it to the lat­est run ID. This is tedious and error-prone—at one point I real­ized I had wasted 6 GPU-days of train­ing by restart­ing from a 3-day-old snap­shot because I had not updated the resume_run_id after a seg­fault!

    If you are doing any runs longer than a few wall­clock hours, I strongly advise use of nshep­perd’s patch to auto­mat­i­cally restart from the lat­est snap­shot by set­ting resume_run_id = "latest":

    diff --git a/training/ b/training/
    index 50ae51c..d906a2d 100755
    --- a/training/
    +++ b/training/
    @@ -119,6 +119,14 @@ def list_network_pkls(run_id_or_run_dir, include_final=True):
             del pkls[0]
         return pkls
    +def locate_latest_pkl():
    +    allpickles = sorted(glob.glob(os.path.join(config.result_dir, '0*', 'network-*.pkl')))
    +    latest_pickle = allpickles[-1]
    +    resume_run_id = os.path.basename(os.path.dirname(latest_pickle))
    +    RE_KIMG = re.compile('network-snapshot-(\d+).pkl')
    +    kimg = int(RE_KIMG.match(os.path.basename(latest_pickle)).group(1))
    +    return (locate_network_pkl(resume_run_id), float(kimg))
     def locate_network_pkl(run_id_or_run_dir_or_network_pkl, snapshot_or_network_pkl=None):
         for candidate in [snapshot_or_network_pkl, run_id_or_run_dir_or_network_pkl]:
             if isinstance(candidate, str):
    diff --git a/training/ b/training/
    index 78d6fe1..20966d9 100755
    --- a/training/
    +++ b/training/
    @@ -148,7 +148,10 @@ def training_loop(
         # Construct networks.
         with tf.device('/gpu:0'):
             if resume_run_id is not None:
    -            network_pkl = misc.locate_network_pkl(resume_run_id, resume_snapshot)
    +            if resume_run_id == 'latest':
    +                network_pkl, resume_kimg = misc.locate_latest_pkl()
    +            else:
    +                network_pkl = misc.locate_network_pkl(resume_run_id, resume_snapshot)
                 print('Loading networks from "%s"...' % network_pkl)
                 G, D, Gs = misc.load_pkl(network_pkl)

    (The diff can be edited by hand, or copied into the repo as a file like latest.patch & then applied with git apply latest.patch.)

  • Tun­ing Learn­ing Rates

    The LR is one of the most crit­i­cal hyper­pa­ra­me­ters: too-large updates based on too-s­mall mini­batches are dev­as­tat­ing to GAN sta­bil­ity & final qual­i­ty. The LR also seems to inter­act with the intrin­sic diffi­culty or diver­sity of an image domain; Kar­ras et al 2019 use 0.003 G/D LRs on their FFHQ dataset (which has been care­fully curated and the faces aligned to put land­marks like eyes/mouth in the same loca­tions in every image) when train­ing on 8-GPU machines with mini­batches of n = 32, but I find lower to be bet­ter on my anime face/portrait datasets where I can only do n = 8. From look­ing at train­ing videos of whole-Dan­booru2018 StyleGAN runs, I sus­pect that the nec­es­sary LRs would be lower still. Learn­ing rates are closely related to mini­batch size (a com­mon rule of thumb in super­vised learn­ing of CNNs is that the rela­tion­ship of biggest usable LR fol­lows a square-root curve in mini­batch size) and the BigGAN research argues that mini­batch size itself strongly influ­ences how bad mode drop­ping is, which sug­gests that smaller LRs may be more nec­es­sary the more diverse/difficult a dataset is.

  • Bal­anc­ing G/D:

    Screen­shot of Ten­sor­Board G/D losses for an anime face StyleGAN mak­ing progress towards con­ver­gence

    Later in train­ing, if the G is not mak­ing good progress towards the ulti­mate goal of a 0.5 loss (and the D’s loss grad­u­ally decreas­ing towards 0.5), and has a loss stub­bornly stuck around −1 or some­thing, it may be nec­es­sary to change the bal­ance of G/D. This can be done sev­eral ways but the eas­i­est is to adjust the LRs in, sched.G_lrate_dict & sched.D_lrate_dict.

    One needs to keep an eye on the G/D losses and also the per­cep­tual qual­ity of the faces (since we don’t have any good FID equiv­a­lent yet for anime faces, which requires a good open-source Dan­booru tag­ger to cre­ate embed­dings), and reduce both LRs (or usu­ally just the D’s LR) based on the face qual­ity and whether the G/D losses are explod­ing or oth­er­wise look imbal­anced. What you want, I think, is for the G/D losses to be sta­ble at a cer­tain absolute amount for a long time while the qual­ity vis­i­bly improves, reduc­ing D’s LR as nec­es­sary to keep it bal­anced with G; and then once you’ve run out of time/patience or arti­facts are show­ing up, then you can decrease both LRs to con­verge onto a local opti­ma.

    I find the default of 0.003 can be too high once qual­ity reaches a high level with both faces & por­traits, and it helps to reduce it by a third to 0.001 or a tenth to 0.0003. If there still isn’t con­ver­gence, the D may be too strong and it can be turned down sep­a­rate­ly, to a tenth or a fifti­eth even. (Given the sto­chas­tic­ity of train­ing & the rel­a­tiv­ity of the loss­es, one should wait sev­eral wall­clock hours or days after each mod­i­fi­ca­tion to see if it made a differ­ence.)

  • Skip­ping FID met­rics:

    Some met­rics are com­puted for logging/reporting. The FID met­rics are cal­cu­lated using an old Ima­geNet CNN; what is real­is­tic on Ima­geNet may have lit­tle to do with your par­tic­u­lar domain and while a large FID like 100 is con­cern­ing, FIDs like 20 or even increas­ing are not nec­es­sar­ily a prob­lem or use­ful guid­ance com­pared to just look­ing at the gen­er­ated sam­ples or the loss curves. Given that com­put­ing FID met­rics is not free & poten­tially irrel­e­vant or mis­lead­ing on many image domains, I sug­gest dis­abling them entire­ly. (They are not used in the train­ing for any­thing, and dis­abling them is safe.)

    They can be edited out of the main train­ing loop by com­ment­ing out the call to like so:

    @@ -261,7 +265,7 @@ def training_loop()
            if cur_tick % network_snapshot_ticks == 0 or done or cur_tick == 1:
                pkl = os.path.join(submit_config.run_dir, 'network-snapshot-%06d.pkl' % (cur_nimg // 1000))
                misc.save_pkl((G, D, Gs), pkl)
                #, run_dir=submit_config.run_dir, num_gpus=submit_config.num_gpus, tf_config=tf_config)
  • ‘Blob’ & ‘Crack’ Arti­facts:

    Dur­ing train­ing, ‘blobs’ often show up or move around. These blobs appear even late in train­ing on oth­er­wise high­-qual­ity images and are unique to StyleGAN (at least, I’ve never seen another GAN whose train­ing arti­facts look like the blob­s). That they are so large & glar­ing sug­gests a weak­ness in StyleGAN some­where. The source of the blobs was unclear. If you watch train­ing videos, these blobs seem to grad­u­ally morph into new fea­tures such as eyes or hair or glass­es. I sus­pect they are part of how StyleGAN ‘cre­ates’ new fea­tures, start­ing with a fea­ture-less blob super­im­posed at approx­i­mately the right loca­tion, and grad­u­ally refined into some­thing use­ful. The inves­ti­gated the blob arti­facts & found it to be due to the Gen­er­a­tor work­ing around a flaw in StyleGAN’s use of AdaIN nor­mal­iza­tion. Kar­ras et al 2019 note that images with­out a blob some­where are severely cor­rupt­ed; because the blobs are in fact doing some­thing use­ful, it is unsur­pris­ing that the Dis­crim­i­na­tor does­n’t fix the Gen­er­a­tor. StyleGAN 2 changes the AdaIN nor­mal­iza­tion to elim­i­nate this prob­lem, improv­ing over­all qual­i­ty.28

    If blobs are appear­ing too often or one wants a final model with­out any new intru­sive blobs, it may help to lower the LR to try to con­verge to a local optima where the nec­es­sary blob is hid­den away some­where unob­tru­sive.

    In train­ing anime faces, I have seen addi­tional arti­facts, which look like ‘cracks’ or ‘waves’ or ele­phant skin wrin­kles or the sort of fine craz­ing seen in old paint­ings or ceram­ics, which appear toward the end of train­ing on pri­mar­ily skin or areas of flat col­or; they hap­pen par­tic­u­larly fast when trans­fer learn­ing on a small dataset. The only solu­tion I have found so far is to either stop train­ing or get more data. In con­trast to the blob arti­facts (iden­ti­fied as an archi­tec­tural prob­lem & fixed in StyleGAN 2), I cur­rently sus­pect the cracks are a sign of over­fit­ting rather than a pecu­liar­ity of nor­mal StyleGAN train­ing, where the G has started try­ing to mem­o­rize noise in the fine detail of pixelation/lines, and so these are a kind of overfitting/mode col­lapse. (More spec­u­la­tive­ly: another pos­si­ble expla­na­tion is that the cracks are caused by the StyleGAN D being sin­gle-s­cale rather than mul­ti­-s­cale—as in MSG-GAN and a num­ber of oth­er­s—and the ‘cracks’ are actu­ally high­-fre­quency noise cre­ated by the G in spe­cific patches as adver­sar­ial exam­ples to fool the D. They report­edly do not appear in MSG-GAN or StyleGAN 2, which both use mul­ti­-s­cale Ds.)

  • Gra­di­ent Accu­mu­la­tion:

    ProGAN/StyleGAN’s code­base claims to sup­port gra­di­ent accu­mu­la­tion, which is a way to fake large mini­batch train­ing (eg n = 2048) by not doing the back­prop­a­ga­tion update every mini­batch, but instead sum­ming the gra­di­ents over many mini­batches and apply­ing them all at once. This is a use­ful trick for sta­bi­liz­ing train­ing, and large mini­batch NN train­ing can differ qual­i­ta­tively from small mini­batch NN training—BigGAN per­for­mance increased with increas­ingly large mini­batches (n = 2048) and the authors spec­u­late that this is because such large mini­batches mean that the full diver­sity of the dataset is rep­re­sented in each ‘mini­batch’ so the BigGAN mod­els can­not sim­ply ‘for­get’ rarer dat­a­points which would oth­er­wise not appear for many mini­batches in a row, result­ing in the GAN pathol­ogy of ‘mode drop­ping’ where some kinds of data just get ignored by both G/D.

    How­ev­er, the ProGAN/StyleGAN imple­men­ta­tion of gra­di­ent accu­mu­la­tion does not resem­ble that of any other imple­men­ta­tion I’ve seen in Ten­sor­Flow or PyTorch, and in my own exper­i­ments with up to n = 4096, I did­n’t observe any sta­bi­liza­tion or qual­i­ta­tive differ­ences, so I am sus­pi­cious the imple­men­ta­tion is wrong.

Here is what a suc­cess­ful train­ing pro­gres­sion looks like for the anime face StyleGAN:

Train­ing mon­tage video of the first 9k iter­a­tions of the anime face StyleGAN.
The anime face model is obso­leted by the StyleGAN 2 por­trait model.

The anime face model as of 2019-03-08, trained for 21,980 iter­a­tions or ~21m images or ~38 GPU-days, is avail­able for down­load. (It is still not ful­ly-con­verged, but the qual­ity is good.)


Hav­ing suc­cess­fully trained a StyleGAN, now the fun part—­gen­er­at­ing sam­ples!

Psi/“truncation trick”

The 𝜓/“trun­ca­tion trick”(BigGAN dis­cus­sion, StyleGAN dis­cus­sion; appar­ently first intro­duced by ) is the most impor­tant hyper­pa­ra­me­ter for all StyleGAN gen­er­a­tion.

The trun­ca­tion trick is used at sam­ple gen­er­a­tion time but not train­ing time. The idea is to edit the latent vec­tor z, which is a vec­tor of , to remove any vari­ables which are above a cer­tain size like 0.5 or 1.0, and resam­ple those.29 This seems to help by avoid­ing ‘extreme’ latent val­ues or com­bi­na­tions of latent val­ues which the G is not as good at—a G will not have gen­er­ated many data points with each latent vari­able at, say, +1.5SD. The trade­off is that those are still legit­i­mate areas of the over­all latent space which were being used dur­ing train­ing to cover parts of the data dis­tri­b­u­tion; so while the latent vari­ables close to the mean of 0 may be the most accu­rately mod­eled, they are also only a small part of the space of all pos­si­ble images. So one can gen­er­ate latent vari­ables from the full unre­stricted dis­tri­b­u­tion for each one, or one can trun­cate them at some­thing like +1SD or +0.7SD. (Like the dis­cus­sion of the best dis­tri­b­u­tion for the orig­i­nal latent dis­tri­b­u­tion, there’s no good rea­son to think that this is an opti­mal method of doing trun­ca­tion; there are many alter­na­tives, such as ones penal­iz­ing the sum of the vari­ables, either reject­ing them or scal­ing them down, and than the cur­rent trun­ca­tion trick.)

At 𝜓=0, diver­sity is nil and all faces are a sin­gle global aver­age face (a brown-eyed brown-haired school­girl, unsur­pris­ing­ly); at ±0.5 you have a broad range of faces, and by ±1.2, you’ll see tremen­dous diver­sity in faces/styles/consistency but also tremen­dous arti­fact­ing & dis­tor­tion. Where you set your 𝜓 will heav­ily influ­ence how ‘orig­i­nal’ out­puts look. At 𝜓=1.2, they are tremen­dously orig­i­nal but extremely hit or miss. At 𝜓=0.5 they are con­sis­tent but bor­ing. For most of my sam­pling, I set 𝜓=0.7 which strikes the best bal­ance between craziness/artifacting and quality/diversity. (Per­son­al­ly, I pre­fer to look at 𝜓=1.2 sam­ples because they are so much more inter­est­ing, but if I released those sam­ples, it would give a mis­lead­ing impres­sion to read­er­s.)

Random Samples

The StyleGAN repo has a sim­ple script to down­load & gen­er­ate a sin­gle face; in the inter­ests of repro­ducibil­i­ty, it hard­wires the model and the RNG seed so it will only gen­er­ate 1 par­tic­u­lar face. How­ev­er, it can be eas­ily adapted to use a local model and (slowly30) gen­er­ate, say, 1000 sam­ple images with the hyper­pa­ra­me­ter 𝜓=0.6 (which gives high­-qual­ity but not high­ly-di­verse images) which are saved to results/example-{0-999}.png:

import os
import pickle
import numpy as np
import PIL.Image
import dnnlib
import dnnlib.tflib as tflib
import config

def main():
    _G, _D, Gs = pickle.load(open("results/02051-sgan-faces-2gpu/network-snapshot-021980.pkl", "rb"))

    for i in range(0,1000):
        rnd = np.random.RandomState(None)
        latents = rnd.randn(1, Gs.input_shape[1])
        fmt = dict(func=tflib.convert_images_to_uint8, nchw_to_nhwc=True)
        images =, None, truncation_psi=0.6, randomize_noise=True, output_transform=fmt)
        os.makedirs(config.result_dir, exist_ok=True)
        png_filename = os.path.join(config.result_dir, 'example-'+str(i)+'.png')
        PIL.Image.fromarray(images[0], 'RGB').save(png_filename)

if __name__ == "__main__":

Karras et al 2018 Figures

The fig­ures in Kar­ras et al 2018, demon­strat­ing ran­dom sam­ples and aspects of the style noise using the 1024px FFHQ face model (as well as the oth­er­s), were gen­er­ated by This script needs exten­sive mod­i­fi­ca­tions to work with my 512px anime face; going through the file:

  • the code uses 𝜓=1 trun­ca­tion, but faces look bet­ter with 𝜓=0.7 (sev­eral of the func­tions have truncation_psi= set­tings but, trick­i­ly, Fig­ure 3’s draw_style_mixing_figure has its 𝜓 set­ting hid­den away in the synthesis_kwargs global vari­able)
  • the loaded model needs to be switched to the anime face mod­el, of course
  • dimen­sions must be reduced 1024→512 as appro­pri­ate; some ranges are hard­coded and must be reduced for 512px images as well
  • the trun­ca­tion trick fig­ure 8 does­n’t show enough faces to give insight into what the latent space is doing so it needs to be expanded to show both more ran­dom seeds/faces, and more 𝜓 val­ues
  • the bedroom/car/cat sam­ples should be dis­abled

The changes I make are as fol­lows:

diff --git a/ b/
index 45b68b8..f27af9d 100755
--- a/
+++ b/
@@ -24,16 +24,13 @@ url_bedrooms    = '
 url_cars        = '' # karras2019stylegan-cars-512x384.pkl
 url_cats        = '' # karras2019stylegan-cats-256x256.pkl

-synthesis_kwargs = dict(output_transform=dict(func=tflib.convert_images_to_uint8, nchw_to_nhwc=True), minibatch_size=8)
+synthesis_kwargs = dict(output_transform=dict(func=tflib.convert_images_to_uint8, nchw_to_nhwc=True), minibatch_size=8, truncation_psi=0.7)

 _Gs_cache = dict()

 def load_Gs(url):
-    if url not in _Gs_cache:
-        with dnnlib.util.open_url(url, cache_dir=config.cache_dir) as f:
-            _G, _D, Gs = pickle.load(f)
-        _Gs_cache[url] = Gs
-    return _Gs_cache[url]
+    _G, _D, Gs = pickle.load(open("results/02051-sgan-faces-2gpu/network-snapshot-021980.pkl", "rb"))
+    return Gs

 # Figures 2, 3, 10, 11, 12: Multi-resolution grid of uncurated result images.
@@ -85,7 +82,7 @@ def draw_noise_detail_figure(png, Gs, w, h, num_samples, seeds):
     canvas ='RGB', (w * 3, h * len(seeds)), 'white')
     for row, seed in enumerate(seeds):
         latents = np.stack([np.random.RandomState(seed).randn(Gs.input_shape[1])] * num_samples)
-        images =, None, truncation_psi=1, **synthesis_kwargs)
+        images =, None, **synthesis_kwargs)
         canvas.paste(PIL.Image.fromarray(images[0], 'RGB'), (0, row * h))
         for i in range(4):
             crop = PIL.Image.fromarray(images[i + 1], 'RGB')
@@ -109,7 +106,7 @@ def draw_noise_components_figure(png, Gs, w, h, seeds, noise_ranges, flips):
     all_images = []
     for noise_range in noise_ranges:
         tflib.set_vars({var: val * (1 if i in noise_range else 0) for i, (var, val) in enumerate(noise_pairs)})
-        range_images =, None, truncation_psi=1, randomize_noise=False, **synthesis_kwargs)
+        range_images =, None, randomize_noise=False, **synthesis_kwargs)
         range_images[flips, :, :] = range_images[flips, :, ::-1]

@@ -144,14 +141,11 @@ def draw_truncation_trick_figure(png, Gs, w, h, seeds, psis):
 def main():
     os.makedirs(config.result_dir, exist_ok=True)
-    draw_uncurated_result_figure(os.path.join(config.result_dir, 'figure02-uncurated-ffhq.png'), load_Gs(url_ffhq), cx=0, cy=0, cw=1024, ch=1024, rows=3, lods=[0,1,2,2,3,3], seed=5)
-    draw_style_mixing_figure(os.path.join(config.result_dir, 'figure03-style-mixing.png'), load_Gs(url_ffhq), w=1024, h=1024, src_seeds=[639,701,687,615,2268], dst_seeds=[888,829,1898,1733,1614,845], style_ranges=[range(0,4)]*3+[range(4,8)]*2+[range(8,18)])
-    draw_noise_detail_figure(os.path.join(config.result_dir, 'figure04-noise-detail.png'), load_Gs(url_ffhq), w=1024, h=1024, num_samples=100, seeds=[1157,1012])
-    draw_noise_components_figure(os.path.join(config.result_dir, 'figure05-noise-components.png'), load_Gs(url_ffhq), w=1024, h=1024, seeds=[1967,1555], noise_ranges=[range(0, 18), range(0, 0), range(8, 18), range(0, 8)], flips=[1])
-    draw_truncation_trick_figure(os.path.join(config.result_dir, 'figure08-truncation-trick.png'), load_Gs(url_ffhq), w=1024, h=1024, seeds=[91,388], psis=[1, 0.7, 0.5, 0, -0.5, -1])
-    draw_uncurated_result_figure(os.path.join(config.result_dir, 'figure10-uncurated-bedrooms.png'), load_Gs(url_bedrooms), cx=0, cy=0, cw=256, ch=256, rows=5, lods=[0,0,1,1,2,2,2], seed=0)
-    draw_uncurated_result_figure(os.path.join(config.result_dir, 'figure11-uncurated-cars.png'), load_Gs(url_cars), cx=0, cy=64, cw=512, ch=384, rows=4, lods=[0,1,2,2,3,3], seed=2)
-    draw_uncurated_result_figure(os.path.join(config.result_dir, 'figure12-uncurated-cats.png'), load_Gs(url_cats), cx=0, cy=0, cw=256, ch=256, rows=5, lods=[0,0,1,1,2,2,2], seed=1)
+    draw_uncurated_result_figure(os.path.join(config.result_dir, 'figure02-uncurated-ffhq.png'), load_Gs(url_ffhq), cx=0, cy=0, cw=512, ch=512, rows=3, lods=[0,1,2,2,3,3], seed=5)
+    draw_style_mixing_figure(os.path.join(config.result_dir, 'figure03-style-mixing.png'), load_Gs(url_ffhq), w=512, h=512, src_seeds=[639,701,687,615,2268], dst_seeds=[888,829,1898,1733,1614,845], style_ranges=[range(0,4)]*3+[range(4,8)]*2+[range(8,16)])
+    draw_noise_detail_figure(os.path.join(config.result_dir, 'figure04-noise-detail.png'), load_Gs(url_ffhq), w=512, h=512, num_samples=100, seeds=[1157,1012])
+    draw_noise_components_figure(os.path.join(config.result_dir, 'figure05-noise-components.png'), load_Gs(url_ffhq), w=512, h=512, seeds=[1967,1555], noise_ranges=[range(0, 18), range(0, 0), range(8, 18), range(0, 8)], flips=[1])
+    draw_truncation_trick_figure(os.path.join(config.result_dir, 'figure08-truncation-trick.png'), load_Gs(url_ffhq), w=512, h=512, seeds=[91,388, 389, 390, 391, 392, 393, 394, 395, 396], psis=[1, 0.7, 0.5, 0.25, 0, -0.25, -0.5, -1])

All this done, we get some fun anime face sam­ples to par­al­lel Kar­ras et al 2018’s fig­ures:

Anime face StyleGAN, Fig­ure 2, uncu­rated sam­ples
Fig­ure 3, “style mix­ing” of source/transfer faces, demon­strat­ing con­trol & inter­po­la­tion (top row=style, left colum­n=­tar­get to be styled)
Fig­ure 8, the “trun­ca­tion trick” visu­al­ized: 10 ran­dom faces, with the range 𝜓 = [1, 0.7, 0.5, 0.25, 0, −0.25, −0.5, −1]—demon­strat­ing the trade­off between diver­sity & qual­i­ty, and the global aver­age face.


Training Montage

The eas­i­est sam­ples are the progress snap­shots gen­er­ated dur­ing train­ing. Over the course of train­ing, their size increases as the effec­tive res­o­lu­tion increases & finer details are gen­er­at­ed, and at the end can be quite large (often 14MB each for the anime faces) so doing lossy com­pres­sion with a tool like pngnq+advpng or con­vert­ing them to JPG with low­ered qual­ity is a good idea. To turn the many snap­shots into a train­ing mon­tage video like above, I use on the PNGs:

cat $(ls ./results/*faces*/fakes*.png | sort --numeric-sort) | ffmpeg -framerate 10 \ # show 10 inputs per second
    -i - # stdin
    -r 25 # output frame-rate; frames will be duplicated to pad out to 25FPS
    -c:v libx264 # x264 for compatibility
    -pix_fmt yuv420p # force ffmpeg to use a standard colorspace - otherwise PNG colorspace is kept, breaking browsers (!)
    -crf 33 # adequate high quality
    -vf "scale=iw/2:ih/2" \ # shrink the image by 2×, the full detail is not necessary & saves space
    -preset veryslow -tune animation \ # aim for smallest binary possible with animation-tuned settings


The orig­i­nal ProGAN repo pro­vided a con­fig for gen­er­at­ing inter­po­la­tion videos, but that was removed in StyleGAN. Cyril Diagne (@kikko_fr) imple­mented a replace­ment, pro­vid­ing 3 kinds of videos:

  1. random_grid_404.mp4: a stan­dard inter­po­la­tion video, which is sim­ply a ran­dom walk through the latent space, mod­i­fy­ing all the vari­ables smoothly and ani­mat­ing it; by default it makes 4 of them arranged 2×2 in the video. Sev­eral inter­po­la­tion videos are show in the exam­ples sec­tion.

  2. interpolate.mp4: a ‘coarse’ “style mix­ing” video; a sin­gle ‘source’ face is gen­er­ated & held con­stant; a sec­ondary inter­po­la­tion video, a ran­dom walk as before is gen­er­at­ed; at each step of the ran­dom walk, the ‘coarse’/high-level ‘style’ noise is copied from the ran­dom walk to over­write the source face’s orig­i­nal style noise. For faces, this means that the orig­i­nal face will be mod­i­fied with all sorts of ori­en­ta­tions & facial expres­sions while still remain­ing rec­og­niz­ably the orig­i­nal char­ac­ter. (It is the video ana­log of Kar­ras et al 2018’s Fig­ure 3.)

    A copy of Diag­ne’s

    import os
    import pickle
    import numpy as np
    import PIL.Image
    import dnnlib
    import dnnlib.tflib as tflib
    import config
    import scipy
    def main():
        # Load pre-trained network.
        # url = ''
        # with dnnlib.util.open_url(url, cache_dir=config.cache_dir) as f:
        ## NOTE: insert model here:
        _G, _D, Gs = pickle.load(open("results/02047-sgan-faces-2gpu/network-snapshot-013221.pkl", "rb"))
        # _G = Instantaneous snapshot of the generator. Mainly useful for resuming a previous training run.
        # _D = Instantaneous snapshot of the discriminator. Mainly useful for resuming a previous training run.
        # Gs = Long-term average of the generator. Yields higher-quality results than the instantaneous snapshot.
        grid_size = [2,2]
        image_shrink = 1
        image_zoom = 1
        duration_sec = 60.0
        smoothing_sec = 1.0
        mp4_fps = 20
        mp4_codec = 'libx264'
        mp4_bitrate = '5M'
        random_seed = 404
        mp4_file = 'results/random_grid_%s.mp4' % random_seed
        minibatch_size = 8
        num_frames = int(np.rint(duration_sec * mp4_fps))
        random_state = np.random.RandomState(random_seed)
        # Generate latent vectors
        shape = [num_frames,] + Gs.input_shape[1:] # [frame, image, channel, component]
        all_latents = random_state.randn(*shape).astype(np.float32)
        import scipy
        all_latents = scipy.ndimage.gaussian_filter(all_latents,
                       [smoothing_sec * mp4_fps] + [0] * len(Gs.input_shape), mode='wrap')
        all_latents /= np.sqrt(np.mean(np.square(all_latents)))
        def create_image_grid(images, grid_size=None):
            assert images.ndim == 3 or images.ndim == 4
            num, img_h, img_w, channels = images.shape
            if grid_size is not None:
                grid_w, grid_h = tuple(grid_size)
                grid_w = max(int(np.ceil(np.sqrt(num))), 1)
                grid_h = max((num - 1) // grid_w + 1, 1)
            grid = np.zeros([grid_h * img_h, grid_w * img_w, channels], dtype=images.dtype)
            for idx in range(num):
                x = (idx % grid_w) * img_w
                y = (idx // grid_w) * img_h
                grid[y : y + img_h, x : x + img_w] = images[idx]
            return grid
        # Frame generation func for moviepy.
        def make_frame(t):
            frame_idx = int(np.clip(np.round(t * mp4_fps), 0, num_frames - 1))
            latents = all_latents[frame_idx]
            fmt = dict(func=tflib.convert_images_to_uint8, nchw_to_nhwc=True)
            images =, None, truncation_psi=0.7,
                                  randomize_noise=False, output_transform=fmt)
            grid = create_image_grid(images, grid_size)
            if image_zoom > 1:
                grid = scipy.ndimage.zoom(grid, [image_zoom, image_zoom, 1], order=0)
            if grid.shape[2] == 1:
                grid = grid.repeat(3, 2) # grayscale => RGB
            return grid
        # Generate video.
        import moviepy.editor
        video_clip = moviepy.editor.VideoClip(make_frame, duration=duration_sec)
        video_clip.write_videofile(mp4_file, fps=mp4_fps, codec=mp4_codec, bitrate=mp4_bitrate)
        # import scipy
        # coarse
        duration_sec = 60.0
        smoothing_sec = 1.0
        mp4_fps = 20
        num_frames = int(np.rint(duration_sec * mp4_fps))
        random_seed = 500
        random_state = np.random.RandomState(random_seed)
        w = 512
        h = 512
        #src_seeds = [601]
        dst_seeds = [700]
        style_ranges = ([0] * 7 + [range(8,16)]) * len(dst_seeds)
        fmt = dict(func=tflib.convert_images_to_uint8, nchw_to_nhwc=True)
        synthesis_kwargs = dict(output_transform=fmt, truncation_psi=0.7, minibatch_size=8)
        shape = [num_frames] + Gs.input_shape[1:] # [frame, image, channel, component]
        src_latents = random_state.randn(*shape).astype(np.float32)
        src_latents = scipy.ndimage.gaussian_filter(src_latents,
                                                    smoothing_sec * mp4_fps,
        src_latents /= np.sqrt(np.mean(np.square(src_latents)))
        dst_latents = np.stack(np.random.RandomState(seed).randn(Gs.input_shape[1]) for seed in dst_seeds)
        src_dlatents =, None) # [seed, layer, component]
        dst_dlatents =, None) # [seed, layer, component]
        src_images =, randomize_noise=False, **synthesis_kwargs)
        dst_images =, randomize_noise=False, **synthesis_kwargs)
        canvas ='RGB', (w * (len(dst_seeds) + 1), h * 2), 'white')
        for col, dst_image in enumerate(list(dst_images)):
            canvas.paste(PIL.Image.fromarray(dst_image, 'RGB'), ((col + 1) * h, 0))
        def make_frame(t):
            frame_idx = int(np.clip(np.round(t * mp4_fps), 0, num_frames - 1))
            src_image = src_images[frame_idx]
            canvas.paste(PIL.Image.fromarray(src_image, 'RGB'), (0, h))
            for col, dst_image in enumerate(list(dst_images)):
                col_dlatents = np.stack([dst_dlatents[col]])
                col_dlatents[:, style_ranges[col]] = src_dlatents[frame_idx, style_ranges[col]]
                col_images =, randomize_noise=False, **synthesis_kwargs)
                for row, image in enumerate(list(col_images)):
                    canvas.paste(PIL.Image.fromarray(image, 'RGB'), ((col + 1) * h, (row + 1) * w))
            return np.array(canvas)
        # Generate video.
        import moviepy.editor
        mp4_file = 'results/interpolate.mp4'
        mp4_codec = 'libx264'
        mp4_bitrate = '5M'
        video_clip = moviepy.editor.VideoClip(make_frame, duration=duration_sec)
        video_clip.write_videofile(mp4_file, fps=mp4_fps, codec=mp4_codec, bitrate=mp4_bitrate)
        import scipy
        duration_sec = 60.0
        smoothing_sec = 1.0
        mp4_fps = 20
        num_frames = int(np.rint(duration_sec * mp4_fps))
        random_seed = 503
        random_state = np.random.RandomState(random_seed)
        w = 512
        h = 512
        style_ranges = [range(6,16)]
        fmt = dict(func=tflib.convert_images_to_uint8, nchw_to_nhwc=True)
        synthesis_kwargs = dict(output_transform=fmt, truncation_psi=0.7, minibatch_size=8)
        shape = [num_frames] + Gs.input_shape[1:] # [frame, image, channel, component]
        src_latents = random_state.randn(*shape).astype(np.float32)
        src_latents = scipy.ndimage.gaussian_filter(src_latents,
                                                    smoothing_sec * mp4_fps,
        src_latents /= np.sqrt(np.mean(np.square(src_latents)))
        dst_latents = np.stack([random_state.randn(Gs.input_shape[1])])
        src_dlatents =, None) # [seed, layer, component]
        dst_dlatents =, None) # [seed, layer, component]
        def make_frame(t):
            frame_idx = int(np.clip(np.round(t * mp4_fps), 0, num_frames - 1))
            col_dlatents = np.stack([dst_dlatents[0]])
            col_dlatents[:, style_ranges[0]] = src_dlatents[frame_idx, style_ranges[0]]
            col_images =, randomize_noise=False, **synthesis_kwargs)
            return col_images[0]
        # Generate video.
        import moviepy.editor
        mp4_file = 'results/fine_%s.mp4' % (random_seed)
        mp4_codec = 'libx264'
        mp4_bitrate = '5M'
        video_clip = moviepy.editor.VideoClip(make_frame, duration=duration_sec)
        video_clip.write_videofile(mp4_file, fps=mp4_fps, codec=mp4_codec, bitrate=mp4_bitrate)
    if __name__ == "__main__":

    ‘Coarse’ style-transfer/interpolation video

  3. fine_503.mp4: a ‘fine’ style mix­ing video; in this case, the style noise is taken from later on and instead of affect­ing the global ori­en­ta­tion or expres­sion, it affects sub­tler details like the pre­cise shape of hair strands or hair color or mouths.

    ‘Fine’ style-transfer/interpolation video

Cir­cu­lar inter­po­la­tions are another inter­est­ing kind of inter­po­la­tion, writ­ten by snowy halcy, which instead of ran­dom walk­ing around the latent space freely, with large or awk­ward tran­si­tions, instead tries to move around a fixed high­-di­men­sional point doing: “binary search to get the MSE to be roughly the same between frames (slightly brute force, but it looks nicer), and then did that for what is prob­a­bly close to a sphere or cir­cle in the latent space.” A later ver­sion of cir­cu­lar inter­po­la­tion is in snowy hal­cy’s face edi­tor repo, but here is the orig­i­nal ver­sion cleaned up into a stand-alone pro­gram:

import dnnlib.tflib as tflib
import math
import moviepy.editor
from numpy import linalg
import numpy as np
import pickle

def main():
    _G, _D, Gs = pickle.load(open("results/02051-sgan-faces-2gpu/network-snapshot-021980.pkl", "rb"))

    rnd = np.random
    latents_a = rnd.randn(1, Gs.input_shape[1])
    latents_b = rnd.randn(1, Gs.input_shape[1])
    latents_c = rnd.randn(1, Gs.input_shape[1])

    def circ_generator(latents_interpolate):
        radius = 40.0

        latents_axis_x = (latents_a - latents_b).flatten() / linalg.norm(latents_a - latents_b)
        latents_axis_y = (latents_a - latents_c).flatten() / linalg.norm(latents_a - latents_c)

        latents_x = math.sin(math.pi * 2.0 * latents_interpolate) * radius
        latents_y = math.cos(math.pi * 2.0 * latents_interpolate) * radius

        latents = latents_a + latents_x * latents_axis_x + latents_y * latents_axis_y
        return latents

    def mse(x, y):
        return (np.square(x - y)).mean()

    def generate_from_generator_adaptive(gen_func):
        max_step = 1.0
        current_pos = 0.0

        change_min = 10.0
        change_max = 11.0

        fmt = dict(func=tflib.convert_images_to_uint8, nchw_to_nhwc=True)

        current_latent = gen_func(current_pos)
        current_image =, None, truncation_psi=0.7, randomize_noise=False, output_transform=fmt)[0]
        array_list = []

        video_length = 1.0
        while(current_pos < video_length):

            lower = current_pos
            upper = current_pos + max_step
            current_pos = (upper + lower) / 2.0

            current_latent = gen_func(current_pos)
            current_image = images =, None, truncation_psi=0.7, randomize_noise=False, output_transform=fmt)[0]
            current_mse = mse(array_list[-1], current_image)

            while current_mse < change_min or current_mse > change_max:
                if current_mse < change_min:
                    lower = current_pos
                    current_pos = (upper + lower) / 2.0

                if current_mse > change_max:
                    upper = current_pos
                    current_pos = (upper + lower) / 2.0

                current_latent = gen_func(current_pos)
                current_image = images =, None, truncation_psi=0.7, randomize_noise=False, output_transform=fmt)[0]
                current_mse = mse(array_list[-1], current_image)
            print(current_pos, current_mse)
        return array_list

    frames = generate_from_generator_adaptive(circ_generator)
    frames = moviepy.editor.ImageSequenceClip(frames, fps=30)

    # Generate video.
    mp4_file = 'results/circular.mp4'
    mp4_codec = 'libx264'
    mp4_bitrate = '3M'
    mp4_fps = 20

    frames.write_videofile(mp4_file, fps=mp4_fps, codec=mp4_codec, bitrate=mp4_bitrate)

if __name__ == "__main__":
‘Cir­cu­lar’ inter­po­la­tion video

An inter­est­ing use of inter­po­la­tions is Kyle McLean’s “Waifu Syn­the­sis” video: a singing anime video mash­ing up StyleGAN anime faces + lyrics + Project Magenta music.


Anime Faces

The pri­mary model I’ve trained, the anime face model is described in the data pro­cess­ing & train­ing sec­tion. It is a 512px StyleGAN model trained on n = 218,794 faces cropped from all of Dan­booru2017, cleaned, & upscaled, and trained for 21,980 iter­a­tions or ~21m images or ~38 GPU-days.

Down­loads (I rec­om­mend using the more-re­cent unless cropped faces are specifi­cally desired):


To show off the anime faces, and as a joke, on 2019-02-14, I set up “This Waifu Does Not Exist”, a stand­alone sta­tic web­site which dis­plays a ran­dom anime face (out of 100,000), gen­er­ated with var­i­ous 𝜓, and paired with GPT-2-117M text snip­pets prompted on anime plot sum­maries. are too length to go into here

But the site was amus­ing & an enor­mous suc­cess. It went viral overnight and by the end of March 2019, ~1 mil­lion unique vis­i­tors (most from Chi­na) had vis­ited TWDNE, spend­ing over 2 min­utes each look­ing at the NN-gen­er­ated faces & text; peo­ple began hunt­ing for hilar­i­ous­ly-de­formed faces, using TWDNE as a screen­saver, pick­ing out faces as avatars, cre­at­ing packs of faces for video games, paint­ing their own col­lages of faces, using it as a char­ac­ter designer for inspi­ra­tion, etc.

Anime Bodies

Aaron Gokaslan exper­i­mented with a cus­tom 256px anime game image dataset which has indi­vid­ual char­ac­ters posed in whole-per­son images to see how StyleGAN coped with more com­plex geome­tries. Progress required addi­tional data clean­ing and low­er­ing the learn­ing rate but, trained on a 4-GPU sys­tem for week or two, the results are promis­ing (even down to repro­duc­ing the copy­right state­ments in the images), pro­vid­ing pre­lim­i­nary evi­dence that StyleGAN can scale:

Whole-body anime images, ran­dom sam­ples, Aaron Gokaslan
Whole-body anime images, style trans­fer among sam­ples, Aaron Gokaslan

Transfer Learning

"In the days when was a novice, once came to him as he sat hack­ing at the .

“What are you doing?”, asked Min­sky. “I am train­ing a ran­domly wired neural net to play Tic-Tac-Toe” Suss­man replied. “Why is the net wired ran­dom­ly?”, asked Min­sky. “I do not want it to have any pre­con­cep­tions of how to play”, Suss­man said.

Min­sky then shut his eyes. “Why do you close your eyes?”, Suss­man asked his teacher. “So that the room will be emp­ty.”

At that moment, Suss­man was enlight­ened."

“Suss­man attains enlight­en­ment”, “AI Koans”,

One of the most use­ful things to do with a trained model on a broad data cor­pus is to use it as a launch­ing pad to train a bet­ter model quicker on lesser data, called “trans­fer learn­ing”. For exam­ple, one might trans­fer learn from Nvidi­a’s FFHQ face StyleGAN model to a differ­ent celebrity dataset, or from bed­room­s→k­itchens. Or with the anime face mod­el, one might retrain it on a sub­set of faces—all char­ac­ters with red hair, or all male char­ac­ters, or just a sin­gle spe­cific char­ac­ter. Even if a dataset seems differ­ent, start­ing from a pre­trained model can save time; after all, while male and female faces may look differ­ent and it may seem like a mis­take to start from a most­ly-fe­male anime face mod­el, the alter­na­tive of start­ing from scratch means start­ing with a model gen­er­at­ing ran­dom rain­bow-col­ored sta­t­ic, and surely male faces look far more like female faces than they do ran­dom sta­t­ic?31 Indeed, you can quickly train a pho­to­graphic face model start­ing from the anime face mod­el.

This extends the reach of good StyleGAN mod­els from those blessed with both big data & big com­pute to those with lit­tle of either. Trans­fer learn­ing works par­tic­u­larly well for spe­cial­iz­ing the anime face model to a spe­cific char­ac­ter: the images of that char­ac­ter would be too lit­tle to train a good StyleGAN on, too data-im­pov­er­ished for the sam­ple-in­effi­cient StyleGAN1–232, but hav­ing been trained on all anime faces, the StyleGAN has learned well the full space of anime faces and can eas­ily spe­cial­ize down with­out over­fit­ting. Try­ing to do, say, faces ↔︎ land­scapes is prob­a­bly a bridge too far.

Data-wise, for doing face spe­cial­iza­tion, the more the bet­ter but n = 500–5000 is an ade­quate range, but even as low as n = 50 works sur­pris­ingly well. I don’t know to what extent data aug­men­ta­tion can sub­sti­tute for orig­i­nal dat­a­points but it’s prob­a­bly worth a try espe­cially if you have n < 5000.

Com­pute-wise, spe­cial­iza­tion is rapid. Adap­ta­tion can hap­pen within a few ticks, pos­si­bly even 1. This is sur­pris­ingly fast given that StyleGAN is not designed for few-shot/transfer learn­ing. I spec­u­late that this may be because the StyleGAN latent space is expres­sive enough that even new faces (such as new human faces for a FFHQ mod­el, or a new anime char­ac­ter for an ani­me-face mod­el) are still already present in the latent space. Exam­ples of the expres­siv­ity are pro­vided by , who find that “although the StyleGAN gen­er­a­tor is trained on a human face dataset [FFHQ], the embed­ding algo­rithm is capa­ble of going far beyond human faces. As Fig­ure 1 shows, although slightly worse than those of human faces, we can obtain rea­son­able and rel­a­tively high­-qual­ity embed­dings of cats, dogs and even paint­ings and cars.” If even images as differ­ent as cars can be encoded suc­cess­fully into a face StyleGAN, then clearly the latent space can eas­ily model new faces and so any new face train­ing data is in some sense already learned; so the train­ing process is per­haps not so much about learn­ing ‘new’ faces as about mak­ing the new faces more ‘impor­tant’ by expand­ing the latent space around them & con­tract­ing it around every­thing else, which seems like a far eas­ier task.

How does one actu­ally do trans­fer learn­ing? Since StyleGAN is (cur­rent­ly) uncon­di­tional with no dataset-spe­cific cat­e­gor­i­cal or text or meta­data encod­ing, just a flat set of images, all that has to be done is to encode the new dataset and sim­ply start train­ing with an exist­ing mod­el. One cre­ates the new dataset as usu­al, and then edits with a new -desc line for the new dataset, and if resume_kimg is set cor­rectly (see next para­graph) and resume_run_id = "latest" enabled as advised, you can then run python and presto, trans­fer learn­ing.

The main prob­lem seems to be that train­ing can­not be done from scratch/0 iter­a­tions, as one might naively assume—when I tried this, it did not work well and StyleGAN appeared to be ignor­ing the pre­trained mod­el. My hypoth­e­sis is that as part of the pro­gres­sive growing/fading in of addi­tional resolution/layers, StyleGAN sim­ply ran­dom­izes or wipes out each new layer and over­writes them—­mak­ing it point­less. This is easy to avoid: sim­ply jump the train­ing sched­ule all the way to the desired res­o­lu­tion. For exam­ple, to start at one’s max­i­mum size (here 512px) one might set resume_kimg=7000 in This forces StyleGAN to skip all the pro­gres­sive grow­ing and load the full model as-is. To make sure you did it right, check the first sam­ple (fakes07000.png or what­ev­er), from before any trans­fer learn­ing train­ing has been done, and it should look like the orig­i­nal model did at the end of its train­ing. Then sub­se­quent train­ing sam­ples should show the orig­i­nal quickly mor­ph­ing to the new dataset. (Any­thing like fakes00000.png should not show up because that indi­cates begin­ning from scratch.)

Anime Faces → Character Faces


The first trans­fer learn­ing was done with Holo of . It used a 512px Holo face dataset cre­ated with Nagadomi’s crop­per from all of Dan­booru2017, upscaled with waifu2x, cleaned by hand, and then data-aug­mented from n = 3900 to n = 12600; mir­ror­ing was enabled since Holo is sym­met­ri­cal. I then used the anime face model as of 2019-02-09—it was not fully con­verged, indeed, would­n’t con­verge with weeks more train­ing, but the qual­ity was so good I was too curi­ous as to how well retrain­ing would work so I switched gears.

It’s worth men­tion­ing that this dataset was used pre­vi­ously with ProGAN, where after weeks of train­ing, ProGAN over­fit badly as demon­strated by the sam­ples & inter­po­la­tion videos.

Train­ing hap­pened remark­ably quick­ly, with all the faces con­verted to rec­og­niz­ably Holo faces within a few hun­dred iter­a­tions:

Train­ing mon­tage of a Holo face model ini­tial­ized from the anime face StyleGAN (blink & you’ll miss it)
Inter­po­la­tion video of the Holo face model ini­tial­ized from the anime face StyleGAN

The best sam­ples were con­vinc­ing with­out exhibit­ing the fail­ures of the ProGAN:

64 hand-s­e­lected Holo face sam­ples

The StyleGAN was much more suc­cess­ful, despite a few fail­ure latent points car­ried over from the anime faces. Indeed, after a few hun­dred iter­a­tions, it was start­ing to over­fit with the ‘crack’ arti­facts & smear­ing in the inter­po­la­tions. The lat­est I was will­ing to use was iter­a­tion #11370, and I think it is still some­what over­fit any­way. I thought that with its total n (after data aug­men­ta­tion), Holo would be able to train longer (be­ing 1⁄7th the size of FFHQ), but appar­ently not. Per­haps the data aug­men­ta­tion is con­sid­er­ably less valu­able than 1-for-1, either because the invari­ants encoded in aren’t that use­ful (sug­gest­ing that Geirhos et al 2018-like style trans­fer data aug­men­ta­tion is what’s nec­es­sary) or that they would be but the anime face StyleGAN has already learned them all as part of the pre­vi­ous train­ing & needs more real data to bet­ter under­stand Holo-like faces. It’s also pos­si­ble that the results could be improved by using one of the later anime face StyleGANs since they did improve when I trained them fur­ther after my 2 Holo/Asuka trans­fer exper­i­ments.

Nev­er­the­less, impressed, I could­n’t help but won­der if they had reached human-levels of verisimil­i­tude: would an unwary viewer assume they were hand­made?

So I selected ~100 of the best sam­ples (24MB; Imgur mir­ror) from a dump of 2000, cropped about 5% from the left/right sides to hide the back­ground arti­facts a lit­tle bit, and sub­mit­ted them on 2019-02-11 to /r/SpiceandWolf under an alt account. I made the mis­take of sort­ing by file­size & thus lead­ing with a face that was par­tic­u­larly sus­pi­cious (streaky hair) so one Red­di­tor voiced the sus­pi­cion they were from MGM (ab­surd yet not entirely wrong) but all the other com­menters took the faces in stride or prais­ing them, and the sub­mis­sion received +248 votes (99% pos­i­tive) by March. A Red­di­tor then turned them all into a GIF video which earned +192 (100%) and many pos­i­tive com­ments with no fur­ther sus­pi­cions until I explained. Not bad indeed.

The #11370 Holo StyleGAN model is avail­able for down­load.


After the Holo train­ing & link sub­mis­sion went so well, I knew I had to try my other char­ac­ter dataset, Asuka, using n = 5300 data-aug­mented to n = 58,000.33 Keep­ing in mind how data seemed to limit the Holo qual­i­ty, I left mir­ror­ing enabled for Asuka, even though she is not sym­met­ri­cal due to her eye­patch over her left eye (as purists will no doubt note).

Train­ing mon­tage of an Asuka face model ini­tial­ized from the anime face StyleGAN
Inter­po­la­tion video of the Asuka face model ini­tial­ized from the anime face StyleGAN

Inter­est­ing­ly, while Holo trained within 4 GPU-hours, Asuka proved much more diffi­cult and did not seem to be fin­ished train­ing or show­ing the cracks despite train­ing twice as long. Is this due to hav­ing ~35% more real data, hav­ing 10× rather than 3× data aug­men­ta­tion, or some inher­ent differ­ence like Asuka being more com­plex (eg because of more vari­a­tions in her appear­ance like the eye­patches or plug­suit­s)?

I gen­er­ated 1000 ran­dom sam­ples with 𝜓=1.2 because they were par­tic­u­larly inter­est­ing to look at. As with Holo, I picked out the best 100 (13MB; Imgur mir­ror) from ~2000:

64 hand-s­e­lected Asuka face sam­ples

And I sub­mit­ted to the /r/Evangelion sub­red­dit, where it also did well (+109, 98%); there were no spec­u­la­tions about the faces being NN-gen­er­ated before I revealed it, merely requests for more. Between the two, it appears that with ade­quate data (n > 3000) and mod­er­ate cura­tion, a sim­ple kind of art Tur­ing test can be passed.

The #7903 Asuka StyleGAN model is avail­able for down­load.


In early Feb­ru­ary 2019, using the then-re­leased mod­el, Red­di­tor End­ing_­Cred­its tried trans­fer learn­ing to n = 500 faces of the Zui­hou for ~1 tick (~60k iter­a­tions).

The sam­ples & inter­po­la­tions have many arti­facts, but the sam­ple size is tiny and I’d con­sider this good fine­tun­ing from a model never intended for few-shot learn­ing:

StyleGAN trans­fer learn­ing from anime face StyleGAN to Kan­Colle Zui­hou by End­ing_­Cred­its, 8×15 ran­dom sam­ple grid
Inter­po­la­tion video (4×4) of the Zui­hou face model ini­tial­ized from the anime face StyleGAN, trained by End­ing_­Cred­its
Inter­po­la­tion video (1×1) of the Zui­hou face model ini­tial­ized from the anime face StyleGAN, trained by End­ing_­Cred­its

Prob­a­bly it could be made bet­ter by start­ing from the lat­est anime face StyleGAN mod­el, and using aggres­sive data aug­men­ta­tion. Another option would be to try to find as many char­ac­ters which look sim­i­lar to Zui­hou (match­ing on hair color might work) and train on a joint dataset—un­con­di­tional sam­ples would then need to be fil­tered for just Zui­hou faces, but per­haps that draw­back could be avoided by a third stage of Zui­hou-only train­ing?



Another Kan­colle char­ac­ter, Akizuki, was trained in April 2019 by Gan­so.


In Jan­u­ary 2020, Ganso trained a StyleGAN 2 model from the S2 por­trait model on a tiny cor­pus of Ptilop­sis images, a char­ac­ter from Arknights, a 2017 Chi­nese RPG mobile game.

Train­ing sam­ples of Ptilop­sis, Arknights (StyleGAN 2 por­traits trans­fer, by Gan­so)

are owls, and her char­ac­ter design shows promi­nent ears; despite the few images to work with (just 21 on Dan­booru as of 2020-01-19), the inter­po­la­tion shows smooth adjust­ments of the ears in all posi­tions & align­ments, demon­strat­ing the power of trans­fer learn­ing:

Inter­po­la­tion video (4×4) of the Ptilop­sis face model ini­tial­ized from the anime face StyleGAN 2, trained by Ganso



End­ing_­Cred­its like­wise did trans­fer to (), n = 4000. The results look about as expected given the sam­ple sizes and pre­vi­ous trans­fer results:

Inter­po­la­tion video (4×4) of the Saber face model ini­tial­ized from the anime face StyleGAN, trained by End­ing_­Cred­its

Fate/Grand Order

Michael Sug­imura in May 2019 exper­i­mented with trans­fer learn­ing from the 512px anime por­trait GAN to faces cropped from ~6k wall­pa­pers he down­loaded via Google search queries. His results for Saber & related char­ac­ters look rea­son­able but more broad­ly, some­what low-qual­i­ty, which Sug­imura sus­pects is due to inad­e­quate data clean­ing (“there are a num­ber of lower qual­ity images and also images of back­grounds, armor, non-char­ac­ter images left in the dataset which causes weird arti­facts in gen­er­ated images or just lower qual­ity gen­er­ated images.”).


Final­ly, End­ing_­Cred­its did trans­fer to (), n = 350:

Inter­po­la­tion video (4×4) of the Louise face model ini­tial­ized from the anime face StyleGAN, trained by End­ing_­Cred­its

Not as good as Saber due to the much smaller sam­ple size.


road­run­ner01 exper­i­mented with a num­ber of trans­fers, includ­ing a trans­fer of the male char­ac­ter () with n = 50 (!), which is not nearly as garbage as it should be.


Flatis­Dogchi exper­i­mented with trans­fer to n = 988 (aug­mented to n = 18772) Asashio (Kan­Colle) faces, cre­at­ing “This Asashio Does Not Exist”.

Marisa Kirisame & the Komeijis

A Japan­ese user mei_miya posted an inter­po­la­tion video of the Touhou char­ac­ter Marisa Kirisame by trans­fer learn­ing on 5000 faces. They also did the Touhou char­ac­ters Satori/Koishi Komeiji with n = 6000.

The Red­dit user Jepa­cor also has done Marisa, using Dan­booru sam­ples.


A Chi­nese user 3D_DLW (S2 writeup/tutorial: 1/2) in Feb­ru­ary 2020 did trans­fer­-learn­ing from the S2 por­trait model to Pixiv art­work of the char­ac­ter Lex­ing­ton from War­ship Girls. He used a sim­i­lar work­flow: crop­ping faces with lbpcascade_animeface, upscal­ing with wai­fu2x, and clean­ing with (us­ing the orig­i­nal S2 mod­el’s Dis­crim­i­na­tor & pro­duc­ing datasets of vary­ing clean­li­ness at n = 302–1659). Sam­ples:

Ran­dom sam­ples for anime por­trait S2 → War­ship Girls char­ac­ter Lex­ing­ton.

Hayasaka Ai

Tazik Shah­ja­han fine­tuned S2 on ’s Hayasaka Ai, pro­vid­ing a Colab note­book demon­strat­ing how he scraped Pixiv and fil­tered out invalid images to cre­ate the train­ing cor­pus


CaJI9I cre­ated an “” StyleGAN; unspec­i­fied cor­pus or method:

6×6 sam­ple of ahe­gao StyleGAN faces

Anime Faces → Anime Headshots

Twit­ter user Sunk did trans­fer learn­ing to an image cor­pus of a spe­cific artist, Kure­hito Mis­aki (深崎暮人), n≅1000. His images work well and the inter­po­la­tion looks nice:

Inter­po­la­tion video (4×4) of the Louise face model ini­tial­ized from the Kure­hito Mis­aki StyleGAN, trained by sunk

Anime Faces → Portrait

TWDNE was a huge suc­cess and pop­u­lar­ized the anime face StyleGAN. It was not per­fect, though, and flaws were not­ed.

Portrait Improvements

The por­traits could be improved by more care­fully select­ing SFW images to avoid over­ly-sug­ges­tive faces, expand­ing the crops to avoid cut­ting off edges of heads like hair­styles,

***­For details and

, please see .***

Portrait Results

After retrain­ing the final face StyleGAN 2019-03-08–2019-04-30 on the new improved por­traits dataset, the results improved:

Train­ing sam­ple for Por­trait StyleGAN: 2019-04-30/iteration #66,083
Inter­po­la­tion video (4×4) of the Dan­booru2018 por­trait model ini­tial­ized from the Dan­booru2017 face StyleGAN
This S1 anime por­trait model is obso­leted by the StyleGAN 2 por­trait model.

The final model from 2019-04-30 is avail­able for down­load.

I used this model at 𝛙=0.5 to gen­er­ate 100,000 new por­traits for TWDNE (#100,000–199,999), bal­anc­ing the pre­vi­ous faces.

I was sur­prised how diffi­cult upgrad­ing to por­traits seemed to be; I spent almost two months train­ing it before giv­ing up on fur­ther improve­ments, while I had been expect­ing more like a week or two. The por­trait results are indeed bet­ter than the faces (I was right that not crop­ping off the top of the head adds verisimil­i­tude), but the upgrade did­n’t impress me as much as the orig­i­nal faces did com­pared to ear­lier GANs. And our other exper­i­men­tal runs on whole-Dan­booru2018 images never pro­gressed beyond sug­ges­tive blobs dur­ing this peri­od.

I sus­pect that StyleGAN—at least, on its default archi­tec­ture & hyper­pa­ra­me­ters, with­out a great deal more com­pute—is reach­ing its lim­its here, and that changes may be nec­es­sary to scale to richer images. (Self-at­ten­tion is prob­a­bly the eas­i­est to add since it should be easy to plug in addi­tional lay­ers to the con­vo­lu­tion code.)

Anime Faces → Male Faces

A few peo­ple have observed that it would be nice to have an anime face GAN for male char­ac­ters instead of always gen­er­at­ing female ones. The anime face StyleGAN does in fact have male faces in its dataset as I did no fil­ter­ing—it’s merely that female faces are over­whelm­ingly fre­quent (and it may also be that male anime faces are rel­a­tively androgynous/feminized any­way so it’s hard to tell any differ­ence between a female with short hair & a guy34).

Train­ing a male-only anime face StyleGAN would be another good appli­ca­tion of trans­fer learn­ing.

The faces can be eas­ily extracted out of Dan­booru2018 by query­ing for "male_focus", which will pick up ~150k images. More nar­row­ly, one could search "1boy" & "solo", to ensure that the only face in the image is a male face (as opposed to, say, 1boy 1girl, where a female face might be cropped out as well). This pro­vides n = 99k raw hits. It would be good to also fil­ter out ‘trap’ or over­ly-fe­male-look­ing faces (else what’s the point?), by fil­ter­ing on tags like cat ears or par­tic­u­larly pop­u­lar ‘trap’ char­ac­ters like Fate/Grand Order’s Astol­fo. A more com­pli­cated query to pick up scenes with mul­ti­ple males could be to search for both "1boy" & "multiple_boys" and then fil­ter out "1girl" & "multiple_girls", in order to select all images with 1 or more males and then remove all images with 1 or more females; this dou­bles the raw hits to n = 198k. (A down­side is that the face-crop­ping will often unavoid­ably yield crops with two faces, a pri­mary face and an over­lap­ping face, which is bad and intro­duces arti­fact­ing when I tried this with all faces.)

Com­bined with trans­fer learn­ing from the gen­eral anime face StyleGAN, the results should be as good as the gen­eral (fe­male) faces.

I set­tled for "1boy" & "solo", and did con­sid­er­able clean­ing by hand. The raw count of images turned out to be highly mis­lead­ing, and many faces are unus­able for a male anime face StyleGAN: many are so highly styl­ized (such as action sce­nes) as to be dam­ag­ing to a GAN, or they are almost indis­tin­guish­able from female faces (be­cause they are bis­honen or trap or just androg­y­nous), which would be point­less to include (the reg­u­lar por­trait StyleGAN cov­ers those already). After hand clean­ing & use of, I was left with n~3k, so I used heavy data aug­men­ta­tion to bring it up to n~57k, and I ini­tial­ized from the final por­trait StyleGAN for the high­est qual­i­ty.

It did not over­fit after ~4 days of train­ing, but the results were not notice­ably improv­ing, so I stopped (in order to start train­ing the GPT-2-345M, which Ope­nAI had just released, ). There are hints in the inter­po­la­tion videos, I think, that it is indeed slightly over­fit­ting, in the form of ‘glitches’ where the image abruptly jumps slight­ly, pre­sum­ably to another mode/face/character of the orig­i­nal data; nev­er­the­less, the male face StyleGAN mostly works.

Train­ing sam­ples for the male por­trait StyleGAN (2019-05-03); com­pare with the same laten­t-space points in the orig­i­nal por­trait StyleGAN.
Inter­po­la­tion video (4×4) of the Dan­booru2018 male faces model ini­tial­ized from the Dan­booru2018 por­trait StyleGAN

The male face StyleGAN model is avail­able for down­load, as is 1000 ran­dom faces with 𝛙=0.7 (mir­ror; par­tial Imgur album).

Anime Faces → Ukiyo-e Faces

In Jan­u­ary 2020, Justin (@Bunt­wor­thy) used 5000 faces cropped with from to do trans­fer learn­ing. After ~24h train­ing:

Justin’s ukiy­o-e StyleGAN sam­ples, 2020-01-04.

Anime Faces → Western Portrait Faces

In 2019, aydao exper­i­mented with trans­fer learn­ing to Euro­pean por­trait faces drawn from WikiArt; the trans­fer learn­ing was done via Nathan Ship­ley’s abuse of where two mod­els are sim­ply aver­aged togeth­er, para­me­ter by para­me­ter and layer by lay­er, to yield a new mod­el. (Sur­pris­ing­ly, this work­s—as long as the mod­els aren’t too differ­ent; if they are, the aver­aged model will gen­er­ate only col­or­ful blob­s.) The results were amus­ing. From early in train­ing:

aydao 2019, anime faces → west­ern por­trait train­ing sam­ples (ear­ly)


aydao 2019, anime faces → west­ern por­trait train­ing sam­ples (later)

Anime Faces → Danbooru2018

nshep­perd began a train­ing run using an early anime face StyleGAN model on the 512px SFW Dan­booru2018 sub­set; after ~3–5 weeks (with many inter­rup­tions) on 1 GPU, as of 2019-03-22, the train­ing sam­ples look like this:

StyleGAN train­ing sam­ples on Dan­booru2018 SFW 512px; iter­a­tion #14204 (nshep­perd)
Real 512px SFW Dan­booru2018 train­ing dat­a­points, for com­par­i­son
Train­ing mon­tage video of the Dan­booru2018 model (up to #14204, 2019-03-22), trained by nshep­perd

The StyleGAN is able to pick up global struc­ture and there are rec­og­niz­ably anime fig­ures, despite the sheer diver­sity of images, which is promis­ing. The fine details are seri­ously lack­ing, and train­ing, to my eye, is wan­der­ing around with­out any steady improve­ment or sharp details (ex­cept per­haps the faces which are inher­ited from the pre­vi­ous mod­el). I sus­pect that the learn­ing rate is still too high and, espe­cially with only 1 GPU/n = 4, such small mini­batches don’t cover enough modes to enable steady improve­ment. If so, the LR will need to be set much lower (or gra­di­ent accu­mu­la­tion used in order to fake hav­ing large mini­batches where large LRs are sta­ble) & train­ing time extended to mul­ti­ple months. Another pos­si­bil­ity would be to restart with added self­-at­ten­tion lay­ers, which I have noticed seem to par­tic­u­larly help with com­pli­cated details & sharp­ness; the style noise approach may be ade­quate for the job but just a few vanilla con­vo­lu­tion lay­ers may be too few (pace the BigGAN results on the ben­e­fits of increas­ing depth while decreas­ing para­me­ter coun­t).

FFHQ Variations

Anime Faces → FFHQ Faces

If StyleGAN can smoothly warp anime faces among each other and express global trans­forms like hair length­+­color with 𝜓, could 𝜓 be a quick way to gain con­trol over a sin­gle large-s­cale vari­able? For exam­ple, male vs female faces, or… anime ↔︎ real faces? (Given a par­tic­u­lar image/latent vec­tor, one would sim­ply flip the sign to con­vert it to the oppo­site; this would give the oppo­site ver­sion of each ran­dom face, and if one had an encoder, one could do auto­mat­i­cally ani­me-fy or real-fy an arbi­trary face by encod­ing it into the latent vec­tor which cre­ates it, and then flip­ping.35)

Since Kar­ras et al 2801 pro­vide a nice FFHQ down­load script (al­beit slower than I’d like once Google Drive rate-lim­its you a wall­clock hour into the full down­load) for the ful­l-res­o­lu­tion PNGs, it would be easy to down­scale to 512px and cre­ate a 512px FFHQ dataset to train on, or even cre­ate a com­bined anime+FFHQ dataset.

The first and fastest thing was to do trans­fer learn­ing from the anime faces to FFHQ real faces. It was unlikely that the model would retain much anime knowl­edge & be able to do mor­ph­ing, but it was worth a try.

The ini­tial results early in train­ing are hilar­i­ous and look like zom­bies:

Ran­dom train­ing sam­ples of anime face→FFHQ-only StyleGAN trans­fer learn­ing, show­ing bizarrely-arte­fac­tual inter­me­di­ate faces
Inter­po­la­tion video (4×4) of the FFHQ face model ini­tial­ized from the anime face StyleGAN, a few ticks into train­ing, show­ing bizarre arti­facts

After 97 ticks, the model has con­verged to a bor­ingly nor­mal appear­ance, with the only hint of its ori­gins being per­haps some exces­sive­ly-fab­u­lous hair in the train­ing sam­ples:

Anime faces→FFHQ-only StyleGAN train­ing sam­ples after much con­ver­gence, show­ing ani­me-ness largely washed out

Anime Faces → Anime Faces + FFHQ Faces

So, that was a bust. The next step is to try train­ing on anime & FFHQ faces simul­ta­ne­ous­ly; given the stark differ­ence between the datasets, would pos­i­tive vs neg­a­tive 𝜓 wind up split­ting into real vs anime and pro­vide a cheap & easy way of con­vert­ing arbi­trary faces?

This sim­ply merged the 512px FFHQ faces with the 512px anime faces and resumed train­ing from the pre­vi­ous FFHQ model (I rea­soned that some of the ani­me-ness should still be in the mod­el, so it would be slightly faster than restart­ing from the orig­i­nal anime face mod­el). I trained it for 812 iter­a­tions, #11,359–12,171 (some­what over 2 GPU-days), at which point it was mostly done.

It did man­age to learn both kinds of faces quite well, sep­a­rat­ing them clearly in ran­dom sam­ples:

Ran­dom train­ing sam­ples, anime+FFHQ StyleGAN

How­ev­er, the style trans­fer & 𝜓 sam­ples were dis­ap­point­ments. The style mix­ing shows lim­ited abil­ity to mod­ify faces cross-do­main or con­vert them, and the trun­ca­tion trick chart shows no clear dis­en­tan­gle­ment of the desired fac­tor (in­deed, the var­i­ous halves of 𝜓 cor­re­spond to noth­ing clear):

Style mix­ing results for the anime+FFHQ StyleGAN
Trun­ca­tion trick results for the anime+FFHQ StyleGAN

The inter­po­la­tion video does show that it learned to inter­po­late slightly between real & anime faces, giv­ing half-anime/half-real faces, but it looks like it only hap­pens some­times—­mostly with young female faces36:

Inter­po��la­tion video (4×4) of the FFHQ+anime face mod­el, after con­ver­gence.

They’re hard to spot in the inter­po­la­tion video because the tran­si­tion hap­pens abrupt­ly, so I gen­er­ated sam­ples & selected some of the more inter­est­ing ani­me-ish faces:

Selected sam­ples from the anime+FFHQ StyleGAN, show­ing curi­ous ‘inter­me­di­ate’ faces (4×4 grid)

Sim­i­lar­ly, Alexan­der Reben trained a StyleGAN on FFHQ+Western por­trait illus­tra­tions, and the inter­po­la­tion video is much smoother & more mixed, sug­gest­ing that more real­is­tic & more var­ied illus­tra­tions are eas­ier for StyleGAN to inter­po­late between.

Anime Faces + FFHQ → Danbooru2018

While I did­n’t have the com­pute to prop­erly train a Dan­booru2018 StyleGAN, after nshep­perd’s results, I was curi­ous and spent some time (817 iter­a­tions, so ~2 GPU-days?) retrain­ing the anime face+FFHQ model on Dan­booru2018 SFW 512px images.

The train­ing mon­tage is inter­est­ing for show­ing how faces get repur­posed into fig­ures:

Train­ing mon­tage video of a Dan­booru2018 StyleGAN ini­tial­ized on an anime faces+FFHQ StyleGAN.

One might think that it is a bridge too far for trans­fer learn­ing, but it seems not.

Reversing StyleGAN To Control & Modify Images

Mod­i­fy­ing images is harder than gen­er­at­ing them. An uncon­di­tional GAN archi­tec­ture is, by default, ‘one-way’: the latent vec­tor z gets gen­er­ated from a bunch of vari­ables, fed through the GAN, and out pops an image. There is no way to run the uncon­di­tional GAN ‘back­wards’ to feed in an image and pop out the z instead.37

If one could, one could take an arbi­trary image and encode it into the z and by jit­ter­ing z, gen­er­ate many new ver­sions of it; or one could feed it back into StyleGAN and play with the style noises at var­i­ous lev­els in order to trans­form the image; or do things like ‘aver­age’ two images or cre­ate inter­po­la­tions between two arbi­trary faces’; or one could (as­sum­ing one knew what each vari­able in z ‘means’) edit the image to changes things like which direc­tion their head tilts or whether they are smil­ing.

There are some attempts at learn­ing con­trol in an unsu­per­vised fash­ion (eg , GANSpace) but while excel­lent start­ing points, they have lim­its and may not find a spe­cific con­trol that one wants.

The most straight­for­ward way would be to switch to a con­di­tional GAN archi­tec­ture based on a text or tag embed­ding. Then to gen­er­ate a spe­cific char­ac­ter wear­ing glass­es, one sim­ply says as much as the con­di­tional input: "character glasses". Or if they should be smil­ing, add "smile". And so on. This would cre­ate images of said char­ac­ter with the desired mod­i­fi­ca­tions. This option is not avail­able at the moment as cre­at­ing a tag embed­ding & train­ing StyleGAN requires quite a bit of mod­i­fi­ca­tion. It also is not a com­plete solu­tion as it would­n’t work for the cases of edit­ing an exist­ing image.

For an uncon­di­tional GAN, there are two com­ple­men­tary approaches to invert­ing the G:

  1. what one NN can learn to decode, another can learn to encode (eg , ):

    If StyleGAN has learned z→im­age, then train a sec­ond encoder NN on the super­vised learn­ing prob­lem of image→z! The sam­ple size is infi­nite (just keep run­ning G) and the map­ping is fixed (given a fixed G), so it’s ugly but not that hard.

  2. back­prop­a­gate a pixel or fea­ture-level loss to ‘opti­mize’ a latent code (eg ):

    While StyleGAN is not inher­ently reversible, it’s not a black­box as, being a NN trained by , it must admit of gra­di­ents. In train­ing neural net­works, there are 3 com­po­nents: inputs, model para­me­ters, and outputs/losses, and thus there are 3 ways to use back­prop­a­ga­tion, even if we usu­ally only use 1. One can hold the inputs fixed, and vary the model para­me­ters in order to change (usu­ally reduce) the fixed out­puts in order to reduce a loss, which is train­ing a NN; one can hold the inputs fixed and vary the out­puts in order to change (often increase) inter­nal para­me­ters such as lay­ers, which cor­re­sponds to neural net­work visu­al­iza­tions & explo­ration; and final­ly, one can hold the para­me­ters & out­puts fixed, and use the gra­di­ents to iter­a­tively find an set of inputs which cre­ates a spe­cific out­put with a low loss (eg opti­mize a wheel-shape input for rolling-effi­ciency out­put).38

    This can be used to cre­ate images which are ‘opti­mized’ in some sense. For exam­ple, uses acti­va­tion max­i­miza­tion, demon­strat­ing how images of Ima­geNet classes can be pulled out of a stan­dard CNN clas­si­fier by back­prop over the clas­si­fier to max­i­mize a par­tic­u­lar out­put class; more amus­ing­ly, in , the gra­di­ent ascent39 on the indi­vid­ual pix­els of an image is done to minimize/maximize a NSFW clas­si­fier’s pre­dic­tion. This can also be done on a higher level by try­ing to max­i­mize sim­i­lar­ity to a NN embed­ding of an image to make it as ‘sim­i­lar’ as pos­si­ble, as was done orig­i­nally in Gatys et al 2014 for style trans­fer, or for more com­pli­cated kinds of style trans­fer like in “Differ­en­tiable Image Para­me­ter­i­za­tions: A pow­er­ful, under­-ex­plored tool for neural net­work visu­al­iza­tions and art”.

    In this case, given an arbi­trary desired image’s z, one can ini­tial­ize a ran­dom z, run it for­ward through the GAN to get an image, com­pare it at the pixel level with the desired (fixed) image, and the total differ­ence is the ‘loss’; hold­ing the GAN fixed, the back­prop­a­ga­tion goes back through the model and adjusts the inputs (the unfixed z) to make it slightly more like the desired image. Done many times, the final z will now yield some­thing like the desired image, and that can be treated as its true z. Com­par­ing at the pix­el-level can be improved by instead look­ing at the higher lay­ers in a NN trained to do clas­si­fi­ca­tion (often an Ima­geNet VGG), which will focus more on the seman­tic sim­i­lar­ity (more of a “per­cep­tual loss”) rather than mis­lead­ing details of sta­tic & indi­vid­ual pix­els. The latent code can be the orig­i­nal z, or z after it’s passed through the stack of 8 FC lay­ers and has been trans­formed, or it can even be the var­i­ous per-layer style noises inside the CNN part of StyleGAN; the last is what style-image-prior uses & 40 argue that it works bet­ter to tar­get the lay­er-wise encod­ings than the orig­i­nal z.

    This may not work too well as the local optima might be bad or the GAN may have trou­ble gen­er­at­ing pre­cisely the desired image no mat­ter how care­fully it is opti­mized, the pix­el-level loss may not be a good loss to use, and the whole process may be quite slow, espe­cially if one runs it many times with many differ­ent ini­tial ran­dom z to try to avoid bad local opti­ma. But it does mostly work.

  3. Encode+Back­prop­a­gate is a use­ful hybrid strat­e­gy: the encoder makes its best guess at the z, which will usu­ally be close to the true z, and then back­prop­a­ga­tion is done for a few iter­a­tions to fine­tune the z. This can be much faster (one for­ward pass vs many for­ward+back­ward pass­es) and much less prone to get­ting stuck in bad local optima (since it starts at a good ini­tial z thanks to the encoder).

    Com­par­i­son with edit­ing in flow-based mod­els On a tan­gent, editing/reversing is one of the great advan­tages41 of ‘flow’-based NN mod­els such as Glow, which is one of the fam­i­lies of NN mod­els com­pet­i­tive with GANs for high­-qual­ity image gen­er­a­tion (along with autore­gres­sive pixel pre­dic­tion mod­els like PixelRNN, and VAEs). Flow mod­els have the same shape as GANs in push­ing a ran­dom latent vec­tor z through a series of upscal­ing con­vo­lu­tion or other lay­ers to pro­duce final pixel val­ues, but flow mod­els use a care­ful­ly-lim­ited set of prim­i­tives which make the model runnable both for­wards and back­wards exact­ly. This means every set of pix­els cor­re­sponds to a unique z and vice-ver­sa, and so an arbi­trary set of pix­els can put in and the model run back­wards to yield the exact cor­re­spond­ing z. There is no need to fight with the model to cre­ate an encoder to reverse it or use back­prop­a­ga­tion opti­miza­tion to try to find some­thing almost right, as the flow model can already do this. This makes edit­ing easy: plug the image in, get out the exact z with the equiv­a­lent of a sin­gle for­ward pass, fig­ure out which part of z con­trols a desired attribute like ‘glasses’, change that, and run it for­ward. The down­side of flow mod­els, which is why I do not (yet) use them, is that the restric­tion to reversible lay­ers means that they are typ­i­cally much larger and slower to train than a more-or-less per­cep­tu­ally equiv­a­lent GAN mod­el, by eas­ily an order of mag­ni­tude (for Glow). When I tried Glow, I could barely run an inter­est­ing model despite aggres­sive mem­o­ry-sav­ing tech­niques, and I did­n’t get any­where inter­est­ing with the sev­eral GPU-days I spent (which was unsur­pris­ing when I real­ized how many GPU-months OA had spen­t). Since high­-qual­ity pho­to­re­al­is­tic GANs are at the limit of 2019 train­abil­ity for most researchers or hob­by­ists, flow mod­els are clearly out of the ques­tion despite their many prac­ti­cal & the­o­ret­i­cal advan­tages—they’re just too expen­sive! How­ev­er, there is no known rea­son flow mod­els could­n’t be com­pet­i­tive with GANs (they will prob­a­bly always be larg­er, but because they are more cor­rect & do more), and future improve­ments or hard­ware scal­ing may make them more viable, so flow-based mod­els are an approach to keep an eye on.

One of those 3 approaches will encode an image into a latent z. So far so good, that enables things like gen­er­at­ing ran­dom­ly-d­iffer­ent ver­sions of a spe­cific image or inter­po­lat­ing between 2 images, but how does one con­trol the z in a more intel­li­gent fash­ion to make spe­cific edits?

If one knew what each vari­able in the z meant, one could sim­ply slide them in the −1/+1 range, change the z, and gen­er­ate the cor­re­spond­ing edited image. But there are 512 vari­ables in z (for StyleGAN), which is a lot to exam­ine man­u­al­ly, and their mean­ing is opaque as StyleGAN does­n’t nec­es­sar­ily map each vari­able onto a human-rec­og­niz­able fac­tor like ‘smil­ing’. A rec­og­niz­able fac­tor like ‘eye­glasses’ might even be gov­erned by mul­ti­ple vari­ables simul­ta­ne­ously which are non­lin­early inter­act­ing.

As always, the solu­tion to one mod­el’s prob­lems is yet more mod­els; to con­trol the z, like with the encoder, we can sim­ply train yet another model (per­haps just a lin­ear clas­si­fier or ran­dom forests this time) to take the z of many images which are all labeled ‘smil­ing’ or ‘not smil­ing’, and learn what parts of z cause ‘smil­ing’ (eg ). These addi­tional mod­els can then be used to con­trol a z. The nec­es­sary labels (a few hun­dred to a few thou­sand will be ade­quate since the z is only 512 vari­ables) can be obtained by hand or by using a pre-ex­ist­ing clas­si­fi­er.

So, the pieces of the puz­zle & putting it all togeth­er:

The final result is inter­ac­tive edit­ing of anime faces along many differ­ent fac­tors:

snowy halcy (MP4) demon­strat­ing inter­ac­tive edit­ing of StyleGAN anime faces using anime-face-StyleGAN+DeepDanbooru+StyleGAN-encoder+TL-GAN

Editing Rare Attributes

A strat­egy of hand-edit­ing or using a tag­ger to clas­sify attrib­utes works for com­mon ones which will be well-rep­re­sented in a sam­ple of a few thou­sand since the clas­si­fier needs a few hun­dred cases to work with, but what about rarer attrib­utes which might appear only on one in a thou­sand ran­dom sam­ples, or attrib­utes too rare in the dataset for StyleGAN to have learned, or attrib­utes which may not be in the dataset at all? Edit­ing “red eyes” should be easy, but what about some­thing like “bunny ears”? It would be amus­ing to be able to edit por­traits to add bunny ears, but there aren’t that many bunny ear sam­ples (although cat ears might be much more com­mon); is one doomed to gen­er­ate & clas­sify hun­dreds of thou­sands of sam­ples to enable bunny ear edit­ing? That would be infea­si­ble for hand label­ing, and diffi­cult even with a tag­ger.

One sug­ges­tion I have for this use-case would be to briefly train another StyleGAN model on an enriched or boosted dataset, like a dataset of 50:50 bunny ear images & nor­mal images. If one can obtain a few thou­sand bunny ear images, then this is ade­quate for trans­fer learn­ing (com­bined with a few thou­sand ran­dom nor­mal images from the orig­i­nal dataset), and one can retrain the StyleGAN on an equal bal­ance of images. The high pres­ence of bunny ears will ensure that the StyleGAN quickly learns all about those, while the nor­mal images pre­vent it from over­fit­ting or cat­a­strophic for­get­ting of the full range of images.

This new bun­ny-ear StyleGAN will then pro­duce bun­ny-ear sam­ples half the time, cir­cum­vent­ing the rare base rate issue (or fail­ure to learn, or nonex­is­tence in dataset), and enabling effi­cient train­ing of a clas­si­fi­er. And since nor­mal faces were used to pre­serve its gen­eral face knowl­edge despite the trans­fer learn­ing poten­tially degrad­ing it, it will remain able to encode & opti­mize nor­mal faces. (The orig­i­nal clas­si­fiers may even be reusable on this, depend­ing on how extreme the new attribute is, as the latent space z might not be too affected by the new attribute and the var­i­ous other attrib­utes approx­i­mately main­tain the orig­i­nal rela­tion­ship with z as before the retrain­ing.)

StyleGAN 2

(source, video), elim­i­nates blob arti­facts, adds a native encod­ing ‘pro­jec­tion’ fea­ture for edit­ing, sim­pli­fies the run­time by scrap­ping pro­gres­sive grow­ing in favor of -like mul­ti­-s­cale archi­tec­ture, & has higher over­all qual­i­ty—but sim­i­lar total train­ing time/requirements42

I used a 512px anime por­trait S2 model trained by Aaron Gokaslan to cre­ate :

100 ran­dom sam­ple images from the StyleGAN 2 anime por­trait faces in TWDNEv3, arranged in a 10×10 grid.

Train­ing sam­ples:

Iter­a­tion #24,303 of Gokaslan’s train­ing of an anime por­trait StyleGAN 2 model (train­ing sam­ples)

The model was trained to iter­a­tion #24,664 for >2 weeks on 4 Nvidia 2080ti GPUs at 35–70s per 1k images. The Ten­sor­flow S2 model is avail­able for down­load (320M­B).43 (PyTorch & Onnx ver­sions have been made by Anton using a cus­tom repo Note that both my face & por­trait mod­els can be run via the Gen­Force PyTorch repo as well.) This model can be used in Google Colab (demon­stra­tion note­book, although it seems it may pull in an older S2 mod­el) & the model can also be used with the S2 code­base for encod­ing anime faces.

Running S2

Because of the opti­miza­tions, which requires cus­tom local com­pi­la­tion of CUDA code for max­i­mum effi­cien­cy, get­ting S2 run­ning can be more chal­leng­ing than get­ting S1 run­ning.

  • No Ten­sor­Flow 2 com­pat­i­bil­i­ty: the TF ver­sion must be 1.14/1.15. Try­ing to run with TF 2 will give errors like: TypeError: int() argument must be a string, a bytes-like object or a number, not 'Tensor'.

    I ran into cuDNN com­pat­i­bil­ity prob­lems with TF 1.15 (which requires cuDNN >7.6.0, 2019-05-20, for CUDA 10.0), which gave errors like this:

    ...[2020-01-11 23:10:35.234784: E tensorflow/stream_executor/cuda/] Loaded runtime CuDNN library:
       7.4.2 but source was compiled with: 7.6.0.  CuDNN library major and minor version needs to match or have higher
       minor version in case of CuDNN 7.0 or later version. If using a binary install, upgrade your CuDNN library.
       If building from sources, make sure the library loaded at runtime is compatible with the version specified
       during compile configuration...

    But then with 1.14, the tpu-estimator library was not found! (I ulti­mately took the risk of upgrad­ing my instal­la­tion with libcudnn7_7.6.0.64-1+cuda10.0_amd64.deb, and thank­ful­ly, that worked and did not seem to break any­thing else.)

  • Get­ting the entire pipeline to com­pile the cus­tom ops in a Conda envi­ron­ment was annoy­ing so Gokaslan tweaked it to use 1.14 on Lin­ux, used cudatoolkit-dev from Conda Forge, and changed the build script to use gcc-7 (since gcc-8 was unsup­port­ed)

  • one issue with Ten­sor­Flow 1.14 is you need to force allow_growth or it will error out on Nvidia 2080tis

  • con­fig name change: has been renamed (again) to

  • buggy learn­ing rates: S2 (but not S1) acci­den­tally uses the same LR for both G & D; either fix this or keep it in mind when doing LR tun­ing—changes to D_lrate do noth­ing!

  • n = 1 mini­batch prob­lems: S2 is not a large NN so it can be trained on low-end GPUs; how­ev­er, the S2 code make an unnec­es­sary assump­tion that n≥2; to fix this in training/ (fixed in Shawn Presser’s TPU/self-attention ori­ented fork):

    @@ -157,9 +157,8 @@ def G_logistic_ns_pathreg(G, D, opt, training_set, minibatch_size, pl_minibatch_
        with tf.name_scope('PathReg'):
            # Evaluate the regularization term using a smaller minibatch to conserve memory.
            if pl_minibatch_shrink > 1 and minibatch_size > 1:
                assert minibatch_size % pl_minibatch_shrink == 0
                pl_minibatch = minibatch_size // pl_minibatch_shrink
            if pl_minibatch_shrink > 1:
                pl_minibatch = tf.maximum(1, minibatch_size // pl_minibatch_shrink)
                pl_latents = tf.random_normal([pl_minibatch] + G.input_shapes[0][1:])
                pl_labels = training_set.get_random_labels_tf(pl_minibatch)
                fake_images_out, fake_dlatents_out = G.get_output_for(pl_latents, pl_labels, is_training=True, return_dlatents=True)

  • S2 has some sort of mem­ory leak, pos­si­bly related to the FID eval­u­a­tions, requir­ing reg­u­lar restarts, like putting it into a loop

Once S2 was run­ning, Gokaslan trained the S2 por­trait model with gen­er­ally default hyper­pa­ra­me­ters.

Future Work

Some open ques­tions about StyleGAN’s archi­tec­ture & train­ing dynam­ics:

  • is pro­gres­sive grow­ing still nec­es­sary with StyleGAN? (StyleGAN 2 implies that it is not, as it uses a MSG-GAN-like approach)
  • are 8×512 FC lay­ers nec­es­sary? (Pre­lim­i­nary BigGAN work sug­gests that they are not nec­es­sary for BigGAN.)
  • what are the wrinkly-line/cracks *noise arti­fact­s** which appear at the end of train­ing?
  • how does StyleGAN com­pare to BigGAN in final qual­i­ty?

Fur­ther pos­si­ble work:

  • explo­ration of “cur­ricu­lum learn­ing”: can train­ing be sped up by train­ing to con­ver­gence on small n and then peri­od­i­cally expand­ing the dataset?

  • boot­strap­ping image gen­er­a­tion by start­ing with a seed cor­pus, gen­er­at­ing many ran­dom sam­ples, select­ing the best by hand, and retrain­ing; eg expand a cor­pus of a spe­cific char­ac­ter, or explore ‘hybrid’ cor­puses which mix A/B images & one then selects for images which look most A+B-ish

  • improved trans­fer learn­ing scripts to edit trained mod­els so 512px pre­trained mod­els can be pro­moted to work with 1024px images and vice versa

  • bet­ter Dan­booru tag­ger CNN for pro­vid­ing clas­si­fi­ca­tion embed­dings for var­i­ous pur­pos­es, par­tic­u­larly FID loss mon­i­tor­ing, mini­batch discrimination/auxiliary loss, and style trans­fer for cre­at­ing a ‘StyleDan­booru’

    • with a StyleDan­booru, I am curi­ous if that can be used as a par­tic­u­larly Pow­er­ful Form Of Data Aug­men­ta­tion for small n char­ac­ter datasets, and whether it leads to a rever­sal of train­ing dynam­ics with edges com­ing before colors/textures—it’s pos­si­ble that a StyleDan­booru could make many GAN archi­tec­tures, not just StyleGAN, sta­ble to train on anime/illustration datasets
  • bor­row­ing archi­tec­tural enhance­ments from BigGAN: self­-at­ten­tion lay­ers, spec­tral norm reg­u­lar­iza­tion, large-mini­batch train­ing, and a rec­ti­fied Gauss­ian dis­tri­b­u­tion for the latent vec­tor z

  • tex­t→im­age con­di­tional GAN archi­tec­ture (à la StackGAN):

    This would take the text tag descrip­tions of each image com­piled by Dan­booru users and use those as inputs to StyleGAN, which, should it work, would mean you could cre­ate arbi­trary anime images sim­ply by typ­ing in a string like 1_boy samurai facing_viewer red_hair clouds sword armor blood etc.

    This should also, by pro­vid­ing rich seman­tic descrip­tions of each image, make train­ing faster & sta­bler and con­verge to higher final qual­i­ty.

  • meta-learn­ing for few-shot face or char­ac­ter or artist imi­ta­tion (eg Set-CGAN or or per­haps , or —the last of which achieves few-shot learn­ing with sam­ples of n = 25 TWDNE StyleGAN anime faces)

ImageNet StyleGAN

As part of exper­i­ments in scal­ing up StyleGAN 2, using , we ran StyleGAN on large-s­cale datasets includ­ing Dan­booru2019, Ima­geNet, and sub­sets of the . Despite run­ning for mil­lions of images, no S2 run ever achieved remotely the real­ism of S2 on FFHQ or BigGAN on Ima­geNet: while the tex­tures could be sur­pris­ingly good, the seman­tic global struc­ture never came togeth­er, with glar­ing flaws—there would be too many heads, or they would be detached from bod­ies, etc.

Aaron Gokaslan took the time to com­pute the FID on Ima­geNet, esti­mat­ing a ter­ri­ble score of FID ~120. (High­er=­worse; for com­par­ison, BigGAN with can be as good as FID ~7, and reg­u­lar BigGAN typ­i­cally sur­passes FID 120 within a few thou­sand iter­a­tions.) Even exper­i­ments in increas­ing the S2 model size up to ~1GB (by increas­ing the fea­ture map mul­ti­pli­er) improved qual­ity rel­a­tively mod­est­ly, and showed no signs of ever approach­ing BigGAN-level qual­i­ty. We con­cluded that StyleGAN is in fact fun­da­men­tally lim­ited as a GAN, , and switched over to BigGAN work.

For those inter­est­ed, we pro­vide our 512px Ima­geNet S2 (step 1,394,688):

rsync --verbose rsync:// ./
Shawn Presser, S2 Ima­geNet inter­po­la­tion video from part­way through train­ing (~45 hours on a TPUv3-512, 3k images/s)

Danbooru2019+e621 256px BigGAN

As part of test­ing our mod­i­fi­ca­tions to compare_gan, includ­ing sam­pling from mul­ti­ple datasets to increase n and using to sta­bi­lize it and adding an addi­tional (crude, lim­it­ed) kind of self­-su­per­vised loss to the D, we trained sev­eral 256px BigGANs, ini­tially on Dan­booru2019 SFW but then adding in the TWDNE por­traits & e621/e621-portraits part­way through train­ing. This desta­bi­lized the mod­els great­ly, but the flood loss appears to have stopped diver­gence and they grad­u­ally recov­ered. Run #39 did some­what bet­ter than run #40; the self­-su­per­vised vari­ants never recov­ered. This indi­cated to us that our self­-su­per­vised loss needed heavy revi­sion (as indeed it did), and that flood loss was more valu­able than expect­ed, and we inves­ti­gated it fur­ther; the impor­tant part appears—­for GANs, any­way—to be the stop-loss part, halt­ing train­ing of G/D when it gets ‘too good’. Freez­ing mod­els is an old GAN trick which is mostly not used post-WGAN, but appears use­ful for BigGAN, per­haps because of the spiky loss curve, espe­cially early in train­ing.

We ran it for 607,250 iter­a­tions on a TPUv3-256 pod until 2020-05-15. Con­fig:

{"": "images_256", "resnet_biggan.Discriminator.blocks_with_attention": "B2",
"": 96, "resnet_biggan.Generator.blocks_with_attention": "B5",
"": 96, "resnet_biggan.Generator.plain_tanh": false, "ModularGAN.d_lr": 0.0005,
"ModularGAN.d_lr_mul": 3.0, "ModularGAN.ema_start_step": 4000, "ModularGAN.g_lr": 6.66e-05,
"ModularGAN.g_lr_mul": 1.0, "options.batch_size": 2048, "options.d_flood": 0.2,
"options.datasets": "gs://XYZ-euw4a/datasets/danbooru2019-s/danbooru2019-s-0*,gs://XYZ-euw4a/datasets/e621-s/e621-s-0*,
"options.g_flood": 0.05, "options.labels": "", "options.random_labels": true, "options.z_dim": 140,
"run_config.experimental_host_call_every_n_steps": 50, "run_config.keep_checkpoint_every_n_hours": 0.5,
"standardize_batch.use_cross_replica_mean": true, "TpuSummaries.save_image_steps": 50, "TpuSummaries.save_summary_steps": 1}
90 ran­dom EMA sam­ples (un­trun­cat­ed) from the 256px BigGAN trained on Danbooru2019/anime-portraits/e621/e621-portraits.
Inter­po­la­tion using High­CWu Pad­dlePad­dle Google Colab note­book

The model is avail­able for down­load:

rsync --verbose rsync:// ./

compare_gan con­fig:

$ cat bigrun39b/operative_config-603500.gin
# Parameters for AdamOptimizer:
# ==============================================================================
AdamOptimizer.beta1 = 0.0
AdamOptimizer.beta2 = 0.999
AdamOptimizer.epsilon = 1e-08
AdamOptimizer.use_locking = False

# Parameters for batch_norm:
# ==============================================================================
# None.

# Parameters for BigGanResNetBlock:
# ==============================================================================
BigGanResNetBlock.add_shortcut = True

# Parameters for conditional_batch_norm:
# ==============================================================================
conditional_batch_norm.use_bias = False

# Parameters for cross_replica_moments:
# ==============================================================================
cross_replica_moments.group_size = None
cross_replica_moments.parallel = True

# Parameters for D:
# ==============================================================================
D.batch_norm_fn = None
D.layer_norm = False
D.spectral_norm = True

# Parameters for dataset:
# ============================================================================== = 'images_256'
dataset.seed = 547

# Parameters for resnet_biggan.Discriminator:
# ==============================================================================
resnet_biggan.Discriminator.blocks_with_attention = 'B2' = 96
resnet_biggan.Discriminator.channel_multipliers = None
resnet_biggan.Discriminator.project_y = True

# Parameters for G:
# ==============================================================================
G.batch_norm_fn = @conditional_batch_norm
G.spectral_norm = True

# Parameters for resnet_biggan.Generator:
# ==============================================================================
resnet_biggan.Generator.blocks_with_attention = 'B5' = 96
resnet_biggan.Generator.channel_multipliers = None
resnet_biggan.Generator.embed_bias = False
resnet_biggan.Generator.embed_y = True
resnet_biggan.Generator.embed_y_dim = 128
resnet_biggan.Generator.embed_z = False
resnet_biggan.Generator.hierarchical_z = True
resnet_biggan.Generator.plain_tanh = False

# Parameters for hinge:
# ==============================================================================
# None.

# Parameters for loss:
# ==============================================================================
loss.fn = @hinge

# Parameters for ModularGAN:
# ==============================================================================
ModularGAN.conditional = True
ModularGAN.d_lr = 0.0005
ModularGAN.d_lr_mul = 3.0
ModularGAN.d_optimizer_fn = @tf.train.AdamOptimizer
ModularGAN.deprecated_split_disc_calls = False
ModularGAN.ema_decay = 0.9999
ModularGAN.ema_start_step = 4000
ModularGAN.experimental_force_graph_unroll = False
ModularGAN.experimental_joint_gen_for_disc = False
ModularGAN.fit_label_distribution = False
ModularGAN.g_lr = 6.66e-05
ModularGAN.g_lr_mul = 1.0
ModularGAN.g_optimizer_fn = @tf.train.AdamOptimizer
ModularGAN.g_use_ema = True

# Parameters for no_penalty:
# ==============================================================================
# None.

# Parameters for normal:
# ==============================================================================
normal.mean = 0.0
normal.seed = None

# Parameters for options:
# ==============================================================================
options.architecture = 'resnet_biggan_arch'
options.batch_size = 2048
options.d_flood = 0.2
options.datasets = \
options.description = \
    'Describe your GIN config. (This appears in the tensorboard text tab.)'
options.disc_iters = 2
options.discriminator_normalization = None
options.g_flood = 0.05
options.gan_class = @ModularGAN
options.image_grid_height = 3
options.image_grid_resolution = 1024
options.image_grid_width = 3
options.labels = ''
options.lamba = 1
options.model_dir = 'gs://darnbooru-euw4a/runs/bigrun39b/'
options.num_classes = 1000
options.random_labels = True
options.training_steps = 250000
options.transpose_input = False
options.z_dim = 140

# Parameters for penalty:
# ==============================================================================
penalty.fn = @no_penalty

# Parameters for replace_labels:
# ==============================================================================
replace_labels.file_pattern = None

# Parameters for run_config:
# ==============================================================================
run_config.experimental_host_call_every_n_steps = 50
run_config.iterations_per_loop = 250
run_config.keep_checkpoint_every_n_hours = 0.5
run_config.keep_checkpoint_max = 10
run_config.save_checkpoints_steps = 250
run_config.single_core = False
run_config.tf_random_seed = None

# Parameters for spectral_norm:
# ==============================================================================
spectral_norm.epsilon = 1e-12
spectral_norm.singular_value = 'auto'

# Parameters for standardize_batch:
# ==============================================================================
standardize_batch.decay = 0.9
standardize_batch.epsilon = 1e-05
standardize_batch.use_cross_replica_mean = True
standardize_batch.use_moving_averages = False

# Parameters for TpuSummaries:
# ==============================================================================
TpuSummaries.save_image_steps = 50
TpuSummaries.save_summary_steps = 1

# Parameters for train_imagenet_transform:
# ==============================================================================
train_imagenet_transform.crop_method = 'random'

# Parameters for weights:
# ==============================================================================
weights.initializer = 'orthogonal'

# Parameters for z:
# ==============================================================================
z.distribution_fn = @tf.random.normal
z.maxval = 1.0
z.minval = -1.0
z.stddev = 1.0


I explore BigGAN, another recent GAN with SOTA results on the most com­plex image domain tack­led by GANs so far, Ima­geNet. BigGAN’s capa­bil­i­ties come at a steep com­pute cost, how­ev­er. I exper­i­ment with 128px Ima­geNet trans­fer learn­ing (suc­cess­ful) with ~6 GPU-days, and from-scratch 256px anime por­traits of 1000 char­ac­ters on a 8×2080ti machine for a month (mixed result­s). My BigGAN results are good but com­pro­mised by the com­pute expense & prac­ti­cal prob­lems with the released BigGAN code base. While BigGAN is not yet supe­rior to StyleGAN for many pur­pos­es, BigGAN-like approaches may be nec­es­sary to scale to whole anime images.

The pri­mary rival GAN to StyleGAN for large-s­cale image syn­the­sis as of mid-2019 is BigGAN (; offi­cial BigGAN-PyTorch imple­men­ta­tion & mod­els).

BigGAN suc­cess­fully trains on up to 512px images from Ima­geNet, from all 1000 cat­e­gories (con­di­tioned on cat­e­go­ry), with near-pho­to­re­al­is­tic results on the best-rep­re­sented cat­e­gories (dogs), and appar­ently can even han­dle the far larger inter­nal Google JFT dataset. In con­trast, StyleGAN, while far less com­pu­ta­tion­ally demand­ing, shows poorer results on more com­plex cat­e­gories (Kar­ras et al 2018’s LSUN CATS StyleGAN; our whole-Dan­booru2018 pilots) and has not been demon­strated to scale to Ima­geNet, much less beyond.

BigGAN does this by com­bin­ing a few improve­ments on stan­dard DCGANs (most of which are not used in StyleGAN):

Brock et al 2018: BigGAN-deep archi­tec­ture (Fig­ure 16, Table 5)

The down­side is that, as the name indi­cates, BigGAN is both a big model and requires big com­pute (par­tic­u­lar­ly, big mini­batch­es)—­some­where around $20,000, we esti­mate, based on pub­lic TPU pric­ing.

This present a dilem­ma: larg­er-s­cale por­trait mod­el­ing or whole-anime image mod­el­ing may be beyond StyleGAN’s cur­rent capa­bil­i­ties; but while BigGAN may be able to han­dle those tasks, we can’t afford to train it!

Must it cost that much? Prob­a­bly not. In par­tic­u­lar, BigGAN’s use of a fixed large mini­batch through­out train­ing is prob­a­bly ineffi­cient: it is highly unlikely that the ben­e­fits of a n = 2048 mini­batch are nec­es­sary at the begin­ning of train­ing when the Gen­er­a­tor is gen­er­at­ing sta­tic which looks noth­ing at all like real data, and at the end of train­ing, that may still be too small a mini­batch (Brock et al 2018 note that the ben­e­fits of larger mini­batches had not sat­u­rated at n = 2048 but time/compute was not avail­able to test larger still mini­batch­es, which is con­sis­tent with the obser­va­tion that the harder & more RL-like a prob­lem, the larger the mini­batch it need­s). Typ­i­cal­ly, mini­batches and/or learn­ing rates are sched­uled: impre­cise gra­di­ents are accept­able early on, while as the model approaches per­fec­tion, more exact gra­di­ents are nec­es­sary. So it should be pos­si­ble to start out with mini­batches a tiny frac­tion of the size and grad­u­ally scale them up dur­ing train­ing, sav­ing an enor­mous amount of com­pute com­pared to BigGAN’s reported num­bers. The gra­di­ent noise scale could pos­si­bly be used to auto­mat­i­cally set the total mini­batch scale, although I did­n’t find any exam­ples of any­one using it in PyTorch this way. And using TPU pods pro­vides large amounts of VRAM, but is not nec­es­sar­ily the cheap­est form of com­pute.

BigGAN Transfer Learning

Another opti­miza­tion is to exploit trans­fer learn­ing from the released mod­els, and reuse the enor­mous amount of com­pute invested in them. The prac­ti­cal details there are fid­dly. The orig­i­nal BigGAN 2018 release included the 128px/256px/512px Gen­er­a­tor Ten­sor­flow mod­els but not their Dis­crim­i­na­tors, nor a train­ing code­base; the compare_gan Ten­sor­flow code­base released in early 2019 includes an inde­pen­dent imple­men­ta­tion of BigGAN that can poten­tially train them, and I believe that the Gen­er­a­tor may still be usable for trans­fer learn­ing on its own and if not—­given the argu­ments that Dis­crim­i­na­tors sim­ply mem­o­rize data and do not learn much beyond that—the Dis­crim­i­na­tors can be trained from scratch by sim­ply freez­ing a G while train­ing its D on G out­puts for as long as nec­es­sary. The 2019 PyTorch release includes a differ­ent mod­el, a full 128px model with G/D (at 2 points in its train­ing), and code to con­vert the orig­i­nal Ten­sor­flow mod­els into PyTorch for­mat; the catch there is that the pre­trained model must be loaded into exactly the same archi­tec­ture, and while the PyTorch code­base defines the archi­tec­ture for 32/64/128/256px BigGANs, it does not (as of 2019-06-04) define the archi­tec­ture for a 512px BigGAN or BigGAN-deep (I tried but could­n’t get it quite right). It would also be pos­si­ble to do model surgery and pro­mote the 128px model to a 512px mod­el, since the two upscal­ing blocks (128px→256px and 256px→512px) should be easy to learn (sim­i­lar to my use of wai­fu2x to fake a 1024px StyleGAN anime face mod­el). Any­way, the upshot is that one can only use the 128px/256px pre­trained mod­els; the 512px will be pos­si­ble with a small update to the PyTorch code­base.

All in all, it is pos­si­ble that BigGAN with some tweaks could be afford­able to train. (At least, with some crowd­fund­ing…)

BigGAN: Danbooru2018-1K Experiments

To test out the water, I ran three BigGAN exper­i­ments:

  1. I first exper­i­mented with retrain­ing the Ima­geNet 128px model44.

    That resulted in almost total mode col­lapse when I re-en­abled G after 2 days; inves­ti­gat­ing, I real­ized that I had mis­un­der­stood: it was a brand­new BigGAN mod­el, trained inde­pen­dent­ly, and came with its ful­ly-trained D already. Oops.

  2. trans­fer learn­ing the 128px Ima­geNet PyTorch BigGAN model to the 1k anime por­traits; suc­cess­ful with ~6 GPU-days

  3. train­ing from scratch a 256px BigGAN-deep on the 1k por­traits;

    Par­tially suc­cess­ful after ~240 GPU-days: it reached com­pa­ra­ble qual­ity to StyleGAN before suffer­ing seri­ous mode col­lapse due, pos­si­bly, being forced to run with small mini­batch sizes by BigGAN bugs

Danbooru2018-1K Dataset

Constructing D1k

Con­struct­ing a new Dan­booru-1k dataset: as BigGAN requires con­di­tion­ing infor­ma­tion, I con­structed new 512px whole-im­age & por­trait datasets by tak­ing the 1000 most pop­u­lar Dan­booru2018 char­ac­ters, with char­ac­ters as cat­e­gories, and cropped out por­traits as usu­al:

cat metadata/20180000000000* | fgrep -e '"name":"solo"' | fgrep -v '"rating":"e"' | \
    jq -c '.tags | .[] | select(.category == "4") | .name' | sort | uniq --count | \
    sort --numeric-sort > characters.txt
mkdir ./characters-1k/ ; cd ./characters-1k/
cpCharacterFace () { # }
    CHARACTER_SAFE=$(echo $CHARACTER | tr '[:punct:]' '.')
    mkdir "$CHARACTER_SAFE"
    IDS=$(cat ../metadata/* | fgrep '"name":"'$CHARACTER\" | fgrep -e '"name":"solo"' \ # )
          | fgrep -v '"rating":"e"' | jq .id | tr -d '"')
    for ID in $IDS; do
        BUCKET=$(printf "%04d" $(( $ID % 1000 )) );
        TARGET=$(ls ../original/$BUCKET/$ID.*)
        CUDA_VISIBLE_DEVICES="" nice python ~/src/lbpcascade_animeface/examples/ \
            ~/src/lbpcascade_animeface/lbpcascade_animeface.xml "$TARGET" "./$CHARACTER_SAFE/$ID"
export -f cpCharacterFace
tail -1200 ../characters.txt | cut -d '"' -f 2 | parallel --progress cpCharacterFace

I merged a num­ber of redun­dant fold­ers by hand45, cleaned as usu­al, and did fur­ther crop­ping as nec­es­sary to reach 1000. This resulted in 212,359 por­trait faces, with the largest class (Hat­sune Miku) hav­ing 6,624 images and the small­est classes hav­ing ~0 or 1 images. (I don’t know if the class imbal­ance con­sti­tutes a real prob­lem for BigGAN, as Ima­geNet itself is imbal­anced on many lev­el­s.)

The data-load­ing code attempts to make the class index/ID num­ber line up with the folder count, so the nth alpha­bet­i­cal folder (char­ac­ter) should have class ID n, which is impor­tant to know for gen­er­at­ing con­di­tional sam­ples. The final set/IDs (as defined for my Dan­booru 1K dataset by find_classes):

2k.tan: 0
abe.nana: 1
abigail.williams..fate.grand.order.: 2
abukuma..kantai.collection.: 3
admiral..kantai.collection.: 4
aegis..persona.: 5
aerith.gainsborough: 6
afuro.terumi: 7
agano..kantai.collection.: 8
agrias.oaks: 9
ahri: 10
aida.mana: 11
aino.minako: 12
aisaka.taiga: 13
aisha..elsword.: 14
akagi..kantai.collection.: 15
akagi.miria: 16
akashi..kantai.collection.: 17
akatsuki..kantai.collection.: 18
akaza.akari: 19
akebono..kantai.collection.: 20
akemi.homura: 21
aki.minoriko: 22
aki.shizuha: 23
akigumo..kantai.collection.: 24
akitsu.maru..kantai.collection.: 25
akitsushima..kantai.collection.: 26
akiyama.mio: 27
akiyama.yukari: 28
akizuki..kantai.collection.: 29
akizuki.ritsuko: 30
akizuki.ryou: 31
akuma.homura: 32
albedo: 33
alice..wonderland.: 34
alice.margatroid: 35
alice.margatroid..pc.98.: 36
alisa.ilinichina.amiella: 37
altera..fate.: 38
amagi..kantai.collection.: 39
amagi.yukiko: 40
amami.haruka: 41
amanogawa.kirara: 42
amasawa.yuuko: 43
amatsukaze..kantai.collection.: 44 45
anastasia..idolmaster.: 46
anchovy: 47
android.18: 48
android.21: 49
anegasaki.nene: 50
angel..kof.: 51
angela.balzac: 52
anjou.naruko: 53
aoba..kantai.collection.: 54
aoki.reika: 55
aori..splatoon.: 56
aozaki.aoko: 57
aqua..konosuba.: 58
ara.han: 59
aragaki.ayase: 60
araragi.karen: 61
arashi..kantai.collection.: 62
arashio..kantai.collection.: 63
archer: 64
arcueid.brunestud: 65
arima.senne: 66
artoria.pendragon..all.: 67
artoria.pendragon..lancer.: 68
artoria.pendragon..lancer.alter.: 69
artoria.pendragon..swimsuit.rider.alter.: 70
asahina.mikuru: 71
asakura.ryouko: 72
asashimo..kantai.collection.: 73
asashio..kantai.collection.: 74
ashigara..kantai.collection.: 75
asia.argento: 76
astolfo..fate.: 77
asui.tsuyu: 78
asuna..sao.: 79
atago..azur.lane.: 80
atago..kantai.collection.: 81
atalanta..fate.: 82
au.ra: 83
ayanami..azur.lane.: 84
ayanami..kantai.collection.: 85
ayanami.rei: 86
ayane..doa.: 87
ayase.eli: 88
baiken: 89
bardiche: 90
barnaby.brooks.jr: 91
battleship.hime: 92
bayonetta..character.: 93
bb..fate...all.: 94
bb..fate.extra.ccc.: 95
bb..swimsuit.mooncancer...fate.: 96
beatrice: 97
belfast..azur.lane.: 98
bismarck..kantai.collection.: 99
black.hanekawa: 100
black.rock.shooter..character.: 101
blake.belladonna: 102
blanc: 103
boko..girls.und.panzer.: 104
bottle.miku: 105
boudica..fate.grand.order.: 106
bowsette: 107
bridget..guilty.gear.: 108
busujima.saeko: 109
c.c.: 110
c.c..lemon..character.: 111
caesar.anthonio.zeppeli: 112
cagliostro..granblue.fantasy.: 113 114
cammy.white: 115
caren.hortensia: 116
caster: 117
cecilia.alcott: 118
celes.chere: 119
charlotte..madoka.magica.: 120
charlotte.dunois: 121
charlotte.e.yeager: 122
chen: 123
chibi.usa: 124
chiki: 125
chitanda.eru: 126
chloe.von.einzbern: 127
choukai..kantai.collection.: 128 129
ciel: 130
cirno: 131
clarisse..granblue.fantasy.: 132
clownpiece: 133
consort.yu..fate.: 134 135
cure.happy: 136
cure.march: 137
cure.marine: 138
cure.moonlight: 139
cure.peace: 140
cure.sunny: 141
cure.sunshine: 142
cure.twinkle: 143 144
daiyousei: 145
danua: 146
darjeeling: 147
dark.magician.girl: 148
dio.brando: 149
dizzy: 150
djeeta..granblue.fantasy.: 151
doremy.sweet: 152
eas: 153
eila.ilmatar.juutilainen: 154
elesis..elsword.: 155
elin..tera.: 156
elizabeth.bathory..brave...fate.: 157
elizabeth.bathory..fate.: 158
elizabeth.bathory..fate...all.: 159
ellen.baker: 160
elphelt.valentine: 161
elsa..frozen.: 162 163
emiya.kiritsugu: 164
emiya.shirou: 165
emperor.penguin..kemono.friends.: 166 167
enoshima.junko: 168
enterprise..azur.lane.: 169
ereshkigal..fate.grand.order.: 170
erica.hartmann: 171
etna: 172
eureka: 173
eve..elsword.: 174
ex.keine: 175
failure.penguin: 176
fate.testarossa: 177
felicia: 178
female.admiral..kantai.collection.: 179 180
female.protagonist..pokemon.go.: 181
fennec..kemono.friends.: 182
ferry..granblue.fantasy.: 183
flandre.scarlet: 184
florence.nightingale..fate.grand.order.: 185
fou..fate.grand.order.: 186
francesca.lucchini: 187 188
fubuki..kantai.collection.: 189
fujibayashi.kyou: 190
fujimaru.ritsuka..female.: 191 192
furude.rika: 193
furudo.erika: 194
furukawa.nagisa: 195
fusou..kantai.collection.: 196
futaba.anzu: 197
futami.mami: 198
futatsuiwa.mamizou: 199
fuuro..pokemon.: 200
galko: 201
gambier.bay..kantai.collection.: 202
ganaha.hibiki: 203
gangut..kantai.collection.: 204
gardevoir: 205
gasai.yuno: 206
gertrud.barkhorn: 207
gilgamesh: 208
ginga.nakajima: 209
giorno.giovanna: 210
gokou.ruri: 211
graf.eisen: 212
graf.zeppelin..kantai.collection.: 213
grey.wolf..kemono.friends.: 214
gumi: 215
hachikuji.mayoi: 216
hagikaze..kantai.collection.: 217
hagiwara.yukiho: 218
haguro..kantai.collection.: 219
hakurei.reimu: 220
hamakaze..kantai.collection.: 221
hammann..azur.lane.: 222
han.juri: 223
hanasaki.tsubomi: 224
hanekawa.tsubasa: 225
hanyuu: 226
haramura.nodoka: 227
harime.nui: 228
haro: 229
haruka..pokemon.: 230
haruna..kantai.collection.: 231 232
harusame..kantai.collection.: 233
hasegawa.kobato: 234
hassan.of.serenity..fate.: 235 236
hatoba.tsugu..character.: 237
hatsune.miku: 238
hatsune.miku..append.: 239
hatsuyuki..kantai.collection.: 240
hatsuzuki..kantai.collection.: 241
hayami.kanade: 242
hayashimo..kantai.collection.: 243
hayasui..kantai.collection.: 244
hecatia.lapislazuli: 245
helena.blavatsky..fate.grand.order.: 246
heles: 247
hestia..danmachi.: 248
hex.maniac..pokemon.: 249
hibari..senran.kagura.: 250
hibiki..kantai.collection.: 251 252
hiei..kantai.collection.: 253
higashi.setsuna: 254
higashikata.jousuke: 255
high.priest: 256
hiiragi.kagami: 257
hiiragi.tsukasa: 258
hijiri.byakuren: 259
hikari..pokemon.: 260
himejima.akeno: 261
himekaidou.hatate: 262
hinanawi.tenshi: 263 264
hino.akane..idolmaster.: 265 266
hino.rei: 267
hirasawa.ui: 268
hirasawa.yui: 269
hiryuu..kantai.collection.: 270
hishikawa.rikka: 271
hk416..girls.frontline.: 272
holo: 273
homura..xenoblade.2.: 274
honda.mio: 275
hong.meiling: 276
honma.meiko: 277
honolulu..azur.lane.: 278
horikawa.raiko: 279
hoshi.shouko: 280
hoshiguma.yuugi: 281
hoshii.miki: 282
hoshimiya.ichigo: 283
hoshimiya.kate: 284
hoshino.fumina: 285
hoshino.ruri: 286
hoshizora.miyuki: 287
hoshizora.rin: 288
hotarumaru: 289
hoto.cocoa: 290
houjou.hibiki: 291
houjou.karen: 292
houjou.satoko: 293
houjuu.nue: 294
houraisan.kaguya: 295
houshou..kantai.collection.: 296
huang.baoling: 297
hyuuga.hinata: 298
i.168..kantai.collection.: 299
i.19..kantai.collection.: 300
i.26..kantai.collection.: 301
i.401..kantai.collection.: 302
i.58..kantai.collection.: 303
i.8..kantai.collection.: 304
ia..vocaloid.: 305
ibaraki.douji..fate.grand.order.: 306
ibaraki.kasen: 307
ibuki.fuuko: 308
ibuki.suika: 309 310
ichinose.kotomi: 311
ichinose.shiki: 312
ikamusume: 313
ikazuchi..kantai.collection.: 314
illustrious..azur.lane.: 315
illyasviel.von.einzbern: 316
imaizumi.kagerou: 317
inaba.tewi: 318
inami.mahiru: 319
inazuma..kantai.collection.: 320
index: 321
ingrid: 322
inkling: 323
inubashiri.momiji: 324
inuyama.aoi: 325
iori.rinko: 326
iowa..kantai.collection.: 327
irisviel.von.einzbern: 328
iroha..samurai.spirits.: 329
ishtar..fate.grand.order.: 330
isokaze..kantai.collection.: 331
isonami..kantai.collection.: 332
isuzu..kantai.collection.: 333
itsumi.erika: 334
ivan.karelin: 335
izayoi.sakuya: 336
izumi.konata: 337
izumi.sagiri: 338
jack.the.ripper..fate.apocrypha.: 339
jakuzure.nonon: 340
japanese.crested.ibis..kemono.friends.: 341
jeanne.d.arc..alter...fate.: 342
jeanne.d.arc..alter.swimsuit.berserker.: 343
jeanne.d.arc..fate.: 344
jeanne.d.arc..fate...all.: 345
jeanne.d.arc..granblue.fantasy.: 346
jeanne.d.arc..swimsuit.archer.: 347
jeanne.d.arc.alter.santa.lily: 348
jintsuu..kantai.collection.: 349
jinx..league.of.legends.: 350
johnny.joestar: 351
jonathan.joestar: 352
joseph.joestar..young.: 353
jougasaki.mika: 354
jougasaki.rika: 355 356
junketsu: 357
junko..touhou.: 358
kaban..kemono.friends.: 359
kaburagi.t.kotetsu: 360
kaenbyou.rin: 361 362
kafuu.chino: 363
kaga..kantai.collection.: 364
kagamine.len: 365
kagamine.rin: 366
kagerou..kantai.collection.: 367
kagiyama.hina: 368
kagura..gintama.: 369
kaguya.luna..character.: 370
kaito: 371
kaku.seiga: 372
kakyouin.noriaki: 373
kallen.stadtfeld: 374
kamikaze..kantai.collection.: 375
kamikita.komari: 376
kamio.misuzu: 377
kamishirasawa.keine: 378
kamiya.nao: 379
kamoi..kantai.collection.: 380
kaname.madoka: 381
kanbaru.suruga: 382
kanna.kamui: 383
kanzaki.ranko: 384
karina.lyle: 385
kasane.teto: 386
kashima..kantai.collection.: 387
kashiwazaki.sena: 388
kasodani.kyouko: 389 390
kasugano.sora: 391
kasumi..doa.: 392
kasumi..kantai.collection.: 393
kasumi..pokemon.: 394
kasumigaoka.utaha: 395
katori..kantai.collection.: 396
katou.megumi: 397
katsura.hinagiku: 398
katsuragi..kantai.collection.: 399
katsushika.hokusai..fate.grand.order.: 400
katyusha: 401
kawakami.mai: 402
kawakaze..kantai.collection.: 403
kawashiro.nitori: 404
kay..girls.und.panzer.: 405
kazama.asuka: 406
kazami.yuuka: 407
kenzaki.makoto: 408
kijin.seija: 409
kikuchi.makoto: 410
kino: 411
kino.makoto: 412 413
kinugasa..kantai.collection.: 414
kirigaya.suguha: 415
kirigiri.kyouko: 416
kirijou.mitsuru: 417
kirima.sharo: 418
kirin..armor.: 419
kirino.ranmaru: 420
kirisame.marisa: 421
kirishima..kantai.collection.: 422
kirito: 423
kiryuuin.satsuki: 424
kisaragi..kantai.collection.: 425
kisaragi.chihaya: 426
kise.yayoi: 427
kishibe.rohan: 428
kishin.sagume: 429
kiso..kantai.collection.: 430
kiss.shot.acerola.orion.heart.under.blade: 431
kisume: 432
kitakami..kantai.collection.: 433
kiyohime..fate.grand.order.: 434
kiyoshimo..kantai.collection.: 435 436
koakuma: 437
kobayakawa.rinko: 438
kobayakawa.sae: 439
kochiya.sanae: 440
kohinata.miho: 441
koizumi.hanayo: 442
komaki.manaka: 443
komeiji.koishi: 444
komeiji.satori: 445
kongou..kantai.collection.: 446 447
konpaku.youmu: 448
konpaku.youmu..ghost.: 449
kooh: 450
kos.mos: 451
koshimizu.sachiko: 452
kotobuki.tsumugi: 453
kotomine.kirei: 454
kotonomiya.yuki: 455
kousaka.honoka: 456
kousaka.kirino: 457
kousaka.tamaki: 458
kozakura.marry: 459
kuchiki.rukia: 460
kujikawa.rise: 461
kujou.karen: 462
kula.diamond: 463
kuma..kantai.collection.: 464
kumano..kantai.collection.: 465
kumoi.ichirin: 466
kunikida.hanamaru: 467
kuradoberi.jam: 468
kuriyama.mirai: 469
kurodani.yamame: 470 471
kurokawa.eren: 472
kuroki.tomoko: 473
kurosawa.dia: 474
kurosawa.ruby: 475
kuroshio..kantai.collection.: 476
kuroyukihime: 477
kurumi.erika: 478
kusanagi.motoko: 479
kusugawa.sasara: 480
kuujou.jolyne: 481
kuujou.joutarou: 482
kyon: 483
kyonko: 484
kyubey: 485
laffey..azur.lane.: 486
lala.satalin.deviluke: 487
lancer: 488 489
laura.bodewig: 490
leafa: 491
lei.lei: 492
lelouch.lamperouge: 493
len: 494
letty.whiterock: 495 496
libeccio..kantai.collection.: 497
lightning.farron: 498
lili..tekken.: 499
lilith.aensland: 500
lillie..pokemon.: 501
lily.white: 502
link: 503 504 505
lucina: 506
lum: 507
luna.child: 508
lunamaria.hawke: 509
lunasa.prismriver: 510
lusamine..pokemon.: 511
lyn..blade...soul.: 512 513
lynette.bishop: 514
m1903.springfield..girls.frontline.: 515
madotsuki: 516
maekawa.miku: 517
maka.albarn: 518
makigumo..kantai.collection.: 519
makinami.mari.illustrious: 520
makise.kurisu: 521
makoto..street.fighter.: 522
makoto.nanaya: 523
mankanshoku.mako: 524
mao..pokemon.: 525
maou..maoyuu.: 526
maribel.hearn: 527
marie.antoinette..fate.grand.order.: 528
mash.kyrielight: 529
matoi..pso2.: 530
matoi.ryuuko: 531 532
matsuura.kanan: 533
maya..kantai.collection.: 534
me.tan: 535
medicine.melancholy: 536
medjed: 537
meer.campbell: 538
megumin: 539
megurine.luka: 540
mei..overwatch.: 541
mei..pokemon.: 542
meiko: 543
meltlilith: 544
mercy..overwatch.: 545
merlin.prismriver: 546
michishio..kantai.collection.: 547
midare.toushirou: 548
midna: 549
midorikawa.nao: 550
mika..girls.und.panzer.: 551
mikasa.ackerman: 552
mikazuki.munechika: 553
miki.sayaka: 554
millia.rage: 555
mima: 556
mimura.kanako: 557
minami.kotori: 558 559 560
minase.akiko: 561
minase.iori: 562
miqo.te: 563
misaka.mikoto: 564
mishaguji: 565
misumi.nagisa: 566
mithra: 567
miura.azusa: 568
miyafuji.yoshika: 569
miyako.yoshika: 570
miyamoto.frederica: 571
miyamoto.musashi..fate.grand.order.: 572
miyaura.sanshio: 573
mizuhashi.parsee: 574
mizuki..pokemon.: 575
mizunashi.akari: 576
mizuno.ami: 577
mogami..kantai.collection.: 578
momo.velia.deviluke: 579 580 581
mordred..fate.: 582
mordred..fate...all.: 583
morgiana: 584
morichika.rinnosuke: 585
morikubo.nono: 586
moriya.suwako: 587
moroboshi.kirari: 588
morrigan.aensland: 589
motoori.kosuzu: 590
mumei..kabaneri.: 591
murakumo..kantai.collection.: 592
murasa.minamitsu: 593
murasame..kantai.collection.: 594
musashi..kantai.collection.: 595
mutsu..kantai.collection.: 596
mutsuki..kantai.collection.: 597 598 599
myoudouin.itsuki: 600
mysterious.heroine.x: 601
mysterious.heroine.x..alter.: 602
mystia.lorelei: 603
nadia: 604
nagae.iku: 605
naganami..kantai.collection.: 606
nagato..kantai.collection.: 607
nagato.yuki: 608
nagatsuki..kantai.collection.: 609
nagi: 610
nagisa.kaworu: 611
naka..kantai.collection.: 612
nakano.azusa: 613 614
nanami.chiaki: 615 616
nao..mabinogi.: 617
narmaya..granblue.fantasy.: 618
narukami.yuu: 619
narusawa.ryouka: 620
natalia..idolmaster.: 621
natori.sana: 622
natsume..pokemon.: 623
natsume.rin: 624
nazrin: 625
nekomiya.hinata: 626
nekomusume: 627 628
nepgear: 629
neptune..neptune.series.: 630
nero.claudius..bride...fate.: 631
nero.claudius..fate.: 632
nero.claudius..fate...all.: 633
nero.claudius..swimsuit.caster...fate.: 634
nia.teppelin: 635
nibutani.shinka: 636
nico.robin: 637
ninomiya.asuka: 638
nishikino.maki: 639
nishizumi.maho: 640
nishizumi.miho: 641
nitocris..fate.grand.order.: 642
nitocris..swimsuit.assassin...fate.: 643
nitta.minami: 644
noel.vermillion: 645
noire: 646
northern.ocean.hime: 647
noshiro..kantai.collection.: 648
noumi.kudryavka: 649
nu.13: 650
nyarlathotep..nyaruko.san.: 651
oboro..kantai.collection.: 652
oda.nobunaga..fate.: 653
ogata.chieri: 654
ohara.mari: 655
oikawa.shizuku: 656
okazaki.yumemi: 657
okita.souji..alter...fate.: 658
okita.souji..fate.: 659
okita.souji..fate...all.: 660
onozuka.komachi: 661
ooi..kantai.collection.: 662
oomori.yuuko: 663
ootsuki.yui: 664
ooyodo..kantai.collection.: 665
osakabe.hime..fate.grand.order.: 666
oshino.shinobu: 667
otonashi.kotori: 668
panty..psg.: 669
passion.lip: 670
patchouli.knowledge: 671
pepperoni..girls.und.panzer.: 672
perrine.h.clostermann: 673
pharah..overwatch.: 674
phosphophyllite: 675
pikachu: 676
pixiv.tan: 677
platelet..hataraku.saibou.: 678
platinum.the.trinity: 679
pod..nier.automata.: 680
pola..kantai.collection.: 681 682 683
princess.peach: 684
princess.serenity: 685
princess.zelda: 686
prinz.eugen..azur.lane.: 687
prinz.eugen..kantai.collection.: 688
prisma.illya: 689
purple.heart: 690
puru.see: 691
pyonta: 692
qbz.95..girls.frontline.: 693
rachel.alucard: 694
racing.miku: 695
raising.heart: 696
ramlethal.valentine: 697
ranka.lee: 698
ranma.chan: 699
re.class.battleship: 700
reinforce: 701
reinforce.zwei: 702
reisen.udongein.inaba: 703
reiuji.utsuho: 704
reizei.mako: 705 706
remilia.scarlet: 707
rensouhou.chan: 708
rensouhou.kun: 709
rias.gremory: 710
rider: 711
riesz: 712
ringo..touhou.: 713
ro.500..kantai.collection.: 714
roll: 715
rosehip: 716
rossweisse: 717
ruby.rose: 718
rumia: 719
rydia: 720
ryougi.shiki: 721
ryuuguu.rena: 722
ryuujou..kantai.collection.: 723
saber: 724
saber.alter: 725
saber.lily: 726
sagisawa.fumika: 727
saigyouji.yuyuko: 728
sailor.mars: 729
sailor.mercury: 730
sailor.moon: 731
sailor.saturn: 732
sailor.venus: 733
saint.martha: 734
sakagami.tomoyo: 735
sakamoto.mio: 736
sakata.gintoki: 737
sakuma.mayu: 738
sakura.chiyo: 739
sakura.futaba: 740
sakura.kyouko: 741
sakura.miku: 742
sakurai.momoka: 743
sakurauchi.riko: 744
samidare..kantai.collection.: 745
samus.aran: 746
sanya.v.litvyak: 747 748
saotome.ranma: 749
saratoga..kantai.collection.: 750
sasaki.chiho: 751
saten.ruiko: 752
satonaka.chie: 753
satsuki..kantai.collection.: 754
sawamura.spencer.eriri: 755
saya: 756
sazaki.kaoruko: 757
sazanami..kantai.collection.: 758
scathach..fate...all.: 759
scathach..fate.grand.order.: 760
scathach..swimsuit.assassin...fate.: 761
seaport.hime: 762
seeu: 763
seiran..touhou.: 764
seiren..suite.precure.: 765
sekibanki: 766
selvaria.bles: 767
sendai..kantai.collection.: 768 769
sengoku.nadeko: 770
senjougahara.hitagi: 771
senketsu: 772
sento.isuzu: 773
serena..pokemon.: 774
serval..kemono.friends.: 775
sf.a2.miki: 776
shameimaru.aya: 777
shana: 778
shanghai.doll: 779
shantae..character.: 780
sheryl.nome: 781
shibuya.rin: 782
shidare.hotaru: 783
shigure..kantai.collection.: 784
shijou.takane: 785
shiki.eiki: 786
shikinami..kantai.collection.: 787
shikinami.asuka.langley: 788
shimada.arisu: 789
shimakaze..kantai.collection.: 790
shimamura.uzuki: 791
shinjou.akane: 792
shinki: 793
shinku: 794
shiomi.shuuko: 795
shirabe.ako: 796
shirai.kuroko: 797
shirakiin.ririchiyo: 798
shiranui..kantai.collection.: 799
shiranui.mai: 800
shirasaka.koume: 801
shirase.sakuya: 802
shiratsuyu..kantai.collection.: 803
shirayuki.hime: 804
shirogane.naoto: 805
shirona..pokemon.: 806
shoebill..kemono.friends.: 807
shokuhou.misaki: 808
shouhou..kantai.collection.: 809
shoukaku..kantai.collection.: 810
shuten.douji..fate.grand.order.: 811
signum: 812
silica: 813
simon: 814
sinon: 815 816
sona.buvelle: 817
sonoda.umi: 818
sonohara.anri: 819
sonozaki.mion: 820
sonozaki.shion: 821
sora.ginko: 822 823
souryuu..kantai.collection.: 824
souryuu.asuka.langley: 825
souseiseki: 826
star.sapphire: 827
stocking..psg.: 828
su.san: 829
subaru.nakajima: 830
suigintou: 831
suiren..pokemon.: 832
suiseiseki: 833
sukuna.shinmyoumaru: 834
sunny.milk: 835
suomi.kp31..girls.frontline.: 836
super.pochaco: 837
super.sonico: 838
suzukaze.aoba: 839
suzumiya.haruhi: 840
suzutsuki..kantai.collection.: 841
suzuya..kantai.collection.: 842
tachibana.arisu: 843
tachibana.hibiki..symphogear.: 844
tada.riina: 845
taigei..kantai.collection.: 846
taihou..azur.lane.: 847
taihou..kantai.collection.: 848
tainaka.ritsu: 849
takagaki.kaede: 850
takakura.himari: 851
takamachi.nanoha: 852
takami.chika: 853
takanashi.rikka: 854
takao..azur.lane.: 855
takao..kantai.collection.: 856
takara.miyuki: 857
takarada.rikka: 858
takatsuki.yayoi: 859
takebe.saori: 860
tama..kantai.collection.: 861
tamamo..fate...all.: 862 863 864 865
tanamachi.kaoru: 866
taneshima.popura: 867
tanned.cirno: 868
taokaka: 869
tatara.kogasa: 870
tateyama.ayano: 871
tatsumaki: 872
tatsuta..kantai.collection.: 873
tedeza.rize: 874
tenryuu..kantai.collection.: 875 876
teruzuki..kantai.collection.: 877
tharja: 878
tifa.lockhart: 879
tina.branford: 880
tippy..gochiusa.: 881
tokiko..touhou.: 882
tokisaki.kurumi: 883
tokitsukaze..kantai.collection.: 884
tomoe.gozen..fate.grand.order.: 885
tomoe.hotaru: 886
tomoe.mami: 887
tone..kantai.collection.: 888
toono.akiha: 889
tooru..maidragon.: 890
toosaka.rin: 891
toramaru.shou: 892
toshinou.kyouko: 893
totoki.airi: 894
toudou.shimako: 895
toudou.yurika: 896
toujou.koneko: 897
toujou.nozomi: 898
touko..pokemon.: 899
touwa.erio: 900 901
tracer..overwatch.: 902
tsukikage.yuri: 903
tsukimiya.ayu: 904
tsukino.mito: 905
tsukino.usagi: 906
tsukumo.benben: 907
tsurumaru.kuninaga: 908
tsuruya: 909
tsushima.yoshiko: 910
u.511..kantai.collection.: 911
ujimatsu.chiya: 912
ultimate.madoka: 913
umikaze..kantai.collection.: 914
unicorn..azur.lane.: 915
unryuu..kantai.collection.: 916
urakaze..kantai.collection.: 917
uraraka.ochako: 918
usada.hikaru: 919
usami.renko: 920
usami.sumireko: 921
ushio..kantai.collection.: 922
ushiromiya.ange: 923
ushiwakamaru..fate.grand.order.: 924
uzuki..kantai.collection.: 925
vampire..azur.lane.: 926
vampy: 927
venera.sama: 928
verniy..kantai.collection.: 929 930
violet.evergarden..character.: 931
vira.lilie: 932
vita: 933
vivio: 934
wa2000..girls.frontline.: 935
wakasagihime: 936
wang.liu.mei: 937
warspite..kantai.collection.: 938 939
watarase.jun: 940 941
waver.velvet: 942
weiss.schnee: 943
white.mage: 944
widowmaker..overwatch.: 945
wo.class.aircraft.carrier: 946
wriggle.nightbug: 947
xenovia.quarta: 948
xp.tan: 949
xuanzang..fate.grand.order.: 950
yagami.hayate: 951
yagokoro.eirin: 952
yahagi..kantai.collection.: 953
yakumo.ran: 954
yakumo.yukari: 955
yamada.aoi: 956
yamada.elf: 957
yamakaze..kantai.collection.: 958
yamashiro..azur.lane.: 959
yamashiro..kantai.collection.: 960
yamato..kantai.collection.: 961 962
yang.xiao.long: 963
yasaka.kanako: 964
yayoi..kantai.collection.: 965 966
yin: 967
yoko.littner: 968 969
yorigami.shion: 970
yowane.haku: 971
yuffie.kisaragi: 972 973
yuigahama.yui: 974
yuki.miku: 975
yukikaze..kantai.collection.: 976
yukine.chris: 977
yukinoshita.yukino: 978
yukishiro.honoka: 979
yumi..senran.kagura.: 980
yuna..ff10.: 981
yuno: 982
yura..kantai.collection.: 983
yuubari..kantai.collection.: 984
yuudachi..kantai.collection.: 985
yuugumo..kantai.collection.: 986
yuuki..sao.: 987
yuuki.makoto: 988
yuuki.mikan: 989
yuzuhara.konomi: 990
yuzuki.yukari: 991
yuzuriha.inori: 992
z1.leberecht.maass..kantai.collection.: 993
z3.max.schultz..kantai.collection.: 994 995
zeta..granblue.fantasy.: 996
zooey..granblue.fantasy.: 997
zuihou..kantai.collection.: 998
zuikaku..kantai.collection.: 999

(Aside from being poten­tially use­ful to sta­bi­lize train­ing by pro­vid­ing supervision/metadata, use of classes/categories reduces the need for char­ac­ter-spe­cific trans­fer learn­ing for spe­cial­ized StyleGAN mod­els, since you can just gen­er­ate sam­ples from a spe­cific class. For the 256px mod­el, I pro­vide down­load­able sam­ples for each of the 1000 class­es.)

D1K Download

D1K (20GB; n = 822,842 512px JPEGs) and the por­trait-crop ver­sion, D1K-por­traits (18GB; n = 212,359) are avail­able for down­load:

rsync --verbose --recursive rsync:// ./d1k/

The JPG com­pres­sion turned out to be too aggres­sive and result in notice­able arti­fact­ing, so in early 2020 I regen­er­ated D1k from Dan­booru2019 for future pro­jects, cre­at­ing D1K-2019-512px: a fresh set of top-1k solo char­ac­ter images, s/q Dan­booru2019, no JPEG com­pres­sion.

Merges of over­lap­ping char­ac­ters were again nec­es­sary; the full set of tag merges:

merge() { mv ./$1/* ./$2/ && rmdir ./$1; }
merge alice.margatroid..pc.98. alice.margatroid
merge artoria.pendragon..all. saber
merge artoria.pendragon..lancer. saber
merge artoria.pendragon..lancer.alter. saber
merge artoria.pendragon..swimsuit.rider.alter. saber
merge artoria.pendragon..swimsuit.ruler...fate. saber
merge atago..midsummer.march...azur.lane. atago..azur.lane.
merge bardiche fate.testarossa
merge bb..fate...all.
merge bb..fate.extra.ccc.
merge bb..swimsuit.mooncancer...fate.
merge bottle.miku hatsune.miku
merge aoki.reika
merge cure.happy hoshizora.miyuki
merge cure.march midorikawa.nao
merge cure.marine kurumi.erika
merge cure.melody houjou.hibiki
merge cure.moonlight tsukikage.yuri
merge cure.peace kise.yayoi
merge cure.peach
merge cure.sunny
merge cure.sunshine myoudouin.itsuki
merge cure.sword kenzaki.makoto
merge cure.twinkle amanogawa.kirara
merge eas higashi.setsuna
merge elizabeth.bathory..brave...fate. elizabeth.bathory..fate.
merge elizabeth.bathory..fate...all. elizabeth.bathory..fate.
merge ex.keine kamishirasawa.keine
merge frederica.bernkastel  furude.rika
merge furudo.erika furude.rika
merge graf.eisen vita
merge hatsune.miku..append. hatsune.miku
merge ishtar..fate.grand.order. ishtar..fate...all.
merge jeanne.d.arc..alter...fate. jeanne.d.arc..fate.
merge jeanne.d.arc..alter.swimsuit.berserker. jeanne.d.arc..fate.
merge jeanne.d.arc..fate...all. jeanne.d.arc..fate.
merge jeanne.d.arc..swimsuit.archer. jeanne.d.arc..fate.
merge jeanne.d.arc.alter.santa.lily jeanne.d.arc..fate.
merge kaenbyou.rin
merge kiyohime..swimsuit.lancer...fate. kiyohime..fate.grand.order.
merge konpaku.youmu..ghost. konpaku.youmu
merge kyonko kyon
merge lancer cu.chulainn..fate...all.
merge medb..fate.grand.order. medb..fate...all.
merge medjed nitocris..fate.grand.order.
merge meltryllis..swimsuit.lancer...fate. meltryllis
merge miyamoto.musashi..swimsuit.berserker...fate. miyamoto.musashi..fate.grand.order.
merge mordred..fate...all. mordred..fate.
merge mysterious.heroine.x saber
merge mysterious.heroine.x..alter. saber
merge mysterious.heroine.xx..foreigner. saber
merge nero.claudius..bride...fate. nero.claudius..fate.
merge nero.claudius..fate...all. nero.claudius..fate.
merge nero.claudius..swimsuit.caster...fate. nero.claudius..fate.
merge nitocris..swimsuit.assassin...fate. nitocris..fate.grand.order.
merge oda.nobunaga..fate...all. oda.nobunaga..fate.
merge okita.souji..alter...fate. okita.souji..fate.
merge okita.souji..fate...all. okita.souji..fate.
merge princess.of.the.crystal takakura.himari
merge princess.serenity tsukino.usagi
merge prinz.eugen..azur.lane.
merge prisma.illya illyasviel.von.einzbern
merge purple.heart neptune..neptune.series.
merge pyonta moriya.suwako
merge racing.miku hatsune.miku
merge raising.heart  takamachi.nanoha
merge reinforce.zwei reinforce
merge rensouhou.chan shimakaze..kantai.collection.
merge roll.caskett roll
merge saber.alter saber
merge saber.lily saber
merge sailor.jupiter kino.makoto
merge sailor.mars hino.rei
merge sailor.mercury mizuno.ami
merge sailor.moon tsukino.usagi
merge sailor.saturn tomoe.hotaru
merge sailor.venus aino.minako
merge sakura.miku hatsune.miku
merge scathach..fate.grand.order. scathach..fate...all.
merge scathach..swimsuit.assassin...fate. scathach..fate...all.
merge scathach.skadi..fate.grand.order. scathach..fate...all.
merge schwertkreuz yagami.hayate
merge seiren..suite.precure. kurokawa.eren
merge shanghai.doll alice.margatroid
merge shikinami.asuka.langley souryuu.asuka.langley
merge su.san medicine.melancholy
merge taihou..forbidden.feast...azur.lane. taihou..azur.lane.
merge tamamo..fate...all.
merge tanned.cirno cirno
merge ultimate.madoka kaname.madoka
merge yuki.miku hatsune.miku


rsync --verbose --recursive rsync:// ./d1k-2019-512px/

D1K BigGAN Conversion

BigGAN requires the dataset meta­data to be defined in, and then, if using HDF5 archives it must be processed into a HDF5 archive, along with Incep­tion sta­tis­tics for the peri­odic test­ing (although I min­i­mize test­ing, the pre­processed sta­tis­tics are still nec­es­sary).

HDF5 is not nec­es­sary and can be omit­ted, BigGAN-Pytorch can read image fold­ers, if you pre­fer to avoid the has­sle.

The must be edited to add meta­data per dataset (no CLI), which looks like this to define a 128px Dan­booru-1k por­trait dataset:

 # Convenience dicts
-dset_dict = {'I32': dset.ImageFolder, 'I64': dset.ImageFolder,
+dset_dict = {'I32': dset.ImageFolder, 'I64': dset.ImageFolder,
              'I128': dset.ImageFolder, 'I256': dset.ImageFolder,
              'I32_hdf5': dset.ILSVRC_HDF5, 'I64_hdf5': dset.ILSVRC_HDF5,
              'I128_hdf5': dset.ILSVRC_HDF5, 'I256_hdf5': dset.ILSVRC_HDF5,
-             'C10': dset.CIFAR10, 'C100': dset.CIFAR100}
+             'C10': dset.CIFAR10, 'C100': dset.CIFAR100,
+             'D1K': dset.ImageFolder, 'D1K_hdf5': dset.ILSVRC_HDF5 }
 imsize_dict = {'I32': 32, 'I32_hdf5': 32,
                'I64': 64, 'I64_hdf5': 64,
                'I128': 128, 'I128_hdf5': 128,
                'I256': 256, 'I256_hdf5': 256,
-               'C10': 32, 'C100': 32}
+               'C10': 32, 'C100': 32,
+               'D1K': 128, 'D1K_hdf5': 128 }
 root_dict = {'I32': 'ImageNet', 'I32_hdf5': 'ILSVRC32.hdf5',
              'I64': 'ImageNet', 'I64_hdf5': 'ILSVRC64.hdf5',
              'I128': 'ImageNet', 'I128_hdf5': 'ILSVRC128.hdf5',
              'I256': 'ImageNet', 'I256_hdf5': 'ILSVRC256.hdf5',
-             'C10': 'cifar', 'C100': 'cifar'}
+             'C10': 'cifar', 'C100': 'cifar',
+             'D1K': 'characters-1k-faces', 'D1K_hdf5': 'D1K.hdf5' }
 nclass_dict = {'I32': 1000, 'I32_hdf5': 1000,
                'I64': 1000, 'I64_hdf5': 1000,
                'I128': 1000, 'I128_hdf5': 1000,
                'I256': 1000, 'I256_hdf5': 1000,
-               'C10': 10, 'C100': 100}
-# Number of classes to put per sample sheet
+               'C10': 10, 'C100': 100,
+               'D1K': 1000, 'D1K_hdf5': 1000 }
+# Number of classes to put per sample sheet
 classes_per_sheet_dict = {'I32': 50, 'I32_hdf5': 50,
                           'I64': 50, 'I64_hdf5': 50,
                           'I128': 20, 'I128_hdf5': 20,
                           'I256': 20, 'I256_hdf5': 20,
-                          'C10': 10, 'C100': 100}
+                          'C10': 10, 'C100': 100,
+                          'D1K': 1, 'D1K_hdf5': 1 }

Each dataset exists in 2 forms, as the orig­i­nal image folder and then as the processed HDF5:

python --dataset D1K512 --data_root /media/gwern/Data2/danbooru2018
python --parallel --dataset D1K_hdf5 --batch_size 64 \
    --data_root /media/gwern/Data2/danbooru2018
## Or ImageNet example:
python --dataset I128 --data_root /media/gwern/Data/imagenet/
python --dataset I128_hdf5 --batch_size 64 \
    --data_root /media/gwern/Data/imagenet/ will write the HDF5 to a ILSVRC*.hdf5 file, so rename it to what­ever (eg D1K.hdf5).

BigGAN Training

With the HDF5 & Incep­tion sta­tis­tics cal­cu­lat­ed, it should be pos­si­ble to run like so:

python --dataset D1K --parallel --shuffle --num_workers 4 --batch_size 32 \
    --num_G_accumulations 8 --num_D_accumulations 8  \
    --num_D_steps 1 --G_lr 1e-4 --D_lr 4e-4 --D_B2 0.999 --G_B2 0.999 --G_attn 64 --D_attn 64 \
    --G_nl inplace_relu --D_nl inplace_relu --SN_eps 1e-6 --BN_eps 1e-5 --adam_eps 1e-6 \
    --G_ortho 0.0 --G_shared --G_init ortho --D_init ortho --hier --dim_z 120 --shared_dim 128 \
    --G_eval_mode --G_ch 96 --D_ch 96  \
    --ema --use_ema --ema_start 20000 --test_every 2000 --save_every 1000 --num_best_copies 5 \
    --num_save_copies 2 --seed 0 --use_multiepoch_sampler --which_best FID \
    --data_root /media/gwern/Data2/danbooru2018

The archi­tec­ture is spec­i­fied on the com­mand line and must be cor­rect; exam­ples are in the scripts/ direc­to­ry. In the above exam­ple, --num_D_steps...--D_ch should be left strictly alone and the key para­me­ters are before/after that archi­tec­ture block. In this exam­ple, my 2×1080ti can sup­port a batch size of n = 32 & the gra­di­ent accu­mu­la­tion over­head with­out OOMing. In addi­tion to that, it’s impor­tant to enable EMA, which makes a truly remark­able differ­ence in the gen­er­ated sam­ple qual­ity (which is inter­est­ing because EMA sounds redun­dant with momentum/learning rates, but isn’t). The big batches of BigGAN are imple­mented by --batch_size times --num_{G/D}_accumulations; I would need an accu­mu­la­tion of 64 to match n = 2048. With­out EMA, sam­ples are low qual­ity and change dras­ti­cally at each iter­a­tion; but after a cer­tain num­ber of iter­a­tions, sam­pling is done with EMA, which aver­ages each iter­a­tion offline (but one does­n’t train using the aver­aged mod­el!46), shows that col­lec­tively these iter­a­tions are sim­i­lar because they are ‘orbit­ing’ around a cen­tral point and the image qual­ity is clearly grad­u­ally improv­ing when EMA is turned on.

Trans­fer learn­ing is not sup­ported native­ly, but a sim­i­lar trick as with StyleGAN is fea­si­ble: just drop the pre­trained mod­els into the check­point folder and resume (which will work as long as the archi­tec­ture is iden­ti­cal to the CLI para­me­ter­s).

The sam­ple sheet func­tion­al­ity can eas­ily over­load a GPU and OOM. In, it may be nec­es­sary to sim­ply com­ment out all of the sam­pling func­tion­al­ity start­ing with utils.sample_sheet.

The main prob­lem run­ning BigGAN is odd bugs in BigGAN’s han­dling of epochs/iterations and chang­ing gra­di­ent accu­mu­la­tions. With --use_multiepoch_sampler, it does com­pli­cated cal­cu­la­tions to try to keep sam­pling con­sis­tent across epoches with pre­cisely the same order­ing of sam­ples regard­less of how often the BigGAN job is started/stopped (eg on a clus­ter), but as one increases the total mini­batch size and it pro­gresses through an epoch, it tries to index data which does­n’t exist and crash­es; I was unable to fig­ure out how the cal­cu­la­tions were going wrong, exact­ly.47

While with that option dis­abled and larger total mini­batches used, a differ­ent bug gets trig­gered, lead­ing to inscrutable crash­es:

ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
Traceback (most recent call last):
  File "", line 228, in <module>
  File "", line 225, in main
  File "", line 172, in run
    for i, (x, y) in enumerate(pbar):
  File "/root/BigGAN-PyTorch-mooch/", line 842, in progress
    for n, item in enumerate(items):
  File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/", line 631, in __next__
    idx, batch = self._get_batch()
  File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/", line 601, in _get_batch
    return self.data_queue.get(timeout=MP_STATUS_CHECK_INTERVAL)
  File "/opt/conda/lib/python3.7/", line 179, in get
  File "/opt/conda/lib/python3.7/", line 300, in wait
    gotit = waiter.acquire(True, timeout)
  File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/", line 274, in handler
RuntimeError: DataLoader worker (pid 21103) is killed by signal: Bus error.

There is no good workaround here: start­ing with small fast mini­batches com­pro­mises final qual­i­ty, while start­ing with big slow mini­batches may work but then costs far more com­pute. I did find that the G/D accu­mu­la­tions can be imbal­anced to allow increas­ing the G’s total mini­batch (which appears to be the key for bet­ter qual­i­ty) but then this risks desta­bi­liz­ing train­ing. These bugs need to be fixed before try­ing BigGAN for real.

BigGAN: ImageNet→Danbooru2018-1K

In any case, I ran the 128px Ima­geNet→­Dan­booru2018-1K for ~6 GPU-days (or ~3 days on my 2×1080ti work­sta­tion) and the train­ing mon­tage indi­cates it was work­ing fine:

Train­ing mon­tage of the 128px Ima­geNet→­Dan­booru2018-1K; suc­cess­ful

Some­time after that, while con­tin­u­ing to play with imbal­anced mini­batches to avoid trig­ger­ing the iteration/crash bugs, it diverged badly and mod­e-col­lapsed into sta­t­ic, so I killed the run, as the point appears to have been made: trans­fer learn­ing is indeed pos­si­ble, and the speed of the adap­ta­tion sug­gests ben­e­fits to train­ing time by start­ing with a high­ly-trained model already.

BigGAN: 256px Danbooru2018-1K

More seri­ous­ly, I began train­ing a 256px model on Dan­booru2018-1K por­traits. This required rebuild­ing the HDF5 with 256px set­tings, and since I was­n’t doing trans­fer learn­ing, I used the BigGAN-deep archi­tec­ture set­tings since that has bet­ter results & is smaller than the orig­i­nal BigGAN.

My own 2×1080ti were inad­e­quate for rea­son­able turn­around on train­ing a 256px BigGAN from scratch—they would take some­thing like 4+ months wall­clock— so I decided to shell out for a big cloud instance. AWS/GCP are too expen­sive, so I used this to inves­ti­gate as an alter­na­tive: they typ­i­cally have much lower prices. setup was straight­for­ward, and I found a nice instance: an 8×2080ti machine avail­able for just $1.7/hour (AWS, for com­par­ison, would charge closer to $2.16/hour for just 8 K80 halves). So I ran 2019-05-02–2019-06-03 their 8×2080ti instance ($1.7/hour; total: $1373.64).

That is ~250 GPU-days of train­ing, although this is a mis­lead­ing way to put it since the bill includes bandwidth/hard-drive in that total and the GPU uti­liza­tion was poor so each ‘GPU-day’ is worth about a third less than with the 128px BigGAN which had good GPU uti­liza­tion and the 2080tis were overkill. It should be pos­si­ble to do much bet­ter with the same bud­get in the future.

The train­ing com­mand:

python --model BigGANdeep --dataset D1K_hdf5 --parallel --shuffle --num_workers 16 \
    --batch_size 56 --num_G_accumulations 8 --num_D_accumulations 8 --num_D_steps 1 --G_lr 1e-4 \
    --D_lr 4e-4 --D_B2 0.999 --G_B2 0.999 --G_attn 64 --D_attn 64 --G_ch 128 --D_ch 128 \
    --G_depth 2 --D_depth 2 --G_nl inplace_relu --D_nl inplace_relu --SN_eps 1e-6 --BN_eps 1e-5 \
    --adam_eps 1e-6 --G_ortho 0.0 --G_shared --G_init ortho --D_init ortho --hier --dim_z 64 \
    --shared_dim 64 --ema --use_ema --G_eval_mode --test_every 200000 --sv_log_interval 1000 \
    --save_every 90 --num_best_copies 1 --num_save_copies 1 --seed 0 --no_fid \
    --num_inception_images 1 --augment --data_root ~/tmp --resume --experiment_name \

The sys­tem worked well but BigGAN turns out to have seri­ous per­for­mance bot­tle­necks (ap­par­ently in syn­chro­niz­ing batch­norm across GPUs) and did not make good use of the 8 GPUs, aver­ag­ing GPU uti­liza­tion ~30% accord­ing to nvidia-smi. (On my 2×1080tis with the 128px, GPU-utilization was closer to 95%.) In ret­ro­spect, I prob­a­bly should’ve switched to a less expen­sive instance like a 8×1080ti where it likely would’ve had sim­i­lar through­put but cost less.

Train­ing pro­gressed well up until iter­a­tions #80–90k, when I began see­ing signs of mode col­lapse:

Train­ing mon­tage of the 256px Dan­booru2018-1K; semi­-suc­cess­ful (note when EMA begins to be used for sam­pling images at ~8s, and the mode col­lapse at the end)

I was unable to increase the mini­batch to more than ~500 because of the bugs, lim­it­ing what I could do against mode col­lapse, and I sus­pect the small mini­batch was why mode col­lapse was hap­pen­ing in the first place. (Gokaslan tried the last check­point I saved—#95,160—with the same set­tings, and ran it to #100,000 iter­a­tions and expe­ri­enced near-to­tal mode col­lapse.)

The last check­point I saved from before mode col­lapse was #83,520, saved on 2019-05-28 after ~24 wall­clock days (ac­count­ing for var­i­ous crashes & time set­ting up & tweak­ing).

Ran­dom sam­ples, inter­po­la­tion grids (not videos), and class-con­di­tional sam­ples can be gen­er­ated using; like, it requires the exact archi­tec­ture to be spec­i­fied. I used the fol­low­ing com­mand (many of the options are prob­a­bly not nec­es­sary, but I did­n’t know which):

python --model BigGANdeep --dataset D1K_hdf5 --parallel --shuffle --num_workers 16 \
    --batch_size 56 --num_G_accumulations 8 --num_D_accumulations 8 --num_D_steps 1 \
    --G_lr 1e-4 --D_lr 4e-4 --D_B2 0.999 --G_B2 0.999 --G_attn 64 --D_attn 64 --G_ch 128 \
    --D_ch 128 --G_depth 2 --D_depth 2 --G_nl inplace_relu --D_nl inplace_relu --SN_eps 1e-6 \
    --BN_eps 1e-5 --adam_eps 1e-6 --G_ortho 0.0 --G_shared --G_init ortho --D_init ortho --hier \
    --dim_z 64 --shared_dim 64 --ema --use_ema --G_eval_mode --test_every 200000 \
    --sv_log_interval 1000 --save_every 90 --num_best_copies 1 --num_save_copies 1 --seed 0 \
    --no_fid --num_inception_images 1 --skip_init --G_batch_size 32  --use_ema --G_eval_mode \
    --sample_random --sample_sheets --sample_interps --resume --experiment_name 256px

Ran­dom sam­ples are already well-rep­re­sented by the train­ing mon­tage. The inter­po­la­tions look sim­i­lar to StyleGAN inter­po­la­tions. The class-con­di­tional sam­ples are the most fun to look at because one can look at spe­cific char­ac­ters with­out the need to retrain the entire mod­el, which while only tak­ing a few hours at most, is a has­sle.

256px Danbooru2018-1K Samples

Inter­po­la­tion images and 5 char­ac­ter-spe­cific ran­dom sam­ples (Asuka, Holo, Rin, Chen, Ruri):

Ran­dom inter­po­la­tion sam­ples (256px BigGAN trained on 1000 Dan­booru2018 char­ac­ter por­traits)
Souryuu Asuka Lan­g­ley (Neon Gen­e­sis Evan­ge­lion), class #825 ran­dom sam­ples
Holo (Spice and Wolf), class #273 ran­dom sam­ples
Rin Tohsaka (Fate/Stay Night), class #891
Yakumo Chen (Touhou), class #123 ran­dom sam­ples
Ruri Hoshino (Mar­t­ian Suc­ces­sor Nades­ico), class #286 ran­dom sam­ples

256px BigGAN Downloads

Model & sam­ple down­loads:


Sar­cas­tic com­men­tary on BigGAN qual­ity by /u/Klockbox

The best results from the 128px BigGAN model look about as good as could be expected from 128px sam­ples; the 256px model is fairly good, but suffers from much more notice­able arti­fact­ing than 512px StyleGAN, and cost $1373 (a 256px StyleGAN would have been closer to $400 on AWS). In BigGAN’s defense, it had clearly not con­verged yet and could have ben­e­fited from much more train­ing and much larger mini­batch­es, had that been pos­si­ble. Qual­i­ta­tive­ly, look­ing at the more com­plex ele­ments of sam­ples, like hair ornaments/hats, I feel like BigGAN was doing a much bet­ter job of cop­ing with com­plex­ity & fine detail than StyleGAN would have at a sim­i­lar point.

How­ev­er, train­ing 512px por­traits or whole-Dan­booru images is infea­si­ble at this point: while the cost might be only a few thou­sand dol­lars, the var­i­ous bugs mean that it may not be pos­si­ble to sta­bly train to a use­ful qual­i­ty. It’s a dilem­ma: at small or easy domains, StyleGAN is much faster (if not bet­ter); but at large or hard domains, mode col­lapse is too risky and endan­gers the big invest­ment nec­es­sary to sur­pass StyleGAN.

To make BigGAN viable, it needs at least:

  • mini­batch size bugs fixed to enable up to n = 2048 (or larg­er, as gra­di­ent noise scale indi­cates)
  • 512px archi­tec­tures defined, to allow trans­fer learn­ing from the released Ten­sor­flow 512px Ima­geNet model
  • opti­miza­tion work to reduce over­head and allow rea­son­able GPU uti­liza­tion on >2-GPU sys­tems

With those done, it should be pos­si­ble to train 512px por­traits for <$1,000 and whole-Dan­booru images for <$10,000. (Given the release of Deep­Dan­booru as a Ten­sor­Flow mod­el, enabling an ani­me-spe­cific per­cep­tual loss, it would also be inter­est­ing to inves­ti­gate apply­ing pre­train­ing to BigGAN.)

See Also


For com­par­i­son, here are some of my older GAN or other NN attempts; as the qual­ity is worse than StyleGAN, I won’t bother going into detail­s—cre­at­ing the datasets & train­ing the ProGAN & tun­ing & trans­fer­-learn­ing were all much the same as already out­lined at length for the StyleGAN results.

Included are:

  • ProGAN

  • Glow


  • PokeGAN

  • Self-Attention-GAN-TensorFlow

  • VGAN

  • BigGAN unoffi­cial (offi­cial BigGAN is cov­ered above)

    • BigGAN-TensorFlow
    • BigGAN-PyTorch
  • GAN-QP

  • WGAN

  • IntroVAE


Using offi­cial imple­men­ta­tion:

  1. 2018-09-08, 512–1024px whole-A­suka images ProGAN sam­ples:

    1024px, whole-A­suka images, ProGAN
    512px whole-A­suka images, ProGAN
  2. 2018-09-18, 512px Asuka faces, ProGAN sam­ples:

    512px Asuka faces, ProGAN
  3. 2018-10-29, 512px Holo faces, ProGAN:

    Ran­dom sam­ples of 512px ProGAN Holo faces

    After gen­er­at­ing ~1k Holo faces, I selected the top decile (n = 103) of the faces (Imgur mir­ror):

    512px ProGAN Holo faces, ran­dom sam­ples from top decile (6×6)

    The top decile images are, nev­er­the­less, show­ing dis­tinct signs of both arti­fact­ing & overfitting/memorization of data points. Another 2 weeks proved this out fur­ther:

    ProGAN sam­ples of 512px Holo faces, after badly over­fit­ting (it­er­a­tion #10,325)

    Inter­po­la­tion video of the Octo­ber 2018 512px Holo face ProGAN; note the gross over­fit­ting indi­cated by the abrupt­ness of the inter­po­la­tions jump­ing from face (mode) to face (mode) and lack of mean­ing­ful inter­me­di­ate faces in addi­tion to the over­all blur­ri­ness & low visual qual­i­ty.

  4. 2019-01-17, Dan­booru2017 512px SFW images, ProGAN:

    512px SFW Dan­booru2017, ProGAN
  5. 2019-02-05 (stopped in order to train with the new StyleGAN code­base), the 512px anime face dataset used else­where, ProGAN:

    512px anime faces, ProGAN

    Inter­po­la­tion video of the 2018-02-05 512px anime face ProGAN; while the image qual­ity is low, the diver­sity is good & shows no overfitting/memorization or bla­tant mode col­lapse



Used Glow () offi­cial imple­men­ta­tion.

Due to the enor­mous model size (4.2G­B), I had to mod­ify Glow’s set­tings to get train­ing work­ing rea­son­ably well, after exten­sive tin­ker­ing to fig­ure out what any meant:

{"verbose": true, "restore_path": "logs/model_4.ckpt", "inference": false, "logdir": "./logs", "problem": "asuka",
"category": "", "data_dir": "../glow/data/asuka/", "dal": 2, "fmap": 1, "pmap": 16, "n_train": 20000, "n_test": 1000,
"n_batch_train": 16, "n_batch_test": 50, "n_batch_init": 16, "optimizer": "adamax", "lr": 0.0005, "beta1": 0.9,
"polyak_epochs": 1, "weight_decay": 1.0, "epochs": 1000000, "epochs_warmup": 10, "epochs_full_valid": 3,
"gradient_checkpointing": 1, "image_size": 512, "anchor_size": 128, "width": 512, "depth": 13, "weight_y": 0.0,
"n_bits_x": 8, "n_levels": 7, "n_sample": 16, "epochs_full_sample": 5, "learntop": false, "ycond": false, "seed": 0,
"flow_permutation": 2, "flow_coupling": 1, "n_y": 1, "rnd_crop": false, "local_batch_train": 1, "local_batch_test": 1,
"local_batch_init": 1, "direct_iterator": true, "train_its": 1250, "test_its": 63, "full_test_its": 1000, "n_bins": 256.0, "top_shape": [4, 4, 768]}
{"epoch": 5, "n_processed": 100000, "n_images": 6250, "train_time": 14496, "loss": "2.0090", "bits_x": "2.0090", "bits_y": "0.0000", "pred_loss": "1.0000"}

An addi­tional chal­lenge was numer­i­cal insta­bil­ity in the revers­ing of matri­ces, giv­ing rise to many ‘invert­ibil­ity’ crash­es.

Final sam­ple before I looked up the com­pute require­ments more care­fully & gave up on Glow:

Glow, Asuka faces, 5 epoches (2018-08-02)


offi­cial imple­men­ta­tion:

2018-12-15, 512px Asuka faces, fail­ure case


nshep­perd’s (un­pub­lished) mul­ti­-s­cale GAN with self­-at­ten­tion lay­ers, spec­tral nor­mal­iza­tion, and a few other tweaks:

PokeGAN, Asuka faces, 2018-11-16


did not have an offi­cial imple­men­ta­tion released at the time so I used the Junho Kim imple­men­ta­tion; 128px SAGAN, WGAN-LP loss, on Asuka faces & whole Asuka images:

Self-Attention-GAN-TensorFlow, whole Asuka, 2019-08-18
Train­ing mon­tage of the 2018-08-18 128px whole-A­suka SAGAN; pos­si­bly too-high LR
Self-Attention-GAN-TensorFlow, Asuka faces, 2019-09-13


The offi­cial VGAN code for Peng et al 2018 had not been released when I began try­ing VGAN, so I used akan­i­max’s imple­men­ta­tion.

The vari­a­tional dis­crim­i­na­tor bot­tle­neck, along with self­-at­ten­tion lay­ers and pro­gres­sive grow­ing, is one of the few strate­gies which per­mit 512px images, and I was intrigued to see that it worked rel­a­tively well, although I ran into per­sis­tent issues with insta­bil­ity & mode col­lapse. I sus­pect that VGAN could’ve worked bet­ter than it did with some more work.

akan­i­max VGAN, anime faces, 2018-12-25

BigGAN unofficial

^s offi­cial imple­men­ta­tion & mod­els were not released until late March 2019 (nor the semi­-offi­cial compare_gan imple­men­ta­tion until Feb­ru­ary 2019), and I exper­i­mented with 2 unoffi­cial imple­men­ta­tions in late 2018–early 2019.


Junho Kim imple­men­ta­tion; 128px spec­tral norm hinge loss, anime faces:

Kim BigGAN-PyTorch, anime faces, 2019-01-17

This one never worked well at all, and I am still puz­zled what went wrong.


Aaron Leong’s PyTorch BigGAN imple­men­ta­tion (not the offi­cial BigGAN imple­men­ta­tion). As it’s class-con­di­tion­al, I faked hav­ing 1000 classes by con­struct­ing a vari­ant anime face dataset: tak­ing the top 1000 char­ac­ters by tag count in the Dan­booru2017 meta­data, I then fil­tered for those char­ac­ter tags 1 by 1, and copied them & cropped faces into match­ing sub­di­rec­to­ries 1–1000. This let me try out both faces & whole images. I also attempted to hack in gra­di­ent accu­mu­la­tion for big mini­batches to make it a true BigGAN imple­men­ta­tion, but did­n’t help too much; the prob­lem here might sim­ply have been that I could­n’t run it long enough.

Results upon aban­don­ing:

Leong BigGAN-PyTorch, 1000-class anime char­ac­ter dataset, 2018-11-30 (#314,000)
Leong BigGAN-PyTorch, 1000-class anime face dataset, 2018-12-24 (#1,006,320)


Imple­men­ta­tion of :

GAN-QP, 512px Asuka faces, 2018-11-21

Train­ing oscil­lated enor­mous­ly, with all the sam­ples closely linked and chang­ing simul­ta­ne­ous­ly. This was despite the check­point model being enor­mous (551MB) and I am sus­pi­cious that some­thing was seri­ously wrong—ei­ther the model archi­tec­ture was wrong (too many lay­ers or fil­ter­s?) or the learn­ing rate was many orders of mag­ni­tude too large. Because of the small mini­batch, progress was diffi­cult to make in a rea­son­able amount of wall­clock time, so I moved on.


offi­cial imple­men­ta­tion; I did most of the early anime face work with WGAN on a differ­ent machine and did­n’t keep copies. How­ev­er, a sam­ple from a short run gives an idea of what WGAN tended to look like on anime runs:

WGAN, 256px Asuka faces, iter­a­tion 2100


A hybrid GAN-VAE archi­tec­ture intro­duced in mid-2018 by , Huang et al 2018, with the offi­cial PyTorch imple­men­ta­tion released in April 2019, IntroVAE attempts to reuse the encoder-de­coder for an adver­sar­ial loss as well, to com­bine the best of both worlds: the prin­ci­pled sta­ble train­ing & reversible encoder of the VAE with the sharp­ness & high qual­ity of a GAN.

Qual­i­ty-wise, they show IntroVAE works on CelebA & LSUN BEDROOM at up to 1024px res­o­lu­tion with results they claim are com­pa­ra­ble to ProGAN. Per­for­mance-wise, for 512px, they give a run­time of 7 days with a mini­batch n = 12, or pre­sum­ably 4 GPUs (since their 1024px run script implies they used 4 GPUs and I can fit a mini­batch of n = 4 onto 1×1080ti, so 4 GPUs would be con­sis­tent with n = 12), and so 28 GPU-days.

I adapted the 256px sug­gested set­tings for my 512px anime por­traits dataset:

python --hdim=512 --output_height=512 --channels='32, 64, 128, 256, 512, 512, 512' --m_plus=120 \
    --weight_rec=0.05 --weight_kl=1.0 --weight_neg=0.5 --num_vae=0 \
    --dataroot=/media/gwern/Data2/danbooru2018/portrait/1/ --trainsize=302652 --test_iter=1000 --save_iter=1 \
    --start_epoch=0 --batchSize=4 --nrow=8 --lr_e=0.0001 --lr_g=0.0001 --cuda --nEpochs=500
# ...====> Cur_iter: [187060]: Epoch[3](5467/60531): time: 142675: Rec: 19569, Kl_E: 162, 151, 121, Kl_G: 151, 121,

There was a minor bug in the code­base where it would crash on try­ing to print out the log data, per­haps because it assumes multi-GPU and I was run­ning on 1 GPU, and was try­ing to index into an array which was actu­ally a sim­ple scalar, which I fixed by remov­ing the index­ing:

-        info += 'Rec: {:.4f}, '.format([0])
-        info += 'Kl_E: {:.4f}, {:.4f}, {:.4f}, '.format([0],
-                      [0],[0])
-        info += 'Kl_G: {:.4f}, {:.4f}, '.format([0],[0])
+        info += 'Rec: {:.4f}, '.format(
+        info += 'Kl_E: {:.4f}, {:.4f}, {:.4f}, '.format(,
+                      ,
+        info += 'Kl_G: {:.4f}, {:.4f}, '.format(,

Sam­ple results after ~1.7 GPU-days:

IntroVAE, 512px anime por­trait (n = 4, 3 sets: real dat­a­points, encod­ed→de­coded ver­sions of the real dat­a­points, and ran­dom gen­er­ated sam­ples)

By this point, StyleGAN would have been gen­er­at­ing rec­og­niz­able faces from scratch, while the IntroVAE ran­dom sam­ples are not even face-like, and the IntroVAE train­ing curve was not improv­ing at a notable rate. IntroVAE has some hyper­pa­ra­me­ters which could prob­a­bly be tuned bet­ter for the anime por­trait faces (they briefly dis­cuss the use of the --num_vae option to run in clas­sic VAE mode to let you tune the VAE-related hyper­pa­ra­me­ters before enabling the GAN-like part), but it should be fairly insen­si­tive over­all to hyper­pa­ra­me­ters and unlikely to help all that much. So IntroVAE prob­a­bly can’t replace StyleGAN (yet?) for gen­er­al-pur­pose image syn­the­sis. This demon­strates again that it seems like every­thing works on CelebA these days and just because some­thing works on a pho­to­graphic dataset does not mean it’ll work on other datasets. Image gen­er­a­tion papers should prob­a­bly branch out some more and con­sider non-pho­to­graphic tests.

  1. Turns out that when train­ing goes really wrong, you can crash many GAN imple­men­ta­tions with either a seg­fault, inte­ger over­flow, or divi­sion by zero error.↩︎

  2. StackGAN/StackGAN++/PixelCNN et al are diffi­cult to run as they require a unique image embed­ding which could only be com­puted in the unmain­tained Torch frame­work using Reed’s prior work on a joint tex­t+im­age embed­ding which how­ever does­n’t run on any­thing but the Birds & Flow­ers datasets, and so no one has ever, as far as I am aware, run those imple­men­ta­tions on any­thing else—cer­tainly I never man­aged to despite quite a few hours try­ing to reverse-engi­neer the embed­ding & var­i­ous imple­men­ta­tions.↩︎

  3. Be sure to check out .↩︎

  4. Glow’s reported results required >40 GPU-weeks; BigGAN’s total com­pute is unclear as it was trained on a TPUv3 Google clus­ter but it would appear that a 128px BigGAN might be ~4 GPU-months assum­ing hard­ware like an 8-GPU machine, 256px ~8 GPU-months, and 512px ≫8 GPU-months, with VRAM being the main lim­it­ing fac­tor for larger mod­els (although pro­gres­sive grow­ing might be able to cut those esti­mates).↩︎

  5. is an old & small CNN trained to pre­dict a few -booru tags on anime images, and so pro­vides an embed­ding—but not a good one. The lack of a good embed­ding is the major lim­i­ta­tion for anime deep learn­ing as of Feb­ru­ary 2019. (Deep­Dan­booru, while per­form­ing well appar­ent­ly, has not yet been used for embed­dings.) An embed­ding is nec­es­sary for tex­t→im­age GANs, image searches & near­est-neigh­bor checks of over­fit­ting, FID errors for objec­tively com­par­ing GANs, mini­batch dis­crim­i­na­tion to help the D/provide an aux­il­iary loss to sta­bi­lize learn­ing, anime style trans­fer (both for its own sake & for cre­at­ing a ‘StyleDan­booru2018’ to reduce tex­ture cheat­ing), encod­ing into GAN latent spaces for manip­u­la­tion, data clean­ing (to detect anom­alous dat­a­points like failed face crop­s), per­cep­tual losses for encoders or as an addi­tional aux­il­iary loss/pretraining (like , which trains a Gen­er­a­tor on a per­cep­tual loss and does GAN train­ing only for fine­tun­ing) etc. A good tag­ger is also a good start­ing point for doing pix­el-level seman­tic seg­men­ta­tion (via “weak super­vi­sion”), which meta­data is key for train­ing some­thing like Nvidi­a’s GauGAN suc­ces­sor to pix2pix (; source).↩︎

  6. Tech­ni­cal note: I typ­i­cally train NNs using my work­sta­tion with 2×1080ti GPUs. For eas­ier com­par­ison, I con­vert all my times to single-GPU equiv­a­lent (ie “6 GPU-weeks” means 3 realtime/wallclock weeks on my 2 GPUs).↩︎

  7. observes (§4 “Using pre­ci­sion and recall to ana­lyze and improve StyleGAN”) that StyleGAN with pro­gres­sive grow­ing dis­abled does work but at some cost to precision/recall qual­ity met­rics; whether this reflects infe­rior per­for­mance on a given train­ing bud­get or an inher­ent limit—BigGAN and other self­-at­ten­tion-us­ing GANs do not use pro­gres­sive grow­ing at all, sug­gest­ing it is not truly nec­es­sary—is not inves­ti­gat­ed. In Decem­ber 2019, StyleGAN 2 suc­cess­fully dropped pro­gres­sive grow­ing entirely at mod­est per­for­mance cost.↩︎

  8. This has con­fused some peo­ple, so to clar­ify the sequence of events: I trained my anime face StyleGAN and posted notes on Twit­ter, releas­ing an early mod­el; road­run­ner01 gen­er­ated an inter­po­la­tion video using said model (but a differ­ent ran­dom seed, of course); this inter­po­la­tion video was retweeted by the Japan­ese Twit­ter user _Ry­obot, upon which it went viral and was ‘liked’ by Elon Musk, fur­ther dri­ving viral­ity (19k reshares, 65k likes, 1.29m watches as of 2019-03-22).↩︎

  9. Google Colab is a free ser­vice includes free GPU time (up to 12 hours on a small GPU). Espe­cially for peo­ple who do not have a rea­son­ably capa­ble GPU on their per­sonal com­put­ers (such as all Apple users) or do not want to engage in the admit­ted has­sle of rent­ing a real cloud GPU instance, Colab can be a great way to play with a pre­trained mod­el, like gen­er­at­ing GPT-2-117M text com­ple­tions or StyleGAN inter­po­la­tion videos, or pro­to­type on tiny prob­lems.

    How­ev­er, it is a bad idea to try to train real mod­els, like 512–1024px StyleGANs, on a Colab instance as the GPUs are low VRAM, far slower (6 hours per StyleGAN tick­!), unwieldy to work with (as one must save snap­shots con­stantly to restart when the ses­sion runs out), does­n’t have a real com­mand-line, etc. Colab is just barely ade­quate for per­haps 1 or 2 ticks of trans­fer learn­ing, but not more. If you har­bor greater ambi­tions but still refuse to spend any money (rather than time), Kag­gle has a sim­i­lar ser­vice with P100 GPU slices rather than K80s. Oth­er­wise, one needs to get access to real GPUs.↩︎

  10. Curi­ous­ly, the ben­e­fit of many more FC lay­ers than usual may have been stum­bled across before: IllustrationGAN found that adding some FC lay­ers seemed to help their DCGAN gen­er­ate anime faces, and when I & Feep­ingCrea­ture exper­i­mented with adding 2–4 FC lay­ers to WGAN-GP along IllustrationGAN’s lines, it did help our lack­lus­ter results, and at the time I spec­u­lated that “the ful­ly-con­nected lay­ers are trans­form­ing the latent-z/noise into a sort of global tem­plate which the sub­se­quent con­vo­lu­tion lay­ers can then fill in more local­ly.” But we never dreamed of going as deep as 8!↩︎

  11. The ProGAN/StyleGAN code­base report­edly does work with con­di­tion­ing, but none of the papers report on this func­tion­al­ity and I have not used it myself.↩︎

  12. The latent embed­ding z is usu­ally gen­er­ated in about the sim­plest pos­si­ble way: draws from the Nor­mal dis­tri­b­u­tion, . A is some­times used instead. There is no good jus­ti­fi­ca­tion for this and some rea­son to think this can be bad (how does a GAN eas­ily map a dis­crete or binary latent fac­tor, such as the pres­ence or absence of the left ear, onto a Nor­mal vari­able?).

    The BigGAN paper explores alter­na­tives, find­ing improve­ments in train­ing time and/or final qual­ity from using instead (in ascend­ing order): a Nor­mal + binary Bernoulli (p = 0.5; per­sonal com­mu­ni­ca­tion, Brock) vari­able, a binary (Bernoul­li), and a (some­times called a “cen­sored nor­mal” even though that sounds like a rather than the rec­ti­fied one). The rec­ti­fied Gauss­ian dis­tri­b­u­tion “out­per­forms (in terms of IS) by 15–20% and tends to require fewer iter­a­tions.”

    The down­side is that the “trun­ca­tion trick”, which yields even larger aver­age improve­ments in image qual­ity (at the expense of diver­si­ty) does­n’t quite apply, and the rec­ti­fied Gauss­ian sans trun­ca­tion pro­duced sim­i­lar results as the Nor­mal+trun­ca­tion, so BigGAN reverted to the default Nor­mal dis­tri­b­u­tion+trun­ca­tion (per­sonal com­mu­ni­ca­tion).

    The trun­ca­tion trick either directly applies to some of the other dis­tri­b­u­tions, par­tic­u­larly the Rec­ti­fied Gaus­sian, or could eas­ily be adapt­ed—­pos­si­bly yield­ing an improve­ment over either approach. The Rec­ti­fied Gauss­ian can be trun­cated just like the default Nor­mals can. And for the Bernoul­li, one could decrease p dur­ing the gen­er­a­tion, or what is prob­a­bly equiv­a­lent, re-sam­ple when­ever the vari­ance (ie squared sum) of all the Bernoulli latent vari­ables exceeds a cer­tain con­stant. (With p = 0.5, a latent vec­tor of 512 Bernouil­lis would on aver­age all sum up to sim­ply , with the 2.5%–97.5% quan­tiles being 234–278, so a ‘trun­ca­tion trick’ here might be throw­ing out every vec­tor with a sum above, say, the 80% quan­tile of 266.)

    One also won­ders about vec­tors which draw from mul­ti­ple dis­tri­b­u­tions rather than just one. Could the StyleGAN 8-FC-layer learned-la­ten­t-vari­able be reverse-engi­neered? Per­haps the first layer or two merely con­verts the nor­mal input into a more use­ful dis­tri­b­u­tion & parameters/training could be saved or insight gained by imi­tat­ing that.↩︎

  13. Which raises the ques­tion: if you added any or all of those fea­tures, would StyleGAN become that much bet­ter? Unfor­tu­nate­ly, while the­o­rists & prac­ti­tion­ers have had many ideas, so far the­ory has proven more fecund than fatidi­cal and the large-s­cale GAN exper­i­ments nec­es­sary to truly test the sug­ges­tions are too expen­sive for most. Half of these sug­ges­tions are great ideas—but which half?↩︎

  14. For more on the choice of con­vo­lu­tion layers/kernel sizes, see Karpa­thy’s 2015 notes for “CS231n: Con­vo­lu­tional Neural Net­works for Visual Recog­ni­tion”, or take a look at these Con­vo­lu­tion ani­ma­tions & Yang’s inter­ac­tive “Con­vo­lu­tion Visu­al­izer”.↩︎

  15. This obser­va­tions apply only to the Gen­er­a­tor in GANs (which is what we pri­mar­ily care about); curi­ous­ly, there’s some rea­son to think that GAN Dis­crim­i­na­tors are in fact mostly mem­o­riz­ing (see later).↩︎

  16. A pos­si­ble alter­na­tive is ESRGAN ().↩︎

  17. Based on eye­balling the ‘cat’ bar graph in Fig­ure 3 of .↩︎

  18. CATS offer an amus­ing instance of the dan­gers of data aug­men­ta­tion: ProGAN used hor­i­zon­tal flipping/mirroring for every­thing, because why not? This led to strange Cyril­lic text cap­tions show­ing up in the gen­er­ated cat images. Why not Latin alpha­bet cap­tions? Because every cat image was being shown mir­rored as well as nor­mal­ly! For StyleGAN, mir­ror­ing was dis­abled, so now the lol­cat cap­tions are rec­og­niz­ably Latin alpha­bet­i­cal, and even almost Eng­lish words. This demon­strates that even datasets where left/right does­n’t seem to mat­ter, like cat pho­tos, can sur­prise you.↩︎

  19. I esti­mated the total cost using AWS EC2 pre­emptible hourly costs on 2019-03-15 as fol­lows:

    • 1 GPU: p2.xlarge instance in us-east-2a, Half of a K80 (12GB VRAM): $0.3235/hour
    • 2 GPUs: NA—there is no P2 instance with 2 GPUs, only 1/8/16
    • 8 GPUs: p2.8xlarge in us-east-2a, 8 halves of K80s (12GB VRAM each): $2.160/hour

    As usu­al, there is sub­lin­ear scal­ing, and larger instances cost dis­pro­por­tion­ately more, because one is pay­ing for faster wall­clock train­ing (time is valu­able) and for not hav­ing to cre­ate a dis­trib­uted infra­struc­ture which can exploit the cheap single-GPU instances.

    This cost esti­mate does not count addi­tional costs like hard drive space. In addi­tion to the dataset size (the StyleGAN data encod­ing is ~18× larger than the raw data size, so a 10GB folder of images → 200GB of .tfrecords), you would need at least 100GB HDD (50GB for the OS, and 50GB for checkpoints/images/etc to avoid crashes from run­ning out of space).↩︎

  20. I regard this as a flaw in StyleGAN & TF in gen­er­al. Com­put­ers are more than fast enough to load & process images asyn­chro­nously using a few worker threads, and work­ing with a direc­tory of images (rather than a spe­cial binary for­mat 10–20× larg­er) avoids impos­ing seri­ous bur­dens on the user & hard dri­ve. PyTorch GANs almost always avoid this mis­take, and are much more pleas­ant to work with as one can freely mod­ify the dataset between (and even dur­ing) runs.↩︎

  21. For exam­ple, my Dan­booru2018 anime por­trait dataset is 16GB, but the StyleGAN encoded dataset is 296GB.↩︎

  22. This may be why some peo­ple report that StyleGAN just crashes for them & they can’t fig­ure out why. They should try chang­ing their dataset JPG ↔︎ PNG.↩︎

  23. That is, in train­ing G, the G’s fake images must be aug­mented before being passed to the D for rat­ing; and in train­ing D, both real & fake images must be aug­mented the same way before being passed to D. Pre­vi­ous­ly, all GAN researchers appear to have assumed that one should only aug­ment real images before pass­ing to D dur­ing D train­ing, which con­ve­niently can be done at dataset cre­ation; unfor­tu­nate­ly, this hid­den assump­tion turns out to be about the most harm­ful way pos­si­ble!↩︎

  24. I would describe the dis­tinc­tions as: Soft­ware 0.0 was imper­a­tive pro­gram­ming for ham­mer­ing out clock­work mech­a­nism; Soft­ware 1.0 was declar­a­tive pro­gram­ming with spec­i­fi­ca­tion of pol­i­cy; and Soft­ware 2.0 is deep learn­ing by gar­den­ing loss func­tions (with every­thing else, from model arch to which dat­a­points to label ide­ally learned end-to-end). Con­tin­u­ing the the­me, we might say that dia­logue with mod­els, like , are “Soft­ware 3.0”…↩︎

  25. But you may not want to–re­mem­ber the lol­cat cap­tions!↩︎

  26. Note: If you use a differ­ent com­mand to resize, check it thor­ough­ly. With ImageMag­ick, if you use the ^ oper­a­tor like -resize 512x512^, you will not get exactly 512×512px images as you need; while if you use the ! oper­a­tor like -resize 512x512!, the images will be exactly 512×512px but the aspect ratios will dis­torted to make images fit, and this may con­fuse any­thing you are train­ing by intro­duc­ing unnec­es­sary mean­ing­less dis­tor­tions & will make any gen­er­ated images look bad.↩︎

  27. If you are using Python 2, you will get print syn­tax error mes­sages; if you are using Python 3–3.6, you will get ‘type hint’ errors.↩︎

  28. Stas Pod­gorskiy has demon­strated that the StyleGAN 2 cor­rec­tion can be reverse-engi­neered and applied back to StyleGAN 1 gen­er­a­tors if nec­es­sary.↩︎

  29. This makes it con­form to a trun­cated nor­mal dis­tri­b­u­tion; why trun­cated rather than rectified/winsorized at a max like 0.5 or 1.0 instead? Because then many, pos­si­bly most, of the latent vari­ables would all be at the max, instead of smoothly spread out over the per­mit­ted range.↩︎

  30. No mini­batches are used, so this is much slower than nec­es­sary.↩︎

  31. The ques­tion is not whether one is to start with an ini­tial­iza­tion at all, but whether to start with one which does every­thing poor­ly, or one which does a few sim­i­lar things well. Sim­i­lar­ly, from a Bayesian sta­tis­tics per­spec­tive, the ques­tion of what to use is one that every­one faces; how­ev­er, many approaches sweep it under the rug and effec­tively assume a default flat prior that is con­sis­tently bad and opti­mal for no mean­ing­ful prob­lem ever.↩︎

  32. ADA/StyleGAN3 is report­edly much more sam­ple-effi­cient and reduces the need for trans­fer learn­ing: . But if a rel­e­vant model is avail­able, it should still be used. Back­port­ing the ADA data aug­men­ta­tion trick to StyleGAN1–2 will be a major upgrade.↩︎

  33. There are more real Asuka images than Holo to begin with, but there is no par­tic­u­lar rea­son for the 10× data aug­men­ta­tion com­pared to the Holo’s 3×—the data aug­men­ta­tions were just done at differ­ent times and hap­pened to have less or more aug­men­ta­tions enabled.↩︎

  34. A famous exam­ple is char­ac­ter designer Yoshiyuki Sadamoto demon­strat­ing how to turn () into (Evan­ge­lion).↩︎

  35. It turns out that this latent vec­tor trick does work. Intrigu­ing­ly, it works even bet­ter to do ‘model aver­ag­ing’ or ‘model blend­ing’ (/, Pinkney & Adler 2020): retrain model A on dataset B, and then take a weighted aver­age of the 2 mod­els (you aver­age them, para­me­ter by para­me­ter, and remark­ably, that Just Works, or you can swap out lay­ers between mod­el­s), and then you can cre­ate faces which are arbi­trar­ily in between A and B. So for exam­ple, you can blend FFHQ/Western-animation faces (Colab note­book), ukiy­o-e/FFHQ faces, furries/foxes/FFHQ faces, or even furries/foxes/FFHQ/anime/ponies.↩︎

  36. In ret­ro­spect, this should­n’t’ve sur­prised me.↩︎

  37. There is for other archi­tec­tures like flow-based ones such as Glow, and this is one of their ben­e­fit­s–while the require­ment to be made out of build­ing blocks which can be run back­wards & for­wards equally well, to be ‘invert­ible’, is cur­rently extremely expen­sive and the results not com­pet­i­tive either in final image qual­ity or com­pute require­ments, the invert­ibil­ity means that encod­ing an arbi­trary real image to get its inferred latents Just Works™ and one can eas­ily morph between 2 arbi­trary images, or encode an arbi­trary image & edit it in the latent space to do things like add/remove glasses from a face or cre­ate an oppo­sites-ex ver­sion.↩︎

  38. This final approach is, inter­est­ing­ly, the his­tor­i­cal rea­son back­prop­a­ga­tion was invent­ed: it cor­re­sponds to plan­ning in a model. For exam­ple, in plan­ning the flight path of an air­plane (/): the des­ti­na­tion or ‘out­put’ is fixed, the aero­dy­nam­ic­s+­geog­ra­phy or ‘model para­me­ters’ are also fixed, and the ques­tion is what actions deter­min­ing a flight path will reduce the loss func­tion of time or fuel spent. One starts with a ran­dom set of actions pick­ing a ran­dom flight path, runs it for­ward through the envi­ron­ment mod­el, gets a final time/fuel spent, and then back­prop­a­gates through the model to get the gra­di­ents for the flight path, adjust­ing the flight path towards a new set of actions which will slightly reduce the time/fuel spent; the new actions are used to plan out the flight to get a new loss, and so on, until a local min­i­mum of the actions has been found. This works with non-s­to­chas­tic prob­lems; for sto­chas­tic ones where the path can’t be guar­an­teed to be exe­cut­ed, “mod­el-pre­dic­tive con­trol” can be used to replan at every step and exe­cute adjust­ments as nec­es­sary. Another inter­est­ing use of back­prop­a­ga­tion for out­puts is which tack­les the long-s­tand­ing prob­lem of how to get NNs to out­put sets rather than list out­puts by gen­er­at­ing a pos­si­ble set out­put & refin­ing it via back­prop­a­ga­tion.↩︎

  39. SGD is com­mon, but a sec­ond-order algo­rithm like is often used in these appli­ca­tions in order to run as few iter­a­tions as pos­si­ble.↩︎

  40. shows that BigGAN/StyleGAN latent embed­dings can also go beyond what one might expect, to include zooms, trans­la­tions, and other trans­forms.↩︎

  41. Flow mod­els have other advan­tages, mostly stem­ming from the max­i­mum like­li­hood train­ing objec­tive. Since the image can be prop­a­gated back­wards and for­wards loss­less­ly, instead of being lim­ited to gen­er­at­ing ran­dom sam­ples like a GAN, it’s pos­si­ble to cal­cu­late the exact prob­a­bil­ity of an image, enabling max­i­mum like­li­hood as a loss to opti­mize, and drop­ping the Dis­crim­i­na­tor entire­ly. With no GAN dynam­ics, there’s no worry about weird train­ing dynam­ics, and the like­li­hood loss also for­bids ‘mode drop­ping’: the flow model can’t sim­ply con­spire with a Dis­crim­i­na­tor to for­get pos­si­ble images.↩︎

  42. StyleGAN 2 is more com­pu­ta­tion­ally expen­sive but Kar­ras et al opti­mized the code­base to make up for it, keep­ing total com­pute con­stant.↩︎

  43. Back­up-backup mir­ror: rsync rsync:// ./↩︎

  44. Ima­geNet requires you to sign up & be approved to down­load from them, but 2 months later I have still heard noth­ing back. So I used the data from ILSVRC2012_img_train.tar (MD5: 1d675b47d978889d74fa0da5fadfb00e; 138GB) which I down­loaded from the Ima­geNet LSVRC 2012 Train­ing Set (Ob­ject Detec­tion) tor­rent.↩︎

  45. Dan­booru can clas­sify the same char­ac­ter under mul­ti­ple tags: for exam­ple, Sailor Moon char­ac­ters are tagged under their “Sailor X” name for images of their trans­formed ver­sion, and their real names for ‘civil­ian’ images (eg ‘Sailor Venus’ or ‘Cure Moon­light’, the for­mer of which I merged with ‘Aino Minako’). Some pop­u­lar fran­chises have many vari­ants of each char­ac­ter: the Fate fran­chise, espe­cially with the suc­cess of , is a par­tic­u­lar offend­er, with quite a few vari­ants of char­ac­ters like Saber.↩︎

  46. One would think it would, but I asked Brock and appar­ently it does­n’t help to occa­sion­ally ini­tial­ize from the EMA snap­shots. EMA is a mys­te­ri­ous thing.↩︎

  47. As far as I can tell, it has some­thing to do with the dataloader code in the cal­cu­la­tion of length and the iter­a­tor do some­thing weird to adjust for pre­vi­ous train­ing, so the net effect is that you can run with a fixed mini­batch accu­mu­la­tion and it’ll be fine, and you can reduce the num­ber of accu­mu­la­tions, and it’ll sim­ply under­run the dat­aload­er, but if you increase the num­ber of accu­mu­la­tions, if you’ve trained enough per­cent­age-wise, it’ll imme­di­ately flip over into a neg­a­tive length and index­ing into it becomes com­pletely impos­si­ble, lead­ing to crash­es. Unfor­tu­nate­ly, I only ever want to increase the mini­batch accu­mu­la­tion… I tried to fix it but the logic is too con­vo­luted for me to fol­low it.↩︎

  48. Mir­ror: rsync --verbose rsync:// ./↩︎

  49. Mir­ror: rsync --verbose rsync:// ./↩︎