Making Anime Faces With StyleGAN

A tutorial explaining how to train and generate high-quality anime faces with StyleGAN 1/2 neural networks, and tips/scripts for effective StyleGAN use.
anime, NGE, NN, Python, technology, tutorial
2019-02-042021-01-03 finished certainty: highly likely importance: 5

Gen­er­a­tive neural net­works, such as GANs, have strug­gled for years to gen­er­ate de­cen­t-qual­ity anime faces, de­spite their great suc­cess with pho­to­graphic im­agery such as real hu­man faces. The task has now been effec­tively solved, for anime faces as well as many other do­mains, by the de­vel­op­ment of a new gen­er­a­tive ad­ver­sar­ial net­work, , whose source code was re­leased in Feb­ru­ary 2019.

I show off my StyleGAN 1/2 CC-0-li­censed anime faces & videos, pro­vide down­loads for the fi­nal mod­els & , pro­vide the ‘miss­ing man­ual’ & ex­plain how I trained them based on Danbooru2017/2018 with source code for the data pre­pro­cess­ing, doc­u­ment in­stal­la­tion & con­fig­u­ra­tion & train­ing tricks.

For ap­pli­ca­tion, I doc­u­ment var­i­ous scripts for gen­er­at­ing im­ages & videos, briefly de­scribe the web­site as a pub­lic demo (see also Art­breeder), dis­cuss how the trained mod­els can be used for trans­fer learn­ing such as gen­er­at­ing high­-qual­ity faces of anime char­ac­ters with small datasets (eg Holo or Asuka Souryuu Lan­g­ley), and touch on more ad­vanced StyleGAN ap­pli­ca­tions like en­coders & con­trol­lable gen­er­a­tion.

The ap­pen­dix gives sam­ples of my fail­ures with ear­lier GANs for anime face gen­er­a­tion, and I pro­vide sam­ples & model from a rel­a­tively large-s­cale BigGAN train­ing run sug­gest­ing that BigGAN may be the next step for­ward to gen­er­at­ing ful­l-s­cale anime im­ages.

A minute of read­ing could save an hour of de­bug­ging!

When Ian Good­fel­low’s first pa­per , with its blurry 64px grayscale faces, I said to my­self, “given the rate at which GPUs & NN ar­chi­tec­tures im­prove, in a few years, we’ll prob­a­bly be able to throw a few GPUs at some anime col­lec­tion like Dan­booru and the re­sults will be hi­lar­i­ous.” There is some­thing in­trin­si­cally amus­ing about try­ing to make com­put­ers draw ani­me, and it would be much more fun than work­ing with yet more celebrity head­shots or Im­a­geNet sam­ples; fur­ther, anime/illustrations/drawings are so differ­ent from the ex­clu­sive­ly-pho­to­graphic datasets al­ways (over)used in con­tem­po­rary ML re­search that I was cu­ri­ous how it would work on ani­me—­bet­ter, worse, faster, or differ­ent fail­ure mod­es? Even more amus­ing—if ran­dom im­ages be­come doable, then tex­t→im­ages would not be far be­hind.

So when GANs hit , and could do some­what pass­able CelebA face sam­ples around 2015, along with my , I be­gan ex­per­i­ment­ing with Soumith Chin­ta­la’s im­ple­men­ta­tion of , re­strict­ing my­self to faces of sin­gle anime char­ac­ters where I could eas­ily scrape up ~5–10k faces. (I did a lot of from be­cause she has a col­or-cen­tric de­sign which made it easy to tell if a GAN run was mak­ing any pro­gress: blonde-red hair, blue eyes, and red hair or­na­ments.)

It did not work. De­spite many runs on my lap­top & a bor­rowed desk­top, DCGAN never got re­motely near to the level of the CelebA face sam­ples, typ­i­cally top­ping out at red­dish blobs be­fore di­verg­ing or out­right crash­ing.1 Think­ing per­haps the prob­lem was too-s­mall datasets & I needed to train on all the faces, I be­gan cre­at­ing the Dan­booru2017 ver­sion of . Armed with a large dataset, I sub­se­quently be­gan work­ing through par­tic­u­larly promis­ing mem­bers of the GAN zoo, em­pha­siz­ing SOTA & open im­ple­men­ta­tions.

Among oth­ers, I have tried / & Pix­el*NN* (failed to get run­ning)2, WGAN-GP, Glow, GAN-QP, MSG-GAN, SAGAN, VGAN, PokeGAN, BigGAN3, ProGAN, & StyleGAN. These ar­chi­tec­tures vary widely in their de­sign & core al­go­rithms and which of the many sta­bi­liza­tion tricks () they use, but they were more sim­i­lar in their re­sults: dis­mal.

Glow & BigGAN had promis­ing re­sults re­ported on CelebA & Im­a­geNet re­spec­tive­ly, but un­for­tu­nately their train­ing re­quire­ments were out of the ques­tion.4 (As in­ter­est­ing as and are, no source was re­leased and I could­n’t even at­tempt them.)

While some re­mark­able tools like were cre­at­ed, and there were the oc­ca­sional semi­-suc­cess­ful anime face GANs like IllustrationGAN, the most no­table at­tempt at anime face gen­er­a­tion was Make Girl­s.­moe (). MGM could, in­ter­est­ing­ly, do in­-browser 256px anime face gen­er­a­tion us­ing tiny GANs, but that is a dead end. MGM ac­com­plished that much by mak­ing the prob­lem eas­ier: they added some light su­per­vi­sion in the form of a crude tag em­bed­ding5, and then sim­pli­fy­ing the prob­lem dras­ti­cally to n = 42k faces cropped from pro­fes­sional video game char­ac­ter art­work, which I re­garded as not an ac­cept­able so­lu­tion—the faces were small & bor­ing, and it was un­clear if this data-clean­ing ap­proach could scale to anime faces in gen­er­al, much less anime im­ages in gen­er­al. They are rec­og­niz­ably anime faces but the res­o­lu­tion is low and the qual­ity is not great:

2017 SOTA: 16 ran­dom Make Girl­s.­Moe face sam­ples (4×4 grid)

Typ­i­cal­ly, a GAN would di­verge after a day or two of train­ing, or it would col­lapse to pro­duc­ing a lim­ited range of faces (or a sin­gle face), or if it was sta­ble, sim­ply con­verge to a low level of qual­ity with a lot of fuzzi­ness; per­haps the most typ­i­cal fail­ure mode was het­e­rochro­mia (which is com­mon in anime but not that com­mon)—mis­matched eye col­ors (each color in­di­vid­u­ally plau­si­ble), from the Gen­er­a­tor ap­par­ently be­ing un­able to co­or­di­nate with it­self to pick con­sis­tent­ly. With more re­cent ar­chi­tec­tures like VGAN or SAGAN, which care­fully weaken the Dis­crim­i­na­tor or which add ex­treme­ly-pow­er­ful com­po­nents like self­-at­ten­tion lay­ers, I could reach fuzzy 128px faces.

Given the mis­er­able fail­ure of all the prior NNs I had tried, I had be­gun to se­ri­ously won­der if there was some­thing about non-pho­tographs which made them in­trin­si­cally un­able to be eas­ily mod­eled by con­vo­lu­tional neural net­works (the com­mon in­gre­di­ent to them al­l). Did con­vo­lu­tions ren­der it un­able to gen­er­ate sharp lines or flat re­gions of col­or? Did reg­u­lar GANs work only be­cause pho­tographs were made al­most en­tirely of blurry tex­tures?

But BigGAN demon­strated that a large cut­ting-edge GAN ar­chi­tec­ture could scale, given enough train­ing, to all of Im­a­geNet at even 512px. And ProGAN demon­strated that reg­u­lar CNNs could learn to gen­er­ate sharp clear anime im­ages with only some­what in­fea­si­ble amounts of train­ing. (source; video), while ex­pen­sive and re­quir­ing >6 GPU-weeks6, did work and was even pow­er­ful enough to over­fit sin­gle-char­ac­ter face datasets; I did­n’t have enough GPU time to train on un­re­stricted face datasets, much less anime im­ages in gen­er­al, but merely get­ting this far was ex­cit­ing. Be­cause, a com­mon se­quence in DL/DRL (un­like many ar­eas of AI) is that a prob­lem seems in­tractable for long pe­ri­ods, un­til some­one mod­i­fies a scal­able ar­chi­tec­ture slight­ly, pro­duces some­what-cred­i­ble (not nec­es­sar­ily hu­man or even near-hu­man) re­sults, and then throws a ton of compute/data at it and, since the ar­chi­tec­ture scales, it rapidly ex­ceeds SOTA and ap­proaches hu­man lev­els (and po­ten­tially ex­ceeds hu­man-level). Now I just needed a faster GAN ar­chi­tec­ture which I could train a much big­ger model with on a much big­ger dataset.

A his­tory of GAN gen­er­a­tion of anime faces: ‘do want’ to ‘oh no’ to ‘awe­some’

StyleGAN was the fi­nal break­through in pro­vid­ing ProGAN-level ca­pa­bil­i­ties but fast: by switch­ing to a rad­i­cally differ­ent ar­chi­tec­ture, it min­i­mized the need for the slow pro­gres­sive grow­ing (per­haps elim­i­nat­ing it en­tirely7), and learned effi­ciently at mul­ti­ple lev­els of res­o­lu­tion, with bonuses in pro­vid­ing much more con­trol of the gen­er­ated im­ages with its “style trans­fer” metaphor.


First, some demon­stra­tions of what is pos­si­ble with StyleGAN on anime faces:

When it works: a hand-s­e­lected StyleGAN sam­ple from my Asuka Souryuu Lan­g­ley-fine­tuned StyleGAN
64 of the best TWDNE anime face sam­ples se­lected from so­cial me­dia (click to zoom).
100 ran­dom sam­ple im­ages from the StyleGAN anime faces on TWDNE

Even a quick look at the MGM & StyleGAN sam­ples demon­strates the lat­ter to be su­pe­rior in res­o­lu­tion, fine de­tails, and over­all ap­pear­ance (although the MGM faces ad­mit­tedly have fewer global mis­takes). It is also su­pe­rior to my 2018 ProGAN faces. Per­haps the most strik­ing fact about these faces, which should be em­pha­sized for those for­tu­nate enough not to have spent as much time look­ing at aw­ful GAN sam­ples as I have, is not that the in­di­vid­ual faces are good, but rather that the faces are so di­verse, par­tic­u­larly when I look through face sam­ples with 𝜓≥1—it is not just the hair/eye color or head ori­en­ta­tion or fine de­tails that differ, but the over­all style ranges from CG to car­toon sketch, and even the ‘me­dia’ differ, I could swear many of these are try­ing to im­i­tate wa­ter­col­ors, char­coal sketch­ing, or oil paint­ing rather than dig­i­tal draw­ings, and some come off as rec­og­niz­ably ’90s-anime-style vs ’00s-anime-style. (I could look through sam­ples all day de­spite the global er­rors be­cause so many are in­ter­est­ing, which is not some­thing I could say of the MGM model whose nov­elty is quickly ex­haust­ed, and it ap­pears that users of my TWDNE web­site feel sim­i­larly as the av­er­age length of each visit is 1m:55s.)

In­ter­po­la­tion video of the 2019-02-11 face StyleGAN demon­strat­ing gen­er­al­iza­tion.
StyleGAN anime face in­ter­po­la­tion videos are Elon Musk™-ap­proved8!
Later in­ter­po­la­tion video (2019-03-08 face StyleGAN)


Ex­am­ple of the StyleGAN up­scal­ing im­age pyra­mid ar­chi­tec­ture: smal­l­→large (vi­su­al­iza­tion by Shawn Presser)

StyleGAN was pub­lished in 2018 as (source code; demo video/al­go­rith­mic re­view video/re­sults & dis­cus­sions video; Co­lab note­book9; Gen­Force Py­Torch reim­ple­men­ta­tion with model zoo/Keras; ex­plain­ers: Sky­ Minute Pa­pers video). StyleGAN takes the stan­dard GAN ar­chi­tec­ture em­bod­ied by ProGAN (whose source code it reuses) and, like the sim­i­lar GAN ar­chi­tec­ture , draws in­spi­ra­tion from the field of “style trans­fer” (essen­tially in­vented by ), by chang­ing the Gen­er­a­tor (G) which cre­ates the im­age by re­peat­edly up­scal­ing its res­o­lu­tion to take, at each level of res­o­lu­tion from 8px→16px→32px→64px→128px etc a ran­dom in­put or “style noise”, which is com­bined with and is used to tell the Gen­er­a­tor how to ‘style’ the im­age at that res­o­lu­tion by chang­ing the hair or chang­ing the skin tex­ture and so on. ‘Style noise’ at a low res­o­lu­tion like 32px affects the im­age rel­a­tively glob­al­ly, per­haps de­ter­min­ing the hair length or col­or, while style noise at a higher level like 256px might affect how frizzy in­di­vid­ual strands of hair are. In con­trast, ProGAN and al­most all other GANs in­ject noise into the G as well, but only at the be­gin­ning, which ap­pears to work not nearly as well (per­haps be­cause it is diffi­cult to prop­a­gate that ran­dom­ness ‘up­wards’ along with the up­scaled im­age it­self to the later lay­ers to en­able them to make con­sis­tent choic­es?). To put it sim­ply, by sys­tem­at­i­cally pro­vid­ing a bit of ran­dom­ness at each step in the process of gen­er­at­ing the im­age, StyleGAN can ‘choose’ vari­a­tions effec­tive­ly.

Kar­ras et al 2018, StyleGAN vs ProGAN ar­chi­tec­ture: “Fig­ure 1. While a tra­di­tional gen­er­a­tor feeds the la­tent code [z] though the in­put layer on­ly, we first map the in­put to an in­ter­me­di­ate la­tent space W, which then con­trols the gen­er­a­tor through adap­tive in­stance nor­mal­iza­tion (AdaIN) at each con­vo­lu­tion lay­er. Gauss­ian noise is added after each con­vo­lu­tion, be­fore eval­u­at­ing the non­lin­ear­i­ty. Here”A" stands for a learned affine trans­form, and “B” ap­plies learned per-chan­nel scal­ing fac­tors to the noise in­put. The map­ping net­work f con­sists of 8 lay­ers and the syn­the­sis net­work g con­sists of 18 lay­er­s—two for each res­o­lu­tion (42-−10242). The out­put of the last layer is con­verted to RGB us­ing a sep­a­rate 1×1 con­vo­lu­tion, sim­i­lar to Kar­ras et al. [29]. Our gen­er­a­tor has a to­tal of 26.2M train­able pa­ra­me­ters, com­pared to 23.1M in the tra­di­tional gen­er­a­tor."

StyleGAN makes a num­ber of ad­di­tional im­prove­ments, but they ap­pear to be less im­por­tant: for ex­am­ple, it in­tro­duces a new FFHQ face/portrait dataset with 1024px im­ages in or­der to show that StyleGAN con­vinc­ingly im­proves on ProGAN in fi­nal im­age qual­i­ty; switches to a loss which is more well-be­haved than the usual lo­gis­tic-style loss­es; and ar­chi­tec­ture-wise, it makes un­usu­ally heavy use of ful­ly-con­nected (FC) lay­ers to process an ini­tial ran­dom in­put, no less than 8 lay­ers of 512 neu­rons, where most GANs use 1 or 2 FC lay­ers.10 More strik­ing is that it omits tech­niques that other GANs have found crit­i­cal for be­ing able to train at 512px–1024px scale: it does not use newer losses like the , SAGAN-style self­-at­ten­tion lay­ers in ei­ther G/D, vari­a­tional Dis­crim­i­na­tor bot­tle­necks, con­di­tion­ing on a tag or cat­e­gory em­bed­ding11, BigGAN-style large mini­batch­es, differ­ent noise dis­tri­b­u­tions12, ad­vanced reg­u­lar­iza­tion like , etc.13 One pos­si­ble rea­son for StyleGAN’s suc­cess is the way it com­bines out­puts from the mul­ti­ple lay­ers into a sin­gle fi­nal im­age rather than re­peat­edly up­scal­ing; when we vi­su­al­ize the out­put of each layer as an RGB im­age in anime StyleGANs, there is a strik­ing di­vi­sion of la­bor be­tween lay­er­s—­some lay­ers fo­cus on mono­chrome out­li­nes, while oth­ers fill in tex­tured re­gions of col­or, and they sum up into an im­age with sharp lines and good color gra­di­ents while main­tain­ing de­tails like eyes.

Aside from the FCs and style noise & nor­mal­iza­tion, it is a vanilla ar­chi­tec­ture. (One odd­ity is the use of only 3×3 con­vo­lu­tions & so few lay­ers in each up­scal­ing block; a more con­ven­tional up­scal­ing block than StyleGAN’s 3×3→3×3 would be some­thing like BigGAN which does 1×1 → 3×3 → 3×3 → 1×1. It’s not clear if this is a good idea as it lim­its the spa­tial in­flu­ence of each pixel by pro­vid­ing lim­ited re­cep­tive fields14.) Thus, if one has some fa­mil­iar­ity with train­ing a ProGAN or an­other GAN, one can im­me­di­ately work with StyleGAN with no trou­ble: the train­ing dy­nam­ics are sim­i­lar and the hy­per­pa­ra­me­ters have their usual mean­ing, and the code­base is much the same as the orig­i­nal ProGAN (with the main ex­cep­tion be­ing that has been re­named (or in S2) and the orig­i­nal, which stores the crit­i­cal con­fig­u­ra­tion pa­ra­me­ters, has been moved to training/; there is still no sup­port for com­mand-line op­tions and StyleGAN must be con­trolled by edit­ing by hand).


Be­cause of its speed and sta­bil­i­ty, when the source code was re­leased on 2019-02-04 (a date that will long be noted in the ANNals of GANime), the Nvidia mod­els & sam­ple dumps were quickly pe­rused & new StyleGANs trained on a wide va­ri­ety of im­age types, yield­ing, in ad­di­tion to the orig­i­nal faces/carts/cats of Kar­ras et al 2018:

Im­age­quilt vi­su­al­iza­tion of the wide range of vi­sual sub­jects StyleGAN has been ap­plied to

Why Don’t GANs Work?

Why does StyleGAN work so well on anime im­ages while other GANs worked not at all or slowly at best?

The les­son I took from , Lu­cic et al 2017, is that CelebA/CIFAR10 are too easy, as al­most all eval­u­ated GAN ar­chi­tec­tures were ca­pa­ble of oc­ca­sion­ally achiev­ing good FID if one sim­ply did enough it­er­a­tions & hy­per­pa­ra­me­ter tun­ing.

In­ter­est­ing­ly, I con­sis­tently ob­serve in train­ing all GANs on anime that clear lines & sharp­ness & cel-like smooth gra­di­ents ap­pear only to­ward the end of train­ing, after typ­i­cally ini­tially blurry tex­tures have co­a­lesced. This sug­gests an in­her­ent bias of CNNs: color im­ages work be­cause they pro­vide some de­gree of tex­tures to start with, but lineart/monochrome stuff fails be­cause the GAN op­ti­miza­tion dy­nam­ics flail around. This is con­sis­tent with —which uses style trans­fer to con­struct a data-augmented/transformed “Styl­ized-Im­a­geNet”—show­ing that Im­a­geNet CNNs are lazy and, be­cause the tasks can be achieved to some de­gree with tex­ture-only clas­si­fi­ca­tion (as demon­strated by sev­eral of Geirhos et al 2018’s au­thors via “Bag­Nets”), fo­cus on tex­tures un­less oth­er­wise forced; and by & , who find that al­though CNNs are per­fectly ca­pa­ble of em­pha­siz­ing shape over tex­ture, low­er-per­form­ing mod­els tend to rely more heav­ily on tex­ture and that many kinds of train­ing (in­clud­ing ) will in­duce a tex­ture fo­cus, sug­gest­ing tex­ture tends to be low­er-hang­ing fruit. So while CNNs can learn sharp lines & shapes rather than tex­tures, the typ­i­cal GAN ar­chi­tec­ture & train­ing al­go­rithm do not make it easy. Since CIFAR10/CelebA can be fairly de­scribed as be­ing just as heavy on tex­tures as Im­a­geNet (which is not true of anime im­ages), it is not sur­pris­ing that GANs train eas­ily on them start­ing with tex­tures and grad­u­ally re­fin­ing into good sam­ples but then strug­gle on ani­me.

This raises a ques­tion of whether the StyleGAN ar­chi­tec­ture is nec­es­sary and whether many GANs might work, if only one had good style trans­fer for anime im­ages and could, to de­feat the tex­ture bi­as, gen­er­ate many ver­sions of each anime im­age which kept the shape while chang­ing the color palet­te? (Cur­rent style trans­fer meth­ods like the AdaIN Py­Torch im­ple­men­ta­tion used by Geirhos et al 2018, do not work well on anime im­ages, iron­i­cally enough, be­cause they are trained on pho­to­graphic im­ages, typ­i­cally us­ing the old VGG mod­el.)


“…Its so­cial ac­count­abil­ity seems sort of like that of de­sign­ers of mil­i­tary weapons: un­cul­pa­ble right up un­til they get a lit­tle too good at their job.”

, E unibus plu­ram: Tele­vi­sion and U.S. Fic­tion”

To ad­dress some com­mon ques­tions peo­ple have after see­ing gen­er­ated sam­ples:

  • Over­fit­ting: “Aren’t StyleGAN (or BigGAN) just over­fit­ting & mem­o­riz­ing data?”

    Amus­ing­ly, this is not a ques­tion any­one re­ally both­ered to ask of ear­lier GAN ar­chi­tec­tures, which is a sign of progress. Over­fit­ting is a bet­ter prob­lem to have than un­der­fit­ting, be­cause over­fit­ting means you can use a smaller model or more data or more ag­gres­sive reg­u­lar­iza­tion tech­niques, while un­der­fit­ting means your ap­proach just is­n’t work­ing.

    In any case, while there is cur­rently no way to con­clu­sively prove that cut­ting-edge GANs are not 100% mem­o­riz­ing (be­cause they should be mem­o­riz­ing to a con­sid­er­able ex­tent in or­der to learn im­age gen­er­a­tion, and eval­u­at­ing gen­er­a­tive mod­els is hard in gen­er­al, and for GANs in par­tic­u­lar, be­cause they don’t pro­vide stan­dard met­rics like like­li­hoods which could be used on held-out sam­ples), there are sev­eral rea­sons to think that they are not just mem­o­riz­ing:15

    1. Sample/Dataset Over­lap: a stan­dard check for over­fit­ting is to com­pare gen­er­ated im­ages to their clos­est matches us­ing (where dis­tance is de­fined by fea­tures like a CNN em­bed­ding) lookup; an ex­am­ple of this are StackGAN’s Fig­ure 6 & BigGAN’s Fig­ures 10–14, where the pho­to­re­al­is­tic sam­ples are nev­er­the­less com­pletely differ­ent from the most sim­i­lar Im­a­geNet dat­a­points. This has not been done for StyleGAN yet but I would­n’t ex­pect differ­ent re­sults as GANs typ­i­cally pass this check. (It’s worth not­ing that fa­cial recog­ni­tion re­port­edly does not re­turn Flickr matches for ran­dom FFHQ StyleGAN faces, sug­gest­ing the gen­er­ated faces gen­uinely look like new faces rather than any of the orig­i­nal Flickr faces.)

      One in­trigu­ing ob­ser­va­tion about GANs made by the BigGAN pa­per is that the crit­i­cisms of Gen­er­a­tors mem­o­riz­ing dat­a­points may be pre­cisely the op­po­site of re­al­i­ty: GANs may work pri­mar­ily by the Dis­crim­i­na­tor (adap­tive­ly) over­fit­ting to dat­a­points, thereby re­pelling the Gen­er­a­tor away from real dat­a­points and forc­ing it to learn nearby pos­si­ble im­ages which col­lec­tively span the im­age dis­tri­b­u­tion. (With enough data, this cre­ates gen­er­al­iza­tion be­cause “neural nets are lazy” and only learn to gen­er­al­ize when eas­ier strate­gies fail.)

    2. Se­man­tic Un­der­stand­ing: GANs ap­pear to learn mean­ing­ful con­cepts like in­di­vid­ual ob­jects, as demon­strated by “la­tent space ad­di­tion” or re­search tools like GANdissection/; im­age ed­its like ob­ject deletions/additions () or seg­ment­ing ob­jects like dogs from their back­grounds (/) are diffi­cult to ex­plain with­out some gen­uine un­der­stand­ing of im­ages.

    In the case of StyleGAN anime faces, there are en­coders and con­trol­lable face gen­er­a­tion now which demon­strate that the la­tent vari­ables do map onto mean­ing­ful fac­tors of vari­a­tion & the model must have gen­uinely learned about cre­at­ing im­ages rather than merely mem­o­riz­ing real im­ages or im­age patch­es. Sim­i­lar­ly, when we use the “trun­ca­tion trick”/ψ to sam­ple from rel­a­tively ex­treme un­likely im­ages and we look at the dis­tor­tions, they show how gen­er­ated im­ages break down in se­man­ti­cal­ly-rel­e­vant ways, which would not be the case if it was just pla­gia­rism. (A par­tic­u­larly ex­treme ex­am­ple of the power of the learned StyleGAN prim­i­tives is ’s demon­stra­tion that Kar­ras et al’s FFHQ faces StyleGAN can be used to gen­er­ate fairly re­al­is­tic im­ages of cats/dogs/cars.)

    1. La­tent Space Smooth­ness: in gen­er­al, in­ter­po­la­tion in the la­tent space (z) shows smooth changes of im­ages and log­i­cal trans­for­ma­tions or vari­a­tions of face fea­tures; if StyleGAN were merely mem­o­riz­ing in­di­vid­ual dat­a­points, the in­ter­po­la­tion would be ex­pected to be low qual­i­ty, yield many ter­ri­ble faces, and ex­hibit ‘jumps’ in be­tween points cor­re­spond­ing to re­al, mem­o­rized, dat­a­points. The StyleGAN anime face mod­els do not ex­hibit this. (In con­trast, the Holo ProGAN, which over­fit bad­ly, does show se­vere prob­lems in its la­tent space in­ter­po­la­tion videos.)

      Which is not to say that GANs do not have is­sues: “mode drop­ping” seems to still be an is­sue for BigGAN de­spite the ex­pen­sive large-mini­batch train­ing, which is over­fit­ting to some de­gree, and StyleGAN pre­sum­ably suffers from it too.

    2. Trans­fer Learn­ing: GANs have been used for semi­-su­per­vised learn­ing (eg gen­er­at­ing plau­si­ble ‘la­beled’ sam­ples to train a clas­si­fier on), im­i­ta­tion learn­ing like , and re­train­ing on fur­ther datasets; if the G is merely mem­o­riz­ing, it is diffi­cult to ex­plain how any of this would work.

  • Com­pute Re­quire­ments: “Does­n’t StyleGAN take too long to train?”

    StyleGAN is re­mark­ably fast-train­ing for a GAN. With the anime faces, I got bet­ter re­sults after 1–3 days of StyleGAN train­ing than I’d got­ten with >3 weeks of ProGAN train­ing. The train­ing times quoted by the StyleGAN repo may sound scary, but they are, in prac­tice, a steep over­es­ti­mate of what you ac­tu­ally need, for sev­eral rea­sons:

    • Lower Res­o­lu­tion: the largest fig­ures are for 1024px im­ages but you may not need them to be that large or even have a big dataset of 1024px im­ages. For anime faces, 1024px-sized faces are rel­a­tively rare, and train­ing at 512px & up­scal­ing 2× to 1024 with waifu2x16 works fine & is much faster. Since up­scal­ing is rel­a­tively sim­ple & easy, an­other strat­egy is to change the pro­gres­sive-grow­ing sched­ule: in­stead of pro­ceed­ing to the fi­nal res­o­lu­tion as fast as pos­si­ble, in­stead ad­just the sched­ule to stop at a more fea­si­ble res­o­lu­tion & spend the bulk of train­ing time there in­stead and then do just enough train­ing at the fi­nal res­o­lu­tion to learn to up­scale (eg spend 10% of train­ing grow­ing to 512px, then 80% of train­ing time at 512px, then 10% at 1024px).
    • Di­min­ish­ing Re­turns: the largest gains in im­age qual­ity are seen in the first few days or weeks of train­ing with the re­main­ing train­ing be­ing not that use­ful as they fo­cus on im­prov­ing small de­tails (so just a few days may be more than ad­e­quate for your pur­pos­es, es­pe­cially if you’re will­ing to se­lect a lit­tle more ag­gres­sively from sam­ples)
    • Trans­fer Learn­ing from a re­lated model can save days or weeks of train­ing, as there is no need to train from scratch; with the anime face StyleGAN, one can train a char­ac­ter-spe­cific StyleGAN with a few hours or days at most, and cer­tainly do not need to spend mul­ti­ple weeks train­ing from scratch! (as­sum­ing that would­n’t just cause over­fit­ting) Sim­i­lar­ly, if one wants to train on some 1024px face dataset, why start from scratch, tak­ing ~1000 GPU-hours, when you can start from Nvidi­a’s FFHQ face model which is al­ready fully trained, and can con­verge in a frac­tion of the from-scratch time? For 1024px, you could use a su­per-res­o­lu­tion GAN like to up­scale? Al­ter­nate­ly, you could change the im­age pro­gres­sion bud­get to spend most of your time at 512px and then at the tail end try 1024px.
    • One-Time Costs: the up­front cost of a few hun­dred dol­lars of GPU-time (at in­flated AWS prices) may seem steep, but should be kept in per­spec­tive. As with al­most all NNs, train­ing 1 StyleGAN model can be lit­er­ally tens of mil­lions of times more ex­pen­sive than sim­ply run­ning the Gen­er­a­tor to pro­duce 1 im­age; but it also need be paid only once by only one per­son, and the to­tal price need not even be paid by the same per­son, given trans­fer learn­ing, but can be amor­tized across var­i­ous datasets. In­deed, given how fast run­ning the Gen­er­a­tor is, the trained model does­n’t even need to be run on a GPU. (The rule of thumb is that a GPU is 20–30× faster than the same thing on CPU, with rare in­stances when over­head dom­i­nates of the CPU be­ing as fast or faster, so since gen­er­at­ing 1 im­age takes on the or­der of ~0.1s on GPU, a CPU can do it in ~3s, which is ad­e­quate for many pur­pos­es.)
  • Copy­right In­fringe­ment: “Who owns StyleGAN im­ages?”

    1. The Nvidia Source Code & Re­leased Mod­els for StyleGAN 1 are un­der a -BY-NC li­cense, and you can­not edit them or pro­duce “de­riv­a­tive works” such as re­train­ing their FFHQ, cat, or cat StyleGAN mod­els. (StyleGAN 2 is un­der a new “Nvidia Source Code Li­cense-NC”, which ap­pears to be effec­tively the same as the CC-BY-NC with the ad­di­tion of a .)

      If a model is trained from scratch, then that does not ap­ply as the source code is sim­ply an­other tool used to cre­ate the model and noth­ing about the CC-BY-NC li­cense forces you to do­nate the copy­right to Nvidia. (It would be odd if such a thing did hap­pen—if your word proces­sor claimed to trans­fer the copy­rights of every­thing writ­ten in it to Mi­crosoft!)

      For those con­cerned by the CC-BY-NC li­cense, a 512px FFHQ con­fig-f StyleGAN 2 has been trained & re­leased into the pub­lic do­main by Ay­dao, and is avail­able for down­load from Mega and my rsync mir­ror:

      rsync --verbose rsync:// ./
    2. Mod­els in gen­eral are gen­er­ally con­sid­ered and the copy­right own­ers of what­ever data the model was trained on have no copy­right on the mod­el. (The fact that the datasets or in­puts are copy­righted is ir­rel­e­vant, as train­ing on them is uni­ver­sally con­sid­ered fair use and trans­for­ma­tive, sim­i­lar to artists or search en­gi­nes; see the fur­ther read­ing.) The model is copy­righted to whomever cre­ated it. Hence, Nvidia has copy­right on the mod­els it cre­ated but I have copy­right un­der the mod­els I trained (which I re­lease un­der CC-0).

    3. Sam­ples are trick­i­er. The usual wide­ly-s­tated le­gal in­ter­pre­ta­tion is that the stan­dard copy­right law po­si­tion is that only hu­man au­thors can earn a copy­right and that ma­chi­nes, an­i­mals, inan­i­mate ob­jects or most fa­mous­ly, , can­not. The US Copy­right Office states clearly that re­gard­less of whether we re­gard a GAN as a ma­chine or a some­thing more in­tel­li­gent like an an­i­mal, ei­ther way, it does­n’t count:

      A work of au­thor­ship must pos­sess “some min­i­mal de­gree of cre­ativ­ity” to sus­tain a copy­right claim. Feist, 499 U.S. at 358, 362 (c­i­ta­tion omit­ted). “[T]he req­ui­site level of cre­ativ­ity is ex­tremely low.” Even a “slight amount” of cre­ative ex­pres­sion will suffice. “The vast ma­jor­ity of works make the grade quite eas­i­ly, as they pos­sess some cre­ative spark, ‘no mat­ter how crude, hum­ble or ob­vi­ous it might be.’” Id. at 346 (c­i­ta­tion omit­ted).

      … To qual­ify as a work of “au­thor­ship” a work must be cre­ated by a hu­man be­ing. See Bur­row-Giles Lith­o­graphic Co., 111 U.S. at 58. Works that do not sat­isfy this re­quire­ment are not copy­rightable. The Office will not reg­is­ter works pro­duced by na­ture, an­i­mals, or plants.


      • A pho­to­graph taken by a mon­key.
      • A mural painted by an ele­phant.

      …the Office will not reg­is­ter works pro­duced by a ma­chine or mere me­chan­i­cal process that op­er­ates ran­domly or au­to­mat­i­cally with­out any cre­ative in­put or in­ter­ven­tion from a hu­man au­thor.

      A dump of ran­dom sam­ples such as the Nvidia sam­ples or TWDNE there­fore has no copy­right & by de­fi­n­i­tion is in the pub­lic do­main.

      A new copy­right can be cre­at­ed, how­ev­er, if a hu­man au­thor is suffi­ciently ‘in the loop’, so to speak, as to ex­ert a de min­imis amount of cre­ative effort, even if that ‘cre­ative effort’ is sim­ply se­lect­ing a sin­gle im­age out of a dump of thou­sands of them or twid­dling knobs (eg on Make Girl­s.­Moe). Crypko, for ex­am­ple, take this po­si­tion.

    Fur­ther read­ing on com­put­er-gen­er­ated art copy­rights:

Training requirements


“The road of ex­cess leads to the palace of wis­dom
…If the fool would per­sist in his folly he would be­come wise
…You never know what is enough un­less you know what is more than enough. …If oth­ers had not been fool­ish, we should be so.”

William Blake, “Proverbs of Hell”,

The nec­es­sary size for a dataset de­pends on the com­plex­ity of the do­main and whether trans­fer learn­ing is be­ing used. StyleGAN’s de­fault set­tings yield a 1024px Gen­er­a­tor with 26.2M pa­ra­me­ters, which is a large model and can soak up po­ten­tially mil­lions of im­ages, so there is no such thing as too much.

For learn­ing de­cen­t-qual­ity anime faces from scratch, a min­i­mum of 5000 ap­pears to be nec­es­sary in prac­tice; for learn­ing a spe­cific char­ac­ter when us­ing the anime face StyleGAN, po­ten­tially as lit­tle as ~500 (e­spe­cially with data aug­men­ta­tion) can give good re­sults. For do­mains as com­pli­cated as “any cat photo” like Kar­ras et al 2018’s cat StyleGAN which is trained on the LSUN CATS cat­e­gory of ~1.8M17 cat pho­tos, that ap­pears to ei­ther not be enough or StyleGAN was not trained to con­ver­gence; Kar­ras et al 2018 note that “CATS con­tin­ues to be a diffi­cult dataset due to the high in­trin­sic vari­a­tion in pos­es, zoom lev­els, and back­grounds.”18


To fit rea­son­able mini­batch sizes, one will want GPUs with >11GB VRAM. At 512px, that will only train n = 4, and go­ing be­low that means it’ll be even slower (and you may have to re­duce learn­ing rates to avoid un­sta­ble train­ing). So, Nvidia 1080ti & up would be good. (Re­port­ed­ly, AMD/OpenCL works for run­ning StyleGAN mod­els, and there is one re­port of suc­cess­ful train­ing with “Radeon VII with tensorflow-rocm 1.13.2 and rocm 2.3.14”.)

The StyleGAN repo pro­vide the fol­low­ing es­ti­mated train­ing times for 1–8 GPU sys­tems (which I con­vert to to­tal GPU-hours & pro­vide a worst-case AWS-based cost es­ti­mate):

Es­ti­mated StyleGAN wall­clock train­ing times for var­i­ous res­o­lu­tions & GPU-clusters (source: StyleGAN re­po)
GPUs 10242 5122 2562 [March 2019 AWS Costs19]
1 41 days 4 hours [988 GPU-hours] 24 days 21 hours [597 GPU-hours] 14 days 22 hours [358 GPU-hours] [$320, $194, $115]
2 21 days 22 hours [1,052] 13 days 7 hours [638] 9 days 5 hours [442] [NA]
4 11 days 8 hours [1,088] 7 days 0 hours [672] 4 days 21 hours [468] [NA]
8 6 days 14 hours [1,264] 4 days 10 hours [848] 3 days 8 hours [640] [$2,730, $1,831, $1,382]

AWS GPU in­stances are some of the most ex­pen­sive ways to train a NN and pro­vide an up­per bound (com­pare; 512px is often an ac­cept­able (or nec­es­sary) res­o­lu­tion; and in prac­tice, the full quoted train­ing time is not re­ally nec­es­sary—with my anime face StyleGAN, the faces them­selves were high qual­ity within 48 GPU-hours, and what train­ing it for ~1000 ad­di­tional GPU-hours ac­com­plished was pri­mar­ily to im­prove de­tails like the shoul­ders & back­grounds. (ProGAN/StyleGAN par­tic­u­larly strug­gle with back­grounds & edges of im­ages be­cause those are cut off, ob­scured, and high­ly-var­ied com­pared to the faces, whether anime or FFHQ. I hy­poth­e­size that the tell­tale blurry back­grounds are due to the im­pov­er­ish­ment of the backgrounds/edges in cropped face pho­tos, and they could be fixed by trans­fer­-learn­ing or pre­train­ing on a more generic dataset like Im­a­geNet, so it learns what the back­grounds even are in the first place; then in face train­ing, it merely has to re­mem­ber them & de­fo­cus a bit to gen­er­ate cor­rect blurry back­ground­s.)

Train­ing im­prove­ments: 256px StyleGAN anime faces after ~46 GPU-hours (top) vs 512px anime faces after 382 GPU-hours (bot­tom); see also the video mon­tage of first 9k it­er­a­tions

Data Preparation

The most diffi­cult part of run­ning StyleGAN is prepar­ing the dataset prop­er­ly. StyleGAN does not, un­like most GAN im­ple­men­ta­tions (par­tic­u­larly Py­Torch ones), sup­port read­ing a di­rec­tory of files as in­put; it can only read its unique .tfrecord for­mat which stores each im­age as raw ar­rays at every rel­e­vant res­o­lu­tion.20 Thus, in­put files must be per­fectly uni­form, (s­low­ly) con­verted to the .tfrecord for­mat by the spe­cial tool, and will take up ~19× more disk space.21

A StyleGAN dataset must con­sist of im­ages all for­mat­ted ex­actly the same way.

Im­ages must be pre­cisely 512×512px or 1024×1024px etc (any eg 512×513px im­ages will kill the en­tire run), they must all be the same col­or­space (you can­not have sRGB and Grayscale JPGs—and I doubt other color spaces work at al­l), they must not be trans­par­ent, the file­type must be the same as the model you in­tend to (re)­train (ie you can­not re­train a PNG-trained model on a JPG dataset, StyleGAN will crash every time with in­scrutable convolution/channel-related er­rors)22, and there must be no sub­tle er­rors like CRC check­sum er­rors which im­age view­ers or li­braries like Im­ageMag­ick often ig­nore.

Faces preparation

My work­flow:

  1. Down­load raw im­ages from Dan­booru2018 if nec­es­sary
  2. Ex­tract from the JSON Dan­booru2018 meta­data all the IDs of a sub­set of im­ages if a spe­cific Dan­booru tag (such as a sin­gle char­ac­ter) is de­sired, us­ing jq and shell script­ing
  3. Crop square anime faces from raw im­ages us­ing Na­gadomi’s lbpcascade_animeface (reg­u­lar face-de­tec­tion meth­ods do not work on anime im­ages)
  4. Delete empty files, mono­chrome or grayscale files, & ex­ac­t-du­pli­cate files
  5. Con­vert to JPG
  6. Up­scale be­low the tar­get res­o­lu­tion (512px) im­ages with waifu2x
  7. Con­vert all im­ages to ex­actly 512×512 res­o­lu­tion sRGB JPG im­ages
  8. If fea­si­ble, im­prove data qual­ity by check­ing for low-qual­ity im­ages by hand, re­mov­ing near-du­pli­cates im­ages found by findimagedupes, and fil­ter­ing with a pre­trained GAN’s Dis­crim­i­na­tor
  9. Con­vert to StyleGAN for­mat us­ing

The goal is to turn this:

100 ran­dom real sam­ple im­ages from the 512px SFW sub­set of Dan­booru in a 10×10 grid.

into this:

36 ran­dom real sam­ple im­ages from the cropped Dan­booru faces in a 6×6 grid.

Be­low I use shell script­ing to pre­pare the dataset. A pos­si­ble al­ter­na­tive is danbooru-utility, which aims to help “ex­plore the dataset, fil­ter by tags, rat­ing, and score, de­tect faces, and re­size the im­ages”.


The Dan­booru2018 down­load can be done via Bit­Tor­rent or rsync, which pro­vides a JSON meta­data tar­ball which un­packs into metadata/2* & a folder struc­ture of {original,512px}/{0-999}/$ID.{png,jpg,...}.

For train­ing on SFW whole im­ages, the 512px/ ver­sion of Dan­booru2018 would work, but it is not a great idea for faces be­cause by scal­ing im­ages down to 512px, a lot of face de­tail has been lost, and get­ting high­-qual­ity faces is a chal­lenge. The SFW IDs can be ex­tracted from the file­names in 512px/ di­rectly or from the meta­data by ex­tract­ing the id & rating fields (and sav­ing to a file):

find ./512px/ -type f | sed -e 's/.*\/\([[:digit:]]*\)\.jpg/\1/'
# 967769
# 1853769
# 2729769
# 704769
# 1799769
# ...
tar xf metadata.json.tar.xz
cat metadata/20180000000000* | jq '[.id, .rating]' -c | fgrep '"s"' | cut -d '"' -f 2 # "
# ...

After in­stalling and test­ing Na­gadomi’s lbpcascade_animeface to make sure it & works, one can use a sim­ple script which crops the face(s) from a sin­gle in­put im­age. The ac­cu­racy on Dan­booru im­ages is fairly good, per­haps 90% ex­cel­lent faces, 5% low-qual­ity faces (gen­uine but ei­ther aw­ful art or tiny lit­tle faces on the or­der of 64px which use­less), and 5% out­right er­rors—non-faces like armpits or el­bows (oddly enough). It can be im­proved by mak­ing the script more re­stric­tive, such as re­quir­ing 250×250px re­gions, which elim­i­nates most of the low-qual­ity faces & mis­takes. (There is an al­ter­na­tive more-d­iffi­cult-to-run li­brary by Nakatomi which offers a face-crop­ping script, ani­me­face-2009’s face_collector.rb, which Nakatomi says is bet­ter at crop­ping faces, but I was not im­pressed when I tried it out.)

import cv2
import sys
import os.path

def detect(cascade_file, filename, outputname):
    if not os.path.isfile(cascade_file):
        raise RuntimeError("%s: not found" % cascade_file)

    cascade = cv2.CascadeClassifier(cascade_file)
    image = cv2.imread(filename)
    gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
    gray = cv2.equalizeHist(gray)

    ## NOTE: Suggested modification: increase minSize to '(250,250)' px,
    ## increasing proportion of high-quality faces & reducing
    ## false positives. Faces which are only 50×50px are useless
    ## and often not faces at all.
    ## FOr my StyleGANs, I use 250 or 300px boxes

    faces = cascade.detectMultiScale(gray,
                                     # detector options
                                     scaleFactor = 1.1,
                                     minNeighbors = 5,
                                     minSize = (50, 50))
    for (x, y, w, h) in faces:
        cropped = image[y: y + h, x: x + w]
        cv2.imwrite(outputname+str(i)+".png", cropped)

if len(sys.argv) != 4:
    sys.stderr.write("usage: <animeface.xml file>  <input> <output prefix>\n")

detect(sys.argv[1], sys.argv[2], sys.argv[3])

The IDs can be com­bined with the pro­vided lbpcascade_animeface script us­ing xargs, how­ever this will be far too slow and it would be bet­ter to ex­ploit par­al­lelism with xargs --max-args=1 --max-procs=16 or parallel. It’s also worth not­ing that lbpcascade_animeface seems to use up GPU VRAM even though GPU use offers no ap­par­ent speedup (a slow­down if any­thing, given lim­ited VRAM), so I find it helps to ex­plic­itly dis­able GPU use by set­ting CUDA_VISIBLE_DEVICES="". (For this step, it’s quite help­ful to have a many-core sys­tem like a Thread­rip­per.)

Com­bin­ing every­thing, par­al­lel face-crop­ping of an en­tire Dan­booru2018 sub­set can be done like this:

cropFaces() {
    BUCKET=$(printf "%04d" $(( $@ % 1000 )) )
    CUDA_VISIBLE_DEVICES="" nice python ~/src/lbpcascade_animeface/examples/  \
     ~/src/lbpcascade_animeface/lbpcascade_animeface.xml \
     ./original/$BUCKET/$ID.* "./faces/$ID"
export -f cropFaces

mkdir ./faces/
cat sfw-ids.txt | parallel --progress cropFaces

# NOTE: because of the possibility of multiple crops from an image, the script appends a N counter;
# remove that to get back the original ID & filepath: eg
## original/0196/933196.jpg  → portrait/9331961.jpg
## original/0669/1712669.png → portrait/17126690.jpg
## original/0997/3093997.jpg → portrait/30939970.jpg

Nvidia StyleGAN, by de­fault and like most im­age-re­lated tools, ex­pects square im­ages like 512×512px, but there is noth­ing in­her­ent to neural nets or con­vo­lu­tions that re­quires square in­puts or out­puts, and rec­tan­gu­lar con­vo­lu­tions are pos­si­ble. In the case of faces, they tend to be more rec­tan­gu­lar than square, and we’d pre­fer to use a rec­tan­gu­lar con­vo­lu­tion if pos­si­ble to fo­cus the im­age on the rel­e­vant di­men­sion rather than ei­ther pay the se­vere per­for­mance penalty of in­creas­ing to­tal di­men­sions to 1024×1024px or stick with 512×512px & waste im­age out­puts on emit­ting black bars/backgrounds. A prop­er­ly-sized rec­tan­gu­lar con­vo­lu­tion can offer a nice speedup (eg’s train­ing Im­a­geNet in 18m for $40 us­ing them among other trick­s). Nolan Ken­t’s StyleGAN re-im­ple­men­tion (re­leased Oc­to­ber 2019) does sup­port rec­tan­gu­lar con­vo­lu­tions, and as he demon­strates in his blog post, it works nice­ly.

Cleaning & Upscaling

Mis­cel­la­neous cleanups can be done:

## Delete failed/empty files
find faces/ -size 0    -type f -delete

## Delete 'too small' files which is indicative of low quality:
find faces/ -size -40k -type f -delete

## Delete exact duplicates:
fdupes --delete --omitfirst --noprompt faces/

## Delete monochrome or minimally-colored images:
### the heuristic of <257 unique colors is imperfect but better than anything else I tried
deleteBW() { if [[ `identify -format "%k" "$@"` -lt 257 ]];
             then rm "$@"; fi; }
export -f deleteBW
find faces -type f | parallel --progress deleteBW

I re­move black­-white or grayscale im­ages from all my GAN ex­per­i­ments be­cause in my ear­li­est ex­per­i­ments, their in­clu­sion ap­peared to in­crease in­sta­bil­i­ty: mixed datasets were ex­tremely un­sta­ble, mono­chrome datasets failed to learn at all, but col­or-only runs made some progress. It is likely that StyleGAN is now pow­er­ful enough to be able to learn on mixed datasets (and some later ex­per­i­ments by other peo­ple sug­gest that StyleGAN can han­dle both mono­chrome & color ani­me-style faces with­out a prob­lem), but I have not risked a full mon­th-long run to in­ves­ti­gate, and so I con­tinue do­ing col­or-on­ly.

Discriminator ranking

A good trick with GANs is, after train­ing to rea­son­able lev­els of qual­i­ty, reusing the Dis­crim­i­na­tor to rank the real dat­a­points; im­ages the trained D as­signs the low­est probability/score of be­ing real are often the worst-qual­ity ones and go­ing through the bot­tom decile (or delet­ing them en­tire­ly) should re­move many anom­alies and may im­prove the GAN. The GAN is then trained on the new cleaned dataset, mak­ing this a kind of “ac­tive learn­ing”.

Since rat­ing im­ages is what the D al­ready does, no new al­go­rithms or train­ing meth­ods are nec­es­sary, and al­most no code is nec­es­sary: run the D on the whole dataset to rank each im­age (faster than it seems since the G & back­prop­a­ga­tion are un­nec­es­sary, even a large dataset can be ranked in a wall­clock hour or two), then one can re­view man­u­ally the bot­tom & top X%, or per­haps just delete the bot­tom X% sight un­seen if enough data is avail­able.

What is a D do­ing? I find that the high­est ranked im­ages often con­tain many anom­alies or low-qual­ity im­ages which need to be delet­ed. Why? The notes a well-trained D which achieves 98% real vs fake clas­si­fi­ca­tion per­for­mance on the Im­a­geNet train­ing dataset falls to 50–55% ac­cu­racy when run on the val­i­da­tion dataset, sug­gest­ing the D’s role is about mem­o­riz­ing the train­ing data rather than some mea­sure of ‘re­al­ism’.

Per­haps be­cause the D rank­ing is not nec­es­sar­ily a ‘qual­ity’ score but sim­ply a sort of con­fi­dence rat­ing that an im­age is from the real dataset; if the real im­ages con­tain cer­tain eas­i­ly-de­tectable im­ages which the G can’t repli­cate, then the D might mem­o­rize or learn them quick­ly. For ex­am­ple, in face crops, whole fig­ure crops are com­mon mis­taken crops, mak­ing up a tiny per­cent­age of im­ages; how could a face-only G learn to gen­er­ate whole re­al­is­tic bod­ies with­out the in­ter­me­di­ate steps be­ing in­stantly de­tected & de­feated as er­rors by D, while D is eas­ily able to de­tect re­al­is­tic bod­ies as defi­nitely re­al? This would ex­plain the po­lar­ized rank­ings. And given the close con­nec­tions be­tween GANs & DRL, I have to won­der if there is more mem­o­riza­tion go­ing on than sus­pected in things like ? In­ci­den­tal­ly, this may also ex­plain the prob­lem with us­ing Dis­crim­i­na­tors for semi­-su­per­vised rep­re­sen­ta­tion learn­ing: if the D is mem­o­riz­ing dat­a­points to force the G to gen­er­al­ize, then its in­ter­nal rep­re­sen­ta­tions would be ex­pected to be use­less. (One would in­stead want to ex­tract knowl­edge from the G, per­haps by en­cod­ing an im­age into z and us­ing the z as the rep­re­sen­ta­tion.)

An al­ter­na­tive per­spec­tive is offered by a crop of 2020 pa­pers (; ; ; ) ex­am­in­ing how use­ful GAN data aug­men­ta­tion re­quires it to be done dur­ing train­ing, and one must aug­ment all im­ages.23 Zhao et al 2020c & Kar­ras et al 2020 ob­serve, with reg­u­lar GAN train­ing, there is a strik­ing steady de­cline of D per­for­mance on held­out data, and in­crease on train­ing data, through­out the course of train­ing, con­firm­ing the BigGAN ob­ser­va­tion but also show­ing it is a dy­namic phe­nom­e­non, and prob­a­bly a bad one. Adding in cor­rect data aug­men­ta­tion re­duces this over­fit­ting—and markedly im­proves sam­ple-effi­ciency & fi­nal qual­i­ty. This sug­gests that the D does in­deed mem­o­rize, but that this is not a good thing. Kar­ras et al 2020 de­scribes what hap­pens as

Con­ver­gence is now achieved [with ADA/data aug­men­ta­tion] re­gard­less of the train­ing set size and over­fit­ting no longer oc­curs. With­out aug­men­ta­tions, the gra­di­ents the gen­er­a­tor re­ceives from the dis­crim­i­na­tor be­come very sim­plis­tic over time—the dis­crim­i­na­tor starts to pay at­ten­tion to only a hand­ful of fea­tures, and the gen­er­a­tor is free to cre­ate oth­er­wise non­sen­si­cal im­ages. With ADA, the gra­di­ent field stays much more de­tailed which pre­vents such de­te­ri­o­ra­tion.

In other words, just as the G can ‘mode col­lapse’ by fo­cus­ing on gen­er­at­ing im­ages with only a few fea­tures, the D can also ‘fea­ture col­lapse’ by fo­cus­ing on a few fea­tures which hap­pen to cor­rectly split the train­ing data’s re­als from fakes, such as by mem­o­riz­ing them out­right. This tech­ni­cally works, but not well. This also ex­plains why when train­ing on JFT-300M: divergence/collapse usu­ally starts with D win­ning; if D wins be­cause it mem­o­rizes, then a suffi­ciently large dataset should make mem­o­riza­tion in­fea­si­ble; and JFT-300M turns out to be suffi­ciently large. (This would pre­dict that if Brock et al had checked the JFT-300M BigGAN D’s clas­si­fi­ca­tion per­for­mance on a held-out JFT-300M, rather than just on their Im­a­geNet BigGAN, then they would have found that it clas­si­fied re­als vs fake well above chance.)

If so, this sug­gests that for D rank­ing, it may not be too use­ful to take the D from the end of a run, if not us­ing data aug­men­ta­tion, be­cause that D be the ver­sion with the great­est de­gree of mem­o­riza­tion!

Here is a sim­ple StyleGAN2 script ( to open a StyleGAN .pkl and run it on a list of im­age file­names to print out the D score, cour­tesy of Shao Xun­ing:

import pickle
import numpy as np
import cv2
import dnnlib.tflib as tflib
import random
import argparse
import PIL.Image
from training.misc import adjust_dynamic_range

def preprocess(file_path):
    # print(file_path)
    img = np.asarray(

    # Preprocessing from dataset_tool.create_from_images
    img = img.transpose([2, 0, 1])  # HWC => CHW
    # img = np.expand_dims(img, axis=0)
    img = img.reshape((1, 3, 512, 512))

    # Preprocessing from training_loop.process_reals
    img = adjust_dynamic_range(data=img, drange_in=[0, 255], drange_out=[-1.0, 1.0])
    return img

def main(args):
    minibatch_size = args.minibatch_size
    input_shape = (minibatch_size, 3, 512, 512)
    # print(args.images)
    images = args.images

    _G, D, _Gs = pickle.load(open(args.model, "rb"))
    # D.print_layers()

    image_score_all = [(image, []) for image in images]

    # Shuffle the images and process each image in multiple minibatches.
    # Note: networks.stylegan2.minibatch_stddev_layer
    # calculates the standard deviation of a minibatch group as a feature channel,
    # which means that the output of the discriminator actually depends
    # on the companion images in the same minibatch.
    for i_shuffle in range(args.num_shuffles):
        # print('shuffle: {}'.format(i_shuffle))
        for idx_1st_img in range(0, len(image_score_all), minibatch_size):
            idx_img_minibatch = []
            images_minibatch = []
            input_minibatch = np.zeros(input_shape)
            for i in range(minibatch_size):
                idx_img = (idx_1st_img + i) % len(image_score_all)
                image = image_score_all[idx_img][0]
                img = preprocess(image)
                input_minibatch[i, :] = img
            output =, None, resolution=512)
            print('shuffle: {}, indices: {}, images: {}'
                  .format(i_shuffle, idx_img_minibatch, images_minibatch))
            print('Output: {}'.format(output))
            for i in range(minibatch_size):
                idx_img = idx_img_minibatch[i]

    with open(args.output, 'a') as fout:
        for image, score_list in image_score_all:
            print('Image: {}, score_list: {}'.format(image, score_list))
            avg_score = sum(score_list)/len(score_list)
            fout.write(image + ' ' + str(avg_score) + '\n')

def parse_arguments():
    parser = argparse.ArgumentParser()
    parser.add_argument('--model', type=str, required=True,
                        help='.pkl model')
    parser.add_argument('--images', nargs='+')
    parser.add_argument('--output', type=str, default='rank.txt')
    parser.add_argument('--minibatch_size', type=int, default=4)
    parser.add_argument('--num_shuffles', type=int, default=5)
    parser.add_argument('--random_seed', type=int, default=0)
    return parser.parse_args()

if __name__ == '__main__':

De­pend­ing on how noisy the rank­ings are in terms of ‘qual­ity’ and avail­able sam­ple size, one can ei­ther re­view the worst-ranked im­ages by hand, or delete the bot­tom X%. One should check the top-ranked im­ages as well to make sure the or­der­ing is right; there can also be some odd im­ages in the top X% as well which should be re­moved.

It might be pos­si­ble to use to im­prove the qual­ity of gen­er­ated sam­ples as well, as a sim­ple ver­sion of .


The next ma­jor step is up­scal­ing im­ages us­ing waifu2x, which does an ex­cel­lent job on 2× up­scal­ing of anime im­ages, which are nigh-indis­tin­guish­able from a high­er-res­o­lu­tion orig­i­nal and greatly in­crease the us­able cor­pus. The down­side is that it can take 1–10s per im­age, must run on the GPU (I can re­li­ably fit ~9 in­stances on my 2×1080ti), and is writ­ten in a now-un­main­tained DL frame­work, Torch, with no cur­rent plans to port to Py­Torch, and is grad­u­ally be­com­ing harder to get run­ning (one hopes that by the time CUDA up­dates break it en­tire­ly, there will be an­other su­per-res­o­lu­tion GAN I or some­one else can train on Dan­booru to re­place it). If pressed for time, one can just up­scale the faces nor­mally with Im­ageMag­ick but I be­lieve there will be some qual­ity loss and it’s worth­while.

. ~/src/torch/install/bin/torch-activate
upscaleWaifu2x() {
    SIZE1=$(identify -format "%h" "$@")
    SIZE2=$(identify -format "%w" "$@");

    if (( $SIZE1 < 512 && $SIZE2 < 512  )); then
        echo "$@" $SIZE
        TMP=$(mktemp "/tmp/XXXXXX.png")
        CUDA_VISIBLE_DEVICES="$((RANDOM % 2 < 1))" nice th ~/src/waifu2x/waifu2x.lua -model_dir \
            ~/src/waifu2x/models/upconv_7/art -tta 1 -m scale -scale 2 \
            -i "$@" -o "$TMP"
        convert "$TMP" "$@"
        rm "$TMP"
    fi;  }

export -f upscaleWaifu2x
find faces/ -type f | parallel --progress --jobs 9 upscaleWaifu2x

Quality Checks & Data Augmentation

The sin­gle most effec­tive strat­egy to im­prove a GAN is to clean the da­ta. StyleGAN can­not han­dle too-di­verse datasets com­posed of mul­ti­ple ob­jects or sin­gle ob­jects shifted around, and rare or odd im­ages can­not be learned well. Kar­ras et al get such good re­sults with StyleGAN on faces in part be­cause they con­structed FFHQ to be an ex­tremely clean con­sis­tent dataset of just cen­tered well-lit clear hu­man faces with­out any ob­struc­tions or other vari­a­tion. Sim­i­lar­ly, Ar­fa’s (TFDNE) S2 gen­er­ates much bet­ter por­traits than my own “This Waifu Does Not Ex­ist” (TWDNE) S2 anime por­traits, due partly to train­ing longer to con­ver­gence on a TPU pod but mostly due to his in­vest­ment in data clean­ing: align­ing the faces and heavy fil­ter­ing of sam­ples—this left him with only n = 50k but TFDNE nev­er­the­less out­per­forms TWDNE’s n = 300k. (Data cleaning/augmentation is one of the more pow­er­ful ways to im­prove re­sults; if we imag­ine deep learn­ing as ‘pro­gram­ming’ or ‘Soft­ware 2.0’24 in An­drej Karpa­thy’s terms, data cleaning/augmentation is one of the eas­i­est ways to fine­tune the loss func­tion to­wards what we re­ally want by gar­den­ing our data to re­move what we don’t want and in­crease what we do.)

At this point, one can do man­ual qual­ity checks by view­ing a few hun­dred im­ages, run­ning findimagedupes -t 99% to look for near-i­den­ti­cal faces, or dab­ble in fur­ther mod­i­fi­ca­tions such as do­ing “data aug­men­ta­tion”. Work­ing with Dan­booru2018, at this point one would have ~600–700,000 faces, which is more than enough to train StyleGAN and one will have diffi­culty stor­ing the fi­nal StyleGAN dataset be­cause of its sheer size (due to the ~18× size mul­ti­pli­er). After clean­ing etc, my fi­nal face dataset is the with n = 300k.

How­ev­er, if that is not enough or one is work­ing with a small dataset like for a sin­gle char­ac­ter, data aug­men­ta­tion may be nec­es­sary. The mirror/horizontal flip is not nec­es­sary as StyleGAN has that built-in as an op­tion25, but there are many other pos­si­ble data aug­men­ta­tions. One can stretch, shift col­ors, sharp­en, blur, increase/decrease contrast/brightness, crop, and so on. An ex­am­ple, ex­tremely ag­gres­sive, set of data aug­men­ta­tions could be done like this:

dataAugment () {
    target=$(basename "$@")
    convert -deskew 50                     "$image" "$target".deskew."$suffix"
    convert -resize 110%x100%              "$image" "$target".horizstretch."$suffix"
    convert -resize 100%x110%              "$image" "$target".vertstretch."$suffix"
    convert -blue-shift 1.1                "$image" "$target".midnight."$suffix"
    convert -fill red -colorize 5%         "$image" "$target".red."$suffix"
    convert -fill orange -colorize 5%      "$image" "$target".orange."$suffix"
    convert -fill yellow -colorize 5%      "$image" "$target".yellow."$suffix"
    convert -fill green -colorize 5%       "$image" "$target".green."$suffix"
    convert -fill blue -colorize 5%        "$image" "$target".blue."$suffix"
    convert -fill purple -colorize 5%      "$image" "$target".purple."$suffix"
    convert -adaptive-blur 3x2             "$image" "$target".blur."$suffix"
    convert -adaptive-sharpen 4x2          "$image" "$target".sharpen."$suffix"
    convert -brightness-contrast 10        "$image" "$target".brighter."$suffix"
    convert -brightness-contrast 10x10     "$image" "$target".brightercontraster."$suffix"
    convert -brightness-contrast -10       "$image" "$target".darker."$suffix"
    convert -brightness-contrast -10x10    "$image" "$target".darkerlesscontrast."$suffix"
    convert +level 5%                      "$image" "$target".contraster."$suffix"
    convert -level 5%\!                    "$image" "$target".lesscontrast."$suffix"
export -f dataAugment
find faces/ -type f | parallel --progress dataAugment

Upscaling & Conversion

Once any qual­ity fixes or data aug­men­ta­tion are done, it’d be a good idea to save a lot of disk space by con­vert­ing to JPG & loss­ily re­duc­ing qual­ity (I find 33% saves a ton of space at no vis­i­ble change):

convertPNGToJPG() { convert -quality 33 "$@" "$@".jpg && rm "$@"; }
export -f convertPNGToJPG
find faces/ -type f -name "*.png" | parallel --progress convertPNGToJPG

Re­mem­ber that StyleGAN mod­els are only com­pat­i­ble with im­ages of the type they were trained on, so if you are us­ing a StyleGAN pre­trained model which was trained on PNGs (like, IIRC, the FFHQ StyleGAN mod­el­s), you will need to keep us­ing PNGs.

Do­ing the fi­nal scal­ing to ex­actly 512px can be done at many points but I gen­er­ally post­pone it to the end in or­der to work with im­ages in their ‘na­tive’ res­o­lu­tions & as­pec­t-ra­tios for as long as pos­si­ble. At this point we care­fully tell Im­ageMag­ick to rescale every­thing to 512×51226, not pre­serv­ing the as­pect ra­tio by fill­ing in with a black back­ground as nec­es­sary on ei­ther side:

find faces/ -type f | xargs --max-procs=16 -n 9000 \
    mogrify -resize 512x512\> -extent 512x512\> -gravity center -background black

Any slight­ly-d­iffer­ent im­age could crash the im­port process. There­fore, we delete any im­age which is even slightly differ­ent from the 512×512 sRGB JPG they are sup­posed to be:

find faces/ -type f | xargs --max-procs=16 -n 9000 identify | \
    # remember the warning: images must be identical, square, and sRGB/grayscale:
    fgrep -v " JPEG 512x512 512x512+0+0 8-bit sRGB"| cut -d ' ' -f 1 | \
    xargs --max-procs=16 -n 10000 rm

Hav­ing done all this, we should have a large con­sis­tent high­-qual­ity dataset.

Fi­nal­ly, the faces can now be con­verted to the ProGAN or StyleGAN dataset for­mat us­ing It is worth re­mem­ber­ing at this point how frag­ile that is and the re­quire­ments Im­ageMag­ick’s identify com­mand is handy for look­ing at files in more de­tails, par­tic­u­larly their res­o­lu­tion & col­or­space, which are often the prob­lem.

Be­cause of the ex­treme fragility of, I strongly ad­vise that you edit it to print out the file­names of each file as they are be­ing processed so that when (not if) it crash­es, you can in­ves­ti­gate the cul­prit and check the rest. The edit could be as sim­ple as this:

diff --git a/ b/
index 4ddfe44..e64e40b 100755
--- a/
+++ b/
@@ -519,6 +519,7 @@ def create_from_images(tfrecord_dir, image_dir, shuffle):
     with TFRecordExporter(tfrecord_dir, len(image_filenames)) as tfr:
         order = tfr.choose_shuffled_order() if shuffle else np.arange(len(image_filenames))
         for idx in range(order.size):
+            print(image_filenames[order[idx]])
             img = np.asarray([order[idx]]))
             if channels == 1:
                 img = img[np.newaxis, :, :] # HW => CHW

There should be no is­sues if all the im­ages were thor­oughly checked ear­lier, but should any im­ages crash it, they can be checked in more de­tail by identify. (I ad­vise just delet­ing them and not try­ing to res­cue them.)

Then the con­ver­sion is just (as­sum­ing StyleGAN pre­req­ui­sites are in­stalled, see next sec­tion):

python create_from_images datasets/faces /media/gwern/Data/danbooru2018/faces/

Con­grat­u­la­tions, the hard­est part is over. Most of the rest sim­ply re­quires pa­tience (and a will­ing­ness to edit Python files di­rectly in or­der to con­fig­ure StyleGAN).



I as­sume you have CUDA in­stalled & func­tion­ing. If not, good luck. (On my Ubuntu Bionic 18.04.2 LTS OS, I have suc­cess­fully used the Nvidia dri­ver ver­sion #410.104, CUDA 10.1, and Ten­sor­Flow 1.13.1.)

A Python ≥3.627 vir­tual en­vi­ron­ment can be set up for StyleGAN to keep de­pen­den­cies tidy, Ten­sor­Flow & StyleGAN de­pen­den­cies in­stalled:

conda create -n stylegan pip python=3.6
source activate stylegan

## TF:
pip install tensorflow-gpu
## Test install:
python -c "import tensorflow as tf; tf.enable_eager_execution(); \
    print(tf.reduce_sum(tf.random_normal([1000, 1000])))"
pip install tensorboard

## StyleGAN:
## Install pre-requisites:
pip install pillow numpy moviepy scipy opencv-python lmdb # requests?
## Download:
git clone '' && cd ./stylegan/
## Test install:
## ./results/example.png should be a photograph of a middle-aged man

StyleGAN can also be trained on the in­ter­ac­tive Google Co­lab ser­vice, which pro­vides free slices of K80 GPUs 12-GPU-hour chunks, us­ing this Co­lab note­book. Co­lab is much slower than train­ing on a lo­cal ma­chine & the free in­stances are not enough to train the best StyleGANs, but this might be a use­ful op­tion for peo­ple who sim­ply want to try it a lit­tle or who are do­ing some­thing quick like ex­tremely low-res­o­lu­tion train­ing or trans­fer­-learn­ing where a few GPU-hours on a slow small GPU might be enough.


StyleGAN does­n’t ship with any sup­port for CLI op­tions; in­stead, one must edit and train/

  1. train/

    The core con­fig­u­ra­tion is done in the func­tion de­faults to training_loop be­gin­ning line 112.

    The key ar­gu­ments are G_smoothing_kimg & D_repeats (affects the learn­ing dy­nam­ic­s), network_snapshot_ticks (how often to save the pickle snap­shot­s—­more fre­quent means less progress lost in crash­es, but as each one weighs 300M­B+, can quickly use up gi­ga­bytes of space), resume_run_id (set to "latest"), and resume_kimg.

    Don’t Erase Your Model
    resume_kimg gov­erns where in the over­all pro­gres­sive-grow­ing train­ing sched­ule StyleGAN starts from. If it is set to 0, train­ing be­gins at the be­gin­ning of the pro­gres­sive-grow­ing sched­ule, at the low­est res­o­lu­tion, re­gard­less of how much train­ing has been pre­vi­ously done. It is vi­tally im­por­tant when do­ing trans­fer learn­ing that it is set to a suffi­ciently high num­ber (eg 10000) that train­ing be­gins at the high­est de­sired res­o­lu­tion like 512px, as it ap­pears that lay­ers are erased when added dur­ing pro­gres­sive-grow­ing. (resume_kimg may also need to be set to a high value to make it skip straight to train­ing at the high­est res­o­lu­tion if you are train­ing on small datasets of small im­ages, where there’s risk of it over­fit­ting un­der the nor­mal train­ing sched­ule and never reach­ing the high­est res­o­lu­tion.) This trick is un­nec­es­sary in StyleGAN 2, which is sim­pler in not us­ing pro­gres­sive grow­ing.

    More ex­per­i­men­tal­ly, I sug­gest set­ting minibatch_repeats = 1 in­stead of minibatch_repeats = 5; in line with the sus­pi­cious­ness of the gra­di­en­t-ac­cu­mu­la­tion im­ple­men­ta­tion in ProGAN/StyleGAN, this ap­pears to make train­ing both sta­bler & faster.

    Note that some of these vari­ables, like learn­ing rates, are over­rid­den in It’s bet­ter to set those there or else you may con­fuse your­self badly (like I did in won­der­ing why ProGAN & StyleGAN seemed ex­tra­or­di­nar­ily ro­bust to large changes in the learn­ing rates…).

  2. (pre­vi­ously in ProGAN; re­named in StyleGAN 2)

    Here we set the num­ber of GPUs, im­age res­o­lu­tion, dataset, learn­ing rates, hor­i­zon­tal flipping/mirroring data aug­men­ta­tion, and mini­batch sizes. (This file in­cludes set­tings in­tended ProGAN—watch out that you don’t ac­ci­den­tally turn on ProGAN in­stead of StyleGAN & con­fuse your­self.) Learn­ing rate & mini­batch should gen­er­ally be left alone (ex­cept to­wards the end of train­ing when one wants to lower the learn­ing rate to pro­mote con­ver­gence or re­bal­ance the G/D), but the im­age resolution/dataset/mirroring do need to be set, like thus:

    desc += '-faces';     dataset = EasyDict(tfrecord_dir='faces', resolution=512);              train.mirror_augment = True

    This sets up the 512px face dataset which was pre­vi­ously cre­ated in dataset/faces, turns on mir­ror­ing (be­cause while there may be writ­ing in the back­ground, we don’t care about it for face gen­er­a­tion), and sets a ti­tle for the checkpoints/logs, which will now ap­pear in results/ with the ‘-faces’ string.

    As­sum­ing you do not have 8 GPUs (as you prob­a­bly do not), you must change the -preset to match your num­ber of GPUs, StyleGAN will not au­to­mat­i­cally choose the cor­rect num­ber of GPUs. If you fail to set it cor­rectly to the ap­pro­pri­ate pre­set, StyleGAN will at­tempt to use GPUs which do not ex­ist and will crash with the opaque er­ror mes­sage (note that CUDA uses ze­ro-in­dex­ing so GPU:0 refers to the first GPU, GPU:1 refers to my sec­ond GPU, and thus /device:GPU:2 refers to my—nonex­is­ten­t—third GPU):

    tensorflow.python.framework.errors_impl.InvalidArgumentError: Cannot assign a device for operation \
        G_synthesis_3/lod: {{node G_synthesis_3/lod}}was explicitly assigned to /device:GPU:2 but available \
        devices are [ /job:localhost/replica:0/task:0/device:CPU:0, /job:localhost/replica:0/task:0/device:GPU:0, \
        /job:localhost/replica:0/task:0/device:GPU:1, /job:localhost/replica:0/task:0/device:XLA_CPU:0, \
        /job:localhost/replica:0/task:0/device:XLA_GPU:0, /job:localhost/replica:0/task:0/device:XLA_GPU:1 ]. \
        Make sure the device specification refers to a valid device.
         [[{{node G_synthesis_3/lod}}]]

    For my 2×1080ti I’d set:

    desc += '-preset-v2-2gpus'; submit_config.num_gpus = 2; sched.minibatch_base = 8; sched.minibatch_dict = \
        {4: 256, 8: 256, 16: 128, 32: 64, 64: 32, 128: 16, 256: 8}; sched.G_lrate_dict = {512: 0.0015, 1024: 0.002}; \
        sched.D_lrate_dict = EasyDict(sched.G_lrate_dict); train.total_kimg = 99000

    So my re­sults get saved to results/00001-sgan-faces-2gpu etc (the run ID in­cre­ments, ‘sgan’ be­cause StyleGAN rather than ProGAN, ‘-faces’ as the dataset be­ing trained on, and ‘2gpu’ be­cause it’s multi-GPU).


I typ­i­cally run StyleGAN in a ses­sion which can be de­tached and keeps mul­ti­ple shells or­ga­nized: 1 terminal/shell for the StyleGAN run, 1 terminal/shell for Ten­sor­Board, and 1 for Emacs.

With Emacs, I keep the two key Python files open ( and train/ for ref­er­ence & easy edit­ing.

With the “lat­est” patch, StyleGAN can be thrown into a while-loop to keep run­ning after crash­es, like:

while true; do nice py ; date; (xmessage "alert: StyleGAN crashed" &); sleep 10s; done

Ten­sor­Board is a log­ging util­ity which dis­plays lit­tle time-series of recorded vari­ables which one views in a web browser, eg:

tensorboard --logdir results/02022-sgan-faces-2gpu/
# TensorBoard 1.13.0 at (Press CTRL+C to quit)

Note that Ten­sor­Board can be back­ground­ed, but needs to be up­dated every time a new run is started as the re­sults will then be in a differ­ent fold­er.

Train­ing StyleGAN is much eas­ier & more re­li­able than other GANs, but it is still more of an art than a sci­ence. (We put up with it be­cause while GANs suck, every­thing else sucks more.) Notes on train­ing:

  • Crash­proofing:

    The ini­tial re­lease of StyleGAN was prone to crash­ing when I ran it, seg­fault­ing at ran­dom. Up­dat­ing Ten­sor­Flow ap­peared to re­duce this but the root cause is still un­known. Seg­fault­ing or crash­ing is also re­port­edly com­mon if run­ning on mixed GPUs (eg a 1080ti + Ti­tan V).

    Un­for­tu­nate­ly, StyleGAN has no set­ting for sim­ply re­sum­ing from the lat­est snap­shot after crashing/exiting (which is what one usu­ally wants), and one must man­u­ally edit the resume_run_id line in to set it to the lat­est run ID. This is te­dious and er­ror-prone—at one point I re­al­ized I had wasted 6 GPU-days of train­ing by restart­ing from a 3-day-old snap­shot be­cause I had not up­dated the resume_run_id after a seg­fault!

    If you are do­ing any runs longer than a few wall­clock hours, I strongly ad­vise use of nshep­perd’s patch to au­to­mat­i­cally restart from the lat­est snap­shot by set­ting resume_run_id = "latest":

    diff --git a/training/ b/training/
    index 50ae51c..d906a2d 100755
    --- a/training/
    +++ b/training/
    @@ -119,6 +119,14 @@ def list_network_pkls(run_id_or_run_dir, include_final=True):
             del pkls[0]
         return pkls
    +def locate_latest_pkl():
    +    allpickles = sorted(glob.glob(os.path.join(config.result_dir, '0*', 'network-*.pkl')))
    +    latest_pickle = allpickles[-1]
    +    resume_run_id = os.path.basename(os.path.dirname(latest_pickle))
    +    RE_KIMG = re.compile('network-snapshot-(\d+).pkl')
    +    kimg = int(RE_KIMG.match(os.path.basename(latest_pickle)).group(1))
    +    return (locate_network_pkl(resume_run_id), float(kimg))
     def locate_network_pkl(run_id_or_run_dir_or_network_pkl, snapshot_or_network_pkl=None):
         for candidate in [snapshot_or_network_pkl, run_id_or_run_dir_or_network_pkl]:
             if isinstance(candidate, str):
    diff --git a/training/ b/training/
    index 78d6fe1..20966d9 100755
    --- a/training/
    +++ b/training/
    @@ -148,7 +148,10 @@ def training_loop(
         # Construct networks.
         with tf.device('/gpu:0'):
             if resume_run_id is not None:
    -            network_pkl = misc.locate_network_pkl(resume_run_id, resume_snapshot)
    +            if resume_run_id == 'latest':
    +                network_pkl, resume_kimg = misc.locate_latest_pkl()
    +            else:
    +                network_pkl = misc.locate_network_pkl(resume_run_id, resume_snapshot)
                 print('Loading networks from "%s"...' % network_pkl)
                 G, D, Gs = misc.load_pkl(network_pkl)

    (The diff can be edited by hand, or copied into the repo as a file like latest.patch & then ap­plied with git apply latest.patch.)

  • Tun­ing Learn­ing Rates

    The LR is one of the most crit­i­cal hy­per­pa­ra­me­ters: too-large up­dates based on too-s­mall mini­batches are dev­as­tat­ing to GAN sta­bil­ity & fi­nal qual­i­ty. The LR also seems to in­ter­act with the in­trin­sic diffi­culty or di­ver­sity of an im­age do­main; Kar­ras et al 2019 use 0.003 G/D LRs on their FFHQ dataset (which has been care­fully cu­rated and the faces aligned to put land­marks like eyes/mouth in the same lo­ca­tions in every im­age) when train­ing on 8-GPU ma­chines with mini­batches of n = 32, but I find lower to be bet­ter on my anime face/portrait datasets where I can only do n = 8. From look­ing at train­ing videos of whole-Dan­booru2018 StyleGAN runs, I sus­pect that the nec­es­sary LRs would be lower still. Learn­ing rates are closely re­lated to mini­batch size (a com­mon rule of thumb in su­per­vised learn­ing of CNNs is that the re­la­tion­ship of biggest us­able LR fol­lows a square-root curve in mini­batch size) and the BigGAN re­search ar­gues that mini­batch size it­self strongly in­flu­ences how bad mode drop­ping is, which sug­gests that smaller LRs may be more nec­es­sary the more diverse/difficult a dataset is.

  • Bal­anc­ing G/D:

    Screen­shot of Ten­sor­Board G/D losses for an anime face StyleGAN mak­ing progress to­wards con­ver­gence

    Later in train­ing, if the G is not mak­ing good progress to­wards the ul­ti­mate goal of a 0.5 loss (and the D’s loss grad­u­ally de­creas­ing to­wards 0.5), and has a loss stub­bornly stuck around −1 or some­thing, it may be nec­es­sary to change the bal­ance of G/D. This can be done sev­eral ways but the eas­i­est is to ad­just the LRs in, sched.G_lrate_dict & sched.D_lrate_dict.

    One needs to keep an eye on the G/D losses and also the per­cep­tual qual­ity of the faces (s­ince we don’t have any good FID equiv­a­lent yet for anime faces, which re­quires a good open-source Dan­booru tag­ger to cre­ate em­bed­dings), and re­duce both LRs (or usu­ally just the D’s LR) based on the face qual­ity and whether the G/D losses are ex­plod­ing or oth­er­wise look im­bal­anced. What you want, I think, is for the G/D losses to be sta­ble at a cer­tain ab­solute amount for a long time while the qual­ity vis­i­bly im­proves, re­duc­ing D’s LR as nec­es­sary to keep it bal­anced with G; and then once you’ve run out of time/patience or ar­ti­facts are show­ing up, then you can de­crease both LRs to con­verge onto a lo­cal op­ti­ma.

    I find the de­fault of 0.003 can be too high once qual­ity reaches a high level with both faces & por­traits, and it helps to re­duce it by a third to 0.001 or a tenth to 0.0003. If there still is­n’t con­ver­gence, the D may be too strong and it can be turned down sep­a­rate­ly, to a tenth or a fifti­eth even. (Given the sto­chas­tic­ity of train­ing & the rel­a­tiv­ity of the loss­es, one should wait sev­eral wall­clock hours or days after each mod­i­fi­ca­tion to see if it made a differ­ence.)

  • Skip­ping FID met­rics:

    Some met­rics are com­puted for logging/reporting. The FID met­rics are cal­cu­lated us­ing an old Im­a­geNet CNN; what is re­al­is­tic on Im­a­geNet may have lit­tle to do with your par­tic­u­lar do­main and while a large FID like 100 is con­cern­ing, FIDs like 20 or even in­creas­ing are not nec­es­sar­ily a prob­lem or use­ful guid­ance com­pared to just look­ing at the gen­er­ated sam­ples or the loss curves. Given that com­put­ing FID met­rics is not free & po­ten­tially ir­rel­e­vant or mis­lead­ing on many im­age do­mains, I sug­gest dis­abling them en­tire­ly. (They are not used in the train­ing for any­thing, and dis­abling them is safe.)

    They can be edited out of the main train­ing loop by com­ment­ing out the call to like so:

    @@ -261,7 +265,7 @@ def training_loop()
            if cur_tick % network_snapshot_ticks == 0 or done or cur_tick == 1:
                pkl = os.path.join(submit_config.run_dir, 'network-snapshot-%06d.pkl' % (cur_nimg // 1000))
                misc.save_pkl((G, D, Gs), pkl)
                #, run_dir=submit_config.run_dir, num_gpus=submit_config.num_gpus, tf_config=tf_config)
  • ‘Blob’ & ‘Crack’ Ar­ti­facts:

    Dur­ing train­ing, ‘blobs’ often show up or move around. These blobs ap­pear even late in train­ing on oth­er­wise high­-qual­ity im­ages and are unique to StyleGAN (at least, I’ve never seen an­other GAN whose train­ing ar­ti­facts look like the blob­s). That they are so large & glar­ing sug­gests a weak­ness in StyleGAN some­where. The source of the blobs was un­clear. If you watch train­ing videos, these blobs seem to grad­u­ally morph into new fea­tures such as eyes or hair or glass­es. I sus­pect they are part of how StyleGAN ‘cre­ates’ new fea­tures, start­ing with a fea­ture-less blob su­per­im­posed at ap­prox­i­mately the right lo­ca­tion, and grad­u­ally re­fined into some­thing use­ful. The in­ves­ti­gated the blob ar­ti­facts & found it to be due to the Gen­er­a­tor work­ing around a flaw in StyleGAN’s use of AdaIN nor­mal­iza­tion. Kar­ras et al 2019 note that im­ages with­out a blob some­where are se­verely cor­rupt­ed; be­cause the blobs are in fact do­ing some­thing use­ful, it is un­sur­pris­ing that the Dis­crim­i­na­tor does­n’t fix the Gen­er­a­tor. StyleGAN 2 changes the AdaIN nor­mal­iza­tion to elim­i­nate this prob­lem, im­prov­ing over­all qual­i­ty.28

    If blobs are ap­pear­ing too often or one wants a fi­nal model with­out any new in­tru­sive blobs, it may help to lower the LR to try to con­verge to a lo­cal op­tima where the nec­es­sary blob is hid­den away some­where un­ob­tru­sive.

    In train­ing anime faces, I have seen ad­di­tional ar­ti­facts, which look like ‘cracks’ or ‘waves’ or ele­phant skin wrin­kles or the sort of fine craz­ing seen in old paint­ings or ce­ram­ics, which ap­pear to­ward the end of train­ing on pri­mar­ily skin or ar­eas of flat col­or; they hap­pen par­tic­u­larly fast when trans­fer learn­ing on a small dataset. The only so­lu­tion I have found so far is to ei­ther stop train­ing or get more da­ta. In con­trast to the blob ar­ti­facts (i­den­ti­fied as an ar­chi­tec­tural prob­lem & fixed in StyleGAN 2), I cur­rently sus­pect the cracks are a sign of over­fit­ting rather than a pe­cu­liar­ity of nor­mal StyleGAN train­ing, where the G has started try­ing to mem­o­rize noise in the fine de­tail of pixelation/lines, and so these are a kind of overfitting/mode col­lapse. (More spec­u­la­tive­ly: an­other pos­si­ble ex­pla­na­tion is that the cracks are caused by the StyleGAN D be­ing sin­gle-s­cale rather than mul­ti­-s­cale—as in MSG-GAN and a num­ber of oth­er­s—and the ‘cracks’ are ac­tu­ally high­-fre­quency noise cre­ated by the G in spe­cific patches as ad­ver­sar­ial ex­am­ples to fool the D. They re­port­edly do not ap­pear in MSG-GAN or StyleGAN 2, which both use mul­ti­-s­cale Ds.)

  • Gra­di­ent Ac­cu­mu­la­tion:

    ProGAN/StyleGAN’s code­base claims to sup­port gra­di­ent ac­cu­mu­la­tion, which is a way to fake large mini­batch train­ing (eg n = 2048) by not do­ing the back­prop­a­ga­tion up­date every mini­batch, but in­stead sum­ming the gra­di­ents over many mini­batches and ap­ply­ing them all at once. This is a use­ful trick for sta­bi­liz­ing train­ing, and large mini­batch NN train­ing can differ qual­i­ta­tively from small mini­batch NN training—BigGAN per­for­mance in­creased with in­creas­ingly large mini­batches (n = 2048) and the au­thors spec­u­late that this is be­cause such large mini­batches mean that the full di­ver­sity of the dataset is rep­re­sented in each ‘mini­batch’ so the BigGAN mod­els can­not sim­ply ‘for­get’ rarer dat­a­points which would oth­er­wise not ap­pear for many mini­batches in a row, re­sult­ing in the GAN pathol­ogy of ‘mode drop­ping’ where some kinds of data just get ig­nored by both G/D.

    How­ev­er, the ProGAN/StyleGAN im­ple­men­ta­tion of gra­di­ent ac­cu­mu­la­tion does not re­sem­ble that of any other im­ple­men­ta­tion I’ve seen in Ten­sor­Flow or Py­Torch, and in my own ex­per­i­ments with up to n = 4096, I did­n’t ob­serve any sta­bi­liza­tion or qual­i­ta­tive differ­ences, so I am sus­pi­cious the im­ple­men­ta­tion is wrong.

Here is what a suc­cess­ful train­ing pro­gres­sion looks like for the anime face StyleGAN:

Train­ing mon­tage video of the first 9k it­er­a­tions of the anime face StyleGAN.

The anime face model is ob­so­leted by the StyleGAN 2 por­trait model.

The anime face model as of 2019-03-08, trained for 21,980 it­er­a­tions or ~21m im­ages or ~38 GPU-days, is avail­able for down­load. (It is still not ful­ly-con­verged, but the qual­ity is good.)


Hav­ing suc­cess­fully trained a StyleGAN, now the fun part—­gen­er­at­ing sam­ples!

Psi/“truncation trick”

The 𝜓/“trun­ca­tion trick”(BigGAN dis­cus­sion, StyleGAN dis­cus­sion; ap­par­ently first in­tro­duced by ) is the most im­por­tant hy­per­pa­ra­me­ter for all StyleGAN gen­er­a­tion.

The trun­ca­tion trick is used at sam­ple gen­er­a­tion time but not train­ing time. The idea is to edit the la­tent vec­tor z, which is a vec­tor of , to re­move any vari­ables which are above a cer­tain size like 0.5 or 1.0, and re­sam­ple those.29 This seems to help by avoid­ing ‘ex­treme’ la­tent val­ues or com­bi­na­tions of la­tent val­ues which the G is not as good at—a G will not have gen­er­ated many data points with each la­tent vari­able at, say, +1.5SD. The trade­off is that those are still le­git­i­mate ar­eas of the over­all la­tent space which were be­ing used dur­ing train­ing to cover parts of the data dis­tri­b­u­tion; so while the la­tent vari­ables close to the mean of 0 may be the most ac­cu­rately mod­eled, they are also only a small part of the space of all pos­si­ble im­ages. So one can gen­er­ate la­tent vari­ables from the full un­re­stricted dis­tri­b­u­tion for each one, or one can trun­cate them at some­thing like +1SD or +0.7SD. (Like the dis­cus­sion of the best dis­tri­b­u­tion for the orig­i­nal la­tent dis­tri­b­u­tion, there’s no good rea­son to think that this is an op­ti­mal method of do­ing trun­ca­tion; there are many al­ter­na­tives, such as ones pe­nal­iz­ing the sum of the vari­ables, ei­ther re­ject­ing them or scal­ing them down, and than the cur­rent trun­ca­tion trick.)

At 𝜓 = 0, di­ver­sity is nil and all faces are a sin­gle global av­er­age face (a brown-eyed brown-haired school­girl, un­sur­pris­ing­ly); at ±0.5 you have a broad range of faces, and by ±1.2, you’ll see tremen­dous di­ver­sity in faces/styles/consistency but also tremen­dous ar­ti­fact­ing & dis­tor­tion. Where you set your 𝜓 will heav­ily in­flu­ence how ‘orig­i­nal’ out­puts look. At 𝜓 = 1.2, they are tremen­dously orig­i­nal but ex­tremely hit or miss. At 𝜓 = 0.5 they are con­sis­tent but bor­ing. For most of my sam­pling, I set 𝜓 = 0.7 which strikes the best bal­ance be­tween craziness/artifacting and quality/diversity. (Per­son­al­ly, I pre­fer to look at 𝜓 = 1.2 sam­ples be­cause they are so much more in­ter­est­ing, but if I re­leased those sam­ples, it would give a mis­lead­ing im­pres­sion to read­er­s.)

Random Samples

The StyleGAN repo has a sim­ple script to down­load & gen­er­ate a sin­gle face; in the in­ter­ests of re­pro­ducibil­i­ty, it hard­wires the model and the RNG seed so it will only gen­er­ate 1 par­tic­u­lar face. How­ev­er, it can be eas­ily adapted to use a lo­cal model and (s­lowly30) gen­er­ate, say, 1000 sam­ple im­ages with the hy­per­pa­ra­me­ter 𝜓 = 0.6 (which gives high­-qual­ity but not high­ly-di­verse im­ages) which are saved to results/example-{0-999}.png:

import os
import pickle
import numpy as np
import PIL.Image
import dnnlib
import dnnlib.tflib as tflib
import config

def main():
    _G, _D, Gs = pickle.load(open("results/02051-sgan-faces-2gpu/network-snapshot-021980.pkl", "rb"))

    for i in range(0,1000):
        rnd = np.random.RandomState(None)
        latents = rnd.randn(1, Gs.input_shape[1])
        fmt = dict(func=tflib.convert_images_to_uint8, nchw_to_nhwc=True)
        images =, None, truncation_psi=0.6, randomize_noise=True, output_transform=fmt)
        os.makedirs(config.result_dir, exist_ok=True)
        png_filename = os.path.join(config.result_dir, 'example-'+str(i)+'.png')
        PIL.Image.fromarray(images[0], 'RGB').save(png_filename)

if __name__ == "__main__":

Karras et al 2018 Figures

The fig­ures in Kar­ras et al 2018, demon­strat­ing ran­dom sam­ples and as­pects of the style noise us­ing the 1024px FFHQ face model (as well as the oth­er­s), were gen­er­ated by This script needs ex­ten­sive mod­i­fi­ca­tions to work with my 512px anime face; go­ing through the file:

  • the code uses 𝜓 = 1 trun­ca­tion, but faces look bet­ter with 𝜓 = 0.7 (sev­eral of the func­tions have truncation_psi= set­tings but, trick­i­ly, the Fig­ure 3 draw_style_mixing_figure has its 𝜓 set­ting hid­den away in the synthesis_kwargs global vari­able)
  • the loaded model needs to be switched to the anime face mod­el, of course
  • di­men­sions must be re­duced 1024→512 as ap­pro­pri­ate; some ranges are hard­coded and must be re­duced for 512px im­ages as well
  • the trun­ca­tion trick fig­ure 8 does­n’t show enough faces to give in­sight into what the la­tent space is do­ing so it needs to be ex­panded to show both more ran­dom seeds/faces, and more 𝜓 val­ues
  • the bedroom/car/cat sam­ples should be dis­abled

The changes I make are as fol­lows:

diff --git a/ b/
index 45b68b8..f27af9d 100755
--- a/
+++ b/
@@ -24,16 +24,13 @@ url_bedrooms    = '
 url_cars        = '' # karras2019stylegan-cars-512x384.pkl
 url_cats        = '' # karras2019stylegan-cats-256x256.pkl

-synthesis_kwargs = dict(output_transform=dict(func=tflib.convert_images_to_uint8, nchw_to_nhwc=True), minibatch_size=8)
+synthesis_kwargs = dict(output_transform=dict(func=tflib.convert_images_to_uint8, nchw_to_nhwc=True), minibatch_size=8, truncation_psi=0.7)

 _Gs_cache = dict()

 def load_Gs(url):
-    if url not in _Gs_cache:
-        with dnnlib.util.open_url(url, cache_dir=config.cache_dir) as f:
-            _G, _D, Gs = pickle.load(f)
-        _Gs_cache[url] = Gs
-    return _Gs_cache[url]
+    _G, _D, Gs = pickle.load(open("results/02051-sgan-faces-2gpu/network-snapshot-021980.pkl", "rb"))
+    return Gs

 # Figures 2, 3, 10, 11, 12: Multi-resolution grid of uncurated result images.
@@ -85,7 +82,7 @@ def draw_noise_detail_figure(png, Gs, w, h, num_samples, seeds):
     canvas ='RGB', (w * 3, h * len(seeds)), 'white')
     for row, seed in enumerate(seeds):
         latents = np.stack([np.random.RandomState(seed).randn(Gs.input_shape[1])] * num_samples)
-        images =, None, truncation_psi=1, **synthesis_kwargs)
+        images =, None, **synthesis_kwargs)
         canvas.paste(PIL.Image.fromarray(images[0], 'RGB'), (0, row * h))
         for i in range(4):
             crop = PIL.Image.fromarray(images[i + 1], 'RGB')
@@ -109,7 +106,7 @@ def draw_noise_components_figure(png, Gs, w, h, seeds, noise_ranges, flips):
     all_images = []
     for noise_range in noise_ranges:
         tflib.set_vars({var: val * (1 if i in noise_range else 0) for i, (var, val) in enumerate(noise_pairs)})
-        range_images =, None, truncation_psi=1, randomize_noise=False, **synthesis_kwargs)
+        range_images =, None, randomize_noise=False, **synthesis_kwargs)
         range_images[flips, :, :] = range_images[flips, :, ::-1]

@@ -144,14 +141,11 @@ def draw_truncation_trick_figure(png, Gs, w, h, seeds, psis):
 def main():
     os.makedirs(config.result_dir, exist_ok=True)
-    draw_uncurated_result_figure(os.path.join(config.result_dir, 'figure02-uncurated-ffhq.png'), load_Gs(url_ffhq), cx=0, cy=0, cw=1024, ch=1024, rows=3, lods=[0,1,2,2,3,3], seed=5)
-    draw_style_mixing_figure(os.path.join(config.result_dir, 'figure03-style-mixing.png'), load_Gs(url_ffhq), w=1024, h=1024, src_seeds=[639,701,687,615,2268], dst_seeds=[888,829,1898,1733,1614,845], style_ranges=[range(0,4)]*3+[range(4,8)]*2+[range(8,18)])
-    draw_noise_detail_figure(os.path.join(config.result_dir, 'figure04-noise-detail.png'), load_Gs(url_ffhq), w=1024, h=1024, num_samples=100, seeds=[1157,1012])
-    draw_noise_components_figure(os.path.join(config.result_dir, 'figure05-noise-components.png'), load_Gs(url_ffhq), w=1024, h=1024, seeds=[1967,1555], noise_ranges=[range(0, 18), range(0, 0), range(8, 18), range(0, 8)], flips=[1])
-    draw_truncation_trick_figure(os.path.join(config.result_dir, 'figure08-truncation-trick.png'), load_Gs(url_ffhq), w=1024, h=1024, seeds=[91,388], psis=[1, 0.7, 0.5, 0, -0.5, -1])
-    draw_uncurated_result_figure(os.path.join(config.result_dir, 'figure10-uncurated-bedrooms.png'), load_Gs(url_bedrooms), cx=0, cy=0, cw=256, ch=256, rows=5, lods=[0,0,1,1,2,2,2], seed=0)
-    draw_uncurated_result_figure(os.path.join(config.result_dir, 'figure11-uncurated-cars.png'), load_Gs(url_cars), cx=0, cy=64, cw=512, ch=384, rows=4, lods=[0,1,2,2,3,3], seed=2)
-    draw_uncurated_result_figure(os.path.join(config.result_dir, 'figure12-uncurated-cats.png'), load_Gs(url_cats), cx=0, cy=0, cw=256, ch=256, rows=5, lods=[0,0,1,1,2,2,2], seed=1)
+    draw_uncurated_result_figure(os.path.join(config.result_dir, 'figure02-uncurated-ffhq.png'), load_Gs(url_ffhq), cx=0, cy=0, cw=512, ch=512, rows=3, lods=[0,1,2,2,3,3], seed=5)
+    draw_style_mixing_figure(os.path.join(config.result_dir, 'figure03-style-mixing.png'), load_Gs(url_ffhq), w=512, h=512, src_seeds=[639,701,687,615,2268], dst_seeds=[888,829,1898,1733,1614,845], style_ranges=[range(0,4)]*3+[range(4,8)]*2+[range(8,16)])
+    draw_noise_detail_figure(os.path.join(config.result_dir, 'figure04-noise-detail.png'), load_Gs(url_ffhq), w=512, h=512, num_samples=100, seeds=[1157,1012])
+    draw_noise_components_figure(os.path.join(config.result_dir, 'figure05-noise-components.png'), load_Gs(url_ffhq), w=512, h=512, seeds=[1967,1555], noise_ranges=[range(0, 18), range(0, 0), range(8, 18), range(0, 8)], flips=[1])
+    draw_truncation_trick_figure(os.path.join(config.result_dir, 'figure08-truncation-trick.png'), load_Gs(url_ffhq), w=512, h=512, seeds=[91,388, 389, 390, 391, 392, 393, 394, 395, 396], psis=[1, 0.7, 0.5, 0.25, 0, -0.25, -0.5, -1])

All this done, we get some fun anime face sam­ples to par­al­lel Kar­ras et al 2018’s fig­ures:

Anime face StyleGAN, Fig­ure 2, un­cu­rated sam­ples
Fig­ure 3, “style mix­ing” of source/transfer faces, demon­strat­ing con­trol & in­ter­po­la­tion (top row=style, left colum­n=­tar­get to be styled)
Fig­ure 8, the “trun­ca­tion trick” vi­su­al­ized: 10 ran­dom faces, with the range 𝜓 = [1, 0.7, 0.5, 0.25, 0, −0.25, −0.5, −1]—demon­strat­ing the trade­off be­tween di­ver­sity & qual­i­ty, and the global av­er­age face.


Training Montage

The eas­i­est sam­ples are the progress snap­shots gen­er­ated dur­ing train­ing. Over the course of train­ing, their size in­creases as the effec­tive res­o­lu­tion in­creases & finer de­tails are gen­er­at­ed, and at the end can be quite large (often 14MB each for the anime faces) so do­ing lossy com­pres­sion with a tool like pngnq+advpng or con­vert­ing them to JPG with low­ered qual­ity is a good idea. To turn the many snap­shots into a train­ing mon­tage video like above, I use on the PNGs:

cat $(ls ./results/*faces*/fakes*.png | sort --numeric-sort) | ffmpeg -framerate 10 \ # show 10 inputs per second
    -i - # stdin
    -r 25 # output frame-rate; frames will be duplicated to pad out to 25FPS
    -c:v libx264 # x264 for compatibility
    -pix_fmt yuv420p # force ffmpeg to use a standard colorspace - otherwise PNG colorspace is kept, breaking browsers (!)
    -crf 33 # adequate high quality
    -vf "scale=iw/2:ih/2" \ # shrink the image by 2×, the full detail is not necessary & saves space
    -preset veryslow -tune animation \ # aim for smallest binary possible with animation-tuned settings


The orig­i­nal ProGAN repo pro­vided a con­fig for gen­er­at­ing in­ter­po­la­tion videos, but that was re­moved in StyleGAN. Cyril Di­agne (@kikko_fr) im­ple­mented a re­place­ment, pro­vid­ing 3 kinds of videos:

  1. random_grid_404.mp4: a stan­dard in­ter­po­la­tion video, which is sim­ply a ran­dom walk through the la­tent space, mod­i­fy­ing all the vari­ables smoothly and an­i­mat­ing it; by de­fault it makes 4 of them arranged 2×2 in the video. Sev­eral in­ter­po­la­tion videos are show in the ex­am­ples sec­tion.

  2. interpolate.mp4: a ‘coarse’ “style mix­ing” video; a sin­gle ‘source’ face is gen­er­ated & held con­stant; a sec­ondary in­ter­po­la­tion video, a ran­dom walk as be­fore is gen­er­at­ed; at each step of the ran­dom walk, the ‘coarse’/high-level ‘style’ noise is copied from the ran­dom walk to over­write the source face’s orig­i­nal style noise. For faces, this means that the orig­i­nal face will be mod­i­fied with all sorts of ori­en­ta­tions & fa­cial ex­pres­sions while still re­main­ing rec­og­niz­ably the orig­i­nal char­ac­ter. (It is the video ana­log of Kar­ras et al 2018’s Fig­ure 3.)

    A copy of Di­ag­ne’s

    import os
    import pickle
    import numpy as np
    import PIL.Image
    import dnnlib
    import dnnlib.tflib as tflib
    import config
    import scipy
    def main():
        # Load pre-trained network.
        # url = ''
        # with dnnlib.util.open_url(url, cache_dir=config.cache_dir) as f:
        ## NOTE: insert model here:
        _G, _D, Gs = pickle.load(open("results/02047-sgan-faces-2gpu/network-snapshot-013221.pkl", "rb"))
        # _G = Instantaneous snapshot of the generator. Mainly useful for resuming a previous training run.
        # _D = Instantaneous snapshot of the discriminator. Mainly useful for resuming a previous training run.
        # Gs = Long-term average of the generator. Yields higher-quality results than the instantaneous snapshot.
        grid_size = [2,2]
        image_shrink = 1
        image_zoom = 1
        duration_sec = 60.0
        smoothing_sec = 1.0
        mp4_fps = 20
        mp4_codec = 'libx264'
        mp4_bitrate = '5M'
        random_seed = 404
        mp4_file = 'results/random_grid_%s.mp4' % random_seed
        minibatch_size = 8
        num_frames = int(np.rint(duration_sec * mp4_fps))
        random_state = np.random.RandomState(random_seed)
        # Generate latent vectors
        shape = [num_frames,] + Gs.input_shape[1:] # [frame, image, channel, component]
        all_latents = random_state.randn(*shape).astype(np.float32)
        import scipy
        all_latents = scipy.ndimage.gaussian_filter(all_latents,
                       [smoothing_sec * mp4_fps] + [0] * len(Gs.input_shape), mode='wrap')
        all_latents /= np.sqrt(np.mean(np.square(all_latents)))
        def create_image_grid(images, grid_size=None):
            assert images.ndim == 3 or images.ndim == 4
            num, img_h, img_w, channels = images.shape
            if grid_size is not None:
                grid_w, grid_h = tuple(grid_size)
                grid_w = max(int(np.ceil(np.sqrt(num))), 1)
                grid_h = max((num - 1) // grid_w + 1, 1)
            grid = np.zeros([grid_h * img_h, grid_w * img_w, channels], dtype=images.dtype)
            for idx in range(num):
                x = (idx % grid_w) * img_w
                y = (idx // grid_w) * img_h
                grid[y : y + img_h, x : x + img_w] = images[idx]
            return grid
        # Frame generation func for moviepy.
        def make_frame(t):
            frame_idx = int(np.clip(np.round(t * mp4_fps), 0, num_frames - 1))
            latents = all_latents[frame_idx]
            fmt = dict(func=tflib.convert_images_to_uint8, nchw_to_nhwc=True)
            images =, None, truncation_psi=0.7,
                                  randomize_noise=False, output_transform=fmt)
            grid = create_image_grid(images, grid_size)
            if image_zoom > 1:
                grid = scipy.ndimage.zoom(grid, [image_zoom, image_zoom, 1], order=0)
            if grid.shape[2] == 1:
                grid = grid.repeat(3, 2) # grayscale => RGB
            return grid
        # Generate video.
        import moviepy.editor
        video_clip = moviepy.editor.VideoClip(make_frame, duration=duration_sec)
        video_clip.write_videofile(mp4_file, fps=mp4_fps, codec=mp4_codec, bitrate=mp4_bitrate)
        # import scipy
        # coarse
        duration_sec = 60.0
        smoothing_sec = 1.0
        mp4_fps = 20
        num_frames = int(np.rint(duration_sec * mp4_fps))
        random_seed = 500
        random_state = np.random.RandomState(random_seed)
        w = 512
        h = 512
        #src_seeds = [601]
        dst_seeds = [700]
        style_ranges = ([0] * 7 + [range(8,16)]) * len(dst_seeds)
        fmt = dict(func=tflib.convert_images_to_uint8, nchw_to_nhwc=True)
        synthesis_kwargs = dict(output_transform=fmt, truncation_psi=0.7, minibatch_size=8)
        shape = [num_frames] + Gs.input_shape[1:] # [frame, image, channel, component]
        src_latents = random_state.randn(*shape).astype(np.float32)
        src_latents = scipy.ndimage.gaussian_filter(src_latents,
                                                    smoothing_sec * mp4_fps,
        src_latents /= np.sqrt(np.mean(np.square(src_latents)))
        dst_latents = np.stack(np.random.RandomState(seed).randn(Gs.input_shape[1]) for seed in dst_seeds)
        src_dlatents =, None) # [seed, layer, component]
        dst_dlatents =, None) # [seed, layer, component]
        src_images =, randomize_noise=False, **synthesis_kwargs)
        dst_images =, randomize_noise=False, **synthesis_kwargs)
        canvas ='RGB', (w * (len(dst_seeds) + 1), h * 2), 'white')
        for col, dst_image in enumerate(list(dst_images)):
            canvas.paste(PIL.Image.fromarray(dst_image, 'RGB'), ((col + 1) * h, 0))
        def make_frame(t):
            frame_idx = int(np.clip(np.round(t * mp4_fps), 0, num_frames - 1))
            src_image = src_images[frame_idx]
            canvas.paste(PIL.Image.fromarray(src_image, 'RGB'), (0, h))
            for col, dst_image in enumerate(list(dst_images)):
                col_dlatents = np.stack([dst_dlatents[col]])
                col_dlatents[:, style_ranges[col]] = src_dlatents[frame_idx, style_ranges[col]]
                col_images =, randomize_noise=False, **synthesis_kwargs)
                for row, image in enumerate(list(col_images)):
                    canvas.paste(PIL.Image.fromarray(image, 'RGB'), ((col + 1) * h, (row + 1) * w))
            return np.array(canvas)
        # Generate video.
        import moviepy.editor
        mp4_file = 'results/interpolate.mp4'
        mp4_codec = 'libx264'
        mp4_bitrate = '5M'
        video_clip = moviepy.editor.VideoClip(make_frame, duration=duration_sec)
        video_clip.write_videofile(mp4_file, fps=mp4_fps, codec=mp4_codec, bitrate=mp4_bitrate)
        import scipy
        duration_sec = 60.0
        smoothing_sec = 1.0
        mp4_fps = 20
        num_frames = int(np.rint(duration_sec * mp4_fps))
        random_seed = 503
        random_state = np.random.RandomState(random_seed)
        w = 512
        h = 512
        style_ranges = [range(6,16)]
        fmt = dict(func=tflib.convert_images_to_uint8, nchw_to_nhwc=True)
        synthesis_kwargs = dict(output_transform=fmt, truncation_psi=0.7, minibatch_size=8)
        shape = [num_frames] + Gs.input_shape[1:] # [frame, image, channel, component]
        src_latents = random_state.randn(*shape).astype(np.float32)
        src_latents = scipy.ndimage.gaussian_filter(src_latents,
                                                    smoothing_sec * mp4_fps,
        src_latents /= np.sqrt(np.mean(np.square(src_latents)))
        dst_latents = np.stack([random_state.randn(Gs.input_shape[1])])
        src_dlatents =, None) # [seed, layer, component]
        dst_dlatents =, None) # [seed, layer, component]
        def make_frame(t):
            frame_idx = int(np.clip(np.round(t * mp4_fps), 0, num_frames - 1))
            col_dlatents = np.stack([dst_dlatents[0]])
            col_dlatents[:, style_ranges[0]] = src_dlatents[frame_idx, style_ranges[0]]
            col_images =, randomize_noise=False, **synthesis_kwargs)
            return col_images[0]
        # Generate video.
        import moviepy.editor
        mp4_file = 'results/fine_%s.mp4' % (random_seed)
        mp4_codec = 'libx264'
        mp4_bitrate = '5M'
        video_clip = moviepy.editor.VideoClip(make_frame, duration=duration_sec)
        video_clip.write_videofile(mp4_file, fps=mp4_fps, codec=mp4_codec, bitrate=mp4_bitrate)
    if __name__ == "__main__":

    ‘Coarse’ style-transfer/interpolation video

  3. fine_503.mp4: a ‘fine’ style mix­ing video; in this case, the style noise is taken from later on and in­stead of affect­ing the global ori­en­ta­tion or ex­pres­sion, it affects sub­tler de­tails like the pre­cise shape of hair strands or hair color or mouths.

    ‘Fine’ style-transfer/interpolation video

Cir­cu­lar in­ter­po­la­tions are an­other in­ter­est­ing kind of in­ter­po­la­tion, writ­ten by snowy halcy, which in­stead of ran­dom walk­ing around the la­tent space freely, with large or awk­ward tran­si­tions, in­stead tries to move around a fixed high­-di­men­sional point do­ing: “bi­nary search to get the MSE to be roughly the same be­tween frames (s­lightly brute force, but it looks nicer), and then did that for what is prob­a­bly close to a sphere or cir­cle in the la­tent space.” A later ver­sion of cir­cu­lar in­ter­po­la­tion is in snowy hal­cy’s face ed­i­tor re­po, but here is the orig­i­nal ver­sion cleaned up into a stand-alone pro­gram:

import dnnlib.tflib as tflib
import math
import moviepy.editor
from numpy import linalg
import numpy as np
import pickle

def main():
    _G, _D, Gs = pickle.load(open("results/02051-sgan-faces-2gpu/network-snapshot-021980.pkl", "rb"))

    rnd = np.random
    latents_a = rnd.randn(1, Gs.input_shape[1])
    latents_b = rnd.randn(1, Gs.input_shape[1])
    latents_c = rnd.randn(1, Gs.input_shape[1])

    def circ_generator(latents_interpolate):
        radius = 40.0

        latents_axis_x = (latents_a - latents_b).flatten() / linalg.norm(latents_a - latents_b)
        latents_axis_y = (latents_a - latents_c).flatten() / linalg.norm(latents_a - latents_c)

        latents_x = math.sin(math.pi * 2.0 * latents_interpolate) * radius
        latents_y = math.cos(math.pi * 2.0 * latents_interpolate) * radius

        latents = latents_a + latents_x * latents_axis_x + latents_y * latents_axis_y
        return latents

    def mse(x, y):
        return (np.square(x - y)).mean()

    def generate_from_generator_adaptive(gen_func):
        max_step = 1.0
        current_pos = 0.0

        change_min = 10.0
        change_max = 11.0

        fmt = dict(func=tflib.convert_images_to_uint8, nchw_to_nhwc=True)

        current_latent = gen_func(current_pos)
        current_image =, None, truncation_psi=0.7, randomize_noise=False, output_transform=fmt)[0]
        array_list = []

        video_length = 1.0
        while(current_pos < video_length):

            lower = current_pos
            upper = current_pos + max_step
            current_pos = (upper + lower) / 2.0

            current_latent = gen_func(current_pos)
            current_image = images =, None, truncation_psi=0.7, randomize_noise=False, output_transform=fmt)[0]
            current_mse = mse(array_list[-1], current_image)

            while current_mse < change_min or current_mse > change_max:
                if current_mse < change_min:
                    lower = current_pos
                    current_pos = (upper + lower) / 2.0

                if current_mse > change_max:
                    upper = current_pos
                    current_pos = (upper + lower) / 2.0

                current_latent = gen_func(current_pos)
                current_image = images =, None, truncation_psi=0.7, randomize_noise=False, output_transform=fmt)[0]
                current_mse = mse(array_list[-1], current_image)
            print(current_pos, current_mse)
        return array_list

    frames = generate_from_generator_adaptive(circ_generator)
    frames = moviepy.editor.ImageSequenceClip(frames, fps=30)

    # Generate video.
    mp4_file = 'results/circular.mp4'
    mp4_codec = 'libx264'
    mp4_bitrate = '3M'
    mp4_fps = 20

    frames.write_videofile(mp4_file, fps=mp4_fps, codec=mp4_codec, bitrate=mp4_bitrate)

if __name__ == "__main__":
‘Cir­cu­lar’ in­ter­po­la­tion video

An in­ter­est­ing use of in­ter­po­la­tions is Kyle McLean’s “Waifu Syn­the­sis” video: a singing anime video mash­ing up StyleGAN anime faces + lyrics + Project Ma­genta mu­sic.


Anime Faces

The pri­mary model I’ve trained, the anime face model is de­scribed in the data pro­cess­ing & train­ing sec­tion. It is a 512px StyleGAN model trained on n = 218,794 faces cropped from all of Dan­booru2017, cleaned, & up­scaled, and trained for 21,980 it­er­a­tions or ~21m im­ages or ~38 GPU-days.

Down­loads (I rec­om­mend us­ing the more-re­cent un­less cropped faces are specifi­cally de­sired):


To show off the anime faces, and as a joke, on 2019-02-14, I set up , a stand­alone sta­tic web­site which dis­plays a ran­dom anime face (out of 100,000), gen­er­ated with var­i­ous 𝜓, and paired with GPT-2-117M text snip­pets prompted on anime plot sum­maries. are too length to go into here

But the site was amus­ing & an enor­mous suc­cess. It went vi­ral overnight and by the end of March 2019, ~1 mil­lion unique vis­i­tors (most from Chi­na) had vis­ited TWDNE, spend­ing over 2 min­utes each look­ing at the NN-gen­er­ated faces & text; peo­ple be­gan hunt­ing for hi­lar­i­ous­ly-de­formed faces, us­ing TWDNE as a screen­saver, pick­ing out faces as avatars, cre­at­ing packs of faces for video games, paint­ing their own col­lages of faces, us­ing it as a char­ac­ter de­signer for in­spi­ra­tion, etc.

Anime Bodies

Aaron Gokaslan ex­per­i­mented with a cus­tom 256px anime game im­age dataset which has in­di­vid­ual char­ac­ters posed in whole-per­son im­ages to see how StyleGAN coped with more com­plex geome­tries. Progress re­quired ad­di­tional data clean­ing and low­er­ing the learn­ing rate but, trained on a 4-GPU sys­tem for week or two, the re­sults are promis­ing (even down to re­pro­duc­ing the copy­right state­ments in the im­ages), pro­vid­ing pre­lim­i­nary ev­i­dence that StyleGAN can scale:

Whole-body anime im­ages, ran­dom sam­ples, Aaron Gokaslan
Whole-body anime im­ages, style trans­fer among sam­ples, Aaron Gokaslan

Transfer Learning

"In the days when was a novice, once came to him as he sat hack­ing at the .

“What are you do­ing?”, asked Min­sky. “I am train­ing a ran­domly wired neural net to play Tic-Tac-Toe” Suss­man replied. “Why is the net wired ran­dom­ly?”, asked Min­sky. “I do not want it to have any pre­con­cep­tions of how to play”, Suss­man said.

Min­sky then shut his eyes. “Why do you close your eyes?”, Suss­man asked his teacher. “So that the room will be emp­ty.”

At that mo­ment, Suss­man was en­light­ened."

“Suss­man at­tains en­light­en­ment”, “AI Koans”,

One of the most use­ful things to do with a trained model on a broad data cor­pus is to use it as a launch­ing pad to train a bet­ter model quicker on lesser data, called “trans­fer learn­ing”. For ex­am­ple, one might trans­fer learn from Nvidi­a’s FFHQ face StyleGAN model to a differ­ent celebrity dataset, or from bed­room­s→k­itchens. Or with the anime face mod­el, one might re­train it on a sub­set of faces—all char­ac­ters with red hair, or all male char­ac­ters, or just a sin­gle spe­cific char­ac­ter. Even if a dataset seems differ­ent, start­ing from a pre­trained model can save time; after all, while male and fe­male faces may look differ­ent and it may seem like a mis­take to start from a most­ly-fe­male anime face mod­el, the al­ter­na­tive of start­ing from scratch means start­ing with a model gen­er­at­ing ran­dom rain­bow-col­ored sta­t­ic, and surely male faces look far more like fe­male faces than they do ran­dom sta­t­ic?31 In­deed, you can quickly train a pho­to­graphic face model start­ing from the anime face mod­el.

This ex­tends the reach of good StyleGAN mod­els from those blessed with both big data & big com­pute to those with lit­tle of ei­ther. Trans­fer learn­ing works par­tic­u­larly well for spe­cial­iz­ing the anime face model to a spe­cific char­ac­ter: the im­ages of that char­ac­ter would be too lit­tle to train a good StyleGAN on, too data-im­pov­er­ished for the sam­ple-in­effi­cient StyleGAN1–232, but hav­ing been trained on all anime faces, the StyleGAN has learned well the full space of anime faces and can eas­ily spe­cial­ize down with­out over­fit­ting. Try­ing to do, say, faces ↔︎ land­scapes is prob­a­bly a bridge too far.

Data-wise, for do­ing face spe­cial­iza­tion, the more the bet­ter but n = 500–5000 is an ad­e­quate range, but even as low as n = 50 works sur­pris­ingly well. I don’t know to what ex­tent data aug­men­ta­tion can sub­sti­tute for orig­i­nal dat­a­points but it’s prob­a­bly worth a try es­pe­cially if you have n < 5000.

Com­pute-wise, spe­cial­iza­tion is rapid. Adap­ta­tion can hap­pen within a few ticks, pos­si­bly even 1. This is sur­pris­ingly fast given that StyleGAN is not de­signed for few-shot/transfer learn­ing. I spec­u­late that this may be be­cause the StyleGAN la­tent space is ex­pres­sive enough that even new faces (such as new hu­man faces for a FFHQ mod­el, or a new anime char­ac­ter for an ani­me-face mod­el) are still al­ready present in the la­tent space. Ex­am­ples of the ex­pres­siv­ity are pro­vided by , who find that “al­though the StyleGAN gen­er­a­tor is trained on a hu­man face dataset [FFHQ], the em­bed­ding al­go­rithm is ca­pa­ble of go­ing far be­yond hu­man faces. As Fig­ure 1 shows, al­though slightly worse than those of hu­man faces, we can ob­tain rea­son­able and rel­a­tively high­-qual­ity em­bed­dings of cats, dogs and even paint­ings and cars.” If even im­ages as differ­ent as cars can be en­coded suc­cess­fully into a face StyleGAN, then clearly the la­tent space can eas­ily model new faces and so any new face train­ing data is in some sense al­ready learned; so the train­ing process is per­haps not so much about learn­ing ‘new’ faces as about mak­ing the new faces more ‘im­por­tant’ by ex­pand­ing the la­tent space around them & con­tract­ing it around every­thing else, which seems like a far eas­ier task.

How does one ac­tu­ally do trans­fer learn­ing? Since StyleGAN is (cur­rent­ly) un­con­di­tional with no dataset-spe­cific cat­e­gor­i­cal or text or meta­data en­cod­ing, just a flat set of im­ages, all that has to be done is to en­code the new dataset and sim­ply start train­ing with an ex­ist­ing mod­el. One cre­ates the new dataset as usu­al, and then ed­its with a new -desc line for the new dataset, and if resume_kimg is set cor­rectly (see next para­graph) and resume_run_id = "latest" en­abled as ad­vised, you can then run python and presto, trans­fer learn­ing.

The main prob­lem seems to be that train­ing can­not be done from scratch/0 it­er­a­tions, as one might naively as­sume—when I tried this, it did not work well and StyleGAN ap­peared to be ig­nor­ing the pre­trained mod­el. My hy­poth­e­sis is that as part of the pro­gres­sive growing/fading in of ad­di­tional resolution/layers, StyleGAN sim­ply ran­dom­izes or wipes out each new layer and over­writes them—­mak­ing it point­less. This is easy to avoid: sim­ply jump the train­ing sched­ule all the way to the de­sired res­o­lu­tion. For ex­am­ple, to start at one’s max­i­mum size (here 512px) one might set resume_kimg=7000 in This forces StyleGAN to skip all the pro­gres­sive grow­ing and load the full model as-is. To make sure you did it right, check the first sam­ple (fakes07000.png or what­ev­er), from be­fore any trans­fer learn­ing train­ing has been done, and it should look like the orig­i­nal model did at the end of its train­ing. Then sub­se­quent train­ing sam­ples should show the orig­i­nal quickly mor­ph­ing to the new dataset. (Any­thing like fakes00000.png should not show up be­cause that in­di­cates be­gin­ning from scratch.)

Anime Faces → Character Faces


The first trans­fer learn­ing was done with Holo of . It used a 512px Holo face dataset cre­ated with Na­gadomi’s crop­per from all of Dan­booru2017, up­scaled with waifu2x, cleaned by hand, and then data-aug­mented from n = 3900 to n = 12600; mir­ror­ing was en­abled since Holo is sym­met­ri­cal. I then used the anime face model as of 2019-02-09—it was not fully con­verged, in­deed, would­n’t con­verge with weeks more train­ing, but the qual­ity was so good I was too cu­ri­ous as to how well re­train­ing would work so I switched gears.

It’s worth men­tion­ing that this dataset was used pre­vi­ously with ProGAN, where after weeks of train­ing, ProGAN over­fit badly as demon­strated by the sam­ples & in­ter­po­la­tion videos.

Train­ing hap­pened re­mark­ably quick­ly, with all the faces con­verted to rec­og­niz­ably Holo faces within a few hun­dred it­er­a­tions:

Train­ing mon­tage of a Holo face model ini­tial­ized from the anime face StyleGAN (blink & you’ll miss it)
In­ter­po­la­tion video of the Holo face model ini­tial­ized from the anime face StyleGAN

The best sam­ples were con­vinc­ing with­out ex­hibit­ing the fail­ures of the ProGAN:

64 hand-s­e­lected Holo face sam­ples

The StyleGAN was much more suc­cess­ful, de­spite a few fail­ure la­tent points car­ried over from the anime faces. In­deed, after a few hun­dred it­er­a­tions, it was start­ing to over­fit with the ‘crack’ ar­ti­facts & smear­ing in the in­ter­po­la­tions. The lat­est I was will­ing to use was it­er­a­tion #11370, and I think it is still some­what over­fit any­way. I thought that with its to­tal n (after data aug­men­ta­tion), Holo would be able to train longer (be­ing 1⁄7th the size of FFHQ), but ap­par­ently not. Per­haps the data aug­men­ta­tion is con­sid­er­ably less valu­able than 1-for-1, ei­ther be­cause the in­vari­ants en­coded in aren’t that use­ful (sug­gest­ing that Geirhos et al 2018-like style trans­fer data aug­men­ta­tion is what’s nec­es­sary) or that they would be but the anime face StyleGAN has al­ready learned them all as part of the pre­vi­ous train­ing & needs more real data to bet­ter un­der­stand Holo-like faces. It’s also pos­si­ble that the re­sults could be im­proved by us­ing one of the later anime face StyleGANs since they did im­prove when I trained them fur­ther after my 2 Holo/Asuka trans­fer ex­per­i­ments.

Nev­er­the­less, im­pressed, I could­n’t help but won­der if they had reached hu­man-levels of verisimil­i­tude: would an un­wary viewer as­sume they were hand­made?

So I se­lected ~100 of the best sam­ples (24MB; Imgur mir­ror) from a dump of 2000, cropped about 5% from the left/right sides to hide the back­ground ar­ti­facts a lit­tle bit, and sub­mit­ted them on 2019-02-11 to /r/SpiceandWolf un­der an alt ac­count. I made the mis­take of sort­ing by file­size & thus lead­ing with a face that was par­tic­u­larly sus­pi­cious (streaky hair) so one Red­di­tor voiced the sus­pi­cion they were from MGM (ab­surd yet not en­tirely wrong) but all the other com­menters took the faces in stride or prais­ing them, and the sub­mis­sion re­ceived +248 votes (99% pos­i­tive) by March. A Red­di­tor then turned them all into a GIF video which earned +192 (100%) and many pos­i­tive com­ments with no fur­ther sus­pi­cions un­til I ex­plained. Not bad in­deed.

The #11370 Holo StyleGAN model is avail­able for down­load.


After the Holo train­ing & link sub­mis­sion went so well, I knew I had to try my other char­ac­ter dataset, Asuka, us­ing n = 5300 data-aug­mented to n = 58,000.33 Keep­ing in mind how data seemed to limit the Holo qual­i­ty, I left mir­ror­ing en­abled for Asuka, even though she is not sym­met­ri­cal due to her eye­patch over her left eye (as purists will no doubt note).

Train­ing mon­tage of an Asuka face model ini­tial­ized from the anime face StyleGAN
In­ter­po­la­tion video of the Asuka face model ini­tial­ized from the anime face StyleGAN

In­ter­est­ing­ly, while Holo trained within 4 GPU-hours, Asuka proved much more diffi­cult and did not seem to be fin­ished train­ing or show­ing the cracks de­spite train­ing twice as long. Is this due to hav­ing ~35% more real data, hav­ing 10× rather than 3× data aug­men­ta­tion, or some in­her­ent differ­ence like Asuka be­ing more com­plex (eg be­cause of more vari­a­tions in her ap­pear­ance like the eye­patches or plug­suit­s)?

I gen­er­ated 1000 ran­dom sam­ples with 𝜓 = 1.2 be­cause they were par­tic­u­larly in­ter­est­ing to look at. As with Holo, I picked out the best 100 (13MB; Imgur mir­ror) from ~2000:

64 hand-s­e­lected Asuka face sam­ples

And I sub­mit­ted to the /r/Evangelion sub­red­dit, where it also did well (+109, 98%); there were no spec­u­la­tions about the faces be­ing NN-gen­er­ated be­fore I re­vealed it, merely re­quests for more. Be­tween the two, it ap­pears that with ad­e­quate data (n > 3000) and mod­er­ate cu­ra­tion, a sim­ple kind of art Tur­ing test can be passed.

The #7903 Asuka StyleGAN model is avail­able for down­load.


In early Feb­ru­ary 2019, us­ing the then-re­leased mod­el, Red­di­tor End­ing_­Cred­its tried trans­fer learn­ing to n = 500 faces of the Zui­hou for ~1 tick (~60k it­er­a­tions).

The sam­ples & in­ter­po­la­tions have many ar­ti­facts, but the sam­ple size is tiny and I’d con­sider this good fine­tun­ing from a model never in­tended for few-shot learn­ing:

StyleGAN trans­fer learn­ing from anime face StyleGAN to Kan­Colle Zui­hou by End­ing_­Cred­its, 8×15 ran­dom sam­ple grid
In­ter­po­la­tion video (4×4) of the Zui­hou face model ini­tial­ized from the anime face StyleGAN, trained by End­ing_­Cred­its
In­ter­po­la­tion video (1×1) of the Zui­hou face model ini­tial­ized from the anime face StyleGAN, trained by End­ing_­Cred­its

Prob­a­bly it could be made bet­ter by start­ing from the lat­est anime face StyleGAN mod­el, and us­ing ag­gres­sive data aug­men­ta­tion. An­other op­tion would be to try to find as many char­ac­ters which look sim­i­lar to Zui­hou (match­ing on hair color might work) and train on a joint dataset—un­con­di­tional sam­ples would then need to be fil­tered for just Zui­hou faces, but per­haps that draw­back could be avoided by a third stage of Zui­hou-only train­ing?



An­other Kan­colle char­ac­ter, Ak­izuki, was trained in April 2019 by Gan­so.


In Jan­u­ary 2020, Ganso trained a StyleGAN 2 model from the S2 por­trait model on a tiny cor­pus of Ptilop­sis im­ages, a char­ac­ter from Arknights, a 2017 Chi­nese RPG mo­bile game.

Train­ing sam­ples of Ptilop­sis, Arknights (StyleGAN 2 por­traits trans­fer, by Gan­so)

are owls, and her char­ac­ter de­sign shows promi­nent ears; de­spite the few im­ages to work with (just 21 on Dan­booru as of 2020-01-19), the in­ter­po­la­tion shows smooth ad­just­ments of the ears in all po­si­tions & align­ments, demon­strat­ing the power of trans­fer learn­ing:

In­ter­po­la­tion video (4×4) of the Ptilop­sis face model ini­tial­ized from the anime face StyleGAN 2, trained by Ganso



End­ing_­Cred­its like­wise did trans­fer to (), n = 4000. The re­sults look about as ex­pected given the sam­ple sizes and pre­vi­ous trans­fer re­sults:

In­ter­po­la­tion video (4×4) of the Saber face model ini­tial­ized from the anime face StyleGAN, trained by End­ing_­Cred­its

Fate/Grand Order

Michael Sug­imura in May 2019 ex­per­i­mented with trans­fer learn­ing from the 512px anime por­trait GAN to faces cropped from ~6k wall­pa­pers he down­loaded via Google search queries. His re­sults for Saber & re­lated char­ac­ters look rea­son­able but more broad­ly, some­what low-qual­i­ty, which Sug­imura sus­pects is due to in­ad­e­quate data clean­ing (“there are a num­ber of lower qual­ity im­ages and also im­ages of back­grounds, ar­mor, non-char­ac­ter im­ages left in the dataset which causes weird ar­ti­facts in gen­er­ated im­ages or just lower qual­ity gen­er­ated im­ages.”).


Fi­nal­ly, End­ing_­Cred­its did trans­fer to (), n = 350:

In­ter­po­la­tion video (4×4) of the Louise face model ini­tial­ized from the anime face StyleGAN, trained by End­ing_­Cred­its

Not as good as Saber due to the much smaller sam­ple size.


road­run­ner01 ex­per­i­mented with a num­ber of trans­fers, in­clud­ing a trans­fer of the male char­ac­ter () with n = 50 (!), which is not nearly as garbage as it should be.


Flatis­Dogchi ex­per­i­mented with trans­fer to n = 988 (aug­mented to n = 18772) Asashio (Kan­Colle) faces, cre­at­ing “This Asashio Does Not Ex­ist”.

Marisa Kirisame & the Komeijis

A Japan­ese user mei_miya posted an in­ter­po­la­tion video of the Touhou char­ac­ter Marisa Kirisame by trans­fer learn­ing on 5000 faces. They also did the Touhou char­ac­ters Satori/Koishi Komeiji with n = 6000.

The Red­dit user Jepa­cor also has done Marisa, us­ing Dan­booru sam­ples.


A Chi­nese user 3D_DLW (S2 writeup/tutorial: 1/2) in Feb­ru­ary 2020 did trans­fer­-learn­ing from the S2 por­trait model to Pixiv art­work of the char­ac­ter Lex­ing­ton from War­ship Girls. He used a sim­i­lar work­flow: crop­ping faces with lbpcascade_animeface, up­scal­ing with wai­fu2x, and clean­ing with (us­ing the orig­i­nal S2 mod­el’s Dis­crim­i­na­tor & pro­duc­ing datasets of vary­ing clean­li­ness at n = 302–1659). Sam­ples:

Ran­dom sam­ples for anime por­trait S2 → War­ship Girls char­ac­ter Lex­ing­ton.

Hayasaka Ai

Tazik Shah­ja­han fine­tuned S2 on ’s Hayasaka Ai, pro­vid­ing a Co­lab note­book demon­strat­ing how he scraped Pixiv and fil­tered out in­valid im­ages to cre­ate the train­ing cor­pus


Ca­JI9I cre­ated an “” StyleGAN; un­spec­i­fied cor­pus or method:

6×6 sam­ple of ahe­gao StyleGAN faces

Emilia (Re:Zero)

In­ter­po­la­tion video (4×4) of the Emilia face model ini­tial­ized from the Por­trait StyleGAN, trained by ship­blaz­er420

Anime Faces → Anime Headshots

Twit­ter user Sunk did trans­fer learn­ing to an im­age cor­pus of a spe­cific artist, Kure­hito Mis­aki (深崎暮人), n≅1000. His im­ages work well and the in­ter­po­la­tion looks nice:

In­ter­po­la­tion video (4×4) of the Louise face model ini­tial­ized from the Kure­hito Mis­aki StyleGAN, trained by sunk

Anime Faces → Portrait

TWDNE was a huge suc­cess and pop­u­lar­ized the anime face StyleGAN. It was not per­fect, though, and flaws were not­ed.

Portrait Improvements

The por­traits could be im­proved by more care­fully se­lect­ing SFW im­ages to avoid over­ly-sug­ges­tive faces, ex­pand­ing the crops to avoid cut­ting off edges of heads like hair­styles,

**­For de­tails and

, please see .**

Portrait Results

After re­train­ing the fi­nal face StyleGAN 2019-03-08–2019-04-30 on the new im­proved por­traits dataset, the re­sults im­proved:

Train­ing sam­ple for Por­trait StyleGAN: 2019-04-30/iteration #66,083
In­ter­po­la­tion video (4×4) of the Dan­booru2018 por­trait model ini­tial­ized from the Dan­booru2017 face StyleGAN

This S1 anime por­trait model is ob­so­leted by the StyleGAN 2 por­trait model.

The fi­nal model from 2019-04-30 is avail­able for down­load.

I used this model at 𝛙=0.5 to gen­er­ate 100,000 new por­traits for TWDNE (#100,000–199,999), bal­anc­ing the pre­vi­ous faces.

I was sur­prised how diffi­cult up­grad­ing to por­traits seemed to be; I spent al­most two months train­ing it be­fore giv­ing up on fur­ther im­prove­ments, while I had been ex­pect­ing more like a week or two. The por­trait re­sults are in­deed bet­ter than the faces (I was right that not crop­ping off the top of the head adds verisimil­i­tude), but the up­grade did­n’t im­press me as much as the orig­i­nal faces did com­pared to ear­lier GANs. And our other ex­per­i­men­tal runs on whole-Dan­booru2018 im­ages never pro­gressed be­yond sug­ges­tive blobs dur­ing this pe­ri­od.

I sus­pect that StyleGAN—at least, on its de­fault ar­chi­tec­ture & hy­per­pa­ra­me­ters, with­out a great deal more com­pute—is reach­ing its lim­its here, and that changes may be nec­es­sary to scale to richer im­ages. (Self-at­ten­tion is prob­a­bly the eas­i­est to add since it should be easy to plug in ad­di­tional lay­ers to the con­vo­lu­tion code.)

Anime Faces → Male Faces

A few peo­ple have ob­served that it would be nice to have an anime face GAN for male char­ac­ters in­stead of al­ways gen­er­at­ing fe­male ones. The anime face StyleGAN does in fact have male faces in its dataset as I did no fil­ter­ing—it’s merely that fe­male faces are over­whelm­ingly fre­quent (and it may also be that male anime faces are rel­a­tively androgynous/feminized any­way so it’s hard to tell any differ­ence be­tween a fe­male with short hair & a guy34).

Train­ing a male-only anime face StyleGAN would be an­other good ap­pli­ca­tion of trans­fer learn­ing.

The faces can be eas­ily ex­tracted out of Dan­booru2018 by query­ing for "male_focus", which will pick up ~150k im­ages. More nar­row­ly, one could search "1boy" & "solo", to en­sure that the only face in the im­age is a male face (as op­posed to, say, 1boy 1girl, where a fe­male face might be cropped out as well). This pro­vides n = 99k raw hits. It would be good to also fil­ter out ‘trap’ or over­ly-fe­male-look­ing faces (else what’s the point?), by fil­ter­ing on tags like cat ears or par­tic­u­larly pop­u­lar ‘trap’ char­ac­ters like Fate/Grand Or­der’s As­tol­fo. A more com­pli­cated query to pick up scenes with mul­ti­ple males could be to search for both "1boy" & "multiple_boys" and then fil­ter out "1girl" & "multiple_girls", in or­der to se­lect all im­ages with 1 or more males and then re­move all im­ages with 1 or more fe­males; this dou­bles the raw hits to n = 198k. (A down­side is that the face-crop­ping will often un­avoid­ably yield crops with two faces, a pri­mary face and an over­lap­ping face, which is bad and in­tro­duces ar­ti­fact­ing when I tried this with all faces.)

Com­bined with trans­fer learn­ing from the gen­eral anime face StyleGAN, the re­sults should be as good as the gen­eral (fe­male) faces.

I set­tled for "1boy" & "solo", and did con­sid­er­able clean­ing by hand. The raw count of im­ages turned out to be highly mis­lead­ing, and many faces are un­us­able for a male anime face StyleGAN: many are so highly styl­ized (such as ac­tion sce­nes) as to be dam­ag­ing to a GAN, or they are al­most in­dis­tin­guish­able from fe­male faces (be­cause they are bis­honen or trap or just an­drog­y­nous), which would be point­less to in­clude (the reg­u­lar por­trait StyleGAN cov­ers those al­ready). After hand clean­ing & use of, I was left with n~3k, so I used heavy data aug­men­ta­tion to bring it up to n~57k, and I ini­tial­ized from the fi­nal por­trait StyleGAN for the high­est qual­i­ty.

It did not over­fit after ~4 days of train­ing, but the re­sults were not no­tice­ably im­prov­ing, so I stopped (in or­der to start train­ing the GPT-2-345M, which Ope­nAI had just re­leased, ). There are hints in the in­ter­po­la­tion videos, I think, that it is in­deed slightly over­fit­ting, in the form of ‘glitches’ where the im­age abruptly jumps slight­ly, pre­sum­ably to an­other mode/face/character of the orig­i­nal data; nev­er­the­less, the male face StyleGAN mostly works.

Train­ing sam­ples for the male por­trait StyleGAN (2019-05-03); com­pare with the same la­ten­t-space points in the orig­i­nal por­trait StyleGAN.
In­ter­po­la­tion video (4×4) of the Dan­booru2018 male faces model ini­tial­ized from the Dan­booru2018 por­trait StyleGAN

The male face StyleGAN model is avail­able for down­load, as is 1000 ran­dom faces with 𝛙=0.7 (mir­ror; par­tial Imgur al­bum).

Anime Faces → Ukiyo-e Faces

In Jan­u­ary 2020, Justin (@Bunt­wor­thy) used 5000 faces cropped with from to do trans­fer learn­ing. After ~24h train­ing:

Justin’s ukiy­o-e StyleGAN sam­ples, 2020-01-04.

Anime Faces → Western Portrait Faces

In 2019, ay­dao ex­per­i­mented with trans­fer learn­ing to Eu­ro­pean por­trait faces drawn from WikiArt; the trans­fer learn­ing was done via Nathan Ship­ley’s abuse of where two mod­els are sim­ply av­er­aged to­geth­er, pa­ra­me­ter by pa­ra­me­ter and layer by lay­er, to yield a new mod­el. (Sur­pris­ing­ly, this work­s—as long as the mod­els aren’t too differ­ent; if they are, the av­er­aged model will gen­er­ate only col­or­ful blob­s.) The re­sults were amus­ing. From early in train­ing:

ay­dao 2019, anime faces → west­ern por­trait train­ing sam­ples (ear­ly)


ay­dao 2019, anime faces → west­ern por­trait train­ing sam­ples (later)

Anime Faces → Danbooru2018

nshep­perd be­gan a train­ing run us­ing an early anime face StyleGAN model on the 512px SFW Dan­booru2018 sub­set; after ~3–5 weeks (with many in­ter­rup­tions) on 1 GPU, as of 2019-03-22, the train­ing sam­ples look like this:

StyleGAN train­ing sam­ples on Dan­booru2018 SFW 512px; it­er­a­tion #14204 (n­shep­perd)
Real 512px SFW Dan­booru2018 train­ing dat­a­points, for com­par­i­son
Train­ing mon­tage video of the Dan­booru2018 model (up to #14204, 2019-03-22), trained by nshep­perd

The StyleGAN is able to pick up global struc­ture and there are rec­og­niz­ably anime fig­ures, de­spite the sheer di­ver­sity of im­ages, which is promis­ing. The fine de­tails are se­ri­ously lack­ing, and train­ing, to my eye, is wan­der­ing around with­out any steady im­prove­ment or sharp de­tails (ex­cept per­haps the faces which are in­her­ited from the pre­vi­ous mod­el). I sus­pect that the learn­ing rate is still too high and, es­pe­cially with only 1 GPU/n = 4, such small mini­batches don’t cover enough modes to en­able steady im­prove­ment. If so, the LR will need to be set much lower (or gra­di­ent ac­cu­mu­la­tion used in or­der to fake hav­ing large mini­batches where large LRs are sta­ble) & train­ing time ex­tended to mul­ti­ple months. An­other pos­si­bil­ity would be to restart with added self­-at­ten­tion lay­ers, which I have no­ticed seem to par­tic­u­larly help with com­pli­cated de­tails & sharp­ness; the style noise ap­proach may be ad­e­quate for the job but just a few vanilla con­vo­lu­tion lay­ers may be too few (pace the BigGAN re­sults on the ben­e­fits of in­creas­ing depth while de­creas­ing pa­ra­me­ter coun­t).

FFHQ Variations

Anime Faces → FFHQ Faces

If StyleGAN can smoothly warp anime faces among each other and ex­press global trans­forms like hair length­+­color with 𝜓, could 𝜓 be a quick way to gain con­trol over a sin­gle large-s­cale vari­able? For ex­am­ple, male vs fe­male faces, or… anime ↔︎ real faces? (Given a par­tic­u­lar image/latent vec­tor, one would sim­ply flip the sign to con­vert it to the op­po­site; this would give the op­po­site ver­sion of each ran­dom face, and if one had an en­coder, one could do au­to­mat­i­cally ani­me-fy or re­al-fy an ar­bi­trary face by en­cod­ing it into the la­tent vec­tor which cre­ates it, and then flip­ping.35)

Since Kar­ras et al 2801 pro­vide a nice FFHQ down­load script (al­beit slower than I’d like once Google Drive rate-lim­its you a wall­clock hour into the full down­load) for the ful­l-res­o­lu­tion PNGs, it would be easy to down­scale to 512px and cre­ate a 512px FFHQ dataset to train on, or even cre­ate a com­bined anime+FFHQ dataset.

The first and fastest thing was to do trans­fer learn­ing from the anime faces to FFHQ real faces. It was un­likely that the model would re­tain much anime knowl­edge & be able to do mor­ph­ing, but it was worth a try.

The ini­tial re­sults early in train­ing are hi­lar­i­ous and look like zom­bies:

Ran­dom train­ing sam­ples of anime face→FFHQ-only StyleGAN trans­fer learn­ing, show­ing bizarrely-arte­fac­tual in­ter­me­di­ate faces
In­ter­po­la­tion video (4×4) of the FFHQ face model ini­tial­ized from the anime face StyleGAN, a few ticks into train­ing, show­ing bizarre ar­ti­facts

After 97 ticks, the model has con­verged to a bor­ingly nor­mal ap­pear­ance, with the only hint of its ori­gins be­ing per­haps some ex­ces­sive­ly-fab­u­lous hair in the train­ing sam­ples:

Anime faces→FFHQ-only StyleGAN train­ing sam­ples after much con­ver­gence, show­ing ani­me-ness largely washed out

Anime Faces → Anime Faces + FFHQ Faces

So, that was a bust. The next step is to try train­ing on anime & FFHQ faces si­mul­ta­ne­ous­ly; given the stark differ­ence be­tween the datasets, would pos­i­tive vs neg­a­tive 𝜓 wind up split­ting into real vs anime and pro­vide a cheap & easy way of con­vert­ing ar­bi­trary faces?

This sim­ply merged the 512px FFHQ faces with the 512px anime faces and re­sumed train­ing from the pre­vi­ous FFHQ model (I rea­soned that some of the ani­me-ness should still be in the mod­el, so it would be slightly faster than restart­ing from the orig­i­nal anime face mod­el). I trained it for 812 it­er­a­tions, #11,359–12,171 (some­what over 2 GPU-days), at which point it was mostly done.

It did man­age to learn both kinds of faces quite well, sep­a­rat­ing them clearly in ran­dom sam­ples:

Ran­dom train­ing sam­ples, anime+FFHQ StyleGAN

How­ev­er, the style trans­fer & 𝜓 sam­ples were dis­ap­point­ments. The style mix­ing shows lim­ited abil­ity to mod­ify faces cross-do­main or con­vert them, and the trun­ca­tion trick chart shows no clear dis­en­tan­gle­ment of the de­sired fac­tor (in­deed, the var­i­ous halves of 𝜓 cor­re­spond to noth­ing clear):

Style mix­ing re­sults for the anime+FFHQ StyleGAN
Trun­ca­tion trick re­sults for the anime+FFHQ StyleGAN

The in­ter­po­la­tion video does show that it learned to in­ter­po­late slightly be­tween real & anime faces, giv­ing half-anime/half-real faces, but it looks like it only hap­pens some­times—­mostly with young fe­male faces36:

In­ter­po­la­tion video (4×4) of the FFHQ+anime face mod­el, after con­ver­gence.

They’re hard to spot in the in­ter­po­la­tion video be­cause the tran­si­tion hap­pens abrupt­ly, so I gen­er­ated sam­ples & se­lected some of the more in­ter­est­ing ani­me-ish faces:

Se­lected sam­ples from the anime+FFHQ StyleGAN, show­ing cu­ri­ous ‘in­ter­me­di­ate’ faces (4×4 grid)

Sim­i­lar­ly, Alexan­der Reben trained a StyleGAN on FFHQ+Western por­trait il­lus­tra­tions, and the in­ter­po­la­tion video is much smoother & more mixed, sug­gest­ing that more re­al­is­tic & more var­ied il­lus­tra­tions are eas­ier for StyleGAN to in­ter­po­late be­tween.

Anime Faces + FFHQ → Danbooru2018

While I did­n’t have the com­pute to prop­erly train a Dan­booru2018 StyleGAN, after nshep­perd’s re­sults, I was cu­ri­ous and spent some time (817 it­er­a­tions, so ~2 GPU-days?) re­train­ing the anime face+FFHQ model on Dan­booru2018 SFW 512px im­ages.

The train­ing mon­tage is in­ter­est­ing for show­ing how faces get re­pur­posed into fig­ures:

Train­ing mon­tage video of a Dan­booru2018 StyleGAN ini­tial­ized on an anime faces+FFHQ StyleGAN.

One might think that it is a bridge too far for trans­fer learn­ing, but it seems not.

Reversing StyleGAN To Control & Modify Images

Mod­i­fy­ing im­ages is harder than gen­er­at­ing them. An un­con­di­tional GAN ar­chi­tec­ture is, by de­fault, ‘one-way’: the la­tent vec­tor z gets gen­er­ated from a bunch of vari­ables, fed through the GAN, and out pops an im­age. There is no way to run the un­con­di­tional GAN ‘back­wards’ to feed in an im­age and pop out the z in­stead.37

If one could, one could take an ar­bi­trary im­age and en­code it into the z and by jit­ter­ing z, gen­er­ate many new ver­sions of it; or one could feed it back into StyleGAN and play with the style noises at var­i­ous lev­els in or­der to trans­form the im­age; or do things like ‘av­er­age’ two im­ages or cre­ate in­ter­po­la­tions be­tween two ar­bi­trary faces’; or one could (as­sum­ing one knew what each vari­able in z ‘means’) edit the im­age to changes things like which di­rec­tion their head tilts or whether they are smil­ing.

There are some at­tempts at learn­ing con­trol in an un­su­per­vised fash­ion (eg , GANSpace) but while ex­cel­lent start­ing points, they have lim­its and may not find a spe­cific con­trol that one wants.

The most straight­for­ward way would be to switch to a con­di­tional GAN ar­chi­tec­ture based on a text or tag em­bed­ding. Then to gen­er­ate a spe­cific char­ac­ter wear­ing glass­es, one sim­ply says as much as the con­di­tional in­put: "character glasses". Or if they should be smil­ing, add "smile". And so on. This would cre­ate im­ages of said char­ac­ter with the de­sired mod­i­fi­ca­tions. This op­tion is not avail­able at the mo­ment as cre­at­ing a tag em­bed­ding & train­ing StyleGAN re­quires quite a bit of mod­i­fi­ca­tion. It also is not a com­plete so­lu­tion as it would­n’t work for the cases of edit­ing an ex­ist­ing im­age.

For an un­con­di­tional GAN, there are two com­ple­men­tary ap­proaches to in­vert­ing the G:

  1. what one NN can learn to de­code, an­other can learn to en­code (eg , ):

    If StyleGAN has learned z→im­age, then train a sec­ond en­coder NN on the su­per­vised learn­ing prob­lem of im­age→z! The sam­ple size is in­fi­nite (just keep run­ning G) and the map­ping is fixed (given a fixed G), so it’s ugly but not that hard.

  2. back­prop­a­gate a pixel or fea­ture-level loss to ‘op­ti­mize’ a la­tent code (eg ):

    While StyleGAN is not in­her­ently re­versible, it’s not a black­box as, be­ing a NN trained by , it must ad­mit of gra­di­ents. In train­ing neural net­works, there are 3 com­po­nents: in­puts, model pa­ra­me­ters, and outputs/losses, and thus there are 3 ways to use back­prop­a­ga­tion, even if we usu­ally only use 1. One can hold the in­puts fixed, and vary the model pa­ra­me­ters in or­der to change (usu­ally re­duce) the fixed out­puts in or­der to re­duce a loss, which is train­ing a NN; one can hold the in­puts fixed and vary the out­puts in or­der to change (often in­crease) in­ter­nal pa­ra­me­ters such as lay­ers, which cor­re­sponds to neural net­work vi­su­al­iza­tions & ex­plo­ration; and fi­nal­ly, one can hold the pa­ra­me­ters & out­puts fixed, and use the gra­di­ents to it­er­a­tively find an set of in­puts which cre­ates a spe­cific out­put with a low loss (eg op­ti­mize a wheel-shape in­put for rolling-effi­ciency out­put).38

    This can be used to cre­ate im­ages which are ‘op­ti­mized’ in some sense. For ex­am­ple, uses ac­ti­va­tion max­i­miza­tion, demon­strat­ing how im­ages of Im­a­geNet classes can be pulled out of a stan­dard CNN clas­si­fier by back­prop over the clas­si­fier to max­i­mize a par­tic­u­lar out­put class; or re­design a for eas­ier clas­si­fi­ca­tion by a mod­el; more amus­ing­ly, in , the gra­di­ent as­cent39 on the in­di­vid­ual pix­els of an im­age is done to minimize/maximize a NSFW clas­si­fier’s pre­dic­tion. This can also be done on a higher level by try­ing to max­i­mize sim­i­lar­ity to a NN em­bed­ding of an im­age to make it as ‘sim­i­lar’ as pos­si­ble, as was done orig­i­nally in Gatys et al 2014 for style trans­fer, or for more com­pli­cated kinds of style trans­fer like in “Differ­en­tiable Im­age Pa­ra­me­ter­i­za­tions: A pow­er­ful, un­der­-ex­plored tool for neural net­work vi­su­al­iza­tions and art”.

    In this case, given an ar­bi­trary de­sired im­age’s z, one can ini­tial­ize a ran­dom z, run it for­ward through the GAN to get an im­age, com­pare it at the pixel level with the de­sired (fixed) im­age, and the to­tal differ­ence is the ‘loss’; hold­ing the GAN fixed, the back­prop­a­ga­tion goes back through the model and ad­justs the in­puts (the un­fixed z) to make it slightly more like the de­sired im­age. Done many times, the fi­nal z will now yield some­thing like the de­sired im­age, and that can be treated as its true z. Com­par­ing at the pix­el-level can be im­proved by in­stead look­ing at the higher lay­ers in a NN trained to do clas­si­fi­ca­tion (often an Im­a­geNet VGG), which will fo­cus more on the se­man­tic sim­i­lar­ity (more of a “per­cep­tual loss”) rather than mis­lead­ing de­tails of sta­tic & in­di­vid­ual pix­els. The la­tent code can be the orig­i­nal z, or z after it’s passed through the stack of 8 FC lay­ers and has been trans­formed, or it can even be the var­i­ous per-layer style noises in­side the CNN part of StyleGAN; the last is what style-image-prior uses & 40 ar­gue that it works bet­ter to tar­get the lay­er-wise en­cod­ings than the orig­i­nal z.

    This may not work too well as the lo­cal op­tima might be bad or the GAN may have trou­ble gen­er­at­ing pre­cisely the de­sired im­age no mat­ter how care­fully it is op­ti­mized, the pix­el-level loss may not be a good loss to use, and the whole process may be quite slow, es­pe­cially if one runs it many times with many differ­ent ini­tial ran­dom z to try to avoid bad lo­cal op­ti­ma. But it does mostly work.

  3. En­code+Back­prop­a­gate is a use­ful hy­brid strat­e­gy: the en­coder makes its best guess at the z, which will usu­ally be close to the true z, and then back­prop­a­ga­tion is done for a few it­er­a­tions to fine­tune the z. This can be much faster (one for­ward pass vs many for­ward+back­ward pass­es) and much less prone to get­ting stuck in bad lo­cal op­tima (s­ince it starts at a good ini­tial z thanks to the en­coder).

    Com­par­i­son with edit­ing in flow-based mod­els On a tan­gent, editing/reversing is one of the great ad­van­tages41 of ‘flow’-based NN mod­els such as Glow, which is one of the fam­i­lies of NN mod­els com­pet­i­tive with GANs for high­-qual­ity im­age gen­er­a­tion (a­long with au­tore­gres­sive pixel pre­dic­tion mod­els like PixelRNN, and VAEs). Flow mod­els have the same shape as GANs in push­ing a ran­dom la­tent vec­tor z through a se­ries of up­scal­ing con­vo­lu­tion or other lay­ers to pro­duce fi­nal pixel val­ues, but flow mod­els use a care­ful­ly-lim­ited set of prim­i­tives which make the model runnable both for­wards and back­wards ex­act­ly. This means every set of pix­els cor­re­sponds to a unique z and vice-ver­sa, and so an ar­bi­trary set of pix­els can put in and the model run back­wards to yield the ex­act cor­re­spond­ing z. There is no need to fight with the model to cre­ate an en­coder to re­verse it or use back­prop­a­ga­tion op­ti­miza­tion to try to find some­thing al­most right, as the flow model can al­ready do this. This makes edit­ing easy: plug the im­age in, get out the ex­act z with the equiv­a­lent of a sin­gle for­ward pass, fig­ure out which part of z con­trols a de­sired at­tribute like ‘glasses’, change that, and run it for­ward. The down­side of flow mod­els, which is why I do not (yet) use them, is that the re­stric­tion to re­versible lay­ers means that they are typ­i­cally much larger and slower to train than a more-or-less per­cep­tu­ally equiv­a­lent GAN mod­el, by eas­ily an or­der of mag­ni­tude (for Glow). When I tried Glow, I could barely run an in­ter­est­ing model de­spite ag­gres­sive mem­o­ry-sav­ing tech­niques, and I did­n’t get any­where in­ter­est­ing with the sev­eral GPU-days I spent (which was un­sur­pris­ing when I re­al­ized how many GPU-months OA had spen­t). Since high­-qual­ity pho­to­re­al­is­tic GANs are at the limit of 2019 train­abil­ity for most re­searchers or hob­by­ists, flow mod­els are clearly out of the ques­tion de­spite their many prac­ti­cal & the­o­ret­i­cal ad­van­tages—they’re just too ex­pen­sive! How­ev­er, there is no known rea­son flow mod­els could­n’t be com­pet­i­tive with GANs (they will prob­a­bly al­ways be larg­er, but be­cause they are more cor­rect & do more), and fu­ture im­prove­ments or hard­ware scal­ing may make them more vi­able, so flow-based mod­els are an ap­proach to keep an eye on.

One of those 3 ap­proaches will en­code an im­age into a la­tent z. So far so good, that en­ables things like gen­er­at­ing ran­dom­ly-d­iffer­ent ver­sions of a spe­cific im­age or in­ter­po­lat­ing be­tween 2 im­ages, but how does one con­trol the z in a more in­tel­li­gent fash­ion to make spe­cific ed­its?

If one knew what each vari­able in the z meant, one could sim­ply slide them in the −1/+1 range, change the z, and gen­er­ate the cor­re­spond­ing edited im­age. But there are 512 vari­ables in z (for StyleGAN), which is a lot to ex­am­ine man­u­al­ly, and their mean­ing is opaque as StyleGAN does­n’t nec­es­sar­ily map each vari­able onto a hu­man-rec­og­niz­able fac­tor like ‘smil­ing’. A rec­og­niz­able fac­tor like ‘eye­glasses’ might even be gov­erned by mul­ti­ple vari­ables si­mul­ta­ne­ously which are non­lin­early in­ter­act­ing.

As al­ways, the so­lu­tion to one mod­el’s prob­lems is yet more mod­els; to con­trol the z, like with the en­coder, we can sim­ply train yet an­other model (per­haps just a lin­ear clas­si­fier or ran­dom forests this time) to take the z of many im­ages which are all la­beled ‘smil­ing’ or ‘not smil­ing’, and learn what parts of z cause ‘smil­ing’ (eg ). These ad­di­tional mod­els can then be used to con­trol a z. The nec­es­sary la­bels (a few hun­dred to a few thou­sand will be ad­e­quate since the z is only 512 vari­ables) can be ob­tained by hand or by us­ing a pre-ex­ist­ing clas­si­fi­er.

So, the pieces of the puz­zle & putting it all to­geth­er:

The fi­nal re­sult is in­ter­ac­tive edit­ing of anime faces along many differ­ent fac­tors:

snowy halcy (MP4) demon­strat­ing in­ter­ac­tive edit­ing of StyleGAN anime faces us­ing anime-face-StyleGAN+DeepDanbooru+StyleGAN-encoder+TL-GAN

Editing Rare Attributes

A strat­egy of hand-edit­ing or us­ing a tag­ger to clas­sify at­trib­utes works for com­mon ones which will be well-rep­re­sented in a sam­ple of a few thou­sand since the clas­si­fier needs a few hun­dred cases to work with, but what about rarer at­trib­utes which might ap­pear only on one in a thou­sand ran­dom sam­ples, or at­trib­utes too rare in the dataset for StyleGAN to have learned, or at­trib­utes which may not be in the dataset at all? Edit­ing “red eyes” should be easy, but what about some­thing like “bunny ears”? It would be amus­ing to be able to edit por­traits to add bunny ears, but there aren’t that many bunny ear sam­ples (although cat ears might be much more com­mon); is one doomed to gen­er­ate & clas­sify hun­dreds of thou­sands of sam­ples to en­able bunny ear edit­ing? That would be in­fea­si­ble for hand la­bel­ing, and diffi­cult even with a tag­ger.

One sug­ges­tion I have for this use-case would be to briefly train an­other StyleGAN model on an en­riched or boosted dataset, like a dataset of 50:50 bunny ear im­ages & nor­mal im­ages. If one can ob­tain a few thou­sand bunny ear im­ages, then this is ad­e­quate for trans­fer learn­ing (com­bined with a few thou­sand ran­dom nor­mal im­ages from the orig­i­nal dataset), and one can re­train the StyleGAN on an equal bal­ance of im­ages. The high pres­ence of bunny ears will en­sure that the StyleGAN quickly learns all about those, while the nor­mal im­ages pre­vent it from over­fit­ting or cat­a­strophic for­get­ting of the full range of im­ages.

This new bun­ny-ear StyleGAN will then pro­duce bun­ny-ear sam­ples half the time, cir­cum­vent­ing the rare base rate is­sue (or fail­ure to learn, or nonex­is­tence in dataset), and en­abling effi­cient train­ing of a clas­si­fi­er. And since nor­mal faces were used to pre­serve its gen­eral face knowl­edge de­spite the trans­fer learn­ing po­ten­tially de­grad­ing it, it will re­main able to en­code & op­ti­mize nor­mal faces. (The orig­i­nal clas­si­fiers may even be reusable on this, de­pend­ing on how ex­treme the new at­tribute is, as the la­tent space z might not be too affected by the new at­tribute and the var­i­ous other at­trib­utes ap­prox­i­mately main­tain the orig­i­nal re­la­tion­ship with z as be­fore the re­train­ing.)

StyleGAN 2

(source, video), elim­i­nates blob ar­ti­facts, adds a na­tive en­cod­ing ‘pro­jec­tion’ fea­ture for edit­ing, sim­pli­fies the run­time by scrap­ping pro­gres­sive grow­ing in fa­vor of -like mul­ti­-s­cale ar­chi­tec­ture, & has higher over­all qual­i­ty—but sim­i­lar to­tal train­ing time/requirements42

I used a 512px anime por­trait S2 model trained by Aaron Gokaslan to cre­ate :

100 ran­dom sam­ple im­ages from the StyleGAN 2 anime por­trait faces in TWDNEv3, arranged in a 10×10 grid.

Train­ing sam­ples:

It­er­a­tion #24,303 of Gokaslan’s train­ing of an anime por­trait StyleGAN 2 model (train­ing sam­ples)

The model was trained to it­er­a­tion #24,664 for >2 weeks on 4 Nvidia 2080ti GPUs at 35–70s per 1k im­ages. The Ten­sor­flow S2 model is avail­able for down­load (320M­B).43 (Py­Torch & Onnx ver­sions have been made by An­ton us­ing a cus­tom repo Note that both my face & por­trait mod­els can be run via the Gen­Force Py­Torch repo as well.) This model can be used in Google Co­lab (demon­stra­tion note­book, al­though it seems it may pull in an older S2 mod­el) & the model can also be used with the S2 code­base for en­cod­ing anime faces.

Running S2

Be­cause of the op­ti­miza­tions, which re­quires cus­tom lo­cal com­pi­la­tion of CUDA code for max­i­mum effi­cien­cy, get­ting S2 run­ning can be more chal­leng­ing than get­ting S1 run­ning.

  • No Ten­sor­Flow 2 com­pat­i­bil­i­ty: the TF ver­sion must be 1.14/1.15. Try­ing to run with TF 2 will give er­rors like: TypeError: int() argument must be a string, a bytes-like object or a number, not 'Tensor'.

    I ran into cuDNN com­pat­i­bil­ity prob­lems with TF 1.15 (which re­quires cuDNN >7.6.0, 2019-05-20, for CUDA 10.0), which gave er­rors like this:

    ...[2020-01-11 23:10:35.234784: E tensorflow/stream_executor/cuda/] Loaded runtime CuDNN library:
       7.4.2 but source was compiled with: 7.6.0.  CuDNN library major and minor version needs to match or have higher
       minor version in case of CuDNN 7.0 or later version. If using a binary install, upgrade your CuDNN library.
       If building from sources, make sure the library loaded at runtime is compatible with the version specified
       during compile configuration...

    But then with 1.14, the tpu-estimator li­brary was not found! (I ul­ti­mately took the risk of up­grad­ing my in­stal­la­tion with libcudnn7_7.6.0.64-1+cuda10.0_amd64.deb, and thank­ful­ly, that worked and did not seem to break any­thing else.)

  • Get­ting the en­tire pipeline to com­pile the cus­tom ops in a Conda en­vi­ron­ment was an­noy­ing so Gokaslan tweaked it to use 1.14 on Lin­ux, used cudatoolkit-dev from Conda Forge, and changed the build script to use gcc-7 (s­ince gcc-8 was un­sup­port­ed)

  • one is­sue with Ten­sor­Flow 1.14 is you need to force allow_growth or it will er­ror out on Nvidia 2080tis

  • con­fig name change: has been re­named (a­gain) to

  • buggy learn­ing rates: S2 (but not S1) ac­ci­den­tally uses the same LR for both G & D; ei­ther fix this or keep it in mind when do­ing LR tun­ing—changes to D_lrate do noth­ing!

  • n = 1 mini­batch prob­lems: S2 is not a large NN so it can be trained on low-end GPUs; how­ev­er, the S2 code make an un­nec­es­sary as­sump­tion that n≥2; to fix this in training/ (fixed in Shawn Presser’s TPU/self-attention ori­ented fork):

    @@ -157,9 +157,8 @@ def G_logistic_ns_pathreg(G, D, opt, training_set, minibatch_size, pl_minibatch_
        with tf.name_scope('PathReg'):
            # Evaluate the regularization term using a smaller minibatch to conserve memory.
            if pl_minibatch_shrink > 1 and minibatch_size > 1:
                assert minibatch_size % pl_minibatch_shrink == 0
                pl_minibatch = minibatch_size // pl_minibatch_shrink
            if pl_minibatch_shrink > 1:
                pl_minibatch = tf.maximum(1, minibatch_size // pl_minibatch_shrink)
                pl_latents = tf.random_normal([pl_minibatch] + G.input_shapes[0][1:])
                pl_labels = training_set.get_random_labels_tf(pl_minibatch)
                fake_images_out, fake_dlatents_out = G.get_output_for(pl_latents, pl_labels, is_training=True, return_dlatents=True)

  • S2 has some sort of mem­ory leak, pos­si­bly re­lated to the FID eval­u­a­tions, re­quir­ing reg­u­lar restarts, like putting it into a loop

Once S2 was run­ning, Gokaslan trained the S2 por­trait model with gen­er­ally de­fault hy­per­pa­ra­me­ters.

Future Work

Some open ques­tions about StyleGAN’s ar­chi­tec­ture & train­ing dy­nam­ics:

  • is pro­gres­sive grow­ing still nec­es­sary with StyleGAN? (StyleGAN 2 im­plies that it is not, as it uses a MSG-GAN-like ap­proach)
  • are 8×512 FC lay­ers nec­es­sary? (Pre­lim­i­nary BigGAN work sug­gests that they are not nec­es­sary for BigGAN.)
  • what are the wrinkly-line/cracks noise ar­ti­facts which ap­pear at the end of train­ing?
  • how does StyleGAN com­pare to BigGAN in fi­nal qual­i­ty?

Fur­ther pos­si­ble work:

  • ex­plo­ration of “cur­ricu­lum learn­ing”: can train­ing be sped up by train­ing to con­ver­gence on small n and then pe­ri­od­i­cally ex­pand­ing the dataset?

  • boot­strap­ping im­age gen­er­a­tion by start­ing with a seed cor­pus, gen­er­at­ing many ran­dom sam­ples, se­lect­ing the best by hand, and re­train­ing; eg ex­pand a cor­pus of a spe­cific char­ac­ter, or ex­plore ‘hy­brid’ cor­puses which mix A/B im­ages & one then se­lects for im­ages which look most A+B-ish

  • im­proved trans­fer learn­ing scripts to edit trained mod­els so 512px pre­trained mod­els can be pro­moted to work with 1024px im­ages and vice versa

  • bet­ter Dan­booru tag­ger CNN for pro­vid­ing clas­si­fi­ca­tion em­bed­dings for var­i­ous pur­pos­es, par­tic­u­larly FID loss mon­i­tor­ing, mini­batch discrimination/auxiliary loss, and style trans­fer for cre­at­ing a ‘StyleDan­booru’

    • with a StyleDan­booru, I am cu­ri­ous if that can be used as a par­tic­u­larly Pow­er­ful Form Of Data Aug­men­ta­tion for small n char­ac­ter datasets, and whether it leads to a re­ver­sal of train­ing dy­nam­ics with edges com­ing be­fore colors/textures—it’s pos­si­ble that a StyleDan­booru could make many GAN ar­chi­tec­tures, not just StyleGAN, sta­ble to train on anime/illustration datasets
  • bor­row­ing ar­chi­tec­tural en­hance­ments from BigGAN: self­-at­ten­tion lay­ers, spec­tral norm reg­u­lar­iza­tion, large-mini­batch train­ing, and a rec­ti­fied Gauss­ian dis­tri­b­u­tion for the la­tent vec­tor z

  • tex­t→im­age con­di­tional GAN ar­chi­tec­ture (à la StackGAN):

    This would take the text tag de­scrip­tions of each im­age com­piled by Dan­booru users and use those as in­puts to StyleGAN, which, should it work, would mean you could cre­ate ar­bi­trary anime im­ages sim­ply by typ­ing in a string like 1_boy samurai facing_viewer red_hair clouds sword armor blood etc.

    This should al­so, by pro­vid­ing rich se­man­tic de­scrip­tions of each im­age, make train­ing faster & sta­bler and con­verge to higher fi­nal qual­i­ty.

  • meta-learn­ing for few-shot face or char­ac­ter or artist im­i­ta­tion (eg Set-CGAN or or per­haps , or —the last of which achieves few-shot learn­ing with sam­ples of n = 25 TWDNE StyleGAN anime faces)

ImageNet StyleGAN

As part of ex­per­i­ments in scal­ing up StyleGAN 2, us­ing , we ran StyleGAN on large-s­cale datasets in­clud­ing Dan­booru2019, Im­a­geNet, and sub­sets of the . De­spite run­ning for mil­lions of im­ages, no S2 run ever achieved re­motely the re­al­ism of S2 on FFHQ or BigGAN on Im­a­geNet: while the tex­tures could be sur­pris­ingly good, the se­man­tic global struc­ture never came to­geth­er, with glar­ing flaws—there would be too many heads, or they would be de­tached from bod­ies, etc.

Aaron Gokaslan took the time to com­pute the FID on Im­a­geNet, es­ti­mat­ing a ter­ri­ble score of FID ~120. (High­er=­worse; for com­par­ison, BigGAN with can be as good as FID ~7, and reg­u­lar BigGAN typ­i­cally sur­passes FID 120 within a few thou­sand it­er­a­tions.) Even ex­per­i­ments in in­creas­ing the S2 model size up to ~1GB (by in­creas­ing the fea­ture map mul­ti­pli­er) im­proved qual­ity rel­a­tively mod­est­ly, and showed no signs of ever ap­proach­ing BigGAN-level qual­i­ty. We con­cluded that StyleGAN is in fact fun­da­men­tally lim­ited as a GAN, , and switched over to BigGAN work.

For those in­ter­est­ed, we pro­vide our 512px Im­a­geNet S2 (step 1,394,688):

rsync --verbose rsync:// ./
Shawn Presser, S2 Im­a­geNet in­ter­po­la­tion video from part­way through train­ing (~45 hours on a TPUv3-512, 3k images/s)

Danbooru2019+e621 256px BigGAN

As part of test­ing our mod­i­fi­ca­tions to compare_gan, in­clud­ing sam­pling from mul­ti­ple datasets to in­crease n and us­ing to sta­bi­lize it and adding an ad­di­tional (crude, lim­it­ed) kind of self­-su­per­vised loss to the D, we trained sev­eral 256px BigGANs, ini­tially on Dan­booru2019 SFW but then adding in the TWDNE por­traits & e621/e621-portraits part­way through train­ing. This desta­bi­lized the mod­els great­ly, but the flood loss ap­pears to have stopped di­ver­gence and they grad­u­ally re­cov­ered. Run #39 did some­what bet­ter than run #40; the self­-su­per­vised vari­ants never re­cov­ered. This in­di­cated to us that our self­-su­per­vised loss needed heavy re­vi­sion (as in­deed it did), and that flood loss was more valu­able than ex­pect­ed, and we in­ves­ti­gated it fur­ther; the im­por­tant part ap­pears—­for GANs, any­way—to be the stop-loss part, halt­ing train­ing of G/D when it gets ‘too good’. Freez­ing mod­els is an old GAN trick which is mostly not used post-WGAN, but ap­pears use­ful for BigGAN, per­haps be­cause of the spiky loss curve, es­pe­cially early in train­ing.

We ran it for 607,250 it­er­a­tions on a TPUv3-256 pod un­til 2020-05-15. Con­fig:

{"": "images_256", "resnet_biggan.Discriminator.blocks_with_attention": "B2",
"": 96, "resnet_biggan.Generator.blocks_with_attention": "B5",
"": 96, "resnet_biggan.Generator.plain_tanh": false, "ModularGAN.d_lr": 0.0005,
"ModularGAN.d_lr_mul": 3.0, "ModularGAN.ema_start_step": 4000, "ModularGAN.g_lr": 6.66e-05,
"ModularGAN.g_lr_mul": 1.0, "options.batch_size": 2048, "options.d_flood": 0.2,
"options.datasets": "gs://XYZ-euw4a/datasets/danbooru2019-s/danbooru2019-s-0*,gs://XYZ-euw4a/datasets/e621-s/e621-s-0*,
"options.g_flood": 0.05, "options.labels": "", "options.random_labels": true, "options.z_dim": 140,
"run_config.experimental_host_call_every_n_steps": 50, "run_config.keep_checkpoint_every_n_hours": 0.5,
"standardize_batch.use_cross_replica_mean": true, "TpuSummaries.save_image_steps": 50, "TpuSummaries.save_summary_steps": 1}
90 ran­dom EMA sam­ples (un­trun­cat­ed) from the 256px BigGAN trained on Danbooru2019/anime-portraits/e621/e621-portraits.
In­ter­po­la­tion us­ing High­CWu Pad­dlePad­dle Google Co­lab note­book

The model is avail­able for down­load:

rsync --verbose rsync:// ./

compare_gan con­fig:

$ cat bigrun39b/operative_config-603500.gin
# Parameters for AdamOptimizer:
# ==============================================================================
AdamOptimizer.beta1 = 0.0
AdamOptimizer.beta2 = 0.999
AdamOptimizer.epsilon = 1e-08
AdamOptimizer.use_locking = False

# Parameters for batch_norm:
# ==============================================================================
# None.

# Parameters for BigGanResNetBlock:
# ==============================================================================
BigGanResNetBlock.add_shortcut = True

# Parameters for conditional_batch_norm:
# ==============================================================================
conditional_batch_norm.use_bias = False

# Parameters for cross_replica_moments:
# ==============================================================================
cross_replica_moments.group_size = None
cross_replica_moments.parallel = True

# Parameters for D:
# ==============================================================================
D.batch_norm_fn = None
D.layer_norm = False
D.spectral_norm = True

# Parameters for dataset:
# ============================================================================== = 'images_256'
dataset.seed = 547

# Parameters for resnet_biggan.Discriminator:
# ==============================================================================
resnet_biggan.Discriminator.blocks_with_attention = 'B2' = 96
resnet_biggan.Discriminator.channel_multipliers = None
resnet_biggan.Discriminator.project_y = True

# Parameters for G:
# ==============================================================================
G.batch_norm_fn = @conditional_batch_norm
G.spectral_norm = True

# Parameters for resnet_biggan.Generator:
# ==============================================================================
resnet_biggan.Generator.blocks_with_attention = 'B5' = 96
resnet_biggan.Generator.channel_multipliers = None
resnet_biggan.Generator.embed_bias = False
resnet_biggan.Generator.embed_y = True
resnet_biggan.Generator.embed_y_dim = 128
resnet_biggan.Generator.embed_z = False
resnet_biggan.Generator.hierarchical_z = True
resnet_biggan.Generator.plain_tanh = False

# Parameters for hinge:
# ==============================================================================
# None.

# Parameters for loss:
# ==============================================================================
loss.fn = @hinge

# Parameters for ModularGAN:
# ==============================================================================
ModularGAN.conditional = True
ModularGAN.d_lr = 0.0005
ModularGAN.d_lr_mul = 3.0
ModularGAN.d_optimizer_fn = @tf.train.AdamOptimizer
ModularGAN.deprecated_split_disc_calls = False
ModularGAN.ema_decay = 0.9999
ModularGAN.ema_start_step = 4000
ModularGAN.experimental_force_graph_unroll = False
ModularGAN.experimental_joint_gen_for_disc = False
ModularGAN.fit_label_distribution = False
ModularGAN.g_lr = 6.66e-05
ModularGAN.g_lr_mul = 1.0
ModularGAN.g_optimizer_fn = @tf.train.AdamOptimizer
ModularGAN.g_use_ema = True

# Parameters for no_penalty:
# ==============================================================================
# None.

# Parameters for normal:
# ==============================================================================
normal.mean = 0.0
normal.seed = None

# Parameters for options:
# ==============================================================================
options.architecture = 'resnet_biggan_arch'
options.batch_size = 2048
options.d_flood = 0.2
options.datasets = \
options.description = \
    'Describe your GIN config. (This appears in the tensorboard text tab.)'
options.disc_iters = 2
options.discriminator_normalization = None
options.g_flood = 0.05
options.gan_class = @ModularGAN
options.image_grid_height = 3
options.image_grid_resolution = 1024
options.image_grid_width = 3
options.labels = ''
options.lamba = 1
options.model_dir = 'gs://darnbooru-euw4a/runs/bigrun39b/'
options.num_classes = 1000
options.random_labels = True
options.training_steps = 250000
options.transpose_input = False
options.z_dim = 140

# Parameters for penalty:
# ==============================================================================
penalty.fn = @no_penalty

# Parameters for replace_labels:
# ==============================================================================
replace_labels.file_pattern = None

# Parameters for run_config:
# ==============================================================================
run_config.experimental_host_call_every_n_steps = 50
run_config.iterations_per_loop = 250
run_config.keep_checkpoint_every_n_hours = 0.5
run_config.keep_checkpoint_max = 10
run_config.save_checkpoints_steps = 250
run_config.single_core = False
run_config.tf_random_seed = None

# Parameters for spectral_norm:
# ==============================================================================
spectral_norm.epsilon = 1e-12
spectral_norm.singular_value = 'auto'

# Parameters for standardize_batch:
# ==============================================================================
standardize_batch.decay = 0.9
standardize_batch.epsilon = 1e-05
standardize_batch.use_cross_replica_mean = True
standardize_batch.use_moving_averages = False

# Parameters for TpuSummaries:
# ==============================================================================
TpuSummaries.save_image_steps = 50
TpuSummaries.save_summary_steps = 1

# Parameters for train_imagenet_transform:
# ==============================================================================
train_imagenet_transform.crop_method = 'random'

# Parameters for weights:
# ==============================================================================
weights.initializer = 'orthogonal'

# Parameters for z:
# ==============================================================================
z.distribution_fn = @tf.random.normal
z.maxval = 1.0
z.minval = -1.0
z.stddev = 1.0


I ex­plore BigGAN, an­other re­cent GAN with SOTA re­sults on the most com­plex im­age do­main tack­led by GANs so far, Im­a­geNet. BigGAN’s ca­pa­bil­i­ties come at a steep com­pute cost, how­ev­er. I ex­per­i­ment with 128px Im­a­geNet trans­fer learn­ing (suc­cess­ful) with ~6 GPU-days, and from-scratch 256px anime por­traits of 1000 char­ac­ters on a 8×2080ti ma­chine for a month (mixed re­sult­s). My BigGAN re­sults are good but com­pro­mised by the com­pute ex­pense & prac­ti­cal prob­lems with the re­leased BigGAN code base. While BigGAN is not yet su­pe­rior to StyleGAN for many pur­pos­es, BigGAN-like ap­proaches may be nec­es­sary to scale to whole anime im­ages.

The pri­mary ri­val GAN to StyleGAN for large-s­cale im­age syn­the­sis as of mid-2019 is BigGAN (; offi­cial BigGAN-PyTorch im­ple­men­ta­tion & mod­els).

BigGAN suc­cess­fully trains on up to 512px im­ages from Im­a­geNet, from all 1000 cat­e­gories (con­di­tioned on cat­e­go­ry), with near-pho­to­re­al­is­tic re­sults on the best-rep­re­sented cat­e­gories (dogs), and ap­par­ently can even han­dle the far larger in­ter­nal Google JFT dataset. In con­trast, StyleGAN, while far less com­pu­ta­tion­ally de­mand­ing, shows poorer re­sults on more com­plex cat­e­gories (Kar­ras et al 2018’s LSUN CATS StyleGAN; our whole-Dan­booru2018 pi­lots) and has not been demon­strated to scale to Im­a­geNet, much less be­yond.

BigGAN does this by com­bin­ing a few im­prove­ments on stan­dard DCGANs (most of which are not used in StyleGAN):

Brock et al 2018: BigGAN-deep ar­chi­tec­ture (Fig­ure 16, Ta­ble 5)

The down­side is that, as the name in­di­cates, BigGAN is both a big model and re­quires big com­pute (par­tic­u­lar­ly, big mini­batch­es)—­some­where around $20,000, we es­ti­mate, based on pub­lic TPU pric­ing.

This present a dilem­ma: larg­er-s­cale por­trait mod­el­ing or whole-anime im­age mod­el­ing may be be­yond StyleGAN’s cur­rent ca­pa­bil­i­ties; but while BigGAN may be able to han­dle those tasks, we can’t afford to train it!

Must it cost that much? Prob­a­bly not. In par­tic­u­lar, BigGAN’s use of a fixed large mini­batch through­out train­ing is prob­a­bly in­effi­cient: it is highly un­likely that the ben­e­fits of a n = 2048 mini­batch are nec­es­sary at the be­gin­ning of train­ing when the Gen­er­a­tor is gen­er­at­ing sta­tic which looks noth­ing at all like real data, and at the end of train­ing, that may still be too small a mini­batch (Brock et al 2018 note that the ben­e­fits of larger mini­batches had not sat­u­rated at n = 2048 but time/compute was not avail­able to test larger still mini­batch­es, which is con­sis­tent with the ob­ser­va­tion that the harder & more RL-like a prob­lem, the larger the mini­batch it need­s). Typ­i­cal­ly, mini­batches and/or learn­ing rates are sched­uled: im­pre­cise gra­di­ents are ac­cept­able early on, while as the model ap­proaches per­fec­tion, more ex­act gra­di­ents are nec­es­sary. So it should be pos­si­ble to start out with mini­batches a tiny frac­tion of the size and grad­u­ally scale them up dur­ing train­ing, sav­ing an enor­mous amount of com­pute com­pared to BigGAN’s re­ported num­bers. The gra­di­ent noise scale could pos­si­bly be used to au­to­mat­i­cally set the to­tal mini­batch scale, al­though I did­n’t find any ex­am­ples of any­one us­ing it in Py­Torch this way. And us­ing TPU pods pro­vides large amounts of VRAM, but is not nec­es­sar­ily the cheap­est form of com­pute.

BigGAN Transfer Learning

An­other op­ti­miza­tion is to ex­ploit trans­fer learn­ing from the re­leased mod­els, and reuse the enor­mous amount of com­pute in­vested in them. The prac­ti­cal de­tails there are fid­dly. The orig­i­nal BigGAN 2018 re­lease in­cluded the 128px/256px/512px Gen­er­a­tor Ten­sor­flow mod­els but not their Dis­crim­i­na­tors, nor a train­ing code­base; the compare_gan Ten­sor­flow code­base re­leased in early 2019 in­cludes an in­de­pen­dent im­ple­men­ta­tion of BigGAN that can po­ten­tially train them, and I be­lieve that the Gen­er­a­tor may still be us­able for trans­fer learn­ing on its own and if not—­given the ar­gu­ments that Dis­crim­i­na­tors sim­ply mem­o­rize data and do not learn much be­yond that—the Dis­crim­i­na­tors can be trained from scratch by sim­ply freez­ing a G while train­ing its D on G out­puts for as long as nec­es­sary. The 2019 Py­Torch re­lease in­cludes a differ­ent mod­el, a full 128px model with G/D (at 2 points in its train­ing), and code to con­vert the orig­i­nal Ten­sor­flow mod­els into Py­Torch for­mat; the catch there is that the pre­trained model must be loaded into ex­actly the same ar­chi­tec­ture, and while the Py­Torch code­base de­fines the ar­chi­tec­ture for 32/64/128/256px BigGANs, it does not (as of 2019-06-04) de­fine the ar­chi­tec­ture for a 512px BigGAN or BigGAN-deep (I tried but could­n’t get it quite right). It would also be pos­si­ble to do model surgery and pro­mote the 128px model to a 512px mod­el, since the two up­scal­ing blocks (128px→256px and 256px→512px) should be easy to learn (sim­i­lar to my use of wai­fu2x to fake a 1024px StyleGAN anime face mod­el). Any­way, the up­shot is that one can only use the 128px/256px pre­trained mod­els; the 512px will be pos­si­ble with a small up­date to the Py­Torch code­base.

All in all, it is pos­si­ble that BigGAN with some tweaks could be afford­able to train. (At least, with some crowd­fund­ing…)

BigGAN: Danbooru2018-1K Experiments

To test out the wa­ter, I ran three BigGAN ex­per­i­ments:

  1. I first ex­per­i­mented with re­train­ing the Im­a­geNet 128px model44.

    That re­sulted in al­most to­tal mode col­lapse when I re-en­abled G after 2 days; in­ves­ti­gat­ing, I re­al­ized that I had mis­un­der­stood: it was a brand­new BigGAN mod­el, trained in­de­pen­dent­ly, and came with its ful­ly-trained D al­ready. Oops.

  2. trans­fer learn­ing the 128px Im­a­geNet Py­Torch BigGAN model to the 1k anime por­traits; suc­cess­ful with ~6 GPU-days

  3. train­ing from scratch a 256px BigGAN-deep on the 1k por­traits;

    Par­tially suc­cess­ful after ~240 GPU-days: it reached com­pa­ra­ble qual­ity to StyleGAN be­fore suffer­ing se­ri­ous mode col­lapse due, pos­si­bly, be­ing forced to run with small mini­batch sizes by BigGAN bugs

Danbooru2018-1K Dataset

Constructing D1k

Con­struct­ing a new Dan­booru-1k dataset: as BigGAN re­quires con­di­tion­ing in­for­ma­tion, I con­structed new 512px whole-im­age & por­trait datasets by tak­ing the 1000 most pop­u­lar Dan­booru2018 char­ac­ters, with char­ac­ters as cat­e­gories, and cropped out por­traits as usu­al:

cat metadata/20180000000000* | fgrep -e '"name":"solo"' | fgrep -v '"rating":"e"' | \
    jq -c '.tags | .[] | select(.category == "4") | .name' | sort | uniq --count | \
    sort --numeric-sort > characters.txt
mkdir ./characters-1k/ ; cd ./characters-1k/
cpCharacterFace () { # }
    CHARACTER_SAFE=$(echo $CHARACTER | tr '[:punct:]' '.')
    mkdir "$CHARACTER_SAFE"
    IDS=$(cat ../metadata/* | fgrep '"name":"'$CHARACTER\" | fgrep -e '"name":"solo"' \ # )
          | fgrep -v '"rating":"e"' | jq .id | tr -d '"')
    for ID in $IDS; do
        BUCKET=$(printf "%04d" $(( $ID % 1000 )) );
        TARGET=$(ls ../original/$BUCKET/$ID.*)
        CUDA_VISIBLE_DEVICES="" nice python ~/src/lbpcascade_animeface/examples/ \
            ~/src/lbpcascade_animeface/lbpcascade_animeface.xml "$TARGET" "./$CHARACTER_SAFE/$ID"
export -f cpCharacterFace
tail -1200 ../characters.txt | cut -d '"' -f 2 | parallel --progress cpCharacterFace

I merged a num­ber of re­dun­dant fold­ers by hand45, cleaned as usu­al, and did fur­ther crop­ping as nec­es­sary to reach 1000. This re­sulted in 212,359 por­trait faces, with the largest class (Hat­sune Miku) hav­ing 6,624 im­ages and the small­est classes hav­ing ~0 or 1 im­ages. (I don’t know if the class im­bal­ance con­sti­tutes a real prob­lem for BigGAN, as Im­a­geNet it­self is im­bal­anced on many lev­el­s.)

The data-load­ing code at­tempts to make the class index/ID num­ber line up with the folder count, so the nth al­pha­bet­i­cal folder (char­ac­ter) should have class ID n, which is im­por­tant to know for gen­er­at­ing con­di­tional sam­ples. The fi­nal set/IDs (as de­fined for my Dan­booru 1K dataset by find_classes):

2k.tan: 0
abe.nana: 1
abigail.williams..fate.grand.order.: 2
abukuma..kantai.collection.: 3
admiral..kantai.collection.: 4
aegis..persona.: 5
aerith.gainsborough: 6
afuro.terumi: 7
agano..kantai.collection.: 8
agrias.oaks: 9
ahri: 10
aida.mana: 11
aino.minako: 12
aisaka.taiga: 13
aisha..elsword.: 14
akagi..kantai.collection.: 15
akagi.miria: 16
akashi..kantai.collection.: 17
akatsuki..kantai.collection.: 18
akaza.akari: 19
akebono..kantai.collection.: 20
akemi.homura: 21
aki.minoriko: 22
aki.shizuha: 23
akigumo..kantai.collection.: 24
akitsu.maru..kantai.collection.: 25
akitsushima..kantai.collection.: 26
akiyama.mio: 27
akiyama.yukari: 28
akizuki..kantai.collection.: 29
akizuki.ritsuko: 30
akizuki.ryou: 31
akuma.homura: 32
albedo: 33
alice..wonderland.: 34
alice.margatroid: 35
alice.margatroid..pc.98.: 36
alisa.ilinichina.amiella: 37
altera..fate.: 38
amagi..kantai.collection.: 39
amagi.yukiko: 40
amami.haruka: 41
amanogawa.kirara: 42
amasawa.yuuko: 43
amatsukaze..kantai.collection.: 44 45
anastasia..idolmaster.: 46
anchovy: 47
android.18: 48
android.21: 49
anegasaki.nene: 50
angel..kof.: 51
angela.balzac: 52
anjou.naruko: 53
aoba..kantai.collection.: 54
aoki.reika: 55
aori..splatoon.: 56
aozaki.aoko: 57
aqua..konosuba.: 58
ara.han: 59
aragaki.ayase: 60
araragi.karen: 61
arashi..kantai.collection.: 62
arashio..kantai.collection.: 63
archer: 64
arcueid.brunestud: 65
arima.senne: 66
artoria.pendragon..all.: 67
artoria.pendragon..lancer.: 68
artoria.pendragon..lancer.alter.: 69
artoria.pendragon..swimsuit.rider.alter.: 70
asahina.mikuru: 71
asakura.ryouko: 72
asashimo..kantai.collection.: 73
asashio..kantai.collection.: 74
ashigara..kantai.collection.: 75
asia.argento: 76
astolfo..fate.: 77
asui.tsuyu: 78
asuna..sao.: 79
atago..azur.lane.: 80
atago..kantai.collection.: 81
atalanta..fate.: 82
au.ra: 83
ayanami..azur.lane.: 84
ayanami..kantai.collection.: 85
ayanami.rei: 86
ayane..doa.: 87
ayase.eli: 88
baiken: 89
bardiche: 90
barnaby.brooks.jr: 91
battleship.hime: 92
bayonetta..character.: 93
bb..fate...all.: 94
bb..fate.extra.ccc.: 95
bb..swimsuit.mooncancer...fate.: 96
beatrice: 97
belfast..azur.lane.: 98
bismarck..kantai.collection.: 99
black.hanekawa: 100
black.rock.shooter..character.: 101
blake.belladonna: 102
blanc: 103
boko..girls.und.panzer.: 104
bottle.miku: 105
boudica..fate.grand.order.: 106
bowsette: 107
bridget..guilty.gear.: 108
busujima.saeko: 109
c.c.: 110
c.c..lemon..character.: 111
caesar.anthonio.zeppeli: 112
cagliostro..granblue.fantasy.: 113 114
cammy.white: 115
caren.hortensia: 116
caster: 117
cecilia.alcott: 118
celes.chere: 119
charlotte..madoka.magica.: 120
charlotte.dunois: 121
charlotte.e.yeager: 122
chen: 123
chibi.usa: 124
chiki: 125
chitanda.eru: 126
chloe.von.einzbern: 127
choukai..kantai.collection.: 128 129
ciel: 130
cirno: 131
clarisse..granblue.fantasy.: 132
clownpiece: 133
consort.yu..fate.: 134 135
cure.happy: 136
cure.march: 137
cure.marine: 138
cure.moonlight: 139
cure.peace: 140
cure.sunny: 141
cure.sunshine: 142
cure.twinkle: 143 144
daiyousei: 145
danua: 146
darjeeling: 147
dark.magician.girl: 148
dio.brando: 149
dizzy: 150
djeeta..granblue.fantasy.: 151
doremy.sweet: 152
eas: 153
eila.ilmatar.juutilainen: 154
elesis..elsword.: 155
elin..tera.: 156
elizabeth.bathory..brave...fate.: 157
elizabeth.bathory..fate.: 158
elizabeth.bathory..fate...all.: 159
ellen.baker: 160
elphelt.valentine: 161
elsa..frozen.: 162 163
emiya.kiritsugu: 164
emiya.shirou: 165
emperor.penguin..kemono.friends.: 166 167
enoshima.junko: 168
enterprise..azur.lane.: 169
ereshkigal..fate.grand.order.: 170
erica.hartmann: 171
etna: 172
eureka: 173
eve..elsword.: 174
ex.keine: 175
failure.penguin: 176
fate.testarossa: 177
felicia: 178
female.admiral..kantai.collection.: 179 180
female.protagonist..pokemon.go.: 181
fennec..kemono.friends.: 182
ferry..granblue.fantasy.: 183
flandre.scarlet: 184
florence.nightingale..fate.grand.order.: 185
fou..fate.grand.order.: 186
francesca.lucchini: 187 188
fubuki..kantai.collection.: 189
fujibayashi.kyou: 190
fujimaru.ritsuka..female.: 191 192
furude.rika: 193
furudo.erika: 194
furukawa.nagisa: 195
fusou..kantai.collection.: 196
futaba.anzu: 197
futami.mami: 198
futatsuiwa.mamizou: 199
fuuro..pokemon.: 200
galko: 201
gambier.bay..kantai.collection.: 202
ganaha.hibiki: 203
gangut..kantai.collection.: 204
gardevoir: 205
gasai.yuno: 206
gertrud.barkhorn: 207
gilgamesh: 208
ginga.nakajima: 209
giorno.giovanna: 210
gokou.ruri: 211
graf.eisen: 212
graf.zeppelin..kantai.collection.: 213
grey.wolf..kemono.friends.: 214
gumi: 215
hachikuji.mayoi: 216
hagikaze..kantai.collection.: 217
hagiwara.yukiho: 218
haguro..kantai.collection.: 219
hakurei.reimu: 220
hamakaze..kantai.collection.: 221
hammann..azur.lane.: 222
han.juri: 223
hanasaki.tsubomi: 224
hanekawa.tsubasa: 225
hanyuu: 226
haramura.nodoka: 227
harime.nui: 228
haro: 229
haruka..pokemon.: 230
haruna..kantai.collection.: 231 232
harusame..kantai.collection.: 233
hasegawa.kobato: 234
hassan.of.serenity..fate.: 235 236
hatoba.tsugu..character.: 237
hatsune.miku: 238
hatsune.miku..append.: 239
hatsuyuki..kantai.collection.: 240
hatsuzuki..kantai.collection.: 241
hayami.kanade: 242
hayashimo..kantai.collection.: 243
hayasui..kantai.collection.: 244
hecatia.lapislazuli: 245
helena.blavatsky..fate.grand.order.: 246
heles: 247
hestia..danmachi.: 248
hex.maniac..pokemon.: 249
hibari..senran.kagura.: 250
hibiki..kantai.collection.: 251 252
hiei..kantai.collection.: 253
higashi.setsuna: 254
higashikata.jousuke: 255
high.priest: 256
hiiragi.kagami: 257
hiiragi.tsukasa: 258
hijiri.byakuren: 259
hikari..pokemon.: 260
himejima.akeno: 261
himekaidou.hatate: 262
hinanawi.tenshi: 263 264
hino.akane..idolmaster.: 265 266
hino.rei: 267
hirasawa.ui: 268
hirasawa.yui: 269
hiryuu..kantai.collection.: 270
hishikawa.rikka: 271
hk416..girls.frontline.: 272
holo: 273
homura..xenoblade.2.: 274
honda.mio: 275
hong.meiling: 276
honma.meiko: 277
honolulu..azur.lane.: 278
horikawa.raiko: 279
hoshi.shouko: 280
hoshiguma.yuugi: 281
hoshii.miki: 282
hoshimiya.ichigo: 283
hoshimiya.kate: 284
hoshino.fumina: 285
hoshino.ruri: 286
hoshizora.miyuki: 287
hoshizora.rin: 288
hotarumaru: 289
hoto.cocoa: 290
houjou.hibiki: 291
houjou.karen: 292
houjou.satoko: 293
houjuu.nue: 294
houraisan.kaguya: 295
houshou..kantai.collection.: 296
huang.baoling: 297
hyuuga.hinata: 298
i.168..kantai.collection.: 299
i.19..kantai.collection.: 300
i.26..kantai.collection.: 301
i.401..kantai.collection.: 302
i.58..kantai.collection.: 303
i.8..kantai.collection.: 304
ia..vocaloid.: 305
ibaraki.douji..fate.grand.order.: 306
ibaraki.kasen: 307
ibuki.fuuko: 308
ibuki.suika: 309 310
ichinose.kotomi: 311
ichinose.shiki: 312
ikamusume: 313
ikazuchi..kantai.collection.: 314
illustrious..azur.lane.: 315
illyasviel.von.einzbern: 316
imaizumi.kagerou: 317
inaba.tewi: 318
inami.mahiru: 319
inazuma..kantai.collection.: 320
index: 321
ingrid: 322
inkling: 323
inubashiri.momiji: 324
inuyama.aoi: 325
iori.rinko: 326
iowa..kantai.collection.: 327
irisviel.von.einzbern: 328
iroha..samurai.spirits.: 329
ishtar..fate.grand.order.: 330
isokaze..kantai.collection.: 331
isonami..kantai.collection.: 332
isuzu..kantai.collection.: 333
itsumi.erika: 334
ivan.karelin: 335
izayoi.sakuya: 336
izumi.konata: 337
izumi.sagiri: 338
jack.the.ripper..fate.apocrypha.: 339
jakuzure.nonon: 340
japanese.crested.ibis..kemono.friends.: 341
jeanne.d.arc..alter...fate.: 342
jeanne.d.arc..alter.swimsuit.berserker.: 343
jeanne.d.arc..fate.: 344
jeanne.d.arc..fate...all.: 345
jeanne.d.arc..granblue.fantasy.: 346
jeanne.d.arc..swimsuit.archer.: 347
jeanne.d.arc.alter.santa.lily: 348
jintsuu..kantai.collection.: 349
jinx..league.of.legends.: 350
johnny.joestar: 351
jonathan.joestar: 352
joseph.joestar..young.: 353
jougasaki.mika: 354
jougasaki.rika: 355 356
junketsu: 357
junko..touhou.: 358
kaban..kemono.friends.: 359
kaburagi.t.kotetsu: 360
kaenbyou.rin: 361 362
kafuu.chino: 363
kaga..kantai.collection.: 364
kagamine.len: 365
kagamine.rin: 366
kagerou..kantai.collection.: 367
kagiyama.hina: 368
kagura..gintama.: 369
kaguya.luna..character.: 370
kaito: 371
kaku.seiga: 372
kakyouin.noriaki: 373
kallen.stadtfeld: 374
kamikaze..kantai.collection.: 375
kamikita.komari: 376
kamio.misuzu: 377
kamishirasawa.keine: 378
kamiya.nao: 379
kamoi..kantai.collection.: 380
kaname.madoka: 381
kanbaru.suruga: 382
kanna.kamui: 383
kanzaki.ranko: 384
karina.lyle: 385
kasane.teto: 386
kashima..kantai.collection.: 387
kashiwazaki.sena: 388
kasodani.kyouko: 389 390
kasugano.sora: 391
kasumi..doa.: 392
kasumi..kantai.collection.: 393
kasumi..pokemon.: 394
kasumigaoka.utaha: 395
katori..kantai.collection.: 396
katou.megumi: 397
katsura.hinagiku: 398
katsuragi..kantai.collection.: 399
katsushika.hokusai..fate.grand.order.: 400
katyusha: 401
kawakami.mai: 402
kawakaze..kantai.collection.: 403
kawashiro.nitori: 404
kay..girls.und.panzer.: 405
kazama.asuka: 406
kazami.yuuka: 407
kenzaki.makoto: 408
kijin.seija: 409
kikuchi.makoto: 410
kino: 411
kino.makoto: 412 413
kinugasa..kantai.collection.: 414
kirigaya.suguha: 415
kirigiri.kyouko: 416
kirijou.mitsuru: 417
kirima.sharo: 418
kirin..armor.: 419
kirino.ranmaru: 420
kirisame.marisa: 421
kirishima..kantai.collection.: 422
kirito: 423
kiryuuin.satsuki: 424
kisaragi..kantai.collection.: 425
kisaragi.chihaya: 426
kise.yayoi: 427
kishibe.rohan: 428
kishin.sagume: 429
kiso..kantai.collection.: 430
kiss.shot.acerola.orion.heart.under.blade: 431
kisume: 432
kitakami..kantai.collection.: 433
kiyohime..fate.grand.order.: 434
kiyoshimo..kantai.collection.: 435 436
koakuma: 437
kobayakawa.rinko: 438
kobayakawa.sae: 439
kochiya.sanae: 440
kohinata.miho: 441
koizumi.hanayo: 442
komaki.manaka: 443
komeiji.koishi: 444
komeiji.satori: 445
kongou..kantai.collection.: 446 447
konpaku.youmu: 448
konpaku.youmu..ghost.: 449
kooh: 450
kos.mos: 451
koshimizu.sachiko: 452
kotobuki.tsumugi: 453
kotomine.kirei: 454
kotonomiya.yuki: 455
kousaka.honoka: 456
kousaka.kirino: 457
kousaka.tamaki: 458
kozakura.marry: 459
kuchiki.rukia: 460
kujikawa.rise: 461
kujou.karen: 462
kula.diamond: 463
kuma..kantai.collection.: 464
kumano..kantai.collection.: 465
kumoi.ichirin: 466
kunikida.hanamaru: 467
kuradoberi.jam: 468
kuriyama.mirai: 469
kurodani.yamame: 470 471
kurokawa.eren: 472
kuroki.tomoko: 473
kurosawa.dia: 474
kurosawa.ruby: 475
kuroshio..kantai.collection.: 476
kuroyukihime: 477
kurumi.erika: 478
kusanagi.motoko: 479
kusugawa.sasara: 480
kuujou.jolyne: 481
kuujou.joutarou: 482
kyon: 483
kyonko: 484
kyubey: 485
laffey..azur.lane.: 486
lala.satalin.deviluke: 487
lancer: 488 489
laura.bodewig: 490
leafa: 491
lei.lei: 492
lelouch.lamperouge: 493
len: 494
letty.whiterock: 495 496
libeccio..kantai.collection.: 497
lightning.farron: 498
lili..tekken.: 499
lilith.aensland: 500
lillie..pokemon.: 501
lily.white: 502
link: 503 504 505
lucina: 506
lum: 507
luna.child: 508
lunamaria.hawke: 509
lunasa.prismriver: 510
lusamine..pokemon.: 511
lyn..blade...soul.: 512 513
lynette.bishop: 514
m1903.springfield..girls.frontline.: 515
madotsuki: 516
maekawa.miku: 517
maka.albarn: 518
makigumo..kantai.collection.: 519
makinami.mari.illustrious: 520
makise.kurisu: 521
makoto..street.fighter.: 522
makoto.nanaya: 523
mankanshoku.mako: 524
mao..pokemon.: 525
maou..maoyuu.: 526
maribel.hearn: 527
marie.antoinette..fate.grand.order.: 528
mash.kyrielight: 529
matoi..pso2.: 530
matoi.ryuuko: 531 532
matsuura.kanan: 533
maya..kantai.collection.: 534
me.tan: 535
medicine.melancholy: 536
medjed: 537
meer.campbell: 538
megumin: 539
megurine.luka: 540
mei..overwatch.: 541
mei..pokemon.: 542
meiko: 543
meltlilith: 544
mercy..overwatch.: 545
merlin.prismriver: 546
michishio..kantai.collection.: 547
midare.toushirou: 548
midna: 549
midorikawa.nao: 550
mika..girls.und.panzer.: 551
mikasa.ackerman: 552
mikazuki.munechika: 553
miki.sayaka: 554
millia.rage: 555
mima: 556
mimura.kanako: 557
minami.kotori: 558 559 560
minase.akiko: 561
minase.iori: 562
miqo.te: 563
misaka.mikoto: 564
mishaguji: 565
misumi.nagisa: 566
mithra: 567
miura.azusa: 568
miyafuji.yoshika: 569
miyako.yoshika: 570
miyamoto.frederica: 571
miyamoto.musashi..fate.grand.order.: 572
miyaura.sanshio: 573
mizuhashi.parsee: 574
mizuki..pokemon.: 575
mizunashi.akari: 576
mizuno.ami: 577
mogami..kantai.collection.: 578
momo.velia.deviluke: 579 580 581
mordred..fate.: 582
mordred..fate...all.: 583
morgiana: 584
morichika.rinnosuke: 585
morikubo.nono: 586
moriya.suwako: 587
moroboshi.kirari: 588
morrigan.aensland: 589
motoori.kosuzu: 590
mumei..kabaneri.: 591
murakumo..kantai.collection.: 592
murasa.minamitsu: 593
murasame..kantai.collection.: 594
musashi..kantai.collection.: 595
mutsu..kantai.collection.: 596
mutsuki..kantai.collection.: 597 598 599
myoudouin.itsuki: 600
mysterious.heroine.x: 601
mysterious.heroine.x..alter.: 602
mystia.lorelei: 603
nadia: 604
nagae.iku: 605
naganami..kantai.collection.: 606
nagato..kantai.collection.: 607
nagato.yuki: 608
nagatsuki..kantai.collection.: 609
nagi: 610
nagisa.kaworu: 611
naka..kantai.collection.: 612
nakano.azusa: 613 614
nanami.chiaki: 615 616
nao..mabinogi.: 617
narmaya..granblue.fantasy.: 618
narukami.yuu: 619
narusawa.ryouka: 620
natalia..idolmaster.: 621
natori.sana: 622
natsume..pokemon.: 623
natsume.rin: 624
nazrin: 625
nekomiya.hinata: 626
nekomusume: 627 628
nepgear: 629
neptune..neptune.series.: 630
nero.claudius..bride...fate.: 631
nero.claudius..fate.: 632
nero.claudius..fate...all.: 633
nero.claudius..swimsuit.caster...fate.: 634
nia.teppelin: 635
nibutani.shinka: 636
nico.robin: 637
ninomiya.asuka: 638
nishikino.maki: 639
nishizumi.maho: 640
nishizumi.miho: 641
nitocris..fate.grand.order.: 642
nitocris..swimsuit.assassin...fate.: 643
nitta.minami: 644
noel.vermillion: 645
noire: 646
northern.ocean.hime: 647
noshiro..kantai.collection.: 648
noumi.kudryavka: 649
nu.13: 650
nyarlathotep..nyaruko.san.: 651
oboro..kantai.collection.: 652
oda.nobunaga..fate.: 653
ogata.chieri: 654
ohara.mari: 655
oikawa.shizuku: 656
okazaki.yumemi: 657
okita.souji..alter...fate.: 658
okita.souji..fate.: 659
okita.souji..fate...all.: 660
onozuka.komachi: 661
ooi..kantai.collection.: 662
oomori.yuuko: 663
ootsuki.yui: 664
ooyodo..kantai.collection.: 665
osakabe.hime..fate.grand.order.: 666
oshino.shinobu: 667
otonashi.kotori: 668
panty..psg.: 669
passion.lip: 670
patchouli.knowledge: 671
pepperoni..girls.und.panzer.: 672
perrine.h.clostermann: 673
pharah..overwatch.: 674
phosphophyllite: 675
pikachu: 676
pixiv.tan: 677
platelet..hataraku.saibou.: 678
platinum.the.trinity: 679
pod..nier.automata.: 680
pola..kantai.collection.: 681 682 683
princess.peach: 684
princess.serenity: 685
princess.zelda: 686
prinz.eugen..azur.lane.: 687
prinz.eugen..kantai.collection.: 688
prisma.illya: 689
purple.heart: 690
puru.see: 691
pyonta: 692
qbz.95..girls.frontline.: 693
rachel.alucard: 694
racing.miku: 695
raising.heart: 696
ramlethal.valentine: 697
ranka.lee: 698
ranma.chan: 699
re.class.battleship: 700
reinforce: 701
reinforce.zwei: 702
reisen.udongein.inaba: 703
reiuji.utsuho: 704
reizei.mako: 705 706
remilia.scarlet: 707
rensouhou.chan: 708
rensouhou.kun: 709
rias.gremory: 710
rider: 711
riesz: 712
ringo..touhou.: 713
ro.500..kantai.collection.: 714
roll: 715
rosehip: 716
rossweisse: 717
ruby.rose: 718
rumia: 719
rydia: 720
ryougi.shiki: 721
ryuuguu.rena: 722
ryuujou..kantai.collection.: 723
saber: 724
saber.alter: 725
saber.lily: 726
sagisawa.fumika: 727
saigyouji.yuyuko: 728
sailor.mars: 729
sailor.mercury: 730
sailor.moon: 731
sailor.saturn: 732
sailor.venus: 733
saint.martha: 734
sakagami.tomoyo: 735
sakamoto.mio: 736
sakata.gintoki: 737
sakuma.mayu: 738
sakura.chiyo: 739
sakura.futaba: 740
sakura.kyouko: 741
sakura.miku: 742
sakurai.momoka: 743
sakurauchi.riko: 744
samidare..kantai.collection.: 745
samus.aran: 746
sanya.v.litvyak: 747 748
saotome.ranma: 749
saratoga..kantai.collection.: 750
sasaki.chiho: 751
saten.ruiko: 752
satonaka.chie: 753
satsuki..kantai.collection.: 754
sawamura.spencer.eriri: 755
saya: 756
sazaki.kaoruko: 757
sazanami..kantai.collection.: 758
scathach..fate...all.: 759
scathach..fate.grand.order.: 760
scathach..swimsuit.assassin...fate.: 761
seaport.hime: 762
seeu: 763
seiran..touhou.: 764
seiren..suite.precure.: 765
sekibanki: 766
selvaria.bles: 767
sendai..kantai.collection.: 768 769
sengoku.nadeko: 770
senjougahara.hitagi: 771
senketsu: 772
sento.isuzu: 773
serena..pokemon.: 774
serval..kemono.friends.: 775
sf.a2.miki: 776
shameimaru.aya: 777
shana: 778
shanghai.doll: 779
shantae..character.: 780
sheryl.nome: 781
shibuya.rin: 782
shidare.hotaru: 783
shigure..kantai.collection.: 784
shijou.takane: 785
shiki.eiki: 786
shikinami..kantai.collection.: 787
shikinami.asuka.langley: 788
shimada.arisu: 789
shimakaze..kantai.collection.: 790
shimamura.uzuki: 791
shinjou.akane: 792
shinki: 793
shinku: 794
shiomi.shuuko: 795
shirabe.ako: 796
shirai.kuroko: 797
shirakiin.ririchiyo: 798
shiranui..kantai.collection.: 799
shiranui.mai: 800
shirasaka.koume: 801
shirase.sakuya: 802
shiratsuyu..kantai.collection.: 803
shirayuki.hime: 804
shirogane.naoto: 805
shirona..pokemon.: 806
shoebill..kemono.friends.: 807
shokuhou.misaki: 808
shouhou..kantai.collection.: 809
shoukaku..kantai.collection.: 810
shuten.douji..fate.grand.order.: 811
signum: 812
silica: 813
simon: 814
sinon: 815 816
sona.buvelle: 817
sonoda.umi: 818
sonohara.anri: 819
sonozaki.mion: 820
sonozaki.shion: 821
sora.ginko: 822 823
souryuu..kantai.collection.: 824
souryuu.asuka.langley: 825
souseiseki: 826
star.sapphire: 827
stocking..psg.: 828
su.san: 829
subaru.nakajima: 830
suigintou: 831
suiren..pokemon.: 832
suiseiseki: 833
sukuna.shinmyoumaru: 834
sunny.milk: 835
suomi.kp31..girls.frontline.: 836
super.pochaco: 837
super.sonico: 838
suzukaze.aoba: 839
suzumiya.haruhi: 840
suzutsuki..kantai.collection.: 841
suzuya..kantai.collection.: 842
tachibana.arisu: 843
tachibana.hibiki..symphogear.: 844
tada.riina: 845
taigei..kantai.collection.: 846
taihou..azur.lane.: 847
taihou..kantai.collection.: 848
tainaka.ritsu: 849
takagaki.kaede: 850
takakura.himari: 851
takamachi.nanoha: 852
takami.chika: 853
takanashi.rikka: 854
takao..azur.lane.: 855
takao..kantai.collection.: 856
takara.miyuki: 857
takarada.rikka: 858
takatsuki.yayoi: 859
takebe.saori: 860
tama..kantai.collection.: 861
tamamo..fate...all.: 862 863 864 865
tanamachi.kaoru: 866
taneshima.popura: 867
tanned.cirno: 868
taokaka: 869
tatara.kogasa: 870
tateyama.ayano: 871
tatsumaki: 872
tatsuta..kantai.collection.: 873
tedeza.rize: 874
tenryuu..kantai.collection.: 875 876
teruzuki..kantai.collection.: 877
tharja: 878
tifa.lockhart: 879
tina.branford: 880
tippy..gochiusa.: 881
tokiko..touhou.: 882
tokisaki.kurumi: 883
tokitsukaze..kantai.collection.: 884
tomoe.gozen..fate.grand.order.: 885
tomoe.hotaru: 886
tomoe.mami: 887
tone..kantai.collection.: 888
toono.akiha: 889
tooru..maidragon.: 890
toosaka.rin: 891
toramaru.shou: 892
toshinou.kyouko: 893
totoki.airi: 894
toudou.shimako: 895
toudou.yurika: 896
toujou.koneko: 897
toujou.nozomi: 898
touko..pokemon.: 899
touwa.erio: 900 901
tracer..overwatch.: 902
tsukikage.yuri: 903
tsukimiya.ayu: 904
tsukino.mito: 905
tsukino.usagi: 906
tsukumo.benben: 907
tsurumaru.kuninaga: 908
tsuruya: 909
tsushima.yoshiko: 910
u.511..kantai.collection.: 911
ujimatsu.chiya: 912
ultimate.madoka: 913
umikaze..kantai.collection.: 914
unicorn..azur.lane.: 915
unryuu..kantai.collection.: 916
urakaze..kantai.collection.: 917
uraraka.ochako: 918
usada.hikaru: 919
usami.renko: 920
usami.sumireko: 921
ushio..kantai.collection.: 922
ushiromiya.ange: 923
ushiwakamaru..fate.grand.order.: 924
uzuki..kantai.collection.: 925
vampire..azur.lane.: 926
vampy: 927
venera.sama: 928
verniy..kantai.collection.: 929 930
violet.evergarden..character.: 931
vira.lilie: 932
vita: 933
vivio: 934
wa2000..girls.frontline.: 935
wakasagihime: 936
wang.liu.mei: 937
warspite..kantai.collection.: 938 939
watarase.jun: 940 941
waver.velvet: 942
weiss.schnee: 943
white.mage: 944
widowmaker..overwatch.: 945
wo.class.aircraft.carrier: 946
wriggle.nightbug: 947
xenovia.quarta: 948
xp.tan: 949
xuanzang..fate.grand.order.: 950
yagami.hayate: 951
yagokoro.eirin: 952
yahagi..kantai.collection.: 953
yakumo.ran: 954
yakumo.yukari: 955
yamada.aoi: 956
yamada.elf: 957
yamakaze..kantai.collection.: 958
yamashiro..azur.lane.: 959
yamashiro..kantai.collection.: 960
yamato..kantai.collection.: 961 962
yang.xiao.long: 963
yasaka.kanako: 964
yayoi..kantai.collection.: 965 966
yin: 967
yoko.littner: 968 969
yorigami.shion: 970
yowane.haku: 971
yuffie.kisaragi: 972 973
yuigahama.yui: 974
yuki.miku: 975
yukikaze..kantai.collection.: 976
yukine.chris: 977
yukinoshita.yukino: 978
yukishiro.honoka: 979
yumi..senran.kagura.: 980
yuna..ff10.: 981
yuno: 982
yura..kantai.collection.: 983
yuubari..kantai.collection.: 984
yuudachi..kantai.collection.: 985
yuugumo..kantai.collection.: 986
yuuki..sao.: 987
yuuki.makoto: 988
yuuki.mikan: 989
yuzuhara.konomi: 990
yuzuki.yukari: 991
yuzuriha.inori: 992
z1.leberecht.maass..kantai.collection.: 993
z3.max.schultz..kantai.collection.: 994 995
zeta..granblue.fantasy.: 996
zooey..granblue.fantasy.: 997
zuihou..kantai.collection.: 998
zuikaku..kantai.collection.: 999

(A­side from be­ing po­ten­tially use­ful to sta­bi­lize train­ing by pro­vid­ing supervision/metadata, use of classes/categories re­duces the need for char­ac­ter-spe­cific trans­fer learn­ing for spe­cial­ized StyleGAN mod­els, since you can just gen­er­ate sam­ples from a spe­cific class. For the 256px mod­el, I pro­vide down­load­able sam­ples for each of the 1000 class­es.)

D1K Download

D1K (20GB; n = 822,842 512px JPEGs) and the por­trait-crop ver­sion, D1K-por­traits (18GB; n = 212,359) are avail­able for down­load:

rsync --verbose --recursive rsync:// ./d1k/

The JPG com­pres­sion turned out to be too ag­gres­sive and re­sult in no­tice­able ar­ti­fact­ing, so in early 2020 I re­gen­er­ated D1k from Dan­booru2019 for fu­ture pro­jects, cre­at­ing D1K-2019-512px: a fresh set of top-1k solo char­ac­ter im­ages, s/q Dan­booru2019, no JPEG com­pres­sion.

Merges of over­lap­ping char­ac­ters were again nec­es­sary; the full set of tag merges:

merge() { mv ./$1/* ./$2/ && rmdir ./$1; }
merge alice.margatroid..pc.98. alice.margatroid
merge artoria.pendragon..all. saber
merge artoria.pendragon..lancer. saber
merge artoria.pendragon..lancer.alter. saber
merge artoria.pendragon..swimsuit.rider.alter. saber
merge artoria.pendragon..swimsuit.ruler...fate. saber
merge atago..midsummer.march...azur.lane. atago..azur.lane.
merge bardiche fate.testarossa
merge bb..fate...all.
merge bb..fate.extra.ccc.
merge bb..swimsuit.mooncancer...fate.
merge bottle.miku hatsune.miku
merge aoki.reika
merge cure.happy hoshizora.miyuki
merge cure.march midorikawa.nao
merge cure.marine kurumi.erika
merge cure.melody houjou.hibiki
merge cure.moonlight tsukikage.yuri
merge cure.peace kise.yayoi
merge cure.peach
merge cure.sunny
merge cure.sunshine myoudouin.itsuki
merge cure.sword kenzaki.makoto
merge cure.twinkle amanogawa.kirara
merge eas higashi.setsuna
merge elizabeth.bathory..brave...fate. elizabeth.bathory..fate.
merge elizabeth.bathory..fate...all. elizabeth.bathory..fate.
merge ex.keine kamishirasawa.keine
merge frederica.bernkastel  furude.rika
merge furudo.erika furude.rika
merge graf.eisen vita
merge hatsune.miku..append. hatsune.miku
merge ishtar..fate.grand.order. ishtar..fate...all.
merge jeanne.d.arc..alter...fate. jeanne.d.arc..fate.
merge jeanne.d.arc..alter.swimsuit.berserker. jeanne.d.arc..fate.
merge jeanne.d.arc..fate...all. jeanne.d.arc..fate.
merge jeanne.d.arc..swimsuit.archer. jeanne.d.arc..fate.
merge jeanne.d.arc.alter.santa.lily jeanne.d.arc..fate.
merge kaenbyou.rin
merge kiyohime..swimsuit.lancer...fate. kiyohime..fate.grand.order.
merge konpaku.youmu..ghost. konpaku.youmu
merge kyonko kyon
merge lancer cu.chulainn..fate...all.
merge medb..fate.grand.order. medb..fate...all.
merge medjed nitocris..fate.grand.order.
merge meltryllis..swimsuit.lancer...fate. meltryllis
merge miyamoto.musashi..swimsuit.berserker...fate. miyamoto.musashi..fate.grand.order.
merge mordred..fate...all. mordred..fate.
merge mysterious.heroine.x saber
merge mysterious.heroine.x..alter. saber
merge mysterious.heroine.xx..foreigner. saber
merge nero.claudius..bride...fate. nero.claudius..fate.
merge nero.claudius..fate...all. nero.claudius..fate.
merge nero.claudius..swimsuit.caster...fate. nero.claudius..fate.
merge nitocris..swimsuit.assassin...fate. nitocris..fate.grand.order.
merge oda.nobunaga..fate...all. oda.nobunaga..fate.
merge okita.souji..alter...fate. okita.souji..fate.
merge okita.souji..fate...all. okita.souji..fate.
merge princess.of.the.crystal takakura.himari
merge princess.serenity tsukino.usagi
merge prinz.eugen..azur.lane.
merge prisma.illya illyasviel.von.einzbern
merge purple.heart neptune..neptune.series.
merge pyonta moriya.suwako
merge racing.miku hatsune.miku
merge raising.heart  takamachi.nanoha
merge reinforce.zwei reinforce
merge rensouhou.chan shimakaze..kantai.collection.
merge roll.caskett roll
merge saber.alter saber
merge saber.lily saber
merge sailor.jupiter kino.makoto
merge sailor.mars hino.rei
merge sailor.mercury mizuno.ami
merge sailor.moon tsukino.usagi
merge sailor.saturn tomoe.hotaru
merge sailor.venus aino.minako
merge sakura.miku hatsune.miku
merge scathach..fate.grand.order. scathach..fate...all.
merge scathach..swimsuit.assassin...fate. scathach..fate...all.
merge scathach.skadi..fate.grand.order. scathach..fate...all.
merge schwertkreuz yagami.hayate
merge seiren..suite.precure. kurokawa.eren
merge shanghai.doll alice.margatroid
merge shikinami.asuka.langley souryuu.asuka.langley
merge su.san medicine.melancholy
merge taihou..forbidden.feast...azur.lane. taihou..azur.lane.
merge tamamo..fate...all.
merge tanned.cirno cirno
merge ultimate.madoka kaname.madoka
merge yuki.miku hatsune.miku


rsync --verbose --recursive rsync:// ./d1k-2019-512px/

D1K BigGAN Conversion

BigGAN re­quires the dataset meta­data to be de­fined in, and then, if us­ing HDF5 archives it must be processed into a HDF5 archive, along with In­cep­tion sta­tis­tics for the pe­ri­odic test­ing (although I min­i­mize test­ing, the pre­processed sta­tis­tics are still nec­es­sary).

HDF5 is not nec­es­sary and can be omit­ted, BigGAN-Pytorch can read im­age fold­ers, if you pre­fer to avoid the has­sle.

The must be edited to add meta­data per dataset (no CLI), which looks like this to de­fine a 128px Dan­booru-1k por­trait dataset:

 # Convenience dicts
-dset_dict = {'I32': dset.ImageFolder, 'I64': dset.ImageFolder,
+dset_dict = {'I32': dset.ImageFolder, 'I64': dset.ImageFolder,
              'I128': dset.ImageFolder, 'I256': dset.ImageFolder,
              'I32_hdf5': dset.ILSVRC_HDF5, 'I64_hdf5': dset.ILSVRC_HDF5,
              'I128_hdf5': dset.ILSVRC_HDF5, 'I256_hdf5': dset.ILSVRC_HDF5,
-             'C10': dset.CIFAR10, 'C100': dset.CIFAR100}
+             'C10': dset.CIFAR10, 'C100': dset.CIFAR100,
+             'D1K': dset.ImageFolder, 'D1K_hdf5': dset.ILSVRC_HDF5 }
 imsize_dict = {'I32': 32, 'I32_hdf5': 32,
                'I64': 64, 'I64_hdf5': 64,
                'I128': 128, 'I128_hdf5': 128,
                'I256': 256, 'I256_hdf5': 256,
-               'C10': 32, 'C100': 32}
+               'C10': 32, 'C100': 32,
+               'D1K': 128, 'D1K_hdf5': 128 }
 root_dict = {'I32': 'ImageNet', 'I32_hdf5': 'ILSVRC32.hdf5',
              'I64': 'ImageNet', 'I64_hdf5': 'ILSVRC64.hdf5',
              'I128': 'ImageNet', 'I128_hdf5': 'ILSVRC128.hdf5',
              'I256': 'ImageNet', 'I256_hdf5': 'ILSVRC256.hdf5',
-             'C10': 'cifar', 'C100': 'cifar'}
+             'C10': 'cifar', 'C100': 'cifar',
+             'D1K': 'characters-1k-faces', 'D1K_hdf5': 'D1K.hdf5' }
 nclass_dict = {'I32': 1000, 'I32_hdf5': 1000,
                'I64': 1000, 'I64_hdf5': 1000,
                'I128': 1000, 'I128_hdf5': 1000,
                'I256': 1000, 'I256_hdf5': 1000,
-               'C10': 10, 'C100': 100}
-# Number of classes to put per sample sheet
+               'C10': 10, 'C100': 100,
+               'D1K': 1000, 'D1K_hdf5': 1000 }
+# Number of classes to put per sample sheet
 classes_per_sheet_dict = {'I32': 50, 'I32_hdf5': 50,
                           'I64': 50, 'I64_hdf5': 50,
                           'I128': 20, 'I128_hdf5': 20,
                           'I256': 20, 'I256_hdf5': 20,
-                          'C10': 10, 'C100': 100}
+                          'C10': 10, 'C100': 100,
+                          'D1K': 1, 'D1K_hdf5': 1 }

Each dataset ex­ists in 2 forms, as the orig­i­nal im­age folder and then as the processed HDF5:

python --dataset D1K512 --data_root /media/gwern/Data2/danbooru2018
python --parallel --dataset D1K_hdf5 --batch_size 64 \
    --data_root /media/gwern/Data2/danbooru2018
## Or ImageNet example:
python --dataset I128 --data_root /media/gwern/Data/imagenet/
python --dataset I128_hdf5 --batch_size 64 \
    --data_root /media/gwern/Data/imagenet/ will write the HDF5 to a ILSVRC*.hdf5 file, so re­name it to what­ever (eg D1K.hdf5).

BigGAN Training

With the HDF5 & In­cep­tion sta­tis­tics cal­cu­lat­ed, it should be pos­si­ble to run like so:

python --dataset D1K --parallel --shuffle --num_workers 4 --batch_size 32 \
    --num_G_accumulations 8 --num_D_accumulations 8  \
    --num_D_steps 1 --G_lr 1e-4 --D_lr 4e-4 --D_B2 0.999 --G_B2 0.999 --G_attn 64 --D_attn 64 \
    --G_nl inplace_relu --D_nl inplace_relu --SN_eps 1e-6 --BN_eps 1e-5 --adam_eps 1e-6 \
    --G_ortho 0.0 --G_shared --G_init ortho --D_init ortho --hier --dim_z 120 --shared_dim 128 \
    --G_eval_mode --G_ch 96 --D_ch 96  \
    --ema --use_ema --ema_start 20000 --test_every 2000 --save_every 1000 --num_best_copies 5 \
    --num_save_copies 2 --seed 0 --use_multiepoch_sampler --which_best FID \
    --data_root /media/gwern/Data2/danbooru2018

The ar­chi­tec­ture is spec­i­fied on the com­mand line and must be cor­rect; ex­am­ples are in the scripts/ di­rec­to­ry. In the above ex­am­ple, --num_D_steps...--D_ch should be left strictly alone and the key pa­ra­me­ters are before/after that ar­chi­tec­ture block. In this ex­am­ple, my 2×1080ti can sup­port a batch size of n = 32 & the gra­di­ent ac­cu­mu­la­tion over­head with­out OOMing. In ad­di­tion to that, it’s im­por­tant to en­able EMA, which makes a truly re­mark­able differ­ence in the gen­er­ated sam­ple qual­ity (which is in­ter­est­ing be­cause EMA sounds re­dun­dant with momentum/learning rates, but is­n’t). The big batches of BigGAN are im­ple­mented by --batch_size times --num_{G/D}_accumulations; I would need an ac­cu­mu­la­tion of 64 to match n = 2048. With­out EMA, sam­ples are low qual­ity and change dras­ti­cally at each it­er­a­tion; but after a cer­tain num­ber of it­er­a­tions, sam­pling is done with EMA, which av­er­ages each it­er­a­tion offline (but one does­n’t train us­ing the av­er­aged mod­el!46), shows that col­lec­tively these it­er­a­tions are sim­i­lar be­cause they are ‘or­bit­ing’ around a cen­tral point and the im­age qual­ity is clearly grad­u­ally im­prov­ing when EMA is turned on.

Trans­fer learn­ing is not sup­ported na­tive­ly, but a sim­i­lar trick as with StyleGAN is fea­si­ble: just drop the pre­trained mod­els into the check­point folder and re­sume (which will work as long as the ar­chi­tec­ture is iden­ti­cal to the CLI pa­ra­me­ter­s).

The sam­ple sheet func­tion­al­ity can eas­ily over­load a GPU and OOM. In, it may be nec­es­sary to sim­ply com­ment out all of the sam­pling func­tion­al­ity start­ing with utils.sample_sheet.

The main prob­lem run­ning BigGAN is odd bugs in BigGAN’s han­dling of epochs/iterations and chang­ing gra­di­ent ac­cu­mu­la­tions. With --use_multiepoch_sampler, it does com­pli­cated cal­cu­la­tions to try to keep sam­pling con­sis­tent across epoches with pre­cisely the same or­der­ing of sam­ples re­gard­less of how often the BigGAN job is started/stopped (eg on a clus­ter), but as one in­creases the to­tal mini­batch size and it pro­gresses through an epoch, it tries to in­dex data which does­n’t ex­ist and crash­es; I was un­able to fig­ure out how the cal­cu­la­tions were go­ing wrong, ex­act­ly.47

While with that op­tion dis­abled and larger to­tal mini­batches used, a differ­ent bug gets trig­gered, lead­ing to in­scrutable crash­es:

ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
Traceback (most recent call last):
  File "", line 228, in <module>
  File "", line 225, in main
  File "", line 172, in run
    for i, (x, y) in enumerate(pbar):
  File "/root/BigGAN-PyTorch-mooch/", line 842, in progress
    for n, item in enumerate(items):
  File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/", line 631, in __next__
    idx, batch = self._get_batch()
  File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/", line 601, in _get_batch
    return self.data_queue.get(timeout=MP_STATUS_CHECK_INTERVAL)
  File "/opt/conda/lib/python3.7/", line 179, in get
  File "/opt/conda/lib/python3.7/", line 300, in wait
    gotit = waiter.acquire(True, timeout)
  File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/", line 274, in handler
RuntimeError: DataLoader worker (pid 21103) is killed by signal: Bus error.

There is no good workaround here: start­ing with small fast mini­batches com­pro­mises fi­nal qual­i­ty, while start­ing with big slow mini­batches may work but then costs far more com­pute. I did find that the G/D ac­cu­mu­la­tions can be im­bal­anced to al­low in­creas­ing the G’s to­tal mini­batch (which ap­pears to be the key for bet­ter qual­i­ty) but then this risks desta­bi­liz­ing train­ing. These bugs need to be fixed be­fore try­ing BigGAN for re­al.

BigGAN: ImageNet→Danbooru2018-1K

In any case, I ran the 128px Im­a­geNet→­Dan­booru2018-1K for ~6 GPU-days (or ~3 days on my 2×1080ti work­sta­tion) and the train­ing mon­tage in­di­cates it was work­ing fine:

Train­ing mon­tage of the 128px Im­a­geNet→­Dan­booru2018-1K; suc­cess­ful

Some­time after that, while con­tin­u­ing to play with im­bal­anced mini­batches to avoid trig­ger­ing the iteration/crash bugs, it di­verged badly and mod­e-col­lapsed into sta­t­ic, so I killed the run, as the point ap­pears to have been made: trans­fer learn­ing is in­deed pos­si­ble, and the speed of the adap­ta­tion sug­gests ben­e­fits to train­ing time by start­ing with a high­ly-trained model al­ready.

BigGAN: 256px Danbooru2018-1K

More se­ri­ous­ly, I be­gan train­ing a 256px model on Dan­booru2018-1K por­traits. This re­quired re­build­ing the HDF5 with 256px set­tings, and since I was­n’t do­ing trans­fer learn­ing, I used the BigGAN-deep ar­chi­tec­ture set­tings since that has bet­ter re­sults & is smaller than the orig­i­nal BigGAN.

My own 2×1080ti were in­ad­e­quate for rea­son­able turn­around on train­ing a 256px BigGAN from scratch—they would take some­thing like 4+ months wall­clock— so I de­cided to shell out for a big cloud in­stance. AWS/GCP are too ex­pen­sive, so I used this to in­ves­ti­gate as an al­ter­na­tive: they typ­i­cally have much lower prices. setup was straight­for­ward, and I found a nice in­stance: an 8×2080ti ma­chine avail­able for just $1.7/hour (AWS, for com­par­ison, would charge closer to $2.16/hour for just 8 K80 halves). So I ran 2019-05-02–2019-06-03 their 8×2080ti in­stance ($1.7/hour; to­tal: $1373.64).

That is ~250 GPU-days of train­ing, al­though this is a mis­lead­ing way to put it since the bill in­cludes bandwidth/hard-drive in that to­tal and the GPU uti­liza­tion was poor so each ‘GPU-day’ is worth about a third less than with the 128px BigGAN which had good GPU uti­liza­tion and the 2080tis were overkill. It should be pos­si­ble to do much bet­ter with the same bud­get in the fu­ture.

The train­ing com­mand:

python --model BigGANdeep --dataset D1K_hdf5 --parallel --shuffle --num_workers 16 \
    --batch_size 56 --num_G_accumulations 8 --num_D_accumulations 8 --num_D_steps 1 --G_lr 1e-4 \
    --D_lr 4e-4 --D_B2 0.999 --G_B2 0.999 --G_attn 64 --D_attn 64 --G_ch 128 --D_ch 128 \
    --G_depth 2 --D_depth 2 --G_nl inplace_relu --D_nl inplace_relu --SN_eps 1e-6 --BN_eps 1e-5 \
    --adam_eps 1e-6 --G_ortho 0.0 --G_shared --G_init ortho --D_init ortho --hier --dim_z 64 \
    --shared_dim 64 --ema --use_ema --G_eval_mode --test_every 200000 --sv_log_interval 1000 \
    --save_every 90 --num_best_copies 1 --num_save_copies 1 --seed 0 --no_fid \
    --num_inception_images 1 --augment --data_root ~/tmp --resume --experiment_name \

The sys­tem worked well but BigGAN turns out to have se­ri­ous per­for­mance bot­tle­necks (ap­par­ently in syn­chro­niz­ing batch­norm across GPUs) and did not make good use of the 8 GPUs, av­er­ag­ing GPU uti­liza­tion ~30% ac­cord­ing to nvidia-smi. (On my 2×1080tis with the 128px, GPU-utilization was closer to 95%.) In ret­ro­spect, I prob­a­bly should’ve switched to a less ex­pen­sive in­stance like a 8×1080ti where it likely would’ve had sim­i­lar through­put but cost less.

Train­ing pro­gressed well up un­til it­er­a­tions #80–90k, when I be­gan see­ing signs of mode col­lapse:

Train­ing mon­tage of the 256px Dan­booru2018-1K; semi­-suc­cess­ful (note when EMA be­gins to be used for sam­pling im­ages at ~8s, and the mode col­lapse at the end)

I was un­able to in­crease the mini­batch to more than ~500 be­cause of the bugs, lim­it­ing what I could do against mode col­lapse, and I sus­pect the small mini­batch was why mode col­lapse was hap­pen­ing in the first place. (Gokaslan tried the last check­point I saved—#95,160—with the same set­tings, and ran it to #100,000 it­er­a­tions and ex­pe­ri­enced near-to­tal mode col­lapse.)

The last check­point I saved from be­fore mode col­lapse was #83,520, saved on 2019-05-28 after ~24 wall­clock days (ac­count­ing for var­i­ous crashes & time set­ting up & tweak­ing).

Ran­dom sam­ples, in­ter­po­la­tion grids (not videos), and class-con­di­tional sam­ples can be gen­er­ated us­ing; like, it re­quires the ex­act ar­chi­tec­ture to be spec­i­fied. I used the fol­low­ing com­mand (many of the op­tions are prob­a­bly not nec­es­sary, but I did­n’t know which):

python --model BigGANdeep --dataset D1K_hdf5 --parallel --shuffle --num_workers 16 \
    --batch_size 56 --num_G_accumulations 8 --num_D_accumulations 8 --num_D_steps 1 \
    --G_lr 1e-4 --D_lr 4e-4 --D_B2 0.999 --G_B2 0.999 --G_attn 64 --D_attn 64 --G_ch 128 \
    --D_ch 128 --G_depth 2 --D_depth 2 --G_nl inplace_relu --D_nl inplace_relu --SN_eps 1e-6 \
    --BN_eps 1e-5 --adam_eps 1e-6 --G_ortho 0.0 --G_shared --G_init ortho --D_init ortho --hier \
    --dim_z 64 --shared_dim 64 --ema --use_ema --G_eval_mode --test_every 200000 \
    --sv_log_interval 1000 --save_every 90 --num_best_copies 1 --num_save_copies 1 --seed 0 \
    --no_fid --num_inception_images 1 --skip_init --G_batch_size 32  --use_ema --G_eval_mode \
    --sample_random --sample_sheets --sample_interps --resume --experiment_name 256px

Ran­dom sam­ples are al­ready well-rep­re­sented by the train­ing mon­tage. The in­ter­po­la­tions look sim­i­lar to StyleGAN in­ter­po­la­tions. The class-con­di­tional sam­ples are the most fun to look at be­cause one can look at spe­cific char­ac­ters with­out the need to re­train the en­tire mod­el, which while only tak­ing a few hours at most, is a has­sle.

256px Danbooru2018-1K Samples

In­ter­po­la­tion im­ages and 5 char­ac­ter-spe­cific ran­dom sam­ples (A­suka, Holo, Rin, Chen, Ruri):

Ran­dom in­ter­po­la­tion sam­ples (256px BigGAN trained on 1000 Dan­booru2018 char­ac­ter por­traits)
Souryuu Asuka Lan­g­ley (Neon Gen­e­sis Evan­ge­lion), class #825 ran­dom sam­ples
Holo (Spice and Wolf), class #273 ran­dom sam­ples
Rin Tohsaka (Fate/Stay Night), class #891
Yakumo Chen (Touhou), class #123 ran­dom sam­ples
Ruri Hoshino (Mar­t­ian Suc­ces­sor Nades­ico), class #286 ran­dom sam­ples

256px BigGAN Downloads

Model & sam­ple down­loads:


Sar­cas­tic com­men­tary on BigGAN qual­ity by /u/Klockbox

The best re­sults from the 128px BigGAN model look about as good as could be ex­pected from 128px sam­ples; the 256px model is fairly good, but suffers from much more no­tice­able ar­ti­fact­ing than 512px StyleGAN, and cost $1373 (a 256px StyleGAN would have been closer to $400 on AWS). In BigGAN’s de­fense, it had clearly not con­verged yet and could have ben­e­fited from much more train­ing and much larger mini­batch­es, had that been pos­si­ble. Qual­i­ta­tive­ly, look­ing at the more com­plex el­e­ments of sam­ples, like hair ornaments/hats, I feel like BigGAN was do­ing a much bet­ter job of cop­ing with com­plex­ity & fine de­tail than StyleGAN would have at a sim­i­lar point.

How­ev­er, train­ing 512px por­traits or whole-Dan­booru im­ages is in­fea­si­ble at this point: while the cost might be only a few thou­sand dol­lars, the var­i­ous bugs mean that it may not be pos­si­ble to sta­bly train to a use­ful qual­i­ty. It’s a dilem­ma: at small or easy do­mains, StyleGAN is much faster (if not bet­ter); but at large or hard do­mains, mode col­lapse is too risky and en­dan­gers the big in­vest­ment nec­es­sary to sur­pass StyleGAN.

To make BigGAN vi­able, it needs at least:

  • mini­batch size bugs fixed to en­able up to n = 2048 (or larg­er, as gra­di­ent noise scale in­di­cates)
  • 512px ar­chi­tec­tures de­fined, to al­low trans­fer learn­ing from the re­leased Ten­sor­flow 512px Im­a­geNet model
  • op­ti­miza­tion work to re­duce over­head and al­low rea­son­able GPU uti­liza­tion on >2-GPU sys­tems

With those done, it should be pos­si­ble to train 512px por­traits for <$1,000 and whole-Dan­booru im­ages for <$10,000. (Given the re­lease of Deep­Dan­booru as a Ten­sor­Flow mod­el, en­abling an ani­me-spe­cific per­cep­tual loss, it would also be in­ter­est­ing to in­ves­ti­gate ap­ply­ing pre­train­ing to BigGAN.)

See Also


For com­par­i­son, here are some of my older GAN or other NN at­tempts; as the qual­ity is worse than StyleGAN, I won’t bother go­ing into de­tail­s—cre­at­ing the datasets & train­ing the ProGAN & tun­ing & trans­fer­-learn­ing were all much the same as al­ready out­lined at length for the StyleGAN re­sults.

In­cluded are:

  • ProGAN

  • Glow


  • PokeGAN

  • Self-Attention-GAN-TensorFlow

  • VGAN

  • BigGAN un­offi­cial (offi­cial BigGAN is cov­ered above)

    • BigGAN-TensorFlow
    • BigGAN-PyTorch
  • GAN-QP

  • WGAN

  • IntroVAE


Us­ing offi­cial im­ple­men­ta­tion:

  1. 2018-09-08, 512–1024px whole-A­suka im­ages ProGAN sam­ples:

    1024px, whole-A­suka im­ages, ProGAN
    512px whole-A­suka im­ages, ProGAN
  2. 2018-09-18, 512px Asuka faces, ProGAN sam­ples:

    512px Asuka faces, ProGAN
  3. 2018-10-29, 512px Holo faces, ProGAN:

    Ran­dom sam­ples of 512px ProGAN Holo faces

    After gen­er­at­ing ~1k Holo faces, I se­lected the top decile (n = 103) of the faces (Imgur mir­ror):

    512px ProGAN Holo faces, ran­dom sam­ples from top decile (6×6)

    The top decile im­ages are, nev­er­the­less, show­ing dis­tinct signs of both ar­ti­fact­ing & overfitting/memorization of data points. An­other 2 weeks proved this out fur­ther:

    ProGAN sam­ples of 512px Holo faces, after badly over­fit­ting (it­er­a­tion #10,325)

    In­ter­po­la­tion video of the Oc­to­ber 2018 512px Holo face ProGAN; note the gross over­fit­ting in­di­cated by the abrupt­ness of the in­ter­po­la­tions jump­ing from face (mode) to face (mode) and lack of mean­ing­ful in­ter­me­di­ate faces in ad­di­tion to the over­all blur­ri­ness & low vi­sual qual­i­ty.

  4. 2019-01-17, Dan­booru2017 512px SFW im­ages, ProGAN:

    512px SFW Dan­booru2017, ProGAN
  5. 2019-02-05 (stopped in or­der to train with the new StyleGAN code­base), the 512px anime face dataset used else­where, ProGAN:

    512px anime faces, ProGAN

    In­ter­po­la­tion video of the 2018-02-05 512px anime face ProGAN; while the im­age qual­ity is low, the di­ver­sity is good & shows no overfitting/memorization or bla­tant mode col­lapse



Used Glow () offi­cial im­ple­men­ta­tion.

Due to the enor­mous model size (4.2G­B), I had to mod­ify Glow’s set­tings to get train­ing work­ing rea­son­ably well, after ex­ten­sive tin­ker­ing to fig­ure out what any meant:

{"verbose": true, "restore_path": "logs/model_4.ckpt", "inference": false, "logdir": "./logs", "problem": "asuka",
"category": "", "data_dir": "../glow/data/asuka/", "dal": 2, "fmap": 1, "pmap": 16, "n_train": 20000, "n_test": 1000,
"n_batch_train": 16, "n_batch_test": 50, "n_batch_init": 16, "optimizer": "adamax", "lr": 0.0005, "beta1": 0.9,
"polyak_epochs": 1, "weight_decay": 1.0, "epochs": 1000000, "epochs_warmup": 10, "epochs_full_valid": 3,
"gradient_checkpointing": 1, "image_size": 512, "anchor_size": 128, "width": 512, "depth": 13, "weight_y": 0.0,
"n_bits_x": 8, "n_levels": 7, "n_sample": 16, "epochs_full_sample": 5, "learntop": false, "ycond": false, "seed": 0,
"flow_permutation": 2, "flow_coupling": 1, "n_y": 1, "rnd_crop": false, "local_batch_train": 1, "local_batch_test": 1,
"local_batch_init": 1, "direct_iterator": true, "train_its": 1250, "test_its": 63, "full_test_its": 1000, "n_bins": 256.0, "top_shape": [4, 4, 768]}
{"epoch": 5, "n_processed": 100000, "n_images": 6250, "train_time": 14496, "loss": "2.0090", "bits_x": "2.0090", "bits_y": "0.0000", "pred_loss": "1.0000"}

An ad­di­tional chal­lenge was nu­mer­i­cal in­sta­bil­ity in the re­vers­ing of ma­tri­ces, giv­ing rise to many ‘in­vert­ibil­ity’ crash­es.

Fi­nal sam­ple be­fore I looked up the com­pute re­quire­ments more care­fully & gave up on Glow:

Glow, Asuka faces, 5 epoches (2018-08-02)


offi­cial im­ple­men­ta­tion:

2018-12-15, 512px Asuka faces, fail­ure case


nshep­perd’s (un­pub­lished) mul­ti­-s­cale GAN with self­-at­ten­tion lay­ers, spec­tral nor­mal­iza­tion, and a few other tweaks:

PokeGAN, Asuka faces, 2018-11-16


did not have an offi­cial im­ple­men­ta­tion re­leased at the time so I used the Junho Kim im­ple­men­ta­tion; 128px SAGAN, WGAN-LP loss, on Asuka faces & whole Asuka im­ages:

Self-Attention-GAN-TensorFlow, whole Asuka, 2019-08-18
Train­ing mon­tage of the 2018-08-18 128px whole-A­suka SAGAN; pos­si­bly too-high LR
Self-Attention-GAN-TensorFlow, Asuka faces, 2019-09-13


The offi­cial VGAN code for Peng et al 2018 had not been re­leased when I be­gan try­ing VGAN, so I used akan­i­max’s im­ple­men­ta­tion.

The vari­a­tional dis­crim­i­na­tor bot­tle­neck, along with self­-at­ten­tion lay­ers and pro­gres­sive grow­ing, is one of the few strate­gies which per­mit 512px im­ages, and I was in­trigued to see that it worked rel­a­tively well, al­though I ran into per­sis­tent is­sues with in­sta­bil­ity & mode col­lapse. I sus­pect that VGAN could’ve worked bet­ter than it did with some more work.

akan­i­max VGAN, anime faces, 2018-12-25

BigGAN unofficial

^s offi­cial im­ple­men­ta­tion & mod­els were not re­leased un­til late March 2019 (nor the semi­-offi­cial compare_gan im­ple­men­ta­tion un­til Feb­ru­ary 2019), and I ex­per­i­mented with 2 un­offi­cial im­ple­men­ta­tions in late 2018–early 2019.


Junho Kim im­ple­men­ta­tion; 128px spec­tral norm hinge loss, anime faces:

Kim BigGAN-PyTorch, anime faces, 2019-01-17

This one never worked well at all, and I am still puz­zled what went wrong.


Aaron Leong’s Py­Torch BigGAN im­ple­men­ta­tion (not the offi­cial BigGAN im­ple­men­ta­tion). As it’s class-con­di­tion­al, I faked hav­ing 1000 classes by con­struct­ing a vari­ant anime face dataset: tak­ing the top 1000 char­ac­ters by tag count in the Dan­booru2017 meta­data, I then fil­tered for those char­ac­ter tags 1 by 1, and copied them & cropped faces into match­ing sub­di­rec­to­ries 1–1000. This let me try out both faces & whole im­ages. I also at­tempted to hack in gra­di­ent ac­cu­mu­la­tion for big mini­batches to make it a true BigGAN im­ple­men­ta­tion, but did­n’t help too much; the prob­lem here might sim­ply have been that I could­n’t run it long enough.

Re­sults upon aban­don­ing:

Leong BigGAN-PyTorch, 1000-class anime char­ac­ter dataset, 2018-11-30 (#314,000)
Leong BigGAN-PyTorch, 1000-class anime face dataset, 2018-12-24 (#1,006,320)


Im­ple­men­ta­tion of :

GAN-QP, 512px Asuka faces, 2018-11-21

Train­ing os­cil­lated enor­mous­ly, with all the sam­ples closely linked and chang­ing si­mul­ta­ne­ous­ly. This was de­spite the check­point model be­ing enor­mous (551MB) and I am sus­pi­cious that some­thing was se­ri­ously wrong—ei­ther the model ar­chi­tec­ture was wrong (too many lay­ers or fil­ter­s?) or the learn­ing rate was many or­ders of mag­ni­tude too large. Be­cause of the small mini­batch, progress was diffi­cult to make in a rea­son­able amount of wall­clock time, so I moved on.


offi­cial im­ple­men­ta­tion; I did most of the early anime face work with WGAN on a differ­ent ma­chine and did­n’t keep copies. How­ev­er, a sam­ple from a short run gives an idea of what WGAN tended to look like on anime runs:

WGAN, 256px Asuka faces, it­er­a­tion 2100


A hy­brid GAN-VAE ar­chi­tec­ture in­tro­duced in mid-2018 by , Huang et al 2018, with the offi­cial Py­Torch im­ple­men­ta­tion re­leased in April 2019, IntroVAE at­tempts to reuse the en­coder-de­coder for an ad­ver­sar­ial loss as well, to com­bine the best of both worlds: the prin­ci­pled sta­ble train­ing & re­versible en­coder of the VAE with the sharp­ness & high qual­ity of a GAN.

Qual­i­ty-wise, they show IntroVAE works on CelebA & LSUN BEDROOM at up to 1024px res­o­lu­tion with re­sults they claim are com­pa­ra­ble to ProGAN. Per­for­mance-wise, for 512px, they give a run­time of 7 days with a mini­batch n = 12, or pre­sum­ably 4 GPUs (s­ince their 1024px run script im­plies they used 4 GPUs and I can fit a mini­batch of n = 4 onto 1×1080ti, so 4 GPUs would be con­sis­tent with n = 12), and so 28 GPU-days.

I adapted the 256px sug­gested set­tings for my 512px anime por­traits dataset:

python --hdim=512 --output_height=512 --channels='32, 64, 128, 256, 512, 512, 512' --m_plus=120 \
    --weight_rec=0.05 --weight_kl=1.0 --weight_neg=0.5 --num_vae=0 \
    --dataroot=/media/gwern/Data2/danbooru2018/portrait/1/ --trainsize=302652 --test_iter=1000 --save_iter=1 \
    --start_epoch=0 --batchSize=4 --nrow=8 --lr_e=0.0001 --lr_g=0.0001 --cuda --nEpochs=500
# ...====> Cur_iter: [187060]: Epoch [3] (5467/60531): time: 142675: Rec: 19569, Kl_E: 162, 151, 121, Kl_G: 151, 121,

There was a mi­nor bug in the code­base where it would crash on try­ing to print out the log data, per­haps be­cause it as­sumes multi-GPU and I was run­ning on 1 GPU, and was try­ing to in­dex into an ar­ray which was ac­tu­ally a sim­ple scalar, which I fixed by re­mov­ing the in­dex­ing:

-        info += 'Rec: {:.4f}, '.format([0])
-        info += 'Kl_E: {:.4f}, {:.4f}, {:.4f}, '.format([0],
-                      [0],[0])
-        info += 'Kl_G: {:.4f}, {:.4f}, '.format([0],[0])
+        info += 'Rec: {:.4f}, '.format(
+        info += 'Kl_E: {:.4f}, {:.4f}, {:.4f}, '.format(,
+                      ,
+        info += 'Kl_G: {:.4f}, {:.4f}, '.format(,

Sam­ple re­sults after ~1.7 GPU-days:

IntroVAE, 512px anime por­trait (n = 4, 3 sets: real dat­a­points, en­cod­ed→de­coded ver­sions of the real dat­a­points, and ran­dom gen­er­ated sam­ples)

By this point, StyleGAN would have been gen­er­at­ing rec­og­niz­able faces from scratch, while the IntroVAE ran­dom sam­ples are not even face-like, and the IntroVAE train­ing curve was not im­prov­ing at a no­table rate. IntroVAE has some hy­per­pa­ra­me­ters which could prob­a­bly be tuned bet­ter for the anime por­trait faces (they briefly dis­cuss the use of the --num_vae op­tion to run in clas­sic VAE mode to let you tune the VAE-related hy­per­pa­ra­me­ters be­fore en­abling the GAN-like part), but it should be fairly in­sen­si­tive over­all to hy­per­pa­ra­me­ters and un­likely to help all that much. So IntroVAE prob­a­bly can’t re­place StyleGAN (yet?) for gen­er­al-pur­pose im­age syn­the­sis. This demon­strates again that it seems like every­thing works on CelebA these days and just be­cause some­thing works on a pho­to­graphic dataset does not mean it’ll work on other datasets. Im­age gen­er­a­tion pa­pers should prob­a­bly branch out some more and con­sider non-pho­to­graphic tests.

  1. Turns out that when train­ing goes re­ally wrong, you can crash many GAN im­ple­men­ta­tions with ei­ther a seg­fault, in­te­ger over­flow, or di­vi­sion by zero er­ror.↩︎

  2. StackGAN/StackGAN++/PixelCNN et al are diffi­cult to run as they re­quire a unique im­age em­bed­ding which could only be com­puted in the un­main­tained Torch frame­work us­ing Reed’s prior work on a joint tex­t+im­age em­bed­ding which how­ever does­n’t run on any­thing but the Birds & Flow­ers datasets, and so no one has ever, as far as I am aware, run those im­ple­men­ta­tions on any­thing else—cer­tainly I never man­aged to de­spite quite a few hours try­ing to re­verse-engi­neer the em­bed­ding & var­i­ous im­ple­men­ta­tions.↩︎

  3. Be sure to check out .↩︎

  4. Glow’s re­ported re­sults re­quired >40 GPU-weeks; BigGAN’s to­tal com­pute is un­clear as it was trained on a TPUv3 Google clus­ter but it would ap­pear that a 128px BigGAN might be ~4 GPU-months as­sum­ing hard­ware like an 8-GPU ma­chine, 256px ~8 GPU-months, and 512px ≫8 GPU-months, with VRAM be­ing the main lim­it­ing fac­tor for larger mod­els (although pro­gres­sive grow­ing might be able to cut those es­ti­mates).↩︎

  5. is an old & small CNN trained to pre­dict a few -booru tags on anime im­ages, and so pro­vides an em­bed­ding—but not a good one. The lack of a good em­bed­ding is the ma­jor lim­i­ta­tion for anime deep learn­ing as of Feb­ru­ary 2019. (Deep­Dan­booru, while per­form­ing well ap­par­ent­ly, has not yet been used for em­bed­dings.) An em­bed­ding is nec­es­sary for tex­t→im­age GANs, im­age searches & near­est-neigh­bor checks of over­fit­ting, FID er­rors for ob­jec­tively com­par­ing GANs, mini­batch dis­crim­i­na­tion to help the D/provide an aux­il­iary loss to sta­bi­lize learn­ing, anime style trans­fer (both for its own sake & for cre­at­ing a ‘StyleDan­booru2018’ to re­duce tex­ture cheat­ing), en­cod­ing into GAN la­tent spaces for ma­nip­u­la­tion, data clean­ing (to de­tect anom­alous dat­a­points like failed face crop­s), per­cep­tual losses for en­coders or as an ad­di­tional aux­il­iary loss/pretraining (like , which trains a Gen­er­a­tor on a per­cep­tual loss and does GAN train­ing only for fine­tun­ing) etc. A good tag­ger is also a good start­ing point for do­ing pix­el-level se­man­tic seg­men­ta­tion (via “weak su­per­vi­sion”), which meta­data is key for train­ing some­thing like Nvidi­a’s GauGAN suc­ces­sor to pix2pix (; source).↩︎

  6. Tech­ni­cal note: I typ­i­cally train NNs us­ing my work­sta­tion with 2×1080ti GPUs. For eas­ier com­par­ison, I con­vert all my times to single-GPU equiv­a­lent (ie “6 GPU-weeks” means 3 realtime/wallclock weeks on my 2 GPUs).↩︎

  7. ob­serves (§4 “Us­ing pre­ci­sion and re­call to an­a­lyze and im­prove StyleGAN”) that StyleGAN with pro­gres­sive grow­ing dis­abled does work but at some cost to precision/recall qual­ity met­rics; whether this re­flects in­fe­rior per­for­mance on a given train­ing bud­get or an in­her­ent limit—BigGAN and other self­-at­ten­tion-us­ing GANs do not use pro­gres­sive grow­ing at all, sug­gest­ing it is not truly nec­es­sary—is not in­ves­ti­gat­ed. In De­cem­ber 2019, StyleGAN 2 suc­cess­fully dropped pro­gres­sive grow­ing en­tirely at mod­est per­for­mance cost.↩︎

  8. This has con­fused some peo­ple, so to clar­ify the se­quence of events: I trained my anime face StyleGAN and posted notes on Twit­ter, re­leas­ing an early mod­el; road­run­ner01 gen­er­ated an in­ter­po­la­tion video us­ing said model (but a differ­ent ran­dom seed, of course); this in­ter­po­la­tion video was retweeted by the Japan­ese Twit­ter user _Ry­obot, upon which it went vi­ral and was ‘liked’ by Elon Musk, fur­ther dri­ving vi­ral­ity (19k re­shares, 65k likes, 1.29m watches as of 2019-03-22).↩︎

  9. Google Co­lab is a free ser­vice in­cludes free GPU time (up to 12 hours on a small GPU). Es­pe­cially for peo­ple who do not have a rea­son­ably ca­pa­ble GPU on their per­sonal com­put­ers (such as all Ap­ple users) or do not want to en­gage in the ad­mit­ted has­sle of rent­ing a real cloud GPU in­stance, Co­lab can be a great way to play with a pre­trained mod­el, like gen­er­at­ing GPT-2-117M text com­ple­tions or StyleGAN in­ter­po­la­tion videos, or pro­to­type on tiny prob­lems.

    How­ev­er, it is a bad idea to try to train real mod­els, like 512–1024px StyleGANs, on a Co­lab in­stance as the GPUs are low VRAM, far slower (6 hours per StyleGAN tick­!), un­wieldy to work with (as one must save snap­shots con­stantly to restart when the ses­sion runs out), does­n’t have a real com­mand-line, etc. Co­lab is just barely ad­e­quate for per­haps 1 or 2 ticks of trans­fer learn­ing, but not more. If you har­bor greater am­bi­tions but still refuse to spend any money (rather than time), Kag­gle has a sim­i­lar ser­vice with P100 GPU slices rather than K80s. Oth­er­wise, one needs to get ac­cess to real GPUs.↩︎

  10. Cu­ri­ous­ly, the ben­e­fit of many more FC lay­ers than usual may have been stum­bled across be­fore: IllustrationGAN found that adding some FC lay­ers seemed to help their DCGAN gen­er­ate anime faces, and when I & Feep­ingCrea­ture ex­per­i­mented with adding 2–4 FC lay­ers to WGAN-GP along IllustrationGAN’s lines, it did help our lack­lus­ter re­sults, and at the time I spec­u­lated that “the ful­ly-con­nected lay­ers are trans­form­ing the la­tent-z/noise into a sort of global tem­plate which the sub­se­quent con­vo­lu­tion lay­ers can then fill in more lo­cal­ly.” But we never dreamed of go­ing as deep as 8!↩︎

  11. The ProGAN/StyleGAN code­base re­port­edly does work with con­di­tion­ing, but none of the pa­pers re­port on this func­tion­al­ity and I have not used it my­self.↩︎

  12. The la­tent em­bed­ding z is usu­ally gen­er­ated in about the sim­plest pos­si­ble way: draws from the Nor­mal dis­tri­b­u­tion, . A is some­times used in­stead. There is no good jus­ti­fi­ca­tion for this and some rea­son to think this can be bad (how does a GAN eas­ily map a dis­crete or bi­nary la­tent fac­tor, such as the pres­ence or ab­sence of the left ear, onto a Nor­mal vari­able?).

    The BigGAN pa­per ex­plores al­ter­na­tives, find­ing im­prove­ments in train­ing time and/or fi­nal qual­ity from us­ing in­stead (in as­cend­ing or­der): a Nor­mal + bi­nary Bernoulli (p = 0.5; per­sonal com­mu­ni­ca­tion, Brock) vari­able, a bi­nary (Bernoul­li), and a (some­times called a “cen­sored nor­mal” even though that sounds like a rather than the rec­ti­fied one). The rec­ti­fied Gauss­ian dis­tri­b­u­tion “out­per­forms (in terms of IS) by 15–20% and tends to re­quire fewer it­er­a­tions.”

    The down­side is that the “trun­ca­tion trick”, which yields even larger av­er­age im­prove­ments in im­age qual­ity (at the ex­pense of di­ver­si­ty) does­n’t quite ap­ply, and the rec­ti­fied Gauss­ian sans trun­ca­tion pro­duced sim­i­lar re­sults as the Nor­mal+trun­ca­tion, so BigGAN re­verted to the de­fault Nor­mal dis­tri­b­u­tion+trun­ca­tion (per­sonal com­mu­ni­ca­tion).

    The trun­ca­tion trick ei­ther di­rectly ap­plies to some of the other dis­tri­b­u­tions, par­tic­u­larly the Rec­ti­fied Gaus­sian, or could eas­ily be adapt­ed—­pos­si­bly yield­ing an im­prove­ment over ei­ther ap­proach. The Rec­ti­fied Gauss­ian can be trun­cated just like the de­fault Nor­mals can. And for the Bernoul­li, one could de­crease p dur­ing the gen­er­a­tion, or what is prob­a­bly equiv­a­lent, re-sam­ple when­ever the vari­ance (ie squared sum) of all the Bernoulli la­tent vari­ables ex­ceeds a cer­tain con­stant. (With p = 0.5, a la­tent vec­tor of 512 Bernouil­lis would on av­er­age all sum up to sim­ply , with the 2.5%–97.5% quan­tiles be­ing 234–278, so a ‘trun­ca­tion trick’ here might be throw­ing out every vec­tor with a sum above, say, the 80% quan­tile of 266.)

    One also won­ders about vec­tors which draw from mul­ti­ple dis­tri­b­u­tions rather than just one. Could the StyleGAN 8-FC-layer learned-la­ten­t-vari­able be re­verse-engi­neered? Per­haps the first layer or two merely con­verts the nor­mal in­put into a more use­ful dis­tri­b­u­tion & parameters/training could be saved or in­sight gained by im­i­tat­ing that.↩︎

  13. Which raises the ques­tion: if you added any or all of those fea­tures, would StyleGAN be­come that much bet­ter? Un­for­tu­nate­ly, while the­o­rists & prac­ti­tion­ers have had many ideas, so far the­ory has proven more fe­cund than fa­tidi­cal and the large-s­cale GAN ex­per­i­ments nec­es­sary to truly test the sug­ges­tions are too ex­pen­sive for most. Half of these sug­ges­tions are great ideas—but which half?↩︎

  14. For more on the choice of con­vo­lu­tion layers/kernel sizes, see Karpa­thy’s 2015 notes for “CS231n: Con­vo­lu­tional Neural Net­works for Vi­sual Recog­ni­tion”, or take a look at these Con­vo­lu­tion an­i­ma­tions & Yang’s in­ter­ac­tive “Con­vo­lu­tion Vi­su­al­izer”.↩︎

  15. This ob­ser­va­tions ap­ply only to the Gen­er­a­tor in GANs (which is what we pri­mar­ily care about); cu­ri­ous­ly, there’s some rea­son to think that GAN Dis­crim­i­na­tors are in fact mostly mem­o­riz­ing (see later).↩︎

  16. A pos­si­ble al­ter­na­tive is ESRGAN ().↩︎

  17. Based on eye­balling the ‘cat’ bar graph in Fig­ure 3 of .↩︎

  18. CATS offer an amus­ing in­stance of the dan­gers of data aug­men­ta­tion: ProGAN used hor­i­zon­tal flipping/mirroring for every­thing, be­cause why not? This led to strange Cyril­lic text cap­tions show­ing up in the gen­er­ated cat im­ages. Why not Latin al­pha­bet cap­tions? Be­cause every cat im­age was be­ing shown mir­rored as well as nor­mal­ly! For StyleGAN, mir­ror­ing was dis­abled, so now the lol­cat cap­tions are rec­og­niz­ably Latin al­pha­bet­i­cal, and even al­most Eng­lish words. This demon­strates that even datasets where left/right does­n’t seem to mat­ter, like cat pho­tos, can sur­prise you.↩︎

  19. I es­ti­mated the to­tal cost us­ing AWS EC2 pre­emptible hourly costs on 2019-03-15 as fol­lows:

    • 1 GPU: p2.xlarge in­stance in us-east-2a, Half of a K80 (12GB VRAM): $0.3235/hour
    • 2 GPUs: NA—there is no P2 in­stance with 2 GPUs, only 1/8/16
    • 8 GPUs: p2.8xlarge in us-east-2a, 8 halves of K80s (12GB VRAM each): $2.160/hour

    As usu­al, there is sub­lin­ear scal­ing, and larger in­stances cost dis­pro­por­tion­ately more, be­cause one is pay­ing for faster wall­clock train­ing (time is valu­able) and for not hav­ing to cre­ate a dis­trib­uted in­fra­struc­ture which can ex­ploit the cheap single-GPU in­stances.

    This cost es­ti­mate does not count ad­di­tional costs like hard drive space. In ad­di­tion to the dataset size (the StyleGAN data en­cod­ing is ~18× larger than the raw data size, so a 10GB folder of im­ages → 200GB of .tfrecords), you would need at least 100GB HDD (50GB for the OS, and 50GB for checkpoints/images/etc to avoid crashes from run­ning out of space).↩︎

  20. I re­gard this as a flaw in StyleGAN & TF in gen­er­al. Com­put­ers are more than fast enough to load & process im­ages asyn­chro­nously us­ing a few worker threads, and work­ing with a di­rec­tory of im­ages (rather than a spe­cial bi­nary for­mat 10–20× larg­er) avoids im­pos­ing se­ri­ous bur­dens on the user & hard dri­ve. Py­Torch GANs al­most al­ways avoid this mis­take, and are much more pleas­ant to work with as one can freely mod­ify the dataset be­tween (and even dur­ing) runs.↩︎

  21. For ex­am­ple, my Dan­booru2018 anime por­trait dataset is 16GB, but the StyleGAN en­coded dataset is 296GB.↩︎

  22. This may be why some peo­ple re­port that StyleGAN just crashes for them & they can’t fig­ure out why. They should try chang­ing their dataset JPG ↔︎ PNG.↩︎

  23. That is, in train­ing G, the G’s fake im­ages must be aug­mented be­fore be­ing passed to the D for rat­ing; and in train­ing D, both real & fake im­ages must be aug­mented the same way be­fore be­ing passed to D. Pre­vi­ous­ly, all GAN re­searchers ap­pear to have as­sumed that one should only aug­ment real im­ages be­fore pass­ing to D dur­ing D train­ing, which con­ve­niently can be done at dataset cre­ation; un­for­tu­nate­ly, this hid­den as­sump­tion turns out to be about the most harm­ful way pos­si­ble!↩︎

  24. I would de­scribe the dis­tinc­tions as: Soft­ware 0.0 was im­per­a­tive pro­gram­ming for ham­mer­ing out clock­work mech­a­nism; Soft­ware 1.0 was de­clar­a­tive pro­gram­ming with spec­i­fi­ca­tion of pol­i­cy; and Soft­ware 2.0 is deep learn­ing by gar­den­ing loss func­tions (with every­thing else, from model arch to which dat­a­points to la­bel ide­ally learned end-to-end). Con­tin­u­ing the the­me, we might say that di­a­logue with mod­els, like , are “Soft­ware 3.0”…↩︎

  25. But you may not want to–re­mem­ber the lol­cat cap­tions!↩︎

  26. Note: If you use a differ­ent com­mand to re­size, check it thor­ough­ly. With Im­ageMag­ick, if you use the ^ op­er­a­tor like -resize 512x512^, you will not get ex­actly 512×512px im­ages as you need; while if you use the ! op­er­a­tor like -resize 512x512!, the im­ages will be ex­actly 512×512px but the as­pect ra­tios will dis­torted to make im­ages fit, and this may con­fuse any­thing you are train­ing by in­tro­duc­ing un­nec­es­sary mean­ing­less dis­tor­tions & will make any gen­er­ated im­ages look bad.↩︎

  27. If you are us­ing Python 2, you will get print syn­tax er­ror mes­sages; if you are us­ing Python 3–3.6, you will get ‘type hint’ er­rors.↩︎

  28. Stas Pod­gorskiy has demon­strated that the StyleGAN 2 cor­rec­tion can be re­verse-engi­neered and ap­plied back to StyleGAN 1 gen­er­a­tors if nec­es­sary.↩︎

  29. This makes it con­form to a trun­cated nor­mal dis­tri­b­u­tion; why trun­cated rather than rectified/winsorized at a max like 0.5 or 1.0 in­stead? Be­cause then many, pos­si­bly most, of the la­tent vari­ables would all be at the max, in­stead of smoothly spread out over the per­mit­ted range.↩︎

  30. No mini­batches are used, so this is much slower than nec­es­sary.↩︎

  31. The ques­tion is not whether one is to start with an ini­tial­iza­tion at all, but whether to start with one which does every­thing poor­ly, or one which does a few sim­i­lar things well. Sim­i­lar­ly, from a Bayesian sta­tis­tics per­spec­tive, the ques­tion of what to use is one that every­one faces; how­ev­er, many ap­proaches sweep it un­der the rug and effec­tively as­sume a de­fault flat prior that is con­sis­tently bad and op­ti­mal for no mean­ing­ful prob­lem ever.↩︎

  32. ADA/StyleGAN3 is re­port­edly much more sam­ple-effi­cient and re­duces the need for trans­fer learn­ing: . But if a rel­e­vant model is avail­able, it should still be used. Back­port­ing the ADA data aug­men­ta­tion trick to StyleGAN1–2 will be a ma­jor up­grade.↩︎

  33. There are more real Asuka im­ages than Holo to be­gin with, but there is no par­tic­u­lar rea­son for the 10× data aug­men­ta­tion com­pared to the Holo’s 3×—the data aug­men­ta­tions were just done at differ­ent times and hap­pened to have less or more aug­men­ta­tions en­abled.↩︎

  34. A fa­mous ex­am­ple is char­ac­ter de­signer Yoshiyuki Sadamoto demon­strat­ing how to turn () into (Evan­ge­lion).↩︎

  35. It turns out that this la­tent vec­tor trick does work. In­trigu­ing­ly, it works even bet­ter to do ‘model av­er­ag­ing’ or ‘model blend­ing’ (/, Pinkney & Adler 2020): re­train model A on dataset B, and then take a weighted av­er­age of the 2 mod­els (you av­er­age them, pa­ra­me­ter by pa­ra­me­ter, and re­mark­ably, that Just Works, or you can swap out lay­ers be­tween mod­el­s), and then you can cre­ate faces which are ar­bi­trar­ily in be­tween A and B. So for ex­am­ple, you can blend FFHQ/Western-animation faces (Co­lab note­book), ukiy­o-e/FFHQ faces, furries/foxes/FFHQ faces, or even furries/foxes/FFHQ/anime/ponies.↩︎

  36. In ret­ro­spect, this should­n’t’ve sur­prised me.↩︎

  37. There is for other ar­chi­tec­tures like flow-based ones such as Glow, and this is one of their ben­e­fit­s–while the re­quire­ment to be made out of build­ing blocks which can be run back­wards & for­wards equally well, to be ‘in­vert­ible’, is cur­rently ex­tremely ex­pen­sive and the re­sults not com­pet­i­tive ei­ther in fi­nal im­age qual­ity or com­pute re­quire­ments, the in­vert­ibil­ity means that en­cod­ing an ar­bi­trary real im­age to get its in­ferred la­tents Just Works™ and one can eas­ily morph be­tween 2 ar­bi­trary im­ages, or en­code an ar­bi­trary im­age & edit it in the la­tent space to do things like add/remove glasses from a face or cre­ate an op­po­sites-ex ver­sion.↩︎

  38. This fi­nal ap­proach is, in­ter­est­ing­ly, the his­tor­i­cal rea­son back­prop­a­ga­tion was in­vent­ed: it cor­re­sponds to plan­ning in a model. For ex­am­ple, in plan­ning the flight path of an air­plane (/): the des­ti­na­tion or ‘out­put’ is fixed, the aero­dy­nam­ic­s+­geog­ra­phy or ‘model pa­ra­me­ters’ are also fixed, and the ques­tion is what ac­tions de­ter­min­ing a flight path will re­duce the loss func­tion of time or fuel spent. One starts with a ran­dom set of ac­tions pick­ing a ran­dom flight path, runs it for­ward through the en­vi­ron­ment mod­el, gets a fi­nal time/fuel spent, and then back­prop­a­gates through the model to get the gra­di­ents for the flight path, ad­just­ing the flight path to­wards a new set of ac­tions which will slightly re­duce the time/fuel spent; the new ac­tions are used to plan out the flight to get a new loss, and so on, un­til a lo­cal min­i­mum of the ac­tions has been found. This works with non-s­to­chas­tic prob­lems; for sto­chas­tic ones where the path can’t be guar­an­teed to be ex­e­cut­ed, “mod­el-pre­dic­tive con­trol” can be used to re­plan at every step and ex­e­cute ad­just­ments as nec­es­sary. An­other in­ter­est­ing use of back­prop­a­ga­tion for out­puts is which tack­les the long-s­tand­ing prob­lem of how to get NNs to out­put sets rather than list out­puts by gen­er­at­ing a pos­si­ble set out­put & re­fin­ing it via back­prop­a­ga­tion.↩︎

  39. SGD is com­mon, but a sec­ond-order al­go­rithm like is often used in these ap­pli­ca­tions in or­der to run as few it­er­a­tions as pos­si­ble.↩︎

  40. shows that BigGAN/StyleGAN la­tent em­bed­dings can also go be­yond what one might ex­pect, to in­clude zooms, trans­la­tions, and other trans­forms.↩︎

  41. Flow mod­els have other ad­van­tages, mostly stem­ming from the max­i­mum like­li­hood train­ing ob­jec­tive. Since the im­age can be prop­a­gated back­wards and for­wards loss­less­ly, in­stead of be­ing lim­ited to gen­er­at­ing ran­dom sam­ples like a GAN, it’s pos­si­ble to cal­cu­late the ex­act prob­a­bil­ity of an im­age, en­abling max­i­mum like­li­hood as a loss to op­ti­mize, and drop­ping the Dis­crim­i­na­tor en­tire­ly. With no GAN dy­nam­ics, there’s no worry about weird train­ing dy­nam­ics, and the like­li­hood loss also for­bids ‘mode drop­ping’: the flow model can’t sim­ply con­spire with a Dis­crim­i­na­tor to for­get pos­si­ble im­ages.↩︎

  42. StyleGAN 2 is more com­pu­ta­tion­ally ex­pen­sive but Kar­ras et al op­ti­mized the code­base to make up for it, keep­ing to­tal com­pute con­stant.↩︎

  43. Back­up-backup mir­ror: rsync rsync:// ./↩︎

  44. Im­a­geNet re­quires you to sign up & be ap­proved to down­load from them, but 2 months later I have still heard noth­ing back. So I used the data from ILSVRC2012_img_train.tar (MD5: 1d675b47d978889d74fa0da5fadfb00e; 138GB) which I down­loaded from the Im­a­geNet LSVRC 2012 Train­ing Set (Ob­ject De­tec­tion) tor­rent.↩︎

  45. Dan­booru can clas­sify the same char­ac­ter un­der mul­ti­ple tags: for ex­am­ple, Sailor Moon char­ac­ters are tagged un­der their “Sailor X” name for im­ages of their trans­formed ver­sion, and their real names for ‘civil­ian’ im­ages (eg ‘Sailor Venus’ or ‘Cure Moon­light’, the for­mer of which I merged with ‘Aino Mi­nako’). Some pop­u­lar fran­chises have many vari­ants of each char­ac­ter: the Fate fran­chise, es­pe­cially with the suc­cess of , is a par­tic­u­lar offend­er, with quite a few vari­ants of char­ac­ters like Saber.↩︎

  46. One would think it would, but I asked Brock and ap­par­ently it does­n’t help to oc­ca­sion­ally ini­tial­ize from the EMA snap­shots. EMA is a mys­te­ri­ous thing.↩︎

  47. As far as I can tell, it has some­thing to do with the dataloader code in the cal­cu­la­tion of length and the it­er­a­tor do some­thing weird to ad­just for pre­vi­ous train­ing, so the net effect is that you can run with a fixed mini­batch ac­cu­mu­la­tion and it’ll be fine, and you can re­duce the num­ber of ac­cu­mu­la­tions, and it’ll sim­ply un­der­run the dat­aload­er, but if you in­crease the num­ber of ac­cu­mu­la­tions, if you’ve trained enough per­cent­age-wise, it’ll im­me­di­ately flip over into a neg­a­tive length and in­dex­ing into it be­comes com­pletely im­pos­si­ble, lead­ing to crash­es. Un­for­tu­nate­ly, I only ever want to in­crease the mini­batch ac­cu­mu­la­tion… I tried to fix it but the logic is too con­vo­luted for me to fol­low it.↩︎

  48. Mir­ror: rsync --verbose rsync:// ./↩︎

  49. Mir­ror: rsync --verbose rsync:// ./↩︎