Making Anime Faces With StyleGAN

A tutorial explaining how to train and generate high-quality anime faces with StyleGAN 1/2 neural networks, and tips/scripts for effective StyleGAN use.
anime, NGE, NN, Python, technology, tutorial
2019-02-042021-01-30 finished certainty: highly likely importance: 5


Gen­er­a­tive neural net­works, such as GANs, have strug­gled for years to gen­er­ate de­cen­t-qual­ity anime faces, de­spite their great suc­cess with pho­to­graphic im­agery such as real hu­man faces. The task has now been effec­tively solved, for anime faces as well as many other do­mains, by the de­vel­op­ment of a new gen­er­a­tive ad­ver­sar­ial net­work, , whose source code was re­leased in Feb­ru­ary 2019.

I show off my StyleGAN 1/2 CC-0-li­censed anime faces & videos, pro­vide down­loads for the fi­nal mod­els & , pro­vide the ‘miss­ing man­ual’ & ex­plain how I trained them based on Dan­booru2017/2018 with source code for the data pre­pro­cess­ing, doc­u­ment in­stal­la­tion & con­fig­u­ra­tion & train­ing tricks.

For ap­pli­ca­tion, I doc­u­ment var­i­ous scripts for gen­er­at­ing im­ages & videos, briefly de­scribe the web­site as a pub­lic demo & its fol­lowup (see also ), dis­cuss how the trained mod­els can be used for trans­fer learn­ing such as gen­er­at­ing high­-qual­ity faces of anime char­ac­ters with small datasets (eg Holo or Asuka Souryuu Lan­g­ley), and touch on more ad­vanced StyleGAN ap­pli­ca­tions like en­coders & con­trol­lable gen­er­a­tion.

The gives sam­ples of my fail­ures with ear­lier GANs for anime face gen­er­a­tion, and I pro­vide sam­ples & model from a rel­a­tively large-s­cale BigGAN train­ing run sug­gest­ing that BigGAN may be the next step for­ward to gen­er­at­ing ful­l-s­cale anime im­ages.

A minute of read­ing could save an hour of de­bug­ging!

When Ian Good­fel­low’s first pa­per , with its blurry 64px grayscale faces, I said to my­self, “given the rate at which GPUs & NN ar­chi­tec­tures im­prove, in a few years, we’ll prob­a­bly be able to throw a few GPUs at some anime col­lec­tion like Dan­booru and the re­sults will be hi­lar­i­ous.” There is some­thing in­trin­si­cally amus­ing about try­ing to make com­put­ers draw ani­me, and it would be much more fun than work­ing with yet more celebrity head­shots or Im­a­geNet sam­ples; fur­ther, ani­me/il­lus­tra­tions/­draw­ings are so differ­ent from the ex­clu­sive­ly-pho­to­graphic datasets al­ways (over)used in con­tem­po­rary ML re­search that I was cu­ri­ous how it would work on ani­me—­bet­ter, worse, faster, or differ­ent fail­ure mod­es? Even more amus­ing—if ran­dom im­ages be­come doable, then text → im­ages would not be far be­hind.

So when GANs hit , and could do some­what pass­able CelebA face sam­ples around 2015, along with my , I be­gan ex­per­i­ment­ing with Soumith Chin­ta­la’s im­ple­men­ta­tion of , re­strict­ing my­self to faces of sin­gle anime char­ac­ters where I could eas­ily scrape up ~5–10k faces. (I did a lot of from be­cause she has a col­or-cen­tric de­sign which made it easy to tell if a GAN run was mak­ing any pro­gress: blonde-red hair, blue eyes, and red hair or­na­ments.)

It did not work. De­spite many runs on my lap­top & a bor­rowed desk­top, DCGAN never got re­motely near to the level of the CelebA face sam­ples, typ­i­cally top­ping out at red­dish blobs be­fore di­verg­ing or out­right crash­ing.1 Think­ing per­haps the prob­lem was too-s­mall datasets & I needed to train on all the faces, I be­gan cre­at­ing the Dan­booru2017 ver­sion of . Armed with a large dataset, I sub­se­quently be­gan work­ing through par­tic­u­larly promis­ing mem­bers of the GAN zoo, em­pha­siz­ing SOTA & open im­ple­men­ta­tions.

Among oth­ers, I have tried / & Pix­el*NN* (failed to get run­ning)2, WGAN-GP, Glow, GAN-QP, MSG-GAN, SAGAN, VGAN, PokeGAN, BigGAN3, ProGAN, & StyleGAN. These ar­chi­tec­tures vary widely in their de­sign & core al­go­rithms and which of the many sta­bi­liza­tion tricks () they use, but they were more sim­i­lar in their re­sults: dis­mal.

Glow & BigGAN had promis­ing re­sults re­ported on CelebA & Im­a­geNet re­spec­tive­ly, but un­for­tu­nately their train­ing re­quire­ments were out of the ques­tion.4 (As in­ter­est­ing as and are, no source was re­leased and I could­n’t even at­tempt them.)

While some re­mark­able tools like were cre­at­ed, and there were the oc­ca­sional semi­-suc­cess­ful anime face GANs like IllustrationGAN, the most no­table at­tempt at anime face gen­er­a­tion was Make Girl­s.­moe (). MGM could, in­ter­est­ing­ly, do in­-browser 256px anime face gen­er­a­tion us­ing tiny GANs, but that is a dead end. MGM ac­com­plished that much by mak­ing the prob­lem eas­ier: they added some light su­per­vi­sion in the form of a crude tag em­bed­ding5, and then sim­pli­fy­ing the prob­lem dras­ti­cally to n = 42k faces cropped from pro­fes­sional video game char­ac­ter art­work, which I re­garded as not an ac­cept­able so­lu­tion—the faces were small & bor­ing, and it was un­clear if this data-clean­ing ap­proach could scale to anime faces in gen­er­al, much less anime im­ages in gen­er­al. They are rec­og­niz­ably anime faces but the res­o­lu­tion is low and the qual­ity is not great:

2017 SOTA: 16 ran­dom Make Girl­s.­Moe face sam­ples (4×4 grid)

Typ­i­cal­ly, a GAN would di­verge after a day or two of train­ing, or it would col­lapse to pro­duc­ing a lim­ited range of faces (or a sin­gle face), or if it was sta­ble, sim­ply con­verge to a low level of qual­ity with a lot of fuzzi­ness; per­haps the most typ­i­cal fail­ure mode was het­e­rochro­mia (which is com­mon in anime but not that com­mon)—mis­matched eye col­ors (each color in­di­vid­u­ally plau­si­ble), from the Gen­er­a­tor ap­par­ently be­ing un­able to co­or­di­nate with it­self to pick con­sis­tent­ly. With more re­cent ar­chi­tec­tures like VGAN or SAGAN, which care­fully weaken the Dis­crim­i­na­tor or which add ex­treme­ly-pow­er­ful com­po­nents like self­-at­ten­tion lay­ers, I could reach fuzzy 128px faces.

Given the mis­er­able fail­ure of all the prior NNs I had tried, I had be­gun to se­ri­ously won­der if there was some­thing about non-pho­tographs which made them in­trin­si­cally un­able to be eas­ily mod­eled by con­vo­lu­tional neural net­works (the com­mon in­gre­di­ent to them al­l). Did con­vo­lu­tions ren­der it un­able to gen­er­ate sharp lines or flat re­gions of col­or? Did reg­u­lar GANs work only be­cause pho­tographs were made al­most en­tirely of blurry tex­tures?

But BigGAN demon­strated that a large cut­ting-edge GAN ar­chi­tec­ture could scale, given enough train­ing, to all of Im­a­geNet at even 512px. And ProGAN demon­strated that reg­u­lar CNNs could learn to gen­er­ate sharp clear anime im­ages with only some­what in­fea­si­ble amounts of train­ing. (source; video), while ex­pen­sive and re­quir­ing >6 GPU-weeks6, did work and was even pow­er­ful enough to over­fit sin­gle-char­ac­ter face datasets; I did­n’t have enough GPU time to train on un­re­stricted face datasets, much less anime im­ages in gen­er­al, but merely get­ting this far was ex­cit­ing. Be­cause, a com­mon se­quence in DL/DRL (un­like many ar­eas of AI) is that a prob­lem seems in­tractable for long pe­ri­ods, un­til some­one mod­i­fies a scal­able ar­chi­tec­ture slight­ly, pro­duces some­what-cred­i­ble (not nec­es­sar­ily hu­man or even near-hu­man) re­sults, and then throws a ton of com­pute/­data at it and, since the ar­chi­tec­ture scales, it rapidly ex­ceeds SOTA and ap­proaches hu­man lev­els (and po­ten­tially ex­ceeds hu­man-level). Now I just needed a faster GAN ar­chi­tec­ture which I could train a much big­ger model with on a much big­ger dataset.

A his­tory of GAN gen­er­a­tion of anime faces: ‘do want’ to ‘oh no’ to ‘awe­some’

StyleGAN was the fi­nal break­through in pro­vid­ing ProGAN-level ca­pa­bil­i­ties but fast: by switch­ing to a rad­i­cally differ­ent ar­chi­tec­ture, it min­i­mized the need for the slow pro­gres­sive grow­ing (per­haps elim­i­nat­ing it en­tirely7), and learned effi­ciently at mul­ti­ple lev­els of res­o­lu­tion, with bonuses in pro­vid­ing much more con­trol of the gen­er­ated im­ages with its “style trans­fer” metaphor.

Examples

First, some demon­stra­tions of what is pos­si­ble with StyleGAN on anime faces:

When it works: a hand-s­e­lected StyleGAN sam­ple from my Asuka Souryuu Lan­g­ley-fine­tuned StyleGAN
64 of the best TWDNE anime face sam­ples se­lected from so­cial me­dia (click to zoom).
100 ran­dom sam­ple im­ages from the StyleGAN anime faces on TWDNE

Even a quick look at the MGM & StyleGAN sam­ples demon­strates the lat­ter to be su­pe­rior in res­o­lu­tion, fine de­tails, and over­all ap­pear­ance (although the MGM faces ad­mit­tedly have fewer global mis­takes). It is also su­pe­rior to my 2018 ProGAN faces. Per­haps the most strik­ing fact about these faces, which should be em­pha­sized for those for­tu­nate enough not to have spent as much time look­ing at aw­ful GAN sam­ples as I have, is not that the in­di­vid­ual faces are good, but rather that the faces are so di­verse, par­tic­u­larly when I look through face sam­ples with Ψ ≥ 1—it is not just the hair/­eye color or head ori­en­ta­tion or fine de­tails that differ, but the over­all style ranges from CG to car­toon sketch, and even the ‘me­dia’ differ, I could swear many of these are try­ing to im­i­tate wa­ter­col­ors, char­coal sketch­ing, or oil paint­ing rather than dig­i­tal draw­ings, and some come off as rec­og­niz­ably ’90s-anime-style vs ’00s-anime-style. (I could look through sam­ples all day de­spite the global er­rors be­cause so many are in­ter­est­ing, which is not some­thing I could say of the MGM model whose nov­elty is quickly ex­haust­ed, and it ap­pears that users of my TWDNE web­site feel sim­i­larly as the av­er­age length of each visit is 1m:55s.)

In­ter­po­la­tion video of the 2019-02-11 face StyleGAN demon­strat­ing gen­er­al­iza­tion.
StyleGAN anime face in­ter­po­la­tion videos are Elon Musk™-ap­proved8!
Later in­ter­po­la­tion video (2019-03-08 face StyleGAN)

Background

Ex­am­ple of the StyleGAN up­scal­ing im­age pyra­mid ar­chi­tec­ture: small → large (vi­su­al­iza­tion by Shawn Presser)

StyleGAN was pub­lished in 2018 as (source code; demo video/al­go­rith­mic re­view video/re­sults & dis­cus­sions video; Co­lab note­book9; Gen­Force Py­Torch reim­ple­men­ta­tion with model zoo/Keras; ex­plain­ers: Sky­mind.ai/Lyrn.ai/Two Minute Pa­pers video). StyleGAN takes the stan­dard GAN ar­chi­tec­ture em­bod­ied by ProGAN (whose source code it reuses) and, like the sim­i­lar GAN ar­chi­tec­ture , draws in­spi­ra­tion from the field of “style trans­fer” (essen­tially in­vented by ), by chang­ing the Gen­er­a­tor (G) which cre­ates the im­age by re­peat­edly up­scal­ing its res­o­lu­tion to take, at each level of res­o­lu­tion from 8px→16px→32px→64px→128px etc a ran­dom in­put or “style noise”, which is com­bined with and is used to tell the Gen­er­a­tor how to ‘style’ the im­age at that res­o­lu­tion by chang­ing the hair or chang­ing the skin tex­ture and so on. ‘Style noise’ at a low res­o­lu­tion like 32px affects the im­age rel­a­tively glob­al­ly, per­haps de­ter­min­ing the hair length or col­or, while style noise at a higher level like 256px might affect how frizzy in­di­vid­ual strands of hair are. In con­trast, ProGAN and al­most all other GANs in­ject noise into the G as well, but only at the be­gin­ning, which ap­pears to work not nearly as well (per­haps be­cause it is diffi­cult to prop­a­gate that ran­dom­ness ‘up­wards’ along with the up­scaled im­age it­self to the later lay­ers to en­able them to make con­sis­tent choic­es?). To put it sim­ply, by sys­tem­at­i­cally pro­vid­ing a bit of ran­dom­ness at each step in the process of gen­er­at­ing the im­age, StyleGAN can ‘choose’ vari­a­tions effec­tive­ly.

Kar­ras et al 2018, StyleGAN vs ProGAN ar­chi­tec­ture: “Fig­ure 1. While a tra­di­tional gen­er­a­tor feeds the la­tent code [z] though the in­put layer on­ly, we first map the in­put to an in­ter­me­di­ate la­tent space W, which then con­trols the gen­er­a­tor through adap­tive in­stance nor­mal­iza­tion (AdaIN) at each con­vo­lu­tion lay­er. Gauss­ian noise is added after each con­vo­lu­tion, be­fore eval­u­at­ing the non­lin­ear­i­ty. Here”A" stands for a learned affine trans­form, and “B” ap­plies learned per-chan­nel scal­ing fac­tors to the noise in­put. The map­ping net­work f con­sists of 8 lay­ers and the syn­the­sis net­work g con­sists of 18 lay­er­s—two for each res­o­lu­tion (42-−10242). The out­put of the last layer is con­verted to RGB us­ing a sep­a­rate 1×1 con­vo­lu­tion, sim­i­lar to Kar­ras et al. [29]. Our gen­er­a­tor has a to­tal of 26.2M train­able pa­ra­me­ters, com­pared to 23.1M in the tra­di­tional gen­er­a­tor."

StyleGAN makes a num­ber of ad­di­tional im­prove­ments, but they ap­pear to be less im­por­tant: for ex­am­ple, it in­tro­duces a new FFHQ face/­por­trait dataset with 1024px im­ages in or­der to show that StyleGAN con­vinc­ingly im­proves on ProGAN in fi­nal im­age qual­i­ty; switches to a loss which is more well-be­haved than the usual lo­gis­tic-style loss­es; and ar­chi­tec­ture-wise, it makes un­usu­ally heavy use of ful­ly-con­nected (FC) lay­ers to process an ini­tial ran­dom in­put, no less than 8 lay­ers of 512 neu­rons, where most GANs use 1 or 2 FC lay­ers.10 More strik­ing is that it omits tech­niques that other GANs have found crit­i­cal for be­ing able to train at 512px–1024px scale: it does not use newer losses like the , SAGAN-style self­-at­ten­tion lay­ers in ei­ther G/D, vari­a­tional Dis­crim­i­na­tor bot­tle­necks, con­di­tion­ing on a tag or cat­e­gory em­bed­ding11, BigGAN-style large mini­batch­es, differ­ent noise dis­tri­b­u­tions12, ad­vanced reg­u­lar­iza­tion like , etc.13 One pos­si­ble rea­son for StyleGAN’s suc­cess is the way it com­bines out­puts from the mul­ti­ple lay­ers into a sin­gle fi­nal im­age rather than re­peat­edly up­scal­ing; when we vi­su­al­ize the out­put of each layer as an RGB im­age in anime StyleGANs, there is a strik­ing di­vi­sion of la­bor be­tween lay­er­s—­some lay­ers fo­cus on mono­chrome out­li­nes, while oth­ers fill in tex­tured re­gions of col­or, and they sum up into an im­age with sharp lines and good color gra­di­ents while main­tain­ing de­tails like eyes.

Aside from the FCs and style noise & nor­mal­iza­tion, it is a vanilla ar­chi­tec­ture. (One odd­ity is the use of only 3×3 con­vo­lu­tions & so few lay­ers in each up­scal­ing block; a more con­ven­tional up­scal­ing block than StyleGAN’s 3×3→3×3 would be some­thing like BigGAN which does 1×1 → 3×3 → 3×3 → 1×1. It’s not clear if this is a good idea as it lim­its the spa­tial in­flu­ence of each pixel by pro­vid­ing lim­ited re­cep­tive fields14.) Thus, if one has some fa­mil­iar­ity with train­ing a ProGAN or an­other GAN, one can im­me­di­ately work with StyleGAN with no trou­ble: the train­ing dy­nam­ics are sim­i­lar and the hy­per­pa­ra­me­ters have their usual mean­ing, and the code­base is much the same as the orig­i­nal ProGAN (with the main ex­cep­tion be­ing that config.py has been re­named train.py (or run_training.py in S2) and the orig­i­nal train.py, which stores the crit­i­cal con­fig­u­ra­tion pa­ra­me­ters, has been moved to training/training_loop.py; there is still no sup­port for com­mand-line op­tions and StyleGAN must be con­trolled by edit­ing train.py/training_loop.py by hand).

Applications

Be­cause of its speed and sta­bil­i­ty, when the source code was re­leased on 2019-02-04 (a date that will long be noted in the ANNals of GANime), the Nvidia mod­els & sam­ple dumps were quickly pe­rused & new StyleGANs trained on a wide va­ri­ety of im­age types, yield­ing, in ad­di­tion to the orig­i­nal faces/­cart­s/­cats of Kar­ras et al 2018:

Im­age­quilt vi­su­al­iza­tion of the wide range of vi­sual sub­jects StyleGAN has been ap­plied to

Why Don’t GANs Work?

Why does StyleGAN work so well on anime im­ages while other GANs worked not at all or slowly at best?

The les­son I took from , Lu­cic et al 2017, is that CelebA/CIFAR10 are too easy, as al­most all eval­u­ated GAN ar­chi­tec­tures were ca­pa­ble of oc­ca­sion­ally achiev­ing good FID if one sim­ply did enough it­er­a­tions & hy­per­pa­ra­me­ter tun­ing.

In­ter­est­ing­ly, I con­sis­tently ob­serve in train­ing all GANs on anime that clear lines & sharp­ness & cel-like smooth gra­di­ents ap­pear only to­ward the end of train­ing, after typ­i­cally ini­tially blurry tex­tures have co­a­lesced. This sug­gests an in­her­ent bias of CNNs: color im­ages work be­cause they pro­vide some de­gree of tex­tures to start with, but lin­eart/­mono­chrome stuff fails be­cause the GAN op­ti­miza­tion dy­nam­ics flail around. This is con­sis­tent with —which uses style trans­fer to con­struct a data-aug­ment­ed/­trans­formed “Styl­ized-Im­a­geNet”—show­ing that Im­a­geNet CNNs are lazy and, be­cause the tasks can be achieved to some de­gree with tex­ture-only clas­si­fi­ca­tion (as demon­strated by sev­eral of Geirhos et al 2018’s au­thors via “Bag­Nets”), fo­cus on tex­tures un­less oth­er­wise forced; and by & , who find that al­though CNNs are per­fectly ca­pa­ble of em­pha­siz­ing shape over tex­ture, low­er-per­form­ing mod­els tend to rely more heav­ily on tex­ture and that many kinds of train­ing (in­clud­ing ) will in­duce a tex­ture fo­cus, sug­gest­ing tex­ture tends to be low­er-hang­ing fruit. So while CNNs can learn sharp lines & shapes rather than tex­tures, the typ­i­cal GAN ar­chi­tec­ture & train­ing al­go­rithm do not make it easy. Since CIFAR10/CelebA can be fairly de­scribed as be­ing just as heavy on tex­tures as Im­a­geNet (which is not true of anime im­ages), it is not sur­pris­ing that GANs train eas­ily on them start­ing with tex­tures and grad­u­ally re­fin­ing into good sam­ples but then strug­gle on ani­me.

This raises a ques­tion of whether the StyleGAN ar­chi­tec­ture is nec­es­sary and whether many GANs might work, if only one had good style trans­fer for anime im­ages and could, to de­feat the tex­ture bi­as, gen­er­ate many ver­sions of each anime im­age which kept the shape while chang­ing the color palet­te? (Cur­rent style trans­fer meth­ods like the AdaIN Py­Torch im­ple­men­ta­tion used by Geirhos et al 2018, do not work well on anime im­ages, iron­i­cally enough, be­cause they are trained on pho­to­graphic im­ages, typ­i­cally us­ing the old VGG mod­el.)

FAQ

“…Its so­cial ac­count­abil­ity seems sort of like that of de­sign­ers of mil­i­tary weapons: un­cul­pa­ble right up un­til they get a lit­tle too good at their job.”

, E unibus plu­ram: Tele­vi­sion and U.S. Fic­tion”

To ad­dress some com­mon ques­tions peo­ple have after see­ing gen­er­ated sam­ples:

  • Over­fit­ting: “Aren’t StyleGAN (or BigGAN) just over­fit­ting & mem­o­riz­ing data?”

    Amus­ing­ly, this is not a ques­tion any­one re­ally both­ered to ask of ear­lier GAN ar­chi­tec­tures, which is a sign of progress. Over­fit­ting is a bet­ter prob­lem to have than un­der­fit­ting, be­cause over­fit­ting means you can use a smaller model or more data or more ag­gres­sive reg­u­lar­iza­tion tech­niques, while un­der­fit­ting means your ap­proach just is­n’t work­ing.

    In any case, while there is cur­rently no way to con­clu­sively prove that cut­ting-edge GANs are not 100% mem­o­riz­ing (be­cause they should be mem­o­riz­ing to a con­sid­er­able ex­tent in or­der to learn im­age gen­er­a­tion, and eval­u­at­ing gen­er­a­tive mod­els is hard in gen­er­al, and for GANs in par­tic­u­lar, be­cause they don’t pro­vide stan­dard met­rics like like­li­hoods which could be used on held-out sam­ples), there are sev­eral rea­sons to think that they are not just mem­o­riz­ing:15

    1. Sam­ple/­Dataset Over­lap: a stan­dard check for over­fit­ting is to com­pare gen­er­ated im­ages to their clos­est matches us­ing (where dis­tance is de­fined by fea­tures like a CNN em­bed­ding) lookup; an ex­am­ple of this are StackGAN’s Fig­ure 6 & BigGAN’s Fig­ures 10–14, where the pho­to­re­al­is­tic sam­ples are nev­er­the­less com­pletely differ­ent from the most sim­i­lar Im­a­geNet dat­a­points. This has not been done for StyleGAN yet but I would­n’t ex­pect differ­ent re­sults as GANs typ­i­cally pass this check. (It’s worth not­ing that fa­cial recog­ni­tion re­port­edly does not re­turn Flickr matches for ran­dom FFHQ StyleGAN faces, sug­gest­ing the gen­er­ated faces gen­uinely look like new faces rather than any of the orig­i­nal Flickr faces.)

      One in­trigu­ing ob­ser­va­tion about GANs made by the BigGAN pa­per is that the crit­i­cisms of Gen­er­a­tors mem­o­riz­ing dat­a­points may be pre­cisely the op­po­site of re­al­i­ty: GANs may work pri­mar­ily by the Dis­crim­i­na­tor (adap­tive­ly) over­fit­ting to dat­a­points, thereby re­pelling the Gen­er­a­tor away from real dat­a­points and forc­ing it to learn nearby pos­si­ble im­ages which col­lec­tively span the im­age dis­tri­b­u­tion. (With enough data, this cre­ates gen­er­al­iza­tion be­cause “neural nets are lazy” and only learn to gen­er­al­ize when eas­ier strate­gies fail.)

    2. Se­man­tic Un­der­stand­ing: GANs ap­pear to learn mean­ing­ful con­cepts like in­di­vid­ual ob­jects, as demon­strated by “la­tent space ad­di­tion” or re­search tools like GANdissection/; im­age ed­its like ob­ject dele­tion­s/ad­di­tions () or seg­ment­ing ob­jects like dogs from their back­grounds (/) are diffi­cult to ex­plain with­out some gen­uine un­der­stand­ing of im­ages.

    In the case of StyleGAN anime faces, there are en­coders and con­trol­lable face gen­er­a­tion now which demon­strate that the la­tent vari­ables do map onto mean­ing­ful fac­tors of vari­a­tion & the model must have gen­uinely learned about cre­at­ing im­ages rather than merely mem­o­riz­ing real im­ages or im­age patch­es. Sim­i­lar­ly, when we use the “trun­ca­tion trick”/ψ to sam­ple from rel­a­tively ex­treme un­likely im­ages and we look at the dis­tor­tions, they show how gen­er­ated im­ages break down in se­man­ti­cal­ly-rel­e­vant ways, which would not be the case if it was just pla­gia­rism. (A par­tic­u­larly ex­treme ex­am­ple of the power of the learned StyleGAN prim­i­tives is ’s demon­stra­tion that Kar­ras et al’s FFHQ faces StyleGAN can be used to gen­er­ate fairly re­al­is­tic im­ages of cat­s/­dogs/­cars.)

    1. La­tent Space Smooth­ness: in gen­er­al, in­ter­po­la­tion in the la­tent space (z) shows smooth changes of im­ages and log­i­cal trans­for­ma­tions or vari­a­tions of face fea­tures; if StyleGAN were merely mem­o­riz­ing in­di­vid­ual dat­a­points, the in­ter­po­la­tion would be ex­pected to be low qual­i­ty, yield many ter­ri­ble faces, and ex­hibit ‘jumps’ in be­tween points cor­re­spond­ing to re­al, mem­o­rized, dat­a­points. The StyleGAN anime face mod­els do not ex­hibit this. (In con­trast, the Holo ProGAN, which over­fit bad­ly, does show se­vere prob­lems in its la­tent space in­ter­po­la­tion videos.)

      Which is not to say that GANs do not have is­sues: “mode drop­ping” seems to still be an is­sue for BigGAN de­spite the ex­pen­sive large-mini­batch train­ing, which is over­fit­ting to some de­gree, and StyleGAN pre­sum­ably suffers from it too.

    2. Trans­fer Learn­ing: GANs have been used for semi­-su­per­vised learn­ing (eg gen­er­at­ing plau­si­ble ‘la­beled’ sam­ples to train a clas­si­fier on), im­i­ta­tion learn­ing like , and re­train­ing on fur­ther datasets; if the G is merely mem­o­riz­ing, it is diffi­cult to ex­plain how any of this would work.

  • Com­pute Re­quire­ments: “Does­n’t StyleGAN take too long to train?”

    StyleGAN is re­mark­ably fast-train­ing for a GAN. With the anime faces, I got bet­ter re­sults after 1–3 days of StyleGAN train­ing than I’d got­ten with >3 weeks of ProGAN train­ing. The train­ing times quoted by the StyleGAN repo may sound scary, but they are, in prac­tice, a steep over­es­ti­mate of what you ac­tu­ally need, for sev­eral rea­sons:

    • Lower Res­o­lu­tion: the largest fig­ures are for 1024px im­ages but you may not need them to be that large or even have a big dataset of 1024px im­ages. For anime faces, 1024px-sized faces are rel­a­tively rare, and train­ing at 512px & up­scal­ing 2× to 1024 with waifu2x16 works fine & is much faster. Since up­scal­ing is rel­a­tively sim­ple & easy, an­other strat­egy is to change the pro­gres­sive-grow­ing sched­ule: in­stead of pro­ceed­ing to the fi­nal res­o­lu­tion as fast as pos­si­ble, in­stead ad­just the sched­ule to stop at a more fea­si­ble res­o­lu­tion & spend the bulk of train­ing time there in­stead and then do just enough train­ing at the fi­nal res­o­lu­tion to learn to up­scale (eg spend 10% of train­ing grow­ing to 512px, then 80% of train­ing time at 512px, then 10% at 1024px).
    • Di­min­ish­ing Re­turns: the largest gains in im­age qual­ity are seen in the first few days or weeks of train­ing with the re­main­ing train­ing be­ing not that use­ful as they fo­cus on im­prov­ing small de­tails (so just a few days may be more than ad­e­quate for your pur­pos­es, es­pe­cially if you’re will­ing to se­lect a lit­tle more ag­gres­sively from sam­ples)
    • Trans­fer Learn­ing from a re­lated model can save days or weeks of train­ing, as there is no need to train from scratch; with the anime face StyleGAN, one can train a char­ac­ter-spe­cific StyleGAN with a few hours or days at most, and cer­tainly do not need to spend mul­ti­ple weeks train­ing from scratch! (as­sum­ing that would­n’t just cause over­fit­ting) Sim­i­lar­ly, if one wants to train on some 1024px face dataset, why start from scratch, tak­ing ~1000 GPU-hours, when you can start from Nvidi­a’s FFHQ face model which is al­ready fully trained, and can con­verge in a frac­tion of the from-scratch time? For 1024px, you could use a su­per-res­o­lu­tion GAN like to up­scale? Al­ter­nate­ly, you could change the im­age pro­gres­sion bud­get to spend most of your time at 512px and then at the tail end try 1024px.
    • One-Time Costs: the up­front cost of a few hun­dred dol­lars of GPU-time (at in­flated AWS prices) may seem steep, but should be kept in per­spec­tive. As with al­most all NNs, train­ing 1 StyleGAN model can be lit­er­ally tens of mil­lions of times more ex­pen­sive than sim­ply run­ning the Gen­er­a­tor to pro­duce 1 im­age; but it also need be paid only once by only one per­son, and the to­tal price need not even be paid by the same per­son, given trans­fer learn­ing, but can be amor­tized across var­i­ous datasets. In­deed, given how fast run­ning the Gen­er­a­tor is, the trained model does­n’t even need to be run on a GPU. (The rule of thumb is that a GPU is 20–30× faster than the same thing on CPU, with rare in­stances when over­head dom­i­nates of the CPU be­ing as fast or faster, so since gen­er­at­ing 1 im­age takes on the or­der of ~0.1s on GPU, a CPU can do it in ~3s, which is ad­e­quate for many pur­pos­es.)
  • Copy­right In­fringe­ment: “Who owns StyleGAN im­ages?”

    1. The Nvidia Source Code & Re­leased Mod­els for StyleGAN 1 are un­der a -BY-NC li­cense, and you can­not edit them or pro­duce “de­riv­a­tive works” such as re­train­ing their FFHQ, cat, or cat StyleGAN mod­els. (StyleGAN 2 is un­der a new “Nvidia Source Code Li­cense-NC”, which ap­pears to be effec­tively the same as the CC-BY-NC with the ad­di­tion of a patent re­tal­i­a­tion clause.)

      If a model is trained from scratch, then that does not ap­ply as the source code is sim­ply an­other tool used to cre­ate the model and noth­ing about the CC-BY-NC li­cense forces you to do­nate the copy­right to Nvidia. (It would be odd if such a thing did hap­pen—if your word proces­sor claimed to trans­fer the copy­rights of every­thing writ­ten in it to Mi­crosoft!)

      For those con­cerned by the CC-BY-NC li­cense, a 512px FFHQ con­fig-f StyleGAN 2 has been trained & re­leased into the pub­lic do­main by Ay­dao, and is avail­able for down­load from Mega and my rsync mir­ror:

      rsync --verbose rsync://78.46.86.149:873/biggan/2020-06-07-aydao-stylegan2-configf-ffhq-512-avg-tpurun1.pkl.xz ./
    2. Mod­els in gen­eral are gen­er­ally con­sid­ered and the copy­right own­ers of what­ever data the model was trained on have no copy­right on the mod­el. (The fact that the datasets or in­puts are copy­righted is ir­rel­e­vant, as train­ing on them is uni­ver­sally con­sid­ered fair use and trans­for­ma­tive, sim­i­lar to artists or search en­gi­nes; see the fur­ther read­ing.) The model is copy­righted to whomever cre­ated it. Hence, Nvidia has copy­right on the mod­els it cre­ated but I have copy­right un­der the mod­els I trained (which I re­lease un­der CC-0).

    3. Sam­ples are trick­i­er. The usual wide­ly-s­tated le­gal in­ter­pre­ta­tion is that the stan­dard copy­right law po­si­tion is that only hu­man au­thors can earn a copy­right and that ma­chi­nes, an­i­mals, inan­i­mate ob­jects or most fa­mous­ly, , can­not. The US Copy­right Office states clearly that re­gard­less of whether we re­gard a GAN as a ma­chine or a some­thing more in­tel­li­gent like an an­i­mal, ei­ther way, it does­n’t count:

      A work of au­thor­ship must pos­sess “some min­i­mal de­gree of cre­ativ­ity” to sus­tain a copy­right claim. Feist, 499 U.S. at 358, 362 (c­i­ta­tion omit­ted). “[T]he req­ui­site level of cre­ativ­ity is ex­tremely low.” Even a “slight amount” of cre­ative ex­pres­sion will suffice. “The vast ma­jor­ity of works make the grade quite eas­i­ly, as they pos­sess some cre­ative spark, ‘no mat­ter how crude, hum­ble or ob­vi­ous it might be.’” Id. at 346 (c­i­ta­tion omit­ted).

      … To qual­ify as a work of “au­thor­ship” a work must be cre­ated by a hu­man be­ing. See Bur­row-Giles Lith­o­graphic Co., 111 U.S. at 58. Works that do not sat­isfy this re­quire­ment are not copy­rightable. The Office will not reg­is­ter works pro­duced by na­ture, an­i­mals, or plants.

      Ex­am­ples:

      • A pho­to­graph taken by a mon­key.
      • A mural painted by an ele­phant.

      …the Office will not reg­is­ter works pro­duced by a ma­chine or mere me­chan­i­cal process that op­er­ates ran­domly or au­to­mat­i­cally with­out any cre­ative in­put or in­ter­ven­tion from a hu­man au­thor.

      A dump of ran­dom sam­ples such as the Nvidia sam­ples or TWDNE there­fore has no copy­right & by de­fi­n­i­tion is in the pub­lic do­main.

      A new copy­right can be cre­at­ed, how­ev­er, if a hu­man au­thor is suffi­ciently ‘in the loop’, so to speak, as to ex­ert a de min­imis amount of cre­ative effort, even if that ‘cre­ative effort’ is sim­ply se­lect­ing a sin­gle im­age out of a dump of thou­sands of them or twid­dling knobs (eg on Make Girl­s.­Moe). Crypko, for ex­am­ple, take this po­si­tion.

    Fur­ther read­ing on com­put­er-gen­er­ated art copy­rights:

Training requirements

Data

“The road of ex­cess leads to the palace of wis­dom
…If the fool would per­sist in his folly he would be­come wise
…You never know what is enough un­less you know what is more than enough. …If oth­ers had not been fool­ish, we should be so.”

William Blake, “Proverbs of Hell”,

The nec­es­sary size for a dataset de­pends on the com­plex­ity of the do­main and whether trans­fer learn­ing is be­ing used. StyleGAN’s de­fault set­tings yield a 1024px Gen­er­a­tor with 26.2M pa­ra­me­ters, which is a large model and can soak up po­ten­tially mil­lions of im­ages, so there is no such thing as too much.

For learn­ing de­cen­t-qual­ity anime faces from scratch, a min­i­mum of 5000 ap­pears to be nec­es­sary in prac­tice; for learn­ing a spe­cific char­ac­ter when us­ing the anime face StyleGAN, po­ten­tially as lit­tle as ~500 (e­spe­cially with data aug­men­ta­tion) can give good re­sults. For do­mains as com­pli­cated as “any cat photo” like Kar­ras et al 2018’s cat StyleGAN which is trained on the LSUN CATS cat­e­gory of ~1.8M17 cat pho­tos, that ap­pears to ei­ther not be enough or StyleGAN was not trained to con­ver­gence; Kar­ras et al 2018 note that “CATS con­tin­ues to be a diffi­cult dataset due to the high in­trin­sic vari­a­tion in pos­es, zoom lev­els, and back­grounds.”18

Compute

To fit rea­son­able mini­batch sizes, one will want GPUs with >11GB VRAM. At 512px, that will only train n = 4, and go­ing be­low that means it’ll be even slower (and you may have to re­duce learn­ing rates to avoid un­sta­ble train­ing). So, Nvidia 1080ti & up would be good. (Re­port­ed­ly, AMD/OpenCL works for run­ning StyleGAN mod­els, and there is one re­port of suc­cess­ful train­ing with “Radeon VII with tensorflow-rocm 1.13.2 and rocm 2.3.14”.)

The StyleGAN repo pro­vide the fol­low­ing es­ti­mated train­ing times for 1–8 GPU sys­tems (which I con­vert to to­tal GPU-hours & pro­vide a worst-case AWS-based cost es­ti­mate):

Es­ti­mated StyleGAN wall­clock train­ing times for var­i­ous res­o­lu­tions & GPU-clusters (source: StyleGAN re­po)
GPUs 10242 5122 2562 [March 2019 AWS Costs19]
1 41 days 4 hours [988 GPU-hours] 24 days 21 hours [597 GPU-hours] 14 days 22 hours [358 GPU-hours] [$320, $194, $115]
2 21 days 22 hours [1,052] 13 days 7 hours [638] 9 days 5 hours [442] [NA]
4 11 days 8 hours [1,088] 7 days 0 hours [672] 4 days 21 hours [468] [NA]
8 6 days 14 hours [1,264] 4 days 10 hours [848] 3 days 8 hours [640] [$2,730, $1,831, $1,382]

AWS GPU in­stances are some of the most ex­pen­sive ways to train a NN and pro­vide an up­per bound (com­pare Vast.ai); 512px is often an ac­cept­able (or nec­es­sary) res­o­lu­tion; and in prac­tice, the full quoted train­ing time is not re­ally nec­es­sary—with my anime face StyleGAN, the faces them­selves were high qual­ity within 48 GPU-hours, and what train­ing it for ~1000 ad­di­tional GPU-hours ac­com­plished was pri­mar­ily to im­prove de­tails like the shoul­ders & back­grounds. (ProGAN/StyleGAN par­tic­u­larly strug­gle with back­grounds & edges of im­ages be­cause those are cut off, ob­scured, and high­ly-var­ied com­pared to the faces, whether anime or FFHQ. I hy­poth­e­size that the tell­tale blurry back­grounds are due to the im­pov­er­ish­ment of the back­ground­s/edges in cropped face pho­tos, and they could be fixed by trans­fer­-learn­ing or pre­train­ing on a more generic dataset like Im­a­geNet, so it learns what the back­grounds even are in the first place; then in face train­ing, it merely has to re­mem­ber them & de­fo­cus a bit to gen­er­ate cor­rect blurry back­ground­s.)

Train­ing im­prove­ments: 256px StyleGAN anime faces after ~46 GPU-hours (top) vs 512px anime faces after 382 GPU-hours (bot­tom); see also the video mon­tage of first 9k it­er­a­tions

Data Preparation

The most diffi­cult part of run­ning StyleGAN is prepar­ing the dataset prop­er­ly. StyleGAN does not, un­like most GAN im­ple­men­ta­tions (par­tic­u­larly Py­Torch ones), sup­port read­ing a di­rec­tory of files as in­put; it can only read its unique .tfrecord for­mat which stores each im­age as raw ar­rays at every rel­e­vant res­o­lu­tion.20 Thus, in­put files must be per­fectly uni­form, (s­low­ly) con­verted to the .tfrecord for­mat by the spe­cial dataset_tool.py tool, and will take up ~19× more disk space.21

A StyleGAN dataset must con­sist of im­ages all for­mat­ted ex­actly the same way.

Im­ages must be pre­cisely 512×512px or 1024×1024px etc (any eg 512×513px im­ages will kill the en­tire run), they must all be the same col­or­space (you can­not have sRGB and Grayscale JPGs—and I doubt other color spaces work at al­l), they must not be trans­par­ent, the file­type must be the same as the model you in­tend to (re)­train (ie you can­not re­train a PNG-trained model on a JPG dataset, StyleGAN will crash every time with in­scrutable con­vo­lu­tion/chan­nel-re­lated er­rors)22, and there must be no sub­tle er­rors like CRC check­sum er­rors which im­age view­ers or li­braries like Im­ageMag­ick often ig­nore.

Faces preparation

My work­flow:

  1. Down­load raw im­ages from Dan­booru2018 if nec­es­sary
  2. Ex­tract from the JSON Dan­booru2018 meta­data all the IDs of a sub­set of im­ages if a spe­cific Dan­booru tag (such as a sin­gle char­ac­ter) is de­sired, us­ing jq and shell script­ing
  3. Crop square anime faces from raw im­ages us­ing Na­gadomi’s lbpcascade_animeface (reg­u­lar face-de­tec­tion meth­ods do not work on anime im­ages)
  4. Delete empty files, mono­chrome or grayscale files, & ex­ac­t-du­pli­cate files
  5. Con­vert to JPG
  6. Up­scale be­low the tar­get res­o­lu­tion (512px) im­ages with waifu2x
  7. Con­vert all im­ages to ex­actly 512×512 res­o­lu­tion sRGB JPG im­ages
  8. If fea­si­ble, im­prove data qual­ity by check­ing for low-qual­ity im­ages by hand, re­mov­ing near-du­pli­cates im­ages found by findimagedupes, and fil­ter­ing with a pre­trained GAN’s Dis­crim­i­na­tor
  9. Con­vert to StyleGAN for­mat us­ing dataset_tool.py

The goal is to turn this:

100 ran­dom real sam­ple im­ages from the 512px SFW sub­set of Dan­booru in a 10×10 grid.

into this:

36 ran­dom real sam­ple im­ages from the cropped Dan­booru faces in a 6×6 grid.

Be­low I use shell script­ing to pre­pare the dataset. A pos­si­ble al­ter­na­tive is danbooru-utility, which aims to help “ex­plore the dataset, fil­ter by tags, rat­ing, and score, de­tect faces, and re­size the im­ages”.

Cropping

The Dan­booru2018 down­load can be done via Bit­Tor­rent or rsync, which pro­vides a JSON meta­data tar­ball which un­packs into metadata/2* & a folder struc­ture of {original,512px}/{0-999}/$ID.{png,jpg,...}.

For train­ing on SFW whole im­ages, the 512px/ ver­sion of Dan­booru2018 would work, but it is not a great idea for faces be­cause by scal­ing im­ages down to 512px, a lot of face de­tail has been lost, and get­ting high­-qual­ity faces is a chal­lenge. The SFW IDs can be ex­tracted from the file­names in 512px/ di­rectly or from the meta­data by ex­tract­ing the id & rating fields (and sav­ing to a file):

find ./512px/ -type f | sed -e 's/.*\/\([[:digit:]]*\)\.jpg/\1/'
# 967769
# 1853769
# 2729769
# 704769
# 1799769
# ...
tar xf metadata.json.tar.xz
cat metadata/20180000000000* | jq '[.id, .rating]' -c | fgrep '"s"' | cut -d '"' -f 2 # "
# ...

After in­stalling and test­ing Na­gadomi’s lbpcascade_animeface to make sure it & works, one can use a sim­ple script which crops the face(s) from a sin­gle in­put im­age. The ac­cu­racy on Dan­booru im­ages is fairly good, per­haps 90% ex­cel­lent faces, 5% low-qual­ity faces (gen­uine but ei­ther aw­ful art or tiny lit­tle faces on the or­der of 64px which use­less), and 5% out­right er­rors—non-faces like armpits or el­bows (oddly enough). It can be im­proved by mak­ing the script more re­stric­tive, such as re­quir­ing 250×250px re­gions, which elim­i­nates most of the low-qual­ity faces & mis­takes. (There is an al­ter­na­tive more-d­iffi­cult-to-run li­brary by Nakatomi which offers a face-crop­ping script, ani­me­face-2009’s face_collector.rb, which Nakatomi says is bet­ter at crop­ping faces, but I was not im­pressed when I tried it out.) crop.py:

import cv2
import sys
import os.path

def detect(cascade_file, filename, outputname):
    if not os.path.isfile(cascade_file):
        raise RuntimeError("%s: not found" % cascade_file)

    cascade = cv2.CascadeClassifier(cascade_file)
    image = cv2.imread(filename)
    gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
    gray = cv2.equalizeHist(gray)

    ## NOTE: Suggested modification: increase minSize to '(250,250)' px,
    ## increasing proportion of high-quality faces & reducing
    ## false positives. Faces which are only 50×50px are useless
    ## and often not faces at all.
    ## FOr my StyleGANs, I use 250 or 300px boxes

    faces = cascade.detectMultiScale(gray,
                                     # detector options
                                     scaleFactor = 1.1,
                                     minNeighbors = 5,
                                     minSize = (50, 50))
    i=0
    for (x, y, w, h) in faces:
        cropped = image[y: y + h, x: x + w]
        cv2.imwrite(outputname+str(i)+".png", cropped)
        i=i+1

if len(sys.argv) != 4:
    sys.stderr.write("usage: detect.py <animeface.xml file>  <input> <output prefix>\n")
    sys.exit(-1)

detect(sys.argv[1], sys.argv[2], sys.argv[3])

The IDs can be com­bined with the pro­vided lbpcascade_animeface script us­ing xargs, how­ever this will be far too slow and it would be bet­ter to ex­ploit par­al­lelism with xargs --max-args=1 --max-procs=16 or parallel. It’s also worth not­ing that lbpcascade_animeface seems to use up GPU VRAM even though GPU use offers no ap­par­ent speedup (a slow­down if any­thing, given lim­ited VRAM), so I find it helps to ex­plic­itly dis­able GPU use by set­ting CUDA_VISIBLE_DEVICES="". (For this step, it’s quite help­ful to have a many-core sys­tem like a Thread­rip­per.)

Com­bin­ing every­thing, par­al­lel face-crop­ping of an en­tire Dan­booru2018 sub­set can be done like this:

cropFaces() {
    BUCKET=$(printf "%04d" $(( $@ % 1000 )) )
    ID="$@"
    CUDA_VISIBLE_DEVICES="" nice python ~/src/lbpcascade_animeface/examples/crop.py  \
     ~/src/lbpcascade_animeface/lbpcascade_animeface.xml \
     ./original/$BUCKET/$ID.* "./faces/$ID"
}
export -f cropFaces

mkdir ./faces/
cat sfw-ids.txt | parallel --progress cropFaces

# NOTE: because of the possibility of multiple crops from an image, the script appends a N counter;
# remove that to get back the original ID & filepath: eg
#
## original/0196/933196.jpg  → portrait/9331961.jpg
## original/0669/1712669.png → portrait/17126690.jpg
## original/0997/3093997.jpg → portrait/30939970.jpg

Nvidia StyleGAN, by de­fault and like most im­age-re­lated tools, ex­pects square im­ages like 512×512px, but there is noth­ing in­her­ent to neural nets or con­vo­lu­tions that re­quires square in­puts or out­puts, and rec­tan­gu­lar con­vo­lu­tions are pos­si­ble. In the case of faces, they tend to be more rec­tan­gu­lar than square, and we’d pre­fer to use a rec­tan­gu­lar con­vo­lu­tion if pos­si­ble to fo­cus the im­age on the rel­e­vant di­men­sion rather than ei­ther pay the se­vere per­for­mance penalty of in­creas­ing to­tal di­men­sions to 1024×1024px or stick with 512×512px & waste im­age out­puts on emit­ting black bars/back­grounds. A prop­er­ly-sized rec­tan­gu­lar con­vo­lu­tion can offer a nice speedup (eg Fast.ai’s train­ing Im­a­geNet in 18m for $40 us­ing them among other trick­s). Nolan Ken­t’s StyleGAN re-im­ple­men­tion (re­leased Oc­to­ber 2019) does sup­port rec­tan­gu­lar con­vo­lu­tions, and as he demon­strates in his blog post, it works nice­ly.

Cleaning & Upscaling

Mis­cel­la­neous cleanups can be done:

## Delete failed/empty files
find faces/ -size 0    -type f -delete

## Delete 'too small' files which is indicative of low quality:
find faces/ -size -40k -type f -delete

## Delete exact duplicates:
fdupes --delete --omitfirst --noprompt faces/

## Delete monochrome or minimally-colored images:
### the heuristic of <257 unique colors is imperfect but better than anything else I tried
deleteBW() { if [[ `identify -format "%k" "$@"` -lt 257 ]];
             then rm "$@"; fi; }
export -f deleteBW
find faces -type f | parallel --progress deleteBW

I re­move black­-white or grayscale im­ages from all my GAN ex­per­i­ments be­cause in my ear­li­est ex­per­i­ments, their in­clu­sion ap­peared to in­crease in­sta­bil­i­ty: mixed datasets were ex­tremely un­sta­ble, mono­chrome datasets failed to learn at all, but col­or-only runs made some progress. It is likely that StyleGAN is now pow­er­ful enough to be able to learn on mixed datasets (and some later ex­per­i­ments by other peo­ple sug­gest that StyleGAN can han­dle both mono­chrome & color ani­me-style faces with­out a prob­lem), but I have not risked a full mon­th-long run to in­ves­ti­gate, and so I con­tinue do­ing col­or-on­ly.

Discriminator ranking

A good trick with GANs is, after train­ing to rea­son­able lev­els of qual­i­ty, reusing the Dis­crim­i­na­tor to rank the real dat­a­points; im­ages the trained D as­signs the low­est prob­a­bil­i­ty/s­core of be­ing real are often the worst-qual­ity ones and go­ing through the bot­tom decile (or delet­ing them en­tire­ly) should re­move many anom­alies and may im­prove the GAN. The GAN is then trained on the new cleaned dataset, mak­ing this a kind of “ac­tive learn­ing”.

Since rat­ing im­ages is what the D al­ready does, no new al­go­rithms or train­ing meth­ods are nec­es­sary, and al­most no code is nec­es­sary: run the D on the whole dataset to rank each im­age (faster than it seems since the G & back­prop­a­ga­tion are un­nec­es­sary, even a large dataset can be ranked in a wall­clock hour or two), then one can re­view man­u­ally the bot­tom & top X%, or per­haps just delete the bot­tom X% sight un­seen if enough data is avail­able.

What is a D do­ing? I find that the high­est ranked im­ages often con­tain many anom­alies or low-qual­ity im­ages which need to be delet­ed. Why? The notes a well-trained D which achieves 98% real vs fake clas­si­fi­ca­tion per­for­mance on the Im­a­geNet train­ing dataset falls to 50–55% ac­cu­racy when run on the val­i­da­tion dataset, sug­gest­ing the D’s role is about mem­o­riz­ing the train­ing data rather than some mea­sure of ‘re­al­ism’.

Per­haps be­cause the D rank­ing is not nec­es­sar­ily a ‘qual­ity’ score but sim­ply a sort of con­fi­dence rat­ing that an im­age is from the real dataset; if the real im­ages con­tain cer­tain eas­i­ly-de­tectable im­ages which the G can’t repli­cate, then the D might mem­o­rize or learn them quick­ly. For ex­am­ple, in face crops, whole fig­ure crops are com­mon mis­taken crops, mak­ing up a tiny per­cent­age of im­ages; how could a face-only G learn to gen­er­ate whole re­al­is­tic bod­ies with­out the in­ter­me­di­ate steps be­ing in­stantly de­tected & de­feated as er­rors by D, while D is eas­ily able to de­tect re­al­is­tic bod­ies as defi­nitely re­al? This would ex­plain the po­lar­ized rank­ings. And given the close con­nec­tions be­tween GANs & DRL, I have to won­der if there is more mem­o­riza­tion go­ing on than sus­pected in things like ? In­ci­den­tal­ly, this may also ex­plain the prob­lem with us­ing Dis­crim­i­na­tors for semi­-su­per­vised rep­re­sen­ta­tion learn­ing: if the D is mem­o­riz­ing dat­a­points to force the G to gen­er­al­ize, then its in­ter­nal rep­re­sen­ta­tions would be ex­pected to be use­less. (One would in­stead want to ex­tract knowl­edge from the G, per­haps by en­cod­ing an im­age into z and us­ing the z as the rep­re­sen­ta­tion.)

An al­ter­na­tive per­spec­tive is offered by a crop of 2020 pa­pers (; ; ; ) ex­am­in­ing how use­ful GAN data aug­men­ta­tion re­quires it to be done dur­ing train­ing, and one must aug­ment all im­ages.23 Zhao et al 2020c & Kar­ras et al 2020 ob­serve, with reg­u­lar GAN train­ing, there is a strik­ing steady de­cline of D per­for­mance on held­out data, and in­crease on train­ing data, through­out the course of train­ing, con­firm­ing the BigGAN ob­ser­va­tion but also show­ing it is a dy­namic phe­nom­e­non, and prob­a­bly a bad one. Adding in cor­rect data aug­men­ta­tion re­duces this over­fit­ting—and markedly im­proves sam­ple-effi­ciency & fi­nal qual­i­ty. This sug­gests that the D does in­deed mem­o­rize, but that this is not a good thing. Kar­ras et al 2020 de­scribes what hap­pens as

Con­ver­gence is now achieved [with ADA/data aug­men­ta­tion] re­gard­less of the train­ing set size and over­fit­ting no longer oc­curs. With­out aug­men­ta­tions, the gra­di­ents the gen­er­a­tor re­ceives from the dis­crim­i­na­tor be­come very sim­plis­tic over time—the dis­crim­i­na­tor starts to pay at­ten­tion to only a hand­ful of fea­tures, and the gen­er­a­tor is free to cre­ate oth­er­wise non­sen­si­cal im­ages. With ADA, the gra­di­ent field stays much more de­tailed which pre­vents such de­te­ri­o­ra­tion.

In other words, just as the G can ‘mode col­lapse’ by fo­cus­ing on gen­er­at­ing im­ages with only a few fea­tures, the D can also ‘fea­ture col­lapse’ by fo­cus­ing on a few fea­tures which hap­pen to cor­rectly split the train­ing data’s re­als from fakes, such as by mem­o­riz­ing them out­right. This tech­ni­cally works, but not well. This also ex­plains why when train­ing on JFT-300M: di­ver­gence/­col­lapse usu­ally starts with D win­ning; if D wins be­cause it mem­o­rizes, then a suffi­ciently large dataset should make mem­o­riza­tion in­fea­si­ble; and JFT-300M turns out to be suffi­ciently large. (This would pre­dict that if Brock et al had checked the JFT-300M BigGAN D’s clas­si­fi­ca­tion per­for­mance on a held-out JFT-300M, rather than just on their Im­a­geNet BigGAN, then they would have found that it clas­si­fied re­als vs fake well above chance.)

If so, this sug­gests that for D rank­ing, it may not be too use­ful to take the D from the end of a run, if not us­ing data aug­men­ta­tion, be­cause that D be the ver­sion with the great­est de­gree of mem­o­riza­tion!

Here is a sim­ple StyleGAN2 script (ranker.py) to open a StyleGAN .pkl and run it on a list of im­age file­names to print out the D score, cour­tesy of Shao Xun­ing:

import pickle
import numpy as np
import cv2
import dnnlib.tflib as tflib
import random
import argparse
import PIL.Image
from training.misc import adjust_dynamic_range


def preprocess(file_path):
    # print(file_path)
    img = np.asarray(PIL.Image.open(file_path))

    # Preprocessing from dataset_tool.create_from_images
    img = img.transpose([2, 0, 1])  # HWC => CHW
    # img = np.expand_dims(img, axis=0)
    img = img.reshape((1, 3, 512, 512))

    # Preprocessing from training_loop.process_reals
    img = adjust_dynamic_range(data=img, drange_in=[0, 255], drange_out=[-1.0, 1.0])
    return img


def main(args):
    random.seed(args.random_seed)
    minibatch_size = args.minibatch_size
    input_shape = (minibatch_size, 3, 512, 512)
    # print(args.images)
    images = args.images
    images.sort()

    tflib.init_tf()
    _G, D, _Gs = pickle.load(open(args.model, "rb"))
    # D.print_layers()

    image_score_all = [(image, []) for image in images]

    # Shuffle the images and process each image in multiple minibatches.
    # Note: networks.stylegan2.minibatch_stddev_layer
    # calculates the standard deviation of a minibatch group as a feature channel,
    # which means that the output of the discriminator actually depends
    # on the companion images in the same minibatch.
    for i_shuffle in range(args.num_shuffles):
        # print('shuffle: {}'.format(i_shuffle))
        random.shuffle(image_score_all)
        for idx_1st_img in range(0, len(image_score_all), minibatch_size):
            idx_img_minibatch = []
            images_minibatch = []
            input_minibatch = np.zeros(input_shape)
            for i in range(minibatch_size):
                idx_img = (idx_1st_img + i) % len(image_score_all)
                idx_img_minibatch.append(idx_img)
                image = image_score_all[idx_img][0]
                images_minibatch.append(image)
                img = preprocess(image)
                input_minibatch[i, :] = img
            output = D.run(input_minibatch, None, resolution=512)
            print('shuffle: {}, indices: {}, images: {}'
                  .format(i_shuffle, idx_img_minibatch, images_minibatch))
            print('Output: {}'.format(output))
            for i in range(minibatch_size):
                idx_img = idx_img_minibatch[i]
                image_score_all[idx_img][1].append(output[i][0])

    with open(args.output, 'a') as fout:
        for image, score_list in image_score_all:
            print('Image: {}, score_list: {}'.format(image, score_list))
            avg_score = sum(score_list)/len(score_list)
            fout.write(image + ' ' + str(avg_score) + '\n')


def parse_arguments():
    parser = argparse.ArgumentParser()
    parser.add_argument('--model', type=str, required=True,
                        help='.pkl model')
    parser.add_argument('--images', nargs='+')
    parser.add_argument('--output', type=str, default='rank.txt')
    parser.add_argument('--minibatch_size', type=int, default=4)
    parser.add_argument('--num_shuffles', type=int, default=5)
    parser.add_argument('--random_seed', type=int, default=0)
    return parser.parse_args()


if __name__ == '__main__':
    main(parse_arguments())

De­pend­ing on how noisy the rank­ings are in terms of ‘qual­ity’ and avail­able sam­ple size, one can ei­ther re­view the worst-ranked im­ages by hand, or delete the bot­tom X%. One should check the top-ranked im­ages as well to make sure the or­der­ing is right; there can also be some odd im­ages in the top X% as well which should be re­moved.

It might be pos­si­ble to use ranker.py to im­prove the qual­ity of gen­er­ated sam­ples as well, as a sim­ple ver­sion of .

Upscaling

The next ma­jor step is up­scal­ing im­ages us­ing waifu2x, which does an ex­cel­lent job on 2× up­scal­ing of anime im­ages, which are nigh-indis­tin­guish­able from a high­er-res­o­lu­tion orig­i­nal and greatly in­crease the us­able cor­pus. The down­side is that it can take 1–10s per im­age, must run on the GPU (I can re­li­ably fit ~9 in­stances on my 2×1080ti), and is writ­ten in a now-un­main­tained DL frame­work, Torch, with no cur­rent plans to port to Py­Torch, and is grad­u­ally be­com­ing harder to get run­ning (one hopes that by the time CUDA up­dates break it en­tire­ly, there will be an­other su­per-res­o­lu­tion GAN I or some­one else can train on Dan­booru to re­place it). If pressed for time, one can just up­scale the faces nor­mally with Im­ageMag­ick but I be­lieve there will be some qual­ity loss and it’s worth­while.

. ~/src/torch/install/bin/torch-activate
upscaleWaifu2x() {
    SIZE1=$(identify -format "%h" "$@")
    SIZE2=$(identify -format "%w" "$@");

    if (( $SIZE1 < 512 && $SIZE2 < 512  )); then
        echo "$@" $SIZE
        TMP=$(mktemp "/tmp/XXXXXX.png")
        CUDA_VISIBLE_DEVICES="$((RANDOM % 2 < 1))" nice th ~/src/waifu2x/waifu2x.lua -model_dir \
            ~/src/waifu2x/models/upconv_7/art -tta 1 -m scale -scale 2 \
            -i "$@" -o "$TMP"
        convert "$TMP" "$@"
        rm "$TMP"
    fi;  }

export -f upscaleWaifu2x
find faces/ -type f | parallel --progress --jobs 9 upscaleWaifu2x

Quality Checks & Data Augmentation

The sin­gle most effec­tive strat­egy to im­prove a GAN is to clean the da­ta. StyleGAN can­not han­dle too-di­verse datasets com­posed of mul­ti­ple ob­jects or sin­gle ob­jects shifted around, and rare or odd im­ages can­not be learned well. Kar­ras et al get such good re­sults with StyleGAN on faces in part be­cause they con­structed FFHQ to be an ex­tremely clean con­sis­tent dataset of just cen­tered well-lit clear hu­man faces with­out any ob­struc­tions or other vari­a­tion. Sim­i­lar­ly, Ar­fa’s (TFDNE) S2 gen­er­ates much bet­ter por­traits than my own “This Waifu Does Not Ex­ist” (TWDNE) S2 anime por­traits, due partly to train­ing longer to con­ver­gence on a TPU pod but mostly due to his in­vest­ment in data clean­ing: align­ing the faces and heavy fil­ter­ing of sam­ples—this left him with only n = 50k but TFDNE nev­er­the­less out­per­forms TWDNE’s n = 300k. (Data clean­ing/aug­men­ta­tion is one of the more pow­er­ful ways to im­prove re­sults; if we imag­ine deep learn­ing as ‘pro­gram­ming’ or ‘Soft­ware 2.0’24 in An­drej Karpa­thy’s terms, data clean­ing/aug­men­ta­tion is one of the eas­i­est ways to fine­tune the loss func­tion to­wards what we re­ally want by gar­den­ing our data to re­move what we don’t want and in­crease what we do.)

At this point, one can do man­ual qual­ity checks by view­ing a few hun­dred im­ages, run­ning findimagedupes -t 99% to look for near-i­den­ti­cal faces, or dab­ble in fur­ther mod­i­fi­ca­tions such as do­ing “data aug­men­ta­tion”. Work­ing with Dan­booru2018, at this point one would have ~600–700,000 faces, which is more than enough to train StyleGAN and one will have diffi­culty stor­ing the fi­nal StyleGAN dataset be­cause of its sheer size (due to the ~18× size mul­ti­pli­er). After clean­ing etc, my fi­nal face dataset is the with n = 300k.

How­ev­er, if that is not enough or one is work­ing with a small dataset like for a sin­gle char­ac­ter, data aug­men­ta­tion may be nec­es­sary. The mir­ror/hor­i­zon­tal flip is not nec­es­sary as StyleGAN has that built-in as an op­tion25, but there are many other pos­si­ble data aug­men­ta­tions. One can stretch, shift col­ors, sharp­en, blur, in­crease/de­crease con­trast/bright­ness, crop, and so on. An ex­am­ple, ex­tremely ag­gres­sive, set of data aug­men­ta­tions could be done like this:

dataAugment () {
    image="$@"
    target=$(basename "$@")
    suffix="png"
    convert -deskew 50                     "$image" "$target".deskew."$suffix"
    convert -resize 110%x100%              "$image" "$target".horizstretch."$suffix"
    convert -resize 100%x110%              "$image" "$target".vertstretch."$suffix"
    convert -blue-shift 1.1                "$image" "$target".midnight."$suffix"
    convert -fill red -colorize 5%         "$image" "$target".red."$suffix"
    convert -fill orange -colorize 5%      "$image" "$target".orange."$suffix"
    convert -fill yellow -colorize 5%      "$image" "$target".yellow."$suffix"
    convert -fill green -colorize 5%       "$image" "$target".green."$suffix"
    convert -fill blue -colorize 5%        "$image" "$target".blue."$suffix"
    convert -fill purple -colorize 5%      "$image" "$target".purple."$suffix"
    convert -adaptive-blur 3x2             "$image" "$target".blur."$suffix"
    convert -adaptive-sharpen 4x2          "$image" "$target".sharpen."$suffix"
    convert -brightness-contrast 10        "$image" "$target".brighter."$suffix"
    convert -brightness-contrast 10x10     "$image" "$target".brightercontraster."$suffix"
    convert -brightness-contrast -10       "$image" "$target".darker."$suffix"
    convert -brightness-contrast -10x10    "$image" "$target".darkerlesscontrast."$suffix"
    convert +level 5%                      "$image" "$target".contraster."$suffix"
    convert -level 5%\!                    "$image" "$target".lesscontrast."$suffix"
  }
export -f dataAugment
find faces/ -type f | parallel --progress dataAugment

Upscaling & Conversion

Once any qual­ity fixes or data aug­men­ta­tion are done, it’d be a good idea to save a lot of disk space by con­vert­ing to JPG & loss­ily re­duc­ing qual­ity (I find 33% saves a ton of space at no vis­i­ble change):

convertPNGToJPG() { convert -quality 33 "$@" "$@".jpg && rm "$@"; }
export -f convertPNGToJPG
find faces/ -type f -name "*.png" | parallel --progress convertPNGToJPG

Re­mem­ber that StyleGAN mod­els are only com­pat­i­ble with im­ages of the type they were trained on, so if you are us­ing a StyleGAN pre­trained model which was trained on PNGs (like, IIRC, the FFHQ StyleGAN mod­el­s), you will need to keep us­ing PNGs.

Do­ing the fi­nal scal­ing to ex­actly 512px can be done at many points but I gen­er­ally post­pone it to the end in or­der to work with im­ages in their ‘na­tive’ res­o­lu­tions & as­pec­t-ra­tios for as long as pos­si­ble. At this point we care­fully tell Im­ageMag­ick to rescale every­thing to 512×51226, not pre­serv­ing the as­pect ra­tio by fill­ing in with a black back­ground as nec­es­sary on ei­ther side:

find faces/ -type f | xargs --max-procs=16 -n 9000 \
    mogrify -resize 512x512\> -extent 512x512\> -gravity center -background black

Any slight­ly-d­iffer­ent im­age could crash the im­port process. There­fore, we delete any im­age which is even slightly differ­ent from the 512×512 sRGB JPG they are sup­posed to be:

find faces/ -type f | xargs --max-procs=16 -n 9000 identify | \
    # remember the warning: images must be identical, square, and sRGB/grayscale:
    fgrep -v " JPEG 512x512 512x512+0+0 8-bit sRGB"| cut -d ' ' -f 1 | \
    xargs --max-procs=16 -n 10000 rm

Hav­ing done all this, we should have a large con­sis­tent high­-qual­ity dataset.

Fi­nal­ly, the faces can now be con­verted to the ProGAN or StyleGAN dataset for­mat us­ing dataset_tool.py. It is worth re­mem­ber­ing at this point how frag­ile that is and the re­quire­ments Im­ageMag­ick’s identify com­mand is handy for look­ing at files in more de­tails, par­tic­u­larly their res­o­lu­tion & col­or­space, which are often the prob­lem.

Be­cause of the ex­treme fragility of dataset_tool.py, I strongly ad­vise that you edit it to print out the file­names of each file as they are be­ing processed so that when (not if) it crash­es, you can in­ves­ti­gate the cul­prit and check the rest. The edit could be as sim­ple as this:

diff --git a/dataset_tool.py b/dataset_tool.py
index 4ddfe44..e64e40b 100755
--- a/dataset_tool.py
+++ b/dataset_tool.py
@@ -519,6 +519,7 @@ def create_from_images(tfrecord_dir, image_dir, shuffle):
     with TFRecordExporter(tfrecord_dir, len(image_filenames)) as tfr:
         order = tfr.choose_shuffled_order() if shuffle else np.arange(len(image_filenames))
         for idx in range(order.size):
+            print(image_filenames[order[idx]])
             img = np.asarray(PIL.Image.open(image_filenames[order[idx]]))
             if channels == 1:
                 img = img[np.newaxis, :, :] # HW => CHW

There should be no is­sues if all the im­ages were thor­oughly checked ear­lier, but should any im­ages crash it, they can be checked in more de­tail by identify. (I ad­vise just delet­ing them and not try­ing to res­cue them.)

Then the con­ver­sion is just (as­sum­ing StyleGAN pre­req­ui­sites are in­stalled, see next sec­tion):

source activate MY_TENSORFLOW_ENVIRONMENT
python dataset_tool.py create_from_images datasets/faces /media/gwern/Data/danbooru2018/faces/

Con­grat­u­la­tions, the hard­est part is over. Most of the rest sim­ply re­quires pa­tience (and a will­ing­ness to edit Python files di­rectly in or­der to con­fig­ure StyleGAN).

Training

Installation

I as­sume you have CUDA in­stalled & func­tion­ing. If not, good luck. (On my Ubuntu Bionic 18.04.2 LTS OS, I have suc­cess­fully used the Nvidia dri­ver ver­sion #410.104, CUDA 10.1, and Ten­sor­Flow 1.13.1.)

A Python ≥3.627 vir­tual en­vi­ron­ment can be set up for StyleGAN to keep de­pen­den­cies tidy, Ten­sor­Flow & StyleGAN de­pen­den­cies in­stalled:

conda create -n stylegan pip python=3.6
source activate stylegan

## TF:
pip install tensorflow-gpu
## Test install:
python -c "import tensorflow as tf; tf.enable_eager_execution(); \
    print(tf.reduce_sum(tf.random_normal([1000, 1000])))"
pip install tensorboard

## StyleGAN:
## Install pre-requisites:
pip install pillow numpy moviepy scipy opencv-python lmdb # requests?
## Download:
git clone 'https://github.com/NVlabs/stylegan.git' && cd ./stylegan/
## Test install:
python pretrained_example.py
## ./results/example.png should be a photograph of a middle-aged man

StyleGAN can also be trained on the in­ter­ac­tive Google Co­lab ser­vice, which pro­vides free slices of K80 GPUs 12-GPU-hour chunks, us­ing this Co­lab note­book. Co­lab is much slower than train­ing on a lo­cal ma­chine & the free in­stances are not enough to train the best StyleGANs, but this might be a use­ful op­tion for peo­ple who sim­ply want to try it a lit­tle or who are do­ing some­thing quick like ex­tremely low-res­o­lu­tion train­ing or trans­fer­-learn­ing where a few GPU-hours on a slow small GPU might be enough.

Configuration

StyleGAN does­n’t ship with any sup­port for CLI op­tions; in­stead, one must edit train.py and train/training_loop.py:

  1. train/training_loop.py

    The core con­fig­u­ra­tion is done in the func­tion de­faults to training_loop be­gin­ning line 112.

    The key ar­gu­ments are G_smoothing_kimg & D_repeats (affects the learn­ing dy­nam­ic­s), network_snapshot_ticks (how often to save the pickle snap­shot­s—­more fre­quent means less progress lost in crash­es, but as each one weighs 300M­B+, can quickly use up gi­ga­bytes of space), resume_run_id (set to "latest"), and resume_kimg.

    Don’t Erase Your Model
    resume_kimg gov­erns where in the over­all pro­gres­sive-grow­ing train­ing sched­ule StyleGAN starts from. If it is set to 0, train­ing be­gins at the be­gin­ning of the pro­gres­sive-grow­ing sched­ule, at the low­est res­o­lu­tion, re­gard­less of how much train­ing has been pre­vi­ously done. It is vi­tally im­por­tant when do­ing trans­fer learn­ing that it is set to a suffi­ciently high num­ber (eg 10000) that train­ing be­gins at the high­est de­sired res­o­lu­tion like 512px, as it ap­pears that lay­ers are erased when added dur­ing pro­gres­sive-grow­ing. (resume_kimg may also need to be set to a high value to make it skip straight to train­ing at the high­est res­o­lu­tion if you are train­ing on small datasets of small im­ages, where there’s risk of it over­fit­ting un­der the nor­mal train­ing sched­ule and never reach­ing the high­est res­o­lu­tion.) This trick is un­nec­es­sary in StyleGAN 2, which is sim­pler in not us­ing pro­gres­sive grow­ing.

    More ex­per­i­men­tal­ly, I sug­gest set­ting minibatch_repeats = 1 in­stead of minibatch_repeats = 5; in line with the sus­pi­cious­ness of the gra­di­en­t-ac­cu­mu­la­tion im­ple­men­ta­tion in ProGAN/StyleGAN, this ap­pears to make train­ing both sta­bler & faster.

    Note that some of these vari­ables, like learn­ing rates, are over­rid­den in train.py. It’s bet­ter to set those there or else you may con­fuse your­self badly (like I did in won­der­ing why ProGAN & StyleGAN seemed ex­tra­or­di­nar­ily ro­bust to large changes in the learn­ing rates…).

  2. train.py (pre­vi­ously config.py in ProGAN; re­named run_training.py in StyleGAN 2)

    Here we set the num­ber of GPUs, im­age res­o­lu­tion, dataset, learn­ing rates, hor­i­zon­tal flip­ping/mir­ror­ing data aug­men­ta­tion, and mini­batch sizes. (This file in­cludes set­tings in­tended ProGAN—watch out that you don’t ac­ci­den­tally turn on ProGAN in­stead of StyleGAN & con­fuse your­self.) Learn­ing rate & mini­batch should gen­er­ally be left alone (ex­cept to­wards the end of train­ing when one wants to lower the learn­ing rate to pro­mote con­ver­gence or re­bal­ance the G/D), but the im­age res­o­lu­tion/­dataset/mir­ror­ing do need to be set, like thus:

    desc += '-faces';     dataset = EasyDict(tfrecord_dir='faces', resolution=512);              train.mirror_augment = True

    This sets up the 512px face dataset which was pre­vi­ously cre­ated in dataset/faces, turns on mir­ror­ing (be­cause while there may be writ­ing in the back­ground, we don’t care about it for face gen­er­a­tion), and sets a ti­tle for the check­points/logs, which will now ap­pear in results/ with the ‘-faces’ string.

    As­sum­ing you do not have 8 GPUs (as you prob­a­bly do not), you must change the -preset to match your num­ber of GPUs, StyleGAN will not au­to­mat­i­cally choose the cor­rect num­ber of GPUs. If you fail to set it cor­rectly to the ap­pro­pri­ate pre­set, StyleGAN will at­tempt to use GPUs which do not ex­ist and will crash with the opaque er­ror mes­sage (note that CUDA uses ze­ro-in­dex­ing so GPU:0 refers to the first GPU, GPU:1 refers to my sec­ond GPU, and thus /device:GPU:2 refers to my—nonex­is­ten­t—third GPU):

    tensorflow.python.framework.errors_impl.InvalidArgumentError: Cannot assign a device for operation \
        G_synthesis_3/lod: {{node G_synthesis_3/lod}}was explicitly assigned to /device:GPU:2 but available \
        devices are [ /job:localhost/replica:0/task:0/device:CPU:0, /job:localhost/replica:0/task:0/device:GPU:0, \
        /job:localhost/replica:0/task:0/device:GPU:1, /job:localhost/replica:0/task:0/device:XLA_CPU:0, \
        /job:localhost/replica:0/task:0/device:XLA_GPU:0, /job:localhost/replica:0/task:0/device:XLA_GPU:1 ]. \
        Make sure the device specification refers to a valid device.
         [[{{node G_synthesis_3/lod}}]]

    For my 2×1080ti I’d set:

    desc += '-preset-v2-2gpus'; submit_config.num_gpus = 2; sched.minibatch_base = 8; sched.minibatch_dict = \
        {4: 256, 8: 256, 16: 128, 32: 64, 64: 32, 128: 16, 256: 8}; sched.G_lrate_dict = {512: 0.0015, 1024: 0.002}; \
        sched.D_lrate_dict = EasyDict(sched.G_lrate_dict); train.total_kimg = 99000

    So my re­sults get saved to results/00001-sgan-faces-2gpu etc (the run ID in­cre­ments, ‘sgan’ be­cause StyleGAN rather than ProGAN, ‘-faces’ as the dataset be­ing trained on, and ‘2gpu’ be­cause it’s multi-GPU).

Running

I typ­i­cally run StyleGAN in a ses­sion which can be de­tached and keeps mul­ti­ple shells or­ga­nized: 1 ter­mi­nal/shell for the StyleGAN run, 1 ter­mi­nal/shell for Ten­sor­Board, and 1 for Emacs.

With Emacs, I keep the two key Python files open (train.py and train/training_loop.py) for ref­er­ence & easy edit­ing.

With the “lat­est” patch, StyleGAN can be thrown into a while-loop to keep run­ning after crash­es, like:

while true; do nice py train.py ; date; (xmessage "alert: StyleGAN crashed" &); sleep 10s; done

Ten­sor­Board is a log­ging util­ity which dis­plays lit­tle time-series of recorded vari­ables which one views in a web browser, eg:

tensorboard --logdir results/02022-sgan-faces-2gpu/
# TensorBoard 1.13.0 at http://127.0.0.1:6006 (Press CTRL+C to quit)

Note that Ten­sor­Board can be back­ground­ed, but needs to be up­dated every time a new run is started as the re­sults will then be in a differ­ent fold­er.

Train­ing StyleGAN is much eas­ier & more re­li­able than other GANs, but it is still more of an art than a sci­ence. (We put up with it be­cause while GANs suck, every­thing else sucks more.) Notes on train­ing:

  • Crash­proofing:

    The ini­tial re­lease of StyleGAN was prone to crash­ing when I ran it, seg­fault­ing at ran­dom. Up­dat­ing Ten­sor­Flow ap­peared to re­duce this but the root cause is still un­known. Seg­fault­ing or crash­ing is also re­port­edly com­mon if run­ning on mixed GPUs (eg a 1080ti + Ti­tan V).

    Un­for­tu­nate­ly, StyleGAN has no set­ting for sim­ply re­sum­ing from the lat­est snap­shot after crash­ing/ex­it­ing (which is what one usu­ally wants), and one must man­u­ally edit the resume_run_id line in training_loop.py to set it to the lat­est run ID. This is te­dious and er­ror-prone—at one point I re­al­ized I had wasted 6 GPU-days of train­ing by restart­ing from a 3-day-old snap­shot be­cause I had not up­dated the resume_run_id after a seg­fault!

    If you are do­ing any runs longer than a few wall­clock hours, I strongly ad­vise use of nshep­perd’s patch to au­to­mat­i­cally restart from the lat­est snap­shot by set­ting resume_run_id = "latest":

    diff --git a/training/misc.py b/training/misc.py
    index 50ae51c..d906a2d 100755
    --- a/training/misc.py
    +++ b/training/misc.py
    @@ -119,6 +119,14 @@ def list_network_pkls(run_id_or_run_dir, include_final=True):
             del pkls[0]
         return pkls
    
    +def locate_latest_pkl():
    +    allpickles = sorted(glob.glob(os.path.join(config.result_dir, '0*', 'network-*.pkl')))
    +    latest_pickle = allpickles[-1]
    +    resume_run_id = os.path.basename(os.path.dirname(latest_pickle))
    +    RE_KIMG = re.compile('network-snapshot-(\d+).pkl')
    +    kimg = int(RE_KIMG.match(os.path.basename(latest_pickle)).group(1))
    +    return (locate_network_pkl(resume_run_id), float(kimg))
    +
     def locate_network_pkl(run_id_or_run_dir_or_network_pkl, snapshot_or_network_pkl=None):
         for candidate in [snapshot_or_network_pkl, run_id_or_run_dir_or_network_pkl]:
             if isinstance(candidate, str):
    diff --git a/training/training_loop.py b/training/training_loop.py
    index 78d6fe1..20966d9 100755
    --- a/training/training_loop.py
    +++ b/training/training_loop.py
    @@ -148,7 +148,10 @@ def training_loop(
         # Construct networks.
         with tf.device('/gpu:0'):
             if resume_run_id is not None:
    -            network_pkl = misc.locate_network_pkl(resume_run_id, resume_snapshot)
    +            if resume_run_id == 'latest':
    +                network_pkl, resume_kimg = misc.locate_latest_pkl()
    +            else:
    +                network_pkl = misc.locate_network_pkl(resume_run_id, resume_snapshot)
                 print('Loading networks from "%s"...' % network_pkl)
                 G, D, Gs = misc.load_pkl(network_pkl)
             else:

    (The diff can be edited by hand, or copied into the repo as a file like latest.patch & then ap­plied with git apply latest.patch.)

  • Tun­ing Learn­ing Rates

    The LR is one of the most crit­i­cal hy­per­pa­ra­me­ters: too-large up­dates based on too-s­mall mini­batches are dev­as­tat­ing to GAN sta­bil­ity & fi­nal qual­i­ty. The LR also seems to in­ter­act with the in­trin­sic diffi­culty or di­ver­sity of an im­age do­main; Kar­ras et al 2019 use 0.003 G/D LRs on their FFHQ dataset (which has been care­fully cu­rated and the faces aligned to put land­marks like eye­s/­mouth in the same lo­ca­tions in every im­age) when train­ing on 8-GPU ma­chines with mini­batches of n = 32, but I find lower to be bet­ter on my anime face/­por­trait datasets where I can only do n = 8. From look­ing at train­ing videos of whole-Dan­booru2018 StyleGAN runs, I sus­pect that the nec­es­sary LRs would be lower still. Learn­ing rates are closely re­lated to mini­batch size (a com­mon rule of thumb in su­per­vised learn­ing of CNNs is that the re­la­tion­ship of biggest us­able LR fol­lows a square-root curve in mini­batch size) and the BigGAN re­search ar­gues that mini­batch size it­self strongly in­flu­ences how bad mode drop­ping is, which sug­gests that smaller LRs may be more nec­es­sary the more di­verse/d­iffi­cult a dataset is.

  • Bal­anc­ing G/D:

    Screen­shot of Ten­sor­Board G/D losses for an anime face StyleGAN mak­ing progress to­wards con­ver­gence

    Later in train­ing, if the G is not mak­ing good progress to­wards the ul­ti­mate goal of a 0.5 loss (and the D’s loss grad­u­ally de­creas­ing to­wards 0.5), and has a loss stub­bornly stuck around −1 or some­thing, it may be nec­es­sary to change the bal­ance of G/D. This can be done sev­eral ways but the eas­i­est is to ad­just the LRs in train.py, sched.G_lrate_dict & sched.D_lrate_dict.

    One needs to keep an eye on the G/D losses and also the per­cep­tual qual­ity of the faces (s­ince we don’t have any good FID equiv­a­lent yet for anime faces, which re­quires a good open-source Dan­booru tag­ger to cre­ate em­bed­dings), and re­duce both LRs (or usu­ally just the D’s LR) based on the face qual­ity and whether the G/D losses are ex­plod­ing or oth­er­wise look im­bal­anced. What you want, I think, is for the G/D losses to be sta­ble at a cer­tain ab­solute amount for a long time while the qual­ity vis­i­bly im­proves, re­duc­ing D’s LR as nec­es­sary to keep it bal­anced with G; and then once you’ve run out of time/­pa­tience or ar­ti­facts are show­ing up, then you can de­crease both LRs to con­verge onto a lo­cal op­ti­ma.

    I find the de­fault of 0.003 can be too high once qual­ity reaches a high level with both faces & por­traits, and it helps to re­duce it by a third to 0.001 or a tenth to 0.0003. If there still is­n’t con­ver­gence, the D may be too strong and it can be turned down sep­a­rate­ly, to a tenth or a fifti­eth even. (Given the sto­chas­tic­ity of train­ing & the rel­a­tiv­ity of the loss­es, one should wait sev­eral wall­clock hours or days after each mod­i­fi­ca­tion to see if it made a differ­ence.)

  • Skip­ping FID met­rics:

    Some met­rics are com­puted for log­ging/re­port­ing. The FID met­rics are cal­cu­lated us­ing an old Im­a­geNet CNN; what is re­al­is­tic on Im­a­geNet may have lit­tle to do with your par­tic­u­lar do­main and while a large FID like 100 is con­cern­ing, FIDs like 20 or even in­creas­ing are not nec­es­sar­ily a prob­lem or use­ful guid­ance com­pared to just look­ing at the gen­er­ated sam­ples or the loss curves. Given that com­put­ing FID met­rics is not free & po­ten­tially ir­rel­e­vant or mis­lead­ing on many im­age do­mains, I sug­gest dis­abling them en­tire­ly. (They are not used in the train­ing for any­thing, and dis­abling them is safe.)

    They can be edited out of the main train­ing loop by com­ment­ing out the call to metrics.run like so:

    @@ -261,7 +265,7 @@ def training_loop()
            if cur_tick % network_snapshot_ticks == 0 or done or cur_tick == 1:
                pkl = os.path.join(submit_config.run_dir, 'network-snapshot-%06d.pkl' % (cur_nimg // 1000))
                misc.save_pkl((G, D, Gs), pkl)
                # metrics.run(pkl, run_dir=submit_config.run_dir, num_gpus=submit_config.num_gpus, tf_config=tf_config)
  • ‘Blob’ & ‘Crack’ Ar­ti­facts:

    Dur­ing train­ing, ‘blobs’ often show up or move around. These blobs ap­pear even late in train­ing on oth­er­wise high­-qual­ity im­ages and are unique to StyleGAN (at least, I’ve never seen an­other GAN whose train­ing ar­ti­facts look like the blob­s). That they are so large & glar­ing sug­gests a weak­ness in StyleGAN some­where. The source of the blobs was un­clear. If you watch train­ing videos, these blobs seem to grad­u­ally morph into new fea­tures such as eyes or hair or glass­es. I sus­pect they are part of how StyleGAN ‘cre­ates’ new fea­tures, start­ing with a fea­ture-less blob su­per­im­posed at ap­prox­i­mately the right lo­ca­tion, and grad­u­ally re­fined into some­thing use­ful. The in­ves­ti­gated the blob ar­ti­facts & found it to be due to the Gen­er­a­tor work­ing around a flaw in StyleGAN’s use of AdaIN nor­mal­iza­tion. Kar­ras et al 2019 note that im­ages with­out a blob some­where are se­verely cor­rupt­ed; be­cause the blobs are in fact do­ing some­thing use­ful, it is un­sur­pris­ing that the Dis­crim­i­na­tor does­n’t fix the Gen­er­a­tor. StyleGAN 2 changes the AdaIN nor­mal­iza­tion to elim­i­nate this prob­lem, im­prov­ing over­all qual­i­ty.28

    If blobs are ap­pear­ing too often or one wants a fi­nal model with­out any new in­tru­sive blobs, it may help to lower the LR to try to con­verge to a lo­cal op­tima where the nec­es­sary blob is hid­den away some­where un­ob­tru­sive.

    In train­ing anime faces, I have seen ad­di­tional ar­ti­facts, which look like ‘cracks’ or ‘waves’ or ele­phant skin wrin­kles or the sort of fine craz­ing seen in old paint­ings or ce­ram­ics, which ap­pear to­ward the end of train­ing on pri­mar­ily skin or ar­eas of flat col­or; they hap­pen par­tic­u­larly fast when trans­fer learn­ing on a small dataset. The only so­lu­tion I have found so far is to ei­ther stop train­ing or get more da­ta. In con­trast to the blob ar­ti­facts (i­den­ti­fied as an ar­chi­tec­tural prob­lem & fixed in StyleGAN 2), I cur­rently sus­pect the cracks are a sign of over­fit­ting rather than a pe­cu­liar­ity of nor­mal StyleGAN train­ing, where the G has started try­ing to mem­o­rize noise in the fine de­tail of pix­e­la­tion/­li­nes, and so these are a kind of over­fit­ting/­mode col­lapse. (More spec­u­la­tive­ly: an­other pos­si­ble ex­pla­na­tion is that the cracks are caused by the StyleGAN D be­ing sin­gle-s­cale rather than mul­ti­-s­cale—as in MSG-GAN and a num­ber of oth­er­s—and the ‘cracks’ are ac­tu­ally high­-fre­quency noise cre­ated by the G in spe­cific patches as ad­ver­sar­ial ex­am­ples to fool the D. They re­port­edly do not ap­pear in MSG-GAN or StyleGAN 2, which both use mul­ti­-s­cale Ds.)

  • Gra­di­ent Ac­cu­mu­la­tion:

    ProGAN/StyleGAN’s code­base claims to sup­port gra­di­ent ac­cu­mu­la­tion, which is a way to fake large mini­batch train­ing (eg n = 2048) by not do­ing the back­prop­a­ga­tion up­date every mini­batch, but in­stead sum­ming the gra­di­ents over many mini­batches and ap­ply­ing them all at once. This is a use­ful trick for sta­bi­liz­ing train­ing, and large mini­batch NN train­ing can differ qual­i­ta­tively from small mini­batch NN training—BigGAN per­for­mance in­creased with in­creas­ingly large mini­batches (n = 2048) and the au­thors spec­u­late that this is be­cause such large mini­batches mean that the full di­ver­sity of the dataset is rep­re­sented in each ‘mini­batch’ so the BigGAN mod­els can­not sim­ply ‘for­get’ rarer dat­a­points which would oth­er­wise not ap­pear for many mini­batches in a row, re­sult­ing in the GAN pathol­ogy of ‘mode drop­ping’ where some kinds of data just get ig­nored by both G/D.

    How­ev­er, the ProGAN/StyleGAN im­ple­men­ta­tion of gra­di­ent ac­cu­mu­la­tion does not re­sem­ble that of any other im­ple­men­ta­tion I’ve seen in Ten­sor­Flow or Py­Torch, and in my own ex­per­i­ments with up to n = 4096, I did­n’t ob­serve any sta­bi­liza­tion or qual­i­ta­tive differ­ences, so I am sus­pi­cious the im­ple­men­ta­tion is wrong.

Here is what a suc­cess­ful train­ing pro­gres­sion looks like for the anime face StyleGAN:

Train­ing mon­tage video of the first 9k it­er­a­tions of the anime face StyleGAN.

The anime face model is ob­so­leted by the StyleGAN 2 por­trait model.

The anime face model as of 2019-03-08, trained for 21,980 it­er­a­tions or ~21m im­ages or ~38 GPU-days, is avail­able for down­load. (It is still not ful­ly-con­verged, but the qual­ity is good.)

Sampling

Hav­ing suc­cess­fully trained a StyleGAN, now the fun part—­gen­er­at­ing sam­ples!

Psi/“truncation trick”

The Ψ/“trun­ca­tion trick” (BigGAN dis­cus­sion, StyleGAN dis­cus­sion; ap­par­ently first in­tro­duced by ) is the most im­por­tant hy­per­pa­ra­me­ter for all StyleGAN gen­er­a­tion.

The trun­ca­tion trick is used at sam­ple gen­er­a­tion time but not train­ing time. The idea is to edit the la­tent vec­tor z, which is a vec­tor of 𝒩(0,1), to re­move any vari­ables which are above a cer­tain size like 0.5 or 1.0, and re­sam­ple those.29 This seems to help by avoid­ing ‘ex­treme’ la­tent val­ues or com­bi­na­tions of la­tent val­ues which the G is not as good at—a G will not have gen­er­ated many data points with each la­tent vari­able at, say, +1.5SD. The trade­off is that those are still le­git­i­mate ar­eas of the over­all la­tent space which were be­ing used dur­ing train­ing to cover parts of the data dis­tri­b­u­tion; so while the la­tent vari­ables close to the mean of 0 may be the most ac­cu­rately mod­eled, they are also only a small part of the space of all pos­si­ble im­ages. So one can gen­er­ate la­tent vari­ables from the full un­re­stricted 𝒩(0,1) dis­tri­b­u­tion for each one, or one can trun­cate them at some­thing like +1SD or +0.7SD. (Like the dis­cus­sion of the best dis­tri­b­u­tion for the orig­i­nal la­tent dis­tri­b­u­tion, there’s no good rea­son to think that this is an op­ti­mal method of do­ing trun­ca­tion; there are many al­ter­na­tives, such as ones pe­nal­iz­ing the sum of the vari­ables, ei­ther re­ject­ing them or scal­ing them down, and than the cur­rent trun­ca­tion trick.)

At Ψ = 0, di­ver­sity is nil and all faces are a sin­gle global av­er­age face (a brown-eyed brown-haired school­girl, un­sur­pris­ing­ly); at ±0.5 you have a broad range of faces, and by ±1.2, you’ll see tremen­dous di­ver­sity in faces/styles/­con­sis­tency but also tremen­dous ar­ti­fact­ing & dis­tor­tion. Where you set your Ψ will heav­ily in­flu­ence how ‘orig­i­nal’ out­puts look. At Ψ = 1.2, they are tremen­dously orig­i­nal but ex­tremely hit or miss. At Ψ = 0.5 they are con­sis­tent but bor­ing. For most of my sam­pling, I set Ψ = 0.7 which strikes the best bal­ance be­tween crazi­ness/ar­ti­fact­ing and qual­i­ty/­di­ver­si­ty. (Per­son­al­ly, I pre­fer to look at Ψ = 1.2 sam­ples be­cause they are so much more in­ter­est­ing, but if I re­leased those sam­ples, it would give a mis­lead­ing im­pres­sion to read­er­s.)

Random Samples

The StyleGAN repo has a sim­ple script pretrained_example.py to down­load & gen­er­ate a sin­gle face; in the in­ter­ests of re­pro­ducibil­i­ty, it hard­wires the model and the RNG seed so it will only gen­er­ate 1 par­tic­u­lar face. How­ev­er, it can be eas­ily adapted to use a lo­cal model and (s­lowly30) gen­er­ate, say, 1000 sam­ple im­ages with the hy­per­pa­ra­me­ter Ψ = 0.6 (which gives high­-qual­ity but not high­ly-di­verse im­ages) which are saved to results/example-{0-999}.png:

import os
import pickle
import numpy as np
import PIL.Image
import dnnlib
import dnnlib.tflib as tflib
import config

def main():
    tflib.init_tf()
    _G, _D, Gs = pickle.load(open("results/02051-sgan-faces-2gpu/network-snapshot-021980.pkl", "rb"))
    Gs.print_layers()

    for i in range(0,1000):
        rnd = np.random.RandomState(None)
        latents = rnd.randn(1, Gs.input_shape[1])
        fmt = dict(func=tflib.convert_images_to_uint8, nchw_to_nhwc=True)
        images = Gs.run(latents, None, truncation_psi=0.6, randomize_noise=True, output_transform=fmt)
        os.makedirs(config.result_dir, exist_ok=True)
        png_filename = os.path.join(config.result_dir, 'example-'+str(i)+'.png')
        PIL.Image.fromarray(images[0], 'RGB').save(png_filename)

if __name__ == "__main__":
    main()

Karras et al 2018 Figures

The fig­ures in Kar­ras et al 2018, demon­strat­ing ran­dom sam­ples and as­pects of the style noise us­ing the 1024px FFHQ face model (as well as the oth­er­s), were gen­er­ated by generate_figures.py. This script needs ex­ten­sive mod­i­fi­ca­tions to work with my 512px anime face; go­ing through the file:

  • the code uses Ψ = 1 trun­ca­tion, but faces look bet­ter with Ψ = 0.7 (sev­eral of the func­tions have truncation_psi= set­tings but, trick­i­ly, the Fig­ure 3 draw_style_mixing_figure has its Ψ set­ting hid­den away in the synthesis_kwargs global vari­able)
  • the loaded model needs to be switched to the anime face mod­el, of course
  • di­men­sions must be re­duced 1024→512 as ap­pro­pri­ate; some ranges are hard­coded and must be re­duced for 512px im­ages as well
  • the trun­ca­tion trick fig­ure 8 does­n’t show enough faces to give in­sight into what the la­tent space is do­ing so it needs to be ex­panded to show both more ran­dom seed­s/­faces, and more Ψ val­ues
  • the bed­room/­car/­cat sam­ples should be dis­abled

The changes I make are as fol­lows:

diff --git a/generate_figures.py b/generate_figures.py
index 45b68b8..f27af9d 100755
--- a/generate_figures.py
+++ b/generate_figures.py
@@ -24,16 +24,13 @@ url_bedrooms    = 'https://drive.google.com/uc?id=1MOSKeGF0FJcivpBI7s63V9YHloUTO
 url_cars        = 'https://drive.google.com/uc?id=1MJ6iCfNtMIRicihwRorsM3b7mmtmK9c3' # karras2019stylegan-cars-512x384.pkl
 url_cats        = 'https://drive.google.com/uc?id=1MQywl0FNt6lHu8E_EUqnRbviagS7fbiJ' # karras2019stylegan-cats-256x256.pkl

-synthesis_kwargs = dict(output_transform=dict(func=tflib.convert_images_to_uint8, nchw_to_nhwc=True), minibatch_size=8)
+synthesis_kwargs = dict(output_transform=dict(func=tflib.convert_images_to_uint8, nchw_to_nhwc=True), minibatch_size=8, truncation_psi=0.7)

 _Gs_cache = dict()

 def load_Gs(url):
-    if url not in _Gs_cache:
-        with dnnlib.util.open_url(url, cache_dir=config.cache_dir) as f:
-            _G, _D, Gs = pickle.load(f)
-        _Gs_cache[url] = Gs
-    return _Gs_cache[url]
+    _G, _D, Gs = pickle.load(open("results/02051-sgan-faces-2gpu/network-snapshot-021980.pkl", "rb"))
+    return Gs

 #----------------------------------------------------------------------------
 # Figures 2, 3, 10, 11, 12: Multi-resolution grid of uncurated result images.
@@ -85,7 +82,7 @@ def draw_noise_detail_figure(png, Gs, w, h, num_samples, seeds):
     canvas = PIL.Image.new('RGB', (w * 3, h * len(seeds)), 'white')
     for row, seed in enumerate(seeds):
         latents = np.stack([np.random.RandomState(seed).randn(Gs.input_shape[1])] * num_samples)
-        images = Gs.run(latents, None, truncation_psi=1, **synthesis_kwargs)
+        images = Gs.run(latents, None, **synthesis_kwargs)
         canvas.paste(PIL.Image.fromarray(images[0], 'RGB'), (0, row * h))
         for i in range(4):
             crop = PIL.Image.fromarray(images[i + 1], 'RGB')
@@ -109,7 +106,7 @@ def draw_noise_components_figure(png, Gs, w, h, seeds, noise_ranges, flips):
     all_images = []
     for noise_range in noise_ranges:
         tflib.set_vars({var: val * (1 if i in noise_range else 0) for i, (var, val) in enumerate(noise_pairs)})
-        range_images = Gsc.run(latents, None, truncation_psi=1, randomize_noise=False, **synthesis_kwargs)
+        range_images = Gsc.run(latents, None, randomize_noise=False, **synthesis_kwargs)
         range_images[flips, :, :] = range_images[flips, :, ::-1]
         all_images.append(list(range_images))

@@ -144,14 +141,11 @@ def draw_truncation_trick_figure(png, Gs, w, h, seeds, psis):
 def main():
     tflib.init_tf()
     os.makedirs(config.result_dir, exist_ok=True)
-    draw_uncurated_result_figure(os.path.join(config.result_dir, 'figure02-uncurated-ffhq.png'), load_Gs(url_ffhq), cx=0, cy=0, cw=1024, ch=1024, rows=3, lods=[0,1,2,2,3,3], seed=5)
-    draw_style_mixing_figure(os.path.join(config.result_dir, 'figure03-style-mixing.png'), load_Gs(url_ffhq), w=1024, h=1024, src_seeds=[639,701,687,615,2268], dst_seeds=[888,829,1898,1733,1614,845], style_ranges=[range(0,4)]*3+[range(4,8)]*2+[range(8,18)])
-    draw_noise_detail_figure(os.path.join(config.result_dir, 'figure04-noise-detail.png'), load_Gs(url_ffhq), w=1024, h=1024, num_samples=100, seeds=[1157,1012])
-    draw_noise_components_figure(os.path.join(config.result_dir, 'figure05-noise-components.png'), load_Gs(url_ffhq), w=1024, h=1024, seeds=[1967,1555], noise_ranges=[range(0, 18), range(0, 0), range(8, 18), range(0, 8)], flips=[1])
-    draw_truncation_trick_figure(os.path.join(config.result_dir, 'figure08-truncation-trick.png'), load_Gs(url_ffhq), w=1024, h=1024, seeds=[91,388], psis=[1, 0.7, 0.5, 0, -0.5, -1])
-    draw_uncurated_result_figure(os.path.join(config.result_dir, 'figure10-uncurated-bedrooms.png'), load_Gs(url_bedrooms), cx=0, cy=0, cw=256, ch=256, rows=5, lods=[0,0,1,1,2,2,2], seed=0)
-    draw_uncurated_result_figure(os.path.join(config.result_dir, 'figure11-uncurated-cars.png'), load_Gs(url_cars), cx=0, cy=64, cw=512, ch=384, rows=4, lods=[0,1,2,2,3,3], seed=2)
-    draw_uncurated_result_figure(os.path.join(config.result_dir, 'figure12-uncurated-cats.png'), load_Gs(url_cats), cx=0, cy=0, cw=256, ch=256, rows=5, lods=[0,0,1,1,2,2,2], seed=1)
+    draw_uncurated_result_figure(os.path.join(config.result_dir, 'figure02-uncurated-ffhq.png'), load_Gs(url_ffhq), cx=0, cy=0, cw=512, ch=512, rows=3, lods=[0,1,2,2,3,3], seed=5)
+    draw_style_mixing_figure(os.path.join(config.result_dir, 'figure03-style-mixing.png'), load_Gs(url_ffhq), w=512, h=512, src_seeds=[639,701,687,615,2268], dst_seeds=[888,829,1898,1733,1614,845], style_ranges=[range(0,4)]*3+[range(4,8)]*2+[range(8,16)])
+    draw_noise_detail_figure(os.path.join(config.result_dir, 'figure04-noise-detail.png'), load_Gs(url_ffhq), w=512, h=512, num_samples=100, seeds=[1157,1012])
+    draw_noise_components_figure(os.path.join(config.result_dir, 'figure05-noise-components.png'), load_Gs(url_ffhq), w=512, h=512, seeds=[1967,1555], noise_ranges=[range(0, 18), range(0, 0), range(8, 18), range(0, 8)], flips=[1])
+    draw_truncation_trick_figure(os.path.join(config.result_dir, 'figure08-truncation-trick.png'), load_Gs(url_ffhq), w=512, h=512, seeds=[91,388, 389, 390, 391, 392, 393, 394, 395, 396], psis=[1, 0.7, 0.5, 0.25, 0, -0.25, -0.5, -1])

All this done, we get some fun anime face sam­ples to par­al­lel Kar­ras et al 2018’s fig­ures:

Anime face StyleGAN, Fig­ure 2, un­cu­rated sam­ples
Fig­ure 3, “style mix­ing” of source/­trans­fer faces, demon­strat­ing con­trol & in­ter­po­la­tion (top row=style, left colum­n=­tar­get to be styled)
Fig­ure 8, the “trun­ca­tion trick” vi­su­al­ized: 10 ran­dom faces, with the range Ψ = [1, 0.7, 0.5, 0.25, 0, −0.25, −0.5, −1]—demon­strat­ing the trade­off be­tween di­ver­sity & qual­i­ty, and the global av­er­age face.

Videos

Training Montage

The eas­i­est sam­ples are the progress snap­shots gen­er­ated dur­ing train­ing. Over the course of train­ing, their size in­creases as the effec­tive res­o­lu­tion in­creases & finer de­tails are gen­er­at­ed, and at the end can be quite large (often 14MB each for the anime faces) so do­ing lossy com­pres­sion with a tool like pngnq+advpng or con­vert­ing them to JPG with low­ered qual­ity is a good idea. To turn the many snap­shots into a train­ing mon­tage video like above, I use on the PNGs:

cat $(ls ./results/*faces*/fakes*.png | sort --numeric-sort) | ffmpeg -framerate 10 \ # show 10 inputs per second
    -i - # stdin
    -r 25 # output frame-rate; frames will be duplicated to pad out to 25FPS
    -c:v libx264 # x264 for compatibility
    -pix_fmt yuv420p # force ffmpeg to use a standard colorspace - otherwise PNG colorspace is kept, breaking browsers (!)
    -crf 33 # adequate high quality
    -vf "scale=iw/2:ih/2" \ # shrink the image by 2×, the full detail is not necessary & saves space
    -preset veryslow -tune animation \ # aim for smallest binary possible with animation-tuned settings
    ./stylegan-facestraining.mp4

Interpolations

The orig­i­nal ProGAN repo pro­vided a con­fig for gen­er­at­ing in­ter­po­la­tion videos, but that was re­moved in StyleGAN. Cyril Di­agne (@kikko_fr) im­ple­mented a re­place­ment, pro­vid­ing 3 kinds of videos:

  1. random_grid_404.mp4: a stan­dard in­ter­po­la­tion video, which is sim­ply a ran­dom walk through the la­tent space, mod­i­fy­ing all the vari­ables smoothly and an­i­mat­ing it; by de­fault it makes 4 of them arranged 2×2 in the video. Sev­eral in­ter­po­la­tion videos are show in the ex­am­ples sec­tion.

  2. interpolate.mp4: a ‘coarse’ “style mix­ing” video; a sin­gle ‘source’ face is gen­er­ated & held con­stant; a sec­ondary in­ter­po­la­tion video, a ran­dom walk as be­fore is gen­er­at­ed; at each step of the ran­dom walk, the ‘coarse’/high­-level ‘style’ noise is copied from the ran­dom walk to over­write the source face’s orig­i­nal style noise. For faces, this means that the orig­i­nal face will be mod­i­fied with all sorts of ori­en­ta­tions & fa­cial ex­pres­sions while still re­main­ing rec­og­niz­ably the orig­i­nal char­ac­ter. (It is the video ana­log of Kar­ras et al 2018’s Fig­ure 3.)

    A copy of Di­ag­ne’s video.py:

    import os
    import pickle
    import numpy as np
    import PIL.Image
    import dnnlib
    import dnnlib.tflib as tflib
    import config
    import scipy
    
    def main():
    
        tflib.init_tf()
    
        # Load pre-trained network.
        # url = 'https://drive.google.com/uc?id=1MEGjdvVpUsu1jB4zrXZN7Y4kBBOzizDQ'
        # with dnnlib.util.open_url(url, cache_dir=config.cache_dir) as f:
        ## NOTE: insert model here:
        _G, _D, Gs = pickle.load(open("results/02047-sgan-faces-2gpu/network-snapshot-013221.pkl", "rb"))
        # _G = Instantaneous snapshot of the generator. Mainly useful for resuming a previous training run.
        # _D = Instantaneous snapshot of the discriminator. Mainly useful for resuming a previous training run.
        # Gs = Long-term average of the generator. Yields higher-quality results than the instantaneous snapshot.
    
        grid_size = [2,2]
        image_shrink = 1
        image_zoom = 1
        duration_sec = 60.0
        smoothing_sec = 1.0
        mp4_fps = 20
        mp4_codec = 'libx264'
        mp4_bitrate = '5M'
        random_seed = 404
        mp4_file = 'results/random_grid_%s.mp4' % random_seed
        minibatch_size = 8
    
        num_frames = int(np.rint(duration_sec * mp4_fps))
        random_state = np.random.RandomState(random_seed)
    
        # Generate latent vectors
        shape = [num_frames, np.prod(grid_size)] + Gs.input_shape[1:] # [frame, image, channel, component]
        all_latents = random_state.randn(*shape).astype(np.float32)
        import scipy
        all_latents = scipy.ndimage.gaussian_filter(all_latents,
                       [smoothing_sec * mp4_fps] + [0] * len(Gs.input_shape), mode='wrap')
        all_latents /= np.sqrt(np.mean(np.square(all_latents)))
    
    
        def create_image_grid(images, grid_size=None):
            assert images.ndim == 3 or images.ndim == 4
            num, img_h, img_w, channels = images.shape
    
            if grid_size is not None:
                grid_w, grid_h = tuple(grid_size)
            else:
                grid_w = max(int(np.ceil(np.sqrt(num))), 1)
                grid_h = max((num - 1) // grid_w + 1, 1)
    
            grid = np.zeros([grid_h * img_h, grid_w * img_w, channels], dtype=images.dtype)
            for idx in range(num):
                x = (idx % grid_w) * img_w
                y = (idx // grid_w) * img_h
                grid[y : y + img_h, x : x + img_w] = images[idx]
            return grid
    
        # Frame generation func for moviepy.
        def make_frame(t):
            frame_idx = int(np.clip(np.round(t * mp4_fps), 0, num_frames - 1))
            latents = all_latents[frame_idx]
            fmt = dict(func=tflib.convert_images_to_uint8, nchw_to_nhwc=True)
            images = Gs.run(latents, None, truncation_psi=0.7,
                                  randomize_noise=False, output_transform=fmt)
    
            grid = create_image_grid(images, grid_size)
            if image_zoom > 1:
                grid = scipy.ndimage.zoom(grid, [image_zoom, image_zoom, 1], order=0)
            if grid.shape[2] == 1:
                grid = grid.repeat(3, 2) # grayscale => RGB
            return grid
    
        # Generate video.
        import moviepy.editor
        video_clip = moviepy.editor.VideoClip(make_frame, duration=duration_sec)
        video_clip.write_videofile(mp4_file, fps=mp4_fps, codec=mp4_codec, bitrate=mp4_bitrate)
    
        # import scipy
        # coarse
        duration_sec = 60.0
        smoothing_sec = 1.0
        mp4_fps = 20
    
        num_frames = int(np.rint(duration_sec * mp4_fps))
        random_seed = 500
        random_state = np.random.RandomState(random_seed)
    
    
        w = 512
        h = 512
        #src_seeds = [601]
        dst_seeds = [700]
        style_ranges = ([0] * 7 + [range(8,16)]) * len(dst_seeds)
    
        fmt = dict(func=tflib.convert_images_to_uint8, nchw_to_nhwc=True)
        synthesis_kwargs = dict(output_transform=fmt, truncation_psi=0.7, minibatch_size=8)
    
        shape = [num_frames] + Gs.input_shape[1:] # [frame, image, channel, component]
        src_latents = random_state.randn(*shape).astype(np.float32)
        src_latents = scipy.ndimage.gaussian_filter(src_latents,
                                                    smoothing_sec * mp4_fps,
                                                    mode='wrap')
        src_latents /= np.sqrt(np.mean(np.square(src_latents)))
    
        dst_latents = np.stack(np.random.RandomState(seed).randn(Gs.input_shape[1]) for seed in dst_seeds)
    
    
        src_dlatents = Gs.components.mapping.run(src_latents, None) # [seed, layer, component]
        dst_dlatents = Gs.components.mapping.run(dst_latents, None) # [seed, layer, component]
        src_images = Gs.components.synthesis.run(src_dlatents, randomize_noise=False, **synthesis_kwargs)
        dst_images = Gs.components.synthesis.run(dst_dlatents, randomize_noise=False, **synthesis_kwargs)
    
    
        canvas = PIL.Image.new('RGB', (w * (len(dst_seeds) + 1), h * 2), 'white')
    
        for col, dst_image in enumerate(list(dst_images)):
            canvas.paste(PIL.Image.fromarray(dst_image, 'RGB'), ((col + 1) * h, 0))
    
        def make_frame(t):
            frame_idx = int(np.clip(np.round(t * mp4_fps), 0, num_frames - 1))
            src_image = src_images[frame_idx]
            canvas.paste(PIL.Image.fromarray(src_image, 'RGB'), (0, h))
    
            for col, dst_image in enumerate(list(dst_images)):
                col_dlatents = np.stack([dst_dlatents[col]])
                col_dlatents[:, style_ranges[col]] = src_dlatents[frame_idx, style_ranges[col]]
                col_images = Gs.components.synthesis.run(col_dlatents, randomize_noise=False, **synthesis_kwargs)
                for row, image in enumerate(list(col_images)):
                    canvas.paste(PIL.Image.fromarray(image, 'RGB'), ((col + 1) * h, (row + 1) * w))
            return np.array(canvas)
    
        # Generate video.
        import moviepy.editor
        mp4_file = 'results/interpolate.mp4'
        mp4_codec = 'libx264'
        mp4_bitrate = '5M'
    
        video_clip = moviepy.editor.VideoClip(make_frame, duration=duration_sec)
        video_clip.write_videofile(mp4_file, fps=mp4_fps, codec=mp4_codec, bitrate=mp4_bitrate)
    
        import scipy
    
        duration_sec = 60.0
        smoothing_sec = 1.0
        mp4_fps = 20
    
        num_frames = int(np.rint(duration_sec * mp4_fps))
        random_seed = 503
        random_state = np.random.RandomState(random_seed)
    
    
        w = 512
        h = 512
        style_ranges = [range(6,16)]
    
        fmt = dict(func=tflib.convert_images_to_uint8, nchw_to_nhwc=True)
        synthesis_kwargs = dict(output_transform=fmt, truncation_psi=0.7, minibatch_size=8)
    
        shape = [num_frames] + Gs.input_shape[1:] # [frame, image, channel, component]
        src_latents = random_state.randn(*shape).astype(np.float32)
        src_latents = scipy.ndimage.gaussian_filter(src_latents,
                                                    smoothing_sec * mp4_fps,
                                                    mode='wrap')
        src_latents /= np.sqrt(np.mean(np.square(src_latents)))
    
        dst_latents = np.stack([random_state.randn(Gs.input_shape[1])])
    
    
        src_dlatents = Gs.components.mapping.run(src_latents, None) # [seed, layer, component]
        dst_dlatents = Gs.components.mapping.run(dst_latents, None) # [seed, layer, component]
    
    
        def make_frame(t):
            frame_idx = int(np.clip(np.round(t * mp4_fps), 0, num_frames - 1))
            col_dlatents = np.stack([dst_dlatents[0]])
            col_dlatents[:, style_ranges[0]] = src_dlatents[frame_idx, style_ranges[0]]
            col_images = Gs.components.synthesis.run(col_dlatents, randomize_noise=False, **synthesis_kwargs)
            return col_images[0]
    
        # Generate video.
        import moviepy.editor
        mp4_file = 'results/fine_%s.mp4' % (random_seed)
        mp4_codec = 'libx264'
        mp4_bitrate = '5M'
    
        video_clip = moviepy.editor.VideoClip(make_frame, duration=duration_sec)
        video_clip.write_videofile(mp4_file, fps=mp4_fps, codec=mp4_codec, bitrate=mp4_bitrate)
    
    if __name__ == "__main__":
        main()

    ‘Coarse’ style-trans­fer­/in­ter­po­la­tion video

  3. fine_503.mp4: a ‘fine’ style mix­ing video; in this case, the style noise is taken from later on and in­stead of affect­ing the global ori­en­ta­tion or ex­pres­sion, it affects sub­tler de­tails like the pre­cise shape of hair strands or hair color or mouths.

    ‘Fine’ style-trans­fer­/in­ter­po­la­tion video

Cir­cu­lar in­ter­po­la­tions are an­other in­ter­est­ing kind of in­ter­po­la­tion, writ­ten by snowy halcy, which in­stead of ran­dom walk­ing around the la­tent space freely, with large or awk­ward tran­si­tions, in­stead tries to move around a fixed high­-di­men­sional point do­ing: “bi­nary search to get the MSE to be roughly the same be­tween frames (s­lightly brute force, but it looks nicer), and then did that for what is prob­a­bly close to a sphere or cir­cle in the la­tent space.” A later ver­sion of cir­cu­lar in­ter­po­la­tion is in snowy hal­cy’s face ed­i­tor re­po, but here is the orig­i­nal ver­sion cleaned up into a stand-alone pro­gram:

import dnnlib.tflib as tflib
import math
import moviepy.editor
from numpy import linalg
import numpy as np
import pickle

def main():
    tflib.init_tf()
    _G, _D, Gs = pickle.load(open("results/02051-sgan-faces-2gpu/network-snapshot-021980.pkl", "rb"))

    rnd = np.random
    latents_a = rnd.randn(1, Gs.input_shape[1])
    latents_b = rnd.randn(1, Gs.input_shape[1])
    latents_c = rnd.randn(1, Gs.input_shape[1])

    def circ_generator(latents_interpolate):
        radius = 40.0

        latents_axis_x = (latents_a - latents_b).flatten() / linalg.norm(latents_a - latents_b)
        latents_axis_y = (latents_a - latents_c).flatten() / linalg.norm(latents_a - latents_c)

        latents_x = math.sin(math.pi * 2.0 * latents_interpolate) * radius
        latents_y = math.cos(math.pi * 2.0 * latents_interpolate) * radius

        latents = latents_a + latents_x * latents_axis_x + latents_y * latents_axis_y
        return latents

    def mse(x, y):
        return (np.square(x - y)).mean()

    def generate_from_generator_adaptive(gen_func):
        max_step = 1.0
        current_pos = 0.0

        change_min = 10.0
        change_max = 11.0

        fmt = dict(func=tflib.convert_images_to_uint8, nchw_to_nhwc=True)

        current_latent = gen_func(current_pos)
        current_image = Gs.run(current_latent, None, truncation_psi=0.7, randomize_noise=False, output_transform=fmt)[0]
        array_list = []

        video_length = 1.0
        while(current_pos < video_length):
            array_list.append(current_image)

            lower = current_pos
            upper = current_pos + max_step
            current_pos = (upper + lower) / 2.0

            current_latent = gen_func(current_pos)
            current_image = images = Gs.run(current_latent, None, truncation_psi=0.7, randomize_noise=False, output_transform=fmt)[0]
            current_mse = mse(array_list[-1], current_image)

            while current_mse < change_min or current_mse > change_max:
                if current_mse < change_min:
                    lower = current_pos
                    current_pos = (upper + lower) / 2.0

                if current_mse > change_max:
                    upper = current_pos
                    current_pos = (upper + lower) / 2.0


                current_latent = gen_func(current_pos)
                current_image = images = Gs.run(current_latent, None, truncation_psi=0.7, randomize_noise=False, output_transform=fmt)[0]
                current_mse = mse(array_list[-1], current_image)
            print(current_pos, current_mse)
        return array_list

    frames = generate_from_generator_adaptive(circ_generator)
    frames = moviepy.editor.ImageSequenceClip(frames, fps=30)

    # Generate video.
    mp4_file = 'results/circular.mp4'
    mp4_codec = 'libx264'
    mp4_bitrate = '3M'
    mp4_fps = 20

    frames.write_videofile(mp4_file, fps=mp4_fps, codec=mp4_codec, bitrate=mp4_bitrate)

if __name__ == "__main__":
    main()
‘Cir­cu­lar’ in­ter­po­la­tion video

An in­ter­est­ing use of in­ter­po­la­tions is Kyle McLean’s “Waifu Syn­the­sis” video: a singing anime video mash­ing up StyleGAN anime faces + lyrics + Project Ma­genta mu­sic.

Models

Anime Faces

The pri­mary model I’ve trained, the anime face model is de­scribed in the data pro­cess­ing & train­ing sec­tion. It is a 512px StyleGAN model trained on n = 218,794 faces cropped from all of Dan­booru2017, cleaned, & up­scaled, and trained for 21,980 it­er­a­tions or ~21m im­ages or ~38 GPU-days.

Down­loads (I rec­om­mend us­ing the more-re­cent un­less cropped faces are specifi­cally de­sired):

TWDNE

To show off the anime faces, and as a joke, on 2019-02-14, I set up , a stand­alone sta­tic web­site which dis­plays a ran­dom anime face (out of 100,000), gen­er­ated with var­i­ous Ψ, and paired with GPT-2-117M text snip­pets prompted on anime plot sum­maries. are too length to go into here

But the site was amus­ing & an enor­mous suc­cess. It went vi­ral overnight and by the end of March 2019, ~1 mil­lion unique vis­i­tors (most from Chi­na) had vis­ited TWDNE, spend­ing over 2 min­utes each look­ing at the NN-gen­er­ated faces & text; peo­ple be­gan hunt­ing for hi­lar­i­ous­ly-de­formed faces, us­ing TWDNE as a screen­saver, pick­ing out faces as avatars, cre­at­ing packs of faces for video games, paint­ing their own col­lages of faces, us­ing it as a char­ac­ter de­signer for in­spi­ra­tion, etc.

Anime Bodies

Aaron Gokaslan ex­per­i­mented with a cus­tom 256px anime game im­age dataset which has in­di­vid­ual char­ac­ters posed in whole-per­son im­ages to see how StyleGAN coped with more com­plex geome­tries. Progress re­quired ad­di­tional data clean­ing and low­er­ing the learn­ing rate but, trained on a 4-GPU sys­tem for week or two, the re­sults are promis­ing (even down to re­pro­duc­ing the copy­right state­ments in the im­ages), pro­vid­ing pre­lim­i­nary ev­i­dence that StyleGAN can scale:

Whole-body anime im­ages, ran­dom sam­ples, Aaron Gokaslan
Whole-body anime im­ages, style trans­fer among sam­ples, Aaron Gokaslan

Conditional Anime Faces, Arfafax

In March 2020, Ar­fafax trained a con­di­tional anime face StyleGAN: it takes a list of tags (a sub­set of Dan­booru2019 tags rel­e­vant to faces), processes them via into a fixed-size in­put, and feeds them into a con­di­tional StyleGAN.31 (While al­most all uses of StyleGAN are un­con­di­tion­al—you just dump in a big pile of im­ages and it learns to gen­er­ate ran­dom sam­ples—the code ac­tu­ally sup­ports con­di­tional use where a cat­e­gory or set of vari­ables is turned into an out­put im­age, and a few other peo­ple have ex­per­i­mented with it.)

Sam­ple gen­er­ated for the tags: ['0:white_hair', '0:blue_eyes', '0:long_hair']
Tags: ['0:blonde_hair', '0:blue_eyes', '0:long_hair']
Tags: ['4:hirasawa_yui']
In­ter­po­la­tion video: from ['blonde_hair red_eyes long_hair blush']['blue_hair green_eyes long_hair blush'].

Conditional GAN Problems

In the­o­ry, a con­di­tional anime face GAN would have two ma­jor ben­e­fits over the reg­u­lar kind: be­cause ad­di­tional in­for­ma­tion is sup­plied by the hu­man-writ­ten tags de­scrib­ing each dat­a­point, the model should be able to learn high­er-qual­ity faces; and be­cause the faces are gen­er­ated based on a spe­cific de­scrip­tion, one can di­rectly con­trol the out­put with­out any com­plex en­cod­ing/edit­ing tricks. The fi­nal model was eas­ier to con­trol, but the qual­ity was only OK.

What went wrong? In prac­tice, Ar­fafax ran into chal­lenges with over­fit­ting, qual­i­ty, and the model seem­ing to ig­nore many tags (and fo­cus­ing in­stead on on a few di­men­sions like hair col­or); my sus­pi­cion is that he re­dis­cov­ered the same conditional-GAN is­sues that the au­thors en­coun­tered, where the tag em­bed­ding is so high­-di­men­sional that each face is effec­tively unique (un­like cat­e­gor­i­cal con­di­tion­ing where there will be hun­dreds or thou­sands of other im­ages with the same cat­e­gory la­bel), lead­ing to Dis­crim­i­na­tor mem­o­riza­tion & train­ing col­lapse. (StackGAN’s rem­edy was “con­di­tion­ing aug­men­ta­tion” : reg­u­lar­iz­ing D’s use of the em­bed­ding by adding ran­dom Gauss­ian noise to each use of an em­bed­ding, and this is a trick which has sur­faced re­peat­edly in con­di­tional GANs since, such as textStyleGAN so it seems no one has a bet­ter idea how to fix it. The data aug­men­ta­tion GANs ap­pear to do this sort of reg­u­lar­iza­tion im­plic­itly by mod­i­fy­ing the im­ages in­stead, and some­times by adding a ‘con­sis­tency loss’ on z, which re­quires nois­ing it. Also of in­ter­est is CLIP’s ob­ser­va­tion that train­ing by pre­dict­ing the ex­act text of a cap­tion from the im­age did­n’t work well, and it was much bet­ter to in­stead do within a mini­batch: in effect, in­stead of try­ing to pre­dict the cap­tion text for a im­age, try­ing to pre­dict which of the im­ages in­side the mini­batch is most likely to match which of the cap­tion texts com­pared to the oth­er­s.)

To test this the­o­ry, Ar­fafax ran an ad­di­tional StyleGAN ex­per­i­ment (Co­lab note­book), us­ing the 100K dataset: a col­lec­tion of sim­ple emo­ji-like car­toon faces which were de­fined us­ing 18 (10 vs 4 vs 4 cat­e­gories = 1013 unique faces) vari­ables & a large sub­set gen­er­at­ed. Be­cause the dataset is syn­thet­ic, the vari­ables can be en­coded both as cat­e­gor­i­cal vari­ables and as doc2vec em­bed­dings, the set of vari­ables should uniquely spec­ify each im­age, and it is easy to de­ter­mine the amount of mod­e-drop­ping by look­ing. Ar­fafax com­pared the stan­dard un­con­di­tional StyleGAN, a mul­ti­-cat­e­gor­i­cal em­bed­ding StyleGAN, and a doc2vec StyleGAN. The un­con­di­tional did fine as usu­al, the cat­e­gor­i­cal one did some­what poor­ly, and the doc2vec one col­lapsed bad­ly—­fail­ing to gen­er­ate en­tire swaths of Car­toon Set face-space. So the prob­lem does ap­pear to be the em­bed­ding.

Tag → Face Usage

Ar­fafax has pro­vided a Google Co­lab note­book (code) for gen­er­at­ing anime faces from tag in­puts and for gen­er­at­ing in­ter­po­la­tions.

Com­plete sup­ported tag list (ex­clud­ing in­di­vid­ual char­ac­ters, which are tagged like miyu_(vampire_princess_miyu) or morrigan_aensland):

>>> face_tags
['angry', 'anger_vein', 'annoyed', 'clenched_teeth', 'blush', 'blush_stickers', 'embarrassed', 'bored',
    'closed_eyes', 'confused', 'crazy', 'disdain', 'disgust', 'drunk', 'envy', 'expressionless', 'evil', 'facepalm',
    'flustered', 'frustrated', 'grimace', 'guilt', 'happy', 'kubrick_stare', 'lonely', 'nervous', 'nosebleed',
    'one_eye_closed', 'open_mouth', 'closed_mouth', 'parted_lips', 'pain', 'pout', 'raised_eyebrow', 'rape_face',
    'rolling_eyes', 'sad', 'depressed', 'frown', 'gloom_(expression)', 'tears', 'horns', 'scared', 'panicking',
    'worried', 'serious', 'sigh', 'sleepy', 'tired', 'sulking', 'thinking', 'pensive', 'wince', 'afterglow',
    'ahegao', 'fucked_silly', 'naughty_face', 'torogao', 'smile', 'crazy_smile', 'evil_smile', 'fingersmile',
    'forced_smile', 'glasgow_smile', 'grin', 'fang', 'evil_grin', 'light_smile', 'sad_smile', 'seductive_smile',
    'stifled_laugh', 'smug', 'doyagao', 'smirk', 'smug', 'troll_face', 'surprised', 'scared', '/\\/\\/\\',
    'color_drain', 'horror_(expression)', 'screaming', 'turn_pale', 'trembling', 'wavy_mouth', ';)', ':d',
    ';d', 'xd', 'd:', ':}', ':{', ':3', ';3', 'x3', '3:', 'uwu', '=.w.=', ':p', ';p', ':q', ';q', '>:)', '>:(',
    ':t', ':i', ':/', ':x', ':c', 'c:', ':<', ';<', ':<>', ':>', ':>=', ';>=', ':o', ';o', '=', '=)', '=d',
    '=o', '=v', '|3', '|d', '|o', 'o3o', '(-3-)', '>3<', 'o_o', '0_0', '._.', '•_•', 'solid_circle_eyes',
    '♥_♥', 'heart_eyes', '^_^', '^o^', '\\(^o^)/', '└(^o^)┐≡', '^q^', '>_<', 'xd', 'x3', '>o<', '<_>', ';_;',
    '@_@', '>_@', '+_+', '+_-', '-_-', '\\_/', '=_=', '=^=', '=v=', '<o>_<o>', 'constricted_pupils', 'cross_eyed',
    'rectangular_mouth', 'sideways_mouth', 'no_nose', 'no_mouth', 'wavy_mouth', 'wide-eyed', 'mouth_drool',
    'awesome_face', 'foodgasm', 'henohenomoheji', 'nonowa', 'portrait', 'profile', 'smiley_face', 'uso_da',
    'food_awe', 'breast_awe', 'penis_awe']
>>> eye_tags
['aqua_eyes', 'black_eyes', 'blue_eyes', 'brown_eyes', 'green_eyes', 'grey_eyes', 'orange_eyes', 'lavender_eyes',
    'pink_eyes', 'purple_eyes', 'red_eyes', 'silver_eyes', 'white_eyes', 'yellow_eyes', 'heterochromia', 'multicolored_eyes',
    'al_bhed_eyes', 'pac-man_eyes', 'ringed_eyes', 'constricted_pupils', 'dilated_pupils', 'horizontal_pupils',
    'no_pupils', 'slit_pupils', 'symbol-shaped_pupils', '+_+', 'heart-shaped_pupils', 'star-shaped_pupils',
    'blue_sclera', 'black_sclera', 'blank_eyes', 'bloodshot_eyes', 'green_sclera', 'mismatched_sclera', 'orange_sclera',
    'red_sclera', 'yellow_sclera', 'bags_under_eyes', 'bruised_eye', 'flaming_eyes', 'glowing_eyes', 'glowing_eye',
    'mako_eyes', 'amphibian_eyes', 'button_eyes', 'cephalopod_eyes', 'compound_eyes', 'frog_eyes', 'crazy_eyes',
    'empty_eyes', 'heart_eyes', 'nonowa', 'solid_circle_eyes', 'o_o', '0_0', 'jitome', 'tareme', 'tsurime',
    'sanpaku', 'sharingan', 'mangekyou_sharingan', 'eye_reflection', 'text_in_eyes', 'missing_eye', 'one-eyed',
    'third_eye', 'extra_eyes', 'no_eyes']
>>> eye_expressions
['>_<', 'x3', 'xd', 'o_o', '0_0', '3_3', '6_9', '@_@', '^_^', '^o^', '9848', '26237', '=_=', '+_+', '._.',
    '<o>_<o>', 'blinking', 'closed_eyes', 'wince', 'one_eye_closed', ';<', ';>', ';p']
>>> eye_other
['covering_eyes', 'hair_over_eyes', 'hair_over_one_eye', 'bandage_over_one_eye', 'blindfold', 'hat_over_eyes',
    'eyepatch', 'eyelashes', 'colored_eyelashes', 'fake_eyelashes', 'eyes_visible_through_hair', 'glasses',
    'makeup', 'eyeliner', 'eyeshadow', 'mascara', 'eye_contact', 'looking_afar', 'looking_at_another', 'looking_at_breasts',
    'looking_at_hand', 'looking_at_mirror', 'looking_at_phone', 'looking_at_viewer', 'looking_away', 'looking_back',
    'looking_down', 'looking_out_window', 'looking_over_glasses', 'looking_through_legs', 'looking_to_the_side',
    'looking_up', 'akanbe', 'blind', 'cross-eyed', 'drawn_on_eyes', 'eyeball', 'eye_beam', 'eye_poke', 'eye_pop',
    'persona_eyes', 'shading_eyes', 'squinting', 'staring', 'uneven_eyes', 'upturned_eyes', 'wall-eyed', 'wide-eyed', 'wince']
>>> ears_tags
['animal_ears', 'bear_ears', 'bunny_ears', 'cat_ears', 'dog_ears', 'fake_animal_ears', 'fox_ears', 'horse_ears',
    'kemonomimi_mode', 'lion_ears', 'monkey_ears', 'mouse_ears', 'raccoon_ears', 'sheep_ears', 'tiger_ears',
    'wolf_ears', 'pointy_ears', 'robot_ears', 'extra_ears', 'ear_piercing', 'ear_protection', 'earrings',
    'single_earring', 'headphones', 'covering_ears', 'ear_biting', 'ear_licking', 'ear_grab']
>>> hair_tags
['heartbreak_haircut', 'hand_in_hair', 'adjusting_hair', 'bunching_hair', 'hair_flip', 'hair_grab', 'hair_pull',
    'hair_tucking', 'hair_tousle', 'hair_twirling', 'hair_sex', 'hair_brush', 'hair_dryer', 'shampoo', 'bun_cover',
    'hairpods', 'chopsticks', 'comb', 'hair_ornament', 'hair_bell', 'hair_bobbles', 'hair_bow', 'hair_ribbon',
    'hairclip', 'hairpin', 'hair_flower', 'hair_tubes', 'kanzashi', 'hair_tie', 'hairband', 'hair_weapon',
    'headband', 'scrunchie', 'wig', 'facial_hair', 'beard', 'bearded_girl', 'goatee', 'mustache', 'fake_mustache',
    'stubble', 'fiery_hair', 'prehensile_hair', 'helicopter_hair', 'tentacle_hair', 'living_hair', 'detached_hair',
    'severed_hair', 'floating_hair', 'hair_spread_out', 'wet_hair']
>>> hair_color_tags
['aqua_hair', 'black_hair', 'blonde_hair', 'blue_hair', 'light_blue_hair', 'brown_hair', 'light_brown_hair',
    'green_hair', 'grey_hair', 'magenta_hair', 'orange_hair', 'pink_hair', 'purple_hair', 'lavender_hair',
    'red_hair', 'auburn_hair', 'maroon_hair', 'silver_hair', 'white_hair', 'multicolored_hair', 'colored_inner_hair',
    'gradient_hair', 'rainbow_hair', 'streaked_hair', 'two-tone_hair', 'highlights', 'colored_tips', 'alternate_hair_color']
>>> hair_style_tags
['very_short_hair', 'short_hair', 'medium_hair', 'long_hair', 'very_long_hair', 'absurdly_long_hair',
    'big_hair', 'bald', 'bald_girl', 'alternate_hairstyle', 'hair_down', 'hair_up', 'curly_hair', 'drill_hair',
    'twin_drills', 'flipped_hair', 'hair_flaps', 'messy_hair', 'pointy_hair', 'ringlets', 'spiked_hair', 'wavy_hair',
    'bangs', 'asymmetrical_bangs', 'blunt_bangs', 'hair_over_eyes', 'hair_over_one_eye', 'parted_bangs', 'swept_bangs',
    'hair_between_eyes', 'hair_intakes', 'sidelocks', "widow's_peak", 'ahoge', 'heart_ahoge', 'huge_ahoge',
    'antenna_hair', 'comb_over', 'hair_pulled_back', 'hair_slicked_back', 'mohawk', 'hair_bikini', 'hair_censor',
    'hair_in_mouth', 'hair_over_breasts', 'hair_over_one_breast', 'hair_over_crotch', 'hair_over_shoulder',
    'hair_scarf', 'bow_by_hair', 'braid', 'braided_bangs', 'front_braid', 'side_braid', 'french_braid', 'crown_braid',
    'single_braid', 'multiple_braids', 'twin_braids', 'tri_braids', 'quad_braids', 'hair_bun', 'braided_bun',
    'double_bun', 'triple_bun', 'hair_rings', 'half_updo', 'one_side_up', 'two_side_up', 'low-braided_long_hair',
    'low-tied_long_hair', 'mizura', 'multi-tied_hair', 'nihongami', 'ponytail', 'folded_ponytail', 'front_ponytail',
    'high_ponytail', 'short_ponytail', 'side_ponytail', 'split_ponytail', 'topknot', 'twintails', 'low_twintails',
    'short_twintails', 'uneven_twintails', 'tri_tails', 'quad_tails', 'quin_tails', 'bob_cut', 'bowl_cut',
    'buzz_cut', 'chonmage', 'crew_cut', 'flattop', 'pixie_cut', 'undercut', 'cornrows', 'hairlocs', 'hime_cut',
    'mullet', 'afro', 'huge_afro', 'beehive_hairdo', 'pompadour', 'quiff', 'shouten_pegasus_mix_mori']
>>> skin_color_tags
['dark_skin', 'pale_skin', 'tan', 'tanlines', 'sun_tattoo', 'black_skin', 'blue_skin', 'green_skin', 'grey_skin',
    'orange_skin', 'pink_skin', 'purple_skin', 'red_skin', 'white_skin', 'yellow_skin', 'shiny_skin']
>>> headwear_tags
['crown', 'hat', 'helmet', 'black_headwear', 'blue_headwear', 'brown_headwear', 'green_headwear', 'grey_headwear',
    'orange_headwear', 'pink_headwear', 'purple_headwear', 'red_headwear', 'white_headwear', 'yellow_headwear',
    'ajirogasa', 'animal_hat', 'cat_hat', 'penguin_hat', 'baseball_cap', 'beanie', 'beret', 'bicorne', 'boater_hat',
    'bowl_hat', 'bowler_hat', 'bucket_hat', 'cabbie_hat', 'chef_hat', 'toque_blanche', 'flat_top_chef_hat',
    'cloche_hat', 'cowboy_hat', 'deerstalker', 'deviruchi_hat', 'dixie_cup_hat', 'eggshell_hat', 'fedora',
    'female_service_cap', 'flat_cap', 'fur_hat', 'garrison_cap', 'jester_cap', 'kepi', 'mian_guan', 'mitre',
    'mob_cap', 'mortarboard', 'nightcap', 'nurse_cap', 'party_hat', 'peaked_cap', 'pillow_hat', 'pirate_hat',
    'porkpie_hat', 'pumpkin_hat', 'rice_hat', 'robe_and_wizard_hat', 'sailor_hat', 'santa_hat', 'mini_santa_hat',
    'shako_cap', 'shampoo_hat', 'sombrero', 'sun_hat', "tam_o'_shanter", 'tate_eboshi', 'tokin_hat', 'top_hat',
    'mini_top_hat', 'tricorne', 'ushanka', 'witch_hat', 'mini_witch_hat', 'wizard_hat', 'veil', 'zun_hat',
    'baseball_helmet', 'bicycle_helmet', 'brodie_helmet', 'diving_helmet', 'football_helmet', 'hardhat', 'horned_helmet',
    'helm', 'kabuto', 'motorcycle_helmet', 'pickelhaube', 'pith_helmet', 'stahlhelm', 'tank_helmet', 'winged_helmet',
    'circlet', 'diadem', 'mini_crown', 'saishi', 'tiara', 'aviator_cap', 'bandana', 'bonnet', 'dalachi_(headdress)',
    'habit', 'hijab', 'keffiyeh', 'shower_cap', 'visor_cap', 'checkered_hat', 'frilled_hat', 'military_hat',
    'mini_hat', 'multicolored_hat', 'police_hat', 'print_hat', 'school_hat', 'straw_hat', 'adjusting_hat',
    'hand_on_headwear', 'hands_on_headwear', 'hat_basket', 'hat_loss', 'hat_on_chest', 'hat_over_eyes', 'hat_over_one_eye',
    'hat_removed', 'hat_tip', 'holding_hat', 'torn_hat', 'no_hat', 'hat_bow', 'hat_feather', 'hat_flower',
    'hat_ribbon', 'hat_with_ears', 'adjusting_hat', 'backwards_hat', 'hat_removed', 'holding_hat', 'torn_hat',
    'hair_bow', 'hair_ribbon', 'hairband', 'headband', 'forehead_protector', 'sweatband', 'hachimaki', 'nejiri_hachimaki',
    'mongkhon', 'headdress', 'maid_headdress', 'veil', 'hood']
>>> eyewear_tags
['glasses', 'monocle', 'sunglasses', 'aqua-framed_eyewear', 'black-framed_eyewear', 'blue-framed_eyewear',
    'brown-framed_eyewear', 'green-framed_eyewear', 'grey-framed_eyewear', 'orange-framed_eyewear', 'pink-framed_eyewear',
    'purple-framed_eyewear', 'red-framed_eyewear', 'white-framed_eyewear', 'yellow-framed_eyewear', 'blue-tinted_eyewear',
    'brown-tinted_eyewear', 'green-tinted_eyewear', 'orange-tinted_eyewear', 'pink-tinted_eyewear', 'purple-tinted_eyewear',
    'red-tinted_eyewear', 'yellow-tinted_eyewear', 'heart-shaped_eyewear', 'round_eyewear', 'over-rim_eyewear',
    'rimless_eyewear', 'semi-rimless_eyewear', 'under-rim_eyewear', 'adjusting_eyewear', 'eyewear_on_head',
    'eyewear_removed', 'eyewear_hang', 'eyewear_in_mouth', 'holding_eyewear', 'eyewear_strap', 'eyewear_switch',
    'looking_over_eyewear', 'no_eyewear', '3d_glasses', 'coke-bottle_glasses', 'diving_mask', 'fancy_glasses',
    'heart-shaped_eyewear', 'funny_glasses', 'goggles', 'nodoka_glasses', 'opaque_glasses', 'pince-nez', 'safety_glasses',
    'shooting_glasses', 'ski_goggles', 'x-ray_glasses', 'bespectacled', 'kamina_shades', 'star_shades']
>>> piercings_tags
['ear_piercing', 'eyebrow_piercing', 'anti-eyebrow_piercing', 'eyelid_piercing', 'lip_piercing', 'labret_piercing',
    'nose_piercing', 'bridge_piercing', 'tongue_piercing']
>>> format_tags
['3d', 'animated', 'animated_png', 'flash', 'music_video', 'song', 'video', 'animated_gif', 'non-looping_animation',
    'archived_file', 'artbook', 'bmp', 'calendar_(medium)', 'card_(medium)', 'comic', '2koma', '3koma', '4koma',
    'multiple_4koma', '5koma', 'borderless_panels', 'doujinshi', 'eromanga', 'left-to-right_manga', 'right-to-left_comic',
    'silent_comic', 'corrupted_image', 'cover', 'album_cover', 'character_single', 'cover_page', 'doujin_cover',
    'dvd_cover', 'fake_cover', 'game_cover', 'magazine_cover', 'manga_cover', 'fake_screenshot', 'game_cg',
    'gyotaku_(medium)', 'highres', 'absurdres', 'incredibly_absurdres', 'lowres', 'thumbnail', 'huge_filesize',
    'icon', 'logo', 'kirigami', 'lineart', 'no_lineart', 'outline', 'long_image', 'tall_image', 'wide_image',
    'mosaic_art', 'photomosaic', 'oekaki', 'official_art', 'phonecard', 'photo', 'papercraft', 'paper_child',
    'paper_cutout', 'pixel_art', 'postcard', 'poster', 'revision', 'bad_revision', 'artifacted_revision',
    'censored_revision', 'corrupted_revision', 'lossy_revision', 'watermarked_revision', 'scan', 'screencap',
    'shitajiki', 'tegaki', 'transparent_background', 'triptych_(art)', 'vector_trace', 'wallpaper', 'dual_monitor',
    'ios_wallpaper', 'official_wallpaper', 'phone_wallpaper', 'psp_wallpaper', 'tileable', 'wallpaper_forced', 'widescreen']
>>> style_tags
['abstract', 'art_deco', 'art_nouveau', 'fine_art_parody', 'flame_painter', 'impressionism', 'nihonga',
    'sumi-e', 'ukiyo-e', 'minimalism', 'realistic', 'photorealistic', 'sketch', 'style_parody', 'list_of_style_parodies',
    'surreal', 'traditional_media', 'faux_traditional_media', 'work_in_progress', 'backlighting', 'blending',
    'bloom', 'bokeh', 'caustics', 'chiaroscuro', 'chromatic_aberration', 'chromatic_aberration_abuse', 'diffraction_spikes',
    'depth_of_field', 'dithering', 'drop_shadow', 'emphasis_lines', 'foreshortening', 'gradient', 'halftone',
    'lens_flare', 'lens_flare_abuse', 'motion_blur', 'motion_lines', 'multiple_monochrome', 'optical_illusion',
    'anaglyph', 'exif_thumbnail_surprise', 'open_in_internet_explorer', 'open_in_winamp', 'stereogram', 'scanlines',
    'silhouette', 'speed_lines', 'vignetting']

Down­load:

  • rsync --verbose --recursive rsync://78.46.86.149:873/biggan/2021-01-09-arfafax-stylegan-danbooruportraits-tagconditional.tar.xz ./ (all files: doc2vec, StyleGAN mod­el, note­book, videos)

Extended StyleGAN2 Danbooru2019, Aydao

Ay­dao (Twit­ter), in par­al­lel with the BigGAN ex­per­i­ments, worked on grad­u­ally ex­tend­ing StyleGAN2’s mod­el­ing pow­ers to cover Dan­booru2019 SFW. His re­sults were ex­cel­lent and power nearcyan’s (Twit­ter) . Sam­ples:

30 neu­ral-net-gen­er­ated anime sam­ples from Ay­dao’s Dan­booru2019 StyleGAN2-ext model (ad­di­tional sets: 1, 2, 3); sam­ples are hand-s­e­lected for be­ing pret­ty, in­ter­est­ing, or demon­strat­ing some­thing.
TADNE in­cludes ad­di­tional fea­tures for vi­su­al­iz­ing the ψ/trun­ca­tion effects on each ran­dom seed
3×3 TADNE in­ter­po­la­tion video (MP4); ad­di­tional in­ter­po­la­tion videos: 1, 2, 3, 4, 5, 6, 7

StyleGAN2-ext Modifications

Broad­ly, Ay­dao’s StyleGAN2-ext in­creases the model size and dis­ables reg­u­lar­iza­tions, which are use­ful for re­stricted do­mains like faces but fail badly on more com­plex and mul­ti­modal do­mains. His con­fig, avail­able in his StyleGAN2-ext repo (.ckpt.pkl/.pt; Co­lab for gen­er­at­ing im­ages & in­ter­po­la­tion videos, Gan­space, -based eg text in­put “Asuka”, slider GUI edit­ing note­book ):

  • dis­ables per­cep­tual path length reg­u­lar­iza­tion, re­vert­ing to the sim­ple G lo­gis­tic loss

  • dis­ables style mix­ing

  • dis­ables nor­mal­iz­ing la­tents

  • dis­ables mir­ror­ing/flip­ping data aug­men­ta­tion

    • does not use StyleGAN2-ADA data aug­men­ta­tion train­ing: we have strug­gled to im­ple­ment them on TPUs. (It’s un­clear how much this mat­ters, as most of ther 2020 flurry of data aug­men­ta­tion GAN re­search, and ’s StyleGAN2-ADA re­sults, shows the best re­sults in the small n regime: we and other StyleGAN2-ADA users have seen it make a huge differ­ence when work­ing with n < 5000, but no one has re­ported more than sub­tle im­prove­ments on Im­a­geNet & be­yond.)
  • weak­ens the R1 gra­di­ent gamma pa­ra­me­ter from 10 to 5

  • model size in­crease (more than dou­bles, to 1GB to­tal):

    • in­creases G fea­ture-maps’ ini­tial base value from 10 to 32
    • in­creases the max fea­ture-map value from 512 to 1024
    • in­creased the la­tent w map­ping/em­bed­ding mod­ule from 8×512 FC lay­ers to 4×1024 FC lay­ers32
    • in­creases mini­batch stan­dard de­vi­a­tion to 32, and num­ber of fea­tures to 4

TADNE Training

Train­ing-wise:

  • the model was run for 5.26m it­er­a­tions on a pre-emptible TPUv3-32 pod for a month or two (with in­ter­rup­tions) up to 2020-11-27, pro­vided by Google ; qual­ity was poor up to 2m, and then in­creased steadily to 4m
  • on Dan­booru2019 orig­i­nal/­ful­l-sized SFW im­ages (n = 2.82m), com­bined with (n = 0.3m), Fig­ures (n = 0.85m), & PALM (n = 0.05m) for data aug­men­ta­tion (to­tal n ~ 4m)
  • ran­dom crop­ping was ini­tially used for the orig­i­nal im­ages, but we learned mid­way through train­ing from that ran­dom crop­ping se­verely dam­aged Im­a­geNet per­for­mance of BigGAN and switched to top-crop­ping every­where; StyleGAN2-ext ben­e­fited
  • de­cayed learn­ing rate late in train­ing
  • train­ing en­abled late in train­ing (min­i­mal ben­e­fit, per­haps be­cause en­abled too late)

Ob­ser­va­tions:

  • Train­ing re­mained much more sta­ble than BigGAN runs, with no di­ver­gences

  • We ob­serve that StyleGAN2-ext vastly out­per­forms base­line StyleGAN2 runs on Dan­booru2019: the prob­lems with StyleGAN2 ap­pear to be less in­her­ent weak­ness of the AdaIN ar­chi­tec­ture (as com­pared to BigGAN etc) than the ex­tremely heavy reg­u­lar­iza­tion built into it by de­fault (ca­ter­ing to small n/sim­ple do­mains like hu­man faces).

    The key is dis­abling path length reg­u­lar­iza­tion & style mix­ing, at which point StyleGAN2 is able to scale & make good use of model size in­creases (rather than un­der­fit­ting & churn­ing end­less­ly).

    Changes that did not help (much): adding self­-at­ten­tion to StyleGAN33, re­mov­ing the “4×4 const” in­put, re­mov­ing the w deep FC stack (adding it to BigGAN also proved desta­bi­liz­ing), re­mov­ing the Lapla­cian pyra­mid/ad­di­tive im­age ar­chi­tec­ture (did­n’t help BigGAN much when added), large batch train­ing.

  • Also of note is the sur­pris­ingly high qual­ity writ­ing—while dis­abling mir­ror­ing makes it pos­si­ble to learn non-gib­ber­ish let­ters, we were still sur­prised how well StyleGAN2-ext can learn to write.

  • Hands have long been a weak point, but its hands are bet­ter than usu­al, and the ad­di­tion of PALM to train­ing caused an im­me­di­ate qual­ity im­prove­ment, demon­strat­ing the ben­e­fits of care­ful­ly-tar­geted data aug­men­ta­tion.

    • PALM’s draw­back: affect­ing G. The draw­back of us­ing PALM is that the fi­nal model gen­er­ates closeup hands oc­ca­sion­al­ly. Our orig­i­nal plan was to re­move the data aug­men­ta­tion datasets at the end of train­ing and fine­tune for a short pe­ri­od, to teach it to not gen­er­ate the PALM-style hand­s—but when Ay­dao tried that, the hands in the reg­u­lar im­ages be­gan to de­grade as StyleGAN2-ext ap­par­ently be­gan to for­get how to do hands! So it would be bet­ter to leave the data aug­men­ta­tion datasets in, and in­stead, screen out gen­er­ated hands us­ing the orig­i­nal PALM YOLOv3 hand-de­tec­tion mod­el.
  • But de­spite the qual­i­ty, the usual ar­ti­facts re­main: ar­ti­facts that I’ve ob­served in ear­lier StyleGAN2 runs (like This Fur­sona Does Not Ex­ist/This Pony Does Not Ex­ist), con­tinue to ap­pear, such as strangely smooth tran­si­tions of dis­crete ob­jects such as neck­laces or sleeves, and ‘cel­lu­lar di­vi­sion’ of 1 🡺 2 head­s/char­ac­ter­s/­bod­ies.

    I spec­u­late these ar­ti­facts, which re­main present de­spite great in­creases in qual­i­ty, are caused by the smooth z/w la­tent space which is forced to trans­form Gauss­ian vari­ables into bi­nary in­di­ca­tors, and does so poorly (eg in TPDNE, many sam­ples of ponies have stubby pseudo-horns, even though ponies in the orig­i­nal art­work ei­ther have a horn or do not, and in TWDNE, there are faces with am­bigu­ous half-glasses smeared on them); the BigGAN pa­per notes that there were gains from us­ing non-nor­mal vari­ables like bi­nary vari­ables, and I would pre­dict that salt­ing in bi­nary vari­ables or (a change I call ‘mix­ture la­tents’) to the la­tent space would fix this.

    • In­ter­est­ing­ly, the la­tent space is qual­i­ta­tively differ­ent and ‘larger’ than our ear­lier StyleGANs: whereas the face/­por­trait StyleGANs fa­vor a sweet spot of ψ ~ 0.7, and sam­ples se­ri­ously de­te­ri­o­rate by ψ ~ 1.2, the Dan­booru2019 StyleGAN2-ext is best at ψ ~ 1.1 and only be­gins no­tice­ably de­grad­ing by ψ ~ 1.4, with good sam­ples find­able all the way up to 2.0 (!).
  • Fine de­tail­s/­tex­tures are poor; this sug­gests l4rz’s scal­ing strat­egy of re­al­lo­cat­ing pa­ra­me­ters from the glob­al/low-res­o­lu­tion lay­ers to the high­er-res­o­lu­tion lay­ers would be use­ful

  • Over­all, we can do even bet­ter!—we ex­pect that a fresh train­ing run in­cor­po­rat­ing the fi­nal hy­per­pa­ra­me­ters, com­bined with ad­di­tional im­ages from Dan­booru2020 & pos­si­bly a new ar­m/­tor­so-fo­cused data aug­men­ta­tion, would be no­tice­ably bet­ter. (How­ev­er, as sur­pris­ingly well as StyleGAN2-ext scaled, I per­son­ally still think BigGAN will scale bet­ter.)

    • We have not tried trans­fer learn­ing (a­side from Ar­fa’s pre­lim­i­nary furry StyleGAN2-ext), but given how well trans­fer learn­ing has worked for n as small as n ~ 100 for the Face/­Por­trait mod­els, we ex­pect it to work well for this model too—and po­ten­tially give even bet­ter re­sults on faces, as it was trained on a su­per­set of the Face/­Por­trait data & knows more about bod­ies/back­ground­s/­head­wear/etc.

TADNE Download

Down­load (as with all our mod­els, the mod­el/sam­ples is li­censed un­der CC-0 or pub­lic do­main):

Transfer Learning

In the days when was a novice, once came to him as he sat hack­ing at the .

“What are you do­ing?”, asked Min­sky. “I am train­ing a ran­domly wired neural net to play Tic-Tac-Toe” Suss­man replied. “Why is the net wired ran­dom­ly?”, asked Min­sky. “I do not want it to have any pre­con­cep­tions of how to play”, Suss­man said.

Min­sky then shut his eyes. “Why do you close your eyes?”, Suss­man asked his teacher. “So that the room will be emp­ty.”

At that mo­ment, Suss­man was en­light­ened.

“Suss­man at­tains en­light­en­ment”, “AI Koans”,

One of the most use­ful things to do with a trained model on a broad data cor­pus is to use it as a launch­ing pad to train a bet­ter model quicker on lesser data, called “trans­fer learn­ing”. For ex­am­ple, one might trans­fer learn from Nvidi­a’s FFHQ face StyleGAN model to a differ­ent celebrity dataset, or from bed­rooms → kitchens. Or with the anime face mod­el, one might re­train it on a sub­set of faces—all char­ac­ters with red hair, or all male char­ac­ters, or just a sin­gle spe­cific char­ac­ter. Even if a dataset seems differ­ent, start­ing from a pre­trained model can save time; after all, while male and fe­male faces may look differ­ent and it may seem like a mis­take to start from a most­ly-fe­male anime face mod­el, the al­ter­na­tive of start­ing from scratch means start­ing with a model gen­er­at­ing ran­dom rain­bow-col­ored sta­t­ic, and surely male faces look far more like fe­male faces than they do ran­dom sta­t­ic?34 In­deed, you can quickly train a pho­to­graphic face model start­ing from the anime face mod­el.

This ex­tends the reach of good StyleGAN mod­els from those blessed with both big data & big com­pute to those with lit­tle of ei­ther. Trans­fer learn­ing works par­tic­u­larly well for spe­cial­iz­ing the anime face model to a spe­cific char­ac­ter: the im­ages of that char­ac­ter would be too lit­tle to train a good StyleGAN on, too data-im­pov­er­ished for the sam­ple-in­effi­cient StyleGAN1–235, but hav­ing been trained on all anime faces, the StyleGAN has learned well the full space of anime faces and can eas­ily spe­cial­ize down with­out over­fit­ting. Try­ing to do, say, faces ↔︎ land­scapes is prob­a­bly a bridge too far.

Data-wise, for do­ing face spe­cial­iza­tion, the more the bet­ter but n = 500–5000 is an ad­e­quate range, but even as low as n = 50 works sur­pris­ingly well. I don’t know to what ex­tent data aug­men­ta­tion can sub­sti­tute for orig­i­nal dat­a­points but it’s prob­a­bly worth a try es­pe­cially if you have n < 5000.

Com­pute-wise, spe­cial­iza­tion is rapid. Adap­ta­tion can hap­pen within a few ticks, pos­si­bly even 1. This is sur­pris­ingly fast given that StyleGAN is not de­signed for few-shot/­trans­fer learn­ing. I spec­u­late that this may be be­cause the StyleGAN la­tent space is ex­pres­sive enough that even new faces (such as new hu­man faces for a FFHQ mod­el, or a new anime char­ac­ter for an ani­me-face mod­el) are still al­ready present in the la­tent space. Ex­am­ples of the ex­pres­siv­ity are pro­vided by , who find that “al­though the StyleGAN gen­er­a­tor is trained on a hu­man face dataset [FFHQ], the em­bed­ding al­go­rithm is ca­pa­ble of go­ing far be­yond hu­man faces. As Fig­ure 1 shows, al­though slightly worse than those of hu­man faces, we can ob­tain rea­son­able and rel­a­tively high­-qual­ity em­bed­dings of cats, dogs and even paint­ings and cars.” If even im­ages as differ­ent as cars can be en­coded suc­cess­fully into a face StyleGAN, then clearly the la­tent space can eas­ily model new faces and so any new face train­ing data is in some sense al­ready learned; so the train­ing process is per­haps not so much about learn­ing ‘new’ faces as about mak­ing the new faces more ‘im­por­tant’ by ex­pand­ing the la­tent space around them & con­tract­ing it around every­thing else, which seems like a far eas­ier task.

How does one ac­tu­ally do trans­fer learn­ing? Since StyleGAN is (cur­rent­ly) un­con­di­tional with no dataset-spe­cific cat­e­gor­i­cal or text or meta­data en­cod­ing, just a flat set of im­ages, all that has to be done is to en­code the new dataset and sim­ply start train­ing with an ex­ist­ing mod­el. One cre­ates the new dataset as usu­al, and then ed­its training.py with a new -desc line for the new dataset, and if resume_kimg is set cor­rectly (see next para­graph) and resume_run_id = "latest" en­abled as ad­vised, you can then run python train.py and presto, trans­fer learn­ing.

The main prob­lem seems to be that train­ing can­not be done from scratch/0 it­er­a­tions, as one might naively as­sume—when I tried this, it did not work well and StyleGAN ap­peared to be ig­nor­ing the pre­trained mod­el. My hy­poth­e­sis is that as part of the pro­gres­sive grow­ing/­fad­ing in of ad­di­tional res­o­lu­tion/lay­ers, StyleGAN sim­ply ran­dom­izes or wipes out each new layer and over­writes them—­mak­ing it point­less. This is easy to avoid: sim­ply jump the train­ing sched­ule all the way to the de­sired res­o­lu­tion. For ex­am­ple, to start at one’s max­i­mum size (here 512px) one might set resume_kimg=7000 in training_loop.py. This forces StyleGAN to skip all the pro­gres­sive grow­ing and load the full model as-is. To make sure you did it right, check the first sam­ple (fakes07000.png or what­ev­er), from be­fore any trans­fer learn­ing train­ing has been done, and it should look like the orig­i­nal model did at the end of its train­ing. Then sub­se­quent train­ing sam­ples should show the orig­i­nal quickly mor­ph­ing to the new dataset. (Any­thing like fakes00000.png should not show up be­cause that in­di­cates be­gin­ning from scratch.)

Anime Faces → Character Faces

Holo

The first trans­fer learn­ing was done with Holo of . It used a 512px Holo face dataset cre­ated with Na­gadomi’s crop­per from all of Dan­booru2017, up­scaled with waifu2x, cleaned by hand, and then data-aug­mented from n = 3900 to n = 12600; mir­ror­ing was en­abled since Holo is sym­met­ri­cal. I then used the anime face model as of 2019-02-09—it was not fully con­verged, in­deed, would­n’t con­verge with weeks more train­ing, but the qual­ity was so good I was too cu­ri­ous as to how well re­train­ing would work so I switched gears.

It’s worth men­tion­ing that this dataset was used pre­vi­ously with ProGAN, where after weeks of train­ing, ProGAN over­fit badly as demon­strated by the sam­ples & in­ter­po­la­tion videos.

Train­ing hap­pened re­mark­ably quick­ly, with all the faces con­verted to rec­og­niz­ably Holo faces within a few hun­dred it­er­a­tions:

Train­ing mon­tage of a Holo face model ini­tial­ized from the anime face StyleGAN (blink & you’ll miss it)
In­ter­po­la­tion video of the Holo face model ini­tial­ized from the anime face StyleGAN

The best sam­ples were con­vinc­ing with­out ex­hibit­ing the fail­ures of the ProGAN:

64 hand-s­e­lected Holo face sam­ples

The StyleGAN was much more suc­cess­ful, de­spite a few fail­ure la­tent points car­ried over from the anime faces. In­deed, after a few hun­dred it­er­a­tions, it was start­ing to over­fit with the ‘crack’ ar­ti­facts & smear­ing in the in­ter­po­la­tions. The lat­est I was will­ing to use was it­er­a­tion #11370, and I think it is still some­what over­fit any­way. I thought that with its to­tal n (after data aug­men­ta­tion), Holo would be able to train longer (be­ing 1⁄7th the size of FFHQ), but ap­par­ently not. Per­haps the data aug­men­ta­tion is con­sid­er­ably less valu­able than 1-for-1, ei­ther be­cause the in­vari­ants en­coded in aren’t that use­ful (sug­gest­ing that Geirhos et al 2018-like style trans­fer data aug­men­ta­tion is what’s nec­es­sary) or that they would be but the anime face StyleGAN has al­ready learned them all as part of the pre­vi­ous train­ing & needs more real data to bet­ter un­der­stand Holo-like faces. It’s also pos­si­ble that the re­sults could be im­proved by us­ing one of the later anime face StyleGANs since they did im­prove when I trained them fur­ther after my 2 Holo/A­suka trans­fer ex­per­i­ments.

Nev­er­the­less, im­pressed, I could­n’t help but won­der if they had reached hu­man-levels of verisimil­i­tude: would an un­wary viewer as­sume they were hand­made?

So I se­lected ~100 of the best sam­ples (24MB; Imgur mir­ror) from a dump of 2000, cropped about 5% from the left­/right sides to hide the back­ground ar­ti­facts a lit­tle bit, and sub­mit­ted them on 2019-02-11 to /r/Spice­and­Wolf un­der an alt ac­count. I made the mis­take of sort­ing by file­size & thus lead­ing with a face that was par­tic­u­larly sus­pi­cious (streaky hair) so one Red­di­tor voiced the sus­pi­cion they were from MGM (ab­surd yet not en­tirely wrong) but all the other com­menters took the faces in stride or prais­ing them, and the sub­mis­sion re­ceived +248 votes (99% pos­i­tive) by March. A Red­di­tor then turned them all into a GIF video which earned +192 (100%) and many pos­i­tive com­ments with no fur­ther sus­pi­cions un­til I ex­plained. Not bad in­deed.

The #11370 Holo StyleGAN model is avail­able for down­load.

Asuka

After the Holo train­ing & link sub­mis­sion went so well, I knew I had to try my other char­ac­ter dataset, Asuka, us­ing n = 5300 data-aug­mented to n = 58,000.36 Keep­ing in mind how data seemed to limit the Holo qual­i­ty, I left mir­ror­ing en­abled for Asuka, even though she is not sym­met­ri­cal due to her eye­patch over her left eye (as purists will no doubt note).

Train­ing mon­tage of an Asuka face model ini­tial­ized from the anime face StyleGAN
In­ter­po­la­tion video of the Asuka face model ini­tial­ized from the anime face StyleGAN

In­ter­est­ing­ly, while Holo trained within 4 GPU-hours, Asuka proved much more diffi­cult and did not seem to be fin­ished train­ing or show­ing the cracks de­spite train­ing twice as long. Is this due to hav­ing ~35% more real data, hav­ing 10× rather than 3× data aug­men­ta­tion, or some in­her­ent differ­ence like Asuka be­ing more com­plex (eg be­cause of more vari­a­tions in her ap­pear­ance like the eye­patches or plug­suit­s)?

I gen­er­ated 1000 ran­dom sam­ples with Ψ = 1.2 be­cause they were par­tic­u­larly in­ter­est­ing to look at. As with Holo, I picked out the best 100 (13MB; Imgur mir­ror) from ~2000:

64 hand-s­e­lected Asuka face sam­ples

And I sub­mit­ted to the /r/E­van­ge­lion sub­red­dit, where it also did well (+109, 98%); there were no spec­u­la­tions about the faces be­ing NN-gen­er­ated be­fore I re­vealed it, merely re­quests for more. Be­tween the two, it ap­pears that with ad­e­quate data (n > 3000) and mod­er­ate cu­ra­tion, a sim­ple kind of art Tur­ing test can be passed.

The #7903 Asuka StyleGAN model is avail­able for down­load.

Zuihou

In early Feb­ru­ary 2019, us­ing the then-re­leased mod­el, Red­di­tor End­ing_­Cred­its tried trans­fer learn­ing to n = 500 faces of the Zui­hou for ~1 tick (~60k it­er­a­tions).

The sam­ples & in­ter­po­la­tions have many ar­ti­facts, but the sam­ple size is tiny and I’d con­sider this good fine­tun­ing from a model never in­tended for few-shot learn­ing:

StyleGAN trans­fer learn­ing from anime face StyleGAN to Kan­Colle Zui­hou by End­ing_­Cred­its, 8×15 ran­dom sam­ple grid
In­ter­po­la­tion video (4×4) of the Zui­hou face model ini­tial­ized from the anime face StyleGAN, trained by End­ing_­Cred­its
In­ter­po­la­tion video (1×1) of the Zui­hou face model ini­tial­ized from the anime face StyleGAN, trained by End­ing_­Cred­its

Prob­a­bly it could be made bet­ter by start­ing from the lat­est anime face StyleGAN mod­el, and us­ing ag­gres­sive data aug­men­ta­tion. An­other op­tion would be to try to find as many char­ac­ters which look sim­i­lar to Zui­hou (match­ing on hair color might work) and train on a joint dataset—un­con­di­tional sam­ples would then need to be fil­tered for just Zui­hou faces, but per­haps that draw­back could be avoided by a third stage of Zui­hou-only train­ing?

Ganso

Akizuki

An­other Kan­colle char­ac­ter, Ak­izuki, was trained in April 2019 by Gan­so.

Ptilopsis

In Jan­u­ary 2020, Ganso trained a StyleGAN 2 model from the S2 por­trait model on a tiny cor­pus of Ptilop­sis im­ages, a char­ac­ter from Arknights, a 2017 Chi­nese RPG mo­bile game.

Train­ing sam­ples of Ptilop­sis, Arknights (StyleGAN 2 por­traits trans­fer, by Gan­so)

are owls, and her char­ac­ter de­sign shows promi­nent ears; de­spite the few im­ages to work with (just 21 on Dan­booru as of 2020-01-19), the in­ter­po­la­tion shows smooth ad­just­ments of the ears in all po­si­tions & align­ments, demon­strat­ing the power of trans­fer learn­ing:

In­ter­po­la­tion video (4×4) of the Ptilop­sis face model ini­tial­ized from the anime face StyleGAN 2, trained by Ganso

Fate

Saber

End­ing_­Cred­its like­wise did trans­fer to (), n = 4000. The re­sults look about as ex­pected given the sam­ple sizes and pre­vi­ous trans­fer re­sults:

In­ter­po­la­tion video (4×4) of the Saber face model ini­tial­ized from the anime face StyleGAN, trained by End­ing_­Cred­its

Fate/Grand Order

Michael Sug­imura in May 2019 ex­per­i­mented with trans­fer learn­ing from the 512px anime por­trait GAN to faces cropped from ~6k wall­pa­pers he down­loaded via Google search queries. His re­sults for Saber & re­lated char­ac­ters look rea­son­able but more broad­ly, some­what low-qual­i­ty, which Sug­imura sus­pects is due to in­ad­e­quate data clean­ing (“there are a num­ber of lower qual­ity im­ages and also im­ages of back­grounds, ar­mor, non-char­ac­ter im­ages left in the dataset which causes weird ar­ti­facts in gen­er­ated im­ages or just lower qual­ity gen­er­ated im­ages.”).

Louise

Fi­nal­ly, End­ing_­Cred­its did trans­fer to Louise (), n = 350:

In­ter­po­la­tion video (4×4) of the Louise face model ini­tial­ized from the anime face StyleGAN, trained by End­ing_­Cred­its

Not as good as Saber due to the much smaller sam­ple size.

Lelouch

road­run­ner01 ex­per­i­mented with a num­ber of trans­fers, in­clud­ing a trans­fer of the male char­ac­ter () with n = 50 (!), which is not nearly as garbage as it should be.

Asashio

Flatis­Dogchi ex­per­i­mented with trans­fer to n = 988 (aug­mented to n = 18772) Asashio (Kan­Colle) faces, cre­at­ing “This Asashio Does Not Ex­ist”.

Marisa Kirisame & the Komeijis

A Japan­ese user mei_miya posted an in­ter­po­la­tion video of the Touhou char­ac­ter Marisa Kirisame by trans­fer learn­ing on 5000 faces. They also did the Touhou char­ac­ters Satori/Koishi Komeiji with n = 6000.

The Red­dit user Jepa­cor also has done Marisa, us­ing Dan­booru sam­ples.

Lexington

A Chi­nese user 3D_DLW (S2 write­up/­tu­to­ri­al: 1/2) in Feb­ru­ary 2020 did trans­fer­-learn­ing from the S2 por­trait model to Pixiv art­work of the char­ac­ter Lex­ing­ton from War­ship Girls. He used a sim­i­lar work­flow: crop­ping faces with lbpcascade_animeface, up­scal­ing with wai­fu2x, and clean­ing with ranker.py (us­ing the orig­i­nal S2 mod­el’s Dis­crim­i­na­tor & pro­duc­ing datasets of vary­ing clean­li­ness at n = 302–1659). Sam­ples:

Ran­dom sam­ples for anime por­trait S2 → War­ship Girls char­ac­ter Lex­ing­ton.

Hayasaka Ai

Tazik Shah­ja­han fine­tuned S2 on ’s Hayasaka Ai, pro­vid­ing a Co­lab note­book demon­strat­ing how he scraped Pixiv and fil­tered out in­valid im­ages to cre­ate the train­ing cor­pus

Ahegao

Ca­JI9I cre­ated an “” StyleGAN; un­spec­i­fied cor­pus or method:

6×6 sam­ple of ahe­gao StyleGAN faces

Emilia (Re:Zero)

In­ter­po­la­tion video (4×4) of the Emilia face model ini­tial­ized from the Por­trait StyleGAN, trained by ship­blaz­er420

Anime Faces → Anime Headshots

Twit­ter user Sunk did trans­fer learn­ing to an im­age cor­pus of a spe­cific artist, Kure­hito Mis­aki (深崎暮人), n≅1000. His im­ages work well and the in­ter­po­la­tion looks nice:

In­ter­po­la­tion video (4×4) of the Louise face model ini­tial­ized from the Kure­hito Mis­aki StyleGAN, trained by sunk

Anime Faces → Portrait

TWDNE was a huge suc­cess and pop­u­lar­ized the anime face StyleGAN. It was not per­fect, though, and flaws were not­ed.

Portrait Improvements

The por­traits could be im­proved by more care­fully se­lect­ing SFW im­ages to avoid over­ly-sug­ges­tive faces, ex­pand­ing the crops to avoid cut­ting off edges of heads like hair­styles,

**­For de­tails and
down­loads

, please see .**

Portrait Results

After re­train­ing the fi­nal face StyleGAN 2019-03-08–2019-04-30 on the new im­proved por­traits dataset, the re­sults im­proved:

Train­ing sam­ple for Por­trait StyleGAN: 2019-04-30/it­er­a­tion #66,083
In­ter­po­la­tion video (4×4) of the Dan­booru2018 por­trait model ini­tial­ized from the Dan­booru2017 face StyleGAN

This S1 anime por­trait model is ob­so­leted by the StyleGAN 2 por­trait model.

The fi­nal model from 2019-04-30 is avail­able for down­load.

I used this model at 𝛙=0.5 to gen­er­ate 100,000 new por­traits for TWDNE (#100,000–199,999), bal­anc­ing the pre­vi­ous faces.

I was sur­prised how diffi­cult up­grad­ing to por­traits seemed to be; I spent al­most two months train­ing it be­fore giv­ing up on fur­ther im­prove­ments, while I had been ex­pect­ing more like a week or two. The por­trait re­sults are in­deed bet­ter than the faces (I was right that not crop­ping off the top of the head adds verisimil­i­tude), but the up­grade did­n’t im­press me as much as the orig­i­nal faces did com­pared to ear­lier GANs. And our other ex­per­i­men­tal runs on whole-Dan­booru2018 im­ages never pro­gressed be­yond sug­ges­tive blobs dur­ing this pe­ri­od.

I sus­pect that StyleGAN—at least, on its de­fault ar­chi­tec­ture & hy­per­pa­ra­me­ters, with­out a great deal more com­pute—is reach­ing its lim­its here, and that changes may be nec­es­sary to scale to richer im­ages. (Self-at­ten­tion is prob­a­bly the eas­i­est to add since it should be easy to plug in ad­di­tional lay­ers to the con­vo­lu­tion code.)

Anime Faces → Male Faces

A few peo­ple have ob­served that it would be nice to have an anime face GAN for male char­ac­ters in­stead of al­ways gen­er­at­ing fe­male ones. The anime face StyleGAN does in fact have male faces in its dataset as I did no fil­ter­ing—it’s merely that fe­male faces are over­whelm­ingly fre­quent (and it may also be that male anime faces are rel­a­tively an­drog­y­nous/fem­i­nized any­way so it’s hard to tell any differ­ence be­tween a fe­male with short hair & a guy37).

Train­ing a male-only anime face StyleGAN would be an­other good ap­pli­ca­tion of trans­fer learn­ing.

The faces can be eas­ily ex­tracted out of Dan­booru2018 by query­ing for "male_focus", which will pick up ~150k im­ages. More nar­row­ly, one could search "1boy" & "solo", to en­sure that the only face in the im­age is a male face (as op­posed to, say, 1boy 1girl, where a fe­male face might be cropped out as well). This pro­vides n = 99k raw hits. It would be good to also fil­ter out ‘trap’ or over­ly-fe­male-look­ing faces (else what’s the point?), by fil­ter­ing on tags like cat ears or par­tic­u­larly pop­u­lar ‘trap’ char­ac­ters like Fate/­Grand Or­der’s As­tol­fo. A more com­pli­cated query to pick up scenes with mul­ti­ple males could be to search for both "1boy" & "multiple_boys" and then fil­ter out "1girl" & "multiple_girls", in or­der to se­lect all im­ages with 1 or more males and then re­move all im­ages with 1 or more fe­males; this dou­bles the raw hits to n = 198k. (A down­side is that the face-crop­ping will often un­avoid­ably yield crops with two faces, a pri­mary face and an over­lap­ping face, which is bad and in­tro­duces ar­ti­fact­ing when I tried this with all faces.)

Com­bined with trans­fer learn­ing from the gen­eral anime face StyleGAN, the re­sults should be as good as the gen­eral (fe­male) faces.

I set­tled for "1boy" & "solo", and did con­sid­er­able clean­ing by hand. The raw count of im­ages turned out to be highly mis­lead­ing, and many faces are un­us­able for a male anime face StyleGAN: many are so highly styl­ized (such as ac­tion sce­nes) as to be dam­ag­ing to a GAN, or they are al­most in­dis­tin­guish­able from fe­male faces (be­cause they are bis­honen or trap or just an­drog­y­nous), which would be point­less to in­clude (the reg­u­lar por­trait StyleGAN cov­ers those al­ready). After hand clean­ing & use of ranker.py, I was left with n~3k, so I used heavy data aug­men­ta­tion to bring it up to n~57k, and I ini­tial­ized from the fi­nal por­trait StyleGAN for the high­est qual­i­ty.

It did not over­fit after ~4 days of train­ing, but the re­sults were not no­tice­ably im­prov­ing, so I stopped (in or­der to start train­ing the GPT-2-345M, which Ope­nAI had just re­leased, ). There are hints in the in­ter­po­la­tion videos, I think, that it is in­deed slightly over­fit­ting, in the form of ‘glitches’ where the im­age abruptly jumps slight­ly, pre­sum­ably to an­other mod­e/­face/char­ac­ter of the orig­i­nal data; nev­er­the­less, the male face StyleGAN mostly works.

Train­ing sam­ples for the male por­trait StyleGAN (2019-05-03); com­pare with the same la­ten­t-space points in the orig­i­nal por­trait StyleGAN.
In­ter­po­la­tion video (4×4) of the Dan­booru2018 male faces model ini­tial­ized from the Dan­booru2018 por­trait StyleGAN

The male face StyleGAN model is avail­able for down­load, as is 1000 ran­dom faces with 𝛙=0.7 (mir­ror; par­tial Imgur al­bum).

Anime Faces → Ukiyo-e Faces

In Jan­u­ary 2020, Justin (@Bunt­wor­thy) used 5000 faces cropped with from to do trans­fer learn­ing. After ~24h train­ing:

Justin’s ukiy­o-e StyleGAN sam­ples, 2020-01-04.

Anime Faces → Western Portrait Faces

In 2019, ay­dao ex­per­i­mented with trans­fer learn­ing to Eu­ro­pean por­trait faces drawn from WikiArt; the trans­fer learn­ing was done via Nathan Ship­ley’s abuse of where two mod­els are sim­ply av­er­aged to­geth­er, pa­ra­me­ter by pa­ra­me­ter and layer by lay­er, to yield a new mod­el. (Sur­pris­ing­ly, this work­s—as long as the mod­els aren’t too differ­ent; if they are, the av­er­aged model will gen­er­ate only col­or­ful blob­s.) The re­sults were amus­ing. From early in train­ing:

ay­dao 2019, anime faces → west­ern por­trait train­ing sam­ples (ear­ly)

Lat­er:

ay­dao 2019, anime faces → west­ern por­trait train­ing sam­ples (later)

Anime Faces → Danbooru2018

nshep­perd be­gan a train­ing run us­ing an early anime face StyleGAN model on the 512px SFW Dan­booru2018 sub­set; after ~3–5 weeks (with many in­ter­rup­tions) on 1 GPU, as of 2019-03-22, the train­ing sam­ples look like this:

StyleGAN train­ing sam­ples on Dan­booru2018 SFW 512px; it­er­a­tion #14204 (n­shep­perd)
Real 512px SFW Dan­booru2018 train­ing dat­a­points, for com­par­i­son
Train­ing mon­tage video of the Dan­booru2018 model (up to #14204, 2019-03-22), trained by nshep­perd

The StyleGAN is able to pick up global struc­ture and there are rec­og­niz­ably anime fig­ures, de­spite the sheer di­ver­sity of im­ages, which is promis­ing. The fine de­tails are se­ri­ously lack­ing, and train­ing, to my eye, is wan­der­ing around with­out any steady im­prove­ment or sharp de­tails (ex­cept per­haps the faces which are in­her­ited from the pre­vi­ous mod­el). I sus­pect that the learn­ing rate is still too high and, es­pe­cially with only 1 GPU/n = 4, such small mini­batches don’t cover enough modes to en­able steady im­prove­ment. If so, the LR will need to be set much lower (or gra­di­ent ac­cu­mu­la­tion used in or­der to fake hav­ing large mini­batches where large LRs are sta­ble) & train­ing time ex­tended to mul­ti­ple months. An­other pos­si­bil­ity would be to restart with added self­-at­ten­tion lay­ers, which I have no­ticed seem to par­tic­u­larly help with com­pli­cated de­tails & sharp­ness; the style noise ap­proach may be ad­e­quate for the job but just a few vanilla con­vo­lu­tion lay­ers may be too few (pace the BigGAN re­sults on the ben­e­fits of in­creas­ing depth while de­creas­ing pa­ra­me­ter coun­t).

FFHQ Variations

Anime Faces → FFHQ Faces

If StyleGAN can smoothly warp anime faces among each other and ex­press global trans­forms like hair length­+­color with Ψ, could Ψ be a quick way to gain con­trol over a sin­gle large-s­cale vari­able? For ex­am­ple, male vs fe­male faces, or… anime ↔︎ real faces? (Given a par­tic­u­lar im­age/la­tent vec­tor, one would sim­ply flip the sign to con­vert it to the op­po­site; this would give the op­po­site ver­sion of each ran­dom face, and if one had an en­coder, one could do au­to­mat­i­cally ani­me-fy or re­al-fy an ar­bi­trary face by en­cod­ing it into the la­tent vec­tor which cre­ates it, and then flip­ping.38)

Since Kar­ras et al 2801 pro­vide a nice FFHQ down­load script (al­beit slower than I’d like once Google Drive rate-lim­its you a wall­clock hour into the full down­load) for the ful­l-res­o­lu­tion PNGs, it would be easy to down­scale to 512px and cre­ate a 512px FFHQ dataset to train on, or even cre­ate a com­bined anime+FFHQ dataset.

The first and fastest thing was to do trans­fer learn­ing from the anime faces to FFHQ real faces. It was un­likely that the model would re­tain much anime knowl­edge & be able to do mor­ph­ing, but it was worth a try.

The ini­tial re­sults early in train­ing are hi­lar­i­ous and look like zom­bies:

Ran­dom train­ing sam­ples of anime face → FFHQ-only StyleGAN trans­fer learn­ing, show­ing bizarrely-arte­fac­tual in­ter­me­di­ate faces
In­ter­po­la­tion video (4×4) of the FFHQ face model ini­tial­ized from the anime face StyleGAN, a few ticks into train­ing, show­ing bizarre ar­ti­facts

After 97 ticks, the model has con­verged to a bor­ingly nor­mal ap­pear­ance, with the only hint of its ori­gins be­ing per­haps some ex­ces­sive­ly-fab­u­lous hair in the train­ing sam­ples:

Anime faces → FFHQ-only StyleGAN train­ing sam­ples after much con­ver­gence, show­ing ani­me-ness largely washed out

Anime Faces → Anime Faces + FFHQ Faces

So, that was a bust. The next step is to try train­ing on anime & FFHQ faces si­mul­ta­ne­ous­ly; given the stark differ­ence be­tween the datasets, would pos­i­tive vs neg­a­tive Ψ wind up split­ting into real vs anime and pro­vide a cheap & easy way of con­vert­ing ar­bi­trary faces?

This sim­ply merged the 512px FFHQ faces with the 512px anime faces and re­sumed train­ing from the pre­vi­ous FFHQ model (I rea­soned that some of the ani­me-ness should still be in the mod­el, so it would be slightly faster than restart­ing from the orig­i­nal anime face mod­el). I trained it for 812 it­er­a­tions, #11,359–12,171 (some­what over 2 GPU-days), at which point it was mostly done.

It did man­age to learn both kinds of faces quite well, sep­a­rat­ing them clearly in ran­dom sam­ples:

Ran­dom train­ing sam­ples, anime+FFHQ StyleGAN

How­ev­er, the style trans­fer & Ψ sam­ples were dis­ap­point­ments. The style mix­ing shows lim­ited abil­ity to mod­ify faces cross-do­main or con­vert them, and the trun­ca­tion trick chart shows no clear dis­en­tan­gle­ment of the de­sired fac­tor (in­deed, the var­i­ous halves of Ψ cor­re­spond to noth­ing clear):

Style mix­ing re­sults for the anime+FFHQ StyleGAN
Trun­ca­tion trick re­sults for the anime+FFHQ StyleGAN

The in­ter­po­la­tion video does show that it learned to in­ter­po­late slightly be­tween real & anime faces, giv­ing half-anime/half-real faces, but it looks like it only hap­pens some­times—­mostly with young fe­male faces39:

In­ter­po­la­tion video (4×4) of the FFHQ+anime face mod­el, after con­ver­gence.

They’re hard to spot in the in­ter­po­la­tion video be­cause the tran­si­tion hap­pens abrupt­ly, so I gen­er­ated sam­ples & se­lected some of the more in­ter­est­ing ani­me-ish faces:

Se­lected sam­ples from the anime+FFHQ StyleGAN, show­ing cu­ri­ous ‘in­ter­me­di­ate’ faces (4×4 grid)

Sim­i­lar­ly, Alexan­der Reben trained a StyleGAN on FFHQ+Western por­trait il­lus­tra­tions, and the in­ter­po­la­tion video is much smoother & more mixed, sug­gest­ing that more re­al­is­tic & more var­ied il­lus­tra­tions are eas­ier for StyleGAN to in­ter­po­late be­tween.

Anime Faces + FFHQ → Danbooru2018

While I did­n’t have the com­pute to prop­erly train a Dan­booru2018 StyleGAN, after nshep­perd’s re­sults, I was cu­ri­ous and spent some time (817 it­er­a­tions, so ~2 GPU-days?) re­train­ing the anime face+FFHQ model on Dan­booru2018 SFW 512px im­ages.

The train­ing mon­tage is in­ter­est­ing for show­ing how faces get re­pur­posed into fig­ures:

Train­ing mon­tage video of a Dan­booru2018 StyleGAN ini­tial­ized on an anime faces+FFHQ StyleGAN.

One might think that it is a bridge too far for trans­fer learn­ing, but it seems not.

Reversing StyleGAN To Control & Modify Images

Mod­i­fy­ing im­ages is harder than gen­er­at­ing them. If we had a con­di­tional anime face GAN like Ar­fafax’s, then we are fine, but if we have an un­con­di­tional ar­chi­tec­ture of some sort, then what? An un­con­di­tional GAN ar­chi­tec­ture is, by de­fault, ‘one-way’: the la­tent vec­tor z gets gen­er­ated from a bunch of 𝒩(0,1) vari­ables, fed through the GAN, and out pops an im­age. There is no way to run the un­con­di­tional GAN ‘back­wards’ to feed in an im­age and pop out the z in­stead.40

If one could, one could take an ar­bi­trary im­age and en­code it into the z and by jit­ter­ing z, gen­er­ate many new ver­sions of it; or one could feed it back into StyleGAN and play with the style noises at var­i­ous lev­els in or­der to trans­form the im­age; or do things like ‘av­er­age’ two im­ages or cre­ate in­ter­po­la­tions be­tween two ar­bi­trary faces’; or one could (as­sum­ing one knew what each vari­able in z ‘means’) edit the im­age to changes things like which di­rec­tion their head tilts or whether they are smil­ing.

There are some at­tempts at learn­ing con­trol in an un­su­per­vised fash­ion (eg , GANSpace) but while ex­cel­lent start­ing points, they have lim­its and may not find a spe­cific con­trol that one wants.

The most straight­for­ward way would be to switch to a con­di­tional GAN ar­chi­tec­ture based on a text or tag em­bed­ding. Then to gen­er­ate a spe­cific char­ac­ter wear­ing glass­es, one sim­ply says as much as the con­di­tional in­put: "character glasses". Or if they should be smil­ing, add "smile". And so on. This would cre­ate im­ages of said char­ac­ter with the de­sired mod­i­fi­ca­tions. This op­tion is not avail­able at the mo­ment as cre­at­ing a tag em­bed­ding & train­ing StyleGAN re­quires quite a bit of mod­i­fi­ca­tion. It also is not a com­plete so­lu­tion as it would­n’t work for the cases of edit­ing an ex­ist­ing im­age.

For an un­con­di­tional GAN, there are two com­ple­men­tary ap­proaches to in­vert­ing the G:

  1. what one NN can learn to de­code, an­other can learn to en­code (eg , ):

    If StyleGAN has learned z→im­age, then train a sec­ond en­coder NN on the su­per­vised learn­ing prob­lem of im­age→z! The sam­ple size is in­fi­nite (just keep run­ning G) and the map­ping is fixed (given a fixed G), so it’s ugly but not that hard.

  2. back­prop­a­gate a pixel or fea­ture-level loss to ‘op­ti­mize’ a la­tent code (eg ):

    While StyleGAN is not in­her­ently re­versible, it’s not a black­box as, be­ing a NN trained by , it must ad­mit of gra­di­ents. In train­ing neural net­works, there are 3 com­po­nents: in­puts, model pa­ra­me­ters, and out­put­s/loss­es, and thus there are 3 ways to use back­prop­a­ga­tion, even if we usu­ally only use 1. One can hold the in­puts fixed, and vary the model pa­ra­me­ters in or­der to change (usu­ally re­duce) the fixed out­puts in or­der to re­duce a loss, which is train­ing a NN; one can hold the in­puts fixed and vary the out­puts in or­der to change (often in­crease) in­ter­nal pa­ra­me­ters such as lay­ers, which cor­re­sponds to neural net­work vi­su­al­iza­tions & ex­plo­ration; and fi­nal­ly, one can hold the pa­ra­me­ters & out­puts fixed, and use the gra­di­ents to it­er­a­tively find a set of in­puts which cre­ates a spe­cific out­put with a low loss (eg op­ti­mize a wheel-shape in­put for rolling-effi­ciency out­put).41

    This can be used to cre­ate im­ages which are ‘op­ti­mized’ in some sense. For ex­am­ple, uses ac­ti­va­tion max­i­miza­tion, demon­strat­ing how im­ages of Im­a­geNet classes can be pulled out of a stan­dard CNN clas­si­fier by back­prop over the clas­si­fier to max­i­mize a par­tic­u­lar out­put class; or re­design a for eas­ier clas­si­fi­ca­tion by a mod­el; more amus­ing­ly, in , the gra­di­ent as­cent42 on the in­di­vid­ual pix­els of an im­age is done to min­i­mize/­max­i­mize a NSFW clas­si­fier’s pre­dic­tion. This can also be done on a higher level by try­ing to max­i­mize sim­i­lar­ity to a NN em­bed­ding of an im­age to make it as ‘sim­i­lar’ as pos­si­ble, as was done orig­i­nally in Gatys et al 2014 for style trans­fer, or for more com­pli­cated kinds of style trans­fer like in “Differ­en­tiable Im­age Pa­ra­me­ter­i­za­tions: A pow­er­ful, un­der­-ex­plored tool for neural net­work vi­su­al­iza­tions and art”.

    In this case, given an ar­bi­trary de­sired im­age’s z, one can ini­tial­ize a ran­dom z, run it for­ward through the GAN to get an im­age, com­pare it at the pixel level with the de­sired (fixed) im­age, and the to­tal differ­ence is the ‘loss’; hold­ing the GAN fixed, the back­prop­a­ga­tion goes back through the model and ad­justs the in­puts (the un­fixed z) to make it slightly more like the de­sired im­age. Done many times, the fi­nal z will now yield some­thing like the de­sired im­age, and that can be treated as its true z. Com­par­ing at the pix­el-level can be im­proved by in­stead look­ing at the higher lay­ers in a NN trained to do clas­si­fi­ca­tion (often an Im­a­geNet VGG), which will fo­cus more on the se­man­tic sim­i­lar­ity (more of a “per­cep­tual loss”) rather than mis­lead­ing de­tails of sta­tic & in­di­vid­ual pix­els. The la­tent code can be the orig­i­nal z, or z after it’s passed through the stack of 8 FC lay­ers and has been trans­formed, or it can even be the var­i­ous per-layer style noises in­side the CNN part of StyleGAN; the last is what style-image-prior uses & 43 ar­gue that it works bet­ter to tar­get the lay­er-wise en­cod­ings than the orig­i­nal z.

    This may not work too well as the lo­cal op­tima might be bad or the GAN may have trou­ble gen­er­at­ing pre­cisely the de­sired im­age no mat­ter how care­fully it is op­ti­mized, the pix­el-level loss may not be a good loss to use, and the whole process may be quite slow, es­pe­cially if one runs it many times with many differ­ent ini­tial ran­dom z to try to avoid bad lo­cal op­ti­ma. But it does mostly work.

  3. En­code+Back­prop­a­gate is a use­ful hy­brid strat­e­gy: the en­coder makes its best guess at the z, which will usu­ally be close to the true z, and then back­prop­a­ga­tion is done for a few it­er­a­tions to fine­tune the z. This can be much faster (one for­ward pass vs many for­ward+back­ward pass­es) and much less prone to get­ting stuck in bad lo­cal op­tima (s­ince it starts at a good ini­tial z thanks to the en­coder).

    Com­par­i­son with edit­ing in flow-based mod­els On a tan­gent, edit­ing/re­vers­ing is one of the great ad­van­tages44 of ‘flow’-based NN mod­els such as Glow, which is one of the fam­i­lies of NN mod­els com­pet­i­tive with GANs for high­-qual­ity im­age gen­er­a­tion (a­long with au­tore­gres­sive pixel pre­dic­tion mod­els like PixelRNN, and VAEs). Flow mod­els have the same shape as GANs in push­ing a ran­dom la­tent vec­tor z through a se­ries of up­scal­ing con­vo­lu­tion or other lay­ers to pro­duce fi­nal pixel val­ues, but flow mod­els use a care­ful­ly-lim­ited set of prim­i­tives which make the model runnable both for­wards and back­wards ex­act­ly. This means every set of pix­els cor­re­sponds to a unique z and vice-ver­sa, and so an ar­bi­trary set of pix­els can put in and the model run back­wards to yield the ex­act cor­re­spond­ing z. There is no need to fight with the model to cre­ate an en­coder to re­verse it or use back­prop­a­ga­tion op­ti­miza­tion to try to find some­thing al­most right, as the flow model can al­ready do this. This makes edit­ing easy: plug the im­age in, get out the ex­act z with the equiv­a­lent of a sin­gle for­ward pass, fig­ure out which part of z con­trols a de­sired at­tribute like ‘glasses’, change that, and run it for­ward. The down­side of flow mod­els, which is why I do not (yet) use them, is that the re­stric­tion to re­versible lay­ers means that they are typ­i­cally much larger and slower to train than a more-or-less per­cep­tu­ally equiv­a­lent GAN mod­el, by eas­ily an or­der of mag­ni­tude (for Glow). When I tried Glow, I could barely run an in­ter­est­ing model de­spite ag­gres­sive mem­o­ry-sav­ing tech­niques, and I did­n’t get any­where in­ter­est­ing with the sev­eral GPU-days I spent (which was un­sur­pris­ing when I re­al­ized how many GPU-months OA had spen­t). Since high­-qual­ity pho­to­re­al­is­tic GANs are at the limit of 2019 train­abil­ity for most re­searchers or hob­by­ists, flow mod­els are clearly out of the ques­tion de­spite their many prac­ti­cal & the­o­ret­i­cal ad­van­tages—they’re just too ex­pen­sive! How­ev­er, there is no known rea­son flow mod­els could­n’t be com­pet­i­tive with GANs (they will prob­a­bly al­ways be larg­er, but be­cause they are more cor­rect & do more), and fu­ture im­prove­ments or hard­ware scal­ing may make them more vi­able, so flow-based mod­els are an ap­proach to keep an eye on.

One of those 3 ap­proaches will en­code an im­age into a la­tent z. So far so good, that en­ables things like gen­er­at­ing ran­dom­ly-d­iffer­ent ver­sions of a spe­cific im­age or in­ter­po­lat­ing be­tween 2 im­ages, but how does one con­trol the z in a more in­tel­li­gent fash­ion to make spe­cific ed­its?

If one knew what each vari­able in the z meant, one could sim­ply slide them in the −1/+1 range, change the z, and gen­er­ate the cor­re­spond­ing edited im­age. But there are 512 vari­ables in z (for StyleGAN), which is a lot to ex­am­ine man­u­al­ly, and their mean­ing is opaque as StyleGAN does­n’t nec­es­sar­ily map each vari­able onto a hu­man-rec­og­niz­able fac­tor like ‘smil­ing’. A rec­og­niz­able fac­tor like ‘eye­glasses’ might even be gov­erned by mul­ti­ple vari­ables si­mul­ta­ne­ously which are non­lin­early in­ter­act­ing.

As al­ways, the so­lu­tion to one mod­el’s prob­lems is yet more mod­els; to con­trol the z, like with the en­coder, we can sim­ply train yet an­other model (per­haps just a lin­ear clas­si­fier or ran­dom forests this time) to take the z of many im­ages which are all la­beled ‘smil­ing’ or ‘not smil­ing’, and learn what parts of z cause ‘smil­ing’ (eg ). These ad­di­tional mod­els can then be used to con­trol a z. The nec­es­sary la­bels (a few hun­dred to a few thou­sand will be ad­e­quate since the z is only 512 vari­ables) can be ob­tained by hand or by us­ing a pre-ex­ist­ing clas­si­fi­er.

So, the pieces of the puz­zle & putting it all to­geth­er:

The fi­nal re­sult is in­ter­ac­tive edit­ing of anime faces along many differ­ent fac­tors:

snowy halcy (MP4) demon­strat­ing in­ter­ac­tive edit­ing of StyleGAN anime faces us­ing anime-face-StyleGAN+DeepDanbooru+StyleGAN-encoder+TL-GAN

Editing Rare Attributes

A strat­egy of hand-edit­ing or us­ing a tag­ger to clas­sify at­trib­utes works for com­mon ones which will be well-rep­re­sented in a sam­ple of a few thou­sand since the clas­si­fier needs a few hun­dred cases to work with, but what about rarer at­trib­utes which might ap­pear only on one in a thou­sand ran­dom sam­ples, or at­trib­utes too rare in the dataset for StyleGAN to have learned, or at­trib­utes which may not be in the dataset at all? Edit­ing “red eyes” should be easy, but what about some­thing like “bunny ears”? It would be amus­ing to be able to edit por­traits to add bunny ears, but there aren’t that many bunny ear sam­ples (although cat ears might be much more com­mon); is one doomed to gen­er­ate & clas­sify hun­dreds of thou­sands of sam­ples to en­able bunny ear edit­ing? That would be in­fea­si­ble for hand la­bel­ing, and diffi­cult even with a tag­ger.

One sug­ges­tion I have for this use-case would be to briefly train an­other StyleGAN model on an en­riched or boosted dataset, like a dataset of 50:50 bunny ear im­ages & nor­mal im­ages. If one can ob­tain a few thou­sand bunny ear im­ages, then this is ad­e­quate for trans­fer learn­ing (com­bined with a few thou­sand ran­dom nor­mal im­ages from the orig­i­nal dataset), and one can re­train the StyleGAN on an equal bal­ance of im­ages. The high pres­ence of bunny ears will en­sure that the StyleGAN quickly learns all about those, while the nor­mal im­ages pre­vent it from over­fit­ting or cat­a­strophic for­get­ting of the full range of im­ages.

This new bun­ny-ear StyleGAN will then pro­duce bun­ny-ear sam­ples half the time, cir­cum­vent­ing the rare base rate is­sue (or fail­ure to learn, or nonex­is­tence in dataset), and en­abling effi­cient train­ing of a clas­si­fi­er. And since nor­mal faces were used to pre­serve its gen­eral face knowl­edge de­spite the trans­fer learn­ing po­ten­tially de­grad­ing it, it will re­main able to en­code & op­ti­mize nor­mal faces. (The orig­i­nal clas­si­fiers may even be reusable on this, de­pend­ing on how ex­treme the new at­tribute is, as the la­tent space z might not be too affected by the new at­tribute and the var­i­ous other at­trib­utes ap­prox­i­mately main­tain the orig­i­nal re­la­tion­ship with z as be­fore the re­train­ing.)

StyleGAN 2

(source (Py­Torch), video), elim­i­nates blob ar­ti­facts, adds a na­tive en­cod­ing ‘pro­jec­tion’ fea­ture for edit­ing, sim­pli­fies the run­time by scrap­ping pro­gres­sive grow­ing in fa­vor of -like mul­ti­-s­cale ar­chi­tec­ture, & has higher over­all qual­i­ty—but sim­i­lar to­tal train­ing time/re­quire­ments45

I used a 512px anime por­trait S2 model trained by Aaron Gokaslan to cre­ate :

100 ran­dom sam­ple im­ages from the StyleGAN 2 anime por­trait faces in TWDNEv3, arranged in a 10×10 grid.

Train­ing sam­ples:

It­er­a­tion #24,303 of Gokaslan’s train­ing of an anime por­trait StyleGAN 2 model (train­ing sam­ples)

The model was trained to it­er­a­tion #24,664 for >2 weeks on 4 Nvidia 2080ti GPUs at 35–70s per 1k im­ages. The Ten­sor­flow S2 model is avail­able for down­load (320M­B).46 (Py­Torch & Onnx ver­sions have been made by An­ton us­ing a cus­tom repo Note that both my face & por­trait mod­els can be run via the Gen­Force Py­Torch repo as well.) This model can be used in Google Co­lab (demon­stra­tion note­book, al­though it seems it may pull in an older S2 mod­el) & the model can also be used with the S2 code­base for en­cod­ing anime faces.

Running S2

Be­cause of the op­ti­miza­tions, which re­quires cus­tom lo­cal com­pi­la­tion of CUDA code for max­i­mum effi­cien­cy, get­ting S2 run­ning can be more chal­leng­ing than get­ting S1 run­ning.

  • No Ten­sor­Flow 2 com­pat­i­bil­i­ty: the TF ver­sion must be 1.14/1.15. Try­ing to run with TF 2 will give er­rors like: TypeError: int() argument must be a string, a bytes-like object or a number, not 'Tensor'.

    I ran into cuDNN com­pat­i­bil­ity prob­lems with TF 1.15 (which re­quires cuDNN >7.6.0, 2019-05-20, for CUDA 10.0), which gave er­rors like this:

    ...[2020-01-11 23:10:35.234784: E tensorflow/stream_executor/cuda/cuda_dnn.cc:319] Loaded runtime CuDNN library:
       7.4.2 but source was compiled with: 7.6.0.  CuDNN library major and minor version needs to match or have higher
       minor version in case of CuDNN 7.0 or later version. If using a binary install, upgrade your CuDNN library.
       If building from sources, make sure the library loaded at runtime is compatible with the version specified
       during compile configuration...

    But then with 1.14, the tpu-estimator li­brary was not found! (I ul­ti­mately took the risk of up­grad­ing my in­stal­la­tion with libcudnn7_7.6.0.64-1+cuda10.0_amd64.deb, and thank­ful­ly, that worked and did not seem to break any­thing else.)

  • Get­ting the en­tire pipeline to com­pile the cus­tom ops in a Conda en­vi­ron­ment was an­noy­ing so Gokaslan tweaked it to use 1.14 on Lin­ux, used cudatoolkit-dev from Conda Forge, and changed the build script to use gcc-7 (s­ince gcc-8 was un­sup­port­ed)

  • one is­sue with Ten­sor­Flow 1.14 is you need to force allow_growth or it will er­ror out on Nvidia 2080tis

  • con­fig name change: train.py has been re­named (a­gain) to run_training.py

  • buggy learn­ing rates: S2 (but not S1) ac­ci­den­tally uses the same LR for both G & D; ei­ther fix this or keep it in mind when do­ing LR tun­ing—changes to D_lrate do noth­ing!

  • n = 1 mini­batch prob­lems: S2 is not a large NN so it can be trained on low-end GPUs; how­ev­er, the S2 code make an un­nec­es­sary as­sump­tion that n≥2; to fix this in training/loss.py (fixed in Shawn Presser’s TPU/self-attention ori­ented fork):

    @@ -157,9 +157,8 @@ def G_logistic_ns_pathreg(G, D, opt, training_set, minibatch_size, pl_minibatch_
        with tf.name_scope('PathReg'):
    
            # Evaluate the regularization term using a smaller minibatch to conserve memory.
            if pl_minibatch_shrink > 1 and minibatch_size > 1:
                assert minibatch_size % pl_minibatch_shrink == 0
                pl_minibatch = minibatch_size // pl_minibatch_shrink
            if pl_minibatch_shrink > 1:
                pl_minibatch = tf.maximum(1, minibatch_size // pl_minibatch_shrink)
                pl_latents = tf.random_normal([pl_minibatch] + G.input_shapes[0][1:])
                pl_labels = training_set.get_random_labels_tf(pl_minibatch)
                fake_images_out, fake_dlatents_out = G.get_output_for(pl_latents, pl_labels, is_training=True, return_dlatents=True)

  • S2 has some sort of mem­ory leak, pos­si­bly re­lated to the FID eval­u­a­tions, re­quir­ing reg­u­lar restarts, like putting it into a loop

Once S2 was run­ning, Gokaslan trained the S2 por­trait model with gen­er­ally de­fault hy­per­pa­ra­me­ters.

Future Work

Some open ques­tions about StyleGAN’s ar­chi­tec­ture & train­ing dy­nam­ics:

  • is pro­gres­sive grow­ing still nec­es­sary with StyleGAN? (StyleGAN 2 im­plies that it is not, as it uses a MSG-GAN-like ap­proach)
  • are 8×512 FC lay­ers nec­es­sary? (Pre­lim­i­nary BigGAN work sug­gests that they are not nec­es­sary for BigGAN.)
  • what are the wrinkly-line/cracks noise ar­ti­facts which ap­pear at the end of train­ing?
  • how does StyleGAN com­pare to BigGAN in fi­nal qual­i­ty?

Fur­ther pos­si­ble work:

  • ex­plo­ration of “cur­ricu­lum learn­ing”: can train­ing be sped up by train­ing to con­ver­gence on small n and then pe­ri­od­i­cally ex­pand­ing the dataset?

  • boot­strap­ping im­age gen­er­a­tion by start­ing with a seed cor­pus, gen­er­at­ing many ran­dom sam­ples, se­lect­ing the best by hand, and re­train­ing; eg ex­pand a cor­pus of a spe­cific char­ac­ter, or ex­plore ‘hy­brid’ cor­puses which mix A/B im­ages & one then se­lects for im­ages which look most A+B-ish

  • im­proved trans­fer learn­ing scripts to edit trained mod­els so 512px pre­trained mod­els can be pro­moted to work with 1024px im­ages and vice versa

  • bet­ter Dan­booru tag­ger CNN for pro­vid­ing clas­si­fi­ca­tion em­bed­dings for var­i­ous pur­pos­es, par­tic­u­larly FID loss mon­i­tor­ing, mini­batch dis­crim­i­na­tion/aux­il­iary loss, and style trans­fer for cre­at­ing a ‘StyleDan­booru’

    • with a StyleDan­booru, I am cu­ri­ous if that can be used as a par­tic­u­larly Pow­er­ful Form Of Data Aug­men­ta­tion for small n char­ac­ter datasets, and whether it leads to a re­ver­sal of train­ing dy­nam­ics with edges com­ing be­fore col­ors/­tex­tures—it’s pos­si­ble that a StyleDan­booru could make many GAN ar­chi­tec­tures, not just StyleGAN, sta­ble to train on ani­me/il­lus­tra­tion datasets
  • bor­row­ing ar­chi­tec­tural en­hance­ments from BigGAN: self­-at­ten­tion lay­ers, spec­tral norm reg­u­lar­iza­tion, large-mini­batch train­ing, and a rec­ti­fied Gauss­ian dis­tri­b­u­tion for the la­tent vec­tor z

  • text → im­age con­di­tional GAN ar­chi­tec­ture (à la StackGAN):

    This would take the text tag de­scrip­tions of each im­age com­piled by Dan­booru users and use those as in­puts to StyleGAN, which, should it work, would mean you could cre­ate ar­bi­trary anime im­ages sim­ply by typ­ing in a string like 1_boy samurai facing_viewer red_hair clouds sword armor blood etc.

    This should al­so, by pro­vid­ing rich se­man­tic de­scrip­tions of each im­age, make train­ing faster & sta­bler and con­verge to higher fi­nal qual­i­ty.

  • meta-learn­ing for few-shot face or char­ac­ter or artist im­i­ta­tion (eg Set-CGAN or or per­haps , or —the last of which achieves few-shot learn­ing with sam­ples of n = 25 TWDNE StyleGAN anime faces)

ImageNet StyleGAN

As part of ex­per­i­ments in scal­ing up StyleGAN 2, us­ing , we ran StyleGAN on large-s­cale datasets in­clud­ing Dan­booru2019, Im­a­geNet, and sub­sets of the . De­spite run­ning for mil­lions of im­ages, no S2 run ever achieved re­motely the re­al­ism of S2 on FFHQ or BigGAN on Im­a­geNet: while the tex­tures could be sur­pris­ingly good, the se­man­tic global struc­ture never came to­geth­er, with glar­ing flaws—there would be too many heads, or they would be de­tached from bod­ies, etc.

Aaron Gokaslan took the time to com­pute the FID on Im­a­geNet, es­ti­mat­ing a ter­ri­ble score of FID ~120. (High­er=­worse; for com­par­ison, BigGAN with can be as good as FID ~7, and reg­u­lar BigGAN typ­i­cally sur­passes FID 120 within a few thou­sand it­er­a­tions.) Even ex­per­i­ments in in­creas­ing the S2 model size up to ~1GB (by in­creas­ing the fea­ture map mul­ti­pli­er) im­proved qual­ity rel­a­tively mod­est­ly, and showed no signs of ever ap­proach­ing BigGAN-level qual­i­ty. We con­cluded that StyleGAN is in fact fun­da­men­tally lim­ited as a GAN, trad­ing off sta­bil­ity for power, and switched over to BigGAN work.

For those in­ter­est­ed, we pro­vide our 512px Im­a­geNet S2 (step 1,394,688):

rsync --verbose rsync://78.46.86.149:873/biggan/2020-04-07-shawwn-stylegan-imagenet-512px-run52-1394688.pkl.xz ./
Shawn Presser, S2 Im­a­geNet in­ter­po­la­tion video from part­way through train­ing (~45 hours on a TPUv3-512, 3k im­ages/s)

BigGAN

Fol­low­ing my StyleGAN anime face ex­per­i­ments, I ex­plore BigGAN, an­other re­cent GAN with SOTA re­sults on one of the most com­plex im­age do­mains tack­led by GANs so far (Im­a­geNet). BigGAN’s ca­pa­bil­i­ties come at a steep com­pute cost, how­ev­er.

Us­ing the un­offi­cial BigGAN-PyTorch reim­ple­men­ta­tion, I ex­per­i­mented in 2019 with 128px Im­a­geNet trans­fer learn­ing (suc­cess­ful) with ~6 GPU-days, and from-scratch 256px anime por­traits of 1000 char­ac­ters on a 8×2080ti ma­chine for a month (mixed re­sult­s). My BigGAN re­sults are good but com­pro­mised by the com­pute ex­pense & prac­ti­cal prob­lems with the re­leased BigGAN code base. While BigGAN is not yet su­pe­rior to StyleGAN for many pur­pos­es, BigGAN-like ap­proaches may be nec­es­sary to scale to whole anime im­ages.

For fol­lowup ex­per­i­ments, Shawn Presser, I and oth­ers (col­lec­tive­ly, “Ten­sor­fork”) have used Ten­sor­flow Re­search Cloud TPU cred­its & the com­pare_­gan BigGAN reim­ple­men­ta­tion. Run­ning this at scale on the full Dan­booru2019 dataset in May 2020, we have reached the best anime GAN re­sults to date.

See .

See Also

Appendix

For failed anime ex­per­i­ments us­ing a va­ri­ety of NN ar­chi­tec­tures, see the .


  1. Turns out that when train­ing goes re­ally wrong, you can crash many GAN im­ple­men­ta­tions with ei­ther a seg­fault, in­te­ger over­flow, or di­vi­sion by zero er­ror.↩︎

  2. StackGAN/StackGAN++/PixelCNN et al are diffi­cult to run as they re­quire a unique im­age em­bed­ding which could only be com­puted in the un­main­tained Torch frame­work us­ing Reed’s prior work on a joint tex­t+im­age em­bed­ding which how­ever does­n’t run on any­thing but the Birds & Flow­ers datasets, and so no one has ever, as far as I am aware, run those im­ple­men­ta­tions on any­thing else—cer­tainly I never man­aged to de­spite quite a few hours try­ing to re­verse-engi­neer the em­bed­ding & var­i­ous im­ple­men­ta­tions.↩︎

  3. Be sure to check out .↩︎

  4. Glow’s re­ported re­sults re­quired >40 GPU-weeks; BigGAN’s to­tal com­pute is un­clear as it was trained on a TPUv3 Google clus­ter but it would ap­pear that a 128px BigGAN might be ~4 GPU-months as­sum­ing hard­ware like an 8-GPU ma­chine, 256px ~8 GPU-months, and 512px ≫8 GPU-months, with VRAM be­ing the main lim­it­ing fac­tor for larger mod­els (although pro­gres­sive grow­ing might be able to cut those es­ti­mates).↩︎

  5. is an old & small CNN trained to pre­dict a few -booru tags on anime im­ages, and so pro­vides an em­bed­ding—but not a good one. The lack of a good em­bed­ding is the ma­jor lim­i­ta­tion for anime deep learn­ing as of Feb­ru­ary 2019. (Deep­Dan­booru, while per­form­ing well ap­par­ent­ly, has not yet been used for em­bed­dings.) An em­bed­ding is nec­es­sary for text → im­age GANs, im­age searches & near­est-neigh­bor checks of over­fit­ting, FID er­rors for ob­jec­tively com­par­ing GANs, mini­batch dis­crim­i­na­tion to help the D/pro­vide an aux­il­iary loss to sta­bi­lize learn­ing, anime style trans­fer (both for its own sake & for cre­at­ing a ‘StyleDan­booru2018’ to re­duce tex­ture cheat­ing), en­cod­ing into GAN la­tent spaces for ma­nip­u­la­tion, data clean­ing (to de­tect anom­alous dat­a­points like failed face crop­s), per­cep­tual losses for en­coders or as an ad­di­tional aux­il­iary loss/pre­train­ing (like , which trains a Gen­er­a­tor on a per­cep­tual loss and does GAN train­ing only for fine­tun­ing) etc. A good tag­ger is also a good start­ing point for do­ing pix­el-level se­man­tic seg­men­ta­tion (via “weak su­per­vi­sion”), which meta­data is key for train­ing some­thing like Nvidi­a’s GauGAN suc­ces­sor to pix2pix (; source).↩︎

  6. Tech­ni­cal note: I typ­i­cally train NNs us­ing my work­sta­tion with 2×1080ti GPUs. For eas­ier com­par­ison, I con­vert all my times to single-GPU equiv­a­lent (ie “6 GPU-weeks” means 3 re­al­time/wall­clock weeks on my 2 GPUs).↩︎

  7. ob­serves (§4 “Us­ing pre­ci­sion and re­call to an­a­lyze and im­prove StyleGAN”) that StyleGAN with pro­gres­sive grow­ing dis­abled does work but at some cost to pre­ci­sion/re­call qual­ity met­rics; whether this re­flects in­fe­rior per­for­mance on a given train­ing bud­get or an in­her­ent limit—BigGAN and other self­-at­ten­tion-us­ing GANs do not use pro­gres­sive grow­ing at all, sug­gest­ing it is not truly nec­es­sary—is not in­ves­ti­gat­ed. In De­cem­ber 2019, StyleGAN 2 suc­cess­fully dropped pro­gres­sive grow­ing en­tirely at mod­est per­for­mance cost.↩︎

  8. This has con­fused some peo­ple, so to clar­ify the se­quence of events: I trained my anime face StyleGAN and posted notes on Twit­ter, re­leas­ing an early mod­el; road­run­ner01 gen­er­ated an in­ter­po­la­tion video us­ing said model (but a differ­ent ran­dom seed, of course); this in­ter­po­la­tion video was retweeted by the Japan­ese Twit­ter user _Ry­obot, upon which it went vi­ral and was ‘liked’ by Elon Musk, fur­ther dri­ving vi­ral­ity (19k re­shares, 65k likes, 1.29m watches as of 2019-03-22).↩︎

  9. Google Co­lab is a free ser­vice in­cludes free GPU time (up to 12 hours on a small GPU). Es­pe­cially for peo­ple who do not have a rea­son­ably ca­pa­ble GPU on their per­sonal com­put­ers (such as all Ap­ple users) or do not want to en­gage in the ad­mit­ted has­sle of rent­ing a real cloud GPU in­stance, Co­lab can be a great way to play with a pre­trained mod­el, like gen­er­at­ing GPT-2-117M text com­ple­tions or StyleGAN in­ter­po­la­tion videos, or pro­to­type on tiny prob­lems.

    How­ev­er, it is a bad idea to try to train real mod­els, like 512–1024px StyleGANs, on a Co­lab in­stance as the GPUs are low VRAM, far slower (6 hours per StyleGAN tick­!), un­wieldy to work with (as one must save snap­shots con­stantly to restart when the ses­sion runs out), does­n’t have a real com­mand-line, etc. Co­lab is just barely ad­e­quate for per­haps 1 or 2 ticks of trans­fer learn­ing, but not more. If you har­bor greater am­bi­tions but still refuse to spend any money (rather than time), Kag­gle has a sim­i­lar ser­vice with P100 GPU slices rather than K80s. Oth­er­wise, one needs to get ac­cess to real GPUs.↩︎

  10. Cu­ri­ous­ly, the ben­e­fit of many more FC lay­ers than usual may have been stum­bled across be­fore: IllustrationGAN found that adding some FC lay­ers seemed to help their DCGAN gen­er­ate anime faces, and when I & Feep­ingCrea­ture ex­per­i­mented with adding 2–4 FC lay­ers to WGAN-GP along IllustrationGAN’s lines, it did help our lack­lus­ter re­sults, and at the time I spec­u­lated that “the ful­ly-con­nected lay­ers are trans­form­ing the la­tent-z/noise into a sort of global tem­plate which the sub­se­quent con­vo­lu­tion lay­ers can then fill in more lo­cal­ly.” But we never dreamed of go­ing as deep as 8!↩︎

  11. The ProGAN/StyleGAN code­base re­port­edly does work with con­di­tion­ing, but none of the pa­pers re­port on this func­tion­al­ity and I have not used it my­self.↩︎

  12. The la­tent em­bed­ding z is usu­ally gen­er­ated in about the sim­plest pos­si­ble way: draws from the Nor­mal dis­tri­b­u­tion, 𝒩(0,1). A Uni­for­m(−1,1) is some­times used in­stead. There is no good jus­ti­fi­ca­tion for this and some rea­son to think this can be bad (how does a GAN eas­ily map a dis­crete or bi­nary la­tent fac­tor, such as the pres­ence or ab­sence of the left ear, onto a Nor­mal vari­able?).

    The BigGAN pa­per ex­plores al­ter­na­tives, find­ing im­prove­ments in train­ing time and/or fi­nal qual­ity from us­ing in­stead (in as­cend­ing or­der): a Nor­mal + bi­nary Bernoulli (p = 0.5; per­sonal com­mu­ni­ca­tion, Brock) vari­able, a bi­nary (Bernoul­li), and a (some­times called a “cen­sored nor­mal” even though that sounds like a rather than the rec­ti­fied one). The rec­ti­fied Gauss­ian dis­tri­b­u­tion “out­per­forms 𝒩(0,1)(in terms of IS) by 15–20% and tends to re­quire fewer it­er­a­tions.”

    The down­side is that the “trun­ca­tion trick”, which yields even larger av­er­age im­prove­ments in im­age qual­ity (at the ex­pense of di­ver­si­ty) does­n’t quite ap­ply, and the rec­ti­fied Gauss­ian sans trun­ca­tion pro­duced sim­i­lar re­sults as the Nor­mal+trun­ca­tion, so BigGAN re­verted to the de­fault Nor­mal dis­tri­b­u­tion+trun­ca­tion (per­sonal com­mu­ni­ca­tion).

    The trun­ca­tion trick ei­ther di­rectly ap­plies to some of the other dis­tri­b­u­tions, par­tic­u­larly the Rec­ti­fied Gaus­sian, or could eas­ily be adapt­ed—­pos­si­bly yield­ing an im­prove­ment over ei­ther ap­proach. The Rec­ti­fied Gauss­ian can be trun­cated just like the de­fault Nor­mals can. And for the Bernoul­li, one could de­crease p dur­ing the gen­er­a­tion, or what is prob­a­bly equiv­a­lent, re-sam­ple when­ever the vari­ance (ie squared sum) of all the Bernoulli la­tent vari­ables ex­ceeds a cer­tain con­stant. (With p = 0.5, a la­tent vec­tor of 512 Bernouil­lis would on av­er­age all sum up to sim­ply 0.5 × 512 = 256, with the 2.5%–97.5% quan­tiles be­ing 234–278, so a ‘trun­ca­tion trick’ here might be throw­ing out every vec­tor with a sum above, say, the 80% quan­tile of 266.)

    One also won­ders about vec­tors which draw from mul­ti­ple dis­tri­b­u­tions rather than just one. Could the StyleGAN 8-FC-layer learned-la­ten­t-vari­able be re­verse-engi­neered? Per­haps the first layer or two merely con­verts the nor­mal in­put into a more use­ful dis­tri­b­u­tion & pa­ra­me­ter­s/­train­ing could be saved or in­sight gained by im­i­tat­ing that.↩︎

  13. Which raises the ques­tion: if you added any or all of those fea­tures, would StyleGAN be­come that much bet­ter? Un­for­tu­nate­ly, while the­o­rists & prac­ti­tion­ers have had many ideas, so far the­ory has proven more fe­cund than fa­tidi­cal and the large-s­cale GAN ex­per­i­ments nec­es­sary to truly test the sug­ges­tions are too ex­pen­sive for most. Half of these sug­ges­tions are great ideas—but which half?↩︎

  14. For more on the choice of con­vo­lu­tion lay­er­s/k­er­nel sizes, see Karpa­thy’s 2015 notes for “CS231n: Con­vo­lu­tional Neural Net­works for Vi­sual Recog­ni­tion”, or take a look at these Con­vo­lu­tion an­i­ma­tions & Yang’s in­ter­ac­tive “Con­vo­lu­tion Vi­su­al­izer”.↩︎

  15. This ob­ser­va­tions ap­ply only to the Gen­er­a­tor in GANs (which is what we pri­mar­ily care about); cu­ri­ous­ly, there’s some rea­son to think that GAN Dis­crim­i­na­tors are in fact mostly mem­o­riz­ing (see later).↩︎

  16. A pos­si­ble al­ter­na­tive is ESRGAN ().↩︎

  17. Based on eye­balling the ‘cat’ bar graph in Fig­ure 3 of .↩︎

  18. CATS offer an amus­ing in­stance of the dan­gers of data aug­men­ta­tion: ProGAN used hor­i­zon­tal flip­ping/mir­ror­ing for every­thing, be­cause why not? This led to strange Cyril­lic text cap­tions show­ing up in the gen­er­ated cat im­ages. Why not Latin al­pha­bet cap­tions? Be­cause every cat im­age was be­ing shown mir­rored as well as nor­mal­ly! For StyleGAN, mir­ror­ing was dis­abled, so now the lol­cat cap­tions are rec­og­niz­ably Latin al­pha­bet­i­cal, and even al­most Eng­lish words. This demon­strates that even datasets where left­/right does­n’t seem to mat­ter, like cat pho­tos, can sur­prise you.↩︎

  19. I es­ti­mated the to­tal cost us­ing AWS EC2 pre­emptible hourly costs on 2019-03-15 as fol­lows:

    • 1 GPU: p2.xlarge in­stance in us-east-2a, Half of a K80 (12GB VRAM): $0.3235/hour
    • 2 GPUs: NA—there is no P2 in­stance with 2 GPUs, only 1/8/16
    • 8 GPUs: p2.8xlarge in us-east-2a, 8 halves of K80s (12GB VRAM each): $2.160/hour

    As usu­al, there is sub­lin­ear scal­ing, and larger in­stances cost dis­pro­por­tion­ately more, be­cause one is pay­ing for faster wall­clock train­ing (time is valu­able) and for not hav­ing to cre­ate a dis­trib­uted in­fra­struc­ture which can ex­ploit the cheap single-GPU in­stances.

    This cost es­ti­mate does not count ad­di­tional costs like hard drive space. In ad­di­tion to the dataset size (the StyleGAN data en­cod­ing is ~18× larger than the raw data size, so a 10GB folder of im­ages → 200GB of .tfrecords), you would need at least 100GB HDD (50GB for the OS, and 50GB for check­points/im­ages/etc to avoid crashes from run­ning out of space).↩︎

  20. I re­gard this as a flaw in StyleGAN & TF in gen­er­al. Com­put­ers are more than fast enough to load & process im­ages asyn­chro­nously us­ing a few worker threads, and work­ing with a di­rec­tory of im­ages (rather than a spe­cial bi­nary for­mat 10–20× larg­er) avoids im­pos­ing se­ri­ous bur­dens on the user & hard dri­ve. Py­Torch GANs al­most al­ways avoid this mis­take, and are much more pleas­ant to work with as one can freely mod­ify the dataset be­tween (and even dur­ing) runs.↩︎

  21. For ex­am­ple, my Dan­booru2018 anime por­trait dataset is 16GB, but the StyleGAN en­coded dataset is 296GB.↩︎

  22. This may be why some peo­ple re­port that StyleGAN just crashes for them & they can’t fig­ure out why. They should try chang­ing their dataset JPG ↔︎ PNG.↩︎

  23. That is, in train­ing G, the G’s fake im­ages must be aug­mented be­fore be­ing passed to the D for rat­ing; and in train­ing D, both real & fake im­ages must be aug­mented the same way be­fore be­ing passed to D. Pre­vi­ous­ly, all GAN re­searchers ap­pear to have as­sumed that one should only aug­ment real im­ages be­fore pass­ing to D dur­ing D train­ing, which con­ve­niently can be done at dataset cre­ation; un­for­tu­nate­ly, this hid­den as­sump­tion turns out to be about the most harm­ful way pos­si­ble!↩︎

  24. I would de­scribe the dis­tinc­tions as: Soft­ware 0.0 was im­per­a­tive pro­gram­ming for ham­mer­ing out clock­work mech­a­nism; Soft­ware 1.0 was de­clar­a­tive pro­gram­ming with spec­i­fi­ca­tion of pol­i­cy; and Soft­ware 2.0 is deep learn­ing by gar­den­ing loss func­tions (with every­thing else, from model arch to which dat­a­points to la­bel ide­ally learned end-to-end). Con­tin­u­ing the the­me, we might say that di­a­logue with mod­els, like , are “Soft­ware 3.0”…↩︎

  25. But you may not want to–re­mem­ber the lol­cat cap­tions!↩︎

  26. Note: If you use a differ­ent com­mand to re­size, check it thor­ough­ly. With Im­ageMag­ick, if you use the ^ op­er­a­tor like -resize 512x512^, you will not get ex­actly 512×512px im­ages as you need; while if you use the ! op­er­a­tor like -resize 512x512!, the im­ages will be ex­actly 512×512px but the as­pect ra­tios will dis­torted to make im­ages fit, and this may con­fuse any­thing you are train­ing by in­tro­duc­ing un­nec­es­sary mean­ing­less dis­tor­tions & will make any gen­er­ated im­ages look bad.↩︎

  27. If you are us­ing Python 2, you will get print syn­tax er­ror mes­sages; if you are us­ing Python 3–3.6, you will get ‘type hint’ er­rors.↩︎

  28. Stas Pod­gorskiy has demon­strated that the StyleGAN 2 cor­rec­tion can be re­verse-engi­neered and ap­plied back to StyleGAN 1 gen­er­a­tors if nec­es­sary.↩︎

  29. This makes it con­form to a trun­cated nor­mal dis­tri­b­u­tion; why trun­cated rather than rec­ti­fied/win­sorized at a max like 0.5 or 1.0 in­stead? Be­cause then many, pos­si­bly most, of the la­tent vari­ables would all be at the max, in­stead of smoothly spread out over the per­mit­ted range.↩︎

  30. No mini­batches are used, so this is much slower than nec­es­sary.↩︎

  31. Sim­ply en­cod­ing each pos­si­ble tag as a one-hot cat­e­gor­i­cal vari­able would scale poor­ly: in the worst case, Dan­booru2020 has >434,000 pos­si­ble tags. If that was passed into a ful­ly-con­nected lay­er, which out­put a 1024-long em­bed­ding, then that would use up 434,000 × 1,024 = 444 mil­lion pa­ra­me­ters! The em­bed­ding would be larger than the ac­tual StyleGAN mod­el, and ac­cord­ingly ex­pen­sive. RNNs his­tor­i­cally are com­monly used to con­vert text in­puts to an em­bed­ding for a CNN to process, but they are finicky and hard to work with. dos­n’t work be­cause, as the name sug­gests, it only con­verts a sin­gle tag/­word at a time into an em­bed­ding; doc2vec is its equiv­a­lent for se­quences of text. If we were do­ing it in 2021, we would prob­a­bly just throw a Trans­former at it (at­ten­tion is all you need!) with a win­dow of 512 to­kens or some­thing. (You don’t need that many tags, and it’s un­likely that a fea­si­ble model would make good use of ~100 tags any­way.)↩︎

  32. Keep this change in mind if you run into er­rors like ValueError: Cannot feed value of shape (1024,) for Tensor 'G/dlatent_avg/new_value:0', which has shape '(512,)' try­ing to reuse the mod­el.↩︎

  33. Mov­ing self­-at­ten­tion around in BigGAN also makes sur­pris­ingly lit­tle differ­ence. We dis­cussed it with BigGAN’s Brock, and he noted that self­-at­ten­tion was ex­pen­sive & never seemed to be as im­por­tant to BigGAN as one would as­sume (com­pared to other im­prove­ments like the or­thog­o­nal reg­u­lar­iza­tion, large mod­els, and large mini­batch­es). Given ex­am­ples like , , , or , I sus­pect that the ben­e­fits of self­-at­ten­tion may be rel­a­tively min­i­mal at the raw pixel lev­el, and bet­ter fo­cused on the ‘se­man­tic level’ in some sense, such as in pro­cess­ing the la­tent vec­tor or VQ-VAE to­kens.↩︎

  34. The ques­tion is not whether one is to start with an ini­tial­iza­tion at all, but whether to start with one which does every­thing poor­ly, or one which does a few sim­i­lar things well. Sim­i­lar­ly, from a Bayesian sta­tis­tics per­spec­tive, the ques­tion of what to use is one that every­one faces; how­ev­er, many ap­proaches sweep it un­der the rug and effec­tively as­sume a de­fault flat prior that is con­sis­tently bad and op­ti­mal for no mean­ing­ful prob­lem ever.↩︎

  35. ADA/StyleGAN3 is re­port­edly much more sam­ple-effi­cient and re­duces the need for trans­fer learn­ing: . But if a rel­e­vant model is avail­able, it should still be used. Back­port­ing the ADA data aug­men­ta­tion trick to StyleGAN1–2 will be a ma­jor up­grade.↩︎

  36. There are more real Asuka im­ages than Holo to be­gin with, but there is no par­tic­u­lar rea­son for the 10× data aug­men­ta­tion com­pared to the Holo’s 3×—the data aug­men­ta­tions were just done at differ­ent times and hap­pened to have less or more aug­men­ta­tions en­abled.↩︎

  37. A fa­mous ex­am­ple is char­ac­ter de­signer Yoshiyuki Sadamoto demon­strat­ing how to turn Na­dia () into (Evan­ge­lion).↩︎

  38. It turns out that this la­tent vec­tor trick does work. In­trigu­ing­ly, it works even bet­ter to do ‘model av­er­ag­ing’ or ‘model blend­ing’ (/, Pinkney & Adler 2020): re­train model A on dataset B, and then take a weighted av­er­age of the 2 mod­els (you av­er­age them, pa­ra­me­ter by pa­ra­me­ter, and re­mark­ably, that Just Works, or you can swap out lay­ers be­tween mod­el­s), and then you can cre­ate faces which are ar­bi­trar­ily in be­tween A and B. So for ex­am­ple, you can blend FFHQ/Western-animation faces (Co­lab note­book), ukiy­o-e/FFHQ faces, furries/foxes/FFHQ faces, or even furries/foxes/FFHQ/anime/ponies.↩︎

  39. In ret­ro­spect, this should­n’t’ve sur­prised me.↩︎

  40. There is for other ar­chi­tec­tures like flow-based ones such as Glow, and this is one of their ben­e­fit­s–while the re­quire­ment to be made out of build­ing blocks which can be run back­wards & for­wards equally well, to be ‘in­vert­ible’, is cur­rently ex­tremely ex­pen­sive and the re­sults not com­pet­i­tive ei­ther in fi­nal im­age qual­ity or com­pute re­quire­ments, the in­vert­ibil­ity means that en­cod­ing an ar­bi­trary real im­age to get its in­ferred la­tents Just Works™ and one can eas­ily morph be­tween 2 ar­bi­trary im­ages, or en­code an ar­bi­trary im­age & edit it in the la­tent space to do things like ad­d/re­move glasses from a face or cre­ate an op­po­sites-ex ver­sion.↩︎

  41. This fi­nal ap­proach is, in­ter­est­ing­ly, the his­tor­i­cal rea­son back­prop­a­ga­tion was in­vent­ed: it cor­re­sponds to plan­ning in a model. For ex­am­ple, in plan­ning the flight path of an air­plane (/): the des­ti­na­tion or ‘out­put’ is fixed, the aero­dy­nam­ic­s+­geog­ra­phy or ‘model pa­ra­me­ters’ are also fixed, and the ques­tion is what ac­tions de­ter­min­ing a flight path will re­duce the loss func­tion of time or fuel spent. One starts with a ran­dom set of ac­tions pick­ing a ran­dom flight path, runs it for­ward through the en­vi­ron­ment mod­el, gets a fi­nal time/­fuel spent, and then back­prop­a­gates through the model to get the gra­di­ents for the flight path, ad­just­ing the flight path to­wards a new set of ac­tions which will slightly re­duce the time/­fuel spent; the new ac­tions are used to plan out the flight to get a new loss, and so on, un­til a lo­cal min­i­mum of the ac­tions has been found. This works with non-s­to­chas­tic prob­lems; for sto­chas­tic ones where the path can’t be guar­an­teed to be ex­e­cut­ed, “mod­el-pre­dic­tive con­trol” can be used to re­plan at every step and ex­e­cute ad­just­ments as nec­es­sary. An­other in­ter­est­ing use of back­prop­a­ga­tion for out­puts is which tack­les the long-s­tand­ing prob­lem of how to get NNs to out­put sets rather than list out­puts by gen­er­at­ing a pos­si­ble set out­put & re­fin­ing it via back­prop­a­ga­tion.↩︎

  42. SGD is com­mon, but a sec­ond-order al­go­rithm like is often used in these ap­pli­ca­tions in or­der to run as few it­er­a­tions as pos­si­ble.↩︎

  43. shows that BigGAN/StyleGAN la­tent em­bed­dings can also go be­yond what one might ex­pect, to in­clude zooms, trans­la­tions, and other trans­forms.↩︎

  44. Flow mod­els have other ad­van­tages, mostly stem­ming from the max­i­mum like­li­hood train­ing ob­jec­tive. Since the im­age can be prop­a­gated back­wards and for­wards loss­less­ly, in­stead of be­ing lim­ited to gen­er­at­ing ran­dom sam­ples like a GAN, it’s pos­si­ble to cal­cu­late the ex­act prob­a­bil­ity of an im­age, en­abling max­i­mum like­li­hood as a loss to op­ti­mize, and drop­ping the Dis­crim­i­na­tor en­tire­ly. With no GAN dy­nam­ics, there’s no worry about weird train­ing dy­nam­ics, and the like­li­hood loss also for­bids ‘mode drop­ping’: the flow model can’t sim­ply con­spire with a Dis­crim­i­na­tor to for­get pos­si­ble im­ages.↩︎

  45. StyleGAN 2 is more com­pu­ta­tion­ally ex­pen­sive but Kar­ras et al op­ti­mized the code­base to make up for it, keep­ing to­tal com­pute con­stant.↩︎

  46. Back­up-backup mir­ror: rsync rsync://78.46.86.149:873/biggan/2020-01-11-skylion-stylegan2-animeportraits-networksnapshot-024664.pkl.xz ./↩︎