Danbooru2020: A Large-Scale Crowdsourced and Tagged Anime Illustration Dataset

Danbooru2020 is a large-scale anime image database with 4.2m+ images annotated with 130m+ tags; it can be useful for machine learning purposes such as image recognition and generation.
statistics, NN, anime, shell, dataset
2015-12-152021-01-21 finished certainty: likely importance: 6


Deep learn­ing for com­puter re­vi­sion re­lies on large an­no­tated datasets. Classification/categorization has ben­e­fited from the cre­ation of Im­a­geNet, which clas­si­fies 1m pho­tos into 1000 cat­e­gories. But classification/categorization is a coarse de­scrip­tion of an im­age which lim­its ap­pli­ca­tion of clas­si­fiers, and there is no com­pa­ra­bly large dataset of im­ages with many tags or la­bels which would al­low learn­ing and de­tect­ing much richer in­for­ma­tion about im­ages. Such a dataset would ide­ally be >1m im­ages with at least 10 de­scrip­tive tags each which can be pub­licly dis­trib­uted to all in­ter­ested re­searchers, hob­by­ists, and or­ga­ni­za­tions. There are cur­rently no such pub­lic datasets, as Im­a­geNet, Birds, Flow­ers, and MS COCO fall short ei­ther on im­age or tag count or re­stricted dis­tri­b­u­tion. I sug­gest that the “im­age -boorus” be used. The im­age boorus are long­stand­ing web data­bases which host large num­bers of im­ages which can be ‘tagged’ or la­beled with an ar­bi­trary num­ber of tex­tual de­scrip­tions; they were de­vel­oped for and are most pop­u­lar among fans of ani­me, who pro­vide de­tailed an­no­ta­tions.

The best known booru, with a fo­cus on qual­i­ty, is Dan­booru. We pro­vide a torrent/rsync mir­ror which con­tains ~3.4TB of 4.22m im­ages with 130m tag in­stances (of 434k de­fined tags, ~30/image) cov­er­ing Dan­booru from 2005-05-24–2020-12-31 (fi­nal ID: #4,279,845), pro­vid­ing the im­age files & a JSON ex­port of the meta­da­ta. We also pro­vide a smaller tor­rent of SFW im­ages down­scaled to 512×512px JPGs (0.37TB; 3,227,715 im­ages) for con­ve­nience. (To­tal: 3.7T­B.)

Our hope is that the Dan­booru2020 dataset can be used for rich large-s­cale classification/tagging & learned em­bed­dings, test out the trans­fer­abil­ity of ex­ist­ing com­puter vi­sion tech­niques (pri­mar­ily de­vel­oped us­ing pho­tographs) to illustration/anime-style im­ages, pro­vide an archival backup for the Dan­booru com­mu­ni­ty, feed back meta­data im­prove­ments & cor­rec­tions, and serve as a test­bed for ad­vanced tech­niques such as con­di­tional im­age gen­er­a­tion or style trans­fer.

Im­age like Dan­booru are im­age host­ing web­sites de­vel­oped by the anime com­mu­nity for col­lab­o­ra­tive tag­ging. Im­ages are up­loaded and tagged by users; they can be large, such as Dan­booru1, and richly an­no­tated with tex­tual ‘tags’.

Dan­booru in par­tic­u­lar is old, large, well-tagged, and its op­er­a­tors have al­ways sup­ported uses be­yond reg­u­lar brows­ing—pro­vid­ing an API and even a data­base ex­port. With their per­mis­sion, I have pe­ri­od­i­cally cre­ated sta­tic snap­shots of Dan­booru ori­ented to­wards ML use pat­terns.

Image booru description

Im­age booru tags typ­i­cally di­vided into a few ma­jor groups:

  • copy­right (the over­all fran­chise, movie, TV se­ries, manga etc a work is based on; for long-run­ning fran­chises like or “crossover” im­ages, there can be mul­ti­ple such tags, or if there is no such as­so­ci­ated work, it would be tagged “orig­i­nal”)

  • char­ac­ter (often mul­ti­ple)

  • au­thor

  • ex­plic­it­ness rat­ing

    Dan­booru does not ban sex­u­ally sug­ges­tive or porno­graphic con­tent; in­stead, im­ages are clas­si­fied into 3 cat­e­gories: safe, questionable, &explicit. (Rep­re­sented in the SQL as “s”/“q”/“e” re­spec­tive­ly.)

    safe is for rel­a­tively SFW con­tent in­clud­ing swim­suits, while questionable would be more ap­pro­pri­ate for high­ly-re­veal­ing swim­suit im­ages or nu­dity or highly sex­u­ally sug­ges­tive sit­u­a­tions, and explicit de­notes any­thing hard-core porno­graph­ic. (8.5% of im­ages are clas­si­fied as “e”, 15% as “q”, and 77% as “s”; as the de­fault tag is “q”, this may un­der­es­ti­mate the num­ber of “s” im­ages, but “s” should prob­a­bly be con­sid­ered the SFW sub­set.)

  • de­scrip­tive tags (eg the top 10 tags are 1girl/solo/long_hair/highres/breasts/blush/short_hair/smile/multiple_girls/open_mouth/looking_at_viewer, which re­flect the ex­pected fo­cus of anime fan­dom on things like the fran­chise)

    These tags form a “” to de­scribe as­pects of im­ages; be­yond the ex­pected tags like long_hair or looking_at_the_viewer, there are many strange and un­usual tags, in­clud­ing many anime or il­lus­tra­tion-spe­cific tags like seiyuu_connection (im­ages where the joke is based on know­ing the two char­ac­ters are voiced in differ­ent anime by the same voice ac­tor) or bad_feet (artists fre­quently ac­ci­den­tally draw two left feet, or just bad_anatomy in gen­er­al). Tags may also be hi­er­ar­chi­cal and one tag “im­ply” an­oth­er.

    Im­ages with text in them will have tags like translated, comic, or speech_bubble.

Im­ages can have other as­so­ci­ated meta­data with them, in­clud­ing:

  • Dan­booru ID, a unique pos­i­tive in­te­ger

  • MD5 hash

    The MD5s are often in­cor­rect.
  • the up­loader user­name

  • the orig­i­nal URL or the name of the work

  • up/downvotes

  • sib­ling im­ages (often an im­age will ex­ist in many forms, such as sketch or black­-white ver­sions in ad­di­tion to a fi­nal color im­age, edited or larger/smaller ver­sions, SFW vs NSFW, or de­pict­ing mul­ti­ple mo­ments in a scene)

  • captions/dialogue (many im­ages will have writ­ten Japan­ese captions/dialogue, which have been trans­lated into Eng­lish by users and an­no­tated us­ing HTML )

  • au­thor com­men­tary (also often trans­lat­ed)

  • pools (ordered se­quences of im­ages from across Dan­booru; often used for comics or im­age groups, or for dis­parate im­ages with some uni­fy­ing theme which is in­suffi­ciently ob­jec­tive to be a nor­mal tag)

Im­age boorus typ­i­cally sup­port ad­vanced Boolean searches on mul­ti­ple at­trib­utes si­mul­ta­ne­ous­ly, which in con­junc­tion with the rich tag­ging, can al­low users to dis­cover ex­tremely spe­cific sets of im­ages.

Samples

100 ran­dom sam­ple im­ages from the 512px SFW sub­set of Dan­booru in a 10×10 grid.

Download

Dan­booru2020 is cur­rently avail­able for down­load in 2 ways:

  1. Bit­Tor­rent
  2. pub­lic rsync server

Torrent

The im­ages have been down­loaded us­ing a curl script & the Dan­booru API, and loss­lessly op­ti­mized us­ing optipng/jpegoptim2; the meta­data has been ex­ported from the Dan­booru Big­Query mir­ror.3

Tor­rents are the pre­ferred down­load method as they stress the seed server less, can po­ten­tially be faster due to many peers, are re­silient to server down­time, and have built-in ECC. (How­ev­er, Dan­booru2020 is ap­proach­ing the lim­its of Bit­Tor­rent clients, and Dan­booru2021 may be forced to drop tor­rent sup­port.)

Due to the num­ber of files, the tor­rent has been bro­ken up into 10 sep­a­rate tor­rents, each cov­er­ing a range of IDs mod­ulo 1000. They are avail­able as an XZ-com­pressed tar­ball (o­rig­i­nal im­ages) (25MB) and the SFW 512px down­scaled sub­set tor­rent (13M­B); down­load & un­pack into one’s tor­rent di­rec­to­ry.

The tor­rents ap­pear to work with on Linux & on Linux/Windows; it re­port­edly does­n’t work on Qbit­tor­rent 3.3–4.0.4 (but may on >=4.0.54), Del­uge, or most Win­dows tor­rent clients.

Rsync

Due to tor­rent com­pat­i­bil­ity & net­work is­sues, I pro­vide an al­ter­nate down­load route via a pub­lic anony­mous server (avail­able for any Unix; al­ter­na­tive im­ple­men­ta­tions are avail­able for ). To list all avail­able files, a file-list can be down­loaded (e­quiv­a­lent to but much faster than rsync --list-only):

rsync rsync://78.46.86.149:873/danbooru2020/filelist.txt.xz ./

To down­load all avail­able files (test with --dry-run):

rsync --verbose --recursive rsync://78.46.86.149:873/danbooru2020/ ./danbooru2020/

For a sin­gle file (eg the meta­data tar­ball), one can down­load like thus:

rsync --verbose rsync://78.46.86.149:873/danbooru2020/metadata.json.tar.xz ./

For a spe­cific sub­set, like the SFW 512px sub­set:

rsync --recursive --verbose rsync://78.46.86.149:873/danbooru2020/512px/ ./danbooru2020/512px/
rsync --recursive --verbose rsync://78.46.86.149:873/danbooru2020/original/ ./danbooru2020/original/

Note that rsync sup­ports a kind of “pat­tern” in queries (re­mem­ber to es­cape globs so they are not in­ter­preted by the shel­l), and also sup­ports read­ing a list of file­names from a file:

            --exclude=PATTERN       exclude files matching PATTERN
            --exclude-from=FILE     read exclude patterns from FILE
            --include=PATTERN       don't exclude files matching PATTERN
            --include-from=FILE     read include patterns from FILE
            --files-from=FILE       read list of source-file names from FILE

So one can query the meta­data to build up a file list­ing IDs match­ing ar­bi­trary cri­te­ria, cre­ate the cor­re­spond­ing dataset file­names (like for ID #58991, 512px/0991/58991.jpg or, even lazier, just do in­clude matches via a glob pat­tern like */58991.*). This can re­quire far less time & band­width than down­load­ing the full dataset, and is also far faster than do­ing rsync one file at a time. See the rsync doc­u­men­ta­tion for fur­ther de­tails.

And for the full dataset (meta­data+o­rig­i­nal+512px):

rsync --recursive --verbose rsync://78.46.86.149:873/danbooru2020 ./danbooru2020/

I also pro­vide rsync mir­rors of a num­ber of mod­els & datasets, such as the cleaned anime por­trait dataset; see Projects for a list­ing of de­riv­a­tive works.

Kaggle

A com­bi­na­tion of a n = 300k sub­set of the 512px SFW sub­set of Dan­booru2017 and Na­gadomi’s moeimouto face dataset are avail­able as a Kag­gle-hosted dataset: “Tagged Anime Il­lus­tra­tions” (36G­B).

Kag­gle also hosts the meta­data of Safebooru up to 2016-11-20: “Safebooru—Anime Im­age Meta­data”.

Model zoo

Cur­rently avail­able:

Use­ful mod­els would be:

  • per­cep­tual loss model (us­ing Deep­Dan­booru?)
  • “s”/“q”/“e” clas­si­fier
  • text em­bed­ding RNN, and pre-com­puted text em­bed­dings for all im­ages’ tags

Updating

If there is in­ter­est, the dataset will con­tinue to be up­dated at reg­u­lar an­nual in­ter­vals (“Dan­booru2020”, “Dan­booru2021” etc).

Up­dates ex­ploit the ECC ca­pa­bil­ity of Bit­Tor­rent by up­dat­ing the images/metadata and cre­at­ing a new .torrent file; users down­load the new .torrent, over­write the old .torrent, and after re­hash­ing files to dis­cover which ones have changed/are miss­ing, the new ones are down­loaded. (This method has been suc­cess­fully used by other pe­ri­od­i­cal­ly-up­dated large tor­rents, such as the Touhou Loss­less Mu­sic Tor­rent, at ~1.75tb after 19 ver­sion­s.)

Turnover in Bit­Tor­rent swarms means that ear­lier ver­sions of the tor­rent will quickly dis­ap­pear, so for eas­ier re­pro­ducibil­i­ty, the meta­data files can be archived into sub­di­rec­to­ries (im­ages gen­er­ally will not change, so re­pro­ducibil­ity is less of a con­cern—to re­pro­duce the sub­set for an ear­lier re­lease, one sim­ply fil­ters on up­load date or takes the file list from the old meta­data).

Notification of updates

To re­ceive no­ti­fi­ca­tion of fu­ture up­dates to the dataset, please sub­scribe to the no­ti­fi­ca­tion mail­ing list.

Possible Uses

Such a dataset would sup­port many pos­si­ble us­es:

  • clas­si­fi­ca­tion & tag­ging:

    • im­age cat­e­go­riza­tion (of ma­jor char­ac­ter­is­tics such as fran­chise or char­ac­ter or SFW/NSFW de­tec­tion eg Der­pi­booru)

    • im­age mul­ti­-la­bel clas­si­fi­ca­tion (tag­ging), ex­ploit­ing the ~20 tags per im­age (cur­rently there is a pro­to­type, Deep­Dan­booru)

      • a large-s­cale test­bed for re­al-world ap­pli­ca­tion of ac­tive learn­ing / man-ma­chine col­lab­o­ra­tion
      • test­ing the scal­ing lim­its of ex­ist­ing tag­ging ap­proaches and mo­ti­vat­ing ze­ro-shot & one-shot learn­ing tech­niques
      • boot­strap­ping video summaries/descriptions
      • ro­bust­ness of im­age clas­si­fiers to differ­ent il­lus­tra­tion styles (eg )
  • im­age gen­er­a­tion:

  • im­age analy­sis:

    • fa­cial de­tec­tion & lo­cal­iza­tion for drawn im­ages (on which nor­mal tech­niques such as OpenCV’s Harr fil­ters fail, re­quir­ing spe­cial-pur­pose ap­proaches like Ani­me­Face 2009/lbpcascade_animeface)
    • im­age popularity/upvote pre­dic­tion
    • im­age-to-text lo­cal­iza­tion, tran­scrip­tion, and trans­la­tion of text in im­ages
    • il­lus­tra­tion-spe­cial­ized com­pres­sion (for bet­ter per­for­mance than PNG/JPG)
  • im­age search:

    • col­lab­o­ra­tive filtering/recommendation, im­age sim­i­lar­ity search (Flickr) of im­ages (use­ful for users look­ing for im­ages, for dis­cov­er­ing tag mis­takes, and for var­i­ous di­ag­nos­tics like check­ing GANs are not mem­o­riz­ing)
    • manga rec­om­men­da­tion ()
    • artist sim­i­lar­ity and de-anonymiza­tion
  • knowl­edge graph ex­trac­tion from tags/tag-implications and im­ages

    • clus­ter­ing tags
    • tem­po­ral trends in tags (fran­chise pop­u­lar­ity trends)

Advantages

Size and metadata

Im­age clas­si­fi­ca­tion has been su­per­charged by work on Im­a­geNet, but Im­a­geNet it­self is lim­ited by its small set of class­es, many of which are de­bat­able, and which en­com­pass only a lim­ited set. Com­pound­ing these lim­its, tagging/classification datasets are no­to­ri­ously un­di­verse & have im­bal­ance prob­lems or are small:

  • Im­a­geNet: dog breeds (mem­o­rably brought out by )

  • Youtube-BB: toilets/giraffes

  • MS COCO: bath­rooms and African sa­van­nah an­i­mals; 328k im­ages, 80 cat­e­gories, short 1-sen­tence de­scrip­tions

  • bird/flowers: a few score of each kind (eg no ea­gles in the birds dataset)

  • Vi­sual Re­la­tion­ship De­tec­tion (VRD) dataset: 5k im­ages

  • Pas­cal VOC: 11k im­ages

  • Vi­sual Genome: 108k im­ages

  • nico-open­data: 400k, but SFW & re­stricted to ap­proved re­searchers

  • : re­leased 2018, 30.1m tags for 9.2m im­ages and 15.4m bound­ing-box­es, with high la­bel qual­i­ty; a ma­jor ad­van­tage of this dataset is that it uses CC-BY-li­censed Flickr photographs/images, and so it should be freely dis­trib­utable,

  • BAM! (): 65m raw im­ages, 393k? tags for 2.5m? tagged im­ages (semi-su­per­vised), re­stricted ac­cess?

The ex­ter­nal va­lid­ity of clas­si­fiers trained on these datasets is some­what ques­tion­able as the learned dis­crim­i­na­tive mod­els may col­lapse or sim­plify in un­de­sir­able ways, and over­fit on the datasets’ in­di­vid­ual bi­ases (Tor­ralba & Efros 2011). For ex­am­ple, Im­a­geNet clas­si­fiers some­times ap­pear to ‘cheat’ by re­ly­ing on lo­cal­ized tex­tures in a “bag-of-words”-style ap­proach and sim­plis­tic outlines/shapes—rec­og­niz­ing leop­ards only by the color tex­ture of the fur, or be­liev­ing bar­bells are ex­ten­sions of arms. CNNs by de­fault ap­pear to rely al­most en­tirely on tex­ture and ig­nore shapes/outlines, un­like hu­man vi­sion, ren­der­ing them frag­ile to trans­forms; train­ing which em­pha­sizes shape/outline data aug­men­ta­tion can im­prove ac­cu­racy & ro­bust­ness (), mak­ing anime im­ages a chal­leng­ing test­bed (and this tex­ture-bias pos­si­bly ex­plain­ing poor per­for­mance of ani­me-tar­geted NNs in the past and the ). The dataset is sim­ply not large enough, or richly an­no­tated enough, to train clas­si­fiers or tag­ger bet­ter than that, or, with resid­ual net­works reach­ing hu­man par­i­ty, re­veal differ­ences be­tween the best al­go­rithms and the merely good. (Dataset bi­ases have also been is­sues on ques­tion-an­swer­ing dataset­s.) As well, the datasets are sta­t­ic, not ac­cept­ing any ad­di­tions, bet­ter meta­data, or cor­rec­tions. Like MNIST be­fore it, Im­a­geNet is verg­ing on ‘solved’ (the ILSVRC or­ga­niz­ers ended it after the 2017 com­pe­ti­tion) and fur­ther progress may sim­ply be over­fit­ting to idio­syn­crasies of the dat­a­points and er­rors; even if low­ered er­ror rates are not over­fit­ting, the low er­ror rates com­press the differ­ences be­tween al­go­rithm, giv­ing a mis­lead­ing view of progress and un­der­stat­ing the ben­e­fits of bet­ter ar­chi­tec­tures, as im­prove­ments be­come com­pa­ra­ble in size to sim­ple chance in initializations/training/validation-set choice. As note:

It is an open is­sue of tex­t-to-im­age map­ping that the dis­tri­b­u­tion of im­ages con­di­tioned on a sen­tence is highly mul­ti­-modal. In the past few years, we’ve wit­nessed a break­through in the ap­pli­ca­tion of re­cur­rent neural net­works (RNN) to gen­er­at­ing tex­tual de­scrip­tions con­di­tioned on im­ages [1, 2], with Xu et al. show­ing that the mul­ti­-modal­ity prob­lem can be de­com­posed se­quen­tially [3]. How­ev­er, the lack of datasets with di­ver­sity de­scrip­tions of im­ages lim­its the per­for­mance of tex­t-to-im­age syn­the­sis on mul­ti­-cat­e­gories dataset like MSCOCO [4]. There­fore, the prob­lem of tex­t-to-im­age syn­the­sis is still far from be­ing solved

In con­trast, the Dan­booru dataset is larger than Im­a­geNet as a whole and larger than the most wide­ly-used mul­ti­-de­scrip­tion dataset, MS COCO, with far richer meta­data than the ‘sub­ject verb ob­ject’ sen­tence sum­mary that is dom­i­nant in MS COCO or the birds dataset (sen­tences which could be ad­e­quately sum­ma­rized in per­haps 5 tags, if even that7). While the Dan­booru com­mu­nity does fo­cus heav­ily on fe­male anime char­ac­ters, they are placed in a wide va­ri­ety of cir­cum­stances with nu­mer­ous sur­round­ing tagged ob­jects or ac­tions, and the sheer size im­plies that many more mis­cel­la­neous im­ages will be in­clud­ed. It is un­likely that the per­for­mance ceil­ing will be reached any­time soon, and ad­vanced tech­niques such as at­ten­tion will likely be re­quired to get any­where near the ceil­ing. And Dan­booru is con­stantly ex­pand­ing and can be eas­ily up­dated by any­one any­where, al­low­ing for reg­u­lar re­leases of im­proved an­no­ta­tions.

Dan­booru and the im­age boorus have been only min­i­mally used in pre­vi­ous ma­chine learn­ing work; prin­ci­pal­ly, in (project), which used 1.287m im­ages to train a fine­tuned VGG-based CNN to de­tect 1,539 tags (drawn from the 512 most fre­quent tags of general/copyright/character each) with an over­all pre­ci­sion of 32.2%, or “Sym­bolic Un­der­stand­ing of Anime Us­ing Deep Learn­ing”, Li 2018 But the datasets for past re­search are typ­i­cally not dis­trib­uted and there has been lit­tle fol­lowup.

Non-photographic

Anime im­ages and il­lus­tra­tions, on the other hand, as com­pared to pho­tographs, differ in many ways—­for ex­am­ple, il­lus­tra­tions are fre­quently black­-and-white rather than col­or, line art rather than pho­tographs, and even color il­lus­tra­tions tend to rely far less on tex­tures and far more on lines (with tex­tures omit­ted or filled in with stan­dard repet­i­tive pat­tern­s), work­ing on a higher level of ab­strac­tion—a leop­ard would not be as triv­ially rec­og­nized by sim­ple pat­tern-match­ing on yel­low and black dot­s—with ir­rel­e­vant de­tails that a dis­crim­i­na­tor might cheaply clas­sify based on typ­i­cally sup­pressed in fa­vor of global gestalt, and often heav­ily styl­ized (eg fre­quent use of “s”). With the ex­cep­tion of MNIST & Om­niglot, al­most all com­mon­ly-used deep learn­ing-re­lated im­age datasets are pho­to­graph­ic.

Hu­mans can still eas­ily per­ceive a black­-white line draw­ing of a leop­ard as be­ing a leop­ard—but can a stan­dard Im­a­geNet clas­si­fier? Like­wise, the diffi­culty face de­tec­tors en­counter on anime im­ages sug­gests that other de­tec­tors like nu­dity or porno­graphic de­tec­tors may fail; but surely mod­er­a­tion tasks re­quire de­tec­tion of penis­es, whether they are drawn or pho­tographed? The at­tempts to ap­ply CNNs to GANs, im­age gen­er­a­tion, im­age in­paint­ing, or style trans­fer have some­times thrown up ar­ti­facts which don’t seem to be is­sues when us­ing the same ar­chi­tec­ture on pho­to­graphic ma­te­ri­al; for ex­am­ple, in GAN im­age gen­er­a­tion & style trans­fer, I al­most al­ways note, in my own or oth­ers’ at­tempts, what I call the “wa­ter­color effect”, where in­stead of pro­duc­ing the usual ab­stracted re­gions of white­space, mo­not­one col­or­ing, or sim­ple color gra­di­ents, the CNN in­stead con­sis­tently pro­duces noisy tran­si­tion tex­tures which look like wa­ter­color paint­ings—which can be beau­ti­ful, and an in­ter­est­ing style in its own right (eg the style2paints sam­ples), but means the CNNs are fail­ing to some de­gree. This wa­ter­color effect ap­pears to not be a prob­lem in pho­to­graphic ap­pli­ca­tions, but on the other hand, pho­tos are filled with noisy tran­si­tion tex­tures and watch­ing a GAN train, you can see that the learn­ing process gen­er­ates tex­tures first and only grad­u­ally learns to build edges and re­gions and tran­si­tions from the blurred texts; is this ani­me-spe­cific prob­lem due to sim­ply in­suffi­cient data/training, or is there some­thing more fun­da­men­tally the is­sue with cur­rent con­vo­lu­tions?

Be­cause il­lus­tra­tions are pro­duced by an en­tirely differ­ent process and fo­cus only on salient de­tails while ab­stract­ing the rest, they offer a way to test ex­ter­nal va­lid­ity and the ex­tent to which tag­gers are tap­ping into high­er-level se­man­tic per­cep­tion.

As well, many ML re­searchers are anime fans and might en­joy work­ing on such a dataset—­train­ing NNs to gen­er­ate anime im­ages can be amus­ing. It is, at least, more in­ter­est­ing than pho­tos of street signs or store­fronts. (“There are few sources of en­ergy so pow­er­ful as a pro­cras­ti­nat­ing grad stu­dent.”)

Community value

A full dataset is of im­me­di­ate value to the Dan­booru com­mu­nity as an archival snap­shot of Dan­booru which can be down­loaded in lieu of ham­mer­ing the main site and down­load­ing ter­abytes of data; back­ups are oc­ca­sion­ally re­quested on the Dan­booru fo­rum but the need is cur­rently not met.

There is po­ten­tial for a sym­bio­sis be­tween the Dan­booru com­mu­nity & ML re­searchers: in a vir­tu­ous cir­cle, the com­mu­nity pro­vides cu­ra­tion and ex­pan­sion of a rich dataset, while ML re­searchers can con­tribute back tools from their re­search on it which help im­prove the dataset. The Dan­booru com­mu­nity is rel­a­tively large and would likely wel­come the de­vel­op­ment of tools like tag­gers to sup­port semi­-au­to­matic (or even­tu­al­ly, fully au­to­mat­ic) im­age tag­ging, as use of a tag­ger could offer or­ders of mag­ni­tude im­prove­ment in speed and ac­cu­racy com­pared to their ex­ist­ing man­ual meth­ods, as well as be­ing new­bie-friendly8 They are also a pre-ex­ist­ing au­di­ence which would be in­ter­ested in new re­search re­sults.

Format

The goal of the dataset is to be as easy as pos­si­ble to use im­me­di­ate­ly, avoid­ing ob­scure file for­mats, while al­low­ing si­mul­ta­ne­ous re­search & seed­ing of the tor­rent, with easy up­dates.

Im­ages are pro­vided in the full orig­i­nal form (be that JPG, PNG, GIF or oth­er­wise) for reference/archival pur­pos­es, and a script for con­vert­ing to JPGS & down­scal­ing (cre­at­ing a smaller more suit­able for ML use).

Im­ages are buck­eted into 1000 sub­di­rec­to­ries 0–999, which is the Dan­booru ID mod­ulo 1000 (ie all im­ages in 0999/ have an ID end­ing in ‘999’). A sin­gle di­rec­tory would cause patho­log­i­cal filesys­tem per­for­mance, and mod­ulo ID spreads im­ages evenly with­out re­quir­ing ad­di­tional di­rec­to­ries to be made. The ID is not ze­ro-padded and files end in the rel­e­vant ex­ten­sion, hence the file lay­out looks like this:

original/0000/
original/0000/1000.png
original/0000/2000.jpg
original/0000/3000.jpg
original/0000/4000.png
original/0000/5000.jpg
original/0000/6000.jpg
original/0000/7000.jpg
original/0000/8000.jpg
original/0000/9000.jpg
...

Cur­rently rep­re­sented file ex­ten­sions are: avi/bmp/gif/html/jpeg/jpg/mp3/mp4/mpg/pdf/png/rar/swf/webm/wmv/zip. (JPG/PNG files have been loss­lessly op­ti­mized us­ing jpegoptim/OptiPNG, sav­ing ~100G­B.)

Raw orig­i­nal files are treach­er­ous

Be care­ful if work­ing with the orig­i­nal rather than 512px sub­set. There are many odd files: trun­cat­ed, non-sRGB col­or­space, wrong file ex­ten­sions (eg some PNGs have .jpg ex­ten­sions like original/0146/1525146.jpg/original/0558/1422558.jpg), etc.

The SFW tor­rent fol­lows the same schema but in­side the 512px/ di­rec­tory in­stead and con­verted to JPG for the SFW files: 512px/0000/1000.jpg etc.

An ex­per­i­men­tal shell script for par­al­lelized con­ver­sion the ful­l-size orig­i­nal im­ages into a more tractable ~250GB cor­pus of 512×512px im­ages is in­clud­ed: rescale_images.sh. It re­quires Im­ageMag­ick & GNU parallel to be in­stalled.

Image Metadata

The meta­data is avail­able as a XZ-com­pressed tar­ball of JSON files as ex­ported from the Dan­booru Big­Query data­base mir­ror (metadata.json.tar.xz). Each line is an in­di­vid­ual JSON ob­ject for a sin­gle im­age; ad hoc queries can be run eas­ily by pip­ing into jq, and sev­eral are il­lus­trated in the shell query ap­pen­dix.

Here is an ex­am­ple of a shell script for get­ting the file­names of all SFW im­ages match­ing a par­tic­u­lar tag:

# print out filenames of all SFW Danbooru images matching a particular tag.
# assumes being in root directory like '/media/gwern/Data2/danbooru2020'
TAG="monochrome"

TEMP=$(mktemp /tmp/matches-XXXX.txt)
cat metadata/* | head -1000 | fgrep -e '"name":"'"$TAG" | fgrep '"rating":"s"' \
    | jq -c '.id' | tr -d '"' >> "$TEMP"

for ID in $(cat "$TEMP"); do
        BUCKET=$(printf "%04d" $(( ID % 1000 )) );
        TARGET=$(ls ./original/"$BUCKET/$ID".*)
        ls "$TARGET"
done

3 ex­am­ple meta­data (jq-for­mat­ted):

{
  "id": "148112",
  "created_at": "2007-10-25 21:29:41.5877 UTC",
  "uploader_id": "1",
  "score": "2",
  "source": "",
  "md5": "afc6c473332f8372afba07cb597818af",
  "last_commented_at": "1970-01-01 00:00:00 UTC",
  "rating": "s",
  "image_width": "1555",
  "image_height": "1200",
  "is_note_locked": false,
  "file_ext": "jpg",
  "last_noted_at": "1970-01-01 00:00:00 UTC",
  "is_rating_locked": false,
  "parent_id": "0",
  "has_children": false,
  "approver_id": "0",
  "file_size": "390946",
  "is_status_locked": false,
  "up_score": "2",
  "down_score": "0",
  "is_pending": false,
  "is_flagged": false,
  "is_deleted": false,
  "updated_at": "2016-03-26 16:29:45.28726 UTC",
  "is_banned": false,
  "pixiv_id": "0",
  "tags": [
    {
      "id": "567316",
      "name": "6+girls",
      "category": "0"
    },
    {
      "id": "437490",
      "name": "artist_request",
      "category": "5"
    },
    {
      "id": "6059",
      "name": "blazer",
      "category": "0"
    },
    {
      "id": "2378",
      "name": "buruma",
      "category": "0"
    },
    {
      "id": "484628",
      "name": "copyright_request",
      "category": "5"
    },
    {
      "id": "6532",
      "name": "glasses",
      "category": "0"
    },
    {
      "id": "7450",
      "name": "gym_uniform",
      "category": "0"
    },
    {
      "id": "1566",
      "name": "highres",
      "category": "5"
    },
    {
      "id": "3843",
      "name": "jacket",
      "category": "0"
    },
    {
      "id": "566835",
      "name": "multiple_girls",
      "category": "0"
    },
    {
      "id": "391",
      "name": "panties",
      "category": "0"
    },
    {
      "id": "2770",
      "name": "pantyshot",
      "category": "0"
    },
    {
      "id": "16509",
      "name": "school_uniform",
      "category": "0"
    },
    {
      "id": "3477",
      "name": "sweater",
      "category": "0"
    },
    {
      "id": "432529",
      "name": "sweater_vest",
      "category": "0"
    },
    {
      "id": "3291",
      "name": "teacher",
      "category": "0"
    },
    {
      "id": "1882",
      "name": "thighhighs",
      "category": "0"
    },
    {
      "id": "464906",
      "name": "underwear",
      "category": "0"
    },
    {
      "id": "6176",
      "name": "vest",
      "category": "0"
    },
    {
      "id": "230",
      "name": "waitress",
      "category": "0"
    },
    {
      "id": "4123",
      "name": "wind",
      "category": "0"
    },
    {
      "id": "378454",
      "name": "wind_lift",
      "category": "0"
    },
    {
      "id": "10644",
      "name": "zettai_ryouiki",
      "category": "0"
    }
  ],
  "pools": [],
  "favs": [
    "11896",
    "1200",
    "13418",
    "11637",
    "108341"
  ]
}

{
  "id": "251218",
  "created_at": "2008-05-21 00:41:56.83102 UTC",
  "uploader_id": "1",
  "score": "2",
  "source": "http://i2.pixiv.net/img10/img/aki-prism/7956060_p31.jpg",
  "md5": "a3b948d2feab35045201da677adaa925",
  "last_commented_at": "1970-01-01 00:00:00 UTC",
  "rating": "s",
  "image_width": "350",
  "image_height": "700",
  "is_note_locked": false,
  "file_ext": "jpg",
  "last_noted_at": "1970-01-01 00:00:00 UTC",
  "is_rating_locked": false,
  "parent_id": "0",
  "has_children": false,
  "approver_id": "0",
  "file_size": "73187",
  "is_status_locked": false,
  "up_score": "2",
  "down_score": "0",
  "is_pending": false,
  "is_flagged": false,
  "is_deleted": false,
  "updated_at": "2020-05-05 23:42:39.02344 UTC",
  "is_banned": false,
  "pixiv_id": "7956060",
  "tags": [
    {
      "id": "470575",
      "name": "1girl",
      "category": "0"
    },
    {
      "id": "6126",
      "name": "animal_ears",
      "category": "0"
    },
    {
      "id": "401178",
      "name": "aruruw",
      "category": "4"
    },
    {
      "id": "465619",
      "name": "closed_eyes",
      "category": "0"
    },
    {
      "id": "10157",
      "name": "honey",
      "category": "0"
    },
    {
      "id": "412964",
      "name": "honeypot",
      "category": "0"
    },
    {
      "id": "426559",
      "name": "marupeke",
      "category": "1"
    },
    {
      "id": "402239",
      "name": "photoshop_(medium)",
      "category": "5"
    },
    {
      "id": "16509",
      "name": "school_uniform",
      "category": "0"
    },
    {
      "id": "268819",
      "name": "serafuku",
      "category": "0"
    },
    {
      "id": "212816",
      "name": "solo",
      "category": "0"
    },
    {
      "id": "15674",
      "name": "tail",
      "category": "0"
    },
    {
      "id": "575561",
      "name": "utawareru_mono",
      "category": "3"
    }
  ],
  "pools": [],
  "favs": [
    "13392",
    "35380",
    "106523",
    "484488",
    "60223"
  ]
}

{
  "id": "901634",
  "created_at": "2011-04-21 22:18:02.20889 UTC",
  "uploader_id": "37391",
  "score": "7",
  "source": "http://www.sword-girls.com/default.aspx",
  "md5": "2c70ff536e7fc8186b70b6d9023d579f",
  "last_commented_at": "1970-01-01 00:00:00 UTC",
  "rating": "s",
  "image_width": "320",
  "image_height": "480",
  "is_note_locked": false,
  "file_ext": "jpg",
  "last_noted_at": "1970-01-01 00:00:00 UTC",
  "is_rating_locked": false,
  "parent_id": "0",
  "has_children": false,
  "approver_id": "288549",
  "file_size": "162693",
  "is_status_locked": false,
  "up_score": "5",
  "down_score": "0",
  "is_pending": false,
  "is_flagged": false,
  "is_deleted": false,
  "updated_at": "2013-05-25 15:10:19.68411 UTC",
  "is_banned": false,
  "pixiv_id": "0",
  "tags": [
    {
      "id": "470575",
      "name": "1girl",
      "category": "0"
    },
    {
      "id": "89368",
      "name": "aqua_eyes",
      "category": "0"
    },
    {
      "id": "399827",
      "name": "arms_up",
      "category": "0"
    },
    {
      "id": "4011",
      "name": "blade",
      "category": "0"
    },
    {
      "id": "378993",
      "name": "energy_sword",
      "category": "0"
    },
    {
      "id": "2270",
      "name": "eyepatch",
      "category": "0"
    },
    {
      "id": "464559",
      "name": "flower",
      "category": "0"
    },
    {
      "id": "7581",
      "name": "garter_belt",
      "category": "0"
    },
    {
      "id": "197",
      "name": "garters",
      "category": "0"
    },
    {
      "id": "620491",
      "name": "iri_flina",
      "category": "4"
    },
    {
      "id": "495048",
      "name": "lily_(flower)",
      "category": "0"
    },
    {
      "id": "10606",
      "name": "lowres",
      "category": "5"
    },
    {
      "id": "461172",
      "name": "nardack",
      "category": "1"
    },
    {
      "id": "15080",
      "name": "short_hair",
      "category": "0"
    },
    {
      "id": "15425",
      "name": "silver_hair",
      "category": "0"
    },
    {
      "id": "429",
      "name": "skirt",
      "category": "0"
    },
    {
      "id": "212816",
      "name": "solo",
      "category": "0"
    },
    {
      "id": "401228",
      "name": "sword",
      "category": "0"
    },
    {
      "id": "620408",
      "name": "sword_girls",
      "category": "3"
    },
    {
      "id": "1882",
      "name": "thighhighs",
      "category": "0"
    },
    {
      "id": "11449",
      "name": "weapon",
      "category": "0"
    },
    {
      "id": "10644",
      "name": "zettai_ryouiki",
      "category": "0"
    }
  ],
  "pools": [],
  "favs": [
    "23888",
    "115871",
    "342656",
    "332770",
    "95046",
    "324891",
    "20124",
    "149704",
    "34355",
    "290816",
    "228600",
    "55507",
    "338018",
    "134865",
    "72221",
    "256960",
    "104143",
    "85939",
    "386036",
    "450665",
    "497363",
    "550966"
  ]
}

Citing

Please cite this dataset as:

  • Anony­mous, The Dan­booru Com­mu­ni­ty, & Gw­ern Bran­wen; “Dan­booru2020: A Large-S­cale Crowd­sourced and Tagged Anime Il­lus­tra­tion Dataset”, 2020-01-12. Web. Ac­cessed [DATE] https://www.gwern.net/Danbooru2020

    @misc{danbooru2020,
        author = {Anonymous and Danbooru community and Gwern Branwen},
        title = {Danbooru2020: A Large-Scale Crowdsourced and Tagged Anime Illustration Dataset},
        howpublished = {\url{https://www.gwern.net/Danbooru2020}},
        url = {https://www.gwern.net/Danbooru2020},
        type = {dataset},
        year = {2021},
        month = {January},
        timestamp = {2020-01-12},
        note = {Accessed: DATE} }

Past releases

Danbooru2017

The first re­lease, Dan­booru2017, con­tained ~1.9tb of 2.94m im­ages with 77.5m tag in­stances (of 333k de­fined tags, ~26.3/image) cov­er­ing Dan­booru from 2005-05-24 through 2017-12-31 (fi­nal ID: #2,973,532).

Dan­booru2018 added 0.413TB/392,557 images/15,208,974 tags/31,698 new unique tags.

To re­con­struct Dan­booru2017, down­load Dan­booru2018, and take the im­age sub­set ID #1–2973532 as the im­age dataset, and the JSON meta­data in the sub­di­rec­tory metadata/2017/ as the meta­da­ta. That should give you Dan­booru2017 bit-i­den­ti­cal to as re­leased on 2018-02-13.

Danbooru2018

The sec­ond re­lease was a tor­rent of ~2.5tb of 3.33m im­ages with 92.7m tag in­stances (of 365k de­fined tags, ~27.8/image) cov­er­ing Dan­booru from 2005-05-24 through 2018-12-31 (fi­nal ID: #3,368,713), pro­vid­ing the im­age files & a JSON ex­port of the meta­da­ta. We also pro­vided a smaller tor­rent of SFW im­ages down­scaled to 512×512px JPGs (241GB; 2,232,462 im­ages) for con­ve­nience.

Dan­booru2018 can be re­con­structed sim­i­larly us­ing metadata/2018/.

Danbooru2019

The third re­lease was 3tb of 3.69m im­ages, 108m tags, through 2019-12-31 (fi­nal ID: #3,734,660). Dan­booru2019 can be re­con­structed like­wise.

Applications

Projects

Code and de­rived datasets:

Publications

Re­search:

  • , Gokaslan et al 2018:

    Un­su­per­vised im­age-to-im­age trans­la­tion tech­niques are able to map lo­cal tex­ture be­tween two do­mains, but they are typ­i­cally un­suc­cess­ful when the do­mains re­quire larger shape change. In­spired by se­man­tic seg­men­ta­tion, we in­tro­duce a dis­crim­i­na­tor with di­lated con­vo­lu­tions that is able to use in­for­ma­tion from across the en­tire im­age to train a more con­tex­t-aware gen­er­a­tor. This is cou­pled with a mul­ti­-s­cale per­cep­tual loss that is bet­ter able to rep­re­sent er­ror in the un­der­ly­ing shape of ob­jects. We demon­strate that this de­sign is more ca­pa­ble of rep­re­sent­ing shape de­for­ma­tion in a chal­leng­ing toy dataset, plus in com­plex map­pings with sig­nifi­cant dataset vari­a­tion be­tween hu­mans, dolls, and anime faces, and be­tween cats and dogs.

  • , Zhang et al 2018 (on style2­paints, ver­sion 3):

    Sketch or line art col­oriza­tion is a re­search field with sig­nifi­cant mar­ket de­mand. Differ­ent from photo col­oriza­tion which strongly re­lies on tex­ture in­for­ma­tion, sketch col­oriza­tion is more chal­leng­ing as sketches may not have tex­ture. Even worse, col­or, tex­ture, and gra­di­ent have to be gen­er­ated from the ab­stract sketch lines. In this pa­per, we pro­pose a semi­-au­to­matic learn­ing-based frame­work to col­orize sketches with proper col­or, tex­ture as well as gra­di­ent. Our frame­work con­sists of two stages. In the first draft­ing stage, our model guesses color re­gions and splashes a rich va­ri­ety of col­ors over the sketch to ob­tain a color draft. In the sec­ond re­fine­ment stage, it de­tects the un­nat­ural col­ors and ar­ti­facts, and try to fix and re­fine the re­sult. Com­par­ing to ex­ist­ing ap­proach­es, this two-stage de­sign effec­tively di­vides the com­plex col­oriza­tion task into two sim­pler and goal-clearer sub­tasks. This eases the learn­ing and raises the qual­ity of col­oriza­tion. Our model re­solves the ar­ti­facts such as wa­ter-color blur­ring, color dis­tor­tion, and dull tex­tures.

    We build an in­ter­ac­tive soft­ware based on our model for eval­u­a­tion. Users can it­er­a­tively edit and re­fine the col­oriza­tion. We eval­u­ate our learn­ing model and the in­ter­ac­tive sys­tem through an ex­ten­sive user study. Sta­tis­tics shows that our method out­per­forms the state-of-art tech­niques and in­dus­trial ap­pli­ca­tions in sev­eral as­pects in­clud­ing, the vi­sual qual­i­ty, the abil­ity of user con­trol, user ex­pe­ri­ence, and other met­rics.

  • “Ap­pli­ca­tion of Gen­er­a­tive Ad­ver­sar­ial Net­work on Im­age Style Trans­for­ma­tion and Im­age Pro­cess­ing”, Wang 2018:

    Im­age-to-Im­age trans­la­tion is a col­lec­tion of com­puter vi­sion prob­lems that aim to learn a map­ping be­tween two differ­ent do­mains or mul­ti­ple do­mains. Re­cent re­search in com­puter vi­sion and deep learn­ing pro­duced pow­er­ful tools for the task. Con­di­tional ad­ver­sar­ial net­works serve as a gen­er­al-pur­pose so­lu­tion for im­age-to-im­age trans­la­tion prob­lems. Deep Con­vo­lu­tional Neural Net­works can learn an im­age rep­re­sen­ta­tion that can be ap­plied for recog­ni­tion, de­tec­tion, and seg­men­ta­tion. Gen­er­a­tive Ad­ver­sar­ial Net­works (GANs) has gained suc­cess in im­age syn­the­sis. How­ev­er, tra­di­tional mod­els that re­quire paired train­ing data might not be ap­plic­a­ble in most sit­u­a­tions due to lack of paired da­ta.

    Here we re­view and com­pare two differ­ent mod­els for learn­ing un­su­per­vised im­age to im­age trans­la­tion: CycleGAN and Un­su­per­vised Im­age-to-Im­age Trans­la­tion Net­works (UNIT). Both mod­els adopt cy­cle con­sis­ten­cy, which en­ables us to con­duct un­su­per­vised learn­ing with­out paired da­ta. We show that both mod­els can suc­cess­fully per­form im­age style trans­la­tion. The ex­per­i­ments re­veal that CycleGAN can gen­er­ate more re­al­is­tic re­sults, and UNIT can gen­er­ate var­ied im­ages and bet­ter pre­serve the struc­ture of in­put im­ages.

  • , Noguchi & Harada 2019 (Dan­booru2018 by way of StyleGAN/TWDNE-gen­er­ated im­ages):

    Thanks to the re­cent de­vel­op­ment of deep gen­er­a­tive mod­els, it is be­com­ing pos­si­ble to gen­er­ate high­-qual­ity im­ages with both fi­delity and di­ver­si­ty. How­ev­er, the train­ing of such gen­er­a­tive mod­els re­quires a large dataset. To re­duce the amount of data re­quired, we pro­pose a new method for trans­fer­ring prior knowl­edge of the pre-trained gen­er­a­tor, which is trained with a large dataset, to a small dataset in a differ­ent do­main. Us­ing such prior knowl­edge, the model can gen­er­ate im­ages lever­ag­ing some com­mon sense that can­not be ac­quired from a small dataset. In this work, we pro­pose a novel method fo­cus­ing on the pa­ra­me­ters for batch sta­tis­tics, scale and shift, of the hid­den lay­ers in the gen­er­a­tor. By train­ing only these pa­ra­me­ters in a su­per­vised man­ner, we achieved sta­ble train­ing of the gen­er­a­tor, and our method can gen­er­ate higher qual­ity im­ages com­pared to pre­vi­ous meth­ods with­out col­laps­ing even when the dataset is small (~100). Our re­sults show that the di­ver­sity of the fil­ters ac­quired in the pre-trained gen­er­a­tor is im­por­tant for the per­for­mance on the tar­get do­main. By our method, it be­comes pos­si­ble to add a new class or do­main to a pre-trained gen­er­a­tor with­out dis­turb­ing the per­for­mance on the orig­i­nal do­main.

  • , Suzuki et al 2018:

    We present a novel CNN-based im­age edit­ing strat­egy that al­lows the user to change the se­man­tic in­for­ma­tion of an im­age over an ar­bi­trary re­gion by ma­nip­u­lat­ing the fea­ture-space rep­re­sen­ta­tion of the im­age in a trained GAN mod­el. We will present two vari­ants of our strat­e­gy: (1) spa­tial con­di­tional batch nor­mal­iza­tion (sCBN), a type of con­di­tional batch nor­mal­iza­tion with user-speci­fi­able spa­tial weight maps, and (2) fea­ture-blend­ing, a method of di­rectly mod­i­fy­ing the in­ter­me­di­ate fea­tures. Our meth­ods can be used to edit both ar­ti­fi­cial im­age and real im­age, and they both can be used to­gether with any GAN with con­di­tional nor­mal­iza­tion lay­ers. We will demon­strate the power of our method through ex­per­i­ments on var­i­ous types of GANs trained on differ­ent datasets. Code will be avail­able at this URL.

  • , Wang et al 2019:

    One of the at­trac­tive char­ac­ter­is­tics of deep neural net­works is their abil­ity to trans­fer knowl­edge ob­tained in one do­main to other re­lated do­mains. As a re­sult, high­-qual­ity net­works can be trained in do­mains with rel­a­tively lit­tle train­ing da­ta. This prop­erty has been ex­ten­sively stud­ied for dis­crim­i­na­tive net­works but has re­ceived sig­nifi­cantly less at­ten­tion for gen­er­a­tive mod­el­s.­Given the often enor­mous effort re­quired to train GANs, both com­pu­ta­tion­ally as well as in the dataset col­lec­tion, the re-use of pre­trained GANs is a de­sir­able ob­jec­tive. We pro­pose a novel knowl­edge trans­fer method for gen­er­a­tive mod­els based on min­ing the knowl­edge that is most ben­e­fi­cial to a spe­cific tar­get do­main, ei­ther from a sin­gle or mul­ti­ple pre­trained GANs. This is done us­ing a miner net­work that iden­ti­fies which part of the gen­er­a­tive dis­tri­b­u­tion of each pre­trained GAN out­puts sam­ples clos­est to the tar­get do­main. Min­ing effec­tively steers GAN sam­pling to­wards suit­able re­gions of the la­tent space, which fa­cil­i­tates the pos­te­rior fine­tun­ing and avoids patholo­gies of other meth­ods such as mode col­lapse and lack of flex­i­bil­i­ty. We per­form ex­per­i­ments on sev­eral com­plex datasets us­ing var­i­ous GAN ar­chi­tec­tures (BigGAN, Pro­gres­sive GAN) and show that the pro­posed method, called MineGAN, effec­tively trans­fers knowl­edge to do­mains with few tar­get im­ages, out­per­form­ing ex­ist­ing meth­ods. In ad­di­tion, MineGAN can suc­cess­fully trans­fer knowl­edge from mul­ti­ple pre­trained GANs.

  • , Kim et al 2019b (Tag2Pix CLI/GUI):

    Line art col­oriza­tion is ex­pen­sive and chal­leng­ing to au­to­mate. A GAN ap­proach is pro­posed, called Tag2Pix, of line art col­oriza­tion which takes as in­put a grayscale line art and color tag in­for­ma­tion and pro­duces a qual­ity col­ored im­age. First, we present the Tag2Pix line art col­oriza­tion dataset. A gen­er­a­tor net­work is pro­posed which con­sists of con­vo­lu­tional lay­ers to trans­form the in­put line art, a pre-trained se­man­tic ex­trac­tion net­work, and an en­coder for in­put color in­for­ma­tion. The dis­crim­i­na­tor is based on an aux­il­iary clas­si­fier GAN to clas­sify the tag in­for­ma­tion as well as gen­uine­ness. In ad­di­tion, we pro­pose a novel net­work struc­ture called SECat, which makes the gen­er­a­tor prop­erly col­orize even small fea­tures such as eyes, and also sug­gest a novel two-step train­ing method where the gen­er­a­tor and dis­crim­i­na­tor first learn the no­tion of ob­ject and shape and then, based on the learned no­tion, learn col­oriza­tion, such as where and how to place which col­or. We present both quan­ti­ta­tive and qual­i­ta­tive eval­u­a­tions which prove the effec­tive­ness of the pro­posed method.

    , Lee et al 2020:

    This pa­per tack­les the au­to­matic col­oriza­tion task of a sketch im­age given an al­ready-col­ored ref­er­ence im­age. Col­oriz­ing a sketch im­age is in high de­mand in comics, an­i­ma­tion, and other con­tent cre­ation ap­pli­ca­tions, but it suffers from in­for­ma­tion scarcity of a sketch im­age. To ad­dress this, a ref­er­ence im­age can ren­der the col­oriza­tion process in a re­li­able and user-driven man­ner. How­ev­er, it is diffi­cult to pre­pare for a train­ing data set that has a suffi­cient amount of se­man­ti­cally mean­ing­ful pairs of im­ages as well as the ground truth for a col­ored im­age re­flect­ing a given ref­er­ence (e.g., col­or­ing a sketch of an orig­i­nally blue car given a ref­er­ence green car). To tackle this chal­lenge, we pro­pose to uti­lize the iden­ti­cal im­age with geo­met­ric dis­tor­tion as a vir­tual ref­er­ence, which makes it pos­si­ble to se­cure the ground truth for a col­ored out­put im­age. Fur­ther­more, it nat­u­rally pro­vides the ground truth for dense se­man­tic cor­re­spon­dence, which we uti­lize in our in­ter­nal at­ten­tion mech­a­nism for color trans­fer from ref­er­ence to sketch in­put. We demon­strate the effec­tive­ness of our ap­proach in var­i­ous types of sketch im­age col­oriza­tion via quan­ti­ta­tive as well as qual­i­ta­tive eval­u­a­tion against ex­ist­ing meth­ods.

  • , Xi­ang & Li 2019 (?)

  • , Chen et al 2019:

    In­stance based photo car­tooniza­tion is one of the chal­leng­ing im­age styl­iza­tion tasks which aim at trans­form­ing re­al­is­tic pho­tos into car­toon style im­ages while pre­serv­ing the se­man­tic con­tents of the pho­tos. State-of-the-art Deep Neural Net­works (DNNs) meth­ods still fail to pro­duce sat­is­fac­tory re­sults with in­put pho­tos in the wild, es­pe­cially for pho­tos which have high con­trast and full of rich tex­tures. This is due to that: car­toon style im­ages tend to have smooth color re­gions and em­pha­sized edges which are con­tra­dict to re­al­is­tic pho­tos which re­quire clear se­man­tic con­tents, i.e., tex­tures, shapes etc. Pre­vi­ous meth­ods have diffi­culty in sat­is­fy­ing car­toon style tex­tures and pre­serv­ing se­man­tic con­tents at the same time. In this work, we pro­pose a novel “Car­toon­Ren­derer” frame­work which uti­liz­ing a sin­gle trained model to gen­er­ate mul­ti­ple car­toon styles. In a nut­shell, our method maps photo into a fea­ture model and ren­ders the fea­ture model back into im­age space. In par­tic­u­lar, car­tooniza­tion is achieved by con­duct­ing some trans­for­ma­tion ma­nip­u­la­tion in the fea­ture space with our pro­posed Soft­-AdaIN. Ex­ten­sive ex­per­i­men­tal re­sults show our method pro­duces higher qual­ity car­toon style im­ages than prior arts, with ac­cu­rate se­man­tic con­tent preser­va­tion. In ad­di­tion, due to the de­cou­pling of whole gen­er­at­ing process into “Mod­el­ing-Co­or­di­nat­ing-Ren­der­ing” parts, our method could eas­ily process higher res­o­lu­tion pho­tos, which is in­tractable for ex­ist­ing meth­ods.

  • “Un­paired Sketch-to-Line Trans­la­tion via Syn­the­sis of Sketches”, Lee et al 2019:

    Con­vert­ing hand-drawn sketches into clean line draw­ings is a cru­cial step for di­verse artis­tic works such as comics and prod­uct de­signs. Re­cent data-driven meth­ods us­ing deep learn­ing have shown their great abil­i­ties to au­to­mat­i­cally sim­plify sketches on raster im­ages. Since it is diffi­cult to col­lect or gen­er­ate paired sketch and line im­ages, lack of train­ing data is a main ob­sta­cle to use these mod­els. In this pa­per, we pro­pose a train­ing scheme that re­quires only un­paired sketch and line im­ages for learn­ing sketch-to-line trans­la­tion. To do this, we first gen­er­ate re­al­is­tic paired sketch and line im­ages from un­paired sketch and line im­ages us­ing rule-based line aug­men­ta­tion and un­su­per­vised tex­ture con­ver­sion. Next, with our syn­thetic paired data, we train a model for sketch-to-line trans­la­tion us­ing su­per­vised learn­ing. Com­pared to un­su­per­vised meth­ods that use cy­cle con­sis­tency loss­es, our model shows bet­ter per­for­mance at re­mov­ing noisy strokes. We also show that our model sim­pli­fies com­pli­cated sketches bet­ter than mod­els trained on a lim­ited num­ber of hand­crafted paired da­ta.

  • “Con­tent Cu­ra­tion, Eval­u­a­tion, and Re­fine­ment on a Non­lin­early Di­rected Im­age­board: Lessons From Dan­booru”, Britt 2019:

    While lin­early di­rected im­age­boards like 4chan have been ex­ten­sively stud­ied, user par­tic­i­pa­tion on non­lin­early di­rected im­age­boards, or “boorus,” has been over­looked de­spite high ac­tiv­i­ty, ex­pan­sive mul­ti­me­dia repos­i­to­ries with user-de­fined clas­si­fi­ca­tions and eval­u­a­tions, and unique affor­dances pri­or­i­tiz­ing mu­tual con­tent cu­ra­tion, eval­u­a­tion, and re­fine­ment over overt dis­course. To ad­dress the gap in the lit­er­a­ture re­lated to par­tic­i­pa­tory en­gage­ment on non­lin­early di­rected im­age­boards, user ac­tiv­ity around the full data­base of N = 2,987,525, sub­mis­sions to Dan­booru, a promi­nent non­lin­early di­rected im­age­board, was eval­u­ated us­ing re­gres­sion. The re­sults il­lus­trate the role played by the affor­dances of non­lin­early di­rected im­age­boards and the vis­i­ble at­trib­utes of in­di­vid­ual sub­mis­sions in shap­ing the user processes of con­tent cu­ra­tion, eval­u­a­tion, and re­fine­ment, as well as the in­ter­re­la­tion­ships be­tween these three core ac­tiv­i­ties. These re­sults pro­vide a foun­da­tion for fur­ther re­search within the unique en­vi­ron­ments of non­lin­early di­rected im­age­boards and sug­gest prac­ti­cal ap­pli­ca­tions across on­line do­mains.

  • , Ye et al 2019:

    Anime line sketch col­oriza­tion is to fill a va­ri­ety of col­ors the anime sketch, to make it col­or­ful and di­verse. The col­or­ing prob­lem is not a new re­search di­rec­tion in the field of deep learn­ing tech­nol­o­gy. Be­cause of col­or­ing of the anime sketch does not have fixed color and we can’t take tex­ture or shadow as ref­er­ence, so it is diffi­cult to learn and have a cer­tain stan­dard to de­ter­mine whether it is cor­rect or not. After gen­er­a­tive ad­ver­sar­ial net­works (GANs) was pro­posed, some used GANs to do col­or­ing re­search, achieved some re­sult, but the col­or­ing effect is lim­it­ed. This study pro­poses a method use deep resid­ual net­work, and adding dis­crim­i­na­tor to net­work, that ex­pect the color of col­ored im­ages can con­sis­tent with the de­sired color by the user and can achieve good col­or­ing re­sults.

  • , Lee et al 2019:

    Con­vert­ing hand-drawn sketches into clean line draw­ings is a cru­cial step for di­verse artis­tic works such as comics and prod­uct de­signs. Re­cent data-driven meth­ods us­ing deep learn­ing have shown their great abil­i­ties to au­to­mat­i­cally sim­plify sketches on raster im­ages. Since it is diffi­cult to col­lect or gen­er­ate paired sketch and line im­ages, lack of train­ing data is a main ob­sta­cle to use these mod­els. In this pa­per, we pro­pose a train­ing scheme that re­quires only un­paired sketch and line im­ages for learn­ing sketch-to-line trans­la­tion. To do this, we first gen­er­ate re­al­is­tic paired sketch and line im­ages from un­paired sketch and line im­ages us­ing rule-based line aug­men­ta­tion and un­su­per­vised tex­ture con­ver­sion. Next, with our syn­thetic paired data, we train a model for sketch-to-line trans­la­tion us­ing su­per­vised learn­ing. Com­pared to un­su­per­vised meth­ods that use cy­cle con­sis­tency loss­es, our model shows bet­ter per­for­mance at re­mov­ing noisy strokes. We also show that our model sim­pli­fies com­pli­cated sketches bet­ter than mod­els trained on a lim­ited num­ber of hand­crafted paired da­ta.

  • , Huang et al 2019:

    Many im­age-to-im­age (I2I) trans­la­tion prob­lems are in na­ture of high di­ver­sity that a sin­gle in­put may have var­i­ous coun­ter­parts. Prior works pro­posed the mul­ti­-modal net­work that can build a many-to-many map­ping be­tween two vi­sual do­mains. How­ev­er, most of them are guided by sam­pled nois­es. Some oth­ers en­code the ref­er­ence im­ages into a la­tent vec­tor, by which the se­man­tic in­for­ma­tion of the ref­er­ence im­age will be washed away. In this work, we aim to pro­vide a so­lu­tion to con­trol the out­put based on ref­er­ences se­man­ti­cal­ly. Given a ref­er­ence im­age and an in­put in an­other do­main, a se­man­tic match­ing is first per­formed be­tween the two vi­sual con­tents and gen­er­ates the aux­il­iary im­age, which is ex­plic­itly en­cour­aged to pre­serve se­man­tic char­ac­ter­is­tics of the ref­er­ence. A deep net­work then is used for I2I trans­la­tion and the fi­nal out­puts are ex­pected to be se­man­ti­cally sim­i­lar to both the in­put and the ref­er­ence; how­ev­er, no such paired data can sat­isfy that du­al-sim­i­lar­ity in a su­per­vised fash­ion, so we build up a self­-su­per­vised frame­work to serve the train­ing pur­pose. We im­prove the qual­ity and di­ver­sity of the out­puts by em­ploy­ing non-lo­cal blocks and a mul­ti­-task ar­chi­tec­ture. We as­sess the pro­posed method through ex­ten­sive qual­i­ta­tive and quan­ti­ta­tive eval­u­a­tions and also pre­sented com­par­isons with sev­eral state-of-art mod­els.

  • , Liu et al 2019:

    Anime sketch col­or­ing is to fill var­i­ous col­ors into the black­-and-white anime sketches and fi­nally ob­tain the color anime im­ages. Re­cent­ly, anime sketch col­or­ing has be­come a new re­search hotspot in the field of deep learn­ing. In anime sketch col­or­ing, gen­er­a­tive ad­ver­sar­ial net­works (GANs) have been used to de­sign ap­pro­pri­ate col­or­ing meth­ods and achieved some re­sults. How­ev­er, the ex­ist­ing meth­ods based on GANs gen­er­ally have low-qual­ity col­or­ing effects, such as un­rea­son­able color mix­ing, poor color gra­di­ent effect. In this pa­per, an effi­cient anime sketch col­or­ing method us­ing swish-gated resid­ual U-net (SGRU) and spec­trally nor­mal­ized GAN (SNGAN) has been pro­posed to solve the above prob­lems. The pro­posed method is called spec­trally nor­mal­ized GAN with swish-gated resid­ual U-net (SSN-GAN). In SSN-GAN, SGRU is used as the gen­er­a­tor. SGRU is the U-net with the pro­posed swish layer and swish-gated resid­ual blocks (SGBs). In SGRU, the pro­posed swish layer and swish-gated resid­ual blocks (SGBs) effec­tively fil­ter the in­for­ma­tion trans­mit­ted by each level and im­prove the per­for­mance of the net­work. The per­cep­tual loss and the per-pixel loss are used to con­sti­tute the fi­nal loss of SGRU. The dis­crim­i­na­tor of SSN-GAN uses spec­tral nor­mal­iza­tion as a sta­bi­lizer of train­ing of GAN, and it is also used as the per­cep­tual net­work for cal­cu­lat­ing the per­cep­tual loss. SSN-GAN can au­to­mat­i­cally color the sketch with­out pro­vid­ing any col­or­ing hints in ad­vance and can be eas­ily end-to-end trained. Ex­per­i­men­tal re­sults show that our method per­forms bet­ter than other state-of-the-art col­or­ing meth­ods, and can ob­tain col­or­ful anime im­ages with higher vi­sual qual­i­ty.

  • , Gopalakr­ish­nan et al 2020:

    Con­trary to the con­ven­tion of us­ing su­per­vi­sion for class-con­di­tioned gen­er­a­tive mod­el­ing, this work ex­plores and demon­strates the fea­si­bil­ity of a learned su­per­vised rep­re­sen­ta­tion space trained on a dis­crim­i­na­tive clas­si­fier for the down­stream task of sam­ple gen­er­a­tion. Un­like gen­er­a­tive mod­el­ing ap­proaches that aim to model the man­i­fold dis­tri­b­u­tion, we di­rectly rep­re­sent the given data man­i­fold in the clas­si­fi­ca­tion space and lever­age prop­er­ties of la­tent space rep­re­sen­ta­tions to gen­er­ate new rep­re­sen­ta­tions that are guar­an­teed to be in the same class. In­ter­est­ing­ly, such rep­re­sen­ta­tions al­low for con­trolled sam­ple gen­er­a­tions for any given class from ex­ist­ing sam­ples and do not re­quire en­forc­ing prior dis­tri­b­u­tion. We show that these la­tent space rep­re­sen­ta­tions can be smartly ma­nip­u­lated (us­ing con­vex com­bi­na­tions of n sam­ples, n≥2) to yield mean­ing­ful sam­ple gen­er­a­tions. Ex­per­i­ments on im­age datasets of vary­ing res­o­lu­tions demon­strate that down­stream gen­er­a­tions have higher clas­si­fi­ca­tion ac­cu­racy than ex­ist­ing con­di­tional gen­er­a­tive mod­els while be­ing com­pet­i­tive in terms of FID.

  • , Su & Fang 2020 (C­S230 class pro­ject; source):

    Hu­man sketches can be ex­pres­sive and ab­stract at the same time. Gen­er­at­ing anime avatars from sim­ple or even bad face draw­ing is an in­ter­est­ing area. Lots of re­lated work has been done such as au­to-col­or­ing sketches to anime or trans­form­ing real pho­tos to ani­me. How­ev­er, there aren’t many in­ter­est­ing works yet to show how to gen­er­ate anime avatars from just some sim­ple draw­ing in­put. In this pro­ject, we pro­pose us­ing GAN to gen­er­ate anime avatars from sketch­es.

  • , Huang et al 2020

    Sketch-to-im­age (S2I) trans­la­tion plays an im­por­tant role in im­age syn­the­sis and ma­nip­u­la­tion tasks, such as photo edit­ing and col­oriza­tion. Some spe­cific S2I trans­la­tion in­clud­ing sketch-to-photo and sketch-to-paint­ing can be used as pow­er­ful tools in the art de­sign in­dus­try. How­ev­er, pre­vi­ous meth­ods only sup­port S2I trans­la­tion with a sin­gle level of den­si­ty, which gives less flex­i­bil­ity to users for con­trol­ling the in­put sketch­es. In this work, we pro­pose the first mul­ti­-level den­sity sketch-to-im­age trans­la­tion frame­work, which al­lows the in­put sketch to cover a wide range from rough ob­ject out­lines to mi­cro struc­tures. More­over, to tackle the prob­lem of non­con­tin­u­ous rep­re­sen­ta­tion of mul­ti­-level den­sity in­put sketch­es, we project the den­sity level into a con­tin­u­ous la­tent space, which can then be lin­early con­trolled by a pa­ra­me­ter. This al­lows users to con­ve­niently con­trol the den­si­ties of in­put sketches and gen­er­a­tion of im­ages. More­over, our method has been suc­cess­fully ver­i­fied on var­i­ous datasets for differ­ent ap­pli­ca­tions in­clud­ing face edit­ing, mul­ti­-modal sketch-to-photo trans­la­tion, and anime col­oriza­tion, pro­vid­ing coarse-to-fine lev­els of con­trols to these ap­pli­ca­tions.

  • , Akita et al 2020:

    Many stud­ies have re­cently ap­plied deep learn­ing to the au­to­matic col­oriza­tion of line draw­ings. How­ev­er, it is diffi­cult to paint empty pupils us­ing ex­ist­ing meth­ods be­cause the net­works are trained with pupils that have edges, which are gen­er­ated from color im­ages us­ing im­age pro­cess­ing. Most ac­tual line draw­ings have empty pupils that artists must paint in. In this pa­per, we pro­pose a novel net­work model that trans­fers the pupil de­tails in a ref­er­ence color im­age to in­put line draw­ings with empty pupils. We also pro­pose a method for ac­cu­rately and au­to­mat­i­cally col­or­ing eyes. In this method, eye patches are ex­tracted from a ref­er­ence color im­age and au­to­mat­i­cally added to an in­put line draw­ing as color hints us­ing our eye po­si­tion es­ti­ma­tion net­work.

    , Akita et al 2020b:

    Many stud­ies have re­cently ap­plied deep learn­ing to the au­to­matic col­oriza­tion of line draw­ings. How­ev­er, it is diffi­cult to paint empty pupils us­ing ex­ist­ing meth­ods be­cause the con­vo­lu­tional neural net­work are trained with pupils that have edges, which are gen­er­ated from color im­ages us­ing im­age pro­cess­ing. Most ac­tual line draw­ings have empty pupils that artists must paint in. In this pa­per, we pro­pose a novel net­work model that trans­fers the pupil de­tails in a ref­er­ence color im­age to in­put line draw­ings with empty pupils. We also pro­pose a method for ac­cu­rately and au­to­mat­i­cally col­oriz­ing eyes. In this method, eye patches are ex­tracted from a ref­er­ence color im­age and au­to­mat­i­cally added to an in­put line draw­ing as color hints us­ing our pupil po­si­tion es­ti­ma­tion net­work.

  • “Dan­booRe­gion: An Il­lus­tra­tion Re­gion Dataset”, Zhang et al 2020 (Github):

    Re­gion is a fun­da­men­tal el­e­ment of var­i­ous car­toon an­i­ma­tion tech­niques and artis­tic paint­ing ap­pli­ca­tions. Achiev­ing sat­is­fac­tory re­gion is es­sen­tial to the suc­cess of these tech­niques. Mo­ti­vated to as­sist di­verse re­gion-based car­toon ap­pli­ca­tions, we in­vite artists to an­no­tate re­gions for in­-the-wild car­toon im­ages with sev­eral ap­pli­ca­tion-ori­ented goals: (1) To as­sist im­age-based car­toon ren­der­ing, re­light­ing, and car­toon in­trin­sic de­com­po­si­tion lit­er­a­ture, artists iden­tify ob­ject out­lines and elim­i­nate light­ing-and-shadow bound­aries. (2) To as­sist car­toon ink­ing tools, car­toon struc­ture ex­trac­tion ap­pli­ca­tions, and car­toon tex­ture pro­cess­ing tech­niques, artists clean-up tex­ture or de­for­ma­tion pat­terns and em­pha­size car­toon struc­tural bound­ary lines. (3) To as­sist re­gion-based car­toon dig­i­tal­iza­tion, clip-art vec­tor­iza­tion, and an­i­ma­tion track­ing ap­pli­ca­tions, artists in­paint and re­con­struct bro­ken or blurred re­gions in car­toon im­ages. Given the typ­i­cal­ity of these in­volved ap­pli­ca­tions, this dataset is also likely to be used in other car­toon tech­niques. We de­tail the chal­lenges in achiev­ing this dataset and present a hu­man-in-the-loop work­flow named Fea­si­bil­i­ty-based As­sign­ment Rec­om­men­da­tion (FAR) to en­able large-s­cale an­no­tat­ing. The FAR tends to re­duce artist trail­s-and-er­rors and en­cour­age their en­thu­si­asm dur­ing an­no­tat­ing. Fi­nal­ly, we present a dataset that con­tains a large num­ber of artis­tic re­gion com­po­si­tions paired with cor­re­spond­ing car­toon il­lus­tra­tions. We also in­vite mul­ti­ple pro­fes­sional artists to as­sure the qual­ity of each an­no­ta­tion. [Key­words: artis­tic cre­ation, fine art, car­toon, re­gion pro­cess­ing]

  • , Ko & Cho 2020 (Github):

    The trans­la­tion of comics (and Man­ga) in­volves re­mov­ing text from a for­eign comic im­ages and type­set­ting trans­lated let­ters into it. The text in comics con­tain a va­ri­ety of de­formed let­ters drawn in ar­bi­trary po­si­tions, in com­plex im­ages or pat­terns. These let­ters have to be re­moved by ex­perts, as com­pu­ta­tion­ally eras­ing these let­ters is very chal­leng­ing. Al­though sev­eral clas­si­cal im­age pro­cess­ing al­go­rithms and tools have been de­vel­oped, a com­pletely au­to­mated method that could erase the text is still lack­ing. There­fore, we pro­pose an im­age pro­cess­ing frame­work called ‘SickZil-Ma­chine’ (SZMC) that au­to­mates the re­moval of text from comics. SZMC works through a two-step process. In the first step, the text ar­eas are seg­mented at the pixel lev­el. In the sec­ond step, the let­ters in the seg­mented ar­eas are erased and in­painted nat­u­rally to match their sur­round­ings. SZMC ex­hib­ited a no­table per­for­mance, em­ploy­ing deep learn­ing based im­age seg­men­ta­tion and im­age in­paint­ing mod­els. To train these mod­els, we con­structed 285 pairs of orig­i­nal comic pages, a text area-mask dataset, and a dataset of 31,497 comic pages. We iden­ti­fied the char­ac­ter­is­tics of the dataset that could im­prove SZMC per­for­mance.

  • , Del Gobbo & Her­rera 2020:

    The de­tec­tion and recog­ni­tion of un­con­strained text is an open prob­lem in re­search. Text in comic books has un­usual styles that raise many chal­lenges for text de­tec­tion. This work aims to iden­tify text char­ac­ters at a pixel level in a comic genre with highly so­phis­ti­cated text styles: Japan­ese man­ga. To over­come the lack of a manga dataset with in­di­vid­ual char­ac­ter level an­no­ta­tions, we cre­ate our own. Most of the lit­er­a­ture in text de­tec­tion use bound­ing box met­rics, which are un­suit­able for pix­el-level eval­u­a­tion. Thus, we im­ple­mented spe­cial met­rics to eval­u­ate per­for­mance. Us­ing these re­sources, we de­signed and eval­u­ated a deep net­work mod­el, out­per­form­ing cur­rent meth­ods for text de­tec­tion in manga in most met­rics.

  • , Zheng et al 2020:

    This pa­per deals with a chal­leng­ing task of learn­ing from differ­ent modal­i­ties by tack­ling the diffi­culty prob­lem of jointly face recog­ni­tion be­tween ab­strac­t-like sketch­es, car­toons, car­i­ca­tures and re­al-life pho­tographs. Due to the sig­nifi­cant vari­a­tions in the ab­stract faces, build­ing vi­sion mod­els for rec­og­niz­ing data from these modal­i­ties is an ex­tremely chal­leng­ing. We pro­pose a novel frame­work termed as Meta-Con­tin­ual Learn­ing with Knowl­edge Em­bed­ding to ad­dress the task of jointly sketch, car­toon, and car­i­ca­ture face recog­ni­tion. In par­tic­u­lar, we firstly present a deep re­la­tional net­work to cap­ture and mem­o­rize the re­la­tion among differ­ent sam­ples. Sec­ond­ly, we present the con­struc­tion of our knowl­edge graph that re­lates im­age with the la­bel as the guid­ance of our meta-learn­er. We then de­sign a knowl­edge em­bed­ding mech­a­nism to in­cor­po­rate the knowl­edge rep­re­sen­ta­tion into our net­work. Third­ly, to mit­i­gate cat­a­strophic for­get­ting, we use a meta-con­tin­ual model that up­dates our en­sem­ble model and im­proves its pre­dic­tion ac­cu­ra­cy. With this meta-con­tin­ual mod­el, our net­work can learn from its past. The fi­nal clas­si­fi­ca­tion is de­rived from our net­work by learn­ing to com­pare the fea­tures of sam­ples. Ex­per­i­men­tal re­sults demon­strate that our ap­proach achieves sig­nifi­cantly higher per­for­mance com­pared with other state-of-the-art ap­proach­es.

  • , Cao et al 2020:

    The car­toon an­i­ma­tion in­dus­try has de­vel­oped into a huge in­dus­trial chain with a large po­ten­tial mar­ket in­volv­ing games, dig­i­tal en­ter­tain­ment, and other in­dus­tries. How­ev­er, due to the coarse-grained clas­si­fi­ca­tion of car­toon ma­te­ri­als, car­toon an­i­ma­tors can hardly find rel­e­vant ma­te­ri­als dur­ing the process of cre­ation. The po­lar emo­tions of car­toon ma­te­ri­als are an im­por­tant ref­er­ence for cre­ators as they can help them eas­ily ob­tain the pic­tures they need. Some meth­ods for ob­tain­ing the emo­tions of car­toon pic­tures have been pro­posed, but most of these fo­cus on ex­pres­sion recog­ni­tion. Mean­while, other emo­tion recog­ni­tion meth­ods are not ideal for use as car­toon ma­te­ri­als. We pro­pose a deep learn­ing-based method to clas­sify the po­lar emo­tions of the car­toon pic­tures of the “Moe” draw­ing style. Ac­cord­ing to the ex­pres­sion fea­ture of the car­toon char­ac­ters of this draw­ing style, we rec­og­nize the fa­cial ex­pres­sions of car­toon char­ac­ters and ex­tract the scene and fa­cial fea­tures of the car­toon im­ages. Then, we cor­rect the emo­tions of the pic­tures ob­tained by the ex­pres­sion recog­ni­tion ac­cord­ing to the scene fea­tures. Fi­nal­ly, we can ob­tain the po­lar emo­tions of cor­re­spond­ing pic­ture. We de­signed a dataset and per­formed ver­i­fi­ca­tion tests on it, achiev­ing 81.9% ex­per­i­men­tal ac­cu­ra­cy. The ex­per­i­men­tal re­sults prove that our method is com­pet­i­tive. [Key­words: car­toon; emo­tion clas­si­fi­ca­tion; deep learn­ing]

  • , Huang et al 2020:

    Im­age-to-Im­age (I2I) trans­la­tion is a heated topic in acad­e­mia, and it also has been ap­plied in re­al-world in­dus­try for tasks like im­age syn­the­sis, su­per-res­o­lu­tion, and col­oriza­tion. How­ev­er, tra­di­tional I2I trans­la­tion meth­ods train data in two or more do­mains to­geth­er. This re­quires lots of com­pu­ta­tion re­sources. More­over, the re­sults are of lower qual­i­ty, and they con­tain many more ar­ti­facts. The train­ing process could be un­sta­ble when the data in differ­ent do­mains are not bal­anced, and modal col­lapse is more likely to hap­pen. We pro­posed a new I2I trans­la­tion method that gen­er­ates a new model in the tar­get do­main via a se­ries of model trans­for­ma­tions on a pre-trained StyleGAN2 model in the source do­main. After that, we pro­posed an in­ver­sion method to achieve the con­ver­sion be­tween an im­age and its la­tent vec­tor. By feed­ing the la­tent vec­tor into the gen­er­ated mod­el, we can per­form I2I trans­la­tion be­tween the source do­main and tar­get do­main. Both qual­i­ta­tive and quan­ti­ta­tive eval­u­a­tions were con­ducted to prove that the pro­posed method can achieve out­stand­ing per­for­mance in terms of im­age qual­i­ty, di­ver­sity and se­man­tic sim­i­lar­ity to the in­put and ref­er­ence im­ages com­pared to state-of-the-art works.

  • , Robb et al 2020:

    Gen­er­a­tive Ad­ver­sar­ial Net­works (GANs) have shown re­mark­able per­for­mance in im­age syn­the­sis tasks, but typ­i­cally re­quire a large num­ber of train­ing sam­ples to achieve high­-qual­ity syn­the­sis. This pa­per pro­poses a sim­ple and effec­tive method, Few-Shot GAN (FSGAN), for adapt­ing GANs in few-shot set­tings (less than 100 im­ages). FSGAN re­pur­poses com­po­nent analy­sis tech­niques and learns to adapt the sin­gu­lar val­ues of the pre-trained weights while freez­ing the cor­re­spond­ing sin­gu­lar vec­tors. This pro­vides a highly ex­pres­sive pa­ra­me­ter space for adap­ta­tion while con­strain­ing changes to the pre­trained weights. We val­i­date our method in a chal­leng­ing few-shot set­ting of 5–100 im­ages in the tar­get do­main. We show that our method has sig­nifi­cant vi­sual qual­ity gains com­pared with ex­ist­ing GAN adap­ta­tion meth­ods. We re­port qual­i­ta­tive and quan­ti­ta­tive re­sults show­ing the effec­tive­ness of our method. We ad­di­tion­ally high­light a prob­lem for few-shot syn­the­sis in the stan­dard quan­ti­ta­tive met­ric used by data-effi­cient im­age syn­the­sis works. Code and ad­di­tional re­sults are avail­able at this URL.

  • , Wu et al 2020:

    Wa­ter­mark­ing neural net­works is a quite im­por­tant means to pro­tect the in­tel­lec­tual prop­erty (IP) of neural net­works. In this pa­per, we in­tro­duce a novel dig­i­tal wa­ter­mark­ing frame­work suit­able for deep neural net­works that out­put im­ages as the re­sults, in which any im­age out­putted from a wa­ter­marked neural net­work must con­tain a cer­tain wa­ter­mark. Here, the host neural net­work to be pro­tected and a wa­ter­mark-ex­trac­tion net­work are trained to­geth­er, so that, by op­ti­miz­ing a com­bined loss func­tion, the trained neural net­work can ac­com­plish the orig­i­nal task while em­bed­ding a wa­ter­mark into the out­putted im­ages. This work is to­tally differ­ent from pre­vi­ous schemes car­ry­ing a wa­ter­mark by net­work weights or clas­si­fi­ca­tion la­bels of the trig­ger set. By de­tect­ing wa­ter­marks in the out­putted im­ages, this tech­nique can be adopted to iden­tify the own­er­ship of the host net­work and find whether an im­age is gen­er­ated from a cer­tain neural net­work or not. We demon­strate that this tech­nique is effec­tive and ro­bust on a va­ri­ety of im­age pro­cess­ing tasks, in­clud­ing im­age col­oriza­tion, su­per-res­o­lu­tion, im­age edit­ing, se­man­tic seg­men­ta­tion and so on.

  • , Rom­bach et al 2020 (us­ing ):

    Given the ever-in­creas­ing com­pu­ta­tional costs of mod­ern ma­chine learn­ing mod­els, we need to find new ways to reuse such ex­pert mod­els and thus tap into the re­sources that have been in­vested in their cre­ation. Re­cent work sug­gests that the power of these mas­sive mod­els is cap­tured by the rep­re­sen­ta­tions they learn. There­fore, we seek a model that can re­late be­tween differ­ent ex­ist­ing rep­re­sen­ta­tions and pro­pose to solve this task with a con­di­tion­ally in­vert­ible net­work. This net­work demon­strates its ca­pa­bil­ity by (1) pro­vid­ing generic trans­fer be­tween di­verse do­mains, (2) en­abling con­trolled con­tent syn­the­sis by al­low­ing mod­i­fi­ca­tion in other do­mains, and (3) fa­cil­i­tat­ing di­ag­no­sis of ex­ist­ing rep­re­sen­ta­tions by trans­lat­ing them into in­ter­pretable do­mains such as im­ages. Our do­main trans­fer net­work can trans­late be­tween fixed rep­re­sen­ta­tions with­out hav­ing to learn or fine­tune them. This al­lows users to uti­lize var­i­ous ex­ist­ing do­main-spe­cific ex­pert mod­els from the lit­er­a­ture that had been trained with ex­ten­sive com­pu­ta­tional re­sources. Ex­per­i­ments on di­verse con­di­tional im­age syn­the­sis tasks, com­pet­i­tive im­age mod­i­fi­ca­tion re­sults and ex­per­i­ments on im­age-to-im­age and tex­t-to-im­age gen­er­a­tion demon­strate the generic ap­plic­a­bil­ity of our ap­proach. For ex­am­ple, we trans­late be­tween BERT and BigGAN, state-of-the-art text and im­age mod­els to pro­vide tex­t-to-im­age gen­er­a­tion, which nei­ther of both ex­perts can per­form on their own.

  • , Anony­mous et al 2020:

    A com­pu­ta­tion­al-effi­cient GAN for few-shot hi-fi im­age dataset (con­verge on sin­gle GPU with few hours’ train­ing, on 1024 res­o­lu­tion sub­-hun­dred im­ages). · Train­ing Gen­er­a­tive Ad­ver­sar­ial Net­works (GAN) on high­-fi­delity im­ages usu­ally re­quires large-s­cale GPU-clusters and a vast num­ber of train­ing im­ages. In this pa­per, we study the few-shot im­age syn­the­sis task for GAN with min­i­mum com­put­ing cost. We pro­pose a light-weight GAN struc­ture that gains su­pe­rior qual­ity on 1024×1024px res­o­lu­tion. No­tably, the model con­verges from scratch with just a few hours of train­ing on a sin­gle RTX-2080 GPU; and has a con­sis­tent per­for­mance, even with less than 100 train­ing sam­ples. 2 tech­nique de­signs con­sti­tute our work, a skip-layer chan­nel-wise ex­ci­ta­tion mod­ule and a self­-su­per­vised dis­crim­i­na­tor trained as a fea­ture-en­coder. With 13 datasets cov­er­ing a wide va­ri­ety of im­age do­mains, we show our mod­el’s ro­bust­ness and its su­pe­rior per­for­mance com­pared to the state-of-the-art StyleGAN2.

  • , Shen & Zhou 2020:

    A rich set of se­man­tic at­trib­utes has been shown to emerge in the la­tent space of the Gen­er­a­tive Ad­ver­sar­ial Net­works (GANs) trained for syn­the­siz­ing im­ages. In or­der to iden­tify such la­tent se­man­tics for im­age ma­nip­u­la­tion, pre­vi­ous meth­ods an­no­tate a col­lec­tion of syn­the­sized sam­ples and then train su­per­vised clas­si­fiers in the la­tent space. How­ev­er, they re­quire a clear de­fi­n­i­tion of the tar­get at­tribute as well as the cor­re­spond­ing man­ual an­no­ta­tions, se­verely lim­it­ing their ap­pli­ca­tions in prac­tice. In this work, we ex­am­ine the in­ter­nal rep­re­sen­ta­tion learned by GANs to re­veal the un­der­ly­ing vari­a­tion fac­tors in an un­su­per­vised man­ner. By study­ing the es­sen­tial role of the ful­ly-con­nected layer that takes the la­tent code into the gen­er­a­tor of GANs, we pro­pose a gen­eral closed-form fac­tor­iza­tion method for la­tent se­man­tic dis­cov­ery. The prop­er­ties of the iden­ti­fied se­man­tics are fur­ther an­a­lyzed both the­o­ret­i­cally and em­pir­i­cal­ly. With its fast and effi­cient im­ple­men­ta­tion, our ap­proach is ca­pa­ble of not only find­ing la­tent se­man­tics as ac­cu­rately as the state-of-the-art su­per­vised meth­ods, but also re­sult­ing in far more ver­sa­tile se­man­tic classes across mul­ti­ple GAN mod­els trained on a wide range of datasets.

  • , Mangla et al 2020:

    Re­cent ad­vances in gen­er­a­tive ad­ver­sar­ial net­works (GANs) have shown re­mark­able progress in gen­er­at­ing high­-qual­ity im­ages. How­ev­er, this gain in per­for­mance de­pends on the avail­abil­ity of a large amount of train­ing da­ta. In lim­ited data regimes, train­ing typ­i­cally di­verges, and there­fore the gen­er­ated sam­ples are of low qual­ity and lack di­ver­si­ty. Pre­vi­ous works have ad­dressed train­ing in low data set­ting by lever­ag­ing trans­fer learn­ing and data aug­men­ta­tion tech­niques. We pro­pose a novel trans­fer learn­ing method for GANs in the lim­ited data do­main by lever­ag­ing in­for­ma­tive data prior de­rived from self-supervised/supervised pre-trained net­works trained on a di­verse source do­main. We per­form ex­per­i­ments on sev­eral stan­dard vi­sion datasets us­ing var­i­ous GAN ar­chi­tec­tures (BigGAN, SNGAN, StyleGAN2) to demon­strate that the pro­posed method effec­tively trans­fers knowl­edge to do­mains with few tar­get im­ages, out­per­form­ing ex­ist­ing state-of-the-art tech­niques in terms of im­age qual­ity and di­ver­si­ty. We also show the util­ity of data in­stance prior in large-s­cale un­con­di­tional im­age gen­er­a­tion and im­age edit­ing tasks.

  • , Rios et al 2021 (repo)

    In this work we tackle the chal­leng­ing prob­lem of anime char­ac­ter recog­ni­tion. Ani­me, re­fer­ring to an­i­ma­tion pro­duced within Japan and work de­rived or in­spired from it. For this pur­pose we present DAF:re (Dan­booru­Anime­Faces:re­vamped), a large-s­cale, crowd-sourced, long-tailed dataset with al­most 500 K im­ages spread across more than 3000 class­es. Ad­di­tion­al­ly, we con­duct ex­per­i­ments on DAF:re and sim­i­lar datasets us­ing a va­ri­ety of clas­si­fi­ca­tion mod­els, in­clud­ing CNN based ResNets and self­-at­ten­tion based (ViT). Our re­sults give new in­sights into the gen­er­al­iza­tion and trans­fer learn­ing prop­er­ties of ViT mod­els on sub­stan­tially differ­ent do­main datasets from those used for the up­stream pre-train­ing, in­clud­ing the in­flu­ence of batch and im­age size in their train­ing. Ad­di­tion­al­ly, we share our dataset, source-code, pre-trained check­points and re­sults, as Ani­me­sion, the first end-to-end frame­work for large-s­cale anime char­ac­ter recog­ni­tion.

Scraping

This project is not offi­cially affil­i­ated or run by Dan­booru, how­ev­er, the site founder Al­bert (and his suc­ces­sor, Evazion) has given his per­mis­sion for scrap­ing. I have reg­is­tered the ac­counts gwern and gwern-bot for use in down­load­ing & par­tic­i­pat­ing on Dan­booru; it is con­sid­ered good re­search ethics to try to off­set any use of re­sources when crawl­ing an on­line com­mu­nity (eg try to run Tor nodes to pay back the band­width), so I have do­nated $20 to Dan­booru via an ac­count up­grade.

Dan­booru IDs are se­quen­tial pos­i­tive in­te­gers, but the im­ages are stored at their MD5 hash­es; so down­load­ing the full im­ages can be done by a query to the JSON API for the meta­data for an ID, get­ting the URL for the full up­load, and down­load­ing that to the ID plus ex­ten­sion.

The meta­data can be down­loaded from Big­Query via BigQuery-API-based tools.

Bugs

Known bugs:

  • Miss­ing trans­la­tion meta­data: the meta­data does not in­clude the trans­la­tions or bound­ing-boxes of captions/translations (“notes”); they were omit­ted from the Big­Query mir­ror and tech­ni­cal prob­lems meant they could not be added to BQ be­fore re­lease. The captions/translations can be re­trieved via the Dan­booru API if nec­es­sary.

  • 512px SFW sub­set trans­parency prob­lem: some im­ages have trans­par­ent back­grounds; if they are also black­-white, like black line-art draw­ings, then the con­ver­sion to JPG with a de­fault black back­ground will ren­der them al­most 100% black and the im­age will be in­vis­i­ble (eg files with the two tags transparent_background lineart). This affects some­where in the hun­dreds of im­ages. Users can ei­ther ig­nore this as affect­ing a minute per­cent­age of files, fil­ter out im­ages based on the tag-com­bi­na­tion, or in­clude data qual­ity checks in their im­age load­ing code to drop anom­alous im­ages with too-few unique col­ors or which are too white/too black.

    Pro­posed fix: in Dan­booru2019+’s 512px SFW sub­set, the down­scal­ing has switched to adding white back­grounds rather than black back­grounds; while the same is­sue can still arise in the case of white line-art draw­ings with trans­par­ent back­grounds, these are much rar­er. (It might also be pos­si­ble to make the con­ver­sion shell script query im­ages for use of trans­paren­cy, av­er­age the con­tents, and pick a back­ground which is most op­po­site the con­tent.)

Future work

Metadata Quality Improvement via Active Learning

How high qual­ity is the Dan­booru meta­data qual­i­ty? As with Im­a­geNet, it is crit­i­cal that the tags are ex­tremely ac­cu­rate or else this will lower­bound the er­ror rates and im­pede the learn­ing of tag­gers, es­pe­cially on rarer tags where a low er­ror may still cause false neg­a­tives to out­weigh the true pos­i­tives.

I would say that the Dan­booru tag data is quite high but im­bal­anced: al­most all tags on im­ages are cor­rect, but the ab­sence of a tag is often wrong—that is, many tags are miss­ing on Dan­booru (there are so many pos­si­ble tags that no user could pos­si­bly know them al­l). So the ab­sence of a tag is­n’t as in­for­ma­tive as the pres­ence of a tag—eye­balling im­ages and some rarer tags, I would guess that tags are present <10% of the time they should be.

This sug­gests lever­ag­ing an ac­tive learn­ing (Set­tles 2010) form of train­ing: train a tag­ger, have a hu­man re­view the er­rors, up­date the meta­data when it was not an er­ror, and re­train.

More specifi­cal­ly: train the tag­ger; run the tag­ger on the en­tire dataset, record­ing the out­puts and er­rors; a hu­man ex­am­ines the er­rors in­ter­ac­tively by com­par­ing the sup­posed er­ror with the im­age; and for false neg­a­tives, the tag can be added to the Dan­booru source us­ing the Dan­booru API and added to the lo­cal im­age meta­data data­base, and for false pos­i­tives, the ‘neg­a­tive tag’ can be added to the lo­cal data­base; train a new model (pos­si­bly ini­tial­iz­ing from the last check­point). Since there will prob­a­bly be thou­sands of er­rors, one would go through them by mag­ni­tude of er­ror: for a false pos­i­tive, start with tag­ging prob­a­bil­i­ties of 1.0 and go down, and for false neg­a­tives, 0.0 and go up. This would be equiv­a­lent to the ac­tive learn­ing strat­egy “un­cer­tainty sam­pling”, which is sim­ple, easy to im­ple­ment, and effec­tive (al­beit not nec­es­sar­ily op­ti­mal for ac­tive learn­ing as the worst er­rors will tend to be highly correlated/redundant and the set of cor­rec­tions overkil­l). Once all er­rors have been hand-checked, the train­ing weight on ab­sent tags can be in­creased, as any miss­ing tags should have shown up as false pos­i­tives.

Over mul­ti­ple it­er­a­tions of ac­tive learn­ing + re­train­ing, the pro­ce­dure should be able to fer­ret out er­rors in the dataset and boost its qual­ity while also in­creas­ing its per­for­mance.

Based on my ex­pe­ri­ences with semi­-au­to­matic edit­ing on Wikipedia (us­ing pywikipediabot for solv­ing wik­ilinks), I would es­ti­mate that given an ap­pro­pri­ate ter­mi­nal in­ter­face, a hu­man should be able to check at least 1 er­ror per sec­ond and so check­ing ~30,000 er­rors per day is pos­si­ble (al­beit ex­tremely te­dious). Fix­ing the top mil­lion er­rors should offer a no­tice­able in­crease in per­for­mance.

There are many open ques­tions about how best to op­ti­mize tag­ging per­for­mance: is it bet­ter to re­fine tags on the ex­ist­ing set of im­ages or would adding more on­ly-par­tial­ly-tagged im­ages be more use­ful?

Appendix

Shell queries for statistics

## count number of images/files in Danbooru2020
find /media/gwern/Data2/danbooru2020/original/ -type f | wc --lines
# 4226544
## count total filesize of original fullsized images in Danbooru2020:
du -sch /media/gwern/Data2/danbooru2020/original/
# 3.4T

# on JSON files concatenated together:
## number of unique tags
cd metadata/; cat * > all.json
cat all.json | jq '.tags | .[] | .name' > tags.txt
sort -u tags.txt  | wc --lines
# 392446
## number of total tags
wc --lines tags.txt
# 108029170
## Average tag count per image:
R
# 108029170 / 3692578
# # [1] 29.2557584
## Most popular tags:
sort tags.txt  | uniq -c | sort -g | tac | head -19
# 2617569 "1girl"
# 2162839 "solo"
# 1808646 "long_hair"
# 1470460 "highres"
# 1268611 "breasts"
# 1204519 "blush"
# 1101925 "smile"
# 1009723 "looking_at_viewer"
# 1006628 "short_hair"
#  904246 "open_mouth"
#  802786 "multiple_girls"
#  758690 "blue_eyes"
#  722932 "blonde_hair"
#  686706 "brown_hair"
#  675740 "skirt"
#  630385 "touhou"
#  606550 "large_breasts"
#  592200 "hat"
#  588769 "thighhighs"

## count Danbooru images by rating
cat all.json  | jq '.rating' > ratings.txt
sort ratings.txt  | uniq -c | sort -g
#  315713 "e"
#  539329 "q"
# 2853721 "s"

wc --lines ratings.txt
## 3708763 ratings.txt
R
# c(315713, 539329, 2853721) / 3708763
# # [1] 0.0851262267  0.1454201846  0.7694535887

# earliest upload:
cat all.json | jq '.created_at' | fgrep '2005' > uploaded.txt
sort -g uploaded.txt | head -1
# "2005-05-24 03:35:31 UTC"

  1. While Dan­booru is not the largest anime im­age booru in ex­is­tence—TBIB, for ex­am­ple, claimed >4.7m im­ages ~2017 or al­most twice as many as Dan­booru2017, by mir­ror­ing from mul­ti­ple boorus—but Dan­booru is gen­er­ally con­sid­ered to fo­cus on high­er-qual­ity im­ages & have bet­ter tag­ging; I sus­pect >4m im­ages is into di­min­ish­ing re­turns and the fo­cus then ought to be on im­prov­ing the meta­da­ta. Google finds () that im­age clas­si­fi­ca­tion is log­a­rith­mic in im­age count up to n = 300M with noisy la­bels (like­wise other scal­ing pa­pers), which I in­ter­pret as sug­gest­ing that for the rest of us with lim­ited hard dri­ves & com­pute, go­ing past mil­lions is not that help­ful; un­for­tu­nately that ex­per­i­ment does­n’t ex­am­ine the im­pact of the noise in their cat­e­gories so one can’t guess how many im­ages each ad­di­tional tag is equiv­a­lent to for im­prov­ing fi­nal ac­cu­ra­cy. (They do com­pare train­ing on equally large datasets with small vs large num­ber of cat­e­gories, but fine vs coarse-grained cat­e­gories is not di­rectly com­pa­ra­ble to a fixed num­ber of im­ages with less or more tags on each im­age.) The im­pact of tag noise could be quan­ti­fied by re­mov­ing vary­ing num­bers of ran­dom images/tags and com­par­ing the curve of fi­nal ac­cu­ra­cy. As adding more im­ages is hard but semi­-au­to­mat­i­cally fix­ing tags with an ac­tive-learn­ing ap­proach should be easy, I would bet that the cost-ben­e­fit is strongly in fa­vor of im­prov­ing the ex­ist­ing meta­data than in adding more im­ages from re­cent Dan­booru up­loads or other -boorus.↩︎

  2. This is done to save >100GB of space/bandwidth; it is true that the loss­less op­ti­miza­tion will in­val­i­date the MD5s, but note that the orig­i­nal MD5 hashes are avail­able in the meta­data, and many thou­sands of them are in­cor­rect even on the orig­i­nal Dan­booru server, and the files’ true hashes are in­her­ently val­i­dated as part of the Bit­Tor­rent down­load process—so there is lit­tle point in any­one ei­ther check­ing them or try­ing to avoid mod­i­fy­ing files, and loss­less op­ti­miza­tion saves a great deal.↩︎

  3. If one is only in­ter­ested in the meta­data, one could run queries on the Big­Query ver­sion of the Dan­booru data­base in­stead of down­load­ing the tor­rent. The Big­Query data­base is also up­dated dai­ly.↩︎

  4. Ap­par­ently a bug due to an an­ti-DoS mech­a­nism, which should be fixed.↩︎

  5. An au­thor of style2paints, a NN painter for anime im­ages, notes that stan­dard style trans­fer ap­proaches (typ­i­cally us­ing an Im­a­geNet-based CNN) fail abysmally on anime im­ages: “All trans­fer­ring meth­ods based on Anime Clas­si­fier are not good enough be­cause we do not have anime Im­a­geNet”. This is in­ter­est­ing in part be­cause it sug­gests that Im­a­geNet CNNs are still only cap­tur­ing a sub­set of hu­man per­cep­tion if they only work on pho­tographs & not il­lus­tra­tions.↩︎

  6. Dan­booru2020 does not by de­fault pro­vide a “face” dataset of im­ages cropped to just faces like that of Getchu or Na­gadomi’s moeimouto; how­ev­er, the tags can be used to fil­ter down to a large set of face close­ups, and Na­gadomi’s face-de­tec­tion code is highly effec­tive at ex­tract­ing faces from Dan­booru2020 im­ages & can be com­bined with wai­fu2× for cre­at­ing large sets of large face im­ages. Sev­eral face datasets have been con­struct­ed, see else­where.↩︎

  7. See for ex­am­ple the pair high­lighted in , mo­ti­vat­ing them to use hu­man di­a­logues to pro­vide more descriptions/supervision.↩︎

  8. A tag­ger could be in­te­grated into the site to au­to­mat­i­cally pro­pose tags for new­ly-u­ploaded im­ages to be ap­proved by the up­load­er; new users, un­con­fi­dent or un­fa­mil­iar with the full breadth, would then have the much eas­ier task of sim­ply check­ing that all the pro­posed tags are cor­rect.↩︎