Danbooru2019: A Large-Scale Crowdsourced and Tagged Anime Illustration Dataset

Danbooru2019 is a large-scale anime image database with 3.69m+ images annotated with 108m+ tags; it can be useful for machine learning purposes such as image recognition and generation.
statistics, NN, anime, shell, dataset
2015-12-152020-09-04 finished certainty: likely importance: 6


Deep learn­ing for com­puter revi­sion relies on large anno­tated datasets. Classification/categorization has ben­e­fited from the cre­ation of Ima­geNet, which clas­si­fies 1m pho­tos into 1000 cat­e­gories. But classification/categorization is a coarse descrip­tion of an image which lim­its appli­ca­tion of clas­si­fiers, and there is no com­pa­ra­bly large dataset of images with many tags or labels which would allow learn­ing and detect­ing much richer infor­ma­tion about images. Such a dataset would ide­ally be >1m images with at least 10 descrip­tive tags each which can be pub­licly dis­trib­uted to all inter­ested researchers, hob­by­ists, and orga­ni­za­tions. There are cur­rently no such pub­lic datasets, as Ima­geNet, Birds, Flow­ers, and MS COCO fall short either on image or tag count or restricted dis­tri­b­u­tion. I sug­gest that the “image -boorus” be used. The image boorus are long­stand­ing web data­bases which host large num­bers of images which can be ‘tagged’ or labeled with an arbi­trary num­ber of tex­tual descrip­tions; they were devel­oped for and are most pop­u­lar among fans of ani­me, who pro­vide detailed anno­ta­tions.

The best known booru, with a focus on qual­i­ty, is Dan­booru. We pro­vide a torrent/rsync mir­ror which con­tains ~3tb of 3.69m images with 108m tag instances (of 392k defined tags, ~29/image) cov­er­ing Dan­booru from 2005-05-24–2019-12-31 (fi­nal ID: #3,734,659), pro­vid­ing the image files & a JSON export of the meta­da­ta. We also pro­vide a smaller tor­rent of SFW images down­scaled to 512×512px JPGs (295GB; 2,828,400 images) for con­ve­nience.

Our hope is that a Dan­booru2019 dataset can be used for rich large-s­cale classification/tagging & learned embed­dings, test out the trans­fer­abil­ity of exist­ing com­puter vision tech­niques (pri­mar­ily devel­oped using pho­tographs) to illustration/anime-style images, pro­vide an archival backup for the Dan­booru com­mu­ni­ty, feed back meta­data improve­ments & cor­rec­tions, and serve as a test­bed for advanced tech­niques such as con­di­tional image gen­er­a­tion or style trans­fer.

Image like Dan­booru are image host­ing web­sites devel­oped by the anime com­mu­nity for col­lab­o­ra­tive tag­ging. Images are uploaded and tagged by users; they can be large, such as Dan­booru1, and richly anno­tated with tex­tual ‘tags’.

Dan­booru in par­tic­u­lar is old, large, well-­tagged, and its oper­a­tors have always sup­ported uses beyond reg­u­lar brows­ing—pro­vid­ing an API and even a data­base export. With their per­mis­sion, I have peri­od­i­cally cre­ated sta­tic snap­shots of Dan­booru ori­ented towards ML use pat­terns.

Image booru description

Image booru tags typ­i­cally divided into a few major groups:

  • copy­right (the over­all fran­chise, movie, TV series, manga etc a work is based on; for long-run­ning fran­chises like or “crossover” images, there can be mul­ti­ple such tags, or if there is no such asso­ci­ated work, it would be tagged “orig­i­nal”)

  • char­ac­ter (often mul­ti­ple)

  • author

  • explic­it­ness rat­ing

    Dan­booru does not ban sex­u­ally sug­ges­tive or porno­graphic con­tent; instead, images are clas­si­fied into 3 cat­e­gories: safe, questionable, &explicit. (Rep­re­sented in the SQL as “s”/“q”/“e” respec­tive­ly.)

    safe is for rel­a­tively SFW con­tent includ­ing swim­suits, while questionable would be more appro­pri­ate for high­ly-re­veal­ing swim­suit images or nudity or highly sex­u­ally sug­ges­tive sit­u­a­tions, and explicit denotes any­thing hard-­core porno­graph­ic. (8.5% of images are clas­si­fied as “e”, 15% as “q”, and 77% as “s”; as the default tag is “q”, this may under­es­ti­mate the num­ber of “s” images, but “s” should prob­a­bly be con­sid­ered the SFW sub­set.)

  • descrip­tive tags (eg the top 10 tags are 1girl/solo/long_hair/highres/breasts/blush/short_hair/smile/multiple_girls/open_mouth/looking_at_viewer, which reflect the expected focus of anime fan­dom on things like the fran­chise)

    These tags form a “” to describe aspects of images; beyond the expected tags like long_hair or looking_at_the_viewer, there are many strange and unusual tags, includ­ing many anime or illus­tra­tion-spe­cific tags like seiyuu_connection (im­ages where the joke is based on know­ing the two char­ac­ters are voiced in dif­fer­ent anime by the same voice actor) or bad_feet (artists fre­quently acci­den­tally draw two left feet, or just bad_anatomy in gen­er­al). Tags may also be hier­ar­chi­cal and one tag “imply” anoth­er.

    Images with text in them will have tags like translated, comic, or speech_bubble.

Images can have other asso­ci­ated meta­data with them, includ­ing:

  • Dan­booru ID, a unique pos­i­tive inte­ger

  • MD5 hash

    The MD5s are often incor­rect.
  • the uploader user­name

  • the orig­i­nal URL or the name of the work

  • up/downvotes

  • sib­ling images (often an image will exist in many forms, such as sketch or black­-white ver­sions in addi­tion to a final color image, edited or larger/smaller ver­sions, SFW vs NSFW, or depict­ing mul­ti­ple moments in a scene)

  • captions/dialogue (many images will have writ­ten Japan­ese captions/dialogue, which have been trans­lated into Eng­lish by users and anno­tated using HTML )

  • author com­men­tary (also often trans­lat­ed)

  • pools (ordered sequences of images from across Dan­booru; often used for comics or image groups, or for dis­parate images with some uni­fy­ing theme which is insuf­fi­ciently objec­tive to be a nor­mal tag)

Image boorus typ­i­cally sup­port advanced Boolean searches on mul­ti­ple attrib­utes simul­ta­ne­ous­ly, which in con­junc­tion with the rich tag­ging, can allow users to dis­cover extremely spe­cific sets of images.

Samples

100 ran­dom sam­ple images from the 512px SFW sub­set of Dan­booru in a 10×10 grid.

Download

Dan­booru2019 is cur­rently avail­able for down­load in 2 ways:

  1. Bit­Tor­rent
  2. pub­lic rsync server

Torrent

The images have been down­loaded using a curl script & the Dan­booru API, and loss­lessly opti­mized using optipng/jpegoptim2; the meta­data has been exported from the Dan­booru Big­Query mir­ror.3

Tor­rents are the pre­ferred down­load method as they stress the seed server less, can poten­tially be faster due to many peers, are resilient to server down­time, and have built-in ECC. (How­ev­er, Dan­booru2019 is approach­ing the lim­its of Bit­Tor­rent clients, and Dan­booru2020 may be forced to drop tor­rent sup­port.)

Due to the num­ber of files, the tor­rent has been bro­ken up into 10 sep­a­rate tor­rents, each cov­er­ing a range of IDs mod­ulo 1000. They are avail­able as an XZ-­com­pressed tar­ball (full archive) (21MB) and the SFW 512px down­scaled sub­set tor­rent (12M­B); down­load & unpack into one’s tor­rent direc­to­ry.

The tor­rents appear to work with on Linux & on Linux/Windows; it report­edly does­n’t work on Qbit­tor­rent 3.3–4.0.4 (but may on >=4.0.54), Del­uge, or most Win­dows tor­rent clients.

Rsync

Due to tor­rent com­pat­i­bil­ity & net­work issues, I pro­vide an alter­nate down­load route via a pub­lic anony­mous server (avail­able for any Unix; alter­na­tive imple­men­ta­tions are avail­able for ). To list all avail­able files, a file-list can be down­loaded (equiv­a­lent to but much faster than rsync --list-only):

rsync rsync://78.46.86.149:873/danbooru2019/filelist.txt.xz ./

To down­load all avail­able files (test with --dry-run):

rsync --verbose --recursive rsync://78.46.86.149:873/danbooru2019/ ./danbooru2019/

For a sin­gle file (eg the meta­data tar­ball), one can down­load like thus:

rsync --verbose rsync://78.46.86.149:873/danbooru2019/metadata.json.tar.xz ./

For a spe­cific sub­set, like the SFW 512px sub­set:

rsync --recursive --verbose rsync://78.46.86.149:873/danbooru2019/512px/ ./danbooru2019/512px/
rsync --recursive --verbose rsync://78.46.86.149:873/danbooru2019/original/ ./danbooru2019/original/

Note that rsync sup­ports a kind of “pat­tern” in queries (re­mem­ber to escape globs so they are not inter­preted by the shel­l), and also sup­ports read­ing a list of file­names from a file:

            --exclude=PATTERN       exclude files matching PATTERN
            --exclude-from=FILE     read exclude patterns from FILE
            --include=PATTERN       don't exclude files matching PATTERN
            --include-from=FILE     read include patterns from FILE
            --files-from=FILE       read list of source-file names from FILE

So one can query the meta­data to build up a file list­ing IDs match­ing arbi­trary cri­te­ria, cre­ate the cor­re­spond­ing dataset file­names (like for ID #58991, 512px/0991/58991.jpg or, even lazier, just do include matches via a glob pat­tern like */58991.*). This can require far less time & band­width than down­load­ing the full dataset, and is also far faster than doing rsync one file at a time. See the rsync doc­u­men­ta­tion for fur­ther details.

And for the full dataset (meta­data+o­rig­i­nal+512px):

rsync --recursive --verbose rsync://78.46.86.149:873/danbooru2019 ./danbooru2019/

I also pro­vide rsync mir­rors of a num­ber of mod­els & datasets, such as the cleaned anime por­trait dataset; see Projects for a list­ing of deriv­a­tive works.

Kaggle

A com­bi­na­tion of a n = 300k sub­set of the 512px SFW sub­set of Dan­booru2017 and Nagadomi’s moeimouto face dataset are avail­able as a Kag­gle-hosted dataset: “Tagged Anime Illus­tra­tions” (36G­B).

Kag­gle also hosts the meta­data of Safebooru up to 2016-11-20: “Safebooru—Anime Image Meta­data”.

Model zoo

Cur­rently avail­able:

Use­ful mod­els would be:

  • per­cep­tual loss model (us­ing Deep­Dan­booru?)
  • “s”/“q”/“e” clas­si­fier
  • text embed­ding RNN, and pre-­com­puted text embed­dings for all images’ tags

Updating

If there is inter­est, the dataset will con­tinue to be updated at reg­u­lar annual inter­vals (“Dan­booru2020”, “Dan­booru2021” etc).

Updates exploit the ECC capa­bil­ity of Bit­Tor­rent by updat­ing the images/metadata and cre­at­ing a new .torrent file; users down­load the new .torrent, over­write the old .torrent, and after rehash­ing files to dis­cover which ones have changed/are miss­ing, the new ones are down­loaded. (This method has been suc­cess­fully used by other peri­od­i­cal­ly-up­dated large tor­rents, such as the Touhou Loss­less Music Tor­rent, at ~1.75tb after 19 ver­sion­s.)

Turnover in Bit­Tor­rent swarms means that ear­lier ver­sions of the tor­rent will quickly dis­ap­pear, so for eas­ier repro­ducibil­i­ty, the meta­data files can be archived into sub­di­rec­to­ries (im­ages gen­er­ally will not change, so repro­ducibil­ity is less of a con­cern—to repro­duce the sub­set for an ear­lier release, one sim­ply fil­ters on upload date or takes the file list from the old meta­data).

Notification of updates

To receive noti­fi­ca­tion of future updates to the dataset, please sub­scribe to the noti­fi­ca­tion mail­ing list.

Possible Uses

Such a dataset would sup­port many pos­si­ble uses:

  • clas­si­fi­ca­tion & tag­ging:

    • image cat­e­go­riza­tion (of major char­ac­ter­is­tics such as fran­chise or char­ac­ter or SFW/NSFW detec­tion eg Der­pi­booru)

    • image mul­ti­-la­bel clas­si­fi­ca­tion (tag­ging), exploit­ing the ~20 tags per image (cur­rently there is a pro­to­type, Deep­Dan­booru)

      • a large-s­cale test­bed for real-­world appli­ca­tion of active learn­ing / man-­ma­chine col­lab­o­ra­tion
      • test­ing the scal­ing lim­its of exist­ing tag­ging approaches and moti­vat­ing zero-shot & one-shot learn­ing tech­niques
      • boot­strap­ping video summaries/descriptions
      • robust­ness of image clas­si­fiers to dif­fer­ent illus­tra­tion styles (eg )
  • image gen­er­a­tion:

  • image analy­sis:

    • facial detec­tion & local­iza­tion for drawn images (on which nor­mal tech­niques such as OpenCV’s Harr fil­ters fail, requir­ing spe­cial-pur­pose approaches like Ani­me­Face 2009/lbpcascade_animeface)
    • image popularity/upvote pre­dic­tion
    • image-­to-­text local­iza­tion, tran­scrip­tion, and trans­la­tion of text in images
    • illus­tra­tion-spe­cial­ized com­pres­sion (for bet­ter per­for­mance than PNG/JPG)
  • image search:

    • col­lab­o­ra­tive filtering/recommendation, image sim­i­lar­ity search (Flickr) of images (use­ful for users look­ing for images, for dis­cov­er­ing tag mis­takes, and for var­i­ous diag­nos­tics like check­ing GANs are not mem­o­riz­ing)
    • manga rec­om­men­da­tion ()
    • artist sim­i­lar­ity and de-anonymiza­tion
  • knowl­edge graph extrac­tion from tags/tag-implications and images

    • clus­ter­ing tags
    • tem­po­ral trends in tags (fran­chise pop­u­lar­ity trends)

Advantages

Size and metadata

Image clas­si­fi­ca­tion has been super­charged by work on Ima­geNet, but Ima­geNet itself is lim­ited by its small set of class­es, many of which are debat­able, and which encom­pass only a lim­ited set. Com­pound­ing these lim­its, tagging/classification datasets are noto­ri­ously undi­verse & have imbal­ance prob­lems or are small:

  • Ima­geNet: dog breeds (mem­o­rably brought out by )

  • Youtube-BB: toilets/giraffes

  • MS COCO: bath­rooms and African savan­nah ani­mals; 328k images, 80 cat­e­gories, short 1-sen­tence descrip­tions

  • bird/flowers: a few score of each kind (eg no eagles in the birds dataset)

  • Visual Rela­tion­ship Detec­tion (VRD) dataset: 5k images

  • Pas­cal VOC: 11k images

  • Visual Genome: 108k images

  • nico-open­data: 400k, but SFW & restricted to approved researchers

  • : released 2018, 30.1m tags for 9.2m images and 15.4m bound­ing-box­es, with high label qual­i­ty; a major advan­tage of this dataset is that it uses CC-BY-li­censed Flickr photographs/images, and so it should be freely dis­trib­utable,

  • BAM! (): 65m raw images, 393k? tags for 2.5m? tagged images (semi-­su­per­vised), restricted access?

The exter­nal valid­ity of clas­si­fiers trained on these datasets is some­what ques­tion­able as the learned dis­crim­i­na­tive mod­els may col­lapse or sim­plify in unde­sir­able ways, and over­fit on the datasets’ indi­vid­ual biases (Tor­ralba & Efros 2011). For exam­ple, Ima­geNet clas­si­fiers some­times appear to ‘cheat’ by rely­ing on local­ized tex­tures in a “bag-of-­words”-style approach and sim­plis­tic outlines/shapes—rec­og­niz­ing leop­ards only by the color tex­ture of the fur, or believ­ing bar­bells are exten­sions of arms. CNNs by default appear to rely almost entirely on tex­ture and ignore shapes/outlines, unlike human vision, ren­der­ing them frag­ile to trans­forms; train­ing which empha­sizes shape/outline data aug­men­ta­tion can improve accu­racy & robust­ness (), mak­ing anime images a chal­leng­ing test­bed (and this tex­ture-bias pos­si­bly explain­ing poor per­for­mance of ani­me-­tar­geted NNs in the past and the ). The dataset is sim­ply not large enough, or richly anno­tated enough, to train clas­si­fiers or tag­ger bet­ter than that, or, with resid­ual net­works reach­ing human par­i­ty, reveal dif­fer­ences between the best algo­rithms and the merely good. (Dataset biases have also been issues on ques­tion-an­swer­ing dataset­s.) As well, the datasets are sta­t­ic, not accept­ing any addi­tions, bet­ter meta­data, or cor­rec­tions. Like MNIST before it, Ima­geNet is verg­ing on ‘solved’ (the ILSVRC orga­niz­ers ended it after the 2017 com­pe­ti­tion) and fur­ther progress may sim­ply be over­fit­ting to idio­syn­crasies of the dat­a­points and errors; even if low­ered error rates are not over­fit­ting, the low error rates com­press the dif­fer­ences between algo­rithm, giv­ing a mis­lead­ing view of progress and under­stat­ing the ben­e­fits of bet­ter archi­tec­tures, as improve­ments become com­pa­ra­ble in size to sim­ple chance in initializations/training/validation-set choice. As note:

It is an open issue of tex­t-­to-im­age map­ping that the dis­tri­b­u­tion of images con­di­tioned on a sen­tence is highly mul­ti­-­modal. In the past few years, we’ve wit­nessed a break­through in the appli­ca­tion of recur­rent neural net­works (RNN) to gen­er­at­ing tex­tual descrip­tions con­di­tioned on images [1, 2], with Xu et al. show­ing that the mul­ti­-­modal­ity prob­lem can be decom­posed sequen­tially [3]. How­ev­er, the lack of datasets with diver­sity descrip­tions of images lim­its the per­for­mance of tex­t-­to-im­age syn­the­sis on mul­ti­-­cat­e­gories dataset like MSCOCO [4]. There­fore, the prob­lem of tex­t-­to-im­age syn­the­sis is still far from being solved

In con­trast, the Dan­booru dataset is larger than Ima­geNet as a whole and larger than the most wide­ly-used mul­ti­-de­scrip­tion dataset, MS COCO, with far richer meta­data than the ‘sub­ject verb object’ sen­tence sum­mary that is dom­i­nant in MS COCO or the birds dataset (sen­tences which could be ade­quately sum­ma­rized in per­haps 5 tags, if even that7). While the Dan­booru com­mu­nity does focus heav­ily on female anime char­ac­ters, they are placed in a wide vari­ety of cir­cum­stances with numer­ous sur­round­ing tagged objects or actions, and the sheer size implies that many more mis­cel­la­neous images will be includ­ed. It is unlikely that the per­for­mance ceil­ing will be reached any­time soon, and advanced tech­niques such as atten­tion will likely be required to get any­where near the ceil­ing. And Dan­booru is con­stantly expand­ing and can be eas­ily updated by any­one any­where, allow­ing for reg­u­lar releases of improved anno­ta­tions.

Dan­booru and the image boorus have been only min­i­mally used in pre­vi­ous machine learn­ing work; prin­ci­pal­ly, in (project), which used 1.287m images to train a fine­tuned VGG-based CNN to detect 1,539 tags (drawn from the 512 most fre­quent tags of general/copyright/character each) with an over­all pre­ci­sion of 32.2%, or “Sym­bolic Under­stand­ing of Anime Using Deep Learn­ing”, Li 2018 But the datasets for past research are typ­i­cally not dis­trib­uted and there has been lit­tle fol­lowup.

Non-photographic

Anime images and illus­tra­tions, on the other hand, as com­pared to pho­tographs, dif­fer in many ways—­for exam­ple, illus­tra­tions are fre­quently black­-and-white rather than col­or, line art rather than pho­tographs, and even color illus­tra­tions tend to rely far less on tex­tures and far more on lines (with tex­tures omit­ted or filled in with stan­dard repet­i­tive pat­tern­s), work­ing on a higher level of abstrac­tion—a leop­ard would not be as triv­ially rec­og­nized by sim­ple pat­tern-­match­ing on yel­low and black dot­s—with irrel­e­vant details that a dis­crim­i­na­tor might cheaply clas­sify based on typ­i­cally sup­pressed in favor of global gestalt, and often heav­ily styl­ized (eg fre­quent use of “s”). With the excep­tion of MNIST & Omniglot, almost all com­mon­ly-used deep learn­ing-re­lated image datasets are pho­to­graph­ic.

Humans can still eas­ily per­ceive a black­-white line draw­ing of a leop­ard as being a leop­ard—but can a stan­dard Ima­geNet clas­si­fier? Like­wise, the dif­fi­culty face detec­tors encounter on anime images sug­gests that other detec­tors like nudity or porno­graphic detec­tors may fail; but surely mod­er­a­tion tasks require detec­tion of penis­es, whether they are drawn or pho­tographed? The attempts to apply CNNs to GANs, image gen­er­a­tion, image inpaint­ing, or style trans­fer have some­times thrown up arti­facts which don’t seem to be issues when using the same archi­tec­ture on pho­to­graphic mate­ri­al; for exam­ple, in GAN image gen­er­a­tion & style trans­fer, I almost always note, in my own or oth­ers’ attempts, what I call the “water­color effect”, where instead of pro­duc­ing the usual abstracted regions of white­space, monot­one col­or­ing, or sim­ple color gra­di­ents, the CNN instead con­sis­tently pro­duces noisy tran­si­tion tex­tures which look like water­color paint­ings—which can be beau­ti­ful, and an inter­est­ing style in its own right (eg the style2paints sam­ples), but means the CNNs are fail­ing to some degree. This water­color effect appears to not be a prob­lem in pho­to­graphic appli­ca­tions, but on the other hand, pho­tos are filled with noisy tran­si­tion tex­tures and watch­ing a GAN train, you can see that the learn­ing process gen­er­ates tex­tures first and only grad­u­ally learns to build edges and regions and tran­si­tions from the blurred texts; is this ani­me-spe­cific prob­lem due to sim­ply insuf­fi­cient data/training, or is there some­thing more fun­da­men­tally the issue with cur­rent con­vo­lu­tions?

Because illus­tra­tions are pro­duced by an entirely dif­fer­ent process and focus only on salient details while abstract­ing the rest, they offer a way to test exter­nal valid­ity and the extent to which tag­gers are tap­ping into high­er-level seman­tic per­cep­tion.

As well, many ML researchers are anime fans and might enjoy work­ing on such a dataset—­train­ing NNs to gen­er­ate anime images can be amus­ing. It is, at least, more inter­est­ing than pho­tos of street signs or store­fronts. (“There are few sources of energy so pow­er­ful as a pro­cras­ti­nat­ing grad stu­dent.”)

Community value

A full dataset is of imme­di­ate value to the Dan­booru com­mu­nity as an archival snap­shot of Dan­booru which can be down­loaded in lieu of ham­mer­ing the main site and down­load­ing ter­abytes of data; back­ups are occa­sion­ally requested on the Dan­booru forum but the need is cur­rently not met.

There is poten­tial for a sym­bio­sis between the Dan­booru com­mu­nity & ML researchers: in a vir­tu­ous cir­cle, the com­mu­nity pro­vides cura­tion and expan­sion of a rich dataset, while ML researchers can con­tribute back tools from their research on it which help improve the dataset. The Dan­booru com­mu­nity is rel­a­tively large and would likely wel­come the devel­op­ment of tools like tag­gers to sup­port semi­-au­to­matic (or even­tu­al­ly, fully auto­mat­ic) image tag­ging, as use of a tag­ger could offer orders of mag­ni­tude improve­ment in speed and accu­racy com­pared to their exist­ing man­ual meth­ods, as well as being new­bie-friendly8 They are also a pre-ex­ist­ing audi­ence which would be inter­ested in new research results.

Format

The goal of the dataset is to be as easy as pos­si­ble to use imme­di­ate­ly, avoid­ing obscure file for­mats, while allow­ing simul­ta­ne­ous research & seed­ing of the tor­rent, with easy updates.

Images are pro­vided in the full orig­i­nal form (be that JPG, PNG, GIF or oth­er­wise) for reference/archival pur­pos­es, and a script for con­vert­ing to JPGS & down­scal­ing (cre­at­ing a smaller more suit­able for ML use).

Images are buck­eted into 1000 sub­di­rec­to­ries 0–999, which is the Dan­booru ID mod­ulo 1000 (ie all images in 0999/ have an ID end­ing in ‘999’). A sin­gle direc­tory would cause patho­log­i­cal filesys­tem per­for­mance, and mod­ulo ID spreads images evenly with­out requir­ing addi­tional direc­to­ries to be made. The ID is not zero-­padded and files end in the rel­e­vant exten­sion, hence the file lay­out looks like this:

original/0000/
original/0000/1000.png
original/0000/2000.jpg
original/0000/3000.jpg
original/0000/4000.png
original/0000/5000.jpg
original/0000/6000.jpg
original/0000/7000.jpg
original/0000/8000.jpg
original/0000/9000.jpg
...

Cur­rently rep­re­sented file exten­sions are: avi/bmp/gif/html/jpeg/jpg/mp3/mp4/mpg/pdf/png/rar/swf/webm/wmv/zip. (JPG/PNG files have been loss­lessly opti­mized using jpegoptim/OptiPNG, sav­ing ~100G­B.)

Raw orig­i­nal files are treach­er­ous

Be care­ful if work­ing with the orig­i­nal rather than 512px sub­set. There are many odd files: trun­cat­ed, non-sRGB col­or­space, wrong file exten­sions (eg some PNGs have .jpg exten­sions like original/0146/1525146.jpg/original/0558/1422558.jpg), etc.

The SFW tor­rent fol­lows the same schema but inside the 512px/ direc­tory instead and con­verted to JPG for the SFW files: 512px/0000/1000.jpg etc.

An exper­i­men­tal shell script for par­al­lelized con­ver­sion the ful­l-­size orig­i­nal images into a more tractable ~250GB cor­pus of 512×512px images is includ­ed: rescale_images.sh. It requires ImageMag­ick & GNU parallel to be installed.

Image Metadata

The meta­data is avail­able as a XZ-­com­pressed tar­ball of JSON files as exported from the Dan­booru Big­Query data­base mir­ror (metadata.json.tar.xz). Each line is an indi­vid­ual JSON object for a sin­gle image; ad hoc queries can be run eas­ily by pip­ing into jq, and sev­eral are illus­trated in the shell query appen­dix.

Here is an exam­ple of a shell script for get­ting the file­names of all SFW images match­ing a par­tic­u­lar tag:

# print out filenames of all SFW Danbooru images matching a particular tag.
# assumes being in root directory like '/media/gwern/Data2/danbooru2019'
TAG="monochrome"

TEMP=$(mktemp /tmp/matches-XXXX.txt)
cat metadata/* | head -1000 | fgrep -e '"name":"'"$TAG" | fgrep '"rating":"s"' \
    | jq -c '.id' | tr -d '"' >> "$TEMP"

for ID in $(cat "$TEMP"); do
        BUCKET=$(printf "%04d" $(( ID % 1000 )) );
        TARGET=$(ls ./original/"$BUCKET/$ID".*)
        ls "$TARGET"
done

Citing

Please cite this dataset as:

  • Anony­mous, The Dan­booru Com­mu­ni­ty, & Gwern Bran­wen; “Dan­booru2019: A Large-S­cale Crowd­sourced and Tagged Anime Illus­tra­tion Dataset”, 2020-01-13. Web. Accessed [DATE] https://www.gwern.net/Danbooru2019

    @misc{danbooru2019,
        author = {Anonymous and Danbooru community and Gwern Branwen},
        title = {Danbooru2019: A Large-Scale Crowdsourced and Tagged Anime Illustration Dataset},
        howpublished = {\url{https://www.gwern.net/Danbooru2019}},
        url = {https://www.gwern.net/Danbooru2019},
        type = {dataset},
        year = {2020},
        month = {January},
        timestamp = {2020-01-13},
        note = {Accessed: DATE} }

Past releases

Danbooru2017

The first release, Dan­booru2017, con­tained ~1.9tb of 2.94m images with 77.5m tag instances (of 333k defined tags, ~26.3/image) cov­er­ing Dan­booru from 2005-05-24 through 2017-12-31 (fi­nal ID: #2,973,532).

Dan­booru2018 added 0.413TB/392,557 images/15,208,974 tags/31,698 new unique tags.

To recon­struct Dan­booru2017, down­load Dan­booru2018, and take the image sub­set ID #1–2973532 as the image dataset, and the JSON meta­data in the sub­di­rec­tory metadata/2017/ as the meta­da­ta. That should give you Dan­booru2017 bit-i­den­ti­cal to as released on 2018-02-13.

Danbooru2018

The sec­ond release was a tor­rent of ~2.5tb of 3.33m images with 92.7m tag instances (of 365k defined tags, ~27.8/image) cov­er­ing Dan­booru from 2005-05-24 through 2018-12-31 (fi­nal ID: #3,368,713), pro­vid­ing the image files & a JSON export of the meta­da­ta. We also pro­vided a smaller tor­rent of SFW images down­scaled to 512×512px JPGs (241GB; 2,232,462 images) for con­ve­nience.

Dan­booru2018 can be recon­structed sim­i­larly using metadata/2018/.

Applications

Projects

Code and derived datasets:

Publications

Research:

  • , Gokaslan et al 2018:

    Unsu­per­vised image-­to-im­age trans­la­tion tech­niques are able to map local tex­ture between two domains, but they are typ­i­cally unsuc­cess­ful when the domains require larger shape change. Inspired by seman­tic seg­men­ta­tion, we intro­duce a dis­crim­i­na­tor with dilated con­vo­lu­tions that is able to use infor­ma­tion from across the entire image to train a more con­tex­t-aware gen­er­a­tor. This is cou­pled with a mul­ti­-s­cale per­cep­tual loss that is bet­ter able to rep­re­sent error in the under­ly­ing shape of objects. We demon­strate that this design is more capa­ble of rep­re­sent­ing shape defor­ma­tion in a chal­leng­ing toy dataset, plus in com­plex map­pings with sig­nif­i­cant dataset vari­a­tion between humans, dolls, and anime faces, and between cats and dogs.

  • , Zhang et al 2018 (on style2­paints, ver­sion 3):

    Sketch or line art col­oriza­tion is a research field with sig­nif­i­cant mar­ket demand. Dif­fer­ent from photo col­oriza­tion which strongly relies on tex­ture infor­ma­tion, sketch col­oriza­tion is more chal­leng­ing as sketches may not have tex­ture. Even worse, col­or, tex­ture, and gra­di­ent have to be gen­er­ated from the abstract sketch lines. In this paper, we pro­pose a semi­-au­to­matic learn­ing-based frame­work to col­orize sketches with proper col­or, tex­ture as well as gra­di­ent. Our frame­work con­sists of two stages. In the first draft­ing stage, our model guesses color regions and splashes a rich vari­ety of col­ors over the sketch to obtain a color draft. In the sec­ond refine­ment stage, it detects the unnat­ural col­ors and arti­facts, and try to fix and refine the result. Com­par­ing to exist­ing approach­es, this two-stage design effec­tively divides the com­plex col­oriza­tion task into two sim­pler and goal-­clearer sub­tasks. This eases the learn­ing and raises the qual­ity of col­oriza­tion. Our model resolves the arti­facts such as water-­color blur­ring, color dis­tor­tion, and dull tex­tures.

    We build an inter­ac­tive soft­ware based on our model for eval­u­a­tion. Users can iter­a­tively edit and refine the col­oriza­tion. We eval­u­ate our learn­ing model and the inter­ac­tive sys­tem through an exten­sive user study. Sta­tis­tics shows that our method out­per­forms the state-of-art tech­niques and indus­trial appli­ca­tions in sev­eral aspects includ­ing, the visual qual­i­ty, the abil­ity of user con­trol, user expe­ri­ence, and other met­rics.

  • “Appli­ca­tion of Gen­er­a­tive Adver­sar­ial Net­work on Image Style Trans­for­ma­tion and Image Pro­cess­ing”, Wang 2018:

    Image-­to-Im­age trans­la­tion is a col­lec­tion of com­puter vision prob­lems that aim to learn a map­ping between two dif­fer­ent domains or mul­ti­ple domains. Recent research in com­puter vision and deep learn­ing pro­duced pow­er­ful tools for the task. Con­di­tional adver­sar­ial net­works serve as a gen­er­al-pur­pose solu­tion for image-­to-im­age trans­la­tion prob­lems. Deep Con­vo­lu­tional Neural Net­works can learn an image rep­re­sen­ta­tion that can be applied for recog­ni­tion, detec­tion, and seg­men­ta­tion. Gen­er­a­tive Adver­sar­ial Net­works (GANs) has gained suc­cess in image syn­the­sis. How­ev­er, tra­di­tional mod­els that require paired train­ing data might not be applic­a­ble in most sit­u­a­tions due to lack of paired data.

    Here we review and com­pare two dif­fer­ent mod­els for learn­ing unsu­per­vised image to image trans­la­tion: CycleGAN and Unsu­per­vised Image-­to-Im­age Trans­la­tion Net­works (UNIT). Both mod­els adopt cycle con­sis­ten­cy, which enables us to con­duct unsu­per­vised learn­ing with­out paired data. We show that both mod­els can suc­cess­fully per­form image style trans­la­tion. The exper­i­ments reveal that CycleGAN can gen­er­ate more real­is­tic results, and UNIT can gen­er­ate var­ied images and bet­ter pre­serve the struc­ture of input images.

  • , Noguchi & Harada 2019 (Dan­booru2018 by way of /-gen­er­ated images):

    Thanks to the recent devel­op­ment of deep gen­er­a­tive mod­els, it is becom­ing pos­si­ble to gen­er­ate high­-qual­ity images with both fidelity and diver­si­ty. How­ev­er, the train­ing of such gen­er­a­tive mod­els requires a large dataset. To reduce the amount of data required, we pro­pose a new method for trans­fer­ring prior knowl­edge of the pre-­trained gen­er­a­tor, which is trained with a large dataset, to a small dataset in a dif­fer­ent domain. Using such prior knowl­edge, the model can gen­er­ate images lever­ag­ing some com­mon sense that can­not be acquired from a small dataset. In this work, we pro­pose a novel method focus­ing on the para­me­ters for batch sta­tis­tics, scale and shift, of the hid­den lay­ers in the gen­er­a­tor. By train­ing only these para­me­ters in a super­vised man­ner, we achieved sta­ble train­ing of the gen­er­a­tor, and our method can gen­er­ate higher qual­ity images com­pared to pre­vi­ous meth­ods with­out col­laps­ing even when the dataset is small (~100). Our results show that the diver­sity of the fil­ters acquired in the pre-­trained gen­er­a­tor is impor­tant for the per­for­mance on the tar­get domain. By our method, it becomes pos­si­ble to add a new class or domain to a pre-­trained gen­er­a­tor with­out dis­turb­ing the per­for­mance on the orig­i­nal domain.

  • , Suzuki et al 2018:

    We present a novel CNN-based image edit­ing strat­egy that allows the user to change the seman­tic infor­ma­tion of an image over an arbi­trary region by manip­u­lat­ing the fea­ture-­space rep­re­sen­ta­tion of the image in a trained GAN mod­el. We will present two vari­ants of our strat­e­gy: (1) spa­tial con­di­tional batch nor­mal­iza­tion (sCBN), a type of con­di­tional batch nor­mal­iza­tion with user-speci­fi­able spa­tial weight maps, and (2) fea­ture-blend­ing, a method of directly mod­i­fy­ing the inter­me­di­ate fea­tures. Our meth­ods can be used to edit both arti­fi­cial image and real image, and they both can be used together with any GAN with con­di­tional nor­mal­iza­tion lay­ers. We will demon­strate the power of our method through exper­i­ments on var­i­ous types of GANs trained on dif­fer­ent datasets. Code will be avail­able at this URL.

  • , Wang et al 2019:

    One of the attrac­tive char­ac­ter­is­tics of deep neural net­works is their abil­ity to trans­fer knowl­edge obtained in one domain to other related domains. As a result, high­-qual­ity net­works can be trained in domains with rel­a­tively lit­tle train­ing data. This prop­erty has been exten­sively stud­ied for dis­crim­i­na­tive net­works but has received sig­nif­i­cantly less atten­tion for gen­er­a­tive mod­el­s.­Given the often enor­mous effort required to train GANs, both com­pu­ta­tion­ally as well as in the dataset col­lec­tion, the re-use of pre­trained GANs is a desir­able objec­tive. We pro­pose a novel knowl­edge trans­fer method for gen­er­a­tive mod­els based on min­ing the knowl­edge that is most ben­e­fi­cial to a spe­cific tar­get domain, either from a sin­gle or mul­ti­ple pre­trained GANs. This is done using a miner net­work that iden­ti­fies which part of the gen­er­a­tive dis­tri­b­u­tion of each pre­trained GAN out­puts sam­ples clos­est to the tar­get domain. Min­ing effec­tively steers GAN sam­pling towards suit­able regions of the latent space, which facil­i­tates the pos­te­rior fine­tun­ing and avoids patholo­gies of other meth­ods such as mode col­lapse and lack of flex­i­bil­i­ty. We per­form exper­i­ments on sev­eral com­plex datasets using var­i­ous GAN archi­tec­tures (BigGAN, Pro­gres­sive GAN) and show that the pro­posed method, called MineGAN, effec­tively trans­fers knowl­edge to domains with few tar­get images, out­per­form­ing exist­ing meth­ods. In addi­tion, MineGAN can suc­cess­fully trans­fer knowl­edge from mul­ti­ple pre­trained GANs.

  • , Kim et al 2019 (Tag2Pix CLI/GUI):

    Line art col­oriza­tion is expen­sive and chal­leng­ing to auto­mate. A GAN approach is pro­posed, called Tag2Pix, of line art col­oriza­tion which takes as input a grayscale line art and color tag infor­ma­tion and pro­duces a qual­ity col­ored image. First, we present the Tag2Pix line art col­oriza­tion dataset. A gen­er­a­tor net­work is pro­posed which con­sists of con­vo­lu­tional lay­ers to trans­form the input line art, a pre-­trained seman­tic extrac­tion net­work, and an encoder for input color infor­ma­tion. The dis­crim­i­na­tor is based on an aux­il­iary clas­si­fier GAN to clas­sify the tag infor­ma­tion as well as gen­uine­ness. In addi­tion, we pro­pose a novel net­work struc­ture called SECat, which makes the gen­er­a­tor prop­erly col­orize even small fea­tures such as eyes, and also sug­gest a novel two-step train­ing method where the gen­er­a­tor and dis­crim­i­na­tor first learn the notion of object and shape and then, based on the learned notion, learn col­oriza­tion, such as where and how to place which col­or. We present both quan­ti­ta­tive and qual­i­ta­tive eval­u­a­tions which prove the effec­tive­ness of the pro­posed method.

    , Lee et al 2020:

    This paper tack­les the auto­matic col­oriza­tion task of a sketch image given an already-­col­ored ref­er­ence image. Col­oriz­ing a sketch image is in high demand in comics, ani­ma­tion, and other con­tent cre­ation appli­ca­tions, but it suf­fers from infor­ma­tion scarcity of a sketch image. To address this, a ref­er­ence image can ren­der the col­oriza­tion process in a reli­able and user-­driven man­ner. How­ev­er, it is dif­fi­cult to pre­pare for a train­ing data set that has a suf­fi­cient amount of seman­ti­cally mean­ing­ful pairs of images as well as the ground truth for a col­ored image reflect­ing a given ref­er­ence (e.g., col­or­ing a sketch of an orig­i­nally blue car given a ref­er­ence green car). To tackle this chal­lenge, we pro­pose to uti­lize the iden­ti­cal image with geo­met­ric dis­tor­tion as a vir­tual ref­er­ence, which makes it pos­si­ble to secure the ground truth for a col­ored out­put image. Fur­ther­more, it nat­u­rally pro­vides the ground truth for dense seman­tic cor­re­spon­dence, which we uti­lize in our inter­nal atten­tion mech­a­nism for color trans­fer from ref­er­ence to sketch input. We demon­strate the effec­tive­ness of our approach in var­i­ous types of sketch image col­oriza­tion via quan­ti­ta­tive as well as qual­i­ta­tive eval­u­a­tion against exist­ing meth­ods.

  • , Xiang & Li 2019 (?)

  • , Chen et al 2019:

    Instance based photo car­tooniza­tion is one of the chal­leng­ing image styl­iza­tion tasks which aim at trans­form­ing real­is­tic pho­tos into car­toon style images while pre­serv­ing the seman­tic con­tents of the pho­tos. State-of-the-art Deep Neural Net­works (DNNs) meth­ods still fail to pro­duce sat­is­fac­tory results with input pho­tos in the wild, espe­cially for pho­tos which have high con­trast and full of rich tex­tures. This is due to that: car­toon style images tend to have smooth color regions and empha­sized edges which are con­tra­dict to real­is­tic pho­tos which require clear seman­tic con­tents, i.e., tex­tures, shapes etc. Pre­vi­ous meth­ods have dif­fi­culty in sat­is­fy­ing car­toon style tex­tures and pre­serv­ing seman­tic con­tents at the same time. In this work, we pro­pose a novel “Car­toon­Ren­derer” frame­work which uti­liz­ing a sin­gle trained model to gen­er­ate mul­ti­ple car­toon styles. In a nut­shell, our method maps photo into a fea­ture model and ren­ders the fea­ture model back into image space. In par­tic­u­lar, car­tooniza­tion is achieved by con­duct­ing some trans­for­ma­tion manip­u­la­tion in the fea­ture space with our pro­posed Soft­-AdaIN. Exten­sive exper­i­men­tal results show our method pro­duces higher qual­ity car­toon style images than prior arts, with accu­rate seman­tic con­tent preser­va­tion. In addi­tion, due to the decou­pling of whole gen­er­at­ing process into “Mod­el­ing-­Co­or­di­nat­ing-Ren­der­ing” parts, our method could eas­ily process higher res­o­lu­tion pho­tos, which is intractable for exist­ing meth­ods.

  • “Unpaired Sketch-­to-­Line Trans­la­tion via Syn­the­sis of Sketches”, Lee et al 2019:

    Con­vert­ing hand-­drawn sketches into clean line draw­ings is a cru­cial step for diverse artis­tic works such as comics and prod­uct designs. Recent data-­driven meth­ods using deep learn­ing have shown their great abil­i­ties to auto­mat­i­cally sim­plify sketches on raster images. Since it is dif­fi­cult to col­lect or gen­er­ate paired sketch and line images, lack of train­ing data is a main obsta­cle to use these mod­els. In this paper, we pro­pose a train­ing scheme that requires only unpaired sketch and line images for learn­ing sketch-­to-­line trans­la­tion. To do this, we first gen­er­ate real­is­tic paired sketch and line images from unpaired sketch and line images using rule-based line aug­men­ta­tion and unsu­per­vised tex­ture con­ver­sion. Next, with our syn­thetic paired data, we train a model for sketch-­to-­line trans­la­tion using super­vised learn­ing. Com­pared to unsu­per­vised meth­ods that use cycle con­sis­tency loss­es, our model shows bet­ter per­for­mance at remov­ing noisy strokes. We also show that our model sim­pli­fies com­pli­cated sketches bet­ter than mod­els trained on a lim­ited num­ber of hand­crafted paired data.

  • “Con­tent Cura­tion, Eval­u­a­tion, and Refine­ment on a Non­lin­early Directed Image­board: Lessons From Dan­booru”, Britt 2019:

    While lin­early directed image­boards like 4chan have been exten­sively stud­ied, user par­tic­i­pa­tion on non­lin­early directed image­boards, or “boorus,” has been over­looked despite high activ­i­ty, expan­sive mul­ti­me­dia repos­i­to­ries with user-de­fined clas­si­fi­ca­tions and eval­u­a­tions, and unique affor­dances pri­or­i­tiz­ing mutual con­tent cura­tion, eval­u­a­tion, and refine­ment over overt dis­course. To address the gap in the lit­er­a­ture related to par­tic­i­pa­tory engage­ment on non­lin­early directed image­boards, user activ­ity around the full data­base of N = 2,987,525, sub­mis­sions to Dan­booru, a promi­nent non­lin­early directed image­board, was eval­u­ated using regres­sion. The results illus­trate the role played by the affor­dances of non­lin­early directed image­boards and the vis­i­ble attrib­utes of indi­vid­ual sub­mis­sions in shap­ing the user processes of con­tent cura­tion, eval­u­a­tion, and refine­ment, as well as the inter­re­la­tion­ships between these three core activ­i­ties. These results pro­vide a foun­da­tion for fur­ther research within the unique envi­ron­ments of non­lin­early directed image­boards and sug­gest prac­ti­cal appli­ca­tions across online domains.

  • , Ye et al 2019:

    Anime line sketch col­oriza­tion is to fill a vari­ety of col­ors the anime sketch, to make it col­or­ful and diverse. The col­or­ing prob­lem is not a new research direc­tion in the field of deep learn­ing tech­nol­o­gy. Because of col­or­ing of the anime sketch does not have fixed color and we can’t take tex­ture or shadow as ref­er­ence, so it is dif­fi­cult to learn and have a cer­tain stan­dard to deter­mine whether it is cor­rect or not. After gen­er­a­tive adver­sar­ial net­works (GANs) was pro­posed, some used GANs to do col­or­ing research, achieved some result, but the col­or­ing effect is lim­it­ed. This study pro­poses a method use deep resid­ual net­work, and adding dis­crim­i­na­tor to net­work, that expect the color of col­ored images can con­sis­tent with the desired color by the user and can achieve good col­or­ing results.

  • , Lee et al 2019:

    Con­vert­ing hand-­drawn sketches into clean line draw­ings is a cru­cial step for diverse artis­tic works such as comics and prod­uct designs. Recent data-­driven meth­ods using deep learn­ing have shown their great abil­i­ties to auto­mat­i­cally sim­plify sketches on raster images. Since it is dif­fi­cult to col­lect or gen­er­ate paired sketch and line images, lack of train­ing data is a main obsta­cle to use these mod­els. In this paper, we pro­pose a train­ing scheme that requires only unpaired sketch and line images for learn­ing sketch-­to-­line trans­la­tion. To do this, we first gen­er­ate real­is­tic paired sketch and line images from unpaired sketch and line images using rule-based line aug­men­ta­tion and unsu­per­vised tex­ture con­ver­sion. Next, with our syn­thetic paired data, we train a model for sketch-­to-­line trans­la­tion using super­vised learn­ing. Com­pared to unsu­per­vised meth­ods that use cycle con­sis­tency loss­es, our model shows bet­ter per­for­mance at remov­ing noisy strokes. We also show that our model sim­pli­fies com­pli­cated sketches bet­ter than mod­els trained on a lim­ited num­ber of hand­crafted paired data.

  • , Huang et al 2019:

    Many image-­to-im­age (I2I) trans­la­tion prob­lems are in nature of high diver­sity that a sin­gle input may have var­i­ous coun­ter­parts. Prior works pro­posed the mul­ti­-­modal net­work that can build a many-­to-­many map­ping between two visual domains. How­ev­er, most of them are guided by sam­pled nois­es. Some oth­ers encode the ref­er­ence images into a latent vec­tor, by which the seman­tic infor­ma­tion of the ref­er­ence image will be washed away. In this work, we aim to pro­vide a solu­tion to con­trol the out­put based on ref­er­ences seman­ti­cal­ly. Given a ref­er­ence image and an input in another domain, a seman­tic match­ing is first per­formed between the two visual con­tents and gen­er­ates the aux­il­iary image, which is explic­itly encour­aged to pre­serve seman­tic char­ac­ter­is­tics of the ref­er­ence. A deep net­work then is used for I2I trans­la­tion and the final out­puts are expected to be seman­ti­cally sim­i­lar to both the input and the ref­er­ence; how­ev­er, no such paired data can sat­isfy that dual-sim­i­lar­ity in a super­vised fash­ion, so we build up a self­-­su­per­vised frame­work to serve the train­ing pur­pose. We improve the qual­ity and diver­sity of the out­puts by employ­ing non-lo­cal blocks and a mul­ti­-­task archi­tec­ture. We assess the pro­posed method through exten­sive qual­i­ta­tive and quan­ti­ta­tive eval­u­a­tions and also pre­sented com­par­isons with sev­eral state-of-art mod­els.

  • , Liu et al 2019:

    Anime sketch col­or­ing is to fill var­i­ous col­ors into the black­-and-white anime sketches and finally obtain the color anime images. Recent­ly, anime sketch col­or­ing has become a new research hotspot in the field of deep learn­ing. In anime sketch col­or­ing, gen­er­a­tive adver­sar­ial net­works (GANs) have been used to design appro­pri­ate col­or­ing meth­ods and achieved some results. How­ev­er, the exist­ing meth­ods based on GANs gen­er­ally have low-qual­ity col­or­ing effects, such as unrea­son­able color mix­ing, poor color gra­di­ent effect. In this paper, an effi­cient anime sketch col­or­ing method using swish-­gated resid­ual U-net (SGRU) and spec­trally nor­mal­ized GAN (SNGAN) has been pro­posed to solve the above prob­lems. The pro­posed method is called spec­trally nor­mal­ized GAN with swish-­gated resid­ual U-net (SSN-GAN). In SSN-GAN, SGRU is used as the gen­er­a­tor. SGRU is the U-net with the pro­posed swish layer and swish-­gated resid­ual blocks (SGBs). In SGRU, the pro­posed swish layer and swish-­gated resid­ual blocks (SGBs) effec­tively fil­ter the infor­ma­tion trans­mit­ted by each level and improve the per­for­mance of the net­work. The per­cep­tual loss and the per-pixel loss are used to con­sti­tute the final loss of SGRU. The dis­crim­i­na­tor of SSN-GAN uses spec­tral nor­mal­iza­tion as a sta­bi­lizer of train­ing of GAN, and it is also used as the per­cep­tual net­work for cal­cu­lat­ing the per­cep­tual loss. SSN-GAN can auto­mat­i­cally color the sketch with­out pro­vid­ing any col­or­ing hints in advance and can be eas­ily end-­to-end trained. Exper­i­men­tal results show that our method per­forms bet­ter than other state-of-the-art col­or­ing meth­ods, and can obtain col­or­ful anime images with higher visual qual­i­ty.

  • , Gopalakr­ish­nan et al 2020:

    Con­trary to the con­ven­tion of using super­vi­sion for class-­con­di­tioned gen­er­a­tive mod­el­ing, this work explores and demon­strates the fea­si­bil­ity of a learned super­vised rep­re­sen­ta­tion space trained on a dis­crim­i­na­tive clas­si­fier for the down­stream task of sam­ple gen­er­a­tion. Unlike gen­er­a­tive mod­el­ing approaches that aim to model the man­i­fold dis­tri­b­u­tion, we directly rep­re­sent the given data man­i­fold in the clas­si­fi­ca­tion space and lever­age prop­er­ties of latent space rep­re­sen­ta­tions to gen­er­ate new rep­re­sen­ta­tions that are guar­an­teed to be in the same class. Inter­est­ing­ly, such rep­re­sen­ta­tions allow for con­trolled sam­ple gen­er­a­tions for any given class from exist­ing sam­ples and do not require enforc­ing prior dis­tri­b­u­tion. We show that these latent space rep­re­sen­ta­tions can be smartly manip­u­lated (us­ing con­vex com­bi­na­tions of n sam­ples, n≥2) to yield mean­ing­ful sam­ple gen­er­a­tions. Exper­i­ments on image datasets of vary­ing res­o­lu­tions demon­strate that down­stream gen­er­a­tions have higher clas­si­fi­ca­tion accu­racy than exist­ing con­di­tional gen­er­a­tive mod­els while being com­pet­i­tive in terms of FID.

  • , Su & Fang 2020 (CS230 class pro­ject; source):

    Human sketches can be expres­sive and abstract at the same time. Gen­er­at­ing anime avatars from sim­ple or even bad face draw­ing is an inter­est­ing area. Lots of related work has been done such as auto-­col­or­ing sketches to anime or trans­form­ing real pho­tos to ani­me. How­ev­er, there aren’t many inter­est­ing works yet to show how to gen­er­ate anime avatars from just some sim­ple draw­ing input. In this pro­ject, we pro­pose using GAN to gen­er­ate anime avatars from sketch­es.

  • , Huang et al 2020

    Sketch-­to-im­age (S2I) trans­la­tion plays an impor­tant role in image syn­the­sis and manip­u­la­tion tasks, such as photo edit­ing and col­oriza­tion. Some spe­cific S2I trans­la­tion includ­ing sketch-­to-photo and sketch-­to-­paint­ing can be used as pow­er­ful tools in the art design indus­try. How­ev­er, pre­vi­ous meth­ods only sup­port S2I trans­la­tion with a sin­gle level of den­si­ty, which gives less flex­i­bil­ity to users for con­trol­ling the input sketch­es. In this work, we pro­pose the first mul­ti­-level den­sity sketch-­to-im­age trans­la­tion frame­work, which allows the input sketch to cover a wide range from rough object out­lines to micro struc­tures. More­over, to tackle the prob­lem of non­con­tin­u­ous rep­re­sen­ta­tion of mul­ti­-level den­sity input sketch­es, we project the den­sity level into a con­tin­u­ous latent space, which can then be lin­early con­trolled by a para­me­ter. This allows users to con­ve­niently con­trol the den­si­ties of input sketches and gen­er­a­tion of images. More­over, our method has been suc­cess­fully ver­i­fied on var­i­ous datasets for dif­fer­ent appli­ca­tions includ­ing face edit­ing, mul­ti­-­modal sketch-­to-photo trans­la­tion, and anime col­oriza­tion, pro­vid­ing coarse-­to-fine lev­els of con­trols to these appli­ca­tions.

  • , Akita et al 2020:

    Many stud­ies have recently applied deep learn­ing to the auto­matic col­oriza­tion of line draw­ings. How­ev­er, it is dif­fi­cult to paint empty pupils using exist­ing meth­ods because the net­works are trained with pupils that have edges, which are gen­er­ated from color images using image pro­cess­ing. Most actual line draw­ings have empty pupils that artists must paint in. In this paper, we pro­pose a novel net­work model that trans­fers the pupil details in a ref­er­ence color image to input line draw­ings with empty pupils. We also pro­pose a method for accu­rately and auto­mat­i­cally col­or­ing eyes. In this method, eye patches are extracted from a ref­er­ence color image and auto­mat­i­cally added to an input line draw­ing as color hints using our eye posi­tion esti­ma­tion net­work.

  • “Dan­booRe­gion: An Illus­tra­tion Region Dataset”, Zhang et al 2020 (Github):

    Region is a fun­da­men­tal ele­ment of var­i­ous car­toon ani­ma­tion tech­niques and artis­tic paint­ing appli­ca­tions. Achiev­ing sat­is­fac­tory region is essen­tial to the suc­cess of these tech­niques. Moti­vated to assist diverse region-based car­toon appli­ca­tions, we invite artists to anno­tate regions for in-the-wild car­toon images with sev­eral appli­ca­tion-ori­ented goals: (1) To assist image-based car­toon ren­der­ing, relight­ing, and car­toon intrin­sic decom­po­si­tion lit­er­a­ture, artists iden­tify object out­lines and elim­i­nate light­ing-and-shadow bound­aries. (2) To assist car­toon ink­ing tools, car­toon struc­ture extrac­tion appli­ca­tions, and car­toon tex­ture pro­cess­ing tech­niques, artists clean-up tex­ture or defor­ma­tion pat­terns and empha­size car­toon struc­tural bound­ary lines. (3) To assist region-based car­toon dig­i­tal­iza­tion, clip-art vec­tor­iza­tion, and ani­ma­tion track­ing appli­ca­tions, artists inpaint and recon­struct bro­ken or blurred regions in car­toon images. Given the typ­i­cal­ity of these involved appli­ca­tions, this dataset is also likely to be used in other car­toon tech­niques. We detail the chal­lenges in achiev­ing this dataset and present a human-in-the-loop work­flow named Fea­si­bil­i­ty-based Assign­ment Rec­om­men­da­tion (FAR) to enable large-s­cale anno­tat­ing. The FAR tends to reduce artist trail­s-and-er­rors and encour­age their enthu­si­asm dur­ing anno­tat­ing. Final­ly, we present a dataset that con­tains a large num­ber of artis­tic region com­po­si­tions paired with cor­re­spond­ing car­toon illus­tra­tions. We also invite mul­ti­ple pro­fes­sional artists to assure the qual­ity of each anno­ta­tion. [Key­words: artis­tic cre­ation, fine art, car­toon, region pro­cess­ing]

  • , Ko & Cho 2020 (Github):

    The trans­la­tion of comics (and Man­ga) involves remov­ing text from a for­eign comic images and type­set­ting trans­lated let­ters into it. The text in comics con­tain a vari­ety of deformed let­ters drawn in arbi­trary posi­tions, in com­plex images or pat­terns. These let­ters have to be removed by experts, as com­pu­ta­tion­ally eras­ing these let­ters is very chal­leng­ing. Although sev­eral clas­si­cal image pro­cess­ing algo­rithms and tools have been devel­oped, a com­pletely auto­mated method that could erase the text is still lack­ing. There­fore, we pro­pose an image pro­cess­ing frame­work called ‘SickZil-­Ma­chine’ (SZMC) that auto­mates the removal of text from comics. SZMC works through a two-step process. In the first step, the text areas are seg­mented at the pixel lev­el. In the sec­ond step, the let­ters in the seg­mented areas are erased and inpainted nat­u­rally to match their sur­round­ings. SZMC exhib­ited a notable per­for­mance, employ­ing deep learn­ing based image seg­men­ta­tion and image inpaint­ing mod­els. To train these mod­els, we con­structed 285 pairs of orig­i­nal comic pages, a text area-­mask dataset, and a dataset of 31,497 comic pages. We iden­ti­fied the char­ac­ter­is­tics of the dataset that could improve SZMC per­for­mance.

  • , Del Gobbo & Her­rera 2020:

    The detec­tion and recog­ni­tion of uncon­strained text is an open prob­lem in research. Text in comic books has unusual styles that raise many chal­lenges for text detec­tion. This work aims to iden­tify text char­ac­ters at a pixel level in a comic genre with highly sophis­ti­cated text styles: Japan­ese man­ga. To over­come the lack of a manga dataset with indi­vid­ual char­ac­ter level anno­ta­tions, we cre­ate our own. Most of the lit­er­a­ture in text detec­tion use bound­ing box met­rics, which are unsuit­able for pix­el-level eval­u­a­tion. Thus, we imple­mented spe­cial met­rics to eval­u­ate per­for­mance. Using these resources, we designed and eval­u­ated a deep net­work mod­el, out­per­form­ing cur­rent meth­ods for text detec­tion in manga in most met­rics.

  • , Zheng et al 2020:

    This paper deals with a chal­leng­ing task of learn­ing from dif­fer­ent modal­i­ties by tack­ling the dif­fi­culty prob­lem of jointly face recog­ni­tion between abstrac­t-­like sketch­es, car­toons, car­i­ca­tures and real-life pho­tographs. Due to the sig­nif­i­cant vari­a­tions in the abstract faces, build­ing vision mod­els for rec­og­niz­ing data from these modal­i­ties is an extremely chal­leng­ing. We pro­pose a novel frame­work termed as Meta-­Con­tin­ual Learn­ing with Knowl­edge Embed­ding to address the task of jointly sketch, car­toon, and car­i­ca­ture face recog­ni­tion. In par­tic­u­lar, we firstly present a deep rela­tional net­work to cap­ture and mem­o­rize the rela­tion among dif­fer­ent sam­ples. Sec­ond­ly, we present the con­struc­tion of our knowl­edge graph that relates image with the label as the guid­ance of our meta-learn­er. We then design a knowl­edge embed­ding mech­a­nism to incor­po­rate the knowl­edge rep­re­sen­ta­tion into our net­work. Third­ly, to mit­i­gate cat­a­strophic for­get­ting, we use a meta-­con­tin­ual model that updates our ensem­ble model and improves its pre­dic­tion accu­ra­cy. With this meta-­con­tin­ual mod­el, our net­work can learn from its past. The final clas­si­fi­ca­tion is derived from our net­work by learn­ing to com­pare the fea­tures of sam­ples. Exper­i­men­tal results demon­strate that our approach achieves sig­nif­i­cantly higher per­for­mance com­pared with other state-of-the-art approach­es.

  • , Cao et al 2020:

    The car­toon ani­ma­tion indus­try has devel­oped into a huge indus­trial chain with a large poten­tial mar­ket involv­ing games, dig­i­tal enter­tain­ment, and other indus­tries. How­ev­er, due to the coarse-­grained clas­si­fi­ca­tion of car­toon mate­ri­als, car­toon ani­ma­tors can hardly find rel­e­vant mate­ri­als dur­ing the process of cre­ation. The polar emo­tions of car­toon mate­ri­als are an impor­tant ref­er­ence for cre­ators as they can help them eas­ily obtain the pic­tures they need. Some meth­ods for obtain­ing the emo­tions of car­toon pic­tures have been pro­posed, but most of these focus on expres­sion recog­ni­tion. Mean­while, other emo­tion recog­ni­tion meth­ods are not ideal for use as car­toon mate­ri­als. We pro­pose a deep learn­ing-based method to clas­sify the polar emo­tions of the car­toon pic­tures of the “Moe” draw­ing style. Accord­ing to the expres­sion fea­ture of the car­toon char­ac­ters of this draw­ing style, we rec­og­nize the facial expres­sions of car­toon char­ac­ters and extract the scene and facial fea­tures of the car­toon images. Then, we cor­rect the emo­tions of the pic­tures obtained by the expres­sion recog­ni­tion accord­ing to the scene fea­tures. Final­ly, we can obtain the polar emo­tions of cor­re­spond­ing pic­ture. We designed a dataset and per­formed ver­i­fi­ca­tion tests on it, achiev­ing 81.9% exper­i­men­tal accu­ra­cy. The exper­i­men­tal results prove that our method is com­pet­i­tive. [Key­words: car­toon; emo­tion clas­si­fi­ca­tion; deep learn­ing]

  • , Huang et al 2020:

    Image-­to-Im­age (I2I) trans­la­tion is a heated topic in acad­e­mia, and it also has been applied in real-­world indus­try for tasks like image syn­the­sis, super-res­o­lu­tion, and col­oriza­tion. How­ev­er, tra­di­tional I2I trans­la­tion meth­ods train data in two or more domains togeth­er. This requires lots of com­pu­ta­tion resources. More­over, the results are of lower qual­i­ty, and they con­tain many more arti­facts. The train­ing process could be unsta­ble when the data in dif­fer­ent domains are not bal­anced, and modal col­lapse is more likely to hap­pen. We pro­posed a new I2I trans­la­tion method that gen­er­ates a new model in the tar­get domain via a series of model trans­for­ma­tions on a pre-­trained StyleGAN2 model in the source domain. After that, we pro­posed an inver­sion method to achieve the con­ver­sion between an image and its latent vec­tor. By feed­ing the latent vec­tor into the gen­er­ated mod­el, we can per­form I2I trans­la­tion between the source domain and tar­get domain. Both qual­i­ta­tive and quan­ti­ta­tive eval­u­a­tions were con­ducted to prove that the pro­posed method can achieve out­stand­ing per­for­mance in terms of image qual­i­ty, diver­sity and seman­tic sim­i­lar­ity to the input and ref­er­ence images com­pared to state-of-the-art works.

  • , Robb et al 2020:

    Gen­er­a­tive Adver­sar­ial Net­works (GANs) have shown remark­able per­for­mance in image syn­the­sis tasks, but typ­i­cally require a large num­ber of train­ing sam­ples to achieve high­-qual­ity syn­the­sis. This paper pro­poses a sim­ple and effec­tive method, Few-Shot GAN (FSGAN), for adapt­ing GANs in few-shot set­tings (less than 100 images). FSGAN repur­poses com­po­nent analy­sis tech­niques and learns to adapt the sin­gu­lar val­ues of the pre-­trained weights while freez­ing the cor­re­spond­ing sin­gu­lar vec­tors. This pro­vides a highly expres­sive para­me­ter space for adap­ta­tion while con­strain­ing changes to the pre­trained weights. We val­i­date our method in a chal­leng­ing few-shot set­ting of 5–100 images in the tar­get domain. We show that our method has sig­nif­i­cant visual qual­ity gains com­pared with exist­ing GAN adap­ta­tion meth­ods. We report qual­i­ta­tive and quan­ti­ta­tive results show­ing the effec­tive­ness of our method. We addi­tion­ally high­light a prob­lem for few-shot syn­the­sis in the stan­dard quan­ti­ta­tive met­ric used by data-­ef­fi­cient image syn­the­sis works. Code and addi­tional results are avail­able at this URL.

Scraping

This project is not offi­cially affil­i­ated or run by Dan­booru, how­ev­er, the site founder Albert (and his suc­ces­sor, Evazion) has given his per­mis­sion for scrap­ing. I have reg­is­tered the accounts gwern and gwern-bot for use in down­load­ing & par­tic­i­pat­ing on Dan­booru; it is con­sid­ered good research ethics to try to off­set any use of resources when crawl­ing an online com­mu­nity (eg try to run Tor nodes to pay back the band­width), so I have donated $20 to Dan­booru via an account upgrade.

Dan­booru IDs are sequen­tial pos­i­tive inte­gers, but the images are stored at their MD5 hash­es; so down­load­ing the full images can be done by a query to the JSON API for the meta­data for an ID, get­ting the URL for the full upload, and down­load­ing that to the ID plus exten­sion.

The meta­data can be down­loaded from Big­Query via BigQuery-API-based tools.

Bugs

Known bugs:

  • all: the meta­data does not include the trans­la­tions or bound­ing-boxes of captions/translations (“notes”); they were omit­ted from the Big­Query mir­ror and tech­ni­cal prob­lems meant they could not be added to BQ before release. The captions/translations can be retrieved via the Dan­booru API if nec­es­sary.

  • 512px SFW sub­set: some images have trans­par­ent back­grounds; if they are also black­-white, like black line-art draw­ings, then the con­ver­sion to JPG with a default black back­ground will ren­der them almost 100% black and the image will be invis­i­ble (eg files with the two tags transparent_background lineart). This affects some­where in the hun­dreds of images. Users can either ignore this as affect­ing a minute per­cent­age of files, fil­ter out images based on the tag-­com­bi­na­tion, or include data qual­ity checks in their image load­ing code to drop anom­alous images with too-few unique col­ors or which are too white/too black.

    Pro­posed fix: in the next ver­sion, Dan­booru2018’s 512px SFW sub­set, the down­scal­ing will switch to white back­grounds rather than black back­grounds; while the same issue can still arise in the case of white line-art draw­ings with trans­par­ent back­grounds, these are much rar­er. (It might also be pos­si­ble to make the con­ver­sion shell script query images for use of trans­paren­cy, aver­age the con­tents, and pick a back­ground which is most oppo­site the con­tent.)

Future work

Metadata Quality Improvement via Active Learning

How high qual­ity is the Dan­booru meta­data qual­i­ty? As with Ima­geNet, it is crit­i­cal that the tags are extremely accu­rate or else this will lower­bound the error rates and impede the learn­ing of tag­gers, espe­cially on rarer tags where a low error may still cause false neg­a­tives to out­weigh the true pos­i­tives.

I would say that the Dan­booru tag data is quite high but imbal­anced: almost all tags on images are cor­rect, but the absence of a tag is often wrong—that is, many tags are miss­ing on Dan­booru (there are so many pos­si­ble tags that no user could pos­si­bly know them all). So the absence of a tag isn’t as infor­ma­tive as the pres­ence of a tag—eye­balling images and some rarer tags, I would guess that tags are present <10% of the time they should be.

This sug­gests lever­ag­ing an active learn­ing (Set­tles 2010) form of train­ing: train a tag­ger, have a human review the errors, update the meta­data when it was not an error, and retrain.

More specif­i­cal­ly: train the tag­ger; run the tag­ger on the entire dataset, record­ing the out­puts and errors; a human exam­ines the errors inter­ac­tively by com­par­ing the sup­posed error with the image; and for false neg­a­tives, the tag can be added to the Dan­booru source using the Dan­booru API and added to the local image meta­data data­base, and for false pos­i­tives, the ‘neg­a­tive tag’ can be added to the local data­base; train a new model (pos­si­bly ini­tial­iz­ing from the last check­point). Since there will prob­a­bly be thou­sands of errors, one would go through them by mag­ni­tude of error: for a false pos­i­tive, start with tag­ging prob­a­bil­i­ties of 1.0 and go down, and for false neg­a­tives, 0.0 and go up. This would be equiv­a­lent to the active learn­ing strat­egy “uncer­tainty sam­pling”, which is sim­ple, easy to imple­ment, and effec­tive (al­beit not nec­es­sar­ily opti­mal for active learn­ing as the worst errors will tend to be highly correlated/redundant and the set of cor­rec­tions overkil­l). Once all errors have been hand-checked, the train­ing weight on absent tags can be increased, as any miss­ing tags should have shown up as false pos­i­tives.

Over mul­ti­ple iter­a­tions of active learn­ing + retrain­ing, the pro­ce­dure should be able to fer­ret out errors in the dataset and boost its qual­ity while also increas­ing its per­for­mance.

Based on my expe­ri­ences with semi­-au­to­matic edit­ing on Wikipedia (us­ing pywikipediabot for solv­ing dis­am­bigua­tion wik­ilinks), I would esti­mate that given an appro­pri­ate ter­mi­nal inter­face, a human should be able to check at least 1 error per sec­ond and so check­ing ~30,000 errors per day is pos­si­ble (al­beit extremely tedious). Fix­ing the top mil­lion errors should offer a notice­able increase in per­for­mance.

There are many open ques­tions about how best to opti­mize tag­ging per­for­mance: is it bet­ter to refine tags on the exist­ing set of images or would adding more only-­par­tial­ly-­tagged images be more use­ful?

Appendix

Shell queries for statistics

## count number of images/files in Danbooru2019
find /media/gwern/Data2/danbooru2019/original/ -type f | wc --lines
# 3692578
## count total filesize of original fullsized images in Danbooru2019:
du -sch /media/gwern/Data2/danbooru2019/original/
# 2.8T

# on JSON files concatenated together:
## number of unique tags
cd metadata/; cat * > all.json
cat all.json | jq '.tags | .[] | .name' > tags.txt
sort -u tags.txt  | wc --lines
# 392446
## number of total tags
wc --lines tags.txt
# 108029170
## Average tag count per image:
R
# R> 108029170 / 3692578
# [1] 29.2557584
## Most popular tags:
sort tags.txt  | uniq -c | sort -g | tac | head -19
# 2617569 "1girl"
# 2162839 "solo"
# 1808646 "long_hair"
# 1470460 "highres"
# 1268611 "breasts"
# 1204519 "blush"
# 1101925 "smile"
# 1009723 "looking_at_viewer"
# 1006628 "short_hair"
#  904246 "open_mouth"
#  802786 "multiple_girls"
#  758690 "blue_eyes"
#  722932 "blonde_hair"
#  686706 "brown_hair"
#  675740 "skirt"
#  630385 "touhou"
#  606550 "large_breasts"
#  592200 "hat"
#  588769 "thighhighs"

## count Danbooru images by rating
cat all.json  | jq '.rating' > ratings.txt
sort ratings.txt  | uniq -c | sort -g
#  315713 "e"
#  539329 "q"
# 2853721 "s"

wc --lines ratings.txt
## 3708763 ratings.txt
R
# R> c(315713, 539329, 2853721) / 3708763
# [1] 0.0851262267  0.1454201846  0.7694535887

# earliest upload:
cat all.json | jq '.created_at' | fgrep '2005' > uploaded.txt
sort -g uploaded.txt | head -1
# "2005-05-24 03:35:31 UTC"

  1. While Dan­booru is not the largest anime image booru in exis­tence—TBIB, for exam­ple, claims >4.7m images or almost twice as many, by mir­ror­ing from mul­ti­ple boorus—but Dan­booru is gen­er­ally con­sid­ered to focus on high­er-qual­ity images & have bet­ter tag­ging; I sus­pect >4m images is into dimin­ish­ing returns and the focus then ought to be on improv­ing the meta­da­ta. Google finds () that image clas­si­fi­ca­tion is log­a­rith­mic in image count up to n = 300M with noisy labels, which I inter­pret as sug­gest­ing that for the rest of us with lim­ited hard dri­ves & com­pute, going past mil­lions is not that help­ful; unfor­tu­nately that exper­i­ment does­n’t exam­ine the impact of the noise in their cat­e­gories so one can’t guess how many images each addi­tional tag is equiv­a­lent to for improv­ing final accu­ra­cy. (They do com­pare train­ing on equally large datasets with small vs large num­ber of cat­e­gories, but fine vs coarse-­grained cat­e­gories is not directly com­pa­ra­ble to a fixed num­ber of images with less or more tags on each image.) The impact of tag noise could be quan­ti­fied by remov­ing vary­ing num­bers of ran­dom images/tags and com­par­ing the curve of final accu­ra­cy. As adding more images is hard but semi­-au­to­mat­i­cally fix­ing tags with an active-learn­ing approach should be easy, I would bet that the cost-ben­e­fit is strongly in favor of improv­ing the exist­ing meta­data than in adding more images from recent Dan­booru uploads or other -boorus.↩︎

  2. This is done to save >100GB of space/bandwidth; it is true that the loss­less opti­miza­tion will inval­i­date the MD5s, but note that the orig­i­nal MD5 hashes are avail­able in the meta­data, and many thou­sands of them are incor­rect even on the orig­i­nal Dan­booru server, and the files’ true hashes are inher­ently val­i­dated as part of the Bit­Tor­rent down­load process—so there is lit­tle point in any­one either check­ing them or try­ing to avoid mod­i­fy­ing files, and loss­less opti­miza­tion saves a great deal.↩︎

  3. If one is only inter­ested in the meta­data, one could run queries on the Big­Query ver­sion of the Dan­booru data­base instead of down­load­ing the tor­rent. The Big­Query data­base is also updated dai­ly.↩︎

  4. Appar­ently a bug due to an anti-­DoS mech­a­nism, which should be fixed.↩︎

  5. An author of style2paints, a NN painter for anime images, notes that stan­dard style trans­fer approaches (typ­i­cally using an Ima­geNet-based CNN) fail abysmally on anime images: “All trans­fer­ring meth­ods based on Anime Clas­si­fier are not good enough because we do not have anime Ima­geNet”. This is inter­est­ing in part because it sug­gests that Ima­geNet CNNs are still only cap­tur­ing a sub­set of human per­cep­tion if they only work on pho­tographs & not illus­tra­tions.↩︎

  6. Dan­booru2019 does not by default pro­vide a “face” dataset of images cropped to just faces like that of Getchu or Nagadomi’s moeimouto; how­ev­er, the tags can be used to fil­ter down to a large set of face close­ups, and Nagadomi’s face-de­tec­tion code is highly effec­tive at extract­ing faces from Dan­booru2019 images & can be com­bined with wai­fu2× for cre­at­ing large sets of large face images.↩︎

  7. See for exam­ple the pair high­lighted in , moti­vat­ing them to use human dia­logues to pro­vide more descriptions/supervision.↩︎

  8. A tag­ger could be inte­grated into the site to auto­mat­i­cally pro­pose tags for new­ly-u­ploaded images to be approved by the upload­er; new users, uncon­fi­dent or unfa­mil­iar with the full breadth, would then have the much eas­ier task of sim­ply check­ing that all the pro­posed tags are cor­rect.↩︎