Anime Crop Datasets: Faces, Figures, & Hands

Description of 3 anime datasets for machine learning based on Danbooru: cropped anime faces, whole-single-character crops, and hand crops (with hand detection model).
NN, anime, dataset
2020-05-102020-08-05 finished certainty: log importance: 4


Doc­u­men­ta­tion of 3 anime datasets for machine learn­ing based on Dan­booru: 300k cropped anime faces (pri­mar­ily used for StyleGAN/This Waifu Does Not Exist), 855k whole-s­in­gle-char­ac­ter fig­ure crops (ex­tracted from Dan­booru using AniSeg), and 58k hand crops (based on a dataset of 14k hand-an­no­tated bound­ing boxes used to train a YOLOv3 hand detec­tion mod­el).

This datasets can be used for machine learn­ing direct­ly, or included as data aug­men­ta­tion: faces, fig­ures, and hands are some of the most notice­able fea­tures of anime images, and by crop­ping images down to just those 3 fea­tures, they can enhance mod­el­ing of those by elim­i­nat­ing dis­tract­ing con­text, zoom­ing in, and increas­ing the weight dur­ing train­ing.

Danbooru2019 Portraits

Dan­booru2019 Por­traits is a dataset of n = 302,652 (16GB) 512px anime faces cropped from solo SFW Dan­booru2019 images in a rel­a­tively broad ‘por­trait’ style encom­pass­ing necklines/ears/hats/etc rather than tightly focused on the face, upscaled to 512px as nec­es­sary, and low-qual­ity images deleted by man­ual review using , which has been used for cre­at­ing TWDNE.

The Por­traits dataset was con­structed to train for cre­at­ing

Faces → Portraits Motivation

The main issues I saw for the faces based on TWDNE feed­back were:

  1. Sex­u­al­ly-­Sug­ges­tive Faces: because I had not expected StyleGAN to work or to wind up mak­ing some­thing like TWDNE, I had not taken the effort to crop faces solely from the SFW sub­set (since no GAN had proven to be good enough to pick up any embar­rass­ing details and I was more con­cerned with max­i­miz­ing the dataset size).

    Dan­booru is divided into 3 rat­ings, “safe”/“ques­tion­able”/“explicit”, with “ques­tion­able” bor­der­ing on soft­core. The explicitly-NSFW images make up only ~9% of Dan­booru but between the SFW-but-suggestive images and the explicit ones, and StyleGAN’s learn­ing capa­bil­i­ties, this proved to be enough to make some of the faces quite naughty-look­ing. Nat­u­ral­ly, every­one insisted on jok­ing about this. This could be fixed sim­ply by fil­ter­ing in “safe”-only rather than merely fil­ter­ing-out “explicit”.

  2. Head Crops: Nagadomi’s face-crop­per is a face crop­per, not a head­-crop­per or a por­trait-crop­per.

    The face-crop­per cen­ters its crops on the cen­ter of a face (like the nose) and, given the orig­i­nal bound­ing box, will nec­es­sar­ily cut off all the addi­tional details asso­ci­ated with anime heads such as the ‘ahoge’ or bunny ears or twin-­tails, since those are not faces. Sim­i­lar­ly, I had left Nagadomi’s face-crop­per on the default set­tings instead of both­er­ing to tweak it to pro­duce more head­-shot-­like crop­s—s­ince if GANs could­n’t mas­ter the faces there was no point in mak­ing the prob­lem even harder & wor­ry­ing about details of the hair.

    This was not good for char­ac­ters with dis­tinc­tive hats or hair or ani­mal ears (such as Holo’s wolf ears). This could be fixed by play­ing with scal­ing the bound­ing box around the face by dif­fer­ent x/y mul­ti­pli­ers to see what picks up the rest of the head. (An­other approach would be to use AniSeg to detect face & whole-char­ac­ter-­fig­ure simul­ta­ne­ous­ly, and crop the fig­ure from its top to the bot­tom of the face.)

  3. Messy Background/Bodies: I sus­pected that the tight­ness of the crops also made it hard for StyleGAN to learn things in the edges, like back­grounds or shoul­ders, because they would always be par­tial if the face-crop­per was doing its job.

    With big­ger crops, there would be more vari­a­tion and more oppor­tu­nity to see whole shoul­ders or large unob­structed back­grounds, and this might lead to more con­vinc­ing over­all images.

  4. Holo/Asuka Over­rep­re­sen­ta­tion: to my sur­prise, TWDNE view­ers seemed quite annoyed by the over­rep­re­sen­ta­tion of Holo/Asuka-like (but mostly Holo) sam­ples.

    For the same rea­son as not fil­ter­ing to SFW, I had thrown in 2 ear­lier datasets I had made of Holo & Asuka faces—I had made the at 512px, and cleaned them fairly thor­ough­ly, and they would increase the dataset size, so why not? Being over­rep­re­sent­ed, and well-rep­re­sented in Dan­booru (a major part of why I had cho­sen them in the first place to make pro­to­type datasets with), of course StyleGAN was more likely to gen­er­ate sam­ples look­ing like them than other pop­u­lar anime char­ac­ters.1 Why this annoyed peo­ple, I don’t under­stand, but it might as well be fixed.

  5. Per­sis­tent Global Arti­facts: despite the gen­er­ally excel­lent results, there are still occa­sional bizarre anom­alous images which are scarce faces at all, even with 𝜓=0.7; I sus­pect that this may be due to the small per­cent­age of non-­faces, cut-off faces, or just poorly/weirdly drawn faces and that more strin­gent data clean­ing would help pol­ish the mod­el.

Portraits Improvements

Issues #1–3 can be fixed by trans­fer­-learn­ing StyleGAN on a new dataset made of faces from the SFW sub­set and cropped with much larger mar­gins to pro­duce more ‘por­trait’-style face crops. (There would still be many errors or sub­op­ti­mal crops but I am not sure there is any full solu­tion short of train­ing a face-lo­cal­iza­tion CNN just for anime images.)

For this, I needed to edit lbpcascade_animeface’s crop.py and adjust the mar­gins. Exper­i­ment­ing, I changed the crop­ping line to:

    for (x, y, w, h) in faces:
        cropped = image[int(y*0.25): y + h, int(x*0.90): x + int(w*1.25)]

These mar­gins seemed to deliver accept­able results which gen­er­ally show the entire head while leav­ing enough room for extra back­ground or hats/ears (although there is still the occa­sional error like a or image with mul­ti­ple faces or heads still par­tially cropped):

100 real faces from the ‘por­trait’ dataset (SFW Dan­booru2018 cropped with expanded mar­gins) in a 10×10 grid

After crop­ping all ~2.8m SFW Dan­booru2018 ful­l-res­o­lu­tion images (as demon­strated in the crop­ping sec­tion), I was left with ~700k faces. This was a large dataset, but the dis­ad­van­tage was that many heads/faces over­lapped, so after a few weeks of train­ing, I had decent por­traits marred by strange hydra-­like heads jut­ting in from the side. So I redid the crop­ping process using the solo tag to elim­i­nate images which might have mul­ti­ple faces in them.

Issue #4 is solved by just not adding the Asuka/Holo datasets.

Final­ly, issue #5 is harder to deal with: prun­ing 200k+ images by hand is infea­si­ble, there’s no easy way to improve the face crop­ping script, and I don’t have the bud­get to Mechan­i­cal-­Turk review all the faces like Kar­ras et al 2018 did for FFHQ to remove their false pos­i­tives (like stat­ues).

One way I do have to improve it is to exploit the Dis­crim­i­na­tor of a pre­trained face GAN. The anime face StyleGAN D would be ideal since it clearly works so well already, so I wrote a ranker.py script (see pre­vi­ous sec­tion) to use a StyleGAN check­point and rank spec­i­fied images on disk, and then rebuilt the .tfrecords with trou­ble­some images removed. (This process can be reit­er­ated as the StyleGAN model improves and the D improves its abil­ity to spot anom­alies.) I engaged in 5 cycles of ranker.py clean­ing over April 2019, delet­ing 14k images; it seemed to reduce some of the arti­fact­ing related to hands.

Portraits Dataset

The final 512px por­trait dataset (with por­trait crops, improved fil­ter­ing via solo, & dis­crim­i­na­tor rank­ing for clean­ing) is avail­able for down­load via rsync (16GB, n = 302,652):

rsync --verbose --recursive rsync://78.46.86.149:873/biggan/portraits/ ./portraits/

Portraits Citing

Please cite this dataset as:

  • Gwern Bran­wen, Anony­mous, & The Dan­booru Com­mu­ni­ty; “Dan­booru2019 Por­traits: A Large-S­cale Anime Head Illus­tra­tion Dataset”, 2019-03-12. Web. Accessed [DATE] https://www.gwern.net/Crops#danbooru2019-portraits

    @misc{danbooru2019Portraits,
        author = {Gwern Branwen and Anonymous and Danbooru Community},
        title = {Danbooru2019 Portraits: A Large-Scale Anime Head Illustration Dataset},
        howpublished = {\url{https://www.gwern.net/Crops#danbooru2019-portraits}},
        url = {https://www.gwern.net/Crops#danbooru2019-portraits},
        type = {dataset},
        year = {2019},
        month = {March},
        timestamp = {2019-03-12},
        note = {Accessed: DATE} }

Danbooru2019 Figures

The Dan­booru2019 Fig­ures dataset is a large-s­cale char­ac­ter anime illus­tra­tion dataset of n = 855,880 images (248GB; min­i­mum width 512px) cropped from Dan­booru2019 using the AniSeg anime char­ac­ter detec­tion mod­el. The images are cropped to focus on a sin­gle char­ac­ter’s entire vis­i­ble body, extend­ing ‘por­trait’ crops to ‘fig­ure’ crops. This is use­ful for tasks focus­ing on indi­vid­ual char­ac­ters, such as char­ac­ter clas­si­fi­ca­tion or for gen­er­a­tive tasks (a cor­pus for weak mod­els like StyleGAN, or data aug­men­ta­tion for BigGAN).

40 ran­dom fig­ure crops from Dan­booru2019 (4×10 grid, resized to 256px)

I cre­ated this dataset to assist our BigGAN train­ing by data aug­men­ta­tion of dif­fi­cult object class­es: by pro­vid­ing a large set of images cropped to just the char­ac­ter (as opposed to the usual ran­dom crop­s), BigGAN should bet­ter learn body struc­ture and reuse that knowl­edge else­where. Focus on just the hard parts. This is a ML trick which we have used for faces/portraits in BigGAN and , and will use for hands as well. This could also be use­ful for StyleGAN, by greatly restrict­ing the vari­a­tion in images to sin­gle cen­tered objects (StyleGAN falls apart when it needs to model mul­ti­ple objects in a vari­a­tion of posi­tion­s). Other appli­ca­tions might be using it as a start­ing dataset for using object local­iz­ers to crop out things like faces where images with mul­ti­ple instances would be ambigu­ous or too occluded (mul­ti­ple faces over­lap­ping) or too low-qual­ity (eg back­ground­s), so the whole Dan­booru2019 dataset would­n’t be as use­ful.

Figures Download

To down­load the cropped images:

rsync --verbose --recursive rsync://78.46.86.149:873/biggan/danbooru2019-figures ./danbooru2019-figures/

Figures Construction

Details of get­ting AniSeg run­ning & crop­ping. Dan­booru2019 Fig­ures was con­structed by fil­ter­ing images from Dan­booru2019 by solo & SFW sta­tus (~1,538,723 images of ~65,783 char­ac­ters illus­trated by ~133,856 artist­s), and then crop­ping using Jerry Li’s AniSeg model (a Ten­sor­Flow Object Detec­tion API-based Python model for anime char­ac­ter face detec­tion & por­trait seg­men­ta­tion), which Li con­structed by anno­tat­ing images from Dan­booru20182.

Before run­ning AniSeg, I had to make 3 changes. AniSeg had two bugs (SciPy depen­dency & model load­ing code) and the pro­vided script for detect­ing faces/figures does not include any func­tion­al­ity for crop­ping images. The bugs have been fixed & the detec­tion code now sup­ports crop­ping with the options --output_cropped_image/--only_output_cropped_single_object. At the time, I mod­i­fied the script to do crop­ping with­out those options, and I ran the fig­ure crop­per (slow­ly) over Dan­booru2019 like thus:

python3 infer_from_image.py --inference_graph=./2019-04-29-jerryli27-aniseg-models-figurefacecrop/figuresegmentation.pb \
    --input_images='/media/gwern/Data2/danbooru2019/original-sfw-solo/*/*' \
    --output_path=/media/gwern/Data/danbooru2019-datasets/danbooru2019-figures

Fil­ter & upscale. After crop­ping out fig­ures, I fol­lowed the image pro­cess­ing described in my StyleGAN faces write­up: I con­verted the images to JPG, deleted images <50kb, deleted images <256px in width, used wai­fu2x to 2× upscale images <512px in width to >512px in width, and deleted mono­chrome images (im­ages with <255 unique col­ors). Note that unlike the por­traits dataset, these images are not resized with 512×512px squares with black back­grounds as nec­es­sary. This allows ran­dom crops if the user wants, and they can be down­scaled as nec­es­sary (eg mogrify -resize 512x512\> -extent 512x512\> -gravity center -background black). This gave a final dataset of n = 855,880 JPGS (248G­B).

Figures Citing

Please cite this dataset as:

  • Gwern Bran­wen, Anony­mous, & The Dan­booru Com­mu­ni­ty; “Dan­booru2019 Fig­ures: A Large-S­cale Anime Char­ac­ter Illus­tra­tion Dataset”, 2020-01-13. Web. Accessed [DATE] https://www.gwern.net/Crops#figures

    @misc{danbooru2019Figures,
        author = {Gwern Branwen and Anonymous and Danbooru Community},
        title = {Danbooru2019: A Large-Scale Anime Character Illustration Dataset},
        howpublished = {\url{https://www.gwern.net/Crops#figures}},
        url = {https://www.gwern.net/Crops#figures},
        type = {dataset},
        year = {2020},
        month = {May},
        timestamp = {2020-05-31},
        note = {Accessed: DATE} }

Hands

We cre­ate & release PALM: the PALM Anime Loca­tor Model. PALM is a pre­trained anime hand detector/localization neural net­work, and 3 sets of accom­pa­ny­ing anime hand datasets:

  1. A dataset of 5,382 ani­me-style Dan­booru2019 images anno­tated with the loca­tions of 14,394 hands.

    This labeled dataset is used to train a model to detect hands in ani­me.

  2. A sec­ond dataset of 96,534 hands cropped from the Dan­booru2019 SFW dataset using the PALM YOLO mod­el.

  3. A cleaned ver­sion of #2, con­sist­ing of 58,536 hand crops upscaled to ≥512px.

Hand detec­tion can be used to clean images (eg remove face images with any hands in the way), or to gen­er­ate datasets of just hands (as a form of data aug­men­ta­tion for GANs), to gen­er­ate ref­er­ence datasets for artists, or for other pur­pos­es.

After faces & whole bod­ies, the next most glar­ing source of arti­facts in GAN anime sam­ples like is draw­ing hands. Hands are noto­ri­ous among human artists for being dif­fi­cult and eas­ily break­ing sus­pen­sion of dis­be­lief, and it’s worth not­ing that aside from the face, the hands are the biggest part of a , sug­gest­ing the atten­tion we pay to them; no won­der that so many illus­tra­tions care­fully crop the sub­ject to avoid hands, or tuck hands into dresses or sleeves, among many resorts to avoid depict­ing hands. Com­mon GAN fail­ure: hands.

Trick: tar­get errors using data aug­men­ta­tion. But even in face/portrait crops, hands appear fre­quently enough that StyleGAN will attempt to gen­er­ate them, as hands occupy a rel­a­tively small part of the image at 512px while being highly var­ied & fre­quently occlud­ed. BigGAN does some­what bet­ter but still strug­gles with hands (eg our ), unsure if they are round blobs or how many fin­gers should be vis­i­ble. One way to train a NN is to over­sam­ple hard data points by active learn­ing: seek out the class of errors and add in enough data that it can and must learn to solve it. Faces work well in our BigGAN because they are so com­mon in the data, and can be fur­ther over­loaded using my anime face datasets; bod­ies work rea­son­ably well and bet­ter after Dan­booru2019 Fig­ures was cre­ated & added. By the same log­ic, if hands are a glar­ing class of errors which BigGAN strug­gles with and which par­tic­u­larly break sus­pen­sion of dis­be­lief, adding addi­tional hand data would help fix this. The most straight­for­ward way to obtain a large cor­pus of anime hands is to use a hand detec­tor to crop out hands from the ~3m images in Dan­booru2019.

Hand Model

There are no pre­trained anime hand detec­tors, and it is unlikely that stan­dard human pho­to­graphic hand detec­tors would work on anime (they don’t work on anime faces, and anime hands are even more styl­ized and abstrac­t).

Rolling my own. Arfafax had con­sid­er­able suc­cess in hand-la­bel­ing images (us­ing a cus­tom web inter­face for draw­ing bound­ing boxes on images) for a YOLO-based furry facial land­mark & face detec­tor, which he used to select & align images for his /. We decided to use his work­flow to build a hand detec­tor and crop hands from Dan­booru2019. Aside from the data aug­men­ta­tion trick, an anime hand detec­tor would allow fil­ter­ing out data with hands, gen­er­ated sam­ples with hands, and doubt­less peo­ple can find other uses for it.

Hand Annotations

Cus­tom Dan­booru anno­ta­tion web­site. Instead of using ran­dom Dan­booru2019 sam­ples, which might not have use­ful hands and would yield mostly ‘easy’ hands for train­ing, we enriched the cor­pus by select­ing the 14k images cor­re­spond­ing to hands rating:s: hands is a Dan­booru tag used “when an image has a char­ac­ter’s hand(s) as the main focus or empha­sizes the usage of hands.” All the sam­ples had hands, in dif­fer­ent loca­tions, sizes, styles, and occlu­sions, and some sam­ples were chal­leng­ing to anno­tate:

Exam­ple of anno­tat­ing hands in the web­site for 2 par­tic­u­larly chal­leng­ing Dan­booru2019 images

Bit­ing the bul­let. We used Shawn Presser’s anno­ta­tion web­site May–June 2020, and in total, we anno­tated n = 14,394 hands in k = 5,382 images (JSON). (I did ~10k anno­ta­tions, which took ~24h over 3–4 evenings.)

Ran­dom selec­tion of 297 hand-an­no­tated hands cropped from Dan­booru2019 hands images (down­sized to 128px)

YOLO Hand Model

Off-the-shelf YOLO model train­ing. I trained a YOLOv3 model using the Alex­eyAB dark­net repo fol­low­ing Arfafax’s note­book using largely default set­tings. With n = 64 mini­batches & 2k iter­a­tions on my 1080ti, it achieved a ‘total loss’ of 1.6; it did­n’t look truly con­verged, so I retried with 6k iter­a­tions & n = 124 mini­batches for ~8 GPU-hours, with a final loss of ~1.26. (I also attempted to train a YOLOv4 model with the same set­tings other than adjust­ing the subdivisions=16 set­ting, but it trained extremely slowly and had not approached YOLOv3’s per­for­mance after 16 GPU-hours with a loss of 3.6, and the YOLOv3 hand-crop­ping per­for­mance appeared sat­is­fac­to­ry, so I did­n’t exper­i­ment fur­ther to fig­ure out what mis­con­fig­u­ra­tion or other issue there was.)

Good enough. False pos­i­tives typ­i­cally are things like faces, flow­ers, feet, clouds or stars (par­tic­u­larly five-­pointed ones), things with many par­al­lel lines like floors or cloth­ing, small ani­mals, jew­el­ry, text cap­tions or bub­bles. The YOLO model appears to look for round objects with radial sym­me­try or rec­tan­gu­lar objects with par­al­lel lines (which makes sense). This model could surely be improved by train­ing a more advanced model with more aggres­sive data aug­men­ta­tion & doing active learn­ing on Dan­booru2019 to fine­tune hard cas­es. (Reread­ing the YOLO docs, one eas­ily reme­died flaw is the absence of neg­a­tive sam­ples: hard-min­ing hands meant that no images were labeled with zero hands ie. every image had at least 1 hand to detect, which might bias the YOLO model towards find­ing hand­s.)

Cropping Hands

58k 512px hands; 96k total. I mod­i­fied Arfa’s exam­ple script to crop the SFW3 Dan­booru2019 dataset (n = 2,285,676 JPG/PNGs) with the YOLOv3 hand model at a thresh­old of 0.6 (which yields roughly 1 hand per 10–20 orig­i­nal images and a false pos­i­tive rate of ~1 in 15); after some man­ual clean­ing along the way, this yielded n = 96,534 cropped hands. (The meta­data of all detected hand crops are avail­able in features.csv.) To gen­er­ate full­sized ≥512px hands use­ful for GAN train­ing, I copied images ≥512px width, skipped images <128px in width, used wai­fu2x to upscale 2× & down to 512px images which are 256–511px width, and upscaled 4× & down 128–255px width images. This yielded n = 58,536 final hands. Images are loss­ily opti­mized with . (Note that the out­put of the YOLOv3 mod­el, filename/bounding-box/confidence for all files, is avail­able in features.csv in the PALM repo for those who want to extract hands with dif­fer­ent thresh­old­s.)

Ran­dom sam­ple of upscaled sub­set of Dan­booru2019 hands

Hands Download

The PALM YOLOv3 model (Mega mir­ror; 235M­B):

rsync --verbose rsync://78.46.86.149:873/biggan/palm/2020-06-08-gwern-palm-yolov3-handdetector126.weights ./

The orig­i­nal hands cropped out of Dan­booru2019 (n = 96,534; 800M­B):

rsync --recursive --verbose rsync://78.46.86.149:873/biggan/palm/original-hands/ ./original-hands/

The upscaled hand sub­set (n = 58,536; 1.5G­B):

rsync --recursive --verbose rsync://78.46.86.149:873/biggan/palm/clean-hands/ ./clean-hands/

The train­ing dataset of anno­tated images, YOLOv3 con­fig­u­ra­tion files etc (k = 5,382/n = 14,394; 6GB):

rsync --verbose rsync://78.46.86.149:873/biggan/palm/2020-06-09-gwern-palm-yolov3-trainingdatasetlogs.tar ./

Hands Citing

Please cite this dataset as:

  • Gwern Bran­wen, Arfafax, Shawn Presser, Anony­mous, & Dan­booru com­mu­ni­ty; “PALM: The PALM Anime Loca­tion Model And Dataset”, 2020-06-12. Web. Accessed [DATE] https://www.gwern.net/Crops#hands

    @misc{palm,
        author = {Gwern Branwen and Arfafax and Shawn Presser and Anonymous and Danbooru community},
        title = {PALM: The PALM Anime Location Model And Dataset},
        howpublished = {\url{https://www.gwern.net/Crops#hands}},
        url = {https://www.gwern.net/Crops#hands},
        type = {dataset},
        year = {2020},
        month = {June},
        timestamp = {2020-06-12},
        note = {Accessed: DATE} }

  1. Holo faces were far more com­mon than Asuka faces. There were 12,611 Holo faces & 5,838 Asuka faces, so Holo was only 2× more com­mon and Asuka is a more pop­u­lar char­ac­ter in gen­eral in Dan­booru, so I am a lit­tle puz­zled why Holo showed up so much more than Asu­ka. One pos­si­bil­ity is that Holo is inher­ently eas­ier to model under the trun­ca­tion trick—I noticed that the brown short­-haired face at 𝜓=0 resem­bles Holo much more than Asuka, so per­haps when set­ting 𝜓, Asukas are dis­pro­por­tion­ately fil­tered out? Or faces closer to the ori­gin (be­cause of brown hair?) are sim­ply more likely to be gen­er­ated to begin with.↩︎

  2. I’ve mir­rored the man­u­al­ly-seg­mented anime fig­ure dataset & the face/figure seg­men­ta­tion mod­els:

    rsync --verbose rsync://78.46.86.149:873/biggan/2019-04-29-jerryli27-aniseg-figuresegmentation-dataset.tar ./
    rsync --verbose rsync://78.46.86.149:873/biggan/2019-04-29-jerryli27-aniseg-models-figurefacecrop.tar.xz   ./
    ↩︎
  3. NSFW did not yield good results.↩︎