Danbooru2017: A Large-Scale Crowdsourced and Tagged Anime Illustration Dataset

Danbooru2017 is a proposed large-scale anime image database with 2.2m+ images annotated with 48m+ tags; it can be useful for machine learning purposes such as image recognition and generation. (statistics, NN, anime, shell)
created: 15 Dec 2015; modified: 24 Oct 2017; status: in progress; confidence: likely; importance: 7

Deep learning for computer revision relies on large annotated datasets. Classification/categorization has benefited from the creation of ImageNet, which classifies 1m photos into 1000 categories. But classification/categorization is a coarse description of an image which limits application of classifiers, and there is no comparably large dataset of images with many tags or labels which would allow learning and detecting much richer information about images. Such a dataset would ideally be >1m images with at least 10 descriptive tags each which can be publicly distributed to all interested researchers, hobbyists, and organizations. There are currently no such public datasets, as ImageNet, Birds, Flowers, and MS COCO fall short either on image or tag count. I suggest that the image -boorus be used. The image boorus are websites which host large numbers of images which can be tagged or labeled with an arbitrary number of textual descriptions; they were developed for and are most popular among fans of anime, who annotate the images in extreme detail. The best known booru, with a focus on quality, is Danbooru, which contains TODO tb of TODO images with TODO tags (TODO unique). Cleaned with active learning, packaged as a dataset, distributed as a torrent, and updated annually, a Danbooru2017 dataset would democratize rich large-scale classification/tagging, provide an archival backup for the Danbooru community, and serve as a testbed for non-photographic computer vision tasks.

Image boorus

Image boorus are image hosting websites developed by the anime community for collaborative tagging. Images are uploaded by users who are part of a highly active community, such as Danbooru1, and richly annotated with textual tags, typically divided into a few major groups:

  • copyright (the overall franchise, movie, TV series, manga etc a work is based on; for long-running franchises like Neon Genesis Evangelion or crossover images, there can be multiple such tags, or if there is no such associated work, it would be tagged original)
  • character (often multiple)
  • author
  • explicitness rating

    Danbooru does not ban sexually suggestive or pornographic content; instead, images are classified into 3 categories: safe, questionable, &explicit.

    safe is for unambiguously SFW content including tasteful swimsuits, while questionable would be more appropriate for highly-revealing swimsuit images or moderate nudity or sexually suggestive situations, and explicit denotes anything pornographic. (TODO percentages)
  • descriptive tags (eg the top 20 tags are TODO)

    These tags form a folksonomy to describe aspects of images; beyond the expected tags like long_hair or looking_at_the_viewer, there are many strange and unusual tags, including many anime or illustration-specific tags like seiyuu_connection (images where the joke is based on knowing the two characters are voiced in different anime by the same voice actor) or bad_feet (artists frequently accidentally draw two left feet, or just bad_anatomy in general). Tags may also be hierarchical and one tag imply another.

Images can have other associated metadata with them, including:

  • Danbooru ID, a unique positive integer
  • MD5 hash
  • the uploader username
  • the original URL or the name of the work
  • up/downvotes
  • sibling images (often an image will exist in many forms, such as sketch or black-white versions in addition to a final color image, edited or larger/smaller versions, SFW vs NSFW, or depicting multiple moments in a scene)
  • captions/dialogue (many images will have written Japanese captions/dialogue, which have been translated into English by users and annotated using HTML image maps)
  • author commentary (also often translated)
  • pools (ordered sequences of images from across Danbooru; often used for comics or image groups, or for disparate images with some unifying theme which is insufficiently objective to be a normal tag)

Image boorus typically support advanced Boolean searches on multiple attributes simultaneously, which in conjunction with the rich tagging, can allow users to discover extremely specific sets of images.

Uses

Such a dataset would support many possible uses:

  • classification & tagging:

    • image categorization (of major characteristics such as franchise or character or SFW/NSFW detection)
    • image multi-label classification (tagging), exploiting the ~TODO tags per image

      • a large-scale testbed for real-world application of active learning / man-machine collaboration
      • testing the scaling limits of existing tagging approaches and motivating zero-shot & one-shot learning techniques
      • bootstrapping video summaries/descriptions
  • image generation:

  • image analysis:

    • facial detection & localization for drawn images (on which normal techniques such as OpenCV’s Harr filters fail)
    • image popularity/upvote prediction
    • image-to-text localization, transcription, and translation of text in images
  • image search:

    • collaborative filtering/recommendation, image similarity search (Flickr) of images (useful for users looking for images, for discovering tag mistakes, and for various diagnostics like checking GANs are not memorizing)
    • manga recommendation (Vie et al 2017)
    • artist similarity and de-anonymization
  • knowledge graph extraction from tags/tag-implications and images

    • clustering tags
    • temporal trends in tags (franchise popularity trends)

Advantages

Size and metadata

Image classification has been supercharged by work on ImageNet, but ImageNet itself is limited by its small set of classes, many of which are debatable, and which encompass only a limited set. Compounding these limits, tagging/classification datasets are notoriously undiverse & have imbalance problems or are small:

The external validity of classifiers trained on these datasets is somewhat questionable as the learned discriminative models may collapse or simplify in undesirable ways, and overfit on the datasets’ individual biases (Torralba & Efros 2011). For example, ImageNet classifiers sometimes appear to cheat by relying on textures and simplistic outlines - recognizing leopards only by the color texture of the fur, or believing barbells are extensions of arms. The dataset is simply not large enough, or richly annotated enough, to train classifiers or tagger better than that, or, with residual networks reaching human parity, reveal differences between the best algorithms and the merely good. (Dataset biases have also been issues on question-answering datasets.) As well, the datasets are static, not accepting any additions, better metadata, or corrections. Like MNIST before it, ImageNet is verging on solved (the ILSVRC organizers ended it after the 2017 competition) and further progress may simply be overfitting to idiosyncrasies of the datapoints and errors; even if lowered error rates are not overfitting, the very low error rates compress the differences between algorithm, giving a misleading view of progress and understating the benefits of better architectures, as improvements become comparable in size to simple chance in initializations/training/validation-set choice. As Dong et al 2017 note:

It is an open issue of text-to-image mapping that the distribution of images conditioned on a sentence is highly multi-modal. In the past few years, we’ve witnessed a breakthrough in the application of recurrent neural networks (RNN) to generating textual descriptions conditioned on images [1, 2], with Xu et al. showing that the multi-modality problem can be decomposed sequentially [3]. However, the lack of datasets with diversity descriptions of images limits the performance of text-to-image synthesis on multi-categories dataset like MSCOCO [4]. Therefore, the problem of text-to-image synthesis is still far from being solved

In contrast, the Danbooru dataset is TODOx larger than ImageNet as a whole and TODOx larger than the current largest multi-description dataset, MS COCO, with far richer metadata than the subject verb object sentence summary that is dominant in MS COCO. While the Danbooru community focuses heavily on female anime characters, they are placed in a wide variety of circumstances with numerous surrounding tagged objects or actions, and the sheer size implies that many more miscellaneous images will be included. It is unlikely that the performance ceiling will be reached anytime soon, and advanced techniques such as attention will likely be required to get anywhere near the ceiling. And Danbooru is constantly expanding and can be easily updated by anyone anywhere, allowing for regular releases of improved annotations.

Danbooru and the image boorus have been only minimally used in previous machine learning work; principally, in Illustration2Vec: A Semantic Vector Representation of Images, Saito & Matsui 2015 (project), which used 1.287m images to train a finetuned VGG-based CNN to detect 1,539 tags (drawn from the 512 most frequent tags of general/copyright/character each) with an overall precision of 32.2%. But the dataset was not distributed and there has been little followup.

Non-photographic

Anime images and illustrations, on the other hand, as compared to photographs, differ in many ways - for example, illustrations are frequently black-and-white rather than color, line art rather than photographs, and even color illustrations tend to rely far less on textures and far more on lines (with textures omitted or filled in with standard repetitive patterns), working on a higher level of abstraction - a leopard would not be as trivially recognized by pattern-matching on yellow and black dots - with irrelevant details that a discriminator might cheaply classify based on typically suppressed in favor of global gestalt, and often heavily stylized (eg frequent use of Dutch angles).

Humans can still easily perceive a black-white line drawing of a leopard, but can a standard ImageNet classifier?

Likewise, the difficulty face detectors encounter on anime images suggests that other detectors like nudity or pornographic detectors may fail; but surely moderation tasks require detection of penises whether they are drawn or photographed? Because illustrations are produced by an entirely different process and focus only on salient details while abstracting the rest, they offer a way to test external validity and the extent to which taggers are tapping into higher-level semantic perception.

As well, many ML researchers are anime fans and might enjoy working on such a dataset - training NNs to generate anime images can be amusing.

Community value

A full dataset is of immediate value to the Danbooru community as an archival snapshot of Danbooru which can be downloaded in lieu of hammering the main site and downloading terabytes of data; backups are occasionally requested on the Danbooru forum but the need is currently not met.

There is much potential for a mutually rewarding symbiosis between the Danbooru community and ML researchers: in a virtuous circle, the community provides curation and expansion of a rich dataset, while ML researchers can contribute back tools from their research on it which help improve the dataset. The Danbooru community is relatively large and would likely welcome the development of tools like taggers to support semi-automatic (or eventually, fully automatic) image tagging, as use of a tagger could offer orders of magnitude improvement in speed and accuracy compared to their existing manual methods, as well as being newbie-friendly2 They are also a pre-existing audience which would be interested in new research results.

Citing

Please cite this dataset as:

  • Anonymous, The Danbooru Community, Gwern Branwen, & Aaron Gokaslan; Danbooru2017: A Large-Scale Crowdsourced and Tagged Anime Illustration Dataset, 1 April 2017. Web. Accessed [DATE] https://www.gwern.net/Danbooru2017

    @misc{danbooru2017,
        author = {Anonymous artists & Danbooru editors, Gwern Branwen, Aaron Gokaslan},
        title = {Danbooru2017: A Large-Scale Crowdsourced and Tagged Anime Illustration Dataset},
        howpublished = {\url{https://www.gwern.net/Danbooru2017}},
        url = {https://www.gwern.net/Danbooru2017},
        type = {dataset},
        year = {2017},
        month = {April},
        timestamp = {2017-04-01},
        note = {Accessed: DATE} }

Format

The goal of the dataset is to be as easy as possible to use immediately, avoiding obscure file formats, while allowing simultaneous research & seeding of the torrent, with easy updates.

Since many users are uninterested in downloading, seeing, or analyzing NSFW images, the dataset will be split into a large safe dataset and a smaller questionable+explicit dataset; the categorization can be double-checked for errors using active learning.

Images will be provided in both the full original form (be that JPG, PNG, GIF or otherwise) for reference/archival purposes and in a smaller form more suitable for ML use:

  • non-images such as animated GIFs & videos will be omitted from the ML dataset for uniformity

    The short video files hosted on Danbooru require different handling than images; the metadata for video files likely does not apply to each frame individually, so it would be damaging to take a naive approach of turning each one into a static image & copying the metadata for the whole video.
  • all images converted to valid JPG
  • all images losslessly optimized with jpegoptim (while this invalidates the MD5, the MD5s are not always valid for Danbooru images anyway and ECC is taken care of by BitTorrent; the advantage is that using jpegoptim may cut disk space & IO by 5-10%, which is substantial on a multi-terabyte dataset)
  • downscaled down to 512x512px with black borders

    Why 512px exactly? image sizes are tricky: too small, and key details become impossible to see, but too large and they become difficult to store, transmit, and process. 64px thumbnails obliterate all details; 128px is too small for much but global recognition; CNNs appear to as of 2016 tend to target ~250x250px; but to future-proof the torrent, it needs to be higher than that - higher resolutions will be feasible with new GPU generations and required for best tagging performance as objects/characters can be very small - leading to 512px as the next natural image size. (The PixelCNN architecture has already been used in March 2017 by DeepMind to generate 512px images: Reed et al 2017.) Viewing 512px downscales, it does offer enough resolution for finer details to be plausibly visible to a CNN while still being ~1/5th the bandwidth/disk usage.

    In addition, one should err on the side of the downscales being too large rather than too small: should 512px be too large, additional downscaling to 128px or 256px, or various kinds of data augmentation, is relatively easier than downscaling from the full original images and can be done on the fly by users without impeding training. While if the downscales are too small, users may be stuck between a rock and hard place: accepting suboptimal results or having to create & permanently store an entire downscaled dataset themselves.
  • stored as individual files:

    • named with the Danbooru ID
    • bucketed into ~100 subdirectories of <50k images to avoid slow lookups (eg 1/$ID.jpg)

Metadata will be provided in the original full SQL export and a slimmed down database of just filename/tags.

Size

  • TODO images
  • TODO unique tags, used TODO times (mean per image: TODO; median: TODO)
  • TODO tb with TODO terapixels

downscaled version likely ~430GB based on size of preliminary download after downscaling:

find /media/gwern/My\ Book/danbooru/danbooru/ -type f -name "*.jpg" | wc --lines
# 76594
du -ch /media/gwern/My\ Book/danbooru/danbooru/
# 37G /media/gwern/My Book/danbooru/danbooru/
find . -type f -exec convert -resize 512x512^ -gravity center -extent 512x512 -background black "{}" ../512px/"{}" \;
duh ../512px/
# 9.5G    ../512px/

so comparing 76k Danbooru images, the 512px images are ~25%. So for 1.7tb, the 512px archive would be 0.43tb or ~430GB

Legality

Danbooru is operated & hosted in the USA. As a community site, it is impossible to assure that every single image is legal to possess; the most questionable content on Danbooru likely falls under the lolicon/shotacon heading, which are legal in the USA & Japan but may not be legal in (to quote Wikipedia) Australia, Canada, the Philippines, South Africa, South Korea and the United Kingdom. Downloaders should consider their current jurisdiction and if the legal situation is unclear or unfavorable, avoid downloading the questionable+explicit subset of images.

Preparation

  • download all metadata from the official daily-updated Danbooru BigQuery metadata database (TODO: is there an easier way than http://stackoverflow.com/a/18497215/329866 ?)
  • download associated images (curl script)
  • train a NSFW classifier on safe/questionable/explicit (without tags which make it too easy): check all questionable/explicit images are appropriately labeled and can be segregated
  • train a top 10k tags tagger: active learn as much as feasible

Scraping

I have registered the accounts gwern and gwern-bot for use in downloading & participating on Danbooru; it is considered good research ethics to try to offset any use of resources when crawling an online community (eg DNM scrapers try to run Tor nodes to pay back the bandwidth), so I have donated $20 to Danbooru via an account upgrade.

Danbooru IDs are sequential positive integers, but the images are stored at their MD5 hashes; so downloading the full images can be done by a query to the JSON API for the metadata for an ID, getting the URL for the full upload, and downloading that to the ID plus extension.

The metadata can be downloaded from BigQuery via BigQuery-API-based tools.

Hosting

1.7tb images, perhaps 0.1tb for downscaled?

Hosting options for 2-3tb typically involve BitTorrent as the best method of distributing very large datasets to many people quickly & allowing for easy updates:

  • local torrent: infeasible initially as my local connection maxes out at ~1MB/s and the initial seeding would take unacceptable months
  • Amazon S3: my usual hosting solution, but also infeasible:

    • Amazon S3 torrents do not allow files larger than ~4GB, so the dataset would have to be split into thousands of torrents, defeating most of the point
    • Amazon S3 disk-space & outgoing bandwidth are notoriously expensive & S3 is avoided by backup services like Backblaze: the AWS price calculator estimates that 2tb+occasional-full-downloads would cost >$1000/month (Amazon Glacier is more reasonably priced but totally unsuited for regular downloads)
  • VPS hosts: typically grossly inadequate disk space
  • seedbox on Hetzner or other dedicated hosts: dedicated servers range up to 4tb easily along with gigabit uploads for $50-100/month; I found many servers in Hetzner’s auctions with adequate disk space (>=3tb) can be rented for ~$30/month or ~$360/year, which is sufficiently low that I can pay for it indefinitely (and should interest take off and the torrent swarm become permanently robust, the seedbox can be abandoned)

The final option appears to be best.

Given the intent, Academic Torrents is a usable tracker; a backup option would be Nyaa.

(Can create the .torrent using mktorrent)

Updating

Should the dataset prove of value to the ML & Danbooru datasets, it can be updated at regular annual intervals (giving Danbooru2017, Danbooru2018, Danbooru2019 etc).

Updates would exploit the ECC capability of BitTorrent by updating the images/metadata and creating a new .torrent file; users download the new .torrent, overwrite the old .torrent, and after rehashing files to discover which ones have changed/are missing, the new ones are downloaded. (This method has been successfully used by other very-large periodically-updated torrents, such as the Touhou Lossless Music Torrent, at 1.4tb after 18 versions.)

Turnover in BitTorrent swarms means that earlier versions of the torrent will quickly disappear, so for easier reproducibility, the metadata files can be archived into subdirectories (images generally will not change, so reproducibility is less of a concern - to reproduce the subset for an earlier release, one simply filters on upload date or takes the file list from the old metadata).

Notification

To receive notification of future updates to the dataset, please subscribe to the notification mailing list.

Future work

Model zoo

If possible, additional models and derived metadata may be supplied as part of a model zoo. Particularly desirable would be:

  • a NSFW classifier
  • a top-10,000-tag tagger
  • a text embedding RNN, and pre-computed text embeddings for all images’ tags

Metadata Quality Improvement via Active Learning

How high quality is the Danbooru metadata quality? As with ImageNet, it is critical that the tags are extremely accurate or else this will lowerbound the error rates and impede the learning of taggers, especially on rarer tags where a low error may still cause false negatives to outweigh the true positives.

I would say that the Danbooru tag data is quite high but imbalanced: almost all tags on images are correct, but the absence of a tag is often wrong - that is, many tags are missing on Danbooru (there are so many possible tags that no user could possibly know them all). So the absence of a tag isn’t as informative as the presence of a tag - eyeballing images and some rarer tags, I would guess that tags are present <10% of the time they should be.

This suggests leveraging an active learning (Settles 2010) form of training: train a tagger, have a human review the errors, update the metadata when it was not an error, and retrain.

More specifically: train the tagger; run the tagger on the entire dataset, recording the outputs and errors; a human examines the errors interactively by comparing the supposed error with the image; and for false negatives, the tag can be added to the Danbooru source using the Danbooru API and added to the local image metadata database, and for false positives, the negative tag can be added to the local database; train a new model (possibly initializing from the last checkpoint). Since there will probably be thousands of errors, one would go through them by magnitude of error: for a false positive, start with tagging probabilities of 1.0 and go down, and for false negatives, 0.0 and go up. This would be equivalent to the active learning strategy uncertainty sampling, which is simple, easy to implement, and effective (albeit not necessarily optimal for active learning as the worst errors will tend to be highly correlated/redundant and the set of corrections overkill). Once all errors have been hand-checked, the training weight on absent tags can be increased, as any missing tags should have shown up as false positives.

Over multiple iterations of active learning + retraining, the procedure should be able to ferret out errors in the dataset and boost its quality while also increasing its performance.

Based on my experiences with semi-automatic editing on Wikipedia (using pywikipediabot for solving disambiguation wikilinks), I would estimate that given an appropriate terminal interface, a human should be able to check at least 1 error per second and so checking ~30,000 errors per day is possible (albeit extremely tedious). Fixing the top million errors should offer a noticeable increase in performance.

There are many open questions about how best to optimize tagging performance: is it better to refine tags on the existing set of images or would adding more only-partially-tagged images be more useful?

TODO: possible metadata: tag traffic/queries/search? other indicators of tag quality? tags which are often searched for are likely more reliably tagged across the corpus, and can be given a heavier weight in the loss function, or alternately, used to prioritize active learning for obscurer tags richer in errors

TODO tag architecture idea for the loss function: have positive tags (1), negative tags (-1), and absent tags (0), with a training weight of perhaps 0.1 on absent tags. This avoids over penalizing the CNN for bad tagging, while it doesn’t learn a degenerate solution like predict every tag is present.


  1. While Danbooru is not the largest anime image booru in existence - TBIB, for example, claims >4.7m images or almost twice as many, by mirroring from multiple boorus - but Danbooru is generally considered to focus on higher-quality images & have better tagging; I suspect >2.4m images is into diminishing returns for sheer bulk and the focus then ought to be on improving the metadata.

  2. A tagger could be integrated into the site to automatically propose tags for newly-uploaded images to be approved by the uploader; new users, unconfident or unfamiliar with the full breadth, would then have the much easier task of simply checking that all the proposed tags are correct.