Danbooru2018 is a large-scale anime image database with 3.33m+ images annotated with 99.7m+ tags; it can be useful for machine learning purposes such as image recognition and generation.
topics:
created: 15 Dec 2015; modified: 23 Feb 2019; status: finished; confidence: likely; importance: 6
Deep learning for computer revision relies on large annotated datasets. Classification/categorization has benefited from the creation of ImageNet, which classifies 1m photos into 1000 categories. But classification/categorization is a coarse description of an image which limits application of classifiers, and there is no comparably large dataset of images with many tags or labels which would allow learning and detecting much richer information about images. Such a dataset would ideally be >1m images with at least 10 descriptive tags each which can be publicly distributed to all interested researchers, hobbyists, and organizations. There are currently no such public datasets, as ImageNet, Birds, Flowers, and MS COCO fall short either on image or tag count or restricted distribution. I suggest that the “image -boorus” be used. The image boorus are longstanding web databases which host large numbers of images which can be ‘tagged’ or labeled with an arbitrary number of textual descriptions; they were developed for and are most popular among fans of anime, who provide detailed annotations.
The best known booru, with a focus on quality, is Danbooru. We create & provide a torrent which contains ~2.5tb of 3.33m images with 92.7m tag instances (of 365k defined tags, ~27.8/image) covering Danbooru from 24 May 2005 through 31 December 2018 (final ID: #3,368,713), providing the image files & a JSON export of the metadata. We also provide a smaller torrent of SFW images downscaled to 512x512px JPGs (241GB; 2,232,462 images) for convenience.
Our hope is that a Danbooru2018 dataset can be used for rich large-scale classification/tagging & learned embeddings, test out the transferability of existing computer vision techniques (primarily developed using photographs) to illustration/anime-style images, provide an archival backup for the Danbooru community, feed back metadata improvements & corrections, and serve as a testbed for advanced techniques such as conditional image generation or style transfer.
Image booru description
Image boorus are image hosting websites developed by the anime community for collaborative tagging. Images are uploaded and tagged by users; they can be large, such as Danbooru1, and richly annotated with textual ‘tags’, typically divided into a few major groups:
copyright (the overall franchise, movie, TV series, manga etc a work is based on; for long-running franchises like Neon Genesis Evangelion or “crossover” images, there can be multiple such tags, or if there is no such associated work, it would be tagged “original”)
character (often multiple)
author
-
Danbooru does not ban sexually suggestive or pornographic content; instead, images are classified into 3 categories:
safe,questionable, &explicit. (Represented in the SQL as “s”/“q”/“e” respectively.)safeis for relatively SFW content including swimsuits, whilequestionablewould be more appropriate for highly-revealing swimsuit images or nudity or highly sexually suggestive situations, andexplicitdenotes anything hard-core pornographic. (8.7% of images are classified as “e”, 14.9% as “q”, and 76.3% as “s”; as the default tag is “q”, this may underestimate the number of “s” images, but “s” should probably be considered the SFW subset.) descriptive tags (eg the top 19 tags are
1girl/solo/long_hair/highres/breasts/blush/short_hair/smile/multiple_girls/open_mouth/looking_at_viewer/blue_eyes/blonde_hair/touhou/brown_hair/skirt/hat/thighhighs/black_hair, which reflect the expected focus of anime fandom on things like the Touhou franchise)These tags form a “folksonomy” to describe aspects of images; beyond the expected tags like
long_hairorlooking_at_the_viewer, there are many strange and unusual tags, including many anime or illustration-specific tags likeseiyuu_connection(images where the joke is based on knowing the two characters are voiced in different anime by the same voice actor) orbad_feet(artists frequently accidentally draw two left feet, or justbad_anatomyin general). Tags may also be hierarchical and one tag “imply” another.
Images can have other associated metadata with them, including:
- Danbooru ID, a unique positive integer
- MD5 hash
- the uploader username
- the original URL or the name of the work
- up/downvotes
- sibling images (often an image will exist in many forms, such as sketch or black-white versions in addition to a final color image, edited or larger/smaller versions, SFW vs NSFW, or depicting multiple moments in a scene)
- captions/dialogue (many images will have written Japanese captions/dialogue, which have been translated into English by users and annotated using HTML image maps)
- author commentary (also often translated)
- pools (ordered sequences of images from across Danbooru; often used for comics or image groups, or for disparate images with some unifying theme which is insufficiently objective to be a normal tag)
Image boorus typically support advanced Boolean searches on multiple attributes simultaneously, which in conjunction with the rich tagging, can allow users to discover extremely specific sets of images.
Download
Danbooru2018 is currently available for download in 3 ways:
- BitTorrent
- public
rsyncserver - Kaggle-hosted dataset
Torrent
The images have been downloaded using a curl script & the Danbooru API, and losslessly optimized using optipng/jpegoptim2; the metadata has been exported from the Danbooru BigQuery mirror.3
Torrents are the preferred download method as they stress the seed server less, can potentially be faster due to many peers, are resilient to server downtime, and have built-in ECC.
Due to the number of files, the torrent has been broken up into 10 separate torrents, each covering a range of IDs modulo 1000. They are available as an XZ-compressed tarball (full archive) (19MB) and the SFW 512px downscaled subset torrent (11MB); download & unpack into one’s torrent directory.
The torrents appear to work with rTorrent on Linux & Transmission on Linux/Windows; it reportedly doesn’t work on Qbittorrent 3.3-4.0.4 (but may on >=4.0.54), Deluge, or most Windows torrent clients.
Rsync
Due to torrent compatibility & network issues, I provide an alternate download route via a public anonymous rsync server. To list all available files:
For a single file (eg the metadata tarball), one can download like thus:
For a specific subset, like the SFW 512px subset:
rsync --recursive --times --verbose rsync://78.46.86.149:873/danbooru2018/512px/ ./danbooru2018/512px/
rsync --recursive --times --verbose rsync://78.46.86.149:873/danbooru2018/original/ ./danbooru2018/original/And for the full dataset (metadata+original+512px):
Kaggle
A combination of a n=300k subset of the 512px SFW subset of Danbooru2017 and Nagadomi’s moeimouto face dataset are available as a Kaggle-hosted dataset: “Tagged Anime Illustrations” (36GB).
Kaggle also hosts the metadata of Safebooru up to 20 November 2016: “Safebooru—Anime Image Metadata”.
Updating
If there is interest, the dataset will be updated at regular annual intervals (“Danbooru2019”, “Danbooru2020” etc).
Updates exploits the ECC capability of BitTorrent by updating the images/metadata and creating a new .torrent file; users download the new .torrent, overwrite the old .torrent, and after rehashing files to discover which ones have changed/are missing, the new ones are downloaded. (This method has been successfully used by other periodically-updated large torrents, such as the Touhou Lossless Music Torrent, at ~1.75tb after 19 versions.)
Turnover in BitTorrent swarms means that earlier versions of the torrent will quickly disappear, so for easier reproducibility, the metadata files can be archived into subdirectories (images generally will not change, so reproducibility is less of a concern—to reproduce the subset for an earlier release, one simply filters on upload date or takes the file list from the old metadata).
Notification of updates
To receive notification of future updates to the dataset, please subscribe to the notification mailing list.
Possible Uses
Such a dataset would support many possible uses:
classification & tagging:
image categorization (of major characteristics such as franchise or character or SFW/NSFW detection eg Derpibooru)
image multi-label classification (tagging), exploiting the ~20 tags per image (currently there is a prototype, DeepDanbooru)
- a large-scale testbed for real-world application of active learning / man-machine collaboration
- testing the scaling limits of existing tagging approaches and motivating zero-shot & one-shot learning techniques
- bootstrapping video summaries/descriptions
- robustness of image classifiers to different illustration styles (eg Icons-50)
image generation:
- text-to-image synthesis (StackGAN++ would benefit greatly from the tags as more informative than the sentence descriptions of COCO)
- unsupervised image generation (DCGANs, VAEs, PixelCNNs, WGANs, eg MakeGirlsMoe or Xiang & Li 2018)
- image transformation: upscaling (waifu2x), colorizing (deepcolor/Frans 2017) or palette color scheme generation (Colormind), inpainting, sketch-to-drawing (Simo-Serra et al 2017), photo-to-drawing (using the
reference_photo/photo_referencetags), artistic style transfer5/image analogies (Liao et al 2017), optimization (“Image Synthesis from Yahoo’sopen_nsfw”, pix2pix, DiscoGAN, CycleGAN eg CycleGAN for silverizing anime character hair or do photo⟺illustration face mapping6 eg Gokaslan et al 2018/Li 2018), CGI model/pose generation (PSGAN)
image analysis:
- facial detection & localization for drawn images (on which normal techniques such as OpenCV’s Harr filters fail, requiring special-purpose approaches like AnimeFace 2009/
lbpcascade_animeface) - image popularity/upvote prediction
- image-to-text localization, transcription, and translation of text in images
- illustration-specialized compression (for better performance than PNG/JPG)
- facial detection & localization for drawn images (on which normal techniques such as OpenCV’s Harr filters fail, requiring special-purpose approaches like AnimeFace 2009/
image search:
- collaborative filtering/recommendation, image similarity search (Flickr) of images (useful for users looking for images, for discovering tag mistakes, and for various diagnostics like checking GANs are not memorizing)
- manga recommendation (Vie et al 2017)
- artist similarity and de-anonymization
knowledge graph extraction from tags/tag-implications and images
- clustering tags
- temporal trends in tags (franchise popularity trends)
Advantages
Size and metadata
Image classification has been supercharged by work on ImageNet, but ImageNet itself is limited by its small set of classes, many of which are debatable, and which encompass only a limited set. Compounding these limits, tagging/classification datasets are notoriously undiverse & have imbalance problems or are small:
ImageNet: dog breeds (memorably brought out by DeepDream)
- WebVision (Li et al 2017a; Li et al 2017b; Guo et al 2018): 2.4m images noisily classified via search engine/Flickr queries into the ImageNet 1k categories
Youtube-BB: toilets/giraffes
MS COCO: bathrooms and African savannah animals; 328k images, 80 categories, short 1-sentence descriptions
bird/flowers: a few score of each kind (eg no eagles in the birds dataset)
Visual Relationship Detection (VRD) dataset: 5k images
Pascal VOC: 11k images
Visual Genome: 108k images
nico-opendata: 400k, but SFW & restricted to approved researchers
Open Images V4: released 2018, 30.1m tags for 9.2m images and 15.4m bounding-boxes, with high label quality; a major advantage of this dataset is that it uses CC-BY-licensed Flickr photographs/images, and so it should be freely distributable,
BAM! (Wilber et al 2017): 65m raw images, 393k? tags for 2.5m? tagged images (semi-supervised), restricted access?
The external validity of classifiers trained on these datasets is somewhat questionable as the learned discriminative models may collapse or simplify in undesirable ways, and overfit on the datasets’ individual biases (Torralba & Efros 2011). For example, ImageNet classifiers sometimes appear to ‘cheat’ by relying on localized textures in a “bag-of-words”-style approach and simplistic outlines/shapes—recognizing leopards only by the color texture of the fur, or believing barbells are extensions of arms. CNNs by default appear to rely almost entirely on texture and ignore shapes/outlines, unlike human vision, rendering them fragile to transforms; training which emphasizes shape/outline data augmentation can improve accuracy & robustness (Geirhos et al 2018), making anime images a challenging testbed (and this texture-bias possibly explaining poor performance of anime-targeted NNs in the past). The dataset is simply not large enough, or richly annotated enough, to train classifiers or tagger better than that, or, with residual networks reaching human parity, reveal differences between the best algorithms and the merely good. (Dataset biases have also been issues on question-answering datasets.) As well, the datasets are static, not accepting any additions, better metadata, or corrections. Like MNIST before it, ImageNet is verging on ‘solved’ (the ILSVRC organizers ended it after the 2017 competition) and further progress may simply be overfitting to idiosyncrasies of the datapoints and errors; even if lowered error rates are not overfitting, the low error rates compress the differences between algorithm, giving a misleading view of progress and understating the benefits of better architectures, as improvements become comparable in size to simple chance in initializations/training/validation-set choice. As Dong et al 2017 note:
It is an open issue of text-to-image mapping that the distribution of images conditioned on a sentence is highly multi-modal. In the past few years, we’ve witnessed a breakthrough in the application of recurrent neural networks (RNN) to generating textual descriptions conditioned on images [1, 2], with Xu et al. showing that the multi-modality problem can be decomposed sequentially [3]. However, the lack of datasets with diversity descriptions of images limits the performance of text-to-image synthesis on multi-categories dataset like MSCOCO [4]. Therefore, the problem of text-to-image synthesis is still far from being solved
In contrast, the Danbooru dataset is larger than ImageNet as a whole and larger than the most widely-used multi-description dataset, MS COCO, with far richer metadata than the ‘subject verb object’ sentence summary that is dominant in MS COCO or the birds dataset (sentences which could be adequately summarized in perhaps 5 tags, if even that7). While the Danbooru community does focus heavily on female anime characters, they are placed in a wide variety of circumstances with numerous surrounding tagged objects or actions, and the sheer size implies that many more miscellaneous images will be included. It is unlikely that the performance ceiling will be reached anytime soon, and advanced techniques such as attention will likely be required to get anywhere near the ceiling. And Danbooru is constantly expanding and can be easily updated by anyone anywhere, allowing for regular releases of improved annotations.
Danbooru and the image boorus have been only minimally used in previous machine learning work; principally, in “Illustration2Vec: A Semantic Vector Representation of Images”, Saito & Matsui 2015 (project), which used 1.287m images to train a finetuned VGG-based CNN to detect 1,539 tags (drawn from the 512 most frequent tags of general/copyright/character each) with an overall precision of 32.2%, or “Symbolic Understanding of Anime Using Deep Learning”, Li 2018 But the datasets for past research are typically not distributed and there has been little followup.
Non-photographic
Anime images and illustrations, on the other hand, as compared to photographs, differ in many ways—for example, illustrations are frequently black-and-white rather than color, line art rather than photographs, and even color illustrations tend to rely far less on textures and far more on lines (with textures omitted or filled in with standard repetitive patterns), working on a higher level of abstraction—a leopard would not be as trivially recognized by simple pattern-matching on yellow and black dots—with irrelevant details that a discriminator might cheaply classify based on typically suppressed in favor of global gestalt, and often heavily stylized (eg frequent use of “Dutch angles”). With the exception of MNIST & Omniglot, almost all commonly-used deep learning-related image datasets are photographic.
Humans can still easily perceive a black-white line drawing of a leopard as being a leopard—but can a standard ImageNet classifier? Likewise, the difficulty face detectors encounter on anime images suggests that other detectors like nudity or pornographic detectors may fail; but surely moderation tasks require detection of penises, whether they are drawn or photographed? The attempts to apply CNNs to GANs, image generation, image inpainting, or style transfer have sometimes thrown up artifacts which don’t seem to be issues when using the same architecture on photographic material; for example, in GAN image generation & style transfer, I almost always note, in my own or others’ attempts, what I call the “watercolor effect”, where instead of producing the usual abstracted regions of whitespace, monotone coloring, or simple color gradients, the CNN instead consistently produces noisy transition textures which look like watercolor paintings—which can be beautiful, and an interesting style in its own right (eg the style2paints samples), but means the CNNs are failing to some degree. This watercolor effect appears to not be a problem in photographic applications, but on the other hand, photos are filled with noisy transition textures and watching a GAN train, you can see that the learning process generates textures first and only gradually learns to build edges and regions and transitions from the blurred texts; is this anime-specific problem due to simply insufficient data/training, or is there something more fundamentally the issue with current convolutions?
Because illustrations are produced by an entirely different process and focus only on salient details while abstracting the rest, they offer a way to test external validity and the extent to which taggers are tapping into higher-level semantic perception.
As well, many ML researchers are anime fans and might enjoy working on such a dataset—training NNs to generate anime images can be amusing. It is, at least, more interesting than photos of street signs or storefronts. (“There are few sources of energy so powerful as a procrastinating grad student.”)
Community value
A full dataset is of immediate value to the Danbooru community as an archival snapshot of Danbooru which can be downloaded in lieu of hammering the main site and downloading terabytes of data; backups are occasionally requested on the Danbooru forum but the need is currently not met.
There is potential for a symbiosis between the Danbooru community & ML researchers: in a virtuous circle, the community provides curation and expansion of a rich dataset, while ML researchers can contribute back tools from their research on it which help improve the dataset. The Danbooru community is relatively large and would likely welcome the development of tools like taggers to support semi-automatic (or eventually, fully automatic) image tagging, as use of a tagger could offer orders of magnitude improvement in speed and accuracy compared to their existing manual methods, as well as being newbie-friendly8 They are also a pre-existing audience which would be interested in new research results.
Format
The goal of the dataset is to be as easy as possible to use immediately, avoiding obscure file formats, while allowing simultaneous research & seeding of the torrent, with easy updates.
Images are provided in the full original form (be that JPG, PNG, GIF or otherwise) for reference/archival purposes, and a script for converting to JPGS & downscaling (creating a smaller more suitable for ML use).
Images are bucketed into 1000 subdirectories 0-999, which is the Danbooru ID modulo 1000 (ie all images in 0999/ have an ID ending in ‘999’). A single directory would cause pathological filesystem performance, and modulo ID spreads images evenly without requiring additional directories to be made. The ID is not zero-padded and files end in the relevant extension, hence the file layout looks like this:
original/0000/
original/0000/1000.png
original/0000/2000.jpg
original/0000/3000.jpg
original/0000/4000.png
original/0000/5000.jpg
original/0000/6000.jpg
original/0000/7000.jpg
original/0000/8000.jpg
original/0000/9000.jpg
...Currently represented file extensions are: “avi”/“bmp”/“gif”/“html”/“jpeg”/“jpg”/“mp3”/“mp4”/“mpg”/“pdf”/“png”/“rar”/“swf”/“webm”/“wmv”/“zip”. (JPG/PNG files have been losslessly optimized using jpegoptim/OptiPNG, saving ~100GB.)
The SFW torrent follows the same schema but inside the 512px/ directory instead and converted to JPG for the SFW files: 512px/0000/1000.jpg etc.
An experimental shell script for parallelized conversion the full-size original images into a more tractable ~250GB corpus of 512x512px images is included: rescale_images.sh. It requires ImageMagick & GNU parallel to be installed.
Image Metadata
The metadata is available as a XZ-compressed tarball of JSON files as exported from the Danbooru BigQuery database mirror (metadata.json.tar.xz). Each line is an individual JSON object for a single image; ad hoc queries can be run easily by piping into jq.
Citing
Please cite this dataset as:
Anonymous, The Danbooru Community, Gwern Branwen, & Aaron Gokaslan; “Danbooru2018: A Large-Scale Crowdsourced and Tagged Anime Illustration Dataset”, 3 January 2019. Web. Accessed [DATE]
https://www.gwern.net/Danbooru2018@misc{danbooru2018, author = {Anonymous, the Danbooru community, Gwern Branwen, Aaron Gokaslan}, title = {Danbooru2018: A Large-Scale Crowdsourced and Tagged Anime Illustration Dataset}, howpublished = {\url{https://www.gwern.net/Danbooru2018}}, url = {https://www.gwern.net/Danbooru2018}, type = {dataset}, year = {2019}, month = {January}, timestamp = {2019-01-02}, note = {Accessed: DATE} }
Past releases
Danbooru2017
The first release, Danbooru2017, contained ~1.9tb of 2.94m images with 77.5m tag instances (of 333k defined tags, ~26.3/image) covering Danbooru from 24 May 2005 through 31 December 2017 (final ID: #2,973,532).
Danbooru2018 added 0.413TB/392,557 images/15,208,974 tags/31,698 new unique tags.
To reconstruct Danbooru2017, download Danbooru2018, and take the image subset ID #1-2973532 as the image dataset, and the JSON metadata in the subdirectory metadata/2017/ as the metadata. That should give you Danbooru2017 as released on 2018-02-13.
Applications
Projects:
“PaintsTransfer-Euclid”/“style2paints” (line-art colorizer): used Danbooru2017 for training (see Zhang et al 2018 for details)
“This Waifu Does Not Exist”: trains a StyleGAN on faces cropped from the Danbooru corpus, generating high-quality 512px anime faces; site displays random samples
“Text Segmentation and Image Inpainting”, yu45020
This is an ongoing project that aims to solve a simple but tedious procedure: remove texts from an image. It will reduce comic book translators’ time on erasing Japanese words.
DCGAN/LSGAN in PyTorch, Kevin Lyu
DeepCreamPy: Decensoring Hentai with Deep Neural Networks, deeppomf
“animeGM: Anime Generative Model for Style Transfer”, Peter Chau: 1/2/3
danbooru-faces: Jupyter notebooks for cropping and processing anime faces using Nagadomi’slbpcascade_animeface(see also Nagadomi’s moeimouto face dataset on Kaggle)
Publications:
“Improving Shape Deformation in Unsupervised Image-to-Image Translation”, Gokaslan et al 2018:
Unsupervised image-to-image translation techniques are able to map local texture between two domains, but they are typically unsuccessful when the domains require larger shape change. Inspired by semantic segmentation, we introduce a discriminator with dilated convolutions that is able to use information from across the entire image to train a more context-aware generator. This is coupled with a multi-scale perceptual loss that is better able to represent error in the underlying shape of objects. We demonstrate that this design is more capable of representing shape deformation in a challenging toy dataset, plus in complex mappings with significant dataset variation between humans, dolls, and anime faces, and between cats and dogs.
“Two Stage Sketch Colorization”, Zhang et al 2018: (on style2paints, version 3)
Sketch or line art colorization is a research field with significant market demand. Different from photo colorization which strongly relies on texture information, sketch colorization is more challenging as sketches may not have texture. Even worse, color, texture, and gradient have to be generated from the abstract sketch lines. In this paper, we propose a semi-automatic learning-based framework to colorize sketches with proper color, texture as well as gradient. Our framework consists of two stages. In the first drafting stage, our model guesses color regions and splashes a rich variety of colors over the sketch to obtain a color draft. In the second refinement stage, it detects the unnatural colors and artifacts, and try to fix and refine the result. Comparing to existing approaches, this two-stage design effectively divides the complex colorization task into two simpler and goal-clearer subtasks. This eases the learning and raises the quality of colorization. Our model resolves the artifacts such as water-color blurring, color distortion, and dull textures.
We build an interactive software based on our model for evaluation. Users can iteratively edit and refine the colorization. We evaluate our learning model and the interactive system through an extensive user study. Statistics shows that our method outperforms the state-of-art techniques and industrial applications in several aspects including, the visual quality, the ability of user control, user experience, and other metrics.
“Application of Generative Adversarial Network on Image Style Transformation and Image Processing”, Wang 2018
Image-to-Image translation is a collection of computer vision problems that aim to learn a mapping between two different domains or multiple domains. Recent research in computer vision and deep learning produced powerful tools for the task. Conditional adversarial networks serve as a general-purpose solution for image-to-image translation problems. Deep Convolutional Neural Networks can learn an image representation that can be applied for recognition, detection, and segmentation. Generative Adversarial Networks (GANs) has gained success in image synthesis. However, traditional models that require paired training data might not be applicable in most situations due to lack of paired data.
Here we review and compare two different models for learning unsupervised image to image translation: CycleGAN and Unsupervised Image-to-Image Translation Networks (UNIT). Both models adopt cycle consistency, which enables us to conduct unsupervised learning without paired data. We show that both models can successfully perform image style translation. The experiments reveal that CycleGAN can generate more realistic results, and UNIT can generate varied images and better preserve the structure of input images.
Scraping
This project is not officially affiliated or run by Danbooru, however, the site operator Albert has given his permission for scraping. I have registered the accounts gwern and gwern-bot for use in downloading & participating on Danbooru; it is considered good research ethics to try to offset any use of resources when crawling an online community (eg DNM scrapers try to run Tor nodes to pay back the bandwidth), so I have donated $20 to Danbooru via an account upgrade.
Danbooru IDs are sequential positive integers, but the images are stored at their MD5 hashes; so downloading the full images can be done by a query to the JSON API for the metadata for an ID, getting the URL for the full upload, and downloading that to the ID plus extension.
The metadata can be downloaded from BigQuery via BigQuery-API-based tools.
Bugs
Known bugs:
all: the metadata does not include the translations or bounding-boxes of captions/translations (“notes”); they were omitted from the BigQuery mirror and technical problems meant they could not be added to BQ before release. The captions/translations can be retrieved via the Danbooru API if necessary.
512px SFW subset: some images have transparent backgrounds; if they are also black-white, like black line-art drawings, then the conversion to JPG with a default black background will render them almost 100% black and the image will be invisible (eg files with the two tags
transparent_background lineart). This affects somewhere in the hundreds of images. Users can either ignore this as affecting a minute percentage of files, filter out images based on the tag-combination, or include data quality checks in their image loading code to drop anomalous images with too-few unique colors or which are too white/too black.Proposed fix: in the next version, Danbooru2018’s 512px SFW subset, the downscaling will switch to white backgrounds rather than black backgrounds; while the same issue can still arise in the case of white line-art drawings with transparent backgrounds, these are much rarer. (It might also be possible to make the conversion shell script query images for use of transparency, average the contents, and pick a background which is most opposite the content.)
Future work
Model zoo
If possible, additional models and derived metadata may be supplied as part of a “model zoo”. Particularly desirable would be:
- “s”/“q”/“e” classifier
- top-10,000-tag tagger
- text embedding RNN, and pre-computed text embeddings for all images’ tags
Metadata Quality Improvement via Active Learning
How high quality is the Danbooru metadata quality? As with ImageNet, it is critical that the tags are extremely accurate or else this will lowerbound the error rates and impede the learning of taggers, especially on rarer tags where a low error may still cause false negatives to outweigh the true positives.
I would say that the Danbooru tag data is quite high but imbalanced: almost all tags on images are correct, but the absence of a tag is often wrong—that is, many tags are missing on Danbooru (there are so many possible tags that no user could possibly know them all). So the absence of a tag isn’t as informative as the presence of a tag—eyeballing images and some rarer tags, I would guess that tags are present <10% of the time they should be.
This suggests leveraging an active learning (Settles 2010) form of training: train a tagger, have a human review the errors, update the metadata when it was not an error, and retrain.
More specifically: train the tagger; run the tagger on the entire dataset, recording the outputs and errors; a human examines the errors interactively by comparing the supposed error with the image; and for false negatives, the tag can be added to the Danbooru source using the Danbooru API and added to the local image metadata database, and for false positives, the ‘negative tag’ can be added to the local database; train a new model (possibly initializing from the last checkpoint). Since there will probably be thousands of errors, one would go through them by magnitude of error: for a false positive, start with tagging probabilities of 1.0 and go down, and for false negatives, 0.0 and go up. This would be equivalent to the active learning strategy “uncertainty sampling”, which is simple, easy to implement, and effective (albeit not necessarily optimal for active learning as the worst errors will tend to be highly correlated/redundant and the set of corrections overkill). Once all errors have been hand-checked, the training weight on absent tags can be increased, as any missing tags should have shown up as false positives.
Over multiple iterations of active learning + retraining, the procedure should be able to ferret out errors in the dataset and boost its quality while also increasing its performance.
Based on my experiences with semi-automatic editing on Wikipedia (using pywikipediabot for solving disambiguation wikilinks), I would estimate that given an appropriate terminal interface, a human should be able to check at least 1 error per second and so checking ~30,000 errors per day is possible (albeit extremely tedious). Fixing the top million errors should offer a noticeable increase in performance.
There are many open questions about how best to optimize tagging performance: is it better to refine tags on the existing set of images or would adding more only-partially-tagged images be more useful?
External links
- Discussion: /r/MachineLearning, /r/anime
- “Deep Learning Anime Papers”
pybooru
Appendix
Shell queries for statistics
# count number of images/files in Danbooru2018
find /media/gwern/My\ Book/danbooru2018/original/ -type f | wc --lines
2941205
# count total filesize of images in Danbooru2018
du -sch /media/gwern/My\ Book/danbooru2018/original/
1.9TB
# on JSON files concatenated together:
## number of unique tags
cat all.json | jq '.tags | .[] | .name' > tags.txt
sort -u tags.txt | wc --lines
# 333333
## number of total tags
wc --lines tags.txt
# 77565442
## Average tag count per image:
R
# R> 77565442 / 2941205
# [1] 26.37199447
## Most popular tags:
sort tags.txt | uniq -c | sort -g | tac | head -19
# 2060363 "1girl"
# 1710762 "solo"
# 1318516 "long_hair"
# 1018512 "highres"
# 900500 "breasts"
# 870086 "blush"
# 813241 "short_hair"
# 800738 "smile"
# 662570 "multiple_girls"
# 651028 "open_mouth"
# 631508 "looking_at_viewer"
# 585282 "blue_eyes"
# 580097 "blonde_hair"
# 573047 "touhou"
# 541144 "brown_hair"
# 508437 "skirt"
# 478805 "hat"
# 466367 "thighhighs"
# 450133 "black_hair"
## count Danbooru images by rating
cat all.json | jq '.rating' > ratings.txt
sort ratings.txt | uniq -c | sort -g
# 257967 "e"
# 443750 "q"
# 2262211 "s"
wc --lines ratings.txt
## 2963928 ratings.txt
R
# R> c(257967, 443750, 2262211) / 2963928
# [1] 0.08703551503 0.14971686222 0.76324762275
# earliest upload:
cat all.json | jq '.created_at' | fgrep '2005' > uploaded.txt
sort -g uploaded.txt | head -1
# "2005-05-24 03:35:31 UTC"While Danbooru is not the largest anime image booru in existence—TBIB, for example, claims >4.7m images or almost twice as many, by mirroring from multiple boorus—but Danbooru is generally considered to focus on higher-quality images & have better tagging; I suspect >2.9m images is into diminishing returns and the focus then ought to be on improving the metadata. Google finds (Sun et al 2017) that image classification is logarithmic in image count up to n=300M with noisy labels, which I interpret as suggesting that for the rest of us with limited hard drives & compute, going past millions is not that helpful; unfortunately that experiment doesn’t examine the impact of the noise in their categories so one can’t guess how many images each additional tag is equivalent to for improving final accuracy. (They do compare training on equally large datasets with small vs large number of categories, but fine vs coarse-grained categories is not directly comparable to a fixed number of images with less or more tags on each image.) The impact of tag noise could be quantified by removing varying numbers of random images/tags and comparing the curve of final accuracy. As adding more images is hard but semi-automatically fixing tags with an active-learning approach should be easy, I would bet that the cost-benefit is strongly in favor of improving the existing metadata than in adding more images from recent Danbooru uploads or other -boorus.↩
This is done to save >100GB of space/bandwidth, as hashes are already inherently validated as part of the BitTorrent download process; the original MD5 hashes are available in the metadata.↩
If one is only interested in the metadata, one could run queries on the BigQuery version of the Danbooru database instead of downloading the torrent. The BigQuery database is also updated daily.↩
Apparently a bug due to an anti-DoS mechanism, which should be fixed.↩
An author of
style2paints, a NN painter for anime images, notes that standard style transfer approaches (typically using an ImageNet-based CNN) fail abysmally on anime images: “All transferring methods based on Anime Classifier are not good enough because we do not have anime ImageNet”. This is interesting in part because it suggests that ImageNet CNNs are still only capturing a subset of human perception if they only work on photographs & not illustrations.↩Danbooru2018 does not by default provide a “face” dataset of images cropped to just faces like that of Getchu or Nagadomi’s moeimouto; however, the tags can be used to filter down to a large set of face closeups, and Nagadomi’s face-detection code is highly effective at extracting faces from Danbooru2018 images & can be combined with waifu2x for creating large sets of large face images.↩
See for example the pair highlighted in Sharma et al 2018, motivating them to use human dialogues to provide more descriptions/supervision.↩
A tagger could be integrated into the site to automatically propose tags for newly-uploaded images to be approved by the uploader; new users, unconfident or unfamiliar with the full breadth, would then have the much easier task of simply checking that all the proposed tags are correct.↩
