Danbooru2020 is a large-scale anime image database with 4.2m+ images annotated with 130m+ tags; it can be useful for machine learning purposes such as image recognition and generation.
2015-12-15–2021-01-12
finished
certainty: likely
importance: 6
Deep learning for computer revision relies on large annotated datasets. Classification/
categorization has benefited from the creation of ImageNet, which classifies 1m photos into 1000 categories. But classification/ categorization is a coarse description of an image which limits application of classifiers, and there is no comparably large dataset of images with many tags or labels which would allow learning and detecting much richer information about images. Such a dataset would ideally be >1m images with at least 10 descriptive tags each which can be publicly distributed to all interested researchers, hobbyists, and organizations. There are currently no such public datasets, as ImageNet, Birds, Flowers, and MS COCO fall short either on image or tag count or restricted distribution. I suggest that the “image -boorus” be used. The image boorus are longstanding web databases which host large numbers of images which can be ‘tagged’ or labeled with an arbitrary number of textual descriptions; they were developed for and are most popular among fans of anime, who provide detailed annotations. The best known booru, with a focus on quality, is Danbooru. We provide a torrent/
rsync mirror which contains ~3.4TB of 4.22m images with 130m tag instances (of 434k defined tags, ~30/ image) covering Danbooru from 2005-05-24–2020-12-31 (final ID: #4,279,845), providing the image files & a JSON export of the metadata. We also provide a smaller torrent of SFW images downscaled to 512×512px JPGs (0.37TB; 3,227,715 images) for convenience. (Total: 3.7TB.) Our hope is that the Danbooru2020 dataset can be used for rich large-scale classification/
tagging & learned embeddings, test out the transferability of existing computer vision techniques (primarily developed using photographs) to illustration/ anime-style images, provide an archival backup for the Danbooru community, feed back metadata improvements & corrections, and serve as a testbed for advanced techniques such as conditional image generation or style transfer.
Image boorus like Danbooru are image hosting websites developed by the anime community for collaborative tagging. Images are uploaded and tagged by users; they can be large, such as Danbooru1, and richly annotated with textual ‘tags’.
Danbooru in particular is old, large, well-tagged, and its operators have always supported uses beyond regular browsing—providing an API and even a database export. With their permission, I have periodically created static snapshots of Danbooru oriented towards ML use patterns.
Image booru description
Image booru tags typically divided into a few major groups:
copyright (the overall franchise, movie, TV series, manga etc a work is based on; for long-running franchises like Neon Genesis Evangelion or “crossover” images, there can be multiple such tags, or if there is no such associated work, it would be tagged “original”)
character (often multiple)
author
-
Danbooru does not ban sexually suggestive or pornographic content; instead, images are classified into 3 categories:
safe
,questionable
, &explicit
. (Represented in the SQL as “s”/“q”/ “e” respectively.) safe
is for relatively SFW content including swimsuits, whilequestionable
would be more appropriate for highly-revealing swimsuit images or nudity or highly sexually suggestive situations, andexplicit
denotes anything hard-core pornographic. (8.5% of images are classified as “e”, 15% as “q”, and 77% as “s”; as the default tag is “q”, this may underestimate the number of “s” images, but “s” should probably be considered the SFW subset.) descriptive tags (eg the top 10 tags are
1girl
/solo
/long_hair
/highres
/breasts
/blush
/short_hair
/smile
/multiple_girls
/open_mouth
/looking_at_viewer
, which reflect the expected focus of anime fandom on things like the Touhou franchise)These tags form a “folksonomy” to describe aspects of images; beyond the expected tags like
long_hair
orlooking_at_the_viewer
, there are many strange and unusual tags, including many anime or illustration-specific tags likeseiyuu_connection
(images where the joke is based on knowing the two characters are voiced in different anime by the same voice actor) orbad_feet
(artists frequently accidentally draw two left feet, or justbad_anatomy
in general). Tags may also be hierarchical and one tag “imply” another.Images with text in them will have tags like
translated
,comic
, orspeech_bubble
.
Images can have other associated metadata with them, including:
Danbooru ID, a unique positive integer
MD5 hash
The MD5s are often incorrect.the uploader username
the original URL or the name of the work
up/
downvotes sibling images (often an image will exist in many forms, such as sketch or black-white versions in addition to a final color image, edited or larger/
smaller versions, SFW vs NSFW, or depicting multiple moments in a scene) captions/
dialogue (many images will have written Japanese captions/ dialogue, which have been translated into English by users and annotated using HTML image maps) author commentary (also often translated)
pools (ordered sequences of images from across Danbooru; often used for comics or image groups, or for disparate images with some unifying theme which is insufficiently objective to be a normal tag)
Image boorus typically support advanced Boolean searches on multiple attributes simultaneously, which in conjunction with the rich tagging, can allow users to discover extremely specific sets of images.
Samples

Download
Danbooru2020 is currently available for download in 2 ways:
- BitTorrent
- public
rsync
server
Torrent
The images have been downloaded using a curl
script & the Danbooru API, and losslessly optimized using optipng
/jpegoptim
2; the metadata has been exported from the Danbooru BigQuery mirror.3
Torrents are the preferred download method as they stress the seed server less, can potentially be faster due to many peers, are resilient to server downtime, and have built-in ECC. (However, Danbooru2020 is approaching the limits of BitTorrent clients, and Danbooru2021 may be forced to drop torrent support.)
Due to the number of files, the torrent has been broken up into 10 separate torrents, each covering a range of IDs modulo 1000. They are available as an XZ-compressed tarball (original images) (25MB) and the SFW 512px downscaled subset torrent (13MB); download & unpack into one’s torrent directory.
The torrents appear to work with rTorrent on Linux & Transmission on Linux/
Rsync
Due to torrent compatibility & network issues, I provide an alternate download route via a public anonymous rsync server (available for any Unix; alternative implementations are available for Mac/rsync --list-only
):
rsync rsync://78.46.86.149:873/danbooru2020/filelist.txt.xz ./
To download all available files (test with --dry-run
):
rsync --verbose --recursive rsync://78.46.86.149:873/danbooru2020/ ./danbooru2020/
For a single file (eg the metadata tarball), one can download like thus:
rsync --verbose rsync://78.46.86.149:873/danbooru2020/metadata.json.tar.xz ./
For a specific subset, like the SFW 512px subset:
rsync --recursive --verbose rsync://78.46.86.149:873/danbooru2020/512px/ ./danbooru2020/512px/
rsync --recursive --verbose rsync://78.46.86.149:873/danbooru2020/original/ ./danbooru2020/original/
Note that rsync
supports a kind of globbing “pattern” in queries (remember to escape globs so they are not interpreted by the shell), and also supports reading a list of filenames from a file:
--exclude=PATTERN exclude files matching PATTERN
--exclude-from=FILE read exclude patterns from FILE
--include=PATTERN don't exclude files matching PATTERN
--include-from=FILE read include patterns from FILE
--files-from=FILE read list of source-file names from FILE
So one can query the metadata to build up a file listing IDs matching arbitrary criteria, create the corresponding dataset filenames (like for ID #58991, 512px/0991/58991.jpg
or, even lazier, just do include matches via a glob pattern like */58991.*
). This can require far less time & bandwidth than downloading the full dataset, and is also far faster than doing rsync
one file at a time. See the rsync
documentation for further details.
And for the full dataset (metadata+original+512px):
rsync --recursive --verbose rsync://78.46.86.149:873/danbooru2020 ./danbooru2020/
I also provide rsync
mirrors of a number of models & datasets, such as the cleaned anime portrait dataset; see Projects for a listing of derivative works.
Kaggle
A combination of a n = 300k subset of the 512px SFW subset of Danbooru2017 and Nagadomi’s moeimouto face dataset are available as a Kaggle-hosted dataset: “Tagged Anime Illustrations” (36GB).
Kaggle also hosts the metadata of Safebooru up to 2016-11-20: “Safebooru—Anime Image Metadata”.
Model zoo
Currently available:
taggers:
- DeepDanbooru (service; implemented in CNTK & TensorFlow on top-7112 tags from Danbooru2018); DeepDanbooru activation/
saliency maps - danbooru-pretrained (PyTorch; top-6000 tags from Danbooru2018)
- DeepDanbooru (service; implemented in CNTK & TensorFlow on top-7112 tags from Danbooru2018); DeepDanbooru activation/
face detection/
figure segmentation: AniSeg/ Yet-Another-Anime-Segmenter StyleGAN models:
- 512px cropped faces (all characters)
- 512px cropped ‘portrait’ faces
- various character-specific StyleGAN models
TwinGAN: human ↔︎ anime face conversion
Useful models would be:
- perceptual loss model (using DeepDanbooru?)
- “s”/
“q”/ “e” classifier - text embedding RNN, and pre-computed text embeddings for all images’ tags
Updating
If there is interest, the dataset will continue to be updated at regular annual intervals (“Danbooru2020”, “Danbooru2021” etc).
Updates exploit the ECC capability of BitTorrent by updating the images/.torrent
file; users download the new .torrent
, overwrite the old .torrent
, and after rehashing files to discover which ones have changed/
Turnover in BitTorrent swarms means that earlier versions of the torrent will quickly disappear, so for easier reproducibility, the metadata files can be archived into subdirectories (images generally will not change, so reproducibility is less of a concern—to reproduce the subset for an earlier release, one simply filters on upload date or takes the file list from the old metadata).
Notification of updates
To receive notification of future updates to the dataset, please subscribe to the notification mailing list.
Possible Uses
Such a dataset would support many possible uses:
classification & tagging:
image categorization (of major characteristics such as franchise or character or SFW/
NSFW detection eg Derpibooru) image multi-label classification (tagging), exploiting the ~20 tags per image (currently there is a prototype, DeepDanbooru)
- a large-scale testbed for real-world application of active learning /
man-machine collaboration - testing the scaling limits of existing tagging approaches and motivating zero-shot & one-shot learning techniques
- bootstrapping video summaries/
descriptions - robustness of image classifiers to different illustration styles (eg Icons-50)
- a large-scale testbed for real-world application of active learning /
image generation:
- text-to-image synthesis (eg DALL·E would benefit greatly from the tags as more informative than the sentence descriptions of MS COCO or the poor quality captions of web scrapes)
- unsupervised image generation (DCGANs, VAEs, PixelCNNs, WGANs, eg MakeGirlsMoe or Xiang & Li 2018)
- image transformation: upscaling (waifu2×), colorizing (Frans 2017) or palette color scheme generation (Colormind), inpainting, sketch-to-drawing (Simo-Serra et al 2017), photo-to-drawing (using the
reference_photo
/photo_reference
tags), artistic style transfer5/image analogies (Liao et al 2017), optimization (“Image Synthesis from Yahoo’s open_nsfw
”, pix2pix, DiscoGAN, CycleGAN eg CycleGAN for silverizing anime character hair or do photo⟺illustration face mapping6 eg Gokaslan et al 2018/Li 2018), CGI model/ pose generation (PSGAN)
image analysis:
- facial detection & localization for drawn images (on which normal techniques such as OpenCV’s Harr filters fail, requiring special-purpose approaches like AnimeFace 2009/
lbpcascade_animeface
) - image popularity/
upvote prediction - image-to-text localization, transcription, and translation of text in images
- illustration-specialized compression (for better performance than PNG/
JPG)
- facial detection & localization for drawn images (on which normal techniques such as OpenCV’s Harr filters fail, requiring special-purpose approaches like AnimeFace 2009/
image search:
- collaborative filtering/
recommendation, image similarity search (Flickr) of images (useful for users looking for images, for discovering tag mistakes, and for various diagnostics like checking GANs are not memorizing) - manga recommendation (Vie et al 2017)
- artist similarity and de-anonymization
- collaborative filtering/
knowledge graph extraction from tags/
tag-implications and images - clustering tags
- temporal trends in tags (franchise popularity trends)
Advantages
Size and metadata
Image classification has been supercharged by work on ImageNet, but ImageNet itself is limited by its small set of classes, many of which are debatable, and which encompass only a limited set. Compounding these limits, tagging/
ImageNet: dog breeds (memorably brought out by DeepDream)
- WebVision (Li et al 2017a; Li et al 2017b; Guo et al 2018): 2.4m images noisily classified via search engine/
Flickr queries into the ImageNet 1k categories
- WebVision (Li et al 2017a; Li et al 2017b; Guo et al 2018): 2.4m images noisily classified via search engine/
Youtube-BB: toilets/
giraffes MS COCO: bathrooms and African savannah animals; 328k images, 80 categories, short 1-sentence descriptions
bird/
flowers: a few score of each kind (eg no eagles in the birds dataset) Visual Relationship Detection (VRD) dataset: 5k images
Pascal VOC: 11k images
Visual Genome: 108k images
nico-opendata: 400k, but SFW & restricted to approved researchers
Open Images V4: released 2018, 30.1m tags for 9.2m images and 15.4m bounding-boxes, with high label quality; a major advantage of this dataset is that it uses CC-BY-licensed Flickr photographs/
images, and so it should be freely distributable, BAM! (Wilber et al 2017): 65m raw images, 393k? tags for 2.5m? tagged images (semi-supervised), restricted access?
The external validity of classifiers trained on these datasets is somewhat questionable as the learned discriminative models may collapse or simplify in undesirable ways, and overfit on the datasets’ individual biases (Torralba & Efros 2011). For example, ImageNet classifiers sometimes appear to ‘cheat’ by relying on localized textures in a “bag-of-words”-style approach and simplistic outlines/
It is an open issue of text-to-image mapping that the distribution of images conditioned on a sentence is highly multi-modal. In the past few years, we’ve witnessed a breakthrough in the application of recurrent neural networks (RNN) to generating textual descriptions conditioned on images [1, 2], with Xu et al. showing that the multi-modality problem can be decomposed sequentially [3]. However, the lack of datasets with diversity descriptions of images limits the performance of text-to-image synthesis on multi-categories dataset like MSCOCO [4]. Therefore, the problem of text-to-image synthesis is still far from being solved
In contrast, the Danbooru dataset is larger than ImageNet as a whole and larger than the most widely-used multi-description dataset, MS COCO, with far richer metadata than the ‘subject verb object’ sentence summary that is dominant in MS COCO or the birds dataset (sentences which could be adequately summarized in perhaps 5 tags, if even that7). While the Danbooru community does focus heavily on female anime characters, they are placed in a wide variety of circumstances with numerous surrounding tagged objects or actions, and the sheer size implies that many more miscellaneous images will be included. It is unlikely that the performance ceiling will be reached anytime soon, and advanced techniques such as attention will likely be required to get anywhere near the ceiling. And Danbooru is constantly expanding and can be easily updated by anyone anywhere, allowing for regular releases of improved annotations.
Danbooru and the image boorus have been only minimally used in previous machine learning work; principally, in “Illustration2Vec: A Semantic Vector Representation of Images”, Saito & Matsui 2015 (project), which used 1.287m images to train a finetuned VGG-based CNN to detect 1,539 tags (drawn from the 512 most frequent tags of general/
Non-photographic
Anime images and illustrations, on the other hand, as compared to photographs, differ in many ways—for example, illustrations are frequently black-and-white rather than color, line art rather than photographs, and even color illustrations tend to rely far less on textures and far more on lines (with textures omitted or filled in with standard repetitive patterns), working on a higher level of abstraction—a leopard would not be as trivially recognized by simple pattern-matching on yellow and black dots—with irrelevant details that a discriminator might cheaply classify based on typically suppressed in favor of global gestalt, and often heavily stylized (eg frequent use of “Dutch angles”). With the exception of MNIST & Omniglot, almost all commonly-used deep learning-related image datasets are photographic.
Humans can still easily perceive a black-white line drawing of a leopard as being a leopard—but can a standard ImageNet classifier? Likewise, the difficulty face detectors encounter on anime images suggests that other detectors like nudity or pornographic detectors may fail; but surely moderation tasks require detection of penises, whether they are drawn or photographed? The attempts to apply CNNs to GANs, image generation, image inpainting, or style transfer have sometimes thrown up artifacts which don’t seem to be issues when using the same architecture on photographic material; for example, in GAN image generation & style transfer, I almost always note, in my own or others’ attempts, what I call the “watercolor effect”, where instead of producing the usual abstracted regions of whitespace, monotone coloring, or simple color gradients, the CNN instead consistently produces noisy transition textures which look like watercolor paintings—which can be beautiful, and an interesting style in its own right (eg the style2paints
samples), but means the CNNs are failing to some degree. This watercolor effect appears to not be a problem in photographic applications, but on the other hand, photos are filled with noisy transition textures and watching a GAN train, you can see that the learning process generates textures first and only gradually learns to build edges and regions and transitions from the blurred texts; is this anime-specific problem due to simply insufficient data/
Because illustrations are produced by an entirely different process and focus only on salient details while abstracting the rest, they offer a way to test external validity and the extent to which taggers are tapping into higher-level semantic perception.
As well, many ML researchers are anime fans and might enjoy working on such a dataset—training NNs to generate anime images can be amusing. It is, at least, more interesting than photos of street signs or storefronts. (“There are few sources of energy so powerful as a procrastinating grad student.”)
Community value
A full dataset is of immediate value to the Danbooru community as an archival snapshot of Danbooru which can be downloaded in lieu of hammering the main site and downloading terabytes of data; back��ups are occasionally requested on the Danbooru forum but the need is currently not met.
There is potential for a symbiosis between the Danbooru community & ML researchers: in a virtuous circle, the community provides curation and expansion of a rich dataset, while ML researchers can contribute back tools from their research on it which help improve the dataset. The Danbooru community is relatively large and would likely welcome the development of tools like taggers to support semi-automatic (or eventually, fully automatic) image tagging, as use of a tagger could offer orders of magnitude improvement in speed and accuracy compared to their existing manual methods, as well as being newbie-friendly8 They are also a pre-existing audience which would be interested in new research results.
Format
The goal of the dataset is to be as easy as possible to use immediately, avoiding obscure file formats, while allowing simultaneous research & seeding of the torrent, with easy updates.
Images are provided in the full original form (be that JPG, PNG, GIF or otherwise) for reference/
Images are bucketed into 1000 subdirectories 0–999, which is the Danbooru ID modulo 1000 (ie all images in 0999/
have an ID ending in ‘999’). A single directory would cause pathological filesystem performance, and modulo ID spreads images evenly without requiring additional directories to be made. The ID is not zero-padded and files end in the relevant extension, hence the file layout looks like this:
original/0000/
original/0000/1000.png
original/0000/2000.jpg
original/0000/3000.jpg
original/0000/4000.png
original/0000/5000.jpg
original/0000/6000.jpg
original/0000/7000.jpg
original/0000/8000.jpg
original/0000/9000.jpg
...
Currently represented file extensions are: avi
/bmp
/gif
/html
/jpeg
/jpg
/mp3
/mp4
/mpg
/pdf
/png
/rar
/swf
/webm
/wmv
/zip
. (JPG/jpegoptim
/
Be careful if working with the original rather than 512px subset. There are many odd files: truncated, non-sRGB colorspace, wrong file extensions (eg some PNGs have .jpg
extensions like original/0146/1525146.jpg
/original/0558/1422558.jpg
), etc.
The SFW torrent follows the same schema but inside the 512px/
directory instead and converted to JPG for the SFW files: 512px/0000/1000.jpg
etc.
An experimental shell script for parallelized conversion the full-size original images into a more tractable ~250GB corpus of 512×512px images is included: rescale_images.sh
. It requires ImageMagick & GNU parallel
to be installed.
Image Metadata
The metadata is available as a XZ-compressed tarball of JSON files as exported from the Danbooru BigQuery database mirror (metadata.json.tar.xz
). Each line is an individual JSON object for a single image; ad hoc queries can be run easily by piping into jq
, and several are illustrated in the shell query appendix.
Here is an example of a shell script for getting the filenames of all SFW images matching a particular tag:
# print out filenames of all SFW Danbooru images matching a particular tag.
# assumes being in root directory like '/media/gwern/Data2/danbooru2020'
TAG="monochrome"
TEMP=$(mktemp /tmp/matches-XXXX.txt)
cat metadata/* | head -1000 | fgrep -e '"name":"'"$TAG" | fgrep '"rating":"s"' \
| jq -c '.id' | tr -d '"' >> "$TEMP"
for ID in $(cat "$TEMP"); do
BUCKET=$(printf "%04d" $(( ID % 1000 )) );
TARGET=$(ls ./original/"$BUCKET/$ID".*)
ls "$TARGET"
done
3 example metadata (jq
-formatted):
{
"id": "148112",
"created_at": "2007-10-25 21:29:41.5877 UTC",
"uploader_id": "1",
"score": "2",
"source": "",
"md5": "afc6c473332f8372afba07cb597818af",
"last_commented_at": "1970-01-01 00:00:00 UTC",
"rating": "s",
"image_width": "1555",
"image_height": "1200",
"is_note_locked": false,
"file_ext": "jpg",
"last_noted_at": "1970-01-01 00:00:00 UTC",
"is_rating_locked": false,
"parent_id": "0",
"has_children": false,
"approver_id": "0",
"file_size": "390946",
"is_status_locked": false,
"up_score": "2",
"down_score": "0",
"is_pending": false,
"is_flagged": false,
"is_deleted": false,
"updated_at": "2016-03-26 16:29:45.28726 UTC",
"is_banned": false,
"pixiv_id": "0",
"tags": [
{
"id": "567316",
"name": "6+girls",
"category": "0"
},
{
"id": "437490",
"name": "artist_request",
"category": "5"
},
{
"id": "6059",
"name": "blazer",
"category": "0"
},
{
"id": "2378",
"name": "buruma",
"category": "0"
},
{
"id": "484628",
"name": "copyright_request",
"category": "5"
},
{
"id": "6532",
"name": "glasses",
"category": "0"
},
{
"id": "7450",
"name": "gym_uniform",
"category": "0"
},
{
"id": "1566",
"name": "highres",
"category": "5"
},
{
"id": "3843",
"name": "jacket",
"category": "0"
},
{
"id": "566835",
"name": "multiple_girls",
"category": "0"
},
{
"id": "391",
"name": "panties",
"category": "0"
},
{
"id": "2770",
"name": "pantyshot",
"category": "0"
},
{
"id": "16509",
"name": "school_uniform",
"category": "0"
},
{
"id": "3477",
"name": "sweater",
"category": "0"
},
{
"id": "432529",
"name": "sweater_vest",
"category": "0"
},
{
"id": "3291",
"name": "teacher",
"category": "0"
},
{
"id": "1882",
"name": "thighhighs",
"category": "0"
},
{
"id": "464906",
"name": "underwear",
"category": "0"
},
{
"id": "6176",
"name": "vest",
"category": "0"
},
{
"id": "230",
"name": "waitress",
"category": "0"
},
{
"id": "4123",
"name": "wind",
"category": "0"
},
{
"id": "378454",
"name": "wind_lift",
"category": "0"
},
{
"id": "10644",
"name": "zettai_ryouiki",
"category": "0"
}
],
"pools": [],
"favs": [
"11896",
"1200",
"13418",
"11637",
"108341"
]
}
{
"id": "251218",
"created_at": "2008-05-21 00:41:56.83102 UTC",
"uploader_id": "1",
"score": "2",
"source": "http://i2.pixiv.net/img10/img/aki-prism/7956060_p31.jpg",
"md5": "a3b948d2feab35045201da677adaa925",
"last_commented_at": "1970-01-01 00:00:00 UTC",
"rating": "s",
"image_width": "350",
"image_height": "700",
"is_note_locked": false,
"file_ext": "jpg",
"last_noted_at": "1970-01-01 00:00:00 UTC",
"is_rating_locked": false,
"parent_id": "0",
"has_children": false,
"approver_id": "0",
"file_size": "73187",
"is_status_locked": false,
"up_score": "2",
"down_score": "0",
"is_pending": false,
"is_flagged": false,
"is_deleted": false,
"updated_at": "2020-05-05 23:42:39.02344 UTC",
"is_banned": false,
"pixiv_id": "7956060",
"tags": [
{
"id": "470575",
"name": "1girl",
"category": "0"
},
{
"id": "6126",
"name": "animal_ears",
"category": "0"
},
{
"id": "401178",
"name": "aruruw",
"category": "4"
},
{
"id": "465619",
"name": "closed_eyes",
"category": "0"
},
{
"id": "10157",
"name": "honey",
"category": "0"
},
{
"id": "412964",
"name": "honeypot",
"category": "0"
},
{
"id": "426559",
"name": "marupeke",
"category": "1"
},
{
"id": "402239",
"name": "photoshop_(medium)",
"category": "5"
},
{
"id": "16509",
"name": "school_uniform",
"category": "0"
},
{
"id": "268819",
"name": "serafuku",
"category": "0"
},
{
"id": "212816",
"name": "solo",
"category": "0"
},
{
"id": "15674",
"name": "tail",
"category": "0"
},
{
"id": "575561",
"name": "utawareru_mono",
"category": "3"
}
],
"pools": [],
"favs": [
"13392",
"35380",
"106523",
"484488",
"60223"
]
}
{
"id": "901634",
"created_at": "2011-04-21 22:18:02.20889 UTC",
"uploader_id": "37391",
"score": "7",
"source": "http://www.sword-girls.com/default.aspx",
"md5": "2c70ff536e7fc8186b70b6d9023d579f",
"last_commented_at": "1970-01-01 00:00:00 UTC",
"rating": "s",
"image_width": "320",
"image_height": "480",
"is_note_locked": false,
"file_ext": "jpg",
"last_noted_at": "1970-01-01 00:00:00 UTC",
"is_rating_locked": false,
"parent_id": "0",
"has_children": false,
"approver_id": "288549",
"file_size": "162693",
"is_status_locked": false,
"up_score": "5",
"down_score": "0",
"is_pending": false,
"is_flagged": false,
"is_deleted": false,
"updated_at": "2013-05-25 15:10:19.68411 UTC",
"is_banned": false,
"pixiv_id": "0",
"tags": [
{
"id": "470575",
"name": "1girl",
"category": "0"
},
{
"id": "89368",
"name": "aqua_eyes",
"category": "0"
},
{
"id": "399827",
"name": "arms_up",
"category": "0"
},
{
"id": "4011",
"name": "blade",
"category": "0"
},
{
"id": "378993",
"name": "energy_sword",
"category": "0"
},
{
"id": "2270",
"name": "eyepatch",
"category": "0"
},
{
"id": "464559",
"name": "flower",
"category": "0"
},
{
"id": "7581",
"name": "garter_belt",
"category": "0"
},
{
"id": "197",
"name": "garters",
"category": "0"
},
{
"id": "620491",
"name": "iri_flina",
"category": "4"
},
{
"id": "495048",
"name": "lily_(flower)",
"category": "0"
},
{
"id": "10606",
"name": "lowres",
"category": "5"
},
{
"id": "461172",
"name": "nardack",
"category": "1"
},
{
"id": "15080",
"name": "short_hair",
"category": "0"
},
{
"id": "15425",
"name": "silver_hair",
"category": "0"
},
{
"id": "429",
"name": "skirt",
"category": "0"
},
{
"id": "212816",
"name": "solo",
"category": "0"
},
{
"id": "401228",
"name": "sword",
"category": "0"
},
{
"id": "620408",
"name": "sword_girls",
"category": "3"
},
{
"id": "1882",
"name": "thighhighs",
"category": "0"
},
{
"id": "11449",
"name": "weapon",
"category": "0"
},
{
"id": "10644",
"name": "zettai_ryouiki",
"category": "0"
}
],
"pools": [],
"favs": [
"23888",
"115871",
"342656",
"332770",
"95046",
"324891",
"20124",
"149704",
"34355",
"290816",
"228600",
"55507",
"338018",
"134865",
"72221",
"256960",
"104143",
"85939",
"386036",
"450665",
"497363",
"550966"
]
}
Citing
Please cite this dataset as:
Anonymous, The Danbooru Community, & Gwern Branwen; “Danbooru2020: A Large-Scale Crowdsourced and Tagged Anime Illustration Dataset”, 2020-01-12. Web. Accessed [DATE]
https://www.gwern.net/Danbooru2020
@misc{danbooru2020, author = {Anonymous and Danbooru community and Gwern Branwen}, title = {Danbooru2020: A Large-Scale Crowdsourced and Tagged Anime Illustration Dataset}, howpublished = {\url{https://www.gwern.net/Danbooru2020}}, url = {https://www.gwern.net/Danbooru2020}, type = {dataset}, year = {2021}, month = {January}, timestamp = {2020-01-12}, note = {Accessed: DATE} }
Past releases
Danbooru2017
The first release, Danbooru2017, contained ~1.9tb of 2.94m images with 77.5m tag instances (of 333k defined tags, ~26.3/
Danbooru2018 added 0.413TB/
To reconstruct Danbooru2017, download Danbooru2018, and take the image subset ID #1–2973532 as the image dataset, and the JSON metadata in the subdirectory metadata/2017/
as the metadata. That should give you Danbooru2017 bit-identical to as released on 2018-02-13.
Danbooru2018
The second release was a torrent of ~2.5tb of 3.33m images with 92.7m tag instances (of 365k defined tags, ~27.8/
Danbooru2018 can be reconstructed similarly using metadata/2018/
.
Danbooru2019
The third release was 3tb of 3.69m images, 108m tags, through 2019-12-31 (final ID: #3,734,660). Danbooru2019 can be reconstructed likewise.
Applications
Projects
Code and derived datasets:
Projects:
“PaintsTransfer-Euclid”/
“style2paints” (line-art colorizer): used Danbooru2017 for training (see Zhang et al 2018 for details; a Style2Paints V3 replication in PyTorch)“This Waifu Does Not Exist” & other StyleGAN anime faces: trains a StyleGAN 2 on faces cropped from the Danbooru corpus, generating high-quality 512px anime faces; site displays random samples. Both face crop datasets, the original faces and broader ‘portrait’ crops, are available for download.
Hand-selected StyleGAN sample from Asuka Souryuu Langley-finetuned StyleGAN “Text Segmentation and Image Inpainting”, yu45020
This is an ongoing project that aims to solve a simple but tedious procedure: remove texts from an image. It will reduce comic book translators’ time on erasing Japanese words.
See also SickZil–Machine/
SZMC (Ko & Cho 2020), and Del Gobbo & Herrera 2020. DCGAN/
LSGAN in PyTorch , Kevin LyuDeepCreamPy: Decensoring Hentai with Deep Neural Networks, deeppomf
“animeGM: Anime Generative Model for Style Transfer”, Peter Chau: 1/
2/ 3 selfie2anime
(using Kim et al 2019’s UGATIT)“Animating gAnime with StyleGAN: Part 1—Introducing a tool for interacting with generative models”, Nolan Kent (re-implementing StyleGAN for improved character generation with rectangular convolutions & feature map visualizations, and interactive manipulation)
SZMC (image editor for erasing text in bubbles in manga/
comics, for scanlation; paper: Ko & Cho 2020) CEDEC 2020 session (JA) on GAN generation of Mixi’s Monster Strike character art
Datasets:
“Danbooru 2018 Anime Character Recognition Dataset” (1m face crops of 70k characters, with bounding boxes & pretrained classification model; good test-case for few-shot classification given long tail: “20k tags only have one single image.”)
The original face dataset can be downloaded via rsync:
rsync --verbose rsync://78.46.86.149:873/biggan/2019-07-27-grapeot-danbooru2018-animecharacterrecognition.tar ./
SeePrettyFace.com: face dataset (512px face crops of Danbooru2018; n = 140,000)
-
We introduce the GochiUsa Faces dataset, building a dataset of almost 40k pictures of faces from nine9 characters. The resolution range from 26×26 to 987×987 with 356×356 being the median resolution. We also provide two supplementary datasets: a test set of independent drawings and an additional face dataset for nine minor characters.
Some experiments show the subject on which GochiUsa Faces could serve as a toy dataset. They include categorization, data compression and conditional generation.
Danbooru2019 Figures dataset (855k single-character images cropped to the character figure using AniSeg)
“PALM: The PALM Anime Location Model And Dataset” (58k anime hands: cropped from Danbooru2019 using a custom YOLO anime hand detection & upscaled to 512px)
The DanbooRegion 2020 Dataset, Style2Paints: Danbooru2018 images which have been human-segmented into small ‘regions’ of single color/
semantics, somewhat like semantic pixel segmentation, and a NN model trained to segment anime into regions; regions/ skeletons can be used to colorize, clean up, style transfer, or support further semantic annotations. “Danbooru Sketch Pair 128px: Anime Sketch Colorization Pair 128x128” (337k color/
grayscale pairs; color images from the Kaggle Danbooru2017 dataset are mechanically converted into ‘sketches’ using the sketchKeras sketch tool)
Utilities/
Tools :image classification/
tagging :- DeepDanbooru (service; implemented in CNTK & TensorFlow on top-7112 tags from Danbooru2018); DeepDanbooru activation/
saliency maps - danbooru-tagger: PyTorch ResNet-50, top-6000 tags
- RegDeepDanbooru, zyddnys (PyTorch RegNet; 1000-tags, half attributes half characters)
- DeepDanbooru (service; implemented in CNTK & TensorFlow on top-7112 tags from Danbooru2018); DeepDanbooru activation/
image superresolution/
upscaling : SAN_pytorch (SAN trained on Danbooru2019); NatSR_pytorch (NatSR)object localization:
danbooru-faces
: Jupyter notebooks for cropping and processing anime faces using Nagadomi’slbpcascade_animeface
(see also Nagadomi’s moeimouto face dataset on Kaggle)danbooru-utility
: Python script which aims to help “explore the dataset, filter by tags, rating, and score, detect faces, and resize the images”AniSeg: A TensorFlow faster-rcnn model for anime character face detection & portrait segmentation; I’ve mirrored the manually-segmented anime figure dataset & the face/
figure segmentation models: rsync --verbose rsync://78.46.86.149:873/biggan/2019-04-29-jerryli27-aniseg-figuresegmentation-dataset.tar ./ rsync --verbose rsync://78.46.86.149:873/biggan/2019-04-29-jerryli27-aniseg-models-figurefacecrop.tar.xz ./
light-anime-face-detector
, Cheese Roll (fast LFFD model distilling Anime-Face-Detector to run at 100FPS/GPU & 10 FPS/ CPU)
-
- SQLite database metadata conversion (based on jxu)
- GUI tag browser (Tkinter Python 3 GUI for local browsing of tagged images)
Danbooru-Dataset-Maker
, Atom-101: “Helper scripts to download images with specific tags from the Danbooru dataset.” (Queries metadata for included/excluded tags, and builds a list to download just matching images with rsync.)
Publications
Research:
“Improving Shape Deformation in Unsupervised Image-to-Image Translation”, Gokaslan et al 2018:
Unsupervised image-to-image translation techniques are able to map local texture between two domains, but they are typically unsuccessful when the domains require larger shape change. Inspired by semantic segmentation, we introduce a discriminator with dilated convolutions that is able to use information from across the entire image to train a more context-aware generator. This is coupled with a multi-scale perceptual loss that is better able to represent error in the underlying shape of objects. We demonstrate that this design is more capable of representing shape deformation in a challenging toy dataset, plus in complex mappings with significant dataset variation between humans, dolls, and anime faces, and between cats and dogs.
“Two Stage Sketch Colorization”, Zhang et al 2018 (on style2paints, version 3):
Sketch or line art colorization is a research field with significant market demand. Different from photo colorization which strongly relies on texture information, sketch colorization is more challenging as sketches may not have texture. Even worse, color, texture, and gradient have to be generated from the abstract sketch lines. In this paper, we propose a semi-automatic learning-based framework to colorize sketches with proper color, texture as well as gradient. Our framework consists of two stages. In the first drafting stage, our model guesses color regions and splashes a rich variety of colors over the sketch to obtain a color draft. In the second refinement stage, it detects the unnatural colors and artifacts, and try to fix and refine the result. Comparing to existing approaches, this two-stage design effectively divides the complex colorization task into two simpler and goal-clearer subtasks. This eases the learning and raises the quality of colorization. Our model resolves the artifacts such as water-color blurring, color distortion, and dull textures.
We build an interactive software based on our model for evaluation. Users can iteratively edit and refine the colorization. We evaluate our learning model and the interactive system through an extensive user study. Statistics shows that our method outperforms the state-of-art techniques and industrial applications in several aspects including, the visual quality, the ability of user control, user experience, and other metrics.
-
Image-to-Image translation is a collection of computer vision problems that aim to learn a mapping between two different domains or multiple domains. Recent research in computer vision and deep learning produced powerful tools for the task. Conditional adversarial networks serve as a general-purpose solution for image-to-image translation problems. Deep Convolutional Neural Networks can learn an image representation that can be applied for recognition, detection, and segmentation. Generative Adversarial Networks (GANs) has gained success in image synthesis. However, traditional models that require paired training data might not be applicable in most situations due to lack of paired data.
Here we review and compare two different models for learning unsupervised image to image translation: CycleGAN and Unsupervised Image-to-Image Translation Networks (UNIT). Both models adopt cycle consistency, which enables us to conduct unsupervised learning without paired data. We show that both models can successfully perform image style translation. The experiments reveal that CycleGAN can generate more realistic results, and UNIT can generate varied images and better preserve the structure of input images.
“Image Generation from Small Datasets via Batch Statistics Adaptation”, Noguchi & Harada 2019 (Danbooru2018 by way of StyleGAN/
TWDNE-generated images): Thanks to the recent development of deep generative models, it is becoming possible to generate high-quality images with both fidelity and diversity. However, the training of such generative models requires a large dataset. To reduce the amount of data required, we propose a new method for transferring prior knowledge of the pre-trained generator, which is trained with a large dataset, to a small dataset in a different domain. Using such prior knowledge, the model can generate images leveraging some common sense that cannot be acquired from a small dataset. In this work, we propose a novel method focusing on the parameters for batch statistics, scale and shift, of the hidden layers in the generator. By training only these parameters in a supervised manner, we achieved stable training of the generator, and our method can generate higher quality images compared to previous methods without collapsing even when the dataset is small (~100). Our results show that the diversity of the filters acquired in the pre-trained generator is important for the performance on the target domain. By our method, it becomes possible to add a new class or domain to a pre-trained generator without disturbing the performance on the original domain.
“Spatially Controllable Image Synthesis with Internal Representation Collaging”, Suzuki et al 2018:
We present a novel CNN-based image editing strategy that allows the user to change the semantic information of an image over an arbitrary region by manipulating the feature-space representation of the image in a trained GAN model. We will present two variants of our strategy: (1) spatial conditional batch normalization (sCBN), a type of conditional batch normalization with user-specifiable spatial weight maps, and (2) feature-blending, a method of directly modifying the intermediate features. Our methods can be used to edit both artificial image and real image, and they both can be used together with any GAN with conditional normalization layers. We will demonstrate the power of our method through experiments on various types of GANs trained on different datasets. Code will be available at this URL.
“MineGAN: effective knowledge transfer from GANs to target domains with few images”, Wang et al 2019:
One of the attractive characteristics of deep neural networks is their ability to transfer knowledge obtained in one domain to other related domains. As a result, high-quality networks can be trained in domains with relatively little training data. This property has been extensively studied for discriminative networks but has received significantly less attention for generative models.Given the often enormous effort required to train GANs, both computationally as well as in the dataset collection, the re-use of pretrained GANs is a desirable objective. We propose a novel knowledge transfer method for generative models based on mining the knowledge that is most beneficial to a specific target domain, either from a single or multiple pretrained GANs. This is done using a miner network that identifies which part of the generative distribution of each pretrained GAN outputs samples closest to the target domain. Mining effectively steers GAN sampling towards suitable regions of the latent space, which facilitates the posterior finetuning and avoids pathologies of other methods such as mode collapse and lack of flexibility. We perform experiments on several complex datasets using various GAN architectures (BigGAN, Progressive GAN) and show that the proposed method, called MineGAN, effectively transfers knowledge to domains with few target images, outperforming existing methods. In addition, MineGAN can successfully transfer knowledge from multiple pretrained GANs.
“Tag2Pix: Line Art Colorization Using Text Tag With SECat and Changing Loss”, Kim et al 2019b (Tag2Pix CLI/
GUI): Line art colorization is expensive and challenging to automate. A GAN approach is proposed, called Tag2Pix, of line art colorization which takes as input a grayscale line art and color tag information and produces a quality colored image. First, we present the Tag2Pix line art colorization dataset. A generator network is proposed which consists of convolutional layers to transform the input line art, a pre-trained semantic extraction network, and an encoder for input color information. The discriminator is based on an auxiliary classifier GAN to classify the tag information as well as genuineness. In addition, we propose a novel network structure called SECat, which makes the generator properly colorize even small features such as eyes, and also suggest a novel two-step training method where the generator and discriminator first learn the notion of object and shape and then, based on the learned notion, learn colorization, such as where and how to place which color. We present both quantitative and qualitative evaluations which prove the effectiveness of the proposed method.
“Reference-Based Sketch Image Colorization using Augmented-Self Reference and Dense Semantic Correspondence”, Lee et al 2020:
This paper tackles the automatic colorization task of a sketch image given an already-colored reference image. Colorizing a sketch image is in high demand in comics, animation, and other content creation applications, but it suffers from information scarcity of a sketch image. To address this, a reference image can render the colorization process in a reliable and user-driven manner. However, it is difficult to prepare for a training data set that has a sufficient amount of semantically meaningful pairs of images as well as the ground truth for a colored image reflecting a given reference (e.g., coloring a sketch of an originally blue car given a reference green car). To tackle this challenge, we propose to utilize the identical image with geometric distortion as a virtual reference, which makes it possible to secure the ground truth for a colored output image. Furthermore, it naturally provides the ground truth for dense semantic correspondence, which we utilize in our internal attention mechanism for color transfer from reference to sketch input. We demonstrate the effectiveness of our approach in various types of sketch image colorization via quantitative as well as qualitative evaluation against existing methods.
“Disentangling Style and Content in Anime Illustrations”, Xiang & Li 2019 (?)
“CartoonRenderer: An Instance-based Multi-Style Cartoon Image Translator”, Chen et al 2019:
Instance based photo cartoonization is one of the challenging image stylization tasks which aim at transforming realistic photos into cartoon style images while preserving the semantic contents of the photos. State-of-the-art Deep Neural Networks (DNNs) methods still fail to produce satisfactory results with input photos in the wild, especially for photos which have high contrast and full of rich textures. This is due to that: cartoon style images tend to have smooth color regions and emphasized edges which are contradict to realistic photos which require clear semantic contents, i.e., textures, shapes etc. Previous methods have difficulty in satisfying cartoon style textures and preserving semantic contents at the same time. In this work, we propose a novel “CartoonRenderer” framework which utilizing a single trained model to generate multiple cartoon styles. In a nutshell, our method maps photo into a feature model and renders the feature model back into image space. In particular, cartoonization is achieved by conducting some transformation manipulation in the feature space with our proposed Soft-AdaIN. Extensive experimental results show our method produces higher quality cartoon style images than prior arts, with accurate semantic content preservation. In addition, due to the decoupling of whole generating process into “Modeling-Coordinating-Rendering” parts, our method could easily process higher resolution photos, which is intractable for existing methods.
“Unpaired Sketch-to-Line Translation via Synthesis of Sketches”, Lee et al 2019:
Converting hand-drawn sketches into clean line drawings is a crucial step for diverse artistic works such as comics and product designs. Recent data-driven methods using deep learning have shown their great abilities to automatically simplify sketches on raster images. Since it is difficult to collect or generate paired sketch and line images, lack of training data is a main obstacle to use these models. In this paper, we propose a training scheme that requires only unpaired sketch and line images for learning sketch-to-line translation. To do this, we first generate realistic paired sketch and line images from unpaired sketch and line images using rule-based line augmentation and unsupervised texture conversion. Next, with our synthetic paired data, we train a model for sketch-to-line translation using supervised learning. Compared to unsupervised methods that use cycle consistency losses, our model shows better performance at removing noisy strokes. We also show that our model simplifies complicated sketches better than models trained on a limited number of handcrafted paired data.
-
While linearly directed imageboards like 4chan have been extensively studied, user participation on nonlinearly directed imageboards, or “boorus,” has been overlooked despite high activity, expansive multimedia repositories with user-defined classifications and evaluations, and unique affordances prioritizing mutual content curation, evaluation, and refinement over overt discourse. To address the gap in the literature related to participatory engagement on nonlinearly directed imageboards, user activity around the full database of N = 2,987,525, submissions to Danbooru, a prominent nonlinearly directed imageboard, was evaluated using regression. The results illustrate the role played by the affordances of nonlinearly directed imageboards and the visible attributes of individual submissions in shaping the user processes of content curation, evaluation, and refinement, as well as the interrelationships between these three core activities. These results provide a foundation for further research within the unique environments of nonlinearly directed imageboards and suggest practical applications across online domains.
“Interactive Anime Sketch Colorization with Style Consistency via a Deep Residual Neural Network”, Ye et al 2019:
Anime line sketch colorization is to fill a variety of colors the anime sketch, to make it colorful and diverse. The coloring problem is not a new research direction in the field of deep learning technology. Because of coloring of the anime sketch does not have fixed color and we can’t take texture or shadow as reference, so it is difficult to learn and have a certain standard to determine whether it is correct or not. After generative adversarial networks (GANs) was proposed, some used GANs to do coloring research, achieved some result, but the coloring effect is limited. This study proposes a method use deep residual network, and adding discriminator to network, that expect the color of colored images can consistent with the desired color by the user and can achieve good coloring results.
“Unpaired Sketch-to-Line Translation via Synthesis of Sketches”, Lee et al 2019:
Converting hand-drawn sketches into clean line drawings is a crucial step for diverse artistic works such as comics and product designs. Recent data-driven methods using deep learning have shown their great abilities to automatically simplify sketches on raster images. Since it is difficult to collect or generate paired sketch and line images, lack of training data is a main obstacle to use these models. In this paper, we propose a training scheme that requires only unpaired sketch and line images for learning sketch-to-line translation. To do this, we first generate realistic paired sketch and line images from unpaired sketch and line images using rule-based line augmentation and unsupervised texture conversion. Next, with our synthetic paired data, we train a model for sketch-to-line translation using supervised learning. Compared to unsupervised methods that use cycle consistency losses, our model shows better performance at removing noisy strokes. We also show that our model simplifies complicated sketches better than models trained on a limited number of handcrafted paired data.
“Semantic Example Guided Image-to-Image Translation”, Huang et al 2019:
Many image-to-image (I2I) translation problems are in nature of high diversity that a single input may have various counterparts. Prior works proposed the multi-modal network that can build a many-to-many mapping between two visual domains. However, most of them are guided by sampled noises. Some others encode the reference images into a latent vector, by which the semantic information of the reference image will be washed away. In this work, we aim to provide a solution to control the output based on references semantically. Given a reference image and an input in another domain, a semantic matching is first performed between the two visual contents and generates the auxiliary image, which is explicitly encouraged to preserve semantic characteristics of the reference. A deep network then is used for I2I translation and the final outputs are expected to be semantically similar to both the input and the reference; however, no such paired data can satisfy that dual-similarity in a supervised fashion, so we build up a self-supervised framework to serve the training purpose. We improve the quality and diversity of the outputs by employing non-local blocks and a multi-task architecture. We assess the proposed method through extensive qualitative and quantitative evaluations and also presented comparisons with several state-of-art models.
“Anime Sketch Coloring with Swish-gated Residual U-net and Spectrally Normalized GAN”, Liu et al 2019:
Anime sketch coloring is to fill various colors into the black-and-white anime sketches and finally obtain the color anime images. Recently, anime sketch coloring has become a new research hotspot in the field of deep learning. In anime sketch coloring, generative adversarial networks (GANs) have been used to design appropriate coloring methods and achieved some results. However, the existing methods based on GANs generally have low-quality coloring effects, such as unreasonable color mixing, poor color gradient effect. In this paper, an efficient anime sketch coloring method using swish-gated residual U-net (SGRU) and spectrally normalized GAN (SNGAN) has been proposed to solve the above problems. The proposed method is called spectrally normalized GAN with swish-gated residual U-net (SSN-GAN). In SSN-GAN, SGRU is used as the generator. SGRU is the U-net with the proposed swish layer and swish-gated residual blocks (SGBs). In SGRU, the proposed swish layer and swish-gated residual blocks (SGBs) effectively filter the information transmitted by each level and improve the performance of the network. The perceptual loss and the per-pixel loss are used to constitute the final loss of SGRU. The discriminator of SSN-GAN uses spectral normalization as a stabilizer of training of GAN, and it is also used as the perceptual network for calculating the perceptual loss. SSN-GAN can automatically color the sketch without providing any coloring hints in advance and can be easily end-to-end trained. Experimental results show that our method performs better than other state-of-the-art coloring methods, and can obtain colorful anime images with higher visual quality.
“Classification Representations Can be Reused for Downstream Generations”, Gopalakrishnan et al 2020:
Contrary to the convention of using supervision for class-conditioned generative modeling, this work explores and demonstrates the feasibility of a learned supervised representation space trained on a discriminative classifier for the downstream task of sample generation. Unlike generative modeling approaches that aim to model the manifold distribution, we directly represent the given data manifold in the classification space and leverage properties of latent space representations to generate new representations that are guaranteed to be in the same class. Interestingly, such representations allow for controlled sample generations for any given class from existing samples and do not require enforcing prior distribution. We show that these latent space representations can be smartly manipulated (using convex combinations of n samples, n≥2) to yield meaningful sample generations. Experiments on image datasets of varying resolutions demonstrate that downstream generations have higher classification accuracy than existing conditional generative models while being competitive in terms of FID.
“Avatar Artist Using GAN”, Su & Fang 2020 (CS230 class project; source):
Human sketches can be expressive and abstract at the same time. Generating anime avatars from simple or even bad face drawing is an interesting area. Lots of related work has been done such as auto-coloring sketches to anime or transforming real photos to anime. However, there aren’t many interesting works yet to show how to generate anime avatars from just some simple drawing input. In this project, we propose using GAN to generate anime avatars from sketches.
“MDSG: Multi-Density Sketch-to-Image Translation Network”, Huang et al 2020
Sketch-to-image (S2I) translation plays an important role in image synthesis and manipulation tasks, such as photo editing and colorization. Some specific S2I translation including sketch-to-photo and sketch-to-painting can be used as powerful tools in the art design industry. However, previous methods only support S2I translation with a single level of density, which gives less flexibility to users for controlling the input sketches. In this work, we propose the first multi-level density sketch-to-image translation framework, which allows the input sketch to cover a wide range from rough object outlines to micro structures. Moreover, to tackle the problem of noncontinuous representation of multi-level density input sketches, we project the density level into a continuous latent space, which can then be linearly controlled by a parameter. This allows users to conveniently control the densities of input sketches and generation of images. Moreover, our method has been successfully verified on various datasets for different applications including face editing, multi-modal sketch-to-photo translation, and anime colorization, providing coarse-to-fine levels of controls to these applications.
“Deep–Eyes: Fully Automatic Anime Character Colorization with Painting of Details on Empty Pupils”, Akita et al 2020:
Many studies have recently applied deep learning to the automatic colorization of line drawings. However, it is difficult to paint empty pupils using existing methods because the networks are trained with pupils that have edges, which are generated from color images using image processing. Most actual line drawings have empty pupils that artists must paint in. In this paper, we propose a novel network model that transfers the pupil details in a reference color image to input line drawings with empty pupils. We also propose a method for accurately and automatically coloring eyes. In this method, eye patches are extracted from a reference color image and automatically added to an input line drawing as color hints using our eye position estimation network.
“Colorization of Line Drawings with Empty Pupils”, Akita et al 2020b:
Many studies have recently applied deep learning to the automatic colorization of line drawings. However, it is difficult to paint empty pupils using existing methods because the convolutional neural network are trained with pupils that have edges, which are generated from color images using image processing. Most actual line drawings have empty pupils that artists must paint in. In this paper, we propose a novel network model that transfers the pupil details in a reference color image to input line drawings with empty pupils. We also propose a method for accurately and automatically colorizing eyes. In this method, eye patches are extracted from a reference color image and automatically added to an input line drawing as color hints using our pupil position estimation network.
“DanbooRegion: An Illustration Region Dataset”, Zhang et al 2020 (Github):
Region is a fundamental element of various cartoon animation techniques and artistic painting applications. Achieving satisfactory region is essential to the success of these techniques. Motivated to assist diverse region-based cartoon applications, we invite artists to annotate regions for in-the-wild cartoon images with several application-oriented goals: (1) To assist image-based cartoon rendering, relighting, and cartoon intrinsic decomposition literature, artists identify object outlines and eliminate lighting-and-shadow boundaries. (2) To assist cartoon inking tools, cartoon structure extraction applications, and cartoon texture processing techniques, artists clean-up texture or deformation patterns and emphasize cartoon structural boundary lines. (3) To assist region-based cartoon digitalization, clip-art vectorization, and animation tracking applications, artists inpaint and reconstruct broken or blurred regions in cartoon images. Given the typicality of these involved applications, this dataset is also likely to be used in other cartoon techniques. We detail the challenges in achieving this dataset and present a human-in-the-loop workflow named Feasibility-based Assignment Recommendation (FAR) to enable large-scale annotating. The FAR tends to reduce artist trails-and-errors and encourage their enthusiasm during annotating. Finally, we present a dataset that contains a large number of artistic region compositions paired with corresponding cartoon illustrations. We also invite multiple professional artists to assure the quality of each annotation. [Keywords: artistic creation, fine art, cartoon, region processing]
“SickZil–Machine (SZMC): A Deep Learning Based Script Text Isolation System for Comics Translation”, Ko & Cho 2020 (Github):
The translation of comics (and Manga) involves removing text from a foreign comic images and typesetting translated letters into it. The text in comics contain a variety of deformed letters drawn in arbitrary positions, in complex images or patterns. These letters have to be removed by experts, as computationally erasing these letters is very challenging. Although several classical image processing algorithms and tools have been developed, a completely automated method that could erase the text is still lacking. Therefore, we propose an image processing framework called ‘SickZil-Machine’ (SZMC) that automates the removal of text from comics. SZMC works through a two-step process. In the first step, the text areas are segmented at the pixel level. In the second step, the letters in the segmented areas are erased and inpainted naturally to match their surroundings. SZMC exhibited a notable performance, employing deep learning based image segmentation and image inpainting models. To train these models, we constructed 285 pairs of original comic pages, a text area-mask dataset, and a dataset of 31,497 comic pages. We identified the characteristics of the dataset that could improve SZMC performance.
“Unconstrained Text Detection in Manga”, Del Gobbo & Herrera 2020:
The detection and recognition of unconstrained text is an open problem in research. Text in comic books has unusual styles that raise many challenges for text detection. This work aims to identify text characters at a pixel level in a comic genre with highly sophisticated text styles: Japanese manga. To overcome the lack of a manga dataset with individual character level annotations, we create our own. Most of the literature in text detection use bounding box metrics, which are unsuitable for pixel-level evaluation. Thus, we implemented special metrics to evaluate performance. Using these resources, we designed and evaluated a deep network model, outperforming current methods for text detection in manga in most metrics.
“Learning from the Past: Meta-Continual Learning with Knowledge Embedding for Jointly Sketch, Cartoon, and Caricature Face Recognition”, Zheng et al 2020:
This paper deals with a challenging task of learning from different modalities by tackling the difficulty problem of jointly face recognition between abstract-like sketches, cartoons, caricatures and real-life photographs. Due to the significant variations in the abstract faces, building vision models for recognizing data from these modalities is an extremely challenging. We propose a novel framework termed as Meta-Continual Learning with Knowledge Embedding to address the task of jointly sketch, cartoon, and caricature face recognition. In particular, we firstly present a deep relational network to capture and memorize the relation among different samples. Secondly, we present the construction of our knowledge graph that relates image with the label as the guidance of our meta-learner. We then design a knowledge embedding mechanism to incorporate the knowledge representation into our network. Thirdly, to mitigate catastrophic forgetting, we use a meta-continual model that updates our ensemble model and improves its prediction accuracy. With this meta-continual model, our network can learn from its past. The final classification is derived from our network by learning to compare the features of samples. Experimental results demonstrate that our approach achieves significantly higher performance compared with other state-of-the-art approaches.
“Deep learning-based classification of the polar emotions of ‘moe’-style cartoon pictures”, Cao et al 2020:
The cartoon animation industry has developed into a huge industrial chain with a large potential market involving games, digital entertainment, and other industries. However, due to the coarse-grained classification of cartoon materials, cartoon animators can hardly find relevant materials during the process of creation. The polar emotions of cartoon materials are an important reference for creators as they can help them easily obtain the pictures they need. Some methods for obtaining the emotions of cartoon pictures have been proposed, but most of these focus on expression recognition. Meanwhile, other emotion recognition methods are not ideal for use as cartoon materials. We propose a deep learning-based method to classify the polar emotions of the cartoon pictures of the “Moe” drawing style. According to the expression feature of the cartoon characters of this drawing style, we recognize the facial expressions of cartoon characters and extract the scene and facial features of the cartoon images. Then, we correct the emotions of the pictures obtained by the expression recognition according to the scene features. Finally, we can obtain the polar emotions of corresponding picture. We designed a dataset and performed verification tests on it, achieving 81.9% experimental accuracy. The experimental results prove that our method is competitive. [Keywords: cartoon; emotion classification; deep learning]
“Unsupervised Image-to-Image Translation via Pre-trained StyleGAN2 Network”, Huang et al 2020:
Image-to-Image (I2I) translation is a heated topic in academia, and it also has been applied in real-world industry for tasks like image synthesis, super-resolution, and colorization. However, traditional I2I translation methods train data in two or more domains together. This requires lots of computation resources. Moreover, the results are of lower quality, and they contain many more artifacts. The training process could be unstable when the data in different domains are not balanced, and modal collapse is more likely to happen. We proposed a new I2I translation method that generates a new model in the target domain via a series of model transformations on a pre-trained StyleGAN2 model in the source domain. After that, we proposed an inversion method to achieve the conversion between an image and its latent vector. By feeding the latent vector into the generated model, we can perform I2I translation between the source domain and target domain. Both qualitative and quantitative evaluations were conducted to prove that the proposed method can achieve outstanding performance in terms of image quality, diversity and semantic similarity to the input and reference images compared to state-of-the-art works.
“FSGAN: Few-Shot Adaptation of Generative Adversarial Networks”, Robb et al 2020:
Generative Adversarial Networks (GANs) have shown remarkable performance in image synthesis tasks, but typically require a large number of training samples to achieve high-quality synthesis. This paper proposes a simple and effective method, Few-Shot GAN (FSGAN), for adapting GANs in few-shot settings (less than 100 images). FSGAN repurposes component analysis techniques and learns to adapt the singular values of the pre-trained weights while freezing the corresponding singular vectors. This provides a highly expressive parameter space for adaptation while constraining changes to the pretrained weights. We validate our method in a challenging few-shot setting of 5–100 images in the target domain. We show that our method has significant visual quality gains compared with existing GAN adaptation methods. We report qualitative and quantitative results showing the effectiveness of our method. We additionally highlight a problem for few-shot synthesis in the standard quantitative metric used by data-efficient image synthesis works. Code and additional results are available at this URL.
“Watermarking Neural Networks with Watermarked Images”, Wu et al 2020:
Watermarking neural networks is a quite important means to protect the intellectual property (IP) of neural networks. In this paper, we introduce a novel digital watermarking framework suitable for deep neural networks that output images as the results, in which any image outputted from a watermarked neural network must contain a certain watermark. Here, the host neural network to be protected and a watermark-extraction network are trained together, so that, by optimizing a combined loss function, the trained neural network can accomplish the original task while embedding a watermark into the outputted images. This work is totally different from previous schemes carrying a watermark by network weights or classification labels of the trigger set. By detecting watermarks in the outputted images, this technique can be adopted to identify the ownership of the host network and find whether an image is generated from a certain neural network or not. We demonstrate that this technique is effective and robust on a variety of image processing tasks, including image colorization, super-resolution, image editing, semantic segmentation and so on.
“Network-to-Network Translation with Conditional Invertible Neural Networks”, Rombach et al 2020 (using Portraits):
Given the ever-increasing computational costs of modern machine learning models, we need to find new ways to reuse such expert models and thus tap into the resources that have been invested in their creation. Recent work suggests that the power of these massive models is captured by the representations they learn. Therefore, we seek a model that can relate between different existing representations and propose to solve this task with a conditionally invertible network. This network demonstrates its capability by (1) providing generic transfer between diverse domains, (2) enabling controlled content synthesis by allowing modification in other domains, and (3) facilitating diagnosis of existing representations by translating them into interpretable domains such as images. Our domain transfer network can translate between fixed representations without having to learn or finetune them. This allows users to utilize various existing domain-specific expert models from the literature that had been trained with extensive computational resources. Experiments on diverse conditional image synthesis tasks, competitive image modification results and experiments on image-to-image and text-to-image generation demonstrate the generic applicability of our approach. For example, we translate between BERT and BigGAN, state-of-the-art text and image models to provide text-to-image generation, which neither of both experts can perform on their own.
“Towards Faster and Stabilized GAN Training for High-fidelity Few-shot Image Synthesis”, Anonymous et al 2020:
A computational-efficient GAN for few-shot hi-fi image dataset (converge on single GPU with few hours’ training, on 1024 resolution sub-hundred images). · Training Generative Adversarial Networks (GAN) on high-fidelity images usually requires large-scale GPU-clusters and a vast number of training images. In this paper, we study the few-shot image synthesis task for GAN with minimum computing cost. We propose a light-weight GAN structure that gains superior quality on 1024×1024px resolution. Notably, the model converges from scratch with just a few hours of training on a single RTX-2080 GPU; and has a consistent performance, even with less than 100 training samples. 2 technique designs constitute our work, a skip-layer channel-wise excitation module and a self-supervised discriminator trained as a feature-encoder. With 13 datasets covering a wide variety of image domains, we show our model’s robustness and its superior performance compared to the state-of-the-art StyleGAN2.
“Semantics Factorization (SeFa): Closed-Form Factorization of Latent Semantics in GANs”, Shen & Zhou 2020:
A rich set of semantic attributes has been shown to emerge in the latent space of the Generative Adversarial Networks (GANs) trained for synthesizing images. In order to identify such latent semantics for image manipulation, previous methods annotate a collection of synthesized samples and then train supervised classifiers in the latent space. However, they require a clear definition of the target attribute as well as the corresponding manual annotations, severely limiting their applications in practice. In this work, we examine the internal representation learned by GANs to reveal the underlying variation factors in an unsupervised manner. By studying the essential role of the fully-connected layer that takes the latent code into the generator of GANs, we propose a general closed-form factorization method for latent semantic discovery. The properties of the identified semantics are further analyzed both theoretically and empirically. With its fast and efficient implementation, our approach is capable of not only finding latent semantics as accurately as the state-of-the-art supervised methods, but also resulting in far more versatile semantic classes across multiple GAN models trained on a wide range of datasets.
“Data Instance Prior for Transfer Learning in GANs”, Mangla et al 2020:
Recent advances in generative adversarial networks (GANs) have shown remarkable progress in generating high-quality images. However, this gain in performance depends on the availability of a large amount of training data. In limited data regimes, training typically diverges, and therefore the generated samples are of low quality and lack diversity. Previous works have addressed training in low data setting by leveraging transfer learning and data augmentation techniques. We propose a novel transfer learning method for GANs in the limited data domain by leveraging informative data prior derived from self-supervised/
supervised pre-trained networks trained on a diverse source domain. We perform experiments on several standard vision datasets using various GAN architectures (BigGAN, SNGAN, StyleGAN2) to demonstrate that the proposed method effectively transfers knowledge to domains with few target images, outperforming existing state-of-the-art techniques in terms of image quality and diversity. We also show the utility of data instance prior in large-scale unconditional image generation and image editing tasks.
Scraping
This project is not officially affiliated or run by Danbooru, however, the site founder Albert (and his successor, Evazion) has given his permission for scraping. I have registered the accounts gwern
and gwern-bot
for use in downloading & participating on Danbooru; it is considered good research ethics to try to offset any use of resources when crawling an online community (eg DNM scrapers try to run Tor nodes to pay back the bandwidth), so I have donated $20 to Danbooru via an account upgrade.
Danbooru IDs are sequential positive integers, but the images are stored at their MD5 hashes; so downloading the full images can be done by a query to the JSON API for the metadata for an ID, getting the URL for the full upload, and downloading that to the ID plus extension.
The metadata can be downloaded from BigQuery via BigQuery-API-based tools.
Bugs
Known bugs:
Missing translation metadata: the metadata does not include the translations or bounding-boxes of captions/
translations (“notes”); they were omitted from the BigQuery mirror and technical problems meant they could not be added to BQ before release. The captions/ translations can be retrieved via the Danbooru API if necessary. 512px SFW subset transparency problem: some images have transparent backgrounds; if they are also black-white, like black line-art drawings, then the conversion to JPG with a default black background will render them almost 100% black and the image will be invisible (eg files with the two tags
transparent_background lineart
). This affects somewhere in the hundreds of images. Users can either ignore this as affecting a minute percentage of files, filter out images based on the tag-combination, or include data quality checks in their image loading code to drop anomalous images with too-few unique colors or which are too white/too black. Proposed fix: in Danbooru2019+’s 512px SFW subset, the downscaling has switched to adding white backgrounds rather than black backgrounds; while the same issue can still arise in the case of white line-art drawings with transparent backgrounds, these are much rarer. (It might also be possible to make the conversion shell script query images for use of transparency, average the contents, and pick a background which is most opposite the content.)
Future work
Metadata Quality Improvement via Active Learning
How high quality is the Danbooru metadata quality? As with ImageNet, it is critical that the tags are extremely accurate or else this will lowerbound the error rates and impede the learning of taggers, especially on rarer tags where a low error may still cause false negatives to outweigh the true positives.
I would say that the Danbooru tag data is quite high but imbalanced: almost all tags on images are correct, but the absence of a tag is often wrong—that is, many tags are missing on Danbooru (there are so many possible tags that no user could possibly know them all). So the absence of a tag isn’t as informative as the presence of a tag—eyeballing images and some rarer tags, I would guess that tags are present <10% of the time they should be.
This suggests leveraging an active learning (Settles 2010) form of training: train a tagger, have a human review the errors, update the metadata when it was not an error, and retrain.
More specifically: train the tagger; run the tagger on the entire dataset, recording the outputs and errors; a human examines the errors interactively by comparing the supposed error with the image; and for false negatives, the tag can be added to the Danbooru source using the Danbooru API and added to the local image metadata database, and for false positives, the ‘negative tag’ can be added to the local database; train a new model (possibly initializing from the last checkpoint). Since there will probably be thousands of errors, one would go through them by magnitude of error: for a false positive, start with tagging probabilities of 1.0 and go down, and for false negatives, 0.0 and go up. This would be equivalent to the active learning strategy “uncertainty sampling”, which is simple, easy to implement, and effective (albeit not necessarily optimal for active learning as the worst errors will tend to be highly correlated/
Over multiple iterations of active learning + retraining, the procedure should be able to ferret out errors in the dataset and boost its quality while also increasing its performance.
Based on my experiences with semi-automatic editing on Wikipedia (using pywikipediabot
for solving disambiguation wikilinks), I would estimate that given an appropriate terminal interface, a human should be able to check at least 1 error per second and so checking ~30,000 errors per day is possible (albeit extremely tedious). Fixing the top million errors should offer a noticeable increase in performance.
There are many open questions about how best to optimize tagging performance: is it better to refine tags on the existing set of images or would adding more only-partially-tagged images be more useful?
External Links
Discussion: /
r/ , /MachineLearning r/ anime Anime-related ML resources:
- “Deep Learning Anime Papers”
- “Awesome ACG Machine Learning Awesome”
- /
r/ AnimeResearch - “E621 Face Dataset”, Arfafax
- “MyWaifuList: A dataset containing info and pictures of over 15,000 waifus” (scrape of metadata, profile image, and user votes for/
against)
Appendix
Shell queries for statistics
## count number of images/files in Danbooru2020
find /media/gwern/Data2/danbooru2020/original/ -type f | wc --lines
# 4226544
## count total filesize of original fullsized images in Danbooru2020:
du -sch /media/gwern/Data2/danbooru2020/original/
# 3.4T
# on JSON files concatenated together:
## number of unique tags
cd metadata/; cat * > all.json
cat all.json | jq '.tags | .[] | .name' > tags.txt
sort -u tags.txt | wc --lines
# 392446
## number of total tags
wc --lines tags.txt
# 108029170
## Average tag count per image:
R
# 108029170 / 3692578
# # [1] 29.2557584
## Most popular tags:
sort tags.txt | uniq -c | sort -g | tac | head -19
# 2617569 "1girl"
# 2162839 "solo"
# 1808646 "long_hair"
# 1470460 "highres"
# 1268611 "breasts"
# 1204519 "blush"
# 1101925 "smile"
# 1009723 "looking_at_viewer"
# 1006628 "short_hair"
# 904246 "open_mouth"
# 802786 "multiple_girls"
# 758690 "blue_eyes"
# 722932 "blonde_hair"
# 686706 "brown_hair"
# 675740 "skirt"
# 630385 "touhou"
# 606550 "large_breasts"
# 592200 "hat"
# 588769 "thighhighs"
## count Danbooru images by rating
cat all.json | jq '.rating' > ratings.txt
sort ratings.txt | uniq -c | sort -g
# 315713 "e"
# 539329 "q"
# 2853721 "s"
wc --lines ratings.txt
## 3708763 ratings.txt
R
# c(315713, 539329, 2853721) / 3708763
# # [1] 0.0851262267 0.1454201846 0.7694535887
# earliest upload:
cat all.json | jq '.created_at' | fgrep '2005' > uploaded.txt
sort -g uploaded.txt | head -1
# "2005-05-24 03:35:31 UTC"
While Danbooru is not the largest anime image booru in existence—TBIB, for example, claimed >4.7m images ~2017 or almost twice as many as Danbooru2017, by mirroring from multiple boorus—but Danbooru is generally considered to focus on higher-quality images & have better tagging; I suspect >4m images is into diminishing returns and the focus then ought to be on improving the metadata. Google finds (Sun et al 2017) that image classification is logarithmic in image count up to n = 300M with noisy labels (likewise other scaling papers), which I interpret as suggesting that for the rest of us with limited hard drives & compute, going past millions is not that helpful; unfortunately that experiment doesn’t examine the impact of the noise in their categories so one can’t guess how many images each additional tag is equivalent to for improving final accuracy. (They do compare training on equally large datasets with small vs large number of categories, but fine vs coarse-grained categories is not directly comparable to a fixed number of images with less or more tags on each image.) The impact of tag noise could be quantified by removing varying numbers of random images/
tags and comparing the curve of final accuracy. As adding more images is hard but semi-automatically fixing tags with an active-learning approach should be easy, I would bet that the cost-benefit is strongly in favor of improving the existing metadata than in adding more images from recent Danbooru uploads or other -boorus.↩︎ This is done to save >100GB of space/
bandwidth; it is true that the lossless optimization will invalidate the MD5s, but note that the original MD5 hashes are available in the metadata, and many thousands of them are incorrect even on the original Danbooru server, and the files’ true hashes are inherently validated as part of the BitTorrent download process—so there is little point in anyone either checking them or trying to avoid modifying files, and lossless optimization saves a great deal.↩︎ If one is only interested in the metadata, one could run queries on the BigQuery version of the Danbooru database instead of downloading the torrent. The BigQuery database is also updated daily.↩︎
Apparently a bug due to an anti-DoS mechanism, which should be fixed.↩︎
An author of
style2paints
, a NN painter for anime images, notes that standard style transfer approaches (typically using an ImageNet-based CNN) fail abysmally on anime images: “All transferring methods based on Anime Classifier are not good enough because we do not have anime ImageNet”. This is interesting in part because it suggests that ImageNet CNNs are still only capturing a subset of human perception if they only work on photographs & not illustrations.↩︎Danbooru2020 does not by default provide a “face” dataset of images cropped to just faces like that of Getchu or Nagadomi’s moeimouto; however, the tags can be used to filter down to a large set of face closeups, and Nagadomi’s face-detection code is highly effective at extracting faces from Danbooru2020 images & can be combined with waifu2× for creating large sets of large face images. Several face datasets have been constructed, see elsewhere.↩︎
See for example the pair highlighted in Sharma et al 2018, motivating them to use human dialogues to provide more descriptions/
supervision.↩︎ A tagger could be integrated into the site to automatically propose tags for newly-uploaded images to be approved by the uploader; new users, unconfident or unfamiliar with the full breadth, would then have the much easier task of simply checking that all the proposed tags are correct.↩︎