Skip to main content

dataset directory

Links

“Anime Crop Datasets: Faces, Figures, & Hands”, Branwen et al 2020

Crops: “Anime Crop Datasets: Faces, Figures, & Hands”⁠, Gwern Branwen, Arfafax, Shawn Presser⁠, Anonymous, Danbooru Community (2020-05-10; ⁠, ⁠, ⁠, ⁠, ; backlinks; similar):

Description of 3 anime datasets for machine learning based on Danbooru: cropped anime faces, whole-single-character crops, and hand crops (with hand detection model).

Documentation of 3 anime datasets for machine learning based on Danbooru: 300k cropped anime faces (primarily used for StyleGAN⁠/​This Waifu Does Not Exist), 855k whole-single-character figure crops (extracted from Danbooru using AniSeg), and 58k hand crops (based on a dataset of 14k hand-annotated bounding boxes used to train a YOLOv3 hand detection model).

These datasets can be used for machine learning directly, or included as data augmentation: faces, figures, and hands are some of the most noticeable features of anime images, and by cropping images down to just those 3 features, they can enhance modeling of those by eliminating distracting context, zooming in, and increasing the weight during training.

“Danbooru2021: A Large-Scale Crowdsourced and Tagged Anime Illustration Dataset”, Branwen 2015

Danbooru2021: “Danbooru2021: A Large-Scale Crowdsourced and Tagged Anime Illustration Dataset”⁠, Gwern Branwen (2015-12-15; ⁠, ; backlinks; similar):

Danbooru2021 is a large-scale anime image database with 4.9m+ images annotated with 162m+ tags; it can be useful for machine learning purposes such as image recognition and generation.

Deep learning for computer revision relies on large annotated datasets. Classification/​categorization has benefited from the creation of ImageNet⁠, which classifies 1m photos into 1000 categories. But classification/​categorization is a coarse description of an image which limits application of classifiers, and there is no comparably large dataset of images with many tags or labels which would allow learning and detecting much richer information about images. Such a dataset would ideally be >1m images with at least 10 descriptive tags each which can be publicly distributed to all interested researchers, hobbyists, and organizations. There are currently no such public datasets, as ImageNet⁠, Birds, Flowers, and MS COCO fall short either on image or tag count or restricted distribution. I suggest that the “image -boorus” be used. The image boorus are long-standing web databases which host large numbers of images which can be ‘tagged’ or labeled with an arbitrary number of textual descriptions; they were developed for and are most popular among fans of anime, who provide detailed annotations.

The best known booru, with a focus on quality, is Danbooru⁠. We provide a rsync mirror which contains ~4.5T of 4.9m images with 162m tag instances (of 498k defined tags, ~32/​image) covering Danbooru 2005-05-24–2021-12-31 (final ID: #5,020,995), providing the image files & a JSONL export of the metadata. We also provide a smaller torrent of SFW images downscaled to 512×512px JPGs (0.39TB; 3,789,092 images) for convenience. (Total: 4.9TB.)

Our hope is that the Danbooru2021 dataset can be used for rich large-scale classification/​tagging & learned embeddings, test out the transferability of existing computer vision techniques (primarily developed using photographs) to illustration/​anime-style images, provide an archival backup for the Danbooru community, feed back metadata improvements & corrections, and serve as a testbed for advanced techniques such as conditional image generation or style transfer⁠.

“Darknet Market Archives (2013–2015)”, Branwen 2013

DNM-archives: “Darknet Market Archives (2013–2015)”⁠, Gwern Branwen (2013-12-01; ⁠, ⁠, ⁠, ⁠, ; backlinks; similar):

Mirrors of ~89 Tor-Bitcoin darknet markets & forums 2011–2015, and related material.

Dark Net Markets (DNM) are online markets typically hosted as Tor hidden services providing escrow services between buyers & sellers transacting in Bitcoin or other cryptocoins, usually for drugs or other illegal/​regulated goods; the most famous DNM was Silk Road 1, which pioneered the business model in 2011.

From 2013–2015, I scraped/​mirrored on a weekly or daily basis all existing English-language DNMs as part of my research into their usage⁠, lifetimes /  ​ characteristics⁠, & legal riskiness⁠; these scrapes covered vendor pages, feedback, images, etc. In addition, I made or obtained copies of as many other datasets & documents related to the DNMs as I could.

This uniquely comprehensive collection is now publicly released as a 50GB (~1.6TB uncompressed) collection covering 89 DNMs & 37+ related forums, representing <4,438 mirrors, and is available for any research.

This page documents the download, contents, interpretation, and technical methods behind the scrapes.