[P] Danbooru2017: a new dataset of 2.94m anime images (1.9tb) with 77.5m tags

gwern · 2018-02-14T18:30:04+00:00

(Happy Valentine's Day.)

MrEldritch · 2018-02-15T10:20:00+00:00

In contrast, the Danbooru dataset is larger than ImageNet as a whole and larger than the current largest multi-description dataset, MS COCO, with far richer metadata than the "subject verb object" sentence summary that is dominant in MS COCO or the birds dataset (sentences which could be adequately summarized in perhaps 5 tags). While the Danbooru community does focus heavily on female anime characters, they are placed in a wide variety of circumstances with numerous surrounding tagged objects or actions, and the sheer size implies that many more miscellaneous images will be included. It is unlikely that the performance ceiling will be reached anytime soon, and advanced techniques such as attention will likely be required to get anywhere near the ceiling. And Danbooru is constantly expanding and can be easily updated by anyone anywhere, allowing for regular releases of improved annotations.

I really hope this doesn't end up getting ignored as "not serious" or "not respectable" as a dataset. I agree with /u/gwern - I think this really could be a very useful dataset for more advanced image understanding techniques, and I frankly don't see how we could otherwise get a better dataset with similarly rich metadata for any feasible amount of effort.

gwern · 2018-02-17T15:42:49+00:00

By popular demand, there is now a SFW downscaled 512x512px subset (241GB, 2.2m images) available as a torrent: https://gwern.net/doc/anime/danbooru2017-sfw512px-torrent.tar.xz This should address everyone's concerns about too much disk space, the NSFW content and legal/reputational risk, and the annoyance of pre-processing to downscale the big images.

zawerf · 2018-02-15T06:36:48+00:00

Noob question, how do you deal with a dataset with such varying dimensions? There are images with a height of 30000px in there. They seem to be comic strips so you can't just resize them. Do you bother chopping those up or just filter them? And also a good chunk of them are normal aspect ratios but with unnecessarily high dpi (I guess these I can just rescale).

This seems like a lot of cleaning work that everyone using this dataset has to repeat.

Also if you don't want to accidentally view CP (which there's a lot of...), you need to filter for only "rating"=="s" for safe images.

Edit: Some more issues: Images with highly negative score are also included. Metadata that is_deleted, is_banned, is_flagged, is_pending are (expectedly) missing their images so they need to be filtered. There are also images in all kinds of formats and needs to be converted to rgb/rgba first or you get a random number of color channels. I didn't even know 2-channel image formats exist (I am guessing grayscale + alpha?).

Edit2: Looking at this dataset some more I am not sure I can ever associate my real identity with it. Maybe you can split out the NSFW part as a separate dataset so it's safe by default?

MrEldritch · 2018-02-15T01:48:12+00:00

Oh man, I've been waiting eagerly for this to come out!

...although now that it's out, I'm realizing that I can't actually download it, since 1.9tb would more than fill all the storage media I own. I'm going to have to buy a new external hard drive if I want to play with this, I suppose.

Mandrathax · 2018-02-14T20:40:08+00:00

Nice. You should cross-post to r/anime

Ending_Credits · 2018-02-15T23:43:04+00:00

You might want to run a face extraction routine on a few hundred thousand images, similar to https://github.com/jayleicn/animeGAN . It makes a good harder alternative to celebA, plus you have freely available identity information (which is not the case with celebA, although you can ask nicely, and they'll give it to you).

I actually made a dataset of 150000 faces of ~500 characters, grouped by character, but also with tags, made using the tools int he above repo. It's actually quite reasonably sized so I can upload it somewhere if people are interested, just need to find somewhere to host it. Example of the fun things you can do with it here: https://github.com/EndingCredits/Set-CGAN

visarga · 2018-02-14T18:54:36+00:00

NSFW

mikhael4440 · 2018-02-14T21:29:34+00:00

Nice-u

poctakeover · 2018-02-15T03:51:53+00:00

baka :/

deeppomf · 2018-02-15T02:52:43+00:00

Is it possible to only download NSFW images?

Silver_Sky · 2018-02-18T02:06:20+00:00

What are the best torrent clients now? Bittorrent has been my main client for a while, but it seems to have major trouble with these large torrents. Thanks for any recs.

gwern · 2018-02-19T17:02:08+00:00

[deleted]

fosa2 · 2018-03-02T09:38:31+00:00

Where do we download the tags dataset? No way to get the tags database for just the 300m SFW images?

inkplay_ · 2018-03-06T02:07:41+00:00

I am new at this, how do you download this? The torrent either doesn't work with most clients or has no seed in windows version of transmission.

Arias-go · 2018-03-12T13:01:11+00:00

An amazing and exciting dataset. That's exactly what i am looking for. By the way, I can not download even part of the dataset with SFW torrent files. No user upload...

unguided_deepness · 2018-02-15T02:36:40+00:00

Jack off and do some machine learning at the same time, what a great idea!

unguided_deepness · 2018-02-15T04:11:53+00:00

I propose that the model trained from this dataset be called "Incelnet"

MachineLearning

Rules For Posts

+Research

+Discussion

+Project

+News

@slashML on Twitter

Chat with us on Slack

Beginners:

MODERATORS