FastML

Machine learning made easy

Goodbooks-10k: a new dataset for book recommendations

There have been a few recommendations datasets for movies (Netflix, Movielens) and music (Million Songs), but not for books. That is, until now.

The dataset contains six million ratings for ten thousand most popular books (with most ratings). There are also:

  • books marked to read by the users
  • book metadata (author, year, etc.)
  • tags/shelves/genres

As to the source, let’s say that these ratings come from a site similar to goodreads.com, but with more permissive terms of use.

Contents

There are a few types of data here:

  • explicit ratings
  • implicit feedback indicators (books marked to read)
  • tabular data (book info)
  • tags

For a quick exploratory analysis of the data, see the notebook.

ratings.csv contains ratings sorted by time. It is 69MB and looks like that:

user_id,book_id,rating
1,258,5
2,4081,4
2,260,5
2,9296,5
2,2318,3

Ratings go from one to five. Both book IDs and user IDs are contiguous. For books, they are 1-10000, for users, 1-53424.

to_read.csv provides IDs of the books marked “to read” by each user, as user_id,book_id pairs, sorted by time. There are close to a million pairs.

books.csv has metadata for each book (goodreads IDs, authors, title, average rating, etc.). The metadata have been extracted from goodreads XML files, available in books_xml.

Tags

In raw XML files, tags look like this:

<popular_shelves>
    <shelf name="science-fiction" count="833"/>
    <shelf name="fantasy" count="543"/>
    <shelf name="sci-fi" count="542"/>
    ...
    <shelf name="for-fun" count="8"/>
    <shelf name="all-time-favorites" count="8"/>
    <shelf name="science-fiction-and-fantasy" count="7"/>   
</popular_shelves>

book_tags.csv contains tags/shelves/genres assigned to books by users. Tags in this file are represented by their IDs. They are sorted by goodreads_book_id ascending and count descending.

goodreads_book_id,tag_id,count
1,30574,167697
1,11305,37174
1,11557,34173

tags.csv translates tag IDs to names.

tag_id,tag_name
0,-
19,--your-message-here--
25,-fiction
26,-fictional
27,-fictitious

Contact us if you would like to advertise here. Prices start at $1.

Goodreads book and work IDs

Each book may have many editions. goodreads_book_id and best_book_id generally point to the most popular edition of a given book, while goodreads work_id refers to the book in the abstract sense.

You can use the goodreads book and work IDs to create URLs to see the difference:

https://www.goodreads.com/book/show/2767052 https://www.goodreads.com/work/editions/2792775

Note that book_id in ratings.csv and to_read.csv maps to work_id, not to goodreads_book_id. It means that ratings for different editions are aggregated.

Download

All files are available on GitHub. Some of them are quite large, so GitHub won’t show their contents online. See samples for smaller CSV snippets. You can download individual zipped files from releases.

Thanks to Maciej Kula, the dataset is accessible from Spotlight, recommender software based on PyTorch.

Citing

This is a preferred citation style:

@article{goodbooks2017,
    author = {Zajac, Zygmunt},
    title = {Goodbooks-10k: a new dataset for book recommendations},
    year = {2017},
    publisher = {FastML},
    journal = {FastML},
    howpublished = {\url{http://fastml.com/goodbooks-10k}},
}

The canonical name of the dataset is goodbooks-10k. Goodbooks-10k when starting the sentence, if you prefer.

Want more? Shiny notebook by Philipp Spachtholz provides an extensive analysis of the slightly smaller first version of the dataset. Be sure to run it if you want to see all the plots.

Comments