GoodReads (Link Bibliography)

“GoodReads” links:

  1. #data

  2. #bayesian-modeling


  4. #modeling-with-covariates


  6. Morning-writing







  13. ⁠, Jordan Ellenberg (2014-07-03):

    Sadly overlooked is that other crucial literary category: the summer non-read, the book that you pick up, all full of ambition, at the beginning of June and put away, the bookmark now and forever halfway through chapter 1, on Labor Day. The classic of this genre is Stephen Hawking’s A Brief History of Time, widely called “the most unread book of all time.”…How can we find today’s greatest non-reads? Amazon’s “Popular Highlights” feature provides one quick and dirty measure. Every book’s Kindle page lists the five passages most highlighted by readers. If every reader is getting to the end, those highlights could be scattered throughout the length of the book. If nobody has made it past the introduction, the popular highlights will be clustered at the beginning.

    Thus, the Hawking Index (HI): Take the page numbers of a book’s five top highlights, average them, and divide by the number of pages in the whole book. The higher the number, the more of the book we’re guessing most people are likely to have read. (Disclaimer: This is not remotely scientific and is for entertainment purposes only!) Here’s how some current best sellers and classics weigh in, from highest HI to lowest:

    • Thinking Fast and Slow by Daniel Kahneman: 6.8%

      Apparently the reading was more slow than fast. To be fair, Prof. Kahneman’s book, the summation of a life’s work at the forefront of cognitive psychology, is more than twice as long as Lean In, so his score probably represents just as much total reading as Ms. Sandberg’s does.

    • A Brief History of Time by Stephen Hawking: 6.6%

      The original avatar backs up its reputation pretty well. But it’s outpaced by one more recent entrant—which brings us to our champion, the most unread book of this year (and perhaps any other). Ladies and gentlemen, I present:

    • Capital in the Twenty-First Century by Thomas Piketty: 2.4%

      Yes, it came out just three months ago. But the contest isn’t even close. Mr. Piketty’s book is almost 700 pages long, and the last of the top five popular highlights appears on page 26. Stephen Hawking is off the hook; from now on, this measure should be known as the Piketty Index.

  14. ⁠, Minda Zetlin (2018-06-13):

    …a distasteful practice called “book stuffing” by some Kindle Unlimited authors. Kindle Unlimited is an Amazon program that works like Netflix for books: You can read as much as you want for a flat monthly fee. For various reasons, Kindle Unlimited is filled with books written and self-published by independent authors, many of them in the romance genre.

    How do authors get compensated when readers pay a flat fee for the service? Amazon has created a pool of funds that authors are paid from, currently around $22.5 million. Up until 2015, authors earned a flat fee for each download of their books. But the company noticed that many of these Kindle Unlimited books were very, very short. So instead, Amazon began paying a bit less than ¢0.5 cent for each page that was actually read. That’s how book stuffing was born.

    It works like this. An Amazon author publishes a new book that’s, say, 300 pages long. At ¢0.5 per page, the author would earn about $1.50 every time that book was read to the end. To beef up their earnings, book stuffers add several other already-published books, or a long series of newsletters, to the end of the book as “bonus material.” Most stuffed books run near 3,000 pages, the maximum that Amazon will pay for. In the current system, an author could earn about $13.50 per book this way, which is more than most authors earn from traditional publishers when their books are sold as hardcovers.

    $1.2 million a year?

    Serious book stuffers acquire email lists that they sometimes share with each other. They boost their sales by sending out promotional email to hundreds of thousands of email addresses. They also spend a lot of money on Amazon Marketing Services, promoting their books as “sponsored” to Kindle Unlimited subscribers and other Kindle shoppers. These tactics, in combination with artificially producing positive reviews (against Amazon’s rules), help them rank high in Amazon’s romance category, crowding out authors who take a more traditional approach. Some book stuffers publish a new book every couple of weeks (they may use ghostwriters to actually write the books), doing a new promotion for each one. In this way, observers report, they can earn as much as $100,000 per month.

    …Why would anyone read through 2,700 pages of uninteresting bonus material? They usually don’t, but many authors do something that gets people to turn to the last page of the book, such as promising a contest or giveaway (forbidden by Amazon rules), or putting some new and perhaps particularly racy content right at the end of the book. On some devices, Amazon may simply be using the last page opened as a measure of how much of a book was “read.” Thus, the author gets full credit for the book, even though the customer didn’t read all of it.

    …Carter openly invited other authors to pay for the use of his “platform” to send out promotional emails to their own mailing lists and also share mailing lists and cross-promote with other authors/​​​​book stuffers. In fact, he was so proud of his book stuffing talents that he posted his credo for the world to see in a Kindle publishing forum:

    • Making content as long as possible.
    • Releasing as frequently as possible.
    • Advertising as hard as possible.
    • Ranking as high as possible.
    • And then doing it all over again.
  15. ⁠, Sarah Jeong (2018-07-16):

    On June 4th, a group of lawyers shuffled into a federal court in Manhattan to argue over two trademark registrations. The day’s hearing was the culmination of months of internet drama—furious blog posts, Twitter hashtags, YouTube videos, claims of doxxing, and death threats…They were gathered there that day because one self-published romance author was suing another for using the word “cocky” in her titles. And as absurd as this courtroom scene was—with a federal judge soberly examining the shirtless doctors on the cover of an “MFM Menage Romance”—it didn’t even begin to scratch the surface.

    The fight over #Cockygate, as it was branded online, emerged from the strange universe of Amazon Kindle Unlimited, where authors collaborate and compete to game Amazon’s algorithm. Trademark trolling is just the beginning: There are private chat groups, ebook exploits, conspiracies to seed hyper-specific trends like “Navy SEALs” and “mountain men”, and even a controversial sweepstakes in which a popular self-published author offered his readers a chance to win diamonds from Tiffany’s if they reviewed his new book…A genre that mostly features shiny, shirtless men on its covers and sells ebooks for ¢99 a pop might seem unserious. But at stake are revenues sometimes amounting to a million dollars a year, with some authors easily netting six figures a month. The top authors can drop $50,000 on a single ad campaign that will keep them in the charts—and see a worthwhile return on that investment.

    …According to Willink, over the course of RWA, Valderrama told her about certain marketing and sales strategies, which she claimed to handle for other authors. Valderrama allegedly said that she organized newsletter swaps, in which authors would promote each other’s books to their respective mailing lists. She also claimed to manage review teams—groups of assigned readers who were expected to leave reviews for books online. According to Willink, Valderrama’s authors often bought each other’s books to improve their ranking on the charts—something that she arranged, coordinating payments through her own account. Valderrama also told her that she used multiple email addresses to buy authors’ books on iBooks when they were trying to hit the USA Today list. When Valderrama invited Willink to a private chat group of romance authors, Willink learned practices like chart gaming and newsletter placement selling—and much more—were surprisingly common.

    …In yet more screencaps, members discuss the mechanics of “book stuffing.” Book stuffing is a term that encompasses a wide range of methods for taking advantage of the Kindle Unlimited revenue structure. In Kindle Unlimited, readers pay $9.99 a month to read as many books as they want that are available through the KU program. This includes both popular mainstream titles like the Harry Potter series and self-published romances put out by authors like Crescent and Hopkins. Authors are paid according to pages read, creating incentives to produce massively inflated and strangely structured books. The more pages Amazon thinks have been read, the more money an author receives.

    …Book stuffing is particularly controversial because Amazon pays authors from a single communal pot. In other words, Kindle Unlimited is a zero-sum game. The more one author gets from Kindle Unlimited, the less the other authors get. The romance authors Willink was discovering didn’t go in for clumsy stuffings of automatic translations or HTML cruft; rather, they stuffed their books with ghostwritten content or repackaged, previously published material. In the latter case, the author will bait readers with promises of fresh content, like a new novella, at the end of the book. Every time a reader reads to the end of a 3,000-page book, the author earns almost 14 dollars. For titles that break into the top of the Kindle Unlimited charts, this trick can generate a fortune.

  16. Regression


  18. ⁠, Joe Alcorn (2020-12-13):

    In news that surprises nobody, last week quietly announced the deprecation of their public APIs. And I mean really quietly—the only people who were told about this were those unfortunate enough to have their existing API keys disabled without warning. Other than a small banner at the top of the API docs which mentions vague “plans to retire these tools”, nobody else appears to have heard anything from Goodreads, including those whose API keys remain active…So this is an “announcement” much in the way a windshield announces its presence to bugs on a highway, and with the same consequences: dead bugs. Some developers have taken to the API discussion boards and blogs, but the overall impression I’m getting is grim acceptance. Really the surprising thing is how long it took them: Amazon has been in charge at Goodreads for almost 8 years now, and I think we’ve all been expecting this to come at some point.

    So why now? What’s changed? Well, the fact is the market’s changing—and Goodreads isn’t. Alternative options are starting to emerge, and since Goodreads has forgotten how to innovate, it wants to use its market position to stifle innovation instead.

  19. ⁠, Melanie Walsh, Maria Antoniak (2021-04-21; fiction):

    The Classics “Shelf”: Genre, Hashtag, Advertising Keyword: This essay understands Goodreads users to be readers as well as “amateur critics”,…

    The Goodreads Algorithmic Echo Chamber: …The first key insight is that Goodreads purposely conceals and obfuscates its data from the public. The company does not provide programmatic (API) access to the full text of its reviews, as some websites and social media platforms do. To collect reviews, we thus needed to use a technique called “web scraping”, where one extracts data from the web, specifically from the part of a web page that users can see, as opposed to retrieving it from an internal source. The Goodreads web interface makes it difficult to scrape large amounts of review data, however. It’s not just difficult for researchers to collect Goodreads reviews. It’s difficult for anyone to interact with Goodreads reviews. Though more than 90 million reviews have been published on Goodreads in the site’s history, one can only view 300 reviews for any given book in any given sort setting, a restriction that was implemented in 2016. Previously, Goodreads users could read through thousands of reviews for any given book. Because there are a handful of ways to sort Goodreads reviews (eg., by publication date or by language), it is technically possible to read through 300 reviews in each of these sort settings. But even when accounting for all possible sort setting permutations, the number of visible and accessible Goodreads reviews is still only a tiny fraction of total Goodreads reviews. This throttling has been a source of frustration both for Goodreads users and for researchers.

    Table 2: Summary Statistics for Goodreads Classics Reviews
    Variable Oldest Newest Default All
    Number of Reviews 42,311 reviews 42,657 reviews 42,884 reviews 127,855 reviews
    Mean Length of Reviews 54.6 words 91.8 words 261.2 words 136.3 words
    Number of Unique Users 24,163 users 33,486 users 17,362 users 69,342 users
    Mean Number of Reviews per User 1.75 reviews/​​​​user 1.27 reviews/​​​​user 2.47 reviews/​​​​user 1.84 reviews/​​​​user

    Working within these constraints, we collected approximately 900 unique reviews for each classic book—300 default sorted reviews, 300 newest reviews, and 300 oldest reviews—for a total of 127,855 Goodreads reviews. We collected these reviews regardless of whether the user explicitly shelved the book as a “classic” or not. We also explicitly filtered for English language reviews. Despite this filtering, a small number of non-English and multi-language reviews are included in the dataset, and they show up as outliers in some of our later results. Compared to the archives of most readership and reception studies, this dataset is large and presents exciting possibilities for studying reception at scale. But it is important to note that this dataset is not large or random enough to be a statistically representative sample of the “true” distribution of classics reviews on Goodreads. We believe our results provide valuable insight into Goodreads and the classics nonetheless.

    Though the constraints of the Goodreads platform distort our dataset in certain ways, we tried to use this distortion to better scrutinize the influence of the web interface on Goodreads users. For example, the company never makes clear how it sorts reviews by default, but we found that reviews with a combination of more likes and more comments almost always appear above those with fewer—except in certain cases when there is, perhaps, another invisible social engagement metric such as the number of clicks, views, or shares that a review has received. Since we collected data in multiple sort settings, we are able to go further than this basic observation and investigate how exactly this default sorting algorithm shapes Goodreads users’ behavior, social interactions, and perceptions of the classics. Based on our analysis, we found that the first 300 default visible reviews for any given book develop into an echo chamber. Once a Goodreads review appears in the default sorting, in other words, it is more likely to be liked and commented on, and more likely to stay there (Figure 6). Meanwhile the majority of reviews quickly age beyond “newest” status and become hidden from public view. These liking patterns reveal that Goodreads users reinforce certain kinds of reviews, such as longer reviews (Figure 7), reviews that include a “spoiler alert” (Figure 9), and reviews written by a small set of Goodreads users who likely have many followers (Table 2). If a review is prominently displayed by the default sorting algorithm, its author may be more likely to go back and modify this review. More default-sorted reviews included the words “update” or “updated” than oldest or newest reviews (Figure 8). In one especially interesting updated review, a Goodreads user raised her rating of Toni Morrison’s The Bluest Eye and apologized for the way that her original, more negative review offended others and reflected her white privilege, which other Goodreads users had pointed out.

    Figure 6: This figure shows the number of average likes per review, broken down by Goodreads main review sort orders. The error bars indicate the standard deviation across 20 bootstrapped samples of the books, providing a measure of instability when a particular book is included or excluded in the dataset.
    Figure 7: This figure shows the average length of reviews, broken down by Goodreads main review sort orders. The error bars indicate the standard deviation across 20 bootstrapped samples of the books, providing a measure of instability when a particular book is included or excluded in the dataset.
    Figure 8: This figure shows the number of reviews that included the word “update” or “updated”, Goodreads main review sort orders. The error bars indicate the standard deviation across 20 bootstrapped samples of the books, providing a measure of instability when a particular book is included or excluded in the dataset.
    Figure 9: This figure shows the number of reviews that included a “spoiler” tag, broken down by Goodreads main review sort orders. The error bars indicate the standard deviation across 20 bootstrapped samples of the books, providing a measure of instability when a particular book is included or excluded in the dataset.
  20. ⁠, Constance Grady (2019-02-06):

    The oft-repeated elevator pitch on Black Leopard Red Wolf, the buzzy new novel from Man Booker Prize winner Marlon James, is that it’s the African Game of Thrones. (“I said that as a joke”, James protested in an interview this week.) To a certain extent, the comparison holds. Black Leopard Red Wolf is a lush epic fantasy set in an enchanted and mythical Africa, filled with quests and magical beasts and vicious battles to the death. But it’s also a much weirder, twistier book than the Game of Thrones parallels would suggest. Most notably, it is not driven by story. Black Leopard Red Wolf actively resists any attempts on the reader’s part to sink inside the world of the book and lose themselves. It is deliberately opaque, on the level of sentence as well as plot.

    On the sentence level, James likes to withhold proper nouns until the last possible moment and then waits to reveal them just a little bit longer than you’d think he should be able to get away with. That means his sentences are generally carried by verbs, and you don’t know who is doing what or why for long stretches at a time: You just get an impression of anonymous limbs tangled together in sex or battle for some reason that is not immediately clear.

    On the plot level, the quest for a missing boy that ostensibly powers the action of the book is so confusing, and has so little to do with the main character’s motivations, that the rest of the characters are constantly complaining about it. “This child carries no stakes for you”, one says toward the end of the novel to Tracker, our protagonist, and she’s correct. So is the poor sad giant who has the premise of the quest he is on explained to him multiple times and can only conclude, “Confusing, this is.”

    …In other words, we know that the quest will be futile and the child will die. We also know that the protagonist is not particularly interested in the quest. It is nearly impossible for a reader to hook into the narrative. Yet Black Leopard Red Wolf spends hundreds and hundreds of pages tracking its many twists and permutations. The opacity here is clearly a deliberate choice on James’s part. He is not interested in easy reads or straightforward stories. “The African folktale is not your refuge from skepticism”, he told the earlier this year. “It is not here to make things easy for you, to give you faith so you don’t have to think.” And James plans to keep things challenging through the rest of the Dark Star trilogy, of which Black Leopard is only the first volume. He’s modeling it on Showtime’s Rashomon-like series The Affair, he says, so that each volume will present the same events to the reader through a different point of view. “The series is three different versions of the same story, and I’m not going to tell people which they should believe”, James says.


  22. ⁠, Paul Bürkner ():

    The brms package provides an interface to fit Bayesian generalized (non-)linear multivariate multilevel models using Stan, which is a C++ package for performing full (see http:/​​​​/​​​​​​​​). The formula syntax is very similar to that of the package lme4 to provide a familiar and simple interface for performing regression analyses. A wide range of response distributions are supported, allowing users to fit—among others—linear, robust linear, count data, survival, response times, ordinal, zero-inflated, and even self-defined mixture models all in a multilevel context. Further modeling options include non-linear and smooth terms, auto-correlation structures, censored data, missing value imputation, and quite a few more. In addition, all parameters of the response distribution can be predicted in order to perform distributional regression. Multivariate models (ie., models with multiple response variables) can be fit, as well. specifications are flexible and explicitly encourage users to apply prior distributions that actually reflect their beliefs. Model fit can easily be assessed and compared with posterior predictive checks, cross-validation, and Bayes factors.

  23. Resorter#background

  24. ⁠, Paul Bürkner (2019-08-29):

    This vignette provides an introduction on how to fit distributional regression models with brms. We use the term distributional model to refer to a model, in which we can specify predictor terms for all parameters of the assumed response distribution. In the vast majority of regression model implementations, only the location parameter (usually the mean) of the response distribution depends on the predictors and corresponding regression parameters. Other parameters (eg., scale or shape parameters) are estimated as auxiliary parameters assuming them to be constant across observations. This assumption is so common that most researchers applying regression models are often (in my experience) not aware of the possibility of relaxing it. This is understandable insofar as relaxing this assumption drastically increase model complexity and thus makes models hard to fit. Fortunately, brms uses Stan on the backend, which is an incredibly flexible and powerful tool for estimating Bayesian models so that model complexity is much less of an issue.

    …In the examples so far, we did not have multilevel data and thus did not fully use the capabilities of the distributional regression framework of brms. In the example presented below, we will not only show how to deal with multilevel data in distributional models, but also how to incorporate smooth terms (ie., splines) into the model. In many applications, we have no or only a very vague idea how the relationship between a predictor and the response looks like. A very flexible approach to tackle this problems is to use splines and let them figure out the form of the relationship.

  25. ⁠, Tim O'Reilly (2002-12-11):

    The continuing controversy over online file sharing sparks me to offer a few thoughts as an author and publisher. To be sure, I write and publish neither movies nor music, but books. But I think that some of the lessons of my experience still apply.

    1. Lesson 1: Obscurity is a far greater threat to authors and creative artists than piracy.

      …More than 100,000 books are published each year, with several million books in print, yet fewer than 10,000 of those new books have any substantial sales, and only a hundred thousand or so of all the books in print are carried in even the largest stores…The web has been a boon for readers, since it makes it easier to spread book recommendations and to purchase the books once you hear about them. But even then, few books survive their first year or two in print. Empty the warehouses and you couldn’t give many of them away…

    2. Lesson 2: Piracy is progressive taxation

      For all of these creative artists, most laboring in obscurity, being well-enough known to be pirated would be a crowning achievement. Piracy is a kind of progressive taxation, which may shave a few percentage points off the sales of well-known artists (and I say “may” because even that point is not proven), in exchange for massive benefits to the far greater number for whom exposure may lead to increased revenues…

    3. Lesson 3: Customers want to do the right thing, if they can.

      …We’ve found little or no abatement of sales of printed books that are also available for sale online…The simplest way to get customers to stop trading illicit digital copies of music and movies is to give those customers a legitimate alternative, at a fair price.

    4. Lesson 4: Shoplifting is a bigger threat than piracy.

      …What we have is a problem that is analogous, at best, to shoplifting, an annoying cost of doing business. And overall, as a book publisher who also makes many of our books available in electronic form, we rate the piracy problem as somewhere below shoplifting as a tax on our revenues. Consistent with my observation that obscurity is a greater danger than piracy, shoplifting of a single copy can lead to lost sales of many more. If a bookstore has only one copy of your book, or a music store one copy of your CD, a shoplifted copy essentially makes it disappear from the next potential buyer’s field of possibility. Because the store’s inventory control system says the product hasn’t been sold, it may not be reordered for weeks or months, perhaps not at all. I have many times asked a bookstore why they didn’t have copies of one of my books, only to be told, after a quick look at the inventory control system: “But we do. It says we still have one copy in stock, and it hasn’t sold in months, so we see no need to reorder.” It takes some prodding to force the point that perhaps it hasn’t sold because it is no longer on the shelf…

    5. Lesson 5: File sharing networks don’t threaten book, music, or film publishing. They threaten existing publishers.

      …The question before us is not whether technologies such as peer-to-peer file sharing will undermine the role of the creative artist or the publisher, but how creative artists can leverage new technologies to increase the visibility of their work. For publishers, the question is whether they will understand how to perform their role in the new medium before someone else does. Publishing is an ecological niche; new publishers will rush in to fill it if the old ones fail to do so…Over time, it may be that online music publishing services will replace CDs and other physical distribution media, much as recorded music relegated sheet music publishers to a niche and, for many, made household pianos a nostalgic affectation rather than the home entertainment center. But the role of the artist and the music publisher will remain. The question then, is not the death of book publishing, music publishing, or film production, but rather one of who will be the publishers.

    6. Lesson 6: “Free” is eventually replaced by a higher-quality paid service

      A question for my readers: How many of you still get your email via peer-to-peer UUCP dialups or the old “free” Internet, and how many of you pay $30.96$19.952002 a month or more to an ISP? How many of you watch “free” television over the airwaves, and how many of you pay $31$202002$93$602002 a month for cable or satellite television? (Not to mention continue to rent movies on videotape and DVD, and purchasing physical copies of your favorites.) Services like Kazaa flourish in the absence of competitive alternatives. I confidently predict that once the music industry provides a service that provides access to all the same songs, freedom from onerous copy-restriction, more accurate metadata and other added value, there will be hundreds of millions of paying subscribers…Another lesson from television is that people prefer subscriptions to pay-per-view, except for very special events. What’s more, they prefer subscriptions to larger collections of content, rather than single channels. So, people subscribe to “the movie package”, “the sports package” and so on. The recording industry’s “per song” trial balloons may work, but I predict that in the long term, an “all-you-can-eat” monthly subscription service (perhaps segmented by musical genre) will prevail in the marketplace.

    7. Lesson 7: There’s more than one way to do it.

      A study of other media marketplaces shows, though, that there is no single silver-bullet solution. A smart company maximizes revenue through all its channels, realizing that its real opportunity comes when it serves the customer who ultimately pays its bills…Interestingly, some of our most successful print/​​​​​online hybrids have come about where we present the same material in different ways for the print and online contexts. For example, much of the content of our bestselling book Programming Perl (more than 600,000 copies in print) is available online as part of the standard Perl documentation. But the entire package—not to mention the convenience of a paper copy, and the aesthetic pleasure of the strongly branded packaging—is only available in print. Multiple ways to present the same information and the same product increase the overall size and richness of the market. And that’s the ultimate lesson. “Give the Wookiee what he wants!” as Han Solo said so memorably in the first Star Wars movie. Give it to him in as many ways as you can find, at a fair price, and let him choose which works best for him.

  26. themathematicsofbeauty.html: ⁠, Christian Rudder (OKCupid) (2011-01-10; psychology  /​ ​​ ​okcupid):

    [Today’s dataset: 1.54m votes, 596k messages, 64k profiles.]

    This post investigates female attractiveness, but without the usual photo analysis stuff. Instead, we look past a woman’s picture, into the reaction she creates in the reptile mind of the human male. Among the remarkable things we’ll show:

    • that the more men as a group disagree about a woman’s looks, the more they end up liking her
    • guys tend to ignore girls who are merely cute
    • and, in fact, having some men think she’s ugly actually works in woman’s favor

    …Now let’s look back at the two real users from before, this time with their own graphs. OkCupid uses a 1 to 5 star system for rating people, so the rest of our discussion will be in those terms. All the users pictured were generous and confident enough to allow us to dissect their experience on our site, and we appreciate it. Okay, so we have: […] As you can see, though the average attractiveness for the two women above is very close, their vote patterns differ. On the left you have consensus, and on the right you have split opinion.

    To put a fine point on it:

    • Ms. Left is, in an absolute sense, considered slightly more attractive
    • Ms. Right was also given the lowest rating 142% more often
    • yet Ms. Right gets as many messages

    When we began pairing other people of similar looks and profiles, but different message outcomes, this pattern presented itself again and again. The less-messaged woman was usually considered consistently attractive, while the more-messaged woman often created variation in male opinion…Our first result was to compare the standard deviation of a woman’s votes to the messages she gets. The more men disagree about a woman’s looks, the more they like her. I’ve plotted the deviation vs. messages curve below, again including some examples…

  27. Books

  28. ⁠, Zygmunt Z. (FastML) (2017-11-29):

    There have been a few recommendations datasets for movies (Netflix, Movielens) and music (Million Songs), but not for books. That is, until now. The dataset contains six million ratings for ten thousand most popular books (with most ratings). There are also:

    • books marked to read by the users
    • book metadata (author, year, etc.)
    • tags/​​​​​shelves/​​​​​genres

    As to the source, let’s say that these ratings come from a site similar to, but with more permissive terms of use. There are a few types of data here:

    • explicit ratings
    • implicit feedback indicators (books marked to read)
    • tabular data (book info)
    • tags

    are available on ⁠. Some of them are quite large, so GitHub won’t show their contents online. See samples for smaller CSV snippets. You can download individual zipped files from releases⁠.

  29. ⁠, Zygmunt Z. (FastML) (2017-11-29):

    This dataset contains six million ratings for ten thousand most popular (with most ratings) books. There are also:

    • books marked to read by the users
    • book metadata (author, year, etc.)
    • tags/​​​​​shelves/​​​​​genres

    Access: Some of these files are quite large, so GitHub won’t show their contents online. See samples    /​ ​​ ​​ ​​ ​ for smaller CSV snippets.

    Open the notebook for a quick look at the data. Download individual zipped files from releases⁠.

    The dataset is accessible from Spotlight⁠, recommender software based on PyTorch.

    Contents: ratings.csv contains ratings sorted by time. It is 69MB and looks like that:

    user_id, book_id, rating1,258,52,4081,42,260,52,9296,52,2318,3

    Ratings go from one to five. Both book IDs and user IDs are contiguous. For books, they are 1–10000, for users, 1–53424.

    to_read.csv provides IDs of the books marked “to read” by each user, as user_id, book_id pairs, sorted by time. There are close to a million pairs.

    books.csv has metadata for each book (goodreads IDs, authors, title, average rating, etc.). The metadata have been extracted from goodreads XML files, available in books_xml.


    book_tags.csv contains tags/​​​​shelves/​​​​genres assigned by users to books. Tags in this file are represented by their IDs. They are sorted by goodreads_book_id ascending and count descending.

    In raw XML files, tags look like this:

    <popular_shelves>    <shelf name="science-fiction" count="833"/>    <shelf name="fantasy" count="543"/>    <shelf name="sci-fi" count="542"/>    …    <shelf name="for-fun" count="8"/>    <shelf name="all-time-favorites" count="8"/>    <shelf name="science-fiction-and-fantasy" count="7"/></popular_shelves>

    Here, each tag/​​​​shelf is given an ID. tags.csv translates tag IDs to names.

    goodreads IDs

    Each book may have many editions. goodreads_book_id and best_book_id generally point to the most popular edition of a given book, while goodreads work_id refers to the book in the abstract sense.

    You can use the goodreads book and work IDs to create URLs as follows:

    Note that book_id in ratings.csv and to_read.csv maps to work_id, not to goodreads_book_id, meaning that ratings for different editions are aggregated.


  31. ⁠, Nick Evershed (The Guardian) (2013-07-12):

    So, what are the movies that people loved, but critics hated? And what about those movies that got rave reviews but just didn’t click with audiences?

    To try and answer these questions I’ve analysed 10,000 movies from 1970 to 2013 in the Rotten Tomatoes database, and determined the difference in audience score and critic score by subtracting the former from the latter. This gives us an index of audience-critic agreement, which I’ve named the - index. From this, we can see which movies the audience loved, but the critics hated—which will be more positive, and movies the critics loved but the audience hated—more negative. We can also find out what types of movies fall into these categories—like which actors, directors and genres are most common to each.

    …I used this IMDb list of 10,000 US-released movies from 1970–2013 (though I did notice a film from 1967) to get ID numbers for a large number of movies. I then wrote a program that accesses the Rotten Tomatoes database via their API and grabbed the title, first two actors listed, genres, first director listed, studio, year of release, and Motion Picture Association of America (MPAA) rating of each movie based on the IMDb number. From this, I removed 2,828 films without a user or critic rating. This produced the dataset for analysis. I created the Tisdale-Carano index by simply subtracting the critic score from the user score, then ranking the entire dataset by this number.