'HP: Methods of Rationality' review statistics

Recording fan speculation for retrospectives; statistically modeling reviews for ongoing story with R
statistics, predictions, Haskell, shell, R, survival-analysis
2012-11-032017-06-18 finished certainty: likely importance: 3


The unprece­dented gap in Meth­ods of Ratio­nal­ity updates prompts mus­ing about whether read­er­ship is increas­ing enough & what sta­tis­tics one would use; I write code to down­load FF.net reviews, clean it, parse it, load into R, sum­ma­rize the data & depict it graph­i­cal­ly, run lin­ear regres­sion on a sub­set & all reviews, note the poor fit, develop a qua­dratic fit instead, and use it to pre­dict future review quan­ti­ties.

Then, I run a sim­i­lar analy­sis on a com­pet­ing fan­fic­tion to find out when they will have equal total review-­counts. A try at log­a­rith­mic fits fails; fit­ting a lin­ear model to the pre­vi­ous 100 days of MoR and the com­peti­tor works much bet­ter, and they pre­dict a con­ver­gence in <5 years.

A sur­vival analy­sis finds no major anom­alies in reviewer life­times, but an appar­ent increase in mor­tal­ity for review­ers who started review­ing with later chap­ters, con­sis­tent with (but far from prov­ing) the orig­i­nal the­ory that the later chap­ters’ delays are hav­ing neg­a­tive effects.

In a LW com­ment, I asked:

I noticed in Eliez­er’s lat­est [2012-11-01] MoR update that he now has 18,000 words writ­ten [of chap­ter 86], and even when that “chap­ter” is fin­ished, he still does­n’t plan to post any­thing, on top of a drought that has now lasted more than half a year. This does­n’t seem very opti­mal from the point of view of gain­ing read­ers.

But I was won­der­ing how one would quan­tify that - how one would esti­mate how many read­ers Eliez­er’s MoR strat­egy of very rare huge dumps is cost­ing him. Maybe sur­vivor­ship curves, where sur­vivor­ship is defined as “post­ing a review on FF.net”? So if say dur­ing the weekly MoR updates, a reader who posted a review of chap­ter X posted a review any­where in X to X+n, that counts as sur­vival of that read­er. One prob­lem here is that new read­ers are con­stantly arriv­ing… You can’t sim­ply say “X read­ers did not return from chap­ter 25 to chap­ter 30, while X+N did not return from chap­ter 85 to 86, there­fore fre­quent updates are bet­ter” since you would expect the lat­ter to be big­ger sim­ply because more peo­ple are read­ing. And even if you had data on unfa­vorit­ing or unfol­low­ing, the impor­tant part is the oppor­tu­nity cost - how many read­ers would have sub­scribed with reg­u­lar updates

If you had total page views, that’d be another thing; you could look at con­ver­sions: what per­cent­age of page views resulted in con­ver­sions to sub­scribers for the reg­u­lar update period ver­sus the feast/fame peri­ods. But I don’t think FF.net pro­vides it and while HP:­MoR.­com (set up 2011-06-13 as an alter­na­tive to FF.net; note the MoR sub­red­dit was founded 2011-11-30] has Google Ana­lyt­ics, I don’t have access. Maybe one could look at each chap­ter pair-­wise, and see­ing what frac­tion of review­ers return? The growth might aver­age out since we’re doing so many com­par­isons… But the delay is now so extreme this would fail: we’d expect a huge growth in review­ers from chap­ter 85 to chap­ter 86, for exam­ple, sim­ply because it’s been like 8 months now - here too the prob­lem is that the growth in review­ers will be less than what it “should be”. But how do we fig­ure out “should be”?

After some addi­tional dis­cus­sion with clone_of_sat­urn, I’ve rejected the idea of sur­vivor­ship curves; we thought cor­re­la­tions between dura­tion and total review count might work, but inter­pre­ta­tion was not clear. So the best cur­rent idea is: gather dura­tion between each chap­ter, the # of reviews posted within X days (where X is some­thing like 2 or 3), plot the points, and then see whether a line fits it bet­ter or a logarithm/logistic curve - to see whether growth slowed as the updates were spaced fur­ther apart.

Get­ting the data is the prob­lem. It’s easy to get total reviews for each chap­ter since FF.net pro­vides them. I don’t know of any way to get total reviews after X days post­ed, though! A script or pro­gram could prob­a­bly do it, but I’m not sure I want to bother with all this work (espe­cially since I don’t know how to do Bayesian line or logis­tic-­fit­ting) if it’s not of inter­est to any­one or Eliezer would sim­ply ignore any results.

I decided to take a stab at it.

Data

does not pro­vide pub­lished dates for each chap­ter, so we’ll be work­ing off solely the posted reviews. FF.net reviews for MoR are pro­vided on 2223 pages (as of 2017-06-18), num­ber­ing from 1. If we start with http://www.fanfiction.net/r/5782108/85/1/, we can down­load them with a script enu­mer­at­ing the URLs & using string inter­po­la­tion. Eye­balling a tex­tual dump of the HTML (cour­tesy of elinks), it’s easy to extract with grep the line with user, chap­ter num­ber, and date with a HTML string embed­ded as a link with each review, lead­ing to this script:

for i in {1..2223}
do elinks -dump-width 1000 -no-numbering -no-references -dump http://www.fanfiction.net/r/5782108/0/$i/ \
     | grep "report review for abuse " | tee reviews.txt
done

Since page #1 is for the most recent URLs and we started with 1, the reviews were dumped newest to old­est: eg.

...
[IMG] report review for abuse SparklesandSunshine chapter 25 . 19h
[IMG] report review for abuse SparklesandSunshine chapter 24 . 19h
report review for abuse Emily chapter 7 . 6/12
[IMG] report review for abuse Ambassador666 chapter 17 . 6/12
[IMG] report review for abuse Teeners87 chapter 122 . 6/11
[IMG] report review for abuse Ryoji Mochizuki chapter 122 . 6/10
...
[IMG] report review for abuse Agent blue rose chapter 2 . 12/20/2016
report review for abuse Kuba Ober chapter 2 . 12/20/2016
[IMG] report review for abuse datnebnnii23 chapter 122 . 12/20/2016
report review for abuse Penguinjava chapter 101 . 12/20/2016
[IMG] report review for abuse Meerhawk chapter 122 . 12/20/2016
report review for abuse Laterreader12345 chapter 113 . 12/19/2016
...

We can clean this up with more shell script­ing:

sed -e 's/\[IMG\] //' -e 's/.*for abuse //' -e 's/ \. chapter / /' reviews.txt >> filtered.txt

This gives us bet­ter look­ing data like:

Zethil1276 chapter 10 . 1h
Guest chapter 2 . 4h
SparklesandSunshine chapter 25 . 19h
SparklesandSunshine chapter 24 . 19h
Emily chapter 7 . 6/12
Ambassador666 chapter 17 . 6/12
Teeners87 chapter 122 . 6/11
Ryoji Mochizuki chapter 122 . 6/10

The spaces in names means we can’t sim­ply do a search-and-re­place to turn spaces into com­mas and call that a CSV which can eas­ily be read in R; so more script­ing, in Haskell this time (though I’m sure awk or some­thing could do the job, I don’t know it):

import Data.List (intercalate)

main :: IO ()
main = do f <- readFile "filtered.txt"
          let parsed = map words (lines f)
          let unparsed = "User,Date,Chapter\n" ++ unlines (map unsplit parsed)
          writeFile "reviews.csv" unparsed

-- eg "Ryoji Mochizuki chapter 122 . 6/10" → ["Ryoji", "Mochizuki", "6/10", "122"] → "Ryoji Mochizuki,6/10,22"
unsplit :: [String] -> String
unsplit xs = let chapter = last $ init $ init xs
                 date = last xs
                 user = intercalate " " $ takeWhile (/="chapter") xs
             in intercalate "," [user, date, chapter]

FF.net as of 2017 has changed its date for­mat to vary by page, so the most recent reviews need to be hand-edit­ed.

R

A runhaskell mor.hs later and our CSV of 33,332 reviews is ready for analy­sis in the R inter­preter:

data <- read.csv("https://www.gwern.net/docs/lwsurvey/hpmor/2017-hpmor-reviews.csv")
data
...
#                  User      Date Chapter
# 1          Zethil1276 6/18/2017      10
# 2               Guest 6/18/2017       2
# 3 SparklesandSunshine 6/18/2017      25
# 4 SparklesandSunshine 6/18/2017      24
# 5               Emily 6/12/2017       7
# 6       Ambassador666 6/12/2017      17
## Parse the date strings into R date objects
data$Date <- as.Date(data$Date,"%m/%d/%Y")
## Reorganize columns (we care more about the chapter & date reviews were given on
## than which user did a review):
data <- data.frame(Chapter=data$Chapter, Date=data$Date, User=data$User)
data <- data[order(data$Date),]

Descriptive

summary(data)
#     Chapter               Date                           User
#  Min.   :  1.00000   Min.   :2010-02-28   Guest            : 1798
#  1st Qu.: 10.00000   1st Qu.:2010-10-25   Anon             :  128
#  Median : 39.00000   Median :2012-03-26   anon             :  121
#  Mean   : 47.65211   Mean   :2012-09-20   The Anguished One:  121
#  3rd Qu.: 80.00000   3rd Qu.:2014-12-06   Montara          :  119
#  Max.   :122.00000   Max.   :2017-06-18   (Other)          :31044
#                                           NA's             :    1

The dates clearly indi­cate the slow­down in MoR post­ing, with the median chap­ter being posted all the way back in 2010. What does the per-­day review pat­tern look like?

plot(table(data$Date))
Reviews post­ed, by day

Quite spiky. At a guess, the days with huge spikes are new chap­ters with the final spike cor­re­spond­ing to the end­ing, while a low-level aver­age of what looks like ~10 reviews a day holds steady over time. What’s the cumu­la­tive total of all reviews over time?

plot(unique(data$Date), cumsum(table(data$Date)), xlab="Date", ylab="Cumulative sum of FF.net reviews left on MoR over time")
All (to­tal cumu­la­tive) reviews post­ed, by day

When we plot each review by the chap­ter it was review­ing and the date it was post­ed, we see more clear pat­terns:

library(ggplot2)
qplot(data$Chapter, data$Date, color=as.ordered(data$Chapter)) +
    theme(legend.position = "none") + xlab("Chapter") + ylab("Date")
Scat­ter­plot of all reviews, date posted ver­sus chap­ter

Obvi­ously no one can review a chap­ter before it was post­ed, which is the thick black line along the bot­tom. We can also see how the delays between chap­ters: orig­i­nally very small, the gaps increase tremen­dously in the 70s chap­ters - just a few of the delays look like they cover the major­ity of the entire pub­lish­ing lifes­pan of MoR! Other inter­est­ing obser­va­tions is that chap­ter 1 and chap­ter 5 are extremely pop­u­lar for review­ers, with hardly a day pass­ing with­out review even as vast waste­lands open up in 2012 in chap­ters 40-60. A final curi­ous obser­va­tion is the pres­ence of dis­tinct hor­i­zon­tal lines; I am guess­ing that these are peo­ple prone to review­ing who are bing­ing their way through MoR, leav­ing reviews on increas­ing chap­ters over short time peri­ods.

Filtered reviews

The next ques­tion is: if we plot only the sub­set of reviews which were made within 7 days of the first review, what does that look like?


## what's the total by each chapter?
totalReviews <- aggregate(Date ~ Chapter, length, data=data)
totalReviews$Date
#   [1] 1490  766  387  549 1896  815  650  563  550  758  568  325  401  409  333  412  547  420  401  420  321  273  290  346  196  325
#  [27]  362  193  180  227  186  230  211  156   63  123  131   83  211  133  151  242   86   58  223  281  405  147  180  128  237   99
#  [53]   72  206   62  166   96  231   96  154  154  196  342  263  166   58  171  151  301  341  242  421  153  308  227  266  574  209
#  [79]  189  328  274  135   90  185  301  175  280   50  426  169   53  119  134  119  169  123  107  192   74   79  178  153   83  181
# [105]   81   51  101  111   80   95   86  117 2080   88  204  104   96   60  144   77   84  544

earlyReviews <- integer(122)
for (i in 1:122) {
    reviews <- data[data$Chapter==i,]$Date
    published <- min(reviews)
    earlyReviews[i] <- sum(reviews < (published+7))
}
earlyReviews
#   [1]   17   11   13   34  326  190  203  209  296  281  117  136  162  144  126  153  195  198  159  201  127  124  140  110   92  141
#  [27]  192   96   91   43  113  169   76   88    5   34   73   17   48   71   88  128   17   10   40  163  238  100  121   81  148   50
#  [53]   31  164   32  127   62  164   66  116  107  139  151   54  104   19  106  104  170  185  104  172  124  211  163  181  113  168
#  [79]  136  299  206  104   69  139  102  109  120   32  350  144   27  103  104   84  136   89   78  101   21   36   56   54   56  149
# [105]   70   42   87   88   66   80   66  109 2026   67  191   94   89   50  128   65   67  233

How did chap­ter 5 get 326 reviews within 7 days when the pre­vi­ous ones were get­ting 10-30, you ask - might my data not be wrong? But if you check the chap­ter 5 reviews and scroll all the way down, you’ll find that peo­ple back in 2010 were very chuffed by the Draco scenes.

Analysis

Linear modeling

Mov­ing onward, we want to graph the date and then fit a (a line which min­i­mizes the devi­a­tion of all of the points from it; I drew on a hand­out for R pro­gram­ming):

## turn vector into a plottable table & plot it
filtered <- data.frame(Chapter=c(1:122), Reviews=earlyReviews)
plot(filtered$Chapter, filtered$Reviews, xlab="Chapter",
    ylab="Reviews left on each chapter within 7 days of publication")

## run a linear model: is there any clear trend?
l <- lm(Reviews ~ Chapter, data=filtered); summary(l)
# ...Coefficients:
#                     Estimate  Std. Error t value  Pr(>|t|)
# (Intercept)      109.3701395  33.9635875 3.22022 0.0016484
# filtered$Chapter   0.3359780   0.4792406 0.70106 0.4846207
#
# Residual standard error: 186.4181 on 120 degrees of freedom
# Multiple R-squared:  0.004079042, Adjusted R-squared:  -0.0042203
# F-statistic: 0.4914898 on 1 and 120 DF,  p-value: 0.4846207

## the slope doesn't look very interesting...
## let's plot the regression line too:
abline(l)
Graph reviews posted within a week, per chap­ter.

Look at the noise. We have some huge spikes in the early chap­ters, some sort of dip in the 40s-70s, and then a sim­i­lar spread of highly and lowly reviewed chap­ters at the end. The lin­ear model is basi­cally just the aver­age.

The read­er­ship of MoR did increase sub­stan­tially over its run, so why is there such a small increase over time in the num­ber of fans? Pos­si­bly most of the activ­ity was drained over to Red­dit as the sub­red­dit started up in Novem­ber 2011, and Ff.net reviews also decreased when the offi­cial web­site started up in June 2011. But those could only affect chap­ter 73 and lat­er.

All reviews

Same thing if we look at total reviews for each chap­ter:

filtered2 <- data.frame(Chapter=c(1:122), Reviews=totalReviews$Date)
plot(filtered2$Chapter, filtered2$Reviews, xlab="Chapter", ylab="Total reviews left on each chapter")
model2 <- lm(Reviews ~ Chapter, data=filtered2)
summary(model2)
# ...Coefficients:
#                Estimate  Std. Error  t value   Pr(>|t|)
# (Intercept) 460.8210270  50.7937240  9.07240 2.7651e-15
# Chapter      -3.0505352   0.7167209 -4.25624 4.1477e-05
#
# Residual standard error: 278.7948 on 120 degrees of freedom
# Multiple R-squared:  0.1311624,   Adjusted R-squared:  0.1239221
# F-statistic: 18.11557 on 1 and 120 DF,  p-value: 4.147744e-05
abline(model2)
Graph all reviews by posted time and chap­ter

Fitting quadratics

Look­ing at the line, the fit seems poor over all (large ) quite poor at the begin­ning and end - does­n’t it look like a big U? That sounds like a qua­drat­ic, so more Googling for R code and an hour or two lat­er, I fit some qua­drat­ics to the 7-day reviews and the total reviews:

plot(filtered$Chapter, filtered$Reviews)
model <- lm(filtered$Reviews ~ filtered$Chapter + I(filtered$Chapter^2))
summary(model)
...
Residuals:
     Min       1Q   Median       3Q      Max
-160.478  -42.795   -3.364   40.813  179.321

Coefficients:
                       Estimate Std. Error t value Pr(>|t|)
(Intercept)           178.41168   22.70879   7.857 1.34e-11 ***
filtered$Chapter       -3.54725    1.21875  -2.911  0.00464 **
I(filtered$Chapter^2)   0.04012    0.01373   2.922  0.00449 **
---
Signif. codes:  0***0.001**0.01*0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 68.15 on 82 degrees of freedom
Multiple R-squared: 0.09533,    Adjusted R-squared: 0.07327
F-statistic:  4.32 on 2 and 82 DF,  p-value: 0.01645

xx <- seq(min(filtered$Chapter),max(filtered$Chapter),len=200)
yy <- model$coef %*% rbind(1,xx,xx^2)
lines(xx,yy,lwd=2,col=3)
Fit a qua­dratic (U-curve) to the fil­tered review-­count for each chap­ter

The qua­dratic fits nice­ly. And for total reviews, we observe a sim­i­lar curve:

plot(filtered2$Chapter, filtered2$Reviews)
model2 <- lm(filtered2$Reviews ~ filtered2$Chapter + I(filtered2$Chapter^2))
summary(model2)
...
Residuals:
    Min      1Q  Median      3Q     Max
-289.68  -56.83  -17.59   36.67  695.45

Coefficients:
                        Estimate Std. Error t value Pr(>|t|)
(Intercept)            543.84019   41.43142  13.126  < 2e-16 ***
filtered2$Chapter      -16.88265    2.22356  -7.593 4.45e-11 ***
I(filtered2$Chapter^2)   0.16479    0.02505   6.578 4.17e-09 ***
---
Signif. codes:  0***0.001**0.01*0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 124.3 on 82 degrees of freedom
Multiple R-squared: 0.4518, Adjusted R-squared: 0.4384
F-statistic: 33.79 on 2 and 82 DF,  p-value: 1.977e-11

xx <- seq(min(filtered2$Chapter),max(filtered2$Chapter),len=200)
yy <- model2$coef %*% rbind(1,xx,xx^2)
lines(xx,yy,lwd=2,col=3)
Fit a qua­dratic (U-curve) to the final 2012-11-03 review-­count for each chap­ter

Predictions / confidence intervals

Lin­ear mod­els, even as a qua­drat­ic, are pretty sim­ple. They don’t have to be a black box, and they’re inter­pretable the moment you are given the coef­fi­cients: ‘this’ times ‘that’ equals ‘that other thing’. We learned these things in mid­dle school by hand-­graph­ing func­tions. So it’s not hard to fig­ure out what func­tion our fit­ted lines rep­re­sent, or to extend them right­wards to uncharted waters.

Our very first lin­ear model for filtered$Reviews ~ filtered$Chapter told us that the coef­fi­cients were Inter­cept: 128.3773, filtered$Chapter: -0.0966. Going back to mid­dle school math, the inter­cept is the y-in­ter­cept, the value of y when x = 0; then the other num­ber must be how much we mul­ti­ply x by.

So the lin­ear model is sim­ply this: . To make pre­dic­tions, we sub­sti­tute in for x as we please; so chap­ter 1 is , chap­ter 85 is (well, what did you expect of a near­ly-flat line?) and so on.

We can do this for the qua­dratic too; your ordi­nary qua­dratic looked like , so it’s easy to guess what the filled-in equiv­a­lent will be. We’ll look at total reviews this time (filtered2$Reviews ~ filtered2$Chapter + I(filtered2$Chapter^2)). The inter­cept was 543.84019, the first coef­fi­cient was -16.88265, and the third mys­tery num­ber was 0.16479, so the new equa­tion is . What’s the pre­dic­tion for chap­ter 86? . And for chap­ter 100, it’d go .

Can we auto­mate this? Yes. Con­tin­u­ing with total reviews per chap­ter:

x = filtered2$Chapter; y = filtered2$Reviews
model2 <- lm(y ~ x + I(x^2))
predict.lm(model2, newdata = data.frame(x = c(86:100)))
       1        2        3        4        5        6        7        8
310.7242 322.3504 334.3061 346.5914 359.2063 372.1507 385.4248 399.0284
       9       10       11       12       13       14       15
412.9616 427.2244 441.8168 456.7387 471.9903 487.5714 503.4821

(No­tice that this use of predict spits out the same answers we cal­cu­lated for chap­ter 86 and chap­ter 100.)

So for exam­ple, the fit­ted qua­dratic model pre­dicts ~503 total reviews for chap­ter 100. Mak­ing fur­ther assump­tions (the error in esti­mat­ing is inde­pen­dent of time, nor­mally dis­trib­ut­ed, has zero mean, and con­stant vari­ance), we pre­dict that our 95% con­fi­dence inter­val 215-792 total reviews of chap­ter 100:

predict.lm(model2, newdata = data.frame(x = 100), interval="prediction")
       fit    lwr      upr
  503.4821 215.05 791.9141

Clearly the model isn’t mak­ing very strong pre­dic­tions because the stan­dard error is so high. We get a sim­i­lar result ask­ing just for the 7-day reviews:

filtered=data.frame(Chapter=c(1:85),Reviews=earlyreviews)
y=filtered$Reviews; x=filtered$Chapter

model = lm(y ~ x + I(x^2))
model
...
Coefficients:
(Intercept)            x       I(x^2)
  178.41168     -3.54725      0.04012

predict.lm(model, newdata = data.frame(x = 100), interval="prediction")
      fit      lwr      upr
1 224.925 66.83383 383.0162

This seems like a fairly rea­son­able pre­dic­tion: 66-383 reviews posted within 7 days. But this is based purely off the mod­el, is it tak­ing into account that even the qua­dratic model has some pretty large resid­u­als? Let’s plot an entire :

filtered <- data.frame(Chapter=c(1:85),Reviews=earlyreviews)
y <- filtered$Reviews; x = filtered$Chapter
plot(x, xlab="Chapter",
     y, ylab="Reviews posted within week",
     main="7-Day Reviews per Chapter with Quadratic Fit and 95% Prediction Intervals")

model <- lm(y ~ x + I(x^2))
xx <- seq(min(x),max(x),len=200)
yy <- model$coef %*% rbind(1,xx,xx^2)
lines(xx,yy,lwd=2,col=3)

prd <- predict(model, newdata = data.frame(x = c(1:85)), interval="prediction", type="response")
lines(c(1:85),prd[,2],col="red",lty=2)
lines(c(1:85),prd[,3],col="red",lty=2)
7-Day Reviews per Chap­ter with Qua­dratic Fit and 95% Pre­dic­tion Inter­vals

The pre­dic­tion bands look like they’re more or less liv­ing up to their claimed cov­er­age (7⁄85 isn’t hugely far from 5⁄100) but do this by being very large and embar­rass­ingly can’t even exclude zero. This should tem­per our fore­cast­ing enthu­si­asm. What might pre­dict the resid­u­als bet­ter? Day of week? Length of chap­ter update? Inter­val between updates? Hard to know.

Modeling conclusions

Sad­ly, nei­ther the qua­dratic nor the lin­ear fits lend any sup­port to the orig­i­nal notion that we could esti­mate how much read­er­ship MoR is being cost by the expo­nen­tial­ly-s­low­ing updates for the sim­ple rea­son that the observed reviews per chap­ter do not match any sim­ple increas­ing model of read­er­ship but a stranger U-curve (why do reviews decline so much in the mid­dle when updates were quite reg­u­lar?).

We could argue that the slug­gish­ness of the recent upturn in the U-curve (in the 80s) is due to there only being a few humon­gous chap­ters, but those chap­ters are quan­ti­ta­tively and qual­i­ta­tively dif­fer­ent from the early chap­ters (espe­cially chap­ter 5) so it’s not clear that any of them “should” have say, 341 reviews within a week.

Additional graphs

Total cumu­la­tive MoR reviews as a func­tion of time, with lin­ear fit
Total MoR reviews vs time, qua­dratic fit
Total cumu­la­tive MoR reviews, 100 days prior to 2012-11-17, lin­ear fit

The Review Race: Unexpected Circumstances versus MoR

An inter­est­ing ques­tion is whether and when MoR will become the most-re­viewed fan­fic­tion on FFN (and by exten­sion, one of the most-re­viewed fan­fic­tions of all time). It has already passed such hits as “Para­chute” & “Fri­days at Noon”, but the next com­peti­tor is Unex­pected Cir­cum­stances (UC) at 24,163 reviews to MoR’s 18,9171.

Averaging

24.2k vs 19k is a large gap to cov­er. Some quick and dirty rea­son­ing:

  • MoR: 18911 reviews, begun 02-28-10 or 992 days ago, so reviews per day (Sta­tus: in pro­gress)
  • UC: 24163 reviews, begun 11-22-10 or 725 days ago, so reviews per day (Sta­tus: com­plete)

Obvi­ously if UC does­n’t see its aver­age review rate decline, it’ll never be passed by MoR! But let’s assume the review rate has gone to zero (not too unre­al­is­tic a start­ing assump­tion, since UC is fin­ished and MoR is not), how many days will it take MoR to catch up? or under a year.

Modeling

Descriptive Data and Graphs

We can take into account the exist­ing review rates with the same tech­niques we used pre­vi­ously for inves­ti­gat­ing MoR review pat­terns. First, we can down­load & parse UC’s reviews the same way as before: make the obvi­ous tweaks to the shell script, run the sed com­mand, re-run the Haskell, then load into R & for­mat:

for i in {1..1611};
do elinks -dump-width 1000 -no-numbering -no-references -dump http://www.fanfiction.net/r/6496709/0/$i/
    | grep "report review for abuse " >> reviews.txt;
done
sed -e 's/\[IMG\] //' -e 's/.*for abuse //' -e 's/ \. chapter / /' reviews.txt >> filtered.txt
runhaskell mor.hs
data <- read.table("https://www.gwern.net/docs/lwsurvey/hpmor/2012-uc-reviews.csv",header=TRUE, sep=",",quote="")
# Convert the date strings to R dates
data=data.frame(Chapter=data$Chapter, Date=as.Date(data$Date,"%m/%d/%Y"), User=data$User)
data
   Chapter     Date                     User
1       39 12-11-14                    REL57
2       27 12-11-13                    Guest
3       26 12-11-13                  Melanie
4       24 12-11-13                  Melanie
5       39 12-11-13    Dream-with-your-heart
6       39 12-11-12          ebdarcy.qt4good
7       19 12-11-10                    Guest
8       15 12-11-10                 JD Thorn
9       39 12-11-09                 shollyRK
10       9 12-11-08 of-Snowfall-and-Seagulls
...
summary(data)
    Chapter           Date                          User
 Min.   : 1.00   Min.   :10-11-22   Guest             :   54
 1st Qu.:12.00   1st Qu.:11-02-26   AbruptlyChagrined :   39
 Median :22.00   Median :11-06-20   AlexaBrandonCullen:   39
 Mean   :20.62   Mean   :11-06-10   Angeldolphin01    :   39
 3rd Qu.:29.00   3rd Qu.:11-09-09   AnjieNet          :   39
 Max.   :39.00   Max.   :12-11-14   beachblonde2244   :   39
                                    (Other)           :23915

Unlike MoR, mul­ti­ple peo­ple have reviewed all (39) chap­ters. Let’s look at total reviews posted each cal­en­dar day:

plot(table(data$Date))

We’d like to see what a run­ning or cumu­la­tive total reviews looks like: how fast is UC’s total review-­count grow­ing these days?

plot(cumsum(table(data$Date)))
Total reviews as a func­tion of time

Visu­al­ly, it’s not quite a straight line, not quite a qua­drat­ic…

A graph of each day UC has existed against reviews left that day

We also get a sim­i­lar graph to MoR if we graph by chap­ter instead, with thick bars indi­cat­ing reviews left at the “lat­est” chap­ter while the series was updat­ing, and many reviews left at the final chap­ter as peo­ple binge through UC:

plot(data$Chapter,data$Date)
Total reviews on each chap­ter of UC

Let’s start mod­el­ing.

Linear model

The lin­ear does­n’t match well since it clearly is mas­sively over­pre­dict­ing future growth!

totals <- cumsum(table(data$Date))
days <- c(1:(length(totals)))
plot(days, totals)

model <- lm(totals ~ days)
abline(model)
A graph of total reviews vs time with a (bad) lin­ear fit over­laid

Quadratic model

Well, the graph does look more like a qua­drat­ic, but this has another prob­lem:

model <- lm(totals ~ days  + I(days^2))
xx <- seq(min(days),max(days),len=200)
yy <- model$coef %*% rbind(1,xx,xx^2)
plot(days, totals)
lines(xx,yy,lwd=2,col=3)
Same graph, bet­ter (but still bad) qua­dratic fit over­laid

It fits much bet­ter over­all, but for our pur­poses it is still rub­bish: it is now pre­dict­ing neg­a­tive growth in total review count (a parabola must come down, after all), which of course can­not hap­pen since reviews are not delet­ed.

Logarithmic model

We need a log­a­rith­mic func­tion to try to model this “lev­el­ing-off” behav­ior:

logEstimate = lm(totals ~ log(days))
plot(days,predict(logEstimate),type='l',col='blue')
points(days,totals)
I’m a log­a­rithm, and I’m OK…

The fit is very bad at the start, but the part we care about, recent reviews, seems bet­ter mod­eled than before. Pro­gress! With a work­ing mod­el, let’s ask the orig­i­nal ques­tion: with these mod­els pre­dict­ing growth for UC and MoR, when do they finally have an equal num­ber of reviews? We can use the predict.lm func­tion to help us

# for MoR
predict.lm(logEstimate, newdata = data.frame(days = c(10000:20000)))
...
    9993     9994     9995     9996     9997     9998     9999    10000
31781.54 31781.78 31782.03 31782.27 31782.52 31782.76 31783.01 31783.25
   10001
31783.50
# for UC
predict.lm(logEstimate, newdata = data.frame(days = c(10000:20000)))
...
    9993     9994     9995     9996     9997     9998     9999    10000
50129.51 50129.88 50130.26 50130.64 50131.02 50131.39 50131.77 50132.15
   10001
50132.53

Growth in reviews has asymp­toted for each way out at 10,000 days or 27 years, but still UC > MoR! This feels a lit­tle unlike­ly: the log func­tion grows very slow­ly, but are the reviews really slow­ing down that much?

Linear model revisited

Let’s ask a more restricted ques­tion - instead of try­ing to model the entire set of reviews, why not look at just the most recent chunk, the one sans any updates (for both, sad­ly), say the last 100 days?

totals <- cumsum(table(data$Date))
totals <- totals[(length(totals)-100):length(totals)]
days <- c(1:(length(totals)))
days <- days[(length(days)-100):length(days)]
plot(days, totals)
model <- lm(totals ~ days)
summary(model)
...
Residuals:
     Min       1Q   Median       3Q      Max
-18.4305  -4.4631   0.2601   7.4555  16.0125

Coefficients:
             Estimate Std. Error  t value Pr(>|t|)
(Intercept) 2.392e+04  1.769e+00 13517.26   <2e-16 ***
days        2.472e+00  3.012e-02    82.08   <2e-16 ***
---
Signif. codes:  0***0.001**0.01*0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 8.826 on 99 degrees of freedom
Multiple R-squared: 0.9855, Adjusted R-squared: 0.9854
F-statistic:  6737 on 1 and 99 DF,  p-value: < 2.2e-16

abline(model)
A lin­ear fit to the last 100 days of UC reviews

That’s an impres­sive fit: visu­ally a great match, rel­a­tively small resid­u­als, extreme sta­tis­ti­cal sig­nif­i­cance. (We saw a sim­i­lar graph for MoR too.) So clearly the lin­ear model is accu­rately describ­ing recent reviews.

Predicting

Let’s use 2 mod­els like that to make our pre­dic­tions. We’ll extract the 2 coef­fi­cients which define a lin­ear mod­el, and solve for the equal­i­ty:

# UC:
coefficients(model)
 (Intercept)         days
23918.704158     2.472312
# MoR:
coefficients(model)
 (Intercept)         days
18323.275842     5.285603

The impor­tant thing here is 2.4723 vs 5.2856: that’s how many reviews UC is get­ting per day ver­sus MoR over the last 100 days accord­ing to the lin­ear mod­els which fit very well in that time peri­od. Where do they inter­cept each oth­er, or to put it in our mid­dle-school math class terms, where do they equal the same y and thus equal each oth­er?

  1. UC:
  2. MoR:

We solve for x:

  1. days, or years

Phew! That’s a long time. But it’s not “never” (as it is in the log­a­rith­mic fit­s). With that in mind, I don’t pre­dict that MoR will sur­pass UC within a year, but bear­ing in mind that 5.4 years isn’t too far off and I expect addi­tional bursts of reviews as MoR updates, I con­sider it pos­si­ble within 2 years.

Survival analysis

I com­pile addi­tional reviews and per­form a sur­vival analy­sis find­ing a dif­fer­ence in reten­tion between roughly the first half of MoR and the sec­ond half, which I attribute to the changed update sched­ule & slower pro­duc­tion in the sec­ond half.

On 2013-05-17, hav­ing fin­ished learn­ing & prac­tic­ing it on a real dataset in , I decided to return to MoR. At that point, more than 7 months had passed since my pre­vi­ous analy­sis, and more reviews had accu­mu­lated (1256 pages’ worth in Novem­ber 2012, 1364 by May 2013), so I down­loaded as before, processed into a clean CSV, and wound up with 20447 reviews by 7508 users.

I then con­sol­i­dated the reviews into per-user records: for each review­er, I extracted their user­name, the first chap­ter they reviewed, the date of the first review, the last chap­ter they reviewed, and the last review’s date. A great many review­ers appar­ently leave only 1 review ever. The spar­sity of reviews presents a chal­lenge for defin­ing an event or ‘death’, since to me the most nat­ural def­i­n­i­tion of ‘death’ is “not hav­ing reviewed the lat­est chap­ter”: with such spar­si­ty, a reviewer may still be a reader (the prop­erty of inter­est) with­out hav­ing left a review on the lat­est chap­ter. Specif­i­cal­ly, of the 7508 users, only 182 have reviewed chap­ter 87. To deal with this, I arbi­trar­ily make the cut­off “any review of chap­ter 82 or more recent”: a great deal of plot hap­pened in those 5 chap­ters, so I rea­son any­one who isn’t moved to com­ment (how­ever briefly) on any of them may not be heav­ily engaged with the sto­ry.

As the days pass, the num­ber of review­ers who are that ‘old’ will dwin­dle as they check out or unsub­scribe or sim­ply stop being as enthu­si­as­tic as they used to be. We can graph this decline over time (zoom­ing in to account for the issue that only 35% of review­ers leave >1 reviews ever); no par­tic­u­lar pat­tern jumps out beyond being a (“great­est mor­tal­ity is expe­ri­enced early on in life, with rel­a­tively low rates of death for those sur­viv­ing this bot­tle­neck. This type of curve is char­ac­ter­is­tic of species that pro­duce a large num­ber of off­spring (see )”):

Age of review­ers against prob­a­bil­ity they will not leave another review

But this is plot­ting age of review­ers and has no clear con­nec­tion to cal­en­dar time, which was the orig­i­nal hypoth­e­sis: that the past 2 years have dam­aged read­er­ship more than the ear­lier chap­ters which were posted more time­ly. In par­tic­u­lar, chap­ter post­ing seems to have slowed down ~ch62 (pub­lished 2010-11-27). So, if we were to look at every user who seems to have begun read­ing MoR only after that date (as evi­denced by their first review post­dat­ing ch62’s pub­li­ca­tion) and com­pare them to those read­ing before, do they have addi­tional mor­tal­ity (above and beyond the low mor­tal­ity we would expect by being rel­a­tively late­com­ers to MoR, which is adjusted for by the sur­vival curve)? The answer seems to be yes, the sur­vival curves look dif­fer­ent when we par­ti­tion users this way:

Dif­fer­ent mor­tal­ity curves for 2 groups of review­ers
  n= 7508, number of events= 6927

           coef exp(coef) se(coef)    z Pr(>|z|)
LateTRUE 0.2868    1.3322   0.0247 11.6   <2e-16

         exp(coef) exp(-coef) lower .95 upper .95
LateTRUE      1.33      0.751      1.27       1.4

A pretty clear dif­fer­ence between groups, and an of 1.33 ( 95% CI: 1.26-1.39) is not triv­ially small. I think we can con­sider the orig­i­nal hypoth­e­sis sup­ported to some degree: the later chap­ters, either because of the update sched­ule or per­haps the change in sub­ject mat­ter or attract­ing a dif­fer­ent audi­ence or some other rea­son, seem to be more likely to lose its read­ers.

But two fur­ther minor obser­va­tions:

  • Visu­al­ly, the two groups of early & late review­ers look like they might be con­verg­ing after sev­eral years of aging. We might try to inter­pret this as a ‘hard­core’ vs ‘pop­ulist’ dis­tinc­tion: the early review­ers tend to all be early adopters hard­core geeks who become devoted fans in for the long haul, while the late­com­ing reg­u­lar peo­ple get grad­u­ally turned off and only the remain­ing geeks con­tinue to review.

  • Can we more specif­i­cally nail down this increased mor­tal­ity to the inter­val between a review­er’s last review and the next chap­ter? So far, no; there is actu­ally a triv­ially small effect the other direc­tion when we add that as a vari­able:

      n= 7326, number of events= 6927
       (182 observations deleted due to missingness)
    
                  coef exp(coef)  se(coef)     z Pr(>|z|)
    LateTRUE  0.433604  1.542808  0.024906  17.4   <2e-16
    Delay    -0.004236  0.995773  0.000262 -16.2   <2e-16
    
             exp(coef) exp(-coef) lower .95 upper .95
    LateTRUE     1.543      0.648     1.469     1.620
    Delay        0.996      1.004     0.995     0.996

    I inter­pret this as indi­cat­ing that an update which is posted after a long drought may be a tiny bit (95% CI: 0.9953-0.9963) more likely to elicit some sub­se­quent reviews (per­haps because extremely long delays are asso­ci­ated with post­ing mul­ti­ple chap­ters quick­ly), but nev­er­the­less, that over­all period with long delays is asso­ci­ated with increased reviewer loss. This makes some post hoc intu­itive sense as a polar­iza­tion effect: if a fran­chise after a long hia­tus finally releases an excit­ing new work, you may be more likely to dis­cuss or review it, even as the hia­tuses make you more likely to finally “check out” for good - you’ll either engage or finally aban­don it, but you prob­a­bly won’t be indif­fer­ent.

Source code

Gen­er­a­tion and pre-pro­cess­ing, using the shell & Haskell scripts from before:

data <- read.csv("2013-hpmor-rawreviews.csv", colClasses=c("integer", "Date", "factor"))
users <- unique(data$User)
lifetimes <- data.frame(User=NA, Start=NA, End=NA, FirstChapter=NA, LastChapter=NA)[-1,]
for (i in 1:length(users)) { # isolate a specific user
                  d <- data[data$User == users[i],]
                  # sort by date
                  d <- d[order(d$Date),]

                  first <- head(d, n=1)
                  last <- tail(d, n=1)
                  # build & store new row (per user)
                  new <- data.frame(User=users[i], Start=first$Date, End=last$Date,
                                    FirstChapter=first$Chapter, LastChapter=last$Chapter)
                  lifetimes <- rbind(lifetimes, new)
}
lifetimes$Days <- lifetimes$End - lifetimes$Start
summary(lifetimes$Days)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
    0.0     0.0     0.0    55.2     6.0  1160.0
lifetimes$Dead <- lifetimes$LastChapter <= 82

lifetimes$Late <- lifetimes$Start >= as.Date("10-11-27")

# figure out when each chapter was written by its earliest review
chapterDates <- as.Date(NA)
for (i in 1:max(data$Chapter)) { chapterDates[i] <- min(data[data$Chapter==i,]$Date) }

# how long after chapter X was posted did chapter X+1 come along? (obviously NA/unknown for latest chapter)
intervals <- integer(0)
for (i in 1:length(chapterDates)) { intervals[i] <- chapterDates[i+1] - chapterDates[i] }

# look up for every user the relevant delay after their last review
delay <- integer(0)
for (i in 1:nrow(lifetimes)) { delay[i] <- intervals[lifetimes$LastChapter[i]] }
lifetimes$Delay <- delay

Now that we have a clean set of data with the nec­es­sary vari­ables like dates, we can ana­lyze:

lifetimes <- read.csv("https://www.gwern.net/docs/lwsurvey/hpmor/2013-hpmor-reviewers.csv",
                       colClasses=c("factor", "Date", "Date", "integer", "integer",
                                    "integer", "logical", "logical","integer"))

library(survival)

surv <- survfit(Surv(Days, Dead, type="right") ~ 1, data=lifetimes)

png(file="~/wiki/images/hpmor/survival-overall.png", width = 3*480, height = 1*480)
plot(surv, ylim=c(0, 0.35))
invisible(dev.off())

cmodel <- coxph(Surv(Days, Dead) ~ Late, data=lifetimes)
print(summary(cmodel))

cat("\nTest for assumption violation to confirm the Cox regression:\n")
print(cox.zph(cmodel))

png(file="~/wiki/images/hpmor/survival-earlylate-split.png", width = 3*480, height = 1*480)
plot(survfit(Surv(Days, Dead) ~ Late, data=lifetimes),
     lty=c(1,2), ylim=c(0, 0.4), main="Reader survivorship: first review before or after chapter 62",
     ylab="Fraction of reader group surviving", xlab="Days between reader's first & last review")
legend("topright", legend=c("Early readers", "Late readers"), lty=c(1 ,2), inset=0.02)
invisible(dev.off())

print(summary(coxph(Surv(Days, Dead) ~ Late + Delay, data=lifetimes)))

cat("\nBegin bootstrap test of coefficient size...\n")
library(boot)
cat("\nGet a subsample, train Cox on it, extract coefficient estimate;\n")
cat("\nfirst the simplest early/late model, then the more complex:\n")
# TODO: factor out duplication
coxCoefficient1 <- function(gb, indices) {
  g <- gb[indices,] # allows boot to select subsample
  # train new regression model on subsample
  cmodel <- coxph(Surv(Days, Dead) ~ Late, data=g)
  return(coef(cmodel))
}
cox1Bs <- boot(data=lifetimes, statistic=coxCoefficient1, R=20000, parallel="multicore", ncpus=4)
print(boot.ci(cox1Bs, type="perc"))
coxCoefficient2 <- function(gb, indices) {
  g <- gb[indices,] # allows boot to select subsample
  # train new regression model on subsample
  cmodel <- coxph(Surv(Days, Dead) ~ Late + Delay, data=g)
  return(coef(cmodel))
}
cox2Bs <- boot(data=lifetimes, statistic=coxCoefficient2, R=20000, parallel="multicore", ncpus=4)
print(boot.ci(cox2Bs, type="perc", index=1)) # `Late`
print(boot.ci(cox2Bs, type="perc", index=2)) # `Delay`

See Also


  1. The author points out that his “Off­side” fan­fic­tion is past 32,000 reviews.↩︎