‘HP: Methods of Rationality’ review statistics

Recording fan speculation for retrospectives; statistically modeling reviews for ongoing story with R
statistics, predictions, Haskell, shell, R, survival-analysis
2012-11-032017-06-18 finished certainty: likely importance: 3


The un­prece­dented gap in Meth­ods of Ra­tio­nal­ity up­dates prompts mus­ing about whether read­er­ship is in­creas­ing enough & what sta­tis­tics one would use; I write code to down­load FF.net re­views, clean it, parse it, load into R, sum­ma­rize the data & de­pict it graph­i­cal­ly, run lin­ear re­gres­sion on a sub­set & all re­views, note the poor fit, de­velop a qua­dratic fit in­stead, and use it to pre­dict fu­ture re­view quan­ti­ties.

Then, I run a sim­i­lar analy­sis on a com­pet­ing fan­fic­tion to find out when they will have equal to­tal re­view-counts. A try at log­a­rith­mic fits fails; fit­ting a lin­ear model to the pre­vi­ous 100 days of MoR and the com­peti­tor works much bet­ter, and they pre­dict a con­ver­gence in <5 years.

A sur­vival analy­sis finds no ma­jor anom­alies in re­viewer life­times, but an ap­par­ent in­crease in mor­tal­ity for re­view­ers who started re­view­ing with later chap­ters, con­sis­tent with (but far from prov­ing) the orig­i­nal the­ory that the later chap­ters’ de­lays are hav­ing neg­a­tive effects.

In a LW com­ment, I asked:

I no­ticed in Eliez­er’s lat­est [2012-11-01] MoR up­date that he now has 18,000 words writ­ten [of chap­ter 86], and even when that “chap­ter” is fin­ished, he still does­n’t plan to post any­thing, on top of a drought that has now lasted more than half a year. This does­n’t seem very op­ti­mal from the point of view of gain­ing read­ers.

But I was won­der­ing how one would quan­tify that - how one would es­ti­mate how many read­ers Eliez­er’s MoR strat­egy of very rare huge dumps is cost­ing him. Maybe sur­vivor­ship curves, where sur­vivor­ship is de­fined as “post­ing a re­view on FF.net”? So if say dur­ing the weekly MoR up­dates, a reader who posted a re­view of chap­ter X posted a re­view any­where in X to X+n, that counts as sur­vival of that read­er. One prob­lem here is that new read­ers are con­stantly ar­riv­ing… You can’t sim­ply say “X read­ers did not re­turn from chap­ter 25 to chap­ter 30, while X+N did not re­turn from chap­ter 85 to 86, there­fore fre­quent up­dates are bet­ter” since you would ex­pect the lat­ter to be big­ger sim­ply be­cause more peo­ple are read­ing. And even if you had data on un­fa­vorit­ing or un­fol­low­ing, the im­por­tant part is the op­por­tu­nity cost - how many read­ers would have sub­scribed with reg­u­lar up­dates

If you had to­tal page views, that’d be an­other thing; you could look at con­ver­sions: what per­cent­age of page views re­sulted in con­ver­sions to sub­scribers for the reg­u­lar up­date pe­riod ver­sus the feast/fame pe­ri­ods. But I don’t think FF.net pro­vides it and while HP:­MoR.­com (set up 2011-06-13 as an al­ter­na­tive to FF.net; note the MoR sub­red­dit was founded 2011-11-30] has Google An­a­lyt­ics, I don’t have ac­cess. Maybe one could look at each chap­ter pair-wise, and see­ing what frac­tion of re­view­ers re­turn? The growth might av­er­age out since we’re do­ing so many com­par­isons… But the de­lay is now so ex­treme this would fail: we’d ex­pect a huge growth in re­view­ers from chap­ter 85 to chap­ter 86, for ex­am­ple, sim­ply be­cause it’s been like 8 months now - here too the prob­lem is that the growth in re­view­ers will be less than what it “should be”. But how do we fig­ure out “should be”?

After some ad­di­tional dis­cus­sion with clone_of_sat­urn, I’ve re­jected the idea of sur­vivor­ship curves; we thought cor­re­la­tions be­tween du­ra­tion and to­tal re­view count might work, but in­ter­pre­ta­tion was not clear. So the best cur­rent idea is: gather du­ra­tion be­tween each chap­ter, the # of re­views posted within X days (where X is some­thing like 2 or 3), plot the points, and then see whether a line fits it bet­ter or a logarithm/logistic curve - to see whether growth slowed as the up­dates were spaced fur­ther apart.

Get­ting the data is the prob­lem. It’s easy to get to­tal re­views for each chap­ter since FF.net pro­vides them. I don’t know of any way to get to­tal re­views after X days post­ed, though! A script or pro­gram could prob­a­bly do it, but I’m not sure I want to bother with all this work (e­spe­cially since I don’t know how to do Bayesian line or lo­gis­tic-fit­ting) if it’s not of in­ter­est to any­one or Eliezer would sim­ply ig­nore any re­sults.

I de­cided to take a stab at it.

Data

does not pro­vide pub­lished dates for each chap­ter, so we’ll be work­ing off solely the posted re­views. FF.net re­views for MoR are pro­vided on 2223 pages (as of 2017-06-18), num­ber­ing from 1. If we start with http://www.fanfiction.net/r/5782108/85/1/, we can down­load them with a script enu­mer­at­ing the URLs & us­ing string in­ter­po­la­tion. Eye­balling a tex­tual dump of the HTML (cour­tesy of elinks), it’s easy to ex­tract with grep the line with user, chap­ter num­ber, and date with a HTML string em­bed­ded as a link with each re­view, lead­ing to this script:

for i in {1..2223}
do elinks -dump-width 1000 -no-numbering -no-references -dump http://www.fanfiction.net/r/5782108/0/$i/ \
     | grep "report review for abuse " | tee reviews.txt
done

Since page #1 is for the most re­cent URLs and we started with 1, the re­views were dumped newest to old­est: eg.

...
[IMG] report review for abuse SparklesandSunshine chapter 25 . 19h
[IMG] report review for abuse SparklesandSunshine chapter 24 . 19h
report review for abuse Emily chapter 7 . 6/12
[IMG] report review for abuse Ambassador666 chapter 17 . 6/12
[IMG] report review for abuse Teeners87 chapter 122 . 6/11
[IMG] report review for abuse Ryoji Mochizuki chapter 122 . 6/10
...
[IMG] report review for abuse Agent blue rose chapter 2 . 12/20/2016
report review for abuse Kuba Ober chapter 2 . 12/20/2016
[IMG] report review for abuse datnebnnii23 chapter 122 . 12/20/2016
report review for abuse Penguinjava chapter 101 . 12/20/2016
[IMG] report review for abuse Meerhawk chapter 122 . 12/20/2016
report review for abuse Laterreader12345 chapter 113 . 12/19/2016
...

We can clean this up with more shell script­ing:

sed -e 's/\[IMG\] //' -e 's/.*for abuse //' -e 's/ \. chapter / /' reviews.txt >> filtered.txt

This gives us bet­ter look­ing data like:

Zethil1276 chapter 10 . 1h
Guest chapter 2 . 4h
SparklesandSunshine chapter 25 . 19h
SparklesandSunshine chapter 24 . 19h
Emily chapter 7 . 6/12
Ambassador666 chapter 17 . 6/12
Teeners87 chapter 122 . 6/11
Ryoji Mochizuki chapter 122 . 6/10

The spaces in names means we can’t sim­ply do a search-and-re­place to turn spaces into com­mas and call that a CSV which can eas­ily be read in R; so more script­ing, in Haskell this time (though I’m sure awk or some­thing could do the job, I don’t know it):

import Data.List (intercalate)

main :: IO ()
main = do f <- readFile "filtered.txt"
          let parsed = map words (lines f)
          let unparsed = "User,Date,Chapter\n" ++ unlines (map unsplit parsed)
          writeFile "reviews.csv" unparsed

-- eg "Ryoji Mochizuki chapter 122 . 6/10" → ["Ryoji", "Mochizuki", "6/10", "122"] → "Ryoji Mochizuki,6/10,22"
unsplit :: [String] -> String
unsplit xs = let chapter = last $ init $ init xs
                 date = last xs
                 user = intercalate " " $ takeWhile (/="chapter") xs
             in intercalate "," [user, date, chapter]

FF.net as of 2017 has changed its date for­mat to vary by page, so the most re­cent re­views need to be hand-edit­ed.

R

A runhaskell mor.hs later and our CSV of 33,332 re­views is ready for analy­sis in the R in­ter­preter:

data <- read.csv("https://www.gwern.net/docs/lwsurvey/hpmor/2017-hpmor-reviews.csv")
data
# ...               User      Date Chapter
# 1          Zethil1276 6/18/2017      10
# 2               Guest 6/18/2017       2
# 3 SparklesandSunshine 6/18/2017      25
# 4 SparklesandSunshine 6/18/2017      24
# 5               Emily 6/12/2017       7
# 6       Ambassador666 6/12/2017      17
## Parse the date strings into R date objects
data$Date <- as.Date(data$Date,"%m/%d/%Y")
## Reorganize columns (we care more about the chapter & date reviews were given on
## than which user did a review):
data <- data.frame(Chapter=data$Chapter, Date=data$Date, User=data$User)
data <- data[order(data$Date),]

Descriptive

summary(data)
#     Chapter               Date                           User
#  Min.   :  1.00000   Min.   :2010-02-28   Guest            : 1798
#  1st Qu.: 10.00000   1st Qu.:2010-10-25   Anon             :  128
#  Median : 39.00000   Median :2012-03-26   anon             :  121
#  Mean   : 47.65211   Mean   :2012-09-20   The Anguished One:  121
#  3rd Qu.: 80.00000   3rd Qu.:2014-12-06   Montara          :  119
#  Max.   :122.00000   Max.   :2017-06-18   (Other)          :31044
#                                           NA's             :    1

The dates clearly in­di­cate the slow­down in MoR post­ing, with the me­dian chap­ter be­ing posted all the way back in 2010. What does the per-day re­view pat­tern look like?

plot(table(data$Date))
Re­views post­ed, by day

Quite spiky. At a guess, the days with huge spikes are new chap­ters with the fi­nal spike cor­re­spond­ing to the end­ing, while a low-level av­er­age of what looks like ~10 re­views a day holds steady over time. What’s the cu­mu­la­tive to­tal of all re­views over time?

plot(unique(data$Date), cumsum(table(data$Date)), xlab="Date", ylab="Cumulative sum of FF.net reviews left on MoR over time")
All (to­tal cu­mu­la­tive) re­views post­ed, by day

When we plot each re­view by the chap­ter it was re­view­ing and the date it was post­ed, we see more clear pat­terns:

library(ggplot2)
qplot(data$Chapter, data$Date, color=as.ordered(data$Chapter)) +
    theme(legend.position = "none") + xlab("Chapter") + ylab("Date")
Scat­ter­plot of all re­views, date posted ver­sus chap­ter

Ob­vi­ously no one can re­view a chap­ter be­fore it was post­ed, which is the thick black line along the bot­tom. We can also see how the de­lays be­tween chap­ters: orig­i­nally very small, the gaps in­crease tremen­dously in the 70s chap­ters - just a few of the de­lays look like they cover the ma­jor­ity of the en­tire pub­lish­ing lifes­pan of MoR! Other in­ter­est­ing ob­ser­va­tions is that chap­ter 1 and chap­ter 5 are ex­tremely pop­u­lar for re­view­ers, with hardly a day pass­ing with­out re­view even as vast waste­lands open up in 2012 in chap­ters 40-60. A fi­nal cu­ri­ous ob­ser­va­tion is the pres­ence of dis­tinct hor­i­zon­tal lines; I am guess­ing that these are peo­ple prone to re­view­ing who are bing­ing their way through MoR, leav­ing re­views on in­creas­ing chap­ters over short time pe­ri­ods.

Filtered reviews

The next ques­tion is: if we plot only the sub­set of re­views which were made within 7 days of the first re­view, what does that look like?


## what's the total by each chapter?
totalReviews <- aggregate(Date ~ Chapter, length, data=data)
totalReviews$Date
#   [1] 1490  766  387  549 1896  815  650  563  550  758  568  325  401  409  333  412  547  420  401  420  321  273  290  346  196  325
#  [27]  362  193  180  227  186  230  211  156   63  123  131   83  211  133  151  242   86   58  223  281  405  147  180  128  237   99
#  [53]   72  206   62  166   96  231   96  154  154  196  342  263  166   58  171  151  301  341  242  421  153  308  227  266  574  209
#  [79]  189  328  274  135   90  185  301  175  280   50  426  169   53  119  134  119  169  123  107  192   74   79  178  153   83  181
# [105]   81   51  101  111   80   95   86  117 2080   88  204  104   96   60  144   77   84  544

earlyReviews <- integer(122)
for (i in 1:122) {
    reviews <- data[data$Chapter==i,]$Date
    published <- min(reviews)
    earlyReviews[i] <- sum(reviews < (published+7))
}
earlyReviews
#   [1]   17   11   13   34  326  190  203  209  296  281  117  136  162  144  126  153  195  198  159  201  127  124  140  110   92  141
#  [27]  192   96   91   43  113  169   76   88    5   34   73   17   48   71   88  128   17   10   40  163  238  100  121   81  148   50
#  [53]   31  164   32  127   62  164   66  116  107  139  151   54  104   19  106  104  170  185  104  172  124  211  163  181  113  168
#  [79]  136  299  206  104   69  139  102  109  120   32  350  144   27  103  104   84  136   89   78  101   21   36   56   54   56  149
# [105]   70   42   87   88   66   80   66  109 2026   67  191   94   89   50  128   65   67  233

How did chap­ter 5 get 326 re­views within 7 days when the pre­vi­ous ones were get­ting 10-30, you ask - might my data not be wrong? But if you check the chap­ter 5 re­views and scroll all the way down, you’ll find that peo­ple back in 2010 were very chuffed by the Draco scenes.

Analysis

Linear modeling

Mov­ing on­ward, we want to graph the date and then fit a (a line which min­i­mizes the de­vi­a­tion of all of the points from it; I drew on a hand­out for R pro­gram­ming):

## turn vector into a plottable table & plot it
filtered <- data.frame(Chapter=c(1:122), Reviews=earlyReviews)
plot(filtered$Chapter, filtered$Reviews, xlab="Chapter",
    ylab="Reviews left on each chapter within 7 days of publication")

## run a linear model: is there any clear trend?
l <- lm(Reviews ~ Chapter, data=filtered); summary(l)
# ...Coefficients:
#                     Estimate  Std. Error t value  Pr(>|t|)
# (Intercept)      109.3701395  33.9635875 3.22022 0.0016484
# filtered$Chapter   0.3359780   0.4792406 0.70106 0.4846207
#
# Residual standard error: 186.4181 on 120 degrees of freedom
# Multiple R-squared:  0.004079042, Adjusted R-squared:  -0.0042203
# F-statistic: 0.4914898 on 1 and 120 DF,  p-value: 0.4846207

## the slope doesn't look very interesting...
## let's plot the regression line too:
abline(l)
Graph re­views posted within a week, per chap­ter.

Look at the noise. We have some huge spikes in the early chap­ters, some sort of dip in the 40s-70s, and then a sim­i­lar spread of highly and lowly re­viewed chap­ters at the end. The lin­ear model is ba­si­cally just the av­er­age.

The read­er­ship of MoR did in­crease sub­stan­tially over its run, so why is there such a small in­crease over time in the num­ber of fans? Pos­si­bly most of the ac­tiv­ity was drained over to Red­dit as the sub­red­dit started up in No­vem­ber 2011, and Ff.net re­views also de­creased when the offi­cial web­site started up in June 2011. But those could only affect chap­ter 73 and lat­er.

All reviews

Same thing if we look at to­tal re­views for each chap­ter:

filtered2 <- data.frame(Chapter=c(1:122), Reviews=totalReviews$Date)
plot(filtered2$Chapter, filtered2$Reviews, xlab="Chapter", ylab="Total reviews left on each chapter")
model2 <- lm(Reviews ~ Chapter, data=filtered2)
summary(model2)
# ...Coefficients:
#                Estimate  Std. Error  t value   Pr(>|t|)
# (Intercept) 460.8210270  50.7937240  9.07240 2.7651e-15
# Chapter      -3.0505352   0.7167209 -4.25624 4.1477e-05
#
# Residual standard error: 278.7948 on 120 degrees of freedom
# Multiple R-squared:  0.1311624,   Adjusted R-squared:  0.1239221
# F-statistic: 18.11557 on 1 and 120 DF,  p-value: 4.147744e-05
abline(model2)
Graph all re­views by posted time and chap­ter

Fitting quadratics

Look­ing at the line, the fit seems poor over all (large ) quite poor at the be­gin­ning and end - does­n’t it look like a big U? That sounds like a qua­drat­ic, so more Googling for R code and an hour or two lat­er, I fit some qua­drat­ics to the 7-day re­views and the to­tal re­views:

plot(filtered$Chapter, filtered$Reviews)
model <- lm(filtered$Reviews ~ filtered$Chapter + I(filtered$Chapter^2))
summary(model)
# ...Residuals:
#      Min       1Q   Median       3Q      Max
# -160.478  -42.795   -3.364   40.813  179.321
#
# Coefficients:
#                        Estimate Std. Error t value Pr(>|t|)
# (Intercept)           178.41168   22.70879   7.857 1.34e-11 ***
# filtered$Chapter       -3.54725    1.21875  -2.911  0.00464 **
# I(filtered$Chapter^2)   0.04012    0.01373   2.922  0.00449 **
# ---
# Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
# Residual standard error: 68.15 on 82 degrees of freedom
# Multiple R-squared: 0.09533,    Adjusted R-squared: 0.07327
# F-statistic:  4.32 on 2 and 82 DF,  p-value: 0.01645

xx <- seq(min(filtered$Chapter),max(filtered$Chapter),len=200)
yy <- model$coef %*% rbind(1,xx,xx^2)
lines(xx,yy,lwd=2,col=3)
Fit a qua­dratic (U-curve) to the fil­tered re­view-count for each chap­ter

The qua­dratic fits nice­ly. And for to­tal re­views, we ob­serve a sim­i­lar curve:

plot(filtered2$Chapter, filtered2$Reviews)
model2 <- lm(filtered2$Reviews ~ filtered2$Chapter + I(filtered2$Chapter^2))
# summary(model2)
# ...
# Residuals:
#     Min      1Q  Median      3Q     Max
# -289.68  -56.83  -17.59   36.67  695.45
#
# Coefficients:
#                         Estimate Std. Error t value Pr(>|t|)
# (Intercept)            543.84019   41.43142  13.126  < 2e-16 ***
# filtered2$Chapter      -16.88265    2.22356  -7.593 4.45e-11 ***
# I(filtered2$Chapter^2)   0.16479    0.02505   6.578 4.17e-09 ***
# ---
# Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
# Residual standard error: 124.3 on 82 degrees of freedom
# Multiple R-squared: 0.4518, Adjusted R-squared: 0.4384
# F-statistic: 33.79 on 2 and 82 DF,  p-value: 1.977e-11

xx <- seq(min(filtered2$Chapter),max(filtered2$Chapter),len=200)
yy <- model2$coef %*% rbind(1,xx,xx^2)
lines(xx,yy,lwd=2,col=3)
Fit a qua­dratic (U-curve) to the fi­nal 2012-11-03 re­view-count for each chap­ter

Predictions / confidence intervals

Lin­ear mod­els, even as a qua­drat­ic, are pretty sim­ple. They don’t have to be a black box, and they’re in­ter­pretable the mo­ment you are given the co­effi­cients: ‘this’ times ‘that’ equals ‘that other thing’. We learned these things in mid­dle school by hand-graph­ing func­tions. So it’s not hard to fig­ure out what func­tion our fit­ted lines rep­re­sent, or to ex­tend them right­wards to un­charted wa­ters.

Our very first lin­ear model for filtered$Reviews ~ filtered$Chapter told us that the co­effi­cients were In­ter­cept: 128.3773, filtered$Chapter: -0.0966. Go­ing back to mid­dle school math, the in­ter­cept is the y-in­ter­cept, the value of y when x = 0; then the other num­ber must be how much we mul­ti­ply x by.

So the lin­ear model is sim­ply this: . To make pre­dic­tions, we sub­sti­tute in for x as we please; so chap­ter 1 is , chap­ter 85 is (well, what did you ex­pect of a near­ly-flat line?) and so on.

We can do this for the qua­dratic too; your or­di­nary qua­dratic looked like , so it’s easy to guess what the filled-in equiv­a­lent will be. We’ll look at to­tal re­views this time (filtered2$Reviews ~ filtered2$Chapter + I(filtered2$Chapter^2)). The in­ter­cept was 543.84019, the first co­effi­cient was -16.88265, and the third mys­tery num­ber was 0.16479, so the new equa­tion is . What’s the pre­dic­tion for chap­ter 86? . And for chap­ter 100, it’d go .

Can we au­to­mate this? Yes. Con­tin­u­ing with to­tal re­views per chap­ter:

x = filtered2$Chapter; y = filtered2$Reviews
model2 <- lm(y ~ x + I(x^2))
predict.lm(model2, newdata = data.frame(x = c(86:100)))
#        1        2        3        4        5        6        7        8
# 310.7242 322.3504 334.3061 346.5914 359.2063 372.1507 385.4248 399.0284
#        9       10       11       12       13       14       15
# 412.9616 427.2244 441.8168 456.7387 471.9903 487.5714 503.4821

(No­tice that this use of predict spits out the same an­swers we cal­cu­lated for chap­ter 86 and chap­ter 100.)

So for ex­am­ple, the fit­ted qua­dratic model pre­dicts ~503 to­tal re­views for chap­ter 100. Mak­ing fur­ther as­sump­tions (the er­ror in es­ti­mat­ing is in­de­pen­dent of time, nor­mally dis­trib­ut­ed, has zero mean, and con­stant vari­ance), we pre­dict that our 95% con­fi­dence in­ter­val 215-792 to­tal re­views of chap­ter 100:

predict.lm(model2, newdata = data.frame(x = 100), interval="prediction")
#        fit    lwr      upr
#   503.4821 215.05 791.9141

Clearly the model is­n’t mak­ing very strong pre­dic­tions be­cause the stan­dard er­ror is so high. We get a sim­i­lar re­sult ask­ing just for the 7-day re­views:

filtered=data.frame(Chapter=c(1:85),Reviews=earlyreviews)
y=filtered$Reviews; x=filtered$Chapter

model = lm(y ~ x + I(x^2))
model
# ...Coefficients:
# (Intercept)            x       I(x^2)
#   178.41168     -3.54725      0.04012
#
# predict.lm(model, newdata = data.frame(x = 100), interval="prediction")
#       fit      lwr      upr
# 1 224.925 66.83383 383.0162

This seems like a fairly rea­son­able pre­dic­tion: 66-383 re­views posted within 7 days. But this is based purely off the mod­el, is it tak­ing into ac­count that even the qua­dratic model has some pretty large resid­u­als? Let’s plot an en­tire :

filtered <- data.frame(Chapter=c(1:85),Reviews=earlyreviews)
y <- filtered$Reviews; x = filtered$Chapter
plot(x, xlab="Chapter",
     y, ylab="Reviews posted within week",
     main="7-Day Reviews per Chapter with Quadratic Fit and 95% Prediction Intervals")

model <- lm(y ~ x + I(x^2))
xx <- seq(min(x),max(x),len=200)
yy <- model$coef %*% rbind(1,xx,xx^2)
lines(xx,yy,lwd=2,col=3)

prd <- predict(model, newdata = data.frame(x = c(1:85)), interval="prediction", type="response")
lines(c(1:85),prd[,2],col="red",lty=2)
lines(c(1:85),prd[,3],col="red",lty=2)
7-Day Re­views per Chap­ter with Qua­dratic Fit and 95% Pre­dic­tion In­ter­vals

The pre­dic­tion bands look like they’re more or less liv­ing up to their claimed cov­er­age (7⁄85 is­n’t hugely far from 5⁄100) but do this by be­ing very large and em­bar­rass­ingly can’t even ex­clude ze­ro. This should tem­per our fore­cast­ing en­thu­si­asm. What might pre­dict the resid­u­als bet­ter? Day of week? Length of chap­ter up­date? In­ter­val be­tween up­dates? Hard to know.

Modeling conclusions

Sad­ly, nei­ther the qua­dratic nor the lin­ear fits lend any sup­port to the orig­i­nal no­tion that we could es­ti­mate how much read­er­ship MoR is be­ing cost by the ex­po­nen­tial­ly-s­low­ing up­dates for the sim­ple rea­son that the ob­served re­views per chap­ter do not match any sim­ple in­creas­ing model of read­er­ship but a stranger U-curve (why do re­views de­cline so much in the mid­dle when up­dates were quite reg­u­lar?).

We could ar­gue that the slug­gish­ness of the re­cent up­turn in the U-curve (in the 80s) is due to there only be­ing a few hu­mon­gous chap­ters, but those chap­ters are quan­ti­ta­tively and qual­i­ta­tively differ­ent from the early chap­ters (e­spe­cially chap­ter 5) so it’s not clear that any of them “should” have say, 341 re­views within a week.

Additional graphs

To­tal cu­mu­la­tive MoR re­views as a func­tion of time, with lin­ear fit
To­tal MoR re­views vs time, qua­dratic fit
To­tal cu­mu­la­tive MoR re­views, 100 days prior to 2012-11-17, lin­ear fit

The Review Race: Unexpected Circumstances versus MoR

An in­ter­est­ing ques­tion is whether and when MoR will be­come the most-re­viewed fan­fic­tion on FFN (and by ex­ten­sion, one of the most-re­viewed fan­fic­tions of all time). It has al­ready passed such hits as “Para­chute” & “Fri­days at Noon”, but the next com­peti­tor is Un­ex­pected Cir­cum­stances (UC) at 24,163 re­views to MoR’s 18,9171.

Averaging

24.2k vs 19k is a large gap to cov­er. Some quick and dirty rea­son­ing:

  • MoR: 18911 re­views, be­gun 02-28-10 or 992 days ago, so re­views per day (S­ta­tus: in pro­gress)
  • UC: 24163 re­views, be­gun 11-22-10 or 725 days ago, so re­views per day (S­ta­tus: com­plete)

Ob­vi­ously if UC does­n’t see its av­er­age re­view rate de­cline, it’ll never be passed by MoR! But let’s as­sume the re­view rate has gone to zero (not too un­re­al­is­tic a start­ing as­sump­tion, since UC is fin­ished and MoR is not), how many days will it take MoR to catch up? or un­der a year.

Modeling

Descriptive Data and Graphs

We can take into ac­count the ex­ist­ing re­view rates with the same tech­niques we used pre­vi­ously for in­ves­ti­gat­ing MoR re­view pat­terns. First, we can down­load & parse UC’s re­views the same way as be­fore: make the ob­vi­ous tweaks to the shell script, run the sed com­mand, re-run the Haskell, then load into R & for­mat:

for i in {1..1611};
do elinks -dump-width 1000 -no-numbering -no-references -dump http://www.fanfiction.net/r/6496709/0/$i/
    | grep "report review for abuse " >> reviews.txt;
done
sed -e 's/\[IMG\] //' -e 's/.*for abuse //' -e 's/ \. chapter / /' reviews.txt >> filtered.txt
runhaskell mor.hs
data <- read.table("https://www.gwern.net/docs/lwsurvey/hpmor/2012-uc-reviews.csv",header=TRUE, sep=",",quote="")
# Convert the date strings to R dates
data=data.frame(Chapter=data$Chapter, Date=as.Date(data$Date,"%m/%d/%Y"), User=data$User)
data
#    Chapter     Date                     User
# 1       39 12-11-14                    REL57
# 2       27 12-11-13                    Guest
# 3       26 12-11-13                  Melanie
# 4       24 12-11-13                  Melanie
# 5       39 12-11-13    Dream-with-your-heart
# 6       39 12-11-12          ebdarcy.qt4good
# 7       19 12-11-10                    Guest
# 8       15 12-11-10                 JD Thorn
# 9       39 12-11-09                 shollyRK
# 10       9 12-11-08 of-Snowfall-and-Seagulls
# ...
summary(data)
#     Chapter           Date                          User
#  Min.   : 1.00   Min.   :10-11-22   Guest             :   54
#  1st Qu.:12.00   1st Qu.:11-02-26   AbruptlyChagrined :   39
#  Median :22.00   Median :11-06-20   AlexaBrandonCullen:   39
#  Mean   :20.62   Mean   :11-06-10   Angeldolphin01    :   39
#  3rd Qu.:29.00   3rd Qu.:11-09-09   AnjieNet          :   39
#  Max.   :39.00   Max.   :12-11-14   beachblonde2244   :   39
#                                     (Other)           :23915

Un­like MoR, mul­ti­ple peo­ple have re­viewed all (39) chap­ters. Let’s look at to­tal re­views posted each cal­en­dar day:

plot(table(data$Date))

We’d like to see what a run­ning or cu­mu­la­tive to­tal re­views looks like: how fast is UC’s to­tal re­view-count grow­ing these days?

plot(cumsum(table(data$Date)))
To­tal re­views as a func­tion of time

Vi­su­al­ly, it’s not quite a straight line, not quite a qua­drat­ic…

A graph of each day UC has ex­isted against re­views left that day

We also get a sim­i­lar graph to MoR if we graph by chap­ter in­stead, with thick bars in­di­cat­ing re­views left at the “lat­est” chap­ter while the se­ries was up­dat­ing, and many re­views left at the fi­nal chap­ter as peo­ple binge through UC:

plot(data$Chapter,data$Date)
To­tal re­views on each chap­ter of UC

Let’s start mod­el­ing.

Linear model

The lin­ear does­n’t match well since it clearly is mas­sively over­pre­dict­ing fu­ture growth!

totals <- cumsum(table(data$Date))
days <- c(1:(length(totals)))
plot(days, totals)

model <- lm(totals ~ days)
abline(model)
A graph of to­tal re­views vs time with a (bad) lin­ear fit over­laid

Quadratic model

Well, the graph does look more like a qua­drat­ic, but this has an­other prob­lem:

model <- lm(totals ~ days  + I(days^2))
xx <- seq(min(days),max(days),len=200)
yy <- model$coef %*% rbind(1,xx,xx^2)
plot(days, totals)
lines(xx,yy,lwd=2,col=3)
Same graph, bet­ter (but still bad) qua­dratic fit over­laid

It fits much bet­ter over­all, but for our pur­poses it is still rub­bish: it is now pre­dict­ing neg­a­tive growth in to­tal re­view count (a parabola must come down, after al­l), which of course can­not hap­pen since re­views are not delet­ed.

Logarithmic model

We need a log­a­rith­mic func­tion to try to model this “lev­el­ing-off” be­hav­ior:

logEstimate = lm(totals ~ log(days))
plot(days,predict(logEstimate),type='l',col='blue')
points(days,totals)
I’m a log­a­rithm, and I’m OK…

The fit is very bad at the start, but the part we care about, re­cent re­views, seems bet­ter mod­eled than be­fore. Pro­gress! With a work­ing mod­el, let’s ask the orig­i­nal ques­tion: with these mod­els pre­dict­ing growth for UC and MoR, when do they fi­nally have an equal num­ber of re­views? We can use the predict.lm func­tion to help us

## for MoR
predict.lm(logEstimate, newdata = data.frame(days = c(10000:20000)))
# ...
#     9993     9994     9995     9996     9997     9998     9999    10000
# 31781.54 31781.78 31782.03 31782.27 31782.52 31782.76 31783.01 31783.25
#    10001
# 31783.50

## for UC
predict.lm(logEstimate, newdata = data.frame(days = c(10000:20000)))
# ...
#     9993     9994     9995     9996     9997     9998     9999    10000
# 50129.51 50129.88 50130.26 50130.64 50131.02 50131.39 50131.77 50132.15
#    10001
# 50132.53

Growth in re­views has as­ymp­toted for each way out at 10,000 days or 27 years, but still UC > MoR! This feels a lit­tle un­like­ly: the log func­tion grows very slow­ly, but are the re­views re­ally slow­ing down that much?

Linear model revisited

Let’s ask a more re­stricted ques­tion - in­stead of try­ing to model the en­tire set of re­views, why not look at just the most re­cent chunk, the one sans any up­dates (for both, sad­ly), say the last 100 days?

totals <- cumsum(table(data$Date))
totals <- totals[(length(totals)-100):length(totals)]
days <- c(1:(length(totals)))
days <- days[(length(days)-100):length(days)]
plot(days, totals)
model <- lm(totals ~ days)
summary(model)
# ...Residuals:
#      Min       1Q   Median       3Q      Max
# -18.4305  -4.4631   0.2601   7.4555  16.0125
#
# Coefficients:
#              Estimate Std. Error  t value Pr(>|t|)
# (Intercept) 2.392e+04  1.769e+00 13517.26   <2e-16 ***
# days        2.472e+00  3.012e-02    82.08   <2e-16 ***
# ---
# Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
# Residual standard error: 8.826 on 99 degrees of freedom
# Multiple R-squared: 0.9855, Adjusted R-squared: 0.9854
# F-statistic:  6737 on 1 and 99 DF,  p-value: < 2.2e-16

abline(model)
A lin­ear fit to the last 100 days of UC re­views

That’s an im­pres­sive fit: vi­su­ally a great match, rel­a­tively small resid­u­als, ex­treme sta­tis­ti­cal sig­nifi­cance. (We saw a sim­i­lar graph for MoR too.) So clearly the lin­ear model is ac­cu­rately de­scrib­ing re­cent re­views.

Predicting

Let’s use 2 mod­els like that to make our pre­dic­tions. We’ll ex­tract the 2 co­effi­cients which de­fine a lin­ear mod­el, and solve for the equal­i­ty:

## UC:
coefficients(model)
#  (Intercept)         days
# 23918.704158     2.472312

## MoR:
coefficients(model)
#  (Intercept)         days
# 18323.275842     5.285603

The im­por­tant thing here is 2.4723 vs 5.2856: that’s how many re­views UC is get­ting per day ver­sus MoR over the last 100 days ac­cord­ing to the lin­ear mod­els which fit very well in that time pe­ri­od. Where do they in­ter­cept each oth­er, or to put it in our mid­dle-school math class terms, where do they equal the same y and thus equal each oth­er?

  1. UC:
  2. MoR:

We solve for x:

  1. days, or years

Phew! That’s a long time. But it’s not “never” (as it is in the log­a­rith­mic fit­s). With that in mind, I don’t pre­dict that MoR will sur­pass UC within a year, but bear­ing in mind that 5.4 years is­n’t too far off and I ex­pect ad­di­tional bursts of re­views as MoR up­dates, I con­sider it pos­si­ble within 2 years.

Survival analysis

I com­pile ad­di­tional re­views and per­form a sur­vival analy­sis find­ing a differ­ence in re­ten­tion be­tween roughly the first half of MoR and the sec­ond half, which I at­tribute to the changed up­date sched­ule & slower pro­duc­tion in the sec­ond half.

On 2013-05-17, hav­ing fin­ished learn­ing & prac­tic­ing it on a real dataset in , I de­cided to re­turn to MoR. At that point, more than 7 months had passed since my pre­vi­ous analy­sis, and more re­views had ac­cu­mu­lated (1256 pages’ worth in No­vem­ber 2012, 1364 by May 2013), so I down­loaded as be­fore, processed into a clean CSV, and wound up with 20447 re­views by 7508 users.

I then con­sol­i­dated the re­views into per-user records: for each re­view­er, I ex­tracted their user­name, the first chap­ter they re­viewed, the date of the first re­view, the last chap­ter they re­viewed, and the last re­view’s date. A great many re­view­ers ap­par­ently leave only 1 re­view ever. The spar­sity of re­views presents a chal­lenge for defin­ing an event or ‘death’, since to me the most nat­ural de­fi­n­i­tion of ‘death’ is “not hav­ing re­viewed the lat­est chap­ter”: with such spar­si­ty, a re­viewer may still be a reader (the prop­erty of in­ter­est) with­out hav­ing left a re­view on the lat­est chap­ter. Specifi­cal­ly, of the 7508 users, only 182 have re­viewed chap­ter 87. To deal with this, I ar­bi­trar­ily make the cut­off “any re­view of chap­ter 82 or more re­cent”: a great deal of plot hap­pened in those 5 chap­ters, so I rea­son any­one who is­n’t moved to com­ment (how­ever briefly) on any of them may not be heav­ily en­gaged with the sto­ry.

As the days pass, the num­ber of re­view­ers who are that ‘old’ will dwin­dle as they check out or un­sub­scribe or sim­ply stop be­ing as en­thu­si­as­tic as they used to be. We can graph this de­cline over time (zoom­ing in to ac­count for the is­sue that only 35% of re­view­ers leave >1 re­views ever); no par­tic­u­lar pat­tern jumps out be­yond be­ing a (“great­est mor­tal­ity is ex­pe­ri­enced early on in life, with rel­a­tively low rates of death for those sur­viv­ing this bot­tle­neck. This type of curve is char­ac­ter­is­tic of species that pro­duce a large num­ber of off­spring (see )”):

Age of re­view­ers against prob­a­bil­ity they will not leave an­other re­view

But this is plot­ting age of re­view­ers and has no clear con­nec­tion to cal­en­dar time, which was the orig­i­nal hy­poth­e­sis: that the past 2 years have dam­aged read­er­ship more than the ear­lier chap­ters which were posted more time­ly. In par­tic­u­lar, chap­ter post­ing seems to have slowed down ~ch62 (pub­lished 2010-11-27). So, if we were to look at every user who seems to have be­gun read­ing MoR only after that date (as ev­i­denced by their first re­view post­dat­ing ch62’s pub­li­ca­tion) and com­pare them to those read­ing be­fore, do they have ad­di­tional mor­tal­ity (above and be­yond the low mor­tal­ity we would ex­pect by be­ing rel­a­tively late­com­ers to MoR, which is ad­justed for by the sur­vival curve)? The an­swer seems to be yes, the sur­vival curves look differ­ent when we par­ti­tion users this way:

Differ­ent mor­tal­ity curves for 2 groups of re­view­ers
#   n= 7508, number of events= 6927
#
#            coef exp(coef) se(coef)    z Pr(>|z|)
# LateTRUE 0.2868    1.3322   0.0247 11.6   <2e-16
#
#          exp(coef) exp(-coef) lower .95 upper .95
# LateTRUE      1.33      0.751      1.27       1.4

A pretty clear differ­ence be­tween groups, and an of 1.33 ( 95% CI: 1.26-1.39) is not triv­ially small. I think we can con­sider the orig­i­nal hy­poth­e­sis sup­ported to some de­gree: the later chap­ters, ei­ther be­cause of the up­date sched­ule or per­haps the change in sub­ject mat­ter or at­tract­ing a differ­ent au­di­ence or some other rea­son, seem to be more likely to lose its read­ers.

But two fur­ther mi­nor ob­ser­va­tions:

  • Vi­su­al­ly, the two groups of early & late re­view­ers look like they might be con­verg­ing after sev­eral years of ag­ing. We might try to in­ter­pret this as a ‘hard­core’ vs ‘pop­ulist’ dis­tinc­tion: the early re­view­ers tend to all be early adopters hard­core geeks who be­come de­voted fans in for the long haul, while the late­com­ing reg­u­lar peo­ple get grad­u­ally turned off and only the re­main­ing geeks con­tinue to re­view.

  • Can we more specifi­cally nail down this in­creased mor­tal­ity to the in­ter­val be­tween a re­view­er’s last re­view and the next chap­ter? So far, no; there is ac­tu­ally a triv­ially small effect the other di­rec­tion when we add that as a vari­able:

    #   n= 7326, number of events= 6927
    #    (182 observations deleted due to missingness)
    #
    #               coef exp(coef)  se(coef)     z Pr(>|z|)
    # LateTRUE  0.433604  1.542808  0.024906  17.4   <2e-16
    # Delay    -0.004236  0.995773  0.000262 -16.2   <2e-16
    #
    #          exp(coef) exp(-coef) lower .95 upper .95
    # LateTRUE     1.543      0.648     1.469     1.620
    # Delay        0.996      1.004     0.995     0.996

    I in­ter­pret this as in­di­cat­ing that an up­date which is posted after a long drought may be a tiny bit (95% CI: 0.9953-0.9963) more likely to elicit some sub­se­quent re­views (per­haps be­cause ex­tremely long de­lays are as­so­ci­ated with post­ing mul­ti­ple chap­ters quick­ly), but nev­er­the­less, that over­all pe­riod with long de­lays is as­so­ci­ated with in­creased re­viewer loss. This makes some post hoc in­tu­itive sense as a po­lar­iza­tion effect: if a fran­chise after a long hia­tus fi­nally re­leases an ex­cit­ing new work, you may be more likely to dis­cuss or re­view it, even as the hia­tuses make you more likely to fi­nally “check out” for good - you’ll ei­ther en­gage or fi­nally aban­don it, but you prob­a­bly won’t be in­differ­ent.

Source code

Gen­er­a­tion and pre-pro­cess­ing, us­ing the shell & Haskell scripts from be­fore:

data <- read.csv("2013-hpmor-rawreviews.csv", colClasses=c("integer", "Date", "factor"))
users <- unique(data$User)
lifetimes <- data.frame(User=NA, Start=NA, End=NA, FirstChapter=NA, LastChapter=NA)[-1,]
for (i in 1:length(users)) { # isolate a specific user
                  d <- data[data$User == users[i],]
                  # sort by date
                  d <- d[order(d$Date),]

                  first <- head(d, n=1)
                  last <- tail(d, n=1)
                  # build & store new row (per user)
                  new <- data.frame(User=users[i], Start=first$Date, End=last$Date,
                                    FirstChapter=first$Chapter, LastChapter=last$Chapter)
                  lifetimes <- rbind(lifetimes, new)
}
lifetimes$Days <- lifetimes$End - lifetimes$Start
summary(lifetimes$Days)
#    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
#     0.0     0.0     0.0    55.2     6.0  1160.0
lifetimes$Dead <- lifetimes$LastChapter <= 82

lifetimes$Late <- lifetimes$Start >= as.Date("10-11-27")

# figure out when each chapter was written by its earliest review
chapterDates <- as.Date(NA)
for (i in 1:max(data$Chapter)) { chapterDates[i] <- min(data[data$Chapter==i,]$Date) }

# how long after chapter X was posted did chapter X+1 come along? (obviously NA/unknown for latest chapter)
intervals <- integer(0)
for (i in 1:length(chapterDates)) { intervals[i] <- chapterDates[i+1] - chapterDates[i] }

# look up for every user the relevant delay after their last review
delay <- integer(0)
for (i in 1:nrow(lifetimes)) { delay[i] <- intervals[lifetimes$LastChapter[i]] }
lifetimes$Delay <- delay

Now that we have a clean set of data with the nec­es­sary vari­ables like dates, we can an­a­lyze:

lifetimes <- read.csv("https://www.gwern.net/docs/lwsurvey/hpmor/2013-hpmor-reviewers.csv",
                       colClasses=c("factor", "Date", "Date", "integer", "integer",
                                    "integer", "logical", "logical","integer"))

library(survival)

surv <- survfit(Surv(Days, Dead, type="right") ~ 1, data=lifetimes)

png(file="~/wiki/images/hpmor/survival-overall.png", width = 3*480, height = 1*480)
plot(surv, ylim=c(0, 0.35))
invisible(dev.off())

cmodel <- coxph(Surv(Days, Dead) ~ Late, data=lifetimes)
print(summary(cmodel))

cat("\nTest for assumption violation to confirm the Cox regression:\n")
print(cox.zph(cmodel))

png(file="~/wiki/images/hpmor/survival-earlylate-split.png", width = 3*480, height = 1*480)
plot(survfit(Surv(Days, Dead) ~ Late, data=lifetimes),
     lty=c(1,2), ylim=c(0, 0.4), main="Reader survivorship: first review before or after chapter 62",
     ylab="Fraction of reader group surviving", xlab="Days between reader's first & last review")
legend("topright", legend=c("Early readers", "Late readers"), lty=c(1 ,2), inset=0.02)
invisible(dev.off())

print(summary(coxph(Surv(Days, Dead) ~ Late + Delay, data=lifetimes)))

cat("\nBegin bootstrap test of coefficient size...\n")
library(boot)
cat("\nGet a subsample, train Cox on it, extract coefficient estimate;\n")
cat("\nfirst the simplest early/late model, then the more complex:\n")
# TODO: factor out duplication
coxCoefficient1 <- function(gb, indices) {
  g <- gb[indices,] # allows boot to select subsample
  # train new regression model on subsample
  cmodel <- coxph(Surv(Days, Dead) ~ Late, data=g)
  return(coef(cmodel))
}
cox1Bs <- boot(data=lifetimes, statistic=coxCoefficient1, R=20000, parallel="multicore", ncpus=4)
print(boot.ci(cox1Bs, type="perc"))
coxCoefficient2 <- function(gb, indices) {
  g <- gb[indices,] # allows boot to select subsample
  # train new regression model on subsample
  cmodel <- coxph(Surv(Days, Dead) ~ Late + Delay, data=g)
  return(coef(cmodel))
}
cox2Bs <- boot(data=lifetimes, statistic=coxCoefficient2, R=20000, parallel="multicore", ncpus=4)
print(boot.ci(cox2Bs, type="perc", index=1)) # `Late`
print(boot.ci(cox2Bs, type="perc", index=2)) # `Delay`

See Also


  1. The au­thor points out that his “Off­side” fan­fic­tion is past 32,000 re­views.↩︎