Advertisement for 'HTerm, The Graphical Terminal'

The unprecedented gap in Methods of Rationality updates prompts musing about whether readership is increasing enough & what statistics one would use; I write code to download FF.net reviews, clean it, parse it, load into R, summarize the data & depict it graphically, run linear regression on a subset & all reviews, note the poor fit, develop a quadratic fit instead, and use it to predict future review quantities.

Then, I run a similar analysis on a competing fanfiction to find out when they will have equal total review-counts. A try at logarithmic fits fails; fitting a linear model to the previous 100 days of MoR and the competitor works much better, and they predict a convergence in <5 years.

A survival analysis finds no major anomalies in reviewer lifetimes, but an apparent increase in mortality for reviewers who started reviewing with later chapters, consistent with (but far from proving) the original theory that the later chapters’ delays are having negative effects.

Background

In a LW comment, I asked:

I noticed in Eliezer’s latest [1 November 2012] MoR update that he now has 18,000 words written [of chapter 86], and even when that “chapter” is finished, he still doesn’t plan to post anything, on top of a drought that has now lasted more than half a year. This doesn’t seem very optimal from the point of view of gaining readers.

But I was wondering how one would quantify that - how one would estimate how many readers Eliezer’s MoR strategy of very rare huge dumps is costing him. Maybe survivorship curves, where survivorship is defined as “posting a review on FF.net”? So if say during the weekly MoR updates, a reader who posted a review of chapter X posted a review anywhere in X to X+n, that counts as survival of that reader. One problem here is that new readers are constantly arriving… You can’t simply say “X readers did not return from chapter 25 to chapter 30, while X+N did not return from chapter 85 to 86, therefore frequent updates are better” since you would expect the latter to be bigger simply because more people are reading. And even if you had data on unfavoriting or unfollowing, the important part is the opportunity cost - how many readers would have subscribed with regular updates

If you had total page views, that’d be another thing; you could look at conversions: what percentage of page views resulted in conversions to subscribers for the regular update period versus the feast/fame periods. But I don’t think FF.net provides it and while HP:MoR.com has Google Analytics, I don’t have access. Maybe one could look at each chapter pair-wise, and seeing what fraction of reviewers return? The growth might average out since we’re doing so many comparisons… But the delay is now so extreme this would fail: we’d expect a huge growth in reviewers from chapter 85 to chapter 86, for example, simply because it’s been like 8 months now - here too the problem is that the growth in reviewers will be less than what it “should be”. But how do we figure out “should be”?

After some additional discussion with clone_of_saturn, I’ve rejected the idea of survivorship curves; we thought correlations between duration and total review count might work, but interpretation was not clear. So the best current idea is: gather duration between each chapter, the # of reviews posted within X days (where X is something like 2 or 3), plot the points, and then see whether a line fits it better or a logarithm/logistic curve - to see whether growth slowed as the updates were spaced further apart.

Getting the data is the problem. It’s easy to get total reviews for each chapter since FF.net provides them. I don’t know of any way to get total reviews after X days posted, though! A script or program could probably do it, but I’m not sure I want to bother with all this work (especially since I don’t know how to do Bayesian line or logistic-fitting) if it’s not of interest to anyone or Eliezer would simply ignore any results.

I decided to take a stab at it.

Data

FanFiction.Net does not provide published dates for each chapter, so we’ll be working off solely the posted reviews. FF.net reviews for MoR are provided on 1256 pages (as of 3 November 2012), numbering from 1. If we go from 1-1256, starting with http://www.fanfiction.net/r/5782108/85/1/, we can download them with a script enumerating the URLs & using string interpolation. Eyeballing a textual dump of the HTML (courtesy of elinks), it’s easy to extract with grep the line with user, chapter number, and date with a HTML string embedded as a link with each review, leading to this script:

for i in {1..1256}
do elinks -dump-width 1000 -no-numbering -no-references -dump http://www.fanfiction.net/r/5782108/0/$i/ \
     | grep "report review for abuse " >> reviews.txt
done

Since page #1 is for the most recent URLs and we started with 1, the reviews were dumped newest to oldest: eg.

...
[IMG] report review for abuse Faerie of Tara 4/1/12 . chapter 3
[IMG] report review for abuse Faerie of Tara 4/1/12 . chapter 2
report review for abuse Aww 4/1/12 . chapter 81
report review for abuse bella-farfallina 4/1/12 . chapter 81
[IMG] report review for abuse Isis Aurora Tomoe 4/1/12 . chapter 2
[IMG] report review for abuse bosk 4/1/12 . chapter 1
[IMG] report review for abuse yamiishot 4/1/12 . chapter 81
report review for abuse Joe in Australia 4/1/12 . chapter 81
report review for abuse of course 3/31/12 . chapter 45
report review for abuse RickJs 3/31/12 . chapter 22
...

We can clean this up with more shell scripting:

sed -e 's/\[IMG\] //' -e 's/.*for abuse //' -e 's/ \. chapter / /' reviews.txt >> filtered.txt

This gives us better looking data like:

NATWEST 3/4/10 4
hemotem 3/4/10 4
Isabelle Eir 3/4/10 4
Orc Shaman 3/3/10 1
Mariann's 3/3/10 4
cheekylildevil 3/3/10 4
Star Bear 3/3/10 4

The spaces in names means we can’t simply do a search-and-replace to turn spaces into commas and call that a CSV which can easily be read in R; so more scripting, in Haskell this time (though I’m sure awk or something could do the job, I don’t know it):

import Data.List (intercalate)

main :: IO ()
main = do f <- readFile "filtered.txt"
          let parsed = map words (lines f)
          let unparsed = "User,Date,Chapter\n" ++ unlines (map unsplit parsed)
          writeFile "reviews.csv" unparsed

-- eg "Star Bear 3/3/10 4" ~> ["Star", "Bear", "3/3/10", "4") ~> "Star Bear,3/3/10,4"
unsplit :: [String] -> String
unsplit xs = let chapter = last xs
                 date = last (init xs)
                 user = unwords (init (init xs))
             in intercalate "," [user, date, chapter]

R

A runhaskell mor.hs later and our CSV of 18,852 reviews is ready for analysis in the R interpreter:

data <- read.csv("http://www.gwern.net/docs/2012-hpmor-reviews.csv")
data
...
18840                                        cheekylildevil  2/28/10       1
18841                                       scarletalphabet  2/28/10       1
18842                                           StoryTagger  2/28/10       1
18843                                                HJP265  2/28/10       1
18844                                              dougal74  2/28/10       1
18845                                                baceba  2/28/10       1
18846                                               Cibbler  2/28/10       1
18847                                               Davek86  2/28/10       1
18848                                        omegahurricane  2/28/10       1
18849                                         hhrforeverhhr  2/28/10       1
18850                                    anonnatymousMARTIN  2/28/10       1
18851                                            wetterwaxs  2/28/10       1
18852                                                Errrrr  2/28/10       1
# Parse the date strings into R date objects
data$Date <- as.Date(data$Date,"%m/%d/%Y")
# Reorganize columns (we care more about the chapter & date reviews were given on
# than which user did a review):
data <- data.frame(Chapter=data$Chapter, Date=data$Date, User=data$User)

Descriptive

summary(data)

    Chapter           Date                      User
 Min.   : 1.00   Min.   :10-02-28   Guest         :  310
 1st Qu.:10.00   1st Qu.:10-06-16   anon          :  106
 Median :27.00   Median :10-11-26   voodooqueen126:   85
 Mean   :35.64   Mean   :11-02-10   Kutta         :   77
 3rd Qu.:63.00   3rd Qu.:11-09-02   badkidoh      :   73
 Max.   :85.00   Max.   :12-11-02   AR            :   72
                                    (Other)       :18129

Surprisingly, it looks like only one user (‘voodooqueen126’) has reviewed every chapter. The dates also clearly indicate the slowdown in MoR posting, with the median chapter being posted all the way back in 2010. What does the per-day review pattern look like?

plot(table(data$Date))
Reviews posted, by day
Reviews posted, by day

Quite spiky. At a guess, the days with huge spikes are new chapters, while a low-level average of what looks like ~10 reviews a day holds steady over time. What’s the cumulative total of all reviews over time?

plot(cumsum(table(data$Date)))
All (total cumulative) reviews posted, by day
All (total cumulative) reviews posted, by day

When we plot each review by the chapter it was reviewing and the date it was posted, we see more clear patterns:

plot(data$Chapter,data$Date)
Scatterplot of all reviews, date posted versus chapter
Scatterplot of all reviews, date posted versus chapter

Obviously no one can review a chapter before it was posted, which is the thick black line along the bottom. We can also see how the delays between chapters: originally very small, the gaps increase tremendously in the 70s (remember the whole y-axis is ~3 years) - just 2 of the delays look like they cover something like 1312 the entire lifespan of MoR! Other interesting observations is that chapter 1 and chapter 5 are extremely popular for reviewers, with hardly a day passing without review even as vast wastelands open up in 2012 in chapters 40-60. A final curious observation is the presence of distinct horizontal lines; I am guessing that these are people prone to reviewing who are binging their way through MoR.

Filtered reviews

The next question is: if we plot only the subset of reviews which were made within 7 days of the first review, what does that look like?

# extract each chapter's reviews into its own vector:
# one vector of dates for ch1, one vector of dates for ch2, for ch3 etc.
myList <- list()
for (i in 1:85) { myList[[i]] <- rev(data$Date[data$Chapter==i]) }
# take a look at the first entry in each vector as a sanity check
lapply(myList, function(a) a[1])
[[1]]
[1] "10-02-28"

[[2]]
[1] "10-02-28"

[[3]]
[1] "10-03-01"

[[4]]
[1] "10-03-03"

[[5]]
[1] "10-03-08"
...
# looks reasonable
# what's the total by each chapter?
totalreviews <- unlist(lapply(myList, length))
totalreviews
 [1]  823  391  205  293 1159  499  419  375  423  498  332  215  270  283  222
[16]  291  385  302  270  311  248  199  215  211  141  243  283  138  136  143
[31]  160  204  148  119   28   77  106   48  130  101  117  183   52   33  139
[46]  246  334  120  148  104  181   69   47  183   38  143   74  202   80  131
[61]  127  164  303  175  146   42  147  130  253  296  217  398  136  249  187
[76]  231  525  185  153  308  235  113   74  154  236
# Now the hard part: map over the list of vectors, & filter out anything dated 7 days after the first date
earlyreviews <- unlist(lapply(lapply(myList,
     function(a) Filter(function(b) {if (b > a[1]+7) return(FALSE) else return(TRUE)}, a)), length))
 [1]  17  11  13  42 341 191 204 210 296 282 117 136 162 144 126 155 199 199 159
[20] 209 133 125 142 111  92 145 193  96  92  43 115 170  78  88   5  34  77  17
[39]  48  71  88 129  18  10  40 164 246 100 124  82 150  51  32 168  32 129  62
[58] 167  66 116 109 140 155  55 114  19 111 104 175 187 110 175 127 212 164 181
[77] 117 169 139 299 211 106  69 143 106

How did chapter 5 get 341 reviews within 7 days and have 1159 in total, you ask - might my data not be wrong? But if you check the chapter 5 reviews and scroll all the way down, you’ll find that people back in 2010 were very chuffed by the Draco scenes.

Analysis

Linear modeling

Moving onward, we want to graph the date and then fit a linear model (a line which minimizes the deviation of all of the points from it; I drew on a handout for R programming):

# turn vector into a plottable table & plot it
filtered <- data.frame(Chapter=c(1:85),Reviews=earlyreviews)
plot(filtered$Chapter, filtered$Reviews)

# run a linear model: is there any clear trend?
lm(filtered$Reviews ~ filtered$Chapter)
...
Coefficients:
     (Intercept)  filtered$Chapter
        128.3773           -0.0966

# the slope doesn't look very interesting...
# let's plot the regression line too:
abline(lm(filtered$Reviews ~ filtered$Chapter))
Graph reviews posted within a week, per chapter.
Graph reviews posted within a week, per chapter.

Look at the noise. We have some huge spikes in the early chapters, some sort of dip in the 40s-70s, and then a similar spread of highly and lowly reviewed chapters at the end. The linear model is basically just the average (mean(filtered$Reviews) = 124.2); it’s actually predicting slight declines in reviews posted within 7 days, interestingly.

All reviews

Same thing if we look at total reviews for each chapter:

filtered2=data.frame(Chapter=c(1:85),Reviews=totalreviews)
plot(filtered2$Chapter, filtered2$Reviews)
model2 = lm(filtered2$Reviews ~ filtered2$Chapter)
summary(model2)
...
Residuals:
    Min      1Q  Median      3Q     Max
-215.47  -98.81  -24.02   58.08  834.21

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)
(Intercept)       338.3462    33.4318  10.120 3.79e-16 ***
filtered2$Chapter  -2.7107     0.6753  -4.014  0.00013 ***
---
Signif. codes:  0 ‘***0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 152.8 on 83 degrees of freedom
Multiple R-squared: 0.1626, Adjusted R-squared: 0.1525
F-statistic: 16.11 on 1 and 83 DF,  p-value: 0.0001302

abline(model2)
Graph all reviews by posted time and chapter
Graph all reviews by posted time and chapter

Fitting quadratics

Looking at the line, the fit seems poor over all (large residuals) quite poor at the beginning and end - doesn’t it look like a big U? That sounds like a quadratic, so more Googling for R code and an hour or two later, I fit some quadratics to the 7-day reviews and the total reviews:

plot(filtered$Chapter, filtered$Reviews)
model <- lm(filtered$Reviews ~ filtered$Chapter + I(filtered$Chapter^2))
summary(model)
...
Residuals:
     Min       1Q   Median       3Q      Max
-160.478  -42.795   -3.364   40.813  179.321

Coefficients:
                       Estimate Std. Error t value Pr(>|t|)
(Intercept)           178.41168   22.70879   7.857 1.34e-11 ***
filtered$Chapter       -3.54725    1.21875  -2.911  0.00464 **
I(filtered$Chapter^2)   0.04012    0.01373   2.922  0.00449 **
---
Signif. codes:  0 ‘***0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 68.15 on 82 degrees of freedom
Multiple R-squared: 0.09533,    Adjusted R-squared: 0.07327
F-statistic:  4.32 on 2 and 82 DF,  p-value: 0.01645

xx <- seq(min(filtered$Chapter),max(filtered$Chapter),len=200)
yy <- model$coef %*% rbind(1,xx,xx^2)
lines(xx,yy,lwd=2,col=3)
Fit a quadratic (U-curve) to the filtered review-count for each chapter
Fit a quadratic (U-curve) to the filtered review-count for each chapter

The quadratic fits nicely. And for total reviews, we observe a similar curve:

plot(filtered2$Chapter, filtered2$Reviews)
model2 <- lm(filtered2$Reviews ~ filtered2$Chapter + I(filtered2$Chapter^2))
summary(model2)
...
Residuals:
    Min      1Q  Median      3Q     Max
-289.68  -56.83  -17.59   36.67  695.45

Coefficients:
                        Estimate Std. Error t value Pr(>|t|)
(Intercept)            543.84019   41.43142  13.126  < 2e-16 ***
filtered2$Chapter      -16.88265    2.22356  -7.593 4.45e-11 ***
I(filtered2$Chapter^2)   0.16479    0.02505   6.578 4.17e-09 ***
---
Signif. codes:  0 ‘***0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 124.3 on 82 degrees of freedom
Multiple R-squared: 0.4518, Adjusted R-squared: 0.4384
F-statistic: 33.79 on 2 and 82 DF,  p-value: 1.977e-11

xx <- seq(min(filtered2$Chapter),max(filtered2$Chapter),len=200)
yy <- model2$coef %*% rbind(1,xx,xx^2)
lines(xx,yy,lwd=2,col=3)
Fit a quadratic (U-curve) to the final 3 November 2012 review-count for each chapter
Fit a quadratic (U-curve) to the final 3 November 2012 review-count for each chapter

Predictions / confidence intervals

Linear models, even as a quadratic, are pretty simple. They don’t have to be a black box, and they’re interpretable the moment you are given the coefficients: ‘this’ times ‘that’ equals ‘that other thing’. We learned these things in middle school by hand-graphing functions. So it’s not hard to figure out what function our fitted lines represent, or to extend them rightwards to uncharted waters.

Our very first linear model for filtered$Reviews ~ filtered$Chapter told us that the coefficients were Intercept: 128.3773, filtered$Chapter: -0.0966. Going back to middle school math, the intercept is the y-intercept, the value of y when x=0; then the other number must be how much we multiply x by.

So the linear model is simply this: y=0.0966x+128.3773. To make predictions, we substitute in for x as we please; so chapter 1 is 0.0966×1+128.3773=128.28, chapter 85 is 0.0966×85+128.3773=120.16 (well, what did you expect of a nearly-flat line?) and so on.

We can do this for the quadratic too; your ordinary quadratic looked like y=x2+x+1, so it’s easy to guess what the filled-in equivalent will be. We’ll look at total reviews this time (filtered2$Reviews ~ filtered2$Chapter + I(filtered2$Chapter^2)). The intercept was 543.84019, the first coefficient was -16.88265, and the third mystery number was 0.16479, so the new equation is y=0.16479x2+16.88265x+543.84019. What’s the prediction for chapter 86? 0.16479×(862)+(16.88265)×86+543.8401=310.719. And for chapter 100, it’d go 0.16479×(1002)+(16.88265)×100+543.8401=503.475.

Can we automate this? Yes. Continuing with total reviews per chapter:

x = filtered2$Chapter; y = filtered2$Reviews
model2 <- lm(y ~ x + I(x^2))
predict.lm(model2, newdata = data.frame(x = c(86:100)))
       1        2        3        4        5        6        7        8
310.7242 322.3504 334.3061 346.5914 359.2063 372.1507 385.4248 399.0284
       9       10       11       12       13       14       15
412.9616 427.2244 441.8168 456.7387 471.9903 487.5714 503.4821

(Notice that this use of predict spits out the same answers we calculated for chapter 86 and chapter 100.)

So for example, the fitted quadratic model predicts ~503 total reviews for chapter 100. Making further assumptions (the error in estimating is independent of time, normally distributed, has zero mean, and constant variance), we predict that our 95% confidence interval 215-792 total reviews of chapter 100:

predict.lm(model2, newdata = data.frame(x = 100), interval="prediction")
       fit    lwr      upr
  503.4821 215.05 791.9141

Clearly the model isn’t making very strong predictions because the standard error is so high. We get a similar result asking just for the 7-day reviews:

filtered=data.frame(Chapter=c(1:85),Reviews=earlyreviews)
y=filtered$Reviews; x=filtered$Chapter

model = lm(y ~ x + I(x^2))
model
...
Coefficients:
(Intercept)            x       I(x^2)
  178.41168     -3.54725      0.04012

predict.lm(model, newdata = data.frame(x = 100), interval="prediction")
      fit      lwr      upr
1 224.925 66.83383 383.0162

This seems like a fairly reasonable prediction: 66-383 reviews posted within 7 days. But this is based purely off the model, is it taking into account that even the quadratic model has some pretty large residuals? Let’s plot an entire confidence band:

filtered <- data.frame(Chapter=c(1:85),Reviews=earlyreviews)
y <- filtered$Reviews; x = filtered$Chapter
plot(x, xlab="Chapter",
     y, ylab="Reviews posted within week",
     main="7-Day Reviews per Chapter with Quadratic Fit and 95% Prediction Intervals")

model <- lm(y ~ x + I(x^2))
xx <- seq(min(x),max(x),len=200)
yy <- model$coef %*% rbind(1,xx,xx^2)
lines(xx,yy,lwd=2,col=3)

prd <- predict(model, newdata = data.frame(x = c(1:85)), interval="prediction", type="response")
lines(c(1:85),prd[,2],col="red",lty=2)
lines(c(1:85),prd[,3],col="red",lty=2)
7-Day Reviews per Chapter with Quadratic Fit and 95% Prediction Intervals
7-Day Reviews per Chapter with Quadratic Fit and 95% Prediction Intervals

The prediction bands look like they’re more or less living up to their claimed coverage (7/85 isn’t hugely far from 5/100) but do this by being very large and embarrassingly can’t even exclude zero. This should temper our forecasting enthusiasm. What might predict the residuals better? Day of week? Length of chapter update? Interval between updates? Hard to know.

Modeling conclusions

Sadly, neither the quadratic nor the linear fits lend any support to the original notion that we could estimate how much readership MoR is being cost by the exponentially-slowing updates for the simple reason that the observed reviews per chapter do not match any simple increasing model of readership but a stranger U-curve (why do reviews decline so much in the middle when updates were quite regular?).

We could argue that the sluggishness of the recent upturn in the U-curve (in the 80s) is due to there only being a few humongous chapters, but those chapters are quantitatively and qualitatively different from the early chapters (especially chapter 5) so it’s not clear that any of them “should” have say, 341 reviews within a week.

Additional graphs

Total cumulative MoR reviews as a function of time, with linear fit Total MoR reviews vs time, quadratic fit Total cumulative MoR reviews, 100 days prior to 17 November 2012, linear fit

The Review Race: Unexpected Circumstances versus MoR

An interesting question is whether and when MoR will become the most-reviewed fanfiction on FFN (and by extension, one of the most-reviewed fanfictions of all time). It has already passed such hits as “Parachute” & “Fridays at Noon”, but the next competitor is Unexpected Circumstances (UC) at 24,163 reviews to MoR’s 18,9171.

Averaging

24.2k vs 19k is a large gap to cover. Some quick and dirty reasoning:

  • MoR: 18911 reviews, begun 02-28-10 or 992 days ago, so 18911992=19.1 reviews per day (Status: in progress)
  • UC: 24163 reviews, begun 11-22-10 or 725 days ago, so 24163725=33.3 reviews per day (Status: complete)

Obviously if UC doesn’t see its average review rate decline, it’ll never be passed by MoR! But let’s assume the review rate has gone to zero (not too unrealistic a starting assumption, since UC is finished and MoR is not), how many days will it take MoR to catch up? 241631891118911992=275.5 or under a year.

Modeling

Descriptive Data and Graphs

We can take into account the existing review rates with the same techniques we used previously for investigating MoR review patterns. First, we can download & parse UC’s reviews the same way as before: make the obvious tweaks to the shell script, run the sed command, re-run the Haskell, then load into R & format:

for i in {1..1611};
do elinks -dump-width 1000 -no-numbering -no-references -dump http://www.fanfiction.net/r/6496709/0/$i/
    | grep "report review for abuse " >> reviews.txt;
done
sed -e 's/\[IMG\] //' -e 's/.*for abuse //' -e 's/ \. chapter / /' reviews.txt >> filtered.txt
runhaskell mor.hs
data <- read.table("http://www.gwern.net/docs/2012-uc-reviews.csv",header=TRUE, sep=",",quote="")
# Convert the date strings to R dates
data=data.frame(Chapter=data$Chapter, Date=as.Date(data$Date,"%m/%d/%Y"), User=data$User)
data
   Chapter     Date                     User
1       39 12-11-14                    REL57
2       27 12-11-13                    Guest
3       26 12-11-13                  Melanie
4       24 12-11-13                  Melanie
5       39 12-11-13    Dream-with-your-heart
6       39 12-11-12          ebdarcy.qt4good
7       19 12-11-10                    Guest
8       15 12-11-10                 JD Thorn
9       39 12-11-09                 shollyRK
10       9 12-11-08 of-Snowfall-and-Seagulls
...
summary(data)
    Chapter           Date                          User
 Min.   : 1.00   Min.   :10-11-22   Guest             :   54
 1st Qu.:12.00   1st Qu.:11-02-26   AbruptlyChagrined :   39
 Median :22.00   Median :11-06-20   AlexaBrandonCullen:   39
 Mean   :20.62   Mean   :11-06-10   Angeldolphin01    :   39
 3rd Qu.:29.00   3rd Qu.:11-09-09   AnjieNet          :   39
 Max.   :39.00   Max.   :12-11-14   beachblonde2244   :   39
                                    (Other)           :23915

Unlike MoR, multiple people have reviewed all (39) chapters. Let’s look at total reviews posted each calendar day:

plot(table(data$Date))

We’d like to see what a running or cumulative total reviews looks like: how fast is UC’s total review-count growing these days?

plot(cumsum(table(data$Date)))
Total reviews as a function of time
Total reviews as a function of time

Visually, it’s not quite a straight line, not quite a quadratic…

A graph of each day UC has existed against reviews left that day
A graph of each day UC has existed against reviews left that day

We also get a similar graph to MoR if we graph by chapter instead, with thick bars indicating reviews left at the “latest” chapter while the series was updating, and many reviews left at the final chapter as people binge through UC:

plot(data$Chapter,data$Date)
Total reviews on each chapter of UC
Total reviews on each chapter of UC

Let’s start modeling.

Linear model

The linear doesn’t match well since it clearly is massively overpredicting future growth!

totals <- cumsum(table(data$Date))
days <- c(1:(length(totals)))
plot(days, totals)

model <- lm(totals ~ days)
abline(model)
A graph of total reviews vs time with a (bad) linear fit overlaid
A graph of total reviews vs time with a (bad) linear fit overlaid

Quadratic model

Well, the graph does look more like a quadratic, but this has another problem:

model <- lm(totals ~ days  + I(days^2))
xx <- seq(min(days),max(days),len=200)
yy <- model$coef %*% rbind(1,xx,xx^2)
plot(days, totals)
lines(xx,yy,lwd=2,col=3)
Same graph, better (but still bad) quadratic fit overlaid
Same graph, better (but still bad) quadratic fit overlaid

It fits much better overall, but for our purposes it is still rubbish: it is now predicting negative growth in total review count (a parabola must come down, after all), which of course cannot happen since reviews are not deleted.

Logarithmic model

We need a logarithmic function to try to model this “leveling-off” behavior:

logEstimate = lm(totals ~ log(days))
plot(days,predict(logEstimate),type='l',col='blue')
points(days,totals)
I’m a logarithm, and I’m OK…
I’m a logarithm, and I’m OK…

The fit is very bad at the start, but the part we care about, recent reviews, seems better modeled than before. Progress! With a working model, let’s ask the original question: with these models predicting growth for UC and MoR, when do they finally have an equal number of reviews? We can use the predict.lm function to help us

# for MoR
predict.lm(logEstimate, newdata = data.frame(days = c(10000:20000)))
...
    9993     9994     9995     9996     9997     9998     9999    10000
31781.54 31781.78 31782.03 31782.27 31782.52 31782.76 31783.01 31783.25
   10001
31783.50
# for UC
predict.lm(logEstimate, newdata = data.frame(days = c(10000:20000)))
...
    9993     9994     9995     9996     9997     9998     9999    10000
50129.51 50129.88 50130.26 50130.64 50131.02 50131.39 50131.77 50132.15
   10001
50132.53

Growth in reviews has asymptoted for each way out at 10,000 days or 27 years, but still UC > MoR! This feels a little unlikely: the log function grows very slowly, but are the reviews really slowing down that much?

Linear model revisited

Let’s ask a more restricted question - instead of trying to model the entire set of reviews, why not look at just the most recent chunk, the one sans any updates (for both, sadly), say the last 100 days?

totals <- cumsum(table(data$Date))
totals <- totals[(length(totals)-100):length(totals)]
days <- c(1:(length(totals)))
days <- days[(length(days)-100):length(days)]
plot(days, totals)
model <- lm(totals ~ days)
summary(model)
...
Residuals:
     Min       1Q   Median       3Q      Max
-18.4305  -4.4631   0.2601   7.4555  16.0125

Coefficients:
             Estimate Std. Error  t value Pr(>|t|)
(Intercept) 2.392e+04  1.769e+00 13517.26   <2e-16 ***
days        2.472e+00  3.012e-02    82.08   <2e-16 ***
---
Signif. codes:  0 ‘***0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 8.826 on 99 degrees of freedom
Multiple R-squared: 0.9855, Adjusted R-squared: 0.9854
F-statistic:  6737 on 1 and 99 DF,  p-value: < 2.2e-16

abline(model)
A linear fit to the last 100 days of UC reviews
A linear fit to the last 100 days of UC reviews

That’s an impressive fit: visually a great match, relatively small residuals, extreme statistical significance. (We saw a similar graph for MoR too.) So clearly the linear model is accurately describing recent reviews.

Predicting

Let’s use 2 models like that to make our predictions. We’ll extract the 2 coefficients which define a y=ax+c linear model, and solve for the equality:

# UC:
coefficients(model)
 (Intercept)         days
23918.704158     2.472312
# MoR:
coefficients(model)
 (Intercept)         days
18323.275842     5.285603

The important thing here is 2.4723 vs 5.2856: that’s how many reviews UC is getting per day versus MoR over the last 100 days according to the linear models which fit very well in that time period. Where do they intercept each other, or to put it in our middle-school math class terms, where do they equal the same y and thus equal each other?

  1. UC: y=2.472312x+23918.704158
  2. MoR: y=5.285603x+18323.275842

We solve for x:

  1. 5.285603x+18323.275842=2.472312x+23918.704158
  2. 5.285603x=2.472312x+(23918.70415818323.275842)
  3. 5.285603x=2.472312x+5595.428
  4. 5.2856032.472312x=x+5595.4282.472312
  5. 2.137919x=x+2263.237
  6. 2.137919x1x=2263.237
  7. 1.137919x=2263.237
  8. x=2263.2371.137919
  9. x=1988.926 days, or 1988.926365.25=5.445383 years

Phew! That’s a long time. But it’s not “never” (as it is in the logarithmic fits). With that in mind, I don’t predict that MoR will surpass UC within a year, but bearing in mind that 5.4 years isn’t too far off and I expect additional bursts of reviews as MoR updates, I consider it possible within 2 years.

Survival analysis

I compile additional reviews and perform a survival analysis finding a difference in retention between roughly the first half of MoR and the second half, which I attribute to the changed update schedule & slower production in the second half.

On 17 May 2013, having finished learning survival analysis & practicing it on a real dataset in my analysis of Google shutdowns, I decided to return to MoR. At that point, more than 7 months had passed since my previous analysis, and more reviews had accumulated (1256 pages’ worth in November 2012, 1364 by May 2013), so I downloaded as before, processed into a clean CSV, and wound up with 20447 reviews by 7508 users.

I then consolidated the reviews into per-user records: for each reviewer, I extracted their username, the first chapter they reviewed, the date of the first review, the last chapter they reviewed, and the last review’s date. A great many reviewers apparently leave only 1 review ever. The sparsity of reviews presents a challenge for defining an event or ‘death’, since to me the most natural definition of ‘death’ is “not having reviewed the latest chapter”: with such sparsity, a reviewer may still be a reader (the property of interest) without having left a review on the latest chapter. Specifically, of the 7508 users, only 182 have reviewed chapter 87. To deal with this, I arbitrarily make the cutoff “any review of chapter 82 or more recent”: a great deal of plot happened in those 5 chapters, so I reason anyone who isn’t moved to comment (however briefly) on any of them may not be heavily engaged with the story.

As the days pass, the number of reviewers who are that ‘old’ will dwindle as they check out or unsubscribe or simply stop being as enthusiastic as they used to be. We can graph this decline over time (zooming in to account for the issue that only 35% of reviewers leave >1 reviews ever); no particular pattern jumps out beyond being a type III curve (“greatest mortality is experienced early on in life, with relatively low rates of death for those surviving this bottleneck. This type of curve is characteristic of species that produce a large number of offspring (see r/K selection theory)”):

Age of reviewers against probability they will not leave another review
Age of reviewers against probability they will not leave another review

But this is plotting age of reviewers and has no clear connection to calendar time, which was the original hypothesis: that the past 2 years have damaged readership more than the earlier chapters which were posted more timely. In particular, chapter posting seems to have slowed down ~ch62 (published 2010-11-27). So, if we were to look at every user who seems to have begun reading MoR only after that date (as evidenced by their first review postdating ch62’s publication) and compare them to those reading before, do they have additional mortality (above and beyond the low mortality we would expect by being relatively latecomers to MoR, which is adjusted for by the survival curve)? The answer seems to be yes, the survival curves look different when we partition users this way:

Different mortality curves for 2 groups of reviewers
Different mortality curves for 2 groups of reviewers
  n= 7508, number of events= 6927

           coef exp(coef) se(coef)    z Pr(>|z|)
LateTRUE 0.2868    1.3322   0.0247 11.6   <2e-16

         exp(coef) exp(-coef) lower .95 upper .95
LateTRUE      1.33      0.751      1.27       1.4

A pretty clear difference between groups, and an odds ratio of 1.33 (bootstrap 95% CI: 1.26-1.39) is not trivially small. I think we can consider the original hypothesis supported to some degree: the later chapters, either because of the update schedule or perhaps the change in subject matter or attracting a different audience or some other reason, seem to be more likely to lose its readers.

But two further minor observations:

  • Visually, the two groups of early & late reviewers look like they might be converging after several years of aging. We might try to interpret this as a ‘hardcore’ vs ‘populist’ distinction: the early reviewers tend to all be early adopters hardcore geeks who become devoted fans in for the long haul, while the latecoming regular people get gradually turned off and only the remaining geeks continue to review.
  • Can we more specifically nail down this increased mortality to the interval between a reviewer’s last review and the next chapter? So far, no; there is actually a trivially small effect the other direction when we add that as a variable:

      n= 7326, number of events= 6927
       (182 observations deleted due to missingness)
    
                  coef exp(coef)  se(coef)     z Pr(>|z|)
    LateTRUE  0.433604  1.542808  0.024906  17.4   <2e-16
    Delay    -0.004236  0.995773  0.000262 -16.2   <2e-16
    
             exp(coef) exp(-coef) lower .95 upper .95
    LateTRUE     1.543      0.648     1.469     1.620
    Delay        0.996      1.004     0.995     0.996

    I interpret this as indicating that an update which is posted after a long drought may be a tiny bit (95% CI: 0.9953-0.9963) more likely to elicit some subsequent reviews (perhaps because extremely long delays are associated with posting multiple chapters quickly), but nevertheless, that overall period with long delays is associated with increased reviewer loss. This makes some post hoc intuitive sense as a polarization effect: if a franchise after a long hiatus finally releases an exciting new work, you may be more likely to discuss or review it, even as the hiatuses make you more likely to finally “check out” for good - you’ll either engage or finally abandon it, but you probably won’t be indifferent.

Source code

Generation and pre-processing, using the shell & Haskell scripts from before:

data <- read.csv("2013-hpmor-rawreviews.csv", colClasses=c("integer", "Date", "factor"))
users <- unique(data$User)
lifetimes <- data.frame(User=NA, Start=NA, End=NA, FirstChapter=NA, LastChapter=NA)[-1,]
for (i in 1:length(users)) { # isolate a specific user
                  d <- data[data$User == users[i],]
                  # sort by date
                  d <- d[order(d$Date),]

                  first <- head(d, n=1)
                  last <- tail(d, n=1)
                  # build & store new row (per user)
                  new <- data.frame(User=users[i], Start=first$Date, End=last$Date,
                                    FirstChapter=first$Chapter, LastChapter=last$Chapter)
                  lifetimes <- rbind(lifetimes, new)
}
lifetimes$Days <- lifetimes$End - lifetimes$Start
summary(lifetimes$Days)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
    0.0     0.0     0.0    55.2     6.0  1160.0
lifetimes$Dead <- lifetimes$LastChapter <= 82

lifetimes$Late <- lifetimes$Start >= as.Date("10-11-27")

# figure out when each chapter was written by its earliest review
chapterDates <- as.Date(NA)
for (i in 1:max(data$Chapter)) { chapterDates[i] <- min(data[data$Chapter==i,]$Date) }

# how long after chapter X was posted did chapter X+1 come along? (obviously NA/unknown for latest chapter)
intervals <- integer(0)
for (i in 1:length(chapterDates)) { intervals[i] <- chapterDates[i+1] - chapterDates[i] }

# look up for every user the relevant delay after their last review
delay <- integer(0)
for (i in 1:nrow(lifetimes)) { delay[i] <- intervals[lifetimes$LastChapter[i]] }
lifetimes$Delay <- delay

Now that we have a clean set of data with the necessary variables like dates, we can analyze:

lifetimes <- read.csv("http://www.gwern.net/docs/2013-hpmor-reviewers.csv",
                       colClasses=c("factor", "Date", "Date", "integer", "integer",
                                    "integer", "logical", "logical","integer"))

library(survival)

surv <- survfit(Surv(Days, Dead, type="right") ~ 1, data=lifetimes)

png(file="~/wiki/images/hpmor/survival-overall.png", width = 3*480, height = 1*480)
plot(surv, ylim=c(0, 0.35))
invisible(dev.off())

cmodel <- coxph(Surv(Days, Dead) ~ Late, data=lifetimes)
print(summary(cmodel))

cat("\nTest for assumption violation to confirm the Cox regression:\n")
print(cox.zph(cmodel))

png(file="~/wiki/images/hpmor/survival-earlylate-split.png", width = 3*480, height = 1*480)
plot(survfit(Surv(Days, Dead) ~ Late, data=lifetimes),
     lty=c(1,2), ylim=c(0, 0.4), main="Reader survivorship: first review before or after chapter 62",
     ylab="Fraction of reader group surviving", xlab="Days between reader's first & last review")
legend("topright", legend=c("Early readers", "Late readers"), lty=c(1 ,2), inset=0.02)
invisible(dev.off())

print(summary(coxph(Surv(Days, Dead) ~ Late + Delay, data=lifetimes)))

cat("\nBegin bootstrap test of coefficient size...\n")
library(boot)
cat("\nGet a subsample, train Cox on it, extract coefficient estimate;\n")
cat("\nfirst the simplest early/late model, then the more complex:\n")
# TODO: factor out duplication
coxCoefficient1 <- function(gb, indices) {
  g <- gb[indices,] # allows boot to select subsample
  # train new regression model on subsample
  cmodel <- coxph(Surv(Days, Dead) ~ Late, data=g)
  return(coef(cmodel))
}
cox1Bs <- boot(data=lifetimes, statistic=coxCoefficient1, R=20000, parallel="multicore", ncpus=4)
print(boot.ci(cox1Bs, type="perc"))
coxCoefficient2 <- function(gb, indices) {
  g <- gb[indices,] # allows boot to select subsample
  # train new regression model on subsample
  cmodel <- coxph(Surv(Days, Dead) ~ Late + Delay, data=g)
  return(coef(cmodel))
}
cox2Bs <- boot(data=lifetimes, statistic=coxCoefficient2, R=20000, parallel="multicore", ncpus=4)
print(boot.ci(cox2Bs, type="perc", index=1)) # `Late`
print(boot.ci(cox2Bs, type="perc", index=2)) # `Delay`

See also


  1. The author points out that his “Offside” fanfiction is past 32,000 reviews.