*Recording fan speculation for retrospectives; statistically modeling reviews for ongoing story with R*( )

created:

*3 Nov 2012*; modified:

*09 Jan 2014*; status:

*finished*; belief:

*likely*

The unprecedented gap in

Methods of Rationalityupdates prompts musing about whether readership is increasing enough & what statistics one would use; I write code to download FF.net reviews, clean it, parse it, load into R, summarize the data & depict it graphically, run linear regression on a subset & all reviews, note the poor fit, develop a quadratic fit instead, and use it to predict future review quantities.Then, I run a similar analysis on a competing fanfiction to find out when they will have equal total review-counts. A try at logarithmic fits fails; fitting a linear model to the previous 100 days of

MoRand the competitor works much better, and they predict a convergence in <5 years.A survival analysis finds no major anomalies in reviewer lifetimes, but an apparent increase in mortality for reviewers who started reviewing with later chapters, consistent with (but far from proving) the original theory that the later chapters’ delays are having negative effects.

# Background

In a LW comment, I asked:

I noticed in Eliezer’s latest [1 November 2012] MoR update that he now has

18,000 words written[of chapter 86], and even when that “chapter” is finished, he still doesn’t plan to post anything, on top of a drought that has now lasted more than half a year. This doesn’t seem very optimal from the point of view of gaining readers.But I was wondering how one would quantify that - how one would estimate how many readers Eliezer’s MoR strategy of very rare huge dumps is costing him. Maybe survivorship curves, where survivorship is defined as “posting a review on FF.net”? So if say during the weekly MoR updates, a reader who posted a review of chapter X posted a review anywhere in X to X+n, that counts as survival of that reader. One problem here is that new readers are constantly arriving… You can’t simply say “X readers did not return from chapter 25 to chapter 30, while X+N did not return from chapter 85 to 86, therefore frequent updates are better” since you would expect the latter to be bigger simply because more people are reading. And even if you had data on unfavoriting or unfollowing, the important part is the opportunity cost - how many readers would have subscribed with regular updates

If you had total page views, that’d be another thing; you could look at conversions: what percentage of page views resulted in conversions to subscribers for the regular update period versus the feast/fame periods. But I don’t think FF.net provides it and while HP:MoR.com has Google Analytics, I don’t have access. Maybe one could look at each chapter pair-wise, and seeing what fraction of reviewers return? The growth might average out since we’re doing so many comparisons… But the delay is now so extreme this would fail: we’d expect a huge growth in reviewers from chapter 85 to chapter 86, for example, simply because it’s been like 8 months now - here too the problem is that the growth in reviewers will be less than what it “should be”. But how do we figure out “should be”?

After some additional discussion with clone_of_saturn, I’ve rejected the idea of survivorship curves; we thought correlations between duration and total review count might work, but interpretation was not clear. So the best current idea is: gather duration between each chapter, the # of reviews posted within X days (where X is something like 2 or 3), plot the points, and then see whether a line fits it better or a logarithm/logistic curve - to see whether growth slowed as the updates were spaced further apart.

Getting the data is the problem. It’s easy to get total reviews for each chapter since FF.net provides them. I don’t know of any way to get total reviews after X days posted, though! A script or program could probably do it, but I’m not sure I want to bother with all this work (especially since I don’t know how to do Bayesian line or logistic-fitting) if it’s not of interest to anyone or Eliezer would simply ignore any results.

I decided to take a stab at it.

# Data

FanFiction.Net does not provide published dates for each chapter, so we’ll be working off solely the posted reviews. FF.net reviews for *MoR* are provided on 1256 pages (as of 3 November 2012), numbering from 1. If we go from 1-1256, starting with `http://www.fanfiction.net/r/5782108/85/1/`

, we can download them with a script enumerating the URLs & using string interpolation. Eyeballing a textual dump of the HTML (courtesy of `elinks`

), it’s easy to extract with `grep`

the line with user, chapter number, and date with a HTML string embedded as a link with each review, leading to this script:

```
for i in {1..1256}
do elinks -dump-width 1000 -no-numbering -no-references -dump http://www.fanfiction.net/r/5782108/0/$i/ \
| grep "report review for abuse " >> reviews.txt
done
```

Since page #1 is for the most recent URLs and we started with 1, the reviews were dumped newest to oldest: eg.

```
...
[IMG] report review for abuse Faerie of Tara 4/1/12 . chapter 3
[IMG] report review for abuse Faerie of Tara 4/1/12 . chapter 2
report review for abuse Aww 4/1/12 . chapter 81
report review for abuse bella-farfallina 4/1/12 . chapter 81
[IMG] report review for abuse Isis Aurora Tomoe 4/1/12 . chapter 2
[IMG] report review for abuse bosk 4/1/12 . chapter 1
[IMG] report review for abuse yamiishot 4/1/12 . chapter 81
report review for abuse Joe in Australia 4/1/12 . chapter 81
report review for abuse of course 3/31/12 . chapter 45
report review for abuse RickJs 3/31/12 . chapter 22
...
```

We can clean this up with more shell scripting:

`sed -e 's/\[IMG\] //' -e 's/.*for abuse //' -e 's/ \. chapter / /' reviews.txt >> filtered.txt`

This gives us better looking data like:

```
NATWEST 3/4/10 4
hemotem 3/4/10 4
Isabelle Eir 3/4/10 4
Orc Shaman 3/3/10 1
Mariann's 3/3/10 4
cheekylildevil 3/3/10 4
Star Bear 3/3/10 4
```

The spaces in names means we can’t simply do a search-and-replace to turn spaces into commas and call that a CSV which can easily be read in R; so more scripting, in Haskell this time (though I’m sure `awk`

or something could do the job, I don’t know it):

```
import Data.List (intercalate)
main :: IO ()
main = do f <- readFile "filtered.txt"
let parsed = map words (lines f)
let unparsed = "User,Date,Chapter\n" ++ unlines (map unsplit parsed)
writeFile "reviews.csv" unparsed
-- eg "Star Bear 3/3/10 4" ~> ["Star", "Bear", "3/3/10", "4") ~> "Star Bear,3/3/10,4"
unsplit :: [String] -> String
unsplit xs = let chapter = last xs
date = last (init xs)
user = unwords (init (init xs))
in intercalate "," [user, date, chapter]
```

## R

A `runhaskell mor.hs`

later and our CSV of 18,852 reviews is ready for analysis in the R interpreter:

```
data <- read.csv("http://www.gwern.net/docs/2012-hpmor-reviews.csv")
data
...
18840 cheekylildevil 2/28/10 1
18841 scarletalphabet 2/28/10 1
18842 StoryTagger 2/28/10 1
18843 HJP265 2/28/10 1
18844 dougal74 2/28/10 1
18845 baceba 2/28/10 1
18846 Cibbler 2/28/10 1
18847 Davek86 2/28/10 1
18848 omegahurricane 2/28/10 1
18849 hhrforeverhhr 2/28/10 1
18850 anonnatymousMARTIN 2/28/10 1
18851 wetterwaxs 2/28/10 1
18852 Errrrr 2/28/10 1
# Parse the date strings into R date objects
data$Date <- as.Date(data$Date,"%m/%d/%Y")
# Reorganize columns (we care more about the chapter & date reviews were given on
# than which user did a review):
data <- data.frame(Chapter=data$Chapter, Date=data$Date, User=data$User)
```

## Descriptive

```
summary(data)
Chapter Date User
Min. : 1.00 Min. :10-02-28 Guest : 310
1st Qu.:10.00 1st Qu.:10-06-16 anon : 106
Median :27.00 Median :10-11-26 voodooqueen126: 85
Mean :35.64 Mean :11-02-10 Kutta : 77
3rd Qu.:63.00 3rd Qu.:11-09-02 badkidoh : 73
Max. :85.00 Max. :12-11-02 AR : 72
(Other) :18129
```

Surprisingly, it looks like only one user (‘voodooqueen126’) has reviewed every chapter. The dates also clearly indicate the slowdown in *MoR* posting, with the median chapter being posted all the way back in 2010. What does the per-day review pattern look like?

`plot(table(data$Date))`

Quite spiky. At a guess, the days with huge spikes are new chapters, while a low-level average of what looks like ~10 reviews a day holds steady over time. What’s the cumulative total of all reviews over time?

`plot(cumsum(table(data$Date)))`

When we plot each review by the chapter it was reviewing and the date it was posted, we see more clear patterns:

`plot(data$Chapter,data$Date)`

Obviously no one can review a chapter before it was posted, which is the thick black line along the bottom. We can also see how the delays between chapters: originally very small, the gaps increase tremendously in the 70s (remember the whole y-axis is ~3 years) - just 2 of the delays look like they cover something like $\frac{1}{3}-\frac{1}{2}$ the entire lifespan of *MoR*! Other interesting observations is that chapter 1 and chapter 5 are extremely popular for reviewers, with hardly a day passing without review even as vast wastelands open up in 2012 in chapters 40-60. A final curious observation is the presence of distinct horizontal lines; I am guessing that these are people prone to reviewing who are binging their way through *MoR*.

### Filtered reviews

The next question is: if we plot only the subset of reviews which were made within 7 days of the first review, what does that look like?

```
# extract each chapter's reviews into its own vector:
# one vector of dates for ch1, one vector of dates for ch2, for ch3 etc.
myList <- list()
for (i in 1:85) { myList[[i]] <- rev(data$Date[data$Chapter==i]) }
# take a look at the first entry in each vector as a sanity check
lapply(myList, function(a) a[1])
[[1]]
[1] "10-02-28"
[[2]]
[1] "10-02-28"
[[3]]
[1] "10-03-01"
[[4]]
[1] "10-03-03"
[[5]]
[1] "10-03-08"
...
# looks reasonable
# what's the total by each chapter?
totalreviews <- unlist(lapply(myList, length))
totalreviews
[1] 823 391 205 293 1159 499 419 375 423 498 332 215 270 283 222
[16] 291 385 302 270 311 248 199 215 211 141 243 283 138 136 143
[31] 160 204 148 119 28 77 106 48 130 101 117 183 52 33 139
[46] 246 334 120 148 104 181 69 47 183 38 143 74 202 80 131
[61] 127 164 303 175 146 42 147 130 253 296 217 398 136 249 187
[76] 231 525 185 153 308 235 113 74 154 236
# Now the hard part: map over the list of vectors, & filter out anything dated 7 days after the first date
earlyreviews <- unlist(lapply(lapply(myList,
function(a) Filter(function(b) {if (b > a[1]+7) return(FALSE) else return(TRUE)}, a)), length))
[1] 17 11 13 42 341 191 204 210 296 282 117 136 162 144 126 155 199 199 159
[20] 209 133 125 142 111 92 145 193 96 92 43 115 170 78 88 5 34 77 17
[39] 48 71 88 129 18 10 40 164 246 100 124 82 150 51 32 168 32 129 62
[58] 167 66 116 109 140 155 55 114 19 111 104 175 187 110 175 127 212 164 181
[77] 117 169 139 299 211 106 69 143 106
```

How did chapter 5 get *341* reviews within 7 days and have *1159* in total, you ask - might my data not be wrong? But if you check the chapter 5 reviews and scroll all the way down, you’ll find that people back in 2010 were *very* chuffed by the Draco scenes.

# Analysis

## Linear modeling

Moving onward, we want to graph the date and then fit a linear model (a line which minimizes the deviation of all of the points from it; I drew on a handout for R programming):

```
# turn vector into a plottable table & plot it
filtered <- data.frame(Chapter=c(1:85),Reviews=earlyreviews)
plot(filtered$Chapter, filtered$Reviews)
# run a linear model: is there any clear trend?
lm(filtered$Reviews ~ filtered$Chapter)
...
Coefficients:
(Intercept) filtered$Chapter
128.3773 -0.0966
# the slope doesn't look very interesting...
# let's plot the regression line too:
abline(lm(filtered$Reviews ~ filtered$Chapter))
```

Look at the noise. We have some huge spikes in the early chapters, some sort of dip in the 40s-70s, and then a similar spread of highly and lowly reviewed chapters at the end. The linear model is basically just the average (`mean(filtered$Reviews)`

= 124.2); it’s actually predicting slight declines in reviews posted within 7 days, interestingly.

### All reviews

Same thing if we look at *total* reviews for each chapter:

```
filtered2=data.frame(Chapter=c(1:85),Reviews=totalreviews)
plot(filtered2$Chapter, filtered2$Reviews)
model2 = lm(filtered2$Reviews ~ filtered2$Chapter)
summary(model2)
...
Residuals:
Min 1Q Median 3Q Max
-215.47 -98.81 -24.02 58.08 834.21
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 338.3462 33.4318 10.120 3.79e-16 ***
filtered2$Chapter -2.7107 0.6753 -4.014 0.00013 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 152.8 on 83 degrees of freedom
Multiple R-squared: 0.1626, Adjusted R-squared: 0.1525
F-statistic: 16.11 on 1 and 83 DF, p-value: 0.0001302
abline(model2)
```

## Fitting quadratics

Looking at the line, the fit seems poor over all (large residuals) quite poor at the beginning and end - doesn’t it look like a big U? That sounds like a quadratic, so more Googling for R code and an hour or two later, I fit some quadratics to the 7-day reviews and the total reviews:

```
plot(filtered$Chapter, filtered$Reviews)
model <- lm(filtered$Reviews ~ filtered$Chapter + I(filtered$Chapter^2))
summary(model)
...
Residuals:
Min 1Q Median 3Q Max
-160.478 -42.795 -3.364 40.813 179.321
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 178.41168 22.70879 7.857 1.34e-11 ***
filtered$Chapter -3.54725 1.21875 -2.911 0.00464 **
I(filtered$Chapter^2) 0.04012 0.01373 2.922 0.00449 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 68.15 on 82 degrees of freedom
Multiple R-squared: 0.09533, Adjusted R-squared: 0.07327
F-statistic: 4.32 on 2 and 82 DF, p-value: 0.01645
xx <- seq(min(filtered$Chapter),max(filtered$Chapter),len=200)
yy <- model$coef %*% rbind(1,xx,xx^2)
lines(xx,yy,lwd=2,col=3)
```

The quadratic fits nicely. And for total reviews, we observe a similar curve:

```
plot(filtered2$Chapter, filtered2$Reviews)
model2 <- lm(filtered2$Reviews ~ filtered2$Chapter + I(filtered2$Chapter^2))
summary(model2)
...
Residuals:
Min 1Q Median 3Q Max
-289.68 -56.83 -17.59 36.67 695.45
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 543.84019 41.43142 13.126 < 2e-16 ***
filtered2$Chapter -16.88265 2.22356 -7.593 4.45e-11 ***
I(filtered2$Chapter^2) 0.16479 0.02505 6.578 4.17e-09 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 124.3 on 82 degrees of freedom
Multiple R-squared: 0.4518, Adjusted R-squared: 0.4384
F-statistic: 33.79 on 2 and 82 DF, p-value: 1.977e-11
xx <- seq(min(filtered2$Chapter),max(filtered2$Chapter),len=200)
yy <- model2$coef %*% rbind(1,xx,xx^2)
lines(xx,yy,lwd=2,col=3)
```

### Predictions / confidence intervals

Linear models, even as a quadratic, are pretty simple. They don’t have to be a black box, and they’re interpretable the moment you are given the coefficients: ‘this’ times ‘that’ equals ‘that other thing’. We learned these things in middle school by hand-graphing functions. So it’s not hard to figure out what function our fitted lines represent, or to extend them rightwards to uncharted waters.

Our very first linear model for `filtered$Reviews ~ filtered$Chapter`

told us that the coefficients were Intercept: 128.3773, `filtered$Chapter`

: -0.0966. Going back to middle school math, the intercept is the *y*-intercept, the value of *y* when *x*=0; then the other number must be how much we multiply *x* by.

So the linear model is simply this: $y=-0.0966x+128.3773$. To make predictions, we substitute in for *x* as we please; so chapter 1 is $-0.0966\times 1+128.3773=128.28$, chapter 85 is $-0.0966\times 85+128.3773=120.16$ (well, what did you expect of a nearly-flat line?) and so on.

We can do this for the quadratic too; your ordinary quadratic looked like $y={x}^{2}+x+1$, so it’s easy to guess what the filled-in equivalent will be. We’ll look at total reviews this time (`filtered2$Reviews ~ filtered2$Chapter + I(filtered2$Chapter^2)`

). The intercept was 543.84019, the first coefficient was -16.88265, and the third mystery number was 0.16479, so the new equation is $y=0.16479{x}^{2}+-16.88265x+543.84019$. What’s the prediction for chapter 86? $0.16479\times ({86}^{2})+(-16.88265)\times 86+543.8401=310.719$. And for chapter 100, it’d go $0.16479\times ({100}^{2})+(-16.88265)\times 100+543.8401=503.475$.

Can we automate this? Yes. Continuing with total reviews per chapter:

```
x = filtered2$Chapter; y = filtered2$Reviews
model2 <- lm(y ~ x + I(x^2))
predict.lm(model2, newdata = data.frame(x = c(86:100)))
1 2 3 4 5 6 7 8
310.7242 322.3504 334.3061 346.5914 359.2063 372.1507 385.4248 399.0284
9 10 11 12 13 14 15
412.9616 427.2244 441.8168 456.7387 471.9903 487.5714 503.4821
```

(Notice that this use of `predict`

spits out the same answers we calculated for chapter 86 and chapter 100.)

So for example, the fitted quadratic model predicts ~503 total reviews for chapter 100. Making further assumptions (the error in estimating is independent of time, normally distributed, has zero mean, and constant variance), we predict that our 95% confidence interval 215-792 total reviews of chapter 100:

```
predict.lm(model2, newdata = data.frame(x = 100), interval="prediction")
fit lwr upr
503.4821 215.05 791.9141
```

Clearly the model isn’t making very strong predictions because the standard error is so high. We get a similar result asking just for the 7-day reviews:

```
filtered=data.frame(Chapter=c(1:85),Reviews=earlyreviews)
y=filtered$Reviews; x=filtered$Chapter
model = lm(y ~ x + I(x^2))
model
...
Coefficients:
(Intercept) x I(x^2)
178.41168 -3.54725 0.04012
predict.lm(model, newdata = data.frame(x = 100), interval="prediction")
fit lwr upr
1 224.925 66.83383 383.0162
```

This *seems* like a fairly reasonable prediction: 66-383 reviews posted within 7 days. But this is based purely off the model, is it taking into account that even the quadratic model has some pretty large residuals? Let’s plot an entire confidence band:

```
filtered <- data.frame(Chapter=c(1:85),Reviews=earlyreviews)
y <- filtered$Reviews; x = filtered$Chapter
plot(x, xlab="Chapter",
y, ylab="Reviews posted within week",
main="7-Day Reviews per Chapter with Quadratic Fit and 95% Prediction Intervals")
model <- lm(y ~ x + I(x^2))
xx <- seq(min(x),max(x),len=200)
yy <- model$coef %*% rbind(1,xx,xx^2)
lines(xx,yy,lwd=2,col=3)
prd <- predict(model, newdata = data.frame(x = c(1:85)), interval="prediction", type="response")
lines(c(1:85),prd[,2],col="red",lty=2)
lines(c(1:85),prd[,3],col="red",lty=2)
```

The prediction bands look like they’re more or less living up to their claimed coverage (7/85 isn’t hugely far from 5/100) but do this by being very large and embarrassingly can’t even exclude zero. This should temper our forecasting enthusiasm. What might predict the residuals better? Day of week? Length of chapter update? Interval between updates? Hard to know.

## Modeling conclusions

Sadly, neither the quadratic nor the linear fits lend any support to the original notion that we could estimate how much readership *MoR* is being cost by the exponentially-slowing updates for the simple reason that the observed reviews per chapter do not match any simple increasing model of readership but a stranger U-curve (why do reviews decline so much in the middle when updates were quite regular?).

We could argue that the sluggishness of the recent upturn in the U-curve (in the 80s) is due to there only being a few humongous chapters, but those chapters are quantitatively and qualitatively different from the early chapters (especially chapter 5) so it’s not clear that any of them “should” have say, 341 reviews within a week.

## Additional graphs

## The Review Race: *Unexpected Circumstances* versus *MoR*

An interesting question is whether and when *MoR* will become the most-reviewed fanfiction on FFN (and by extension, one of the most-reviewed fanfictions of all time). It has already passed such hits as “Parachute” & “Fridays at Noon”, but the next competitor is *Unexpected Circumstances* (*UC*) at 24,163 reviews to *MoR*’s 18,917^{1}.

### Averaging

24.2k vs 19k is a large gap to cover. Some quick and dirty reasoning:

*MoR*: 18911 reviews, begun 02-28-10 or 992 days ago, so $\frac{18911}{992}=19.1$ reviews per day (Status: in progress)*UC*: 24163 reviews, begun 11-22-10 or 725 days ago, so $\frac{24163}{725}=33.3$ reviews per day (Status: complete)

Obviously if *UC* doesn’t see its average review rate decline, it’ll never be passed by *MoR*! But let’s assume the review rate has gone to zero (not *too* unrealistic a starting assumption, since *UC* is finished and *MoR* is not), how many days will it take *MoR* to catch up? $\frac{24163-18911}{\frac{18911}{992}}=275.5$ or under a year.

### Modeling

#### Descriptive Data and Graphs

We can take into account the existing review rates with the same techniques we used previously for investigating *MoR* review patterns. First, we can download & parse *UC*’s reviews the same way as before: make the obvious tweaks to the shell script, run the `sed`

command, re-run the Haskell, then load into R & format:

```
for i in {1..1611};
do elinks -dump-width 1000 -no-numbering -no-references -dump http://www.fanfiction.net/r/6496709/0/$i/
| grep "report review for abuse " >> reviews.txt;
done
sed -e 's/\[IMG\] //' -e 's/.*for abuse //' -e 's/ \. chapter / /' reviews.txt >> filtered.txt
runhaskell mor.hs
```

```
data <- read.table("http://www.gwern.net/docs/2012-uc-reviews.csv",header=TRUE, sep=",",quote="")
# Convert the date strings to R dates
data=data.frame(Chapter=data$Chapter, Date=as.Date(data$Date,"%m/%d/%Y"), User=data$User)
data
Chapter Date User
1 39 12-11-14 REL57
2 27 12-11-13 Guest
3 26 12-11-13 Melanie
4 24 12-11-13 Melanie
5 39 12-11-13 Dream-with-your-heart
6 39 12-11-12 ebdarcy.qt4good
7 19 12-11-10 Guest
8 15 12-11-10 JD Thorn
9 39 12-11-09 shollyRK
10 9 12-11-08 of-Snowfall-and-Seagulls
...
summary(data)
Chapter Date User
Min. : 1.00 Min. :10-11-22 Guest : 54
1st Qu.:12.00 1st Qu.:11-02-26 AbruptlyChagrined : 39
Median :22.00 Median :11-06-20 AlexaBrandonCullen: 39
Mean :20.62 Mean :11-06-10 Angeldolphin01 : 39
3rd Qu.:29.00 3rd Qu.:11-09-09 AnjieNet : 39
Max. :39.00 Max. :12-11-14 beachblonde2244 : 39
(Other) :23915
```

Unlike *MoR*, multiple people have reviewed all (39) chapters. Let’s look at total reviews posted each calendar day:

`plot(table(data$Date))`

We’d like to see what a running or cumulative total reviews looks like: how fast is *UC*’s total review-count growing these days?

`plot(cumsum(table(data$Date)))`

Visually, it’s not quite a straight line, not quite a quadratic…

We also get a similar graph to *MoR* if we graph by chapter instead, with thick bars indicating reviews left at the “latest” chapter while the series was updating, and many reviews left at the final chapter as people binge through *UC*:

`plot(data$Chapter,data$Date)`

Let’s start modeling.

#### Linear model

The linear doesn’t match well since it clearly is massively overpredicting future growth!

```
totals <- cumsum(table(data$Date))
days <- c(1:(length(totals)))
plot(days, totals)
model <- lm(totals ~ days)
abline(model)
```

#### Quadratic model

Well, the graph does look more like a quadratic, but this has another problem:

```
model <- lm(totals ~ days + I(days^2))
xx <- seq(min(days),max(days),len=200)
yy <- model$coef %*% rbind(1,xx,xx^2)
plot(days, totals)
lines(xx,yy,lwd=2,col=3)
```

It fits much better overall, but for our purposes it is still rubbish: it is now predicting *negative* growth in total review count (a parabola must come down, after all), which of course cannot happen since reviews are not deleted.

#### Logarithmic model

We need a logarithmic function to try to model this “leveling-off” behavior:

```
logEstimate = lm(totals ~ log(days))
plot(days,predict(logEstimate),type='l',col='blue')
points(days,totals)
```

The fit is very bad at the start, but the part we care about, recent reviews, seems better modeled than before. Progress! With a working model, let’s ask the original question: with these models predicting growth for *UC* and *MoR*, when do they finally have an equal number of reviews? We can use the `predict.lm`

function to help us

```
# for MoR
predict.lm(logEstimate, newdata = data.frame(days = c(10000:20000)))
...
9993 9994 9995 9996 9997 9998 9999 10000
31781.54 31781.78 31782.03 31782.27 31782.52 31782.76 31783.01 31783.25
10001
31783.50
# for UC
predict.lm(logEstimate, newdata = data.frame(days = c(10000:20000)))
...
9993 9994 9995 9996 9997 9998 9999 10000
50129.51 50129.88 50130.26 50130.64 50131.02 50131.39 50131.77 50132.15
10001
50132.53
```

Growth in reviews has asymptoted for each way out at 10,000 days or 27 years, but still *UC* > *MoR*! This feels a little unlikely: the log function grows very slowly, but are the reviews really slowing down *that* much?

#### Linear model revisited

Let’s ask a more restricted question - instead of trying to model the entire set of reviews, why not look at just the most recent chunk, the one sans any updates (for both, sadly), say the last 100 days?

```
totals <- cumsum(table(data$Date))
totals <- totals[(length(totals)-100):length(totals)]
days <- c(1:(length(totals)))
days <- days[(length(days)-100):length(days)]
plot(days, totals)
model <- lm(totals ~ days)
summary(model)
...
Residuals:
Min 1Q Median 3Q Max
-18.4305 -4.4631 0.2601 7.4555 16.0125
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.392e+04 1.769e+00 13517.26 <2e-16 ***
days 2.472e+00 3.012e-02 82.08 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 8.826 on 99 degrees of freedom
Multiple R-squared: 0.9855, Adjusted R-squared: 0.9854
F-statistic: 6737 on 1 and 99 DF, p-value: < 2.2e-16
abline(model)
```

That’s an impressive fit: visually a great match, relatively small residuals, extreme statistical significance. (We saw a similar graph for *MoR* too.) So clearly the linear model is accurately describing recent reviews.

##### Predicting

Let’s use 2 models like that to make our predictions. We’ll extract the 2 coefficients which define a $y=ax+c$ linear model, and solve for the equality:

```
# UC:
coefficients(model)
(Intercept) days
23918.704158 2.472312
# MoR:
coefficients(model)
(Intercept) days
18323.275842 5.285603
```

The important thing here is 2.4723 vs 5.2856: that’s how many reviews *UC* is getting per day versus *MoR* over the last 100 days according to the linear models which fit very well in that time period. Where do they intercept each other, or to put it in our middle-school math class terms, where do they equal the same *y* and thus equal each other?

*UC*: $y=2.472312x+23918.704158$*MoR*: $y=5.285603x+18323.275842$

We solve for *x*:

- $5.285603x+18323.275842=2.472312x+23918.704158$
- $5.285603x=2.472312x+(23918.704158-18323.275842)$
- $5.285603x=2.472312x+5595.428$
- $\frac{5.285603}{2.472312}x=x+\frac{5595.428}{2.472312}$
- $2.137919x=x+2263.237$
- $2.137919x-1x=2263.237$
- $1.137919x=2263.237$
- $x=\frac{2263.237}{1.137919}$
- $x=1988.926$ days, or $\frac{1988.926}{365.25}=5.445383$ years

Phew! That’s a long time. But it’s not “never” (as it is in the logarithmic fits). With that in mind, I don’t predict that *MoR* will surpass *UC* within a year, but bearing in mind that 5.4 years isn’t too far off and I expect additional bursts of reviews as *MoR* updates, I consider it possible within 2 years.

## Survival analysis

I compile additional reviews and perform a survival analysis finding a difference in retention between roughly the first half of

MoRand the second half, which I attribute to the changed update schedule & slower production in the second half.

On 17 May 2013, having finished learning survival analysis & practicing it on a real dataset in my analysis of Google shutdowns, I decided to return to *MoR*. At that point, more than 7 months had passed since my previous analysis, and more reviews had accumulated (1256 pages’ worth in November 2012, 1364 by May 2013), so I downloaded as before, processed into a clean CSV, and wound up with 20447 reviews by 7508 users.

I then consolidated the reviews into per-user records: for each reviewer, I extracted their username, the first chapter they reviewed, the date of the first review, the last chapter they reviewed, and the last review’s date. A great many reviewers apparently leave only 1 review ever. The sparsity of reviews presents a challenge for defining an event or ‘death’, since to me the most natural definition of ‘death’ is “not having reviewed the latest chapter”: with such sparsity, a reviewer may still be a reader (the property of interest) without having left a review on the latest chapter. Specifically, of the 7508 users, only 182 have reviewed chapter 87. To deal with this, I arbitrarily make the cutoff “any review of chapter 82 or more recent”: a great deal of plot happened in those 5 chapters, so I reason anyone who isn’t moved to comment (however briefly) on any of them may not be heavily engaged with the story.

As the days pass, the number of reviewers who are that ‘old’ will dwindle as they check out or unsubscribe or simply stop being as enthusiastic as they used to be. We can graph this decline over time (zooming in to account for the issue that only 35% of reviewers leave >1 reviews ever); no particular pattern jumps out beyond being a type III curve (“greatest mortality is experienced early on in life, with relatively low rates of death for those surviving this bottleneck. This type of curve is characteristic of species that produce a large number of offspring (see r/K selection theory)”):

But this is plotting age of reviewers and has no clear connection to calendar time, which was the original hypothesis: that the past 2 years have damaged readership more than the earlier chapters which were posted more timely. In particular, chapter posting seems to have slowed down ~ch62 (published 2010-11-27). So, if we were to look at every user who seems to have begun reading *MoR* only after that date (as evidenced by their first review postdating ch62’s publication) and compare them to those reading before, do they have additional mortality (above and beyond the low mortality we would expect by being relatively latecomers to *MoR*, which is adjusted for by the survival curve)? The answer seems to be yes, the survival curves look different when we partition users this way:

```
n= 7508, number of events= 6927
coef exp(coef) se(coef) z Pr(>|z|)
LateTRUE 0.2868 1.3322 0.0247 11.6 <2e-16
exp(coef) exp(-coef) lower .95 upper .95
LateTRUE 1.33 0.751 1.27 1.4
```

A pretty clear difference between groups, and an odds ratio of 1.33 (bootstrap 95% CI: 1.26-1.39) is not trivially small. I think we can consider the original hypothesis supported to some degree: the later chapters, either because of the update schedule or perhaps the change in subject matter or attracting a different audience or some other reason, seem to be more likely to lose its readers.

But two further minor observations:

- Visually, the two groups of early & late reviewers look like they might be converging after several years of aging. We might try to interpret this as a ‘hardcore’ vs ‘populist’ distinction: the early reviewers tend to all be early adopters hardcore geeks who become devoted fans in for the long haul, while the latecoming regular people get gradually turned off and only the remaining geeks continue to review.
Can we more specifically nail down this increased mortality to the interval between a reviewer’s last review and the next chapter? So far, no; there is actually a trivially small effect the other direction when we add that as a variable:

`n= 7326, number of events= 6927 (182 observations deleted due to missingness) coef exp(coef) se(coef) z Pr(>|z|) LateTRUE 0.433604 1.542808 0.024906 17.4 <2e-16 Delay -0.004236 0.995773 0.000262 -16.2 <2e-16 exp(coef) exp(-coef) lower .95 upper .95 LateTRUE 1.543 0.648 1.469 1.620 Delay 0.996 1.004 0.995 0.996`

I interpret this as indicating that an update which is posted after a long drought may be a

*tiny*bit (95% CI: 0.9953-0.9963) more likely to elicit some subsequent reviews (perhaps because extremely long delays are associated with posting multiple chapters quickly), but nevertheless, that overall period with long delays is associated with increased reviewer loss. This makes some post hoc intuitive sense as a polarization effect: if a franchise after a long hiatus finally releases an exciting new work, you may be more likely to discuss or review it, even as the hiatuses make you more likely to finally “check out” for good - you’ll either engage or finally abandon it, but you probably won’t be indifferent.

### Source code

Generation and pre-processing, using the shell & Haskell scripts from before:

```
data <- read.csv("2013-hpmor-rawreviews.csv", colClasses=c("integer", "Date", "factor"))
users <- unique(data$User)
lifetimes <- data.frame(User=NA, Start=NA, End=NA, FirstChapter=NA, LastChapter=NA)[-1,]
for (i in 1:length(users)) { # isolate a specific user
d <- data[data$User == users[i],]
# sort by date
d <- d[order(d$Date),]
first <- head(d, n=1)
last <- tail(d, n=1)
# build & store new row (per user)
new <- data.frame(User=users[i], Start=first$Date, End=last$Date,
FirstChapter=first$Chapter, LastChapter=last$Chapter)
lifetimes <- rbind(lifetimes, new)
}
lifetimes$Days <- lifetimes$End - lifetimes$Start
summary(lifetimes$Days)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.0 0.0 0.0 55.2 6.0 1160.0
lifetimes$Dead <- lifetimes$LastChapter <= 82
lifetimes$Late <- lifetimes$Start >= as.Date("10-11-27")
# figure out when each chapter was written by its earliest review
chapterDates <- as.Date(NA)
for (i in 1:max(data$Chapter)) { chapterDates[i] <- min(data[data$Chapter==i,]$Date) }
# how long after chapter X was posted did chapter X+1 come along? (obviously NA/unknown for latest chapter)
intervals <- integer(0)
for (i in 1:length(chapterDates)) { intervals[i] <- chapterDates[i+1] - chapterDates[i] }
# look up for every user the relevant delay after their last review
delay <- integer(0)
for (i in 1:nrow(lifetimes)) { delay[i] <- intervals[lifetimes$LastChapter[i]] }
lifetimes$Delay <- delay
```

Now that we have a clean set of data with the necessary variables like dates, we can analyze:

```
lifetimes <- read.csv("http://www.gwern.net/docs/2013-hpmor-reviewers.csv",
colClasses=c("factor", "Date", "Date", "integer", "integer",
"integer", "logical", "logical","integer"))
library(survival)
surv <- survfit(Surv(Days, Dead, type="right") ~ 1, data=lifetimes)
png(file="~/wiki/images/hpmor/survival-overall.png", width = 3*480, height = 1*480)
plot(surv, ylim=c(0, 0.35))
invisible(dev.off())
cmodel <- coxph(Surv(Days, Dead) ~ Late, data=lifetimes)
print(summary(cmodel))
cat("\nTest for assumption violation to confirm the Cox regression:\n")
print(cox.zph(cmodel))
png(file="~/wiki/images/hpmor/survival-earlylate-split.png", width = 3*480, height = 1*480)
plot(survfit(Surv(Days, Dead) ~ Late, data=lifetimes),
lty=c(1,2), ylim=c(0, 0.4), main="Reader survivorship: first review before or after chapter 62",
ylab="Fraction of reader group surviving", xlab="Days between reader's first & last review")
legend("topright", legend=c("Early readers", "Late readers"), lty=c(1 ,2), inset=0.02)
invisible(dev.off())
print(summary(coxph(Surv(Days, Dead) ~ Late + Delay, data=lifetimes)))
cat("\nBegin bootstrap test of coefficient size...\n")
library(boot)
cat("\nGet a subsample, train Cox on it, extract coefficient estimate;\n")
cat("\nfirst the simplest early/late model, then the more complex:\n")
# TODO: factor out duplication
coxCoefficient1 <- function(gb, indices) {
g <- gb[indices,] # allows boot to select subsample
# train new regression model on subsample
cmodel <- coxph(Surv(Days, Dead) ~ Late, data=g)
return(coef(cmodel))
}
cox1Bs <- boot(data=lifetimes, statistic=coxCoefficient1, R=20000, parallel="multicore", ncpus=4)
print(boot.ci(cox1Bs, type="perc"))
coxCoefficient2 <- function(gb, indices) {
g <- gb[indices,] # allows boot to select subsample
# train new regression model on subsample
cmodel <- coxph(Surv(Days, Dead) ~ Late + Delay, data=g)
return(coef(cmodel))
}
cox2Bs <- boot(data=lifetimes, statistic=coxCoefficient2, R=20000, parallel="multicore", ncpus=4)
print(boot.ci(cox2Bs, type="perc", index=1)) # `Late`
print(boot.ci(cox2Bs, type="perc", index=2)) # `Delay`
```