# Sleep

My expectations are that the treadmill will increase how much I sleep, decrease sleep latency, and possibly have a small negative effect on productivity (which may be offset by an improvement in mood and less need to get a daily walk). Subjectively, whenever I use the treadmill, it feels like I can’t work on hard material like programming or statistics, and I need to sit down and be still to really focus; I wonder if it is because my head bobbles slightly as I walk, and if a VR solution like an Oculus Rift might fix the jiggling issue? (If the walking were intense aerobic fitness, I might expect an increase in cognitive abilities or various sorts, but it’s not, so I don’t expect any effect on Mnemosyne scores.)

# Typing

Fortunately, I had used Amphetype for typing practice for 3 years prior to finding the treadmill, so I could compare my daily treadmill typing sessions to a very long dataseries.

The graph looks like WPM (but not Accuracy) may have been damaged, but it’s not clear at all: we should do statistics. Amphetype stores the graphed data in a SQLite database, which after a little tinkering I figured out how to extract the WPM & Accuracy scores:

$sqlite3 -batch gwern.db 'SELECT w real, wpm real, accuracy real FROM result;' > ~/stats.txt Which gives a file like 1233502576.01172|70.2471151325281|0.981412639405205 1233502634.48339|80.9762013034008|0.989159891598916 1233502677.26434|74.0623733171948|0.988326848249027 ... The pipes are delimiters, which I replaced with commas (tr '|' ','). The first field is a date-stamp expressed in seconds since the Unix epoch; they can be converted to more readable dates like so: $ date --date '@1308320681.44771'
Fri Jun 17 10:24:41 EDT 2011

I went through the 2870 lines until I found the first treadmill session I did on June 16. After splitting, deleting the date-stamps, and adding a CSV header like WPM,Accuracy, I had had 2285 entries for 2012-gwern-amphetype-before.csv and 585 for 2012-gwern-amphetype-after.csv. Then it is easy to load the CSVs into R and test:

before <- read.csv("http://www.gwern.net/docs/2012-gwern-amphetype-before.csv")
before$Treadmill <- 0 after <- read.csv("http://www.gwern.net/docs/2012-gwern-amphetype-after.csv") after$Treadmill <- 1
amphetype <- rbind(before,after)
l <- lm(cbind(WPM, Accuracy) ~ Treadmill, data=amphetype)

summary(manova(l))
Df Pillai approx F num Df den Df Pr(>F)
Treadmill    1 0.0556     84.4      2   2867 <2e-16

summary(l)
Response WPM :

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)   82.343      0.195   422.2   <2e-16
Treadmill      5.216      0.432    12.1   <2e-16

Response Accuracy :

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.987517   0.000170 5813.22  < 2e-16
Treadmill   0.001610   0.000376    4.28  1.9e-05

What? Using a treadmill made my average WPM go up 5 WPM? And my average accuracy increased 0.001%? And both are highly statistically-significant (not a surprise, given how many entries there were)? What’s going on - this is the exact opposite of expected! The key is the low mean of the before data: I type much faster than 82 WPM now, more like 90 or 100 WPM. What happened was that I spent 3 years practicing. Given that I was improving, it is wrong to compare the recent treadmill typing data against a low long-run average without any consideration of this trend of increasing WPM. What would be better would be to lop off the first half of the before data to get a fairer comparison with after, since I began to plateau around then. Redoing the tests:

secondHalf <- amphetype[(nrow(amphetype)/2):nrow(amphetype),]
l2 <- lm(cbind(WPM, Accuracy) ~ Treadmill, data=secondHalf)
summary(l2)

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)   85.826      0.315  272.13  < 2e-16
Treadmill      1.733      0.494    3.51  0.00047

Response Accuracy :

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.988951   0.000259 3820.00   <2e-16
Treadmill   0.000176   0.000406    0.43     0.66

This is more reasonable: only a 2 WPM gain from the treadmill. 2 WPM could be explicable as just a placebo effect: me wanting to justify the time I’ve sunk into the treadmill and typing practice every day. It’s still a little surprising, but the result initially seems solider. (If we drop every score before 2000 instead of 1144, the difference continues to shrink but still favors the treadmill. We have to go to scores 2100-2285 before the treadmill starts to lose, but with 2200-2285 the treadmill wins!) Accuracy seems largely unaffected. Better yet, we can model the linear progress of my WPM over time and test for a variation that way:

amphetype$Nth <- 1:nrow(amphetype) summary(lm(cbind(WPM, Accuracy) ~ Nth + Treadmill, data=amphetype)) Response WPM : Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 77.06152 0.37071 207.88 <2e-16 Nth 0.00462 0.00028 16.49 <2e-16 Treadmill -1.41533 0.57651 -2.45 0.014 Response Accuracy : Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 9.86e-01 3.35e-04 2938.81 < 2e-16 Nth 1.63e-06 2.54e-07 6.44 1.4e-10 Treadmill -7.34e-04 5.22e-04 -1.41 0.16 This is more as expected: so walking on the treadmill cost me -1.5WPM in typing speed, and a day of practice correlates with +0.004WPM (and so a full month of practice would be worth 0.12WPM). Having reached diminishing returns, I decided to stop typing practice. # Treadmill effect on Spaced repetition performance: randomized experiment It has been claimed that doing spaced repetition review while on a walking treadmill improves memory performance. I did a randomized experiment August 2013 - May 2014 and found that using a treadmill damaged my recall performance. ## Background Starting in 2010, Seth Roberts claimed that he found his Anki flashcard reviews (for spaced repetition) to be easier & better when he did them while using his treadmill, and offers some just-so evolutionary psychology theorizing that walking may cue knowledge absorption in a “thirst for knowledge”. He doesn’t offer any hard data, but he does quote some data from a 2012 presentation by Jeremy Howard, who claims a 5% review error-rate while walking and 8% while not-walking, and to be “40% faster [at learning]”; a near-halving of lower grades is certainly an effect to be reckoned with and well worthwhile. An effect strikes me as plausible: flashcard review does not require fine motor skills or (too) difficult thinking, and the walking might well wake one up if nothing else. And it would be convenient if it were true, since spaced repetition on one’s treadmill would be two birds with one stone. But on the other hand, the walking might be a distraction from the work of recall and damage real performance, much like how many students claim playing music while studying “helps them focus” which is dubious (eg Perham & Sykora 2012 found music damaged memory recall, and music you enjoyed was the worst). Consistent with this, my own experience with treadmills was that it impeded concentration. And I couldn’t help but notice Robert’s failure to present hard data: since Anki (like almost all spaced-repetition software), records detailed statistics about flashcard reviews in order to implement the scheduling algorithm, he had access to the data to show some objective performance measurements like whether days on the treadmill increase the average flashcard scores; all he had to do was record his treadmill use and then extract it, which wouldn’t take too long to show “a big effect” (a month or two would likely be enough). But as far as I know, he never made any use of his Anki data. Having acquired a treadmill, and being a long-time user of Mnemosyne, this seems eminently testable! I simply randomize whether I do my daily Mnemosyne review before or after getting on the treadmill. (Unfortunately, I can think of no way to blind treadmill use, so randomization is it.) One concern, prompted by the 2013 Lewis meditation results, is that there may be time-of-day effects on flashcard review; I tend to not use the treadmill in the morning (I am not a morning person), so if recall improved in the afternoon, then it might be conflated with the treadmill. I downloaded the 4GB public Mnemosyne dataset (every Mnemosyne user is offered the option to anonymously submit statistical data about their flashcards) to try to analyze it and estimate fixed effects of time. The full dataset showed many such effects, so time variables will be included in the analysis. ## Method Each day I decided to do spaced repetition, I randomly flipped a bit (50-50) in Bash to determine whether I would do it seated or on my treadmill (which is set to 1mph), and recorded whether that day was treadmill-affected after review. This was done from August 2013 to May 2014. Eventually I noticed that the experiment was becoming a trivial inconvenience that was damaging my hard-earned spaced repetition habit, and ended the experiment. I didn’t do a formal power analysis, but my intuition was that this would be enough data to show an effect, especially if the effect was as large as claimed. The endpoint is the grades given flashcards each day; analysis will be multilevel ordinal logistic regression. ## Data Extract the raw data from my Mnemosyne database: $ sqlite3 -batch ~/.local/share/mnemosyne/default.db \
"SELECT timestamp,easiness,grade FROM log WHERE event_type==9;" | \
tr "|" "," \
> gwern-mnemosyne.csv

Processing:

## read into R
col.names =c("Timestamp", "Easiness", "Grade"),
colClasses=c("integer",   "numeric",  "integer"))
mnemosyne$Timestamp <- as.POSIXct(mnemosyne$Timestamp, origin = "1970-01-01", tz = "EST")

## extract the temporal covariates from the timestamp
mnemosyne$WeekDay <- as.factor(weekdays(mnemosyne$Timestamp))
mnemosyne$Hour <- as.factor(as.numeric(format(mnemosyne$Timestamp, "%H")))
mnemosyne$Date <- as.Date(mnemosyne$Timestamp)

## select data from during the experiment
mnemosyneFormatted <- with(mnemosyne, data.frame(Timestamp=Timestamp, Date=Date, WeekDay=WeekDay,
treadmill <- mnemosyneFormatted[mnemosyneFormatted$Date > as.Date("2013-08-22") & mnemosyneFormatted$Date < as.Date("2014-06-01"),]

## code which days' review was done on the treadmill
treadmill$Treadmill <- FALSE treadmillDates <- as.Date(c("2013-08-25", "2013-08-26", "2013-08-28", "2013-09-14", "2013-09-27", "2013-10-14", "2013-11-09", "2013-11-10", "2013-11-14", "2013-11-29", "2013-12-05", "2013-12-07", "2014-01-29", "2014-02-10", "2014-02-15", "2014-02-25", "2014-02-28", "2014-03-04", "2014-03-05", "2014-03-07", "2014-03-09", "2014-03-19", "2014-03-19", "2014-03-24", "2014-03-25", "2014-03-26", "2014-04-03", "2014-04-22", "2014-05-01", "2014-05-05", "2014-05-06", "2014-05-28", "2014-05-29", "2014-05-31")) for (i in 1:length(treadmillDates)) { treadmill[treadmill$Date==treadmillDates[i],]$Treadmill <- TRUE; } ## serialize clean CSV for analysis write.csv(treadmill, "2014-05-31-mnemosyne-treadmill.csv", row.names=FALSE) ## Analysis ### Exploratory treadmill <- read.csv("http://www.gwern.net/docs/spacedrepetition/2014-05-31-mnemosyne-treadmill.csv") summary(treadmill) # Timestamp Date WeekDay Hour Easiness # 2013-11-26 19:24:44: 2 2013-09-25: 254 Friday : 577 Min. : 9.0 Min. :1.30 # 2013-11-26 22:22:12: 2 2014-02-10: 171 Monday : 711 1st Qu.:15.0 1st Qu.:1.44 # 2013-12-01 18:21:49: 2 2014-02-28: 163 Saturday : 856 Median :17.0 Median :1.93 # 2013-12-01 18:22:04: 2 2013-11-09: 162 Sunday : 869 Mean :17.2 Mean :1.87 # 2013-08-23 22:56:28: 1 2013-11-14: 155 Thursday :1034 3rd Qu.:20.0 3rd Qu.:2.16 # 2013-08-23 22:56:36: 1 2014-04-22: 145 Tuesday :1021 Max. :23.0 Max. :3.00 # (Other) :5843 (Other) :4803 Wednesday: 785 # Grade Treadmill # Min. :2.00 Mode :logical # 1st Qu.:4.00 FALSE:2695 # Median :4.00 TRUE :3158 # Mean :3.78 NA's :0 # 3rd Qu.:4.00 # Max. :5.00 ## graphing all 5853 reviews is unreadable, so summarize by day & throw out outliers daily <- aggregate(Grade ~ Date + Treadmill, treadmill, mean) daily <- daily[order(daily$Date),]
daily <- daily[daily$Grade>=3 & daily$Grade<=4,]
qplot(Date, Grade, color=Treadmill, size=I(5), data=daily)

### Tests

Because there’s only 4 possible responses in the dataset (2/3/4/5) & they don’t look like a normal distribution (even with n=5853), my analysis preference is for an ordinal logistic regression which captures that structure. Reviews are grouped by day, so I want a multilevel ordinal logistic regression to reflect that inherent structure. And because my earlier analysis of the ~50m response Mnemosyne dataset confirmed that there are meaningful hour-of-day and day-of-week effects, I’ll want to include those as covariates. (I was originally going to include card ID as a random-effects variable to reflect the easiness of each card and help reduce the unpredictability of grades; but the most any card had been reviewed during the experiment was 7 times, so the possible gain was limited, and when an analysis with card IDs as a variable took >2 hours to run and still hadn’t finished, I decided to simply use Mnemosyne’s internal estimate of “easiness”.) I’ll also check with a U-test that any effect isn’t being completely driven by the covariates.

The best-fitting such model confirms that there’s an effect: it’s negative. The proportional odds effect on grades is -1.381 (-2.086 to -0.6755; p=0.00012) or, to use a multilevel linear model, a lower mean grade by 0.1 (-0.14732 to -0.02029).

wilcox.test(Grade ~ Treadmill, conf.int=TRUE, data=treadmill)
#
#     Wilcoxon rank sum test with continuity correction
#
# W = 4363353, p-value = 0.01656
# alternative hypothesis: true location shift is not equal to 0
# 95% confidence interval:
#  -2.355e-05  5.387e-05
# sample estimates:
# difference in location
#              4.155e-05

library(ordinal)
c1  <- clmm(ordered(Grade) ~ Treadmill + Easiness + (1|Date) + (1|WeekDay) + (1|Hour), data=treadmill)
c2  <- clmm(ordered(Grade) ~ Treadmill + Easiness + (1|Date) + (1|WeekDay)           , data=treadmill)
c3  <- clmm(ordered(Grade) ~ Treadmill + Easiness +            (1|WeekDay) + (1|Hour), data=treadmill)
c4  <- clmm(ordered(Grade) ~ Treadmill + Easiness + (1|Date)               + (1|Hour), data=treadmill)
c5  <- clmm(ordered(Grade) ~ Treadmill + Easiness + (1|Date)                         , data=treadmill)
c6  <- clmm(ordered(Grade) ~ Treadmill + Easiness +            (1|WeekDay)           , data=treadmill)
c7  <- clmm(ordered(Grade) ~ Treadmill + Easiness +                          (1|Hour), data=treadmill)
c11 <- clm(ordered(Grade)  ~ 1                                                       , data=treadmill)
anova(c1,c2,c3,c4,c5,c6,c7,c8,c9,c10,c11)
# ...
#     no.par  AIC logLik LR.stat df Pr(>Chisq)
# c11      3 8215  -4104
# c9       4 8211  -4101    5.77  1      0.016
# c10      4 8211  -4101    0.00  0
# c8       5 6959  -3475 1253.46  1     <2e-16
# c5       6 6842  -3415  119.16  1     <2e-16
# c6       6 6951  -3470 -108.79  0
# c7       6 6946  -3467    5.16  0
# c2       7 6842  -3414  106.17  1     <2e-16
# c3       7 6939  -3462  -96.90  0
# c4       7 6833  -3409  106.03  0
# c1       8 6834  -3409    0.32  1      0.570

summary(c4)
# ...
# Random effects:
#  Groups Name        Variance Std.Dev.
#  Date   (Intercept) 2.008    1.417
#  Hour   (Intercept) 0.179    0.423
# Number of groups:  Date 97,  Hour 15
#
# Coefficients:
#               Estimate Std. Error z value Pr(>|z|)
# TreadmillTRUE   -1.381      0.360   -3.84  0.00012
# Easiness         3.365      0.121   27.73  < 2e-16
#
# Threshold coefficients:
#     Estimate Std. Error z value
# 2|3    1.751      0.338    5.19
# 3|4    2.842      0.338    8.41
# 4|5    9.814      0.388   25.28

## easier to interpret a linear model: how much does average grade fall on treadmill?
library(lme4)
l4 <- lmer(Grade ~ Treadmill + Easiness + (1|Date) + (1|Hour), data=treadmill); summary(l4)
# ...
# Fixed effects:
#               Estimate Std. Error t value
# (Intercept)     2.7003     0.0422    64.0
# TreadmillTRUE  -0.0805     0.0296    -2.7
# Easiness        0.6085     0.0180    33.8

confint(c4)
#                2.5 %  97.5 %
# 2|3            1.089  2.4124
# 3|4            2.180  3.5047
# 4|5            9.053 10.5749
# TreadmillTRUE -2.086 -0.6755
# Easiness       3.127  3.6032

confint(l4)
# Computing profile confidence intervals ...
#                  2.5 %   97.5 %
# .sig01         0.06273  0.14300
# .sig02         0.01809  0.08734
# .sigma         0.54550  0.56598
# (Intercept)    2.61418  2.78532
# TreadmillTRUE -0.14732 -0.02029
# Easiness       0.57313  0.64363

## Conclusion

While the result seems highly likely to be true for me, I don’t know how well it might generalize to other people. For example, perhaps more fit people can use a treadmill without harm and the negative effect is due to the treadmill usage tiring & distracting me; I try to walk 2 miles a day, but that’s not much compared to some people.

Given this harmful impact, I will avoid doing spaced repetition on my treadmill in the future, and given this & the typing result, will relegate any computer+treadmill usage to non-intellectually-demanding work like watching movies.