2013 Lewis meditation results

Multilevel modeling of effect of small group's meditation on math errors
topics: psychology, experiments, statistics, R
created: 12 July 2013; modified: 27 Mar 2019; status: in progress; confidence: unlikely; importance: 4


A small group of Quantified Selfers tested themselves daily on arithmetic and engaged in a month of meditation. I analyze their scores with a multilevel model with per-subject grouping, and find the expect result: a small decrease in arithmetic errors which is not statistically-significant, with practice & time-of-day effects (but not day-of-week or weekend effects). This suggests a longer experiment by twice as many experimenters in order to detect this effect.

Quantified Self hobbyist Peter Lewis in April 2013 began a self-experiment with 11 other hobbyists (with the results posted on Seth Robert’s blog): they used a cellphone app “Math Workout” to do timed simple arithmetic problems (eg. “3.7 + 7.3, 93 + 18, 14 * 7, and 12² + √9”) in an ABA experimental design, where B was practicing mindfulness meditation. This was motivated in part by the observation that, while there is a ton of scientific research on meditation, the studies are often of doubtful quality (he cited a remarkable 2007 collection of meditation meta-analyses).

Lewis thought the results were positive inasmuch as his timings & error rate fell during the meditation period and the falls seemed visually clear when graphed, but some of the participants dropped out and others had graphs that were not so positive.

The interpretation of the data from a statistical point of view was not terribly clear. Lewis & Roberts did not do the analysis. Reading the post, I thought that it was a perfect dataset to try out multilevel models on, as this sort of psychology setup (multiple groups of data per subject, and multiple subjects) is a standard use-case.

Data

The original data was provided by Lewis in an Excel spreadsheet, but formatted in a way I couldn’t make use of. I extracted the data, turned it into a ‘long’ row-by-row format, added IDs for each subject, transformed the dates into R format, and added a variable for each score whether it was on a meditation day. The result was a CSV:

The TestDuration variable jumps out as bizarre: it goes from 143 seconds to 2185.7 seconds? A closer look shows that the last two entries are 368.42, and then 2185.66. This is a bizarre outlier, so I deleted it. The errors also jump from 4 to 25, and a closer look shows a similar jump from 16 to the maximum value 25, which will distort plots later on, so I chose to make 16 the ceiling. We also don’t want the dummy rows which indicate a meditation session, they’re redundant once the Meditated variable has been created. Now that the data is cleaned up and coded, we can look at it. For example, all error scores colored by subject:

Analysis

Descriptive

We have two response variables, number of errors & how long it took to finish the set of 50 arithmetic problems:

library(ggplot2); qplot(Date, jitter(ErrorCount, factor=1.5), color=Subject, data=mdtt, ylab="Number of errors out of 50 problems")
library(ggplot2); qplot(Date, jitter(ErrorCount, factor=1.5), color=Subject, data=mdtt, ylab="Number of errors out of 50 problems")
qplot(Date, TestDuration, color=Subject, data=mdtt, ylab="Time taken to answer 50 problems (lower=better)")
qplot(Date, TestDuration, color=Subject, data=mdtt, ylab="Time taken to answer 50 problems (lower=better)")

We can clearly see a practice effect in both test duration & number of errors, where subjects are gradually getting better over time. We will need to remember to control for that.

Splitting by subject instead and looking at small multiples for any ABA patterns (we delete 3 of the attrited subjects to free up vertical space):

Errors in each phase of ABA, stratified by subject
Errors in each phase of ABA, stratified by subject
Test time required in each phase, by subject
Test time required in each phase, by subject

Overall, these are looking equivocal. There might be something going on, there might not be - ErrorCount looks better than TestDuration, but in both cases, the clear practice effect makes comparison hard.

Goals

Roberts comments:

Another way to improve this experiment would be to do statistical tests that generate p values; this would give a better indication of the strength of the evidence. Because this experiment didn’t reach steady states, the best tests are complicated (e.g., comparison of slopes of fitted lines). With steady-state data, these tests are simple (e.g., comparison of means).

If you are sophisticated at statistics, you could look for a time-of-day effect (are tests later in the day faster?), a day-of-week effect, and so on. If these effects exist, their removal would make the experiment more sensitive. In my brain-function experiments, I use a small number of problems so that I can adjust for problem difficulty. That isn’t possible here.

p-values are standard in psychology, of course, despite their conceptual & practical problems which help contribute to systematic problems in science, so we’ll calculate one.

But we know in advance what the result will be: greater than 0.05, or non-statistically-significant. A p-value is mostly a function of how much data you have and how large the effect size is - we know a priori that this is a small experiment, and we also know that the effect of meditation on arithmetic scores cannot be huge because otherwise we would have heard of the groundbreaking research & society would make kids meditate before standardized exams & the subjects didn’t do much meditation & they were inexperienced etc. A small experiment on a small effect is guaranteed to not turn in a statistically-significant result, even when there genuinely is an effect, because the experiment doesn’t have much statistical power. In fact, a decrease in errors large enough to reach a statistical-significance of p<0.05 would be so bizarre & unexpected that it would be grounds for throwing out the data entirely, as someone must be lying or have screwed something up or something have gone terribly wrong somewhere.

So the best we can hope for is to show that there was a benefit at all, rather than a penalty - it’s not clear, visually, that we’ll see better scores once we take into account a practice effect of time. Having shown this, then we can ask additional questions like, “how large an experiment would we need to reach statistical-significance, anyway?”

Multilevel models

So, we’ll set up a few models in lme4 and compare them. The first model is the simplest: each subject has a different level of base accuracy (intercept), but are influenced the same way by meditation (meditation is a fixed effect):

(This is with a normal distribution family; using a Poisson distribution with family=poisson would give a more accurate model but I am not yet comfortable using or interpreting Poissons.) We get a fixed effect of -0.2 or 1/5 of an error less on meditation days. Not a big effect. The next step is to imagine that besides having differing natural levels of accuracy, meditation may affect some subjects more or less:

In this view of things, there’s a weaker effect of meditation: 1/7th of an error less (rather than 1/5th).

But of course, what about the practice effect? The date should matter! Let’s re-run those two models with the date:

Both agree that there is definitely a practice effect, it’s probably statistically-significant with such large t-values, and the effect size is cumulatively large - at -0.045 errors per day, we’d expect our error rate to decrease by 1.4 errors a month - which is a lot when we remember that the total average errors per session was 3.1!

Time of day

So, that’s the practice effect verified. What else? Roberts suggested “time-of-day effect (are tests later in the day faster?)” and “day-of-week effect”. First we’ll convert the times to hours and regress on that:

To my surprise, there really is a time-of-day effect. (I expected that there wouldn’t be enough data to see one, or that it would be a complex squiggle corresponding to circadian rhythms.) And it’s a simple linear one, too, which jumps out when one plots errors against hour:

qplot(Hours, ErrorCount, color=Meditated, data=mdtt)
qplot(Hours, ErrorCount, color=Meditated, data=mdtt)

The lesson here is that doing arithmetic at 7AM is not a good idea. (Would that American high schools could realize this…) But moving on: I really don’t think there’s enough data to get a day-of-week effect as Roberts suggests, so I’ll also do a simpler split on weekends.

Neither one is impressive. The model using weekends is completely useless, while the model using days looks like it’s overfitting the data.

Which of the 5 models is better, overall and keeping in mind that simplicity is good (we don’t want to overfit the data by throwing in a huge number of random variables which might “explain” variation in error-scores)?

As we might have guessed, model 3 (Meditated + Date + Hours + (1|Subject)) wins because the practice effect & time-of-day effect matter but apparently not day-of-week in the small dataset we have. (In a larger dataset, then the fancier models might do better.)

Significance

So, in mlm3 the meditation coefficient was -0.1875. How reliable is this meditation coefficient? We’d like a p-value, and the t-values aren’t clear: -1.106?

The practice & time-of-day effects are pretty statistically-significant, but we get a p-value of 0.29 for meditation at -0.15 (-0.54-0.15). To put this in perspective: the dataset just isn’t a whole lot of data and the data is heavily loaded on non-meditation data - not such an issue for the date or time-of-day effects which get spread around and estimated evenly, but bad for the meditation data. Another check is by bootstrapping our model from the data and seeing how the meditation estimate varies - does it vary like it did above?

Very similar: -0.16 (-0.51-0.17).

Power for a followup study?

Lewis viewed the collected data as a proto-study, and wants to do a followup experiment which will be better. Besides the obvious methodological improvements one could make like explicitly randomizing blocks of weeks & using better controls (some other mental activity or alternate form of meditation like yoga), how big should the experiment be so that a meditation effect could reach a p<0.05?

Taking the final effect estimate from my analysis of Lewis’s proto-study at face-value and assume it’s the true effect; how many comparisons of meditation/no-meditation would we have to make to have an 80% chance of detecting the true effect which could pass the 0.05 threshold if we ignored the possibility meditation worsened scores? Roughly estimating it, quite a few. A simple power calculation for the estimated effect of -0.187:

(A t-test calculation because I am not yet sure how to do power analysis for a MLM, and this at least will get us in the neighborhood.)

Conclusion

If I’m interpreting the power right, this is not as bad as it looks since each of those n is one math-game, but still. If you have 10 participants and each does 4 games a day, then you’ll need each of your 10 participants to spend 64 days playing the game without meditation (2544 / (10*4) = 64) and another 50 days playing with the meditation. If you could get 20 participants then it’s a more reasonable 2 months total each.

But getting that many fully-compliant volunteers would be quite difficult, and one has to ask: why does one care? It’s not clear that this effect is worth caring about for several reasons:

  1. the value of meditation is over-determined by all the existing research, which in aggregate covers probably hundreds of thousands of subjects. 10 or 30 QSers’ data should not, for a rational person, override or noticeably affect one’s beliefs about the value or meditation. If the existing research says meditation is worthwhile, that is far more trustworthy than this; if it says meditation is not worthwhile, likewise.

  2. the methodology here means that the small effect is not a causal effect of meditation alone, but an unknown melange of countless factors.

    What the small decrease means or is caused by, I am agnostic about. I know a lot of people get annoyed when someone brings up methodological issues like expectancy or placebo effects, but with my own blinded self-experiments with nootropics I have felt first-hand placebo effects, and I have been deeply impressed by the results of my dual n-back meta-analysis where simply switching from a passive to an active control group changes the IQ gain from 9.3 IQ points to 2.5.

    For me, things like that are the reality behind most self-experiments, and certainly ones with self-selected participants in an unrandomized unblinded experiment with attrition leading to a small effect easily explicable by simply trying a little bit harder.

  3. the metric of arithmetic scores has no clear relation to anything of importance

    No one actually cares about simple arithmetic scores. Many results in brain training studies follow a pattern where the training successfully increases scores on some psychological task like digit span, but when researchers look for “far transfer” to things like school grades, no effect can be found. Even if all the other issues could be dealt with and mindfulness meditation did improve arithmetic, what is it a valid proxy for?

See also