I meta-analyze the >19 studies which measure IQ after an n-back intervention, find a net gain over all studies of medium effect size.

This IQ gain correlates with the methodological concern of whether a study used active or passive control groups. This indicates n-back training does not increase underlying intelligence, and that the gains are due to the motivational effect of passive control groups (who did not train on anything) not trying as hard as the n-back-trained experimental groups on the post-tests. The remaining studies using active control groups find a small effect (this may be due to matrix-test-specific training, undetected publication bias, smaller motivational effects, etc.)

I also investigate several other n-back claims, criticisms, and indicators of bias, finding:

Dual n-back is a working memory exercise which stress holding several items in memory and quickly updating them; the study Jaeggi et al 2008 found that training dual n-back increases scores on an IQ test for healthy young adults. If this result were true and influenced underlying intelligence (with its many correlates such as higher income or educational achievement), it would be an unprecedented result of inestimable social value and practical impact, and so is worth investigating in detail. In my DNB FAQ, I discuss a list of post-2008 experiments investigating how much and whether practicing dual n-back can increase IQ; they conflict heavily, with some finding large gains and others finding gains which are not statistically-significant or no gain at all.

What is one to make of these studies? When one has multiple quantitative studies going in both directions, one resorts to a meta-analysis: we pool the studies with their various sample sizes and effect sizes and get some overall answer - do a bunch of small positive studies outweigh a few big negative ones? Or vice versa? Or any mix thereof? Unfortunately, no one has done one for n-back & IQ already; the existing study, “Is Working Memory Training Effective? A Meta-Analytic Review” (Melby-Lervåg & Hulme 2013), covers working memory in general, to summarize:

However, a recent meta-analysis by Melby-Lervåg and Hulme (in press) indicates that even when considering published studies, few appropriately-powered empirical studies have found evidence for transfer from various WM training programs to fluid intelligence. Melby-Lervåg and Hulme reported that WM training showed evidence of transfer to verbal and spatial WM tasks (d = .79 and .52, respectively). When examining the effect of WM training on transfer to nonverbal abilities tests in 22 comparisons across 20 studies, they found an effect of d = .19. Critically, a moderator analysis showed that there was no effect (d = .00) in the 10 comparisons that used a treated control group, and there was a medium effect (d = .38) in the 12 comparisons that used an untreated control group.

I’m not as interested in near WM transfer from n-back training - as the Melby-Lervåg & Hulme 2013 meta-analysis confirms, it surely does - but in the transfer with many more ramifications, transfer to IQ as measured by a matrix test. So I decided to do a meta-analysis of my own.

For background on conducting meta-analyses, I am using chapter 9 of part 2 of the Cochrane Collaboration’s Cochrane Handbook for Systematic Reviews of Interventions. For the actual statistical analysis, I am using the metafor package for the R language.

# Data

The candidate studies:

Variables:

1. active moderator variable: whether a control group was no-contact or trained on some other task.
2. IQ type:

3. BOMAT
4. Raven’s Advanced Progressive Matrices (RAPM)
5. Raven’s Standard Progressive Matrices (SPM)
6. other (eg. WAIS, Cattell’s Culture Fair Intelligence Test/CFIT, TONI)
7. record speed of IQ test: minutes allotted (upper bound if more details are given; if no time limits, default to 30 minutes since no subjects take longer)
8. n-back type:

9. dual n-back (audio & visual modalities)
10. single n-back (visual modality)
11. single n-back (audio modality)
12. paid: expected value of total payment in dollars, converted if necessary; if a paper does not mention payment or compensation, I assume 0 (likewise subjects receiving course credit or extra credit - so common in psychology studies that there must not be any effect), and if the rewards are of real but small value (eg. “For each correct response, participants earned points that they could cash in for token prizes such as pencils or stickers.”), I code as 1.

## Table

The data from the surviving studies:

year study n.e mean.e sd.e n.c mean.c sd.c active training IQ speed nbt paid
2008 Jaeggi1.8 4 14 2.928 8 12.13 2.588 0 200 0 10 0 0
2008 Jaeggi1.8 4 14 2.928 7 12.86 1.46 1 200 0 10 0 0
2008 Jaeggi1.12 11 9.55 1.968 11 8.73 3.409 0 300 1 10 0 0
2008 Jaeggi1.17 8 10.25 2.188 8 8 1.604 0 425 1 10 0 0
2008 Jaeggi1.19 7 14.71 3.546 8 13.88 3.643 0 475 1 20 0 0
2009 Qiu 9 132.1 3.2 10 130 5.3 0 250 2 25 0 0
2009 polar 13 32.76 1.83 8 25.12 9.37 0 360 0 30 0 0
2010 Jaeggi2.1 21 13.67 3.17 21.5 11.44 2.58 0 370 0 16 1 91
2010 Jaeggi2.2 25 12.28 3.09 21.5 11.44 2.58 0 370 0 16 0 20
2010 Stephenson.1 14 17.54 0.76 9.3 15.50 0.99 1 400 1 10 0 0.44
2010 Stephenson.2 14 17.54 0.76 8.6 14.08 0.65 0 400 1 10 0 0.44
2010 Stephenson.3 14.5 15.34 0.90 9.3 15.50 0.99 1 400 1 10 1 0.44
2010 Stephenson.4 14.5 15.34 0.90 8.6 14.08 0.65 0 400 1 10 1 0.44
2010 Stephenson.5 12.5 15.32 0.83 9.3 15.50 0.99 1 400 1 10 2 0.44
2010 Stephenson.6 12.5 15.32 0.83 8.6 14.08 0.65 0 400 1 10 2 0.44
2011 Chooi.1.1 4.5 12.7 2 15 13.3 1.91 1 240 1 20 0 0
2011 Chooi.1.2 4.5 12.7 2 22 11.3 2.59 0 240 1 20 0 0
2011 Chooi.2.1 6.5 12.1 2.81 11 13.4 2.7 1 600 1 20 0 0
2011 Chooi.2.2 6.5 12.1 2.81 23 11.9 2.64 0 600 1 20 0 0
2011 Jaeggi3 32 16.94 4.75 30 16.2 5.1 1 287 2 10 1 1
2011 Kundu1 3 31 1.73 3 30.3 4.51 1 1000 1 40 0 0
2011 Schweizer 29 27.07 2.16 16 26.5 4.5 1 463 2 30 0 0
2011 Zhong.1.05d 17.6 21.38 1.71 8.8 21.85 2.6 0 125 1 30 0 0
2011 Zhong.1.05s 17.6 22.83 2.5 8.8 21.85 2.6 0 125 1 30 0 0
2011 Zhong.1.10d 17.6 22.21 2.3 8.8 21 1.94 0 250 1 30 0 0
2011 Zhong.1.10s 17.6 23.12 1.83 8.8 21 1.94 0 250 1 30 0 0
2011 Zhong.1.15d 17.6 24.12 1.83 8.8 23.78 1.48 0 375 1 30 0 0
2011 Zhong.1.15s 17.6 25.11 1.45 8.8 23.78 1.48 0 375 1 30 0 0
2011 Zhong.1.20d 17.6 23.06 1.48 8.8 23.38 1.56 0 500 1 30 0 0
2011 Zhong.1.20s 17.6 23.06 3.15 8.8 23.38 1.56 0 500 1 30 0 0
2011 Zhong.2.15s 18.5 6.89 0.99 18.5 5.15 2.01 0 375 1 30 0 0
2011 Zhong.2.19s 18.5 6.72 1.07 18.5 5.35 1.62 0 475 1 30 0 0
2012 Jaušovec 14 32.43 5.65 15 29.2 6.34 1 1800 1 8.3 0 0
2012 Kundu2 11 10.81 2.32 12 9.5 2.02 1 1000 1 10 0 0
2012 Redick.1 12 6.25 3.08 20 6 3 0 700 1 10 0 204.3
2012 Redick.2 12 6.25 3.08 29 6.24 3.34 1 700 1 10 0 204.3
2012 Rudebeck 27 9.52 2.03 28 7.75 2.53 0 400 0 10 0 0
2012 Salminen 13 13.7 2.2 9 10.9 4.3 0 319 1 20 0 55
2012 Takeuchi 41 31.9 0.4 20 31.2 0.9 0 270 1 30 0 0
2013 Clouter 18 30.84 4.11 18 28.83 2.68 1 400 3 12.5 0 115
2013 Colom 28 37.25 6.23 28 35.46 8.26 0 720 1 20 0 204
2013 Heinzel.1 15 24.53 2.9 15 23.07 2.34 0 540 2 7.5 1 129
2013 Heinzel.2 15 17 3.89 15 15.87 3.13 0 540 2 7.5 1 129
2013 Jaeggi.5 25 14.96 2.7 13.5 14.74 2.8 1 500 1 30 0 0
2013 Jaeggi.5 26 15.23 2.44 13.5 14.74 2.8 1 500 1 30 2 0
2013 Oelhafen 28 18.7 3.75 15 19.9 4.7 0 350 0 45 0 54
2013 Smith.1 5 11.5 2.99 9 11.9 1.58 0 340 1 10 0 3.9
2013 Smith.2 5 11.5 2.99 20 12.15 2.735 1 340 1 10 0 3.9
2013 Sprenger.1 34 9.76 3.68 18.5 9.95 3.42 1 410 1 10 1 100
2013 Sprenger.2 34 9.24 3.34 18.5 9.95 3.42 1 205 1 10 1 100
2013 Thompson.1 10 13.2 0.67 19 12.7 0.62 0 800 1 25 0 740
2013 Thompson.2 10 13.2 0.67 19 13.3 0.5 1 800 1 25 0 740
2013 Vartanian 17 11.18 2.53 17 10.41 2.24 1 60 1 10 1 0
2013 Savage 23 11.61 2.5 27 11.21 2.5 1 625 1 20 0 0
2013 Stepankova.1 20 20.25 3.77 12.5 17.04 5.02 0 250 3 30 1 29
2013 Stepankova.2 20 21.1 2.95 12.5 17.04 5.02 0 500 3 30 1 29
2013 Nussbaumer 29 13.69 2.54 27 11.89 2.24 1 450 1 30 0 0
2014 Burki 11 37.41 6.43 20 35.95 7.55 1 300 1 30 1 0
2014 Burki 11 37.41 6.43 21 36.86 6.55 0 300 1 30 1 0
2014 Burki 11 28.86 7.10 20 31.20 6.67 1 300 2 30 1 0
2014 Burki 11 28.86 7.10 23 27.61 6.82 0 300 2 30 1 0
2014 Pugin 14 40.29 2.30 15 41.33 1.97 0 600 3 30 1 1

# Analysis

The result of the meta-analysis:

Random-Effects Model (k = 62; tau^2 estimator: REML)

tau^2 (estimated amount of total heterogeneity): 0.1234 (SE = 0.0496)
tau (square root of estimated tau^2 value):      0.3512
I^2 (total heterogeneity / total variability):   45.86%
H^2 (total variability / sampling variability):  1.85

Test for Heterogeneity:
Q(df = 61) = 125.2211, p-val < .0001

Model Results:

estimate       se     zval     pval    ci.lb    ci.ub
0.4011   0.0675   5.9414   <.0001   0.2688   0.5334

To depict the random-effects model in a more graphic form, we use the “forest plot”:

The overall effect is reasonably strong. But there seems to be substantial differences between studies: this heterogeneity may be what is showing up as a high τ2 and i2; and indeed, if we look at the computed SMDs, we see one sample with d=2.59 (!) and some instances of d<0. The high heterogeneity means that the fixed-effects model is inappropriate, as clearly the studies are not all measuring the same effect, so we use a random-effects.

The confidence interval excludes zero, so one might conclude that n-back does increase IQ scores. From a Bayesian standpoint, it’s worth pointing out that this is not nearly as conclusive as it seems, for several reasons:

1. meta-analyses are generally believed to be biased towards larger effects due to systematic biases like publication bias
2. our prior that any particular intervention would increase the underlying genuine fluid intelligence is extremely small, as scores or hundreds of attempts to increase IQ over the past century have all eventually turned out to be failures, with few exceptions (eg pre-natal iodine or iron supplementation), so very strong evidence is necessary to conclude that a particular attempt is one of those extremely rare exceptions. As the saying goes, “extraordinary claims require extraordinary evidence”. (For further reading, see the statistics/methodology discussion in the DNB FAQ.) David Hambrick explains it informally:

…Yet I and many other intelligence researchers are skeptical of this research. Before anyone spends any more time and money looking for a quick and easy way to boost intelligence, it’s important to explain why we’re not sold on the idea…Does this [Jaeggi et al 2008] sound like an extraordinary claim? It should. There have been many attempts to demonstrate large, lasting gains in intelligence through educational interventions, with few successes. When gains in intelligence have been achieved, they have been modest and the result of many years of effort. For instance, in a University of North Carolina study known as the Abecedarian Early Intervention Project, children received an intensive educational intervention from infancy to age 5 designed to increase intelligence1. In follow-up tests, these children showed an advantage of six I.Q. points over a control group (and as adults, they were four times more likely to graduate from college). By contrast, the increase implied by the findings of the Jaeggi study was six I.Q. points after only six hours of training - an I.Q. point an hour. Though the Jaeggi results are intriguing, many researchers have failed to demonstrate statistically significant gains in intelligence using other, similar cognitive training programs, like Cogmed’s… We shouldn’t be surprised if extraordinary claims of quick gains in intelligence turn out to be wrong. Most extraordinary claims are.

3. it’s not clear that just because IQ tests like Raven’s are valid and useful for measuring levels of intelligence, that an increase on the tests can be interpreted as an increase of intelligence. Haier 2014 analogizes claims of breakthrough IQ increases to the initial reports of cold fusion and comments:

The basic misunderstanding is assuming that intelligence test scores are units of measurement like inches or liters or grams. They are not. Inches, liters and grams are ratio scales where zero means zero and 100 units are twice 50 units. Intelligence test scores estimate a construct using interval scales and have meaning only relative to other people of the same age and sex. People with high scores generally do better on a broad range of mental ability tests, but someone with an IQ score of 130 is not 30% smarter then someone with an IQ score of 100…This makes simple interpretation of intelligence test score changes impossible. Most recent studies that have claimed increases in intelligence after a cognitive training intervention rely on comparing an intelligence test score before the intervention to a second score after the intervention. If there is an average change score increase for the training group that is statistically significant (using a dependent t-test or similar statistical test), this is treated as evidence that intelligence has increased. This reasoning is correct if one is measuring ratio scales like inches, liters or grams before and after some intervention (assuming suitable and reliable instruments like rulers to avoid erroneous Cold Fusion-like conclusions that apparently were based on faulty heat measurement); it is not correct for intelligence test scores on interval scales that only estimate a relative rank order rather than measure the construct of intelligence….Studies that use a single test to estimate intelligence before and after an intervention are using less reliable and more variable scores (bigger standard errors) than studies that combine scores from a battery of tests….Speaking about science, Carl Sagan observed that extraordinary claims require extraordinary evidence. So far, we do not have it for claims about increasing intelligence after cognitive training or, for that matter, any other manipulation or treatment, including early childhood education. Small statistically significant changes in test scores may be important observations about attention or memory or some other elemental cognitive variable or a specific mental ability assessed with a ratio scale like milliseconds, but they are not sufficient proof that general intelligence has changed.

This skeptical attitude is relevant to our examination of moderators.

## Moderators

### Control groups

A major criticism of n-back studies is that the effect is being manufactured by the methodological problem of some studies using a no-contact or passive control group rather than an active control group. (Passive controls know they received no intervention and that the researchers don’t expect them to do better on the post-test, which may reduce their efforts & lower their scores.)

The review Morrison & Chein 20112 noted that no-contact control groups limited the validity of such studies, a criticism that was echoed with greater force by Shipstead, Redick, & Engle 2012. The Melby-Lervåg & Hulme 2013 WM training meta-analysis then confirmed that use of no-contact controls inflated the effect size estimates3, similar to Zehdner et al 2009 results in the aged and Rapport et al 2013’s blind vs unblinded ratings in WM/executive training of ADHD; and consistent with the increase of d=0.2 across many kinds of psychological therapies which was found by Lipsey & Wilson 1993.

So I wondered if this held true for the subset of n-back & IQ studies. (Age is an interesting moderator in Melby-Lervåg & Hulme 2013, but in the following DNB & IQ studies there is only 1 study involving children - all the others are adults or young adults.) Each study has been coded appropriately, and we can ask whether it matters:

Test of Moderators (coefficient(s) 1,2):
QM(df = 2) = 48.8463, p-val < .0001

Model Results:

estimate      se    zval    pval    ci.lb   ci.ub
factor(active)0    0.5494  0.0812  6.7643  <.0001   0.3902  0.7086
factor(active)1    0.1727  0.0982  1.7579  0.0788  -0.0198  0.3652

The active/control variable confirms the criticism: lack of active control groups is responsible for a large chunk of the overall effect, with the confidence intervals overlap only partially. The effect with passive control groups is a dramatic d=0.6 while with active control groups, the IQ gains shrink to a small effect (whose 95% CI does not exclude d=0). Indeed, the confidence intervals do not overlap (upper of 0.36 & lower of 0.39).

We can see the difference by splitting a forest plot on passive vs active:

This is very damaging to the case that dual n-back increases IQ. Not only do the better studies find a drastically smaller effect, they are not sufficiently powered to find such a small effect at all, even aggregated in a meta-analysis, with a power of ~12%, which is dismal indeed when compared to the usual benchmark of 80%, and leads to worries that even that is too high an estimate and that the active control studies are aberrant somehow in being subject to a winner’s curse or subject to other biases. (Because many studies used convenient passive control groups and the passive effect size is 3x larger, they in aggregate are very well-powered at 89%; however, we already know they are skewed upwards, so we don’t care if we can detect a biased effect or not.) In particular, Boot et al 2013 argues that active control groups do not suffice to identify the true causal effect because the subjects in the active control group can still have different expectations than the experimental group, and the group’ differing awareness & expectations can cause differing performance on tests; they suggest recording expectancies (somewhat similar to Redick et al 2013), checking for a dose-response relationship (see the following section for whether dose-response exists for dual n-back/IQ), and using different experimental designs which actively manipulate subject expectations to identify how much effects are inflated by remaining placebo/expectancy effects.

The active estimate of d=0.17 does allow us to estimate how many subjects a two-group experiment with an active control group would require in order for it to be well-powered (80%) to detect an effect; a total n of 1054 subjects (527 in each group).

### Training time

Jaeggi et al 2008 observed a dose-response to training, where those who trained the longest apparently improved the most. Ever since, this has been cited as a factor in what studies will observe gains or as an explanation why some studies did not see improvements - perhaps they just didn’t do enough training. metafor is able to look at the number of minutes subjects in each study trained for to see if there’s any obvious linear relationship:

Test of Moderators (coefficient(s) 2):
QM(df = 1) = 0.0613, p-val = 0.8044

Model Results:

estimate      se     zval    pval    ci.lb   ci.ub
intrcpt    0.4314  0.1377   3.1334  0.0017   0.1616  0.7013
mods      -0.0001  0.0003  -0.2476  0.8044  -0.0006  0.0005

The estimate of the relationship is that there is none at all: the estimated coefficient has a large p-value, and further, that coefficient is negative. This may seem initially implausible but if we graph the time spent training per study with the final (unweighted) effect size, we see why:

### IQ test time

Similarly, Moody 2009 identified the 10 minute test-time or “speeding” of the RAPM as a concern in whether far transfer actually happened; after collecting the allotted test time for the studies, we can likewise look for whether there is an inverse relationship (the more time given to subjects on the IQ test, the smaller their IQ gains):

Test of Moderators (coefficient(s) 2):
QM(df = 1) = 0.5571, p-val = 0.4554

Model Results:

estimate      se     zval    pval    ci.lb   ci.ub
intrcpt    0.5121  0.1626   3.1501  0.0016   0.1935  0.8307
mods      -0.0053  0.0071  -0.7464  0.4554  -0.0191  0.0086

A tiny slope which is extremely non-statistically-significant; graphing the (unweighted) studies suggests as much:

### Training type

One question of interest both for issues of validity and for effective training is whether the existing studies show larger effects for a particular kind of n-back training: dual (visual & audio; labeled 0) or single (visual; labeled 1) or single (audio; labeled 2)? If visual single n-back turns in the largest effects, that is troubling since it’s also the one most resembling a matrix IQ test. Checking against the 3 kinds of n-back training:

Test of Moderators (coefficient(s) 1,2,3):
QM(df = 3) = 35.9048, p-val < .0001

Model Results:

estimate      se    zval    pval    ci.lb   ci.ub
factor(nbt)0    0.4497  0.0837  5.3744  <.0001   0.2857  0.6138
factor(nbt)1    0.2880  0.1263  2.2804  0.0226   0.0405  0.5354
factor(nbt)2    0.4283  0.3175  1.3491  0.1773  -0.1939  1.0506

There are not enough studies using the other kinds of n-back to say anything conclusive other than there seem to be differences, but it’s interesting that single visual n-back and single auditory n-back have weaker results so far.

### Payment/extrinsic motivation

In a 2013 talk, “‘Brain Training: Current Challenges and Potential Resolutions’, with Susanne Jaeggi, PhD”, Jaeggi suggests

Extrinsic reward can undermine people’s intrinsic motivation. If extrinsic reward is crucial, then its influence should be visible in our data.

I investigated payment as a moderator. Payment seems to actually be quite rare in n-back studies (in part because it’s so common in psychology to just recruit students with course credit or extra credit), and so the result is that as a moderator payment is currently a small and non-statistically-significant negative effect, whether you regress on the total payment amount or treat it as a boolean variable. More interestingly, it seems that the negative sign is being driven by payment being associated with higher-quality studies using active control groups, because when you look at the interaction, payment in a study with an active control group actually flips sign to being positive again (correlating with a bigger effect size).

More specifically, if we check payment as a binary variable, we get a decrease which is (almost) statistically-significant:

Test of Moderators (coefficient(s) 2):
QM(df = 1) = 2.6840, p-val = 0.1014

Model Results:

estimate      se     zval    pval    ci.lb   ci.ub
intrcpt                 0.4828  0.0834   5.7864  <.0001   0.3193  0.6464
as.logical(paid)TRUE   -0.2259  0.1379  -1.6383  0.1014  -0.4961  0.0443

If we instead regress against the total payment size (perhaps larger payments discourage one more?), the effect of each additional dollar is very small and 0 is far from excluded as the coefficient:

Test of Moderators (coefficient(s) 2):
QM(df = 1) = 0.6319, p-val = 0.4267

Model Results:

estimate      se     zval    pval    ci.lb   ci.ub
intrcpt    0.4219  0.0726   5.8136  <.0001   0.2797  0.5641
paid      -0.0004  0.0005  -0.7949  0.4267  -0.0014  0.0006

Why would treating payment as a binary category yield a major result when there is only a small slope within the paid studies? As I’ve mentioned before, the difference in effect size between active and passive control groups is quite striking, and I noticed that eg. the Redick et al 2012 experiment paid subjects a lot of money to put up with all its tests and ensure subject retention & Thompson et al 2013 paid a lot to put up with the fMRI machine and long training sessions, and likewise with Oelhafen et al 2013; so what happens if we look for an interaction?

Test of Moderators (coefficient(s) 2,3,4):
QM(df = 3) = 12.6970, p-val = 0.0053

Model Results:

estimate      se     zval    pval    ci.lb    ci.ub
intrcpt                        0.6494  0.1025   6.3344  <.0001   0.4484   0.8503
active                        -0.4003  0.1570  -2.5498  0.0108  -0.7080  -0.0926
as.logical(paid)TRUE          -0.2573  0.1635  -1.5735  0.1156  -0.5779   0.0632
active:as.logical(paid)TRUE    0.0336  0.2608   0.1290  0.8973  -0.4775   0.5448

Active control groups cuts the observed effect of n-back by more than half, as before, and payment decreases the effect size, potentially supporting claims about motivation, but then in studies with both active controls & payment, the effect size increases slightly again, which seems a little curious.

## Biases

N-back has been presented in some popular & academic medias in an entirely uncritical & positive light: ignoring the overwhelming failure of intelligence interventions in the past, not citing the failures to replicate, and giving short schrift to the criticisms which have been made. (Examples include the NYT, WSJ, Scientific American, & Nisbett et al 2012.) One researcher told me that a reviewer savaged their work, asserting that n-back works and thus their null result meant only that they did something wrong. So it’s worth investigating, to the extent we can, whether there is a publication bias towards publishing only positive results.

20-odd studies (some quite small) is considered medium-sized for a meta-analysis, but that many does permit us to generate funnel plots , or check for possible publication bias via the trim-and-fill method.

### Funnel plot

test for funnel plot asymmetry: z = 2.6084, p = 0.0091

The asymmetry has reached statistical-significance, so let’s visualize it:

This looks reasonably good, although we see that studies are crowding the edges of the funnel. We know that the studies with active control groups show twice the effect-size of the passive control groups, is this related? If we plot the residual left after correcting for active vs passive, the funnel plot improves a lot (Stephenson remains an outlier):

### Trim-and-fill

The trim-and-fill estimate:

Estimated number of missing studies on the left side: 0 (SE = 4.4573)

Graphing it:

Overall, the results suggest that this particular (comprehensive) collection of DNB studies does not suffer from serious publication bias after taking in account the active/passive moderator.

## Notes

Going through them, I must note:

• Jaeggi 2008: group-level data provided by Jaeggi to Redick for Redick et al 2013; the 8-session group included both active & passive controls, so experimental DNB group was split in half. IQ test time is based on the description in Redick et al 2012:

In addition, the 19-session groups were 20 min to complete BOMAT, whereas the 12- and 17-session groups received only 10 min (S. M. Jaeggi, personal communication, May 25, 2011). As shown in Figure 2, the use of the short time limit in the 12- and 17-session studies produced substantially lower scores than the 19-session study.

• polar: control, 2nd scores: 23,27,19,15,12,35,36,34; experiment, 2nd scores: 30,35,33,33,32,30,35,33,35,33,34,30,33
• Jaeggi 2010: used BOMAT scores; should I somehow pool RAPM with BOMAT? Control group split.
• Jaeggi 2011: used SPM (a Raven’s); should I somehow pool the TONI?
• Schweizer 2011: used the adjusted final scores as suggested by the authors due to potential pre-existing differences in their control & experimental groups:

…This raises the possibility that the relative gains in Gf in the training versus control groups may be to some extent an artefact of baseline differences. However, the interactive effect of transfer as a function of group remained [statistically-]significant even after more closely matching the training and control groups for pre-training RPM scores (by removing the highest scoring controls) F(1, 30) = 3.66, P = 0.032, gp2 = 0.10. The adjusted means (standard deviations) for the control and training groups were now 27.20 (1.93), 26.63 (2.60) at pre-training (t(43) = 1.29, P.0.05) and 26.50 (4.50), 27.07 (2.16) at post-training, respectively.

• Stephenson data from pg79/95; means are post-scores on Raven’s. I am omitting Stephenson scores on WASI, Cattell’s Culture Fair Test, & BETA III Matrix Reasoning subset because metafor does not support multivariate meta-analyses and including them as separate studies would be statistically illegitimate. The active and passive control groups were split into thirds over each of the 3 n-back training regimens, and each training regimen split in half over the active & passive controls.

The splitting is worth discussion. Some of these studies have multiple experimental groups, control groups, or both. A criticism of early studies was the use of no-contact control groups - the control groups did nothing except be tested twice, and it was suggested that the experimental group gains might be in part solely because they are doing a task, any task, and the control group should be doing some non-WM task as well. The WM meta-analysis Melby-Lervåg & Hulme 2013 checked for this and found that use of no-contact control groups led to a much larger estimate of effect size than studies which did use an active control. When trying to incorporate such a multi-part experiment, one cannot just copy controls as the Cochrane Handbook points out:

One approach that must be avoided is simply to enter several comparisons into the meta-analysis when these have one or more intervention groups in common. This ‘double-counts’ the participants in the ‘shared’ intervention group(s), and creates a unit-of-analysis error due to the unaddressed correlation between the estimated intervention effects from multiple comparisons (see Chapter 9, Section 9.3).

Just dropping one control or experimental group weakens the meta-analysis, and may bias it as well if not done systematically. I have used one of its suggested approaches which accepts some additional error in exchange for greater power in checking this possible active versus no-contact distinction, in which we instead split the shared group:

A further possibility is to include each pair-wise comparison separately, but with shared intervention groups divided out approximately evenly among the comparisons. For example, if a trial compares 121 patients receiving acupuncture with 124 patients receiving sham acupuncture and 117 patients receiving no acupuncture, then two comparisons (of, say, 61 ‘acupuncture’ against 124 ‘sham acupuncture’, and of 60 ‘acupuncture’ against 117 ‘no intervention’) might be entered into the meta-analysis. For dichotomous outcomes, both the number of events and the total number of patients would be divided up. For continuous outcomes, only the total number of participants would be divided up and the means and standard deviations left unchanged. This method only partially overcomes the unit-of-analysis error (because the resulting comparisons remain correlated) so is not generally recommended. A potential advantage of this approach, however, would be that approximate investigations of heterogeneity across intervention arms are possible (for example, in the case of the example here, the difference between using sham acupuncture and no intervention as a control group).

• Chooi: the relevant table was provided in private communication; I split each experimental group in half to pair it up with the active and passive control groups which trained the same number of days
• Takeuchi et al 2012: subjects were trained on 3 WM tasks in addition to DNB for 27 days, 30-60 minutes; RAPM scores used, BOMAT & Tanaka B-type intelligence test scores omitted
• Jaušovec 2012: IQ test time was calculated based on the description

Used were 50 test items - 25 easy (Advanced Progressive Matrices Set I - 12 items and the B Set of the Colored Progressive Matrices), and 25 difficult items (Advanced Progressive Matrices Set II, items 12-36). Participants saw a figural matrix with the lower right entry missing. They had to determine which of the four options fitted into the missing space. The tasks were presented on a computer screen (positioned about 80-100 cm in front of the respondent), at fixed 10 or 14 s interstimulus intervals. They were exposed for 6 s (easy) or 10 s (difficult) following a 2-s interval, when a cross was presented. During this time the participants were instructed to press a button on a response pad (1-4) which indicated their answer.

$\frac{25×\left(6+2\right)+25×\left(10+2\right)}{60}=8.33$ minutes.
• Zhong 2011: “dual attention channel” task omitted, dual and single n-back scores kept unpooled and controls split across the 2; I thank Emile Kroger for his translations of key parts of the thesis. Unable to get whether IQ test was administered speeded. Zhong 2011 appears to have replicated Jaeggi 2008’s training time.
• Jonasson 2011 omitted for lacking any measure of IQ
• Preece 2011 omitted; only the Figure Weights subtest from the WAIS was reported, but RAPM scores were taken and published in the inaccessible Palmer 2011
• Kundu et al 2011 and Kundu 2012 have been split into 2 experiments based on the raw data provided to me by Kundu: the smaller one using the full RAPM 36-matrix 40-minute test, and the larger an 18-matrix 10-minute test. (Kundu 2012 subsumes 2011, but the procedure was changed partway on Jaeggi’s advice, so they are separate results.) The final results were reported in “Strengthened effective connectivity underlies transfer of working memory training to tests of short-term memory and attention”, Kundu et al 2013.
• Redick et al: n-back split over passive control & active control (visual search) RAPM post scores (omitted SPM and Cattell Culture-Fair Test)
• Vartanian 2013: short n-back intervention not adaptive; I did not specify in advance that the n-back interventions had to be adaptive (possibly some of the others were not) and subjects trained for <50 minutes, so the lack of adaptiveness may not have mattered.
• Heinzel et al 2013 mentions conducting a pilot study; I contacted Heinzel and no measures like Raven’s were taken in it. The main study used both SPM and also “the Figural Relations subtest of a German intelligence test (LPS)”; as usual, I drop alternatives in favor of the more common test.
• Thompson et al 2013; used RAPM rather than WAIS; treated the “multiple object tracking”/MOT as an active control group since it did not statistically-significantly improve RAPM scores
• Smith et al 2013; 4 groups. Consistent with all the other studies, I have ignored the post-post-tests (a 4-week followup). To deal with the 4 groups, I have combined the Brain Age & strategy game groups into a single active control group, and then split the dual n-back group in half over the original passive control group and the new active control group.
• Jaeggi 2005: Jaeggi et al 2008 is not clear about the source of its 4 experiments, but one of them seems to be experiment 7 from Jaeggi 2005, so I omit experiment 7 to avoid any double-counting, and only use experiment 6.
• Oelhafen 2013: merged the lure and non-lure dual n-back groups
• Sprenger 2013: split the active control group over the n-back+Floop group and the combo group; training time refers solely to time spent on n-back and not the other tasks
• Jaeggi et al 2013: administered the RAPM, Cattell’s Culture Fair Test / CFT, & BOMAT; in keeping with all previous choices, I used the RAPM data; the active control group is split over the two kinds of n-back training groups. This was previously included in the meta-analysis as Jaeggi4 based on the poster but deleted once it wa formally published as Jaeggi et al 2013.
• Clouter 2013: means & standard deviations, payment amount, and training time were provided by him; student participants could be paid in credit points as well as money, so to get $115, I combined the base payment of$75 with the no-credit-points option of another $40 (rather than try to assign any monetary value to credit points or figure out an average payment) • Colom 2013: the experiment group was trained with 2 weeks of visual single n-back, then 2 weeks of auditory n-back, then 2 weeks of dual n-back; since the IQ tests were simply pre/post it’s impossible to break out the training gains separately, so I coded the n-back type as dual n-back since visual+auditory single n-back = dual n-back, and they finished with dual n-back. Colom administered 3 IQ tests - RAPM, DAT-AR, & PMA-R; as usual, I used RAPM. • Savage 2013: administered RAPM & CCFT; as usual, only used RAPM • Stepankova et al 2013: administered the Block Design (BD) & Matrix Reasoning (MR) nonverbal subtests of the WAIS-III • Nussbaumer et al 2013: administered RAPM & I-S-T 2000 R tests; participants were trained in 3 conditions: non-adaptive single 1-back (“low”); non-adaptive single 3-back (“medium”); adaptive dual n-back (“high”). Given the low training time, I decided to drop the medium group as being unclear whether the intervention is doing anything, and treat the high group as the experimental group vs a “low” active control group. • Burki et al 2014: split experimental groups across the passive & active controls; young and old groups were left unpooled because they used RAPM and RSPM respectively • Pugin et al 2014: used the TONI-IV IQ test from the post-test, but not the followup scores; the paper reports the age-adjusted scaled values, but Fiona Pugin provided me the raw TONI-IV scores The following authors had their studies omitted and have been contacted for clarification: • Seidler, Jaeggi et al 2010 (experimental: n=47; control: n=45) did not report means or standard deviations • Preece’s supervising researcher • Minear • Katz ## Source Run as R --slave --file=dnb.r: set.seed(7777) # for reproducible numbers # TODO: factor out common parts of png (& make less square), and rma calls library(XML) dnb <- readHTMLTable(colClasses = c("integer", "character", rep("numeric", 11)), "http://www.gwern.net/DNB%20meta-analysis")[[1]] # install.packages("metafor") # if not installed library(metafor) cat("Basic random-effects meta-analysis of all studies:\n") res1 <- rma(measure="SMD", m1i = mean.e, m2i = mean.c, sd1i = sd.e, sd2i = sd.c, n1i = n.e, n2i = n.c, data = dnb); res1 png(file="~/wiki/images/dnb/forest.png", width = 580, height = 600) forest(res1, slab = paste(dnb$study, dnb$year, sep = ", ")) invisible(dev.off()) cat("Random-effects with passive/active control groups moderator:\n") res0 <- rma(measure="SMD", m1i = mean.e, m2i = mean.c, sd1i = sd.e, sd2i = sd.c, n1i = n.e, n2i = n.c, data = dnb, mods = ~ factor(active) - 1); res0 cat("Power analysis of the passive control group sample, then the active:\n") with(dnb[dnb$active==0,],
power.t.test(n = mean(sum(n.c), sum(n.e)), delta=res0$b[1], sd = mean(c(sd.c, sd.e)))) with(dnb[dnb$active==1,],
power.t.test(n = mean(sum(n.c), sum(n.e)), delta=res0$b[2], sd = mean(c(sd.c, sd.e)))) cat("Calculate necesssary sample size for active-control experiment of 80% power:") power.t.test(delta = res0$b[2], power=0.8)

# this is ridiculously ugly & fragile, I regret ever wanting to show the split.
png(file="~/wiki/images/dnb/forest-activevspassive.png", width = 580, height = 780)
totalHeight <- nrow(dnb) + 8
activeTop <- nrow(dnb[dnb$active==1,])+2 forest(rma(measure="SMD", m1i = mean.e, m2i = mean.c, sd1i = sd.e, sd2i = sd.c, n1i = n.e, n2i = n.c, data = dnb), slab = paste(dnb$study, dnb$year, sep = ", "), order=order(dnb$active, decreasing=T),
ylim=c(0, totalHeight), rows=c(3:activeTop, (4+activeTop):(totalHeight-3)), mlab="overall")
res2 <- rma(measure="SMD", m1i = mean.e, m2i = mean.c, sd1i = sd.e, sd2i = sd.c, n1i = n.e, n2i = n.c,
subset=(active==0),data = dnb)
# place the passive summary in the middle of the graph:
res3 <- rma(measure="SMD", m1i = mean.e, m2i = mean.c, sd1i = sd.e, sd2i = sd.c, n1i = n.e, n2i = n.c,
subset=(active==1),data = dnb)
# place the active summary right at the bottom of the graph:
invisible(dev.off())

cat("Random-effects, regressing against training time:\n")
rma(measure="SMD", m1i = mean.e, m2i = mean.c, sd1i = sd.e, sd2i = sd.c, n1i = n.e, n2i = n.c, data = dnb,
mods = training)

png(file="~/wiki/images/dnb/effectsizevstrainingtime.png", width = 580, height = 600)
plot(dnb$training, res1$yi)
invisible(dev.off())

cat("Random-effects, regressing against administered speed of IQ tests:\n")
rma(measure="SMD", m1i = mean.e, m2i = mean.c, sd1i = sd.e, sd2i = sd.c, n1i = n.e, n2i = n.c,
data = dnb, mods=speed)

png(file="~/wiki/images/dnb/iqspeedversuseffect.png", width = 580, height = 600)
plot(dnb$speed, res1$yi)
invisible(dev.off())

cat("Random-effects, regressing against kind of n-back training:\n")
rma(measure="SMD", m1i = mean.e, m2i = mean.c, sd1i = sd.e, sd2i = sd.c, n1i = n.e, n2i = n.c,
data = dnb, mods=~factor(nbt)-1)

cat("*, payment as a binary moderator:\n")
rma(measure="SMD", m1i = mean.e, m2i = mean.c, sd1i = sd.e, sd2i = sd.c, n1i = n.e, n2i = n.c, data = dnb,
mods = ~ as.logical(paid))
cat("*, regressing against payment amount:\n")
rma(measure="SMD", m1i = mean.e, m2i = mean.c, sd1i = sd.e, sd2i = sd.c, n1i = n.e, n2i = n.c, data = dnb,
mods = ~ paid)
cat("*, checking for interaction with higher experiment quality:\n")
rma(measure="SMD", m1i = mean.e, m2i = mean.c, sd1i = sd.e, sd2i = sd.c, n1i = n.e, n2i = n.c, data = dnb,
mods = ~ active * as.logical(paid))

cat("Publication bias checks using funnel plots:\n")
regtest(res1, model = "rma", predictor = "sei", ni = NULL)

png(file="~/wiki/images/dnb/funnel.png", width = 580, height = 600)
funnel(res1)
invisible(dev.off())

# If we plot the residual left after correcting for active vs passive, the funnel plot improves
png(file="~/wiki/images/dnb/funnel-moderators.png", width = 580, height = 600)
res2 <- rma(measure="SMD", m1i = mean.e, m2i = mean.c, sd1i = sd.e, sd2i = sd.c, n1i = n.e, n2i = n.c,
data = dnb, mods = ~ factor(active)-1 )
funnel(res2)
invisible(dev.off())

cat("Little publication bias, but let's see trim-and-fill's suggestions anyway:\n")
tf <- trimfill(res1); tf

png(file="~/wiki/images/dnb/funnel-trimfill.png", width = 580, height = 600)
funnel(tf)
invisible(dev.off())

# optimize the generated graphs by cropping whitespace & losslessly compressing them
system(paste('cd ~/wiki/images/dnb/ &&',
'for f in *.png; do convert "$f" -crop', 'nice convert "$f" -virtual-pixel edge -blur 0x5 -fuzz 10% -trim -format',
'\'%wx%h%O\' info: +repage "$f"; done')) system("optipng -o9 -fix ~/wiki/images/dnb/*.png", ignore.stdout = TRUE) 1. To give an idea of how intensive, it cost ~$14,000 (2002) or \$18,200 (2013) per child per year.

2. from pg 54-55:

An issue of great concern is that observed test score improvements may be achieved through various influences on the expectations or level of investment of participants, rather than on the intentionally targeted cognitive processes. One form of expectancy bias relates to the placebo effects observed in clinical drug studies. Simply the belief that training should have a positive influence on cognition may produce a measurable improvement on post-training performance. Participants may also be affected by the demand characteristics of the training study. Namely, in anticipation of the goals of the experiment, participants may put forth a greater effort in their performance during the post-training assessment. Finally, apparent training-related improvements may reflect differences in participants’ level of cognitive investment during the period of training. Since participants in the experimental group often engage in more mentally taxing activities, they may work harder during post-training assessments to assure the value of their earlier efforts.

Even seemingly small differences between control and training groups may yield measurable differences in effort, expectancy, and investment, but these confounds are most problematic in studies that use no control group (Holmes et al., 2010; Mezzacappa & Buckner, 2010), or only a no-contact control group; a cohort of participants that completes the pre and post training assessments but has no contact with the lab in the interval between assessments. Comparison to a no-contact control group is a prevalent practice among studies reporting positive far transfer (Chein & Morrison, 2010; Jaeggi et al., 2008; Olesen et al., 2004; Schmiedek et al., 2010; Vogt et al., 2009). This approach allows experimenters to rule out simple test-retest improvements, but is potentially vulnerable to confounding due to expectancy effects. An alternative approach is to use a “control training” group, which matches the treatment group on time and effort invested, but is not expected to benefit from training (groups receiving control training are sometimes referred to as “active control” groups). For instance, in Persson and Reuter-Lorenz (2008), both trained and control subjects practiced a common set of memory tasks, but difficulty and level of interference were higher in the experimental group’s training. Similarly, control train- ing groups completing a non-adaptive form of training (Holmes et al., 2009; Klingberg et al., 2005) or receiving a smaller dose of training (one-third of the training trials as the experimental group, e.g., Klingberg et al., 2002) have been used as comparison groups in assessments of Cogmed variants. One recent study conducted in young children found no differences in performance gains demonstrated by a no-contact control group and a control group that completed a non-adaptive version of training, suggesting that the former approach may be adequate (Thorell et al., 2009). We note, however, that regardless of the control procedures used, not a single study conducted to date has simultaneously controlled motivation, commitment, and difficulty, nor has any study attempted to demonstrate explicitly (for instance through subject self-report) that the control subjects experienced a comparable degree of motivation or commitment, or had similar expectancies about the benefits of training

3. Details about the treated (active) vs untreated (passive) differences in Melby-Lervåg & Hulme 2013:

…This controls for apparently irrelevant aspects of the training that might nevertheless affect performance. In a review of educational research Clark and Sugrue (1991) estimated that such Hawthorne or expectancy effects account for up to 0.3 standard deviations improvement in many studies.

The meta-analytic results:

1. Verbal WM: d=0.99 vs 0.69
2. Visuospatial WM: 0.63 vs 0.36
3. Nonverbal abilities: 0 vs 0.38
4. Stroop: 0.30 vs 0.35

There was a significant difference in outcome between studies with treated controls and studies with only untreated controls. In fact, the studies with treated control groups had a mean effect size close to zero (notably, the 95% confidence intervals for untreated controls were d=-0.24 to 0.22, and for treated controls d=0.23 to 0.56). More specifically, several of the research groups demonstrated significant transfer effects to nonverbal ability when they used untreated control groups but did not replicate such effects when a treated control group was used (e.g., Jaeggi, Buschkuehl, Jonides, & Shah, 2011; Nutley, Söderqvist, Bryde, Thorell, Humphreys, & Klingberg, 2011). Similarly, the difference in outcome between randomized and nonrandomized studies was close to significance (p=.06), with the randomized studies giving a mean effect size that was close to zero. Notably, all the studies with untreated control groups are also nonrandomized; it is apparent from these analyses that the use of randomized designs with an alternative treatment control group are essential to give unambiguous evidence for training effects in this field.