Dual N-Back Meta-Analysis

Does DNB increase IQ? What factors affect the studies? Probably not: gains are driven by studies with weakest methodology like apathetic control groups.
DNB, psychology, meta-analysis, R, power-analysis, Bayes, IQ, bibliography
2012-05-202018-11-30 in progress certainty: highly likely importance: 9

I meta-an­a­lyze the >19 stud­ies up to 2016 which mea­sure IQ after an inter­ven­tion, find­ing (over all stud­ies) a net gain (medi­um-sized) on the post-train­ing IQ tests.

The size of this increase on IQ test score cor­re­lates highly with the method­olog­i­cal con­cern of whether a study used active or pas­sive con­trol groups. This indi­cates that the medium effect size is due to method­olog­i­cal prob­lems and that n-back train­ing does not increase sub­jects’ under­ly­ing fluid intel­li­gence but the gains are due to the moti­va­tional effect of pas­sive con­trol groups (who did not train on any­thing) not try­ing as hard as the n-back­-trained exper­i­men­tal groups on the post-tests. The remain­ing stud­ies using active con­trol groups find a small pos­i­tive effect (but this may be due to matrix-test-spe­cific train­ing, unde­tected pub­li­ca­tion bias, smaller moti­va­tional effects, etc.)

I also inves­ti­gate sev­eral other n-back claims, crit­i­cisms, and indi­ca­tors of bias, find­ing:

Dual N-Back (DNB) is a work­ing mem­ory task which stresses hold­ing sev­eral items in mem­ory and quickly updat­ing them. Devel­oped for cog­ni­tive test­ing bat­ter­ies, DNB has been repur­posed for cog­ni­tive train­ing, start­ing with the first study Jaeggi et al 2008, which found that train­ing dual n-back increases scores on an IQ test for healthy young adults. If this result were true and influ­enced under­ly­ing intel­li­gence (with its many cor­re­lates such as higher income or edu­ca­tional achieve­men­t), it would be an unprece­dented result of ines­timable social value and prac­ti­cal impact, and so is worth inves­ti­gat­ing in detail. In my , I dis­cuss a list of post-2008 exper­i­ments inves­ti­gat­ing how much and whether prac­tic­ing dual n-back can increase IQ; they con­flict heav­i­ly, with some find­ing large gains and oth­ers find­ing gains which are not sta­tis­ti­cal­ly-sig­nifi­cant or no gain at all.

What is one to make of these stud­ies? When one has mul­ti­ple quan­ti­ta­tive stud­ies going in both direc­tions, one resorts to a : we pool the stud­ies with their var­i­ous sam­ple sizes and effect sizes and get some over­all answer - do a bunch of small pos­i­tive stud­ies out­weigh a few big neg­a­tive ones? Or vice ver­sa? Or any mix there­of? Unfor­tu­nate­ly, when I began com­pil­ing stud­ies 2011-2013 no one has done one for n-back & IQ already; the exist­ing study, (Mel­by-Lervåg & Hulme 2013), cov­ers work­ing mem­ory in gen­er­al, to sum­ma­rize:

How­ev­er, a recent meta-analy­sis by Mel­by-Lervåg and Hulme (in press) indi­cates that even when con­sid­er­ing pub­lished stud­ies, few appro­pri­ate­ly-pow­ered empir­i­cal stud­ies have found evi­dence for trans­fer from var­i­ous WM train­ing pro­grams to fluid intel­li­gence. Mel­by-Lervåg and Hulme reported that WM train­ing showed evi­dence of trans­fer to ver­bal and spa­tial WM tasks (d = .79 and .52, respec­tive­ly). When exam­in­ing the effect of WM train­ing on trans­fer to non­ver­bal abil­i­ties tests in 22 com­par­isons across 20 stud­ies, they found an effect of d = .19. Crit­i­cal­ly, a mod­er­a­tor analy­sis showed that there was no effect (d = .00) in the 10 com­par­isons that used a treated con­trol group, and there was a medium effect (d = .38) in the 12 com­par­isons that used an untreated con­trol group.

Sim­i­lar results were found by the later WM meta-analy­sis Schwaighofer et al 2015 and Mel­by-Lervåg et al 2016, and for exec­u­tive func­tion by Kas­sai et al 2019. (Com­men­tary: .) I’m not as inter­ested in near WM trans­fer from n-back train­ing - as the Mel­by-Lervåg & Hulme 2013 meta-analy­sis con­firms, it surely does - but in the trans­fer with many more ram­i­fi­ca­tions, trans­fer to IQ as mea­sured by a matrix test. So in early 2012, I decided to start a meta-analy­sis of my own. My method & results differ from the later 2014 DNB meta-analy­sis by Au & Jaeggi et al (Bayesian re-analy­sis) and the broad meta-analy­sis of all com­put­er-train­ing.

For back­ground on con­duct­ing meta-analy­ses, I am using chap­ter 9 of part 2 of the Cochrane Hand­book for Sys­tem­atic Reviews of Inter­ven­tions. For the actual sta­tis­ti­cal analy­sis, I am using the metafor pack­age for the R lan­guage.


The can­di­date stud­ies:

  • Jaeggi 2008

  • Li et al 2008

    Excluded for using non-adap­tive n-back and not admin­is­ter­ing an IQ post-test.

  • Qiu 2009

  • polar 2009

  • Sei­dler 2010

  • Stephen­son 2010

  • Jaeggi 2010

  • Jaeggi 2011

  • Chooi 2011

  • Schweizer 2011

  • Preece 2011

  • Zhong 2011

  • Jaušovec 2012

  • Kundu et al 2012

  • Salmi­nen et al 2012

  • Redick et al 2012

  • Jaeggi 2012?

  • Takeuchi et al 2012

  • Rude­beck 2012

  • Thomp­son et al 2013

  • Var­tan­ian 2013

  • Heinzel et al 2013

  • Smith et al 2013

  • Nuss­baumer et al 2013

  • Oel­hafen et al 2013

  • Clouter 2013

  • Sprenger et al 2013

  • Jaeggi et al 2013

  • Colom et al 2013

  • Sav­age 2013

  • Stepankova et al 2013

  • Min­ear et al 2013

  • Katz et al 2013

  • Burki et al 2014

  • Pugin et al 2014

  • Schmiedek et al 2014

  • Hor­vat 2014

  • Heffer­nan 2014

  • Han­cock 2013

  • Loosli et al 2015 (sup­ple­ment):

    excluded because both groups received n-back train­ing, the exper­i­men­tal differ­ence being addi­tional inter­fer­ence, with near-i­den­ti­cal improve­ments to all mea­sured tasks.

  • Waris et al 2015

  • Ban­iqued et al 2015

  • Kuper & Kar­bach 2015

  • Zając-Lam­parska & Trem­pała 2016

    exclud­ed: n-back exper­i­men­tal group was non-adap­tive (sin­gle 1/2-back)

  • Lin­deløv et al 2016

  • Schwarb et al 2015

  • Stud­er-Luethi et al 2015

    excluded because did not report means/SDs for RPM scores

  • Min­ear et al 2016

  • Tay­eri et al 2016: excluded for being qua­si­-ex­per­i­men­tal

  • Stud­er-Luethi et al 2016: found no trans­fer to the Ravens; but did not report the sum­mary sta­tis­tics so can’t be included


  1. active mod­er­a­tor vari­able: whether a con­trol group was no-con­tact or trained on some other task.

  2. IQ type:

    1. BOMAT
    2. Raven’s Advanced Pro­gres­sive Matri­ces (RAPM)
    3. Raven’s Stan­dard Pro­gres­sive Matri­ces (SPM)
    4. other (eg. WAIS or WASI, Cat­tel­l’s Cul­ture Fair Intel­li­gence Test/CFIT, TONI)
  3. record speed of IQ test: min­utes allot­ted (up­per bound if more details are given; if no time lim­its, default to 30 min­utes since no sub­jects take longer)

  4. n-back type:

    1. dual n-back (au­dio & visual modal­i­ties simul­ta­ne­ous­ly)
    2. sin­gle n-back (vi­sual modal­i­ty)
    3. sin­gle n-back (au­dio modal­i­ty)
    4. mixed n-back (eg audio or visual in each block, alter­nat­ing or at ran­dom)
  5. paid: expected value of total pay­ment in dol­lars, con­verted if nec­es­sary; if a paper does not men­tion pay­ment or com­pen­sa­tion, I assume 0 (like­wise sub­jects receiv­ing course credit or extra credit - so com­mon in psy­chol­ogy stud­ies that there must not be any effec­t), and if the rewards are of real but small value (eg. “For each cor­rect respon­se, par­tic­i­pants earned points that they could cash in for token prizes such as pen­cils or stick­ers.”), I code as 1.

  6. coun­try: what coun­try the sub­jects are from/trained in (sug­gested by Au et al 2014)


The data from the sur­viv­ing stud­ies:

Meta-analy­sis of ran­dom­ized inter­ven­tions of dual n-back on IQ, with covari­ates (ac­tive vs pas­sive con­trol groups, total train­ing time, IQ test type, IQ admin­is­tra­tion speed, n-back type, par­tic­i­pant pay­ments, and coun­try)
year study Pub­li­ca­tion n.e mean.e sd.e n.c mean.c sd.c active train­ing IQ speed N.back paid coun­try
2008 Jaeg­gi1.8 Jaeg­gi1 4 14 2.928 8 12.13 2.588 FALSE 200 0 10 0 0 Switzer­land
2008 Jaeg­gi1.8 Jaeg­gi1 4 14 2.928 7 12.86 1.46 TRUE 200 0 10 0 0 Switzer­land
2008 Jaeg­gi1.12 Jaeg­gi1 11 9.55 1.968 11 8.73 3.409 FALSE 300 1 10 0 0 Switzer­land
2008 Jaeg­gi1.17 Jaeg­gi1 8 10.25 2.188 8 8 1.604 FALSE 425 1 10 0 0 Switzer­land
2008 Jaeg­gi1.19 Jaeg­gi1 7 14.71 3.546 8 13.88 3.643 FALSE 475 1 20 0 0 Switzer­land
2009 Qiu Qiu 9 132.1 3.2 10 130 5.3 FALSE 250 2 25 0 0 China
2009 polar Marcek1 13 32.76 1.83 8 25.12 9.37 FALSE 360 0 30 0 0 Czech
2010 Jaeg­gi2.1 Jaeg­gi2 21 13.67 3.17 21.5 11.44 2.58 FALSE 370 0 16 1 91 USA
2010 Jaeg­gi2.2 Jaeg­gi2 25 12.28 3.09 21.5 11.44 2.58 FALSE 370 0 16 0 20 USA
2010 Stephen­son.1 Stephen­son 14 17.54 0.76 9.3 15.50 0.99 TRUE 400 1 10 0 0.44 USA
2010 Stephen­son.2 Stephen­son 14 17.54 0.76 8.6 14.08 0.65 FALSE 400 1 10 0 0.44 USA
2010 Stephen­son.3 Stephen­son 14.5 15.34 0.90 9.3 15.50 0.99 TRUE 400 1 10 1 0.44 USA
2010 Stephen­son.4 Stephen­son 14.5 15.34 0.90 8.6 14.08 0.65 FALSE 400 1 10 1 0.44 USA
2010 Stephen­son.5 Stephen­son 12.5 15.32 0.83 9.3 15.50 0.99 TRUE 400 1 10 2 0.44 USA
2010 Stephen­son.6 Stephen­son 12.5 15.32 0.83 8.6 14.08 0.65 FALSE 400 1 10 2 0.44 USA
2011 Chooi.1.1 Chooi 4.5 12.7 2 15 13.3 1.91 TRUE 240 1 20 0 0 USA
2011 Chooi.1.2 Chooi 4.5 12.7 2 22 11.3 2.59 FALSE 240 1 20 0 0 USA
2011 Chooi.2.1 Chooi 6.5 12.1 2.81 11 13.4 2.7 TRUE 600 1 20 0 0 USA
2011 Chooi.2.2 Chooi 6.5 12.1 2.81 23 11.9 2.64 FALSE 600 1 20 0 0 USA
2011 Jaeg­gi3 Jaeg­gi3 32 16.94 4.75 30 16.2 5.1 TRUE 287 2 10 1 1 USA
2011 Kun­du1 Kun­du1 3 31 1.73 3 30.3 4.51 TRUE 1000 1 40 0 0 USA
2011 Schweizer Schweizer 29 27.07 2.16 16 26.5 4.5 TRUE 463 2 30 0 0 UK
2011 Zhong.1.05d Zhong 17.6 21.38 1.71 8.8 21.85 2.6 FALSE 125 1 30 0 0 China
2011 Zhong.1.05s Zhong 17.6 22.83 2.5 8.8 21.85 2.6 FALSE 125 1 30 0 0 China
2011 Zhong.1.10d Zhong 17.6 22.21 2.3 8.8 21 1.94 FALSE 250 1 30 0 0 China
2011 Zhong.1.10s Zhong 17.6 23.12 1.83 8.8 21 1.94 FALSE 250 1 30 0 0 China
2011 Zhong.1.15d Zhong 17.6 24.12 1.83 8.8 23.78 1.48 FALSE 375 1 30 0 0 China
2011 Zhong.1.15s Zhong 17.6 25.11 1.45 8.8 23.78 1.48 FALSE 375 1 30 0 0 China
2011 Zhong.1.20d Zhong 17.6 23.06 1.48 8.8 23.38 1.56 FALSE 500 1 30 0 0 China
2011 Zhong.1.20s Zhong 17.6 23.06 3.15 8.8 23.38 1.56 FALSE 500 1 30 0 0 China
2011 Zhong.2.15s Zhong 18.5 6.89 0.99 18.5 5.15 2.01 FALSE 375 1 30 0 0 China
2011 Zhong.2.19s Zhong 18.5 6.72 1.07 18.5 5.35 1.62 FALSE 475 1 30 0 0 China
2012 Jaušovec Jaušovec 14 32.43 5.65 15 29.2 6.34 TRUE 1800 1 8.3 0 0 Slove­nia
2012 Kun­du2 Kun­du2 11 10.81 2.32 12 9.5 2.02 TRUE 1000 1 10 0 0 USA
2012 Redick.1 Redick 12 6.25 3.08 20 6 3 FALSE 700 1 10 0 204.3 USA
2012 Redick.2 Redick 12 6.25 3.08 29 6.24 3.34 TRUE 700 1 10 0 204.3 USA
2012 Rude­beck Rude­beck 27 9.52 2.03 28 7.75 2.53 FALSE 400 0 10 0 0 UK
2012 Salmi­nen Salmi­nen 13 13.7 2.2 9 10.9 4.3 FALSE 319 1 20 0 55 Ger­many
2012 Takeuchi Takeuchi 41 31.9 0.4 20 31.2 0.9 FALSE 270 1 30 0 0 Japan
2013 Clouter Clouter 18 30.84 4.11 18 28.83 2.68 TRUE 400 3 12.5 0 115 Canada
2013 Colom Colom 28 37.25 6.23 28 35.46 8.26 FALSE 720 1 20 0 204 Spain
2013 Heinzel.1 Heinzel 15 24.53 2.9 15 23.07 2.34 FALSE 540 2 7.5 1 129 Ger­many
2013 Heinzel.2 Heinzel 15 17 3.89 15 15.87 3.13 FALSE 540 2 7.5 1 129 Ger­many
2013 Jaeg­gi.4 Jaeg­gi4 25 14.96 2.7 13.5 14.74 2.8 TRUE 500 1 30 0 0 USA
2013 Jaeg­gi.4 Jaeg­gi4 26 15.23 2.44 13.5 14.74 2.8 TRUE 500 1 30 2 0 USA
2013 Oel­hafen Oel­hafen 28 18.7 3.75 15 19.9 4.7 FALSE 350 0 45 0 54 Switzer­land
2013 Smith.1 Smith 5 11.5 2.99 9 11.9 1.58 FALSE 340 1 10 0 3.9 UK
2013 Smith.2 Smith 5 11.5 2.99 20 12.15 2.749 TRUE 340 1 10 0 3.9 UK
2013 Sprenger.1 Sprenger 34 9.76 3.68 18.5 9.95 3.42 TRUE 410 1 10 1 100 USA
2013 Sprenger.2 Sprenger 34 9.24 3.34 18.5 9.95 3.42 TRUE 205 1 10 1 100 USA
2013 Thomp­son.1 Thomp­son 10 13.2 0.67 19 12.7 0.62 FALSE 800 1 25 0 740 USA
2013 Thomp­son.2 Thomp­son 10 13.2 0.67 19 13.3 0.5 TRUE 800 1 25 0 740 USA
2013 Var­tan­ian Var­tan­ian 17 11.18 2.53 17 10.41 2.24 TRUE 60 1 10 1 0 Canada
2013 Sav­age Sav­age 23 11.61 2.5 27 11.21 2.5 TRUE 625 1 20 0 0 Canada
2013 Stepanko­va.1 Stepankova 20 20.25 3.77 12.5 17.04 5.02 FALSE 250 3 30 1 29 Czech
2013 Stepanko­va.2 Stepankova 20 21.1 2.95 12.5 17.04 5.02 FALSE 500 3 30 1 29 Czech
2013 Nuss­baumer Nuss­baumer 29 13.69 2.54 27 11.89 2.24 TRUE 450 1 30 0 0 Switzer­land
2014 Bur­k­i.1 Burki 11 37.41 6.43 20 35.95 7.55 TRUE 300 1 30 1 0 Switzer­land
2014 Bur­k­i.2 Burki 11 37.41 6.43 21 36.86 6.55 FALSE 300 1 30 1 0 Switzer­land
2014 Bur­k­i.3 Burki 11 28.86 7.10 20 31.20 6.67 TRUE 300 2 30 1 0 Switzer­land
2014 Bur­k­i.4 Burki 11 28.86 7.10 23 27.61 6.82 FALSE 300 2 30 1 0 Switzer­land
2014 Pugin Pugin 14 40.29 2.30 15 41.33 1.97 FALSE 600 3 30 1 1 Switzer­land
2014 Hor­vat Hor­vat 14 48 5.68 15 47 7.49 FALSE 225 2 45 0 0 Slove­nia
2014 Heffer­nan Heffer­nan 9 32.78 2.91 10 31 3.06 TRUE 450 3 20 0 140 Canada
2013 Han­cock Han­cock 20 9.32 3.47 20 10.44 4.35 TRUE 810 1 30 1 50 USA
2015 Waris Waris 15 16.4 2.8 16 15.9 3.0 TRUE 675 3 30 0 76 Fin­land
2015 Ban­iqued Ban­iqued 42 10.125 3.0275 45 10.25 3.2824 TRUE 240 1 10 0 700 USA
2015 Kuper.1 Kuper 18 23.6 6.1 9 20.7 7.3 FALSE 150 1 20 1 45 Ger­many
2015 Kuper.2 Kuper 18 24.9 5.5 9 20.7 7.3 FALSE 150 1 20 0 45 Ger­many
2016 Lin­deløv.1 Lin­deløv 9 10.67 2.5 9 9.89 3.79 TRUE 260 1 10 3 0 Den­mark
2016 Lin­deløv.2 Lin­deløv 8 3.25 2.71 9 4.11 1.36 TRUE 260 1 10 3 0 Den­mark
2015 Schwar­b.1 Schwarb 26 0.11 2.5 26 0.69 3.2 FALSE 480 1 10 3 80 USA
2015 Schwar­b.2.1 Schwarb 22 0.23 2.5 10.5 -0.36 2.1 FALSE 480 1 10 1 80 USA
2015 Schwar­b.2.2 Schwarb 23 0.91 3.1 10.5 -0.36 2.1 FALSE 480 1 10 2 80 USA
2016 Heinzel.1 Heinzel.2 15 17.73 1.20 14 16.43 1.00 FALSE 540 2 7.5 1 1 Ger­many
2016 Lawlor-Sav­age Lawlor 27 11.48 2.98 30 11.02 2.34 TRUE 391 1 20 0 0 Canada
2016 Min­ear.1 Min­ear 15.5 24.1 4.5 26 24 5.1 TRUE 400 1 15 1 300 USA
2016 Min­ear.2 Min­ear 15.5 24.1 4.5 37 22.8 4.6 TRUE 400 1 15 1 300 USA


The result of the meta-analy­sis:

Random-Effects Model (k = 74; tau^2 estimator: REML)

tau^2 (estimated amount of total heterogeneity): 0.1103 (SE = 0.0424)
tau (square root of estimated tau^2 value):      0.3321
I^2 (total heterogeneity / total variability):   43.83%
H^2 (total variability / sampling variability):  1.78

Test for Heterogeneity:
Q(df = 73) = 144.2388, p-val < .0001

Model Results:

estimate       se     zval     pval    ci.lb    ci.ub
  0.3460   0.0598   5.7836   <.0001   0.2288   0.4633

To depict the ran­dom-effects model in a more graphic form, we use the “”:

forest(res1, slab = paste(dnb$study, dnb$year, sep = ", "))

The over­all effect is some­what strong. But there seems to be sub­stan­tial differ­ences between stud­ies: this het­ero­gene­ity may be what is show­ing up as a high τ2 and i2; and indeed, if we look at the com­puted SMDs, we see one sam­ple with d = 2.59 (!) and some instances of d < 0. The high het­ero­gene­ity means that the fixed-effects model is inap­pro­pri­ate, as clearly the stud­ies are not all mea­sur­ing the same effect, so we use a ran­dom-effects.

The con­fi­dence inter­val excludes zero, so one might con­clude that n-back does increase IQ scores. From a Bayesian stand­point, it’s worth point­ing out that this is not nearly as con­clu­sive as it seems, for sev­eral rea­sons:

  1. Pub­lished research can be weak (see ); meta-analy­ses are gen­er­ally believed to be biased towards larger effects due to sys­tem­atic biases like pub­li­ca­tion bias

  2. our prior that any par­tic­u­lar inter­ven­tion would increase the under­ly­ing gen­uine fluid intel­li­gence is extremely small, as scores or hun­dreds of attempts to increase IQ over the past cen­tury have all even­tu­ally turned out to be fail­ures, with few excep­tions (eg pre-na­tal or iron sup­ple­men­ta­tion), so strong evi­dence is nec­es­sary to con­clude that a par­tic­u­lar attempt is one of those extremely rare excep­tions. As the say­ing goes, “extra­or­di­nary claims require extra­or­di­nary evi­dence”. David Ham­brick explains it infor­mally:

    …Yet I and many other intel­li­gence researchers are skep­ti­cal of this research. Before any­one spends any more time and money look­ing for a quick and easy way to boost intel­li­gence, it’s impor­tant to explain why we’re not sold on the idea…­Does this [Jaeggi et al 2008] sound like an extra­or­di­nary claim? It should. There have been many attempts to demon­strate large, last­ing gains in intel­li­gence through edu­ca­tional inter­ven­tions, with few suc­cess­es. When gains in intel­li­gence have been achieved, they have been mod­est and the result of many years of effort. For instance, in a Uni­ver­sity of North Car­olina study known as the , chil­dren received an inten­sive edu­ca­tional inter­ven­tion from infancy to age 5 designed to increase intel­li­gence1. In fol­low-up tests, these chil­dren showed an advan­tage of six I.Q. points over a con­trol group (and as adults, they were four times more likely to grad­u­ate from col­lege). By con­trast, the increase implied by the find­ings of the Jaeggi study was six I.Q. points after only six hours of train­ing - an I.Q. point an hour. Though the Jaeggi results are intrigu­ing, many researchers have failed to demon­strate sta­tis­ti­cally sig­nifi­cant gains in intel­li­gence using oth­er, sim­i­lar cog­ni­tive train­ing pro­grams, like Cogmed’s… We should­n’t be sur­prised if extra­or­di­nary claims of quick gains in intel­li­gence turn out to be wrong. Most extra­or­di­nary claims are.

  3. it’s not clear that just because IQ tests like Raven’s are valid and use­ful for mea­sur­ing lev­els of intel­li­gence, that an increase on the tests can be inter­preted as an increase of intel­li­gence; intel­li­gence poses unique prob­lems of its own in any attempt to show increases in the of gf rather than just the raw scores of tests (which can be, essen­tial­ly, ‘gamed’). Haier 2014 analo­gizes claims of break­through IQ increases to the ini­tial reports of cold fusion and com­ments:

    The basic mis­un­der­stand­ing is assum­ing that intel­li­gence test scores are units of mea­sure­ment like inches or liters or grams. They are not. Inch­es, liters and grams are ratio scales where zero means zero and 100 units are twice 50 units. Intel­li­gence test scores esti­mate a con­struct using inter­val scales and have mean­ing only rel­a­tive to other peo­ple of the same age and sex. Peo­ple with high scores gen­er­ally do bet­ter on a broad range of men­tal abil­ity tests, but some­one with an IQ score of 130 is not 30% smarter then some­one with an IQ score of 100…This makes sim­ple inter­pre­ta­tion of intel­li­gence test score changes impos­si­ble. Most recent stud­ies that have claimed increases in intel­li­gence after a cog­ni­tive train­ing inter­ven­tion rely on com­par­ing an intel­li­gence test score before the inter­ven­tion to a sec­ond score after the inter­ven­tion. If there is an aver­age change score increase for the train­ing group that is sta­tis­ti­cally sig­nifi­cant (us­ing a depen­dent t-test or sim­i­lar sta­tis­ti­cal test), this is treated as evi­dence that intel­li­gence has increased. This rea­son­ing is cor­rect if one is mea­sur­ing ratio scales like inch­es, liters or grams before and after some inter­ven­tion (as­sum­ing suit­able and reli­able instru­ments like rulers to avoid erro­neous Cold Fusion-like con­clu­sions that appar­ently were based on faulty heat mea­sure­men­t); it is not cor­rect for intel­li­gence test scores on inter­val scales that only esti­mate a rel­a­tive rank order rather than mea­sure the con­struct of intel­li­gence….S­tud­ies that use a sin­gle test to esti­mate intel­li­gence before and after an inter­ven­tion are using less reli­able and more vari­able scores (big­ger stan­dard errors) than stud­ies that com­bine scores from a bat­tery of test­s….S­peak­ing about sci­ence, Carl Sagan observed that extra­or­di­nary claims require extra­or­di­nary evi­dence. So far, we do not have it for claims about increas­ing intel­li­gence after cog­ni­tive train­ing or, for that mat­ter, any other manip­u­la­tion or treat­ment, includ­ing early child­hood edu­ca­tion. Small sta­tis­ti­cally sig­nifi­cant changes in test scores may be impor­tant obser­va­tions about atten­tion or mem­ory or some other ele­men­tal cog­ni­tive vari­able or a spe­cific men­tal abil­ity assessed with a ratio scale like mil­lisec­onds, but they are not suffi­cient proof that gen­eral intel­li­gence has changed.

    For sta­tis­ti­cal back­ground on how one should be mea­sur­ing changes on a latent vari­able like intel­li­gence and run­ning inter­ven­tion stud­ies, see Cron­bach & Furby 1970 & Moreau et al 2016; for exam­ples of past IQ inter­ven­tions which fade­out, see Protzko 2015; for exam­ples of past IQ inter­ven­tions which prove not to be on g when ana­lyzed in a latent vari­able approach, see Jensen’s infor­mal com­ments on the Mil­wau­kee Project (Jensen 1989), te Nijen­huis et al 2007, te Nijen­huis et al 2014, te Nijen­huis et al 2015, Nut­ley et al 2011, Ship­stead et al 2012, Colom et al 2013, , Ritchie et al 2015, Estrada et al 2015, Bai­ley et al 2017 (and in a null, the com­pos­ite score in ).

This skep­ti­cal atti­tude is rel­e­vant to our exam­i­na­tion of mod­er­a­tors.


Control groups

A major crit­i­cism of n-back stud­ies is that the effect is being man­u­fac­tured by the method­olog­i­cal prob­lem of some stud­ies using a no-con­tact or pas­sive con­trol group rather than an active con­trol group. (Pas­sive con­trols know they received no inter­ven­tion and that the researchers don’t expect them to do bet­ter on the post-test, which may reduce their efforts & lower their scores.)

The review Mor­ri­son & Chein 20112 noted that no-con­tact con­trol groups lim­ited the valid­ity of such stud­ies, a crit­i­cism that was echoed with greater force by Ship­stead, Redick, & Engle 2012. The WM train­ing meta-analy­sis then con­firmed that use of no-con­tact con­trols inflated the effect size esti­mates3, sim­i­lar to Zehd­ner et al 2009/ results in the aged and Rap­port et al 2013’s blind vs unblinded rat­ings in WM/executive train­ing of ADHD or Long et al 2019’s infla­tion of emo­tional reg­u­la­tion ben­e­fits from DNB train­ing, and WM train­ing in young chil­dren (Sala & Gobet 2017a find pas­sive con­trol groups inflate effect sizes by g = 0.12, and are fur­ther crit­i­cal in Sala & Gobet 2017b); and con­sis­tent with the “dodo bird ver­dict” and the increase of d = 0.2 across many kinds of psy­cho­log­i­cal ther­a­pies which was found by (but incon­sis­tent with the g = 0.20 vs g = 0.26 of Lampit et al 2014), and De Quidt et al 2018 demon­strat­ing strong demand effects can reach as high as d = 1 (with a mean d = 0.6, quite close to the actual pas­sive DNB effec­t).

So I won­dered if this held true for the sub­set of n-back & IQ stud­ies. (Age is an inter­est­ing mod­er­a­tor in Mel­by-Lervåg & Hulme 2013, but in the fol­low­ing DNB & IQ stud­ies there is only 1 study involv­ing chil­dren - all the oth­ers are adults or young adult­s.) Each study has been coded appro­pri­ate­ly, and we can ask whether it mat­ters:

Mixed-Effects Model (k = 74; tau^2 estimator: REML)

tau^2 (estimated amount of residual heterogeneity):     0.0803 (SE = 0.0373)
tau (square root of estimated tau^2 value):             0.2834
I^2 (residual heterogeneity / unaccounted variability): 36.14%
H^2 (unaccounted variability / sampling variability):   1.57

Test for Residual Heterogeneity:
QE(df = 72) = 129.2820, p-val < .0001

Test of Moderators (coefficient(s) 1,2):
QM(df = 2) = 46.5977, p-val < .0001

Model Results:

                     estimate      se    zval    pval    ci.lb   ci.ub
factor(active)FALSE    0.4895  0.0738  6.6310  <.0001   0.3448  0.6342
factor(active)TRUE     0.1397  0.0862  1.6211  0.1050  -0.0292  0.3085

The active/control vari­able con­firms the crit­i­cism: lack of active con­trol groups is respon­si­ble for a large chunk of the over­all effect, with the con­fi­dence inter­vals not over­lap­ping. The effect with pas­sive con­trol groups is a medi­um-large d = 0.5 while with active con­trol groups, the IQ gains shrink to a small effect (whose 95% CI does not exclude d = 0).

We can see the differ­ence by split­ting a for­est plot on pas­sive vs active:

The vis­i­bly differ­ent groups of pas­sive then active stud­ies, plot­ted on the same axis

This is dam­ag­ing to the case that dual n-back increases intel­li­gence, if it’s unclear if it even increases a par­tic­u­lar test score. Not only do the bet­ter stud­ies find a dras­ti­cally smaller effect, they are not suffi­ciently pow­ered to find such a small effect at all, even aggre­gated in a meta-analy­sis, with a power of ~11%, which is dis­mal indeed when com­pared to the usual bench­mark of 80%, and leads to wor­ries that even that is too high an esti­mate and that the active con­trol stud­ies are aber­rant some­how in being sub­ject to a win­ner’s curse or sub­ject to other bias­es. (Be­cause many stud­ies used con­ve­nient pas­sive con­trol groups and the pas­sive effect size is 3x larg­er, they in aggre­gate are well-pow­ered at 82%; how­ev­er, we already know they are skewed upwards, so we don’t care if we can detect a biased effect or not.) In par­tic­u­lar, Boot et al 2013 argues that active con­trol groups do not suffice to iden­tify the true causal effect because the sub­jects in the active con­trol group can still have differ­ent expec­ta­tions than the exper­i­men­tal group, and the group’ differ­ing aware­ness & expec­ta­tions can cause differ­ing per­for­mance on tests; they sug­gest record­ing expectan­cies (some­what sim­i­lar to Redick et al 2013), check­ing for a dose-re­sponse rela­tion­ship (see the fol­low­ing sec­tion for whether dose-re­sponse exists for dual n-back/IQ), and using differ­ent exper­i­men­tal designs which actively manip­u­late sub­ject expec­ta­tions to iden­tify how much effects are inflated by remain­ing placebo/expectancy effects.

The active esti­mate of d = 0.14 does allow us to esti­mate how many sub­jects a sim­ple4 two-group exper­i­ment with an active con­trol group would require in order for it to be well-pow­ered (80%) to detect an effect: a total n of >1600 sub­jects (805 in each group).

Training time

Jaeggi et al 2008 observed a dose-re­sponse to train­ing, where those who trained the longest appar­ently improved the most. Ever since, this has been cited as a fac­tor in what stud­ies will observe gains or as an expla­na­tion why some stud­ies did not see improve­ments - per­haps they just did­n’t do enough train­ing. metafor is able to look at the num­ber of min­utes sub­jects in each study trained for to see if there’s any clear lin­ear rela­tion­ship:

         estimate      se     zval    pval    ci.lb   ci.ub
intrcpt    0.3961  0.1226   3.2299  0.0012   0.1558  0.6365
mods      -0.0001  0.0002  -0.4640  0.6427  -0.0006  0.0004

The esti­mate of the rela­tion­ship is that there is none at all: the esti­mated coeffi­cient has a large p-val­ue, and fur­ther, that coeffi­cient is neg­a­tive. This may seem ini­tially implau­si­ble but if we graph the time spent train­ing per study with the final (un­weight­ed) effect size, we see why:

plot(dnb$training, res1$yi)

IQ test time

Sim­i­lar­ly, Moody 2009 iden­ti­fied the 10 minute test-time or “speed­ing” of the RAPM as a con­cern in whether far trans­fer actu­ally hap­pened; after col­lect­ing the allot­ted test time for the stud­ies, we can like­wise look for whether there is an inverse rela­tion­ship (the more time given to sub­jects on the IQ test, the smaller their IQ gain­s):

         estimate      se     zval    pval    ci.lb   ci.ub
intrcpt    0.4197  0.1379   3.0435  0.0023   0.1494  0.6899
mods      -0.0036  0.0061  -0.5874  0.5570  -0.0154  0.0083

A tiny slope which is also non-s­ta­tis­ti­cal­ly-sig­nifi­cant; graph­ing the (un­weight­ed) stud­ies sug­gests as much:

plot(dnb$speed, res1$yi)

Training type

One ques­tion of inter­est both for issues of valid­ity and for effec­tive train­ing is whether the exist­ing stud­ies show larger effects for a par­tic­u­lar kind of n-back train­ing: dual (vi­sual & audio; labeled 0) or sin­gle (vi­su­al; labeled 1) or sin­gle (au­dio; labeled 2)? If visual sin­gle n-back turns in the largest effects, that is trou­bling since it’s also the one most resem­bling a matrix IQ test. Check­ing against the 3 kinds of n-back train­ing:

Mixed-Effects Model (k = 74; tau^2 estimator: REML)

tau^2 (estimated amount of residual heterogeneity):     0.1029 (SE = 0.0421)
tau (square root of estimated tau^2 value):             0.3208
I^2 (residual heterogeneity / unaccounted variability): 41.94%
H^2 (unaccounted variability / sampling variability):   1.72

Test for Residual Heterogeneity:
QE(df = 70) = 135.5275, p-val < .0001

Test of Moderators (coefficient(s) 1,2,3,4):
QM(df = 4) = 39.1393, p-val < .0001

Model Results:

                 estimate      se     zval    pval    ci.lb   ci.ub
factor(N.back)0    0.4219  0.0747   5.6454  <.0001   0.2754  0.5684
factor(N.back)1    0.2300  0.1102   2.0876  0.0368   0.0141  0.4459
factor(N.back)2    0.4255  0.2586   1.6458  0.0998  -0.0812  0.9323
factor(N.back)3   -0.1325  0.2946  -0.4497  0.6529  -0.7099  0.4449

There are not enough stud­ies using the other kinds of n-back to say any­thing con­clu­sive other than there seem to be differ­ences, but it’s inter­est­ing that sin­gle visual n-back has weaker results so far.

Payment/extrinsic motivation

In a 2013 talk, “‘Brain Train­ing: Cur­rent Chal­lenges and Poten­tial Res­o­lu­tions’, with Susanne Jaeg­gi, PhD”, Jaeggi sug­gests

Extrin­sic reward can under­mine peo­ple’s intrin­sic moti­va­tion. If extrin­sic reward is cru­cial, then its influ­ence should be vis­i­ble in our data.

I inves­ti­gated pay­ment as a mod­er­a­tor. Pay­ment seems to actu­ally be quite rare in n-back stud­ies (in part because it’s so com­mon in psy­chol­ogy to just recruit stu­dents with course credit or extra cred­it), and so the result is that as a mod­er­a­tor pay­ment is cur­rently a small and non-s­ta­tis­ti­cal­ly-sig­nifi­cant neg­a­tive effect, whether you regress on the total pay­ment amount or treat it as a boolean vari­able. More inter­est­ing­ly, it seems that the neg­a­tive sign is being dri­ven by pay­ment being asso­ci­ated with high­er-qual­ity stud­ies using active con­trol groups, because when you look at the inter­ac­tion, pay­ment in a study with an active con­trol group actu­ally flips sign to being pos­i­tive again (cor­re­lat­ing with a big­ger effect size).

More specifi­cal­ly, if we check pay­ment as a binary vari­able, we get a decrease which is sta­tis­ti­cal­ly-sig­nifi­cant:

                      estimate      se     zval    pval    ci.lb   ci.ub
intrcpt                 0.4514  0.0776   5.8168  <.0001   0.2993   0.6035
as.logical(paid)TRUE   -0.2424  0.1164  -2.0828  0.0373  -0.4706  -0.0143

If we instead regress against the total pay­ment size (log­i­cal­ly, larger pay­ments would dis­cour­age par­tic­i­pants more), the effect of each addi­tional dol­lar is tiny and 0 is far from excluded as the coeffi­cient:

         estimate      se     zval    pval    ci.lb   ci.ub
intrcpt    0.3753  0.0647   5.7976  <.0001   0.2484  0.5022
paid      -0.0004  0.0004  -1.1633  0.2447  -0.0012  0.0003

Why would treat­ing pay­ment as a binary cat­e­gory yield a major result when there is only a small slope within the paid stud­ies? It would be odd if n-back could achieve the holy grail of increas­ing intel­li­gence, but the effect van­ishes imme­di­ately whether you pay sub­jects any­thing, whether $1 or $1000.

As I’ve men­tioned before, the differ­ence in effect size between active and pas­sive con­trol groups is quite strik­ing, and I noticed that eg. the Redick et al 2012 exper­i­ment paid sub­jects a lot of money to put up with all its tests and ensure sub­ject reten­tion & Thomp­son et al 2013 paid a lot to put up with the fMRI machine and long train­ing ses­sions, and like­wise with Oel­hafen et al 2013 and Ban­iqued et al 2015 etc; so what hap­pens if we look for an inter­ac­tion?

                                 estimate      se     zval    pval    ci.lb    ci.ub
intrcpt                            0.6244  0.0971   6.4309  <.0001   0.4341   0.8147
activeTRUE                        -0.4013  0.1468  -2.7342  0.0063  -0.6890  -0.1136
as.logical(paid)TRUE              -0.2977  0.1427  -2.0860  0.0370  -0.5774  -0.0180
activeTRUE:as.logical(paid)TRUE    0.1039  0.2194   0.4737  0.6357  -0.3262   0.5340

Active con­trol groups cuts the observed effect of n-back by more than half, as before, and pay­ment increases the effect size, but then in stud­ies which use active con­trol groups and also pays sub­jects, the effect size increases slightly again with pay­ment size, which seems a lit­tle curi­ous if we buy the story about extrin­sic moti­va­tion crowd­ing out intrin­sic and defeat­ing any gains.


N-back has been pre­sented in some pop­u­lar & aca­d­e­mic medias in an entirely uncrit­i­cal & pos­i­tive light: ignor­ing the over­whelm­ing fail­ure of intel­li­gence inter­ven­tions in the past, not cit­ing the fail­ures to repli­cate, and giv­ing short schrift to the crit­i­cisms which have been made. (Ex­am­ples include the NYT, WSJ, Sci­en­tific Amer­i­can, & Nis­bett et al 2012.) One researcher told me that a reviewer sav­aged their work, assert­ing that n-back works and thus their null result meant only that they did some­thing wrong. So it’s worth inves­ti­gat­ing, to the extent we can, whether there is a towards pub­lish­ing only pos­i­tive results.

20-odd stud­ies (some quite small) is con­sid­ered medi­um-sized for a meta-analy­sis, but that many does per­mit us to gen­er­ate , or check for pos­si­ble pub­li­ca­tion bias via the trim-and-fill method.

Funnel plot

test for funnel plot asymmetry: z = 3.0010, p = 0.0027

The asym­me­try has reached sta­tis­ti­cal-sig­nifi­cance, so let’s visu­al­ize it:


This looks rea­son­ably good, although we see that stud­ies are crowd­ing the edges of the fun­nel. We know that the stud­ies with active con­trol groups show twice the effec­t-size of the pas­sive con­trol groups, is this relat­ed? If we plot the resid­ual left after cor­rect­ing for active vs pas­sive, the fun­nel plot improves a lot (Stephen­son remains an out­lier):

Mixed-effects plot of stan­dard error ver­sus effect size after mod­er­a­tor cor­rec­tion.


The trim-and-fill esti­mate:

Estimated number of missing studies on the left side: 0 (SE = 4.8908)

Graph­ing it:


Over­all, the results sug­gest that this par­tic­u­lar (com­pre­hen­sive) col­lec­tion of DNB stud­ies does not suffer from seri­ous pub­li­ca­tion bias after tak­ing in account the active/passive mod­er­a­tor.


Going through them, I must note:

  • Jaeggi 2008: group-level data pro­vided by Jaeggi to Redick for Redick et al 2013; the 8-ses­sion group included both active & pas­sive con­trols, so exper­i­men­tal DNB group was split in half. IQ test time is based on the descrip­tion in Redick et al 2012:

    In addi­tion, the 19-ses­sion groups were 20 min to com­plete BOMAT, whereas the 12- and 17-ses­sion groups received only 10 min (S. M. Jaeg­gi, per­sonal com­mu­ni­ca­tion, May 25, 2011). As shown in Fig­ure 2, the use of the short time limit in the 12- and 17-ses­sion stud­ies pro­duced sub­stan­tially lower scores than the 19-ses­sion study.

  • polar: con­trol, 2nd scores: 23,27,19,15,12,35,36,34; exper­i­ment, 2nd scores: 30,35,33,33,32,30,35,33,35,33,34,30,33

  • Jaeggi 2010: used BOMAT scores; should I some­how pool RAPM with BOMAT? Con­trol group split.

  • Jaeggi 2011: used SPM (a Raven’s); should I some­how pool the TONI?

  • Schweizer 2011: used the adjusted final scores as sug­gested by the authors due to poten­tial pre-ex­ist­ing differ­ences in their con­trol & exper­i­men­tal groups:

    …This raises the pos­si­bil­ity that the rel­a­tive gains in Gf in the train­ing ver­sus con­trol groups may be to some extent an arte­fact of base­line differ­ences. How­ev­er, the inter­ac­tive effect of trans­fer as a func­tion of group remained [sta­tis­ti­cal­ly-]sig­nifi­cant even after more closely match­ing the train­ing and con­trol groups for pre-train­ing RPM scores (by remov­ing the high­est scor­ing con­trols) F(1, 30) = 3.66, P = 0.032, gp2 = 0.10. The adjusted means (stan­dard devi­a­tions) for the con­trol and train­ing groups were now 27.20 (1.93), 26.63 (2.60) at pre-train­ing (t(43) = 1.29, P.0.05) and 26.50 (4.50), 27.07 (2.16) at post-train­ing, respec­tive­ly.

  • Stephen­son data from pg79/95; means are post-s­cores on Raven’s. I am omit­ting Stephen­son scores on WASI, Cat­tel­l’s Cul­ture Fair Test, & BETA III Matrix Rea­son­ing sub­set because metafor does not sup­port mul­ti­vari­ate meta-analy­ses and includ­ing them as sep­a­rate stud­ies would be sta­tis­ti­cally ille­git­i­mate. The active and pas­sive con­trol groups were split into thirds over each of the 3 n-back train­ing reg­i­mens, and each train­ing reg­i­men split in half over the active & pas­sive con­trols.

    The split­ting is worth dis­cus­sion. Some of these stud­ies have mul­ti­ple exper­i­men­tal groups, con­trol groups, or both. A crit­i­cism of early stud­ies was the use of no-con­tact con­trol groups - the con­trol groups did noth­ing except be tested twice, and it was sug­gested that the exper­i­men­tal group gains might be in part solely because they are doing a task, any task, and the con­trol group should be doing some non-WM task as well. The WM meta-analy­sis Mel­by-Lervåg & Hulme 2013 checked for this and found that use of no-con­tact con­trol groups led to a much larger esti­mate of effect size than stud­ies which did use an active con­trol. When try­ing to incor­po­rate such a mul­ti­-part exper­i­ment, one can­not just copy con­trols as the Cochrane Hand­book points out:

    One approach that must be avoided is sim­ply to enter sev­eral com­par­isons into the meta-analy­sis when these have one or more inter­ven­tion groups in com­mon. This ‘dou­ble-counts’ the par­tic­i­pants in the ‘shared’ inter­ven­tion group(s), and cre­ates a unit-of-analy­sis error due to the unad­dressed cor­re­la­tion between the esti­mated inter­ven­tion effects from mul­ti­ple com­par­isons (see Chap­ter 9, Sec­tion 9.3).

    Just drop­ping one con­trol or exper­i­men­tal group weak­ens the meta-analy­sis, and may bias it as well if not done sys­tem­at­i­cal­ly. I have used one of its sug­gested approaches which accepts some addi­tional error in exchange for greater power in check­ing this pos­si­ble active ver­sus no-con­tact dis­tinc­tion, in which we instead split the shared group:

    A fur­ther pos­si­bil­ity is to include each pair-wise com­par­i­son sep­a­rate­ly, but with shared inter­ven­tion groups divided out approx­i­mately evenly among the com­par­isons. For exam­ple, if a trial com­pares 121 patients receiv­ing acupunc­ture with 124 patients receiv­ing sham acupunc­ture and 117 patients receiv­ing no acupunc­ture, then two com­par­isons (of, say, 61 ‘acupunc­ture’ against 124 ‘sham acupunc­ture’, and of 60 ‘acupunc­ture’ against 117 ‘no inter­ven­tion’) might be entered into the meta-analy­sis. For dichoto­mous out­comes, both the num­ber of events and the total num­ber of patients would be divided up. For con­tin­u­ous out­comes, only the total num­ber of par­tic­i­pants would be divided up and the means and stan­dard devi­a­tions left unchanged. This method only par­tially over­comes the unit-of-analy­sis error (be­cause the result­ing com­par­isons remain cor­re­lat­ed) so is not gen­er­ally rec­om­mend­ed. A poten­tial advan­tage of this approach, how­ev­er, would be that approx­i­mate inves­ti­ga­tions of het­ero­gene­ity across inter­ven­tion arms are pos­si­ble (for exam­ple, in the case of the exam­ple here, the differ­ence between using sham acupunc­ture and no inter­ven­tion as a con­trol group).

  • Chooi: the rel­e­vant table was pro­vided in pri­vate com­mu­ni­ca­tion; I split each exper­i­men­tal group in half to pair it up with the active and pas­sive con­trol groups which trained the same num­ber of days

  • Takeuchi et al 2012: sub­jects were trained on 3 WM tasks in addi­tion to DNB for 27 days, 30-60 min­utes; RAPM scores used, BOMAT & Tanaka B-type intel­li­gence test scores omit­ted

  • Jaušovec 2012: IQ test time was cal­cu­lated based on the descrip­tion

    Used were 50 test items - 25 easy (Ad­vanced Pro­gres­sive Matri­ces Set I - 12 items and the B Set of the Col­ored Pro­gres­sive Matri­ces), and 25 diffi­cult items (Ad­vanced Pro­gres­sive Matri­ces Set II, items 12-36). Par­tic­i­pants saw a fig­ural matrix with the lower right entry miss­ing. They had to deter­mine which of the four options fit­ted into the miss­ing space. The tasks were pre­sented on a com­puter screen (po­si­tioned about 80-100 cm in front of the respon­den­t), at fixed 10 or 14 s inter­stim­u­lus inter­vals. They were exposed for 6 s (easy) or 10 s (diffi­cult) fol­low­ing a 2-s inter­val, when a cross was pre­sent­ed. Dur­ing this time the par­tic­i­pants were instructed to press a but­ton on a response pad (1-4) which indi­cated their answer.


  • Zhong 2011: “dual atten­tion chan­nel” task omit­ted, dual and sin­gle n-back scores kept unpooled and con­trols split across the 2; I thank Emile Kroger for his trans­la­tions of key parts of the the­sis. Unable to get whether IQ test was admin­is­tered speed­ed. Zhong 2011 appears to have repli­cated Jaeggi 2008’s train­ing time.

  • Jonas­son 2011 omit­ted for lack­ing any mea­sure of IQ

  • Preece 2011 omit­ted; only the Fig­ure Weights sub­test from the WAIS was report­ed, but RAPM scores were taken and pub­lished in the inac­ces­si­ble Palmer 2011

  • Kundu et al 2011 and Kundu 2012 have been split into 2 exper­i­ments based on the raw data pro­vided to me by Kun­du: the smaller one using the full RAPM 36-ma­trix 40-minute test, and the larger an 18-ma­trix 10-minute test. (Kundu 2012 sub­sumes 2011, but the pro­ce­dure was changed part­way on Jaeg­gi’s advice, so they are sep­a­rate result­s.) The final results were reported in , Kundu et al 2013.

  • Redick et al: n-back split over pas­sive con­trol & active con­trol (vi­sual search) RAPM post scores (omit­ted SPM and Cat­tell Cul­ture-Fair Test)

  • Var­tan­ian 2013: short n-back inter­ven­tion not adap­tive; I did not spec­ify in advance that the n-back inter­ven­tions had to be adap­tive (pos­si­bly some of the oth­ers were not) and sub­jects trained for <50 min­utes, so the lack of adap­tive­ness may not have mat­tered.

  • Heinzel et al 2013 men­tions con­duct­ing a pilot study; I con­tacted Heinzel and no mea­sures like Raven’s were taken in it. The main study used both SPM and also “the Fig­ural Rela­tions sub­test of a Ger­man intel­li­gence test (LPS)”; as usu­al, I drop alter­na­tives in favor of the more com­mon test.

  • Thomp­son et al 2013; used RAPM rather than WAIS; treated the “mul­ti­ple object track­ing”/MOT as an active con­trol group since it did not sta­tis­ti­cal­ly-sig­nifi­cantly improve RAPM scores

  • Smith et al 2013; 4 groups. Con­sis­tent with all the other stud­ies, I have ignored the post-post-tests (a 4-week fol­lowup). To deal with the 4 groups, I have com­bined the Brain Age & strat­egy game groups into a sin­gle active con­trol group, and then split the dual n-back group in half over the orig­i­nal pas­sive con­trol group and the new active con­trol group.

  • Jaeggi 2005: Jaeggi et al 2008 is not clear about the source of its 4 exper­i­ments, but one of them seems to be exper­i­ment 7 from Jaeggi 2005, so I omit exper­i­ment 7 to avoid any dou­ble-count­ing, and only use exper­i­ment 6.

  • Oel­hafen 2013: merged the lure and non-lure dual n-back groups

  • Sprenger 2013: split the active con­trol group over the n-back­+Floop group and the combo group; train­ing time refers solely to time spent on n-back and not the other tasks

  • Jaeggi et al 2013: admin­is­tered the RAPM, Cat­tel­l’s Cul­ture Fair Test / CFT, & BOMAT; in keep­ing with all pre­vi­ous choic­es, I used the RAPM data; the active con­trol group is split over the two kinds of n-back train­ing groups. This was pre­vi­ously included in the meta-analy­sis as Jaeggi4 based on the poster but deleted once it wa for­mally pub­lished as Jaeggi et al 2013.

  • Clouter 2013: means & stan­dard devi­a­tions, pay­ment amount, and train­ing time were pro­vided by him; stu­dent par­tic­i­pants could be paid in credit points as well as mon­ey, so to get $115, I com­bined the base pay­ment of $75 with the no-cred­it-points option of another $40 (rather than try to assign any mon­e­tary value to credit points or fig­ure out an aver­age pay­ment)

  • Colom 2013: the exper­i­ment group was trained with 2 weeks of visual sin­gle n-back, then 2 weeks of audi­tory n-back, then 2 weeks of dual n-back; since the IQ tests were sim­ply pre/post it’s impos­si­ble to break out the train­ing gains sep­a­rate­ly, so I coded the n-back type as dual n-back since visu­al+au­di­tory sin­gle n-back = dual n-back, and they fin­ished with dual n-back. Colom admin­is­tered 3 IQ tests - RAPM, DAT-AR, & PMA-R; as usu­al, I used RAPM.

  • Sav­age 2013: admin­is­tered RAPM & CCFT; as usu­al, only used RAPM

  • Stepankova et al 2013: admin­is­tered the Block Design (BD) & Matrix Rea­son­ing (MR) non­ver­bal sub­tests of the WAIS-III

  • Nuss­baumer et al 2013: admin­is­tered RAPM & I-S-T 2000 R tests; par­tic­i­pants were trained in 3 con­di­tions: non-adap­tive sin­gle 1-back (“low”); non-adap­tive sin­gle 3-back (“medium”); adap­tive dual n-back (“high”). Given the low train­ing time, I decided to drop the medium group as being unclear whether the inter­ven­tion is doing any­thing, and treat the high group as the exper­i­men­tal group vs a “low” active con­trol group.

  • Burki et al 2014: split exper­i­men­tal groups across the pas­sive & active con­trols; young and old groups were left unpooled because they used RAPM and RSPM respec­tively

  • Pugin et al 2014: used the TONI-IV IQ test from the post-test, but not the fol­lowup scores; the paper reports the age-ad­justed scaled val­ues, but Fiona Pugin pro­vided me the raw TONI-IV scores

  • Schmiedek et al 2014: “Younger Adults Show Long-Term Effects of Cog­ni­tive Train­ing on Broad Cog­ni­tive Abil­i­ties Over 2 Years”/ had sub­jects prac­tice on 12 differ­ent tasks, one of which was sin­gle (spa­tial) n-back, but it was not adap­tive (“diffi­culty lev­els for the EM and WM tasks were indi­vid­u­al­ized using differ­ent pre­sen­ta­tion times (PT) based on pre-test per­for­mance”); due to the lack of adap­tive­ness and the 11 other tasks par­tic­i­pants trained, I am omit­ting their data.

  • Hor­vat 2014: I thank Sergei & Google Trans­late for help­ing with extract­ing rel­e­vant details from the body of the the­sis, which is writ­ten in Sloven­ian. The train­ing time was 20-25 min­utes in 10 ses­sions or 225 min­utes total. The SPM test scores can be found on pg57, Table 4; the non-speed­ing of the SPM is dis­cussed on pg44; the esti­mate of $0 in com­pen­sa­tion is based on the absence of ref­er­ences to the local cur­rency (eu­ros), the cita­tion on pg32-33 of Jaeg­gi’s the­o­ries on pay­ment block­ing trans­fer due to intrin­sic vs extrin­sic moti­va­tion, and the gen­eral rar­ity of pay­ing young sub­jects like the 13-15yos used by Hor­vat.

  • Ban­iqued et al 2015: note that total com­pen­sa­tion is twice as high as one would esti­mate from the train­ing time times hourly pay; see sup­ple­men­tary. They admin­is­tered sev­eral mea­sures of Gf, and as usual I have extracted only the one clos­est to being a matrix test and prob­a­bly most g-load­ed, which is the matrix test they admin­is­tered. That par­tic­u­lar test is based on the RAPM, so it is coded as RAPM. The full train­ing involved 6 tasks, one of which was DNB; the train­ing time is coded as just the time spent on DNB (ie the total train­ing time divided by 6). Means & SDs of post-test matrix scores were extracted from the raw data pro­vided by the authors.

  • Kuper & Kar­bach 2015: con­trol group split

  • Schwarb et al 2015: reports 2 exper­i­ments, both of whose RAPM data is reported as “change scores” (the aver­age test-retest gain & the SDs of the paired differ­ences); the Cochrane Hand­book argues that change scores can be included as-is in a meta-analy­sis using post-test vari­ables as the differ­ence between the post-tests of controls/experimentals will become equiv­a­lent to change scores.

    The sec­ond exper­i­ment has 3 groups: a pas­sive con­trol group, a visual n-back, and an audi­tory n-back. The con­trol group is split.

  • Heinzel et al 2016 does not spec­ify how much par­tic­i­pants were paid

  • Lawlor-Sav­age & Goghari 2016 recorded post-tests for both RAPM & CCFT; I use RAPM as usual

  • Min­ear et al 2016: two active con­trol groups (Star­craft and non-adap­tive n-back), split con­trol group. They also admin­is­tered two sub­tests of the ETS Kit of Fac­tor-Ref­er­enced Tests, the RPM, and Cat­tell Cul­ture Fair Tests, so I use the RPM

The fol­low­ing authors had their stud­ies omit­ted and have been con­tacted for clar­i­fi­ca­tion:

  • Sei­dler, Jaeggi et al 2010 (ex­per­i­men­tal: n = 47; con­trol: n = 45) did not report means or stan­dard devi­a­tions
  • Preece’s super­vis­ing researcher
  • Min­ear
  • Katz


Run as R --slave --file=dnb.r | less:

set.seed(7777) # for reproducible numbers
# TODO: factor out common parts of `png` (& make less square), and `rma` calls
dnb <- readHTMLTable(colClasses = c("integer", "character", "factor",
                                    "numeric", "numeric", "numeric", "numeric", "numeric", "numeric",
                                    "logical", "integer", "factor", "integer", "factor", "integer", "factor"), "/tmp/burl8109K_P.html")[[1]]
# install.packages("metafor") # if not installed

cat("Basic random-effects meta-analysis of all studies:\n")
res1 <- rma(measure="SMD", m1i = mean.e, m2i = mean.c, sd1i = sd.e, sd2i = sd.c, n1i = n.e, n2i = n.c,
            data = dnb); res1

png(file="~/wiki/images/dnb/forest.png", width = 680, height = 800)
forest(res1, slab = paste(dnb$study, dnb$year, sep = ", "))

cat("Random-effects with passive/active control groups moderator:\n")
res0 <- rma(measure="SMD", m1i = mean.e, m2i = mean.c, sd1i = sd.e, sd2i = sd.c, n1i = n.e, n2i = n.c,
            data = dnb,
            mods = ~ factor(active) - 1); res0
cat("Power analysis of the passive control group sample, then the active:\n")
     power.t.test(n = mean(sum(n.c), sum(n.e)), delta=res0$b[1], sd = mean(c(sd.c, sd.e))))
    power.t.test(n = mean(sum(n.c), sum(n.e)), delta=res0$b[2], sd = mean(c(sd.c, sd.e))))
cat("Calculate necessary sample size for active-control experiment of 80% power:")
power.t.test(delta = res0$b[2], power=0.8)

png(file="~/wiki/images/dnb/forest-activevspassive.png", width = 750, height = 1100)
par(mfrow=c(2,1), mar=c(1,4.5,1,0))
active <- dnb[dnb$active==TRUE,]
passive <- dnb[dnb$active==FALSE,]
forest(rma(measure="SMD", m1i = mean.e, m2i = mean.c, sd1i = sd.e, sd2i = sd.c, n1i = n.e, n2i = n.c,
            data = passive),
       order=order(passive$year), slab=paste(passive$study, passive$year, sep = ", "),
       mlab="Studies with passive control groups")
forest(rma(measure="SMD", m1i = mean.e, m2i = mean.c, sd1i = sd.e, sd2i = sd.c, n1i = n.e, n2i = n.c,
            data = active),
       order=order(active$year), slab=paste(active$study, active$year, sep = ", "),
       mlab="Studies with active control groups")

cat("Random-effects, regressing against training time:\n")
rma(measure="SMD", m1i = mean.e, m2i = mean.c, sd1i = sd.e, sd2i = sd.c, n1i = n.e, n2i = n.c, data = dnb,
    mods = training)

png(file="~/wiki/images/dnb/effectsizevstrainingtime.png", width = 580, height = 600)
plot(dnb$training, res1$yi, xlab="Minutes spent n-backing", ylab="SMD")

cat("Random-effects, regressing against administered speed of IQ tests:\n")
rma(measure="SMD", m1i = mean.e, m2i = mean.c, sd1i = sd.e, sd2i = sd.c, n1i = n.e, n2i = n.c,
    data = dnb, mods=speed)

png(file="~/wiki/images/dnb/iqspeedversuseffect.png", width = 580, height = 600)
plot(dnb$speed, res1$yi)

cat("Random-effects, regressing against kind of n-back training:\n")
rma(measure="SMD", m1i = mean.e, m2i = mean.c, sd1i = sd.e, sd2i = sd.c, n1i = n.e, n2i = n.c,
                   data = dnb, mods=~factor(N.back)-1)

cat("*, payment as a binary moderator:\n")
rma(measure="SMD", m1i = mean.e, m2i = mean.c, sd1i = sd.e, sd2i = sd.c, n1i = n.e, n2i = n.c, data = dnb,
    mods = ~ as.logical(paid))
cat("*, regressing against payment amount:\n")
rma(measure="SMD", m1i = mean.e, m2i = mean.c, sd1i = sd.e, sd2i = sd.c, n1i = n.e, n2i = n.c, data = dnb,
    mods = ~ paid)
cat("*, checking for interaction with higher experiment quality:\n")
rma(measure="SMD", m1i = mean.e, m2i = mean.c, sd1i = sd.e, sd2i = sd.c, n1i = n.e, n2i = n.c, data = dnb,
    mods = ~ active * as.logical(paid))

cat("Test Au's claim about active control groups being a proxy for international differences:")
rma(measure="SMD", m1i = mean.e, m2i = mean.c, sd1i = sd.e, sd2i = sd.c, n1i = n.e, n2i = n.c, data = dnb,
    mods = ~ active + I(country=="USA"))

cat("Look at all covariates together:")
rma(measure="SMD", m1i = mean.e, m2i = mean.c, sd1i = sd.e, sd2i = sd.c, n1i = n.e, n2i = n.c, data = dnb,
    mods = ~ active + I(country=="USA") + training + IQ + speed + N.back + paid + I(paid==0))

cat("Publication bias checks using funnel plots:\n")
regtest(res1, model = "rma", predictor = "sei", ni = NULL)

png(file="~/wiki/images/dnb/funnel.png", width = 580, height = 600)

# If we plot the residual left after correcting for active vs passive, the funnel plot improves
png(file="~/wiki/images/dnb/funnel-moderators.png", width = 580, height = 600)
res2 <- rma(measure="SMD", m1i = mean.e, m2i = mean.c, sd1i = sd.e, sd2i = sd.c, n1i = n.e, n2i = n.c,
           data = dnb, mods = ~ factor(active) - 1)

cat("Little publication bias, but let's see trim-and-fill's suggestions anyway:\n")
tf <- trimfill(res1); tf

png(file="~/wiki/images/dnb/funnel-trimfill.png", width = 580, height = 600)

# optimize the generated graphs by cropping whitespace & losslessly compressing them
system(paste('cd ~/wiki/images/dnb/ &&',
             'for f in *.png; do convert "$f" -crop',
             '`nice convert "$f" -virtual-pixel edge -blur 0x5 -fuzz 10% -trim -format',
             '\'%wx%h%O\' info:` +repage "$f"; done'))
system("optipng -quiet -o9 -fix ~/wiki/images/dnb/*.png", ignore.stdout = TRUE)

  1. To give an idea of how inten­sive, it cost ~$14,000 (2002) or $18,200 (2013) per child per year.↩︎

  2. from pg 54-55:

    An issue of great con­cern is that observed test score improve­ments may be achieved through var­i­ous influ­ences on the expec­ta­tions or level of invest­ment of par­tic­i­pants, rather than on the inten­tion­ally tar­geted cog­ni­tive process­es. One form of expectancy bias relates to the placebo effects observed in clin­i­cal drug stud­ies. Sim­ply the belief that train­ing should have a pos­i­tive influ­ence on cog­ni­tion may pro­duce a mea­sur­able improve­ment on post-train­ing per­for­mance. Par­tic­i­pants may also be affected by the demand char­ac­ter­is­tics of the train­ing study. Name­ly, in antic­i­pa­tion of the goals of the exper­i­ment, par­tic­i­pants may put forth a greater effort in their per­for­mance dur­ing the post-train­ing assess­ment. Final­ly, appar­ent train­ing-re­lated improve­ments may reflect differ­ences in par­tic­i­pants’ level of cog­ni­tive invest­ment dur­ing the period of train­ing. Since par­tic­i­pants in the exper­i­men­tal group often engage in more men­tally tax­ing activ­i­ties, they may work harder dur­ing post-train­ing assess­ments to assure the value of their ear­lier efforts.

    Even seem­ingly small differ­ences between con­trol and train­ing groups may yield mea­sur­able differ­ences in effort, expectan­cy, and invest­ment, but these con­founds are most prob­lem­atic in stud­ies that use no con­trol group (Holmes et al., 2010; Mez­za­cappa & Buck­n­er, 2010), or only a no-con­tact con­trol group; a cohort of par­tic­i­pants that com­pletes the pre and post train­ing assess­ments but has no con­tact with the lab in the inter­val between assess­ments. Com­par­i­son to a no-con­tact con­trol group is a preva­lent prac­tice among stud­ies report­ing pos­i­tive far trans­fer (Chein & Mor­rison, 2010; Jaeggi et al., 2008; Ole­sen et al., 2004; Schmiedek et al., 2010; Vogt et al., 2009). This approach allows exper­i­menters to rule out sim­ple test-retest improve­ments, but is poten­tially vul­ner­a­ble to con­found­ing due to expectancy effects. An alter­na­tive approach is to use a “con­trol train­ing” group, which matches the treat­ment group on time and effort invest­ed, but is not expected to ben­e­fit from train­ing (groups receiv­ing con­trol train­ing are some­times referred to as “active con­trol” group­s). For instance, in Pers­son and Reuter-Lorenz (2008), both trained and con­trol sub­jects prac­ticed a com­mon set of mem­ory tasks, but diffi­culty and level of inter­fer­ence were higher in the exper­i­men­tal group’s train­ing. Sim­i­lar­ly, con­trol train- ing groups com­plet­ing a non-adap­tive form of train­ing (Holmes et al., 2009; Kling­berg et al., 2005) or receiv­ing a smaller dose of train­ing (one-third of the train­ing tri­als as the exper­i­men­tal group, e.g., Kling­berg et al., 2002) have been used as com­par­i­son groups in assess­ments of Cogmed vari­ants. One recent study con­ducted in young chil­dren found no differ­ences in per­for­mance gains demon­strated by a no-con­tact con­trol group and a con­trol group that com­pleted a non-adap­tive ver­sion of train­ing, sug­gest­ing that the for­mer approach may be ade­quate (Thorell et al., 2009). We note, how­ev­er, that regard­less of the con­trol pro­ce­dures used, not a sin­gle study con­ducted to date has simul­ta­ne­ously con­trolled moti­va­tion, com­mit­ment, and diffi­cul­ty, nor has any study attempted to demon­strate explic­itly (for instance through sub­ject self­-re­port) that the con­trol sub­jects expe­ri­enced a com­pa­ra­ble degree of moti­va­tion or com­mit­ment, or had sim­i­lar expectan­cies about the ben­e­fits of train­ing

  3. Details about the treated (ac­tive) vs untreated (pas­sive) differ­ences in Mel­by-Lervåg & Hulme 2013:

    …This con­trols for appar­ently irrel­e­vant aspects of the train­ing that might nev­er­the­less affect per­for­mance. In a review of edu­ca­tional research Clark and Sug­rue (1991) [“Research on instruc­tional media, 1978-1988” in ed Anglin 1991, Instruc­tional Tech­nol­ogy] esti­mated that such Hawthorne or expectancy effects account for up to 0.3 stan­dard devi­a­tions improve­ment in many stud­ies.

    The meta-an­a­lytic results:

    1. Ver­bal WM: d = 0.99 vs 0.69
    2. Visu­ospa­tial WM: 0.63 vs 0.36
    3. Non­ver­bal abil­i­ties: 0 vs 0.38
    4. Stroop: 0.30 vs 0.35

    There was a sig­nifi­cant differ­ence in out­come between stud­ies with treated con­trols and stud­ies with only untreated con­trols. In fact, the stud­ies with treated con­trol groups had a mean effect size close to zero (no­tably, the 95% con­fi­dence inter­vals for untreated con­trols were d=-0.24 to 0.22, and for treated con­trols d = 0.23 to 0.56). More specifi­cal­ly, sev­eral of the research groups demon­strated sig­nifi­cant trans­fer effects to non­ver­bal abil­ity when they used untreated con­trol groups but did not repli­cate such effects when a treated con­trol group was used (e.g., Jaeg­gi, Buschkuehl, Jonides, & Shah, 2011; Nut­ley, Söderqvist, Bry­de, Thorell, Humphreys, & Kling­berg, 2011). Sim­i­lar­ly, the differ­ence in out­come between ran­dom­ized and non­ran­dom­ized stud­ies was close to sig­nifi­cance (p = 0.06), with the ran­dom­ized stud­ies giv­ing a mean effect size that was close to zero. Notably, all the stud­ies with untreated con­trol groups are also non­ran­dom­ized; it is appar­ent from these analy­ses that the use of ran­dom­ized designs with an alter­na­tive treat­ment con­trol group are essen­tial to give unam­bigu­ous evi­dence for train­ing effects in this field.

  4. A more com­pli­cated analy­sis, includ­ing base­line per­for­mance and other covari­ates, would do bet­ter.↩︎