Dual N-Back Meta-Analysis

Does DNB increase IQ? What factors affect the studies? Probably not: gains are driven by studies with weakest methodology like apathetic control groups.
DNB, psychology, meta-analysis, R, power-analysis, Bayes, IQ, bibliography
2012-05-202018-11-30 in progress certainty: highly likely importance: 9

I meta-an­a­lyze the >19 stud­ies up to 2016 which mea­sure IQ after an n-back in­ter­ven­tion, find­ing (over all stud­ies) a net gain (medi­um-sized) on the post-train­ing IQ tests.

The size of this in­crease on IQ test score cor­re­lates highly with the method­olog­i­cal con­cern of whether a study used ac­tive or pas­sive con­trol groups. This in­di­cates that the medium effect size is due to method­olog­i­cal prob­lems and that n-back train­ing does not in­crease sub­jects’ un­der­ly­ing fluid in­tel­li­gence but the gains are due to the mo­ti­va­tional effect of pas­sive con­trol groups (who did not train on any­thing) not try­ing as hard as the n-back­-trained ex­per­i­men­tal groups on the post-tests. The re­main­ing stud­ies us­ing ac­tive con­trol groups find a small pos­i­tive effect (but this may be due to ma­trix-test-spe­cific train­ing, un­de­tected pub­li­ca­tion bi­as, smaller mo­ti­va­tional effects, etc.)

I also in­ves­ti­gate sev­eral other n-back claims, crit­i­cisms, and in­di­ca­tors of bi­as, find­ing:

Dual N-Back (DNB) is a work­ing mem­ory task which stresses hold­ing sev­eral items in mem­ory and quickly up­dat­ing them. De­vel­oped for cog­ni­tive test­ing bat­ter­ies, DNB has been re­pur­posed for cog­ni­tive train­ing, start­ing with the first study Jaeggi et al 2008, which found that train­ing dual n-back in­creases scores on an IQ test for healthy young adults. If this re­sult were true and in­flu­enced un­der­ly­ing in­tel­li­gence (with its many cor­re­lates such as higher in­come or ed­u­ca­tional achieve­men­t), it would be an un­prece­dented re­sult of in­es­timable so­cial value and prac­ti­cal im­pact, and so is worth in­ves­ti­gat­ing in de­tail. In my DNB FAQ, I dis­cuss a list of post-2008 ex­per­i­ments in­ves­ti­gat­ing how much and whether prac­tic­ing dual n-back can in­crease IQ; they con­flict heav­i­ly, with some find­ing large gains and oth­ers find­ing gains which are not sta­tis­ti­cal­ly-sig­nifi­cant or no gain at all.

What is one to make of these stud­ies? When one has mul­ti­ple quan­ti­ta­tive stud­ies go­ing in both di­rec­tions, one re­sorts to a : we pool the stud­ies with their var­i­ous sam­ple sizes and effect sizes and get some over­all an­swer - do a bunch of small pos­i­tive stud­ies out­weigh a few big neg­a­tive ones? Or vice ver­sa? Or any mix there­of? Un­for­tu­nate­ly, when I be­gan com­pil­ing stud­ies 2011-2013 no one has done one for n-back & IQ al­ready; the ex­ist­ing study, (Mel­by-Lervåg & Hulme 2013), cov­ers work­ing mem­ory in gen­er­al, to sum­ma­rize:

How­ev­er, a re­cent meta-analy­sis by Mel­by-Lervåg and Hulme (in press) in­di­cates that even when con­sid­er­ing pub­lished stud­ies, few ap­pro­pri­ate­ly-pow­ered em­pir­i­cal stud­ies have found ev­i­dence for trans­fer from var­i­ous WM train­ing pro­grams to fluid in­tel­li­gence. Mel­by-Lervåg and Hulme re­ported that WM train­ing showed ev­i­dence of trans­fer to ver­bal and spa­tial WM tasks (d = .79 and .52, re­spec­tive­ly). When ex­am­in­ing the effect of WM train­ing on trans­fer to non­ver­bal abil­i­ties tests in 22 com­par­isons across 20 stud­ies, they found an effect of d = .19. Crit­i­cal­ly, a mod­er­a­tor analy­sis showed that there was no effect (d = .00) in the 10 com­par­isons that used a treated con­trol group, and there was a medium effect (d = .38) in the 12 com­par­isons that used an un­treated con­trol group.

Sim­i­lar re­sults were found by the later WM meta-analy­sis Schwaighofer et al 2015 and Mel­by-Lervåg et al 2016, and for ex­ec­u­tive func­tion by Kas­sai et al 2019. (Com­men­tary: .) I’m not as in­ter­ested in near WM trans­fer from n-back train­ing - as the Mel­by-Lervåg & Hulme 2013 meta-analy­sis con­firms, it surely does - but in the trans­fer with many more ram­i­fi­ca­tions, trans­fer to IQ as mea­sured by a ma­trix test. So in early 2012, I de­cided to start a meta-analy­sis of my own. My method & re­sults differ from the later 2014 DNB meta-analy­sis by Au & Jaeggi et al (Bayesian re-analy­sis) and the broad meta-analy­sis of all com­put­er-train­ing.

For back­ground on con­duct­ing meta-analy­ses, I am us­ing chap­ter 9 of part 2 of the Cochrane Hand­book for Sys­tem­atic Re­views of In­ter­ven­tions. For the ac­tual sta­tis­ti­cal analy­sis, I am us­ing the metafor pack­age for the R lan­guage.


The can­di­date stud­ies:

  • Jaeggi 2008

  • Li et al 2008

    Ex­cluded for us­ing non-adap­tive n-back and not ad­min­is­ter­ing an IQ post-test.

  • Qiu 2009

  • po­lar 2009

  • Sei­dler 2010

  • Stephen­son 2010

  • Jaeggi 2010

  • Jaeggi 2011

  • Chooi 2011

  • Schweizer 2011

  • Preece 2011

  • Zhong 2011

  • Jaušovec 2012

  • Kundu et al 2012

  • Salmi­nen et al 2012

  • Redick et al 2012

  • Jaeggi 2012?

  • Takeuchi et al 2012

  • Rude­beck 2012

  • Thomp­son et al 2013

  • Var­tan­ian 2013

  • Heinzel et al 2013

  • Smith et al 2013

  • Nuss­baumer et al 2013

  • Oel­hafen et al 2013

  • Clouter 2013

  • Sprenger et al 2013

  • Jaeggi et al 2013

  • Colom et al 2013

  • Sav­age 2013

  • Stepankova et al 2013

  • Min­ear et al 2013

  • Katz et al 2013

  • Burki et al 2014

  • Pu­gin et al 2014

  • Schmiedek et al 2014

  • Hor­vat 2014

  • Heffer­nan 2014

  • Han­cock 2013

  • Loosli et al 2015 (sup­ple­ment):

    ex­cluded be­cause both groups re­ceived n-back train­ing, the ex­per­i­men­tal differ­ence be­ing ad­di­tional in­ter­fer­ence, with near-i­den­ti­cal im­prove­ments to all mea­sured tasks.

  • Waris et al 2015

  • Ban­iqued et al 2015

  • Ku­per & Kar­bach 2015

  • Za­jąc-Lam­parska & Trem­pała 2016

    ex­clud­ed: n-back ex­per­i­men­tal group was non-adap­tive (s­in­gle 1/2-back)

  • Lin­deløv et al 2016

  • Schwarb et al 2015

  • Stud­er-Luethi et al 2015

    ex­cluded be­cause did not re­port means/SDs for RPM scores

  • Min­ear et al 2016

  • Tay­eri et al 2016: ex­cluded for be­ing qua­si­-ex­per­i­men­tal

  • Stud­er-Luethi et al 2016: found no trans­fer to the Ravens; but did not re­port the sum­mary sta­tis­tics so can’t be in­cluded


  1. active mod­er­a­tor vari­able: whether a con­trol group was no-con­tact or trained on some other task.

  2. IQ type:

    1. BOMAT
    2. Raven’s Ad­vanced Pro­gres­sive Ma­tri­ces (RAPM)
    3. Raven’s Stan­dard Pro­gres­sive Ma­tri­ces (SPM)
    4. other (eg. WAIS or WASI, Cat­tel­l’s Cul­ture Fair In­tel­li­gence Test/CFIT, TONI)
  3. record speed of IQ test: min­utes al­lot­ted (up­per bound if more de­tails are given; if no time lim­its, de­fault to 30 min­utes since no sub­jects take longer)

  4. n-back type:

    1. dual n-back (au­dio & vi­sual modal­i­ties si­mul­ta­ne­ous­ly)
    2. sin­gle n-back (vi­sual modal­i­ty)
    3. sin­gle n-back (au­dio modal­i­ty)
    4. mixed n-back (eg au­dio or vi­sual in each block, al­ter­nat­ing or at ran­dom)
  5. paid: ex­pected value of to­tal pay­ment in dol­lars, con­verted if nec­es­sary; if a pa­per does not men­tion pay­ment or com­pen­sa­tion, I as­sume 0 (like­wise sub­jects re­ceiv­ing course credit or ex­tra credit - so com­mon in psy­chol­ogy stud­ies that there must not be any effec­t), and if the re­wards are of real but small value (eg. “For each cor­rect re­spon­se, par­tic­i­pants earned points that they could cash in for to­ken prizes such as pen­cils or stick­ers.”), I code as 1.

  6. coun­try: what coun­try the sub­jects are from/trained in (sug­gested by Au et al 2014)


The data from the sur­viv­ing stud­ies:

Meta-analy­sis of ran­dom­ized in­ter­ven­tions of dual n-back on IQ, with co­vari­ates (ac­tive vs pas­sive con­trol groups, to­tal train­ing time, IQ test type, IQ ad­min­is­tra­tion speed, n-back type, par­tic­i­pant pay­ments, and coun­try)
year study Pub­li­ca­tion n.e mean.e sd.e n.c mean.c sd.c ac­tive train­ing IQ speed N.back paid coun­try
2008 Jaeg­gi1.8 Jaeg­gi1 4 14 2.928 8 12.13 2.588 FALSE 200 0 10 0 0 Switzer­land
2008 Jaeg­gi1.8 Jaeg­gi1 4 14 2.928 7 12.86 1.46 TRUE 200 0 10 0 0 Switzer­land
2008 Jaeg­gi1.12 Jaeg­gi1 11 9.55 1.968 11 8.73 3.409 FALSE 300 1 10 0 0 Switzer­land
2008 Jaeg­gi1.17 Jaeg­gi1 8 10.25 2.188 8 8 1.604 FALSE 425 1 10 0 0 Switzer­land
2008 Jaeg­gi1.19 Jaeg­gi1 7 14.71 3.546 8 13.88 3.643 FALSE 475 1 20 0 0 Switzer­land
2009 Qiu Qiu 9 132.1 3.2 10 130 5.3 FALSE 250 2 25 0 0 China
2009 po­lar Marcek1 13 32.76 1.83 8 25.12 9.37 FALSE 360 0 30 0 0 Czech
2010 Jaeg­gi2.1 Jaeg­gi2 21 13.67 3.17 21.5 11.44 2.58 FALSE 370 0 16 1 91 USA
2010 Jaeg­gi2.2 Jaeg­gi2 25 12.28 3.09 21.5 11.44 2.58 FALSE 370 0 16 0 20 USA
2010 Stephen­son.1 Stephen­son 14 17.54 0.76 9.3 15.50 0.99 TRUE 400 1 10 0 0.44 USA
2010 Stephen­son.2 Stephen­son 14 17.54 0.76 8.6 14.08 0.65 FALSE 400 1 10 0 0.44 USA
2010 Stephen­son.3 Stephen­son 14.5 15.34 0.90 9.3 15.50 0.99 TRUE 400 1 10 1 0.44 USA
2010 Stephen­son.4 Stephen­son 14.5 15.34 0.90 8.6 14.08 0.65 FALSE 400 1 10 1 0.44 USA
2010 Stephen­son.5 Stephen­son 12.5 15.32 0.83 9.3 15.50 0.99 TRUE 400 1 10 2 0.44 USA
2010 Stephen­son.6 Stephen­son 12.5 15.32 0.83 8.6 14.08 0.65 FALSE 400 1 10 2 0.44 USA
2011 Chooi.1.1 Chooi 4.5 12.7 2 15 13.3 1.91 TRUE 240 1 20 0 0 USA
2011 Chooi.1.2 Chooi 4.5 12.7 2 22 11.3 2.59 FALSE 240 1 20 0 0 USA
2011 Chooi.2.1 Chooi 6.5 12.1 2.81 11 13.4 2.7 TRUE 600 1 20 0 0 USA
2011 Chooi.2.2 Chooi 6.5 12.1 2.81 23 11.9 2.64 FALSE 600 1 20 0 0 USA
2011 Jaeg­gi3 Jaeg­gi3 32 16.94 4.75 30 16.2 5.1 TRUE 287 2 10 1 1 USA
2011 Kun­du1 Kun­du1 3 31 1.73 3 30.3 4.51 TRUE 1000 1 40 0 0 USA
2011 Schweizer Schweizer 29 27.07 2.16 16 26.5 4.5 TRUE 463 2 30 0 0 UK
2011 Zhong.1.05d Zhong 17.6 21.38 1.71 8.8 21.85 2.6 FALSE 125 1 30 0 0 China
2011 Zhong.1.05s Zhong 17.6 22.83 2.5 8.8 21.85 2.6 FALSE 125 1 30 0 0 China
2011 Zhong.1.10d Zhong 17.6 22.21 2.3 8.8 21 1.94 FALSE 250 1 30 0 0 China
2011 Zhong.1.10s Zhong 17.6 23.12 1.83 8.8 21 1.94 FALSE 250 1 30 0 0 China
2011 Zhong.1.15d Zhong 17.6 24.12 1.83 8.8 23.78 1.48 FALSE 375 1 30 0 0 China
2011 Zhong.1.15s Zhong 17.6 25.11 1.45 8.8 23.78 1.48 FALSE 375 1 30 0 0 China
2011 Zhong.1.20d Zhong 17.6 23.06 1.48 8.8 23.38 1.56 FALSE 500 1 30 0 0 China
2011 Zhong.1.20s Zhong 17.6 23.06 3.15 8.8 23.38 1.56 FALSE 500 1 30 0 0 China
2011 Zhong.2.15s Zhong 18.5 6.89 0.99 18.5 5.15 2.01 FALSE 375 1 30 0 0 China
2011 Zhong.2.19s Zhong 18.5 6.72 1.07 18.5 5.35 1.62 FALSE 475 1 30 0 0 China
2012 Jaušovec Jaušovec 14 32.43 5.65 15 29.2 6.34 TRUE 1800 1 8.3 0 0 Slove­nia
2012 Kun­du2 Kun­du2 11 10.81 2.32 12 9.5 2.02 TRUE 1000 1 10 0 0 USA
2012 Redick.1 Redick 12 6.25 3.08 20 6 3 FALSE 700 1 10 0 204.3 USA
2012 Redick.2 Redick 12 6.25 3.08 29 6.24 3.34 TRUE 700 1 10 0 204.3 USA
2012 Rude­beck Rude­beck 27 9.52 2.03 28 7.75 2.53 FALSE 400 0 10 0 0 UK
2012 Salmi­nen Salmi­nen 13 13.7 2.2 9 10.9 4.3 FALSE 319 1 20 0 55 Ger­many
2012 Takeuchi Takeuchi 41 31.9 0.4 20 31.2 0.9 FALSE 270 1 30 0 0 Japan
2013 Clouter Clouter 18 30.84 4.11 18 28.83 2.68 TRUE 400 3 12.5 0 115 Canada
2013 Colom Colom 28 37.25 6.23 28 35.46 8.26 FALSE 720 1 20 0 204 Spain
2013 Heinzel.1 Heinzel 15 24.53 2.9 15 23.07 2.34 FALSE 540 2 7.5 1 129 Ger­many
2013 Heinzel.2 Heinzel 15 17 3.89 15 15.87 3.13 FALSE 540 2 7.5 1 129 Ger­many
2013 Jaeg­gi.4 Jaeg­gi4 25 14.96 2.7 13.5 14.74 2.8 TRUE 500 1 30 0 0 USA
2013 Jaeg­gi.4 Jaeg­gi4 26 15.23 2.44 13.5 14.74 2.8 TRUE 500 1 30 2 0 USA
2013 Oel­hafen Oel­hafen 28 18.7 3.75 15 19.9 4.7 FALSE 350 0 45 0 54 Switzer­land
2013 Smith.1 Smith 5 11.5 2.99 9 11.9 1.58 FALSE 340 1 10 0 3.9 UK
2013 Smith.2 Smith 5 11.5 2.99 20 12.15 2.749 TRUE 340 1 10 0 3.9 UK
2013 Sprenger.1 Sprenger 34 9.76 3.68 18.5 9.95 3.42 TRUE 410 1 10 1 100 USA
2013 Sprenger.2 Sprenger 34 9.24 3.34 18.5 9.95 3.42 TRUE 205 1 10 1 100 USA
2013 Thomp­son.1 Thomp­son 10 13.2 0.67 19 12.7 0.62 FALSE 800 1 25 0 740 USA
2013 Thomp­son.2 Thomp­son 10 13.2 0.67 19 13.3 0.5 TRUE 800 1 25 0 740 USA
2013 Var­tan­ian Var­tan­ian 17 11.18 2.53 17 10.41 2.24 TRUE 60 1 10 1 0 Canada
2013 Sav­age Sav­age 23 11.61 2.5 27 11.21 2.5 TRUE 625 1 20 0 0 Canada
2013 Stepanko­va.1 Stepankova 20 20.25 3.77 12.5 17.04 5.02 FALSE 250 3 30 1 29 Czech
2013 Stepanko­va.2 Stepankova 20 21.1 2.95 12.5 17.04 5.02 FALSE 500 3 30 1 29 Czech
2013 Nuss­baumer Nuss­baumer 29 13.69 2.54 27 11.89 2.24 TRUE 450 1 30 0 0 Switzer­land
2014 Bur­k­i.1 Burki 11 37.41 6.43 20 35.95 7.55 TRUE 300 1 30 1 0 Switzer­land
2014 Bur­k­i.2 Burki 11 37.41 6.43 21 36.86 6.55 FALSE 300 1 30 1 0 Switzer­land
2014 Bur­k­i.3 Burki 11 28.86 7.10 20 31.20 6.67 TRUE 300 2 30 1 0 Switzer­land
2014 Bur­k­i.4 Burki 11 28.86 7.10 23 27.61 6.82 FALSE 300 2 30 1 0 Switzer­land
2014 Pu­gin Pu­gin 14 40.29 2.30 15 41.33 1.97 FALSE 600 3 30 1 1 Switzer­land
2014 Hor­vat Hor­vat 14 48 5.68 15 47 7.49 FALSE 225 2 45 0 0 Slove­nia
2014 Heffer­nan Heffer­nan 9 32.78 2.91 10 31 3.06 TRUE 450 3 20 0 140 Canada
2013 Han­cock Han­cock 20 9.32 3.47 20 10.44 4.35 TRUE 810 1 30 1 50 USA
2015 Waris Waris 15 16.4 2.8 16 15.9 3.0 TRUE 675 3 30 0 76 Fin­land
2015 Ban­iqued Ban­iqued 42 10.125 3.0275 45 10.25 3.2824 TRUE 240 1 10 0 700 USA
2015 Ku­per.1 Ku­per 18 23.6 6.1 9 20.7 7.3 FALSE 150 1 20 1 45 Ger­many
2015 Ku­per.2 Ku­per 18 24.9 5.5 9 20.7 7.3 FALSE 150 1 20 0 45 Ger­many
2016 Lin­deløv.1 Lin­deløv 9 10.67 2.5 9 9.89 3.79 TRUE 260 1 10 3 0 Den­mark
2016 Lin­deløv.2 Lin­deløv 8 3.25 2.71 9 4.11 1.36 TRUE 260 1 10 3 0 Den­mark
2015 Schwar­b.1 Schwarb 26 0.11 2.5 26 0.69 3.2 FALSE 480 1 10 3 80 USA
2015 Schwar­b.2.1 Schwarb 22 0.23 2.5 10.5 -0.36 2.1 FALSE 480 1 10 1 80 USA
2015 Schwar­b.2.2 Schwarb 23 0.91 3.1 10.5 -0.36 2.1 FALSE 480 1 10 2 80 USA
2016 Heinzel.1 Heinzel.2 15 17.73 1.20 14 16.43 1.00 FALSE 540 2 7.5 1 1 Ger­many
2016 Lawlor-Sav­age Lawlor 27 11.48 2.98 30 11.02 2.34 TRUE 391 1 20 0 0 Canada
2016 Min­ear.1 Min­ear 15.5 24.1 4.5 26 24 5.1 TRUE 400 1 15 1 300 USA
2016 Min­ear.2 Min­ear 15.5 24.1 4.5 37 22.8 4.6 TRUE 400 1 15 1 300 USA


The re­sult of the meta-analy­sis:

Random-Effects Model (k = 74; tau^2 estimator: REML)

tau^2 (estimated amount of total heterogeneity): 0.1103 (SE = 0.0424)
tau (square root of estimated tau^2 value):      0.3321
I^2 (total heterogeneity / total variability):   43.83%
H^2 (total variability / sampling variability):  1.78

Test for Heterogeneity:
Q(df = 73) = 144.2388, p-val < .0001

Model Results:

estimate       se     zval     pval    ci.lb    ci.ub
  0.3460   0.0598   5.7836   <.0001   0.2288   0.4633

To de­pict the ran­dom-effects model in a more graphic form, we use the “”:

forest(res1, slab = paste(dnb$study, dnb$year, sep = ", "))

The over­all effect is some­what strong. But there seems to be sub­stan­tial differ­ences be­tween stud­ies: this het­ero­gene­ity may be what is show­ing up as a high τ2 and i2; and in­deed, if we look at the com­puted SMDs, we see one sam­ple with d = 2.59 (!) and some in­stances of d < 0. The high het­ero­gene­ity means that the fixed-effects model is in­ap­pro­pri­ate, as clearly the stud­ies are not all mea­sur­ing the same effect, so we use a ran­dom-effects.

The con­fi­dence in­ter­val ex­cludes ze­ro, so one might con­clude that n-back does in­crease IQ scores. From a Bayesian stand­point, it’s worth point­ing out that this is not nearly as con­clu­sive as it seems, for sev­eral rea­sons:

  1. Pub­lished re­search can be weak (see ); meta-analy­ses are gen­er­ally be­lieved to be bi­ased to­wards larger effects due to sys­tem­atic bi­ases like pub­li­ca­tion bias

  2. our prior that any par­tic­u­lar in­ter­ven­tion would in­crease the un­der­ly­ing gen­uine fluid in­tel­li­gence is ex­tremely small, as scores or hun­dreds of at­tempts to in­crease IQ over the past cen­tury have all even­tu­ally turned out to be fail­ures, with few ex­cep­tions (eg pre-na­tal or iron sup­ple­men­ta­tion), so strong ev­i­dence is nec­es­sary to con­clude that a par­tic­u­lar at­tempt is one of those ex­tremely rare ex­cep­tions. As the say­ing goes, “ex­tra­or­di­nary claims re­quire ex­tra­or­di­nary ev­i­dence”. David Ham­brick ex­plains it in­for­mally:

    …Yet I and many other in­tel­li­gence re­searchers are skep­ti­cal of this re­search. Be­fore any­one spends any more time and money look­ing for a quick and easy way to boost in­tel­li­gence, it’s im­por­tant to ex­plain why we’re not sold on the idea…­Does this [Jaeggi et al 2008] sound like an ex­tra­or­di­nary claim? It should. There have been many at­tempts to demon­strate large, last­ing gains in in­tel­li­gence through ed­u­ca­tional in­ter­ven­tions, with few suc­cess­es. When gains in in­tel­li­gence have been achieved, they have been mod­est and the re­sult of many years of effort. For in­stance, in a Uni­ver­sity of North Car­olina study known as the , chil­dren re­ceived an in­ten­sive ed­u­ca­tional in­ter­ven­tion from in­fancy to age 5 de­signed to in­crease in­tel­li­gence1. In fol­low-up tests, these chil­dren showed an ad­van­tage of six I.Q. points over a con­trol group (and as adults, they were four times more likely to grad­u­ate from col­lege). By con­trast, the in­crease im­plied by the find­ings of the Jaeggi study was six I.Q. points after only six hours of train­ing - an I.Q. point an hour. Though the Jaeggi re­sults are in­trigu­ing, many re­searchers have failed to demon­strate sta­tis­ti­cally sig­nifi­cant gains in in­tel­li­gence us­ing oth­er, sim­i­lar cog­ni­tive train­ing pro­grams, like Cogmed’s… We should­n’t be sur­prised if ex­tra­or­di­nary claims of quick gains in in­tel­li­gence turn out to be wrong. Most ex­tra­or­di­nary claims are.

  3. it’s not clear that just be­cause IQ tests like Raven’s are valid and use­ful for mea­sur­ing lev­els of in­tel­li­gence, that an in­crease on the tests can be in­ter­preted as an in­crease of in­tel­li­gence; in­tel­li­gence poses unique prob­lems of its own in any at­tempt to show in­creases in the of gf rather than just the raw scores of tests (which can be, es­sen­tial­ly, ‘gamed’). Haier 2014 analo­gizes claims of break­through IQ in­creases to the ini­tial re­ports of cold fu­sion and com­ments:

    The ba­sic mis­un­der­stand­ing is as­sum­ing that in­tel­li­gence test scores are units of mea­sure­ment like inches or liters or grams. They are not. Inch­es, liters and grams are ra­tio scales where zero means zero and 100 units are twice 50 units. In­tel­li­gence test scores es­ti­mate a con­struct us­ing in­ter­val scales and have mean­ing only rel­a­tive to other peo­ple of the same age and sex. Peo­ple with high scores gen­er­ally do bet­ter on a broad range of men­tal abil­ity tests, but some­one with an IQ score of 130 is not 30% smarter then some­one with an IQ score of 100…This makes sim­ple in­ter­pre­ta­tion of in­tel­li­gence test score changes im­pos­si­ble. Most re­cent stud­ies that have claimed in­creases in in­tel­li­gence after a cog­ni­tive train­ing in­ter­ven­tion rely on com­par­ing an in­tel­li­gence test score be­fore the in­ter­ven­tion to a sec­ond score after the in­ter­ven­tion. If there is an av­er­age change score in­crease for the train­ing group that is sta­tis­ti­cally sig­nifi­cant (us­ing a de­pen­dent t-test or sim­i­lar sta­tis­ti­cal test), this is treated as ev­i­dence that in­tel­li­gence has in­creased. This rea­son­ing is cor­rect if one is mea­sur­ing ra­tio scales like inch­es, liters or grams be­fore and after some in­ter­ven­tion (as­sum­ing suit­able and re­li­able in­stru­ments like rulers to avoid er­ro­neous Cold Fu­sion-like con­clu­sions that ap­par­ently were based on faulty heat mea­sure­men­t); it is not cor­rect for in­tel­li­gence test scores on in­ter­val scales that only es­ti­mate a rel­a­tive rank or­der rather than mea­sure the con­struct of in­tel­li­gence….S­tud­ies that use a sin­gle test to es­ti­mate in­tel­li­gence be­fore and after an in­ter­ven­tion are us­ing less re­li­able and more vari­able scores (big­ger stan­dard er­rors) than stud­ies that com­bine scores from a bat­tery of test­s….S­peak­ing about sci­ence, Carl Sagan ob­served that ex­tra­or­di­nary claims re­quire ex­tra­or­di­nary ev­i­dence. So far, we do not have it for claims about in­creas­ing in­tel­li­gence after cog­ni­tive train­ing or, for that mat­ter, any other ma­nip­u­la­tion or treat­ment, in­clud­ing early child­hood ed­u­ca­tion. Small sta­tis­ti­cally sig­nifi­cant changes in test scores may be im­por­tant ob­ser­va­tions about at­ten­tion or mem­ory or some other el­e­men­tal cog­ni­tive vari­able or a spe­cific men­tal abil­ity as­sessed with a ra­tio scale like mil­lisec­onds, but they are not suffi­cient proof that gen­eral in­tel­li­gence has changed.

    For sta­tis­ti­cal back­ground on how one should be mea­sur­ing changes on a la­tent vari­able like in­tel­li­gence and run­ning in­ter­ven­tion stud­ies, see Cron­bach & Furby 1970 & Moreau et al 2016; for ex­am­ples of past IQ in­ter­ven­tions which fade­out, see Protzko 2015; for ex­am­ples of past IQ in­ter­ven­tions which prove not to be on g when an­a­lyzed in a la­tent vari­able ap­proach, see Jensen’s in­for­mal com­ments on the Mil­wau­kee Project (Jensen 1989), te Ni­jen­huis et al 2007, te Ni­jen­huis et al 2014, te Ni­jen­huis et al 2015, Nut­ley et al 2011, Ship­stead et al 2012, Colom et al 2013, , Ritchie et al 2015, Estrada et al 2015, Bai­ley et al 2017 (and in a null, the com­pos­ite score in Ban­iqued et al 2015).

This skep­ti­cal at­ti­tude is rel­e­vant to our ex­am­i­na­tion of mod­er­a­tors.


Control groups

A ma­jor crit­i­cism of n-back stud­ies is that the effect is be­ing man­u­fac­tured by the method­olog­i­cal prob­lem of some stud­ies us­ing a no-con­tact or pas­sive con­trol group rather than an ac­tive con­trol group. (Pas­sive con­trols know they re­ceived no in­ter­ven­tion and that the re­searchers don’t ex­pect them to do bet­ter on the post-test, which may re­duce their efforts & lower their scores.)

The re­view Mor­ri­son & Chein 20112 noted that no-con­tact con­trol groups lim­ited the va­lid­ity of such stud­ies, a crit­i­cism that was echoed with greater force by Ship­stead, Redick, & En­gle 2012. The WM train­ing meta-analy­sis then con­firmed that use of no-con­tact con­trols in­flated the effect size es­ti­mates3, sim­i­lar to Ze­hd­ner et al 2009/ re­sults in the aged and Rap­port et al 2013’s blind vs un­blinded rat­ings in WM/executive train­ing of ADHD or Long et al 2019’s in­fla­tion of emo­tional reg­u­la­tion ben­e­fits from DNB train­ing, and WM train­ing in young chil­dren (Sala & Go­bet 2017a find pas­sive con­trol groups in­flate effect sizes by g = 0.12, and are fur­ther crit­i­cal in Sala & Go­bet 2017b); and con­sis­tent with the “dodo bird ver­dict” and the in­crease of d = 0.2 across many kinds of psy­cho­log­i­cal ther­a­pies which was found by (but in­con­sis­tent with the g = 0.20 vs g = 0.26 of Lampit et al 2014), and De Quidt et al 2018 demon­strat­ing strong de­mand effects can reach as high as d = 1 (with a mean d = 0.6, quite close to the ac­tual pas­sive DNB effec­t).

So I won­dered if this held true for the sub­set of n-back & IQ stud­ies. (Age is an in­ter­est­ing mod­er­a­tor in Mel­by-Lervåg & Hulme 2013, but in the fol­low­ing DNB & IQ stud­ies there is only 1 study in­volv­ing chil­dren - all the oth­ers are adults or young adult­s.) Each study has been coded ap­pro­pri­ate­ly, and we can ask whether it mat­ters:

Mixed-Effects Model (k = 74; tau^2 estimator: REML)

tau^2 (estimated amount of residual heterogeneity):     0.0803 (SE = 0.0373)
tau (square root of estimated tau^2 value):             0.2834
I^2 (residual heterogeneity / unaccounted variability): 36.14%
H^2 (unaccounted variability / sampling variability):   1.57

Test for Residual Heterogeneity:
QE(df = 72) = 129.2820, p-val < .0001

Test of Moderators (coefficient(s) 1,2):
QM(df = 2) = 46.5977, p-val < .0001

Model Results:

                     estimate      se    zval    pval    ci.lb   ci.ub
factor(active)FALSE    0.4895  0.0738  6.6310  <.0001   0.3448  0.6342
factor(active)TRUE     0.1397  0.0862  1.6211  0.1050  -0.0292  0.3085

The active/control vari­able con­firms the crit­i­cism: lack of ac­tive con­trol groups is re­spon­si­ble for a large chunk of the over­all effect, with the con­fi­dence in­ter­vals not over­lap­ping. The effect with pas­sive con­trol groups is a medi­um-large d = 0.5 while with ac­tive con­trol groups, the IQ gains shrink to a small effect (whose 95% CI does not ex­clude d = 0).

We can see the differ­ence by split­ting a for­est plot on pas­sive vs ac­tive:

The vis­i­bly differ­ent groups of pas­sive then ac­tive stud­ies, plot­ted on the same axis

This is dam­ag­ing to the case that dual n-back in­creases in­tel­li­gence, if it’s un­clear if it even in­creases a par­tic­u­lar test score. Not only do the bet­ter stud­ies find a dras­ti­cally smaller effect, they are not suffi­ciently pow­ered to find such a small effect at all, even ag­gre­gated in a meta-analy­sis, with a power of ~11%, which is dis­mal in­deed when com­pared to the usual bench­mark of 80%, and leads to wor­ries that even that is too high an es­ti­mate and that the ac­tive con­trol stud­ies are aber­rant some­how in be­ing sub­ject to a win­ner’s curse or sub­ject to other bi­as­es. (Be­cause many stud­ies used con­ve­nient pas­sive con­trol groups and the pas­sive effect size is 3x larg­er, they in ag­gre­gate are well-pow­ered at 82%; how­ev­er, we al­ready know they are skewed up­wards, so we don’t care if we can de­tect a bi­ased effect or not.) In par­tic­u­lar, Boot et al 2013 ar­gues that ac­tive con­trol groups do not suffice to iden­tify the true causal effect be­cause the sub­jects in the ac­tive con­trol group can still have differ­ent ex­pec­ta­tions than the ex­per­i­men­tal group, and the group’ differ­ing aware­ness & ex­pec­ta­tions can cause differ­ing per­for­mance on tests; they sug­gest record­ing ex­pectan­cies (some­what sim­i­lar to Redick et al 2013), check­ing for a dose-re­sponse re­la­tion­ship (see the fol­low­ing sec­tion for whether dose-re­sponse ex­ists for dual n-back/IQ), and us­ing differ­ent ex­per­i­men­tal de­signs which ac­tively ma­nip­u­late sub­ject ex­pec­ta­tions to iden­tify how much effects are in­flated by re­main­ing placebo/expectancy effects.

The ac­tive es­ti­mate of d = 0.14 does al­low us to es­ti­mate how many sub­jects a sim­ple4 two-group ex­per­i­ment with an ac­tive con­trol group would re­quire in or­der for it to be well-pow­ered (80%) to de­tect an effect: a to­tal n of >1600 sub­jects (805 in each group).

Training time

Jaeggi et al 2008 ob­served a dose-re­sponse to train­ing, where those who trained the longest ap­par­ently im­proved the most. Ever since, this has been cited as a fac­tor in what stud­ies will ob­serve gains or as an ex­pla­na­tion why some stud­ies did not see im­prove­ments - per­haps they just did­n’t do enough train­ing. metafor is able to look at the num­ber of min­utes sub­jects in each study trained for to see if there’s any clear lin­ear re­la­tion­ship:

         estimate      se     zval    pval    ci.lb   ci.ub
intrcpt    0.3961  0.1226   3.2299  0.0012   0.1558  0.6365
mods      -0.0001  0.0002  -0.4640  0.6427  -0.0006  0.0004

The es­ti­mate of the re­la­tion­ship is that there is none at all: the es­ti­mated co­effi­cient has a large p-val­ue, and fur­ther, that co­effi­cient is neg­a­tive. This may seem ini­tially im­plau­si­ble but if we graph the time spent train­ing per study with the fi­nal (un­weight­ed) effect size, we see why:

plot(dnb$training, res1$yi)

IQ test time

Sim­i­lar­ly, Moody 2009 iden­ti­fied the 10 minute test-time or “speed­ing” of the RAPM as a con­cern in whether far trans­fer ac­tu­ally hap­pened; after col­lect­ing the al­lot­ted test time for the stud­ies, we can like­wise look for whether there is an in­verse re­la­tion­ship (the more time given to sub­jects on the IQ test, the smaller their IQ gain­s):

         estimate      se     zval    pval    ci.lb   ci.ub
intrcpt    0.4197  0.1379   3.0435  0.0023   0.1494  0.6899
mods      -0.0036  0.0061  -0.5874  0.5570  -0.0154  0.0083

A tiny slope which is also non-s­ta­tis­ti­cal­ly-sig­nifi­cant; graph­ing the (un­weight­ed) stud­ies sug­gests as much:

plot(dnb$speed, res1$yi)

Training type

One ques­tion of in­ter­est both for is­sues of va­lid­ity and for effec­tive train­ing is whether the ex­ist­ing stud­ies show larger effects for a par­tic­u­lar kind of n-back train­ing: dual (vi­sual & au­dio; la­beled 0) or sin­gle (vi­su­al; la­beled 1) or sin­gle (au­dio; la­beled 2)? If vi­sual sin­gle n-back turns in the largest effects, that is trou­bling since it’s also the one most re­sem­bling a ma­trix IQ test. Check­ing against the 3 kinds of n-back train­ing:

Mixed-Effects Model (k = 74; tau^2 estimator: REML)

tau^2 (estimated amount of residual heterogeneity):     0.1029 (SE = 0.0421)
tau (square root of estimated tau^2 value):             0.3208
I^2 (residual heterogeneity / unaccounted variability): 41.94%
H^2 (unaccounted variability / sampling variability):   1.72

Test for Residual Heterogeneity:
QE(df = 70) = 135.5275, p-val < .0001

Test of Moderators (coefficient(s) 1,2,3,4):
QM(df = 4) = 39.1393, p-val < .0001

Model Results:

                 estimate      se     zval    pval    ci.lb   ci.ub
factor(N.back)0    0.4219  0.0747   5.6454  <.0001   0.2754  0.5684
factor(N.back)1    0.2300  0.1102   2.0876  0.0368   0.0141  0.4459
factor(N.back)2    0.4255  0.2586   1.6458  0.0998  -0.0812  0.9323
factor(N.back)3   -0.1325  0.2946  -0.4497  0.6529  -0.7099  0.4449

There are not enough stud­ies us­ing the other kinds of n-back to say any­thing con­clu­sive other than there seem to be differ­ences, but it’s in­ter­est­ing that sin­gle vi­sual n-back has weaker re­sults so far.

Payment/extrinsic motivation

In a 2013 talk, “‘Brain Train­ing: Cur­rent Chal­lenges and Po­ten­tial Res­o­lu­tions’, with Su­sanne Jaeg­gi, PhD”, Jaeggi sug­gests

Ex­trin­sic re­ward can un­der­mine peo­ple’s in­trin­sic mo­ti­va­tion. If ex­trin­sic re­ward is cru­cial, then its in­flu­ence should be vis­i­ble in our da­ta.

I in­ves­ti­gated pay­ment as a mod­er­a­tor. Pay­ment seems to ac­tu­ally be quite rare in n-back stud­ies (in part be­cause it’s so com­mon in psy­chol­ogy to just re­cruit stu­dents with course credit or ex­tra cred­it), and so the re­sult is that as a mod­er­a­tor pay­ment is cur­rently a small and non-s­ta­tis­ti­cal­ly-sig­nifi­cant neg­a­tive effect, whether you regress on the to­tal pay­ment amount or treat it as a boolean vari­able. More in­ter­est­ing­ly, it seems that the neg­a­tive sign is be­ing dri­ven by pay­ment be­ing as­so­ci­ated with high­er-qual­ity stud­ies us­ing ac­tive con­trol groups, be­cause when you look at the in­ter­ac­tion, pay­ment in a study with an ac­tive con­trol group ac­tu­ally flips sign to be­ing pos­i­tive again (cor­re­lat­ing with a big­ger effect size).

More specifi­cal­ly, if we check pay­ment as a bi­nary vari­able, we get a de­crease which is sta­tis­ti­cal­ly-sig­nifi­cant:

                      estimate      se     zval    pval    ci.lb   ci.ub
intrcpt                 0.4514  0.0776   5.8168  <.0001   0.2993   0.6035
as.logical(paid)TRUE   -0.2424  0.1164  -2.0828  0.0373  -0.4706  -0.0143

If we in­stead regress against the to­tal pay­ment size (log­i­cal­ly, larger pay­ments would dis­cour­age par­tic­i­pants more), the effect of each ad­di­tional dol­lar is tiny and 0 is far from ex­cluded as the co­effi­cient:

         estimate      se     zval    pval    ci.lb   ci.ub
intrcpt    0.3753  0.0647   5.7976  <.0001   0.2484  0.5022
paid      -0.0004  0.0004  -1.1633  0.2447  -0.0012  0.0003

Why would treat­ing pay­ment as a bi­nary cat­e­gory yield a ma­jor re­sult when there is only a small slope within the paid stud­ies? It would be odd if n-back could achieve the holy grail of in­creas­ing in­tel­li­gence, but the effect van­ishes im­me­di­ately whether you pay sub­jects any­thing, whether $1 or $1000.

As I’ve men­tioned be­fore, the differ­ence in effect size be­tween ac­tive and pas­sive con­trol groups is quite strik­ing, and I no­ticed that eg. the Redick et al 2012 ex­per­i­ment paid sub­jects a lot of money to put up with all its tests and en­sure sub­ject re­ten­tion & Thomp­son et al 2013 paid a lot to put up with the fMRI ma­chine and long train­ing ses­sions, and like­wise with Oel­hafen et al 2013 and Ban­iqued et al 2015 etc; so what hap­pens if we look for an in­ter­ac­tion?

                                 estimate      se     zval    pval    ci.lb    ci.ub
intrcpt                            0.6244  0.0971   6.4309  <.0001   0.4341   0.8147
activeTRUE                        -0.4013  0.1468  -2.7342  0.0063  -0.6890  -0.1136
as.logical(paid)TRUE              -0.2977  0.1427  -2.0860  0.0370  -0.5774  -0.0180
activeTRUE:as.logical(paid)TRUE    0.1039  0.2194   0.4737  0.6357  -0.3262   0.5340

Ac­tive con­trol groups cuts the ob­served effect of n-back by more than half, as be­fore, and pay­ment in­creases the effect size, but then in stud­ies which use ac­tive con­trol groups and also pays sub­jects, the effect size in­creases slightly again with pay­ment size, which seems a lit­tle cu­ri­ous if we buy the story about ex­trin­sic mo­ti­va­tion crowd­ing out in­trin­sic and de­feat­ing any gains.


N-back has been pre­sented in some pop­u­lar & aca­d­e­mic me­dias in an en­tirely un­crit­i­cal & pos­i­tive light: ig­nor­ing the over­whelm­ing fail­ure of in­tel­li­gence in­ter­ven­tions in the past, not cit­ing the fail­ures to repli­cate, and giv­ing short schrift to the crit­i­cisms which have been made. (Ex­am­ples in­clude the NYT, WSJ, Sci­en­tific Amer­i­can, & Nis­bett et al 2012.) One re­searcher told me that a re­viewer sav­aged their work, as­sert­ing that n-back works and thus their null re­sult meant only that they did some­thing wrong. So it’s worth in­ves­ti­gat­ing, to the ex­tent we can, whether there is a to­wards pub­lish­ing only pos­i­tive re­sults.

20-odd stud­ies (some quite small) is con­sid­ered medi­um-sized for a meta-analy­sis, but that many does per­mit us to gen­er­ate , or check for pos­si­ble pub­li­ca­tion bias via the trim-and-fill method.

Funnel plot

test for funnel plot asymmetry: z = 3.0010, p = 0.0027

The asym­me­try has reached sta­tis­ti­cal-sig­nifi­cance, so let’s vi­su­al­ize it:


This looks rea­son­ably good, al­though we see that stud­ies are crowd­ing the edges of the fun­nel. We know that the stud­ies with ac­tive con­trol groups show twice the effec­t-size of the pas­sive con­trol groups, is this re­lat­ed? If we plot the resid­ual left after cor­rect­ing for ac­tive vs pas­sive, the fun­nel plot im­proves a lot (Stephen­son re­mains an out­lier):

Mixed-effects plot of stan­dard er­ror ver­sus effect size after mod­er­a­tor cor­rec­tion.


The trim-and-fill es­ti­mate:

Estimated number of missing studies on the left side: 0 (SE = 4.8908)

Graph­ing it:


Over­all, the re­sults sug­gest that this par­tic­u­lar (com­pre­hen­sive) col­lec­tion of DNB stud­ies does not suffer from se­ri­ous pub­li­ca­tion bias after tak­ing in ac­count the active/passive mod­er­a­tor.


Go­ing through them, I must note:

  • Jaeggi 2008: group-level data pro­vided by Jaeggi to Redick for Redick et al 2013; the 8-ses­sion group in­cluded both ac­tive & pas­sive con­trols, so ex­per­i­men­tal DNB group was split in half. IQ test time is based on the de­scrip­tion in Redick et al 2012:

    In ad­di­tion, the 19-ses­sion groups were 20 min to com­plete BOMAT, whereas the 12- and 17-ses­sion groups re­ceived only 10 min (S. M. Jaeg­gi, per­sonal com­mu­ni­ca­tion, May 25, 2011). As shown in Fig­ure 2, the use of the short time limit in the 12- and 17-ses­sion stud­ies pro­duced sub­stan­tially lower scores than the 19-ses­sion study.

  • po­lar: con­trol, 2nd scores: 23,27,19,15,12,35,36,34; ex­per­i­ment, 2nd scores: 30,35,33,33,32,30,35,33,35,33,34,30,33

  • Jaeggi 2010: used BOMAT scores; should I some­how pool RAPM with BOMAT? Con­trol group split.

  • Jaeggi 2011: used SPM (a Raven’s); should I some­how pool the TONI?

  • Schweizer 2011: used the ad­justed fi­nal scores as sug­gested by the au­thors due to po­ten­tial pre-ex­ist­ing differ­ences in their con­trol & ex­per­i­men­tal groups:

    …This raises the pos­si­bil­ity that the rel­a­tive gains in Gf in the train­ing ver­sus con­trol groups may be to some ex­tent an arte­fact of base­line differ­ences. How­ev­er, the in­ter­ac­tive effect of trans­fer as a func­tion of group re­mained [s­ta­tis­ti­cal­ly-]sig­nifi­cant even after more closely match­ing the train­ing and con­trol groups for pre-train­ing RPM scores (by re­mov­ing the high­est scor­ing con­trols) F(1, 30) = 3.66, P = 0.032, gp2 = 0.10. The ad­justed means (s­tan­dard de­vi­a­tions) for the con­trol and train­ing groups were now 27.20 (1.93), 26.63 (2.60) at pre-train­ing (t(43) = 1.29, P.0.05) and 26.50 (4.50), 27.07 (2.16) at post-train­ing, re­spec­tive­ly.

  • Stephen­son data from pg79/95; means are post-s­cores on Raven’s. I am omit­ting Stephen­son scores on WASI, Cat­tel­l’s Cul­ture Fair Test, & BETA III Ma­trix Rea­son­ing sub­set be­cause metafor does not sup­port mul­ti­vari­ate meta-analy­ses and in­clud­ing them as sep­a­rate stud­ies would be sta­tis­ti­cally il­le­git­i­mate. The ac­tive and pas­sive con­trol groups were split into thirds over each of the 3 n-back train­ing reg­i­mens, and each train­ing reg­i­men split in half over the ac­tive & pas­sive con­trols.

    The split­ting is worth dis­cus­sion. Some of these stud­ies have mul­ti­ple ex­per­i­men­tal groups, con­trol groups, or both. A crit­i­cism of early stud­ies was the use of no-con­tact con­trol groups - the con­trol groups did noth­ing ex­cept be tested twice, and it was sug­gested that the ex­per­i­men­tal group gains might be in part solely be­cause they are do­ing a task, any task, and the con­trol group should be do­ing some non-WM task as well. The WM meta-analy­sis Mel­by-Lervåg & Hulme 2013 checked for this and found that use of no-con­tact con­trol groups led to a much larger es­ti­mate of effect size than stud­ies which did use an ac­tive con­trol. When try­ing to in­cor­po­rate such a mul­ti­-part ex­per­i­ment, one can­not just copy con­trols as the Cochrane Hand­book points out:

    One ap­proach that must be avoided is sim­ply to en­ter sev­eral com­par­isons into the meta-analy­sis when these have one or more in­ter­ven­tion groups in com­mon. This ‘dou­ble-counts’ the par­tic­i­pants in the ‘shared’ in­ter­ven­tion group(s), and cre­ates a unit-of-analy­sis er­ror due to the un­ad­dressed cor­re­la­tion be­tween the es­ti­mated in­ter­ven­tion effects from mul­ti­ple com­par­isons (see Chap­ter 9, Sec­tion 9.3).

    Just drop­ping one con­trol or ex­per­i­men­tal group weak­ens the meta-analy­sis, and may bias it as well if not done sys­tem­at­i­cal­ly. I have used one of its sug­gested ap­proaches which ac­cepts some ad­di­tional er­ror in ex­change for greater power in check­ing this pos­si­ble ac­tive ver­sus no-con­tact dis­tinc­tion, in which we in­stead split the shared group:

    A fur­ther pos­si­bil­ity is to in­clude each pair-wise com­par­i­son sep­a­rate­ly, but with shared in­ter­ven­tion groups di­vided out ap­prox­i­mately evenly among the com­par­isons. For ex­am­ple, if a trial com­pares 121 pa­tients re­ceiv­ing acupunc­ture with 124 pa­tients re­ceiv­ing sham acupunc­ture and 117 pa­tients re­ceiv­ing no acupunc­ture, then two com­par­isons (of, say, 61 ‘acupunc­ture’ against 124 ‘sham acupunc­ture’, and of 60 ‘acupunc­ture’ against 117 ‘no in­ter­ven­tion’) might be en­tered into the meta-analy­sis. For di­choto­mous out­comes, both the num­ber of events and the to­tal num­ber of pa­tients would be di­vided up. For con­tin­u­ous out­comes, only the to­tal num­ber of par­tic­i­pants would be di­vided up and the means and stan­dard de­vi­a­tions left un­changed. This method only par­tially over­comes the unit-of-analy­sis er­ror (be­cause the re­sult­ing com­par­isons re­main cor­re­lat­ed) so is not gen­er­ally rec­om­mend­ed. A po­ten­tial ad­van­tage of this ap­proach, how­ev­er, would be that ap­prox­i­mate in­ves­ti­ga­tions of het­ero­gene­ity across in­ter­ven­tion arms are pos­si­ble (for ex­am­ple, in the case of the ex­am­ple here, the differ­ence be­tween us­ing sham acupunc­ture and no in­ter­ven­tion as a con­trol group).

  • Chooi: the rel­e­vant ta­ble was pro­vided in pri­vate com­mu­ni­ca­tion; I split each ex­per­i­men­tal group in half to pair it up with the ac­tive and pas­sive con­trol groups which trained the same num­ber of days

  • Takeuchi et al 2012: sub­jects were trained on 3 WM tasks in ad­di­tion to DNB for 27 days, 30-60 min­utes; RAPM scores used, BOMAT & Tanaka B-type in­tel­li­gence test scores omit­ted

  • Jaušovec 2012: IQ test time was cal­cu­lated based on the de­scrip­tion

    Used were 50 test items - 25 easy (Ad­vanced Pro­gres­sive Ma­tri­ces Set I - 12 items and the B Set of the Col­ored Pro­gres­sive Ma­tri­ces), and 25 diffi­cult items (Ad­vanced Pro­gres­sive Ma­tri­ces Set II, items 12-36). Par­tic­i­pants saw a fig­ural ma­trix with the lower right en­try miss­ing. They had to de­ter­mine which of the four op­tions fit­ted into the miss­ing space. The tasks were pre­sented on a com­puter screen (po­si­tioned about 80-100 cm in front of the re­spon­den­t), at fixed 10 or 14 s in­ter­stim­u­lus in­ter­vals. They were ex­posed for 6 s (easy) or 10 s (d­iffi­cult) fol­low­ing a 2-s in­ter­val, when a cross was pre­sent­ed. Dur­ing this time the par­tic­i­pants were in­structed to press a but­ton on a re­sponse pad (1-4) which in­di­cated their an­swer.


  • Zhong 2011: “dual at­ten­tion chan­nel” task omit­ted, dual and sin­gle n-back scores kept un­pooled and con­trols split across the 2; I thank Emile Kroger for his trans­la­tions of key parts of the the­sis. Un­able to get whether IQ test was ad­min­is­tered speed­ed. Zhong 2011 ap­pears to have repli­cated Jaeggi 2008’s train­ing time.

  • Jonas­son 2011 omit­ted for lack­ing any mea­sure of IQ

  • Preece 2011 omit­ted; only the Fig­ure Weights sub­test from the WAIS was re­port­ed, but RAPM scores were taken and pub­lished in the in­ac­ces­si­ble Palmer 2011

  • Kundu et al 2011 and Kundu 2012 have been split into 2 ex­per­i­ments based on the raw data pro­vided to me by Kun­du: the smaller one us­ing the full RAPM 36-ma­trix 40-minute test, and the larger an 18-ma­trix 10-minute test. (Kundu 2012 sub­sumes 2011, but the pro­ce­dure was changed part­way on Jaeg­gi’s ad­vice, so they are sep­a­rate re­sult­s.) The fi­nal re­sults were re­ported in , Kundu et al 2013.

  • Redick et al: n-back split over pas­sive con­trol & ac­tive con­trol (vi­sual search) RAPM post scores (omit­ted SPM and Cat­tell Cul­ture-Fair Test)

  • Var­tan­ian 2013: short n-back in­ter­ven­tion not adap­tive; I did not spec­ify in ad­vance that the n-back in­ter­ven­tions had to be adap­tive (pos­si­bly some of the oth­ers were not) and sub­jects trained for <50 min­utes, so the lack of adap­tive­ness may not have mat­tered.

  • Heinzel et al 2013 men­tions con­duct­ing a pi­lot study; I con­tacted Heinzel and no mea­sures like Raven’s were taken in it. The main study used both SPM and also “the Fig­ural Re­la­tions sub­test of a Ger­man in­tel­li­gence test (LPS)”; as usu­al, I drop al­ter­na­tives in fa­vor of the more com­mon test.

  • Thomp­son et al 2013; used RAPM rather than WAIS; treated the “mul­ti­ple ob­ject track­ing”/MOT as an ac­tive con­trol group since it did not sta­tis­ti­cal­ly-sig­nifi­cantly im­prove RAPM scores

  • Smith et al 2013; 4 groups. Con­sis­tent with all the other stud­ies, I have ig­nored the post-post-tests (a 4-week fol­lowup). To deal with the 4 groups, I have com­bined the Brain Age & strat­egy game groups into a sin­gle ac­tive con­trol group, and then split the dual n-back group in half over the orig­i­nal pas­sive con­trol group and the new ac­tive con­trol group.

  • Jaeggi 2005: Jaeggi et al 2008 is not clear about the source of its 4 ex­per­i­ments, but one of them seems to be ex­per­i­ment 7 from Jaeggi 2005, so I omit ex­per­i­ment 7 to avoid any dou­ble-count­ing, and only use ex­per­i­ment 6.

  • Oel­hafen 2013: merged the lure and non-lure dual n-back groups

  • Sprenger 2013: split the ac­tive con­trol group over the n-back­+Floop group and the combo group; train­ing time refers solely to time spent on n-back and not the other tasks

  • Jaeggi et al 2013: ad­min­is­tered the RAPM, Cat­tel­l’s Cul­ture Fair Test / CFT, & BOMAT; in keep­ing with all pre­vi­ous choic­es, I used the RAPM data; the ac­tive con­trol group is split over the two kinds of n-back train­ing groups. This was pre­vi­ously in­cluded in the meta-analy­sis as Jaeggi4 based on the poster but deleted once it wa for­mally pub­lished as Jaeggi et al 2013.

  • Clouter 2013: means & stan­dard de­vi­a­tions, pay­ment amount, and train­ing time were pro­vided by him; stu­dent par­tic­i­pants could be paid in credit points as well as mon­ey, so to get $115, I com­bined the base pay­ment of $75 with the no-cred­it-points op­tion of an­other $40 (rather than try to as­sign any mon­e­tary value to credit points or fig­ure out an av­er­age pay­ment)

  • Colom 2013: the ex­per­i­ment group was trained with 2 weeks of vi­sual sin­gle n-back, then 2 weeks of au­di­tory n-back, then 2 weeks of dual n-back; since the IQ tests were sim­ply pre/post it’s im­pos­si­ble to break out the train­ing gains sep­a­rate­ly, so I coded the n-back type as dual n-back since vi­su­al+au­di­tory sin­gle n-back = dual n-back, and they fin­ished with dual n-back. Colom ad­min­is­tered 3 IQ tests - RAPM, DAT-AR, & PMA-R; as usu­al, I used RAPM.

  • Sav­age 2013: ad­min­is­tered RAPM & CCFT; as usu­al, only used RAPM

  • Stepankova et al 2013: ad­min­is­tered the Block De­sign (BD) & Ma­trix Rea­son­ing (MR) non­ver­bal sub­tests of the WAIS-III

  • Nuss­baumer et al 2013: ad­min­is­tered RAPM & I-S-T 2000 R tests; par­tic­i­pants were trained in 3 con­di­tions: non-adap­tive sin­gle 1-back (“low”); non-adap­tive sin­gle 3-back (“medium”); adap­tive dual n-back (“high”). Given the low train­ing time, I de­cided to drop the medium group as be­ing un­clear whether the in­ter­ven­tion is do­ing any­thing, and treat the high group as the ex­per­i­men­tal group vs a “low” ac­tive con­trol group.

  • Burki et al 2014: split ex­per­i­men­tal groups across the pas­sive & ac­tive con­trols; young and old groups were left un­pooled be­cause they used RAPM and RSPM re­spec­tively

  • Pu­gin et al 2014: used the TONI-IV IQ test from the post-test, but not the fol­lowup scores; the pa­per re­ports the age-ad­justed scaled val­ues, but Fiona Pu­gin pro­vided me the raw TONI-IV scores

  • Schmiedek et al 2014: “Younger Adults Show Long-Term Effects of Cog­ni­tive Train­ing on Broad Cog­ni­tive Abil­i­ties Over 2 Years”/ had sub­jects prac­tice on 12 differ­ent tasks, one of which was sin­gle (s­pa­tial) n-back, but it was not adap­tive (“diffi­culty lev­els for the EM and WM tasks were in­di­vid­u­al­ized us­ing differ­ent pre­sen­ta­tion times (PT) based on pre-test per­for­mance”); due to the lack of adap­tive­ness and the 11 other tasks par­tic­i­pants trained, I am omit­ting their da­ta.

  • Hor­vat 2014: I thank Sergei & Google Trans­late for help­ing with ex­tract­ing rel­e­vant de­tails from the body of the the­sis, which is writ­ten in Sloven­ian. The train­ing time was 20-25 min­utes in 10 ses­sions or 225 min­utes to­tal. The SPM test scores can be found on pg57, Ta­ble 4; the non-speed­ing of the SPM is dis­cussed on pg44; the es­ti­mate of $0 in com­pen­sa­tion is based on the ab­sence of ref­er­ences to the lo­cal cur­rency (eu­ros), the ci­ta­tion on pg32-33 of Jaeg­gi’s the­o­ries on pay­ment block­ing trans­fer due to in­trin­sic vs ex­trin­sic mo­ti­va­tion, and the gen­eral rar­ity of pay­ing young sub­jects like the 13-15yos used by Hor­vat.

  • Ban­iqued et al 2015: note that to­tal com­pen­sa­tion is twice as high as one would es­ti­mate from the train­ing time times hourly pay; see sup­ple­men­tary. They ad­min­is­tered sev­eral mea­sures of Gf, and as usual I have ex­tracted only the one clos­est to be­ing a ma­trix test and prob­a­bly most g-load­ed, which is the ma­trix test they ad­min­is­tered. That par­tic­u­lar test is based on the RAPM, so it is coded as RAPM. The full train­ing in­volved 6 tasks, one of which was DNB; the train­ing time is coded as just the time spent on DNB (ie the to­tal train­ing time di­vided by 6). Means & SDs of post-test ma­trix scores were ex­tracted from the raw data pro­vided by the au­thors.

  • Ku­per & Kar­bach 2015: con­trol group split

  • Schwarb et al 2015: re­ports 2 ex­per­i­ments, both of whose RAPM data is re­ported as “change scores” (the av­er­age test-retest gain & the SDs of the paired differ­ences); the Cochrane Hand­book ar­gues that change scores can be in­cluded as-is in a meta-analy­sis us­ing post-test vari­ables as the differ­ence be­tween the post-tests of controls/experimentals will be­come equiv­a­lent to change scores.

    The sec­ond ex­per­i­ment has 3 groups: a pas­sive con­trol group, a vi­sual n-back, and an au­di­tory n-back. The con­trol group is split.

  • Heinzel et al 2016 does not spec­ify how much par­tic­i­pants were paid

  • Lawlor-Sav­age & Goghari 2016 recorded post-tests for both RAPM & CCFT; I use RAPM as usual

  • Min­ear et al 2016: two ac­tive con­trol groups (Star­craft and non-adap­tive n-back), split con­trol group. They also ad­min­is­tered two sub­tests of the ETS Kit of Fac­tor-Ref­er­enced Tests, the RPM, and Cat­tell Cul­ture Fair Tests, so I use the RPM

The fol­low­ing au­thors had their stud­ies omit­ted and have been con­tacted for clar­i­fi­ca­tion:

  • Sei­dler, Jaeggi et al 2010 (ex­per­i­men­tal: n = 47; con­trol: n = 45) did not re­port means or stan­dard de­vi­a­tions
  • Preece’s su­per­vis­ing re­searcher
  • Min­ear
  • Katz


Run as R --slave --file=dnb.r | less:

set.seed(7777) # for reproducible numbers
# TODO: factor out common parts of `png` (& make less square), and `rma` calls
dnb <- readHTMLTable(colClasses = c("integer", "character", "factor",
                                    "numeric", "numeric", "numeric", "numeric", "numeric", "numeric",
                                    "logical", "integer", "factor", "integer", "factor", "integer", "factor"), "/tmp/burl8109K_P.html")[[1]]
# install.packages("metafor") # if not installed

cat("Basic random-effects meta-analysis of all studies:\n")
res1 <- rma(measure="SMD", m1i = mean.e, m2i = mean.c, sd1i = sd.e, sd2i = sd.c, n1i = n.e, n2i = n.c,
            data = dnb); res1

png(file="~/wiki/images/dnb/forest.png", width = 680, height = 800)
forest(res1, slab = paste(dnb$study, dnb$year, sep = ", "))

cat("Random-effects with passive/active control groups moderator:\n")
res0 <- rma(measure="SMD", m1i = mean.e, m2i = mean.c, sd1i = sd.e, sd2i = sd.c, n1i = n.e, n2i = n.c,
            data = dnb,
            mods = ~ factor(active) - 1); res0
cat("Power analysis of the passive control group sample, then the active:\n")
     power.t.test(n = mean(sum(n.c), sum(n.e)), delta=res0$b[1], sd = mean(c(sd.c, sd.e))))
    power.t.test(n = mean(sum(n.c), sum(n.e)), delta=res0$b[2], sd = mean(c(sd.c, sd.e))))
cat("Calculate necessary sample size for active-control experiment of 80% power:")
power.t.test(delta = res0$b[2], power=0.8)

png(file="~/wiki/images/dnb/forest-activevspassive.png", width = 750, height = 1100)
par(mfrow=c(2,1), mar=c(1,4.5,1,0))
active <- dnb[dnb$active==TRUE,]
passive <- dnb[dnb$active==FALSE,]
forest(rma(measure="SMD", m1i = mean.e, m2i = mean.c, sd1i = sd.e, sd2i = sd.c, n1i = n.e, n2i = n.c,
            data = passive),
       order=order(passive$year), slab=paste(passive$study, passive$year, sep = ", "),
       mlab="Studies with passive control groups")
forest(rma(measure="SMD", m1i = mean.e, m2i = mean.c, sd1i = sd.e, sd2i = sd.c, n1i = n.e, n2i = n.c,
            data = active),
       order=order(active$year), slab=paste(active$study, active$year, sep = ", "),
       mlab="Studies with active control groups")

cat("Random-effects, regressing against training time:\n")
rma(measure="SMD", m1i = mean.e, m2i = mean.c, sd1i = sd.e, sd2i = sd.c, n1i = n.e, n2i = n.c, data = dnb,
    mods = training)

png(file="~/wiki/images/dnb/effectsizevstrainingtime.png", width = 580, height = 600)
plot(dnb$training, res1$yi, xlab="Minutes spent n-backing", ylab="SMD")

cat("Random-effects, regressing against administered speed of IQ tests:\n")
rma(measure="SMD", m1i = mean.e, m2i = mean.c, sd1i = sd.e, sd2i = sd.c, n1i = n.e, n2i = n.c,
    data = dnb, mods=speed)

png(file="~/wiki/images/dnb/iqspeedversuseffect.png", width = 580, height = 600)
plot(dnb$speed, res1$yi)

cat("Random-effects, regressing against kind of n-back training:\n")
rma(measure="SMD", m1i = mean.e, m2i = mean.c, sd1i = sd.e, sd2i = sd.c, n1i = n.e, n2i = n.c,
                   data = dnb, mods=~factor(N.back)-1)

cat("*, payment as a binary moderator:\n")
rma(measure="SMD", m1i = mean.e, m2i = mean.c, sd1i = sd.e, sd2i = sd.c, n1i = n.e, n2i = n.c, data = dnb,
    mods = ~ as.logical(paid))
cat("*, regressing against payment amount:\n")
rma(measure="SMD", m1i = mean.e, m2i = mean.c, sd1i = sd.e, sd2i = sd.c, n1i = n.e, n2i = n.c, data = dnb,
    mods = ~ paid)
cat("*, checking for interaction with higher experiment quality:\n")
rma(measure="SMD", m1i = mean.e, m2i = mean.c, sd1i = sd.e, sd2i = sd.c, n1i = n.e, n2i = n.c, data = dnb,
    mods = ~ active * as.logical(paid))

cat("Test Au's claim about active control groups being a proxy for international differences:")
rma(measure="SMD", m1i = mean.e, m2i = mean.c, sd1i = sd.e, sd2i = sd.c, n1i = n.e, n2i = n.c, data = dnb,
    mods = ~ active + I(country=="USA"))

cat("Look at all covariates together:")
rma(measure="SMD", m1i = mean.e, m2i = mean.c, sd1i = sd.e, sd2i = sd.c, n1i = n.e, n2i = n.c, data = dnb,
    mods = ~ active + I(country=="USA") + training + IQ + speed + N.back + paid + I(paid==0))

cat("Publication bias checks using funnel plots:\n")
regtest(res1, model = "rma", predictor = "sei", ni = NULL)

png(file="~/wiki/images/dnb/funnel.png", width = 580, height = 600)

# If we plot the residual left after correcting for active vs passive, the funnel plot improves
png(file="~/wiki/images/dnb/funnel-moderators.png", width = 580, height = 600)
res2 <- rma(measure="SMD", m1i = mean.e, m2i = mean.c, sd1i = sd.e, sd2i = sd.c, n1i = n.e, n2i = n.c,
           data = dnb, mods = ~ factor(active) - 1)

cat("Little publication bias, but let's see trim-and-fill's suggestions anyway:\n")
tf <- trimfill(res1); tf

png(file="~/wiki/images/dnb/funnel-trimfill.png", width = 580, height = 600)

# optimize the generated graphs by cropping whitespace & losslessly compressing them
system(paste('cd ~/wiki/images/dnb/ &&',
             'for f in *.png; do convert "$f" -crop',
             '`nice convert "$f" -virtual-pixel edge -blur 0x5 -fuzz 10% -trim -format',
             '\'%wx%h%O\' info:` +repage "$f"; done'))
system("optipng -quiet -o9 -fix ~/wiki/images/dnb/*.png", ignore.stdout = TRUE)

  1. To give an idea of how in­ten­sive, it cost ~$14,000 (2002) or $18,200 (2013) per child per year.↩︎

  2. from pg 54-55:

    An is­sue of great con­cern is that ob­served test score im­prove­ments may be achieved through var­i­ous in­flu­ences on the ex­pec­ta­tions or level of in­vest­ment of par­tic­i­pants, rather than on the in­ten­tion­ally tar­geted cog­ni­tive process­es. One form of ex­pectancy bias re­lates to the placebo effects ob­served in clin­i­cal drug stud­ies. Sim­ply the be­lief that train­ing should have a pos­i­tive in­flu­ence on cog­ni­tion may pro­duce a mea­sur­able im­prove­ment on post-train­ing per­for­mance. Par­tic­i­pants may also be affected by the de­mand char­ac­ter­is­tics of the train­ing study. Name­ly, in an­tic­i­pa­tion of the goals of the ex­per­i­ment, par­tic­i­pants may put forth a greater effort in their per­for­mance dur­ing the post-train­ing as­sess­ment. Fi­nal­ly, ap­par­ent train­ing-re­lated im­prove­ments may re­flect differ­ences in par­tic­i­pants’ level of cog­ni­tive in­vest­ment dur­ing the pe­riod of train­ing. Since par­tic­i­pants in the ex­per­i­men­tal group often en­gage in more men­tally tax­ing ac­tiv­i­ties, they may work harder dur­ing post-train­ing as­sess­ments to as­sure the value of their ear­lier efforts.

    Even seem­ingly small differ­ences be­tween con­trol and train­ing groups may yield mea­sur­able differ­ences in effort, ex­pectan­cy, and in­vest­ment, but these con­founds are most prob­lem­atic in stud­ies that use no con­trol group (Holmes et al., 2010; Mez­za­cappa & Buck­n­er, 2010), or only a no-con­tact con­trol group; a co­hort of par­tic­i­pants that com­pletes the pre and post train­ing as­sess­ments but has no con­tact with the lab in the in­ter­val be­tween as­sess­ments. Com­par­i­son to a no-con­tact con­trol group is a preva­lent prac­tice among stud­ies re­port­ing pos­i­tive far trans­fer (Chein & Mor­rison, 2010; Jaeggi et al., 2008; Ole­sen et al., 2004; Schmiedek et al., 2010; Vogt et al., 2009). This ap­proach al­lows ex­per­i­menters to rule out sim­ple test-retest im­prove­ments, but is po­ten­tially vul­ner­a­ble to con­found­ing due to ex­pectancy effects. An al­ter­na­tive ap­proach is to use a “con­trol train­ing” group, which matches the treat­ment group on time and effort in­vest­ed, but is not ex­pected to ben­e­fit from train­ing (groups re­ceiv­ing con­trol train­ing are some­times re­ferred to as “ac­tive con­trol” group­s). For in­stance, in Pers­son and Reuter-Lorenz (2008), both trained and con­trol sub­jects prac­ticed a com­mon set of mem­ory tasks, but diffi­culty and level of in­ter­fer­ence were higher in the ex­per­i­men­tal group’s train­ing. Sim­i­lar­ly, con­trol train- ing groups com­plet­ing a non-adap­tive form of train­ing (Holmes et al., 2009; Kling­berg et al., 2005) or re­ceiv­ing a smaller dose of train­ing (one-third of the train­ing tri­als as the ex­per­i­men­tal group, e.g., Kling­berg et al., 2002) have been used as com­par­i­son groups in as­sess­ments of Cogmed vari­ants. One re­cent study con­ducted in young chil­dren found no differ­ences in per­for­mance gains demon­strated by a no-con­tact con­trol group and a con­trol group that com­pleted a non-adap­tive ver­sion of train­ing, sug­gest­ing that the for­mer ap­proach may be ad­e­quate (Thorell et al., 2009). We note, how­ev­er, that re­gard­less of the con­trol pro­ce­dures used, not a sin­gle study con­ducted to date has si­mul­ta­ne­ously con­trolled mo­ti­va­tion, com­mit­ment, and diffi­cul­ty, nor has any study at­tempted to demon­strate ex­plic­itly (for in­stance through sub­ject self­-re­port) that the con­trol sub­jects ex­pe­ri­enced a com­pa­ra­ble de­gree of mo­ti­va­tion or com­mit­ment, or had sim­i­lar ex­pectan­cies about the ben­e­fits of train­ing

  3. De­tails about the treated (ac­tive) vs un­treated (pas­sive) differ­ences in Mel­by-Lervåg & Hulme 2013:

    …This con­trols for ap­par­ently ir­rel­e­vant as­pects of the train­ing that might nev­er­the­less affect per­for­mance. In a re­view of ed­u­ca­tional re­search Clark and Sug­rue (1991) [“Re­search on in­struc­tional me­dia, 1978-1988” in ed An­glin 1991, In­struc­tional Tech­nol­ogy] es­ti­mated that such Hawthorne or ex­pectancy effects ac­count for up to 0.3 stan­dard de­vi­a­tions im­prove­ment in many stud­ies.

    The meta-an­a­lytic re­sults:

    1. Ver­bal WM: d = 0.99 vs 0.69
    2. Vi­su­ospa­tial WM: 0.63 vs 0.36
    3. Non­ver­bal abil­i­ties: 0 vs 0.38
    4. Stroop: 0.30 vs 0.35

    There was a sig­nifi­cant differ­ence in out­come be­tween stud­ies with treated con­trols and stud­ies with only un­treated con­trols. In fact, the stud­ies with treated con­trol groups had a mean effect size close to zero (no­tably, the 95% con­fi­dence in­ter­vals for un­treated con­trols were d=-0.24 to 0.22, and for treated con­trols d = 0.23 to 0.56). More specifi­cal­ly, sev­eral of the re­search groups demon­strated sig­nifi­cant trans­fer effects to non­ver­bal abil­ity when they used un­treated con­trol groups but did not repli­cate such effects when a treated con­trol group was used (e.g., Jaeg­gi, Buschkuehl, Jonides, & Shah, 2011; Nut­ley, Söderqvist, Bry­de, Thorell, Humphreys, & Kling­berg, 2011). Sim­i­lar­ly, the differ­ence in out­come be­tween ran­dom­ized and non­ran­dom­ized stud­ies was close to sig­nifi­cance (p = 0.06), with the ran­dom­ized stud­ies giv­ing a mean effect size that was close to ze­ro. No­tably, all the stud­ies with un­treated con­trol groups are also non­ran­dom­ized; it is ap­par­ent from these analy­ses that the use of ran­dom­ized de­signs with an al­ter­na­tive treat­ment con­trol group are es­sen­tial to give un­am­bigu­ous ev­i­dence for train­ing effects in this field.

  4. A more com­pli­cated analy­sis, in­clud­ing base­line per­for­mance and other co­vari­ates, would do bet­ter.↩︎