How Often Does Correlation=Causality?

Compilation of studies comparing observational results with randomized experimental results on the same intervention, compiled from medicine/economics/psychology, indicating that a large fraction of the time (although probably not a majority) correlation ≠ causality.
statistics, causality, insight-porn
2014-06-242019-04-16 in progress certainty: log importance: 10


“How study de­sign affects out­comes in com­par­isons of ther­a­py. I: Med­ical”, Colditz et al 1989:

We analysed 113 re­ports pub­lished in 1980 in a sam­ple of med­ical jour­nals to re­late fea­tures of study de­sign to the mag­ni­tude of gains at­trib­uted to new ther­a­pies over old. Over­all we rated 87% of new ther­a­pies as im­prove­ments over stan­dard ther­a­pies. The mean gain (mea­sured by the Man­n-Whit­ney sta­tis­tic) was rel­a­tively con­stant across study de­signs, ex­cept for non-ran­dom­ized con­trolled tri­als with se­quen­tial as­sign­ment to ther­a­py, which showed a sig­nifi­cantly higher like­li­hood that a pa­tient would do bet­ter on the in­no­va­tion than on stan­dard ther­apy (p = 0.004). Ran­dom­ized con­trolled tri­als that did not use a dou­ble- blind de­sign had a higher like­li­hood of show­ing a gain for the in­no­va­tion than did dou­ble-blind tri­als (p = 0.02). Any eval­u­a­tion of an in­no­va­tion may in­clude both bias and the true effi­cacy of the new ther­a­py, there­fore we may con­sider mak­ing ad­just­ments for the av­er­age bias as­so­ci­ated with a study de­sign. When in­ter­pret­ing an eval­u­a­tion of a new ther­a­py, read­ers should con­sider the im­pact of the fol­low­ing av­er­age ad­just­ments to the Man­n-Whit­ney sta­tis­tic: for tri­als with non-ran­dom se­quen­tial as­sign­ment a de­crease of 0.15, for non-dou­ble-blind ran­dom­ized con­trolled tri­als a de­crease of 0.11.

“How study de­sign affects out­comes in com­par­isons of ther­a­py. II: Sur­gi­cal”, Miller et al 1989:

We analysed the re­sults of 221 com­par­isons of an in­no­va­tion with a stan­dard treat­ment in surgery pub­lished in 6 lead­ing surgery jour­nals in 1983 to re­late fea­tures of study de­sign to the mag­ni­tude of gain. For each com­par­i­son we mea­sured the gain at­trib­uted to the in­no­va­tion over the stan­dard ther­apy by the Man­n-Whit­ney sta­tis­tic and the differ­ence in pro­por­tion of treat­ment suc­cess­es. For pri­mary treat­ments (aimed at cur­ing or ame­lio­rat­ing a pa­tien­t’s prin­ci­pal dis­ease), an av­er­age gain of 0.56 was pro­duced by 20 ran­dom­ized con­trolled tri­als. This was less than the 0.62 av­er­age for four non-ran­dom­ized con­trolled tri­als, 0.63 for 19 ex­ter­nally con­trolled tri­als, and 0.57 for 73 record re­views (0.50 rep­re­sents a toss-up be­tween in­no­va­tion and stan­dard). For sec­ondary ther­a­pies (used to pre­vent or treat com­pli­ca­tions of ther­a­py), the av­er­age gain was 0.53 for 61 ran­dom­ized con­trolled tri­als, 0.58 for eleven non-ran­dom­ized con­trolled tri­als, 0.54 for eight ex­ter­nally con­trolled tri­als, and 0.55 for 18 record re­views.

“De­vel­op­ing im­proved ob­ser­va­tional meth­ods for eval­u­at­ing ther­a­peu­tic effec­tive­ness”, Hor­witz et al 1990:

…The spe­cific topic in­ves­ti­gated was the pro­phy­lac­tic effec­tive­ness of β-blocker ther­apy after an acute my­ocar­dial in­farc­tion. To ac­com­plish the re­search ob­jec­tive, three sets of data were com­pared. First, we de­vel­oped a re­stricted co­hort based on the el­i­gi­bil­ity cri­te­ria of the ran­dom­ized clin­i­cal tri­al; sec­ond, we as­sem­bled an ex­panded co­hort us­ing the same de­sign prin­ci­ples ex­cept for not re­strict­ing pa­tient el­i­gi­bil­i­ty; and third, we used the data from the Beta Blocker Heart At­tack Trial (BHAT), whose re­sults served as the gold stan­dard for com­par­i­son. In this re­search, the treat­ment differ­ence in death rates for the re­stricted co­hort and the BHAT trial was nearly iden­ti­cal. In con­trast, the ex­panded co­hort had a larger treat­ment differ­ence than was ob­served in the BHAT tri­al. We also noted the im­por­tant and largely ne­glected role that el­i­gi­bil­ity cri­te­ria may play in en­sur­ing the va­lid­ity of treat­ment com­par­isons and study out­comes….

“Choos­ing be­tween ran­domised and non-ran­domised stud­ies: a sys­tem­atic re­view”, Brit­ton et al 1998 ():

This re­view ex­plored those is­sues re­lated to the process of ran­domi­sa­tion that may affect the va­lid­ity of con­clu­sions drawn from the re­sults of RCTs and non-ran­domised stud­ies. …Pre­vi­ous com­par­isons of RCTs and non-ran­domised stud­ies: 18 pa­pers that di­rectly com­pared the re­sults of RCTs and prospec­tive non-ran­domised stud­ies were found and analysed. No ob­vi­ous pat­terns emerged; nei­ther the RCTs nor the non-ran­domised stud­ies con­sis­tently gave larger or smaller es­ti­mates of the treat­ment effect. The type of in­ter­ven­tion did not ap­pear to be in­flu­en­tial, though more com­par­isons need to be con­ducted be­fore defi­nite con­clu­sions can be drawn.

7 of the 18 pa­pers found no [s­ta­tis­ti­cal­ly-]sig­nifi­cant differ­ences be­tween treat­ment effects from the two types of study. 5 of these 7 had ad­justed re­sults in the non-ran­domised stud­ies for base­line prog­nos­tic differ­ences. The re­main­ing 11 pa­pers re­ported [s­ta­tis­ti­cal­ly-sig­nifi­cant] differ­ences which are sum­marised in Ta­ble 3.

7 stud­ies ob­tained differ­ences in the same di­rec­tion but of sig­nifi­cantly differ­ent mag­ni­tude. In 3, effect sizes were greater in the RCTs.

…How­ev­er, the ev­i­dence re­viewed here is ex­tremely lim­it­ed. It sug­gests that ad­just­ment for base­line differ­ences in arms of non-ran­domised stud­ies will not nec­es­sar­ily re­sult in sim­i­lar effect sizes to those ob­tained from RCTs.

, Eg­ger et al 1998:

Meta-analy­sis of ob­ser­va­tional stud­ies is as com­mon as meta-analy­sis of con­trolled tri­als Con­found­ing and se­lec­tion bias often dis­tort the find­ings from ob­ser­va­tional stud­ies There is a dan­ger that meta-analy­ses of ob­ser­va­tional data pro­duce very pre­cise but equally spu­ri­ous re­sults The sta­tis­ti­cal com­bi­na­tion of data should there­fore not be a promi­nent com­po­nent of re­views of ob­ser­va­tional stud­ies More is gained by care­fully ex­am­in­ing pos­si­ble sources of het­ero­gene­ity be­tween the re­sults from ob­ser­va­tional stud­ies Re­views of any type of re­search and data should use a sys­tem­atic ap­proach, which is doc­u­mented in a ma­te­ri­als and meth­ods sec­tion.

…The ran­domised con­trolled trial is the prin­ci­pal re­search de­sign in the eval­u­a­tion of med­ical in­ter­ven­tions. How­ev­er, ae­ti­o­log­i­cal hy­pothe­ses - for ex­am­ple, those re­lat­ing com­mon ex­po­sures to the oc­cur­rence of dis­ease - can­not gen­er­ally be tested in ran­domised ex­per­i­ments. Does breath­ing other peo­ple’s to­bacco smoke cause lung can­cer, drink­ing coffee cause coro­nary heart dis­ease, and eat­ing a diet rich in sat­u­rated fat cause breast can­cer? Stud­ies of such “men­aces of daily life”6 use ob­ser­va­tional de­signs or ex­am­ine the pre­sumed bi­o­log­i­cal mech­a­nisms in the lab­o­ra­to­ry. In these sit­u­a­tions the risks in­volved are gen­er­ally small, but once a large pro­por­tion of the pop­u­la­tion is ex­posed, the po­ten­tial pub­lic health im­pli­ca­tions of these as­so­ci­a­tions - if they are causal - can be strik­ing.

…If years later es­tab­lished in­ter­ven­tions are in­crim­i­nated with ad­verse effects, there will be eth­i­cal, po­lit­i­cal, and le­gal ob­sta­cles to the con­duct of a new tri­al. Re­cent ex­am­ples for such sit­u­a­tions in­clude the con­tro­versy sur­round­ing a pos­si­ble as­so­ci­a­tion be­tween in­tra­mus­cu­lar ad­min­is­tra­tion of vi­t­a­min K to new­borns and the risk of child­hood can­cer8 and whether oral con­tra­cep­tives in­crease wom­en’s risk of breast can­cer.9

…Pa­tients ex­posed to the fac­tor un­der in­ves­ti­ga­tion may differ in sev­eral other as­pects that are rel­e­vant to the risk of de­vel­op­ing the dis­ease in ques­tion. Con­sid­er, for ex­am­ple, smok­ing as a risk fac­tor for sui­cide. Vir­tu­ally all co­hort stud­ies have shown a pos­i­tive as­so­ci­a­tion, with a dose-re­sponse re­la­tion be­ing ev­i­dent be­tween the amount smoked and the prob­a­bil­ity of com­mit­ting sui­cide.14-19 Fig­ure 1 il­lus­trates this for four prospec­tive stud­ies of mid­dle aged men, in­clud­ing the mas­sive co­hort of pa­tients screened for the mul­ti­ple risk fac­tors in­ter­ven­tion tri­al. Based on over 390 000 men and al­most five mil­lion years of fol­low up, a meta-analy­sis of these co­horts pro­duces highly pre­cise and sig­nifi­cant es­ti­mates of the in­crease in sui­cide risk that is as­so­ci­ated with smok­ing differ­ent daily amounts of cig­a­rettes: rel­a­tive rate for 1-14 cig­a­rettes 1.43 (95% con­fi­dence in­ter­val 1.06 to 1.93), for 15-24 cig­a­rettes 1.88 (1.53 to 2.32), >25 cig­a­rettes 2.18 (1.82 to 2.61). On the ba­sis of es­tab­lished cri­te­ria,20 many would con­sider the as­so­ci­a­tion to be causal - if only it were more plau­si­ble. In­deed, it is im­prob­a­ble that smok­ing is causally re­lated to sui­cide.14 Rather, it is the so­cial and men­tal states pre­dis­pos­ing to sui­cide that are also as­so­ci­ated with the habit of smok­ing.

…Beta carotene has an­tiox­i­dant prop­er­ties and could thus plau­si­bly be ex­pected to pre­vent car­cino­gen­e­sis and athero­ge­n­e­sis by re­duc­ing ox­ida­tive dam­age to DNA and lipopro­teins.27 Con­trary to many other as­so­ci­a­tions found in ob­ser­va­tional stud­ies, this hy­poth­e­sis could be, and was, tested in ex­per­i­men­tal stud­ies. The find­ings of four large tri­als have re­cently been pub­lished.28-31 The re­sults were dis­ap­point­ing and even - for the two tri­als con­ducted in men at high risk (smok­ers and work­ers ex­posed to as­bestos)28,29 - dis­turb­ing. …With a fixed effects mod­el, the meta-analy­sis of the co­hort stud­ies shows a sig­nifi­cantly lower risk of car­dio­vas­cu­lar death (rel­a­tive risk re­duc­tion 31% (95% con­fi­dence in­ter­val 41% to 20%, p < 0.0001)) (fig 2). The re­sults from the ran­domised tri­als, how­ev­er, show a mod­er­ate ad­verse effect of â-carotene sup­ple­men­ta­tion (rel­a­tive in­crease in the risk of car­dio­vas­cu­lar death 12% (4% to 22%, p = 0.005)). Sim­i­larly dis­crepant re­sults be­tween epi­demi­o­log­i­cal stud­ies and tri­als were ob­served for the in­ci­dence of and mor­tal­ity from can­cer. …Fig 2 Meta-analy­sis of as­so­ci­a­tion be­tween Be­ta-carotene in­take and car­dio­vas­cu­lar mor­tal­i­ty: re­sults from ob­ser­va­tional stud­ies show con­sid­er­able ben­e­fit, whereas the find­ings from ran­domised con­trolled tri­als show an in­crease in the risk of death. Meta-analy­sis is by fixed effects mod­el.

…How­ev­er, even if ad­just­ments for con­found­ing fac­tors have been made in the analy­sis, resid­ual con­found­ing re­mains a po­ten­tially se­ri­ous prob­lem in ob­ser­va­tional re­search. Resid­ual con­found­ing arises when a con­found­ing fac­tor can­not be mea­sured with suffi­cient pre­ci­sion-which often oc­curs in epi­demi­o­log­i­cal stud­ies.22,23

…Im­plau­si­bil­ity of re­sults, as in the case of smok­ing and sui­cide, rarely pro­tects us from reach­ing mis­lead­ing claims. It is gen­er­ally easy to pro­duce plau­si­ble ex­pla­na­tions for the find­ings from ob­ser­va­tional re­search. In a co­hort study of sex work­ers, for ex­am­ple, one group of re­searchers that in­ves­ti­gated co­fac­tors in trans­mis­sion of HIV among het­ero­sex­ual men and women found a strong as­so­ci­a­tion be­tween oral con­tra­cep­tives and HIV in­fec­tion, which was in­de­pen­dent of other fac­tors.25 The au­thors hy­poth­e­sised that, among other mech­a­nisms, the risk of trans­mis­sion could be in­creased with oral con­tra­cep­tives due to “effects on the gen­i­tal mu­cosa, such as in­creas­ing the area of ec­topy and the po­ten­tial for mu­cosal dis­rup­tion dur­ing in­ter­course.” In a cross sec­tional study an­other group pro­duced di­a­met­ri­cally op­posed find­ings, in­di­cat­ing that oral con­tra­cep­tives pro­tect against the virus.26 This was con­sid­ered to be equally plau­si­ble, “since prog­es­terone-con­tain­ing oral con­tra­cep­tives thicken cer­vi­cal mu­cus, which might be ex­pected to ham­per the en­try of HIV into the uter­ine cav­i­ty.” It is likely that con­found­ing and bias had a role in pro­duc­ing these con­tra­dic­tory find­ings. This ex­am­ple should be kept in mind when as­sess­ing other seem­ingly plau­si­ble epi­demi­o­log­i­cal as­so­ci­a­tions.

…Sev­eral such sit­u­a­tions are de­picted in fig­ure 3. Con­sider diet and breast can­cer. The hy­poth­e­sis from eco­log­i­cal analy­ses33 that higher in­take of sat­u­rated fat could in­crease the risk of breast can­cer gen­er­ated much ob­ser­va­tional re­search, often with con­tra­dic­tory re­sults. A com­pre­hen­sive meta-analy­sis 34 showed an as­so­ci­a­tion for case-con­trol but not for co­hort stud­ies (odds ra­tio 1.36 for case-con­trol stud­ies ver­sus rel­a­tive rate 0.95 for co­hort stud­ies com­par­ing high­est with low­est cat­e­gory of sat­u­rated fat in­take, p = 0.0002 for differ­ence in our cal­cu­la­tion) (fig 2). This dis­crep­ancy was also shown in two sep­a­rate large col­lab­o­ra­tive pooled analy­ses of co­hort and case-con­trol stud­ies.35,36

The most likely ex­pla­na­tion for this sit­u­a­tion is that bi­ases in the re­call of di­etary items and in the se­lec­tion of study par­tic­i­pants have pro­duced a spu­ri­ous as­so­ci­a­tion in the case-con­trol com­par­isons.36 That differ­en­tial re­call of past ex­po­sures may in­tro­duce bias is also ev­i­dent from a meta-analy­sis of case-con­trol stud­ies of in­ter­mit­tent sun­light ex­po­sure and melanoma (fig 3).37 When stud­ies were com­bined in which some de­gree of blind­ing to the study hy­poth­e­sis was achieved, only a small and non-sig­nifi­cant effect (odds ra­tio 1.17 (95% con­fi­dence in­ter­val 0.98 to 1.39)) was ev­i­dent. Con­verse­ly, in stud­ies with­out blind­ing, the effect was con­sid­er­ably greater and sig­nifi­cant (1.84 (1.52 to 2.25)). The differ­ence be­tween these two es­ti­mates is un­likely to be a prod­uct of chance (p = 0.0004 in our cal­cu­la­tion).

The im­por­tance of the meth­ods used for as­sess­ing ex­po­sure is fur­ther il­lus­trated by a meta-analy­sis of cross sec­tional data of di­etary cal­cium in­take and blood pres­sure from 23 differ­ent stud­ies.38 As shown in fig­ure 3, the re­gres­sion slope de­scrib­ing the change in sys­tolic blood pres­sure (in mm Hg) per 100 mg of cal­cium in­take is strongly in­flu­enced by the ap­proach used for as­sess­ing the amount of cal­cium con­sumed. The as­so­ci­a­tion is small and only mar­gin­ally sig­nifi­cant with diet his­to­ries (s­lope − 0.01 ( − 0.003 to − 0.016)) but large and highly sig­nifi­cant when food fre­quency ques­tion­naires were used ( − 0.15 ( − 0.11 to − 0.19)). With stud­ies us­ing 24 hour re­call an in­ter­me­di­ate re­sult emerges ( − 0.06 ( − 0.09 to − 0.03)). Diet his­to­ries as­sess pat­terns of usual in­take over long pe­ri­ods of time and re­quire an ex­ten­sive in­ter­view with a nu­tri­tion­ist, whereas 24 hour re­call and food fre­quency ques­tion­naires are sim­pler meth­ods that re­flect cur­rent con­sump­tion.39 It is con­ceiv­able that differ­ent pre­ci­sion in the as­sess­ment of cur­rent cal­cium in­take may ex­plain the differ­ences in the strength of the as­so­ci­a­tions found, a sta­tis­ti­cal phe­nom­e­non known as re­gres­sion di­lu­tion bias.40

An im­por­tant cri­te­rion sup­port­ing causal­ity of as­so­ci­a­tions is a dose-re­sponse re­la­tion. In oc­cu­pa­tional epi­demi­ol­ogy the quest to show such an as­so­ci­a­tion can lead to very differ­ent groups of em­ploy­ees be­ing com­pared. In a meta-analy­sis that ex­am­ined the link be­tween ex­po­sure to formalde­hyde and can­cer, fu­neral di­rec­tors and em­balmers (high ex­po­sure) were com­pared with anatomists and pathol­o­gists (in­ter­me­di­ate to high ex­po­sure) and with in­dus­trial work­ers (low to high ex­po­sure, de­pend­ing on job as­sign­men­t).41 As shown in fig­ure 3, there is a strik­ing deficit of deaths from lung can­cer among anatomists and pathol­o­gists (s­tan­dard­ised mor­tal­ity ra­tio 33 (95% con­fi­dence in­ter­val 22 to 47)), which is most likely to be due to a lower preva­lence of smok­ing among this group. In this sit­u­a­tion few would ar­gue that formalde­hyde pro­tects against lung can­cer. In other in­stances, how­ev­er, such se­lec­tion bias may be less ob­vi­ous.

“Eval­u­at­ing non-ran­domised in­ter­ven­tion stud­ies”, Deeks et al 2003:

In the sys­tem­atic re­views, eight stud­ies com­pared re­sults of ran­domised and non-ran­domised stud­ies across mul­ti­ple in­ter­ven­tions us­ing metaepi­demi­o­log­i­cal tech­niques. A to­tal of 194 tools were iden­ti­fied that could be or had been used to as­sess non-ran­domised stud­ies. Sixty tools cov­ered at least five of six pre-spec­i­fied in­ter­nal va­lid­ity do­mains. Four­teen tools cov­ered three of four core items of par­tic­u­lar im­por­tance for non-ran­domised stud­ies. Six tools were thought suit­able for use in sys­tem­atic re­views. Of 511 sys­tem­atic re­views that in­cluded non­ran­domised stud­ies, only 169 (33%) as­sessed study qual­i­ty. Six­ty-nine re­views in­ves­ti­gated the im­pact of qual­ity on study re­sults in a quan­ti­ta­tive man­ner. The new em­pir­i­cal stud­ies es­ti­mated the bias as­so­ci­ated with non-ran­dom al­lo­ca­tion and found that the bias could lead to con­sis­tent over- or un­der­es­ti­ma­tions of treat­ment effects, also the bias in­creased vari­a­tion in re­sults for both his­tor­i­cal and con­cur­rent con­trols, ow­ing to hap­haz­ard differ­ences in case-mix be­tween groups. The bi­ases were large enough to lead stud­ies falsely to con­clude sig­nifi­cant find­ings of ben­e­fit or harm.

…Con­clu­sions: Re­sults of non-ran­domised stud­ies some­times, but not al­ways, differ from re­sults of ran­domised stud­ies of the same in­ter­ven­tion. Non­ran­domised stud­ies may still give se­ri­ously mis­lead­ing re­sults when treated and con­trol groups ap­pear sim­i­lar in key prog­nos­tic fac­tors. Stan­dard meth­ods of case-mix ad­just­ment do not guar­an­tee re­moval of bias. Resid­ual con­found­ing may be high even when good prog­nos­tic data are avail­able, and in some sit­u­a­tions ad­justed re­sults may ap­pear more bi­ased than un­ad­justed re­sults.


  1. Three re­views were con­ducted to con­sid­er:

    • em­pir­i­cal ev­i­dence of bias as­so­ci­ated with non-ran­domised stud­ies
    • the con­tent of qual­ity as­sess­ment tools for non-ran­domised stud­ies
    • the use of qual­ity as­sess­ment in sys­tem­atic re­views of non-ran­domised stud­ies.

    These re­views were con­ducted sys­tem­at­i­cal­ly, iden­ti­fy­ing rel­e­vant lit­er­a­ture through com­pre­hen­sive searches across elec­tronic data­bas­es, hand-searches and con­tact with ex­perts.

  2. New em­pir­i­cal in­ves­ti­ga­tions were con­ducted gen­er­at­ing non-ran­domised stud­ies from two large, mul­ti­-cen­tre RCTs by se­lec­tively re­sam­pling trial par­tic­i­pants ac­cord­ing to al­lo­cated treat­ment, cen­tre and pe­ri­od. These were used to ex­am­ine:

    • sys­tem­atic bias in­tro­duced by the use of his­tor­i­cal and non-ran­domised con­cur­rent con­trols
    • whether re­sults of non-ran­domised stud­ies are more vari­able than re­sults of RCTs
    • the abil­ity of case-mix ad­just­ment meth­ods to cor­rect for se­lec­tion bias in­tro­duced by non­ran­dom al­lo­ca­tion.

    The re­sam­pling de­sign over­came par­tic­u­lar prob­lems of meta-con­found­ing and vari­abil­ity of di­rec­tion and mag­ni­tude of bias that hin­der the in­ter­pre­ta­tion of pre­vi­ous re­views.

The first sys­tem­atic re­view looks at ex­ist­ing ev­i­dence of bias in non-ran­domised stud­ies, crit­i­cally eval­u­at­ing pre­vi­ous method­olog­i­cal stud­ies that have at­tempted to es­ti­mate and char­ac­terise differ­ences in re­sults be­tween RCTs and non-ran­domised stud­ies. Two fur­ther sys­tem­atic re­views fo­cus on the is­sue of qual­ity as­sess­ment of non-ran­domised stud­ies. The first iden­ti­fies and eval­u­ates tools that can be used to as­sess the qual­ity of non-ran­domised stud­ies. The sec­ond looks at ways that study qual­ity has been as­sessed and ad­dressed in sys­tem­atic re­views of health­care in­ter­ven­tions that have in­cluded non-ran­domised stud­ies. The two em­pir­i­cal in­ves­ti­ga­tions fo­cus on the is­sue of se­lec­tion bias in non-ran­domised stud­ies. The first in­ves­ti­gates the size and be­hav­iour of se­lec­tion bias in eval­u­a­tions of two spe­cific clin­i­cal in­ter­ven­tions and the sec­ond as­sesses the de­gree to which case-mix ad­just­ment cor­rects for se­lec­tion bias.

Ev­i­dence about the im­por­tance of de­sign fea­tures of RCTs has ac­cu­mu­lated rapidly dur­ing re­cent years.19-21 This ev­i­dence has mainly been ob­tained by a method of in­ves­ti­ga­tion that has been termed meta-epi­demi­ol­o­gy, a pow­er­ful but sim­ple tech­nique of in­ves­ti­gat­ing vari­a­tions in the re­sults of RCTs of the same in­ter­ven­tion ac­cord­ing to fea­tures of their study de­sign.22 The process in­volves first iden­ti­fy­ing sub­stan­tial num­bers of sys­tem­atic re­views each con­tain­ing RCTs both with and with­out the de­sign fea­ture of in­ter­est. Within each re­view, re­sults are com­pared be­tween the tri­als meet­ing and not meet­ing each de­sign cri­te­ri­on. These com­par­isons are then ag­gre­gated across the re­views in a grand over­all meta-analy­sis to ob­tain an es­ti­mate of the sys­tem­atic bias re­moved by the de­sign fea­ture. For RCTs, the rel­a­tive im­por­tance of proper ran­domi­sa­tion, con­ceal­ment of al­lo­ca­tion and blind­ing have all been es­ti­mated us­ing this tech­nique.20,21 The re­sults have been shown to be con­sis­tent across clin­i­cal fields,23 pro­vid­ing some ev­i­dence that meta-epi­demi­ol­ogy may be a re­li­able in­ves­tiga­tive tech­nique. The method has also been ap­plied to in­ves­ti­gate sources of bias in stud­ies of di­ag­nos­tic ac­cu­ra­cy, where par­tic­i­pant se­lec­tion, in­de­pen­dent test­ing and use of con­sis­tent ref­er­ence stan­dards have been iden­ti­fied as be­ing the most im­por­tant de­sign fea­tures.24

8 re­views were iden­ti­fied which ful­filled the in­clu­sion cri­te­ria; seven con­sid­ered med­ical in­ter­ven­tions and one psy­cho­log­i­cal in­ter­ven­tions. Brief de­scrip­tions of the meth­ods and find­ings of each re­view are given be­low, with sum­mary de­tails given in Ta­ble 2. There is sub­stan­tial over­lap in the in­ter­ven­tions (and hence stud­ies) that were in­cluded in the re­views of med­ical in­ter­ven­tions (Table 3):

  • Sacks et al 1982, “Ran­dom­ized ver­sus his­tor­i­cal con­trols for clin­i­cal tri­als”:

    Sacks and col­leagues com­pared the re­sults of RCTs with his­tor­i­cally con­trolled tri­als (HCTs). The stud­ies were iden­ti­fied in Chalmers’ per­sonal col­lec­tion of RCTs, HCTs and un­con­trolled stud­ies main­tained since 1955 by searches of In­dex Medicus, Cur­rent Con­tents and ref­er­ences of re­views and pa­pers in ar­eas of par­tic­u­lar med­ical in­ter­est (full list not stat­ed). Six in­ter­ven­tions were in­cluded for which at least two RCTs and two HCTs were iden­ti­fied [cir­rho­sis with oe­sophageal varices, coro­nary artery surgery, an­ti­co­ag­u­lants for acute my­ocar­dial in­farc­tion, 5-flu­o­rouracil ad­ju­vant ther­apy for colon can­cer, bacille Cal­met­te-Guérin vac­cine (BCG) ad­ju­vant im­munother­apy and di­ethyl­stilbe­strol for ha­bit­ual abor­tion (Table 3)]. Trial re­sults were clas­si­fied as pos­i­tive if there was ei­ther a sta­tis­ti­cally sig­nifi­cant ben­e­fit or if the au­thors con­cluded ben­e­fit in the ab­sence of sta­tis­ti­cal analy­sis, oth­er­wise as neg­a­tive. For each of the six in­ter­ven­tions, a higher per­cent­age of HCTs com­pared with RCTs con­cluded ben­e­fit: across all six in­ter­ven­tions 20% of RCTs showed ben­e­fit com­pared with 79% of the HCTs.

  • Kunz and Ox­man [Kunz & Ox­man 1998, ] and Kunz, Vist and Ox­man [Kunz et al 2002/2008, “Ran­domi­sa­tion to pro­tect against se­lec­tion bias in health­care tri­als (Cochrane Method­ol­ogy Re­view)”]:

    Kunz and Ox­man searched the lit­er­a­ture for re­views that made em­pir­i­cal com­par­isons be­tween the re­sults of ran­domised and non-ran­domised stud­ies. They in­cluded the re­sults of the six com­par­isons in Sacks and col­leagues’ study above, and re­sults from a fur­ther five pub­lished com­par­isons [an­tiar­rhthymic ther­apy for atrial fib­ril­la­tion, al­lo­genic leu­co­cyte im­munother­apy for re­cur­rent mis­car­riage, con­trast me­dia for salp­in­gog­ra­phy, hor­monal ther­apy for cryp­torchidism, and tran­scu­ta­neous elec­tri­cal nerve stim­u­la­tion (TENS) for post­op­er­a­tive pain (Table 3)]. In some of the com­par­isons, RCTs were com­pared with truly ob­ser­va­tional stud­ies and, in oth­ers they were com­pared with qua­si­-ex­per­i­men­tal tri­als. A sep­a­rate pub­li­ca­tion of an­ti­co­ag­u­lants for acute my­ocar­dial in­farc­tion al­ready in­cluded in Sacks and col­leagues’ re­view was also re­viewed,30 as was a com­par­i­son of differ­ences in con­trol group event rates be­tween ran­domised and non-ran­domised stud­ies for treat­ments for six can­cers (which does not fit within our in­clu­sion cri­te­ri­a).31 The re­view was up­dated in 2002 in­clud­ing a fur­ther 11 com­par­isons, and pub­lished as a Cochrane method­ol­ogy re­view.29 The re­sults of each em­pir­i­cal eval­u­a­tion were de­scribed, but no over­all quan­ti­ta­tive syn­the­sis was car­ried out. The re­sults showed differ­ences be­tween RCTs and non-ran­domised stud­ies in 15 of the 23 com­par­isons, but with in­con­sis­tency in the di­rec­tion and mag­ni­tude of the differ­ence. It was noted that non-ran­domised stud­ies over­es­ti­mated more often than they un­der­es­ti­mated treat­ment effects. …In 15 of 23 com­par­isons effects were larger in non-ran­domised stud­ies, 4 stud­ies had com­pa­ra­ble re­sults, whilst 4 re­ported smaller effects

  • Brit­ton, Mc­K­ee, Black, McPher­son, Sander­son and Bain 25 [Brit­ton et al 1998, “Choos­ing be­tween ran­domised and non-ran­domised stud­ies: a sys­tem­atic re­view”]:

    Brit­ton and col­leagues searched for pri­mary pub­li­ca­tions that made com­par­isons be­tween sin­gle ran­domised and non-ran­domised stud­ies (14 com­par­isons) and sec­ondary pub­li­ca­tions (re­views) mak­ing sim­i­lar com­par­isons (four com­par­ison­s). Both ob­ser­va­tional and qua­si­ex­per­i­men­tal stud­ies were in­cluded in the non­ran­domised cat­e­go­ry. They in­cluded all four of the sec­ondary com­par­isons in­cluded in the re­view by Kunz and col­leagues28 (Table 3). The sin­gle study com­par­isons in­cluded stud­ies where a com­par­i­son was made be­tween par­tic­i­pants who were al­lo­cated to ex­per­i­men­tal treat­ment as part of a trial and a group who de­clined to par­tic­i­pate, and stud­ies of cen­tres where si­mul­ta­ne­ous ran­domised and pa­tien­t-pref­er­ence stud­ies had been un­der­taken of the same in­ter­ven­tion. The stud­ies were as­sessed to en­sure that the ran­domised and non­ran­domised stud­ies were com­pa­ra­ble on sev­eral di­men­sions (Table 4). There were sta­tis­ti­cally sig­nifi­cant differ­ences be­tween ran­domised and non-ran­domised stud­ies for 11 of the 18 com­par­isons. The di­rec­tion of these differ­ences was in­con­sis­tent and the mag­ni­tude ex­tremely vari­able. For some in­ter­ven­tions the differ­ences were very large. For ex­am­ple, in a re­view of treat­ments for acute non-lym­phatic leukaemia, the risk ra­tio in RCTs was 24 com­pared with 3.7 in non-ran­domised stud­ies (com­par­i­son 23 in Ta­ble 3). The im­pact of sta­tis­ti­cal ad­just­ment for base­line im­bal­ances in prog­nos­tic fac­tors was in­ves­ti­gated in two pri­mary stud­ies, and in four ad­di­tional com­par­isons (coro­nary an­gio­plasty ver­sus by­pass graft­ing, cal­cium an­tag­o­nists for car­dio­vas­cu­lar dis­ease, malaria vac­cines and stroke unit care: com­par­isons 25-28 in Ta­ble 3). In two of the six com­par­isons there was ev­i­dence that ad­just­ment for prog­nos­tic fac­tors led to im­proved con­cor­dance of re­sults be­tween ran­domised and non-ran­domised stud­ies.

  • MacLe­hose, Reeves, Har­vey, Shel­don, Rus­sell and Black26 [MacLe­hose et al 2000, “A sys­tem­atic re­view of com­par­isons of effect sizes de­rived from ran­domised and non-ran­domised stud­ies”]:

    MacLe­hose and col­leagues re­stricted their re­view to stud­ies where re­sults of ran­domised and non­ran­domised com­par­isons were re­ported to­gether in a sin­gle pa­per, ar­gu­ing that such com­par­isons are more likely to be of ‘like-with­-like’ than those made be­tween stud­ies re­ported in sep­a­rate pa­pers. They in­cluded pri­mary stud­ies and also re­views that pooled re­sults from sev­eral in­di­vid­ual stud­ies. Of the 14 com­par­isons in­cluded in their re­port, three were based on re­views (com­par­isons 3, 7 and 25 in Ta­ble 3) and the rest were re­sults from com­par­isons within sin­gle stud­ies. The non­ran­domised de­signs in­cluded com­pre­hen­sive co­hort stud­ies, other ob­ser­va­tional stud­ies and qua­si­-ex­per­i­men­tal de­signs. The ‘fair­ness’ or ‘qual­ity’ of each of the com­par­isons made was as­sessed for com­pa­ra­bil­ity of pa­tients, in­ter­ven­tions and out­comes and ad­di­tional study method­ol­ogy (see Ta­ble 4). Al­though the au­thors did not cat­e­gorise com­par­isons as show­ing equiv­a­lence or dis­crep­an­cy, the differ­ences in re­sults were found to be sig­nifi­cantly greater in com­par­isons ranked as be­ing low qual­i­ty. …In 14 of 35 com­par­isons the dis­crep­ancy in RR was <10%, in 5 com­par­isons it was >50%. Dis­crep­an­cies were smaller in “fairer” com­par­isons.

  • Ben­son and Hartz32 [Ben­son & Hartz 2000, “A com­par­i­son of ob­ser­va­tional stud­ies and ran­dom­ized, con­trolled tri­als”]:

    Ben­son and Hartz eval­u­ated 19 treat­ment com­par­isons (eight in com­mon with Brit­ton and col­leagues25) for which they lo­cated at least one ran­domised and one ob­ser­va­tional study (de­fined as be­ing a study where the treat­ment was not al­lo­cated for the pur­pose of re­search) in a search of MEDLINE and the data­bases in the Cochrane Li­brary (Table 4). They only con­sid­ered treat­ments ad­min­is­tered by physi­cians. Across the 19 com­par­isons they found 53 ob­ser­va­tional and 83 ran­domised stud­ies, the re­sults of which were meta-analysed sep­a­rately for each treat­ment com­par­i­son. Com­par­isons were made be­tween the pooled es­ti­mates, not­ing whether the point es­ti­mate from the com­bined ob­ser­va­tional stud­ies was within the con­fi­dence in­ter­val of the RCTs. They found only two in­stances where the ob­ser­va­tional and ran­domised stud­ies did not meet this cri­te­ri­on.

  • Con­ca­to, Shah and Hor­witz33 [Con­cato et al 2000, “Ran­dom­ized, Con­trolled Tri­als, Ob­ser­va­tional Stud­ies, and the Hi­er­ar­chy of Re­search De­signs”]:

    Con­cato and col­leagues searched for meta-analy­ses of RCTs and of ob­ser­va­tional stud­ies (re­stricted to case-con­trol and con­cur­rent co­hort stud­ies) pub­lished in five lead­ing gen­eral med­ical jour­nals. They found only five com­par­isons where both types of study had been meta-analysed [BCG vac­ci­na­tion for tu­ber­cu­lo­sis (T­B), mam­mo­graphic screen­ing for breast can­cer mor­tal­i­ty, cho­les­terol lev­els and death from trau­ma, treat­ment of hy­per­ten­sion and stroke, treat­ment of hy­per­ten­sion and coro­nary heart dis­ease (CHD) (Table 3)] com­bin­ing a to­tal of 55 ran­domised and 44 ob­ser­va­tional stud­ies. They tab­u­lated the re­sults of meta-analy­ses of the ran­domised and the ob­ser­va­tional stud­ies and con­sid­ered the sim­i­lar­ity of the point es­ti­mates and the range of find­ings from the in­di­vid­ual stud­ies. In all five in­stances they noted the pooled re­sults of ran­domised and non-ran­domised stud­ies to be sim­i­lar. Where in­di­vid­ual study re­sults were avail­able, the range of the RCT re­sults was greater than the range of the ob­ser­va­tional re­sults.

    [Pocock & El­bourne crit­i­cism of Ben­son and Con­cato stud­ies.]

  • Ioan­ni­dis, Haidich, Pap­pa, Pan­tazis, Koko­ri, Tek­tonidou, Con­topoulous-Ioan­ni­dis and Lau34 [Ioan­ni­dis et al 2001]:

    Ioan­ni­dis and col­leagues searched for re­views that con­sid­ered re­sults of RCTs and non-ran­domised stud­ies. In ad­di­tion to search­ing MEDLINE they in­cluded sys­tem­atic re­views pub­lished in the Cochrane Li­brary, lo­cat­ing in to­tal 45 com­par­isons. Com­par­isons of RCTs with both qua­si­-ran­domised and ob­ser­va­tional stud­ies were in­clud­ed. All meta-an­a­lyt­i­cal re­sults were ex­pressed as odds ra­tios, and differ­ences be­tween ran­domised and non-ran­domised re­sults ex­pressed as a ra­tio of odds ra­tios and their sta­tis­ti­cal sig­nifi­cance cal­cu­lat­ed. Find­ings across the 45 topic ar­eas were pooled in­cor­po­rat­ing re­sults from 240 RCTs and 168 non-ran­domised stud­ies. Larger treat­ment effects were noted more often in non-ran­domised stud­ies. In 15 cases (33%) there was at least a twofold vari­a­tion in odds ra­tios, whereas in 16% there were sta­tis­ti­cally sig­nifi­cant differ­ences be­tween the re­sults of ran­domised and non-ran­domised stud­ies. The au­thors also tested the het­ero­gene­ity of the re­sults of the ran­domised and non-ran­domised stud­ies for each top­ic. Sig­nifi­cant het­ero­gene­ity was noted for 23% of the re­views of RCTs and for 41% of the re­views of non-ran­domised stud­ies.

  • Lipsey and Wil­son35 [Lipsey & Wil­son 1993, ] and Wil­son and Lipsey [Wil­son & Lipsey 2001, “The role of method in treat­ment effec­tive­ness re­search: ev­i­dence from meta-analy­sis”]36:

    Lipsey and Wil­son searched for all meta-analy­ses of psy­cho­log­i­cal in­ter­ven­tions, broadly de­fined as treat­ments whose in­ten­tion was to in­duce psy­cho­log­i­cal change (whether emo­tion­al, at­ti­tu­di­nal, cog­ni­tive or be­hav­ioural). Eval­u­a­tions of in­di­vid­ual com­po­nents of in­ter­ven­tions and broad in­ter­ven­tional poli­cies or or­gan­i­sa­tional arrange­ments were ex­clud­ed. Searches of psy­chol­ogy and so­ci­ol­ogy data­bases sup­ported by man­ual searches iden­ti­fied a to­tal of 302 meta-analy­ses, 76 of which con­tained both ran­domised and non-ran­domised com­par­a­tive stud­ies. Re­sults were analysed in two ways. First, the av­er­age effect sizes of ran­domised and non-ran­domised stud­ies were com­puted across the 74 re­views, and av­er­age effects were noted to be very slightly smaller for non-ran­domised than ran­domised stud­ies. Sec­ond (and more use­ful­ly) the differ­ence in effect sizes be­tween ran­domised and non-ran­domised stud­ies within each of the re­views was com­puted and plot­ted. This re­vealed both large over- and un­der­es­ti­mates with non-ran­domised stud­ies, differ­ences in effect sizes rang­ing from -0.60 to +0.77 stan­dard de­vi­a­tions.

Three com­monly cited stud­ies were ex­cluded from our re­view.37-39 Al­though these stud­ies made com­par­isons be­tween the re­sults of ran­domised and non-ran­domised stud­ies across many in­ter­ven­tions, they did not match RCTs and non­ran­domised stud­ies ac­cord­ing to the in­ter­ven­tion. Al­though they pro­vide some in­for­ma­tion about the av­er­age find­ings of se­lected ran­domised and non­ran­domised stud­ies, they did not con­sider whether there are differ­ences in re­sults of RCTs and non­ran­domised stud­ies of the same in­ter­ven­tion.

Find­ings of the eight re­views: The eight re­views have drawn con­flict­ing con­clu­sions. 5 of the eight re­views con­cluded that there are differ­ences be­tween the re­sults of ran­domised and non-ran­domised stud­ies in many but not all clin­i­cal ar­eas, but with­out there be­ing a con­sis­tent pat­tern in­di­cat­ing sys­tem­atic bias.25,26,28,34,35 One of the eight re­views found an over­es­ti­ma­tion of effects in all ar­eas stud­ied.27 The fi­nal two con­cluded that the re­sults of ran­domised and non-ran­domised stud­ies were ‘re­mark­ably sim­i­lar’.32,33 Of the two re­views that con­sid­ered the rel­a­tive vari­abil­ity of ran­domised and non-ran­domised re­sults, one con­cluded that RCTs were more con­sis­tent34 and the other that they were less con­sis­tent.33

, Ioan­ni­dis 2005 :

5 of 6 high­ly-cited non­ran­dom­ized stud­ies had been con­tra­dicted or had found stronger effects vs 9 of 39 ran­dom­ized con­trolled tri­als (p = 0.008)…­Matched con­trol stud­ies did not have a sig­nifi­cantly differ­ent share of re­futed re­sults than highly cited stud­ies, but they in­cluded more stud­ies with “neg­a­tive” re­sult­s…Sim­i­lar­ly, there is some ev­i­dence on dis­agree­ments be­tween epi­demi­o­log­i­cal stud­ies and ran­dom­ized tri­als.3-5

…For highly cited non­ran­dom­ized stud­ies, sub­se­quently pub­lished per­ti­nent ran­dom­ized tri­als and meta-analy­ses thereof were el­i­gi­ble re­gard­less of sam­ple size; non­ran­dom­ized ev­i­dence was also con­sid­ered, if ran­dom­ized tri­als were not avail­able…5 of 6 highly cited non­ran­dom­ized stud­ies had been con­tra­dicted or had ini­tially stronger effects while this was seen in only 9 of 39 highly cited ran­dom­ized tri­als (p = 0.008). Ta­ble 3 shows that tri­als with con­tra­dicted or ini­tially stronger effects had sig­nifi­cantly smaller sam­ple sizes and tended to be older than those with repli­cated or un­chal­lenged find­ings. There were no sig­nifi­cant differ­ences on the type of dis­ease. The pro­por­tion of con­tra­dicted or ini­tially stronger effects did not differ sig­nifi­cantly across jour­nals (p = 0.60)…S­mall stud­ies us­ing sur­ro­gate mark­ers may also some­times lead to er­ro­neous clin­i­cal in­fer­ences.158 There were only 2 stud­ies with typ­i­cal sur­ro­gate mark­ers among the highly cited stud­ies ex­am­ined here­in, but both were sub­se­quently con­tra­dicted in their clin­i­cal ex­trap­o­la­tions about the effi­cacy of ni­tric ox­ide 22 and hor­mone ther­a­py.42

Box 2. “Con­tra­dicted and Ini­tially Stronger Effects in Con­trol Stud­ies Con­tra­dicted Find­ings”:

  • In a prospec­tive co­hort,91 vi­t­a­min A was in­versely re­lated to breast can­cer (rel­a­tive risk in the high­est quin­tile, 0.84; 95% con­fi­dence in­ter­val [CI], 0.71-0.98) and vi­t­a­min A sup­ple­men­ta­tion was as­so­ci­ated with a re­duced risk (p = 0.03) in women at the low­est quin­tile group; in a ran­dom­ized trial128 ex­plor­ing fur­ther the retinoid-breast can­cer hy­poth­e­sis, fen­re­tinide treat­ment of women with breast can­cer for 5 years had no effect on the in­ci­dence of sec­ond breast ma­lig­nan­cies.
  • A trial (n = 51) showed that cladrib­ine sig­nifi­cantly im­proved the clin­i­cal scores of pa­tients with chronic pro­gres­sive mul­ti­ple scle­ro­sis.119 In a larger trial of 159 pa­tients, no sig­nifi­cant treat­ment effects were found for cladrib­ine in terms of changes in clin­i­cal scores.129

Ini­tially Stronger Effects:

  • A trial (n = 28) of aerosolized rib­avirin in in­fants re­ceiv­ing me­chan­i­cal ven­ti­la­tion for se­vere res­pi­ra­tory syn­cy­tial virus in­fec­tion82 showed sig­nifi­cant de­creases in me­chan­i­cal ven­ti­la­tion (4.9 vs 9.9 days) and hos­pi­tal stay (13.3 vs 15.0 days). A meta-analy­sis of 3 tri­als (n = 104) showed a de­crease of only 1.8 days in the du­ra­tion of me­chan­i­cal ven­ti­la­tion and a non­signifi­cant de­crease of 1.9 days in du­ra­tion of hos­pi­tal­iza­tion.130
  • A trial (n = 406) of in­ter­mit­tent di­azepam ad­min­is­tered dur­ing fever to pre­vent re­cur­rence of febrile seizures90 showed a sig­nifi­cant 44% rel­a­tive risk re­duc­tion in seizures. The effect was smaller in other tri­als and the over­all risk re­duc­tion was no longer for­mally sig­nifi­cant131; more­over, the safety pro­file of di­azepam was deemed un­fa­vor­able to rec­om­mend rou­tine pre­ven­tive use.
  • A case-con­trol and co­hort study eval­u­a­tion92 showed that the in­creased risk of sud­den in­fant death syn­drome among in­fants who sleep prone is in­creased by use of nat­u­ral-fiber mat­tress­es, swad­dling, and heat­ing in bed­rooms. Sev­eral ob­ser­va­tional stud­ies have been done since, and they have pro­vided in­con­sis­tent re­sults on these in­ter­ven­tions, in par­tic­u­lar, they dis­agree on the pos­si­ble role of over­heat­ing.132
  • A trial of 54 chil­dren95 showed that the steroid bu­deno­side sig­nifi­cantly re­duced the croup score by 2 points at 4 hours, and sig­nifi­cantly de­creased read­mis­sions by 86%. A meta-analy­sis (n = 3736) 133 showed a sig­nifi­cant im­prove­ment in the West­ley score at 6 hours (1.2 points), and 12 hours (1.9 points), but not at 24 hours. Fewer re­turn vis­its and/or (re)ad­mis­sions oc­curred in pa­tients treated with glu­co­cor­ti­coids, but the rel­a­tive risk re­duc­tion was only 50% (95% CI, 24%-64%).
  • A trial (n = 55) showed that mis­pros­tol was as effec­tive as dino­pro­s­tone for ter­mi­na­tion of sec­ond-trimester preg­nancy and was as­so­ci­ated with fewer ad­verse effects than dino­pro­s­tone.96 A sub­se­quent trial134 showed equal effi­ca­cy, but a higher rate of ad­verse effects with miso­pros­tol (74%) than with dino­pro­s­tone (47%).
  • A trial (n = 50) com­par­ing bot­u­linum toxin vs glyc­eryl trini­trate for chronic anal fis­sure con­cluded that both are effec­tive al­ter­na­tives to surgery but bot­u­linum toxin is the more effec­tive non­sur­gi­cal treat­ment (1 fail­ure vs 9 fail­ures with ni­tro­glyc­er­in).109 In a meta-analy­sis135 of 31 tri­als, bot­u­linum toxin com­pared with placebo showed no sig­nifi­cant effi­cacy (rel­a­tive risk of fail­ure, 0.75; 95% CI, 0.32-1.77), and was also no bet­ter than glyc­eryl trini­trate (rel­a­tive risk of fail­ure, 0.48; 95% CI, 0.211.10); surgery was more effec­tive than med­ical ther­apy in cur­ing fis­sure (rel­a­tive risk of fail­ure, 0.12; 95% CI, 0.07-0.22).
  • A trial of acetyl­cys­teine (n = 83) showed that it was highly effec­tive in pre­vent­ing con­trast nephropa­thy (90% rel­a­tive risk re­duc­tion).110 There have been many more tri­als and many meta-analy­ses on this top­ic. The lat­est meta-analy­sis136 shows a non­signifi­cant 27% rel­a­tive risk re­duc­tion with acetyl­cys­teine.
  • A trial of 129 stunted Ja­maican chil­dren found that both nu­tri­tional sup­ple­men­ta­tion and psy­choso­cial stim­u­la­tion im­proved the men­tal de­vel­op­ment of stunted chil­dren; chil­dren who got both in­ter­ven­tions had ad­di­tive ben­e­fits and achieved scores close to those of non­stunted chil­dren.117 With long-term fol­low-up, how­ev­er, it was found that the ben­e­fits were small and the 2 in­ter­ven­tions no longer had ad­di­tive effects.137

…It is pos­si­ble that high­-pro­file jour­nals may tend to pub­lish oc­ca­sion­ally very strik­ing find­ings and that this may lead to some diffi­culty in repli­cat­ing some of these find­ings.163 Poy­nard et al [“Truth Sur­vival in Clin­i­cal Re­search: An Ev­i­dence-Based Re­quiem?”] eval­u­ated the con­clu­sions of he­pa­tol­ogy-re­lated ar­ti­cles pub­lished be­tween 1945 and 1999 and found that, over­all, 60% of these con­clu­sions were con­sid­ered to be true in 2000 and that there was no differ­ence be­tween ran­dom­ized and non­ran­dom­ized stud­ies or high- vs low-qual­ity stud­ies. Al­low­ing for some­what differ­ent de­fi­n­i­tions, the higher rates of refu­ta­tion and the gen­er­ally worse per­for­mance of non­ran­dom­ized stud­ies in the present analy­sis may stem from the fact that I fo­cused on a se­lected sam­ple of the most no­ticed and in­flu­en­tial clin­i­cal re­search. For such highly cited stud­ies, the turn­around of “truth” may be faster; in par­tic­u­lar non-ran­dom­ized stud­ies may be more likely to be probed and chal­lenged than non-ran­dom­ized stud­ies pub­lished in the gen­eral lit­er­a­ture.

, Pa­paniko­laou et al 2006:

Back­ground: In­for­ma­tion on ma­jor harms of med­ical in­ter­ven­tions comes pri­mar­ily from epi­demi­o­logic stud­ies per­formed after li­cens­ing and mar­ket­ing. Com­par­i­son with data from large-s­cale ran­dom­ized tri­als is oc­ca­sion­ally fea­si­ble. We com­pared ev­i­dence from ran­dom­ized tri­als with that from epi­demi­o­logic stud­ies to de­ter­mine whether they give differ­ent es­ti­mates of risk for im­por­tant harms of med­ical in­ter­ven­tions.

Meth­ods: We tar­geted well-de­fined, spe­cific harms of var­i­ous med­ical in­ter­ven­tions for which data were al­ready avail­able from large-s­cale ran­dom­ized tri­als (> 4000 sub­ject­s). Non­ran­dom­ized stud­ies in­volv­ing at least 4000 sub­jects ad­dress­ing these same harms were re­trieved through a search of MEDLINE. We com­pared the rel­a­tive risks and ab­solute risk differ­ences for spe­cific harms in the ran­dom­ized and non­ran­dom­ized stud­ies.

Re­sults: El­i­gi­ble non­ran­dom­ized stud­ies were found for 15 harms for which data were avail­able from ran­dom­ized tri­als ad­dress­ing the same harms. Com­par­isons of rel­a­tive risks be­tween the study types were fea­si­ble for 13 of the 15 top­ics, and of ab­solute risk differ­ences for 8 top­ics. The es­ti­mated in­crease in rel­a­tive risk differed more than 2-fold be­tween the ran­dom­ized and non­ran­dom­ized stud­ies for 7 (54%) of the 13 top­ics; the es­ti­mated in­crease in ab­solute risk differed more than 2-fold for 5 (62%) of the 8 top­ics. There was no clear predilec­tion for ran­dom­ized or non­ran­dom­ized stud­ies to es­ti­mate greater rel­a­tive risks, but usu­ally (75% [6/8]) the ran­dom­ized tri­als es­ti­mated larger ab­solute ex­cess risks of harm than the non­ran­dom­ized stud­ies did.

In­ter­pre­ta­tion: Non­ran­dom­ized stud­ies are often con­ser­v­a­tive in es­ti­mat­ing ab­solute risks of harms. It would be use­ful to com­pare and scru­ti­nize the ev­i­dence on harms ob­tained from both ran­dom­ized and non­ran­dom­ized stud­ies.

…In to­tal, data from non­ran­dom­ized stud­ies could be jux­ta­posed against data from ran­dom­ized tri­als for 15 of the 66 harms (Table 1). All of the stud­ied harms were se­ri­ous and clin­i­cally rel­e­vant. The in­ter­ven­tions in­cluded drugs, vi­t­a­mins, vac­cines and sur­gi­cal pro­ce­dures. A large va­ri­ety of prospec­tive and ret­ro­spec­tive ap­proaches were used in the non­ran­dom­ized stud­ies, in­clud­ing both con­trolled and un­con­trolled de­signs (Table 1)…­For 5 (38%) of the 13 top­ics for which es­ti­mated in­creases in rel­a­tive risk could be com­pared, the in­crease was greater in the non­ran­dom­ized stud­ies than in the re­spec­tive ran­dom­ized tri­als; for the other 8 top­ics (62%), the in­crease was greater in the ran­dom­ized tri­als. The es­ti­mated in­crease in rel­a­tive risk differed more than 2-fold be­tween the ran­dom­ized and non­ran­dom­ized stud­ies for 7 (54%) of the 13 top­ics (symp­to­matic in­tracra­nial bleed with oral an­ti­co­ag­u­lant ther­apy [topic 5], ma­jor ex­tracra­nial bleed with an­ti­co­ag­u­lant v. an­tiplatelet ther­apy [topic 6], symp­to­matic in­tracra­nial bleed with ASA [topic 8], vas­cu­lar or vis­ceral in­jury with la­paro­scopic v. open sur­gi­cal re­pair of in­guinal her­nia [topic 10], ma­jor bleed with platelet gly­co­pro­tein IIb/IIIa blocker ther­apy for per­cu­ta­neous coro­nary in­ter­ven­tion [topic 14], mul­ti­ple ges­ta­tion with fo­late sup­ple­men­ta­tion [topic 13], and acute my­ocar­dial in­farc­tion with ro­fe­coxib v. naproxen ther­apy [topic 15]). Differ­ences in rel­a­tive risk be­yond chance be­tween the ran­dom­ized and non­ran­dom­ized stud­ies oc­curred for 2 of the 13 top­ics: the rel­a­tive risks for symp­to­matic in­tracra­nial bleed with oral an­ti­co­ag­u­lant ther­apy (topic 5) and for vas­cu­lar or vis­ceral in­jury with la­paro­scopic ver­sus open sur­gi­cal re­pair of in­guinal her­nia (topic 10) were sig­nifi­cantly greater in the non­ran­dom­ized stud­ies than in the ran­dom­ized tri­als. Be­tween-s­tudy het­ero­gene­ity was more com­mon in the syn­the­ses of data from the non­ran­dom­ized stud­ies than in the syn­the­ses of data from the ran­dom­ized tri­als. There was sig­nifi­cant be­tween-s­tudy het­ero­gene­ity (p < 0.10 on the Q sta­tis­tic) among the ran­dom­ized tri­als for 2 data syn­the­ses (topics 3 and 14) and among the non­ran­dom­ized stud­ies for 5 data syn­the­ses (topics 4, 7, 8, 13 and 15). The ad­justed and un­ad­justed es­ti­mates of rel­a­tive risk in the non­ran­dom­ized stud­ies were sim­i­lar (see on­line Ap­pen­dix 4, avail­able at www.c­ma­j.­ca/cgi/­con­tent/­ful­l/c­ma­j.050873/D­C1)…The ran­dom­ized tri­als usu­ally es­ti­mated larger ab­solute risks of harms than the non­ran­dom­ized stud­ies did; for 1 top­ic, the differ­ence was al­most 40-fold.

Young & Karr 2011, “Dem­ing, data and ob­ser­va­tional stud­ies: A process out of con­trol and need­ing fix­ing”:

As long ago as 19881,2 it was noted that there were con­tra­dicted re­sults for case-con­trol stud­ies in 56 differ­ent topic ar­eas, of which can­cer and things that cause it or cure it were by far the most fre­quent. An av­er­age of 2.4 stud­ies sup­ported each as­so­ci­a­tion - and an av­er­age of 2.3 stud­ies did not sup­port it. For ex­am­ple, 3 stud­ies sup­ported an as­so­ci­a­tion be­tween the an­ti-de­pres­sant drug re­ser­pine and breast can­cer, and 8 did not. It was as­serted2 that “much of the dis­agree­ment may oc­cur be­cause a set of rig­or­ous sci­en­tific prin­ci­ples has not yet been ac­cepted to guide the de­sign or in­ter­pre­ta­tion of case-con­trol re­search”.

  1. Mayes et al 1988, “A col­lec­tion of 56 top­ics with con­tra­dic­tory re­sults in case-con­trol re­search”
  2. Fe­in­stein 1988, “Sci­en­tific stan­dards in epi­demi­o­logic stud­ies of the men­ace of daily life”

…We our­selves car­ried out an in­for­mal but com­pre­hen­sive ac­count­ing of 12 ran­domised clin­i­cal tri­als that tested ob­ser­va­tional claims - see Ta­ble 1. The 12 clin­i­cal tri­als tested 52 ob­ser­va­tional claims. They all con­firmed no claims in the di­rec­tion of the ob­ser­va­tional claims. We re­peat that fig­ure: 0 out of 52. To put it an­other way, 100% of the ob­ser­va­tional claims failed to repli­cate. In fact, 5 claims (9.6%) are sta­tis­ti­cally sig­nifi­cant in the clin­i­cal tri­als in the op­po­site di­rec­tion to the ob­ser­va­tional claim. To us, a false dis­cov­ery rate of over 80% is po­tent ev­i­dence that the ob­ser­va­tional study process is not in con­trol. The prob­lem, which has been recog­nised at least since 1988, is sys­temic.

…The “fe­males eat­ing ce­real leads to more boy ba­bies” claim trans­lated the car­toon ex­am­ple into real life. The claim ap­peared in the Pro­ceed­ings of the Royal So­ci­ety, Se­ries B. It makes es­sen­tially no bi­o­log­i­cal sense, as for hu­mans the Y chro­mo­some con­trols gen­der and comes from the male par­ent. The data set con­sisted of the gen­der of chil­dren of 740 moth­ers along with the re­sults of a food ques­tion­naire, not of break­fast ce­real alone but of 133 differ­ent food items - com­pared to only 20 colours of jelly beans. Break­fast ce­real dur­ing the sec­ond time pe­riod at is­sue was one of the few foods of the 133 to give a pos­i­tive. We re­analysed the data6, with 262 t-tests, and con­cluded that the re­sult was eas­ily ex­plained as pure chance.

  1. Young et al 2009, “Ce­re­al-in­duced gen­der se­lec­tion? Most likely a mul­ti­ple test­ing false pos­i­tive”

…The US Cen­ter for Dis­ease Con­trol as­sayed the urine of around 1000 peo­ple for 275 chem­i­cals, one of which was bisphe­nol A (BPA). One re­sult­ing claim was that BPA is as­so­ci­ated with car­dio­vas­cu­lar di­ag­noses, di­a­betes, and ab­nor­mal liver en­zyme con­cen­tra­tions. BPA is a chem­i­cal much in the news and un­der at­tack from peo­ple fear­ful of chem­i­cals. The peo­ple who had their urine as­sayed for chem­i­cals also gave a self­-re­ported health sta­tus for 32 med­ical out­comes. For each per­son, ten de­mo­graphic vari­ables (such as eth­nic­i­ty, ed­u­ca­tion, and in­come) were also col­lect­ed. There are 275 × 32 = 8800 po­ten­tial end­points for analy­sis. Us­ing sim­ple lin­ear re­gres­sion for co­vari­ate ad­just­ment, there are ap­prox­i­mately 1000 po­ten­tial mod­els, in­clud­ing or not in­clud­ing each de­mo­graphic vari­able. Al­to­gether the search space is about 9 mil­lion mod­els and end­points11. The au­thors re­main con­vinced that their claim is valid.

  1. Young, S. S. and Yu, M. (2009) “To the Ed­i­tor: As­so­ci­a­tion of Bisphe­nol A With Di­a­betes and Other Ab­nor­mal­i­ties”. Jour­nal of the Amer­i­can Med­ical As­so­ci­a­tion, 301, 720-721

“Vi­t­a­min D and mul­ti­ple health out­comes: um­brella re­view of sys­tem­atic re­views and meta-analy­ses of ob­ser­va­tional stud­ies and ran­domised tri­als”, Theodor­a­tou et al 2014:

Com­par­i­son of find­ings from ob­ser­va­tional stud­ies and clin­i­cal tri­als:

One hun­dred and twenty three (90%) out­comes were ex­am­ined only by syn­the­ses of ob­ser­va­tional ev­i­dence (n = 84) or only by meta-analy­ses of ran­domised ev­i­dence (n = 39), so we could not com­pare ob­ser­va­tional and ran­domised ev­i­dence.

Ten (7%) out­comes were ex­am­ined by both meta-analy­ses of ob­ser­va­tional stud­ies and meta-analy­ses of ran­domised con­trolled tri­als: car­dio­vas­cu­lar dis­ease, hy­per­ten­sion, birth weight, birth length, head cir­cum­fer­ence at birth, small for ges­ta­tional age birth, mor­tal­ity in pa­tients with chronic kid­ney dis­ease, all cause mor­tal­i­ty, frac­tures, and hip frac­tures (table 5⇓). The di­rec­tion of the as­so­ci­a­tion/­effect and level of sta­tis­ti­cal sig­nifi­cance was con­cor­dant only for birth weight, but this out­come could not be tested for hints of bias in the meta-analy­sis of ob­ser­va­tional stud­ies (ow­ing to lack of the in­di­vid­ual data). The di­rec­tion of the as­so­ci­a­tion/­effect but not the level of sta­tis­ti­cal sig­nifi­cance was con­cor­dant in six out­comes (car­dio­vas­cu­lar dis­ease, hy­per­ten­sion, birth length, head cir­cum­fer­ence small for ges­ta­tional age births, and all cause mor­tal­i­ty), but only two of them (car­dio­vas­cu­lar dis­ease and hy­per­ten­sion) could be tested and were found to be free from hint of bias and of low het­ero­gene­ity in the meta-analy­ses of ob­ser­va­tional stud­ies. For mor­tal­ity in chronic kid­ney dis­ease pa­tients, frac­tures in older pop­u­la­tions, and hip frac­tures, both the di­rec­tion and the level of sig­nifi­cance of the as­so­ci­a­tion/­effect were not con­cor­dant.

, Hemkens et al 2016:

Ob­jec­tive: To as­sess differ­ences in es­ti­mated treat­ment effects for mor­tal­ity be­tween ob­ser­va­tional stud­ies with rou­tinely col­lected health data (RCD; that are pub­lished be­fore tri­als are avail­able) and sub­se­quent ev­i­dence from ran­dom­ized con­trolled tri­als on the same clin­i­cal ques­tion.

De­sign: Meta-epi­demi­o­log­i­cal sur­vey.

Data sources: PubMed searched up to No­vem­ber 2014.

Meth­ods: El­i­gi­ble RCD stud­ies were pub­lished up to 2010 that used propen­sity scores to ad­dress con­found­ing bias and re­ported com­par­a­tive effects of in­ter­ven­tions for mor­tal­i­ty. The analy­sis in­cluded only RCD stud­ies con­ducted be­fore any trial was pub­lished on the same top­ic. The di­rec­tion of treat­ment effects, con­fi­dence in­ter­vals, and effect sizes (odds ra­tios) were com­pared be­tween RCD stud­ies and ran­dom­ized con­trolled tri­als. The rel­a­tive odds ra­tio (that is, the sum­mary odds ra­tio of tri­al(s) di­vided by the RCD study es­ti­mate) and the sum­mary rel­a­tive odds ra­tio were cal­cu­lated across all pairs of RCD stud­ies and tri­als. A sum­mary rel­a­tive odds ra­tio greater than one in­di­cates that RCD stud­ies gave more fa­vor­able mor­tal­ity re­sults.

Re­sults: The eval­u­a­tion in­cluded 16 el­i­gi­ble RCD stud­ies, and 36 sub­se­quent pub­lished ran­dom­ized con­trolled tri­als in­ves­ti­gat­ing the same clin­i­cal ques­tions (with 17,275 pa­tients and 835 death­s). Tri­als were pub­lished a me­dian of three years after the cor­re­spond­ing RCD study. For five (31%) of the 16 clin­i­cal ques­tions, the di­rec­tion of treat­ment effects differed be­tween RCD stud­ies and tri­als. Con­fi­dence in­ter­vals in nine (56%) RCD stud­ies did not in­clude the RCT effect es­ti­mate. Over­all, RCD stud­ies showed sig­nifi­cantly more fa­vor­able mor­tal­ity es­ti­mates by 31% than sub­se­quent tri­als (sum­mary rel­a­tive odds ra­tio 1.31 (95% con­fi­dence in­ter­val 1.03 to 1.65; I2=0%)).

Con­clu­sions: Stud­ies of rou­tinely col­lected health data could give differ­ent an­swers from sub­se­quent ran­dom­ized con­trolled tri­als on the same clin­i­cal ques­tions, and may sub­stan­tially over­es­ti­mate treat­ment effects. Cau­tion is needed to pre­vent mis­guided clin­i­cal de­ci­sion mak­ing.


“Eval­u­at­ing the Econo­met­ric Eval­u­a­tions of Train­ing Pro­grams with Ex­per­i­men­tal Data”, LaLonde 1986:

This pa­per com­pares the effect on trainee earn­ings of an em­ploy­ment pro­gram that was run as a field ex­per­i­ment where par­tic­i­pants were ran­domly as­signed to treat­ment and con­trol groups with the es­ti­mates that would have been pro­duced by an econo­me­tri­cian. This com­par­i­son shows that many of the econo­met­ric pro­ce­dures do not repli­cate the ex­per­i­men­tally de­ter­mined re­sults, and it sug­gests that re­searchers should be aware of the po­ten­tial for spec­i­fi­ca­tion er­rors in other non­ex­per­i­men­tal eval­u­a­tions.

…The Na­tional Sup­ported Work Demon­stra­tion (NSW) was a tem­po­rary em­ploy­ment pro­gram de­signed to help dis­ad­van­taged work­ers lack­ing ba­sic job skills move into the la­bor mar­ket by giv­ing them work ex­pe­ri­ence and coun­sel­ing in a shel­tered en­vi­ron­ment. Un­like other fed­er­ally spon­sored em­ploy­ment and train­ing pro­grams, the NSW pro­gram as­signed qual­i­fied ap­pli­cants to train­ing po­si­tions ran­dom­ly. Those as­signed to the treat­ment group re­ceived all the ben­e­fits of the NSW pro­gram, while those as­signed to the con­trol group were left to fend for them­selves.3 Dur­ing the mid-1970s, the Man­power Demon­stra­tion Re­search Cor­po­ra­tion (MDRC) op­er­ated the NSW pro­gram in ten sites across the United States. The MDRC ad­mit­ted into the pro­gram AFDC wom­en, ex-drug ad­dicts, ex-crim­i­nal offend­ers, and high school dropouts of both sex­es.4 For those as­signed to the treat­ment group, the pro­gram guar­an­teed a job for 9 to 18 months, de­pend­ing on the tar­get group and site. The treat­ment group was di­vided into crews of three to five par­tic­i­pants who worked to­gether and met fre­quently with an NSW coun­selor to dis­cuss griev­ances and per­for­mance. The NSW pro­gram paid the treat­ment group mem­bers for their work. The wage sched­ule offered the trainees lower wage rates than they would have re­ceived on a reg­u­lar job, but al­lowed their earn­ings to in­crease for sat­is­fac­tory per­for­mance and at­ten­dance. The trainees could stay on their sup­ported work jobs un­til their terms in the pro­gram ex­pired and they were forced to find reg­u­lar em­ploy­ment. …male and fe­male par­tic­i­pants fre­quently per­formed differ­ent sorts of work. The fe­male par­tic­i­pants usu­ally worked in ser­vice oc­cu­pa­tions, whereas the male par­tic­i­pants tended to work in con­struc­tion oc­cu­pa­tions. Con­se­quent­ly, the pro­gram costs var­ied across the sites and tar­get groups. The pro­gram cost $9,100 per AFDC par­tic­i­pant and ap­prox­i­mately $6,800 for the other tar­get groups’ trainees.

The first two columns of Ta­bles 2 and 3 present the an­nual earn­ings of the treat­ment and con­trol group mem­ber­s.9 The earn­ings of the ex­per­i­men­tal groups were the same in the pre-train­ing year 1975, di­verged dur­ing the em­ploy­ment pro­gram, and con­verged to some ex­tent after the pro­gram end­ed. The post-train­ing year was 1979 for the AFDC fe­males and 1978 for the males.10 Columns 2 and 3 in the first row of Ta­bles 4 and 5 show that both the un­ad­justed and re­gres­sion-ad­justed pre-train­ing earn­ings of the two sets of treat­ment and con­trol group mem­bers are es­sen­tially iden­ti­cal. There­fore, be­cause of the NSW pro­gram’s ex­per­i­men­tal de­sign, the differ­ence be­tween the post-train­ing earn­ings of the ex­per­i­men­tal groups is an un­bi­ased es­ti­ma­tor of the train­ing effect, and the other es­ti­ma­tors de­scribed in columns 5-10(11) are un­bi­ased es­ti­ma­tors as well. The es­ti­mates in col­umn 4 in­di­cate that the earn­ings of the AFDC fe­males were $851 higher than they would have been with­out the NSW pro­gram, while the earn­ings of the male par­tic­i­pants were $886 high­er.11 More­over, the other columns show that the econo­met­ric pro­ce­dure does not affect these es­ti­mates.

The re­searchers who eval­u­ated these fed­er­ally spon­sored pro­grams de­vised both ex­per­i­men­tal and non­ex­per­i­men­tal pro­ce­dures to es­ti­mate the train­ing effect, be­cause they rec­og­nized that the differ­ence be­tween the trainees’ pre- and post-train­ing earn­ings was a poor es­ti­mate of the train­ing effect. In a dy­namic econ­o­my, the trainees’ earn­ings may grow even with­out an effec­tive pro­gram. The goal of these pro­gram eval­u­a­tions is to es­ti­mate the earn­ings of the trainees had they not par­tic­i­pated in the pro­gram. Re­searchers us­ing ex­per­i­men­tal data take the earn­ings of the con­trol group mem­bers to be an es­ti­mate of the trainees’ earn­ings with­out the pro­gram. With­out ex­per­i­men­tal data, re­searchers es­ti­mate the earn­ings of the trainees by us­ing the re­gres­sion-ad­justed earn­ings of a com­par­i­son group drawn from the pop­u­la­tion. This ad­just­ment takes into ac­count that the ob­serv­able char­ac­ter­is­tics of the trainees and the com­par­i­son group mem­bers differ, and their un­ob­serv­able char­ac­ter­is­tics may differ as well.

The first step in a non­ex­per­i­men­tal eval­u­a­tion is to se­lect a com­par­i­son group whose earn­ings can be com­pared to the earn­ings of the trainees. Ta­bles 2 and 3 present the mean an­nual earn­ings of fe­male and male com­par­i­son groups drawn from the Panel Study of In­come Dy­nam­ics (PSID) and Wes­t­at’s Matched Cur­rent Pop­u­la­tion Sur­vey - So­cial Se­cu­rity Ad­min­is­tra­tion File (CPS-SSA). These groups are char­ac­ter­is­tic of two types of com­par­i­son groups fre­quently used in the pro­gram eval­u­a­tion lit­er­a­ture. The PSID-1 and the CPS-SSA-1 groups are large, strat­i­fied ran­dom sam­ples from pop­u­la­tions of house­hold heads and house­holds, re­spec­tive­ly.14 The oth­er, small­er, com­par­i­son groups are com­posed of in­di­vid­u­als whose char­ac­ter­is­tics are con­sis­tent with some of the el­i­gi­bil­ity cri­te­ria used to ad­mit ap­pli­cants into the NSW pro­gram. For ex­am­ple, the PSID-3 and CPS-SSA-4 com­par­i­son groups in Ta­ble 2 in­clude fe­males from the PSID and the CPS-SSA who re­ceived AFDC pay­ments in 1975, and were not em­ployed in the spring of 1976. Ta­bles 2 and 3 show that the NSW trainees and con­trols have earn­ings his­to­ries that are more sim­i­lar to those of the smaller com­par­i­son groups

Un­like the ex­per­i­men­tal es­ti­mates, the non­ex­per­i­men­tal es­ti­mates are sen­si­tive both to the com­po­si­tion of the com­par­i­son group and to the econo­met­ric pro­ce­dure. For ex­am­ple, many of the es­ti­mates in col­umn 9 of Ta­ble 4 repli­cate the ex­per­i­men­tal re­sults, while other es­ti­mates are more than $1,000 larger than the ex­per­i­men­tal re­sults. More specifi­cal­ly, the re­sults for the fe­male par­tic­i­pants (Table 4) tend to be pos­i­tive and larger than the ex­per­i­men­tal es­ti­mate, while for the male par­tic­i­pants (Table 5), the es­ti­mates tend to be neg­a­tive and smaller than the ex­per­i­men­tal im­pact.20 Ad­di­tion­al­ly, the non­ex­per­i­men­tal pro­ce­dures repli­cate the ex­per­i­men­tal re­sults more closely when the non­ex­per­i­men­tal data in­clude pre­train­ing earn­ings rather than cross-sec­tional data alone or when eval­u­at­ing fe­male rather than male par­tic­i­pants.

Be­fore tak­ing some of these es­ti­mates too se­ri­ous­ly, many econo­me­tri­cians at a min­i­mum would re­quire that their es­ti­ma­tors be based on econo­met­ric mod­els that are con­sis­tent with the pre-train­ing earn­ings da­ta. Thus, if the re­gres­sion-ad­justed differ­ence be­tween the post-train­ing earn­ings of the two groups is go­ing to be a con­sis­tent es­ti­ma­tor of the train­ing effect, the re­gres­sion-ad­justed pre­train­ing earn­ings of the two groups should be the same. Based on this spec­i­fi­ca­tion test, econo­me­tri­cians might re­ject the non­ex­per­i­men­tal es­ti­mates in columns 4-7 of Ta­ble 4 in fa­vor of the ones in columns 8-11. Few econo­me­tri­cians would re­port the train­ing effect of $870 in col­umn 5, even though this es­ti­mate differs from the ex­per­i­men­tal re­sult by only $19. If the cross-sec­tional es­ti­ma­tor prop­erly con­trolled for differ­ences be­tween the trainees and com­par­i­son group mem­bers, we would not ex­pect the differ­ence be­tween the re­gres­sion ad­justed pre-train­ing earn­ings of the two groups to be $1,550, as re­ported in col­umn 3. Like­wise, econo­me­tri­cians might re­frain from re­port­ing the differ­ence in differ­ences es­ti­mates in columns 6 and 7, even though all these es­ti­mates are within two stan­dard er­rors of $3,000. As noted ear­lier, this es­ti­ma­tor is not con­sis­tent with the de­cline in the trainees’ pre-train­ing earn­ings.

The two-step es­ti­mates are usu­ally closer than the one-step es­ti­mates to the ex­per­i­men­tal re­sults for the male trainees as well. One es­ti­mate, which used the CPS-SSA-1 sam­ple as a com­par­i­son group, is within $600 of the ex­per­i­men­tal re­sult, while the one-step es­ti­mate falls short by $1,695. The es­ti­mates of the par­tic­i­pa­tion co­effi­cients are neg­a­tive, al­though un­like these es­ti­mates for the fe­males, they are al­ways sig­nifi­cantly differ­ent from ze­ro. This find­ing is con­sis­tent with the ex­am­ple cited ear­lier in which in­di­vid­u­als with high par­tic­i­pa­tion un­ob­serv­ables and low earn­ings un­ob­serv­ables were more likely to be in train­ing. As pre­dict­ed, the un­re­stricted es­ti­mates are larger than the one-step es­ti­mates. How­ev­er, as with the re­sults for the fe­males, this pro­ce­dure may leave econo­me­tri­cians with a con­sid­er­able range ($1,546) of im­pre­cise es­ti­mates

“The En­do­gene­ity Prob­lem in De­vel­op­men­tal Stud­ies”, Dun­can et al 2004:

For ex­am­ple, de­spite the­o­ret­i­cal ar­gu­ments to the con­trary, most em­pir­i­cal stud­ies of the effects of di­vorce on chil­dren have as­sumed that di­vorce is ran­domly as­signed to chil­dren. They do this by fail­ing to con­trol for the fact that di­vorce is the prod­uct of the par­ents’ tem­pera­ments, re­sources, and other stres­sors that face par­ents, most of which will in­flu­ence chil­dren’s out­comes in their own right. As a re­sult, stud­ies com­par­ing de­vel­op­men­tal out­comes of chil­dren with and with­out past parental di­vorces after con­trol­ling for a hand­ful of fam­ily back­ground char­ac­ter­is­tics are likely to con­found the effects of di­vorce with the effects of un­mea­sured par­ent and child vari­ables. In­deed, stud­ies that con­trol for chil­dren’s be­hav­ior prob­lems prior to a pos­si­ble di­vorce find much smaller ap­par­ent effects of the di­vorce it­self (Cher­lin, Chase-Lans­dale, & McRae, 1998).

… These ex­per­i­ments can pro­vide re­searchers with some sense for the bias that re­sults from non­ex­per­i­men­tal es­ti­mates as well as pro­vid­ing di­rect ev­i­dence for the causal effects of some de­vel­op­men­tal in­flu­ence of in­ter­est. For ex­am­ple, Wilde and Hol­lis­ter (2002) com­pare non­ex­per­i­men­tal and ex­per­i­men­tal re­sults for the widely cited Ten­nessee Stu­den­t-Teacher Achieve­ment Ra­tio (STAR) class-size ex­per­i­ment. The STAR ex­per­i­ment pro­vides an un­bi­ased es­ti­mate of the im­pact of class size on stu­dent achieve­ment by com­par­ing the av­er­age achieve­ment lev­els of stu­dents as­signed to small (ex­per­i­men­tal) and reg­u­lar (con­trol) class­rooms. How­ev­er, Wilde and Hol­lis­ter also es­ti­mated a se­ries of more con­ven­tional non­ex­per­i­men­tal re­gres­sions that re­lated nat­u­rally oc­cur­ring class size vari­a­tion within the set of reg­u­lar class­rooms to stu­dent achieve­ment, con­trol­ling for an ex­ten­sive set of stu­dent de­mo­graphic char­ac­ter­is­tics and so­cioe­co­nomic sta­tus.

Ta­ble 1 com­pares the ex­per­i­men­tal and non­ex­per­i­men­tal es­ti­mates of class size im­pacts by school. The ta­ble shows sub­stan­tial vari­abil­ity across schools in the effects of smaller classes on stu­dent stan­dard­ized test scores. In some cases (e.g., Schools B, D, and I), the two sets of es­ti­mates are quite close, but in some (e.g., Schools C, E, G, and H) they are quite differ­ent. A com­par­i­son of the non­ex­per­i­men­tal and ex­per­i­men­tal re­sults as a whole re­veals that the av­er­age bias (i.e., the ab­solute differ­ence be­tween the ex­per­i­men­tal and non­ex­per­i­men­tal im­pact es­ti­mates) is on the or­der of 10 per­centile points - about the same as the av­er­age ex­per­i­men­tal es­ti­mate for the effects of smaller class­es!

Ta­ble 1, Com­par­i­son of Ex­per­i­men­tal and Non­ex­per­i­men­tal Es­ti­mates for Effects of Class Size on Stu­dent Test Scores [* = es­ti­mate sta­tis­ti­cally sig­nifi­cant at the 5% cut­off.]
School Non­ex­per­i­men­tal Re­gres­sion Ex­per­i­men­tal Es­ti­mate
A 9.6 -5.2
B 15.3* 13.0*
C 1.9 24.1*
D 35.2* 33.1*
E 20.4* -10.5
F 0.2 1.3
G -8.6 10.6*
H -5.6 9.6*
I 16.5* 14.7*
J 24.3* 16.2*
K 27.8* 19.3*

A sec­ond ex­am­ple of the bias that may re­sult with non­ex­per­i­men­tal es­ti­mates comes from the U.S. De­part­ment of Hous­ing and Ur­ban De­vel­op­men­t’s Mov­ing to Op­por­tu­nity (MTO) hous­ing-voucher ex­per­i­ment, which ran­domly as­signed hous­ing-pro­ject res­i­dents in high­-poverty neigh­bor­hoods of five of the na­tion’s largest cities to ei­ther a group that was offered a hous­ing voucher to re­lo­cate to a lower poverty area or to a con­trol group that re­ceived no mo­bil­ity as­sis­tance un­der the pro­gram (Lud­wig, Dun­can, & Hirschfield, 2001). Be­cause of well-im­ple­mented ran­dom as­sign­ment, each of the groups on av­er­age should be equiv­a­lent (sub­ject to sam­pling vari­abil­i­ty) with re­spect to all ob­serv­able and un­ob­serv­able pre­pro­gram char­ac­ter­is­tics.

Ta­ble 2 presents the re­sults of us­ing the ran­dom­ized de­sign of MTO to gen­er­ate un­bi­ased es­ti­mates of the effects of mov­ing from high- to low-poverty cen­sus tracts on teen crime. The ex­per­i­men­tal es­ti­mates are the differ­ence be­tween av­er­age out­comes of all fam­i­lies offered vouch­ers and those as­signed to the con­trol group, di­vided by the differ­ence across the two groups in the pro­por­tion of fam­i­lies who moved to a low-poverty area. (Note the im­pli­ca­tion that these kinds of ex­per­i­men­tal data can be used to pro­duce un­bi­ased es­ti­mates of the effects of neigh­bor­hood char­ac­ter­is­tics on de­vel­op­men­tal out­comes, even if the takeup rate is less than 100% in the treat­ment group and greater than 0% among the con­trol group.)4 The non­ex­per­i­men­tal es­ti­mates sim­ply com­pare fam­i­lies who moved to low-poverty neigh­bor­hoods with those who did not, ig­nor­ing in­for­ma­tion about each fam­i­ly’s ran­dom as­sign­ment and re­ly­ing on the set of pre­ran­dom as­sign­ment mea­sures of MTO fam­ily char­ac­ter­is­tics to ad­just for differ­ences be­tween fam­i­lies who chose to move and those who do not.5 As seen in Ta­ble 2, even after sta­tis­ti­cally ad­just­ing for a rich set of back­ground char­ac­ter­is­tics the non­ex­per­i­men­tal mea­sure-the-un­mea­sured ap­proach leads to starkly differ­ent in­fer­ences about the effects of res­i­den­tial mo­bil­ity com­pared with the un­bi­ased ex­per­i­men­tal es­ti­mates. For ex­am­ple, the ex­per­i­men­tal es­ti­mates sug­gest that mov­ing from a high- to a low- poverty cen­sus tract sig­nifi­cantly re­duces the num­ber of vi­o­lent crimes. In con­trast, the non­ex­per­i­men­tal es­ti­mates find that such moves have es­sen­tially no effect on vi­o­lent ar­rests. In the case of “other” crimes, the non­ex­per­i­men­tal es­ti­mates sug­gest that such moves re­duce crime, but the ex­per­i­men­tally based es­ti­mates do not.

Ta­ble 2, Es­ti­mated Im­pacts of Mov­ing From a High- to a Low-Poverty Neigh­bor­hood on Ar­rests Per 100 Ju­ve­niles [From Lud­wig (1999), based on data from the Bal­ti­more Mov­ing to Op­por­tu­nity ex­per­i­ment. Re­gres­sion mod­els also con­trol for base­line mea­sure­ment of gen­der, age at ran­dom as­sign­ment, and pre­pro­gram crim­i­nal in­volve­ment, fam­i­ly’s pre­pro­gram vic­tim­iza­tion, moth­er’s school­ing, wel­fare re­ceipt and mar­i­tal sta­tus. * = es­ti­mated effect of dropout pro­gram on dropout rates sta­tis­ti­cally sig­nifi­cant at the 5% cut­off lev­el.]
Mea­sure Ex­per­i­men­tal SE Non-ex­per­i­men­tal SE Sam­ple Size
Vi­o­lent Crime -47.4* 24.3 -4.9 12.5 259
Prop­erty Crime 29.7 28.9 -10.8 14.1 259
Other Crimes -0.6 37.4 -36.9* 14.3 259

…A fi­nal ex­am­ple comes from the Na­tional Eval­u­a­tion of Wel­fare-to-Work Strate­gies, ran­dom­ized ex­per­i­ment de­signed to eval­u­ate wel­fare-to-work pro­grams in seven sites across the United States. One of the treat­ment streams en­cour­aged wel­fare-re­cip­i­ent moth­ers to par­tic­i­pate in ed­u­ca­tion ac­tiv­i­ties. In ad­di­tion to mea­sur­ing out­comes such as clients’ wel­fare re­ceipt, em­ploy­ment, and earn­ings, the eval­u­a­tion study also tested young chil­dren’s school readi­ness us­ing the Bracken Ba­sic Con­cepts Scale School Readi­ness Sub­scale. Us­ing a method for gen­er­at­ing ex­per­i­men­tal es­ti­mates sim­i­lar to that used in the MTO analy­ses, Mag­nu­son and Mc­Groder (2002) ex­am­ined the effects of the ex­per­i­men­tally in­duced in­creases in ma­ter­nal school­ing on chil­dren’s school readi­ness. Again, the re­sults sug­gest that non­ex­per­i­men­tal es­ti­mates did not closely re­pro­duce ex­per­i­men­tally based es­ti­mates.

A much larger lit­er­a­ture within eco­nom­ics, sta­tis­tics, and pro­gram eval­u­a­tion has fo­cused on the abil­ity of non­ex­per­i­men­tal re­gres­sion-ad­just­ment meth­ods to repli­cate ex­per­i­men­tal es­ti­mates for the effects of job train­ing or wel­fare-to-work pro­grams. Al­though the “con­texts” rep­re­sented by these pro­grams may be less in­ter­est­ing to de­vel­op­men­tal­ists, the re­sults of this lit­er­a­ture nev­er­the­less bear di­rectly on the ques­tion con­sid­ered in this ar­ti­cle: Can re­gres­sion meth­ods with often quite de­tailed back­ground co­vari­ates re­pro­duce ex­per­i­men­tal im­pact es­ti­mates for such pro­grams? As one re­cent re­view con­clud­ed, “Oc­ca­sion­al­ly, but not in a way that can be eas­ily pre­dicted” (Glaz­er­man, Levy, & My­ers, 2002, p. 46; see also Bloom, Michalopoulos, Hill, & Lei, 2002).

All­cott 2011, “So­cial Norms and En­ergy Con­ser­va­tion”:

Nearly all en­ergy effi­ciency pro­grams are still eval­u­ated us­ing non-ex­per­i­men­tal es­ti­ma­tors or en­gi­neer­ing ac­count­ing ap­proach­es. How im­por­tant is the ex­per­i­men­tal con­trol group to con­sis­tent­ly-es­ti­mated ATEs? This is­sue is cru­cial for sev­eral of OPOWER’s ini­tial pro­grams that were im­ple­mented with­out a con­trol group but must es­ti­mate im­pacts to re­port to state reg­u­la­tors. While LaLonde (1986) doc­u­mented that non-ex­per­i­men­tal es­ti­ma­tors per­formed poorly in eval­u­at­ing job train­ing pro­grams and sim­i­lar ar­gu­ments have been made in many other do­mains, weath­er-ad­justed non-ex­per­i­men­tal es­ti­ma­tors could in the­ory per­form well in mod­el­ing en­ergy de­mand. The im­por­tance of ran­dom­ized con­trolled tri­als has not yet been clearly doc­u­mented to an­a­lysts and pol­i­cy­mak­ers in this con­text.

With­out an ex­per­i­men­tal con­trol group, there are two econo­met­ric ap­proaches that could be used. The first is to use a differ­ence es­ti­ma­tor, com­par­ing elec­tric­ity use in the treated pop­u­la­tion be­fore and after treat­ment. In im­ple­ment­ing this, I con­trol for weather differ­ences non-para­met­ri­cal­ly, us­ing bins with width one av­er­age de­gree day. This slightly out­per­forms the use of fourth de­gree poly­no­mi­als in heat­ing and cool­ing de­gree-days. This es­ti­ma­tor is un­bi­ased if and only if there are no other fac­tors as­so­ci­ated with en­ergy de­mand that vary be­tween the pre-treat­ment and post-treat­ment pe­ri­od. A sec­ond non-ex­per­i­men­tal ap­proach is to use a differ­ence-in-d­iffer­ences es­ti­ma­tor with nearby house­holds as a con­trol group. For each ex­per­i­ment, I form a con­trol group us­ing the av­er­age monthly en­ergy use of house­holds in other util­i­ties in the same state, us­ing data that reg­u­lated util­i­ties re­port to the U.S. De­part­ment of En­ergy on Form EIA 826. The es­ti­ma­tor in­cludes util­i­ty-by-month fixed effects to cap­ture differ­ent sea­sonal pat­terns - for ex­am­ple, there may be lo­cal vari­a­tion in how many house­holds use elec­tric heat in­stead of nat­ural gas or oil, which then affects win­ter elec­tric­ity de­mand. This es­ti­ma­tor is un­bi­ased if and only if there are no un­ob­served fac­tors that differ­en­tially affect av­er­age house­hold en­ergy de­mand in the OPOWER part­ner util­ity vs. the other util­i­ties in the same state. Fig. 6 presents the ex­per­i­men­tal ATEs for each ex­per­i­ment along with point es­ti­mates for the two types of non-ex­per­i­men­tal es­ti­ma­tors. There is sub­stan­tial vari­ance in the non-ex­per­i­men­tal es­ti­ma­tors: the av­er­age ab­solute er­rors for the differ­ence and differ­ence-in-d­iffer­ences es­ti­ma­tors, re­spec­tive­ly, are 2.1% and 3.0%. Across the 14 ex­per­i­ments, the es­ti­ma­tors are also bi­ased on av­er­age. In par­tic­u­lar, the mean of the ATEs from the differ­ence-in-d­iffer­ences es­ti­ma­tor is −3.75%, which is nearly dou­ble the mean of the ex­per­i­men­tal ATEs.

…What’s par­tic­u­larly in­sid­i­ous about the non-ex­per­i­men­tal es­ti­mates is that they would ap­pear quite plau­si­ble if not com­pared to the ex­per­i­men­tal bench­mark. Nearly all are within the con­fi­dence in­ter­vals of the small sam­ple pi­lots by Schultz et al. (2007) and Nolan et al. (2008) that were dis­cussed above. Eval­u­a­tions of sim­i­lar types of en­ergy use in­for­ma­tion feed­back pro­grams have re­ported im­pacts of zero to 10% (Dar­by, 2006).

“What Do Work­place Well­ness Pro­grams Do? Ev­i­dence from the Illi­nois Work­place Well­ness Study”, Jones et al 2018:

Work­place well­ness pro­grams cover over 50 mil­lion work­ers and are in­tended to re­duce med­ical spend­ing, in­crease pro­duc­tiv­i­ty, and im­prove well-be­ing. Yet, lim­ited ev­i­dence ex­ists to sup­port these claims. We de­signed and im­ple­mented a com­pre­hen­sive work­place well­ness pro­gram for a large em­ployer with over 12,000 em­ploy­ees, and ran­domly as­signed pro­gram el­i­gi­bil­ity and fi­nan­cial in­cen­tives at the in­di­vid­ual lev­el. Over 56% of el­i­gi­ble (treat­ment group) em­ploy­ees par­tic­i­pated in the pro­gram. We find strong pat­terns of se­lec­tion: dur­ing the year prior to the in­ter­ven­tion, pro­gram par­tic­i­pants had lower med­ical ex­pen­di­tures and health­ier be­hav­iors than non-par­tic­i­pants. How­ev­er, we do not find sig­nifi­cant causal effects of treat­ment on to­tal med­ical ex­pen­di­tures, health be­hav­iors, em­ployee pro­duc­tiv­i­ty, or self­-re­ported health sta­tus in the first year. Our 95% con­fi­dence in­ter­vals rule out 83% of pre­vi­ous es­ti­mates on med­ical spend­ing and ab­sen­teeism. Our se­lec­tion re­sults sug­gest these pro­grams may act as a screen­ing mech­a­nism: even in the ab­sence of any di­rect sav­ings, differ­en­tial re­cruit­ment or re­ten­tion of low­er-cost par­tic­i­pants could re­sult in net sav­ings for em­ploy­ers.

…We in­vited 12,459 ben­e­fit­s-el­i­gi­ble uni­ver­sity em­ploy­ees to par­tic­i­pate in our study.3 Study par­tic­i­pants (n = 4, 834) as­signed to the treat­ment group (n = 3, 300) were in­vited to take paid time off to par­tic­i­pate in our work­place well­ness pro­gram. Those who suc­cess­fully com­pleted the en­tire pro­gram earned re­wards rang­ing from $50 to $350, with the amounts ran­domly as­signed and com­mu­ni­cated at the start of the pro­gram. The re­main­ing sub­jects (n= 1,534) were as­signed to a con­trol group, which was not per­mit­ted to par­tic­i­pate. Our analy­sis com­bines in­di­vid­u­al-level data from on­line sur­veys, uni­ver­sity em­ploy­ment records, health in­sur­ance claims, cam­pus gym visit records, and ad­min­is­tra­tive records from a pop­u­lar com­mu­nity run­ning event. We can there­fore ex­am­ine out­comes com­monly stud­ied by the prior lit­er­a­ture (name­ly, med­ical spend­ing and em­ployee ab­sen­teeism) as well as a large num­ber of novel out­comes.

…Third, we do not find sig­nifi­cant effects of our in­ter­ven­tion on 37 out of the 39 out­comes we ex­am­ine in the first year fol­low­ing ran­dom as­sign­ment. These 37 out­comes in­clude all our mea­sures of med­ical spend­ing, pro­duc­tiv­i­ty, health be­hav­iors, and self­-re­ported health. We in­ves­ti­gate the effect on med­ical ex­pen­di­tures in de­tail, but fail to find sig­nifi­cant effects on differ­ent quan­tiles of the spend­ing dis­tri­b­u­tion or on any ma­jor sub­cat­e­gory of med­ical ex­pen­di­tures (phar­ma­ceu­ti­cal drugs, office, or hos­pi­tal). We also do not find any effect of our in­ter­ven­tion on the num­ber of vis­its to cam­pus gym fa­cil­i­ties or on the prob­a­bil­ity of par­tic­i­pat­ing in a pop­u­lar an­nual com­mu­nity run­ning event, two health be­hav­iors that are rel­a­tively sim­ple for a mo­ti­vated em­ployee to change over the course of one year. These null es­ti­mates are mean­ing­fully pre­cise, par­tic­u­larly for two key out­comes of in­ter­est in the lit­er­a­ture: med­ical spend­ing and ab­sen­teeism. Our 95% con­fi­dence in­ter­vals rule out 83% of the effects re­ported in 115 prior stud­ies, and the 99% con­fi­dence in­ter­vals for the re­turn on in­vest­ment (ROI) of our in­ter­ven­tion rule out the widely cited med­ical spend­ing and ab­sen­teeism ROI’s re­ported in the meta-analy­sis of Baick­er, Cut­ler and Song (2010). In ad­di­tion, we show that our OLS (non-RCT) es­ti­mate for med­ical spend­ing is in line with es­ti­mates from prior ob­ser­va­tional stud­ies, but is ruled out by the 95% con­fi­dence in­ter­val of our IV (RCT) es­ti­mate. This demon­strates the value of em­ploy­ing an RCT de­sign in this lit­er­a­ture.

…Our ran­dom­ized con­trolled de­sign al­lows us to es­tab­lish re­li­able causal effects by com­par­ing out­comes across the treat­ment and con­trol groups. By con­trast, most ex­ist­ing stud­ies rely on ob­ser­va­tional com­par­isons be­tween par­tic­i­pants and non-par­tic­i­pants (see Pel­letier, 2011, and Chap­man, 2012, for re­views). Re­views of the lit­er­a­ture have called for ad­di­tional re­search on this topic and have also noted the po­ten­tial for pub­li­ca­tion bias to skew the set of ex­ist­ing re­sults (Baick­er, Cut­ler and Song, 2010; Abra­ham and White, 2017). To that end, our in­ter­ven­tion, em­pir­i­cal spec­i­fi­ca­tions, and out­come vari­ables were pre­spec­i­fied and pub­licly archived. In ad­di­tion, the analy­ses in this pa­per were in­de­pen­dently repli­cated by a J-PAL affil­i­ated re­searcher.

…Fig­ure 8 il­lus­trates how our es­ti­mates com­pare to the prior lit­er­a­ture. The top-left fig­ure in Panel (a) plots the dis­tri­b­u­tion of the in­ten­t-to-treat (ITT) point es­ti­mates for med­ical spend­ing from 22 prior work­place well­ness stud­ies. The fig­ure also plots our ITT point es­ti­mate for to­tal med­ical spend­ing from Ta­ble 4, and shows that our 95% con­fi­dence in­ter­val rules out 20 of these 22 es­ti­mates. For ease of com­par­ison, all effects are ex­pressed as % changes. The bot­tom-left fig­ure in Panel (a) plots the dis­tri­b­u­tion of treat­men­t-on-the-treated (TOT) es­ti­mates for health spend­ing from 33 prior stud­ies, along with the IV es­ti­mates from our study. In this case, our 95% con­fi­dence in­ter­val rules out 23 of the 33 stud­ies (70%). Over­all, our con­fi­dence in­ter­vals rule out 43 of 55 (78%) prior ITT and TOT point es­ti­mates for health spend­ing. The two fig­ures in Panel (b) re­peat this ex­er­cise for ab­sen­teeism, and show that our es­ti­mates rule out 53 of 60 (88%) prior ITT and TOT point es­ti­mates for ab­sen­teeism. Across both sets of out­comes, we rule out 96 of 115 (83%) prior es­ti­mates. We can also com­bine our spend­ing and ab­sen­teeism es­ti­mates with our cost data to cal­cu­late a re­turn on in­vest­ment (ROI) for work­place well­ness pro­grams. The 99% con­fi­dence in­ter­vals for the ROI as­so­ci­ated with our in­ter­ven­tion rule out the widely cited sav­ings es­ti­mates re­ported in the meta-analy­sis of Baick­er, Cut­ler and Song (2010).

Fig­ure 8 of Jones et al 2018: com­par­i­son of pre­vi­ous lit­er­a­ture’s cor­re­la­tional point-es­ti­mates with the Jones et al 2018 ran­dom­ized effec­t’s CI, demon­strat­ing that al­most none fall within the Jones et al 2018 CI.

4.3.3 IV ver­sus OLS

Across a va­ri­ety of out­comes, we find very lit­tle ev­i­dence that our in­ter­ven­tion had any effect in its first year. As shown above, our re­sults differ from many prior stud­ies that find sig­nifi­cant re­duc­tions in health ex­pen­di­tures and ab­sen­teeism. One pos­si­ble rea­son for this dis­crep­ancy is the pres­ence of ad­van­ta­geous se­lec­tion bias in these other stud­ies, which are gen­er­ally not ran­dom­ized con­trolled tri­als. A sec­ond pos­si­bil­ity is that there is some­thing unique about our set­ting. We in­ves­ti­gate these com­pet­ing ex­pla­na­tions by per­form­ing a typ­i­cal ob­ser­va­tional (OLS) analy­sis and com­par­ing its re­sults to those of our ex­per­i­men­tal es­ti­mates.

Specifi­cal­ly, we es­ti­mate , (5) where is the out­come vari­able as in (4), is an in­di­ca­tor for par­tic­i­pat­ing in the screen­ing and HRA, and is a vec­tor of vari­ables that con­trol for po­ten­tially non-ran­dom se­lec­tion into par­tic­i­pa­tion. We es­ti­mate two vari­ants of equa­tion (5). The first is an in­stru­men­tal vari­ables (IV) spec­i­fi­ca­tion that in­cludes ob­ser­va­tions for in­di­vid­u­als in the treat­ment or con­trol groups, and uses treat­ment as­sign­ment as an in­stru­ment for com­plet­ing the screen­ing and HRA. The sec­ond vari­ant es­ti­mates equa­tion (5) us­ing OLS, re­stricted to in­di­vid­u­als in the treat­ment group. For each of these two vari­ants, we es­ti­mate three spec­i­fi­ca­tions sim­i­lar to those used for the ITT analy­sis de­scribed above (no con­trols, strata fixed effects, and post-Las­so).

This gen­er­ates six es­ti­mates for each out­come vari­able. Ta­ble 5 re­ports the re­sults for our pri­mary out­comes of in­ter­est. The re­sults for all pre-spec­i­fied ad­min­is­tra­tive and sur­vey out­comes are re­ported in Ap­pen­dix Ta­bles A.3e-A.3f.

Ta­ble 5, com­par­ing the ran­dom­ized es­ti­mate with the cor­re­la­tional es­ti­mates
Vi­su­al­iza­tion of 5 en­tries from Ta­ble 5, from the New York Times.

As in our pre­vi­ous ITT analy­sis, the IV es­ti­mates re­ported in columns (1)-(3) are small and in­dis­tin­guish­able from zero for nearly every out­come. By con­trast, the ob­ser­va­tional es­ti­mates re­ported in columns (4)-(6) are fre­quently large and sta­tis­ti­cally sig­nifi­cant. More­over, the IV es­ti­mate rules out the OLS es­ti­mate for sev­eral key out­comes. Based on our most pre­cise and well-con­trolled spec­i­fi­ca­tion (post-Las­so), the OLS monthly spend­ing es­ti­mate of -$88.1 (row 1, col­umn (6)) lies out­side the 95% con­fi­dence in­ter­val of the IV es­ti­mate of $38.5 with a stan­dard er­ror of $58.8 (row 1, col­umn (3)). For par­tic­i­pa­tion in the 2017 IL Marathon/10K/5K, the OLS es­ti­mate of 0.024 lies out­side the 99% con­fi­dence in­ter­val of the cor­re­spond­ing IV es­ti­mate of -0.011 (s­tan­dard er­ror = 0.011). For cam­pus gym vis­its, the OLS es­ti­mate of 2.160 lies just in­side the 95% con­fi­dence in­ter­val of the cor­re­spond­ing IV es­ti­mate of 0.757 (s­tan­dard er­ror = 0.656). Un­der the as­sump­tion that the IV (RCT) es­ti­mates are un­bi­ased, these differ­ence im­ply that even after con­di­tion­ing on a rich set of con­trols, par­tic­i­pants se­lected into our work­place well­ness pro­gram on the ba­sis of low­er-than-av­er­age con­tem­po­ra­ne­ous spend­ing and high­er-than-av­er­age health ac­tiv­i­ty. This is con­sis­tent with the ev­i­dence pre­sented in Sec­tion 3.2 that pre-ex­ist­ing spend­ing is low­er, and pre-ex­ist­ing be­hav­iors are health­ier, among par­tic­i­pants than among non-par­tic­i­pants. In ad­di­tion, the ob­ser­va­tional es­ti­mates pre­sented in columns (4)-(6) are in line with es­ti­mates from pre­vi­ous ob­ser­va­tional stud­ies, which sug­gests that our set­ting is not par­tic­u­larly unique. In the spirit of LaLonde (1986), these es­ti­mates demon­strate that even well-con­trolled ob­ser­va­tional analy­ses can suffer from sig­nifi­cant se­lec­tion bias in our set­ting, sug­gest­ing that sim­i­lar bi­ases might be at play in other well­ness pro­gram set­tings as well.

Jones et al 2018 ap­pen­dix: Ta­ble A3, a-c, all ran­dom­ized vs cor­re­la­tional es­ti­mates
Jones et al 2018 ap­pen­dix: Ta­ble A3, d-e, all ran­dom­ized vs cor­re­la­tional es­ti­mates
Jones et al 2018 ap­pen­dix: Ta­ble A3, f-g, all ran­dom­ized vs cor­re­la­tional es­ti­mates

“Effect of a Work­place Well­ness Pro­gram on Em­ployee Health and Eco­nomic Out­comes: A Ran­dom­ized Clin­i­cal Trial”, Song & Baicker 2019:

De­sign, Set­ting, and Par­tic­i­pants: This clus­tered ran­dom­ized trial was im­ple­mented at 160 work­sites from Jan­u­ary 2015 through June 2016. Ad­min­is­tra­tive claims and em­ploy­ment data were gath­ered con­tin­u­ously through June 30, 2016; data from sur­veys and bio­met­rics were col­lected from July 1, 2016, through Au­gust 31, 2016.

In­ter­ven­tions: There were 20 ran­domly se­lected treat­ment work­sites (4037 em­ploy­ees) and 140 ran­domly se­lected con­trol work­sites (28 937 em­ploy­ees, in­clud­ing 20 pri­mary con­trol work­sites [4106 em­ploy­ees]). Con­trol work­sites re­ceived no well­ness pro­gram­ming. The pro­gram com­prised 8 mod­ules fo­cused on nu­tri­tion, phys­i­cal ac­tiv­i­ty, stress re­duc­tion, and re­lated top­ics im­ple­mented by reg­is­tered di­eti­tians at the treat­ment work­sites.

Main Out­comes and Mea­sures: Four out­come do­mains were as­sessed. Self­-re­ported health and be­hav­iors via sur­veys (29 out­comes) and clin­i­cal mea­sures of health via screen­ings (10 out­comes) were com­pared among 20 in­ter­ven­tion and 20 pri­mary con­trol sites; health care spend­ing and uti­liza­tion (38 out­comes) and em­ploy­ment out­comes (3 out­comes) from ad­min­is­tra­tive data were com­pared among 20 in­ter­ven­tion and 140 con­trol sites.

Re­sults: Among 32 974 em­ploy­ees (mean [SD] age, 38.6 [15.2] years; 15 272 [45.9%] wom­en), the mean par­tic­i­pa­tion rate in sur­veys and screen­ings at in­ter­ven­tion sites was 36.2% to 44.6% (n= 4037 em­ploy­ees) and at pri­mary con­trol sites was 34.4% to 43.0% (n= 4106 em­ploy­ees) (mean of 1.3 pro­gram mod­ules com­plet­ed). After 18 months, the rates for 2 self­-re­ported out­comes were higher in the in­ter­ven­tion group than in the con­trol group: for en­gag­ing in reg­u­lar ex­er­cise (69.8% vs 61.9%; ad­justed differ­ence, 8.3 per­cent­age points [95% CI, 3.9-12.8]; ad­justed p= .03) and for ac­tively man­ag­ing weight (69.2% vs 54.7%; ad­justed differ­ence, 13.6 per­cent­age points [95% CI, 7.1-20.2]; ad­justed p= .02). The pro­gram had no sig­nifi­cant effects on other pre­spec­i­fied out­comes: 27 self­-re­ported health out­comes and be­hav­iors (in­clud­ing self­-re­ported health, sleep qual­i­ty, and food choic­es), 10 clin­i­cal mark­ers of health (in­clud­ing cho­les­terol, blood pres­sure, and body mass in­dex), 38 med­ical and phar­ma­ceu­ti­cal spend­ing and uti­liza­tion mea­sures, and 3 em­ploy­ment out­comes (ab­sen­teeism, job tenure, and job per­for­mance).

…To as­sess en­doge­nous se­lec­tion into pro­gram par­tic­i­pa­tion, we com­pared the base­line char­ac­ter­is­tics of pro­gram par­tic­i­pants to those of non-par­tic­i­pants in treat­ment sites. To as­sess en­doge­nous se­lec­tion into par­tic­i­pa­tion in pri­mary data col­lec­tion, we com­pared base­line char­ac­ter­is­tics of work­ers who elected to pro­vide clin­i­cal data or com­plete the health risk as­sess­ment to those of work­ers who did not, sep­a­rately within the treat­ment group and the con­trol group. This en­abled us to as­sess any po­ten­tial differ­en­tial se­lec­tion into pri­mary data col­lec­tion. Ad­di­tion­al­ly, to ex­am­ine differ­ences in find­ings be­tween our ran­dom­ized trial ap­proach and a stan­dard ob­ser­va­tional de­sign (and thereby any bias that con­found­ing fac­tors would have in­tro­duced into naive ob­ser­va­tional es­ti­mates), we gen­er­ated es­ti­mates of pro­gram effects us­ing or­di­nary least squares to com­pare pro­gram par­tic­i­pants with non­par­tic­i­pants (rather than us­ing the vari­a­tion gen­er­ated by ran­dom­iza­tion).

Se­lec­tion Into Pro­gram Par­tic­i­pa­tion. Com­par­isons of prein­ter­ven­tion char­ac­ter­is­tics be­tween par­tic­i­pants and non­par­tic­i­pants in the treat­ment group pro­vided ev­i­dence of po­ten­tial se­lec­tion effects. Par­tic­i­pants were sig­nifi­cantly more likely to be fe­male, non­white, and ful­l-time salaried work­ers in sales, al­though nei­ther mean health care spend­ing nor the prob­a­bil­ity of hav­ing any spend­ing dur­ing the year be­fore the pro­gram was sig­nifi­cantly differ­ent be­tween par­tic­i­pants and non­par­tic­i­pants (eTable 15 in Sup­ple­ment 2)…an ob­ser­va­tional ap­proach com­par­ing work­ers who elected to par­tic­i­pate with non­par­tic­i­pants would have in­cor­rectly sug­gested that the pro­gram had larger effects on some out­comes than the effects found us­ing the con­trolled de­sign, un­der­scor­ing the im­por­tance of ran­dom­iza­tion to ob­tain un­bi­ased es­ti­mates (eTable 17 in Sup­ple­ment 2).


“How Close Is Close Enough? Test­ing Non­ex­per­i­men­tal Es­ti­mates of Im­pact against Ex­per­i­men­tal Es­ti­mates of Im­pact with Ed­u­ca­tion Test Scores as Out­comes”, Wilde & Hol­lis­ter 2002

In this study we test the per­for­mance of some non­ex­per­i­men­tal es­ti­ma­tors of im­pacts ap­plied to an ed­u­ca­tional in­ter­ven­tion-re­duc­tion in class size-where achieve­ment test scores were the out­come. We com­pare the non­ex­per­i­men­tal es­ti­mates of the im­pacts to “true im­pact” es­ti­mates pro­vided by a ran­dom-as­sign­ment de­sign used to as­sess the effects of that in­ter­ven­tion. Our pri­mary fo­cus in this study is on a non­ex­per­i­men­tal es­ti­ma­tor based on a com­plex pro­ce­dure called propen­sity score match­ing. Pre­vi­ous stud­ies which tested non­ex­per­i­men­tal es­ti­ma­tors against ex­per­i­men­tal ones all had em­ploy­ment or wel­fare use as the out­come vari­able. We tried to de­ter­mine whether the con­clu­sions from those stud­ies about the per­for­mance of non­ex­per­i­men­tal es­ti­ma­tors car­ried over into the ed­u­ca­tion do­main.

…Pro­ject Star is the source of data for the ex­per­i­men­tal es­ti­mates and the source for draw­ing non­ex­per­i­men­tal com­par­i­son groups used to make non­ex­per­i­men­tal es­ti­mates. Project Star was an ex­per­i­ment in Ten­nessee in­volv­ing 79 schools in which stu­dents in kinder­garten through third grade were ran­domly as­signed to small classes (the treat­ment group) or to reg­u­lar-size classes (the con­trol group). The out­come vari­ables from the data set were the math and read­ing achieve­ment test scores. We car­ried out the propen­si­ty-s­core-match­ing es­ti­mat­ing pro­ce­dure sep­a­rately for each of 11 schools’ kinder­gartens and used it to de­rive non­ex­per­i­men­tal es­ti­mates of the im­pact of smaller class size. We also de­vel­oped proper stan­dard er­rors for the propen­si­ty-s­core-matched es­ti­ma­tors by us­ing boot­strap­ping pro­ce­dures. We found that in most cas­es, the propen­si­ty-s­core es­ti­mate of the im­pact differed sub­stan­tially from the “true im­pact” es­ti­mated by the ex­per­i­ment. We then at­tempted to as­sess how close the non­ex­per­i­men­tal es­ti­mates were to the ex­per­i­men­tal ones. We sug­gested sev­eral differ­ent ways of at­tempt­ing to as­sess “close­ness.” Most of them led to the con­clu­sion, in our view, that the non­ex­per­i­men­tal es­ti­mates were not very “close” and there­fore were not re­li­able guides as to what the “true im­pact” was. We put great­est em­pha­sis on look­ing at the ques­tion of “how close is close enough?” in terms of a de­ci­sion-maker try­ing to use the eval­u­a­tion to de­ter­mine whether to in­vest in wider ap­pli­ca­tion of the in­ter­ven­tion be­ing as­sessed-in this case, re­duc­tion in class size. We il­lus­trate this in terms of a rough cost-ben­e­fit frame­work for small class size as ap­plied to Project Star. We find that in 30 to 45% of the 11 cas­es, the propen­si­ty-s­core-match­ing non­ex­per­i­men­tal es­ti­ma­tors would have led to the “wrong” de­ci­sion.

…Two ma­jor con­sid­er­a­tions mo­ti­vated us to un­der­take this study. First, four im­por­tant stud­ies (Fraker and May­nard, 1987; LaLon­de, 1986; Fried­lan­der and Robins, 1995; and De­he­jia and Wah­ba, 1999) have as­sessed the effec­tive­ness of non­ex­per­i­men­tal meth­ods of im­pact as­sess­ment in a com­pelling fash­ion, but these stud­ies have fo­cused solely on so­cial in­ter­ven­tions re­lated to work and their im­pact on the out­come vari­ables of earn­ings, em­ploy­ment rates, and wel­fare uti­liza­tion.

…Be­cause we are in­ter­ested in test­ing non­ex­per­i­men­tal meth­ods on ed­u­ca­tional out­comes, we use Ten­nessee’s Project Star as the source of the “true ran­dom-as­sign­ment” da­ta. We de­scribe Project Star in de­tail be­low. We use the treat­ment group data from a given school for the treat­ments and then con­struct com­par­i­son groups in var­i­ous non­ex­per­i­men­tal ways with data taken out of the con­trol groups in other schools.

…From 1985 to 1989, re­searchers col­lected ob­ser­va­tional data in­clud­ing sex, age, race, and free-lunch sta­tus from over 11,000 stu­dents (Word, 1990). The schools cho­sen for the ex­per­i­ment were broadly dis­trib­uted through­out Ten­nessee. Orig­i­nal­ly, the project in­cluded eight schools from non­metro­pol­i­tan cities and large towns (for ex­am­ple, Man­ches­ter and Maryville), 38 schools from rural ar­eas, and 17 in­ner-c­ity and 16 sub­ur­ban schools drawn from four met­ro­pol­i­tan ar­eas: Knoxville, Nashville, Mem­phis, and Chat­tanooga. Be­gin­ning in 1985-86, the kinder­garten teach­ers and stu­dents within Project Star classes were ran­domly as­signed within schools to ei­ther “small” (13-17 pupil­s), “reg­u­lar” (22-25), or “reg­u­lar-with­-aide” class­es. New stu­dents who en­tered a Project Star school in 1986, 1987, 1988, or 1989 were ran­domly as­signed to class­es. Be­cause each school had “the same kinds of stu­dents, cur­ricu­lum, prin­ci­pal, pol­i­cy, sched­ule, ex­pen­di­tures, etc, for each class” and the ran­dom­iza­tion oc­curred within school, the­o­ret­i­cal­ly, the es­ti­mated with­in-school effect of small classes should have been un­bi­ased. Dur­ing the course of the pro­ject, how­ev­er, there were sev­eral de­vi­a­tions from the orig­i­nal ex­per­i­men­tal de­sign-for ex­am­ple, after kinder­garten the stu­dents in the reg­u­lar and reg­u­lar-with­-aide classes were ran­domly re­as­signed be­tween reg­u­lar and reg­u­lar-with­-aide class­es, and a sig­nifi­cant num­ber of stu­dents switched class types be­tween grades. How­ev­er, Krueger found that, after ad­just­ing for these and other prob­lems, the main Project Star re­sults were un­affect­ed; in all four school types stu­dents in small classes scored sig­nifi­cantly higher on stan­dard­ized tests than stu­dents in reg­u­lar-size class­es. In this study, fol­low­ing Krueger’s ex­am­ple, test score is used as the mea­sure of stu­dent achieve­ment and is the out­come vari­able. For all com­par­isons, test score is cal­cu­lated as a per­centile rank of the com­bined raw Stan­ford Achieve­ment read­ing and math scores within the en­tire sam­ple dis­tri­b­u­tion for that grade. The Project Star data set pro­vides mea­sures of a num­ber of stu­dent, teacher, and school char­ac­ter­is­tics. The fol­low­ing are the vari­ables avail­able to use as mea­sures prior to ran­dom as­sign­ment: stu­dent sex, stu­dent race, stu­dent free-lunch sta­tus, teacher race, teacher ed­u­ca­tion, teacher ca­reer lad­der, teacher ex­pe­ri­ence, school type, and school sys­tem ID. In ad­di­tion, the fol­low­ing vari­ables mea­sured con­tem­po­ra­ne­ously can be con­sid­ered ex­oge­nous: stu­dent age, as­sign­ment to small class size.

…One very im­por­tant and strin­gent mea­sure of close­ness is whether there are many cases in which the non­ex­per­i­men­tal im­pact es­ti­mates are op­po­site in sign from the ex­per­i­men­tal im­pact es­ti­mates and both sets of im­pact es­ti­mates are sta­tis­ti­cally sig­nifi­cantly differ­ent from 0, e.g., the ex­per­i­men­tal es­ti­mates said that the mean test scores of those in smaller classes were sig­nifi­cantly neg­a­tive while the non­ex­per­i­men­tal es­ti­mates in­di­cated they were sig­nifi­cantly pos­i­tive. There is only one case in these 11 which comes close to this sit­u­a­tion. For school 27, the ex­per­i­men­tal im­pact es­ti­mate is ! 10.5 and sig­nifi­cant at the 6% lev­el, just above the usual sig­nifi­cance cut­off of 5%. The non­ex­per­i­men­tal im­pact es­ti­mate is 35.2 and sig­nifi­cant at bet­ter than the 1% lev­el. In other cases (school 7 and school 33), the im­pact es­ti­mates are of op­po­site sign, but one or the other of them fails the test for be­ing sig­nifi­cantly differ­ent from 0. If we weaken the strin­gency of the cri­te­rion a bit, we can con­sider cases in which the ex­per­i­men­tal im­pact es­ti­mates were sig­nifi­cantly differ­ent from 0 but the non­ex­per­i­men­tal es­ti­mates were not (school 16 and school 33), or vice versa (schools 7, 16, and 28). An­oth­er, per­haps bet­ter, way of as­sess­ing the differ­ences in the im­pact es­ti­mates is to look at col­umn 8, which presents a test for whether the im­pact es­ti­mate from the non­ex­per­i­men­tal pro­ce­dure is sig­nifi­cantly differ­ent from the im­pact es­ti­mate from the ex­per­i­men­tal pro­ce­dure. For 8 of the 11 schools, the two im­pact es­ti­mates were sta­tis­ti­cally sig­nifi­cantly differ­ent from each oth­er.

…When we first be­gan con­sid­er­ing this is­sue, we thought that a cri­te­rion might be based on the per­cent­age differ­ence in the point es­ti­mates of the im­pact. For ex­am­ple, for school 9 the non­ex­per­i­men­tal es­ti­mate of the im­pact is 135% larger than the ex­per­i­men­tal im­pact. But for school 22 the non­ex­per­i­men­tal im­pact es­ti­mate is 9% larg­er. In­deed, all but two of the non­ex­per­i­men­tal es­ti­mates are more than 50% differ­ent from the ex­per­i­men­tal im­pact es­ti­mates. Whereas in this case the per­cent­age differ­ence in im­pact es­ti­mates seems to in­di­cate quite con­clu­sively that the non­ex­per­i­men­tal es­ti­mates are not gen­er­ally close to the ex­per­i­men­tal ones, in some cases such a per­cent­age differ­ence cri­te­rion might be mis­lead­ing. The cri­te­rion which seems to us the most de­fin­i­tive is whether dis­tance be­tween the non­ex­per­i­men­tal and the ex­per­i­men­tal im­pact es­ti­mates would have been suffi­cient to cause an ob­server to make a differ­ent de­ci­sion from one based on the true ex­per­i­men­tal re­sults. For ex­am­ple, sup­pose that the ex­per­i­men­tal im­pact es­ti­mate had been 0.02 and the non­ex­per­i­men­tal im­pact es­ti­mate had been 0.04, a 100% differ­ence in im­pact es­ti­mate. But fur­ther sup­pose that the de­ci­sion about whether to take an ac­tion, e.g., in­vest in the type of ac­tiv­ity which the treat­ment in­ter­ven­tion rep­re­sents, would have been a yes if the differ­ence be­tween the treat­ments and com­par­isons had been 0.05 or greater and a no if the im­pact es­ti­mate had been less than 0.05. Then even though the non­ex­per­i­men­tal es­ti­mate was 100% larger than the ex­per­i­men­tal es­ti­mate, one would still have de­cided not to in­vest in this type of in­ter­ven­tion whether one had the true ex­per­i­men­tal es­ti­mate or the non­ex­per­i­men­tal es­ti­mate.

…In a cou­ple of his ar­ti­cles pre­sent­ing as­pects of his re­search us­ing Project Star data, Krueger (1999, 2000) has de­vel­oped some rough ben­e­fit-cost cal­cu­la­tions re­lated to re­duc­tion in class size. In Ap­pen­dix B we sketch in a few el­e­ments of his cal­cu­la­tions which pro­vide the back­ground for the sum­mary mea­sures de­rived from his cal­cu­la­tions that we use to il­lus­trate our “close enough.” The ben­e­fits Krueger fo­cuses on are in­creases in fu­ture earn­ings that could be as­so­ci­ated with test score gains. He care­fully de­vel­ops es­ti­mates - based on other lit­er­a­ture - of what in­crease in fu­ture earn­ings might be as­so­ci­ated with a gain in test scores in the early years of el­e­men­tary school. With ap­pro­pri­ate dis­count­ing to present val­ues, and other ad­just­ments, he uses these val­ues as es­ti­mates of ben­e­fits and then com­pares them to the es­ti­mated cost of re­duc­ing class size from 22 to 15, taken from the ex­pe­ri­ence in Project Star and ap­pro­pri­ately ad­just­ed. For our pur­pos­es, what is most in­ter­est­ing is the way he uses these ben­e­fit-cost cal­cu­la­tions to an­swer a slightly differ­ent ques­tion: How big an effect on test scores due to re­duc­tion of class size from 22 to 15 would have been nec­es­sary to just jus­tify the ex­pen­di­tures it took to re­duce the class size by that much? He states the an­swer in terms of “effect size,” that is the im­pact di­vided by the es­ti­mated stan­dard de­vi­a­tion of the im­pact. This is a mea­sure that is in­creas­ingly used to com­pare across out­comes that are mea­sured in some­what differ­ent met­rics. His an­swer is that an effect size of 0.2 of a stan­dard de­vi­a­tion of tests scores would have been just large enough to gen­er­ate es­ti­mated fu­ture earn­ings gains suffi­cient to jus­tify the costs. 4 Krueger in­di­cates that the es­ti­mated effect for kinder­garten was a 5.37 per­centile in­crease in achieve­ment test scores due to smaller class size and that this was equiv­a­lent to 0.2 of a stan­dard de­vi­a­tion in test scores. There­fore we use 5.4 per­centile points as the crit­i­cal value for a de­ci­sion of whether the re­duc­tion in class size from 22 to 15 would have been cost-effec­tive. In Ta­ble 2 we use the re­sults from Ta­ble 1 to ap­ply the cost-effec­tive­ness cri­te­rion to de­ter­mine the ex­tent to which the non­ex­per­i­men­tal es­ti­mates might have led to the wrong de­ci­sion. To cre­ate the en­tries in this table, we look at the Ta­ble 1 en­try for a given school. If the im­pact es­ti­mate is greater than 5.4 per­centile points and sta­tis­ti­cally sig­nifi­cantly differ­ent from 0, we en­ter a Yes, in­di­cat­ing the im­pact es­ti­mate would have led to a con­clu­sion that re­duc­ing class size from 22 to 15 was cost-effec­tive. If the im­pact es­ti­mate is less than 5.4 or sta­tis­ti­cally not sig­nifi­cantly differ­ent from 0, we en­ter a No to in­di­cate a con­clu­sion that the class-size re­duc­tion was not cost-effec­tive. Col­umn 1 is the school ID, col­umn 2 gives the con­clu­sion on the ba­sis of the ex­per­i­men­tal im­pact es­ti­mate, col­umn 3 gives the con­clu­sion on the ba­sis of the non­ex­per­i­men­tal im­pact es­ti­mate, and col­umn 4 con­tains an x if the non­ex­per­i­men­tal es­ti­mate would have led to a “wrong” cost-effec­tive­ness con­clu­sion, i.e., a differ­ent con­clu­sion from the ex­per­i­men­tal im­pact con­clu­sion about cost-effec­tive­ness.

It is easy to see in Ta­ble 2 that the non­ex­per­i­men­tal es­ti­mate would have led to the wrong con­clu­sion in four of the 11 cas­es. For a fifth case, school 16, we en­tered a Maybe in Col­umn 4 be­cause, as shown in Ta­ble 1, for that school the non­ex­per­i­men­tal es­ti­mate was sig­nifi­cantly differ­ent from 0 at only the 9% lev­el, whereas the usual sig­nifi­cance cri­te­rion is 5%. Even though the non­ex­per­i­men­tal point es­ti­mate of im­pact was greater than 5.4 per­centile points, strict use of the 5% sig­nifi­cance cri­te­rion would have led to the con­clu­sion that the re­duc­tion in class size was not cost-effec­tive. On the other hand, an­a­lysts some­times use a 10% sig­nifi­cance cri­te­ri­on, so it could be ar­gued that they might have used that level to con­clude the pro­gram was cost-effec­tive - thus the Maybe en­try for this school.

…In all seven se­lected cas­es, the ex­per­i­men­tal and non­ex­per­i­men­tal es­ti­mates differ con­sid­er­ably from each oth­er. One of the non­ex­per­i­men­tal es­ti­mates is of the wrong sign, while in the other es­ti­mates, the signs are the same but all the es­ti­mates differ by at least 1.8 per­cent­age points, rang­ing up to as much as 12 per­cent­age points (ru­ral-c­i­ty). Sta­tis­ti­cal in­fer­ences about the sig­nifi­cance of these pro­gram effects also vary (five of the seven pairs had differ­ing in­fer­ences-i.e., only one es­ti­mate of the pro­gram effect in a pair is sta­tis­ti­cally sig­nifi­cant at the 10% lev­el). All of the differ­ences be­tween the ex­per­i­men­tal and non­ex­per­i­men­tal es­ti­mates (the test of differ­ence be­tween the out­comes for the ex­per­i­men­tal con­trol group and the non­ex­per­i­men­tal com­par­i­son group) in this sub­set were sta­tis­ti­cally sig­nifi­cant.

Ta­ble 5 shows the re­sults for the com­plete set of the first 49 pairs of es­ti­mates. Each col­umn shows a differ­ent type of com­par­i­son (ei­ther school type or dis­trict type). The top row in each col­umn pro­vides the num­ber of pairs of ex­per­i­men­tal and non­ex­per­i­men­tal es­ti­mates in the col­umn. The sec­ond row shows the mean es­ti­mate of pro­gram effect from the (un­bi­ased) ex­per­i­men­tal es­ti­mates. The third row has the mean ab­solute differ­ences be­tween these es­ti­mates, pro­vid­ing some in­di­ca­tion of the size of our non­ex­per­i­men­tal bias. The fourth row pro­vides the per­cent­age of pairs in which the ex­per­i­men­tal and non­ex­per­i­men­tal es­ti­mates led to differ­ent in­fer­ences about the sig­nifi­cance of the pro­gram effect. The fifth row in­di­cates the per­cent­age of pairs in which the differ­ence be­tween the two es­ti­mated val­ues was sig­nifi­cant (a­gain the test of differ­ence be­tween con­trol and com­par­i­son group). Look­ing at the sum­ma­rized re­sults for com­par­isons across school type, these re­sults sug­gest that con­struct­ing non­ex­per­i­men­tal groups based on sim­i­lar de­mo­graphic school types leads to non­ex­per­i­men­tal es­ti­mates that do not per­form very well when com­pared with the ex­per­i­men­tal es­ti­mates for the same group. In 50% of the pairs, ex­per­i­men­tal and non­ex­per­i­men­tal es­ti­mates had differ­ent sta­tis­ti­cal in­fer­ences, with a mean ab­solute differ­ence in effect es­ti­mate of 4.65. Over 75% of these differ­ences were sta­tis­ti­cally sig­nifi­cant. About half of the es­ti­mated pairs in com­par­isons across school type differ by more than 5 per­cent­age points.


“As­sign­ment Meth­ods in Ex­per­i­men­ta­tion: When Do Non­ran­dom­ized Ex­per­i­ments Ap­prox­i­mate An­swers From Ran­dom­ized Ex­per­i­ments?”, Heins­man & Shadish 1996:

This meta-analy­sis com­pares effect size es­ti­mates from 51 ran­dom­ized ex­per­i­ments to those from 47 non­ran­dom­ized ex­per­i­ments. These ex­per­i­ments were drawn from pub­lished and un­pub­lished stud­ies of Scholas­tic Ap­ti­tude Test coach­ing, abil­ity group­ing of stu­dents within class­rooms, presur­gi­cal ed­u­ca­tion of pa­tients to im­prove post-sur­gi­cal out­come, and drug abuse pre­ven­tion with ju­ve­niles. The raw re­sults sug­gest that the two kinds of ex­per­i­ments yield very differ­ent an­swers. But when stud­ies are equated for cru­cial fea­tures (which is not al­ways pos­si­ble), non­ran­dom­ized ex­per­i­ments can yield a rea­son­ably ac­cu­rate effect size in com­par­i­son with ran­dom­ized de­signs. Cru­cial de­sign fea­tures in­clude the ac­tiv­ity level of the in­ter­ven­tion given the con­trol group, pretest effect size, se­lec­tion and at­tri­tion lev­els, and the ac­cu­racy of the effec­t-size es­ti­ma­tion method. Im­pli­ca­tions of these re­sults for the con­duct of meta-analy­sis and for the de­sign of good non­ran­dom­ized ex­per­i­ments are dis­cussed.

…When cer­tain as­sump­tions are met (e.g., no treat­ment cor­re­lated at­tri­tion) and it is prop­erly ex­e­cuted (e.g., as­sign­ment is not over­rid­den), ran­dom as­sign­ment al­lows un­bi­ased es­ti­mates of treat­ment effects and jus­ti­fies the the­ory that leads to tests of sig­nifi­cance. We com­pare this ex­per­i­ment to a closely re­lated qua­si­ex­per­i­men­tal de­sign - the non­equiv­a­lent con­trol group de­sign - that is sim­i­lar to the ran­dom­ized ex­per­i­ment ex­cept that units are not as­signed to con­di­tions at ran­dom (Cook & Camp­bell, 1979).

Sta­tis­ti­cal the­ory is mostly silent about the sta­tis­ti­cal char­ac­ter­is­tics (bi­as, con­sis­ten­cy, and effi­cien­cy) of this de­sign. How­ev­er, meta-an­a­lysts have em­pir­i­cally com­pared the two de­signs. In meta-analy­sis, study out­comes are sum­ma­rized with an effect size sta­tis­tic (Glass, 1976). In the present case, the stan­dard­ized mean differ­ence sta­tis­tic is rel­e­vant:

where is the mean of the ex­per­i­men­tal group, Mc is the mean of the com­par­i­son group, and SDP is the pooled stan­dard de­vi­a­tion. This sta­tis­tic al­lows the meta-an­a­lyst to com­bine study out­comes that are in dis­parate met­rics into a sin­gle met­ric for ag­gre­ga­tion. Com­par­isons of effect sizes from ran­dom­ized and non­ran­dom­ized ex­per­i­ments have yielded in­con­sis­tent re­sults (e.g., Beck­er, 1990; Colditz, Miller, & Mosteller, 1988; Hazel­rigg, Coop­er, & Bor­duin, 1987; Shapiro & Shapiro, 1983; Smith, Glass & Miller, 1980). A re­cent sum­mary of such work (Lipsey & Wilson, 1993) ag­gre­gated the re­sults of 74 meta-analy­ses that re­ported sep­a­rate stan­dard­ized mean differ­ence sta­tis­tics for ran­dom­ized and non­ran­dom­ized stud­ies. Over­all, the ran­dom­ized stud­ies yielded an av­er­age stan­dard­ized mean differ­ence sta­tis­tic of d = 0.46 (SD = 0.28), triv­ially higher than the non­ran­dom­ized stud­ies d = 0.41 (SD = 0.36); that is, the differ­ence was near zero on the av­er­age over these 74 meta-analy­ses. Lipsey and Wil­son (1993) con­cluded that “there is no strong pat­tern or bias in the di­rec­tion of the differ­ence made by lower qual­ity meth­ods. In a given treat­ment area, poor de­sign or low method­olog­i­cal qual­ity may re­sult in a treat­ment es­ti­mate quite dis­crepant from what a bet­ter qual­ity de­sign would yield, but it is al­most as likely to be an un­der­es­ti­mate as an over­es­ti­mate” (p. 1193). How­ev­er, we be­lieve that con­sid­er­able am­bi­gu­ity still re­mains about this method­olog­i­cal is­sue.

…The present study drew from four past meta-analy­ses that con­tained both ran­dom and non­ran­dom­ized ex­per­i­ments on ju­ve­nile drug use pre­ven­tion pro­grams (To­bler, 1986), psy­choso­cial in­ter­ven­tions for post-surgery out­comes (Devine, 1992), coach­ing for Scholas­tic Ap­ti­tude Test per­for­mance (Beck­er, 1990), and abil­ity group­ing of pupils in sec­ondary school classes (Slav­in, 1990). These four ar­eas were se­lected de­lib­er­ately to re­flect differ­ent kinds of in­ter­ven­tions and sub­stan­tive top­ics. …All four meta-analy­ses also in­cluded many un­pub­lished man­u­scripts, al­low­ing us to ex­am­ine pub­li­ca­tion bias effects. In this re­gard, a prac­ti­cal rea­son for choos­ing these four was that pre­vi­ous con­tacts with three of the four au­thors of these meta-analy­ses sug­gested that they would be will­ing to pro­vide us with these un­pub­lished doc­u­ments.

…This pro­ce­dure yielded 98 stud­ies for in­clu­sion, 51 ran­dom and 47 non­ran­dom. These stud­ies al­lowed com­pu­ta­tion of 733 effect sizes, which we ag­gre­gated to 98 study-level effect sizes. Ta­ble 1 de­scribes the num­ber of stud­ies in more de­tail. Re­triev­ing equal num­bers of pub­lished and un­pub­lished stud­ies in each cell of Ta­ble 1 proved im­pos­si­ble. Se­lec­tion cri­te­ria re­sulted in elim­i­na­tion of 103 stud­ies, of which 40 did not pro­vide enough sta­tis­tics to cal­cu­late at least one good effect size; 119 re­ported data only for sig­nifi­cant effects but not for non­signifi­cant ones; 15 did not de­scribe as­sign­ment method ad­e­quate­ly; 11 re­ported only di­choto­mous out­come mea­sures; 9 used hap­haz­ard as­sign­ment; 5 had no con­trol group; and 4 were elim­i­nated for other rea­sons (ex­tremely im­plau­si­ble data, no posttest re­port­ed, se­vere unit of analy­sis prob­lem, or fail­ure to re­port any em­pir­i­cal re­sult­s).

…Table 2 shows that over all 98 stud­ies, ex­per­i­ments in which sub­jects were ran­domly as­signed to con­di­tions yielded sig­nifi­cantly larger effect sizes than did ex­per­i­ments in which ran­dom as­sign­ment did not take place (Q = 82.09, df=l, p < 0.0001). Within area, ran­dom­ized ex­per­i­ments yielded sig­nifi­cantly more pos­i­tive effect sizes for abil­ity group­ing (Q = 4.76, df = 1, p = 0.029) and for drug-use pre­ven­tion stud­ies (Q = 15.67, df = 1, p = 0.000075) but not for SAT coach­ing (Q = .02, df = I , p = .89) and presur­gi­cal in­ter­ven­tion stud­ies (Q = .17, df=,p = 0.68). This yielded a bor­der­line in­ter­ac­tion be­tween as­sign­ment mech­a­nism and sub­stan­tive area (Q = 5.93, df = 3, p = 0.12). We in­clude this in­ter­ac­tion in sub­se­quent re­gres­sions be­cause power to de­tect in­ter­ac­tions is smaller than power to de­tect main effects and be­cause such an in­ter­ac­tion is con­cep­tu­ally the same as Lipsey and Wilson’s (1993) find­ing that as­sign­ment method differ­ences may vary con­sid­er­ably over sub­stan­tive ar­eas. Fi­nal­ly, as Hedges (1983) pre­dict­ed, the vari­ance com­po­nent for non­ran­dom­ized ex­per­i­ments was twice as large as the vari­ance com­po­nent for ran­dom­ized ex­per­i­ments in the over­all sam­ple. Within ar­eas, vari­ance com­po­nents were equal in two ar­eas but larger for non­ran­dom­ized ex­per­i­ments in two oth­ers. Hence non­ran­dom as­sign­ment may re­sult in un­usu­ally dis­parate effect size es­ti­mates, cre­at­ing differ­ent means and vari­ances.

…Effect size was higher with low differ­en­tial and to­tal at­tri­tion, with pas­sive con­trols, with higher pretest effect sizes, when the se­lec­tion mech­a­nism did not in­volve self­-s­e­lec­tion of sub­jects into treat­ment, and with ex­act effect size com­pu­ta­tion mea­sures.

Pro­ject­ing the Re­sults of an Ideal Com­par­i­son: Given these find­ings, one might ask what an ideal com­par­i­son be­tween ran­dom­ized and non­ran­dom­ized ex­per­i­ments would yield. We sim­u­late such a com­par­i­son in Ta­ble 6 us­ing the re­sults in Ta­ble 5, pro­ject­ing effect sizes us­ing pre­dic­tor val­ues that equate stud­ies at an ideal or a rea­son­able lev­el. The pro­jec­tions in Ta­ble 6 as­sume that both ran­dom­ized and non­ran­dom­ized ex­per­i­ments used pas­sive con­trol groups, in­ter­nal con­trol groups, and match­ing; al­lowed ex­act com­pu­ta­tion of d; had no at­tri­tion; stan­dard­ized treat­ments; were pub­lished; had pretest effect sizes of ze­ro; used n = 1,000 sub­jects per study; did not al­low self­-s­e­lec­tion of sub­jects into con­di­tions; and used out­comes based on self­-re­ports and specifi­cally tai­lored to treat­ment. Area effects and in­ter­ac­tion effects be­tween area and as­sign­ment were in­cluded in the pro­jec­tion. Note that the over­all differ­ence among the eight cell means has di­min­ished dra­mat­i­cally in com­par­i­son with Ta­ble 2. In Ta­ble 2, the low­est cell mean was -0.23 and the high­est was 0.37, for a range of 0.60. The range in Ta­ble 6 is only half as large (0.34). The same con­clu­sion is true for the range within each area. In Ta­ble 2 that range was 0.01 for the small­est differ­ence be­tween ran­dom­ized and non­ran­dom­ized ex­per­i­ments (SAT coach­ing) to 0.21 for the largest differ­ence (drug-use pre­ven­tion). In Ta­ble 6, the range was 0.11 (SAT coach­ing), 0.01 (a­bil­ity group­ing), 0.05 (presur­gi­cal in­ter­ven­tion­s), and 0.09 (drug-use pre­ven­tion). Put a bit more sim­ply, non­ran­dom­ized ex­per­i­ments are more like ran­dom­ized ex­per­i­ments if one takes con­founds into ac­count.

, Ioan­ni­dis et al 2001:

Study Se­lec­tion: 45 di­verse top­ics were iden­ti­fied for which both ran­dom­ized tri­als (n = 240) and non­ran­dom­ized stud­ies (n = 168) had been per­formed and had been con­sid­ered in meta-analy­ses of bi­nary out­comes.

Data Ex­trac­tion: Data on events per pa­tient in each study arm and de­sign and char­ac­ter­is­tics of each study con­sid­ered in each meta-analy­sis were ex­tracted and syn­the­sized sep­a­rately for ran­dom­ized and non­ran­dom­ized stud­ies.

Data Syn­the­sis: Very good cor­re­la­tion was ob­served be­tween the sum­mary odds ra­tios of ran­dom­ized and non­ran­dom­ized stud­ies (r= 0.75; p<.001); how­ev­er, non-ran­dom­ized stud­ies tended to show larger treat­ment effects (28 vs 11; p = 0.009). Be­tween-s­tudy het­ero­gene­ity was fre­quent among ran­dom­ized tri­als alone (23%) and very fre­quent among non­ran­dom­ized stud­ies alone (41%). The sum­mary re­sults of the 2 types of de­signs differed be­yond chance in 7 cases (16%). Dis­crep­an­cies be­yond chance were less com­mon when only prospec­tive stud­ies were con­sid­ered (8%). Oc­ca­sional differ­ences in sam­ple size and tim­ing of pub­li­ca­tion were also noted be­tween dis­crepant ran­dom­ized and non­ran­dom­ized stud­ies. In 28 cases (62%), the nat­ural log­a­rithm of the odds ra­tio differed by at least 50%, and in 15 cases (33%), the odds ra­tio var­ied at least 2-fold be­tween non­ran­dom­ized stud­ies and ran­dom­ized tri­als.

Con­clu­sions: De­spite good cor­re­la­tion be­tween ran­dom­ized tri­als and non­ran­dom­ized stud­ies - in par­tic­u­lar, prospec­tive stud­ies - dis­crep­an­cies be­yond chance do oc­cur and differ­ences in es­ti­mated mag­ni­tude of treat­ment effect are very com­mon.

“Com­par­i­son of Effects in Ran­dom­ized Con­trolled Tri­als With Ob­ser­va­tional Stud­ies in Di­ges­tive Surgery”, Shikata et al 2006:

Meth­ods: The PubMed (1966 to April 2004), EMBASE (1986 to April 2004) and Cochrane data­bases (Is­sue 2, 2004) were searched to iden­tify meta-analy­ses of ran­dom­ized con­trolled tri­als in di­ges­tive surgery. Fifty-two out­comes of 18 top­ics were iden­ti­fied from 276 orig­i­nal ar­ti­cles (96 ran­dom­ized tri­als, 180 ob­ser­va­tional stud­ies) and in­cluded in meta-analy­ses. All avail­able bi­nary data and study char­ac­ter­is­tics were ex­tracted and com­bined sep­a­rately for ran­dom­ized and ob­ser­va­tional stud­ies. In each se­lected di­ges­tive sur­gi­cal top­ic, sum­mary odds ra­tios or rel­a­tive risks from ran­dom­ized con­trolled tri­als were com­pared with ob­ser­va­tional stud­ies us­ing an equiv­a­lent cal­cu­la­tion method.

Re­sults: Sig­nifi­cant be­tween-s­tudy het­ero­gene­ity was seen more often among ob­ser­va­tional stud­ies (5 of 12 top­ics) than among ran­dom­ized tri­als (1 of 9 top­ic­s). In 4 of the 16 pri­mary out­comes com­pared (10 of 52 to­tal out­comes), sum­mary es­ti­mates of treat­ment effects showed sig­nifi­cant dis­crep­an­cies be­tween the two de­signs.

Con­clu­sions: One fourth of ob­ser­va­tional stud­ies gave differ­ent re­sults than ran­dom­ized tri­als, and be­tween-s­tudy het­ero­gene­ity was more com­mon in ob­ser­va­tional stud­ies in the field of di­ges­tive surgery.

, Gor­don et al 2019:

We ex­am­ine how com­mon tech­niques used to mea­sure the causal im­pact of ad ex­po­sures on users’ con­ver­sion out­comes com­pare to the “gold stan­dard” of a true ex­per­i­ment (ran­dom­ized con­trolled tri­al). Us­ing data from 12 US ad­ver­tis­ing lift stud­ies at Face­book com­pris­ing 435 mil­lion user-s­tudy ob­ser­va­tions and 1.4 bil­lion to­tal im­pres­sions we con­trast the ex­per­i­men­tal re­sults to those ob­tained from ob­ser­va­tional meth­ods, such as com­par­ing ex­posed to un­ex­posed users, match­ing meth­ods, mod­el-based ad­just­ments, syn­thetic matched-mar­kets tests, and be­fore-after tests. We show that ob­ser­va­tional meth­ods often fail to pro­duce the same re­sults as true ex­per­i­ments even after con­di­tion­ing on in­for­ma­tion from thou­sands of be­hav­ioral vari­ables and us­ing non-lin­ear mod­els. We ex­plain why this is the case. Our find­ings sug­gest that com­mon ap­proaches used to mea­sure ad­ver­tis­ing effec­tive­ness in in­dus­try fail to mea­sure ac­cu­rately the true effect of ads.

…Fig­ure 13 sum­ma­rizes re­sults for the four stud­ies for which there was a con­ver­sion pixel on a reg­is­tra­tion page. Fig­ure 14 sum­ma­rizes re­sults for the three stud­ies for which there was a con­ver­sion pixel on a key land­ing page. The re­sults for these stud­ies vary across stud­ies in how they com­pare to the RCT re­sults, just as they do for the check­out con­ver­sion stud­ies re­ported in Fig­ures 11 and 12.

We sum­ma­rize the per­for­mance of differ­ent ob­ser­va­tional ap­proaches us­ing two differ­ent met­rics. We want to know first how often an ob­ser­va­tional study fails to cap­ture the truth. Said in a sta­tis­ti­cally pre­cise way, “For how many of the stud­ies do we re­ject the hy­poth­e­sis that the lift of the ob­ser­va­tional method is equal to the RCT lift?” Ta­ble 7 re­ports the an­swer to this ques­tion. We di­vide the ta­ble by out­come re­ported in the study (check­out is in the top sec­tion of Ta­ble 7, fol­lowed by reg­is­tra­tion and page view). The first row of Ta­ble 7 tells us that of the 11 stud­ies that tracked check­out con­ver­sions, we sta­tis­ti­cally re­ject the hy­poth­e­sis that the ex­act match­ing es­ti­mate of lift equals the RCT es­ti­mate. As we go down the column, the propen­sity score match­ing and re­gres­sion ad­just­ment ap­proaches fare a lit­tle bet­ter, but for all but one spec­i­fi­ca­tion, we re­ject equal­ity with the RCT es­ti­mate for half the stud­ies or more.

We would also like to know how differ­ent the es­ti­mate pro­duced by an ob­ser­va­tional method is from the RCT es­ti­mate. Said more pre­cise­ly, we ask “Across eval­u­ated stud­ies of a given out­come, what is the av­er­age ab­solute de­vi­a­tion in per­cent­age points be­tween the ob­ser­va­tional method es­ti­mate of lift and the RCT lift?” For ex­am­ple, the RCT lift for study 1 (check­out out­come) is 33%. The EM lift es­ti­mate is 117%. Hence the ab­solute lift de­vi­a­tion is 84 per­cent­age points. For study 2 (check­out out­come) the RCT lift is 0.9%, the EM lift es­ti­mate is 535%, and the ab­solute lift de­vi­a­tion is 534 per­cent­age points. When we av­er­age over all stud­ies, ex­act match­ing leads to an av­er­age ab­solute lift de­vi­a­tion of 661 per­cent­age points rel­a­tive to an av­er­age RCT lift of 57% across stud­ies (see the last two columns of the first row of the table.)

, Huang et al 2018:

How valu­able is the vari­a­tion gen­er­ated by this ex­per­i­ment? Since it can be diffi­cult to con­vince de­ci­sion mak­ers to run ex­per­i­ments on key eco­nomic de­ci­sions, and it con­sumes en­gi­neer­ing re­sources to im­ple­ment such an ex­per­i­ment prop­er­ly, could we have done just as well by us­ing ob­ser­va­tional data? To in­ves­ti­gate this ques­tion, we re-run our analy­sis us­ing only the 17 mil­lion lis­ten­ers in the con­trol group, since they were un­touched by the ex­per­i­ment in the sense that they re­ceived the de­fault ad load. In the ab­sence of ex­per­i­men­tal in­stru­men­tal vari­ables, we run the re­gres­sion based on nat­u­rally oc­cur­ring vari­a­tion in ad load, such as that caused by higher ad­ver­tiser de­mand for some lis­ten­ers than oth­ers, ex­clud­ing lis­ten­ers who got no ads dur­ing the ex­per­i­men­tal pe­ri­od.

The re­sults are found in Ta­ble 7. We find that the en­do­gene­ity of the re­al­ized ad load (some peo­ple get more ad load due to ad­ver­tiser de­mand, and these peo­ple hap­pen to lis­ten less than peo­ple with lower ad­ver­tiser de­mand) causes us to over­es­ti­mate the true causal im­pact of ad load by a fac­tor of ap­prox­i­mately 4.

…To give the panel es­ti­ma­tor its best shot, we use what we have learned from the ex­per­i­ment, and al­low the panel es­ti­ma­tor a 20-month time pe­riod be­tween ob­ser­va­tion­s…We see from Ta­ble 8 that the point es­ti­mate for ac­tive days is much closer to that found in Ta­ble 5, but it still over­es­ti­mates the im­pact of ad load by more than the width of our 95% con­fi­dence in­ter­vals. The panel point es­ti­mate for to­tal hours, while an im­prove­ment over the cross-sec­tional re­gres­sion re­sults, still over­es­ti­mates the effect by a fac­tor of 3. Our re­sult sug­gests that, even after con­trol­ling for time-in­vari­ant lis­tener het­ero­gene­ity, ob­ser­va­tional tech­niques still suffer from omit­ted-vari­able bias caused by un­ob­serv­able terms that vary across in­di­vid­u­als and time that cor­re­late with ad load and lis­ten­ing be­hav­iors. And with­out a long-run ex­per­i­ment, we would not have known the rel­e­vant timescale to con­sider to mea­sure the long-run sen­si­tiv­ity to ad­ver­tis­ing (which is what mat­ters for the plat­for­m’s pol­icy de­ci­sions)


“An eval­u­a­tion of bias in three mea­sures of teacher qual­i­ty: Val­ue-added, class­room ob­ser­va­tions, and stu­dent sur­veys”, Bacher-Hicks et al 2017:

We con­duct a ran­dom as­sign­ment ex­per­i­ment to test the pre­dic­tive va­lid­ity of three mea­sures of teacher per­for­mance: value added, class­room ob­ser­va­tions, and stu­dent sur­veys. Com­bin­ing our re­sults with those from two pre­vi­ous ran­dom as­sign­ment ex­per­i­ments, we pro­vide fur­ther(and more pre­cise) ev­i­dence that val­ue-added mea­sures are un­bi­ased pre­dic­tors of teacher per­for­mance. In ad­di­tion, we pro­vide the first ev­i­dence that class­room ob­ser­va­tion scores are un­bi­ased pre­dic­tors of teacher per­for­mance on a rubric mea­sur­ing the qual­ity of math­e­mat­ics in­struc­tion, but we lack the sta­tis­ti­cal power to reach any sim­i­lar con­clu­sions for stu­dent sur­vey re­spons­es.

We used the pre-ex­ist­ing ad­min­is­tra­tive records and the ad­di­tional data col­lected by NCTE to gen­er­ate es­ti­mates of teacher per­for­mance on five mea­sures: (a) stu­dents’ scores on state stan­dard­ized math­e­mat­ics tests; (b) stu­dents’ scores on the pro­jec­t-de­vel­oped math­e­mat­ics test (Hick­man, Fu, & Hill, 2012); (c) teach­ers’ per­for­mance on the Math­e­mat­i­cal Qual­ity of In­struc­tion (MQI; Hill et al., 2008) class­room ob­ser­va­tion in­stru­ment; (d) teach­ers’ per­for­mance on the Class­room As­sess­ment Scor­ing Sys­tem (CLASS; La Paro, Pi­anta, & Ham­re, 2012) ob­ser­va­tion in­stru­ment; and(e)s­tu­dents’ re­sponses to a Tripod-based per­cep­tion sur­vey (Fer­gu­son, 2009).5 Kane and Staiger (2008) Mc­Caffrey, Miller, and Staiger (2013)

“Val­i­dat­ing Teacher Effects On Stu­dents’ At­ti­tudes And Be­hav­iors: Ev­i­dence From Ran­dom As­sign­ment Of Teach­ers To Stu­dents”, Blazar 2017:

There is grow­ing in­ter­est among re­searchers, pol­i­cy­mak­ers, and prac­ti­tion­ers in iden­ti­fy­ing teach­ers who are skilled at im­prov­ing stu­dent out­comes be­yond test scores. How­ev­er, im­por­tant ques­tions re­main about the va­lid­ity of these teacher effect es­ti­mates. Lever­ag­ing the ran­dom as­sign­ment of teach­ers to class­es, I find that teach­ers have causal effects on their stu­dents’ self­-re­ported be­hav­ior in class, self­-effi­cacy in math, and hap­pi­ness in class that are sim­i­lar in mag­ni­tude to effects on math test scores. Weak cor­re­la­tions be­tween teacher effects on differ­ent stu­dent out­comes in­di­cate that these mea­sures cap­ture unique skills that teach­ers bring to the class­room. Teacher effects cal­cu­lated in non-ex­per­i­men­tal data are re­lated to these same out­comes fol­low­ing ran­dom as­sign­ment, re­veal­ing that they con­tain im­por­tant in­for­ma­tion con­tent on teach­ers. How­ev­er, for some non-ex­per­i­men­tal teacher effect es­ti­mates, large and po­ten­tially im­por­tant de­grees of bias re­main. These re­sults sug­gest that re­searchers and pol­i­cy­mak­ers should pro­ceed with cau­tion when us­ing these mea­sures. They likely are more ap­pro­pri­ate for low-s­takes de­ci­sions, such as match­ing teach­ers to pro­fes­sional de­vel­op­ment, than for high­-s­takes per­son­nel de­ci­sions and ac­count­abil­i­ty.

…In Ta­ble 5, I re­port es­ti­mates de­scrib­ing the re­la­tion­ship be­tween non-ex­per­i­men­tal teacher effects on stu­dent out­comes and these same mea­sures fol­low­ing ran­dom as­sign­ment. Cells con­tain es­ti­mates from sep­a­rate re­gres­sion mod­els where the de­pen­dent vari­able is the stu­dent at­ti­tude or be­hav­ior listed in each col­umn. The in­de­pen­dent vari­able of in­ter­est is the non-ex­per­i­men­tal teacher effect on this same out­come es­ti­mated in years prior to ran­dom as­sign­ment. All mod­els in­clude fixed effects for ran­dom­iza­tion block to match the ex­per­i­men­tal de­sign. In or­der to in­crease the pre­ci­sion of my es­ti­mates, mod­els also con­trol for stu­dents’ prior achieve­ment in math and read­ing, stu­dent de­mo­graphic char­ac­ter­is­tics, and class­room char­ac­ter­is­tics from ran­domly as­signed ros­ters.

…Va­lid­ity ev­i­dence for teacher effects on stu­dents’ math per­for­mance are con­sis­tent with other ex­per­i­men­tal stud­ies (Kane et al. 2013; Kane and Staiger 2008), where pre­dicted differ­ences in teacher effec­tive­ness in ob­ser­va­tional data come close to ac­tual differ­ences fol­low­ing ran­dom as­sign­ment of teach­ers to class­es. The non-ex­per­i­men­tal teacher effect es­ti­mate that comes clos­est to a 1:1 re­la­tion­ship is the shrunken es­ti­mate that con­trols for stu­dents’ prior achieve­ment and other de­mo­graphic char­ac­ter­is­tics (0.995 SD).

De­spite a rel­a­tively small sam­ple of teach­ers, the stan­dard er­ror for this es­ti­mate (0.084) is sub­stan­tively smaller than those in other stud­ies—in­clud­ing the meta-analy­sis con­ducted by Bacher-Hicks et al. (2017)—and al­lows me to rule out rel­a­tively large de­grees of bias in teacher effects cal­cu­lated from this mod­el. A likely ex­pla­na­tion for greater pre­ci­sion in this study rel­a­tive to oth­ers is the fact that other stud­ies gen­er­ate es­ti­mates through in­stru­men­tal vari­ables es­ti­ma­tion to cal­cu­late treat­ment on the treat­ed. In­stead, I use OLS re­gres­sion and ac­count for non-com­pli­ance by nar­row­ing in on ran­dom­iza­tion blocks in which very few, in any, stu­dents moved out of their ran­domly as­signed teach­ers’ class­room. Non-ex­per­i­men­tal teacher effects cal­cu­lated with­out shrink­age are re­lated less strongly to cur­rent stu­dent out­comes, though differ­ences in es­ti­mates and as­so­ci­ated stan­dard er­rors be­tween Panel A and Panel B are not large. All cor­re­spond­ing es­ti­mates (e.g., Model 1 from Panel A ver­sus Panel B) have over­lap­ping 95% con­fi­dence in­ter­vals.

…For both Self­-Effi­cacy in Math and Hap­pi­ness in Class, non-ex­per­i­men­tal teacher effect es­ti­mates have mod­er­ate pre­dic­tive va­lid­i­ty. Gen­er­al­ly, I can dis­tin­guish es­ti­mates from 0 SD, in­di­cat­ing that they con­tain some in­for­ma­tion con­tent on teach­ers. The ex­cep­tion is shrunken es­ti­mates for Self­-Effi­cacy in Math. Al­though es­ti­mates are sim­i­lar in mag­ni­tude to the un­shrunken es­ti­mates in Panel A, be­tween 0.42 SD and 0.58 SD, stan­dard er­rors are large and 95% con­fi­dence in­ter­vals cross 0 SD. I also can dis­tin­guish many es­ti­mates from 1 SD. This in­di­cates that non-ex­per­i­men­tal teacher effects on stu­dents’ Self­-Effi­cacy in Math and Hap­pi­ness in Class con­tain po­ten­tially large and im­por­tant de­grees of bias. For both mea­sures of teacher effec­tive­ness, point es­ti­mates around 0.5 SD sug­gest that they con­tain roughly 50% bias.

“Ta­ble 5: Re­la­tion­ship be­tween Cur­rent Stu­dent Out­comes and Pri­or, Non-ex­per­i­men­tal Teacher Effect Out­comes” [3⁄4 effects <1 im­ply bi­as]