>
> School Nonexperimental Regression Experimental Estimate
> ------- ---------------------------- -----------------------
> A 9.6 -5.2
> B 15.3\* 13.0\*
> C 1.9 24.1\*
> D 35.2\* 33.1\*
> E 20.4\* -10.5
> F 0.2 1.3
> G -8.6 10.6\*
> H -5.6 9.6\*
> I 16.5\* 14.7\*
> J 24.3\* 16.2\*
> K 27.8* 19.3\*
>
> Table: Table 1, Comparison of Experimental and Nonexperimental Estimates for Effects of Class Size on Student Test Scores [\* = estimate statistically significant at the 5% cutoff.]
>

> A second example of the bias that may result with nonexperimental estimates comes from the U.S. Department of Housing and Urban Development's Moving to Opportunity (MTO) housing-voucher experiment, which randomly assigned housing-project residents in high-poverty neighborhoods of five of the nation's largest cities to either a group that was offered a housing voucher to relocate to a lower poverty area or to a control group that received no mobility assistance under the program (Ludwig, Duncan, & Hirschfield, 2001). Because of well-implemented random assignment, each of the groups on average should be equivalent (subject to sampling variability) with respect to all observable and unobservable preprogram characteristics.
>
> - Ludwig, J. (1999). "Experimental and non-experimental estimates of neighborhood effects on low-income families". Unpublished document, Georgetown University.
> - Ludwig, J., Duncan, G., & Hirschfield, P. (2001). ["Urban poverty and juvenile crime: Evidence from a randomized housing-mobility experiment"](http://www.nber.org/mtopublic/baltimore/mto_balt_delinquency.pdf). _Quarterly Journal of Economics_, 116, 655-680.
>
> Table 2 presents the results of using the randomized design of MTO to generate unbiased estimates of the effects of moving from high- to low-poverty census tracts on teen crime. The experimental estimates are the difference between average outcomes of all families offered vouchers and those assigned to the control group, divided by the difference across the two groups in the proportion of families who moved to a low-poverty area. (Note the implication that these kinds of experimental data can be used to produce unbiased estimates of the effects of neighborhood characteristics on developmental outcomes, even if the takeup rate is less than 100% in the treatment group and greater than 0% among the control group.)^4^ The nonexperimental estimates simply compare families who moved to low-poverty neighborhoods with those who did not, ignoring information about each family's random assignment and relying on the set of prerandom assignment measures of MTO family characteristics to adjust for differences between families who chose to move and those who do not.^5^ As seen in Table 2, even after statistically adjusting for a rich set of background characteristics the nonexperimental measure-the-unmeasured approach leads to starkly different inferences about the effects of residential mobility compared with the unbiased experimental estimates. For example, the experimental estimates suggest that moving from a high- to a low- poverty census tract significantly reduces the number of violent crimes. In contrast, the nonexperimental estimates find that such moves have essentially no effect on violent arrests. In the case of "other" crimes, the nonexperimental estimates suggest that such moves reduce crime, but the experimentally based estimates do not.
>
>
>
> Measure Experimental SE Non-experimental SE Sample Size
> ------------------ -------------- ----- ------------------ ----- --------------
> Violent Crime -47.4\* 24.3 -4.9 12.5 259
> Property Crime 29.7 28.9 -10.8 14.1 259
> Other Crimes -0.6 37.4 -36.9\* 14.3 259
>
> Table: Table 2, Estimated Impacts of Moving From a High- to a Low-Poverty Neighborhood on Arrests Per 100 Juveniles [From Ludwig (1999), based on data from the Baltimore Moving to Opportunity experiment. Regression models also control for baseline measurement of gender, age at random assignment, and preprogram criminal involvement, family's preprogram victimization, mother's schooling, welfare receipt and marital status. \* = estimated effect of dropout program on dropout rates statistically significant at the 5% cutoff level.]
>
>

>
> ...A final example comes from the National Evaluation of Welfare-to-Work Strategies, randomized experiment designed to evaluate welfare-to-work programs in seven sites across the United States. One of the treatment streams encouraged welfare-recipient mothers to participate in education activities. In addition to measuring outcomes such as clients' welfare receipt, employment, and earnings, the evaluation study also tested young children's school readiness using the Bracken Basic Concepts Scale School Readiness Subscale. Using a method for generating experimental estimates similar to that used in the MTO analyses, Magnuson and McGroder (2002) examined the effects of the experimentally induced increases in maternal schooling on children's school readiness. Again, the results suggest that nonexperimental estimates did not closely reproduce experimentally based estimates.
>
> - Magnuson, K. A., & McGroder, S. (2002). ["The effect of increasing welfare mothers' education on their young children's academic problems and school readiness"](http://www.jonescollegeprep.org/ourpages/auto/2013/1/17/61972839/Effect%20of%20Increasing%20Welfare%20Mothers%20Education%20on%20their%20Young%20Childrens%20Problems%20and%20School%20Readiness.pdf). Joint Center for Poverty Research, Working Paper No. 280, Northwestern University.
>
> A much larger literature within economics, statistics, and program evaluation has focused on the ability of nonexperimental regression-adjustment methods to replicate experimental estimates for the effects of job training or welfare-to-work programs. Although the "contexts" represented by these programs may be less interesting to developmentalists, the results of this literature nevertheless bear directly on the question considered in this article: Can regression methods with often quite detailed background covariates reproduce experimental impact estimates for such programs? As one recent review concluded, "Occasionally, but not in a way that can be easily predicted" (Glazerman, Levy, & Myers, 2002, p. 46; see also Bloom, Michalopoulos, Hill, & Lei, 2002).
>
> - Glazerman, S., Levy, D., & Myers, D. (2002). [_Nonexperimental replications of social experiments: A systematic review_](https://www.mathematica-mpr.com/-/media/publications/pdfs/nonexperimentalreps.pdf). Washington, DC: Mathematica Policy Research
> - Bloom, H. S., Michalopoulos, C., Hill, C., & Lei, Y. (2002) [_Can non-experimental comparison group methods match the findings from a random assignment evaluation of mandatory welfare-to-work programs?_](http://files.eric.ed.gov/fulltext/ED471814.pdf) New York: Manpower Demonstration Research Corporation.
Allcott 2011, ["Social Norms and Energy Conservation"](http://bwl.univie.ac.at/fileadmin/user_upload/lehrstuhl_ind_en_uw/lehre/ws1213/SE_Energy_WS12_13/Social_norms_and_energy_conservation.pdf):
> Nearly all energy efficiency programs are still evaluated using non-experimental estimators or engineering accounting approaches. How important is the experimental control group to consistently-estimated ATEs? This issue is crucial for several of OPOWER's initial programs that were implemented without a control group but must estimate impacts to report to state regulators. While LaLonde (1986) documented that non-experimental estimators performed poorly in evaluating job training programs and similar arguments have been made in many other domains, weather-adjusted non-experimental estimators could in theory perform well in modeling energy demand. The importance of randomized controlled trials has not yet been clearly documented to analysts and policymakers in this context.
>
> Without an experimental control group, there are two econometric approaches that could be used. The first is to use a difference estimator, comparing electricity use in the treated population before and after treatment. In implementing this, I control for weather differences non-parametrically, using bins with width one average degree day. This slightly outperforms the use of fourth degree polynomials in heating and cooling degree-days. This estimator is unbiased if and only if there are no other factors associated with energy demand that vary between the pre-treatment and post-treatment period. A second non-experimental approach is to use a difference-in-differences estimator with nearby households as a control group. For each experiment, I form a control group using the average monthly energy use of households in other utilities in the same state, using data that regulated utilities report to the U.S. Department of Energy on Form EIA 826. The estimator includes utility-by-month fixed effects to capture different seasonal patterns - for example, there may be local variation in how many households use electric heat instead of natural gas or oil, which then affects winter electricity demand. This estimator is unbiased if and only if there are no unobserved factors that differentially affect average household energy demand in the OPOWER partner utility vs. the other utilities in the same state. Fig. 6 presents the experimental ATEs for each experiment along with point estimates for the two types of non-experimental estimators. There is substantial variance in the non-experimental estimators: the average absolute errors for the difference and difference-in-differences estimators, respectively, are 2.1% and 3.0%. Across the 14 experiments, the estimators are also biased on average. In particular, the mean of the ATEs from the difference-in-differences estimator is −3.75%, which is nearly double the mean of the experimental ATEs.
>
> ...What's particularly insidious about the non-experimental estimates is that they would appear quite plausible if not compared to the experimental benchmark. Nearly all are within the confidence intervals of the small sample pilots by Schultz et al. (2007) and Nolan et al. (2008) that were discussed above. Evaluations of similar types of energy use information feedback programs have reported impacts of zero to 10% (Darby, 2006).
["What Do Workplace Wellness Programs Do? Evidence from the Illinois Workplace Wellness Study"](/docs/statistics/causality/2018-jones.pdf), Jones et al 2018:
> Workplace wellness programs cover over 50 million workers and are intended to reduce medical spending, increase productivity, and improve well-being. Yet, limited evidence exists to support these claims. We designed and implemented a comprehensive workplace wellness program for a large employer with over 12,000 employees, and randomly assigned program eligibility and financial incentives at the individual level. Over 56% of eligible (treatment group) employees participated in the program. We find strong patterns of selection: during the year prior to the intervention, program participants had lower medical expenditures and healthier behaviors than non-participants. However, we do not find significant causal effects of treatment on total medical expenditures, health behaviors, employee productivity, or self-reported health status in the first year. Our 95% confidence intervals rule out 83% of previous estimates on medical spending and absenteeism. Our selection results suggest these programs may act as a screening mechanism: even in the absence of any direct savings, differential recruitment or retention of lower-cost participants could result in net savings for employers.
>
> ...We invited 12,459 benefits-eligible university employees to participate in our study.^3^ Study participants (_n_=4, 834) assigned to the treatment group (_n_=3, 300) were invited to take paid time off to participate in our workplace wellness program. Those who successfully completed the entire program earned rewards ranging from \$50 to \$350, with the amounts randomly assigned and communicated at the start of the program. The remaining subjects (_n_=1, 534) were assigned to a control group, which was not permitted to participate. Our analysis combines individual-level data from online surveys, university employment records, health insurance claims, campus gym visit records, and administrative records from a popular community running event. We can therefore examine outcomes commonly studied by the prior literature (namely, medical spending and employee absenteeism) as well as a large number of novel outcomes.
>
> ...Third, we do not find significant effects of our intervention on 37 out of the 39 outcomes we examine in the first year following random assignment. These 37 outcomes include all our measures of medical spending, productivity, health behaviors, and self-reported health. We investigate the effect on medical expenditures in detail, but fail to find significant effects on different quantiles of the spending distribution or on any major subcategory of medical expenditures (pharmaceutical drugs, office, or hospital). We also do not find any effect of our intervention on the number of visits to campus gym facilities or on the probability of participating in a popular annual community running event, two health behaviors that are relatively simple for a motivated employee to change over the course of one year. These null estimates are meaningfully precise, particularly for two key outcomes of interest in the literature: medical spending and absenteeism. Our 95% confidence intervals rule out 83% of the effects reported in 115 prior studies, and the 99% confidence intervals for the return on investment (ROI) of our intervention rule out the widely cited medical spending and absenteeism ROI's reported in the meta-analysis of Baicker, Cutler and Song (2010). In addition, we show that our OLS (non-RCT) estimate for medical spending is in line with estimates from prior observational studies, but is ruled out by the 95% confidence interval of our IV (RCT) estimate. This demonstrates the value of employing an RCT design in this literature.
>
> ...Our randomized controlled design allows us to establish reliable causal effects by comparing outcomes across the treatment and control groups. By contrast, most existing studies rely on observational comparisons between participants and non-participants (see Pelletier, 2011, and Chapman, 2012, for reviews). Reviews of the literature have called for additional research on this topic and have also noted the potential for publication bias to skew the set of existing results (Baicker, Cutler and Song, 2010; Abraham and White, 2017). To that end, our intervention, empirical specifications, and outcome variables were prespecified and publicly archived. In addition, the analyses in this paper were independently replicated by a J-PAL affiliated researcher.
>
> ...Figure 8 illustrates how our estimates compare to the prior literature. The top-left figure in Panel (a) plots the distribution of the intent-to-treat (ITT) point estimates for medical spending from 22 prior workplace wellness studies. The figure also plots our ITT point estimate for total medical spending from Table 4, and shows that our 95% confidence interval rules out 20 of these 22 estimates. For ease of comparison, all effects are expressed as % changes. The bottom-left figure in Panel (a) plots the distribution of treatment-on-the-treated (TOT) estimates for health spending from 33 prior studies, along with the IV estimates from our study. In this case, our 95% confidence interval rules out 23 of the 33 studies (70%). Overall, our confidence intervals rule out 43 of 55 (78%) prior ITT and TOT point estimates for health spending. The two figures in Panel (b) repeat this exercise for absenteeism, and show that our estimates rule out 53 of 60 (88%) prior ITT and TOT point estimates for absenteeism. Across both sets of outcomes, we rule out 96 of 115 (83%) prior estimates. We can also combine our spending and absenteeism estimates with our cost data to calculate a return on investment (ROI) for workplace wellness programs. The 99% confidence intervals for the ROI associated with our intervention rule out the widely cited savings estimates reported in the meta-analysis of Baicker, Cutler and Song (2010).
>
> ![Figure 8 of Jones et al 2018: comparison of previous literature's correlational point-estimates with the Jones et al 2018 randomized effect's CI, demonstrating that almost none fall within the Jones et al 2018 CI.](/images/causality/2018-jones-figure8-randomizedvscorrelationliterature.png)
>
> 4.3.3 IV versus OLS
>
> Across a variety of outcomes, we find very little evidence that our intervention had any effect in its first year. As shown above, our results differ from many prior studies that find significant reductions in health expenditures and absenteeism. One possible reason for this discrepancy is the presence of advantageous selection bias in these other studies, which are generally not randomized controlled trials. A second possibility is that there is something unique about our setting. We investigate these competing explanations by performing a typical observational (OLS) analysis and comparing its results to those of our experimental estimates.
>
> Specifically, we estimate $Y_i = α + γ_{P_i} + Γ_{X_i} + ε_i$, (5) where $Y_i$ is the outcome variable as in (4), $P_i$ is an indicator for participating in the screening and HRA, and $X_i$ is a vector of variables that control for potentially non-random selection into participation. We estimate two variants of equation (5). The first is an instrumental variables (IV) specification that includes observations for individuals in the treatment or control groups, and uses treatment assignment as an instrument for completing the screening and HRA. The second variant estimates equation (5) using OLS, restricted to individuals in the treatment group. For each of these two variants, we estimate three specifications similar to those used for the ITT analysis described above (no controls, strata fixed effects, and post-Lasso).
>
> This generates six estimates for each outcome variable. Table 5 reports the results for our primary outcomes of interest. The results for all pre-specified administrative and survey outcomes are reported in Appendix Tables A.3e-A.3f.
>
> ![Table 5, comparing the randomized estimate with the correlational estimates](/images/causality/2018-jones-table5-correlationvsrandomized.png){.full-width}
>
> ![Visualization of 5 entries from Table 5, from the [_New York Times_](https://www.nytimes.com/2018/08/06/upshot/employer-wellness-programs-randomized-trials.html "Workplace Wellness Programs Don't Work Well. Why Some Studies Show Otherwise. Randomized controlled trials, despite their flaws, remain a powerful tool.").](/images/causality/2018-jones-table5-nyt-randomizedvscorrelation.png)
>
> As in our previous ITT analysis, the IV estimates reported in columns (1)-(3) are small and indistinguishable from zero for nearly every outcome. By contrast, the observational estimates reported in columns (4)-(6) are frequently large and statistically significant. Moreover, the IV estimate rules out the OLS estimate for several key outcomes. Based on our most precise and well-controlled specification (post-Lasso), the OLS monthly spending estimate of -\$88.1 (row 1, column (6)) lies outside the 95% confidence interval of the IV estimate of \$38.5 with a standard error of \$58.8 (row 1, column (3)). For participation in the 2017 IL Marathon/10K/5K, the OLS estimate of 0.024 lies outside the 99% confidence interval of the corresponding IV estimate of -0.011 (standard error = 0.011). For campus gym visits, the OLS estimate of 2.160 lies just inside the 95% confidence interval of the corresponding IV estimate of 0.757 (standard error = 0.656). Under the assumption that the IV (RCT) estimates are unbiased, these difference imply that even after conditioning on a rich set of controls, participants selected into our workplace wellness program on the basis of lower-than-average contemporaneous spending and higher-than-average health activity. This is consistent with the evidence presented in Section 3.2 that pre-existing spending is lower, and pre-existing behaviors are healthier, among participants than among non-participants. In addition, the observational estimates presented in columns (4)-(6) are in line with estimates from previous observational studies, which suggests that our setting is not particularly unique. In the spirit of LaLonde (1986), these estimates demonstrate that even well-controlled observational analyses can suffer from significant selection bias in our setting, suggesting that similar biases might be at play in other wellness program settings as well.
>
> ![Jones et al 2018 appendix: Table A3, a-c, all randomized vs correlational estimates](/images/causality/2018-jones-supplement-randomizedvscorrelation-tablea3-ac.png)
>
> ![Jones et al 2018 appendix: Table A3, d-e, all randomized vs correlational estimates](/images/causality/2018-jones-supplement-randomizedvscorrelation-tablea3-de.png)
>
> ![Jones et al 2018 appendix: Table A3, f-g, all randomized vs correlational estimates](/images/causality/2018-jones-supplement-randomizedvscorrelation-tablea3-fg.png)
["Effect of a Workplace Wellness Program on Employee Health and Economic Outcomes: A Randomized Clinical Trial"](/docs/statistics/causality/2019-song.pdf), Song & Baicker 2019:
> *Design, Setting, and Participants*: This clustered randomized trial was implemented at 160 worksites from January 2015 through June 2016. Administrative claims and employment data were gathered continuously through June 30, 2016; data from surveys and biometrics were collected from July 1, 2016, through August 31, 2016.
>
> *Interventions*: There were 20 randomly selected treatment worksites (4037 employees) and 140 randomly selected control worksites (28 937 employees, including 20 primary control worksites [4106 employees]). Control worksites received no wellness programming. The program comprised 8 modules focused on nutrition, physical activity, stress reduction, and related topics implemented by registered dietitians at the treatment worksites.
>
> *Main Outcomes and Measures*: Four outcome domains were assessed. Self-reported health and behaviors via surveys (29 outcomes) and clinical measures of health via screenings (10 outcomes) were compared among 20 intervention and 20 primary control sites; health care spending and utilization (38 outcomes) and employment outcomes (3 outcomes) from administrative data were compared among 20 intervention and 140 control sites.
>
> *Results*: Among 32 974 employees (mean [SD] age, 38.6 [15.2] years; 15 272 [45.9%] women), the mean participation rate in surveys and screenings at intervention sites was 36.2% to 44.6% (_n_= 4037 employees) and at primary control sites was 34.4% to 43.0% (_n_= 4106 employees) (mean of 1.3 program modules completed). After 18 months, the rates for 2 self-reported outcomes were higher in the intervention group than in the control group: for engaging in regular exercise (69.8% vs 61.9%; adjusted difference, 8.3 percentage points [95% CI, 3.9-12.8]; adjusted _p_= .03) and for actively managing weight (69.2% vs 54.7%; adjusted difference, 13.6 percentage points [95% CI, 7.1-20.2]; adjusted _p_= .02). The program had no significant effects on other prespecified outcomes: 27 self-reported health outcomes and behaviors (including self-reported health, sleep quality, and food choices), 10 clinical markers of health (including cholesterol, blood pressure, and body mass index), 38 medical and pharmaceutical spending and utilization measures, and 3 employment outcomes (absenteeism, job tenure, and job performance).
>
> ...To assess endogenous selection into program participation, we compared the baseline characteristics of program participants to those of non-participants in treatment sites. To assess endogenous selection into participation in primary data collection, we compared baseline characteristics of workers who elected to provide clinical data or complete the health risk assessment to those of workers who did not, separately within the treatment group and the control group. This enabled us to assess any potential differential selection into primary data collection. Additionally, to examine differences in findings between our randomized trial approach and a standard observational design (and thereby any bias that confounding factors would have introduced into naive observational estimates), we generated estimates of program effects using ordinary least squares to compare program participants with nonparticipants (rather than using the variation generated by randomization).
>
> ...*Selection Into Program Participation*. Comparisons of preintervention characteristics between participants and nonparticipants in the treatment group provided evidence of potential selection effects. Participants were significantly more likely to be female, nonwhite, and full-time salaried workers in sales, although neither mean health care spending nor the probability of having any spending during the year before the program was significantly different between participants and nonparticipants (eTable 15 in Supplement 2)...an observational approach comparing workers who elected to participate with nonparticipants would have incorrectly suggested that the program had larger effects on some outcomes than the effects found using the controlled design, underscoring the importance of randomization to obtain unbiased estimates ([eTable 17 in Supplement 2](/docs/statistics/causality/2019-song-supplement2.pdf#page=45)).
# Sociology
["How Close Is Close Enough? Testing Nonexperimental Estimates of Impact against Experimental Estimates of Impact with Education Test Scores as Outcomes"](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.217.2276&rep=rep1&type=pdf), Wilde & Hollister 2002
> In this study we test the performance of some nonexperimental estimators of impacts applied to an educational intervention-reduction in class size-where achievement test scores were the outcome. We compare the nonexperimental estimates of the impacts to "true impact" estimates provided by a random-assignment design used to assess the effects of that intervention. Our primary focus in this study is on a nonexperimental estimator based on a complex procedure called propensity score matching. Previous studies which tested nonexperimental estimators against experimental ones all had employment or welfare use as the outcome variable. We tried to determine whether the conclusions from those studies about the performance of nonexperimental estimators carried over into the education domain.
>
> ...Project Star is the source of data for the experimental estimates and the source for drawing nonexperimental comparison groups used to make nonexperimental estimates. Project Star was an experiment in Tennessee involving 79 schools in which students in kindergarten through third grade were randomly assigned to small classes (the treatment group) or to regular-size classes (the control group). The outcome variables from the data set were the math and reading achievement test scores. We carried out the propensity-score-matching estimating procedure separately for each of 11 schools' kindergartens and used it to derive nonexperimental estimates of the impact of smaller class size. We also developed proper standard errors for the propensity-score-matched estimators by using bootstrapping procedures. We found that in most cases, the propensity-score estimate of the impact differed substantially from the "true impact" estimated by the experiment. We then attempted to assess how close the nonexperimental estimates were to the experimental ones. We suggested several different ways of attempting to assess "closeness." Most of them led to the conclusion, in our view, that the nonexperimental estimates were not very "close" and therefore were not reliable guides as to what the "true impact" was. We put greatest emphasis on looking at the question of "how close is close enough?" in terms of a decision-maker trying to use the evaluation to determine whether to invest in wider application of the intervention being assessed-in this case, reduction in class size. We illustrate this in terms of a rough cost-benefit framework for small class size as applied to Project Star. We find that in 30 to 45% of the 11 cases, the propensity-score-matching nonexperimental estimators would have led to the "wrong" decision.
>
> ...Two major considerations motivated us to undertake this study. First, four important studies (Fraker and Maynard, 1987; LaLonde, 1986; Friedlander and Robins, 1995; and Dehejia and Wahba, 1999) have assessed the effectiveness of nonexperimental methods of impact assessment in a compelling fashion, but these studies have focused solely on social interventions related to work and their impact on the outcome variables of earnings, employment rates, and welfare utilization.
>
> - Fraker & Maynard 1987, ["The adequacy of comparison group designs for evaluations of employment-related programs"](/docs/statistics/causality/1987-fraker.pdf)
> - Friedlander & Robins 1995, ["Evaluating Program Evaluations: New Evidence on Commonly Used Nonexperimental Methods"](/docs/statistics/causality/1995-friedlander.pdf)
> - Dehejia & Wahba 1999 ["Causal effects in non-experimental studies: re-evaluating the evaluation of training programs"](/docs/statistics/causality/1999-dehejia.pdf)
>
> ...Because we are interested in testing nonexperimental methods on educational outcomes, we use Tennessee's Project Star as the source of the "true random-assignment" data. We describe Project Star in detail below. We use the treatment group data from a given school for the treatments and then construct comparison groups in various nonexperimental ways with data taken out of the control groups in other schools.
>
> ...From 1985 to 1989, researchers collected observational data including sex, age, race, and free-lunch status from over 11,000 students (Word, 1990). The schools chosen for the experiment were broadly distributed throughout Tennessee. Originally, the project included eight schools from nonmetropolitan cities and large towns (for example, Manchester and Maryville), 38 schools from rural areas, and 17 inner-city and 16 suburban schools drawn from four metropolitan areas: Knoxville, Nashville, Memphis, and Chattanooga. Beginning in 1985-86, the kindergarten teachers and students within Project Star classes were randomly assigned within schools to either "small" (13-17 pupils), "regular" (22-25), or "regular-with-aide" classes. New students who entered a Project Star school in 1986, 1987, 1988, or 1989 were randomly assigned to classes. Because each school had "the same kinds of students, curriculum, principal, policy, schedule, expenditures, etc, for each class" and the randomization occurred within school, theoretically, the estimated within-school effect of small classes should have been unbiased. During the course of the project, however, there were several deviations from the original experimental design-for example, after kindergarten the students in the regular and regular-with-aide classes were randomly reassigned between regular and regular-with-aide classes, and a significant number of students switched class types between grades. However, Krueger found that, after adjusting for these and other problems, the main Project Star results were unaffected; in all four school types students in small classes scored significantly higher on standardized tests than students in regular-size classes. In this study, following Krueger's example, test score is used as the measure of student achievement and is the outcome variable. For all comparisons, test score is calculated as a percentile rank of the combined raw Stanford Achievement reading and math scores within the entire sample distribution for that grade. The Project Star data set provides measures of a number of student, teacher, and school characteristics. The following are the variables available to use as measures prior to random assignment: student sex, student race, student free-lunch status, teacher race, teacher education, teacher career ladder, teacher experience, school type, and school system ID. In addition, the following variables measured contemporaneously can be considered exogenous: student age, assignment to small class size.
>
> ...One very important and stringent measure of closeness is whether there are many cases in which the nonexperimental impact estimates are opposite in sign from the experimental impact estimates and both sets of impact estimates are statistically significantly different from 0, e.g., the experimental estimates said that the mean test scores of those in smaller classes were significantly negative while the nonexperimental estimates indicated they were significantly positive. There is only one case in these 11 which comes close to this situation. For school 27, the experimental impact estimate is ! 10.5 and significant at the 6% level, just above the usual significance cutoff of 5%. The nonexperimental impact estimate is 35.2 and significant at better than the 1% level. In other cases (school 7 and school 33), the impact estimates are of opposite sign, but one or the other of them fails the test for being significantly different from 0. If we weaken the stringency of the criterion a bit, we can consider cases in which the experimental impact estimates were significantly different from 0 but the nonexperimental estimates were not (school 16 and school 33), or vice versa (schools 7, 16, and 28). Another, perhaps better, way of assessing the differences in the impact estimates is to look at column 8, which presents a test for whether the impact estimate from the nonexperimental procedure is significantly different from the impact estimate from the experimental procedure. For 8 of the 11 schools, the two impact estimates were statistically significantly different from each other.
>
> ...When we first began considering this issue, we thought that a criterion might be based on the percentage difference in the point estimates of the impact. For example, for school 9 the nonexperimental estimate of the impact is 135% larger than the experimental impact. But for school 22 the nonexperimental impact estimate is 9% larger. Indeed, all but two of the nonexperimental estimates are more than 50% different from the experimental impact estimates. Whereas in this case the percentage difference in impact estimates seems to indicate quite conclusively that the nonexperimental estimates are not generally close to the experimental ones, in some cases such a percentage difference criterion might be misleading. The criterion which seems to us the most definitive is whether distance between the nonexperimental and the experimental impact estimates would have been sufficient to cause an observer to make a different decision from one based on the true experimental results. For example, suppose that the experimental impact estimate had been 0.02 and the nonexperimental impact estimate had been 0.04, a 100% difference in impact estimate. But further suppose that the decision about whether to take an action, e.g., invest in the type of activity which the treatment intervention represents, would have been a yes if the difference between the treatments and comparisons had been 0.05 or greater and a no if the impact estimate had been less than 0.05. Then even though the nonexperimental estimate was 100% larger than the experimental estimate, one would still have decided not to invest in this type of intervention whether one had the true experimental estimate or the nonexperimental estimate.
>
> ...In a couple of his articles presenting aspects of his research using Project Star data, Krueger (1999, 2000) has developed some rough benefit-cost calculations related to reduction in class size. In Appendix B we sketch in a few elements of his calculations which provide the background for the summary measures derived from his calculations that we use to illustrate our "close enough." The benefits Krueger focuses on are increases in future earnings that could be associated with test score gains. He carefully develops estimates - based on other literature - of what increase in future earnings might be associated with a gain in test scores in the early years of elementary school. With appropriate discounting to present values, and other adjustments, he uses these values as estimates of benefits and then compares them to the estimated cost of reducing class size from 22 to 15, taken from the experience in Project Star and appropriately adjusted. For our purposes, what is most interesting is the way he uses these benefit-cost calculations to answer a slightly different question: How big an effect on test scores due to reduction of class size from 22 to 15 would have been necessary to just justify the expenditures it took to reduce the class size by that much? He states the answer in terms of "effect size," that is the impact divided by the estimated standard deviation of the impact. This is a measure that is increasingly used to compare across outcomes that are measured in somewhat different metrics. His answer is that an effect size of 0.2 of a standard deviation of tests scores would have been just large enough to generate estimated future earnings gains sufficient to justify the costs. 4 Krueger indicates that the estimated effect for kindergarten was a 5.37 percentile increase in achievement test scores due to smaller class size and that this was equivalent to 0.2 of a standard deviation in test scores. Therefore we use 5.4 percentile points as the critical value for a decision of whether the reduction in class size from 22 to 15 would have been cost-effective. In Table 2 we use the results from Table 1 to apply the cost-effectiveness criterion to determine the extent to which the nonexperimental estimates might have led to the wrong decision. To create the entries in this table, we look at the Table 1 entry for a given school. If the impact estimate is greater than 5.4 percentile points and statistically significantly different from 0, we enter a Yes, indicating the impact estimate would have led to a conclusion that reducing class size from 22 to 15 was cost-effective. If the impact estimate is less than 5.4 or statistically not significantly different from 0, we enter a No to indicate a conclusion that the class-size reduction was not cost-effective. Column 1 is the school ID, column 2 gives the conclusion on the basis of the experimental impact estimate, column 3 gives the conclusion on the basis of the nonexperimental impact estimate, and column 4 contains an x if the nonexperimental estimate would have led to a "wrong" cost-effectiveness conclusion, i.e., a different conclusion from the experimental impact conclusion about cost-effectiveness.
>
> It is easy to see in Table 2 that the nonexperimental estimate would have led to the wrong conclusion in four of the 11 cases. For a fifth case, school 16, we entered a Maybe in Column 4 because, as shown in Table 1, for that school the nonexperimental estimate was significantly different from 0 at only the 9% level, whereas the usual significance criterion is 5%. Even though the nonexperimental point estimate of impact was greater than 5.4 percentile points, strict use of the 5% significance criterion would have led to the conclusion that the reduction in class size was not cost-effective. On the other hand, analysts sometimes use a 10% significance criterion, so it could be argued that they might have used that level to conclude the program was cost-effective - thus the Maybe entry for this school.
>
> ...In all seven selected cases, the experimental and nonexperimental estimates differ considerably from each other. One of the nonexperimental estimates is of the wrong sign, while in the other estimates, the signs are the same but all the estimates differ by at least 1.8 percentage points, ranging up to as much as 12 percentage points (rural-city). Statistical inferences about the significance of these program effects also vary (five of the seven pairs had differing inferences-i.e., only one estimate of the program effect in a pair is statistically significant at the 10% level). All of the differences between the experimental and nonexperimental estimates (the test of difference between the outcomes for the experimental control group and the nonexperimental comparison group) in this subset were statistically significant.
>
> Table 5 shows the results for the complete set of the first 49 pairs of estimates. Each column shows a different type of comparison (either school type or district type). The top row in each column provides the number of pairs of experimental and nonexperimental estimates in the column. The second row shows the mean estimate of program effect from the (unbiased) experimental estimates. The third row has the mean absolute differences between these estimates, providing some indication of the size of our nonexperimental bias. The fourth row provides the percentage of pairs in which the experimental and nonexperimental estimates led to different inferences about the significance of the program effect. The fifth row indicates the percentage of pairs in which the difference between the two estimated values was significant (again the test of difference between control and comparison group). Looking at the summarized results for comparisons across school type, these results suggest that constructing nonexperimental groups based on similar demographic school types leads to nonexperimental estimates that do not perform very well when compared with the experimental estimates for the same group. In 50% of the pairs, experimental and nonexperimental estimates had different statistical inferences, with a mean absolute difference in effect estimate of 4.65. Over 75% of these differences were statistically significant. About half of the estimated pairs in comparisons across school type differ by more than 5 percentage points.
# Psychology
["Assignment Methods in Experimentation: When Do Nonrandomized Experiments Approximate Answers From Randomized Experiments?"](/docs/statistics/causality/1996-heinsman.pdf), Heinsman & Shadish 1996:
> This meta-analysis compares effect size estimates from 51 randomized experiments to those from 47 nonrandomized experiments. These experiments were drawn from published and unpublished studies of Scholastic Aptitude Test coaching, ability grouping of students within classrooms, presurgical education of patients to improve post-surgical outcome, and drug abuse prevention with juveniles. The raw results suggest that the two kinds of experiments yield very different answers. But when studies are equated for crucial features (which is not always possible), nonrandomized experiments can yield a reasonably accurate effect size in comparison with randomized designs. Crucial design features include the activity level of the intervention given the control group, pretest effect size, selection and attrition levels, and the accuracy of the effect-size estimation method. Implications of these results for the conduct of meta-analysis and for the design of good nonrandomized experiments are discussed.
>
> ...When certain assumptions are met (e.g., no treatment correlated attrition) and it is properly executed (e.g., assignment is not overridden), random assignment allows unbiased estimates of treatment effects and justifies the theory that leads to tests of significance. We compare this experiment to a closely related quasiexperimental design - the nonequivalent control group design - that is similar to the randomized experiment except that units are not assigned to conditions at random (Cook & Campbell, 1979).
>
> Statistical theory is mostly silent about the statistical characteristics (bias, consistency, and efficiency) of this design. However, meta-analysts have empirically compared the two designs. In meta-analysis, study outcomes are summarized with an effect size statistic (Glass, 1976). In the present case, the standardized mean difference statistic is relevant:
>
> $$d = \frac{\text{mean_treatment} - \text{mean_control}}{\text{SD_all}}$$
>
> where $\frac{A}{T}$ is the mean of the experimental group, _M~c~_ is the mean of the comparison group, and _SD~P~_ is the pooled standard deviation. This statistic allows the meta-analyst to combine study outcomes that are in disparate metrics into a single metric for aggregation. Comparisons of effect sizes from randomized and nonrandomized experiments have yielded inconsistent results (e.g., Becker, 1990; Colditz, Miller, & Mosteller, 1988; Hazelrigg, Cooper, & Borduin, 1987; Shapiro & Shapiro, 1983; Smith, Glass & Miller, 1980). A recent summary of such work (Lipsey & Wilson, 1993) aggregated the results of 74 meta-analyses that reported separate standardized mean difference statistics for randomized and nonrandomized studies. Overall, the randomized studies yielded an average standardized mean difference statistic of _d_=0.46 (SD = 0.28), trivially higher than the nonrandomized studies _d_=0.41 (SD = 0.36); that is, the difference was near zero on the average over these 74 meta-analyses. Lipsey and Wilson (1993) concluded that "there is no strong pattern or bias in the direction of the difference made by lower quality methods. In a given treatment area, poor design or low methodological quality may result in a treatment estimate quite discrepant from what a better quality design would yield, but it is almost as likely to be an underestimate as an overestimate" (p. 1193). However, we believe that considerable ambiguity still remains about this methodological issue.
>
> - Becker, B. J. (1990). ["Coaching for the Scholastic Aptitude Test: Further synthesis and appraisal"](/docs/statistics/causality/1990-baker.pdf). _Review of Educational Research_, 60, 373-417
> - Colditz, G. A., Miller, J. N., & Mosteller, F. (1988). ["The effect of study design on gain in evaluation of new treatments in medicine and surgery"](/docs/statistics/causality/1988-colditz.pdf). _Drug Information Journal_, 22, 343-352
> - Hazelrigg, M. D., Cooper, H. M., & Borduin, C. M. (1987). ["Evaluating the effectiveness of family therapies: An integrative review and analysis"](/docs/statistics/causality/1987-hazelrigg.pdf). _Psychological Bulletin_, 101, 428-442
> - Lipsey, M. W., & Wilson, D. B. (1993). ["The efficacy of psychological, educational, and behavioral treatment: Confirmation from meta-analysis"](/docs/psychology/1993-lipsey.pdf). _American Psychologist_, 48, 1181-1209.
> - Shapiro, D. A., & Shapiro, D. (1983). ["Comparative therapy outcome research: Methodological implications of meta-analysis"](/docs/statistics/causality/1983-shapiro.pdf). _Journal of Consulting and Clinical Psychology_, 51, 42-53.
> - Smith et al 1980, [_The benefits of psychotherapy_](/docs/psychology/1980-smith-thebenefitsofpsychotherapy.pdf). Baltimore: Johns Hopkins University Press. [followup to [the 1977 paper](/docs/psychology/1977-smith.pdf "'Meta-Analysis of Psychotherapy Outcome Studies', Smith & Glass 1977")? paper doesn't include anything about randomization as a covariate. Ordered used copy; book mentions issue but only notes that there is no large average effect difference; does not provide original data! An entire book and they can't include the coded-up data which would be impossible to reproduce at this date... TODO: ask around if there are any archives anywhere - maybe University of Colorado has the original data in Smith or Glass's archived papers?]
>
> ...The present study drew from four past meta-analyses that contained both random and nonrandomized experiments on juvenile drug use prevention programs (Tobler, 1986), psychosocial interventions for post-surgery outcomes (Devine, 1992), coaching for Scholastic Aptitude Test performance (Becker, 1990), and ability grouping of pupils in secondary school classes (Slavin, 1990). These four areas were selected deliberately to reflect different kinds of interventions and substantive topics.
> ...All four meta-analyses also included many unpublished manuscripts, allowing us to examine publication bias effects. In this regard, a practical reason for choosing these four was that previous contacts with three of the four authors of these meta-analyses suggested that they would be willing to provide us with these unpublished documents.
>
> ...This procedure yielded 98 studies for inclusion, 51 random and 47 nonrandom. These studies allowed computation of 733 effect sizes, which we aggregated to 98 study-level effect sizes. Table 1 describes the number of studies in more detail. Retrieving equal numbers of published and unpublished studies in each cell of Table 1 proved impossible. Selection criteria resulted in elimination of 103 studies, of which 40 did not provide enough statistics to calculate at least one good effect size; 119 reported data only for significant effects but not for nonsignificant ones; 15 did not describe assignment method adequately; 11 reported only dichotomous outcome measures; 9 used haphazard assignment; 5 had no control group; and 4 were eliminated for other reasons (extremely implausible data, no posttest reported, severe unit of analysis problem, or failure to report any empirical results).
>
> ...Table 2 shows that over all 98 studies, experiments in which subjects were randomly assigned to conditions yielded significantly larger effect sizes than did experiments in which random assignment did not take place (Q = 82.09, df=l, _p_<0.0001). Within area, randomized experiments yielded significantly more positive effect sizes for ability grouping (Q = 4.76, df = 1, _p_=0.029) and for drug-use prevention studies (Q = 15.67, df = 1, _p_=0.000075) but not for SAT coaching (Q = .02, df = I , p = .89) and presurgical intervention studies (Q = .17, df=\,_p_=0.68). This yielded a borderline interaction between assignment mechanism and substantive area (Q = 5.93, df = 3, _p_=0.12). We include this interaction in subsequent regressions because power to detect interactions is smaller than power to detect main effects and because such an interaction is conceptually the same as Lipsey and Wilson's (1993) finding that assignment method differences may vary considerably over substantive areas. Finally, as Hedges (1983) predicted, the variance component for nonrandomized experiments was twice as large as the variance component for randomized experiments in the overall sample. Within areas, variance components were equal in two areas but larger for nonrandomized experiments in two others. Hence nonrandom assignment may result in unusually disparate effect size estimates, creating different means and variances.
>
> - Hedges, L. V. (1983). ["A random effects model for effect sizes"](/docs/statistics/causality/1983-hedges.pdf). _Psychological Bulletin_, 93, 388-395
>
> ...Effect size was higher with low differential and total attrition, with passive controls, with higher pretest effect sizes, when the selection mechanism did not involve self-selection of subjects into treatment, and with exact effect size computation measures.
>
> ...*Projecting the Results of an Ideal Comparison*:
> Given these findings, one might ask what an ideal comparison between randomized and nonrandomized experiments would yield. We simulate such a comparison in Table 6 using the results in Table 5, projecting effect sizes using predictor values that equate studies at an ideal or a reasonable level. The projections in Table 6 assume that both randomized and nonrandomized experiments used passive control groups, internal control groups, and matching; allowed exact computation of _d_; had no attrition; standardized treatments; were published; had pretest effect sizes of zero; used _n_=1,000 subjects per study; did not allow self-selection of subjects into conditions; and used outcomes based on self-reports and specifically tailored to treatment. Area effects and interaction effects between area and assignment were included in the projection. Note that the overall difference among the eight cell means has diminished dramatically in comparison with Table 2. In Table 2, the lowest cell mean was -0.23 and the highest was 0.37, for a range of 0.60. The range in Table 6 is only half as large (0.34). The same conclusion is true for the range within each area. In Table 2 that range was 0.01 for the smallest difference between randomized and nonrandomized experiments (SAT coaching) to 0.21 for the largest difference (drug-use prevention). In Table 6, the range was 0.11 (SAT coaching), 0.01 (ability grouping), 0.05 (presurgical interventions), and 0.09 (drug-use prevention). Put a bit more simply, nonrandomized experiments are more like randomized experiments if one takes confounds into account.
["Comparison of evidence of treatment effects in randomized and nonrandomized studies"](/docs/statistics/causality/2001-ioannidis.pdf), Ioannidis et al 2001:
> *Study Selection*: 45 diverse topics were identified for which both randomized trials (_n_=240) and nonrandomized studies (_n_=168) had been performed and had been considered in meta-analyses of binary outcomes.
>
> *Data Extraction*: Data on events per patient in each study arm and design and characteristics of each study considered in each meta-analysis were extracted and synthesized separately for randomized and nonrandomized studies.
>
> *Data Synthesis*: Very good correlation was observed between the summary odds ratios of randomized and nonrandomized studies (_r_= 0.75; _p_<.001); however, non-randomized studies tended to show larger treatment effects (28 vs 11; _p_=0.009). Between-study heterogeneity was frequent among randomized trials alone (23%) and very frequent among nonrandomized studies alone (41%). The summary results of the 2 types of designs differed beyond chance in 7 cases (16%). Discrepancies beyond chance were less common when only prospective studies were considered (8%). Occasional differences in sample size and timing of publication were also noted between discrepant randomized and nonrandomized studies. In 28 cases (62%), the natural logarithm of the odds ratio differed by at least 50%, and in 15 cases (33%), the odds ratio varied at least 2-fold between nonrandomized studies and randomized trials.
>
> *Conclusions*: Despite good correlation between randomized trials and nonrandomized studies - in particular, prospective studies - discrepancies beyond chance do occur and differences in estimated magnitude of treatment effect are very common.
["Comparison of Effects in Randomized Controlled Trials With Observational Studies in Digestive Surgery"](/docs/statistics/causality/2006-shikata.pdf), Shikata et al 2006:
> *Methods*: The PubMed (1966 to April 2004), EMBASE (1986 to April 2004) and Cochrane databases (Issue 2, 2004) were searched to identify meta-analyses of randomized controlled trials in digestive surgery. Fifty-two outcomes of 18 topics were identified from 276 original articles (96 randomized trials, 180 observational studies) and included in meta-analyses. All available binary data and study characteristics were extracted and combined separately for randomized and observational studies. In each selected digestive surgical topic, summary odds ratios or relative risks from randomized controlled trials were compared with observational studies using an equivalent calculation method.
>
> *Results*: Significant between-study heterogeneity was seen more often among observational studies (5 of 12 topics) than among randomized trials (1 of 9 topics). In 4 of the 16 primary outcomes compared (10 of 52 total outcomes), summary estimates of treatment effects showed significant discrepancies between the two designs.
>
> *Conclusions*: One fourth of observational studies gave different results than randomized trials, and between-study heterogeneity was more common in observational studies in the field of digestive surgery.
["A Comparison of Approaches to Advertising Measurement: Evidence from Big Field Experiments at Facebook"](/docs/statistics/causality/2019-gordon.pdf), Gordon et al 2019:
> We examine how common techniques used to measure the causal impact of ad exposures on users' conversion outcomes compare to the "gold standard" of a true experiment (randomized controlled trial). Using data from 12 US advertising lift studies at Facebook comprising 435 million user-study observations and 1.4 billion total impressions we contrast the experimental results to those obtained from observational methods, such as comparing exposed to unexposed users, matching methods, model-based adjustments, synthetic matched-markets tests, and before-after tests. We show that observational methods often fail to produce the same results as true experiments even after conditioning on information from thousands of behavioral variables and using non-linear models. We explain why this is the case. Our findings suggest that common approaches used to measure advertising effectiveness in industry fail to measure accurately the true effect of ads.
>
> ...Figure 13 summarizes results for the four studies for which there was a conversion pixel on a registration page. Figure 14 summarizes results for the three studies for which there was a conversion pixel on a key landing page. The results for these studies vary across studies in how they compare to the RCT results, just as they do for the checkout conversion studies reported in Figures 11 and 12.
>
> We summarize the performance of different observational approaches using two different metrics. We want to know first how often an observational study fails to capture the truth. Said in a statistically precise way, "For how many of the studies do we reject the hypothesis that the lift of the observational method is equal to the RCT lift?" Table 7 reports the answer to this question. We divide the table by outcome reported in the study (checkout is in the top section of Table 7, followed by registration and page view). The first row of Table 7 tells us that of the 11 studies that tracked checkout conversions, we statistically reject the hypothesis that the exact matching estimate of lift equals the RCT estimate. As we go down the column, the propensity score matching and regression adjustment approaches fare a little better, but for all but one specification, we reject equality with the RCT estimate for half the studies or more.
>
> We would also like to know how different the estimate produced by an observational method is from the RCT estimate. Said more precisely, we ask "Across evaluated studies of a given outcome, what is the average absolute deviation in percentage points between the observational method estimate of lift and the RCT lift?" For example, the RCT lift for study 1 (checkout outcome) is 33%. The EM lift estimate is 117%. Hence the absolute lift deviation is 84 percentage points. For study 2 (checkout outcome) the RCT lift is 0.9%, the EM lift estimate is 535%, and the absolute lift deviation is 534 percentage points. When we average over all studies, exact matching leads to an average absolute lift deviation of 661 percentage points relative to an average RCT lift of 57% across studies (see the last two columns of the first row of the table.)
["Measuring Consumer Sensitivity to Audio Advertising: A Field Experiment on Pandora Internet Radio"](https://davidreiley.com/papers/PandoraListenerDemandCurve.pdf), Huang et al 2018:
> How valuable is the variation generated by this experiment? Since it can be difficult to convince decision makers to run experiments on key economic decisions, and it consumes engineering resources to implement such an experiment properly, could we have done just as well by using observational data? To investigate this question, we re-run our analysis using only the 17 million listeners in the control group, since they were untouched by the experiment in the sense that they received the default ad load. In the absence of experimental instrumental variables, we run the regression based on naturally occurring variation in ad load, such as that caused by higher advertiser demand for some listeners than others, excluding listeners who got no ads during the experimental period.
>
> The results are found in Table 7. We find that the endogeneity of the realized ad load (some people get more ad load due to advertiser demand, and these people happen to listen less than people with lower advertiser demand) causes us to overestimate the true causal impact of ad load by a factor of approximately 4.
>
> ...To give the panel estimator its best shot, we use what we have learned from the experiment, and allow the panel estimator a 20-month time period between observations...We see from Table 8 that the point estimate for active days is much closer to that found in Table 5, but it still overestimates the impact of ad load by more than the width of our 95% confidence intervals. The panel point estimate for total hours, while an improvement over the cross-sectional regression results, still overestimates the effect by a factor of 3. Our result suggests that, even after controlling for time-invariant listener heterogeneity, observational techniques still suffer from omitted-variable bias caused by unobservable terms that vary across individuals *and* time that correlate with ad load and listening behaviors. And without a long-run experiment, we would not have known the relevant timescale to consider to measure the long-run sensitivity to advertising (which is what matters for the platform's policy decisions)
# Education
["An evaluation of bias in three measures of teacher quality: Value-added, classroom observations, and student surveys"](https://scholar.harvard.edu/files/andrewbacherhicks/files/an_evaluation_of_bias_in_three_measures_of_teacher_quality.pdf), Bacher-Hicks et al 2017:
> We conduct a random assignment experiment to test the predictive validity of three measures of teacher performance: value added, classroom observations, and student surveys. Combining our results with those from two previous random assignment experiments, we provide further(and more precise) evidence that value-added measures are unbiased predictors of teacher performance. In addition, we provide the first evidence that classroom observation scores are unbiased predictors of teacher performance on a rubric measuring the quality of mathematics instruction, but we lack the statistical power to reach any similar conclusions for student survey responses.
> We used the pre-existing administrative records and the additional data collected by NCTE to generate estimates of teacher performance on five measures: (a) students' scores on state standardized mathematics tests; (b) students' scores on the project-developed mathematics test (Hickman, Fu, & Hill, 2012); (c) teachers' performance on the Mathematical Quality of Instruction (MQI; Hill et al., 2008) classroom observation instrument; (d) teachers' performance on the Classroom Assessment Scoring System (CLASS; La Paro, Pianta, & Hamre, 2012) observation instrument; and(e)students' responses to a Tripod-based perception survey (Ferguson, 2009).5 Kane and Staiger (2008) McCaffrey, Miller, and Staiger (2013)
["Validating Teacher Effects On Students' Attitudes And Behaviors: Evidence From Random Assignment Of Teachers To Students"](/docs/statistics/causality/2017-blazar.pdf), Blazar 2017:
> There is growing interest among researchers, policymakers, and practitioners in identifying teachers who are skilled at improving student outcomes beyond test scores. However, important questions remain about the validity of these teacher effect estimates. Leveraging the random assignment of teachers to classes, I find that teachers have causal effects on their students' self-reported behavior in class, self-efficacy in math, and happiness in class that are similar in magnitude to effects on math test scores. Weak correlations between teacher effects on different student outcomes indicate that these measures capture unique skills that teachers bring to the classroom. Teacher effects calculated in non-experimental data are related to these same outcomes following random assignment, revealing that they contain important information content on teachers. However, for some non-experimental teacher effect estimates, large and potentially important degrees of bias remain. These results suggest that researchers and policymakers should proceed with caution when using these measures. They likely are more appropriate for low-stakes decisions, such as matching teachers to professional development, than for high-stakes personnel decisions and accountability.
>
> ...In Table 5, I report estimates describing the relationship between non-experimental teacher effects on student outcomes and these same measures following random assignment. Cells contain estimates from separate regression models where the dependent variable is the student attitude or behavior listed in each column. The independent variable of interest is the non-experimental teacher effect on this same outcome estimated in years prior to random assignment. All models include fixed effects for randomization block to match the experimental design. In order to increase the precision of my estimates, models also control for students' prior achievement in math and reading, student demographic characteristics, and classroom characteristics from randomly assigned rosters.
>
> ...Validity evidence for teacher effects on students' math performance are consistent with other experimental studies (Kane et al. 2013; Kane and Staiger 2008), where predicted differences in teacher effectiveness in observational data come close to actual differences following random assignment of teachers to classes. The non-experimental teacher effect estimate that comes closest to a 1:1 relationship is the shrunken estimate that controls for students' prior achievement and other demographic characteristics (0.995 SD).
> Despite a relatively small sample of teachers, the standard error for this estimate (0.084) is substantively smaller than those in other studies - including the meta-analysis conducted by Bacher-Hicks et al. (2017) - and allows me to rule out relatively large degrees of bias in teacher effects calculated from this model. A likely explanation for greater precision in this study relative to others is the fact that other studies generate estimates through instrumental variables estimation to calculate treatment on the treated. Instead, I use OLS regression and account for non-compliance by narrowing in on randomization blocks in which very few, in any, students moved out of their randomly assigned teachers' classroom. Non-experimental teacher effects calculated without shrinkage are related less strongly to current student outcomes, though differences in estimates and associated standard errors between Panel A and Panel B are not large. All corresponding estimates (e.g., Model 1 from Panel A versus Panel B) have overlapping 95% confidence intervals.
>
> ...For both Self-Efficacy in Math and Happiness in Class, non-experimental teacher effect estimates have moderate predictive validity. Generally, I can distinguish estimates from 0 SD, indicating that they contain some information content on teachers. The exception is shrunken estimates for Self-Efficacy in Math. Although estimates are similar in magnitude to the unshrunken estimates in Panel A, between 0.42 SD and 0.58 SD, standard errors are large and 95% confidence intervals cross 0 SD. I also can distinguish many estimates from 1 SD. This indicates that non-experimental teacher effects on students' Self-Efficacy in Math and Happiness in Class contain potentially large and important degrees of bias. For both measures of teacher effectiveness, point estimates around 0.5 SD suggest that they contain roughly 50% bias.
>
> !["Table 5: Relationship between Current Student Outcomes and Prior, Non-experimental Teacher Effect Outcomes" [$\frac{3}{4} effects <1 imply bias]](/images/causality/2017-blazar-table5-correlationvscausation.png){.full-width}
# TODO
- Lonn EM, Yusuf S. "Is there a role for antioxidant vitamins in the prevention of cardiovascular disease? An update on epidemiological and clinical trials data". _Can J Cardiol_ 1997;13:957-965
- Patterson RE, White E, Kristal AR, Neuhouser ML, Potter JD. "Vitamin supplement and cancer risk: the epidemiological evidence". _Cancer Causes Control_ 1997;8:786-802
- "Comparisons of effect sizes derived from randomised and non-randomised studies", BC Reeves, RR MacLehose, IM Harvey, TA Sheldon... - _Health Services Research_ ..., 1998 [superseded by MacLehose et al 2000?]
- Glazerman, S., Levy, D., & Myers, D. (2002). [_Nonexperimental replications of social experiments: A systematic review_](https://www.mathematica-mpr.com/-/media/publications/pdfs/nonexperimentalreps.pdf). Washington, DC: Mathematica Policy Research
- Bloom, H. S., Michalopoulos, C., Hill, C., & Lei, Y. (2002) [_Can non-experimental comparison group methods match the findings from a random assignment evaluation of mandatory welfare-to-work programs?_](http://files.eric.ed.gov/fulltext/ED471814.pdf) New York: Manpower Demonstration Research Corporation.
- Fraker & Maynard 1987, ["The adequacy of comparison group designs for evaluations of employment-related programs"](/docs/statistics/causality/1987-fraker.pdf)
- Friedlander & Robins 1995, ["Evaluating Program Evaluations: New Evidence on Commonly Used Nonexperimental Methods"](/docs/statistics/causality/1995-friedlander.pdf)
- Dehejia and Wahba, 1999 ["Causal effects in non-experimental studies: re-evaluating the evaluation of training programs"](/docs/statistics/causality/1999-dehejia.pdf)
- Becker, B. J. (1990). ["Coaching for the Scholastic Aptitude Test: Further synthesis and appraisal"](/docs/statistics/causality/1990-baker.pdf). _Review of Educational Research_, 60, 373-417
- Hazelrigg, M. D., Cooper, H. M., & Borduin, C. M. (1987). ["Evaluating the effectiveness of family therapies: An integrative review and analysis"](/docs/statistics/causality/1987-hazelrigg.pdf). _Psychological Bulletin_, 101, 428-442
- Lipsey, M. W., & Wilson, D. B. (1993). ["The efficacy of psychological, educational, and behavioral treatment: Confirmation from meta-analysis"](/docs/psychology/1993-lipsey.pdf). _American Psychologist_, 48, 1181-1209.
- Shapiro, D. A., & Shapiro, D. (1983). ["Comparative therapy outcome research: Methodological implications of meta-analysis"](/docs/statistics/causality/1983-shapiro.pdf). _Journal of Consulting and Clinical Psychology_, 51, 42-53.
- Hedges, L. V. (1983). ["A random effects model for effect sizes"](/docs/statistics/causality/1983-hedges.pdf). _Psychological Bulletin_, 93, 388-395