Skip to main content

statistics/​causality directory


“Sibling Comparison Studies”, Sjölander et al 2022

2022-sjolander.pdf: “Sibling Comparison Studies”⁠, Arvid Sjölander, Thomas Frisell, Sara Öberg (2022-03-01; ⁠, ; similar):

Unmeasured confounding is one of the main sources of bias in observational studies. A popular way to reduce confounding bias is to use sibling comparisons, which implicitly adjust for several factors in the early environment or upbringing without requiring them to be measured or known.

In this article we provide a broad exposition of the statistical analysis methods for sibling comparison studies. We further discuss a number of methodological challenges that arise in sibling comparison studies.

“Learning Causal Overhypotheses through Exploration in Children and Computational Models”, Kosoy et al 2022

“Learning Causal Overhypotheses through Exploration in Children and Computational Models”⁠, Eliza Kosoy, Adrian Liu, Jasmine Collins, David M. Chan, Jessica B. Hamrick, Nan Rosemary Ke, Sandy H. Huang et al (2022-02-21; ⁠, ; similar):

Despite recent progress in reinforcement learning (RL), RL algorithms for exploration still remain an active area of research. Existing methods often focus on state-based metrics, which do not consider the underlying causal structures of the environment, and while recent research has begun to explore RL environments for causal learning, these environments primarily leverage causal information through causal inference or induction rather than exploration. In contrast, human children—some of the most proficient explorers—have been shown to use causal information to great benefit. In this work, we introduce a novel RL environment designed with a controllable causal structure, which allows us to evaluate exploration strategies used by both agents and children in an unified environment. In addition, through experimentation on both computation models and children, we demonstrate that there are statistically-significant differences between information-gain optimal RL exploration in causal environments and the exploration of children in the same environments. We conclude with a discussion of how these findings may inspire new directions of research into efficient exploration and disambiguation of causal structures for RL algorithms.

“Causal and Associational Language in Observational Health Research: A Systematic Evaluation”, Haber et al 2021

“Causal and Associational Language in Observational Health Research: A systematic evaluation”⁠, Noah A. Haber, Sarah E. Wieten, Julia M. Rohrer, Onyebuchi A. Arah, Peter W. G. Tennant, Elizabeth A. Stuart et al (2021-12-16; ⁠, ; similar):

We estimated the degree to which language used in the high profile medical/​public health/​epidemiology literature implied causality using language linking exposures to outcomes and action recommendations; examined disconnects between language and recommendations; identified the most common linking phrases; and estimated how strongly linking phrases imply causality.

We searched and screened for 1,170 articles from 18 high-profile journals (65 per journal) published from 2010–2019. Based on written framing and systematic guidance, three reviewers rated the degree of causality implied in abstracts and full text for exposure/​outcome linking language and action recommendations.

Reviewers rated the causal implication of exposure/​outcome linking language as None (no causal implication) in 13.8%, Weak 34.2%, Moderate 33.2%, and Strong 18.7% of abstracts. The implied causality of action recommendations was higher than the implied causality of linking sentences for 44.5% or commensurate for 40.3% of articles. The most common linking word in abstracts was “associate” (45.7%). Reviewer’s ratings of linking word roots were highly heterogeneous; over half of reviewers rated “association” as having at least some causal implication.

This research undercuts the assumption that avoiding “causal” words leads to clarity of interpretation in medical research.

Figure 5: Strength of causal implication ratings for the most common root linking words. This chart shows the distribution of ratings given by reviewers during the root word rating exercise. On the left side, they are sorted by median rating + the number of reviewers who would have to change their ratings in order for the rating to change. On the right, the chart is sorted alphabetically.

“Providing a Lower-bound Estimate for Psychology’s “crud Factor”: The Case of Aggression”, Ferguson & Heene 2021

2021-ferguson.pdf: “Providing a lower-bound estimate for psychology’s “crud factor”: The case of aggression”⁠, Christopher J. Ferguson, Moritz Heene (2021-11-01; ⁠, ; backlinks; similar):

When conducting research on large data sets, statistically-significant findings having only trivial interpretive meaning may appear. Little consensus exists whether such small effects can be meaningfully interpreted. The current analysis examines the possibility that trivial effects may emerge in large datasets, but that some such effects may lack interpretive value. When such results match an investigator’s hypothesis, they may be over-interpreted.

The current study examines this issue as related to aggression research in 2 large samples. Specifically, in the first study, the National Longitudinal Study of Adolescent to Adult Health (Add Health) dataset was used. 15 variables with little theoretical relevance to aggression were selected, then correlated with self-reported delinquency. For the second study, the Understanding Society database was used. As with Study 1, 14 nonsensical variables were correlated with conduct problems.

Many variables achieved “statistical-significance” and some effect-sizes approached or exceeded r = 0.10, despite little theoretical relevance between the variables.

It is recommended that effect sizes below r = 0.10 should not be interpreted as hypothesis supportive.

Table 1: Correlations Between Crud and Delinquency for Study 1
Table 2: Correlations Between Crud and Conduct Problems for Study 2

“Testing the Structure of Human Cognitive Ability Using Evidence Obtained from the Impact of Brain Lesions over Abilities”, Protzko & Colom 2021

2021-protzko.pdf: “Testing the structure of human cognitive ability using evidence obtained from the impact of brain lesions over abilities”⁠, John Protzko, Roberto Colom (2021-11-01; ⁠, ⁠, ; backlinks; similar):

  • Focal cortical lesions lead to local, not global, deficits.
  • Measurement models to explain the positive manifold are causal models with unique predictions going beyond model fit statistics.
  • Correlated factor, network, process sampling, mutualism, investment models, make causal predictions inconsistent with lesion evidence.
  • Hierarchical and bifactor models are consistent with the pattern of lesion effects, as well as possibly one form of bonds sampling models.
  • Future models and explanations of the positive manifold have to accommodate focal lesions leading to local not global deficits.

Here we examine 3 classes of models regarding the structure of human cognition: common cause models, sampling/​network models, and interconnected models. That disparate models can accommodate one of the most globally replicated psychological phenomena—namely, the positive manifold—is an extension of underdetermination of theory by data. Statistical fit indices are an insufficient and sometimes intractable method of demarcating between the theories; strict tests and further evidence should be brought to bear on understanding the potential causes of the positive manifold. The cognitive impact of focal cortical lesions allows testing the necessary causal connections predicted by competing models. This evidence shows focal cortical lesions lead to local, not global (across all abilities), deficits. Only models that can accommodate a deficit in a given ability without effects on other covarying abilities can accommodate focal lesion evidence. After studying how different models pass this test, we suggest bifactor models (class: common cause models) and bond models (class: sampling models) are best supported. In short, competing psychometric models can be informed when their implied causal connections and predictions are tested.

[Keywords: human intelligence, structural models⁠, causality, statistical model fit, cortical lesions]

[This would seem to explain the failure of dual n-back & WM training in general.

Training the specific ability of WM could only cause g increases in models with ‘upwards causation’ like hierarchical models or dynamic mutual causation like mutualism/​investment models; these are ruled out by the lesion literature which finds that physically-tiny lesions damage specific abilities but not g, and if decreasing a specific ability cannot decrease g, then it’s hard to see how increasing that ability could ever increase g. See also Lee et al 2019⁠.]

“Is Coffee the Cause or the Cure? Conflicting Nutrition Messages in 2 Decades of Online New York Times’ Nutrition News Coverage”, Ihekweazu 2021

2021-ihekweazu.pdf: “Is Coffee the Cause or the Cure? Conflicting Nutrition Messages in 2 Decades of Online New York Times’ Nutrition News Coverage”⁠, Chioma Ihekweazu (2021-09-14; ; similar):

2⁄3rds of US adults report hearing news stories about diet and health relationships daily or a few times a week. These stories have often been labeled as conflicting. While public opinion suggests conflicting nutrition messages are widespread, there has been limited empirical research to support this belief.

This study examined the prevalence of conflicting information in online New York Times’ news articles discussing published nutrition research between 1996–2016. It also examined the contextual differences that existed between conflicting studies. The final sample included 375 news articles discussing 416 diet and health relationships (228 distinct relationships).

The most popular dietary items discussed were alcoholic beverages (n = 51), vitamin D (n = 26), and B vitamins (n = 23). Over the 20-year study period, 12.7% of the 228 diet and health relationships had conflicting reports. Just under 3⁄4ths of the conflicting reports involved changes in study design, 79% involved changes in study population, and 31% involved changes in industry funding.

Conflicting nutrition messages can have negative cognitive and behavioral consequences for individuals. To help effectively address conflicting nutrition news coverage, a multi-pronged approach involving journalists, researchers, and news audiences is needed.

“Common Elective Orthopaedic Procedures and Their Clinical Effectiveness: Umbrella Review of Level 1 Evidence”, Blom et al 2021

“Common elective orthopaedic procedures and their clinical effectiveness: umbrella review of level 1 evidence”⁠, Ashley W. Blom, Richard L. Donovan, Andrew D. Beswick, Michael R. Whitehouse, Setor K. Kunutsor (2021-07-08; ; similar):

Objective: To determine the clinical effectiveness of common elective orthopaedic procedures compared with no treatment, placebo, or non-operative care and assess the impact on clinical guidelines.

Design: Umbrella review of meta-analyses of randomised controlled trials or other study designs in the absence of meta-analyses of randomised controlled trials.

Data sources: 10 of the most common elective orthopaedic procedures—arthroscopic anterior cruciate ligament reconstruction⁠, arthroscopic meniscal repair of the knee, arthroscopic partial meniscectomy of the knee, arthroscopic rotator cuff repair⁠, arthroscopic subacromial decompression, carpal tunnel decompression⁠, lumbar spine decompression⁠, lumbar spine fusion⁠, total hip replacement⁠, and total knee replacement—were studied. MEDLINE⁠, Embase⁠, Cochrane Library, and bibliographies were searched until September 2020.

Eligibility criteria for selecting studies: Meta-analyses of randomised controlled trials (or in the absence of meta-analysis other study designs) that compared the clinical effectiveness of any of the 10 orthopaedic procedures with no treatment, placebo, or non-operative care.

Data extraction and synthesis: Summary data were extracted by 2 independent investigators, and a consensus was reached with the involvement of a third. The methodological quality of each meta-analysis was assessed using the Assessment of Multiple Systematic Reviews instrument. The Jadad decision algorithm was used to ascertain which meta-analysis represented the best evidence. The National Institute for Health and Care Excellence Evidence search was used to check whether recommendations for each procedure reflected the body of evidence.

Main outcome measures: Quality and quantity of evidence behind common elective orthopaedic interventions and comparisons with the strength of recommendations in relevant national clinical guidelines.

Results: Randomised controlled trial evidence supports the superiority of carpal tunnel decompression and total knee replacement over non-operative care. No randomised controlled trials specifically compared total hip replacement or meniscal repair with non-operative care. Trial evidence for the other 6 procedures showed no benefit over non-operative care.

Conclusions: Although they may be effective overall or in certain subgroups, no strong, high quality evidence base shows that many commonly performed elective orthopaedic procedures are more effective than non-operative alternatives. Despite the lack of strong evidence, some of these procedures are still recommended by national guidelines in certain situations.

Systematic review registration: PROSPERO CRD42018115917.

“What Is Your Estimand? Defining the Target Quantity Connects Statistical Evidence to Theory”, Lundberg et al 2021

2021-lundberg.pdf: “What Is Your Estimand? Defining the Target Quantity Connects Statistical Evidence to Theory”⁠, Ian Lundberg, Rebecca Johnson, Brandon M. Stewart (2021-06-04; similar):

We make only one point in this article. Every quantitative study must be able to answer the question: what is your estimand? The estimand is the target quantity—the purpose of the statistical analysis. Much attention is already placed on how to do estimation; a similar degree of care should be given to defining the thing we are estimating. We advocate that authors state the central quantity of each analysis—the theoretical estimand—in precise terms that exist outside of any statistical model. In our framework, researchers do three things: (1) set a theoretical estimand, clearly connecting this quantity to theory; (2) link to an empirical estimand, which is informative about the theoretical estimand under some identification assumptions; and (3) learn from data. Adding precise estimands to research practice expands the space of theoretical questions, clarifies how evidence can speak to those questions, and unlocks new tools for estimation. By grounding all three steps in a precise statement of the target quantity, our framework connects statistical evidence to theory.

“The Revolution Will Be Hard to Evaluate: How Co-occurring Policy Changes Affect Research on the Health Effects of Social Policies”, Matthay et al 2021

“The revolution will be hard to evaluate: How co-occurring policy changes affect research on the health effects of social policies”⁠, Ellicott C. Matthay, Erin Hagan, Spruha Joshi, May Lynn Tan, David Vlahov, Nancy Adler, M. Maria Glymour et al (2021-05-15; ; similar):

Extensive empirical health research leverages variation in the timing and location of policy changes as quasi-experiments. Multiple social policies may be adopted simultaneously in the same locations, creating co-occurrence which must be addressed analytically for valid inferences. The pervasiveness and consequences of co-occurring policies have received limited attention. We analyzed a systematic sample of 13 social policy databases covering diverse domains including poverty, paid family leave, and tobacco. We quantified policy co-occurrence in each database as the fraction of variation in each policy measure across different jurisdictions and times that could be explained by co-variation with other policies (R2). We used simulations to estimate the ratio of the variance of effect estimates under the observed policy co-occurrence to variance if policies were independent. Policy co-occurrence ranged from very high for state-level cannabis policies to low for country-level sexual minority rights policies. For 65% of policies, greater than 90% of the place-time variation was explained by other policies. Policy co-occurrence increased the variance of effect estimates by a median of 57×. Co-occurring policies are common and pose a major methodological challenge to rigorously evaluating health effects of individual social policies. When uncontrolled, co-occurring policies confound one another, and when controlled, resulting positivity violations may substantially inflate the variance of estimated effects. Tools to enhance validity and precision for evaluating co-occurring policies are needed.

“The Piranha Problem: Large Effects Swimming in a Small Pond”, Tosh et al 2021

“The piranha problem: Large effects swimming in a small pond”⁠, Christopher Tosh, Philip Greengard, Ben Goodrich, Andrew Gelman, Aki Vehtari, Daniel Hsu (2021-04-08; ; backlinks; similar):

In some scientific fields, it is common to have certain variables of interest that are of particular importance and for which there are many studies indicating a relationship with a different explanatory variable. In such cases, particularly those where no relationships are known among explanatory variables, it is worth asking under what conditions it is possible for all such claimed effects to exist simultaneously. This paper addresses this question by reviewing some theorems from multivariate analysis that show, unless the explanatory variables also have sizable effects on each other, it is impossible to have many such large effects. We also discuss implications for the replication crisis in social science.

…The implication of the claims regarding ovulation and voting, shark attacks and voting, college football and voting, etc., is not merely that some voters are superficial and fickle. No, these papers claim that seemingly trivial or irrelevant factors have large and consistent effects, and this runs into the problem of interactions. For example, the effect on your vote of the local college football team losing could depend crucially on whether there’s been a shark attack lately, or on what’s up with your hormones on election day. Or the effect could be positive in an election with a female candidate and negative in an election with a male candidate. Or the effect could interact with your parents’ socioeconomic status⁠, or whether your child is a boy or a girl, or the latest campaign ad, or any of the many other factors that have been studied in the evolutionary psychology and political psychology literatures. Again, we are not saying that psychological factors have no effect on social, political, or economic decision making; we are only arguing that such effects, if large, will necessarily interact in complex ways. Similar reasoning has been used to argue against naive assumptions of causal identification in economics, where there is a large literature considering rainfall as an instrumental variable, without accounting for the implication that these many hypothesized causal pathways would, if taken seriously, represent violations of the assumption of exclusion restriction (Mellon, 2020).

In this work, we demonstrate that there is an inevitable consequence of having many explanatory variables with large effects: the explanatory variables must have large effects on each other. We call this type of result a “piranha theorem” (Gelman, 2017), the analogy being the folk wisdom that if one has a large number of piranhas (representing large effects) in a single fish tank, then one will soon be left with far fewer piranhas (Anonymous, 2021). If there is some outcome on which a large number of studies demonstrate an effect of a novel explanatory variable, then we can conclude that either some of the claimed effects are smaller than claimed, or some of the explanatory variables are essentially measuring the same phenomenon.

There are a multitude of ways to capture the dependency of random variables, and thus we should expect there to be a correspondingly large collection of piranha theorems. We formalize and prove piranha theorems for correlation, regression, and mutual information in Sections 2 and 3. These theorems illustrate the general phenomena at work in any setting with multiple causal or explanatory variables. In Section 4, we examine typical correlations in a finite sample under a simple probabilistic model.

…For example, an influential experiment from 1996 reported that participants were given a scrambled-sentence task and then were surreptitiously timed when walking away from the lab (Bargh et al 1996). Students whose sentences included elderly-related words such as “worried”, “Florida”, “old”, and “lonely” walked an average of 13% more slowly than students in the control condition, and the difference was statistically-significant.

This experimental claim is of historical interest in psychology in that, despite its implausibility, it was taken seriously for many years (for example, “You have no choice but to accept that the major conclusions of these studies are true” (Kahneman, 2011)), but it failed to replicate (Harris et al 2013) and is no longer generally believed to represent a real effect; for background see Wagenmakers et al 2015. Now we understand such apparently statistically-significant findings as the result of selection with many researcher degrees of freedom (Simmons et al 2011).

Here, though, we will take the published claim at face value and also work within its larger theoretical structure, under which weak indirect stimuli can produce large effects.

An effect of 13% on walking speed is not in itself huge; the difficulty comes when considering elderly-related words as just one of many potential stimuli. Here are just some of the factors that have been presented in the social priming literature as having large effects on behavior: hormones (male and female), subliminal images, the outcomes of recent football games, irrelevant news events such as shark attacks, a chance encounter with a stranger, parental socioeconomic status, weather, the last digit of one’s age, the sex of a hurricane name, the sexes of siblings, the position in which a person is sitting, and many others. A common feature of these examples is that the stimuli have no clear direct effect on the measured outcomes, and in most cases the experimental subject is not even aware of the manipulation. Based on these examples, one can come up with dozens of other potential stimuli that fit the pattern. For example, in addition to elderly-related words, one could also consider word lengths (with longer words corresponding to slower movement), sounds of words (with smooth sibilance motivating faster walking), subject matter (sports-related words as compared to sedentary words), affect (happy words compared to sad words, or calm compared to angry), words related to travel (inducing faster walking) or invoking adhesives such as tape or glue (inducing slower walking), and so on. Similarly, one can consider many different sorts of incidental events, not just encounters with strangers but also a ringing phone or knocking at the door or the presence of a male or female lab assistant (which could have a main effect or interact with the participant’s sex) or the presence or absence of a newspaper or magazine on a nearby table, ad infinitum.

Now we can invoke the piranha theorem. Suppose we can imagine 100 possible stimuli, each with an effect of 13% on walking speed, all of which could arise in a real-world setting where we encounter many sources of text, news, and internal and external stimuli. If the effects are independent, then at any given time we could expect, on the log scale, a total effect with standard deviation 0.5 × √100 × log(1.13) = 0.61, thus walking speed could easily be multiplied or divided by e0.61 = 1.8 based on a collection of arbitrary stimuli that are imperceptible to the person being affected. And this factor of 1.8 could be made arbitrarily large by simply increasing the number of potential primes.

It is ridiculous to think that walking speed could be randomly doubled or halved based on a random collection of unnoticed stimuli—but that is the implication of the embodied cognition literature. It is basically a Brownian motion model in which the individual inputs are too large to work out.

“Interpolating Causal Mechanisms: The Paradox of Knowing More”, Stephan et al 2021

2021-stephan.pdf: “Interpolating Causal Mechanisms: The Paradox of Knowing More”⁠, Simon Stephan, Katya Tentori, Stefania Pighin, Michael R. Waldmann (2021-03; similar):

Causal knowledge is not static; it is constantly modified based on new evidence. The present set of seven experiments explores 1 important case of causal belief revision that has been neglected in research so far: causal interpolations.

A simple prototypic case of an interpolation is a situation in which we initially have knowledge about a causal relation or a positive covariation between 2 variables but later become interested in the mechanism linking these 2 variables. Our key finding is that the interpolation of mechanism variables tends to be misrepresented, which leads to the paradox of knowing more: The more people know about a mechanism, the weaker they tend to find the probabilistic relation between the 2 variables (ie. weakening effect). Indeed, in all our experiments we found that, despite identical learning data about 2 variables, the probability linking the 2 variables was judged higher when follow-up research showed that the 2 variables were assumed to be directly causally linked (ie. C → E) than when participants were instructed that the causal relation is in fact mediated by a variable representing a component of the mechanism (M; ie. C → M → E).

Our explanation of the weakening effect is that people often confuse discoveries of preexisting but unknown mechanisms with situations in which new variables are being added to a previously simpler causal model, thus violating causal stability assumptions in natural kind domains. The experiments test several implications of this hypothesis.

[Keywords: belief revision, causal Bayes nets, causal reasoning, interpolation, probabilistic reasoning]

[Original OSF data⁠; remember, “it all adds up to normality”.]

“Quantifying Causality in Data Science With Quasi-experiments”, Liu et al 2021

2021-liu.pdf: “Quantifying causality in data science with quasi-experiments”⁠, Tony Liu, Lyle Ungar, Konrad Kording (2021-01-14):

Estimating causality from observational data is essential in many data science questions but can be a challenging task.

Here we review approaches to causality that are popular in econometrics and that exploit (quasi) random variation in existing data, called quasi-experiments, and show how they can be combined with machine learning to answer causal questions within typical data science settings.

We also highlight how data scientists can help advance these methods to bring causal estimation to high-dimensional data from medicine, industry and society.

“Intelligence and General Psychopathology in the Vietnam Experience Study: A Closer Look”, Kirkegaard & Nyborg 2021

2021-kirkegaard.pdf: “Intelligence and General Psychopathology in the Vietnam Experience Study: A Closer Look”⁠, Emil O. W. Kirkegaard, Helmuth Nyborg (2021; ⁠, ; backlinks; similar):

Prior research has indicated that one can summarize the variation in psychopathology measures in a single dimension, labeled P by analogy with the g factor of intelligence. Research shows that this P factor has a weak to moderate negative relationship to intelligence.

We used data from the Vietnam Experience Study to reexamine the relations between psychopathology assessed with the MMPI (Minnesota Multiphasic Personality Inventory) and intelligence (total n = 4,462: 3,654 whites, 525 blacks, 200 Hispanics, and 83 others).

We show that the scoring of the P factor affects the strength of the relationship with intelligence. Specifically, item response theory-based scores correlate more strongly with intelligence than sum-scoring or scale-based scores: r’s = −0.35, −0.31, and −0.25, respectively.

We furthermore show that the factor loadings from these analyses show moderately strong Jensen patterns such that items and scales with stronger loadings on the P factor also correlate more negatively with intelligence (r = −0.51 for 566 items= −0.60 for 14 scales).

Finally, we show that training an elastic net model on the item data allows one to predict intelligence with extremely high precision, r = 0.84. We examined whether these predicted values worked as intended with regards to cross-racial predictive validity, and relations to other variables. We mostly find that they work as intended, but seem slightly less valid for blacks and Hispanics (r’s = 0.85, 0.83, and 0.81, for whites, Hispanics, and blacks, respectively).

[Keywords: Vietnam Experience Study, MMPI, general psychopathology factor, intelligence, cognitive ability, machine learning, elastic net, LASSO⁠, random forest⁠, crud factor]

…To further examine predictive accuracy, we trained a lasso model to see if a relatively sparse model could be obtained. The validity of the lasso model, however, was essentially identical to the elastic net one, and the optimal lasso fit was not very sparse (363 out of 556 items used)…It is seen that about 90 items are needed to reach a correlation accuracy of 0.80, whereas only 3 items are needed to reach 0.50. This may be surprising, but some items have absolute correlations to g of around 0.40, so it is unsurprising that combining 3 of them yields a model accuracy at 0.50.

…Finally, we fit a random forest model. This performed slightly worse than the elastic net (r = 0.78). The failure of the random forest model to do better than the elastic net indicates that nonlinear and interaction effects are not important in a given dataset for the purpose of prediction. In other words, the additive assumption is supported for this dataset and outcome variable

“Objecting to Experiments Even While Approving of the Policies or Treatments They Compare”, Heck et al 2020

“Objecting to experiments even while approving of the policies or treatments they compare”⁠, Patrick R. Heck, Christopher F. Chabris, Duncan J. Watts, Michelle N. Meyer (2020-08-11; backlinks; similar):

We resolve a controversy over two competing hypotheses about why people object to randomized experiments: (1) People unsurprisingly object to experiments only when they object to a policy or treatment the experiment contains, or (2) people can paradoxically object to experiments even when they approve of implementing either condition for everyone. Using multiple measures of preference and test criteria in 5 preregistered within-subjects studies with 1,955 participants, we find that people often disapprove of experiments involving randomization despite approving of the policies or treatments to be tested.

[Keywords: field experiments, A/​B tests, randomized controlled trials, research ethics, pragmatic trials]

“Commentary: Cynical Epidemiology”, Kaufman 2020

2020-kaufman.pdf: “Commentary: Cynical epidemiology”⁠, Jay S. Kaufman (2020-06-28)

“Rethinking Causation for Data-intensive Biology: Constraints, Cancellations, and Quantized Organisms: Causality in Complex Organisms Is Sculpted by Constraints rather than Instigators, With Outcomes Perhaps Better Described by Quantized Patterns Than Rectilinear Pathways”, Brash 2020

2020-brash.pdf: “Rethinking Causation for Data-intensive Biology: Constraints, Cancellations, and Quantized Organisms: Causality in complex organisms is sculpted by constraints rather than instigators, with outcomes perhaps better described by quantized patterns than rectilinear pathways”⁠, Douglas E. Brash (2020-06-02; similar):

Complex organisms thwart the simple rectilinear causality paradigm of “necessary and sufficient”, with its experimental strategy of “knock down and overexpress.”

This Essay organizes the eccentricities of biology into 4 categories that call for new mathematical approaches; recaps for the biologist the philosopher’s recent refinements to the causation concept and the mathematician’s computational tools that handle some but not all of the biological eccentricities; and describes overlooked insights that make causal properties of physical hierarchies such as emergence and downward causation straightforward.

Reviewing and extrapolating from similar situations in physics, it is suggested that new mathematical tools for causation analysis incorporating feedback, signal cancellation, nonlinear dependencies, physical hierarchies, and fixed constraints rather than instigative changes will reveal unconventional biological behaviors. These include “eigenisms”, organisms that are limited to quantized states; trajectories that steer a system such as an evolving species toward optimal states; and medical control via distributed “sheets” rather than single control points.

[Keywords: causation, constraint, driver, emergence, feedback, hierarchy, quantization]

“Exploring the Persome: The Power of the Item in Understanding Personality Structure”, Revelle et al 2020

“Exploring the persome: The power of the item in understanding personality structure”⁠, William Revelle, Elizabeth M. Dworak, David M. Condon (2020-03-06; ; backlinks; similar):

We discuss methods of data collection and analysis that emphasize the power of individual personality items for predicting real world criteria (eg. smoking, exercise, self-rated health). These methods are borrowed by analogy from radio astronomy and human genomics. Synthetic Aperture Personality Assessment (SAPA) applies a matrix sampling procedure that synthesizes very large covariance matrices through the application of massively missing at random data collection. These large covariance matrices can be applied, in turn, in Persome Wide Association Studies (PWAS) to form personality prediction scores for particular criteria. We use two open source data sets (n = 4,000 and 126,884 with 135 and 696 items respectively) for demonstrations of both of these procedures. We compare these procedures to the more traditional use of “Big 5” or a larger set of narrower factors (the “little 27”). We argue that there is more information at the item level than is used when aggregating items to form factorially derived scales.

[Keywords: Persome, Persome Wide Association Studies, Synthetic Aperture Personality Assessment (SAPA), Massively Missing Completely at Random (MMCAR), Scale construction, Factor analysis⁠, Item analysis]

“How Should We Critique Research?”, Branwen 2019

Research-criticism: “How Should We Critique Research?”⁠, Gwern Branwen (2019-05-19; ⁠, ⁠, ⁠, ; backlinks; similar):

Criticizing studies and statistics is hard in part because so many criticisms are possible, rendering them meaningless. What makes a good criticism is the chance of being a ‘difference which makes a difference’ to our ultimate actions.

Scientific and statistical research must be read with a critical eye to understand how credible the claims are. The Reproducibility Crisis and the growth of meta-science have demonstrated that much research is of low quality and often false.

But there are so many possible things any given study could be criticized for, falling short of an unobtainable ideal, that it becomes unclear which possible criticism is important, and they may degenerate into mere rhetoric. How do we separate fatal flaws from unfortunate caveats from specious quibbling?

I offer a pragmatic criterion: what makes a criticism important is how much it could change a result if corrected and how much that would then change our decisions or actions: to what extent it is a “difference which makes a difference”.

This is why issues of research fraud, causal inference, or biases yielding overestimates are universally important: because a ‘causal’ effect turning out to be zero effect or grossly overestimated will change almost all decisions based on such research; while on the other hand, other issues like measurement error or distributional assumptions, which are equally common, are often not important: because they typically yield much smaller changes in conclusions, and hence decisions.

If we regularly ask whether a criticism would make this kind of difference, it will be clearer which ones are important criticisms, and which ones risk being rhetorical distractions and obstructing meaningful evaluation of research.

“A Comparison of Approaches to Advertising Measurement: Evidence from Big Field Experiments at Facebook”, Gordon et al 2019

2019-gordon.pdf: “A Comparison of Approaches to Advertising Measurement: Evidence from Big Field Experiments at Facebook”⁠, Brett R. Gordon, Florian Zettelmeyer, Neha Bhargava, Dan Chapsky (2019-05-04; ; backlinks; similar):

Measuring the causal effects of digital advertising remains challenging despite the availability of granular data. Unobservable factors make exposure endogenous, and advertising’s effect on outcomes tends to be small. In principle, these concerns could be addressed using randomized controlled trials (RCTs). In practice, few online ad campaigns rely on RCTs and instead use observational methods to estimate ad effects. We assess empirically whether the variation in data typically available in the advertising industry enables observational methods to recover the causal effects of online advertising. Using data from 15 U.S. advertising experiments at Facebook comprising 500 million user-experiment observations and 1.6 billion ad impressions, we contrast the experimental results to those obtained from multiple observational models. The observational methods often fail to produce the same effects as the randomized experiments, even after conditioning on extensive demographic and behavioral variables. In our setting, advances in causal inference methods do not allow us to isolate the exogenous variation needed to estimate the treatment effects. We also characterize the incremental explanatory power our data would require to enable observational methods to successfully measure advertising effects. Our findings suggest that commonly used observational approaches based on the data usually available in the industry often fail to accurately measure the true effect of advertising.

“Association Studies of up to 1.2 Million Individuals Yield New Insights into the Genetic Etiology of Tobacco and Alcohol Use”, Liu et al 2019

2019-liu.pdf: “Association studies of up to 1.2 million individuals yield new insights into the genetic etiology of tobacco and alcohol use”⁠, Mengzhen Liu, Yu Jiang, Robbee Wedow, Yue Li, David M. Brazel, Fang Chen, Gargi Datta, Jose Davila-Velderrain et al (2019-01-01; ; backlinks)

“Effect of a Workplace Wellness Program on Employee Health and Economic Outcomes: A Randomized Clinical Trial”, Association 2019

2019-song.pdf: “Effect of a Workplace Wellness Program on Employee Health and Economic Outcomes: A Randomized Clinical Trial”⁠, American Medical Association (2019-01-01; backlinks)

“Correlation = Causation? Music Training, Psychology, and Neuroscience”, Schellenberg 2019

2019-schellenberg.pdf: “Correlation = causation? Music training, psychology, and neuroscience”⁠, E. Glenn Schellenberg (2019-01-01)

“Why Scatter Plots Suggest Causality, and What We Can Do about It”, Bergstrom & West 2018

“Why scatter plots suggest causality, and what we can do about it”⁠, Carl T. Bergstrom, Jevin D. West (2018-09-25; ; similar):

Scatter plots carry an implicit if subtle message about causality. Whether we look at functions of one variable in pure mathematics, plots of experimental measurements as a function of the experimental conditions, or scatter plots of predictor and response variables, the value plotted on the vertical axis is by convention assumed to be determined or influenced by the value on the horizontal axis. This is a problem for the public understanding of scientific results and perhaps also for professional scientists’ interpretations of scatter plots. To avoid suggesting a causal relationship between the x and y values in a scatter plot, we propose a new type of data visualization, the diamond plot. Diamond plots are essentially 45 degree rotations of ordinary scatter plots; by visually jarring the viewer they clearly indicate that she should not draw the usual distinction between independent/​predictor variable and dependent/​response variable. Instead, she should see the relationship as purely correlative.

“Causal Language and Strength of Inference in Academic and Media Articles Shared in Social Media (CLAIMS): A Systematic Review”, Haber et al 2018

“Causal language and strength of inference in academic and media articles shared in social media (CLAIMS): A systematic review”⁠, Noah Haber, Emily R. Smith, Ellen Moscoe, Kathryn Andrews, Robin Audy, Winnie Bell, Alana T. Brennan et al (2018-05-30; ; backlinks; similar):

Background: The pathway from evidence generation to consumption contains many steps which can lead to overstatement or misinformation. The proliferation of internet-based health news may encourage selection of media and academic research articles that overstate strength of causal inference. We investigated the state of causal inference in health research as it appears at the end of the pathway, at the point of social media consumption.

Methods: We screened the NewsWhip Insights database for the most shared media articles on Facebook and Twitter reporting about peer-reviewed academic studies associating an exposure with a health outcome in 2015, extracting the 50 most-shared academic articles and media articles covering them. We designed and utilized a review tool to systematically assess and summarize studies’ strength of causal inference, including generalizability, potential confounders, and methods used. These were then compared with the strength of causal language used to describe results in both academic and media articles. Two randomly assigned independent reviewers and one arbitrating reviewer from a pool of 21 reviewers assessed each article.

Results: We accepted the most shared 64 media articles pertaining to 50 academic articles for review, representing 68% of Facebook and 45% of Twitter shares in 2015. 34% of academic studies and 48% of media articles used language that reviewers considered too strong for their strength of causal inference. 70% of academic studies were considered low or very low strength of inference, with only 6% considered high or very high strength of causal inference. The most severe issues with academic studies’ causal inference were reported to be omitted confounding variables and generalizability. 58% of media articles were found to have inaccurately reported the question, results, intervention, or population of the academic study.

Conclusions: We find a large disparity between the strength of language as presented to the research consumer and the underlying strength of causal inference among the studies most widely shared on social media. However, because this sample was designed to be representative of the articles selected and shared on social media, it is unlikely to be representative of all academic and media work. More research is needed to determine how academic institutions, media organizations, and social network sharing patterns impact causal inference and language as received by the research consumer.

“Amusing Ourselves to Death?”, Branwen 2018

Amuse: “Amusing Ourselves to Death?”⁠, Gwern Branwen (2018-05-12; ⁠, ⁠, ⁠, ⁠, ; backlinks; similar):

A suggested x-risk⁠/​Great Filter is the possibility of advanced entertainment technology leading to wireheading/​mass sterility/​population collapse and extinction. As media consumption patterns are highly heritable, any such effect would trigger rapid human adaptation, implying extinction is almost impossible unless immediate collapse or exponentially accelerating addictiveness.

To demonstrate the point that there are pervasive genetic influences on all aspects of media consumption or leisure time activities/​preferences/​attitudes, I compile >580 heritability estimates from the behavioral genetics literature (drawing particularly on Loehlin & Nichols 1976’s A Study of 850 Sets of Twins), roughly divided in ~13 categories.

“Measuring Consumer Sensitivity to Audio Advertising: A Field Experiment on Pandora Internet Radio”, Huang et al 2018

“Measuring Consumer Sensitivity to Audio Advertising: A Field Experiment on Pandora Internet Radio”⁠, Jason Huang, David H. Reiley, Nickolai M. Riabov (2018-04-21; ⁠, ; backlinks; similar):

A randomized experiment with almost 35 million Pandora listeners enables us to measure the sensitivity of consumers to advertising, an important topic of study in the era of ad-supported digital content provision. The experiment randomized listeners into 9 treatment groups, each of which received a different level of audio advertising interrupting their music listening, with the highest treatment group receiving more than twice as many ads as the lowest treatment group. By keeping consistent treatment assignment for 21 months, we are able to measure long-run demand effects, with three times as much ad-load sensitivity as we would have obtained if we had run a month-long experiment.

We estimate a demand curve that is strikingly linear, with the number of hours listened decreasing linearly in the number of ads per hour (also known as the price of ad-supported listening). We also show the negative impact on the number of days listened and on the probability of listening at all in the final month.

Using an experimental design that separately varies the number of commercial interruptions per hour and the number of ads per commercial interruption, we find that neither makes much difference to listeners beyond their impact on the total number of ads per hour. Lastly, we find that increased ad load causes a substantial increase in the number of paid ad-free subscriptions to Pandora, particularly among older listeners.

“A Combined Analysis of Genetically Correlated Traits Identifies 187 Loci and a Role for Neurogenesis and Myelination in Intelligence”, Hill et al 2018

“A combined analysis of genetically correlated traits identifies 187 loci and a role for neurogenesis and myelination in intelligence”⁠, William D. Hill, Robert E. Marioni, O. Maghzian, Stuart J. Ritchie, Sarah P. Hagenaars, A. M. McIntosh et al (2018-01-11; ⁠, ; backlinks; similar):

Intelligence, or general cognitive function, is phenotypically and genetically correlated with many traits, including a wide range of physical, and mental health variables. Education is strongly genetically correlated with intelligence (rg = 0.70). We used these findings as foundations for our use of a novel approach—multi-trait analysis of genome-wide association studies (MTAG; Turley et al 2017)—to combine two large genome-wide association studies (GWASs) of education and intelligence, increasing statistical power and resulting in the largest GWAS of intelligence yet reported. Our study had four goals: first, to facilitate the discovery of new genetic loci associated with intelligence; second, to add to our understanding of the biology of intelligence differences; third, to examine whether combining genetically correlated traits in this way produces results consistent with the primary phenotype of intelligence; and, finally, to test how well this new meta-analytic data sample on intelligence predicts phenotypic intelligence in an independent sample.

By combining datasets using MTAG, our functional sample size increased from 199,242 participants to 248,482. We found 187 independent loci associated with intelligence, implicating 538 genes, using both SNP-based and gene-based GWAS. We found evidence that neurogenesis and myelination—as well as genes expressed in the synapse, and those involved in the regulation of the nervous system—may explain some of the biological differences in intelligence.

The results of our combined analysis demonstrated the same pattern of genetic correlations as those from previous GWASs of intelligence, providing support for the meta-analysis of these genetically-related phenotypes.

“Polygenic Prediction of the Phenome, across Ancestry, in Emerging Adulthood”, Docherty et al 2017

“Polygenic prediction of the phenome, across ancestry, in emerging adulthood”⁠, Anna R. Docherty, Arden Moscati, Danielle Dick, Jeanne E. Savage, Jessica E. Salvatore, Megan Cooke, Fazil Aliev et al (2017-11-27; ⁠, ; backlinks; similar):

Background: Identifying genetic relationships between complex traits in emerging adulthood can provide useful etiological insights into risk for psychopathology. College-age individuals are under-represented in genomic analyses thus far, and the majority of work has focused on the clinical disorder or cognitive abilities rather than normal-range behavioral outcomes.

Methods: This study examined a sample of emerging adults 18–22 years of age (n = 5947) to construct an atlas of polygenic risk for 33 traits predicting relevant phenotypic outcomes. 28 hypotheses were tested based on the previous literature on samples of European ancestry, and the availability of rich assessment data allowed for polygenic predictions across 55 psychological and medical phenotypes.

Results: Polygenic risk for schizophrenia (SZ) in emerging adults predicted anxiety, depression, nicotine use, trauma, and family history of psychological disorders. Polygenic risk for neuroticism predicted anxiety, depression, phobia, panic, neuroticism, and was correlated with polygenic risk for cardiovascular disease.

Conclusions: These results demonstrate the extensive impact of genetic risk for SZ, neuroticism, and major depression on a range of health outcomes in early adulthood. Minimal cross-ancestry replication of these phenomic patterns of polygenic influence underscores the need for more genome-wide association studies of non-European populations.

“Percutaneous Coronary Intervention in Stable Angina (ORBITA): a Double-blind, Randomised Controlled Trial”, Al-Lamee et al 2017

2017-allamee.pdf: “Percutaneous coronary intervention in stable angina (ORBITA): a double-blind, randomised controlled trial”⁠, Rasha Al-Lamee, David Thompson, Hakim-Moulay Dehbi, Sayan Sen, Kare Tang, John Davies, Thomas Keeble et al (2017-11-02; ⁠, ; backlinks; similar):

Background: Symptomatic relief is the primary goal of percutaneous coronary intervention (PCI) in stable angina and is commonly observed clinically. However, there is no evidence from blinded, placebo-controlled randomised trials to show its efficacy.

Methods: ORBITA is a blinded, multicentre randomised trial of PCI versus a placebo procedure for angina relief that was done at five study sites in the UK. We enrolled patients with severe (≥70%) single-vessel stenoses. After enrolment, patients received 6 weeks of medication optimisation. Patients then had pre-randomisation assessments with cardiopulmonary exercise testing, symptom questionnaires, and dobutamine stress echocardiography. Patients were randomised 1:1 to undergo PCI or a placebo procedure by use of an automated online randomisation tool. After 6 weeks of follow-up, the assessments done before randomisation were repeated at the final assessment. The primary endpoint was difference in exercise time increment between groups. All analyses were based on the intention-to-treat principle and the study population contained all participants who underwent randomisation. This study is registered with, number NCT02062593.

Findings: ORBITA enrolled 230 patients with ischaemic symptoms. After the medication optimisation phase and between Jan 6, 2014, and Aug 11, 2017, 200 patients underwent randomisation, with 105 patients assigned PCI and 95 assigned the placebo procedure. Lesions had mean area stenosis of 84.4% (SD 10.2), fractional flow reserve of 0.69 (0.16), and instantaneous wave-free ratio of 0.76 (0.22). There was no statistically-significant difference in the primary endpoint of exercise time increment between groups (PCI minus placebo 16.6 s, 95% CI −8.9 to 42.0, p = 0.200). There were no deaths. Serious adverse events included four pressure-wire related complications in the placebo group, which required PCI, and five major bleeding events, including two in the PCI group and three in the placebo group.

Interpretation: In patients with medically treated angina and severe coronary stenosis, PCI did not increase exercise time by more than the effect of a placebo procedure. The efficacy of invasive procedures can be assessed with a placebo control, as is standard for pharmacotherapy.

“Implicit Causal Models for Genome-wide Association Studies”, Tran & Blei 2017

“Implicit Causal Models for Genome-wide Association Studies”⁠, Dustin Tran, David M. Blei (2017-10-30; ⁠, ; similar):

Progress in probabilistic generative models has accelerated, developing richer models with neural architectures, implicit densities, and with scalable algorithms for their Bayesian inference⁠. However, there has been limited progress in models that capture causal relationships, for example, how individual genetic factors cause major human diseases.

In this work, we focus on two challenges in particular:

How do we build richer causal models, which can capture highly nonlinear relationships and interactions between multiple causes?

How do we adjust for latent confounders, which are variables influencing both cause and effect and which prevent learning of causal relationships?

To address these challenges, we synthesize ideas from causality and modern probabilistic modeling.

For the first, we describe implicit causal models, a class of causal models that leverages neural architectures with an implicit density.

For the second, we describe an implicit causal model that adjusts for confounders by sharing strength across examples.

In experiments, we scale Bayesian inference on up to a billion genetic measurements. We achieve state of the art accuracy for identifying causal factors: we significantly outperform existing genetics methods by an absolute difference of 15–45.3%.

“Genome-wide Meta-analysis Associates HLA-DQA1/DRB1 and LPA and Lifestyle Factors With Human Longevity”, Joshi et al 2017

“Genome-wide meta-analysis associates HLA-DQA1/DRB1 and LPA and lifestyle factors with human longevity”⁠, Peter K. Joshi, Nicola Pirastu, Katherine A. Kentistou, Krista Fischer, Edith Hofer, Katharina E. Schraut et al (2017-10-13; ⁠, ⁠, ⁠, ; backlinks; similar):

Genomic analysis of longevity offers the potential to illuminate the biology of human aging. Here, using genome-wide association meta-analysis of 606,059 parents’ survival, we discover two regions associated with longevity (HLA-DQA1/​DRB1 and LPA). We also validate previous suggestions that APOE, CHRNA3/​5, CDKN2A/​B, SH2B3 and FOXO3A influence longevity. Next we show that giving up smoking, educational attainment, openness to new experience and high-density lipoprotein (HDL) cholesterol levels are most positively genetically correlated with lifespan while susceptibility to coronary artery disease (CAD), cigarettes smoked per day, lung cancer, insulin resistance and body fat are most negatively correlated. We suggest that the effect of education on lifespan is principally mediated through smoking while the effect of obesity appears to act via CAD. Using instrumental variables, we suggest that an increase of one body mass index unit reduces lifespan by 7 months while 1 year of education adds 11 months to expected lifespan.

“The Surprising Implications of Familial Association in Disease Risk”, Valberg et al 2017

“The surprising implications of familial association in disease risk”⁠, Morten Valberg, Mats Julius Stensrud, Odd O. Aalen (2017-06-14; ⁠, ; backlinks; similar):

Background: A wide range of diseases show some degree of clustering in families; family history is therefore an important aspect for clinicians when making risk predictions. Familial aggregation is often quantified in terms of a familial relative risk (FRR), and although at first glance this measure may seem simple and intuitive as an average risk prediction, its implications are not straightforward.

Methods: We use two statistical models for the distribution of disease risk in a population: a dichotomous risk model that gives an intuitive understanding of the implication of a given FRR, and a continuous risk model that facilitates a more detailed computation of the inequalities in disease risk. Published estimates of FRRs are used to produce Lorenz curves and Gini indices that quantifies the inequalities in risk for a range of diseases.

Results: We demonstrate that even a moderate familial association in disease risk implies a very large difference in risk between individuals in the population. We give examples of diseases for which this is likely to be true, and we further demonstrate the relationship between the point estimates of FRRs and the distribution of risk in the population.

Conclusions: The variation in risk for several severe diseases may be larger than the variation in income in many countries. The implications of familial risk estimates should be recognized by epidemiologists and clinicians.

“Banner Ads Considered Harmful”, Branwen 2017

Ads: “Banner Ads Considered Harmful”⁠, Gwern Branwen (2017-01-08; ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ; backlinks; similar):

9 months of daily A/​B-testing of Google AdSense banner ads on indicates banner ads decrease total traffic substantially, possibly due to spillover effects in reader engagement and resharing.

One source of complexity & JavaScript use on is the use of Google AdSense advertising to insert banner ads. In considering design & usability improvements, removing the banner ads comes up every time as a possibility, as readers do not like ads, but such removal comes at a revenue loss and it’s unclear whether the benefit outweighs the cost, suggesting I run an A/​B experiment. However, ads might be expected to have broader effects on traffic than individual page reading times/​bounce rates, affecting total site traffic instead through long-term effects on or spillover mechanisms between readers (eg. social media behavior), rendering the usual A/​B testing method of per-page-load/​session randomization incorrect; instead it would be better to analyze total traffic as a time-series experiment.

Design: A decision analysis of revenue vs readers yields an maximum acceptable total traffic loss of ~3%. Power analysis of historical traffic data demonstrates that the high autocorrelation yields low statistical power with standard tests & regressions but acceptable power with ARIMA models. I design a long-term Bayesian ARIMA(4,0,1) time-series model in which an A/​B-test running January–October 2017 in randomized paired 2-day blocks of ads/​no-ads uses client-local JS to determine whether to load & display ads, with total traffic data collected in Google Analytics & ad exposure data in Google AdSense. The A/​B test ran from 2017-01-01 to 2017-10-15, affecting 288 days with collectively 380,140 pageviews in 251,164 sessions.

Correcting for a flaw in the randomization, the final results yield a surprisingly large estimate of an expected traffic loss of −9.7% (driven by the subset of users without adblock), with an implied −14% traffic loss if all traffic were exposed to ads (95% credible interval: −13–16%), exceeding my decision threshold for disabling ads & strongly ruling out the possibility of acceptably small losses which might justify further experimentation.

Thus, banner ads on appear to be harmful and AdSense has been removed. If these results generalize to other blogs and personal websites, an important implication is that many websites may be harmed by their use of banner ad advertising without realizing it.

“Graphical Models for Quasi-experimental Designs”, Steiner et al 2017

“Graphical Models for Quasi-experimental Designs”⁠, Peter M. Steiner, Yongnam Kim, Courtney E. Hall, Dan Su (2017):

Randomized controlled trials (RCTs) and quasi-experimental designs like regression discontinuity (RD) designs, instrumental variable (IV) designs, and matching and propensity score (PS) designs are frequently used for inferring causal effects. It is well known that the features of these designs facilitate the identification of a causal estimand and, thus, warrant a causal interpretation of the estimated effect.

In this article, we discuss and compare the identifying assumptions of quasi-experiments using causal graphs. The increasing complexity of the causal graphs as one switches from an RCT to RD, IV, or PS designs reveals that the assumptions become stronger as the researcher’s control over treatment selection diminishes.

We introduce limiting graphs for the RD design and conditional graphs for the latent subgroups of com-pliers, always takers, and never takers of the IV design, and argue that the PS is a collider that offsets confounding bias via collider bias.

“Could a Neuroscientist Understand a Microprocessor?”, Jonas & Kording 2016

“Could a Neuroscientist Understand a Microprocessor?”⁠, Eric Jonas, Konrad Paul Kording (2016-11-14; ⁠, ⁠, ; backlinks; similar):

[Reply to “Can a biologist fix a radio?”⁠; earlier, Doug the biochemist & Bill the geneticist research how cars work] There is a popular belief in neuroscience that we are primarily data limited, and that producing large, multimodal, and complex datasets will, with the help of advanced data analysis algorithms, lead to fundamental insights into the way the brain processes information. These datasets do not yet exist, and if they did we would have no way of evaluating whether or not the algorithmically-generated insights were sufficient or even correct. To address this, here we take a classical microprocessor as a model organism, and use our ability to perform arbitrary experiments on it to see if popular data analysis methods from neuroscience can elucidate the way it processes information. Microprocessors are among those artificial information processing systems that are both complex and that we understand at all levels, from the overall logical flow, via logical gates, to the dynamics of transistors. We show that the approaches reveal interesting structure in the data but do not meaningfully describe the hierarchy of information processing in the microprocessor. This suggests current analytic approaches in neuroscience may fall short of producing meaningful understanding of neural systems, regardless of the amount of data. Additionally, we argue for scientists using complex non-linear dynamical systems with known ground truth, such as the microprocessor as a validation platform for time-series and structure discovery methods.

Author Summary

Neuroscience is held back by the fact that it is hard to evaluate if a conclusion is correct; the complexity of the systems under study and their experimental inaccessibility make the assessment of algorithmic and data analytic techniques challenging at best. We thus argue for testing approaches using known artifacts, where the correct interpretation is known. Here we present a microprocessor platform as one such test case. We find that many approaches in neuroscience, when used naively, fall short of producing a meaningful understanding.

“Redundancy, Unilateralism and Bias beyond GDP—results of a Global Index Benchmark”, Dill & Gebhart 2016

“Redundancy, Unilateralism and Bias beyond GDP—results of a Global Index Benchmark”⁠, Alexander Dill, Nicolas Gebhart (2016-09-25; backlinks; similar):

Eight out of ten leading international indices to assess developing countries in aspects beyond GDP are showing strong redundancy, bias and unilateralism. The quantitative comparison gives evidence for the fact that always the same countries lead the ranks with a low standard deviation. The dependency of the GDP is striking: do the indices only measure indicators that are direct effects of a strong GDP? While the impact of GDP can be discussed reverse as well, the standard deviation shows a strong bias: only one out of the twenty countries with the highest standard deviation is among the Top-20 countries of the world, but 11 countries among those with the lowest standard deviation. Let’s have a look at the backsides of global statistics and methods to compare their findings. The article is the result of a pre-study to assess Social Capital for development countries made for the German Federal Ministry for Economic Cooperation and Development. The study led to the UN Sustainable Development Goals (UN SDG) project World Social Capital Monitor.

“Molecular Genetic Contributions to Social Deprivation and Household Income in UK Biobank (n = 112,151)”, Hill et al 2016

“Molecular genetic contributions to social deprivation and household income in UK Biobank (n = 112,151)”⁠, W. David Hill, Saskia P. Hagenaars, Riccardo E. Marioni, Sarah E. Harris, David C. M. Liewald, Gail Davies et al (2016-03-09; ⁠, ⁠, ⁠, ; backlinks; similar):

Individuals with lower socio-economic status (SES) are at increased risk of physical and mental illnesses and tend to die at an earlier age. Explanations for the association between SES and health typically focus on factors that are environmental in origin. However, common single nucleotide polymorphisms (SNPs) have been found collectively to explain around 18% (SE = 5%) of the phenotypic variance of an area-based social deprivation measure of SES. Molecular genetic studies have also shown that physical and psychiatric diseases are at least partly heritable. It is possible, therefore, that phenotypic associations between SES and health arise partly due to a shared genetic etiology.

We conducted a genome-wide association study (GWAS) on social deprivation and on household income using the 112,151 participants of UK Biobank⁠. We find that common SNPs explain 21% (SE = 0.5%) of the variation in social deprivation and 11% (SE = 0.7%) in household income. 2 independent SNPs attained genome-wide statistical-significance for household income, rs187848990 on chromosome 2, and rs8100891 on chromosome 19. Genes in the regions of these SNPs have been associated with intellectual disabilities, schizophrenia, and synaptic plasticity. Extensive genetic correlations were found between both measures of socioeconomic status and illnesses, anthropometric variables, psychiatric disorders, and cognitive ability.

These findings show that some SNPs associated with SES are involved in the brain and central nervous system. The genetic associations with SES are probably mediated via other partly-heritable variables, including cognitive ability, education, personality, and health.

Genetic correlation between household income and health variables.

“Agreement of Treatment Effects for Mortality from Routinely Collected Data and Subsequent Randomized Trials: Meta-epidemiological Survey”, Hemkens et al 2016

“Agreement of treatment effects for mortality from routinely collected data and subsequent randomized trials: meta-epidemiological survey”⁠, Lars G. Hemkens, Despina G. Contopoulos-Ioannidis, John P. A. Ioannidis (2016-02-08; backlinks; similar):

Objective: To assess differences in estimated treatment effects for mortality between observational studies with routinely collected health data (RCD; that are published before trials are available) and subsequent evidence from randomized controlled trials on the same clinical question.

Design: Meta-epidemiological survey.

Data sources: PubMed searched up to November 2014.

Methods: Eligible RCD studies were published up to 2010 that used propensity scores to address confounding bias and reported comparative effects of interventions for mortality. The analysis included only RCD studies conducted before any trial was published on the same topic. The direction of treatment effects, confidence intervals⁠, and effect sizes (odds ratios) were compared between RCD studies and randomized controlled trials. The relative odds ratio (that is, the summary odds ratio of trial(s) divided by the RCD study estimate) and the summary relative odds ratio were calculated across all pairs of RCD studies and trials. A summary relative odds ratio greater than one indicates that RCD studies gave more favorable mortality results.

Results: The evaluation included 16 eligible RCD studies, and 36 subsequent published randomized controlled trials investigating the same clinical questions (with 17 275 patients and 835 deaths). Trials were published a median of three years after the corresponding RCD study. For five (31%) of the 16 clinical questions, the direction of treatment effects differed between RCD studies and trials. Confidence intervals in nine (56%) RCD studies did not include the RCT effect estimate. Overall, RCD studies showed statistically-significantly more favorable mortality estimates by 31% than subsequent trials (summary relative odds ratio 1.31 (95% confidence interval 1.03 to 1.65; I2 = 0%)).

Conclusions: Studies of routinely collected health data could give different answers from subsequent randomized controlled trials on the same clinical questions, and may substantially overestimate treatment effects. Caution is needed to prevent misguided clinical decision making.

“Shared Genetic Aetiology between Cognitive Functions and Physical and Mental Health in UK Biobank (n = 112,151) and 24 GWAS Consortia”, Hagenaars et al 2016

“Shared genetic aetiology between cognitive functions and physical and mental health in UK Biobank (n = 112,151) and 24 GWAS consortia”⁠, S. P. Hagenaars, S. E. Harris, G. Davies, W. D. Hill, D. C. M. Liewald, S. J. Ritchie, R. E. Marioni et al (2016-01-26; ; backlinks; similar):

Causes of the well-documented association between low levels of cognitive functioning and many adverse neuropsychiatric outcomes, poorer physical health and earlier death remain unknown. We used linkage disequilibrium regression and polygenic profile scoring to test for shared genetic aetiology between cognitive functions and neuropsychiatric disorders and physical health. Using information provided by many published genome-wide association study consortia, we created polygenic profile scores for 24 vascular-metabolic, neuropsychiatric, physiological-anthropometric and cognitive traits in the participants of UK Biobank⁠, a very large population-based sample (n = 112 151). Pleiotropy between cognitive and health traits was quantified by deriving genetic correlations using summary genome-wide association study statistics and to the method of linkage disequilibrium score regression⁠. Substantial and statistically-significant genetic correlations were observed between cognitive test scores in the UK Biobank sample and many of the mental and physical health-related traits and disorders assessed here. In addition, highly statistically-significant associations were observed between the cognitive test scores in the UK Biobank sample and many polygenic profile scores, including coronary artery disease, stroke, Alzheimer’s disease, schizophrenia, autism, major depressive disorder, body mass index⁠, intracranial volume, infant head circumference and childhood cognitive ability. Where disease diagnosis was available for UK Biobank participants, we were able to show that these results were not confounded by those who had the relevant disease. These findings indicate that a substantial level of pleiotropy exists between cognitive abilities and many human mental and physical health disorders and traits and that it can be used to predict phenotypic variance across samples.

“Beyond GDP? Welfare across Countries and Time”, Jones & Klenow 2016

“Beyond GDP? Welfare across Countries and Time”⁠, Charles I. Jones, Peter J. Klenow (2016; backlinks; similar):

We propose a summary statistic for the economic well-being of people in a country. Our measure incorporates consumption, leisure, mortality, and inequality, first for a narrow set of countries using detailed micro data, and then more broadly using multi-country datasets. While welfare is highly correlated with GDP per capita, deviations are often large. Western Europe looks considerably closer to the United States, emerging Asia has not caught up as much, and many developing countries are further behind. Each component we introduce plays an important role in accounting for these differences, with mortality being most important.

Key Point 1: GDP per person is an excellent indicator of welfare across the broad range of countries: the two measures have a correlation of 0.98. Nevertheless, for any given country, the difference between the two measures can be important. Across 13 countries, the median deviation is about 35%.

Figure 5 illustrates this first point. The top panel plots the welfare measure, λ, against GDP per person. What emerges prominently is that the two measures are highly correlated, with a correlation coefficient (for the logs) of 0.98. Thus per capita GDP is a good proxy for welfare under our assumptions. At the same time, there are clear departures from the 45° line. In particular, many countries with very low GDP per capita exhibit even lower welfare. As a result, welfare is more dispersed (standard deviation of 1.51 in logs) than is income (standard deviation of 1.27 in logs).

The bottom panel provides a closer look at the deviations. This figure plots the ratio of welfare to per capita GDP across countries. The European countries have welfare measures 22% higher than their incomes. The remaining countries, in contrast, have welfare levels that are typically 25–50% below their incomes. The way to reconcile these large deviations with the high correlation between welfare and income is that the “scales” are so different. Incomes vary by more than a factor of 64 in our sample, ie. 6,300%, whereas the deviations are on the order of 25–50%.

“The Unfavorable Economics of Measuring the Returns to Advertising”, Lewis & Rao 2015

2015-lewis.pdf: “The Unfavorable Economics of Measuring the Returns to Advertising”⁠, Randall A. Lewis, Justin M. Rao (2015-07-06; ⁠, ⁠, ; backlinks; similar):

25 large field experiments with major U.S. retailers and brokerages, most reaching millions of customers and collectively representing $3.53$2.802015 million in digital advertising expenditure, reveal that measuring the returns to advertising is difficult.

The median confidence interval on return on investment is over 100 percentage points wide. Detailed sales data show that relative to the per capita cost of the advertising, individual-level sales are very volatile; a coefficient of variation of 10 is common. Hence, informative advertising experiments can easily require more than 10 million person-weeks, making experiments costly and potentially infeasible for many firms.

Despite these unfavorable economics, randomized control trials represent progress by injecting new, unbiased information into the market. The inference challenges revealed in the field experiments also show that selection bias, due to the targeted nature of advertising, is a crippling concern for widely employed observational methods.

“Mendelian Randomization With Invalid Instruments: Effect Estimation and Bias Detection through Egger Regression (MR-Egger)”, Bowden et al 2015

“Mendelian randomization with invalid instruments: effect estimation and bias detection through Egger regression (MR-Egger)”⁠, Jack Bowden, George Davey Smith, Stephen Burgess (2015-06-06; ; backlinks; similar):

  • Mendelian randomization analyses using multiple genetic variants can be viewed as a meta-analysis of the causal estimates from each variant.
  • If the genetic variants have pleiotropic effects on the outcome, these causal estimates will be biased.
  • Funnel plots offer a simple way to detect directional pleiotropy; that is, whether causal estimates from weaker variants tend to be skewed in one direction.
  • Under a weaker set of assumptions than typically used in Mendelian randomization, an adaption of Egger regression (MR-Egger) can be used to detect and correct for the bias due to directional pleiotropy.

Background: The number of Mendelian randomization analyses including large numbers of genetic variants is rapidly increasing. This is due to the proliferation of genome-wide association studies⁠, and the desire to obtain more precise estimates of causal effects. However, some genetic variants may not be valid instrumental variables, in particular due to them having more than one proximal phenotypic correlate (pleiotropy).

Methods: We view Mendelian randomization with multiple instruments as a meta-analysis, and show that bias caused by pleiotropy can be regarded as analogous to small study bias. Causal estimates using each instrument can be displayed visually by a funnel plot to assess potential asymmetry. Egger regression⁠, a tool to detect small study bias in meta-analysis, can be adapted to test for bias from pleiotropy, and the slope coefficient from Egger regression provides an estimate of the causal effect. Under the assumption that the association of each genetic variant with the exposure is independent of the pleiotropic effect of the variant (not via the exposure), Egger’s test gives a valid test of the null causal hypothesis and a consistent causal effect estimate even when all the genetic variants are invalid instrumental variables⁠.

Results: We illustrate the use of this approach by re-analysing 2 published Mendelian randomization studies of the causal effect of height on lung function, and the causal effect of blood pressure on coronary artery disease risk. The conservative nature of this approach is illustrated with these examples.

Conclusions: An adaption of Egger regression (which we call MR-Egger) can detect some violations of the standard instrumental variable assumptions, and provide an effect estimate which is not subject to these violations. The approach provides a sensitivity analysis for the robustness of the findings from a Mendelian randomization investigation.

[Keywords: Mendelian randomization⁠, invalid instruments, meta-analysis⁠, pleiotropy⁠, small study bias, MR-Egger test]

“Bounding a Linear Causal Effect Using Relative Correlation Restrictions”, Krauth 2015

2015-krauth.pdf: “Bounding a Linear Causal Effect Using Relative Correlation Restrictions”⁠, Brian Krauth (2015-06-04; backlinks; similar):

This paper describes and implements a simple partial solution to the most common problem in applied microeconometrics: estimating a linear causal effect with a potentially endogenous explanatory variable and no suitable instrumental variables. Empirical researchers faced with this situation can either assume away the endogeneity or accept that the effect of interest is not identified. This paper describes a middle ground in which the researcher assumes plausible but nontrivial restrictions on the correlation between the variable of interest and relevant unobserved variables relative to the correlation between the variable of interest and observed control variables. Given such relative correlation restrictions, the researcher can then estimate informative bounds on the effect and assess the sensitivity of conventional estimates to plausible deviations from exogeneity. Two empirical applications demonstrate the potential usefulness of this method for both experimental and observational data.

“Everything Is Correlated”, Branwen 2014

Everything: “Everything Is Correlated”⁠, Gwern Branwen (2014-09-12; ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ; backlinks; similar):

Anthology of sociology, statistical, or psychological papers discussing the observation that all real-world variables have non-zero correlations and the implications for statistical theory such as ‘null hypothesis testing’.

Statistical folklore asserts that “everything is correlated”: in any real-world dataset, most or all measured variables will have non-zero correlations, even between variables which appear to be completely independent of each other, and that these correlations are not merely sampling error flukes but will appear in large-scale datasets to arbitrarily designated levels of statistical-significance or posterior probability.

This raises serious questions for null-hypothesis statistical-significance testing, as it implies the null hypothesis of 0 will always be rejected with sufficient data, meaning that a failure to reject only implies insufficient data, and provides no actual test or confirmation of a theory. Even a directional prediction is minimally confirmatory since there is a 50% chance of picking the right direction at random.

It also has implications for conceptualizations of theories & causal models, interpretations of structural models, and other statistical principles such as the “sparsity principle”.

“Statistical Notes”, Branwen 2014

Statistical-notes: “Statistical Notes”⁠, Gwern Branwen (2014-07-17; ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ; backlinks; similar):

Miscellaneous statistical stuff

Given two disagreeing polls, one small & imprecise but taken at face-value, and the other large & precise but with a high chance of being totally mistaken, what is the right Bayesian model to update on these two datapoints? I give ABC and MCMC implementations of Bayesian inference on this problem and find that the posterior is bimodal with a mean estimate close to the large unreliable poll’s estimate but with wide credible intervals to cover the mode based on the small reliable poll’s estimate.

“Why Correlation Usually ≠ Causation”, Branwen 2014

Causality: “Why Correlation Usually ≠ Causation”⁠, Gwern Branwen (2014-06-24; ⁠, ⁠, ⁠, ⁠, ⁠, ; backlinks; similar):

Correlations are oft interpreted as evidence for causation; this is oft falsified; do causal graphs explain why this is so common, because the number of possible indirect paths greatly exceeds the direct paths necessary for useful manipulation?

It is widely understood that statistical correlation between two variables ≠ causation. Despite this admonition, people are overconfident in claiming correlations to support favored causal interpretations and are surprised by the results of randomized experiments, suggesting that they are biased & systematically underestimate the prevalence of confounds / common-causation. I speculate that in realistic causal networks or DAGs, the number of possible correlations grows faster than the number of possible causal relationships. So confounds really are that common, and since people do not think in realistic DAGs but toy models, the imbalance also explains overconfidence.

“Lizardman Constant in Surveys”, Branwen 2013

Lizardman-constant: “Lizardman Constant in Surveys”⁠, Gwern Branwen (2013-04-12; ⁠, ⁠, ⁠, ⁠, ; backlinks; similar):

A small fraction of human responses will always be garbage because we are lazy, bored, trolling, or crazy.

Researchers have demonstrated repeatedly in human surveys the stylized fact that, far from being an oracle or gold standard, a certain small percentage of human responses will reliably be bullshit: “jokester” or “mischievous responders”, or more memorably, “lizardman constant” responders—respondents who give the wrong answer to simple questions.

Below a certain percentage of responses, for sufficiently rare responses, much or all of responding humans may be lying, lazy, crazy, or maliciously responding and the responses are false. This systematic error seriously undermines attempts to study rare beliefs such as conspiracy theories, and puts bounds on how accurate any single survey can hope to be.

“Observational Studies Often Make Clinical Practice Recommendations: an Empirical Evaluation of Authors' Attitudes”, Prasad et al 2013

2013-prasad.pdf: “Observational studies often make clinical practice recommendations: an empirical evaluation of authors' attitudes”⁠, Vinay Prasad, Joel Jorgenson, John P. A. Ioannidis, Adam Cifu (2013-01-01; backlinks)

“‘Story Of Your Life’ Is Not A Time-Travel Story”, Branwen 2012

Story-Of-Your-Life: “‘Story Of Your Life’ Is Not A Time-Travel Story”⁠, Gwern Branwen (2012-12-12; ⁠, ⁠, ⁠, ⁠, ; backlinks; similar):

Famous Ted Chiang SF short story ‘Story Of Your Life’ is usually misinterpreted as, like the movie version Arrival, being about time-travel/​precognition; I explain it is instead an exploration of xenopsychology and a psychology of timeless physics.

One of Ted Chiang’s most noted philosophical SF short stories, “Story of Your Life”, was made into a successful time-travel movie, Arrival, sparking interest in the original. However, movie viewers often misread the short story: “Story” is not a time-travel movie. At no point does the protagonist travel in time or enjoy precognitive powers, interpreting the story this way leads to many serious plot holes, it renders most of the exposition-heavy dialogue (which is a large fraction of the wordcount) completely irrelevant, and genuine precognition undercuts the themes of tragedy & acceptance.

Instead, what appears to be precognition in Chiang’s story is actually far more interesting, and a novel twist on psychology and physics: classical physics allows usefully interpreting the laws of physics in both a ‘forward’ way in which events happen step by step, but also a teleological way in which events are simply the unique optimal solution to a set of constraints including the outcome and allows reasoning ‘backwards’. The alien race exemplifies this other, equally valid, possible way of thinking and viewing the universe, and the protagonist learns their way of thinking by studying their language, which requires seeing written characters as a unified gestalt. This holistic view of the universe as an immutable ‘block-universe’, in which events unfold as they must, changes the protagonist’s attitude towards life and the tragic death of her daughter, teaching her in a somewhat Buddhist or Stoic fashion to embrace life in both its ups and downs.

“The Iron Law Of Evaluation And Other Metallic Rules”, Rossi 2012

1987-rossi: “The Iron Law Of Evaluation And Other Metallic Rules”⁠, Peter H. Rossi (2012-09-18; ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ; backlinks; similar):

Problems with social experiments and evaluating them, loopholes, causes, and suggestions; non-experimental methods systematically deliver false results, as most interventions fail or have small effects.

“The Iron Law Of Evaluation And Other Metallic Rules” is a classic review paper by American “sociologist Peter Rossi⁠, a dedicated progressive and the nation’s leading expert on social program evaluation from the 1960s through the 1980s”; it discusses the difficulties of creating an useful social program⁠, and proposed some aphoristic summary rules, including most famously:

  • The Iron law: “The expected value of any net impact assessment of any large scale social program is zero”
  • the Stainless Steel law: “the better designed the impact assessment of a social program, the more likely is the resulting estimate of net impact to be zero.”

It expands an earlier paper by Rossi (“Issues in the evaluation of human services delivery”⁠, Rossi 1978), where he coined the first, “Iron Law”.

I provide an annotated HTML version with fulltext for all references, as well as a bibliography collating many negative results in social experiments I’ve found since Rossi’s paper was published (see also the closely-related Replication Crisis).

“Correlation and Causation in the Study of Personality”, Lee 2012

2012-lee.pdf: “Correlation and Causation in the Study of Personality”⁠, James Jung-Hun Lee (2012-07-26; ⁠, ⁠, ; backlinks; similar):

Personality psychology aims to explain the causes and the consequences of variation in behavioural traits. Because of the observational nature of the pertinent data, this endeavour has provoked many controversies. In recent years, the computer scientist Judea Pearl has used a graphical approach to extend the innovations in causal inference developed by Ronald Fisher and Sewall Wright. Besides shedding much light on the philosophical notion of causality itself, this graphical framework now contains many powerful concepts of relevance to the controversies just mentioned. In this article, some of these concepts are applied to areas of personality research where questions of causation arise, including the analysis of observational data and the genetic sources of individual differences.

“One Man’s Modus Ponens”, Branwen 2012

Modus: “One Man’s Modus Ponens”⁠, Gwern Branwen (2012-05-01; ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ; backlinks; similar):

One man’s modus ponens is another man’s modus tollens is a saying in Western philosophy encapsulating a common response to a logical proof which generalizes the reductio ad absurdum and consists of rejecting a premise based on an implied conclusion. I explain it in more detail, provide examples, and a Bayesian gloss.

A logically-valid argument which takes the form of a modus ponens may be interpreted in several ways; a major one is to interpret it as a kind of reductio ad absurdum, where by ‘proving’ a conclusion believed to be false, one might instead take it as a modus tollens which proves that one of the premises is false. This “Moorean shift” is aphorized as the snowclone⁠, “One man’s modus ponens is another man’s modus tollens”.

The Moorean shift is a powerful counter-argument which has been deployed against many skeptical & metaphysical claims in philosophy, where often the conclusion is extremely unlikely and little evidence can be provided for the premises used in the proofs; and it is relevant to many other debates, particularly methodological ones.

“Uniform Random Generation of Large Acyclic Digraphs”, Kuipers & Moffa 2012

“Uniform random generation of large acyclic digraphs”⁠, Jack Kuipers, Giusi Moffa (2012-02-29; backlinks; similar):

Directed acyclic graphs are the basic representation of the structure underlying Bayesian networks, which represent multivariate probability distributions. In many practical applications, such as the reverse engineering of gene regulatory networks, not only the estimation of model parameters but the reconstruction of the structure itself is of great interest. As well as for the assessment of different structure learning algorithms in simulation studies, an uniform sample from the space of directed acyclic graphs is required to evaluate the prevalence of certain structural features. Here we analyse how to sample acyclic digraphs uniformly at random through recursive enumeration, an approach previously thought too computationally involved. Based on complexity considerations, we discuss in particular how the enumeration directly provides an exact method, which avoids the convergence issues of the alternative Markov chain methods and is actually computationally much faster. The limiting behaviour of the distribution of acyclic digraphs then allows us to sample arbitrarily large graphs. Building on the ideas of recursive enumeration based sampling we also introduce a novel hybrid Markov chain with much faster convergence than current alternatives while still being easy to adapt to various restrictions. Finally we discuss how to include such restrictions in the combinatorial enumeration and the new hybrid Markov chain method for efficient uniform sampling of the corresponding graphs.

“Does Retail Advertising Work? Measuring the Effects of Advertising on Sales Via a Controlled Experiment on Yahoo!”, Lewis & Reiley 2011

“Does Retail Advertising Work? Measuring the Effects of Advertising on Sales Via a Controlled Experiment on Yahoo!”⁠, Randall A. Lewis, David H. Reiley (2011-06-08; ⁠, ⁠, ; backlinks; similar):

We measure the causal effects of online advertising on sales, using a randomized experiment performed in cooperation between Yahoo! and a major retailer.

After identifying over one million customers matched in the databases of the retailer and Yahoo!, we randomly assign them to treatment and control groups. We analyze individual-level data on ad exposure and weekly purchases at this retailer, both online and in stores.

We find statistically-significant and economically substantial impacts of the advertising on sales. The treatment effect persists for weeks after the end of an advertising campaign, and the total effect on revenues is estimated to be more than seven times the retailer’s expenditure on advertising during the study. Additional results explore differences in the number of advertising impressions delivered to each individual, online and offline sales, and the effects of advertising on those who click the ads versus those who merely view them.

Statistical power calculations show that, due to the high variance of sales, our large number of observations brings us just to the frontier of being able to measure economically substantial effects of advertising.

We also demonstrate that without an experiment, using industry-standard methods based on endogenous crosssectional variation in advertising exposure, we would have obtained a wildly inaccurate estimate of advertising effectiveness.

“Here, There, and Everywhere: Correlated Online Behaviors Can Lead to Overestimates of the Effects of Advertising”, Lewis et al 2011

2011-lewis.pdf: “Here, there, and everywhere: correlated online behaviors can lead to overestimates of the effects of advertising”⁠, Randall A. Lewis, Justin M. Rao, David H. Reiley (2011-03; ⁠, ⁠, ; backlinks; similar):

Measuring the causal effects of online advertising (adfx) on user behavior is important to the health of the WWW publishing industry. In this paper, using three controlled experiments, we show that observational data frequently lead to incorrect estimates of adfx. The reason, which we label “activity bias”, comes from the surprising amount of time-based correlation between the myriad activities that users undertake online.

In Experiment 1, users who are exposed to an ad on a given day are much more likely to engage in brand-relevant search queries as compared to their recent history for reasons that had nothing do with the advertisement. In Experiment 2, we show that activity bias occurs for page views across diverse websites. In Experiment 3, we track account sign-ups at a competitor’s (of the advertiser) website and find that many more people sign-up on the day they saw an advertisement than on other days, but that the true “competitive effect” was minimal.

In all three experiments, exposure to a campaign signals doing “more of everything” in given period of time, making it difficult to find a suitable “matched control” using prior behavior. In such cases, the “match” is fundamentally different from the exposed group, and we show how and why observational methods lead to a massive overestimate of adfx in such circumstances.

[Keywords: advertising effectiveness, browsing behavior, causal inference, field experiments, selection bias]

“The Replication Crisis: Flaws in Mainstream Science”, Branwen 2010

Replication: “The Replication Crisis: Flaws in Mainstream Science”⁠, Gwern Branwen (2010-10-27; ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ; backlinks; similar):

2013 discussion of how systemic biases in science, particularly medicine and psychology, have resulted in a research literature filled with false positives and exaggerated effects, called ‘the Replication Crisis’.

Long-standing problems in standard scientific methodology have exploded as the “Replication Crisis”: the discovery that many results in fields as diverse as psychology, economics, medicine, biology, and sociology are in fact false or quantitatively highly inaccurately measured. I cover here a handful of the issues and publications on this large, important, and rapidly developing topic up to about 2013, at which point the Replication Crisis became too large a topic to cover more than cursorily. (A compilation of some additional links are provided for post-2013 developments.)

The crisis is caused by methods & publishing procedures which interpret random noise as important results, far too small datasets, selective analysis by an analyst trying to reach expected/​desired results, publication bias, poor implementation of existing best-practices, nontrivial levels of research fraud, software errors, philosophical beliefs among researchers that false positives are acceptable, neglect of known confounding like genetics, and skewed incentives (financial & professional) to publish ‘hot’ results.

Thus, any individual piece of research typically establishes little. Scientific validation comes not from small p-values, but from discovering a regular feature of the world which disinterested third parties can discover with straightforward research done independently on new data with new procedures—replication.

“Causal Inference and Developmental Psychology”, Foster 2010

2010-foster.pdf: “Causal Inference and Developmental Psychology”⁠, E. Michael Foster (2010-01-01)

“Causal Inference and Observational Research: The Utility of Twins”, McGue et al 2010

“Causal Inference and Observational Research: The Utility of Twins”⁠, Matt McGue, Merete Osler, Kaare Christensen (2010; ; backlinks; similar):

Valid causal inference is central to progress in theoretical and applied psychology. Although the randomized experiment is widely considered the gold standard for determining whether a given exposure increases the likelihood of some specified outcome, experiments are not always feasible and in some cases can result in biased estimates of causal effects. Alternatively, standard observational approaches are limited by the possibility of confounding, reverse causation⁠, and the nonrandom distribution of exposure (ie. selection).

We describe the counterfactual model of causation and apply it to the challenges of causal inference in observational research, with a particular focus on aging. We argue that the study of twin pairs discordant on exposure, and in particular discordant monozygotic twins, provides an useful analog to the idealized counterfactual design.

A review of discordant-twin studies in aging reveals that they are consistent with, but do not unambiguously establish, a causal effect of lifestyle factors on important late-life outcomes. Nonetheless, the existing studies are few in number and have clear limitations that have not always been considered in interpreting their results.

It is concluded that twin researchers could make greater use of the discordant-twin design as one approach to strengthen causal inferences in observational research.

“Retrospectives Guinnessometrics: The Economic Foundation of “Student’s” T”, Ziliak 2008

2008-ziliak.pdf: “Retrospectives Guinnessometrics: The Economic Foundation of “Student’s” t”⁠, Stephen T. Ziliak (2008-09; ; backlinks; similar):

In economics and other sciences, “statistical-significance” is by custom, habit, and education a necessary and sufficient condition for proving an empirical result (Ziliak and McCloskey, 2008; McCloskey & Ziliak, 1996). The canonical routine is to calculate what’s called a t-statistic and then to compare its estimated value against a theoretically expected value of it, which is found in “Student’s” t table. A result yielding a t-value greater than or equal to about 2.0 is said to be “statistically-significant at the 95% level.” Alternatively, a regression coefficient is said to be “statistically-significantly different from the null, p < 0.05.” Canonically speaking, if a coefficient clears the 95% hurdle, it warrants additional scientific attention. If not, not. The first presentation of “Student’s” test of statistical-significance came a century ago, in “The Probable Error of a Mean” (1908b), published by an anonymous “Student.” The author’s commercial employer required that his identity be shielded from competitors, but we have known for some decades that the article was written by William Sealy Gosset (1876–1937), whose entire career was spent at Guinness’s brewery in Dublin, where Gosset was a master brewer and experimental scientist (E. S. Pearson, 1937). Perhaps surprisingly, the ingenious “Student” did not give a hoot for a single finding of “statistical”-significance, even at the 95% level of statistical-significance as established by his own tables. Beginning in 1904, “Student”, who was a businessman besides a scientist, took an economic approach to the logic of uncertainty, arguing finally that statistical-significance is “nearly valueless” in itself.

“Proceeding From Observed Correlation to Causal Inference: The Use of Natural Experiments”, Rutter 2007

“Proceeding From Observed Correlation to Causal Inference: The Use of Natural Experiments”⁠, Michael Rutter (2007; ⁠, ⁠, ⁠, ; backlinks; similar):

This article notes 5 reasons why a correlation between a risk (or protective) factor and some specified outcome might not reflect environmental causation. In keeping with numerous other writers, it is noted that a causal effect is usually composed of a constellation of components acting in concert. The study of causation, therefore, will necessarily be informative on only one or more subsets of such components. There is no such thing as a single basic necessary and sufficient cause. Attention is drawn to the need (albeit unobservable) to consider the counterfactual (ie. what would have happened if the individual had not had the supposed risk experience). 15 possible types of natural experiments that may be used to test causal inferences with respect to naturally occurring prior causes (rather than planned interventions) are described. These comprise 5 types of genetically sensitive designs intended to control for possible genetic mediation (as well as dealing with other issues), 6 uses of twin or adoptee strategies to deal with other issues such as selection bias or the contrasts between different environmental risks, 2 designs to deal with selection bias, regression discontinuity designs to take into account unmeasured confounders, and the study of contextual effects. It is concluded that, taken in conjunction, natural experiments can be very helpful in both strengthening and weakening causal inferences.

“Personality and the Prediction of Consequential Outcomes”, Ozer & Benet-Martínez 2006

2006-ozer.pdf: “Personality and the Prediction of Consequential Outcomes”⁠, Daniel J. Ozer, Verónica Benet-Martínez (2006-02-01; ⁠, ; backlinks; similar):

Personality has consequences. Measures of personality have contemporaneous and predictive relations to a variety of important outcomes. Using the Big Five factors as heuristics for organizing the research literature, numerous consequential relations are identified. Personality dispositions are associated with happiness, physical and psychological health, spirituality, and identity at an individual level; associated with the quality of relationships with peers, family, and romantic others at an interpersonal level; and associated with occupational choice, satisfaction, and performance, as well as community involvement, criminal activity, and political ideology at a social institutional level.

[Keywords: individual differences, traits, life outcomes, consequences]

“Contradicted and Initially Stronger Effects in Highly Cited Clinical Research”, Ioannidis 2005

“Contradicted and Initially Stronger Effects in Highly Cited Clinical Research”⁠, John P. A. Ioannidis (2005-07-13; ; backlinks; similar):

Context: Controversy and uncertainty ensue when the results of clinical research on the effectiveness of interventions are subsequently contradicted. Controversies are most prominent when high-impact research is involved.

Objectives: To understand how frequently highly cited studies are contradicted or find effects that are stronger than in other similar studies and to discern whether specific characteristics are associated with such refutation over time.

Design: All original clinical research studies published in 3 major general clinical journals or high-impact-factor specialty journals in 1990–2003 and cited more than 1000 times in the literature were examined.

Main Outcome Measure: The results of highly cited articles were compared against subsequent studies of comparable or larger sample size and similar or better controlled designs. The same analysis was also performed comparatively for matched studies that were not so highly cited.

Results: Of 49 highly cited original clinical research studies, 45 claimed that the intervention was effective. Of these, 7 (16%) were contradicted by subsequent studies, 7 others (16%) had found effects that were stronger than those of subsequent studies, 20 (44%) were replicated, and 11 (24%) remained largely unchallenged. Five of 6 highly-cited nonrandomized studies had been contradicted or had found stronger effects vs 9 of 39 randomized controlled trials (p = 0.008). Among randomized trials, studies with contradicted or stronger effects were smaller (p = 0.009) than replicated or unchallenged studies although there was no statistically-significant difference in their early or overall citation impact. Matched control studies did not have a statistically-significantly different share of refuted results than highly cited studies, but they included more studies with “negative” results.

Conclusions: Contradiction and initially stronger effects are not unusual in highly cited research of clinical interventions and their outcomes. The extent to which high citations may provoke contradictions and vice versa needs more study. Controversies are most common with highly cited nonrandomized studies, but even the most highly cited randomized trials may be challenged and refuted over time, especially small ones.

“Testing Hypotheses about the Relationship between Cannabis Use and Psychosis”, Degenhardt et al 2003

2002-degenhardt.pdf: “Testing hypotheses about the relationship between cannabis use and psychosis”⁠, Louisa Degenhardt, Wayne Hall, Michael Lynskey (2003-01-01; ⁠, ; backlinks)

“Can Nonexperimental Comparison Group Methods Match the Findings from a Random Assignment Evaluation of Mandatory Welfare-to-Work Programs? MDRC Working Papers on Research Methodology”, Bloom et al 2002

“Can Nonexperimental Comparison Group Methods Match the Findings from a Random Assignment Evaluation of Mandatory Welfare-to-Work Programs? MDRC Working Papers on Research Methodology”⁠, Howard S. Bloom, Michael Michalopoulos, Carolyn J. Hill, Ying Lei (2002; backlinks; similar):

A study explored which nonexperimental comparison group methods provide the most accurate estimates of the impacts of mandatory welfare-to-work programs and whether the best methods work well enough to substitute for random assignment experiments. Findings were compared for nonexperimental comparison groups and statistical adjustment procedures with those for experimental control groups from a large-sample, six-state random assignment experiment—the National Evaluation of Welfare-to-Work Strategies. The methods were assessed in terms of their ability to estimate program impacts on annual earnings during short-run and medium-run follow-up periods. Findings with respect to the first issue suggested in-state comparison groups perform somewhat better than out-of-state or multi-state, especially for medium-run impact estimates; a simple difference of means or ordinary least squares regression can perform as well or better than more complex methods when used with a local comparison group; impact estimates for out-of-state or multi-state comparison groups are not improved substantially by more complex estimation procedures but are improved somewhat when propensity score methods are used to eliminate comparison groups that are not balanced on their baseline characteristics. Findings with respect to the second issue indicated the best methods did not work well enough to replace random assignment.Statistical analyses are appended.

“How Close Is Close Enough? Testing Nonexperimental Estimates of Impact against Experimental Estimates of Impact With Education Test Scores As Outcomes”, Wilde & Hollister 2002

“How Close Is Close Enough? Testing Nonexperimental Estimates of Impact against Experimental Estimates of Impact with Education Test Scores as Outcomes”⁠, Elizabeth Ty Wilde, Robinson Hollister (2002; backlinks; similar):

In this study we test the performance of some nonexperimental estimators of impacts applied to an educational intervention—reduction in class size—where achievement test scores were the outcome. We compare the nonexperimental estimates of the impacts to “true impact” estimates provided by a random-assignment design used to assess the effects of that intervention. Our primary focus in this study is on a nonexperimental estimator based on a complex procedure called propensity score matching.

We put greatest emphasis on looking at the question of “how close is close enough?” in terms of a decision-maker trying to use the evaluation to determine whether to invest in wider application of the intervention being assessed—in this case, reduction in class size. We illustrate this in terms of a rough cost-benefit framework for small class size as applied to Project Star. We find that in 30 to 45% of the 11 cases, the propensity-score-matching nonexperimental estimators would have led to the “wrong” decision.

“Comparison of Evidence of Treatment Effects in Randomized and Nonrandomized Studies”, Ioannidis et al 2001

2001-ioannidis.pdf: “Comparison of Evidence of Treatment Effects in Randomized and Nonrandomized Studies”⁠, John P. A. Ioannidis, Anna-Bettina Haidich, Maroudia Pappa, Nikos Pantazis, Styliani I. Kokori, Maria G. Tektonidou et al (2001-08-01; backlinks; similar):

Context: There is substantial debate about whether the results of nonrandomized studies are consistent with the results of randomized controlled trials on the same topic.

Objectives: To compare results of randomized and nonrandomized studies that evaluated medical interventions and to examine characteristics that may explain discrepancies between randomized and nonrandomized studies.

Data Sources: MEDLINE (1966–March 2000), the Cochrane Library (Issue 3, 2000), and major journals were searched.

Study Selection: Forty-five diverse topics were identified for which both randomized trials (n = 240) and nonrandomized studies (n = 168) had been performed and had been considered in meta-analyses of binary outcomes.

Data Extraction: Data on events per patient in each study arm and design and characteristics of each study considered in each meta-analysis were extracted and synthesized separately for randomized and nonrandomized studies.

Data Synthesis: Very good correlation was observed between the summary odds ratios of randomized and nonrandomized studies (r = 0.75; p < 0.001); however, nonrandomized studies tended to show larger treatment effects (28 vs 11; p = 0.009). Between-study heterogeneity was frequent among randomized trials alone (23%) and very frequent among nonrandomized studies alone (41%). The summary results of the 2 types of designs differed beyond chance in 7 cases (16%). Discrepancies beyond chance were less common when only prospective studies were considered (8%). Occasional differences in sample size and timing of publication were also noted between discrepant randomized and nonrandomized studies. In 28 cases (62%), the natural logarithm of the odds ratio differed by at least 50%, and in 15 cases (33%), the odds ratio varied at least 2-fold between nonrandomized studies and randomized trials.

Conclusions: Despite good correlation between randomized trials and nonrandomized studies—in particular, prospective studies—discrepancies beyond chance do occur and differences in estimated magnitude of treatment effect are very common.

“Crosstalk and Specificity in Signalling: Are We Crosstalking Ourselves into General Confusion?”, Dumont et al 2001

2001-dumont.pdf: “Crosstalk and specificity in signalling: Are we crosstalking ourselves into general confusion?”⁠, Jacques E. Dumont, Frédéric Pécasse, Carine Maenhaut (2001-07-01; ; backlinks; similar):

The numerous examples of “crosstalk” between signal transduction pathways reported in the biochemical literature seem to imply a general common response of cells to different stimuli, even when these stimuli act initially on different cascades.

This contradicts our knowledge of the specificity of action of extracellular signals in different cell types.

This discrepancy is explained by the restricted occurrence of crosstalks in any cell type and by several categories of cell specificity mechanisms, for instance, the specific qualitative and quantitative expression of the various subtypes of signal transduction proteins, the combinatorial control of the cascades with specific sets of regulatory factors and the compartmentalization of signal transduction cascades or their elements.

[Keywords:, signal transduction pathways, crosstalk, specificity, compartmental, combinatorial, isoforms, kinetics, cell models]

Figure 1: The New Simple View: Everything Does Everything—cross-signalling between 5 signal transduction cascades as reported in the literature for 2 years. In black, the “textbook” representation of the linear cascades. In color, the cross-signallings: in red, negative controls (ie. inhibitions); in green, positive controls, i.e., stimulations.

“Study Design and Estimates of Effectiveness”, MacLehose et al 2000

2000-maclehose.pdf: “Study design and estimates of effectiveness”⁠, R. R. MacLehose, B. C. Reeves, I. M. Harvey, T. A. Sheldon, I. T. Russell, A. M. S. Black (2000-01-01; backlinks)

“Causal Effects in Nonexperimental Studies: Reevaluating the Evaluation of Training Programs”, Dehejia & Wahba 1999

1999-dehejia.pdf: “Causal Effects in Nonexperimental Studies: Reevaluating the Evaluation of Training Programs”⁠, Rajeev H. Dehejia, Sadek Wahba (1999-10-01; backlinks; similar):

This article uses propensity score methods to estimate the treatment impact of the National Supported Work (NSW) Demonstration, a labor training program, on post-intervention earnings. We use data from Lalonde’s evaluation of nonexperimental methods that combine the treated units from a randomized evaluation of the NSW with nonexperimental comparison units drawn from survey datasets. We apply propensity score methods to this composite dataset and demonstrate that, relative to the estimators that Lalonde evaluates, propensity score estimates of the treatment impact are much closer to the experimental benchmark estimate.

Propensity score methods assume that the variables associated with assignment to treatment are observed (referred to as ignorable treatment assignment, or selection on observables). Even under this assumption, it is difficult to control for differences between the treatment and comparison groups when they are dissimilar and when there are many pre-intervention variables. The estimated propensity score (the probability of assignment to treatment, conditional on pre-intervention variables) summarizes the pre-intervention variables. This offers a diagnostic on the comparability of the treatment and comparison groups, because one has only to compare the estimated propensity score across the two groups. We discuss several methods (such as stratification and matching) that use the propensity score to estimate the treatment impact. When the range of estimated propensity scores of the treatment and comparison groups overlap, these methods can estimate the treatment impact for the treatment group. A sensitivity analysis shows that our estimates are not sensitive to the specification of the estimated propensity score, but are sensitive to the assumption of selection on observables. We conclude that when the treatment and comparison groups overlap, and when the variables determining assignment to treatment are observed, these methods provide a means to estimate the treatment impact. Even though propensity score methods are not always applicable, they offer a diagnostic on the quality of nonexperimental comparison groups in terms of observable pre-intervention variables.

“Interpreting the Evidence: Choosing between Randomised and Non-randomised Studies”, McKee et al 1999

“Interpreting the evidence: choosing between randomised and non-randomised studies”⁠, Martin McKee, Annie Britton, Nick Black, Klim McPherson, Colin Sanderson, Chris Bain (1999-07-31; backlinks; similar):

Evaluations of healthcare interventions can either randomise subjects to comparison groups, or not. In both designs there are potential threats to validity, which can be external (the extent to which they are generalisable to all potential recipients) or internal (whether differences in observed effects can be attributed to differences in the intervention). Randomisation should ensure that comparison groups of sufficient size differ only in their exposure to the intervention concerned. However, some investigators have argued that randomised controlled trials (RCTs) tend to exclude, consciously or otherwise, some types of patient to whom results will subsequently be applied. Furthermore, in unblinded trials the outcome of treatment may be influenced by practitioners’ and patients’ preferences for one or other intervention. Though non-randomised studies are less selective in terms of recruitment, they are subject to selection bias in allocation if treatment is related to initial prognosis.

Summary points:

  • Treatment effects obtained from randomised and non-randomised studies may differ, but one method does not give a consistently greater effect than the other
  • Treatment effects measured in each type of study best approximate when the exclusion criteria are the same and where potential prognostic factors are well understood and controlled for in the non-randomised studies
  • Subjects excluded from randomised controlled trials tend to have a worse prognosis than those included, and this limits generalisability
  • Subjects participating in randomised controlled trials evaluating treatment of existing conditions tend to be less affluent, educated, and healthy than those who do not; the opposite is true for trials of preventive interventions

“Superadditive Correlation”, Giraud et al 1999

1999-giraud.pdf: “Superadditive Correlation”⁠, B. G. Giraud, John M. Heumann, Alan S. Lapedes (1999-01-01)

“Spurious Precision? Meta-analysis of Observational Studies”, Egger et al 1998

“Spurious precision? Meta-analysis of observational studies”⁠, M. Egger, M. Schneider, G. Davey Smith (1998-01-10; backlinks; similar):

In previous articles we have focused on the potentials, principles, and pitfalls of meta-analysis of randomised controlled trials. Meta-analysis of observational data is, however, also becoming common. In a MEDLINE search we identified 566 articles (excluding those published as letters) published in 1995 and indexed with the medical subject heading (MeSH) term “meta-analysis.” We randomly selected 100 of these articles and examined them further. Sixty articles reported on actual meta-analyses, and 40 were methodological papers, editorials, and traditional reviews (1). Among the meta-analyses, about half were based on observational studies, mainly cohort and case-control studies of medical interventions or aetiological associations.

Summary points:

  • Meta-analysis of observational studies is as common as Meta-analysis of controlled trials
  • Confounding and selection bias often distort the findings from observational studies
  • There is a danger that meta-analyses of observational data produce very precise but equally spurious results
  • The statistical combination of data should therefore not be a prominent component of reviews of observational studies
  • More is gained by carefully examining possible sources of heterogeneity between the results from observational studies
  • Reviews of any type of research and data should use a systematic approach, which is documented in a materials and methods section

“Choosing Between Randomised and Non-randomised Studies”, Britton et al 1998

1998-britton.pdf: “Choosing Between Randomised and Non-randomised Studies”⁠, A. Britton, M. McKee, N. Black, K. McPherson, C. Sanderson, C. Bain (1998-01-01; backlinks)

“There Is a Time and a Place for Significance Testing”, Mulaik et al 1997

1997-muzaik.pdf: “There Is a Time and a Place for Significance Testing”⁠, Stanley A. Mulaik, Nambury S. Raju, Richard A. Harshman (1997-01-01; backlinks)

“Evaluating Program Evaluations: New Evidence on Commonly Used Nonexperimental Methods”, Friedlander & Robins 1995

1995-friedlander.pdf: “Evaluating Program Evaluations: New Evidence on Commonly Used Nonexperimental Methods”⁠, Daniel Friedlander, Philip K. Robins (1995-09-01; backlinks)

“The Efficacy of Psychological, Educational, and Behavioral Treatment: Confirmation from Meta-Analysis”, Lipsey & Wilson 1993

1993-lipsey.pdf: “The Efficacy of Psychological, Educational, and Behavioral Treatment: Confirmation from Meta-Analysis”⁠, Mark W. Lipsey, David B. Wilson (1993-12-01; ⁠, ⁠, ⁠, ; backlinks; similar):

Conventional reviews of research on the efficacy of psychological, educational, and behavioral treatments often find considerable variation in outcome among studies and, as a consequence, fail to reach firm conclusions about the overall effectiveness of the interventions in question. In contrast, meta-analysis reviews show a strong, dramatic pattern of positive overall effects that cannot readily be explained as artifacts of meta-analytic technique or generalized placebo effects. Moreover, the effects are not so small that they can be dismissed as lacking practical or clinical-significance. Although meta-analysis has limitations, there are good reasons to believe that its results are more credible than those of conventional reviews and to conclude that well-developed psychological, educational, and behavioral treatment is generally efficacious.

“Two and One-Half Decades of Leadership in Measurement and Evaluation”, Thompson 1992

1992-thompson.pdf: “Two and One-Half Decades of Leadership in Measurement and Evaluation”⁠, Bruce Thompson (1992-01-01; ; backlinks)

“Smoking As 'independent' Risk Factor for Suicide: Illustration of an Artifact from Observational Epidemiology?”, Smith et al 1992

1992-smith.pdf: “Smoking as 'independent' risk factor for suicide: illustration of an artifact from observational epidemiology?”⁠, George Davey Smith, Andrew N. Phillips, James D. Neaton (1992-01-01; ; backlinks)

“Bias in Relative Odds Estimation owing to Imprecise Measurement of Correlated Exposures”, Phillips & Smith 1992

1992-phillips.pdf: “Bias in relative odds estimation owing to imprecise measurement of correlated exposures”⁠, Andrew N. Phillips, George Davey Smith (1992-01-01; ; backlinks)

“How Independent Are 'independent' Effects? Relative Risk Estimation When Correlated Exposures Are Measured Imprecisely”, Phillips & Smith 1991

1991-phillips.pdf: “How independent are 'independent' effects? Relative risk estimation when correlated exposures are measured imprecisely”⁠, Andrew N. Phillips, George Davey Smith (1991-01-01; ; backlinks)

“Developing Improved Observational Methods for Evaluating Therapeutic Effectiveness”, Horwitz et al 1990

1990-horwitz.pdf: “Developing improved observational methods for evaluating therapeutic effectiveness”⁠, Ralph I. Horwitz, Catherine M. Viscoli, John D. Clemens, Robert T. Sadock (1990-11; backlinks; similar):

Therapeutic efficacy is often studied with observational surveys of patients whose treatments were selected non-experimentally. The results of these surveys are distrusted because of the fear that biased results occur in the absence of experimental principles, particularly randomization. The purpose of the current study was to develop and validate improved observational study designs by incorporating many of the design principles and patient assembly procedures of the randomized trial. The specific topic investigated was the prophylactic effectiveness of β-blocker therapy after an acute myocardial infarction.

To accomplish the research objective, three sets of data were compared. First, we developed a restricted cohort based on the eligibility criteria of the randomized clinical trial; second, we assembled an expanded cohort using the same design principles except for not restricting patient eligibility; and third, we used the data from the Beta Blocker Heart Attack Trial (BHAT), whose results served as the gold standard for comparison.

In this research, the treatment difference in death rates for the restricted cohort and the BHAT trial was nearly identical. In contrast, the expanded cohort had a larger treatment difference than was observed in the BHAT trial. We also noted the important and largely neglected role that eligibility criteria may play in ensuring the validity of treatment comparisons and study outcomes. The new methodological strategies we developed may improve the quality of observational studies and may be useful in assessing the efficacy of the many medical/​surgical therapies that cannot be tested with randomized clinical trials.

“Memories of the British Streptomycin Trial in Tuberculosis: The First Randomized Trial”, Hill 1990

1990-hill.pdf: “Memories of the British streptomycin trial in tuberculosis: The First Randomized Trial”⁠, Austin Bradford Hill (1990-01-01)

“The Adequacy of Comparison Group Designs for Evaluations of Employment-Related Programs”, Fraker & Maynard 1987

1987-fraker.pdf: “The Adequacy of Comparison Group Designs for Evaluations of Employment-Related Programs”⁠, Thomas Fraker, Rebecca Maynard (1987; ; backlinks; similar):

This study investigates empirically the strengths and limitations of using experimental versus nonexperimental designs for evaluating employment and training programs. The assessment involves comparing results from an experimental-design study-the National Supported Work Demonstration-with the estimated impacts of Supported Work based on analyses using comparison groups constructed from the Current Population Surveys.

The results indicate that nonexperimental designs cannot be relied on to estimate the effectiveness of employment programs. Impact estimates tend to be sensitive both to the comparison group construction methodology and to the analytic model used. There is currently no way a priori to ensure that the results of comparison group studies will be valid indicators of the program impacts.

[Keywords: public assistance programs, analytical models, analytical estimating, employment, control groups, estimation methods, random sampling, human resources, public works legislation, statistical-significance]

“Back to Spearman?”, Tyler 1986

1986-tyler.pdf: “Back to Spearman?”⁠, Leona E. Tyler (1986-01-01; ; backlinks)

“The Role of General Ability in Prediction”, Thorndike 1986

1986-thorndike.pdf: “The role of general ability in prediction”⁠, Robert L. Thorndike (1986-01-01; ; backlinks)

“Comments on the G Factor in Employment Testing”, Linn 1986

1986-linn.pdf: “Comments on the g factor in employment testing”⁠, Robert L. Linn (1986-01-01; ; backlinks)

“G: Artifact or Reality?”, Jensen 1986

1986-jensen.pdf: “g: Artifact or Reality?”⁠, Arthur R. Jensen (1986-01-01; ; backlinks)

“Cognitive Ability, Cognitive Aptitudes, Job Knowledge, and Job Performance”, Hunter 1986

1986-hunter.pdf: “Cognitive ability, cognitive aptitudes, job knowledge, and job performance”⁠, John E. Hunter (1986-01-01; ; backlinks)

“Commentary [on 'The _g: Factor in Employment Special Issue']”, Humphreys 1986

1986-humphreys.pdf: “Commentary [on 'The _g: factor in employment special issue']”⁠, Lloyd G. Humphreys (1986-01-01; ; backlinks)

“Real World Implications of G”, Hawk 1986

1986-hawk.pdf: “Real world implications of g”⁠, John Hawk (1986-01-01; ; backlinks)

“The _g: Factor in Employment”, Gottfredson 1986

1986-gottfredson.pdf: “The _g: factor in employment”⁠, Linda S. Gottfredson (1986-01-01; ; backlinks)

“Societal Consequences of the G Factor in Employment”, Gottfredson 1986c

1986-gottfredson-3.pdf: “Societal consequences of the g factor in employment”⁠, Linda S. Gottfredson (1986-01-01; ; backlinks)

“Validity versus Utility of Mental Tests: Example of the SAT”, Gottfredson & Crouse 1986b

1986-gottfredson-2.pdf: “Validity versus utility of mental tests: Example of the SAT”⁠, Linda S. Gottfredson, James Crouse (1986-01-01; ; backlinks)

“Origins of and Reactions to the PTC Conference on 'The g Factor In Employment Testing'”, Avery 1986

1986-avery.pdf: “Origins of and Reactions to the PTC conference on 'The <em>g< / em> Factor In Employment Testing'”⁠, Lillian Markos Avery (1986-01-01; ; backlinks)

“General Ability in Employment: A Discussion”, Arvey 1986

1986-arvey.pdf: “General ability in employment: A discussion”⁠, Richard D. Arvey (1986-01-01; ; backlinks)

“Why Do We Need Some Large, Simple Randomized Trials?”, Yusuf et al 1984

1984-yusuf.pdf: “Why do we need some large, simple randomized trials?”⁠, Salim Yusuf, Rory Collins, Richard Peto (1984-01-01)

“Essence of Statistics (Second Edition)”, Loftus & Loftus 1982

1982-loftus-essenceofstatistics.pdf: “Essence of Statistics (Second Edition)”⁠, Geoffry R. Loftus, Elizabeth F. Loftus (1982-01-01; backlinks)

“Theory Confirmation in Psychology”, Swoyer & Monson 1975

1975-swoyer.pdf: “Theory Confirmation in Psychology”⁠, Chris Swoyer, Thomas C. Monson (1975-01-01; backlinks)

“On the Alleged Falsity of the Null Hypothesis”, Oakes 1975

1975-oakes.pdf: “On the alleged falsity of the null hypothesis”⁠, William F. Oakes (1975-01-01; backlinks)

“On Prior Probabilities of Rejecting Statistical Hypotheses”, Keuth 1973

1973-keuth.pdf: “On Prior Probabilities of Rejecting Statistical Hypotheses”⁠, Herbert Keuth (1973-01-01; backlinks)

“How We *All* Failed In Performance Contracting”, Page 1972

1972-page.pdf: “How We *All* Failed In Performance Contracting”⁠, Ellis B. Page (1972-01-01; ; backlinks)

“Heredity, Environment, and School Achievement”, Nichols 1968

1968-nichols.pdf: “Heredity, Environment, and School Achievement”⁠, Robert C. Nichols (1968-01-01; ; backlinks)

“Use and Abuse of Regression”, Box 1966

1966-box.pdf: “Use and Abuse of Regression”⁠, George E. P. Box (1966-01-01)

“Distributions of Correlation Coefficients in Economic Time Series”, Ames & Reiter 1961

1961-ames.pdf: “Distributions of Correlation Coefficients in Economic Time Series”⁠, Edward Ames, Stanley Reiter (1961; ; backlinks; similar):

This paper presents results, mainly in tabular form, of a sampling experiment in which 100 economic time series 25 years long were drawn at random from the Historical Statistics for the United States. Sampling distributions of coefficients of correlation and autocorrelation were computed using these series, and their logarithms, with and without correction for linear trend.

We find that the frequency distribution of autocorrelation coefficients has the following properties:

  1. It is roughly invariant under logarithmic transformation of data.
  2. It is approximated by a Pearson Type XII function.
  3. It approaches a rectangular distribution symmetric about 0 as the lag increases.

The autocorrelation properties observed are not to be explained by linear trends alone. Correlations and lagged cross-correlations are quite high for all classes of data. eg. given a randomly selected series, it is possible to find, by random drawing, another series which explains at least 50% of the variances of the first one, in from 2 to 6 random trials, depending on the class of data involved. The sampling distributions obtained provide a basis for tests of statistical-significance of correlations of economic time series. We also find that our economic series are well described by exact linear difference equations of low order.

“The Fallacy Of The Null-Hypothesis Statistical-Significance Test”, Rozeboom 1960

“The Fallacy Of The Null-Hypothesis Statistical-Significance Test”⁠, William W. Rozeboom (1960; backlinks; similar):

In this paper, I wish to examine a dogma of inferential procedure which, for psychologists at least, has attained the status of a religious conviction. The dogma to be scrutinized is the “null-hypothesis statistical-significance test” orthodoxy that passing statistical judgment on a scientific hypothesis by means of experimental observation is a decision procedure wherein one rejects or accepts a null hypothesis according to whether or not the value of a sample statistic yielded by an experiment falls within a certain predetermined “rejection region” of its possible values. The thesis to be advanced is that despite the awesome preeminence this method has attained in our experimental journals and textbooks of applied statistics, it is based upon a fundamental misunderstanding of the nature of rational inference, and is seldom if ever appropriate to the aims of scientific research.

“Testing Statistical Hypotheses (First Edition)”, Lehmann 1959

1959-lehmann-testingstatisticalhypotheses.pdf: “Testing Statistical Hypotheses (First Edition)”⁠, E. L. Lehmann (1959-01-01; ; backlinks)

“Unsolved Problems of Experimental Statistics”, Tukey 1954

1954-tukey.pdf: “Unsolved Problems of Experimental Statistics”⁠, John W. Tukey (1954-01-01; ; backlinks)

“The Influence of 'Statistical Methods for Research Workers' on the Development of the Science of Statistics”, Yates 1951

1951-yates.pdf: “The Influence of 'Statistical Methods for Research Workers' on the Development of the Science of Statistics”⁠, Francis Yates (1951-01-01; backlinks)

“Probability and the Weighing of Evidence”, Good 1950-page-96

1950-good-probabilityandtheweighingofevidence.pdf#page=96: “Probability and the Weighing of Evidence”⁠, I. J. Good (1950-01-01; ; backlinks)

“'Superstition' in the Pigeon”, Skinner 1948

1948-skinner.pdf: “'Superstition' in the Pigeon”⁠, B. F. Skinner (1948-04; ; similar):

“A pigeon is brought to a stable state of hunger by reducing it to 75% of its weight when well fed. It is put into an experimental cage for a few minutes each day. A food hopper attached to the cage may be swung into place so that the pigeon can eat from it. A solenoid and a timing relay hold the hopper in place for 5 sec. at each reinforcement. If a clock is now arranged to present the food hopper at regular intervals with no reference whatsoever to the bird’s behavior, operant conditioning usually takes place.” The bird tends to learn whatever response it is making when the hopper appears. The response may be extinguished and reconditioned. “The experiment might be said to demonstrate a sort of superstition. The bird behaves as if there were a causal relation between its behavior and the presentation of food, although such a relation is lacking.”

“Factorial Studies Of Intelligence”, Thurstone & Thurstone 1941

1941-thurstone-factorialstudiesofintelligence.pdf: “Factorial Studies Of Intelligence”⁠, Louis L. Thurstone, Thelma G. Thurstone (1941-01-01; ; backlinks)

“A New Measure of Introversion-Extroversion”, Evans & McConnell 1941

1941-evans.pdf: “A New Measure of Introversion-Extroversion”⁠, Catharine Evans, T. R. McConnell (1941; ; backlinks; similar):

This paper describes the development of relatively independent measures for 3 types of Introversion-Extroversion: Thinking. Social, and Emotional. The need for clarifying the concept of I-E and for devising new inventories can best be understood by reviewing the confusion concerning its nature and measurement. In the effort to simplify the original complex description of I-E by Jung, psychologists either have introduced new concepts or emphasized varying phases of Jung’s definition. In this process of elaboration, they have actually complicated rather than clarified the idea of I-E. The use of these terms in the popular literature has only added to the confusion. Unfortunately, introversion, at least in the popular writings on psychology, has come to denote an undesirable personality tendency which borders on a neurotic condition.

In general, the available I-E inventories purport to measure a general, undifferentiated trait. However, the intercorrelations between the published inventories are surprisingly low. Only 5 of the 19 coefficients of intercorrelation reported in the literature for nine inventories are above 0.40, and only 2 are above 0.80. The 2 coefficients above 0.80 are between 2 inventories and revised forms of these same inventories.

…This study has reduced the confusion in the field of measurement of I-E by getting away from the general undifferentiated concept of I-E. An inventory was constructed to measure, not a general trait, but 3 types or phases of I-E which were clearly defined. By a simple technique of item analysis, 3 homogeneous and relatively independent I-E tests were developed. Each test seems to be sufficiently reliable for individual prediction. The demonstrated ability of each test to discriminate between groups of college students which one would logically expect to be characteristically different in a given type of I-E justifies the conclusion that each test is sufficiently valid for the inventory to be employed in the diagnosis and counseling of college students.

“"Student" As Statistician”, Pearson 1939

1939-pearson.pdf: “"Student" as Statistician”⁠, E. S. Pearson (1939-01-00; ; backlinks; similar):

[Egon Pearson describes Student⁠, or Gosset, as a statistician: Student corresponded widely with young statisticians/​mathematicians, encouraging them, and having an outsized influence not reflected in his publication. Student’s preferred statistical tools were remarkably simple, focused on correlations and standard deviations, but wielded effectively in the analysis and efficient design of experiments (particularly agricultural experiments), and he was an early decision-theorist, focused on practical problems connected to his Guinness Brewery job—which detachment from academia partially explains why he didn’t publish methods or results immediately or often. The need to handle small n of the brewery led to his work on small-sample approximations rather than, like Pearson et al in the Galton biometric tradition, relying on collecting large datasets and using asymptotic methods, and Student carried out one of the first Monte Carlo simulations.]

“Why Do We Sometimes Get Nonsense-Correlations between Time-Series?—A Study in Sampling and the Nature of Time-Series”, Yule 1926

1926-yule.pdf: “Why do we Sometimes get Nonsense-Correlations between Time-Series?—A Study in Sampling and the Nature of Time-Series”⁠, G. Udny Yule (1926-01-01)

“On Testing Varieties of Cereals”, Gosset 1923

1923-student.pdf: “On Testing Varieties of Cereals”⁠, William Sealy Gosset (1923-01-01; ; backlinks)

“Intelligence and Its Uses”, Thorndike 1920

1920-thorndike.pdf: “Intelligence and Its Uses”⁠, Edward L. Thorndike (1920-01-01; ; backlinks)

Spanish Christmas Lottery