Focal cortical lesions lead to local, not global, deficits.
Measurement models to explain the positive manifold are causal models with unique predictions going beyond model fit statistics.
Correlated factor, network, process sampling, mutualism, investment models, make causal predictions inconsistent with lesion evidence.
Hierarchical and bifactor models are consistent with the pattern of lesion effects, as well as possibly one form of bonds sampling models.
Future models and explanations of the positive manifold have to accommodate focal lesions leading to local not global deficits.
Here we examine 3 classes of models regarding the structure of human cognition: common cause models, sampling/network models, and interconnected models. That disparate models can accommodate one of the most globally replicated psychological phenomena—namely, the positive manifold—is an extension of underdetermination of theory by data. Statistical fit indices are an insufficient and sometimes intractable method of demarcating between the theories; strict tests and further evidence should be brought to bear on understanding the potential causes of the positive manifold. The cognitive impact of focal cortical lesions allows testing the necessary causal connections predicted by competing models. This evidence shows focal cortical lesions lead to local, not global (across all abilities), deficits. Only models that can accommodate a deficit in a given ability without effects on other covarying abilities can accommodate focal lesion evidence. After studying how different models pass this test, we suggest bifactor models (class: common cause models) and bond models (class: sampling models) are best supported. In short, competing psychometric models can be informed when their implied causal connections and predictions are tested.
[Keywords: human intelligence, structural models, causality, statistical model fit, cortical lesions]
[This would seem to explain the failure of dual n-back & WM training in general.
Training the specific ability of WM could only cause g increases in models with ‘upwards causation’ like hierarchical models or dynamic mutual causation like mutualism/investment models; these are ruled out by the lesion literature which finds that physically-tiny lesions damage specific abilities but not g, and if decreasing a specific ability cannot decrease g, then it’s hard to see how increasing that ability could ever increase g. See also Lee et al 2019.]
Causal knowledge is not static; it is constantly modified based on new evidence. The present set of seven experiments explores 1 important case of causal belief revision that has been neglected in research so far: causal interpolations.
A simple prototypic case of an interpolation is a situation in which we initially have knowledge about a causal relation or a positive covariation between 2 variables but later become interested in the mechanism linking these 2 variables. Our key finding is that the interpolation of mechanism variables tends to be misrepresented, which leads to the paradox of knowing more: The more people know about a mechanism, the weaker they tend to find the probabilistic relation between the 2 variables (ie., weakening effect). Indeed, in all our experiments we found that, despite identical learning data about 2 variables, the probability linking the 2 variables was judged higher when follow-up research showed that the 2 variables were assumed to be directly causally linked (ie., C → E) than when participants were instructed that the causal relation is in fact mediated by a variable representing a component of the mechanism (M; ie., C → M → E).
Our explanation of the weakening effect is that people often confuse discoveries of preexisting but unknown mechanisms with situations in which new variables are being added to a previously simpler causal model, thus violating causal stability assumptions in natural kind domains. The experiments test several implications of this hypothesis.
We resolve a controversy over two competing hypotheses about why people object to randomized experiments: (1) People unsurprisingly object to experiments only when they object to a policy or treatment the experiment contains, or (2) people can paradoxically object to experiments even when they approve of implementing either condition for everyone. Using multiple measures of preference and test criteria in 5 preregistered within-subjects studies with 1,955 participants, we find that people often disapprove of experiments involving randomization despite approving of the policies or treatments to be tested.
[Keywords: field experiments, A/B tests, randomized controlled trials, research ethics, pragmatic trials]
Complex organisms thwart the simple rectilinear causality paradigm of “necessary and sufficient”, with its experimental strategy of “knock down and overexpress.”
This Essay organizes the eccentricities of biology into 4 categories that call for new mathematical approaches; recaps for the biologist the philosopher’s recent refinements to the causation concept and the mathematician’s computational tools that handle some but not all of the biological eccentricities; and describes overlooked insights that make causal properties of physical hierarchies such as emergence and downward causation straightforward.
Reviewing and extrapolating from similar situations in physics, it is suggested that new mathematical tools for causation analysis incorporating feedback, signal cancellation, nonlinear dependencies, physical hierarchies, and fixed constraints rather than instigative changes will reveal unconventional biological behaviors. These include “eigenisms”, organisms that are limited to quantized states; trajectories that steer a system such as an evolving species toward optimal states; and medical control via distributed “sheets” rather than single control points.
We discuss methods of data collection and analysis that emphasize the power of individual personality items for predicting real world criteria (eg., smoking, exercise, self-rated health). These methods are borrowed by analogy from radio astronomy and human genomics. Synthetic Aperture Personality Assessment (SAPA) applies a matrix sampling procedure that synthesizes very large covariance matrices through the application of massively missing at random data collection. These large covariance matrices can be applied, in turn, in Persome Wide Association Studies (PWAS) to form personality prediction scores for particular criteria. We use two open source data sets (n = 4,000 and 126,884 with 135 and 696 items respectively) for demonstrations of both of these procedures. We compare these procedures to the more traditional use of “Big 5” or a larger set of narrower factors (the “little 27”). We argue that there is more information at the item level than is used when aggregating items to form factorially derived scales.
Measuring the causal effects of digital advertising remains challenging despite the availability of granular data. Unobservable factors make exposure endogenous, and advertising’s effect on outcomes tends to be small. In principle, these concerns could be addressed using randomized controlled trials (RCTs). In practice, few online ad campaigns rely on RCTs and instead use observational methods to estimate ad effects. We assess empirically whether the variation in data typically available in the advertising industry enables observational methods to recover the causal effects of online advertising. Using data from 15 U.S. advertising experiments at Facebook comprising 500 million user-experiment observations and 1.6 billion ad impressions, we contrast the experimental results to those obtained from multiple observational models. The observational methods often fail to produce the same effects as the randomized experiments, even after conditioning on extensive demographic and behavioral variables. In our setting, advances in causal inference methods do not allow us to isolate the exogenous variation needed to estimate the treatment effects. We also characterize the incremental explanatory power our data would require to enable observational methods to successfully measure advertising effects. Our findings suggest that commonly used observational approaches based on the data usually available in the industry often fail to accurately measure the true effect of advertising.
After a decade of genome-wide association studies(GWASs), fundamental questions in human genetics are still unanswered, such as the extent of pleiotropy across the genome, the nature of trait-associated genetic variants and the disparate genetic architecture across human traits. The current availability of hundreds of GWAS results provide the unique opportunity to gain insight into these questions. In this study, we harmonized and systematically analysed 4,155 publicly available GWASs. For a subset of well-poweredGWAS on 558 unique traits, we provide an extensive overview of pleiotropy and genetic architecture. We show that trait associated loci cover more than half of the genome, and 90% of those loci are associated with multiple trait domains. We further show that potential causal genetic variants are enriched in coding and flanking regions, as well as in regulatory elements, and how trait-polygenicity is related to an estimate of the required sample size to detect 90% of causal genetic variants. Our results provide novel insights into how genetic variation contributes to trait variation. All GWAS results can be queried and visualized at the GWASATLAS resource (http://atlas.ctglab.nl).
Human DNA varies across geographic regions, with most variation observed so far reflecting distant ancestry differences. Here, we investigate the geographic clustering of genetic variants that influence complex traits and disease risk in a sample of ~450,000 individuals from Great Britain. Out of 30 traits analyzed, 16 show significant geographic clustering at the genetic level after controlling for ancestry, likely reflecting recent migration driven by socio-economic status (SES). Alleles associated with educational attainment (EA) show most clustering, with EA-decreasing alleles clustering in lower SES areas such as coal mining areas. Individuals that leave coal mining areas carry more EA-increasing alleles on average than the rest of Great Britain. In addition, we leveraged the geographic clustering of complex trait variation to further disentangle regional differences in socio-economic and cultural outcomes through genome-wide association studies on publicly available regional measures, namely coal mining, religiousness, 1970/2015 general election outcomes, and Brexit referendum results.
Background: The pathway from evidence generation to consumption contains many steps which can lead to overstatement or misinformation. The proliferation of internet-based health news may encourage selection of media and academic research articles that overstate strength of causal inference. We investigated the state of causal inference in health research as it appears at the end of the pathway, at the point of social media consumption.
Methods: We screened the NewsWhip Insights database for the most shared media articles on Facebook and Twitter reporting about peer-reviewed academic studies associating an exposure with a health outcome in 2015, extracting the 50 most-shared academic articles and media articles covering them. We designed and utilized a review tool to systematically assess and summarize studies’ strength of causal inference, including generalizability, potential confounders, and methods used. These were then compared with the strength of causal language used to describe results in both academic and media articles. Two randomly assigned independent reviewers and one arbitrating reviewer from a pool of 21 reviewers assessed each article.
Results: We accepted the most shared 64 media articles pertaining to 50 academic articles for review, representing 68% of Facebook and 45% of Twitter shares in 2015. 34% of academic studies and 48% of media articles used language that reviewers considered too strong for their strength of causal inference. 70% of academic studies were considered low or very low strength of inference, with only 6% considered high or very high strength of causal inference. The most severe issues with academic studies’ causal inference were reported to be omitted confounding variables and generalizability. 58% of media articles were found to have inaccurately reported the question, results, intervention, or population of the academic study.
Conclusions: We find a large disparity between the strength of language as presented to the research consumer and the underlying strength of causal inference among the studies most widely shared on social media. However, because this sample was designed to be representative of the articles selected and shared on social media, it is unlikely to be representative of all academic and media work. More research is needed to determine how academic institutions, media organizations, and social network sharing patterns impact causal inference and language as received by the research consumer.
Understanding the nature and extent of horizontal pleiotropy, where one genetic variant has independent effects on multiple observable traits, is vitally important for our understanding of the genetic architecture of human phenotypes, as well as the design of genome-wide association studies (GWASs) and Mendelian randomization (MR) studies. Many recent studies have pointed to the existence of horizontal pleiotropy among human phenotypes, but the exact extent remains unknown, largely due to difficulty in disentangling the inherently correlated nature of observable traits. Here, we present a statistical framework to isolate and quantify horizontal pleiotropy in human genetic variation using a two-component pleiotropy score computed from summary statistic data derived from published GWASs. This score uses a statistical whitening procedure to remove correlations between observable traits and normalize effect sizes across all traits, and is able to detect horizontal pleiotropy under a range of different models in our simulations. When applied to real human phenotype data using association statistics for 1,564 traits measured in 337,119 individuals from the UK Biobank, our score detects a statistically-significant excess of horizontal pleiotropy. This signal of horizontal pleiotropy is pervasive throughout the human genome and across a wide range of phenotypes and biological functions, but is especially prominent in regions of high linkage disequilibrium and among phenotypes known to be highly polygenic and heterogeneous. Using our pleiotropy score, we identify thousands of loci with extreme levels of horizontal pleiotropy, a majority of which have never been previously reported in any published GWAS. This highlights an under-recognized class of genetic variation that has weak effects on many distinct phenotypes but no specific marked effect on any one phenotype. We show that a large fraction of these loci replicate using independent datasets of GWAS summary statistics. Our results highlight the central role horizontal pleiotropy plays in the genetic architecture of human phenotypes, and the importance of modeling horizontal pleiotropy in genomic medicine.
A randomized experiment with almost 35 million Pandora listeners enables us to measure the sensitivity of consumers to advertising, an important topic of study in the era of ad-supported digital content provision. The experiment randomized listeners into 9 treatment groups, each of which received a different level of audio advertising interrupting their music listening, with the highest treatment group receiving more than twice as many ads as the lowest treatment group. By keeping consistent treatment assignment for 21 months, we are able to measure long-run demand effects, with three times as much ad-load sensitivity as we would have obtained if we had run a month-long experiment. We estimate a demand curve that is strikingly linear, with the number of hours listened decreasing linearly in the number of ads per hour (also known as the price of ad-supported listening). We also show the negative impact on the number of days listened and on the probability of listening at all in the final month. Using an experimental design that separately varies the number of commercial interruptions per hour and the number of ads per commercial interruption, we find that neither makes much difference to listeners beyond their impact on the total number of ads per hour. Lastly, we find that increased ad load causes a substantial increase in the number of paid ad-free subscriptions to Pandora, particularly among older listeners.
Accurate estimation of genetic correlation requires large sample sizes and access to genetically informative data, which are not always available. Accordingly, phenotypic correlations are often assumed to reflect genotypic correlations in evolutionary biology. Cheverud’s conjecture asserts that the use of phenotypic correlations as proxies for genetic correlations is appropriate. Empirical evidence of the conjecture has been found across plant and animal species, with results suggesting that there is indeed a robust relationship between the two. Here, we investigate the conjecture in human populations, an analysis made possible by recent developments in availability of human genomic data and computing resources. A sample of 108,035 British European individuals from the UK Biobank was split equally into discovery and replication datasets. 17 traits were selected based on sample size, distribution and heritability. Genetic correlations were calculated using linkage disequilibrium score regression applied to the genome-wide association summary statistics of pairs of traits, and compared within and across datasets. Strong and statistically-significant correlations were found for the between-dataset comparison, suggesting that the genetic correlations from one independent sample were able to predict the phenotypic correlations from another independent sample within the same population. Designating the selected traits as morphological or non-morphological indicated little difference in correlation. The results of this study support the existence of a relationship between genetic and phenotypic correlations in humans. This finding is of specific interest in anthropological studies, which use measured phenotypic correlations to make inferences about the genetics of ancient human populations.
Intelligence, or general cognitive function, is phenotypically and genetically correlated with many traits, including a wide range of physical, and mental health variables. Education is strongly genetically correlated with intelligence (rg = 0.70). We used these findings as foundations for our use of a novel approach—multi-trait analysis of genome-wide association studies (MTAG; Turley et al. 2017)—to combine two large genome-wideassociation studies (GWASs) of education and intelligence, increasing statistical power and resulting in the largest GWAS of intelligence yet reported. Our study had four goals: first, to facilitate the discovery of new genetic loci associated with intelligence; second, to add to our understanding of the biology of intelligence differences; third, to examine whether combining genetically correlated traits in this way produces results consistent with the primary phenotype of intelligence; and, finally, to test how well this new meta-analytic data sample on intelligence predicts phenotypic intelligence in an independent sample. By combining datasets using MTAG, our functional sample size increased from 199,242 participants to 248,482. We found 187 independent loci associated with intelligence, implicating 538 genes, using both SNP-based and gene-based GWAS. We found evidence that neurogenesis and myelination—as well as genes expressed in the synapse, and those involved in the regulation of the nervous system—may explain some of the biological differences in intelligence. The results of our combined analysis demonstrated the same pattern of genetic correlationsas those from previous GWASs of intelligence, providing support for the meta-analysis of these genetically-related phenotypes.
Background: Identifying genetic relationships between complex traits in emerging adulthood can provide useful etiological insights into risk for psychopathology. College-age individuals are under-represented in genomic analyses thus far, and the majority of work has focused on the clinical disorder or cognitive abilities rather than normal-range behavioral outcomes.
Methods: This study examined a sample of emerging adults 18–22 years of age (n = 5947) to construct an atlas of polygenic risk for 33 traits predicting relevant phenotypic outcomes. 28 hypotheses were tested based on the previous literature on samples of European ancestry, and the availability of rich assessment data allowed for polygenic predictions across 55 psychological and medical phenotypes.
Results: Polygenic risk for schizophrenia (SZ) in emerging adults predicted anxiety, depression, nicotine use, trauma, and family history of psychological disorders. Polygenic risk for neuroticism predicted anxiety, depression, phobia, panic, neuroticism, and was correlated with polygenic risk for cardiovascular disease.
Conclusions: These results demonstrate the extensive impact of genetic risk for SZ, neuroticism, and major depression on a range of health outcomes in early adulthood. Minimal cross-ancestry replication of these phenomic patterns of polygenic influence underscores the need for more genome-wide association studies of non-European populations.
Background: Symptomatic relief is the primary goal of percutaneous coronary intervention (PCI) in stable angina and is commonly observed clinically. However, there is no evidence from blinded, placebo-controlled randomised trials to show its efficacy.
Methods: ORBITA is a blinded, multicentre randomised trial of PCI versus a placebo procedure for angina relief that was done at five study sites in the UK. We enrolled patients with severe (≥70%) single-vessel stenoses. After enrolment, patients received 6 weeks of medication optimisation. Patients then had pre-randomisation assessments with cardiopulmonary exercise testing, symptom questionnaires, and dobutamine stress echocardiography. Patients were randomised 1:1 to undergo PCI or a placebo procedure by use of an automated online randomisation tool. After 6 weeks of follow-up, the assessments done before randomisation were repeated at the final assessment. The primary endpoint was difference in exercise time increment between groups. All analyses were based on the intention-to-treat principle and the study population contained all participants who underwent randomisation. This study is registered with ClinicalTrials.gov, number NCT02062593.
Findings: ORBITA enrolled 230 patients with ischaemic symptoms. After the medication optimisation phase and between Jan 6, 2014, and Aug 11, 2017, 200 patients underwent randomisation, with 105 patients assigned PCI and 95 assigned the placebo procedure. Lesions had mean area stenosis of 84.4% (SD 10.2), fractional flow reserve of 0.69 (0.16), and instantaneous wave-free ratio of 0.76 (0.22). There was no statistically-significant difference in the primary endpoint of exercise time increment between groups (PCI minus placebo 16.6 s, 95% CI −8.9 to 42.0, p = 0.200). There were no deaths. Serious adverse events included four pressure-wire related complications in the placebo group, which required PCI, and five major bleeding events,including two in the PCI group and three in the placebo group.
Interpretation: In patients with medically treated angina and severe coronary stenosis, PCI did not increase exercise time by more than the effect of a placebo procedure. The efficacy of invasive procedures can be assessed with a placebo control, as is standard for pharmacotherapy.
Background: There is now convincing evidence that pleiotropy across the genome contributes to the correlation between human traits and comorbidity of diseases. The recent availability of genome-wide association study (GWAS) results have made the polygenic risk score (PRS) approach a powerful way to perform genetic prediction and identify genetic overlap among phenotypes.
Methods and findings
Here we use the PRS method to assess evidence for shared genetic aetiology across hundreds of traits within a single epidemiological study—the Northern Finland Birth Cohort 1966 (NFBC1966). We replicate numerous recent findings, such as a genetic association between Alzheimer’s disease and lipid levels, while the depth of phenotyping in the NFBC1966 highlights a range of novel significant genetic associations between traits.
Conclusion: This study illustrates the power in taking a hypothesis-free approach to the study of shared genetic aetiology between human traits and diseases. It also demonstrates the potential of the PRS method to provide important biological insights using only a single well-phenotyped epidemiological study of moderate sample size (~5k), with important advantages over evaluating genetic correlations from GWAS summary statistics only.
“Genome-wide meta-analysis associates HLA-DQA1/DRB1 and LPA and lifestyle factors with human longevity”, Peter K. Joshi, Nicola Pirastu, Katherine A. Kentistou, Krista Fischer, Edith Hofer, Katharina E. Schraut, David W. Clark, Teresa Nutile, Catriona L. K. Barnes, Paul R. H. J. Timmers, Xia Shen, Ilaria Gandin, Aaron F. McDaid, Thomas Folkmann Hansen, Scott D. Gordon, Franco Giulianini, Thibaud S. Boutin, Abdel Abdellaoui, Wei Zhao, Carolina Medina-Gomez, Traci M. Bartz, Stella Trompet, Leslie A. Lange, Laura Raffield, Ashley van der Spek, Tessel E. Galesloot, Petroula Proitsi, Lisa R. Yanek, Lawrence F. Bielak, Antony Payton, Federico Murgia, Maria Pina Concas, Ginevra Biino, Salman M. Tajuddin, Ilkka Seppälä, Najaf Amin, Eric Boerwinkle, Anders D. Børglum, Archie Campbell, Ellen W. Demerath, Ilja Demuth, Jessica D. Faul, Ian Ford, Alessandro Gialluisi, Martin Gögele, MariaElisa Graff, Aroon Hingorani, Jouke-Jan Hottenga, David M. Hougaard, Mikko A. Hurme, M. Arfan Ikram, Marja Jylhä, Diana Kuh, Lannie Ligthart, Christina M. Lill, Ulman Lindenberger, Thomas Lumley, Reedik Mägi, Pedro Marques-Vidal, Sarah E. Medland, Lili Milani, Reka Nagy, William E. R. Ollier, Patricia A. Peyser, Peter P. Pramstaller, Paul M. Ridker, Fernando Rivadeneira, Daniela Ruggiero, Yasaman Saba, Reinhold Schmidt, Helena Schmidt, P. Eline Slagboom, Blair H. Smith, Jennifer A. Smith, Nona Sotoodehnia, Elisabeth Steinhagen-Thiessen, Frank J. A. van Rooij, André L. Verbeek, Sita H. Vermeulen, Peter Vollenweider, Yunpeng Wang, Thomas Werge, John B. Whitfield, Alan B. Zonderman, Terho Lehtimäki, Michele K. Evans, Mario Pirastu, Christian Fuchsberger, Lars Bertram, Neil Pendleton, Sharon L. R. Kardia, Marina Ciullo, Diane M. Becker, Andrew Wong, Bruce M. Psaty, Cornelia M. van Duijn, James G. Wilson, J. Wouter Jukema, Lambertus Kiemeney, André G. Uitterlinden, Nora Franceschini, Kari E. North, David R. Weir, Andres Metspalu, Dorret I. Boomsma, Caroline Hayward, Daniel Chasman, Nicholas G. Martin, Naveed Sattar, Harry Campbell, Tōnu Esko, Zoltán Kutalik & James F. Wilson (2017-10-13; exercise, genetics / heritable, statistics / survival-analysis; backlinks):
Genomic analysis of longevity offers the potential to illuminate the biology of human aging. Here, using genome-wide association meta-analysis of 606,059 parents’ survival, we discover two regions associated with longevity (HLA-DQA1/DRB1 and LPA). We also validate previous suggestions that APOE, CHRNA3/5, CDKN2A/B, SH2B3 and FOXO3A influence longevity. Next we show that giving up smoking, educational attainment, openness to new experience and high-density lipoprotein (HDL) cholesterol levels are most positively genetically correlated with lifespan while susceptibility to coronary artery disease (CAD), cigarettes smoked per day, lung cancer, insulin resistance and body fat are most negatively correlated. We suggest that the effect of education on lifespan is principally mediated through smoking while the effect of obesity appears to act via CAD. Using instrumental variables, we suggest that an increase of one body mass index unit reduces lifespan by 7 months while 1 year of education adds 11 months to expected lifespan.
Background: A wide range of diseases show some degree of clustering in families; family history is therefore an important aspect for clinicians when making risk predictions. Familial aggregation is often quantified in terms of a familial relative risk (FRR), and although at first glance this measure may seem simple and intuitive as an average risk prediction, its implications are not straightforward.
Methods: We use two statistical models for the distribution of disease risk in a population: a dichotomous risk model that gives an intuitive understanding of the implication of a given FRR, and a continuous risk model that facilitates a more detailed computation of the inequalities in disease risk. Published estimates of FRRs are used to produce Lorenz curves and Gini indices that quantifies the inequalities in risk for a range of diseases.
Results: We demonstrate that even a moderate familial association in disease risk implies a very large difference in risk between individuals in the population. We give examples of diseases for which this is likely to be true, and we further demonstrate the relationship between the point estimates of FRRs and the distribution of risk in the population.
Conclusions: The variation in risk for several severe diseases may be larger than the variation in income in many countries. The implications of familial risk estimates should be recognized by epidemiologists and clinicians.
Eight out of ten leading international indices to assess developing countries in aspects beyond GDP are showing strong redundancy, bias and unilateralism. The quantitative comparison gives evidence for the fact that always the same countries lead the ranks with a low standard deviation. The dependency of the GDP is striking: do the indices onlymeasure indicators that are direct effects of a strong GDP? While theimpact of GDP can be discussed reverse as well, the standard deviation shows a strong bias: only one out of the twenty countries with the highest standard deviation is among the Top-20 countries of the world, but 11 countries among those with the lowest standard deviation. Let’s have a look at the backsides of global statistics and methods to compare their findings. The article is the result of a pre-study to assess Social Capital for development countries made for the German Federal Ministry for Economic Cooperation and Development. The study led to the UN Sustainable Development Goals (UN SDG) project World Social Capital Monitor.
Individuals with lower socio-economic status (SES) are at increased risk of physical and mental illnesses and tend to die at an earlier age. Explanations for the association between SES and health typically focus on factors that are environmental in origin. However, common single nucleotide polymorphisms (SNPs) have been found collectively to explain around 18% (SE = 5%) of the phenotypic variance of an area-based social deprivation measure of SES. Molecular genetic studies have also shown that physical and psychiatric diseases are at least partly heritable. It is possible, therefore, that phenotypic associations between SES and health arise partly due to a shared genetic etiology.
We conducted a genome-wide association study (GWAS) on social deprivation and on household income using the 112,151 participants of UK Biobank. We find that common SNPs explain 21% (SE = 0.5%) of the variation in social deprivation and 11% (SE = 0.7%) in household income. 2 independent SNPs attained genome-wide statistical-significance for household income, rs187848990 on chromosome 2, and rs8100891 on chromosome 19. Genes in the regions of these SNPs have been associated with intellectual disabilities, schizophrenia, and synaptic plasticity. Extensive genetic correlations were found between both measures of socioeconomic status and illnesses, anthropometric variables, psychiatric disorders, and cognitive ability.
These findings show that some SNPs associated withSES are involved in the brain and central nervous system. The genetic associations with SES are probably mediated via other partly-heritable variables, including cognitive ability, education, personality, and health.
Objective: To assess differences in estimated treatment effects for mortality between observational studies with routinely collected health data (RCD; that are published before trials are available) and subsequent evidence from randomized controlled trials on the same clinical question.
Design: Meta-epidemiological survey.
Data sources: PubMed searched up to November 2014.
Methods: Eligible RCD studies were published up to 2010 that used propensity scores to address confounding bias and reported comparative effects of interventions for mortality. The analysis included only RCD studies conducted before any trial was published on the same topic. The direction of treatment effects, confidence intervals, and effect sizes (odds ratios) were compared between RCD studies and randomized controlled trials. The relative odds ratio (that is, the summary odds ratio of trial(s) divided by the RCD study estimate) and the summaryrelative odds ratio were calculated across all pairs of RCD studies and trials. A summary relative odds ratio greater than one indicates that RCD studies gave more favorable mortality results.
Results: The evaluation included 16 eligible RCD studies, and 36 subsequent published randomized controlled trials investigating the same clinical questions (with 17 275 patients and 835 deaths). Trials were published a median of three years after the corresponding RCD study. For five (31%) of the 16 clinical questions, the direction of treatment effects differed between RCD studies and trials. Confidenceintervals in nine (56%) RCD studies did not include theRCT effect estimate. Overall, RCD studies showed statistically-significantly more favorable mortality estimates by 31% than subsequent trials (summary relative odds ratio 1.31 (95% confidence interval 1.03 to 1.65; I2 = 0%)).
Conclusions: Studies of routinely collected health data could give different answers from subsequent randomized controlled trials on the same clinical questions, and may substantially overestimate treatment effects. Caution is needed to prevent misguided clinical decision making.
Causes of the well-documented association between low levels of cognitive functioning and many adverse neuropsychiatric outcomes, poorer physical health and earlier death remain unknown. We used linkage disequilibrium regression and polygenic profile scoring to test for shared genetic aetiology between cognitive functions and neuropsychiatric disorders and physical health. Using information provided by many published genome-wide association study consortia, we created polygenic profile scores for 24 vascular-metabolic, neuropsychiatric, physiological-anthropometric and cognitive traits in the participants of UK Biobank, a very large population-based sample (n = 112 151). Pleiotropy between cognitive and health traits was quantified by deriving genetic correlations using summary genome-wide association study statistics and to the method of linkage disequilibrium score regression. Substantial and statistically-significant genetic correlations were observed between cognitive test scores in the UK Biobank sample and many of the mental and physical health-related traits and disorders assessed here. In addition, highly statistically-significant associations were observed between the cognitive test scores in the UK Biobank sample and many polygenic profile scores, including coronary artery disease, stroke, Alzheimer’s disease, schizophrenia, autism, major depressive disorder, body mass index, intracranial volume, infant head circumference and childhood cognitive ability. Where disease diagnosis was available for UK Biobank participants, we were able to show that these results were not confounded by those who had the relevant disease. These findings indicate that a substantial level of pleiotropy exists between cognitive abilities and many human mental and physical health disorders and traits and that it can be used to predict phenotypic variance across samples.
We propose a summary statistic for the economic well-being of people in a country. Our measure incorporates consumption, leisure, mortality, and inequality, first for a narrow set of countries using detailed micro data, and then more broadly using multi-country datasets. While welfare is highly correlated with GDP per capita, deviations are often large. Western Europe looks considerably closer to the United States, emerging Asia has not caught up as much, and many developing countries are further behind. Each component we introduce plays an important role in accounting for these differences, with mortality being most important.
Key Point 1: GDP per person is an excellent indicator of welfare across the broad range of countries: the two measures have a correlation of 0.98. Nevertheless, for any given country, the difference between the two measures can be important. Across 13 countries, the median deviation is about 35%.
Figure 5 illustrates this first point. The top panel plots the welfare measure, λ, against GDP per person. What emerges prominently is that the two measures are highly correlated, with a correlation coefficient (for the logs) of 0.98. Thus per capita GDP is a good proxy for welfare under our assumptions. At the same time, there are clear departures from the 45° line. In particular, many countries with very low GDP per capita exhibit even lower welfare. As a result, welfare is more dispersed (standard deviation of 1.51 in logs) than is income (standard deviation of 1.27 in logs).
The bottom panel provides a closer look at the deviations. This figure plots the ratio of welfare to per capita GDP across countries. The European countries have welfare measures 22% higher than their incomes. The remaining countries, in contrast, have welfare levels that are typically 25–50% below their incomes. The way to reconcile these large deviations with the high correlation between welfare and income is that the “scales” are so different. Incomes vary by more than a factor of 64 in our sample, ie., 6,300%, whereas the deviations are on the order of 25–50%.
25 large field experiments with major U.S. retailers and brokerages, most reaching millions of customers and collectively representing $3.44$2.82015 million in digital advertising expenditure, reveal that measuring the returns to advertising is difficult. The median confidence interval on return on investment is over 100 percentage points wide. Detailed sales data show that relative to the per capita cost of the advertising, individual-level sales are very volatile; a coefficient of variation of 10 is common. Hence, informative advertising experiments can easily require more than 10 million person-weeks, making experiments costly and potentially infeasible for many firms. Despite these unfavorable economics, randomized control trials represent progress by injecting new, unbiased information into the market. The inference challenges revealed in the field experiments also show that selection bias, due to the targeted nature of advertising, is a crippling concern for widely employed observational methods.
Despite a century of research on complex traits in humans, the relative importance and specific nature of the influences of genes and environment on human traits remain controversial. We report a meta-analysis of twin correlations and reported variance components for 17,804 traits from 2,748 publications including 14,558,903 partly dependent twin pairs, virtually all published twin studies of complex traits. Estimates of heritability cluster strongly within functional domains, and across all traits the reported heritability is 49%. For a majority (69%) of traits, the observed twin correlations are consistent with a simple and parsimonious model where twin resemblance is solely due to additive genetic variation. The data are inconsistent with substantial influences from shared environment or non-additive genetic variation. This study provides the most comprehensive analysis of the causes of individual differences in human traits thus far and will guide future gene-mapping efforts. All the results can be visualized using the MaTCH webtool.
In this commentary we answer 3 questions that are often posed when debating the usefulness and accuracy of correcting criterion-related validity coefficients for unreliability: (a) Is 0.52 an inaccurate estimate? (b) Do corrections for criterion unreliability lead us to choose different selection tools? (c) Is too much variance explained?
[1. Yes; 2. No, because rank-order of tools’ utility is preserved by the corrections; 3. No, because while everything is correlated r = 0.30 on average, most of those variables are unknowable at hiring time and also adding up variables ignores diminishing returns/intercorrelations between the predictors, so one will never predict perfectly.]
Conclusion: Based on our review of the evidence, the 0.52 estimate of the interrater reliability of supervisor ratings of job performance is an appropriate estimate; corrections for unreliability do not appear to change our decisions regarding the choice of one selection tool over another; and most variables may be more strongly correlated than people expect, making it difficult to demonstrate continued incremental validity in predicting job performance when adding additional predictors. We agree with LeBreton et al that psychologists need to be careful when applying and interpreting corrections, and we are thankful that they sponsored a discussion on the topic.
Corrections are critical for both basic science (ie., estimating population parameters) and practice (ie., recognizing artifacts attenuating estimates on which our work may be evaluated by stakeholders, courts, and other third parties). Ultimately, the appropriate use of corrections depends on the purpose of the project. If the goal is to explain variation among a sample of incumbents on observed criterion scores, then no corrections need to be made. If the goal is to explain variation among incumbents on a true score for job performance, then a correction for unreliability is not only desirable but necessary. Finally, if the goal is to estimate how much variation among applicants is explained by a predictor for a true score on job performance, then corrections for range restriction and unreliability are indispensable. This goal represents the target validity inference that was included in Binning & Barrett 1989’s figure, but (rather interestingly) is omitted from LeBreton et al’s reproduction of that figure. We believe that the target validity inference is the most important inference in personnel selection; it provides the critical link from the observed predictor to the criterion construct (see also Putka & Sackett 2010).
Directed acyclic graphs are the basic representation of the structure underlying Bayesian networks, which represent multivariate probability distributions. In many practical applications, such as the reverse engineering of gene regulatory networks, not only the estimation of model parameters but the reconstruction of the structure itself is of great interest. As well as for the assessment of different structure learning algorithms in simulation studies, a uniform sample from the space of directed acyclic graphs is required to evaluate the prevalence of certain structural features. Here we analyse how to sample acyclic digraphs uniformly at random through recursive enumeration, an approach previously thought too computationally involved. Based on complexity considerations, we discuss in particular how the enumeration directly provides an exact method, which avoids the convergence issues of the alternative Markov chain methods and is actually computationally much faster. The limiting behaviour of the distribution of acyclic digraphs then allows us to sample arbitrarily large graphs. Building on the ideas of recursive enumeration based sampling we also introduce a novel hybrid Markov chain with much faster convergence than current alternatives while still being easy to adapt to various restrictions. Finally we discuss how to include such restrictions in the combinatorial enumeration and the new hybrid Markov chain method for efficient uniform sampling of the corresponding graphs.
We measure the causal effects of online advertising on sales, using a randomized experiment performed in cooperation between Yahoo! and a major retailer. After identifying over one million customers matched in the databases of the retailer and Yahoo!, we randomly assign them to treatment and control groups. We analyze individual-level data on ad exposure and weekly purchases at this retailer, both online and in stores. We find statistically-significant and economically substantial impacts of the advertising on sales. The treatment effect persists for weeks after the end of an advertising campaign, and the total effect on revenues is estimated to be more than seven times the retailer’s expenditure on advertising during the study. Additional results explore differences in the number of advertising impressions delivered to each individual, online and offline sales, and the effects of advertising on those who click the ads versus those who merely view them. Power calculations show that, due to the high variance of sales, our large number of observations brings us just to the frontier of being able to measure economically substantial effects of advertising. We also demonstrate that without an experiment, using industry-standard methods based on endogenous crosssectional variation in advertising exposure, we would have obtained a wildly inaccurate estimate of advertising effectiveness.
Measuring the causal effects of online advertising (adfx) on user behavior is important to the health of the WWW publishing industry. In this paper, using three controlled experiments, we show that observational data frequently lead to incorrect estimates of adfx. The reason, which we label “activity bias”, comes from the surprising amount of time-based correlation between the myriad activities that users undertake online.
In Experiment 1, users who are exposed to an ad on a given day are much more likely to engage in brand-relevant search queries as compared to their recent history for reasons that had nothing do with the advertisement. In Experiment 2, we show that activity bias occurs for page views across diverse websites. In Experiment 3, we track account sign-ups at a competitor’s (of the advertiser) website and find that many more people sign-up on the day they saw an advertisement than on other days, but that the true “competitive effect” was minimal.
In all three experiments, exposure to a campaign signals doing “more of everything” in given period of time, making it difficult to find a suitable “matched control” using prior behavior. In such cases, the “match” is fundamentally different from the exposed group, and we show how and why observational methods lead to a massive overestimate of adfx in such circumstances.
Valid causal inference is central to progress in theoretical and applied psychology. Although the randomized experiment is widely considered the gold standard for determining whether a given exposure increases the likelihood of some specified outcome, experiments are not always feasible and in some cases can result in biased estimates of causal effects. Alternatively, standard observational approaches are limited by the possibility of confounding, reverse causation, and the nonrandom distribution of exposure (i.e., selection). We describe the counterfactual model of causation and apply it to the challenges of causal inference in observational research, with a particular focus on aging. We argue that the study of twin pairs discordant on exposure, and in particular discordant monozygotic twins, provides a useful analog to the idealized counterfactual design. A review of discordant-twin studies in aging reveals that they are consistent with, but do not unambiguously establish, a causal effect of lifestyle factors on important late-life outcomes. Nonetheless, the existing studies are few in number and have clear limitations that have not always been considered in interpreting their results. It is concluded that twin researchers could make greater use of the discordant-twin design as one approach to strengthen causal inferences in observational research.
Personality researchers have recently advocated the use of very short personality inventories in order to minimize administration time. However, few such inventories are currently available. Here I introduce an automated method that can be used to abbreviate virtually any personality inventory with minimal effort. After validating the method against existing measures in Studies 1 and 2, a new 181-item inventory is generated in Study 3 that accurately recaptures scores on 8 different broadband inventories comprising 203 distinct scales. Collectively, the results validate a powerful new way to improve the efficiency of personality measurement in research settings.
In economics and other sciences, “statistical-significance” is by custom, habit, and education a necessary and sufficient condition for proving an empirical result (Ziliak and McCloskey, 2008; McCloskey & Ziliak, 1996). The canonical routine is to calculate what’s called a t-statistic and then to compare its estimated value against a theoretically expected value of it, which is found in “Student’s” t table. A result yielding a t-value greater than or equal to about 2.0 is said to be “statistically-significant at the 95 percent level.” Alternatively, a regression coefficient is said to be “statistically-significantly different from the null, p < 0.05.” Canonically speaking, if a coefficient clears the 95 percent hurdle, it warrants additional scientific attention. If not, not. The first presentation of “Student’s” test of statistical-significance came a century ago, in “The Probable Error of a Mean” (1908b), published by an anonymous “Student.” The author’s commercial employer required that his identity be shielded from competitors, but we have known for some decades that the article was written by William Sealy Gosset (1876–1937), whose entire career was spent at Guinness’s brewery in Dublin, where Gosset was a master brewer and experimental scientist (E. S. Pearson, 1937). Perhaps surprisingly, the ingenious “Student” did not give a hoot for a single finding of “statistical”-significance, even at the 95 percent level of statistical-significance as established by his own tables. Beginning in 1904, “Student”, who was a businessman besides a scientist, took an economic approach to the logic of uncertainty, arguing finally that statistical-significance is “nearly valueless” in itself.
In conventional epidemiology confounding of the exposure of interest with lifestyle or socioeconomic factors, and reverse causation whereby disease status influences exposure rather than vice versa, may invalidate causal interpretations of observed associations. Conversely, genetic variants should not be related to the confounding factors that distort associations in conventional observational epidemiological studies. Furthermore, disease onset will not influence genotype. Therefore, it has been suggested that genetic variants that are known to be associated with a modifiable (nongenetic) risk factor can be used to help determine the causal effect of this modifiable risk factor on disease outcomes. This approach, mendelian randomization, is increasingly being applied within epidemiological studies. However, there is debate about the underlying premise that associations between genotypes and disease outcomes are not confounded by other risk factors. We examined the extent to which genetic variants, on the one hand, and nongenetic environmental exposures or phenotypic characteristics on the other, tend to be associated with each other, to assess the degree of confounding that would exist in conventional epidemiological studies compared with mendelian randomization studies.
Methods and Findings:
We estimated pairwise correlations between nongenetic baseline variables and genetic variables in a cross-sectional study comparing the number of correlations that were statistically-significant at the 5%, 1%, and 0.01% level (α = 0.05, 0.01, and 0.0001, respectively) with the number expected by chance if all variables were in fact uncorrelated, using a two-sided binomial exact test. We demonstrate that behavioural, socioeconomic, and physiological factors are strongly interrelated, with 45% of all possible pairwise associations between 96 nongenetic characteristics (n = 4,560 correlations) being statistically-significant at the p < 0.01 level (the ratio of observed to expected statistically-significant associations was 45; p-value for difference between observed and expected < 0.000001). Similar findings were observed for other levels of significance. In contrast, genetic variants showed no greater association with each other, or with the 96 behavioural, socioeconomic, and physiological factors, than would be expected by chance.
These data illustrate why observational studies have produced misleading claims regarding potentially causal factors for disease. The findings demonstrate the potential power of a methodology that utilizes genetic variants as indicators of exposure level when studying environmentally modifiable risk factors.
In a cross-sectional study Davey Smith and colleagues show why observational studies can produce misleading claims regarding potential causal factors for disease, and illustrate the use of mendelian randomization to study environmentally modifiable risk factors.
Epidemiology is the study of the distribution and causes of human disease. Observational epidemiological studies investigate whether particular modifiable factors (for example, smoking or eating healthily) are associated with the risk of a particular disease. The link between smoking and lung cancer was discovered in this way. Once the modifiable factors associated with a disease are established as causal factors, individuals can reduce their risk of developing that disease by avoiding causative factors or by increasing their exposure to protective factors. Unfortunately, modifiable factors that are associated with risk of a disease in observational studies sometimes turn out not to cause or prevent disease. For example, higher intake of vitamins C and E apparently protected people against heart problems in observational studies, but taking these vitamins did not show any protection against heart disease in randomized controlled trials (studies in which identical groups of patients are randomly assigned various interventions and then their health monitored). One explanation for this type of discrepancy is known as confounding—the distortion of the effect of one factor by the presence of another that is associated both with the exposure under study and with the disease outcome. So in this example, people who took vitamin supplements might have also have exercised more than people who did not take supplements and it could have been the exercise rather than the supplements that was protective against heart disease.
Why Was This Study Done?:
It isn’t always possible to check the results of observational studies in randomized controlled trials so epidemiologists have developed other ways to minimize confounding. One approach is known as mendelian randomization. Several gene variants have been identified that affect risk factors. For example, variants in a gene called APOE affect the level of cholesterol in an individual’s blood, a risk factor for heart disease. People inherit gene variants randomly from their parents to build up their own unique genotype (total genetic makeup). Consequently, a study that examines the associations between a gene variant and a disease can indicate whether the risk factor affected by that gene variant causes the disease. There should be no confounding in this type of study, the argument goes, because different genetic variants should not be associated with each other or with nongenetic variables that typically confound directly assessed associations between risk factors and disease. But is this true? In this study, the researchers have tested whether nongenetic risk factors are confounded by each other and also whether genetic variants are confounded by nongenetic risk factors and also by other genetic variants.
What Did the Researchers Do and Find?:
Using data collected in the British Women’s Heart and Health Study, the researchers calculated how many pairs of nongenetic variables (for example, frequency of eating meat, alcohol intake) were statistically-significantly correlated with each other. That is, the number of pairs of nongenetic variables in which a high correlation between both variables occurred in more study participants than expected by chance. They compared this number with the number of correlations that would occur by chance if all the variables were totally independent. When the researchers assumed that 1 in 100 combinations of pairs of variables would have been correlated by chance, the ratio of observed to expected statistically-significant correlations was seen 45 times more frequently than would be expected by chance. When the researchers repeated this exercise with genetic variants, the ratio of observed to expected statistically-significant correlations was 1.58, a figure not significantly different from 1. Similarly, the ratio of observed to expected statistically-significant correlations when pairwise combinations between genetic and nongenetic variants were considered was 1.22.
What Do These Findings Mean?:
These findings have two main implications. First, the large excess of observed over expected associations among the nongenetic variables indicates that many nongenetic modifiable factors occur in clusters—for example, people with healthy diets often have other healthy habits. Researchers doing observational studies always try to adjust for confounding but this result suggests that this adjustment will be hard to do, in part because it will not always be clear which factors are confounders. Second, the lack of a large excess of observed over expected associations among the genetic variables (and also among genetic variables paired with nongenetic variables) indicates that little confounding is likely to occur in studies that use mendelian randomization. In other words, this approach is a valid way to identify which environmentally modifiable risk factors cause human disease.
This article notes 5 reasons why a correlation between a risk (or protective) factor and some specified outcome might not reflect environmental causation. In keeping with numerous other writers, it is noted that a causal effect is usually composed of a constellation of components acting in concert. The study of causation, therefore, will necessarily be informative on only one or more subsets of such components. There is no such thing as a single basic necessary and sufficient cause. Attention is drawn to the need (albeit unobservable) to consider the counterfactual (ie., what would have happened if the individual had not had the supposed risk experience). 15 possible types of natural experiments that may be used to test causal inferences with respect to naturally occurring prior causes (rather than planned interventions) are described. These comprise 5 types of genetically sensitive designs intended to control for possible genetic mediation (as well as dealing with other issues), 6 uses of twin or adoptee strategies to deal with other issues such as selection bias or the contrasts between different environmental risks, 2 designs to deal with selection bias, regression discontinuity designs to take into account unmeasured confounders, and the study of contextual effects. It is concluded that, taken in conjunction, natural experiments can be very helpful in both strengthening and weakening causal inferences.
Personality has consequences. Measures of personality have contemporaneous and predictive relations to a variety of important outcomes. Using the Big Five factors as heuristics for organizing the research literature, numerous consequential relations are identified. Personality dispositions are associated with happiness, physical and psychological health, spirituality, and identity at an individual level; associated with the quality of relationships with peers, family, and romantic others at an interpersonal level; and associated with occupational choice, satisfaction, and performance, as well as community involvement, criminal activity, and political ideology at a social institutional level.
[Keywords: individual differences, traits, life outcomes, consequences]
Background: Information on major harms of medical interventions comes primarily from epidemiologic studies performed after licensing and marketing. Comparison with data from large-scale randomized trials is occasionally feasible. We compared evidence from randomized trials with that from epidemiologic studies to determine whether they give different estimates of risk for important harms of medical interventions.
Methods: We targeted well-defined, specific harms of various medical interventions for which data were already available from large-scale randomized trials (> 4000 subjects). Nonrandomized studies involving at least 4000 subjects addressing these same harms were retrieved through a search of MEDLINE. We compared the relative risks and absolute risk differences for specific harms in the randomized and nonrandomized studies.
Results: Eligible nonrandomized studies were found for 15 harms for which data were available from randomized trials addressing the same harms. Comparisons of relative risks between the study types were feasible for 13 of the 15 topics, and of absolute risk differences for 8 topics. The estimated increase in relative risk differed more than 2× between the randomized and nonrandomized studies for 7 (54%) of the 13 topics; the estimated increase in absolute risk differed more than 2× for 5 (62%) of the 8 topics. There was no clear predilection for randomized or nonrandomized studies to estimate greater relative risks, but usually (75% [6/8]) the randomized trials estimated larger absolute excess risks of harm than the nonrandomized studies did.
Interpretation: Nonrandomized studies are often conservative in estimating absolute risks of harms. It would be useful to compare and scrutinize the evidence on harms obtained from both randomized and nonrandomized studies.
Context: Controversy and uncertainty ensue when the results of clinical research on the effectiveness of interventions are subsequently contradicted. Controversies are most prominent when high-impact research is involved.
Objectives: To understand how frequently highly cited studies are contradicted or find effects that are stronger than in other similar studies and to discern whether specific characteristics are associated with such refutation over time.
Design: All original clinical research studies published in 3 major general clinical journals or high-impact-factor specialty journals in 1990–2003 and cited more than 1000 times in the literature were examined.
Main Outcome Measure: The results of highly cited articles were compared against subsequent studies of comparable or larger sample size and similar or better controlled designs. The same analysis was also performed comparatively for matched studies that were not so highly cited.
Results: Of 49 highly cited original clinical research studies, 45 claimed that the intervention was effective. Of these, 7 (16%) were contradicted by subsequent studies, 7 others (16%) had found effects that were stronger than those of subsequent studies, 20 (44%) were replicated, and 11 (24%) remained largely unchallenged. Five of 6 highly-cited nonrandomized studies had been contradicted or had found stronger effects vs 9 of 39 randomized controlled trials (p = 0.008). Among randomized trials, studies with contradicted or stronger effects were smaller (p = 0.009) than replicated or unchallenged studies although there was no statistically-significant difference in their early or overall citation impact. Matched control studies did not have a statistically-significantly different share of refuted results than highly cited studies, but they included more studies with “negative” results.
Conclusions: Contradiction and initially stronger effects are not unusual in highly cited research of clinical interventions and their outcomes. The extent to which high citations may provoke contradictions and vice versa needs more study. Controversies are most common with highly cited nonrandomized studies, but even the most highly cited randomized trials may be challenged and refuted over time, especially small ones.
Controlled experiments, where subjects are randomly assigned to receive interventions, are desirable but frequently perceived to be infeasible or overly burdensome, especially in social settings. Therefore, nonexperimental (also called quasi-experimental) methods are often used instead. Quasi-experimental methods are less intrusive and sometimes less costly than controlled experiments, but their validity rests on particular assumptions that are often difficult to test. It is therefore important to find empirical evidence to assess the likelihood that a given method applied in a given context will yield unbiased estimates. The current study is a systematic review of validation research to better understand the conditions under which quasiexperimental methods most closely approximate the results that would be found in a well-designed and well-executed experimental study. We collect and summarize a set of earlier studies that each tried, using convenience samples and one or more quasi-experimental methods, to replicate the findings from a social experiment. Our synthesis aims to give both producers and consumers of social program evaluations a clear understanding of what we know and what we do not know about the performance of quasi-experimental evaluation methods.
Context: There is substantial debate about whether the results of nonrandomized studies are consistent with the results of randomized controlled trials on the same topic.
Objectives: To compare results of randomized and nonrandomized studies that evaluated medical interventions and to examine characteristics that may explain discrepancies between randomized and nonrandomized studies.
Data Sources: MEDLINE (1966–March 2000), the Cochrane Library (Issue 3, 2000), and major journals were searched.
Study Selection: Forty-five diverse topics were identified for which both randomized trials (n = 240) and nonrandomized studies (n = 168) had been performed and had been considered in meta-analyses of binary outcomes.
Data Extraction: Data on events per patient in each study arm and design and characteristics of each study considered in each meta-analysis were extracted and synthesized separately for randomized and nonrandomized studies.
Data Synthesis: Very good correlation was observed between the summary odds ratios of randomized and nonrandomized studies (r = 0.75; p < 0.001); however, nonrandomized studies tended to show larger treatment effects (28 vs 11; p = 0.009). Between-study heterogeneity was frequent among randomized trials alone (23%) and very frequent among nonrandomized studies alone (41%). The summary results of the 2 types of designs differed beyond chance in 7 cases (16%). Discrepancies beyond chance were less common when only prospective studies were considered (8%). Occasional differences in sample size and timing of publication were also noted between discrepant randomized and nonrandomized studies. In 28 cases (62%), the natural logarithm of the odds ratio differed by at least 50%, and in 15 cases (33%), the odds ratio varied at least 2-fold between nonrandomized studies and randomized trials.
Conclusions: Despite good correlation between randomized trials and nonrandomized studies—in particular, prospective studies—discrepancies beyond chance do occur and differences in estimated magnitude of treatment effect are very common.
This article uses propensity score methods to estimate the treatment impact of the National Supported Work (NSW) Demonstration, a labor training program, on post-intervention earnings. We use data from Lalonde’s evaluation of nonexperimental methods that combine the treated units from a randomized evaluation of the NSW with nonexperimental comparison units drawn from survey datasets. We apply propensity score methods to this composite dataset and demonstrate that, relative to the estimators that Lalonde evaluates, propensity score estimates of the treatment impact are much closer to the experimental benchmark estimate. Propensity score methods assume that the variables associated with assignment to treatment are observed (referred to as ignorable treatment assignment, or selection on observables). Even under this assumption, it is difficult to control for differences between the treatment and comparison groups when they are dissimilar and when there are many pre-intervention variables. The estimated propensity score (the probability of assignment to treatment, conditional on pre-intervention variables) summarizes the pre-intervention variables. This offers a diagnostic on the comparability of the treatment and comparison groups, because one has only to compare the estimated propensity score across the two groups. We discuss several methods (such as stratification and matching) that use the propensity score to estimate the treatment impact. When the range of estimated propensity scores of the treatment and comparison groups overlap, these methods can estimate the treatment impact for the treatment group. A sensitivity analysis shows that our estimates are not sensitive to the specification of the estimated propensity score, but are sensitive to the assumption of selection on observables. We conclude that when the treatment and comparison groups overlap, and when the variables determining assignment to treatment are observed, these methods provide a means to estimate the treatment impact. Even though propensity score methods are not always applicable, they offer a diagnostic on the quality of nonexperimental comparison groups in terms of observable pre-intervention variables.
Evaluations of healthcare interventions can either randomise subjects to comparison groups, or not. In both designs there are potential threats to validity, which can be external (the extent to which they are generalisable to all potential recipients) or internal (whether differences in observed effects can be attributed to differences in the intervention). Randomisation should ensure that comparison groups of sufficient size differ only in their exposure to the intervention concerned. However, some investigators have argued that randomised controlled trials (RCTs) tend to exclude, consciously or otherwise, some types of patient to whom results will subsequently be applied. Furthermore, in unblinded trials the outcome of treatment may be influenced by practitioners’ and patients’ preferences for one or other intervention. Though non-randomised studies are less selective in terms of recruitment, they are subject to selection bias in allocation if treatment is related to initial prognosis.
Treatment effects obtained from randomised and non-randomised studies may differ, but one method does not give a consistently greater effect than the other
Treatment effects measured in each type of study best approximate when the exclusion criteria are the same and where potential prognostic factors are well understood and controlled for in the non-randomised studies
Subjects excluded from randomised controlled trials tend to have a worse prognosis than those included, and this limits generalisability
Subjects participating in randomised controlled trials evaluating treatment of existing conditions tend to be less affluent, educated, and healthy than those who do not; the opposite is true for trials of preventive interventions
In previous articles we have focused on the potentials, principles, and pitfalls of meta-analysis of randomised controlled trials. Meta-analysis of observational data is, however, also becoming common. In a MEDLINE search we identified 566 articles (excluding those published as letters) published in 1995 and indexed with the medical subject heading (MeSH) term “meta-analysis.” We randomly selected 100 of these articles and examined them further. Sixty articles reported on actual meta-analyses, and 40 were methodological papers, editorials, and traditional reviews (1). Among the meta-analyses, about half were based on observational studies, mainly cohort and case-control studies of medical interventions or aetiological associations.
Meta-analysis of observational studies is as common as Meta-analysis of controlled trials
Confounding and selection bias often distort the findings from observational studies
There is a danger that meta-analyses of observational data produce very precise but equally spurious results
The statistical combination of data should therefore not be a prominent component of reviews of observational studies
More is gained by carefully examining possible sources of heterogeneity between the results from observational studies
Reviews of any type of research and data should use a systematic approach, which is documented in a materials and methods section
Objective: To summarise comparisons of randomised clinical trials and non-randomised clinical trials, trials with adequately concealed random allocation versus inadequately concealed random allocation, and high quality trials versus low quality trials where the effect of randomisation could not be separated from the effects of other methodological manoeuvres.
Design: Systematic review.
Selection Criteria: Cohorts or meta-analyses of clinical trials that included an empirical assessment of the relation between randomisation and estimates of effect.
Data Sources: Cochrane Review Methodology Database, Medline, SciSearch, bibliographies, hand searching of journals, personal communication with methodologists, and the reference lists of relevant articles.
Main Outcome Measures: Relation between randomisation and estimates of effect.
Results: Eleven studies that compared randomised controlled trials with non-randomised controlled trials (eight for evaluations of the same intervention and three across different interventions), two studies that compared trials with adequately concealed random allocation and inadequately concealed random allocation, and five studies that assessed the relation between quality scores and estimates of treatment effects, were identified. Failure to use random allocation and concealment of allocation were associated with relative increases in estimates of effects of 150% or more, relative decreases of up to 90%, inversion of the estimated effect and, in some cases, no difference. On average, failure to use randomisation or adequate concealment of allocation resulted in larger estimates of effect due to a poorer prognosis in non-randomly selected control groups compared with randomly selected control groups.
Conclusions: Failure to use adequately concealed random allocation can distort the apparent effects of care in either direction, causing the effects to seem either larger or smaller than they really are. The size of these distortions can be as large as or larger than the size of the effects that are to be detected.
Conventional reviews of research on the efficacy of psychological, educational, and behavioral treatments often find considerable variation in outcome among studies and, as a consequence, fail to reach firm conclusions about the overall effectiveness of the interventions in question. In contrast, meta-analysis reviews show a strong, dramatic pattern of positive overall effects that cannot readily be explained as artifacts of meta-analytic technique or generalized placebo effects. Moreover, the effects are not so small that they can be dismissed as lacking practical or clinical-significance. Although meta-analysis has limitations, there are good reasons to believe that its results are more credible than those of conventional reviews and to conclude that well-developed psychological, educational, and behavioral treatment is generally efficacious.
Therapeutic efficacy is often studied with observational surveys of patients whose treatments were selected non-experimentally. The results of these surveys are distrusted because of the fear that biased results occur in the absence of experimental principles, particularly randomization. The purpose of the current study was to develop and validate improved observational study designs by incorporating many of the design principles and patient assembly procedures of the randomized trial. The specific topic investigated was the prophylactic effectiveness of β-blocker therapy after an acute myocardial infarction.
To accomplish the research objective, three sets of data were compared. First, we developed a restricted cohort based on the eligibility criteria of the randomized clinical trial; second, we assembled an expanded cohort using the same design principles except for not restricting patient eligibility; and third, we used the data from the Beta Blocker Heart Attack Trial (BHAT), whose results served as the gold standard for comparison.
In this research, the treatment difference in death rates for the restricted cohort and the BHAT trial was nearly identical. In contrast, the expanded cohort had a larger treatment difference than was observed in the BHAT trial. We also noted the important and largely neglected role that eligibility criteria may play in ensuring the validity of treatment comparisons and study outcomes. The new methodological strategies we developed may improve the quality of observational studies and may be useful in assessing the efficacy of the many medical/surgical therapies that cannot be tested with randomized clinical trials.
This study investigates empirically the strengths and limitations of using experimental versus nonexperimental designs for evaluating employment and training programs. The assessment involves comparing results from an experimental-design study-the National Supported Work Demonstration-with the estimated impacts of Supported Work based on analyses using comparison groups constructed from the Current Population Surveys.
The results indicate that nonexperimental designs cannot be relied on to estimate the effectiveness of employment programs. Impact estimates tend to be sensitive both to the comparison group construction methodology and to the analytic model used. There is currently no way a priori to ensure that the results of comparison group studies will be valid indicators of the program impacts.
[Keywords: public assistance programs, analytical models, analytical estimating, employment, control groups, estimation methods, random sampling, human resources, public works legislation, statistical-significance]
This volume reports on a study of 850 pairs of twins who were tested to determine the influence of heredity and environment on individual differences in personality, ability, and interests. It presents the background, research design, and procedures of the study, a complete tabulation of the test results, and the authors’ extensive analysis of their findings. Based on one of the largest studies of twin behavior ever conducted, the book challenges a number of traditional beliefs about genetic and environmental contributions to personality development.
The subjects were chosen from participants in the National Merit Scholarship Qualifying Test of 1962 and were mailed a battery of personality and interest questionnaires. In addition, parents of the twins were sent questionnaires asking about the twins’ early experiences. A similar sample of nontwin students who had taken the merit exam provided a comparison group. The questions investigated included how twins are similar to or different from non-twins, how identical twins are similar to or different from fraternal twins, how the personalities and interests of twins reflect genetic factors, how the personalities and interests of twins reflect early environmental factors, and what implications these questions have for the general issue of how heredity and environment influence the development of psychological characteristics. In attempting to answer these questions, the authors shed new light on the importance of both genes and environment and have formed the basis for new approaches in behavior genetic research.
This paper presents results, mainly in tabular form, of a sampling experiment in which 100 economic time series 25 years long were drawn at random from the Historical Statistics for the United States. Sampling distributions of coefficients of correlation and autocorrelation were computed using these series, and their logarithms, with and without correction for linear trend.
We find that the frequency distribution of autocorrelation coefficients has the following properties:
It is roughly invariant under logarithmic transformation of data.
It is approximated by a Pearson Type XII function.
The autocorrelation properties observed are not to be explained by linear trends alone. Correlations and lagged cross-correlations are quite high for all classes of data. eg., given a randomly selected series, it is possible to find, by random drawing, another series which explains at least 50% of the variances of the first one, in from 2 to 6 random trials, depending on the class of data involved. The sampling distributions obtained provide a basis for tests of statistical-significance of correlations of economic time series. We also find that our economic series are well described by exact linear difference equations of low order.
In this paper, I wish to examine a dogma of inferential procedure which, for psychologists at least, has attained the status of a religious conviction. The dogma to be scrutinized is the “null-hypothesis statistical-significance test” orthodoxy that passing statistical judgment on a scientific hypothesis by means of experimental observation is a decision procedure wherein one rejects or accepts a null hypothesis according to whether or not the value of a sample statistic yielded by an experiment falls within a certain predetermined “rejection region” of its possible values. The thesis to be advanced is that despite the awesome preeminence this method has attained in our experimental journals and textbooks of applied statistics, it is based upon a fundamental misunderstanding of the nature of rational inference, and is seldom if ever appropriate to the aims of scientific research.
“A pigeon is brought to a stable state of hunger by reducing it to 75% of its weight when well fed. It is put into an experimental cage for a few minutes each day. A food hopper attached to the cage may be swung into place so that the pigeon can eat from it. A solenoid and a timing relay hold the hopper in place for 5 sec. at each reinforcement. If a clock is now arranged to present the food hopper at regular intervals with no reference whatsoever to the bird’s behavior, operant conditioning usually takes place.” The bird tends to learn whatever response it is making when the hopper appears. The response may be extinguished and reconditioned. “The experiment might be said to demonstrate a sort of superstition. The bird behaves as if there were a causal relation between its behavior and the presentation of food, although such a relation is lacking.”
This paper describes the development of relatively independent measures for 3 types of Introversion-Extroversion: Thinking. Social, and Emotional. The need for clarifying the concept of I-E and for devising new inventories can best be understood by reviewing the confusion concerning its nature and measurement. In the effort to simplify the original complex description of I-E by Jung, psychologists either have introduced new concepts or emphasized varying phases of Jung’s definition. In this process of elaboration, they have actually complicated rather than clarified the idea of I-E. The use of these terms in the popular literature has only added to the confusion. Unfortunately, introversion, at least in the popular writings on psychology, has come to denote an undesirable personality tendency which borders on a neurotic condition.
In general, the available I-E inventories purport to measure a general, undifferentiated trait. However, the intercorrelations between the published inventories are surprisingly low. Only 5 of the 19 coefficients of intercorrelation reported in the literature for nine inventories are above 0.40, and only 2 are above 0.80. The 2 coefficients above 0.80 are between 2 inventories and revised forms of these same inventories.
…This study has reduced the confusion in the field of measurement of I-E by getting away from the general undifferentiated concept of I-E. An inventory was constructed to measure, not a general trait, but 3 types or phases of I-E which were clearly defined. By a simple technique of item analysis, 3 homogeneous and relatively independent I-E tests were developed. Each test seems to be sufficiently reliable for individual prediction. The demonstrated ability of each test to discriminate between groups of college students which one would logically expect to be characteristically different in a given type of I-E justifies the conclusion that each test is sufficiently valid for the inventory to be employed in the diagnosis and counseling of college students.
[Egon Pearson describes Student, or Gosset, as a statistician: Student corresponded widely with young statisticians/mathematicians, encouraging them, and having an outsized influence not reflected in his publication. Student’s preferred statistical tools were remarkably simple, focused on correlations and standard deviations, but wielded effectively in the analysis and efficient design of experiments (particularly agricultural experiments), and he was an early decision-theorist, focused on practical problems connected to his Guinness Brewery job—which detachment from academia partially explains why he didn’t publish methods or results immediately or often. The need to handle small n of the brewery led to his work on small-sample approximations rather than, like Pearson et al in the Galton biometric tradition, relying on collecting large datasets and using asymptotic methods, and Student carried out one of the first Monte Carlo simulations.]