2021-mcbee.pdf: “Challenging the Link Between Early Childhood Television Exposure and Later Attention Problems: A Multiverse Approach”, (2021-03-25):
In 2004, Christakis and colleagues published findings that he and others used to argue for a link between early childhood television exposure and later attention problems, a claim that continues to be frequently promoted by the popular media. Using the same National Longitudinal Survey of Youth 1979 data set (n = 2,108), we conducted two multiverse analyses to examine whether the finding reported by Christakis and colleagues was robust to different analytic choices. We evaluated 848 models, including logistic regression models, linear regression models, and two forms of propensity-score analysis. If the claim were true, we would expect most of the justifiable analyses to produce significant results in the predicted direction. However, only 166 models (19.6%) yielded a statistically-significant relationship, and most of these employed questionable analytic choices. We concluded that these data do not provide compelling evidence of a harmful effect of TV exposure on attention.
2021-huntingtonklein.pdf: “The influence of hidden researcher decisions in applied microeconomics”, (2021-03-22):
Researchers make hundreds of decisions about data collection, preparation, and analysis in their research. We use a many-analysts approach to measure the extent and impact of these decisions.
Two published causal empirical results are replicated by 7 replicators each. We find large differences in data preparation and analysis decisions, many of which would not likely be reported in a publication. No 2 replicators reported the same sample size. statistical-significance varied across replications, and for 1 of the studies the effect’s sign varied as well. The standard deviation of estimates across replications was 3–4 times the mean reported standard error.
2021-lishner.pdf: “Sorting the File Drawer: A Typology for Describing Unpublished Studies”, (2021-03-01):
A typology of unpublished studies is presented to describe various types of unpublished studies and the reasons for their nonpublication. Reasons for nonpublication are classified by whether they stem from an awareness of the study results (result-dependent reasons) or not (result-independent reasons) and whether the reasons affect the publication decisions of individual researchers or reviewers/
editors. I argue that result-independent reasons for nonpublication are less likely to introduce motivated reasoning into the publication decision process than are result-dependent reasons. I also argue that some reasons for nonpublication would produce beneficial as opposed to problematic publication bias. The typology of unpublished studies provides a descriptive scheme that can facilitate understanding of the population of study results across the field of psychology, within subdisciplines of psychology, or within specific psychology research domains. The typology also offers insight into different publication biases and research-dissemination practices and can guide individual researchers in organizing their own file drawers of unpublished studies.
2021-broers.pdf: “When the Numbers Do Not Add Up: The Practical Limits of Stochastologicals for Soft Psychology”, (2021-01-22):
One particular weakness of psychology that was left implicit by Meehl is the fact that psychological theories tend to be verbal theories, permitting at best ordinal predictions. Such predictions do not enable the high-risk tests that would strengthen our belief in the verisimilitude of theories but instead lead to the practice of null-hypothesis statistical-significance testing, a practice Meehl believed to be a major reason for the slow theoretical progress of soft psychology. The rising popularity of meta-analysis has led some to argue that we should move away from statistical-significance testing and focus on the size and stability of effects instead. Proponents of this reform assume that a greater emphasis on quantity can help psychology to develop a cumulative body of knowledge. The crucial question in this endeavor is whether the resulting numbers really have theoretical meaning. Psychological science lacks an undisputed, preexisting domain of observations analogous to the observations in the space-time continuum in physics. It is argued that, for this reason, effect sizes do not really exist independently of the adopted research design that led to their manifestation. Consequently, they can have no bearing on the verisimilitude of a theory.
2021-berkman.pdf: “So Useful as a Good Theory? The Practicality Crisis in (Social) Psychological Theory”, (2021-01-07):
Practicality was a valued attribute of academic psychological theory during its initial decades, but usefulness has since faded in importance to the field. Theories are now evaluated mainly on their ability to account for decontextualized laboratory data and not their ability to help solve societal problems. With laudable exceptions in the clinical, intergroup, and health domains, most psychological theories have little relevance to people’s everyday lives, poor accessibility to policymakers, or even applicability to the work of other academics who are better positioned to translate the theories to the practical realm. We refer to the lack of relevance, accessibility, and applicability of psychological theory to the rest of society as the practicality crisis. The practicality crisis harms the field in its ability to attract the next generation of scholars and maintain viability at the national level. We describe practical theory and illustrate its use in the field of self-regulation. Psychological theory is historically and scientifically well positioned to become useful should scholars in the field decide to value practicality. We offer a set of incentives to encourage the return of social psychology to the Lewinian vision of a useful science that speaks to pressing social issues.
[The unusually large chasm between the social sciences and its practical applications has been noted before. Focusing specifically on social psychology, Berkman and Wilson (2021) grade 360 articles published in the top cited journal of the field over a five year period on various criteria of practical import, generally finding quite low levels of “practicality” of the published research. For example, their average grade for the extent to which published papers offered actionable steps to address a specific problem was just 0.9 out of 4. They also look at the publication criterion in ten top journals; while all of them highlight the importance of original work that contributes to scientific progress, only 2 ask for even a brief statement of the public importance of the work.]
2020-ebersole.pdf: “Many Labs 5: Testing Pre-Data-Collection Peer Review as an Intervention to Increase Replicability”, (2020-11-13; ):
Replication studies in psychological science sometimes fail to reproduce prior findings. If these studies use methods that are unfaithful to the original study or ineffective in eliciting the phenomenon of interest, then a failure to replicate may be a failure of the protocol rather than a challenge to the original finding. Formal pre-data-collection peer review by experts may address shortcomings and increase replicability rates. We selected 10 replication studies from the Reproducibility Project: Psychology (RP:P; Open Science Collaboration, 2015) for which the original authors had expressed concerns about the replication designs before data collection; only one of these studies had yielded a statistically-significant effect (p < 0.05). Commenters suggested that lack of adherence to expert review and low-powered tests were the reasons that most of these RP:P studies failed to replicate the original effects. We revised the replication protocols and received formal peer review prior to conducting new replication studies. We administered the RP:P and revised protocols in multiple laboratories (median number of laboratories per original study = 6.5, range = 3–9; median total sample = 1,279.5, range = 276–3,512) for high-powered tests of each original finding with both protocols. Overall, following the preregistered analysis plan, we found that the revised protocols produced effect sizes similar to those of the RP:P protocols (Δ_r_ = 0.002 or 0.014, depending on analytic approach). The median effect size for the revised protocols (r = 0.05) was similar to that of the RP:P protocols (r = 0.04) and the original RP:P replications (r = 0.11), and smaller than that of the original studies (r = 0.37). Analysis of the cumulative evidence across the original studies and the corresponding three replication attempts provided very precise estimates of the 10 tested effects and indicated that their effect sizes (median r = 0.07, range = 0.00–0.15) were 78% smaller, on average, than the original effect sizes (median r = 0.37, range = 0.19–0.50).
[Keywords: replication, reproducibility, metascience, peer review, Registered Reports, open data, preregistered]
2020-artner.pdf: “The reproducibility of statistical results in psychological research: An investigation using unpublished raw data”, (2020-11-12; ):
We investigated the reproducibility of the major statistical conclusions drawn in 46 articles published in 2012 in three APA journals. After having identified 232 key statistical claims, we tried to reproduce, for each claim, the test statistic, its degrees of freedom, and the corresponding p-value, starting from the raw data that were provided by the authors and closely following the Method section in the article. Out of the 232 claims, we were able to successfully reproduce 163 (70%), 18 of which only by deviating from the article’s analytical description. Thirteen (7%) of the 185 claims deemed statistically-significant by the authors are no longer so. The reproduction successes were often the result of cumbersome and time-consuming trial-and-error work, suggesting that APA style reporting in conjunction with raw data makes numerical verification at least hard, if not impossible. This article discusses the types of mistakes we could identify and the tediousness of our reproduction efforts in the light of a newly developed taxonomy for reproducibility. We then link our findings with other findings of empirical research on this topic, give practical recommendations on how to achieve reproducibility, and discuss the challenges of large-scale reproducibility checks as well as promising ideas that could considerably increase the reproducibility of psychological research.
2020-mcabe.pdf: “Cite Unseen: Theory and Evidence on the Effect of Open Access on Cites to Academic Articles Across the Quality Spectrum”, (2020-11-01; ):
Our previous paper (McCabe and Snyder 2014) contained the provocative result that, despite a positive average effect, open access reduces cites to some articles, in particular those published in lower-tier journals. We propose a model in which open access leads more readers to acquire the full text, yielding more cites from some, but fewer cites from those who would have cited the article based on superficial knowledge but who refrain once they learn that the article is a bad match. We test the theory with data for over 200,000 science articles binned by cites received during a pre-study period. Consistent with the theory, the marginal effect of open access is negative for the least-cited articles, positive for the most cited, and generally monotonic for quality levels in between. Also consistent with the theory is a magnification of these effects for articles placed on PubMed Central, one of the broadest open-access platforms, and the differential pattern of results for cites from insiders versus outsiders to the article’s field.
2020-lilienfeld.pdf: “Psychological measurement and the replication crisis: Four sacred cows”, (2020-11-01):
Although there are surely multiple contributors to the replication crisis in psychology, one largely unappreciated source is a neglect of basic principles of measurement. We consider 4 sacred cows—widely shared and rarely questioned assumptions—in psychological measurement that may fuel the replicability crisis by contributing to questionable measurement practices. These 4 sacred cows are:
- we can safely rely on the name of a measure to infer its content;
- reliability is not a major concern for laboratory measures;
- using measures that are difficult to collect obviates the need for large sample sizes; and
- convergent validity data afford sufficient evidence for construct validity.
For items #1 and #4, we provide provisional data from recent psychological journals that support our assertion that such beliefs are prevalent among authors.
To enhance the replicability of psychological science, researchers will need to become vigilant against erroneous assumptions regarding both the psychometric properties of their measures and the implications of these psychometric properties for their studies.
[Keywords: discriminant validity, experimental replication, measurement, psychological assessment, sample size, construct validity, convergent validity, experimental laboratories, test reliability, statistical power]
2020-olssoncollentine.pdf: “Heterogeneity in direct replications in psychology and its association with effect size”, (2020-10-01; ):
Impact Statement: This article suggests that for direct replications in social and cognitive psychology research, small variations in design (sample settings and population) are an unlikely explanation for differences in findings of studies. Differences in findings of direct replications are particularly unlikely if the overall effect is (close to) 0, whereas these differences are more likely if the overall effect is larger.
We examined the evidence for heterogeneity (of effect sizes) when only minor changes to sample population and settings were made between studies and explored the association between heterogeneity and average effect size in a sample of 68 meta-analyses from 13 preregistered multilab direct replication projects in social and cognitive psychology. Among the many examined effects, examples include the Stroop effect, the ‘verbal overshadowing’ effect, and various priming effects such as ‘anchoring’ effects. We found limited heterogeneity; 48⁄68 (71%) meta-analyses had nonsignificant heterogeneity, and most (49⁄68; 72%) were most likely to have zero to small heterogeneity. Power to detect small heterogeneity (as defined by Higgins, Thompson, Deeks, & Altman, 2003) was low for all projects (mean 43%), but good to excellent for medium and large heterogeneity. Our findings thus show little evidence of widespread heterogeneity in direct replication studies in social and cognitive psychology, suggesting that minor changes in sample population and settings are unlikely to affect research outcomes in these fields of psychology. We also found strong correlations between observed average effect sizes (standardized mean differences and log odds ratios) and heterogeneity in our sample. Our results suggest that heterogeneity and moderation of effects is unlikely for a 0 average true effect size, but increasingly likely for larger average true effect size.
[Keywords: heterogeneity, meta-analysis, psychology, direct replication, many labs]
2020-boulesteix.pdf: “A replication crisis in methodological research?”, (2020-09-29):
Statisticians have been keen to critique statistical aspects of the “replication crisis” in other scientific disciplines. But new statistical tools are often published and promoted without any thought to replicability. This needs to change, argue Anne-Laure Boulesteix, Sabine Hoffmann, Alethea Charlton and Heidi Seibold.
2020-serragarcia.pdf: “Can Short Psychological Interventions Affect Educational Performance? Revisiting the Effect of Self-Affirmation Interventions”, (2020-07-01; ):
Large amounts of resources are spent annually to improve educational achievement and to close the gender gap in sciences with typically very modest effects. In 2010, a 15-min self-affirmation intervention showed a dramatic reduction in this gender gap. We reanalyzed the original data and found several critical problems. First, the self-affirmation hypothesis stated that women’s performance would improve. However, the data showed no improvement for women. There was an interaction effect between self-affirmation and gender caused by a negative effect on men’s performance. Second, the findings were based on covariate-adjusted interaction effects, which imply that self-affirmation reduced the gender gap only for the small sample of men and women who did not differ in the covariates. Third, specification-curve analyses with more than 1,500 possible specifications showed that less than one quarter yielded statistically-significant interaction effects and less than 3% showed significant improvements among women.
2020-harder.pdf: “The Multiverse of Methods: Extending the Multiverse Analysis to Address Data-Collection Decisions”, (2020-06-29):
When analyzing data, researchers may have multiple reasonable options for the many decisions they must make about the data—for example, how to code a variable or which participants to exclude. Therefore, there exists a multiverse of possible data sets. A classic multiverse analysis involves performing a given analysis on every potential data set in this multiverse to examine how each data decision affects the results. However, a limitation of the multiverse analysis is that it addresses only data cleaning and analytic decisions, yet researcher decisions that affect results also happen at the data-collection stage. I propose an adaptation of the multiverse method in which the multiverse of data sets is composed of real data sets from studies varying in data-collection methods of interest. I walk through an example analysis applying the approach to 19 studies on shooting decisions to demonstrate the usefulness of this approach and conclude with a further discussion of the limitations and applications of this method.
2020-griggs.pdf: “New Revelations About Rosenhan’s Pseudopatient Study: Scientific Integrity in Remission”, (2020-06-11; ):
David Rosenhan’s pseudopatient study is one of the most famous studies in psychology, but it is also one of the most criticized studies in psychology. Almost 50 years after its publication, it is still discussed in psychology textbooks, but the extensive body of criticism is not, likely leading teachers not to present the study as the contentious classic that it is. New revelations by Susannah Cahalan (2019), based on her years of investigation of the study and her analysis of the study’s archival materials, question the validity and veracity of both Rosenhan’s study and his reporting of it as well as Rosenhan’s scientific integrity. Because many (if not most) teachers are likely not aware of Cahalan’s findings, we provide a summary of her main findings so that if they still opt to cover Rosenhan’s study, they can do so more accurately. Because these findings are related to scientific integrity, we think that they are best discussed in the context of research ethics and methods. To aid teachers in this task, we provide some suggestions for such discussions.
[ToC: Rosenhan’s Misrepresentations of the Pseudopatient Script and His Medical Record · Selective Reporting of Data · Rosenhan’s Failure to Prepare and Protect Other Pseudopatients · Reporting Questionable Data and Possibly Pseudo-Pseudopatients · Concluding Remarks · Footnotes · References]
2020-cowan.pdf: “How Do Scientific Views Change? Notes From an Extended Adversarial Collaboration”, (2020-06-08):
There are few examples of an extended adversarial collaboration, in which investigators committed to different theoretical views collaborate to test opposing predictions. Whereas previous adversarial collaborations have produced single research articles, here, we share our experience in programmatic, extended adversarial collaboration involving three laboratories in different countries with different theoretical views regarding working memory, the limited information retained in mind, serving ongoing thought and action. We have focused on short-term memory retention of items (letters) during a distracting task (arithmetic), and effects of aging on these tasks. Over several years, we have conducted and published joint research with preregistered predictions, methods, and analysis plans, with replication of each study across two laboratories concurrently. We argue that, although an adversarial collaboration will not usually induce senior researchers to abandon favored theoretical views and adopt opposing views, it will necessitate varieties of their views that are more similar to one another, in that they must account for a growing, common corpus of evidence. This approach promotes understanding of others’ views and presents to the field research findings accepted as valid by researchers with opposing interpretations. We illustrate this process with our own research experiences and make recommendations applicable to diverse scientific areas.
[Keywords: scientific method, adversarial collaboration, scientific views, changing views, working memory]
2020-elliott.pdf: “What Is the Test-Retest Reliability of Common Task-Functional MRI Measures? New Empirical Evidence and a Meta-Analysis”, (2020-06-07):
Identifying brain biomarkers of disease risk is a growing priority in neuroscience. The ability to identify meaningful biomarkers is limited by measurement reliability; unreliable measures are unsuitable for predicting clinical outcomes. Measuring brain activity using task functional MRI (fMRI) is a major focus of biomarker development; however, the reliability of task fMRI has not been systematically evaluated. We present converging evidence demonstrating poor reliability of task-fMRI measures. First, a meta-analysis of 90 experiments (n = 1,008) revealed poor overall reliability—mean intraclass correlation coefficient (ICC) = 0.397. Second, the test-retest reliabilities of activity in a priori regions of interest across 11 common fMRI tasks collected by the Human Connectome Project (n = 45) and the Dunedin Study (n = 20) were poor (ICCs = 0.067–.485). Collectively, these findings demonstrate that common task-fMRI measures are not currently suitable for brain biomarker discovery or for individual-differences research. We review how this state of affairs came to be and highlight avenues for improving task-fMRI reliability.
2020-voekl.pdf: “Reproducibility of animal research in light of biological variation”, (2020-06-02; ):
Context-dependent biological variation presents a unique challenge to the reproducibility of results in experimental animal research, because organisms’ responses to experimental treatments can vary with both genotype and environmental conditions. In March 2019, experts in animal biology, experimental design and statistics convened in Blonay, Switzerland, to discuss strategies addressing this challenge.
In contrast to the current gold standard of rigorous standardization in experimental animal research, we recommend the use of systematic heterogenization of study samples and conditions by actively incorporating biological variation into study design through diversifying study samples and conditions.
Here we provide the scientific rationale for this approach in the hope that researchers, regulators, funders and editors can embrace this paradigm shift. We also present a road map towards better practices in view of improving the reproducibility of animal research.
2020-botviniknezer.pdf: “Variability in the analysis of a single neuroimaging dataset by many teams”, (2020-05-20; ):
Data analysis workflows in many scientific domains have become increasingly complex and flexible. Here we assess the effect of this flexibility on the results of functional magnetic resonance imaging by asking 70 independent teams to analyse the same dataset, testing the same 9 ex-ante hypotheses1. The flexibility of analytical approaches is exemplified by the fact that no two teams chose identical workflows to analyse the data. This flexibility resulted in sizeable variation in the results of hypothesis tests, even for teams whose statistical maps were highly correlated at intermediate stages of the analysis pipeline. Variation in reported results was related to several aspects of analysis methodology. Notably, a meta-analytical approach that aggregated information across teams yielded a statistically-significant consensus in activated regions. Furthermore, prediction markets of researchers in the field revealed an overestimation of the likelihood of statistically-significant findings, even by researchers with direct knowledge of the dataset2,3,4,5. Our findings show that analytical flexibility can have substantial effects on scientific conclusions, and identify factors that may be related to variability in the analysis of functional magnetic resonance imaging. The results emphasize the importance of validating and sharing complex analysis workflows, and demonstrate the need for performing and reporting multiple analyses of the same data. Potential approaches that could be used to mitigate issues related to analytical variability are discussed.
2020-peters.pdf: “Ideological diversity, hostility, and discrimination in philosophy”, (2020-04-16):
Members of the field of philosophy have, just as other people, political convictions or, as psychologists call them, ideologies. How are different ideologies distributed and perceived in the field? Using the familiar distinction between the political left and right, we surveyed an international sample of 794 subjects in philosophy. We found that survey participants clearly leaned left (75%), while right-leaning individuals (14%) and moderates (11%) were underrepresented. Moreover, and strikingly, across the political spectrum from very left-leaning individuals and moderates to very right-leaning individuals, participants reported experiencing ideological hostility in the field, occasionally even from those on their own side of the political spectrum. Finally, while about half of the subjects believed that discrimination against left-leaning or right-leaning individuals in the field is not justified, a substantial minority displayed an explicit willingness to discriminate against colleagues with the opposite ideology. Our findings are both surprising and important because a commitment to tolerance and equality is widespread in philosophy, and there is reason to think that ideological similarity, hostility, and discrimination undermine reliable belief formation in many areas of the discipline.
[Keywords: Ideological bias, diversity, demographics.]
2020-gelman.pdf: “Statistics as Squid Ink: How Prominent Researchers Can Get Away with Misrepresenting Data”, Andrew Gelman, Alexey Guzey
2020-devito.pdf: “Compliance with legal requirement to report clinical trial results on ClinicalTrials.gov: a cohort study”, (2020-01-17; ):
Background: Failure to report the results of a clinical trial can distort the evidence base for clinical practice, breaches researchers’ ethical obligations to participants, and represents an important source of research waste. The Food and Drug Administration Amendments Act (FDAAA) of 2007 now requires sponsors of applicable trials to report their results directly onto ClinicalTrials.gov within 1 year of completion. The first trials covered by the Final Rule of this act became due to report results in January, 2018. In this cohort study, we set out to assess compliance.
Methods: We downloaded data for all registered trials on ClinicalTrials.gov each month from March, 2018, to September, 2019. All cross-sectional analyses in this manuscript were performed on data extracted from ClinicalTrials.gov on Sept 16, 2019; monthly trends analysis used archived data closest to the 15th day of each month from March, 2018, to September, 2019. Our study cohort included all applicable trials due to report results under FDAAA. We excluded all non-applicable trials, those not yet due to report, and those given a certificate allowing for delayed reporting. A trial was considered reported if results had been submitted and were either publicly available, or undergoing quality control review at ClinicalTrials.gov. A trial was considered compliant if these results were submitted within 1 year of the primary completion date, as required by the legislation. We described compliance with the FDAAA 2007 Final Rule, assessed trial characteristics associated with results reporting using logistic regression models, described sponsor-level reporting, examined trends in reporting, and described time-to-report using the Kaplan-Meier method.
Findings: 4209 trials were due to report results; 1722 (40·9%; 95% CI 39·4–42·2) did so within the 1-year deadline. 2686 (63·8%; 62·4–65·3) trials had results submitted at any time. Compliance has not improved since July, 2018. Industry sponsors were statistically-significantly more likely to be compliant than non-industry, non-US Government sponsors (odds ratio [OR] 3·08 [95% CI 2·52–3·77]), and sponsors running large numbers of trials were statistically-significantly more likely to be compliant than smaller sponsors (OR 11·84 [9·36–14·99]). The median delay from primary completion date to submission date was 424 days (95% CI 412–435), 59 days higher than the legal reporting requirement of 1 year.
Interpretation: Compliance with the FDAAA 2007 is poor, and not improving. To our knowledge, this is the first study to fully assess compliance with the Final Rule of the FDAAA 2007. Poor compliance is likely to reflect lack of enforcement by regulators. Effective enforcement and action from sponsors is needed; until then, open public audit of compliance for each individual sponsor may help. We will maintain updated compliance data for each individual sponsor and trial at fdaaa.trialstracker.net.
Funding: Laura and John Arnold Foundation.
2020-silander.pdf: “Implications of ideological bias in social psychology on clinical practice”, (2020-01-14; ):
Ideological bias is a worsening but often neglected concern for social and psychological sciences, affecting a range of professional activities and relationships, from self-reported willingness to discriminate to the promotion of ideologically saturated and scientifically questionable research constructs. Though clinical psychologists co-produce and apply social psychological research, little is known about its impact on the profession of clinical psychology.
Following a brief review of relevant topics, such as “concept creep” and the importance of the psychotherapeutic relationship, the relevance of ideological bias to clinical psychology, counterarguments and a rebuttal, clinical applications, and potential solutions are presented. For providing empathic and multiculturally competent clinical services, in accordance with professional ethics, psychologists would benefit from treating ideological diversity as another professionally recognized diversity area.
[See also“Political Diversity Will Improve Social Psychological Science”, Duarte et al 2015.]
2019-kvarven.pdf: “Comparing meta-analyses and preregistered multiple-laboratory replication projects”, (2019-12-23; ):
Many researchers rely on meta-analysis to summarize research evidence. However, there is a concern that publication bias and selective reporting may lead to biased meta-analytic effect sizes. We compare the results of meta-analyses to large-scale preregistered replications in psychology carried out at multiple laboratories. The multiple-laboratory replications provide precisely estimated effect sizes that do not suffer from publication bias or selective reporting. We searched the literature and identified 15 meta-analyses on the same topics as multiple-laboratory replications. We find that meta-analytic effect sizes are statistically-significantly different from replication effect sizes for 12 out of the 15 meta-replication pairs. These differences are systematic and, on average, meta-analytic effect sizes are almost 3 times as large as replication effect sizes. We also implement 3 methods of correcting meta-analysis for bias, but these methods do not substantively improve the meta-analytic results.
2018-camerer.pdf: “Evaluating the replicability of social science experiments in Nature and Science between 2010 and 2015”, (2018-08-27; ):
Being able to replicate scientific findings is crucial for scientific progress. We replicate 21 systematically selected experimental studies in the social sciences published in Nature and Science between 2010 and 2015. The replications follow analysis plans reviewed by the original authors and pre-registered prior to the replications. The replications are high powered, with sample sizes on average about five times higher than in the original studies. We find a statistically-significant effect in the same direction as the original study for 13 (62%) studies, and the effect size of the replications is on average about 50% of the original effect size. Replicability varies between 12 (57%) and 14 (67%) studies for complementary replicability indicators. Consistent with these results, the estimated true-positive rate is 67% in a Bayesian analysis. The relative effect size of true positives is estimated to be 71%, suggesting that both false positives and inflated effect sizes of true positives contribute to imperfect reproducibility. Furthermore, we find that peer beliefs of replicability are strongly related to replicability, suggesting that the research community could predict which results would replicate and that failures to replicate were not the result of chance alone.
2018-wood.pdf: “The Elusive Backfire Effect: Mass Attitudes’ Steadfast Factual Adherence”, Thomas Wood, Ethan Porter ( )
2017-bianco.pdf: “Knowing What We Are Getting: Evaluating Scientific Research on the International Space Station”, (2017-12-26):
The debate over the value of the International Space Station has overlooked a fundamental question: What is the station’s contribution to scientific knowledge? We address this question using a multivariate analysis of publication and patent data from station experiments. We find a relatively high probability that ISS experiments with PIs drawn from outside NASA will yield refereed publications and, furthermore, that these experiments have non-negligible probabilities of finding publication in high-impact journals or producing government patents. However, technology demonstrations and experiments with all-NASA PIs have much weaker track records. These results highlight the complexities inherent to constructing a compelling case for science onboard the ISS or for crewed spaceflight in general.
2017-sigut.pdf: “Avoiding erroneous citations in ecological research: read before you apply”, (2017-04-24; ):
The Shannon-Wiener index is a popular nonparametric metric widely used in ecological research as a measure of species diversity. We used the Web of Science database to examine cases where papers published from 1990 to 2015 mislabeled this index. We provide detailed insights into causes potentially affecting use of the wrong name ‘Weaver’ instead of the correct ‘Wiener’. Basic science serves as a fundamental information source for applied research, so we emphasize the effect of the type of research (applied or basic) on the incidence of the error. Biological research, especially applied studies, increasingly uses indices, even though some researchers have strongly criticized their use. Applied research papers had a higher frequency of the wrong index name than did basic research papers. The mislabeling frequency decreased in both categories over the 25-year period, although the decrease lagged in applied research. Moreover, the index use and mistake proportion differed by region and authors’ countries of origin. Our study also provides insight into citation culture, and results suggest that almost 50% of authors have not actually read their cited sources. Applied research scientists in particular should be more cautious during manuscript preparation, carefully select sources from basic research, and read theoretical background articles before they apply the theories to their research. Moreover, theoretical ecologists should liaise with applied researchers and present their research for the broader scientific community. Researchers should point out known, often-repeated errors and phenomena not only in specialized books and journals but also in widely used and fundamental literature.
2016-kausel.pdf: “Overconfidence in personnel selection: When and why unstructured interview information can hurt hiring decisions”, (2016-11-01):
- Individuals responsible for hiring decisions participated in two studies.
- We manipulated the information presented to them.
- Information about unstructured interviews boosted overconfidence.
- A third study showed that overconfidence was linked to fewer payoffs.
- In the presence of valid predictors, unstructured interviews can hurt hiring decisions.
Overconfidence is an important bias related to the ability to recognize the limits of one’s knowledge.
The present study examines overconfidence in predictions of job performance for participants presented with information about candidates based solely on standardized tests versus those who also were presented with unstructured interview information. We conducted two studies with individuals responsible for hiring decisions. Results showed that individuals presented with interview information exhibited more overconfidence than individuals presented with test scores only. In a third study, consisting of a betting competition for undergraduate students, larger overconfidence was related to fewer payoffs.
These combined results emphasize the importance of studying confidence and decision-related variables in selection decisions. Furthermore, while previous research has shown that the predictive validity of unstructured interviews is low, this study provides compelling evidence that they not only fail to help personnel selection decisions, but can actually hurt them.
[Keywords: judgment and decision making, behavioral decision theory, overconfidence, hiring decisions, personnel selection, human resource management, Conscientiousness, General Mental Ability, unstructured interviews, evidence-based management]
2016-zschirnt.pdf: “Ethnic discrimination in hiring decisions: a meta-analysis of correspondence tests 1990–2015”, (2016-01-22):
For almost 50 years field experiments have been used to study ethnic and racial discrimination in hiring decisions, consistently reporting high rates of discrimination against minority applicants—including immigrants—irrespective of time, location, or minority groups tested. While Peter A. Riach and Judith Rich [2002. “Field Experiments of Discrimination in the Market Place.” The Economic Journal 112 (483): F480–F518] and Judith Rich [2014. “What Do Field Experiments of Discrimination in Markets Tell Us? A Meta Analysis of Studies Conducted since 2000.” In Discussion Paper Series. Bonn: IZA] provide systematic reviews of existing field experiments, no study has undertaken a meta-analysis to examine the findings in the studies reported. In this article, we present a meta-analysis of 738 correspondence tests in 43 separate studies conducted in OECD countries between 1990 and 2015. In addition to summarising research findings, we focus on groups of specific tests to ascertain the robustness of findings, emphasising differences across countries, gender, and economic contexts. Moreover we examine patterns of discrimination, by drawing on the fact that the groups considered in correspondence tests and the contexts of testing vary to some extent. We focus on first-generation and second-generation immigrants, differences between specific minority groups, the implementation of EU directives, and the length of job application packs.
[Keywords: Ethnic discrimination, hiring, correspondence test, meta-analysis, immigration]
2016-boudreau.pdf: “Looking Across and Looking Beyond the Knowledge Frontier: Intellectual Distance, Novelty, and Resource Allocation in Science”, (2016-01-08; ):
Selecting among alternative projects is a core management task in all innovating organizations. In this paper, we focus on the evaluation of frontier scientific research projects. We argue that the “intellectual distance” between the knowledge embodied in research proposals and an evaluator’s own expertise systematically relates to the evaluations given. To estimate relationships, we designed and executed a grant proposal process at a leading research university in which we randomized the assignment of evaluators and proposals to generate 2,130 evaluator-proposal pairs. We find that evaluators systematically give lower scores to research proposals that are closer to their own areas of expertise and to those that are highly novel. The patterns are consistent with biases associated with boundedly rational evaluation of new ideas. The patterns are inconsistent with intellectual distance simply contributing “noise” or being associated with private interests of evaluators. We discuss implications for policy, managerial intervention, and allocation of resources in the ongoing accumulation of scientific knowledge.
2016-pica.pdf: “PEDS_20160223.indd” ( )
2016-lane.pdf: “Is there a publication bias in behavioral intranasal oxytocin research on humans? Opening the file drawer of one lab”, A. Lane, O. Luminet, G. Nave, M. Mikolajczak ( )
2015-opensciencecollaboration.pdf: “Estimating the reproducibility of psychological science”, (2015-08-28; ):
Empirically analyzing empirical evidence: One of the central goals in any scientific endeavor is to understand causality. Experiments that seek to demonstrate a cause/
effect relation most often manipulate the postulated causal factor. Aarts et al 2015 describe the replication of 100 experiments reported in papers published in 2008 in 3 high-ranking psychology journals. Assessing whether the replication and the original experiment yielded the same result according to several criteria, they find that about one-third to one-half of the original findings were also observed in the replication study.
Introduction: Reproducibility is a defining feature of science, but the extent to which it characterizes current research is unknown. Scientific claims should not gain credence because of the status or authority of their originator but by the replicability of their supporting evidence. Even research of exemplary quality may have irreproducible empirical findings because of random or systematic error.
Rationale: There is concern about the rate and predictors of reproducibility, but limited evidence. Potentially problematic practices include selective reporting, selective analysis, and insufficient specification of the conditions necessary or sufficient to obtain the results. Direct replication is the attempt to recreate the conditions believed sufficient for obtaining a previously observed finding and is the means of establishing reproducibility of a finding with new data. We conducted a large-scale, collaborative effort to obtain an initial estimate of the reproducibility of psychological science.
Results: We conducted replications of 100 experimental and correlational studies published in three psychology journals using high-powered designs and original materials when available. There is no single standard for evaluating replication success. Here, we evaluated reproducibility using statistical-significance and P values, effect sizes, subjective assessments of replication teams, and meta-analysis of effect sizes. The mean effect size (r) of the replication effects (Mr = 0.197, SD = 0.257) was half the magnitude of the mean effect size of the original effects (Mr = 0.403, SD = 0.188), representing a substantial decline. 97% of original studies had statistically-significant results (p < 0.05). 36% of replications had statistically-significant results; 47% of original effect sizes were in the 95% confidence interval of the replication effect size; 39% of effects were subjectively rated to have replicated the original result; and if no bias in original results is assumed, combining original and replication results left 68% with statistically-significant effects. Correlational tests suggest that replication success was better predicted by the strength of original evidence than by characteristics of the original and replication teams.
Conclusion: No single indicator sufficiently describes replication success, and the five indicators examined here are not the only ways to evaluate reproducibility. Nonetheless, collectively these results offer a clear conclusion: A large portion of replications produced weaker evidence for the original findings despite using materials provided by the original authors, review in advance for methodological fidelity, and high statistical power to detect the original effect sizes. Moreover, correlational evidence is consistent with the conclusion that variation in the strength of initial evidence (such as original P value) was more predictive of replication success than variation in the characteristics of the teams conducting the research (such as experience and expertise). The latter factors certainly can influence replication success, but they did not appear to do so here.
Reproducibility is not well understood because the incentives for individual scientists prioritize novelty over replication. Innovation is the engine of discovery and is vital for a productive, effective scientific enterprise. However, innovative ideas become old news fast. Journal reviewers and editors may dismiss a new test of a published idea as unoriginal. The claim that “we already know this” belies the uncertainty of scientific evidence. Innovation points out paths that are possible; replication points out paths that are likely; progress relies on both. Replication can increase certainty when findings are reproduced and promote innovation when they are not. This project provides accumulating evidence for many findings in psychological research and suggests that there is still more work to do to verify whether we know what we think we know.
2014-raven.pdf: “The Corrupted Epidemiological Evidence Base of Psychiatry: A Key Driver of Overdiagnosis”, Melissa Raven
2014-mccabe.pdf: “Identifying The Effect Of Open Access On Citations Using A Panel Of Science Journals”, (2014-02-20; ):
An open-access journal allows free online access to its articles, obtaining revenue from fees charged to submitting authors or from institutional support. Using panel data on science journals, we are able to circumvent problems plaguing previous studies of the impact of open access on citations. In contrast to the huge effects found in these previous studies, we find a more modest effect: moving from paid to open access increases cites by 8% on average in our sample. The benefit is concentrated among top-ranked journals. In fact, open access causes a statistically-significant reduction in cites to the bottom-ranked journals in our sample, leading us to conjecture that open access may intensify competition among articles for readers’ attention, generating losers as well as winners. [See also the 2020 followup by the same authors, “Cite Unseen: Theory and Evidence on the Effect of Open Access on Cites to Academic Articles Across the Quality Spectrum”.]
2014-andreoliversbach.pdf: “Open Access to Data: An Ideal Professed but Not Practised”, (2014; ):
Data-sharing is an essential tool for replication, validation and extension of empirical results. Using a hand-collected data set describing the data-sharing behaviour of 488 randomly selected empirical researchers, we provide evidence that most researchers in economics and management do not share their data voluntarily. We derive testable hypotheses based on the theoretical literature on information-sharing and relate data-sharing to observable characteristics of researchers. We find empirical support for the hypotheses that voluntary data-sharing statistically-significantly increases with (a) academic tenure, (b) the quality of researchers, (c) the share of published articles subject to a mandatory data-disclosure policy of journals, and (d) personal attitudes towards “open science” principles. On the basis of our empirical evidence, we discuss a set of policy recommendations.
2013-dana.pdf: “Belief in the unstructured interview: The persistence of an illusion”, (2013-09-01):
Unstructured interviews are a ubiquitous tool for making screening decisions despite a vast literature suggesting that they have little validity. We sought to establish reasons why people might persist in the illusion that unstructured interviews are valid and what features about them actually lead to poor predictive accuracy.
In three studies, we investigated the propensity for “sensemaking”—the ability for interviewers to make sense of virtually anything the interviewee says—and “dilution”—the tendency for available but non-diagnostic information to weaken the predictive value of quality information. In Study 1, participants predicted two fellow students’ semester GPAs from valid background information like prior GPA and, for one of them, an unstructured interview. In one condition, the interview was essentially nonsense in that the interviewee was actually answering questions using a random response system. Consistent with sensemaking, participants formed interview impressions just as confidently after getting random responses as they did after real responses. Consistent with dilution, interviews actually led participants to make worse predictions. Study 2 showed that watching a random interview, rather than personally conducting it, did little to mitigate sensemaking. Study 3 showed that participants believe unstructured interviews will help accuracy, so much so that they would rather have random interviews than no interview.
People form confident impressions even interviews are defined to be invalid, like our random interview, and these impressions can interfere with the use of valid information. Our simple recommendation for those making screening decisions is not to use them.
[Keywords: unstructured interview, random interview, clinical judgment, actuarial judgment]
2013-ioannidis.pdf: “What's to know about the credibility of empirical economics?”, (2013; ):
The scientific credibility of economics is itself a scientific question that can be addressed with both theoretical speculations and empirical data. In this review, we examine the major parameters that are expected to affect the credibility of empirical economics: sample size, magnitude of pursued effects, number and pre-selection of tested relationships, flexibility and lack of standardization in designs, definitions, outcomes and analyses, financial and other interests and prejudices, and the multiplicity and fragmentation of efforts. We summarize and discuss the empirical evidence on the lack of a robust reproducibility culture in economics and business research, the prevalence of potential publication and other selective reporting biases, and other failures and biases in the market of scientific information. Overall, the credibility of the economics literature is likely to be modest or even low.
[Keywords: Bias; Credibility; Economics; Meta-research; Replication; Reproducibility]
2012-masicampo.pdf: “A peculiar prevalence of p values just below 0.05”, (2012-11-01; ):
In null hypothesis statistical-significance testing (NHST), p-values are judged relative to an arbitrary threshold for statistical-significance (0.05). The present work examined whether that standard influences the distribution of p-values reported in the psychology literature.
We examined a large subset of papers from 3 highly regarded journals. Distributions of p were found to be similar across the different journals. Moreover, p-values were much more common immediately below 0.05 than would be expected based on the number of p-values occurring in other ranges. This prevalence of p-values just below the arbitrary criterion for statistical-significance was observed in all 3 journals.
We discuss potential sources of this pattern, including publication bias and researcher degrees of freedom.
2012-tinsley.pdf: “How Near-Miss Events Amplify or Attenuate Risky Decision Making”, (2012-04-18; ):
In the aftermath of many natural and man-made disasters, people often wonder why those affected were underprepared, especially when the disaster was the result of known or regularly occurring hazards (e.g., hurricanes). We study one contributing factor: prior near-miss experiences. Near misses are events that have some nontrivial expectation of ending in disaster but, by chance, do not. We demonstrate that when near misses are interpreted as disasters that did not occur, people illegitimately underestimate the danger of subsequent hazardous situations and make riskier decisions (e.g., choosing not to engage in mitigation activities for the potential hazard). On the other hand, if near misses can be recognized and interpreted as disasters that almost happened, this will counter the basic “near-miss” effect and encourage more mitigation. We illustrate the robustness of this pattern across populations with varying levels of real expertise with hazards and different hazard contexts (household evacuation for a hurricane, Caribbean cruises during hurricane season, and deep-water oil drilling). We conclude with ideas to help people manage and communicate about risk.
[Keywords: near miss; risk; decision making; natural disasters; organizational hazards; hurricanes; oil spills.]
2012-fanelli.pdf: “Negative results are disappearing from most disciplines and countries”, (2011-09-11; ):
Concerns that the growing competition for funding and citations might distort science are frequently discussed, but have not been verified directly. Of the hypothesized problems, perhaps the most worrying is a worsening of positive-outcome bias. A system that disfavours negative results not only distorts the scientific literature directly, but might also discourage high-risk projects and pressure scientists to fabricate and falsify their data. This study analysed over 4,600 papers published in all disciplines between 1990 and 2007, measuring the frequency of papers that, having declared to have “tested” a hypothesis, reported a positive support for it. The overall frequency of positive supports has grown by over 22% between 1990 and 2007, with statistically-significant differences between disciplines and countries. The increase was stronger in the social and some biomedical disciplines. The United States had published, over the years, statistically-significantly fewer positive results than Asian countries (and particularly Japan) but more than European countries (and in particular the United Kingdom). Methodological artefacts cannot explain away these patterns, which support the hypotheses that research is becoming less pioneering and/
or that the objectivity with which results are produced and published is decreasing.
2011-tatum.pdf: “Artifact and Recording Concepts in EEG”, (2011-06; ):
Artifact is present when electrical potentials that are not brain derived are recorded on the EEG and is commonly encountered during interpretation.
Many artifacts obscure the tracing, while others reflect physiologic functions that are crucial for routine visual analysis. Both physiologic and nonphysiologic sources of artifact may act as source of confusion with abnormality and lead to misinterpretation. Identifying the mismatch between potentials that are generated by the brain from activity that does not conform to a realistic head model is the foundation for recognizing artifact. Electroencephalographers are challenged with the task of correct interpretations among the many artifacts that could potentially be misleading, resulting in an incorrect diagnosis and treatment that may adversely impact patient care. Despite advances in digital EEG, artifact identification, recognition, and elimination are essential for correct interpretation of the EEG.
The authors discuss recording concepts for interpreting EEG that contains artifact.
1987-raopalmer-psi.pdf: “On rustles, wolf interpretations, and other wild speculations”, David Navon
2009-ljungqvist.pdf: “Rewriting History”, (2009-07-16; ):
We document widespread changes to the historical I/
B/ E/ S analyst stock recommendations database. Across seven I/ B/ E/ S downloads, obtained between 2000 and 2007, we find that between 6,580 (1.6%) and 97,582 (21.7%) of matched observations are different from one download to the next. The changes include alterations of recommendations, additions and deletions of records, and removal of analyst names. These changes are nonrandom, clustering by analyst reputation, broker size and status, and recommendation boldness, and affect trading signal classifications and back‐tests of three stylized facts: profitability of trading signals, profitability of consensus recommendation changes, and persistence in individual analyst stock‐picking ability.
2008-charlton.pdf: “Figureheads, ghost-writers and pseudonymous quant bloggers: The recent evolution of authorship in science publishing”, Bruce G. Charlton
2008-gerber.pdf: “Publication Bias in Empirical Sociological Research: Do Arbitrary Significance Levels Distort Published Results?”, (2008-08-01; ):
Despite great attention to the quality of research methods in individual studies, if publication decisions of journals are a function of the statistical-significance of research findings, the published literature as a whole may not produce accurate measures of true effects.
This article examines the 2 most prominent sociology journals (the American Sociological Review and the American Journal of Sociology) and another important though less influential journal (The Sociological Quarterly) for evidence of publication bias. The effect of the 0.05 statistical-significance level on the pattern of published findings is examined using a “caliper” test, and the hypothesis of no publication bias can be rejected at approximately the 1 in 10 million level.
Findings suggest that some of the results reported in leading sociology journals may be misleading and inaccurate due to publication bias. Some reasons for publication bias and proposed reforms to reduce its impact on research are also discussed.
2008-falk.pdf: “The allure of equality: Uniformity in probabilistic and statistical judgment”, Ruma Falk, Avital Lann ( )
2007-simkin.pdf: “A mathematical theory of citing”, (2007-07-13; ):
Recently we proposed a model in which when a scientist writes a manuscript, he picks up several random papers, cites them, and also copies a fraction of their references. The model was stimulated by our finding that a majority of scientific citations are copied from the lists of references used in other papers. It accounted quantitatively for several properties of empirically observed distribution of citations; however, important features such as power-law distributions of citations to papers published during the same year and the fact that the average rate of citing decreases with aging of a paper were not accounted for by that model. Here, we propose a modified model: When a scientist writes a manuscript, he picks up several random recent papers, cites them, and also copies some of their references. The difference with the original model is the word recent. We solve the model using methods of the theory of branching processes, and find that it can explain the aforementioned features of citation distribution, which our original model could not account for. The model also can explain “sleeping beauties in science;” that is, papers that are little cited for a decade or so and later “awaken” and get many citations. Although much can be understood from purely random models, we find that to obtain a good quantitative agreement with empirical citation data, one must introduce Darwinian fitness parameter for the papers.
2008-scherer.pdf: “Full publication of results initially presented in abstracts”, (2007; ):
Studies initially reported as conference abstracts that have positive results are subsequently published as full-length journal articles more often than studies with negative results.
Less than half of all studies, and about 60% of randomized or controlled clinical trials, initially presented as summaries or abstracts at professional meetings are subsequently published as peer-reviewed journal articles. An important factor appearing to influence whether a study described in an abstract is published in full is the presence of ‘positive’ results in the abstract. Thus, the efforts of persons trying to collect all of the evidence in a field may be stymied, first by the failure of investigators to take abstract study results to full publication, and second, by the tendency to take to full publication only those studies reporting ‘significant’ results. The consequence of this is that systematic reviews will tend to over-estimate treatment effects.
Background: Abstracts of presentations at scientific meetings are usually available only in conference proceedings. If subsequent full publication of abstract results is based on the magnitude or direction of study results, publication bias may result. Publication bias, in turn, creates problems for those conducting systematic reviews or relying on the published literature for evidence.
Objectives: To determine the rate at which abstract results are subsequently published in full, and the time between meeting presentation and full publication.
Search methods: We searched MEDLINE, EMBASE, The Cochrane Library, Science Citation Index, reference lists, and author files. Date of most recent search: June 2003. Selection criteria We included all reports that examined the subsequent full publication rate of biomedical results initially presented as abstracts or in summary form. Follow-up of abstracts had to be at least two years.
Data collection and analysis: Two reviewers extracted data. We calculated the weighted mean full publication rate and time to full publication. Dichotomous variables were analyzed using relative risk and random effects models. We assessed time to publication using Kaplan-Meier survival analyses.
Main results: Combining data from 79 reports (29,729 abstracts) resulted in a weighted mean full publication rate of 44.5% (95% confidence interval (CI) 43.9 to 45.1). Survival analyses resulted in an estimated publication rate at 9 years of 52.6% for all studies, 63.1% for randomized or controlled clinical trials, and 49.3% for other types of study designs.
‘Positive’ results defined as any ‘significant’ result showed an association with full publication (RR = 1.30; CI 1.14 to 1.47), as did ‘positive’ results defined as a result favoring the experimental treatment (RR = 1.17; CI 1.02 to 1.35), and ‘positive’ results emanating from randomized or controlled clinical trials (RR = 1.18, CI 1.07 to 1.30).
Other factors associated with full publication include oral presentation (RR = 1.28; CI 1.09 to 1.49); acceptance for meeting presentation (RR = 1.78; CI 1.50 to 2.12); randomized trial study design (RR = 1.24; CI 1.14 to 1.36); and basic research (RR = 0.79; CI 0.70 to 0.89). Higher quality of abstracts describing randomized or controlled clinical trials was also associated with full publication (RR = 1.30, CI 1.00 to 1.71).
Authors’ conclusions: Only 63% of results from abstracts describing randomized or controlled clinical trials are published in full. ‘Positive’ results were more frequently published than not ‘positive’ results.
2005-jussim.pdf: “Teacher expectations and self-fulfilling prophecies: knowns and unknowns, resolved and unresolved controversies”, (2005; ):
This article shows that 35 years of empirical research on teacher expectations justifies the following conclusions: (a) Self-fulfilling prophecies in the classroom do occur, but these effects are typically small, they do not accumulate greatly across perceivers or over time, and they may be more likely to dissipate than accumulate; (b) powerful self-fulfilling prophecies may selectively occur among students from stigmatized social groups; (c) whether self-fulfilling prophecies affect intelligence, and whether they in general do more harm than good, remains unclear, and (d) teacher expectations may predict student outcomes more because these expectations are accurate than because they are self-fulfilling. Implications for future research, the role of self-fulfilling prophecies in social problems, and perspectives emphasizing the power of erroneous beliefs to create social reality are discussed.
[Jussim discusses the famous ‘Pygmalion effect’. It demonstrates the Replication crisis: an initial extraordinary finding indicating that teachers could raise student IQs by dozens of points gradually shrunk over repeated replications to essentially zero net long-term effect. The original finding was driven by statistical malpractice bordering on research fraud: some students had “pretest IQ scores near zero, and others had post-test IQ scores over 200”! Rosenthal further maintained the Pygmalion effect by statistical trickery, such as his ‘fail-safe N’, which attempted to show that hundreds of studies would have to have not been published in order for the Pygmalion effect to be true—except this assumes zero publication bias in those unpublished studies and begs the question.]
2004-brett.pdf: “When is a correlation between non-independent variables 'spurious'?”, (2004-05-14):
Correlations which are artifacts of various types of data transformations can be said to be spurious. This study considers four common types of analyses where the X and Y variables are not independent; these include regressions of the form X⁄Z vs Y⁄Z, X×Z vs Y×Z, X vs Y⁄X, and X+Y vs Y.
These analyses were carried out using a series of Monte Carlo simulations while varying sample size and sample variability. The impact of disparities in variability between the shared and non-shared terms and measurement error for the shared term on the magnitude of the spurious correlations was also considered. The accuracy of equations previously derived to predict the magnitude of spurious correlations was also assessed.
These results show the risk of producing spurious correlations when analyzing non-independent variables is very large. Spurious correlations occurred in all cases assessed, the mean spurious coefficient of determination (R2) frequently exceeded 0.50, and in some cases the 90% confidence interval for these simulations included all large R2 values. The magnitude of spurious correlations was sensitive to differences in the variability of the shared and non-shared terms, with large spurious correlations obtained when the variability for the shared term was larger. Sample size had only a modest impact on the magnitude of spurious correlations. When measurement error for the shared variable was smaller than one half the coefficient of variation for that variable, which is generally the case, the measurement error did not generate large spurious correlations.
The equations available to predict expected spurious correlations provided accurate predictions for the case of X×Z vs Y×Z, variable predictions for the case of X vs Y⁄X, and poor predictions for most cases of X⁄Z vs Y⁄Z, and X+Y vs Y.
1997-matthews.pdf: “The Science of Murphy’s Law: Life’s little annoyances are not as random as they seem: the awful truth is that the universe is against you”, (1997-04-01; ):
[Popularization of Matthews’s other articles on physics & statistics and what truth there is to “Murphy’s law”:
Toast falling butter side up: true, because tables are not high enough for toast to be likely to complete one or more rotations before landing given the tilt & falling off edges, therefore toast will in fact tend to land on its top half.
Maps putting things on edges: true—printed paper maps tend to be hard to use because the place one wants to go will tend to be toward an edge; this is simply a geometric fact due to most of the area of a volume being towards the edge.
Other Checkout Lines Being Faster: true, because of anthropics, as most waiting time is spent in the slowest line; even if equally loaded, order statistics points out there is only a 1 in n chance that one picked the fastest line out of n lanes
Mismatched Socks: also true, simply because there are many more ways for socks in a pair to go missing than to go missing in pairs or match up
Raining: forecasts are fairly accurate, but this ignores base-rates and that much of that accuracy is due to predicting non-rain. It’s a version of the diagnostics/
For example, suppose that the hourly base rate of rain is 0.1, meaning that it is 10 times more likely not to rain during your hour-long stroll. Probability theory then shows that even an 80% accurate forecast of rain is twice as likely to prove wrong as right during your walk—and you’ll end up taking an umbrella unnecessarily. The fact is that even today’s apparently highly accurate forecasts are still not good enough to predict rare events reliably.
Keywords: socks, maps, umbrellas, family law, combinatorics, weather forecasting, technology law, mathematical constants, physics, probability theory]
1997-schwartz.pdf: “The Rise and Fall of Uncitedness”, (1997-01-01; ):
Large-scale uncitedness refers to the remarkable proportion of articles that do not receive a single citation within five years of publication. Equally remarkable is the brief and troubled history of this area of inquiry, which was prone to miscalculation, misinterpretation, and politicization. This article reassesses large-scale uncitedness as both a general phenomenon in the scholarly communication system and a case study of library and information science, where its rate is 72%.
1993-ernhart.pdf: “On Being a Whistleblower: The Needleman Case”, (1993; ):
We believe that members of the scientific community have a primary obligation to promote integrity in research and that this obligation includes a duty to report observations that suggest misconduct to agencies that are empowered to examine and evaluate such evidence. Consonant with this responsibility, we became whistleblowers in the case of Herbert Needleman. His 1979 study (Needleman et al 1979), on the effects of low-level lead exposure on children, is widely cited and highly influential in the formulation of public policy on lead. The opportunity we had to examine subject selection and data analyses from this study was prematurely halted by efforts to prevent disclosure of our observations. Nevertheless, what we saw left us with serious concerns. We hope that the events here summarized will contribute to revisions of process by which allegations of scientific misconduct are handled and that such revisions will result in less damage to scientists who speak out.
1992-rogers.pdf: “How a Publicity Blitz Created The Myth of Subliminal Advertising”, (1992-12-01; ):
[‘Subliminal advertising’ was the Cambridge Analytica of the 1950s.] In September 1957, I began what to me was a serious study of contemporary applied psychology at Hofstra College in Hempstead, Long Island. At exactly the same time, in nearby New York City, an unemployed market researcher named James M. Vicary made a startling announcement based on research in high-speed photography later popularized by Eastman Kodak Company.
His persuasive sales pitch was that consumers would comprehend information projected at 1/
60,000th of a second, although they could not literally “see” the flash. And he sent a news release to the major media announcing his “discovery”.
…And, as a follow-up, toward the end of 1957 Vicary invited 50 reporters to a film studio in New York where he projected some motion picture footage, and claimed that he had also projected a subliminal message. He then handed out another of his well written and nicely printed news releases claiming that he had actually conducted major research on how an invisible image could cause people to buy something even if they didn’t want to.
The release said that in an unidentified motion picture theater a “scientific test” had been conducted in which 45,699 persons unknowingly had been exposed to 2 advertising messages projected subliminally on alternate nights. One message, the release claimed, had advised the moviegoers to “Eat Popcorn” while the other had read “Drink Coca-Cola.”
…Vicary swore that the invisible advertising had increased sales of popcorn an average of 57.5%, and increased the sales of Coca-Cola an average of 18.1%. No explanation was offered for the difference in size of the percentages, no allowance was made for variations in attendance, and no other details were provided as to how or under what conditions the purported tests had been conducted. Vicary got off the hook for his lack of specificity by stating that the research information formed part of his patent application for the projection device, and therefore must remain secret. He assured the media, however, that what he called “sound statistical controls” had been employed in the theater test. At least as importantly, too, he had observed the proven propagandist’s ploy of using odd numbers, and also including a decimal in a percentage. The figures 57.5 and 18.1% rang with a clear tone of Truth.
…When I learned of Vicary’s claim, I made the short drive to Fort Lee to learn first-hand about his clearly remarkable experiment.
The size of that small-town theater suggested it should have taken considerably longer than 6 weeks to complete a test of nearly 50,000 movie patrons. But even more perplexing was the response of the theater manager to my eager questioning. He declared that no such test had ever been conducted at his theater.
There went my term paper for my psychology class.
Soon after my disappointment, Motion Picture Daily reported that the same theater manager had sworn to one of its reporters that there had been no effect on refreshment stand patronage, whether a test had been conducted or not—a rather curious form of denial, I think.
…Technological Impossibility: Vicary also informed the reporters that subliminal advertising would have its “biggest initial impact” on the television medium.
When I learned of this, I visited the engineering section of RCA…I was assured by their helpful and knowledgeable engineering liaison man that, because of the time required for an electron beam to scan the surface of a television picture tube, and the persistence of the phosphor glow, it was technologically impossible to project a television image faster than the human eye could perceive.
“In a nighttime scene on television, watch the way the image of a car’s headlights lingers; that’s called comet-tailing”, the engineer explained. “See how long it takes before the headlights fade away.” Clearly there was no way that even the slower tachistoscope speeds of 1/
3,000th of a second that Vicary had begun talking about in early 1958 could work on contemporary television.
…It has been estimated he collected retainer and consulting fees from America’s largest advertisers totaling some $34.16$4.51958 million—about $51.46$22.51992 million in today’s dollars.
Then, some time in June 1958, Mr. Vicary disappeared from the New York marketing scene, reportedly leaving no bank accounts, no clothes in his closet, and no hint as to where he might have gone. The big advertisers, apparently ashamed of having been fooled by such an obvious scam, have said nothing since about subliminal advertising, except to deny that they have ever used it.
1989-moed.pdf: “Possible inaccuracies occurring in citation analysis”, (1989-04-01; ):
Citation analysis of scientific articles constitutes an im portant tool in quantitative studies of science and technology. Moreover, citation indexes are used frequently in searches for relevant scientific documents. In this article we focus on the issue of reliability of citation analysis. How accurate are cita tion counts to individual scientific articles? What pitfalls might occur in the process of data collection? To what extent do ‘random’ or ‘systematic’ errors affect the results of the citation analysis?
We present a detailed analysis of discrepancies between target articles and cited references with respect to author names, publication year, volume number, and starting page number. Our data consist of some 4500 target articles pub lished in five scientific journals, and 25000 citations to these articles. Both target and citation data were obtained from the Science Citation Index, produced by the Institute for Scientific Information.
It appears that in many cases a specific error in a citation to a particular target article occurs in more than one citing publication. We present evidence that authors in compiling reference lists, may copy references from reference lists in other articles, and that this may be one of the mechanisms underlying this phenomenon of multiple’ variations/
1989-omni-walterstewartinterview.pdf: “Walter Stewart: Fighting Fraud in Science (They call him the 'terrorist of the lab', but this self-appointed scourge of scientific fraud has reason to suspect that as much as 25 percent of all research papers may be intentionally fudged) [interview]”, Walter Stewart, Doug Stewart ( )
1989-diaconis.pdf: “Methods for Studying Coincidences”, (1989-01-01; ):
This article illustrates basic statistical techniques for studying coincidences. These include data-gathering methods (informal anecdotes, case studies, observational studies, and experiments) and methods of analysis (exploratory and confirmatory data analysis, special analytic techniques, and probabilistic modeling, both general and special purpose). We develop a version of the birthday problem general enough to include dependence, inhomogeneity, and almost and multiple matches. We review Fisher’s techniques for giving partial credit for close matches. We develop a model for studying coincidences involving newly learned words. Once we set aside coincidences having apparent causes, four principles account for large numbers of remaining coincidences: hidden cause; psychology, including memory and perception; multiplicity of endpoints, including the counting of “close” or nearly alike events as if they were identical; and the law of truly large numbers, which says that when enormous numbers of events and people and their interactions cumulate over time, almost any outrageous event is bound to occur. These sources account for much of the force of synchronicity.
[Keywords: birthday problems, extrasensory perception, Jung, Kammerer, multiple endpoints, rare events, synchronicity]
…Because of our different reading habits, we readers are exposed to the same words at different observed rates, even when the long-run rates are the same Some words will appear relatively early in your experience, some relatively late. More than half will appear before their expected time of appearance, probably more than 60% of them if we use the exponential model, so the appearance of new words is like a Poisson process. On the other hand, some words will take more than twice the average time to appear, about 1⁄7 of them (1⁄e2) in the exponential model. They will look rarer than they actually are. Furthermore, their average time to reappearance is less than half that of their observed first appearance, and about 10% of those that took at least twice as long as they should have to occur will appear in less than 1⁄20 of the time they originally took to appear. The model we are using supposes an exponential waiting time to first occurrence of events. The phenomenon that accounts for part of this variable behavior of the words is of course the regression effect.
…We now extend the model. Suppose that we are somewhat more complicated creatures, that we require k exposures to notice a word for the first time, and that k is itself a Poisson random variable…Then, the mean time until the word is noticed is (𝜆 + 1)T, where T is the average time between actual occurrences of the word. The variance of the time is (2𝜆 + 1)T2. Suppose T = 1 year and 𝜆 = 4. Then, as an approximation, 5% of the words will take at least time [𝜆 + 1 + 1.65 (2𝜆 + 1)(1⁄2)]T or about 10 years to be detected the first time. Assume further that, now that you are sensitized, you will detect the word the next time it appears. On the average it will be a year, but about 3% of these words that were so slow to be detected the first time will appear within a month by natural variation alone. So what took 10 years to happen once happens again within a month. No wonder we are astonished. One of our graduate students learned the word “formication” on a Friday and read part of this manuscript the next Sunday, two days later, illustrating the effect and providing an anecdote. Here, sensitizing the individual, the regression effect, and the recall of notable events and the non-recall of humdrum events produce a situation where coincidences are noted with much higher than their expected frequency. This model can explain vast numbers of seeming coincidences.
1987-eichorn.pdf: “Do authors check their references? A survey of accuracy of references in 3 public health journals”, (1987-08-01; ):
We verified a random sample of 50 references in the May 1986 issue of each of 3 public health journals.
31% of the 150 references had citation errors, one out of 10 being a major error (reference not locatable). 30% of the references differed from authors’ use of them with half being a major error (cited paper not related to author’s contention).
1983-broadus.pdf: “An investigation of the validity of bibliographic citations”, Robert N. Broadus ( )
1983-jensen-2.pdf: “Taboo, constraint, and responsibility in educational research”, Arthur R. Jensen
1976-lando.pdf: “On being sane in insane places: A supplemental report”, (1976-01-01; ):
Describes the author’s experiences as a pseudo-patient on the psychiatric ward of a large public hospital for 19 days. Hospital facilities were judged excellent, and therapy tended to be extensive. Close contact with both patients and staff was obtained. Despite this contact, however, not only was the author’s simulation not detected, but his behavior was seen as consistent with the admitting diagnosis of “chronic undifferentiated schizophrenia.” Even with this misattribution it is concluded that the present institution had many positive aspects and that the depersonalization of patients so strongly emphasized by D. Rosenhan (see record 1973-21600-001) did not exist in this setting. It is recommended that future research address positive characteristics of existing institutions and possibly emulate these in upgrading psychiatric care.
…I was the ninth pseudopatient in the Rosenhan study, and my data were not included in the original report.
1976-schmidt.pdf: “Critical Analysis of the Statistical and Ethical Implications of Various Definitions of 'Test Bias'”, (1976; ):
Three mutually incompatible ethical positions in regard to the fair and unbiased use of psychological tests for different groups such as Blacks and Whites are defined, including unqualified and qualified individualism and quotas. 5 statistical definitions of test bias are also reviewed and are related to the ethical judgments. Each definition is critically examined for its weaknesses on either technical or social grounds. It is argued that a statistical attempt to define fair use without recourse to substantive and causal analysis is doomed to failure.
1976-rosenthal-experimenterexpectancyeffects.pdf: “Experimenter Effects in Behavioral Research: Enlarged Edition”, (1976; ):
Within the context of a general discussion of the unintended effects of scientists on the results of their research, this work reported on the growing evidence that the hypothesis of the behavioral scientist could come to serve as self-fulfilling prophecy, by means of subtle processes of communication between the experimenter and the human or animal research subject. [The Science Citation Index (SCI) and the Social Sciences Citation Index (SSCI) indicate that the book has been cited over 740 times since 1966 [as of 1979].] —“Citation Classic”
[Enlarged Edition, expanded with discussion of the Pygmalion effect etc: ISBN 0-470-01391-5]
1973-gergen.pdf: “Social Psychology as History”, (1973-01-01; ):
Presents an analysis of theory and research in social psychology which reveals that while methods of research are scientific in character, theories of social behavior are primarily reflections of contemporary history. The dissemination of psychological knowledge modifies the patterns of behavior upon which the knowledge is based. This modification occurs because of the prescriptive bias of psychological theorizing, the liberating effects of knowledge, and the resistance based on common values of freedom and individuality. In addition, theoretical premises are based primarily on acquired dispositions. As the culture changes, such dispositions are altered, and the premises are often invalidated. Several modifications in the scope and methods of social psychology are derived from this analysis.
…Yet, while the propagandizing effects of psychological terminology must be lamented, it is also important to trace their sources. In part the evaluative loading of theoretical terms seems quite intentional. The act of publishing implies the desire to be heard. However, value-free terms have low interest value for the potential reader, and value-free research rapidly becomes obscure. If obedience were relabeled alpha behavior and not rendered deplorable through associations with Adolf Eichmann, public concern would undoubtedly be meagre. In addition to capturing the interest of the public and the profession, value-laden concepts also provide an expressive outlet for the psychologist. I have talked with countless graduate students drawn into psychology out of deep humanistic concern. Within many lies a frustrated poet, philosopher, or humanitarian who finds the scientific method at once a means to expressive ends and an encumbrance to free expression. Resented is the apparent fact that the ticket to open expression through the professional media is a near lifetime in the laboratory. Many wish to share their values directly, unfettered by constant demands for systematic evidence. For them, value-laden concepts compensate for the conservatism usually imparted by these demands. The more established psychologist may indulge himself more directly. Normally, however, we are not inclined to view our personal biases as propagandistic so much as reflecting “basic truths.”
1973-furby.pdf: “Interpreting regression toward the mean in developmental research”, (1973; ):
Explicates the fundamental nature of regression toward the mean, which is frequently misunderstood by developmental researchers. While errors of measurement are commonly assumed to be the sole source of regression effects, the latter also are obtained with errorless measures. The conditions under which regression phenomena can appear are first clearly defined. Next, an explanation of regression effects is presented which applies both when variables contain errors of measurement and when they are errorless. The analysis focuses on cause and effect relationships of psychologically meaningful variables. Finally, the implications for interpreting regression effects in developmental research are illustrated with several empirical examples.
1971-elashoff-pygmalionreconsidered.pdf: “Pygmalion Reconsidered: A Case Study in Statistical Inference: Reconsideration of the Rosenthal-Jacobson Data on Teacher Expectancy”, Janet D. Elashoff, Richard E. Snow ( )
1970-meehl.pdf: “Nuisance Variables and the Ex Post Facto Design”, Paul E. Meehl ( )
1968-rosenthal-pygmalionintheclassroom.pdf: “Pygmalion In The Classroom: Teacher Expectation and Pupil's Intellectual Development”, Robert Rosenthal, Lenore Jacobson ( )
1967-vesell.pdf: “Induction of Drug-Metabolizing Enzymes in Liver Microsomes of Mice and Rats by Softwood Bedding”, (1967-09-01; ):
Induction of three drug-metabolizing enzymes occurred in liver microsomes of mice and rats kept on softwood bedding of either red cedar, white pine, or ponderosa pine. This induction was reversed when animals were placed on hardwood bedding composed of a mixture of beech, birch, and maple. Differences in the capacity of various beddings to induce may partially explain divergent results of studies on drug-metabolizing enzymes. The presence of such inducing substances in the environment may influence the pharmacologic responsiveness of animals to a wide variety of drugs.
[Cronbach 1975 description:
Even the animal experimenter is not exempt from problems of interaction. (I am indebted to Neal Miller for the following example.) Investigators checking on how animals metabolize drugs found that results differed mysteriously from laboratory to laboratory. The most startling inconsistency of all occurred after a refurbishing of a National Institutes of Health (NIH) animal room brought in new cages and new supplies. Previously, a mouse would sleep for about 35 minutes after a standard injection of hexobarbital. In their new homes, the NIH mice came miraculously back to their feet just 16 minutes after receiving a shot of the drug. Detective work proved that red-cedar bedding made the difference, stepping up the activity of several enzymes that metabolize hexobarbital. Pine shavings had the same effect. When the softwood was replaced with birch or maple bedding like that originally used, drug response came back in line with previous experience (Vesell, 1967).
1966-dunnette.pdf: “Fads, fashions, and folderol in psychology”, (1966; ):
[Influential early critique of academic psychology: weak theories, no predictions, poor measurements, poor replicability, high levels of publication bias, non-progressive theory building, and constant churn; many of these criticisms would be taken up by the ‘Minnesota school’ of Bouchard/
Meehl/ Lykken/ etc.]
Fads include brain-storming, Q technique, level of aspiration, forced choice, critical incidents, semantic differential, role playing, and need theory. Fashions include theorizing and theory building, criterion fixation, model building, null-hypothesis testing, and sensitivity training. Folderol includes tendencies to be fixated on theories, methods, and points of view, conducting “little” studies with great precision, attaching dramatic but unnecessary trappings to experiments, grantsmanship, coining new names for old concepts, fixation on methods and apparatus, etc.
1962-wolins.pdf: “Responsibility for Raw Data”, (1962-09; ):
Comments on a Iowa State University graduate student’s endeavor of requiring data of a particular kind in order to carry out a study for his master’s thesis. This student wrote to 37 authors whose journal articles appeared in APA journals between 1959 and 1961. Of these authors, 32 replied. 21 of those reported the data misplaced, lost, or inadvertently destroyed. 2 of the remaining 11 offered their data on the conditions that they be notified of our intended use of their data, and stated that they have control of anything that we would publish involving these data. Errors were found in some of the raw data that was obtained which caused a dilemma of either reporting the errors or not. The commentator states that if it were clearly set forth by the APA that the responsibility for retaining raw data and submitting them for scrutiny upon request lies with the author, this dilemma would not exist. The commentator suggests that a possibly more effective means of controlling quality of publication would be to institute a system of quality control whereby random samples of raw data from submitted journal articles would be requested by editors and scrutinized for accuracy and the appropriateness of the analysis performed.