Replication (Link Bibliography)

“Replication” links:

  1. #further-reading


  3. ⁠, John P. A. Ioannidis ():


    There is increasing concern that most current published research findings are false. The probability that a research claim is true may depend on study power and bias, the number of other studies on the same question, and, importantly, the ratio of true to no relationships among the relationships probed in each scientific field. In this framework, a research finding is less likely to be true when the studies conducted in a field are smaller; when effect sizes are smaller; when there is a greater number and lesser preselection of tested relationships; where there is greater flexibility in designs, definitions, outcomes, and analytical modes; when there is greater financial and other interest and prejudice; and when more teams are involved in a scientific field in chase of statistical-significance. Simulations show that for most study designs and settings, it is more likely for a research claim to be false than true. Moreover, for many current scientific fields, claimed research findings may often be simply accurate measures of the prevailing bias. In this essay, I discuss the implications of these problems for the conduct and interpretation of research.

    Published research findings are sometimes refuted by subsequent evidence, says Ioannidis, with ensuing confusion and disappointment.

  4. 2005-ioannidis-table4-positivepredictivevalueofpublicationsforpowervsbaseratevsbias.png

  5. ⁠, Leah R. Jager, Jeffrey T. Leek (2013-01-16):

    The accuracy of published medical research is critical both for scientists, physicians and patients who rely on these results. But the fundamental belief in the medical literature was called into serious question by a paper suggesting most published medical research is false. Here we adapt estimation methods from the genomics community to the problem of estimating the rate of false positives in the medical literature using reported p-values as the data. We then collect p-values from the abstracts of all 77,430 papers published in The Lancet, The Journal of the American Medical Association, The New England Journal of Medicine, The British Medical Journal, and The American Journal of Epidemiology between 2000 and 2010. We estimate that the overall rate of false positives among reported results is 14% (s.d. 1%), contrary to previous claims. We also find there is not a statistically-significant increase in the estimated rate of reported false positive results over time (0.5% more FP per year, p = 0.18) or with respect to journal submissions (0.1% more FP per 100 submissions, p = 0.48). Statistical analysis must allow for false positives in order to make claims on the basis of noisy data. But our analysis suggests that the medical literature remains a reliable record of scientific progress.

  6. 2015-opensciencecollaboration.pdf: ⁠, Open Science Collaboration (2015-08-28; statistics  /​ ​​ ​bias):

    Empirically analyzing empirical evidence: One of the central goals in any scientific endeavor is to understand causality. Experiments that seek to demonstrate a cause/​​​​effect relation most often manipulate the postulated causal factor. Aarts et al 2015 describe the replication of 100 experiments reported in papers published in 2008 in 3 high-ranking psychology journals. Assessing whether the replication and the original experiment yielded the same result according to several criteria, they find that about one-third to one-half of the original findings were also observed in the replication study.

    Introduction: Reproducibility is a defining feature of science, but the extent to which it characterizes current research is unknown. Scientific claims should not gain credence because of the status or authority of their originator but by the replicability of their supporting evidence. Even research of exemplary quality may have irreproducible empirical findings because of random or systematic error.

    Rationale: There is concern about the rate and predictors of reproducibility, but limited evidence. Potentially problematic practices include selective reporting, selective analysis, and insufficient specification of the conditions necessary or sufficient to obtain the results. Direct replication is the attempt to recreate the conditions believed sufficient for obtaining a previously observed finding and is the means of establishing reproducibility of a finding with new data. We conducted a large-scale, collaborative effort to obtain an initial estimate of the reproducibility of psychological science.

    Results: We conducted replications of 100 experimental and correlational studies published in three psychology journals using high-powered designs and original materials when available. There is no single standard for evaluating replication success. Here, we evaluated reproducibility using statistical-significance and P values, effect sizes, subjective assessments of replication teams, and of effect sizes. The mean effect size (r) of the replication effects (Mr = 0.197, SD = 0.257) was half the magnitude of the mean effect size of the original effects (Mr = 0.403, SD = 0.188), representing a substantial decline. 97% of original studies had statistically-significant results (p < 0.05). 36% of replications had statistically-significant results; 47% of original effect sizes were in the 95% confidence interval of the replication effect size; 39% of effects were subjectively rated to have replicated the original result; and if no bias in original results is assumed, combining original and replication results left 68% with statistically-significant effects. Correlational tests suggest that replication success was better predicted by the strength of original evidence than by characteristics of the original and replication teams.

    Conclusion: No single indicator sufficiently describes replication success, and the five indicators examined here are not the only ways to evaluate reproducibility. Nonetheless, collectively these results offer a clear conclusion: A large portion of replications produced weaker evidence for the original findings despite using materials provided by the original authors, review in advance for methodological fidelity, and high statistical power to detect the original effect sizes. Moreover, correlational evidence is consistent with the conclusion that variation in the strength of initial evidence (such as original P value) was more predictive of replication success than variation in the characteristics of the teams conducting the research (such as experience and expertise). The latter factors certainly can influence replication success, but they did not appear to do so here.

    Reproducibility is not well understood because the incentives for individual scientists prioritize novelty over replication. Innovation is the engine of discovery and is vital for a productive, effective scientific enterprise. However, innovative ideas become old news fast. Journal reviewers and editors may dismiss a new test of a published idea as unoriginal. The claim that “we already know this” belies the uncertainty of scientific evidence. Innovation points out paths that are possible; replication points out paths that are likely; progress relies on both. Replication can increase certainty when findings are reproduced and promote innovation when they are not. This project provides accumulating evidence for many findings in psychological research and suggests that there is still more work to do to verify whether we know what we think we know.

    Figure 1: Original study effect size versus replication effect size (correlation coefficients). Diagonal line represents replication effect size equal to original effect size. Dotted line represents replication effect size of 0. Points below the dotted line were effects in the opposite direction of the original. Density plots are separated by statistically-significant (blue) and non-statistically-significant (red) effects.

  8. 2008-scherer.pdf: ⁠, Roberta W. Scherer, Patricia Langenberg, Erik von Elm (2007; statistics  /​ ​​ ​bias):

    Studies initially reported as conference abstracts that have positive results are subsequently published as full-length journal articles more often than studies with negative results.

    Less than half of all studies, and about 60% of randomized or controlled clinical trials, initially presented as summaries or abstracts at professional meetings are subsequently published as peer-reviewed journal articles. An important factor appearing to influence whether a study described in an abstract is published in full is the presence of ‘positive’ results in the abstract. Thus, the efforts of persons trying to collect all of the evidence in a field may be stymied, first by the failure of investigators to take abstract study results to full publication, and second, by the tendency to take to full publication only those studies reporting ‘significant’ results. The consequence of this is that systematic reviews will tend to over-estimate treatment effects.

    Background: Abstracts of presentations at scientific meetings are usually available only in conference proceedings. If subsequent full publication of abstract results is based on the magnitude or direction of study results, publication bias may result. Publication bias, in turn, creates problems for those conducting systematic reviews or relying on the published literature for evidence.

    Objectives: To determine the rate at which abstract results are subsequently published in full, and the time between meeting presentation and full publication.

    Search methods: We searched MEDLINE, EMBASE, The Cochrane Library, Science Citation Index, reference lists, and author files. Date of most recent search: June 2003. Selection criteria We included all reports that examined the subsequent full publication rate of biomedical results initially presented as abstracts or in summary form. Follow-up of abstracts had to be at least two years.

    Data collection and analysis: Two reviewers extracted data. We calculated the weighted mean full publication rate and time to full publication. Dichotomous variables were analyzed using relative risk and random effects models. We assessed time to publication using survival analyses.

    Main results: Combining data from 79 reports (29,729 abstracts) resulted in a weighted mean full publication rate of 44.5% (95% confidence interval () 43.9 to 45.1). Survival analyses resulted in an estimated publication rate at 9 years of 52.6% for all studies, 63.1% for randomized or ⁠, and 49.3% for other types of study designs.

    ‘Positive’ results defined as any ‘significant’ result showed an association with full publication (RR = 1.30; CI 1.14 to 1.47), as did ‘positive’ results defined as a result favoring the experimental treatment (RR = 1.17; CI 1.02 to 1.35), and ‘positive’ results emanating from randomized or controlled clinical trials (RR = 1.18, CI 1.07 to 1.30).

    Other factors associated with full publication include oral presentation (RR = 1.28; CI 1.09 to 1.49); acceptance for meeting presentation (RR = 1.78; CI 1.50 to 2.12); randomized trial study design (RR = 1.24; CI 1.14 to 1.36); and basic research (RR = 0.79; CI 0.70 to 0.89). Higher quality of abstracts describing randomized or controlled clinical trials was also associated with full publication (RR = 1.30, CI 1.00 to 1.71).

    Authors’ conclusions: Only 63% of results from abstracts describing randomized or controlled clinical trials are published in full. ‘Positive’ results were more frequently published than not ‘positive’ results.



  11. 2007-kyzas.pdf

  12. 1996-csada.pdf

  13. 2002-jennions.pdf


  15. 1995-sterling.pdf

  16. 2012-masicampo.pdf: ⁠, E. J. Masicampo, Daniel R. Lalande (2012-11-01; statistics  /​ ​​ ​bias):

    In null hypothesis statistical-significance testing (NHST), p-values are judged relative to an arbitrary threshold for statistical-significance (0.05). The present work examined whether that standard influences the distribution of p-values reported in the psychology literature.

    We examined a large subset of papers from 3 highly regarded journals. Distributions of p were found to be similar across the different journals. Moreover, p-values were much more common immediately below 0.05 than would be expected based on the number of p-values occurring in other ranges. This prevalence of p-values just below the arbitrary criterion for statistical-significance was observed in all 3 journals.

    We discuss potential sources of this pattern, including publication bias and researcher degrees of freedom.


  18. ⁠, Anton Kühberger, Astrid Fritz, Thomas Scherndl (2014-07-29):


    The p value obtained from a significance test provides no information about the magnitude or importance of the underlying phenomenon. Therefore, additional reporting of effect size is often recommended. Effect sizes are theoretically independent from sample size. Yet this may not hold true empirically: non-independence could indicate publication bias.


    We investigate whether effect size is independent from sample size in psychological research. We randomly sampled 1,000 psychological articles from all areas of psychological research. We extracted p values, effect sizes, and sample sizes of all empirical papers, and calculated the correlation between effect size and sample size, and investigated the distribution of p values.


    We found a negative correlation of r = −.45 [95% CI: −.53; −.35] between effect size and sample size. In addition, we found an inordinately high number of p values just passing the boundary of statistical-significance. Additional data showed that neither implicit nor explicit power analysis could account for this pattern of findings.


    The negative correlation between effect size and samples size, and the biased distribution of p values indicate pervasive publication bias in the entire field of psychology.


  20. ⁠, Abel Brodeur, Mathias Lé, Marc Sangnier, Yanos Zylberberg (2013-03):

    Journals favor rejection of the null hypothesis. This selection upon tests may distort the behavior of researchers. Using 50,000 tests published between 2005 and 2011 in the AER, JPE, and QJE, we identify a residual in the distribution of tests that cannot be explained by selection. The distribution of p-values exhibits a camel shape with abundant p-values above 0.25, a valley between 0.25 and 0.10 and a bump slightly below 0.05. The missing tests (with p-values between 0.25 and 0.10) can be retrieved just after the 0.05 threshold and represent 10% to 20% of marginally rejected tests. Our interpretation is that researchers might be tempted to inflate the value of those almost-rejected tests by choosing a “significant” specification. We propose a method to measure inflation and decompose it along articles’ and authors’ characteristics.

  21. 2008-gerber.pdf: ⁠, Alan S. Gerber, Neil Malhotra (2008-08-01; statistics  /​ ​​ ​bias):

    Despite great attention to the quality of research methods in individual studies, if publication decisions of journals are a function of the statistical-significance of research findings, the published literature as a whole may not produce accurate measures of true effects.

    This article examines the 2 most prominent sociology journals (the American Sociological Review and the American Journal of Sociology) and another important though less influential journal (The Sociological Quarterly) for evidence of publication bias. The effect of the 0.05 statistical-significance level on the pattern of published findings is examined using a “caliper” test, and the hypothesis of no publication bias can be rejected at approximately the 1 in 10 million level.

    Findings suggest that some of the results reported in leading sociology journals may be misleading and inaccurate due to publication bias. Some reasons for publication bias and proposed reforms to reduce its impact on research are also discussed.

  22. 2001-ioannidis.pdf

  23. ⁠, Daniele Fanelli (2010-03-24):

    The growing competition and “publish or perish” culture in academia might conflict with the objectivity and integrity of research, because it forces scientists to produce “publishable” results at all costs. Papers are less likely to be published and to be cited if they report “negative” results (results that fail to support the tested hypothesis). Therefore, if publication pressures increase scientific bias, the frequency of “positive” results in the literature should be higher in the more competitive and “productive” academic environments. This study verified this hypothesis by measuring the frequency of positive results in a large random sample of papers with a corresponding author based in the US. Across all disciplines, papers were more likely to support a tested hypothesis if their corresponding authors were working in states that, according to NSF data, produced more academic papers per capita. The size of this effect increased when controlling for state’s per capita R&D expenditure and for study characteristics that previous research showed to correlate with the frequency of positive results, including discipline and methodology. Although the effect of institutions’ prestige could not be excluded (researchers in the more productive universities could be the most clever and successful in their experiments), these results support the hypothesis that competitive academic environments increase not only scientists’ productivity but also their bias. The same phenomenon might be observed in other countries where academic competition and pressures to publish are high.

  24. 2012-fanelli.pdf: ⁠, Danielle Fanelli (2011-09-11; statistics  /​ ​​ ​bias):

    Concerns that the growing competition for funding and citations might distort science are frequently discussed, but have not been verified directly. Of the hypothesized problems, perhaps the most worrying is a worsening of positive-outcome bias. A system that disfavours negative results not only distorts the scientific literature directly, but might also discourage high-risk projects and pressure scientists to fabricate and falsify their data. This study analysed over 4,600 papers published in all disciplines between 1990 and 2007, measuring the frequency of papers that, having declared to have “tested” a hypothesis, reported a positive support for it. The overall frequency of positive supports has grown by over 22% between 1990 and 2007, with statistically-significant differences between disciplines and countries. The increase was stronger in the social and some biomedical disciplines. The United States had published, over the years, statistically-significantly fewer positive results than Asian countries (and particularly Japan) but more than European countries (and in particular the United Kingdom). Methodological artefacts cannot explain away these patterns, which support the hypotheses that research is becoming less pioneering and/​​​​or that the objectivity with which results are produced and published is decreasing.














  38. 2011-young.pdf

  39. #animal-models



  42. ⁠, Thomas Pfeiffer, Robert Hoffmann (2009-05-21):

    Based on theoretical reasoning it has been suggested that the reliability of findings published in the scientific literature decreases with the popularity of a research field. Here we provide empirical support for this prediction. We evaluate published statements on protein interactions with data from high-throughput experiments. We find evidence for two distinctive effects. First, with increasing popularity of the interaction partners, individual statements in the literature become more erroneous. Second, the overall evidence on an interaction becomes increasingly distorted by multiple independent testing. We therefore argue that for increasing the reliability of research it is essential to assess the negative effects of popularity and develop approaches to diminish these effects.



  45. ⁠, Chabris, Christopher F. Hebert, Benjamin M. Benjamin, Daniel J. Beauchamp, Jonathan Cesarini, David van der Loos, Matthijs Johannesson, Magnus Magnusson, Patrik K. E Lichtenstein, Paul Atwood, Craig S. Freese, Jeremy Hauser, Taissa S. Hauser, Robert M. Christakis, Nicholas Laibson, David (2012):

    General intelligence (g) and virtually all other behavioral traits are heritable. Associations between g and specific single-nucleotide polymorphisms (SNPs) in several candidate genes involved in brain function have been reported. We sought to replicate published associations between g and 12 specific genetic variants (in the genes DTNBP1, CTSD, DRD2, ANKK1, CHRM2, SSADH, COMT, BDNF, CHRNA4, DISC1, APOE, and SNAP25) using data sets from three independent, well-characterized longitudinal studies with samples of 5,571, 1,759, and 2,441 individuals. Of 32 independent tests across all three data sets, only 1 was nominally statistically-significant. By contrast, power analyses showed that we should have expected 10 to 15 statistically-significant associations, given reasonable assumptions for genotype effect sizes. For positive controls, we confirmed accepted genetic associations for Alzheimer’s disease and ⁠, and we used SNP-based calculations of genetic relatedness to replicate previous estimates that about half of the in g is accounted for by common genetic variation among individuals. We conclude that the molecular genetics of psychology and social science requires approaches that go beyond the examination of candidate genes.

  46. #wicherts-et-al-2011



  49. 1990-rothman.pdf



  52. 1985-godfrey.pdf

  53. 1987-pocock.pdf

  54. 1987-smith.pdf

  55. 1993-lipsey.pdf: ⁠, Mark W. Lipsey, David B. Wilson (1993-12-01; psychology):

    Conventional reviews of research on the efficacy of psychological, educational, and behavioral treatments often find considerable variation in outcome among studies and, as a consequence, fail to reach firm conclusions about the overall effectiveness of the interventions in question. In contrast, meta-analysis reviews show a strong, dramatic pattern of positive overall effects that cannot readily be explained as artifacts of meta-analytic technique or generalized placebo effects. Moreover, the effects are not so small that they can be dismissed as lacking practical or clinical-significance. Although meta-analysis has limitations, there are good reasons to believe that its results are more credible than those of conventional reviews and to conclude that well-developed psychological, educational, and behavioral treatment is generally efficacious.


  57. 2012-mitchell.pdf





  62. 2011-simmons.pdf


  64. 2013-button.pdf







  71. ⁠, Scott Alexander (2014-04-28):

    Allan Crossman calls parapsychology the control group for science⁠. That is, in let’s say a drug testing experiment, you give some people the drug and they recover. That doesn’t tell you much until you give some other people a placebo drug you know doesn’t work—but which they themselves believe in—and see how many of them recover. That number tells you how many people will recover whether the drug works or not. Unless people on your real drug do substantially better than people on the placebo drug, you haven’t found anything. On the meta-level, you’re studying some phenomenon and you get some positive findings. That doesn’t tell you much until you take some other researchers who are studying a phenomenon you know doesn’t exist—but which they themselves believe in—and see how many of them get positive findings. That number tells you how many studies will discover positive results whether the phenomenon is real or not. Unless studies of the real phenomenon do substantially better than studies of the placebo phenomenon, you haven’t found anything.

    Trying to set up placebo science would be a logistical nightmare. You’d have to find a phenomenon that definitely doesn’t exist, somehow convince a whole community of scientists across the world that it does, and fund them to study it for a couple of decades without them figuring it out.

    Luckily we have a in terms of parapsychology—the study of psychic phenomena—which most reasonable people believe don’t exist, but which a community of practicing scientists believes in and publishes papers on all the time. The results are pretty dismal. Parapsychologists are able to produce experimental evidence for psychic phenomena about as easily as normal scientists are able to produce such evidence for normal, non-psychic phenomena. This suggests the existence of a very large “placebo effect” in science—ie with enough energy focused on a subject, you can always produce “experimental evidence” for it that meets the usual scientific standards. As Eliezer Yudkowsky puts it:

    Parapsychologists are constantly protesting that they are playing by all the standard scientific rules, and yet their results are being ignored—that they are unfairly being held to higher standards than everyone else. I’m willing to believe that. It just means that the standard statistical methods of science are so weak and flawed as to permit a field of study to sustain itself in the complete absence of any subject matter.


  73. 2014-andreoliversbach.pdf: ⁠, Patrick Andreoli-Versbach, Frank Mueller-Langer (2014; statistics  /​ ​​ ​bias):

    Data-sharing is an essential tool for replication, validation and extension of empirical results. Using a hand-collected data set describing the data-sharing behaviour of 488 randomly selected empirical researchers, we provide evidence that most researchers in economics and management do not share their data voluntarily. We derive testable hypotheses based on the theoretical literature on information-sharing and relate data-sharing to observable characteristics of researchers. We find empirical support for the hypotheses that voluntary data-sharing statistically-significantly increases with (a) academic tenure, (b) the quality of researchers, (c) the share of published articles subject to a mandatory data-disclosure policy of journals, and (d) personal attitudes towards “open science” principles. On the basis of our empirical evidence, we discuss a set of policy recommendations.


  75. 2009-ljungqvist.pdf: ⁠, Alexander Ljungqvist, Christopher Malloy, Felicia Marston (2009-07-16; statistics  /​ ​​ ​bias):

    We document widespread changes to the historical I/​​​​B/​​​​E/​​​​S analyst stock recommendations database. Across seven I/​​​​B/​​​​E/​​​​S downloads, obtained between 2000 and 2007, we find that between 6,580 (1.6%) and 97,582 (21.7%) of matched observations are different from one download to the next. The changes include alterations of recommendations, additions and deletions of records, and removal of analyst names. These changes are nonrandom, clustering by analyst reputation, broker size and status, and recommendation boldness, and affect trading signal classifications and back‐tests of three stylized facts: profitability of trading signals, profitability of consensus recommendation changes, and persistence in individual analyst stock‐picking ability.

  76. 2013-ioannidis.pdf: ⁠, John Ioannidis, Chris Doucouliagos (2013; statistics  /​ ​​ ​bias):

    The scientific credibility of economics is itself a scientific question that can be addressed with both theoretical speculations and empirical data.

    In this review, we examine the major parameters that are expected to affect the credibility of empirical economics: sample size, magnitude of pursued effects, number and pre-selection of tested relationships, flexibility and lack of standardization in designs, definitions, outcomes and analyses, financial and other interests and prejudices, and the multiplicity and fragmentation of efforts.

    We summarize and discuss the empirical evidence on the lack of a robust reproducibility culture in economics and business research, the prevalence of potential publication and other selective reporting biases, and other failures and biases in the market of scientific information. Overall, the credibility of the economics literature is likely to be modest or even low.

    [Keywords: bias, credibility, economics, meta-research, replication, reproducibility]

  77. 2015-krawczyk.pdf

  78. ⁠, Timothy Vines, Arianne Albert, Rose Andrew, Florence Debarré, Dan Bock, Michelle Franklin, Kimberley Gilbert, Jean-Sébastien Moore, Sébastien Renaut, Diana J. Rennison (2013-12-19):

    Policies ensuring that research data are available on public archives are increasingly being implemented at the government [1], funding agency [2-4], and journal [5,6] level. These policies are predicated on the idea that authors are poor stewards of their data, particularly over the long term [7], and indeed many studies have found that authors are often unable or unwilling to share their data [8-11]. However, there are no systematic estimates of how the availability of research data changes with time since publication. We therefore requested datasets from a relatively homogenous set of 516 articles published between 2 and 22 years ago, and found that availability of the data was strongly affected by article age. For papers where the authors gave the status of their data, the odds of a dataset being extant fell by 17% per year. In addition, the odds that we could find a working email address for the first, last or corresponding author fell by 7% per year. Our results reinforce the notion that, in the long term, research data cannot be reliably preserved by individual researchers, and further demonstrate the urgent need for policies mandating data sharing via public archives.







  85. 2009-anda.pdf


  87. 2011-sainani.pdf










  97. ⁠, Duncan, Laramie E. Keller, Matthew C (2011):

    Objective: Gene-by-environment interaction (G×E) studies in psychiatry have typically been conducted using a candidate G×E (cG×E) approach, analogous to the candidate gene association approach used to test genetic main effects. Such cG×E research has received widespread attention and acclaim, yet cG×E findings remain controversial. The authors examined whether the many positive cG×E findings reported in the psychiatric literature were robust or if, in aggregate, cG×E findings were consistent with the existence of publication bias, low statistical power, and a high false discovery rate.

    Method: The authors conducted analyses on data extracted from all published studies (103 studies) from the first decade (2000-2009) of cG×E research in psychiatry.

    Results: Ninety-six percent of novel cG×E studies were significant compared with 27% of replication attempts. These findings are consistent with the existence of publication bias among novel cG×E studies, making cG×E hypotheses appear more robust than they actually are. There also appears to be publication bias among replication attempts because positive replication attempts had smaller average sample sizes than negative ones. Power calculations using observed sample sizes suggest that cG×E studies are underpowered. Low power along with the likely low probability of a given cG×E hypothesis being true suggests that most or even all positive cG×E findings represent type I errors.

    Conclusions: In this new era of big data and small effects, a recalibration of views about groundbreaking findings is necessary. Well-powered direct replications deserve more attention than novel cG×E findings and indirect replications.

  98. ⁠, Ramal Moonesinghe, Muin J. Khoury, A. Cecile J. W. Janssens ():

    We examine the PPV as a function of the number of statistically-significant findings. Figure 1 shows the PPV of at least one, two, or three statistically-significant research findings out of ten independent studies as a function of the pre-study odds of a true relationship (R) for powers of 20% and 80%. The lower lines correspond to Ioannidis’ finding and indicate the probability of a true association when at least one out of ten studies shows a statistically-significant result. As can be seen, the PPV is substantially higher when more research findings are statistically-significant. Thus, a few positive replications can considerably enhance our confidence that the research findings reflect a true relationship. When R ranged from 0.0001 to 0.01, a higher number of positive studies is required to attain a reasonable PPV. The difference in PPV for power of 80% and power of 20% when at least three studies are positive is higher than when at least one study is positive. Figure 2 gives the PPV for increasing number of positive studies out of ten, 25, and 50 studies for pre-study odds of 0.0001, 0.01, 0.1, and 0.5 for powers of 20% and 80%. When there is at least one positive study (r = 1) and power equal to 80%, as indicated in Ioannidis’ paper, PPV declined approximately 50% for 50 studies compared to ten studies for R values between 0.0001 and 0.1. However, PPV increases with increasing number of positive studies and the percentage of positive studies required to achieve a given PPV declines with increasing number of studies. The number of positive studies required to achieve a PPV of at least 70% increased from eight for ten studies to 12 for 50 studies when pre-study odds equaled 0.0001, from five for ten studies to eight for 50 studies when pre-study odds equaled 0.01, from three for ten studies to six for 50 studies when pre-study odds equaled 0.1, and from two for ten studies to five for 50 studies when pre-study odds equaled 0.5. The difference in PPV for powers of 80% and 20% declines with increasing number of studies.

    …In summary, while we agree with Ioannidis that most research findings are false, we clearly demonstrate that replication of research findings enhances the positive predictive value of research findings being true. While this is not unexpected, it should be encouraging news to researchers in their never-ending pursuit of scientific hypothesis generation and testing. Nevertheless, more methodological work is needed to assess and interpret cumulative evidence of research findings and their biological plausibility. This is especially urgent in the exploding field of genetic associations.





  103. ⁠, Aaron Mobley, Suzanne K. Linder, Russell Braeuer, Lee M. Ellis, Leonard Zwelling (2013-03-29):


    The pharmaceutical and biotechnology industries depend on findings from academic investigators prior to initiating programs to develop new diagnostic and therapeutic agents to benefit cancer patients. The success of these programs depends on the validity of published findings. This validity, represented by the reproducibility of published findings, has come into question recently as investigators from companies have raised the issue of poor reproducibility of published results from academic laboratories. Furthermore, retraction rates in high impact journals are climbing.

    Methods and Findings:

    To examine a microcosm of the academic experience with data reproducibility, we surveyed the faculty and trainees at MD Anderson Cancer Center using an anonymous computerized questionnaire; we sought to ascertain the frequency and potential causes of non-reproducible data. We found that ~50% of respondents had experienced at least one episode of the inability to reproduce published data; many who pursued this issue with the original authors were never able to identify the reason for the lack of reproducibility; some were even met with a less than “collegial” interaction.


    These results suggest that the problem of data reproducibility is real. Biomedical science needs to establish processes to decrease the problem and adjudicate discrepancies in findings when they are discovered.



  106. 2011-osherovich.pdf

  107. 1986-henrion.pdf


  109. ⁠, Stefan Thurner, Rudolf Hanel (2010-08-25):

    One of the virtues of peer review is that it provides a self-regulating selection mechanism for scientific work, papers and projects. Peer review as a selection mechanism is hard to evaluate in terms of its efficiency. Serious efforts to understand its strengths and weaknesses have not yet lead to clear answers. In theory peer review works if the involved parties (editors and referees) conform to a set of requirements, such as love for high quality science, objectiveness, and absence of biases, nepotism, friend and clique networks, selfishness, etc. If these requirements are violated, what is the effect on the selection of high quality work? We study this question with a simple agent based model. In particular we are interested in the effects of rational referees, who might not have any incentive to see high quality work other than their own published or promoted. We find that a small fraction of incorrect (selfish or rational) referees can drastically reduce the quality of the published (accepted) scientific standard. We quantify the fraction for which peer review will no longer select better than pure chance. Decline of quality of accepted scientific work is shown as a function of the fraction of rational and unqualified referees. We show how a simple quality-increasing policy of e.g. a journal can lead to a loss in overall scientific quality, and how mutual support-networks of authors and referees deteriorate the system.






  115. 2008-gonzalezalvarez.pdf: “Science in the 21<sup>st<  /​ ​​ ​sup> century: social, political, and economic issues”⁠, Juan R. González-Álvarez



  118. ⁠, Daniele Fanelli (2009-04-19):

    The frequency with which scientists fabricate and falsify data, or commit other forms of scientific misconduct is a matter of controversy. Many surveys have asked scientists directly whether they have committed or know of a colleague who committed research misconduct, but their results appeared difficult to compare and synthesize. This is the first meta-analysis of these surveys.

    To standardize outcomes, the number of respondents who recalled at least one incident of misconduct was calculated for each question, and the analysis was limited to behaviours that distort scientific knowledge: fabrication, falsification, “cooking” of data, etc… Survey questions on plagiarism and other forms of professional misconduct were excluded. The final sample consisted of 21 surveys that were included in the systematic review, and 18 in the meta-analysis.

    A pooled weighted average of 1.97% (n = 7, 95%CI: 0.86–4.45) of scientists admitted to have fabricated, falsified or modified data or results at least once –a serious form of misconduct by any standard– and up to 33.7% admitted other questionable research practices. In surveys asking about the behaviour of colleagues, admission rates were 14.12% (n = 12, 95% CI: 9.91–19.72) for falsification, and up to 72% for other questionable research practices. Meta-regression showed that self reports surveys, surveys using the words “falsification” or “fabrication”, and mailed surveys yielded lower percentages of misconduct. When these factors were controlled for, misconduct was reported more frequently by medical/​​​​pharmacological researchers than others.

    Considering that these surveys ask sensitive questions and have other limitations, it appears likely that this is a conservative estimate of the true prevalence of scientific misconduct.

  119. ⁠, Leslie K. John, George Loewenstein, Drazen Prelec (2012):

    Cases of clear scientific misconduct have received substantial media attention recently, but less flagrantly questionable research practices may be more prevalent and, ultimately, more damaging to the academic enterprise.

    Using an anonymous elicitation format supplemented by incentives for honest reporting, we surveyed over 2,000 psychologists about their involvement in questionable research practices. The impact of truth-telling incentives on self-admissions of questionable research practices was positive, and this impact was greater for practices that respondents judged to be less defensible.

    Combining three different estimation methods, we found that the percentage of respondents who have engaged in questionable practices was surprisingly high. This finding suggests that some questionable practices may constitute the prevailing research norm.

    [Keywords: professional standards, judgment, disclosure, methodology]


  121. Research-criticism





  126. 1959-schlaifer-probabilitystatisticsbusinessdecisions.pdf#page=488

  127. ⁠, Xiao-Li Meng (2018-07-28):

    Statisticians are increasingly posed with thought-provoking and even paradoxical questions, challenging our qualifications for entering the statistical paradises created by Big Data. By developing measures for data quality, this article suggests a framework to address such a question: “Which one should I trust more: a 1% survey with 60% response rate or a self-reported administrative dataset covering 80% of the population?” A 5-element Euler-formula-like identity shows that for any dataset of size , probabilistic or not, the difference between the sample average and the population average is the product of three terms: (1) a data quality measure, , the correlation between and the response/​​​​recording indicator ; (2) a data quantity measure, , where is the population size; and (3) a problem difficulty measure, , the standard deviation of . This decomposition provides multiple insights: (1) Probabilistic sampling ensures high data quality by controlling at the level of ; (2) When we lose this control, the impact of is no longer canceled by , leading to a Law of Large Populations (LLP), that is, our estimation error, relative to the benchmarking rate , increases with ; and (3) the “bigness” of such Big Data (for population inferences) should be measured by the relative size , not the absolute size ; (4) When combining data sources for population inferences, those relatively tiny but higher quality ones should be given far more weights than suggested by their sizes.

    Estimates obtained from the Cooperative Congressional Election Study (CCES) of the 2016 US presidential election suggest a for self-reporting to vote for Donald Trump. Because of LLP, this seemingly minuscule data defect correlation implies that the simple sample proportion of the self-reported voting preference for Trump from of the US eligible voters, that is, , has the same mean squared error as the corresponding sample proportion from a genuine simple random sample of size , a reduction of sample size (and hence our confidence). The CCES data demonstrate LLP vividly: on average, the larger the state’s voter populations, the further away the actual Trump vote shares from the usual confidence intervals based on the sample proportions. This should remind us that, without taking data quality into account, population inferences with Big Data are subject to a Big Data Paradox: the more the data, the surer we fool ourselves.

  128. ⁠, Houshmand Shirani-Mehr, David Rothschild, Sharad Goel, Andrew Gelman (2018-07-25):

    It is well known among researchers and practitioners that election polls suffer from a variety of sampling and nonsampling errors, often collectively referred to as total survey error. Reported margins of error typically only capture sampling variability, and in particular, generally ignore nonsampling errors in defining the target population (eg., errors due to uncertainty in who will vote).

    Here, we empirically analyze 4221 polls for 608 state-level presidential, senatorial, and gubernatorial elections between 1998 and 2014, all of which were conducted during the final three weeks of the campaigns. Comparing to the actual election outcomes, we find that average survey error as measured by root mean square error is approximately 3.5 percentage points, about twice as large as that implied by most reported margins of error. We decompose survey error into election-level bias and variance terms. We find that average absolute election-level bias is about 2 percentage points, indicating that polls for a given election often share a common component of error. This shared error may stem from the fact that polling organizations often face similar difficulties in reaching various subgroups of the population, and that they rely on similar screening rules when estimating who will vote. We also find that average election-level variance is higher than implied by simple random sampling, in part because polling organizations often use complex sampling designs and adjustment procedures.

    We conclude by discussing how these results help explain polling failures in the 2016 U.S. presidential election, and offer recommendations to improve polling practice.

    [Keywords: margin of error, non-sampling error, polling bias, total survey error]

  129. #shiranimehr-et-al-2018



  132. Correlation

  133. Causality

  134. Littlewood

  135. The-Existential-Risk-of-Mathematical-Error

  136. Leprechauns

  137. Hydrocephalus

  138. Mouse-Utopia

  139. DNB-meta-analysis

  140. Lunar-sleep

  141. Questions#jeanne-calment

  142. 2016-munksgaard.pdf: ⁠, Rasmus Munksgaard, Jakob Demant, Gwern Branwen (2016-09; silk-road):

    [Debunking a remarkably sloppy paper which screwed up its scraping and somehow concluded that the notorious Silk Road 2, in defiance of all observable evidence & subsequent FBI data, actually sold primarily e-books and hardly any drugs. This study has yet to be retracted.] The development of cryptomarkets has gained increasing attention from academics, including growing scientific literature on the distribution of illegal goods using cryptomarkets. Dolliver’s 2015 article “Evaluating drug trafficking on the Network: Silk Road 2, the Sequel” addresses this theme by evaluating drug trafficking on one of the most well-known cryptomarkets, Silk Road 2.0. The research on cryptomarkets in general—particularly in Dolliver’s article—poses a number of new questions for methodologies. This commentary is structured around a replication of Dolliver’s original study. The replication study is not based on Dolliver’s original dataset, but on a second dataset collected applying the same methodology. We have found that the results produced by Dolliver differ greatly from our replicated study. While a margin of error is to be expected, the inconsistencies we found are too great to attribute to anything other than methodological issues. The analysis and conclusions drawn from studies using these methods are promising and insightful. However, based on the replication of Dolliver’s study, we suggest that researchers using these methodologies consider and that datasets be made available for other researchers, and that methodology and dataset metrics (eg. number of downloaded pages, error logs) are described thoroughly in the context of web-o-metrics and web crawling.

  143. newsletter





  148. 2013-kurzban.pdf: ⁠, Robert Kurzban, Angela Duckworth, Joseph W. Kable, Justus Myers (2013-12-04; psychology):

    Why does performing certain tasks cause the aversive experience of mental effort and concomitant deterioration in task performance? One explanation posits a physical resource that is depleted over time. We propose an alternative explanation that centers on mental representations of the costs and benefits associated with task performance. Specifically, certain computational mechanisms, especially those associated with executive function, can be deployed for only a limited number of simultaneous tasks at any given moment. Consequently, the deployment of these computational mechanisms carries an opportunity cost—that is, the next-best use to which these systems might be put. We argue that the phenomenology of effort can be understood as the felt output of these cost/​​​​benefit computations. In turn, the subjective experience of effort motivates reduced deployment of these computational mechanisms in the service of the present task. These opportunity cost representations, then, together with other cost/​​​​benefit calculations, determine effort expended and, everything else equal, result in performance reductions. In making our case for this position, we review alternative explanations for both the phenomenology of effort associated with these tasks and for performance reductions over time. Likewise, we review the broad range of relevant empirical results from across sub-disciplines, especially psychology and neuroscience. We hope that our proposal will help to build links among the diverse fields that have been addressing similar questions from different perspectives, and we emphasize ways in which alternative models might be empirically distinguished.

  149. 2016-orquin.pdf






  155. 2013-couzinfrankel.pdf: “Science Magazine”




  159. 2003-krum.pdf

  160. ⁠, Coalition for Evidence-Based Policy (2013):

    Since the establishment of the Institute for Education Sciences (IES) within the U.S. Department of Education in 2002, IES has commissioned a sizable number of well-conducted randomized controlled trials (RCTs) evaluating the effectiveness of diverse educational programs, practices, and strategies (“interventions”). These interventions have included, for example, various educational curricula, teacher professional development programs, school choice programs, educational software, and data-driven school reform initiatives. Largely as a result of these IES studies, there now exists—for the first time in U.S. education—a sizable body of credible knowledge about what works and what doesn’t work to improve key educational outcomes of American students. A clear pattern of findings in these IES studies is that the large majority of interventions evaluated produced weak or no positive effects compared to usual school practices. This pattern is consistent with findings in other fields where RCTs are frequently carried out, such as medicine and business,1 and underscores the need to test many different interventions so as to build the number shown to work.






  166. ⁠, David Krauth, Andrew Anglemyer, Rose Philipps, Lisa Bero (2013-12-09):

    Industry-sponsored clinical drug studies are associated with publication of outcomes that favor the sponsor, even when controlling for potential bias in the methods used. However, the influence of sponsorship bias has not been examined in preclinical animal studies.

    We performed a meta-analysis of preclinical statin studies to determine whether industry sponsorship is associated with either increased effect sizes of efficacy outcomes and/​​​​or risks of bias in a cohort of published preclinical statin studies. We searched Medline (January 1966–April 2012) and identified 63 studies evaluating the effects of statins on atherosclerosis outcomes in animals. Two coders independently extracted study design criteria aimed at reducing bias, results for all relevant outcomes, sponsorship source, and investigator financial ties. The I2 statistic was used to examine heterogeneity. We calculated the standardized mean difference (SMD) for each outcome and pooled data across studies to estimate the pooled average SMD using random effects models. In a priori subgroup analyses, we assessed statin efficacy by outcome measured, sponsorship source, presence or absence of financial conflict information, use of an optimal time window for outcome assessment, accounting for all animals, inclusion criteria, blinding, and randomization.

    The effect of statins was statistically-significantly larger for studies sponsored by nonindustry sources (−1.99; 95% CI −2.68, −1.31) versus studies sponsored by industry (−0.73; 95% CI −1.00, −0.47) (p < 0.001). Statin efficacy did not differ by disclosure of financial conflict information, use of an optimal time window for outcome assessment, accounting for all animals, inclusion criteria, blinding, and randomization. Possible reasons for the differences between nonindustry-sponsored and industry-sponsored studies, such as selective reporting of outcomes, require further study.

    Author Summary: Industry-sponsored clinical drug studies are associated with publication of outcomes that favor the sponsor, even when controlling for potential bias in the methods used. However, the influence of sponsorship bias has not been examined in preclinical animal studies. We performed a meta-analysis to identify whether industry sponsorship is associated with increased risks of bias or effect sizes of outcomes in a cohort of published preclinical studies of the effects of statins on outcomes related to atherosclerosis. We found that in contrast to clinical studies, the effect of statins was statistically-significantly larger for studies sponsored by nonindustry sources versus studies sponsored by industry. Furthermore, statin efficacy did not differ with respect to disclosure of financial conflict information, use of an optimal time window for outcome assessment, accounting for all animals, inclusion criteria, blinding, and randomization. Possible reasons for the differences between nonindustry-sponsored and industry-sponsored studies, such as selective outcome reporting, require further study. Overall, our findings provide empirical evidence regarding the impact of funding and other methodological criteria on research outcomes.







  173. 2014-franco.pdf












  185. 2019-shewach.pdf: ⁠, Oren R. Shewach, Paul R. Sackett, Sander Quint (2019; psychology):

    The stereotype threat literature primarily comprises lab studies, many of which involve features that would not be present in high-stakes testing settings. We meta-analyze the effect of stereotype threat on cognitive ability tests, focusing on both laboratory and operational studies with features likely to be present in high stakes settings. First, we examine the features of cognitive ability test metric, stereotype threat cue activation strength, and type of non-threat control group, and conduct a focal analysis removing conditions that would not be present in high stakes settings. We also take into account a previously unrecognized methodological error in how data are analyzed in studies that control for scores on a prior cognitive ability test, which resulted in a biased estimate of stereotype threat. The focal sample, restricting the database to samples utilizing operational testing-relevant conditions, displayed a threat effect of d = −0.14 (k = 45, N = 3,532, SDδ = 0.31). Second, we present a comprehensive meta-analysis of stereotype threat. Third, we examine a small subset of studies in operational test settings and studies utilizing motivational incentives, which yielded d-values ranging from 0.00 to −0.14. Fourth, the meta-analytic database is subjected to tests of publication bias, finding nontrivial evidence for publication bias. Overall, results indicate that the size of the stereotype threat effect that can be experienced on tests of cognitive ability in operational scenarios such as college admissions tests and employment testing may range from negligible to small.



  188. ⁠, John P. A. Ioannidis ():

    • Currently, many published research findings are false or exaggerated, and an estimated 85% of research resources are wasted.
    • To make more published research true, practices that have improved credibility and efficiency in specific fields may be transplanted to others which would benefit from them—possibilities include the adoption of large-scale collaborative research; replication culture; registration; sharing; reproducibility practices; better statistical methods; standardization of definitions and analyses; more appropriate (usually more stringent) statistical thresholds; and improvement in study design standards, peer review, reporting and dissemination of research, and training of the scientific workforce.
    • Selection of interventions to improve research practices requires rigorous examination and experimental testing whenever feasible.
    • Optimal interventions need to understand and harness the motives of various stakeholders who operate in scientific research and who differ on the extent to which they are interested in promoting publishable, fundable, translatable, or profitable results.
    • Modifications need to be made in the reward system for science, affecting the exchange rates for currencies (eg., publications and grants) and purchased academic goods (eg., promotion and other academic or administrative power) and introducing currencies that are better aligned with translatable and reproducible research.

  190. 2014-hambrick.pdf: ⁠, David Z. Hambrick, Frederick L. Oswald, Erik M. Altmann, Elizabeth J. Meinz, Fernand Gobet, Guillermo Campitelli (2014-07; psychology):

    • Ericsson and colleagues argue that deliberate practice explains expert performance.
    • We tested this view in the two most studied domains in expertise research.
    • Deliberate practice is not sufficient to explain expert performance.
    • Other factors must be considered to advance the science of expertise.

    Twenty years ago, proposed that expert performance reflects a long period of deliberate practice rather than innate ability, or “talent”. Ericsson et al. found that elite musicians had accumulated thousands of hours more than less accomplished musicians, and concluded that their theoretical framework could provide “a sufficient account of the major facts about the nature and scarcity of exceptional performance” (p. 392). The deliberate practice view has since gained popularity as a theoretical account of expert performance, but here we show that deliberate practice is not sufficient to explain individual differences in performance in the two most widely studied domains in expertise research—chess and music. For researchers interested in advancing the science of expert performance, the task now is to develop and rigorously test theories that take into account as many potentially relevant explanatory constructs as possible.

    [Keywords: Expert performance, Expertise, Deliberate practice, Talent]

  191. 2014-horowitz.pdf


  193. 2010-kruschke.pdf: “Bayesian data analysis”⁠, John K. Kruschke




  197. 1973-gergen.pdf: ⁠, Kenneth J. Gergen (1973-01-01; statistics  /​ ​​ ​bias):

    Presents an analysis of theory and research in social psychology which reveals that while methods of research are scientific in character, theories of social behavior are primarily reflections of contemporary history. The dissemination of psychological knowledge modifies the patterns of behavior upon which the knowledge is based. This modification occurs because of the prescriptive bias of psychological theorizing, the liberating effects of knowledge, and the resistance based on common values of freedom and individuality. In addition, theoretical premises are based primarily on acquired dispositions. As the culture changes, such dispositions are altered, and the premises are often invalidated. Several modifications in the scope and methods of social psychology are derived from this analysis.

    …Yet, while the propagandizing effects of psychological terminology must be lamented, it is also important to trace their sources. In part the evaluative loading of theoretical terms seems quite intentional. The act of publishing implies the desire to be heard. However, value-free terms have low interest value for the potential reader, and value-free research rapidly becomes obscure. If obedience were relabeled alpha behavior and not rendered deplorable through associations with Adolf Eichmann, public concern would undoubtedly be meagre. In addition to capturing the interest of the public and the profession, value-laden concepts also provide an expressive outlet for the psychologist. I have talked with countless graduate students drawn into psychology out of deep humanistic concern. Within many lies a frustrated poet, philosopher, or humanitarian who finds the scientific method at once a means to expressive ends and an encumbrance to free expression. Resented is the apparent fact that the ticket to open expression through the professional media is a near lifetime in the laboratory. Many wish to share their values directly, unfettered by constant demands for systematic evidence. For them, value-laden concepts compensate for the conservatism usually imparted by these demands. The more established psychologist may indulge himself more directly. Normally, however, we are not inclined to view our personal biases as propagandistic so much as reflecting “basic truths.”


  199. 2020-silander.pdf: ⁠, Nina C. Silander, Bela Geczy Jr., Olivia Marks, Robert D. Mather (2020-01-14; statistics  /​ ​​ ​bias):

    Ideological bias is a worsening but often neglected concern for social and psychological sciences, affecting a range of professional activities and relationships, from self-reported willingness to discriminate to the promotion of ideologically saturated and scientifically questionable research constructs. Though clinical psychologists co-produce and apply social psychological research, little is known about its impact on the profession of clinical psychology.

    Following a brief review of relevant topics, such as “concept creep” and the importance of the psychotherapeutic relationship, the relevance of ideological bias to clinical psychology, counterarguments and a rebuttal, clinical applications, and potential solutions are presented. For providing empathic and multiculturally competent clinical services, in accordance with professional ethics, psychologists would benefit from treating ideological diversity as another professionally recognized diversity area.

    [See also“Political Diversity Will Improve Social Psychological Science”⁠, Duarte et al 2015.]

  200. ⁠, John P. A. Ioannidis (2005-07-13):

    Context: Controversy and uncertainty ensue when the results of clinical research on the effectiveness of interventions are subsequently contradicted. Controversies are most prominent when high-impact research is involved.

    Objectives: To understand how frequently highly cited studies are contradicted or find effects that are stronger than in other similar studies and to discern whether specific characteristics are associated with such refutation over time.

    Design: All original clinical research studies published in 3 major general clinical journals or high-impact-factor specialty journals in 1990–2003 and cited more than 1000 times in the literature were examined.

    Main Outcome Measure: The results of highly cited articles were compared against subsequent studies of comparable or larger sample size and similar or better controlled designs. The same analysis was also performed comparatively for matched studies that were not so highly cited.

    Results: Of 49 highly cited original clinical research studies, 45 claimed that the intervention was effective. Of these, 7 (16%) were contradicted by subsequent studies, 7 others (16%) had found effects that were stronger than those of subsequent studies, 20 (44%) were replicated, and 11 (24%) remained largely unchallenged. Five of 6 highly-cited nonrandomized studies had been contradicted or had found stronger effects vs 9 of 39 randomized controlled trials (p = 0.008). Among randomized trials, studies with contradicted or stronger effects were smaller (p = 0.009) than replicated or unchallenged studies although there was no statistically-significant difference in their early or overall citation impact. Matched control studies did not have a statistically-significantly different share of refuted results than highly cited studies, but they included more studies with “negative” results.

    Conclusions: Contradiction and initially stronger effects are not unusual in highly cited research of clinical interventions and their outcomes. The extent to which high citations may provoke contradictions and vice versa needs more study. Controversies are most common with highly cited nonrandomized studies, but even the most highly cited randomized trials may be challenged and refuted over time, especially small ones.

  201. ⁠, Marta Serra-Garcia, Uri Gneezy (2021-05-21):

    We use publicly available data to show that published papers in top psychology, economics, and general interest journals that fail to replicate are cited more than those that replicate. This difference in citation does not change after the publication of the failure to replicate. Only 12% of post-replication citations of non-replicable findings acknowledge the replication failure. Existing evidence also shows that experts predict well which papers will be replicated. Given this prediction, why are non-replicable papers accepted for publication in the first place? A possible answer is that the review team faces a trade-off. When the results are more “interesting”, they apply lower standards regarding their reproducibility.


  203. 1955-abramson.pdf: ⁠, H. A. Abramson, M. E. Jarvik, A. Levine, M. R. Kaufman, M. W. Hirsch (1955; psychology):

    The purpose of this paper is to study the responses given to a questionnaire by subjects who received a tap water ‘placebo’ instead of lysergic acid diethylamide (LSD-25), and to relate the number of responses to other variables. These variables are: body weight, number of responses on a health questionnaire, arithmetic test scores, scores on the Wechsler-Bellevue Intelligence Scale, and Rorschach test responses.

    Figure 4 shows for each question the percentage and number of subjects out of 28 who gave a positive response at least once during the 0.5, 2.5, and 4.5-hour intervals. The questions appear in the figure in the order of decreasing percentages of response to them. The time of the response and the magnitude are disregarded in this tabulation. The question receiving the greatest percentage response was (Subject 24), “Are your palms moist?” As many as 60.7% reported this symptom. Half of the subjects reported headache (Subject 13), fatigue (Subject 44), and drowsiness (Subject 45). About 36% reported anxiety (Subject 47). Illness (Subject 1), and dizziness (Subject 15) were reported by 28.6 per cent of the group and 25% indicated a dream-like feeling (Subject 46), increased appetite (Subject 6), unsteadiness (Subject 16), a hot feeling (Subject 22), heaviness of hands and feet (Subject 30), and weakness (Subject 43). There were 19 questions which received positive responses from between 10 and 22 per cent of the subjects. Less than 10% of the group (or no more than two subjects) responded positively to the remaining questions, but each question received a positive response from at least one subject.

    …The findings point out that a substance such as tap water, which is generally considered chemically and pharmacologically inactive, is capable of eliciting certain responses from certain subjects who believe they have received lysergic acid diethylamide. These observations emphasize once more the need for placebo controls in studies investigating the effects of drugs; without them changes which are produced merely by the situation and not by the drug are frequently falsely attributed to the action of the drug…Most subjects who respond to a placebo tend to do so most markedly during the first 0.5 hour after receiving the substance. At this time their anticipation of, and anxiety about, the effects of LSD-25 are probably greatest. Gradually the effects wear off, as the anticipation wears off. Individual differences exist in the time of peak effect, but this is the most common finding. The questions which elicited the greatest percentage response from the group were those related to anxiety (moist palms and feeling anxious) or to phenomena which commonly occur without the presence of any foreign agent (drowsiness, fatigue, and headache). The remaining questions received random responses. The fact that there is a wide range in the number of positive responses made to the questionnaire is of major interest.



  206. ⁠, Leonard Leibovici (2001-12-22):

    Objective:To determine whether remote, retroactive intercessory prayer, said for a group of patients with a bloodstream infection, has an effect on outcomes.

    Design: Double blind, parallel group, randomised controlled trial of a retroactive intervention.

    Setting: University hospital.

    Subjects:All 3393 adult patients whose bloodstream infection was detected at the hospital in 1990–1996.

    Intervention: In July 2000 patients were randomised to a control group and an intervention group. A remote, retroactive intercessory prayer was said for the well being and full recovery of the intervention group.

    Main outcome measures: Mortality in hospital, length of stay in hospital, and duration of fever.

    Results: Mortality was 28.1% (475⁄1691) in the intervention group and 30.2% (514⁄1702) in the control group (p for difference = 0.4). Length of stay in hospital and duration of fever were statistically-significantly shorter in the intervention group than in the control group (p = 0.01 and p = 0.04, respectively).

    Conclusions: Remote, retroactive intercessory prayer said for a group is associated with a shorter stay in hospital and shorter duration of fever in patients with a bloodstream infection and should be considered for use in clinical practice.

    What is already known on this topic: 2 randomised controlled trials of remote intercessory prayer (praying for persons unknown) showed a beneficial effect in patients in an intensive coronary care unit. A recent systematic review found that 57% of the randomised, placebo controlled trials of distant healing showed a positive treatment effect. What this study adds: Remote intercessory prayer said for a group of patients is associated with a shorter hospital stay and shorter duration of fever in patients with a bloodstream infection, even when the intervention is performed 4–10 years after the infection.



  209. ⁠, Martin Klein, Herbert Van de Sompel, Robert Sanderson, Harihar Shankar, Lyudmila Balakireva, Ke Zhou, Richard Tobin (2014-12-26):

    The emergence of the web has fundamentally affected most aspects of information communication, including scholarly communication. The immediacy that characterizes publishing information to the web, as well as accessing it, allows for a dramatic increase in the speed of dissemination of scholarly knowledge. But, the transition from a paper-based to a web-based scholarly communication system also poses challenges. In this paper, we focus on reference rot, the combination of link rot and content drift to which references to web resources included in Science, Technology, and Medicine (STM) articles are subject. We investigate the extent to which reference rot impacts the ability to revisit the web context that surrounds STM articles some time after their publication. We do so on the basis of a vast collection of articles from three corpora that span publication years 1997 to 2012. For over one million references to web resources extracted from over 3.5 million articles, we determine whether the HTTP URI is still responsive on the live web and whether web archives contain an archived snapshot representative of the state the referenced resource had at the time it was referenced. We observe that the fraction of articles containing references to web resources is growing steadily over time. We find one out of five STM articles suffering from reference rot, meaning it is impossible to revisit the web context that surrounds them some time after their publication. When only considering STM articles that contain references to web resources, this fraction increases to seven out of ten. We suggest that, in order to safeguard the long-term integrity of the web-based scholarly record, robust solutions to combat the reference rot problem are required. In conclusion, we provide a brief insight into the directions that are explored with this regard in the context of the Hiberlink project.








  217. ⁠, Ramal Moonesinghe, Muin J. Khoury, A. Cecile J. W. Janssens ():

    While the authors agree with John Ioannidis that “most research findings are false,” here they show that replication of research findings enhances the positive predictive value of research findings being true.






  223. ⁠, Leonard P. Freedman, Iain M. Cockburn, Timothy S. Simcoe ():

    Low reproducibility rates within life science research undermine cumulative knowledge production and contribute to both delays and costs of therapeutic drug development. An analysis of past studies indicates that the cumulative (total) prevalence of irreproducible preclinical research exceeds 50%, resulting in approximately US$28,000,000,000 (US$28B)/​​​​year spent on preclinical research that is not reproducible—in the United States alone. We outline a framework for solutions and a plan for long-term improvements in reproducibility rates that will help to accelerate the discovery of life-saving therapies and cures.

    This Perspective provides estimates of the rate of irreproducibility of preclinical research and its direct cost implications. It goes on to outline a framework for solutions and a plan for long-term improvements in reproducibility rates.

  224. ⁠, Killeen, Peter R (2005):

    The statistic p(rep) estimates the probability of replicating an effect. It captures traditional publication criteria for signal-to-noise ratio, while avoiding parametric inference and the resulting Bayesian dilemma. In concert with effect size and replication intervals, p(rep) provides all of the information now used in evaluating research, while avoiding many of the pitfalls of traditional statistical inference.









  233. ⁠, Kenneth Hung, William Fithian (2019-03-20):

    Large-scale replication studies like the Reproducibility Project: Psychology (RP:P) provide invaluable systematic data on scientific replicability, but most analyses and interpretations of the data fail to agree on the definition of “replicability” and disentangle the inexorable consequences of known selection bias from competing explanations. We discuss three concrete definitions of replicability based on (1) whether published findings about the signs of effects are mostly correct, (2) how effective replication studies are in reproducing whatever true effect size was present in the original experiment, and (3) whether true effect sizes tend to diminish in replication. We apply techniques from multiple testing and post-selection inference to develop new methods that answer these questions while explicitly accounting for selection bias. Our analyses suggest that the RP:P dataset is largely consistent with publication bias due to selection of statistically-significant effects. The methods in this paper make no distributional assumptions about the true effect sizes.





  238. ⁠, Jack W. Scannell, Jim Bosley (2016-02-10):

    A striking contrast runs through the last 60 years of biopharmaceutical discovery, research, and development. Huge scientific and technological gains should have increased the quality of academic science and raised industrial R&D efficiency. However, academia faces a “reproducibility crisis”; inflation-adjusted industrial R&D costs per novel drug increased nearly 100× between 1950 and 2010; and drugs are more likely to fail in clinical development today than in the 1970s. The contrast is explicable only if powerful headwinds reversed the gains and/​​​​or if many “gains” have proved illusory. However, discussions of reproducibility and R&D productivity rarely address this point explicitly.

    The main objectives of the primary research in this paper are: (a) to provide quantitatively and historically plausible explanations of the contrast; and (b) identify factors to which R&D efficiency is sensitive.

    We present a quantitative decision-theoretic model of the R&D process [a ‘leaky pipeline’⁠; cf the log-normal]. The model represents therapeutic candidates (eg., putative drug targets, molecules in a screening library, etc.) within a “measurement space”, with candidates’ positions determined by their performance on a variety of assays (eg., binding affinity, toxicity, in vivo efficacy, etc.) whose results correlate to a greater or lesser degree. We apply decision rules to segment the space, and assess the probability of correct R&D decisions.

    We find that when searching for rare positives (eg., candidates that will successfully complete clinical development), changes in the predictive validity of screening and disease models that many people working in drug discovery would regard as small and/​​​​or unknowable (ie., an 0.1 absolute change in correlation coefficient between model output and clinical outcomes in man) can offset large (eg., 10×, even 100×) changes in models’ brute-force efficiency. We also show how validity and reproducibility correlate across a population of simulated screening and disease models.

    We hypothesize that screening and disease models with high predictive validity are more likely to yield good answers and good treatments, so tend to render themselves and their diseases academically and commercially redundant. Perhaps there has also been too much enthusiasm for reductionist molecular models which have insufficient predictive validity. Thus we hypothesize that the average predictive validity of the stock of academically and industrially “interesting” screening and disease models has declined over time, with even small falls able to offset large gains in scientific knowledge and brute-force efficiency. The rate of creation of valid screening and disease models may be the major constraint on R&D productivity.

  239. ⁠, Andreas Bender, Isidro Cortés-Ciriano (2021-02):

    We first attempted to simulate the effect of (1) speeding up phases in the drug discovery process, (2) making them cheaper and (3) making individual phases more successful on the overall financial outcome of drug-discovery projects. In every case, an improvement of the respective measure (speed, cost and success of phase) of 20% (in the case of failure rate in relative terms) has been assumed to quantify effects on the capital cost of bringing one successful drug to the market. For the simulations, a patent lifetime of 20 years was assumed, with patent applications filed at the start of clinical Phase I, and the net effect of changes of speed, cost and quality of decisions on overall project return was calculated, assuming that projects, on average, are able to return their own cost…(Studies such as [], which posed the question of which changes are most efficient in terms of improving R&D productivity, returned similar results to those presented here, although we have quantified them in more detail.)

    It can be seen in Figure 2 that a reduction of the failure rate (in particular across all clinical phases) has by far the most substantial impact on project value overall, multiple times that of a reduction of the cost of a particular phase or a decrease in the amount of time a particular phase takes. This effect is most profound in clinical Phase II, in agreement with previous studies [33], and it is a result of the relatively low success rate, long duration and high cost of the clinical phases. In other words, increasing the success of clinical phases decreases the number of expensive clinical trials needed to bring a drug to the market, and this decrease in the number of failures matters more than failing more quickly or more cheaply in terms of cost per successful, approved drug.

    Figure 2: The impact of increasing speed (with the time taken for each phase reduced by 20%), improving the quality of the compounds tested in each phase (with the failure rate reduced by 20%), and decreasing costs (by 20%) on the net profit of a drug-discovery project, assuming patenting at time of first in human tests, and with other assumptions based on [“When Quality Beats Quantity: Decision Theory, Drug Discovery, and the Reproducibility Crisis”, Scannell & Bosley 2016]. It can be seen that the quality of compounds taken forward has a much more profound impact on the success of projects, far beyond improving the speed and reducing the cost of the respective phase. This has implications for the most beneficial uses of AI in drug-discovery projects.

    …When translating this to drug-discovery programmes, this means that AI needs to support:

    1. better compounds going into clinical trials (related to the structure itself, but also including the right dosing/​​​​​PK for suitable efficacy versus the safety/​​​​​therapeutic index, in the desired target tissue);
    2. better validated targets (to decrease the number of failures owing to efficacy, especially in clinical Phases II and III, which have a profound impact on overall project success and in which target validation is currently probably not yet where one would like it to be []);
    3. better patient selection (eg., using biomarkers) []; and
    4. better conductance of trials (with respect to, eg., patient recruitment and adherence) [36].

    This finding is in line with previous research in the area cited already [33], as well as a study that compared the impact of the quality of decisions that can be made to the number of compounds that can be processed with a particular technique [30]. In this latter case, the authors found that: “when searching for rare positives (eg., candidates that will successfully complete clinical development), changes in the predictive validity of screening and disease models that many people working in drug discovery would regard as small and/​​​​or unknowable (ie., a 0.1 absolute change in correlation coefficient between model output and clinical outcomes in man) can offset large (eg., tenfold, even 100-fold) changes in models’ brute-force efficiency.” Still, currently the main focus of AI in drug discovery, in many cases, seems to be on speed and cost, as opposed to the quality of decisions.



  242. 2015-franco.pdf


  244. 2016-lane.pdf: “Is there a publication bias in behavioral intranasal oxytocin research on humans? Opening the file drawer of one lab”⁠, A. Lane, O. Luminet, G. Nave, M. Mikolajczak



  247. 2012-lee.pdf: ⁠, James Jung-Hun Lee (2012-07-26; genetics  /​ ​​ ​heritable):

    Personality psychology aims to explain the causes and the consequences of variation in behavioural traits. Because of the observational nature of the pertinent data, this endeavour has provoked many controversies. In recent years, the computer scientist Judea Pearl has used a graphical approach to extend the innovations in causal inference developed by Ronald Fisher and Sewall Wright. Besides shedding much light on the philosophical notion of causality itself, this graphical framework now contains many powerful concepts of relevance to the controversies just mentioned. In this article, some of these concepts are applied to areas of personality research where questions of causation arise, including the analysis of observational data and the genetic sources of individual differences.




  251. 2017-mukadam.pdf



  254. 2019-antoniou.pdf: ⁠, Mark Antoniou (2019-01-01; psychology):

    Bilingualism was once thought to result in cognitive disadvantages, but research in recent decades has demonstrated that experience with two (or more) languages confers a bilingual advantage in and may delay the incidence of Alzheimer’s disease. However, conflicting evidence has emerged leading to questions concerning the robustness of the bilingual advantage for both executive functions and dementia incidence. Some investigators have failed to find evidence of a bilingual advantage; others have suggested that bilingual advantages may be entirely spurious, while proponents of the advantage case have continued to defend it. A heated debate has ensued, and the field has now reached an impasse. This review critically examines evidence for and against the bilingual advantage in executive functions, cognitive aging, and brain plasticity, before outlining how future research could shed light on this debate and advance knowledge of how experience with multiple languages affects cognition and the brain.

  255. 2020-nichols.pdf: ⁠, Emily S. Nichols, Conor J. Wild, Bobby Stojanoski, Michael E. Battista, Adrian M. Owen (2020-04-20; psychology):

    Whether acquiring a second language affords any general advantages to executive function has been a matter of fierce scientific debate for decades. If being bilingual does have benefits over and above the broader social, employment, and lifestyle gains that are available to speakers of a second language, then it should manifest as a cognitive advantage in the general population of bilinguals. We assessed 11,041 participants on a broad battery of 12 executive tasks whose functional and neural properties have been well described. Bilinguals showed an advantage over monolinguals on only one test (whereas monolinguals performed better on four tests), and these effects all disappeared when the groups were matched to remove potentially confounding factors. In any case, the size of the positive bilingual effect in the unmatched groups was so small that it would likely have a negligible impact on the cognitive performance of any individual.

    [Keywords: bilingualism, executive function, cognition, aging, null-hypothesis testing.]





  260. ⁠, Eric Jonas, Konrad Paul Kording (2016-11-14):

    There is a popular belief in neuroscience that we are primarily data limited, and that producing large, multimodal, and complex datasets will, with the help of advanced data analysis algorithms, lead to fundamental insights into the way the brain processes information. These datasets do not yet exist, and if they did we would have no way of evaluating whether or not the algorithmically-generated insights were sufficient or even correct. To address this, here we take a classical microprocessor as a model organism, and use our ability to perform arbitrary experiments on it to see if popular data analysis methods from neuroscience can elucidate the way it processes information. Microprocessors are among those artificial information processing systems that are both complex and that we understand at all levels, from the overall logical flow, via logical gates, to the dynamics of transistors. We show that the approaches reveal interesting structure in the data but do not meaningfully describe the hierarchy of information processing in the microprocessor. This suggests current analytic approaches in neuroscience may fall short of producing meaningful understanding of neural systems, regardless of the amount of data. Additionally, we argue for scientists using complex non-linear dynamical systems with known ground truth, such as the microprocessor as a validation platform for time-series and structure discovery methods.

    Author Summary

    Neuroscience is held back by the fact that it is hard to evaluate if a conclusion is correct; the complexity of the systems under study and their experimental inaccessability make the assessment of algorithmic and data analytic technqiues challenging at best. We thus argue for testing approaches using known artifacts, where the correct interpretation is known. Here we present a microprocessor platform as one such test case. We find that many approaches in neuroscience, when used na•vely, fall short of producing a meaningful understanding.

  261. 2002-lazebnik.pdf

  262. ⁠, Jacob Westfall, Tal Yarkoni (2016-03-17):

    Social scientists often seek to demonstrate that a construct has incremental validity over and above other related constructs. However, these claims are typically supported by measurement-level models that fail to consider the effects of measurement (un)reliability. We use intuitive examples, Monte Carlo simulations, and a novel analytical framework to demonstrate that common strategies for establishing incremental construct validity using multiple regression analysis exhibit extremely high Type I error rates under parameter regimes common in many psychological domains. Counterintuitively, we find that error rates are highest—in some cases approaching 100%—when sample sizes are large and reliability is moderate. Our findings suggest that a potentially large proportion of incremental validity claims made in the literature are spurious. We present a web application (http:    /​ ​​ ​​ ​​ ​    /​ ​​ ​​ ​​ ​    /​ ​​ ​​ ​​ ​ivy    /​ ​​ ​​ ​​ ​) that readers can use to explore the statistical properties of these and other incremental validity arguments. We conclude by reviewing SEM-based statistical approaches that appropriately control the Type I error rate when attempting to establish incremental validity.

  263. 1936-stouffer.pdf: ⁠, Samuel A. Stouffer (1936; statistics):

    It is not generally recognized that such an analysis [using regression] assumes that each of the variables is perfectly measured, such that a second measure X’i, of the variable measured by Xi, has a correlation of unity with Xi. If some of the measures are more accurate than others, the analysis is impaired [by ]. For example, the sociologist may have a problem in which an index of economic status and an index of nativity are independent variables. What is the effect, if the index of economic status is much less satisfactory than the index of nativity? Ordinarily, the effect will be to underestimate the [coefficient] of the less adequately measured variable and to overestimate the [coefficient] of the more adequately measured variable.

    If either the reliability or validity of an index is in question, at least two measures of the variable are required to permit an evaluation. The purpose of this paper is to provide a logical basis and a simple arithmetical procedure (a) for measuring the effect of the use of 2 indexes, each of one or more variables, in partial and multiple correlation analysis and (b) for estimating the likely effect if 2 indexes, not available, could be secured.

  264. 1942-thorndike.pdf: “85_1.tif”⁠, Robert L. Thorndike

  265. 1965-kahneman.pdf: “Control of spurious association and the reliability of the controlled variable”⁠, Daniel Kahneman

  266. 2015-mackenzie.pdf


  268. 2016-findley.pdf


  270. 2015-cofnas.pdf


  272. 1987-rossi


  274. ⁠, Toby Ord, Rafaela Hillerbrand, Anders Sandberg (2008-10-30):

    Some risks have extremely high stakes. For example, a worldwide pandemic or asteroid impact could potentially kill more than a billion people. Comfortingly, scientific calculations often put very low probabilities on the occurrence of such catastrophes. In this paper, we argue that there are important new methodological problems which arise when assessing global catastrophic risks and we focus on a problem regarding probability estimation. When an expert provides a calculation of the probability of an outcome, they are really providing the probability of the outcome occurring, given that their argument is watertight. However, their argument may fail for a number of reasons such as a flaw in the underlying theory, a flaw in the modeling of the problem, or a mistake in the calculations. If the probability estimate given by an argument is dwarfed by the chance that the argument itself is flawed, then the estimate is suspect. We develop this idea formally, explaining how it differs from the related distinctions of model and parameter uncertainty. Using the risk estimates from the Large Hadron Collider as a test case, we show how serious the problem can be when it comes to catastrophic risks and how best to address it.


  276. 2018-camerer.pdf: ⁠, Colin F. Camerer, Anna Dreber, Felix Holzmeister, Teck-Hua Ho, Jürgen Huber, Magnus Johannesson, Michael Kirchler, Gideon Nave, Brian A. Nosek, Thomas Pfeiffer, Adam Altmejd, Nick Buttrick, Taizan Chan, Yiling Chen, Eskil Forsell, Anup Gampa, Emma Heikensten, Lily Hummer, Taisuke Imai, Siri Isaksson, Dylan Manfredi, Julia Rose, Eric-Jan Wagenmakers, Hang Wu (2018-08-27; statistics  /​ ​​ ​bias):

    Being able to replicate scientific findings is crucial for scientific progress. We replicate 21 systematically selected experimental studies in the social sciences published in Nature and Science between 2010 and 2015. The replications follow analysis plans reviewed by the original authors and pre-registered prior to the replications. The replications are high powered, with sample sizes on average about five times higher than in the original studies. We find a statistically-significant effect in the same direction as the original study for 13 (62%) studies, and the effect size of the replications is on average about 50% of the original effect size. Replicability varies between 12 (57%) and 14 (67%) studies for complementary replicability indicators. Consistent with these results, the estimated true-positive rate is 67% in a Bayesian analysis. The relative effect size of true positives is estimated to be 71%, suggesting that both false positives and inflated effect sizes of true positives contribute to imperfect reproducibility. Furthermore, we find that peer beliefs of replicability are strongly related to replicability, suggesting that the research community could predict which results would replicate and that failures to replicate were not the result of chance alone.

  277. ⁠, Eskil Forsell, Domenico Viganola, Thomas Pfeiffer, Johan Almenberg, Brad Wilson, Yiling Chen, Brian A. Nosek, Magnus Johannesson, Anna Dreber (2018-10-25):


    • Psychologists participated in prediction markets to predict replication outcomes.
    • Prediction markets correctly predicted 75% of the replication outcomes.
    • Prediction markets performed better than survey data in predicting replication outcomes.
    • Survey data performed better in predicting relative effect size of the replications.

    Understanding and improving reproducibility is crucial for scientific progress. Prediction markets and related methods of eliciting peer beliefs are promising tools to predict replication outcomes. We invited researchers in the field of psychology to judge the replicability of 24 studies replicated in the large scale Many Labs 2 project. We elicited peer beliefs in prediction markets and surveys about two replication success metrics: the probability that the replication yields a statistically-significant effect in the original direction (p < 0.001), and the relative effect size of the replication. The prediction markets correctly predicted 75% of the replication outcomes, and were highly correlated with the replication outcomes. Survey beliefs were also statistically-significantly correlated with replication outcomes, but had larger prediction errors. The prediction markets for relative effect sizes attracted little trading and thus did not work well. The survey beliefs about relative effect sizes performed better and were statistically-significantly correlated with observed relative effect sizes. The results suggest that replication outcomes can be predicted and that the elicitation of peer beliefs can increase our knowledge about scientific reproducibility and the dynamics of hypothesis testing.

    [Keywords: reproducibility, replications, prediction markets, beliefs]

  278. ⁠, Suzanne Hoogeveen, Alexandra Sarafoglou, Eric-Jan Wagenmakers (2020-08-21):

    Large-scale collaborative projects recently demonstrated that several key findings from the social-science literature could not be replicated successfully. Here, we assess the extent to which a finding’s replication success relates to its intuitive plausibility. Each of 27 high-profile social-science findings was evaluated by 233 people without a Ph.D. in psychology. Results showed that these laypeople predicted replication success with above-chance accuracy (ie., 59%). In addition, when participants were informed about the strength of evidence from the original studies, this boosted their prediction performance to 67%. We discuss the prediction patterns and apply signal detection theory to disentangle detection ability from response bias. Our study suggests that laypeople’s predictions contain useful information for assessing the probability that a given finding will be replicated successfully.

    [Keywords: open science, meta-science, replication crisis, prediction survey, open data, open materials, ]

  279. 2016-panero.pdf





  284. {#linkBibliography-(nautilus)-2016 .docMetadata}, Bob Henderson (Nautilus) (2016-12-29):

    [Memoir of an ex-theoretical-physics grad student at the University of Rochester with Sarada Rajeev who gradually became disillusioned with physics research, burned out, and left to work in finance and is now a writer. Henderson was attracted by the life of the mind and the grandeur of uncovering the mysteries of the universe, only to discover that, after the endless triumphs of the 20th century and predicting enormous swathes of empirical experimental data, theoretical physics has drifted and become a branch of abstract mathematics, exploring ever more recondite, simplified, and implausible models in the hopes of obtaining any insight into physics’ intractable problems; one must be brilliant to even understand the questions being asked by the math and incredibly hardworking to make any progress which hasn’t already been tried by even more brilliant physicists of the past (while living in ignominious poverty and terror of not getting a grant or tenure), but one’s entire career may be spent chasing a useless dead end without one having any clue.]

    The next thing I knew I was crouched in a chair in Rajeev’s little office, with a notebook on my knee and focused with everything I had on an impromptu lecture he was giving me on an esoteric aspect of some mathematical subject I’d never heard of before. Zeta functions, or elliptic functions, or something like that. I’d barely introduced myself when he’d started banging out equations on his board. Trying to follow was like learning a new game, with strangely shaped pieces and arbitrary rules. It was a challenge, but I was excited to be talking to a real physicist about his real research, even though there was one big question nagging me that I didn’t dare to ask: What does any of this have to do with physics?

    …Even a Theory of Everything, I started to realize, might suffer the same fate of multiple interpretations. The Grail could just be a hall of mirrors, with no clear answer to the “What?” or the “How?”—let alone the “Why?” Plus physics had changed since Big Al bestrode it. Mathematical as opposed to physical intuition had become more central, partly because quantum mechanics was such a strange multi-headed beast that it diminished the role that everyday, or even Einstein-level, intuition could play. So much for my dreams of staring out windows and into the secrets of the universe.

    …If I did lose my marbles for a while, this is how it started. With cutting my time outside of Bausch and Lomb down to nine hours a day—just enough to pedal my mountain bike back to my bat cave of an apartment each night, sleep, shower, and pedal back in. With filling my file cabinet with boxes and cans of food, and carting in a coffee maker, mini-fridge, and microwave so that I could maximize the time spent at my desk. With feeling guilty after any day that I didn’t make my 15-hour quota. And with exceeding that quota frequently enough that I regularly circumnavigated the clock: staying later and later each night until I was going home in the morning, then in the afternoon, and finally at night again.

    …The longer and harder I worked, the more I realized I didn’t know. Papers that took days or weeks to work through cited dozens more that seemed just as essential to digest; the piles on my desk grew rather than shrunk. I discovered the stark difference between classes and research: With no syllabus to guide me I didn’t know how to keep on a path of profitable inquiry. Getting “wonderfully lost” sounded nice, but the reality of being lost, and of re-living, again and again, that first night in the old woman’s house, with all of its doubts and dead-ends and that horrible hissing voice was … something else. At some point, flipping the lights on in the library no longer filled me with excitement but with dread.

    …My mental model building was hitting its limits. I’d sit there in Rajeev’s office with him and his other students, or in a seminar given by some visiting luminary, listening and putting each piece in place, and try to fix in memory what I’d built so far. But at some point I’d lose track of how the green stick connected to the red wheel, or whatever, and I’d realize my picture had diverged from reality. Then I’d try toggling between tracing my steps back in memory to repair my mistake and catching all the new pieces still flying in from the talk. Stray pieces would fall to the ground. My model would start falling down. And I would fall hopelessly behind. A year or so of research with Rajeev, and I found myself frustrated and in a fog, sinking deeper into the quicksand but not knowing why. Was it my lack of mathematical background? My grandiose goals? Was I just not intelligent enough?

    …I turned 30 during this time and the milestone hit me hard. I was nearly four years into the Ph.D. program, and while my classmates seemed to be systematically marching toward their degrees, collecting data and writing papers, I had no thesis topic and no clear path to graduation. My engineering friends were becoming managers, getting married, buying houses. And there I was entering my fourth decade of life feeling like a pitiful and penniless mole, aimlessly wandering dark empty tunnels at night, coming home to a creepy crypt each morning with nothing to show for it, and checking my bed for bugs before turning out the lights…As I put the final touches on my thesis, I weighed my options. I was broke, burned out, and doubted my ability to go any further in theoretical physics. But mostly, with The Grail now gone and the physics landscape grown so immense, I thought back to Rajeev’s comment about knowing which problems to solve and realized that I still didn’t know what, for me, they were.





  289. 2017-debarra.pdf: “Reporting bias inflates the reputation of medical treatments: A comparison of outcomes in clinical trials and online product reviews”⁠, Mícheál de Barra

  290. ⁠, Jelte M. Wicherts, Marjan Bakker, Dylan Molenaar (2011-10-04):


    The widespread reluctance to share published research data is often hypothesized to be due to the authors’ fear that reanalysis may expose errors in their work or may produce conclusions that contradict their own. However, these hypotheses have not previously been studied systematically.

    Methods and Findings:

    We related the reluctance to share research data for reanalysis to 1148 statistically-significant results reported in 49 papers published in two major psychology journals. We found the reluctance to share data to be associated with weaker evidence (against the null hypothesis of no effect) and a higher prevalence of apparent errors in the reporting of statistical results. The unwillingness to share data was particularly clear when reporting errors had a bearing on statistical-significance.


    Our findings on the basis of psychological papers suggest that statistical results are particularly hard to verify when reanalysis is more likely to lead to contrasting conclusions. This highlights the importance of establishing mandatory data archiving policies.


  292. ⁠, Denes Szucs, John P. A. Ioannidis (2017-02-06):

    We have empirically assessed the distribution of published effect sizes and estimated power by analyzing 26,841 statistical records from 3,801 cognitive neuroscience and psychology papers published recently. The reported median effect size was D = 0.93 (interquartile range: 0.64–1.46) for nominally statistically-significant results and D = 0.24 (0.11–0.42) for nonsignificant results. Median power to detect small, medium, and large effects was 0.12, 0.44, and 0.73, reflecting no improvement through the past half-century. This is so because sample sizes have remained small. Assuming similar true effect sizes in both disciplines, power was lower in cognitive neuroscience than in psychology. Journal impact factors negatively correlated with power. Assuming a realistic range of prior probabilities for null hypotheses, false report probability is likely to exceed 50% for the whole literature. In light of our findings, the recently reported low replication success in psychology is realistic, and worse performance may be expected for cognitive neuroscience.

    Author summary:

    Biomedical science, psychology, and many other fields may be suffering from a serious replication crisis. In order to gain insight into some factors behind this crisis, we have analyzed statistical information extracted from thousands of cognitive neuroscience and psychology research papers. We established that the statistical power to discover existing relationships has not improved during the past half century. A consequence of low statistical power is that research studies are likely to report many false positive findings. Using our large dataset, we estimated the probability that a statistically-significant finding is false (called false report probability). With some reasonable assumptions about how often researchers come up with correct hypotheses, we conclude that more than 50% of published findings deemed to be statistically-significant are likely to be false. We also observed that cognitive neuroscience studies had higher false report probability than psychology studies, due to smaller sample sizes in cognitive neuroscience. In addition, the higher the impact factors of the journals in which the studies were published, the lower was the statistical power. In light of our findings, the recently reported low replication success in psychology is realistic, and worse performance may be expected for cognitive neuroscience.

  293. 2017-fanelli.pdf




  297. 2017-adams.pdf



  300. Modus

  301. ⁠, Roy F. Baumeister, Jennifer D. Campbell, Joachim I. Krueger, Kathleen D. Vohs (2003-05-01):

    Self-esteem has become a household word. Teachers, parents, therapists, and others have focused efforts on boosting self-esteem, on the assumption that high self-esteem will cause many positive outcomes and benefits—an assumption that is critically evaluated in this review.

    Appraisal of the effects of self-esteem is complicated by several factors. Because many people with high self-esteem exaggerate their successes and good traits, we emphasize objective measures of outcomes. High self-esteem is also a heterogeneous category, encompassing people who frankly accept their good qualities along with narcissistic, defensive, and conceited individuals.

    The modest correlations between self-esteem and school performance do not indicate that high self-esteem leads to good performance. Instead, high self-esteem is partly the result of good school performance. Efforts to boost the self-esteem of pupils have not been shown to improve academic performance and may sometimes be counterproductive. Job performance in adults is sometimes related to self-esteem, although the correlations vary widely, and the direction of causality has not been established. Occupational success may boost self-esteem rather than the reverse. Alternatively, self-esteem may be helpful only in some job contexts. Laboratory studies have generally failed to find that self-esteem causes good task performance, with the important exception that high self-esteem facilitates persistence after failure.

    People high in self-esteem claim to be more likable and attractive, to have better relationships, and to make better impressions on others than people with low self-esteem, but objective measures disconfirm most of these beliefs. Narcissists are charming at first but tend to alienate others eventually. Self-esteem has not been shown to predict the quality or duration of relationships.

    High self-esteem makes people more willing to speak up in groups and to criticize the group’s approach. Leadership does not stem directly from self-esteem, but self-esteem may have indirect effects. Relative to people with low self-esteem, those with high self-esteem show stronger in-group favoritism, which may increase prejudice and discrimination.

    Neither high nor low self-esteem is a direct cause of violence. Narcissism leads to increased aggression in retaliation for wounded pride. Low self-esteem may contribute to externalizing behavior and delinquency, although some studies have found that there are no effects or that the effect of self-esteem vanishes when other variables are controlled. The highest and lowest rates of cheating and bullying are found in different subcategories of high self-esteem.

    Self-esteem has a strong relation to happiness. Although the research has not clearly established causation, we are persuaded that high self-esteem does lead to greater happiness. Low self-esteem is more likely than high to lead to depression under some circumstances. Some studies support the buffer hypothesis, which is that high self-esteem mitigates the effects of stress, but other studies come to the opposite conclusion, indicating that the negative effects of low self-esteem are mainly felt in good times. Still others find that high self-esteem leads to happier outcomes regardless of stress or other circumstances.

    High self-esteem does not prevent children from smoking, drinking, taking drugs, or engaging in early sex. If anything, high self-esteem fosters experimentation, which may increase early sexual activity or drinking, but in general effects of self-esteem are negligible. One important exception is that high self-esteem reduces the chances of bulimia in females.

    Overall, the benefits of high self-esteem fall into two categories: enhanced initiative and pleasant feelings. We have not found evidence that boosting self-esteem (by therapeutic interventions or school programs) causes benefits. Our findings do not support continued widespread efforts to boost self-esteem in the hope that it will by itself foster improved outcomes. In view of the heterogeneity of high self-esteem, indiscriminate praise might just as easily promote narcissism, with its less desirable consequences. Instead, we recommend using praise to boost self-esteem as a reward for socially desirable behavior and self-improvement.

  302. 2014-02-25-matter-themanwhodestroyedamericasego.html



  305. 2016-pica.pdf: “PEDS_20160223.indd

  306. ⁠, Andrew Gelman, Daniel Simpson, Michael Betancourt (2017-08-24):

    A key sticking point of is the choice of prior distribution, and there is a vast literature on potential defaults including uniform priors, Jeffreys’ priors, reference priors, maximum priors, and weakly informative priors. These methods, however, often manifest a key conceptual tension in prior modeling: a model encoding true prior information should be chosen without reference to the model of the measurement process, but almost all common prior modeling techniques are implicitly motivated by a reference likelihood. In this paper we resolve this apparent paradox by placing the choice of prior into the context of the entire Bayesian analysis, from inference to prediction to model evaluation.






  312. 2017-kjaer.pdf






  318. 2017-allamee.pdf: ⁠, Rasha Al-Lamee, David Thompson, Hakim-Moulay Dehbi, Sayan Sen, Kare Tang, John Davies, Thomas Keeble, Michael Mielewczik, Raffi Kaprielian, Iqbal S. Malik, Sukhjinder S. Nijjer, Ricardo Petraco, Christopher Cook, Yousif Ahmad, James Howard, Christopher Baker, Andrew Sharp, Robert Gerber, Suneel Talwar, Ravi Assomull, Jamil Mayet, Roland Wensel, David Collier, Matthew Shun-Shin, Simon A. Thom, Justin E. Davies, Darrel P. Francis (2017-11-02; statistics  /​ ​​ ​causality):

    Background: Symptomatic relief is the primary goal of percutaneous coronary intervention (PCI) in stable angina and is commonly observed clinically. However, there is no evidence from blinded, placebo-controlled randomised trials to show its efficacy.

    Methods: ORBITA is a blinded, multicentre randomised trial of PCI versus a placebo procedure for angina relief that was done at five study sites in the UK. We enrolled patients with severe (≥70%) single-vessel stenoses. After enrolment, patients received 6 weeks of medication optimisation. Patients then had pre-randomisation assessments with cardiopulmonary exercise testing, symptom questionnaires, and dobutamine stress echocardiography. Patients were randomised 1:1 to undergo PCI or a placebo procedure by use of an automated online randomisation tool. After 6 weeks of follow-up, the assessments done before randomisation were repeated at the final assessment. The primary endpoint was difference in exercise time increment between groups. All analyses were based on the intention-to-treat principle and the study population contained all participants who underwent randomisation. This study is registered with, number NCT02062593.

    Findings: ORBITA enrolled 230 patients with ischaemic symptoms. After the medication optimisation phase and between Jan 6, 2014, and Aug 11, 2017, 200 patients underwent randomisation, with 105 patients assigned PCI and 95 assigned the placebo procedure. Lesions had mean area stenosis of 84.4% (SD 10.2), fractional flow reserve of 0.69 (0.16), and instantaneous wave-free ratio of 0.76 (0.22). There was no statistically-significant difference in the primary endpoint of exercise time increment between groups (PCI minus placebo 16.6 s, 95% CI −8.9 to 42.0, p = 0.200). There were no deaths. Serious adverse events included four pressure-wire related complications in the placebo group, which required PCI, and five major bleeding events, including two in the PCI group and three in the placebo group.

    Interpretation: In patients with medically treated angina and severe coronary stenosis, PCI did not increase exercise time by more than the effect of a placebo procedure. The efficacy of invasive procedures can be assessed with a placebo control, as is standard for pharmacotherapy.


  320. ⁠, Giovanni Sala, Fernand Gobet (2017-02):

    • Music training is thought to improve youngsters’ cognitive and academic skills.
    • Results show a small overall effect size (d = 0.16, K = 118).
    • Music training seems to moderately enhance youngsters’ intelligence and memory.
    • The design quality of the studies is negatively related to the size of the effects.
    • Future studies should include random assignment and active control groups.

    Music training has been recently claimed to enhance children and young adolescents’ cognitive and academic skills. However, substantive research on transfer of skills suggests that far-transfer—ie., the transfer of skills between 2 areas only loosely related to each other—occurs rarely.

    In this meta-analysis, we examined the available experimental evidence regarding the impact of music training on children and young adolescents’ cognitive and academic skills. The results of the showed (a) a small overall effect size (d = 0.16); (b) slightly greater effect sizes with regard to intelligence (d = 0.35) and memory-related outcomes (d = 0.34); and (c) an inverse relation between the size of the effects and the methodological quality of the study design.

    These results suggest that music training does not reliably enhance children and young adolescents’ cognitive or academic skills, and that previous positive findings were probably due to confounding variables.

    [Keywords: music training, transfer, cognitive skills, education, meta-analysis]


  322. 2014-mosing.pdf: ⁠, Miriam A. Mosing, Guy Madison, Nancy L. Pedersen, Ralf Kuja-Halkola, Fredrik Ullén (2014-07-30; genetics  /​ ​​ ​correlation):

    The relative importance of nature and nurture for various forms of expertise has been intensely debated. Music proficiency is viewed as a general model for expertise, and associations between deliberate practice and music proficiency have been interpreted as supporting the prevailing idea that long-term deliberate practice inevitably results in increased music ability.

    Here, we examined the associations (rs = 0.18–0.36) between music practice and music ability (rhythm, melody, and pitch discrimination) in 10,500 Swedish twins. We found that music practice was substantially heritable (40%–70%). Associations between music practice and music ability were predominantly genetic, and, contrary to the causal hypothesis, nonshared environmental influences did not contribute. There was no difference in ability within monozygotic twin pairs differing in their amount of practice, so that when genetic predisposition was controlled for, more practice was no longer associated with better music skills.

    These findings suggest that music practice may not causally influence music ability and that genetic variation among individuals affects both ability and inclination to practice.

    [Keywords: training, expertise, music ability, practice, heritability, twin, causality]

  323. ⁠, Giovanni Sala, Fernand Gobet (2020-01-14):

    Music training has repeatedly been claimed to positively impact on children’s cognitive skills and academic achievement. This claim relies on the assumption that engaging in intellectually demanding activities fosters particular domain-general cognitive skills, or even general intelligence. The present meta-analytic review (N = 6,984, k = 254, m = 54) shows that this belief is incorrect. Once the study quality design is controlled for, the overall effect of music training programs is null (g ≈ 0) and highly consistent across studies (τ2 ≈ 0). Small statistically-significant overall effects are obtained only in those studies implementing no random allocation of participants and employing non-active controls (g ≈ 0.200, p < 0.001). Interestingly, music training is ineffective regardless of the type of outcome measure (eg., verbal, non-verbal, speed-related, etc.). Furthermore, we note that, beyond meta-analysis of experimental studies, a considerable amount of cross-sectional evidence indicates that engagement in music has no impact on people’s non-music cognitive skills or academic achievement. We conclude that researchers’ optimism about the benefits of music training is empirically unjustified and stem from misinterpretation of the empirical data and, possibly, confirmation bias. Given the clarity of the results, the large number of participants involved, and the numerous studies carried out so far, we conclude that this line of research should be dismissed.



  326. 2018-wood.pdf: “The Elusive Backfire Effect: Mass Attitudes’ Steadfast Factual Adherence”⁠, Thomas Wood, Ethan Porter



  329. 2017-jerrim.pdf: “Does teaching children how to play cognitively demanding games improve their educational attainment? Evidence from a Randomised Controlled Trial of chess instruction in England”⁠, john jerrim

  330. 2017-wallach.pdf: “Evaluation of Evidence of Statistical Support and Corroboration of Subgroup Claims in Randomized Clinical Trials”⁠, American Medical Association



  333. 2018-langbert.pdf: “Homogenous: The Political Affiliations of Elite Liberal Arts College Faculty”⁠, Mitchell Langbert


  335. 2018-watts.pdf: ⁠, Tyler W. Watts, Greg J. Duncan, Haonan Quan (2018-01-01; psychology):

    We replicated and extended Shoda, Mischel, and Peake’s (1990) famous marshmallow study, which showed strong bivariate correlations between a child’s ability to delay gratification just before entering school and both adolescent achievement and socioemotional behaviors. Concentrating on children whose mothers had not completed college, we found that an additional minute waited at age 4 predicted a gain of approximately one tenth of a standard deviation in achievement at age 15. But this bivariate correlation was only half the size of those reported in the original studies and was reduced by two thirds in the presence of controls for family background, early cognitive ability, and the home environment. Most of the variation in adolescent achievement came from being able to wait at least 20 s. Associations between delay time and measures of behavioral outcomes at age 15 were much smaller and rarely statistically-significant.




  339. 2018-berman.pdf: “p-Hacking and False Discovery in A  /​ ​​ ​B Testing”⁠, Ron Berman, Leonid Pekelis, Aisling Scott, Christophe Van den Bulte





  344. ⁠, Matthew Cobb (2017-11-16):

    In 1961, the National Institutes of Health (NIH) began to circulate biological preprints in a forgotten experiment called the Information Exchange Groups (IEGs). This system eventually attracted over 3,600 participants and saw the production of over 2,500 different documents, but by 1967, it was effectively shut down following the refusal of journals to accept articles that had been circulated as preprints. This article charts the rise and fall of the IEGs and explores the parallels with the 1990s and the biomedical preprint movement of today.


  346. ⁠, Y. A. de Vries, A. M. Roest, P. de Jonge, P. Cuijpers, M. R. Munafò, J. A. Bastiaansen (2018-08-18):

    Evidence-based medicine is the cornerstone of clinical practice, but it is dependent on the quality of evidence upon which it is based. Unfortunately, up to half of all randomized controlled trials (RCTs) have never been published, and trials with statistically-significant findings are more likely to be published than those without (Dwan et al 2013). Importantly, negative trials face additional hurdles beyond study publication bias that can result in the disappearance of non-significant results (Boutron et al 2010; Dwan et al 2013; Duyx et al 2017). Here, we analyze the cumulative impact of biases on apparent efficacy, and discuss possible remedies, using the evidence base for two effective treatments for depression: antidepressants and psychotherapy.

    Figure 1: The cumulative impact of reporting and citation biases on the evidence base for antidepressants. (a) displays the initial, complete cohort of trials, while (b) through (e) show the cumulative effect of biases. Each circle indicates a trial, while the color indicates the results or the presence of spin. Circles connected by a grey line indicate trials that were published together in a pooled publication. In (e), the size of the circle indicates the (relative) number of citations received by that category of studies.






  353. ⁠, Thomas Schäfer, Marcus A. Schwarz (2019-04-11; statistics  /​ ​​ ​peer-review):

    are the currency of psychological research. They quantify the results of a study to answer the research question and are used to calculate statistical power. The interpretation of effect sizes—when is an effect small, medium, or large?—has been guided by the recommendations Jacob Cohen gave in his pioneering writings starting in 1962: Either compare an effect with the effects found in past research or use certain conventional benchmarks.

    The present analysis shows that neither of these recommendations is currently applicable. From past publications without ⁠, 900 effects were randomly drawn and compared with 93 effects from publications with pre-registration, revealing a large difference: Effects from the former (median r = 0.36) were much larger than effects from the latter (median r = 0.16). That is, certain biases, such as or ⁠, have caused a dramatic inflation in published effects, making it difficult to compare an actual effect with the real population effects (as these are unknown). In addition, there were very large differences in the mean effects between psychological sub-disciplines and between different study designs, making it impossible to apply any global benchmarks.

    Many more pre-registered studies are needed in the future to derive a reliable picture of real population effects.

    Figure 1: Distributions of effects (absolute values) from articles published with (N = 89) and without (N = 684) pre-registration. The distributions contain all effects that were extracted as or could be transformed into a correlation coefficient r.



  357. ⁠, Richard Klein, Michelangelo, Vianello, Fred, Hasselman, Byron, Adams, Reginald B. Adams, Jr., Sinan Alper, Mark Aveyard, Jordan Axt, Mayowa Babalola, Š těpán Bahník, Mihaly Berkics, Michael Bernstein, Daniel Berry, Olga Bialobrzeska, Konrad Bocian, Mark Brandt, Robert Busching, Huajian Cai, Fanny Cambier, Katarzyna Cantarero, Cheryl Carmichael, Zeynep Cemalcilar, Jesse Chandler, Jen-Ho Chang, Armand Chatard, Eva CHEN, Winnee Cheong, David ⁠, Sharon Coen, Jennifer Coleman, Brian Collisson, Morgan Conway, Katherine Corker, Paul Curran, Fiery Cushman, Ilker Dalgar, William Davis, Maaike de Bruijn, Marieke de Vries, Thierry Devos, Canay Doğulu, Nerisa Dozo, Kristin Dukes, Yarrow Dunham, Kevin Durrheim, Matthew Easterbrook, Charles Ebersole, John Edlund, Alexander English, Anja Eller, Carolyn Finck, Miguel-Ángel Freyre, Mike Friedman, Natalia Frankowska, Elisa Galliani, Tanuka Ghoshal, Steffen Giessner, Tripat Gill, Timo Gnambs, Angel Gomez, Roberto Gonzalez, Jesse Graham, Jon Grahe, Ivan Grahek, Eva Green, Kakul Hai, Matthew Haigh, Elizabeth Haines, Michael Hall, Marie Heffernan, Joshua Hicks, Petr Houdek, Marije van, der Hulst, Jeffrey Huntsinger, Ho Huynh, Hans IJzerman, Yoel Inbar, Åse Innes-Ker, William Jimenez-Leal, Melissa-Sue John, Jennifer Joy-Gaba, Roza Kamiloglu, Andreas Kappes, Heather Kappes, Serdar Karabati, Haruna Karick, Victor Keller, Anna Kende, Nicolas Kervyn, Goran Knezevic, Carrie Kovacs, Lacy Krueger, German Kurapov, Jaime Kurtz, Daniel Lakens, Ljiljana Lazarevic, Carmel Levitan, Neil Lewis, Samuel Lins, Esther Maassen, Angela Maitner, Winfrida Malingumu, Robyn Mallett, Satia Marotta, Jason McIntyre, Janko Međedović, Taciano Milfont, Wendy Morris, Andriy Myachykov, Sean Murphy, Koen Neijenhuijs, Anthony Nelson, Felix Neto, Austin Nichols, Susan O'Donnell, Masanori Oikawa, Gabor Orosz, Malgorzata Osowiecka, Grant Packard, Rolando Pérez, Boban Petrovic, Ronaldo Pilati, Brad Pinter, Lysandra Podesta, Monique Pollmann, Anna Dalla Rosa, Abraham Rutchick, Patricio Saavedra, Airi Sacco, Alexander Saeri, Erika Salomon, Kathleen Schmidt, Felix Schönbrodt, Maciek Sekerdej, David Sirlopu, Jeanine Skorinko, Michael Smith, Vanessa Smith-Castro, Agata Sobkow, Walter Sowden, Philipp Spachtholz, Troy Steiner, Jeroen Stouten, Chris Street, Oskar Sundfelt, Ewa Szumowska, Andrew Tang, Norbert Tanzer, Morgan Tear, Jordan Theriault, Manuela Thomae, David Torres-Fernández, Jakub Traczyk, Joshua Tybur, Adrienn Ujhelyi, Marcel van Assen, Anna van 't Veer, Alejandro, Vásquez-Echeverría Leigh, Ann Vaughn, Alexandra Vázquez, Diego Vega, Catherine Verniers, Mark Verschoor, Ingrid Voermans, Marek Vranka, Cheryl Welch, Aaron Wichman, Lisa Williams, Julie Woodzicka, Marta Wronska, Liane Young, John Zelenski, Brian Nosek (2019-11-19):

    We conducted preregistered replications of 28 classic and contemporary published findings with protocols that were peer reviewed in advance to examine variation in effect magnitudes across sample and setting. Each protocol was administered to approximately half of 125 samples and 15,305 total participants from 36 countries and territories. Using conventional statistical-significance (p < 0.05), fifteen (54%) of the replications provided evidence in the same direction and statistically-significant as the original finding. With a strict statistical-significance criterion (p < 0.0001), fourteen (50%) provide such evidence reflecting the extremely high powered design. Seven (25%) of the replications had effect sizes larger than the original finding and 21 (75%) had effect sizes smaller than the original finding. The median comparable Cohen’s d effect sizes for original findings was 0.60 and for replications was 0.15. Sixteen replications (57%) had small effect sizes (< 0.20) and 9 (32%) were in the opposite direction from the original finding. Across settings, 11 (39%) showed statistically-significant heterogeneity using the Q statistic and most of those were among the findings eliciting the largest overall effect sizes; only one effect that was near zero in the aggregate showed statistically-significant heterogeneity. Only one effect showed a Tau > 0.20 indicating moderate heterogeneity. Nine others had a Tau near or slightly above 0.10 indicating slight heterogeneity. In moderation tests, very little heterogeneity was attributable to task order, administration in lab versus online, and exploratory WEIRD versus less WEIRD culture comparisons. Cumulatively, variability in observed effect sizes was more attributable to the effect being studied than the sample or setting in which it was studied.


  359. 2020-olssoncollentine.pdf: ⁠, Anton Olsson-Collentine, Jelte M. Wicherts, Marcel A. L. M. van Assen (2020-10-01; statistics  /​ ​​ ​bias):

    Impact Statement: This article suggests that for direct replications in social and cognitive psychology research, small variations in design (sample settings and population) are an unlikely explanation for differences in findings of studies. Differences in findings of direct replications are particularly unlikely if the overall effect is (close to) 0, whereas these differences are more likely if the overall effect is larger.

    We examined the evidence for heterogeneity (of effect sizes) when only minor changes to sample population and settings were made between studies and explored the association between heterogeneity and average effect size in a sample of 68 meta-analyses from 13 preregistered multilab direct replication projects in social and cognitive psychology. Among the many examined effects, examples include the ⁠, the ‘verbal overshadowing’ effect, and various effects such as ‘anchoring’ effects. We found limited heterogeneity; 48⁄68 (71%) meta-analyses had nonsignificant heterogeneity, and most (49⁄68; 72%) were most likely to have zero to small heterogeneity. Power to detect small heterogeneity (as defined by Higgins, Thompson, Deeks, & Altman, 2003) was low for all projects (mean 43%), but good to excellent for medium and large heterogeneity. Our findings thus show little evidence of widespread heterogeneity in direct replication studies in social and cognitive psychology, suggesting that minor changes in sample population and settings are unlikely to affect research outcomes in these fields of psychology. We also found strong correlations between observed average effect sizes (standardized mean differences and log odds ratios) and heterogeneity in our sample. Our results suggest that heterogeneity and moderation of effects is unlikely for a 0 average true effect size, but increasingly likely for larger average true effect size.

    [Keywords: heterogeneity, meta-analysis, psychology, direct replication, many labs]

  360. 2020-ebersole.pdf: ⁠, Charles R. Ebersole, Maya B. Mathur, Erica Baranski, Diane-Jo Bart-Plange, Nicholas R. Buttrick, Christopher R. Chartier, Katherine S. Corker, Martin Corley, Joshua K. Hartshorne, Hans IJzerman, Ljiljana B. Lazarević, Hugh Rabagliati, Ivan Ropovik, Balazs Aczel, Lena F. Aeschbach, Luca Andrighetto, Jack D. Arnal, Holly Arrow, Peter Babincak, Bence E. Bakos, Gabriel Baník, Ernest Baskin, Radomir Belopavlović, Michael H. Bernstein, Michał Białek, Nicholas G. Bloxsom, Bojana Bodroža, Diane B. V. Bonfiglio, Leanne Boucher, Florian Brühlmann, Claudia C. Brumbaugh, Erica Casini, Yiling Chen, Carlo Chiorri, William J. Chopik, Oliver Christ, Antonia M. Ciunci, Heather M. Claypool, Sean Coary, Marija V. Čolić, W. Matthew Collins, Paul G. Curran, Chris R. Day, Benjamin Dering, Anna Dreber, John E. Edlund, Filipe Falcão, Anna Fedor, Lily Feinberg, Ian R. Ferguson, Máire Ford, Michael C. Frank, Emily Fryberger, Alexander Garinther, Katarzyna Gawryluk, Kayla Ashbaugh, Mauro Giacomantonio, Steffen R. Giessner, Jon E. Grahe, Rosanna E. Guadagno, Ewa Hałasa, Peter J. B. Hancock, Rias A. Hilliard, Joachim Hüffmeier, Sean Hughes, Katarzyna Idzikowska, Michael Inzlicht, Alan Jern, William Jiménez-Leal, Magnus Johannesson, Jennifer A. Joy-Gaba, Mathias Kauff, Danielle J. Kellier, Grecia Kessinger, Mallory C. Kidwell, Amanda M. Kimbrough, Josiah P. J. King, Vanessa S. Kolb, Sabina Kołodziej, Marton Kovacs, Karolina Krasuska, Sue Kraus, Lacy E. Krueger, Katarzyna Kuchno, Caio Ambrosio Lage, Eleanor V. Langford, Carmel A. Levitan, Tiago Jessé Souza de Lima, Hause Lin, Samuel Lins, Jia E. Loy, Dylan Manfredi, Łukasz Markiewicz, Madhavi Menon, Brett Mercier, Mitchell Metzger, Venus Meyet, Ailsa E. Millen, Jeremy K. Miller, Andres Montealegre, Don A. Moore, Rafał Muda, Gideon Nave, Austin Lee Nichols, Sarah A. Novak, Christian Nunnally, Ana Orlić, Anna Palinkas, Angelo Panno, Kimberly P. Parks, Ivana Pedović, Emilian Pękala, Matthew R. Penner, Sebastiaan Pessers, Boban Petrović, Thomas Pfeiffer, Damian Pieńkosz, Emanuele Preti, Danka Purić, Tiago Ramos, Jonathan Ravid, Timothy S. Razza, Katrin Rentzsch, Juliette Richetin, Sean C. Rife, Anna Dalla Rosa, Kaylis Hase Rudy, Janos Salamon, Blair Saunders, Przemysław Sawicki, Kathleen Schmidt, Kurt Schuepfer, Thomas Schultze, Stefan Schulz-Hardt, Astrid Schütz, Ani N. Shabazian, Rachel L. Shubella, Adam Siegel, Rúben Silva, Barbara Sioma, Lauren Skorb, Luana Elayne Cunha de Souza, Sara Steegen, L. A. R. Stein, R. Weylin Sternglanz, Darko Stojilović, Daniel Storage, Gavin Brent Sullivan, Barnabas Szaszi, Peter Szecsi, Orsolya Szöke, Attila Szuts, Manuela Thomae, Natasha D. Tidwell, Carly Tocco, Ann-Kathrin Torka, Francis Tuerlinckx, Wolf Vanpaemel, Leigh Ann Vaughn, Michelangelo Vianello, Domenico Viganola, Maria Vlachou, Ryan J. Walker, Sophia C. Weissgerber, Aaron L. Wichman, Bradford J. Wiggins, Daniel Wolf, Michael J. Wood, David Zealley, Iris Žeželj, Mark Zrubka, Brian A. Nosek (2020-11-13; statistics  /​ ​​ ​bias):

    Replication studies in psychological science sometimes fail to reproduce prior findings. If these studies use methods that are unfaithful to the original study or ineffective in eliciting the phenomenon of interest, then a failure to replicate may be a failure of the protocol rather than a challenge to the original finding. Formal pre-data-collection peer review by experts may address shortcomings and increase replicability rates. We selected 10 replication studies from the Reproducibility Project: Psychology (RP:P; Open Science Collaboration, 2015) for which the original authors had expressed concerns about the replication designs before data collection; only one of these studies had yielded a statistically-significant effect (p < 0.05). Commenters suggested that lack of adherence to expert review and low-powered tests were the reasons that most of these RP:P studies failed to replicate the original effects. We revised the replication protocols and received formal peer review prior to conducting new replication studies. We administered the RP:P and revised protocols in multiple laboratories (median number of laboratories per original study = 6.5, range = 3–9; median total sample = 1,279.5, range = 276–3,512) for high-powered tests of each original finding with both protocols. Overall, following the preregistered analysis plan, we found that the revised protocols produced effect sizes similar to those of the RP:P protocols (Δr = 0.002 or 0.014, depending on analytic approach). The median effect size for the revised protocols (r = 0.05) was similar to that of the RP:P protocols (r = 0.04) and the original RP:P replications (r = 0.11), and smaller than that of the original studies (r = 0.37). Analysis of the cumulative evidence across the original studies and the corresponding three replication attempts provided very precise estimates of the 10 tested effects and indicated that their effect sizes (median r = 0.07, range = 0.00–0.15) were 78% smaller, on average, than the original effect sizes (median r = 0.37, range = 0.19–0.50).

    [Keywords: replication, reproducibility, metascience, peer review, Registered Reports, open data, preregistered]

  361. 2019-kvarven.pdf: ⁠, Amanda Kvarven, Eirik Strømland, Magnus Johannesson (2019-12-23; statistics  /​ ​​ ​bias):

    Many researchers rely on meta-analysis to summarize research evidence. However, there is a concern that publication bias and selective reporting may lead to biased meta-analytic effect sizes. We compare the results of meta-analyses to large-scale preregistered replications in psychology carried out at multiple laboratories. The multiple-laboratory replications provide precisely estimated effect sizes that do not suffer from publication bias or selective reporting. We searched the literature and identified 15 meta-analyses on the same topics as multiple-laboratory replications. We find that meta-analytic effect sizes are statistically-significantly different from replication effect sizes for 12 out of the 15 meta-replication pairs. These differences are systematic and, on average, meta-analytic effect sizes are almost 3 times as large as replication effect sizes. We also implement 3 methods of correcting meta-analysis for bias, but these methods do not substantively improve the meta-analytic results.



  364. 2018-salminen.pdf: “Five-Year Follow-up of Antibiotic Therapy for Uncomplicated Acute Appendicitis in the APPAC Randomized Clinical Trial⁠, Paulina Salminen, Risto Tuominen, Hannu Paajanen, Tero Rautio, Pia Nordström, Markku Aarnio, Tuomo Rantanen, Saija Hurme, Jukka-Pekka Mecklin, Juhani Sand, Johanna Virtanen, Airi Jartti, Juha M. Grönroos


  366. ⁠, Noah Haber, Emily R. Smith, Ellen Moscoe, Kathryn Andrews, Robin Audy, Winnie Bell, Alana T. Brennan, Alexander Breskin, Jeremy C. Kane, Mahesh Karra, Elizabeth S. McClure, Elizabeth A. Suarez (2018-05-30):

    Background: The pathway from evidence generation to consumption contains many steps which can lead to overstatement or misinformation. The proliferation of internet-based health news may encourage selection of media and academic research articles that overstate strength of causal inference. We investigated the state of causal inference in health research as it appears at the end of the pathway, at the point of social media consumption.

    Methods: We screened the NewsWhip Insights database for the most shared media articles on Facebook and Twitter reporting about peer-reviewed academic studies associating an exposure with a health outcome in 2015, extracting the 50 most-shared academic articles and media articles covering them. We designed and utilized a review tool to systematically assess and summarize studies’ strength of causal inference, including generalizability, potential confounders, and methods used. These were then compared with the strength of causal language used to describe results in both academic and media articles. Two randomly assigned independent reviewers and one arbitrating reviewer from a pool of 21 reviewers assessed each article.

    Results: We accepted the most shared 64 media articles pertaining to 50 academic articles for review, representing 68% of Facebook and 45% of Twitter shares in 2015. 34% of academic studies and 48% of media articles used language that reviewers considered too strong for their strength of causal inference. 70% of academic studies were considered low or very low strength of inference, with only 6% considered high or very high strength of causal inference. The most severe issues with academic studies’ causal inference were reported to be omitted confounding variables and generalizability. 58% of media articles were found to have inaccurately reported the question, results, intervention, or population of the academic study.

    Conclusions: We find a large disparity between the strength of language as presented to the research consumer and the underlying strength of causal inference among the studies most widely shared on social media. However, because this sample was designed to be representative of the articles selected and shared on social media, it is unlikely to be representative of all academic and media work. More research is needed to determine how academic institutions, media organizations, and social network sharing patterns impact causal inference and language as received by the research consumer.


  368. Leprechauns#spinach

  369. 1991-lykken.pdf: ⁠, David T. Lykken (1991; psychology):

    [Lykken’s (1991) classic criticisms of psychology’s dominant research tradition, from the perspective of the Minnesotan psychometrics school, in association with Paul Meehl: psychology’s replication crisis, the constant fading-away of trendy theories, and inability to predict the real world the measurement problem, null hypothesis statistical-significance testing, and the granularity of research methods.]

    I shall argue the following theses:

    1. Psychology isn’t doing very well as a scientific discipline and something seems to be wrong somewhere.
    2. This is due partly to the fact that psychology is simply harder than physics or chemistry, and for a variety of reasons. One interesting reason is that people differ structurally from one another and, to that extent, cannot be understood in terms of the same theory since theories are guesses about structure.
    3. But the problems of psychology are also due in part to a defect in our research tradition; our students are carefully taught to behave in the same obfuscating, self-deluding, pettifogging ways that (some of) their teachers have employed.
  370. ⁠, Daniel Burfoot (2011-04-28):

    This book presents a methodology and philosophy of empirical science based on large scale lossless data compression. In this view a theory is scientific if it can be used to build a data compression program, and it is valuable if it can compress a standard benchmark database to a small size, taking into account the length of the compressor itself. This methodology therefore includes an Occam principle as well as a solution to the problem of demarcation. Because of the fundamental difficulty of lossless compression, this type of research must be empirical in nature: compression can only be achieved by discovering and characterizing empirical regularities in the data. Because of this, the philosophy provides a way to reformulate fields such as computer vision and computational linguistics as empirical sciences: the former by attempting to compress databases of natural images, the latter by attempting to compress large text databases. The book argues that the rigor and objectivity of the compression principle should set the stage for systematic progress in these fields. The argument is especially strong in the context of computer vision, which is plagued by chronic problems of evaluation.

    The book also considers the field of machine learning. Here the traditional approach requires that the models proposed to solve learning problems be extremely simple, in order to avoid overfitting. However, the world may contain intrinsically complex phenomena, which would require complex models to understand. The compression philosophy can justify complex models because of the large quantity of data being modeled (if the target database is 100 Gb, it is easy to justify a 10 Mb model). The complex models and abstractions learned on the basis of the raw data (images, language, etc) can then be reused to solve any specific learning problem, such as face recognition or machine translation.




  374. 2019-orben.pdf: “The association between adolescent well-being and digital technology use”⁠, Amy Orben, Andrew K. Przybylski

  375. 2019-lortieforgues.pdf: ⁠, Hugues Lortie-Forgues, Matthew Inglis (2019-03-11; sociology):

    There are a growing number of large-scale educational randomized controlled trials (RCTs). Considering their expense, it is important to reflect on the effectiveness of this approach. We assessed the magnitude and precision of effects found in those large-scale RCTs commissioned by the UK-based Education Endowment Foundation and the U.S.-based National Center for Educational Evaluation and Regional Assistance, which evaluated interventions aimed at improving academic achievement in K–12 (141 RCTs; 1,222,024 students). The mean effect size was 0.06 standard deviations. These sat within relatively large confidence intervals (mean width = 0.30 SDs), which meant that the results were often uninformative (the median Bayes factor was 0.56). We argue that our field needs, as a priority, to understand why educational RCTs often find small and uninformative effects.

    [Keywords: educational policy, evaluation, meta-analysis, program evaluation.]

  376. 2019-soto.pdf: ⁠, Christopher J. Soto (2019-01-01; psychology):

    The personality traits have been linked to dozens of life outcomes. However, metascientific research has raised questions about the replicability of behavioral science. The Life Outcomes of Personality Replication (LOOPR) Project was therefore conducted to estimate the replicability of the personality-outcome literature. Specifically, I conducted preregistered, high-powered (median n = 1,504) replications of 78 previously published trait–outcome associations. Overall, 87% of the replication attempts were statistically-significant in the expected direction. The replication effects were typically 77% as strong as the corresponding original effects, which represents a significant decline in effect size. The replicability of individual effects was predicted by the effect size and design of the original study, as well as the sample size and statistical power of the replication. These results indicate that the personality-outcome literature provides a reasonably accurate map of trait–outcome associations but also that it stands to benefit from efforts to improve replicability.


  378. 2014-turkheimer.pdf: “Behavior Genetic Research Methods: Testing Quasi-Causal Hypotheses Using Multivariate Twin Data”⁠, Eric Turkheimer, K. Paige Harden

  379. 2019-duncan.pdf: “How genome-wide association studies (GWAS) made traditional candidate gene studies obsolete”⁠, Laramie E. Duncan, Michael Ostacher, Jacob Ballon


  381. 2019-border.pdf: ⁠, Richard Border, Emma C. Johnson, Luke M. Evans, Andrew Smolen, Noah Berley, Patrick F. Sullivan, Matthew C. Keller (2019; genetics  /​ ​​ ​heritable):

    Objective: Interest in candidate gene and candidate gene-by-environment interaction hypotheses regarding major depressive disorder remains strong despite controversy surrounding the validity of previous findings. In response to this controversy, the present investigation empirically identified 18 candidate genes for depression that have been studied 10 or more times and examined evidence for their relevance to depression phenotypes.

    Methods: Utilizing data from large population-based and case-control samples (_n_s ranging from 62,138 to 443,264 across subsamples), the authors conducted a series of preregistered analyses examining candidate gene polymorphism main effects, polymorphism-by-environment interactions, and gene-level effects across a number of operational definitions of depression (eg., lifetime diagnosis, current severity, episode recurrence) and environmental moderators (eg., sexual or physical abuse during childhood, socioeconomic adversity).

    Results: No clear evidence was found for any candidate gene polymorphism associations with depression phenotypes or any polymorphism-by-environment moderator effects. As a set, depression candidate genes were no more associated with depression phenotypes than non-candidate genes. The authors demonstrate that phenotypic measurement error is unlikely to account for these null findings.

    Conclusions: The study results do not support previous depression candidate gene findings, in which large genetic effects are frequently reported in samples orders of magnitude smaller than those examined here. Instead, the results suggest that early hypotheses about depression candidate genes were incorrect and that the large number of associations reported in the depression candidate gene literature are likely to be false positives.

    Figure 2: Main effects and gene-by-environment effects of 16 candidate polymorphisms on estimated lifetime depression diagnosis and current depression severity in the UK Biobank sample. The graphs show effect size estimates for 16 candidate polymorphisms, presented in order of estimated number of studies from left to right, descending, on estimated lifetime depression diagnosis (panel A) and past-2-week depression symptom severity from the online mental health follow-up assessment (panel B) in the UK Biobank sample (N = 115,257). Both polymorphism main effects and polymorphism-by-environment moderator interaction effects are presented for each outcome. Detailed descriptions of the variables and of the association and power analysis models are provided in sections S3 and S4, respectively, of the online supplement.
    Figure 3: Gene-wise statistics for effects of 18 candidate genes on primary depression outcomes in the UK Biobank sample. The plot shows gene-wise p values across the genome, highlighting the 18 candidate polymorphisms’ effects on estimated depression diagnosis (filled points) and past-2-week depression symptom severity (unfilled points) from the online mental health follow-up assessment in the UK Biobank sample (N = 115,257). Gene labels alternate colors to aid readability. Detailed descriptions of the variables and of the association models are provided in sections S3 and S4.2, respectively, of the online supplement.

  383. ⁠, Thomas S. Redick (2019-05-16):

    Seventeen years and hundreds of studies after the first journal article on training was published, evidence for the efficacy of working memory training is still wanting. Numerous studies show that individuals who repeatedly practice computerized working memory tasks improve on those tasks and closely related variants. Critically, although individual studies have shown improvements in untrained abilities and behaviors, systematic reviews of the broader literature show that studies producing large, positive findings are often those with the most methodological shortcomings. The current review discusses the past, present, and future status of working memory training, including consideration of factors that might influence working memory training and transfer efficacy.

  384. ⁠, Diana Herrera-Perez, Alyson Haslam, Tyler Crain, Jennifer Gill, Catherine Livingston, Victoria Kaestner, Michael Hayes, Dan Morgan, Adam S. Cifu, Vinay Prasad (2019-06-11):

    The ability to identify medical reversals and other low-value medical practices is an essential prerequisite for efforts to reduce spending on such practices. Through an analysis of more than 3000 randomized controlled trials (RCTs) published in three leading medical journals (the Journal of the American Medical Association, the Lancet, and the New England Journal of Medicine), we have identified 396 medical reversals. Most of the studies (92%) were conducted on populations in high-income countries, cardiovascular disease was the most common medical category (20%), and medication was the most common type of intervention (33%).

  385. ⁠, David Goodstein (1994):

    [On the end to the post-WWII Vannevar Bushian exponential growth of academia and consequences thereof: growth can’t go on forever, and it didn’t.]

    According to modern cosmology, the universe began with a big bang about 10 billion years ago, and it has been expanding ever since. If the density of mass in the universe is great enough, its gravitational force will cause that expansion to slow down and reverse, causing the universe to fall back in on itself. Then the universe will end in a cataclysmic event known as ‘the Big Crunch’. I would like to present to you a vaguely analogous theory of the history of science. The upper curve on Figure 1 was first made by historian Derek da Solla Price, sometime in the 1950s. It is a semilog plot of the cumulative number of scientific journals founded worldwide as a function of time…the growth of the profession of science, the scientific enterprise, is bound to reach certain limits. I contend that these limits have now been reached.

    …But after about 1970 and the Big Crunch, the gleaming gems produced at the end of the vast mining-and-sorting operation produced less often from American ore. Research professors and their universities, using ore imported from across the oceans, kept the machinery humming.

    …Let me finish by summarizing what I’ve been trying to tell you. We stand at an historic juncture in the history of science. The long era of exponential expansion ended decades ago, but we have not yet reconciled ourselves to that fact. The present social structure of science, by which I mean institutions, education, funding, publications and so on all evolved during the period of exponential expansion, before The Big Crunch. They are not suited to the unknown future we face. Today’s scientific leaders, in the universities, government, industry and the scientific societies are mostly people who came of age during the golden era, 1950–1970. I am myself part of that generation. We think those were normal times and expect them to return. But we are wrong. Nothing like it will ever happen again. It is by no means certain that science will even survive, much less flourish, in the difficult times we face. Before it can survive, those of us who have gained so much from the era of scientific elites and scientific illiterates must learn to face reality, and admit that those days are gone forever.

  386. ⁠, Saul Justin Newman (2019-07-16):

    The observation of individuals attaining remarkable ages, and their concentration into geographic sub-regions or ‘blue zones’, has generated considerable scientific interest. Proposed drivers of remarkable longevity include high vegetable intake, strong social connections, and genetic markers. Here, we reveal new predictors of remarkable longevity and ‘supercentenarian’ status. In the United States, supercentenarian status is predicted by the absence of vital registration. The state-specific introduction of birth certificates is associated with a 69–82% fall in the number of supercentenarian records. In Italy, which has more uniform vital registration, remarkable longevity is instead predicted by low per capita incomes and a short life expectancy. Finally, the designated ‘blue zones’ of Sardinia, Okinawa, and Ikaria corresponded to regions with low incomes, low literacy, high crime rate and short life expectancy relative to their national average. As such, relative poverty and short lifespan constitute unexpected predictors of centenarian and supercentenarian status, and support a primary role of fraud and error in generating remarkable human age records.

  387. 2017-mercier.pdf: ⁠, Hugo Mercier (2017-05-18; psychology⁠, sociology⁠, philosophy⁠, advertising):

    A long tradition of scholarship, from ancient Greece to Marxism or some contemporary social psychology, portrays humans as strongly gullible—wont to accept harmful messages by being unduly deferent. However, if humans are reasonably well adapted, they should not be strongly gullible: they should be vigilant toward communicated information. Evidence from experimental psychology reveals that humans are equipped with well-functioning mechanisms of epistemic vigilance. They check the plausibility of messages against their background beliefs, calibrate their trust as a function of the source’s competence and benevolence, and critically evaluate arguments offered to them. Even if humans are equipped with well-functioning mechanisms of epistemic vigilance, an adaptive lag might render them gullible in the face of new challenges, from clever marketing to omnipresent propaganda. I review evidence from different cultural domains often taken as proof of strong gullibility: religion, demagoguery, propaganda, political campaigns, advertising, erroneous medical beliefs, and rumors. Converging evidence reveals that communication is much less influential than often believed—that religious proselytizing, propaganda, advertising, and so forth are generally not very effective at changing people’s minds. Beliefs that lead to costly behavior are even less likely to be accepted. Finally, it is also argued that most cases of acceptance of misguided communicated information do not stem from undue deference, but from a fit between the communicated information and the audience’s preexisting beliefs.

    [Keywords: epistemic vigilance, gullibility, trust]

  388. ⁠, Richard Wiseman, Caroline Watt, Diana Kornbrot (2019-01-16):

    The recent ‘replication crisis’ in psychology has focused attention on ways of increasing methodological rigor within the behavioral sciences. Part of this work has involved promoting ‘Registered Reports’, wherein journals peer review papers prior to data collection and publication. Although this approach is usually seen as a relatively recent development, we note that a prototype of this publishing model was initiated in the mid-1970s by parapsychologist Martin Johnson in the European Journal of Parapsychology (EJP). A retrospective and observational comparison of Registered and non-Registered Reports published in the EJP during a seventeen-year period provides circumstantial evidence to suggest that the approach helped to reduce questionable research practices. This paper aims both to bring Johnson’s pioneering work to a wider audience, and to investigate the positive role that Registered Reports may play in helping to promote higher methodological and statistical standards.

    …The final dataset contained 60 papers: 25 RRs and 35 non-RRs. The RRs described 31 experiments that tested 131 hypotheses, and the non-RRs described 60 experiments that tested 232 hypotheses.

    28.4% of the statistical tests reported in non-RRs were statistically-significant (66⁄232: 95% CI [21.5%–36.4%]); compared to 8.4% of those in the RRs (11⁄131: 95% CI [4.0%–16.8%]). A simple 2 × 2 contingency analysis showed that this difference is highly statistically-significant (Fisher’s exact test: p < 0.0005, Pearson chi-square = 20.1, Cohen’s d = 0.48).

    …Parapsychologists investigate the possible existence of phenomena that, for many, have a low a priori likelihood of being genuine (see, eg., Wagenmakers et al 2011). This has often resulted in their work being subjected to a considerable amount of critical attention (from both within and outwith the field) that has led to them pioneering several methodological advances prior to their use within mainstream psychology, including the development of randomisation in experimental design (Hacking, 1988), the use of blinds (Kaptchuk, 1998), explorations into randomisation and statistical inference (Fisher, 1924), advances in replication issues (Rosenthal, 1986), the need for pre-specification in meta-analysis (Akers, 1985; Milton, 1999; Kennedy, 2004), and the creation of a formal study registry (Watt, 2012; Watt & Kennedy, 2015). Johnson’s work on RRs provides another striking illustration of this principle at work.

  389. 1975-johnson.pdf: ⁠, Martin U. Johnson (1975; statistics  /​ ​​ ​peer-review):

    The author discusses how to increase the quality and reliability of the research and reporting process in experimental parapsychology. Three levels of bias and control of bias are discussed. The levels are referred to as Model 1, Model 2 and Model 3 respectively.

    1. Model 1 is characterized by its very low level of intersubjective control. The reliability of the results depends to a very great extent upon the reliability of the investigator and the editor.
    2. Model 2 is relevant to the case when the experimenter is aware of the potential risk of making both errors of observation and recording and tries to control this bias. However, this model of control does not make allowances for the case when data are intentionally manipulated.
    3. Model 3 depicts a rather sophisticated system of control. One feature of this model is, that selective reporting will become harder since the editor has to make his decision as regards the acceptance or rejection of an experimental article prior to the results being obtained, and subsequently based upon the quality of the outline of the experiment. However, it should be stressed, that not even this model provides a fool-proof guarantee against deliberate fraud.

    It is assumed that the models of bias and control of bias under discussion are relevant to most branches of the behavioral sciences.

  390. 1975-johnson-2.pdf: ⁠, Martin U. Johnson (1975; statistics  /​ ​​ ​peer-review):

    This copy represents our first ‘real’ issue of the European Journal of Parapsychology…As far as experimental articles are concerned, we would like to ask potential contributors to try and adhere to the publishing policy which we have outlined in the editorial of the demonstration copy, and which is also discussed at some length in the article: ‘Models of Bias and Control of Bias’ [Johnson 1975a], in this issue. In short we shall try to avoid selective reporting and yet at the same time we shall try to refrain from making our journal a graveyard for all those studies which did not ‘turn out’. These objectives may be fulfilled by the editorial rule of basing our judgment entirely on our impressions of the quality of the design and methodology of the planned study. The acceptance or rejection of a manuscript should if possible take place prior to the carrying out and the evaluation of the results of the study.

  391. 1976-johnson.pdf: ⁠, Martin U. Johnson (1976; statistics  /​ ​​ ​peer-review):

    …even the most proper use of statistics may lead to spurious correlations or conclusions if there are inadequacies regarding the research process itself. One of these sources of error in the research process is related to selective reporting; another to human limitations with regard to the ability to make reliable observations or evaluations. says:

    The most common variant is, of course, the tendency to bury negative results. I only recently became aware of the massive size of this great graveyard for dead studies when a colleague expressed gratification that only a third of his studies ‘turned out’—as he put it. Recently, a second variant of this secret game was discovered, quite inadvertently, by ⁠, when he wrote to 37 authors to ask for the raw-data on which they had based recent journal articles. Wolins found that of the 37 who replied, 21 reported their data to be either misplaced, lost, or inadvertently destroyed. Finally, after some negotiation, Wolins was able to complete 7 re-analyses on the data supplied from 5 authors. Of the 7, he found gross errors in 3—errors so great as to clearly change the outcome of the experiments already reported.

    It should also be stressed that Rosenthal and others have demonstrated that experimenters tend to arrive at results found to be in full agreement with their expectancies, or with the expectancies of those within the scientific establishment in charge of the rewards. Even if some of Rosenthal’s results have been questioned [especially the ‘Pygmalion effect’] the general tendency seems to be unaffected.

    I guess we can all agree upon the fact that selective reporting in studies on the reliability and validity, of for instance a personality test, is a bad thing. But what could be the reason for selective reporting? Why does a research worker manipulate his dead? Is it only because the research worker has a ‘weak’ mind or does there exist some kind of ‘steering field’ that exerts such an influence that improper behavior on the part of the research worker occurs?

    It seems rather reasonable to assume that the editors of professional journals or research leaders in general could exert a certain harmful influence in this connection…There is no doubt at all in my mind about the ‘filtering’ or ‘shaping’ effect an editor may exert upon the output of his journal…As I see it, the major risk of selective reporting is not primarily a statistical one, but rather the research climate which the underlying policy create (“you are ‘good’ if you obtain supporting results; you are”no-good" if you only arrive at chance results").

    …The analysis I carried out has had practical implications for the publication policy which we have stated as an ideal for our new journal: the European Journal of Parapsychology.

  392. 1966-dunnette.pdf: ⁠, Marvin D. Dunnette (1966; statistics  /​ ​​ ​bias):

    [Influential early critique of academic psychology: weak theories, no predictions, poor measurements, poor replicability, high levels of publication bias, non-progressive theory building, and constant churn; many of these criticisms would be taken up by the ‘Minnesota school’ of Bouchard/​​​​Meehl/​​​​Lykken/​​​​etc.]

    Fads include brain-storming, Q technique, level of aspiration, forced choice, critical incidents, semantic differential, role playing, and need theory. Fashions include theorizing and theory building, criterion ⁠, model building, null-hypothesis testing, and sensitivity training. Folderol includes tendencies to be fixated on theories, methods, and points of view, conducting “little” studies with great precision, attaching dramatic but unnecessary trappings to experiments, grantsmanship, coining new names for old concepts, fixation on methods and apparatus, etc.

  393. 1962-wolins.pdf: ⁠, Leroy Wolins (1962-09; statistics  /​ ​​ ​bias):

    Comments on a Iowa State University graduate student’s endeavor of requiring data of a particular kind in order to carry out a study for his master’s thesis. This student wrote to 37 authors whose journal articles appeared in APA journals between 1959 and 1961. Of these authors, 32 replied. 21 of those reported the data misplaced, lost, or inadvertently destroyed. 2 of the remaining 11 offered their data on the conditions that they be notified of our intended use of their data, and stated that they have control of anything that we would publish involving these data. Errors were found in some of the raw data that was obtained which caused a dilemma of either reporting the errors or not. The commentator states that if it were clearly set forth by the APA that the responsibility for retaining raw data and submitting them for scrutiny upon request lies with the author, this dilemma would not exist. The commentator suggests that a possibly more effective means of controlling quality of publication would be to institute a system of quality control whereby random samples of raw data from submitted journal articles would be requested by editors and scrutinized for accuracy and the appropriateness of the analysis performed.

  394. 2019-horowitz.pdf: ⁠, Mark Horowitz, William Yaworsky, Kenneth Kickham (2019-10; sociology⁠, sociology  /​ ​​ ​preference-falsification):

    In recent decades the field of anthropology has been characterized as sharply divided between pro-science and anti-science factions. The aim of this study is to empirically evaluate that characterization. We survey anthropologists in graduate programs in the United States regarding their views of science and advocacy, moral and epistemic relativism, and the merits of evolutionary biological explanations. We examine anthropologists’ views in concert with their varying appraisals of major controversies in the discipline (⁠, ⁠, and ). We find that disciplinary specialization and especially gender and political orientation are statistically-significant predictors of anthropologists’ views. We interpret our findings through the lens of an intuitionist social psychology that helps explain the dynamics of such controversies as well as ongoing ideological divisions in the field.


  396. 2019-zeraatkar.pdf: ⁠, Dena Zeraatkar, Bradley C. Johnston, Jessica Bartoszko, Kevin Cheung, Malgorzata M. Bala, Claudia Valli, Montserrat Rabassa, Deagan Sit, Kirolos Milio, Behnam Sadeghirad, Arnav Agarwal, Adriana M. Zea, Yung Lee, Mi Ah Han, Robin W. M. Vernooij, Pablo Alonso-Coello, Gordon H. Guyatt, Regina El Dib (2019-10-01; longevity):

    Background: Few randomized trials have evaluated the effect of reducing red meat intake on clinically important outcomes.

    Purpose: To summarize the effect of lower versus higher red meat intake on the incidence of cardiometabolic and cancer outcomes in adults.

    Data Sources: EMBASE, CENTRAL, CINAHL, Web of Science, and ProQuest from inception to July 2018 and MEDLINE from inception to April 2019, without language restrictions.

    Study Selection: Randomized trials (published in any language) comparing diets lower in red meat with diets higher in red meat that differed by a gradient of at least 1 serving per week for 6 months or more.

    Data Extraction: Teams of 2 reviewers independently extracted data and assessed the risk of bias and the certainty of the evidence.

    Data Synthesis: Of 12 eligible trials, a single trial enrolling 48 835 women provided the most credible, though still low-certainty, evidence that diets lower in red meat may have little or no effect on all-cause mortality (hazard ratio [HR], 0.99 [95% CI, 0.95 to 1.03]), cardiovascular mortality (HR, 0.98 [CI, 0.91 to 1.06]), and cardiovascular disease (HR, 0.99 [CI, 0.94 to 1.05]). That trial also provided low-certainty to very-low-certainty evidence that diets lower in red meat may have little or no effect on total cancer mortality (HR, 0.95 [CI, 0.89 to 1.01]) and the incidence of cancer, including colorectal cancer (HR, 1.04 [CI, 0.90 to 1.20]) and breast cancer (HR, 0.97 [0.90 to 1.04]).

    Limitations: There were few trials, most addressing only surrogate outcomes, with heterogeneous comparators and small gradients in red meat consumption between lower versus higher intake groups.

    Conclusion: Low-certainty to very-low-certainty evidence suggests that diets restricted in red meat may have little or no effect on major cardiometabolic outcomes and cancer mortality and incidence.


  398. 2019-vrij.pdf: ⁠, Aldert Vrij, Maria Hartwig, Pär Anders Granhag (2019-01-01; psychology):

    The relationship between nonverbal communication and deception continues to attract much interest, but there are many misconceptions about it. In this review, we present a scientific view on this relationship. We describe theories explaining why liars would behave differently from truth tellers, followed by research on how liars actually behave and individuals’ ability to detect lies. We show that the nonverbal cues to deceit discovered to date are faint and unreliable and that people are mediocre lie catchers when they pay attention to behavior. We also discuss why individuals hold misbeliefs about the relationship between nonverbal behavior and deception—beliefs that appear very hard to debunk. We further discuss the ways in which researchers could improve the state of affairs by examining nonverbal behaviors in different ways and in different settings than they currently do.

  399. ⁠, Alexander Coppock, Seth J. Hill, Lynn Vavreck (2020-09-02; sociology⁠, advertising):

    Evidence across social science indicates that average effects of persuasive messages are small. One commonly offered explanation for these small effects is heterogeneity: Persuasion may only work well in specific circumstances. To evaluate heterogeneity, we repeated an experiment weekly in real time using 2016 U.S. presidential election campaign advertisements. We tested 49 political advertisements in 59 unique experiments on 34,000 people. We investigate heterogeneous effects by sender (candidates or groups), receiver (subject partisanship), content (attack or promotional), and context (battleground versus non-battleground, primary versus general election, and early versus late). We find small average effects on candidate favorability and vote. These small effects, however, do not mask substantial heterogeneity even where theory from political science suggests that we should find it. During the primary and general election, in battleground states, for Democrats, Republicans, and Independents, effects are similarly small. Heterogeneity with large offsetting effects is not the source of small average effects.

  400. Leprechauns#citogenesis-how-often-do-researchers-not-read-the-papers-they-cite

  401. Movies#project-nim

  402. ⁠, Susannah Cahalan (2019-11-02):

    [Summary of investigation into David Rosenhan: like the Robbers Cave or Stanford Prison Experiment, his famous fake-insane patients experiment cannot be verified and many troubling anomalies have come to light. Cahalan is unable to find almost all of the supposed participants, Rosenhan hid his own participation & his own medical records show he fabricated details of his case, he throw out participant data that didn’t match his narrative, reported numbers are inconsistent, Rosenhan abandoned a lucrative book deal about it and avoided further psychiatric research, and showed some character traits of a fabulist eager to please.]

  403. {#linkBibliography-(spectator)-2020 .docMetadata}, Andrew Scull (Spectator) (2020-01-25):

    As her work proceeded, her doubts about Rosenhan’s work grew. At one point, I suggested that she write to Science and request copies of the peer review of the paper. What had the reviewers seen and requested? Did they know the identities of the anonymous pseudo-patients and the institutions to which they had been consigned? What checks had they made on the validity of Rosenhan’s claims? Had they, for example, asked to see the raw field notes? The editorial office told her that the peer review was confidential and they couldn’t share it. I wondered whether an approach from an academic rather than a journalist might be more successful, and with Cahalan’s permission, I sought the records myself, pointing out the important issues at stake, and noting that it would be perfectly acceptable for the names of the expert referees to be redacted. This time the excuse was different: the journal had moved offices, and the peer reviews no longer existed. That’s plausible, but it is distinctly odd that such different explanations should be offered.

    …Of course, proving a negative, especially after decades have passed, is nigh on impossible. Perhaps the appearance of The Great Pretender will cause one or more of the missing pseudo-patients to surface, or for their descendants to speak up and reveal their identities, for surely anyone who participated in such a famous study could not fail to mention it to someone. More likely, I think, is that these people are fictitious, invented by someone who Cahalan’s researches suggest was fully capable of such deception. (Indeed, the distinguished psychologist Eleanor Maccoby, who was in charge of assessing Rosenhan’s tenure file, reported that she and others were deeply suspicious of him, and that they found it ‘impossible to know what he had really done, or if he had done it’, granting him tenure only because of his popularity as a teacher.)

    …Most damning of all, though, are Rosenhan’s own medical records. When he was admitted to the hospital, it was not because he simply claimed to be hearing voices but was otherwise ‘normal’. On the contrary, he told his psychiatrist his auditory hallucinations included the interception of radio signals and listening in to other people’s thoughts. He had tried to keep these out by putting copper over his ears, and sought admission to the hospital because it was ‘better insulated there’. For months, he reported he had been unable to work or sleep, financial difficulties had mounted and he had contemplated suicide. His speech was retarded, he grimaced and twitched, and told several staff that the world would be better off without him. No wonder he was admitted.

    Perhaps out of sympathy for Rosenhan’s son and his closest friends, who had granted access to all this damning material and with whom she became close, I think Cahalan pulls her punches a bit when she brings her book to a conclusion. But the evidence she provides makes an overwhelming case: Rosenhan pulled off one of the greatest scientific frauds of the past 75 years, and it was a fraud whose real-world consequences still resonate today. Exposing what he got up to is a quite exceptional accomplishment, and Cahalan recounts the story vividly and with great skill.

  404. 1976-lando.pdf: ⁠, Harry A. Lando (1976-01-01; statistics  /​ ​​ ​bias):

    Describes the author’s experiences as a pseudo-patient on the psychiatric ward of a large public hospital for 19 days. Hospital facilities were judged excellent, and therapy tended to be extensive. Close contact with both patients and staff was obtained. Despite this contact, however, not only was the author’s simulation not detected, but his behavior was seen as consistent with the admitting diagnosis of “chronic undifferentiated ⁠.” Even with this misattribution it is concluded that the present institution had many positive aspects and that the depersonalization of patients so strongly emphasized by D. Rosenhan (see record 1973-21600-001) did not exist in this setting. It is recommended that future research address positive characteristics of existing institutions and possibly emulate these in upgrading psychiatric care.

    …I was the ninth pseudopatient in the Rosenhan study, and my data were not included in the original report.

  405. ⁠, Alison Abbott (2019-10-29):

    Although Rosenhan died in 2012, Cahalan easily tracked down his archives, held by social psychologist Lee Ross, his friend and colleague at Stanford. They included the first 200 pages of Rosenhan’s unfinished draft of a book about the experiment…Ross warned her that Rosenhan had been secretive. As her attempts to identify the pseudonymous pseudopatients hit one dead end after the other, she realized Ross’s prescience.

    The archives did allow Cahalan to piece together the beginnings of the experiment in 1969, when Rosenhan was teaching psychology at Swarthmore College in Pennsylvania…Rosenhan cautiously decided to check things out for himself first. He emerged humbled from nine traumatizing days in a locked ward, and abandoned the idea of putting students through the experience.

    …According to Rosenhan’s draft, it was at a conference dinner that he met his first recruits: a recently retired psychiatrist and his psychologist wife. The psychiatrist’s sister also signed up. But the draft didn’t explain how, when and why subsequent recruits signed up. Cahalan interviewed numerous people who had known Rosenhan personally or indirectly. She also chased down the medical records of individuals whom she suspected could have been involved in the experiment, and spoke with their families and friends. But her sleuthing brought her to only one participant, a former Stanford graduate student called Bill Underwood.

    …Underwood and his wife were happy to talk, but two of their comments jarred. Rosenhan’s draft described how he prepared his volunteers very carefully, over weeks. Underwood, however, remembered only brief guidance on how to avoid swallowing medication by hiding pills in his cheek. His wife recalled Rosenhan telling her that he had prepared writs of habeas corpus for each pseudopatient, in case an institution would not discharge them. But Cahalan had already worked out that that wasn’t so.

    Comparing the Science report with documents in Rosenhan’s archives, she also noted many mismatches in numbers. For instance, Rosenhan’s draft, and the Science paper, stated that Underwood had spent seven days in a hospital with 8,000 patients, whereas he spent eight days in a hospital with 1,500 patients.

    When all of the leads from her contacts led to ground, she published a commentary in The Lancet Psychiatry asking for help in finding them—to no avail. Had Rosenhan invented them, she found herself asking?

  406. 2020-griggs.pdf: ⁠, Richard A. Griggs, Jenna Blyler, Sherri L. Jackson (2020-06-11; statistics  /​ ​​ ​bias):

    David Rosenhan’s pseudopatient study is one of the most famous studies in psychology, but it is also one of the most criticized studies in psychology. Almost 50 years after its publication, it is still discussed in psychology textbooks, but the extensive body of criticism is not, likely leading teachers not to present the study as the contentious classic that it is. New revelations by Susannah Cahalan (2019), based on her years of investigation of the study and her analysis of the study’s archival materials, question the validity and veracity of both Rosenhan’s study and his reporting of it as well as Rosenhan’s scientific integrity. Because many (if not most) teachers are likely not aware of Cahalan’s findings, we provide a summary of her main findings so that if they still opt to cover Rosenhan’s study, they can do so more accurately. Because these findings are related to scientific integrity, we think that they are best discussed in the context of research ethics and methods. To aid teachers in this task, we provide some suggestions for such discussions.

    [ToC: Rosenhan’s Misrepresentations of the Pseudopatient Script and His Medical Record · Selective Reporting of Data · Rosenhan’s Failure to Prepare and Protect Other Pseudopatients · Reporting Questionable Data and Possibly Pseudo-Pseudopatients · Concluding Remarks · Footnotes · References]

  407. ⁠, Robert L. Spitzer (1975):

    Rosenhan’s “On Being Sane in Insane Places” is pseudoscience presented as science. Just as his pseudopatients were diagnosed at discharge as “schizophrenia in remission”, so a careful examination of this study’s methods, results, and conclusion leads to a diagnosis of “logic in remission”. Rosenhan’s study proves that pseudopatients are not detected by psychiatrists as having simulated signs of mental illness. This rather unremarkable finding is not relevant to the real problems of the reliability and validity of psychiatric diagnosis and only serves to obscure them. A correct interpretation of these data contradicts the conclusions that were drawn. In the setting of a psychiatric hospital, psychiatrists seem remarkably able to distinguish the “sane” from the “insane”.

  408. ⁠, Jonatan Pallesen (2019-02-19):

    Blind auditions and gender discrimination: A seminal paper from 2000 investigated the impact of blind auditions in orchestras, and found that they increased the proportion of women in symphony orchestras. I investigate the study, and find that there is no good evidence presented. [The study is temporally confounded by a national trend of increasing female participation, does not actually establish any particular correlate of blind auditions, much less randomized experiments of blinding, the dataset is extremely underpowered, the effects cited in coverage cannot be found anywhere in the paper, and the critical comparisons which are there are not even statistically-significant in the first place. None of these caveats are included in the numerous citations of the study as “proving” discrimination against women.]


  410. 2019-letexier.pdf: ⁠, Thibault Le Texier (2019-08-05; psychology):

    The (SPE) is one of psychology’s most famous studies. It has been criticized on many grounds, and yet a majority of textbook authors have ignored these criticisms in their discussions of the SPE, thereby misleading both students and the general public about the study’s questionable scientific validity.

    Data collected from a thorough investigation of the SPE archives and interviews with 15 of the participants in the experiment further question the study’s scientific merit. These data are not only supportive of previous criticisms of the SPE, such as the presence of ⁠, but provide new criticisms of the SPE based on heretofore unknown information. These new criticisms include the biased and incomplete collection of data, the extent to which the SPE drew on a prison experiment devised and conducted by students in one of Zimbardo’s classes 3 months earlier, the fact that the guards received precise instructions regarding the treatment of the prisoners, the fact that the guards were not told they were subjects, and the fact that participants were almost never completely immersed by the situation.

    Possible explanations of the inaccurate textbook portrayal and general misperception of the SPE’s scientific validity over the past 5 decades, in spite of its flaws and shortcomings, are discussed.

    [Keywords: Stanford Prison Experiment, Zimbardo, epistemology]

  411. ⁠, David Shariatmadari (2018-04-16):

    In 50s Middle Grove, things didn’t go according to plan either, though the surprise was of a different nature. Despite his pretence of leaving the 11-year-olds to their own devices, Sherif and his research staff, posing as camp counsellors and caretakers, interfered to engineer the result they wanted. He believed he could make the two groups, called the Pythons and the Panthers, sworn enemies via a series of well-timed “frustration exercises”. These included his assistants stealing items of clothing from the boys’ tents and cutting the rope that held up the Panthers’ homemade flag, in the hope they would blame the Pythons. One of the researchers crushed the Panthers’ tent, flung their suitcases into the bushes and broke a boy’s beloved ukulele. To Sherif’s dismay, however, the children just couldn’t be persuaded to hate each other…The robustness of the boy’s “civilised” values came as a blow to Sherif, making him angry enough to want to punch one of his young academic helpers. It turned out that the strong bonds forged at the beginning of the camp weren’t easily broken. Thankfully, he never did start the forest fire—he aborted the experiment when he realised it wasn’t going to support his hypothesis.

    But the Rockefeller Foundation had given Sherif $306,948$38,0001953. In his mind, perhaps, if he came back empty-handed, he would face not just their anger but the ruin of his reputation. So, within a year, he had recruited boys for a second camp, this time in Robbers Cave state park in Oklahoma. He was determined not to repeat the mistakes of Middle Grove.

    …At Robbers Cave, things went more to plan. After a tug-of-war in which they were defeated, the Eagles burned the Rattler’s flag. Then all hell broke loose, with raids on cabins, vandalism and food fights. Each moment of confrontation, however, was subtly manipulated by the research team. They egged the boys on, providing them with the means to provoke one another—who else, asks Perry in her book, could have supplied the matches for the flag-burning?

    …Sherif was elated. And, with the publication of his findings that same year, his status as world-class scholar was confirmed. The “Robbers Cave experiment” is considered seminal by social psychologists, still one of the best-known examples of “realistic conflict theory”. It is often cited in modern research. But was it scientifically rigorous? And why were the results of the Middle Grove experiment—where the researchers couldn’t get the boys to fight—suppressed? “Sherif was clearly driven by a kind of a passion”, Perry says. “That shaped his view and it also shaped the methods he used. He really did come from that tradition in the 30s of using experiments as demonstrations—as a confirmation, not to try to find something new.” In other words, think of the theory first and then find a way to get the results that match it. If the results say something else? Bury them…“I think people are aware now that there are real ethical problems with Sherif’s research”, she tells me, “but probably much less aware of the backstage [manipulation] that I’ve found. And that’s understandable because the way a scientist writes about their research is accepted at face value.” The published report of Robbers Cave uses studiedly neutral language. “It’s not until you are able to compare the published version with the archival material that you can see how that story is shaped and edited and made more respectable in the process.” That polishing up still happens today, she explains. “I wouldn’t describe him as a charlatan…every journal article, every textbook is written to convince, persuade and to provide evidence for a point of view. So I don’t think Sherif is unusual in that way.”











  422. ⁠, Alexey Guzey (2019-11-15):

    …In the process of reading the book and encountering some extraordinary claims about sleep, I decided to compare the facts it presented with the scientific literature. I found that the book consistently overstates the problem of lack of sleep, sometimes egregiously so. It misrepresents basic sleep research and contradicts its own sources.

    In one instance, Walker claims that sleeping less than six or seven hours a night doubles one’s risk of cancer—this is not supported by the scientific evidence (Section 1.1). In another instance, Walker seems to have invented a “fact” that the WHO has declared a sleep loss epidemic (Section 4). In yet another instance, he falsely claims that the National Sleep Foundation recommends 8 hours of sleep per night, and then uses this “fact” to falsely claim that two-thirds of people in developed nations sleep less than the “the recommended eight hours of nightly sleep” (Section 5).

    Walker’s book has likely wasted thousands of hours of life and worsened the health of people who read it and took its recommendations at face value (Section 7).

  423. ⁠, Alexey Guzey ():

    I’m an independent researcher with background in Economics, Mathematics, and Cognitive Science. My biggest intellectual influences are Scott Alexander⁠, Dan Carlin⁠, Scott Adams, and Gwern⁠.

    Right now, I think about meta-science, biology and philanthropy⁠. My long-term goal is to make the future humane, aesthetic, and to make it happen faster.

    You can contact me at or via Twitter⁠, Telegram⁠, Facebook or VK

  424. {#linkBibliography-(stat)-2019 .docMetadata}, Sharon Begley (STAT) (2019-06-25; sociology  /​ ​​ ​preference-falsification):

    In the 30 years that biomedical researchers have worked determinedly to find a cure for Alzheimer’s disease, their counterparts have developed drugs that helped cut deaths from cardiovascular disease by more than half, and cancer drugs able to eliminate tumors that had been incurable. But for Alzheimer’s, not only is there no cure, there is not even a disease-slowing treatment.

    …In more than two dozen interviews, scientists whose ideas fell outside the dogma recounted how, for decades, believers in the dominant hypothesis suppressed research on alternative ideas: They influenced what studies got published in top journals, which scientists got funded, who got tenure, and who got speaking slots at reputation-buffing scientific conferences. The scientists described the frustrating, even career-ending, obstacles that they confronted in pursuing their research. A top journal told one that it would not publish her paper because others hadn’t. Another got whispered advice to at least pretend that the research for which she was seeking funding was related to the leading idea—that a protein fragment called beta-amyloid accumulates in the brain, creating neuron-killing clumps that are both the cause of Alzheimer’s and the key to treating it. Others could not get speaking slots at important meetings, a key showcase for research results. Several who tried to start companies to develop Alzheimer’s cures were told again and again by venture capital firms and major biopharma companies that they would back only an amyloid approach.

    …For all her regrets about the amyloid hegemony, Neve is an unlikely critic: She co-led the 1987 discovery of mutations in a gene called APP that increases amyloid levels and causes Alzheimer’s in middle age, supporting the then-emerging orthodoxy. Yet she believes that one reason Alzheimer’s remains incurable and untreatable is that the amyloid camp “dominated the field”, she said. Its followers were influential “to the extent that they persuaded the National Institute of Neurological Disorders and Stroke [part of the National Institutes of Health] that it was a waste of money to fund any Alzheimer’s-related grants that didn’t center around amyloid.” To be sure, NIH did fund some Alzheimer’s research that did not focus on amyloid. In a sea of amyloid-focused grants, there are tiny islands of research on oxidative stress, neuroinflammation, and, especially, a protein called tau. But Neve’s NINDS program officer, she said, “told me that I should at least collaborate with the amyloid people or I wouldn’t get any more NINDS grants.(She hoped to study how neurons die.) A decade after her APP discovery, a disillusioned Neve left Alzheimer’s research, building a distinguished career in gene editing. Today, she said, she is “sick about the millions of people who have needlessly died from” the disease.

    Dr. Daniel Alkon, a longtime NIH neuroscientist who started a company to develop an Alzheimer’s treatment, is even more emphatic: “If it weren’t for the near-total dominance of the idea that amyloid is the only appropriate drug target”, he said, “we would be 10 or 15 years ahead of where we are now.”

    Making it worse is that the empirical support for the amyloid hypothesis has always been shaky. There were numerous red flags over the decades that targeting amyloid alone might not slow or reverse Alzheimer’s. “Even at the time the amyloid hypothesis emerged, 30 years ago, there was concern about putting all our eggs into one basket, especially the idea that ridding the brain of amyloid would lead to a successful treatment”, said neurobiologist Susan Fitzpatrick, president of the James S. McDonnell Foundation. But research pointing out shortcomings of the hypothesis was relegated to second-tier journals, at best, a signal to other scientists and drug companies that the criticisms needn’t be taken too seriously. Zaven Khachaturian spent years at NIH overseeing its early Alzheimer’s funding. Amyloid partisans, he said, “came to permeate drug companies, journals, and NIH study sections”, the groups of mostly outside academics who decide what research NIH should fund. “Things shifted from a scientific inquiry into an almost religious belief system, where people stopped being skeptical or even questioning.”

    …“You had a whole industry going after amyloid, hundreds of clinical trials targeting it in different ways”, Alkon said. Despite success in millions of mice, “none of it worked in patients.”

    Scientists who raised doubts about the amyloid model suspected why. Amyloid deposits, they thought, are a response to the true cause of Alzheimer’s and therefore a marker of the disease—again, the gravestones of neurons and synapses, not the killers. The evidence? For one thing, although the brains of elderly Alzheimer’s patients had amyloid plaques, so did the brains of people the same age who died with no signs of dementia, a pathologist discovered in 1991. Why didn’t amyloid rob them of their memories? For another, mice engineered with human genes for early Alzheimer’s developed both amyloid plaques and dementia, but there was no proof that the much more common, late-onset form of Alzheimer’s worked the same way. And yes, amyloid plaques destroy synapses (the basis of memory and every other brain function) in mouse brains, but there is no correlation between the degree of cognitive impairment in humans and the amyloid burden in the memory-forming hippocampus or the higher-thought frontal cortex. “There were so many clues”, said neuroscientist Nikolaos Robakis of the Icahn School of Medicine at Mount Sinai, who also discovered a mutation for early-onset Alzheimer’s. “Somehow the field believed all the studies supporting it, but not those raising doubts, which were very strong. The many weaknesses in the theory were ignored.”

  425. ⁠, David S. Yeager, Paul Hanselman, Gregory M. Walton, Jared S. Murray, Robert Crosnoe, Chandra Muller, Elizabeth Tipton, Barbara Schneider, Chris S. Hulleman, Cintia P. Hinojosa, David Paunesku, Carissa Romero, Kate Flint, Alice Roberts, Jill Trott, Ronaldo Iachan, Jenny Buontempo, Sophia Man Yang, Carlos M. Carvalho, P. Richard Hahn, Maithreyi Gopalan, Pratik Mhatre, Ronald Ferguson, Angela L. Duckworth, Carol S. Dweck (2019-08-07):

    A global priority for the behavioural sciences is to develop cost-effective, scalable interventions that could improve the academic outcomes of adolescents at a population level, but no such interventions have so far been evaluated in a population-generalizable sample. Here we show that a short (less than one hour), online growth mindset intervention—which teaches that intellectual abilities can be developed—improved grades among lower-achieving students and increased overall enrolment to advanced mathematics courses in a nationally representative sample of students in secondary education in the United States. Notably, the study identified school contexts that sustained the effects of the growth mindset intervention: the intervention changed grades when peer norms aligned with the messages of the intervention. Confidence in the conclusions of this study comes from independent data collection and processing, pre-registration of analyses, and corroboration of results by a blinded Bayesian analysis.

  426. 2019-forscher.pdf: ⁠, Patrick Forscher, Calvin Lai, Jordan Axt, Charles Ebersole, Michelle Herman, Patricia Devine, Brian Nosek (2019-08-19; psychology):

    Using a novel technique known as network meta-analysis, we synthesized evidence from 492 studies (87,418 participants) to investigate the effectiveness of procedures in changing implicit measures, which we define as response biases on implicit tasks. We also evaluated these procedures’ effects on explicit and behavioral measures. We found that implicit measures can be changed, but effects are often relatively weak (|ds| < .30). Most studies focused on producing short-term changes with brief, single-session manipulations. Procedures that associate sets of concepts, invoke goals or motivations, or tax mental resources changed implicit measures the most, whereas procedures that induced threat, affirmation, or specific moods/​​​​emotions changed implicit measures the least. Bias tests suggested that implicit effects could be inflated relative to their true population values. Procedures changed explicit measures less consistently and to a smaller degree than implicit measures and generally produced trivial changes in behavior. Finally, changes in implicit measures did not mediate changes in explicit measures or behavior. Our findings suggest that changes in implicit measures are possible, but those changes do not necessarily translate into changes in explicit measures or behavior.


  428. ⁠, Tom Stafford (2016-12-08):

    This seeming evidence of the irrationality of judges has been cited hundreds of times, in economics, psychology and legal scholarship. Now, a new analysis by Andreas Glöckner in the journal Judgement and Decision Making questions these conclusions.

    Glöckner’s analysis doesn’t prove that extraneous factors weren’t influencing the judges, but he shows how the same effect could be produced by entirely rational judges interacting with the protocols required by the legal system.

    The main analysis works like this: we know that favourable rulings take longer than unfavourable ones (~7 mins vs ~5 mins), and we assume that judges are able to guess how long a case will take to rule on before they begin it (from clues like the thickness of the file, the types of request made, the representation the prisoner has and so on). Finally, we assume judges have a time limit in mind for each of the three sessions of the day, and will avoid starting cases which they estimate will overrun the time limit for the current session.

    It turns out that this kind of rational time-management is sufficient to generate the drops in favourable outcomes. How this occurs isn’t straightforward and interacts with a quirk of original author’s data presentation (specifically their graph shows the order number of cases when the number of cases in each session varied day to day—so, for example, it shows that the 12th case after a break is least likely to be judged favourably, but there wasn’t always a 12th case in each session. So sessions in which there were more unfavourable cases were more likely to contribute to this data point).

  429. ⁠, Daniël Lakens (2017-07-05):

    I was listening to a recent Radiolab episode on blame and guilt, where the guest Robert Sapolsky mentioned a famous study on judges handing out harsher sentences before lunch than after lunch…During the podcast, it was mentioned that the percentage of favorable decisions drops from 65% to 0% over the number of cases that are decided on. This sounded unlikely. I looked at Figure 1 from the paper (below), and I couldn’t believe my eyes. Not only is the drop indeed as large as mentioned—it occurs three times in a row over the course of the day, and after a break, it returns to exactly 65%!

    …Some people dislike statistics. They are only interested in effects that are so large, you can see them by just plotting the data. This study might seem to be a convincing illustration of such an effect. My goal in this blog is to argue against this idea. You need statistics, maybe especially when effects are so large they jump out at you. When reporting findings, authors should report and interpret effect sizes. An important reason for this is that effects can be impossibly large.

    …If hunger had an effect on our mental resources of this magnitude, our society would fall into minor chaos every day at 11:45AM. Or at the very least, our society would have organized itself around this incredibly strong effect of mental depletion. Just like manufacturers take size differences between men and women into account when producing items such as golf clubs or watches, we would stop teaching in the time before lunch, doctors would not schedule surgery, and driving before lunch would be illegal. If a psychological effect is this big, we don’t need to discover it and publish it in a scientific journal—you would already know it exists. Sort of how the “after lunch dip” is a strong and replicable finding that you can feel yourself (and that, as it happens, is directly in conflict with the finding that judges perform better immediately after lunch—surprisingly, the authors don’t discuss the after lunch dip).

    …I think it is telling that most psychologists don’t seem to be able to recognize data patterns that are too large to be caused by psychological mechanisms. There are simply no plausible psychological effects that are strong enough to cause the data pattern in the hungry judges study. Implausibility is not a reason to completely dismiss empirical findings, but impossibility is. It is up to authors to interpret the effect size in their study, and to show the mechanism through which an effect that is impossibly large, becomes plausible. Without such an explanation, the finding should simply be dismissed.

  430. 2020-devito.pdf: ⁠, Nicholas J. DeVito, Seb Bacon, Ben Goldacre (2020-01-17; statistics  /​ ​​ ​bias):

    Background: Failure to report the results of a clinical trial can distort the evidence base for clinical practice, breaches researchers’ ethical obligations to participants, and represents an important source of research waste. The Food and Drug Administration Amendments Act (FDAAA) of 2007 now requires sponsors of applicable trials to report their results directly onto within 1 year of completion. The first trials covered by the Final Rule of this act became due to report results in January, 2018. In this cohort study, we set out to assess compliance.

    Methods: We downloaded data for all registered trials on each month from March, 2018, to September, 2019. All cross-sectional analyses in this manuscript were performed on data extracted from on Sept 16, 2019; monthly trends analysis used archived data closest to the 15th day of each month from March, 2018, to September, 2019. Our study cohort included all applicable trials due to report results under FDAAA. We excluded all non-applicable trials, those not yet due to report, and those given a certificate allowing for delayed reporting. A trial was considered reported if results had been submitted and were either publicly available, or undergoing quality control review at A trial was considered compliant if these results were submitted within 1 year of the primary completion date, as required by the legislation. We described compliance with the FDAAA 2007 Final Rule, assessed trial characteristics associated with results reporting using logistic regression models, described sponsor-level reporting, examined trends in reporting, and described time-to-report using the Kaplan-Meier method.

    Findings: 4209 trials were due to report results; 1722 (40·9%; 95% CI 39·4–42·2) did so within the 1-year deadline. 2686 (63·8%; 62·4–65·3) trials had results submitted at any time. Compliance has not improved since July, 2018. Industry sponsors were statistically-significantly more likely to be compliant than non-industry, non-US Government sponsors (odds ratio [OR] 3·08 [95% CI 2·52–3·77]), and sponsors running large numbers of trials were statistically-significantly more likely to be compliant than smaller sponsors (OR 11·84 [9·36–14·99]). The median delay from primary completion date to submission date was 424 days (95% CI 412–435), 59 days higher than the legal reporting requirement of 1 year.

    Interpretation: Compliance with the FDAAA 2007 is poor, and not improving. To our knowledge, this is the first study to fully assess compliance with the Final Rule of the FDAAA 2007. Poor compliance is likely to reflect lack of enforcement by regulators. Effective enforcement and action from sponsors is needed; until then, open public audit of compliance for each individual sponsor may help. We will maintain updated compliance data for each individual sponsor and trial at⁠.

    Funding: Laura and John Arnold Foundation.

  431. {#linkBibliography-(science)-2020 .docMetadata doi=“10.1126/​​science.aba8123”}, Charles Piller (Science) (2020-01-13):

    The rule took full effect 2 years ago, on 2018-01-18, giving trial sponsors ample time to comply. But a Science investigation shows that many still ignore the requirement, while federal officials do little or nothing to enforce the law.

    Science examined more than 4700 trials whose results should have been posted on the NIH website under the 2017 rule. Reporting rates by most large pharmaceutical companies and some universities have improved sharply, but performance by many other trial sponsors—including, ironically, NIH itself—was lackluster. Those sponsors, typically either the institution conducting a trial or its funder, must deposit results and other data within 1 year of completing a trial. But of 184 sponsor organizations with at least five trials due as of 2019-09-25, 30 companies, universities, or medical centers never met a single deadline. As of that date, those habitual violators had failed to report any results for 67% of their trials and averaged 268 days late for those and all trials that missed their deadlines. They included such eminent institutions as the Harvard University-affiliated Boston Children’s Hospital, the University of Minnesota, and Baylor College of Medicine—all among the top 50 recipients of NIH grants in 2019. The violations cover trials in virtually all fields of medicine, and the missing or late results offer potentially vital information for the most desperate patients. For example, in one long-overdue trial, researchers compared the efficacy of different chemotherapy regimens in 200 patients with advanced lymphoma; another—nearly 2 years late—tests immunotherapy against conventional chemotherapy in about 600 people with late-stage lung cancer.

    …Contacted for comment, none of the institutions disputed the findings of this investigation. In all 4768 trials Science checked, sponsors violated the reporting law more than 55% of the time. And in hundreds of cases where the sponsors got credit for reporting trial results, they have yet to be publicly posted because of quality lapses flagged by staff (see sidebar).

    Although the 2017 rule, and officials’ statements at the time, promised aggressive enforcement and stiff penalties, neither NIH nor FDA has cracked down. FDA now says it won’t brandish its big stick—penalties of up to $12,103 a day for failing to report a trial’s results—until after the agency issues further “guidance” on how it will exercise that power. It has not set a date. NIH said at a 2016 briefing on the final rule that it would cut off grants to those who ignore the trial reporting requirements, as authorized in the 2007 law, but so far has not done so…NIH and FDA officials do not seem inclined to apply that pressure. Lyric Jorgenson, NIH deputy director for science policy, says her agency has been “trying to change the culture of how clinical trial results are reported and disseminated; not so much on the ‘aha, we caught you’, as much as getting people to understand the value, and making it as easy as possible to share and disseminate results.” To that end, she says, staff have educated researchers about the website and improved its usability. As for FDA, Patrick McNeilly, an official at the agency who handles trial enforcement matters, recently told an industry conference session on that “FDA has limited resources, and we encourage voluntary compliance.” He said the agency also reviews reporting of information on as part of inspections of trial sites, or when it receives complaints. McNeilly declined an interview request, but at the conference he discounted violations of reporting requirements found by journalists and watchdog groups. “We’re not going to blanketly accept an entire list of trials that people say are noncompliant”, he said.

    …It also highlights that pharma’s record has been markedly better than that of academia and the federal government.

    …But such good performance shouldn’t be an exception, Harvard’s Zarin says. “Further public accountability of the trialists, but also our government organizations, has to happen. One possibility is that FDA and NIH will be shamed into enforcing the law. Another possibility is that sponsors will be shamed into doing a better job. A third possibility is that will never fully achieve its vital aspirations.”

  432. ⁠, Marc P. Raphael, Paul E. Sheehan, Gary J. Vora (2020-03-10):

    In 2016, the US Defense Advanced Research Projects Agency () told eight research groups that their proposals had made it through the review gauntlet and would soon get a few million dollars from its Biological Technologies Office (BTO). Along with congratulations, the teams received a reminder that their award came with an unusual requirement—an independent shadow team of scientists tasked with reproducing their results. Thus began an intense, multi-year controlled trial in reproducibility. Each shadow team consists of three to five researchers, who visit the ‘performer’ team’s laboratory and often host visits themselves. Between 3% and 8% of the programme’s total funds go to this independent validation and verification (IV&V) work…Awardees were told from the outset that they would be paired with an IV&V team consisting of unbiased, third-party scientists hired by and accountable to DARPA. In this programme, we relied on US Department of Defense laboratories, with specific teams selected for their technical competence and ability to solve problems creatively.

    …Results so far show a high degree of experimental reproducibility. The technologies investigated include using chemical triggers to control how cells migrate1; introducing synthetic circuits that control other cell functions2; intricate protein switches that can be programmed to respond to various cellular conditions3; and timed bacterial expression that works even in the variable environment of the mammalian gut4…getting to this point was more difficult than we expected. It demanded intense coordination, communication and attention to detail…Our effort needed capable research groups that could dedicate much more time (in one case, 20 months) and that could flexibly follow evolving research…A key component of the IV&V teams’ effort has been to spend a day or more working with the performer teams in their laboratories. Often, members of a performer laboratory travel to the IV&V laboratory as well. These interactions lead to a better grasp of methodology than reading a paper, frequently revealing person-to-person differences that can affect results…Still, our IV&V efforts have been derailed for weeks at a time for trivial reasons (see ‘Hard lessons’), such as a typo that meant an ingredient in cell media was off by an order of magnitude. We lost more than a year after discovering that commonly used biochemicals that were thought to be interchangeable are not.

    Document Reagents:…We lost weeks of work and performed useless experiments when we assumed that identically named reagents (for example, polyethylene glycol or fetal bovine serum) from different vendors could be used interchangeably. · See It Live:…In our hands, washing cells too vigorously or using the wrong-size pipette tip changed results unpredictably. · State a range: …Knowing whether 21 ° C means 20.5–21.5 ° C or 20–22 ° C can tell you whether cells will thrive or wither, and whether you’ll need to buy an incubator to make an experiment work. · Test, then ship: …Incorrect, outdated or otherwise diminished products were sent to the IV&V team for verification many times. · Double check: …A typo in one protocol cost us four weeks of failed experiments, and in general, vague descriptions of formulation protocols (for example, for expressing genes and making proteins without cells) caused months of delay and cost thousands of dollars in wasted reagents. · Pick a person: …The projects that lacked a dedicated and stable point of contact were the same ones that took the longest to reproduce. That is not coincidence. · Keep in silico analysis up to date: …Teams had to visit each others’ labs more than once to understand and fully implement computational-analysis pipelines for large microscopy data sets.

    …We have learnt to note the flow rates used when washing cells from culture dishes, to optimize salt concentration in each batch of medium and to describe temperature and other conditions with a range rather than a single number. This last practice came about after we realized that diminished slime-mould viability in our Washington DC facility was due to lab temperatures that could fluctuate by 2 °C on warm summer days, versus the more tightly controlled temperature of the performer lab in Baltimore 63 kilometres away. Such observations can be written up in a protocol paper…As one of our scientists said, “IV&V forces performers to think more critically about what qualifies as a successful system, and facilitates candid discussion about system performance and limitations.”

  433. 2020-artner.pdf: ⁠, Richard Artner, Thomas Verliefde, Sara Steegen, Sara Gomes, Frits Traets, Francis Tuerlinckx, Wolf Vanpaemel (2020-11-12; statistics  /​ ​​ ​bias):

    We investigated the reproducibility of the major statistical conclusions drawn in 46 articles published in 2012 in three APA journals. After having identified 232 key statistical claims, we tried to reproduce, for each claim, the test statistic, its degrees of freedom, and the corresponding p-value, starting from the raw data that were provided by the authors and closely following the Method section in the article. Out of the 232 claims, we were able to successfully reproduce 163 (70%), 18 of which only by deviating from the article’s analytical description. Thirteen (7%) of the 185 claims deemed statistically-significant by the authors are no longer so. The reproduction successes were often the result of cumbersome and time-consuming trial-and-error work, suggesting that APA style reporting in conjunction with raw data makes numerical verification at least hard, if not impossible. This article discusses the types of mistakes we could identify and the tediousness of our reproduction efforts in the light of a newly developed taxonomy for reproducibility. We then link our findings with other findings of empirical research on this topic, give practical recommendations on how to achieve reproducibility, and discuss the challenges of large-scale reproducibility checks as well as promising ideas that could considerably increase the reproducibility of psychological research.

  434. 2009-mytkowicz.pdf: ⁠, Todd Mytkowicz, Amer Diwan, Matthias Hauswirth, Peter F. Sweeney (2009-03-07; cs):

    This paper presents a surprising result: changing a seemingly innocuous aspect of an experimental setup can cause a systems researcher to draw wrong conclusions from an experiment. What appears to be an innocuous aspect in the experimental setup may in fact introduce a substantial bias in an evaluation. This phenomenon is called measurement bias in the natural and social sciences.

    Our results demonstrate that measurement bias is substantial and commonplace in computer system evaluation. By significant we mean that measurement bias can lead to a performance analysis that either over-states an effect or even yields an incorrect conclusion. By commonplace we mean that measurement bias occurs in all architectures that we tried (Pentium 4, Core 2, and m5 O3CPU), both compilers that we tried (gcc and Intel’s C compiler), and most of the SPEC CPU2006 C programs. Thus, we cannot ignore measurement bias. Nevertheless, in a literature survey of 133 recent papers from ASPLOS, PACT, PLDI, and CGO, we determined that none of the papers with experimental results adequately consider measurement bias.

    Inspired by similar problems and their solutions in other sciences, we describe and demonstrate two methods, one for detecting (causal analysis) and one for avoiding (setup randomization) measurement bias.

    [Keywords: experimentation, measurement, performance, bias]

  435. ⁠, Robert M. Kaplan, Veronica L. Irvin (2015-05-21):

    Background: We explore whether the number of null results in large National Heart Lung, and Blood Institute (NHLBI) funded trials has increased over time.

    Methods: We identified all large NHLBI supported RCTs between 1970 and 2012 evaluating drugs or dietary supplements for the treatment or prevention of cardiovascular disease. Trials were included if direct costs >$500,000/​​​​year, participants were adult humans, and the primary outcome was cardiovascular risk, disease or death. The 55 trials meeting these criteria were coded for whether they were published prior to or after the year 2000, whether they registered in prior to publication, used active or placebo comparator, and whether or not the trial had industry co-sponsorship. We tabulated whether the study reported a positive, negative, or null result on the primary outcome variable and for total mortality.

    Results: 17 of 30 studies (57%) published prior to 2000 showed a statistically-significant benefit of intervention on the primary outcome in comparison to only 2 among the 25 (8%) trials published after 2000 (χ2 = 12.2, df= 1, p = 0.0005). There has been no change in the proportion of trials that compared treatment to placebo versus active comparator. Industry co-sponsorship was unrelated to the probability of reporting a statistically-significant benefit. Pre-registration in clinical was strongly associated with the trend toward null findings.

    Conclusions: The number NHLBI trials reporting positive results declined after the year 2000. Prospective declaration of outcomes in RCTs, and the adoption of transparent reporting standards, as required by, may have contributed to the trend toward null findings.

  436. 2020-blake.pdf: ⁠, Khandis R. Blake, Steven Gangestad (2020-03-25; statistics):

    The replication crisis has seen increased focus on best practice techniques to improve the reliability of scientific findings. What remains elusive to many researchers and is frequently misunderstood is that predictions involving interactions dramatically affect the calculation of statistical power. Using recent papers published in Personality and Social Psychology Bulletin (PSPB), we illustrate the pitfalls of improper power estimations in studies where attenuated interactions are predicted. Our investigation shows why even a programmatic series of 6 studies employing 2×2 designs, with samples exceeding n = 500, can be woefully underpowered to detect genuine effects. We also highlight the importance of accounting for error-prone measures when estimating effect sizes and calculating power, explaining why even positive results can mislead when power is low. We then provide five guidelines for researchers to avoid these pitfalls, including cautioning against the heuristic that a series of underpowered studies approximates the credibility of one well-powered study.

    [Keywords: statistical power, effect size, fertility, ovulation, interaction effects]

  437. 2020-botviniknezer.pdf: ⁠, Rotem Botvinik-Nezer, Felix Holzmeister, Colin F. Camerer, Anna Dreber, Juergen Huber, Magnus Johannesson, Michael Kirchler, Roni Iwanir, Jeanette A. Mumford, R. Alison Adcock, Paolo Avesani, Blazej M. Baczkowski, Aahana Bajracharya, Leah Bakst, Sheryl Ball, Marco Barilari, Nadège Bault, Derek Beaton, Julia Beitner, Roland G. Benoit, Ruud M. W. J. Berkers, Jamil P. Bhanji, Bharat B. Biswal, Sebastian Bobadilla-Suarez, Tiago Bortolini, Katherine L. Bottenhorn, Alexander Bowring, Senne Braem, Hayley R. Brooks, Emily G. Brudner, Cristian B. Calderon, Julia A. Camilleri, Jaime J. Castrellon, Luca Cecchetti, Edna C. Cieslik, Zachary J. Cole, Olivier Collignon, Robert W. Cox, William A. Cunningham, Stefan Czoschke, Kamalaker Dadi, Charles P. Davis, Alberto De Luca, Mauricio R. Delgado, Lysia Demetriou, Jeffrey B. Dennison, Xin Di, Erin W. Dickie, Ekaterina Dobryakova, Claire L. Donnat, Juergen Dukart, Niall W. Duncan, Joke Durnez, Amr Eed, Simon B. Eickhoff, Andrew Erhart, Laura Fontanesi, G. Matthew Fricke, Shiguang Fu, Adriana Galván, Remi Gau, Sarah Genon, Tristan Glatard, Enrico Glerean, Jelle J. Goeman, Sergej A. E. Golowin, Carlos González-García, Krzysztof J. Gorgolewski, Cheryl L. Grady, Mikella A. Green, João F. Guassi Moreira, Olivia Guest, Shabnam Hakimi, J. Paul Hamilton, Roeland Hancock, Giacomo Handjaras, Bronson B. Harry, Colin Hawco, Peer Herholz, Gabrielle Herman, Stephan Heunis, Felix Hoffstaedter, Jeremy Hogeveen, Susan Holmes, Chuan-Peng Hu, Scott A. Huettel, Matthew E. Hughes, Vittorio Iacovella, Alexandru D. Iordan, Peder M. Isager, Ayse I. Isik, Andrew Jahn, Matthew R. Johnson, Tom Johnstone, Michael J. E. Joseph, Anthony C. Juliano, Joseph W. Kable, Michalis Kassinopoulos, Cemal Koba, Xiang-Zhen Kong, Timothy R. Koscik, Nuri Erkut Kucukboyaci, Brice A. Kuhl, Sebastian Kupek, Angela R. Laird, Claus Lamm, Robert Langner, Nina Lauharatanahirun, Hongmi Lee, Sangil Lee, Alexander Leemans, Andrea Leo, Elise Lesage, Flora Li, Monica Y. C. Li, Phui Cheng Lim, Evan N. Lintz, Schuyler W. Liphardt, Annabel B. Losecaat Vermeer, Bradley C. Love, Michael L. Mack, Norberto Malpica, Theo Marins, Camille Maumet, Kelsey McDonald, Joseph T. McGuire, Helena Melero, Adriana S. Méndez Leal, Benjamin Meyer, Kristin N. Meyer, Glad Mihai, Georgios D. Mitsis, Jorge Moll, Dylan M. Nielson, Gustav Nilsonne, Michael P. Notter, Emanuele Olivetti, Adrian I. Onicas, Paolo Papale, Kaustubh R. Patil, Jonathan E. Peelle, Alexandre Pérez, Doris Pischedda, Jean-Baptiste Poline, Yanina Prystauka, Shruti Ray, Patricia A. Reuter-Lorenz, Richard C. Reynolds, Emiliano Ricciardi, Jenny R. Rieck, Anais M. Rodriguez-Thompson, Anthony Romyn, Taylor Salo, Gregory R. Samanez-Larkin, Emilio Sanz-Morales, Margaret L. Schlichting, Douglas H. Schultz, Qiang Shen, Margaret A. Sheridan, Jennifer A. Silvers, Kenny Skagerlund, Alec Smith, David V. Smith, Peter Sokol-Hessner, Simon R. Steinkamp, Sarah M. Tashjian, Bertrand Thirion, John N. Thorp, Gustav Tinghög, Loreen Tisdall, Steven H. Tompson, Claudio Toro-Serey, Juan Jesus Torre Tresols, Leonardo Tozzi, Vuong Truong, Luca Turella, Anna E. van ’t Veer, Tom Verguts, Jean M. Vettel, Sagana Vijayarajah, Khoi Vo, Matthew B. Wall, Wouter D. Weeda, Susanne Weis, David J. White, David Wisniewski, Alba Xifra-Porxas, Emily A. Yearling, Sangsuk Yoon, Rui Yuan, Kenneth S. L. Yuen, Lei Zhang, Xu Zhang, Joshua E. Zosky, Thomas E. Nichols, Russell A. Poldrack, Tom Schonberg (2020-05-20; statistics  /​ ​​ ​bias):

    Data analysis workflows in many scientific domains have become increasingly complex and flexible. Here we assess the effect of this flexibility on the results of functional magnetic resonance imaging by asking 70 independent teams to analyse the same dataset, testing the same 9 ex-ante hypotheses1. The flexibility of analytical approaches is exemplified by the fact that no two teams chose identical workflows to analyse the data. This flexibility resulted in sizeable variation in the results of hypothesis tests, even for teams whose statistical maps were highly correlated at intermediate stages of the analysis pipeline. Variation in reported results was related to several aspects of analysis methodology. Notably, a meta-analytical approach that aggregated information across teams yielded a statistically-significant consensus in activated regions. Furthermore, prediction markets of researchers in the field revealed an overestimation of the likelihood of statistically-significant findings, even by researchers with direct knowledge of the dataset2,3,4,5. Our findings show that analytical flexibility can have substantial effects on scientific conclusions, and identify factors that may be related to variability in the analysis of functional magnetic resonance imaging. The results emphasize the importance of validating and sharing complex analysis workflows, and demonstrate the need for performing and reporting multiple analyses of the same data. Potential approaches that could be used to mitigate issues related to analytical variability are discussed.

  438. ⁠, Scott Marek, Brenden Tervo-Clemmens, Finnegan J. Calabro, David F. Montez, Benjamin P. Kay, Alexander S. Hatoum, Meghan Rose Donohue, William Foran, Ryl, L. Miller, Eric Feczko, Oscar Miranda-Dominguez, Alice M. Graham, Eric A. Earl, Anders J. Perrone, Michaela Cordova, Olivia Doyle, Lucille A. Moore, Greg Conan, Johnny Uriarte, Kathy Snider, Angela Tam, Jianzhong Chen, Dillan J. Newbold, Annie Zheng, Nicole A. Seider, Andrew N. Van, Timothy O. Laumann, Wesley K. Thompson, Deanna J. Greene, Steven E. Petersen, Thomas E. Nichols, B. T. Thomas Yeo, Deanna M. Barch, Hugh Garavan, Beatriz Luna, Damien A. Fair, Nico U. F. Dosenbach (2020-08-22):

    Magnetic resonance imaging (MRI) continues to drive many important neuroscientific advances. However, progress in uncovering reproducible associations between individual differences in brain structure/​​​​function and behavioral phenotypes (e.g., cognition, mental health) may have been undermined by typical neuroimaging sample sizes (median n = 25)1,2. Leveraging the Adolescent Brain Cognitive Development (ABCD) Study3 (n = 11,878), we estimated the effect sizes and reproducibility of these brain-wide associations studies (BWAS) as a function of sample size. The very largest, replicable brain-wide associations for univariate and multivariate methods werer = 0.14 andr = 0.34, respectively. In smaller samples, typical for brain-wide association studies (BWAS), irreproducible, inflated effect sizes were ubiquitous, no matter the method (univariate, multivariate). Until sample sizes started to approach consortium-levels, BWAS were underpowered and statistical errors assured. Multiple factors contribute to replication failures4–6; here, we show that the pairing of small brain-behavioral phenotype effect sizes with sampling variability is a key element in wide-spread BWAS replication failure. Brain-behavioral phenotype associations stabilize and become more reproducible with sample sizes of N⪆2,000. While investigator-initiated brain-behavior research continues to generate hypotheses and propel innovation, large consortia are needed to usher in a new era of reproducible human brain-wide association studies.

  439. 1988-spiller.pdf: “S.L.A. Marshall and the ratio of fire”⁠, Roger J. Spiller


  441. 1989-hackworth-aboutface-slamarshallexcerpts.pdf#page=37: “About Face: The Odyssey of an American Warrior [excerpts about S. L. A. Marshall (SLAM)]⁠, David H. Hackworth, Julie Sherman



  444. 2002-jordan.pdf: “Right for the Wrong Reasons: S. L. A. Marshall and the Ratio of Fire in Korea”⁠, Kelly C. Jordan

  445. 2003-chambers.pdf: ⁠, John Whiteclay Chambers II (2003-09-01; history):

    Chambers II discusses the findings of journalist-soldier S. L. A. Marshall about combat fire ratios particularly that in World War II. Marshall claimed that his figures about the ratio of fire, the proportion of a rifle unit firing its weapons in battle was derived from his group after-action interviews, a method he developed as a field historian in world War II and which as a civilian journalist, Reserve officer, and military consultant. Although the ratio-of-fire figure was his most famous product, Marshall was proudest of his methodology.

    [Chambers interviews Frank L. Brennan, an assistant of Marshall during his Korea War after-action interview work, who accompanied him to every interview. Brennan described Marshall’s methodology as follows: the group interviews typically lasted about 2 hours at most; Marshall took few notes; Marshall preferred to ask open-ended questions and listen to the discussions; he rarely asked questions specifically about the rate of fire or whether a soldier had fired his weapon; he did not seem to collect any of the statistics he would later report in his books; and Marshall was evasive when Brennan asked about his WWII statistics’ sources. Brennan noted that in Marshall’s autobiography, Marshall greatly inflated his importance & the resources placed at his disposal in Korea, and the length of his interviews. Brennan also served in combat afterwards, and observed a high rate of fire in his own men.]


  447. 2008-engen.pdf: ⁠, Robert Engen (2008-12-01; history):

    It would appear, then, that Lieutenant Colonel Grossman’s appeals to biology and psychology are flawed, and that the bulwark of his historical evidence—S.L. A. Marshall’s assertion that soldiers do not fire their weapons—can be verifiably disproven. The theory of an innate, biological resistance to killing has little support in either evolutionary biology or in what we know about psychology, and, discounting Marshall’s claims, there is little basis in military history for such a theory either. This is not to say that all people can or will kill, or even that all soldiers can or will kill. Combat is staggeringly complex, an environment where human beings are pushed beyond all tolerable limits. There is much that we do not know, and plenty that we should be doing more to learn about. Grossman is clearly leading the way in posing these questions. Much of his work on the processes of killing and the relevance of physical distance to killing is extremely insightful. There is material in On Combat about fear, heart rate, and combat effectiveness that might be groundbreaking, and it should be studied carefully by historians trying to understand human behaviour in war. No disrespect to Lieutenant-Colonel Grossman is intended by this article, and it is not meant to devalue his work. I personally believe that some of the elements of his books, particularly the physiology of combat, would actually be strengthened if they were not shackled to the idea that humans cannot kill one another. But there are still questions that need to be asked, and the subject should not be considered closed. Grossman’s overall picture of killing in war and society is heavily informed by a belief in an innate human resistance to killing that, as has been offered here, does not stand up well to scrutiny. More research on the processes of human killing is needed, and although On Killing and On Combat form an excellent starting point, there are too many problems with their interpretation for them to be considered the final word on the subject. I believe that, in the future, the Canadian Forces needs to take a more critical posture when it comes to incorporating Grossman’s studies into its own doctrine. It is imperative that our nation’s military culture remain one devoted to pursuing the best available evidence at all costs, rather than one merely following the most popular consensus.

  448. 2020-serragarcia.pdf: ⁠, Marta Serra-Garcia, Karsten T. Hansen, Uri Gneezy (2020-07-01; statistics  /​ ​​ ​bias):

    Large amounts of resources are spent annually to improve educational achievement and to close the gender gap in sciences with typically very modest effects. In 2010, a 15-min self-affirmation intervention showed a dramatic reduction in this gender gap. We reanalyzed the original data and found several critical problems. First, the self-affirmation hypothesis stated that women’s performance would improve. However, the data showed no improvement for women. There was an interaction effect between self-affirmation and gender caused by a negative effect on men’s performance. Second, the findings were based on covariate-adjusted interaction effects, which imply that self-affirmation reduced the gender gap only for the small sample of men and women who did not differ in the covariates. Third, specification-curve analyses with more than 1,500 possible specifications showed that less than one quarter yielded statistically-significant interaction effects and less than 3% showed significant improvements among women.

  449. ⁠, Justin Nix, M. James Lozada (2019-12-30; statistics  /​ ​​ ​bias⁠, sociology):

    We reevaluate the claim from Bor et al. (2018: 302) that “police killings of unarmed black Americans have effects on mental health among black American adults in the general population.”

    The Mapping Police Violence data used by the authors includes 91 incidents involving black decedents who were either (1) not killed by police officers in the line of duty or (2) armed when killed. These incidents should have been removed or recoded prior to analysis.

    Correctly recoding these incidents decreased in magnitude all of the reported coefficients, and, more importantly, eliminated the reported statistically-significant effect of exposure to police killings of unarmed black individuals on the mental health of black Americans in the general population.

    We caution researchers to vet carefully crowdsourced data that tracks police behaviors and warn against reducing these complex incidents to overly simplistic armed/​​​​unarmed dichotomies.

  450. ⁠, Erik van Zwet, Simon Schwab, Stephen Senn (2020-11-30):

    We abstract the concept of a randomized controlled trial (RCT) as a triple (β, b, s), where β is the primary efficacy parameter, b the estimate and s the standard error (s > 0). The parameter β is either a difference of means, a log odds ratio or a log hazard ratio. If we assume that b is unbiased and ⁠, then we can estimate the full joint distribution of (β, b, s) from a sample of pairs (bi, si).

    We 23,747 such pairs from the Cochrane Database of Systematic Reviews to do so. Here, we report the estimated distribution of the signal-to-noise ratio β⁄s and the achieved power. We estimate the median achieved power to be 0.13. We also consider the exaggeration ratio which is the factor by which the magnitude of β is overestimated. We find that if the estimate is just statistically-significant at the 5% level, we would expect it to overestimate the true effect by a factor of 1.7.

    This exaggeration is sometimes referred to as the and it is undoubtedly to a considerable extent responsible for disappointing replication results. For this reason, we believe it is important to shrink the unbiased estimator, and we propose a method for doing so.

  451. ⁠, Lin, Ann Giuliano, Christopher J. Palladino, Ann John, Kristen M. Abramowicz, Connor Yuan, Monet Lou Sausville, Erin L. Lukow, Devon A. Liu, Luwei Chait, Alexander R. Galluzzo, Zachary C. Tucker, Clara Sheltzer, Jason M (2019):

    Ninety-seven percent of drug-indication pairs that are tested in clinical trials in oncology never advance to receive U.S. Food and Drug Administration approval. While lack of efficacy and dose-limiting toxicities are the most common causes of trial failure, the reason(s) why so many new drugs encounter these problems is not well understood. Using -Cas9 mutagenesis, we investigated a set of cancer drugs and drug targets in various stages of clinical testing. We show that-contrary to previous reports obtained predominantly with RNA interference and small-molecule inhibitors-the proteins ostensibly targeted by these drugs are nonessential for cancer cell proliferation. Moreover, the efficacy of each drug that we tested was unaffected by the loss of its putative target, indicating that these compounds kill cells via off-target effects. By applying a genetic target-deconvolution strategy, we found that the mischaracterized anticancer agent OTS964 is actually a potent inhibitor of the cyclin-dependent kinase CDK11 and that multiple cancer types are addicted to CDK11 expression. We suggest that stringent genetic validation of the mechanism of action of cancer drugs in the preclinical setting may decrease the number of therapies tested in human patients that fail to provide any clinical benefit.



  454. Lizardman-constant

  455. 2020-dellavigna.pdf: ⁠, Stefano DellaVigna, Elizabeth Linos (2020-07-01; sociology):

    interventions have quickly expanded from academic studies to larger implementation in so-called in governments. This provides an opportunity to compare interventions in research studies, versus at scale. We assemble a unique data set of 126 RCTs covering over 23 million individuals, including all trials run by 2 of the largest Nudge Units in the United States. We compare these trials to a sample of nudge trials published in academic journals from 2 recent meta-analyses.

    In papers published in academic journals, the average impact of a nudge is very large—an 8.7 percentage point take-up effect, a 33.5% increase over the average control. In the Nudge Unit trials, the average impact is still sizable and highly statistically-significant, but smaller at 1.4 percentage points, an 8.1% increase [8.7 / 1.4 = 6.2×].

    We consider 5 potential channels for this gap: statistical power, selective publication, academic involvement, differences in trial features and in nudge features. Publication bias in the academic journals, exacerbated by low statistical power, can account for the full difference in effect sizes. Academic involvement does not account for the difference. Different features of the nudges, such as in-person versus letter-based communication, likely reflecting institutional constraints, can partially explain the different effect sizes.

    We conjecture that larger sample sizes and institutional constraints, which play an important role in our setting, are relevant in other at-scale implementations. Finally, we compare these results to the predictions of academics and practitioners. Most forecasters overestimate the impact for the Nudge Unit interventions, though nudge practitioners are almost perfectly calibrated.

    Figure 4: Nudge treatment effects. This figure plots the treatment effect relative to control group take-up for each nudge. Nudges with extreme treatment effects are labeled for context.

    …In this paper, we present the results of a unique collaboration with 2 of the major “Nudge Units”: BIT North America operating at the level of US cities and SBST/​​​​OES for the US Federal government. These 2 units kept a comprehensive record of all trials that they ran from inception in 2015 to July 2019, for a total of 165 trials testing 349 nudge treatments and a sample size of over 37 million participants. In a remarkable case of administrative transparency, each trial had atrial report, including in many cases a pre-analysis plan. The 2 units worked with us to retrieve the results of all the trials. Importantly, over 90% of these trials have not been documented in working paper or academic publication format. [emphasis added]

    …Since we are interested in comparing the Nudge Unit trials to nudge papers in the literature, we aim to find broadly comparable studies in academic journals, without hand-picking individual papers. We lean on 2 recent meta-analyses summarizing over 100 RCTs across many different applications (⁠, and ). We apply similar restrictions as we did in the Nudge Unit sample, excluding lab or hypothetical experiments and non-RCTs, treatments with financial incentives, requiring treatments with binary dependent variables, and excluding default effects. This leaves a final sample of 26 RCTs, including 74 nudge treatments with 505,337 participants. Before we turn to the results, we stress that the features of behavioral interventions in academic journals do not perfectly match with the nudge treatments implemented by the Nudge Units, a difference to which we indeed return below. At the same time, overall interventions conducted by Nudge Units are fairly representative of the type of nudge treatments that are run by researchers.

    What do we find? In the sample of 26 papers in the Academic Journals sample, we compute the average (unweighted) impact of a nudge across the 74 nudge interventions. We find that on average a nudge intervention increases the take up by 8.7 (s.e. = 2.5) percentage points, out of an average control take up of 26.0 percentage points.

    Turning to the 126 trials by Nudge Units, we estimate an unweighted impact of 1.4 percentage points (s.e. = 0.3), out of an average control take up of 17.4 percentage points. While this impact is highly statistically-significantly different from 0 and sizable, it is about 1⁄6th the size of the estimated nudge impact in academic papers. What explains this large difference in the impact of nudges?

    We discuss 3 features of the 2 samples which could account for this difference. First, we document a large difference in the sample size and thus statistical power of the interventions. The median nudge intervention in the Academic Journals sample has treatment arm sample size of 484 participants and a minimum detectable effect size (MDE, the effect size that can be detected with 80% power) of 6.3 percentage points. In contrast, the nudge interventions in the Nudge Units have a median treatment arm sample size of 10,006 participants and MDE of 0.8 percentage points. Thus, the statistical power for the trials in the Academic Journals sample is nearly an order of magnitude smaller. This illustrates a key feature of the “at scale” implementation: the implementation in an administrative setting allows for a larger sample size. Importantly, the smaller sample size for the Academic Journals papers could lead not just to noisier estimates, but also to upward-biased point estimates in the presence of publication bias.

    A second difference, directly zooming into publication bias, is the evidence of selective publication of studies with statistically-significant results (t > 1.96), versus studies that are not statistically-significant (t < 1.96). In the sample of Academic Journals nudges, there are over 4× as many studies with a t-statistic for the most statistically-significant nudge between 1.96 and 2.96, versus the number of studies with the most statistically-significant nudge with at between 0.96 and 1.96. Interestingly, the publication bias appears to operate at the level of the most statistically-significant treatment arm within a paper. By comparison, we find no evidence of a discontinuity in the distribution of t-statistics for the Nudge Unit sample, consistent with the fact that the Nudge Unit registry contains the comprehensive sample of all studies run. We stress here that with “publication bias” we include not just whether a journal would publish a paper, but also whether a researcher would write up a study (the “file drawer” problem). In the Nudge Units sample, all these selective steps are removed, as we access all studies that were run.


  457. 2014-oboyle.pdf: ⁠, (Ernest Hugh O’Boyle, Jr., George Christopher Banks, Erik Gonzalez-Mulé; statistics  /​ ​​ ​bias):

    The issue of a published literature not representative of the population of research is most often discussed in terms of entire studies being suppressed. However, alternative sources of publication bias are questionable research practices (QRPs) that entail post hoc alterations of hypotheses to support data or post hoc alterations of data to support hypotheses. Using general strain theory as an explanatory framework, we outline the means, motives, and opportunities for researchers to better their chances of publication independent of rigor and relevance. We then assess the frequency of QRPs in management research by tracking differences between dissertations and their resulting journal publications. Our primary finding is that from dissertation to journal article, the ratio of supported to unsupported hypotheses more than doubled (0.82 to 1.00 versus 1.94 to 1.00). The rise in predictive accuracy resulted from the dropping of statistically nonsignificant hypotheses, the addition of statistically-significant hypotheses, the reversing of predicted direction of hypotheses, and alterations to data. We conclude with recommendations to help mitigate the problem of an unrepresentative literature that we label the “Chrysalis Effect.”

  458. ⁠, Julia M. Rohrer, Warren Tierney, Eric L. Uhlmann, Lisa M. DeBruine, Tom Heyman, Benedict Jones, Stefan C. Schmukle, Raphael Silberzahn, Rebecca M. Willén, Rickard Carlsson, Richard E. Lucas, Julia Strand, Simine Vazire, Jessica K. Witt, Thomas R. Zentall, Christopher F. Chabris, Tal Yarkoni (2021-03-01):

    Science is often perceived to be a self-correcting enterprise. In principle, the assessment of scientific claims is supposed to proceed in a cumulative fashion, with the reigning theories of the day progressively approximating truth more accurately over time. In practice, however, cumulative self-correction tends to proceed less efficiently than one might naively suppose. Far from evaluating new evidence dispassionately and infallibly, individual scientists often cling stubbornly to prior findings.

    Here we explore the dynamics of scientific self-correction at an individual rather than collective level. In 13 written statements, researchers from diverse branches of psychology share why and how they have lost confidence in one of their own published findings. We qualitatively characterize these disclosures and explore their implications.

    A cross-disciplinary survey suggests that such loss-of-confidence sentiments are surprisingly common among members of the broader scientific population yet rarely become part of the public record. We argue that removing barriers to self-correction at the individual level is imperative if the scientific community as a whole is to achieve the ideal of efficient self-correction.

    [Keywords: self-correction, knowledge accumulation, metascience, scientific falsification, incentive structure, scientific errors]

  459. 1968-rosenthal-pygmalionintheclassroom.pdf: “Pygmalion In The Classroom: Teacher Expectation and Pupil's Intellectual Development”⁠, Robert Rosenthal, Lenore Jacobson

  460. 1976-rosenthal-experimenterexpectancyeffects.pdf: ⁠, Robert Rosenthal (1976; statistics  /​ ​​ ​bias):

    Within the context of a general discussion of the unintended effects of scientists on the results of their research, this work reported on the growing evidence that the hypothesis of the behavioral scientist could come to serve as self-fulfilling prophecy, by means of subtle processes of communication between the experimenter and the human or animal research subject. [The Science Citation Index (SCI) and the Social Sciences Citation Index (SSCI) indicate that the book has been cited over 740 times since 1966 [as of 1979].] —“Citation Classic”

    [Enlarged Edition, expanded with discussion of the Pygmalion effect etc: ISBN 0-470-01391-5]

  461. 1968-thorndike.pdf

  462. 1969-thorndike.pdf

  463. 1969-snow.html

  464. 1971-elashoff-pygmalionreconsidered.pdf: “Pygmalion Reconsidered: A Case Study in Statistical Inference: Reconsideration of the Rosenthal-Jacobson Data on Teacher Expectancy”⁠, Janet D. Elashoff, Richard E. Snow

  465. 1975-cronbach.pdf

  466. 1987-wineburg.pdf

  467. 1987-wineburg-2.pdf

  468. 1995-snow.pdf

  469. 1999-spitz.pdf: ⁠, Herman H. Spitz (1999-09; iq):

    The 1968 publication of the Rosenthal and Jacobson’s Pygmalion in the Classroom offered the optimistic message that raising teachers’ expectations of their pupils’ potentials would raise their pupils’ intelligence. This claim was, and still is, endorsed by many psychologists and educators. The original study, along with the scores of attempted replications and the acrimonious controversy that followed it, is reviewed, and its consequences discussed.

  470. 2005-jussim.pdf: ⁠, Lee Jussim, Kent D. Harber (2005; statistics  /​ ​​ ​bias):

    This article shows that 35 years of empirical research on teacher expectations justifies the following conclusions: (a) Self-fulfilling prophecies in the classroom do occur, but these effects are typically small, they do not accumulate greatly across perceivers or over time, and they may be more likely to dissipate than accumulate; (b) powerful self-fulfilling prophecies may selectively occur among students from stigmatized social groups; (c) whether self-fulfilling prophecies affect intelligence, and whether they in general do more harm than good, remains unclear, and (d) teacher expectations may predict student outcomes more because these expectations are accurate than because they are self-fulfilling. Implications for future research, the role of self-fulfilling prophecies in social problems, and perspectives emphasizing the power of erroneous beliefs to create social reality are discussed.

    [Jussim discusses the famous ‘Pygmalion effect’. It demonstrates the Replication crisis: an initial extraordinary finding indicating that teachers could raise student IQs by dozens of points gradually shrunk over repeated replications to essentially zero net long-term effect. The original finding was driven by statistical malpractice bordering on research fraud: some students had “pretest IQ scores near zero, and others had post-test IQ scores over 200”! Rosenthal further maintained the Pygmalion effect by statistical trickery, such as his ‘fail-safe N’, which attempted to show that hundreds of studies would have to have not been published in order for the Pygmalion effect to be true—except this assumes zero publication bias in those unpublished studies and begs the question.]


  472. Books#experimenter-effects-in-behavioral-research-rosenthal-1976











  483. #baumeister-et-al-2003




















  503. 2011-09-slacktory-correlations-biebertonsillitis.png




  507. 1955-michie.pdf: “The Importance of Being Cross-Bred”⁠, Donald Michie, Anne McLaren

  508. 2020-voekl.pdf: ⁠, Bernhard Voelkl, Naomi S. Altman, Anders Forsman, Wolfgang Forstmeier, Jessica Gurevitch, Ivana Jaric, Natasha A. Karp, Martien J. Kas, Holger Schielzeth, Tom Casteele, Hanno Würbel (2020-06-02; statistics  /​ ​​ ​bias):

    Context-dependent biological variation presents a unique challenge to the reproducibility of results in experimental animal research, because organisms’ responses to experimental treatments can vary with both genotype and environmental conditions. In March 2019, experts in animal biology, experimental design and statistics convened in Blonay, Switzerland, to discuss strategies addressing this challenge.

    In contrast to the current gold standard of rigorous standardization in experimental animal research, we recommend the use of systematic heterogenization of study samples and conditions by actively incorporating biological variation into study design through diversifying study samples and conditions.

    Here we provide the scientific rationale for this approach in the hope that researchers, regulators, funders and editors can embrace this paradigm shift. We also present a road map towards better practices in view of improving the reproducibility of animal research.

  509. 1967-vesell.pdf: ⁠, Elliot S. Vesell (1967-09-01; statistics  /​ ​​ ​bias):

    Induction of three drug-metabolizing enzymes occurred in liver microsomes of mice and rats kept on softwood bedding of either red cedar, white pine, or ponderosa pine. This induction was reversed when animals were placed on hardwood bedding composed of a mixture of beech, birch, and maple. Differences in the capacity of various beddings to induce may partially explain divergent results of studies on drug-metabolizing enzymes. The presence of such inducing substances in the environment may influence the pharmacologic responsiveness of animals to a wide variety of drugs.

    [ description:

    Even the animal experimenter is not exempt from problems of interaction. (I am indebted to Neal Miller for the following example.) Investigators checking on how animals metabolize drugs found that results differed mysteriously from laboratory to laboratory. The most startling inconsistency of all occurred after a refurbishing of a National Institutes of Health (NIH) animal room brought in new cages and new supplies. Previously, a mouse would sleep for about 35 minutes after a standard injection of hexobarbital. In their new homes, the NIH mice came miraculously back to their feet just 16 minutes after receiving a shot of the drug. Detective work proved that red-cedar bedding made the difference, stepping up the activity of several enzymes that metabolize hexobarbital. Pine shavings had the same effect. When the softwood was replaced with birch or maple bedding like that originally used, drug response came back in line with previous experience (Vesell, 1967).


  510. 1970-schein.pdf


  512. 1981-lagakos.pdf

  513. 1986-wilbourn.pdf



  516. 2000-fishbain.pdf: “PME00042⁠, FrameMaker 5.5 PowerPC: PSPrinter 8.3

  517. 2000-olson.pdf: “Concordance of the Toxicity of Pharmaceuticals in Humans and in Animals”⁠, Olson, H., et al.




  521. 2002-sandercock.pdf

  522. 2002-ikonomidou.pdf


  524. 2003-lee.pdf


  526. ⁠, Pandora Pound, Shah Ebrahim, Peter Sandercock, Michael B. Bracken, Ian Roberts, Reviewing Animal Trials Systematically (RATS) Group (2004-02-28):

    Much animal research into potential treatments for humans is wasted because it is poorly conducted and not evaluated through systematic reviews.

    …We searched MEDLINE to identify published systematic reviews of animal experiments (see for the search strategy). The search identified 277 possible papers, of which 22 were reports of systematic reviews. We are also aware of one recently published study and two unpublished studies, bringing the total to 25. Three further studies are in progress (M Macleod, personal communication). Seven of the 25 papers were systematic reviews of animal studies that had been conducted to find out how the animal research had informed the clinical research. Two of these reported on the same group of studies, giving six reviews in this category. A further 10 papers were systematic reviews of animal studies conducted to assess the evidence for proceeding to clinical trials or to establish an evidence base. 8 systematically reviewed both the animal and human studies in a particular field, again before clinical trials had taken place. We focus on the 6 studies in the first category because these shed the most light on the contribution that animal research makes to clinical medicine.

    …The clinical trials of nimodipine and low level laser therapy were conducted concurrently with the animal studies, while the clinical trials of fluid resuscitation, thrombolytic therapy, and endothelin receptor blockade went ahead despite evidence of harm from the animal studies. This suggests that the animal data were regarded as irrelevant, calling into question why the studies were done in the first place and seriously undermining the principle that animal experiments are necessary to inform clinical medicine.

    Furthermore, many of the existing animal experiments were poorly designed…Although randomisation and blinding are accepted as standard in clinical trials, no such standards exist for animal studies. Bebarta et al found that animal studies that did not report randomisation and blinding were more likely to report a treatment effect than studies that used these methods. The box summarises further potential methodological problems.

    Summary points:

    • The value of animal research into potential human treatments needs urgent rigorous evaluation
    • Systematic reviews can provide important insights into the validity of animal research
    • The few existing reviews have highlighted deficiencies such as animal and clinical trials being conducted simultaneously
    • Many animal studies were of poor methodological quality
    • Systematic reviews should become routine to ensure the best use of existing animal data as well as improve the estimates of effect from animal experiments

  528. 2004-greaves.pdf




  532. 2005-macleod.pdf

  533. 2005-macleod-2.pdf




  537. 2006-hackam.pdf

  538. 2007-ocollins.pdf

  539. 2006-peters.pdf: ⁠, Jaime L. Peters, Alex J. ⁠, David R. Jones, Lesley Rushton, Keith R. Abrams (2006; statistics  /​ ​​ ​meta-analysis):

    To maximize the findings of animal experiments to inform likely health effects in humans, a thorough review and evaluation of the animal evidence is required. Systematic reviews and, where appropriate, meta-analyses have great potential in facilitating such an evaluation, making efficient use of the animal evidence while minimizing possible sources of bias. The extent to which systematic review and meta-analysis methods have been applied to evaluate animal experiments to inform human health is unknown.

    Using systematic review methods, we examine the extent and quality of systematic reviews and meta-analyses of in vivo animal experiments carried out to inform human health. We identified 103 articles meeting the inclusion criteria: 57 reported a systematic review, 29 a systematic review and a meta-analysis, and 17 reported a meta-analysis only.

    The use of these methods to evaluate animal evidence has increased over time. Although the reporting of systematic reviews is of adequate quality, the reporting of meta-analyses is poor. The inadequate reporting of meta-analyses observed here leads to questions on whether the most appropriate methods were used to maximize the use of the animal evidence to inform policy or decision-making. We recommend that guidelines proposed here be used to help improve the reporting of systematic reviews and meta-analyses of animal experiments.

    Further consideration of the use and methodological quality and reporting of such studies is needed.

    [Keywords: animal experiments, guidelines, meta-analysis, reporting, review, systematic review]

  540. ⁠, Daniel G. Hackam (2007-01-27):

    Poor methodological standards in animal studies mean that positive results rarely translate to the clinical domain… in a systematic review reported in this week’s BMJ Perel and colleagues find that therapeutic efficacy in animals often does not translate to the clinical domain.2 The authors conducted meta-analyses of all available animal data for six interventions that showed definitive proof of benefit or harm in humans. For three of the interventions—corticosteroids for brain injury, antifibrinolytics in haemorrhage, and tirilazad for acute ischaemic stroke—they found major discordance between the results of the animal experiments and human trials. Equally concerning, they found consistent methodological flaws throughout the animal data, irrespective of the intervention or disease studied. For example, only eight of the 113 animal studies on thrombolysis for stroke reported a sample size calculation, a fundamental step in helping to ensure an appropriately powered precise estimate of effect. In addition, the use of randomisation, concealed allocation, and blinded outcome assessment—standards that are considered the norm when planning and reporting modern human clinical trials—were inconsistent in the animal studies.

    …What can be done to remedy this situation? Firstly, uniform reporting requirements are needed urgently and would improve the quality of animal research; as in the clinical research world, this would require cooperation between investigators, editors, and funders of basic scientific research. A more immediate solution is to promote rigorous systematic reviews of experimental treatments before clinical trials begin. Many clinical trials would probably not have gone ahead if all the data had been subjected to meta-analysis. Such reviews would also provide robust estimates of effect size and variance for adequately powering randomised trials. A third solution, which Perel and colleagues call for, is a system for registering animal experiments, analogous to that for clinical trials. This would help to reduce publication bias and provide a more informed view before proceeding to clinical trials. Until such improvements occur, it seems prudent to be critical and cautious about the applicability of animal data to the clinical domain.



  543. 2007-sena.pdf

  544. 2007-dixit.pdf



  547. 2008-whiteside.pdf




  551. ⁠, Emily S. Sena, H. Bart van der Worp, Philip M. W. Bath, David W. Howells, Malcolm R. Macleod (2010-02-18):

    Publication bias confounds attempts to use systematic reviews to assess the efficacy of various interventions tested in experiments modelling acute ischaemic stroke, leading to a 30% overstatement of efficacy of interventions tested in animals.

    The consolidation of scientific knowledge proceeds through the interpretation and then distillation of data presented in research reports, first in review articles and then in textbooks and undergraduate courses, until truths become accepted as such both amongst “experts” and in the public understanding. Where data are collected but remain unpublished, they cannot contribute to this distillation of knowledge. If these unpublished data differ substantially from published work, conclusions may not reflect adequately the underlying biological effects being described. The existence and any impact of such “publication bias” in the laboratory sciences have not been described. Using the CAMARADES (Collaborative Approach to Meta-analysis and Review of Animal Data in Experimental Studies) database we identified 16 systematic reviews of interventions tested in animal studies of acute ischaemic stroke involving 525 unique publications. Only ten publications (2%) reported no statistically-significant effects on infarct volume and only six (1.2%) did not report at least one significant finding. Egger regression and trim-and-fill analysis suggested that publication bias was highly prevalent (present in the literature for 16 and ten interventions, respectively) in animal studies modelling stroke. Trim-and-fill analysis suggested that publication bias might account for around one-third of the efficacy reported in systematic reviews, with reported efficacy falling from 31.3% to 23.8% after adjustment for publication bias. We estimate that a further 214 experiments (in addition to the 1,359 identified through rigorous systematic review; non publication rate 14%) have been conducted but not reported. It is probable that publication bias has an important impact in other animal disease models, and more broadly in the life sciences.

    Author Summary:

    Publication bias is known to be a major problem in the reporting of clinical trials, but its impact in basic research has not previously been quantified. Here we show that publication bias is prevalent in reports of laboratory-based research in animal models of stroke, such that data from as many as one in seven experiments remain unpublished. The result of this bias is that systematic reviews of the published results of interventions in animal models of stroke overstate their efficacy by around one third. Nonpublication of data raises ethical concerns, first because the animals used have not contributed to the sum of human knowledge, and second because participants in clinical trials may be put at unnecessary risk if efficacy in animals has been overstated. It is unlikely that this publication bias in the basic sciences is restricted to the area we have studied, the preclinical modelling of the efficacy of candidate drugs for stroke. A related article in PLoS Medicine (van der Worp et al., doi:10.1371    /​ ​​ ​​ ​​ ​journal.pmed.1000245) discusses the controversies and possibilities of translating the results of animal experiments into human clinical trials.

  552. ⁠, H. Bart van der Worp, David W. Howells, Emily S. Sena, Michelle J. Porritt, Sarah Rewell, Victoria O'Collins, Malcolm R. Macleod ():

    H. Bart van der Worp and colleagues discuss the controversies and possibilities of translating the results of animal experiments into human clinical trials.

  553. 2010-vesterinen.pdf: ⁠, Hanna M. Vesterinen, Emily S. Sena, Charles ffrench-Constant, Anna Williams, Siddharthan Chandran, Malcolm R. Macleod (2010-08-04; statistics  /​ ​​ ​meta-analysis):

    Background: In other neurological diseases, the failure to translate pre-clinical findings to effective clinical treatments has been partially attributed to bias introduced by shortcomings in the design of animal experiments.

    Objectives: Here we evaluate published studies of interventions in animal models of multiple sclerosis for methodological design and quality and to identify candidate interventions with the best evidence of efficacy.

    Methods: A systematic review of the literature describing experiments testing the effectiveness of interventions in animal models of multiple sclerosis was carried out. Data were extracted for reported study quality and design and for neurobehavioural outcome. Weighted mean difference meta-analysis was used to provide summary estimates of the efficacy for drugs where this was reported in five or more publications.

    Results: The use of a drug in a pre-clinical multiple sclerosis model was reported in 1152 publications, of which 1117 were experimental autoimmune encephalomyelitis (EAE). For 36 interventions analysed in greater detail, neurobehavioural score was improved by 39.6% (95% CI 34.9–44.2%, p < 0.001). However, few studies reported measures to reduce bias, and those reporting randomization or blinding found statistically-significantly smaller effect sizes.

    Conclusions: EAE has proven to be a valuable model in elucidating pathogenesis as well as identifying candidate therapies for multiple sclerosis. However, there is an inconsistent application of measures to limit bias that could be addressed by adopting methodological best practice in study design. Our analysis provides an estimate of sample size required for different levels of power in future studies and suggests a number of interventions for which there are substantial animal data supporting efficacy.


  555. ⁠, Konstantinos K. Tsilidis, Orestis A. Panagiotou, Emily S. Sena, Eleni Aretouli, Evangelos Evangelou, David W. Howells, Rustam Al-Shahi Salman, Malcolm R. Macleod, John P. A. Ioannidis (2013-06-06):

    The evaluation of 160 meta-analyses of animal studies on potential treatments for neurological disorders reveals that the number of statistically-significant results was too large to be true, suggesting biases.

    Animal studies generate valuable hypotheses that lead to the conduct of preventive or therapeutic clinical trials. We assessed whether there is evidence for excess statistical-significance in results of animal studies on neurological disorders, suggesting biases. We used data from meta-analyses of interventions deposited in Collaborative Approach to Meta-Analysis and Review of Animal Data in Experimental Studies (CAMARADES). The number of observed studies with statistically-significant results (O) was compared with the expected number (E), based on the statistical power of each study under different assumptions for the plausible effect size. We assessed 4,445 datasets synthesized in 160 meta-analyses on Alzheimer disease (n = 2), experimental autoimmune encephalomyelitis (n = 34), focal ischemia (n = 16), intracerebral hemorrhage (n = 61), Parkinson disease (n = 45), and spinal cord injury (n = 2). 112 meta-analyses (70%) found nominally (p≤0.05) statistically-significant summary fixed effects. Assuming the effect size in the most precise study to be a plausible effect, 919 out of 4,445 nominally statistically-significant results were expected versus 1,719 observed (p<10−9). Excess significance was present across all neurological disorders, in all subgroups defined by methodological characteristics, and also according to alternative plausible effects. Asymmetry tests also showed evidence of small-study effects in 74 (46%) meta-analyses. Significantly effective interventions with more than 500 animals, and no hints of bias were seen in eight (5%) meta-analyses. Overall, there are too many animal studies with statistically-significant results in the literature of neurological disorders. This observation suggests strong biases, with selective analysis and outcome reporting biases being plausible explanations, and provides novel evidence on how these biases might influence the whole research domain of neurological animal literature.

    Author Summary:

    Studies have shown that the results of animal biomedical experiments fail to translate into human clinical trials; this could be attributed either to real differences in the underlying biology between humans and animals, to shortcomings in the experimental design, or to bias in the reporting of results from the animal studies. We use a statistical technique to evaluate whether the number of published animal studies with “positive” (statistically-significant) results is too large to be true. We assess 4,445 animal studies for 160 candidate treatments of neurological disorders, and observe that 1,719 of them have a “positive” result, whereas only 919 studies would a priori be expected to have such a result. According to our methodology, only eight of the 160 evaluated treatments should have been subsequently tested in humans. In summary, we judge that there are too many animal studies with “positive” results in the neurological disorder literature, and we discuss the reasons and potential remedies for this phenomenon.

  556. ⁠, Hartung, Thomas (2013):

    Unlabelled: Misled by animal studies and basic research? Whenever we take a closer look at the outcome of clinical trials in a field such as, most recently, stroke or septic shock, we see how limited the value of our preclinical models was. For all indications, 95% of drugs that enter clinical trials do not make it to the market, despite all promise of the (animal) models used to develop them. Drug development has started already to decrease its reliance on animal models: In Europe, for example, despite increasing R&D expenditure, animal use by pharmaceutical companies dropped by more than 25% from 2005 to 2008. In vitro studies are likewise limited: questionable cell authenticity, over-passaging, mycoplasma infections, and lack of differentiation as well as non-homeostatic and non-physiologic culture conditions endanger the relevance of these models. The standards of statistics and reporting often are poor, further impairing reliability. Alarming studies from industry show miserable reproducibility of landmark studies. This paper discusses factors contributing to the lack of reproducibility and relevance of pre-clinical research.

    The Conclusion: Publish less but of better quality and do not rely on the face value of animal studies.

  557. ⁠, David Baker, Katie Lidster, Ana Sottomayor, Sandra Amor ():

    A study by David Baker and colleagues reveals poor quality of reporting in pre-clinical animal research and a failure of journals to implement the ARRIVE guidelines.

    There is growing concern that poor experimental design and lack of transparent reporting contribute to the frequent failure of pre-clinical animal studies to translate into treatments for human disease. In 2010, the Animal Research: Reporting of In Vivo Experiments (ARRIVE) guidelines were introduced to help improve reporting standards. They were published in PLOS Biology and endorsed by funding agencies and publishers and their journals, including PLOS, Nature research journals, and other top-tier journals. Yet our analysis of papers published in PLOS and Nature journals indicates that there has been very little improvement in reporting standards since then. This suggests that authors, referees, and editors generally are ignoring guidelines, and the editorial endorsement is yet to be effectively implemented.


  559. 2015-gaukler.pdf: ⁠, James S. Ruff, Tessa Galland, Kirstie A. Kandaris, Tristan K. Underwood, Nicole M. Liu, Elizabeth L. Young, Linda C. Morrison, Garold S. Yost, Wayne K. Potts (2015; statistics):

    is a selective serotonin reuptake inhibitor (SSRI) that is currently available on the market and is suspected of causing congenital malformations in babies born to mothers who take the drug during the first trimester of pregnancy.

    We utilized organismal performance assays (OPAs), a novel toxicity assessment method, to assess the safety of paroxetine during pregnancy in a rodent model. OPAs utilize genetically diverse wild mice (Mus musculus) to evaluate competitive performance between experimental and control animals as they compete amongst each other for limited resources in semi-natural enclosures. Performance measures included reproductive success, male competitive ability and survivorship.

    Paroxetine-exposed males weighed 13% less, had 44% fewer offspring, dominated 53% fewer territories and experienced a 2.5-fold increased trend in mortality, when compared with controls. Paroxetine-exposed females had 65% fewer offspring early in the study, but rebounded at later time points. In cages, paroxetine-exposed breeders took 2.3 times longer to produce their first litter and pups of both sexes experienced reduced weight when compared with controls. Low-dose paroxetine-induced health declines detected in this study were undetected in preclinical trials with dose 2.5-8 times higher than human therapeutic doses.

    These data indicate that OPAs detect phenotypic adversity and provide unique information that could useful towards safety testing during pharmaceutical development.

    [Keywords: intraspecific competition, pharmacodynamics, reproductive success, semi-natural enclosures, SSRI, toxicity assessment.]

  560. ⁠, Bonnie L. Hylander, Elizabeth A. Repasky (2016-04-22):


    Several mouse models show statistically-significant differences in experimental outcomes at standard sub-thermoneutral (ST, 22–26°C) versus thermoneutral housing temperatures (TT, 30–32°C), including models of cardiovascular disease, obesity, inflammation and atherosclerosis, graft versus host disease and cancer.

    NE levels are higher, anti-tumor immunity is impaired, and tumor growth is statistically-significantly enhanced in mice housed at ST compared to TT. NE levels are reduced, immunosuppression is reversed and tumor growth is slowed by housing mice at TT.

    Housing temperature should be reported in every study such that potential sources of data bias or non-reproducibility can be identified.

    Our opinion is that any experiment designed to understand tumor biology and/​​​​or having an immune component could potentially have different outcomes in mice housed at ST versus TT and this should be tested.

    The ‘mild’ cold stress caused by standard sub-thermoneutral housing temperatures used for laboratory mice in research institutes is sufficient to statistically-significantly bias conclusions drawn from murine models of several human diseases. We review the data leading to this conclusion, discuss the implications for research and suggest ways to reduce problems in reproducibility and experimental transparency caused by this housing variable. We have found that these cool temperatures suppress endogenous immune responses, skewing tumor growth data and the severity of graft versus host disease, and also increase the therapeutic resistance of tumors. Owing to the potential for ambient temperature to affect energy homeostasis as well as adrenergic stress, both of which could contribute to biased outcomes in murine cancer models, housing temperature should be reported in all publications and considered as a potential source of variability in results between laboratories. Researchers and regulatory agencies should work together to determine whether changes in housing parameters would enhance the use of mouse models in cancer research, as well as for other diseases. Finally, for many years agencies such as the National Cancer Institute (NCI) have encouraged the development of newer and more sophisticated mouse models for cancer research, but we believe that, without an appreciation of how basic murine physiology is affected by ambient temperature, even data from these models is likely to be compromised.

    [Keywords: thermoneutrality, tumor microenvironment, immunosuppression, energy balance, metabolism, adrenergic stress]

  561. ⁠, Neri Kafkafi, Joseph Agassi, Elissa J. Chesler, John C. Crabbe, Wim E. Crusio, David Eilam, Robert Gerlai, Ilan Golani, Alex Gomez-Marin, Ruth Heller, Fuad Iraqi, Iman Jaljuli, Natasha A. Karp, Hugh Morgan, George Nicholson, Donald W. Pfaff, S. Helene Richter, Philip B. Stark, Oliver Stiedl, Victoria Stodden, Lisa M. Tarantino, Valter Tucci, William Valdar, Robert W. Williams, Hanno Würbel, Yoav Benjamini (2016-10-17):

    The scientific community is increasingly concerned with cases of published “discoveries” that are not replicated in further studies. The field of mouse behavioral phenotyping was one of the first to raise this concern, and to relate it to other complicated methodological issues: the complex interaction between genotype and environment; the definitions of behavioral constructs; and the use of the mouse as a model animal for human health and disease mechanisms. In January 2015, researchers from various disciplines including genetics, behavior genetics, neuroscience, ethology, statistics and bioinformatics gathered in Tel Aviv University to discuss these issues. The general consent presented here was that the issue is prevalent and of concern, and should be addressed at the statistical, methodological and policy levels, but is not so severe as to call into question the validity and the usefulness of model organisms as a whole. Well-organized community efforts, coupled with improved data and metadata sharing, were agreed by all to have a key role to play in identifying specific problems and promoting effective solutions. As replicability is related to validity and may also affect generalizability and translation of findings, the implications of the present discussion reach far beyond the issue of replicability of mouse phenotypes but may be highly relevant throughout biomedical research.

  562. ⁠, Stanley E. Lazic, Charlie J. Clarke-Williams, Marcus R. Munafò (2017-09-02):

    Biologists establish the existence of experimental effects by applying treatments or interventions to biological entities or units, such as people, animals, slice preparations, or cells. When done appropriately, independent replication of the entity-intervention pair contributes to the sample size (N) and forms the basis of statistical inference. However, sometimes the appropriate entity-intervention pair may not be obvious, and the wrong choice can make an experiment worthless. We surveyed a random sample of published animal experiments from 2011 to 2016 where interventions were applied to parents but effects examined in the offspring, as regulatory authorities have provided clear guidelines on replication with such designs. We found that only 22% of studies (95% CI = 17% to 29%) replicated the correct entity-intervention pair and thus made valid statistical inferences. Approximately half of the studies (46%, 95% CI = 38% to 53%) had pseudoreplication while 32% (95% CI = 26% to 39%) provided insufficient information to make a judgement. Pseudoreplication artificially inflates the sample size, leading to more false positive results and inflating the apparent evidence supporting a scientific claim. It is hard for science to advance when so many experiments are poorly designed and analysed. We argue that distinguishing between biological units, experimental units, and observational units clarifies where replication should occur, describe the criteria for genuine replication, and provide guidelines for designing and analysing in vitro, ex vivo, and in vivo experiments.

  563. ⁠, Gordon J. Lithgow, Monica Driscoll, Patrick Phillips (2017-08-22):

    About 15 years ago, one of us (G.J.L.) got an uncomfortable phone call from a colleague and collaborator. After nearly a year of frustrating experiments, this colleague was about to publish a paper1 chronicling his team’s inability to reproduce the results of our high-profile paper2 in a mainstream journal. Our study was the first to show clearly that a drug-like molecule could extend an animal’s lifespan. We had found over and over again that the treatment lengthened the life of a roundworm by as much as 67%. Numerous phone calls and e-mails failed to identify why this apparently simple experiment produced different results between the labs. Then another lab failed to replicate our study. Despite more experiments and additional publications, we couldn’t work out why the labs were getting different lifespan results. To this day, we still don’t know. A few years later, the same scenario played out with different compounds in other labs…In another, now-famous example, two cancer labs spent more than a year trying to understand inconsistencies6. It took scientists working side by side on the same tumour biopsy to reveal that small differences in how they isolated cells—vigorous stirring versus prolonged gentle rocking—produced different results. Subtle tinkering has long been important in getting biology experiments to work. Before researchers purchased kits of reagents for common experiments, it wasn’t unheard of for a team to cart distilled water from one institution when it moved to another. Lab members would spend months tweaking conditions until experiments with the new institution’s water worked as well as before. Sources of variation include the quality and purity of reagents, daily fluctuations in microenvironment and the idiosyncratic techniques of investigators7. With so many ways of getting it wrong, perhaps we should be surprised at how often experimental findings are reproducible.

    …Nonetheless, scores of publications continued to appear with claims about compounds that slow ageing. There was little effort at replication. In 2013, the three of us were charged with that unglamorous task…Our first task, to develop a protocol, seemed straightforward.

    But subtle disparities were endless. In one particularly painful teleconference, we spent an hour debating the proper procedure for picking up worms and placing them on new agar plates. Some batches of worms lived a full day longer with gentler technicians. Because a worm’s lifespan is only about 20 days, this is a big deal. Hundreds of e-mails and many teleconferences later, we converged on a technique but still had a stupendous three-day difference in lifespan between labs. The problem, it turned out, was notation—one lab determined age on the basis of when an egg hatched, others on when it was laid. We decided to buy shared batches of reagents from the start. Coordination was a nightmare; we arranged with suppliers to give us the same lot numbers and elected to change lots at the same time. We grew worms and their food from a common stock and had strict rules for handling. We established protocols that included precise positions of flasks in autoclave runs. We purchased worm incubators at the same time, from the same vendor. We also needed to cope with a large amount of data going from each lab to a single database. We wrote an iPad app so that measurements were entered directly into the system and not jotted on paper to be entered later. The app prompted us to include full descriptors for each plate of worms, and ensured that data and metadata for each experiment were proofread (the strain names MY16 and my16 are not the same). This simple technology removed small recording errors that could disproportionately affect statistical analyses.

    Once this system was in place, variability between labs decreased. After more than a year of pilot experiments and discussion of methods in excruciating detail, we almost completely eliminated systematic differences in worm survival across our labs9 (see ‘Worm wonders’)…Even in a single lab performing apparently identical experiments, we could not eliminate run-to-run differences.

    …We have found one compound that lengthens lifespan across all strains and species. Most do so in only two or three strains, and often show detrimental effects in others.

  564. ⁠, Mark Lucanic, W. Todd Plummer, Esteban Chen, Jailynn Harke, Anna C. Foulger, Brian Onken, Anna L. Coleman-Hulbert, Kathleen J. Dumas, Suzhen Guo, Erik Johnson, Dipa Bhaumik, Jian Xue, Anna B. Crist, Michael P. Presley, Girish Harinath, Christine A. Sedore, Manish Chamoli, Shaunak Kamat, Michelle K. Chen, Suzanne Angeli, Christina Chang, John H. Willis, Daniel Edgar, Mary Anne Royal, Elizabeth A. Chao, Shobhna Patel, Theo Garrett, Carolina Ibanez-Ventoso, June Hope, Jason L. Kish, Max Guo, Gordon J. Lithgow, Monica Driscoll, Patrick C. Phillips (2017-02-21):

    Limiting the debilitating consequences of ageing is a major medical challenge of our time. Robust pharmacological interventions that promote healthy ageing across diverse genetic backgrounds may engage conserved longevity pathways. Here we report results from the Caenorhabditis Intervention Testing Program in assessing longevity variation across 22 Caenorhabditis strains spanning 3 species, using multiple replicates collected across three independent laboratories. Reproducibility between test sites is high, whereas individual trial reproducibility is relatively low. Of ten pro-longevity chemicals tested, six statistically-significantly extend lifespan in at least one strain. Three reported dietary restriction mimetics are mainly effective across C. elegans strains, indicating species and strain-specific responses. In contrast, the amyloid dye ThioflavinT is both potent and robust across the strains. Our results highlight promising pharmacological leads and demonstrate the importance of assessing lifespans of discrete cohorts across repeat studies to capture biological variation in the search for reproducible ageing interventions.



  567. ⁠, Maria Moiron, Kate L. Laskowski, Petri T. Niemelä (2019-12-06):

    Research focusing on among-individual differences in behaviour (‘animal personality’) has been blooming for over a decade. Central theories explaining the maintenance of such behavioural variation posits that individuals expressing greater “risky” behaviours should suffer higher mortality. Here, for the first time, we synthesize the existing empirical evidence for this key prediction. Our results did not support this prediction as there was no directional relationship between riskier behaviour and greater mortality; however there was a statistically-significant absolute relationship between behaviour and survival. In total, behaviour explained a statistically-significant, but small, portion (5.8%) of the variance in survival. We also found that risky (vs. “shy”) behavioural types live statistically-significantly longer in the wild, but not in the laboratory. This suggests that individuals expressing risky behaviours might be of overall higher quality but the lack of predation pressure and resource restrictions mask this effect in laboratory environments. Our work demonstrates that individual differences in behaviour explain important differences in survival but not in the direction predicted by theory. Importantly, this suggests that models predicting behaviour to be a mediator of reproduction-survival trade-offs may need revision and/​​​​or empiricists may need to reconsider their proxies of risky behaviours when testing such theory.



  570. ⁠, Mira van der Naald, Steven Wenker, Pieter A. Doevendans, Kimberley E. Wever, Steven A. J. Chamuleau (2020-08-27):

    Objectives: The ultimate goal of biomedical research is the development of new treatment options for patients. Animal models are used if questions cannot be addressed otherwise. Currently, it is widely believed that a large fraction of performed studies are never published, but there are no data that directly address this question.

    Methods: We have tracked a selection of animal study protocols approved in the University Medical Center Utrecht in the Netherlands, to assess whether these have led to a publication with a follow-up period of 7 years.

    Results: We found that 60% of all animal study protocols led to at least one publication (full text or abstract). A total of 5590 animals were used in these studies, of which 26% was reported in the resulting publications.

    Conclusions: The data presented here underline the need for preclinical preregistration, in view of the risk of reporting and publication bias in preclinical research. We plea that all animal study protocols should be prospectively registered on an online, accessible platform to increase transparency and data sharing. To facilitate this, we have developed a platform dedicated to animal study protocol registration:⁠.

    Strengths and limitations of this study:

    • This study directly traces animal study protocols to potential publications and is the first study to assess the number of animals used and the number of animals published.
    • We had full access to all documents submitted to the animal experiment committee of the University Medical Center Utrecht from the selected protocols.
    • There is a sufficient follow-up period for researchers to publish their animal study.
    • Due to privacy reasons, we are not able to publish the exact search terms used.
    • A delay has occurred between the start of this project and time of publishing, this is related to the political sensitivity of this subject.