The issue of a published literature not representative of the population of research is most often discussed in terms of entire studies being suppressed. However, alternative sources of publication bias are questionable research practices (QRPs) that entail post hoc alterations of hypotheses to support data or post hoc alterations of data to support hypotheses. Using general strain theory as an explanatory framework, we outline the means, motives, and opportunities for researchers to better their chances of publication independent of rigor and relevance. We then assess the frequency of QRPs in management research by tracking differences between dissertations and their resulting journal publications. Our primary finding is that from dissertation to journal article, the ratio of supported to unsupported hypotheses more than doubled (0.82 to 1.00 versus 1.94 to 1.00). The rise in predictive accuracy resulted from the dropping of statistically nonsignificant hypotheses, the addition of statistically-significant hypotheses, the reversing of predicted direction of hypotheses, and alterations to data. We conclude with recommendations to help mitigate the problem of an unrepresentative literature that we label the “Chrysalis Effect.”
Digital media defines modern childhood, but its cognitive effects are unclear and hotly debated. We estimated the impact of different types of screen time (watching, socializing, or gaming) on children’s intelligence while controlling for genetic differences in cognition and socioeconomic background. We analyzed 9855 children from the ABCD dataset with measures of intelligence at baseline (ages 9–10) and after 2 years.
At baseline, time watching and socializing were negatively correlated with intelligence, while gaming had no correlation. After 2 years, gaming positively impacted intelligence, but socializing had no effect. This is consistent with cognitive benefits documented in experimental studies on video gaming. Unexpectedly, watching videos also benefited intelligence, contrary to prior research on the effect of watching TV.
Broadly, our results are in line with research on the malleability of cognitive abilities from environmental factors, such as cognitive training and the Flynn effect.
[Keywords: children, cognitive abilities, development, digital media, genetics, intelligence, polygenic scores, screen time, socioeconomic status, TV, video games]
We estimate the distribution of television advertising elasticities and the distribution of the advertising return on investment (ROI) for a large number of products in many categories…We construct a data set by merging market (DMA) level TV advertising data with retail sales and price data at the brand level…Our identification strategy is based on the institutions of the ad buying process.
Our results reveal substantially smaller advertising elasticities compared to the results documented in the literature, as well as a sizable percentage of statistically insignificant or negative estimates. The results are robust to functional form assumptions and are not driven by insufficient statistical power or measurement error.
The ROI analysis shows negative ROIs at the margin for more than 80% of brands, implying over-investment in advertising by most firms. Further, the overall ROI of the observed advertising schedule is only positive for one third of all brands.
[Keywords: advertising, return on investment, empirical generalizations, agency issues, consumer packaged goods, media markets]
…We find that the mean and median of the distribution of estimated long-run own-advertising elasticities are 0.023 and 0.014, respectively, and 2 thirds of the elasticity estimates are not statistically different from zero. These magnitudes are considerably smaller than the results in the extant literature. The results are robust to controls for own and competitor prices and feature and display advertising, and the advertising effect distributions are similar whether a carryover parameter is assumed or estimated. The estimates are also robust if we allow for a flexible functional form for the advertising effect, and they do not appear to be driven by measurement error. As we are not able to include all sensitivity checks in the paper, we created an interactive web application that allows the reader to explore all model specifications. The web application is available.
…First, the advertising elasticity estimates in the baseline specification are small. The median elasticity is 0.0140, and the mean is 0.0233. These averages are substantially smaller than the average elasticities reported in extant meta-analyses of published case studies (Assmus, Farley, and Lehmann (1984b), Sethuraman, Tellis, and Briesch (2011)). Second, 2 thirds of the estimates are not statistically distinguishable from zero. We show in Figure 2 that the most precise estimates are those closest to the mean and the least precise estimates are in the extremes.
…6.1 Average ROI of Advertising in a Given Week:
In the first policy experiment, we measure the ROI of the observedadvertising levels (in all DMAs) in a given week t relative to not advertising in week t. For each brand, we compute the corresponding ROI for all weeks with positive advertising, and then average the ROIsacross all weeks to compute the average ROI of weekly advertising. This metric reveals if, on the margin, firms choose the (approximately) correct advertising level or could increase profits by either increasing or decreasing advertising.
We provide key summary statistics in the top panel of Table III, and we show the distribution of the predicted ROIs in Figure 3(a). The average ROI of weekly advertising is negative for most brands over the whole range of assumed manufacturer margins. At a 30% margin, the median ROI is −88.15%, and only 12% of brands have positive ROI.Further, for only 3% of brands the ROI is positive and statisticallydifferent from zero, whereas for 68% of brands the ROI is negative and statistically different from zero.
These results provide strong evidence for over-investment in advertising at the margin. [In Appendix C.3, we assess how much larger the TV advertising effects would need to be for the observed level of weekly advertising to be profitable. For the median brand with a positive estimated ad elasticity, the advertising effect would have to be 5.33× larger for the observed level of weekly advertising to yield a positive ROI (assuming a 30% margin).]
6.2 Overall ROI of the Observed Advertising Schedule: In the second policy experiment, we investigate if firms are better off when advertising at the observed levels versus not advertising at all. Hence, we calculate the ROI of the observed advertising schedule relative to a counterfactual baseline with zero advertising in all periods.
We present the results in the bottom panel of Table III and in Figure 3(b). At a 30% margin, the median ROI is −57.34%, and 34% of brands have a positive return from the observed advertising schedule versus not advertising at all. Whereas 12% of brands only have positive and 30% of brands only negative values in their confidence intervals, there is more uncertainty about the sign of the ROI for the remaining 58% of brands. This evidence leaves open the possibility that advertising may be valuable for a substantial number of brands, especially if they reduce advertising on the margin.
…Our results have important positive and normative implications. Why do firms spend billions of dollars on TV advertising each year if the return is negative? There are several possible explanations. First, agency issues, in particular career concerns, may lead managers (or consultants) to overstate the effectiveness of advertising if they expect to lose their jobs if their advertising campaigns are revealed to be unprofitable. Second, an incorrect prior (i.e., conventional wisdom that advertising is typically effective) may lead a decision maker to rationally shrink the estimated advertising effect from their data to an incorrect, inflated prior mean. These proposed explanations are not mutually exclusive. In particular, agency issues may be exacerbated if the general effectiveness of advertising or a specific advertising effect estimate is overstated. [Another explanation is that many brands have objectives for advertising other than stimulating sales. This is a nonstandard objective in economic analysis, but nonetheless, we cannot rule it out.] While we cannot conclusively point to these explanations as the source of the documented over-investment in advertising, our discussions with managers and industry insiders suggest that these may be contributing factors.
Some evidence suggests that people behave more cooperatively and generously when observed or in the presence of images of eyes (termed the ‘watching eyes’ effect). Eye images are thought to trigger feelings of observation, which in turn motivate people to behave more cooperatively to earn a good reputation. However, several recent studies have failed to find evidence of the eyes effect. One possibility is that inconsistent evidence in support of the eyes effect is a product of individual differences in sensitivity or susceptibility to the cue. In fact, some evidence suggests that people who are generally more prosocial are less susceptible to situation-specific reputation-based cues of observation. In this paper, we sought to (1) replicate the eyes effect, (2) replicate the past finding that people who are dispositionally less prosocial are more responsive to observation than people who are more dispositionally more prosocial, and (3) determine if this effect extends to the watching eyes effect. Results from a pre-registered study showed that people did not give more money in a dictator game when decisions were made public or in the presence of eye images, even though participants felt more observed when decisions were public. That is, we failed to replicate the eyes effect and observation effect. An initial, but underpowered, interaction model suggests that egoists give less than prosocials in private, but not public, conditions. This suggests a direction for future research investigating if and how individual differences in prosociality influence observation effects.
We analyze the extent to which citing practices may be driven by strategic considerations. The discontinuation of the Journal of Business (JB) in 2006 for extraneous reasons serves as an exogenous shock for analyzing strategic citing behavior. [We thank Douglas Diamond, the editor of the Journal of Business for 13 years, who told us that the main reason for the discontinuation was the difficulty in finding an editor from within Booth’s faculty.]
Using a difference-in-differences analysis, we find that articles published in JB before 2006 experienced a relative reduction in citations of approximately 20% after 2006.
Since the discontinuation of JB is unrelated to the scientific contributions of its articles, the results imply that the referencing of articles is systematically affected by strategic considerations, which hinders scientific progress.
We draw on genetics research to argue that complex psychological phenomena are most likely determined by a multitude of causes and that any individual cause is likely to have only a small effect.
Building on this, we highlight the dangers of a publication culture that continues to demand large effects. First, it rewards inflated effects that are unlikely to be real and encourages practices likely to yield such effects. Second, it overlooks the small effects that are most likely to be real, hindering attempts to identify and understand the actual determinants of complex psychological phenomena.
We then explain the theoretical and practical relevance of small effects, which can have substantial consequences, especially when considered at scale and over time. Finally, we suggest ways in which scholars can harness these insights to advance research and practices in psychology (i.e., leveraging the power of big data, machine learning, and crowdsourcing science; promoting rigorous preregistration, including prespecifying the smallest effect size of interest; contextualizing effects; changing cultural norms to reward accurate and meaningful effects rather than exaggerated and unreliable effects).
Only once small effects are accepted as the norm, rather than the exception, can a reliable and reproducible cumulative psychological science be built.
[See variance-components for one route forward in quantifying small effects given the daunting statistical power challenges. Götz et al appear locked into the conventional framework of directly estimating effects, when what they really need to borrow from genetics is looking at variance terms like heritability… You can’t afford to gather n in the millions when you aren’t even sure your haystack contains a needle!]
Data visualizations and graphs are increasingly common in both scientific and mass media settings. While graphs are useful tools for communicating patterns in data, they also have the potential to mislead viewers.
In 5 studies, we provide empirical evidence that y-axis truncation leads viewers to perceive illustrated differences as larger (i.e., a truncation effect). This effect persisted after viewers were taught about the effects of y-axis truncation and was robust across participants, with 83.5% of participants across these 5 studies showing a truncation effect. We also found that individual differences in graph literacy failed to predict the size of individuals’ truncation effects. PhD students in both quantitative fields and the humanities were susceptible to the truncation effect, but quantitative PhD students were slightly more resistant when no warning about truncated axes was provided.
We discuss the implications of these results for the underlying mechanisms and make practical recommendations for training critical consumers and creators of graphs.
[Keywords: data visualization, misleading graphs, misinformation, axis truncation, bar graphs]
News media, opinion pieces, social media, and scientific publications are full of graphs meant to communicate and persuade. Such graphs may be technically accurate in displaying correct numerical values and yet misleading because they lead people to draw inappropriate conclusions. In 5 studies, we investigate the practice of truncating the y-axis of bar graphs to start at a non-zero value. While this has been called one of “the worst of crimes in data visualization” by The Economist, it is surprisingly common in not just news and social media, but also in scientific conferences and publications. This might be because the injunction to “not truncate the axis!” may be seen as more dogmatic than data-driven.
We examine how truncated graphs consistently lead people to perceive a larger difference between 2 quantities in 5 studies, and we find that 83.5% of participants across studies show a truncation effect. In other words, 83.5% of people in our studies judged differences illustrated by truncated bar graphs as larger than differences illustrated by graphs where the y-axis starts at 0.
Surprisingly, we found that the truncation effect was very persistent. People were misled by y-axis truncation even when we thoroughly explained the technique right before they rated graphs, although this warning reduced the degree to which people were misled. People with extensive experience working with data and statistics (i.e., PhD students in quantitative fields) were also susceptible to the truncation effect. Overall, our work shows the consequences of truncating bar graphs and the extent to which interventions, such as warning people, can help but are limited in their scope
…Study 1 established a paradigm for examining the effects of y-axis truncation within bar graphs, providing a useful paradigm for studying deceptive graphs. In Study 2, we investigate whether providing an explicit explanation of y-axis truncation would reduce or eliminate the truncation effect…We expected that explicit warnings about y-axis truncation would give participants the information needed to identify truncated graphs and adjust their judgments accordingly.
…The results of Study 2 were surprising in that an explicit warning did not eliminate the truncation effect. To further investigate this, in Study 3, we directly manipulate in a single experiment whether participants are given an explanatory warning about y-axis truncation. Doing so allows us to directly compare the effects of having an explanatory warning or not on the truncation effect.
…Study 4: In Study 3, we found that providing an explanatory warning before participants rated control and truncated bar graphs reduced but did not eliminate the truncation effect. In the world, however, an explicit warning will rarely immediately precede graphs with truncated vertical axes. Here, we extend the findings of Study 3 by asking participants to provide judgments about a new set of bar graphs after a 1-day delay. The purpose in doing so was to examine whether the effects of the explicit warning on the first day will extend to the next day.
…In Study 5, we examined the size of the truncation effect in two doctoral student populations: PhD students pursuing quantitative fields versus the humanities.
We use publicly available data to show that published papers in top psychology, economics, and general interest journals that fail to replicate are cited more than those that replicate. This difference in citation does not change after the publication of the failure to replicate. Only 12% of post-replication citations of non-replicable findings acknowledge the replication failure. Existing evidence also shows that experts predict well which papers will be replicated. Given this prediction, why are non-replicable papers accepted for publication in the first place? A possible answer is that the review team faces a trade-off. When the results are more “interesting”, they apply lower standards regarding their reproducibility.
In 2004, Christakis and colleagues published findings that he and others used to argue for a link between early childhood television exposure and later attention problems, a claim that continues to be frequently promoted by the popular media. Using the same National Longitudinal Survey of Youth 1979 data set (n = 2,108), we conducted two multiverse analyses to examine whether the finding reported by Christakis and colleagues was robust to different analytic choices. We evaluated 848 models, including logistic regression models, linear regression models, and two forms of propensity-score analysis. If the claim were true, we would expect most of the justifiable analyses to produce significant results in the predicted direction. However, only 166 models (19.6%) yielded a statistically-significant relationship, and most of these employed questionable analytic choices. We concluded that these data do not provide compelling evidence of a harmful effect of TV exposure on attention.
Researchers make hundreds of decisions about data collection, preparation, and analysis in their research. We use a many-analysts approach to measure the extent and impact of these decisions.
Two published causal empirical results are replicated by 7 replicators each. We find large differences in data preparation and analysis decisions, many of which would not likely be reported in a publication. No 2 replicators reported the same sample size. statistical-significance varied across replications, and for 1 of the studies the effect’s sign varied as well. The standard deviation of estimates across replications was 3–3× the mean reported standard error.
Science is often perceived to be a self-correcting enterprise. In principle, the assessment of scientific claims is supposed to proceed in a cumulative fashion, with the reigning theories of the day progressively approximating truth more accurately over time. In practice, however, cumulative self-correction tends to proceed less efficiently than one might naively suppose. Far from evaluating new evidence dispassionately and infallibly, individual scientists often cling stubbornly to prior findings.
Here we explore the dynamics of scientific self-correction at an individual rather than collective level. In 13 written statements, researchers from diverse branches of psychology share why and how they have lost confidence in one of their own published findings. We qualitatively characterize these disclosures and explore their implications.
A cross-disciplinary survey suggests that such loss-of-confidence sentiments are surprisingly common among members of the broader scientific population yet rarely become part of the public record. We argue that removing barriers to self-correction at the individual level is imperative if the scientific community as a whole is to achieve the ideal of efficient self-correction.
A typology of unpublished studies is presented to describe various types of unpublished studies and the reasons for their nonpublication. Reasons for nonpublication are classified by whether they stem from an awareness of the study results (result-dependent reasons) or not (result-independent reasons) and whether the reasons affect the publication decisions of individual researchers or reviewers/editors. I argue that result-independent reasons for nonpublication are less likely to introduce motivated reasoning into the publication decision process than are result-dependent reasons. I also argue that some reasons for nonpublication would produce beneficial as opposed to problematic publication bias. The typology of unpublished studies provides a descriptive scheme that can facilitate understanding of the population of study results across the field of psychology, within subdisciplines of psychology, or within specific psychology research domains. The typology also offers insight into different publication biases and research-dissemination practices and can guide individual researchers in organizing their own file drawers of unpublished studies.
One particular weakness of psychology that was left implicit by Meehl is the fact that psychological theories tend to be verbal theories, permitting at best ordinal predictions. Such predictions do not enable the high-risk tests that would strengthen our belief in the verisimilitude of theories but instead lead to the practice of null-hypothesis statistical-significance testing, a practice Meehl believed to be a major reason for the slow theoretical progress of soft psychology. The rising popularity of meta-analysis has led some to argue that we should move away from statistical-significance testing and focus on the size and stability of effects instead. Proponents of this reform assume that a greater emphasis on quantity can help psychology to develop a cumulative body of knowledge. The crucial question in this endeavor is whether the resulting numbers really have theoretical meaning. Psychological science lacks an undisputed, preexisting domain of observations analogous to the observations in the space-time continuum in physics. It is argued that, for this reason, effect sizes do not really exist independently of the adopted research design that led to their manifestation. Consequently, they can have no bearing on the verisimilitude of a theory.
Practicality was a valued attribute of academic psychological theory during its initial decades, but usefulness has since faded in importance to the field. Theories are now evaluated mainly on their ability to account for decontextualized laboratory data and not their ability to help solve societal problems. With laudable exceptions in the clinical, intergroup, and health domains, most psychological theories have little relevance to people’s everyday lives, poor accessibility to policymakers, or even applicability to the work of other academics who are better positioned to translate the theories to the practical realm. We refer to the lack of relevance, accessibility, and applicability of psychological theory to the rest of society as the practicality crisis. The practicality crisis harms the field in its ability to attract the next generation of scholars and maintain viability at the national level. We describe practical theory and illustrate its use in the field of self-regulation. Psychological theory is historically and scientifically well positioned to become useful should scholars in the field decide to value practicality. We offer a set of incentives to encourage the return of social psychology to the Lewinian vision of a useful science that speaks to pressing social issues.
[The unusually large chasm between the social sciences and its practical applications has been noted before. Focusing specifically on social psychology, Berkman and Wilson (2021) grade 360 articles published in the top cited journal of the field over a five year period on various criteria of practical import, generally finding quite low levels of “practicality” of the published research. For example, their average grade for the extent to which published papers offered actionable steps to address a specific problem was just 0.9 out of 4. They also look at the publication criterion in ten top journals; while all of them highlight the importance of original work that contributes to scientific progress, only 2 ask for even a brief statement of the public importance of the work.]
We abstract the concept of a randomized controlled trial (RCT) as a triple (β, b, s), where β is the primary efficacy parameter, b the estimate and s the standard error (s > 0). The parameter β is either a difference of means, a log odds ratio or a log hazard ratio. If we assume that b is unbiased and normally distributed, then we can estimate the full joint distribution of (β, b, s) from a sample of pairs (bi, si).
We have collected 23,747 such pairs from the Cochrane Database of Systematic Reviews to do so. Here, we report the estimated distribution of the signal-to-noise ratio β⁄s and the achieved power. We estimate the median achieved power to be 0.13. We also consider the exaggeration ratio which is the factor by which the magnitude of β is overestimated. We find that if the estimate is just statistically-significant at the 5% level, we would expect it to overestimate the true effect by a factor of 1.7.
This exaggeration is sometimes referred to as the winner’s curse and it is undoubtedly to a considerable extent responsible for disappointing replication results. For this reason, we believe it is important to shrink the unbiased estimator, and we propose a method for doing so.
2020-ebersole.pdf: “Many Labs 5: Testing Pre-Data-Collection Peer Review as an Intervention to Increase Replicability”, Charles R. Ebersole, Maya B. Mathur, Erica Baranski, Diane-Jo Bart-Plange, Nicholas R. Buttrick, Christopher R. Chartier, Katherine S. Corker, Martin Corley, Joshua K. Hartshorne, Hans IJzerman, Ljiljana B. Lazarević, Hugh Rabagliati, Ivan Ropovik, Balazs Aczel, Lena F. Aeschbach, Luca Andrighetto, Jack D. Arnal, Holly Arrow, Peter Babincak, Bence E. Bakos, Gabriel Baník, Ernest Baskin, Radomir Belopavlović, Michael H. Bernstein, Michał Białek, Nicholas G. Bloxsom, Bojana Bodroža, Diane B. V. Bonfiglio, Leanne Boucher, Florian Brühlmann, Claudia C. Brumbaugh, Erica Casini, Yiling Chen, Carlo Chiorri, William J. Chopik, Oliver Christ, Antonia M. Ciunci, Heather M. Claypool, Sean Coary, Marija V. Čolić, W. Matthew Collins, Paul G. Curran, Chris R. Day, Benjamin Dering, Anna Dreber, John E. Edlund, Filipe Falcão, Anna Fedor, Lily Feinberg, Ian R. Ferguson, Máire Ford, Michael C. Frank, Emily Fryberger, Alexander Garinther, Katarzyna Gawryluk, Kayla Ashbaugh, Mauro Giacomantonio, Steffen R. Giessner, Jon E. Grahe, Rosanna E. Guadagno, Ewa Hałasa, Peter J. B. Hancock, Rias A. Hilliard, Joachim Hüffmeier, Sean Hughes, Katarzyna Idzikowska, Michael Inzlicht, Alan Jern, William Jiménez-Leal, Magnus Johannesson, Jennifer A. Joy-Gaba, Mathias Kauff, Danielle J. Kellier, Grecia Kessinger, Mallory C. Kidwell, Amanda M. Kimbrough, Josiah P. J. King, Vanessa S. Kolb, Sabina Kołodziej, Marton Kovacs, Karolina Krasuska, Sue Kraus, Lacy E. Krueger, Katarzyna Kuchno, Caio Ambrosio Lage, Eleanor V. Langford, Carmel A. Levitan, Tiago Jessé Souza de Lima, Hause Lin, Samuel Lins, Jia E. Loy, Dylan Manfredi, Łukasz Markiewicz, Madhavi Menon, Brett Mercier, Mitchell Metzger, Venus Meyet, Ailsa E. Millen, Jeremy K. Miller, Andres Montealegre, Don A. Moore, Rafał Muda, Gideon Nave, Austin Lee Nichols, Sarah A. Novak, Christian Nunnally, Ana Orlić, Anna Palinkas, Angelo Panno, Kimberly P. Parks, Ivana Pedović, Emilian Pękala, Matthew R. Penner, Sebastiaan Pessers, Boban Petrović, Thomas Pfeiffer, Damian Pieńkosz, Emanuele Preti, Danka Purić, Tiago Ramos, Jonathan Ravid, Timothy S. Razza, Katrin Rentzsch, Juliette Richetin, Sean C. Rife, Anna Dalla Rosa, Kaylis Hase Rudy, Janos Salamon, Blair Saunders, Przemysław Sawicki, Kathleen Schmidt, Kurt Schuepfer, Thomas Schultze, Stefan Schulz-Hardt, Astrid Schütz, Ani N. Shabazian, Rachel L. Shubella, Adam Siegel, Rúben Silva, Barbara Sioma, Lauren Skorb, Luana Elayne Cunha de Souza, Sara Steegen, L. A. R. Stein, R. Weylin Sternglanz, Darko Stojilović, Daniel Storage, Gavin Brent Sullivan, Barnabas Szaszi, Peter Szecsi, Orsolya Szöke, Attila Szuts, Manuela Thomae, Natasha D. Tidwell, Carly Tocco, Ann-Kathrin Torka, Francis Tuerlinckx, Wolf Vanpaemel, Leigh Ann Vaughn, Michelangelo Vianello, Domenico Viganola, Maria Vlachou, Ryan J. Walker, Sophia C. Weissgerber, Aaron L. Wichman, Bradford J. Wiggins, Daniel Wolf, Michael J. Wood, David Zealley, Iris Žeželj, Mark Zrubka, Brian A. Nosek (2020-11-13; backlinks):
Replication studies in psychological science sometimes fail to reproduce prior findings. If these studies use methods that are unfaithful to the original study or ineffective in eliciting the phenomenon of interest, then a failure to replicate may be a failure of the protocol rather than a challenge to the original finding. Formal pre-data-collection peer review by experts may address shortcomings and increase replicability rates. We selected 10 replication studies from the Reproducibility Project: Psychology (RP:P; Open Science Collaboration, 2015) for which the original authors had expressed concerns about the replication designs before data collection; only one of these studies had yielded a statistically-significant effect (p < 0.05). Commenters suggested that lack of adherence to expert review and low-powered tests were the reasons that most of these RP:P studies failed to replicate the original effects. We revised the replication protocols and received formal peer review prior to conducting new replication studies. We administered the RP:P and revised protocols in multiple laboratories (median number of laboratories per original study = 6.5, range = 3–9; median total sample = 1,279.5, range = 276–3,512) for high-powered tests of each original finding with both protocols. Overall, following the preregistered analysis plan, we found that the revised protocols produced effect sizes similar to those of the RP:P protocols (Δr = 0.002 or 0.014, depending on analytic approach). The median effect size for the revised protocols (r = 0.05) was similar to that of the RP:P protocols (r = 0.04) and the original RP:P replications (r = 0.11), and smaller than that of the original studies (r = 0.37). Analysis of the cumulative evidence across the original studies and the corresponding three replication attempts provided very precise estimates of the 10 tested effects and indicated that their effect sizes (median r = 0.07, range = 0.00–0.15) were 78% smaller, on average, than the original effect sizes (median r = 0.37, range = 0.19–0.50).
We investigated the reproducibility of the major statistical conclusions drawn in 46 articles published in 2012 in three APA journals. After having identified 232 key statistical claims, we tried to reproduce, for each claim, the test statistic, its degrees of freedom, and the corresponding p-value, starting from the raw data that were provided by the authors and closely following the Method section in the article. Out of the 232 claims, we were able to successfully reproduce 163 (70%), 18 of which only by deviating from the article’s analytical description. Thirteen (7%) of the 185 claims deemed statistically-significant by the authors are no longer so. The reproduction successes were often the result of cumbersome and time-consuming trial-and-error work, suggesting that APA style reporting in conjunction with raw data makes numerical verification at least hard, if not impossible. This article discusses the types of mistakes we could identify and the tediousness of our reproduction efforts in the light of a newly developed taxonomy for reproducibility. We then link our findings with other findings of empirical research on this topic, give practical recommendations on how to achieve reproducibility, and discuss the challenges of large-scale reproducibility checks as well as promising ideas that could considerably increase the reproducibility of psychological research.
Our previous paper (McCabe & Snyder 2014) contained the provocative result that, despite a positive average effect, open access reduces cites to some articles, in particular those published in lower-tier journals. We propose a model in which open access leads more readers to acquire the full text, yielding more cites from some, but fewer cites from those who would have cited the article based on superficial knowledge but who refrain once they learn that the article is a bad match. We test the theory with data for over 200,000 science articles binned by cites received during a pre-study period. Consistent with the theory, the marginal effect of open access is negative for the least-cited articles, positive for the most cited, and generally monotonic for quality levels in between. Also consistent with the theory is a magnification of these effects for articles placed on PubMed Central, one of the broadest open-access platforms, and the differential pattern of results for cites from insiders versus outsiders to the article’s field.
Although there are surely multiple contributors to the replication crisis in psychology, one largely unappreciated source is a neglect of basic principles of measurement. We consider 4 sacred cows—widely shared and rarely questioned assumptions—in psychological measurement that may fuel the replicability crisis by contributing to questionable measurement practices. These 4 sacred cows are:
we can safely rely on the name of a measure to infer its content;
reliability is not a major concern for laboratory measures;
using measures that are difficult to collect obviates the need for large sample sizes; and
convergent validity data afford sufficient evidence for construct validity.
For items #1 and #4, we provide provisional data from recent psychological journals that support our assertion that such beliefs are prevalent among authors.
To enhance the replicability of psychological science, researchers will need to become vigilant against erroneous assumptions regarding both the psychometric properties of their measures and the implications of these psychometric properties for their studies.
C60 is a potent antioxidant that has been reported to substantially extend the lifespan of rodents when formulated in olive oil (C60-OO) or extra virgin olive oil (C60-EVOO). Despite there being no regulated form of C60-OO, people have begun obtaining it from online sources and dosing it to themselves or their pets, presumably with the assumption of safety and efficacy. In this study, we obtain C60-OO from a sample of online vendors, and find marked discrepancies in appearance, impurity profile, concentration, and activity relative to pristine C60-OO formulated in-house. We additionally find that pristine C60-OO causes no acute toxicity in a rodent model but does form toxic species that can cause statistically-significant morbidity and mortality in mice in under 2 weeks when exposed to light levels consistent with ambient light. Intraperitoneal injections of C60-OO did not affect the lifespan of CB6F1 female mice. Finally, we conduct a lifespan and health span study in males and females C57BL/6 J mice comparing oral treatment with pristine C60-EVOO and EVOO alone versus untreated controls. We failed to observe statistically-significant lifespan and health span benefits of C60-EVOO or EVOO supplementation compared to untreated controls, both starting the treatment in adult or old age. Our results call into question the biological benefit of C60-OO in aging.
Impact Statement: This article suggests that for direct replications in social and cognitive psychology research, small variations in design (sample settings and population) are an unlikely explanation for differences in findings of studies. Differences in findings of direct replications are particularly unlikely if the overall effect is (close to) 0, whereas these differences are more likely if the overall effect is larger.
We examined the evidence for heterogeneity (of effect sizes) when only minor changes to sample population and settings were made between studies and explored the association between heterogeneity and average effect size in a sample of 68 meta-analyses from 13 preregistered multilab direct replication projects in social and cognitive psychology. Among the many examined effects, examples include the Stroop effect, the ‘verbal overshadowing’ effect, and various priming effects such as ‘anchoring’ effects. We found limited heterogeneity; 48⁄68 (71%) meta-analyses had nonsignificant heterogeneity, and most (49⁄68; 72%) were most likely to have zero to small heterogeneity. Power to detect small heterogeneity (as defined by Higgins, Thompson, Deeks, & Altman, 2003) was low for all projects (mean 43%), but good to excellent for medium and large heterogeneity. Our findings thus show little evidence of widespread heterogeneity in direct replication studies in social and cognitive psychology, suggesting that minor changes in sample population and settings are unlikely to affect research outcomes in these fields of psychology. We also found strong correlations between observed average effect sizes (standardized mean differences and log odds ratios) and heterogeneity in our sample. Our results suggest that heterogeneity and moderation of effects is unlikely for a 0 average true effect size, but increasingly likely for larger average true effect size.
[Keywords: heterogeneity, meta-analysis, psychology, direct replication, many labs]
Statisticians have been keen to critique statistical aspects of the “replication crisis” in other scientific disciplines. But new statistical tools are often published and promoted without any thought to replicability. This needs to change, argue Anne-Laure Boulesteix, Sabine Hoffmann, Alethea Charlton and Heidi Seibold.
Evidence across social science indicates that average effects of persuasive messages are small. One commonly offered explanation for these small effects is heterogeneity: Persuasion may only work well in specific circumstances. To evaluate heterogeneity, we repeated an experiment weekly in real time using 2016 U.S. presidential election campaign advertisements. We tested 49 political advertisements in 59 unique experiments on 34,000 people. We investigate heterogeneous effects by sender (candidates or groups), receiver (subject partisanship), content (attack or promotional), and context (battleground versus non-battleground, primary versus general election, and early versus late). We find small average effects on candidate favorability and vote. These small effects, however, do not mask substantial heterogeneity even where theory from political science suggests that we should find it. During the primary and general election, in battleground states, for Democrats, Republicans, and Independents, effects are similarly small. Heterogeneity with large offsetting effects is not the source of small average effects.
Objectives: The ultimate goal of biomedical research is the development of new treatment options for patients. Animal models are used if questions cannot be addressed otherwise. Currently, it is widely believed that a large fraction of performed studies are never published, but there are no data that directly address this question.
Methods: We have tracked a selection of animal study protocols approved in the University Medical Center Utrecht in the Netherlands, to assess whether these have led to a publication with a follow-up period of 7 years.
Results: We found that 60% of all animal study protocols led to at least one publication (full text or abstract). A total of 5590 animals were used in these studies, of which 26% was reported in the resulting publications.
Conclusions: The data presented here underline the need for preclinical preregistration, in view of the risk of reporting and publication bias in preclinical research. We plea that all animal study protocols should be prospectively registered on an online, accessible platform to increase transparency and data sharing. To facilitate this, we have developed a platform dedicated to animal study protocol registration: www.preclinicaltrials.eu.
Strengths and limitations of this study:
This study directly traces animal study protocols to potential publications and is the first study to assess the number of animals used and the number of animals published.
We had full access to all documents submitted to the animal experiment committee of the University Medical Center Utrecht from the selected protocols.
There is a sufficient follow-up period for researchers to publish their animal study.
Due to privacy reasons, we are not able to publish the exact search terms used.
A delay has occurred between the start of this project and time of publishing, this is related to the political sensitivity of this subject.
“Towards Reproducible Brain-Wide Association Studies”, Scott Marek, Brenden Tervo-Clemmens, Finnegan J. Calabro, David F. Montez, Benjamin P. Kay, Alexander S. Hatoum, Meghan Rose Donohue, William Foran, Ryl, L. Miller, Eric Feczko, Oscar Miranda-Dominguez, Alice M. Graham, Eric A. Earl, Anders J. Perrone, Michaela Cordova, Olivia Doyle, Lucille A. Moore, Greg Conan, Johnny Uriarte, Kathy Snider, Angela Tam, Jianzhong Chen, Dillan J. Newbold, Annie Zheng, Nicole A. Seider, Andrew N. Van, Timothy O. Laumann, Wesley K. Thompson, Deanna J. Greene, Steven E. Petersen, Thomas E. Nichols, B. T. Thomas Yeo, Deanna M. Barch, Hugh Garavan, Beatriz Luna, Damien A. Fair, Nico U. F. Dosenbach (2020-08-22; backlinks):
Magnetic resonance imaging (MRI) continues to drive many important neuroscientific advances. However, progress in uncovering reproducible associations between individual differences in brain structure/function and behavioral phenotypes (e.g., cognition, mental health) may have been undermined by typical neuroimaging sample sizes (median n = 25)1,2. Leveraging the Adolescent Brain Cognitive Development (ABCD) Study3 (n = 11,878), we estimated the effect sizes and reproducibility of these brain-wide associations studies (BWAS) as a function of sample size. The very largest, replicable brain-wide associations for univariate and multivariate methods werer = 0.14 andr = 0.34, respectively. In smaller samples, typical for brain-wide association studies (BWAS), irreproducible, inflated effect sizes were ubiquitous, no matter the method (univariate, multivariate). Until sample sizes started to approach consortium-levels, BWAS were underpowered and statistical errors assured. Multiple factors contribute to replication failures4–6; here, we show that the pairing of small brain-behavioral phenotype effect sizes with sampling variability is a key element in wide-spread BWAS replication failure. Brain-behavioral phenotype associations stabilize and become more reproducible with sample sizes of N⪆2,000. While investigator-initiated brain-behavior research continues to generate hypotheses and propel innovation, large consortia are needed to usher in a new era of reproducible human brain-wide association studies.
Large-scale collaborative projects recently demonstrated that several key findings from the social-science literature could not be replicated successfully. Here, we assess the extent to which a finding’s replication success relates to its intuitive plausibility. Each of 27 high-profile social-science findings was evaluated by 233 people without a Ph.D. in psychology. Results showed that these laypeople predicted replication success with above-chance accuracy (ie., 59%). In addition, when participants were informed about the strength of evidence from the original studies, this boosted their prediction performance to 67%. We discuss the prediction patterns and apply signal detection theory to disentangle detection ability from response bias. Our study suggests that laypeople’s predictions contain useful information for assessing the probability that a given finding will be replicated successfully.
[Keywords: open science, meta-science, replication crisis, prediction survey, open data, open materials, preregistered]
Nudge interventions have quickly expanded from academic studies to larger implementation in so-called Nudge Units in governments. This provides an opportunity to compare interventions in research studies, versus at scale. We assemble a unique data set of 126 RCTs covering over 23 million individuals, including all trials run by 2 of the largest Nudge Units in the United States. We compare these trials to a sample of nudge trials published in academic journals from 2 recent meta-analyses.
In papers published in academic journals, the average impact of a nudge is very large—an 8.7 percentage point take-up effect, a 33.5% increase over the average control. In the Nudge Unit trials, the average impact is still sizable and highly statistically-significant, but smaller at 1.4 percentage points, an 8.1% increase [8.7 / 1.4 = 6.2×].
We consider 5 potential channels for this gap: statistical power, selective publication, academic involvement, differences in trial features and in nudge features. Publication bias in the academic journals, exacerbated by low statistical power, can account for the full difference in effect sizes. Academic involvement does not account for the difference. Different features of the nudges, such as in-person versus letter-based communication, likely reflecting institutional constraints, can partially explain the different effect sizes.
We conjecture that larger sample sizes and institutional constraints, which play an important role in our setting, are relevant in other at-scale implementations. Finally, we compare these results to the predictions of academics and practitioners. Most forecasters overestimate the impact for the Nudge Unit interventions, though nudge practitioners are almost perfectly calibrated.
…In this paper, we present the results of a unique collaboration with 2 of the major “Nudge Units”: BIT North America operating at thelevel of US cities and SBST/OES for the US Federal government. These 2 units kept a comprehensive record of all trials that they ran from inception in 2015 to July 2019, for a total of 165 trials testing 349 nudge treatments and a sample size of over 37 million participants. In a remarkable case of administrative transparency, each trial had atrial report, including in many cases a pre-analysis plan. The 2 units worked with us to retrieve the results of all the trials. Importantly, over 90% of these trials have not been documented in working paper or academic publication format. [emphasis added]
…Since we are interested in comparing the Nudge Unit trials to nudge papers in the literature, we aim to find broadly comparable studies in academic journals, without hand-picking individual papers. We lean on 2 recent meta-analyses summarizing over 100 RCTs across many different applications (Benartzi et al 2017, and Hummel & Maedche 2019). We apply similar restrictions as we did in the Nudge Unit sample, excluding lab or hypothetical experiments and non-RCTs, treatments with financial incentives, requiring treatments with binary dependent variables, and excluding default effects. This leaves a final sample of 26 RCTs, including 74 nudge treatments with 505,337 participants. Before we turn to the results, we stress that the features of behavioral interventions in academic journals do not perfectly match with the nudge treatments implemented by the Nudge Units, a difference to which we indeed return below. At the same time, overall interventions conducted by Nudge Units are fairly representative of the type of nudge treatments that are run by researchers.
What do we find? In the sample of 26 papers in the Academic Journals sample, we compute the average (unweighted) impact of a nudge across the 74 nudge interventions. We find that on average a nudge intervention increases the take up by 8.7 (s.e. = 2.5) percentage points, out of an average control take up of 26.0 percentage points.
Turning to the 126 trials by Nudge Units, we estimate an unweighted impact of 1.4 percentage points (s.e. = 0.3), out of an average control take up of 17.4 percentage points. While this impact is highly statistically-significantly different from 0 and sizable, it is about 1⁄6th the size of the estimated nudge impact in academic papers. What explains this large difference in the impact of nudges?
We discuss 3 features of the 2 samples which could account for this difference. First, we document a large difference in the sample size and thus statistical power of the interventions. The median nudge intervention in the Academic Journals sample has treatment arm sample size of 484 participants and a minimum detectable effect size (MDE, the effect size that can be detected with 80% power) of 6.3 percentage points. In contrast, the nudge interventions in the Nudge Units have a median treatment arm sample size of 10,006 participants and MDE of 0.8 percentage points. Thus, the statistical power for the trials in the Academic Journals sample is nearly an order of magnitude smaller. This illustrates a key feature of the “at scale” implementation: the implementation in an administrative setting allows for a larger sample size. Importantly, the smaller sample size for the Academic Journals papers could lead not just to noisier estimates, but also to upward-biased point estimates in the presence of publication bias.
A second difference, directly zooming into publication bias, is the evidence of selective publication of studies with statistically-significant results (t > 1.96), versus studies that are not statistically-significant (t < 1.96). In the sample of Academic Journals nudges, there are over 4× as many studies with a t-statistic for the most statistically-significant nudge between 1.96 and 2.96, versus the number of studies with the most statistically-significant nudge with at between 0.96 and 1.96. Interestingly, the publication bias appears to operate at the level of the most statistically-significant treatment arm within a paper. By comparison, we find no evidence of a discontinuity in the distribution of t-statistics for the Nudge Unit sample, consistent with the fact that the Nudge Unit registry contains the comprehensive sample of all studies run. We stress here that with “publication bias” we include not just whether a journal would publish a paper, but also whether a researcher would write up a study (the “file drawer” problem). In the Nudge Units sample, all these selective steps are removed, as we access all studies that were run.
Large amounts of resources are spent annually to improve educational achievement and to close the gender gap in sciences with typically very modest effects. In 2010, a 15-min self-affirmation intervention showed a dramatic reduction in this gender gap. We reanalyzed the original data and found several critical problems. First, the self-affirmation hypothesis stated that women’s performance would improve. However, the data showed no improvement for women. There was an interaction effect between self-affirmation and gender caused by a negative effect on men’s performance. Second, the findings were based on covariate-adjusted interaction effects, which imply that self-affirmation reduced the gender gap only for the small sample of men and women who did not differ in the covariates. Third, specification-curve analyses with more than 1,500 possible specifications showed that less than one quarter yielded statistically-significant interaction effects and less than 3% showed significant improvements among women.
When analyzing data, researchers may have multiple reasonable options for the many decisions they must make about the data—for example, how to code a variable or which participants to exclude. Therefore, there exists a multiverse of possible data sets. A classic multiverse analysis involves performing a given analysis on every potential data set in this multiverse to examine how each data decision affects the results. However, a limitation of the multiverse analysis is that it addresses only data cleaning and analytic decisions, yet researcher decisions that affect results also happen at the data-collection stage. I propose an adaptation of the multiverse method in which the multiverse of data sets is composed of real data sets from studies varying in data-collection methods of interest. I walk through an example analysis applying the approach to 19 studies on shooting decisions to demonstrate the usefulness of this approach and conclude with a further discussion of the limitations and applications of this method.
David Rosenhan’s pseudopatient study is one of the most famous studies in psychology, but it is also one of the most criticized studies in psychology. Almost 50 years after its publication, it is still discussed in psychology textbooks, but the extensive body of criticism is not, likely leading teachers not to present the study as the contentious classic that it is. New revelations by Susannah Cahalan (2019), based on her years of investigation of the study and her analysis of the study’s archival materials, question the validity and veracity of both Rosenhan’s study and his reporting of it as well as Rosenhan’s scientific integrity. Because many (if not most) teachers are likely not aware of Cahalan’s findings, we provide a summary of her main findings so that if they still opt to cover Rosenhan’s study, they can do so more accurately. Because these findings are related to scientific integrity, we think that they are best discussed in the context of research ethics and methods. To aid teachers in this task, we provide some suggestions for such discussions.
[ToC: Rosenhan’s Misrepresentations of the Pseudopatient Script and His Medical Record · Selective Reporting of Data · Rosenhan’s Failure to Prepare and Protect Other Pseudopatients · Reporting Questionable Data and Possibly Pseudo-Pseudopatients · Concluding Remarks · Footnotes · References]
There are few examples of an extended adversarial collaboration, in which investigators committed to different theoretical views collaborate to test opposing predictions. Whereas previous adversarial collaborations have produced single research articles, here, we share our experience in programmatic, extended adversarial collaboration involving three laboratories in different countries with different theoretical views regarding working memory, the limited information retained in mind, serving ongoing thought and action. We have focused on short-term memory retention of items (letters) during a distracting task (arithmetic), and effects of aging on these tasks. Over several years, we have conducted and published joint research with preregistered predictions, methods, and analysis plans, with replication of each study across two laboratories concurrently. We argue that, although an adversarial collaboration will not usually induce senior researchers to abandon favored theoretical views and adopt opposing views, it will necessitate varieties of their views that are more similar to one another, in that they must account for a growing, common corpus of evidence. This approach promotes understanding of others’ views and presents to the field research findings accepted as valid by researchers with opposing interpretations. We illustrate this process with our own research experiences and make recommendations applicable to diverse scientific areas.
Identifying brain biomarkers of disease risk is a growing priority in neuroscience. The ability to identify meaningful biomarkers is limited by measurement reliability; unreliable measures are unsuitable for predicting clinical outcomes. Measuring brain activity using task functional MRI (fMRI) is a major focus of biomarker development;however, the reliability of task fMRI has not been systematically evaluated. We present converging evidence demonstrating poor reliability of task-fMRI measures. First, ameta-analysis of 90 experiments (n = 1,008) revealed poor overall reliability—mean intraclass correlation coefficient (ICC) = 0.397. Second, the test-retest reliabilities of activity in a priori regions of interest across 11 common fMRI tasks collected by the Human Connectome Project (n = 45) and the Dunedin Study (n = 20) were poor (ICCs = 0.067–.485). Collectively, these findings demonstrate that common task-fMRI measures are not currently suitable for brain biomarker discovery or for individual-differences research. We review how this state of affairs came to be and highlight avenues for improving task-fMRI reliability.
Context-dependent biological variation presents a unique challenge to the reproducibility of results in experimental animal research, because organisms’ responses to experimental treatments can vary with both genotype and environmental conditions. In March 2019, experts in animal biology, experimental design and statistics convened in Blonay, Switzerland, to discuss strategies addressing this challenge.
In contrast to the current gold standard of rigorous standardization in experimental animal research, we recommend the use of systematic heterogenization of study samples and conditions by actively incorporating biological variation into study design through diversifying study samples and conditions.
Here we provide the scientific rationale for this approach in the hope that researchers, regulators, funders and editors can embrace this paradigm shift. We also present a road map towards better practices in view of improving the reproducibility of animal research.
2020-botviniknezer.pdf: “Variability in the analysis of a single neuroimaging dataset by many teams”, Rotem Botvinik-Nezer, Felix Holzmeister, Colin F. Camerer, Anna Dreber, Juergen Huber, Magnus Johannesson, Michael Kirchler, Roni Iwanir, Jeanette A. Mumford, R. Alison Adcock, Paolo Avesani, Blazej M. Baczkowski, Aahana Bajracharya, Leah Bakst, Sheryl Ball, Marco Barilari, Nadège Bault, Derek Beaton, Julia Beitner, Roland G. Benoit, Ruud M. W. J. Berkers, Jamil P. Bhanji, Bharat B. Biswal, Sebastian Bobadilla-Suarez, Tiago Bortolini, Katherine L. Bottenhorn, Alexander Bowring, Senne Braem, Hayley R. Brooks, Emily G. Brudner, Cristian B. Calderon, Julia A. Camilleri, Jaime J. Castrellon, Luca Cecchetti, Edna C. Cieslik, Zachary J. Cole, Olivier Collignon, Robert W. Cox, William A. Cunningham, Stefan Czoschke, Kamalaker Dadi, Charles P. Davis, Alberto De Luca, Mauricio R. Delgado, Lysia Demetriou, Jeffrey B. Dennison, Xin Di, Erin W. Dickie, Ekaterina Dobryakova, Claire L. Donnat, Juergen Dukart, Niall W. Duncan, Joke Durnez, Amr Eed, Simon B. Eickhoff, Andrew Erhart, Laura Fontanesi, G. Matthew Fricke, Shiguang Fu, Adriana Galván, Remi Gau, Sarah Genon, Tristan Glatard, Enrico Glerean, Jelle J. Goeman, Sergej A. E. Golowin, Carlos González-García, Krzysztof J. Gorgolewski, Cheryl L. Grady, Mikella A. Green, João F. Guassi Moreira, Olivia Guest, Shabnam Hakimi, J. Paul Hamilton, Roeland Hancock, Giacomo Handjaras, Bronson B. Harry, Colin Hawco, Peer Herholz, Gabrielle Herman, Stephan Heunis, Felix Hoffstaedter, Jeremy Hogeveen, Susan Holmes, Chuan-Peng Hu, Scott A. Huettel, Matthew E. Hughes, Vittorio Iacovella, Alexandru D. Iordan, Peder M. Isager, Ayse I. Isik, Andrew Jahn, Matthew R. Johnson, Tom Johnstone, Michael J. E. Joseph, Anthony C. Juliano, Joseph W. Kable, Michalis Kassinopoulos, Cemal Koba, Xiang-Zhen Kong, Timothy R. Koscik, Nuri Erkut Kucukboyaci, Brice A. Kuhl, Sebastian Kupek, Angela R. Laird, Claus Lamm, Robert Langner, Nina Lauharatanahirun, Hongmi Lee, Sangil Lee, Alexander Leemans, Andrea Leo, Elise Lesage, Flora Li, Monica Y. C. Li, Phui Cheng Lim, Evan N. Lintz, Schuyler W. Liphardt, Annabel B. Losecaat Vermeer, Bradley C. Love, Michael L. Mack, Norberto Malpica, Theo Marins, Camille Maumet, Kelsey McDonald, Joseph T. McGuire, Helena Melero, Adriana S. Méndez Leal, Benjamin Meyer, Kristin N. Meyer, Glad Mihai, Georgios D. Mitsis, Jorge Moll, Dylan M. Nielson, Gustav Nilsonne, Michael P. Notter, Emanuele Olivetti, Adrian I. Onicas, Paolo Papale, Kaustubh R. Patil, Jonathan E. Peelle, Alexandre Pérez, Doris Pischedda, Jean-Baptiste Poline, Yanina Prystauka, Shruti Ray, Patricia A. Reuter-Lorenz, Richard C. Reynolds, Emiliano Ricciardi, Jenny R. Rieck, Anais M. Rodriguez-Thompson, Anthony Romyn, Taylor Salo, Gregory R. Samanez-Larkin, Emilio Sanz-Morales, Margaret L. Schlichting, Douglas H. Schultz, Qiang Shen, Margaret A. Sheridan, Jennifer A. Silvers, Kenny Skagerlund, Alec Smith, David V. Smith, Peter Sokol-Hessner, Simon R. Steinkamp, Sarah M. Tashjian, Bertrand Thirion, John N. Thorp, Gustav Tinghög, Loreen Tisdall, Steven H. Tompson, Claudio Toro-Serey, Juan Jesus Torre Tresols, Leonardo Tozzi, Vuong Truong, Luca Turella, Anna E. van ’t Veer, Tom Verguts, Jean M. Vettel, Sagana Vijayarajah, Khoi Vo, Matthew B. Wall, Wouter D. Weeda, Susanne Weis, David J. White, David Wisniewski, Alba Xifra-Porxas, Emily A. Yearling, Sangsuk Yoon, Rui Yuan, Kenneth S. L. Yuen, Lei Zhang, Xu Zhang, Joshua E. Zosky, Thomas E. Nichols, Russell A. Poldrack, Tom Schonberg (2020-05-20; backlinks):
Data analysis workflows in many scientific domains have become increasingly complex and flexible. Here we assess the effect of this flexibility on the results of functional magnetic resonance imaging by asking 70 independent teams to analyse the same dataset, testing the same 9 ex-ante hypotheses1. The flexibility of analytical approaches is exemplified by the fact that no two teams chose identical workflows to analyse the data. This flexibility resulted in sizeable variation in the results of hypothesis tests, even for teams whose statistical maps were highly correlated at intermediate stages of the analysis pipeline. Variation in reported results was related to several aspects of analysis methodology. Notably, a meta-analytical approach that aggregated information across teams yielded a statistically-significant consensus in activated regions. Furthermore, prediction markets of researchers in the field revealed an overestimation of the likelihood of statistically-significant findings, even by researchers with direct knowledge of the dataset2,3,4,5. Our findings show that analytical flexibility can have substantial effects on scientific conclusions, and identify factors that may be related to variability in the analysis of functional magnetic resonance imaging. The results emphasize the importance of validating and sharing complex analysis workflows, and demonstrate the need for performing and reporting multiple analyses of the same data. Potential approaches that could be used to mitigate issues related to analytical variability are discussed.
Whether acquiring a second language affords any general advantages to executive function has been a matter of fierce scientific debate for decades. If being bilingual does have benefits over and above the broader social, employment, and lifestyle gains that are available to speakers of a second language, then it should manifest as a cognitive advantage in the general population of bilinguals. We assessed 11,041 participants on a broad battery of 12 executive tasks whose functional and neural properties have been well described. Bilinguals showed an advantage over monolinguals on only one test (whereas monolinguals performed better on four tests), and these effects all disappeared when the groups were matched to remove potentially confounding factors. In any case, the size of the positive bilingual effect in the unmatched groups was so small that it would likely have a negligible impact on the cognitive performance of any individual.
Members of the field of philosophy have, just as other people, political convictions or, as psychologists call them, ideologies. How are different ideologies distributed and perceived in the field? Using the familiar distinction between the political left and right, we surveyed an international sample of 794 subjects in philosophy. We found that survey participants clearly leaned left (75%), while right-leaning individuals (14%) and moderates (11%) were underrepresented. Moreover, and strikingly, across the political spectrum from very left-leaning individuals and moderates to very right-leaning individuals, participants reported experiencing ideological hostility in the field, occasionally even from those on their own side of the political spectrum. Finally, while about half of the subjects believed that discrimination against left-leaning or right-leaning individuals in the field is not justified, a substantial minority displayed an explicit willingness to discriminate against colleagues with the opposite ideology. Our findings are both surprising and important because a commitment to tolerance and equality is widespread in philosophy, and there is reason to think that ideological similarity, hostility, and discrimination undermine reliable belief formation in many areas of the discipline.
The replication crisis has seen increased focus on best practice techniques to improve the reliability of scientific findings. What remains elusive to many researchers and is frequently misunderstood is that predictions involving interactions dramatically affect the calculation of statistical power. Using recent papers published in Personality and Social Psychology Bulletin (PSPB), we illustrate the pitfalls of improper power estimations in studies where attenuated interactions are predicted. Our investigation shows why even a programmatic series of 6 studies employing 2×2 designs, with samples exceeding n = 500, can be woefully underpowered to detect genuine effects. We also highlight the importance of accounting for error-prone measures when estimating effect sizes and calculating power, explaining why even positive results can mislead when power is low. We then provide five guidelines for researchers to avoid these pitfalls, including cautioning against the heuristic that a series of underpowered studies approximates the credibility of one well-powered study.
In 2016, the US Defense Advanced Research Projects Agency (DARPA) told eight research groups that their proposals had made it through the review gauntlet and would soon get a few million dollars from its Biological Technologies Office (BTO). Along with congratulations, the teams received a reminder that their award came with an unusual requirement—an independent shadow team of scientists tasked with reproducing their results. Thus began an intense, multi-year controlled trial in reproducibility. Each shadow team consists of three to five researchers, who visit the ‘performer’ team’s laboratory and often host visits themselves. Between 3% and 8% of the programme’s total funds go to this independent validation and verification (IV&V) work…Awardees were told from the outset that they would be paired with an IV&V team consisting of unbiased, third-party scientists hired by and accountable to DARPA. In this programme, we relied on US Department of Defense laboratories, with specific teams selected for their technical competence and ability to solve problems creatively.
…Results so far show a high degree of experimental reproducibility. The technologies investigated include using chemical triggers to control how cells migrate1; introducing synthetic circuits that control other cell functions2; intricate protein switches that can be programmed to respond to various cellular conditions3; and timed bacterial expression that works even in the variable environment of the mammalian gut4…getting to this point was more difficult than we expected. It demanded intense coordination, communication and attention to detail…Our effort needed capable research groups that could dedicate much more time (in one case, 20 months) and that could flexibly follow evolving research…A key component of the IV&V teams’ effort has been to spend a day or more working with the performer teams in their laboratories. Often, members of a performer laboratory travel to the IV&V laboratory as well. These interactions lead to a better grasp of methodology than reading a paper, frequently revealing person-to-person differences that can affect results…Still, our IV&V efforts have been derailed for weeks at a time for trivial reasons (see ‘Hard lessons’), such as a typo that meant an ingredient in cell media was off by an order of magnitude. We lost more than a year after discovering that commonly used biochemicals that were thought to be interchangeable are not.
Document Reagents:…We lost weeks of work and performed useless experiments when we assumed that identically named reagents (for example, polyethylene glycol or fetal bovine serum) from different vendors could be used interchangeably. · See It Live:…In our hands, washing cells too vigorously or using the wrong-size pipette tip changed results unpredictably. · State a range: …Knowing whether 21 ° C means 20.5–21.5 ° C or 20–22 ° C can tell you whether cells will thrive or wither, and whether you’ll need to buy an incubator to make an experiment work. · Test, then ship: …Incorrect, outdated or otherwise diminished products were sent to the IV&V team for verification many times. · Double check: …A typo in one protocol cost us four weeks of failed experiments, and in general, vague descriptions of formulation protocols (for example, for expressing genes and making proteins without cells) caused months of delay and cost thousands of dollars in wasted reagents. · Pick a person: …The projects that lacked a dedicated and stable point of contact were the same ones that took the longest to reproduce. That is not coincidence. · Keep in silico analysis up to date: …Teams had to visit each others’ labs more than once to understand and fully implement computational-analysis pipelines for large microscopy data sets.
…We have learnt to note the flow rates used when washing cells from culture dishes, to optimize salt concentration in each batch of medium and to describe temperature and other conditions with a range rather than a single number. This last practice came about after we realized that diminished slime-mould viability in our Washington DC facility was due to lab temperatures that could fluctuate by 2 °C on warm summer days, versus the more tightly controlled temperature of the performer lab in Baltimore 63 kilometres away. Such observations can be written up in a protocol paper…As one of our scientists said, “IV&V forces performers to think more critically about what qualifies as a successful system, and facilitates candid discussion about system performance and limitations.”
Foreign language learning in older age has been proposed as a promising avenue for combatting age-related cognitive decline. We tested this hypothesis in a randomized controlled study in a sample of 160 healthy older participants (aged 65–75 years) who were randomized to 11 weeks of either language learning or relaxation training. Participants in the language learning condition obtained some basic knowledge in the new language (Italian), but between-groups differences in improvements on latent factors of verbal intelligence, spatial intelligence, working memory, item memory, or associative memory were negligible. We argue that this is not due to either poor measurement, low course intensity, or low statistical power, but that basic studies in foreign languages in older age are likely to have no or trivially small effects on cognitive abilities. We place this in the context of the cognitive training and engagement literature and conclude that while foreign language learning may expand the behavioral repertoire, it does little to improve cognitive processing abilities.
As her work proceeded, her doubts about Rosenhan’s work grew. At one point, I suggested that she write to Science and request copies of the peer review of the paper. What had the reviewers seen and requested? Did they know the identities of the anonymous pseudo-patients and the institutions to which they had been consigned? What checks had they made on the validity of Rosenhan’s claims? Had they, for example, asked to see the raw field notes? The editorial office told her that the peer review was confidential and they couldn’t share it. I wondered whether an approach from an academic rather than a journalist might be more successful, and with Cahalan’s permission, I sought the records myself, pointing out the important issues at stake, and noting that it would be perfectly acceptable for the names of the expert referees to be redacted. This time the excuse was different: the journal had moved offices, and the peer reviews no longer existed. That’s plausible, but it is distinctly odd that such different explanations should be offered.
…Of course, proving a negative, especially after decades have passed, is nigh on impossible. Perhaps the appearance of The Great Pretender will cause one or more of the missing pseudo-patients to surface, or for their descendants to speak up and reveal their identities, for surely anyone who participated in such a famous study could not fail to mention it to someone. More likely, I think, is that these people are fictitious, invented by someone who Cahalan’s researches suggest was fully capable of such deception. (Indeed, the distinguished psychologist Eleanor Maccoby, who was in charge of assessing Rosenhan’s tenure file, reported that she and others were deeply suspicious of him, and that they found it ‘impossible to know what he had really done, or if he had done it’, granting him tenure only because of his popularity as a teacher.)
…Most damning of all, though, are Rosenhan’s own medical records. When he was admitted to the hospital, it was not because he simply claimed to be hearing voices but was otherwise ‘normal’. On the contrary, he told his psychiatrist his auditory hallucinations included the interception of radio signals and listening in to other people’s thoughts. He had tried to keep these out by putting copper over his ears, and sought admission to the hospital because it was ‘better insulated there’. For months, he reported he had been unable to work or sleep, financial difficulties had mounted and he had contemplated suicide. His speech was retarded, he grimaced and twitched, and told several staff that the world would be better off without him. No wonder he was admitted.
Perhaps out of sympathy for Rosenhan’s son and his closest friends, who had granted access to all this damning material and with whom she became close, I think Cahalan pulls her punches a bit when she brings her book to a conclusion. But the evidence she provides makes an overwhelming case: Rosenhan pulled off one of the greatest scientific frauds of the past 75 years, and it was a fraud whose real-world consequences still resonate today. Exposing what he got up to is a quite exceptional accomplishment, and Cahalan recounts the story vividly and with great skill.
Background: Failure to report the results of a clinical trial can distort the evidence base for clinical practice, breaches researchers’ ethical obligations to participants, and represents an important source of research waste. The Food and Drug Administration Amendments Act (FDAAA) of 2007 now requires sponsors of applicable trials to report their results directly onto ClinicalTrials.gov within 1 year of completion. The first trials covered by the Final Rule of this act became due to report results in January, 2018. In this cohort study, we set out to assess compliance.
Methods: We downloaded data for all registered trials on ClinicalTrials.gov each month from March, 2018, to September, 2019. All cross-sectional analyses in this manuscript were performed on data extracted from ClinicalTrials.gov on Sept 16, 2019; monthly trends analysis used archived data closest to the 15th day of each month from March, 2018, to September, 2019. Our study cohort included all applicable trials due to report results under FDAAA. We excluded all non-applicable trials, those not yet due to report, and those given a certificate allowing for delayed reporting. A trial was considered reported if results had been submitted and were either publicly available, or undergoing quality control review at ClinicalTrials.gov. A trial was considered compliant if these results were submitted within 1 year of the primary completion date, as required by the legislation. We described compliance with the FDAAA 2007 Final Rule, assessed trial characteristics associated with results reporting using logistic regression models, described sponsor-level reporting, examined trends in reporting, and described time-to-report using the Kaplan-Meier method.
Findings: 4209 trials were due to report results; 1722 (40·9%; 95% CI 39·4–42·2) did so within the 1-year deadline. 2686 (63·8%; 62·4–65·3) trials had results submitted at any time. Compliance has not improved since July, 2018. Industry sponsors were statistically-significantly more likely to be compliant than non-industry, non-US Government sponsors (odds ratio [OR] 3·08 [95% CI 2·52–3·77]), and sponsors running large numbers of trials were statistically-significantly more likely to be compliant than smaller sponsors (OR 11·84 [9·36–14·99]). The median delay from primary completion date to submission date was 424 days (95% CI 412–435), 59 days higher than the legal reporting requirement of 1 year.
Interpretation: Compliance with the FDAAA 2007 is poor, and not improving. To our knowledge, this is the first study to fully assess compliance with the Final Rule of the FDAAA 2007. Poor compliance is likely to reflect lack of enforcement by regulators. Effective enforcement and action from sponsors is needed; until then, open public audit of compliance for each individual sponsor may help. We will maintain updated compliance data for each individual sponsor and trial at fdaaa.trialstracker.net.
Music training has repeatedly been claimed to positively impact on children’s cognitive skills and academic achievement. This claim relies on the assumption that engaging in intellectually demanding activities fosters particular domain-general cognitive skills, or even general intelligence. The present meta-analytic review (N = 6,984, k = 254, m = 54) shows that this belief is incorrect. Once the study quality design is controlled for, the overall effect of music training programs is null (g ≈ 0) and highly consistent across studies (τ2 ≈ 0). Small statistically-significant overall effects are obtained only in those studies implementing no random allocation of participants and employing non-active controls (g ≈ 0.200, p < 0.001). Interestingly, music training is ineffective regardless of the type of outcome measure (eg., verbal, non-verbal, speed-related, etc.). Furthermore, we note that, beyond meta-analysis of experimental studies, a considerable amount of cross-sectional evidence indicates that engagement in music has no impact on people’s non-music cognitive skills or academic achievement. We conclude that researchers’ optimism about the benefits of music training is empirically unjustified and stem from misinterpretation of the empirical data and, possibly, confirmation bias. Given the clarity of the results, the large number of participants involved, and the numerous studies carried out so far, we conclude that this line of research should be dismissed.
Ideological bias is a worsening but often neglected concern for social and psychological sciences, affecting a range of professional activities and relationships, from self-reported willingness to discriminate to the promotion of ideologically saturated and scientifically questionable research constructs. Though clinical psychologists co-produce and apply social psychological research, little is known about its impact on the profession of clinical psychology.
Following a brief review of relevant topics, such as “concept creep” and the importance of the psychotherapeutic relationship, the relevance of ideological bias to clinical psychology, counterarguments and a rebuttal, clinical applications, and potential solutions are presented. For providing empathic and multiculturally competent clinical services, in accordance with professional ethics, psychologists would benefit from treating ideological diversity as another professionally recognized diversity area.
The rule took full effect 2 years ago, on 2018-01-18, giving trial sponsors ample time to comply. But a Science investigation shows that many still ignore the requirement, while federal officials do little or nothing to enforce the law.
Science examined more than 4700 trials whose results should have been posted on the NIH website ClinicalTrials.gov under the 2017 rule. Reporting rates by most large pharmaceutical companies and some universities have improved sharply, but performance by many other trial sponsors—including, ironically, NIH itself—was lackluster. Those sponsors, typically either the institution conducting a trial or its funder, must deposit results and other data within 1 year of completing a trial. But of 184 sponsor organizations with at least five trials due as of 2019-09-25, 30 companies, universities, or medical centers never met a single deadline. As of that date, those habitual violators had failed to report any results for 67% of their trials and averaged 268 days late for those and all trials that missed their deadlines. They included such eminent institutions as the Harvard University-affiliated Boston Children’s Hospital, the University of Minnesota, and Baylor College of Medicine—all among the top 50 recipients of NIH grants in 2019. The violations cover trials in virtually all fields of medicine, and the missing or late results offer potentially vital information for the most desperate patients. For example, in one long-overdue trial, researchers compared the efficacy of different chemotherapy regimens in 200 patients with advanced lymphoma; another—nearly 2 years late—tests immunotherapy against conventional chemotherapy in about 600 people with late-stage lung cancer.
…Contacted for comment, none of the institutions disputed the findings of this investigation. In all 4768 trials Science checked, sponsors violated the reporting law more than 55% of the time. And in hundreds of cases where the sponsors got credit for reporting trial results, they have yet to be publicly posted because of quality lapses flagged by ClinicalTrials.gov staff (see sidebar).
Although the 2017 rule, and officials’ statements at the time, promised aggressive enforcement and stiff penalties, neither NIH norFDA has cracked down. FDA now says it won’t brandish its big stick—penalties of up to $12,103 a day for failing to report a trial’s results—until after the agency issues further “guidance” on how it will exercise that power. It has not set a date. NIH said at a 2016 briefing on the final rule that it would cut off grants to those who ignore the trial reporting requirements, as authorized in the 2007 law, but so far has not done so…NIH and FDA officials do not seem inclined to apply that pressure. Lyric Jorgenson, NIH deputy director for science policy, says her agency has been “trying to change the culture of how clinical trial results are reported and disseminated; not so much on the ‘aha, we caught you’, as much as getting people to understand the value, and making it as easy as possible to share and disseminate results.” To that end, she says, ClinicalTrials.gov staff have educated researchers about the website and improved its usability. As for FDA, Patrick McNeilly, an official at the agency who handles trial enforcement matters, recently told an industry conference session on ClinicalTrials.gov that “FDA has limited resources, and we encourage voluntary compliance.” He said the agency also reviews reporting of information on ClinicalTrials.gov as part of inspections of trial sites, or when it receives complaints. McNeilly declined an interview request, but at the conference he discounted violations of ClinicalTrials.gov reporting requirements found by journalists and watchdog groups. “We’re not going to blanketly accept an entire list of trials that people say are noncompliant”, he said.
…It also highlights that pharma’s record has been markedly better than that of academia and the federal government.
…But such good performance shouldn’t be an exception, Harvard’s Zarin says. “Further public accountability of the trialists, but also our government organizations, has to happen. One possibility is that FDA and NIH will be shamed into enforcing the law. Another possibility is that sponsors will be shamed into doing a better job. A third possibility is that ClinicalTrials.gov will never fully achieve its vital aspirations.”
We reevaluate the claim from Bor et al. (2018: 302) that “police killings of unarmed black Americans have effects on mental health among black American adults in the general population.”
The Mapping Police Violence data used by the authors includes 91 incidents involving black decedents who were either (1) not killed by police officers in the line of duty or (2) armed when killed. These incidents should have been removed or recoded prior to analysis.
Correctly recoding these incidents decreased in magnitude all of the reported coefficients, and, more importantly, eliminated the reported statistically-significant effect of exposure to police killings of unarmed black individuals on the mental health of black Americans in the general population.
We caution researchers to vet carefully crowdsourced data that tracks police behaviors and warn against reducing these complex incidents to overly simplistic armed/unarmed dichotomies.
We develop a simple algorithm for detecting exam cheating between students who copy off one another’s exams.
When this algorithm is applied to exams in a general science course at a top university, we find strong evidence of cheating by at least 10% of the students. Students studying together cannot explain our findings. Matching incorrect answers proves to be a stronger indicator of cheating than matching correct answers.
When seating locations are randomly assigned, and monitoring is increased, cheating virtually disappears.
Many researchers rely on meta-analysis to summarize research evidence. However, there is a concern that publication bias and selective reporting may lead to biased meta-analytic effect sizes. We compare the results of meta-analyses to large-scale preregistered replications in psychology carried out at multiple laboratories. The multiple-laboratory replications provide precisely estimated effect sizes that do not suffer from publication bias or selective reporting. We searched the literature and identified 15 meta-analyses on the same topics as multiple-laboratory replications. We find that meta-analytic effect sizes are statistically-significantly different from replication effect sizes for 12 out of the 15 meta-replication pairs. These differences are systematic and, on average, meta-analytic effect sizes are almost 3× as large as replication effect sizes. We also implement 3 methods of correcting meta-analysis for bias, but these methods do not substantively improve the meta-analytic results.
Research focusing on among-individual differences in behaviour (‘animal personality’) has been blooming for over a decade. Central theories explaining the maintenance of such behavioural variation posits that individuals expressing greater “risky” behaviours should suffer higher mortality. Here, for the first time, we synthesize the existing empirical evidence for this key prediction. Our results did not support this prediction as there was no directional relationship between riskier behaviour and greater mortality; however there was a statistically-significant absolute relationship between behaviour and survival. In total, behaviour explained a statistically-significant, but small, portion (5.8%) of the variance in survival. We also found that risky (vs. “shy”) behavioural types live statistically-significantly longer in the wild, but not in the laboratory. This suggests that individuals expressing risky behaviours might be of overall higher quality but the lack of predation pressure and resource restrictions mask this effect in laboratory environments. Our work demonstrates that individual differences in behaviour explain important differences in survival but not in the direction predicted by theory. Importantly, this suggests that models predicting behaviour to be a mediator of reproduction-survival trade-offs may need revision and/or empiricists may need to reconsider their proxies of risky behaviours when testing such theory.
“Many Labs 2: Investigating Variation in Replicability Across Sample and Setting”, Richard Klein, Michelangelo, Vianello, Fred, Hasselman, Byron, Adams, Reginald B. Adams, Jr., Sinan Alper, Mark Aveyard, Jordan Axt, Mayowa Babalola, Š těpán Bahník, Mihaly Berkics, Michael Bernstein, Daniel Berry, Olga Bialobrzeska, Konrad Bocian, Mark Brandt, Robert Busching, Huajian Cai, Fanny Cambier, Katarzyna Cantarero, Cheryl Carmichael, Zeynep Cemalcilar, Jesse Chandler, Jen-Ho Chang, Armand Chatard, Eva CHEN, Winnee Cheong, DavidCicero, Sharon Coen, Jennifer Coleman, Brian Collisson, Morgan Conway, Katherine Corker, Paul Curran, Fiery Cushman, Ilker Dalgar, William Davis, Maaike de Bruijn, Marieke de Vries, Thierry Devos, Canay Doğulu, Nerisa Dozo, Kristin Dukes, Yarrow Dunham, Kevin Durrheim, Matthew Easterbrook, Charles Ebersole, John Edlund, Alexander English, Anja Eller, Carolyn Finck, Miguel-Ángel Freyre, Mike Friedman, Natalia Frankowska, Elisa Galliani, Tanuka Ghoshal, Steffen Giessner, Tripat Gill, Timo Gnambs, Angel Gomez, Roberto Gonzalez, Jesse Graham, Jon Grahe, Ivan Grahek, Eva Green, Kakul Hai, Matthew Haigh, Elizabeth Haines, Michael Hall, Marie Heffernan, Joshua Hicks, Petr Houdek, Marije van, der Hulst, Jeffrey Huntsinger, Ho Huynh, Hans IJzerman, Yoel Inbar, Åse Innes-Ker, William Jimenez-Leal, Melissa-Sue John, Jennifer Joy-Gaba, Roza Kamiloglu, Andreas Kappes, Heather Kappes, Serdar Karabati, Haruna Karick, Victor Keller, Anna Kende, Nicolas Kervyn, Goran Knezevic, Carrie Kovacs, Lacy Krueger, German Kurapov, Jaime Kurtz, Daniel Lakens, Ljiljana Lazarevic, Carmel Levitan, Neil Lewis, Samuel Lins, Esther Maassen, Angela Maitner, Winfrida Malingumu, Robyn Mallett, Satia Marotta, Jason McIntyre, Janko Međedović, Taciano Milfont, Wendy Morris, Andriy Myachykov, Sean Murphy, Koen Neijenhuijs, Anthony Nelson, Felix Neto, Austin Nichols, Susan O'Donnell, Masanori Oikawa, Gabor Orosz, Malgorzata Osowiecka, Grant Packard, Rolando Pérez, Boban Petrovic, Ronaldo Pilati, Brad Pinter, Lysandra Podesta, Monique Pollmann, Anna Dalla Rosa, Abraham Rutchick, Patricio Saavedra, Airi Sacco, Alexander Saeri, Erika Salomon, Kathleen Schmidt, Felix Schönbrodt, Maciek Sekerdej, David Sirlopu, Jeanine Skorinko, Michael Smith, Vanessa Smith-Castro, Agata Sobkow, Walter Sowden, Philipp Spachtholz, Troy Steiner, Jeroen Stouten, Chris Street, Oskar Sundfelt, Ewa Szumowska, Andrew Tang, Norbert Tanzer, Morgan Tear, Jordan Theriault, Manuela Thomae, David Torres-Fernández, Jakub Traczyk, Joshua Tybur, Adrienn Ujhelyi, Marcel van Assen, Anna van 't Veer, Alejandro, Vásquez-Echeverría Leigh, Ann Vaughn, Alexandra Vázquez, Diego Vega, Catherine Verniers, Mark Verschoor, Ingrid Voermans, Marek Vranka, Cheryl Welch, Aaron Wichman, Lisa Williams, Julie Woodzicka, Marta Wronska, Liane Young, John Zelenski, Brian Nosek (2019-11-19; backlinks):
We conducted preregistered replications of 28 classic and contemporary published findings with protocols that were peer reviewed in advance to examine variation in effect magnitudes across sample and setting. Each protocol was administered to approximately half of 125 samples and 15,305 total participants from 36 countries and territories. Using conventional statistical-significance (p < 0.05), fifteen (54%) of the replications provided evidence in the same direction and statistically-significant as the original finding. With a strict statistical-significance criterion (p < 0.0001), fourteen (50%) provide such evidence reflecting the extremely high powered design. Seven (25%) of the replications had effect sizes larger than the original finding and 21 (75%) had effect sizes smaller than the original finding. The median comparable Cohen’s d effect sizes for original findings was 0.60 and for replications was 0.15. Sixteen replications (57%) had small effect sizes (< 0.20) and 9 (32%) were in the opposite direction from the original finding. Across settings, 11 (39%) showed statistically-significant heterogeneity using the Q statistic and most of those were among the findings eliciting the largest overall effect sizes; only one effect that was near zero in the aggregate showed statistically-significant heterogeneity. Only one effect showed a Tau > 0.20 indicating moderate heterogeneity. Nine others had a Tau near or slightly above 0.10 indicating slight heterogeneity. In moderation tests, very little heterogeneity was attributable to task order, administration in lab versus online, and exploratory WEIRD versus less WEIRD culture comparisons. Cumulatively, variability in observed effect sizes was more attributable to the effect being studied than the sample or setting in which it was studied.
…In the process of reading the book and encountering some extraordinary claims about sleep, I decided to compare the facts it presented with the scientific literature. I found that the book consistently overstates the problem of lack of sleep, sometimes egregiously so. It misrepresents basic sleep research and contradicts its own sources.
In one instance, Walker claims that sleeping less than six or seven hours a night doubles one’s risk of cancer—this is not supported by the scientific evidence (Section 1.1). In another instance, Walker seems to have invented a “fact” that the WHO has declared a sleep loss epidemic (Section 4). In yet another instance, he falsely claims that the National Sleep Foundation recommends 8 hours of sleep per night, and then uses this “fact” to falsely claim that two-thirds of people in developed nations sleep less than the “the recommended eight hours of nightly sleep” (Section 5).
Walker’s book has likely wasted thousands of hours of life and worsened the health of people who read it and took its recommendations at face value (Section 7).
[Summary of investigation into David Rosenhan: like the Robbers Cave or Stanford Prison Experiment, his famous fake-insane patients experiment cannot be verified and many troubling anomalies have come to light. Cahalan is unable to find almost all of the supposed participants, Rosenhan hid his own participation & his own medical records show he fabricated details of his case, he throw out participant data that didn’t match his narrative, reported numbers are inconsistent, Rosenhan abandoned a lucrative book deal about it and avoided further psychiatric research, and showed some character traits of a fabulist eager to please.]
Although Rosenhan died in 2012, Cahalan easily tracked down his archives, held by social psychologist Lee Ross, his friend and colleague at Stanford. They included the first 200 pages of Rosenhan’s unfinished draft of a book about the experiment…Ross warned her that Rosenhan had been secretive. As her attempts to identify the pseudonymous pseudopatients hit one dead end after the other, she realized Ross’s prescience.
The archives did allow Cahalan to piece together the beginnings of the experiment in 1969, when Rosenhan was teaching psychology at Swarthmore College in Pennsylvania…Rosenhan cautiously decided to check things out for himself first. He emerged humbled from nine traumatizing days in a locked ward, and abandoned the idea of putting students through the experience.
…According to Rosenhan’s draft, it was at a conference dinner that he met his first recruits: a recently retired psychiatrist and his psychologist wife. The psychiatrist’s sister also signed up. But the draft didn’t explain how, when and why subsequent recruits signed up. Cahalan interviewed numerous people who had known Rosenhan personally or indirectly. She also chased down the medical records of individuals whom she suspected could have been involved in the experiment, and spoke with their families and friends. But her sleuthing brought her to only one participant, a former Stanford graduate student called Bill Underwood.
…Underwood and his wife were happy to talk, but two of their comments jarred. Rosenhan’s draft described how he prepared his volunteers very carefully, over weeks. Underwood, however, remembered only brief guidance on how to avoid swallowing medication by hiding pills in his cheek. His wife recalled Rosenhan telling her that he had prepared writs of habeas corpus for each pseudopatient, in case an institution would not discharge them. But Cahalan had already worked out that that wasn’t so.
Comparing the Science report with documents in Rosenhan’s archives, she also noted many mismatches in numbers. For instance, Rosenhan’s draft, and the Science paper, stated that Underwood had spent seven days in a hospital with 8,000 patients, whereas he spent eight days in a hospital with 1,500 patients.
When all of the leads from her contacts led to ground, she published a commentary in The Lancet Psychiatry asking for help in finding them—to no avail. Had Rosenhan invented them, she found herself asking?
Background: Few randomized trials have evaluated the effect of reducing red meat intake on clinically important outcomes.
Purpose: To summarize the effect of lower versus higher red meat intake on the incidence of cardiometabolic and cancer outcomes in adults.
Data Sources: EMBASE, CENTRAL, CINAHL, Web of Science, andProQuest from inception to July 2018 and MEDLINE from inception to April 2019, without language restrictions.
Study Selection: Randomized trials (published in any language) comparing diets lower in red meat with diets higher in red meat that differed by a gradient of at least 1 serving per week for 6 months or more.
Data Extraction: Teams of 2 reviewers independently extracted data and assessed the risk of bias and the certainty of the evidence.
Data Synthesis: Of 12 eligible trials, a single trial enrolling 48 835 women provided the most credible, though still low-certainty, evidence that diets lower in red meat may have little or no effect on all-cause mortality (hazard ratio [HR], 0.99 [95% CI, 0.95 to 1.03]), cardiovascular mortality (HR, 0.98 [CI, 0.91 to 1.06]), and cardiovascular disease (HR, 0.99 [CI, 0.94 to 1.05]). That trial also provided low-certainty to very-low-certainty evidence that diets lower in red meat may have little or no effect on total cancer mortality (HR, 0.95 [CI, 0.89 to 1.01]) and the incidence of cancer, including colorectal cancer (HR, 1.04 [CI, 0.90 to 1.20]) and breast cancer (HR, 0.97 [0.90 to 1.04]).
Limitations: There were few trials, most addressing only surrogate outcomes, with heterogeneous comparators and small gradients in red meat consumption between lower versus higher intake groups.
Conclusion: Low-certainty to very-low-certainty evidence suggests that diets restricted in red meat may have little or no effect on major cardiometabolic outcomes and cancer mortality and incidence.
In recent decades the field of anthropology has been characterized as sharply divided between pro-science and anti-science factions. The aim of this study is to empirically evaluate that characterization. We survey anthropologists in graduate programs in the United States regarding their views of science and advocacy, moral and epistemic relativism, and the merits of evolutionary biological explanations. We examine anthropologists’ views in concert with their varying appraisals of major controversies in the discipline (Chagnon / Tierney, Mead / Freeman, and Menchú / Stoll). We find that disciplinary specialization and especially gender and political orientation are statistically-significant predictors of anthropologists’ views. We interpret our findings through the lens of an intuitionist social psychology that helps explain the dynamics of such controversies as well as ongoing ideological divisions in the field.
Using a novel technique known as network meta-analysis, we synthesized evidence from 492 studies (87,418 participants) to investigate the effectiveness of procedures in changing implicit measures, which we define as response biases on implicit tasks. We also evaluated these procedures’ effects on explicit and behavioral measures. We found that implicit measures can be changed, but effects are often relatively weak (|ds| < .30). Most studies focused on producing short-term changes with brief, single-session manipulations. Procedures that associate sets of concepts, invoke goals or motivations, or tax mental resources changed implicit measures the most, whereas procedures that induced threat, affirmation, or specific moods/emotions changed implicit measures the least. Bias tests suggested that implicit effects could be inflated relative to their true population values. Procedures changed explicit measures less consistently and to a smaller degree than implicit measures and generally produced trivial changes in behavior. Finally, changes in implicit measures did not mediate changes in explicit measures or behavior. Our findings suggest that changes in implicit measures are possible, but those changes do not necessarily translate into changes in explicit measures or behavior.
“A national experiment reveals where a growth mindset improves achievement”, David S. Yeager, Paul Hanselman, Gregory M. Walton, Jared S. Murray, Robert Crosnoe, Chandra Muller, Elizabeth Tipton, Barbara Schneider, Chris S. Hulleman, Cintia P. Hinojosa, David Paunesku, Carissa Romero, Kate Flint, Alice Roberts, Jill Trott, Ronaldo Iachan, Jenny Buontempo, Sophia Man Yang, Carlos M. Carvalho, P. Richard Hahn, Maithreyi Gopalan, Pratik Mhatre, Ronald Ferguson, Angela L. Duckworth, Carol S. Dweck (2019-08-07; sociology; backlinks):
A global priority for the behavioural sciences is to develop cost-effective, scalable interventions that could improve the academic outcomes of adolescents at a population level, but no such interventions have so far been evaluated in a population-generalizable sample. Here we show that a short (less than one hour), online growth mindset intervention—which teaches that intellectual abilities can be developed—improved grades among lower-achieving students and increased overall enrolment to advanced mathematics courses in a nationally representative sample of students in secondary education in the United States. Notably, the study identified school contexts that sustained the effects of the growth mindset intervention: the intervention changed grades when peer norms aligned with the messages of the intervention. Confidence in the conclusions of this study comes from independent data collection and processing, pre-registration of analyses, and corroboration of results by a blinded Bayesian analysis.
The Stanford Prison Experiment(SPE) is one of psychology’s most famous studies. It has been criticized on many grounds, and yet a majority of textbook authors have ignored these criticisms in their discussions of the SPE, thereby misleading both students and the general public about the study’s questionable scientific validity.
Data collected from a thorough investigation of the SPE archives and interviews with 15 of the participants in the experiment further question the study’s scientific merit. These data are not only supportive of previous criticisms of the SPE, such as the presence ofdemand characteristics, but provide new criticisms of the SPE based on heretofore unknown information. These new criticisms include the biased and incomplete collection of data, the extent to which the SPE drew on a prison experiment devised and conducted by students in one of Zimbardo’s classes 3 months earlier, the fact that the guards received precise instructions regarding the treatment of the prisoners, the fact that the guards were not told they were subjects, and the fact that participants were almost never completely immersed by the situation.
Possible explanations of the inaccurate textbook portrayal and general misperception of the SPE’s scientific validity over the past 5 decades, in spite of its flaws and shortcomings, are discussed.
The observation of individuals attaining remarkable ages, and their concentration into geographic sub-regions or ‘blue zones’, has generated considerable scientific interest. Proposed drivers of remarkable longevity include high vegetable intake, strong social connections, and genetic markers. Here, we reveal new predictors of remarkable longevity and ‘supercentenarian’ status. In the United States, supercentenarian status is predicted by the absence of vital registration. The state-specific introduction of birth certificates is associated with a 69–82% fall in the number of supercentenarian records. In Italy, which has more uniform vital registration, remarkable longevity is instead predicted by low per capita incomes and a short life expectancy. Finally, the designated ‘blue zones’ of Sardinia, Okinawa, and Ikaria corresponded to regions with low incomes, low literacy, high crime rate and short life expectancy relative to their national average. As such, relative poverty and short lifespan constitute unexpected predictors of centenarian and supercentenarian status, and support a primary role of fraud and error in generating remarkable human age records.
In the 30 years that biomedical researchers have worked determinedly to find a cure for Alzheimer’s disease, their counterparts have developed drugs that helped cut deaths from cardiovascular disease by more than half, and cancer drugs able to eliminate tumors that had been incurable. But for Alzheimer’s, not only is there no cure, there is not even a disease-slowing treatment.
…In more than two dozen interviews, scientists whose ideas fell outside the dogma recounted how, for decades, believers in the dominant hypothesis suppressed research on alternative ideas: They influenced what studies got published in top journals, which scientists got funded, who got tenure, and who got speaking slots at reputation-buffing scientific conferences. The scientists described the frustrating, even career-ending, obstacles that they confronted in pursuing their research. A top journal told one that it would not publish her paper because others hadn’t. Another got whispered advice to at least pretend that the research for which she was seeking funding was related to the leading idea—that a protein fragment called beta-amyloid accumulates in the brain, creating neuron-killing clumps that are both the cause of Alzheimer’s and the key to treating it. Others could not get speaking slots at important meetings, a key showcase for research results. Several who tried to start companies to develop Alzheimer’s cures were told again and again by venture capital firms and major biopharma companies that they would back only an amyloid approach.
…For all her regrets about the amyloid hegemony, Neve is an unlikely critic: She co-led the 1987 discovery of mutations in a gene called APP that increases amyloid levels and causes Alzheimer’s in middle age, supporting the then-emerging orthodoxy. Yet she believes that one reason Alzheimer’s remains incurable and untreatable is that the amyloid camp “dominated the field”, she said. Its followers were influential “to the extent that they persuaded the National Institute of Neurological Disorders and Stroke [part of the National Institutes of Health] that it was a waste of money to fund any Alzheimer’s-related grants that didn’t center around amyloid.” To be sure, NIH did fund some Alzheimer’s research that did not focus on amyloid. In a sea of amyloid-focused grants, there are tiny islands of research on oxidative stress, neuroinflammation, and, especially, a protein called tau. But Neve’s NINDS program officer, she said, “told me that I should at least collaborate with the amyloid people or I wouldn’t get any more NINDS grants.” (She hoped to study how neurons die.) A decade after her APP discovery, a disillusioned Neve left Alzheimer’s research, building a distinguished career in gene editing. Today, she said, she is “sick about the millions of people who have needlessly died from” the disease.
Dr. Daniel Alkon, a longtime NIH neuroscientist who started a company to develop an Alzheimer’s treatment, is even more emphatic: “If it weren’t for the near-total dominance of the idea that amyloid is the only appropriate drug target”, he said, “we would be 10 or 15 years ahead of where we are now.”
Making it worse is that the empirical support for the amyloid hypothesis has always been shaky. There were numerous red flags over the decades that targeting amyloid alone might not slow or reverse Alzheimer’s. “Even at the time the amyloid hypothesis emerged, 30 years ago, there was concern about putting all our eggs into one basket, especially the idea that ridding the brain of amyloid would lead to a successful treatment”, said neurobiologist Susan Fitzpatrick, president of the James S. McDonnell Foundation. But research pointing out shortcomings of the hypothesis was relegated to second-tier journals, at best, a signal to other scientists and drug companies that the criticisms needn’t be taken too seriously. Zaven Khachaturian spent years at NIH overseeing its early Alzheimer’s funding. Amyloid partisans, he said, “came to permeate drug companies, journals, and NIH study sections”, the groups of mostly outside academics who decide what research NIH should fund. “Things shifted from a scientific inquiry into an almost religious belief system, where people stopped being skeptical or even questioning.”
…“You had a whole industry going after amyloid, hundreds of clinical trials targeting it in different ways”, Alkon said. Despite success in millions of mice, “none of it worked in patients.”
Scientists who raised doubts about the amyloid model suspected why. Amyloid deposits, they thought, are a response to the true cause of Alzheimer’s and therefore a marker of the disease—again, the gravestones of neurons and synapses, not the killers. The evidence? For one thing, although the brains of elderly Alzheimer’s patients had amyloid plaques, so did the brains of people the same age who died with no signs of dementia, a pathologist discovered in 1991. Why didn’t amyloid rob them of their memories? For another, mice engineered with human genes for early Alzheimer’s developed both amyloid plaques and dementia, but there was no proof that the much more common, late-onset form of Alzheimer’s worked the same way. And yes, amyloid plaques destroy synapses (the basis of memory and every other brain function) in mouse brains, but there is no correlation between the degree of cognitive impairment in humans and the amyloid burden in the memory-forming hippocampus or the higher-thought frontal cortex. “There were so many clues”, said neuroscientist Nikolaos Robakis of the Icahn School of Medicine at Mount Sinai, who also discovered a mutation for early-onset Alzheimer’s. “Somehow the field believed all the studies supporting it, but not those raising doubts, which were very strong. The many weaknesses in the theory were ignored.”
The ability to identify medical reversals and other low-value medical practices is an essential prerequisite for efforts to reduce spending on such practices. Through an analysis of more than 3000 randomized controlled trials (RCTs) published in three leading medical journals (the Journal of the American Medical Association, the Lancet, and the New England Journal of Medicine), we have identified 396 medical reversals. Most of the studies (92%) were conducted on populations in high-income countries, cardiovascular disease was the most common medical category (20%), and medication was the most common type of intervention (33%).
We provide generalizable and robust results on the causal sales effect of TV advertising based on the distribution of advertising elasticities for a large number of products (brands) in many categories. Such generalizable results provide a prior distribution that can improve the advertising decisions made by firms and the analysis and recommendations of anti-trust and public policy makers. A single case study cannot provide generalizable results, and hence the marketing literature provides several meta-analyses based on published case studies of advertising effects. However, publication bias results if the research or review process systematically rejects estimates of small, statistically insignificant, or “unexpected” advertising elasticities. Consequently, if there is publication bias, the results of a meta-analysis will not reflect the true population distribution of advertising effects.
To provide generalizable results, we base our analysis on a large number of products and clearly lay out the research protocol used to select the products. We characterize the distribution of all estimates, irrespective of sign, size, or statistical-significance. To ensure generalizability we document the robustness of the estimates. First, we examine the sensitivity of the results to the approach and assumptions made when constructing the data used in estimation from the raw sources. Second, as we aim to provide causal estimates, we document if the estimated effects are sensitive to the identification strategies that we use to claim causality based on observational data. Our results reveal substantially smaller effects of own-advertising compared to the results documented in the extant literature, as well as a sizable percentage of statistically insignificant or negative estimates. If we only select products with statistically-significant and positive estimates, the mean or median of the advertising effect distribution increases by a factor of about five.
The results are robust to various identifying assumptions, and are consistent with both publication bias and bias due to non-robust identification strategies to obtain causal estimates in the literature.
Seventeen years and hundreds of studies after the first journal article on working memory training was published, evidence for the efficacy of working memory training is still wanting. Numerous studies show that individuals who repeatedly practice computerized working memory tasks improve on those tasks and closely related variants. Critically, although individual studies have shown improvements in untrained abilities and behaviors, systematic reviews of the broader literature show that studies producing large, positive findings are often those with the most methodological shortcomings. The current review discusses the past, present, and future status of working memory training, including consideration of factors that might influence working memory training and transfer efficacy.
Effect sizes are the currency of psychological research. They quantify the results of a study to answer the research question and are used to calculate statistical power. The interpretation of effect sizes—when is an effect small, medium, or large?—has been guided by the recommendations Jacob Cohen gave in his pioneering writings starting in 1962: Either compare an effect with the effects found in past research or use certain conventional benchmarks.
The present analysis shows that neither of these recommendations is currently applicable. From past publications without pre-registration, 900 effects were randomly drawn and compared with 93 effects from publications with pre-registration, revealing a large difference: Effects from the former (median r = 0.36) were much larger than effects from the latter (median r = 0.16). That is, certain biases, such as publication bias or questionable research practices, have caused a dramatic inflation in published effects, making it difficult to compare an actual effect with the real population effects (as these are unknown). In addition, there were very large differences in the mean effects between psychological sub-disciplines and between different study designs, making it impossible to apply any global benchmarks.
Many more pre-registered studies are needed in the future to derive a reliable picture of real population effects.
Large-scale replication studies like the Reproducibility Project: Psychology (RP:P) provide invaluable systematic data on scientific replicability, but most analyses and interpretations of the data fail to agree on the definition of “replicability” and disentangle the inexorable consequences of known selection bias from competing explanations. We discuss three concrete definitions of replicability based on (1) whether published findings about the signs of effects are mostly correct, (2) how effective replication studies are in reproducing whatever true effect size was present in the original experiment, and (3) whether true effect sizes tend to diminish in replication. We apply techniques from multiple testing and post-selection inference to develop new methods that answer these questions while explicitly accounting for selection bias. Our analyses suggest that the RP:P dataset is largely consistent with publication bias due to selection of statistically-significant effects. The methods in this paper make no distributional assumptions about the true effect sizes.
There are a growing number of large-scale educational randomized controlled trials (RCTs). Considering their expense, it is important to reflect on the effectiveness of this approach. We assessed the magnitude and precision of effects found in those large-scale RCTs commissioned by the UK-based Education Endowment Foundation and the U.S.-based National Center for Educational Evaluation and Regional Assistance, which evaluated interventions aimed at improving academic achievement in K–12 (141 RCTs; 1,222,024 students). The mean effect size was 0.06 standard deviations. These sat within relatively large confidence intervals (mean width = 0.30 SDs), which meant that the results were often uninformative (the median Bayes factor was 0.56). We argue that our field needs, as a priority, to understand why educational RCTs often find small and uninformative effects.
[Keywords: educational policy, evaluation, meta-analysis, program evaluation.]
Blind auditions and gender discrimination: A seminal paper from 2000 investigated the impact of blind auditions in orchestras, and found that they increased the proportion of women in symphony orchestras. I investigate the study, and find that there is no good evidence presented. [The study is temporally confounded by a national trend of increasing female participation, does not actually establish any particular correlate of blind auditions, much less randomized experiments of blinding, the dataset is extremely underpowered, the effects cited in coverage cannot be found anywhere in the paper, and the critical comparisons which are there are not even statistically-significant in the first place. None of these caveats are included in the numerous citations of the study as “proving” discrimination against women.]
The recent ‘replication crisis’ in psychology has focused attention on ways of increasing methodological rigor within the behavioral sciences. Part of this work has involved promoting ‘Registered Reports’, wherein journals peer review papers prior to data collection and publication. Although this approach is usually seen as a relatively recent development, we note that a prototype of this publishing model was initiated in the mid-1970s by parapsychologist Martin Johnson in the European Journal of Parapsychology (EJP). A retrospective and observational comparison of Registered and non-Registered Reportspublished in the EJP during a seventeen-year period provides circumstantial evidence to suggest that the approach helped to reduce questionable research practices. This paper aims both to bring Johnson’s pioneering work to a wider audience, and to investigate the positive role that Registered Reports may play in helping to promote higher methodological and statistical standards.
…The final dataset contained 60 papers: 25 RRs and 35 non-RRs. The RRs described 31 experiments that tested 131 hypotheses, and the non-RRs described 60 experiments that tested 232 hypotheses.
28.4% of the statistical tests reported in non-RRs were statistically-significant (66⁄232: 95% CI [21.5%–36.4%]); compared to 8.4% of those in the RRs (11⁄131: 95% CI [4.0%–16.8%]). A simple 2 × 2 contingency analysis showed that this difference is highly statistically-significant (Fisher’s exact test: p < 0.0005, Pearson chi-square = 20.1, Cohen’s d = 0.48).
…Parapsychologists investigate the possible existence of phenomena that, for many, have a low a priori likelihood of being genuine (see, eg., Wagenmakers et al 2011). This has often resulted in their work being subjected to a considerable amount of critical attention (from both within and outwith the field) that has led to them pioneering several methodological advances prior to their use within mainstream psychology, including the development of randomisation in experimental design (Hacking, 1988), the use of blinds (Kaptchuk, 1998), explorations into randomisation and statistical inference (Fisher, 1924), advances in replication issues (Rosenthal, 1986), the need for pre-specification in meta-analysis (Akers, 1985; Milton, 1999; Kennedy, 2004), and the creation of a formal study registry (Watt, 2012; Watt & Kennedy, 2015). Johnson’s work on RRs provides another striking illustration of this principle at work.
The relationship between nonverbal communication and deception continues to attract much interest, but there are many misconceptions about it. In this review, we present a scientific view on this relationship. We describe theories explaining why liars would behave differently from truth tellers, followed by research on how liars actually behave and individuals’ ability to detect lies. We show that the nonverbal cues to deceit discovered to date are faint and unreliable and that people are mediocre lie catchers when they pay attention to behavior. We also discuss why individuals hold misbeliefs about the relationship between nonverbal behavior and deception—beliefs that appear very hard to debunk. We further discuss the ways in which researchers could improve the state of affairs by examining nonverbal behaviors in different ways and in different settings than they currently do.
The Big Five personality traits have been linked to dozens of life outcomes. However, metascientific research has raised questions about the replicability of behavioral science. The Life Outcomes of Personality Replication (LOOPR) Project was therefore conducted to estimate the replicability of the personality-outcome literature. Specifically, I conducted preregistered, high-powered (median n = 1,504) replications of 78 previously published trait–outcome associations. Overall, 87% of the replication attempts were statistically-significant in the expected direction. The replication effects were typically 77% as strong as the corresponding original effects, which represents a significant decline in effect size. The replicability of individual effects was predicted by the effect size and design of the original study, as well as the sample size and statistical power of the replication. These results indicate that the personality-outcome literature provides a reasonably accurate map of trait–outcome associations but also that it stands to benefit from efforts to improve replicability.
Bilingualism was once thought to result in cognitive disadvantages, but research in recent decades has demonstrated that experience with two (or more) languages confers a bilingual advantage in executive functions and may delay the incidence of Alzheimer’s disease. However, conflicting evidence has emerged leading to questions concerning the robustness of the bilingual advantage for both executive functions and dementia incidence. Some investigators have failed to find evidence of a bilingual advantage; others have suggested that bilingual advantages may be entirely spurious, while proponents of the advantage case have continued to defend it. A heated debate has ensued, and the field has now reached an impasse. This review critically examines evidence for and against the bilingual advantage in executive functions, cognitive aging, and brain plasticity, before outlining how future research could shed light on this debate and advance knowledge of how experience with multiple languages affects cognition and the brain.
Ninety-seven percent of drug-indication pairs that are tested in clinical trials in oncology never advance to receive U.S. Food and Drug Administration approval. While lack of efficacy and dose-limiting toxicities are the most common causes of trial failure, the reason(s) why so many new drugs encounter these problems is not well understood. Using CRISPR-Cas9 mutagenesis, we investigated a set of cancer drugs and drug targets in various stages of clinical testing. We show that-contrary to previous reports obtained predominantly with RNA interference and small-molecule inhibitors-the proteins ostensibly targeted by these drugs are nonessential for cancer cell proliferation. Moreover, the efficacy of each drug that we tested was unaffected by the loss of its putative target, indicating that these compounds kill cells via off-target effects. By applying a genetic target-deconvolution strategy, we found that the mischaracterized anticancer agent OTS964 is actually a potent inhibitor of thecyclin-dependent kinase CDK11 and that multiple cancer types areaddicted to CDK11 expression. We suggest that stringent genetic validation of the mechanism of action of cancer drugs in the preclinical setting may decrease the number of therapies tested in human patients that fail to provide any clinical benefit.
Learning a second language in childhood is inherently advantageous for communication. However, parents, educators and scientists have been interested in determining whether there are additional cognitive advantages. One of the most exciting yet controversial1 findings about bilinguals is a reported advantage for executive function. That is, several studies suggest that bilinguals perform better than monolinguals on tasks assessing cognitive abilities that are central to the voluntary control of thoughts and behaviours-the so-called ‘executive functions’ (for example, attention, inhibitory control, task switching and resolving conflict). Although a number of small-2–4 and large-sample5,6 studies have reported a bilingual executive function advantage (see refs. 7–9 for a review), there have been several failures to replicate these findings10–15, and recent meta-analyses have called into question the reliability of the original empirical claims8,9. Here we show, in a very large, demographically representative sample (n = 4,524) of 9- to 10-year-olds across the United States, that there is little evidence for a bilingual advantage for inhibitory control, attention and task switching, or cognitive flexibility, which are key aspects of executive function. We also replicate previously reported disadvantages in English vocabulary in bilinguals7,16,17. However, these English vocabulary differences are substantially mitigated when we account for individual differences in socioeconomic status or intelligence. In summary, notwithstanding the inherently positive benefits of learning a second language in childhood18, we found little evidence that it engenders additional benefits to executive function development.
The stereotype threat literature primarily comprises lab studies, many of which involve features that would not be present in high-stakes testing settings. We meta-analyze the effect of stereotype threat on cognitive ability tests, focusing on both laboratory and operational studies with features likely to be present in high stakes settings. First, we examine the features of cognitive ability test metric, stereotype threat cue activation strength, and type of non-threat control group, and conduct a focal analysis removing conditions that would not be present in high stakes settings. We also take into account a previously unrecognized methodological error in how data are analyzed in studies that control for scores on a prior cognitive ability test, which resulted in a biased estimate of stereotype threat. The focal sample, restricting the database to samples utilizing operational testing-relevant conditions, displayed a threat effect of d = −0.14 (k = 45, N = 3,532, SDδ = 0.31). Second, we present a comprehensive meta-analysis of stereotype threat. Third, we examine a small subset of studies in operational test settings and studies utilizing motivational incentives, which yielded d-values ranging from 0.00 to −0.14. Fourth, the meta-analytic database is subjected to tests of publication bias, finding nontrivial evidence for publication bias. Overall, results indicate that the size of the stereotype threat effect that can be experienced on tests of cognitive ability in operational scenarios such as college admissions tests and employment testing may range from negligible to small.
Objective: Interest in candidate gene and candidate gene-by-environment interaction hypotheses regarding major depressive disorder remains strong despite controversy surrounding the validity of previous findings. In response to this controversy, the present investigation empirically identified 18 candidate genes for depression that have been studied 10 or more times and examined evidence for their relevance to depression phenotypes.
Methods: Utilizing data from large population-based and case-control samples (_n_s ranging from 62,138 to 443,264 across subsamples), the authors conducted a series of preregistered analyses examining candidate gene polymorphism main effects, polymorphism-by-environment interactions, and gene-level effects across a number of operational definitions of depression (eg., lifetime diagnosis, current severity, episode recurrence) and environmental moderators (eg., sexual or physical abuse during childhood, socioeconomic adversity).
Results: No clear evidence was found for any candidate gene polymorphism associations with depression phenotypes or any polymorphism-by-environment moderator effects. As a set, depression candidate genes were no more associated with depression phenotypes than non-candidate genes. The authors demonstrate that phenotypic measurement error is unlikely to account for these null findings.
Conclusions: The study results do not support previous depression candidate gene findings, in which large genetic effects are frequently reported in samples orders of magnitude smaller than those examined here. Instead, the results suggest that early hypotheses about depression candidate genes were incorrect and that the large number of associations reported in the depression candidate gene literature are likely to be false positives.
Prediction markets correctly predicted 75% of the replication outcomes.
Prediction markets performed better than survey data in predicting replication outcomes.
Survey data performed better in predicting relative effect size of the replications.
Understanding and improving reproducibility is crucial for scientific progress. Prediction markets and related methods of eliciting peer beliefs are promising tools to predict replication outcomes. We invited researchers in the field of psychology to judge the replicability of 24 studies replicated in the large scale Many Labs 2 project. We elicited peer beliefs in prediction markets and surveys about two replication success metrics: the probability that the replication yields a statistically-significant effect in the original direction (p < 0.001), and the relative effect size of the replication. The prediction markets correctly predicted 75% of the replication outcomes, and were highly correlated with the replication outcomes. Survey beliefs were also statistically-significantly correlated with replication outcomes, but had larger prediction errors. The prediction markets for relative effect sizes attracted little trading and thus did not work well. The survey beliefs about relative effect sizes performed better and were statistically-significantly correlated with observed relative effect sizes. The results suggest that replication outcomes can be predicted and that the elicitation of peer beliefs can increase our knowledge about scientific reproducibility and the dynamics of hypothesis testing.
Being able to replicate scientific findings is crucial for scientific progress. We replicate 21 systematically selected experimental studies in the social sciences published in Nature and Science between 2010 and 2015. The replications follow analysis plans reviewed by the original authors and pre-registered prior to the replications. The replications are high powered, with sample sizes on average about five times higher than in the original studies. We find a statistically-significant effect in the same direction as the original study for 13 (62%) studies, and the effect size of the replications is on average about 50% of the original effect size. Replicability varies between 12 (57%) and 14 (67%) studies for complementary replicability indicators. Consistent with these results, the estimated true-positive rate is 67% in a Bayesian analysis. The relative effect size of true positives is estimated to be 71%, suggesting that both false positives and inflated effect sizes of true positives contribute to imperfect reproducibility. Furthermore, we find that peer beliefs of replicability are strongly related to replicability, suggesting that the research community could predict which results would replicate and that failures to replicate were not the result of chance alone.
Evidence-based medicine is the cornerstone of clinical practice, but it is dependent on the quality of evidence upon which it is based. Unfortunately, up to half of all randomized controlled trials (RCTs) have never been published, and trials with statistically-significant findings are more likely to be published than those without (Dwan et al 2013). Importantly, negative trials face additional hurdles beyond study publication bias that can result in the disappearance of non-significant results (Boutron et al 2010; Dwan et al 2013; Duyx et al 2017). Here, we analyze the cumulative impact of biases on apparent efficacy, and discuss possible remedies, using the evidence base for two effective treatments for depression: antidepressants and psychotherapy.
Statisticians are increasingly posed with thought-provoking and even paradoxical questions, challenging our qualifications for entering the statistical paradises created by Big Data. By developing measures for data quality, this article suggests a framework to address such a question: “Which one should I trust more: a 1% survey with 60% response rate or a self-reported administrative dataset covering 80% of the population?” A 5-element Euler-formula-like identity shows that for any dataset of size n, probabilistic or not, the difference between the sample average ¯¯¯¯¯Xn and the population average ¯¯¯¯¯XN is the product of three terms: (1) a data quality measure, ρR,X, the correlation between Xj and the response/recording indicator Rj; (2) a data quantity measure, √(N−n)/n, where N is the population size; and (3) a problem difficulty measure, σX, the standard deviation of X. This decomposition provides multiple insights: (1) Probabilistic sampling ensures high data quality by controlling ρR,X at the level of N−1/2; (2) When we lose this control, the impact of N is no longer canceled by ρR,X, leading to a Law of Large Populations (LLP), that is, our estimation error, relative to the benchmarking rate 1/√n, increases with √N; and (3) the “bigness” of such Big Data (for population inferences) should be measured by the relative sizef=n/N, not the absolute sizen; (4) When combining data sources for population inferences, those relatively tiny but higher quality ones should be given far more weights than suggested by their sizes.
Estimates obtained from the Cooperative Congressional Election Study (CCES) of the 2016 US presidential election suggest aρR,X≈−0.005 for self-reporting to vote for Donald Trump. Because of LLP, this seemingly minuscule data defect correlation implies that the simple sample proportion of the self-reported voting preference for Trump from 1% of the US eligible voters, that is, n≈2,300,000, has the same mean squared error as the corresponding sample proportion from a genuine simple random sample of size n≈400, a 99.98%reduction of sample size (and hence our confidence). The CCES datademonstrate LLP vividly: on average, the larger the state’s voter populations, the further away the actual Trump vote shares from the usual 95%confidence intervals based on the sample proportions. This should remind us that, without taking data quality into account, population inferences with Big Data are subject to a Big Data Paradox: the more the data, the surer we fool ourselves.
It is well known among researchers and practitioners that election polls suffer from a variety of sampling and nonsampling errors, often collectively referred to as total survey error. Reported margins of error typically only capture sampling variability, and in particular, generally ignore nonsampling errors in defining the target population (eg., errors due to uncertainty in who will vote).
Here, we empirically analyze 4221 polls for 608 state-level presidential, senatorial, and gubernatorial elections between 1998 and 2014, all of which were conducted during the final three weeks of the campaigns. Comparing to the actual election outcomes, we find that average survey error as measured by root mean square error is approximately 3.5 percentage points, about twice as large as that implied by most reported margins of error. We decompose survey error into election-level bias and variance terms. We find that average absolute election-level bias is about 2 percentage points, indicating that polls for a given election often share a common component of error. This shared error may stem from the fact that polling organizations often face similar difficulties in reaching various subgroups of the population, and that they rely on similar screening rules when estimating who will vote. We also find that average election-level variance is higher than implied by simple random sampling, in part because polling organizations often use complex sampling designs and adjustment procedures.
We conclude by discussing how these results help explain polling failures in the 2016 U.S. presidential election, and offer recommendations to improve polling practice.
[Keywords: margin of error, non-sampling error, polling bias, total survey error]
Background: The pathway from evidence generation to consumption contains many steps which can lead to overstatement or misinformation. The proliferation of internet-based health news may encourage selection of media and academic research articles that overstate strength of causal inference. We investigated the state of causal inference in health research as it appears at the end of the pathway, at the point of social media consumption.
Methods: We screened the NewsWhip Insights database for the most shared media articles on Facebook and Twitter reporting about peer-reviewed academic studies associating an exposure with a health outcome in 2015, extracting the 50 most-shared academic articles and media articles covering them. We designed and utilized a review tool to systematically assess and summarize studies’ strength of causal inference, including generalizability, potential confounders, and methods used. These were then compared with the strength of causal language used to describe results in both academic and media articles. Two randomly assigned independent reviewers and one arbitrating reviewer from a pool of 21 reviewers assessed each article.
Results: We accepted the most shared 64 media articles pertaining to 50 academic articles for review, representing 68% of Facebook and 45% of Twitter shares in 2015. 34% of academic studies and 48% of media articles used language that reviewers considered too strong for their strength of causal inference. 70% of academic studies were considered low or very low strength of inference, with only 6% considered high or very high strength of causal inference. The most severe issues with academic studies’ causal inference were reported to be omitted confounding variables and generalizability. 58% of media articles were found to have inaccurately reported the question, results, intervention, or population of the academic study.
Conclusions: We find a large disparity between the strength of language as presented to the research consumer and the underlying strength of causal inference among the studies most widely shared on social media. However, because this sample was designed to be representative of the articles selected and shared on social media, it is unlikely to be representative of all academic and media work. More research is needed to determine how academic institutions, media organizations, and social network sharing patterns impact causal inference and language as received by the research consumer.
[pro; con] It is often claimed that negative events carry a larger weight than positive events. Loss aversion is the manifestation of this argument in monetary outcomes. In this review, we examine early studies of the utility function of gains and losses, and in particular the original evidence for loss aversion reported by Kahneman and Tversky (Econometrica 47:263–291, 1979).
We suggest that loss aversion proponents have over-interpreted these findings. Specifically, the early studies of utility functions have shown that while very large losses are overweighted, smaller losses are often not. In addition, the findings of some of these studies have been systematically misrepresented to reflect loss aversion, though they did not find it.
These findings shed light both on the inability of modern studies to reproduce loss aversion as well as a second literature arguing strongly for it.
In 50s Middle Grove, things didn’t go according to plan either, though the surprise was of a different nature. Despite his pretence of leaving the 11-year-olds to their own devices, Sherif and his research staff, posing as camp counsellors and caretakers, interfered to engineer the result they wanted. He believed he could make the two groups, called the Pythons and the Panthers, sworn enemies via a series of well-timed “frustration exercises”. These included his assistants stealing items of clothing from the boys’ tents and cutting the rope that held up the Panthers’ homemade flag, in the hope they would blame the Pythons. One of the researchers crushed the Panthers’ tent, flung their suitcases into the bushes and broke a boy’s beloved ukulele. To Sherif’s dismay, however, the children just couldn’t be persuaded to hate each other…The robustness of the boy’s “civilised” values came as a blow to Sherif, making him angry enough to want to punch one of his young academic helpers. It turned out that the strong bonds forged at the beginning of the camp weren’t easily broken. Thankfully, he never did start the forest fire—he aborted the experiment when he realised it wasn’t going to support his hypothesis.
But the Rockefeller Foundation had given Sherif $316,156$38,0001953. In his mind, perhaps, if he came back empty-handed, he would face not just their anger but the ruin of his reputation. So, within a year, he had recruited boys for a second camp, this time in Robbers Cave state park in Oklahoma. He was determined not to repeat the mistakes of Middle Grove.
…At Robbers Cave, things went more to plan. After a tug-of-war in which they were defeated, the Eagles burned the Rattler’s flag. Then all hell broke loose, with raids on cabins, vandalism and food fights. Each moment of confrontation, however, was subtly manipulated by the research team. They egged the boys on, providing them with the means to provoke one another—who else, asks Perry in her book, could have supplied the matches for the flag-burning?
…Sherif was elated. And, with the publication of his findings that same year, his status as world-class scholar was confirmed. The “Robbers Cave experiment” is considered seminal by social psychologists, still one of the best-known examples of “realistic conflict theory”. It is often cited in modern research. But was it scientifically rigorous? And why were the results of the Middle Grove experiment—where the researchers couldn’t get the boys to fight—suppressed? “Sherif was clearly driven by a kind of a passion”, Perry says. “That shaped his view and it also shaped the methods he used. He really did come from that tradition in the 30s of using experiments as demonstrations—as a confirmation, not to try to find something new.” In other words, think of the theory first and then find a way to get the results that match it. If the results say something else? Bury them…“I think people are aware now that there are real ethical problems with Sherif’s research”, she tells me, “but probably much less aware of the backstage [manipulation] that I’ve found. And that’s understandable because the way a scientist writes about their research is accepted at face value.” The published report of Robbers Cave uses studiedly neutral language. “It’s not until you are able to compare the published version with the archival material that you can see how that story is shaped and edited and made more respectable in the process.” That polishing up still happens today, she explains. “I wouldn’t describe him as a charlatan…every journal article, every textbook is written to convince, persuade and to provide evidence for a point of view. So I don’t think Sherif is unusual in that way.”
We replicated and extended Shoda, Mischel, and Peake’s (1990) famous marshmallow study, which showed strong bivariate correlations between a child’s ability to delay gratification just before entering school and both adolescent achievement and socioemotional behaviors. Concentrating on children whose mothers had not completed college, we found that an additional minute waited at age 4 predicted a gain of approximately one tenth of a standard deviation in achievement at age 15. But this bivariate correlation was only half the size of those reported in the original studies and was reduced by two thirds in the presence of controls for family background, early cognitive ability, and the home environment. Most of the variation in adolescent achievement came from being able to wait at least 20 s. Associations between delay time and measures of behavioral outcomes at age 15 were much smaller and rarely statistically-significant.
The debate over the value of the International Space Station has overlooked a fundamental question: What is the station’s contribution to scientific knowledge? We address this question using a multivariate analysis of publication and patent data from station experiments. We find a relatively high probability that ISS experiments with PIs drawnfrom outside NASA will yield refereed publications and, furthermore, that these experiments have non-negligible probabilities of finding publication in high-impact journals or producing government patents. However, technology demonstrations and experiments with all-NASA PIs have much weaker track records. These results highlight the complexities inherent to constructing a compelling case for science onboard the ISS or for crewed spaceflight in general.
In 1961, the National Institutes of Health (NIH) began to circulate biological preprints in a forgotten experiment called the Information Exchange Groups (IEGs). This system eventually attracted over 3,600 participants and saw the production of over 2,500 different documents, but by 1967, it was effectively shut down following the refusal of journals to accept articles that had been circulated as preprints. This article charts the rise and fall of the IEGs and explores the parallels with the 1990s and the biomedical preprint movement of today.
Background: Symptomatic relief is the primary goal of percutaneous coronary intervention (PCI) in stable angina and is commonly observed clinically. However, there is no evidence from blinded, placebo-controlled randomised trials to show its efficacy.
Methods: ORBITA is a blinded, multicentre randomised trial of PCI versus a placebo procedure for angina relief that was done at five study sites in the UK. We enrolled patients with severe (≥70%) single-vessel stenoses. After enrolment, patients received 6 weeks of medication optimisation. Patients then had pre-randomisation assessments with cardiopulmonary exercise testing, symptom questionnaires, and dobutamine stress echocardiography. Patients were randomised 1:1 to undergo PCI or a placebo procedure by use of an automated online randomisation tool. After 6 weeks of follow-up, the assessments done before randomisation were repeated at the final assessment. The primary endpoint was difference in exercise time increment between groups. All analyses were based on the intention-to-treat principle and the study population contained all participants who underwent randomisation. This study is registered with ClinicalTrials.gov, number NCT02062593.
Findings: ORBITA enrolled 230 patients with ischaemic symptoms. After the medication optimisation phase and between Jan 6, 2014, and Aug 11, 2017, 200 patients underwent randomisation, with 105 patients assigned PCI and 95 assigned the placebo procedure. Lesions had mean area stenosis of 84.4% (SD 10.2), fractional flow reserve of 0.69 (0.16), and instantaneous wave-free ratio of 0.76 (0.22). There was no statistically-significant difference in the primary endpoint of exercise time increment between groups (PCI minus placebo 16.6 s, 95% CI −8.9 to 42.0, p = 0.200). There were no deaths. Serious adverse events included four pressure-wire related complications in the placebo group, which required PCI, and five major bleeding events,including two in the PCI group and three in the placebo group.
Interpretation: In patients with medically treated angina and severe coronary stenosis, PCI did not increase exercise time by more than the effect of a placebo procedure. The efficacy of invasive procedures can be assessed with a placebo control, as is standard for pharmacotherapy.
Biologists establish the existence of experimental effects by applying treatments or interventions to biological entities or units, such as people, animals, slice preparations, or cells. When done appropriately, independent replication of the entity-intervention pair contributes to the sample size (N) and forms the basis of statistical inference. However, sometimes the appropriate entity-intervention pair may not be obvious, and the wrong choice can make an experiment worthless. We surveyed a random sample of published animal experiments from 2011 to 2016 where interventions were applied to parents but effects examined in the offspring, as regulatory authorities have provided clear guidelines on replication with such designs. We found that only 22% of studies (95% CI = 17% to 29%) replicated the correct entity-intervention pair and thus made valid statistical inferences. Approximately half of the studies (46%, 95% CI = 38% to 53%) had pseudoreplication while 32% (95% CI = 26% to 39%) provided insufficient information to make a judgement. Pseudoreplication artificially inflates the sample size, leading to more false positive results and inflating the apparent evidence supporting a scientific claim. It is hard for science to advance when so many experiments are poorly designed and analysed. We argue that distinguishing between biological units, experimental units, and observational units clarifies where replication should occur, describe the criteria for genuine replication, and provide guidelines for designing and analysing in vitro, ex vivo, and in vivo experiments.
A key sticking point of Bayesian analysis is the choice of prior distribution, and there is a vast literature on potential defaults including uniform priors, Jeffreys’ priors, reference priors, maximum entropy priors, and weakly informative priors. These methods, however, often manifest a key conceptual tension in prior modeling: a model encoding true prior information should be chosen without reference to the model of the measurement process, but almost all common prior modeling techniques are implicitly motivated by a reference likelihood. In this paper we resolve this apparent paradox by placing the choice of prior into the context of the entire Bayesian analysis, from inference to prediction to model evaluation.
About 15 years ago, one of us (G.J.L.) got an uncomfortable phone call from a colleague and collaborator. After nearly a year of frustrating experiments, this colleague was about to publish a paper1 chronicling his team’s inability to reproduce the results of our high-profile paper2 in a mainstream journal. Our study was the first to show clearly that a drug-like molecule could extend an animal’s lifespan. We had found over and over again that the treatment lengthened the life of a roundworm by as much as 67%. Numerous phone calls and e-mails failed to identify why this apparently simple experiment produced different results between the labs. Then another lab failed to replicate our study. Despite more experiments and additional publications, we couldn’t work out why the labs were getting different lifespan results. To this day, we still don’t know. A few years later, the same scenario played out with different compounds in other labs…In another, now-famous example, two cancer labs spent more than a year trying to understand inconsistencies6. It took scientists working side by side on the same tumour biopsy to reveal that small differences in how they isolated cells—vigorous stirring versus prolonged gentle rocking—produced different results. Subtle tinkering has long been important in getting biology experiments to work. Before researchers purchased kits of reagents for common experiments, it wasn’t unheard of for a team to cart distilled water from one institution when it moved to another. Lab members would spend months tweaking conditions until experiments with the new institution’s water worked as well as before. Sources of variation include the quality and purity of reagents, daily fluctuations in microenvironment and the idiosyncratic techniques of investigators7. With so many ways of getting it wrong, perhaps we should be surprised at how often experimental findings are reproducible.
…Nonetheless, scores of publications continued to appear with claims about compounds that slow ageing. There was little effort at replication. In 2013, the three of us were charged with that unglamorous task…Our first task, to develop a protocol, seemed straightforward.
But subtle disparities were endless. In one particularly painful teleconference, we spent an hour debating the proper procedure for picking up worms and placing them on new agar plates. Some batches of worms lived a full day longer with gentler technicians. Because a worm’s lifespan is only about 20 days, this is a big deal. Hundreds of e-mails and many teleconferences later, we converged on a technique but still had a stupendous three-day difference in lifespan between labs. The problem, it turned out, was notation—one lab determined age on the basis of when an egg hatched, others on when it was laid. We decided to buy shared batches of reagents from the start. Coordination was a nightmare; we arranged with suppliers to give us the same lot numbers and elected to change lots at the same time. We grew worms and their food from a common stock and had strict rules for handling. We established protocols that included precise positions of flasks in autoclave runs. We purchased worm incubators at the same time, from the same vendor. We also needed to cope with a large amount of data going from each lab to a single database. We wrote an iPad app so that measurements were entered directly into the system and not jotted on paper to be entered later. The app prompted us to include full descriptors for each plate of worms, and ensured that data and metadata for each experiment were proofread (the strain names MY16 and my16 are not the same). This simple technology removed small recording errors that could disproportionately affect statistical analyses.
Once this system was in place, variability between labs decreased. After more than a year of pilot experiments and discussion of methods in excruciating detail, we almost completely eliminated systematic differences in worm survival across our labs (see ‘Worm wonders’)…Even in a single lab performing apparently identical experiments, we could not eliminate run-to-run differences.
…We have found one compound that lengthens lifespan across all strains and species. Most do so in only two or three strains, and often show detrimental effects in others.
I was listening to a recent Radiolab episode on blame and guilt, where the guest Robert Sapolsky mentioned a famous study on judges handing out harsher sentences before lunch than after lunch…During the podcast, it was mentioned that the percentage of favorable decisions drops from 65% to 0% over the number of cases that are decided on. This sounded unlikely. I looked at Figure 1 from the paper (below), and I couldn’t believe my eyes. Not only is the drop indeed as large as mentioned—it occurs three times in a row over the course of the day, and after a break, it returns to exactly 65%!
…Some people dislike statistics. They are only interested in effects that are so large, you can see them by just plotting the data. This study might seem to be a convincing illustration of such an effect. My goal in this blog is to argue against this idea. You need statistics, maybe especially when effects are so large they jump out at you. When reporting findings, authors should report and interpret effect sizes. An important reason for this is that effects can be impossibly large.
…If hunger had an effect on our mental resources of this magnitude, our society would fall into minor chaos every day at 11:45AM. Or at the very least, our society would have organized itself around this incredibly strong effect of mental depletion. Just like manufacturers take size differences between men and women into account when producing items such as golf clubs or watches, we would stop teaching in the time before lunch, doctors would not schedule surgery, and driving before lunch would be illegal. If a psychological effect is this big, we don’t need to discover it and publish it in a scientific journal—you would already know it exists. Sort of how the “after lunch dip” is a strong and replicable finding that you can feel yourself (and that, as it happens, is directly in conflict with the finding that judges perform better immediately after lunch—surprisingly, the authors don’t discuss the after lunch dip).
…I think it is telling that most psychologists don’t seem to be able to recognize data patterns that are too large to be caused by psychological mechanisms. There are simply no plausible psychological effects that are strong enough to cause the data pattern in the hungry judges study. Implausibility is not a reason to completely dismiss empirical findings, but impossibility is. It is up to authors to interpret the effect size in their study, and to show the mechanism through which an effect that is impossibly large, becomes plausible. Without such an explanation, the finding should simply be dismissed.
A long tradition of scholarship, from ancient Greece to Marxism or some contemporary social psychology, portrays humans as strongly gullible—wont to accept harmful messages by being unduly deferent. However, if humans are reasonably well adapted, they should not be strongly gullible: they should be vigilant toward communicated information. Evidence from experimental psychology reveals that humans are equipped with well-functioning mechanisms of epistemic vigilance. They check the plausibility of messages against their background beliefs, calibrate their trust as a function of the source’s competence and benevolence, and critically evaluate arguments offered to them. Even if humans are equipped with well-functioning mechanisms of epistemic vigilance, an adaptive lag might render them gullible in the face of new challenges, from clever marketing to omnipresent propaganda. I review evidence from different cultural domains often taken as proof of strong gullibility: religion, demagoguery, propaganda, political campaigns, advertising, erroneous medical beliefs, and rumors. Converging evidence reveals that communication is much less influential than often believed—that religious proselytizing, propaganda, advertising, and so forth are generally not very effective at changing people’s minds. Beliefs that lead to costly behavior are even less likely to be accepted. Finally, it is also argued that most cases of acceptance of misguided communicated information do not stem from undue deference, but from a fit between the communicated information and the audience’s preexisting beliefs.
The Shannon-Wiener index is a popular nonparametric metric widely used in ecological research as a measure of species diversity. We used the Web of Science database to examine cases where papers published from 1990 to 2015 mislabeled this index. We provide detailed insights into causes potentially affecting use of the wrong name ‘Weaver’ instead of the correct ‘Wiener’. Basic science serves as a fundamental information source for applied research, so we emphasize the effect of the type of research (applied or basic) on the incidence of the error. Biological research, especially applied studies, increasingly uses indices, even though some researchers have strongly criticized their use. Applied research papers had a higher frequency of the wrong index name than did basic research papers. The mislabeling frequency decreased in both categories over the 25-year period, although the decrease lagged in applied research. Moreover, the index use and mistake proportion differed by region and authors’ countries of origin. Our study also provides insight into citation culture, and results suggest that almost 50% of authors have not actually read their cited sources. Applied research scientists in particular should be more cautious during manuscript preparation, carefully select sources from basic research, and read theoretical background articles before they apply the theories to their research. Moreover, theoretical ecologists should liaise with applied researchers and present their research for the broader scientific community. Researchers should point out known, often-repeated errors and phenomena not only in specialized books and journals but also in widely used and fundamental literature.
The sampling frame was constructed from telephone directories and automobile registration lists, and the survey had a 24% response rate. But if information collected by the poll about votes cast in 1932 had been used to weight the results, the poll would have predicted a majority of electoral votes for Roosevelt in 1936, and thus would have correctly predicted the winner of the election.
We explore alternative weighting methods for the 1936 poll and the models that support them. While weighting would have resulted in Roosevelt being projected as the winner, the bias in the estimates is still very large.
We discuss implications of these results for today’s low-response rate surveys and how the accuracy of the modeling might be reflected better than current practice.
…After every election in which polls err, numerous commentators publish articles about what went wrong with the polls. Gallup’s (1938) commentary was of this type. Deming (1986: p. 319) observed that “Dr. George Gallup remarked in a speech one time (after a fiasco) that he made his prediction in advance of the election. Other people, smarter, made their predictions after the election, explaining how it all happened.” In some respects, post hoc explanations view the polling inadequacies as what Deming called a “special cause” attributable to unusual features of that particular election. However, since the outcomes of elections typically differ from the poll estimates by considerably more than sampling error, Deming would argue that this is a system-level, or “common” cause. Our system of assessing the uncertainty of estimates from surveys is inadequate and this needs to be addressed systematically rather than trying to explain what is wrong with a particular outcome. Unfortunately, the main lesson from 1936 has not yet been learned.
“Impact of genetic background and experimental reproducibility on identifying chemical compounds with robust longevity effects”, Mark Lucanic, W. Todd Plummer, Esteban Chen, Jailynn Harke, Anna C. Foulger, Brian Onken, Anna L. Coleman-Hulbert, Kathleen J. Dumas, Suzhen Guo, Erik Johnson, Dipa Bhaumik, Jian Xue, Anna B. Crist, Michael P. Presley, Girish Harinath, Christine A. Sedore, Manish Chamoli, Shaunak Kamat, Michelle K. Chen, Suzanne Angeli, Christina Chang, John H. Willis, Daniel Edgar, Mary Anne Royal, Elizabeth A. Chao, Shobhna Patel, Theo Garrett, Carolina Ibanez-Ventoso, June Hope, Jason L. Kish, Max Guo, Gordon J. Lithgow, Monica Driscoll, Patrick C. Phillips (2017-02-21; backlinks):
Limiting the debilitating consequences of ageing is a major medical challenge of our time. Robust pharmacological interventions that promote healthy ageing across diverse genetic backgrounds may engage conserved longevity pathways. Here we report results from the Caenorhabditis Intervention Testing Program in assessing longevity variation across 22 Caenorhabditis strains spanning 3 species, using multiple replicates collected across three independent laboratories. Reproducibility between test sites is high, whereas individual trial reproducibility is relatively low. Of ten pro-longevity chemicals tested, six statistically-significantly extend lifespan in at least one strain. Three reported dietary restriction mimetics are mainly effective across C. elegans strains, indicating species and strain-specific responses. In contrast, the amyloid dye ThioflavinT is both potent and robust across the strains. Our results highlight promising pharmacological leads and demonstrate the importance of assessing lifespans of discrete cohorts across repeat studies to capture biological variation in the search for reproducible ageing interventions.
We have empirically assessed the distribution of published effect sizes and estimated power by analyzing 26,841 statistical records from 3,801 cognitive neuroscience and psychology papers published recently. The reported median effect size was D = 0.93 (interquartile range: 0.64–1.46) for nominally statistically-significant results and D = 0.24 (0.11–0.42) for nonsignificant results. Median power to detect small, medium, and large effects was 0.12, 0.44, and 0.73, reflecting no improvement through the past half-century. This is so because sample sizes have remained small. Assuming similar true effect sizes in both disciplines, power was lower in cognitive neuroscience than in psychology. Journal impact factors negatively correlated with power. Assuming a realistic range of prior probabilities for null hypotheses, false report probability is likely to exceed 50% for the whole literature. In light of our findings, the recently reported low replication success in psychology is realistic, and worse performance may be expected for cognitive neuroscience.
Biomedical science, psychology, and many other fields may be suffering from a serious replication crisis. In order to gain insight into some factors behind this crisis, we have analyzed statistical information extracted from thousands of cognitive neuroscience and psychology research papers. We established that the statistical power to discover existing relationships has not improved during the past half century. A consequence of low statistical power is that research studies are likely to report many false positive findings. Using our large dataset, we estimated the probability that a statistically-significant finding is false (called false report probability). With some reasonable assumptions about how often researchers come up with correct hypotheses, we conclude that more than 50% of published findings deemed to be statistically-significant are likely to be false. We also observed that cognitive neuroscience studies had higher false report probability than psychology studies, due to smaller sample sizes in cognitive neuroscience. In addition, the higher the impact factors of the journals in which the studies were published, the lower was the statistical power. In light of our findings, the recently reported low replication success in psychology is realistic, and worse performance may be expected for cognitive neuroscience.
Music training is thought to improve youngsters’ cognitive and academic skills.
Results show a small overall effect size (d = 0.16, K = 118).
Music training seems to moderately enhance youngsters’ intelligence and memory.
The design quality of the studies is negatively related to the size of the effects.
Future studies should include random assignment and active control groups.
Music training has been recently claimed to enhance children and young adolescents’ cognitive and academic skills. However, substantive research on transfer of skills suggests that far-transfer—ie., the transfer of skills between 2 areas only loosely related to each other—occurs rarely.
In this meta-analysis, we examined the available experimental evidence regarding the impact of music training on children and young adolescents’ cognitive and academic skills. The results of the random-effects models showed (a) a small overall effect size (d = 0.16); (b) slightly greater effect sizes with regard to intelligence (d = 0.35) and memory-related outcomes (d = 0.34); and (c) an inverse relation between the size of the effects and the methodological quality of the study design.
These results suggest that music training does not reliably enhance children and young adolescents’ cognitive or academic skills, and that previous positive findings were probably due to confounding variables.
[Keywords: music training, transfer, cognitive skills, education, meta-analysis]
[Memoir of an ex-theoretical-physics grad student at the University of Rochester with Sarada Rajeev who gradually became disillusioned with physics research, burned out, and left to work in finance and is now a writer. Henderson was attracted by the life of the mind and the grandeur of uncovering the mysteries of the universe, only to discover that, after the endless triumphs of the 20th century and predicting enormous swathes of empirical experimental data, theoretical physics has drifted and become a branch of abstract mathematics, exploring ever more recondite, simplified, and implausible models in the hopes of obtaining any insight into physics’ intractable problems; one must be brilliant to even understand the questions being asked by the math and incredibly hardworking to make any progress which hasn’t already been tried by even more brilliant physicists of the past (while living in ignominious poverty and terror of not getting a grant or tenure), but one’s entire career may be spent chasing a useless dead end without one having any clue.]
The next thing I knew I was crouched in a chair in Rajeev’s little office, with a notebook on my knee and focused with everything I had on an impromptu lecture he was giving me on an esoteric aspect of some mathematical subject I’d never heard of before. Zeta functions, or elliptic functions, or something like that. I’d barely introduced myself when he’d started banging out equations on his board. Trying to follow was like learning a new game, with strangely shaped pieces and arbitrary rules. It was a challenge, but I was excited to be talking to a real physicist about his real research, even though there was one big question nagging me that I didn’t dare to ask: What does any of this have to do with physics?
…Even a Theory of Everything, I started to realize, might suffer the same fate of multiple interpretations. The Grail could just be a hall of mirrors, with no clear answer to the “What?” or the “How?”—let alone the “Why?” Plus physics had changed since Big Al bestrode it. Mathematical as opposed to physical intuition had become more central, partly because quantum mechanics was such a strange multi-headed beast that it diminished the role that everyday, or even Einstein-level, intuition could play. So much for my dreams of staring out windows and into the secrets of the universe.
…If I did lose my marbles for a while, this is how it started. With cutting my time outside of Bausch and Lomb down to nine hours a day—just enough to pedal my mountain bike back to my bat cave of an apartment each night, sleep, shower, and pedal back in. With filling my file cabinet with boxes and cans of food, and carting in a coffee maker, mini-fridge, and microwave so that I could maximize the time spent at my desk. With feeling guilty after any day that I didn’t make my 15-hour quota. And with exceeding that quota frequently enough that I regularly circumnavigated the clock: staying later and later each night until I was going home in the morning, then in the afternoon, and finally at night again.
…The longer and harder I worked, the more I realized I didn’t know. Papers that took days or weeks to work through cited dozens more that seemed just as essential to digest; the piles on my desk grew rather than shrunk. I discovered the stark difference between classes and research: With no syllabus to guide me I didn’t know how to keep on a path of profitable inquiry. Getting “wonderfully lost” sounded nice, but the reality of being lost, and of re-living, again and again, that first night in the old woman’s house, with all of its doubts and dead-ends and that horrible hissing voice was … something else. At some point, flipping the lights on in the library no longer filled me with excitement but with dread.
…My mental model building was hitting its limits. I’d sit there in Rajeev’s office with him and his other students, or in a seminar given by some visiting luminary, listening and putting each piece in place, and try to fix in memory what I’d built so far. But at some point I’d lose track of how the green stick connected to the red wheel, or whatever, and I’d realize my picture had diverged from reality. Then I’d try toggling between tracing my steps back in memory to repair my mistake and catching all the new pieces still flying in from the talk. Stray pieces would fall to the ground. My model would start falling down. And I would fall hopelessly behind. A year or so of research with Rajeev, and I found myself frustrated and in a fog, sinking deeper into the quicksand but not knowing why. Was it my lack of mathematical background? My grandiose goals? Was I just not intelligent enough?
…I turned 30 during this time and the milestone hit me hard. I was nearly four years into the Ph.D. program, and while my classmates seemed to be systematically marching toward their degrees, collecting data and writing papers, I had no thesis topic and no clear path to graduation. My engineering friends were becoming managers, getting married, buying houses. And there I was entering my fourth decade of life feeling like a pitiful and penniless mole, aimlessly wandering dark empty tunnels at night, coming home to a creepy crypt each morning with nothing to show for it, and checking my bed for bugs before turning out the lights…As I put the final touches on my thesis, I weighed my options. I was broke, burned out, and doubted my ability to go any further in theoretical physics. But mostly, with The Grail now gone and the physics landscape grown so immense, I thought back to Rajeev’s comment about knowing which problems to solve and realized that I still didn’t know what, for me, they were.
Glöckner’s analysis doesn’t prove that extraneous factors weren’t influencing the judges, but he shows how the same effect could be produced by entirely rational judges interacting with the protocols required by the legal system.
The main analysis works like this: we know that favourable rulings take longer than unfavourable ones (~7 mins vs ~5 mins), and we assume that judges are able to guess how long a case will take to rule on before they begin it (from clues like the thickness of the file, the types of request made, the representation the prisoner has and so on). Finally, we assume judges have a time limit in mind for each of the three sessions of the day, and will avoid starting cases which they estimate will overrun the time limit for the current session.
It turns out that this kind of rational time-management is sufficient to generate the drops in favourable outcomes. How this occurs isn’t straightforward and interacts with a quirk of original author’s data presentation (specifically their graph shows the order number of cases when the number of cases in each session varied day to day—so, for example, it shows that the 12th case after a break is least likely to be judged favourably, but there wasn’t always a 12th case in each session. So sessions in which there were more unfavourable cases were more likely to contribute to this data point).
There is a popular belief in neuroscience that we are primarily data limited, and that producing large, multimodal, and complex datasets will, with the help of advanced data analysis algorithms, lead to fundamental insights into the way the brain processes information. These datasets do not yet exist, and if they did we would have no way of evaluating whether or not the algorithmically-generated insights were sufficient or even correct. To address this, here we take a classical microprocessor as a model organism, and use our ability to perform arbitrary experiments on it to see if popular data analysis methods from neuroscience can elucidate the way it processes information. Microprocessors are among those artificial information processing systems that are both complex and that we understand at all levels, from the overall logical flow, via logical gates, to the dynamics of transistors. We show that the approaches reveal interesting structure in the data but do not meaningfully describe the hierarchy of information processing in the microprocessor. This suggests current analytic approaches in neuroscience may fall short of producing meaningful understanding of neural systems, regardless of the amount of data. Additionally, we argue for scientists using complex non-linear dynamical systems with known ground truth, such as the microprocessor as a validation platform for time-series and structure discovery methods.
Neuroscience is held back by the fact that it is hard to evaluate if a conclusion is correct; the complexity of the systems under study and their experimental inaccessability make the assessment of algorithmic and data analytic technqiues challenging at best. We thus argue for testing approaches using known artifacts, where the correct interpretation is known. Here we present a microprocessor platform as one such test case. We find that many approaches in neuroscience, when used na•vely, fall short of producing a meaningful understanding.
Individuals responsible for hiring decisions participated in two studies.
We manipulated the information presented to them.
Information about unstructured interviews boosted overconfidence.
A third study showed that overconfidence was linked to fewer payoffs.
In the presence of valid predictors, unstructured interviews can hurt hiring decisions.
Overconfidence is an important bias related to the ability to recognize the limits of one’s knowledge.
The present study examines overconfidence in predictions of job performance for participants presented with information about candidates based solely on standardized tests versus those who also were presented with unstructured interview information. We conducted two studies with individuals responsible for hiring decisions. Results showed that individuals presented with interview information exhibited more overconfidence than individuals presented with test scores only. In a third study, consisting of a betting competition for undergraduate students, larger overconfidence was related to fewer payoffs.
These combined results emphasize the importance of studying confidence and decision-related variables in selection decisions. Furthermore, while previous research has shown that the predictive validity of unstructured interviews is low, this study provides compelling evidence that they not only fail to help personnel selection decisions, but can actually hurt them.
[Keywords: judgment and decision making, behavioral decision theory, overconfidence, hiring decisions, personnel selection, human resource management, Conscientiousness, General Mental Ability, unstructured interviews, evidence-based management]
“Reproducibility and replicability of rodent phenotyping in preclinical studies”, Neri Kafkafi, Joseph Agassi, Elissa J. Chesler, John C. Crabbe, Wim E. Crusio, David Eilam, Robert Gerlai, Ilan Golani, Alex Gomez-Marin, Ruth Heller, Fuad Iraqi, Iman Jaljuli, Natasha A. Karp, Hugh Morgan, George Nicholson, Donald W. Pfaff, S. Helene Richter, Philip B. Stark, Oliver Stiedl, Victoria Stodden, Lisa M. Tarantino, Valter Tucci, William Valdar, Robert W. Williams, Hanno Würbel, Yoav Benjamini (2016-10-17; backlinks):
The scientific community is increasingly concerned with cases of published “discoveries” that are not replicated in further studies. The field of mouse behavioral phenotyping was one of the first to raise this concern, and to relate it to other complicated methodological issues: the complex interaction between genotype and environment; the definitions of behavioral constructs; and the use of the mouse as a model animal for human health and disease mechanisms. In January 2015, researchers from various disciplines including genetics, behavior genetics, neuroscience, ethology, statistics and bioinformatics gathered in Tel Aviv University to discuss these issues. The general consent presented here was that the issue is prevalent and of concern, and should be addressed at the statistical, methodological and policy levels, but is not so severe as to call into question the validity and the usefulness of model organisms as a whole. Well-organized community efforts, coupled with improved data and metadata sharing, were agreed by all to have a key role to play in identifying specific problems and promoting effective solutions. As replicability is related to validity and may also affect generalizability and translation of findings, the implications of the present discussion reach far beyond the issue of replicability of mouse phenotypes but may be highly relevant throughout biomedical research.
[Debunking a remarkably sloppy darknet market paper which screwed up its scraping and somehow concluded that the notorious Silk Road 2, in defiance of all observable evidence & subsequent FBI data, actually sold primarily e-books and hardly any drugs. This study has yet to be retracted.] The development of cryptomarkets has gained increasing attention from academics, including growing scientific literature on the distribution of illegal goods using cryptomarkets. Dolliver’s 2015 article “Evaluating drug trafficking on the Tor Network: Silk Road 2, the Sequel” addresses this theme by evaluating drug trafficking on one of the most well-known cryptomarkets, Silk Road 2.0. The research on cryptomarkets in general—particularly in Dolliver’s article—poses a number of new questions for methodologies. This commentary is structured around a replication of Dolliver’s original study. The replication study is not based on Dolliver’s original dataset, but on a second dataset collected applying the same methodology. We have found that the results produced by Dolliver differ greatly from our replicated study. While a margin of error is to be expected, the inconsistencies we found are too great to attribute to anything other than methodological issues. The analysis and conclusions drawn from studies using these methods are promising and insightful. However, based on the replication of Dolliver’s study, we suggest that researchers using these methodologies consider and that datasets be made available for other researchers, and that methodology and dataset metrics (eg. number of downloaded pages, error logs) are described thoroughly in the context of web-o-metrics and web crawling.
Several mouse models show statistically-significant differences in experimental outcomes at standard sub-thermoneutral (ST, 22–26°C) versus thermoneutral housing temperatures (TT, 30–32°C), including models of cardiovascular disease, obesity, inflammation and atherosclerosis, graft versus host disease and cancer.
NE levels are higher, anti-tumor immunity is impaired, and tumor growth is statistically-significantly enhanced in mice housed at ST compared to TT. NE levels are reduced, immunosuppression is reversed and tumor growth is slowed by housing mice at TT.
Housing temperature should be reported in every study such that potential sources of data bias or non-reproducibility can be identified.
Our opinion is that any experiment designed to understand tumor biology and/or having an immune component could potentially have different outcomes in mice housed at ST versus TT and this should be tested.
The ‘mild’ cold stress caused by standard sub-thermoneutral housing temperatures used for laboratory mice in research institutes is sufficient to statistically-significantly bias conclusions drawn from murine models of several human diseases. We review the data leading to this conclusion, discuss the implications for research and suggest ways to reduce problems in reproducibility and experimental transparency caused by this housing variable. We have found that these cool temperatures suppress endogenous immune responses, skewing tumor growth data and the severity of graft versus host disease, and also increase the therapeutic resistance of tumors. Owing to the potential for ambient temperature to affect energy homeostasis as well as adrenergic stress, both of which could contribute to biased outcomes in murine cancer models, housing temperature should be reported in all publications and considered as a potential source of variability in results between laboratories. Researchers and regulatory agencies should work together to determine whether changes in housing parameters would enhance the use of mouse models in cancer research, as well as for other diseases. Finally, for many years agencies such as the National Cancer Institute (NCI) have encouraged the development of newer and more sophisticated mouse models for cancer research, but we believe that, without an appreciation of how basic murine physiology is affected by ambient temperature, even data from these models is likely to be compromised.
[Keywords: thermoneutrality, tumor microenvironment, immunosuppression, energy balance, metabolism, adrenergic stress]
Social scientists often seek to demonstrate that a construct has incremental validity over and above other related constructs. However, these claims are typically supported by measurement-level models that fail to consider the effects of measurement (un)reliability. We use intuitive examples, Monte Carlo simulations, and a novel analytical framework to demonstrate that common strategies for establishing incremental construct validity using multiple regression analysis exhibit extremely high Type I error rates under parameter regimes common in many psychological domains. Counterintuitively, we find that error rates are highest—in some cases approaching 100%—when sample sizes are large and reliability is moderate. Our findings suggest that a potentially large proportion of incremental validity claims made in the literature are spurious. We present a web application (http: / / jakewestfall.org / ivy / ) that readers can use to explore the statistical properties of these and other incremental validity arguments. We conclude by reviewing SEM-based statistical approaches that appropriately control the Type I error rate when attempting to establish incremental validity.
A striking contrast runs through the last 60 years of biopharmaceutical discovery, research, and development. Huge scientific and technological gains should have increased the quality of academic science and raised industrial R&D efficiency. However, academia faces a “reproducibility crisis”; inflation-adjusted industrial R&D costs per novel drug increased nearly 100× between 1950 and 2010; and drugs are more likely to fail in clinical development today than in the 1970s. The contrast is explicable only if powerful headwinds reversed the gains and/or if many “gains” have proved illusory. However, discussions of reproducibility and R&D productivity rarely address this point explicitly.
The main objectives of the primary research in this paper are: (a) to provide quantitatively and historically plausible explanations of the contrast; and (b) identify factors to which R&D efficiency is sensitive.
We present a quantitative decision-theoretic model of the R&D process [a ‘leaky pipeline’; cf the log-normal]. The model represents therapeutic candidates (eg., putative drug targets, molecules in a screening library, etc.) within a “measurement space”, with candidates’ positions determined by their performance on a variety of assays (eg., binding affinity, toxicity, in vivo efficacy, etc.) whose results correlate to a greater or lesser degree. We apply decision rules to segment the space, and assess the probability of correct R&D decisions.
We find that when searching for rare positives (eg., candidates that will successfully complete clinical development), changes in the predictive validity of screening and disease models that many people working in drug discovery would regard as small and/or unknowable (ie., an 0.1 absolute change in correlation coefficient between model output and clinical outcomes in man) can offset large (eg., 10×, even 100×) changes in models’ brute-force efficiency. We also show how validity and reproducibility correlate across a population of simulated screening and disease models.
We hypothesize that screening and disease models with high predictive validity are more likely to yield good answers and good treatments, so tend to render themselves and their diseases academically and commercially redundant. Perhaps there has also been too much enthusiasm for reductionist molecular models which have insufficient predictive validity. Thus we hypothesize that the average predictive validity of the stock of academically and industrially “interesting” screening and disease models has declined over time, with even small falls able to offset large gains in scientific knowledge and brute-force efficiency. The rate of creation of valid screening and disease models may be the major constraint on R&D productivity.
For almost 50 years field experiments have been used to study ethnic and racial discrimination in hiring decisions, consistently reporting high rates of discrimination against minority applicants—including immigrants—irrespective of time, location, or minority groups tested. While Peter A. Riach and Judith Rich [2002. “Field Experiments of Discrimination in the Market Place.” The Economic Journal 112 (483): F480–F518] and Judith Rich [2014. “What Do Field Experiments of Discrimination in Markets Tell Us? A Meta Analysis of Studies Conducted since 2000.” In Discussion Paper Series. Bonn: IZA] provide systematic reviews of existing field experiments, no study has undertaken a meta-analysis to examine the findings in the studies reported. In this article, we present a meta-analysis of 738 correspondence tests in 43 separate studies conducted in OECD countries between 1990 and 2015. In addition to summarising research findings, we focus on groups of specific tests to ascertain the robustness of findings, emphasising differences across countries, gender, and economic contexts. Moreover we examine patterns of discrimination, by drawing on the fact that the groups considered in correspondence tests and the contexts of testing vary to some extent. We focus on first-generation and second-generation immigrants, differences between specific minority groups, the implementation of EU directives, and the length of job application packs.
Selecting among alternative projects is a core management task in all innovating organizations. In this paper, we focus on the evaluation of frontier scientific research projects. We argue that the “intellectual distance” between the knowledge embodied in research proposals and an evaluator’s own expertise systematically relates to the evaluations given. To estimate relationships, we designed and executed a grant proposal process at a leading research university in which we randomized the assignment of evaluators and proposals to generate 2,130 evaluator-proposal pairs. We find that evaluators systematically give lower scores to research proposals that are closer to their own areas of expertise and to those that are highly novel. The patterns are consistent with biases associated with boundedly rational evaluation of new ideas. The patterns are inconsistent with intellectual distance simply contributing “noise” or being associated with private interests of evaluators. We discuss implications for policy, managerial intervention, and allocation of resources in the ongoing accumulation of scientific knowledge.
Empirically analyzing empirical evidence: One of the central goals in any scientific endeavor is to understand causality. Experiments that seek to demonstrate a cause/effect relation most often manipulate the postulated causal factor. Aarts et al 2015 describe the replication of 100 experiments reported in papers published in 2008 in 3 high-ranking psychology journals. Assessing whether the replication and the original experiment yielded the same result according to several criteria, they find that about one-third to one-half of the original findings were also observed in the replication study.
Introduction: Reproducibility is a defining feature of science, but the extent to which it characterizes current research is unknown. Scientific claims should not gain credence because of the status or authority of their originator but by the replicability of their supporting evidence. Even research of exemplary quality may have irreproducible empirical findings because of random or systematic error.
Rationale: There is concern about the rate and predictors of reproducibility, but limited evidence. Potentially problematic practices include selective reporting, selective analysis, and insufficient specification of the conditions necessary or sufficient to obtain the results. Direct replication is the attempt to recreate the conditions believed sufficient for obtaining a previously observed finding and is the means of establishing reproducibility of a finding with new data. We conducted a large-scale, collaborative effort to obtain an initial estimate of the reproducibility of psychological science.
Results: We conducted replications of 100 experimental and correlational studies published in three psychology journals using high-powered designs and original materials when available. There is no single standard for evaluating replication success. Here, we evaluated reproducibility using statistical-significance and P values, effect sizes, subjective assessments of replication teams, and meta-analysis of effect sizes. The mean effect size (r) of the replication effects (Mr = 0.197, SD = 0.257) was half the magnitude of the mean effect size of the original effects (Mr = 0.403, SD = 0.188), representing a substantial decline. 97% of original studies had statistically-significant results (p < 0.05). 36% of replications had statistically-significant results; 47% of original effect sizes were in the 95% confidence interval of the replication effect size; 39% of effects were subjectively rated to have replicated the original result; and if no bias in original results is assumed, combining original and replication results left 68% with statistically-significant effects. Correlational tests suggest that replication success was better predicted by the strength of original evidence than by characteristics of the original and replication teams.
Conclusion: No single indicator sufficiently describes replication success, and the five indicators examined here are not the only ways to evaluate reproducibility. Nonetheless, collectively these results offer a clear conclusion: A large portion of replications produced weaker evidence for the original findings despite using materials provided by the original authors, review in advance for methodological fidelity, and high statistical power to detect the original effect sizes. Moreover, correlational evidence is consistent with the conclusion that variation in the strength of initial evidence (such as original P value) was more predictive of replication success than variation in the characteristics of the teams conducting the research (such as experience and expertise). The latter factors certainly can influence replication success, but they did not appear to do so here.
Reproducibility is not well understood because the incentives for individual scientists prioritize novelty over replication. Innovation is the engine of discovery and is vital for a productive, effective scientific enterprise. However, innovative ideas become old news fast. Journal reviewers and editors may dismiss a new test of a published idea as unoriginal. The claim that “we already know this” belies the uncertainty of scientific evidence. Innovation points out paths that are possible; replication points out paths that are likely; progress relies on both. Replication can increase certainty when findings are reproduced and promote innovation when they are not. This project provides accumulating evidence for many findings in psychological research and suggests that there is still more work to do to verify whether we know what we think we know.
Background: We explore whether the number of null results in large National Heart Lung, and Blood Institute (NHLBI) funded trials has increased over time.
Methods: We identified all large NHLBI supportedRCTs between 1970 and 2012 evaluating drugs or dietary supplements for the treatment or prevention of cardiovascular disease. Trials were included if direct costs >$614,915$500,0002015/year, participants were adult humans, and the primary outcome was cardiovascular risk, disease or death. The 55 trials meeting these criteria were coded for whether they were published prior to or after the year 2000, whether they registered in clinicaltrials.gov prior to publication, used active or placebo comparator, and whether or not the trial had industry co-sponsorship. We tabulated whether the study reported a positive, negative, or null result on the primary outcome variable and for total mortality.
Results: 17 of 30 studies (57%) published prior to 2000 showed a statistically-significant benefit of intervention on the primary outcome in comparison to only 2 among the 25 (8%) trials published after 2000 (χ2 = 12.2, df= 1, p = 0.0005). There has been no change in the proportion of trials that compared treatment to placebo versus active comparator. Industry co-sponsorship was unrelated to the probability of reporting a statistically-significant benefit. Pre-registration in clinical trials.gov was strongly associated with the trend toward null findings.
Conclusions: The number NHLBI trials reporting positive results declined after the year 2000. Prospective declaration of outcomes in RCTs, and the adoption of transparent reporting standards, as required by clinicaltrials.gov, may have contributed to the trend toward null findings.
Paroxetine is a selective serotonin reuptake inhibitor (SSRI) that is currently available on the market and is suspected of causing congenital malformations in babies born to mothers who take the drug during the first trimester of pregnancy.
We utilized organismal performance assays (OPAs), a novel toxicity assessment method, to assess the safety of paroxetine during pregnancy in a rodent model. OPAs utilize genetically diverse wild mice (Mus musculus) to evaluate competitive performance between experimental and control animals as they compete amongst each other for limited resources in semi-natural enclosures. Performance measures included reproductive success, male competitive ability and survivorship.
Paroxetine-exposed males weighed 13% less, had 44% fewer offspring, dominated 53% fewer territories and experienced a 2.5-fold increased trend in mortality, when compared with controls. Paroxetine-exposed females had 65% fewer offspring early in the study, but rebounded at later time points. In cages, paroxetine-exposed breeders took 2.3× longer to produce their first litter and pups of both sexes experienced reduced weight when compared with controls. Low-dose paroxetine-induced health declines detected in this study were undetected in preclinical trials with dose 2.5-8× higher than human therapeutic doses.
These data indicate that OPAs detect phenotypic adversity and provide unique information that could useful towards safety testing during pharmaceutical development.
The emergence of the web has fundamentally affected most aspects of information communication, including scholarly communication. The immediacy that characterizes publishing information to the web, as well as accessing it, allows for a dramatic increase in the speed of dissemination of scholarly knowledge. But, the transition from a paper-based to a web-based scholarly communication system also poses challenges. In this paper, we focus on reference rot, the combination of link rot and content drift to which references to web resources included in Science, Technology, and Medicine (STM) articles are subject. We investigate the extent to which reference rot impacts the ability to revisit the web context that surrounds STM articles some time after their publication. We do so on the basis of a vast collection of articles from three corpora that span publication years 1997 to 2012. For over one million references to web resources extracted from over 3.5 million articles, we determine whether the HTTPURI is still responsive on the live web and whether web archives contain an archived snapshot representative of the state the referenced resource had at the time it was referenced. We observe that the fraction of articles containing references to web resources is growing steadily over time. We find one out of five STM articles suffering from reference rot, meaning it is impossible to revisit the web context that surrounds them some time after their publication. When only considering STM articles that contain references to web resources, this fraction increases to seven out of ten. We suggest that, in order to safeguard the long-term integrity of the web-based scholarly record, robust solutions to combat the reference rot problem are required. In conclusion, we provide a brief insight into the directions that are explored with this regard in the context of the Hiberlink project.
The relative importance of nature and nurture for various forms of expertise has been intensely debated. Music proficiency is viewed as a general model for expertise, and associations between deliberate practice and music proficiency have been interpreted as supporting the prevailing idea that long-term deliberate practice inevitably results in increased music ability.
Here, we examined the associations (rs = 0.18–0.36) between music practice and music ability (rhythm, melody, and pitch discrimination) in 10,500 Swedish twins. We found that music practice was substantially heritable (40%–70%). Associations between music practice and music ability were predominantly genetic, and, contrary to the causal hypothesis, nonshared environmental influences did not contribute. There was no difference in ability within monozygotic twin pairs differing in their amount of practice, so that when genetic predisposition was controlled for, more practice was no longer associated with better music skills.
These findings suggest that music practice may not causally influence music ability and that genetic variation among individuals affects both ability and inclination to practice.
[Keywords: training, expertise, music ability, practice, heritability, twin, causality]
The p value obtained from a significance test provides no information about the magnitude or importance of the underlying phenomenon. Therefore, additional reporting of effect size is often recommended. Effect sizes are theoretically independent from sample size. Yet this may not hold true empirically: non-independence could indicate publication bias.
We investigate whether effect size is independent from sample size in psychological research. We randomly sampled 1,000 psychological articles from all areas of psychological research. We extracted p values, effect sizes, and sample sizes of all empirical papers, and calculated the correlation between effect size and sample size, and investigated the distribution of p values.
We found a negative correlation of r = −.45 [95% CI: −.53; −.35] between effect size and sample size. In addition, we found an inordinately high number of p values just passing the boundary of statistical-significance. Additional data showed that neither implicit nor explicit power analysis could account for this pattern of findings.
The negative correlation between effect size and samples size, and the biased distribution of p values indicate pervasive publication bias in the entire field of psychology.
Ericsson and colleagues argue that deliberate practice explains expert performance.
We tested this view in the two most studied domains in expertise research.
Deliberate practice is not sufficient to explain expert performance.
Other factors must be considered to advance the science of expertise.
Twenty years ago, Ericsson, Krampe, and Tesch-Römer (1993) proposed that expert performance reflects a long period of deliberate practice rather than innate ability, or “talent”. Ericsson et al. found that elite musicians had accumulated thousands of hours more deliberate practice than less accomplished musicians, and concluded that their theoretical framework could provide “a sufficient account of the major facts about the nature and scarcity of exceptional performance” (p. 392). The deliberate practice view has since gained popularity as a theoretical account of expert performance, but here we show that deliberate practice is not sufficient to explain individual differences in performance in the two most widely studied domains in expertise research—chess and music. For researchers interested in advancing the science of expert performance, the task now is to develop and rigorously test theories that take into account as many potentially relevant explanatory constructs as possible.
Allan Crossman calls parapsychology the control group for science. That is, in let’s say a drug testing experiment, you give some people the drug and they recover. That doesn’t tell you much until you give some other people a placebo drug you know doesn’t work—but which they themselves believe in—and see how many of them recover. That number tells you how many people will recover whether the drug works or not. Unless people on your real drug do substantially better than people on the placebo drug, you haven’t found anything. On the meta-level, you’re studying some phenomenon and you get some positive findings. That doesn’t tell you much until you take some other researchers who are studying a phenomenon you know doesn’t exist—but which they themselves believe in—and see how many of them get positive findings. That number tells you how many studies will discover positive results whether the phenomenon is real or not. Unless studies of the real phenomenon do substantially better than studies of the placebo phenomenon, you haven’t found anything.
Trying to set up placebo science would be a logistical nightmare. You’d have to find a phenomenon that definitely doesn’t exist, somehow convince a whole community of scientists across the world that it does, and fund them to study it for a couple of decades without them figuring it out.
Luckily we have a natural experiment in terms of parapsychology—the study of psychic phenomena—which most reasonable people believe don’t exist, but which a community of practicing scientists believes in and publishes papers on all the time. The results are pretty dismal. Parapsychologists are able to produce experimental evidence for psychic phenomena about as easily as normal scientists are able to produce such evidence for normal, non-psychic phenomena. This suggests the existence of a very large “placebo effect” in science—ie with enough energy focused on a subject, you can always produce “experimental evidence” for it that meets the usual scientific standards. As Eliezer Yudkowsky puts it:
Parapsychologists are constantly protesting that they are playing by all the standard scientific rules, and yet their results are being ignored—that they are unfairly being held to higher standards than everyone else. I’m willing to believe that. It just means that the standard statistical methods of science are so weak and flawed as to permit a field of study to sustain itself in the complete absence of any subject matter.
An open-access journal allows free online access to its articles, obtaining revenue from fees charged to submitting authors or from institutional support. Using panel data on science journals, we are able to circumvent problems plaguing previous studies of the impact of open access on citations. In contrast to the huge effects found in these previous studies, we find a more modest effect: moving from paid to open access increases cites by 8% on average in our sample. The benefit is concentrated among top-ranked journals. In fact, open access causes a statistically-significant reduction in cites to the bottom-ranked journals in our sample, leading us to conjecture that open access may intensify competition among articles for readers’ attention, generating losers as well as winners. [See also the 2020 followup by the same authors, “Cite Unseen: Theory and Evidence on the Effect of Open Access on Cites to Academic Articles Across the Quality Spectrum”.]
Data-sharing is an essential tool for replication, validation and extension of empirical results. Using a hand-collected data set describing the data-sharing behaviour of 488 randomly selected empirical researchers, we provide evidence that most researchers in economics and management do not share their data voluntarily. We derive testable hypotheses based on the theoretical literature on information-sharing and relate data-sharing to observable characteristics of researchers. We find empirical support for the hypotheses that voluntary data-sharing statistically-significantly increases with (a) academic tenure, (b) the quality of researchers, (c) the share of published articles subject to a mandatory data-disclosure policy of journals, and (d) personal attitudes towards “open science” principles. On the basis of our empirical evidence, we discuss a set of policy recommendations.
Policies ensuring that research data are available on public archives are increasingly being implemented at the government , funding agency [2-4], and journal [5,6] level. These policies are predicated on the idea that authors are poor stewards of their data, particularly over the long term , and indeed many studies have found that authors are often unable or unwilling to share their data [8-11]. However, there are no systematic estimates of how the availability of research data changes with time since publication. We therefore requested datasets from a relatively homogenous set of 516 articles published between 2 and 22 years ago, and found that availability of the data was strongly affected by article age. For papers where the authors gave the status of their data, the odds of a dataset being extant fell by 17% per year. In addition, the odds that we could find a working email address for the first, last or corresponding author fell by 7% per year. Our results reinforce the notion that, in the long term, research data cannot be reliably preserved by individual researchers, and further demonstrate the urgent need for policies mandating data sharing via public archives.
Industry-sponsored clinical drug studies are associated with publication of outcomes that favor the sponsor, even when controlling for potential bias in the methods used. However, the influence of sponsorship bias has not been examined in preclinical animal studies.
We performed a meta-analysis of preclinical statin studies to determine whether industry sponsorship is associated with either increased effect sizes of efficacy outcomes and/or risks of bias in a cohort of published preclinical statin studies. We searched Medline (January 1966–April 2012) and identified 63 studies evaluating the effects of statins on atherosclerosis outcomes in animals. Two coders independently extracted study design criteria aimed at reducing bias, results for all relevant outcomes, sponsorship source, and investigator financial ties. The I2 statistic was used to examine heterogeneity. We calculated the standardized mean difference (SMD) for each outcome and pooled data across studies to estimate the pooled average SMD using random effects models. In a priori subgroup analyses, we assessed statin efficacy by outcome measured, sponsorship source, presence or absence of financial conflict information, use of an optimal time window for outcome assessment, accounting for all animals, inclusion criteria, blinding, and randomization.
The effect of statins was statistically-significantly larger for studies sponsored by nonindustry sources (−1.99; 95% CI −2.68, −1.31) versus studies sponsored by industry (−0.73; 95% CI −1.00, −0.47) (p < 0.001). Statin efficacy did not differ by disclosure of financial conflict information, use of an optimal time window for outcome assessment, accounting for all animals, inclusion criteria, blinding, and randomization. Possible reasons for the differences between nonindustry-sponsored and industry-sponsored studies, such as selective reporting of outcomes, require further study.
Author Summary: Industry-sponsored clinical drug studies are associated with publication of outcomes that favor the sponsor, even when controlling for potential bias in the methods used. However, the influence of sponsorship bias has not been examined in preclinical animal studies. We performed a meta-analysis to identify whether industry sponsorship is associated with increased risks of bias or effect sizes of outcomes in a cohort of published preclinical studies of the effects of statins on outcomes related to atherosclerosis. We found that in contrast to clinical studies, the effect of statins was statistically-significantly larger for studies sponsored by nonindustry sources versus studies sponsored by industry. Furthermore, statin efficacy did not differ with respect to disclosure of financial conflict information, use of an optimal time window for outcome assessment, accounting for all animals, inclusion criteria, blinding, and randomization. Possible reasons for the differences between nonindustry-sponsored and industry-sponsored studies, such as selective outcome reporting, require further study. Overall, our findings provide empirical evidence regarding the impact of funding and other methodological criteria on research outcomes.
Why does performing certain tasks cause the aversive experience of mental effort and concomitant deterioration in task performance? One explanation posits a physical resource that is depleted over time. We propose an alternative explanation that centers on mental representations of the costs and benefits associated with task performance. Specifically, certain computational mechanisms, especially those associated with executive function, can be deployed for only a limited number of simultaneous tasks at any given moment. Consequently, the deployment of these computational mechanisms carries an opportunity cost – that is, the next-best use to which these systems might be put. We argue that the phenomenology of effort can be understood as the felt output of these cost/benefit computations. In turn, the subjective experience of effort motivates reduced deployment of these computational mechanisms in the service of the present task. These opportunity cost representations, then, together with other cost/benefit calculations, determine effort expended and, everything else equal, result in performance reductions. In making our case for this position, we review alternative explanations for both the phenomenology of effort associated with these tasks and for performance reductions over time. Likewise, we review the broad range of relevant empirical results from across sub-disciplines, especially psychology and neuroscience. We hope that our proposal will help to build links among the diverse fields that have been addressing similar questions from different perspectives, and we emphasize ways in which alternative models might be empirically distinguished.
Why does performing certain tasks cause the aversive experience of mental effort and concomitant deterioration in task performance? One explanation posits a physical resource that is depleted over time. We propose an alternative explanation that centers on mental representations of the costs and benefits associated with task performance. Specifically, certain computational mechanisms, especially those associated with executive function, can be deployed for only a limited number of simultaneous tasks at any given moment. Consequently, the deployment of these computational mechanisms carries an opportunity cost—that is, the next-best use to which these systems might be put. We argue that the phenomenology of effort can be understood as the felt output of these cost/benefit computations. In turn, the subjective experience of effort motivates reduced deployment of these computational mechanisms in the service of the present task. These opportunity cost representations, then, together with other cost/benefit calculations, determine effort expended and, everything else equal, result in performance reductions. In making our case for this position, we review alternative explanations for both the phenomenology of effort associated with these tasks and for performance reductions over time. Likewise, we review the broad range of relevant empirical results from across sub-disciplines, especially psychology and neuroscience. We hope that our proposal will help to build links among the diverse fields that have been addressing similar questions from different perspectives, and we emphasize ways in which alternative models might be empirically distinguished.
Unstructured interviews are a ubiquitous tool for making screening decisions despite a vast literature suggesting that they have little validity. We sought to establish reasons why people might persist in the illusion that unstructured interviews are valid and what features about them actually lead to poor predictive accuracy.
In three studies, we investigated the propensity for “sensemaking”—the ability for interviewers to make sense of virtually anything the interviewee says—and “dilution”—the tendency for available but non-diagnostic information to weaken the predictive value of quality information. In Study 1, participants predicted two fellow students’ semester GPAs from valid backgroundinformation like prior GPA and, for one of them, an unstructured interview. In one condition, the interview was essentially nonsense in that the interviewee was actually answering questions using a random response system. Consistent with sensemaking, participants formed interview impressions just as confidently after getting random responses as they did after real responses. Consistent with dilution, interviews actually led participants to make worse predictions. Study 2 showed that watching a random interview, rather than personally conducting it, did little to mitigate sensemaking. Study 3 showed that participants believe unstructured interviews will help accuracy, so much so that they would rather have random interviews than no interview.
People form confident impressions even interviews are defined to be invalid, like our random interview, and these impressions can interfere with the use of valid information. Our simple recommendation for those making screening decisions is not to use them.
[Keywords: unstructured interview, random interview, clinical judgment, actuarial judgment]
The evaluation of 160 meta-analyses of animal studies on potential treatments for neurological disorders reveals that the number of statistically-significant results was too large to be true, suggesting biases.
Animal studies generate valuable hypotheses that lead to the conduct of preventive or therapeutic clinical trials. We assessed whether there is evidence for excess statistical-significance in results of animal studies on neurological disorders, suggesting biases. We used data from meta-analyses of interventions deposited in Collaborative Approach to Meta-Analysis and Review of Animal Data in Experimental Studies (CAMARADES). The number of observed studies with statistically-significant results (O) was compared with the expected number (E), based on the statistical power of each study under different assumptions for the plausible effect size. We assessed 4,445 datasets synthesized in 160 meta-analyses on Alzheimer disease (n = 2), experimental autoimmune encephalomyelitis (n = 34), focal ischemia (n = 16), intracerebral hemorrhage (n = 61), Parkinson disease (n = 45), and spinal cord injury (n = 2). 112 meta-analyses (70%) found nominally (p≤0.05) statistically-significant summary fixed effects. Assuming the effect size in the most precise study to be a plausible effect, 919 out of 4,445 nominally statistically-significant results were expected versus 1,719 observed (p<10−9). Excess significance was present across all neurological disorders, in all subgroups defined by methodological characteristics, and also according to alternative plausible effects. Asymmetry tests also showed evidence of small-study effects in 74 (46%) meta-analyses. Significantly effective interventions with more than 500 animals, and no hints of bias were seen in eight (5%) meta-analyses. Overall, there are too many animal studies with statistically-significant results in the literature of neurological disorders. This observation suggests strong biases, with selective analysis and outcome reporting biases being plausible explanations, and provides novel evidence on how these biases might influence the whole research domain of neurological animal literature.
Studies have shown that the results of animal biomedical experiments fail to translate into human clinical trials; this could be attributed either to real differences in the underlying biology between humans and animals, to shortcomings in the experimental design, or to bias in the reporting of results from the animal studies. We use a statistical technique to evaluate whether the number of published animal studies with “positive” (statistically-significant) results is too large to be true. We assess 4,445 animal studies for 160 candidate treatments of neurological disorders, and observe that 1,719 of them have a “positive” result, whereas only 919 studies would a priori be expected to have such a result. According to our methodology, only eight of the 160 evaluated treatments should have been subsequently tested in humans. In summary, we judge that there are too many animal studies with “positive” results in the neurological disorder literature, and we discuss the reasons and potential remedies for this phenomenon.
The pharmaceutical and biotechnology industries depend on findings from academic investigators prior to initiating programs to develop new diagnostic and therapeutic agents to benefit cancer patients. The success of these programs depends on the validity of published findings. This validity, represented by the reproducibility of published findings, has come into question recently as investigators from companies have raised the issue of poor reproducibility of published results from academic laboratories. Furthermore, retraction rates in high impact journals are climbing.
Methods and Findings:
To examine a microcosm of the academic experience with data reproducibility, we surveyed the faculty and trainees at MD Anderson Cancer Center using an anonymous computerized questionnaire; we sought to ascertain the frequency and potential causes of non-reproducible data. We found that ~50% of respondents had experienced at least one episode of the inability to reproduce published data; many who pursued this issue with the original authors were never able to identify the reason for the lack of reproducibility; some were even met with a less than “collegial” interaction.
These results suggest that the problem of data reproducibility is real. Biomedical science needs to establish processes to decrease the problem and adjudicate discrepancies in findings when they are discovered.
Journals favor rejection of the null hypothesis. This selection upon tests may distort the behavior of researchers. Using 50,000 tests published between 2005 and 2011 in the AER, JPE, and QJE, we identify a residual in the distribution of tests that cannot be explained by selection. The distribution of p-values exhibits a camel shape with abundant p-values above 0.25, a valley between 0.25 and 0.10 and a bump slightly below 0.05. The missing tests (with p-values between 0.25 and 0.10) can be retrieved just after the 0.05 threshold and represent 10% to 20% of marginally rejected tests. Our interpretation is that researchers might be tempted to inflate the value of those almost-rejected tests by choosing a “significant” specification. We propose a method to measure inflation and decompose it along articles’ and authors’ characteristics.
The accuracy of published medical research is critical both for scientists, physicians and patients who rely on these results. But the fundamental belief in the medical literature was called into serious question by a paper suggesting most published medical research is false. Here we adapt estimation methods from the genomics community to the problem of estimating the rate of false positives in the medical literature using reported p-values as the data. We then collect p-values from the abstracts of all 77,430 papers published in The Lancet, The Journal of the American Medical Association, The New England Journal of Medicine, The British Medical Journal, and The American Journal of Epidemiology between 2000 and 2010. We estimate that the overall rate of false positives among reported results is 14% (s.d. 1%), contrary to previous claims. We also find there is not a statistically-significant increase in the estimated rate of reported false positive results over time (0.5% more FP per year, p = 0.18) or with respect to journal submissions (0.1% more FP per 100 submissions, p = 0.48). Statistical analysis must allow for false positives in order to make claims on the basis of noisy data. But our analysis suggests that the medical literature remains a reliable record of scientific progress.
Unlabelled: Misled by animal studies and basic research? Whenever we take a closer look at the outcome of clinical trials in a field such as, most recently, stroke or septic shock, we see how limited the value of our preclinical models was. For all indications, 95% of drugs that enter clinical trials do not make it to the market, despite all promise of the (animal) models used to develop them. Drug development has started already to decrease its reliance on animal models: In Europe, for example, despite increasing R&D expenditure, animal use by pharmaceutical companies dropped by more than 25% from 2005 to 2008. In vitro studies are likewise limited: questionable cell authenticity, over-passaging, mycoplasma infections, and lack of differentiation as well as non-homeostatic and non-physiologic culture conditions endanger the relevance of these models. The standards of statistics and reporting often are poor, further impairing reliability. Alarming studies from industry show miserable reproducibility of landmark studies. This paper discusses factors contributing to the lack of reproducibility and relevance of pre-clinical research.
The Conclusion: Publish less but of better quality and do not rely on the face value of animal studies.
Since the establishment of the Institute for Education Sciences (IES)within the U.S. Department of Education in 2002, IES has commissioned a sizable number of well-conducted randomized controlled trials (RCTs) evaluating the effectiveness of diverse educational programs, practices, and strategies (“interventions”). These interventions have included, for example, various educational curricula, teacher professional development programs, school choice programs, educational software, and data-driven school reform initiatives. Largely as a result of these IES studies, there now exists—for the first time in U.S. education—a sizable body of credible knowledge about what works and what doesn’t work to improve key educational outcomes of American students. A clear pattern of findings in these IES studies is that the large majority of interventions evaluated produced weak or no positive effects compared to usual school practices. This pattern is consistent with findings in other fields where RCTs are frequently carried out, such as medicine and business,1 and underscores the need to test many different interventions so as to build the number shown to work.
The scientific credibility of economics is itself a scientific question that can be addressed with both theoretical speculations and empirical data.
In this review, we examine the major parameters that are expected to affect the credibility of empirical economics: sample size, magnitude of pursued effects, number and pre-selection of tested relationships, flexibility and lack of standardization in designs, definitions, outcomes and analyses, financial and other interests and prejudices, and the multiplicity and fragmentation of efforts.
We summarize and discuss the empirical evidence on the lack of a robust reproducibility culture in economics and business research, the prevalence of potential publication and other selective reporting biases, and other failures and biases in the market of scientific information. Overall, the credibility of the economics literature is likely to be modest or even low.
In null hypothesis statistical-significancetesting (NHST), p-values are judged relative to an arbitrary threshold for statistical-significance (0.05). The present work examined whether that standard influences the distribution of p-values reported in the psychology literature.
We examined a large subset of papers from 3 highly regarded journals. Distributions of p were found to be similar across the different journals. Moreover, p-values were much more common immediately below 0.05 than would be expected based on the number of p-values occurring in other ranges. This prevalence of p-values just below the arbitrary criterion for statistical-significance was observed in all 3 journals.
We discuss potential sources of this pattern, including publication bias and researcher degrees of freedom.
Personality psychology aims to explain the causes and the consequences of variation in behavioural traits. Because of the observational nature of the pertinent data, this endeavour has provoked many controversies. In recent years, the computer scientist Judea Pearl has used a graphical approach to extend the innovations in causal inference developed by Ronald Fisher and Sewall Wright. Besides shedding much light on the philosophical notion of causality itself, this graphical framework now contains many powerful concepts of relevance to the controversies just mentioned. In this article, some of these concepts are applied to areas of personality research where questions of causation arise, including the analysis of observational data and the genetic sources of individual differences.
In the aftermath of many natural and man-made disasters, people often wonder why those affected were underprepared, especially when the disaster was the result of known or regularly occurring hazards (eg., hurricanes). We study one contributing factor: prior near-miss experiences. Near misses are events that have some nontrivial expectation of ending in disaster but, by chance, do not. We demonstrate that when near misses are interpreted as disasters that did not occur, people illegitimately underestimate the danger of subsequent hazardous situations and make riskier decisions (eg., choosing not to engage in mitigation activities for the potential hazard). On the other hand, if near misses can be recognized and interpreted as disasters that almost happened, this will counter the basic “near-miss” effect and encourage more mitigation. We illustrate the robustness of this pattern across populations with varying levels of real expertise with hazards and different hazard contexts (household evacuation for a hurricane, Caribbean cruises during hurricane season, and deep-water oil drilling). We conclude with ideas to help people manage and communicate about risk.
General intelligence (g) and virtually all other behavioral traits are heritable. Associations between g and specific single-nucleotide polymorphisms (SNPs) in several candidate genes involved in brain function have been reported. We sought to replicate published associations between g and 12 specific genetic variants (in the genes DTNBP1, CTSD, DRD2, ANKK1, CHRM2, SSADH, COMT, BDNF, CHRNA4, DISC1,APOE, and SNAP25) using data sets from three independent, well-characterized longitudinal studies with samples of 5,571, 1,759, and 2,441 individuals. Of 32 independent tests across all three data sets, only 1 was nominally statistically-significant. By contrast, power analyses showed that we should have expected 10 to 15 statistically-significant associations, given reasonable assumptions for genotype effect sizes. For positive controls, we confirmed accepted genetic associations for Alzheimer’s disease and body mass index, and we used SNP-based calculations of genetic relatedness to replicate previous estimates that about half of the variance in g is accounted for by common genetic variation among individuals. We conclude that the molecular genetics of psychology and social science requires approaches that go beyond the examination of candidate genes.
Cases of clear scientific misconduct have received substantial media attention recently, but less flagrantly questionable research practices may be more prevalent and, ultimately, more damaging to the academic enterprise.
Using an anonymous elicitation format supplemented by incentives for honest reporting, we surveyed over 2,000 psychologists about their involvement in questionable research practices. The impact of truth-telling incentives on self-admissions of questionable research practices was positive, and this impact was greater for practices that respondents judged to be less defensible.
Combining three different estimation methods, we found that the percentage of respondents who have engaged in questionable practices was surprisingly high. This finding suggests that some questionable practices may constitute the prevailing research norm.
[Keywords: professional standards, judgment, disclosure, methodology]
The widespread reluctance to share published research data is often hypothesized to be due to the authors’ fear that reanalysis may expose errors in their work or may produce conclusions that contradict their own. However, these hypotheses have not previously been studied systematically.
Methods and Findings:
We related the reluctance to share research data for reanalysis to 1148 statistically-significant results reported in 49 papers published in two major psychology journals. We found the reluctance to share data to be associated with weaker evidence (against the null hypothesis of no effect) and a higher prevalence of apparent errors in the reporting of statistical results. The unwillingness to share data was particularly clear when reporting errors had a bearing on statistical-significance.
Our findings on the basis of psychological papers suggest that statistical results are particularly hard to verify when reanalysis is more likely to lead to contrasting conclusions. This highlights the importance of establishing mandatory data archiving policies.
Concerns that the growing competition for funding and citations might distort science are frequently discussed, but have not been verified directly. Of the hypothesized problems, perhaps the most worrying is a worsening of positive-outcome bias. A system that disfavours negative results not only distorts the scientific literature directly, but might also discourage high-risk projects and pressure scientists to fabricate and falsify their data. This study analysed over 4,600 papers published in all disciplines between 1990 and 2007, measuring the frequency of papers that, having declared to have “tested” a hypothesis, reported a positive support for it. The overall frequency of positive supports has grown by over 22% between 1990 and 2007, with statistically-significant differences between disciplines and countries. The increase was stronger in the social and some biomedical disciplines. The United States had published, over the years, statistically-significantly fewer positive results than Asian countries (and particularly Japan) but more than European countries (and in particular the United Kingdom). Methodological artefacts cannot explain away these patterns, which support the hypotheses that research is becoming less pioneering and/or that the objectivity with which results are produced and published is decreasing.
Artifact is present when electrical potentials that are not brain derived are recorded on the EEG and is commonly encountered during interpretation.
Many artifacts obscure the tracing, while others reflect physiologic functions that are crucial for routine visual analysis. Both physiologic and nonphysiologic sources of artifact may act as source of confusion with abnormality and lead to misinterpretation. Identifying the mismatch between potentials that are generated by the brain from activity that does not conform to a realistic head model is the foundation for recognizing artifact. Electroencephalographers are challenged with the task of correct interpretations among the many artifacts that could potentially be misleading, resulting in an incorrect diagnosis and treatment that may adversely impact patient care. Despite advances in digital EEG, artifact identification, recognition, and elimination are essential for correct interpretation of the EEG.
The authors discuss recording concepts for interpreting EEG that contains artifact.
This book presents a methodology and philosophy of empirical science based on large scale lossless data compression. In this view a theory is scientific if it can be used to build a data compression program, and it is valuable if it can compress a standard benchmark database to a small size, taking into account the length of the compressor itself. This methodology therefore includes an Occam principle as well as a solution to the problem of demarcation. Because of the fundamental difficulty of lossless compression, this type of research must be empirical in nature: compression can only be achieved by discovering and characterizing empirical regularities in the data. Because of this, the philosophy provides a way to reformulate fields such as computer vision and computational linguistics as empirical sciences: the former by attempting to compress databases of natural images, the latter by attempting to compress large text databases. The book argues that the rigor and objectivity of the compression principle should set the stage for systematic progress in these fields. The argument is especially strong in the context of computer vision, which is plagued by chronic problems of evaluation.
The book also considers the field of machine learning. Here the traditional approach requires that the models proposed to solve learning problems be extremely simple, in order to avoid overfitting. However, the world may contain intrinsically complex phenomena, which would require complex models to understand. The compression philosophy can justify complex models because of the large quantity of data being modeled (if the target database is 100 Gb, it is easy to justify a 10 Mb model). The complex models and abstractions learned on the basis of the raw data (images, language, etc) can then be reused to solve any specific learning problem, such as face recognition or machine translation.
Objective: Gene-by-environment interaction (G×E) studies in psychiatry have typically been conducted using a candidate G×E (cG×E) approach, analogous to the candidate gene association approach used to test genetic main effects. Such cG×E research has received widespread attention and acclaim, yet cG×E findings remain controversial. The authors examined whether the many positive cG×E findings reported in the psychiatric literature were robust or if, in aggregate, cG×E findings were consistent with the existence of publication bias, low statistical power, and a high false discovery rate.
Method: The authors conducted analyses on data extracted from all published studies (103 studies) from the first decade (2000-2009) of cG×E research in psychiatry.
Results: Ninety-six percent of novel cG×E studies were significant compared with 27% of replication attempts. These findings are consistent with the existence of publication bias among novel cG×E studies, making cG×E hypotheses appear more robust than they actually are. There also appears to be publication bias among replication attempts because positive replication attempts had smaller average sample sizes than negative ones. Power calculations using observed sample sizes suggest that cG×E studies are underpowered. Low power along with the likely low prior probability of a given cG×E hypothesis being true suggests that most or even all positive cG×E findings represent type I errors.
Conclusions: In this new era of big data and small effects, a recalibration of views about groundbreaking findings is necessary. Well-powered direct replications deserve more attention than novel cG×E findings and indirect replications.
One of the virtues of peer review is that it provides a self-regulating selection mechanism for scientific work, papers and projects. Peer review as a selection mechanism is hard to evaluate in terms of its efficiency. Serious efforts to understand its strengths and weaknesses have not yet lead to clear answers. In theory peer review works if the involved parties (editors and referees) conform to a set of requirements, such as love for high quality science, objectiveness, and absence of biases, nepotism, friend and clique networks, selfishness, etc. If these requirements are violated, what is the effect on the selection of high quality work? We study this question with a simple agent based model. In particular we are interested in the effects of rational referees, who might not have any incentive to see high quality work other than their own published or promoted. We find that a small fraction of incorrect (selfish or rational) referees can drastically reduce the quality of the published (accepted) scientific standard. We quantify the fraction for which peer review will no longer select better than pure chance. Decline of quality of accepted scientific work is shown as a function of the fraction of rational and unqualified referees. We show how a simple quality-increasing policy of e.g. a journal can lead to a loss in overall scientific quality, and how mutual support-networks of authors and referees deteriorate the system.
Background: In other neurological diseases, the failure to translate pre-clinical findings to effective clinical treatments has been partially attributed to bias introduced by shortcomings in the design of animal experiments.
Objectives: Here we evaluate published studies of interventions in animal models of multiple sclerosis for methodological design and quality and to identify candidate interventions with the best evidence of efficacy.
Methods: A systematic review of the literature describing experiments testing the effectiveness of interventions in animal models of multiple sclerosis was carried out. Data were extracted for reported study quality and design and for neurobehavioural outcome. Weighted mean difference meta-analysis was used to provide summary estimates of the efficacy for drugs where this was reported in five or more publications.
Results: The use of a drug in a pre-clinical multiple sclerosis model was reported in 1152 publications, of which 1117 were experimental autoimmune encephalomyelitis (EAE). For 36 interventions analysed in greater detail, neurobehavioural score was improved by 39.6% (95% CI 34.9–44.2%, p < 0.001). However, few studies reported measures to reduce bias, and those reporting randomization or blinding found statistically-significantly smaller effect sizes.
Conclusions: EAE has proven to be a valuable model in elucidating pathogenesis as well as identifying candidate therapies for multiple sclerosis. However, there is an inconsistent application of measures to limit bias that could be addressed by adopting methodological best practice in study design. Our analysis provides an estimate of sample size required for different levels of power in future studies and suggests a number of interventions for which there are substantial animal data supporting efficacy.
The growing competition and “publish or perish” culture in academia might conflict with the objectivity and integrity of research, because it forces scientists to produce “publishable” results at all costs. Papers are less likely to be published and to be cited if they report “negative” results (results that fail to support the tested hypothesis). Therefore, if publication pressures increase scientific bias, the frequency of “positive” results in the literature should be higher in the more competitive and “productive” academic environments. This study verified this hypothesis by measuring the frequency of positive results in a large random sample of papers with a corresponding author based in the US. Across all disciplines, papers were more likely to support a tested hypothesis if their corresponding authors were working in states that, according to NSF data, produced more academic papers per capita. The size of this effect increased when controlling for state’s per capita R&D expenditure and for study characteristics that previous research showed to correlate with the frequency of positive results, including discipline and methodology. Although the confounding effect of institutions’ prestige could not be excluded (researchers in the more productive universities could be the most clever and successful in their experiments), these results support the hypothesis that competitive academic environments increase not only scientists’ productivity but also their bias. The same phenomenon might be observed in other countries where academic competition and pressures to publish are high.
Publication bias confounds attempts to use systematic reviews to assess the efficacy of various interventions tested in experiments modelling acute ischaemic stroke, leading to a 30% overstatement of efficacy of interventions tested in animals.
The consolidation of scientific knowledge proceeds through the interpretation and then distillation of data presented in research reports, first in review articles and then in textbooks and undergraduate courses, until truths become accepted as such both amongst “experts” and in the public understanding. Where data are collected but remain unpublished, they cannot contribute to this distillation of knowledge. If these unpublished data differ substantially from published work, conclusions may not reflect adequately the underlying biological effects being described. The existence and any impact of such “publication bias” in the laboratory sciences have not been described. Using the CAMARADES (Collaborative Approach to Meta-analysis and Review of Animal Data in Experimental Studies) database we identified 16 systematic reviews of interventions tested in animal studies of acute ischaemic stroke involving 525 unique publications. Only ten publications (2%) reported no statistically-significant effects on infarct volume and only six (1.2%) did not report at least one significant finding. Egger regression and trim-and-fill analysis suggested that publication bias was highly prevalent (present in the literature for 16 and ten interventions, respectively) in animal studies modelling stroke. Trim-and-fill analysis suggested that publication bias might account for around one-third of the efficacy reported in systematic reviews, with reported efficacy falling from 31.3% to 23.8% after adjustment for publication bias. We estimate that a further 214 experiments (in addition to the 1,359 identified through rigorous systematic review; non publication rate 14%) have been conducted but not reported. It is probable that publication bias has an important impact in other animal disease models, and more broadly in the life sciences.
Publication bias is known to be a major problem in the reporting of clinical trials, but its impact in basic research has not previously been quantified. Here we show that publication bias is prevalent in reports of laboratory-based research in animal models of stroke, such that data from as many as one in seven experiments remain unpublished. The result of this bias is that systematic reviews of the published results of interventions in animal models of stroke overstate their efficacy by around one third. Nonpublication of data raises ethical concerns, first because the animals used have not contributed to the sum of human knowledge, and second because participants in clinical trials may be put at unnecessary risk if efficacy in animals has been overstated. It is unlikely that this publication bias in the basic sciences is restricted to the area we have studied, the preclinical modelling of the efficacy of candidate drugs for stroke. A related article in PLoS Medicine (van der Worp et al., doi:10.1371 / journal.pmed.1000245) discusses the controversies and possibilities of translating the results of animal experiments into human clinical trials.
We document widespread changes to the historical I/B/E/S analyst stock recommendations database. Across seven I/B/E/S downloads, obtained between 2000 and 2007, we find that between 6,580 (1.6%) and 97,582 (21.7%) of matched observations are different from one download to the next. The changes include alterations of recommendations, additions and deletions of records, and removal of analyst names. These changes are nonrandom, clustering by analyst reputation, broker size and status, and recommendation boldness, and affect trading signal classifications and back‐tests of three stylized facts: profitability of trading signals, profitability of consensus recommendation changes, and persistence in individual analyst stock‐picking ability.
Based on theoretical reasoning it has been suggested that the reliability of findings published in the scientific literature decreases with the popularity of a research field. Here we provide empirical support for this prediction. We evaluate published statements on protein interactions with data from high-throughput experiments. We find evidence for two distinctive effects. First, with increasing popularity of the interaction partners, individual statements in the literature become more erroneous. Second, the overall evidence on an interaction becomes increasingly distorted by multiple independent testing. We therefore argue that for increasing the reliability of research it is essential to assess the negative effects of popularity and develop approaches to diminish these effects.
The frequency with which scientists fabricate and falsify data, or commit other forms of scientific misconduct is a matter of controversy. Many surveys have asked scientists directly whether they have committed or know of a colleague who committed research misconduct, but their results appeared difficult to compare and synthesize. This is the first meta-analysis of these surveys.
To standardize outcomes, the number of respondents who recalled at least one incident of misconduct was calculated for each question, and the analysis was limited to behaviours that distort scientific knowledge: fabrication, falsification, “cooking” of data, etc… Survey questions on plagiarism and other forms of professional misconduct were excluded. The final sample consisted of 21 surveys that were included in the systematic review, and 18 in the meta-analysis.
A pooled weighted average of 1.97% (n = 7, 95%CI: 0.86–4.45) of scientists admitted to have fabricated, falsified or modified data or results at least once –a serious form of misconduct by any standard– and up to 33.7% admitted other questionable research practices. In surveys asking about the behaviour of colleagues, admission rates were 14.12% (n = 12, 95% CI: 9.91–19.72) for falsification, and up to 72% for other questionable research practices. Meta-regression showed that self reports surveys, surveys using the words “falsification” or “fabrication”, and mailed surveys yielded lower percentages of misconduct. When these factors were controlled for, misconduct was reported more frequently by medical/pharmacological researchers than others.
Considering that these surveys ask sensitive questions and have other limitations, it appears likely that this is a conservative estimate of the true prevalence of scientific misconduct.
The purpose of this study was to examine links between public status and performance in a real-world, high-pressure sport task.
It was believed that high public status could negatively affect performance through added performance pressure. Video analyses were conducted of all penalty shootouts ever held in 3 major soccer tournaments (n = 366 kicks) and public status was derived from prestigious international awards (e.g., “FIFA World Player of the year”).
The results showed that players with high current status performed worse and seemed to engage more in certain escapist self-regulatory behaviors than players with future status. Some of these performance drops may be accounted for by misdirected self-regulation (particularly low response time), but only small multivariate effects were found.
[Previously, Blackwell 1998] This paper presents a surprising result: changing a seemingly innocuous aspect of an experimental setup can cause a systems researcher to draw wrong conclusions from an experiment. What appears to be an innocuous aspect in the experimental setup may in fact introduce a substantial bias in an evaluation. This phenomenon is called measurement bias in the natural and social sciences.
Our results demonstrate that measurement bias is substantial and commonplace in computer system evaluation. By significant we mean that measurement bias can lead to a performance analysis that either over-states an effect or even yields an incorrect conclusion. By commonplace we mean that measurement bias occurs in all architectures that we tried (Pentium 4, Core 2, and m5 O3CPU), both compilers that we tried (gcc and Intel’s C compiler), and most of the SPECCPU2006 C programs. Thus, we cannot ignore measurement bias.Nevertheless, in a literature survey of 133 recent papers from ASPLOS,PACT, PLDI, and CGO, we determined that none of the papers with experimental results adequately consider measurement bias.
Inspired by similar problems and their solutions in other sciences, we describe and demonstrate two methods, one for detecting (causal analysis) and one for avoiding (setup randomization) measurement bias.
It would appear, then, that Lieutenant Colonel Grossman’s appeals to biology and psychology are flawed, and that the bulwark of his historical evidence—S.L. A. Marshall’s assertion that soldiers do not fire their weapons—can be verifiably disproven. The theory of an innate, biological resistance to killing has little support in either evolutionary biology or in what we know about psychology, and, discounting Marshall’s claims, there is little basis in military history for such a theory either. This is not to say that all people can or will kill, or even that all soldiers can or will kill. Combat is staggeringly complex, an environment where human beings are pushed beyond all tolerable limits. There is much that we do not know, and plenty that we should be doing more to learn about. Grossman is clearly leading the way in posing these questions. Much of his work on the processes of killing and the relevance of physical distance to killing is extremely insightful. There is material in On Combat about fear, heart rate, and combat effectiveness that might be groundbreaking, and it should be studied carefully by historians trying to understand human behaviour in war. No disrespect to Lieutenant-Colonel Grossman is intended by this article, and it is not meant to devalue his work. I personally believe that some of the elements of his books, particularly the physiology of combat, would actually be strengthened if they were not shackled to the idea that humans cannot kill one another. But there are still questions that need to be asked, and the subject should not be considered closed. Grossman’s overall picture of killing in war and society is heavily informed by a belief in an innate human resistance to killing that, as has been offered here, does not stand up well to scrutiny. More research on the processes of human killing is needed, and although On Killing and On Combat form an excellent starting point, there are too many problems with their interpretation for them to be considered the final word on the subject. I believe that, in the future, the Canadian Forces needs to take a more critical posture when it comes to incorporating Grossman’s studies into its own doctrine. It is imperative that our nation’s military culture remain one devoted to pursuing the best available evidence at all costs, rather than one merely following the most popular consensus.
Some risks have extremely high stakes. For example, a worldwide pandemic or asteroid impact could potentially kill more than a billion people. Comfortingly, scientific calculations often put very low probabilities on the occurrence of such catastrophes. In this paper, we argue that there are important new methodological problems which arise when assessing global catastrophic risks and we focus on a problem regarding probability estimation. When an expert provides a calculation of the probability of an outcome, they are really providing the probability of the outcome occurring, given that their argument is watertight. However, their argument may fail for a number of reasons such as a flaw in the underlying theory, a flaw in the modeling of the problem, or a mistake in the calculations. If the probability estimate given by an argument is dwarfed by the chance that the argument itself is flawed, then the estimate is suspect. We develop this idea formally, explaining how it differs from the related distinctions of model and parameter uncertainty. Using the risk estimates from the Large Hadron Collider as a test case, we show how serious the problem can be when it comes to catastrophic risks and how best to address it.
Despite great attention to the quality of research methods in individual studies, if publication decisions of journals are a function of the statistical-significance of research findings, the published literature as a whole may not produce accurate measures of true effects.
This article examines the 2 most prominent sociology journals (the American Sociological Review and the American Journal of Sociology) and another important though less influential journal (The Sociological Quarterly) for evidence of publication bias. The effect of the 0.05 statistical-significance level on the pattern of published findings is examined using a “caliper” test, and the hypothesis of no publication bias can be rejected at approximately the 1 in 10 million level.
Findings suggest that some of the results reported in leading sociology journals may be misleading and inaccurate due to publication bias. Some reasons for publication bias and proposed reforms to reduce its impact on research are also discussed.
Recently we proposed a model in which when a scientist writes a manuscript, he picks up several random papers, cites them, and also copies a fraction of their references. The model was stimulated by our finding that a majority of scientific citations are copied from the lists of references used in other papers. It accounted quantitatively for several properties of empirically observed distribution of citations; however, important features such as power-law distributions of citations to papers published during the same year and the fact that the average rate of citing decreases with aging of a paper were not accounted for by that model. Here, we propose a modified model: When a scientist writes a manuscript, he picks up several random recent papers, cites them, and also copies some of their references. The difference with the original model is the word recent. We solve the model using methods of the theory of branching processes, and find that it can explain the aforementioned features of citation distribution, which our original model could not account for. The model also can explain “sleeping beauties in science”; that is, papers that are little cited for a decade or so and later “awaken” and get many citations. Although much can be understood from purely random models, we find that to obtain a good quantitative agreement with empirical citation data, one must introduce Darwinian fitness parameter for the papers.
Poor methodological standards in animal studies mean that positive results rarely translate to the clinical domain… in a systematic review reported in this week’s BMJ Perel and colleagues find that therapeutic efficacy in animals often does not translate to the clinical domain.2 The authors conducted meta-analyses of all available animal data for six interventions that showed definitive proof of benefit or harm in humans. For three of the interventions—corticosteroids for brain injury, antifibrinolytics in haemorrhage, and tirilazad for acute ischaemic stroke—they found major discordance between the results of the animal experiments and human trials. Equally concerning, they found consistent methodological flaws throughout the animal data, irrespective of the intervention or disease studied. For example, only eight of the 113 animal studies on thrombolysis for stroke reported a sample size calculation, a fundamental step in helping to ensure an appropriately powered precise estimate of effect. In addition, the use of randomisation, concealed allocation, and blinded outcome assessment—standards that are considered the norm when planning and reporting modern human clinical trials—were inconsistent in the animal studies.
…What can be done to remedy this situation? Firstly, uniform reporting requirements are needed urgently and would improve the quality of animal research; as in the clinical research world, this would require cooperation between investigators, editors, and funders of basic scientific research. A more immediate solution is to promote rigorous systematic reviews of experimental treatments before clinical trials begin. Many clinical trials would probably not have gone ahead if all the data had been subjected to meta-analysis. Such reviews would also provide robust estimates of effect size and variance for adequately powering randomised trials. A third solution, which Perel and colleagues call for, is a system for registering animal experiments, analogous to that for clinical trials. This would help to reduce publication bias and provide a more informed view before proceeding to clinical trials. Until such improvements occur, it seems prudent to be critical and cautious about the applicability of animal data to the clinical domain.
Studies initially reported as conference abstracts that have positive results are subsequently published as full-length journal articles more often than studies with negative results.
Less than half of all studies, and about 60% of randomized or controlled clinical trials, initially presented as summaries or abstracts at professional meetings are subsequently published as peer-reviewed journal articles. An important factor appearing to influence whether a study described in an abstract is published in full is the presence of ‘positive’ results in the abstract. Thus, the efforts of persons trying to collect all of the evidence in a field may be stymied, first by the failure of investigators to take abstract study results to full publication, and second, by the tendency to take to full publication only those studies reporting ‘significant’ results. The consequence of this is that systematic reviews will tend to over-estimate treatment effects.
Background: Abstracts of presentations at scientific meetings are usually available only in conference proceedings. If subsequent full publication of abstract results is based on the magnitude or direction of study results, publication bias may result. Publication bias, in turn, creates problems for those conducting systematic reviews or relying on the published literature for evidence.
Objectives: To determine the rate at which abstract results are subsequently published in full, and the time between meeting presentation and full publication.
Search methods: We searched MEDLINE, EMBASE, The Cochrane Library, Science Citation Index, reference lists, and author files. Date of most recent search: June 2003. Selection criteria We included all reports that examined the subsequent full publication rate of biomedical results initially presented as abstracts or in summary form. Follow-up of abstracts had to be at least two years.
Data collection and analysis: Two reviewers extracted data. We calculated the weighted mean full publication rate and time to full publication. Dichotomous variables were analyzed using relative risk and random effects models. We assessed time to publication using Kaplan-Meier survival analyses.
Main results: Combining data from 79 reports (29,729 abstracts) resulted in a weighted mean full publication rate of 44.5% (95% confidence interval (CI) 43.9 to 45.1). Survival analyses resulted in an estimated publication rate at 9 years of 52.6% for all studies, 63.1% for randomized or controlled clinical trials, and 49.3% for other types of study designs.
‘Positive’ results defined as any ‘significant’ result showed an association with full publication (RR = 1.30; CI 1.14 to 1.47), as did ‘positive’ results defined as a result favoring the experimental treatment (RR = 1.17; CI 1.02 to 1.35), and ‘positive’ results emanating from randomized or controlled clinical trials (RR = 1.18, CI 1.07 to 1.30).
Other factors associated with full publication include oral presentation (RR = 1.28; CI 1.09 to 1.49); acceptance for meeting presentation (RR = 1.78; CI 1.50 to 2.12); randomized trial study design (RR = 1.24; CI 1.14 to 1.36); and basic research (RR = 0.79; CI 0.70 to 0.89). Higher quality of abstracts describing randomized or controlled clinical trials was also associated with full publication (RR = 1.30, CI 1.00 to 1.71).
Authors’ conclusions: Only 63% of results from abstracts describing randomized or controlled clinical trials are published in full. ‘Positive’ results were more frequently published than not ‘positive’ results.
To maximize the findings of animal experiments to inform likely health effects in humans, a thorough review and evaluation of the animal evidence is required. Systematic reviews and, where appropriate, meta-analyses have great potential in facilitating such an evaluation, making efficient use of the animal evidence while minimizing possible sources of bias. The extent to which systematic review and meta-analysis methods have been applied to evaluate animal experiments to inform human health is unknown.
Using systematic review methods, we examine the extent and quality of systematic reviews and meta-analyses of in vivo animal experiments carried out to inform human health. We identified 103 articles meeting the inclusion criteria: 57 reported a systematic review, 29 a systematic review and a meta-analysis, and 17 reported a meta-analysis only.
The use of these methods to evaluate animal evidence has increased over time. Although the reporting of systematic reviews is of adequate quality, the reporting of meta-analyses is poor. The inadequate reporting of meta-analyses observed here leads to questions on whether the most appropriate methods were used to maximize the use of the animal evidence to inform policy or decision-making. We recommend that guidelines proposed here be used to help improve the reporting of systematic reviews and meta-analyses of animal experiments.
Further consideration of the use and methodological quality and reporting of such studies is needed.
Context: Controversy and uncertainty ensue when the results of clinical research on the effectiveness of interventions are subsequently contradicted. Controversies are most prominent when high-impact research is involved.
Objectives: To understand how frequently highly cited studies are contradicted or find effects that are stronger than in other similar studies and to discern whether specific characteristics are associated with such refutation over time.
Design: All original clinical research studies published in 3 major general clinical journals or high-impact-factor specialty journals in 1990–2003 and cited more than 1000 times in the literature were examined.
Main Outcome Measure: The results of highly cited articles were compared against subsequent studies of comparable or larger sample size and similar or better controlled designs. The same analysis was also performed comparatively for matched studies that were not so highly cited.
Results: Of 49 highly cited original clinical research studies, 45 claimed that the intervention was effective. Of these, 7 (16%) were contradicted by subsequent studies, 7 others (16%) had found effects that were stronger than those of subsequent studies, 20 (44%) were replicated, and 11 (24%) remained largely unchallenged. Five of 6 highly-cited nonrandomized studies had been contradicted or had found stronger effects vs 9 of 39 randomized controlled trials (p = 0.008). Among randomized trials, studies with contradicted or stronger effects were smaller (p = 0.009) than replicated or unchallenged studies although there was no statistically-significant difference in their early or overall citation impact. Matched control studies did not have a statistically-significantly different share of refuted results than highly cited studies, but they included more studies with “negative” results.
Conclusions: Contradiction and initially stronger effects are not unusual in highly cited research of clinical interventions and their outcomes. The extent to which high citations may provoke contradictions and vice versa needs more study. Controversies are most common with highly cited nonrandomized studies, but even the most highly cited randomized trials may be challenged and refuted over time, especially small ones.
The statistic p(rep) estimates the probability of replicating an effect. It captures traditional publication criteria for signal-to-noise ratio, while avoiding parametric inference and the resulting Bayesian dilemma. In concert with effect size and replication intervals, p(rep) provides all of the information now used in evaluating research, while avoiding many of the pitfalls of traditional statistical inference.
This article shows that 35 years of empirical research on teacher expectations justifies the following conclusions: (a) Self-fulfilling prophecies in the classroom do occur, but these effects are typically small, they do not accumulate greatly across perceivers or over time, and they may be more likely to dissipate than accumulate; (b) powerful self-fulfilling prophecies may selectively occur among students from stigmatized social groups; (c) whether self-fulfilling prophecies affect intelligence, and whether they in general do more harm than good, remains unclear, and (d) teacher expectations may predict student outcomes more because these expectations are accurate than because they are self-fulfilling. Implications for future research, the role of self-fulfilling prophecies in social problems, and perspectives emphasizing the power of erroneous beliefs to create social reality are discussed.
[Jussim discusses the famous ‘Pygmalion effect’. It demonstrates the Replication crisis: an initial extraordinary finding indicating that teachers could raise student IQs by dozens of points gradually shrunk over repeated replications to essentially zero net long-term effect. The original finding was driven by statistical malpractice bordering on research fraud: some students had “pretest IQ scores near zero, and others had post-test IQ scores over 200”! Rosenthal further maintained the Pygmalion effect by statistical trickery, such as his ‘fail-safe N’, which attempted to show that hundreds of studies would have to have not been published in order for the Pygmalion effect to be true—except this assumes zero publication bias in those unpublished studies and begs the question.]
Correlations which are artifacts of various types of data transformations can be said to be spurious. This study considers four common types of analyses where the X and Y variables are not independent; these include regressions of the form X⁄Z vs Y⁄Z, X×Z vsY×Z, X vs Y⁄X, and X+Y vs Y.
These analyses were carried out using a series of Monte Carlo simulations while varying sample size and sample variability. The impact of disparities in variability between the shared and non-shared terms and measurement error for the shared term on the magnitude of the spurious correlations was also considered. The accuracy of equations previously derived to predict the magnitude of spurious correlations was also assessed.
These results show the risk of producing spurious correlations when analyzing non-independent variables is very large. Spurious correlations occurred in all cases assessed, the mean spurious coefficient of determination (R2) frequently exceeded 0.50, and in some cases the 90% confidence interval for these simulations included all large R2 values. The magnitude of spurious correlations was sensitive to differences in the variability of the shared and non-shared terms, with large spurious correlations obtained when the variability for the shared term was larger. Sample size had only a modest impact on the magnitude of spurious correlations. When measurement error for the shared variable was smaller than one half the coefficient of variation for that variable, which is generally the case, the measurement error did not generate large spurious correlations.
The equations available to predict expected spurious correlations provided accurate predictions for the case of X×Z vs Y×Z, variable predictions for the case of X vs Y⁄X, and poor predictions for mostcases of X⁄Z vs Y⁄Z, and X+Y vs Y.
Much animal research into potential treatments for humans is wasted because it is poorly conducted and not evaluated through systematic reviews.
…We searched MEDLINE to identify published systematic reviews of animal experiments (see bmj.com for the search strategy). The search identified 277 possible papers, of which 22 were reports of systematic reviews. We are also aware of one recently published study and two unpublished studies, bringing the total to 25. Three further studies are in progress (M Macleod, personal communication). Seven of the 25 papers were systematic reviews of animal studies that had been conducted to find out how the animal research had informed the clinical research. Two of these reported on the same group of studies, giving six reviews in this category. A further 10 papers were systematic reviews of animal studies conducted to assess the evidence for proceeding to clinical trials or to establish an evidence base. 8 systematically reviewed both the animal and human studies in a particular field, again before clinical trials had taken place. We focus on the 6 studies in the first category because these shed the most light on the contribution that animal research makes to clinical medicine.
…The clinical trials of nimodipine and low level laser therapy were conducted concurrently with the animal studies, while the clinical trials of fluid resuscitation, thrombolytic therapy, and endothelin receptor blockade went ahead despite evidence of harm from the animal studies. This suggests that the animal data were regarded as irrelevant, calling into question why the studies were done in the first place and seriously undermining the principle that animal experiments are necessary to inform clinical medicine.
Furthermore, many of the existing animal experiments were poorly designed…Although randomisation and blinding are accepted as standard in clinical trials, no such standards exist for animal studies. Bebarta et al found that animal studies that did not report randomisation and blinding were more likely to report a treatment effect than studies that used these methods. The box summarises further potential methodological problems.
The value of animal research into potential human treatments needs urgent rigorous evaluation
Systematic reviews can provide important insights into the validity of animal research
The few existing reviews have highlighted deficiencies such as animal and clinical trials being conducted simultaneously
Many animal studies were of poor methodological quality
Systematic reviews should become routine to ensure the best use of existing animal data as well as improve the estimates of effect from animal experiments
Chambers II discusses the findings of journalist-soldier S. L. A. Marshall about combat fire ratios particularly that in World War II. Marshall claimed that his figures about the ratio of fire, the proportion of a rifle unit firing its weapons in battle was derived from his group after-action interviews, a method he developed as a field historian in world War II and which as a civilian journalist, Reserve officer, and military consultant. Although the ratio-of-fire figure was his most famous product, Marshall was proudest of his methodology.
[Chambers interviews Frank L. Brennan, an assistant of Marshall during his Korea War after-action interview work, who accompanied him to every interview. Brennan described Marshall’s methodology as follows: the group interviews typically lasted about 2 hours at most; Marshall took few notes; Marshall preferred to ask open-ended questions and listen to the discussions; he rarely asked questions specifically about the rate of fire or whether a soldier had fired his weapon; he did not seem to collect any of the statistics he would later report in his books; and Marshall was evasive when Brennan asked about his WWII statistics’ sources. Brennan noted that in Marshall’s autobiography, Marshall greatly inflated his importance & the resources placed at his disposal in Korea, and the length of his interviews. Brennan also served in combat afterwards, and observed a high rate of fire in his own men.]
Self-esteem has become a household word. Teachers, parents, therapists, and others have focused efforts on boosting self-esteem, on the assumption that high self-esteem will cause many positive outcomes and benefits—an assumption that is critically evaluated in this review.
Appraisal of the effects of self-esteem is complicated by several factors. Because many people with high self-esteem exaggerate their successes and good traits, we emphasize objective measures of outcomes. High self-esteem is also a heterogeneous category, encompassing people who frankly accept their good qualities along with narcissistic, defensive, and conceited individuals.
The modest correlations between self-esteem and school performance do not indicate that high self-esteem leads to good performance. Instead, high self-esteem is partly the result of good school performance. Efforts to boost the self-esteem of pupils have not been shown to improve academic performance and may sometimes be counterproductive. Job performance in adults is sometimes related to self-esteem, although the correlations vary widely, and the direction of causality has not been established. Occupational success may boost self-esteem rather than the reverse. Alternatively, self-esteem may be helpful only in some job contexts. Laboratory studies have generally failed to find that self-esteem causes good task performance, with the important exception that high self-esteem facilitates persistence after failure.
People high in self-esteem claim to be more likable and attractive, to have better relationships, and to make better impressions on others than people with low self-esteem, but objective measures disconfirm most of these beliefs. Narcissists are charming at first but tend to alienate others eventually. Self-esteem has not been shown to predict the quality or duration of relationships.
High self-esteem makes people more willing to speak up in groups and to criticize the group’s approach. Leadership does not stem directly from self-esteem, but self-esteem may have indirect effects. Relative to people with low self-esteem, those with high self-esteem show stronger in-group favoritism, which may increase prejudice and discrimination.
Neither high nor low self-esteem is a direct cause of violence. Narcissism leads to increased aggression in retaliation for wounded pride. Low self-esteem may contribute to externalizing behavior and delinquency, although some studies have found that there are no effects or that the effect of self-esteem vanishes when other variables are controlled. The highest and lowest rates of cheating and bullying are found in different subcategories of high self-esteem.
Self-esteem has a strong relation to happiness. Although the research has not clearly established causation, we are persuaded that high self-esteem does lead to greater happiness. Low self-esteem is more likely than high to lead to depression under some circumstances. Some studies support the buffer hypothesis, which is that high self-esteem mitigates the effects of stress, but other studies come to the opposite conclusion, indicating that the negative effects of low self-esteem are mainly felt in good times. Still others find that high self-esteem leads to happier outcomes regardless of stress or other circumstances.
High self-esteem does not prevent children from smoking, drinking, taking drugs, or engaging in early sex. If anything, high self-esteem fosters experimentation, which may increase early sexual activity or drinking, but in general effects of self-esteem are negligible. One important exception is that high self-esteem reduces the chances of bulimia in females.
Overall, the benefits of high self-esteem fall into two categories: enhanced initiative and pleasant feelings. We have not found evidence that boosting self-esteem (by therapeutic interventions or school programs) causes benefits. Our findings do not support continued widespread efforts to boost self-esteem in the hope that it will by itself foster improved outcomes. In view of the heterogeneity of high self-esteem, indiscriminate praise might just as easily promote narcissism, with its less desirable consequences. Instead, we recommend using praise to boost self-esteem as a reward for socially desirable behavior and self-improvement.
Objective:To determine whether remote, retroactive intercessory prayer, said for a group of patients with a bloodstream infection, has an effect on outcomes.
Design: Double blind, parallel group, randomised controlled trial of a retroactive intervention.
Setting: University hospital.
Subjects:All 3393 adult patients whose bloodstream infection was detected at the hospital in 1990–1996.
Intervention: In July 2000 patients were randomised to a control group and an intervention group. A remote, retroactive intercessory prayer was said for the well being and full recovery of the intervention group.
Main outcome measures: Mortality in hospital, length of stay in hospital, and duration of fever.
Results: Mortality was 28.1% (475⁄1691) in the intervention group and 30.2% (514⁄1702) in the control group (p for difference = 0.4). Length of stay in hospital and duration of fever were statistically-significantly shorter in the intervention group than in the control group (p = 0.01 and p = 0.04, respectively).
Conclusions: Remote, retroactive intercessory prayer said for a group is associated with a shorter stay in hospital and shorter duration of fever in patients with a bloodstream infection and should be considered for use in clinical practice.
What is already known on this topic: 2 randomised controlled trials of remote intercessory prayer (praying for persons unknown) showed a beneficial effect in patients in an intensive coronary care unit. A recent systematic review found that 57% of the randomised, placebo controlled trials of distant healing showed a positive treatment effect. What this study adds: Remote intercessory prayer said for a group of patients is associated with a shorter hospital stay and shorter duration of fever in patients with a bloodstream infection, even when the intervention is performed 4–10 years after the infection.
The 1968 publication of the Rosenthal and Jacobson’s Pygmalion in the Classroom offered the optimistic message that raising teachers’ expectations of their pupils’ potentials would raise their pupils’ intelligence. This claim was, and still is, endorsed by many psychologists and educators. The original study, along with the scores of attempted replications and the acrimonious controversy that followed it, is reviewed, and its consequences discussed.
This thesis presents and analyzes a simple principle for building systems: that there should be a random component in all arbitrary decisions. If no randomness is used, system performance can vary widely and unpredictably due to small changes in the system workload or configuration. This makes measurements hard to reproduce and less meaningful as predictors of performance that could be expected in similar situations.
To measure the sensitivity of non-randomized systems to slight configuration changes, we measured the variation in performance of both TCP / IP and workstation memory systems as a result of “small” configuration perturbations. By “small”, we mean within the range over which things may change unintentionally due to other modifications being evaluated, or within the range of accuracy that an independent researcher could reasonably achieve.
For TCP/IP, changes of a few percent in link propagation delays and other parameters caused order of magnitude shifts in bandwidth allocation between competing connections. For memory systems, changes in the essentially arbitrary order in which functions were arranged in memory caused changes in runtime of tens of percent for single benchmarks, and of a few percent when averaged across a suite of benchmarks. In both applications the measured variability is larger than performance increases often reported for new improved designs, suggesting that many published measurements of the benefits of new schemes may be erroneous or at least irreproducible.
To make TCP/IP and memory systems measurable enough to make benchmark results meaningful and convincing, randomness must be added. Methods for adding randomness to conventional program linkers, to linkers which try to optimize memory system performance by avoiding cache conflicts, and to TCP/IP are presented and analyzed. In all of the systems, various amounts of randomness can be added in many different places. We show how to choose reasonable amounts of randomness based on measuring configuration sensitivity, and propose specific recipes for randomizing TCP/IP and memory systems. Substantial reductions in the configuration sensitivity are demonstrated, making measurements much more robust and meaningful. The accuracy of the results increases with the number of runs and thus is limited only by the available computing resources.
When the overall performance of a system is strongly influenced by the worst case behavior, reducing the sensitivity of the system can also make it perform better. Using average waiting time as a metric, TCP/IP performance is shown to improve substantially when randomization is added to the sending host’s congestion window calculations. Although the improvements are less than those achieved by previously proposed schemes using randomized packet discard algorithms inside the network, the proposed modifications can be implemented entirely in the sending host and so can be deployed more easily.
…When arbitrary decisions are made deterministically based on small-scale properties of the system, the overall performance of the system is likely to be highly sensitive to slight changes in configuration, initial conditions, or the results of internal computations of the system itself. For such a system, the graph of performance as a function of one or more of its configuration parameters looks jagged, and a measurement with any given set of configuration parameters may not be representative. If past decisions can affect future decisions, the system may be chaotic, meaning that an arbitrarily small change in initial conditions can cause an arbitrarily large change in the behavior of the system.
Highly sensitive deterministic systems have 2 disadvantages. First, it is hard to make robust and reproducible measurements of them. These terms will be described formally below, but intuitively, if the system performs radically better or worse due to small changes in some aspect of the system, then any single measurement is a poor predictor of the performance that could be expected in general.
The second disadvantage of highly sensitive deterministic systems is that they are likely to fall into behavior patterns which repeat regularly on the small scale and result in poor overall performance on the large scale. For example, a memoryless resource allocator must decide to grant its resource to one user or another based only on the current set of requests. If there is no random input to its decision making process, it may persistently favor some users and starve others, resulting in poor overall performance.
To avoid extreme sensitivity, arbitrary decisions should be randomized. When randomness is added to highly sensitive systems and performance is reported as the distribution across a large number of individual measurements, we show that performance does not change radically with small changes in the initial conditions.
The following hypothetical example illustrates how extreme sensitivity leads to misleading results and incorrect design decisions. Consider the problem of deciding whether or not a particular compiler optimization improves performance (i.e. reduces the runtime of the compiled program.) Suppose that the optimization eliminates some redundant instructions, thereby reducing the number of instruction execution cycles by 2%. The eliminated instructions reduce the sizes of some procedures, and (as will be shown in detail in Chapter 4), cache effects cause runtime to be highly sensitive to procedure sizes. Although in general the optimization would be expected to reduce memory system costs, suppose that changing cache conflicts cause it to increase memory system costs by 10% on the particular benchmark being used by the compiler developer. Thus a measurement would show an 8% performance decrease leading to a decision not to include the optimization in the compiler. Note that if another optimization is added which also affects procedure sizes, the memory system effects might be quite different.
[Popularization of Matthews’s other articles on physics & statistics and what truth there is to “Murphy’s law”]:
Toast falling butter side up: true, because tables are not high enough for toast to be likely to complete one or more rotations before landing given the tilt & falling off edges, therefore toast will in fact tend to land on its top half.
Maps putting things on edges: true—printed paper maps tend to be hard to use because the place one wants to go will tend to be toward an edge; this is simply a geometric fact due to most of the area of a volume being towards the edge.
Other Checkout Lines Being Faster: true, because of anthropics, as most waiting time is spent in the slowest line; even if equally loaded, order statistics points out there is only a 1 in n chance that one picked the fastest line out of n lanes
Mismatched Socks: also true, simply because there are many more ways for socks in a pair to go missing than to go missing in pairs or match up
Raining: forecasts are fairly accurate, but this ignores base-rates and that much of that accuracy is due to predicting non-rain. It’s a version of the diagnostics/screening paradox:
For example, suppose that the hourly base rate of rain is 0.1, meaning that it is 10× more likely not to rain during your hour-long stroll. Probability theory then shows that even an 80% accurate forecast of rain is twice as likely to prove wrong as right during your walk—and you’ll end up taking an umbrella unnecessarily. The fact is that even today’s apparently highly accurate forecasts are still not good enough to predict rare events reliably.
[Keywords: socks, maps, umbrellas, family law, combinatorics, weather forecasting, technology law, mathematical constants, physics, probability theory]
Large-scale uncitedness refers to the remarkable proportion of articles that do not receive a single citation within five years of publication. Equally remarkable is the brief and troubled history of this area of inquiry, which was prone to miscalculation, misinterpretation, and politicization. This article reassesses large-scale uncitedness as both a general phenomenon in the scholarly communication system and a case study of library and information science, where its rate is 72%.
[On the end to the post-WWII Vannevar Bushian exponential growth of academia and consequences thereof: growth can’t go on forever, and it didn’t.]
According to modern cosmology, the universe began with a big bang about 10 billion years ago, and it has been expanding ever since. If the density of mass in the universe is great enough, its gravitational force will cause that expansion to slow down and reverse, causing the universe to fall back in on itself. Then the universe will end in a cataclysmic event known as ‘the Big Crunch’. I would like to present to you a vaguely analogous theory of the history of science. The upper curve on Figure 1 was first made by historian Derek da Solla Price, sometime in the 1950s. It is a semilog plot of the cumulative number of scientific journals founded worldwide as a function of time…the growth of the profession of science, the scientific enterprise, is bound to reach certain limits. I contend that these limits have now been reached.
…But after about 1970 and the Big Crunch, the gleaming gems produced at the end of the vast mining-and-sorting operation produced less often from American ore. Research professors and their universities, using ore imported from across the oceans, kept the machinery humming.
…Let me finish by summarizing what I’ve been trying to tell you. We stand at an historic juncture in the history of science. The long era of exponential expansion ended decades ago, but we have not yet reconciled ourselves to that fact. The present social structure of science, by which I mean institutions, education, funding, publications and so on all evolved during the period of exponential expansion, before The Big Crunch. They are not suited to the unknown future we face. Today’s scientific leaders, in the universities, government, industry and the scientific societies are mostly people who came of age during the golden era, 1950–1970. I am myself part of that generation. We think those were normal times and expect them to return. But we are wrong. Nothing like it will ever happen again. It is by no means certain that science will even survive, much less flourish, in the difficult times we face. Before it can survive, those of us who have gained so much from the era of scientific elites and scientific illiterates must learn to face reality, and admit that those days are gone forever.
Conventional reviews of research on the efficacy of psychological, educational, and behavioral treatments often find considerable variation in outcome among studies and, as a consequence, fail to reach firm conclusions about the overall effectiveness of the interventions in question. In contrast, meta-analysis reviews show a strong, dramatic pattern of positive overall effects that cannot readily be explained as artifacts of meta-analytic technique or generalized placebo effects. Moreover, the effects are not so small that they can be dismissed as lacking practical or clinical-significance. Although meta-analysis has limitations, there are good reasons to believe that its results are more credible than those of conventional reviews and to conclude that well-developed psychological, educational, and behavioral treatment is generally efficacious.
We believe that members of the scientific community have a primary obligation to promote integrity in research and that this obligation includes a duty to report observations that suggest misconduct to agencies that are empowered to examine and evaluate such evidence. Consonant with this responsibility, we became whistleblowers in the case of Herbert Needleman. His 1979 study (Needleman et al 1979), on the effects of low-level lead exposure on children, is widely cited and highly influential in the formulation of public policy on lead. The opportunity we had to examine subject selection and data analyses from this study was prematurely halted by efforts to prevent disclosure of our observations. Nevertheless, what we saw left us with serious concerns. We hope that the events here summarized will contribute to revisions of process by which allegations of scientific misconduct are handled and that such revisions will result in less damage to scientists who speak out.
[‘Subliminal advertising’ was the Cambridge Analytica of the 1950s.] In September 1957, I began what to me was a serious study of contemporary applied psychology at Hofstra College in Hempstead, Long Island. At exactly the same time, in nearby New York City, an unemployed market researcher named James M. Vicary made a startling announcement based on research in high-speed photography later popularized by Eastman Kodak Company.
His persuasive sales pitch was that consumers would comprehend information projected at 1/ 60,000th of a second, although they could not literally “see” the flash. And he sent a news release to the major media announcing his “discovery”.
…And, as a follow-up, toward the end of 1957 Vicary invited 50 reporters to a film studio in New York where he projected some motion picture footage, and claimed that he had also projected a subliminal message. He then handed out another of his well written and nicely printed news releases claiming that he had actually conducted major research on how an invisible image could cause people to buy something even if they didn’t want to.
The release said that in an unidentified motion picture theater a “scientific test” had been conducted in which 45,699 persons unknowingly had been exposed to 2 advertising messages projected subliminally on alternate nights. One message, the release claimed, had advised the moviegoers to “Eat Popcorn” while the other had read “Drink Coca-Cola.”
…Vicary swore that the invisible advertising had increased sales of popcorn an average of 57.5%, and increased the sales of Coca-Cola an average of 18.1%. No explanation was offered for the difference in size of the percentages, no allowance was made for variations in attendance, and no other details were provided as to how or under what conditions the purported tests had been conducted. Vicary got off the hook for his lack of specificity by stating that the research information formed part of his patent application for the projection device, and therefore must remain secret. He assured the media, however, that what he called “sound statistical controls” had been employed in the theater test. At least as importantly, too, he had observed the proven propagandist’s ploy of using odd numbers, and also including a decimal in a percentage. The figures 57.5 and 18.1% rang with a clear tone of Truth.
…When I learned of Vicary’s claim, I made the short drive to Fort Lee to learn first-hand about his clearly remarkable experiment.
The size of that small-town theater suggested it should have taken considerably longer than 6 weeks to complete a test of nearly 50,000 movie patrons. But even more perplexing was the response of the theater manager to my eager questioning. He declared that no such test had ever been conducted at his theater.
There went my term paper for my psychology class.
Soon after my disappointment, Motion Picture Daily reported that the same theater manager had sworn to one of its reporters that there had been no effect on refreshment stand patronage, whether a test had been conducted or not—a rather curious form of denial, I think.
…Technological Impossibility: Vicary also informed the reporters that subliminal advertising would have its “biggest initial impact” on the television medium.
When I learned of this, I visited the engineering section of RCA…I was assured by their helpful and knowledgeable engineering liaison man that, because of the time required for an electron beam to scan the surface of a television picture tube, and the persistence of the phosphor glow, it was technologically impossible to project a television image faster than the human eye could perceive.
“In a nighttime scene on television, watch the way the image of a car’s headlights lingers; that’s called comet-tailing”, the engineer explained. “See how long it takes before the headlights fade away.” Clearly there was no way that even the slower tachistoscope speeds of 1/3,000th of a second that Vicary had begun talking about in early 1958 could work on contemporary television.
…It has been estimated he collected retainer and consulting fees from America’s largest advertisers totaling some $35.19$4.51958 million—about $53.01$22.51992 million in today’s dollars.
Then, some time in June 1958, Mr. Vicary disappeared from the New York marketing scene, reportedly leaving no bank accounts, no clothes in his closet, and no hint as to where he might have gone. The big advertisers, apparently ashamed of having been fooled by such an obvious scam, have said nothing since about subliminal advertising, except to deny that they have ever used it.
[Lykken’s (1991) classic criticisms of psychology’s dominant research tradition, from the perspective of the Minnesotan psychometrics school, in association with Paul Meehl: psychology’s replication crisis, the constant fading-away of trendy theories, and inability to predict the real world the measurement problem, null hypothesis statistical-significance testing, and the granularity of research methods.]
I shall argue the following theses:
Psychology isn’t doing very well as a scientific discipline and something seems to be wrong somewhere.
This is due partly to the fact that psychology is simply harder than physics or chemistry, and for a variety of reasons. One interesting reason is that people differ structurally from one another and, to that extent, cannot be understood in terms of the same theory since theories are guesses about structure.
But the problems of psychology are also due in part to a defect in our research tradition; our students are carefully taught to behave in the same obfuscating, self-deluding, pettifogging ways that (some of) their teachers have employed.
The future of measurement in psychology and education in the decades to come will depend on the availability of measurement faculty in our universities, the range of measurement offerings on our campuses, and the standards for measurement literacy reflected in the preparation of psychological and educational professionals, in criteria for professional program accreditation, and in standards for licenses and credentials for psychological and educational practice.
In this article, I argue that, by becoming aware of the inadequate supply of future psychometricians and the range of coverage of both classical and modern test theory in undergraduate and graduate courses on our university campuses, psychologists can promote measurement literacy among future professional psychologists and educators. In our efforts to promote measurement literacy, we should acknowledge the excellent work of our colleagues who contributed to the revised Standards for Educational and Psychological Tests and the several other testing documents that have been published subsequently as well as the current efforts of the Joint Committee on Testing Practices and its subcommittees. Although these documents support raising the level of measurement literacy among education and psychology professionals, the documents will not lead us far toward the goal of minimal standards for measurement literacy, unless we, measurement professionals, carry the movement (1) back to our places of work and (2) outward to the various committees involved in proposals for revising national standards in all fields dependent on measurement literacy for competent practices.
Citation analysis of scientific articles constitutes an im portant tool in quantitative studies of science and technology. Moreover, citation indexes are used frequently in searches for relevant scientific documents. In this article we focus on the issue of reliability of citation analysis. How accurate are cita tion counts to individual scientific articles? What pitfalls might occur in the process of data collection? To what extent do ‘random’ or ‘systematic’ errors affect the results of the citation analysis?
We present a detailed analysis of discrepancies between target articles and cited references with respect to author names, publication year, volume number, and starting page number. Our data consist of some 4500 target articles pub lished in five scientific journals, and 25000 citations to these articles. Both target and citation data were obtained from the Science Citation Index, produced by the Institute for Scientific Information.
It appears that in many cases a specific error in a citation to a particular target article occurs in more than one citing publication. We present evidence that authors in compiling reference lists, may copy references from reference lists in other articles, and that this may be one of the mechanisms underlying this phenomenon of multiple’ variations/errors.
This article illustrates basic statistical techniques for studying coincidences. These include data-gathering methods (informal anecdotes, case studies, observational studies, and experiments) and methods of analysis (exploratory and confirmatory data analysis, special analytic techniques, and probabilistic modeling, both general and special purpose). We develop a version of the birthday problem general enough to include dependence, inhomogeneity, and almost and multiple matches. We review Fisher’s techniques for giving partial credit for close matches. We develop a model for studying coincidences involving newly learned words. Once we set aside coincidences having apparent causes, four principles account for large numbers of remaining coincidences: hidden cause; psychology, including memory and perception; multiplicity of endpoints, including the counting of “close” or nearly alike events as if they were identical; and the law of truly large numbers, which says that when enormous numbers of events and people and their interactions cumulate over time, almost any outrageous event is bound to occur. These sources account for much of the force of synchronicity.
…Because of our different reading habits, we readers are exposed to the same words at different observed rates, even when the long-run rates are the same Some words will appear relatively early in your experience, some relatively late. More than half will appear before their expected time of appearance, probably more than 60% of them if we use the exponential model, so the appearance of new words is like a Poisson process. On the other hand, some words will take more than twice the average time to appear, about 1⁄7 of them (1⁄e2) in the exponential model. They will look rarer than they actually are. Furthermore, their average time to reappearance is less than half that of their observed first appearance, and about 10% of those that took at least twice as long as they should have to occur will appear in less than 1⁄20 of the time they originally took to appear. The model we are using supposes an exponential waiting time to first occurrence of events. The phenomenon that accounts for part of this variable behavior of the words is of course the regression effect.
…We now extend the model. Suppose that we are somewhat more complicated creatures, that we require k exposures to notice a word for the first time, and that k is itself a Poisson random variable…Then, the mean time until the word is noticed is (𝜆 + 1)T, where T is the average time between actual occurrences of the word. The variance of the time is (2𝜆 + 1)T2. Suppose T = 1 year and 𝜆 = 4. Then, as an approximation, 5% of the words will take at least time [𝜆 + 1 + 1.65 (2𝜆 + 1)(1⁄2)]T or about 10 years to be detected the first time. Assume further that, now that you are sensitized, you will detect the word the next time it appears. On the average it will be a year, but about 3% of these words that were so slow to be detected the first time will appear within a month by natural variation alone. So what took 10 years to happen once happens again within a month. No wonder we are astonished. One of our graduate students learned the word “formication” on a Friday and read part of this manuscript the next Sunday, two days later, illustrating the effect and providing an anecdote. Here, sensitizing the individual, the regression effect, and the recall of notable events and the non-recall of humdrum events produce a situation where coincidences are noted with much higher than their expected frequency. This model can explain vast numbers of seeming coincidences.
We verified a random sample of 50 references in the May 1986 issue of each of 3 public health journals.
31% of the 150 references had citation errors, one out of 10 being a major error (reference not locatable). 30% of the references differed from authors’ use of them with half being a major error (cited paper not related to author’s contention).
Describes the author’s experiences as a pseudo-patient on the psychiatric ward of a large public hospital for 19 days. Hospital facilities were judged excellent, and therapy tended to be extensive. Close contact with both patients and staff was obtained. Despite this contact, however, not only was the author’s simulation not detected, but his behavior was seen as consistent with the admitting diagnosis of “chronic undifferentiated schizophrenia.” Even with this misattribution it is concluded that the present institution had many positive aspects and that the depersonalization of patients so strongly emphasized by D. Rosenhan (see record 1973-21600-001) did not exist in this setting. It is recommended that future research address positive characteristics of existing institutions and possibly emulate these in upgrading psychiatric care.
…I was the ninth pseudopatient in the Rosenhan study, and my data were not included in the original report.
…even the most proper use of statistics may lead to spurious correlations or conclusions if there are inadequacies regarding the research process itself. One of these sources of error in the research process is related to selective reporting; another to human limitations with regard to the ability to make reliable observations or evaluations. Dunette (1) says:
The most common variant is, of course, the tendency to bury negative results. I only recently became aware of the massive size of this great graveyard for dead studies when a colleague expressed gratification that only a third of his studies ‘turned out’—as he put it. Recently, a second variant of this secret game was discovered, quite inadvertently, by Wolins 1962, when he wrote to 37 authors to ask for the raw-data on which they had based recent journal articles. Wolins found that of the 37 who replied, 21 reported their data to be either misplaced, lost, or inadvertently destroyed. Finally, after some negotiation, Wolins was able to complete 7 re-analyses on the data supplied from 5 authors. Of the 7, he found gross errors in 3—errors so great as to clearly change the outcome of the experiments already reported.
It should also be stressed that Rosenthal and others have demonstrated that experimenters tend to arrive at results found to be in full agreement with their expectancies, or with the expectancies of those within the scientific establishment in charge of the rewards. Even if some of Rosenthal’s results have been questioned [especially the ‘Pygmalion effect’] the general tendency seems to be unaffected.
I guess we can all agree upon the fact that selective reporting in studies on the reliability and validity, of for instance a personality test, is a bad thing. But what could be the reason for selective reporting? Why does a research worker manipulate his dead? Is it only because the research worker has a ‘weak’ mind or does there exist some kind of ‘steering field’ that exerts such an influence that improper behavior on the part of the research worker occurs?
It seems rather reasonable to assume that the editors of professional journals or research leaders in general could exert a certain harmful influence in this connection…There is no doubt at all in my mind about the ‘filtering’ or ‘shaping’ effect an editor may exert upon the output of his journal…As I see it, the major risk of selective reporting is not primarily a statistical one, but rather the research climate which the underlying policy create (“you are ‘good’ if you obtain supporting results; you are”no-good" if you only arrive at chance results").
…The analysis I carried out has had practical implications for the publication policy which we have stated as an ideal for our new journal: the European Journal of Parapsychology.
Three mutually incompatible ethical positions in regard to the fair and unbiased use of psychological tests for different groups such as Blacks and Whites are defined, including unqualified and qualified individualism and quotas. 5 statistical definitions of test bias are also reviewed and are related to the ethical judgments. Each definition is critically examined for its weaknesses on either technical or social grounds. It is argued that a statistical attempt to define fair use without recourse to substantive and causal analysis is doomed to failure.
Within the context of a general discussion of the unintended effects of scientists on the results of their research, this work reported on the growing evidence that the hypothesis of the behavioral scientist could come to serve as self-fulfilling prophecy, by means of subtle processes of communication between the experimenter and the human or animal research subject. [The Science Citation Index (SCI) and the Social Sciences Citation Index (SSCI) indicate that the book has been cited over 740 times since 1966 [as of 1979].] —“Citation Classic”
[Enlarged Edition, expanded with discussion of the Pygmalion effect etc: ISBN 0-470-01391-5]
Rosenhan’s “On Being Sane in Insane Places” is pseudoscience presented as science. Just as his pseudopatients were diagnosed at discharge as “schizophrenia in remission”, so a careful examination of this study’s methods, results, and conclusion leads to a diagnosis of “logic in remission”. Rosenhan’s study proves that pseudopatients are not detected by psychiatrists as having simulated signs of mental illness. This rather unremarkable finding is not relevant to the real problems of the reliability and validity of psychiatric diagnosis and only serves to obscure them. A correct interpretation of these data contradicts the conclusions that were drawn. In the setting of a psychiatric hospital, psychiatrists seem remarkably able to distinguish the “sane” from the “insane”.
The author discusses how to increase the quality and reliability of the research and reporting process in experimental parapsychology. Three levels of bias and control of bias are discussed. The levels are referred to as Model 1, Model 2 and Model 3 respectively.
Model 1 is characterized by its very low level of intersubjective control. The reliability of the results depends to a very great extent upon the reliability of the investigator and the editor.
Model 2 is relevant to the case when the experimenter is aware of the potential risk of making both errors of observation and recording and tries to control this bias. However, this model of control does not make allowances for the case when data are intentionally manipulated.
Model 3 depicts a rather sophisticated system of control. One feature of this model is, that selective reporting will become harder since the editor has to make his decision as regards the acceptance or rejection of an experimental article prior to the results being obtained, and subsequently based upon the quality of the outline of the experiment. However, it should be stressed, that not even this model provides a fool-proof guarantee against deliberate fraud.
It is assumed that the models of bias and control of bias under discussion are relevant to most branches of the behavioral sciences.
This copy represents our first ‘real’ issue of the European Journal of Parapsychology…As far as experimental articles are concerned, we would like to ask potential contributors to try and adhere to the publishing policy which we have outlined in the editorial of the demonstration copy, and which is also discussed at some length in the article: ‘Models of Bias and Control of Bias’ [Johnson 1975a], in this issue. In short we shall try to avoid selective reporting and yet at the same time we shall try to refrain from making our journal a graveyard for all those studies which did not ‘turn out’. These objectives may be fulfilled by the editorial rule of basing our judgment entirely on our impressions of the quality of the design and methodology of the planned study. The acceptance or rejection of a manuscript should if possible take place prior to the carrying out and the evaluation of the results of the study.
Presents an analysis of theory and research in social psychology which reveals that while methods of research are scientific in character, theories of social behavior are primarily reflections of contemporary history. The dissemination of psychological knowledge modifies the patterns of behavior upon which the knowledge is based. This modification occurs because of the prescriptive bias of psychological theorizing, the liberating effects of knowledge, and the resistance based on common values of freedom and individuality. In addition, theoretical premises are based primarily on acquired dispositions. As the culture changes, such dispositions are altered, and the premises are often invalidated. Several modifications in the scope and methods of social psychology are derived from this analysis.
…Yet, while the propagandizing effects of psychological terminology must be lamented, it is also important to trace their sources. In part the evaluative loading of theoretical terms seems quite intentional. The act of publishing implies the desire to be heard. However, value-free terms have low interest value for the potential reader, and value-free research rapidly becomes obscure. If obedience were relabeled alpha behavior and not rendered deplorable through associations with Adolf Eichmann, public concern would undoubtedly be meagre. In addition to capturing the interest of the public and the profession, value-laden concepts also provide an expressive outlet for the psychologist. I have talked with countless graduate students drawn into psychology out of deep humanistic concern. Within many lies a frustrated poet, philosopher, or humanitarian who finds the scientific method at once a means to expressive ends and an encumbrance to free expression. Resented is the apparent fact that the ticket to open expression through the professional media is a near lifetime in the laboratory. Many wish to share their values directly, unfettered by constant demands for systematic evidence. For them, value-laden concepts compensate for the conservatism usually imparted by these demands. The more established psychologist may indulge himself more directly. Normally, however, we are not inclined to view our personal biases as propagandistic so much as reflecting “basic truths.”
Explicates the fundamental nature of regression toward the mean, which is frequently misunderstood by developmental researchers. While errors of measurement are commonly assumed to be the sole source of regression effects, the latter also are obtained with errorless measures. The conditions under which regression phenomena can appear are first clearly defined. Next, an explanation of regression effects is presented which applies both when variables contain errors of measurement and when they are errorless. The analysis focuses on cause and effect relationships of psychologically meaningful variables. Finally, the implications for interpreting regression effects in developmental research are illustrated with several empirical examples.
Induction of three drug-metabolizing enzymes occurred in liver microsomes of mice and rats kept on softwood bedding of either red cedar, white pine, or ponderosa pine. This induction was reversed when animals were placed on hardwood bedding composed of a mixture of beech, birch, and maple. Differences in the capacity of various beddings to induce may partially explain divergent results of studies on drug-metabolizing enzymes. The presence of such inducing substances in the environment may influence the pharmacologic responsiveness of animals to a wide variety of drugs.
Even the animal experimenter is not exempt from problems of interaction. (I am indebted to Neal Miller for the following example.) Investigators checking on how animals metabolize drugs found that results differed mysteriously from laboratory to laboratory. The most startling inconsistency of all occurred after a refurbishing of a National Institutes of Health (NIH) animal room brought in new cages and new supplies. Previously, a mouse would sleep for about 35 minutes after a standard injection of hexobarbital. In their new homes, the NIH mice came miraculously back to their feet just 16 minutes after receiving a shot of the drug. Detective work proved that red-cedar bedding made the difference, stepping up the activity of several enzymes that metabolize hexobarbital. Pine shavings had the same effect. When the softwood was replaced with birch or maple bedding like that originally used, drug response came back in line with previous experience (Vesell, 1967).
[Influential early critique of academic psychology: weak theories, no predictions, poor measurements, poor replicability, high levels of publication bias, non-progressive theory building, and constant churn; many of these criticisms would be taken up by the ‘Minnesota school’ of Bouchard/Meehl/Lykken/etc.]
Fads include brain-storming, Q technique, level of aspiration, forced choice, critical incidents, semantic differential, role playing, and need theory. Fashions include theorizing and theory building, criterion fixation, model building, null-hypothesis testing, and sensitivity training. Folderol includes tendencies to be fixated on theories, methods, and points of view, conducting “little” studies with great precision, attaching dramatic but unnecessary trappings to experiments, grantsmanship, coining new names for old concepts, fixation on methods and apparatus, etc.
Comments on a Iowa State University graduate student’s endeavor of requiring data of a particular kind in order to carry out a study for his master’s thesis. This student wrote to 37 authors whose journal articles appeared in APA journals between 1959 and 1961. Of these authors, 32 replied. 21 of those reported the data misplaced, lost, or inadvertently destroyed. 2 of the remaining 11 offered their data on the conditions that they be notified of our intended use of their data, and stated that they have control of anything that we would publish involving these data. Errors were found in some of the raw data that was obtained which caused a dilemma of either reporting the errors or not. The commentator states that if it were clearly set forth by the APA that the responsibility for retaining raw data and submitting them for scrutiny upon request lies with the author, this dilemma would not exist. The commentator suggests that a possibly more effective means of controlling quality of publication would be to institute a system of quality control whereby random samples of raw data from submitted journal articles would be requested by editors and scrutinized for accuracy and the appropriateness of the analysis performed.
The purpose of this paper is to study the responses given to a questionnaire by subjects who received a tap water ‘placebo’ instead of lysergic acid diethylamide (LSD-25), and to relate the number of responses to other variables. These variables are: body weight, number of responses on a health questionnaire, arithmetic test scores, scores on the Wechsler-Bellevue Intelligence Scale, and Rorschach test responses.
…Figure 4 shows for each question the percentage and number of subjects out of 28 who gave a positive response at least once during the 0.5, 2.5, and 4.5-hour intervals. The questions appear in the figure in the order of decreasing percentages of response to them. The time of the response and the magnitude are disregarded in this tabulation. The question receiving the greatest percentage response was (Subject 24), “Are your palms moist?” As many as 60.7% reported this symptom. Half of the subjects reported headache (Subject 13), fatigue (Subject 44), and drowsiness (Subject 45). About 36% reported anxiety (Subject 47). Illness (Subject 1), and dizziness (Subject 15) were reported by 28.6 per cent of the group and 25% indicated a dream-like feeling (Subject 46), increased appetite (Subject 6), unsteadiness (Subject 16), a hot feeling (Subject 22), heaviness of hands and feet (Subject 30), and weakness (Subject 43). There were 19 questions which received positive responses from between 10 and 22 per cent of the subjects. Less than 10% of the group (or no more than two subjects) responded positively to the remaining questions, but each question received a positive response from at least one subject.
…The findings point out that a substance such as tap water, which is generally considered chemically and pharmacologically inactive, is capable of eliciting certain responses from certain subjects who believe they have received lysergic acid diethylamide. These observations emphasize once more the need for placebo controls in studies investigating the effects of drugs; without them changes which are produced merely by the situation and not by the drug are frequently falsely attributed to the action of the drug…Most subjects who respond to a placebo tend to do so most markedly during the first 0.5 hour after receiving the substance. At this time their anticipation of, and anxiety about, the effects of LSD-25 are probably greatest. Gradually the effects wear off, as the anticipation wears off. Individual differences exist in the time of peak effect, but this is the most common finding. The questions which elicited the greatest percentage response from the group were those related to anxiety (moist palms and feeling anxious) or to phenomena which commonly occur without the presence of any foreign agent (drowsiness, fatigue, and headache). The remaining questions received random responses. The fact that there is a wide range in the number of positive responses made to the questionnaire is of major interest.
It is not generally recognized that such an analysis [using regression] assumes that each of the variables is perfectly measured, such that a second measure X’i, of the variable measured by Xi, has a correlation of unity with Xi. If some of the measures are more accurate than others, the analysis is impaired [by measurement error]. For example, the sociologist may have a problem in which an index of economic status and an index of nativity are independent variables. What is the effect, if the index of economic status is much less satisfactory than the index of nativity? Ordinarily, the effect will be to underestimate the [coefficient] of the less adequately measured variable and to overestimate the [coefficient] of the more adequately measured variable.
If either the reliability or validity of an index is in question, at least two measures of the variable are required to permit an evaluation. The purpose of this paper is to provide a logical basis and a simple arithmetical procedure (a) for measuring the effect of the use of 2 indexes, each of one or more variables, in partial and multiple correlation analysis and (b) for estimating the likely effect if 2 indexes, not available, could be secured.
While the authors agree with John Ioannidis that “most research findings are false,” here they show that replication of research findings enhances the positive predictive value of research findings being true.
Currently, many published research findings are false or exaggerated, and an estimated 85% of research resources are wasted.
To make more published research true, practices that have improved credibility and efficiency in specific fields may be transplanted to others which would benefit from them—possibilities include the adoption of large-scale collaborative research; replication culture; registration; sharing; reproducibility practices; better statistical methods; standardization of definitions and analyses; more appropriate (usually more stringent) statistical thresholds; and improvement in study design standards, peer review, reporting and dissemination of research, and training of the scientific workforce.
Selection of interventions to improve research practices requires rigorous examination and experimental testing whenever feasible.
Optimal interventions need to understand and harness the motives of various stakeholders who operate in scientific research and who differ on the extent to which they are interested in promoting publishable, fundable, translatable, or profitable results.
Modifications need to be made in the reward system for science, affecting the exchange rates for currencies (eg., publications and grants) and purchased academic goods (eg., promotion and other academic or administrative power) and introducing currencies that are better aligned with translatable and reproducible research.
There is increasing concern that most current published research findings are false. The probability that a research claim is true may depend on study power and bias, the number of other studies on the same question, and, importantly, the ratio of true to no relationships among the relationships probed in each scientific field. In this framework, a research finding is less likely to be true when the studies conducted in a field are smaller; when effect sizes are smaller; when there is a greater number and lesser preselection of tested relationships; where there is greater flexibility in designs, definitions, outcomes, and analytical modes; when there is greater financial and other interest and prejudice; and when more teams are involved in a scientific field in chase of statistical-significance. Simulations show that for most study designs and settings, it is more likely for a research claim to be false than true. Moreover, for many current scientific fields, claimed research findings may often be simply accurate measures of the prevailing bias. In this essay, I discuss the implications of these problems for the conduct and interpretation of research.
Published research findings are sometimes refuted by subsequent evidence, says Ioannidis, with ensuing confusion and disappointment.
We examine the PPV as a function of the number of statistically-significant findings. Figure 1 shows the PPV of at least one, two, or three statistically-significant research findings out of ten independent studies as a function of the pre-study odds of a true relationship (R) for powers of 20% and 80%. The lower lines correspond to Ioannidis’ finding and indicate the probability of a true association when at least one out of ten studies shows a statistically-significant result. As can be seen, the PPV is substantially higher when more research findings are statistically-significant. Thus, a few positive replications can considerably enhance our confidence that the research findings reflect a true relationship. When R ranged from 0.0001 to 0.01, a higher number of positive studies is required to attain a reasonable PPV. Thedifference in PPV for power of 80% and power of 20% when at least three studies are positive is higher than when at least one study is positive. Figure 2 gives the PPV for increasing number of positive studies out of ten, 25, and 50 studies for pre-study odds of 0.0001, 0.01, 0.1, and 0.5 for powers of 20% and 80%. When there is at least one positive study (r = 1) and power equal to 80%, as indicated in Ioannidis’ paper, PPV declined approximately 50% for 50 studies compared to ten studies for R values between 0.0001 and 0.1. However, PPV increases with increasing number of positive studies and thepercentage of positive studies required to achieve a given PPV declines with increasing number of studies. The number of positive studies required to achieve a PPV of at least 70% increased from eight for ten studies to 12 for 50 studies when pre-study odds equaled 0.0001, from five for ten studies to eight for 50 studies when pre-study odds equaled 0.01, from three for ten studies to six for 50 studies when pre-study odds equaled 0.1, and from two for ten studies to five for 50 studies when pre-study odds equaled 0.5. The difference in PPV for powers of 80% and 20% declines with increasing number of studies.
…In summary, while we agree with Ioannidis that most research findings are false, we clearly demonstrate that replication of research findings enhances the positive predictive value of research findings being true. While this is not unexpected, it should be encouraging news to researchers in their never-ending pursuit of scientific hypothesis generation and testing. Nevertheless, more methodological work is needed to assess and interpret cumulative evidence of research findings and their biological plausibility. This is especially urgent in the exploding field of genetic associations.
Low reproducibility rates within life science research undermine cumulative knowledge production and contribute to both delays and costs of therapeutic drug development. An analysis of past studies indicates that the cumulative (total) prevalence of irreproducible preclinical research exceeds 50%, resulting in approximately US$28,000,000,000 (US$28B)/year spent on preclinical research that is not reproducible—in the United States alone. We outline a framework for solutions and a plan for long-term improvements in reproducibility rates that will help to accelerate the discovery of life-saving therapies and cures.
This Perspective provides estimates of the rate of irreproducibility of preclinical research and its direct cost implications. It goes on to outline a framework for solutions and a plan for long-term improvements in reproducibility rates.
A study by David Baker and colleagues reveals poor quality of reporting in pre-clinical animal research and a failure of journals to implement the ARRIVE guidelines.
There is growing concern that poor experimental design and lack of transparent reporting contribute to the frequent failure of pre-clinical animal studies to translate into treatments for human disease. In 2010, the Animal Research: Reporting of In VivoExperiments (ARRIVE) guidelines were introduced to help improve reporting standards. They were published in PLOS Biology and endorsed by funding agencies and publishers and their journals, including PLOS, Nature research journals, and other top-tier journals.Yet our analysis of papers published in PLOS and Nature journals indicates that there has been very little improvement in reporting standards since then. This suggests that authors, referees, and editors generally are ignoring guidelines, and the editorial endorsement is yet to be effectively implemented.