Skip to main content

statistics/​bias directory


“The Impact of Digital Media on Children’s Intelligence While Controlling for Genetic Differences in Cognition and Socioeconomic Background”, Sauce et al 2022

“The impact of digital media on children’s intelligence while controlling for genetic differences in cognition and socioeconomic background”⁠, Bruno Sauce, Magnus Liebherr, Nicholas Judd, Torkel Klingberg (2022-05-11; ; backlinks):

Digital media defines modern childhood, but its cognitive effects are unclear and hotly debated. We believe that studies with genetic data could clarify causal claims and correct for the typically unaccounted role of genetic predispositions.

Here, we estimated the impact of different types of screen time (watching, socializing, or gaming) on children’s intelligence while controlling [using a polygenic score predicting 7% IQ variance] for the confounding effects of genetic differences in cognition and socioeconomic status⁠. We analyzed 9,855 children from the USA who were part of the ABCD dataset with measures of intelligence at baseline (ages 9–10) and after 2 years.

At baseline, time watching (r = −0.12) and socializing (r = −0.10) were negatively correlated with intelligence, while gaming did not correlate. After 2 years, gaming positively impacted intelligence (standardized β = +0.17), but socializing had no effect. This is consistent with cognitive benefits documented in experimental studies on video gaming. Unexpectedly, watching videos also benefited intelligence (standardized β = +0.12), contrary to prior research on the effect of watching TV. Although, in a post hoc analysis, this was not statistically-significant if parental education (instead of SES) was controlled for.

Broadly, our results are in line with research on the malleability of cognitive abilities from environmental factors, such as cognitive training and the Flynn effect⁠.

[Obviously wrong. Cognitive training doesn’t work in randomized experiments, the Lee et al 2018 PGS explains less than a fifth of genetics and doesn’t ‘control’ for much at all, and their correlates are probably just residual confounding⁠.]

“Theoretical False Positive Psychology”, Wilson et al 2022

2022-wilson.pdf: “Theoretical false positive psychology”⁠, Brent M. Wilson, Christine R. Harris, John T. Wixted (2022-05-02; backlinks):

A fundamental goal of scientific research is to generate true positives (ie. authentic discoveries). Statistically, a true positive is a statistically-significant finding for which the underlying effect size (δ) is greater than 0, whereas a false positive is a statistically-significant finding for which δ equals 0. However, the null hypothesis of no difference (δ = 0) may never be strictly true because innumerable nuisance factors can introduce small effects for theoretically uninteresting reasons. If δ never equals zero, then with sufficient power, every experiment would yield a statistically-significant result. Yet running studies with higher power by increasing sample size (N) is one of the most widely agreed upon reforms to increase replicability. Moreover, and perhaps not surprisingly, the idea that psychology should attach greater value to small effect sizes is gaining currency.

Increasing N without limit makes sense for purely measurement-focused research, where the magnitude of δ itself is of interest, but it makes less sense for theory-focused research, where the truth status of the theory under investigation is of interest.

Increasing power to enhance replicability will increase true positives at the level of the effect size (statistical true positives) while increasing false positives at the level of theory (theoretical false positives). With too much power, the cumulative foundation of psychological science would consist largely of nuisance effects masquerading as theoretically important discoveries.

Positive predictive value at the level of theory is maximized by using an optimal N, one that is neither too small nor too large…PPV at the level of theory is the probability that a p < 0.05 result confirming a theory-based prediction reflects the effect of the theoretical mechanism, not a nuisance factor.

[Keywords: null hypothesis statistical-significance-testing, false positives, positive predictive value, replication crisis]

“Does Democracy Matter?”, Gerring et al 2022

“Does Democracy Matter?”⁠, John Gerring, Carl Henrik Knutsen, Jonas Berge (2022-05; ):

Does democracy matter for normatively desirable outcomes?

We survey results from 1,100 cross-country analyses drawn from 600 journal articles published after the year 2000. These analyses are conducted on 30 distinct outcomes pertaining to social policy, economic policy, citizenship and human rights, military and criminal justice, and overall governance.

Across these diverse outcomes, most studies report either a positive or null relationship with democracy.

However, there is evidence of threshold bias, suggesting that reported findings may reflect a somewhat exaggerated image of democracy’s effects. Additionally, democratic effects are more likely to be found for outcomes that are easily attained than for those that lie beyond the reach of government but are often of great normative importance. We also find that outcomes measured by subjective indicators show a stronger positive relationship with democracy than outcomes that are measured or proxied by more objective indicators.

[Keywords: democracy, regime type, governance, economic policy, social policy]

“‘I Think I Discovered a Military Base in the Middle of the Ocean’—Null Island, the Most Real of Fictional Places”, Juhasz & Mooney 2022

“‘I think I discovered a military base in the middle of the ocean’—Null Island, the most real of fictional places”⁠, Levente Juhasz, Peter Mooney (2022-04-18; ):

This paper explores Null Island⁠, a fictional place located at 0° latitude and 0° longitude in the WGS84 geographic coordinate system.

Null Island is erroneously associated with large amounts of geographic data in a wide variety of location-based services, place databases, social media and web-based maps. While it was originally considered a joke within the geospatial community, this article will demonstrate implications of its existence, both technological and social in nature, promoting Null Island as a fundamental issue of geographic information that requires more widespread awareness.

The article summarizes error sources that lead to data being associated with Null Island.

We identify four evolutionary phases which help explain how this fictional place evolved and established itself as an entity reaching beyond the geospatial profession to the point of being discovered by the visual arts and the general population. After providing an accurate account of data that can be found at (0, 0), geospatial, technological and social implications of Null Island are discussed.

Guidelines to avoid misplacing data to Null Island are provided. Since data will likely continue to appear at this location, our contribution is aimed at both GIScientists and the general population to promote awareness of this error source.

“Clinical Prediction Models in Psychiatry: a Systematic Review of Two Decades of Progress and Challenges”, Meehan et al 2022

“Clinical prediction models in psychiatry: a systematic review of two decades of progress and challenges”⁠, Alan J. Meehan, Stephanie J. Lewis, Seena Fazel, Paolo Fusar-Poli, Ewout W. Steyerberg, Daniel Stahl et al (2022-04-01; ; similar):

Recent years have seen the rapid proliferation of clinical prediction models aiming to support risk stratification and individualized care within psychiatry. Despite growing interest, attempts to synthesize current evidence in the nascent field of precision psychiatry have remained scarce.

This systematic review therefore sought to summarize progress towards clinical implementation of prediction modeling for psychiatric outcomes. We searched MEDLINE⁠, PubMed⁠, Embase⁠, and PsycINFO databases from inception to September 30, 2020, for English-language articles that developed and/​or validated multivariable models to predict (at an individual level) onset, course, or treatment response for non-organic psychiatric disorders (PROSPERO: CRD42020216530). Individual prediction models were evaluated based on 3 key criteria: (1) mitigation of bias and overfitting; (2) generalizability, and (3) clinical utility. The Prediction model Risk Of Bias ASsessment Tool (PROBAST) was used to formally appraise each study’s risk of bias.

228 studies detailing 308 prediction models were ultimately eligible for inclusion. 94.5% of developed prediction models were deemed to be at high risk of bias, largely due to inadequate or inappropriate analytic decisions. Insufficient internal validation efforts (within the development sample) were also observed, while only one-fifth of models underwent external validation in an independent sample. Finally, our search identified just one published model whose potential utility in clinical practice was formally assessed.

Our findings illustrated substantial growth in precision psychiatry with promising progress towards real-world application. Nevertheless, these efforts have been inhibited by a preponderance of bias and overfitting, while the generalizability and clinical utility of many published models has yet to be formally established. Through improved methodological rigor during initial development, robust evaluations of reproducibility via independent validation, and evidence-based implementation frameworks, future research has the potential to generate risk prediction tools capable of enhancing clinical decision-making in psychiatric care.

[ML prediction will work, but like GWASes or deep learning or brain imaging, people will be unhappy how much data it will take.]

“Reproducible Brain-wide Association Studies Require Thousands of Individuals”, Marek et al 2022

2022-marek.pdf: “Reproducible brain-wide association studies require thousands of individuals”⁠, Scott Marek, Brenden Tervo-Clemmens, Finnegan J. Calabro, David F. Montez, Benjamin P. Kay, Alexander S. Hatoum et al (2022-03-16; ; similar):

Magnetic resonance imaging (MRI) has transformed our understanding of the human brain through well-replicated mapping of abilities to specific structures (for example, lesion studies) and functions (for example, task functional MRI (fMRI)). Mental health research and care have yet to realize similar advances from MRI. A primary challenge has been replicating associations between inter-individual differences in brain structure or function and complex cognitive or mental health phenotypes (brain-wide association studies (BWAS)). Such BWAS have typically relied on sample sizes appropriate for classical brain mapping4 (the median neuroimaging study sample size is about 25), but potentially too small for capturing reproducible brain-behavioural phenotype associations.

Here we used 3 of the largest neuroimaging datasets currently available—with a total sample size of around 50,000 individuals—to quantify BWAS effect sizes and reproducibility as a function of sample size. [Adolescent Brain Cognitive Development (ABCD) study, n = 11,874; Human Connectome Project (HCP), n = 1,200; and UK Biobank (UKB), n = 35,73]

BWAS associations were smaller than previously thought, resulting in statistically underpowered studies, inflated effect sizes and replication failures at typical sample sizes. As sample sizes grew into the thousands, replication rates began to improve and effect size inflation decreased. More robust BWAS effects were detected for functional MRI (versus structural), cognitive tests (versus mental health questionnaires) and multivariate methods (versus univariate).

Smaller than expected brain-phenotype associations and variability across population subsamples can explain widespread BWAS replication failures. In contrast to non-BWAS approaches with larger effects (for example, lesions, interventions and within-person), BWAS reproducibility requires samples with thousands of individuals.

“A 680,000-person Megastudy of Nudges to Encourage Vaccination in Pharmacies”, Milkman et al 2022

“A 680,000-person megastudy of nudges to encourage vaccination in pharmacies”⁠, Katherine L. Milkman, Linnea Gandhi, Mitesh S. Patel, Heather N. Graci, Dena M. Gromet, Hung Ho, Joseph S. Kay et al (2022-02-08; ⁠, ; similar):

[See also Hoogeveen et al 2020⁠; “The pandemic fallacy: Inaccuracy of social scientists’ and lay judgments about COVID-19’s societal consequences in America”⁠, Hutcherson et al 2021] Encouraging vaccination is a pressing policy problem. Our megastudy with 689,693 Walmart pharmacy customers demonstrates that text-based reminders can encourage pharmacy vaccination and establishes what kinds of messages work best.

We tested 22 different text reminders using a variety of different behavioral science principles to nudge flu vaccination⁠. Reminder texts increased vaccination rates by an average of 2.0 percentage points (6.8%) over a business-as-usual control condition. The most-effective messages reminded patients that a flu shot was waiting for them and delivered reminders on multiple days. The top-performing intervention included 2 texts 3d apart and stated that a vaccine was “waiting for you.”

[Scientist but not laymen on Prolific] Forecasters failed to anticipate that this would be the best-performing treatment, underscoring the value of testing.

Encouraging vaccination is a pressing policy problem. To assess whether text-based reminders can encourage pharmacy vaccination and what kinds of messages work best, we conducted a megastudy.

We randomly assigned 689,693 Walmart pharmacy patients to receive one of 22 different text reminders using a variety of different behavioral science principles to nudge flu vaccination or to a business-as-usual control condition that received no messages.

We found that the reminder texts that we tested increased pharmacy vaccination rates by an average of 2.0 percentage points, or 6.8%, over a 3-mo follow-up period. The most-effective messages reminded patients that a flu shot was waiting for them and delivered reminders on multiple days. The top-performing intervention included 2 texts delivered 3d apart and communicated to patients that a vaccine was “waiting for you”.

Neither experts [r = 0.03] nor lay people [r = 0.60] anticipated that this would be the best-performing treatment, underscoring the value of simultaneously testing many different nudges in a highly powered megastudy.

[Keywords: vaccination, COVID-19, nudge, influenza, field experiment]

…To assess how well the relative success of these messages could be forecasted ex ante, both the scientists who developed the texts and a separate sample of lay survey respondents predicted the impact of different interventions on flu vaccination rates…Prediction Study Methods: To assess the ex ante predictability of this megastudy’s results, we collected forecasts of different interventions’ efficacy from 2 populations. First, in November 2020, we invited each of the scientists who designed one or more interventions in our megastudy to estimate the vaccination rates among patients in all 22 intervention conditions as well as among patients in the business-as-usual control condition. 24 scientists participated (89% of those asked), including at least one representative from each design team, and these scientists made a total of 528 forecasts. In January 2021, we also recruited 406 survey respondents from Prolific to predict the vaccination rates among patients in 6 different intervention conditions (independently selected randomly from the 22 interventions for each forecaster) as well as among patients in the business-as-usual control, which generated a total of 2,842 predictions. Participants from both populations were shown a realistic rendering of the messages sent in a given intervention and then asked to predict the percentage of people in that condition who would get a flu shot from Walmart pharmacy between September 25, 2020, and October 30, 2020. For more information on recruitment materials, participant demographics, and the prediction survey, refer to SI Appendix⁠.

Prediction Study Results: The average predictions of scientists did not correlate with observed vaccination rates across our megastudy’s 23 different experimental conditions (n = 23, r = 0.03, and p = 0.880).

Prolific raters, in contrast, on average accurately predicted relative vaccination rates across our megastudy’s conditions (n = 23, r = 0.60, and p = 0.003)—a marginally statistically-significant difference (Dunn and Clark’s z-test: p = 0.048; Steiger’s z-test: p = 0.051; and Meng et al’s z-test: p = 0.055) (25⇓–27).

Further, the median scientist prediction of the average lift in vaccinations across our interventions was 6.2%, while the median Prolific respondent guess was 8.3%—remarkably close to the observed average of 8.9%. Notably, neither population correctly guessed the top-performing intervention. In fact, scientists’ predictions placed it 15th out of 22, while Prolific raters’ predictions placed it 16th out of 22 (SI Appendix, Table S18).

Figure S3: By condition, the actual vaccination rate versus the 95% confidence interval predictions by scientists (24 scientists making a total of 552 predictions, Panel A) and lay predictors (406 individuals making a total of 2,842 predictions, Panel B).

“The Backfire Effect After Correcting Misinformation Is Strongly Associated With Reliability”, Swire-Thompson et al 2022

2022-swirethompson.pdf: “The backfire effect after correcting misinformation is strongly associated with reliability”⁠, Briony Swire-Thompson, Nicholas Miklaucic, John P. Wihbey, David Lazer, Joseph DeGutis (2022-02-07; ⁠, ; similar):

The “backfire effect” is when a correction increases belief in the very misconception it is attempting to correct, and it is often used as a reason not to correct misinformation.

The current study aimed to test whether correcting misinformation increases belief more than a no-correction control. Furthermore, we aimed to examine whether item-level differences in backfire rates were associated with test-retest reliability or theoretically meaningful factors. These factors included worldview-related attributes, including perceived importance and strength of pre-correction belief, and familiarity-related attributes, including perceived novelty and the illusory truth effect⁠.

In 2 nearly identical experiments, we conducted a longitudinal pre/​post design with n = 388 and 532 participants. Participants rated 21 misinformation items and were assigned to a correction condition or test-retest control.

We found that no items backfired more in the correction condition compared to test-retest control or initial belief ratings. Item backfire rates were strongly negatively correlated with item reliability (ρ = −0.61/​−.73) and did not correlate with worldview-related attributes. Familiarity-related attributes were statistically-significantly correlated with backfire rate, though they did not consistently account for unique variance beyond reliability. While there have been previous papers highlighting the non-replicable nature of backfire effects, the current findings provide a potential mechanism for this poor replicability.

It is crucial for future research into backfire effects to use reliable measures, report the reliability of their measures, and take reliability into account in analyses. Furthermore, fact-checkers and communicators should not avoid giving corrective information due to backfire concerns.

[Keywords: misinformation, reliability, belief updating, the backfire effect]

…At best, unreliable measures add noise and complicate the interpretation of effects observed. At worst, unreliable measures can produce statistically-significant findings that are spurious artifacts (Loken & Gelman 2017). A major drawback of prior misinformation research is that experiments investigating backfire effects have typically not reported the reliability of their measures (for an exception, see Horne et al 2015). Due to random variation or regression to the mean in a pre/​post study, items with low reliability would be more likely to show a backfire effect. In a previous meta-analysis (Swire-Thompson et al 2020), we found preliminary evidence for this reliability-backfire relationship by comparing studies using single-item measures—which typically have poorer reliability (Jacoby 1978⁠; Peter 1979)—with more reliable multi-item measures. Examining 31 studies and 72 dependent measures1, we found that the proportion of backfire effects observed with single item measures was substantially greater than those found in multi-item measures. Notably, when a backfire effect was reported, 81% of these cases were with single-item measures (70% of worldview backfire effects and 100% of familiarity backfire effects), whereas only 19% of cases used multi-item measures. This suggests that measurement error could be a contributing factor, but it is important to more directly measure the contribution of reliability to the backfire effect.

“Dream Interpretation from a Cognitive and Cultural Evolutionary Perspective: The Case of Oneiromancy in Traditional China”, Hong 2022

2022-hong.pdf: “Dream Interpretation from a Cognitive and Cultural Evolutionary Perspective: The Case of Oneiromancy in Traditional China”⁠, Ze Hong (2022-01-23; ⁠, ⁠, ; similar):

Why did people across the world and throughout history believe that dreams can foretell what will occur in the future? In this paper, I attempt to answer this question within a cultural evolutionary framework by emphasizing the cognitive aspect of dream interpretation; namely, the fact that dreams were often viewed as meaningful and interpretable has to do with various psychological and social factors that influence how people obtain and process information regarding the validity of dream interpretation as a technique.

Through a comprehensive analysis of a large dataset of dream occurrences in the official Chinese historical records [and dream encyclopedias], I argue that the ubiquity and persistence of dream interpretation have a strong empirical component (predictively accurate dream cases), which is particularly vulnerable to transmission errors and biases. The overwhelmingly successful records of dream prediction in transmitted texts, I suggest, is largely due to the fabrication and retrospective inference of past dreams, as well as the under-reporting of predictive failures [selection bias⁠, post hoc confirmation bias⁠/​publication bias]. These “positive data” then reinforce individuals’ confidence in the predictive power of dreams.

I finally show a potential decline of the popularity of dream interpretation in traditional China and offer a few suggestive explanations drawing on the unique characteristics of oneiromancy compared to other divination techniques.

[Keywords: cultural evolution, divination, oneiromancy, China, dream]

…Since this paper focuses on why people believe in the validity of oneiromancy, I propose to classify dreams by their epistemological status. Specifically, dreams as signs that usually need to be interpreted (often with professional expertise) and dreams as messages transmitted by other humans or human-like agents. This distinction is useful because it highlights how the perceived plausibility of the 2 kinds of dreams may be affected by one’s larger theoretical commitment.5 The famous Eastern Han skeptical thinker, Wang Chong (27–97 CE), for example, denies the possibility of message dreams but would entertain the possibility of certain sign dreams (He 2011).

2.3. The cultural transmission of oneiromancy instructions and cases: Because of the indispensability of interpretation in sign dreams, there is often an interest and demand for instructions on how to correctly interpret of the content of dreams. In ancient China, there was a rich tradition in collecting and compiling dreams and their associated meanings (Fu 2017; Liu 1989), and some of the most popular compilations, such as The Duke of Zhou’s Explanations of Dreams, can still be purchased in bookstores today (Yun 2013). As mentioned, the other aspect of cultural transmission of oneiromancy, the transmission of actual oneiromancy cases and the associated predictive outcomes (whether the prediction was successful or not), is also important; intuitively, one would not take dreams very seriously if all she hears about oneiromancy are failed predictions.

In China, oneiromancy cases were recorded in historical records, philosophical writings, and a wide range of literary forms (fiction, drama, poetry, etc.) (Liu 1989). During later dynasties, compilations of oneiromancy cases in the form of encyclopedias became popular with improved printing technology and the expansion of book publishing and distribution (Vance 2012, “Textualizing dreams in a Late Ming dream encyclopedia”). These encyclopedias often contained both dream prognostic instructions and actual cases; in an extensive analysis of an oneiromancy encyclopedia, Forest of Dreams compiled in 1636 CE, for example, Vance 2012 shows that it contained not only instructions on how to interpret dreams but also many case descriptions of predictive dreams.

…In total, I collected 793 dream occurrences and recorded information regarding the type of dreams, the dreamer, the interpreter, the interpretation of the dream, and the predictive accuracy of the dream interpretation whenever possible (see Supplementary Material for details)…Figure 3 shows the relative proportion of dreams in terms of their predictive accuracy over historical time, and what is immediately obvious is that most dream occurrences are prophetic and have an associated confirmatory outcome. That is, whenever dreams are mentioned in these official historical records, the readers can expect that they are predictive of some later outcome, which is usually verified.

Figure 3: Relative proportion of dreams of different accuracy types as recorded in official dynastic records by chronological order.

…To what extent were these stories believed? Historical texts do not offer straightforward answers, but we can, nonetheless, get some indirect clues. The famous skeptic during the Eastern Han dynasty⁠, Wang Chong (27–97 CE) made the following comment on the story about how the mother of the first Han emperor Liu Ao dreamed of a dragon which presumably induced the pregnancy:

“From the chronicle of Gaozu (the later founding emperor of the Han dynasty) we learn that dame Liu (mother of Gaozu) was reposing on the banks of a large lake. In her dream she met with a spirit. At the time there was a tempest with thunder and lightning and a great darkness. Taigong (Gaozu’s father) went near, and perceived a dragon above her. She became enceinte and was delivered of Gaozu. These instances of the supernatural action of spirits are not only narrated, but also written down, and all the savants of the day swear by them.” (Lun Heng⁠, Chapter 26 [?], Forke 1907’s translation)

Thus, the story goes that Gaozu’s mother met with a spirit (and presumably had sexual intercourse with it) whose earthly manifestation was a dragon. According to Wang Chong, all the savants believed the veracity of the story, and he felt compelled to make a case against it. Of course, we do not know for sure whether the savants at the time genuinely believed in it or were merely pretending for political reasons. I suggest that some, perhaps many of them were genuine believers; even Wang Chong himself who argued against this kind of supernatural pregnancy believed that when great men are born, there will be signs occurring either in reality or dreams; he just does not believe that nonhuman species, such as dragons, can have sexual intercourse with humans.9

…To get a better sense of the number of such “political justification” dreams, I computed the percentage of such dreams11 out of the total number of dreams in different historical periods (Table 1).

From Table 1, we can clearly see that in all 3 historical periods (the reason for using Southern-Northern Dynasties as the dividing period will be made clear in Section 3.4), there is a nontrivial proportion of recorded dreams of such type. The percentage of dreams that could be used to justify political power is slightly higher in the pre Southern-Northern Dynasties period and remains roughly constant in the later 2 periods.

In addition to intentional fabrication, some dreams may be “false memories”; that is, individuals may falsely remember and report dreams that they never experienced if these dreams were expected in the community. Recent psychological research on dreams has suggested that the encoding of memories of dreams may share the same neurocognitive basis as autobiographical memory and thus be subject to false memory (Beaulieu-Prévost & Zadra 2015). Psychologists have long known that subjective dream reports are often unreliable (Schwitzgebel 2011, Perplexities of consciousness), and both theoretical accounts and empirical studies (Beaulieu-Prévost & Zadra 2015) have suggested that false memories may occur quite often in dreams (Rosen 2013). In particular, Rosen 2013 points out there is often substantial memory loss in dream recall, which may lead to a “fill in the blanks” process.

While the dreamer may fabricate or falsely remember their dreams, the observer can also infer dreams retrospectively. Historians in ancient China often have an “if there is an outcome, then there must be a sign” mentality (Zheng 2014) when recording events that were supposed be predicted by divination. Similarly, Vance 2012 in her extensive treatment of dream interpretation of the Ming dynasty argues that written and transmitted dreams often reveal not what the dreamers actually dreamed of but what the recorder believed about the dreams. In my dataset, a substantial proportion of the dreams (11%) were described in a retrospective and explanatory manner, marked by the phrase “in the beginning” (chu). This way of writing gives the impression that the authors were trying to find signs that had already foretold the fate of individuals in order to create a coherent narrative.

Therefore, it is likely that the retelling and recording of dreams involved an imaginative and inferential process. Li 1999 points out that in early Chinese historical writing, authors may present cases where multiple individuals shared the same dream to prove its objective veracity. In my dataset, 1.3% of total dreams were reported to have multiple dreamers, and in the most extreme case, hundreds of people were said to have dreamed of the same thing.12 Although this is not statistically impossible, we can safely conclude (unless we seriously entertain the possibility of ghosts and spirits sending dream messages to multiple individuals simultaneously) that there was either some serious fabrication or false inference.

3.3. Under-reporting of failed dream predictions/​wrong dream interpretations: In addition to the fabrication/​retrospective inference of oneiromancy cases, under-reporting of failed predictions very likely existed to a substantial extent. The Song historian and philosopher Lü Zuqian (1137–1181 CE) made the following statement when commenting on the Confucian text Zuo Zhuan (~500 BCE) regarding the accuracy of divination predictions:

“Some people ask:”Zuo’s record of crackmaking and milfoil divination cases were so amazing and spectacular; given such predictive accuracy, why are there so few [records] of them?” The answer: “from the Lord Yin [Duke Yin of Lu] till Lord Ai was a total of 222 years. Kings, lords, dukes, the literati and the commoner perhaps made tens of thousands of divinations, and only tens of the efficacious cases were recorded in Zuo’s book. These tens of the cases were collected in Zuo’s book and therefore feel like a lot; if they were dispersed into the 222 years it would feel extremely rare. If divination cases were of deceptive nature or had failed predictions, they would not have transmitted during their time and not be recorded in the book. I do not know how many tens of thousands of them were missed. If we had all of them [recorded], they would not be so rare.” (Donglai Zuoshi Boyi13)

The early Qing scholar Xiong Bolong (1616–1669 CE) commented on using dream signs to predict the sex of the fetus more specifically14:

It is not the case that all pregnant women have the same type of dreams, and it is not the case that if [she] dreams of certain signs she must give birth to son or daughter. There are also instances where one dreams of a bear15 yet gives birth to a daughter, and instances where one dreams of a snake and gives birth to a son. The poets [diviners] tell the cases where their predictions are fulfilled and not talk about the cases where their predictions failed. (Wuhe Ji16)

…my own fieldwork in southwest China among the Yi shows that many people are unwilling to reveal the divination or healing ritual failures of local shamans because these shamans are often friends and neighbors of the clients and there is the concern that spreading “accidental” failures may taint their reputation (Hong, submitted).

…As we have argued elsewhere, under-reporting of failed predictions may be a prevalent feature of divination in ancient societies (Anonymized, forthcoming). By selectively omitting failed predictions, these transmitted texts give a false impression that dream interpretations are overwhelmingly accurate, which, along with fabrication and ad hoc inference of predictive dreams, serves as a powerful mechanism to empirically sustain the validity of oneiromancy.

“A Systematic Review and Meta-analysis of the Success of Blinding in Antidepressant RCTs”, Scott et al 2022

2022-scott.pdf: “A systematic review and meta-analysis of the success of blinding in antidepressant RCTs”⁠, Amelia J. Scott, Louise Sharpe, Ben Colagiuri (2022-01; ; similar):

  • Successful blinding is an important feature of double-blind randomized controlled trials⁠, and ensures that the safety and efficacy of treatments are accurately appraised.
  • In a range of fields (eg. chronic pain⁠, general medicine), few trials report assessing the success of blinding.
  • We do not know the frequency or success of blinding assessment among antidepressant RCTs within depression.
  • Only 4.7% of RCTs examining antidepressants in depression assess blinding.
  • Overall, blinding is not successful among either patients or investigators.

Successful blinding in double-blind RCTs is crucial for minimizing bias, however studies rarely report information about blinding. Among RCTs for depression, the rates of testing and success of blinding is unknown.

We conducted a systematic review and meta-analysis of the rates of testing, predictors, and success of blinding in RCTs of antidepressants for depression. Following systematic search, further information about blinding assessment was requested from corresponding authors of the included studies. We reported the frequency of blinding assessment across all RCTs, and conducted logistic regression analyses to assess predictors of blinding reporting. Participant and/​or investigator guesses about treatment allocation were used to calculate Bang’s Blinding Index (BI). The BI between RCT arms was compared using meta-analysis.

Across the 295 included trials, only 4.7% of studies assessed blinding. Pharmaceutical company sponsorship predicted blinding assessment; unsponsored trials were more likely to assess blinding. Meta-analysis suggested that blinding was unsuccessful among participants and investigators. Results suggest that blinding is rarely assessed, and often fails, among RCTs of antidepressants.

This is concerning considering controversy around the efficacy of antidepressant medication. Blinding should be routinely assessed and reported in RCTs of antidepressants, and trial outcomes should be considered in light of blinding success or failure.

[Keywords: randomized controlled trials, blinding, depression, antidepressants]

“How Malleable Are Cognitive Abilities? A Critical Perspective on Popular Brief Interventions”, Moreau 2021

2021-moreau.pdf: “How malleable are cognitive abilities? A critical perspective on popular brief interventions”⁠, David Moreau (2021-12-23; ⁠, ⁠, ; similar):

This review discusses evidence across a number of popular brief interventions designed to enhance cognitive abilities and suggests that these interventions often fail to elicit reliable improvements. Consequences of exaggerated claims are discussed, together with a call for constructive criticism when evaluating this body of research.

A number of popular research areas suggest that cognitive performance can be manipulated via relatively brief interventions. These findings have generated a lot of traction, given their inherent appeal to individuals and society. However, recent evidence indicates that cognitive abilities might not be as malleable as preliminary findings implied and that other more stable factors play an important role.

In this article, I provide a critical outlook on these trends of research, combining findings that have mainly remained segregated despite shared characteristics.

Specifically, I suggest that the purported cognitive improvements elicited by many interventions are not reliable, and that their ecological validity remains limited.

I conclude with a call for constructive skepticism when evaluating claims of generalized cognitive improvements following brief interventions.

[Keywords: behavioral interventions, cognitive improvements, brain plasticity⁠, genetics, intelligence]

“More Treatment but No Less Depression: The Treatment-prevalence Paradox”, Ormel et al 2021

“More treatment but no less depression: The treatment-prevalence paradox”⁠, Johan Ormel, Steven D. Hollon, Ronald C. Kessler, Pim Cuijpers, Scott M. Monroe (2021-12-11; ; similar):

  • The puzzling paradox of more treatment but no less depression requires answers.
  • First incidence has probably not increased and offset treatment-driven prevalence drops.
  • The published trial literature substantially overestimates efficacy of treatments⁠.
  • In addition, treatment-quality gaps in routine care reduce effectiveness further.
  • Long-term outcome, under-treatment of recurrence, iatrogenicity need further study.

Treatments for depression have improved, and their availability has markedly increased since the 1980s. Mysteriously the general population prevalence of depression has not decreased. This “treatment-prevalence paradox” (TPP) raises fundamental questions about the diagnosis and treatment of depression.

We propose and evaluate 7 explanations for the TPP. First, 2 explanations assume that improved and more widely available treatments have reduced prevalence, but that the reduction has been offset by an increase in:

  1. misdiagnosing distress as depression, yielding more “false positive” diagnoses; or

  2. an actual increase in depression incidence.

    Second, the remaining 5 explanations assume prevalence has not decreased, but suggest that:

  3. treatments are less efficacious and

  4. less enduring than the literature suggests;

  5. RCT trial efficacy doesn’t generalize to real-world settings;

  6. population-level treatment impact differs for chronic-recurrent versus non-recurrent cases; and

  7. treatments have some iatrogenic consequences.

Any of these 7 explanations could undermine treatment impact on prevalence, thereby helping to explain the TPP.

Our analysis reveals that there is little evidence that incidence or prevalence have increased as a result of error or fact (Explanations 1 and 2), and strong evidence that (a) the published literature overestimates short-term and long-term treatment efficacy, (b) treatments are considerably less effective as deployed in “real world” settings, and (c) treatment impact differs substantially for chronic-recurrent cases relative to non-recurrent cases.

Collectively, these 4 explanations likely account for most of the TPP. Lastly, little research exists on iatrogenic effects of current treatments (Explanation 7), but further exploration is critical.

[Keywords:, depression, treatment, prevalence, more treatment but not less depression, explanations treatment-prevalence paradox]

Possible Explanations for the TPP Evidence and (Preliminary) Conclusions
  1. Have prevalence estimates been spuriously inflated due to increasing societal recognition of depression and associated diagnostic practices?
People have probably become more willing to admit depressive symptoms and to present for treatment, where they may receive false-positive diagnoses of MDD. But since epidemiologic surveys are conducted by well-trained interviewers using structured interviews to generate well-standardized diagnoses, it is unlikely that systematic drift at the population level of ‘caseness’ has occurred. Thus, an increase in “false positives” diagnoses would not mask a treatment-driven drop in “true” epidemiological prevalence.
  1. Have first incidence rates increased and offset a “true” treatment-driven reduction in point-prevalence?
Post-1980 first incidence studies and information on trends in causal risk factors are few, and too inconsistent to provide a conclusive answer. The handful of incidence studies, though, do not hint at any substantial rise in incidence since the 1980’s, and some even suggest a decrease. The evidence is sparse and uncertain, though, and ends around 2010. It seems unlikely that an true increase in depression incidence offsets any treatment-driven prevalence reduction.
  1. Do RCTs overestimate Acute-Phase treatment efficacy? Might biases both within the trials and across the larger literature on medication and psychotherapy inflate these short-term benefits of treatment? | R
RCTs do yield inflated acute-phase efficacy estimates. Adjusted for bias, efficacy drops by a third to half to-modest effect-sizes at best (about 0.30 for medications vs. pill-placebo and psychotherapy vs. care-as-usual). It is unclear how long Acute-Phase treatment benefits persist. Given the biases and large heterogeneity, it is not surprising that there is substantial disagreement about the clinical impact of treatments. It is not that treatments do not work, just that they do not work as well as the published literature suggests, or as is widely believed. Hence, the a more accurate estimate of short-term efficacy is at best modest, which can explain in part the TPP.
  1. Does research on maintenance of treatment gains and long-term efficacy over estimate beneficial effects? Are medication and psychotherapy interventions to prevent relapse-recurrence upwardly biased due to non-eligibility, insufficient response to Acute-Phase treatment, symptom return risk, and a variety of biases that need to be taken into account?
RCTs evaluating treatments aimed to reduce relapse-recurrence risk show substantial efficacy for preventive psychotherapy and for continued medication. However, these “effects” are rife with possible biases (misclassification, unblinding, allegiance effects, and differential mortality) complicating interpretation. In addition, many patients without sufficient response to acute-phase treatment are not eligible for relapse or recurrence prevention trials, and relapse-recurrence rates over 2 years after preventive treatment remain substantial, (though estimates vary greatly). Hence, limited overall long-term efficacy also may help to explain the TPP.
  1. Do RCTs generalize to real-world settings? How large is the gap between RCT-based efficacy and real-world effectiveness?
RCT-based efficacy does not generalize all that well to real-world practice, both for medication and for psychotherapy. Reasons: Large gaps in treatment choice and implementation quality exist in real-world practice; compared to the typical RCT patient, the real-world patient is somewhat less treatable (suicidal ideation; addiction, severe comorbidities). What the gaps, along with naturalistic follow-up studies, tell us is that treatment is not as effective long-term as we would like it to be. This explanation appears to be one of the strongest candidates for understanding the TPP as it also amplifies the contribution of explanations 3 and 4.
  1. Does treatment efficacy vary by different subtypes of depression? Specifically, could differential treatment benefits for chronic-recurrent versus non-recurrent cases dilute the potential beneficial effects of treatments for those most in clinical need? Further, chronic-recurrent cases are often very difficult to treat, or treatment-resistant.
The majority of people who initially become depressed have few if any recurrences, whereas recurrent and chronic cases become or remain depressed for much more time over the course of their lives. The availability of more and better treatments consequently has many more opportunities to benefit the smaller number of chronic-recurrent cases, while treatment effects at the population level for the many more non-recurrent cases most likely will be very limited. The resulting limited effects at the population level for the larger non-recurrent group could dilute more pronounced effects for the chronic-recurrent subgroup, obscuring a positive impact on prevalence for those in greatest need. However, it is unclear to what extent advances in preventive treatments specifically benefit the chronic-recurrent subgroup, or if these treatments are adequately transported into routine care for them (Explanations 3–5). Individually and combined, these subgroup considerations also provide potentially strong explanations for the TPP.
  1. Can treatment sometimes also have counterproductive consequences?
Oppositional perturbation refers to a medication-induced state of built-up perturbation in homeostatic monoamine regulatory mechanisms that “bounces back” when medication is discontinued, and then overshoots the normal balance of monoamine storage and release, increasing the risk for symptom return compared to spontaneous remission. Loss of agency refers to the hypothesis that either medication or psychotherapy could be counterproductive if either or both reduce self-help activity and active coping and thereby interfere with natural recovery mechanisms. Although some indirect evidence exists for each possibility, both mechanisms are largely speculative. The explanatory potential of this concern remains to be demonstrated for understanding the TPP, but is worthy of further investigation.

“The Psychophysiology of Political Ideology: Replications, Reanalyses, and Recommendations”, Osmundsen et al 2021

2021-osmundsen.pdf: “The Psychophysiology of Political Ideology: Replications, Reanalyses, and Recommendations”⁠, Mathias Osmundsen, David J. Hendry, Lasse Laustsen, Kevin B. Smith, Michael Bang Petersen (2021-11-25; ⁠, ; similar):

This article presents a large-scale, empirical evaluation of the psychophysiological correlates of political ideology and, in particular, the claim that conservatives react with higher levels of electrodermal activity to threatening stimuli than liberals.

We (1) conduct 2 large replications of this claim, using locally representative samples of Danes and Americans; (2) reanalyze all published studies and evaluate their reliability and validity⁠; and (3) test several features to enhance the validity of psychophysiological measures and offer a number of recommendations.

Overall, we find little empirical support for the claim. This is caused by large reliability and validity problems related to measuring threat sensitivity using electrodermal activity. When assessed reliably, electrodermal activity in the replications and published studies captures individual differences in the physiological changes associated with attention shifts, which are unrelated to ideology. In contrast to psychophysiological reactions, self-reported emotional reactions to threatening stimuli are reliably associated with ideology.

[Keywords: political ideology, threat sensitivity, electrodermal activity, replication, measurement, psychometrics]

…In the process of revising this article, a preprint of another large-scale replication effort became available. Bakker et al 2019 field 2 conceptual replications, as well a preregistered direct replication of Oxley et al 2008⁠. All of these efforts fail to replicate the results. We encourage readers to consult Bakker et al 2019, which is aligned with and reinforces the conclusions of the present article.

“The Implicit Association Test in Introductory Psychology Textbooks: Blind Spot for Controversy”, Bartels & Schoenrade 2021

2021-bartels.pdf: “The Implicit Association Test in Introductory Psychology Textbooks: Blind Spot for Controversy”⁠, Jared M. Bartels, Patricia Schoenrade (2021-11-13; ; backlinks; similar):

The Implicit Association Test (IAT) has been widely discussed as a potential measure of “implicit bias.” Yet the IAT is controversial; research suggests that it is far from clear precisely what the instrument measures, and it does not appear to be a strong predictor of behavior. The presentation of this topic in Introductory Psychology texts is important as, for many students, it is their first introduction to scientific treatment of such issues. In the present study, we examined twenty current Introductory Psychology texts in terms of their coverage of the controversy and presentation of the strengths and weaknesses of the measure. Of the 17 texts that discussed the IAT, a minority presented any of the concerns including the lack of measurement clarity (29%), an automatic preference for White people among African Americans (12%), lack of predictive validity (12%), and lack of caution about the meaning of a score (0%); most provided students with a link to the Project Implicit website (65%). Overall, 82% of the texts were rated as biased or partially biased on their coverage of the IAT. The implications for the perceptions and self-perceptions of students, particularly when a link to Project Implicit is included, are discussed.

“A Pre-registered, Multi-lab Non-replication of the Action-sentence Compatibility Effect (ACE)”, Morey et al 2021

“A pre-registered, multi-lab non-replication of the action-sentence compatibility effect (ACE)”⁠, Richard D. Morey, Michael P. Kaschak, Antonio M. Díez-Álamo, Arthur M. Glenberg, Rolf A. Zwaan, Daniël Lakens et al (2021-11-09; ; backlinks; similar):

The Action-sentence Compatibility Effect (ACE) is a well-known demonstration of the role of motor activity in the comprehension of language. Participants are asked to make sensibility judgments on sentences by producing movements toward the body or away from the body. The ACE is the finding that movements are faster when the direction of the movement (eg. ‘toward’) matches the direction of the action in the to-be-judged sentence (eg. ‘Art gave you the pen’ describes action toward you).

We report on a pre-registered⁠, multi-lab replication of one version of the ACE.

The results show that none of the 18 labs involved in the study observed a reliable ACE, and that the meta-analytic estimate of the size of the ACE was essentially zero.

Figure 6: Action-sentence Compatibility Effect (ACE) interaction effects on the logarithm of the lift-off times across all labs. Thick error bars show standard errors from the linear mixed effects model analysis; thin error bars are the corresponding 95% CI. The shaded region represents our pre-registered, predicted conclusions about the ACE: Effects within the lighter shaded region were pre-registered as too small to be consistent with the ACE; effects in the dark gray region were pre-registered as negligibly small. Above the gray region was considered consistent with the extant ACE literature.
Figure 7: Action-sentence Compatibility Effect (ACE) interaction effects on the logarithm of the move times across all labs. Thick error bars show standard errors from the linear mixed effects model analysis; thin error bars are the corresponding 95% CI. Asterisks before the names indicate a singular fit due to the random effect variance of items being estimated as 0. For comparability of the effect, we include them here so that all effects presented were estimated using the same model.

“Are Conservatives More Rigid Than Liberals? A Meta-Analytic Test of the Rigidity-of-the-Right Hypothesis”, Costello et al 2021

“Are Conservatives More Rigid Than Liberals? A Meta-Analytic Test of the Rigidity-of-the-Right Hypothesis”⁠, Thomas H. Costello, Shauna Bowes, Matthew Baldwin, Scott O. Lilienfeld, Arber Tasimi (2021-11-06; ⁠, ; similar):

[See also “Political Diversity Will Improve Social Psychological Science”⁠, Duarte et al 2015; “Clarifying the Structure and Nature of Left-Wing Authoritarianism (LWA)”⁠, Costello et al 2021] The rigidity-of-the-right hypothesis (RRH), which posits that cognitive, motivational, and ideological rigidity resonate with political conservatism, is the dominant psychological account of political ideology.

Here, we conduct an extensive review of the RRH, using multilevel meta-analysis to examine relations between varieties of rigidity and ideology alongside a bevy of potential moderators (s = 329, k = 708, n = 187,612).

Associations between conservatism and rigidity were enormously heterogeneous, such that broad theoretical accounts of left-right asymmetries in rigidity have masked complex—yet conceptually fertile—patterns of relations. Most notably, correlations between economic conservatism and rigidity constructs were almost uniformly not statistically-significant, whereas social conservatism and rigidity were statistically-significantly positively correlated. Further, leftists and rightists exhibited modestly asymmetrical motivations yet closely symmetrical thinking styles and cognitive architecture. Dogmatism was a special case, with rightists being clearly more dogmatic. Complicating this picture, moderator analyses revealed that the RRH may not generalize to key environmental/​psychometric modalities.

Thus, our work represents a crucial launch point for advancing a more accurate—but admittedly more nuanced—model of political social cognition. We resolve that drilling into this complexity, thereby moving away from the question of if conservatives are essentially rigid, will amplify the explanatory power of political psychology.

[Keywords: conservatism, meta-analysis, personality psychology, political ideology, political psychology, rigidity, social psychology]

“Empirical Audit and Review and an Assessment of Evidentiary Value in Research on the Psychological Consequences of Scarcity”, O’Donnell et al 2021

“Empirical audit and review and an assessment of evidentiary value in research on the psychological consequences of scarcity”⁠, Michael O’Donnell, Amelia S. Dev, Stephen Antonoplis, Stephen M. Baum, Arianna H. Benedetti, N. Derek Brown et al (2021-11-02; ; backlinks; similar):

Empirical audit and review is an approach to assessing the evidentiary value of a research area. It involves identifying a topic and selecting a cross-section of studies for replication. We apply the method to research on the psychological consequences of scarcity. Starting with the papers citing a seminal publication in the field, we conducted replications of 20 studies that evaluate the role of scarcity priming in pain sensitivity, resource allocation, materialism, and many other domains. There was considerable variability in the replicability, with some strong successes and other undeniable failures. Empirical audit and review does not attempt to assign an overall replication rate for a heterogeneous field, but rather facilitates researchers seeking to incorporate strength of evidence as they refine theories and plan new investigations in the research area. This method allows for an integration of qualitative and quantitative approaches to review and enables the growth of a cumulative science.

[Keywords: scarcity, reproducibility, open science, meta-analysis, evidentiary value]

…We selected 20 studies for replication. We built a set of eligible papers and then drew from that set at random. The set included studies that (1) cited Shah et al 2012 seminal paper on scarcity, (2) included scarcity as a factor in their design, and (3) could be replicated with an online sample. We did not decide on an operational definition of scarcity, but we accepted all measures and manipulations of scarcity that were proposed by the original authors…To give us sufficient precision to comment on the statistical power of the original effects, our replications employed 2.5× the sample size of the original paper (8). Because this approach would also allow us to detect smaller effects than in the original studies, it would have allowed us to detect statistically-significant effects even in the cases where the original findings were not statistically-significant.

Figure 1: The Leftmost columns indicate common features among the replicated studies and the Middle column depicts effect size (correlation coefficients) for the original and replication studies. Effect sizes are bounded by 95% CIs. The Right columns indicate the estimated power in the original studies (third column from Right), the upper bound of the 95% CI for estimated power in the original (second column from Right), and well as an estimated sample size required for 80% power, based on the replication effect (Rightmost column).

Figure 1 shows our results. The Leftmost columns categorize commonalities among the 20 studies. In the 6 studies featuring writing independent variables we reviewed the responses for nonsensical or careless responses and excluded them. Results including these responses are in SI Appendix. The Middle column shows that replication effect sizes were smaller than the original effect sizes for 80% of the 20 studies, and directionally opposite for 30% of these 20 studies. Of the 20 studies that were statistically-significant in the original, 4 of our replication efforts yielded statistically-significant results. But statistical-significance is only one way to evaluate the results of a replication. The 3 Rightmost columns report estimates of the power in the original studies based on the replication effects. This analysis provides the upper bounds of the 95% CI for the estimated power of the original studies. Only 9 of the original studies included 33% power in these 95% CIs, indicating that most of the 20 effects we attempted to replicate were too small to be detectably studied in the original investigations.

…Scarcity is a real and enduring societal problem, yet our results suggest that behavioral scientists have not fully identified the underlying psychology. Although this project has neither the goal nor the capacity to “accept the null” hypothesis for any of these tests, the replications of these 20 studies indicate that within this set, scarcity primes have a minimal influence on cognitive ability, product attitudes, or well being.

“Effect Sizes Reported in Highly Cited Emotion Research Compared With Larger Studies and Meta-Analyses Addressing the Same Questions”, Cristea et al 2021

2021-cristea.pdf: “Effect Sizes Reported in Highly Cited Emotion Research Compared With Larger Studies and Meta-Analyses Addressing the Same Questions”⁠, Ioana A. Cristea, Raluca Georgescu, John P. A. Ioannidis (2021-11-01; similar):

We assessed whether the most highly cited studies in emotion research reported larger effect sizes compared with meta-analyses and the largest studies on the same question. We screened all reports with at least 1,000 citations and identified matching meta-analyses for 40 highly cited observational studies and 25 highly cited experimental studies. Highly cited observational studies had effects greater on average by 1.42× (95% confidence interval [CI] = [1.09, 1.87]) compared with meta-analyses and 1.99× (95% CI = [1.33, 2.99]) compared with largest studies on the same questions. Highly cited experimental studies had increases of 1.29× (95% CI = [1.01, 1.63]) compared with meta-analyses and 2.02× (95% CI = [1.60, 2.57]) compared with the largest studies. There was substantial between-topics heterogeneity, more prominently for observational studies. Highly cited studies often did not have the largest weight in meta-analyses (12 of 65 topics, 18%) but were frequently the earliest ones published on the topic (31 of 65 topics, 48%). Highly cited studies may offer, on average, exaggerated estimates of effects in both observational and experimental designs.

“The Role of Human Fallibility in Psychological Research: A Survey of Mistakes in Data Management”, Kovacs et al 2021

“The Role of Human Fallibility in Psychological Research: A Survey of Mistakes in Data Management”⁠, Marton Kovacs, Rink Hoekstra, Balazs Aczel (2021-10-21; backlinks; similar):

Errors are an inevitable consequence of human fallibility, and researchers are no exception. Most researchers can recall major frustrations or serious time delays due to human errors while collecting, analyzing, or reporting data. The present study is an exploration of mistakes made during the data-management process in psychological research.

We surveyed 488 researchers regarding the type, frequency, seriousness, and outcome of mistakes that have occurred in their research team during the last 5 years.

The majority of respondents suggested that mistakes occurred with very low or low frequency. Most respondents reported that the most frequent mistakes led to insignificant or minor consequences, such as time loss or frustration. The most serious mistakes caused insignificant or minor consequences for about a third of respondents, moderate consequences for almost half of respondents, and major or extreme consequences for about one fifth of respondents. The most frequently reported types of mistakes were ambiguous naming/​defining of data, version control error, and wrong data processing/​analysis. Most mistakes were reportedly due to poor project preparation or management and/​or personal difficulties (physical or cognitive constraints).

With these initial exploratory findings, we do not aim to provide a description representative for psychological scientists but, rather, to lay the groundwork for a systematic investigation of human fallibility in research data management and the development of solutions to reduce errors and mitigate their impact.

[Keywords: human error, data-management mistakes, research workflow, life cycle of the data, open data, open materials, preregistered]

“On the Reliability of Published Findings Using the Regression Discontinuity Design in Political Science”, Stommes et al 2021

“On the reliability of published findings using the regression discontinuity design in political science”⁠, Drew Stommes, P. M. Aronow, Fredrik Sävje (2021-09-29; similar):

The regression discontinuity (RD) design offers identification of causal effects under weak assumptions, earning it the position as a standard method in modern political science research. But identification does not necessarily imply that the causal effects can be estimated accurately with limited data. In this paper, we highlight that estimation is particularly challenging with the RD design and investigate how these challenges manifest themselves in the empirical literature. We collect all RD-based findings published in top political science journals from 2009–2018. The findings exhibit pathological features; estimates tend to bunch just above the conventional level of statistical-significance. A reanalysis of all studies with available data suggests that researcher’s discretion is not a major driver of these pathological features, but researchers tend to use inappropriate methods for inference, rendering standard errors artificially small. A retrospective power analysis reveals that most of these studies were underpowered to detect all but large effects. The issues we uncover, combined with well-documented selection pressures in academic publishing, cause concern that many published findings using the RD design are exaggerated, if not entirely spurious.

“Is Coffee the Cause or the Cure? Conflicting Nutrition Messages in 2 Decades of Online New York Times’ Nutrition News Coverage”, Ihekweazu 2021

2021-ihekweazu.pdf: “Is Coffee the Cause or the Cure? Conflicting Nutrition Messages in 2 Decades of Online New York Times’ Nutrition News Coverage”⁠, Chioma Ihekweazu (2021-09-14; ; similar):

2⁄3rds of US adults report hearing news stories about diet and health relationships daily or a few times a week. These stories have often been labeled as conflicting. While public opinion suggests conflicting nutrition messages are widespread, there has been limited empirical research to support this belief.

This study examined the prevalence of conflicting information in online New York Times’ news articles discussing published nutrition research between 1996–2016. It also examined the contextual differences that existed between conflicting studies. The final sample included 375 news articles discussing 416 diet and health relationships (228 distinct relationships).

The most popular dietary items discussed were alcoholic beverages (n = 51), vitamin D (n = 26), and B vitamins (n = 23). Over the 20-year study period, 12.7% of the 228 diet and health relationships had conflicting reports. Just under 3⁄4ths of the conflicting reports involved changes in study design, 79% involved changes in study population, and 31% involved changes in industry funding.

Conflicting nutrition messages can have negative cognitive and behavioral consequences for individuals. To help effectively address conflicting nutrition news coverage, a multi-pronged approach involving journalists, researchers, and news audiences is needed.

“A Multisite Preregistered Paradigmatic Test of the Ego-Depletion Effect”, Vohs et al 2021

2021-vohs.pdf: “A Multisite Preregistered Paradigmatic Test of the Ego-Depletion Effect”⁠, Kathleen D. Vohs, Brandon J. Schmeichel, Sophie Lohmann, Quentin F. Gronau, Anna J. Finley, Sarah E. Ainsworth et al (2021-09-14; ; backlinks):

We conducted a preregistered multilaboratory project (k = 36; n = 3,531) to assess the size and robustness of ego-depletion effects using a novel replication method, termed the paradigmatic replication approach. Each laboratory implemented one of two procedures that was intended to manipulate self-control and tested performance on a subsequent measure of self-control. Confirmatory tests found a nonsignificant result (d = 0.06). Confirmatory Bayesian meta-analyses using an informed-prior hypothesis (δ = 0.30, SD = 0.15) found that the data were 4× more likely under the null than the alternative hypothesis. Hence, preregistered analyses did not find evidence for a depletion effect. Exploratory analyses on the full sample (ie. ignoring exclusion criteria) found a statistically-significant effect (d = 0.08); Bayesian analyses showed that the data were about equally likely under the null and informed-prior hypotheses. Exploratory moderator tests suggested that the depletion effect was larger for participants who reported more fatigue but was not moderated by trait self-control, willpower beliefs, or action orientation.

“TV Advertising Effectiveness and Profitability: Generalizable Results From 288 Brands”, Shapiro et al 2021

2021-shapiro.pdf: “TV Advertising Effectiveness and Profitability: Generalizable Results From 288 Brands”⁠, Bradley T. Shapiro, Gunter J. Hitsch, Anna E. Tuchman (2021-07-26; ⁠, ⁠, ; backlinks; similar):

We estimate the distribution of television advertising elasticities and the distribution of the advertising return on investment (ROI) for a large number of products in many categories…We construct a data set by merging market (DMA) level TV advertising data with retail sales and price data at the brand level…Our identification strategy is based on the institutions of the ad buying process.

Our results reveal substantially smaller advertising elasticities compared to the results documented in the literature, as well as a sizable percentage of statistically insignificant or negative estimates. The results are robust to functional form assumptions and are not driven by insufficient statistical power or measurement error.

The ROI analysis shows negative ROIs at the margin for more than 80% of brands, implying over-investment in advertising by most firms. Further, the overall ROI of the observed advertising schedule is only positive for one third of all brands.

[Keywords: advertising, return on investment, empirical generalizations, agency issues, consumer packaged goods, media markets]

…We find that the mean and median of the distribution of estimated long-run own-advertising elasticities are 0.023 and 0.014, respectively, and 2 thirds of the elasticity estimates are not statistically different from zero. These magnitudes are considerably smaller than the results in the extant literature. The results are robust to controls for own and competitor prices and feature and display advertising, and the advertising effect distributions are similar whether a carryover parameter is assumed or estimated. The estimates are also robust if we allow for a flexible functional form for the advertising effect, and they do not appear to be driven by measurement error. As we are not able to include all sensitivity checks in the paper, we created an interactive web application that allows the reader to explore all model specifications. The web application is available⁠.

…First, the advertising elasticity estimates in the baseline specification are small. The median elasticity is 0.0140, and the mean is 0.0233. These averages are substantially smaller than the average elasticities reported in extant meta-analyses of published case studies (Assmus, Farley, and Lehmann (1984b), Sethuraman, Tellis, and Briesch (2011)). Second, 2 thirds of the estimates are not statistically distinguishable from zero. We show in Figure 2 that the most precise estimates are those closest to the mean and the least precise estimates are in the extremes.

Figure 2: Advertising effects and confidence intervals using baseline strategy. Note: Brands are arranged on the horizontal axis in increasing order of their estimated ad effects. For each brand, a dot plots the point estimate of the ad effect and a vertical bar represents the 95% confidence interval. Results are from the baseline strategy model with δ = 0.9 (equation (1)).

6.1 Average ROI of Advertising in a Given Week:

In the first policy experiment, we measure the ROI of the observed advertising levels (in all DMAs) in a given week t relative to not advertising in week t. For each brand, we compute the corresponding ROI for all weeks with positive advertising, and then average the ROIs across all weeks to compute the average ROI of weekly advertising. This metric reveals if, on the margin, firms choose the (approximately) correct advertising level or could increase profits by either increasing or decreasing advertising.

We provide key summary statistics in the top panel of Table III, and we show the distribution of the predicted ROIs in Figure 3(a). The average ROI of weekly advertising is negative for most brands over the whole range of assumed manufacturer margins. At a 30% margin, the median ROI is −88.15%, and only 12% of brands have positive ROI. Further, for only 3% of brands the ROI is positive and statistically different from zero, whereas for 68% of brands the ROI is negative and statistically different from zero.

Figure 3: Predicted ROIs. Note: Panel (a) provides the distribution of the estimated ROI of weekly advertising and panel (b) provides the distribution of the overall ROI of the observed advertising schedule. Each is provided for 3 margin factors, m = 0.2, m = 0.3, and m = 0.4. The median is denoted by a solid vertical line and zero is denoted with a vertical dashed line. Gray indicates brands with negative ROI that is statistically different from zero. Red indicates brands with positive ROI that is statistically different from zero. Blue indicates brands with ROI not statistically different from zero.

These results provide strong evidence for over-investment in advertising at the margin. [In Appendix C.3, we assess how much larger the TV advertising effects would need to be for the observed level of weekly advertising to be profitable. For the median brand with a positive estimated ad elasticity, the advertising effect would have to be 5.33× larger for the observed level of weekly advertising to yield a positive ROI (assuming a 30% margin).]

6.2 Overall ROI of the Observed Advertising Schedule: In the second policy experiment, we investigate if firms are better off when advertising at the observed levels versus not advertising at all. Hence, we calculate the ROI of the observed advertising schedule relative to a counterfactual baseline with zero advertising in all periods.

We present the results in the bottom panel of Table III and in Figure 3(b). At a 30% margin, the median ROI is −57.34%, and 34% of brands have a positive return from the observed advertising schedule versus not advertising at all. Whereas 12% of brands only have positive and 30% of brands only negative values in their confidence intervals, there is more uncertainty about the sign of the ROI for the remaining 58% of brands. This evidence leaves open the possibility that advertising may be valuable for a substantial number of brands, especially if they reduce advertising on the margin.

…Our results have important positive and normative implications. Why do firms spend billions of dollars on TV advertising each year if the return is negative? There are several possible explanations. First, agency issues, in particular career concerns, may lead managers (or consultants) to overstate the effectiveness of advertising if they expect to lose their jobs if their advertising campaigns are revealed to be unprofitable. Second, an incorrect prior (ie. conventional wisdom that advertising is typically effective) may lead a decision maker to rationally shrink the estimated advertising effect from their data to an incorrect, inflated prior mean. These proposed explanations are not mutually exclusive. In particular, agency issues may be exacerbated if the general effectiveness of advertising or a specific advertising effect estimate is overstated. [Another explanation is that many brands have objectives for advertising other than stimulating sales. This is a nonstandard objective in economic analysis, but nonetheless, we cannot rule it out.] While we cannot conclusively point to these explanations as the source of the documented over-investment in advertising, our discussions with managers and industry insiders suggest that these may be contributing factors.

“Systematic Bias in the Progress of Research”, Rubin & Rubin 2021

2021-rubin.pdf: “Systematic Bias in the Progress of Research”⁠, Amir Rubin, Eran Rubin (2021-07-12; similar):

We analyze the extent to which citing practices may be driven by strategic considerations. The discontinuation of the Journal of Business (JB) in 2006 for extraneous reasons serves as an exogenous shock for analyzing strategic citing behavior. [We thank Douglas Diamond, the editor of the Journal of Business for 13 years, who told us that the main reason for the discontinuation was the difficulty in finding an editor from within Booth’s faculty.]

Using a difference-in-differences analysis, we find that articles published in JB before 2006 experienced a relative reduction in citations of ~20% after 2006.

Since the discontinuation of JB is unrelated to the scientific contributions of its articles, the results imply that the referencing of articles is systematically affected by strategic considerations, which hinders scientific progress.

Figure 3: One-to-one matching based on propensity score. The figure depicts the mean log of (1 + citation count) of JB articles and that of matched articles taken from the pool of all other articles in the top 4 finance journals, based on having the closest propensity score. PS and PS (topics FE) correspond to propensity score matching based on equations (1) and (2), respectively.

“Common Elective Orthopaedic Procedures and Their Clinical Effectiveness: Umbrella Review of Level 1 Evidence”, Blom et al 2021

“Common elective orthopaedic procedures and their clinical effectiveness: umbrella review of level 1 evidence”⁠, Ashley W. Blom, Richard L. Donovan, Andrew D. Beswick, Michael R. Whitehouse, Setor K. Kunutsor (2021-07-08; ; similar):

Objective: To determine the clinical effectiveness of common elective orthopaedic procedures compared with no treatment, placebo, or non-operative care and assess the impact on clinical guidelines.

Design: Umbrella review of meta-analyses of randomised controlled trials or other study designs in the absence of meta-analyses of randomised controlled trials.

Data sources: 10 of the most common elective orthopaedic procedures—arthroscopic anterior cruciate ligament reconstruction⁠, arthroscopic meniscal repair of the knee, arthroscopic partial meniscectomy of the knee, arthroscopic rotator cuff repair⁠, arthroscopic subacromial decompression, carpal tunnel decompression⁠, lumbar spine decompression⁠, lumbar spine fusion⁠, total hip replacement⁠, and total knee replacement—were studied. MEDLINE⁠, Embase⁠, Cochrane Library, and bibliographies were searched until September 2020.

Eligibility criteria for selecting studies: Meta-analyses of randomised controlled trials (or in the absence of meta-analysis other study designs) that compared the clinical effectiveness of any of the 10 orthopaedic procedures with no treatment, placebo, or non-operative care.

Data extraction and synthesis: Summary data were extracted by 2 independent investigators, and a consensus was reached with the involvement of a third. The methodological quality of each meta-analysis was assessed using the Assessment of Multiple Systematic Reviews instrument. The Jadad decision algorithm was used to ascertain which meta-analysis represented the best evidence. The National Institute for Health and Care Excellence Evidence search was used to check whether recommendations for each procedure reflected the body of evidence.

Main outcome measures: Quality and quantity of evidence behind common elective orthopaedic interventions and comparisons with the strength of recommendations in relevant national clinical guidelines.

Results: Randomised controlled trial evidence supports the superiority of carpal tunnel decompression and total knee replacement over non-operative care. No randomised controlled trials specifically compared total hip replacement or meniscal repair with non-operative care. Trial evidence for the other 6 procedures showed no benefit over non-operative care.

Conclusions: Although they may be effective overall or in certain subgroups, no strong, high quality evidence base shows that many commonly performed elective orthopaedic procedures are more effective than non-operative alternatives. Despite the lack of strong evidence, some of these procedures are still recommended by national guidelines in certain situations.

Systematic review registration: PROSPERO CRD42018115917.

“Small Effects: The Indispensable Foundation for a Cumulative Psychological Science”, Götz et al 2021

2021-gotz.pdf: “Small Effects: The Indispensable Foundation for a Cumulative Psychological Science”⁠, Friedrich M. Götz, Samuel D. Gosling, Peter J. Rentfrow (2021-07-02; backlinks; similar):

We draw on genetics research to argue that complex psychological phenomena are most likely determined by a multitude of causes and that any individual cause is likely to have only a small effect.

Building on this, we highlight the dangers of a publication culture that continues to demand large effects. First, it rewards inflated effects that are unlikely to be real and encourages practices likely to yield such effects. Second, it overlooks the small effects that are most likely to be real, hindering attempts to identify and understand the actual determinants of complex psychological phenomena.

We then explain the theoretical and practical relevance of small effects, which can have substantial consequences, especially when considered at scale and over time. Finally, we suggest ways in which scholars can harness these insights to advance research and practices in psychology (ie. leveraging the power of big data, machine learning, and crowdsourcing science; promoting rigorous preregistration, including prespecifying the smallest effect size of interest; contextualizing effects; changing cultural norms to reward accurate and meaningful effects rather than exaggerated and unreliable effects).

Only once small effects are accepted as the norm, rather than the exception, can a reliable and reproducible cumulative psychological science be built.

[See variance-components for one route forward in quantifying small effects given the daunting statistical power challenges. Götz et al appear locked into the conventional framework of directly estimating effects, when what they really need to borrow from genetics is looking at variance terms like heritability… You can’t afford to gather n in the millions when you aren’t even sure your haystack contains a needle!]

“Truncating Bar Graphs Persistently Misleads Viewers”, Yang et al 2021

2021-yang.pdf: “Truncating Bar Graphs Persistently Misleads Viewers”⁠, Brenda W. Yang, Camila Vargas Restrepo, Matthew L. Stanley, Elizabeth J. Marsh (2021-06-01; similar):

Data visualizations and graphs are increasingly common in both scientific and mass media settings. While graphs are useful tools for communicating patterns in data, they also have the potential to mislead viewers.

In 5 studies, we provide empirical evidence that y-axis truncation leads viewers to perceive illustrated differences as larger (ie. a truncation effect). This effect persisted after viewers were taught about the effects of y-axis truncation and was robust across participants, with 83.5% of participants across these 5 studies showing a truncation effect. We also found that individual differences in graph literacy failed to predict the size of individuals’ truncation effects. PhD students in both quantitative fields and the humanities were susceptible to the truncation effect, but quantitative PhD students were slightly more resistant when no warning about truncated axes was provided.

We discuss the implications of these results for the underlying mechanisms and make practical recommendations for training critical consumers and creators of graphs.

[Keywords: data visualization, misleading graphs, misinformation, axis truncation, bar graphs]

News media, opinion pieces, social media, and scientific publications are full of graphs meant to communicate and persuade. Such graphs may be technically accurate in displaying correct numerical values and yet misleading because they lead people to draw inappropriate conclusions. In 5 studies, we investigate the practice of truncating the y-axis of bar graphs to start at a non-zero value. While this has been called one of “the worst of crimes in data visualization” by The Economist, it is surprisingly common in not just news and social media, but also in scientific conferences and publications. This might be because the injunction to “not truncate the axis!” may be seen as more dogmatic than data-driven.

We examine how truncated graphs consistently lead people to perceive a larger difference between 2 quantities in 5 studies, and we find that 83.5% of participants across studies show a truncation effect. In other words, 83.5% of people in our studies judged differences illustrated by truncated bar graphs as larger than differences illustrated by graphs where the y-axis starts at 0.

Surprisingly, we found that the truncation effect was very persistent. People were misled by y-axis truncation even when we thoroughly explained the technique right before they rated graphs, although this warning reduced the degree to which people were misled. People with extensive experience working with data and statistics (ie. PhD students in quantitative fields) were also susceptible to the truncation effect. Overall, our work shows the consequences of truncating bar graphs and the extent to which interventions, such as warning people, can help but are limited in their scope

Study 1 established a paradigm for examining the effects of y-axis truncation within bar graphs, providing an useful paradigm for studying deceptive graphs. In Study 2, we investigate whether providing an explicit explanation of y-axis truncation would reduce or eliminate the truncation effect…We expected that explicit warnings about y-axis truncation would give participants the information needed to identify truncated graphs and adjust their judgments accordingly.

Figure 3: Raincloud plots for Study 1 (a) and Study 2 (b). These raincloud plots (Allen et al 2018) depict average participant ratings for truncated and control graphs, respectively. Error bars reflect correlation-adjusted and difference-adjusted 95% confidence intervals of the means (Cousineau 2017). The truncated versus control graphs variable was manipulated within-subjects; each participant is represented by 2 points. In this Study 2, all participants received an explanatory warning.

…The results of Study 2 were surprising in that an explicit warning did not eliminate the truncation effect. To further investigate this, in Study 3, we directly manipulate in a single experiment whether participants are given an explanatory warning about y-axis truncation. Doing so allows us to directly compare the effects of having an explanatory warning or not on the truncation effect.

Figure 4: Raincloud plot for Study 3. Error bars reflect correlation-adjusted and difference-adjusted 95% confidence intervals of the means and points represent each participant twice. The truncated versus control graphs variable was manipulated within-subject.

Study 4: In Study 3, we found that providing an explanatory warning before participants rated control and truncated bar graphs reduced but did not eliminate the truncation effect. In the world, however, an explicit warning will rarely immediately precede graphs with truncated vertical axes. Here, we extend the findings of Study 3 by asking participants to provide judgments about a new set of bar graphs after a 1-day delay. The purpose in doing so was to examine whether the effects of the explicit warning on the first day will extend to the next day.

Figure 5: Raincloud Plot for Study 4. Error bars reflect correlation-adjusted and difference-adjusted 95% confidence intervals of the means. Points represent each participant twice. The session (1 and 2) and graph type (truncated and control) variables were within-subject manipulations. Cohen’s d and 95% CIs from left to right: No Warning1 = 0.69 [0.37, 1.02], No Warning2 = 0.62 [0.30, 0.94]; Warning1 = 0.42 [0.10, 0.74], Warning2 = 0.40 [0.08, 0.72]

…In Study 5, we examined the size of the truncation effect in two doctoral student populations: PhD students pursuing quantitative fields versus the humanities.

Figure 6: Raincloud plot for Study 5. Error bars reflect correlation-adjusted and difference-adjusted 95% confidence intervals of the means. Points represent each participant twice. The truncated versus control graphs variable was manipulated within-subjects.

“Of Forking Paths and Tied Hands: Selective Publication of Findings, and What Economists Should Do about It”, Kasy 2021

2021-kasy.pdf: “Of Forking Paths and Tied Hands: Selective Publication of Findings, and What Economists Should Do about It”⁠, Maximilian Kasy (2021-06-01; ; similar):

A key challenge for interpreting published empirical research is the fact that published findings might be selected by researchers or by journals. Selection might be based on criteria such as significance, consistency with theory, or the surprisingness of findings or their plausibility. Selection leads to biased estimates, reduced coverage of confidence intervals, and distorted posterior beliefs. I review methods for detecting and quantifying selection based on the distribution of p-values, systematic replication studies, and meta-studies. I then discuss the conflicting recommendations regarding selection result ing from alternative objectives, in particular, the validity of inference versus the relevance of findings for decision-makers. Based on this discussion, I consider various reform proposals, such as de-emphasizing significance, pre-analysis plans, journals for null results and replication studies, and a functionally differentiated publication system. In conclusion, I argue that we need alternative foundations of statistics that go beyond the single-agent model of decision theory⁠.

“Non-replicable Publications Are Cited More Than Replicable Ones”, Serra-Garcia & Gneezy 2021

“Non-replicable publications are cited more than replicable ones”⁠, Marta Serra-Garcia, Uri Gneezy (2021-05-21; backlinks; similar):

We use publicly available data to show that published papers in top psychology, economics, and general interest journals that fail to replicate are cited more than those that replicate. This difference in citation does not change after the publication of the failure to replicate. Only 12% of post-replication citations of non-replicable findings acknowledge the replication failure. Existing evidence also shows that experts predict well which papers will be replicated. Given this prediction, why are non-replicable papers accepted for publication in the first place? A possible answer is that the review team faces a trade-off. When the results are more “interesting”, they apply lower standards regarding their reproducibility.

“The Revolution Will Be Hard to Evaluate: How Co-occurring Policy Changes Affect Research on the Health Effects of Social Policies”, Matthay et al 2021

“The revolution will be hard to evaluate: How co-occurring policy changes affect research on the health effects of social policies”⁠, Ellicott C. Matthay, Erin Hagan, Spruha Joshi, May Lynn Tan, David Vlahov, Nancy Adler, M. Maria Glymour et al (2021-05-15; ; similar):

Extensive empirical health research leverages variation in the timing and location of policy changes as quasi-experiments. Multiple social policies may be adopted simultaneously in the same locations, creating co-occurrence which must be addressed analytically for valid inferences. The pervasiveness and consequences of co-occurring policies have received limited attention. We analyzed a systematic sample of 13 social policy databases covering diverse domains including poverty, paid family leave, and tobacco. We quantified policy co-occurrence in each database as the fraction of variation in each policy measure across different jurisdictions and times that could be explained by co-variation with other policies (R2). We used simulations to estimate the ratio of the variance of effect estimates under the observed policy co-occurrence to variance if policies were independent. Policy co-occurrence ranged from very high for state-level cannabis policies to low for country-level sexual minority rights policies. For 65% of policies, greater than 90% of the place-time variation was explained by other policies. Policy co-occurrence increased the variance of effect estimates by a median of 57×. Co-occurring policies are common and pose a major methodological challenge to rigorously evaluating health effects of individual social policies. When uncontrolled, co-occurring policies confound one another, and when controlled, resulting positivity violations may substantially inflate the variance of estimated effects. Tools to enhance validity and precision for evaluating co-occurring policies are needed.

“Challenging the Link Between Early Childhood Television Exposure and Later Attention Problems: A Multiverse Approach”, McBee et al 2021

2021-mcbee.pdf: “Challenging the Link Between Early Childhood Television Exposure and Later Attention Problems: A Multiverse Approach”⁠, Matthew T. McBee, Rebecca J. Brand, Wallace E. Dixon, Jr. (2021-03-25; similar):

In 2004, Christakis and colleagues published findings that he and others used to argue for a link between early childhood television exposure and later attention problems, a claim that continues to be frequently promoted by the popular media. Using the same National Longitudinal Survey of Youth 1979 data set (n = 2,108), we conducted two multiverse analyses to examine whether the finding reported by Christakis and colleagues was robust to different analytic choices. We evaluated 848 models, including logistic regression models, linear regression models, and two forms of propensity-score analysis. If the claim were true, we would expect most of the justifiable analyses to produce significant results in the predicted direction. However, only 166 models (19.6%) yielded a statistically-significant relationship, and most of these employed questionable analytic choices. We concluded that these data do not provide compelling evidence of a harmful effect of TV exposure on attention.

“The Influence of Hidden Researcher Decisions in Applied Microeconomics”, Huntington-Klein et al 2021

2021-huntingtonklein.pdf: “The influence of hidden researcher decisions in applied microeconomics”⁠, Nick Huntington-Klein, Andreu Arenas, Emily Beam, Marco Bertoni, Jeffrey R. Bloem, Pralhad Burli, Naibin Chen et al (2021-03-22; similar):

Researchers make hundreds of decisions about data collection, preparation, and analysis in their research. We use a many-analysts approach to measure the extent and impact of these decisions.

Two published causal empirical results are replicated by 7 replicators each. We find large differences in data preparation and analysis decisions, many of which would not likely be reported in a publication. No 2 replicators reported the same sample size. statistical-significance varied across replications, and for 1 of the studies the effect’s sign varied as well. The standard deviation of estimates across replications was 3–3× the mean reported standard error.

“Putting the Self in Self-Correction: Findings From the Loss-of-Confidence Project”, Rohrer et al 2021

“Putting the Self in Self-Correction: Findings From the Loss-of-Confidence Project”⁠, Julia M. Rohrer, Warren Tierney, Eric L. Uhlmann, Lisa M. DeBruine, Tom Heyman, Benedict Jones, Stefan C. Schmukle et al (2021-03-01; backlinks; similar):

Science is often perceived to be a self-correcting enterprise. In principle, the assessment of scientific claims is supposed to proceed in a cumulative fashion, with the reigning theories of the day progressively approximating truth more accurately over time. In practice, however, cumulative self-correction tends to proceed less efficiently than one might naively suppose. Far from evaluating new evidence dispassionately and infallibly, individual scientists often cling stubbornly to prior findings.

Here we explore the dynamics of scientific self-correction at an individual rather than collective level. In 13 written statements, researchers from diverse branches of psychology share why and how they have lost confidence in one of their own published findings. We qualitatively characterize these disclosures and explore their implications.

A cross-disciplinary survey suggests that such loss-of-confidence sentiments are surprisingly common among members of the broader scientific population yet rarely become part of the public record. We argue that removing barriers to self-correction at the individual level is imperative if the scientific community as a whole is to achieve the ideal of efficient self-correction.

[Keywords: self-correction, knowledge accumulation, metascience, scientific falsification, incentive structure, scientific errors]

“Maximal Positive Controls: A Method for Estimating the Largest Plausible Effect Size”, Hilgard 2021

2021-hilgard.pdf: “Maximal positive controls: A method for estimating the largest plausible effect size”⁠, Joseph Hilgard (2021-03-01; ⁠, ; similar):

  • Some reported effect sizes are too big for the hypothesized process.
  • Simple, obvious manipulations can reveal which effects are too big.
  • A demonstration is provided examining an implausibly large effect.

Effect sizes in social psychology are generally not large and are limited by error variance in manipulation and measurement. Effect sizes exceeding these limits are implausible and should be viewed with skepticism. Maximal positive controls, experimental conditions that should show an obvious and predictable effect [eg. a Stroop effect], can provide estimates of the upper limits of plausible effect sizes on a measure.

In this work, maximal positive controls are conducted for 3 measures of aggressive cognition, and the effect sizes obtained are compared to studies found through systematic review. Questions are raised regarding the plausibility of certain reports with effect sizes comparable to, or in excess of, the effect sizes found in maximal positive controls.

Maximal positive controls may provide a means to identify implausible study results at lower cost than direct replication.

[Keywords: violent video games, aggression⁠, aggressive thought, positive controls, scientific self-correction]

[Positive controls eliciting a hitherto-maximum effect can be seen as a kind of empirical Bayes estimating the distribution of plausible effects: if a reported effect size exceeds the empirical max, either something extremely unlikely has occurred (a new max out of n effects ever observed) or an error. For large n, the posterior probability of an error will be much larger.]

“Sorting the File Drawer: A Typology for Describing Unpublished Studies”, Lishner 2021

2021-lishner.pdf: “Sorting the File Drawer: A Typology for Describing Unpublished Studies”⁠, David A. Lishner (2021-03-01; similar):

A typology of unpublished studies is presented to describe various types of unpublished studies and the reasons for their nonpublication. Reasons for nonpublication are classified by whether they stem from an awareness of the study results (result-dependent reasons) or not (result-independent reasons) and whether the reasons affect the publication decisions of individual researchers or reviewers/​editors. I argue that result-independent reasons for nonpublication are less likely to introduce motivated reasoning into the publication decision process than are result-dependent reasons. I also argue that some reasons for nonpublication would produce beneficial as opposed to problematic publication bias. The typology of unpublished studies provides a descriptive scheme that can facilitate understanding of the population of study results across the field of psychology, within subdisciplines of psychology, or within specific psychology research domains. The typology also offers insight into different publication biases and research-dissemination practices and can guide individual researchers in organizing their own file drawers of unpublished studies.

“Artificial Intelligence in Drug Discovery: What Is Realistic, What Are Illusions? Part 1: Ways to Make an Impact, and Why We Are Not There Yet: Quality Is More Important Than Speed and Cost in Drug Discovery”, Bender & Cortés-Ciriano 2021

“Artificial intelligence in drug discovery: what is realistic, what are illusions? Part 1: Ways to make an impact, and why we are not there yet: Quality is more important than speed and cost in drug discovery”⁠, Andreas Bender, Isidro Cortés-Ciriano (2021-02; ; backlinks; similar):

We first attempted to simulate the effect of (1) speeding up phases in the drug discovery process, (2) making them cheaper and (3) making individual phases more successful on the overall financial outcome of drug-discovery projects. In every case, an improvement of the respective measure (speed, cost and success of phase) of 20% (in the case of failure rate in relative terms) has been assumed to quantify effects on the capital cost of bringing one successful drug to the market. For the simulations, a patent lifetime of 20 years was assumed, with patent applications filed at the start of clinical Phase I, and the net effect of changes of speed, cost and quality of decisions on overall project return was calculated, assuming that projects, on average, are able to return their own cost…(Studies such as [33], which posed the question of which changes are most efficient in terms of improving R&D productivity, returned similar results to those presented here, although we have quantified them in more detail.)

It can be seen in Figure 2 that a reduction of the failure rate (in particular across all clinical phases) has by far the most substantial impact on project value overall, multiple times that of a reduction of the cost of a particular phase or a decrease in the amount of time a particular phase takes. This effect is most profound in clinical Phase II, in agreement with previous studies33, and it is a result of the relatively low success rate, long duration and high cost of the clinical phases. In other words, increasing the success of clinical phases decreases the number of expensive clinical trials needed to bring a drug to the market, and this decrease in the number of failures matters more than failing more quickly or more cheaply in terms of cost per successful, approved drug.

Figure 2: The impact of increasing speed (with the time taken for each phase reduced by 20%), improving the quality of the compounds tested in each phase (with the failure rate reduced by 20%), and decreasing costs (by 20%) on the net profit of a drug-discovery project, assuming patenting at time of first in human tests, and with other assumptions based on [“When Quality Beats Quantity: Decision Theory, Drug Discovery, and the Reproducibility Crisis”, Scannell & Bosley 2016]. It can be seen that the quality of compounds taken forward has a much more profound impact on the success of projects, far beyond improving the speed and reducing the cost of the respective phase. This has implications for the most beneficial uses of AI in drug-discovery projects.

…When translating this to drug-discovery programmes, this means that AI needs to support:

  1. better compounds going into clinical trials (related to the structure itself, but also including the right dosing/​PK for suitable efficacy versus the safety/​therapeutic index, in the desired target tissue);
  2. better validated targets (to decrease the number of failures owing to efficacy, especially in clinical Phases II and III, which have a profound impact on overall project success and in which target validation is currently probably not yet where one would like it to be [35]);
  3. better patient selection (eg. using biomarkers) [31]; and
  4. better conductance of trials (with respect to, eg. patient recruitment and adherence) [36].

This finding is in line with previous research in the area cited already33, as well as a study that compared the impact of the quality of decisions that can be made to the number of compounds that can be processed with a particular technique 30. In this latter case, the authors found that: “when searching for rare positives (eg. candidates that will successfully complete clinical development), changes in the predictive validity of screening and disease models that many people working in drug discovery would regard as small and/​or unknowable (ie. a 0.1 absolute change in correlation coefficient between model output and clinical outcomes in man) can offset large (eg. tenfold, even 100-fold) changes in models’ brute-force efficiency.” Still, currently the main focus of AI in drug discovery, in many cases, seems to be on speed and cost, as opposed to the quality of decisions.

“When the Numbers Do Not Add Up: The Practical Limits of Stochastologicals for Soft Psychology”, Broers 2021

2021-broers.pdf: “When the Numbers Do Not Add Up: The Practical Limits of Stochastologicals for Soft Psychology”⁠, Nick J. Broers (2021-01-22; similar):

One particular weakness of psychology that was left implicit by Meehl is the fact that psychological theories tend to be verbal theories, permitting at best ordinal predictions. Such predictions do not enable the high-risk tests that would strengthen our belief in the verisimilitude of theories but instead lead to the practice of null-hypothesis statistical-significance testing, a practice Meehl believed to be a major reason for the slow theoretical progress of soft psychology. The rising popularity of meta-analysis has led some to argue that we should move away from statistical-significance testing and focus on the size and stability of effects instead. Proponents of this reform assume that a greater emphasis on quantity can help psychology to develop a cumulative body of knowledge. The crucial question in this endeavor is whether the resulting numbers really have theoretical meaning. Psychological science lacks an undisputed, preexisting domain of observations analogous to the observations in the space-time continuum in physics. It is argued that, for this reason, effect sizes do not really exist independently of the adopted research design that led to their manifestation. Consequently, they can have no bearing on the verisimilitude of a theory.

“So Useful As a Good Theory? The Practicality Crisis in (Social) Psychological Theory”, Berkman & Wilson 2021

2021-berkman.pdf: “So Useful as a Good Theory? The Practicality Crisis in (Social) Psychological Theory”⁠, Elliot T. Berkman, Sylas M. Wilson (2021-01-07; similar):

Practicality was a valued attribute of academic psychological theory during its initial decades, but usefulness has since faded in importance to the field. Theories are now evaluated mainly on their ability to account for decontextualized laboratory data and not their ability to help solve societal problems. With laudable exceptions in the clinical, intergroup, and health domains, most psychological theories have little relevance to people’s everyday lives, poor accessibility to policymakers, or even applicability to the work of other academics who are better positioned to translate the theories to the practical realm. We refer to the lack of relevance, accessibility, and applicability of psychological theory to the rest of society as the practicality crisis. The practicality crisis harms the field in its ability to attract the next generation of scholars and maintain viability at the national level. We describe practical theory and illustrate its use in the field of self-regulation. Psychological theory is historically and scientifically well positioned to become useful should scholars in the field decide to value practicality. We offer a set of incentives to encourage the return of social psychology to the Lewinian vision of an useful science that speaks to pressing social issues.

[The unusually large chasm between the social sciences and its practical applications has been noted before. Focusing specifically on social psychology, Berkman and Wilson (2021) grade 360 articles published in the top cited journal of the field over a five year period on various criteria of practical import, generally finding quite low levels of “practicality” of the published research. For example, their average grade for the extent to which published papers offered actionable steps to address a specific problem was just 0.9 out of 4. They also look at the publication criterion in ten top journals; while all of them highlight the importance of original work that contributes to scientific progress, only 2 ask for even a brief statement of the public importance of the work.]

“The Statistical Properties of RCTs and a Proposal for Shrinkage”, Zwet et al 2020

“The statistical properties of RCTs and a proposal for shrinkage”⁠, Erik van Zwet, Simon Schwab, Stephen Senn (2020-11-30; backlinks; similar):

We abstract the concept of a randomized controlled trial (RCT) as a triple (β, b, s), where β is the primary efficacy parameter, b the estimate and s the standard error (s > 0). The parameter β is either a difference of means, a log odds ratio or a log hazard ratio. If we assume that b is unbiased and normally distributed⁠, then we can estimate the full joint distribution of (β, b, s) from a sample of pairs (bi, si).

We have collected 23,747 such pairs from the Cochrane Database of Systematic Reviews to do so. Here, we report the estimated distribution of the signal-to-noise ratio β⁄s and the achieved power. We estimate the median achieved power to be 0.13. We also consider the exaggeration ratio which is the factor by which the magnitude of β is overestimated. We find that if the estimate is just statistically-significant at the 5% level, we would expect it to overestimate the true effect by a factor of 1.7.

This exaggeration is sometimes referred to as the winner’s curse and it is undoubtedly to a considerable extent responsible for disappointing replication results. For this reason, we believe it is important to shrink the unbiased estimator, and we propose a method for doing so.

“Many Labs 5: Testing Pre-Data-Collection Peer Review As an Intervention to Increase Replicability”, Ebersole et al 2020

2020-ebersole.pdf: “Many Labs 5: Testing Pre-Data-Collection Peer Review as an Intervention to Increase Replicability”⁠, Charles R. Ebersole, Maya B. Mathur, Erica Baranski, Diane-Jo Bart-Plange, Nicholas R. Buttrick, Christopher R. Chartier et al (2020-11-13; backlinks; similar):

Replication studies in psychological science sometimes fail to reproduce prior findings. If these studies use methods that are unfaithful to the original study or ineffective in eliciting the phenomenon of interest, then a failure to replicate may be a failure of the protocol rather than a challenge to the original finding. Formal pre-data-collection peer review by experts may address shortcomings and increase replicability rates. We selected 10 replication studies from the Reproducibility Project: Psychology (RP:P; Open Science Collaboration, 2015) for which the original authors had expressed concerns about the replication designs before data collection; only one of these studies had yielded a statistically-significant effect (p < 0.05). Commenters suggested that lack of adherence to expert review and low-powered tests were the reasons that most of these RP:P studies failed to replicate the original effects. We revised the replication protocols and received formal peer review prior to conducting new replication studies. We administered the RP:P and revised protocols in multiple laboratories (median number of laboratories per original study = 6.5, range = 3–9; median total sample = 1,279.5, range = 276–3,512) for high-powered tests of each original finding with both protocols. Overall, following the preregistered analysis plan, we found that the revised protocols produced effect sizes similar to those of the RP:P protocols (Δr = 0.002 or 0.014, depending on analytic approach). The median effect size for the revised protocols (r = 0.05) was similar to that of the RP:P protocols (r = 0.04) and the original RP:P replications (r = 0.11), and smaller than that of the original studies (r = 0.37). Analysis of the cumulative evidence across the original studies and the corresponding three replication attempts provided very precise estimates of the 10 tested effects and indicated that their effect sizes (median r = 0.07, range = 0.00–0.15) were 78% smaller, on average, than the original effect sizes (median r = 0.37, range = 0.19–0.50).

[Keywords: replication, reproducibility, metascience, peer review, Registered Reports, open data, preregistered]

“The Reproducibility of Statistical Results in Psychological Research: An Investigation Using Unpublished Raw Data”, Artner et al 2020

2020-artner.pdf: “The reproducibility of statistical results in psychological research: An investigation using unpublished raw data”⁠, Richard Artner, Thomas Verliefde, Sara Steegen, Sara Gomes, Frits Traets, Francis Tuerlinckx, Wolf Vanpaemel et al (2020-11-12; backlinks; similar):

We investigated the reproducibility of the major statistical conclusions drawn in 46 articles published in 2012 in three APA journals. After having identified 232 key statistical claims, we tried to reproduce, for each claim, the test statistic, its degrees of freedom, and the corresponding p-value, starting from the raw data that were provided by the authors and closely following the Method section in the article. Out of the 232 claims, we were able to successfully reproduce 163 (70%), 18 of which only by deviating from the article’s analytical description. Thirteen (7%) of the 185 claims deemed statistically-significant by the authors are no longer so. The reproduction successes were often the result of cumbersome and time-consuming trial-and-error work, suggesting that APA style reporting in conjunction with raw data makes numerical verification at least hard, if not impossible. This article discusses the types of mistakes we could identify and the tediousness of our reproduction efforts in the light of a newly developed taxonomy for reproducibility. We then link our findings with other findings of empirical research on this topic, give practical recommendations on how to achieve reproducibility, and discuss the challenges of large-scale reproducibility checks as well as promising ideas that could considerably increase the reproducibility of psychological research.

“Cite Unseen: Theory and Evidence on the Effect of Open Access on Cites to Academic Articles Across the Quality Spectrum”, McCabe & Snyder 2020

2020-mcabe.pdf: “Cite Unseen: Theory and Evidence on the Effect of Open Access on Cites to Academic Articles Across the Quality Spectrum”⁠, Mark J. McCabe, Christopher Snyder (2020-11-01; backlinks; similar):

Our previous paper (McCabe & Snyder 2014) contained the provocative result that, despite a positive average effect, open access reduces cites to some articles, in particular those published in lower-tier journals. We propose a model in which open access leads more readers to acquire the full text, yielding more cites from some, but fewer cites from those who would have cited the article based on superficial knowledge but who refrain once they learn that the article is a bad match. We test the theory with data for over 200,000 science articles binned by cites received during a pre-study period. Consistent with the theory, the marginal effect of open access is negative for the least-cited articles, positive for the most cited, and generally monotonic for quality levels in between. Also consistent with the theory is a magnification of these effects for articles placed on PubMed Central, one of the broadest open-access platforms, and the differential pattern of results for cites from insiders versus outsiders to the article’s field.

“Psychological Measurement and the Replication Crisis: Four Sacred Cows”, Lilienfeld & Strother 2020

2020-lilienfeld.pdf: “Psychological measurement and the replication crisis: Four sacred cows”⁠, Scott O. Lilienfeld, Adele N. Strother (2020-11-01; similar):

Although there are surely multiple contributors to the replication crisis in psychology, one largely unappreciated source is a neglect of basic principles of measurement. We consider 4 sacred cows—widely shared and rarely questioned assumptions—in psychological measurement that may fuel the replicability crisis by contributing to questionable measurement practices. These 4 sacred cows are:

  1. we can safely rely on the name of a measure to infer its content;
  2. reliability is not a major concern for laboratory measures;
  3. using measures that are difficult to collect obviates the need for large sample sizes; and
  4. convergent validity data afford sufficient evidence for construct validity.

For items #1 and #4, we provide provisional data from recent psychological journals that support our assertion that such beliefs are prevalent among authors.

To enhance the replicability of psychological science, researchers will need to become vigilant against erroneous assumptions regarding both the psychometric properties of their measures and the implications of these psychometric properties for their studies.

[Keywords: discriminant validity, experimental replication, measurement, psychological assessment, sample size, construct validity, convergent validity, experimental laboratories, test reliability, statistical power]

“C60 in Olive Oil Causes Light-dependent Toxicity and Does Not Extend Lifespan in Mice”, Grohn et al 2020

2020-grohn.pdf: “C60 in olive oil causes light-dependent toxicity and does not extend lifespan in mice”⁠, Kristopher J. Grohn, Brandon S. Moyer, Danique C. Wortel, Cheyanne M. Fisher, Ellie Lumen, Anthony H. Bianchi et al (2020-10-29; ; backlinks; similar):

C60 is a potent antioxidant that has been reported to substantially extend the lifespan of rodents when formulated in olive oil (C60-OO) or extra virgin olive oil (C60-EVOO). Despite there being no regulated form of C60-OO, people have begun obtaining it from online sources and dosing it to themselves or their pets, presumably with the assumption of safety and efficacy. In this study, we obtain C60-OO from a sample of online vendors, and find marked discrepancies in appearance, impurity profile, concentration, and activity relative to pristine C60-OO formulated in-house. We additionally find that pristine C60-OO causes no acute toxicity in a rodent model but does form toxic species that can cause statistically-significant morbidity and mortality in mice in under 2 weeks when exposed to light levels consistent with ambient light. Intraperitoneal injections of C60-OO did not affect the lifespan of CB6F1 female mice. Finally, we conduct a lifespan and health span study in males and females C57BL/​6 J mice comparing oral treatment with pristine C60-EVOO and EVOO alone versus untreated controls. We failed to observe statistically-significant lifespan and health span benefits of C60-EVOO or EVOO supplementation compared to untreated controls, both starting the treatment in adult or old age. Our results call into question the biological benefit of C60-OO in aging.

“Heterogeneity in Direct Replications in Psychology and Its Association With Effect Size”, Olsson-Collentine et al 2020

2020-olssoncollentine.pdf: “Heterogeneity in direct replications in psychology and its association with effect size”⁠, Anton Olsson-Collentine, Jelte M. Wicherts, Marcel A. L. M. van Assen (2020-10-01; backlinks; similar):

Impact Statement: This article suggests that for direct replications in social and cognitive psychology research, small variations in design (sample settings and population) are an unlikely explanation for differences in findings of studies. Differences in findings of direct replications are particularly unlikely if the overall effect is (close to) 0, whereas these differences are more likely if the overall effect is larger.

We examined the evidence for heterogeneity (of effect sizes) when only minor changes to sample population and settings were made between studies and explored the association between heterogeneity and average effect size in a sample of 68 meta-analyses from 13 preregistered multilab direct replication projects in social and cognitive psychology. Among the many examined effects, examples include the Stroop effect, the ‘verbal overshadowing’ effect, and various priming effects such as ‘anchoring’ effects. We found limited heterogeneity; 48⁄68 (71%) meta-analyses had nonsignificant heterogeneity, and most (49⁄68; 72%) were most likely to have zero to small heterogeneity. Power to detect small heterogeneity (as defined by Higgins, Thompson, Deeks, & Altman, 2003) was low for all projects (mean 43%), but good to excellent for medium and large heterogeneity. Our findings thus show little evidence of widespread heterogeneity in direct replication studies in social and cognitive psychology, suggesting that minor changes in sample population and settings are unlikely to affect research outcomes in these fields of psychology. We also found strong correlations between observed average effect sizes (standardized mean differences and log odds ratios) and heterogeneity in our sample. Our results suggest that heterogeneity and moderation of effects is unlikely for a 0 average true effect size, but increasingly likely for larger average true effect size.

[Keywords: heterogeneity, meta-analysis, psychology, direct replication, many labs]

“A Replication Crisis in Methodological Research?”, Boulesteix et al 2020

2020-boulesteix.pdf: “A replication crisis in methodological research?”⁠, Anne-Laure Boulesteix, Sabine Hoffmann, Alethea Charlton, Heidi Seibold (2020-09-29; similar):

Statisticians have been keen to critique statistical aspects of the “replication crisis” in other scientific disciplines. But new statistical tools are often published and promoted without any thought to replicability. This needs to change, argue Anne-Laure Boulesteix, Sabine Hoffmann, Alethea Charlton and Heidi Seibold.

“The Small Effects of Political Advertising Are Small regardless of Context, Message, Sender, or Receiver: Evidence from 59 Real-time Randomized Experiments”, Coppock et al 2020

“The small effects of political advertising are small regardless of context, message, sender, or receiver: Evidence from 59 real-time randomized experiments”⁠, Alexander Coppock, Seth J. Hill, Lynn Vavreck (2020-09-02; ⁠, ; backlinks; similar):

Evidence across social science indicates that average effects of persuasive messages are small. One commonly offered explanation for these small effects is heterogeneity: Persuasion may only work well in specific circumstances. To evaluate heterogeneity, we repeated an experiment weekly in real time using 2016 U.S. presidential election campaign advertisements. We tested 49 political advertisements in 59 unique experiments on 34,000 people.

We investigate heterogeneous effects by sender (candidates or groups), receiver (subject partisanship), content (attack or promotional), and context (battleground versus non-battleground, primary versus general election, and early versus late). We find small average effects on candidate favorability and vote. These small effects, however, do not mask substantial heterogeneity even where theory from political science suggests that we should find it.

During the primary and general election, in battleground states, for Democrats, Republicans, and Independents, effects are similarly small. Heterogeneity with large offsetting effects is not the source of small average effects.

“Publication Rate in Preclinical Research: a Plea for Preregistration”, Naald et al 2020

“Publication rate in preclinical research: a plea for preregistration”⁠, Mira van der Naald, Steven Wenker, Pieter A. Doevendans, Kimberley E. Wever, Steven A. J. Chamuleau (2020-08-27; backlinks; similar):

Objectives: The ultimate goal of biomedical research is the development of new treatment options for patients. Animal models are used if questions cannot be addressed otherwise. Currently, it is widely believed that a large fraction of performed studies are never published, but there are no data that directly address this question.

Methods: We have tracked a selection of animal study protocols approved in the University Medical Center Utrecht in the Netherlands, to assess whether these have led to a publication with a follow-up period of 7 years.

Results: We found that 60% of all animal study protocols led to at least one publication (full text or abstract). A total of 5590 animals were used in these studies, of which 26% was reported in the resulting publications.

Conclusions: The data presented here underline the need for preclinical preregistration, in view of the risk of reporting and publication bias in preclinical research. We plea that all animal study protocols should be prospectively registered on an online, accessible platform to increase transparency and data sharing. To facilitate this, we have developed a platform dedicated to animal study protocol registration:⁠.

Strengths and limitations of this study:

  • This study directly traces animal study protocols to potential publications and is the first study to assess the number of animals used and the number of animals published.
  • We had full access to all documents submitted to the animal experiment committee of the University Medical Center Utrecht from the selected protocols.
  • There is a sufficient follow-up period for researchers to publish their animal study.
  • Due to privacy reasons, we are not able to publish the exact search terms used.
  • A delay has occurred between the start of this project and time of publishing, this is related to the political sensitivity of this subject.

“Towards Reproducible Brain-Wide Association Studies”, Marek et al 2020

“Towards Reproducible Brain-Wide Association Studies”⁠, Scott Marek, Brenden Tervo-Clemmens, Finnegan J. Calabro, David F. Montez, Benjamin P. Kay, Alexander S. Hatoum et al (2020-08-22; ; backlinks; similar):

Magnetic resonance imaging (MRI) continues to drive many important neuroscientific advances. However, progress in uncovering reproducible associations between individual differences in brain structure/​function and behavioral phenotypes (eg. cognition, mental health) may have been undermined by typical neuroimaging sample sizes (median n = 25)1,2.

Leveraging the Adolescent Brain Cognitive Development (ABCD) Study3 (n = 11,878), we estimated the effect sizes and reproducibility of these brain-wide associations studies (BWAS) as a function of sample size.

The very largest, replicable brain-wide associations for univariate and multivariate methods werer = 0.14 andr = 0.34, respectively. In smaller samples, typical for brain-wide association studies (BWAS), irreproducible, inflated effect sizes were ubiquitous, no matter the method (univariate, multivariate).

Until sample sizes started to approach consortium-levels, BWAS were underpowered and statistical errors assured. Multiple factors contribute to replication failures4–6; here, we show that the pairing of small brain-behavioral phenotype effect sizes with sampling variability is a key element in wide-spread BWAS replication failure. Brain-behavioral phenotype associations stabilize and become more reproducible with sample sizes of N⪆2,000. While investigator-initiated brain-behavior research continues to generate hypotheses and propel innovation, large consortia are needed to usher in a new era of reproducible human brain-wide association studies.

“Laypeople Can Predict Which Social-Science Studies Will Be Replicated Successfully”, Hoogeveen et al 2020

“Laypeople Can Predict Which Social-Science Studies Will Be Replicated Successfully”⁠, Suzanne Hoogeveen, Alexandra Sarafoglou, Eric-Jan Wagenmakers (2020-08-21; backlinks; similar):

Large-scale collaborative projects recently demonstrated that several key findings from the social-science literature could not be replicated successfully. Here, we assess the extent to which a finding’s replication success relates to its intuitive plausibility. Each of 27 high-profile social-science findings was evaluated by 233 people without a Ph.D. in psychology. Results showed that these laypeople predicted replication success with above-chance accuracy (ie. 59%). In addition, when participants were informed about the strength of evidence from the original studies, this boosted their prediction performance to 67%. We discuss the prediction patterns and apply signal detection theory to disentangle detection ability from response bias. Our study suggests that laypeople’s predictions contain useful information for assessing the probability that a given finding will be replicated successfully.

[Keywords: open science, meta-science, replication crisis, prediction survey, open data, open materials, preregistered]

“Specification Curve Analysis”, Simonsohn et al 2020

2020-simonsohn.pdf: “Specification curve analysis”⁠, Uri Simonsohn, Joseph P. Simmons, Leif D. Nelson (2020-07-27; ; backlinks; similar):

Empirical results hinge on analytical decisions that are defensible, arbitrary and motivated. These decisions probably introduce bias (towards the narrative put forward by the authors), and they certainly involve variability not reflected by standard errors.

To address this source of noise and bias, we introduce specification curve analysis, which consists of 3 steps: (1) identifying the set of theoretically justified, statistically valid and non-redundant specifications; (2) displaying the results graphically, allowing readers to identify consequential specifications decisions; and (3) conducting joint inference across all specifications.

We illustrate the use of this technique by applying it to 3 findings from 2 different papers, one investigating discrimination based on distinctively black names, the other investigating the effect of assigning female versus male names to hurricanes. Specification curve analysis reveals that one finding is robust, one is weak and one is not robust at all.

“RCTs to Scale: Comprehensive Evidence from Two Nudge Units”, DellaVigna & Linos 2020

2020-dellavigna.pdf: “RCTs to Scale: Comprehensive Evidence from Two Nudge Units”⁠, Stefano DellaVigna, Elizabeth Linos (2020-07-01; ; backlinks; similar):

Nudge interventions have quickly expanded from academic studies to larger implementation in so-called Nudge Units in governments. This provides an opportunity to compare interventions in research studies, versus at scale. We assemble an unique data set of 126 RCTs covering over 23 million individuals, including all trials run by 2 of the largest Nudge Units in the United States. We compare these trials to a sample of nudge trials published in academic journals from 2 recent meta-analyses.

In papers published in academic journals, the average impact of a nudge is very large—an 8.7 percentage point take-up effect, a 33.5% increase over the average control. In the Nudge Unit trials, the average impact is still sizable and highly statistically-significant, but smaller at 1.4 percentage points, an 8.1% increase [8.7 / 1.4 = 6.2×].

We consider 5 potential channels for this gap: statistical power, selective publication, academic involvement, differences in trial features and in nudge features. Publication bias in the academic journals, exacerbated by low statistical power, can account for the full difference in effect sizes. Academic involvement does not account for the difference. Different features of the nudges, such as in-person versus letter-based communication, likely reflecting institutional constraints, can partially explain the different effect sizes.

We conjecture that larger sample sizes and institutional constraints, which play an important role in our setting, are relevant in other at-scale implementations. Finally, we compare these results to the predictions of academics and practitioners. Most forecasters overestimate the impact for the Nudge Unit interventions, though nudge practitioners are almost perfectly calibrated.

Figure 4: Nudge treatment effects. This figure plots the treatment effect relative to control group take-up for each nudge. Nudges with extreme treatment effects are labeled for context.

…In this paper, we present the results of an unique collaboration with 2 of the major “Nudge Units”: BIT North America operating at the level of US cities and SBST/​OES for the US Federal government. These 2 units kept a comprehensive record of all trials that they ran from inception in 2015 to July 2019, for a total of 165 trials testing 349 nudge treatments and a sample size of over 37 million participants. In a remarkable case of administrative transparency, each trial had atrial report, including in many cases a pre-analysis plan. The 2 units worked with us to retrieve the results of all the trials. Importantly, over 90% of these trials have not been documented in working paper or academic publication format. [emphasis added]

…Since we are interested in comparing the Nudge Unit trials to nudge papers in the literature, we aim to find broadly comparable studies in academic journals, without hand-picking individual papers. We lean on 2 recent meta-analyses summarizing over 100 RCTs across many different applications (Benartzi et al 2017⁠, and Hummel & Maedche 2019). We apply similar restrictions as we did in the Nudge Unit sample, excluding lab or hypothetical experiments and non-RCTs, treatments with financial incentives, requiring treatments with binary dependent variables, and excluding default effects. This leaves a final sample of 26 RCTs, including 74 nudge treatments with 505,337 participants. Before we turn to the results, we stress that the features of behavioral interventions in academic journals do not perfectly match with the nudge treatments implemented by the Nudge Units, a difference to which we indeed return below. At the same time, overall interventions conducted by Nudge Units are fairly representative of the type of nudge treatments that are run by researchers.

What do we find? In the sample of 26 papers in the Academic Journals sample, we compute the average (unweighted) impact of a nudge across the 74 nudge interventions. We find that on average a nudge intervention increases the take up by 8.7 (s.e. = 2.5) percentage points, out of an average control take up of 26.0 percentage points.

Turning to the 126 trials by Nudge Units, we estimate an unweighted impact of 1.4 percentage points (s.e. = 0.3), out of an average control take up of 17.4 percentage points. While this impact is highly statistically-significantly different from 0 and sizable, it is about 1⁄6th the size of the estimated nudge impact in academic papers. What explains this large difference in the impact of nudges?

We discuss 3 features of the 2 samples which could account for this difference. First, we document a large difference in the sample size and thus statistical power of the interventions. The median nudge intervention in the Academic Journals sample has treatment arm sample size of 484 participants and a minimum detectable effect size (MDE, the effect size that can be detected with 80% power) of 6.3 percentage points. In contrast, the nudge interventions in the Nudge Units have a median treatment arm sample size of 10,006 participants and MDE of 0.8 percentage points. Thus, the statistical power for the trials in the Academic Journals sample is nearly an order of magnitude smaller. This illustrates a key feature of the “at scale” implementation: the implementation in an administrative setting allows for a larger sample size. Importantly, the smaller sample size for the Academic Journals papers could lead not just to noisier estimates, but also to upward-biased point estimates in the presence of publication bias.

A second difference, directly zooming into publication bias, is the evidence of selective publication of studies with statistically-significant results (t > 1.96), versus studies that are not statistically-significant (t < 1.96). In the sample of Academic Journals nudges, there are over 4× as many studies with a t-statistic for the most statistically-significant nudge between 1.96 and 2.96, versus the number of studies with the most statistically-significant nudge with at between 0.96 and 1.96. Interestingly, the publication bias appears to operate at the level of the most statistically-significant treatment arm within a paper. By comparison, we find no evidence of a discontinuity in the distribution of t-statistics for the Nudge Unit sample, consistent with the fact that the Nudge Unit registry contains the comprehensive sample of all studies run. We stress here that with “publication bias” we include not just whether a journal would publish a paper, but also whether a researcher would write up a study (the “file drawer” problem). In the Nudge Units sample, all these selective steps are removed, as we access all studies that were run.

“Can Short Psychological Interventions Affect Educational Performance? Revisiting the Effect of Self-Affirmation Interventions”, Serra-Garcia et al 2020

2020-serragarcia.pdf: “Can Short Psychological Interventions Affect Educational Performance? Revisiting the Effect of Self-Affirmation Interventions”⁠, Marta Serra-Garcia, Karsten T. Hansen, Uri Gneezy (2020-07-01; backlinks; similar):

Large amounts of resources are spent annually to improve educational achievement and to close the gender gap in sciences with typically very modest effects. In 2010, a 15-min self-affirmation intervention showed a dramatic reduction in this gender gap. We reanalyzed the original data and found several critical problems. First, the self-affirmation hypothesis stated that women’s performance would improve. However, the data showed no improvement for women. There was an interaction effect between self-affirmation and gender caused by a negative effect on men’s performance. Second, the findings were based on covariate-adjusted interaction effects, which imply that self-affirmation reduced the gender gap only for the small sample of men and women who did not differ in the covariates. Third, specification-curve analyses with more than 1,500 possible specifications showed that less than one quarter yielded statistically-significant interaction effects and less than 3% showed significant improvements among women.

“The Multiverse of Methods: Extending the Multiverse Analysis to Address Data-Collection Decisions”, Harder 2020

2020-harder.pdf: “The Multiverse of Methods: Extending the Multiverse Analysis to Address Data-Collection Decisions”⁠, Jenna A. Harder (2020-06-29; similar):

When analyzing data, researchers may have multiple reasonable options for the many decisions they must make about the data—for example, how to code a variable or which participants to exclude. Therefore, there exists a multiverse of possible data sets. A classic multiverse analysis involves performing a given analysis on every potential data set in this multiverse to examine how each data decision affects the results. However, a limitation of the multiverse analysis is that it addresses only data cleaning and analytic decisions, yet researcher decisions that affect results also happen at the data-collection stage. I propose an adaptation of the multiverse method in which the multiverse of data sets is composed of real data sets from studies varying in data-collection methods of interest. I walk through an example analysis applying the approach to 19 studies on shooting decisions to demonstrate the usefulness of this approach and conclude with a further discussion of the limitations and applications of this method.

“New Revelations About Rosenhan’s Pseudopatient Study: Scientific Integrity in Remission”, Griggs et al 2020

2020-griggs.pdf: “New Revelations About Rosenhan’s Pseudopatient Study: Scientific Integrity in Remission”⁠, Richard A. Griggs, Jenna Blyler, Sherri L. Jackson (2020-06-11; backlinks; similar):

David Rosenhan’s pseudopatient study is one of the most famous studies in psychology, but it is also one of the most criticized studies in psychology. Almost 50 years after its publication, it is still discussed in psychology textbooks, but the extensive body of criticism is not, likely leading teachers not to present the study as the contentious classic that it is. New revelations by Susannah Cahalan (2019), based on her years of investigation of the study and her analysis of the study’s archival materials, question the validity and veracity of both Rosenhan’s study and his reporting of it as well as Rosenhan’s scientific integrity. Because many (if not most) teachers are likely not aware of Cahalan’s findings, we provide a summary of her main findings so that if they still opt to cover Rosenhan’s study, they can do so more accurately. Because these findings are related to scientific integrity, we think that they are best discussed in the context of research ethics and methods. To aid teachers in this task, we provide some suggestions for such discussions.

[ToC: Rosenhan’s Misrepresentations of the Pseudopatient Script and His Medical Record · Selective Reporting of Data · Rosenhan’s Failure to Prepare and Protect Other Pseudopatients · Reporting Questionable Data and Possibly Pseudo-Pseudopatients · Concluding Remarks · Footnotes · References]

“How Do Scientific Views Change? Notes From an Extended Adversarial Collaboration”, Cowan et al 2020

2020-cowan.pdf: “How Do Scientific Views Change? Notes From an Extended Adversarial Collaboration”⁠, Nelson Cowan, Clément Belletier, Jason M. Doherty, Agnieszka J. Jaroslawska, Stephen Rhodes, Alicia Forsberg et al (2020-06-08; similar):

There are few examples of an extended adversarial collaboration, in which investigators committed to different theoretical views collaborate to test opposing predictions. Whereas previous adversarial collaborations have produced single research articles, here, we share our experience in programmatic, extended adversarial collaboration involving three laboratories in different countries with different theoretical views regarding working memory⁠, the limited information retained in mind, serving ongoing thought and action. We have focused on short-term memory retention of items (letters) during a distracting task (arithmetic), and effects of aging on these tasks. Over several years, we have conducted and published joint research with preregistered predictions, methods, and analysis plans, with replication of each study across two laboratories concurrently. We argue that, although an adversarial collaboration will not usually induce senior researchers to abandon favored theoretical views and adopt opposing views, it will necessitate varieties of their views that are more similar to one another, in that they must account for a growing, common corpus of evidence. This approach promotes understanding of others’ views and presents to the field research findings accepted as valid by researchers with opposing interpretations. We illustrate this process with our own research experiences and make recommendations applicable to diverse scientific areas.

[Keywords: scientific method, adversarial collaboration, scientific views, changing views, working memory]

“What Is the Test-Retest Reliability of Common Task-Functional MRI Measures? New Empirical Evidence and a Meta-Analysis”, Elliott et al 2020

2020-elliott.pdf: “What Is the Test-Retest Reliability of Common Task-Functional MRI Measures? New Empirical Evidence and a Meta-Analysis”⁠, Maxwell L. Elliott, Annchen R. Knodt, David Ireland, Meriwether L. Morris, Richie Poulton, Sandhya Ramrakha et al (2020-06-07; similar):

Identifying brain biomarkers of disease risk is a growing priority in neuroscience. The ability to identify meaningful biomarkers is limited by measurement reliability; unreliable measures are unsuitable for predicting clinical outcomes. Measuring brain activity using task functional MRI (fMRI) is a major focus of biomarker development; however, the reliability of task fMRI has not been systematically evaluated.

We present converging evidence demonstrating poor reliability of task-fMRI measures. First, a meta-analysis of 90 experiments (n = 1,008) revealed poor overall reliability—mean intraclass correlation coefficient (ICC) = 0.397. Second, the test-retest reliabilities of activity in a priori regions of interest across 11 common fMRI tasks collected by the Human Connectome Project (n = 45) and the Dunedin Study (n = 20) were poor (ICCs = 0.067–.485).

Collectively, these findings demonstrate that common task-fMRI measures are not currently suitable for brain biomarker discovery or for individual-differences research. We review how this state of affairs came to be and highlight avenues for improving task-fMRI reliability.

“Reproducibility of Animal Research in Light of Biological Variation”, Voelkl et al 2020

2020-voekl.pdf: “Reproducibility of animal research in light of biological variation”⁠, Bernhard Voelkl, Naomi S. Altman, Anders Forsman, Wolfgang Forstmeier, Jessica Gurevitch, Ivana Jaric et al (2020-06-02; backlinks; similar):

Context-dependent biological variation presents an unique challenge to the reproducibility of results in experimental animal research, because organisms’ responses to experimental treatments can vary with both genotype and environmental conditions. In March 2019, experts in animal biology, experimental design and statistics convened in Blonay, Switzerland, to discuss strategies addressing this challenge.

In contrast to the current gold standard of rigorous standardization in experimental animal research, we recommend the use of systematic heterogenization of study samples and conditions by actively incorporating biological variation into study design through diversifying study samples and conditions.

Here we provide the scientific rationale for this approach in the hope that researchers, regulators, funders and editors can embrace this paradigm shift. We also present a road map towards better practices in view of improving the reproducibility of animal research.

“Variability in the Analysis of a Single Neuroimaging Dataset by Many Teams”, Botvinik-Nezer et al 2020

2020-botviniknezer.pdf: “Variability in the analysis of a single neuroimaging dataset by many teams”⁠, Rotem Botvinik-Nezer, Felix Holzmeister, Colin F. Camerer, Anna Dreber, Juergen Huber, Magnus Johannesson et al (2020-05-20; backlinks; similar):

Data analysis workflows in many scientific domains have become increasingly complex and flexible. Here we assess the effect of this flexibility on the results of functional magnetic resonance imaging by asking 70 independent teams to analyse the same dataset, testing the same 9 ex-ante hypotheses1. The flexibility of analytical approaches is exemplified by the fact that no two teams chose identical workflows to analyse the data. This flexibility resulted in sizeable variation in the results of hypothesis tests, even for teams whose statistical maps were highly correlated at intermediate stages of the analysis pipeline. Variation in reported results was related to several aspects of analysis methodology. Notably, a meta-analytical approach that aggregated information across teams yielded a statistically-significant consensus in activated regions. Furthermore, prediction markets of researchers in the field revealed an overestimation of the likelihood of statistically-significant findings, even by researchers with direct knowledge of the dataset2,3,4,5. Our findings show that analytical flexibility can have substantial effects on scientific conclusions, and identify factors that may be related to variability in the analysis of functional magnetic resonance imaging. The results emphasize the importance of validating and sharing complex analysis workflows, and demonstrate the need for performing and reporting multiple analyses of the same data. Potential approaches that could be used to mitigate issues related to analytical variability are discussed.

“Bilingualism Affords No General Cognitive Advantages: A Population Study of Executive Function in 11,000 People”, Nichols et al 2020

2020-nichols.pdf: “Bilingualism Affords No General Cognitive Advantages: A Population Study of Executive Function in 11,000 People”⁠, Emily S. Nichols, Conor J. Wild, Bobby Stojanoski, Michael E. Battista, Adrian M. Owen (2020-04-20; ; backlinks; similar):

Whether acquiring a second language affords any general advantages to executive function has been a matter of fierce scientific debate for decades. If being bilingual does have benefits over and above the broader social, employment, and lifestyle gains that are available to speakers of a second language, then it should manifest as a cognitive advantage in the general population of bilinguals. We assessed 11,041 participants on a broad battery of 12 executive tasks whose functional and neural properties have been well described. Bilinguals showed an advantage over monolinguals on only one test (whereas monolinguals performed better on four tests), and these effects all disappeared when the groups were matched to remove potentially confounding factors. In any case, the size of the positive bilingual effect in the unmatched groups was so small that it would likely have a negligible impact on the cognitive performance of any individual.

[Keywords: bilingualism, executive function, cognition, aging, null-hypothesis testing.]

“Ideological Diversity, Hostility, and Discrimination in Philosophy”, Peters et al 2020

2020-peters.pdf: “Ideological diversity, hostility, and discrimination in philosophy”⁠, Uwe Peters, Nathan Honeycutt, Andreas De Block, Lee Jussim (2020-04-16; similar):

Members of the field of philosophy have, just as other people, political convictions or, as psychologists call them, ideologies. How are different ideologies distributed and perceived in the field? Using the familiar distinction between the political left and right, we surveyed an international sample of 794 subjects in philosophy. We found that survey participants clearly leaned left (75%), while right-leaning individuals (14%) and moderates (11%) were underrepresented. Moreover, and strikingly, across the political spectrum from very left-leaning individuals and moderates to very right-leaning individuals, participants reported experiencing ideological hostility in the field, occasionally even from those on their own side of the political spectrum. Finally, while about half of the subjects believed that discrimination against left-leaning or right-leaning individuals in the field is not justified, a substantial minority displayed an explicit willingness to discriminate against colleagues with the opposite ideology. Our findings are both surprising and important because a commitment to tolerance and equality is widespread in philosophy, and there is reason to think that ideological similarity, hostility, and discrimination undermine reliable belief formation in many areas of the discipline.

[Keywords: Ideological bias, diversity, demographics.]

“Statistics As Squid Ink: How Prominent Researchers Can Get Away With Misrepresenting Data”, Gelman & Guzey 2020

2020-gelman.pdf: “Statistics as Squid Ink: How Prominent Researchers Can Get Away with Misrepresenting Data”⁠, Andrew Gelman, Alexey Guzey (2020-04-16)

“On Attenuated Interactions, Measurement Error, and Statistical Power: Guidelines for Social and Personality Psychologists”, Blake & Gangestad 2020

2020-blake.pdf: “On Attenuated Interactions, Measurement Error, and Statistical Power: Guidelines for Social and Personality Psychologists”⁠, Khandis R. Blake, Steven Gangestad (2020-03-25; ; backlinks; similar):

The replication crisis has seen increased focus on best practice techniques to improve the reliability of scientific findings. What remains elusive to many researchers and is frequently misunderstood is that predictions involving interactions dramatically affect the calculation of statistical power. Using recent papers published in Personality and Social Psychology Bulletin (PSPB), we illustrate the pitfalls of improper power estimations in studies where attenuated interactions are predicted. Our investigation shows why even a programmatic series of 6 studies employing 2×2 designs, with samples exceeding n = 500, can be woefully underpowered to detect genuine effects. We also highlight the importance of accounting for error-prone measures when estimating effect sizes and calculating power, explaining why even positive results can mislead when power is low. We then provide five guidelines for researchers to avoid these pitfalls, including cautioning against the heuristic that a series of underpowered studies approximates the credibility of one well-powered study.

[Keywords: statistical power, effect size, fertility, ovulation, interaction effects]

“A Controlled Trial for Reproducibility: For Three Years, Part of DARPA Has Funded Two Teams for Each Project: One for Research and One for Reproducibility. The Investment Is Paying Off.”, Raphael et al 2020

“A controlled trial for reproducibility: For three years, part of DARPA has funded two teams for each project: one for research and one for reproducibility. The investment is paying off.”⁠, Marc P. Raphael, Paul E. Sheehan, Gary J. Vora (2020-03-10; backlinks; similar):

In 2016, the US Defense Advanced Research Projects Agency (DARPA) told eight research groups that their proposals had made it through the review gauntlet and would soon get a few million dollars from its Biological Technologies Office (BTO). Along with congratulations, the teams received a reminder that their award came with an unusual requirement—an independent shadow team of scientists tasked with reproducing their results. Thus began an intense, multi-year controlled trial in reproducibility. Each shadow team consists of three to five researchers, who visit the ‘performer’ team’s laboratory and often host visits themselves. Between 3% and 8% of the programme’s total funds go to this independent validation and verification (IV&V) work…Awardees were told from the outset that they would be paired with an IV&V team consisting of unbiased, third-party scientists hired by and accountable to DARPA. In this programme, we relied on US Department of Defense laboratories, with specific teams selected for their technical competence and ability to solve problems creatively.

…Results so far show a high degree of experimental reproducibility. The technologies investigated include using chemical triggers to control how cells migrate1; introducing synthetic circuits that control other cell functions2; intricate protein switches that can be programmed to respond to various cellular conditions3; and timed bacterial expression that works even in the variable environment of the mammalian gut4…getting to this point was more difficult than we expected. It demanded intense coordination, communication and attention to detail…Our effort needed capable research groups that could dedicate much more time (in one case, 20 months) and that could flexibly follow evolving research…A key component of the IV&V teams’ effort has been to spend a day or more working with the performer teams in their laboratories. Often, members of a performer laboratory travel to the IV&V laboratory as well. These interactions lead to a better grasp of methodology than reading a paper, frequently revealing person-to-person differences that can affect results…Still, our IV&V efforts have been derailed for weeks at a time for trivial reasons (see ‘Hard lessons’), such as a typo that meant an ingredient in cell media was off by an order of magnitude. We lost more than a year after discovering that commonly used biochemicals that were thought to be interchangeable are not.

Document Reagents:…We lost weeks of work and performed useless experiments when we assumed that identically named reagents (for example, polyethylene glycol or fetal bovine serum) from different vendors could be used interchangeably. · See It Live:…In our hands, washing cells too vigorously or using the wrong-size pipette tip changed results unpredictably. · State a range: …Knowing whether 21 ° C means 20.5–21.5 ° C or 20–22 ° C can tell you whether cells will thrive or wither, and whether you’ll need to buy an incubator to make an experiment work. · Test, then ship: …Incorrect, outdated or otherwise diminished products were sent to the IV&V team for verification many times. · Double check: …A typo in one protocol cost us four weeks of failed experiments, and in general, vague descriptions of formulation protocols (for example, for expressing genes and making proteins without cells) caused months of delay and cost thousands of dollars in wasted reagents. · Pick a person: …The projects that lacked a dedicated and stable point of contact were the same ones that took the longest to reproduce. That is not coincidence. · Keep in silico analysis up to date: …Teams had to visit each others’ labs more than once to understand and fully implement computational-analysis pipelines for large microscopy data sets.

…We have learnt to note the flow rates used when washing cells from culture dishes, to optimize salt concentration in each batch of medium and to describe temperature and other conditions with a range rather than a single number. This last practice came about after we realized that diminished slime-mould viability in our Washington DC facility was due to lab temperatures that could fluctuate by 2 °C on warm summer days, versus the more tightly controlled temperature of the performer lab in Baltimore 63 kilometres away. Such observations can be written up in a protocol paper…As one of our scientists said, “IV&V forces performers to think more critically about what qualifies as a successful system, and facilitates candid discussion about system performance and limitations.”

“Foreign Language Learning in Older Age Does Not Improve Memory or Intelligence: Evidence from a Randomized Controlled Study”, Berggren et al 2020

2020-berggren.pdf: “Foreign language learning in older age does not improve memory or intelligence: Evidence from a randomized controlled study”⁠, Rasmus Berggren, Jonna Nilsson, Yvonne Brehmer, Florian Schmiedek, Martin Lövdén (2020-03-01; ; backlinks; similar):

Foreign language learning in older age has been proposed as a promising avenue for combatting age-related cognitive decline. We tested this hypothesis in a randomized controlled study in a sample of 160 healthy older participants (aged 65–75 years) who were randomized to 11 weeks of either language learning or relaxation training. Participants in the language learning condition obtained some basic knowledge in the new language (Italian), but between-groups differences in improvements on latent factors of verbal intelligence, spatial intelligence, working memory, item memory, or associative memory were negligible. We argue that this is not due to either poor measurement, low course intensity, or low statistical power, but that basic studies in foreign languages in older age are likely to have no or trivially small effects on cognitive abilities. We place this in the context of the cognitive training and engagement literature and conclude that while foreign language learning may expand the behavioral repertoire, it does little to improve cognitive processing abilities.

“The Stewart Retractions: A Quantitative and Qualitative Analysis”, Pickett 2020

“The Stewart Retractions: A Quantitative and Qualitative Analysis”⁠, Justin T. Pickett (2020-03; ; similar):

Sociology has recently experienced its first large-scale retraction event. Dr. Eric Stewart and his coauthors have retracted five articles from three journals, Social Problems, Criminology, and Law & Society Review. I coauthored one of the retracted articles. The retraction notices are uninformative, stating only that the authors uncovered an unacceptable number of errors in each article. Misinformation about the event abounds. Some of the authors have continued to insist in print that the retracted findings are correct. I analyze both quantitative and qualitative data about what happened, in the articles, among the coauthors, and at the journals. The findings suggest that the five articles were likely fraudulent, several coauthors acted with negligence bordering on complicity after learning about the data irregularities, and the editors violated the ethical standards advanced by the Committee on Publication Ethics (COPE). Suggested reforms include requiring data verification by coauthors and editorial adherence to COPE standards.

[Keywords: open science, reproducibility, peer review, research misconduct, scientific fraud]

“How David Rosenhan’s Fraudulent Thud Experiment Set Back Psychiatry for Decades: In the 1970s, a Social Psychologist Published ‘findings’ Deeply Critical of American Psychiatric Methods. The Problem Was They Were Almost Entirely Fictional”, Scull 2020

“How David Rosenhan’s fraudulent Thud experiment set back psychiatry for decades: In the 1970s, a social psychologist published ‘findings’ deeply critical of American psychiatric methods. The problem was they were almost entirely fictional”⁠, Andrew Scull (2020-01-25; backlinks; similar):

As her work proceeded, her doubts about Rosenhan’s work grew. At one point, I suggested that she write to Science and request copies of the peer review of the paper. What had the reviewers seen and requested? Did they know the identities of the anonymous pseudo-patients and the institutions to which they had been consigned? What checks had they made on the validity of Rosenhan’s claims? Had they, for example, asked to see the raw field notes? The editorial office told her that the peer review was confidential and they couldn’t share it. I wondered whether an approach from an academic rather than a journalist might be more successful, and with Cahalan’s permission, I sought the records myself, pointing out the important issues at stake, and noting that it would be perfectly acceptable for the names of the expert referees to be redacted. This time the excuse was different: the journal had moved offices, and the peer reviews no longer existed. That’s plausible, but it is distinctly odd that such different explanations should be offered.

…Of course, proving a negative, especially after decades have passed, is nigh on impossible. Perhaps the appearance of The Great Pretender will cause one or more of the missing pseudo-patients to surface, or for their descendants to speak up and reveal their identities, for surely anyone who participated in such a famous study could not fail to mention it to someone. More likely, I think, is that these people are fictitious, invented by someone who Cahalan’s researches suggest was fully capable of such deception. (Indeed, the distinguished psychologist Eleanor Maccoby, who was in charge of assessing Rosenhan’s tenure file, reported that she and others were deeply suspicious of him, and that they found it ‘impossible to know what he had really done, or if he had done it’, granting him tenure only because of his popularity as a teacher.)

…Most damning of all, though, are Rosenhan’s own medical records. When he was admitted to the hospital, it was not because he simply claimed to be hearing voices but was otherwise ‘normal’. On the contrary, he told his psychiatrist his auditory hallucinations included the interception of radio signals and listening in to other people’s thoughts. He had tried to keep these out by putting copper over his ears, and sought admission to the hospital because it was ‘better insulated there’. For months, he reported he had been unable to work or sleep, financial difficulties had mounted and he had contemplated suicide. His speech was retarded, he grimaced and twitched, and told several staff that the world would be better off without him. No wonder he was admitted.

Perhaps out of sympathy for Rosenhan’s son and his closest friends, who had granted access to all this damning material and with whom she became close, I think Cahalan pulls her punches a bit when she brings her book to a conclusion. But the evidence she provides makes an overwhelming case: Rosenhan pulled off one of the greatest scientific frauds of the past 75 years, and it was a fraud whose real-world consequences still resonate today. Exposing what he got up to is a quite exceptional accomplishment, and Cahalan recounts the story vividly and with great skill.

“Compliance With Legal Requirement to Report Clinical Trial Results on a Cohort Study”, DeVito et al 2020

2020-devito.pdf: “Compliance with legal requirement to report clinical trial results on a cohort study”⁠, Nicholas J. DeVito, Seb Bacon, Ben Goldacre (2020-01-17; backlinks; similar):

Background: Failure to report the results of a clinical trial can distort the evidence base for clinical practice, breaches researchers’ ethical obligations to participants, and represents an important source of research waste. The Food and Drug Administration Amendments Act (FDAAA) of 2007 now requires sponsors of applicable trials to report their results directly onto within 1 year of completion. The first trials covered by the Final Rule of this act became due to report results in January, 2018. In this cohort study, we set out to assess compliance.

Methods: We downloaded data for all registered trials on each month from March, 2018, to September, 2019. All cross-sectional analyses in this manuscript were performed on data extracted from on Sept 16, 2019; monthly trends analysis used archived data closest to the 15th day of each month from March, 2018, to September, 2019. Our study cohort included all applicable trials due to report results under FDAAA. We excluded all non-applicable trials, those not yet due to report, and those given a certificate allowing for delayed reporting. A trial was considered reported if results had been submitted and were either publicly available, or undergoing quality control review at A trial was considered compliant if these results were submitted within 1 year of the primary completion date, as required by the legislation. We described compliance with the FDAAA 2007 Final Rule, assessed trial characteristics associated with results reporting using logistic regression models, described sponsor-level reporting, examined trends in reporting, and described time-to-report using the Kaplan-Meier method.

Findings: 4209 trials were due to report results; 1722 (40·9%; 95% CI 39·4–42·2) did so within the 1-year deadline. 2686 (63·8%; 62·4–65·3) trials had results submitted at any time. Compliance has not improved since July, 2018. Industry sponsors were statistically-significantly more likely to be compliant than non-industry, non-US Government sponsors (odds ratio [OR] 3·08 [95% CI 2·52–3·77]), and sponsors running large numbers of trials were statistically-significantly more likely to be compliant than smaller sponsors (OR 11·84 [9·36–14·99]). The median delay from primary completion date to submission date was 424 days (95% CI 412–435), 59 days higher than the legal reporting requirement of 1 year.

Interpretation: Compliance with the FDAAA 2007 is poor, and not improving. To our knowledge, this is the first study to fully assess compliance with the Final Rule of the FDAAA 2007. Poor compliance is likely to reflect lack of enforcement by regulators. Effective enforcement and action from sponsors is needed; until then, open public audit of compliance for each individual sponsor may help. We will maintain updated compliance data for each individual sponsor and trial at⁠.

Funding: Laura and John Arnold Foundation.

“Cognitive and Academic Benefits of Music Training With Children: A Multilevel Meta-analysis”, Sala & Gobet 2020

“Cognitive and academic benefits of music training with children: A multilevel meta-analysis”⁠, Giovanni Sala, Fernand Gobet (2020-01-14; ; backlinks; similar):

Music training has repeatedly been claimed to positively impact on children’s cognitive skills and academic achievement. This claim relies on the assumption that engaging in intellectually demanding activities fosters particular domain-general cognitive skills, or even general intelligence. The present meta-analytic review (n = 6,984, k = 254, m = 54) shows that this belief is incorrect. Once the study quality design is controlled for, the overall effect of music training programs is null (g ≈ 0) and highly consistent across studies (τ2 ≈ 0). Small statistically-significant overall effects are obtained only in those studies implementing no random allocation of participants and employing non-active controls (g ≈ 0.200, p < 0.001). Interestingly, music training is ineffective regardless of the type of outcome measure (eg. verbal, non-verbal, speed-related, etc.). Furthermore, we note that, beyond meta-analysis of experimental studies, a considerable amount of cross-sectional evidence indicates that engagement in music has no impact on people’s non-music cognitive skills or academic achievement. We conclude that researchers’ optimism about the benefits of music training is empirically unjustified and stem from misinterpretation of the empirical data and, possibly, confirmation bias. Given the clarity of the results, the large number of participants involved, and the numerous studies carried out so far, we conclude that this line of research should be dismissed.

“Implications of Ideological Bias in Social Psychology on Clinical Practice”, Silander et al 2020

2020-silander.pdf: “Implications of ideological bias in social psychology on clinical practice”⁠, Nina C. Silander, Bela Geczy Jr., Olivia Marks, Robert D. Mather (2020-01-14; backlinks; similar):

Ideological bias is a worsening but often neglected concern for social and psychological sciences, affecting a range of professional activities and relationships, from self-reported willingness to discriminate to the promotion of ideologically saturated and scientifically questionable research constructs. Though clinical psychologists co-produce and apply social psychological research, little is known about its impact on the profession of clinical psychology.

Following a brief review of relevant topics, such as “concept creep” and the importance of the psychotherapeutic relationship, the relevance of ideological bias to clinical psychology, counterarguments and a rebuttal, clinical applications, and potential solutions are presented. For providing empathic and multiculturally competent clinical services, in accordance with professional ethics, psychologists would benefit from treating ideological diversity as another professionally recognized diversity area.

[See also“Political Diversity Will Improve Social Psychological Science”⁠, Duarte et al 2015.]

“FDA and NIH Let Clinical Trial Sponsors Keep Results Secret and Break the Law”, Piller 2020

“FDA and NIH let clinical trial sponsors keep results secret and break the law”⁠, Charles Piller (2020-01-13; backlinks; similar):

The rule took full effect 2 years ago, on 2018-01-18, giving trial sponsors ample time to comply. But a Science investigation shows that many still ignore the requirement, while federal officials do little or nothing to enforce the law.

Science examined more than 4700 trials whose results should have been posted on the NIH website under the 2017 rule. Reporting rates by most large pharmaceutical companies and some universities have improved sharply, but performance by many other trial sponsors—including, ironically, NIH itself—was lackluster. Those sponsors, typically either the institution conducting a trial or its funder, must deposit results and other data within 1 year of completing a trial. But of 184 sponsor organizations with at least five trials due as of 2019-09-25, 30 companies, universities, or medical centers never met a single deadline. As of that date, those habitual violators had failed to report any results for 67% of their trials and averaged 268 days late for those and all trials that missed their deadlines. They included such eminent institutions as the Harvard University-affiliated Boston Children’s Hospital, the University of Minnesota, and Baylor College of Medicine—all among the top 50 recipients of NIH grants in 2019. The violations cover trials in virtually all fields of medicine, and the missing or late results offer potentially vital information for the most desperate patients. For example, in one long-overdue trial, researchers compared the efficacy of different chemotherapy regimens in 200 patients with advanced lymphoma; another—nearly 2 years late—tests immunotherapy against conventional chemotherapy in about 600 people with late-stage lung cancer.

…Contacted for comment, none of the institutions disputed the findings of this investigation. In all 4768 trials Science checked, sponsors violated the reporting law more than 55% of the time. And in hundreds of cases where the sponsors got credit for reporting trial results, they have yet to be publicly posted because of quality lapses flagged by staff (see sidebar).

Although the 2017 rule, and officials’ statements at the time, promised aggressive enforcement and stiff penalties, neither NIH nor FDA has cracked down. FDA now says it won’t brandish its big stick—penalties of up to $12,103 a day for failing to report a trial’s results—until after the agency issues further “guidance” on how it will exercise that power. It has not set a date. NIH said at a 2016 briefing on the final rule that it would cut off grants to those who ignore the trial reporting requirements, as authorized in the 2007 law, but so far has not done so…NIH and FDA officials do not seem inclined to apply that pressure. Lyric Jorgenson, NIH deputy director for science policy, says her agency has been “trying to change the culture of how clinical trial results are reported and disseminated; not so much on the ‘aha, we caught you’, as much as getting people to understand the value, and making it as easy as possible to share and disseminate results.” To that end, she says, staff have educated researchers about the website and improved its usability. As for FDA, Patrick McNeilly, an official at the agency who handles trial enforcement matters, recently told an industry conference session on that “FDA has limited resources, and we encourage voluntary compliance.” He said the agency also reviews reporting of information on as part of inspections of trial sites, or when it receives complaints. McNeilly declined an interview request, but at the conference he discounted violations of reporting requirements found by journalists and watchdog groups. “We’re not going to blanketly accept an entire list of trials that people say are noncompliant”, he said.

…It also highlights that pharma’s record has been markedly better than that of academia and the federal government.

…But such good performance shouldn’t be an exception, Harvard’s Zarin says. “Further public accountability of the trialists, but also our government organizations, has to happen. One possibility is that FDA and NIH will be shamed into enforcing the law. Another possibility is that sponsors will be shamed into doing a better job. A third possibility is that will never fully achieve its vital aspirations.”

“Do Police Killings of Unarmed Persons Really Have Spillover Effects? Reanalyzing Bor Et Al (2018)”, Nix & Lozada 2019

“Do police killings of unarmed persons really have spillover effects? Reanalyzing Bor et al (2018)”⁠, Justin Nix, M. James Lozada (2019-12-30; ; backlinks; similar):

We reevaluate the claim from Bor et al (2018: 302) that “police killings of unarmed black Americans have effects on mental health among black American adults in the general population.”

The Mapping Police Violence data used by the authors includes 91 incidents involving black decedents who were either (1) not killed by police officers in the line of duty or (2) armed when killed. These incidents should have been removed or recoded prior to analysis.

Correctly recoding these incidents decreased in magnitude all of the reported coefficients, and, more importantly, eliminated the reported statistically-significant effect of exposure to police killings of unarmed black individuals on the mental health of black Americans in the general population.

We caution researchers to vet carefully crowdsourced data that tracks police behaviors and warn against reducing these complex incidents to overly simplistic armed/​unarmed dichotomies.

“Catching Cheating Students”, Lin & Levitt 2019

2019-lin.pdf: “Catching Cheating Students”⁠, Ming-Jen Lin, Steven D. Levitt (2019-12-29; similar):

We develop a simple algorithm for detecting exam cheating between students who copy off one another’s exams.

When this algorithm is applied to exams in a general science course at a top university, we find strong evidence of cheating by at least 10% of the students. Students studying together cannot explain our findings. Matching incorrect answers proves to be a stronger indicator of cheating than matching correct answers.

When seating locations are randomly assigned, and monitoring is increased, cheating virtually disappears.

“Comparing Meta-analyses and Preregistered Multiple-laboratory Replication Projects”, Kvarven et al 2019

2019-kvarven.pdf: “Comparing meta-analyses and preregistered multiple-laboratory replication projects”⁠, Amanda Kvarven, Eirik Strømland, Magnus Johannesson (2019-12-23; backlinks; similar):

Many researchers rely on meta-analysis to summarize research evidence. However, there is a concern that publication bias and selective reporting may lead to biased meta-analytic effect sizes. We compare the results of meta-analyses to large-scale preregistered replications in psychology carried out at multiple laboratories. The multiple-laboratory replications provide precisely estimated effect sizes that do not suffer from publication bias or selective reporting. We searched the literature and identified 15 meta-analyses on the same topics as multiple-laboratory replications. We find that meta-analytic effect sizes are statistically-significantly different from replication effect sizes for 12 out of the 15 meta-replication pairs. These differences are systematic and, on average, meta-analytic effect sizes are almost 3× as large as replication effect sizes. We also implement 3 methods of correcting meta-analysis for bias, but these methods do not substantively improve the meta-analytic results.

“Individual Differences in Behaviour Explain Variation in Survival: a Meta-analysis”, Moiron et al 2019

“Individual differences in behaviour explain variation in survival: a meta-analysis”⁠, Maria Moiron, Kate L. Laskowski, Petri T. Niemelä (2019-12-06; ⁠, ; backlinks; similar):

Research focusing on among-individual differences in behaviour (‘animal personality’) has been blooming for over a decade. Central theories explaining the maintenance of such behavioural variation posits that individuals expressing greater “risky” behaviours should suffer higher mortality. Here, for the first time, we synthesize the existing empirical evidence for this key prediction. Our results did not support this prediction as there was no directional relationship between riskier behaviour and greater mortality; however there was a statistically-significant absolute relationship between behaviour and survival. In total, behaviour explained a statistically-significant, but small, portion (5.8%) of the variance in survival. We also found that risky (vs. “shy”) behavioural types live statistically-significantly longer in the wild, but not in the laboratory. This suggests that individuals expressing risky behaviours might be of overall higher quality but the lack of predation pressure and resource restrictions mask this effect in laboratory environments. Our work demonstrates that individual differences in behaviour explain important differences in survival but not in the direction predicted by theory. Importantly, this suggests that models predicting behaviour to be a mediator of reproduction-survival trade-offs may need revision and/​or empiricists may need to reconsider their proxies of risky behaviours when testing such theory.

“Many Labs 2: Investigating Variation in Replicability Across Sample and Setting”, Klein et al 2019

“Many Labs 2: Investigating Variation in Replicability Across Sample and Setting”⁠, Richard Klein, Michelangelo, Vianello, Fred, Hasselman, Byron, Adams, Reginald B. Adams, Jr., Sinan Alper et al (2019-11-19; backlinks; similar):

We conducted preregistered replications of 28 classic and contemporary published findings with protocols that were peer reviewed in advance to examine variation in effect magnitudes across sample and setting. Each protocol was administered to ~half of 125 samples and 15,305 total participants from 36 countries and territories. Using conventional statistical-significance (p < 0.05), fifteen (54%) of the replications provided evidence in the same direction and statistically-significant as the original finding. With a strict statistical-significance criterion (p < 0.0001), fourteen (50%) provide such evidence reflecting the extremely high powered design. Seven (25%) of the replications had effect sizes larger than the original finding and 21 (75%) had effect sizes smaller than the original finding. The median comparable Cohen’s d effect sizes for original findings was 0.60 and for replications was 0.15. Sixteen replications (57%) had small effect sizes (< 0.20) and 9 (32%) were in the opposite direction from the original finding. Across settings, 11 (39%) showed statistically-significant heterogeneity using the Q statistic and most of those were among the findings eliciting the largest overall effect sizes; only one effect that was near zero in the aggregate showed statistically-significant heterogeneity. Only one effect showed a Tau > 0.20 indicating moderate heterogeneity. Nine others had a Tau near or slightly above 0.10 indicating slight heterogeneity. In moderation tests, very little heterogeneity was attributable to task order, administration in lab versus online, and exploratory WEIRD versus less WEIRD culture comparisons. Cumulatively, variability in observed effect sizes was more attributable to the effect being studied than the sample or setting in which it was studied.

“Matthew Walker’s Why We Sleep Is Riddled With Scientific and Factual Errors”, Guzey 2019

“Matthew Walker’s Why We Sleep Is Riddled with Scientific and Factual Errors”⁠, Alexey Guzey (2019-11-15; backlinks; similar):

…In the process of reading the book and encountering some extraordinary claims about sleep, I decided to compare the facts it presented with the scientific literature. I found that the book consistently overstates the problem of lack of sleep, sometimes egregiously so. It misrepresents basic sleep research and contradicts its own sources.

In one instance, Walker claims that sleeping less than six or seven hours a night doubles one’s risk of cancer—this is not supported by the scientific evidence (Section 1.1). In another instance, Walker seems to have invented a “fact” that the WHO has declared a sleep loss epidemic (Section 4). In yet another instance, he falsely claims that the National Sleep Foundation recommends 8 hours of sleep per night, and then uses this “fact” to falsely claim that two-thirds of people in developed nations sleep less than the “the recommended eight hours of nightly sleep” (Section 5).

Walker’s book has likely wasted thousands of hours of life and worsened the health of people who read it and took its recommendations at face value (Section 7).

“Stanford Professor Who Changed America With Just One Study Was Also a Liar”, Cahalan 2019

“Stanford professor who changed America with just one study was also a liar”⁠, Susannah Cahalan (2019-11-02; ⁠, ⁠, ; backlinks; similar):

[Summary of investigation into David Rosenhan: like the Robbers Cave or Stanford Prison Experiment, his famous fake-insane patients experiment cannot be verified and many troubling anomalies have come to light. Cahalan is unable to find almost all of the supposed participants, Rosenhan hid his own participation & his own medical records show he fabricated details of his case, he throw out participant data that didn’t match his narrative, reported numbers are inconsistent, Rosenhan abandoned a lucrative book deal about it and avoided further psychiatric research, and showed some character traits of a fabulist eager to please.]

“On the Troubling Trail of Psychiatry’s Pseudopatients Stunt: Susannah Cahalan’s Investigation of the Social-psychology Experiment That Saw Healthy People Sent to Mental Hospitals Finds Inconsistencies”, Abbott 2019

“On the troubling trail of psychiatry’s pseudopatients stunt: Susannah Cahalan’s investigation of the social-psychology experiment that saw healthy people sent to mental hospitals finds inconsistencies”⁠, Alison Abbott (2019-10-29; backlinks; similar):

Although Rosenhan died in 2012, Cahalan easily tracked down his archives, held by social psychologist Lee Ross, his friend and colleague at Stanford. They included the first 200 pages of Rosenhan’s unfinished draft of a book about the experiment…Ross warned her that Rosenhan had been secretive. As her attempts to identify the pseudonymous pseudopatients hit one dead end after the other, she realized Ross’s prescience.

The archives did allow Cahalan to piece together the beginnings of the experiment in 1969, when Rosenhan was teaching psychology at Swarthmore College in Pennsylvania…Rosenhan cautiously decided to check things out for himself first. He emerged humbled from nine traumatizing days in a locked ward, and abandoned the idea of putting students through the experience.

…According to Rosenhan’s draft, it was at a conference dinner that he met his first recruits: a recently retired psychiatrist and his psychologist wife. The psychiatrist’s sister also signed up. But the draft didn’t explain how, when and why subsequent recruits signed up. Cahalan interviewed numerous people who had known Rosenhan personally or indirectly. She also chased down the medical records of individuals whom she suspected could have been involved in the experiment, and spoke with their families and friends. But her sleuthing brought her to only one participant, a former Stanford graduate student called Bill Underwood.

…Underwood and his wife were happy to talk, but two of their comments jarred. Rosenhan’s draft described how he prepared his volunteers very carefully, over weeks. Underwood, however, remembered only brief guidance on how to avoid swallowing medication by hiding pills in his cheek. His wife recalled Rosenhan telling her that he had prepared writs of habeas corpus for each pseudopatient, in case an institution would not discharge them. But Cahalan had already worked out that that wasn’t so.

Comparing the Science report with documents in Rosenhan’s archives, she also noted many mismatches in numbers. For instance, Rosenhan’s draft, and the Science paper, stated that Underwood had spent seven days in a hospital with 8,000 patients, whereas he spent eight days in a hospital with 1,500 patients.

When all of the leads from her contacts led to ground, she published a commentary in The Lancet Psychiatry asking for help in finding them—to no avail. Had Rosenhan invented them, she found herself asking?

“Effect of Lower Versus Higher Red Meat Intake on Cardiometabolic and Cancer Outcomes: A Systematic Review of Randomized Trials”, Zeraatkar et al 2019

2019-zeraatkar.pdf: “Effect of Lower Versus Higher Red Meat Intake on Cardiometabolic and Cancer Outcomes: A Systematic Review of Randomized Trials”⁠, Dena Zeraatkar, Bradley C. Johnston, Jessica Bartoszko, Kevin Cheung, Malgorzata M. Bala, Claudia Valli et al (2019-10-01; ; backlinks; similar):

Background: Few randomized trials have evaluated the effect of reducing red meat intake on clinically important outcomes.

Purpose: To summarize the effect of lower versus higher red meat intake on the incidence of cardiometabolic and cancer outcomes in adults.

Data Sources: Embase⁠, CENTRAL, CINAHL, Web of Science⁠, and ProQuest from inception to July 2018 and MEDLINE from inception to April 2019, without language restrictions.

Study Selection: Randomized trials (published in any language) comparing diets lower in red meat with diets higher in red meat that differed by a gradient of at least 1 serving per week for 6 months or more.

Data Extraction: Teams of 2 reviewers independently extracted data and assessed the risk of bias and the certainty of the evidence.

Data Synthesis: Of 12 eligible trials, a single trial enrolling 48 835 women provided the most credible, though still low-certainty, evidence that diets lower in red meat may have little or no effect on all-cause mortality (hazard ratio [HR], 0.99 [95% CI, 0.95 to 1.03]), cardiovascular mortality (HR, 0.98 [CI, 0.91 to 1.06]), and cardiovascular disease (HR, 0.99 [CI, 0.94 to 1.05]). That trial also provided low-certainty to very-low-certainty evidence that diets lower in red meat may have little or no effect on total cancer mortality (HR, 0.95 [CI, 0.89 to 1.01]) and the incidence of cancer, including colorectal cancer (HR, 1.04 [CI, 0.90 to 1.20]) and breast cancer (HR, 0.97 [0.90 to 1.04]).

Limitations: There were few trials, most addressing only surrogate outcomes, with heterogeneous comparators and small gradients in red meat consumption between lower versus higher intake groups.

Conclusion: Low-certainty to very-low-certainty evidence suggests that diets restricted in red meat may have little or no effect on major cardiometabolic outcomes and cancer mortality and incidence.

“Anthropology's Science Wars: Insights from a New Survey”, Horowitz et al 2019

2019-horowitz.pdf: “Anthropology's Science Wars: Insights from a New Survey”⁠, Mark Horowitz, William Yaworsky, Kenneth Kickham (2019-10; ; backlinks; similar):

In recent decades the field of anthropology has been characterized as sharply divided between pro-science and anti-science factions. The aim of this study is to empirically evaluate that characterization. We survey anthropologists in graduate programs in the United States regarding their views of science and advocacy, moral and epistemic relativism, and the merits of evolutionary biological explanations. We examine anthropologists’ views in concert with their varying appraisals of major controversies in the discipline (Chagnon /  ​Tierney⁠, Mead /  ​Freeman⁠, and Menchú /  ​Stoll). We find that disciplinary specialization and especially gender and political orientation are statistically-significant predictors of anthropologists’ views. We interpret our findings through the lens of an intuitionist social psychology that helps explain the dynamics of such controversies as well as ongoing ideological divisions in the field.

“A Meta-Analysis of Procedures to Change Implicit Measures”, Forscher et al 2019

2019-forscher.pdf: “A Meta-Analysis of Procedures to Change Implicit Measures”⁠, Patrick Forscher, Calvin Lai, Jordan Axt, Charles Ebersole, Michelle Herman, Patricia Devine, Brian Nosek et al (2019-08-19; ; backlinks; similar):

Using a novel technique known as network meta-analysis, we synthesized evidence from 492 studies (87,418 participants) to investigate the effectiveness of procedures in changing implicit measures, which we define as response biases on implicit tasks. We also evaluated these procedures’ effects on explicit and behavioral measures. We found that implicit measures can be changed, but effects are often relatively weak (|ds| < .30). Most studies focused on producing short-term changes with brief, single-session manipulations. Procedures that associate sets of concepts, invoke goals or motivations, or tax mental resources changed implicit measures the most, whereas procedures that induced threat, affirmation, or specific moods/​emotions changed implicit measures the least. Bias tests suggested that implicit effects could be inflated relative to their true population values. Procedures changed explicit measures less consistently and to a smaller degree than implicit measures and generally produced trivial changes in behavior. Finally, changes in implicit measures did not mediate changes in explicit measures or behavior. Our findings suggest that changes in implicit measures are possible, but those changes do not necessarily translate into changes in explicit measures or behavior.

“Does Mouse Utopia Exist?”, Branwen 2019

Mouse-Utopia: “Does Mouse Utopia Exist?”⁠, Gwern Branwen (2019-08-12; ⁠, ⁠, ⁠, ; backlinks; similar):

Did John Calhoun’s 1960s Mouse Utopia really show that animal (and human) populations will expand to arbitrary densities, creating socially-driven pathology and collapse? Reasons for doubt.

Did John Calhoun’s 1960s Mouse Utopia really show that animal (and human) populations will expand to arbitrary densities, creating socially-driven pathology and collapse? I give reasons for doubt about its replicability, interpretation, and meaningfulness.

One of the most famous experiments in psychology & sociology was John Calhoun’s Mouse Utopia experiments in the 1960s–1970s. In the usual telling, Mouse Utopia created ideal mouse environments in which the mouse population was permitted to increase as much as possible; however, the overcrowding inevitably resulted in extreme levels of physical & social dysfunctionality, and eventually population collapse & even extinction. Looking more closely into it, there are reasons to doubt the replicability of the growth & pathological behavior & collapse of this utopia (“no-place”), and if it does happen, whether it is driven by the social pressures as claimed by Calhoun or by other causal mechanisms at least as consistent with the evidence like disease or mutational meltdown.

“A National Experiment Reveals Where a Growth Mindset Improves Achievement”, Yeager et al 2019

“A national experiment reveals where a growth mindset improves achievement”⁠, David S. Yeager, Paul Hanselman, Gregory M. Walton, Jared S. Murray, Robert Crosnoe, Chandra Muller, Elizabeth Tipton et al (2019-08-07; ; backlinks; similar):

A global priority for the behavioural sciences is to develop cost-effective, scalable interventions that could improve the academic outcomes of adolescents at a population level, but no such interventions have so far been evaluated in a population-generalizable sample. Here we show that a short (less than one hour), online growth mindset intervention—which teaches that intellectual abilities can be developed—improved grades among lower-achieving students and increased overall enrolment to advanced mathematics courses in a nationally representative sample of students in secondary education in the United States. Notably, the study identified school contexts that sustained the effects of the growth mindset intervention: the intervention changed grades when peer norms aligned with the messages of the intervention. Confidence in the conclusions of this study comes from independent data collection and processing, pre-registration of analyses, and corroboration of results by a blinded Bayesian analysis.

“Debunking the Stanford Prison Experiment”, Texier 2019

2019-letexier.pdf: “Debunking the Stanford Prison Experiment”⁠, Thibault Le Texier (2019-08-05; ; backlinks; similar):

The Stanford Prison Experiment (SPE) is one of psychology’s most famous studies. It has been criticized on many grounds, and yet a majority of textbook authors have ignored these criticisms in their discussions of the SPE, thereby misleading both students and the general public about the study’s questionable scientific validity.

Data collected from a thorough investigation of the SPE archives and interviews with 15 of the participants in the experiment further question the study’s scientific merit. These data are not only supportive of previous criticisms of the SPE, such as the presence of demand characteristics⁠, but provide new criticisms of the SPE based on heretofore unknown information. These new criticisms include the biased and incomplete collection of data, the extent to which the SPE drew on a prison experiment devised and conducted by students in one of Zimbardo’s classes 3 months earlier, the fact that the guards received precise instructions regarding the treatment of the prisoners, the fact that the guards were not told they were subjects, and the fact that participants were almost never completely immersed by the situation.

Possible explanations of the inaccurate textbook portrayal and general misperception of the SPE’s scientific validity over the past 5 decades, in spite of its flaws and shortcomings, are discussed.

[Keywords: Stanford Prison Experiment, Zimbardo, epistemology]

“The Maddening Saga of How an Alzheimer’s ‘cabal’ Thwarted Progress toward a Cure for Decades”, Begley 2019

“The maddening saga of how an Alzheimer’s ‘cabal’ thwarted progress toward a cure for decades”⁠, Sharon Begley (2019-06-25; ; backlinks; similar):

In the 30 years that biomedical researchers have worked determinedly to find a cure for Alzheimer’s disease, their counterparts have developed drugs that helped cut deaths from cardiovascular disease by more than half, and cancer drugs able to eliminate tumors that had been incurable. But for Alzheimer’s, not only is there no cure, there is not even a disease-slowing treatment.

…In more than two dozen interviews, scientists whose ideas fell outside the dogma recounted how, for decades, believers in the dominant hypothesis suppressed research on alternative ideas: They influenced what studies got published in top journals, which scientists got funded, who got tenure, and who got speaking slots at reputation-buffing scientific conferences. The scientists described the frustrating, even career-ending, obstacles that they confronted in pursuing their research. A top journal told one that it would not publish her paper because others hadn’t. Another got whispered advice to at least pretend that the research for which she was seeking funding was related to the leading idea—that a protein fragment called beta-amyloid accumulates in the brain, creating neuron-killing clumps that are both the cause of Alzheimer’s and the key to treating it. Others could not get speaking slots at important meetings, a key showcase for research results. Several who tried to start companies to develop Alzheimer’s cures were told again and again by venture capital firms and major biopharma companies that they would back only an amyloid approach.

…For all her regrets about the amyloid hegemony, Neve is an unlikely critic: She co-led the 1987 discovery of mutations in a gene called APP that increases amyloid levels and causes Alzheimer’s in middle age, supporting the then-emerging orthodoxy. Yet she believes that one reason Alzheimer’s remains incurable and untreatable is that the amyloid camp “dominated the field”, she said. Its followers were influential “to the extent that they persuaded the National Institute of Neurological Disorders and Stroke [part of the National Institutes of Health] that it was a waste of money to fund any Alzheimer’s-related grants that didn’t center around amyloid.” To be sure, NIH did fund some Alzheimer’s research that did not focus on amyloid. In a sea of amyloid-focused grants, there are tiny islands of research on oxidative stress, neuroinflammation, and, especially, a protein called tau. But Neve’s NINDS program officer, she said, “told me that I should at least collaborate with the amyloid people or I wouldn’t get any more NINDS grants.” (She hoped to study how neurons die.) A decade after her APP discovery, a disillusioned Neve left Alzheimer’s research, building a distinguished career in gene editing. Today, she said, she is “sick about the millions of people who have needlessly died from” the disease.

Dr. Daniel Alkon, a longtime NIH neuroscientist who started a company to develop an Alzheimer’s treatment, is even more emphatic: “If it weren’t for the near-total dominance of the idea that amyloid is the only appropriate drug target”, he said, “we would be 10 or 15 years ahead of where we are now.”

Making it worse is that the empirical support for the amyloid hypothesis has always been shaky. There were numerous red flags over the decades that targeting amyloid alone might not slow or reverse Alzheimer’s. “Even at the time the amyloid hypothesis emerged, 30 years ago, there was concern about putting all our eggs into one basket, especially the idea that ridding the brain of amyloid would lead to a successful treatment”, said neurobiologist Susan Fitzpatrick, president of the James S. McDonnell Foundation. But research pointing out shortcomings of the hypothesis was relegated to second-tier journals, at best, a signal to other scientists and drug companies that the criticisms needn’t be taken too seriously. Zaven Khachaturian spent years at NIH overseeing its early Alzheimer’s funding. Amyloid partisans, he said, “came to permeate drug companies, journals, and NIH study sections”, the groups of mostly outside academics who decide what research NIH should fund. “Things shifted from a scientific inquiry into an almost religious belief system, where people stopped being skeptical or even questioning.”

…“You had a whole industry going after amyloid, hundreds of clinical trials targeting it in different ways”, Alkon said. Despite success in millions of mice, “none of it worked in patients.”

Scientists who raised doubts about the amyloid model suspected why. Amyloid deposits, they thought, are a response to the true cause of Alzheimer’s and therefore a marker of the disease—again, the gravestones of neurons and synapses, not the killers. The evidence? For one thing, although the brains of elderly Alzheimer’s patients had amyloid plaques, so did the brains of people the same age who died with no signs of dementia, a pathologist discovered in 1991. Why didn’t amyloid rob them of their memories? For another, mice engineered with human genes for early Alzheimer’s developed both amyloid plaques and dementia, but there was no proof that the much more common, late-onset form of Alzheimer’s worked the same way. And yes, amyloid plaques destroy synapses (the basis of memory and every other brain function) in mouse brains, but there is no correlation between the degree of cognitive impairment in humans and the amyloid burden in the memory-forming hippocampus or the higher-thought frontal cortex. “There were so many clues”, said neuroscientist Nikolaos Robakis of the Icahn School of Medicine at Mount Sinai, who also discovered a mutation for early-onset Alzheimer’s. “Somehow the field believed all the studies supporting it, but not those raising doubts, which were very strong. The many weaknesses in the theory were ignored.”

“Meta-Research: A Comprehensive Review of Randomized Clinical Trials in Three Medical Journals Reveals 396 Medical Reversals”, Herrera-Perez et al 2019

“Meta-Research: A comprehensive review of randomized clinical trials in three medical journals reveals 396 medical reversals”⁠, Diana Herrera-Perez, Alyson Haslam, Tyler Crain, Jennifer Gill, Catherine Livingston, Victoria Kaestner et al (2019-06-11; backlinks; similar):

The ability to identify medical reversals and other low-value medical practices is an essential prerequisite for efforts to reduce spending on such practices. Through an analysis of more than 3000 randomized controlled trials (RCTs) published in three leading medical journals (the Journal of the American Medical Association, the Lancet, and the New England Journal of Medicine), we have identified 396 medical reversals. Most of the studies (92%) were conducted on populations in high-income countries, cardiovascular disease was the most common medical category (20%), and medication was the most common type of intervention (33%).

“Generalizable and Robust TV Advertising Effects”, Shapiro et al 2019

2019-shapiro.pdf: “Generalizable and Robust TV Advertising Effects”⁠, Bradley Shapiro, Günter J. Hitsch, Anna Tuchman (2019-06-11; ⁠, ⁠, ; backlinks; similar):

We provide generalizable and robust results on the causal sales effect of TV advertising based on the distribution of advertising elasticities for a large number of products (brands) in many categories. Such generalizable results provide a prior distribution that can improve the advertising decisions made by firms and the analysis and recommendations of anti-trust and public policy makers. A single case study cannot provide generalizable results, and hence the marketing literature provides several meta-analyses based on published case studies of advertising effects. However, publication bias results if the research or review process systematically rejects estimates of small, statistically insignificant, or “unexpected” advertising elasticities. Consequently, if there is publication bias, the results of a meta-analysis will not reflect the true population distribution of advertising effects.

To provide generalizable results, we base our analysis on a large number of products and clearly lay out the research protocol used to select the products. We characterize the distribution of all estimates, irrespective of sign, size, or statistical-significance. To ensure generalizability we document the robustness of the estimates. First, we examine the sensitivity of the results to the approach and assumptions made when constructing the data used in estimation from the raw sources. Second, as we aim to provide causal estimates, we document if the estimated effects are sensitive to the identification strategies that we use to claim causality based on observational data. Our results reveal substantially smaller effects of own-advertising compared to the results documented in the extant literature, as well as a sizable percentage of statistically insignificant or negative estimates. If we only select products with statistically-significant and positive estimates, the mean or median of the advertising effect distribution increases by a factor of about five.

The results are robust to various identifying assumptions, and are consistent with both publication bias and bias due to non-robust identification strategies to obtain causal estimates in the literature.

[Keywords: advertising, publication bias, generalizability]

“How Should We Critique Research?”, Branwen 2019

Research-criticism: “How Should We Critique Research?”⁠, Gwern Branwen (2019-05-19; ⁠, ⁠, ⁠, ⁠, ⁠, ; backlinks; similar):

Criticizing studies and statistics is hard in part because so many criticisms are possible, rendering them meaningless. What makes a good criticism is the chance of being a ‘difference which makes a difference’ to our ultimate actions.

Scientific and statistical research must be read with a critical eye to understand how credible the claims are. The Reproducibility Crisis and the growth of meta-science have demonstrated that much research is of low quality and often false.

But there are so many possible things any given study could be criticized for, falling short of an unobtainable ideal, that it becomes unclear which possible criticism is important, and they may degenerate into mere rhetoric. How do we separate fatal flaws from unfortunate caveats from specious quibbling?

I offer a pragmatic criterion: what makes a criticism important is how much it could change a result if corrected and how much that would then change our decisions or actions: to what extent it is a “difference which makes a difference”.

This is why issues of research fraud, causal inference, or biases yielding overestimates are universally important: because a ‘causal’ effect turning out to be zero effect or grossly overestimated will change almost all decisions based on such research; while on the other hand, other issues like measurement error or distributional assumptions, which are equally common, are often not important: because they typically yield much smaller changes in conclusions, and hence decisions.

If we regularly ask whether a criticism would make this kind of difference, it will be clearer which ones are important criticisms, and which ones risk being rhetorical distractions and obstructing meaningful evaluation of research.

“The Hype Cycle of Working Memory Training”, Redick 2019

“The Hype Cycle of Working Memory Training”⁠, Thomas S. Redick (2019-05-16; ; backlinks; similar):

Seventeen years and hundreds of studies after the first journal article on working memory training was published, evidence for the efficacy of working memory training is still wanting. Numerous studies show that individuals who repeatedly practice computerized working memory tasks improve on those tasks and closely related variants. Critically, although individual studies have shown improvements in untrained abilities and behaviors, systematic reviews of the broader literature show that studies producing large, positive findings are often those with the most methodological shortcomings. The current review discusses the past, present, and future status of working memory training, including consideration of factors that might influence working memory training and transfer efficacy.

“The Meaningfulness of Effect Sizes in Psychological Research: Differences Between Sub-Disciplines and the Impact of Potential Biases”, Schäfer & Schwarz 2019

“The Meaningfulness of Effect Sizes in Psychological Research: Differences Between Sub-Disciplines and the Impact of Potential Biases”⁠, Thomas Schäfer, Marcus A. Schwarz (2019-04-11; ; backlinks; similar):

Effect sizes are the currency of psychological research. They quantify the results of a study to answer the research question and are used to calculate statistical power. The interpretation of effect sizes—when is an effect small, medium, or large?—has been guided by the recommendations Jacob Cohen gave in his pioneering writings starting in 1962: Either compare an effect with the effects found in past research or use certain conventional benchmarks.

The present analysis shows that neither of these recommendations is currently applicable. From past publications without pre-registration⁠, 900 effects were randomly drawn and compared with 93 effects from publications with pre-registration, revealing a large difference: Effects from the former (median r = 0.36) were much larger than effects from the latter (median r = 0.16). That is, certain biases, such as publication bias or questionable research practices⁠, have caused a dramatic inflation in published effects, making it difficult to compare an actual effect with the real population effects (as these are unknown). In addition, there were very large differences in the mean effects between psychological sub-disciplines and between different study designs, making it impossible to apply any global benchmarks.

Many more pre-registered studies are needed in the future to derive a reliable picture of real population effects.

Figure 1: Distributions of effects (absolute values) from articles published with (n = 89) and without (n = 684) pre-registration. The distributions contain all effects that were extracted as or could be transformed into a correlation coefficient r.

“Rigorous Large-Scale Educational RCTs Are Often Uninformative: Should We Be Concerned?”, Lortie-Forgues & Inglis 2019

2019-lortieforgues.pdf: “Rigorous Large-Scale Educational RCTs Are Often Uninformative: Should We Be Concerned?”⁠, Hugues Lortie-Forgues, Matthew Inglis (2019-03-11; ; backlinks; similar):

There are a growing number of large-scale educational randomized controlled trials (RCTs). Considering their expense, it is important to reflect on the effectiveness of this approach. We assessed the magnitude and precision of effects found in those large-scale RCTs commissioned by the UK-based Education Endowment Foundation and the U.S.-based National Center for Educational Evaluation and Regional Assistance, which evaluated interventions aimed at improving academic achievement in K–12 (141 RCTs; 1,222,024 students). The mean effect size was 0.06 standard deviations. These sat within relatively large confidence intervals (mean width = 0.30 SDs), which meant that the results were often uninformative (the median Bayes factor was 0.56). We argue that our field needs, as a priority, to understand why educational RCTs often find small and uninformative effects.

[Keywords: educational policy, evaluation, meta-analysis, program evaluation.]

“Genetically Heterogeneous Mice Exhibit a Female Survival Advantage That Is Age-specific and Site-specific: Results from a Large Multi-site Study”, Cheng et al 2019

“Genetically heterogeneous mice exhibit a female survival advantage that is age-specific and site-specific: Results from a large multi-site study”⁠, Catherine J. Cheng, Jonathan A. L. Gelfond, Randy Strong, James F. Nelson (2019-02-23; ; backlinks; similar):

[See also Lucanic et al 2017 on C. elegans] The female survival advantage is a robust characteristic of human longevity. However, underlying mechanisms are not understood, and rodent models exhibiting a female advantage are lacking. Here, we report that the genetically heterogeneous (UM-HET3) mice used by the National Institute on Aging Interventions Testing Program (ITP) are such a model.

Analysis of age-specific survival of 3,690 control ITP mice revealed a female survival advantage paralleling that of humans. As in humans, the female advantage in mice was greatest in early adulthood, peaking around 350 days of age and diminishing progressively thereafter. This persistent finding was observed at 3 geographically distinct sites and in 6 separate cohorts over a 10-year period.

Because males weigh more than females and bodyweight is often inversely related to lifespan, we examined sex differences in the relationship between bodyweight and survival. Although present in both sexes, the inverse relationship between bodyweight and longevity was much stronger in males, indicating that male mortality is more influenced by bodyweight than is female mortality.

In addition, male survival varied more across site and cohort than female survival, suggesting greater resistance of females to environmental modulators of survival. Notably, at 24 months the relationship between bodyweight and longevity shifted from negative to positive in both sexes, similar to the human condition in advanced age.

These results indicate that the UM-HET3 mouse models the human female survival advantage and provide evidence for greater resilience of females to modulators of survival.

“Orchestrating False Beliefs about Gender Discrimination”, Pallesen 2019

“Orchestrating false beliefs about gender discrimination”⁠, Jonatan Pallesen (2019-02-19; backlinks; similar):

Blind auditions and gender discrimination: A seminal paper from 2000 investigated the impact of blind auditions in orchestras, and found that they increased the proportion of women in symphony orchestras. I investigate the study, and find that there is no good evidence presented. [The study is temporally confounded by a national trend of increasing female participation, does not actually establish any particular correlate of blind auditions, much less randomized experiments of blinding, the dataset is extremely underpowered, the effects cited in coverage cannot be found anywhere in the paper, and the critical comparisons which are there are not even statistically-significant in the first place. None of these caveats are included in the numerous citations of the study as “proving” discrimination against women.]

“On the Estimation of Treatment Effects With Endogenous Misreporting”, Nguimkeu et al 2019

2019-nguimkeu.pdf: “On the estimation of treatment effects with endogenous misreporting”⁠, Pierre Nguimkeu, Augustine Denteh, Rusty Tchernis (2019-02-01; backlinks; similar):

Participation in social programs is often misreported in survey data, complicating the estimation of treatment effects.

We propose a model to estimate treatment effects under endogenous participation and endogenous misreporting. We present an expression for the asymptotic bias of both OLS and IV estimators and discuss the conditions under which sign reversal may occur. We provide a method for eliminating this bias when researchers have access to information regarding participation and misreporting.

We establish the consistency and asymptotic normality of our proposed estimator and assess its small sample performance through Monte Carlo simulations⁠. An empirical example illustrates the proposed method.

[Keywords: treatment effect, misclassification, endogeneity, binary regressor, partial observability, bias, measurement-error]

“Registered Reports: an Early Example and Analysis”, Wiseman et al 2019

“Registered reports: an early example and analysis”⁠, Richard Wiseman, Caroline Watt, Diana Kornbrot (2019-01-16; backlinks; similar):

The recent ‘replication crisis’ in psychology has focused attention on ways of increasing methodological rigor within the behavioral sciences. Part of this work has involved promoting ‘Registered Reports’, wherein journals peer review papers prior to data collection and publication. Although this approach is usually seen as a relatively recent development, we note that a prototype of this publishing model was initiated in the mid-1970s by parapsychologist Martin Johnson in the European Journal of Parapsychology (EJP). A retrospective and observational comparison of Registered and non-Registered Reports published in the EJP during a seventeen-year period provides circumstantial evidence to suggest that the approach helped to reduce questionable research practices. This paper aims both to bring Johnson’s pioneering work to a wider audience, and to investigate the positive role that Registered Reports may play in helping to promote higher methodological and statistical standards.

…The final dataset contained 60 papers: 25 RRs and 35 non-RRs. The RRs described 31 experiments that tested 131 hypotheses, and the non-RRs described 60 experiments that tested 232 hypotheses.

28.4% of the statistical tests reported in non-RRs were statistically-significant (66⁄232: 95% CI [21.5%–36.4%]); compared to 8.4% of those in the RRs (11⁄131: 95% CI [4.0%–16.8%]). A simple 2 × 2 contingency analysis showed that this difference is highly statistically-significant (Fisher’s exact test: p < 0.0005, Pearson chi-square = 20.1, Cohen’s d = 0.48).

…Parapsychologists investigate the possible existence of phenomena that, for many, have a low a priori likelihood of being genuine (see, eg. Wagenmakers et al 2011). This has often resulted in their work being subjected to a considerable amount of critical attention (from both within and outwith the field) that has led to them pioneering several methodological advances prior to their use within mainstream psychology, including the development of randomisation in experimental design (Hacking, 1988), the use of blinds (Kaptchuk, 1998), explorations into randomisation and statistical inference (Fisher, 1924), advances in replication issues (Rosenthal, 1986), the need for pre-specification in meta-analysis (Akers, 1985; Milton, 1999; Kennedy, 2004), and the creation of a formal study registry (Watt, 2012; Watt & Kennedy, 2015). Johnson’s work on RRs provides another striking illustration of this principle at work.

“How Replicable Are Links Between Personality Traits and Consequential Life Outcomes? The Life Outcomes of Personality Replication Project”, Soto 2019

2019-soto.pdf: “How Replicable Are Links Between Personality Traits and Consequential Life Outcomes? The Life Outcomes of Personality Replication Project”⁠, Christopher J. Soto (2019; ; backlinks; similar):

The Big Five personality traits have been linked to dozens of life outcomes. However, meta-scientific research has raised questions about the replicability of behavioral science. The Life Outcomes of Personality Replication (LOOPR) Project was therefore conducted to estimate the replicability of the personality-outcome literature.

Specifically, I conducted preregistered, high-powered (median n = 1,504) replications of 78 previously published trait-outcome associations. Overall, 87% of the replication attempts were statistically-significant in the expected direction. The replication effects were typically 77% as strong as the corresponding original effects, which represents a significant decline in effect size.

The replicability of individual effects was predicted by the effect size and design of the original study, as well as the sample size and statistical power of the replication. These results indicate that the personality-outcome literature provides a reasonably accurate map of trait-outcome associations but also that it stands to benefit from efforts to improve replicability.

“Reading Lies: Nonverbal Communication and Deception”, Vrij et al 2019

2019-vrij.pdf: “Reading Lies: Nonverbal Communication and Deception”⁠, Aldert Vrij, Maria Hartwig, Pär Anders Granhag (2019; ; backlinks; similar):

The relationship between nonverbal communication and deception continues to attract much interest, but there are many misconceptions about it. In this review, we present a scientific view on this relationship. We describe theories explaining why liars would behave differently from truth tellers, followed by research on how liars actually behave and individuals’ ability to detect lies. We show that the nonverbal cues to deceit discovered to date are faint and unreliable and that people are mediocre lie catchers when they pay attention to behavior. We also discuss why individuals hold misbeliefs about the relationship between nonverbal behavior and deception—beliefs that appear very hard to debunk. We further discuss the ways in which researchers could improve the state of affairs by examining nonverbal behaviors in different ways and in different settings than they currently do.

“The Association between Adolescent Well-being and Digital Technology Use”, Orben & Przybylski 2019

2019-orben.pdf: “The association between adolescent well-being and digital technology use”⁠, Amy Orben, Andrew K. Przybylski (2019-01-01; ; backlinks)

“The Advantages of Bilingualism Debate”, Antoniou 2019

2019-antoniou.pdf: “The Advantages of Bilingualism Debate”⁠, Mark Antoniou (2019; ; backlinks; similar):

Bilingualism was once thought to result in cognitive disadvantages, but research in recent decades has demonstrated that experience with two (or more) languages confers a bilingual advantage in executive functions and may delay the incidence of Alzheimer’s disease. However, conflicting evidence has emerged leading to questions concerning the robustness of the bilingual advantage for both executive functions and dementia incidence. Some investigators have failed to find evidence of a bilingual advantage; others have suggested that bilingual advantages may be entirely spurious, while proponents of the advantage case have continued to defend it. A heated debate has ensued, and the field has now reached an impasse.

This review critically examines evidence for and against the bilingual advantage in executive functions, cognitive aging, and brain plasticity, before outlining how future research could shed light on this debate and advance knowledge of how experience with multiple languages affects cognition and the brain.

“How Genome-wide Association Studies (GWAS) Made Traditional Candidate Gene Studies Obsolete”, Duncan et al 2019

2019-duncan.pdf: “How genome-wide association studies (GWAS) made traditional candidate gene studies obsolete”⁠, Laramie E. Duncan, Michael Ostacher, Jacob Ballon (2019-01-01; ; backlinks)

“Stereotype Threat Effects in Settings With Features Likely versus Unlikely in Operational Test Settings: A Meta-analysis”, Shewach et al 2019

2019-shewach.pdf: “Stereotype threat effects in settings with features likely versus unlikely in operational test settings: A meta-analysis”⁠, Oren R. Shewach, Paul R. Sackett, Sander Quint (2019; ; backlinks; similar):

The stereotype threat literature primarily comprises lab studies, many of which involve features that would not be present in high-stakes testing settings. We meta-analyze the effect of stereotype threat on cognitive ability tests, focusing on both laboratory and operational studies with features likely to be present in high stakes settings. First, we examine the features of cognitive ability test metric, stereotype threat cue activation strength, and type of non-threat control group, and conduct a focal analysis removing conditions that would not be present in high stakes settings. We also take into account a previously unrecognized methodological error in how data are analyzed in studies that control for scores on a prior cognitive ability test, which resulted in a biased estimate of stereotype threat. The focal sample, restricting the database to samples utilizing operational testing-relevant conditions, displayed a threat effect of d = −0.14 (k = 45, n = 3,532, SDδ = 0.31). Second, we present a comprehensive meta-analysis of stereotype threat. Third, we examine a small subset of studies in operational test settings and studies utilizing motivational incentives, which yielded d-values ranging from 0.00 to −0.14. Fourth, the meta-analytic database is subjected to tests of publication bias, finding nontrivial evidence for publication bias. Overall, results indicate that the size of the stereotype threat effect that can be experienced on tests of cognitive ability in operational scenarios such as college admissions tests and employment testing may range from negligible to small.

“No Support for Historical Candidate Gene or Candidate Gene-by-Interaction Hypotheses for Major Depression Across Multiple Large Samples”, Border et al 2019

2019-border.pdf: “No Support for Historical Candidate Gene or Candidate Gene-by-Interaction Hypotheses for Major Depression Across Multiple Large Samples”⁠, Richard Border, Emma C. Johnson, Luke M. Evans, Andrew Smolen, Noah Berley, Patrick F. Sullivan, Matthew C. Keller et al (2019; ; backlinks; similar):

Objective: Interest in candidate gene and candidate gene-by-environment interaction hypotheses regarding major depressive disorder remains strong despite controversy surrounding the validity of previous findings. In response to this controversy, the present investigation empirically identified 18 candidate genes for depression that have been studied 10 or more times and examined evidence for their relevance to depression phenotypes.

Methods: Utilizing data from large population-based and case-control samples (_n_s ranging from 62,138 to 443,264 across subsamples), the authors conducted a series of preregistered analyses examining candidate gene polymorphism main effects, polymorphism-by-environment interactions, and gene-level effects across a number of operational definitions of depression (eg. lifetime diagnosis, current severity, episode recurrence) and environmental moderators (eg. sexual or physical abuse during childhood, socioeconomic adversity).

Results: No clear evidence was found for any candidate gene polymorphism associations with depression phenotypes or any polymorphism-by-environment moderator effects. As a set, depression candidate genes were no more associated with depression phenotypes than non-candidate genes. The authors demonstrate that phenotypic measurement error is unlikely to account for these null findings.

Conclusions: The study results do not support previous depression candidate gene findings, in which large genetic effects are frequently reported in samples orders of magnitude smaller than those examined here. Instead, the results suggest that early hypotheses about depression candidate genes were incorrect and that the large number of associations reported in the depression candidate gene literature are likely to be false positives.

Figure 2: Main effects and gene-by-environment effects of 16 candidate polymorphisms on estimated lifetime depression diagnosis and current depression severity in the UK Biobank sample. The graphs show effect size estimates for 16 candidate polymorphisms, presented in order of estimated number of studies from left to right, descending, on estimated lifetime depression diagnosis (panel A) and past-2-week depression symptom severity from the online mental health follow-up assessment (panel B) in the UK Biobank sample (N = 115,257). Both polymorphism main effects and polymorphism-by-environment moderator interaction effects are presented for each outcome. Detailed descriptions of the variables and of the association and power analysis models are provided in sections S3 and S4, respectively, of the online supplement.
Figure 3: Gene-wise statistics for effects of 18 candidate genes on primary depression outcomes in the UK Biobank sample. The plot shows gene-wise p values across the genome, highlighting the 18 candidate polymorphisms’ effects on estimated depression diagnosis (filled points) and past-2-week depression symptom severity (unfilled points) from the online mental health follow-up assessment in the UK Biobank sample (N = 115,257). Gene labels alternate colors to aid readability. Detailed descriptions of the variables and of the association models are provided in sections S3 and S4.2, respectively, of the online supplement.

“Littlewood’s Law and the Global Media”, Branwen 2018

Littlewood: “Littlewood’s Law and the Global Media”⁠, Gwern Branwen (2018-12-15; ⁠, ⁠, ⁠, ⁠, ⁠, ; backlinks; similar):

Selection effects in media become increasingly strong as populations and media increase, meaning that rare datapoints driven by unusual processes such as the mentally ill or hoaxers are increasingly unreliable as evidence of anything at all and must be ignored. At scale, anything that can happen will happen a small but nonzero times.

Online & mainstream media and social networking have become increasingly misleading as to the state of the world by focusing on ‘stories’ and ‘events’ rather than trends and averages. This is because as the global population increases and the scope of media increases, media’s urge for narrative focuses on the most extreme outlier datapoints—but such datapoints are, at a global scale, deeply misleading as they are driven by unusual processes such as the mentally ill or hoaxers.

At a global scale, anything that can happen will happen a small but nonzero times: this has been epitomized as “Littlewood’s Law: in the course of any normal person’s life, miracles happen at a rate of roughly one per month.” This must now be extended to a global scale for a hyper-networked global media covering anomalies from 8 billion people—all coincidences, hoaxes, mental illnesses, psychological oddities, extremes of continuums, mistakes, misunderstandings, terrorism, unexplained phenomena etc. Hence, there will be enough ‘miracles’ that all media coverage of events can potentially be composed of nothing but extreme outliers, even though it would seem like an ‘extraordinary’ claim to say that all media-reported events may be flukes.

This creates an epistemic environment deeply hostile to understanding reality, one which is dedicated to finding arbitrary amounts of and amplifying the least representative datapoints.

Given this, it is important to maintain extreme skepticism of any individual anecdotes or stories which are selectively reported but still claimed (often implicitly) to be representative of a general trend or fact about the world. Standard techniques like critical thinking, emphasizing trends & averages, and demanding original sources can help fight the biasing effect of news.

“Mesmerising Science: The Franklin Commission and the Modern Clinical Trial”, Laukaityte 2018

“Mesmerising Science: The Franklin Commission and the Modern Clinical Trial”⁠, Urte Laukaityte (2018-11-20; ⁠, ; similar):

Benjamin Franklin, magnetic trees, and erotically-charged séances— Urte Laukaityte on how a craze for sessions of “animal magnetism” in late 18th-century Paris led to the randomised placebo-controlled and double-blind clinical trials we know and love today…By a lucky coincidence, Benjamin Franklin was in France as the first US ambassador with a mission to ensure an official alliance against its arch nemesis, the British. On account of his fame as a great man of science in general and his experiments on one such invisible force—electricity—in particular, Franklin was appointed as head of the royal commission. The investigating team also included the chemist Antoine-Laurent Lavoisier, the astronomer Jean-Sylvain Bailly, and the doctor Joseph-Ignace Guillotin. It is a curious fact of history that both Lavoisier and Bailly were later executed by the guillotine—the device attributed to their fellow commissioner. The revolution also, of course, brought the same fate to King Louis XVI and his Mesmer-supporting wife Marie Antoinette. In a stroke of insight, the commissioners figured that the cures might be affected by one of two possible mechanisms: psychological suggestion (what they refer to as “imagination”) or some actual physical magnetic action. Mesmer and his followers claimed it was the magnetic fluid, so that served as the experimental condition if you like. Continuing with the modern analogies, suggestion would then represent a rudimentary placebo control condition. So to test animal magnetism, they came up with two kinds of trials to try and separate the two possibilities: either the research subject is being magnetised but does not know it (magnetism without imagination) or the subject is not being magnetised but thinks that they are (imagination without magnetism). The fact that the trials were blind, or in other words, the patients did not know when the magnetic operation was being performed, marks the commission’s most innovative contribution to science…Whatever the moral case may be, the report paved the way for the modern empirical approach in more ways than one. Stephen Jay Gould called the work “a masterpiece of the genre, an enduring testimony to the power and beauty of reason” that “should be rescued from its current obscurity, translated into all languages”. Just to mention a few further insights, the commissioners were patently aware of psychological phenomena like the experimenter effect, concerned as they were that some patients might report certain sensations because they thought that is what the eminent men of science wanted to hear. That seems to be what propelled them to make the study placebo-controlled and single-blind. Other phenomena reminiscent of the modern-day notion of priming⁠, and the role of expectations more generally, are pointed out throughout the document. The report also contains a detailed account of how self-directed attention can generate what are known today as psychosomatic symptoms. Relatedly, there is an incredibly lucid discussion of mass psychogenic illness, and mass hysteria more generally, including in cases of war and political upheaval. Just 5 years later, France would descend into the chaos of a violent revolution.

“Predicting Replication Outcomes in the Many Labs 2 Study”, Forsell et al 2018

“Predicting replication outcomes in the Many Labs 2 study”⁠, Eskil Forsell, Domenico Viganola, Thomas Pfeiffer, Johan Almenberg, Brad Wilson, Yiling Chen, Brian A. Nosek et al (2018-10-25; backlinks; similar):


  • Psychologists participated in prediction markets to predict replication outcomes.
  • Prediction markets correctly predicted 75% of the replication outcomes.
  • Prediction markets performed better than survey data in predicting replication outcomes.
  • Survey data performed better in predicting relative effect size of the replications.

Understanding and improving reproducibility is crucial for scientific progress. Prediction markets and related methods of eliciting peer beliefs are promising tools to predict replication outcomes. We invited researchers in the field of psychology to judge the replicability of 24 studies replicated in the large scale Many Labs 2 project. We elicited peer beliefs in prediction markets and surveys about two replication success metrics: the probability that the replication yields a statistically-significant effect in the original direction (p < 0.001), and the relative effect size of the replication. The prediction markets correctly predicted 75% of the replication outcomes, and were highly correlated with the replication outcomes. Survey beliefs were also statistically-significantly correlated with replication outcomes, but had larger prediction errors. The prediction markets for relative effect sizes attracted little trading and thus did not work well. The survey beliefs about relative effect sizes performed better and were statistically-significantly correlated with observed relative effect sizes. The results suggest that replication outcomes can be predicted and that the elicitation of peer beliefs can increase our knowledge about scientific reproducibility and the dynamics of hypothesis testing.

[Keywords: reproducibility, replications, prediction markets, beliefs]

“Open Questions”, Branwen 2018

Questions: “Open Questions”⁠, Gwern Branwen (2018-10-17; ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ; backlinks; similar):

Some anomalies/​questions which are not necessarily important, but do puzzle me or where I find existing explanations to be unsatisfying.

? ? ?

A list of some questions which are not necessarily important, but do puzzle me or where I find existing ‘answers’ to be unsatisfying, categorized by subject (along the lines of Patrick Collison’s list & Alex Guzey⁠; see also my list of project ideas).

“Effects of the Tennessee Prekindergarten Program on Children’s Achievement and Behavior through Third Grade”, Lipsey et al 2018

“Effects of the Tennessee Prekindergarten Program on children’s achievement and behavior through third grade”⁠, Mark W. Lipsey, Dale C. Farran, Kelley Durkin (2018-09; ; backlinks; similar):

  • This study of the Tennessee Voluntary Pre-K Program (VPK) is the first randomized control trial of a state pre-k program.
  • Positive achievement effects at the end of pre-k reversed and began favoring the control children by 2nd and 3rd grade.
  • VPK participants had more disciplinary infractions and special education placements by 3rd grade than control children.
  • No effects of VPK were found on attendance or retention in the later grades.
  • These findings have policy implications for scaling up pre-k and supporting its benefits in the later grades.

[6th grade followup⁠; cf. Duncan & Magnuson 2013⁠/​Pages et al 2020⁠; SSC] This report presents results of a randomized trial of a state prekindergarten program.

Low-income children (n = 2,990) applying to oversubscribed programs were randomly assigned to receive offers of admission or remain on a waiting list. Data from pre-k through 3rd grade were obtained from state education records; additional data were collected for a subset of children with parental consent (n = 1,076).

At the end of pre-k, pre-k participants in the consented subsample performed better than control children on a battery of achievement tests, with non-native English speakers and children scoring lowest at baseline showing the greatest gains. During the kindergarten year and thereafter, the control children caught up with the pre-k participants on those tests and generally surpassed them. Similar results appeared on the 3rd grade state achievement tests for the full randomized sample—pre-k participants did not perform as well as the control children. Teacher ratings of classroom behavior did not favor either group overall, though some negative treatment effects were seen in 1st and 2nd grade. There were differential positive pre-k effects for male and Black children on a few ratings and on attendance. Pre-k participants had lower retention rates in kindergarten that did not persist, and higher rates of school rule violations in later grades. Many pre-k participants received special education designations that remained through later years, creating higher rates than for control children.

Issues raised by these findings and implications for pre-k policy are discussed.

[Keywords: public pre-k, randomized control trial, longitudinal, early childhood education, achievement, policy]

“Evaluating the Replicability of Social Science Experiments in Nature and Science between 2010 and 2015”, Camerer et al 2018

2018-camerer.pdf: “Evaluating the replicability of social science experiments in Nature and Science between 2010 and 2015”⁠, Colin F. Camerer, Anna Dreber, Felix Holzmeister, Teck-Hua Ho, Jürgen Huber, Magnus Johannesson, Michael Kirchler et al (2018-08-27; backlinks; similar):

Being able to replicate scientific findings is crucial for scientific progress. We replicate 21 systematically selected experimental studies in the social sciences published in Nature and Science between 2010 and 2015. The replications follow analysis plans reviewed by the original authors and pre-registered prior to the replications. The replications are high powered, with sample sizes on average about 5× higher than in the original studies. We find a statistically-significant effect in the same direction as the original study for 13 (62%) studies, and the effect size of the replications is on average about 50% of the original effect size. Replicability varies between 12 (57%) and 14 (67%) studies for complementary replicability indicators. Consistent with these results, the estimated true-positive rate is 67% in a Bayesian analysis. The relative effect size of true positives is estimated to be 71%, suggesting that both false positives and inflated effect sizes of true positives contribute to imperfect reproducibility. Furthermore, we find that peer beliefs of replicability are strongly related to replicability, suggesting that the research community could predict which results would replicate and that failures to replicate were not the result of chance alone.

“The Cumulative Effect of Reporting and Citation Biases on the Apparent Efficacy of Treatments: the Case of Depression”, Vries et al 2018

2018-devries.pdf: “The cumulative effect of reporting and citation biases on the apparent efficacy of treatments: the case of depression”⁠, Y. A. de Vries, A. M. Roest, P. de Jonge, P. Cuijpers, M. R. Munafò, J. A. Bastiaansen (2018-08-18; ; backlinks; similar):

Evidence-based medicine is the cornerstone of clinical practice, but it is dependent on the quality of evidence upon which it is based. Unfortunately, up to half of all randomized controlled trials (RCTs) have never been published, and trials with statistically-significant findings are more likely to be published than those without (Dwan et al 2013). Importantly, negative trials face additional hurdles beyond study publication bias that can result in the disappearance of non-significant results (Boutron et al 2010; Dwan et al 2013; Duyx et al 2017). Here, we analyze the cumulative impact of biases on apparent efficacy, and discuss possible remedies, using the evidence base for two effective treatments for depression: antidepressants and psychotherapy.

Figure 1: The cumulative impact of reporting and citation biases on the evidence base for antidepressants. (a) displays the initial, complete cohort of trials, while (b) through (e) show the cumulative effect of biases. Each circle indicates a trial, while the color indicates the results or the presence of spin. Circles connected by a grey line indicate trials that were published together in a pooled publication. In (e), the size of the circle indicates the (relative) number of citations received by that category of studies.

“Statistical Paradises and Paradoxes in Big Data (1): Law of Large Populations, Big Data Paradox, and the 2016 US Presidential Election”, Meng 2018

“Statistical paradises and paradoxes in big data (1): Law of large populations, big data paradox, and the 2016 US presidential election”⁠, Xiao-Li Meng (2018-07-28; backlinks; similar):

Statisticians are increasingly posed with thought-provoking and even paradoxical questions, challenging our qualifications for entering the statistical paradises created by Big Data. By developing measures for data quality, this article suggests a framework to address such a question: “Which one should I trust more: a 1% survey with 60% response rate or a self-reported administrative dataset covering 80% of the population?” A 5-element Euler-formula-like identity shows that for any dataset of size

n n

, probabilistic or not, the difference between the sample average

X ¯ n \overline{X}_{n}

and the population average

X ¯ N \overline{X}_{N}

is the product of three terms: (1) a data quality measure,

ρ R , X \rho_{{R, X}}

, the correlation between

X j X_{j}

and the response/​recording indicator

R j R_{j}

; (2) a data quantity measure,

( N n ) /​ n \sqrt{(N-n)/​n}

, where


is the population size; and (3) a problem difficulty measure,

σ X \sigma_{X}

, the standard deviation of


. This decomposition provides multiple insights: (1) Probabilistic sampling ensures high data quality by controlling

ρ R , X \rho_{{R, X}}

at the level of

N 1 /​ 2 N^{-1/​2}

; (2) When we lose this control, the impact of


is no longer canceled by

ρ R , X \rho_{{R, X}}

, leading to a Law of Large Populations (LLP), that is, our estimation error, relative to the benchmarking rate

1 /​ n 1/​\sqrt{n}

, increases with

N \sqrt{N}

; and (3) the “bigness” of such Big Data (for population inferences) should be measured by the relative size

f = n /​ N f = n/​N

, not the absolute size

n n

; (4) When combining data sources for population inferences, those relatively tiny but higher quality ones should be given far more weights than suggested by their sizes.

Estimates obtained from the Cooperative Congressional Election Study (CCES) of the 2016 US presidential election suggest a

ρ R , X 0.005 \rho_{{R, X}}\approx-0.005

for self-reporting to vote for Donald Trump. Because of LLP, this seemingly minuscule data defect correlation implies that the simple sample proportion of the self-reported voting preference for Trump from

1 % 1\%

of the US eligible voters, that is,

n 2 , 300 , 000 n\approx2\mbox{,}300\mbox{,}000

, has the same mean squared error as the corresponding sample proportion from a genuine simple random sample of size

n 400 n\approx400

, a

99.98 % 99.98\%

reduction of sample size (and hence our confidence). The CCES data demonstrate LLP vividly: on average, the larger the state’s voter populations, the further away the actual Trump vote shares from the usual

95 % 95\%

confidence intervals based on the sample proportions. This should remind us that, without taking data quality into account, population inferences with Big Data are subject to a Big Data Paradox: the more the data, the surer we fool ourselves.

“Disentangling Bias and Variance in Election Polls”, Shirani-Mehr et al 2018

“Disentangling Bias and Variance in Election Polls”⁠, Houshmand Shirani-Mehr, David Rothschild, Sharad Goel, Andrew Gelman (2018-07-25; ; backlinks; similar):

It is well known among researchers and practitioners that election polls suffer from a variety of sampling and nonsampling errors, often collectively referred to as total survey error. Reported margins of error typically only capture sampling variability, and in particular, generally ignore nonsampling errors in defining the target population (eg. errors due to uncertainty in who will vote).

Here, we empirically analyze 4221 polls for 608 state-level presidential, senatorial, and gubernatorial elections between 1998 and 2014, all of which were conducted during the final three weeks of the campaigns. Comparing to the actual election outcomes, we find that average survey error as measured by root mean square error is ~3.5 percentage points, about twice as large as that implied by most reported margins of error. We decompose survey error into election-level bias and variance terms. We find that average absolute election-level bias is about 2 percentage points, indicating that polls for a given election often share a common component of error. This shared error may stem from the fact that polling organizations often face similar difficulties in reaching various subgroups of the population, and that they rely on similar screening rules when estimating who will vote. We also find that average election-level variance is higher than implied by simple random sampling, in part because polling organizations often use complex sampling designs and adjustment procedures.

We conclude by discussing how these results help explain polling failures in the 2016 U.S. presidential election, and offer recommendations to improve polling practice.

[Keywords: margin of error, non-sampling error, polling bias, total survey error]

“Causal Language and Strength of Inference in Academic and Media Articles Shared in Social Media (CLAIMS): A Systematic Review”, Haber et al 2018

“Causal language and strength of inference in academic and media articles shared in social media (CLAIMS): A systematic review”⁠, Noah Haber, Emily R. Smith, Ellen Moscoe, Kathryn Andrews, Robin Audy, Winnie Bell, Alana T. Brennan et al (2018-05-30; ; backlinks; similar):

Background: The pathway from evidence generation to consumption contains many steps which can lead to overstatement or misinformation. The proliferation of internet-based health news may encourage selection of media and academic research articles that overstate strength of causal inference. We investigated the state of causal inference in health research as it appears at the end of the pathway, at the point of social media consumption.

Methods: We screened the NewsWhip Insights database for the most shared media articles on Facebook and Twitter reporting about peer-reviewed academic studies associating an exposure with a health outcome in 2015, extracting the 50 most-shared academic articles and media articles covering them. We designed and utilized a review tool to systematically assess and summarize studies’ strength of causal inference, including generalizability, potential confounders, and methods used. These were then compared with the strength of causal language used to describe results in both academic and media articles. Two randomly assigned independent reviewers and one arbitrating reviewer from a pool of 21 reviewers assessed each article.

Results: We accepted the most shared 64 media articles pertaining to 50 academic articles for review, representing 68% of Facebook and 45% of Twitter shares in 2015. 34% of academic studies and 48% of media articles used language that reviewers considered too strong for their strength of causal inference. 70% of academic studies were considered low or very low strength of inference, with only 6% considered high or very high strength of causal inference. The most severe issues with academic studies’ causal inference were reported to be omitted confounding variables and generalizability. 58% of media articles were found to have inaccurately reported the question, results, intervention, or population of the academic study.

Conclusions: We find a large disparity between the strength of language as presented to the research consumer and the underlying strength of causal inference among the studies most widely shared on social media. However, because this sample was designed to be representative of the articles selected and shared on social media, it is unlikely to be representative of all academic and media work. More research is needed to determine how academic institutions, media organizations, and social network sharing patterns impact causal inference and language as received by the research consumer.

“Acceptable Losses: the Debatable Origins of Loss Aversion”, Yechiam 2018

2018-yechiam.pdf: “Acceptable losses: the debatable origins of loss aversion”⁠, Eldad Yechiam (2018-05-16; similar):

[pro⁠; con] It is often claimed that negative events carry a larger weight than positive events. Loss aversion is the manifestation of this argument in monetary outcomes. In this review, we examine early studies of the utility function of gains and losses, and in particular the original evidence for loss aversion reported by Kahneman and Tversky (Econometrica 47:263–291, 1979).

We suggest that loss aversion proponents have over-interpreted these findings. Specifically, the early studies of utility functions have shown that while very large losses are overweighted, smaller losses are often not. In addition, the findings of some of these studies have been systematically misrepresented to reflect loss aversion, though they did not find it.

These findings shed light both on the inability of modern studies to reproduce loss aversion as well as a second literature arguing strongly for it.

“A Real-life Lord of the Flies: the Troubling Legacy of the Robbers Cave Experiment; In the Early 1950s, the Psychologist Muzafer Sherif Brought Together a Group of Boys at a US Summer Camp—and Tried to Make Them Fight Each Other. Does His Work Teach Us Anything about Our Age of Resurgent Tribalism? [an Extract from The Lost Boys]”, Shariatmadari 2018

“A real-life Lord of the Flies: the troubling legacy of the Robbers Cave experiment; In the early 1950s, the psychologist Muzafer Sherif brought together a group of boys at a US summer camp—and tried to make them fight each other. Does his work teach us anything about our age of resurgent tribalism? [an extract from The Lost Boys]”⁠, David Shariatmadari (2018-04-16; backlinks; similar):

In 50s Middle Grove, things didn’t go according to plan either, though the surprise was of a different nature. Despite his pretence of leaving the 11-year-olds to their own devices, Sherif and his research staff, posing as camp counsellors and caretakers, interfered to engineer the result they wanted. He believed he could make the two groups, called the Pythons and the Panthers, sworn enemies via a series of well-timed “frustration exercises”. These included his assistants stealing items of clothing from the boys’ tents and cutting the rope that held up the Panthers’ homemade flag, in the hope they would blame the Pythons. One of the researchers crushed the Panthers’ tent, flung their suitcases into the bushes and broke a boy’s beloved ukulele. To Sherif’s dismay, however, the children just couldn’t be persuaded to hate each other…The robustness of the boy’s “civilised” values came as a blow to Sherif, making him angry enough to want to punch one of his young academic helpers. It turned out that the strong bonds forged at the beginning of the camp weren’t easily broken. Thankfully, he never did start the forest fire—he aborted the experiment when he realised it wasn’t going to support his hypothesis.

But the Rockefeller Foundation had given Sherif $323,938.0$38,000.01953. In his mind, perhaps, if he came back empty-handed, he would face not just their anger but the ruin of his reputation. So, within a year, he had recruited boys for a second camp, this time in Robbers Cave state park in Oklahoma. He was determined not to repeat the mistakes of Middle Grove.

…At Robbers Cave, things went more to plan. After a tug-of-war in which they were defeated, the Eagles burned the Rattler’s flag. Then all hell broke loose, with raids on cabins, vandalism and food fights. Each moment of confrontation, however, was subtly manipulated by the research team. They egged the boys on, providing them with the means to provoke one another—who else, asks Perry in her book, could have supplied the matches for the flag-burning?

…Sherif was elated. And, with the publication of his findings that same year, his status as world-class scholar was confirmed. The “Robbers Cave experiment” is considered seminal by social psychologists, still one of the best-known examples of “realistic conflict theory”. It is often cited in modern research. But was it scientifically rigorous? And why were the results of the Middle Grove experiment—where the researchers couldn’t get the boys to fight—suppressed? “Sherif was clearly driven by a kind of a passion”, Perry says. “That shaped his view and it also shaped the methods he used. He really did come from that tradition in the 30s of using experiments as demonstrations—as a confirmation, not to try to find something new.” In other words, think of the theory first and then find a way to get the results that match it. If the results say something else? Bury them…“I think people are aware now that there are real ethical problems with Sherif’s research”, she tells me, “but probably much less aware of the backstage [manipulation] that I’ve found. And that’s understandable because the way a scientist writes about their research is accepted at face value.” The published report of Robbers Cave uses studiedly neutral language. “It’s not until you are able to compare the published version with the archival material that you can see how that story is shaped and edited and made more respectable in the process.” That polishing up still happens today, she explains. “I wouldn’t describe him as a charlatan…every journal article, every textbook is written to convince, persuade and to provide evidence for a point of view. So I don’t think Sherif is unusual in that way.”

“P-Hacking and False Discovery in A/B Testing”, Berman et al 2018

2018-berman.pdf: “p-Hacking and False Discovery in A / B Testing”⁠, Ron Berman, Leonid Pekelis, Aisling Scott, Christophe Van den Bulte (2018-01-01; ⁠, ; backlinks)

“Homogenous: The Political Affiliations of Elite Liberal Arts College Faculty”, Langbert 2018

2018-langbert.pdf: “Homogenous: The Political Affiliations of Elite Liberal Arts College Faculty”⁠, Mitchell Langbert (2018-01-01; ; backlinks)

“Revisiting the Marshmallow Test: A Conceptual Replication Investigating Links Between Early Delay of Gratification and Later Outcomes”, Watts et al 2018

2018-watts.pdf: “Revisiting the Marshmallow Test: A Conceptual Replication Investigating Links Between Early Delay of Gratification and Later Outcomes”⁠, Tyler W. Watts, Greg J. Duncan, Haonan Quan (2018; ; backlinks; similar):

We replicated and extended Shoda et al 1990’s famous marshmallow study⁠, which showed strong bivariate correlations between a child’s ability to delay gratification just before entering school and both adolescent achievement and socioemotional behaviors.

Concentrating on children whose mothers had not completed college, we found that an additional minute waited at age 4 predicted a gain of ~1⁄10th of a standard deviation in achievement at age 15. But this bivariate correlation was only half the size of those reported in the original studies and was reduced by two thirds in the presence of controls for family background, early cognitive ability, and the home environment.

Most of the variation in adolescent achievement came from being able to wait at least 20s. Associations between delay time and measures of behavioral outcomes at age 15 were much smaller and rarely statistically-significant.

“The Elusive Backfire Effect: Mass Attitudes’ Steadfast Factual Adherence”, Wood & Porter 2018

2018-wood.pdf: “The Elusive Backfire Effect: Mass Attitudes’ Steadfast Factual Adherence”⁠, Thomas Wood, Ethan Porter (2018-01-01; backlinks)

“Knowing What We Are Getting: Evaluating Scientific Research on the International Space Station”, Bianco & Schmidt 2017

2017-bianco.pdf: “Knowing What We Are Getting: Evaluating Scientific Research on the International Space Station”⁠, William Bianco, Eric Schmidt (2017-12-26; similar):

The debate over the value of the International Space Station has overlooked a fundamental question: What is the station’s contribution to scientific knowledge? We address this question using a multivariate analysis of publication and patent data from station experiments. We find a relatively high probability that ISS experiments with PIs drawn from outside NASA will yield refereed publications and, furthermore, that these experiments have non-negligible probabilities of finding publication in high-impact journals or producing government patents. However, technology demonstrations and experiments with all-NASA PIs have much weaker track records. These results highlight the complexities inherent to constructing a compelling case for science onboard the ISS or for crewed spaceflight in general.

“The Prehistory of Biology Preprints: A Forgotten Experiment from the 1960s”, Cobb 2017

“The prehistory of biology preprints: A forgotten experiment from the 1960s”⁠, Matthew Cobb (2017-11-16; ; backlinks; similar):

In 1961, the National Institutes of Health (NIH) began to circulate biological preprints in a forgotten experiment called the Information Exchange Groups (IEGs). This system eventually attracted over 3,600 participants and saw the production of over 2,500 different documents, but by 1967, it was effectively shut down following the refusal of journals to accept articles that had been circulated as preprints.

This article charts the rise and fall of the IEGs and explores the parallels with the 1990s and the biomedical preprint movement of today.

“Percutaneous Coronary Intervention in Stable Angina (ORBITA): a Double-blind, Randomised Controlled Trial”, Al-Lamee et al 2017

2017-allamee.pdf: “Percutaneous coronary intervention in stable angina (ORBITA): a double-blind, randomised controlled trial”⁠, Rasha Al-Lamee, David Thompson, Hakim-Moulay Dehbi, Sayan Sen, Kare Tang, John Davies, Thomas Keeble et al (2017-11-02; ⁠, ; backlinks; similar):

Background: Symptomatic relief is the primary goal of percutaneous coronary intervention (PCI) in stable angina and is commonly observed clinically. However, there is no evidence from blinded, placebo-controlled randomised trials to show its efficacy.

Methods: ORBITA is a blinded, multicentre randomised trial of PCI versus a placebo procedure for angina relief that was done at five study sites in the UK. We enrolled patients with severe (≥70%) single-vessel stenoses. After enrolment, patients received 6 weeks of medication optimisation. Patients then had pre-randomisation assessments with cardiopulmonary exercise testing, symptom questionnaires, and dobutamine stress echocardiography. Patients were randomised 1:1 to undergo PCI or a placebo procedure by use of an automated online randomisation tool. After 6 weeks of follow-up, the assessments done before randomisation were repeated at the final assessment. The primary endpoint was difference in exercise time increment between groups. All analyses were based on the intention-to-treat principle and the study population contained all participants who underwent randomisation. This study is registered with, number NCT02062593.

Findings: ORBITA enrolled 230 patients with ischaemic symptoms. After the medication optimisation phase and between Jan 6, 2014, and Aug 11, 2017, 200 patients underwent randomisation, with 105 patients assigned PCI and 95 assigned the placebo procedure. Lesions had mean area stenosis of 84.4% (SD 10.2), fractional flow reserve of 0.69 (0.16), and instantaneous wave-free ratio of 0.76 (0.22). There was no statistically-significant difference in the primary endpoint of exercise time increment between groups (PCI minus placebo 16.6 s, 95% CI −8.9 to 42.0, p = 0.200). There were no deaths. Serious adverse events included four pressure-wire related complications in the placebo group, which required PCI, and five major bleeding events, including two in the PCI group and three in the placebo group.

Interpretation: In patients with medically treated angina and severe coronary stenosis, PCI did not increase exercise time by more than the effect of a placebo procedure. The efficacy of invasive procedures can be assessed with a placebo control, as is standard for pharmacotherapy.

“A Long Journey to Reproducible Results: Replicating Our Work Took Four Years and 100,000 Worms but Brought Surprising Discoveries”, Lithgow et al 2017

“A long journey to reproducible results: Replicating our work took four years and 100,000 worms but brought surprising discoveries”⁠, Gordon J. Lithgow, Monica Driscoll, Patrick Phillips (2017-08-22; ⁠, ; backlinks; similar):

About 15 years ago, one of us (G.J.L.) got an uncomfortable phone call from a colleague and collaborator. After nearly a year of frustrating experiments, this colleague was about to publish a paper1 chronicling his team’s inability to reproduce the results of our high-profile paper2 in a mainstream journal. Our study was the first to show clearly that a drug-like molecule could extend an animal’s lifespan. We had found over and over again that the treatment lengthened the life of a roundworm by as much as 67%. Numerous phone calls and e-mails failed to identify why this apparently simple experiment produced different results between the labs. Then another lab failed to replicate our study. Despite more experiments and additional publications, we couldn’t work out why the labs were getting different lifespan results. To this day, we still don’t know. A few years later, the same scenario played out with different compounds in other labs…In another, now-famous example, two cancer labs spent more than a year trying to understand inconsistencies6. It took scientists working side by side on the same tumour biopsy to reveal that small differences in how they isolated cells—vigorous stirring versus prolonged gentle rocking—produced different results. Subtle tinkering has long been important in getting biology experiments to work. Before researchers purchased kits of reagents for common experiments, it wasn’t unheard of for a team to cart distilled water from one institution when it moved to another. Lab members would spend months tweaking conditions until experiments with the new institution’s water worked as well as before. Sources of variation include the quality and purity of reagents, daily fluctuations in microenvironment and the idiosyncratic techniques of investigators7. With so many ways of getting it wrong, perhaps we should be surprised at how often experimental findings are reproducible.

…Nonetheless, scores of publications continued to appear with claims about compounds that slow ageing. There was little effort at replication. In 2013, the three of us were charged with that unglamorous task…Our first task, to develop a protocol, seemed straightforward.

But subtle disparities were endless. In one particularly painful teleconference, we spent an hour debating the proper procedure for picking up worms and placing them on new agar plates. Some batches of worms lived a full day longer with gentler technicians. Because a worm’s lifespan is only about 20 days, this is a big deal. Hundreds of e-mails and many teleconferences later, we converged on a technique but still had a stupendous three-day difference in lifespan between labs. The problem, it turned out, was notation—one lab determined age on the basis of when an egg hatched, others on when it was laid. We decided to buy shared batches of reagents from the start. Coordination was a nightmare; we arranged with suppliers to give us the same lot numbers and elected to change lots at the same time. We grew worms and their food from a common stock and had strict rules for handling. We established protocols that included precise positions of flasks in autoclave runs. We purchased worm incubators at the same time, from the same vendor. We also needed to cope with a large amount of data going from each lab to a single database. We wrote an iPad app so that measurements were entered directly into the system and not jotted on paper to be entered later. The app prompted us to include full descriptors for each plate of worms, and ensured that data and metadata for each experiment were proofread (the strain names MY16 and my16 are not the same). This simple technology removed small recording errors that could disproportionately affect statistical analyses.

Once this system was in place, variability between labs decreased. After more than a year of pilot experiments and discussion of methods in excruciating detail, we almost completely eliminated systematic differences in worm survival across our labs (see ‘Worm wonders’)…Even in a single lab performing apparently identical experiments, we could not eliminate run-to-run differences.

…We have found one compound that lengthens lifespan across all strains and species. Most do so in only two or three strains, and often show detrimental effects in others.

“Does Diversity Pay? A Replication of Herring 2009”, Stojmenovska et al 2017

2017-stojmenovska.pdf: “Does Diversity Pay? A Replication of Herring 2009”⁠, Dragana Stojmenovska, Thijs Bol, Thomas Leopold (2017-07-07; ⁠, ; similar):

In an influential article published in the American Sociological Review in 2,009, Herring finds that diverse workforces are beneficial for business. His analysis supports seven out of eight hypotheses on the positive effects of gender and racial diversity on sales revenue, number of customers, perceived relative market share, and perceived relative profitability. This comment points out that Herring’s analysis contains two errors. First, missing codes on the outcome variables are treated as substantive codes. Second, two control variables—company size and establishment size—are highly skewed, and this skew obscures their positive associations with the predictor and outcome variables. We replicate Herring’s analysis correcting for both errors. The findings support only one of the original eight hypotheses, suggesting that diversity is inconsequential, rather than beneficial, to business success.

“Impossibly Hungry Judges”, Lakens 2017

“Impossibly Hungry Judges”⁠, Daniël Lakens (2017-07-05; backlinks; similar):

I was listening to a recent Radiolab episode on blame and guilt, where the guest Robert Sapolsky mentioned a famous study on judges handing out harsher sentences before lunch than after lunch…During the podcast, it was mentioned that the percentage of favorable decisions drops from 65% to 0% over the number of cases that are decided on. This sounded unlikely. I looked at Figure 1 from the paper (below), and I couldn’t believe my eyes. Not only is the drop indeed as large as mentioned—it occurs three times in a row over the course of the day, and after a break, it returns to exactly 65%!

…Some people dislike statistics. They are only interested in effects that are so large, you can see them by just plotting the data. This study might seem to be a convincing illustration of such an effect. My goal in this blog is to argue against this idea. You need statistics, maybe especially when effects are so large they jump out at you. When reporting findings, authors should report and interpret effect sizes. An important reason for this is that effects can be impossibly large.

…If hunger had an effect on our mental resources of this magnitude, our society would fall into minor chaos every day at 11:45AM. Or at the very least, our society would have organized itself around this incredibly strong effect of mental depletion. Just like manufacturers take size differences between men and women into account when producing items such as golf clubs or watches, we would stop teaching in the time before lunch, doctors would not schedule surgery, and driving before lunch would be illegal. If a psychological effect is this big, we don’t need to discover it and publish it in a scientific journal—you would already know it exists. Sort of how the “after lunch dip” is a strong and replicable finding that you can feel yourself (and that, as it happens, is directly in conflict with the finding that judges perform better immediately after lunch—surprisingly, the authors don’t discuss the after lunch dip).

…I think it is telling that most psychologists don’t seem to be able to recognize data patterns that are too large to be caused by psychological mechanisms. There are simply no plausible psychological effects that are strong enough to cause the data pattern in the hungry judges study. Implausibility is not a reason to completely dismiss empirical findings, but impossibility is. It is up to authors to interpret the effect size in their study, and to show the mechanism through which an effect that is impossibly large, becomes plausible. Without such an explanation, the finding should simply be dismissed.

“How Gullible Are We? A Review of the Evidence from Psychology and Social Science”, Mercier 2017

2017-mercier.pdf: “How Gullible are We? A Review of the Evidence from Psychology and Social Science”⁠, Hugo Mercier (2017-05-18; ⁠, ⁠, ⁠, ; backlinks; similar):

A long tradition of scholarship, from ancient Greece to Marxism or some contemporary social psychology, portrays humans as strongly gullible—wont to accept harmful messages by being unduly deferent. However, if humans are reasonably well adapted, they should not be strongly gullible: they should be vigilant toward communicated information. Evidence from experimental psychology reveals that humans are equipped with well-functioning mechanisms of epistemic vigilance. They check the plausibility of messages against their background beliefs, calibrate their trust as a function of the source’s competence and benevolence, and critically evaluate arguments offered to them. Even if humans are equipped with well-functioning mechanisms of epistemic vigilance, an adaptive lag might render them gullible in the face of new challenges, from clever marketing to omnipresent propaganda. I review evidence from different cultural domains often taken as proof of strong gullibility: religion, demagoguery, propaganda, political campaigns, advertising, erroneous medical beliefs, and rumors. Converging evidence reveals that communication is much less influential than often believed—that religious proselytizing, propaganda, advertising, and so forth are generally not very effective at changing people’s minds. Beliefs that lead to costly behavior are even less likely to be accepted. Finally, it is also argued that most cases of acceptance of misguided communicated information do not stem from undue deference, but from a fit between the communicated information and the audience’s preexisting beliefs.

[Keywords: epistemic vigilance, gullibility, trust]

“Avoiding Erroneous Citations in Ecological Research: Read Before You Apply”, Šigut et al 2017

2017-sigut.pdf: “Avoiding erroneous citations in ecological research: read before you apply”⁠, Martin Šigut, Hana Šigutová, Petr Pyszko, Aleš Dolný, Michaela Drozdová, Pavel Drozd (2017-04-24; backlinks; similar):

The Shannon-Wiener index is a popular nonparametric metric widely used in ecological research as a measure of species diversity. We used the Web of Science database to examine cases where papers published from 1990 to 2015 mislabeled this index. We provide detailed insights into causes potentially affecting use of the wrong name ‘Weaver’ instead of the correct ‘Wiener’. Basic science serves as a fundamental information source for applied research, so we emphasize the effect of the type of research (applied or basic) on the incidence of the error. Biological research, especially applied studies, increasingly uses indices, even though some researchers have strongly criticized their use. Applied research papers had a higher frequency of the wrong index name than did basic research papers. The mislabeling frequency decreased in both categories over the 25-year period, although the decrease lagged in applied research. Moreover, the index use and mistake proportion differed by region and authors’ countries of origin. Our study also provides insight into citation culture, and results suggest that almost 50% of authors have not actually read their cited sources. Applied research scientists in particular should be more cautious during manuscript preparation, carefully select sources from basic research, and read theoretical background articles before they apply the theories to their research. Moreover, theoretical ecologists should liaise with applied researchers and present their research for the broader scientific community. Researchers should point out known, often-repeated errors and phenomena not only in specialized books and journals but also in widely used and fundamental literature.

“Roosevelt Predicted to Win: Revisiting the 1936 Literary Digest Poll”, Lohr & Brick 2017

2017-lohr.pdf: “Roosevelt Predicted to Win: Revisiting the 1936 Literary Digest Poll”⁠, Sharon L. Lohr, J. Michael Brick (2017-03-31; similar):

The Literary Digest poll of 1936⁠, which incorrectly predicted that Landon would defeat Roosevelt in the 1936 US presidential election⁠, has long been held up as an example of how not to sample.

The sampling frame was constructed from telephone directories and automobile registration lists, and the survey had a 24% response rate⁠. But if information collected by the poll about votes cast in 1932 had been used to weight the results, the poll would have predicted a majority of electoral votes for Roosevelt in 1936, and thus would have correctly predicted the winner of the election.

We explore alternative weighting methods for the 1936 poll and the models that support them. While weighting would have resulted in Roosevelt being projected as the winner, the bias in the estimates is still very large.

We discuss implications of these results for today’s low-response rate surveys and how the accuracy of the modeling might be reflected better than current practice.

…After every election in which polls err, numerous commentators publish articles about what went wrong with the polls. Gallup’s (1938) commentary was of this type. Deming (1986: p. 319) observed that “Dr. George Gallup remarked in a speech one time (after a fiasco) that he made his prediction in advance of the election. Other people, smarter, made their predictions after the election, explaining how it all happened.” In some respects, post hoc explanations view the polling inadequacies as what Deming called a “special cause” attributable to unusual features of that particular election. However, since the outcomes of elections typically differ from the poll estimates by considerably more than sampling error, Deming would argue that this is a system-level, or “common” cause. Our system of assessing the uncertainty of estimates from surveys is inadequate and this needs to be addressed systematically rather than trying to explain what is wrong with a particular outcome. Unfortunately, the main lesson from 1936 has not yet been learned.

“Impact of Genetic Background and Experimental Reproducibility on Identifying Chemical Compounds With Robust Longevity Effects”, Lucanic et al 2017

“Impact of genetic background and experimental reproducibility on identifying chemical compounds with robust longevity effects”⁠, Mark Lucanic, W. Todd Plummer, Esteban Chen, Jailynn Harke, Anna C. Foulger, Brian Onken, Anna L. Coleman-Hulbert et al (2017-02-21; backlinks; similar):

Limiting the debilitating consequences of ageing is a major medical challenge of our time. Robust pharmacological interventions that promote healthy ageing across diverse genetic backgrounds may engage conserved longevity pathways. Here we report results from the Caenorhabditis Intervention Testing Program in assessing longevity variation across 22 Caenorhabditis strains spanning 3 species, using multiple replicates collected across three independent laboratories. Reproducibility between test sites is high, whereas individual trial reproducibility is relatively low. Of ten pro-longevity chemicals tested, six statistically-significantly extend lifespan in at least one strain. Three reported dietary restriction mimetics are mainly effective across C. elegans strains, indicating species and strain-specific responses. In contrast, the amyloid dye ThioflavinT is both potent and robust across the strains. Our results highlight promising pharmacological leads and demonstrate the importance of assessing lifespans of discrete cohorts across repeat studies to capture biological variation in the search for reproducible ageing interventions.

“When the Music’s Over. Does Music Skill Transfer to Children’s and Young Adolescents' Cognitive and Academic Skills? A Meta-analysis”, Sala & Gobet 2017

“When the music’s over. Does music skill transfer to children’s and young adolescents' cognitive and academic skills? A meta-analysis”⁠, Giovanni Sala, Fernand Gobet (2017-02; ; backlinks; similar):

  • Music training is thought to improve youngsters’ cognitive and academic skills.
  • Results show a small overall effect size (d = 0.16, K = 118).
  • Music training seems to moderately enhance youngsters’ intelligence and memory.
  • The design quality of the studies is negatively related to the size of the effects.
  • Future studies should include random assignment and active control groups.

Music training has been recently claimed to enhance children and young adolescents’ cognitive and academic skills. However, substantive research on transfer of skills suggests that far-transfer—ie. the transfer of skills between 2 areas only loosely related to each other—occurs rarely.

In this meta-analysis, we examined the available experimental evidence regarding the impact of music training on children and young adolescents’ cognitive and academic skills. The results of the random-effects models showed (a) a small overall effect size (d = 0.16); (b) slightly greater effect sizes with regard to intelligence (d = 0.35) and memory-related outcomes (d = 0.34); and (c) an inverse relation between the size of the effects and the methodological quality of the study design.

These results suggest that music training does not reliably enhance children and young adolescents’ cognitive or academic skills, and that previous positive findings were probably due to confounding variables.

[Keywords: music training, transfer, cognitive skills, education, meta-analysis]

“Evaluation of Evidence of Statistical Support and Corroboration of Subgroup Claims in Randomized Clinical Trials”, Association 2017

2017-wallach.pdf: “Evaluation of Evidence of Statistical Support and Corroboration of Subgroup Claims in Randomized Clinical Trials”⁠, American Medical Association (2017-01-01; ; backlinks)

“Does Teaching Children How to Play Cognitively Demanding Games Improve Their Educational Attainment? Evidence from a Randomised Controlled Trial of Chess Instruction in England”, jerrim 2017

2017-jerrim.pdf: “Does teaching children how to play cognitively demanding games improve their educational attainment? Evidence from a Randomised Controlled Trial of chess instruction in England”⁠, john jerrim (2017-01-01; ⁠, ; backlinks)

“What Does Any of This Have To Do With Physics? Einstein and Feynman Ushered Me into Grad School, Reality Ushered Me Out”, Henderson 2016

“What Does Any of This Have To Do with Physics? Einstein and Feynman ushered me into grad school, reality ushered me out”⁠, Bob Henderson (2016-12-29; backlinks; similar):

[Memoir of an ex-theoretical-physics grad student at the University of Rochester with Sarada Rajeev who gradually became disillusioned with physics research, burned out, and left to work in finance and is now a writer. Henderson was attracted by the life of the mind and the grandeur of uncovering the mysteries of the universe, only to discover that, after the endless triumphs of the 20th century and predicting enormous swathes of empirical experimental data, theoretical physics has drifted and become a branch of abstract mathematics, exploring ever more recondite, simplified, and implausible models in the hopes of obtaining any insight into physics’ intractable problems; one must be brilliant to even understand the questions being asked by the math and incredibly hardworking to make any progress which hasn’t already been tried by even more brilliant physicists of the past (while living in ignominious poverty and terror of not getting a grant or tenure), but one’s entire career may be spent chasing an useless dead end without one having any clue.]

The next thing I knew I was crouched in a chair in Rajeev’s little office, with a notebook on my knee and focused with everything I had on an impromptu lecture he was giving me on an esoteric aspect of some mathematical subject I’d never heard of before. Zeta functions, or elliptic functions, or something like that. I’d barely introduced myself when he’d started banging out equations on his board. Trying to follow was like learning a new game, with strangely shaped pieces and arbitrary rules. It was a challenge, but I was excited to be talking to a real physicist about his real research, even though there was one big question nagging me that I didn’t dare to ask: What does any of this have to do with physics?

…Even a Theory of Everything, I started to realize, might suffer the same fate of multiple interpretations. The Grail could just be a hall of mirrors, with no clear answer to the “What?” or the “How?”—let alone the “Why?” Plus physics had changed since Big Al bestrode it. Mathematical as opposed to physical intuition had become more central, partly because quantum mechanics was such a strange multi-headed beast that it diminished the role that everyday, or even Einstein-level, intuition could play. So much for my dreams of staring out windows and into the secrets of the universe.

…If I did lose my marbles for a while, this is how it started. With cutting my time outside of Bausch and Lomb down to nine hours a day—just enough to pedal my mountain bike back to my bat cave of an apartment each night, sleep, shower, and pedal back in. With filling my file cabinet with boxes and cans of food, and carting in a coffee maker, mini-fridge, and microwave so that I could maximize the time spent at my desk. With feeling guilty after any day that I didn’t make my 15-hour quota. And with exceeding that quota frequently enough that I regularly circumnavigated the clock: staying later and later each night until I was going home in the morning, then in the afternoon, and finally at night again.

…The longer and harder I worked, the more I realized I didn’t know. Papers that took days or weeks to work through cited dozens more that seemed just as essential to digest; the piles on my desk grew rather than shrunk. I discovered the stark difference between classes and research: With no syllabus to guide me I didn’t know how to keep on a path of profitable inquiry. Getting “wonderfully lost” sounded nice, but the reality of being lost, and of re-living, again and again, that first night in the old woman’s house, with all of its doubts and dead-ends and that horrible hissing voice was … something else. At some point, flipping the lights on in the library no longer filled me with excitement but with dread.

…My mental model building was hitting its limits. I’d sit there in Rajeev’s office with him and his other students, or in a seminar given by some visiting luminary, listening and putting each piece in place, and try to fix in memory what I’d built so far. But at some point I’d lose track of how the green stick connected to the red wheel, or whatever, and I’d realize my picture had diverged from reality. Then I’d try toggling between tracing my steps back in memory to repair my mistake and catching all the new pieces still flying in from the talk. Stray pieces would fall to the ground. My model would start falling down. And I would fall hopelessly behind. A year or so of research with Rajeev, and I found myself frustrated and in a fog, sinking deeper into the quicksand but not knowing why. Was it my lack of mathematical background? My grandiose goals? Was I just not intelligent enough?

…I turned 30 during this time and the milestone hit me hard. I was nearly four years into the Ph.D. program, and while my classmates seemed to be systematically marching toward their degrees, collecting data and writing papers, I had no thesis topic and no clear path to graduation. My engineering friends were becoming managers, getting married, buying houses. And there I was entering my fourth decade of life feeling like a pitiful and penniless mole, aimlessly wandering dark empty tunnels at night, coming home to a creepy crypt each morning with nothing to show for it, and checking my bed for bugs before turning out the lights…As I put the final touches on my thesis, I weighed my options. I was broke, burned out, and doubted my ability to go any further in theoretical physics. But mostly, with The Grail now gone and the physics landscape grown so immense, I thought back to Rajeev’s comment about knowing which problems to solve and realized that I still didn’t know what, for me, they were.

“Rational Judges, Not Extraneous Factors In Decisions”, Stafford 2016

“Rational Judges, Not Extraneous Factors In Decisions”⁠, Tom Stafford (2016-12-08; backlinks; similar):

This seeming evidence of the irrationality of judges has been cited hundreds of times, in economics, psychology and legal scholarship. Now, a new analysis by Andreas Glöckner in the journal Judgement and Decision Making questions these conclusions.

Glöckner’s analysis doesn’t prove that extraneous factors weren’t influencing the judges, but he shows how the same effect could be produced by entirely rational judges interacting with the protocols required by the legal system.

The main analysis works like this: we know that favourable rulings take longer than unfavourable ones (~7 mins vs ~5 mins), and we assume that judges are able to guess how long a case will take to rule on before they begin it (from clues like the thickness of the file, the types of request made, the representation the prisoner has and so on). Finally, we assume judges have a time limit in mind for each of the three sessions of the day, and will avoid starting cases which they estimate will overrun the time limit for the current session.

It turns out that this kind of rational time-management is sufficient to generate the drops in favourable outcomes. How this occurs isn’t straightforward and interacts with a quirk of original author’s data presentation (specifically their graph shows the order number of cases when the number of cases in each session varied day to day—so, for example, it shows that the 12th case after a break is least likely to be judged favourably, but there wasn’t always a 12th case in each session. So sessions in which there were more unfavourable cases were more likely to contribute to this data point).

“Could a Neuroscientist Understand a Microprocessor?”, Jonas & Kording 2016

“Could a Neuroscientist Understand a Microprocessor?”⁠, Eric Jonas, Konrad Paul Kording (2016-11-14; ⁠, ⁠, ; backlinks; similar):

[Reply to “Can a biologist fix a radio?”⁠; earlier, Doug the biochemist & Bill the geneticist research how cars work] There is a popular belief in neuroscience that we are primarily data limited, and that producing large, multimodal, and complex datasets will, with the help of advanced data analysis algorithms, lead to fundamental insights into the way the brain processes information. These datasets do not yet exist, and if they did we would have no way of evaluating whether or not the algorithmically-generated insights were sufficient or even correct. To address this, here we take a classical microprocessor as a model organism, and use our ability to perform arbitrary experiments on it to see if popular data analysis methods from neuroscience can elucidate the way it processes information. Microprocessors are among those artificial information processing systems that are both complex and that we understand at all levels, from the overall logical flow, via logical gates, to the dynamics of transistors. We show that the approaches reveal interesting structure in the data but do not meaningfully describe the hierarchy of information processing in the microprocessor. This suggests current analytic approaches in neuroscience may fall short of producing meaningful understanding of neural systems, regardless of the amount of data. Additionally, we argue for scientists using complex non-linear dynamical systems with known ground truth, such as the microprocessor as a validation platform for time-series and structure discovery methods.

Author Summary

Neuroscience is held back by the fact that it is hard to evaluate if a conclusion is correct; the complexity of the systems under study and their experimental inaccessibility make the assessment of algorithmic and data analytic techniques challenging at best. We thus argue for testing approaches using known artifacts, where the correct interpretation is known. Here we present a microprocessor platform as one such test case. We find that many approaches in neuroscience, when used naively, fall short of producing a meaningful understanding.

“Overconfidence in Personnel Selection: When and Why Unstructured Interview Information Can Hurt Hiring Decisions”, Kausel et al 2016

2016-kausel.pdf: “Overconfidence in personnel selection: When and why unstructured interview information can hurt hiring decisions”⁠, Edgar E. Kausel, Satoris S. Culbertson, Hector P. Madrid (2016-11-01; similar):


  • Individuals responsible for hiring decisions participated in two studies.
  • We manipulated the information presented to them.
  • Information about unstructured interviews boosted overconfidence.
  • A third study showed that overconfidence was linked to fewer payoffs.
  • In the presence of valid predictors, unstructured interviews can hurt hiring decisions.

Overconfidence is an important bias related to the ability to recognize the limits of one’s knowledge.

The present study examines overconfidence in predictions of job performance for participants presented with information about candidates based solely on standardized tests versus those who also were presented with unstructured interview information. We conducted two studies with individuals responsible for hiring decisions. Results showed that individuals presented with interview information exhibited more overconfidence than individuals presented with test scores only. In a third study, consisting of a betting competition for undergraduate students, larger overconfidence was related to fewer payoffs.

These combined results emphasize the importance of studying confidence and decision-related variables in selection decisions. Furthermore, while previous research has shown that the predictive validity of unstructured interviews is low, this study provides compelling evidence that they not only fail to help personnel selection decisions, but can actually hurt them.

[Keywords: judgment and decision making, behavioral decision theory, overconfidence, hiring decisions, personnel selection, human resource management, Conscientiousness⁠, General Mental Ability, unstructured interviews, evidence-based management]

“A Replication and Methodological Critique of the Study “Evaluating Drug Trafficking on the Tor Network””, Munksgaard et al 2016

2016-munksgaard.pdf: “A replication and methodological critique of the study “Evaluating drug trafficking on the Tor Network””⁠, Rasmus Munksgaard, Jakob Demant, Gwern Branwen (2016-09; ⁠, ; backlinks):

[Debunking a remarkably sloppy darknet market paper which screwed up its scraping and somehow concluded that the notorious Silk Road 2, in defiance of all observable evidence & subsequent FBI data, actually sold primarily e-books and hardly any drugs. This study has yet to be retracted.] The development of cryptomarkets has gained increasing attention from academics, including growing scientific literature on the distribution of illegal goods using cryptomarkets. Dolliver’s 2015 article “Evaluating drug trafficking on the Tor Network: Silk Road 2, the Sequel” addresses this theme by evaluating drug trafficking on one of the most well-known cryptomarkets, Silk Road 2.0. The research on cryptomarkets in general—particularly in Dolliver’s article—poses a number of new questions for methodologies. This commentary is structured around a replication of Dolliver’s original study. The replication study is not based on Dolliver’s original dataset, but on a second dataset collected applying the same methodology. We have found that the results produced by Dolliver differ greatly from our replicated study. While a margin of error is to be expected, the inconsistencies we found are too great to attribute to anything other than methodological issues. The analysis and conclusions drawn from studies using these methods are promising and insightful. However, based on the replication of Dolliver’s study, we suggest that researchers using these methodologies consider and that datasets be made available for other researchers, and that methodology and dataset metrics (eg. number of downloaded pages, error logs) are described thoroughly in the context of web-o-metrics and web crawling.

“Thermoneutrality, Mice, and Cancer: A Heated Opinion”, Hylander & Repasky 2016

“Thermoneutrality, Mice, and Cancer: A Heated Opinion”⁠, Bonnie L. Hylander, Elizabeth A. Repasky (2016-04-22; backlinks; similar):


Several mouse models show statistically-significant differences in experimental outcomes at standard sub-thermoneutral (ST, 22–26°C) versus thermoneutral housing temperatures (TT, 30–32°C), including models of cardiovascular disease, obesity, inflammation and atherosclerosis, graft versus host disease and cancer.

NE levels are higher, anti-tumor immunity is impaired, and tumor growth is statistically-significantly enhanced in mice housed at ST compared to TT. NE levels are reduced, immunosuppression is reversed and tumor growth is slowed by housing mice at TT.

Housing temperature should be reported in every study such that potential sources of data bias or non-reproducibility can be identified.

Our opinion is that any experiment designed to understand tumor biology and/​or having an immune component could potentially have different outcomes in mice housed at ST versus TT and this should be tested.

The ‘mild’ cold stress caused by standard sub-thermoneutral housing temperatures used for laboratory mice in research institutes is sufficient to statistically-significantly bias conclusions drawn from murine models of several human diseases. We review the data leading to this conclusion, discuss the implications for research and suggest ways to reduce problems in reproducibility and experimental transparency caused by this housing variable. We have found that these cool temperatures suppress endogenous immune responses, skewing tumor growth data and the severity of graft versus host disease, and also increase the therapeutic resistance of tumors. Owing to the potential for ambient temperature to affect energy homeostasis as well as adrenergic stress, both of which could contribute to biased outcomes in murine cancer models, housing temperature should be reported in all publications and considered as a potential source of variability in results between laboratories. Researchers and regulatory agencies should work together to determine whether changes in housing parameters would enhance the use of mouse models in cancer research, as well as for other diseases. Finally, for many years agencies such as the National Cancer Institute (NCI) have encouraged the development of newer and more sophisticated mouse models for cancer research, but we believe that, without an appreciation of how basic murine physiology is affected by ambient temperature, even data from these models is likely to be compromised.

[Keywords: thermoneutrality, tumor microenvironment, immunosuppression, energy balance, metabolism, adrenergic stress]

“When Quality Beats Quantity: Decision Theory, Drug Discovery, and the Reproducibility Crisis”, Scannell & Bosley 2016

“When Quality Beats Quantity: Decision Theory, Drug Discovery, and the Reproducibility Crisis”⁠, Jack W. Scannell, Jim Bosley (2016-02-10; ; backlinks; similar):

A striking contrast runs through the last 60 years of biopharmaceutical discovery, research, and development. Huge scientific and technological gains should have increased the quality of academic science and raised industrial R&D efficiency. However, academia faces a “reproducibility crisis”; inflation-adjusted industrial R&D costs per novel drug increased nearly 100× between 1950 and 2010; and drugs are more likely to fail in clinical development today than in the 1970s. The contrast is explicable only if powerful headwinds reversed the gains and/​or if many “gains” have proved illusory. However, discussions of reproducibility and R&D productivity rarely address this point explicitly.

The main objectives of the primary research in this paper are: (a) to provide quantitatively and historically plausible explanations of the contrast; and (b) identify factors to which R&D efficiency is sensitive.

We present a quantitative decision-theoretic model of the R&D process [a ‘leaky pipeline’⁠; cf. the log-normal]. The model represents therapeutic candidates (eg. putative drug targets, molecules in a screening library, etc.) within a “measurement space”, with candidates’ positions determined by their performance on a variety of assays (eg. binding affinity, toxicity, in vivo efficacy, etc.) whose results correlate to a greater or lesser degree. We apply decision rules to segment the space, and assess the probability of correct R&D decisions.

We find that when searching for rare positives (eg. candidates that will successfully complete clinical development), changes in the predictive validity of screening and disease models that many people working in drug discovery would regard as small and/​or unknowable (ie. an 0.1 absolute change in correlation coefficient between model output and clinical outcomes in man) can offset large (eg. 10×, even 100×) changes in models’ brute-force efficiency. We also show how validity and reproducibility correlate across a population of simulated screening and disease models.

We hypothesize that screening and disease models with high predictive validity are more likely to yield good answers and good treatments, so tend to render themselves and their diseases academically and commercially redundant. Perhaps there has also been too much enthusiasm for reductionist molecular models which have insufficient predictive validity. Thus we hypothesize that the average predictive validity of the stock of academically and industrially “interesting” screening and disease models has declined over time, with even small falls able to offset large gains in scientific knowledge and brute-force efficiency. The rate of creation of valid screening and disease models may be the major constraint on R&D productivity.

“Ethnic Discrimination in Hiring Decisions: a Meta-analysis of Correspondence Tests 1990–2015”, Zschirnt & Ruedin 2016

2016-zschirnt.pdf: “Ethnic discrimination in hiring decisions: a meta-analysis of correspondence tests 1990–2015”⁠, Eva Zschirnt, Didier Ruedin (2016-01-22; similar):

For almost 50 years field experiments have been used to study ethnic and racial discrimination in hiring decisions, consistently reporting high rates of discrimination against minority applicants—including immigrants—irrespective of time, location, or minority groups tested. While Peter A. Riach and Judith Rich [2002. “Field Experiments of Discrimination in the Market Place.” The Economic Journal 112 (483): F480–F518] and Judith Rich [2014. “What Do Field Experiments of Discrimination in Markets Tell Us? A Meta Analysis of Studies Conducted since 2000.” In Discussion Paper Series. Bonn: IZA] provide systematic reviews of existing field experiments, no study has undertaken a meta-analysis to examine the findings in the studies reported. In this article, we present a meta-analysis of 738 correspondence tests in 43 separate studies conducted in OECD countries between 1990 and 2015. In addition to summarising research findings, we focus on groups of specific tests to ascertain the robustness of findings, emphasising differences across countries, gender, and economic contexts. Moreover we examine patterns of discrimination, by drawing on the fact that the groups considered in correspondence tests and the contexts of testing vary to some extent. We focus on first-generation and second-generation immigrants, differences between specific minority groups, the implementation of EU directives, and the length of job application packs.

[Keywords: Ethnic discrimination, hiring, correspondence test, meta-analysis, immigration]

“Looking Across and Looking Beyond the Knowledge Frontier: Intellectual Distance, Novelty, and Resource Allocation in Science”, Boudreau et al 2016

2016-boudreau.pdf: “Looking Across and Looking Beyond the Knowledge Frontier: Intellectual Distance, Novelty, and Resource Allocation in Science”⁠, Kevin J. Boudreau, Eva C. Guinan, Karim R. Lakhani, Christoph Riedl (2016-01-08; ⁠, ; backlinks; similar):

Selecting among alternative projects is a core management task in all innovating organizations. In this paper, we focus on the evaluation of frontier scientific research projects. We argue that the “intellectual distance” between the knowledge embodied in research proposals and an evaluator’s own expertise systematically relates to the evaluations given. To estimate relationships, we designed and executed a grant proposal process at a leading research university in which we randomized the assignment of evaluators and proposals to generate 2,130 evaluator-proposal pairs. We find that evaluators systematically give lower scores to research proposals that are closer to their own areas of expertise and to those that are highly novel. The patterns are consistent with biases associated with boundedly rational evaluation of new ideas. The patterns are inconsistent with intellectual distance simply contributing “noise” or being associated with private interests of evaluators. We discuss implications for policy, managerial intervention, and allocation of resources in the ongoing accumulation of scientific knowledge.

“Discontinuation and Nonpublication of Randomized Clinical Trials Conducted in Children”

2016-pica.pdf: “Discontinuation and Nonpublication of Randomized Clinical Trials Conducted in Children” (2016-01-01; ; backlinks)

“Is There a Publication Bias in Behavioral Intranasal Oxytocin Research on Humans? Opening the File Drawer of One Lab”, Lane et al 2016

2016-lane.pdf: “Is there a publication bias in behavioral intranasal oxytocin research on humans? Opening the file drawer of one lab”⁠, A. Lane, O. Luminet, G. Nave, M. Mikolajczak (2016-01-01; backlinks)

“The Use and Abuse of Transcranial Magnetic Stimulation to Modulate Corticospinal Excitability in Humans”, Héroux et al 2015

“The Use and Abuse of Transcranial Magnetic Stimulation to Modulate Corticospinal Excitability in Humans”⁠, Martin E. Héroux, Janet L. Taylor, Simon C. Gandevia (2015-11-13; ; similar):

The magnitude and direction of reported physiological effects induced using transcranial magnetic stimulation (TMS) to modulate human motor cortical excitability have proven difficult to replicate routinely. We conducted an online survey on the prevalence and possible causes of these reproducibility issues. A total of 153 researchers were identified via their publications and invited to complete an anonymous internet-based survey that asked about their experience trying to reproduce published findings for various TMS protocols. The prevalence of questionable research practices known to contribute to low reproducibility was also determined. We received 47 completed surveys from researchers with an average of 16.4 published papers (95% CI 10.8–22.0) that used TMS to modulate motor cortical excitability. Respondents also had a mean of 4.0 (2.5–5.7) relevant completed studies that would never be published. Across a range of TMS protocols, 45–60% of respondents found similar results to those in the original publications; the other respondents were able to reproduce the original effects only sometimes or not at all. Only 20% of respondents used formal power calculations to determine study sample sizes. Others relied on previously published studies (25%), personal experience (24%) or flexible post-hoc criteria (41%). ~44% of respondents knew researchers who engaged in questionable research practices (range 32–70%), yet only 18% admitted to engaging in them (range 6–38%). These practices included screening subjects to find those that respond in a desired way to a TMS protocol, selectively reporting results and rejecting data based on a gut feeling. In a sample of 56 published papers that were inspected, not a single questionable research practice was reported. Our survey revealed that ~50% of researchers are unable to reproduce published TMS effects. Researchers need to start increasing study sample size and eliminating—or at least reporting—questionable research practices in order to make the outcomes of TMS research reproducible.

“Estimating the Reproducibility of Psychological Science”, Collaboration 2015

2015-opensciencecollaboration.pdf: “Estimating the reproducibility of psychological science”⁠, Open Science Collaboration (2015-08-28; backlinks; similar):

Empirically analyzing empirical evidence: One of the central goals in any scientific endeavor is to understand causality. Experiments that seek to demonstrate a cause/​effect relation most often manipulate the postulated causal factor. Aarts et al 2015 describe the replication of 100 experiments reported in papers published in 2008 in 3 high-ranking psychology journals. Assessing whether the replication and the original experiment yielded the same result according to several criteria, they find that about one-third to one-half of the original findings were also observed in the replication study.

Introduction: Reproducibility is a defining feature of science, but the extent to which it characterizes current research is unknown. Scientific claims should not gain credence because of the status or authority of their originator but by the replicability of their supporting evidence. Even research of exemplary quality may have irreproducible empirical findings because of random or systematic error.

Rationale: There is concern about the rate and predictors of reproducibility, but limited evidence. Potentially problematic practices include selective reporting, selective analysis, and insufficient specification of the conditions necessary or sufficient to obtain the results. Direct replication is the attempt to recreate the conditions believed sufficient for obtaining a previously observed finding and is the means of establishing reproducibility of a finding with new data. We conducted a large-scale, collaborative effort to obtain an initial estimate of the reproducibility of psychological science.

Results: We conducted replications of 100 experimental and correlational studies published in three psychology journals using high-powered designs and original materials when available. There is no single standard for evaluating replication success. Here, we evaluated reproducibility using statistical-significance and P values, effect sizes, subjective assessments of replication teams, and meta-analysis of effect sizes. The mean effect size (r) of the replication effects (Mr = 0.197, SD = 0.257) was half the magnitude of the mean effect size of the original effects (Mr = 0.403, SD = 0.188), representing a substantial decline. 97% of original studies had statistically-significant results (p < 0.05). 36% of replications had statistically-significant results; 47% of original effect sizes were in the 95% confidence interval of the replication effect size; 39% of effects were subjectively rated to have replicated the original result; and if no bias in original results is assumed, combining original and replication results left 68% with statistically-significant effects. Correlational tests suggest that replication success was better predicted by the strength of original evidence than by characteristics of the original and replication teams.

Conclusion: No single indicator sufficiently describes replication success, and the five indicators examined here are not the only ways to evaluate reproducibility. Nonetheless, collectively these results offer a clear conclusion: A large portion of replications produced weaker evidence for the original findings despite using materials provided by the original authors, review in advance for methodological fidelity, and high statistical power to detect the original effect sizes. Moreover, correlational evidence is consistent with the conclusion that variation in the strength of initial evidence (such as original P value) was more predictive of replication success than variation in the characteristics of the teams conducting the research (such as experience and expertise). The latter factors certainly can influence replication success, but they did not appear to do so here.

Reproducibility is not well understood because the incentives for individual scientists prioritize novelty over replication. Innovation is the engine of discovery and is vital for a productive, effective scientific enterprise. However, innovative ideas become old news fast. Journal reviewers and editors may dismiss a new test of a published idea as unoriginal. The claim that “we already know this” belies the uncertainty of scientific evidence. Innovation points out paths that are possible; replication points out paths that are likely; progress relies on both. Replication can increase certainty when findings are reproduced and promote innovation when they are not. This project provides accumulating evidence for many findings in psychological research and suggests that there is still more work to do to verify whether we know what we think we know.

Figure 1: Original study effect size versus replication effect size (correlation coefficients). Diagonal line represents replication effect size equal to original effect size. Dotted line represents replication effect size of 0. Points below the dotted line were effects in the opposite direction of the original. Density plots are separated by statistically-significant (blue) and non-statistically-significant (red) effects.

“Hydrocephalus and Intelligence: The Hollow Men”, Branwen 2015

Hydrocephalus: “Hydrocephalus and Intelligence: The Hollow Men”⁠, Gwern Branwen (2015-07-28; ⁠, ⁠, ; backlinks; similar):

Some claim the disease hydrocephalus reduces brain size by 95% but often with normal or even above-average intelligence, and thus brains aren’t really necessary. Neither is true.

Hydrocephalus is a damaging brain disorder where fluids compress the brain, sometimes drastically decreasing its volume. While often extremely harmful or life-threatening when untreated, some people with severe compression nevertheless are relatively normal, and in one case (Lorber) they have been claimed to have IQs as high as 126 with a brain volume 5% of normal brains. A few of these case studies have been used to argue the extraordinary claim that brain volume has little or nothing to do with intelligence; authors have argued that hydrocephalus suggests enormous untapped cognitive potential which are tapped into rarely for repairs and can boost intelligence on net, or that intelligence/​consciousness are non-material or tapping into ESP.

I point out why this claim is almost certainly untrue because it predicts countless phenomena we never observe, and investigate the claimed examples in more detail: the cases turn out to be suspiciously unverifiable (Lorber), likely fraudulent (Oliveira), or actually low intelligence (Feuillet). It is unclear if high-functioning cases of hydrocephalus even have less brain mass, as opposed to lower proxy measures like brain volume.

I then summarize anthropologist John Hawks’s criticisms of the original hydrocephalus author: his brain imaging data could not have been as precise as claimed, he studied a selective sample, the story of the legendary IQ 126 hydrocephalus patient raises questions as to how normal or intelligent he really was, and hydrocephalus in general appears to be no more anomalous or hard-to-explain than many other kinds of brain injuries, and in a comparison, hemispherectomies, removing or severing a hemisphere, has produced no anomalous reports of above-average intelligence (just deficits), though they ought to be just the same in terms of repairs or ESP.

That hydrocephalus cases can reach roughly normal levels of functioning, various deficits aside, can be explained by brain size being one of several relevant variables, brain plasticity enabling cognitive flexibility & recovery from gradually-developing conditions, and overparameterization giving robustness to damage and poor environments, and learning ability. The field of deep learning has observed similar phenomenon in training of artificial neural networks. This is consistent with Lorber’s original contention that the brain was more robust, and hydrocephalus was more treatable, than commonly accepted, but does not support any of the more exotic interpretations since put on his findings.

In short, there is little anomalous to explain, and standard brain-centric accounts appear to account for existing verified observations without much problem or resort to extraordinary claims.

“Likelihood of Null Effects of Large NHLBI Clinical Trials Has Increased over Time”, Kaplan & Irvin 2015

“Likelihood of Null Effects of Large NHLBI Clinical Trials Has Increased over Time”⁠, Robert M. Kaplan, Veronica L. Irvin (2015-05-21; backlinks; similar):

Background: We explore whether the number of null results in large National Heart Lung, and Blood Institute (NHLBI) funded trials has increased over time.

Methods: We identified all large NHLBI supported RCTs between 1970 and 2012 evaluating drugs or dietary supplements for the treatment or prevention of cardiovascular disease. Trials were included if direct costs >$630,050.8$500,000.02015/​year, participants were adult humans, and the primary outcome was cardiovascular risk, disease or death. The 55 trials meeting these criteria were coded for whether they were published prior to or after the year 2000, whether they registered in prior to publication, used active or placebo comparator, and whether or not the trial had industry co-sponsorship. We tabulated whether the study reported a positive, negative, or null result on the primary outcome variable and for total mortality.

Results: 17 of 30 studies (57%) published prior to 2000 showed a statistically-significant benefit of intervention on the primary outcome in comparison to only 2 among the 25 (8%) trials published after 2000 (χ2 = 12.2, df= 1, p = 0.0005). There has been no change in the proportion of trials that compared treatment to placebo versus active comparator. Industry co-sponsorship was unrelated to the probability of reporting a statistically-significant benefit. Pre-registration in clinical was strongly associated with the trend toward null findings.

Conclusions: The number NHLBI trials reporting positive results declined after the year 2000. Prospective declaration of outcomes in RCTs, and the adoption of transparent reporting standards, as required by, may have contributed to the trend toward null findings.

Figure 1: Relative risk of showing benefit or harm of treatment by year of publication for large NHLBI trials on pharmaceutical and dietary supplement interventions. Positive trials are indicated by the plus signs while trials showing harm are indicated by a diagonal line within a circle. Prior to 2000 when trials were not registered in, there was substantial variability in outcome. Following the imposition of the requirement that trials preregister in the relative risk on primary outcomes showed considerably less variability around 1.0.

“Small Telescopes: Detectability and the Evaluation of Replication Results”, Simonsohn 2015

2015-simonsohn.pdf: “Small Telescopes: Detectability and the Evaluation of Replication Results”⁠, Uri Simonsohn (2015-03-23; backlinks; similar):

This article introduces a new approach for evaluating replication results. It combines effect-size estimation with hypothesis testing, assessing the extent to which the replication results are consistent with an effect size big enough to have been detectable in the original study. The approach is demonstrated by examining replications of three well-known findings. Its benefits include the following: (a) differentiating “unsuccessful” replication attempts (ie. studies yielding p > .05) that are too noisy from those that actively indicate the effect is undetectably different from zero, (b) “protecting” true findings from underpowered replications, and (c) arriving at intuitively compelling inferences in general and for the revisited replications in particular.

“Low-dose Paroxetine Exposure Causes Lifetime Declines in Male Mouse Body Weight, Reproduction and Competitive Ability As Measured by the Novel Organismal Performance Assay”, Ruff et al 2015

2015-gaukler.pdf: “Low-dose paroxetine exposure causes lifetime declines in male mouse body weight, reproduction and competitive ability as measured by the novel organismal performance assay”⁠, James S. Ruff, Tessa Galland, Kirstie A. Kandaris, Tristan K. Underwood, Nicole M. Liu, Elizabeth L. Young et al (2015; ⁠, ⁠, ; backlinks; similar):

Paroxetine is a selective serotonin reuptake inhibitor (SSRI) that is currently available on the market and is suspected of causing congenital malformations in babies born to mothers who take the drug during the first trimester of pregnancy.

We utilized organismal performance assays (OPAs), a novel toxicity assessment method, to assess the safety of paroxetine during pregnancy in a rodent model. OPAs utilize genetically diverse wild mice (Mus musculus) to evaluate competitive performance between experimental and control animals as they compete amongst each other for limited resources in semi-natural enclosures. Performance measures included reproductive success, male competitive ability and survivorship.

Paroxetine-exposed males weighed 13% less, had 44% fewer offspring, dominated 53% fewer territories and experienced a 2.5-fold increased trend in mortality, when compared with controls. Paroxetine-exposed females had 65% fewer offspring early in the study, but rebounded at later time points. In cages, paroxetine-exposed breeders took 2.3× longer to produce their first litter and pups of both sexes experienced reduced weight when compared with controls. Low-dose paroxetine-induced health declines detected in this study were undetected in preclinical trials with dose 2.5-8× higher than human therapeutic doses.

These data indicate that OPAs detect phenotypic adversity and provide unique information that could useful towards safety testing during pharmaceutical development.

[Keywords: intraspecific competition, pharmacodynamics, reproductive success, semi-natural enclosures, SSRI, toxicity assessment.]

“Scholarly Context Not Found: One in Five Articles Suffers from Reference Rot”, Klein et al 2014

“Scholarly Context Not Found: One in Five Articles Suffers from Reference Rot”⁠, Martin Klein, Herbert Van de Sompel, Robert Sanderson, Harihar Shankar, Lyudmila Balakireva, Ke Zhou et al (2014-12-26; ; backlinks; similar):

The emergence of the web has fundamentally affected most aspects of information communication, including scholarly communication. The immediacy that characterizes publishing information to the web, as well as accessing it, allows for a dramatic increase in the speed of dissemination of scholarly knowledge. But, the transition from a paper-based to a web-based scholarly communication system also poses challenges. In this paper, we focus on reference rot, the combination of link rot and content drift to which references to web resources included in Science, Technology, and Medicine (STM) articles are subject. We investigate the extent to which reference rot impacts the ability to revisit the web context that surrounds STM articles some time after their publication. We do so on the basis of a vast collection of articles from three corpora that span publication years 1997 to 2012. For over one million references to web resources extracted from over 3.5 million articles, we determine whether the HTTP URI is still responsive on the live web and whether web archives contain an archived snapshot representative of the state the referenced resource had at the time it was referenced. We observe that the fraction of articles containing references to web resources is growing steadily over time. We find one out of five STM articles suffering from reference rot, meaning it is impossible to revisit the web context that surrounds them some time after their publication. When only considering STM articles that contain references to web resources, this fraction increases to seven out of ten. We suggest that, in order to safeguard the long-term integrity of the web-based scholarly record, robust solutions to combat the reference rot problem are required. In conclusion, we provide a brief insight into the directions that are explored with this regard in the context of the Hiberlink project.

“The Corrupted Epidemiological Evidence Base of Psychiatry: A Key Driver of Overdiagnosis”, Raven 2014

2014-raven.pdf: “The Corrupted Epidemiological Evidence Base of Psychiatry: A Key Driver of Overdiagnosis”⁠, Melissa Raven (2014-09-17)

“Association Between Analytic Strategy and Estimates of Treatment Outcomes in Meta-analyses”, Dechartres et al 2014

“Association Between Analytic Strategy and Estimates of Treatment Outcomes in Meta-analyses”⁠, Agnes Dechartres, Douglas G. Altman, Ludovic Trinquart, Isabelle Boutron, Philippe Ravaud (2014-08-13; ; similar):

Importance: A persistent dilemma when performing meta-analyses is whether all available trials should be included in the meta-analysis.

Objectives: To compare treatment outcomes estimated by meta-analysis of all trials and several alternative analytic strategies: single most precise trial (ie. trial with the narrowest confidence interval), meta-analysis restricted to the 25% largest trials, limit meta-analysis (a meta-analysis model adjusted for small-study effect), and meta-analysis restricted to trials at low overall risk of bias.

Data Sources: 163 meta-analyses published between 2008 and 2010 in high-impact-factor journals and between 2011 and 2013 in the Cochrane Database of Systematic Reviews: 92 (705 randomized clinical trials [RCTs]) with subjective outcomes and 71 (535 RCTs) with objective outcomes.

Data Synthesis: For each meta-analysis, the difference in treatment outcomes between meta-analysis of all trials and each alternative strategy, expressed as a ratio of odds ratios (ROR), was assessed considering the dependency between strategies. A difference greater than 30% was considered substantial. RORs were combined by random-effects meta-analysis models to obtain an average difference across the sample. An ROR greater than 1 indicates larger treatment outcomes with meta-analysis of all trials. Subjective and objective outcomes were analyzed separately.

Results: Treatment outcomes were larger in the meta-analysis of all trials than in the single most precise trial (combined ROR, 1.13 [95% CI, 1.07–1.19]) for subjective outcomes and 1.03 (95% CI, 1.01–1.05) for objective outcomes. The difference in treatment outcomes between these strategies was substantial in 47 of 92 (51%) meta-analyses of subjective outcomes (meta-analysis of all trials showing larger outcomes in 40⁄47) and in 28 of 71 (39%) meta-analyses of objective outcomes (meta-analysis of all trials showing larger outcomes in 21⁄28). The combined ROR for subjective and objective outcomes was, respectively, 1.08 (95% CI, 1.04–1.13) and 1.03 (95% CI, 1.00–1.06) when comparing meta-analysis of all trials and meta-analysis of the 25% largest trials, 1.17 (95% CI, 1.11–1.22) and 1.13 (95% CI, 0.82–1.55) when comparing meta-analysis of all trials and limit meta-analysis, and 0.94 (95% CI, 0.86–1.04) and 1.03 (95% CI, 1.00–1.06) when comparing meta-analysis of all trials and meta-analysis restricted to trials at low risk of bias.

Conclusions & Relevance: Estimation of treatment outcomes in meta-analyses differs depending on the strategy used. This instability in findings can result in major alterations in the conclusions derived from the analysis and underlines the need for systematic sensitivity analyses. [discussion]

“Practice Does Not Make Perfect: No Causal Effect of Music Practice on Music Ability”, Mosing et al 2014

2014-mosing.pdf: “Practice Does Not Make Perfect: No Causal Effect of Music Practice on Music Ability”⁠, Miriam A. Mosing, Guy Madison, Nancy L. Pedersen, Ralf Kuja-Halkola, Fredrik Ullén (2014-07-30; ⁠, ; backlinks; similar):

The relative importance of nature and nurture for various forms of expertise has been intensely debated. Music proficiency is viewed as a general model for expertise, and associations between deliberate practice and music proficiency have been interpreted as supporting the prevailing idea that long-term deliberate practice inevitably results in increased music ability.

Here, we examined the associations (rs = 0.18–0.36) between music practice and music ability (rhythm, melody, and pitch discrimination) in 10,500 Swedish twins. We found that music practice was substantially heritable (40%–70%). Associations between music practice and music ability were predominantly genetic, and, contrary to the causal hypothesis, nonshared environmental influences did not contribute. There was no difference in ability within monozygotic twin pairs differing in their amount of practice, so that when genetic predisposition was controlled for, more practice was no longer associated with better music skills.

These findings suggest that music practice may not causally influence music ability and that genetic variation among individuals affects both ability and inclination to practice.

[Keywords: training, expertise, music ability, practice, heritability, twin, causality]

“Deliberate Practice: Is That All It Takes to Become an Expert?”, Hambrick et al 2014

2014-hambrick.pdf: “Deliberate practice: Is that all it takes to become an expert?”⁠, David Z. Hambrick, Frederick L. Oswald, Erik M. Altmann, Elizabeth J. Meinz, Fernand Gobet, Guillermo Campitelli et al (2014-07; ⁠, ⁠, ; backlinks):

  • Ericsson and colleagues argue that deliberate practice explains expert performance.
  • We tested this view in the two most studied domains in expertise research.
  • Deliberate practice is not sufficient to explain expert performance.
  • Other factors must be considered to advance the science of expertise.

Twenty years ago, Ericsson et al 1993 proposed that expert performance reflects a long period of deliberate practice rather than innate ability, or “talent”. Ericsson et al 1993 found that elite musicians had accumulated thousands of hours more deliberate practice than less accomplished musicians, and concluded that their theoretical framework could provide “a sufficient account of the major facts about the nature and scarcity of exceptional performance” (p. 392). The deliberate practice view has since gained popularity as a theoretical account of expert performance, but here we show that deliberate practice is not sufficient to explain individual differences in performance in the two most widely studied domains in expertise research—chess and music. For researchers interested in advancing the science of expert performance, the task now is to develop and rigorously test theories that take into account as many potentially relevant explanatory constructs as possible.

[Keywords: Expert performance, Expertise, Deliberate practice, Talent]

“Leprechaun Hunting & Citogenesis”, Branwen 2014

Leprechauns: “Leprechaun Hunting & Citogenesis”⁠, Gwern Branwen (2014-06-30; ; backlinks; similar):

Many claims, about history in particular, turn out to be false when traced back to their origins, and form kinds of academic urban legends. These “leprechauns” are particularly pernicious because they are often widely-repeated due to their apparent trustworthiness, yet difficult to research & debunk due to the difficulty of following deeply-nested chains of citations through ever more obscure sources. This page lists instances I have run into.

A major source of leprechaun transmission is the frequency with which researchers do not read the papers they cite: because they do not read them, they repeat misstatements or add their own errors, further transforming the leprechaun and adding another link in the chain to anyone seeking the original source. This can be quantified by checking statements against the original paper, and examining the spread of typos in citations: someone reading the original will fix a typo in the usual citation, or is unlikely to make the same typo, and so will not repeat it. Both methods indicate high rates of non-reading, explaining how leprechauns can propagate so easily.

“Why Correlation Usually ≠ Causation”, Branwen 2014

Causality: “Why Correlation Usually ≠ Causation”⁠, Gwern Branwen (2014-06-24; ⁠, ⁠, ⁠, ⁠, ⁠, ; backlinks; similar):

Correlations are oft interpreted as evidence for causation; this is oft falsified; do causal graphs explain why this is so common, because the number of possible indirect paths greatly exceeds the direct paths necessary for useful manipulation?

It is widely understood that statistical correlation between two variables ≠ causation. Despite this admonition, people are overconfident in claiming correlations to support favored causal interpretations and are surprised by the results of randomized experiments, suggesting that they are biased & systematically underestimate the prevalence of confounds / common-causation. I speculate that in realistic causal networks or DAGs, the number of possible correlations grows faster than the number of possible causal relationships. So confounds really are that common, and since people do not think in realistic DAGs but toy models, the imbalance also explains overconfidence.

“Movie Reviews”, Branwen 2014

Movies: “Movie Reviews”⁠, Gwern Branwen (2014-05-01; ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ; backlinks; similar):

A compilation of movie, television, and opera reviews since 2014.

This is a compilation of my film/​television/​theater reviews; it is compiled from my newsletter⁠. Reviews are sorted by rating in descending order.

See also my book & anime /  ​ manga reviews⁠.

“The Control Group Is Out Of Control”, Alexander 2014

“The Control Group Is Out Of Control”⁠, Scott Alexander (2014-04-28; backlinks; similar):

Allan Crossman calls parapsychology the control group for science⁠. That is, in let’s say a drug testing experiment, you give some people the drug and they recover. That doesn’t tell you much until you give some other people a placebo drug you know doesn’t work—but which they themselves believe in—and see how many of them recover. That number tells you how many people will recover whether the drug works or not. Unless people on your real drug do substantially better than people on the placebo drug, you haven’t found anything. On the meta-level, you’re studying some phenomenon and you get some positive findings. That doesn’t tell you much until you take some other researchers who are studying a phenomenon you know doesn’t exist—but which they themselves believe in—and see how many of them get positive findings. That number tells you how many studies will discover positive results whether the phenomenon is real or not. Unless studies of the real phenomenon do substantially better than studies of the placebo phenomenon, you haven’t found anything.

Trying to set up placebo science would be a logistical nightmare. You’d have to find a phenomenon that definitely doesn’t exist, somehow convince a whole community of scientists across the world that it does, and fund them to study it for a couple of decades without them figuring it out.

Luckily we have a natural experiment in terms of parapsychology—the study of psychic phenomena—which most reasonable people believe don’t exist, but which a community of practicing scientists believes in and publishes papers on all the time. The results are pretty dismal. Parapsychologists are able to produce experimental evidence for psychic phenomena about as easily as normal scientists are able to produce such evidence for normal, non-psychic phenomena. This suggests the existence of a very large “placebo effect” in science—ie with enough energy focused on a subject, you can always produce “experimental evidence” for it that meets the usual scientific standards. As Eliezer Yudkowsky puts it:

Parapsychologists are constantly protesting that they are playing by all the standard scientific rules, and yet their results are being ignored—that they are unfairly being held to higher standards than everyone else. I’m willing to believe that. It just means that the standard statistical methods of science are so weak and flawed as to permit a field of study to sustain itself in the complete absence of any subject matter.

“The Chrysalis Effect: How Ugly Initial Results Metamorphosize Into Beautiful Articles”, O’Boyle et al 2014

2014-oboyle.pdf: “The Chrysalis Effect: How Ugly Initial Results Metamorphosize Into Beautiful Articles”⁠, Ernest Hugh O’Boyle, Jr., George Christopher Banks, Erik Gonzalez-Mulé (2014-03-19; backlinks; similar):

The issue of a published literature not representative of the population of research is most often discussed in terms of entire studies being suppressed. However, alternative sources of publication bias are questionable research practices (QRPs) that entail post hoc alterations of hypotheses to support data or post hoc alterations of data to support hypotheses. Using general strain theory as an explanatory framework, we outline the means, motives, and opportunities for researchers to better their chances of publication independent of rigor and relevance. We then assess the frequency of QRPs in management research by tracking differences between dissertations and their resulting journal publications. Our primary finding is that from dissertation to journal article, the ratio of supported to unsupported hypotheses more than doubled (0.82 to 1.00 versus 1.94 to 1.00). The rise in predictive accuracy resulted from the dropping of statistically nonsignificant hypotheses, the addition of statistically-significant hypotheses, the reversing of predicted direction of hypotheses, and alterations to data. We conclude with recommendations to help mitigate the problem of an unrepresentative literature that we label the “Chrysalis Effect.”

“Identifying The Effect Of Open Access On Citations Using A Panel Of Science Journals”, McCabe & Snyder 2014

2014-mccabe.pdf: “Identifying The Effect Of Open Access On Citations Using A Panel Of Science Journals”⁠, Mark J. McCabe, Christopher M. Snyder (2014-02-20; backlinks; similar):

An open-access journal allows free online access to its articles, obtaining revenue from fees charged to submitting authors or from institutional support. Using panel data on science journals, we are able to circumvent problems plaguing previous studies of the impact of open access on citations. In contrast to the huge effects found in these previous studies, we find a more modest effect: moving from paid to open access increases cites by 8% on average in our sample. The benefit is concentrated among top-ranked journals. In fact, open access causes a statistically-significant reduction in cites to the bottom-ranked journals in our sample, leading us to conjecture that open access may intensify competition among articles for readers’ attention, generating losers as well as winners. [See also the 2020 followup by the same authors, “Cite Unseen: Theory and Evidence on the Effect of Open Access on Cites to Academic Articles Across the Quality Spectrum”⁠.]

“Behavior Genetic Research Methods: Testing Quasi-Causal Hypotheses Using Multivariate Twin Data”, Turkheimer & Harden 2014

2014-turkheimer.pdf: “Behavior Genetic Research Methods: Testing Quasi-Causal Hypotheses Using Multivariate Twin Data”⁠, Eric Turkheimer, K. Paige Harden (2014-01-01; ; backlinks)

“Open Access to Data: An Ideal Professed but Not Practised”, Andreoli-Versbach & Mueller-Langer 2014

2014-andreoliversbach.pdf: “Open Access to Data: An Ideal Professed but Not Practised”⁠, Patrick Andreoli-Versbach, Frank Mueller-Langer (2014; backlinks; similar):

Data-sharing is an essential tool for replication, validation and extension of empirical results. Using a hand-collected data set describing the data-sharing behaviour of 488 randomly selected empirical researchers, we provide evidence that most researchers in economics and management do not share their data voluntarily. We derive testable hypotheses based on the theoretical literature on information-sharing and relate data-sharing to observable characteristics of researchers. We find empirical support for the hypotheses that voluntary data-sharing statistically-significantly increases with (1) academic tenure, (2) the quality of researchers, (3) the share of published articles subject to a mandatory data-disclosure policy of journals, and (4) personal attitudes towards “open science” principles. On the basis of our empirical evidence, we discuss a set of policy recommendations.

“The Availability of Research Data Declines Rapidly With Article Age”, Vines et al 2013

“The availability of research data declines rapidly with article age”⁠, Timothy Vines, Arianne Albert, Rose Andrew, Florence Debarré, Dan Bock, Michelle Franklin, Kimberley Gilbert et al (2013-12-19; backlinks; similar):

Policies ensuring that research data are available on public archives are increasingly being implemented at the government [1], funding agency [2–4], and journal [5,6] level. These policies are predicated on the idea that authors are poor stewards of their data, particularly over the long term [7], and indeed many studies have found that authors are often unable or unwilling to share their data [8–11]. However, there are no systematic estimates of how the availability of research data changes with time since publication. We therefore requested datasets from a relatively homogenous set of 516 articles published between 2 and 22 years ago, and found that availability of the data was strongly affected by article age. For papers where the authors gave the status of their data, the odds of a dataset being extant fell by 17% per year. In addition, the odds that we could find a working email address for the first, last or corresponding author fell by 7% per year. Our results reinforce the notion that, in the long term, research data cannot be reliably preserved by individual researchers, and further demonstrate the urgent need for policies mandating data sharing via public archives.

“Nonindustry-Sponsored Preclinical Studies on Statins Yield Greater Efficacy Estimates Than Industry-Sponsored Studies: A Meta-Analysis”, Krauth et al 2013

“Nonindustry-Sponsored Preclinical Studies on Statins Yield Greater Efficacy Estimates Than Industry-Sponsored Studies: A Meta-Analysis”⁠, David Krauth, Andrew Anglemyer, Rose Philipps, Lisa Bero (2013-12-09; backlinks; similar):

Industry-sponsored clinical drug studies are associated with publication of outcomes that favor the sponsor, even when controlling for potential bias in the methods used. However, the influence of sponsorship bias has not been examined in preclinical animal studies.

We performed a meta-analysis of preclinical statin studies to determine whether industry sponsorship is associated with either increased effect sizes of efficacy outcomes and/​or risks of bias in a cohort of published preclinical statin studies. We searched MEDLINE (January 1966–April 2012) and identified 63 studies evaluating the effects of statins on atherosclerosis outcomes in animals. Two coders independently extracted study design criteria aimed at reducing bias, results for all relevant outcomes, sponsorship source, and investigator financial ties. The I2 statistic was used to examine heterogeneity. We calculated the standardized mean difference (SMD) for each outcome and pooled data across studies to estimate the pooled average SMD using random effects models. In a priori subgroup analyses, we assessed statin efficacy by outcome measured, sponsorship source, presence or absence of financial conflict information, use of an optimal time window for outcome assessment, accounting for all animals, inclusion criteria, blinding, and randomization.

The effect of statins was statistically-significantly larger for studies sponsored by nonindustry sources (−1.99; 95% CI −2.68, −1.31) versus studies sponsored by industry (−0.73; 95% CI −1.00, −0.47) (p < 0.001). Statin efficacy did not differ by disclosure of financial conflict information, use of an optimal time window for outcome assessment, accounting for all animals, inclusion criteria, blinding, and randomization. Possible reasons for the differences between nonindustry-sponsored and industry-sponsored studies, such as selective reporting of outcomes, require further study.

Author Summary: Industry-sponsored clinical drug studies are associated with publication of outcomes that favor the sponsor, even when controlling for potential bias in the methods used. However, the influence of sponsorship bias has not been examined in preclinical animal studies. We performed a meta-analysis to identify whether industry sponsorship is associated with increased risks of bias or effect sizes of outcomes in a cohort of published preclinical studies of the effects of statins on outcomes related to atherosclerosis. We found that in contrast to clinical studies, the effect of statins was statistically-significantly larger for studies sponsored by nonindustry sources versus studies sponsored by industry. Furthermore, statin efficacy did not differ with respect to disclosure of financial conflict information, use of an optimal time window for outcome assessment, accounting for all animals, inclusion criteria, blinding, and randomization. Possible reasons for the differences between nonindustry-sponsored and industry-sponsored studies, such as selective outcome reporting, require further study. Overall, our findings provide empirical evidence regarding the impact of funding and other methodological criteria on research outcomes.

“An Opportunity Cost Model of Subjective Effort and Task Performance”, Kurzban et al 2013-page-14

2013-kurzban.pdf#page=14: “An opportunity cost model of subjective effort and task performance”⁠, Robert Kurzban, Angela Duckworth, Joseph W. Kable, Justus Myers (2013-12-04; ⁠, ⁠, ⁠, ; backlinks):

Why does performing certain tasks cause the aversive experience of mental effort and concomitant deterioration in task performance? One explanation posits a physical resource that is depleted over time. We propose an alternative explanation that centers on mental representations of the costs and benefits associated with task performance. Specifically, certain computational mechanisms, especially those associated with executive function, can be deployed for only a limited number of simultaneous tasks at any given moment. Consequently, the deployment of these computational mechanisms carries an opportunity cost—that is, the next-best use to which these systems might be put. We argue that the phenomenology of effort can be understood as the felt output of these cost/​benefit computations. In turn, the subjective experience of effort motivates reduced deployment of these computational mechanisms in the service of the present task. These opportunity cost representations, then, together with other cost/​benefit calculations, determine effort expended and, everything else equal, result in performance reductions. In making our case for this position, we review alternative explanations for both the phenomenology of effort associated with these tasks and for performance reductions over time. Likewise, we review the broad range of relevant empirical results from across sub-disciplines, especially psychology and neuroscience. We hope that our proposal will help to build links among the diverse fields that have been addressing similar questions from different perspectives, and we emphasize ways in which alternative models might be empirically distinguished.

“An Opportunity Cost Model of Subjective Effort and Task Performance”, Kurzban et al 2013

2013-kurzban.pdf: “An opportunity cost model of subjective effort and task performance”⁠, Robert Kurzban, Angela Duckworth, Joseph W. Kable, Justus Myers (2013-12-04; ⁠, ⁠, ⁠, ; backlinks):

Why does performing certain tasks cause the aversive experience of mental effort and concomitant deterioration in task performance? One explanation posits a physical resource that is depleted over time. We propose an alternative explanation that centers on mental representations of the costs and benefits associated with task performance. Specifically, certain computational mechanisms, especially those associated with executive function, can be deployed for only a limited number of simultaneous tasks at any given moment. Consequently, the deployment of these computational mechanisms carries an opportunity cost—that is, the next-best use to which these systems might be put. We argue that the phenomenology of effort can be understood as the felt output of these cost/​benefit computations. In turn, the subjective experience of effort motivates reduced deployment of these computational mechanisms in the service of the present task. These opportunity cost representations, then, together with other cost/​benefit calculations, determine effort expended and, everything else equal, result in performance reductions. In making our case for this position, we review alternative explanations for both the phenomenology of effort associated with these tasks and for performance reductions over time. Likewise, we review the broad range of relevant empirical results from across sub-disciplines, especially psychology and neuroscience. We hope that our proposal will help to build links among the diverse fields that have been addressing similar questions from different perspectives, and we emphasize ways in which alternative models might be empirically distinguished.

“When Mice Mislead: Tackling a Long-standing Disconnect between Animal and Human Studies, Some Charge That Animal Researchers Need Stricter Safeguards and Better Statistics to Ensure Their Science Is Solid”, Couzin-Frankel 2013

2013-couzinfrankel.pdf: “When Mice Mislead: Tackling a long-standing disconnect between animal and human studies, some charge that animal researchers need stricter safeguards and better statistics to ensure their science is solid”⁠, Jennifer Couzin-Frankel (2013-11-22; ; backlinks):

Tackling a long-standing disconnect between animal and human studies, some charge that animal researchers need stricter safeguards and better statistics to ensure their science is solid.

“Belief in the Unstructured Interview: The Persistence of an Illusion”, Dana et al 2013

2013-dana.pdf: “Belief in the unstructured interview: The persistence of an illusion”⁠, Jason Dana, Robyn Dawes, Nathanial Peterson (2013-09-01; similar):

Unstructured interviews are an ubiquitous tool for making screening decisions despite a vast literature suggesting that they have little validity. We sought to establish reasons why people might persist in the illusion that unstructured interviews are valid and what features about them actually lead to poor predictive accuracy.

In three studies, we investigated the propensity for “sensemaking”—the ability for interviewers to make sense of virtually anything the interviewee says—and “dilution”—the tendency for available but non-diagnostic information to weaken the predictive value of quality information. In Study 1, participants predicted two fellow students’ semester GPAs from valid background information like prior GPA and, for one of them, an unstructured interview. In one condition, the interview was essentially nonsense in that the interviewee was actually answering questions using a random response system. Consistent with sensemaking, participants formed interview impressions just as confidently after getting random responses as they did after real responses. Consistent with dilution, interviews actually led participants to make worse predictions. Study 2 showed that watching a random interview, rather than personally conducting it, did little to mitigate sensemaking. Study 3 showed that participants believe unstructured interviews will help accuracy, so much so that they would rather have random interviews than no interview.

People form confident impressions even interviews are defined to be invalid, like our random interview, and these impressions can interfere with the use of valid information. Our simple recommendation for those making screening decisions is not to use them.

[Keywords: unstructured interview, random interview, clinical judgment, actuarial judgment]

“Book Reviews”, Branwen 2013

Books: “Book Reviews”⁠, Gwern Branwen (2013-08-23; backlinks; similar):

A compilation of books reviews of books I have read since ~1997.

This is a compilation of my book reviews. Book reviews are sorted by star, and sorted by length of review within each star level, under the assumption that longer reviews are of more interest to readers.

See also my anime /  ​ manga and film /  ​ TV /  ​ theater reviews⁠.

“Lunar Circadian Rhythms”, Branwen 2013

Lunar-sleep: “Lunar circadian rhythms”⁠, Gwern Branwen (2013-07-26; ⁠, ⁠, ⁠, ⁠, ; backlinks; similar):

Is sleep affected by the phase of the moon? An analysis of several years of 4 Zeo users’ sleep data shows no lunar cycle.

I attempt to replicate, using public Zeo-recorded sleep datasets, a finding of a monthly circadian rhythm affecting sleep in a small sleep lab. I find only small non-statistically-significant correlations, despite being well-powered⁠.

“Lizardman Constant in Surveys”, Branwen 2013

Lizardman-constant: “Lizardman Constant in Surveys”⁠, Gwern Branwen (2013-04-12; ⁠, ⁠, ⁠, ⁠, ; backlinks; similar):

A small fraction of human responses will always be garbage because we are lazy, bored, trolling, or crazy.

Researchers have demonstrated repeatedly in human surveys the stylized fact that, far from being an oracle or gold standard, a certain small percentage of human responses will reliably be bullshit: “jokester” or “mischievous responders”, or more memorably, “lizardman constant” responders—respondents who give the wrong answer to simple questions.

Below a certain percentage of responses, for sufficiently rare responses, much or all of responding humans may be lying, lazy, crazy, or maliciously responding and the responses are false. This systematic error seriously undermines attempts to study rare beliefs such as conspiracy theories, and puts bounds on how accurate any single survey can hope to be.

“Investing in Preschool Programs”, Duncan & Magnuson 2013

“Investing in Preschool Programs”⁠, Greg J. Duncan, Katherine Magnuson (2013-04; ⁠, ; backlinks; similar):

At the beginning of kindergarten, the math and reading achievement gaps between children in the bottom and top income quintiles amount to more than a full standard deviation. Early childhood education programs provide child care services and may facilitate the labor market careers of parents, but their greatest potential value is as a human capital investment in young children, particularly children from economically disadvantaged families (Heckman 2006). After all, both human and animal studies highlight the critical importance of experiences in the earliest years of life for establishing the brain architecture that will shape future cognitive, social, and emotional development, as well as physical and mental health (Sapolsky 2004; Knudsen et al 2006). Moreover, research on the malleability (plasticity) of cognitive abilities finds these skills to be highly responsive to environmental enrichment during the early childhood period (Nelson & Sheridan 2011). Perhaps early childhood education programs can be designed to provide the kinds of enrichment that low-income children most need to do well in school and succeed in the labor market.

We summarize the available evidence on the extent to which expenditures on early childhood education programs constitute worthy social investments in the human capital of children. We begin with a short overview of existing early childhood education programs, and then summarize results from a substantial body of methodologically sound evaluations of the impacts of early childhood education. We find that the evidence supports few unqualified conclusions. Many early childhood education programs appear to boost cognitive ability and early school achievement in the short run. However, most of them show smaller impacts than those generated by the best-known programs, and their cognitive impacts largely disappear within a few years. Despite this fade-out, long-run follow-ups from a handful of well-known programs show lasting positive effects on such outcomes as greater educational attainment, higher earnings, and lower rates of crime. Since findings regarding short and longer-run impacts on “noncognitive” outcomes are mixed, it is uncertain what skills, behaviors, or developmental processes are particularly important in producing these longer-run impacts.

Our review also describes different models of human development used by social scientists, examines heterogeneous results across groups, and tries to identify the ingredients of early childhood education programs that are most likely to improve the performance of these programs. We use the terms “early childhood education” and “preschool” interchangeably to denote the subset of programs that provide group-based care in a center setting and offer some kind of developmental and educational focus. This definition is intentionally broad, as historical distinctions between early education and other kinds of center-based child care programs have blurred. Many early education programs now claim the dual goals of supporting working families and providing enriched learning environments to children, while many child care centers also foster early learning and development (Adams & Rohacek 2002).

Figure 2 shows the distribution of 84 program-average treatment effect-sizes for cognitive and achievement outcomes, measured at the end of each program’s treatment period, by the calendar year in which the program began. Reflecting their approximate contributions to weighted results, “bubble” sizes are proportional to the inverse of the squared standard error of the estimated program impact. The figure differentiates between evaluations of Head Start and other early childhood education programs and also includes a weighted regression line of effect size by calendar year.

Average Impact of Early Child Care Programs at End of Treatment. (standard deviation units) Notes: Figure 2 shows the distribution of 84 program-average treatment effect sizes for cognitive and achievement outcomes, measured at the end of each program’s treatment period, by the calendar year in which the program began. Reflecting their approximate contributions to weighted results, “bubble” sizes are proportional to the inverse of the squared standard error of the estimated program impact. There is a weighted regression line of effect size by calendar year.

“Star Wars: The Empirics Strike Back”, Brodeur et al 2013

“Star Wars: The Empirics Strike Back”⁠, Abel Brodeur, Mathias Lé, Marc Sangnier, Yanos Zylberberg (2013-03; backlinks; similar):

Journals favor rejection of the null hypothesis. This selection upon tests may distort the behavior of researchers. Using 50,000 tests published between 2005 and 2011 in the AER, JPE, and QJE, we identify a residual in the distribution of tests that cannot be explained by selection. The distribution of p-values exhibits a camel shape with abundant p-values above 0.25, a valley between 0.25 and 0.10 and a bump slightly below 0.05. The missing tests (with p-values between 0.25 and 0.10) can be retrieved just after the 0.05 threshold and represent 10% to 20% of marginally rejected tests. Our interpretation is that researchers might be tempted to inflate the value of those almost-rejected tests by choosing a “significant” specification. We propose a method to measure inflation and decompose it along articles’ and authors’ characteristics.

“Empirical Estimates Suggest Most Published Medical Research Is True”, Jager & Leek 2013

“Empirical estimates suggest most published medical research is true”⁠, Leah R. Jager, Jeffrey T. Leek (2013-01-16; backlinks; similar):

The accuracy of published medical research is critical both for scientists, physicians and patients who rely on these results. But the fundamental belief in the medical literature was called into serious question by a paper suggesting most published medical research is false. Here we adapt estimation methods from the genomics community to the problem of estimating the rate of false positives in the medical literature using reported p-values as the data. We then collect p-values from the abstracts of all 77,430 papers published in The Lancet, The Journal of the American Medical Association, The New England Journal of Medicine, The British Medical Journal, and The American Journal of Epidemiology between 2000 and 2010. We estimate that the overall rate of false positives among reported results is 14% (s.d. 1%), contrary to previous claims. We also find there is not a statistically-significant increase in the estimated rate of reported false positive results over time (0.5% more FP per year, p = 0.18) or with respect to journal submissions (0.1% more FP per 100 submissions, p = 0.48). Statistical analysis must allow for false positives in order to make claims on the basis of noisy data. But our analysis suggests that the medical literature remains a reliable record of scientific progress.

“Randomized Controlled Trials Commissioned by the Institute of Education Sciences Since 2002: How Many Found Positive Versus Weak or No Effects?”, Policy 2013

“Randomized Controlled Trials Commissioned by the Institute of Education Sciences Since 2002: How Many Found Positive Versus Weak or No Effects?”⁠, Coalition for Evidence-Based Policy (2013; ; backlinks; similar):

Since the establishment of the Institute for Education Sciences (IES) within the U.S. Department of Education in 2002, IES has commissioned a sizable number of well-conducted randomized controlled trials (RCTs) evaluating the effectiveness of diverse educational programs, practices, and strategies (“interventions”). These interventions have included, for example, various educational curricula, teacher professional development programs, school choice programs, educational software, and data-driven school reform initiatives.

Largely as a result of these IES studies, there now exists—for the first time in U.S. education—a sizable body of credible knowledge about what works and what doesn’t work to improve key educational outcomes of American students.

A clear pattern of findings in these IES studies is that the large majority of interventions evaluated produced weak or no positive effects compared to usual school practices. This pattern is consistent with findings in other fields where RCTs are frequently carried out, such as medicine and business,1 and underscores the need to test many different interventions so as to build the number shown to work.

“What’s to Know about the Credibility of Empirical Economics?”, Ioannidis & Doucouliagos 2013

2013-ioannidis.pdf: “What’s to know about the credibility of empirical economics?”⁠, John Ioannidis, Chris Doucouliagos (2013; ; backlinks; similar):

The scientific credibility of economics is itself a scientific question that can be addressed with both theoretical speculations and empirical data.

In this review, we examine the major parameters that are expected to affect the credibility of empirical economics: sample size, magnitude of pursued effects, number and pre-selection of tested relationships, flexibility and lack of standardization in designs, definitions, outcomes and analyses, financial and other interests and prejudices, and the multiplicity and fragmentation of efforts.

We summarize and discuss the empirical evidence on the lack of a robust reproducibility culture in economics and business research, the prevalence of potential publication and other selective reporting biases, and other failures and biases in the market of scientific information. Overall, the credibility of the economics literature is likely to be modest or even low.

[Keywords: bias, credibility, economics, meta-research, replication, reproducibility]

“Flawed Science: The Fraudulent Research Practices of Social Psychologist Diederik Stapel”, Committee et al 2012

2012-levelt.pdf: “Flawed science: The fraudulent research practices of social psychologist Diederik Stapel”⁠, Levelt Committee, Noort Committee, Drenth Committee (2012-11-28; backlinks)

“A Peculiar Prevalence of p Values Just below 0.05”, Masicampo & Lalande 2012

2012-masicampo.pdf: “A peculiar prevalence of p values just below 0.05”⁠, E. J. Masicampo, Daniel R. Lalande (2012-11-01; backlinks; similar):

In null hypothesis statistical-significance testing (NHST), p-values are judged relative to an arbitrary threshold for statistical-significance (0.05). The present work examined whether that standard influences the distribution of p-values reported in the psychology literature.

We examined a large subset of papers from 3 highly regarded journals. Distributions of p were found to be similar across the different journals. Moreover, p-values were much more common immediately below 0.05 than would be expected based on the number of p-values occurring in other ranges. This prevalence of p-values just below the arbitrary criterion for statistical-significance was observed in all 3 journals.

We discuss potential sources of this pattern, including publication bias and researcher degrees of freedom.

“The Iron Law Of Evaluation And Other Metallic Rules”, Rossi 2012

1987-rossi: “The Iron Law Of Evaluation And Other Metallic Rules”⁠, Peter H. Rossi (2012-09-18; ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ; backlinks; similar):

Problems with social experiments and evaluating them, loopholes, causes, and suggestions; non-experimental methods systematically deliver false results, as most interventions fail or have small effects.

“The Iron Law Of Evaluation And Other Metallic Rules” is a classic review paper by American “sociologist Peter Rossi⁠, a dedicated progressive and the nation’s leading expert on social program evaluation from the 1960s through the 1980s”; it discusses the difficulties of creating an useful social program⁠, and proposed some aphoristic summary rules, including most famously:</