Skip to main content

Bayes directory


“Fast and Accurate Bayesian Polygenic Risk Modeling With Variational Inference”, Zabad et al 2022

“Fast and Accurate Bayesian Polygenic Risk Modeling with Variational Inference”⁠, Shadi Zabad, Simon Gravel, Yue Li (2022-05-11; ):

The recent proliferation of large scale genome-wide association studies (GWASs) has motivated the development of statistical methods for phenotype prediction using single nucleotide polymorphism (SNP) array data. These polygenic risk score (PRS) methods formulate the task of polygenic prediction in terms of a multiple linear regression framework, where the goal is to infer the joint effect sizes of all genetic variants on the trait. Among the subset of PRS methods that operate on GWAS summary statistics, sparse Bayesian methods have shown competitive predictive ability. However, existing Bayesian approaches employ Markov Chain Monte Carlo (MCMC) algorithms for posterior inference, which are computationally inefficient and do not scale favorably with the number of SNPs included in the analysis. Here, we introduce Variational Inference of Polygenic Risk Scores (VIPRS), a Bayesian summary statistics-based PRS method that utilizes Variational Inference (VI) techniques to efficiently approximate the posterior distribution for the effect sizes. Our experiments with genome-wide simulations and real phenotypes from the UK Biobank (UKB) dataset demonstrated that variational approximations to the posterior are competitively accurate and highly efficient. When compared to state-of-the-art PRS methods, VIPRS consistently achieves the best or second best predictive accuracy in our analyses of 18 simulation configurations as well as 12 real phenotypes measured among the UKB participants of “White British” background. This performance advantage was higher among individuals from other ethnic groups, with an increase in R-squared of up to 1.7× among participants of Nigerian ancestry for Low-Density Lipoprotein (LDL) cholesterol. Furthermore, given its computational efficiency, we applied VIPRS to a dataset of up to 10 million genetic markers, an order of magnitude greater than the standard HapMap3 subset used to train existing PRS methods. Modeling this expanded set of variants conferred modest improvements in prediction accuracy for a number of highly polygenic traits, such as standing height.

“The InterModel Vigorish (IMV): A Flexible and Portable Approach for Quantifying Predictive Accuracy With Binary Outcomes”, Domingue et al 2022

“The InterModel Vigorish (IMV): A flexible and portable approach for quantifying predictive accuracy with binary outcomes”⁠, Ben Domingue, Charles Rahal, Jessica Faul, Jeremy Freese, Klint Kanopka, Alexandros Rigos, Ben Stenhaug et al (2022-01-12; ; similar):

[Twitter⁠; app] Understanding the “fit” of models designed to predict binary outcomes has been a long-standing problem.

We propose a flexible, portable, and intuitive metric for quantifying the change in accuracy between 2 predictive systems in the case of a binary outcome, the InterModel Vigorish (IMV). The IMV is based on an analogy to well-characterized physical systems with tractable probabilities: weighted coins. The IMV is always a statement about the change in fit relative to some baseline—which can be as simple as the prevalence—whereas other metrics are stand-alone measures that need to be further manipulated to yield indices related to differences in fit across models. Moreover, the IMV is consistently interpretable independent of baseline prevalence.

We illustrate the flexible properties of this metric in numerous simulations and showcase its flexibility across examples spanning the social, biomedical, and physical sciences.

[Keywords: binary outcomes, fit index, logistic regression⁠, prediction, Kelly criterion⁠, entropy⁠, coherence]

“The Science of Visual Data Communication: What Works”, Franconeri et al 2021

2021-franconeri.pdf: “The Science of Visual Data Communication: What Works”⁠, Steven L. Franconeri, Lace M. Padilla, Priti Shah, Jeffrey M. Zacks, Jessica Hullman (2021-12-15; ; backlinks; similar):

Effectively designed data visualizations allow viewers to use their powerful visual systems to understand patterns in data across science, education, health, and public policy. But ineffectively designed visualizations can cause confusion, misunderstanding, or even distrust—especially among viewers with low graphical literacy.

We review research-backed guidelines for creating effective and intuitive visualizations oriented toward communicating data to students, coworkers, and the general public. We describe how the visual system can quickly extract broad statistics from a display, whereas poorly designed displays can lead to misperceptions and illusions. Extracting global statistics is fast, but comparing between subsets of values is slow. Effective graphics avoid taxing working memory, guide attention, and respect familiar conventions.

Data visualizations can play a critical role in teaching and communication, provided that designers tailor those visualizations to their audience.

“How to Learn and Represent Abstractions: An Investigation Using Symbolic Alchemy”, AlKhamissi et al 2021

“How to Learn and Represent Abstractions: An Investigation using Symbolic Alchemy”⁠, Badr AlKhamissi, Akshay Srinivasan, Zeb-Kurth Nelson, Sam Ritter (2021-12-14; ⁠, ; similar):

Alchemy is a new meta-learning environment rich enough to contain interesting abstractions, yet simple enough to make fine-grained analysis tractable. Further, Alchemy provides an optional symbolic interface that enables meta-RL research without a large compute budget. In this work, we take the first steps toward using Symbolic Alchemy to identify design choices that enable deep-RL agents to learn various types of abstraction. Then, using a variety of behavioral and introspective analyses we investigate how our trained agents use and represent abstract task variables, and find intriguing connections to the neuroscience of abstraction. We conclude by discussing the next steps for using meta-RL and Alchemy to better understand the representation of abstract variables in the brain.

“An Experimental Design Perspective on Model-Based Reinforcement Learning”, Mehta et al 2021

“An Experimental Design Perspective on Model-Based Reinforcement Learning”⁠, Viraj Mehta, Biswajit Paria, Jeff Schneider, Stefano Ermon, Willie Neiswanger (2021-12-09; ):

[blog] In many practical applications of RL, it is expensive to observe state transitions from the environment. For example, in the problem of plasma control for nuclear fusion, computing the next state for a given state-action pair requires querying an expensive transition function which can lead to many hours of computer simulation or dollars of scientific research. Such expensive data collection prohibits application of standard RL algorithms which usually require a large number of observations to learn.

In this work, we address the problem of efficiently learning a policy while making a minimal number of state-action queries to the transition function. In particular, we leverage ideas from Bayesian optimal experimental design to guide the selection of state-action queries for efficient learning. We propose an acquisition function that quantifies how much information a state-action pair would provide about the optimal solution to a Markov decision process. At each iteration, our algorithm maximizes this acquisition function, to choose the most informative state-action pair to be queried, thus yielding a data-efficient RL approach.

We experiment with a variety of simulated continuous control problems and show that our approach learns an optimal policy with up to 5–1,000× less data than model-based RL baselines and 103–105× less data than model-free RL baselines. We also provide several ablated comparisons which point to substantial improvements arising from the principled method of obtaining data.

“Prior Knowledge Elicitation: The Past, Present, and Future”, Mikkola et al 2021

“Prior knowledge elicitation: The past, present, and future”⁠, Petrus Mikkola, Osvaldo A. Martin, Suyog Chandramouli, Marcelo Hartmann, Oriol Abril Pla, Owen Thomas et al (2021-12-01; ; similar):

Specification of the prior distribution for a Bayesian model is a central part of the Bayesian workflow for data analysis, but it is often difficult even for statistical experts. Prior elicitation transforms domain knowledge of various kinds into well-defined prior distributions, and offers a solution to the prior specification problem, in principle. In practice, however, we are still fairly far from having usable prior elicitation tools that could significantly influence the way we build probabilistic models in academia and industry. We lack elicitation methods that integrate well into the Bayesian workflow and perform elicitation efficiently in terms of costs of time and effort. We even lack a comprehensive theoretical framework for understanding different facets of the prior elicitation problem.

Why are we not widely using prior elicitation? We analyze the state of the art by identifying a range of key aspects of prior knowledge elicitation, from properties of the modelling task and the nature of the priors to the form of interaction with the expert. The existing prior elicitation literature is reviewed and categorized in these terms. This allows recognizing under-studied directions in prior elicitation research, finally leading to a proposal of several new avenues to improve prior elicitation methodology.

“Improving GWAS Discovery and Genomic Prediction Accuracy in Biobank Data”, Orliac et al 2021

“Improving GWAS discovery and genomic prediction accuracy in Biobank data”⁠, Etienne J. Orliac, Daniel Trejo Banos, Sven Erik Ojavee, Kristi Läll, Reedik Mägi, Peter M. Visscher et al (2021-11-08; ; similar):

Genetically informed and deep-phenotyped biobanks are an important research resource. The cost of phenotyping far outstrips that of genotyping, and therefore it is imperative that the most powerful, versatile and efficient analysis approaches are used. Here, we apply our recently developed Bayesian grouped mixture of regressions model (GMRM) in the UK and Estonian Biobanks and obtain the highest genomic prediction accuracy reported to date across 21 heritable traits. On average, GMRM accuracies were 15% (SE 7%) greater than prediction models run in the LDAK software with SNP annotation marker groups, 18% (SE 3%) greater than a baseline BayesR model without SNP markers grouped into MAF-LD-annotation categories, and 106% (SE 9%) greater than polygenic risk scores calculated from mixed-linear model association (MLMA) estimates. For height, the prediction accuracy was 47% in a UK Biobank hold-out sample, which was 76% of the estimated SNP-heritability. We then extend our GMRM prediction model to provide MLMA SNP marker estimates for GWAS discovery, which increased the independent loci detected to 7,910 in unrelated UK Biobank individuals, as compared to 5,521 from BoltLMM and 5,727 from Regenie, a 43% and 38% increase respectively. The average χ2 value of the leading markers was 34% (SE 5.11) higher for GMRM as compared to Regenie, and increased by 17% for every 1% increase in prediction accuracy gained over a baseline BayesR model across the traits. Thus, we show that modelling genetic associations accounting for MAF and LD differences among SNP markers, and incorporating prior knowledge of genomic function, is important for both genomic prediction and for discovery in large-scale individual-level biobank-scale studies.

“An Explanation of In-context Learning As Implicit Bayesian Inference”, Xie et al 2021

“An Explanation of In-context Learning as Implicit Bayesian Inference”⁠, Sang Michael Xie, Aditi Raghunathan, Percy Liang, Tengyu Ma (2021-11-03; ⁠, ⁠, ⁠, ; backlinks; similar):

Large pretrained language models such as GPT-3 have the surprising ability to do in-context learning, where the model learns to do a downstream task simply by conditioning on a prompt consisting of input-output examples. Without being explicitly pretrained to do so, the language model learns from these examples during its forward pass without parameter updates on “out-of-distribution” prompts. Thus, it is unclear what mechanism enables in-context learning.

In this paper, we study the role of the pretraining distribution on the emergence of in-context learning under a mathematical setting where the pretraining texts have long-range coherence. Here, language model pretraining requires inferring a latent document-level concept from the conditioning text to generate coherent next tokens. At test time, this mechanism enables in-context learning by inferring the shared latent concept between prompt examples and applying it to make a prediction on the test example.

Concretely, we prove that in-context learning occurs implicitly via Bayesian inference of the latent concept when the pretraining distribution is a mixture of HMMs⁠. This can occur despite the distribution mismatch between prompts and pretraining data. In contrast to messy large-scale pretraining datasets for in-context learning in natural language, we generate a family of small-scale synthetic datasets (GINC) where Transformer and LSTM language models both exhibit in-context learning.

Beyond the theory which focuses on the effect of the pretraining distribution, we empirically find that scaling model size improves in-context accuracy even when the pretraining loss is the same.

“Unifying Individual Differences in Personality, Predictability and Plasticity: A Practical Guide”, O’Dea et al 2021

2021-odea.pdf: “Unifying individual differences in personality, predictability and plasticity: A practical guide”⁠, Rose E. O’Dea, Daniel W. A. Noble, Shinichi Nakagawa (2021-11-01; ; similar):

  1. Organisms use labile traits to respond to different conditions over short time-scales. When a population experiences the same conditions, we might expect all individuals to adjust their trait expression to the same, optimal, value, thereby minimizing phenotypic variation. Instead, variation abounds. Individuals substantially differ not only from each other, but also from their former selves, with the expression of labile traits varying both predictably and unpredictably over time.
  2. A powerful tool for studying the evolution of phenotypic variation in labile traits is the mixed model⁠. Here, we review how mixed models are used to quantify individual differences in both means and variability, and their between-individual correlations. Individuals can differ in their average phenotypes (eg. behavioural personalities), their variability (known as ‘predictability’ or intra-individual variability), and their plastic response to different contexts.
  3. We provide detailed descriptions and resources for simultaneously modelling individual differences in averages, plasticity and predictability. Empiricists can use these methods to quantify how traits covary across individuals and test theoretical ideas about phenotypic integration. These methods can be extended to incorporate plastic changes in predictability (termed ‘stochastic malleability’).
  4. Overall, we showcase the unfulfilled potential of existing statistical tools to test more holistic and nuanced questions about the evolution, function, and maintenance of phenotypic variation, for any trait that is repeatedly expressed.

[Keywords: brms⁠, coefficient of variation, DHGLM⁠, Double Hierarchical⁠, location-scale regression, multivariate, repeatability, rstan]

Conclusions And Future Directions: Incorporating predictability into studies of personality and plasticity creates an opportunity to test more nuanced questions about how phenotypic variation is maintained, or constrained. For some traits, it might be adaptive to be unpredictable, such as in predator-prey interactions (Briffa 2013). For other traits, selection might act to minimise maladaptive imprecision around an optimal mean (Hansen et al 2006). The supplementary worked example and open code (O’Dea et al 2021) shows between-individual correlations in predictability across multiple behavioural traits, and some correlations of predictability with personality and plasticity. If driven by biological integration and not measurement errors or statistical artefacts, these correlations could hint at genetic integration too; other studies have found additive genetic variance in predictability (Martin et al 2017; Prentice et al 2020). Given that different traits might have different optimal levels of unpredictability, integration of predictability could constrain variation in one trait (resulting in lower than optimal variability) and maintain variation in another (resulting in greater than optimal variability). Because of associations with personality and plasticity, variation in predictability—the lowest level of the phenotypic hierarchy—could have cascading effects upwards (Westneat et al 2015). Empirical estimates of the strength of these associations can inform theoretical models on the simultaneous evolution of means and variances.

“Transformers Can Do Bayesian Inference”, Müller et al 2021

“Transformers Can Do Bayesian Inference”⁠, Samuel Müller, Noah Hollmann, Sebastian Pineda Arango, Josif Grabocka, Frank Hutter (2021-10-05; ; similar):

Currently, it is hard to reap the benefits of deep learning for Bayesian methods, which allow the explicit specification of prior knowledge and accurately capture model uncertainty. We present Prior-Data Fitted Networks (PFNs). PFNs leverage large-scale machine learning techniques to approximate a large set of posteriors. The only requirement for PFNs to work is the ability to sample from a prior distribution over supervised learning tasks (or functions). Our method restates the objective of posterior approximation as a supervised classification problem with a set-valued input: it repeatedly draws a task (or function) from the prior, draws a set of data points and their labels from it, masks one of the labels and learns to make probabilistic predictions for it based on the set-valued input of the rest of the data points. Presented with a set of samples from a new supervised learning task as input, PFNs make probabilistic predictions for arbitrary other data points in a single forward propagation, having learned to approximate Bayesian inference.

We demonstrate that PFNs can near-perfectly mimic Gaussian processes and also enable efficient Bayesian inference for intractable problems, with over 200× speedups in multiple setups compared to current methods. We obtain strong results in very diverse areas such as Gaussian process regression, Bayesian neural networks, classification for small tabular data sets, and few-shot image classification, demonstrating the generality of PFNs. Code and trained PFNs are released under [ANONYMIZED; please see the supplementary material].

“Why Generalization in RL Is Difficult: Epistemic POMDPs and Implicit Partial Observability”, Ghosh et al 2021

“Why Generalization in RL is Difficult: Epistemic POMDPs and Implicit Partial Observability”⁠, Dibya Ghosh, Jad Rahme, Aviral Kumar, Amy Zhang, Ryan P. Adams, Sergey Levine (2021-07-13; ⁠, ; similar):

Generalization is a central challenge for the deployment of reinforcement learning (RL) systems in the real world.

In this paper, we show that the sequential structure of the RL problem necessitates new approaches to generalization beyond the well-studied techniques used in supervised learning. While supervised learning methods can generalize effectively without explicitly accounting for epistemic uncertainty, we show that, perhaps surprisingly, this is not the case in RL.

We show that generalization to unseen test conditions from a limited number of training conditions induces implicit partial observability, effectively turning even fully-observed MDPs into POMDPs⁠. Informed by this observation, we recast the problem of generalization in RL as solving the induced partially observed Markov decision process, which we call the epistemic POMDP. We demonstrate the failure modes of algorithms that do not appropriately handle this partial observability, and suggest a simple ensemble-based technique for solving the partially observed problem.

Empirically, we demonstrate that our simple algorithm derived from the epistemic POMDP achieves substantial gains in generalization over current methods on the Procgen benchmark suite.

“The Bayesian Learning Rule”, Khan & Rue 2021

“The Bayesian Learning Rule”⁠, Mohammad Emtiyaz Khan, Håvard Rue (2021-07-09; similar):

We show that many machine-learning algorithms are specific instances of a single algorithm called the Bayesian learning rule. The rule, derived from Bayesian principles, yields a wide-range of algorithms from fields such as optimization, deep learning, and graphical models. This includes classical algorithms such as ridge regression⁠, Newton’s method, and Kalman filter, as well as modern deep-learning algorithms such as stochastic-gradient descent, RMSprop, and Dropout. The key idea in deriving such algorithms is to approximate the posterior using candidate distributions estimated by using natural gradients. Different candidate distributions result in different algorithms and further approximations to natural gradients give rise to variants of those algorithms. Our work not only unifies, generalizes, and improves existing algorithms, but also helps us design new ones.

“No Need to Choose: Robust Bayesian Meta-Analysis With Competing Publication Bias Adjustment Methods”, Bartoš et al 2021

“No Need to Choose: Robust Bayesian Meta-Analysis with Competing Publication Bias Adjustment Methods”⁠, František Bartoš, Maximilian Maier, Eric-Jan Wagenmakers, Hristos Doucouliagos, T. D. Stanley (2021-06-17; ; similar):

Publication bias is an ubiquitous threat to the validity of meta-analysis and the accumulation of scientific evidence. In order to estimate and counteract the impact of publication bias, multiple methods have been developed; however, recent simulation studies have shown the methods’ performance to depend on the true data generating process—no method consistently outperforms the others across a wide range of conditions.

To avoid the condition-dependent, all-or-none choice between competing methods we extend robust Bayesian meta-analysis and model-average across 2 prominent approaches of adjusting for publication bias: (1) selection models of p-values and (2) models of the relationship between effect-sizes and their standard errors. The resulting estimator weights the models with the support they receive from the existing research record.

Applications, simulations, and comparisons to preregistered⁠, multi-lab replications demonstrate the benefits of Bayesian model-averaging of competing publication bias adjustment methods.

[Keywords: Bayesian model-averaging⁠, meta-analysis, PET-PEESE⁠, publication bias, selection models]

“Maternal Judgments of Child Numeracy and Reading Ability Predict Gains in Academic Achievement and Interest”, Parker et al 2021

2021-parker.pdf: “Maternal Judgments of Child Numeracy and Reading Ability Predict Gains in Academic Achievement and Interest”⁠, Philip D. Parker, Taren Sanders, Jake Anders, Rhiannon B. Parker, Jasper J. Duineveld (2021-05-15; ; backlinks; similar):

[Example of regression to the mean fallacies: parents know much more about their children than highly unreliable early childhood exam scores, and their “overestimates” predict later performance (particularly for immigrant parents about second-language proficiency). Of course. How could it be otherwise? (Not to mention that we already know the ‘Pygmalion effect’ isn’t real so the claimed causal explanation of their correlates has already been ruled out.)]

In a representative longitudinal sample of 2,602 Australian children (52% boys; 2% Indigenous; 13% language other than English background; 22% of Mothers born overseas; and 65% Urban) and their mothers (first surveyed in 2003), this article examined if maternal judgments of numeracy and reading ability varied by child demographics and influenced achievement and interest gains.

We linked survey data to administrative data of national standardized tests in Year 3, 5, and 7 and found that maternal judgments followed gender stereotype patterns, favoring girls in reading and boys in numeracy. Maternal judgments were more positive for children from non-English speaking backgrounds. Maternal judgments predicted gains in children’s achievement (consistently) and academic interest (generally) including during the transition to high school.

His team collected data from more than 2,600 Australian children and tracked their academic performance through NAPLAN tests between grade 3, 5 and 7.

They also collected information from the primary caregiver—mostly the child’s mother—as to whether they thought their child’s academic performance was better than average, average or below average.

“What we found was that in year 5, the kids whose parents overestimated their ability—they were optimistic—they did better in subsequent NAPLAN tests”, Professor Parker says.

“And more importantly, [the children] actually grew in their interest. They were more interested in maths, they were more interested in reading than [those who had] parents who are more pessimistic.”

Professor Philip Parker says your expectations of your child can become a self-fulfilling prophecy.

The study also found that mothers who were not from English-speaking backgrounds had statistically-significantly more positive judgments than English-speaking mothers towards their child when assessing them on reading. This was not the case when assessing numeracy.

Professor Parker says there are many ways that a parent’s optimism can benefit their child. “So they might hire a tutor, or they … buy one of those computer games for maths classes … also they tend to be more motivating. And they tend to give homework help that is more positive and supportive, rather than controlling and detrimental.”

“Genetic Sensitivity Analysis: Adjusting for Genetic Confounding in Epidemiological Associations”, Pingault et al 2021

“Genetic sensitivity analysis: Adjusting for genetic confounding in epidemiological associations”⁠, Jean-Baptiste Pingault, Frühling Rijsdijk, Tabea Schoeler, Shing Wan Choi, Saskia Selzam, Eva Krapohl et al (2021-05-07; backlinks; similar):

Associations between exposures and outcomes reported in epidemiological studies are typically unadjusted for genetic confounding. We propose a two-stage approach for estimating the degree to which such observed associations can be explained by genetic confounding. First, we assess attenuation of exposure effects in regressions controlling for increasingly powerful polygenic scores. Second, we use structural equation models to estimate genetic confounding using heritability estimates derived from both SNP-based and twin-based studies. We examine associations between maternal education and three developmental outcomes—child educational achievement, Body Mass Index (BMI), and Attention Deficit Hyperactivity Disorder (ADHD). Polygenic scores explain between 14.3% and 23.0% of the original associations, while analyses under SNP-based and twin-based heritability scenarios indicate that observed associations could be almost entirely explained by genetic confounding. Thus, caution is needed when interpreting associations from non-genetically informed epidemiology studies. Our approach, akin to a genetically informed sensitivity analysis can be applied widely.

Author summary:

An objective shared across the life, behavioural, and social sciences is to identify factors that increase risk for a particular disease or trait. However, identifying true risk factors is challenging. Often, a risk factor is statistically associated with a disease even if it is not really relevant, meaning that even successfully improving the risk factor will not impact the disease. One reason for the existence of such misleading associations stems from genetic confounding. This is when genetic factors influence directly both the risk factor and the disease, which generates a statistical association even in the absence of a true effect of the risk factor. Here, we propose a method to estimate genetic confounding and quantify its effect on observed associations. We show that a large part of the associations between maternal education and 3 child outcomes—educational achievement, body mass index and Attention-Deficit Hyperactivity Disorder—is explained by genetic confounding. Our findings can be applied to better understand the role of genetics in explaining associations of key risk factors with diseases and traits.

“What Are Bayesian Neural Network Posteriors Really Like?”, Izmailov et al 2021

“What Are Bayesian Neural Network Posteriors Really Like?”⁠, Pavel Izmailov, Sharad Vikram, Matthew D. Hoffman, Andrew Gordon Wilson (2021-04-29; ; similar):

The posterior over Bayesian neural network (BNN) parameters is extremely high-dimensional and non-convex. For computational reasons, researchers approximate this posterior using inexpensive mini-batch methods such as mean-field variational inference or stochastic-gradient Markov chain Monte Carlo (SGMCMC). To investigate foundational questions in Bayesian deep learning, we instead use full-batch Hamiltonian Monte Carlo (HMC) on modern architectures. We show that (1) BNNs can achieve significant performance gains over standard training and deep ensembles⁠; (2) a single long HMC chain can provide a comparable representation of the posterior to multiple shorter chains; (3) in contrast to recent studies, we find posterior tempering is not needed for near-optimal performance, with little evidence for a “cold posterior” effect, which we show is largely an artifact of data augmentation; (4) BMA performance is robust to the choice of prior scale, and relatively similar for diagonal Gaussian, mixture of Gaussian, and logistic priors; (5) Bayesian neural networks show surprisingly poor generalization under domain shift; (6) while cheaper alternatives such as deep ensembles and SGMCMC methods can provide good generalization, they provide distinct predictive distributions from HMC. Notably, deep ensemble predictive distributions are similarly close to HMC as standard SGLD, and closer than standard variational inference.

“Bayesian Optimization Is Superior to Random Search for Machine Learning Hyperparameter Tuning: Analysis of the Black-Box Optimization Challenge 2020”, Turner et al 2021

“Bayesian Optimization is Superior to Random Search for Machine Learning Hyperparameter Tuning: Analysis of the Black-Box Optimization Challenge 2020”⁠, Ryan Turner, David Eriksson, Michael McCourt, Juha Kiili, Eero Laaksonen, Zhen Xu, Isabelle Guyon (2021-04-20; ⁠, ; similar):

This paper presents the results and insights from the black-box optimization (BBO) challenge at NeurIPS 2020 which ran from July-October, 2020.

The challenge emphasized the importance of evaluating derivative-free optimizers for tuning the hyperparameters of machine learning models. This was the first black-box optimization challenge with a machine learning emphasis. It was based on tuning (validation set) performance of standard machine learning models on real datasets. This competition has widespread impact as black-box optimization (eg. Bayesian optimization) is relevant for hyperparameter tuning in almost every machine learning project as well as many applications outside of machine learning. The final leaderboard was determined using the optimization performance on held-out (hidden) objective functions, where the optimizers ran without human intervention.

Baselines were set using the default settings of several open-source black-box optimization packages as well as random search.

“Maximal Positive Controls: A Method for Estimating the Largest Plausible Effect Size”, Hilgard 2021

2021-hilgard.pdf: “Maximal positive controls: A method for estimating the largest plausible effect size”⁠, Joseph Hilgard (2021-03-01; ⁠, ; similar):

  • Some reported effect sizes are too big for the hypothesized process.
  • Simple, obvious manipulations can reveal which effects are too big.
  • A demonstration is provided examining an implausibly large effect.

Effect sizes in social psychology are generally not large and are limited by error variance in manipulation and measurement. Effect sizes exceeding these limits are implausible and should be viewed with skepticism. Maximal positive controls, experimental conditions that should show an obvious and predictable effect [eg. a Stroop effect], can provide estimates of the upper limits of plausible effect sizes on a measure.

In this work, maximal positive controls are conducted for 3 measures of aggressive cognition, and the effect sizes obtained are compared to studies found through systematic review. Questions are raised regarding the plausibility of certain reports with effect sizes comparable to, or in excess of, the effect sizes found in maximal positive controls.

Maximal positive controls may provide a means to identify implausible study results at lower cost than direct replication.

[Keywords: violent video games, aggression⁠, aggressive thought, positive controls, scientific self-correction]

[Positive controls eliciting a hitherto-maximum effect can be seen as a kind of empirical Bayes estimating the distribution of plausible effects: if a reported effect size exceeds the empirical max, either something extremely unlikely has occurred (a new max out of n effects ever observed) or an error. For large n, the posterior probability of an error will be much larger.]

“Informational Herding, Optimal Experimentation, and Contrarianism”, Smith et al 2021

2021-smith.pdf: “Informational Herding, Optimal Experimentation, and Contrarianism”⁠, Lones Smith, Peter Norman Sørensen, Jianrong Tian (2021-02-25; ⁠, ⁠, ; similar):

In the standard herding model⁠, privately informed individuals sequentially see prior actions and then act. An identical action herd eventually starts and public beliefs tend to “cascade sets” where social learning stops. What behaviour is socially efficient when actions ignore informational externalities?

We characterize the outcome that maximizes the discounted sum of utilities. Our 4 key findings are:

  1. cascade sets shrink but do not vanish, and herding should occur but less readily as greater weight is attached to posterity.
  2. An optimal mechanism rewards individuals mimicked by their successor.
  3. Cascades cannot start after period one under a signal log-concavity condition.
  4. Given this condition, efficient behaviour is contrarian, leaning against the myopically more popular actions in every period.

We make 2 technical contributions: as value functions with learning are not smooth, we use monotone comparative statics under uncertainty to deduce optimal dynamic behaviour. We also adapt dynamic pivot mechanisms to Bayesian learning.

[Keywords: herding, mimicking, contrarian, cascade, efficiency, monotonicity, log-concavity]

“Hot under the Collar: A Latent Measure of Interstate Hostility”, Terechshenko 2020

2020-terechshenko.pdf: “Hot under the collar: A latent measure of interstate hostility”⁠, Zhanna Terechshenko (2020-11-17; ; similar):

The majority of studies on international conflict escalation use a variety of measures of hostility including the use of force, reciprocity, and the number of fatalities. The use of different measures, however, leads to different empirical results and creates difficulties when testing existing theories of interstate conflict. Furthermore, hostility measures currently used in the conflict literature are ill suited to the task of identifying consistent predictors of international conflict escalation. This article presents a new dyadic latent measure of interstate hostility, created using a Bayesian item-response theory model and conflict data from the Militarized Interstate Dispute (MID) and Phoenix political event datasets. This model (1) provides a more granular, conceptually precise, and validated measure of hostility, which incorporates the uncertainty inherent in the latent variable; and (2) solves the problem of temporal variation in event data using a varying-intercept structure and human-coded data as a benchmark against which biases in machine-coded data are corrected. In addition, this measurement model allows for the systematic evaluation of how existing measures relate to the construct of hostility. The presented model will therefore enhance the ability of researchers to understand factors affecting conflict dynamics, including escalation and de-escalation processes.

“What Matters More for Entrepreneurship Success? A Meta-analysis Comparing General Mental Ability and Emotional Intelligence in Entrepreneurial Settings”, Allen et al 2020

“What matters more for entrepreneurship success? A meta-analysis comparing general mental ability and emotional intelligence in entrepreneurial settings”⁠, Jared S. Allen, Regan M. Stevenson, Ernest H. O''Boyle, Scott Seibert (2020-11-03; backlinks; similar):

Using meta-analysis, we investigate the extent to which General Mental Ability (GMA) and Emotional Intelligence (EI) predict entrepreneurial success. Based on 65,826 observations, we find that both GMA and EI matter for success, but that the size of the relationship is more than twice as large for EI. Our study contradicts and adds important contextual nuance to previous meta-analyses on performance in traditional workplace settings, where GMA is considered to be more critical than EI. We also contribute to the literature on cognitive and emotional intelligence in entrepreneurship.

Managerial Summary: While previous studies have shown General Mental Ability (GMA, cognitive intelligence) to be more important for success compared to Emotional Intelligence (EI) in traditional workplace settings, we theorize that EI will be more important in entrepreneurial contexts. Entrepreneurship is an extreme setting with distinct emotional and social demands relative to many other organizational settings. Moreover, managing an entrepreneurial business has been described as an “emotional rollercoaster.” Thus, on a relative basis we expected EI to matter more in entrepreneurial contexts and explore this assumption using a meta-analysis of 65,826 observations. We find that both GMA and EI matter for entrepreneurial success, but that the size of the relationship is more than twice as large for EI.

…The dominant meta-analytic paradigm in entrepreneurship is psychometric meta-analysis (Hunter & Schmidt 2004). However, we did not choose this procedure for 2 reasons. First, the chief advantage of psychometric meta-analysis is the ability to correct for statistical artifacts such as unreliability and range restriction⁠. In our data, a large percentage of the samples did not report the needed information to make these corrections locally and the global corrections via artifact distributions with the limited number of samples that reported necessary information would likely have been strongly influenced by second order sampling error.

“From Probability to Consilience: How Explanatory Values Implement Bayesian Reasoning”, Wojtowicz & DeDeo 2020

2020-wojtowicz.pdf: “From Probability to Consilience: How Explanatory Values Implement Bayesian Reasoning”⁠, Zachary Wojtowicz, Simon DeDeo (2020-10-23; similar):


  • Recent experiments show that we value explanations for many reasons, such as predictive power and simplicity.
  • Bayesian rational analysis provides a functional account of these values, along with concrete definitions that allow us to measure and compare them across a variety of contexts, including visual perception, politics, and science.
  • These values include descriptiveness, co-explanation, and measures of simplicity such as parsimony and concision. The first two are associated with the evaluation of explanations in the light of experience, while the latter concern the intrinsic features of an explanation.
  • Failures to explain well can be understood as imbalances in these values: a conspiracy theorist, for example, may over-rate co-explanation relative to simplicity, and many similar ‘failures to explain’ that we see in social life may be analyzable at this level.

Recent work in cognitive science has uncovered a diversity of explanatory values, or dimensions along which we judge explanations as better or worse. We propose a Bayesian account of these values that clarifies their function and shows how they fit together to guide explanation-making. The resulting taxonomy shows that core values from psychology, statistics, and the philosophy of science emerge from a common mathematical framework and provide insight into why people adopt the explanations they do. This framework not only operationalizes the explanatory virtues associated with, for example, scientific argument-making, but also enables us to reinterpret the explanatory vices that drive phenomena such as conspiracy theories, delusions, and extremist ideologies.

[Keywords: explanation, explanatory values, Bayesian cognition, rational analysis, simplicity, vice epistemology]

“Meta-trained Agents Implement Bayes-optimal Agents”, Mikulik et al 2020

“Meta-trained agents implement Bayes-optimal agents”⁠, Vladimir Mikulik, Grégoire Delétang, Tom McGrath, Tim Genewein, Miljan Martic, Shane Legg, Pedro A. Ortega et al (2020-10-21; ⁠, ; backlinks; similar):

Memory-based meta-learning is a powerful technique to build agents that adapt fast to any task within a target distribution. A previous theoretical study has argued that this remarkable performance is because the meta-training protocol incentivizes agents to behave Bayes-optimally. We empirically investigate this claim on a number of prediction and bandit tasks. Inspired by ideas from theoretical computer science, we show that meta-learned and Bayes-optimal agents not only behave alike, but they even share a similar computational structure, in the sense that one agent system can simulate the other. Furthermore, we show that Bayes-optimal agents are fixed points of the meta-learning dynamics. Our results suggest that memory-based meta-learning might serve as a general technique for numerically approximating Bayes-optimal agents—that is, even for task distributions for which we currently don’t possess tractable models.

“Learning Not to Learn: Nature versus Nurture in Silico”, Lange & Sprekeler 2020

“Learning not to learn: Nature versus nurture in silico”⁠, Robert Tjarko Lange, Henning Sprekeler (2020-10-09; ⁠, ⁠, ; backlinks; similar):

Animals are equipped with a rich innate repertoire of sensory, behavioral and motor skills, which allows them to interact with the world immediately after birth. At the same time, many behaviors are highly adaptive and can be tailored to specific environments by means of learning. In this work, we use mathematical analysis and the framework of meta-learning (or ‘learning to learn’) to answer when it is beneficial to learn such an adaptive strategy and when to hard-code a heuristic behavior. We find that the interplay of ecological uncertainty, task complexity and the agents’ lifetime has crucial effects on the meta-learned amortized Bayesian inference performed by an agent. There exist two regimes: One in which meta-learning yields a learning algorithm that implements task-dependent information-integration and a second regime in which meta-learning imprints a heuristic or ‘hard-coded’ behavior. Further analysis reveals that non-adaptive behaviors are not only optimal for aspects of the environment that are stable across individuals, but also in situations where an adaptation to the environment would in fact be highly beneficial, but could not be done quickly enough to be exploited within the remaining lifetime. Hard-coded behaviors should hence not only be those that always work, but also those that are too complex to be learned within a reasonable time frame.

“Searching for the Backfire Effect: Measurement and Design Considerations”, Swire-Thompson et al 2020

“Searching for the Backfire Effect: Measurement and Design Considerations”⁠, Briony Swire-Thompson, Joseph De Gutis, David Lazer (2020-09; backlinks; similar):

One of the most concerning notions for science communicators, fact-checkers, and advocates of truth, is the backfire effect; this is when a correction leads to an individual increasing their belief in the very misconception the correction is aiming to rectify. There is currently a debate in the literature as to whether backfire effects exist at all, as recent studies have failed to find the phenomenon, even under theoretically favorable conditions.

In this review, we summarize the current state of the worldview and familiarity backfire effect literatures. We subsequently examine barriers to measuring the backfire phenomenon, discuss approaches to improving measurement and design, and conclude with recommendations for fact-checkers.

We suggest that backfire effects are not a robust empirical phenomenon, and more reliable measures, powerful designs, and stronger links between experimental design and theory could greatly help move the field ahead.

[Keywords: backfire effects, belief updating, misinformation, continued influence effect, reliability]

“Is SGD a Bayesian Sampler? Well, Almost”, Mingard et al 2020

“Is SGD a Bayesian sampler? Well, almost”⁠, Chris Mingard, Guillermo Valle-Pérez, Joar Skalse, Ard A. Louis (2020-06-26; ; similar):

Overparameterized deep neural networks (DNNs) are highly expressive and so can, in principle, generate almost any function that fits a training dataset with zero error. The vast majority of these functions will perform poorly on unseen data, and yet in practice DNNs often generalize remarkably well. This success suggests that a trained DNN must have a strong inductive bias towards functions with low generalisation error. Here we empirically investigate this inductive bias by calculating, for a range of architectures and datasets, the probability PSGD(f|S) that an overparameterized DNN, trained with stochastic gradient descent (SGD) or one of its variants, converges on a function f consistent with a training set S. We also use Gaussian processes to estimate the Bayesian posterior probability Pb(f|S) that the DNN expresses f upon random sampling of its parameters, conditioned on S.

Our main findings are that PSGD(f|S) correlates remarkably well with Pb(f|S) and that Pb(f|S) is strongly biased towards low-error and low complexity functions. These results imply that strong inductive bias in the parameter-function map (which determines Pb(f|S)), rather than a special property of SGD, is the primary explanation for why DNNs generalize so well in the overparameterized regime.

While our results suggest that the Bayesian posterior Pb(f|S) is the first order determinant of PSGD(f|S), there remain second order differences that are sensitive to hyperparameter tuning. A function probability picture, based on PSGD(f|S) and/​or Pb(f|S), can shed new light on the way that variations in architecture or hyperparameter settings such as batch size, learning rate, and optimizer choice, affect DNN performance.

“Laplace’s Theories of Cognitive Illusions, Heuristics and Biases”, Miller & Gelman 2020

2020-miller.pdf: “Laplace’s Theories of Cognitive Illusions, Heuristics and Biases”⁠, Joshua B. Miller, Andrew Gelman (2020-06-03; similar):

In his book from the early 1800s, Essai Philosophique sur les Probabilités, the mathematician Pierre-Simon de Laplace anticipated many ideas developed within the past 50 years in cognitive psychology and behavioral economics, explaining human tendencies to deviate from norms of rationality in the presence of probability and uncertainty. A look at Laplace’s theories and reasoning is striking, both in how modern they seem, how much progress he made without the benefit of systematic experimentation, and the novelty of a few of his unexplored conjectures. We argue that this work points to these theories being more fundamental and less contingent on recent experimental findings than we might have thought.

“Exploring Bayesian Optimization: Breaking Bayesian Optimization into Small, Sizeable Chunks”, Agnihotri & Batra 2020

“Exploring Bayesian Optimization: Breaking Bayesian Optimization into small, sizeable chunks”⁠, Apoorv Agnihotri, Nipun Batra (2020-05-05; ; similar):

[Discussion of Bayesian optimization (BO), a decision-theoretic application of Bayesian statistics (typically using Gaussian processes for flexibility) which tries to model a set of variables to find the maximum or best in the fewest number of collected data points possible. This differs from normal experiment design which tries to simply maximize the overall information about all points given a fixed number of samples, not just the best point, or “active learning”, which tries to select data points which make the model as predictive as possible while collecting samples. The difference can be visualized by watching posterior distributions for simple 2D problems evolve as data is collected according to different BO or active learning or simple grid-search/​random baseline strategies. The optimal strategy is usually infeasible to calculate, so various heuristics like “expected improvement” or “Thompson sampling” are used, and their different behavior can be visualized and compared. BO is heavily used in machine learning to find the best combinations of settings for machine learning models.]

In this article, we looked at Bayesian Optimization for optimizing a black-box function. Bayesian Optimization is well suited when the function evaluations are expensive, making grid or exhaustive search impractical. We looked at the key components of Bayesian Optimization. First, we looked at the notion of using a surrogate function (with a prior over the space of objective functions) to model our black-box function. Next, we looked at the “Bayes” in Bayesian Optimization — the function evaluations are used as data to obtain the surrogate posterior. We look at acquisition functions, which are functions of the surrogate posterior and are optimized sequentially. This new sequential optimization is inexpensive and thus of utility of us. We also looked at a few acquisition functions and showed how these different functions balance exploration and exploitation. Finally, we looked at some practical examples of Bayesian Optimization for optimizing hyper-parameters for machine learning models.

“The Social and Genetic Inheritance of Educational Attainment: Genes, Parental Education, and Educational Expansion”, Lin 2020

2020-lin.pdf: “The social and genetic inheritance of educational attainment: Genes, parental education, and educational expansion”⁠, Meng-Jung Lin (2020-02-01; ; backlinks)

“The Most ‘Abandoned’ Books on GoodReads”, Branwen 2019

GoodReads: “The Most ‘Abandoned’ Books on GoodReads”⁠, Gwern Branwen (2019-12-09; ⁠, ⁠, ; backlinks; similar):

Which books on GoodReads are most difficult to finish? Estimating proportions in December 2019 gives an entirely different result than absolute counts.

What books are hardest for a reader who starts them to finish, and most likely to be abandoned? I scrape a crowdsourced tag⁠, abandoned, from the GoodReads book social network on 2019-12-09 to estimate conditional probability of being abandoned.

The default GoodReads tag interface presents only raw counts of tags, not counts divided by total ratings ( = reads). This conflates popularity with probability of being abandoned: a popular but rarely-abandoned book may have more abandoned tags than a less popular but often-abandoned book. There is also residual error from the winner’s curse where books with fewer ratings are more mis-estimated than popular books. I fix that to see what more correct rankings look like.

Correcting for both changes the top-5 ranking completely, from (raw counts):

  1. The Casual Vacancy, J. K. Rowling
  2. Catch-22, Joseph Heller
  3. American Gods, Neil Gaiman
  4. A Game of Thrones, George R. R. Martin
  5. The Book Thief, Markus Zusak

to (shrunken posterior proportions):

  1. Black Leopard, Red Wolf, Marlon James
  2. Space Opera⁠, Catherynne M. Valente
  3. Little, Big, John Crowley
  4. The Witches: Salem, 1692⁠, Stacy Schiff
  5. Tender Morsels, Margo Lanagan

I also consider a model adjusting for covariates (author/​average-rating/​year), to see what books are most surprisingly often-abandoned given their pedigrees & rating etc. Abandon rates increase the newer a book is, and the lower the average rating.

Adjusting for those, the top-5 are:

  1. The Casual Vacancy, J. K. Rowling
  2. The Chemist⁠, Stephenie Meyer
  3. Infinite Jest, David Foster Wallace
  4. The Glass Bead Game, Hermann Hesse
  5. Theft by Finding: Diaries (1977–2002), David Sedaris

Books at the top of the adjusted list appear to reflect a mix of highly-popular authors changing genres, and ‘prestige’ books which are highly-rated but a slog to read.

These results are interesting for how they highlight how people read books for many reasons (such as marketing campaigns, literary prestige, or following a popular author), and this is reflected in their decision whether to continue reading or to abandon a book.

“The Propensity for Aggressive Behavior and Lifetime Incarceration Risk: A Test for Gene-environment Interaction (G × E) Using Whole-genome Data”, Barnes et al 2019

2019-barnes.pdf: “The propensity for aggressive behavior and lifetime incarceration risk: A test for gene-environment interaction (G × E) using whole-genome data”⁠, J. C. Barnes, Hexuan Liu, Ryan T. Motz, Peter T. Tanksley, Rachel Kail, Amber L. Beckley, Daniel W. Belsky et al (2019-11-01; ⁠, ; backlinks; similar):

  • Socio-genomics offers insight into gene-environment interplay.
  • We construct a genome-wide measure of genetic propensity for aggressive behavior.
  • Males with higher genetic propensity were more likely to experience incarceration.
  • But gene-environment interaction (G × E) was observed
  • Genetic propensity was not predictive for males raised in high education homes.

Incarceration is a disruptive event that is experienced by a considerable proportion of the United States population. Research has identified social factors that predict incarceration risk, but scholars have called for a focus on the ways that individual differences combine with social factors to affect incarceration risk. Our study is an initial attempt to heed this call using whole-genome data.

We use data from the Health and Retirement Study (HRS) (n = 6716) to construct a genome-wide measure of genetic propensity for aggressive behavior and use it to predict lifetime incarceration risk. We find that participants with a higher genetic propensity for aggression are more likely to experience incarceration, but the effect is stronger for males than females. Importantly, we identify a gene-environment interaction (G × E)—genetic propensity is reduced, substantively and statistically, to a non-significant predictor for males raised in homes where at least one parent graduated high school.

We close by placing these findings in the broader context of concerns that have been raised about genetics research in criminology.

[Keywords: lifetime incarceration, genome-wide polygenic score (PGS), parental educational attainment, gene-environment interaction (G × E)]

“Bayesian Parameter Estimation Using Conditional Variational Autoencoders for Gravitational-wave Astronomy”, Gabbard et al 2019

“Bayesian parameter estimation using conditional variational autoencoders for gravitational-wave astronomy”⁠, Hunter Gabbard, Chris Messenger, Ik Siong Heng, Francesco Tonolini, Roderick Murray-Smith (2019-09-13; ; similar):

Gravitational wave (GW) detection is now commonplace and as the sensitivity of the global network of GW detectors improves, we will observe 𝒪(100)s of transient GW events per year. The current methods used to estimate their source parameters employ optimally sensitive but computationally costly Bayesian inference approaches where typical analyses have taken between 6 hours and 5 days. For binary neutron star and neutron star black hole systems prompt counterpart electromagnetic (EM) signatures are expected on timescales of 1s–1min and the current fastest method for alerting EM follow-up observers, can provide estimates in 𝒪(1) minutes, on a limited range of key source parameters.

Here we show that a conditional variational autoencoder pre-trained on binary black hole signals can return Bayesian posterior probability estimates.

The training procedure need only be performed once for a given prior parameter space and the resulting trained machine can then generate samples describing the posterior distribution ~6 orders of magnitude faster than existing techniques.

“New Paradigms in the Psychology of Reasoning”, Oaksford & Chater 2019

2020-oaksford.pdf: “New Paradigms in the Psychology of Reasoning”⁠, Mike Oaksford, Nick Chater (2019-09-12; similar):

The psychology of verbal reasoning initially compared performance with classical logic. In the last 25 years, a new paradigm has arisen, which focuses on knowledge-rich reasoning for communication and persuasion and is typically modeled using Bayesian probability theory rather than logic. This paradigm provides a new perspective on argumentation, explaining the rational persuasiveness of arguments that are logical fallacies. It also helps explain how and why people stray from logic when given deductive reasoning tasks. What appear to be erroneous responses, when compared against logic, often turn out to be rationally justified when seen in the richer rational framework of the new paradigm. Moreover, the same approach extends naturally to inductive reasoning tasks, in which people extrapolate beyond the data they are given and logic does not readily apply. We outline links between social and individual reasoning and set recent developments in the psychology of reasoning in the wider context of Bayesian cognitive science.

“Estimating Distributional Models With Brms: Additive Distributional Models”, Bürkner 2019

“Estimating Distributional Models with brms: Additive Distributional Models”⁠, Paul Bürkner (2019-08-29; ; backlinks; similar):

This vignette provides an introduction on how to fit distributional regression models with brms. We use the term distributional model to refer to a model, in which we can specify predictor terms for all parameters of the assumed response distribution.

In the vast majority of regression model implementations, only the location parameter (usually the mean) of the response distribution depends on the predictors and corresponding regression parameters. Other parameters (eg. scale or shape parameters) are estimated as auxiliary parameters assuming them to be constant across observations. This assumption is so common that most researchers applying regression models are often (in my experience) not aware of the possibility of relaxing it. This is understandable insofar as relaxing this assumption drastically increase model complexity and thus makes models hard to fit. Fortunately, brms uses Stan on the backend, which is an incredibly flexible and powerful tool for estimating Bayesian models so that model complexity is much less of an issue.

…In the examples so far, we did not have multilevel data and thus did not fully use the capabilities of the distributional regression framework of brms. In the example presented below, we will not only show how to deal with multilevel data in distributional models, but also how to incorporate smooth terms (ie. splines) into the model. In many applications, we have no or only a very vague idea how the relationship between a predictor and the response looks like. A very flexible approach to tackle this problems is to use splines and let them figure out the form of the relationship.

“Allocation to Groups: Examples of Lord's Paradox”, Wright 2019

2019-wright.pdf: “Allocation to groups: Examples of Lord's paradox”⁠, Daniel B. Wright (2019-07-12; backlinks; similar):

Background: Educational and developmental psychologists often examine how groups change over time. 2 analytic procedures—analysis of covariance (ANCOVA) and the gain score model—each seem well suited for the simplest situation, with just 2 groups and 2 time points. They can produce different results, what is known as Lord’s paradox⁠.

Aims: Several factors should influence a researcher’s analytic choice. This includes whether the score from the initial time influences how people are assigned to groups. Examples are shown, which will help to explain this to researchers and students, and are of educational relevance. It is shown that a common method used to measure school effectiveness is biased against schools that serve students from groups that are historically poor performing.

Methods and results: The examples come from sports and measuring educational effectiveness (eg. for teachers or schools). A simulation study shows that if the covariate influences group allocation, the ANCOVA is preferred, but otherwise, the gain score model may be appropriate. Regression towards the mean is used to account for these findings.

Conclusions: Analysts should consider the relationship between the covariate and group allocation when deciding upon their analytic method. Because the influence of the covariate on group allocation may be complex, the appropriate method may be complex. Because the influence of the covariate on group allocation may be unknown, the choice of method may require several assumptions.

[Keywords: Lord’s paradox, value-added models⁠, ANCOVA, educator equity]

“Evolutionary Implementation of Bayesian Computations”, Czégel et al 2019

“Evolutionary implementation of Bayesian computations”⁠, Dániel Czégel, Hamza Giaffar, István Zachar, Eörs Szathmáry (2019-06-28; ⁠, ; backlinks; similar):

A wide variety of human and non-human behavior is computationally well accounted for by probabilistic generative models, formalized consistently in a Bayesian framework.

Recently, it has been suggested that another family of adaptive systems, namely, those governed by Darwinian evolutionary dynamics, are capable of implementing building blocks of Bayesian computations. These algorithmic similarities rely on the analogous competition dynamics of generative models and of Darwinian replicators to fit possibly high-dimensional and stochastic environments. Identified computational building blocks include Bayesian update over a single variable and replicator dynamics, transition between hidden states and mutation, and Bayesian inference in hierarchical models and multilevel selection.

Here we provide a coherent mathematical discussion of these observations in terms of Bayesian graphical models and a step-by-step introduction to their evolutionary interpretation. We also extend existing results by adding two missing components: a correspondence between likelihood optimization and phenotypic adaptation, and between expectation-maximization-like dynamics in mixture models and ecological competition.

These correspondences suggest a deeper algorithmic analogy between evolutionary dynamics and statistical learning, pointing towards an unified computational understanding of mechanisms Nature invented to adapt to high-dimensional and uncertain environments.

“How Should We Critique Research?”, Branwen 2019

Research-criticism: “How Should We Critique Research?”⁠, Gwern Branwen (2019-05-19; ⁠, ⁠, ⁠, ⁠, ⁠, ; backlinks; similar):

Criticizing studies and statistics is hard in part because so many criticisms are possible, rendering them meaningless. What makes a good criticism is the chance of being a ‘difference which makes a difference’ to our ultimate actions.

Scientific and statistical research must be read with a critical eye to understand how credible the claims are. The Reproducibility Crisis and the growth of meta-science have demonstrated that much research is of low quality and often false.

But there are so many possible things any given study could be criticized for, falling short of an unobtainable ideal, that it becomes unclear which possible criticism is important, and they may degenerate into mere rhetoric. How do we separate fatal flaws from unfortunate caveats from specious quibbling?

I offer a pragmatic criterion: what makes a criticism important is how much it could change a result if corrected and how much that would then change our decisions or actions: to what extent it is a “difference which makes a difference”.

This is why issues of research fraud, causal inference, or biases yielding overestimates are universally important: because a ‘causal’ effect turning out to be zero effect or grossly overestimated will change almost all decisions based on such research; while on the other hand, other issues like measurement error or distributional assumptions, which are equally common, are often not important: because they typically yield much smaller changes in conclusions, and hence decisions.

If we regularly ask whether a criticism would make this kind of difference, it will be clearer which ones are important criticisms, and which ones risk being rhetorical distractions and obstructing meaningful evaluation of research.

“Reinforcement Learning, Fast and Slow”, Botvinick et al 2019

“Reinforcement Learning, Fast and Slow”⁠, Matthew Botvinick, Sam Ritter, Jane X. Wang, Zeb Kurth-Nelson, Charles Blundell, Demis Hassabis (2019-05-16; ⁠, ⁠, ; backlinks; similar):

Recent AI research has given rise to powerful techniques for deep reinforcement learning. In their combination of representation learning with reward-driven behavior, deep reinforcement learning would appear to have inherent interest for psychology and neuroscience.

One reservation has been that deep reinforcement learning procedures demand large amounts of training data, suggesting that these algorithms may differ fundamentally from those underlying human learning.

While this concern applies to the initial wave of deep RL techniques, subsequent AI work has established methods that allow deep RL systems to learn more quickly and efficiently. Two particularly interesting and promising techniques center, respectively, on episodic memory and meta-learning. Alongside their interest as AI techniques, deep RL methods leveraging episodic memory and meta-learning have direct and interesting implications for psychology and neuroscience. One subtle but critically important insight which these techniques bring into focus is the fundamental connection between fast and slow forms of learning.

Deep reinforcement learning (RL) methods have driven impressive advances in artificial intelligence in recent years, exceeding human performance in domains ranging from Atari to Go to no-limit poker. This progress has drawn the attention of cognitive scientists interested in understanding human learning. However, the concern has been raised that deep RL may be too sample-inefficient—that is, it may simply be too slow—to provide a plausible model of how humans learn. In the present review, we counter this critique by describing recently developed techniques that allow deep RL to operate more nimbly, solving problems much more quickly than previous methods. Although these techniques were developed in an AI context, we propose that they may have rich implications for psychology and neuroscience. A key insight, arising from these AI methods, concerns the fundamental connection between fast RL and slower, more incremental forms of learning.

“Meta Reinforcement Learning As Task Inference”, Humplik et al 2019

“Meta reinforcement learning as task inference”⁠, Jan Humplik, Alexandre Galashov, Leonard Hasenclever, Pedro A. Ortega, Yee Whye Teh, Nicolas Heess (2019-05-15; ⁠, ; similar):

Humans achieve efficient learning by relying on prior knowledge about the structure of naturally occurring tasks. There is considerable interest in designing reinforcement learning (RL) algorithms with similar properties. This includes proposals to learn the learning algorithm itself, an idea also known as meta learning. One formal interpretation of this idea is as a partially observable multi-task RL problem in which task information is hidden from the agent. Such unknown task problems can be reduced to Markov decision processes (MDPs) by augmenting an agent’s observations with an estimate of the belief about the task based on past experience. However estimating the belief state is intractable in most partially-observed MDPs. We propose a method that separately learns the policy and the task belief by taking advantage of various kinds of privileged information. Our approach can be very effective at solving standard meta-RL environments, as well as a complex continuous control environment with sparse rewards and requiring long-term memory.

“Structural Equation Models As Computation Graphs”, Kesteren & Oberski 2019

“Structural Equation Models as Computation Graphs”⁠, Erik-Jan van Kesteren, Daniel L. Oberski (2019-05-11; similar):

Structural equation modeling (SEM) is a popular tool in the social and behavioural sciences, where it is being applied to ever more complex data types. The high-dimensional data produced by modern sensors, brain images, or (epi)genetic measurements require variable selection using parameter penalization; experimental models combining disparate data sources benefit from regularization to obtain a stable result; and genomic SEM or network models lead to alternative objective functions. With each proposed extension, researchers currently have to completely reformulate SEM and its optimization algorithm—a challenging and time-consuming task.

In this paper, we consider each SEM as a computation graph, a flexible method of specifying objective functions borrowed from the field of deep learning. When combined with state-of-the-art optimizers, our computation graph approach can extend SEM without the need for bespoke software development. We show that both existing and novel SEM improvements follow naturally from our approach. To demonstrate, we discuss least absolute deviation estimation and penalized regression models. We also introduce spike-and-slab SEM, which may perform better when shrinkage of large factor loadings is not desired. By applying computation graphs to SEM, we hope to greatly accelerate the process of developing SEM techniques, paving the way for new applications. We provide an accompanying R package tensorsem.

“Meta-learning of Sequential Strategies”, Ortega et al 2019

“Meta-learning of Sequential Strategies”⁠, Pedro A. Ortega, Jane X. Wang, Mark Rowland, Tim Genewein, Zeb Kurth-Nelson, Razvan Pascanu, Nicolas Heess et al (2019-05-08; ⁠, ⁠, ; backlinks; similar):

In this report we review memory-based meta-learning as a tool for building sample-efficient strategies that learn from past experience to adapt to any task within a target class. Our goal is to equip the reader with the conceptual foundations of this tool for building new, scalable agents that operate on broad domains. To do so, we present basic algorithmic templates for building near-optimal predictors and reinforcement learners which behave as if they had a probabilistic model that allowed them to efficiently exploit task structure. Furthermore, we recast memory-based meta-learning within a Bayesian framework, showing that the meta-learned strategies are near-optimal because they amortize Bayes-filtered data, where the adaptation is implemented in the memory dynamics as a state-machine of sufficient statistics. Essentially, memory-based meta-learning translates the hard problem of probabilistic sequential inference into a regression problem.

“Fermi Calculation Examples”, Branwen 2019

Fermi: “Fermi Calculation Examples”⁠, Gwern Branwen (2019-03-29; backlinks; similar):

Fermi estimates or problems are quick heuristic solutions to apparently insoluble quantitative problems rewarding clever use of real-world knowledge and critical thinking; bibliography of some examples.

A short discussion of “Fermi calculations”: quick-and-dirty approximate answers to quantitative questions which prize cleverness in exploiting implications of common knowledge or basic principles in given reasonable answers to apparently unanswerable questions.

Links to discussions of Fermi estimates, and a list of some Fermi estimates I’ve done.

“Is the FDA Too Conservative or Too Aggressive?: A Bayesian Decision Analysis of Clinical Trial Design”, Isakov et al 2019

2019-isakov.pdf: “Is the FDA too conservative or too aggressive?: A Bayesian decision analysis of clinical trial design”⁠, Leah Isakov, Andrew W. Lo, Vahid Montazerhodjat (2019-01-04; ⁠, ; similar):

Implicit in the drug-approval process is a host of decisions—target patient population, control group, primary endpoint, sample size, follow-up period, etc.—all of which determine the trade-off between Type I and Type II error. We explore the application of Bayesian decision analysis (BDA) to minimize the expected cost of drug approval, where the relative costs of the two types of errors are calibrated using U.S. Burden of Disease Study 2010 data. The results for conventional fixed-sample randomized clinical-trial designs suggest that for terminal illnesses with no existing therapies such as pancreatic cancer, the standard threshold of 2.5% is substantially more conservative than the BDA-optimal threshold of 23.9% to 27.8%. For relatively less deadly conditions such as prostate cancer, 2.5% is more risk-tolerant or aggressive than the BDA-optimal threshold of 1.2% to 1.5%. We compute BDA-optimal sizes for 25 of the most lethal diseases and show how a BDA-informed approval process can incorporate all stakeholders’ views in a systematic, transparent, internally consistent, and repeatable manner.

“Bayesian Statistics in Sociology: Past, Present, and Future”, Lynch & Bartlett 2019

2019-lynch.pdf: “Bayesian Statistics in Sociology: Past, Present, and Future”⁠, Scott M. Lynch, Bryce Bartlett (2019; similar):

Although Bayes’ theorem has been around for more than 250 years, widespread application of the Bayesian approach only began in statistics in 1990. By 2000, Bayesian statistics had made considerable headway into social science, but even now its direct use is rare in articles in top sociology journals, perhaps because of a lack of knowledge about the topic. In this review, we provide an overview of the key ideas and terminology of Bayesian statistics, and we discuss articles in the top journals that have used or developed Bayesian methods over the last decade. In this process, we elucidate some of the advantages of the Bayesian approach. We highlight that many sociologists are, in fact, using Bayesian methods, even if they do not realize it, because techniques deployed by popular software packages often involve Bayesian logic and/​or computation. Finally, we conclude by briefly discussing the future of Bayesian statistics in sociology.

“Approximate Bayesian Computation”, Beaumont 2019

2019-beaumont.pdf: “Approximate Bayesian Computation”⁠, Mark A. Beaumont (2019; similar):

Many of the statistical models that could provide an accurate, interesting, and testable explanation for the structure of a data set turn out to have intractable likelihood functions. The method of approximate Bayesian computation (ABC) has become a popular approach for tackling such models. This review gives an overview of the method and the main issues and challenges that are the subject of current research.

“Accounting Theory As a Bayesian Discipline”, Johnstone 2018

2018-johnstone.pdf: “Accounting Theory as a Bayesian Discipline”⁠, David Johnstone (2018-12-28; ⁠, ; similar):

Accounting Theory as a Bayesian Discipline introduces Bayesian theory and its role in statistical accounting information theory. The Bayesian statistical logic of probability, evidence and decision lies at the historical and modern center of accounting thought and research. It is not only the presumed rule of reasoning in analytical models of accounting disclosure, it is the default position for empiricists when hypothesizing about how the users of financial statements think. Bayesian logic comes to light throughout accounting research and is the soul of most strategic disclosure models. In addition, Bayesianism is similarly a large part of the stated and unstated motivation of empirical studies of how market prices and their implied costs of capital react to better financial disclosure.

The approach taken in this monograph is a Demski 1973-like treatment of “accounting numbers” as “signals” rather than as “measurements”. It should be of course that “good” measurements like “quality earnings” reports make generally better signals. However, to be useful for decision making under uncertainty, accounting measurements need to have more than established accounting measurement virtues. This monograph explains what those Bayesian information attributes are, where they come from in Bayesian theory, and how they apply in statistical accounting information theory.

The Bayesian logic of probability, evidence and decision is the presumed rule of reasoning in analytical models of accounting disclosure. Any rational explication of the decades-old accounting notions of “information content”, “value relevance”, “decision useful”, and possibly conservatism, is inevitably Bayesian. By raising some of the probability principles, paradoxes and surprises in Bayesian theory, intuition in accounting theory about information, and its value, can be tested and enhanced. Of all the branches of the social sciences, accounting information theory begs Bayesian insights.

This monograph lays out the main logical constructs and principles of Bayesianism, and relates them to important contributions in the theoretical accounting literature. The approach taken is essentially “old-fashioned” normative statistics, building on the expositions of Demski, Ijiri, Feltham and other early accounting theorists who brought Bayesian theory to accounting theory. Some history of this nexus, and the role of business schools in the development of Bayesian statistics in the 1950–1970s, is described. Later developments in accounting, especially noisy rational expectations models under which the information reported by firms is endogenous, rather than unaffected or “drawn from nature”, make the task of Bayesian inference more difficult yet no different in principle.

The information user must still revise beliefs based on what is reported. The extra complexity is that users must allow for the firm’s perceived disclosure motives and other relevant background knowledge in their Bayesian models. A known strength of Bayesian modelling is that subjective considerations are admitted and formally incorporated. Allowances for perceived self-interest or biased reporting, along with any other apparent signal defects or “information uncertainty”, are part and parcel of Bayesian information theory.

  1. Introduction

  2. Bayesianism Early in Accounting Theory

    1. Rise of Bayesian statistics
    2. Bayes in US business schools
    3. Early Bayesian accounting theorists
    4. Postscript
  3. Survey of Bayesian Fundamentals

    1. All probability is subjective
    2. Inference comes first
    3. Bayesian learning
    4. No objective priors
    5. Independence is subjective
    6. No distinction between risk and uncertainty
    7. The likelihood function (ie. model)
    8. Sufficiency and the likelihood principle
    9. Coherence
    10. Coherent means no “Dutch book
    11. Coherent is not necessarily accurate
    12. Accuracy is relative
    13. Odds form of Bayes theorem
    14. Data can’t speak for itself
    15. Ancillary information
    16. Nuisance parameters “integrate out”
    17. “Randomness” is subjective
    18. “Exchangeable” samples
    19. The Bayes factor
    20. Conditioning on all evidence
    21. Bayesian versus conventional inference
    22. Simpson’s paradox
    23. Data swamps prior
    24. Stable estimation
    25. Cromwell’s rule
    26. Decisions follow inference
    27. Inference, not estimation
    28. Calibration
    29. Economic scoring rules
    30. Market scoring rules
    31. Measures of information
    32. Ex ante versus ex post accuracy
    33. Sampling to forgone conclusion
    34. Predictive distributions
    35. Model averaging
    36. Definition of a subjectivist Bayesian
    37. What makes a Bayesian?
    38. Rise of Bayesianism in data science
  4. Case Study: Using All the Evidence

    1. Interpreting “p-level ≤ α”
    2. Bayesian interpretation of frequentist reports
    3. A generic inference problem
  5. Is Accounting Bayesian or Frequentist?

    1. 2 Bayesian schools in accounting
    2. Markowitz, subjectivist Bayesian
    3. Characterization of information in accounting
    4. Why accounting literature emphasizes “precision”
    5. Bayesian description of information quality
    6. Likelihood function of earnings
    7. Capturing conditional conservatism
  6. Decision Support Role of Accounting Information

    1. A formal Bayesian model
    2. Parallels with meteorology
    3. Bayesian fundamental analysis
  7. Demski’s (1973) Impossibility Result

    1. Example: binary accounting signals
    2. Conservatism and the user’s risk aversion
  8. Does Information Reduce Uncertainty

    1. Beaver’s (1968) prescription
    2. Bayesian basics
    3. Contrary views in accounting
    4. Bayesian roots in finance
    5. The general Bayesian law
    6. Rogers et al 2009
    7. Dye & Hughes 2018
    8. Why a Predictive Distribution?
    9. Limits to certainty
    10. Lewellen & Shanken 2002
    11. Neururer et al 2016
    12. Veronesi 1999
  9. How Information Combines

    1. Combining 2 risky signals
  10. Ex Ante Effect of Greater Risk/​Uncertainty

    1. Risk adds to ex ante expected utility
    2. Implications for Bayesian decision analysis
    3. Volatility pumping
  11. Ex Post Decision Outcomes: 1. Practical investment

    1. Economic Darwinism
    2. Bayesian Darwinian selection
    3. Good probability assessments
    4. Implications for accounting information
  12. Information Uncertainty

    1. Bayesian definition of information uncertainty
    2. Bayesian treatment of information uncertainty
    3. Model risk as information risk
  13. Conditioning Beliefs and the Cost of Capital

    1. Numerical example

    2. Interpretation: 14. Reliance on the Normal-Normal Model

    3. Intuitive counter-example

    4. Appeal to the normal-normal model in accounting

    5. Unknown variance, increasing after observation

    6. Beyer 2009

    7. Armstrong et al 2016

  14. Bayesian Subjective Beta

    1. Core et al 2015
    2. Verrecchia 2001: Understated influence of the mean
    3. Decision analysis effect of the mean
  15. Other Bayesian Points of Interest

    1. Accounting input in prediction models
    2. Earnings quality and accurate probability assessments
    3. Expected variance as a measure of information
    4. Information stays relevant
    5. Bayesian view of earnings management
    6. Numerator versus denominator news
    7. Mixtures of normals
    8. Information content
    9. Fundamental versus information risk
    10. When information adds to information asymmetry
    11. Value of independent information sources
    12. How might market probabilities behave?
    13. “Idiosyncratic” versus “undiversifiable” information
  16. Conclusion

  17. References

“The Bayesian Superorganism III: Externalized Memories Facilitate Distributed Sampling”, Hunt et al 2018

“The Bayesian Superorganism III: externalized memories facilitate distributed sampling”⁠, Edmund R. Hunt, Nigel R. Franks, Rol, J. Baddeley (2018-12-21; ; similar):

A key challenge for any animal is to avoid wasting time by searching for resources in places it has already found to be unprofitable. This challenge is particularly strong when the organism is a central place forager—returning to a nest between foraging bouts—because it is destined repeatedly to cover much the same ground. Furthermore, this problem will reach its zenith if many individuals forage from the same central place, as in social insects.

Foraging performance may be greatly enhanced by coordinating movement trajectories such that each ant visits separate parts of the surrounding (unknown) space. In this third of three papers, we find experimental evidence for an externalized spatial memory in Temnothorax albipennis ants: chemical markers (either pheromones or other cues such as cuticular hydrocarbon footprints) that are used by nest-mates to mark explored space. We show these markers could be used by the ants to scout the space surrounding their nest more efficiently through indirect coordination.

We also develop a simple model of this marking behaviour that can be applied in the context of Markov chain Monte Carlo methods (see part two of this series). This substantially enhances the performance of standard methods like the Metropolis– Hastings algorithm in sampling from sparse probability distributions (such as those confronted by the ants) with little additional computational cost.

Our Bayesian framework for superorganismal behaviour motivates the evolution of exploratory mechanisms such as trail marking in terms of enhanced collective information processing.

“Evolution As Backstop for Reinforcement Learning”, Branwen 2018

Backstop: “Evolution as Backstop for Reinforcement Learning”⁠, Gwern Branwen (2018-12-06; ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ; backlinks; similar):

Markets/​evolution as backstops/​ground truths for reinforcement learning/​optimization: on some connections between Coase’s theory of the firm/​linear optimization/​DRL/​evolution/​multicellular life/​pain/​Internet communities as multi-level optimization problems.

One defense of free markets notes the inability of non-market mechanisms to solve planning & optimization problems. This has difficulty with Coase’s paradox of the firm, and I note that the difficulty is increased by the fact that with improvements in computers, algorithms, and data, ever larger planning problems are solved. Expanding on some Cosma Shalizi comments, I suggest interpreting phenomenon as multi-level nested optimization paradigm: many systems can be usefully described as having two (or more) levels where a slow sample-inefficient but ground-truth ‘outer’ loss such as death, bankruptcy, or reproductive fitness, trains & constrains a fast sample-efficient but possibly misguided ‘inner’ loss which is used by learned mechanisms such as neural networks or linear programming group selection perspective. So, one reason for free-market or evolutionary or Bayesian methods in general is that while poorer at planning/​optimization in the short run, they have the advantage of simplicity and operating on ground-truth values, and serve as a constraint on the more sophisticated non-market mechanisms. I illustrate by discussing corporations, multicellular life, reinforcement learning & meta-learning in AI, and pain in humans. This view suggests that are inherent balances between market/​non-market mechanisms which reflect the relative advantages between a slow unbiased method and faster but potentially arbitrarily biased methods.

“SMPY Bibliography”, Branwen 2018

SMPY: “SMPY Bibliography”⁠, Gwern Branwen (2018-07-28; ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ; backlinks; similar):

An annotated fulltext bibliography of publications on the Study of Mathematically Precocious Youth (SMPY), a longitudinal study of high-IQ youth.

SMPY (Study of Mathematically Precocious Youth) is a long-running longitudinal survey of extremely mathematically-talented or intelligent youth, which has been following high-IQ cohorts since the 1970s. It has provided the largest and most concrete findings about the correlates and predictive power of screening extremely intelligent children, and revolutionized gifted & talented educational practices.

Because it has been running for over 40 years, SMPY-related publications are difficult to find; many early papers were published only in long-out-of-print books and are not available in any other way. Others are digitized and more accessible, but one must already know they exist. Between these barriers, SMPY information is less widely available & used than it should be given its importance.

To fix this, I have been gradually going through all SMPY citations and making fulltext copies available online with occasional commentary.

“Deep Learning Generalizes Because the Parameter-function Map Is Biased towards Simple Functions”, Valle-Pérez et al 2018

“Deep learning generalizes because the parameter-function map is biased towards simple functions”⁠, Guillermo Valle-Pérez, Chico Q. Camargo, Ard A. Louis (2018-05-22; ; similar):

Deep neural networks (DNNs) generalize remarkably well without explicit regularization even in the strongly over-parameterized regime where classical learning theory would instead predict that they would severely overfit. While many proposals for some kind of implicit regularization have been made to rationalize this success, there is no consensus for the fundamental reason why DNNs do not strongly overfit.

In this paper, we provide a new explanation. By applying a very general probability-complexity bound recently derived from algorithmic information theory (AIT), we argue that the parameter-function map of many DNNs should be exponentially biased towards simple functions. We then provide clear evidence for this strong simplicity bias in a model DNN for Boolean functions, as well as in much larger fully connected and convolutional networks applied to CIFAR10 and MNIST⁠.

As the target functions in many real problems are expected to be highly structured, this intrinsic simplicity bias helps explain why deep networks generalize well on real world problems. This picture also facilitates a novel PAC-Bayes approach where the prior is taken over the DNN input-output function space, rather than the more conventional prior over parameter space. If we assume that the training algorithm samples parameters close to uniformly within the zero-error region then the PAC-Bayes theorem can be used to guarantee good expected generalization for target functions producing high-likelihood training sets.

By exploiting recently discovered connections between DNNs and Gaussian processes to estimate the marginal likelihood, we produce relatively tight generalization PAC-Bayes error bounds which correlate well with the true error on realistic datasets such as MNIST and CIFAR10 and for architectures including convolutional and fully connected networks.

“On Having Enough Socks”, Branwen 2017

Socks: “On Having Enough Socks”⁠, Gwern Branwen (2017-11-22; ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ; backlinks; similar):

Personal experience and surveys on running out of socks; discussion of socks as small example of human procrastination and irrationality, caused by lack of explicit deliberative thought where no natural triggers or habits exist.

After running out of socks one day, I reflected on how ordinary tasks get neglected. Anecdotally and in 3 online surveys, people report often not having enough socks, a problem which correlates with rarity of sock purchases and demographic variables, consistent with a neglect/​procrastination interpretation: because there is no specific time or triggering factor to replenish a shrinking sock stockpile, it is easy to run out.

This reminds me of akrasia on minor tasks, ‘yak shaving’, and the nature of disaster in complex systems: lack of hard rules lets errors accumulate, without any ‘global’ understanding of the drift into disaster (or at least inefficiency). Humans on a smaller scale also ‘drift’ when they engage in System I reactive thinking & action for too long, resulting in cognitive biases⁠. An example of drift is the generalized human failure to explore/​experiment adequately, resulting in overly greedy exploitative behavior of the current local optimum. Grocery shopping provides a case study: despite large gains, most people do not explore, perhaps because there is no established routine or practice involving experimentation. Fixes for these things can be seen as ensuring that System II deliberative cognition is periodically invoked to review things at a global level, such as developing a habit of maximum exploration at first purchase of a food product, or annually reviewing possessions to note problems like a lack of socks.

While socks may be small things, they may reflect big things.

“Implicit Causal Models for Genome-wide Association Studies”, Tran & Blei 2017

“Implicit Causal Models for Genome-wide Association Studies”⁠, Dustin Tran, David M. Blei (2017-10-30; ⁠, ; similar):

Progress in probabilistic generative models has accelerated, developing richer models with neural architectures, implicit densities, and with scalable algorithms for their Bayesian inference. However, there has been limited progress in models that capture causal relationships, for example, how individual genetic factors cause major human diseases.

In this work, we focus on two challenges in particular:

How do we build richer causal models, which can capture highly nonlinear relationships and interactions between multiple causes?

How do we adjust for latent confounders, which are variables influencing both cause and effect and which prevent learning of causal relationships?

To address these challenges, we synthesize ideas from causality and modern probabilistic modeling.

For the first, we describe implicit causal models, a class of causal models that leverages neural architectures with an implicit density.

For the second, we describe an implicit causal model that adjusts for confounders by sharing strength across examples.

In experiments, we scale Bayesian inference on up to a billion genetic measurements. We achieve state of the art accuracy for identifying causal factors: we significantly outperform existing genetics methods by an absolute difference of 15–45.3%.

“A Rational Choice Framework for Collective Behavior”, Krafft 2017

“A Rational Choice Framework for Collective Behavior”⁠, Peter M. Krafft (2017-09; ; backlinks; similar):

As the world becomes increasingly digitally mediated, people can more and more easily form groups, teams, and communities around shared interests and goals. Yet there is a constant struggle across forms of social organization to maintain stability and coherency in the face of disparate individual experiences and agendas. When are collectives able to function and thrive despite these challenges? In this thesis I propose a theoretical framework for reasoning about collective intelligence—the ability of people to accomplish their shared goals together. A simple result from the literature on multiagent systems suggests that strong general collective intelligence in the form of “rational group agency” arises from three conditions: aligned utilities, accurate shared beliefs, and coordinated actions. However, achieving these conditions can be difficult, as evidenced by impossibility results related to each condition from the literature on social choice, belief aggregation, and distributed systems. The theoretical framework I propose serves as a point of inspiration to study how human groups address these difficulties. To this end, I develop computational models of facets of human collective intelligence, and test these models in specific case studies. The models I introduce suggest distributed Bayesian inference as a framework for understanding shared belief formation, and also show that people can overcome other difficult computational challenges associated with achieving rational group agency, including balancing the group “exploration versus exploitation dilemma” for information gathering and inferring levels of “common p-belief” to coordinate actions.

“Statistical Correction of the Winner’s Curse Explains Replication Variability in Quantitative Trait Genome-wide Association Studies”, Palmer & Pe’er 2017

“Statistical correction of the Winner’s Curse explains replication variability in quantitative trait genome-wide association studies”⁠, Cameron Palmer, Itsik Pe’er (2017-07-10; backlinks; similar):

Genome-wide association studies (GWAS) have identified hundreds of SNPs responsible for variation in human quantitative traits. However, genome-wide-significant associations often fail to replicate across independent cohorts, in apparent inconsistency with their apparent strong effects in discovery cohorts. This limited success of replication raises pervasive questions about the utility of the GWAS field.

We identify all 332 studies of quantitative traits from the NHGRI-EBI GWAS Database with attempted replication. We find that the majority of studies provide insufficient data to evaluate replication rates. The remaining papers replicate statistically-significantly worse than expected (p < 10−14), even when adjusting for regression-to-the-mean of effect size between discovery-cohort and replication-cohorts termed the Winner’s Curse (p < 10−16). We show this is due in part to misreporting replication cohort-size as a maximum number, rather than per-locus one. In 39 studies accurately reporting per-locus cohort-size for attempted replication of 707 loci in samples with similar ancestry, replication rate matched expectation (predicted 458, observed 457, p = 0.94). In contrast, ancestry differences between replication and discovery (13 studies, 385 loci) cause the most highly-powered decile of loci to replicate worse than expected, due to difference in linkage disequilibrium.

Author summary:

The majority of associations between common genetic variation and human traits come from genome-wide association studies, which have analyzed millions of single-nucleotide polymorphisms in millions of samples. These kinds of studies pose serious statistical challenges to discovering new associations. Finite resources restrict the number of candidate associations that can brought forward into validation samples, introducing the need for a statistical-significance threshold. This threshold creates a phenomenon called the Winner’s Curse, in which candidate associations close to the discovery threshold are more likely to have biased overestimates of the variant’s true association in the sampled population.

We survey all human quantitative trait association studies that validated at least one signal. We find the majority of these studies do not publish sufficient information to actually support their claims of replication. For studies that did, we computationally correct the Winner’s Curse and evaluate replication performance. While all variants combined replicate statistically-significantly less than expected, we find that the subset of studies that (1) perform both discovery and replication in samples of the same ancestry; and (2) report accurate per-variant sample sizes, replicate as expected.

This study provides strong, rigorous evidence for the broad reliability of genome-wide association studies. We furthermore provide a model for more efficient selection of variants as candidates for replication, as selecting variants using cursed discovery data enriches for variants with little real evidence for trait association.

“A Tutorial on Thompson Sampling”, Russo et al 2017

“A Tutorial on Thompson Sampling”⁠, Daniel Russo, Benjamin Van Roy, Abbas Kazerouni, Ian Osband, Zheng Wen (2017-07-07; ; similar):

Thompson sampling is an algorithm for online decision problems where actions are taken sequentially in a manner that must balance between exploiting what is known to maximize immediate performance and investing to accumulate new information that may improve future performance. The algorithm addresses a broad range of problems in a computationally efficient manner and is therefore enjoying wide use. This tutorial covers the algorithm and its application, illustrating concepts through a range of examples, including Bernoulli bandit problems, shortest path problems, product recommendation, assortment, active learning with neural networks, and reinforcement learning in Markov decision processes. Most of these problems involve complex information structures, where information revealed by taking an action informs beliefs about other actions. We will also discuss when and why Thompson sampling is or is not effective and relations to alternative algorithms.

“Black-Box Data-efficient Policy Search for Robotics”, Chatzilygeroudis et al 2017

“Black-Box Data-efficient Policy Search for Robotics”⁠, Konstantinos Chatzilygeroudis, Roberto Rama, Rituraj Kaushik, Dorian Goepp, Vassilis Vassiliades, Jean-Baptiste Mouret et al (2017-03-21; ):

The most data-efficient algorithms for reinforcement learning (RL) in robotics are based on uncertain dynamical models: after each episode, they first learn a dynamical model of the robot, then they use an optimization algorithm to find a policy that maximizes the expected return given the model and its uncertainties. It is often believed that this optimization can be tractable only if analytical, gradient-based algorithms are used; however, these algorithms require using specific families of reward functions and policies, which greatly limits the flexibility of the overall approach. In this paper, we introduce a novel model-based RL algorithm, called Black-DROPS (Black-box Data-efficient RObot Policy Search) that: (1) does not impose any constraint on the reward function or the policy (they are treated as black-boxes), (2) is as data-efficient as the state-of-the-art algorithm for data-efficient RL in robotics, and (3) is as fast (or faster) than analytical approaches when several cores are available. The key idea is to replace the gradient-based optimization algorithm with a parallel, black-box algorithm that takes into account the model uncertainties. We demonstrate the performance of our new algorithm on two standard control benchmark problems (in simulation) and a low-cost robotic manipulator (with a real robot).

“ZMA Sleep Experiment”, Branwen 2017

ZMA: “ZMA Sleep Experiment”⁠, Gwern Branwen (2017-03-13; ⁠, ⁠, ⁠, ; backlinks; similar):

A randomized blinded self-experiment of the effects of ZMA (zinc+magnesium+vitamin B6) on my sleep; results suggest small benefit to sleep quality but are underpowered and damaged by Zeo measurement error/​data issues.

I ran a blinded randomized self-experiment of 2.5g nightly ZMA powder effect on Zeo-recorded sleep data during March-October 2017 (n = 127). The linear model and SEM model show no statistically-significant effects or high posterior probability of benefits, although all point-estimates were in the direction of benefits. Data quality issues reduced the available dataset, rendering the experiment particularly underpowered and the results more inconclusive. I decided to not continue use of ZMA after running out; ZMA may help my sleep but I need to improve data quality before attempting any further sleep self-experiments on it.

“Self-Blinded Mineral Water Taste Test”, Branwen 2017

Water: “Self-Blinded Mineral Water Taste Test”⁠, Gwern Branwen (2017-02-15; ⁠, ⁠, ; backlinks; similar):

Blind randomized taste-test of mineral/​distilled/​tap waters using Bayesian best-arm finding; no large differences in preference.

The kind of water used in tea is claimed to make a difference in the flavor: mineral water being better than tap water or distilled water. However, mineral water is vastly more expensive than tap water.

To test the claim, I run a preliminary test of pure water to see if any water differences are detectable at all. Compared my tap water, 3 distilled water brands (Great Value, Nestle Pure Life, & Poland Spring), 1 osmosis-purified brand (Aquafina), and 3 non-carbonated mineral water brands (Evian, Voss, & Fiji) in a series of n = 67 blinded randomized comparisons of water flavor. The comparisons are modeled using a Bradley-Terry competitive model implemented in Stan; comparisons were chosen using an adaptive Bayesian best-arm sequential trial (racing) method designed to locate the best-tasting water in the minimum number of samples by preferentially comparing the best-known arm to potentially superior arms. Blinding & randomization are achieved by using a Lazy Susan to physically randomize two identical (but marked in a hidden spot) cups of water.

The final posterior distribution indicates that some differences between waters are likely to exist but are small & imprecisely estimated and of little practical concern.

“The Kelly Coin-Flipping Game: Exact Solutions”, Branwen et al 2017

Coin-flip: “The Kelly Coin-Flipping Game: Exact Solutions”⁠, Gwern Branwen, Arthur B., nshepperd, FeepingCreature, Gurkenglas (2017-01-19; ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ; backlinks; similar):

Decision-theoretic analysis of how to optimally play Haghani & Dewey 2016’s 300-round double-or-nothing coin-flipping game with an edge and ceiling better than using the Kelly Criterion. Computing and following an exact decision tree increases earnings by $6.6 over a modified KC.

Haghani & Dewey 2016 experiment with a double-or-nothing coin-flipping game where the player starts with $30.4[^\$25.0^~2016~]{.supsub} and has an edge of 60%, and can play 300 times, choosing how much to bet each time, winning up to a maximum ceiling of $303.8[^\$250.0^~2016~]{.supsub}. Most of their subjects fail to play well, earning an average $110.6[^\$91.0^~2016~]{.supsub}, compared to Haghani & Dewey 2016’s heuristic benchmark of ~$291.6[^\$240.0^~2016~]{.supsub} in winnings achievable using a modified Kelly Criterion as their strategy. The KC, however, is not optimal for this problem as it ignores the ceiling and limited number of plays.

We solve the problem of the value of optimal play exactly by using decision trees & dynamic programming for calculating the value function, with implementations in R, Haskell⁠, and C. We also provide a closed-form exact value formula in R & Python, several approximations using Monte Carlo/​random forests⁠/​neural networks, visualizations of the value function, and a Python implementation of the game for the OpenAI Gym collection. We find that optimal play yields $246.61 on average (rather than ~$240), and so the human players actually earned only 36.8% of what was possible, losing $155.6 in potential profit. Comparing decision trees and the Kelly criterion for various horizons (bets left), the relative advantage of the decision tree strategy depends on the horizon: it is highest when the player can make few bets (at b = 23, with a difference of ~$36), and decreases with number of bets as more strategies hit the ceiling.

In the Kelly game, the maximum winnings, number of rounds, and edge are fixed; we describe a more difficult generalized version in which the 3 parameters are drawn from Pareto, normal, and beta distributions and are unknown to the player (who can use Bayesian inference to try to estimate them during play). Upper and lower bounds are estimated on the value of this game. In the variant of this game where subjects are not told the exact edge of 60%, a Bayesian decision tree approach shows that performance can closely approach that of the decision tree, with a penalty for 1 plausible prior of only $1. Two deep reinforcement learning agents, DQN & DDPG⁠, are implemented but DQN fails to learn and DDPG doesn’t show acceptable performance, indicating better deep RL methods may be required to solve the generalized Kelly game.

“Banner Ads Considered Harmful”, Branwen 2017

Ads: “Banner Ads Considered Harmful”⁠, Gwern Branwen (2017-01-08; ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ; backlinks; similar):

9 months of daily A/​B-testing of Google AdSense banner ads on indicates banner ads decrease total traffic substantially, possibly due to spillover effects in reader engagement and resharing.

One source of complexity & JavaScript use on is the use of Google AdSense advertising to insert banner ads. In considering design & usability improvements, removing the banner ads comes up every time as a possibility, as readers do not like ads, but such removal comes at a revenue loss and it’s unclear whether the benefit outweighs the cost, suggesting I run an A/​B experiment. However, ads might be expected to have broader effects on traffic than individual page reading times/​bounce rates, affecting total site traffic instead through long-term effects on or spillover mechanisms between readers (eg. social media behavior), rendering the usual A/​B testing method of per-page-load/​session randomization incorrect; instead it would be better to analyze total traffic as a time-series experiment.

Design: A decision analysis of revenue vs readers yields an maximum acceptable total traffic loss of ~3%. Power analysis of historical traffic data demonstrates that the high autocorrelation yields low statistical power with standard tests & regressions but acceptable power with ARIMA models. I design a long-term Bayesian ARIMA(4,0,1) time-series model in which an A/​B-test running January–October 2017 in randomized paired 2-day blocks of ads/​no-ads uses client-local JS to determine whether to load & display ads, with total traffic data collected in Google Analytics & ad exposure data in Google AdSense. The A/​B test ran from 2017-01-01 to 2017-10-15, affecting 288 days with collectively 380,140 pageviews in 251,164 sessions.

Correcting for a flaw in the randomization, the final results yield a surprisingly large estimate of an expected traffic loss of −9.7% (driven by the subset of users without adblock), with an implied −14% traffic loss if all traffic were exposed to ads (95% credible interval: −13–16%), exceeding my decision threshold for disabling ads & strongly ruling out the possibility of acceptably small losses which might justify further experimentation.

Thus, banner ads on appear to be harmful and AdSense has been removed. If these results generalize to other blogs and personal websites, an important implication is that many websites may be harmed by their use of banner ad advertising without realizing it.

“Simple and Scalable Predictive Uncertainty Estimation Using Deep Ensembles”, Lakshminarayanan et al 2016

“Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles”⁠, Balaji Lakshminarayanan, Alexander Pritzel, Charles Blundell (2016-12-05; ; backlinks; similar):

Deep neural networks (NNs) are powerful black box predictors that have recently achieved impressive performance on a wide spectrum of tasks. Quantifying predictive uncertainty in NNs is a challenging and yet unsolved problem.

Bayesian NNs, which learn a distribution over weights, are currently the state-of-the-art for estimating predictive uncertainty; however these require significant modifications to the training procedure and are computationally expensive compared to standard (non-Bayesian) NNs.

We propose an alternative to Bayesian NNs that is simple to implement, readily parallelizable, requires very little hyperparameter tuning, and yields high quality predictive uncertainty estimates. Through a series of experiments on classification and regression benchmarks, we demonstrate that our method produces well-calibrated uncertainty estimates which are as good or better than approximate Bayesian NNs. To assess robustness to dataset shift, we evaluate the predictive uncertainty on test examples from known and unknown distributions, and show that our method is able to express higher uncertainty on out-of-distribution examples.

We demonstrate the scalability of our method by evaluating predictive uncertainty estimates on ImageNet⁠.

“Bayesian Reinforcement Learning: A Survey”, Ghavamzadeh et al 2016

“Bayesian Reinforcement Learning: A Survey”⁠, Mohammad Ghavamzadeh, Shie Mannor, Joelle Pineau, Aviv Tamar (2016-09-14; ; backlinks; similar):

Bayesian methods for machine learning have been widely investigated, yielding principled methods for incorporating prior information into inference algorithms. In this survey, we provide an in-depth review of the role of Bayesian methods for the reinforcement learning (RL) paradigm. The major incentives for incorporating Bayesian reasoning in RL are: (1) it provides an elegant approach to action-selection (exploration/​exploitation) as a function of the uncertainty in learning; and (2) it provides a machinery to incorporate prior knowledge into the algorithms.

We first discuss models and methods for Bayesian inference in the simple single-step Bandit model⁠. We then review the extensive recent literature on Bayesian methods for model-based RL, where prior information can be expressed on the parameters of the Markov model. We also present Bayesian methods for model-free RL, where priors are expressed over the value function or policy class.

The objective of the paper is to provide a comprehensive survey on Bayesian RL algorithms and their theoretical and empirical properties.

“Why Tool AIs Want to Be Agent AIs”, Branwen 2016

Tool-AI: “Why Tool AIs Want to Be Agent AIs”⁠, Gwern Branwen (2016-09-07; ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ; backlinks; similar):

AIs limited to pure computation (Tool AIs) supporting humans, will be less intelligent, efficient, and economically valuable than more autonomous reinforcement-learning AIs (Agent AIs) who act on their own and meta-learn, because all problems are reinforcement-learning problems.

Autonomous AI systems (Agent AIs) trained using reinforcement learning can do harm when they take wrong actions, especially superintelligent Agent AIs. One solution would be to eliminate their agency by not giving AIs the ability to take actions, confining them to purely informational or inferential tasks such as classification or prediction (Tool AIs), and have all actions be approved & executed by humans, giving equivalently superintelligent results without the risk.

I argue that this is not an effective solution for two major reasons. First, because Agent AIs will by definition be better at actions than Tool AIs, giving an economic advantage. Secondly, because Agent AIs will be better at inference & learning than Tool AIs, and this is inherently due to their greater agency: the same algorithms which learn how to perform actions can be used to select important datapoints to learn inference over, how long to learn, how to more efficiently execute inference, how to design themselves, how to optimize hyperparameters, how to make use of external resources such as long-term memories or external software or large databases or the Internet, and how best to acquire new data.

All of these actions will result in Agent AIs more intelligent than Tool AIs, in addition to their greater economic competitiveness. Thus, Tool AIs will be inferior to Agent AIs in both actions and intelligence, implying use of Tool AIs is an even more highly unstable equilibrium than previously argued, as users of Agent AIs will be able to outcompete them on two dimensions (and not just one).

“‘Genius Revisited’ Revisited”, Branwen 2016

Hunter: “‘Genius Revisited’ Revisited”⁠, Gwern Branwen (2016-06-19; ⁠, ⁠, ⁠, ⁠, ⁠, ; backlinks; similar):

A book study of surveys of the high-IQ elementary school HCES concludes that high IQ is not predictive of accomplishment; I point out that results are consistent with regression to the mean from extremely early IQ tests and small total sample size.

Genius Revisited documents the longitudinal results of a high-IQ/​gifted-and-talented elementary school, Hunter College Elementary School (HCES); one of the most striking results is the general high education & income levels, but absence of great accomplishment on a national or global scale (eg. a Nobel prize). The authors suggest that this may reflect harmful educational practices at their elementary school or the low predictive value of IQ.

I suggest that there is no puzzle to this absence nor anything for HCES to be blamed for, as the absence is fully explainable by their making 2 statistical errors: base-rate neglect⁠, and regression to the mean⁠.

First, their standards fall prey to a base-rate fallacy and even extreme predictive value of IQ would not predict 1 or more Nobel prizes because Nobel prize odds are measured at 1 in millions, and with a small total sample size of a few hundred, it is highly likely that there would simply be no Nobels.

Secondly, and more seriously, the lack of accomplishment is inherent and unavoidable as it is driven by the regression to the mean caused by the relatively low correlation of early childhood with adult IQs—which means their sample is far less elite as adults than they believe. Using early-childhood/​adult IQ correlations, regression to the mean implies that HCES students will fall from a mean of 157 IQ in kindergarten (when selected) to somewhere around 133 as adults (and possibly lower). Further demonstrating the role of regression to the mean, in contrast, HCES’s associated high-IQ/​gifted-and-talented high school, Hunter High, which has access to the adolescents’ more predictive IQ scores, has much higher achievement in proportion to its lesser regression to the mean (despite dilution by Hunter elementary students being grandfathered in).

This unavoidable statistical fact undermines the main rationale of HCES: extremely high-IQ adults cannot be accurately selected as kindergartners on the basis of a simple test. This greater-regression problem can be lessened by the use of additional variables in admissions, such as parental IQs or high-quality genetic polygenic scores⁠; unfortunately, these are either politically unacceptable or dependent on future scientific advances. This suggests that such elementary schools may not be a good use of resources and HCES students should not be assigned scarce magnet high school slots.

“Candy Japan’s New Box A/B Test”, Branwen 2016

Candy-Japan: “Candy Japan’s new box A/B test”⁠, Gwern Branwen (2016-05-06; ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ; backlinks; similar):

Bayesian decision-theoretic analysis of the effect of fancier packaging on subscription cancellations & optimal experiment design.

I analyze an A/​B test from a mail-order company of two different kinds of box packaging from a Bayesian decision-theory perspective, balancing posterior probability of improvements & greater profit against the cost of packaging & risk of worse results, finding that as the company’s analysis suggested, the new box is unlikely to be sufficiently better than the old. Calculating expected values of information shows that it is not worth experimenting on further, and that such fixed-sample trials are unlikely to ever be cost-effective for packaging improvements. However, adaptive experiments may be worthwhile.

“Calculating The Gaussian Expected Maximum”, Branwen 2016

Order-statistics: “Calculating The Gaussian Expected Maximum”⁠, Gwern Branwen (2016-01-22; ⁠, ; backlinks; similar):

In generating a sample of n datapoints drawn from a normal/​Gaussian distribution, how big on average the biggest datapoint is will depend on how large n is. I implement a variety of exact & approximate calculations from the literature in R to compare efficiency & accuracy.

In generating a sample of n datapoints drawn from a normal/​Gaussian distribution with a particular mean/​SD, how big on average the biggest datapoint is will depend on how large n is. Knowing this average is useful in a number of areas like sports or breeding or manufacturing, as it defines how bad/​good the worst/​best datapoint will be (eg. the score of the winner in a multi-player game).

The order statistic of the mean/​average/​expectation of the maximum of a draw of n samples from a normal distribution has no exact formula, unfortunately, and is generally not built into any programming language’s libraries.

I implement & compare some of the approaches to estimating this order statistic in the R programming language, for both the maximum and the general order statistic. The overall best approach is to calculate the exact order statistics for the n range of interest using numerical integration via lmomco and cache them in a lookup table, rescaling the mean/​SD as necessary for arbitrary normal distributions; next best is a polynomial regression approximation; finally, the Elfving correction to the Blom 1958 approximation is fast, easily implemented, and accurate for reasonably large n such as n > 100.

“Top 10 Replicated Findings From Behavioral Genetics”, Plomin et al 2016-page-10

2016-plomin.pdf#page=10: “Top 10 Replicated Findings From Behavioral Genetics”⁠, Robert Plomin, John C. DeFries, Valerie S. Knopik, Jenae M. Neiderhiser (2016; ⁠, ⁠, ⁠, ; backlinks; similar):

Finding 7. Most measures of the “environment” show substantial genetic influence

Although it might seem a peculiar thing to do, measures of the environment widely used in psychological science—such as parenting, social support, and life events—can be treated as dependent measures in genetic analyses. If they are truly measures of the environment, they should not show genetic influence. To the contrary, in 1991, Plomin and Bergeman conducted a review of the first 18 studies in which environmental measures were used as dependent measures in genetically sensitive designs and found evidence for genetic influence for these measures of the environment. Substantial genetic influence was found for objective measures such as videotaped observations of parenting as well as self-report measures of parenting, social support, and life events. How can measures of the environment show genetic influence? The reason appears to be that such measures do not assess the environment independent of the person. As noted earlier, humans select, modify, and create environments correlated with their genetic behavioral propensities such as personality and psychopathology (McAdams, Gregory, & Eley, 2013). For example, in studies of twin children, parenting has been found to reflect genetic differences in children’s characteristics such as personality and psychopathology (Avinun & Knafo, 2014; Klahr & Burt, 2014; Plomin, 1994).

Since 1991, more than 150 articles have been published in which environmental measures were used in genetically sensitive designs; they have shown consistently that there is substantial genetic influence on environmental measures, extending the findings from family environments to neighborhood, school, and work environments. Kendler and Baker (2007) conducted a review of 55 independent genetic studies and found an average heritability of 0.27 across 35 diverse environmental measures (confidence intervals not available). Meta-analyses of parenting, the most frequently studied domain, have shown genetic influence that is driven by child characteristics (Avinun & Knafo, 2014) as well as by parent characteristics (Klahr & Burt, 2014). Some exceptions have emerged. Not surprisingly, when life events are separated into uncontrollable events (eg. death of a spouse) and controllable life events (eg. financial problems), the former show nonsignificant genetic influence. In an example of how all behavioral genetic results can differ in different cultures, Shikishima, Hiraishi, Yamagata, Neiderhiser, and Ando (2012) compared parenting in Japan and Sweden and found that parenting in Japan showed more genetic influence than in Sweden, consistent with the view that parenting is more child centered in Japan than in the West.

Researchers have begun to use GCTA to replicate these findings from twin studies. For example, GCTA has been used to show substantial genetic influence on stressful life events (Power et al 2013) and on variables often used as environmental measures in epidemiological studies such as years of schooling (C. A. Rietveld, Medland, et al 2013). Use of GCTA can also circumvent a limitation of twin studies of children. Such twin studies are limited to investigating within-family (twin-specific) experiences, whereas many important environmental factors such as socioeconomic status (SES) are the same for two children in a family. However, researchers can use GCTA to assess genetic influence on family environments such as SES that differ between families, not within families. GCTA has been used to show genetic influence on family SES (Trzaskowski et al 2014) and an index of social deprivation (Marioni et al 2014).

“World Catnip Surveys”, Branwen 2015

Catnip-survey: “World Catnip Surveys”⁠, Gwern Branwen (2015-11-15; ⁠, ⁠, ⁠, ⁠, ; backlinks; similar):

International population online surveys of cat owners about catnip and other cat stimulant use.

In compiling a meta-analysis of reports of catnip response rats in domestic cats⁠, yielding a meta-analytic average of ~2⁄3, the available data suggests heterogeneity from cross-country differences in rates (possibly for genetic reasons) but is insufficient to definitively demonstrate the existence of or estimate those differences (particularly a possible extremely high catnip response rate in Japan). I use Google Surveys August–September 2017 to conduct a brief 1-question online survey of a proportional population sample of 9 countries about cat ownership & catnip use, specifically: Canada, the USA, UK, Japan, Germany, Brazil, Spain, Australia, & Mexico. In total, I surveyed n = 31,471 people, of whom n = 9,087 are cat owners, of whom n = 4,402 report having used catnip on their cat, and of whom n = 2996 report a catnip response.

The survey yields catnip response rates of Canada (82%), USA (79%), UK (74%), Japan (71%), Germany (57%), Brazil (56%), Spain (54%), Australia (53%), and Mexico (52%). The differences are substantial and of high posterior probability, supporting the existence of large cross-country differences. In additional analysis, the other conditional probabilities of cat ownership and trying catnip with a cat appear to correlate with catnip response rates; this intercorrelation suggests a “cat factor” of some sort influencing responses, although what causal relationship there might be between proportion of cat owners and proportion of catnip-responder cats is unclear.

An additional survey of a convenience sample of primarily US Internet users about catnip is reported, although the improbable catnip response rates compared to the population survey suggest the respondents are either highly unrepresentative or the questions caused demand bias.

“Catnip Immunity and Alternatives”, Branwen 2015

Catnip: “Catnip immunity and alternatives”⁠, Gwern Branwen (2015-11-07; ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ; backlinks; similar):

Estimation of catnip immunity rates by country with meta-analysis and surveys, and discussion of catnip alternatives.

Not all cats respond to the catnip stimulant; the rate of responders is generally estimated at ~70% of cats. A meta-analysis of catnip response experiments since the 1940s indicates the true value is ~62%. The low quality of studies and the reporting of their data makes examination of possible moderators like age, sex, and country difficult. Catnip responses have been recorded for a number of species both inside and outside the Felidae family; of them, there is evidence for a catnip response in the Felidae, and, more uncertainly, the Paradoxurinae, and Herpestinae.

To extend the analysis, I run large-scale online surveys measuring catnip response rates globally in domestic cats, finding high heterogeneity but considerable rates of catnip immunity worldwide.

As a piece of practical advice for cat-hallucinogen sommeliers, I treat catnip response & finding catnip substitutes as a decision problem, modeling it as a Markov decision process where one wishes to find a working psychoactive at minimum cost. Bol et al 2017 measured multiple psychoactives simultaneously in a large sample of cats, permitting prediction of responses conditional on not responding to others. (The solution to the specific problem is to test in the sequence catnip → honeysuckle → silvervine → Valerian⁠.)

For discussion of cat psychology in general, see my Cat Sense review.

“Bitter Melon for Blood Glucose”, Branwen 2015

Melon: “Bitter Melon for blood glucose”⁠, Gwern Branwen (2015-09-14; ⁠, ; similar):

Analysis of whether bitter melon reduces blood glucose in one self-experiment and utility of further self-experimentation

I re-analyze a bitter-melon/​blood-glucose self-experiment, finding a small effect of increasing blood glucose after correcting for temporal trends & daily variation, giving both frequentist & Bayesian analyses. I then analyze the self-experiment from a subjective Bayesian decision-theoretic perspective, cursorily estimating the costs of diabetes & benefits of intervention in order to estimate Value Of Information for the self-experiment and the benefit of further self-experimenting; I find that the expected value of more data (EVSI) is negative and further self-experimenting would not be optimal compared to trying out other anti-diabetes interventions.

“Resorting Media Ratings”, Branwen 2015

Resorter: “Resorting Media Ratings”⁠, Gwern Branwen (2015-09-07; ⁠, ; backlinks; similar):

Commandline tool providing interactive statistical pairwise ranking and sorting of items

User-created datasets using ordinal scales (such as media ratings) have tendencies to drift or ‘clump’ towards the extremes and fail to be informative as possible, falling prey to ceiling effects and making it difficult to distinguish between the mediocre & excellent.

This can be counteracted by rerating the dataset to create an uniform (and hence, informative) distribution of ratings—but such manual rerating is difficult.

I provide an anytime CLI program, resorter, written in R (should be cross-platform but only tested on Linux) which keeps track of comparisons, infers underlying ratings assuming that they are noisy in the ELO-like Bradley-Terry model⁠, and interactively & intelligently queries the user with comparisons of the media with the most uncertain current ratings, until the user ends the session and a fully rescaled set of ratings are output.

“When Should I Check The Mail?”, Branwen 2015

Mail-delivery: “When Should I Check The Mail?”⁠, Gwern Branwen (2015-06-21; ⁠, ⁠, ⁠, ; backlinks; similar):

Bayesian decision-theoretic analysis of local mail delivery times: modeling deliveries as survival analysis⁠, model comparison, optimizing check times with a loss function⁠, and optimal data collection.

Mail is delivered by the USPS mailman at a regular but not observed time; what is observed is whether the mail has been delivered at a time, yielding somewhat-unusual “interval-censored data”. I describe the problem of estimating when the mailman delivers, write a simulation of the data-generating process, and demonstrate analysis of interval-censored data in R using maximum-likelihood (survival analysis with Gaussian regression using survival library), MCMC (Bayesian model in JAGS), and likelihood-free Bayesian inference (custom ABC, using the simulation). This allows estimation of the distribution of mail delivery times. I compare those estimates from the interval-censored data with estimates from a (smaller) set of exact delivery-times provided by USPS tracking & personal observation, using a multilevel model to deal with heterogeneity apparently due to a change in USPS routes/​postmen. Finally, I define a loss function on mail checks, enabling: a choice of optimal time to check the mailbox to minimize loss (exploitation); optimal time to check to maximize information gain (exploration); Thompson sampling (balancing exploration & exploitation indefinitely), and estimates of the value-of-information of another datapoint (to estimate when to stop exploration and start exploitation after a finite amount of data).

“Gaussian Processes for Data-Efficient Learning in Robotics and Control”, Deisenroth et al 2015

“Gaussian Processes for Data-Efficient Learning in Robotics and Control”⁠, Marc Peter Deisenroth, Dieter Fox, Carl Edward Rasmussen (2015-02-10; ):

Autonomous learning has been a promising direction in control and robotics for more than a decade since data-driven learning allows to reduce the amount of engineering knowledge, which is otherwise required. However, autonomous reinforcement learning (RL) approaches typically require many interactions with the system to learn controllers, which is a practical limitation in real systems, such as robots, where many interactions can be impractical and time consuming. To address this problem, current learning approaches typically require task-specific knowledge in form of expert demonstrations, realistic simulators, pre-shaped policies, or specific knowledge about the underlying dynamics. In this article, we follow a different approach and speed up learning by extracting more information from data. In particular, we learn a probabilistic, non-parametric Gaussian process transition model of the system. By explicitly incorporating model uncertainty into long-term planning and controller learning our approach reduces the effects of model errors, a key problem in model-based learning. Compared to state-of-the art RL our model-based policy search method achieves an unprecedented speed of learning. We demonstrate its applicability to autonomous learning in real robot and control tasks.

“Predictive Distributions for Between-study Heterogeneity and Simple Methods for Their Application in Bayesian Meta-analysis”, Turner et al 2014

“Predictive distributions for between-study heterogeneity and simple methods for their application in Bayesian meta-analysis”⁠, Rebecca M. Turner, Dan Jackson, Yinghui Wei, Simon G. Thompson, Julian P. T. Higgins (2014-12-05; ⁠, ; similar):

Numerous meta-analyses in healthcare research combine results from only a small number of studies, for which the variance representing between-study heterogeneity is estimated imprecisely. A Bayesian approach to estimation allows external evidence on the expected magnitude of heterogeneity to be incorporated.

The aim of this paper is to provide tools that improve the accessibility of Bayesian meta-analysis. We present 2 methods for implementing Bayesian meta-analysis, using numerical integration and importance sampling techniques. Based on 14 886 binary outcome meta-analyses in the Cochrane Database of Systematic Reviews⁠, we derive a novel set of predictive distributions for the degree of heterogeneity expected in 80 settings depending on the outcomes assessed and comparisons made. These can be used as prior distributions for heterogeneity in future meta-analyses.

The 2 methods are implemented in R, for which code is provided. Both methods produce equivalent results to standard but more complex Markov chain Monte Carlo approaches. The priors are derived as log-normal distributions for the between-study variance, applicable to meta-analyses of binary outcomes on the log odds ratio scale. The methods are applied to 2 example meta-analyses, incorporating the relevant predictive distributions as prior distributions for between-study heterogeneity.

We have provided resources to facilitate Bayesian meta-analysis, in a form accessible to applied researchers, which allow relevant prior information on the degree of heterogeneity to be incorporated.

[Erik van Zwet:

The distribution of tau across all the meta-analyses in Cochrane with a binary outcome has been estimated by Turner et al 2014.

They estimated the distribution of log(τ2) as normal with mean −2.56 and standard deviation 1.74. I’ve estimated the distribution of μ across Cochrane as a generalized t-distribution with mean = 0, scale = 0.52 and 3.4 degrees of freedom.

These estimated priors usually don’t make a very big difference compared to flat priors⁠. That’s just because the signal-to-noise ratio of most meta-analyses is reasonably good. For most meta-analyses, finding an honest set of reliable studies seems to be a much bigger problem than sampling error.]

“Thompson Sampling With the Online Bootstrap”, Eckles & Kaptein 2014

“Thompson sampling with the online bootstrap”⁠, Dean Eckles, Maurits Kaptein (2014-10-15; ⁠, ; similar):

Thompson sampling provides a solution to bandit problems in which new observations are allocated to arms with the posterior probability that an arm is optimal. While sometimes easy to implement and asymptotically optimal, Thompson sampling can be computationally demanding in large scale bandit problems, and its performance is dependent on the model fit to the observed data. We introduce bootstrap Thompson sampling (BTS), a heuristic method for solving bandit problems which modifies Thompson sampling by replacing the posterior distribution used in Thompson sampling by a bootstrap distribution. We first explain BTS and show that the performance of BTS is competitive to Thompson sampling in the well-studied Bernoulli bandit case. Subsequently, we detail why BTS using the online bootstrap is more scalable than regular Thompson sampling, and we show through simulation that BTS is more robust to a misspecified error distribution. BTS is an appealing modification of Thompson sampling, especially when samples from the posterior are otherwise not available or are costly.

“Everything Is Correlated”, Branwen 2014

Everything: “Everything Is Correlated”⁠, Gwern Branwen (2014-09-12; ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ; backlinks; similar):

Anthology of sociology, statistical, or psychological papers discussing the observation that all real-world variables have non-zero correlations and the implications for statistical theory such as ‘null hypothesis testing’.

Statistical folklore asserts that “everything is correlated”: in any real-world dataset, most or all measured variables will have non-zero correlations, even between variables which appear to be completely independent of each other, and that these correlations are not merely sampling error flukes but will appear in large-scale datasets to arbitrarily designated levels of statistical-significance or posterior probability.

This raises serious questions for null-hypothesis statistical-significance testing, as it implies the null hypothesis of 0 will always be rejected with sufficient data, meaning that a failure to reject only implies insufficient data, and provides no actual test or confirmation of a theory. Even a directional prediction is minimally confirmatory since there is a 50% chance of picking the right direction at random.

It also has implications for conceptualizations of theories & causal models, interpretations of structural models, and other statistical principles such as the “sparsity principle”.

“Statistical Notes”, Branwen 2014

Statistical-notes: “Statistical Notes”⁠, Gwern Branwen (2014-07-17; ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ; backlinks; similar):

Miscellaneous statistical stuff

Given two disagreeing polls, one small & imprecise but taken at face-value, and the other large & precise but with a high chance of being totally mistaken, what is the right Bayesian model to update on these two datapoints? I give ABC and MCMC implementations of Bayesian inference on this problem and find that the posterior is bimodal with a mean estimate close to the large unreliable poll’s estimate but with wide credible intervals to cover the mode based on the small reliable poll’s estimate.

“Why Correlation Usually ≠ Causation”, Branwen 2014

Causality: “Why Correlation Usually ≠ Causation”⁠, Gwern Branwen (2014-06-24; ⁠, ⁠, ⁠, ⁠, ⁠, ; backlinks; similar):

Correlations are oft interpreted as evidence for causation; this is oft falsified; do causal graphs explain why this is so common, because the number of possible indirect paths greatly exceeds the direct paths necessary for useful manipulation?

It is widely understood that statistical correlation between two variables ≠ causation. Despite this admonition, people are overconfident in claiming correlations to support favored causal interpretations and are surprised by the results of randomized experiments, suggesting that they are biased & systematically underestimate the prevalence of confounds / common-causation. I speculate that in realistic causal networks or DAGs, the number of possible correlations grows faster than the number of possible causal relationships. So confounds really are that common, and since people do not think in realistic DAGs but toy models, the imbalance also explains overconfidence.

“Bacopa Quasi-Experiment”, Branwen 2014

Bacopa: “Bacopa Quasi-Experiment”⁠, Gwern Branwen (2014-05-06; ⁠, ⁠, ⁠, ⁠, ; backlinks; similar):

A small 2014-2015 non-blinded self-experiment using Bacopa monnieri to investigate effect on memory/​sleep/​self-ratings in an ABABA design; no particular effects were found.

Bacopa is a supplement herb often used for memory or stress adaptation. Its chronic effects reportedly take many weeks to manifest, with no important acute effects. Out of curiosity, I bought 2 bottles of Bacognize Bacopa pills and ran a non-randomized non-blinded ABABA quasi-self-experiment from June 2014 to September 2015, measuring effects on my memory performance, sleep, and daily self-ratings of mood/​productivity. For analysis, a multi-level Bayesian model on two memory performance variables was used to extract per-day performance, factor analysis was used to extract a sleep index from 9 Zeo sleep variables, and the 3 endpoints were modeled as a multivariate Bayesian time-series regression with splines. Because of the slow onset of chronic effects, small effective sample size, definite temporal trends probably unrelated to Bacopa, and noise in the variables, the results were as expected, ambiguous, and do not strongly support any correlation between Bacopa and memory/​sleep/​self-rating (+/​-/​- respectively).

“Bayesian Model Selection: The Steepest Mountain to Climb”, Tenan et al 2014

2014-tenan.pdf: “Bayesian model selection: The steepest mountain to climb”⁠, Simone Tenan, Robert B. O’Hara, Iris Hendriks, Giacomo Tavecchia (2014-01-01; backlinks)

“(More) Efficient Reinforcement Learning via Posterior Sampling”, Osband et al 2013

“(More) Efficient Reinforcement Learning via Posterior Sampling”⁠, Ian Osband, Benjamin Van Roy, Daniel Russo (2013-06-04; ; similar):

Most provably-efficient learning algorithms introduce optimism about poorly-understood states and actions to encourage exploration.

We study an alternative approach for efficient exploration, posterior sampling for reinforcement learning (PSRL). This algorithm proceeds in repeated episodes of known duration. At the start of each episode, PSRL updates a prior distribution over Markov decision processes and takes one sample from this posterior. PSRL then follows the policy that is optimal for this sample during the episode. The algorithm is conceptually simple, computationally efficient and allows an agent to encode prior knowledge in a natural way.

We establish an Õ(τ ⋅ S ⋅ √(AT)) bound on the expected regret⁠, where T is time, τ is the episode length and S and A are the cardinalities of the state and action spaces. This bound is one of the first for an algorithm not based on optimism, and close to the state of the art for any reinforcement learning algorithm.

We show through simulation that PSRL substantially outperforms existing algorithms with similar regret bounds.

“Magnesium Self-Experiments”, Branwen 2013

Magnesium: “Magnesium Self-Experiments”⁠, Gwern Branwen (2013-05-13; ⁠, ⁠, ⁠, ⁠, ; backlinks; similar):

3 magnesium self-experiments on magnesium l-threonate and magnesium citrate.

Encouraged by TruBrain’s magnesium & my magnesium l-threonate use, I design and run a blind random self-experiment to see whether magnesium citrate supplementation would improve my mood or productivity. I collected ~200 days of data at two dose levels. The analysis finds that the net effect was negative, but a more detailed look shows time-varying effects with a large initial benefit negated by an increasingly-negative effect. Combined with my expectations, the long half-life, and the higher-than-intended dosage, I infer that I overdosed on the magnesium. To verify this, I will be running a followup experiment with a much smaller dose.

“Caffeine Wakeup Experiment”, Branwen 2013

Caffeine: “Caffeine wakeup experiment”⁠, Gwern Branwen (2013-04-07; ⁠, ⁠, ⁠, ⁠, ⁠, ; backlinks; similar):

Self-experiment on whether consuming caffeine immediately upon waking results in less time in bed & higher productivity. The results indicate a small and uncertain effect.

One trick to combat morning sluggishness is to get caffeine extra-early by using caffeine pills shortly before or upon trying to get up. From 2013-2014 I ran a blinded & placebo-controlled randomized experiment measuring the effect of caffeine pills in the morning upon awakening time and daily productivity. The estimated effect is small and the posterior probability relatively low, but a decision analysis suggests that since caffeine pills are so cheap, it would be worthwhile to conduct another experiment; however, increasing Zeo equipment problems have made me hold off additional experiments indefinitely.

“Bayesian Estimation Supersedes the t Test”, Kruschke 2013

2012-kruschke.pdf: “Bayesian estimation supersedes the t test”⁠, John J. Kruschke (2013; ; backlinks; similar):

Bayesian estimation for 2 groups provides complete distributions of credible values for the effect size, group means and their difference, standard deviations and their difference, and the normality of the data. The method handles outliers. The decision rule can accept the null value (unlike traditional t-tests) when certainty in the estimate is high (unlike Bayesian model comparison using Bayes factors). The method also yields precise estimates of statistical power for various research goals. The software and programs are free and run on Macintosh, Windows, and Linux platforms.

[Keywords: Bayesian statistics, effect size, robust estimation, Bayes factor, confidence interval]

“Potassium Sleep Experiments”, Branwen 2012

Potassium: “Potassium sleep experiments”⁠, Gwern Branwen (2012-12-21; ⁠, ⁠, ⁠, ; backlinks; similar):

2 self-experiments on potassium citrate effects on sleep: harm to sleep when taken daily or in the morning

Potassium and magnesium are minerals that many Americans are deficient in. I tried using potassium citrate and immediately noticed difficulty sleeping. A short randomized (but not blinded) self-experiment of ~4g potassium taken throughout the day confirmed large negative effects on my sleep. A longer followup randomized and blinded self-experiment used standardized doses taken once a day early in the morning, and also found some harm to sleep, and I discontinued potassium use entirely.

“2012 Election Predictions”, Branwen 2012

2012-election-predictions: “2012 election predictions”⁠, Gwern Branwen (2012-11-05; ⁠, ⁠, ⁠, ⁠, ; backlinks; similar):

Compiling academic and media forecaster’s 2012 American Presidential election predictions and statistically judging correctness; Nate Silver was not the best.

Statistically analyzing in R hundreds of predictions compiled for ~10 forecasters of the 2012 American Presidential election, and ranking them by Brier, RMSE, & log scores; the best overall performance seems to be by Drew Linzer and Wang & Holbrook, while Nate Silver appears as somewhat over-rated and the famous Intrade prediction market turning in a disappointing overall performance.

“Biased Information As Anti-information”, Branwen 2012

backfire-effect: “Biased information as anti-information”⁠, Gwern Branwen (2012-10-19; ⁠, ⁠, ; similar):

Filtered data for a belief can rationally push you away from that belief

The backfire effect is a recently-discovered bias where arguments contrary to a person’s belief leads to them believing even more strongly in that belief; this is taken as obviously “irrational”. The “rational” update can be statistically modeled as a shift in the estimated mean of a normal distribution where each randomly distributed datapoint is an argument: new datapoints below the mean cause a shift of the inferred mean downward and likewise if above. When this model is changed to include the “censoring” of datapoints, then the valid inference changes and a datapoint below the mean can lead to a shift of the mean upwards. This suggests that providing a person with anything less than the best data contrary to, or decisive refutations of, one of their beliefs may result in them becoming even more certain of that belief. If it is enjoyable or profitable to argue with a person while one does less than one’s best, it is bad to hold false beliefs, and this badness is not shared between both parties, then arguing online may constitute a negative externality: an activity whose benefits are gained by one party but whose full costs are not paid by the same party. In many moral systems, negative externalities are considered selfish and immoral; hence, lazy or half-hearted arguing may be immoral because it internalizes any benefits while possibly leaving the other person epistemically worse off.

“A/B Testing Long-form Readability on”, Branwen 2012

AB-testing: “A/B testing long-form readability on”⁠, Gwern Branwen (2012-06-16; ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ; backlinks; similar):

A log of experiments done on the site design, intended to render pages more readable, focusing on the challenge of testing a static site, page width, fonts, plugins, and effects of advertising.

To gain some statistical & web development experience and to improve my readers’ experiences, I have been running a series of CSS A/​B tests since June 2012. As expected, most do not show any meaningful difference.

“Dual N-Back Meta-Analysis”, Branwen 2012

DNB-meta-analysis: “Dual n-Back Meta-Analysis”⁠, Gwern Branwen (2012-05-20; ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ; backlinks; similar):

Does DNB increase IQ? What factors affect the studies? Probably not: gains are driven by studies with weakest methodology like apathetic control groups.

I meta-analyze the >19 studies up to 2016 which measure IQ after an n-back intervention, finding (over all studies) a net gain (medium-sized) on the post-training IQ tests.

The size of this increase on IQ test score correlates highly with the methodological concern of whether a study used active or passive control groups⁠. This indicates that the medium effect size is due to methodological problems and that n-back training does not increase subjects’ underlying fluid intelligence but the gains are due to the motivational effect of passive control groups (who did not train on anything) not trying as hard as the n-back-trained experimental groups on the post-tests. The remaining studies using active control groups find a small positive effect (but this may be due to matrix-test-specific training, undetected publication bias, smaller motivational effects, etc.)

I also investigate several other n-back claims, criticisms, and indicators of bias, finding:

“One Man’s Modus Ponens”, Branwen 2012

Modus: “One Man’s Modus Ponens”⁠, Gwern Branwen (2012-05-01; ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ; backlinks; similar):

One man’s modus ponens is another man’s modus tollens is a saying in Western philosophy encapsulating a common response to a logical proof which generalizes the reductio ad absurdum and consists of rejecting a premise based on an implied conclusion. I explain it in more detail, provide examples, and a Bayesian gloss.

A logically-valid argument which takes the form of a modus ponens may be interpreted in several ways; a major one is to interpret it as a kind of reductio ad absurdum, where by ‘proving’ a conclusion believed to be false, one might instead take it as a modus tollens which proves that one of the premises is false. This “Moorean shift” is aphorized as the snowclone⁠, “One man’s modus ponens is another man’s modus tollens”.

The Moorean shift is a powerful counter-argument which has been deployed against many skeptical & metaphysical claims in philosophy, where often the conclusion is extremely unlikely and little evidence can be provided for the premises used in the proofs; and it is relevant to many other debates, particularly methodological ones.

“Learning Is Planning: near Bayes-optimal Reinforcement Learning via Monte-Carlo Tree Search”, Asmuth & Littman 2012

“Learning is planning: near Bayes-optimal reinforcement learning via Monte-Carlo tree search”⁠, John Asmuth, Michael L. Littman (2012-02-14; ⁠, ; similar):

Bayes-optimal behavior, while well-defined, is often difficult to achieve. Recent advances in the use of Monte-Carlo tree search (MCTS) have shown that it is possible to act near-optimally in Markov Decision Processes (MDPs) with very large or infinite state spaces. Bayes-optimal behavior in an unknown MDP is equivalent to optimal behavior in the known belief-space MDP, although the size of this belief-space MDP grows exponentially with the amount of history retained, and is potentially infinite. We show how an agent can use one particular MCTS algorithm, Forward Search Sparse Sampling (FSSS), in an efficient way to act nearly Bayes-optimally for all but a polynomial number of steps, assuming that FSSS can be used to act efficiently in any possible underlying MDP.

“Learning Performance of Prediction Markets With Kelly Bettors”, Beygelzimer et al 2012

“Learning Performance of Prediction Markets with Kelly Bettors”⁠, Alina Beygelzimer, John Langford, David Pennock (2012-01-31; ; similar):

[blog] In evaluating prediction markets (and other crowd-prediction mechanisms), investigators have repeatedly observed a so-called “wisdom of crowds” effect, which roughly says that the average of participants performs much better than the average participant. The market price—an average or at least aggregate of traders’ beliefs—offers a better estimate than most any individual trader’s opinion.

In this paper, we ask a stronger question: how does the market price compare to the best trader’s belief, not just the average trader. We measure the market’s worst-case log regret⁠, a notion common in machine learning theory. To arrive at a meaningful answer, we need to assume something about how traders behave. We suppose that every trader optimizes according to the Kelly criteria⁠, a strategy that provably maximizes the compound growth of wealth over an (infinite) sequence of market interactions. We show several consequences.

First, the market prediction is a wealth-weighted average of the individual participants’ beliefs. Second, the market learns at the optimal rate, the market price reacts exactly as if updating according to Bayes’ Law, and the market prediction has low worst-case log regret to the best individual participant. We simulate a sequence of markets where an underlying true probability exists, showing that the market converges to the true objective frequency as if updating a Beta distribution⁠, as the theory predicts. If agents adopt a fractional Kelly criteria, a common practical variant, we show that agents behave like full-Kelly agents with beliefs weighted between their own and the market’s, and that the market price converges to a time-discounted frequency.

Our analysis provides a new justification for fractional Kelly betting, a strategy widely used in practice for ad-hoc reasons. Finally, we propose a method for an agent to learn her own optimal Kelly fraction.

“Vitamin D Sleep Experiments”, Branwen 2012

Vitamin-D: “Vitamin D sleep experiments”⁠, Gwern Branwen (2012; ⁠, ⁠, ⁠, ; backlinks; similar):

Self-experiment on vitamin D effects on sleep: harmful taken at night, no or beneficial effects when taken in the morning.

Vitamin D is a hormone endogenously created by exposure to sunlight; due to historically low outdoors activity levels, it has become a popular supplement and I use it. Some anecdotes suggest that vitamin D may have circadian and zeitgeber effects due to its origin, and is harmful to sleep when taken at night. I ran a blinded randomized self-experiment on taking vitamin D pills at bedtime. The vitamin D damaged my sleep and especially how rested I felt upon wakening, suggesting vitamin D did have a stimulating effect which obstructed sleep. I conducted a followup blinded randomized self-experiment on the logical next question: if vitamin D is a daytime cue, then would vitamin D taken in the morning show some beneficial effects? The results were inconclusive (but slightly in favor of benefits). Given the asymmetry, I suggest that vitamin D supplements should be taken only in the morning.

“Silk Road 1: Theory & Practice”, Branwen 2011

Silk-Road: “Silk Road 1: Theory & Practice”⁠, Gwern Branwen (2011-07-11; ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ; backlinks; similar):

History, background, visiting, ordering, using, & analyzing the drug market Silk Road 1

The cypherpunk movement laid the ideological roots of Bitcoin and the online drug market Silk Road; balancing previous emphasis on cryptography, I emphasize the non-cryptographic market aspects of Silk Road which is rooted in cypherpunk economic reasoning, and give a fully detailed account of how a buyer might use market information to rationally buy, and finish by discussing strengths and weaknesses of Silk Road, and what future developments are predicted by cypherpunk ideas.

“PILCO: A Model-Based and Data-Efficient Approach to Policy Search”, Deisenroth & Rasmussen 2011

2011-deisenroth.pdf: “PILCO: A Model-Based and Data-Efficient Approach to Policy Search”⁠, Marc Peter Deisenroth, Carl Edward Rasmussen (2011-06-01; ⁠, ⁠, ; backlinks; similar):

In this paper, we introduce PILCO, a practical, data-efficient model-based policy search method. PILCO reduces model bias, one of the key problems of model-based reinforcement learning, in a principled way.

By learning a probabilistic dynamics model and explicitly incorporating model uncertainty into long-term planning, PILCO can cope with very little data and facilitates learning from scratch in only a few trials. Policy evaluation is performed in closed form using state-of-the-art approximate inference. Furthermore, policy gradients are computed analytically for policy improvement.

We report unprecedented learning efficiency on challenging and high-dimensional control tasks.

[Remarkably, PILCO can learn your standard “Cartpole” task within just a few trials by carefully building a Bayesian Gaussian process model and picking the maximally-informative experiments to run. Cartpole is quite difficult for a human, incidentally, there’s an installation of one in the SF Exploratorium⁠, and I just had to try it out once I recognized it. (My sample-efficiency was not better than PILCO.)]

“Death Note: L, Anonymity & Eluding Entropy”, Branwen 2011

Death-Note-Anonymity: “Death Note: L, Anonymity & Eluding Entropy”⁠, Gwern Branwen (2011-05-04; ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ; backlinks; similar):

Applied Computer Science: On Murder Considered As STEM Field—using information theory to quantify the magnitude of Light Yagami’s mistakes in Death Note and considering fixes

In the manga Death Note, the protagonist Light Yagami is given the supernatural weapon “Death Note” which can kill anyone on demand, and begins using it to reshape the world. The genius detective L attempts to track him down with analysis and trickery, and ultimately succeeds. Death Note is almost a thought-experiment-given the perfect murder weapon, how can you screw up anyway? I consider the various steps of L’s process from the perspective of computer security, cryptography, and information theory, to quantify Light’s initial anonymity and how L gradually de-anonymizes him, and consider which mistake was the largest as follows:

  1. Light’s fundamental mistake is to kill in ways unrelated to his goal.

    Killing through heart attacks does not just make him visible early on, but the deaths reveals that his assassination method is impossibly precise and something profoundly anomalous is going on. L has been tipped off that Kira exists. Whatever the bogus justification may be, this is a major victory for his opponents. (To deter criminals and villains, it is not necessary for there to be a globally-known single anomalous or supernatural killer, when it would be equally effective to arrange for all the killings to be done naturalistically by ordinary mechanisms such as third parties/​police/​judiciary or used indirectly as parallel construction to crack cases.)

  2. Worse, the deaths are non-random in other ways—they tend to occur at particular times!

    Just the scheduling of deaths cost Light 6 bits of anonymity

  3. Light’s third mistake was reacting to the blatant provocation of Lind L. Tailor.

    Taking the bait let L narrow his target down to 1⁄3 the original Japanese population, for a gain of ~1.6 bits.

  4. Light’s fourth mistake was to use confidential police information stolen using his policeman father’s credentials.

    This mistake was the largest in bits lost. This mistake cost him 11 bits of anonymity; in other words, this mistake cost him twice what his scheduling cost him and almost 8 times the murder of Tailor!

  5. Killing Ray Penbar and the FBI team.

If we assume Penbar was tasked 200 leads out of the 10,000, then murdering him and the fiancee dropped Light just 6 bits or a little over half the fourth mistake and comparable to the original scheduling mistake. 6. Endgame: At this point in the plot, L resorts to direct measures and enters Light’s life directly, enrolling at the university, with Light unable to perfectly play the role of innocent under intense in-person surveillance.

From that point on, Light is screwed as he is now playing a deadly game of “Mafia” with L & the investigative team. He frittered away >25 bits of anonymity and then L intuited the rest and suspected him all along.

Finally, I suggest how Light could have most effectively employed the Death Note and limited his loss of anonymity. In an appendix, I discuss the maximum amount of information leakage possible from using a Death Note as a communication device.

(Note: This essay assumes a familiarity with the early plot of Death Note and Light Yagami. If you are unfamiliar with DN, see my Death Note Ending essay or consult Wikipedia or read the DN rules⁠.)

“Tea Reviews”, Branwen 2011

Tea: “Tea Reviews”⁠, Gwern Branwen (2011-04-13; ⁠, ⁠, ⁠, ⁠, ; backlinks; similar):

Teas I have drunk, with reviews and future purchases; focused primarily on oolongs and greens. Plus experiments on water.

Electric kettles are faster, but I was curious how much faster my electric kettle heated water to high or boiling temperatures than does my stove-top kettle. So I collected some data and compared them directly, trying out a number of statistical methods (principally: nonparametric & parametric tests of difference, linear & beta regression models, and a Bayesian measurement error model). My electric kettle is faster than the stove-top kettle (the difference is both statistically-significant p≪0.01 & the posterior probability of difference is P ≈ 1), and the modeling suggests time to boil is largely predictable from a combination of volume, end-temperature, and kettle type.

“ Website Traffic”, Branwen 2011

Traffic: “ Website Traffic”⁠, Gwern Branwen (2011-02-03; ⁠, ⁠, ⁠, ⁠, ; similar):

Meta page describing editing activity, traffic statistics, and referrer details, primarily sourced from Google Analytics (2011-present).

On a semi-annual basis, since 2011, I review website traffic using Google Analytics; although what most readers value is not what I value, I find it motivating to see total traffic statistics reminding me of readers (writing can be a lonely and abstract endeavour), and useful to see what are major referrers. typically enjoys steady traffic in the 50–100k range per month, with occasional spikes from social media, particularly Hacker News; over the first decade (2010–2020), there were 7.98m pageviews by 3.8m unique users.

“Zeo Sleep Self-experiments”, Branwen 2010

Zeo: “Zeo sleep self-experiments”⁠, Gwern Branwen (2010-12-28; ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ; backlinks; similar):

EEG recordings of sleep and my experiments with things affecting sleep quality or durations: melatonin, potassium, vitamin D etc

I discuss my beliefs about Quantified Self⁠, and demonstrate with a series of single-subject design self-experiments using a Zeo. A Zeo records sleep via EEG; I have made many measurements and performed many experiments. This is what I have learned so far:

  1. the Zeo headband is wearable long-term
  2. melatonin improves my sleep
  3. one-legged standing does little
  4. Vitamin D at night damages my sleep & Vitamin D in morning does not affect my sleep
  5. potassium (over the day but not so much the morning) damages my sleep and does not improve my mood/​productivity
  6. small quantities of alcohol appear to make little difference to my sleep quality
  7. I may be better off changing my sleep timing by waking up somewhat earlier & going to bed somewhat earlier
  8. lithium orotate does not affect my sleep
  9. Redshift causes me to go to bed earlier
  10. ZMA: inconclusive results slightly suggestive of benefits

“The Replication Crisis: Flaws in Mainstream Science”, Branwen 2010

Replication: “The Replication Crisis: Flaws in Mainstream Science”⁠, Gwern Branwen (2010-10-27; ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ; backlinks; similar):

2013 discussion of how systemic biases in science, particularly medicine and psychology, have resulted in a research literature filled with false positives and exaggerated effects, called ‘the Replication Crisis’.

Long-standing problems in standard scientific methodology have exploded as the “Replication Crisis”: the discovery that many results in fields as diverse as psychology, economics, medicine, biology, and sociology are in fact false or quantitatively highly inaccurately measured. I cover here a handful of the issues and publications on this large, important, and rapidly developing topic up to about 2013, at which point the Replication Crisis became too large a topic to cover more than cursorily. (A compilation of some additional links are provided for post-2013 developments.)

The crisis is caused by methods & publishing procedures which interpret random noise as important results, far too small datasets, selective analysis by an analyst trying to reach expected/​desired results, publication bias, poor implementation of existing best-practices, nontrivial levels of research fraud, software errors, philosophical beliefs among researchers that false positives are acceptable, neglect of known confounding like genetics, and skewed incentives (financial & professional) to publish ‘hot’ results.

Thus, any individual piece of research typically establishes little. Scientific validation comes not from small p-values, but from discovering a regular feature of the world which disinterested third parties can discover with straightforward research done independently on new data with new procedures—replication.

“About This Website”, Branwen 2010

About: “About This Website”⁠, Gwern Branwen (2010-10-01; ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ; backlinks; similar):

Meta page describing site ideals of stable long-term essays which improve over time; idea sources and writing methodology; metadata definitions; site statistics; copyright license.

This page is about content; for the details of its implementation & design like the popup paradigm, see Design⁠; and for information about me, see Links⁠.

“Bayesian Data Analysis”, Kruschke 2010

2010-kruschke.pdf: “Bayesian data analysis”⁠, John K. Kruschke (2010-08-10; ; backlinks)

“Nootropics”, Branwen 2010

Nootropics: “Nootropics”⁠, Gwern Branwen (2010-01-02; ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ; backlinks; similar)

“Predicting the Next Big Thing: Success As a Signal of Poor Judgment”, Denrell & Fang 2010

2010-denrell.pdf: “Predicting the Next Big Thing: Success as a Signal of Poor Judgment”⁠, Jerker Denrell, Christina Fang (2010-01-01; ; backlinks)

“Rssa_a0157 469..482”

2010-stigler.pdf: “rssa_a0157 469..482” (2010-01-01)

“Who Wrote The ‘Death Note’ Script?”, Branwen 2009

Death-Note-script: “Who Wrote The ‘Death Note’ Script?”⁠, Gwern Branwen (2009-11-02; ⁠, ⁠, ⁠, ; backlinks; similar):

Internal, external, stylometric evidence point to live-action leak of Death Note Hollywood script being real.

I give a history of the 2009 leaked script, discuss internal & external evidence for its realness including stylometrics; and then give a simple step-by-step Bayesian analysis of each point. We finish with high confidence in the script being real, discussion of how this analysis was surprisingly enlightening, and what followup work the analysis suggests would be most valuable.

“Miscellaneous”, Branwen 2009

Notes: “Miscellaneous”⁠, Gwern Branwen (2009-08-05; ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ; backlinks; similar):

Misc thoughts, memories, proto-essays, musings, etc.

We usually clean up after ourselves, but sometimes, we are expected to clean before (ie. after others) instead. Why?

Because in those cases, pre-cleanup is the same amount of work, but game-theoretically better whenever a failure of post-cleanup would cause the next person problems.

“When Superstars Flop: Public Status and Choking Under Pressure in International Soccer Penalty Shootouts”, Jordet 2009

2009-jordet.pdf: “When Superstars Flop: Public Status and Choking Under Pressure in International Soccer Penalty Shootouts”⁠, Geir Jordet (2009-04-15; ; backlinks; similar):

The purpose of this study was to examine links between public status and performance in a real-world, high-pressure sport task.

It was believed that high public status could negatively affect performance through added performance pressure. Video analyses were conducted of all penalty shootouts ever held in 3 major soccer tournaments (n = 366 kicks) and public status was derived from prestigious international awards (eg. “FIFA World Player of the year”).

The results showed that players with high current status performed worse and seemed to engage more in certain escapist self-regulatory behaviors than players with future status. Some of these performance drops may be accounted for by misdirected self-regulation (particularly low response time), but only small multivariate effects were found.

[See Regression To The Mean Fallacies⁠.]

“Dual N-Back FAQ”, Branwen 2009

DNB-FAQ: “Dual n-Back FAQ”⁠, Gwern Branwen (2009-03-25; ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ; backlinks; similar):

A compendium of DNB, WM⁠, IQ information up to 2015.

Between 2008 and 2011, I collected a number of anecdotal reports about the effects of n-backing; there are many other anecdotes out there, but the following are a good representation—for what they’re worth.

“Modafinil”, Branwen 2009

Modafinil: “Modafinil”⁠, Gwern Branwen (2009-02-20; ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ; backlinks; similar):

Effects, health concerns, suppliers, prices & rational ordering.

Modafinil is a prescription stimulant drug. I discuss informally, from a cost-benefit-informed perspective, the research up to 2015 on modafinil’s cognitive effects, the risks of side-effects and addiction/​tolerance and law enforcement, and give a table of current grey-market suppliers and discuss how to order from them.

“Models for Potentially Biased Evidence in Meta-analysis Using Empirically Based Priors”, Welton et al 2008

2009-welton.pdf: “Models for potentially biased evidence in meta-analysis using empirically based priors”⁠, N. J. Welton, A. E. Ades, J. B. Carlin, D. G. Altman, J. A. C. Sterne (2008-12-22; ; similar):

We present models for the combined analysis of evidence from randomized controlled trials categorized as being at either low or high risk of bias due to a flaw in their conduct.

We formulate a bias model that incorporates between-study and between-meta-analysis heterogeneity in bias, and uncertainty in overall mean bias. We obtain algebraic expressions for the posterior distribution of the bias-adjusted treatment effect, which provide limiting values for the information that can be obtained from studies at high risk of bias.

The parameters of the bias model can be estimated from collections of previously published meta-analyses. We explore alternative models for such data, and alternative methods for introducing prior information on the bias parameters into a new meta-analysis.

Results: from an illustrative example show that the bias-adjusted treatment effect estimates are sensitive to the way in which the meta-epidemiological data are modelled, but that using point estimates for bias parameters provides an adequate approximation to using a full joint prior distribution⁠. A sensitivity analysis shows that the gain in precision from including studies at high risk of bias is likely to be low, however numerous or large their size, and that little is gained by incorporating such studies, unless the information from studies at low risk of bias is limited.

We discuss approaches that might increase the value of including studies at high risk of bias, and the acceptability of the methods in the evaluation of health care interventions.

[Keywords: Bayesian methods, Bias, health technology assessment, Markov chain Monte Carlo methods⁠, randomized controlled trial]

“Optimal Approximation of Signal Priors”, Hyvarinen 2008

2008-hyvarinen.pdf: “Optimal approximation of signal priors”⁠, Aapo Hyvärinen (2008-12-01; ; backlinks; similar):

In signal restoration by Bayesian inference, one typically uses a parametric model of the prior distribution of the signal. Here, we consider how the parameters of a prior model should be estimated from observations of uncorrupted signals. A lot of recent work has implicitly assumed that maximum likelihood estimation is the optimal estimation method. Our results imply that this is not the case. We first obtain an objective function that approximates the error occurred in signal restoration due to an imperfect prior model. Next, we show that in an important special case (small Gaussian noise), the error is the same as the score-matching objective function, which was previously proposed as an alternative for likelihood based on purely computational considerations. Our analysis thus shows that score matching combines computational simplicity with statistical optimality in signal restoration, providing a viable alternative to maximum likelihood methods. We also show how the method leads to a new intuitive and geometric interpretation of structure inherent in probability distributions.

“Verbal Probability Expressions In National Intelligence Estimates: A Comprehensive Analysis Of Trends From The Fifties Through Post-9/11”, Kesselman 2008

2008-kesselman.pdf: “Verbal Probability Expressions In National Intelligence Estimates: A Comprehensive Analysis Of Trends From The Fifties Through Post-9/11”⁠, Rachel F. Kesselman (2008-05; ⁠, ; backlinks; similar):

This research presents the findings of a study that analyzed words of estimators probability in the key judgments of National Intelligence Estimates from the 1950s through the 2000s. The research found that of the 50 words examined, only 13 were statistically-significant. Furthermore, interesting trends have emerged when the words are broken down into English modals, terminology that conveys analytical assessments and words employed by the National Intelligence Council as of 2006. One of the more intriguing findings is that use of the word will has by far been the most popular for analysts, registering over 700 occurrences throughout the decades; however, a word of such certainty is problematic in the sense that intelligence should never deal with 100% certitude. The relatively low occurrence and wide variety of word usage across the decades demonstrates a real lack of consistency in the way analysts have been conveying assessments over the past 58 years. Finally, the researcher suggests the Kesselman List of Estimative Words for use in the IC. The word list takes into account the literature review findings as well as the results of this study in equating odds with verbal probabilities.

[Rachel’s lit review, for example, makes for very interesting reading. She has done a thorough search of not only the intelligence but also the business, linguistics and other literatures in order to find out how other disciplines have dealt with the problem of “What do we mean when we say something is ‘likely’…” She uncovered, for example, that, in medicine, words of estimative probability such as “likely”, “remote” and “probably” have taken on more or less fixed meanings due primarily to outside intervention or, as she put it, “legal ramifications”. Her comparative analysis of the results and approaches taken by these other disciplines is required reading for anyone in the Intelligence Community trying to understand how verbal expressions of probability are actually interpreted. The NICs list only became final in the last several years so it is arguable whether this list of nine words really captures the breadth of estimative word usage across the decades. Rather, it would be arguable if this chart didn’t make it crystal clear that the Intelligence Community has really relied on just two words, “probably” and “likely” to express its estimates of probabilities for the last 60 years. All other words are used rarely or not at all.

Based on her research of what works and what doesn’t and which words seem to have the most consistent meanings to users, Rachel even offers her own list of estimative words along with their associated probabilities:

  1. Almost certain: 86–99%
  2. Highly likely: 71–85%
  3. Likely: 56–70%
  4. Chances a little better [or less] than even: 46–55%
  5. Unlikely: 31–45%
  6. Highly unlikely: 16–30%
  7. Remote: 1–15%


[See also “Decision by sampling”⁠, Stewart et al 2006; “Processing Linguistic Probabilities: General Principles and Empirical Evidence”⁠, Budescu & Wallsten 1995.]

“The Allure of Equality: Uniformity in Probabilistic and Statistical Judgment”, Falk & Lann 2008

2008-falk.pdf: “The allure of equality: Uniformity in probabilistic and statistical judgment”⁠, Ruma Falk, Avital Lann (2008-01-01; ; backlinks)

“Experiments on Partisanship and Public Opinion: Party Cues, False Beliefs, and Bayesian Updating”, Bullock 2007

2007-bullock.pdf: “Experiments on partisanship and public opinion: Party cues, false beliefs, and Bayesian updating”⁠, John G. Bullock (2007-06-01; ; backlinks; similar):

This dissertation contains 3 parts—three papers. The first is about the effects of party cues on policy attitudes and candidate preferences. The second is about the resilience of false political beliefs. The third is about Bayesian updating of public opinion. Substantively, what unites them is my interest in partisanship and public opinion. Normatively, they all spring from my interest in the quality of citizens’ thinking about politics. Methodologically, they are bound by my conviction that we gain purchase on interesting empirical questions by doing things differently: first, by bringing more experiments to fields still dominated by cross-sectional survey research; second, by using experiments unlike the ones that have gone before.

  1. Part 1: It is widely believed that party cues affect political attitudes. But their effects have rarely been demonstrated, and most demonstrations rely on questionable inferences about cue-taking behavior. I use data from 3 experiments on representative national samples to show that party cues affect even the extremely well-informed and that their effects are, as Downs predicted, decreasing in the amount of policy-relevant information that people have. But the effects are often smaller than we imagine and much smaller than the ones caused by changes in policy-relevant information. Partisans tend to perceive themselves as much less influenced by cues than members of the other party—a finding with troubling implications for those who subscribe to deliberative theories of democracy.
  2. Part 2: The widely noted tendency of people to resist challenges to their political beliefs can usually be explained by the poverty of those challenges: they are easily avoided, often ambiguous, and almost always easily dismissed as irrelevant, biased, or uninformed. It is natural to hope that stronger challenges will be more successful. In a trio of experiments that draw on real-world cases of misinformation, I instill false political beliefs and then challenge them in ways that are unambiguous and nearly impossible to avoid or dismiss for the conventional reasons. The success of these challenges proves highly contingent on party identification.
  3. Part 3: Political scientists are increasingly interested in using Bayes’ Theorem to evaluate citizens’ thinking about politics. But there is widespread uncertainty about why the Theorem should be considered a normative standard for rational information processing and whether models based on it can accommodate ordinary features of political cognition including partisan bias, attitude polarization, and enduring disagreement. I clarify these points with reference to the best-known Bayesian updating model and several little-known but more realistic alternatives. I show that the Theorem is more accommodating than many suppose—but that, precisely because it is so accommodating, it is far from an ideal standard for rational information processing.

“A Free Energy Principle for the Brain”, Friston et al 2006

2006-friston.pdf: “A free energy principle for the brain”⁠, Karl Friston, James Kilner, Lee Harrison (2006-07-01; ⁠, ; backlinks; similar):

By formulating Helmholtz’s ideas about perception, in terms of modern-day theories, one arrives at a model of perceptual inference and learning that can explain a remarkable range of neurobiological facts: using constructs from statistical physics, the problems of inferring the causes of sensory input and learning the causal structure of their generation can be resolved using exactly the same principles. Furthermore, inference and learning can proceed in a biologically plausible fashion. The ensuing scheme rests on Empirical Bayes and hierarchical models of how sensory input is caused. The use of hierarchical models enables the brain to construct prior expectations in a dynamic and context-sensitive fashion. This scheme provides a principled way to understand many aspects of cortical organisation and responses.

In this paper, we show these perceptual processes are just one aspect of emergent behaviours of systems that conform to a free energy principle. The free energy considered here measures the difference between the probability distribution of environmental quantities that act on the system and an arbitrary distribution encoded by its configuration. The system can minimise free energy by changing its configuration to affect the way it samples the environment or change the distribution it encodes. These changes correspond to action and perception respectively and lead to an adaptive exchange with the environment that is characteristic of biological systems. This treatment assumes that the system’s state and structure encode an implicit and probabilistic model of the environment. We will look at the models entailed by the brain and how minimisation of its free energy can explain its dynamics and structure.

[Keywords: Variational Bayes⁠, free energy, inference, perception, action, learning, attention, selection, hierarchical]

“The Optimizer’s Curse: Skepticism and Postdecision Surprise in Decision Analysis”, Smith & Winkler 2006

2006-smith.pdf: “The Optimizer’s Curse: Skepticism and Postdecision Surprise in Decision Analysis”⁠, James E. Smith, Robert L. Winkler (2006-03-01; ; backlinks; similar):

Decision analysis produces measures of value such as expected net present values or expected utilities and ranks alternatives by these value estimates. Other optimization-based processes operate in a similar manner. With uncertainty and limited resources, an analysis is never perfect, so these value estimates are subject to error. We show that if we take these value estimates at face value and select accordingly, we should expect the value of the chosen alternative to be less than its estimate, even if the value estimates are unbiased. Thus, when comparing actual outcomes to value estimates, we should expect to be disappointed on average, not because of any inherent bias in the estimates themselves, but because of the optimization-based selection process. We call this phenomenon the optimizer’s curse and argue that it is not well understood or appreciated in the decision analysis and management science communities. This curse may be a factor in creating skepticism in decision makers who review the results of an analysis.

In this paper, we study the optimizer’s curse and show that the resulting expected disappointment may be substantial. We then propose the use of Bayesian methods to adjust value estimates. These Bayesian methods can be viewed as disciplined skepticism and provide a method for avoiding this postdecision disappointment.

“Three Statistical Paradoxes in the Interpretation of Group Differences: Illustrated With Medical School Admission and Licensing Data”, Wainer & Brown 2006

2006-wainer.pdf: “Three Statistical Paradoxes in the Interpretation of Group Differences: Illustrated with Medical School Admission and Licensing Data”⁠, Howard Wainer, Lisa M. Brown (2006; backlinks; similar):

Interpreting group differences observed in aggregated data is a practice that must be done with enormous care. Often the truth underlying such data is quite different than a naïve first look would indicate. The confusions that can arise are so perplexing that some of the more frequently occurring ones have been dubbed paradoxes. In this chapter we describe three of the best known of these paradoxes—Simpson’s Paradox⁠, Kelley’s Paradox⁠, and Lord’s Paradox—and illustrate them in a single data set. The data set contains the score distributions, separated by race, on the biological sciences component of the Medical College Admission Test (MCAT) and Step 1 of the United States Medical Licensing Examination™ (USMLE). Our goal in examining these data was to move toward a greater understanding of race differences in admissions policies in medical schools. As we demonstrate, the path toward this goal is hindered by differences in the score distributions which gives rise to these three paradoxes. The ease with which we were able to illustrate all of these paradoxes within a single data set is indicative of how wide spread they are likely to be in practice.

“Estimation of Non-Normalized Statistical Models by Score Matching”, Hyvarinen 2005

“Estimation of Non-Normalized Statistical Models by Score Matching”⁠, Aapo Hyvärinen (2005-04; ; backlinks; similar):

One often wants to estimate statistical models where the probability density function is known only up to a multiplicative normalization constant⁠. Typically, one then has to resort to Markov Chain Monte Carlo methods⁠, or approximations of the normalization constant.

Here, we propose that such models can be estimated by minimizing the expected squared distance between the gradient of the log-density given by the model and the gradient of the log-density of the observed data.

While the estimation of the gradient of log-density function is, in principle, a very difficult non-parametric problem, we prove a surprising result that gives a simple formula for this objective function. The density function of the observed data does not appear in this formula, which simplifies to a sample average of a sum of some derivatives of the log-density given by the model.

The validity of the [score-matching] method is demonstrated on multivariate Gaussian and independent component analysis models, and by estimating an overcomplete filter set for natural image data.

[Keywords: statistical estimation, non-normalized densities, pseudo-likelihood, Markov chain Monte Carlo⁠, contrastive divergence]

“The Bayesian Brain: the Role of Uncertainty in Neural Coding and Computation”, Knill & Pouget 2004

2004-knill.pdf: “The Bayesian brain: the role of uncertainty in neural coding and computation”⁠, David C. Knill, Alexandre Pouget (2004-12; ⁠, ; backlinks; similar):

To use sensory information efficiently to make judgments and guide action in the world, the brain must represent and use information about uncertainty in its computations for perception and action. Bayesian methods have proven successful in building computational theories for perception and sensorimotor control, and psychophysics is providing a growing body of evidence that human perceptual computations are ‘Bayes’ optimal’. This leads to the ‘Bayesian coding hypothesis’: that the brain represents sensory information probabilistically, in the form of probability distributions. Several computational schemes have recently been proposed for how this might be achieved in populations of neurons. Neurophysiological data on the hypothesis, however, is almost non-existent. A major challenge for neuroscientists is to test these ideas experimentally, and so determine whether and how neurons code information about sensory uncertainty.

“Methods of Meta-Analysis: Correcting Error and Bias in Research Findings”, Hunter & Schmidt 2004

2004-hunterschmidt-methodsofmetaanalysis.pdf: “Methods of Meta-Analysis: Correcting Error and Bias in Research Findings”⁠, John E. Hunter, Dr. Frank L. Schmidt (2004-01-01; ⁠, ; backlinks)

“Bayesian Informal Logic and Fallacy”, Korb 2004

2003-korb.pdf: “Bayesian Informal Logic and Fallacy”⁠, Kevin Korb (2004; similar):

Bayesian reasoning has been applied formally to statistical inference, machine learning and analysing scientific method. Here I apply it informally to more common forms of inference, namely natural language arguments. I analyse a variety of traditional fallacies, deductive, inductive and causal, and find more merit in them than is generally acknowledged. Bayesian principles provide a framework for understanding ordinary arguments which is well worth developing.

“Two Statistical Paradoxes in the Interpretation of Group Differences: Illustrated With Medical School Admission and Licensing Data”, Wainer & Brown 2004

2004-wainer.pdf: “Two Statistical Paradoxes in the Interpretation of Group Differences: Illustrated with Medical School Admission and Licensing Data”⁠, Howard Wainer, Lisa M. Brown (2004; similar):

Interpreting group differences observed in aggregated data is a practice that must be done with enormous care. Often the truth underlying such data is quite different than a naïve first look would indicate. The confusions that can arise are so perplexing that some of the more frequently occurring ones have been dubbed paradoxes. This article describes two of these paradoxes—Simpson’s paradox and Lord’s paradox—and illustrates them in a single dataset. The dataset contains the score distributions, separated by race, on the biological sciences component of the Medical College Admission Test (MCAT) and Step 1 of the United States Medical Licensing Examination™ (USMLE). Our goal in examining these data was to move toward a greater understanding of race differences in admissions policies in medical schools. As we demonstrate, the path toward this goal is hindered by differences in the score distributions which gives rise to these two paradoxes. The ease with which we were able to illustrate both of these paradoxes within a single dataset is indicative of how widespread they are likely to be in practice.

[Keywords: group differences, Lord’s paradox, Medical College Admission Test, Rubin’s model for causal inference, Simpson’s paradox, standardization, United States Medical Licensing Examination]

“Bayesian Computation: a Statistical Revolution”, Brooks 2003

2003-brooks.pdf: “Bayesian computation: a statistical revolution”⁠, Stephen P. Brooks (2003-11-03; backlinks; similar):

The 1990s saw a statistical revolution sparked predominantly by the phenomenal advances in computing technology from the early 1980s onwards. These advances enabled the development of powerful new computational tools, which reignited interest in a philosophy of statistics that had lain almost dormant since the turn of the century.

In this paper we briefly review the historic and philosophical foundations of the 2 schools of statistical thought, before examining the implications of the reascendance of the Bayesian paradigm for both current and future statistical practice.

[Keywords: computer packages [BUGS], Markov chain Monte Carlo⁠, model discrimination⁠, population ecology⁠, prior beliefs⁠, posterior distribution]

“A Bayesian Framework for Reinforcement Learning”, Strens 2000

“A Bayesian Framework for Reinforcement Learning”⁠, Malcolm Strens (2000-06-28; ; backlinks; similar):

The reinforcement learning problem can be decomposed into two parallel types of inference: (1) estimating the parameters of a model for the underlying process; (2) determining behavior which maximizes return under the estimated model.

Following Dearden et al 1999, it is proposed that the learning process estimates online the full posterior distribution over models.

To determine behavior, a hypothesis is sampled from this distribution and the greedy policy with respect to the hypothesis is obtained by dynamic programming⁠. By using a different hypothesis for each trial appropriate exploratory and exploitative behavior is obtained.

This Bayesian method always converges to the optimal policy for a stationary process with discrete states.

“Kelley’s Paradox”, Wainer 2000

2000-wainer.pdf: “Kelley’s Paradox”⁠, Howard Wainer (2000-01-01; backlinks)

“Statistical Issues in the Analysis of Data Gathered in the New Designs”, Kadane & Seidenfeld 1996

1996-kadane.pdf: “Statistical Issues in the Analysis of Data Gathered in the New Designs”⁠, Joseph B. Kadane, Teddy Seidenfeld (1996-01-01; backlinks)

“Is There Sufficient Historical Evidence to Establish the Resurrection of Jesus?”, Cavin 1995

1995-cavin.pdf: “Is There Sufficient Historical Evidence to Establish the Resurrection of Jesus?”⁠, Robert Greg Cavin (1995-01-01; backlinks)

“The Relevance of Group Membership for Personnel Selection: A Demonstration Using Bayes' Theorem”, Miller 1994

1994-miller.pdf: “The Relevance of Group Membership for Personnel Selection: A Demonstration Using Bayes' Theorem”⁠, Edward M. Miller (1994-09-01; backlinks; similar):

A Bayesian approach to problems of personnel selection implies a fundamental conflict between non-discrimination and merit selection. Groups—such as ethnic groups, sexes and races—do differ in various attributes relevant to vocational success, including intelligence and personality.

This journal has repeatedly discussed the technical and ethical issues raised by the existence of groups (races, sexes, ethnic groups) that frequently differ in abilities and other job-related characteristics (Eysenck 1991, Jensen, 1992; Levin, 1990, 1991). This paper is meant to add to that discussion by providing mathematical proof that consideration of such groups is, in general, necessary in selecting the best employees or students.

It is almost an article of faith that race, sex, religion, national origin, or similar classifications (which will be referred to here as groups) are irrelevant for hiring, given a goal of selecting the best candidates. The standard wisdom is that those selecting for school admission or employment should devise an unbiased (in the statistical sense) procedure which predicts individual performance, evaluate individuals with this, and then select the highest ranked individuals. However, analysis shows that even with statistically unbiased evaluation procedures, group membership may still be relevant. If the goal is to pick the best individuals for jobs or training, membership in the group with the lower average performance (the disadvantaged group) should properly be held against the individual. In general, not considering group membership and selecting the best candidates are mutually exclusive.

Related Psychometric Discussions: How does the conclusion reached above about the relevance of groups membership relate to discussions in the technical psychometric literature?

At least some psychometricians have been aware of the relevance of group membership. Hunter & Schmidt 1976 point out that differences in group means will typically lead to differences in intercepts. Jensen (1980, p. 94, Bias in Mental Testing) points out that the best estimate of true scores is obtained by regressing observed scores towards the mean, and that if there are 2 groups with different means, the downwards correction for the high scoring individuals will be greater for those from the low scoring group. Kelley (1947, p. 409, Fundamentals of Statistics) put it as follows: “This is an interesting equation in that it expresses the estimate of true ability as a weighted sum of 2 separate estimates, one based upon the individual’s observed score, X1, and the other based upon the mean of the group to which he belongs, M1. If the test is highly reliable, much weight is given to the test score and little to the group mean, and vice versa”, although he may not have been thinking of demographic groups. Cronbach, Gleser, Nanda, and Rajaratnam (1972, The Dependability of Behavioral Measurements: Theory of Generalizability for Scores and Profiles) discuss the problem of deducing universe scores (essentially true scores in traditional terminology) from test data, recognizing that group means will be relevant. They even display an awareness that, since blacks normally score lower than whites, the logic of their reasoning calls for the use of higher cut-off scores for blacks than for whites (see p. 385). Mislevy (1993) also displays an awareness that group means are relevant, although he feels it would be unfair to use them.

In general, the relevance of group membership has been known to the specialist psychometric community, although few outside the community are aware of the effect. Thus, the contribution of Bayes’ theorem is to provide another demonstration, one that those outside the psychometric community may be more comfortable with.

“Subjective Probability”, Wright & Ayton 1994

1994-wright-subjectiveprobability.pdf: “Subjective Probability”⁠, George Wright, Peter Ayton (1994-01-01; backlinks)

“The Influence of Prior Beliefs on Scientific Judgments of Evidence Quality”, Koehler 1993

1993-koehler.pdf: “The Influence of Prior Beliefs on Scientific Judgments of Evidence Quality”⁠, Jonathan J. Koehler (1993-10-01; ⁠, ; similar):

This paper is concerned with the influence of scientists′ prior beliefs on their judgments of evidence quality.

A laboratory experiment using advanced graduate students in the sciences (study 1) and an experimental survey of practicing scientists on opposite sides of a controversial issue (study 2) revealed agreement effects. Research reports that agreed with scientists′ prior beliefs were judged to be of higher quality than those that disagreed.

In study 1, a prior belief strength × agreement interaction was found, indicating that the agreement effect was larger among scientists who held strong prior beliefs. In both studies, the agreement effect was larger for general, evaluative judgments (eg. relevance, methodological quality, results clarity) than for more specific, analytical judgments (eg. adequacy of randomization procedures).

A Bayesian analysis indicates that the pattern of agreement effects found in these studies may be normatively defensible, although arguments against implementing a Bayesian approach to scientific judgment are also advanced.

“Smoking As 'independent' Risk Factor for Suicide: Illustration of an Artifact from Observational Epidemiology?”, Smith et al 1992

1992-smith.pdf: “Smoking as 'independent' risk factor for suicide: illustration of an artifact from observational epidemiology?”⁠, George Davey Smith, Andrew N. Phillips, James D. Neaton (1992-01-01; ; backlinks)

“Bias in Relative Odds Estimation owing to Imprecise Measurement of Correlated Exposures”, Phillips & Smith 1992

1992-phillips.pdf: “Bias in relative odds estimation owing to imprecise measurement of correlated exposures”⁠, Andrew N. Phillips, George Davey Smith (1992-01-01; ; backlinks)

“How Independent Are 'independent' Effects? Relative Risk Estimation When Correlated Exposures Are Measured Imprecisely”, Phillips & Smith 1991

1991-phillips.pdf: “How independent are 'independent' effects? Relative risk estimation when correlated exposures are measured imprecisely”⁠, Andrew N. Phillips, George Davey Smith (1991-01-01; ; backlinks)

“Bayes-Hermite Quadrature”, O’Hagan 1991

1991-ohagan.pdf: “Bayes-Hermite quadrature”⁠, Andrew O’Hagan (1991-01-01)

“The 1988 Neyman Memorial Lecture: A Galtonian Perspective on Shrinkage Estimators”, Stigler 1990

1990-stigler.pdf: “The 1988 Neyman Memorial Lecture: A Galtonian Perspective on Shrinkage Estimators”⁠, Stephen M. Stigler (1990-02-01; backlinks; similar):

More than 30 years ago, Charles Stein discovered that in 3 or more dimensions, the ordinary estimator of the vector of means of a multivariate normal distribution is inadmissible⁠.

This article examines Stein’s paradox from the perspective of an earlier century and shows that from that point of view the phenomenon is transparent. Furthermore, this earlier perspective leads to a relatively simple rigorous proof of Stein’s result, and the perspective can be extended to cover other situations, such as the simultaneous estimation of several Poisson means.

The relationship of this perspective to other earlier work, including the empirical Bayes approach, is also discussed.

[Keywords: admissibility, Empirical Bayes, James-Stein estimation, Poisson distribution⁠, regression, Stein paradox]

“The Double Exponential Distribution: Using Calculus to Find a Maximum Likelihood Estimator”, Norton 1984

1984-norton.pdf: “The Double Exponential Distribution: Using Calculus to Find a Maximum Likelihood Estimator”⁠, Robert M. Norton (1984-01-01)

“Interpreting Regression toward the Mean in Developmental Research”, Furby 1973

1973-furby.pdf: “Interpreting regression toward the mean in developmental research”⁠, Lita Furby (1973; ; backlinks; similar):

Explicates the fundamental nature of regression toward the mean, which is frequently misunderstood by developmental researchers. While errors of measurement are commonly assumed to be the sole source of regression effects, the latter also are obtained with errorless measures. The conditions under which regression phenomena can appear are first clearly defined. Next, an explanation of regression effects is presented which applies both when variables contain errors of measurement and when they are errorless. The analysis focuses on cause and effect relationships of psychologically meaningful variables. Finally, the implications for interpreting regression effects in developmental research are illustrated with several empirical examples.

“Nuisance Variables and the Ex Post Facto Design”, Meehl 1970

1970-meehl.pdf: “Nuisance Variables and the Ex Post Facto Design”⁠, Paul E. Meehl (1970-01-01; ; backlinks)

“Control of Spurious Association and the Reliability of the Controlled Variable”, Kahneman 1965

1965-kahneman.pdf: “Control of spurious association and the reliability of the controlled variable”⁠, Daniel Kahneman (1965-01-01; ; backlinks)

“Inference in an Authorship Problem: A Comparative Study of Discrimination Methods Applied to the Authorship of the Disputed Federalist Papers”, Mosteller & Wallace 1963

1963-mosteller.pdf: “Inference in an Authorship Problem: A Comparative Study of Discrimination Methods Applied to the Authorship of the Disputed Federalist Papers”⁠, Frederick Mosteller, David L. Wallace (1963-06; similar):

This study has four purposes: to provide a comparison of discrimination methods; to explore the problems presented by techniques based strongly on Bayes’ theorem when they are used in a data analysis of large scale; to solve the authorship question of The Federalist papers; and to propose routine methods for solving other authorship problems.

Word counts are the variables used for discrimination. Since the topic written about heavily influences the rate with which a word is used, care in selection of words is necessary. The filler words of the language such as ‘an’, ‘of’, and ‘upon’, and, more generally, articles, prepositions, and conjunctions provide fairly stable rates, whereas more meaningful words like ‘war’, ‘executive’, and ‘legislature’ do not.

After an investigation of the distribution of these counts, the authors execute an analysis employing the usual discriminant function and an analysis based on Bayesian methods. The conclusions about the authorship problem are that Madison rather than Hamilton wrote all 12 of the disputed papers.

The findings about methods are presented in the closing section on conclusions.

This report, summarizing and abbreviating a forthcoming monograph8, gives some of the results but very little of their empirical and theoretical foundation. It treats two of the four main studies presented in the monograph, and none of the side studies.

“The Argentine Writer and Tradition”, Borges 1951

1951-borges-theargentinewriterandtradition.pdf: “The Argentine Writer and Tradition”⁠, Jorge Luis Borges (1951; ; backlinks; similar):

[Borges considers the problem of whether Argentinian writing on non-Argentinian subjects can still be truly “Argentine.” His conclusion:]

…We should not be alarmed and that we should feel that our patrimony is the universe; we should essay all themes, and we cannot limit ourselves to purely Argentine subjects in order to be Argentine; for either being Argentine is an inescapable act of fate—and in that case we shall be so in all events—or being Argentine is a mere affectation, a mask. I believe that if we surrender ourselves to that voluntary dream which is artistic creation, we shall be Argentine and we shall also be good or tolerable writers.

“Probability and the Weighing of Evidence”, Good 1950-page-96

1950-good-probabilityandtheweighingofevidence.pdf#page=96: “Probability and the Weighing of Evidence”⁠, I. J. Good (1950-01-01; ; backlinks)

“Evaluating the Effect of Inadequately Measured Variables in Partial Correlation Analysis”, Stouffer 1936

1936-stouffer.pdf: “Evaluating the Effect of Inadequately Measured Variables in Partial Correlation Analysis”⁠, Samuel A. Stouffer (1936; ; backlinks; similar):

It is not generally recognized that such an analysis [using regression] assumes that each of the variables is perfectly measured, such that a second measure X’i, of the variable measured by Xi, has a correlation of unity with Xi. If some of the measures are more accurate than others, the analysis is impaired [by measurement error]. For example, the sociologist may have a problem in which an index of economic status and an index of nativity are independent variables. What is the effect, if the index of economic status is much less satisfactory than the index of nativity? Ordinarily, the effect will be to underestimate the [coefficient] of the less adequately measured variable and to overestimate the [coefficient] of the more adequately measured variable.

If either the reliability or validity of an index is in question, at least two measures of the variable are required to permit an evaluation. The purpose of this paper is to provide a logical basis and a simple arithmetical procedure (a) for measuring the effect of the use of 2 indexes, each of one or more variables, in partial and multiple correlation analysis and (b) for estimating the likely effect if 2 indexes, not available, could be secured.

“Interpretation of Educational Measurements”, Kelley 1927

1927-kelley-interpretationofeducationalmeasurements.pdf: “Interpretation of Educational Measurements”⁠, Truman Lee Kelley (1927; ; backlinks; similar):

[Historically notable for introducing Kelley’s paradox⁠, another fallacy related to regression to the mean⁠.] Among the outstanding contributions of the book are (1) the judgments of the relative excellence of assorted tests in some 70 fields of accomplishment, by Kelley, Franzen, Freeman, McCall, Otis, Trabue and Van Wagenen; (2) detailed and exact information on the statistical and other characteristics of the same tests, based on a questionnaire addressed to the text authors or (in the absence of reply) estimates by Kelley on the best data available; (3) a chapter of 47 pages condensing all the principal elementary statistical methods. In addition, there is constant emphasis upon the importance of the probable error, with some illustrative applications; for example, it is maintained that about 90% of the abilities measured by our best “intelligence” and “achievement” tests are (due chiefly to the size of the probable errors) the same ability. A chapter sets forth the analytical procedures which lead to this conclusion and to four others earlier enunciated. “Idiosyncrasy”, or inequality among abilities, which the author regards as highly valuable, is considered in two chapters; the remainder of the volume is devoted to a historical sketch of the mental test movement and a statement of the purposes of tests, the latter being illustrated by appropriate chapters.

“Mr Keynes on Probability [review of J. M. Keynes, _A Treatise on Probability: 1921]”, Ramsey 1922

1922-ramsey.pdf: “Mr Keynes on Probability [review of J. M. Keynes, _A Treatise on Probability: 1921]”⁠, Frank P. Ramsey (1922-01-01; ; backlinks)

“Philosophical Essay on Probabilities, Chapter 11: Concerning the Probabilities of Testimonies”, Laplace 1814

1814-laplace-philosophicalessayonprobabilities-ch5probabilitiestestimonies.pdf: “Philosophical Essay on Probabilities, Chapter 11: Concerning the Probabilities of Testimonies”⁠, Pierre-Simon Laplace (1814; backlinks; similar):

The majority of our opinions being founded on the probability of proofs it is indeed important to submit it to calculus. Things it is true often become impossible by the difficulty of appreciating the veracity of witnesses and by the great number of circumstances which accompany the deeds they attest; but one is able in several cases to resolve the problems which have much analogy with the questions which are proposed and whose solutions may be regarded as suitable approximations to guide and to defend us against the errors and the dangers of false reasoning to which we are exposed. An approximation of this kind, when it is well made, is always preferable to the most specious reasonings.

We would give no credence to the testimony of a man who should attest to us that in throwing a hundred dice into the air they had all fallen on the same face. If we had ourselves been spectators of this event we should believe our own eyes only after having carefully examined all the circumstances, and after having brought in the testimonies of other eyes in order to be quite sure that there had been neither hallucination nor deception. But after this examination we should not hesitate to admit it in spite of its extreme improbability; and no one would be tempted, in order to explain it, to recur to a denial of the laws of vision. We ought to conclude from it that the probability of the constancy of the laws of nature is for us greater than this, that the event in question has not taken place at all a probability greater than that of the majority of historical facts which we regard as incontestable. One may judge by this the immense weight of testimonies necessary to admit a suspension of natural laws, and how improper it would be to apply to this case the ordinary rules of criticism. All those who without offering this immensity of testimonies support this when making recitals of events contrary to those laws, decrease rather than augment the belief which they wish to inspire; for then those recitals render very probable the error or the falsehood of their authors. But that which diminishes the belief of educated men increases often that of the uneducated, always greedy for the wonderful.

The action of time enfeebles then, without ceasing, the probability of historical facts just as it changes the most durable monuments. One can indeed diminish it by multiplying and conserving the testimonies and the monuments which support them. Printing offers for this purpose a great means, unfortunately unknown to the ancients. In spite of the infinite advantages which it procures the physical and moral revolutions by which the surface of this globe will always be agitated will end, in conjunction with the inevitable effect of time, by rendering doubtful after thousands of years the historical facts regarded to-day as the most certain.

“Brms: an R Package for Bayesian Generalized Multivariate Non-linear Multilevel Models Using Stan”, Bürkner 2022

“brms: an R package for Bayesian generalized multivariate non-linear multilevel models using Stan”⁠, Paul Bürkner (⁠, ; backlinks; similar):

The brms package provides an interface to fit Bayesian generalized (non-)linear multivariate multilevel models using Stan⁠, which is a C++ package for performing full Bayesian inference. The formula syntax is very similar to that of the package lme4 to provide a familiar and simple interface for performing regression analyses.

A wide range of response distributions are supported, allowing users to fit—among others—linear, robust linear, count data, survival, response times, ordinal, zero-inflated, and even self-defined mixture models all in a multilevel context. Further modeling options include non-linear and smooth terms, auto-correlation structures, censored data, missing value imputation⁠, and quite a few more. In addition, all parameters of the response distribution can be predicted in order to perform distributional regression. Multivariate models (ie. models with multiple response variables) can be fit, as well.

Prior specifications are flexible and explicitly encourage users to apply prior distributions that actually reflect their beliefs.

Model fit can easily be assessed and compared with posterior predictive checks, cross-validation, and Bayes factors.

Thompson sampling


Proving too much


Particle filter


Monte Carlo tree search


Gaussian process