
Links
 “Fast and Accurate Bayesian Polygenic Risk Modeling With Variational Inference”, Zabad et al 2022
 “The InterModel Vigorish (IMV): A Flexible and Portable Approach for Quantifying Predictive Accuracy With Binary Outcomes”, Domingue et al 2022
 “The Science of Visual Data Communication: What Works”, Franconeri et al 2021
 “How to Learn and Represent Abstractions: An Investigation Using Symbolic Alchemy”, AlKhamissi et al 2021
 “An Experimental Design Perspective on ModelBased Reinforcement Learning”, Mehta et al 2021
 “Prior Knowledge Elicitation: The Past, Present, and Future”, Mikkola et al 2021
 “Improving GWAS Discovery and Genomic Prediction Accuracy in Biobank Data”, Orliac et al 2021
 “An Explanation of Incontext Learning As Implicit Bayesian Inference”, Xie et al 2021
 “Unifying Individual Differences in Personality, Predictability and Plasticity: A Practical Guide”, O’Dea et al 2021
 “Transformers Can Do Bayesian Inference”, Müller et al 2021
 “Why Generalization in RL Is Difficult: Epistemic POMDPs and Implicit Partial Observability”, Ghosh et al 2021
 “The Bayesian Learning Rule”, Khan & Rue 2021
 “No Need to Choose: Robust Bayesian MetaAnalysis With Competing Publication Bias Adjustment Methods”, Bartoš et al 2021
 “Maternal Judgments of Child Numeracy and Reading Ability Predict Gains in Academic Achievement and Interest”, Parker et al 2021
 “Genetic Sensitivity Analysis: Adjusting for Genetic Confounding in Epidemiological Associations”, Pingault et al 2021
 “What Are Bayesian Neural Network Posteriors Really Like?”, Izmailov et al 2021
 “Bayesian Optimization Is Superior to Random Search for Machine Learning Hyperparameter Tuning: Analysis of the BlackBox Optimization Challenge 2020”, Turner et al 2021
 “Maximal Positive Controls: A Method for Estimating the Largest Plausible Effect Size”, Hilgard 2021
 “Informational Herding, Optimal Experimentation, and Contrarianism”, Smith et al 2021
 “Hot under the Collar: A Latent Measure of Interstate Hostility”, Terechshenko 2020
 “What Matters More for Entrepreneurship Success? A Metaanalysis Comparing General Mental Ability and Emotional Intelligence in Entrepreneurial Settings”, Allen et al 2020
 “From Probability to Consilience: How Explanatory Values Implement Bayesian Reasoning”, Wojtowicz & DeDeo 2020
 “Metatrained Agents Implement Bayesoptimal Agents”, Mikulik et al 2020
 “Learning Not to Learn: Nature versus Nurture in Silico”, Lange & Sprekeler 2020
 “Searching for the Backfire Effect: Measurement and Design Considerations”, SwireThompson et al 2020
 “Is SGD a Bayesian Sampler? Well, Almost”, Mingard et al 2020
 “Laplace’s Theories of Cognitive Illusions, Heuristics and Biases”, Miller & Gelman 2020
 “Exploring Bayesian Optimization: Breaking Bayesian Optimization into Small, Sizeable Chunks”, Agnihotri & Batra 2020
 “The Social and Genetic Inheritance of Educational Attainment: Genes, Parental Education, and Educational Expansion”, Lin 2020
 “The Most ‘Abandoned’ Books on GoodReads”, Branwen 2019
 “The Propensity for Aggressive Behavior and Lifetime Incarceration Risk: A Test for Geneenvironment Interaction (G × E) Using Wholegenome Data”, Barnes et al 2019
 “Bayesian Parameter Estimation Using Conditional Variational Autoencoders for Gravitationalwave Astronomy”, Gabbard et al 2019
 “New Paradigms in the Psychology of Reasoning”, Oaksford & Chater 2019
 “Estimating Distributional Models With Brms: Additive Distributional Models”, Bürkner 2019
 “Allocation to Groups: Examples of Lord's Paradox”, Wright 2019
 “Evolutionary Implementation of Bayesian Computations”, Czégel et al 2019
 “How Should We Critique Research?”, Branwen 2019
 “Reinforcement Learning, Fast and Slow”, Botvinick et al 2019
 “Meta Reinforcement Learning As Task Inference”, Humplik et al 2019
 “Structural Equation Models As Computation Graphs”, Kesteren & Oberski 2019
 “Metalearning of Sequential Strategies”, Ortega et al 2019
 “Fermi Calculation Examples”, Branwen 2019
 “Is the FDA Too Conservative or Too Aggressive?: A Bayesian Decision Analysis of Clinical Trial Design”, Isakov et al 2019
 “Bayesian Statistics in Sociology: Past, Present, and Future”, Lynch & Bartlett 2019
 “Approximate Bayesian Computation”, Beaumont 2019
 “Accounting Theory As a Bayesian Discipline”, Johnstone 2018
 “The Bayesian Superorganism III: Externalized Memories Facilitate Distributed Sampling”, Hunt et al 2018
 “Evolution As Backstop for Reinforcement Learning”, Branwen 2018
 “SMPY Bibliography”, Branwen 2018
 “Deep Learning Generalizes Because the Parameterfunction Map Is Biased towards Simple Functions”, VallePérez et al 2018
 “On Having Enough Socks”, Branwen 2017
 “Implicit Causal Models for Genomewide Association Studies”, Tran & Blei 2017
 “A Rational Choice Framework for Collective Behavior”, Krafft 2017
 “Statistical Correction of the Winner’s Curse Explains Replication Variability in Quantitative Trait Genomewide Association Studies”, Palmer & Pe’er 2017
 “A Tutorial on Thompson Sampling”, Russo et al 2017
 “BlackBox Dataefficient Policy Search for Robotics”, Chatzilygeroudis et al 2017
 “ZMA Sleep Experiment”, Branwen 2017
 “SelfBlinded Mineral Water Taste Test”, Branwen 2017
 “The Kelly CoinFlipping Game: Exact Solutions”, Branwen et al 2017
 “Banner Ads Considered Harmful”, Branwen 2017
 “Simple and Scalable Predictive Uncertainty Estimation Using Deep Ensembles”, Lakshminarayanan et al 2016
 “Bayesian Reinforcement Learning: A Survey”, Ghavamzadeh et al 2016
 “Why Tool AIs Want to Be Agent AIs”, Branwen 2016
 “‘Genius Revisited’ Revisited”, Branwen 2016
 “Candy Japan’s New Box A/B Test”, Branwen 2016
 “Calculating The Gaussian Expected Maximum”, Branwen 2016
 “Top 10 Replicated Findings From Behavioral Genetics”, Plomin et al 2016page10
 “World Catnip Surveys”, Branwen 2015
 “Catnip Immunity and Alternatives”, Branwen 2015
 “Bitter Melon for Blood Glucose”, Branwen 2015
 “Resorting Media Ratings”, Branwen 2015
 “When Should I Check The Mail?”, Branwen 2015
 “Gaussian Processes for DataEfficient Learning in Robotics and Control”, Deisenroth et al 2015
 “Predictive Distributions for Betweenstudy Heterogeneity and Simple Methods for Their Application in Bayesian Metaanalysis”, Turner et al 2014
 “Thompson Sampling With the Online Bootstrap”, Eckles & Kaptein 2014
 “Everything Is Correlated”, Branwen 2014
 “Statistical Notes”, Branwen 2014
 “Why Correlation Usually ≠ Causation”, Branwen 2014
 “Bacopa QuasiExperiment”, Branwen 2014
 “Bayesian Model Selection: The Steepest Mountain to Climb”, Tenan et al 2014
 “(More) Efficient Reinforcement Learning via Posterior Sampling”, Osband et al 2013
 “Magnesium SelfExperiments”, Branwen 2013
 “Caffeine Wakeup Experiment”, Branwen 2013
 “Bayesian Estimation Supersedes the t Test”, Kruschke 2013
 “Potassium Sleep Experiments”, Branwen 2012
 “2012 Election Predictions”, Branwen 2012
 “Biased Information As Antiinformation”, Branwen 2012
 “A/B Testing Longform Readability on Gwern.net”, Branwen 2012
 “Dual NBack MetaAnalysis”, Branwen 2012
 “One Man’s Modus Ponens”, Branwen 2012
 “Learning Is Planning: near Bayesoptimal Reinforcement Learning via MonteCarlo Tree Search”, Asmuth & Littman 2012
 “Learning Performance of Prediction Markets With Kelly Bettors”, Beygelzimer et al 2012
 “Vitamin D Sleep Experiments”, Branwen 2012
 “Silk Road 1: Theory & Practice”, Branwen 2011
 “PILCO: A ModelBased and DataEfficient Approach to Policy Search”, Deisenroth & Rasmussen 2011
 “Death Note: L, Anonymity & Eluding Entropy”, Branwen 2011
 “Tea Reviews”, Branwen 2011
 “Gwern.net Website Traffic”, Branwen 2011
 “Zeo Sleep Selfexperiments”, Branwen 2010
 “The Replication Crisis: Flaws in Mainstream Science”, Branwen 2010
 “About This Website”, Branwen 2010
 “Bayesian Data Analysis”, Kruschke 2010
 “Nootropics”, Branwen 2010
 “Predicting the Next Big Thing: Success As a Signal of Poor Judgment”, Denrell & Fang 2010
 “Rssa_a0157 469..482”
 “Who Wrote The ‘Death Note’ Script?”, Branwen 2009
 “Miscellaneous”, Branwen 2009
 “When Superstars Flop: Public Status and Choking Under Pressure in International Soccer Penalty Shootouts”, Jordet 2009
 “Dual NBack FAQ”, Branwen 2009
 “Modafinil”, Branwen 2009
 “Models for Potentially Biased Evidence in Metaanalysis Using Empirically Based Priors”, Welton et al 2008
 “Optimal Approximation of Signal Priors”, Hyvarinen 2008
 “Verbal Probability Expressions In National Intelligence Estimates: A Comprehensive Analysis Of Trends From The Fifties Through Post9/11”, Kesselman 2008
 “The Allure of Equality: Uniformity in Probabilistic and Statistical Judgment”, Falk & Lann 2008
 “Experiments on Partisanship and Public Opinion: Party Cues, False Beliefs, and Bayesian Updating”, Bullock 2007
 “A Free Energy Principle for the Brain”, Friston et al 2006
 “The Optimizer’s Curse: Skepticism and Postdecision Surprise in Decision Analysis”, Smith & Winkler 2006
 “Three Statistical Paradoxes in the Interpretation of Group Differences: Illustrated With Medical School Admission and Licensing Data”, Wainer & Brown 2006
 “Estimation of NonNormalized Statistical Models by Score Matching”, Hyvarinen 2005
 “The Bayesian Brain: the Role of Uncertainty in Neural Coding and Computation”, Knill & Pouget 2004
 “Methods of MetaAnalysis: Correcting Error and Bias in Research Findings”, Hunter & Schmidt 2004
 “Bayesian Informal Logic and Fallacy”, Korb 2004
 “Two Statistical Paradoxes in the Interpretation of Group Differences: Illustrated With Medical School Admission and Licensing Data”, Wainer & Brown 2004
 “Bayesian Computation: a Statistical Revolution”, Brooks 2003
 “A Bayesian Framework for Reinforcement Learning”, Strens 2000
 “Kelley’s Paradox”, Wainer 2000
 “Statistical Issues in the Analysis of Data Gathered in the New Designs”, Kadane & Seidenfeld 1996
 “Is There Sufficient Historical Evidence to Establish the Resurrection of Jesus?”, Cavin 1995
 “The Relevance of Group Membership for Personnel Selection: A Demonstration Using Bayes' Theorem”, Miller 1994
 “Subjective Probability”, Wright & Ayton 1994
 “The Influence of Prior Beliefs on Scientific Judgments of Evidence Quality”, Koehler 1993
 “Smoking As 'independent' Risk Factor for Suicide: Illustration of an Artifact from Observational Epidemiology?”, Smith et al 1992
 “Bias in Relative Odds Estimation owing to Imprecise Measurement of Correlated Exposures”, Phillips & Smith 1992
 “How Independent Are 'independent' Effects? Relative Risk Estimation When Correlated Exposures Are Measured Imprecisely”, Phillips & Smith 1991
 “BayesHermite Quadrature”, O’Hagan 1991
 “The 1988 Neyman Memorial Lecture: A Galtonian Perspective on Shrinkage Estimators”, Stigler 1990
 “The Double Exponential Distribution: Using Calculus to Find a Maximum Likelihood Estimator”, Norton 1984
 “Interpreting Regression toward the Mean in Developmental Research”, Furby 1973
 “Nuisance Variables and the Ex Post Facto Design”, Meehl 1970
 “Control of Spurious Association and the Reliability of the Controlled Variable”, Kahneman 1965
 “Inference in an Authorship Problem: A Comparative Study of Discrimination Methods Applied to the Authorship of the Disputed Federalist Papers”, Mosteller & Wallace 1963
 “The Argentine Writer and Tradition”, Borges 1951
 “Probability and the Weighing of Evidence”, Good 1950page96
 “Evaluating the Effect of Inadequately Measured Variables in Partial Correlation Analysis”, Stouffer 1936
 “Interpretation of Educational Measurements”, Kelley 1927
 “Mr Keynes on Probability [review of J. M. Keynes, _A Treatise on Probability: 1921]”, Ramsey 1922
 “Philosophical Essay on Probabilities, Chapter 11: Concerning the Probabilities of Testimonies”, Laplace 1814
 “Brms: an R Package for Bayesian Generalized Multivariate Nonlinear Multilevel Models Using Stan”, Bürkner 2022
 Thompson sampling
 Proving too much
 Particle filter
 Monte Carlo tree search
 Gaussian process
 Miscellaneous
Links
“Fast and Accurate Bayesian Polygenic Risk Modeling With Variational Inference”, Zabad et al 2022
“Fast and Accurate Bayesian Polygenic Risk Modeling with Variational Inference”, (20220511; ):
The recent proliferation of large scale genomewide association studies (GWASs) has motivated the development of statistical methods for phenotype prediction using single nucleotide polymorphism (SNP) array data. These polygenic risk score (PRS) methods formulate the task of polygenic prediction in terms of a multiple linear regression framework, where the goal is to infer the joint effect sizes of all genetic variants on the trait. Among the subset of PRS methods that operate on GWAS summary statistics, sparse Bayesian methods have shown competitive predictive ability. However, existing Bayesian approaches employ Markov Chain Monte Carlo (MCMC) algorithms for posterior inference, which are computationally inefficient and do not scale favorably with the number of SNPs included in the analysis. Here, we introduce Variational Inference of Polygenic Risk Scores (VIPRS), a Bayesian summary statisticsbased PRS method that utilizes Variational Inference (VI) techniques to efficiently approximate the posterior distribution for the effect sizes. Our experiments with genomewide simulations and real phenotypes from the UK Biobank (UKB) dataset demonstrated that variational approximations to the posterior are competitively accurate and highly efficient. When compared to stateoftheart PRS methods, VIPRS consistently achieves the best or second best predictive accuracy in our analyses of 18 simulation configurations as well as 12 real phenotypes measured among the UKB participants of “White British” background. This performance advantage was higher among individuals from other ethnic groups, with an increase in Rsquared of up to 1.7× among participants of Nigerian ancestry for LowDensity Lipoprotein (LDL) cholesterol. Furthermore, given its computational efficiency, we applied VIPRS to a dataset of up to 10 million genetic markers, an order of magnitude greater than the standard HapMap3 subset used to train existing PRS methods. Modeling this expanded set of variants conferred modest improvements in prediction accuracy for a number of highly polygenic traits, such as standing height.
“The InterModel Vigorish (IMV): A Flexible and Portable Approach for Quantifying Predictive Accuracy With Binary Outcomes”, Domingue et al 2022
“The InterModel Vigorish (IMV): A flexible and portable approach for quantifying predictive accuracy with binary outcomes”, (20220112; ; similar):
[Twitter; app] Understanding the “fit” of models designed to predict binary outcomes has been a longstanding problem.
We propose a flexible, portable, and intuitive metric for quantifying the change in accuracy between 2 predictive systems in the case of a binary outcome, the InterModel Vigorish (IMV). The IMV is based on an analogy to wellcharacterized physical systems with tractable probabilities: weighted coins. The IMV is always a statement about the change in fit relative to some baseline—which can be as simple as the prevalence—whereas other metrics are standalone measures that need to be further manipulated to yield indices related to differences in fit across models. Moreover, the IMV is consistently interpretable independent of baseline prevalence.
We illustrate the flexible properties of this metric in numerous simulations and showcase its flexibility across examples spanning the social, biomedical, and physical sciences.
[Keywords: binary outcomes, fit index, logistic regression, prediction, Kelly criterion, entropy, coherence]
“The Science of Visual Data Communication: What Works”, Franconeri et al 2021
2021franconeri.pdf
: “The Science of Visual Data Communication: What Works”, (20211215; ; backlinks; similar):
Effectively designed data visualizations allow viewers to use their powerful visual systems to understand patterns in data across science, education, health, and public policy. But ineffectively designed visualizations can cause confusion, misunderstanding, or even distrust—especially among viewers with low graphical literacy.
We review researchbacked guidelines for creating effective and intuitive visualizations oriented toward communicating data to students, coworkers, and the general public. We describe how the visual system can quickly extract broad statistics from a display, whereas poorly designed displays can lead to misperceptions and illusions. Extracting global statistics is fast, but comparing between subsets of values is slow. Effective graphics avoid taxing working memory, guide attention, and respect familiar conventions.
Data visualizations can play a critical role in teaching and communication, provided that designers tailor those visualizations to their audience.
“How to Learn and Represent Abstractions: An Investigation Using Symbolic Alchemy”, AlKhamissi et al 2021
“How to Learn and Represent Abstractions: An Investigation using Symbolic Alchemy”, (20211214; ; similar):
Alchemy is a new metalearning environment rich enough to contain interesting abstractions, yet simple enough to make finegrained analysis tractable. Further, Alchemy provides an optional symbolic interface that enables metaRL research without a large compute budget. In this work, we take the first steps toward using Symbolic Alchemy to identify design choices that enable deepRL agents to learn various types of abstraction. Then, using a variety of behavioral and introspective analyses we investigate how our trained agents use and represent abstract task variables, and find intriguing connections to the neuroscience of abstraction. We conclude by discussing the next steps for using metaRL and Alchemy to better understand the representation of abstract variables in the brain.
“An Experimental Design Perspective on ModelBased Reinforcement Learning”, Mehta et al 2021
“An Experimental Design Perspective on ModelBased Reinforcement Learning”, (20211209; ):
[blog] In many practical applications of RL, it is expensive to observe state transitions from the environment. For example, in the problem of plasma control for nuclear fusion, computing the next state for a given stateaction pair requires querying an expensive transition function which can lead to many hours of computer simulation or dollars of scientific research. Such expensive data collection prohibits application of standard RL algorithms which usually require a large number of observations to learn.
In this work, we address the problem of efficiently learning a policy while making a minimal number of stateaction queries to the transition function. In particular, we leverage ideas from Bayesian optimal experimental design to guide the selection of stateaction queries for efficient learning. We propose an acquisition function that quantifies how much information a stateaction pair would provide about the optimal solution to a Markov decision process. At each iteration, our algorithm maximizes this acquisition function, to choose the most informative stateaction pair to be queried, thus yielding a dataefficient RL approach.
We experiment with a variety of simulated continuous control problems and show that our approach learns an optimal policy with up to 5–1,000× less data than modelbased RL baselines and 10^{3}–10^{5}× less data than modelfree RL baselines. We also provide several ablated comparisons which point to substantial improvements arising from the principled method of obtaining data.
“Prior Knowledge Elicitation: The Past, Present, and Future”, Mikkola et al 2021
“Prior knowledge elicitation: The past, present, and future”, (20211201; ; similar):
Specification of the prior distribution for a Bayesian model is a central part of the Bayesian workflow for data analysis, but it is often difficult even for statistical experts. Prior elicitation transforms domain knowledge of various kinds into welldefined prior distributions, and offers a solution to the prior specification problem, in principle. In practice, however, we are still fairly far from having usable prior elicitation tools that could significantly influence the way we build probabilistic models in academia and industry. We lack elicitation methods that integrate well into the Bayesian workflow and perform elicitation efficiently in terms of costs of time and effort. We even lack a comprehensive theoretical framework for understanding different facets of the prior elicitation problem.
Why are we not widely using prior elicitation? We analyze the state of the art by identifying a range of key aspects of prior knowledge elicitation, from properties of the modelling task and the nature of the priors to the form of interaction with the expert. The existing prior elicitation literature is reviewed and categorized in these terms. This allows recognizing understudied directions in prior elicitation research, finally leading to a proposal of several new avenues to improve prior elicitation methodology.
“Improving GWAS Discovery and Genomic Prediction Accuracy in Biobank Data”, Orliac et al 2021
“Improving GWAS discovery and genomic prediction accuracy in Biobank data”, (20211108; ; similar):
Genetically informed and deepphenotyped biobanks are an important research resource. The cost of phenotyping far outstrips that of genotyping, and therefore it is imperative that the most powerful, versatile and efficient analysis approaches are used. Here, we apply our recently developed Bayesian grouped mixture of regressions model (GMRM) in the UK and Estonian Biobanks and obtain the highest genomic prediction accuracy reported to date across 21 heritable traits. On average, GMRM accuracies were 15% (SE 7%) greater than prediction models run in the LDAK software with SNP annotation marker groups, 18% (SE 3%) greater than a baseline BayesR model without SNP markers grouped into MAFLDannotation categories, and 106% (SE 9%) greater than polygenic risk scores calculated from mixedlinear model association (MLMA) estimates. For height, the prediction accuracy was 47% in a UK Biobank holdout sample, which was 76% of the estimated SNPheritability. We then extend our GMRM prediction model to provide MLMA SNP marker estimates for GWAS discovery, which increased the independent loci detected to 7,910 in unrelated UK Biobank individuals, as compared to 5,521 from BoltLMM and 5,727 from Regenie, a 43% and 38% increase respectively. The average χ2 value of the leading markers was 34% (SE 5.11) higher for GMRM as compared to Regenie, and increased by 17% for every 1% increase in prediction accuracy gained over a baseline BayesR model across the traits. Thus, we show that modelling genetic associations accounting for MAF and LD differences among SNP markers, and incorporating prior knowledge of genomic function, is important for both genomic prediction and for discovery in largescale individuallevel biobankscale studies.
“An Explanation of Incontext Learning As Implicit Bayesian Inference”, Xie et al 2021
“An Explanation of Incontext Learning as Implicit Bayesian Inference”, (20211103; ; backlinks; similar):
Large pretrained language models such as GPT3 have the surprising ability to do incontext learning, where the model learns to do a downstream task simply by conditioning on a prompt consisting of inputoutput examples. Without being explicitly pretrained to do so, the language model learns from these examples during its forward pass without parameter updates on “outofdistribution” prompts. Thus, it is unclear what mechanism enables incontext learning.
In this paper, we study the role of the pretraining distribution on the emergence of incontext learning under a mathematical setting where the pretraining texts have longrange coherence. Here, language model pretraining requires inferring a latent documentlevel concept from the conditioning text to generate coherent next tokens. At test time, this mechanism enables incontext learning by inferring the shared latent concept between prompt examples and applying it to make a prediction on the test example.
Concretely, we prove that incontext learning occurs implicitly via Bayesian inference of the latent concept when the pretraining distribution is a mixture of HMMs. This can occur despite the distribution mismatch between prompts and pretraining data. In contrast to messy largescale pretraining datasets for incontext learning in natural language, we generate a family of smallscale synthetic datasets (GINC) where Transformer and LSTM language models both exhibit incontext learning.
Beyond the theory which focuses on the effect of the pretraining distribution, we empirically find that scaling model size improves incontext accuracy even when the pretraining loss is the same.
“Unifying Individual Differences in Personality, Predictability and Plasticity: A Practical Guide”, O’Dea et al 2021
2021odea.pdf
: “Unifying individual differences in personality, predictability and plasticity: A practical guide”, (20211101; ; similar):
 Organisms use labile traits to respond to different conditions over short timescales. When a population experiences the same conditions, we might expect all individuals to adjust their trait expression to the same, optimal, value, thereby minimizing phenotypic variation. Instead, variation abounds. Individuals substantially differ not only from each other, but also from their former selves, with the expression of labile traits varying both predictably and unpredictably over time.
 A powerful tool for studying the evolution of phenotypic variation in labile traits is the mixed model. Here, we review how mixed models are used to quantify individual differences in both means and variability, and their betweenindividual correlations. Individuals can differ in their average phenotypes (eg. behavioural personalities), their variability (known as ‘predictability’ or intraindividual variability), and their plastic response to different contexts.
 We provide detailed descriptions and resources for simultaneously modelling individual differences in averages, plasticity and predictability. Empiricists can use these methods to quantify how traits covary across individuals and test theoretical ideas about phenotypic integration. These methods can be extended to incorporate plastic changes in predictability (termed ‘stochastic malleability’).
 Overall, we showcase the unfulfilled potential of existing statistical tools to test more holistic and nuanced questions about the evolution, function, and maintenance of phenotypic variation, for any trait that is repeatedly expressed.
[Keywords:
brms
, coefficient of variation, DHGLM, Double Hierarchical, locationscale regression, multivariate, repeatability,rstan
]…Conclusions And Future Directions: Incorporating predictability into studies of personality and plasticity creates an opportunity to test more nuanced questions about how phenotypic variation is maintained, or constrained. For some traits, it might be adaptive to be unpredictable, such as in predatorprey interactions (Briffa 2013). For other traits, selection might act to minimise maladaptive imprecision around an optimal mean (Hansen et al 2006). The supplementary worked example and open code (O’Dea et al 2021) shows betweenindividual correlations in predictability across multiple behavioural traits, and some correlations of predictability with personality and plasticity. If driven by biological integration and not measurement errors or statistical artefacts, these correlations could hint at genetic integration too; other studies have found additive genetic variance in predictability (Martin et al 2017; Prentice et al 2020). Given that different traits might have different optimal levels of unpredictability, integration of predictability could constrain variation in one trait (resulting in lower than optimal variability) and maintain variation in another (resulting in greater than optimal variability). Because of associations with personality and plasticity, variation in predictability—the lowest level of the phenotypic hierarchy—could have cascading effects upwards (Westneat et al 2015). Empirical estimates of the strength of these associations can inform theoretical models on the simultaneous evolution of means and variances.
“Transformers Can Do Bayesian Inference”, Müller et al 2021
“Transformers Can Do Bayesian Inference”, (20211005; ; similar):
Currently, it is hard to reap the benefits of deep learning for Bayesian methods, which allow the explicit specification of prior knowledge and accurately capture model uncertainty. We present PriorData Fitted Networks (PFNs). PFNs leverage largescale machine learning techniques to approximate a large set of posteriors. The only requirement for PFNs to work is the ability to sample from a prior distribution over supervised learning tasks (or functions). Our method restates the objective of posterior approximation as a supervised classification problem with a setvalued input: it repeatedly draws a task (or function) from the prior, draws a set of data points and their labels from it, masks one of the labels and learns to make probabilistic predictions for it based on the setvalued input of the rest of the data points. Presented with a set of samples from a new supervised learning task as input, PFNs make probabilistic predictions for arbitrary other data points in a single forward propagation, having learned to approximate Bayesian inference.
We demonstrate that PFNs can nearperfectly mimic Gaussian processes and also enable efficient Bayesian inference for intractable problems, with over 200× speedups in multiple setups compared to current methods. We obtain strong results in very diverse areas such as Gaussian process regression, Bayesian neural networks, classification for small tabular data sets, and fewshot image classification, demonstrating the generality of PFNs. Code and trained PFNs are released under [ANONYMIZED; please see the supplementary material].
“Why Generalization in RL Is Difficult: Epistemic POMDPs and Implicit Partial Observability”, Ghosh et al 2021
“Why Generalization in RL is Difficult: Epistemic POMDPs and Implicit Partial Observability”, (20210713; ; similar):
Generalization is a central challenge for the deployment of reinforcement learning (RL) systems in the real world.
In this paper, we show that the sequential structure of the RL problem necessitates new approaches to generalization beyond the wellstudied techniques used in supervised learning. While supervised learning methods can generalize effectively without explicitly accounting for epistemic uncertainty, we show that, perhaps surprisingly, this is not the case in RL.
We show that generalization to unseen test conditions from a limited number of training conditions induces implicit partial observability, effectively turning even fullyobserved MDPs into POMDPs. Informed by this observation, we recast the problem of generalization in RL as solving the induced partially observed Markov decision process, which we call the epistemic POMDP. We demonstrate the failure modes of algorithms that do not appropriately handle this partial observability, and suggest a simple ensemblebased technique for solving the partially observed problem.
Empirically, we demonstrate that our simple algorithm derived from the epistemic POMDP achieves substantial gains in generalization over current methods on the Procgen benchmark suite.
“The Bayesian Learning Rule”, Khan & Rue 2021
“The Bayesian Learning Rule”, (20210709; similar):
We show that many machinelearning algorithms are specific instances of a single algorithm called the Bayesian learning rule. The rule, derived from Bayesian principles, yields a widerange of algorithms from fields such as optimization, deep learning, and graphical models. This includes classical algorithms such as ridge regression, Newton’s method, and Kalman filter, as well as modern deeplearning algorithms such as stochasticgradient descent, RMSprop, and Dropout. The key idea in deriving such algorithms is to approximate the posterior using candidate distributions estimated by using natural gradients. Different candidate distributions result in different algorithms and further approximations to natural gradients give rise to variants of those algorithms. Our work not only unifies, generalizes, and improves existing algorithms, but also helps us design new ones.
“No Need to Choose: Robust Bayesian MetaAnalysis With Competing Publication Bias Adjustment Methods”, Bartoš et al 2021
“No Need to Choose: Robust Bayesian MetaAnalysis with Competing Publication Bias Adjustment Methods”, (20210617; ; similar):
Publication bias is an ubiquitous threat to the validity of metaanalysis and the accumulation of scientific evidence. In order to estimate and counteract the impact of publication bias, multiple methods have been developed; however, recent simulation studies have shown the methods’ performance to depend on the true data generating process—no method consistently outperforms the others across a wide range of conditions.
To avoid the conditiondependent, allornone choice between competing methods we extend robust Bayesian metaanalysis and modelaverage across 2 prominent approaches of adjusting for publication bias: (1) selection models of pvalues and (2) models of the relationship between effectsizes and their standard errors. The resulting estimator weights the models with the support they receive from the existing research record.
Applications, simulations, and comparisons to preregistered, multilab replications demonstrate the benefits of Bayesian modelaveraging of competing publication bias adjustment methods.
[Keywords: Bayesian modelaveraging, metaanalysis, PETPEESE, publication bias, selection models]
“Maternal Judgments of Child Numeracy and Reading Ability Predict Gains in Academic Achievement and Interest”, Parker et al 2021
2021parker.pdf
: “Maternal Judgments of Child Numeracy and Reading Ability Predict Gains in Academic Achievement and Interest”, (20210515; ; backlinks; similar):
[Example of regression to the mean fallacies: parents know much more about their children than highly unreliable early childhood exam scores, and their “overestimates” predict later performance (particularly for immigrant parents about secondlanguage proficiency). Of course. How could it be otherwise? (Not to mention that we already know the ‘Pygmalion effect’ isn’t real so the claimed causal explanation of their correlates has already been ruled out.)]
In a representative longitudinal sample of 2,602 Australian children (52% boys; 2% Indigenous; 13% language other than English background; 22% of Mothers born overseas; and 65% Urban) and their mothers (first surveyed in 2003), this article examined if maternal judgments of numeracy and reading ability varied by child demographics and influenced achievement and interest gains.
We linked survey data to administrative data of national standardized tests in Year 3, 5, and 7 and found that maternal judgments followed gender stereotype patterns, favoring girls in reading and boys in numeracy. Maternal judgments were more positive for children from nonEnglish speaking backgrounds. Maternal judgments predicted gains in children’s achievement (consistently) and academic interest (generally) including during the transition to high school.
His team collected data from more than 2,600 Australian children and tracked their academic performance through NAPLAN tests between grade 3, 5 and 7.
They also collected information from the primary caregiver—mostly the child’s mother—as to whether they thought their child’s academic performance was better than average, average or below average.
“What we found was that in year 5, the kids whose parents overestimated their ability—they were optimistic—they did better in subsequent NAPLAN tests”, Professor Parker says.
“And more importantly, [the children] actually grew in their interest. They were more interested in maths, they were more interested in reading than [those who had] parents who are more pessimistic.”
Professor Philip Parker says your expectations of your child can become a selffulfilling prophecy.
The study also found that mothers who were not from Englishspeaking backgrounds had statisticallysignificantly more positive judgments than Englishspeaking mothers towards their child when assessing them on reading. This was not the case when assessing numeracy.
Professor Parker says there are many ways that a parent’s optimism can benefit their child. “So they might hire a tutor, or they … buy one of those computer games for maths classes … also they tend to be more motivating. And they tend to give homework help that is more positive and supportive, rather than controlling and detrimental.”
“Genetic Sensitivity Analysis: Adjusting for Genetic Confounding in Epidemiological Associations”, Pingault et al 2021
“Genetic sensitivity analysis: Adjusting for genetic confounding in epidemiological associations”, (20210507; backlinks; similar):
Associations between exposures and outcomes reported in epidemiological studies are typically unadjusted for genetic confounding. We propose a twostage approach for estimating the degree to which such observed associations can be explained by genetic confounding. First, we assess attenuation of exposure effects in regressions controlling for increasingly powerful polygenic scores. Second, we use structural equation models to estimate genetic confounding using heritability estimates derived from both SNPbased and twinbased studies. We examine associations between maternal education and three developmental outcomes—child educational achievement, Body Mass Index (BMI), and Attention Deficit Hyperactivity Disorder (ADHD). Polygenic scores explain between 14.3% and 23.0% of the original associations, while analyses under SNPbased and twinbased heritability scenarios indicate that observed associations could be almost entirely explained by genetic confounding. Thus, caution is needed when interpreting associations from nongenetically informed epidemiology studies. Our approach, akin to a genetically informed sensitivity analysis can be applied widely.
Author summary:
An objective shared across the life, behavioural, and social sciences is to identify factors that increase risk for a particular disease or trait. However, identifying true risk factors is challenging. Often, a risk factor is statistically associated with a disease even if it is not really relevant, meaning that even successfully improving the risk factor will not impact the disease. One reason for the existence of such misleading associations stems from genetic confounding. This is when genetic factors influence directly both the risk factor and the disease, which generates a statistical association even in the absence of a true effect of the risk factor. Here, we propose a method to estimate genetic confounding and quantify its effect on observed associations. We show that a large part of the associations between maternal education and 3 child outcomes—educational achievement, body mass index and AttentionDeficit Hyperactivity Disorder—is explained by genetic confounding. Our findings can be applied to better understand the role of genetics in explaining associations of key risk factors with diseases and traits.
“What Are Bayesian Neural Network Posteriors Really Like?”, Izmailov et al 2021
“What Are Bayesian Neural Network Posteriors Really Like?”, (20210429; ; similar):
The posterior over Bayesian neural network (BNN) parameters is extremely highdimensional and nonconvex. For computational reasons, researchers approximate this posterior using inexpensive minibatch methods such as meanfield variational inference or stochasticgradient Markov chain Monte Carlo (SGMCMC). To investigate foundational questions in Bayesian deep learning, we instead use fullbatch Hamiltonian Monte Carlo (HMC) on modern architectures. We show that (1) BNNs can achieve significant performance gains over standard training and deep ensembles; (2) a single long HMC chain can provide a comparable representation of the posterior to multiple shorter chains; (3) in contrast to recent studies, we find posterior tempering is not needed for nearoptimal performance, with little evidence for a “cold posterior” effect, which we show is largely an artifact of data augmentation; (4) BMA performance is robust to the choice of prior scale, and relatively similar for diagonal Gaussian, mixture of Gaussian, and logistic priors; (5) Bayesian neural networks show surprisingly poor generalization under domain shift; (6) while cheaper alternatives such as deep ensembles and SGMCMC methods can provide good generalization, they provide distinct predictive distributions from HMC. Notably, deep ensemble predictive distributions are similarly close to HMC as standard SGLD, and closer than standard variational inference.
“Bayesian Optimization Is Superior to Random Search for Machine Learning Hyperparameter Tuning: Analysis of the BlackBox Optimization Challenge 2020”, Turner et al 2021
“Bayesian Optimization is Superior to Random Search for Machine Learning Hyperparameter Tuning: Analysis of the BlackBox Optimization Challenge 2020”, (20210420; ; similar):
This paper presents the results and insights from the blackbox optimization (BBO) challenge at NeurIPS 2020 which ran from JulyOctober, 2020.
The challenge emphasized the importance of evaluating derivativefree optimizers for tuning the hyperparameters of machine learning models. This was the first blackbox optimization challenge with a machine learning emphasis. It was based on tuning (validation set) performance of standard machine learning models on real datasets. This competition has widespread impact as blackbox optimization (eg. Bayesian optimization) is relevant for hyperparameter tuning in almost every machine learning project as well as many applications outside of machine learning. The final leaderboard was determined using the optimization performance on heldout (hidden) objective functions, where the optimizers ran without human intervention.
Baselines were set using the default settings of several opensource blackbox optimization packages as well as random search.
“Maximal Positive Controls: A Method for Estimating the Largest Plausible Effect Size”, Hilgard 2021
2021hilgard.pdf
: “Maximal positive controls: A method for estimating the largest plausible effect size”, (20210301; ; similar):
 Some reported effect sizes are too big for the hypothesized process.
 Simple, obvious manipulations can reveal which effects are too big.
 A demonstration is provided examining an implausibly large effect.
Effect sizes in social psychology are generally not large and are limited by error variance in manipulation and measurement. Effect sizes exceeding these limits are implausible and should be viewed with skepticism. Maximal positive controls, experimental conditions that should show an obvious and predictable effect [eg. a Stroop effect], can provide estimates of the upper limits of plausible effect sizes on a measure.
In this work, maximal positive controls are conducted for 3 measures of aggressive cognition, and the effect sizes obtained are compared to studies found through systematic review. Questions are raised regarding the plausibility of certain reports with effect sizes comparable to, or in excess of, the effect sizes found in maximal positive controls.
Maximal positive controls may provide a means to identify implausible study results at lower cost than direct replication.
[Keywords: violent video games, aggression, aggressive thought, positive controls, scientific selfcorrection]
[Positive controls eliciting a hithertomaximum effect can be seen as a kind of empirical Bayes estimating the distribution of plausible effects: if a reported effect size exceeds the empirical max, either something extremely unlikely has occurred (a new max out of n effects ever observed) or an error. For large n, the posterior probability of an error will be much larger.]
“Informational Herding, Optimal Experimentation, and Contrarianism”, Smith et al 2021
2021smith.pdf
: “Informational Herding, Optimal Experimentation, and Contrarianism”, (20210225; ; similar):
In the standard herding model, privately informed individuals sequentially see prior actions and then act. An identical action herd eventually starts and public beliefs tend to “cascade sets” where social learning stops. What behaviour is socially efficient when actions ignore informational externalities?
We characterize the outcome that maximizes the discounted sum of utilities. Our 4 key findings are:
 cascade sets shrink but do not vanish, and herding should occur but less readily as greater weight is attached to posterity.
 An optimal mechanism rewards individuals mimicked by their successor.
 Cascades cannot start after period one under a signal logconcavity condition.
 Given this condition, efficient behaviour is contrarian, leaning against the myopically more popular actions in every period.
We make 2 technical contributions: as value functions with learning are not smooth, we use monotone comparative statics under uncertainty to deduce optimal dynamic behaviour. We also adapt dynamic pivot mechanisms to Bayesian learning.
[Keywords: herding, mimicking, contrarian, cascade, efficiency, monotonicity, logconcavity]
“Hot under the Collar: A Latent Measure of Interstate Hostility”, Terechshenko 2020
2020terechshenko.pdf
: “Hot under the collar: A latent measure of interstate hostility”, (20201117; ; similar):
The majority of studies on international conflict escalation use a variety of measures of hostility including the use of force, reciprocity, and the number of fatalities. The use of different measures, however, leads to different empirical results and creates difficulties when testing existing theories of interstate conflict. Furthermore, hostility measures currently used in the conflict literature are ill suited to the task of identifying consistent predictors of international conflict escalation. This article presents a new dyadic latent measure of interstate hostility, created using a Bayesian itemresponse theory model and conflict data from the Militarized Interstate Dispute (MID) and Phoenix political event datasets. This model (1) provides a more granular, conceptually precise, and validated measure of hostility, which incorporates the uncertainty inherent in the latent variable; and (2) solves the problem of temporal variation in event data using a varyingintercept structure and humancoded data as a benchmark against which biases in machinecoded data are corrected. In addition, this measurement model allows for the systematic evaluation of how existing measures relate to the construct of hostility. The presented model will therefore enhance the ability of researchers to understand factors affecting conflict dynamics, including escalation and deescalation processes.
“What Matters More for Entrepreneurship Success? A Metaanalysis Comparing General Mental Ability and Emotional Intelligence in Entrepreneurial Settings”, Allen et al 2020
“What matters more for entrepreneurship success? A metaanalysis comparing general mental ability and emotional intelligence in entrepreneurial settings”, (20201103; backlinks; similar):
Using metaanalysis, we investigate the extent to which General Mental Ability (GMA) and Emotional Intelligence (EI) predict entrepreneurial success. Based on 65,826 observations, we find that both GMA and EI matter for success, but that the size of the relationship is more than twice as large for EI. Our study contradicts and adds important contextual nuance to previous metaanalyses on performance in traditional workplace settings, where GMA is considered to be more critical than EI. We also contribute to the literature on cognitive and emotional intelligence in entrepreneurship.
Managerial Summary: While previous studies have shown General Mental Ability (GMA, cognitive intelligence) to be more important for success compared to Emotional Intelligence (EI) in traditional workplace settings, we theorize that EI will be more important in entrepreneurial contexts. Entrepreneurship is an extreme setting with distinct emotional and social demands relative to many other organizational settings. Moreover, managing an entrepreneurial business has been described as an “emotional rollercoaster.” Thus, on a relative basis we expected EI to matter more in entrepreneurial contexts and explore this assumption using a metaanalysis of 65,826 observations. We find that both GMA and EI matter for entrepreneurial success, but that the size of the relationship is more than twice as large for EI.
…The dominant metaanalytic paradigm in entrepreneurship is psychometric metaanalysis (Hunter & Schmidt 2004). However, we did not choose this procedure for 2 reasons. First, the chief advantage of psychometric metaanalysis is the ability to correct for statistical artifacts such as unreliability and range restriction. In our data, a large percentage of the samples did not report the needed information to make these corrections locally and the global corrections via artifact distributions with the limited number of samples that reported necessary information would likely have been strongly influenced by second order sampling error.
“From Probability to Consilience: How Explanatory Values Implement Bayesian Reasoning”, Wojtowicz & DeDeo 2020
2020wojtowicz.pdf
: “From Probability to Consilience: How Explanatory Values Implement Bayesian Reasoning”, (20201023; similar):
Highlights:
 Recent experiments show that we value explanations for many reasons, such as predictive power and simplicity.
 Bayesian rational analysis provides a functional account of these values, along with concrete definitions that allow us to measure and compare them across a variety of contexts, including visual perception, politics, and science.
 These values include descriptiveness, coexplanation, and measures of simplicity such as parsimony and concision. The first two are associated with the evaluation of explanations in the light of experience, while the latter concern the intrinsic features of an explanation.
 Failures to explain well can be understood as imbalances in these values: a conspiracy theorist, for example, may overrate coexplanation relative to simplicity, and many similar ‘failures to explain’ that we see in social life may be analyzable at this level.
Recent work in cognitive science has uncovered a diversity of explanatory values, or dimensions along which we judge explanations as better or worse. We propose a Bayesian account of these values that clarifies their function and shows how they fit together to guide explanationmaking. The resulting taxonomy shows that core values from psychology, statistics, and the philosophy of science emerge from a common mathematical framework and provide insight into why people adopt the explanations they do. This framework not only operationalizes the explanatory virtues associated with, for example, scientific argumentmaking, but also enables us to reinterpret the explanatory vices that drive phenomena such as conspiracy theories, delusions, and extremist ideologies.
[Keywords: explanation, explanatory values, Bayesian cognition, rational analysis, simplicity, vice epistemology]
“Metatrained Agents Implement Bayesoptimal Agents”, Mikulik et al 2020
“Metatrained agents implement Bayesoptimal agents”, (20201021; ; backlinks; similar):
Memorybased metalearning is a powerful technique to build agents that adapt fast to any task within a target distribution. A previous theoretical study has argued that this remarkable performance is because the metatraining protocol incentivizes agents to behave Bayesoptimally. We empirically investigate this claim on a number of prediction and bandit tasks. Inspired by ideas from theoretical computer science, we show that metalearned and Bayesoptimal agents not only behave alike, but they even share a similar computational structure, in the sense that one agent system can simulate the other. Furthermore, we show that Bayesoptimal agents are fixed points of the metalearning dynamics. Our results suggest that memorybased metalearning might serve as a general technique for numerically approximating Bayesoptimal agents—that is, even for task distributions for which we currently don’t possess tractable models.
“Learning Not to Learn: Nature versus Nurture in Silico”, Lange & Sprekeler 2020
“Learning not to learn: Nature versus nurture in silico”, (20201009; ; backlinks; similar):
Animals are equipped with a rich innate repertoire of sensory, behavioral and motor skills, which allows them to interact with the world immediately after birth. At the same time, many behaviors are highly adaptive and can be tailored to specific environments by means of learning. In this work, we use mathematical analysis and the framework of metalearning (or ‘learning to learn’) to answer when it is beneficial to learn such an adaptive strategy and when to hardcode a heuristic behavior. We find that the interplay of ecological uncertainty, task complexity and the agents’ lifetime has crucial effects on the metalearned amortized Bayesian inference performed by an agent. There exist two regimes: One in which metalearning yields a learning algorithm that implements taskdependent informationintegration and a second regime in which metalearning imprints a heuristic or ‘hardcoded’ behavior. Further analysis reveals that nonadaptive behaviors are not only optimal for aspects of the environment that are stable across individuals, but also in situations where an adaptation to the environment would in fact be highly beneficial, but could not be done quickly enough to be exploited within the remaining lifetime. Hardcoded behaviors should hence not only be those that always work, but also those that are too complex to be learned within a reasonable time frame.
“Searching for the Backfire Effect: Measurement and Design Considerations”, SwireThompson et al 2020
“Searching for the Backfire Effect: Measurement and Design Considerations”, (202009; backlinks; similar):
One of the most concerning notions for science communicators, factcheckers, and advocates of truth, is the backfire effect; this is when a correction leads to an individual increasing their belief in the very misconception the correction is aiming to rectify. There is currently a debate in the literature as to whether backfire effects exist at all, as recent studies have failed to find the phenomenon, even under theoretically favorable conditions.
In this review, we summarize the current state of the worldview and familiarity backfire effect literatures. We subsequently examine barriers to measuring the backfire phenomenon, discuss approaches to improving measurement and design, and conclude with recommendations for factcheckers.
We suggest that backfire effects are not a robust empirical phenomenon, and more reliable measures, powerful designs, and stronger links between experimental design and theory could greatly help move the field ahead.
[Keywords: backfire effects, belief updating, misinformation, continued influence effect, reliability]
“Is SGD a Bayesian Sampler? Well, Almost”, Mingard et al 2020
“Is SGD a Bayesian sampler? Well, almost”, (20200626; ; similar):
Overparameterized deep neural networks (DNNs) are highly expressive and so can, in principle, generate almost any function that fits a training dataset with zero error. The vast majority of these functions will perform poorly on unseen data, and yet in practice DNNs often generalize remarkably well. This success suggests that a trained DNN must have a strong inductive bias towards functions with low generalisation error. Here we empirically investigate this inductive bias by calculating, for a range of architectures and datasets, the probability P_{SGD}(fS) that an overparameterized DNN, trained with stochastic gradient descent (SGD) or one of its variants, converges on a function f consistent with a training set S. We also use Gaussian processes to estimate the Bayesian posterior probability P_{b}(fS) that the DNN expresses f upon random sampling of its parameters, conditioned on S.
Our main findings are that P_{SGD}(fS) correlates remarkably well with P_{b}(fS) and that P_{b}(fS) is strongly biased towards lowerror and low complexity functions. These results imply that strong inductive bias in the parameterfunction map (which determines P_{b}(fS)), rather than a special property of SGD, is the primary explanation for why DNNs generalize so well in the overparameterized regime.
While our results suggest that the Bayesian posterior P_{b}(fS) is the first order determinant of P_{SGD}(fS), there remain second order differences that are sensitive to hyperparameter tuning. A function probability picture, based on P_{SGD}(fS) and/or P_{b}(fS), can shed new light on the way that variations in architecture or hyperparameter settings such as batch size, learning rate, and optimizer choice, affect DNN performance.
“Laplace’s Theories of Cognitive Illusions, Heuristics and Biases”, Miller & Gelman 2020
2020miller.pdf
: “Laplace’s Theories of Cognitive Illusions, Heuristics and Biases”, (20200603; similar):
In his book from the early 1800s, Essai Philosophique sur les Probabilités, the mathematician PierreSimon de Laplace anticipated many ideas developed within the past 50 years in cognitive psychology and behavioral economics, explaining human tendencies to deviate from norms of rationality in the presence of probability and uncertainty. A look at Laplace’s theories and reasoning is striking, both in how modern they seem, how much progress he made without the benefit of systematic experimentation, and the novelty of a few of his unexplored conjectures. We argue that this work points to these theories being more fundamental and less contingent on recent experimental findings than we might have thought.
“Exploring Bayesian Optimization: Breaking Bayesian Optimization into Small, Sizeable Chunks”, Agnihotri & Batra 2020
“Exploring Bayesian Optimization: Breaking Bayesian Optimization into small, sizeable chunks”, (20200505; ; similar):
[Discussion of Bayesian optimization (BO), a decisiontheoretic application of Bayesian statistics (typically using Gaussian processes for flexibility) which tries to model a set of variables to find the maximum or best in the fewest number of collected data points possible. This differs from normal experiment design which tries to simply maximize the overall information about all points given a fixed number of samples, not just the best point, or “active learning”, which tries to select data points which make the model as predictive as possible while collecting samples. The difference can be visualized by watching posterior distributions for simple 2D problems evolve as data is collected according to different BO or active learning or simple gridsearch/random baseline strategies. The optimal strategy is usually infeasible to calculate, so various heuristics like “expected improvement” or “Thompson sampling” are used, and their different behavior can be visualized and compared. BO is heavily used in machine learning to find the best combinations of settings for machine learning models.]
In this article, we looked at Bayesian Optimization for optimizing a blackbox function. Bayesian Optimization is well suited when the function evaluations are expensive, making grid or exhaustive search impractical. We looked at the key components of Bayesian Optimization. First, we looked at the notion of using a surrogate function (with a prior over the space of objective functions) to model our blackbox function. Next, we looked at the “Bayes” in Bayesian Optimization — the function evaluations are used as data to obtain the surrogate posterior. We look at acquisition functions, which are functions of the surrogate posterior and are optimized sequentially. This new sequential optimization is inexpensive and thus of utility of us. We also looked at a few acquisition functions and showed how these different functions balance exploration and exploitation. Finally, we looked at some practical examples of Bayesian Optimization for optimizing hyperparameters for machine learning models.
“The Social and Genetic Inheritance of Educational Attainment: Genes, Parental Education, and Educational Expansion”, Lin 2020
2020lin.pdf
: “The social and genetic inheritance of educational attainment: Genes, parental education, and educational expansion”, MengJung Lin (20200201; ; backlinks)
“The Most ‘Abandoned’ Books on GoodReads”, Branwen 2019
GoodReads
: “The Most ‘Abandoned’ Books on GoodReads”, (20191209; ; backlinks; similar):
Which books on GoodReads are most difficult to finish? Estimating proportions in December 2019 gives an entirely different result than absolute counts.
What books are hardest for a reader who starts them to finish, and most likely to be abandoned? I scrape a crowdsourced tag,
abandoned
, from the GoodReads book social network on 20191209 to estimate conditional probability of being abandoned.The default GoodReads tag interface presents only raw counts of tags, not counts divided by total ratings ( = reads). This conflates popularity with probability of being abandoned: a popular but rarelyabandoned book may have more
abandoned
tags than a less popular but oftenabandoned book. There is also residual error from the winner’s curse where books with fewer ratings are more misestimated than popular books. I fix that to see what more correct rankings look like.Correcting for both changes the top5 ranking completely, from (raw counts):
 The Casual Vacancy, J. K. Rowling
 Catch22, Joseph Heller
 American Gods, Neil Gaiman
 A Game of Thrones, George R. R. Martin
 The Book Thief, Markus Zusak
to (shrunken posterior proportions):
 Black Leopard, Red Wolf, Marlon James
 Space Opera, Catherynne M. Valente
 Little, Big, John Crowley
 The Witches: Salem, 1692, Stacy Schiff
 Tender Morsels, Margo Lanagan
I also consider a model adjusting for covariates (author/averagerating/year), to see what books are most surprisingly oftenabandoned given their pedigrees & rating etc. Abandon rates increase the newer a book is, and the lower the average rating.
Adjusting for those, the top5 are:
 The Casual Vacancy, J. K. Rowling
 The Chemist, Stephenie Meyer
 Infinite Jest, David Foster Wallace
 The Glass Bead Game, Hermann Hesse
 Theft by Finding: Diaries (1977–2002), David Sedaris
Books at the top of the adjusted list appear to reflect a mix of highlypopular authors changing genres, and ‘prestige’ books which are highlyrated but a slog to read.
These results are interesting for how they highlight how people read books for many reasons (such as marketing campaigns, literary prestige, or following a popular author), and this is reflected in their decision whether to continue reading or to abandon a book.
“The Propensity for Aggressive Behavior and Lifetime Incarceration Risk: A Test for Geneenvironment Interaction (G × E) Using Wholegenome Data”, Barnes et al 2019
2019barnes.pdf
: “The propensity for aggressive behavior and lifetime incarceration risk: A test for geneenvironment interaction (G × E) using wholegenome data”, (20191101; ; backlinks; similar):
 Sociogenomics offers insight into geneenvironment interplay.
 We construct a genomewide measure of genetic propensity for aggressive behavior.
 Males with higher genetic propensity were more likely to experience incarceration.
 But geneenvironment interaction (G × E) was observed
 Genetic propensity was not predictive for males raised in high education homes.
Incarceration is a disruptive event that is experienced by a considerable proportion of the United States population. Research has identified social factors that predict incarceration risk, but scholars have called for a focus on the ways that individual differences combine with social factors to affect incarceration risk. Our study is an initial attempt to heed this call using wholegenome data.
We use data from the Health and Retirement Study (HRS) (n = 6716) to construct a genomewide measure of genetic propensity for aggressive behavior and use it to predict lifetime incarceration risk. We find that participants with a higher genetic propensity for aggression are more likely to experience incarceration, but the effect is stronger for males than females. Importantly, we identify a geneenvironment interaction (G × E)—genetic propensity is reduced, substantively and statistically, to a nonsignificant predictor for males raised in homes where at least one parent graduated high school.
We close by placing these findings in the broader context of concerns that have been raised about genetics research in criminology.
[Keywords: lifetime incarceration, genomewide polygenic score (PGS), parental educational attainment, geneenvironment interaction (G × E)]
“Bayesian Parameter Estimation Using Conditional Variational Autoencoders for Gravitationalwave Astronomy”, Gabbard et al 2019
“Bayesian parameter estimation using conditional variational autoencoders for gravitationalwave astronomy”, (20190913; ; similar):
Gravitational wave (GW) detection is now commonplace and as the sensitivity of the global network of GW detectors improves, we will observe 𝒪(100)s of transient GW events per year. The current methods used to estimate their source parameters employ optimally sensitive but computationally costly Bayesian inference approaches where typical analyses have taken between 6 hours and 5 days. For binary neutron star and neutron star black hole systems prompt counterpart electromagnetic (EM) signatures are expected on timescales of 1s–1min and the current fastest method for alerting EM followup observers, can provide estimates in 𝒪(1) minutes, on a limited range of key source parameters.
Here we show that a conditional variational autoencoder pretrained on binary black hole signals can return Bayesian posterior probability estimates.
The training procedure need only be performed once for a given prior parameter space and the resulting trained machine can then generate samples describing the posterior distribution ~6 orders of magnitude faster than existing techniques.
“New Paradigms in the Psychology of Reasoning”, Oaksford & Chater 2019
2020oaksford.pdf
: “New Paradigms in the Psychology of Reasoning”, (20190912; similar):
The psychology of verbal reasoning initially compared performance with classical logic. In the last 25 years, a new paradigm has arisen, which focuses on knowledgerich reasoning for communication and persuasion and is typically modeled using Bayesian probability theory rather than logic. This paradigm provides a new perspective on argumentation, explaining the rational persuasiveness of arguments that are logical fallacies. It also helps explain how and why people stray from logic when given deductive reasoning tasks. What appear to be erroneous responses, when compared against logic, often turn out to be rationally justified when seen in the richer rational framework of the new paradigm. Moreover, the same approach extends naturally to inductive reasoning tasks, in which people extrapolate beyond the data they are given and logic does not readily apply. We outline links between social and individual reasoning and set recent developments in the psychology of reasoning in the wider context of Bayesian cognitive science.
“Estimating Distributional Models With Brms: Additive Distributional Models”, Bürkner 2019
“Estimating Distributional Models with brms: Additive Distributional Models”, (20190829; ; backlinks; similar):
This vignette provides an introduction on how to fit distributional regression models with
brms
. We use the term distributional model to refer to a model, in which we can specify predictor terms for all parameters of the assumed response distribution.In the vast majority of regression model implementations, only the location parameter (usually the mean) of the response distribution depends on the predictors and corresponding regression parameters. Other parameters (eg. scale or shape parameters) are estimated as auxiliary parameters assuming them to be constant across observations. This assumption is so common that most researchers applying regression models are often (in my experience) not aware of the possibility of relaxing it. This is understandable insofar as relaxing this assumption drastically increase model complexity and thus makes models hard to fit. Fortunately,
brms
uses Stan on the backend, which is an incredibly flexible and powerful tool for estimating Bayesian models so that model complexity is much less of an issue.…In the examples so far, we did not have multilevel data and thus did not fully use the capabilities of the distributional regression framework of
brms
. In the example presented below, we will not only show how to deal with multilevel data in distributional models, but also how to incorporate smooth terms (ie. splines) into the model. In many applications, we have no or only a very vague idea how the relationship between a predictor and the response looks like. A very flexible approach to tackle this problems is to use splines and let them figure out the form of the relationship.
“Allocation to Groups: Examples of Lord's Paradox”, Wright 2019
2019wright.pdf
: “Allocation to groups: Examples of Lord's paradox”, (20190712; backlinks; similar):
Background: Educational and developmental psychologists often examine how groups change over time. 2 analytic procedures—analysis of covariance (ANCOVA) and the gain score model—each seem well suited for the simplest situation, with just 2 groups and 2 time points. They can produce different results, what is known as Lord’s paradox.
Aims: Several factors should influence a researcher’s analytic choice. This includes whether the score from the initial time influences how people are assigned to groups. Examples are shown, which will help to explain this to researchers and students, and are of educational relevance. It is shown that a common method used to measure school effectiveness is biased against schools that serve students from groups that are historically poor performing.
Methods and results: The examples come from sports and measuring educational effectiveness (eg. for teachers or schools). A simulation study shows that if the covariate influences group allocation, the ANCOVA is preferred, but otherwise, the gain score model may be appropriate. Regression towards the mean is used to account for these findings.
Conclusions: Analysts should consider the relationship between the covariate and group allocation when deciding upon their analytic method. Because the influence of the covariate on group allocation may be complex, the appropriate method may be complex. Because the influence of the covariate on group allocation may be unknown, the choice of method may require several assumptions.
[Keywords: Lord’s paradox, valueadded models, ANCOVA, educator equity]
“Evolutionary Implementation of Bayesian Computations”, Czégel et al 2019
“Evolutionary implementation of Bayesian computations”, (20190628; ; backlinks; similar):
A wide variety of human and nonhuman behavior is computationally well accounted for by probabilistic generative models, formalized consistently in a Bayesian framework.
Recently, it has been suggested that another family of adaptive systems, namely, those governed by Darwinian evolutionary dynamics, are capable of implementing building blocks of Bayesian computations. These algorithmic similarities rely on the analogous competition dynamics of generative models and of Darwinian replicators to fit possibly highdimensional and stochastic environments. Identified computational building blocks include Bayesian update over a single variable and replicator dynamics, transition between hidden states and mutation, and Bayesian inference in hierarchical models and multilevel selection.
Here we provide a coherent mathematical discussion of these observations in terms of Bayesian graphical models and a stepbystep introduction to their evolutionary interpretation. We also extend existing results by adding two missing components: a correspondence between likelihood optimization and phenotypic adaptation, and between expectationmaximizationlike dynamics in mixture models and ecological competition.
These correspondences suggest a deeper algorithmic analogy between evolutionary dynamics and statistical learning, pointing towards an unified computational understanding of mechanisms Nature invented to adapt to highdimensional and uncertain environments.
“How Should We Critique Research?”, Branwen 2019
Researchcriticism
: “How Should We Critique Research?”, (20190519; ; backlinks; similar):
Criticizing studies and statistics is hard in part because so many criticisms are possible, rendering them meaningless. What makes a good criticism is the chance of being a ‘difference which makes a difference’ to our ultimate actions.
Scientific and statistical research must be read with a critical eye to understand how credible the claims are. The Reproducibility Crisis and the growth of metascience have demonstrated that much research is of low quality and often false.
But there are so many possible things any given study could be criticized for, falling short of an unobtainable ideal, that it becomes unclear which possible criticism is important, and they may degenerate into mere rhetoric. How do we separate fatal flaws from unfortunate caveats from specious quibbling?
I offer a pragmatic criterion: what makes a criticism important is how much it could change a result if corrected and how much that would then change our decisions or actions: to what extent it is a “difference which makes a difference”.
This is why issues of research fraud, causal inference, or biases yielding overestimates are universally important: because a ‘causal’ effect turning out to be zero effect or grossly overestimated will change almost all decisions based on such research; while on the other hand, other issues like measurement error or distributional assumptions, which are equally common, are often not important: because they typically yield much smaller changes in conclusions, and hence decisions.
If we regularly ask whether a criticism would make this kind of difference, it will be clearer which ones are important criticisms, and which ones risk being rhetorical distractions and obstructing meaningful evaluation of research.
“Reinforcement Learning, Fast and Slow”, Botvinick et al 2019
“Reinforcement Learning, Fast and Slow”, (20190516; ; backlinks; similar):
Recent AI research has given rise to powerful techniques for deep reinforcement learning. In their combination of representation learning with rewarddriven behavior, deep reinforcement learning would appear to have inherent interest for psychology and neuroscience.
One reservation has been that deep reinforcement learning procedures demand large amounts of training data, suggesting that these algorithms may differ fundamentally from those underlying human learning.
While this concern applies to the initial wave of deep RL techniques, subsequent AI work has established methods that allow deep RL systems to learn more quickly and efficiently. Two particularly interesting and promising techniques center, respectively, on episodic memory and metalearning. Alongside their interest as AI techniques, deep RL methods leveraging episodic memory and metalearning have direct and interesting implications for psychology and neuroscience. One subtle but critically important insight which these techniques bring into focus is the fundamental connection between fast and slow forms of learning.
Deep reinforcement learning (RL) methods have driven impressive advances in artificial intelligence in recent years, exceeding human performance in domains ranging from Atari to Go to nolimit poker. This progress has drawn the attention of cognitive scientists interested in understanding human learning. However, the concern has been raised that deep RL may be too sampleinefficient—that is, it may simply be too slow—to provide a plausible model of how humans learn. In the present review, we counter this critique by describing recently developed techniques that allow deep RL to operate more nimbly, solving problems much more quickly than previous methods. Although these techniques were developed in an AI context, we propose that they may have rich implications for psychology and neuroscience. A key insight, arising from these AI methods, concerns the fundamental connection between fast RL and slower, more incremental forms of learning.
“Meta Reinforcement Learning As Task Inference”, Humplik et al 2019
“Meta reinforcement learning as task inference”, (20190515; ; similar):
Humans achieve efficient learning by relying on prior knowledge about the structure of naturally occurring tasks. There is considerable interest in designing reinforcement learning (RL) algorithms with similar properties. This includes proposals to learn the learning algorithm itself, an idea also known as meta learning. One formal interpretation of this idea is as a partially observable multitask RL problem in which task information is hidden from the agent. Such unknown task problems can be reduced to Markov decision processes (MDPs) by augmenting an agent’s observations with an estimate of the belief about the task based on past experience. However estimating the belief state is intractable in most partiallyobserved MDPs. We propose a method that separately learns the policy and the task belief by taking advantage of various kinds of privileged information. Our approach can be very effective at solving standard metaRL environments, as well as a complex continuous control environment with sparse rewards and requiring longterm memory.
“Structural Equation Models As Computation Graphs”, Kesteren & Oberski 2019
“Structural Equation Models as Computation Graphs”, (20190511; similar):
Structural equation modeling (SEM) is a popular tool in the social and behavioural sciences, where it is being applied to ever more complex data types. The highdimensional data produced by modern sensors, brain images, or (epi)genetic measurements require variable selection using parameter penalization; experimental models combining disparate data sources benefit from regularization to obtain a stable result; and genomic SEM or network models lead to alternative objective functions. With each proposed extension, researchers currently have to completely reformulate SEM and its optimization algorithm—a challenging and timeconsuming task.
In this paper, we consider each SEM as a computation graph, a flexible method of specifying objective functions borrowed from the field of deep learning. When combined with stateoftheart optimizers, our computation graph approach can extend SEM without the need for bespoke software development. We show that both existing and novel SEM improvements follow naturally from our approach. To demonstrate, we discuss least absolute deviation estimation and penalized regression models. We also introduce spikeandslab SEM, which may perform better when shrinkage of large factor loadings is not desired. By applying computation graphs to SEM, we hope to greatly accelerate the process of developing SEM techniques, paving the way for new applications. We provide an accompanying R package tensorsem.
“Metalearning of Sequential Strategies”, Ortega et al 2019
“Metalearning of Sequential Strategies”, (20190508; ; backlinks; similar):
In this report we review memorybased metalearning as a tool for building sampleefficient strategies that learn from past experience to adapt to any task within a target class. Our goal is to equip the reader with the conceptual foundations of this tool for building new, scalable agents that operate on broad domains. To do so, we present basic algorithmic templates for building nearoptimal predictors and reinforcement learners which behave as if they had a probabilistic model that allowed them to efficiently exploit task structure. Furthermore, we recast memorybased metalearning within a Bayesian framework, showing that the metalearned strategies are nearoptimal because they amortize Bayesfiltered data, where the adaptation is implemented in the memory dynamics as a statemachine of sufficient statistics. Essentially, memorybased metalearning translates the hard problem of probabilistic sequential inference into a regression problem.
“Fermi Calculation Examples”, Branwen 2019
Fermi
: “Fermi Calculation Examples”, (20190329; backlinks; similar):
Fermi estimates or problems are quick heuristic solutions to apparently insoluble quantitative problems rewarding clever use of realworld knowledge and critical thinking; bibliography of some examples.
A short discussion of “Fermi calculations”: quickanddirty approximate answers to quantitative questions which prize cleverness in exploiting implications of common knowledge or basic principles in given reasonable answers to apparently unanswerable questions.
Links to discussions of Fermi estimates, and a list of some Fermi estimates I’ve done.
“Is the FDA Too Conservative or Too Aggressive?: A Bayesian Decision Analysis of Clinical Trial Design”, Isakov et al 2019
2019isakov.pdf
: “Is the FDA too conservative or too aggressive?: A Bayesian decision analysis of clinical trial design”, (20190104; ; similar):
Implicit in the drugapproval process is a host of decisions—target patient population, control group, primary endpoint, sample size, followup period, etc.—all of which determine the tradeoff between Type I and Type II error. We explore the application of Bayesian decision analysis (BDA) to minimize the expected cost of drug approval, where the relative costs of the two types of errors are calibrated using U.S. Burden of Disease Study 2010 data. The results for conventional fixedsample randomized clinicaltrial designs suggest that for terminal illnesses with no existing therapies such as pancreatic cancer, the standard threshold of 2.5% is substantially more conservative than the BDAoptimal threshold of 23.9% to 27.8%. For relatively less deadly conditions such as prostate cancer, 2.5% is more risktolerant or aggressive than the BDAoptimal threshold of 1.2% to 1.5%. We compute BDAoptimal sizes for 25 of the most lethal diseases and show how a BDAinformed approval process can incorporate all stakeholders’ views in a systematic, transparent, internally consistent, and repeatable manner.
“Bayesian Statistics in Sociology: Past, Present, and Future”, Lynch & Bartlett 2019
2019lynch.pdf
: “Bayesian Statistics in Sociology: Past, Present, and Future”, (2019; similar):
Although Bayes’ theorem has been around for more than 250 years, widespread application of the Bayesian approach only began in statistics in 1990. By 2000, Bayesian statistics had made considerable headway into social science, but even now its direct use is rare in articles in top sociology journals, perhaps because of a lack of knowledge about the topic. In this review, we provide an overview of the key ideas and terminology of Bayesian statistics, and we discuss articles in the top journals that have used or developed Bayesian methods over the last decade. In this process, we elucidate some of the advantages of the Bayesian approach. We highlight that many sociologists are, in fact, using Bayesian methods, even if they do not realize it, because techniques deployed by popular software packages often involve Bayesian logic and/or computation. Finally, we conclude by briefly discussing the future of Bayesian statistics in sociology.
“Approximate Bayesian Computation”, Beaumont 2019
2019beaumont.pdf
: “Approximate Bayesian Computation”, (2019; similar):
Many of the statistical models that could provide an accurate, interesting, and testable explanation for the structure of a data set turn out to have intractable likelihood functions. The method of approximate Bayesian computation (ABC) has become a popular approach for tackling such models. This review gives an overview of the method and the main issues and challenges that are the subject of current research.
“Accounting Theory As a Bayesian Discipline”, Johnstone 2018
2018johnstone.pdf
: “Accounting Theory as a Bayesian Discipline”, (20181228; ; similar):
Accounting Theory as a Bayesian Discipline introduces Bayesian theory and its role in statistical accounting information theory. The Bayesian statistical logic of probability, evidence and decision lies at the historical and modern center of accounting thought and research. It is not only the presumed rule of reasoning in analytical models of accounting disclosure, it is the default position for empiricists when hypothesizing about how the users of financial statements think. Bayesian logic comes to light throughout accounting research and is the soul of most strategic disclosure models. In addition, Bayesianism is similarly a large part of the stated and unstated motivation of empirical studies of how market prices and their implied costs of capital react to better financial disclosure.
The approach taken in this monograph is a Demski 1973like treatment of “accounting numbers” as “signals” rather than as “measurements”. It should be of course that “good” measurements like “quality earnings” reports make generally better signals. However, to be useful for decision making under uncertainty, accounting measurements need to have more than established accounting measurement virtues. This monograph explains what those Bayesian information attributes are, where they come from in Bayesian theory, and how they apply in statistical accounting information theory.
The Bayesian logic of probability, evidence and decision is the presumed rule of reasoning in analytical models of accounting disclosure. Any rational explication of the decadesold accounting notions of “information content”, “value relevance”, “decision useful”, and possibly conservatism, is inevitably Bayesian. By raising some of the probability principles, paradoxes and surprises in Bayesian theory, intuition in accounting theory about information, and its value, can be tested and enhanced. Of all the branches of the social sciences, accounting information theory begs Bayesian insights.
This monograph lays out the main logical constructs and principles of Bayesianism, and relates them to important contributions in the theoretical accounting literature. The approach taken is essentially “oldfashioned” normative statistics, building on the expositions of Demski, Ijiri, Feltham and other early accounting theorists who brought Bayesian theory to accounting theory. Some history of this nexus, and the role of business schools in the development of Bayesian statistics in the 1950–1970s, is described. Later developments in accounting, especially noisy rational expectations models under which the information reported by firms is endogenous, rather than unaffected or “drawn from nature”, make the task of Bayesian inference more difficult yet no different in principle.
The information user must still revise beliefs based on what is reported. The extra complexity is that users must allow for the firm’s perceived disclosure motives and other relevant background knowledge in their Bayesian models. A known strength of Bayesian modelling is that subjective considerations are admitted and formally incorporated. Allowances for perceived selfinterest or biased reporting, along with any other apparent signal defects or “information uncertainty”, are part and parcel of Bayesian information theory.
Introduction
Bayesianism Early in Accounting Theory
 Rise of Bayesian statistics
 Bayes in US business schools
 Early Bayesian accounting theorists
 Postscript
Survey of Bayesian Fundamentals
 All probability is subjective
 Inference comes first
 Bayesian learning
 No objective priors
 Independence is subjective
 No distinction between risk and uncertainty
 The likelihood function (ie. model)
 Sufficiency and the likelihood principle
 Coherence
 Coherent means no “Dutch book”
 Coherent is not necessarily accurate
 Accuracy is relative
 Odds form of Bayes theorem
 Data can’t speak for itself
 Ancillary information
 Nuisance parameters “integrate out”
 “Randomness” is subjective
 “Exchangeable” samples
 The Bayes factor
 Conditioning on all evidence
 Bayesian versus conventional inference
 Simpson’s paradox
 Data swamps prior
 Stable estimation
 Cromwell’s rule
 Decisions follow inference
 Inference, not estimation
 Calibration
 Economic scoring rules
 Market scoring rules
 Measures of information
 Ex ante versus ex post accuracy
 Sampling to forgone conclusion
 Predictive distributions
 Model averaging
 Definition of a subjectivist Bayesian
 What makes a Bayesian?
 Rise of Bayesianism in data science
Case Study: Using All the Evidence
 Interpreting “plevel ≤ α”
 Bayesian interpretation of frequentist reports
 A generic inference problem
Is Accounting Bayesian or Frequentist?
 2 Bayesian schools in accounting
 Markowitz, subjectivist Bayesian
 Characterization of information in accounting
 Why accounting literature emphasizes “precision”
 Bayesian description of information quality
 Likelihood function of earnings
 Capturing conditional conservatism
Decision Support Role of Accounting Information
 A formal Bayesian model
 Parallels with meteorology
 Bayesian fundamental analysis
Demski’s (1973) Impossibility Result
 Example: binary accounting signals
 Conservatism and the user’s risk aversion
Does Information Reduce Uncertainty
 Beaver’s (1968) prescription
 Bayesian basics
 Contrary views in accounting
 Bayesian roots in finance
 The general Bayesian law
 Rogers et al 2009
 Dye & Hughes 2018
 Why a Predictive Distribution?
 Limits to certainty
 Lewellen & Shanken 2002
 Neururer et al 2016
 Veronesi 1999
How Information Combines
 Combining 2 risky signals
Ex Ante Effect of Greater Risk/Uncertainty
 Risk adds to ex ante expected utility
 Implications for Bayesian decision analysis
 Volatility pumping
Ex Post Decision Outcomes: 1. Practical investment
 Economic Darwinism
 Bayesian Darwinian selection
 Good probability assessments
 Implications for accounting information
Information Uncertainty
 Bayesian definition of information uncertainty
 Bayesian treatment of information uncertainty
 Model risk as information risk
Conditioning Beliefs and the Cost of Capital
Numerical example
Interpretation: 14. Reliance on the NormalNormal Model
Intuitive counterexample
Appeal to the normalnormal model in accounting
Unknown variance, increasing after observation
Beyer 2009
Armstrong et al 2016
Bayesian Subjective Beta
 Core et al 2015
 Verrecchia 2001: Understated influence of the mean
 Decision analysis effect of the mean
Other Bayesian Points of Interest
 Accounting input in prediction models
 Earnings quality and accurate probability assessments
 Expected variance as a measure of information
 Information stays relevant
 Bayesian view of earnings management
 Numerator versus denominator news
 Mixtures of normals
 Information content
 Fundamental versus information risk
 When information adds to information asymmetry
 Value of independent information sources
 How might market probabilities behave?
 “Idiosyncratic” versus “undiversifiable” information
Conclusion
References
“The Bayesian Superorganism III: Externalized Memories Facilitate Distributed Sampling”, Hunt et al 2018
“The Bayesian Superorganism III: externalized memories facilitate distributed sampling”, (20181221; ; similar):
A key challenge for any animal is to avoid wasting time by searching for resources in places it has already found to be unprofitable. This challenge is particularly strong when the organism is a central place forager—returning to a nest between foraging bouts—because it is destined repeatedly to cover much the same ground. Furthermore, this problem will reach its zenith if many individuals forage from the same central place, as in social insects.
Foraging performance may be greatly enhanced by coordinating movement trajectories such that each ant visits separate parts of the surrounding (unknown) space. In this third of three papers, we find experimental evidence for an externalized spatial memory in Temnothorax albipennis ants: chemical markers (either pheromones or other cues such as cuticular hydrocarbon footprints) that are used by nestmates to mark explored space. We show these markers could be used by the ants to scout the space surrounding their nest more efficiently through indirect coordination.
We also develop a simple model of this marking behaviour that can be applied in the context of Markov chain Monte Carlo methods (see part two of this series). This substantially enhances the performance of standard methods like the Metropolis– Hastings algorithm in sampling from sparse probability distributions (such as those confronted by the ants) with little additional computational cost.
Our Bayesian framework for superorganismal behaviour motivates the evolution of exploratory mechanisms such as trail marking in terms of enhanced collective information processing.
“Evolution As Backstop for Reinforcement Learning”, Branwen 2018
Backstop
: “Evolution as Backstop for Reinforcement Learning”, (20181206; ; backlinks; similar):
Markets/evolution as backstops/ground truths for reinforcement learning/optimization: on some connections between Coase’s theory of the firm/linear optimization/DRL/evolution/multicellular life/pain/Internet communities as multilevel optimization problems.
One defense of free markets notes the inability of nonmarket mechanisms to solve planning & optimization problems. This has difficulty with Coase’s paradox of the firm, and I note that the difficulty is increased by the fact that with improvements in computers, algorithms, and data, ever larger planning problems are solved. Expanding on some Cosma Shalizi comments, I suggest interpreting phenomenon as multilevel nested optimization paradigm: many systems can be usefully described as having two (or more) levels where a slow sampleinefficient but groundtruth ‘outer’ loss such as death, bankruptcy, or reproductive fitness, trains & constrains a fast sampleefficient but possibly misguided ‘inner’ loss which is used by learned mechanisms such as neural networks or linear programming group selection perspective. So, one reason for freemarket or evolutionary or Bayesian methods in general is that while poorer at planning/optimization in the short run, they have the advantage of simplicity and operating on groundtruth values, and serve as a constraint on the more sophisticated nonmarket mechanisms. I illustrate by discussing corporations, multicellular life, reinforcement learning & metalearning in AI, and pain in humans. This view suggests that are inherent balances between market/nonmarket mechanisms which reflect the relative advantages between a slow unbiased method and faster but potentially arbitrarily biased methods.
“SMPY Bibliography”, Branwen 2018
SMPY
: “SMPY Bibliography”, (20180728; ; backlinks; similar):
An annotated fulltext bibliography of publications on the Study of Mathematically Precocious Youth (SMPY), a longitudinal study of highIQ youth.
SMPY (Study of Mathematically Precocious Youth) is a longrunning longitudinal survey of extremely mathematicallytalented or intelligent youth, which has been following highIQ cohorts since the 1970s. It has provided the largest and most concrete findings about the correlates and predictive power of screening extremely intelligent children, and revolutionized gifted & talented educational practices.
Because it has been running for over 40 years, SMPYrelated publications are difficult to find; many early papers were published only in longoutofprint books and are not available in any other way. Others are digitized and more accessible, but one must already know they exist. Between these barriers, SMPY information is less widely available & used than it should be given its importance.
To fix this, I have been gradually going through all SMPY citations and making fulltext copies available online with occasional commentary.
 Missing
 Bibliography sources
 1950
 1970
 Keating & Stanley 1972
 Stanley 1973
 Hogan et al 1974
 Stanley et al 1974
 Hogan & Garvey 1975
 Keating 1975
 Solano & George 1975
 Gifted Child Quarterly 1976
 Cohn 1976
 Hogan & Garvey 1976
 Fox 1976b
 Fox 1976c
 Keating et al 1976
 Solano 1976
 Stanley 1976c
 Stanley 1976d
 George 1977
 Stanley 1977
 Stanley 1977b
 Stanley et al 1977
 Time 1977
 Albert 1978
 Cohn 1978
 Mills 1978
 Stanley 1978a
 Stanley 1978b
 Stanley & George 1978
 Cohn 1979
 Durden 1979
 Eisenberg & George 1979
 George & Stanley 1979
 Fox 1979
 Fox & Pyryt 1979
 George 1979
 George et al 1979
 Laycock 1979
 Mills 1979
 Stanley & George 1979
 1980
 Albert 1980
 Becker 1980
 Benbow 1980
 Benbow & Stanley 1980
 Fox et al 1980
 McClain & Durden 1980
 Mezynski & Stanley 1980
 Stanley 1980a
 Stanley 1980b
 House 1981
 Fox 1981
 Stanley 1981
 Bartkovich & Mezynski 1981
 Benbow 1981
 Benbow & Stanley 1982a
 Benbow & Stanley 1982b
 Moore 1982
 Sawyer & Daggett 1982
 Stanley & Benbow 1982
 Academic Precocity, Benbow & Stanley 1983a
 Benbow & Stanley 1983b
 Benbow & Stanley 1983c
 Benbow & Stanley 1983d
 Benbow et al 1983a
 Benbow et al 1983b
 Stanley 1983
 Stanley 1983b
 Stanley & Benbow 1983a
 Stanley & Benbow 1983b
 Stanley & Durden 1983
 Tursman 1983
 Benbow & Benbow 1984
 Benbow & Stanley 1984
 Holmes et al 1984
 Reynolds et al 1984
 Stanley 1984a
 Stanley 1984b
 Durden 1985
 Stanley 1985a
 Stanley 1985b
 Stanley 1985d
 Benbow 1986
 Benbow & Minor 1986
 Brody & Benbow 1986
 Stanley et al 1986
 University of North Texas, Julian C. Stanley archival materials (1986–1989)
 Benbow 1987a
 Benbow & Benbow 1987b
 Brody & Benbow 1987
 Fox 1987
 Stanley 1987a
 Stanley 1987b
 Stanley 1987c
 Stanley 1987d
 Stanley 1987e
 Benbow 1988
 Stanley 1988
 Anonymous 1989
 Stanley 1989a
 Stanley 1989b
 Stanley 1989c
 1990
 Benbow & Arjmand 1990
 Benbow & Minor 1990
 Dark & Benbow 1990
 Dauber & Benbow 1990
 Lubinski & Humphreys 1990
 Lupkowski et al 1990
 Lynch 1990
 Richardson & Benbow 1990
 Stanley 1990
 Stanley et al 1990
 Benbow et al 1991
 Stanley 1991a
 Stanley 1991b
 Stanley 1991c
 Swiatek & Benbow 1991a
 Swiatek & Benbow 1991b
 Brody et al 1991
 Benbow 1992a
 Benbow 1992b
 Kirschenbaum 1992
 Lubinski & Benbow 1992
 Lubinski & Humphreys 1992
 Pyryt & Moroz 1992
 Stanley 1992
 Stanley 1992b
 Benbow & Lubinski 1993a
 Benbow & Lubinski 1993b
 Bock & Ackrill 1993
 Lubinski et al 1993
 Mills 1993
 Southern et al 1993
 Sowell 1993
 Swiatek 1993
 Albert 1994
 Charlton et al 1994
 Lubinski & Benbow 1994
 Lubinski et al 1995
 Lubinski & Benbow 1995
 Sanders et al 1995
 Achter et al 1996
 Benbow & Lubinski 1996
 Benbow & Stanley 1996
 Lubinski et al 1996
 Stanley 1996
 Anonymous 1997
 Benbow & Lubinski 1997
 Johns Hopkins Magazine 1997
 Petrill et al 1997
 Stanley 1997
 Chorney et al 1998
 Pyryt 1998
 Schmidt et al 1998
 Achter et al 1999
 Lange 1999
 Norman et al 1999
 Rotigel & LupkowskiShoplik 1999
 2000
 Benbow et al 2000
 Heller et al 2000
 Lubinski & Benbow 2000
 Stanley 2000
 Lubinski et al 2001a
 Lubinski et al 2001b
 Plomin et al 2001
 Shea et al 2001
 Clark & Zimmerman 2002
 Moore 2002
 Webb et al 2002
 Achter & Lubinski 2003
 Kerr & Sodano 2003
 BleskeRechek et al 2004
 Lubinski 2004a
 Lubinski 2004b
 Benbow 2005
 Brody & Stanley 2005
 High Ability Studies 2005
 Wai et al 2005
 Benbow & Lubinski 2006
 Lubinski & Benbow 2006
 Lubinski et al 2006
 Muratori et al 2006
 Brody 2007
 Halpern et al 2007
 Lubinski & Benbow 2007
 Park 2007
 Swiatek 2007
 Webb et al 2007
 Leder 2008
 Benbow & Lubinski 2009
 Brody 2009
 Ferriman et al 2009
 Lubinski 2009a
 Lubinski 2009b
 Wai et al 2009
 Wai et al 2009b
 SteenbergenHu 2009
 2010
 Henshon 2010
 Lubinski 2010
 Robertson et al 2010
 Wai et al 2010
 Hunt 2011
 Touron & Touron 2011
 Benbow 2012
 Kell & Lubinski 2013
 Kell et al 2013a
 Kell et al 2013b
 Park et al 2013
 Nature 2013
 Stumpf et al 2013
 Beattie 2014
 Brody & Muratori 2014
 Lubinski et al 2014
 Kell & Lubinski 2014
 Wai 2014a
 Wai 2014b
 Brody 2015
 Lubinski 2016
 Makel et al 2016
 Spain et al 2016
 Kell et al 2017
 Wai & Kell 2017
 Lubinski 2018
 Bernstein et al 2019
 McCabe et al 2019
 Kell & Wai 2019
 2020
 See Also
“Deep Learning Generalizes Because the Parameterfunction Map Is Biased towards Simple Functions”, VallePérez et al 2018
“Deep learning generalizes because the parameterfunction map is biased towards simple functions”, (20180522; ; similar):
Deep neural networks (DNNs) generalize remarkably well without explicit regularization even in the strongly overparameterized regime where classical learning theory would instead predict that they would severely overfit. While many proposals for some kind of implicit regularization have been made to rationalize this success, there is no consensus for the fundamental reason why DNNs do not strongly overfit.
In this paper, we provide a new explanation. By applying a very general probabilitycomplexity bound recently derived from algorithmic information theory (AIT), we argue that the parameterfunction map of many DNNs should be exponentially biased towards simple functions. We then provide clear evidence for this strong simplicity bias in a model DNN for Boolean functions, as well as in much larger fully connected and convolutional networks applied to CIFAR10 and MNIST.
As the target functions in many real problems are expected to be highly structured, this intrinsic simplicity bias helps explain why deep networks generalize well on real world problems. This picture also facilitates a novel PACBayes approach where the prior is taken over the DNN inputoutput function space, rather than the more conventional prior over parameter space. If we assume that the training algorithm samples parameters close to uniformly within the zeroerror region then the PACBayes theorem can be used to guarantee good expected generalization for target functions producing highlikelihood training sets.
By exploiting recently discovered connections between DNNs and Gaussian processes to estimate the marginal likelihood, we produce relatively tight generalization PACBayes error bounds which correlate well with the true error on realistic datasets such as MNIST and CIFAR10 and for architectures including convolutional and fully connected networks.
“On Having Enough Socks”, Branwen 2017
Socks
: “On Having Enough Socks”, (20171122; ; backlinks; similar):
Personal experience and surveys on running out of socks; discussion of socks as small example of human procrastination and irrationality, caused by lack of explicit deliberative thought where no natural triggers or habits exist.
After running out of socks one day, I reflected on how ordinary tasks get neglected. Anecdotally and in 3 online surveys, people report often not having enough socks, a problem which correlates with rarity of sock purchases and demographic variables, consistent with a neglect/procrastination interpretation: because there is no specific time or triggering factor to replenish a shrinking sock stockpile, it is easy to run out.
This reminds me of akrasia on minor tasks, ‘yak shaving’, and the nature of disaster in complex systems: lack of hard rules lets errors accumulate, without any ‘global’ understanding of the drift into disaster (or at least inefficiency). Humans on a smaller scale also ‘drift’ when they engage in System I reactive thinking & action for too long, resulting in cognitive biases. An example of drift is the generalized human failure to explore/experiment adequately, resulting in overly greedy exploitative behavior of the current local optimum. Grocery shopping provides a case study: despite large gains, most people do not explore, perhaps because there is no established routine or practice involving experimentation. Fixes for these things can be seen as ensuring that System II deliberative cognition is periodically invoked to review things at a global level, such as developing a habit of maximum exploration at first purchase of a food product, or annually reviewing possessions to note problems like a lack of socks.
While socks may be small things, they may reflect big things.
“Implicit Causal Models for Genomewide Association Studies”, Tran & Blei 2017
“Implicit Causal Models for Genomewide Association Studies”, (20171030; ; similar):
Progress in probabilistic generative models has accelerated, developing richer models with neural architectures, implicit densities, and with scalable algorithms for their Bayesian inference. However, there has been limited progress in models that capture causal relationships, for example, how individual genetic factors cause major human diseases.
In this work, we focus on two challenges in particular:
How do we build richer causal models, which can capture highly nonlinear relationships and interactions between multiple causes?
How do we adjust for latent confounders, which are variables influencing both cause and effect and which prevent learning of causal relationships?
To address these challenges, we synthesize ideas from causality and modern probabilistic modeling.
For the first, we describe implicit causal models, a class of causal models that leverages neural architectures with an implicit density.
For the second, we describe an implicit causal model that adjusts for confounders by sharing strength across examples.
In experiments, we scale Bayesian inference on up to a billion genetic measurements. We achieve state of the art accuracy for identifying causal factors: we significantly outperform existing genetics methods by an absolute difference of 15–45.3%.
“A Rational Choice Framework for Collective Behavior”, Krafft 2017
“A Rational Choice Framework for Collective Behavior”, (201709; ; backlinks; similar):
As the world becomes increasingly digitally mediated, people can more and more easily form groups, teams, and communities around shared interests and goals. Yet there is a constant struggle across forms of social organization to maintain stability and coherency in the face of disparate individual experiences and agendas. When are collectives able to function and thrive despite these challenges? In this thesis I propose a theoretical framework for reasoning about collective intelligence—the ability of people to accomplish their shared goals together. A simple result from the literature on multiagent systems suggests that strong general collective intelligence in the form of “rational group agency” arises from three conditions: aligned utilities, accurate shared beliefs, and coordinated actions. However, achieving these conditions can be difficult, as evidenced by impossibility results related to each condition from the literature on social choice, belief aggregation, and distributed systems. The theoretical framework I propose serves as a point of inspiration to study how human groups address these difficulties. To this end, I develop computational models of facets of human collective intelligence, and test these models in specific case studies. The models I introduce suggest distributed Bayesian inference as a framework for understanding shared belief formation, and also show that people can overcome other difficult computational challenges associated with achieving rational group agency, including balancing the group “exploration versus exploitation dilemma” for information gathering and inferring levels of “common pbelief” to coordinate actions.
“Statistical Correction of the Winner’s Curse Explains Replication Variability in Quantitative Trait Genomewide Association Studies”, Palmer & Pe’er 2017
“Statistical correction of the Winner’s Curse explains replication variability in quantitative trait genomewide association studies”, (20170710; backlinks; similar):
Genomewide association studies (GWAS) have identified hundreds of SNPs responsible for variation in human quantitative traits. However, genomewidesignificant associations often fail to replicate across independent cohorts, in apparent inconsistency with their apparent strong effects in discovery cohorts. This limited success of replication raises pervasive questions about the utility of the GWAS field.
We identify all 332 studies of quantitative traits from the NHGRIEBI GWAS Database with attempted replication. We find that the majority of studies provide insufficient data to evaluate replication rates. The remaining papers replicate statisticallysignificantly worse than expected (p < 10^{−14}), even when adjusting for regressiontothemean of effect size between discoverycohort and replicationcohorts termed the Winner’s Curse (p < 10^{−16}). We show this is due in part to misreporting replication cohortsize as a maximum number, rather than perlocus one. In 39 studies accurately reporting perlocus cohortsize for attempted replication of 707 loci in samples with similar ancestry, replication rate matched expectation (predicted 458, observed 457, p = 0.94). In contrast, ancestry differences between replication and discovery (13 studies, 385 loci) cause the most highlypowered decile of loci to replicate worse than expected, due to difference in linkage disequilibrium.
Author summary:
The majority of associations between common genetic variation and human traits come from genomewide association studies, which have analyzed millions of singlenucleotide polymorphisms in millions of samples. These kinds of studies pose serious statistical challenges to discovering new associations. Finite resources restrict the number of candidate associations that can brought forward into validation samples, introducing the need for a statisticalsignificance threshold. This threshold creates a phenomenon called the Winner’s Curse, in which candidate associations close to the discovery threshold are more likely to have biased overestimates of the variant’s true association in the sampled population.
We survey all human quantitative trait association studies that validated at least one signal. We find the majority of these studies do not publish sufficient information to actually support their claims of replication. For studies that did, we computationally correct the Winner’s Curse and evaluate replication performance. While all variants combined replicate statisticallysignificantly less than expected, we find that the subset of studies that (1) perform both discovery and replication in samples of the same ancestry; and (2) report accurate pervariant sample sizes, replicate as expected.
This study provides strong, rigorous evidence for the broad reliability of genomewide association studies. We furthermore provide a model for more efficient selection of variants as candidates for replication, as selecting variants using cursed discovery data enriches for variants with little real evidence for trait association.
“A Tutorial on Thompson Sampling”, Russo et al 2017
“A Tutorial on Thompson Sampling”, (20170707; ; similar):
Thompson sampling is an algorithm for online decision problems where actions are taken sequentially in a manner that must balance between exploiting what is known to maximize immediate performance and investing to accumulate new information that may improve future performance. The algorithm addresses a broad range of problems in a computationally efficient manner and is therefore enjoying wide use. This tutorial covers the algorithm and its application, illustrating concepts through a range of examples, including Bernoulli bandit problems, shortest path problems, product recommendation, assortment, active learning with neural networks, and reinforcement learning in Markov decision processes. Most of these problems involve complex information structures, where information revealed by taking an action informs beliefs about other actions. We will also discuss when and why Thompson sampling is or is not effective and relations to alternative algorithms.
“BlackBox Dataefficient Policy Search for Robotics”, Chatzilygeroudis et al 2017
“BlackBox Dataefficient Policy Search for Robotics”, (20170321; ):
The most dataefficient algorithms for reinforcement learning (RL) in robotics are based on uncertain dynamical models: after each episode, they first learn a dynamical model of the robot, then they use an optimization algorithm to find a policy that maximizes the expected return given the model and its uncertainties. It is often believed that this optimization can be tractable only if analytical, gradientbased algorithms are used; however, these algorithms require using specific families of reward functions and policies, which greatly limits the flexibility of the overall approach. In this paper, we introduce a novel modelbased RL algorithm, called BlackDROPS (Blackbox Dataefficient RObot Policy Search) that: (1) does not impose any constraint on the reward function or the policy (they are treated as blackboxes), (2) is as dataefficient as the stateoftheart algorithm for dataefficient RL in robotics, and (3) is as fast (or faster) than analytical approaches when several cores are available. The key idea is to replace the gradientbased optimization algorithm with a parallel, blackbox algorithm that takes into account the model uncertainties. We demonstrate the performance of our new algorithm on two standard control benchmark problems (in simulation) and a lowcost robotic manipulator (with a real robot).
“ZMA Sleep Experiment”, Branwen 2017
ZMA
: “ZMA Sleep Experiment”, (20170313; ; backlinks; similar):
A randomized blinded selfexperiment of the effects of ZMA (zinc+magnesium+vitamin B6) on my sleep; results suggest small benefit to sleep quality but are underpowered and damaged by Zeo measurement error/data issues.
I ran a blinded randomized selfexperiment of 2.5g nightly ZMA powder effect on Zeorecorded sleep data during MarchOctober 2017 (n = 127). The linear model and SEM model show no statisticallysignificant effects or high posterior probability of benefits, although all pointestimates were in the direction of benefits. Data quality issues reduced the available dataset, rendering the experiment particularly underpowered and the results more inconclusive. I decided to not continue use of ZMA after running out; ZMA may help my sleep but I need to improve data quality before attempting any further sleep selfexperiments on it.
“SelfBlinded Mineral Water Taste Test”, Branwen 2017
Water
: “SelfBlinded Mineral Water Taste Test”, (20170215; ; backlinks; similar):
Blind randomized tastetest of mineral/distilled/tap waters using Bayesian bestarm finding; no large differences in preference.
The kind of water used in tea is claimed to make a difference in the flavor: mineral water being better than tap water or distilled water. However, mineral water is vastly more expensive than tap water.
To test the claim, I run a preliminary test of pure water to see if any water differences are detectable at all. Compared my tap water, 3 distilled water brands (Great Value, Nestle Pure Life, & Poland Spring), 1 osmosispurified brand (Aquafina), and 3 noncarbonated mineral water brands (Evian, Voss, & Fiji) in a series of n = 67 blinded randomized comparisons of water flavor. The comparisons are modeled using a BradleyTerry competitive model implemented in Stan; comparisons were chosen using an adaptive Bayesian bestarm sequential trial (racing) method designed to locate the besttasting water in the minimum number of samples by preferentially comparing the bestknown arm to potentially superior arms. Blinding & randomization are achieved by using a Lazy Susan to physically randomize two identical (but marked in a hidden spot) cups of water.
The final posterior distribution indicates that some differences between waters are likely to exist but are small & imprecisely estimated and of little practical concern.
“The Kelly CoinFlipping Game: Exact Solutions”, Branwen et al 2017
Coinflip
: “The Kelly CoinFlipping Game: Exact Solutions”, (20170119; ; backlinks; similar):
Decisiontheoretic analysis of how to optimally play Haghani & Dewey 2016’s 300round doubleornothing coinflipping game with an edge and ceiling better than using the Kelly Criterion. Computing and following an exact decision tree increases earnings by $6.6 over a modified KC.
Haghani & Dewey 2016 experiment with a doubleornothing coinflipping game where the player starts with $30.4[^\$25.0^~2016~]{.supsub} and has an edge of 60%, and can play 300 times, choosing how much to bet each time, winning up to a maximum ceiling of $303.8[^\$250.0^~2016~]{.supsub}. Most of their subjects fail to play well, earning an average $110.6[^\$91.0^~2016~]{.supsub}, compared to Haghani & Dewey 2016’s heuristic benchmark of ~$291.6[^\$240.0^~2016~]{.supsub} in winnings achievable using a modified Kelly Criterion as their strategy. The KC, however, is not optimal for this problem as it ignores the ceiling and limited number of plays.
We solve the problem of the value of optimal play exactly by using decision trees & dynamic programming for calculating the value function, with implementations in R, Haskell, and C. We also provide a closedform exact value formula in R & Python, several approximations using Monte Carlo/random forests/neural networks, visualizations of the value function, and a Python implementation of the game for the OpenAI Gym collection. We find that optimal play yields $246.61 on average (rather than ~$240), and so the human players actually earned only 36.8% of what was possible, losing $155.6 in potential profit. Comparing decision trees and the Kelly criterion for various horizons (bets left), the relative advantage of the decision tree strategy depends on the horizon: it is highest when the player can make few bets (at b = 23, with a difference of ~$36), and decreases with number of bets as more strategies hit the ceiling.
In the Kelly game, the maximum winnings, number of rounds, and edge are fixed; we describe a more difficult generalized version in which the 3 parameters are drawn from Pareto, normal, and beta distributions and are unknown to the player (who can use Bayesian inference to try to estimate them during play). Upper and lower bounds are estimated on the value of this game. In the variant of this game where subjects are not told the exact edge of 60%, a Bayesian decision tree approach shows that performance can closely approach that of the decision tree, with a penalty for 1 plausible prior of only $1. Two deep reinforcement learning agents, DQN & DDPG, are implemented but DQN fails to learn and DDPG doesn’t show acceptable performance, indicating better deep RL methods may be required to solve the generalized Kelly game.
“Banner Ads Considered Harmful”, Branwen 2017
Ads
: “Banner Ads Considered Harmful”, (20170108; ; backlinks; similar):
9 months of daily A/Btesting of Google AdSense banner ads on Gwern.net indicates banner ads decrease total traffic substantially, possibly due to spillover effects in reader engagement and resharing.
One source of complexity & JavaScript use on Gwern.net is the use of Google AdSense advertising to insert banner ads. In considering design & usability improvements, removing the banner ads comes up every time as a possibility, as readers do not like ads, but such removal comes at a revenue loss and it’s unclear whether the benefit outweighs the cost, suggesting I run an A/B experiment. However, ads might be expected to have broader effects on traffic than individual page reading times/bounce rates, affecting total site traffic instead through longterm effects on or spillover mechanisms between readers (eg. social media behavior), rendering the usual A/B testing method of perpageload/session randomization incorrect; instead it would be better to analyze total traffic as a timeseries experiment.
Design: A decision analysis of revenue vs readers yields an maximum acceptable total traffic loss of ~3%. Power analysis of historical Gwern.net traffic data demonstrates that the high autocorrelation yields low statistical power with standard tests & regressions but acceptable power with ARIMA models. I design a longterm Bayesian
ARIMA(4,0,1)
timeseries model in which an A/Btest running January–October 2017 in randomized paired 2day blocks of ads/noads uses clientlocal JS to determine whether to load & display ads, with total traffic data collected in Google Analytics & ad exposure data in Google AdSense. The A/B test ran from 20170101 to 20171015, affecting 288 days with collectively 380,140 pageviews in 251,164 sessions.Correcting for a flaw in the randomization, the final results yield a surprisingly large estimate of an expected traffic loss of −9.7% (driven by the subset of users without adblock), with an implied −14% traffic loss if all traffic were exposed to ads (95% credible interval: −13–16%), exceeding my decision threshold for disabling ads & strongly ruling out the possibility of acceptably small losses which might justify further experimentation.
Thus, banner ads on Gwern.net appear to be harmful and AdSense has been removed. If these results generalize to other blogs and personal websites, an important implication is that many websites may be harmed by their use of banner ad advertising without realizing it.
“Simple and Scalable Predictive Uncertainty Estimation Using Deep Ensembles”, Lakshminarayanan et al 2016
“Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles”, (20161205; ; backlinks; similar):
Deep neural networks (NNs) are powerful black box predictors that have recently achieved impressive performance on a wide spectrum of tasks. Quantifying predictive uncertainty in NNs is a challenging and yet unsolved problem.
Bayesian NNs, which learn a distribution over weights, are currently the stateoftheart for estimating predictive uncertainty; however these require significant modifications to the training procedure and are computationally expensive compared to standard (nonBayesian) NNs.
We propose an alternative to Bayesian NNs that is simple to implement, readily parallelizable, requires very little hyperparameter tuning, and yields high quality predictive uncertainty estimates. Through a series of experiments on classification and regression benchmarks, we demonstrate that our method produces wellcalibrated uncertainty estimates which are as good or better than approximate Bayesian NNs. To assess robustness to dataset shift, we evaluate the predictive uncertainty on test examples from known and unknown distributions, and show that our method is able to express higher uncertainty on outofdistribution examples.
We demonstrate the scalability of our method by evaluating predictive uncertainty estimates on ImageNet.
“Bayesian Reinforcement Learning: A Survey”, Ghavamzadeh et al 2016
“Bayesian Reinforcement Learning: A Survey”, (20160914; ; backlinks; similar):
Bayesian methods for machine learning have been widely investigated, yielding principled methods for incorporating prior information into inference algorithms. In this survey, we provide an indepth review of the role of Bayesian methods for the reinforcement learning (RL) paradigm. The major incentives for incorporating Bayesian reasoning in RL are: (1) it provides an elegant approach to actionselection (exploration/exploitation) as a function of the uncertainty in learning; and (2) it provides a machinery to incorporate prior knowledge into the algorithms.
We first discuss models and methods for Bayesian inference in the simple singlestep Bandit model. We then review the extensive recent literature on Bayesian methods for modelbased RL, where prior information can be expressed on the parameters of the Markov model. We also present Bayesian methods for modelfree RL, where priors are expressed over the value function or policy class.
The objective of the paper is to provide a comprehensive survey on Bayesian RL algorithms and their theoretical and empirical properties.
“Why Tool AIs Want to Be Agent AIs”, Branwen 2016
ToolAI
: “Why Tool AIs Want to Be Agent AIs”, (20160907; ; backlinks; similar):
AIs limited to pure computation (Tool AIs) supporting humans, will be less intelligent, efficient, and economically valuable than more autonomous reinforcementlearning AIs (Agent AIs) who act on their own and metalearn, because all problems are reinforcementlearning problems.
Autonomous AI systems (Agent AIs) trained using reinforcement learning can do harm when they take wrong actions, especially superintelligent Agent AIs. One solution would be to eliminate their agency by not giving AIs the ability to take actions, confining them to purely informational or inferential tasks such as classification or prediction (Tool AIs), and have all actions be approved & executed by humans, giving equivalently superintelligent results without the risk.
I argue that this is not an effective solution for two major reasons. First, because Agent AIs will by definition be better at actions than Tool AIs, giving an economic advantage. Secondly, because Agent AIs will be better at inference & learning than Tool AIs, and this is inherently due to their greater agency: the same algorithms which learn how to perform actions can be used to select important datapoints to learn inference over, how long to learn, how to more efficiently execute inference, how to design themselves, how to optimize hyperparameters, how to make use of external resources such as longterm memories or external software or large databases or the Internet, and how best to acquire new data.
All of these actions will result in Agent AIs more intelligent than Tool AIs, in addition to their greater economic competitiveness. Thus, Tool AIs will be inferior to Agent AIs in both actions and intelligence, implying use of Tool AIs is an even more highly unstable equilibrium than previously argued, as users of Agent AIs will be able to outcompete them on two dimensions (and not just one).
“‘Genius Revisited’ Revisited”, Branwen 2016
Hunter
: “‘Genius Revisited’ Revisited”, (20160619; ; backlinks; similar):
A book study of surveys of the highIQ elementary school HCES concludes that high IQ is not predictive of accomplishment; I point out that results are consistent with regression to the mean from extremely early IQ tests and small total sample size.
Genius Revisited documents the longitudinal results of a highIQ/giftedandtalented elementary school, Hunter College Elementary School (HCES); one of the most striking results is the general high education & income levels, but absence of great accomplishment on a national or global scale (eg. a Nobel prize). The authors suggest that this may reflect harmful educational practices at their elementary school or the low predictive value of IQ.
I suggest that there is no puzzle to this absence nor anything for HCES to be blamed for, as the absence is fully explainable by their making 2 statistical errors: baserate neglect, and regression to the mean.
First, their standards fall prey to a baserate fallacy and even extreme predictive value of IQ would not predict 1 or more Nobel prizes because Nobel prize odds are measured at 1 in millions, and with a small total sample size of a few hundred, it is highly likely that there would simply be no Nobels.
Secondly, and more seriously, the lack of accomplishment is inherent and unavoidable as it is driven by the regression to the mean caused by the relatively low correlation of early childhood with adult IQs—which means their sample is far less elite as adults than they believe. Using earlychildhood/adult IQ correlations, regression to the mean implies that HCES students will fall from a mean of 157 IQ in kindergarten (when selected) to somewhere around 133 as adults (and possibly lower). Further demonstrating the role of regression to the mean, in contrast, HCES’s associated highIQ/giftedandtalented high school, Hunter High, which has access to the adolescents’ more predictive IQ scores, has much higher achievement in proportion to its lesser regression to the mean (despite dilution by Hunter elementary students being grandfathered in).
This unavoidable statistical fact undermines the main rationale of HCES: extremely highIQ adults cannot be accurately selected as kindergartners on the basis of a simple test. This greaterregression problem can be lessened by the use of additional variables in admissions, such as parental IQs or highquality genetic polygenic scores; unfortunately, these are either politically unacceptable or dependent on future scientific advances. This suggests that such elementary schools may not be a good use of resources and HCES students should not be assigned scarce magnet high school slots.
“Candy Japan’s New Box A/B Test”, Branwen 2016
CandyJapan
: “Candy Japan’s new box A/B test”, (20160506; ; backlinks; similar):
Bayesian decisiontheoretic analysis of the effect of fancier packaging on subscription cancellations & optimal experiment design.
I analyze an A/B test from a mailorder company of two different kinds of box packaging from a Bayesian decisiontheory perspective, balancing posterior probability of improvements & greater profit against the cost of packaging & risk of worse results, finding that as the company’s analysis suggested, the new box is unlikely to be sufficiently better than the old. Calculating expected values of information shows that it is not worth experimenting on further, and that such fixedsample trials are unlikely to ever be costeffective for packaging improvements. However, adaptive experiments may be worthwhile.
“Calculating The Gaussian Expected Maximum”, Branwen 2016
Orderstatistics
: “Calculating The Gaussian Expected Maximum”, (20160122; ; backlinks; similar):
In generating a sample of n datapoints drawn from a normal/Gaussian distribution, how big on average the biggest datapoint is will depend on how large n is. I implement a variety of exact & approximate calculations from the literature in R to compare efficiency & accuracy.
In generating a sample of n datapoints drawn from a normal/Gaussian distribution with a particular mean/SD, how big on average the biggest datapoint is will depend on how large n is. Knowing this average is useful in a number of areas like sports or breeding or manufacturing, as it defines how bad/good the worst/best datapoint will be (eg. the score of the winner in a multiplayer game).
The order statistic of the mean/average/expectation of the maximum of a draw of n samples from a normal distribution has no exact formula, unfortunately, and is generally not built into any programming language’s libraries.
I implement & compare some of the approaches to estimating this order statistic in the R programming language, for both the maximum and the general order statistic. The overall best approach is to calculate the exact order statistics for the n range of interest using numerical integration via
lmomco
and cache them in a lookup table, rescaling the mean/SD as necessary for arbitrary normal distributions; next best is a polynomial regression approximation; finally, the Elfving correction to the Blom 1958 approximation is fast, easily implemented, and accurate for reasonably large n such as n > 100.
“Top 10 Replicated Findings From Behavioral Genetics”, Plomin et al 2016page10
2016plomin.pdf#page=10
: “Top 10 Replicated Findings From Behavioral Genetics”, (2016; ; backlinks; similar):
Finding 7. Most measures of the “environment” show substantial genetic influence
Although it might seem a peculiar thing to do, measures of the environment widely used in psychological science—such as parenting, social support, and life events—can be treated as dependent measures in genetic analyses. If they are truly measures of the environment, they should not show genetic influence. To the contrary, in 1991, Plomin and Bergeman conducted a review of the first 18 studies in which environmental measures were used as dependent measures in genetically sensitive designs and found evidence for genetic influence for these measures of the environment. Substantial genetic influence was found for objective measures such as videotaped observations of parenting as well as selfreport measures of parenting, social support, and life events. How can measures of the environment show genetic influence? The reason appears to be that such measures do not assess the environment independent of the person. As noted earlier, humans select, modify, and create environments correlated with their genetic behavioral propensities such as personality and psychopathology (McAdams, Gregory, & Eley, 2013). For example, in studies of twin children, parenting has been found to reflect genetic differences in children’s characteristics such as personality and psychopathology (Avinun & Knafo, 2014; Klahr & Burt, 2014; Plomin, 1994).
Since 1991, more than 150 articles have been published in which environmental measures were used in genetically sensitive designs; they have shown consistently that there is substantial genetic influence on environmental measures, extending the findings from family environments to neighborhood, school, and work environments. Kendler and Baker (2007) conducted a review of 55 independent genetic studies and found an average heritability of 0.27 across 35 diverse environmental measures (confidence intervals not available). Metaanalyses of parenting, the most frequently studied domain, have shown genetic influence that is driven by child characteristics (Avinun & Knafo, 2014) as well as by parent characteristics (Klahr & Burt, 2014). Some exceptions have emerged. Not surprisingly, when life events are separated into uncontrollable events (eg. death of a spouse) and controllable life events (eg. financial problems), the former show nonsignificant genetic influence. In an example of how all behavioral genetic results can differ in different cultures, Shikishima, Hiraishi, Yamagata, Neiderhiser, and Ando (2012) compared parenting in Japan and Sweden and found that parenting in Japan showed more genetic influence than in Sweden, consistent with the view that parenting is more child centered in Japan than in the West.
Researchers have begun to use GCTA to replicate these findings from twin studies. For example, GCTA has been used to show substantial genetic influence on stressful life events (Power et al 2013) and on variables often used as environmental measures in epidemiological studies such as years of schooling (C. A. Rietveld, Medland, et al 2013). Use of GCTA can also circumvent a limitation of twin studies of children. Such twin studies are limited to investigating withinfamily (twinspecific) experiences, whereas many important environmental factors such as socioeconomic status (SES) are the same for two children in a family. However, researchers can use GCTA to assess genetic influence on family environments such as SES that differ between families, not within families. GCTA has been used to show genetic influence on family SES (Trzaskowski et al 2014) and an index of social deprivation (Marioni et al 2014).
“World Catnip Surveys”, Branwen 2015
Catnipsurvey
: “World Catnip Surveys”, (20151115; ; backlinks; similar):
International population online surveys of cat owners about catnip and other cat stimulant use.
In compiling a metaanalysis of reports of catnip response rats in domestic cats, yielding a metaanalytic average of ~2⁄3, the available data suggests heterogeneity from crosscountry differences in rates (possibly for genetic reasons) but is insufficient to definitively demonstrate the existence of or estimate those differences (particularly a possible extremely high catnip response rate in Japan). I use Google Surveys August–September 2017 to conduct a brief 1question online survey of a proportional population sample of 9 countries about cat ownership & catnip use, specifically: Canada, the USA, UK, Japan, Germany, Brazil, Spain, Australia, & Mexico. In total, I surveyed n = 31,471 people, of whom n = 9,087 are cat owners, of whom n = 4,402 report having used catnip on their cat, and of whom n = 2996 report a catnip response.
The survey yields catnip response rates of Canada (82%), USA (79%), UK (74%), Japan (71%), Germany (57%), Brazil (56%), Spain (54%), Australia (53%), and Mexico (52%). The differences are substantial and of high posterior probability, supporting the existence of large crosscountry differences. In additional analysis, the other conditional probabilities of cat ownership and trying catnip with a cat appear to correlate with catnip response rates; this intercorrelation suggests a “cat factor” of some sort influencing responses, although what causal relationship there might be between proportion of cat owners and proportion of catnipresponder cats is unclear.
An additional survey of a convenience sample of primarily US Internet users about catnip is reported, although the improbable catnip response rates compared to the population survey suggest the respondents are either highly unrepresentative or the questions caused demand bias.
“Catnip Immunity and Alternatives”, Branwen 2015
Catnip
: “Catnip immunity and alternatives”, (20151107; ; backlinks; similar):
Estimation of catnip immunity rates by country with metaanalysis and surveys, and discussion of catnip alternatives.
Not all cats respond to the catnip stimulant; the rate of responders is generally estimated at ~70% of cats. A metaanalysis of catnip response experiments since the 1940s indicates the true value is ~62%. The low quality of studies and the reporting of their data makes examination of possible moderators like age, sex, and country difficult. Catnip responses have been recorded for a number of species both inside and outside the Felidae family; of them, there is evidence for a catnip response in the Felidae, and, more uncertainly, the Paradoxurinae, and Herpestinae.
To extend the analysis, I run largescale online surveys measuring catnip response rates globally in domestic cats, finding high heterogeneity but considerable rates of catnip immunity worldwide.
As a piece of practical advice for cathallucinogen sommeliers, I treat catnip response & finding catnip substitutes as a decision problem, modeling it as a Markov decision process where one wishes to find a working psychoactive at minimum cost. Bol et al 2017 measured multiple psychoactives simultaneously in a large sample of cats, permitting prediction of responses conditional on not responding to others. (The solution to the specific problem is to test in the sequence catnip → honeysuckle → silvervine → Valerian.)
For discussion of cat psychology in general, see my Cat Sense review.
“Bitter Melon for Blood Glucose”, Branwen 2015
Melon
: “Bitter Melon for blood glucose”, (20150914; ; similar):
Analysis of whether bitter melon reduces blood glucose in one selfexperiment and utility of further selfexperimentation
I reanalyze a bittermelon/bloodglucose selfexperiment, finding a small effect of increasing blood glucose after correcting for temporal trends & daily variation, giving both frequentist & Bayesian analyses. I then analyze the selfexperiment from a subjective Bayesian decisiontheoretic perspective, cursorily estimating the costs of diabetes & benefits of intervention in order to estimate Value Of Information for the selfexperiment and the benefit of further selfexperimenting; I find that the expected value of more data (EVSI) is negative and further selfexperimenting would not be optimal compared to trying out other antidiabetes interventions.
“Resorting Media Ratings”, Branwen 2015
Resorter
: “Resorting Media Ratings”, (20150907; ; backlinks; similar):
Commandline tool providing interactive statistical pairwise ranking and sorting of items
Usercreated datasets using ordinal scales (such as media ratings) have tendencies to drift or ‘clump’ towards the extremes and fail to be informative as possible, falling prey to ceiling effects and making it difficult to distinguish between the mediocre & excellent.
This can be counteracted by rerating the dataset to create an uniform (and hence, informative) distribution of ratings—but such manual rerating is difficult.
I provide an anytime CLI program,
resorter
, written in R (should be crossplatform but only tested on Linux) which keeps track of comparisons, infers underlying ratings assuming that they are noisy in the ELOlike BradleyTerry model, and interactively & intelligently queries the user with comparisons of the media with the most uncertain current ratings, until the user ends the session and a fully rescaled set of ratings are output.
“When Should I Check The Mail?”, Branwen 2015
Maildelivery
: “When Should I Check The Mail?”, (20150621; ; backlinks; similar):
Bayesian decisiontheoretic analysis of local mail delivery times: modeling deliveries as survival analysis, model comparison, optimizing check times with a loss function, and optimal data collection.
Mail is delivered by the USPS mailman at a regular but not observed time; what is observed is whether the mail has been delivered at a time, yielding somewhatunusual “intervalcensored data”. I describe the problem of estimating when the mailman delivers, write a simulation of the datagenerating process, and demonstrate analysis of intervalcensored data in R using maximumlikelihood (survival analysis with Gaussian regression using
survival
library), MCMC (Bayesian model in JAGS), and likelihoodfree Bayesian inference (custom ABC, using the simulation). This allows estimation of the distribution of mail delivery times. I compare those estimates from the intervalcensored data with estimates from a (smaller) set of exact deliverytimes provided by USPS tracking & personal observation, using a multilevel model to deal with heterogeneity apparently due to a change in USPS routes/postmen. Finally, I define a loss function on mail checks, enabling: a choice of optimal time to check the mailbox to minimize loss (exploitation); optimal time to check to maximize information gain (exploration); Thompson sampling (balancing exploration & exploitation indefinitely), and estimates of the valueofinformation of another datapoint (to estimate when to stop exploration and start exploitation after a finite amount of data).
“Gaussian Processes for DataEfficient Learning in Robotics and Control”, Deisenroth et al 2015
“Gaussian Processes for DataEfficient Learning in Robotics and Control”, (20150210; ):
Autonomous learning has been a promising direction in control and robotics for more than a decade since datadriven learning allows to reduce the amount of engineering knowledge, which is otherwise required. However, autonomous reinforcement learning (RL) approaches typically require many interactions with the system to learn controllers, which is a practical limitation in real systems, such as robots, where many interactions can be impractical and time consuming. To address this problem, current learning approaches typically require taskspecific knowledge in form of expert demonstrations, realistic simulators, preshaped policies, or specific knowledge about the underlying dynamics. In this article, we follow a different approach and speed up learning by extracting more information from data. In particular, we learn a probabilistic, nonparametric Gaussian process transition model of the system. By explicitly incorporating model uncertainty into longterm planning and controller learning our approach reduces the effects of model errors, a key problem in modelbased learning. Compared to stateofthe art RL our modelbased policy search method achieves an unprecedented speed of learning. We demonstrate its applicability to autonomous learning in real robot and control tasks.
“Predictive Distributions for Betweenstudy Heterogeneity and Simple Methods for Their Application in Bayesian Metaanalysis”, Turner et al 2014
“Predictive distributions for betweenstudy heterogeneity and simple methods for their application in Bayesian metaanalysis”, (20141205; ; similar):
Numerous metaanalyses in healthcare research combine results from only a small number of studies, for which the variance representing betweenstudy heterogeneity is estimated imprecisely. A Bayesian approach to estimation allows external evidence on the expected magnitude of heterogeneity to be incorporated.
The aim of this paper is to provide tools that improve the accessibility of Bayesian metaanalysis. We present 2 methods for implementing Bayesian metaanalysis, using numerical integration and importance sampling techniques. Based on 14 886 binary outcome metaanalyses in the Cochrane Database of Systematic Reviews, we derive a novel set of predictive distributions for the degree of heterogeneity expected in 80 settings depending on the outcomes assessed and comparisons made. These can be used as prior distributions for heterogeneity in future metaanalyses.
The 2 methods are implemented in R, for which code is provided. Both methods produce equivalent results to standard but more complex Markov chain Monte Carlo approaches. The priors are derived as lognormal distributions for the betweenstudy variance, applicable to metaanalyses of binary outcomes on the log odds ratio scale. The methods are applied to 2 example metaanalyses, incorporating the relevant predictive distributions as prior distributions for betweenstudy heterogeneity.
We have provided resources to facilitate Bayesian metaanalysis, in a form accessible to applied researchers, which allow relevant prior information on the degree of heterogeneity to be incorporated.
The distribution of tau across all the metaanalyses in Cochrane with a binary outcome has been estimated by Turner et al 2014.
They estimated the distribution of log(τ^{2}) as normal with mean −2.56 and standard deviation 1.74. I’ve estimated the distribution of μ across Cochrane as a generalized tdistribution with mean = 0, scale = 0.52 and 3.4 degrees of freedom.
These estimated priors usually don’t make a very big difference compared to flat priors. That’s just because the signaltonoise ratio of most metaanalyses is reasonably good. For most metaanalyses, finding an honest set of reliable studies seems to be a much bigger problem than sampling error.]
“Thompson Sampling With the Online Bootstrap”, Eckles & Kaptein 2014
“Thompson sampling with the online bootstrap”, (20141015; ; similar):
Thompson sampling provides a solution to bandit problems in which new observations are allocated to arms with the posterior probability that an arm is optimal. While sometimes easy to implement and asymptotically optimal, Thompson sampling can be computationally demanding in large scale bandit problems, and its performance is dependent on the model fit to the observed data. We introduce bootstrap Thompson sampling (BTS), a heuristic method for solving bandit problems which modifies Thompson sampling by replacing the posterior distribution used in Thompson sampling by a bootstrap distribution. We first explain BTS and show that the performance of BTS is competitive to Thompson sampling in the wellstudied Bernoulli bandit case. Subsequently, we detail why BTS using the online bootstrap is more scalable than regular Thompson sampling, and we show through simulation that BTS is more robust to a misspecified error distribution. BTS is an appealing modification of Thompson sampling, especially when samples from the posterior are otherwise not available or are costly.
“Everything Is Correlated”, Branwen 2014
Everything
: “Everything Is Correlated”, (20140912; ; backlinks; similar):
Anthology of sociology, statistical, or psychological papers discussing the observation that all realworld variables have nonzero correlations and the implications for statistical theory such as ‘null hypothesis testing’.
Statistical folklore asserts that “everything is correlated”: in any realworld dataset, most or all measured variables will have nonzero correlations, even between variables which appear to be completely independent of each other, and that these correlations are not merely sampling error flukes but will appear in largescale datasets to arbitrarily designated levels of statisticalsignificance or posterior probability.
This raises serious questions for nullhypothesis statisticalsignificance testing, as it implies the null hypothesis of 0 will always be rejected with sufficient data, meaning that a failure to reject only implies insufficient data, and provides no actual test or confirmation of a theory. Even a directional prediction is minimally confirmatory since there is a 50% chance of picking the right direction at random.
It also has implications for conceptualizations of theories & causal models, interpretations of structural models, and other statistical principles such as the “sparsity principle”.
 Importance
 Gosset / Student 1904
 Thorndike 1920
 Berkson 1938
 Thorndike 1939
 Good 1950
 Hodges & Lehmann 1954
 Savage 1954
 Fisher 1956
 Wallis & Roberts 1956
 Savage 1957
 Nunnally 1960
 Smith 1960
 Edwards 1963
 Bakan 1966
 Meehl 1967
 Lykken 1968
 Nichols 1968
 Hays 1973
 Oakes 1975
 Loehlin & Nichols 1976
 Meehl 1978
 Loftus & Loftus 1982
 Meehl 1990 (1)
 Meehl 1990 (2)
 Tukey 1991
 Raftery 1995
 Thompson 1995
 Mulaik et al 1997
 Waller 2004
 Starbuck 2006
 Smith et al 2007
 Hecht & Moxley 2009
 Andrew Gelman
 Lin et al 2013
 Schwitzgebel 2013
 Ellenberg 2014
 Lakens 2014
 Kirkegaard 2014
 Shen et al 2014
 Gordon et al 2019
 Kirkegaard 2020
 Ferguson & Heene 2021
 External Links
 Appendix
“Statistical Notes”, Branwen 2014
Statisticalnotes
: “Statistical Notes”, (20140717; ; backlinks; similar):
Miscellaneous statistical stuff
Given two disagreeing polls, one small & imprecise but taken at facevalue, and the other large & precise but with a high chance of being totally mistaken, what is the right Bayesian model to update on these two datapoints? I give ABC and MCMC implementations of Bayesian inference on this problem and find that the posterior is bimodal with a mean estimate close to the large unreliable poll’s estimate but with wide credible intervals to cover the mode based on the small reliable poll’s estimate.
 Critiques
 “Someone Should Do Something”: Wishlist of Miscellaneous Project Ideas
 Estimating censored test scores
 The Traveling Gerontologist problem
 Bayes nets
 Genome sequencing costs
 Proposal: handcounting mobile app for more fluid group discussions
 Air conditioner replacement
 Some ways of dealing with measurement error
 Value of Information: clinical prediction instruments for suicide
 Bayesian Model Averaging
 Dealing with allornothing unreliability of data
 Dysgenics power analysis
 Power analysis for racial admixture studies of continuous variables
 Operating on an aneurysm
 The Power of Twins: Revisiting Student’s Scottish Milk Experiment Example
 RNN metadata for mimicking individual author style
 MCTS
 Candy Japan A / B test
 DeFriesFulker power analysis
 Inferring mean IQs from SMPY / TIP elite samples
 Genius Revisited: On the Value of High IQ Elementary Schools
 Great Scott! Personal Name Collisions and the Birthday Paradox
 Detecting fake (human) Markov chain bots
 Optimal Existential Risk Reduction Investment
 Model Criticism via Machine Learning
 Proportion of Important Thinkers by Global Region Over Time in Charles Murray’s Human Accomplishment
 Program for nonspacedrepetition review of past written materials for serendipity & rediscovery: Archive Revisiter
 On the value of new statistical methods
 Bayesian power analysis: probability of exact replication
 Expectations are not expected deviations and large number of variables are not large samples
 Oh Deer: Could Deer Evolve to Avoid Car Accidents?
 Evolution as Backstop for Reinforcement Learning
 Acne: a good Quantified Self topic
 Fermi calculations
 Selective Emigration and Personality Trait Change
 The Most Abandoned Books on GoodReads
“Why Correlation Usually ≠ Causation”, Branwen 2014
Causality
: “Why Correlation Usually ≠ Causation”, (20140624; ; backlinks; similar):
Correlations are oft interpreted as evidence for causation; this is oft falsified; do causal graphs explain why this is so common, because the number of possible indirect paths greatly exceeds the direct paths necessary for useful manipulation?
It is widely understood that statistical correlation between two variables ≠ causation. Despite this admonition, people are overconfident in claiming correlations to support favored causal interpretations and are surprised by the results of randomized experiments, suggesting that they are biased & systematically underestimate the prevalence of confounds / commoncausation. I speculate that in realistic causal networks or DAGs, the number of possible correlations grows faster than the number of possible causal relationships. So confounds really are that common, and since people do not think in realistic DAGs but toy models, the imbalance also explains overconfidence.
“Bacopa QuasiExperiment”, Branwen 2014
Bacopa
: “Bacopa QuasiExperiment”, (20140506; ; backlinks; similar):
A small 20142015 nonblinded selfexperiment using Bacopa monnieri to investigate effect on memory/sleep/selfratings in an ABABA design; no particular effects were found.
Bacopa is a supplement herb often used for memory or stress adaptation. Its chronic effects reportedly take many weeks to manifest, with no important acute effects. Out of curiosity, I bought 2 bottles of Bacognize Bacopa pills and ran a nonrandomized nonblinded ABABA quasiselfexperiment from June 2014 to September 2015, measuring effects on my memory performance, sleep, and daily selfratings of mood/productivity. For analysis, a multilevel Bayesian model on two memory performance variables was used to extract perday performance, factor analysis was used to extract a sleep index from 9 Zeo sleep variables, and the 3 endpoints were modeled as a multivariate Bayesian timeseries regression with splines. Because of the slow onset of chronic effects, small effective sample size, definite temporal trends probably unrelated to Bacopa, and noise in the variables, the results were as expected, ambiguous, and do not strongly support any correlation between Bacopa and memory/sleep/selfrating (+// respectively).
“Bayesian Model Selection: The Steepest Mountain to Climb”, Tenan et al 2014
2014tenan.pdf
: “Bayesian model selection: The steepest mountain to climb”, Simone Tenan, Robert B. O’Hara, Iris Hendriks, Giacomo Tavecchia (20140101; backlinks)
“(More) Efficient Reinforcement Learning via Posterior Sampling”, Osband et al 2013
“(More) Efficient Reinforcement Learning via Posterior Sampling”, (20130604; ; similar):
Most provablyefficient learning algorithms introduce optimism about poorlyunderstood states and actions to encourage exploration.
We study an alternative approach for efficient exploration, posterior sampling for reinforcement learning (PSRL). This algorithm proceeds in repeated episodes of known duration. At the start of each episode, PSRL updates a prior distribution over Markov decision processes and takes one sample from this posterior. PSRL then follows the policy that is optimal for this sample during the episode. The algorithm is conceptually simple, computationally efficient and allows an agent to encode prior knowledge in a natural way.
We establish an Õ(τ ⋅ S ⋅ √(AT)) bound on the expected regret, where T is time, τ is the episode length and S and A are the cardinalities of the state and action spaces. This bound is one of the first for an algorithm not based on optimism, and close to the state of the art for any reinforcement learning algorithm.
We show through simulation that PSRL substantially outperforms existing algorithms with similar regret bounds.
“Magnesium SelfExperiments”, Branwen 2013
Magnesium
: “Magnesium SelfExperiments”, (20130513; ; backlinks; similar):
3 magnesium selfexperiments on magnesium lthreonate and magnesium citrate.
Encouraged by TruBrain’s magnesium & my magnesium lthreonate use, I design and run a blind random selfexperiment to see whether magnesium citrate supplementation would improve my mood or productivity. I collected ~200 days of data at two dose levels. The analysis finds that the net effect was negative, but a more detailed look shows timevarying effects with a large initial benefit negated by an increasinglynegative effect. Combined with my expectations, the long halflife, and the higherthanintended dosage, I infer that I overdosed on the magnesium. To verify this, I will be running a followup experiment with a much smaller dose.
“Caffeine Wakeup Experiment”, Branwen 2013
Caffeine
: “Caffeine wakeup experiment”, (20130407; ; backlinks; similar):
Selfexperiment on whether consuming caffeine immediately upon waking results in less time in bed & higher productivity. The results indicate a small and uncertain effect.
One trick to combat morning sluggishness is to get caffeine extraearly by using caffeine pills shortly before or upon trying to get up. From 20132014 I ran a blinded & placebocontrolled randomized experiment measuring the effect of caffeine pills in the morning upon awakening time and daily productivity. The estimated effect is small and the posterior probability relatively low, but a decision analysis suggests that since caffeine pills are so cheap, it would be worthwhile to conduct another experiment; however, increasing Zeo equipment problems have made me hold off additional experiments indefinitely.
“Bayesian Estimation Supersedes the t Test”, Kruschke 2013
2012kruschke.pdf
: “Bayesian estimation supersedes the t test”, (2013; ; backlinks; similar):
Bayesian estimation for 2 groups provides complete distributions of credible values for the effect size, group means and their difference, standard deviations and their difference, and the normality of the data. The method handles outliers. The decision rule can accept the null value (unlike traditional ttests) when certainty in the estimate is high (unlike Bayesian model comparison using Bayes factors). The method also yields precise estimates of statistical power for various research goals. The software and programs are free and run on Macintosh, Windows, and Linux platforms.
[Keywords: Bayesian statistics, effect size, robust estimation, Bayes factor, confidence interval]
“Potassium Sleep Experiments”, Branwen 2012
Potassium
: “Potassium sleep experiments”, (20121221; ; backlinks; similar):
2 selfexperiments on potassium citrate effects on sleep: harm to sleep when taken daily or in the morning
Potassium and magnesium are minerals that many Americans are deficient in. I tried using potassium citrate and immediately noticed difficulty sleeping. A short randomized (but not blinded) selfexperiment of ~4g potassium taken throughout the day confirmed large negative effects on my sleep. A longer followup randomized and blinded selfexperiment used standardized doses taken once a day early in the morning, and also found some harm to sleep, and I discontinued potassium use entirely.
“2012 Election Predictions”, Branwen 2012
2012electionpredictions
: “2012 election predictions”, (20121105; ; backlinks; similar):
Compiling academic and media forecaster’s 2012 American Presidential election predictions and statistically judging correctness; Nate Silver was not the best.
Statistically analyzing in R hundreds of predictions compiled for ~10 forecasters of the 2012 American Presidential election, and ranking them by Brier, RMSE, & log scores; the best overall performance seems to be by Drew Linzer and Wang & Holbrook, while Nate Silver appears as somewhat overrated and the famous Intrade prediction market turning in a disappointing overall performance.
“Biased Information As Antiinformation”, Branwen 2012
backfireeffect
: “Biased information as antiinformation”, (20121019; ; similar):
Filtered data for a belief can rationally push you away from that belief
The backfire effect is a recentlydiscovered bias where arguments contrary to a person’s belief leads to them believing even more strongly in that belief; this is taken as obviously “irrational”. The “rational” update can be statistically modeled as a shift in the estimated mean of a normal distribution where each randomly distributed datapoint is an argument: new datapoints below the mean cause a shift of the inferred mean downward and likewise if above. When this model is changed to include the “censoring” of datapoints, then the valid inference changes and a datapoint below the mean can lead to a shift of the mean upwards. This suggests that providing a person with anything less than the best data contrary to, or decisive refutations of, one of their beliefs may result in them becoming even more certain of that belief. If it is enjoyable or profitable to argue with a person while one does less than one’s best, it is bad to hold false beliefs, and this badness is not shared between both parties, then arguing online may constitute a negative externality: an activity whose benefits are gained by one party but whose full costs are not paid by the same party. In many moral systems, negative externalities are considered selfish and immoral; hence, lazy or halfhearted arguing may be immoral because it internalizes any benefits while possibly leaving the other person epistemically worse off.
“A/B Testing Longform Readability on Gwern.net”, Branwen 2012
ABtesting
: “A/B testing longform readability on Gwern.net”, (20120616; ; backlinks; similar):
A log of experiments done on the site design, intended to render pages more readable, focusing on the challenge of testing a static site, page width, fonts, plugins, and effects of advertising.
To gain some statistical & web development experience and to improve my readers’ experiences, I have been running a series of CSS A/B tests since June 2012. As expected, most do not show any meaningful difference.
 Background
 Problems with “conversion” metric
 ideas for testing
 Testing
 Resumption: ABalytics
maxwidth
redux Fonts
 Line height
 Null test
 Text & background color
 List symbol and fontsize
 Blockquote formatting
 Font size & ToC background
 Section header capitalization
 ToC formatting
 BeeLine Reader text highlighting
 Floating footnotes
 Indented paragraphs
 Sidebar elements
 Moving sidebar’s metadata into page
 CSE
 Banner Ad Effect on Total Traffic
 Deep reinforcement learning
 Indentation + LeftJustified Text
 Appendix
“Dual NBack MetaAnalysis”, Branwen 2012
DNBmetaanalysis
: “Dual nBack MetaAnalysis”, (20120520; ; backlinks; similar):
Does DNB increase IQ? What factors affect the studies? Probably not: gains are driven by studies with weakest methodology like apathetic control groups.
I metaanalyze the >19 studies up to 2016 which measure IQ after an nback intervention, finding (over all studies) a net gain (mediumsized) on the posttraining IQ tests.
The size of this increase on IQ test score correlates highly with the methodological concern of whether a study used active or passive control groups. This indicates that the medium effect size is due to methodological problems and that nback training does not increase subjects’ underlying fluid intelligence but the gains are due to the motivational effect of passive control groups (who did not train on anything) not trying as hard as the nbacktrained experimental groups on the posttests. The remaining studies using active control groups find a small positive effect (but this may be due to matrixtestspecific training, undetected publication bias, smaller motivational effects, etc.)
I also investigate several other nback claims, criticisms, and indicators of bias, finding:
 payment reducing performance claim: possible
 doseresponse relationship of nback training time & IQ gains claim: not found
 kind of nback matters: not found
 publication bias criticism: not found
 speeding of IQ tests criticism: not found
“One Man’s Modus Ponens”, Branwen 2012
Modus
: “One Man’s Modus Ponens”, (20120501; ; backlinks; similar):
One man’s modus ponens is another man’s modus tollens is a saying in Western philosophy encapsulating a common response to a logical proof which generalizes the reductio ad absurdum and consists of rejecting a premise based on an implied conclusion. I explain it in more detail, provide examples, and a Bayesian gloss.
A logicallyvalid argument which takes the form of a modus ponens may be interpreted in several ways; a major one is to interpret it as a kind of reductio ad absurdum, where by ‘proving’ a conclusion believed to be false, one might instead take it as a modus tollens which proves that one of the premises is false. This “Moorean shift” is aphorized as the snowclone, “One man’s modus ponens is another man’s modus tollens”.
The Moorean shift is a powerful counterargument which has been deployed against many skeptical & metaphysical claims in philosophy, where often the conclusion is extremely unlikely and little evidence can be provided for the premises used in the proofs; and it is relevant to many other debates, particularly methodological ones.
“Learning Is Planning: near Bayesoptimal Reinforcement Learning via MonteCarlo Tree Search”, Asmuth & Littman 2012
“Learning is planning: near Bayesoptimal reinforcement learning via MonteCarlo tree search”, (20120214; ; similar):
Bayesoptimal behavior, while welldefined, is often difficult to achieve. Recent advances in the use of MonteCarlo tree search (MCTS) have shown that it is possible to act nearoptimally in Markov Decision Processes (MDPs) with very large or infinite state spaces. Bayesoptimal behavior in an unknown MDP is equivalent to optimal behavior in the known beliefspace MDP, although the size of this beliefspace MDP grows exponentially with the amount of history retained, and is potentially infinite. We show how an agent can use one particular MCTS algorithm, Forward Search Sparse Sampling (FSSS), in an efficient way to act nearly Bayesoptimally for all but a polynomial number of steps, assuming that FSSS can be used to act efficiently in any possible underlying MDP.
“Learning Performance of Prediction Markets With Kelly Bettors”, Beygelzimer et al 2012
“Learning Performance of Prediction Markets with Kelly Bettors”, (20120131; ; similar):
[blog] In evaluating prediction markets (and other crowdprediction mechanisms), investigators have repeatedly observed a socalled “wisdom of crowds” effect, which roughly says that the average of participants performs much better than the average participant. The market price—an average or at least aggregate of traders’ beliefs—offers a better estimate than most any individual trader’s opinion.
In this paper, we ask a stronger question: how does the market price compare to the best trader’s belief, not just the average trader. We measure the market’s worstcase log regret, a notion common in machine learning theory. To arrive at a meaningful answer, we need to assume something about how traders behave. We suppose that every trader optimizes according to the Kelly criteria, a strategy that provably maximizes the compound growth of wealth over an (infinite) sequence of market interactions. We show several consequences.
First, the market prediction is a wealthweighted average of the individual participants’ beliefs. Second, the market learns at the optimal rate, the market price reacts exactly as if updating according to Bayes’ Law, and the market prediction has low worstcase log regret to the best individual participant. We simulate a sequence of markets where an underlying true probability exists, showing that the market converges to the true objective frequency as if updating a Beta distribution, as the theory predicts. If agents adopt a fractional Kelly criteria, a common practical variant, we show that agents behave like fullKelly agents with beliefs weighted between their own and the market’s, and that the market price converges to a timediscounted frequency.
Our analysis provides a new justification for fractional Kelly betting, a strategy widely used in practice for adhoc reasons. Finally, we propose a method for an agent to learn her own optimal Kelly fraction.
“Vitamin D Sleep Experiments”, Branwen 2012
VitaminD
: “Vitamin D sleep experiments”, (2012; ; backlinks; similar):
Selfexperiment on vitamin D effects on sleep: harmful taken at night, no or beneficial effects when taken in the morning.
Vitamin D is a hormone endogenously created by exposure to sunlight; due to historically low outdoors activity levels, it has become a popular supplement and I use it. Some anecdotes suggest that vitamin D may have circadian and zeitgeber effects due to its origin, and is harmful to sleep when taken at night. I ran a blinded randomized selfexperiment on taking vitamin D pills at bedtime. The vitamin D damaged my sleep and especially how rested I felt upon wakening, suggesting vitamin D did have a stimulating effect which obstructed sleep. I conducted a followup blinded randomized selfexperiment on the logical next question: if vitamin D is a daytime cue, then would vitamin D taken in the morning show some beneficial effects? The results were inconclusive (but slightly in favor of benefits). Given the asymmetry, I suggest that vitamin D supplements should be taken only in the morning.
“Silk Road 1: Theory & Practice”, Branwen 2011
SilkRoad
: “Silk Road 1: Theory & Practice”, (20110711; ; backlinks; similar):
History, background, visiting, ordering, using, & analyzing the drug market Silk Road 1
The cypherpunk movement laid the ideological roots of Bitcoin and the online drug market Silk Road; balancing previous emphasis on cryptography, I emphasize the noncryptographic market aspects of Silk Road which is rooted in cypherpunk economic reasoning, and give a fully detailed account of how a buyer might use market information to rationally buy, and finish by discussing strengths and weaknesses of Silk Road, and what future developments are predicted by cypherpunk ideas.
 Size
 Competitors
 Cypherpunks
 Bitcoin
 Silk Road as Cyphernomicon’s black markets
 Silk Road as a marketplace
 Preparations
 Silk Road
 Legal wares
 Anonymity
 Shopping
 LSD case study
 Finis
 Future Developments
 Postmortem
 See Also
 External Links
 Colophon
 Appendices
“PILCO: A ModelBased and DataEfficient Approach to Policy Search”, Deisenroth & Rasmussen 2011
2011deisenroth.pdf
: “PILCO: A ModelBased and DataEfficient Approach to Policy Search”, (20110601; ; backlinks; similar):
In this paper, we introduce PILCO, a practical, dataefficient modelbased policy search method. PILCO reduces model bias, one of the key problems of modelbased reinforcement learning, in a principled way.
By learning a probabilistic dynamics model and explicitly incorporating model uncertainty into longterm planning, PILCO can cope with very little data and facilitates learning from scratch in only a few trials. Policy evaluation is performed in closed form using stateoftheart approximate inference. Furthermore, policy gradients are computed analytically for policy improvement.
We report unprecedented learning efficiency on challenging and highdimensional control tasks.
[Remarkably, PILCO can learn your standard “Cartpole” task within just a few trials by carefully building a Bayesian Gaussian process model and picking the maximallyinformative experiments to run. Cartpole is quite difficult for a human, incidentally, there’s an installation of one in the SF Exploratorium, and I just had to try it out once I recognized it. (My sampleefficiency was not better than PILCO.)]
“Death Note: L, Anonymity & Eluding Entropy”, Branwen 2011
DeathNoteAnonymity
: “Death Note: L, Anonymity & Eluding Entropy”, (20110504; ; backlinks; similar):
Applied Computer Science: On Murder Considered As STEM Field—using information theory to quantify the magnitude of Light Yagami’s mistakes in Death Note and considering fixes
In the manga Death Note, the protagonist Light Yagami is given the supernatural weapon “Death Note” which can kill anyone on demand, and begins using it to reshape the world. The genius detective L attempts to track him down with analysis and trickery, and ultimately succeeds. Death Note is almost a thoughtexperimentgiven the perfect murder weapon, how can you screw up anyway? I consider the various steps of L’s process from the perspective of computer security, cryptography, and information theory, to quantify Light’s initial anonymity and how L gradually deanonymizes him, and consider which mistake was the largest as follows:
Light’s fundamental mistake is to kill in ways unrelated to his goal.
Killing through heart attacks does not just make him visible early on, but the deaths reveals that his assassination method is impossibly precise and something profoundly anomalous is going on. L has been tipped off that Kira exists. Whatever the bogus justification may be, this is a major victory for his opponents. (To deter criminals and villains, it is not necessary for there to be a globallyknown single anomalous or supernatural killer, when it would be equally effective to arrange for all the killings to be done naturalistically by ordinary mechanisms such as third parties/police/judiciary or used indirectly as parallel construction to crack cases.)
Worse, the deaths are nonrandom in other ways—they tend to occur at particular times!
Just the scheduling of deaths cost Light 6 bits of anonymity
Light’s third mistake was reacting to the blatant provocation of Lind L. Tailor.
Taking the bait let L narrow his target down to 1⁄3 the original Japanese population, for a gain of ~1.6 bits.
Light’s fourth mistake was to use confidential police information stolen using his policeman father’s credentials.
This mistake was the largest in bits lost. This mistake cost him 11 bits of anonymity; in other words, this mistake cost him twice what his scheduling cost him and almost 8 times the murder of Tailor!
Killing Ray Penbar and the FBI team.
If we assume Penbar was tasked 200 leads out of the 10,000, then murdering him and the fiancee dropped Light just 6 bits or a little over half the fourth mistake and comparable to the original scheduling mistake. 6. Endgame: At this point in the plot, L resorts to direct measures and enters Light’s life directly, enrolling at the university, with Light unable to perfectly play the role of innocent under intense inperson surveillance.
From that point on, Light is screwed as he is now playing a deadly game of “Mafia” with L & the investigative team. He frittered away >25 bits of anonymity and then L intuited the rest and suspected him all along.
Finally, I suggest how Light could have most effectively employed the Death Note and limited his loss of anonymity. In an appendix, I discuss the maximum amount of information leakage possible from using a Death Note as a communication device.
(Note: This essay assumes a familiarity with the early plot of Death Note and Light Yagami. If you are unfamiliar with DN, see my Death Note Ending essay or consult Wikipedia or read the DN rules.)
“Tea Reviews”, Branwen 2011
Tea
: “Tea Reviews”, (20110413; ; backlinks; similar):
Teas I have drunk, with reviews and future purchases; focused primarily on oolongs and greens. Plus experiments on water.
Electric kettles are faster, but I was curious how much faster my electric kettle heated water to high or boiling temperatures than does my stovetop kettle. So I collected some data and compared them directly, trying out a number of statistical methods (principally: nonparametric & parametric tests of difference, linear & beta regression models, and a Bayesian measurement error model). My electric kettle is faster than the stovetop kettle (the difference is both statisticallysignificant p≪0.01 & the posterior probability of difference is P ≈ 1), and the modeling suggests time to boil is largely predictable from a combination of volume, endtemperature, and kettle type.
“Gwern.net Website Traffic”, Branwen 2011
Traffic
: “Gwern.net Website Traffic”, (20110203; ; similar):
Meta page describing Gwern.net editing activity, traffic statistics, and referrer details, primarily sourced from Google Analytics (2011present).
On a semiannual basis, since 2011, I review Gwern.net website traffic using Google Analytics; although what most readers value is not what I value, I find it motivating to see total traffic statistics reminding me of readers (writing can be a lonely and abstract endeavour), and useful to see what are major referrers.
Gwern.net typically enjoys steady traffic in the 50–100k range per month, with occasional spikes from social media, particularly Hacker News; over the first decade (2010–2020), there were 7.98m pageviews by 3.8m unique users.
“Zeo Sleep Selfexperiments”, Branwen 2010
Zeo
: “Zeo sleep selfexperiments”, (20101228; ; backlinks; similar):
EEG recordings of sleep and my experiments with things affecting sleep quality or durations: melatonin, potassium, vitamin D etc
I discuss my beliefs about Quantified Self, and demonstrate with a series of singlesubject design selfexperiments using a Zeo. A Zeo records sleep via EEG; I have made many measurements and performed many experiments. This is what I have learned so far:
 the Zeo headband is wearable longterm
 melatonin improves my sleep
 onelegged standing does little
 Vitamin D at night damages my sleep & Vitamin D in morning does not affect my sleep
 potassium (over the day but not so much the morning) damages my sleep and does not improve my mood/productivity
 small quantities of alcohol appear to make little difference to my sleep quality
 I may be better off changing my sleep timing by waking up somewhat earlier & going to bed somewhat earlier
 lithium orotate does not affect my sleep
 Redshift causes me to go to bed earlier
 ZMA: inconclusive results slightly suggestive of benefits
“The Replication Crisis: Flaws in Mainstream Science”, Branwen 2010
Replication
: “The Replication Crisis: Flaws in Mainstream Science”, (20101027; ; backlinks; similar):
2013 discussion of how systemic biases in science, particularly medicine and psychology, have resulted in a research literature filled with false positives and exaggerated effects, called ‘the Replication Crisis’.
Longstanding problems in standard scientific methodology have exploded as the “Replication Crisis”: the discovery that many results in fields as diverse as psychology, economics, medicine, biology, and sociology are in fact false or quantitatively highly inaccurately measured. I cover here a handful of the issues and publications on this large, important, and rapidly developing topic up to about 2013, at which point the Replication Crisis became too large a topic to cover more than cursorily. (A compilation of some additional links are provided for post2013 developments.)
The crisis is caused by methods & publishing procedures which interpret random noise as important results, far too small datasets, selective analysis by an analyst trying to reach expected/desired results, publication bias, poor implementation of existing bestpractices, nontrivial levels of research fraud, software errors, philosophical beliefs among researchers that false positives are acceptable, neglect of known confounding like genetics, and skewed incentives (financial & professional) to publish ‘hot’ results.
Thus, any individual piece of research typically establishes little. Scientific validation comes not from small pvalues, but from discovering a regular feature of the world which disinterested third parties can discover with straightforward research done independently on new data with new procedures—replication.
“About This Website”, Branwen 2010
About
: “About This Website”, (20101001; ; backlinks; similar):
Meta page describing Gwern.net site ideals of stable longterm essays which improve over time; idea sources and writing methodology; metadata definitions; site statistics; copyright license.
This page is about Gwern.net content; for the details of its implementation & design like the popup paradigm, see Design; and for information about me, see Links.
“Bayesian Data Analysis”, Kruschke 2010
2010kruschke.pdf
: “Bayesian data analysis”, John K. Kruschke (20100810; ; backlinks)
“Nootropics”, Branwen 2010
Nootropics
: “Nootropics”, Gwern Branwen (20100102; ; backlinks; similar)
“Predicting the Next Big Thing: Success As a Signal of Poor Judgment”, Denrell & Fang 2010
2010denrell.pdf
: “Predicting the Next Big Thing: Success as a Signal of Poor Judgment”, Jerker Denrell, Christina Fang (20100101; ; backlinks)
“Rssa_a0157 469..482”
2010stigler.pdf
: “rssa_a0157 469..482” (20100101)
“Who Wrote The ‘Death Note’ Script?”, Branwen 2009
DeathNotescript
: “Who Wrote The ‘Death Note’ Script?”, (20091102; ; backlinks; similar):
Internal, external, stylometric evidence point to liveaction leak of Death Note Hollywood script being real.
I give a history of the 2009 leaked script, discuss internal & external evidence for its realness including stylometrics; and then give a simple stepbystep Bayesian analysis of each point. We finish with high confidence in the script being real, discussion of how this analysis was surprisingly enlightening, and what followup work the analysis suggests would be most valuable.
“Miscellaneous”, Branwen 2009
Notes
: “Miscellaneous”, (20090805; ; backlinks; similar):
Misc thoughts, memories, protoessays, musings, etc.
We usually clean up after ourselves, but sometimes, we are expected to clean before (ie. after others) instead. Why?
Because in those cases, precleanup is the same amount of work, but gametheoretically better whenever a failure of postcleanup would cause the next person problems.
 Quickies
 Cleanup: Before Or After?
 Peak Human Speed
 Oldest Food
 Zuckerberg Futures
 Russia
 Conscientiousness And Online Education
 Fiction
 American light novels’ absence
 Cultural growth through diversity
 TV & the Matrix
 Cherchez Le Chien: Dogs as Class Markers in Anime
 Tradeoffs and costly signaling in appearances: the case of long hair
 The Tragedy of Grand Admiral Thrawn
 On dropping Family Guy
 Pom Poko’s glorification of group suicide
 Full Metal Alchemist: pride and knowledge
 A secular humanist reads The Tale of Genji
 Economics
 Long term investment
 Measuring social trust by offering free lunches
 Lip reading website
 Good governance & Girl Scouts
 Chinese Kremlinology
 Domainsquatting externalities
 Ordinary life improvements
 A Market For Fat: The Transfer Machine
 Urban Area CostofLiving as Big Tech Moats & Employee Golden Handcuffs
 Psychology
 Technology
 Somatic genetic engineering
 The advantage of an uncommon name
 Backups: life and death
 Measuring multiple times in a sandglass
 Powerful natural languages
 A Bitcoin+BitTorrentdriven economy for creators (Artcoin)
 William Carlos Williams
 Simplicity is the Price of Reliability
 November 2016 data loss postmortem
 Cats and Computer Keyboards
 How Would You Prove You Are a Timetraveler From the Past?
 ARPA and SCI: Surfing AI (Review of Roland & Shiman 2002)
 Open Questions
 The Reverse Amara’s Law
 Worldbuilding: The Lights in the Sky are Sacs
 Remote monitoring
 Surprising Turingcomplete languages
 North Paw
 Leaf burgers
 Night watch
 Two cows: philosophy
 Venusian Revolution
 Hard problems in utilitarianism
 Who lives longer, men or women?
 Politicians are not unethical
 Defining ‘but’
 On metaethical optimization
 Alternate Futures: The Second English Restoration
 Cicadas
 Nopoo selfexperiment
 Newton’s System of the World and Comets
 Rationality Heuristic for Bias Detection: Updating Towards the Net Weight of Evidence
 Littlewood’s Law and the Global Media
 D&D Game #2 log
 Highly Potent Drugs As Psychological Warfare Weapons
 Who Buys Fonts?
 Advanced Chess obituary
“When Superstars Flop: Public Status and Choking Under Pressure in International Soccer Penalty Shootouts”, Jordet 2009
2009jordet.pdf
: “When Superstars Flop: Public Status and Choking Under Pressure in International Soccer Penalty Shootouts”, (20090415; ; backlinks; similar):
The purpose of this study was to examine links between public status and performance in a realworld, highpressure sport task.
It was believed that high public status could negatively affect performance through added performance pressure. Video analyses were conducted of all penalty shootouts ever held in 3 major soccer tournaments (n = 366 kicks) and public status was derived from prestigious international awards (eg. “FIFA World Player of the year”).
The results showed that players with high current status performed worse and seemed to engage more in certain escapist selfregulatory behaviors than players with future status. Some of these performance drops may be accounted for by misdirected selfregulation (particularly low response time), but only small multivariate effects were found.
[See Regression To The Mean Fallacies.]
“Dual NBack FAQ”, Branwen 2009
DNBFAQ
: “Dual nBack FAQ”, (20090325; ; backlinks; similar):
A compendium of DNB, WM, IQ information up to 2015.
Between 2008 and 2011, I collected a number of anecdotal reports about the effects of nbacking; there are many other anecdotes out there, but the following are a good representation—for what they’re worth.
 The Argument
 Training
 Terminology
 Notes from the author
 Nback training
 What’s some relevant research?
 Support
 Criticism
 Moody 2009 (re: Jaeggi 2008)
 Seidler 2010
 Jonasson 2011
 Chooi 2011
 Preece 2011 / Palmer 2011
 Kundu et al 2012
 Salminen 2012
 Redick et al 2012
 Rudebeck 2012
 Heinzel et al 2013
 Thompson et al 2013
 Smith et al 2013
 Nussbaumer et al 2013
 Oelhafen et al 2013
 Sprenger et al 2013
 Colom et al 2013
 Burki et al 2014
 Pugin et al 2014
 Heffernan 2014
 Hancock 2013
 Waris et al 2015
 Baniqued et al 2015
 Kuper & Karbach 2015
 Lindeløv et al 2016
 Schwarb et al 2015
 LawlorSavage & Goghari 2016
 StuderLuethi et al 2015
 Minear et al 2016
 StuderLuethi et al 2016
 Metaanalysis
 Does it really work?
 NonIQ or nonDNB gains
 Saccading
 Sleep
 Lucid dreaming
 Aging
 TODO
 Software
 What else can I do?
 See Also
 Appendix
“Modafinil”, Branwen 2009
Modafinil
: “Modafinil”, (20090220; ; backlinks; similar):
Effects, health concerns, suppliers, prices & rational ordering.
Modafinil is a prescription stimulant drug. I discuss informally, from a costbenefitinformed perspective, the research up to 2015 on modafinil’s cognitive effects, the risks of sideeffects and addiction/tolerance and law enforcement, and give a table of current greymarket suppliers and discuss how to order from them.
“Models for Potentially Biased Evidence in Metaanalysis Using Empirically Based Priors”, Welton et al 2008
2009welton.pdf
: “Models for potentially biased evidence in metaanalysis using empirically based priors”, (20081222; ; similar):
We present models for the combined analysis of evidence from randomized controlled trials categorized as being at either low or high risk of bias due to a flaw in their conduct.
We formulate a bias model that incorporates betweenstudy and betweenmetaanalysis heterogeneity in bias, and uncertainty in overall mean bias. We obtain algebraic expressions for the posterior distribution of the biasadjusted treatment effect, which provide limiting values for the information that can be obtained from studies at high risk of bias.
The parameters of the bias model can be estimated from collections of previously published metaanalyses. We explore alternative models for such data, and alternative methods for introducing prior information on the bias parameters into a new metaanalysis.
Results: from an illustrative example show that the biasadjusted treatment effect estimates are sensitive to the way in which the metaepidemiological data are modelled, but that using point estimates for bias parameters provides an adequate approximation to using a full joint prior distribution. A sensitivity analysis shows that the gain in precision from including studies at high risk of bias is likely to be low, however numerous or large their size, and that little is gained by incorporating such studies, unless the information from studies at low risk of bias is limited.
We discuss approaches that might increase the value of including studies at high risk of bias, and the acceptability of the methods in the evaluation of health care interventions.
[Keywords: Bayesian methods, Bias, health technology assessment, Markov chain Monte Carlo methods, randomized controlled trial]
“Optimal Approximation of Signal Priors”, Hyvarinen 2008
2008hyvarinen.pdf
: “Optimal approximation of signal priors”, (20081201; ; backlinks; similar):
In signal restoration by Bayesian inference, one typically uses a parametric model of the prior distribution of the signal. Here, we consider how the parameters of a prior model should be estimated from observations of uncorrupted signals. A lot of recent work has implicitly assumed that maximum likelihood estimation is the optimal estimation method. Our results imply that this is not the case. We first obtain an objective function that approximates the error occurred in signal restoration due to an imperfect prior model. Next, we show that in an important special case (small Gaussian noise), the error is the same as the scorematching objective function, which was previously proposed as an alternative for likelihood based on purely computational considerations. Our analysis thus shows that score matching combines computational simplicity with statistical optimality in signal restoration, providing a viable alternative to maximum likelihood methods. We also show how the method leads to a new intuitive and geometric interpretation of structure inherent in probability distributions.
“Verbal Probability Expressions In National Intelligence Estimates: A Comprehensive Analysis Of Trends From The Fifties Through Post9/11”, Kesselman 2008
2008kesselman.pdf
: “Verbal Probability Expressions In National Intelligence Estimates: A Comprehensive Analysis Of Trends From The Fifties Through Post9/11”, (200805; ; backlinks; similar):
This research presents the findings of a study that analyzed words of estimators probability in the key judgments of National Intelligence Estimates from the 1950s through the 2000s. The research found that of the 50 words examined, only 13 were statisticallysignificant. Furthermore, interesting trends have emerged when the words are broken down into English modals, terminology that conveys analytical assessments and words employed by the National Intelligence Council as of 2006. One of the more intriguing findings is that use of the word will has by far been the most popular for analysts, registering over 700 occurrences throughout the decades; however, a word of such certainty is problematic in the sense that intelligence should never deal with 100% certitude. The relatively low occurrence and wide variety of word usage across the decades demonstrates a real lack of consistency in the way analysts have been conveying assessments over the past 58 years. Finally, the researcher suggests the Kesselman List of Estimative Words for use in the IC. The word list takes into account the literature review findings as well as the results of this study in equating odds with verbal probabilities.
[Rachel’s lit review, for example, makes for very interesting reading. She has done a thorough search of not only the intelligence but also the business, linguistics and other literatures in order to find out how other disciplines have dealt with the problem of “What do we mean when we say something is ‘likely’…” She uncovered, for example, that, in medicine, words of estimative probability such as “likely”, “remote” and “probably” have taken on more or less fixed meanings due primarily to outside intervention or, as she put it, “legal ramifications”. Her comparative analysis of the results and approaches taken by these other disciplines is required reading for anyone in the Intelligence Community trying to understand how verbal expressions of probability are actually interpreted. The NICs list only became final in the last several years so it is arguable whether this list of nine words really captures the breadth of estimative word usage across the decades. Rather, it would be arguable if this chart didn’t make it crystal clear that the Intelligence Community has really relied on just two words, “probably” and “likely” to express its estimates of probabilities for the last 60 years. All other words are used rarely or not at all.
Based on her research of what works and what doesn’t and which words seem to have the most consistent meanings to users, Rachel even offers her own list of estimative words along with their associated probabilities:
 Almost certain: 86–99%
 Highly likely: 71–85%
 Likely: 56–70%
 Chances a little better [or less] than even: 46–55%
 Unlikely: 31–45%
 Highly unlikely: 16–30%
 Remote: 1–15%
]
[See also “Decision by sampling”, Stewart et al 2006; “Processing Linguistic Probabilities: General Principles and Empirical Evidence”, Budescu & Wallsten 1995.]
“The Allure of Equality: Uniformity in Probabilistic and Statistical Judgment”, Falk & Lann 2008
2008falk.pdf
: “The allure of equality: Uniformity in probabilistic and statistical judgment”, Ruma Falk, Avital Lann (20080101; ; backlinks)
“Experiments on Partisanship and Public Opinion: Party Cues, False Beliefs, and Bayesian Updating”, Bullock 2007
2007bullock.pdf
: “Experiments on partisanship and public opinion: Party cues, false beliefs, and Bayesian updating”, (20070601; ; backlinks; similar):
This dissertation contains 3 parts—three papers. The first is about the effects of party cues on policy attitudes and candidate preferences. The second is about the resilience of false political beliefs. The third is about Bayesian updating of public opinion. Substantively, what unites them is my interest in partisanship and public opinion. Normatively, they all spring from my interest in the quality of citizens’ thinking about politics. Methodologically, they are bound by my conviction that we gain purchase on interesting empirical questions by doing things differently: first, by bringing more experiments to fields still dominated by crosssectional survey research; second, by using experiments unlike the ones that have gone before.
 Part 1: It is widely believed that party cues affect political attitudes. But their effects have rarely been demonstrated, and most demonstrations rely on questionable inferences about cuetaking behavior. I use data from 3 experiments on representative national samples to show that party cues affect even the extremely wellinformed and that their effects are, as Downs predicted, decreasing in the amount of policyrelevant information that people have. But the effects are often smaller than we imagine and much smaller than the ones caused by changes in policyrelevant information. Partisans tend to perceive themselves as much less influenced by cues than members of the other party—a finding with troubling implications for those who subscribe to deliberative theories of democracy.
 Part 2: The widely noted tendency of people to resist challenges to their political beliefs can usually be explained by the poverty of those challenges: they are easily avoided, often ambiguous, and almost always easily dismissed as irrelevant, biased, or uninformed. It is natural to hope that stronger challenges will be more successful. In a trio of experiments that draw on realworld cases of misinformation, I instill false political beliefs and then challenge them in ways that are unambiguous and nearly impossible to avoid or dismiss for the conventional reasons. The success of these challenges proves highly contingent on party identification.
 Part 3: Political scientists are increasingly interested in using Bayes’ Theorem to evaluate citizens’ thinking about politics. But there is widespread uncertainty about why the Theorem should be considered a normative standard for rational information processing and whether models based on it can accommodate ordinary features of political cognition including partisan bias, attitude polarization, and enduring disagreement. I clarify these points with reference to the bestknown Bayesian updating model and several littleknown but more realistic alternatives. I show that the Theorem is more accommodating than many suppose—but that, precisely because it is so accommodating, it is far from an ideal standard for rational information processing.
“A Free Energy Principle for the Brain”, Friston et al 2006
2006friston.pdf
: “A free energy principle for the brain”, (20060701; ; backlinks; similar):
By formulating Helmholtz’s ideas about perception, in terms of modernday theories, one arrives at a model of perceptual inference and learning that can explain a remarkable range of neurobiological facts: using constructs from statistical physics, the problems of inferring the causes of sensory input and learning the causal structure of their generation can be resolved using exactly the same principles. Furthermore, inference and learning can proceed in a biologically plausible fashion. The ensuing scheme rests on Empirical Bayes and hierarchical models of how sensory input is caused. The use of hierarchical models enables the brain to construct prior expectations in a dynamic and contextsensitive fashion. This scheme provides a principled way to understand many aspects of cortical organisation and responses.
In this paper, we show these perceptual processes are just one aspect of emergent behaviours of systems that conform to a free energy principle. The free energy considered here measures the difference between the probability distribution of environmental quantities that act on the system and an arbitrary distribution encoded by its configuration. The system can minimise free energy by changing its configuration to affect the way it samples the environment or change the distribution it encodes. These changes correspond to action and perception respectively and lead to an adaptive exchange with the environment that is characteristic of biological systems. This treatment assumes that the system’s state and structure encode an implicit and probabilistic model of the environment. We will look at the models entailed by the brain and how minimisation of its free energy can explain its dynamics and structure.
[Keywords: Variational Bayes, free energy, inference, perception, action, learning, attention, selection, hierarchical]
“The Optimizer’s Curse: Skepticism and Postdecision Surprise in Decision Analysis”, Smith & Winkler 2006
2006smith.pdf
: “The Optimizer’s Curse: Skepticism and Postdecision Surprise in Decision Analysis”, (20060301; ; backlinks; similar):
Decision analysis produces measures of value such as expected net present values or expected utilities and ranks alternatives by these value estimates. Other optimizationbased processes operate in a similar manner. With uncertainty and limited resources, an analysis is never perfect, so these value estimates are subject to error. We show that if we take these value estimates at face value and select accordingly, we should expect the value of the chosen alternative to be less than its estimate, even if the value estimates are unbiased. Thus, when comparing actual outcomes to value estimates, we should expect to be disappointed on average, not because of any inherent bias in the estimates themselves, but because of the optimizationbased selection process. We call this phenomenon the optimizer’s curse and argue that it is not well understood or appreciated in the decision analysis and management science communities. This curse may be a factor in creating skepticism in decision makers who review the results of an analysis.
In this paper, we study the optimizer’s curse and show that the resulting expected disappointment may be substantial. We then propose the use of Bayesian methods to adjust value estimates. These Bayesian methods can be viewed as disciplined skepticism and provide a method for avoiding this postdecision disappointment.
“Three Statistical Paradoxes in the Interpretation of Group Differences: Illustrated With Medical School Admission and Licensing Data”, Wainer & Brown 2006
2006wainer.pdf
: “Three Statistical Paradoxes in the Interpretation of Group Differences: Illustrated with Medical School Admission and Licensing Data”, (2006; backlinks; similar):
Interpreting group differences observed in aggregated data is a practice that must be done with enormous care. Often the truth underlying such data is quite different than a naïve first look would indicate. The confusions that can arise are so perplexing that some of the more frequently occurring ones have been dubbed paradoxes. In this chapter we describe three of the best known of these paradoxes—Simpson’s Paradox, Kelley’s Paradox, and Lord’s Paradox—and illustrate them in a single data set. The data set contains the score distributions, separated by race, on the biological sciences component of the Medical College Admission Test (MCAT) and Step 1 of the United States Medical Licensing Examination™ (USMLE). Our goal in examining these data was to move toward a greater understanding of race differences in admissions policies in medical schools. As we demonstrate, the path toward this goal is hindered by differences in the score distributions which gives rise to these three paradoxes. The ease with which we were able to illustrate all of these paradoxes within a single data set is indicative of how wide spread they are likely to be in practice.
“Estimation of NonNormalized Statistical Models by Score Matching”, Hyvarinen 2005
“Estimation of NonNormalized Statistical Models by Score Matching”, (200504; ; backlinks; similar):
One often wants to estimate statistical models where the probability density function is known only up to a multiplicative normalization constant. Typically, one then has to resort to Markov Chain Monte Carlo methods, or approximations of the normalization constant.
Here, we propose that such models can be estimated by minimizing the expected squared distance between the gradient of the logdensity given by the model and the gradient of the logdensity of the observed data.
While the estimation of the gradient of logdensity function is, in principle, a very difficult nonparametric problem, we prove a surprising result that gives a simple formula for this objective function. The density function of the observed data does not appear in this formula, which simplifies to a sample average of a sum of some derivatives of the logdensity given by the model.
The validity of the [scorematching] method is demonstrated on multivariate Gaussian and independent component analysis models, and by estimating an overcomplete filter set for natural image data.
[Keywords: statistical estimation, nonnormalized densities, pseudolikelihood, Markov chain Monte Carlo, contrastive divergence]
“The Bayesian Brain: the Role of Uncertainty in Neural Coding and Computation”, Knill & Pouget 2004
2004knill.pdf
: “The Bayesian brain: the role of uncertainty in neural coding and computation”, (200412; ; backlinks; similar):
To use sensory information efficiently to make judgments and guide action in the world, the brain must represent and use information about uncertainty in its computations for perception and action. Bayesian methods have proven successful in building computational theories for perception and sensorimotor control, and psychophysics is providing a growing body of evidence that human perceptual computations are ‘Bayes’ optimal’. This leads to the ‘Bayesian coding hypothesis’: that the brain represents sensory information probabilistically, in the form of probability distributions. Several computational schemes have recently been proposed for how this might be achieved in populations of neurons. Neurophysiological data on the hypothesis, however, is almost nonexistent. A major challenge for neuroscientists is to test these ideas experimentally, and so determine whether and how neurons code information about sensory uncertainty.
“Methods of MetaAnalysis: Correcting Error and Bias in Research Findings”, Hunter & Schmidt 2004
2004hunterschmidtmethodsofmetaanalysis.pdf
: “Methods of MetaAnalysis: Correcting Error and Bias in Research Findings”, John E. Hunter, Dr. Frank L. Schmidt (20040101; ; backlinks)
“Bayesian Informal Logic and Fallacy”, Korb 2004
2003korb.pdf
: “Bayesian Informal Logic and Fallacy”, (2004; similar):
Bayesian reasoning has been applied formally to statistical inference, machine learning and analysing scientific method. Here I apply it informally to more common forms of inference, namely natural language arguments. I analyse a variety of traditional fallacies, deductive, inductive and causal, and find more merit in them than is generally acknowledged. Bayesian principles provide a framework for understanding ordinary arguments which is well worth developing.
“Two Statistical Paradoxes in the Interpretation of Group Differences: Illustrated With Medical School Admission and Licensing Data”, Wainer & Brown 2004
2004wainer.pdf
: “Two Statistical Paradoxes in the Interpretation of Group Differences: Illustrated with Medical School Admission and Licensing Data”, (2004; similar):
Interpreting group differences observed in aggregated data is a practice that must be done with enormous care. Often the truth underlying such data is quite different than a naïve first look would indicate. The confusions that can arise are so perplexing that some of the more frequently occurring ones have been dubbed paradoxes. This article describes two of these paradoxes—Simpson’s paradox and Lord’s paradox—and illustrates them in a single dataset. The dataset contains the score distributions, separated by race, on the biological sciences component of the Medical College Admission Test (MCAT) and Step 1 of the United States Medical Licensing Examination™ (USMLE). Our goal in examining these data was to move toward a greater understanding of race differences in admissions policies in medical schools. As we demonstrate, the path toward this goal is hindered by differences in the score distributions which gives rise to these two paradoxes. The ease with which we were able to illustrate both of these paradoxes within a single dataset is indicative of how widespread they are likely to be in practice.
[Keywords: group differences, Lord’s paradox, Medical College Admission Test, Rubin’s model for causal inference, Simpson’s paradox, standardization, United States Medical Licensing Examination]
“Bayesian Computation: a Statistical Revolution”, Brooks 2003
2003brooks.pdf
: “Bayesian computation: a statistical revolution”, (20031103; backlinks; similar):
The 1990s saw a statistical revolution sparked predominantly by the phenomenal advances in computing technology from the early 1980s onwards. These advances enabled the development of powerful new computational tools, which reignited interest in a philosophy of statistics that had lain almost dormant since the turn of the century.
In this paper we briefly review the historic and philosophical foundations of the 2 schools of statistical thought, before examining the implications of the reascendance of the Bayesian paradigm for both current and future statistical practice.
[Keywords: computer packages [BUGS], Markov chain Monte Carlo, model discrimination, population ecology, prior beliefs, posterior distribution]
“A Bayesian Framework for Reinforcement Learning”, Strens 2000
“A Bayesian Framework for Reinforcement Learning”, (20000628; ; backlinks; similar):
The reinforcement learning problem can be decomposed into two parallel types of inference: (1) estimating the parameters of a model for the underlying process; (2) determining behavior which maximizes return under the estimated model.
Following Dearden et al 1999, it is proposed that the learning process estimates online the full posterior distribution over models.
To determine behavior, a hypothesis is sampled from this distribution and the greedy policy with respect to the hypothesis is obtained by dynamic programming. By using a different hypothesis for each trial appropriate exploratory and exploitative behavior is obtained.
This Bayesian method always converges to the optimal policy for a stationary process with discrete states.
“Kelley’s Paradox”, Wainer 2000
2000wainer.pdf
: “Kelley’s Paradox”, Howard Wainer (20000101; backlinks)
“Statistical Issues in the Analysis of Data Gathered in the New Designs”, Kadane & Seidenfeld 1996
1996kadane.pdf
: “Statistical Issues in the Analysis of Data Gathered in the New Designs”, Joseph B. Kadane, Teddy Seidenfeld (19960101; backlinks)
“Is There Sufficient Historical Evidence to Establish the Resurrection of Jesus?”, Cavin 1995
1995cavin.pdf
: “Is There Sufficient Historical Evidence to Establish the Resurrection of Jesus?”, Robert Greg Cavin (19950101; backlinks)
“The Relevance of Group Membership for Personnel Selection: A Demonstration Using Bayes' Theorem”, Miller 1994
1994miller.pdf
: “The Relevance of Group Membership for Personnel Selection: A Demonstration Using Bayes' Theorem”, (19940901; backlinks; similar):
A Bayesian approach to problems of personnel selection implies a fundamental conflict between nondiscrimination and merit selection. Groups—such as ethnic groups, sexes and races—do differ in various attributes relevant to vocational success, including intelligence and personality.
This journal has repeatedly discussed the technical and ethical issues raised by the existence of groups (races, sexes, ethnic groups) that frequently differ in abilities and other jobrelated characteristics (Eysenck 1991, Jensen, 1992; Levin, 1990, 1991). This paper is meant to add to that discussion by providing mathematical proof that consideration of such groups is, in general, necessary in selecting the best employees or students.
It is almost an article of faith that race, sex, religion, national origin, or similar classifications (which will be referred to here as groups) are irrelevant for hiring, given a goal of selecting the best candidates. The standard wisdom is that those selecting for school admission or employment should devise an unbiased (in the statistical sense) procedure which predicts individual performance, evaluate individuals with this, and then select the highest ranked individuals. However, analysis shows that even with statistically unbiased evaluation procedures, group membership may still be relevant. If the goal is to pick the best individuals for jobs or training, membership in the group with the lower average performance (the disadvantaged group) should properly be held against the individual. In general, not considering group membership and selecting the best candidates are mutually exclusive.
…Related Psychometric Discussions: How does the conclusion reached above about the relevance of groups membership relate to discussions in the technical psychometric literature?
At least some psychometricians have been aware of the relevance of group membership. Hunter & Schmidt 1976 point out that differences in group means will typically lead to differences in intercepts. Jensen (1980, p. 94, Bias in Mental Testing) points out that the best estimate of true scores is obtained by regressing observed scores towards the mean, and that if there are 2 groups with different means, the downwards correction for the high scoring individuals will be greater for those from the low scoring group. Kelley (1947, p. 409, Fundamentals of Statistics) put it as follows: “This is an interesting equation in that it expresses the estimate of true ability as a weighted sum of 2 separate estimates, one based upon the individual’s observed score, X_{1}, and the other based upon the mean of the group to which he belongs, M_{1}. If the test is highly reliable, much weight is given to the test score and little to the group mean, and vice versa”, although he may not have been thinking of demographic groups. Cronbach, Gleser, Nanda, and Rajaratnam (1972, The Dependability of Behavioral Measurements: Theory of Generalizability for Scores and Profiles) discuss the problem of deducing universe scores (essentially true scores in traditional terminology) from test data, recognizing that group means will be relevant. They even display an awareness that, since blacks normally score lower than whites, the logic of their reasoning calls for the use of higher cutoff scores for blacks than for whites (see p. 385). Mislevy (1993) also displays an awareness that group means are relevant, although he feels it would be unfair to use them.
In general, the relevance of group membership has been known to the specialist psychometric community, although few outside the community are aware of the effect. Thus, the contribution of Bayes’ theorem is to provide another demonstration, one that those outside the psychometric community may be more comfortable with.
“Subjective Probability”, Wright & Ayton 1994
1994wrightsubjectiveprobability.pdf
: “Subjective Probability”, George Wright, Peter Ayton (19940101; backlinks)
“The Influence of Prior Beliefs on Scientific Judgments of Evidence Quality”, Koehler 1993
1993koehler.pdf
: “The Influence of Prior Beliefs on Scientific Judgments of Evidence Quality”, (19931001; ; similar):
This paper is concerned with the influence of scientists′ prior beliefs on their judgments of evidence quality.
A laboratory experiment using advanced graduate students in the sciences (study 1) and an experimental survey of practicing scientists on opposite sides of a controversial issue (study 2) revealed agreement effects. Research reports that agreed with scientists′ prior beliefs were judged to be of higher quality than those that disagreed.
In study 1, a prior belief strength × agreement interaction was found, indicating that the agreement effect was larger among scientists who held strong prior beliefs. In both studies, the agreement effect was larger for general, evaluative judgments (eg. relevance, methodological quality, results clarity) than for more specific, analytical judgments (eg. adequacy of randomization procedures).
A Bayesian analysis indicates that the pattern of agreement effects found in these studies may be normatively defensible, although arguments against implementing a Bayesian approach to scientific judgment are also advanced.
“Smoking As 'independent' Risk Factor for Suicide: Illustration of an Artifact from Observational Epidemiology?”, Smith et al 1992
1992smith.pdf
: “Smoking as 'independent' risk factor for suicide: illustration of an artifact from observational epidemiology?”, George Davey Smith, Andrew N. Phillips, James D. Neaton (19920101; ; backlinks)
“Bias in Relative Odds Estimation owing to Imprecise Measurement of Correlated Exposures”, Phillips & Smith 1992
1992phillips.pdf
: “Bias in relative odds estimation owing to imprecise measurement of correlated exposures”, Andrew N. Phillips, George Davey Smith (19920101; ; backlinks)
“How Independent Are 'independent' Effects? Relative Risk Estimation When Correlated Exposures Are Measured Imprecisely”, Phillips & Smith 1991
1991phillips.pdf
: “How independent are 'independent' effects? Relative risk estimation when correlated exposures are measured imprecisely”, Andrew N. Phillips, George Davey Smith (19910101; ; backlinks)
“BayesHermite Quadrature”, O’Hagan 1991
1991ohagan.pdf
: “BayesHermite quadrature”, Andrew O’Hagan (19910101)
“The 1988 Neyman Memorial Lecture: A Galtonian Perspective on Shrinkage Estimators”, Stigler 1990
1990stigler.pdf
: “The 1988 Neyman Memorial Lecture: A Galtonian Perspective on Shrinkage Estimators”, (19900201; backlinks; similar):
More than 30 years ago, Charles Stein discovered that in 3 or more dimensions, the ordinary estimator of the vector of means of a multivariate normal distribution is inadmissible.
This article examines Stein’s paradox from the perspective of an earlier century and shows that from that point of view the phenomenon is transparent. Furthermore, this earlier perspective leads to a relatively simple rigorous proof of Stein’s result, and the perspective can be extended to cover other situations, such as the simultaneous estimation of several Poisson means.
The relationship of this perspective to other earlier work, including the empirical Bayes approach, is also discussed.
[Keywords: admissibility, Empirical Bayes, JamesStein estimation, Poisson distribution, regression, Stein paradox]
“The Double Exponential Distribution: Using Calculus to Find a Maximum Likelihood Estimator”, Norton 1984
1984norton.pdf
: “The Double Exponential Distribution: Using Calculus to Find a Maximum Likelihood Estimator”, Robert M. Norton (19840101)
“Interpreting Regression toward the Mean in Developmental Research”, Furby 1973
1973furby.pdf
: “Interpreting regression toward the mean in developmental research”, (1973; ; backlinks; similar):
Explicates the fundamental nature of regression toward the mean, which is frequently misunderstood by developmental researchers. While errors of measurement are commonly assumed to be the sole source of regression effects, the latter also are obtained with errorless measures. The conditions under which regression phenomena can appear are first clearly defined. Next, an explanation of regression effects is presented which applies both when variables contain errors of measurement and when they are errorless. The analysis focuses on cause and effect relationships of psychologically meaningful variables. Finally, the implications for interpreting regression effects in developmental research are illustrated with several empirical examples.
“Nuisance Variables and the Ex Post Facto Design”, Meehl 1970
1970meehl.pdf
: “Nuisance Variables and the Ex Post Facto Design”, Paul E. Meehl (19700101; ; backlinks)
“Control of Spurious Association and the Reliability of the Controlled Variable”, Kahneman 1965
1965kahneman.pdf
: “Control of spurious association and the reliability of the controlled variable”, Daniel Kahneman (19650101; ; backlinks)
“Inference in an Authorship Problem: A Comparative Study of Discrimination Methods Applied to the Authorship of the Disputed Federalist Papers”, Mosteller & Wallace 1963
1963mosteller.pdf
: “Inference in an Authorship Problem: A Comparative Study of Discrimination Methods Applied to the Authorship of the Disputed Federalist Papers”, (196306; similar):
This study has four purposes: to provide a comparison of discrimination methods; to explore the problems presented by techniques based strongly on Bayes’ theorem when they are used in a data analysis of large scale; to solve the authorship question of The Federalist papers; and to propose routine methods for solving other authorship problems.
Word counts are the variables used for discrimination. Since the topic written about heavily influences the rate with which a word is used, care in selection of words is necessary. The filler words of the language such as ‘an’, ‘of’, and ‘upon’, and, more generally, articles, prepositions, and conjunctions provide fairly stable rates, whereas more meaningful words like ‘war’, ‘executive’, and ‘legislature’ do not.
After an investigation of the distribution of these counts, the authors execute an analysis employing the usual discriminant function and an analysis based on Bayesian methods. The conclusions about the authorship problem are that Madison rather than Hamilton wrote all 12 of the disputed papers.
The findings about methods are presented in the closing section on conclusions.
This report, summarizing and abbreviating a forthcoming monograph^{8}, gives some of the results but very little of their empirical and theoretical foundation. It treats two of the four main studies presented in the monograph, and none of the side studies.
“The Argentine Writer and Tradition”, Borges 1951
1951borgestheargentinewriterandtradition.pdf
: “The Argentine Writer and Tradition”, (1951; ; backlinks; similar):
[Borges considers the problem of whether Argentinian writing on nonArgentinian subjects can still be truly “Argentine.” His conclusion:]
…We should not be alarmed and that we should feel that our patrimony is the universe; we should essay all themes, and we cannot limit ourselves to purely Argentine subjects in order to be Argentine; for either being Argentine is an inescapable act of fate—and in that case we shall be so in all events—or being Argentine is a mere affectation, a mask. I believe that if we surrender ourselves to that voluntary dream which is artistic creation, we shall be Argentine and we shall also be good or tolerable writers.
“Probability and the Weighing of Evidence”, Good 1950page96
1950goodprobabilityandtheweighingofevidence.pdf#page=96
: “Probability and the Weighing of Evidence”, I. J. Good (19500101; ; backlinks)
“Evaluating the Effect of Inadequately Measured Variables in Partial Correlation Analysis”, Stouffer 1936
1936stouffer.pdf
: “Evaluating the Effect of Inadequately Measured Variables in Partial Correlation Analysis”, (1936; ; backlinks; similar):
It is not generally recognized that such an analysis [using regression] assumes that each of the variables is perfectly measured, such that a second measure X’_{i}, of the variable measured by X_{i}, has a correlation of unity with X_{i}. If some of the measures are more accurate than others, the analysis is impaired [by measurement error]. For example, the sociologist may have a problem in which an index of economic status and an index of nativity are independent variables. What is the effect, if the index of economic status is much less satisfactory than the index of nativity? Ordinarily, the effect will be to underestimate the [coefficient] of the less adequately measured variable and to overestimate the [coefficient] of the more adequately measured variable.
If either the reliability or validity of an index is in question, at least two measures of the variable are required to permit an evaluation. The purpose of this paper is to provide a logical basis and a simple arithmetical procedure (a) for measuring the effect of the use of 2 indexes, each of one or more variables, in partial and multiple correlation analysis and (b) for estimating the likely effect if 2 indexes, not available, could be secured.
“Interpretation of Educational Measurements”, Kelley 1927
1927kelleyinterpretationofeducationalmeasurements.pdf
: “Interpretation of Educational Measurements”, (1927; ; backlinks; similar):
[Historically notable for introducing Kelley’s paradox, another fallacy related to regression to the mean.] Among the outstanding contributions of the book are (1) the judgments of the relative excellence of assorted tests in some 70 fields of accomplishment, by Kelley, Franzen, Freeman, McCall, Otis, Trabue and Van Wagenen; (2) detailed and exact information on the statistical and other characteristics of the same tests, based on a questionnaire addressed to the text authors or (in the absence of reply) estimates by Kelley on the best data available; (3) a chapter of 47 pages condensing all the principal elementary statistical methods. In addition, there is constant emphasis upon the importance of the probable error, with some illustrative applications; for example, it is maintained that about 90% of the abilities measured by our best “intelligence” and “achievement” tests are (due chiefly to the size of the probable errors) the same ability. A chapter sets forth the analytical procedures which lead to this conclusion and to four others earlier enunciated. “Idiosyncrasy”, or inequality among abilities, which the author regards as highly valuable, is considered in two chapters; the remainder of the volume is devoted to a historical sketch of the mental test movement and a statement of the purposes of tests, the latter being illustrated by appropriate chapters.
“Mr Keynes on Probability [review of J. M. Keynes, _A Treatise on Probability: 1921]”, Ramsey 1922
1922ramsey.pdf
: “Mr Keynes on Probability [review of J. M. Keynes, _A Treatise on Probability: 1921]”, Frank P. Ramsey (19220101; ; backlinks)
“Philosophical Essay on Probabilities, Chapter 11: Concerning the Probabilities of Testimonies”, Laplace 1814
1814laplacephilosophicalessayonprobabilitiesch5probabilitiestestimonies.pdf
: “Philosophical Essay on Probabilities, Chapter 11: Concerning the Probabilities of Testimonies”, (1814; backlinks; similar):
The majority of our opinions being founded on the probability of proofs it is indeed important to submit it to calculus. Things it is true often become impossible by the difficulty of appreciating the veracity of witnesses and by the great number of circumstances which accompany the deeds they attest; but one is able in several cases to resolve the problems which have much analogy with the questions which are proposed and whose solutions may be regarded as suitable approximations to guide and to defend us against the errors and the dangers of false reasoning to which we are exposed. An approximation of this kind, when it is well made, is always preferable to the most specious reasonings.
We would give no credence to the testimony of a man who should attest to us that in throwing a hundred dice into the air they had all fallen on the same face. If we had ourselves been spectators of this event we should believe our own eyes only after having carefully examined all the circumstances, and after having brought in the testimonies of other eyes in order to be quite sure that there had been neither hallucination nor deception. But after this examination we should not hesitate to admit it in spite of its extreme improbability; and no one would be tempted, in order to explain it, to recur to a denial of the laws of vision. We ought to conclude from it that the probability of the constancy of the laws of nature is for us greater than this, that the event in question has not taken place at all a probability greater than that of the majority of historical facts which we regard as incontestable. One may judge by this the immense weight of testimonies necessary to admit a suspension of natural laws, and how improper it would be to apply to this case the ordinary rules of criticism. All those who without offering this immensity of testimonies support this when making recitals of events contrary to those laws, decrease rather than augment the belief which they wish to inspire; for then those recitals render very probable the error or the falsehood of their authors. But that which diminishes the belief of educated men increases often that of the uneducated, always greedy for the wonderful.
The action of time enfeebles then, without ceasing, the probability of historical facts just as it changes the most durable monuments. One can indeed diminish it by multiplying and conserving the testimonies and the monuments which support them. Printing offers for this purpose a great means, unfortunately unknown to the ancients. In spite of the infinite advantages which it procures the physical and moral revolutions by which the surface of this globe will always be agitated will end, in conjunction with the inevitable effect of time, by rendering doubtful after thousands of years the historical facts regarded today as the most certain.
“Brms: an R Package for Bayesian Generalized Multivariate Nonlinear Multilevel Models Using Stan”, Bürkner 2022
“brms: an R package for Bayesian generalized multivariate nonlinear multilevel models using Stan”, ( ; backlinks; similar):
The
brms
package provides an interface to fit Bayesian generalized (non)linear multivariate multilevel models using Stan, which is a C++ package for performing full Bayesian inference. The formula syntax is very similar to that of the package lme4 to provide a familiar and simple interface for performing regression analyses.A wide range of response distributions are supported, allowing users to fit—among others—linear, robust linear, count data, survival, response times, ordinal, zeroinflated, and even selfdefined mixture models all in a multilevel context. Further modeling options include nonlinear and smooth terms, autocorrelation structures, censored data, missing value imputation, and quite a few more. In addition, all parameters of the response distribution can be predicted in order to perform distributional regression. Multivariate models (ie. models with multiple response variables) can be fit, as well.
Prior specifications are flexible and explicitly encourage users to apply prior distributions that actually reflect their beliefs.
Model fit can easily be assessed and compared with posterior predictive checks, crossvalidation, and Bayes factors.
Thompson sampling
Proving too much
Particle filter
Monte Carlo tree search
Gaussian process
Miscellaneous

2011sainudiin.pdf
(20110101; backlinks) 
1987ayton.pdf
(19870101; backlinks) 
1973meshalkincollectionofproblemsinprobabilitytheory.pdf
(19730101; ; backlinks) 
1942thorndike.pdf
, Robert L. Thorndike (19420101; ; backlinks) 
https://www.sumsar.net/blog/2015/01/probablepointsandcredibleintervalsparttwo/
( ) 
https://www.sumsar.net/blog/2014/10/tinydataandthesocksofkarlbroman/
(backlinks) 
https://www.bzarg.com/p/howakalmanfilterworksinpictures/

https://www.authorea.com/users/429500/articles/533177modellingatimeseriesofrecordsinpymc3
( ) 
https://towardsdatascience.com/neuralnetworksarefundamentallybayesianbee9a172fad8
( ) 
https://static.googleusercontent.com/media/research.google.com/en/pubs/archive/46507.pdf
( ) 
https://proceedings.neurips.cc/paper/2010/file/edfbe1afcf9246bb0d40eb4d8027d90fPaper.pdf
( ) 
https://lilianweng.github.io/lillog/2022/02/20/activelearning.html
( ) 
https://genomebiology.biomedcentral.com/articles/10.1186/s13059

https://bair.berkeley.edu/blog/2021/11/05/epistemicpomdp/
( ) 
https://archive.org/details/in.ernet.dli.2015.263082/page/n421/mode/2up
(backlinks) 
https://answers.google.com/answers/threadview?id=714986
(backlinks) 
https://allendowney.blogspot.com/2015/08/theinspectionparadoxiseverywhere.html
(backlinks) 
https://academic.oup.com/aje/article/166/6/646/89040
(backlinks) 
http://www.vietnamwar.net/84charliemopicscript.htm
(backlinks) 
http://www.scifiscripts.com/scripts/banzai_script.txt
(backlinks) 
http://www.mediafire.com/error.php?errno=320&origin=download
(backlinks) 
http://www.elizabethcallaway.net/goodomensstylometry
(backlinks) 
http://www.dtic.mil/cgibin/GetTRDoc?AD=ADA099503
(backlinks) 
http://www.dcscience.net/2015/12/11/placeboeffectsareweakregressiontothemeanisthemainreasonineffectivetreatmentsappeartowork/
(backlinks) 
http://www.dailyscript.com/scripts/fearandloathing.html
(backlinks) 
http://www.animevice.com/news/rumoralertdeathnotemoviescriptleaked/2653/
(backlinks) 
http://www.animevice.com/deathnotehollywoodliveaction/131374/rumoralertdeathnotemoviescriptleaked/97207040/?page=2#jspostbody117149
(backlinks) 
http://www.animevice.com/deathnotehollywoodliveaction/131374/rumoralertdeathnotemoviescriptleaked/97207040/#jspostbody116650
(backlinks) 
http://scifiscripts.com/scripts/twelvemonkeys.txt
(backlinks) 
http://rationallyspeaking.blogspot.com/2012/11/oddsagainbayesmadeusable.html
(backlinks) 
http://maxstock.scripts.mit.edu/blog/2019/01/13/278/
(backlinks) 
http://joelvelasco.net/teaching/3865/zabell%20%20symmetry%20and%20its%20discontents.pdf
(backlinks) 
http://blog.sigfpe.com/2012/12/shufflesbayestheoremandcontinuations.html
( ) 
1957felleranintroductiontoprobabilitytheoryanditsapplications.pdf
( ; backlinks) 
1923kelley.pdf
( ; backlinks) 
1968cohen.pdf
( ; backlinks) 
1965black.pdf
( ; backlinks) 
2004lawlor.pdf
( ; backlinks) 
1989konold.pdf
( ; backlinks) 
1963asimov
( ; backlinks; similar) 
2012gwerndeathnotescriptstylometricanalysis.tar.xz
( ; backlinks)