A key sticking point of Bayesian analysis is the choice of prior distribution, and there is a vast literature on potential defaults including uniform priors, Jeffreys’ priors, reference priors, maximum entropy priors, and weakly informative priors. These methods, however, often manifest a key conceptual tension in prior modeling: a model encoding true prior information should be chosen without reference to the model of the measurement process, but almost all common prior modeling techniques are implicitly motivated by a reference likelihood. In this paper we resolve this apparent paradox by placing the choice of prior into the context of the entire , from inference to prediction to model evaluation.
“Life Before Earth”, (2013-03-28):
An extrapolation of the genetic complexity of organisms to earlier times suggests that life began before the Earth was formed. Life may have started from systems with single heritable elements that are functionally equivalent to a nucleotide. The genetic complexity, roughly measured by the number of non-redundant functional nucleotides, is expected to have grown exponentially due to several positive feedback factors: gene cooperation, duplication of genes with their subsequent specialization, and emergence of novel functional niches associated with existing genes. Linear regression of genetic complexity on a log scale extrapolated back to just one base pair suggests the time of the origin of life 9.7 billion years ago. This cosmic time scale for the evolution of life has important consequences: life took ca. 5 billion years to reach the complexity of bacteria; the environments in which life originated and evolved to the prokaryote stage may have been quite different from those envisaged on Earth; there was no intelligent life in our universe prior to the origin of Earth, thus Earth could not have been deliberately seeded with life by intelligent aliens; Earth was seeded by panspermia; experimental replication of the origin of life from scratch may have to emulate many cumulative rare events; and the Drake equation for guesstimating the number of civilizations in the universe is likely wrong, as intelligent life has just begun appearing in our universe. Evolution of advanced organisms has accelerated via development of additional information-processing systems: epigenetic memory, primitive mind, multicellular brain, language, books, computers, and Internet. As a result the doubling time of complexity has reached ca. 20 years. Finally, we discuss the issue of the predicted technological singularity and give a biosemiotics perspective on the increase of complexity.
Background: The size of non-redundant functional genome can be an indicator of biological complexity of living organisms. Several positive feedback mechanisms including gene cooperation and duplication with subsequent specialization may result in the exponential growth of biological complexity in macro-evolution.
Results: I propose a hypothesis that biological complexity increased exponentially during evolution. Regression of the logarithm of functional non-redundant genome size versus time of origin in major groups of organisms showed a 7.8× increase per 1 billion years, and hence the increase of complexity can be viewed as a clock of macro-evolution. A strong version of the exponential hypothesis is that the rate of complexity increase in early (pre-prokaryotic) evolution of life was at most the same (or even slower) than observed in the evolution of prokaryotes and eukaryotes.
Conclusion: The increase of functional non-redundant genome size in macro-evolution was consistent with the exponential hypothesis. If the strong exponential hypothesis is true, then the origin of life should be dated 10 billion years ago. Thus, the possibility of panspermia as a source of life on earth should be discussed on equal basis with alternative hypotheses of de-novo life origin. Panspermia may be proven if bacteria similar to terrestrial ones are found on other planets or satellites in the solar system.
Reviewers: This article was reviewed by Eugene V. Koonin, Chris Adami and Arcady Mushegian.
“A Widely Applicable Bayesian Information Criterion”, (2012-08-31):
A statistical model or a learning machine is called regular if the map taking a parameter to a probability distribution is one-to-one and if its Fisher information matrix is always positive definite. If otherwise, it is called singular. In regular statistical models, the Bayes free energy, which is defined by the minus logarithm of Bayes marginal likelihood, can be asymptotically approximated by the Schwarz Bayes information criterion (BIC), whereas in singular models such approximation does not hold.
Recently, it was proved that the Bayes free energy of a singular model is asymptotically given by a generalized formula using a birational invariant, the real log canonical threshold (RLCT), instead of half the number of parameters in BIC. Theoretical values of RLCTs in several statistical models are now being discovered based on algebraic geometrical methodology. However, it has been difficult to estimate the Bayes free energy using only training samples, because an RLCT depends on an unknown true distribution.
In the present paper, we define a widely applicable Bayesian information criterion (WBIC) by the average log likelihood function over the posterior distribution with the inverse temperature 1/log(n), where n is the number of training samples. We mathematically prove that WBIC has the same asymptotic expansion as the Bayes free energy, even if a statistical model is singular for and unrealizable by a statistical model. Since WBIC can be numerically calculated without any information about a true distribution, it is a generalized version of BIC onto singular statistical models.
“Estimating the evidence—a review”, (2011-11-08):
The model evidence is a vital quantity in the comparison of statistical models under the Bayesian paradigm. This paper presents a review of commonly used methods. We outline some guidelines and offer some practical advice. The reviewed methods are compared for two examples: non-nested Gaussian linear regression and covariate subset selection in logistic regression.
2014-tenan.pdf: “Bayesian model selection: The steepest mountain to climb”, Simone Tenan, Robert B. O’Hara, Iris Hendriks, Giacomo Tavecchia
We propose a new method for approximate Bayesian statistical inference on the basis of summary statistics. The method is suited to complex problems that arise in population genetics, extending ideas developed in this setting by earlier authors. Properties of the posterior distribution of a parameter, such as its mean or density curve, are approximated without explicit likelihood calculations. This is achieved by fitting a local-linear regression of simulated parameter values on simulated summary statistics, and then substituting the observed summary statistics into the regression equation. The method combines many of the advantages of Bayesian statistical inference with the computational efficiency of methods based on summary statistics. A key advantage of the method is that the nuisance parameters are automatically integrated out in the simulation step, so that the large numbers of nuisance parameters that arise in population genetics problems can be handled without difficulty. Simulation results indicate computational and statistical efficiency that compares favorably with those of alternative methods previously proposed in the literature. We also compare the relative efficiency of inferences obtained using methods based on summary statistics with those obtained directly from the data using MCMC.
The model plant species Arabidopsis thaliana is successful at colonizing land that has recently undergone human-mediated disturbance. To investigate the prehistoric spread of A. thaliana, we applied approximate Bayesian computation and explicit spatial modeling to 76 European accessions sequenced at 876 nuclear loci. We find evidence that a major migration wave occurred from east to west, affecting most of the sampled individuals. The longitudinal gradient appears to result from the plant having spread in Europe from the east ~10,000 years ago, with a rate of westward spread of ~0.9 km/year. This wave-of-advance model is consistent with a natural colonization from an eastern glacial refugium that overwhelmed ancient western lineages. However, the speed and time frame of the model also suggest that the migration of A. thaliana into Europe may have accompanied the spread of agriculture during the Neolithic transition.
The demographic forces that have shaped the pattern of genetic variability in the plant species Arabidopsis thaliana provide an important backdrop for the use of this model organism in understanding the genetic determinants of plant natural variation. We investigated the demographic history of A. thaliana using novel population-genetic tools applied to a combination of molecular and geographic data. We infer that A. thaliana entered Europe from the east and spread westward at a rate of ~0.9 kilometers per year, and that its population size began increasing around 10,000 years ago. The “wave-of-advance” model suggested by these results is potentially consistent with the pattern expected if the species colonized Europe as the ice retreated at the end of the most recent glaciation. Alternatively, it is also compatible with the possibility that A. thaliana—a weedy species—may have spread into Europe with the diffusion of agriculture, providing an example of the phenomenon of “ecological imperialism” described by A. Crosby. In this framework, just as weeds from Europe invaded temperate regions worldwide during European human colonization, weeds originating from the source region of farming invaded Europe as a result of the disturbance caused by the spread of agriculture.
1959-schlaifer-probabilitystatisticsbusinessdecisions.pdf: “Probability and Statistics for Business Decisions: An Introduction to Managerial Economics Under Uncertainty”, (1959; ):
This book is a non-mathematical introduction to the logical analysis of practical business problems in which a decision must be reached under uncertainty. The analysis which it recommends is based on the modern theory of utility and what has come to be known as the “’personal”’ definition of probability; the author believes, in other words, that when the consequences of various possible courses of action depend on some unpredictable event, the practical way of choosing the “best” act is to assign values to consequences and probabilities to events and then to select the act with the highest expected value. In the author’s experience, thoughtful businessmen intuitively apply exactly this kind of analysis in problems which are simple enough to allow of purely intuitive analysis; and he believes that they will readily accept its formalization once the essential logic of this formalization is presented in a way which can be comprehended by an intelligent layman. Excellent books on the pure mathematical theory of decision under uncertainty already exist; the present text is an endeavor to show how formal analysis of practical decision problems can be made to pay its way.
From the point of view taken in this book, there is no real difference between a “statistical” decision problem in which a part of the available evidence happens to come from a ‘sample’ and a problem in which all the evidence is of a less formal nature. Both kinds of problems are analyzed by use of the same basic principles; and one of the resulting advantages is that it becomes possible to avoid having to assert that nothing useful can be said about a sample which contains an unknown amount of bias while at the same time having to admit that in most practical situations it is totally impossible to draw a sample which does not contain an unknown amount of bias. In the same way and for the same reason there is no real difference between a decision problem in which the long-run-average demand for some commodity is known with certainty and one in which it is not; and not the least of the advantages which result from recognizing this fact is that it becomes possible to analyze a problem of inventory control without having to pretend that a finite amount of experience can ever give anyone perfect knowledge of long-run-average demand. The author is quite ready to admit that in some situations it may be difficult for the businessman to assess the numerical probabilities and utilities which are required for the kind of analysis recommended in this book, but he is confident that the businessman who really tries to make a reasoned analysis of a difficult decision problem will find it far easier to do this than to make a direct determination of, say, the correct risk premium to add to the pure cost of capital or of the correct level at which to conduct a test of statistical-significance.
In sum, the author believes that the modern theories of utility and personal probability have at last made it possible to develop a really complete theory to guide the making of managerial decisions—a theory into which the traditional disciplines of statistics and economics under certainty and the collection of miscellaneous techniques taught under the name of operations research will all enter as constituent parts. He hopes, therefore, that the present book will be of interest and value not only to students and practitioners of inventory control, quality control, marketing research, and other specific business functions but also to students of business and businessmen who are interested in the basic principles of managerial economics and to students of economics who are interested in the theory of the firm. Even the teacher of a course in mathematical decision theory who wishes to include applications as well as complete-class and existence theory may find the book useful as a source of examples of the practical decision problems which do arise in the real world.
1961-raiffa-appliedstatisticaldecisiontheory.pdf: “Applied Statistical Decision Theory”, Howard Raiffa, Robert Schlaifer