Background: There exists no prior systematic review of human challenge trials (HCTs) that focuses on participant safety. Key questions regarding HCTs include how risky such trials have been, how often adverse events (AEs) and serious adverse events (SAEs) occur, and whether risk mitigation measures have been effective.
Methods: A systematic search of PubMed and PubMed Central for articles reporting on results of HCTs published between 1980 and 2021 was performed and completed by 2021-10-07.
Results: Of 2,838 articles screened, 276 were reviewed in full. 15,046 challenged participants were described in 308 studies that met inclusion criteria. 286 (92.9%) of these studies reported mitigation measures used to minimize risk to the challenge population. Among 187 studies which reported on SAEs, 0.2% of participants experienced at least one challenge-related SAE. Among 94 studies that graded AEs by severity, challenge-related AEs graded “severe” were reported by between 5.6% and 15.8% of participants. AE data were provided as a range to account for unclear reporting. 80% of studies published after 2010 were registered in a trials database.
Conclusion: HCTs are increasingly common and used for an expanding list of diseases. Although AEs occur, severe AEs and SAEs are rare. Reporting has improved over time, though not all papers provide a comprehensive report of relevant health impacts. From the available data, most HCTs do not lead to a high number of severe symptoms or SAEs.
This study was preregistered on PROSPERO as CRD42021247218.
We propose a flexible, portable, and intuitive metric for quantifying the change in accuracy between 2 predictive systems in the case of a binary outcome, the InterModel Vigorish (IMV). The IMV is based on an analogy to well-characterized physical systems with tractable probabilities: weighted coins. The IMV is always a statement about the change in fit relative to some baseline—which can be as simple as the prevalence—whereas other metrics are stand-alone measures that need to be further manipulated to yield indices related to differences in fit across models. Moreover, the IMV is consistently interpretable independent of baseline prevalence.
We illustrate the flexible properties of this metric in numerous simulations and showcase its flexibility across examples spanning the social, biomedical, and physical sciences.
Almost all animals must make decisions on the move. Here, employing an approach that integrates theory and high-throughput experiments (using state-of-the-art virtual reality), we reveal that there exist fundamental geometrical principles that result from the inherent interplay between movement and organisms’ internal representation of space. Specifically, we find that animals spontaneously reduce the world into a series of sequential binary decisions, a response that facilitates effective decision-making and is robust both to the number of options available and to context, such as whether options are static (eg. refuges) or mobile (eg. other animals). We present evidence that these same principles, hitherto overlooked, apply across scales of biological organization, from individual to collective decision-making.
Choosing among spatially distributed options is a central challenge for animals, from deciding among alternative potential food sources or refuges to choosing with whom to associate. Using an integrated theoretical and experimental approach (employing immersive virtual reality), we consider the interplay between movement and vectorial integration during decision-making regarding 2, or more, options in space.
In computational models of this process, we reveal the occurrence of spontaneous and abrupt “critical” transitions (associated with specific geometrical relationships) whereby organisms spontaneously switch from averaging vectorial information among, to suddenly excluding one among, the remaining options. This bifurcation process repeats until only one option—the one ultimately selected—remains. Thus, we predict that the brain repeatedly breaks multi-choice decisions into a series of binary decisions in space-time.
Experiments with fruit flies, desert locusts, and larval zebrafish reveal that they exhibit these same bifurcations, demonstrating that across taxa and ecological contexts, there exist fundamental geometric principles that are essential to explain how, and why, animals move the way they do.
[Keywords: ring attractor, movement ecology, navigation, collective behavior, embodied choice]
We introduce a theoretical framework distinguishing between anchoring effects, anchoring bias, and judgmental noise: Anchoring effects require anchoring bias, but noise modulates their size. We tested this framework by manipulating stimulus magnitudes. As magnitudes increase, psychophysical noise due to scalar variability widens the perceived range of plausible values for the stimulus. This increased noise, in turn, increases the influence of anchoring bias on judgments. In 11 preregistered experiments (n = 3,552 adults), anchoring effects increased with stimulus magnitude for point estimates of familiar and novel stimuli (eg. reservation prices for hotels and donuts, counts in dot arrays). Comparisons of relevant and irrelevant anchors showed that noise itself did not produce anchoring effects. Noise amplified anchoring bias. Our findings identify a stimulus feature predicting the size and replicability of anchoring effects—stimulus magnitude. More broadly, we show how to use psychophysical noise to test relationships between bias and noise in judgment under uncertainty.
Specification of the prior distribution for a Bayesian model is a central part of the Bayesian workflow for data analysis, but it is often difficult even for statistical experts. Prior elicitation transforms domain knowledge of various kinds into well-defined prior distributions, and offers a solution to the prior specification problem, in principle. In practice, however, we are still fairly far from having usable prior elicitation tools that could significantly influence the way we build probabilistic models in academia and industry. We lack elicitation methods that integrate well into the Bayesian workflow and perform elicitation efficiently in terms of costs of time and effort. We even lack a comprehensive theoretical framework for understanding different facets of the prior elicitation problem.
Why are we not widely using prior elicitation? We analyze the state of the art by identifying a range of key aspects of prior knowledge elicitation, from properties of the modelling task and the nature of the priors to the form of interaction with the expert. The existing prior elicitation literature is reviewed and categorized in these terms. This allows recognizing under-studied directions in prior elicitation research, finally leading to a proposal of several new avenues to improve prior elicitation methodology.
Stochastic dual dynamic programming (SDDP) is a state-of-the-art method for solving multi-stage stochastic optimization, widely used for modeling real-world process optimization tasks. Unfortunately, SDDP has a worst-case complexity that scales exponentially in the number of decision variables, which severely limits applicability to only low dimensional problems.
To overcome this limitation, we extend SDDP by introducing a trainable neural model that learns to map problem instances to a piece-wise linearvalue function within intrinsic low-dimension space, which is architected specifically to interact with a base SDDP solver, so that can accelerate optimization performance on new instances. The proposed Neural Stochastic Dual Dynamic Programming (ν-SDDP) continually self-improves by solving successive problems.
An empirical investigation demonstrates that ν-SDDP can substantially reduce problem solving cost without sacrificing solution quality over competitors such as SDDP and reinforcement learning algorithms, across a range of synthetic and real-world process optimization problems.
Highly influential “dual-process” accounts of human cognition postulate the coexistence of a slow accurate system with a fast error-prone system. But why would there be just 2 systems rather than, say, one or 93?
Here, we argue that a dual-process architecture might reflect a rational tradeoff between the cognitive flexibility afforded by multiple systems and the time and effort required to choose between them. We investigate what the optimal set and number of cognitive systems would depend on the structure of the environment.
We find that the optimal number of systems depends on the variability of the environment and the difficulty of deciding when which system should be used. Furthermore, we find that there is a plausible range of conditions under which it is optimal to be equipped with a fast system that performs no deliberation (“System 1”) and a slow system that achieves a higher expected accuracy through deliberation (“System 2”).
Our findings thereby suggest a rational reinterpretation of dual-process theories.
…We study this problem in 4 different domains where the dual systems framework has been applied to explain human decision-making: binary choice, planning, strategic interaction, and multi-alternative, multi-attribute risky choice. We investigate how the optimal cognitive architecture for each domain depends on the variability of the environment and the cost of choosing between multiple cognitive systems, which we call metareasoning cost.
Unlike judgments made in private, advice contexts invoke strategic social concerns that might increase overconfidence in advice. Many scholars have assumed that overconfident advice emerges as an adaptive response to advice seekers’ preference for confident advice and failure to punish overconfidence. However, another possibility is that advisors robustly display overconfidence as a self-promotion tactic—even when it is punished by others.
Across 4 experiments and a survey of advice professionals, the current research finds support for this account. First, it shows that advisors express more overconfidence than private decision-makers. This pattern held even after advice recipients punished advisors for their overconfidence. Second, it identifies the underlying motivations of advisors’ overconfidence. Advisors’ overconfidence was not driven by self-deception or a sincere desire to be helpful. Instead, it reflected strategic self-promotion.
Relative to the overconfidence revealed by their private beliefs, advisors purposely increased their overconfidence while broadcasting judgments when (a) it was salient that others would assess their competence and (b) looking competent served their self-interest.
We estimate the distribution of television advertising elasticities and the distribution of the advertising return on investment (ROI) for a large number of products in many categories…We construct a data set by merging market (DMA) level TV advertising data with retail sales and price data at the brand level…Our identification strategy is based on the institutions of the ad buying process.
Our results reveal substantially smaller advertising elasticities compared to the results documented in the literature, as well as a sizable percentage of statistically insignificant or negative estimates. The results are robust to functional form assumptions and are not driven by insufficient statistical power or measurement error.
The ROI analysis shows negative ROIs at the margin for more than 80% of brands, implying over-investment in advertising by most firms. Further, the overall ROI of the observed advertising schedule is only positive for one third of all brands.
[Keywords: advertising, return on investment, empirical generalizations, agency issues, consumer packaged goods, media markets]
…We find that the mean and median of the distribution of estimated long-run own-advertising elasticities are 0.023 and 0.014, respectively, and 2 thirds of the elasticity estimates are not statistically different from zero. These magnitudes are considerably smaller than the results in the extant literature. The results are robust to controls for own and competitor prices and feature and display advertising, and the advertising effect distributions are similar whether a carryover parameter is assumed or estimated. The estimates are also robust if we allow for a flexible functional form for the advertising effect, and they do not appear to be driven by measurement error. As we are not able to include all sensitivity checks in the paper, we created an interactive web application that allows the reader to explore all model specifications. The web application is available.
…First, the advertising elasticity estimates in the baseline specification are small. The median elasticity is 0.0140, and the mean is 0.0233. These averages are substantially smaller than the average elasticities reported in extant meta-analyses of published case studies (Assmus, Farley, and Lehmann (1984b), Sethuraman, Tellis, and Briesch (2011)). Second, 2 thirds of the estimates are not statistically distinguishable from zero. We show in Figure 2 that the most precise estimates are those closest to the mean and the least precise estimates are in the extremes.
Figure 2: Advertising effects and confidence intervals using baseline strategy. Note: Brands are arranged on the horizontal axis in increasing order of their estimated ad effects. For each brand, a dot plots the point estimate of the ad effect and a vertical bar represents the 95% confidence interval. Results are from the baseline strategy model with δ = 0.9 (equation (1)).
…6.1 Average ROI of Advertising in a Given Week:
In the first policy experiment, we measure the ROI of the observed advertising levels (in all DMAs) in a given week t relative to not advertising in week t. For each brand, we compute the corresponding ROI for all weeks with positive advertising, and then average the ROIs across all weeks to compute the average ROI of weekly advertising. This metric reveals if, on the margin, firms choose the (approximately) correct advertising level or could increase profits by either increasing or decreasing advertising.
We provide key summary statistics in the top panel of Table III, and we show the distribution of the predicted ROIs in Figure 3(a). The average ROI of weekly advertising is negative for most brands over the whole range of assumed manufacturer margins. At a 30% margin, the median ROI is −88.15%, and only 12% of brands have positive ROI. Further, for only 3% of brands the ROI is positive and statistically different from zero, whereas for 68% of brands the ROI is negative and statistically different from zero.
Figure 3: Predicted ROIs. Note: Panel (a) provides the distribution of the estimated ROI of weekly advertising and panel (b) provides the distribution of the overall ROI of the observed advertising schedule. Each is provided for 3 margin factors, m = 0.2, m = 0.3, and m = 0.4. The median is denoted by a solid vertical line and zero is denoted with a vertical dashed line. Gray indicates brands with negative ROI that is statistically different from zero. Red indicates brands with positive ROI that is statistically different from zero. Blue indicates brands with ROI not statistically different from zero.
These results provide strong evidence for over-investment in advertising at the margin. [In Appendix C.3, we assess how much larger the TV advertising effects would need to be for the observed level of weekly advertising to be profitable. For the median brand with a positive estimated ad elasticity, the advertising effect would have to be 5.33× larger for the observed level of weekly advertising to yield a positive ROI (assuming a 30% margin).]
6.2 Overall ROI of the Observed Advertising Schedule: In the second policy experiment, we investigate if firms are better off when advertising at the observed levels versus not advertising at all. Hence, we calculate the ROI of the observed advertising schedule relative to a counterfactual baseline with zero advertising in all periods.
We present the results in the bottom panel of Table III and in Figure 3(b). At a 30% margin, the median ROI is −57.34%, and 34% of brands have a positive return from the observed advertising schedule versus not advertising at all. Whereas 12% of brands only have positive and 30% of brands only negative values in their confidence intervals, there is more uncertainty about the sign of the ROI for the remaining 58% of brands. This evidence leaves open the possibility that advertising may be valuable for a substantial number of brands, especially if they reduce advertising on the margin.
…Our results have important positive and normative implications. Why do firms spend billions of dollars on TV advertising each year if the return is negative? There are several possible explanations. First, agency issues, in particular career concerns, may lead managers (or consultants) to overstate the effectiveness of advertising if they expect to lose their jobs if their advertising campaigns are revealed to be unprofitable. Second, an incorrect prior (ie. conventional wisdom that advertising is typically effective) may lead a decision maker to rationally shrink the estimated advertising effect from their data to an incorrect, inflated prior mean. These proposed explanations are not mutually exclusive. In particular, agency issues may be exacerbated if the general effectiveness of advertising or a specific advertising effect estimate is overstated. [Another explanation is that many brands have objectives for advertising other than stimulating sales. This is a nonstandard objective in economic analysis, but nonetheless, we cannot rule it out.] While we cannot conclusively point to these explanations as the source of the documented over-investment in advertising, our discussions with managers and industry insiders suggest that these may be contributing factors.
We investigate how people make choices when they are unsure about the value of the options they face and have to decide whether to choose now or wait and acquire more information first.
In an experiment, we find that participants deviate from optimal information acquisition in a systematic manner. They acquire too much information (when they should only collect little) or not enough (when they should collect a lot). We show that this pattern can be explained as naturally emerging from Fechner cognitive errors. Over time participants tend to learn to approximate the optimal strategy when information is relatively costly.
[Keywords: search, decision under uncertainty, information, optimal stopping, real option]
…We design a controlled situation where individuals have to choose between 2 alternatives with uncertain payoffs. Before making a choice, they have the opportunity to wait and collect additional (costly) pieces of information which help them get a better idea of the likely alternatives’ payoffs. The design of the experiment allows us to precisely identify the optimal sequential sampling strategy and to assess whether participants are able to approximate it.
We find that participants deviate in systematic ways from the optimal strategy. They tend to hesitate too long and oversample information when it is relatively costly, and therefore when the optimal strategy is to collect only little information. On the contrary, they tend to undersample information when it is relatively cheap, and therefore when the optimal strategy is to collect a lot of information. We show that this pattern of oversampling and undersampling can be explained as the result of Fechner cognitive errors which introduce stochasticity in decisions about whether or not to stop. Cognitive errors create a risk to stop at any time by mistake. When the optimal level of information to acquire is high, DMs should continue to sample information for a long time. As a consequence, errors are likely to lead to stop too early, and therefore to undersampling. When the optimal level of evidence to acquire is low, DMs should stop sampling early. In that case, cognitive errors are more likely to lead to fail to stop early enough, and therefore to oversampling. The deviations we observe, lead participants to lose between 10 and 25% of their potential payoff. However, participants learn to get closer to the optimal strategy over time, as long as information is relatively costly.
In the standard herding model, privately informed individuals sequentially see prior actions and then act. An identical action herd eventually starts and public beliefs tend to “cascade sets” where social learning stops. What behaviour is socially efficient when actions ignore informational externalities?
We characterize the outcome that maximizes the discounted sum of utilities. Our 4 key findings are:
cascade sets shrink but do not vanish, and herding should occur but less readily as greater weight is attached to posterity.
An optimal mechanism rewards individuals mimicked by their successor.
Cascades cannot start after period one under a signal log-concavity condition.
Given this condition, efficient behaviour is contrarian, leaning against the myopically more popular actions in every period.
We make 2 technical contributions: as value functions with learning are not smooth, we use monotone comparative statics under uncertainty to deduce optimal dynamic behaviour. We also adapt dynamic pivot mechanisms to Bayesian learning.
“What I cannot efficiently break, I cannot understand.” Understanding the vulnerabilities of human choice processes allows us to detect and potentially avoid adversarial attacks. We develop a general framework for creating adversaries for human decision-making. The framework is based on recent developments in deep reinforcement learning models and recurrent neural networks and can in principle be applied to any decision-making task and adversarial objective. We show the performance of the framework in 3 tasks involving choice, response inhibition, and social decision-making. In all of the cases the framework was successful in its adversarial attack. Furthermore, we show various ways to interpret the models to provide insights into the exploitability of human choice.
Adversarial examples are carefully crafted input patterns that are surprisingly poorly classified by artificial and/or natural neural networks. Here we examine adversarial vulnerabilities in the processes responsible for learning and choice in humans. Building upon recent recurrent neural network models of choice processes, we propose a general framework for generating adversarial opponents that can shape the choices of individuals in particular decision-making tasks toward the behavioral patterns desired by the adversary. We show the efficacy of the framework through 3 experiments involving action selection, response inhibition, and social decision-making. We further investigate the strategy used by the adversary in order to gain insights into the vulnerabilities of human choice. The framework may find applications across behavioral sciences in helping detect and avoid flawed choice.
Decision-makers often want to target interventions (eg. marketing campaigns) so as to maximize an outcome that is observed only in the long-term. This typically requires delaying decisions until the outcome is observed or relying on simple short-term proxies for the long-term outcome. Here we build on the statistical surrogacy and off-policy learning literature to impute the missing long-term outcomes and then approximate the optimal targeting policy on the imputed outcomes via a doubly-robust approach.
We apply our approach in large-scale proactive churn management experiments at The Boston Globe by targeting optimal discounts to its digital subscribers to maximize their long-term revenue.
We first show that conditions for validity of average treatment effect estimation with imputed outcomes are also sufficient for valid policy evaluation and optimization; furthermore, these conditions can be somewhat relaxed for policy optimization.
We then validate this approach empirically by comparing it with a policy learned on the ground truth long-term outcomes and show that they are statistically indistinguishable. Our approach also outperforms a policy learned on short-term proxies for the long-term outcome. In a second field experiment, we implement the optimal targeting policy with additional randomized exploration, which allows us to update the optimal policy for each new cohort of customers to account for potential non-stationarity.
Over 3 years, our approach had a net-positive revenue impact in the range of $4–$5 million compared to The Boston Globe’s current policies.
Animals are equipped with a rich innate repertoire of sensory, behavioral and motor skills, which allows them to interact with the world immediately after birth. At the same time, many behaviors are highly adaptive and can be tailored to specific environments by means of learning. In this work, we use mathematical analysis and the framework of meta-learning (or ‘learning to learn’) to answer when it is beneficial to learn such an adaptive strategy and when to hard-code a heuristic behavior. We find that the interplay of ecological uncertainty, task complexity and the agents’ lifetime has crucial effects on the meta-learned amortized Bayesian inference performed by an agent. There exist two regimes: One in which meta-learning yields a learning algorithm that implements task-dependent information-integration and a second regime in which meta-learning imprints a heuristic or ‘hard-coded’ behavior. Further analysis reveals that non-adaptive behaviors are not only optimal for aspects of the environment that are stable across individuals, but also in situations where an adaptation to the environment would in fact be highly beneficial, but could not be done quickly enough to be exploited within the remaining lifetime. Hard-coded behaviors should hence not only be those that always work, but also those that are too complex to be learned within a reasonable time frame.
This review uses the empirical analysis of portfolio choice to illustrate econometric issues that arise in decision problems. Subjective expected utility (SEU) can provide normative guidance to an investor making a portfolio choice. The investor, however, may have doubts on the specification of the distribution and may seek a decision theory that is less sensitive to the specification. I consider three such theories: maxmin expected utility, variational preferences (including multiplier and divergence preferences and the associated constraint preferences), and smooth ambiguity preferences. I use a simple two-period model to illustrate their application. Normative empirical work on portfolio choice is mainly in the SEU framework, and bringing in ideas from robust decision theory may be fruitful.
Speed-accuracy trade-off (SAT) is the tendency for decision speed to covary with decision accuracy. SAT is an inescapable property of aimed movements being present in a wide range of species, from insects to primates. An aspect that remains unsolved is whether SAT extends to plants’ movement.
Here, we tested this possibility by examining the swaying in circles of the tips of shoots exhibited by climbing plants (Pisum sativum L.) as they approach to grasp a potential support. In particular, by means of 3-dimensional kinematical analysis, we investigated whether climbing plants scale movement velocity as a function of the difficulty to coil a support.
Results showed that plants are able to process the properties of the support before contact and, similarly to animal species, strategically modulate movement velocity according to task difficulty.
…To date, a great absent in the Fitts’s law literature is the “green kingdom.” At first glance, plants seem relatively immobile, stuck to the ground in rigid structures and, unlike animals, unable to escape stressful environments. But, although markedly different from those of animals, movement pervades all aspects of plant behavior (Darwin & Darwin 1880). As observed by Darwin 1875, the tendrils of climbing plants undergo subtle movements around their axes of elongation. This elliptical movement, known as circumnutation, allows plants to explore their immediate surroundings in search, for instance, of a physical support to enhance light acquisition (Larson 2000). Also, Darwin (1875; see also Trewavas 2017) observed that the tendrils tend to assume the shape of whatever surface before they come into contact with. Implicitly this might signify that they “see” the support and plan the movement accordingly. In this view, climbing plants might be able to plan the course of an action ahead of time and program the tendrils’ choreography according to the “to-be-grasped” object.
Support for this contention comes from both theoretical and empirical studies suggesting that plant movement is not a simple product of cause-effect mechanisms but rather seems to be driven by processes that are anticipatory in nature (eg. Calvo & Friston 2017; Guerra et al 2019). For instance, a recent study shows that a climbing plant (Pisum sativum L.) not only is able to perceive a potential support, but it also scales the kinematics of tendrils’ aperture according to its size well ahead they touch the stimulus (Guerra et al 2019). This has been taken as the demonstration that plants plan the movement purposefully and in ways that are flexible and anticipatory.
With this in mind, one of the empirical predictions stemming from Fitts’s law can be well-suited to model the 3-dimensional circumnutation of plants. Precisely, we refer to the evidence that movement time scales as a function of the target’s size: When the distance is constant, thinner targets are reached more slowly than thicker ones (see Murata & Iwase 2001). We test this prediction in Pisum sativum L. by assessing the change of velocity of the tendrils during their approach-to-grasp a thin or to a thicker support.
…Results…The analysis of movement time confirms this evidence, showing that movement time was shorter for the thinner than for the thicker stimulus (β < 0) with a probability of 79.3%. This evidence suggests that plants are able to process the properties of the support and are endowed with a form of perception underwriting a goal-directed and anticipatory behavior (Guerra et al 2019). However, in contrast with previous human and animal literature (eg. Beggs & Howarth 1972; Fitts 1954; Heitz & Schall 2012), our results indicate an opposite pattern of what Fitts’s law predicts. Remember that according to Fitts’s law, the velocity of the movement is inversely proportional to ID (2D/W). In other words, our results seem to suggest that plants exhibit more difficulty grasping a thicker than a thinner support. These findings are line with previous reports showing a lower success rate of attachment for thick supports (Peñalosa 1982), and a preference for plants to climb supports with a smaller diameter (Darwin 1875; Putz 1984; Putz & Holbrook 1992 [The Biology of Vines]). Furthermore, by using the curvature of tendrils during the twining phase, Goriely & Neukirch 2006 demonstrate that for thinner supports, the contact angle (ie.t, the angle between the tip of the tendril and the tangent of the support) is a near-zero value. Instead, with thicker supports, the contact angle tends to increase as tendrils must curl into the support’s surface to maintain an efficient grip. When the support is too thick, the contact angle increases to an extent that the tendril curls back on itself, losing grip. Interestingly, field studies in rainforests showed that the presence of climbing plants tends to decrease in areas in which there is a prevalence of thicker supports (Carrasco-Urra & Gianoli 2009).
A possible explanation for this phenomenon may reside in the fact that, for plants, reaching to grasp thick supports is a more energy consuming process than grasping for thinner ones. Indeed, the grasping of a thick support implies that plants have to increase the tendril length in order to efficiently coil the support (Rowe et al 2006), and to strengthen the tensional forces to resist gravity (Gianoli 2015)
Over the following year, Woody came to believe that the most promising path to automated facial recognition was one that reduced a face to a set of relationships between its major landmarks: eyes, ears, nose, eyebrows, lips. The system that he imagined was similar to one that Alphonse Bertillon, the French criminologist who invented the modern mug shot, had pioneered in 1879. Bertillon described people on the basis of 11 physical measurements, including the length of the left foot and the length from the elbow to the end of the middle finger. The idea was that, if you took enough measurements, every person was unique. Although the system was labor-intensive, it worked: In 1897, years before fingerprinting became widespread, French gendarmes used it to identify the serial killer Joseph Vacher. Throughout 1965, Panoramic attempted to create a fully automated Bertillon system for the face. The team tried to devise a program that could locate noses, lips, and the like by parsing patterns of lightness and darkness in a photograph, but the effort was mostly a flop.
…Even with this larger sample size, though, Woody’s team struggled to overcome all the usual obstacles. The computer still had trouble with smiles, for instance, which “distort the face and drastically change inter-facial measurements.” Aging remained a problem too, as Woody’s own face proved. When asked to cross-match a photo of Woody from 1945 with one from 1965, the computer was flummoxed. It saw little resemblance between the younger man, with his toothy smile and dark widow’s peak, and the older one, with his grim expression and thinning hair. It was as if the decades had created a different person.
…In 1967, more than a year after his move to Austin, Woody took on one last assignment that involved recognizing patterns in the human face. The purpose of the experiment was to help law enforcement agencies quickly sift through databases of mug shots and portraits, looking for matches…Woody’s main collaborator on the project was Peter Hart, a research engineer in the Applied Physics Laboratory at the Stanford Research Institute. (Now known as SRI International, the institute split from Stanford University in 1970 because its heavy reliance on military funding had become so controversial on campus.) Woody and Hart began with a database of around 800 images—two newsprint-quality photos each of about “400 adult male caucasians”, varying in age and head rotation. (I did not see images of women or people of color, or references to them, in any of Woody’s facial-recognition studies.) Using the RAND tablet, they recorded 46 coordinates per photo, including five on each ear, seven on the nose, and four on each eyebrow. Building on Woody’s earlier experience at normalizing variations in images, they used a mathematical equation to rotate each head into a forward-looking position. Then, to account for differences in scale, they enlarged or reduced each image to a standard size, with the distance between the pupils as their anchor metric. The computer’s task was to memorize one version of each face and use it to identify the other. Woody and Hart offered the machine one of two shortcuts. With the first, known as group matching, the computer would divide the face into features—left eyebrow, right ear, and so on—and compare the relative distances between them. The second approach relied on Bayesian decision theory; it used 22 measurements to make an educated guess about the whole.
In the end, the two programs handled the task about equally well. More important, they blew their human competitors out of the water. When Woody and Hart asked three people to cross-match subsets of 100 faces, even the fastest one took six hours to finish. The CDC 3800 computer completed a similar task in about three minutes, reaching a hundredfold reduction in time. The humans were better at coping with head rotation and poor photographic quality, Woody and Hart acknowledged, but the computer was “vastly superior” at tolerating the differences caused by aging. Overall, they concluded, the machine “dominates” or “very nearly dominates” the humans.
This was the greatest success Woody ever had with his facial-recognition research. It was also the last paper he would write on the subject. The paper was never made public—for “government reasons”, Hart says—which both men lamented. In 1970, two years after the collaboration with Hart ended, a roboticist named Michael Kassler alerted Woody to a facial-recognition study that Leon Harmon at Bell Labs was planning. “I’m irked that this second rate study will now be published and appear to be the best man-machine system available”, Woody replied. “It sounds to me like Leon, if he works hard, will be almost 10 years behind us by 1975.” He must have been frustrated when Harmon’s research made the cover of Scientific American a few years later, while his own, more advanced work was essentially kept in a vault.
A school may improve its students’ job outcomes if it issues only coarse grades. Google can reduce congestion on roads by giving drivers noisy information about the state of traffic. A social planner might raise everyone’s welfare by providing only partial information about solvency of banks. All of this can happen even when everyone is fully rational and understands the data-generating process. Each of these examples raises questions of what is the (socially or privately) optimal information that should be revealed. In this article, I review the literature that answers such questions.
We provide generalizable and robust results on the causal sales effect of TV advertising based on the distribution of advertising elasticities for a large number of products (brands) in many categories. Such generalizable results provide a prior distribution that can improve the advertising decisions made by firms and the analysis and recommendations of anti-trust and public policy makers. A single case study cannot provide generalizable results, and hence the marketing literature provides several meta-analyses based on published case studies of advertising effects. However, publication bias results if the research or review process systematically rejects estimates of small, statistically insignificant, or “unexpected” advertising elasticities. Consequently, if there is publication bias, the results of a meta-analysis will not reflect the true population distribution of advertising effects.
To provide generalizable results, we base our analysis on a large number of products and clearly lay out the research protocol used to select the products. We characterize the distribution of all estimates, irrespective of sign, size, or statistical-significance. To ensure generalizability we document the robustness of the estimates. First, we examine the sensitivity of the results to the approach and assumptions made when constructing the data used in estimation from the raw sources. Second, as we aim to provide causal estimates, we document if the estimated effects are sensitive to the identification strategies that we use to claim causality based on observational data. Our results reveal substantially smaller effects of own-advertising compared to the results documented in the extant literature, as well as a sizable percentage of statistically insignificant or negative estimates. If we only select products with statistically-significant and positive estimates, the mean or median of the advertising effect distribution increases by a factor of about five.
The results are robust to various identifying assumptions, and are consistent with both publication bias and bias due to non-robust identification strategies to obtain causal estimates in the literature.
Criticizing studies and statistics is hard in part because so many criticisms are possible, rendering them meaningless. What makes a good criticism is the chance of being a ‘difference which makes a difference’ to our ultimate actions.
Scientific and statistical research must be read with a critical eye to understand how credible the claims are. The Reproducibility Crisis and the growth of meta-science have demonstrated that much research is of low quality and often false.
But there are so many possible things any given study could be criticized for, falling short of an unobtainable ideal, that it becomes unclear which possible criticism is important, and they may degenerate into mere rhetoric. How do we separate fatal flaws from unfortunate caveats from specious quibbling?
I offer a pragmatic criterion: what makes a criticism important is how much it could change a result if corrected and how much that would then change our decisions or actions: to what extent it is a “difference which makes a difference”.
This is why issues of research fraud, causal inference, or biases yielding overestimates are universally important: because a ‘causal’ effect turning out to be zero effect or grossly overestimated will change almost all decisions based on such research; while on the other hand, other issues like measurement error or distributional assumptions, which are equally common, are often not important: because they typically yield much smaller changes in conclusions, and hence decisions.
If we regularly ask whether a criticism would make this kind of difference, it will be clearer which ones are important criticisms, and which ones risk being rhetorical distractions and obstructing meaningful evaluation of research.
In this report we review memory-based meta-learning as a tool for building sample-efficient strategies that learn from past experience to adapt to any task within a target class. Our goal is to equip the reader with the conceptual foundations of this tool for building new, scalable agents that operate on broad domains. To do so, we present basic algorithmic templates for building near-optimal predictors and reinforcement learners which behave as if they had a probabilistic model that allowed them to efficiently exploit task structure. Furthermore, we recast memory-based meta-learning within a Bayesian framework, showing that the meta-learned strategies are near-optimal because they amortize Bayes-filtered data, where the adaptation is implemented in the memory dynamics as a state-machine of sufficient statistics. Essentially, memory-based meta-learning translates the hard problem of probabilistic sequential inference into a regression problem.
Implicit in the drug-approval process is a host of decisions—target patient population, control group, primary endpoint, sample size, follow-up period, etc.—all of which determine the trade-off between Type I and Type II error. We explore the application of Bayesian decision analysis (BDA) to minimize the expected cost of drug approval, where the relative costs of the two types of errors are calibrated using U.S. Burden of Disease Study 2010 data. The results for conventional fixed-sample randomized clinical-trial designs suggest that for terminal illnesses with no existing therapies such as pancreatic cancer, the standard threshold of 2.5% is substantially more conservative than the BDA-optimal threshold of 23.9% to 27.8%. For relatively less deadly conditions such as prostate cancer, 2.5% is more risk-tolerant or aggressive than the BDA-optimal threshold of 1.2% to 1.5%. We compute BDA-optimal sizes for 25 of the most lethal diseases and show how a BDA-informed approval process can incorporate all stakeholders’ views in a systematic, transparent, internally consistent, and repeatable manner.
Accounting Theory as a Bayesian Discipline introduces Bayesian theory and its role in statistical accounting information theory. The Bayesian statistical logic of probability, evidence and decision lies at the historical and modern center of accounting thought and research. It is not only the presumed rule of reasoning in analytical models of accounting disclosure, it is the default position for empiricists when hypothesizing about how the users of financial statements think. Bayesian logic comes to light throughout accounting research and is the soul of most strategic disclosure models. In addition, Bayesianism is similarly a large part of the stated and unstated motivation of empirical studies of how market prices and their implied costs of capital react to better financial disclosure.
The approach taken in this monograph is a Demski 1973-like treatment of “accounting numbers” as “signals” rather than as “measurements”. It should be of course that “good” measurements like “quality earnings” reports make generally better signals. However, to be useful for decision making under uncertainty, accounting measurements need to have more than established accounting measurement virtues. This monograph explains what those Bayesian information attributes are, where they come from in Bayesian theory, and how they apply in statistical accounting information theory.
The Bayesian logic of probability, evidence and decision is the presumed rule of reasoning in analytical models of accounting disclosure. Any rational explication of the decades-old accounting notions of “information content”, “value relevance”, “decision useful”, and possibly conservatism, is inevitably Bayesian. By raising some of the probability principles, paradoxes and surprises in Bayesian theory, intuition in accounting theory about information, and its value, can be tested and enhanced. Of all the branches of the social sciences, accounting information theory begs Bayesian insights.
This monograph lays out the main logical constructs and principles of Bayesianism, and relates them to important contributions in the theoretical accounting literature. The approach taken is essentially “old-fashioned” normative statistics, building on the expositions of Demski, Ijiri, Feltham and other early accounting theorists who brought Bayesian theory to accounting theory. Some history of this nexus, and the role of business schools in the development of Bayesian statistics in the 1950–1970s, is described. Later developments in accounting, especially noisy rational expectations models under which the information reported by firms is endogenous, rather than unaffected or “drawn from nature”, make the task of Bayesian inference more difficult yet no different in principle.
The information user must still revise beliefs based on what is reported. The extra complexity is that users must allow for the firm’s perceived disclosure motives and other relevant background knowledge in their Bayesian models. A known strength of Bayesian modelling is that subjective considerations are admitted and formally incorporated. Allowances for perceived self-interest or biased reporting, along with any other apparent signal defects or “information uncertainty”, are part and parcel of Bayesian information theory.
Markets/evolution as backstops/ground truths for reinforcement learning/optimization: on some connections between Coase’s theory of the firm/linear optimization/DRL/evolution/multicellular life/pain/Internet communities as multi-level optimization problems.
One defense of free markets notes the inability of non-market mechanisms to solve planning & optimization problems. This has difficulty with Coase’s paradox of the firm, and I note that the difficulty is increased by the fact that with improvements in computers, algorithms, and data, ever larger planning problems are solved. Expanding on some Cosma Shalizi comments, I suggest interpreting phenomenon as multi-level nested optimization paradigm: many systems can be usefully described as having two (or more) levels where a slow sample-inefficient but ground-truth ‘outer’ loss such as death, bankruptcy, or reproductive fitness, trains & constrains a fast sample-efficient but possibly misguided ‘inner’ loss which is used by learned mechanisms such as neural networks or linear programming group selection perspective. So, one reason for free-market or evolutionary or Bayesian methods in general is that while poorer at planning/optimization in the short run, they have the advantage of simplicity and operating on ground-truth values, and serve as a constraint on the more sophisticated non-market mechanisms. I illustrate by discussing corporations, multicellular life, reinforcement learning & meta-learning in AI, and pain in humans. This view suggests that are inherent balances between market/non-market mechanisms which reflect the relative advantages between a slow unbiased method and faster but potentially arbitrarily biased methods.
Decision analysis of whether cloning the most elite Special Forces dogs is a profitable improvement over standard selection procedures. Unless training is extremely cheap or heritability is extremely low, dog cloning is hypothetically profitable.
Cloning is widely used in animal & plant breeding despite steep costs due to its advantages; more unusual recent applications include creating entire polo horse teams and reported trials of cloning in elite police/Special Forces war dogs. Given the cost of dog cloning, however, can this ever make more sense than standard screening methods for selecting from working dog breeds, or would the increase in successful dog training be too low under all reasonable models to turn a profit?
I model the question as one of expected cost per dog with the trait of successfully passing training, success in training being a dichotomous liability threshold with a polygenic genetic architecture; given the extreme level of selection possible in selecting the best among already-elite Special Forces dogs and a range of heritabilities, this predicts clones’ success probabilities. To approximate the relevant parameters, I look at some reported training costs and success rates for regular dog candidates, broad dog heritabilities, and the few current dog cloning case studies reported in the media.
Since none of the relevant parameters are known with confidence, I run the cost-benefit equation for many hypothetical scenarios, and find that in a large fraction of them covering most plausible values, dog cloning would improve training yields enough to be profitable (in addition to its other advantages).
As further illustration of the use-case of screening for an extreme outcome based on a partial predictor, I consider the question of whether height PGSes could be used to screen the US population for people of NBA height, which turns out to be reasonably doable with current & future PGSes.
Optimal action selection in decision problems characterized by sparse, delayed rewards is still an open challenge. For these problems, current deep reinforcement learning methods require enormous amounts of data to learn controllers that reach human-level performance. In this work, we propose a method that interleaves planning and learning to address this issue. The planning step hinges on the Iterated-Width (IW) planner, a state of the art planner that makes explicit use of the state representation to perform structured exploration. IW is able to scale up to problems independently of the size of the state space. From the state-actions visited by IW, the learning step estimates a compact policy, which in turn is used to guide the planning step. The type of exploration used by our method is radically different than the standard random exploration used in RL. We evaluate our method in simple problems where we show it to have superior performance than the state-of-the-art reinforcement learning algorithms A2C and Alpha Zero. Finally, we present preliminary results in a subset of the Atari games suite.
Delphi is a procedure that produces forecasts on technological and social developments. This article traces the history of Delphi’s development to the early 1950s, where a group of logicians and mathematicians working at the RAND Corporation carried out experiments to assess the predictive capacities of groups of experts. While Delphi now has a rather stable methodological shape, this was not so in its early years. The vision that Delphi’s creators had for their brainchild changed considerably. While they had initially seen it as a technique, a few years later they reconfigured it as a scientific method. After some more years, however, they conceived of Delphi as a tool. This turbulent youth of Delphi can be explained by parallel changes in the fields that were deemed relevant audiences for the technique, operations research and the policy sciences. While changing the shape of Delphi led to some success, it had severe, yet unrecognized methodological consequences. The core assumption of Delphi that the convergence of expert opinions observed over the iterative stages of the procedure can be interpreted as consensus, appears not to be justified for the third shape of Delphi as a tool that continues to be the most prominent one.
Personal experience and surveys on running out of socks; discussion of socks as small example of human procrastination and irrationality, caused by lack of explicit deliberative thought where no natural triggers or habits exist.
After running out of socks one day, I reflected on how ordinary tasks get neglected. Anecdotally and in 3 online surveys, people report often not having enough socks, a problem which correlates with rarity of sock purchases and demographic variables, consistent with a neglect/procrastination interpretation: because there is no specific time or triggering factor to replenish a shrinking sock stockpile, it is easy to run out.
This reminds me of akrasia on minor tasks, ‘yak shaving’, and the nature of disaster in complex systems: lack of hard rules lets errors accumulate, without any ‘global’ understanding of the drift into disaster (or at least inefficiency). Humans on a smaller scale also ‘drift’ when they engage in System I reactive thinking & action for too long, resulting in cognitive biases. An example of drift is the generalized human failure to explore/experiment adequately, resulting in overly greedy exploitative behavior of the current local optimum. Grocery shopping provides a case study: despite large gains, most people do not explore, perhaps because there is no established routine or practice involving experimentation. Fixes for these things can be seen as ensuring that System II deliberative cognition is periodically invoked to review things at a global level, such as developing a habit of maximum exploration at first purchase of a food product, or annually reviewing possessions to note problems like a lack of socks.
While socks may be small things, they may reflect big things.
In this paper, we propose an information-theoretic exploration strategy for stochastic, discrete multi-armed bandits that achieves optimal regret. Our strategy is based on the value of information criterion. This criterion measures the trade-off between policy information and obtainable rewards. High amounts of policy information are associated with exploration-dominant searches of the space and yield high rewards. Low amounts of policy information favor the exploitation of existing knowledge. Information, in this criterion, is quantified by a parameter that can be varied during search. We demonstrate that a simulated-annealing-like update of this parameter, with a sufficiently fast cooling schedule, leads to an optimal regret that is logarithmic with respect to the number of episodes.
In spite of its familiar phenomenology, the mechanistic basis for mental effort remains poorly understood. Although most researchers agree that mental effort is aversive and stems from limitations in our capacity to exercise cognitive control, it is unclear what gives rise to those limitations and why they result in an experience of control as costly. The presence of these control costs also raises further questions regarding how best to allocate mental effort to minimize those costs and maximize the attendant benefits. This review explores recent advances in computational modeling and empirical research aimed at addressing these questions at the level of psychological process and neural mechanism, examining both the limitations to mental effort exertion and how we manage those limited cognitive resources. We conclude by identifying remaining challenges for theoretical accounts of mental effort as well as possible applications of the available findings to understanding the causes of and potential solutions for apparent failures to exert the mental effort required of us.
Time is money. But how much? What is money in the future worth to you today? This question of “present value” arises in myriad economic activities, from valuing financial securities to real estate transactions to governmental cost-benefit analysis—even the economics of climate change. In modern capitalist practice, one calculation offers the only “rational” way to answer: compound-interest discounting. In the early modern period, though, economic actors used at least two alternative calculating technologies for thinking about present value, including a vernacular technique called years purchase and discounting by simple interest. All of these calculations had different strengths and affordances, and none was unquestionably better or more “rational” than the others at the time. The history of technology offers distinct resources for understanding such technological competitions, and thus for understanding the emergence of modern economic temporality.
We present a framework to tackle combinatorial optimization problems using neural networks and reinforcement learning. We focus on the traveling salesman problem (TSP) and train a recurrent neural network that, given a set of city coordinates, predicts a distribution over different city permutations. Using negative tour length as the reward signal, we optimize the parameters of the recurrent neural network using a policy gradient method. Without much engineering and heuristic designing, Neural Combinatorial Optimization achieves close to optimal results on 2D Euclidean graphs with up to 100 nodes. These results, albeit still quite far from state-of-the-art, give insights into how neural networks can be used as a general tool for tackling combinatorial optimization problems.
Blind randomized taste-test of mineral/distilled/tap waters using Bayesian best-arm finding; no large differences in preference.
The kind of water used in tea is claimed to make a difference in the flavor: mineral water being better than tap water or distilled water. However, mineral water is vastly more expensive than tap water.
To test the claim, I run a preliminary test of pure water to see if any water differences are detectable at all. Compared my tap water, 3 distilled water brands (Great Value, Nestle Pure Life, & Poland Spring), 1 osmosis-purified brand (Aquafina), and 3 non-carbonated mineral water brands (Evian, Voss, & Fiji) in a series of n = 67 blinded randomized comparisons of water flavor. The comparisons are modeled using a Bradley-Terry competitive model implemented in Stan; comparisons were chosen using an adaptive Bayesian best-arm sequential trial (racing) method designed to locate the best-tasting water in the minimum number of samples by preferentially comparing the best-known arm to potentially superior arms. Blinding & randomization are achieved by using a Lazy Susan to physically randomize two identical (but marked in a hidden spot) cups of water.
The final posterior distribution indicates that some differences between waters are likely to exist but are small & imprecisely estimated and of little practical concern.
Decision-theoretic analysis of how to optimally play Haghani & Dewey 2016’s 300-round double-or-nothing coin-flipping game with an edge and ceiling better than using the Kelly Criterion. Computing and following an exact decision tree increases earnings by $6.6 over a modified KC.
Haghani & Dewey 2016 experiment with a double-or-nothing coin-flipping game where the player starts with $30.4[^\$25.0^~2016~]{.supsub} and has an edge of 60%, and can play 300 times, choosing how much to bet each time, winning up to a maximum ceiling of $303.8[^\$250.0^~2016~]{.supsub}. Most of their subjects fail to play well, earning an average $110.6[^\$91.0^~2016~]{.supsub}, compared to Haghani & Dewey 2016’s heuristic benchmark of ~$291.6[^\$240.0^~2016~]{.supsub} in winnings achievable using a modified Kelly Criterion as their strategy. The KC, however, is not optimal for this problem as it ignores the ceiling and limited number of plays.
We solve the problem of the value of optimal play exactly by using decision trees & dynamic programming for calculating the value function, with implementations in R, Haskell, and C. We also provide a closed-form exact value formula in R & Python, several approximations using Monte Carlo/random forests/neural networks, visualizations of the value function, and a Python implementation of the game for the OpenAI Gym collection. We find that optimal play yields $246.61 on average (rather than ~$240), and so the human players actually earned only 36.8% of what was possible, losing $155.6 in potential profit. Comparing decision trees and the Kelly criterion for various horizons (bets left), the relative advantage of the decision tree strategy depends on the horizon: it is highest when the player can make few bets (at b = 23, with a difference of ~$36), and decreases with number of bets as more strategies hit the ceiling.
In the Kelly game, the maximum winnings, number of rounds, and edge are fixed; we describe a more difficult generalized version in which the 3 parameters are drawn from Pareto, normal, and beta distributions and are unknown to the player (who can use Bayesian inference to try to estimate them during play). Upper and lower bounds are estimated on the value of this game. In the variant of this game where subjects are not told the exact edge of 60%, a Bayesian decision tree approach shows that performance can closely approach that of the decision tree, with a penalty for 1 plausible prior of only $1. Two deep reinforcement learning agents, DQN & DDPG, are implemented but DQN fails to learn and DDPG doesn’t show acceptable performance, indicating better deep RL methods may be required to solve the generalized Kelly game.
9 months of daily A/B-testing of Google AdSense banner ads on Gwern.net indicates banner ads decrease total traffic substantially, possibly due to spillover effects in reader engagement and resharing.
One source of complexity & JavaScript use on Gwern.net is the use of Google AdSense advertising to insert banner ads. In considering design & usability improvements, removing the banner ads comes up every time as a possibility, as readers do not like ads, but such removal comes at a revenue loss and it’s unclear whether the benefit outweighs the cost, suggesting I run an A/B experiment. However, ads might be expected to have broader effects on traffic than individual page reading times/bounce rates, affecting total site traffic instead through long-term effects on or spillover mechanisms between readers (eg. social media behavior), rendering the usual A/B testing method of per-page-load/session randomization incorrect; instead it would be better to analyze total traffic as a time-series experiment.
Design: A decision analysis of revenue vs readers yields an maximum acceptable total traffic loss of ~3%. Power analysis of historical Gwern.net traffic data demonstrates that the high autocorrelation yields low statistical power with standard tests & regressions but acceptable power with ARIMA models. I design a long-term Bayesian ARIMA(4,0,1) time-series model in which an A/B-test running January–October 2017 in randomized paired 2-day blocks of ads/no-ads uses client-local JS to determine whether to load & display ads, with total traffic data collected in Google Analytics & ad exposure data in Google AdSense. The A/B test ran from 2017-01-01 to 2017-10-15, affecting 288 days with collectively 380,140 pageviews in 251,164 sessions.
Correcting for a flaw in the randomization, the final results yield a surprisingly large estimate of an expected traffic loss of −9.7% (driven by the subset of users without adblock), with an implied −14% traffic loss if all traffic were exposed to ads (95% credible interval: −13–16%), exceeding my decision threshold for disabling ads & strongly ruling out the possibility of acceptably small losses which might justify further experimentation.
Thus, banner ads on Gwern.net appear to be harmful and AdSense has been removed. If these results generalize to other blogs and personal websites, an important implication is that many websites may be harmed by their use of banner ad advertising without realizing it.
2017-pedroni.pdf: “The risk elicitation puzzle”, Andreas Pedroni, Renato Frey, Adrian Bruhin, Gilles Dutilh, Ralph Hertwig, Jamp#x000F6;rg Rieskamp (2017-01-01)
After putting up with slow glitchy WiFi Internet for years, I investigate improvements. Upgrading the router, switching to a high-gain antenna, and installing a buried Ethernet cable all offer increasing speeds.
My laptop in my apartment receives Internet via a WiFi repeater to another house, yielding slow speeds and frequent glitches. I replaced the obsolete WiFi router and increased connection speeds somewhat but still inadequate. For a better solution, I used a directional antenna to connect directly to the new WiFi router, which, contrary to my expectations, yielded a ~6× increase in speed. Extensive benchmarking of all possible arrangements of laptops/dongles/repeaters/antennas/routers/positions shows that the antenna+router is inexpensive and near optimal speed, and that the only possible improvement would be a hardwired Ethernet line, which I installed a few weeks later after learning it was not as difficult as I thought it would be.
AIs limited to pure computation (Tool AIs) supporting humans, will be less intelligent, efficient, and economically valuable than more autonomous reinforcement-learning AIs (Agent AIs) who act on their own and meta-learn, because all problems are reinforcement-learning problems.
Autonomous AI systems (Agent AIs) trained using reinforcement learning can do harm when they take wrong actions, especially superintelligent Agent AIs. One solution would be to eliminate their agency by not giving AIs the ability to take actions, confining them to purely informational or inferential tasks such as classification or prediction (Tool AIs), and have all actions be approved & executed by humans, giving equivalently superintelligent results without the risk.
I argue that this is not an effective solution for two major reasons. First, because Agent AIs will by definition be better at actions than Tool AIs, giving an economic advantage. Secondly, because Agent AIs will be better at inference & learning than Tool AIs, and this is inherently due to their greater agency: the same algorithms which learn how to perform actions can be used to select important datapoints to learn inference over, how long to learn, how to more efficiently execute inference, how to design themselves, how to optimize hyperparameters, how to make use of external resources such as long-term memories or external software or large databases or the Internet, and how best to acquire new data.
All of these actions will result in Agent AIs more intelligent than Tool AIs, in addition to their greater economic competitiveness. Thus, Tool AIs will be inferior to Agent AIs in both actions and intelligence, implying use of Tool AIs is an even more highly unstable equilibrium than previously argued, as users of Agent AIs will be able to outcompete them on two dimensions (and not just one).
Bayesian decision-theoretic analysis of the effect of fancier packaging on subscription cancellations & optimal experiment design.
I analyze an A/B test from a mail-order company of two different kinds of box packaging from a Bayesian decision-theory perspective, balancing posterior probability of improvements & greater profit against the cost of packaging & risk of worse results, finding that as the company’s analysis suggested, the new box is unlikely to be sufficiently better than the old. Calculating expected values of information shows that it is not worth experimenting on further, and that such fixed-sample trials are unlikely to ever be cost-effective for packaging improvements. However, adaptive experiments may be worthwhile.
A cost-benefit analysis of the marginal cost of IVF-based embryo selection for intelligence and other traits with 2016-2017 state-of-the-art
With genetic predictors of a phenotypic trait, it is possible to select embryos during an in vitro fertilization process to increase or decrease that trait. Extending the work of Shulman & Bostrom 2014/Hsu 2014, I consider the case of human intelligence using SNP-based genetic prediction, finding:
a meta-analysis of GCTA results indicates that SNPs can explain >33% of variance in current intelligence scores, and >44% with better-quality phenotype testing
this sets an upper bound on the effectiveness of SNP-based selection: a gain of 9 IQ points when selecting the top embryo out of 10
the best 2016 polygenic score could achieve a gain of ~3 IQ points when selecting out of 10
the marginal cost of embryo selection (assuming IVF is already being done) is modest, at $1,822.7[^\$1,500.0^~2016~]{.supsub} + $243.0[^\$200.0^~2016~]{.supsub} per embryo, with the sequencing cost projected to drop rapidly
a model of the IVF process, incorporating number of extracted eggs, losses to abnormalities & vitrification & failed implantation & miscarriages from 2 real IVF patient populations, estimates feasible gains of 0.39 & 0.68 IQ points
embryo selection is currently unprofitable (mean: -$435.0[^\$358.0^~2016~]{.supsub}) in the USA under the lowest estimate of the value of an IQ point, but profitable under the highest (mean: $7,570.3[^\$6,230.0^~2016~]{.supsub}). The main constraints on selection profitability is the polygenic score; under the highest value, the NPV EVPI of a perfect SNP predictor is $29.2[^\$24.0^~2016~]{.supsub}b and the EVSI per education/SNP sample is $86.3[^\$71.0^~2016~]{.supsub}k
under the worst-case estimate, selection can be made profitable with a better polygenic score, which would require n > 237,300 using education phenotype data (and much less using fluid intelligence measures)
selection can be made more effective by selecting on multiple phenotype traits: considering an example using 7 traits (IQ/height/BMI/diabetes/ADHD/bipolar/schizophrenia), there is a factor gain over IQ alone; the outperformance of multiple selection remains after adjusting for genetic correlations & polygenic scores and using a broader set of 16 traits.
Analysis of whether bitter melon reduces blood glucose in one self-experiment and utility of further self-experimentation
I re-analyze a bitter-melon/blood-glucose self-experiment, finding a small effect of increasing blood glucose after correcting for temporal trends & daily variation, giving both frequentist & Bayesian analyses. I then analyze the self-experiment from a subjective Bayesian decision-theoretic perspective, cursorily estimating the costs of diabetes & benefits of intervention in order to estimate Value Of Information for the self-experiment and the benefit of further self-experimenting; I find that the expected value of more data (EVSI) is negative and further self-experimenting would not be optimal compared to trying out other anti-diabetes interventions.
We adapt the ideas underlying the success of Deep Q-Learning to the continuous action domain.
We present an actor-critic, model-free algorithm based on the deterministic policy gradient that can operate over continuous action spaces.
Using the same learning algorithm, network architecture and hyper-parameters, our DDPG algorithm robustly solves more than 20 simulated physics tasks, including classic problems such as cartpole swing-up, dexterous manipulation [gripper/reacher], legged locomotion [Cheetah/walker] and car driving [TORCS]. Our algorithm is able to find policies whose performance is competitive with those found by a planning algorithm with full access to the dynamics of the domain and its derivatives.
We further demonstrate that for many of the tasks the algorithm can learn policies end-to-end: directly from raw pixel inputs.
25 large field experiments with major U.S. retailers and brokerages, most reaching millions of customers and collectively representing $3.53$2.802015 million in digital advertising expenditure, reveal that measuring the returns to advertising is difficult.
The median confidence interval on return on investment is over 100 percentage points wide. Detailed sales data show that relative to the per capita cost of the advertising, individual-level sales are very volatile; a coefficient of variation of 10 is common. Hence, informative advertising experiments can easily require more than 10 million person-weeks, making experiments costly and potentially infeasible for many firms.
Despite these unfavorable economics, randomized control trials represent progress by injecting new, unbiased information into the market. The inference challenges revealed in the field experiments also show that selection bias, due to the targeted nature of advertising, is a crippling concern for widely employed observational methods.
Bayesian decision-theoretic analysis of local mail delivery times: modeling deliveries as survival analysis, model comparison, optimizing check times with a loss function, and optimal data collection.
Mail is delivered by the USPS mailman at a regular but not observed time; what is observed is whether the mail has been delivered at a time, yielding somewhat-unusual “interval-censored data”. I describe the problem of estimating when the mailman delivers, write a simulation of the data-generating process, and demonstrate analysis of interval-censored data in R using maximum-likelihood (survival analysis with Gaussian regression using survival library), MCMC (Bayesian model in JAGS), and likelihood-free Bayesian inference (custom ABC, using the simulation). This allows estimation of the distribution of mail delivery times. I compare those estimates from the interval-censored data with estimates from a (smaller) set of exact delivery-times provided by USPS tracking & personal observation, using a multilevel model to deal with heterogeneity apparently due to a change in USPS routes/postmen. Finally, I define a loss function on mail checks, enabling: a choice of optimal time to check the mailbox to minimize loss (exploitation); optimal time to check to maximize information gain (exploration); Thompson sampling (balancing exploration & exploitation indefinitely), and estimates of the value-of-information of another datapoint (to estimate when to stop exploration and start exploitation after a finite amount of data).
Training Deep Neural Networks is complicated by the fact that the distribution of each layer’s inputs changes during training, as the parameters of the previous layers change. This slows down the training by requiring lower learning rates and careful parameter initialization, and makes it notoriously hard to train models with saturating nonlinearities. We refer to this phenomenon as internal covariate shift, and address the problem by normalizing layer inputs. Our method draws its strength from making normalization a part of the model architecture and performing the normalization for each training mini-batch. Batch Normalization allows us to use much higher learning rates and be less careful about initialization. It also acts as a regularizer, in some cases eliminating the need for Dropout. Applied to a state-of-the-art image classification model, Batch Normalization achieves the same accuracy with 14× fewer training steps, and beats the original model by a significant margin. Using an ensemble of batch-normalized networks, we improve upon the best published result on ImageNet classification: reaching 4.9% top-5 validation error (and 4.8% test error), exceeding the accuracy of human raters.
Over the past 10+ years, online companies large and small have adopted widespread A/B testing as a robust data-based method for evaluating potential product improvements. In online experimentation, it is straightforward to measure the short-term effect, ie. the impact observed during the experiment. However, the short-term effect is not always predictive of the long-term effect, ie. the final impact once the product has fully launched and users have changed their behavior in response. Thus, the challenge is how to determine the long-term user impact while still being able to make decisions in a timely manner.
We tackle that challenge in this paper by first developing experiment methodology for quantifying long-term user learning. We then apply this methodology to ads shown on Google search, more specifically, to determine and quantify the drivers of ads blindness and sightedness, the phenomenon of users changing their inherent propensity to click on or interact with ads.
We use these results to create a model that uses metrics measurable in the short-term to predict the long-term. We learn that user satisfaction is paramount: ads blindness and sightedness are driven by the quality of previously viewed or clicked ads, as measured by both ad relevance and landing page quality. Focusing on user satisfaction both ensures happier users but also makes business sense, as our results illustrate. We describe two major applications of our findings: a conceptual change to our search ads auction that further increased the importance of ads quality, and a 50% reduction of the ad load on Google’s mobile search interface.
The results presented in this paper are generalizable in two major ways. First, the methodology may be used to quantify user learning effects and to evaluate online experiments in contexts other than ads. Second, the ads blindness/sightedness results indicate that a focus on user satisfaction could help to reduce the ad load on the internet at large with long-term neutral, or even positive, business impact.
Thompson sampling provides a solution to bandit problems in which new observations are allocated to arms with the posterior probability that an arm is optimal. While sometimes easy to implement and asymptotically optimal, Thompson sampling can be computationally demanding in large scale bandit problems, and its performance is dependent on the model fit to the observed data. We introduce bootstrap Thompson sampling (BTS), a heuristic method for solving bandit problems which modifies Thompson sampling by replacing the posterior distribution used in Thompson sampling by a bootstrap distribution. We first explain BTS and show that the performance of BTS is competitive to Thompson sampling in the well-studied Bernoulli bandit case. Subsequently, we detail why BTS using the online bootstrap is more scalable than regular Thompson sampling, and we show through simulation that BTS is more robust to a misspecified error distribution. BTS is an appealing modification of Thompson sampling, especially when samples from the posterior are otherwise not available or are costly.
Given two disagreeing polls, one small & imprecise but taken at face-value, and the other large & precise but with a high chance of being totally mistaken, what is the right Bayesian model to update on these two datapoints? I give ABC and MCMC implementations of Bayesian inference on this problem and find that the posterior is bimodal with a mean estimate close to the large unreliable poll’s estimate but with wide credible intervals to cover the mode based on the small reliable poll’s estimate.
We present the first deep learning model to successfully learn control policies directly from high-dimensional sensory input using reinforcement learning. The model is a convolutional neural network, trained with a variant of Q-learning, whose input is raw pixels and whose output is a value function estimating future rewards. We apply our method to seven Atari 2600 games from the Arcade Learning Environment, with no adjustment of the architecture or learning algorithm. We find that it outperforms all previous approaches on six of the games and surpasses a human expert on three of them.
Classical theories of the firm assume access to reliable signals to measure the causal impact of choice variables on profit.
For advertising expenditure we show, using 25 online field experiments (representing $3.69$2.802013 million) with major U.S. retailers and brokerages, that this assumption typically does not hold. Statistical evidence from the randomized trials is very weak because individual-level sales are incredibly volatile relative to the per capita cost of a campaign—a “small” impact on a noisy dependent variable can generate positive returns.
A concise statistical argument shows that the required sample size for an experiment to generate sufficiently informative confidence intervals is typically in excess of ten million person-weeks. This also implies that heterogeneity bias (or model misspecification) unaccounted for by observational methods only needs to explain a tiny fraction of the variation in sales to severely bias estimates.
The weak informational feedback means most firms cannot even approach profit maximization.
Self-experiment on whether consuming caffeine immediately upon waking results in less time in bed & higher productivity. The results indicate a small and uncertain effect.
One trick to combat morning sluggishness is to get caffeine extra-early by using caffeine pills shortly before or upon trying to get up. From 2013-2014 I ran a blinded & placebo-controlled randomized experiment measuring the effect of caffeine pills in the morning upon awakening time and daily productivity. The estimated effect is small and the posterior probability relatively low, but a decision analysis suggests that since caffeine pills are so cheap, it would be worthwhile to conduct another experiment; however, increasing Zeo equipment problems have made me hold off additional experiments indefinitely.
This paper deals with the question of how to most effectively conduct experiments in Partially Observed Markov Decision Processes so as to provide data that is most informative about a parameter of interest. Methods from Markov decision processes, especially dynamic programming, are introduced and then used in an algorithm to maximize a relevant Fisher Information. The algorithm is then applied to two POMDP examples. The methods developed can also be applied to stochastic dynamical systems, by suitable discretization, and we consequently show what control policies look like in the Morris-Lecar Neuron model, and simulation results are presented. We discuss how parameter dependence within these methods can be dealt with by the use of priors, and develop tools to update control policies online. This is demonstrated in another stochastic dynamical system describing growth dynamics of DNA template in a PCR model.
Randomized experiments are the “gold standard” for estimating causal effects, yet often in practice, chance imbalances exist in covariate distributions between treatment groups. If covariate data are available before units are exposed to treatments, these chance imbalances can be mitigated by first checking covariate balance before the physical experiment takes place. Provided a precise definition of imbalance has been specified in advance, unbalanced randomizations can be discarded, followed by a rerandomization, and this process can continue until a randomization yielding balance according to the definition is achieved. By improving covariate balance, rerandomization provides more precise and trustworthy estimates of treatment effects.
Technological developments can be foreseen but the knowledge is largely useless because startups are inherently risky and require optimal timing. A more practical approach is to embrace uncertainty, taking a reinforcement learning perspective.
How do you time your startup? Technological forecasts are often surprisingly prescient in terms of predicting that something was possible & desirable and what they predict eventually happens; but they are far less successful at predicting the timing, and almost always fail, with the success (and riches) going to another.
Why is their knowledge so useless? Why are success and failure so intertwined in the tech industry? The right moment cannot be known exactly in advance, so attempts to forecast will typically be off by years or worse. For many claims, there is no way to invest in an idea except by going all in and launching a company, resulting in extreme variance in outcomes, even when the idea is good and the forecasts correct about the (eventual) outcome.
Progress can happen and can be foreseen long before, but the details and exact timing due to bottlenecks are too difficult to get right. Launching too early means failure, but being conservative & launching later is just as bad because regardless of forecasting, a good idea will draw overly-optimistic researchers or entrepreneurs to it like moths to a flame: all get immolated but the one with the dumb luck to kiss the flame at the perfect instant, who then wins everything, at which point everyone can see that the optimal time is past. All major success stories overshadow their long list of predecessors who did the same thing, but got unlucky. The lesson of history is that for every lesson, there is an equal and opposite lesson. So, ideas can be divided into the overly-optimistic & likely doomed, or the fait accompli. On an individual level, ideas are worthless because so many others have them too—‘multiple invention’ is the rule, and not the exception. Progress, then, depends on the ‘unreasonable man’.
This overall problem falls under the reinforcement learning paradigm, and successful approaches are analogous to Thompson sampling/posterior sampling: even an informed strategy can’t reliably beat random exploration which gradually shifts towards successful areas while continuing to take occasional long shots. Since people tend to systematically over-exploit, how is this implemented? Apparently by individuals acting suboptimally on the personal level, but optimally on societal level by serving as random exploration.
A major benefit of R&D, then, is in laying fallow until the ‘ripe time’ when they can be immediately exploited in previously-unpredictable ways; applied R&D or VC strategies should focus on maintaining diversity of investments, while continuing to flexibly revisit previous failures which forecasts indicate may have reached ‘ripe time’. This balances overall exploitation & exploration to progress as fast as possible, showing the usefulness of technological forecasting on a global level despite its uselessness to individuals.
A log of experiments done on the site design, intended to render pages more readable, focusing on the challenge of testing a static site, page width, fonts, plugins, and effects of advertising.
To gain some statistical & web development experience and to improve my readers’ experiences, I have been running a series of CSS A/B tests since June 2012. As expected, most do not show any meaningful difference.
Self-experiment on whether screen-tinting software such as Redshift/f.lux affect sleep times and sleep quality; Redshift lets me sleep earlier but doesn’t improve sleep quality.
I ran a randomized experiment with a free program (Redshift) which reddens screens at night to avoid tampering with melatonin secretion & the sleep from 2012–2013, measuring sleep changes with my Zeo. With 533 days of data, the main result is that Redshift causes me to go to sleep half an hour earlier but otherwise does not improve sleep quality.
Bayes-optimal behavior, while well-defined, is often difficult to achieve. Recent advances in the use of Monte-Carlo tree search (MCTS) have shown that it is possible to act near-optimally in Markov Decision Processes (MDPs) with very large or infinite state spaces. Bayes-optimal behavior in an unknown MDP is equivalent to optimal behavior in the known belief-space MDP, although the size of this belief-space MDP grows exponentially with the amount of history retained, and is potentially infinite. We show how an agent can use one particular MCTS algorithm, Forward Search Sparse Sampling (FSSS), in an efficient way to act nearly Bayes-optimally for all but a polynomial number of steps, assuming that FSSS can be used to act efficiently in any possible underlying MDP.
One might think that, once we know something is computable, how efficiently it can be computed is a practical question with little further philosophical importance. In this essay, I offer a detailed case that one would be wrong. In particular, I argue that computational complexity theory—the field that studies the resources (such as time, space, and randomness) needed to solve computational problems—leads to new perspectives on the nature of mathematical knowledge, the strong AI debate, computationalism, the problem of logical omniscience, Hume’s problem of induction, Goodman’s grue riddle, the foundations of quantum mechanics, economic rationality, closed timelike curves, and several other topics of philosophical interest. I end by discussing aspects of complexity theory itself that could benefit from philosophical analysis.
We measure the causal effects of online advertising on sales, using a randomized experiment performed in cooperation between Yahoo! and a major retailer.
After identifying over one million customers matched in the databases of the retailer and Yahoo!, we randomly assign them to treatment and control groups. We analyze individual-level data on ad exposure and weekly purchases at this retailer, both online and in stores.
We find statistically-significant and economically substantial impacts of the advertising on sales. The treatment effect persists for weeks after the end of an advertising campaign, and the total effect on revenues is estimated to be more than seven times the retailer’s expenditure on advertising during the study. Additional results explore differences in the number of advertising impressions delivered to each individual, online and offline sales, and the effects of advertising on those who click the ads versus those who merely view them.
Statistical power calculations show that, due to the high variance of sales, our large number of observations brings us just to the frontier of being able to measure economically substantial effects of advertising.
We also demonstrate that without an experiment, using industry-standard methods based on endogenous crosssectional variation in advertising exposure, we would have obtained a wildly inaccurate estimate of advertising effectiveness.
In this paper, we introduce PILCO, a practical, data-efficient model-based policy search method. PILCO reduces model bias, one of the key problems of model-based reinforcement learning, in a principled way.
By learning a probabilistic dynamics model and explicitly incorporating model uncertainty into long-term planning, PILCO can cope with very little data and facilitates learning from scratch in only a few trials. Policy evaluation is performed in closed form using state-of-the-art approximate inference. Furthermore, policy gradients are computed analytically for policy improvement.
We report unprecedented learning efficiency on challenging and high-dimensional control tasks.
[Remarkably, PILCO can learn your standard “Cartpole” task within just a few trials by carefully building a Bayesian Gaussian process model and picking the maximally-informative experiments to run. Cartpole is quite difficult for a human, incidentally, there’s an installation of one in the SF Exploratorium, and I just had to try it out once I recognized it. (My sample-efficiency was not better than PILCO.)]
Measuring the causal effects of online advertising (adfx) on user behavior is important to the health of the WWW publishing industry. In this paper, using three controlled experiments, we show that observational data frequently lead to incorrect estimates of adfx. The reason, which we label “activity bias”, comes from the surprising amount of time-based correlation between the myriad activities that users undertake online.
In Experiment 1, users who are exposed to an ad on a given day are much more likely to engage in brand-relevant search queries as compared to their recent history for reasons that had nothing do with the advertisement. In Experiment 2, we show that activity bias occurs for page views across diverse websites. In Experiment 3, we track account sign-ups at a competitor’s (of the advertiser) website and find that many more people sign-up on the day they saw an advertisement than on other days, but that the true “competitive effect” was minimal.
In all three experiments, exposure to a campaign signals doing “more of everything” in given period of time, making it difficult to find a suitable “matched control” using prior behavior. In such cases, the “match” is fundamentally different from the exposed group, and we show how and why observational methods lead to a massive overestimate of adfx in such circumstances.
A resolution of the St Petersburg paradox is presented. In contrast to the standard resolution, utility is not required. Instead, the time-average performance of the lottery is computed. The final result can be phrased mathematically identically to Daniel Bernoulli’s resolution, which uses logarithmic utility, but is derived using a conceptually different argument. The advantage of the time resolution is the elimination of arbitrary utility functions.
The biopharmaceutical industry is facing unprecedented challenges to its fundamental business model and currently cannot sustain sufficient innovation to replace its products and revenues lost due to patent expirations.
The number of truly innovative new medicines approved by regulatory agencies such as the US Food and Drug Administration has declined substantially despite continued increases in R&D spending, raising the current cost of each new molecular entity (NME) to ~US$2.49$1.802010 billion
Declining R&D productivity is arguably the most important challenge the industry faces and thus improving R&D productivity is its most important priority.
A detailed analysis of the key elements that determine overall R&D productivity and the cost to successfully develop an NME reveals exactly where (and to what degree) R&D productivity can (and must) be improved.
Reducing late-stage (Phase II and III) attrition rates and cycle times during drug development are among the key requirements for improving R&D productivity.
To achieve the necessary increase in R&D productivity, R&D investments, both financial and intellectual, must be focused on the ‘sweet spot’ of drug discovery and early clinical development, from target selection to clinical proof-of-concept.
The transformation from a traditional biopharmaceutical FIPCo (fully integrated pharmaceutical company) to a FIPNet (fully integrated pharmaceutical network) should allow a given R&D organization to ‘play bigger than its size’ and to more affordably fund the necessary number and quality of pipeline assets.
The pharmaceutical industry is under growing pressure from a range of environmental issues, including major losses of revenue owing to patent expirations, increasingly cost-constrained healthcare systems and more demanding regulatory requirements. In our view, the key to tackling the challenges such issues pose to both the future viability of the pharmaceutical industry and advances in healthcare is to substantially increase the number and quality of innovative, cost-effective new medicines, without incurring unsustainable R&D costs. However, it is widely acknowledged that trends in industry R&D productivity have been moving in the opposite direction for a number of years.
Here, we present a detailed analysis based on comprehensive, recent, industry-wide data to identify the relative contributions of each of the steps in the drug discovery and development process to overall R&D productivity. We then propose specific strategies that could have the most substantial impact in improving R&D productivity.
Applications in counterterrorism and corporate competition have led to the development of new methods for the analysis of decision making when there are intelligent opponents and uncertain outcomes.
This field represents a combination of statistical risk analysis and game theory, and is sometimes called adversarial risk analysis.
In this article, we describe several formulations of adversarial risk problems, and provide a framework that extends traditional risk analysis tools, such as influence diagrams and probabilistic reasoning, to adversarial problems.
We also discuss the research challenges that arise when dealing with these models, illustrate the ideas with examples from business, and point out relevance to national defense. [keywords: auctions, decision theory, game theory, influence diagrams]
In economics and other sciences, “statistical-significance” is by custom, habit, and education a necessary and sufficient condition for proving an empirical result (Ziliak and McCloskey, 2008; McCloskey & Ziliak, 1996). The canonical routine is to calculate what’s called a t-statistic and then to compare its estimated value against a theoretically expected value of it, which is found in “Student’s” t table. A result yielding a t-value greater than or equal to about 2.0 is said to be “statistically-significant at the 95% level.” Alternatively, a regression coefficient is said to be “statistically-significantly different from the null, p < 0.05.” Canonically speaking, if a coefficient clears the 95% hurdle, it warrants additional scientific attention. If not, not. The first presentation of “Student’s” test of statistical-significance came a century ago, in “The Probable Error of a Mean” (1908b), published by an anonymous “Student.” The author’s commercial employer required that his identity be shielded from competitors, but we have known for some decades that the article was written by William Sealy Gosset (1876–1937), whose entire career was spent at Guinness’s brewery in Dublin, where Gosset was a master brewer and experimental scientist (E. S. Pearson, 1937). Perhaps surprisingly, the ingenious “Student” did not give a hoot for a single finding of “statistical”-significance, even at the 95% level of statistical-significance as established by his own tables. Beginning in 1904, “Student”, who was a businessman besides a scientist, took an economic approach to the logic of uncertainty, arguing finally that statistical-significance is “nearly valueless” in itself.
Software project escalation has been shown to be a widespread phenomenon. With few exceptions, prior research has portrayed escalation as an irrational decision-making process whereby additional resources are plowed into a failing project.
In this article, we examine the possibility that in some cases managers escalate their commitment not because they are acting irrationally, but rather as a rational response to real options that may be embedded in a project.
A project embeds real options when managers have the opportunity but not the obligation to adjust the future direction of the project in response to external or internal events. Examples include deferring the project, switching the project to serve a different purpose, changing the scale of the project, implementing it in incremental stages, abandoning the project, or using the project as a platform for future growth opportunities. Although real options can represent a substantial portion of a project’s value, they rarely enter a project’s formal justification process in the traditional quantitative discounted cash-flow-based project valuation techniques.
Using experimental data collected from managers in 123 firms, we demonstrate that managers recognize and value the presence of real options. We also assess the relative importance that managers ascribe to each type of real option, showing that growth options are more highly valued than operational options. Finally, we demonstrate that the influence of the options on project continuation decisions is largely mediated by the perceived value that they add.
Implications for both theory and practice are discussed.
[Keywords: decision making, escalation, information integration, information systems, innovation management, investment decisions, project continuation, project management, real options]
We present a theory of decision by sampling (DbS) in which, in contrast with traditional models, there are no underlying psychoeconomic scales.
Instead, we assume that an attribute’s subjective value is constructed from a series of binary, ordinal comparisons to a sample of attribute values drawn from memory and is its rank within the sample. We assume that the sample reflects both the immediate distribution of attribute values from the current decision’s context and also the background, real-world distribution of attribute values.
DbS accounts for concave utility functions; losses looming larger than gains; hyperbolic temporal discounting; and the overestimation of small probabilities and the underestimation of large probabilities.
Decision analysis produces measures of value such as expected net present values or expected utilities and ranks alternatives by these value estimates. Other optimization-based processes operate in a similar manner. With uncertainty and limited resources, an analysis is never perfect, so these value estimates are subject to error. We show that if we take these value estimates at face value and select accordingly, we should expect the value of the chosen alternative to be less than its estimate, even if the value estimates are unbiased. Thus, when comparing actual outcomes to value estimates, we should expect to be disappointed on average, not because of any inherent bias in the estimates themselves, but because of the optimization-based selection process. We call this phenomenon the optimizer’s curse and argue that it is not well understood or appreciated in the decision analysis and management science communities. This curse may be a factor in creating skepticism in decision makers who review the results of an analysis.
In this paper, we study the optimizer’s curse and show that the resulting expected disappointment may be substantial. We then propose the use of Bayesian methods to adjust value estimates. These Bayesian methods can be viewed as disciplined skepticism and provide a method for avoiding this postdecision disappointment.
From David Ricardo making a fortune buying British government bonds on the eve of the Battle of Waterloo to Warren Buffett selling insurance to the California earthquake authority, the wisest investors have earned extraordinary returns by investing in the unknown and the unknowable (UU). But they have done so on a reasoned, sensible basis. This essay explains some of the central principles that such investors employ. It starts by discussing “ignorance”, a widespread situation in the real world of investing, where even the possible states of the world are not known. Traditional finance theory does not apply in UU situations.
Strategic thinking, deducing what other investors might know or not, and assessing whether they might be deterred from investing, for example due to fiduciary requirements, frequently point the way to profitability. Most big investment payouts come when money is combined with complementary skills, such as knowing how to develop real estate or new technologies. Those who lack these skills can look for “sidecar” investments that allow them to put their money alongside that of people they know to be both capable and honest. The reader is asked to consider a number of such investments.
Central concepts in decision analysis, game theory, and behavioral decision are deployed alongside real investment decisions to unearth successful investment strategies. These strategies are distilled into 8 investment maxims. Learning to invest more wisely in a UU world may be the most promising way to substantially bolster your prosperity.
[By Edward O. Thorp] The central problem for gamblers is to find positive expectation bets. But the gambler also needs to know how to manage his money, ie. how much to bet. In the stock market (more inclusively, the securities markets) the problem is similar but more complex. The gambler, who is now an “investor”, looks for “excess risk adjusted return”.
In both these settings, we explore the use of the Kelly criterion, which is to maximize the expected value of the logarithm of wealth (“maximize expected logarithmic utility”). The criterion is known to economists and financial theorists by names such as the “geometric mean maximizing portfolio strategy”, maximizing logarithmic utility, the growth-optimal strategy, the capital growth criterion, etc.
The author initiated the practical application of the Kelly criterion by using it for card counting in blackjack. We will present some useful formulas and methods to answer various natural questions about it that arise in blackjack and other gambling games. Then we illustrate its recent use in a successful casino sports betting system. Finally, we discuss its application to the securities markets where it has helped the author to make a 30 year total of 80 billion dollars worth of “bets”.
[Keywords: Kelly criterion, betting, long run investing, portfolio allocation, logarithmic utility, capital growth]
Abstract
Introduction
Coin tossing
Optimal growth: Kelly criterion formulas for practitioners
The probability of reaching a fixed goal on or before n trials
The probability of ever being reduced to a fraction x of this initial bankroll
The probability of being at or above a specified value at the end of a specified number of trials
Continuous approximation of expected time to reach a goal
Comparing fixed fraction strategies: the probability that one strategy leads another after n trials
The long run: when will the Kelly strategy “dominate”?
In Good and Real, a tour-de-force of metaphysical naturalism, computer scientist Gary Drescher examines a series of provocative paradoxes about consciousness, choice, ethics, quantum mechanics, and other topics, in an effort to reconcile a purely mechanical view of the universe with key aspects of our subjective impressions of our own existence.
Many scientists suspect that the universe can ultimately be described by a simple (perhaps even deterministic) formalism; all that is real unfolds mechanically according to that formalism. But how, then, is it possible for us to be conscious, or to make genuine choices? And how can there be an ethical dimension to such choices? Drescher sketches computational models of consciousness, choice, and subjunctive reasoning—what would happen if this or that were to occur?—to show how such phenomena are compatible with a mechanical, even deterministic universe.
Analyses of Newcomb’s Problem (a paradox about choice) and the Prisoner’s Dilemma (a paradox about self-interest vs altruism, arguably reducible to Newcomb’s Problem) help bring the problems and proposed solutions into focus. Regarding quantum mechanics, Drescher builds on Everett’s relative-state formulation—but presenting a simplified formalism, accessible to laypersons—to argue that, contrary to some popular impressions, quantum mechanics is compatible with an objective, deterministic physical reality, and that there is no special connection between quantum phenomena and consciousness.
In each of several disparate but intertwined topics ranging from physics to ethics, Drescher argues that a missing technical linchpin can make the quest for objectivity seem impossible, until the elusive technical fix is at hand.:
Chapter 2 explores how inanimate, mechanical matter could be conscious, just by virtue of being organized to perform the right kind of computation.
Chapter 3 explains why conscious beings would experience an apparent inexorable forward flow of time, even in a universe who physical principles are time-symmetric and have no such flow, with everything sitting statically in spacetime.
Chapter 4, following [Hugh] Everett, looks closely at the paradoxes of quantum mechanics, showing how some theorists came to conclude—mistakenly, I argue—that consciousness is part of the story of quantum phenomena, or vice versa. Chapter 4 also shows how quantum phenomena are consistent with determinism (even though so-called hidden-variable theories of quantum determinism are provably wrong).
Chapter 5 examines in detail how it can be that we make genuine choices in in a mechanical, deterministic universe.
Chapter 6 analyzes Newcomb’s Problem, a startling paradox that elicits some counterintuitive conclusions about choice and causality.
Chapter 7 considers how our choices can have a moral component—that is, how even a mechanical, deterministic universe can provide a basis for distinguishing right from wrong.
Chapter 8 wraps up the presentation and touches briefly on some concluding metaphysical questions.
In this thesis we present a new data mining methodology for extracting decision policies from datasets containing descriptions of interactions with an environment. This methodology, which we call policy mining, is valuable for applications in which experimental interaction is not feasible but for which fixed sets of collected data are available. Examples of such applications are direct marketing, credit card fraud detection, recommender systems and medical treatment.
Recent advances in classifier learning and the availability of a great variety of off-the-shelf learners make it attractive to use classifier learning as the core generalization tool in policy mining. However, in order to successfully apply classifier learning methods to policy mining, 3 important improvements to the current classifier learning technology are necessary.
First, standard classifier learners assume that all incorrect predictions are equally costly. This thesis presents 2 general methods for cost-sensitive learning that take into account the fact that misclassification costs are different for different examples and unknown for some examples. The methods we propose are evaluated carefully with experiments using large, difficult and highly cost-sensitive datasets from the direct marketing domain.
Second, most existing learning methods produce classifiers that output ranking scores along with the class label. These scores, however, are classifier dependent and cannot be easily combined with other sources of information for decision-making. This thesis presents a fast and effective calibration algorithm for transforming ranking scores into accurate class membership probability estimates. Experimental results using datasets from a variety of domains shows that the method produces probability estimates that are comparable to or better than the ones produced by other methods.
Finally, learning algorithms commonly assume that the available data consists of randomly drawn examples from the same underlying distribution of examples about which the learned model is expected to make predictions. In many situations, however, this assumption is violated because we do not have control over the data gathering process. This thesis formalizes the sample selection bias problem in machine learning and presents methods for learning and evaluation under sample selection bias.
It is well known that, for estimating a linear treatment effect with constant variance, the optimal design divides the units equally between the 2 extremes of the design space. If the dose-response relation may be nonlinear, however, intermediate measurements may be useful in order to estimate the effects of partial treatments.
We consider the decision of whether to gather data at an intermediate design point: do the gains from learning about nonlinearity outweigh the loss in efficiency in estimating the linear effect?
Under reasonable assumptions about nonlinearity, we find that, unless sample size is very large, the design with no interior measurements is best, because with moderate total sample sizes, any nonlinearity in the dose-response will be difficult to detect.
We discuss in the context of a simplified version of the problem that motivated this work—a study of pest-control treatments intended to reduce asthma symptoms in children.
[Keywords: asthma, Bayesian inference, dose-response experimental design, pest control, statistical-significance.]
Figure 2: Mean squared error (as a multiple of σ2/n) for 4 design/estimator combinations of θ0.5 as a function of |δ|, the relative magnitude of nonlinearity of the dose-response. The plots show T = 4 and T = 8, which correspond to a treatment effect that is 2 or 4 standard deviations away from zero. The design w = 0 (all the data collected at the 2 extreme points) dominates unless both |δ| and T are large. When the design w = 1⁄3 (data evenly divided between the 3 design points) is chosen, the Bayes estimate has the lowest mean squared error for the range of δ and T considered here.
Receiver Operating Characteristic (ROC) curves are popular ways of summarising the performance of two class classification rules.
In fact, however, they are extremely inconvenient. If the relative severity of the two different kinds of misclassification is known, then an awkward projection operation is required to deduce the overall loss. At the other extreme, when the relative severity is unknown, the area under an ROC curve is often used as an index of performance. However, this essentially assumes that nothing whatsoever is known about the relative severity—a situation which is very rare in real problems.
We present an alternative plot which is more revealing than an ROC plot, and we describe a comparative index which allows one to take advantage of anything that may be known about the relative severity of the two kinds of misclassification.
[From Selected Papers of Hirotugu Akaike, pg199–213; Originally published in +Proceeding of the Second International Symposium on Information Theory+, B.N. Petrov and F. Caski, eds., Akademiai Kiado, Budapest, 1973, 267–281]
In this paper it is shown that the classical maximum likelihood principle can be considered to be a method of asymptotic realization of an optimum estimate with respect to a very general information theoretic criterion. This observation shows an extension of the principle to provide answers to many practical problems of statistical model fitting.
[Keywords: autoregressive model, final prediction error, maximum likelihood principle, statistical model identification, statistical decision function]
This paper gives an anthropological comment on what has been called the ‘audit explosion’, the proliferation of procedures for evaluating performance. In higher education the subject of audit (in this sense) is not so much the education of the students as the institutional provision for their education. British universities, as institutions, are increasingly subject to national scrutiny for teaching, research and administrative competence. In the wake of this scrutiny comes a new cultural apparatus of expectations and technologies. While the metaphor of financial auditing points to the important values of accountability, audit does more than monitor—it has a life of its own that jeopardizes the life it audits. The runaway character of assessment practices is analysed in terms of cultural practice. Higher education is intimately bound up with the origins of such practices, and is not just the latter day target of them.
…When a measure becomes a target, it ceases to be a good measure. The more a 2.1 examination performance becomes an expectation, the poorer it becomes as a discriminator of individual performances. Hoskin describes this as ‘Goodhart’s law’, after the latter’s observation on instruments for monetary control which lead to other devices for monetary flexibility having to be invented. However, targets that seem measurable become enticing tools for improvement. The linking of improvement to commensurable increase produced practices of wide application. It was that conflation of ‘is’ and ‘ought’, alongside the techniques of quantifiable written assessments, which led in Hoskin’s view to the modernist invention of accountability. This was articulated in Britain for the first time around 1800 as ‘the awful idea of accountability’ (Ref. 3, p. 268)
When measures of individual differences are used to predict group performance, the reporting of correlations computed on samples of individuals invites misinterpretation and dismissal of the data. In contrast, if regression equations, in which the correlations required are computed on bivariate means, as are the distribution statistics, it is difficult to underappreciate or lightly dismiss the utility of psychological predictors.
Given sufficient sample size and linearity of regression, this technique produces cross-validated regression equations that forecast criterion means with almost perfect accuracy. This level of accuracy is provided by correlations approaching unity between bivariate samples of predictor and criterion means, and this holds true regardless of the magnitude of the “simple” correlation (eg. rxy = 0.20, or rxy = 0.80).
We illustrate this technique empirically using a measure of general intelligence as the predictor and other measures of individual differences and socioeconomic status as criteria. In addition to theoretical applications pertaining to group trends, this methodology also has implications for applied problems aimed at developing policy in numerous fields.
…To summarize, psychological variables generating modest correlations frequently are discounted by those who focus on the magnitude of unaccounted for criterion variance, large standard errors, and frequent false positive and false negative errors in predicting individuals. Dismissal of modest correlations (and the utility of their regressions) by professionals based on this psychometric-statistical reasoning has spread to administrators, journalists, and legislative policy makers. Some examples of this have been compiled by Dawes (1979, 1988) and Linn (1982). They range from squaring a correlation of 0.345 (ie. 0.12) and concluding that for 88% of students, “An SAT score will predict their grade rank no more accurately than a pair of dice” (cf. Linn, 1982, p. 280) to evaluating the differential utility of two correlations 0.20 and 0.40 (based on different procedures for selecting graduate students) as “twice of nothing is nothing” (cf. Dawes, 1979, p. 580).
…Tests are used, however, in ways other than the prediction of individuals or of a specific outcome for Johnny or Jane. And policy decisions based on tests frequently have broader implications for individuals beyond those directly involved in the assessment and selection context (see the discussion later in this article). For example, selection of personnel in education, business, industry, and the military focuses on the criterion performance of groups of applicants whose scores on selection instruments differ. Selection psychologists have long made use of modest predictive correlations when the ratio of applicants to openings becomes large. The relation of utility to size of correlation, relative to the selection ratio and base rate for success (if one ignores the test scores), is incorporated in the well-known Taylor-Russell (1939) tables. These tables are examples of how psychological tests have revealed convincingly economic and societal benefits (Hartigan & Wigdor 1989), even when a correlation of modest size remains at center stage. For example, given a base rate of 30% for adequate performance and a predictive validity coefficient of 0.30 within the applicant population, selecting the top 20% on the predictor test will result in 46% of hires ultimately achieving adequate performance (a 16% gain over base rate). To be sure, the prediction for individuals within any group is not strong—about 9% of the variance in job performance. Yet, when training is expensive or time-consuming, this can result in huge savings. For analyses of groups composed of anonymous persons, however, there is a more unequivocal way of illustrating the importance of modest correlations than even the Taylor-Russell tables provide.
Rationale for an Alternative Approach: Applied psychologists discovered decades ago that it is more advantageous to report correlations between a continuous predictor and a dichotomous criterion graphically rather than as a number that varies between zero and one. For example, the correlation (point biserial) of about 0.40 with the pass-fail pilot training criterion and an ability-stanine predictor looks quite impressive when graphed in the manner of Figure 1a. In contrast, in Figure 1b, a scatter plot of a correlation of 0.40 between two continuous measures looks at first glance like the pattern of birdshot on a target. It takes close scrutiny to perceive that the pattern in Figure 1b is not quite circular for the small correlation. Figure 1a communicates the information more effectively than Figure 1b. When the data on the predictive validity of the pilot ability-stanine were presented in the form of Figure 1a (rather than, say, as a scatter plot of a correlation of 0.40; Figure 1b), general officers in recruitment, training, logistics, and operations immediately grasped the importance of the data for their problems. Because the Army Air Forces were an attractive career choice, there were many more applicants for pilot training than could be accommodated and selection was required…A small gain on a criterion for an unit of gain on the predictor, as long as it is predicted with near-perfect accuracy, can have high utility.
Figure 1. a: Percentage of pilots eliminated from a training class as a function of pilot aptitude rating in stanines. Number of trainees in each stanine is shown on each bar. (From DuBois 1947). b: A synthetic example of a correlation of 0.40 (n = 400).
This chapter discusses that practical issues arise because weighty decisions often depend on forecasts and opinions communicated from one person or set of individuals to another.
The standard wisdom has been that numerical communication is better than linguistic, and therefore, especially in important contexts, it is to be preferred. A good deal of evidence suggests that this advice is not uniformly correct and is inconsistent with strongly held preferences. A theoretical understanding of the preceding questions is an important step toward the development of means for improving communication, judgment, and decision making under uncertainty. The theoretical issues concern how individuals interpret imprecise linguistic terms, what factors affect their interpretations, and how they combine those terms with other information for the purpose of taking action. The chapter reviews the relevant literature in order to develop a theory of how linguistic information about imprecise continuous quantities is processed in the service of decision making, judgment, and communication.
It provides the current view, which has evolved inductively, to substantiate it where the data allow, and to suggest where additional research is needed. It also summarizes the research on meanings of qualitative probability expressions and compares judgments and decisions made on the basis of vague and precise probabilities.
Figure 2: First, second, and third quartiles over subjects of the upper and lower probability limits for each phrase in Experiment 1 of Wallsten et al 1986.
The observed level of milk yield of a dairy cow or the litter size of a sow is only partially the result of a permanent characteristic of the animal; temporary effects are also involved. Thus, we face a problem concerning the proper definition and measurement of the traits in order to give the best possible prediction of the future revenues from an animal considered for replacement. A trait model describing the underlying effects is built into a model combining a Bayesian approach with a hierarchic Markov process in order to be able to calculate optimal replacement policies under various conditions.
An organization’s promotion decision between 2 workers is modelled as a problem of boundedly-rational learning about ability. The decision-maker can bias noisy rank-order contests sequentially, thereby changing the information they convey.
The optimal final-period bias favours the “leader”, reinforcing his likely ability advantage. When optimally biased rank-order information is a sufficient statistic for cardinal information, the leader is favoured in every period. In other environments, bias in early periods may (1) favour the early loser, (2) be optimal even when the workers are equally rated, and (3) reduce the favoured worker’s promotion chances.
L. J. Savage and I. J. Good have each demonstrated that the expected utility of free information [Value of Information] is never negative for a decision maker who updates her degrees of belief by conditionalization on propositions learned for certain. In this paper Good’s argument is generalized to show the same result for a decision maker who updates her degrees of belief on the basis of uncertain information by Richard Jeffrey’s probability kinematics. The Savage/Good result is shown to be a special case of the more general result.
Can the vague meanings of probability terms such as doubtful, probable, or likely be expressed as membership functions over the [0, 1] probability interval? A function for a given term would assign a membership value of 0 to probabilities not at all in the vague concept represented by the term, a membership value of 1 to probabilities definitely in the concept, and intermediate membership values to probabilities represented by the term to some degree.
A modified pair-comparison procedure was used in 2 experiments to empirically establish and assess membership functions for several probability terms. Subjects performed 2 tasks in both experiments: They judged (1) to what degree one probability rather than another was better described by a given probability term, and (2) to what degree one term rather than another better described a specified probability. Probabilities were displayed as relative areas on spinners.
Task 1 data were analyzed from the perspective of conjoint-measurement theory, and membership function values were obtained for each term according to various scaling models. The conjoint-measurement axioms were well satisfied and goodness-of-fit measures for the scaling procedures were high. Individual differences were large but stable. Furthermore, the derived membership function values satisfactorily predicted the judgments independently obtained in task 2.
The results support the claim that the scaled values represented the vague meanings of the terms to the individual subjects in the present experimental context. Methodological implications are discussed, as are substantive issues raised by the data regarding the vague meanings of probability terms.
Figure 2: First, second, and third quartiles over subjects of the upper and lower probability limits for each phrase in Experiment 1 of Wallsten et al 1986.
Assessed membership functions over the [0,1] probability interval for several vague meanings of probability terms (eg. doubtful, probable, likely), using a modified pair-comparison procedure in 2 experiments with 20 and 8 graduate business students, respectively. Subjects performed 2 tasks in both experiments: They judged (A) to what degree one probability rather than another was better described by a given probability term and (B) to what degree one term rather than another better described a specified probability. Probabilities were displayed as relative areas on spinners. Task A data were analyzed from the perspective of conjoint-measurement theory, and membership function values were obtained for each term according to various scaling models. Findings show that the conjoint-measurement axioms were well satisfied and goodness-of-fit measures for the scaling procedures were high. Individual differences were large but stable, and the derived membership function values satisfactorily predicted the judgments independently obtained in Task B. Results indicated that the scaled values represented the vague meanings of the terms to the individual Ss in the present experimental context.
Two methods for estimating dollar standard deviations were investigated in a simulated environment. 19 graduate students with management experience managed a simulated pharmaceutical firm for 4 quarters. Ss were given information describing the performance of sales representatives on 3 job components. Estimates derived using the method developed by F. L. Schmidt et al 1979 (see record 1981–02231–001) were relatively accurate with objective sales data that could be directly translated to dollars, but resulted in overestimates of means and standard deviations when data were less directly translatable to dollars and involved variable costs. An additional problem with the Schmidt et al procedure involved the presence of outliers, possibly caused by differing interpretations of instructions. The Cascio-Ramos estimate of performance in dollars (CREPID) technique, proposed by W. F. Cascio (1982), yielded smaller dollar standard deviations, but Ss could reliably discriminate among job components in terms of importance and could accurately evaluate employee performance on those components. Problems with the CREPID method included the underlying scale used to obtain performance ratings and a dependency on job component intercorrelations.
Examined whether selectivity was used in the citing of evidence in research on the psychology of judgment and decision making and investigated the possible effects that this citation bias might have on the views of readers of the literature.
An analysis of the frequency of citations of good-performance and poor-performance articles cited in the Social Sciences Citation Index 1972–1981 revealed that poor-performance articles were cited statistically-significantly more often than good-performance articles.
80 members of the Judgment and Decision Making Society, a semiformal professional group, were asked to complete a questionnaire assessing the overall quality of human judgment and decision-making abilities on a scale from 0 to 100 and to list 4 examples of documented poor judgment or decision-making performance and 4 examples of good performance. Subjects recalled statistically-significantly more examples of poor than of good performance. Less experienced Subjects in the field appeared to have a lower opinion of human reasoning ability than did highly experienced Subjects. Also, Subjects recalled 50% more examples of poor performance than of good performance, despite the fact that the variety of poor-performance examples was limited.
It is concluded that there is a citation bias in the judgment and decision-making literature, and poor-performance articles are receiving most of the attention from other writers, despite equivalent proportions of each type in the journals.
Used decision theoretic equations to estimate the impact of the Programmer Aptitude Test (PAT) on productivity if used to select new computer programmers for 1 yr in the federal government and the national economy. A newly developed technique was used to estimate the standard deviation of the dollar value of employee job performance, which in the past has been the most difficult and expensive item of required information. For the federal government and the US economy separately, results are presented for different selection ratios and for different assumed values for the validity of previously used selection procedures. The impact of the PAT on programmer productivity was substantial for all combinations of assumptions. Results support the conclusion that hundreds of millions of dollars in increased productivity could be realized by increasing the validity of selection decisions in this occupation. Similarities between computer programmers and other occupations are discussed. It is concluded that the impact of valid selection procedures on work-force productivity is considerably greater than most personnel psychologists have believed.
Aspects of scientific method are discussed: In particular, its representation as a motivated iteration in which, in succession, practice confronts theory, and theory, practice. Rapid progress requires sufficient flexibility to profit from such confrontations, and the ability to devise parsimonious but effective models, to worry selectively about model inadequacies and to employ mathematics skillfully but appropriately. The development of statistical methods at Rothamsted Experimental Station by Sir Ronald Fisher is used to illustrate these themes.
…Since all models are wrong the scientist must be alert to what is importantly wrong. It is inappropriate to be concerned about mice when there are tigers abroad… In applying mathematics to subjects such as physics or statistics we make tentative assumptions about the real world which we know are false but which we believe may be useful nonetheless. The physicist knows that particles have mass and yet certain results, approximating what really happens, may be derived from the assumption that they do not. Equally, the statistician knows, for example, that in nature there never was a normal distribution, there never was a straight line, yet with normal and linear assumptions, known to be false, he can often derive results which match, to an useful approximation, those found in the real world.
It follows that, although rigorous derivation of logical consequences is of great importance to statistics, such derivations are necessarily encapsulated in the knowledge that premise, and hence consequence, do not describe natural truth. It follows that we cannot know that any statistical technique we develop is useful unless we use it. Major advances in science and in the science of statistics in particular, usually occur, therefore, as the result of the theory-practice iteration.
When Values Conflict: Essays on Environmental Analysis, Discourse, and Decision is a collection of essays each of which addresses the issue of value conflicts in environmental disputes. These authors discuss the need to integrate such “fragile” values as beauty and naturalness with “hard” values such as economic efficiency in the decision making process. When Values Conflict: Essays on Environmental Analysis, Discourse, and Decision will be of interest to those who seek to include environmentalist values in public policy debates. This work is comprised of seven essays.
In the first chapter, Robert Socolow discusses obstacles to the integration of environmental values into natural resource policy. Technical studies often fail to resolve conflicts, because such conflict rest of the parties’ very different goals and values. Nonetheless, agreement on the technical analysis may serve as a platform from which to more clearly articulate value differences.
Irene Thomson draws on the case of the Tocks Island Dam controversy to explore environmental decision making processes. She describes the impact the various party’s interests and values have on their analyses, and argues that the fragmentation of responsibility among institutional actors contributes to the production of inadequate analyses.
Tribe’s essay suggests that a natural environment has intrinsic value, a value that cannot be reduced to human interests. This recognition may serve as the first step in developing an environmental ethic.
Charles Frankel explores the idea that nature has rights. He first explores the meaning of nature, by contrast to the supernatural, technological and cultural. He suggests that appeals to nature’s rights serves as an appeal for “institutional protection against being carried away by temporary enthusiasms.”
In Chapter Five, Harvey Brooks describes three main functions which analysis serves in the environmental decision-making process: they ground conclusions in neutral, generally accepted principles, they separate means from ends, and they legitimate the final policy decision. If environmental values such as beauty, naturalness and uniqueness are to be incorporated into systems analysis, they must do so in such a way as to preserve the basic function of analysis.
Henry Rowen discusses the use of policy analysis as an aid to making environmental decisions. He describes the characteristics of a good analysis, and argues that good analysis can help clarify the issues, and assist in “the design and invention of objectives and alternatives.” Rowen concludes by suggesting ways of improving the field of policy analysis.
Robert Dorfman provides the Afterword for this collection. This essay distinguishes between value and price, and explores the import of this distinction for cost-benefit analysis. The author concludes that there can be no “formula for measuring a projects contribution to humane values.” Environmental decisions will always require the use of human judgement and wisdom.
When Values Conflict: Essays on Environmental Analysis, Discourse, and Decision offers a series of thoughtful essays on the nature and weight of environmentalist values. The essays range from a philosophic investigation of natural value to a more concrete evaluation of the elements of good policy analysis.
This is a study of what happens to technical analyses in the real world of politics. The Tocks Island Dam project proposed construction of a dam on the Delaware River at Tocks Island, five miles north of the Delaware Water Gap. Planned and developed in the early 1960’s, it was initially considered a model of water resource planning. But it soon became the target of an extended controversy involving a tangle of interconnected concerns—floods and droughts, energy, growth, congestion, recreation, and the uprooting of people and communities. Numerous participants—economists, scientists, planners, technologists, bureaucrats and environmentalists—measured, modeled and studied the Tocks Island proposal. The results were a weighty legacy of technical and economic analyses—and a decade of political stalemate regarding the fate of the dam.
These analyses, to a substantial degree, masked the value conflicts at stake in the controversy; they concealed the real political and human issues of who would win and who would lose if the Tocks Island project were undertaken. And, the studies were infected by rigid categories of thought and divisions of bureaucratic responsibilities.
This collection of original essays tells the story of the Tocks Island controversy, with a fresh perspective on the environmental issues at stake. Its contributors consider the political decision-making process throughout the controversy and show how economic and technological analyses affected those decisions. Viewed as a whole, the essays show that systematic analysis and an explicit concern for human values need not be mutually exclusive pursuits.
The Kelly (-Bernoulli-Latané or capital growth) criterion is to maximize the expected value E log X of the logarithm of the random variable X, representing wealth. The chapter presents a treatment of the Kelly criterion and Breiman’s results.
Breiman’s results can be extended to cover many if not most of the more complicated situations which arise in real-world portfolios Specifically, the number and distribution of investments can vary with the time period, the random variables need not be finite or even discrete, and a certain amount of dependence can be introduced between the investment universes for different time periods. The chapter also discusses a few relationships between the max expected log approach and Markowitz’s mean-variance approach.
It highlights a few misconceptions concerning the Kelly criterion, the most notable being the fact that decisions that maximize the expected log of wealth do not necessarily maximize expected utility of terminal wealth for arbitrarily large time horizons.
Cross-modality matching of hypothetical increments of money against loudness recover the previously proposed exponent of the utility function for money within a few percent. Similar cross-modality matching experiments for decrements give a disutility exponent of 0.59, larger than the utility exponent for increments. This disutility exponent was checked by an additional cross-modality matching experiment against the disutility of drinking various concentrations of a bitter solution. The parameter estimated in this fashion was 0.63.
Three experiments were conducted in which monetary increments and decrements were matched to either the loudness of a tone or the bitterness of various concentrations of sucrose octaacetate. An additional experiment involving ratio estimates of monetary loss is also reported. Results confirm that the utility function for both monetary increments and decrements is a power function with exponents less than one. The data further suggest that the exponent of the disutility function is larger than that of the utility function, i.e., the rate of change of ‘unhappiness’ caused by monetary losses is greater than the comparable rate of ‘happiness’ produced by monetary gains.
[Becker-DeGroot-Marschak mechanism] A person deciding on a career, a wife, or a place to live bases his choice on 2 factors: (1) How much do I like each of the available alternatives? and (2) What are the chances for a successful outcome of each alternative? These 2 factors comprise the utility of each outcome for the person making the choice. This notion of utility is fundamental to most current theories of decision behavior.
According to the expected utility hypothesis, if we could know the utility function of a person, we could predict his choice from among any set of actions or objects. But the utility function of a given subject is almost impossible to measure directly.
To circumvent this difficulty, stochastic models of choice behavior have been formulated which do not predict the subject’s choices but make statements about the probabilities that the subject will choose a given action. This paper reports an experiment to measure utility and to test one stochastic model of choice behavior.
A simple cost function approach is proposed for designing an optimal clinical trial when a total of n patients with a disease are to be treated with one of two medical treatments.
The cost function is constructed with but one cost, the consequences of treating a patient with the superior or inferior of the two treatments. Fixed sample size and sequential trials are considered. Minimax, maximin, and Bayesian approaches are used for determining the optimal size of a fixed sample trial and the optimal position of the boundaries of a sequential trial.
Comparisons of the different approaches are made as well as comparisons of the results for the fixed and sequential plans.
An analytical development of flight performance optimization according to the method of gradients or ‘method of steepest decent’ is presented. Construction of a minimizing sequence of flight paths by a stepwise process of descent along the local gradient direction is described as a computational scheme. Numerical application of the technique is illustrated in a simple example of orbital transfer via solar sail propulsion. Successive approximations to minimum time planar flight paths from Earth’s orbit to the orbit of Mars are presented for cases corresponding to free and fixed boundary conditions on terminal velocity components.
This book is a non-mathematical introduction to the logical analysis of practical business problems in which a decision must be reached under uncertainty. The analysis which it recommends is based on the modern theory of utility and what has come to be known as the “‘personal”’ definition of probability; the author believes, in other words, that when the consequences of various possible courses of action depend on some unpredictable event, the practical way of choosing the”best” act is to assign values to consequences and probabilities to events and then to select the act with the highest expected value. In the author’s experience, thoughtful businessmen intuitively apply exactly this kind of analysis in problems which are simple enough to allow of purely intuitive analysis; and he believes that they will readily accept its formalization once the essential logic of this formalization is presented in a way which can be comprehended by an intelligent layman. Excellent books on the pure mathematical theory of decision under uncertainty already exist; the present text is an endeavor to show how formal analysis of practical decision problems can be made to pay its way.
From the point of view taken in this book, there is no real difference between a “statistical” decision problem in which a part of the available evidence happens to come from a ‘sample’ and a problem in which all the evidence is of a less formal nature. Both kinds of problems are analyzed by use of the same basic principles; and one of the resulting advantages is that it becomes possible to avoid having to assert that nothing useful can be said about a sample which contains an unknown amount of bias while at the same time having to admit that in most practical situations it is totally impossible to draw a sample which does not contain an unknown amount of bias. In the same way and for the same reason there is no real difference between a decision problem in which the long-run-average demand for some commodity is known with certainty and one in which it is not; and not the least of the advantages which result from recognizing this fact is that it becomes possible to analyze a problem of inventory control without having to pretend that a finite amount of experience can ever give anyone perfect knowledge of long-run-average demand. The author is quite ready to admit that in some situations it may be difficult for the businessman to assess the numerical probabilities and utilities which are required for the kind of analysis recommended in this book, but he is confident that the businessman who really tries to make a reasoned analysis of a difficult decision problem will find it far easier to do this than to make a direct determination of, say, the correct risk premium to add to the pure cost of capital or of the correct level at which to conduct a test of statistical-significance.
In sum, the author believes that the modern theories of utility and personal probability have at last made it possible to develop a really complete theory to guide the making of managerial decisions—a theory into which the traditional disciplines of statistics and economics under certainty and the collection of miscellaneous techniques taught under the name of operations research will all enter as constituent parts. He hopes, therefore, that the present book will be of interest and value not only to students and practitioners of inventory control, quality control, marketing research, and other specific business functions but also to students of business and businessmen who are interested in the basic principles of managerial economics and to students of economics who are interested in the theory of the firm. Even the teacher of a course in mathematical decision theory who wishes to include applications as well as complete-class and existence theory may find the book useful as a source of examples of the practical decision problems which do arise in the real world.
The character recognition problem, usually resulting from characters being corrupted by printing deterioration and/or inherent noise of the devices, is considered from the viewpoint of statistical decision theory.
The optimization consists of minimizing the expected risk for a weight function which is preassigned to measure the consequences of system decisions As an alternative minimization of the error rate for a given rejection rate is used as the criterion. The optimum recognition is thus obtained.
The optimum system consists of a conditional-probability densities computer; character channels, one for each character; a rejection channel; and a comparison network. Its precise structure and ultimate performance depend essentially upon the signals and noise structure.
Explicit examples for an additive Gaussian noise and a “cosine” noise are presented. Finally, an error-free recognition system and a possible criterion to measure the character style and deterioration are presented.
…Our game theory, in contradistinction, is based on the absence of coalitions in that it is assumed that each participant acts independently, without collaboration or communication with any of the others.
The notion of an equilibrium point is the basic ingredient in our theory. This notion yields a generalization of the concept of the solution of a 2-person zero-sum game. It turns out that the set of equilibrium points of a 2-person zero-sum game is simply the set of all pairs of opposing ‘good strategies’.
In the immediately following sections we shall define equilibrium points and prove that a finite non-cooperative game always has at least one equilibrium point. We shall also introduce the notions of solvability and strong solvability of a non-cooperative game and prove a theorem on the geometrical structure of the set of equilibrium points of a solvable game.
As an example of the application of our theory we include a solution of a simplified 3 person poker game.
[Egon Pearson describes Student, or Gosset, as a statistician: Student corresponded widely with young statisticians/mathematicians, encouraging them, and having an outsized influence not reflected in his publication. Student’s preferred statistical tools were remarkably simple, focused on correlations and standard deviations, but wielded effectively in the analysis and efficient design of experiments (particularly agricultural experiments), and he was an early decision-theorist, focused on practical problems connected to his Guinness Brewery job—which detachment from academia partially explains why he didn’t publish methods or results immediately or often. The need to handle small n of the brewery led to his work on small-sample approximations rather than, like Pearson et al in the Galton biometric tradition, relying on collecting large datasets and using asymptotic methods, and Student carried out one of the first Monte Carlo simulations.]