“Early Canid Domestication: The Farm-Fox Experiment: Foxes bred for tamability in a 40-year experiment exhibit remarkable transformations that suggest an interplay between behavioral genetics and development”, (1999-03):
[Popular review of the domesticated red fox by the lead researcher. Trut gives the history of Belyaev’s founding of the experiment in 1959, and how the results gradually proved his theory about ‘domestication syndrome’: that domestication produces multiple simultaneous effects like floppy ears despite the foxes being bred solely for being willing to approach a strange human, suggesting an underlying common genetic mechanism]
Forty years into our unique lifelong experiment, we believe that Dmitry Belyaev would be pleased with its progress. By intense selective breeding, we have compressed into a few decades an ancient process that originally unfolded over thousands of years. Before our eyes, “the Beast” has turned into “Beauty”, as the aggressive behavior of our herd’s wild progenitors entirely disappeared. We have watched new morphological traits emerge, a process previously known only from archaeological evidence. Now we know that these changes can burst into a population early in domestication, triggered by the stresses of captivity, and that many of them result from changes in the timing of developmental processes. In some cases the changes in timing, such as earlier sexual maturity or retarded growth of somatic characters, resemble pedomorphosis. Some long-standing puzzles remain. We believed at the start that foxes could be made to reproduce twice a year and all year round, like dogs. We would like to understand why this has turned out not to be quite so. We are also curious about how the vocal repertoire of foxes changes under domestication. Some of the calls of our adult foxes resemble those of dogs and, like those of dogs, appear to be holdovers from puppyhood, but only further study will reveal the details. The biggest unanswered question is just how much further our selective-breeding experiment can go. The domestic fox is not a domestic dog, but we believe that it has the genetic potential to become more and more doglike.
Understanding the nature and extent of horizontal pleiotropy, where one genetic variant has independent effects on multiple observable traits, is vitally important for our understanding of the genetic architecture of human phenotypes, as well as the design of genome-wide association studies (GWASs) and Mendelian randomization (MR) studies. Many recent studies have pointed to the existence of horizontal pleiotropy among human phenotypes, but the exact extent remains unknown, largely due to difficulty in disentangling the inherently correlated nature of observable traits. Here, we present a statistical framework to isolate and quantify horizontal pleiotropy in human genetic variation using a two-component pleiotropy score computed from summary statistic data derived from published GWASs. This score uses a statistical whitening procedure to remove correlations between observable traits and normalize effect sizes across all traits, and is able to detect horizontal pleiotropy under a range of different models in our simulations. When applied to real human phenotype data using association statistics for 1,564 traits measured in 337,119 individuals from the UK Biobank, our score detects a statistically-significant excess of horizontal pleiotropy. This signal of horizontal pleiotropy is pervasive throughout the human genome and across a wide range of phenotypes and biological functions, but is especially prominent in regions of high linkage disequilibrium and among phenotypes known to be highly polygenic and heterogeneous. Using our pleiotropy score, we identify thousands of loci with extreme levels of horizontal pleiotropy, a majority of which have never been previously reported in any published . This highlights an under-recognized class of genetic variation that has weak effects on many distinct phenotypes but no specific marked effect on any one phenotype. We show that a large fraction of these loci replicate using independent datasets of summary statistics. Our results highlight the central role horizontal pleiotropy plays in the genetic architecture of human phenotypes, and the importance of modeling horizontal pleiotropy in genomic medicine.
Accurate estimation of genetic correlation requires large sample sizes and access to genetically informative data, which are not always available. Accordingly, phenotypic correlations are often assumed to reflect genotypic correlations in evolutionary biology. Cheverud’s conjecture asserts that the use of phenotypic correlations as proxies for is appropriate. Empirical evidence of the conjecture has been found across plant and animal species, with results suggesting that there is indeed a robust relationship between the two. Here, we investigate the conjecture in human populations, an analysis made possible by recent developments in availability of human genomic data and computing resources. A sample of 108,035 British European individuals from the was split equally into discovery and replication datasets. 17 traits were selected based on sample size, distribution and heritability. were calculated using linkage disequilibrium score regression applied to the genome-wide association summary statistics of pairs of traits, and compared within and across datasets. Strong and statistically-significant correlations were found for the between-dataset comparison, suggesting that the genetic correlations from one independent sample were able to predict the phenotypic correlations from another independent sample within the same population. Designating the selected traits as morphological or non-morphological indicated little difference in correlation. The results of this study support the existence of a relationship between genetic and phenotypic correlations in humans. This finding is of specific interest in anthropological studies, which use measured phenotypic correlations to make inferences about the genetics of ancient human populations.
General cognitive function is a prominent and relatively stable human trait that is associated with many important life outcomes. We combine cognitive and genetic data from the CHARGE and COGENT consortia, and (total n = 300,486; age 16–102) and find 148 genome-wide statistically-significant independent loci (p <5 × 10−8) associated with general cognitive function. Within the novel genetic loci are variants associated with neurodegenerative and neurodevelopmental disorders, physical and psychiatric illnesses, and brain structure. Gene-based analyses find 709 genes associated with general cognitive function. Expression levels across the cortex are associated with general cognitive function. Using polygenic scores, up to 4.3% of variance in general cognitive function is predicted in independent samples. We detect genetic overlap between general cognitive function, reaction time, and many health variables including eyesight, hypertension, and longevity. In conclusion we identify novel genetic loci and pathways contributing to the heritability of general cognitive function.
General cognitive function is a prominent human trait associated with many important life outcomes1,2, including longevity3. The substantial heritability of general cognitive function is known to be polygenic, but it has had little explication in terms of the contributing genetic variants4,5,6. Here, we combined cognitive and genetic data from the CHARGE and COGENT consortia, and (total n = 280,360; age range = 16 to 102). We found 9,714 genome-wide SNPs (P<5 x 10−8) in 99 independent loci. Most showed clear evidence of functional importance. Among many novel genes associated with general cognitive function were SGCZ, ATXN1, MAPT, AUTS2, and P2RY6. Within the novel genetic loci were variants associated with neurodegenerative disorders, neurodevelopmental disorders, physical and psychiatric illnesses, brain structure, and BMI. Gene-based analyses found 536 genes statistically-significantly associated with general cognitive function; many were highly expressed in the brain, and associated with neurogenesis and dendrite gene sets. Genetic association results predicted up to 4% of general cognitive function variance in independent samples. There was significant genetic overlap between general cognitive function and information processing speed, as well as many health variables including longevity.
A polymorphisms (SNPs) are genome-wide statistically-significant (rs9320913, rs11584700, rs4851266), and all three replicate. Estimated effects sizes are small (coefficient of determination R(2) ≈ 0.02%), approximately 1 month of schooling per allele. A linear polygenic score from all measured SNPs accounts for ≈2% of the variance in both educational attainment and cognitive function. Genes in the region of the loci have previously been associated with health, cognitive, and central nervous system phenotypes, and bioinformatics analyses suggest the involvement of the anterior caudate nucleus. These findings provide promising candidate SNPs for follow-up work, and our estimates can anchor power analyses in social-science genetics.( ) of educational attainment was conducted in a discovery sample of 101,069 individuals and a replication sample of 25,490. Three independent single-nucleotide
There are thousands of rare human disorders caused by a single deleterious, protein-coding genetic variant 1. However, patients with the same genetic defect can have different clinical presentation 2–4, and some individuals carrying known disease-causing variants can appear unaffected 5. What explains these differences? Here, we show in a cohort of 6,987 children with heterogeneous severe neurodevelopmental disorders expected to be almost entirely monogenic that 7.7% ofin risk is attributable to inherited common genetic variation. We replicated this genome wide common variant burden by showing that it is over-transmitted from parents to children in an independent sample of 728 trios from the same cohort. Our common variant signal is significantly positively correlated with genetic predisposition to fewer years of schooling, decreased intelligence, and risk of schizophrenia. We found that common variant risk was not significantly different between individuals with and without a known protein-coding diagnostic variant, suggesting that common variant risk is not confined to patients without a monogenic diagnosis. In addition, previously published common variant scores for autism, height, birth weight, and intracranial volume were all correlated with those traits within our cohort, suggesting that phenotypic expression in individuals with monogenic disorders is affected by the same variants as the general population. Our results demonstrate that common genetic variation affects both overall risk and clinical presentation in disorders typically considered to be monogenic.
2018-corbett.pdf: “The transition to modernity and chronic disease: mismatch and natural selection”, Stephen Corbett, Alexandre Courtiol, Virpi Lummaa, Jacob Moorad, Stephen Stearns
Prior evolutionary theory provided reason to suspect that measures of development and reproduction would be correlated with antisocial behaviors in human and non-human species. Behavioral genetics has revealed that most quantitative traits are heritable, suggesting that these phenotypic correlations may share genetic etiologies. We usedata to estimate the genetic correlations between various measures of reproductive development (n = 52,776 – 318,863) and antisocial behavior (n = 31,968). Our genetic correlation analyses demonstrate that alleles associated with higher reproductive output (number of children ever born, r g = 0.50, p = 0.0065) were positively correlated with alleles associated with antisocial behavior, whereas alleles associated with more delayed reproductive onset (age of first birth, r g = -0.64, p = 0.0008) were negatively associated with alleles linked to antisocial behavior. Ultimately, these findings coalesce with evolutionary theories suggesting that increased antisocial behaviors may partly represent a faster life history approach, which may be significantly calibrated by genes.
The pathophysiology of antisocial personality disorder (ASPD) remains unclear. Although the most consistent biological finding is reduced grey matter volume in the frontal cortex, about 50% of the total liability to developing ASPD has been attributed to genetic factors. The contributing genes remain largely unknown. Therefore, we sought to study the genetic background of ASPD. We conducted a ( ) and a replication analysis of Finnish criminal offenders fulfilling DSM-IV criteria for ASPD (n = 370, n = 5850 for controls, ; n = 173, n = 3766 for controls and replication sample). The resulted in suggestive associations of two clusters of single-nucleotide polymorphisms at 6p21.2 and at 6p21.32 at the human leukocyte antigen (HLA) region. Imputation of HLA alleles revealed an independent association with DRB1*01:01 (odds ratio (OR) = 2.19 (1.53–3.14), p = 1.9 × 10-5). Two polymorphisms at 6p21.2 LINC00951–LRFN2 gene region were replicated in a separate data set, and rs4714329 reached genome-wide statistical-significance (OR = 1.59 (1.37–1.85), p = 1.6 × 10−9) in the meta-analysis. The risk allele also associated with antisocial features in the general population conditioned for severe problems in childhood family (β = 0.68, p = 0.012). Functional analysis in brain tissue in open access GTEx and Braineac databases revealed eQTL associations of rs4714329 with LINC00951 and LRFN2 in cerebellum. In humans, LINC00951 and LRFN2 are both expressed in the brain, especially in the frontal cortex, which is intriguing considering the role of the frontal cortex in behavior and the neuroanatomical findings of reduced gray matter volume in ASPD. To our knowledge, this is the first study showing genome-wide statistically-significant and replicable findings on genetic variants associated with any personality disorder.
“AI and Compute”, (2018-05-26):
[Further reading: “Parameter Counts In Machine Learning” (2021-06-19), Akronomicon leaderboard.] We’re releasing an analysis showing that since 2012, the amount of compute used in the largest AI training runs has been increasing exponentially with a 3.4-month doubling time (by comparison, Moore’s Law had a 2-year doubling period). Since 2012, this metric has grown by more than 300,000× (a 2-year doubling period would yield only a 7× increase). Improvements in compute have been a key component of AI progress, so as long as this trend continues, it’s worth preparing for the implications of systems far outside today’s capabilities.
Three factors drive the advance of AI: algorithmic innovation, data (which can be either supervised data or interactive environments), and the amount of compute available for training. Algorithmic innovation and data are difficult to track, but compute is unusually quantifiable, providing an opportunity to measure one input to AI progress. Of course, the use of massive compute sometimes just exposes the shortcomings of our current algorithms. But at least within many current domains, more compute seems to lead predictably to better performance, and is often complementary to algorithmic advances…The trend represents an increase by roughly a factor of 10 each year. It’s been partly driven by custom hardware that allows more operations to be performed per second for a given price (GPUs and TPUs), but it’s been primarily propelled by researchers repeatedly finding ways to use more chips in parallel and being willing to pay the economic cost of doing so.
Eras: Looking at the graph we can roughly see 4 distinct eras:
- Before 2012: It was uncommon to use GPUs for ML, making any of the results in the graph difficult to achieve.
- 2012–2014: Infrastructure to train on many GPUs was uncommon, so most results used 1–8 GPUs rated at 1–2 TFLOPS for a total of 0.001–0.1 pfs-days.
- 2014–2016: Large-scale results used 10–100 GPUs rated at 5–10 TFLOPS, resulting in 0.1–10 pfs-days. Diminishing returns on data parallelism meant that larger training runs had limited value.
- 2016–2017: Approaches that allow greater algorithmic parallelism such as huge batch sizes, architecture search, and expert iteration, along with specialized hardware such as TPUs and faster interconnects, have greatly increased these limits, at least for some applications.
AlphaGoZero/AlphaZero is the most visible public example of massive algorithmic parallelism, but many other applications at this scale are now algorithmically possible, and may already be happening in a production context.
…Addendum: Compute used in older headline results (2019-11-07)
We’ve updated our analysis with data that span 1959 to 2012. Looking at the data as a whole, we clearly see two distinct eras of training AI systems in terms of compute-usage: (a) a first era, from 1959 to 2012, which is defined by results that roughly track Moore’s law, and (b) the modern era, from 2012 to now, of results using computational power that substantially outpaces macro trends. The history of investment in AI broadly is usually told as a story of booms and busts, but we don’t see that reflected in the historical trend of compute used by learning systems. It seems that AI winters and periods of excitement had a small effect on compute used to train models over the last half-century.
Starting from the perceptron in 1959, we see a ~2-year doubling time for the compute used in these historical results—with a 3.4-month doubling time starting in ~2012. It’s difficult to draw a strong conclusion from this data alone, but we believe that this trend is probably due to a combination of the limits on the amount of compute that was possible to use for those results and the willingness to spend on scaling up experiments. [For one vivid account of the history of computing in AI in this period, see the “False Start” section in Hans Moravec’s 1998 article.]
“Distilling the Knowledge in a Neural Network”, (2015-03-09):
A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.
Training convolutional networks (CNN’s) that fit on a single GPU with minibatch stochastic gradient descent has become effective in practice. However, there is still no effective method for training large CNN’s that do not fit in the memory of a few GPU cards, or for parallelizing CNN training. In this work we show that a simple hard mixture of experts model can be efficiently trained to good effect on large scale hashtag (multilabel) prediction tasks. Mixture of experts models are not new (Jacobs et. al. 1991, Collobert et. al. 2003), but in the past, researchers have had to devise sophisticated methods to deal with data fragmentation. We show empirically that modern weakly supervised data sets are large enough to support naive partitioning schemes where each data point is assigned to a single expert. Because the experts are independent, training them in parallel is easy, and evaluation is cheap for the size of the model. Furthermore, we show that we can use a single decoding layer for all the experts, allowing a unified feature embedding space. We demonstrate that it is feasible (and in fact relatively painless) to train far larger models than could be practically trained with standard CNN architectures, and that the extra capacity can be well used on current datasets.
The success of deep learning in vision can be attributed to: (a) models with high capacity; (b) increased computational power; and (c) availability of large-scale labeled data. Since 2012, there have been significant advances in representation capabilities of the models and computational capabilities of GPUs. But the size of the biggest dataset has surprisingly remained constant. What will happen if we increase the dataset size by 10× or 100×? This paper takes a step towards clearing the clouds of mystery surrounding the relationship between ‘enormous data’ and visual deep learning. By exploiting the JFT-300M dataset which has more than 375M noisy labels for 300M images, we investigate how the performance of current vision tasks would change if this data was used for representation learning. Our paper delivers some surprising (and some expected) findings. First, we find that the performance on vision tasks increases logarithmically based on volume of training data size. Second, we show that representation learning (or pre-training) still holds a lot of promise. One can improve performance on many vision tasks by just training a better base model. Finally, as expected, we present new state-of-the-art results for different vision tasks including image classification, object detection, semantic segmentation and human pose estimation. Our sincere hope is that this inspires vision community to not undervalue the data and develop collective efforts in building larger datasets.
Fine-grained image labels are desirable for many computer vision applications, such as visual search or mobile AI assistant. These applications rely on image classification models that can produce hundreds of thousands (e.g. 100K) of diversified fine-grained image labels on input images. However, training a network at this vocabulary scale is challenging, and suffers from intolerable large model size and slow training speed, which leads to unsatisfying classification performance. A straightforward solution would be training separate expert networks (specialists), with each specialist focusing on learning one specific vertical (e.g. cars, birds…). However, deploying dozens of expert networks in a practical system would significantly increase system complexity and inference latency, and consumes large amounts of computational resources. To address these challenges, we propose a Knowledge Concentration method, which effectively transfers the knowledge from dozens of specialists (multiple teacher networks) into one single model (one student network) to classify 100K object categories. There are three salient aspects in our method: (1) a multi-teacher single-student knowledge distillation framework; (2) a self-paced learning mechanism to allow the student to learn from different teachers at various paces; (3) structurally connected layers to expand the student network capacity with limited extra parameters. We validate our method on OpenImage and a newly collected dataset, Entity-Foto-Tree (EFT), with 100K categories, and show that the proposed model performs significantly better than the baseline generalist model.
The capacity of a neural network to absorb information is limited by its number of parameters. Conditional computation, where parts of the network are active on a per-example basis, has been proposed in theory as a way of dramatically increasing model capacity without a proportional increase in computation. In practice, however, there are significant algorithmic and performance challenges. In this work, we address these challenges and finally realize the promise of conditional computation, achieving greater than 1000× improvements in model capacity with only minor losses in computational efficiency on modern GPU clusters. We introduce a Sparsely-Gated Mixture-of-Experts layer (MoE), consisting of up to thousands of feed-forward sub-networks. A trainable gating network determines a sparse combination of these experts to use for each example. We apply the MoE to the tasks of language modeling and machine translation, where model capacity is critical for absorbing the vast quantities of knowledge available in the training corpora. We present model architectures in which a MoE with up to 137 billion parameters is applied convolutionally between stacked LSTM layers. On large language modeling and machine translation benchmarks, these models achieve significantly better results than state-of-the-art at lower computational cost.
“Exploring the Limits of Weakly Supervised Pretraining”, (2018-05-02):
State-of-the-art visual perception models for a wide range of tasks rely on supervised pretraining. ImageNet classification is the de facto pretraining task for these models. Yet, is now nearly ten years old and is by modern standards “small”. Even so, relatively little is known about the behavior of pretraining with datasets that are multiple orders of magnitude larger. The reasons are obvious: such datasets are difficult to collect and annotate. In this paper, we present a unique study of transfer learning with large convolutional networks trained to predict hashtags on billions of social media images. Our experiments demonstrate that training for large-scale hashtag prediction leads to excellent results. We show improvements on several image classification and object detection tasks, and report the highest ImageNet-1k single-crop, top-1 accuracy to date: 85.4% (97.6% top-5). We also perform extensive experiments that provide novel empirical data on the relationship between large-scale pretraining and transfer learning performance.
This paper presents a study of semi-supervised learning with large convolutional networks. We propose a pipeline, based on a teacher/student paradigm, that leverages a large collection of unlabelled images (up to 1 billion). Our main goal is to improve the performance for a given target architecture, like ResNet-50 or ResNext. We provide an extensive analysis of the success factors of our approach, which leads us to formulate some recommendations to produce high-accuracy models for image classification with semi-supervised learning. As a result, our approach brings important gains to standard architectures for image, video and fine-grained classification. For instance, by leveraging one billion unlabelled images, our learned vanilla achieves 81.2% top-1 accuracy on the benchmark.
2001-collins.pdf: “Tacit Knowledge, Trust and the Q of Sapphire”, (2001-01-01; ):
Russian measurements of the quality factor (Q) of sapphire, made 20 years ago, have only just been repeated in the West. Shortfalls in tacit knowledge have been partly responsible for this delay. The idea of ` tacit knowledge’, first put forward by the physical chemist Michael Polanyi, has been studied and analysed over the last two decades. A new classification of (broadly construed) is offered here and applied to the case of sapphire. The importance of personal contact between scientists is brought out and the sources of trust described. It is suggested that the reproduction of scientific findings could be aided by a small addition to the information contained in experimental reports. The analysis is done in the context of fieldwork conducted in the USA and observations of experimental work at Glasgow University.
2009-mytkowicz.pdf: “Producing Wrong Data Without Doing Anything Obviously Wrong!”, (2009-03-07; ):
This paper presents a surprising result: changing a seemingly innocuous aspect of an experimental setup can cause a systems researcher to draw wrong conclusions from an experiment. What appears to be an innocuous aspect in the experimental setup may in fact introduce a substantial bias in an evaluation. This phenomenon is called measurement bias in the natural and social sciences.
Our results demonstrate that measurement bias is substantial and commonplace in computer system evaluation. By significant we mean that measurement bias can lead to a performance analysis that either over-states an effect or even yields an incorrect conclusion. By commonplace we mean that measurement bias occurs in all architectures that we tried (Pentium 4, Core 2, and m5 O3CPU), both compilers that we tried (gcc and Intel’s C compiler), and most of the SPEC CPU2006 C programs. Thus, we cannot ignore measurement bias. Nevertheless, in a literature survey of 133 recent papers from ASPLOS, PACT, PLDI, and CGO, we determined that none of the papers with experimental results adequately consider measurement bias.
Inspired by similar problems and their solutions in other sciences, we describe and demonstrate two methods, one for detecting (causal analysis) and one for avoiding (setup randomization) measurement bias.
[Keywords: experimentation, measurement, performance, bias]
2018-langbert.pdf: “Homogenous: The Political Affiliations of Elite Liberal Arts College Faculty”, Mitchell Langbert
The number of individuals in a random sample with close relatives in the sample is a quantity of interest when designing Genome Wide Association Studies (expectation of the number of p-th cousins in a sample from a population of size N under two diploid Wright-Fisher models. We also develop simple asymptotic expressions for large values of N. For example, the expected proportion of individuals with at least one p-th cousin in a sample of K individuals, for a diploid dioecious Wright-Fisher model, is approximately 1 − e−(22p−1)K/N. Our results show that a substantial fraction of individuals in the sample will have at least a second cousin if the sampling fraction (K/N) is on the order of 10−2. This confirms that, for large cohort samples, relatedness among individuals cannot easily be ignored.) and other cohort based genetic, and non-genetic, studies. In this paper, we develop expressions for the distribution and
2018-watts.pdf: “Revisiting the Marshmallow Test: A Conceptual Replication Investigating Links Between Early Delay of Gratification and Later Outcomes”, (2018-01-01; ):
We replicated and extended Shoda, Mischel, and Peake’s (1990) famous marshmallow study, which showed strong bivariate correlations between a child’s ability to delay gratification just before entering school and both adolescent achievement and socioemotional behaviors. Concentrating on children whose mothers had not completed college, we found that an additional minute waited at age 4 predicted a gain of approximately one tenth of a standard deviation in achievement at age 15. But this bivariate correlation was only half the size of those reported in the original studies and was reduced by two thirds in the presence of controls for family background, early cognitive ability, and the home environment. Most of the variation in adolescent achievement came from being able to wait at least 20 s. Associations between delay time and measures of behavioral outcomes at age 15 were much smaller and rarely.
A randomized experiment with almost 35 million Pandora listeners enables us to measure the sensitivity of consumers to advertising, an important topic of study in the era of ad-supported digital content provision. The experiment randomized listeners into 9 treatment groups, each of which received a different level of audio advertising interrupting their music listening, with the highest treatment group receiving more than twice as many ads as the lowest treatment group. By keeping consistent treatment assignment for 21 months, we are able to measure long-run demand effects, with three times as much ad-load sensitivity as we would have obtained if we had run a month-long experiment. We estimate a demand curve that is strikingly linear, with the number of hours listened decreasing linearly in the number of ads per hour (also known as the price of ad-supported listening). We also show the negative impact on the number of days listened and on the probability of listening at all in the final month. Using an experimental design that separately varies the number of commercial interruptions per hour and the number of ads per commercial interruption, we find that neither makes much difference to listeners beyond their impact on the total number of ads per hour. Lastly, we find that increased ad load causes a substantial increase in the number of paid ad-free subscriptions to Pandora, particularly among older listeners.