Skip to main content

genetics directory


“An RNA-based Theory of Natural Universal Computation”, Akhlaghpour 2022

2022-akhlaghpour.pdf: “An RNA-based theory of natural universal computation”⁠, Hessameddin Akhlaghpour (2022-03-21; ⁠, )

“EBERT: Epigenomic Language Models Powered by Cerebras”, Trotter et al 2021

“EBERT: Epigenomic language models powered by Cerebras”⁠, Meredith V. Trotter, Cuong Q. Nguyen, Stephen Young, Rob T. Woodruff, Kim M. Branson (2021-12-14; ; similar):

Large scale self-supervised pre-training of Transformer language models has advanced the field of Natural Language Processing and shown promise in cross-application to the biological ‘languages’ of proteins and DNA. Learning effective representations of DNA sequences using large genomic sequence corpuses may accelerate the development of models of gene regulation and function through transfer learning. However, to accurately model cell type-specific gene regulation and function, it is necessary to consider not only the information contained in DNA nucleotide sequences, which is mostly invariant between cell types, but also how the local chemical and structural ‘epigenetic state’ of chromosomes varies between cell types.

Here, we introduce a Bidirectional Encoder Representations from Transformers (BERT) model that learns representations based on both DNA sequence and paired epigenetic state inputs, which we call Epigenomic BERT (or EBERT). We pre-train EBERT with a masked language model objective across the entire human genome and across 127 cell types. Training this complex model with a previously prohibitively large dataset was made possible for the first time by a partnership with Cerebras Systems, whose CS-1 system powered all pre-training experiments. We show EBERT’s transfer learning potential by demonstrating strong performance on a cell type-specific transcription factor binding prediction task. Our fine-tuned model exceeds state of the art performance on 4 of 13 evaluation datasets from ENCODE-DREAM benchmarks and earns an overall rank of 3rd on the challenge leaderboard.

We explore how the inclusion of epigenetic data and task specific feature augmentation impact transfer learning performance.

“Does Mouse Utopia Exist?”, Branwen 2019

Mouse-Utopia: “Does Mouse Utopia Exist?”⁠, Gwern Branwen (2019-08-12; ⁠, ⁠, ⁠, ; backlinks; similar):

Did John Calhoun’s 1960s Mouse Utopia really show that animal (and human) populations will expand to arbitrary densities, creating socially-driven pathology and collapse? Reasons for doubt.

Did John Calhoun’s 1960s Mouse Utopia really show that animal (and human) populations will expand to arbitrary densities, creating socially-driven pathology and collapse? I give reasons for doubt about its replicability, interpretation, and meaningfulness.

One of the most famous experiments in psychology & sociology was John Calhoun’s Mouse Utopia experiments in the 1960s–1970s. In the usual telling, Mouse Utopia created ideal mouse environments in which the mouse population was permitted to increase as much as possible; however, the overcrowding inevitably resulted in extreme levels of physical & social dysfunctionality, and eventually population collapse & even extinction. Looking more closely into it, there are reasons to doubt the replicability of the growth & pathological behavior & collapse of this utopia (“no-place”), and if it does happen, whether it is driven by the social pressures as claimed by Calhoun or by other causal mechanisms at least as consistent with the evidence like disease or mutational meltdown.

“Origins of Innovation: Bakewell & Breeding”, Branwen 2018

Bakewell: “Origins of Innovation: Bakewell & Breeding”⁠, Gwern Branwen (2018-10-28; ⁠, ⁠, ⁠, ; backlinks; similar):

A review of Russell 1986’s Like Engend’ring Like: Heredity and Animal Breeding in Early Modern England, describing development of selective breeding and discussing models of the psychology and sociology of innovation.

Like anything else, the idea of “breeding” had to be invented. That traits are genetically-influenced broadly equally by both parents subject to considerable randomness and can be selected for over many generations to create large average population-wide increases had to be discovered the hard way, with many wildly wrong theories discarded along the way. Animal breeding is a case in point, as reviewed by an intellectual history of animal breeding, Like Engend’ring Like, which covers mistaken theories of conception & inheritance from the ancient Greeks to perhaps the first truly successful modern animal breeder, Robert Bakewell (1725–1795).

Why did it take thousands of years to begin developing useful animal breeding techniques, a topic of interest to almost all farmers everywhere, a field which has no prerequisites such as advanced mathematics or special chemicals or mechanical tools, and seemingly requires only close observation and patience? This question can be asked of many innovations early in the Industrial Revolution, such as the flying shuttle.

Some veins in economics history and sociology suggest that at least one ingredient is an improving attitude: a detached outsider’s attitude which asks whether there is any way to optimize something, in defiance of ‘the wisdom of tradition’, and looks for improvements. A relevant English example is the English Royal Society of Arts, founded not too distant in time from Bakewell⁠, specifically to spur competition and imitation and new inventions. Psychological barriers may be as important as anything like per capita wealth or peace in innovation.

“Open Questions”, Branwen 2018

Questions: “Open Questions”⁠, Gwern Branwen (2018-10-17; ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ; backlinks; similar):

Some anomalies/​questions which are not necessarily important, but do puzzle me or where I find existing explanations to be unsatisfying.

? ? ?

A list of some questions which are not necessarily important, but do puzzle me or where I find existing ‘answers’ to be unsatisfying, categorized by subject (along the lines of Patrick Collison’s list & Alex Guzey⁠; see also my list of project ideas).

“Race in My Little Pony”, Branwen 2018

MLP-genetics: “Race in My Little Pony”⁠, Gwern Branwen (2018-06-04; ⁠, ; backlinks; similar):

In MLP:FiM, the 3 pony races sometimes bear offspring of other pony races; I review 4 complicated Mendelian models attempting to explain this, and note that a standard polygenic liability-threshold model can fit it parsimoniously.

(For background on My Little Pony: Friendship is Magic, see my review of My Little Pony⁠.)

Another fictional universe with genetic mechanisms is My Little Pony: Friendship Is Magic, where there are 3 pony races which are heritable. One outlier family which has all 3 races represented challenges simple Mendelian interpretations of MLP races. I review 4 attempts to reconcile the outlier with Mendelian mechanisms, and propose another interpretation, drawing on polygenic mechanisms, treating race as a polytomous liability threshold trait, which is flexible enough to explain all observations in-universe (at least for the first few seasons of MLP).

“Amusing Ourselves to Death?”, Branwen 2018

Amuse: “Amusing Ourselves to Death?”⁠, Gwern Branwen (2018-05-12; ⁠, ⁠, ⁠, ⁠, ; backlinks; similar):

A suggested x-risk⁠/​Great Filter is the possibility of advanced entertainment technology leading to wireheading/​mass sterility/​population collapse and extinction. As media consumption patterns are highly heritable, any such effect would trigger rapid human adaptation, implying extinction is almost impossible unless immediate collapse or exponentially accelerating addictiveness.

To demonstrate the point that there are pervasive genetic influences on all aspects of media consumption or leisure time activities/​preferences/​attitudes, I compile >580 heritability estimates from the behavioral genetics literature (drawing particularly on Loehlin & Nichols 1976’s A Study of 850 Sets of Twins), roughly divided in ~13 categories.

“Genetics and Eugenics in Frank Herbert’s Dune-verse”, Branwen 2018

Dune-genetics: “Genetics and Eugenics in Frank Herbert’s Dune-verse”⁠, Gwern Branwen (2018-05-05; ⁠, ⁠, ⁠, ; backlinks; similar):

Discussion of fictional eugenics program in the SF Dune-verse and how it contradicts contemporary known human genetics but suggests heavy agricultural science and Mendelian inspiration to Frank Herbert’s worldview.

Frank Herbert’s SF Dune series features as a central mechanic a multi-millennium human eugenics breeding program by the Bene Gesserit, which produces the main character, Paul Atreides⁠, with precognitive powers. The breeding program is described as oddly slow and ineffective and requiring roles for incest and inbreeding at some points, which contradict most proposed human eugenics methods. I describe the two main historical paradigms of complex trait genetics, the Fisherian infinitesimal model and the Mendelian monogenic model, the former of which is heavily used in human behavioral genetics and the latter of which is heavily used in agricultural breeding for novel traits, and argue that Herbert (incorrectly but understandably) believed the latter applied to most human traits, perhaps related to his lifelong autodidactic interest in plants & insects & farming, and this unstated but implicit intellectual background shaped Dune and resolves the anomalies.

“‘Genius Revisited’ Revisited”, Branwen 2016

Hunter: “‘Genius Revisited’ Revisited”⁠, Gwern Branwen (2016-06-19; ⁠, ⁠, ⁠, ⁠, ⁠, ; backlinks; similar):

A book study of surveys of the high-IQ elementary school HCES concludes that high IQ is not predictive of accomplishment; I point out that results are consistent with regression to the mean from extremely early IQ tests and small total sample size.

Genius Revisited documents the longitudinal results of a high-IQ/​gifted-and-talented elementary school, Hunter College Elementary School (HCES); one of the most striking results is the general high education & income levels, but absence of great accomplishment on a national or global scale (eg. a Nobel prize). The authors suggest that this may reflect harmful educational practices at their elementary school or the low predictive value of IQ.

I suggest that there is no puzzle to this absence nor anything for HCES to be blamed for, as the absence is fully explainable by their making 2 statistical errors: base-rate neglect⁠, and regression to the mean⁠.

First, their standards fall prey to a base-rate fallacy and even extreme predictive value of IQ would not predict 1 or more Nobel prizes because Nobel prize odds are measured at 1 in millions, and with a small total sample size of a few hundred, it is highly likely that there would simply be no Nobels.

Secondly, and more seriously, the lack of accomplishment is inherent and unavoidable as it is driven by the regression to the mean caused by the relatively low correlation of early childhood with adult IQs—which means their sample is far less elite as adults than they believe. Using early-childhood/​adult IQ correlations, regression to the mean implies that HCES students will fall from a mean of 157 IQ in kindergarten (when selected) to somewhere around 133 as adults (and possibly lower). Further demonstrating the role of regression to the mean, in contrast, HCES’s associated high-IQ/​gifted-and-talented high school, Hunter High, which has access to the adolescents’ more predictive IQ scores, has much higher achievement in proportion to its lesser regression to the mean (despite dilution by Hunter elementary students being grandfathered in).

This unavoidable statistical fact undermines the main rationale of HCES: extremely high-IQ adults cannot be accurately selected as kindergartners on the basis of a simple test. This greater-regression problem can be lessened by the use of additional variables in admissions, such as parental IQs or high-quality genetic polygenic scores⁠; unfortunately, these are either politically unacceptable or dependent on future scientific advances. This suggests that such elementary schools may not be a good use of resources and HCES students should not be assigned scarce magnet high school slots.

“Embryo Selection For Intelligence”, Branwen 2016

Embryo-selection: “Embryo Selection For Intelligence”⁠, Gwern Branwen (2016-01-22; ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ; backlinks; similar):

A cost-benefit analysis of the marginal cost of IVF-based embryo selection for intelligence and other traits with 2016-2017 state-of-the-art

With genetic predictors of a phenotypic trait, it is possible to select embryos during an in vitro fertilization process to increase or decrease that trait. Extending the work of Shulman & Bostrom 2014⁠/​Hsu 2014⁠, I consider the case of human intelligence using SNP-based genetic prediction, finding:

  • a meta-analysis of GCTA results indicates that SNPs can explain >33% of variance in current intelligence scores, and >44% with better-quality phenotype testing
  • this sets an upper bound on the effectiveness of SNP-based selection: a gain of 9 IQ points when selecting the top embryo out of 10
  • the best 2016 polygenic score could achieve a gain of ~3 IQ points when selecting out of 10
  • the marginal cost of embryo selection (assuming IVF is already being done) is modest, at $1,822.7[^\$1,500.0^~2016~]{.supsub} + $243.0[^\$200.0^~2016~]{.supsub} per embryo, with the sequencing cost projected to drop rapidly
  • a model of the IVF process, incorporating number of extracted eggs, losses to abnormalities & vitrification & failed implantation & miscarriages from 2 real IVF patient populations, estimates feasible gains of 0.39 & 0.68 IQ points
  • embryo selection is currently unprofitable (mean: -$435.0[^\$358.0^~2016~]{.supsub}) in the USA under the lowest estimate of the value of an IQ point, but profitable under the highest (mean: $7,570.3[^\$6,230.0^~2016~]{.supsub}). The main constraints on selection profitability is the polygenic score; under the highest value, the NPV EVPI of a perfect SNP predictor is $29.2[^\$24.0^~2016~]{.supsub}b and the EVSI per education/​SNP sample is $86.3[^\$71.0^~2016~]{.supsub}k
  • under the worst-case estimate, selection can be made profitable with a better polygenic score, which would require n > 237,300 using education phenotype data (and much less using fluid intelligence measures)
  • selection can be made more effective by selecting on multiple phenotype traits: considering an example using 7 traits (IQ/​height/​BMI/​diabetes/​ADHD⁠/​bipolar/​schizophrenia), there is a factor gain over IQ alone; the outperformance of multiple selection remains after adjusting for genetic correlations & polygenic scores and using a broader set of 16 traits.

“Catnip Immunity and Alternatives”, Branwen 2015

Catnip: “Catnip immunity and alternatives”⁠, Gwern Branwen (2015-11-07; ⁠, ⁠, ⁠, ⁠, ⁠, ; backlinks; similar):

Estimation of catnip immunity rates by country with meta-analysis and surveys, and discussion of catnip alternatives.

Not all cats respond to the catnip stimulant; the rate of responders is generally estimated at ~70% of cats. A meta-analysis of catnip response experiments since the 1940s indicates the true value is ~62%. The low quality of studies and the reporting of their data makes examination of possible moderators like age, sex, and country difficult. Catnip responses have been recorded for a number of species both inside and outside the Felidae family; of them, there is evidence for a catnip response in the Felidae, and, more uncertainly, the Paradoxurinae, and Herpestinae.

To extend the analysis, I run large-scale online surveys measuring catnip response rates globally in domestic cats, finding high heterogeneity but considerable rates of catnip immunity worldwide.

As a piece of practical advice for cat-hallucinogen sommeliers, I treat catnip response & finding catnip substitutes as a decision problem, modeling it as a Markov decision process where one wishes to find a working psychoactive at minimum cost. Bol et al 2017 measured multiple psychoactives simultaneously in a large sample of cats, permitting prediction of responses conditional on not responding to others. (The solution to the specific problem is to test in the sequence catnip → honeysuckle → silvervine → Valerian⁠.)

For discussion of cat psychology in general, see my Cat Sense review.

“Physical Activity, Fitness, Glucose Homeostasis, and Brain Morphology in Twins”, Rottensteiner et al 2015

2015-rottensteiner.pdf: “Physical activity, fitness, glucose homeostasis, and brain morphology in twins”⁠, Mirva Rottensteiner, Tuija Leskinen, Eini Niskanen, Sari Aaltonen, Sara Mutikainen, Jan Wikgren, Kauko Heikkilä et al (2015; ; backlinks; similar):

Purpose: The main aim of the present study (FITFATTWIN) was to investigate how physical activity level is associated with body composition, glucose homeostasis, and brain morphology in young adult male monozygotic twin pairs discordant for physical activity.

Methods: From a population-based twin cohort, we systematically selected 10 young adult male monozygotic twin pairs (age range, 32–36 yr) discordant for leisure time physical activity during the past 3 yr. On the basis of interviews, we calculated a mean sum index for leisure time and commuting activity during the past 3 yr (3-yr LTMET index expressed as MET-hours per day). We conducted extensive measurements on body composition (including fat percentage measured by dual-energy x-ray absorptiometry), glucose homeostasis including homeostatic model assessment index and insulin sensitivity index (Matsuda index, calculated from glucose and insulin values from an oral glucose tolerance test), and whole brain magnetic resonance imaging for regional volumetric analyses.

Results: According to pairwise analysis, the active twins had lower body fat percentage (p = 0.029) and homeostatic model assessment index (p = 0.031) and higher Matsuda index (p = 0.021) compared with their inactive co-twins. Striatal and prefrontal cortex (subgyral and inferior frontal gyrus) brain gray matter volumes were larger in the nondominant hemisphere in active twins compared with those in inactive co-twins, with a statistical threshold of p < 0.001.

Conclusions: Among healthy adult male twins in their mid-30s, a greater level of physical activity is associated with improved glucose homeostasis and modulation of striatum and prefrontal cortex gray matter volume, independent of genetic background. The findings may contribute to later reduced risk of type 2 diabetes and mobility limitations.

“Everything Is Correlated”, Branwen 2014

Everything: “Everything Is Correlated”⁠, Gwern Branwen (2014-09-12; ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ; backlinks; similar):

Anthology of sociology, statistical, or psychological papers discussing the observation that all real-world variables have non-zero correlations and the implications for statistical theory such as ‘null hypothesis testing’.

Statistical folklore asserts that “everything is correlated”: in any real-world dataset, most or all measured variables will have non-zero correlations, even between variables which appear to be completely independent of each other, and that these correlations are not merely sampling error flukes but will appear in large-scale datasets to arbitrarily designated levels of statistical-significance or posterior probability.

This raises serious questions for null-hypothesis statistical-significance testing, as it implies the null hypothesis of 0 will always be rejected with sufficient data, meaning that a failure to reject only implies insufficient data, and provides no actual test or confirmation of a theory. Even a directional prediction is minimally confirmatory since there is a 50% chance of picking the right direction at random.

It also has implications for conceptualizations of theories & causal models, interpretations of structural models, and other statistical principles such as the “sparsity principle”.

“Statistical Notes”, Branwen 2014

Statistical-notes: “Statistical Notes”⁠, Gwern Branwen (2014-07-17; ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ⁠, ; backlinks; similar):

Miscellaneous statistical stuff

Given two disagreeing polls, one small & imprecise but taken at face-value, and the other large & precise but with a high chance of being totally mistaken, what is the right Bayesian model to update on these two datapoints? I give ABC and MCMC implementations of Bayesian inference on this problem and find that the posterior is bimodal with a mean estimate close to the large unreliable poll’s estimate but with wide credible intervals to cover the mode based on the small reliable poll’s estimate.

“The Consequences of Political Dictatorship for Russian Science”, Soyfer 2001

2001-soyfer.pdf: “The consequences of political dictatorship for Russian science”⁠, Valery N. Soyfer (2001-09-01; ⁠, ; similar):

The Soviet communist regime had devastating consequences on the state of Russian 20th century science. Country Communist leaders promoted Trofim Lysenko—an agronomist and keen supporter of the inheritance of acquired characters—and the Soviet government imposed a complete ban on the practice and teaching of genetics, which it condemned as a “bourgeois perversion”. Russian science, which had previously flourished, rapidly declined, and many valuable scientific discoveries made by leading Russian geneticists were forgotten.

Totalitarian political pressure: The Soviet communist regime eliminated many of its best scientists, crushed societal morals and brought irreparable harm to the country (for a discussion see REFS 8,10). During 1919–1922, Lenin exiled thousands of philosophers, sociologists, historians and economists whose ideas contradicted his views. Stalin and the Communist Party Politburo took the next step: they decided that certain scientific fields must be forbidden as “bourgeois perversion”. It is possible to argue that science is intrinsically political, and many scientists might be seen as excellent politicians when it comes to seeking financial support for their work, but, in my opinion, this behaviour cannot be compared with the hysterical appeals to the country’s leaders to ban certain disciplines and calls for the arrests of ‘anti-Soviet’ scientists that took place in the USSR.

The intervention of the Communist leaders into science in the USSR was a particular phenomenon in the history of science in the 20th century, comparable only with the events that took place in Nazi Germany. It is qualitatively different from the sort of everyday ‘politics’ in which all scientists, everywhere, engage. The most tragic consequence of totalitarian rule was the persecution of those scientists who were unable to unconditionally agree with the Party’s decrees or tried to dispute its decisions. These personal tragedies of many outstanding scientists in the USSR led to much deeper and wider effects. The progress of science was slowed or stopped, and millions of university and high school students received a distorted education. A comparable example of the devastating influence of politicization of society was the Nazis’ destruction of science in fascist Germany after 1933. Thousands of scientists, especially those of Jewish origins, were forced to leave Germany. Nevertheless, the mass arrests of scientists in the Soviet Union had much worse consequences for science. In my opinion, it was the most tragic event in the history of science. It demonstrated the terrible effects of a political dictatorship, and showed that science should develop in free and open competition between scientists, without political intervention.

“Recent Human Influenza A (H1N1) Viruses Are Closely Related Genetically to Strains Isolated in 1950”, Nakajima et al 1978

1978-nakajima.pdf: “Recent human influenza A (H1N1) viruses are closely related genetically to strains isolated in 1950”⁠, Katsuhisa Nakajima, Ulrich Desselberger, Peter Palese (1978-07-27; similar):

Comparison of the oligonucleotide maps of the RNAs of current human influenza (H1N1) virus isolates shows these strains to be much more closely related to viruses isolated in 1950 than to strains which circulated before or after that period.