> Given two disagreeing polls, one small & imprecise but taken at face-value, and the other large & precise but with a high chance of being totally mistaken, what is the right Bayesian model to update on these two datapoints?
> I give ABC and MCMC implementations of Bayesian inference on this problem and find that the posterior is bimodal with a mean estimate close to the large unreliable poll's estimate but with wide credible intervals to cover the mode based on the small reliable poll's estimate.

A question was asked of me: what should one infer if one is given what would be definitive data if one could take it at face value---but one suspects this data might be totally 100% incorrect?
An example would be if one wanted to know what fraction of people would answer 'yes' to a particular question, and one had a very small poll (_n_ = 10) suggesting 90% say yes, but then one was also given the results from a much larger poll (_n_ = 1000) saying 75% responded yes---but this poll was run by untrustworthy people, people that, for whatever reason, you believe might make something up half the time.
You should be able to learn *something* from this unreliable poll, but you can't learn *everything* from it because you would be burned half the time.
If not for this issue of unreliability, this would be an easy binomial problem: specify a uniform or Jeffreys prior on what percentage of people will say yes, add in the binomial data of 9⁄10, and look at the posterior.
But what do we do with the unreliability joker?
## Binomial
First let's try the simple case, just updating on a small poll of 9⁄10. We would expect it to be unimodally peaked around 80-90%, but broad (due to the small sample size) and falling sharply until 100% since being that high is a priori unlikely.
MCMC using [Bayesian First Aid](https://github.com/rasmusab/bayesian_first_aid):
~~~{.R}
## install.packages("devtools")
## devtools::install_github("rasmusab/bayesian_first_aid")
library(BayesianFirstAid)
b <- bayes.binom.test(oldData$Yes, oldData$N); b
# ...number of successes = 9, number of trials = 10
# Estimated relative frequency of success:
# 0.85
# 95% credible interval:
# 0.63 0.99
# The relative frequency of success is more than 0.5 by a probability of 0.994
# and less than 0.5 by a probability of 0.006
~~~
Which itself is a wrapper around calling out to JAGS doing something like this:
~~~{.R}
library(runjags)
model_string <- "model {
x ~ dbinom(theta, n)
theta ~ dbeta(1, 1) }"
model <- autorun.jags(model_string, monitor="theta", data=list(x=oldData$Yes, n=oldData$N)); model
# JAGS model summary statistics from 20000 samples (chains = 2; adapt+burnin = 5000):
#
# Lower95 Median Upper95 Mean SD Mode MCerr MC%ofSD SSeff AC.10 psrf
# theta 0.63669 0.85254 0.9944 0.83357 0.10329 ---0.0007304 0.7 20000 0.011014 1.0004
~~~
Here is a simulation-based version of Bayesian inference using [ABC](!Wikipedia "Approximate Bayesian computation"):
~~~{.R}
oldData <- data.frame(Yes=9, N=10)
simulatePoll <- function(n, pr) { rbinom(1, size=n, p=pr); }
poll_abc <- replicate(100000, {
# draw from our uniform prior
p <- runif(1,min=0,max=1)
# simulate a hypothetical poll dataset the same size as our original
newData <- data.frame(Yes=simulatePoll(oldData$N, p), N=oldData$N)
# were they equal? if so, save sample as part of posterior
if (all(oldData == newData)) { return(p) }
}
)
resultsABC <- unlist(Filter(function(x) {!is.null(x)}, poll_abc))
summary(resultsABC)
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# 0.3260816 0.7750520 0.8508855 0.8336383 0.9117471 0.9991691
hist(resultsABC)
# https://i.imgur.com/fn3XYQW.png
~~~
They look identical, as they should.
### Binomial with binary unreliability
To implement our more complicated version: the original poll remains the same but we add in the complication of a very large poll which 50% of the time is a true measure of the poll response and 50% of the time is drawn uniformly at random.
(So if the true poll response is 90%, then half the time the large poll will yield accurate data like 905⁄1000 or 890⁄1000, and the rest it will yield 10⁄1000 or 400⁄1000 or 700⁄1000.)
This is different from the more common kinds of measurement-error models where it's generally assumed that the noisy measurements still have *some* informativeness to them; here there is none.
Specifically, this faux poll has yielded the data not 9⁄10, but 750⁄1000.
#### ABC
Using ABC again: we generate the reliable small poll as before, and we add in an faux poll where we flip a coin to decide if we are going to return a 'yes' count based on the population parameters or just a random number, then we combine the two datasets and check that it's identical to the actual data, saving the population probability if it is.
~~~{.R}
oldData2 <- data.frame(Yes=c(9,750), N=c(10,1000)); oldData2
# Yes N
# 1 9 10
# 2 750 1000
simulateHonestPoll <- function(n, pr) { rbinom(1, size=n, p=pr); }
simulateFauxPoll <- function(n, pr, switchp) { if(sample(c(TRUE, FALSE), 1, prob=c(switchp, 1-switchp))) { rbinom(1, size=n, p=pr); } else { round(runif(1, min=0, max=n)); }}
poll_abc <- replicate(1000000, {
priorp <- runif(1,min=0,max=1)
switch <- 0.5
n1 <- 10
n2 <- 1000
data1 <- data.frame(Yes=simulateHonestPoll(n1, priorp), N=n1)
data2 <- data.frame(Yes=simulateFauxPoll(n2, priorp, switch), N=n2)
newData <- rbind(data1, data2)
if (all(oldData2 == newData)) { return(priorp) }
}
)
resultsABC <- unlist(Filter(function(x) {!is.null(x)}, poll_abc))
summary(resultsABC)
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# 0.5256471 0.7427098 0.7584650 0.7860109 0.8133581 0.9765648
hist(resultsABC)
# https://i.imgur.com/atMz0jg.png
~~~
The results are interesting and in this case the summary statistics are misleading: the median is indeed around 75% (as we would expect! since that's the result of the highly precise poll which has a 50% chance of being the truth) but we see the mean is being pulled away towards the original 90% estimate, and plotting the histogram, bimodality emerges.
The posterior reports that there's still a lot of credibility to the 90% point estimate, but between the original diffuseness of that posterior (leaving a lot of probability to lower responses including, say, 75%) and the high certainty that if accurate the responses will definitely be close to 75%, it winds up peaked at a little higher than 75% (since even if the larger poll is honest, the earlier poll did still find 9⁄10).
So it's not so much that we think the best estimate of true population rate really is 79% (indeed, the mode is more like 75%, but it could easily be far away from 75% and in the 90%s) as we would need to think more about what we want to do with this posterior before we decide how to summarize it.
#### Mixture
ABC is slow and would not scale to more hypothetical polls unless we abandoned exact ABC inference and began using approximate ABC (entirely possible in this case; instead of strict equality between the original and simulated data, we'd instead accept a sample of _p_ if the simulated dataset's fractions were within, say, 1% of the originals); and the simulation would need to be rewritten anyway.
MCMC can handle this if we think of our problem as a [mixture model](!Wikipedia): our problem is that we have poll data drawn from two clusters/distributions---one cluster is the true population distribution of opinion, and the other cluster just spits out noise. We have one observation which we know is from first reliable distribution (the 9⁄10 poll result), and one observation which we're not sure which of the two it came from (750/1000), but we do know that the indexing probability for mixing the two distributions is 50%.
In JAGS, we write down a model in which `dcat` flips between 1 and 2 if the cluster is not known, specifying which distribution a sample came from and its theta probability, and then we infer the thetas for both distributions.
Of course, we only care about the first distribution's theta since the second one is noise.
~~~{.R}
library(runjags)
model1 <- "model {
for (i in 1:N) {
y[i] ~ dbinom(theta[i], n[i])
theta[i] <- thetaOfClust[ clust[i] ]
clust[i] ~ dcat(pi[])
}
pi[1] <- switch[1]
pi[2] <- switch[2]
thetaOfClust[1] ~ dbeta(1,1)
thetaOfClust[2] ~ dunif(0,1)
}"
j1 <- autorun.jags(model1, monitor=c("theta"), data = list(N=nrow(oldData2), y=oldData2$Yes, n=oldData2$N, switch=c(0.5, 0.5), clust=c(1,NA))); j1
# ... Lower95 Median Upper95 Mean SD Mode MCerr MC%ofSD SSeff AC.10 psrf
# theta[1] 0.70582 0.75651 0.97263 0.77926 0.07178 --- 0.001442 2 2478 0.12978 1.0011
# theta[2] 0.72446 0.75078 0.77814 0.75054 0.013646 ---0.00009649 0.7 20000 0.009458 1
plot(j1)
# https://i.imgur.com/EaqR0dD.png
~~~
Sure enough, we get a good match with the ABC estimate: a mean estimate for the population distribution of 78% with a very wide 95% CI and a clearly bimodal distribution with a huge spike at 75%.
Since the MCMC mixture model looks completely different from the imperative simulation-based model, the consistency in estimates & distributions gives me some confidence in the results being right.
So we can see how we should update our beliefs---by a perhaps surprising amount towards the unreliable datapoint.
The original data was too weak to strongly resist the allure of that highly precise poll.
#### Weakening heuristic?
We might try to think of it this way: half the time, the large poll means nothing whatsoever, it contains 0% or no information about the population at all; While the other half of the time, it is exactly what it seems to be and 100% informative; so doesn't that mean that on average we should treat it as containing half the information we thought it did? And the information is directly based on the sample size: a sample 5x as big contains 5x as much information.
So perhaps in this case of all-or-nothing accuracy, we could solve it easily by simply weakening the weight put the unreliable information and shrinking the claimed sample size---instead of treating it as 750 of 1000, treat it as 375⁄500; and if it had been 75,000 of 100,000, convert it to 37,500 of 50,000.
This is a simple and intuitive shortcut, but if we think about what the binomial will return as the unreliable poll increases in size or if we look at the results...
~~~{.R}
switch <- 0.5
oldData3 <- data.frame(Yes=c(9,(750*switch)), N=c(10,(1000*switch)))
b2 <- bayes.binom.test(sum(oldData3$Yes), sum(oldData3$N)); b2
#
# Bayesian First Aid binomial test
#
# data: sum(oldData3$Yes) and sum(oldData3$N)
# number of successes = 384, number of trials = 510
# Estimated relative frequency of success:
# 0.75
# 95% credible interval:
# 0.71 0.79
# The relative frequency of success is more than 0.5 by a probability of >0.999
# and less than 0.5 by a probability of <0.001
~~~
Unfortunately, this doesn't work because it doesn't preserve the bimodal aspect of the posterior, and we get a unimodal distribution ever concentrating on its mean, wiping out the existence of the 0.90 peak. If our untrustworthy poll had instead, say, reported 750,000 out of 1 million, that should only make the peak at 0.75 look like a needle---it should be unable to affect the mass around 0.9, because it doesn't matter if the data is 100 or 1 million or 1 billion, it still only has a 50% chance of being true.
It's a little hard to see this since the mean frequency of 0.75 is fairly close to the mean of 0.78 from the ABC and we might write this off as approximation error in either the ABC estimate or BFA's MCMC, but if we look at the 95% CI and note that 0.9 is not inside it or if we plot the posterior (`plot(b2)`), then the absence of bimodality jumps out.
So this trick doesn't work.
# Dysgenics power analysis {.collapse}
> Current dysgenic estimates predict that genotypic IQ in the West are falling at a substantial rate, amounting to around half a standard deviation or more over the past century, by 1. reducing the frequency at which intelligence-increasing genetic variants occur (through natural selection against such variants) and 2. by increasing the number of new and potentially harmful genetic mutations (increasing mutation load).
> Estimates are produced indirectly by surveying reproductive rates or by trying to show decreases in phenotypic traits associated with intelligence; it would obviously be preferable to examine dysgenic effects directly, by observing decreases in frequencies or increases in mutation load in a large sample of Western genetic information such as SNP arrays or whole-genomes (respectively).
> Such direct testing of dysgenics hypotheses are becoming increasingly feasible due to the exponential decrease in SNP & whole-genome sequencing costs creating large datasets (some publicly available) and the recent identification of some intelligence genes.
> It remains unclear how large these datasets must be to overcome sampling error and yield informative estimates of changes in frequencies or mutation load, however; datasets like PGP or SSGAC may still be too small to investigate dysgenics.
> I considered the effect size estimates and under some simple models derive power calculations & power simulations of how large a dataset would be required to have an 80% chance of detecting a dysgenic effect: to detect the decrease in intelligence SNPs using SNP data, _n_≥30,000; to detect the increase in mutation load in whole genomes, _n_≥160.
> I then compare to available datasets: the effect on SNPs can be detected by a large number of existing proprietary databases, but there are no public databases which will be large enough in the foreseeable future; the effect on mutation load, on the other hand, can be detected using solely the currently publicly available dataset from PGP.
> So I conclude that while only the proprietary databases can directly test dysgenic theories of selection for the foreseeable future, there *is* an opportunity to analyze PGP genomes to directly test the dysgenic theory of mutation load.

The dysgenics hypothesis argues that due to observed reproductive patterns where the highly educated or intelligent tend to have fewer offspring, genotypic IQ (the upper bound on phenotypic IQs set by genes and the sort of thing measured by a polygenic score).
If dysgenics is true, then it is an extremely important phenomenon, as important as many things that get far more attention like lead remediation; but to paraphrase [Richard Hamming](!Wikipedia)[^Hamming-importance], just because a problem is important does not mean it is worth working on or researching or discussing if there is no chance of making progress---if the data is hopelessly compromised by many systematic biases which would cause false positives or if the data is too scanty to overcome random error or analyses so flexible that they could deliver any answer the partisan wishes.
Phenotypic data will, in all probability, never allow for a clear & decisive answer to the question of whether dysgenics exists or matters, as long-term comparisons are roughly as credible as noting that global piracy rates have declined while global warming increases, or paracetamol consumption rates have increased in tandem with Alzheimer's rates; only direct examination of genetics will deliver the decisive answer.
It would be nice to have an idea of *how much* genetic data we would need to overcome random error (and hence, whether it's possible to make progress in the near future), which we can answer by doing some statistical power analyses.
[^Hamming-importance]: Richard Hamming, ["You and Your Research"](https://www.cs.virginia.edu/~robins/YouAndYourResearch.html):
> The three outstanding problems in physics, in a certain sense, were never worked on while I was at Bell Labs. By important I mean guaranteed a Nobel Prize and any sum of money you want to mention. We didn't work on (1) time travel, (2) teleportation, and (3) antigravity. They are not important problems because we do not have an attack. It's not the consequence that makes a problem important, it is that you have a reasonable attack. That is what makes a problem important.
Changes over time in genetics could be due to changes within a particular race or population (for example, in all white Englishmen), or could be due to population movements like one group replacing or migrating or merging into another (population genetics has revealed innumerable complex examples historically).
The latter is possible thanks to the increasing availability of ancient DNA, often [made public for researchers](https://www.oagr.org.au/source/ "Online Ancient Genome Repository"); so one could observe very long-term trends with cumulatively large effects (implying that small samples may suffice), but this approach has serious issues in interpretation and questions about how comparable intelligence variants may be across groups or throughout human evolution.
With the former, there is less concern about interpretation due to greater temporal and ethnic homogeneity---if a GWAS on white northern Europeans in 2013 turns up intelligence variants and produces a useful polygenic score, it will almost certainly work on samples of white northern Europeans in 1900 too---but because the time-scale is so short the effect will be subtler and harder to detect.
Nevertheless, a result within a modern population would be much more credible, so we'll focus on that.
How subtle and hard to detect an effect are we talking about here?
[Woodley 2012](/docs/algernon/2012-woodley.pdf "The social and scientific temporal correlates of genotypic intelligence and the Flynn effect") summarizes a number of estimates:
> Early in the 20th century, negative correlations were observed between intelligence and fertility, which were taken to indicate a dysgenic fertility trend (e.g. Cattell, 1936; Lentz, 1927; Maller, 1933; Sutherland, 1929). Early predictions of the rate of dysgenesis were as high as between 1 and 1.5 IQ points per decade (Cattell, 1937, 1936)...In their study of the relationship between intelligence and both completed and partially completed fertility, van Court and Bean (1985) reported that the relationships were predominantly negative in cohorts born between the years 1912 and 1982...Vining (1982) was the first to have attempted an estimation of the rate of genotypic IQ decline due to dysgenesis with reference to a large national probability cohort of US women aged between 24 and 34 years in 1978. He identified significant negative correlations between fertility and IQ ranging from −.104 to −.221 across categories of sex, age and race, with an estimated genotypic IQ decline of one point a generation. In a 10year follow-up study using the same cohort, Vining (1995) re-examined the relationship between IQ and fertility, now that fertility was complete, finding evidence for a genotypic IQ decline of .5 points per generation. Retherford and Sewell (1988) examined the association between fertility and IQ amongst a sample of 9000 Wisconsin high-school graduates (graduated 1957). They found a selection differential that would have reduced the phenotypic IQ by .81 points per generation under the assumption of equal IQs for parents and children. With an estimate of .4 for the additive heritability of IQ, they calculated a more modest genotypic decline of approximately .33 points. The study of Ree and Earles (1991), which employed the NLSY suggests that once the differential fertility of immigrant groups is taken into consideration, the phenotypic IQ loss amongst the American population may be greater than .8 of a point per generation. Similarly, in summarizing various studies, Herrnstein & Murray (1994) suggest that "it would be nearly impossible to make the total [phenotypic IQ decline] come out to less than one point per generation. It might be twice that." (p. 364). Loehlin (1997) found a negative relationship between the fertility of American women aged 35-44 in 1992 and their educational level. By assigning IQ scores to each of six educational levels, Loehlin estimated a dysgenesis rate of .8 points in one generation. Significant contributions to the study of dysgenesis have been made by Lynn, 1996 (see also: 2011) whose book Dysgenics: Genetic deterioration in modern populations provided the first estimates of the magnitude of dysgenesis in Britain over a 90 year period, putting the phenotypic loss at .069 points per year (about 1.7 points a generation assuming a generational length of 25 years). In the same study, Lynn estimated that the genotypic IQ loss was 1.64 points per generation between 1920 and 1940, which reduced to .66 points between 1950 and the present. Subsequent work by Lynn has investigated dysgenesis in other populations. For example Lynn (1999) found evidence for dysgenic fertility amongst those surveyed in the 1994 National Opinion Research Center survey, which encompassed a representative sample of American adults, in the form of negative correlations between the intelligence of adults aged 40+ and the number of children and siblings. Lynn estimates the rate of dysgenesis amongst this cohort at .48 points per generation. In a more recent study, Lynn and van Court (2004) estimated that amongst the most recent US cohort for which fertility can be considered complete (i.e. those born in the years 1940-1949), IQ has declined by .9 points per generation. At the country level, Lynn and Harvey (2008) have found evidence of a global dysgenesis of around .86 points between 1950 and 2000, which is projected to increase to 1.28 points in the period from 2000 to 2050. This projection includes the assumption that 35% of the variance in cross-country IQ differences is due to the influence of genetic factors. A subsequent study by Meisenberg (2009), found that the fertility differential between developed and developing nations has the potential to reduce the phenotypic world population IQ mean by 1.34 points per decade (amounting to a genotypic decline of .47 points per decade assuming Lynn & Harvey's 35% estimate). This assumes present rates of fertility and pre-reproductive mortality within countries. Meisenberg (2010) and Meisenberg and Kaul (2010) have examined the factors through which intelligence influences reproductive outcomes. They found that amongst the NLSY79 cohort in the United States, the negative correlation between intelligence and fertility is primarily associated with g and is mediated in part by education and income, and to a lesser extent by more "liberal" gender attitudes. From this Meisenberg has suggested that in the absence of migration and with a constant environment, selection has the potential to reduce the average genotypic IQ of the US population by between .4, .8 and 1.2 points per generation.
All of these estimates are genetic selection estimates: indirect estimates inferred from IQ being a heritable trait and then treating it as a natural selection/breeding process, where a trait is selected against based on phenotype and how fast the trait decreases in each succeeding generation depends on how genetic the trait is and how harsh the selection is.
So variation in these estimates (quoted estimates per generation range from .3 to 3+) is due to sampling error, differences in populations or time periods, expressing the effect by year or generation, the estimate used for heritability, reliability of IQ estimates, and whether additional genetic effects are taken into account---for example, [Woodley et al 2015](http://www.sciencedirect.com/science/article/pii/S0191886915003712) finds -.262 points per decade from selection, but [in another paper](http://www.sciencedirect.com/science/article/pii/S0191886914006278) argues that paternal mutation load must be affecting intelligence by ~-0.84 in the general population, giving a total of -1 per decade.
Dysgenics effects should be observable by looking at genomes & SNP data with known ages/birth-years and looking for increases in total mutations or decreases in intelligence-causing SNPs, respectively.
## Selection on SNPs
Without formally meta-analyzing all dysgenics studies, a good starting point on the selection effect seems like a genetic selection of 1 point per decade or 0.1 points per year or 0.007 standard deviations per year (or 0.7 standard deviations per century).
The most common available genetic data is SNP data, which sequence only the variants most common in the general population; SNP data can look at the effects of genetic selection but will not look at new mutations (since a new mutation would not be common enough to be worth putting onto a SNP chip).
Given a large sample of SNP data, a birth year (or age), and a set of binary SNP variables which cause intelligence (coded as 1 for the good variant, 0 for the others), we could formulate this as a multivariate regression: `glm(cbind(SNP1, SNP2, ... SNP_N) ~ Year, family=binomial)` and see if the year variable has a negative sign (increasing passage of time predicts lower levels of the good genes); if it does, this is evidence for dysgenics.
Better yet, given information about the effect size of the SNPs, we could for each person's SNP sum the net effects and then regress on a single variable, giving more precision rather than looking for independent effects on each SNP: `lm(Polygenic_score ~ Year)`. Again a negative sign on the year variable is evidence for dysgenics.
Directional predictions are weak, and in this case we have quantitative predictions of how big the effects should be.
Most of the public genomes I looked at seem to have the earliest birthdates in the 1950s or so; genomes can come from any age person (parents can give permission, and sequencing has been done prenatally) so the maximum effect is the difference between 1950 and 2015, which is `65*0.007=0.455` standard deviations (but most genomes will come from intermediate birth-dates, which are less informative about the temporal trend---in the optimal experimental design for measuring a linear trend, half the samples would be from 1950 and the other half from 2015).
If the genetic total is going down by 0.455SDs, how much do the frequencies of all the good genes go down?
One simple model of genotypic IQ would be to treat it as a large number of alleles of equal binary effect: a binomial sum of _n_ = 10,000 1/0 variables with _P_ = 50% (population frequency) is reasonable. (For example, GIANT has found a large number of variants for height, and the [GCTA](!Wikipedia)s indicate that SNPs explain much more of variance than the top Rietveld hits currently account for; this specific model is loosely inspired by [Hsu 2014](https://arxiv.org/abs/1408.3421 "On the genetic architecture of intelligence and other quantitative traits").)
In such a model, the average value of the sum is of course `n*p=5000`, and the SD is `sqrt(n*p*(1-p))` or `sqrt(10000*0.5*0.5)` or 50.
Applying our estimate of dysgenic effect, we would expect the sum to fall by `0.455*50=22.75`, so we would be comparing two populations, one with a mean of 5000 and a dysgenic mean of 4977.25.
If we were given access to all alleles from a sample of 1950 and 2015 genomes and so we could construct the sum, how hard would it be able to tell the difference? In this case, the sum is normally distributed as there are more than enough alleles to create normality, so we can just treat this as a two-sample normally-distributed comparison of means (a _t_-test), and we already have a directional effect size in mind, -0.445SDs, so:
~~~{.R}
power.t.test(delta=0.455, power=0.8, alternative="one.sided")
# Two-sample t test power calculation
#
# n = 60.4155602
# ...
~~~
A total _n_ = 120 is doable, but it is unlikely that we will know all intelligence genes anytime soon; instead, we know a few.
A new mean of 4977 implies that since total number of alleles is the same but the mean has fallen, the frequencies must also fall and the average frequency falls from 0.5 to `4977.25/10000=0.497725`.
To go to the other extreme, if we know only a single gene and we want to test a fall from a frequency of 0.50 to 0.4977, we need infeasibly more samples:
~~~{.R}
power.prop.test(p1=0.5, p2=0.497725, power=0.8, alternative="one.sided")
# Two-sample comparison of proportions power calculation
#
# n = 597,272.2524
# ...
~~~
1.2m datapoints would be difficult to get, and so a single gene test would be unhelpful; further, a single gene could change frequencies solely through genetic drift without the change being due to dysgenic pressures.
We know a number of genes, though: Rietveld gives 4 good hits, so we can look at a polygenic score from that.
They are all of similar effect size and frequency, so we'll continue under the same assumptions of 1/0 and _P_ = 50%.
The non-dysgenic average score is `4*0.5=2`, sd=`sqrt(4*0.5*0.5)=1`. (Naturally, the SD is *much* larger than before because with so few random variables...)
The predicted shift is from frequencies of 0.5 to 0.497, so the dysgenic scores should be `4*0.497=1.988`, sd=`sqrt(4*0.497*0.503)=0.999`.
The difference of 0.012 on the reduced polygenic score is _d_=`((2-1.988) / 0.999)=0.012`, giving a necessary power of:
~~~{.R}
power.t.test(delta=0.012006003, power=0.8)
# Two-sample t test power calculation
#
# n = 108904.194
# ...
~~~
So the 4 hits do reduce the necessary sample size, but it's still not feasible to require 218k SNP datasets (unless you are 23andMe or SSGAC or an entity like that).
In the current GWAS literature, there are ~9 hits we could use, but the [upcoming SSGAC paper promises: "We identified 86 independent SNPs associated with EA (_p_ < 5E-8)."](http://drjamesthompson.blogspot.com/2015/09/scholar-in-86-snps.html "'86 genomic sites associated with educational attainment provide insight into the biology of cognitive performance', James J Lee, (at least 200 co-authors)").
So how much would 86 improve over 4?
- mean old: `86*0.5=43`
- sd old: `sqrt(86*0.5*0.5)=4.6368`
- mean new: `86*0.497=42.742`
- sd new: `sqrt(86*0.497*(1-0.497))=4.6367`
- so _d_=`(43-42.742)/4.63675=0.0556`
~~~{.R}
power.t.test(delta=((43-42.742)/4.63675), power=0.8)
# Two-sample t test power calculation
#
# n = 5071.166739
# ...
~~~
So with 75, it drops from 200k to 10.1k.
To work backwards: we know with 1 hit, we need a million SNP datasets (infeasible for any but the largest proprietary databases, who have no interest in studying this hypothesis), and with all hits we need more like 200 genomes (entirely doable with just publicly available datasets like PGP), but how many hits do we need to work with an in-between amount of data like the ~2k genomes with ages I guess may be publicly available now or in the near future?
~~~{.R}
power.t.test(n=1000, power=0.8)
# Two-sample t test power calculation
#
# n = 1000
# delta = 0.1253508704
hits=437;
mean1=hits*0.5; sd1=sqrt(hits*0.5*0.5);
mean2=hits*0.497; sd2=sqrt(hits*0.497*(1-0.497));
d=(mean1-mean2)/mean(c(sd1,sd2)); d
# [1] 0.1254283986
~~~
With a polygenic score drawing on 437 hits, then a sample of 2k suffices to detect the maximum decrease.
This is pessimistic because the 10k alleles are not all the same effect size and GWAS studies inherently will tend to find the largest effects first. So the first 4 (or 86) hits are worth the most.
The distribution of effects is probably something like an inverse exponential distribution: many small near-zero effects and a few large ones.
Rietveld 2013 released the betas for all SNPs, and [the beta estimates can be plotted](http://emilkirkegaard.dk/en/?p=5574 "Polygenic traits and the distribution of effect sizes: years of education from Rietveld et al (2013)"); each estimate is imprecise and there are artifacts in the beta sizes (SSGAC confirms that they were rounded to 3 decimals), but the distribution looks like a radioactive half-life graph, an inverse exponential distribution.
With a mean of 1, we can simulate creating a set of 10k effect sizes which are exponentially distributed and have mean 5000 and SD close to (but larger than) 50 and mimics closely the binomial model:
~~~{.R}
effects <- sort(rexp(10000)/1, decreasing=TRUE)
genomeOld <- function() { ifelse(sample(c(FALSE,TRUE), prob=c(0.5, 0.5), 10000, replace = TRUE), 0, effects) }
mean(replicate(10000, sum(genomeOld())))
# [1] 5000.270218
sd(replicate(10000, sum(genomeOld())))
# [1] 69.82652816
genomeNew <- function() { ifelse(sample(c(FALSE,TRUE), prob=c(0.497, 1-0.497), 10000, replace = TRUE), 0, effects) }
~~~
With a dysgenic effect of -0.445SDs, that's a fall of the sum of random exponentials of ~31, which agrees closely with the difference in polygenic genome scores:
~~~{.R}
mean(replicate(10000, sum(genomeOld() - genomeNew())))
# [1] 29.75354558
~~~
For each draw from the old and new populations, we can take the first 4 alleles, which were the ones assigned the largest effects, and build a weak polygenic score and compare means.
For example:
~~~{.R}
polyNew <- replicate(1000, sum(genomeNew()[1:4]))
polyOld <- replicate(1000, sum(genomeOld()[1:4]))
t.test(polyOld, polyNew, alternative="greater")
# Welch Two Sample t-test
#
# data: polyOld and polyNew
# t = 0.12808985, df = 1995.8371, p-value = 0.8980908
# alternative hypothesis: true difference in means is not equal to 0
# 95 percent confidence interval:
# -0.7044731204 0.8029267301
# sample estimates:
# mean of x mean of y
# 17.72741040 17.67818359
~~~
Or to mimic 86 hits:
~~~{.R}
t.test(replicate(1000, sum(genomeOld()[1:86])), replicate(1000, sum(genomeNew()[1:86])))
#
# Welch Two Sample t-test
#
# t = 1.2268929, df = 1997.6307, p-value = 0.2200074
# alternative hypothesis: true difference in means is not equal to 0
# 95% confidence interval:
# -0.8642674547 3.7525210076
# sample estimates:
# mean of x mean of y
# 244.5471658 243.1030390
~~~
Using the exponential simulation, we can do a parallelized power analysis: simulate draws (_i_ = 300) & tests for a variety of sample sizes to get an idea of what sample size we need to get decent power with 86 hits.
~~~{.R}
library(ggplot2)
library(parallel) # warning, Windows users
library(plyr)
genomeOld <- function(efft) { ifelse(sample(c(FALSE,TRUE), prob=c(0.5, 0.5), length(efft), replace = TRUE), 0, efft) }
genomeNew <- function(efft) { ifelse(sample(c(FALSE,TRUE), prob=c(0.497, 1-0.497), length(efft), replace = TRUE), 0, efft) }
simulateStudy <- function(n, hits) {
effects <- sort(rexp(10000)/1, decreasing=TRUE)[1:hits]
polyOld <- replicate(n, sum(genomeOld(effects)))
polyNew <- replicate(n, sum(genomeNew(effects)))
t <- t.test(polyOld, polyNew, alternative="greater")
return(data.frame(N=n, P=t$p.value, PO.mean=mean(polyOld), PO.sd=sd(polyOld), PN.mean=mean(polyNew), PN.sd=sd(polyNew))) }
hits <- 86
parallelStudies <- function(n, itr) { ldply(mclapply(1:itr, function(x) { simulateStudy(n, hits); })); }
sampleSizes <- seq(500, 5000, by=100)
iters <- 300
powerExponential <- ldply(lapply(sampleSizes, function(n) { parallelStudies(n, iters) })); summary(powerExponential)
# N P PO.mean PO.sd PN.mean
# Min. : 500 Min. :0.000000000 Min. :222.5525 Min. :23.84966 Min. :221.2894
# 1st Qu.:1600 1st Qu.:0.002991554 1st Qu.:242.8170 1st Qu.:26.46606 1st Qu.:241.3242
# Median :2750 Median :0.023639517 Median :247.2059 Median :27.04467 Median :245.7044
# Mean :2750 Mean :0.093184735 Mean :247.3352 Mean :27.06300 Mean :245.8298
# 3rd Qu.:3900 3rd Qu.:0.107997575 3rd Qu.:251.7787 3rd Qu.:27.64103 3rd Qu.:250.2157
# Max. :5000 Max. :0.997322161 Max. :276.2614 Max. :30.67000 Max. :275.7741
# PN.sd
# Min. :23.04527
# 1st Qu.:26.45508
# Median :27.04299
# Mean :27.05750
# 3rd Qu.:27.63241
# Max. :30.85065
powerExponential$Power <- powerExponential$P<0.05
powers <- aggregate(Power ~ N, mean, data=powerExponential); powers
# 1 500 0.2133333333
# 2 600 0.2833333333
# 3 700 0.2833333333
# 4 800 0.3133333333
# 5 900 0.3033333333
# 6 1000 0.3400000000
# 7 1100 0.4066666667
# 8 1200 0.3833333333
# 9 1300 0.4133333333
# 10 1400 0.4166666667
# 11 1500 0.4700000000
# 12 1600 0.4600000000
# 13 1700 0.4666666667
# 14 1800 0.4733333333
# 15 1900 0.5233333333
# 16 2000 0.5366666667
# 17 2100 0.6000000000
# 18 2200 0.5900000000
# 19 2300 0.5600000000
# 20 2400 0.6066666667
# 21 2500 0.6066666667
# 22 2600 0.6700000000
# 23 2700 0.6566666667
# 24 2800 0.7133333333
# 25 2900 0.7200000000
# 26 3000 0.7300000000
# 27 3100 0.7300000000
# 28 3200 0.7066666667
# 29 3300 0.7433333333
# 30 3400 0.7133333333
# 31 3500 0.7233333333
# 32 3600 0.7200000000
# 33 3700 0.7766666667
# 34 3800 0.7933333333
# 35 3900 0.7700000000
# 36 4000 0.8100000000
# 37 4100 0.7766666667
# 38 4200 0.8000000000
# 39 4300 0.8333333333
# 40 4400 0.8466666667
# 41 4500 0.8700000000
# 42 4600 0.8633333333
# 43 4700 0.8166666667
# 44 4800 0.8366666667
# 45 4900 0.8666666667
# 46 5000 0.8800000000
qplot(N, Power, data=powers) + stat_smooth()
~~~
![Power for a two-group comparison of old and new SNP datasets for testing a hypothesis of dysgenics](/images/genetics/selection/dysgenics-snps-exponential-optimal.png)
So for a well-powered two-group comparison of 1950 & 2015 SNP datasets using 86 SNPs, we would want ~4000 in each group for a total _n_ = 8000; we do have nontrivial power even at a total _n_ = 1000 (500 in each group means 21% power) but a non-statistically-significant result will be difficult to interpret and if one wanted to do that, reporting a Bayes factor from a Bayesian hypothesis test would make much more sense to express clearly whether the (non-definitive) data is evidence for or against dysgenics.
This is still too optimistic since we assumed the optimal scenario of only very old and very new genomes, while available genomes are more likely to be distributed fairly uniformly between 1950 and 2015.
Per ["Optimal design in psychological research"](http://www2.psych.ubc.ca/~schaller/528Readings/McClelland1997.pdf), McClelland 1997, we expect a penalty of ~2x in sample size efficiency in going from the optimal two-group extreme endpoints design to samples being uniformly distributed (due to much of our sample size being wasted on estimating small effects) and so we would expect our sample size requirement to at least double to around _n_ = 16000, but we can do a power simulation here as well.
To get the effect size for each year, we simply split the frequency decrease over each year and generate hypothetical genomes with less of a frequency decrease uniformly distributed 1950-2015, and do a linear regression to get a _p_-value for the year predictor:
~~~{.R}
hits <- 86
sampleSizes <- seq(8000, 30000, by=1000)
iters <- 100
genome <- function(effects) {
t <- sample(c(1:(2015-1950)), 1)
decreasedFrequency <- 0.5 - (((0.5-0.497)/(2015-1950)) * t)
geneFlips <- sample(c(FALSE,TRUE), prob=c(decreasedFrequency, 1-decreasedFrequency), replace = TRUE, length(effects))
geneValues <- ifelse(geneFlips, effects, 0)
return(data.frame(Year=1950+t,
PolygenicScore=sum(geneValues)))
}
simulateStudy <- function(n, hits) {
effects <- sort(rexp(10000)/1, decreasing=TRUE)[1:hits]
d <- ldply(replicate(n, genome(effects), simplify=FALSE))
l <- lm(PolygenicScore ~ Year, data=d)
p <- anova(l)$`Pr(>F)`[1]
return(data.frame(N=n, P=p, PO.mean=predict(l, newdata=data.frame(Year=1950)),
PN.mean=predict(l, newdata=data.frame(Year=2015)))) }
parallelStudies <- function(n, itr) { ldply(mclapply(1:itr, function(x) { simulateStudy(n, hits); })); }
powerExponentialDistributed <- ldply(lapply(sampleSizes, function(n) { parallelStudies(n, iters) })); summary(powerExponential)
powerExponentialDistributed$Power <- powerExponentialDistributed$P<0.05
powers <- aggregate(Power ~ N, mean, data=powerExponentialDistributed); powers
# N Power
# 1 8000 0.27
# 2 9000 0.32
# 3 10000 0.35
# 4 11000 0.33
# 5 12000 0.41
# 6 13000 0.34
# 7 14000 0.41
# 8 15000 0.48
# 9 16000 0.55
# 10 17000 0.62
# 11 18000 0.55
# 12 19000 0.60
# 13 20000 0.69
# 14 21000 0.61
# 15 22000 0.65
# 16 23000 0.63
# 17 24000 0.71
# 18 25000 0.67
# 19 26000 0.71
# 20 27000 0.74
# 21 28000 0.70
# 22 29000 0.79
# 23 30000 0.83
qplot(N, Power, data=powers) + stat_smooth()
~~~
![Power to detect dysgenics effect with SNP samples spread over time](/images/genetics/selection/dysgenics-snps-exponential-realistic.png)
In this case, the power simulation suggestions the need for triple rather than double the data, and so a total of _n_ = 30,000 to be well-powered.
## Mutation load
The paternal mutation load should show up as a increase (70 new mutations per generation, 35 years per generation, so ~2 per year on average) over the past century, while the genetic selection will operate by reducing the frequency of variants which increase intelligence.
If there are ~70 new mutations per generation and 2 harmful, and there is no longer any purifying selection so that all 70 will tend to remain present, how much does that compare to existing mutation load averages and, more importantly, standard deviations?
A [mutation load review](/docs/genetics/selection/2015-henn.pdf "'Estimating the mutation load in human genomes', Henn et al 2015") leads me to some hard figures from [Simons et al 2014](/docs/genetics/selection/2014-simons.pdf "The deleterious mutation load is insensitive to recent population history") ([supplement](/docs/genetics/selection/2014-simons-supplementary.pdf)) using data from [Fu et al 2012](/docs/genetics/selection/2012-fu.pdf "Analysis of 6,515 exomes reveals the recent origin of most human protein-coding variants"); particularly relevant is figure 3, the number of single-nucleotide variants per person over the European-American sample, split by estimates of harm from least to most likely: `21345 + 15231 + 5338 + 1682 + 1969 = 45565`.
The supplementary tables gives a count of all observed SNVs by category, which sum to `300209 + 8355 + 220391 + 7001 + 351265 + 10293 = 897514`, so the average frequency must be `45565/897514=0.05`, and then the binomial SD will be `sqrt(897514*0.05*(1-0.05))=206.47`.
Considering the two-sample case of 1950 vs 2015, that's an increase of 130 total SNVs (`65*2`), which is 0.63SDs, hence:
~~~{.R}
power.t.test(d=(130/206), power=0.8)
# Two-sample t test power calculation
#
# n = 40.40035398
# ...
~~~
A total of _n_ = 80.
This particular set up for the two-sample test can be seen as a linear model with the optimum design of allocating half the sample to each extreme (see again McClelland 1997); but more realistically, there is an even distribution across years, in which case the penalty is 2x and _n_ = 160.
## Weaknesses
There are some potential problems:
1. Range restriction: in many IQ-related studies, failure to account for selection effects yielding a limited range of IQs may [seriously understate the true correlation](https://en.wikipedia.org/wiki/Statistical_conclusion_validity#Restriction_of_range); this is true in general but particularly common in IQ studies because selection on IQ (eg samples of convenience using only college students) is so universal in human society
This may not be such a large issue when dealing with polygenic scores; even severe IQ selection effects will increase polygenic scores only somewhat because the polygenic scores explain so little of IQ variance in the first place.
2. Self-selection by age: if people providing genetic data are not random samples, then there may be pseudo-trends which can mask a real dysgenic trend or create a pseudo-dysgenic trend where there is none. For example, if young people buying genome or SNP data tend to be above-average in intelligence and scientific interest (which anecdotally they certainly do seem to be), while old people tend to get genomes or SNP data due to health problems (and otherwise have average levels of intelligence and thus polygenic score), then in comparing young vs old, one might find not a dysgenic but a pseudo-eugenic trend instead! Conversely, it could be the other way around, if much fewer elderly get genetic data and younger people are more concerned about future health or are going along with a fad, producing a pseudo-dysgenic effect instead (eg in the PGP genome data, there seem to be disproportionately more PhDs who are quite elderly, while younger participants are a more scattershot sample from the general population; probably relating to the circumstances of PGP's founding & Harvard home).
This is probably an issue with databases that rely on voluntary individual contributions, such as PGP, where selection effects have free play. It would be much less of an issue with longitudinal studies where motivations and participation rates will not differ much by age. Since most dysgenic theories accept that recorded IQ scores have remained stable over the 20th century and the decreases in genetic potential either have not manifested yet or have been masked by the Flynn effect & greater familiarity with tests & loss of some g-loading, one might reason that proxies like educational achievement should be increasing throughout one's sample (since they are known to have increased), and a lack of such a trend indicates selection bias.
## Genetic data availability
### Proprietary
The known proprietary databases have long been large enough to carry out either analysis, as well as countless other analyses (but have failed to and represent a [tragedy of the anticommons](!Wikipedia)):
1. The mutation load analysis requires a whole-genome sample size small enough to have been carried out by innumerable groups post-2009.
2. For SNPs, an incomplete list of examples of publications based on large samples:
- 23andMe reached [1 million customers in July 2015](http://blog.23andme.com/news/one-in-a-million/), of whom >=80% opt-in to research (>2m as of [August 2017](https://www.fastcompany.com/40438376/after-a-comeback-23andme-faces-its-next-test "After A Comeback, 23andMe Faces Its Next Test: Can the pioneering DNA-testing company satisfy the FDA while also staying true to its founding mission: putting people in control of their healthcare?"), still 80% opt-in, and [>6m as of November 2017](https://www.cnbc.com/2017/11/21/amazon-has-suddenly-become-a-big-marketplace-for-selling-genetic-tests.html "Amazon has suddenly become a big marketplace for selling genetic tests") with sales increasing); the first questions 23andMe asks all customers are age and education, so they likely have at least 700,000 usable SNPs for both discovering educational associations & dysgenic tests. In [June 2010](http://blog.23andme.com/23andme-research/23andme-and-a-new-paradigm-for-research/), they claimed to have 29k opt-ins out of 50k customers, implying they were well-powered for a dysgenic test in 2010 if they had access to a polygenic score, and that in the absence of the score, they could have found the 85 SSGAC hits (using _n_ = 305k) themselves somewhere around mid-2011 or 2012 and then done a dysgenic test.
- the SSGAC collaboration has _n_ = 305k as of late 2015
- [GIANT: height](http://www.nature.com/ng/journal/v46/n11/full/ng.3097.html "'Defining the role of common variation in the genomic and biological architecture of adult human height', Wood et al 2015"): _n_ = 253,288
- [cholesterol](http://www.nature.com/nm/journal/vaop/ncurrent/full/nm.3980.html "'Genome-wide identification of microRNAs regulating cholesterol and triglyceride homeostasis', Wagschal et al 2015"): _n_ = 188,000
- [UK Biobank](!Wikipedia): _n_ = 152,729 (sequenced/published on as of June 2015; 500k were enrolled and will be covered eventually)
- [diabetes](http://www.nature.com/ng/journal/v44/n9/full/ng.2383.html "'Large-scale association analysis provides insights into the genetic architecture and pathophysiology of type 2 diabetes', Morris et al 2012"): _n_ = 114,981
- [Psychiatric Genomics Consortium](http://www.med.unc.edu/pgc/results): has run multiple studies of varying sizes, the second-largest (a bipolar study) peaking at a total control group of _n_ = 51672, and the largest (schizophrenia) at a control group of _n_ = 113,075
- [Parkinson's](http://www.nature.com/ng/journal/v46/n9/full/ng.3043.html "'Large-scale meta-analysis of genome-wide association data identifies six new risk loci for Parkinson's disease', Nalls et al 2014"): _n_ = 100,833 (may overlap with 23andMe and some others)
- [eczema](http://www.nature.com/ng/journal/vaop/ncurrent/full/ng.3424.html "'Multi-ancestry genome-wide association study of 21,000 cases and 95,000 controls identifies new risk loci for atopic dermatitis', Paternoster et al 2015"): _n_ = 95,464
- [Genetics of Personality Consortium](http://www.tweelingenregister.org/GPC/): eg [Neuroticism](http://palmerlab.org/neuroticism-and-depression-gwas-consortium-paper-accepted-for-publication-in-jama-psychiatry-abraham-palmer-harriet-de-wit-and-amy-hart-are-co-authors/ "Genome-wide association study identifies novel locus for neuroticism and shows polygenic association with Major Depressive Disorder"): _n_ = 63k
- Dutch LifeLines Biobank and Cohort: _n_ > 13,000
- Health and Retirement Survey: _n_ = 12,500
- [Swedish TwinGene project](http://openarchive.ki.se/xmlui/bitstream/handle/10616/41868/Swedish_Twin_Registry_2013.pdf?sequence=3 "'The Swedish Twin Registry: establishment of a biobank and other recent developments', Magnusson et al 2013"): _n_ = 10,682
- TwinsUK registry: _n_ = 4,905
- [Generation Scotland: the Scottish Family Health Study](http://www.generationscotland.org/): _n_ = 24,000
The existing private groups do not seem to have any interest in testing dysgenics, with the possible exception of future GWAS studies examining fertility, one of which is mentioned by [Mills & Tropf 2015](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4577548/ "The Biodemography of Fertility: A Review and Future Research Frontiers"):
> At the time of writing this review, Mills and her research team at the University of Oxford are currently leading a large consortium to engage in the first ever genome-wide association search (GWAS) and meta-analysis of reproductive choice (age at first birth; number of children), conducted in both men and women in over 50 data sets, with the results replicated in additional datasets in a large sample.
The hits in such a GWAS might overlap with intelligence hits, and if the multiple hits increase intelligence but decrease fertility or vice versa (as compared to decreasing or increasing both), that would be evidence for dysgenics.
Or, assuming the betas are reported, polygenic scores for fertility and intelligence could be estimated in independent samples and checked for an inverse correlation.
### Public
There are a few sources of data, primarily SNP data, which are freely available to all users:
1. [1000 Genomes](!Wikipedia): unusable due to a deliberate policy decision by 1000 Genomes to delete all phenotype data, including age; similar is [69 Genomes](http://www.completegenomics.com/public-data/69-Genomes/). Both likely would be unusable due to the diversity of the global sample (there is no reason to think that dysgenics pressures are operating in every population at the same strength)
2. [OpenSNP](https://opensnp.org): hosting for user-provided SNP & phenotype data with [dumps available](https://opensnp.org/genotypes); hosts ~2k SNP datasets, but only [270 users have birth-years](https://opensnp.org/phenotypes/148)
3. [SNPedia](http://www.snpedia.com/index.php/Genomes) likewise hosts SNP data (overlapping with OpenSNP) and genome data, but a very small number
4. [Genomes unzipped](http://genomesunzipped.org/data) provides a small amount of data
4. [DNA.LAND](https://dna.land/): claims _n_ = 8k based on public participation & input (_n_ = 43k as of [May 2017](https://www.biorxiv.org/content/early/2017/05/09/135715 "'DNA.Land: A Digital Biobank Using A Massive Crowdsourcing Approach', Yuan et al 2017")), but seems to then restrict access to a small set of researchers
4. [Exome Aggregation Consortium](http://exac.broadinstitute.org/about): _n_ = 61,486 exomes; [phenotype data](http://exac.broadinstitute.org/faq) is unavailable
5. [Personal Genome Project](!Wikipedia) (PGP): probably the single largest source of open SNP & genome data. ~1252 participants have registered birthdates according to `demographics.tsv`, and their [statistics page](https://my.pgp-hms.org/public_genetic_data/statistics)'s graphs indicates <300 whole genomes and <1k SNPs. Phenotype data [has been recently released as a SQLite database](http://blog.personalgenomes.org/2015/11/19/exploring-the-harvard-pgp-dataset-with-untap/ "Exploring the Harvard PGP Dataset with Untap"), making it easier to work with.
- Genomes: browsing the [user lists for 'Whole genome datasets'](https://my.pgp-hms.org/users), I estimate a total of ~222; looking at the first and last 22 entries, 34 had ages/birth-years, so ~75% of the whole genomes come with the necessary birth-year data, indicating ~166 usable genomes for the purpose of testing dysgenics. With the most recent one uploaded on 2015-10-12, and the earliest recorded being 2011-09-16, that suggests the available genome number increases by ~0.25/day. 166 is uncomfortably close to the requirement for a well-powered test, and there may not be enough data to account for glitches in the data or allow for more complicated statistical testing, but if we wanted to double the available data, we'd only need to wait around 885 days or 2.5 years (or less, depending on whether the collapse in genome sequencing prices continue and prices drop below even the current \$1k genomes).
- SNPs: PGP has [~656 23andMe SNP datasets](https://my.pgp-hms.org/public_genetic_data?utf8=%E2%9C%93&data_type=23andMe&commit=Search) (the number of SNP datasets sourced from other providers is quite small so I didn't include them), dated 2015-10-21--2011-01-06, so assuming same birth-date percentage, 0.37 per day. Unfortunately, to get 30k SNP datasets through PGP, we would have to wait (linearly extrapolating) 291 years. (Making matters worse, in October 2015, 23andMe doubled its price and reduced the quality of SNP coverage, which will discourage many users and push other users to purchase whole-genome sequencing instead.)
# Power analysis for racial admixture studies of continuous variables {.collapse}
> I consider power analysis of a genomic racial admixture study for detecting genetic group differences affecting a continuous trait such as IQ in US African-Americans, where ancestry is directly measured by genome sequencing and the comparisons are all within-family to eliminate confounding by population structure or racism/colorism/discrimination. The necessary sample size for well-powered studies is closely related to the average size of differences in ancestry percentage between siblings, as the upper bound on IQ effect per percentage is small, requiring large differences in ancestry to detect easily. A within-family comparison of siblings, due to the relatively small differences in ancestry between siblings estimated from IBD measurements of siblings, might require _n_ > 50,000 pairs of siblings to detect possible effects on IQ, an infeasible sample size. An alternative design focuses on increasing the available ancestry differences within a family unit by comparing *adoptees* with siblings; the larger within-population standard deviation of ancestry creates larger & more easily-detected IQ differences. A random-effects meta-analysis of past admixture & ancestry studies suggests the SD in heterogeneous samples may range from 2% to 20% with a mean of 11% (95% predictive interval), yielding sample sizes of _n_ > 20,000, _n_ = 1100, and _n_ = 500. Hence, an adoption study is probably in the feasible range, with required sample sizes comparable to annual adoption rates among US African-Americans.

[Admixture studies](!Wikipedia "Genetic admixture#Mapping") examine racial phenotypic differences in traits such as blood pressure by comparing people with ancestry from multiple groups, and correlating differences in ancestry percentage with differences in the phenotype.
So, for example, African-Americans have higher blood-pressure than white Americans, and most African-Americans have an average white ancestry of something like 20-25% (see later); if having 26% white ancestry predicts slightly lower blood pressure while 24% predicts higher, that suggests the difference is (as is currently believed) genetic; and this logic can be used to narrow down to specific chromosome regions, and has contributed to study of [racial differences in disease](!Wikipedia "Race and health").
One application would be to thorny questions like potential group differences in non-medical traits like intelligence.
The standard admixture design, requiring a few thousand subjects spanning the full range, might not necessarily work here here because of the claimed environmental effects.
[A proposed resolution](http://humanvarieties.org/2013/03/29/cryptic-admixture-mixed-race-siblings-social-outcomes/ "Cryptic Admixture, Mixed-Race Siblings, & Social Outcomes") to the question is to do an admixture study comparing African-American siblings.
Siblings are highly genetically related on average (50%) but in a randomized fashion due to recombination; so two siblings, including fraternal twins, born to the same parents in the same family in the same neighborhood going to the same schools, will nevertheless have many different variants, and will differ in how related they are---the average is 50% but it could be as low as 45% or high as 55%.
So given two siblings, they will differ slightly in their white ancestry, and if indeed white ancestry brings with it more intelligence variants, then the sibling with a higher whiter percentage ought to be slightly more intelligent on average, and this effect will have to be causal, as the inheritance is randomized and all other factors are equal by design.
(A result using ancestry percentages measured in the general population, outside families, would be able to make far more powerful comparisons by comparing people with ~0% white ancestry to those with anywhere up to 100%, and require small sample sizes, and such analyses have been done with the expected result, but are ambiguous & totally unconvincing, as the correlation of greater whiteness with intelligence could easily be due to greater SES or greater blackness could be a marker for recent immigration or any of a number of confounds that exist.)
This has historically been difficult or impossible since how does one measure the actual ancestry in siblings?
But with the rise of cheap genotyping, precise measure of actual (rather than average) ancestry can be done for <\$100, so that is no longer an obstacle.
## Sibling power analysis
How many sibling pairs would this require?
- you are trying to regress `IQ_difference ~ Ancestry_difference`
- the SD of the IQ difference of siblings is known---it's ~13 IQ points (nonshared environment + differences in genetics)
- of this, a small fraction will be explained by the small difference in ancestry percentage
- the power will be determined by the ratio of the sibling SD to the IQ-difference-due-to-ancestry-difference SD, giving an effect size, which combined with the usual alpha=0.05 and beta=0.80, uniquely determines the sample size
- IQ-difference-due-to-ancestry-difference SD will be the advantage of better ancestry times how much ancestry differs
- if you knew the number of relevant alleles, you could calculate through the binomial the expected SD of sibling ancestor differences. As there are so many alleles, it will be almost exactly normal. So it's not surprising that siblings overall, for all variants, are 50% IBD with a SD of 4%.
If we treated it as simply as possible, [Visscher 2006](http://www.plosgenetics.org/article/info%3Adoi%2F10.1371%2Fjournal.pgen.0020041 "Assumption-Free Estimation of Heritability from Genome-Wide Identity-by-Descent Sharing between Full Siblings") for an analogous height analysis says they measured 588 markers. So a binomial with 588 draws and _p_ = 0.5 implies that 147 markers are expected to be the same:
~~~{.R}
588 * 0.5*(1-0.5)
# [1] 147
~~~
and the distribution around 147 is 12, which is ~8%:
~~~{.R}
sqrt((588 * 0.5*(1-0.5)))
# [1] 12.12435565
12/147
# [1] 0.08163265306
~~~
Visscher does a more complicated analysis taking into account closeness of the markers and gets a SD of 3.9%: equation 7; variance = `1/(16*L) - (1/3*L^2)`, where L = 35, so
~~~{.R}
L=35; sqrt(1/(16*L) - (1/(3*L^2)))
# [1] 0.03890508247
~~~
And [Hill & Weir 2011's](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3070763/ "Variation in actual relationship as a consequence of Mendelian sampling and linkage") theoretical modeling gives an expected sibling SD of SD of 3.92%/3.84% (Table 2), which are nearly identical.
So whatever the mean admixture is, I suppose it'll have a similar SD of 4-8% of itself.
IIRC, African-Americans are ~25% admixed, so with a mean admixture of 25%, we would expect siblings differences to be $25\% \pm 0.04 \cdot 0.25=0.01$ or 1% difference.
If that 75% missing white ancestry accounts for 9 IQ points or 0.6SDs, then each percentage of white ancestry would be 0.6/75 =0.008 SDs.
So that SD of 1% more white ancestry yields an SD of 0.008 IQ, which is superimposed on the full sibling difference of 0.866, giving a standardized effect size/_d_ of 0.008 / 0.866 = 0.0092
Let me try a power simulation:
~~~{.R}
n <- 10000
siblings <- data.frame(
sibling1AncestryPercentage = rnorm(n, mean=25, sd=1),
sibling1NonancestryIQ = rnorm(n, mean=0, sd=12),
sibling2AncestryPercentage = rnorm(n, mean=25, sd=1),
sibling2NonancestryIQ = rnorm(n, mean=0, sd=12))
siblings$sibling1TotalIQ <- with(siblings, sibling1NonancestryIQ + sibling1AncestryPercentage*(0.008*15))
siblings$sibling2TotalIQ <- with(siblings, sibling2NonancestryIQ + sibling2AncestryPercentage*(0.008*15))
siblings$siblingAncestryDifference <- with(siblings, sibling1AncestryPercentage - sibling2AncestryPercentage)
siblings$siblingIQDifference <- with(siblings, sibling1TotalIQ - sibling2TotalIQ )
summary(siblings)
# ...
# siblingAncestryDifference siblingIQDifference
# Min. :-5.370128122 Min. :-68.2971343
# 1st Qu.:-0.932086950 1st Qu.:-11.7903864
# Median : 0.002384529 Median : -0.2501536
# Mean : 0.007831583 Mean : -0.4166863
# 3rd Qu.: 0.938513265 3rd Qu.: 11.0720667
# Max. : 5.271052675 Max. : 67.5569825
summary(lm(siblingIQDifference ~ siblingAncestryDifference, data=siblings))
# ...Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) -0.4192761 0.1705125 -2.45892 0.0139525
# siblingAncestryDifference 0.3306871 0.1220813 2.70874 0.0067653
#
# Residual standard error: 17.05098 on 9998 degrees of freedom
# Multiple R-squared: 0.000733338, Adjusted R-squared: 0.0006333913
# F-statistic: 7.337294 on 1 and 9998 DF, p-value: 0.006765343
confint(lm(siblingIQDifference ~ siblingAncestryDifference, data=siblings))
# 2.5 % 97.5 %
# (Intercept) -0.75351500523 -0.08503724643
# siblingAncestryDifference 0.09138308561 0.56999105507
admixtureTest <- function(n, alpha=0.05, ancestryEffect=0.008) {
siblings <- data.frame(
sibling1AncestryPercentage =pmax(0, rnorm(n, mean=25, sd=1)),
sibling1NonancestryIQ = rnorm(n, mean=0, sd=12),
sibling2AncestryPercentage = pmax(0,rnorm(n, mean=25, sd=1)),
sibling2NonancestryIQ = rnorm(n, mean=0, sd=12))
siblings$sibling1TotalIQ <- with(siblings, sibling1NonancestryIQ + sibling1AncestryPercentage*(ancestryEffect*15))
siblings$sibling2TotalIQ <- with(siblings, sibling2NonancestryIQ + sibling2AncestryPercentage*(ancestryEffect*15))
siblings$siblingAncestryDifference <- with(siblings, sibling1AncestryPercentage - sibling2AncestryPercentage)
siblings$siblingIQDifference <- with(siblings, sibling1TotalIQ - sibling2TotalIQ )
p <- summary(lm(siblingIQDifference ~ siblingAncestryDifference, data=siblings))$coefficients[8]
return(p
> Samples taken from the extremes of mixtures of distributions can have very different properties than random samples, such as the tail effect of wildly disproportionate representation of one distribution due to order statistics/threshold selection. This can be used to infer differing means. I demonstrate working backwards from the racial composition of TIP/SMPY samples of extremely (1-in-10,000) gifted youth to estimate the overall racial means, which is consistent with the known racial means and hence an unbiased selection process, using ABC to infer Bayesian credible intervals on the estimated means.

The properties of statistical distributions can be very different from the properties of *specific subsets* of those distributions in counterintuitive ways.
A point drawn from an extreme will exhibit "regression to the mean", a phenomenon which routinely trips people up.
Another common example is that a small difference in means for many distributions can lead to large differences in [extreme](!Wikipedia "Extreme value theory") subsets.
For example, male and female average heights differ by a relatively small amount, inches at most.
So in a random sample, plenty of women will be taller than men, and vice versa.
However, if instead ask the sex of the tallest person in the sample, it will often be male, and the larger the sample, the more certain we can be that it will be male, and that the top X% by height will be male.
Likewise, if we wanted to start a basketball league and recruited the tallest 100 people in the country, this small mean difference will show up as our entire basketball league turning out to be male.
(And since height is highly heritable, we may find out that [many of them are related](http://www.wsj.com/articles/nba-basketball-runs-in-the-family-1464130236 "Why Basketball Runs in the Family: A new WSJ study finds 48.8% of players are related to an elite athlete-that number is 17.5% for the NFL and 14.5% for MLB")!)
What seemed like a small difference become a large one; we could have worked it out in advance if we had thought about it.
Reasoning from the general to the particular turned out to be tricky in this case because we were dealing with extreme values rather than random samples---1 basketball player *chosen by height* from thousands of people.
Many things of great interest turn out to be like that: we are interested in the extremes much more than the expectation.
Running a 2-hour marathon is an extreme on athleticism; winning the Nobel is an extreme on scientific accomplishment; being enlisted in the NBA is an extreme on height; being admitted to MIT/Stanford/Harvard is an extreme on intelligence; murdering someone is an extreme on violence; winning an Academy Award is an extreme on acting success.
When we ask questions like, "why does the world record in this sport keep being shattered" or "why are so many NBA players related" or "how good can we expect the best chess player to be in 10 years" or "does this racial composition prove bias" or "how much more important are the best authors in literature than obscurer figures" or "why do so few women win the Field Medal", we're asking extreme value questions whose answers may be counterintuitive---and the answer may be as simple as the shape of distributions, and a slightly lower mean here or a slightly higher standard deviation there.
(Working backwards from a sample selected for passing a threshold to a mean can be called "the method of limits" or "the method of thresholds".)
The study ["When Lightning Strikes Twice: Profoundly Gifted, Profoundly Accomplished", Makel et al 2016](/docs/iq/smpy/2016-makel.pdf) describes the accomplishments of the [Duke TIP](!Wikipedia "Talent Identification Program") sample, 259 children selected for their intelligence by taking the highest-scorers out of 425,000 adolescents taking the SAT (usually <13yo) starting in 1981, representing the top 0.01% of the test-takers.
The TIP sample parallels the better-known SMPY sample, which also selected extremely intelligent adolescents, who were included in a longitudinal sample.
It's frequently suggested, based on anecdotal evidence or some biased convenience samples, that more intelligence may not be better; extremely intelligent people may be unhealthy, neurotic, insane, isolated, lonely, discriminated against by society and their peers, and doomed to failure; or if things are not quite that dire, as all studies show things improving up to 130, then at around that point greater intelligence may stop making any difference, and there be little difference between someone with an IQ of 130 and 160.
This is difficult to study cross-sectionally, because once you start talking about as extreme as 0.01%, it is difficult to recruit any subjects at all, and your sample will be biased in unknown ways; if you only look at successful people, you are missing the hypothetical homeless bum living out of a trash can who is a troubled and misunderstood genius.
To solve these problems, you want to filter through hundreds of thousands of people so you can select the very brightest possible, and you want to find them as early as possible in life, before they have had any chance to fail or succeed, and track them longitudinally as they grow up.
This is what the SMPY & TIP studies do, and the results are that the subjects are spectacularly successful in life; great intelligence is not harmful and the returns to greater intelligence are not zero even as high as 1 in 10,000.
Makel et al 2016 also reports the ethnic breakdown of the TIP and SMPY samples: 72% white, 22% Asian, 6% not reported or other.
This distribution might seem remarkable given that subjects taking the SAT in 1981 were born ~1970, when the USA was ~77% white, ~11% black, and ~0.7% Asian, so white are slightly under-represented, blacks are very under-represented (even if we assume all 6% are black, then that's still half), and Asians are 31x (!) overrepresented.
~~~{.R}
## TIP/SMPY sample size & ethnic percentages: https://pbs.twimg.com/media/Cj9DXwxWEAEaQYk.jpg
tip <- 259; smpy <- 320 ## total: 579
white <- ((0.65*tip) + (0.78*smpy)) / (tip+smpy)
asian <- ((0.24*tip) + (0.20*smpy)) / (tip+smpy)
white; asian
# [1] 0.7218480138
# [1] 0.2178929188
# http://drjamesthompson.blogspot.com/2016/06/some-characteristics-of-eminent-persons.html
# > The data on ethnicity are rather sparse, but we can do a little bit of work on them by looking at US Census
# > figures for the 1970s when most of these children were born: White 178,119,221...Asia 1,526,401...So, in the
# > absence of more detailed particulars about the Other category, Asians win the race by a country mile. If we
# > simplify things by considering only Whites, Blacks and Asians the US in 1970 then the country at that time was
# > 88% White, 11% Black, and less than 1% Asian. The actual results of eminent students are 77% White, 0% Black,
# > 22% Asian. No need for a Chi square.
#
# Asian is 0.7%: 1526401 / (178119221 / 0.80)
whiteRR <- white / 0.77; asianRR <- asian / 0.007
whiteRR; asianRR
# [1] 0.937464953
# [1] 31.12755983
~~~
Of course, races in the USA have long differed by mean intelligence, with the rule of thumb being Asians ~105 IQ, whites ~100, and blacks ~90.
So the order is expected---but still, 31x!
Are the results being driven by some sort of pro-Asian bias or otherwise bizarre?
But this is an extreme sample. 1-in-10,000 is far out on the tails: 3.71SDs.
~~~{.R}
-qnorm(1/10000)
# [1] 3.719016485
~~~
Maybe this is normal. Can we work backwards from the overrepresentations to what differences would have generated them?
Yes, we can, even with this small sample which is so extreme and unrepresentative of the general population.
This is because it is an [order statistics](!Wikipedia) problem: we know the order represented by the sample and so can work back to parameters of the distribution the order statistics are being generated by.
Since IQ is a normal distribution, we know the overrepresentation RR, and the exact cutoff/limit used in the sample, we can convert the limit to _a_ standard deviations, and then find the normal distribution $\mathcal{N}(100+x,15)$ which is RR (31) times the normal distribution $\mathcal{N}(100,15)$ at _a_ standard deviations.
We can compare using two `pnorm`s and shifting the second by _a_ SDs. So for example, shifting by 15 IQ points or 1 SD would lead to 84x overrepresentation
~~~{.R}
pnorm(qnorm(1/10000)) / pnorm(qnorm(1/10000) - (15/15))
# [1] 84.39259519
~~~
We would like to solve for the shift which leads to an exact overrepresentation like 31.127; an optimization routine like R's `optim` function can do that, but it requires an error to minimize, so minimizing `pnorm()/pnorm(x)` doesn't work since it just leads to negative infinity, nor will `RR == pnorm()/pnorm(x)` work, because it evaluates to 0 for all values of _x_ except the exact right one . Instead, we minimize the squared error between the ratio predicted by a particular _x_ and our observed RR.
This works:
~~~{.R}
## An optimization routine which automatically finds for us the IQ increase which most closely matches the RR:
solver <- function(RR, cutoff=10000) {
optim(1,
function(IQ_gain) { (RR - (pnorm(qnorm(1/cutoff)) / pnorm(qnorm(1/cutoff)-(IQ_gain/15))))^2 },
)$par }
100 + solver(whiteRR)
# [1] 99.75488281
100 + solver(asianRR)
# [1] 111.8929688
~~~
So our inferred white & Asian populations means are: 99.8 and 111.9. These are relatively close to the expected values.
This approach can be used to infer other things as well.
For example, the TIP/SMPY papers have not, as far as I've seen, mentioned what fraction of the white subjects were ethnic Jewish; since they are so over-represented in areas like Nobel prizes, we would expect many of the TIP/SMPY white students to have been Jewish.
Using an estimate of the Jewish population in 1970 and estimates of their mean IQ, we can work forward to what fraction of TIP/SMPY subjects might be Jewish.
The [1970-1971 National Jewish Population Study](http://www.jewishdatabank.org/studies/details.cfm?StudyID=304) estimated "5,800,000 persons (of whom 5,370,000 were Jews) living in Jewish households" out of a total US population of 205 million, or 2.8% of the total population or ~3.6% of the white population. So of the ~418 white subjects, ~15 would be expected to be Jewish under the null hypothesis of no difference.
The majority of American Jews are of Ashkenazi descent^[This renders the population estimate a bit off, but I couldn't find any sources on the breakdown of Sephardic vs Ashkenazi in the USA in 1970 other than a comment that the latter were a "vast majority". Since the Jewish Population Study was probably an undercount in not including all the people of Jewish descent, I'm hopeful those two biases cancel out.], for whom intelligence estimates are [debated](!Wikipedia "Ashkenazi Jewish intelligence#Evidence for a group difference in intelligence") but tend to range 105-115 (with occasional samples suggesting even higher values, like [Levinson 1957](/docs/iq/1957-levinson.pdf "The Intelligence of Applicants for Admission to Jewish Day Schools")).
In the [Barbe 1964](http://files.eric.ed.gov/fulltext/ED013518.pdf "One In A Thousand: A Comparative Study of Highly and Moderately Gifted Elementary School Children") Ohio sample (IQ ~143), 8% were Jewish^[The high IQ sample in Barbe 1964 would have been >8% Jewish, but the paper only reports the overall Jewishness, without specifying whether it's 4% vs 12% or something like that.]; in [Terman's](!Wikipedia "Genetic Studies of Genius") (ratio IQ >140) 1920s sample in SF/LA, 10% were Jewish; Hollingworth's 1930s sample (>180) turned up 51⁄55 or 90% Jewish[^Hollingworth]; [Byrns 1936's](/docs/iq/1936-byrns.pdf "Intelligence and Nationality of Wisconsin School Children") 1931 Wisconsin state sample found 18% of the Jewish sample to be in the top decile vs 10% American; in the [Hunter College Elementary School](!Wikipedia) sample 1948-1960 (>140, mean 157) in New York City, 62% were Jewish ([Subotnik et al 1989](/docs/iq/1989-subotnik.pdf "High IQ children at midlife: An investigation into the generalizability of Terman's genetic studies of genius"), [Subotnik et al 1993](/docs/iq/1993-subotnik-geniusrevisited.pdf "Genius Revisited: High IQ Children Grown Up")[^Subotnik1993]).
Given estimates of the Jewish population of children in those specific times and places, one could work backwards to estimate a Jewish mean.
[^Subotnik1993]: Subotnik et al 1993, pg3-4:
> The mean IQ of the Hunter sample was 157, or approximately 3.5 standard deviations above the mean, with a range of 122 to 196 on the L-M form. [Stanford-Binet Intelligence Scale, Form L-M (SBL-M)]
>
> ...Each class at Hunter College Elementary School from the years 1948 to 1960 contained about 50 students, yielding a total possible population of 600 graduates...35% of the total population of 1948-1960 HCES students (_n_ = 210) completed and returned study questionnaires
>
> ...*Religious Affiliation*: The Hunter group is approximately 62% Jewish, although they describe themselves as Jews more in terms of ethnic identity than religious practice. The group, as a whole, is not religious.
>
> *Educational Attainments*: Over 80% of the study participants held at least a Master's degree. Furthermore, 40% of the women and 68% of the men held either a Ph.D, LL.B., J.D., or M.D. degree. Occupation and Income: Only two of the HCES women identified themselves primarily as homemakers. 53% were professionals, working as a teacher at the college or pre-college level, writer (journalist, author, editor), or psychologist. The same proportion of HCES men were professionals, serving as lawyers, medical doctors, or college teachers. The median income for men in 1988 was \$75,000 (range = \$500,000) and for women \$40,000 (range = \$169,000). Income levels were significantly different for men and women, even when matched by profession. For example, the median income for male college teachers or psychologists was \$50,000 and for females, \$30,000
We can calculate the fraction of the white sample being Jewish for each possible mean IQ:
[^Hollingworth]: [Hollingworth & Rust 1937](/docs/iq/1937-hollingworth.pdf "Application of the Bernreuter Inventory of Personality to Highly Intelligence Adolescents"): "The data of the present study were obtained early in 1933, the subjects being 36 boys and 19 girls, of the average age of 18 years 6 months. The IQ's (S-B) of all had been taken in early childhood (9). The group ranged from 135-190 IQ (S-B), with a median at about 153 IQ (S-B). All but four of these young persons were Jewish, a factor which must be considered as of possible consequence (8, 14)..."
~~~{.R}
proportion <- function (gain, cutoff=10000) {
(pnorm(qnorm(1/cutoff)) / pnorm(qnorm(1/cutoff)-(gain/15))) }
possibleIQs <- seq(5, 15, by=0.5)
data.frame(Advantage=possibleIQs, Fraction.of.white=(sapply(possibleIQs, proportion) * 15) / 418)
Advantage Fraction.of.white
1 5.0 0.1415427303
2 5.5 0.1633099334
3 6.0 0.1886246225
4 6.5 0.2180947374
5 7.0 0.2524371552
6 7.5 0.2924980125
7 8.0 0.3392769622
8 8.5 0.3939561508
9 9.0 0.4579348680
10 9.5 0.5328710150
11 10.0 0.6207307813
12 10.5 0.7238482059
13 11.0 0.8449966589
14 11.5 0.9874747049
15 12.0 1.1552093388
16 12.5 1.3528802227
17 13.0 1.5860693342
18 13.5 1.8614413902
19 14.0 2.1869615788
20 14.5 2.5721585555
21 15.0 3.0284424112
~~~
Judging from earlier samples with very high cutoffs, I'd guess TIP/SMPY has at least a majority Jewish, giving a mean IQ of ~110; this is pleasantly similar to estimates based on regular samples & estimation. This result is also similar to [La Griffe du Lion's 2003 threshold analysis](http://www.lagriffedulion.f2s.com/ashkenaz.htm "Assessing the Ashkenazic IQ") estimating a mean IQ of 112 based on Ashkenazi overrepresentation among USSR championship chess players, 111 based on Western [Fields Medal](!Wikipedia) awards, and 110 based on the USA/Canada [Putnam competition](!Wikipedia "William Lowell Putnam Mathematical Competition").
But if the mean IQ was as high as 112, then almost every single white subject would be Jewish in every sampling, which seems implausible and like something so striking that anyone writing or involved with TIP/SMPY would have to have mentioned at *some* point---right?
For the same reason, the original estimate of 112 for the Asians strikes me as on the high side.
This could be due to problems in the data like underestimating the Asian population at the time---perhaps the Southeast/Midwest states that TIP samples from were more than 0.7% Asian---or it could be due to sampling error (only _n_ = 579, after all).
Working backwards doesn't immediately provide any measurement of precision or confidence intervals. Presumably someone has worked out analytic formulas which come with standard errors and confidence intervals, but I don't know it.
Instead, since the selection process which generated our data is straightforward (population mean -> millions of samples -> take top 1-in-10000s -> calculate overrepresentation), I can again use [Approximate Bayesian computation](!Wikipedia) (ABC) to turn a simulation of the data generating process into a method of Bayesian inference on the unknown parameters (population means) and get credible intervals.
What sort of confidence do we have in these estimates given that these RRs are based only on?
We can simulate TIP/SMPY-like selection by taking the hypothetical means of the two groups, generating ~3 million simulates (`579 * 10000`) each, selecting the top 1⁄10000th^[In retrospect, I could have sped this up considerably by using the [beta transform trick of order statistics](/Order-statistics) to sample the tail directly rather than generating the entire sample.], taking the RRs and then solving for the mean IQ.
If we provide a prior on the means and we hold onto only the means which successfully generate TIP/SMPY-like fractions of 72% & 21%, this becomes ABC with the saved means forming the posterior distribution of means.
(It would likely be faster to use MCMC like JAGS, but while JAGS provides truncated normal distributions which one could sample from quickly, and the necessary `pnorm`/`qnorm` functions, but it's not clear to me how one could go about estimating the overperformance ratio and the binomial.[^JAGS-TIP] There's likely some way to use [order statistics](/Order-statistics) more directly than simulating cutoffs, in which case there is a transformation to a beta distribution over 0-1, which is a well-supported distribution by MCM software and might allow exact solution as well.)
For my priors, I believe that the rule of thumbs of 100⁄105 are accurate and highly unlikely to be more than a few points off, so I use a very weak prior of populations means being $\mathcal{N}(100/105, 4)$.
[^JAGS-TIP]: My first attempt at it in JAGS went like this:
~~~{.R}
model_string <- '
model {
cutoffIQ <- 100 + 3.719016485*15
mu_asian ~ dnorm(105, 4^-2)
X_asian ~ dnorm(mu_asian, 15^-2) # T(cutoffIQ,)
X_frac_asian <- X_asian > cutoffIQ
P_asian <- 0.07 * (X_frac_asian / length(X_asian))
Y_asian ~ dbinom(P_asian, total)
# mu_white ~ dnorm(100, 4^-2)
# X_white ~ dnorm(mu_white, 15^-2) # T(cutoffIQ,)
# X_frac_white <- X_white > cutoffIQ
# P_white <- (1-0.07) * (X_frac_white / length(X_white))
# Y_white ~ dbinom(P_white, total)
}
'
library(runjags)
Y_asian=126
Y_white=418
total=579
model <- run.jags(model_string, data = list(Y_asian=Y_asian, Y_white=Y_white, total=total),
monitor=c("mu_asian", "mu_white"),
n.chains = getOption("mc.cores"), method="rjparallel")
summary(model)
~~~
But then I realized that `X_frac_asian <- X_asian > cutoffIQ` didn't do what I thought it did and I needed to somehow draw a large number of samples, just like in the ABC simulation, and compare to the number after the truncation... or something.
In exact ABC, we would keep only data which exactly matched 72%/22%, but that would require rejecting an extremely large number of samples. Here we'll loosen it to ±2% tolerance:
~~~{.R}
simulateTIPSMPY <- function() {
## informative priors: IQs are somewhere close to where we would estimate based on other datasets
whiteMean <- round(rnorm(1, mean=100, sd=5), digits=2)
asianMean <- round(rnorm(1, mean=105, sd=5), digits=2)
iqCutoff <- 100 + -qnorm(1/10000) * 15
whites <- rnorm(0.770 * 579 * 10000, mean=whiteMean, sd=15)
whiteSample <- max(1, sum(ifelse(whites>iqCutoff, 1, 0)))
asians <- rnorm(0.007 * 579 * 10000, mean=asianMean, sd=15)
asianSample <- max(1, sum(ifelse(asians>iqCutoff, 1, 0)))
## white+Asian = 92% of original total sample, so inflate by that much to preserve proportions: 1.08
totalSample <- (whiteSample+asianSample) * (1 + (1-(white+asian)))
whiteFraction <- round(whiteSample / totalSample, digits=2)
asianFraction <- round(asianSample / totalSample, digits=2)
# print(paste("samples: ", c(whiteSample, asianSample), "fractions: ", c(whiteFraction, asianFraction)))
tolerance <- 0.02
if ((abs(whiteFraction - 0.7218480138) < tolerance) && (abs(asianFraction - 0.2178929188) < tolerance)) {
return(data.frame(White=whiteMean, Asian=asianMean))
}
}
library(parallel); library(plyr)
simulateSamples <- function(n.sample=10000, iters=getOption("mc.cores")) {
## because of rejection sampling, no run is guaranteed to produce a sample so we loop:
results <- data.frame()
while (nrow(results) < n.sample) {
simResults <- ldply(mclapply(1:iters, function(i) { simulateTIPSMPY() } ))
results <- rbind(results, simResults)
# print(paste("Samples so far: ", nrow(results)))
}
return(results) }
posteriorSamples <- simulateSamples()
mean(posteriorSamples$White < posteriorSamples$Asian)
# [1] 1
## we have relatively few samples, so get a better posterior estimate by shuffling the posterior samples & comparing many times:
mean(replicate(1000, mean(c(sample(posteriorSamples$White) < sample(posteriorSamples$Asian)))))
# [1] 0.9968822
quantile(probs=c(0.025, 0.975), posteriorSamples$White, na.rm=TRUE)
# 2.5% 97.5%
# 89.49975 101.38050
quantile(probs=c(0.025, 0.975), posteriorSamples$Asian, na.rm=TRUE)
# 2.5% 97.5%
# 101.37000 116.74075
par(mfrow=c(2,1))
hist(posteriorSamples$White, main="Posterior white mean IQ estimated from TIP/SMPY cutoff & ratio", xlab="IQ")
hist(posteriorSamples$Asian, main="Posterior Asian mean", xlab="IQ")
~~~
![Histograms of the posterior estimate of white & Asian mean IQs ~1970 as estimated from fraction of TIP/SMPY sample using ABC](/images/iq/smpy/iq-smpytip.png){.full-width}
So sampling error does turn out to be substantial: our 95% credible intervals are white 90-101, Asian 101-116.
Still, the overlap is minimal, with _P_ = 99.7% that the Asian mean is higher than the white.
We are able to conclude that the rank ordering is highly likely to be correct, and the results are consistent with the conventional wisdom, so there is no prima facie case for bias in the results: the ethnic composition is in line with what one would calculate from the design of TIP/SMPY and population means.
# _Genius Revisited_: On the Value of High IQ Elementary Schools
> _Genius Revisited_ documents the longitudinal results of a high-IQ/gifted-and-talented elementary school, Hunter College Elementary School (HCES); one of the most striking results is the general high education & income levels, but absence of great accomplishment on a national or global scale (eg a Nobel prize). The authors suggest that this may reflect harmful educational practices at their elementary school or the low predictive value of IQ.
>
> I suggest that there is no puzzle to this absence nor anything for HCES to be blamed for, as the absence is fully explainable by their making two statistical errors: base-rate neglect, and regression to the mean.
>
> First, their standards fall prey to a base-rate fallacy and even extreme predictive value of IQ would not predict 1 or more Nobel prizes because Nobel prize odds are measured at 1 in millions, and with a small total sample size of a few hundred, it is highly likely that there would simply be no Nobels.
>
> Secondly, and more seriously, the lack of accomplishment is inherent and unavoidable as it is driven by the [regression to the mean](!Wikipedia) caused by the relatively low correlation of early childhood with adult IQs---which means their sample is far less elite as adults than they believe. Using early-childhood/adult IQ correlations, regression to the mean implies that HCES students will fall from a mean of 157 IQ in kindergarten (when selected) to somewhere around 133 as adults (and possibly lower). Further demonstrating the role of regression to the mean, in contrast, HCES's associated high-IQ/gifted-and-talented high school, Hunter High, which has access to the adolescents' more predictive IQ scores, has much higher achievement in proportion to its lesser regression to the mean (despite dilution by Hunter elementary students being grandfathered in).
>
> This unavoidable statistical fact undermines the main rationale of HCES: extremely high-IQ adults cannot be very accurately selected as kindergarteners on the basis of a simple test. This greater-regression problem can be lessened by the use of additional variables in admissions, such as parental IQs or high-quality genetic polygenic scores; unfortunately, these are either politically unacceptable or dependent on future scientific advances. This suggests that such elementary schools may not be a good use of resources and HCES students should not be assigned scarce magnet high school slots.
Split out to [separate article](/Hunter).
# Great Scott! Personal Name Collisions and the Birthday Paradox {.collapse}
> "How large does can a social circle be before first names no longer suffice for identification? Scott, I'm looking at you."
>
> [MakerOfDecisions](https://nitter.net/MakerOfDecision/status/759086911084490752), 2016-07-29

Scott here refers to any of [Scott Alexander](https://slatestarcodex.com/about/), [Scott Adams](!Wikipedia), [Scott Aaronson](!Wikipedia), [Scott Sumner](!Wikipedia) (and to a much lesser extent, Scott Garrabrant, [Orson Scott Card](!Wikipedia), and [Scott H. Young](https://www.scotthyoung.com/blog/)); a reference to a 'Scott' on a site like Less Wrong is increasingly ambiguous.
When a large number of samples draw from a common pool of identifiers, collisions are common, leading to the [birthday paradox](!Wikipedia): despite there being 365.25 days in the year, a classroom of just 23 people (who can cover at most 6% of the days in a year) is ~50% likely to have at least two people who share the same birthday and so birthdays cease being unique unambiguous identifiers.
(Intuitively, you might expect the number to be much larger and closer to 180 than 23.)
We can verify this by simulation:
~~~{.R}
dupes <- function(a) { length(a) != length(unique(a)) }
identifiers <- function(n, ids, probabilities) { sample(1:ids, n, prob=probabilities, replace=TRUE) }
simulate <- function(n, ids, probabilities=rep(1/ids, ids), iters=10000) {
sims <- replicate(iters, { id <- identifiers(n, ids, probabilities)
return(dupes(id)) })
return(mean(sims)) }
simulate(23, 365)
# [1] 0.488
sapply(1:50, function(n) { simulate(n, 365) } )
# [1] 0.0000 0.0029 0.0059 0.0148 0.0253 0.0400 0.0585 0.0753 0.0909 0.1196 0.1431 0.1689 0.1891
# 0.2310 0.2560 0.2779 0.3142 0.3500 0.3787 0.4206 0.4383 0.4681 0.5165 0.5455 0.5722 0.5935
# [27] 0.6227 0.6491 0.6766 0.7107 0.7305 0.7536 0.7818 0.7934 0.8206 0.8302 0.8465 0.8603 0.8746
# 0.8919 0.9040 0.9134 0.9248 0.9356 0.9408 0.9490 0.9535 0.9595 0.9623 0.9732
~~~
Similarly, in a group of people, it will be common for first names to overlap.
(Overlaps of both first names & surnames are much more unlikely: [Charpentier & Coulmont 2017](https://arxiv.org/abs/1707.07607 "We are not alone! (at least, most of us). Homonymy in large scale social groups") estimate from French & Ohioan data that while almost everyone has a non-unique full name, even groups of thousands of people will have only a few duplicates.)
How common? There are far more than 365.25 first names, especially as some first names are made up by parents.
Names have a highly skewed (often said to be a [power law](!Wikipedia)) distribution: the first few baby names make up an enormous fraction of all names, hence all the Ethan/Lucas/Mason baby boys in 2016.
(One would think that parents would go out of their way to avoid too-popular names, but apparently not.)
Since there are only "10,000 things under heaven", one might think that the top 10000 personal names would give a good guess.
At what _n_ can we expect a collision?
~~~{.R}
findN <- function(ids, targetP=0.5, startingN=1, probabilities=rep(1/ids, ids)) {
n <- startingN
collisionProbability <- 0
while (collisionProbability < targetP) {
collisionProbability <- simulate(n, ids, probabilities)
n <- n+1
}
return(n) }
findN(10000)
# [1] 118
simulate(118, 10000)
# [1] 0.5031
~~~
We could also use [an approximation](!Wikipedia "Birthday problem#Approximations") such as the square approximation: $n \approx \sqrt { 2m \cdot p(n)}$: `sqrt(2 * 10000 * 0.5) → 100`
Or the similar upper bound: `ceiling(sqrt(2*10000*log(2))) → 118`.
So the collision point is smaller than [Dunbar's number](!Wikipedia).
But all of these are themselves upper bounds because the case in which birthdays/names are uniformly distributed is the worst case.
If there is any difference in the probabilities, a collision will happen much earlier.
This makes sense since if 1 birthday happens with, say, P=0.99, then it's almost impossible to go more than 3 or 4 birthdays without a collision. Likewise, if one birthday has P=0.50, and so on down to P=\frac{1}{365.25}$:
~~~{.R}
sapply(1:23, function(n){ simulate(n, 365, probabilities=c(0.99, rep(0.01/364, 364)))})
# [1] 0.0000 0.9789 0.9995 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000
# 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000
sapply(1:23, function(n){ simulate(n, 365, probabilities=c(0.5, rep(0.5/364, 364)))})
# [1] 0.0000 0.2531 0.5031 0.6915 0.8182 0.8896 0.9402 0.9666 0.9808 0.9914 0.9951 0.9973 0.9988
# 0.9993 0.9991 0.9999 1.0000 1.0000 0.9999 1.0000 1.0000 1.0000 1.0000
~~~
How skewed are real names?
[Given Names Frequency Project](http://www.galbithink.org/names/agnames.htm) provides ["Popular Given Names US, 1801-1999"](http://www.galbithink.org/names/us200.htm) (1990-1999, 909288 names) based on Social Security data.
After deleting the first 4 lines of `s1990m.txt`, it can be loaded into R and the fractions used as probabilities to find the 50% collision point for US names:
~~~{.R}
names <- read.csv("s1990m.txt", header=FALSE)
summary(names)
# V1 V2
# Aaron : 1 Min. : 55.0000
# Abdiel : 1 1st Qu.: 86.0000
# Abdullah: 1 Median : 183.0000
# Abel : 1 Mean : 914.1923
# Abraham : 1 3rd Qu.: 535.5000
# Adam : 1 Max. :24435.0000
# (Other) :852
sum(names$V2)
# [1] 784377
## "Scott" as fraction of all names:
2279 / 784377
# [1] 0.0029054906
## presumably male names:
2279 / (784377*0.5)
# [1] 0.005810981199
simulate(118, nrow(names), probabilities=names$V2/sum(names$V2))
# [1] 1
findN(nrow(names), probabilities=names$V2/sum(names$V2))
# [1] 15
~~~
So a more realistic analysis suggests _n_ = 15 is where unique first names will probably break down.
This only covers the 853 most common personal names, and the more names, the higher the _n_ has to be to trigger a collision (making 15 something of a lower upper bound); to estimate 10000, we need to fit a distribution to extrapolate below that.
The [log normal distribution](!Wikipedia) fits reasonably well and is easy to work with:
~~~{.R}
library(fitdistrplus)
fitdist(names$V2, "lnorm")
# Fitting of the distribution ' lnorm ' by maximum likelihood
# Parameters:
# estimate Std. Error
# meanlog 5.550448321 0.04640182299
# sdlog 1.359185357 0.03281096378
simulateLN <- replicate(100, {
names <- rlnorm(10000, meanlog=5.550448321, sdlog=1.359185357)
hit <- findN(length(names), startingN=46, probabilities=names/sum(names))
return(hit)
})
median(simulateLN)
# [1] 51
~~~
Since first names will cluster by age group, location, profession, and whatnot, arguably even 51 is a bit of an upper bound.
Finally, one might ask the probability of a group with a great Scott, or to put it another way, the probability of it unfortunately getting away scot-free.
This is easy to answer; the probability of having 1 or more Scotts in a group is the probability of everyone having a name other than Scott.
We saw that the probability of being named Scott was _P_ = 0.0029054906 in the name dataset.
So the probability of one person not being named Scott is $1 - 0.0029 = 0.997$. So the probability of _n_ people all being named not-Scott is 0.997^_n_^.
The crossover point is ~239.
So an American social group cannot exceed _n_ = 51 before first names begin to break down, and it is all Scott's fault at _n_ = 239.
# Detecting fake (human) Markov chain bots {.collapse}
Some popular Twitter and Tumblr accounts use [Markov chains](!Wikipedia) trained on a corpus of writing such as [Markov James Mitchens](https://nitter.net/MarkovMickens) or two unrelated corpuses to create amusing mashups: programming documentation and [H.P. Lovecraft's horror/SF fiction](http://thedoomthatcametopuppet.tumblr.com/ "The Doom That Came to Puppet") or the [King James Bible](http://kingjamesprogramming.tumblr.com/ "King James Programming") or [the works of Karl Marx](https://marxistprogramming.tumblr.com/ "Marxist Programming"), [Kim Kardashian and Kierkegaard](https://nitter.net/kimkierkegaard "Kim Kierkegaardashian"), or [Silicon Valley recruiting emails and Erowid drug use reports](https://nitter.net/erowidrecruiter "Erowid Recruiter").
The humor comes from the fact that the Markov chains have no understanding and are merely programs producing gibberish that occasionally present striking juxtapositions or insights.
Much of their appeal derives in large part from the fact that while humans *curate* them, humans don't *write* them.
They depend on this authenticity to be striking.
Of course, there's always the temptation to edit them or write them wholesale, perhaps because the Markov chains aren't cooperating in producing any comedy gold to tweet that day, which deceives the reader.
This poses an inverse Turing test: how would you detect a fake Markov chain account, that is, one where a human is pretending to be a computer and writing some of the text?
Markov chains are trained on a specific corpus and are a probabilistic generative model which encode the probability that a word _X_ follows another word _Y_ for all the words in that corpus (and similarly if they are operating on letters or on [_n_-grams](!Wikipedia "n-gram")); there is no state or memory or 'look back' or ability to model recursion.
To generate text, one simply picks a random word _Y_, looks up the probabilities of all the words _A_..._Z_ from _Y_, and picks a word at random weighted by those probabilities; then repeat indefinitely.
Conversely, one could also use it to calculate the [likelihood](!Wikipedia) of a given text by multiplying the probability of each word in the text conditional on the previous one.
One difficulty is the potential for double-use of data: the first pass through a Markov chain account is already applying to the data a highly flexible Bayesian neural network with billions of parameters (one's brain).
If one spots an 'anomalous' dataset and subsequent analysis confirms it, what does this mean?
I am reminded of one past incident: someone had lost a great deal of money on a Bitcoin gambling website, and suspected the site had defrauded him.
But he had contacted me only *because* he had had unusual losses. What does an analysis mean? Imagine that the top 1% of losers get angry and start looking into whether they were cheated; they go to a statistician who duly computes that based on the number of games played, there is a _p_ = 0.01 that they would lose as much or more as they did...
If one had *all* the gambling records, one could look at the overall patterns and see if there are more losers than there should be given the rules of the game and a supposedly fair random number generator, but what does one do with 1 self-selected player? The data generation process is certainly neither random nor 'ignorable' nor modelable without dubious assumptions.
A few possible attacks come to mind:
- observation of malformed syntax or lack of long-range dependencies
- vocabulary or output outside an independently trained Markov chain's domain
- unusually low likelihood for an independently trained Markov chain to generate known samples
- unusually low likelihood for an independently trained Markov chain to generate known samples compared to newly generated samples filtered at a 1-in-100s quality level
- unusually high quality of known samples compared to newly generated samples from independently trained Markov chain filtered at a 1-in-100s quality level, tested nonparametrically or parametrically as a mixture model
Markov chains produce realistic-looking output and are efficient to create & run, but, compared to RNNs, they notoriously model recursive syntax poorly, such as nested parentheses (since they have no way of remembering whether a parenthetical comment had been started), and cannot extrapolate---for example, a word-level Markov chain can't create new words, and would require _n_-grams to have available fragments of words which could be recombined.
The memory-less nature of Markov chains also means that, lacking any memory which could model the 'long-range correlations' found in natural English text like systematic use of particular names/topics/vocabulary, larger samples quickly veer off-topic and become gibberish and lack any coherency possibly even inside a single sentence.
(RNNs also have this problem, but somewhat less.)
With the limits of a Markov chain in mind, it would be easy to detect faked Markov chain output with large samples: it is just difficult for a human to deliberately generate long text which is as nonsensical and syntactically invalid as a Markov chain creates, for the same reason an unpracticed human is a remarkably bad random number generator.
However, for this same reason the selected Markov samples tend to be very short, usually no more than a sentence.
It *might* be possible to measure this on the samples as a whole and observe higher entropy or memoryless-ness (eg by measuring compression performance or efficiency of a Markov chain in modeling the samples), but I would guess that usually the samples are not long enough or large enough for this to have reasonable [statistical power](!Wikipedia) as a test.
This eliminates the easiest test.
Since the corpus is known in many of these cases, we can assume access to a Markov chain model which is similar (if not identical) to the one which supposedly wrote all the tweets.
This gives us several possibilities.
We could exploit the lack of creativity of Markov chains and look for anything in the tweets which is not present in the original corpus.
For example, if a word like "[cromulent](!Wikipedia)" appears neither in the Puppet documentation nor (having been coined in 1996, 59 years after he died) in H.P. Lovecraft's fiction, then it would have a probability of 0 of being generated by any Puppet/Lovecraft Markov chain (as no word will have any transition probability to it).
Such extra-corporal vocabulary immediately proves human authorship.
Continuing this same logic, we could take the corpus, train our own Markov chain (which will at least be similar), and use it to calculate the likelihood of all the tweets.
A human-written tweet may be *possible* for the Markov chain to have written, but it will be far more unlikely than most of the ones the Markov chain actually wrote & were selected.
So we would see that most of the tweets have reasonable log likelihoods, but that our suspicious ones will be far more extreme.
(If the Markov chains are word-level, this test subsumes the impossible-word test: any tweet with a word not in the corpus, and hence not represented in the Markov chain, will have a meaningless likelihood.)
This likelihood test might not help if they are all equally extreme, in which case one could use our Markov chain in another manner, as a generative model, to try to estimate the likelihood of getting as great a tweet.
For this, one samples several thousand samples from our Markov chain, and screens them for good ones.
This creates an empirical distribution of the likelihoods of good tweets conditional on the null hypothesis of a Markov chain author; in this case, the null hypothesis is known to be true by construction.
Then to test, one compares the known-Markov-chain tweets with the likelihoods of the suspect tweets (perhaps with a [permutation test](!Wikipedia)).
They should be similar.
Alternately, if one doesn't want to use likelihoods as a measure of improbability, one could instead use some human measure of funniness like having rating the originals and the samples on a scale 1-5, and comparing them.
The original poster is probably not screening more than a few hundred generated tweets for each selected tweet, so given a similar level of stringency, one's generated tweets should be equally good; if the originals turn out to be extremely better than yours, to a level where you would have to screen thousands of random samples, that is highly suspicious and suggests the originals were 'too good to be true'.
With ratings or likelihoods, one could try to assume a decreasing distribution like an exponential: most samples will be incoherent and totally unfunny, many will be slightly funny, a few will be funny, and a very few will be very funny.
The ratings on samples generated from our Markov chain will probably follow a smooth distribution.
However, if a human is authoring some in an attempt to spice things up, they will be above the average of the Markov chain (otherwise why bother with cheating?), and if there is a substantial number of them, this will create an anomaly in the ratings of the originals---a 'bump' indicating that the tweets are coming from two different populations.
In this case, it can be modeled as a [mixture model](!Wikipedia) with either _k_ = 1 or _k_ = 2, and the _p_-value or Bayesian posterior probability calculated for 1 vs 2.
# Optimal Existential Risk Reduction Investment {.collapse}
An existential risk is any risk which destroys or permanently cripples human civilization, such as an asteroid strike or pandemic.
Since humanity might otherwise continue for millions of years, creating untold trillions of humans and colonizing the galaxy, human extinction represents the loss of literally [astronomical amounts of utility](http://www.nickbostrom.com/astronomical/waste.html "'Astronomical Waste: The Opportunity Cost of Delayed Technological Development', Bostrom 2003").
The loss is greater than any disaster up to extinction levels, as humanity can always recover from lesser disasters; but there is no recovery from a total destruction.
Thus, the expected value of even a slight reduction in an exotic risk ought to itself be astronomical, or at least extremely large; under plausible values for well-characterized x-risks like asteroid strikes or nuclear war or pandemic, preventing them may be the charitable spending with the highest expected value and they should be receiving all charitable expenditures.
This strikes people as odd and dangerous reasoning.
Is it really true that we should be spending almost unlimited amounts of money on these things and not otherwise extremely compelling charities like distributing malaria nets in Africa to save millions of lives or vaccine distribution or funding research into ending aging?
And if we should, how do we choose what fraction to spend on global warming rather than artificial intelligence?
What if someone discovers an entirely new x-risk not previously considered, like nearby supernovas or vacuum collapses or nanotechnology 'grey goo'?
Thinking historically, it's clear in retrospect that someone concerned about x-risk would be better off not going after the terminal goal of x-risk reduction but instead spending their money on instrumental goals such as science/technology or economic growth.
Imagine someone in England in 1500 who reasons the same way about x-risk: humanity might be destroyed, so preventing that is the most important task possible.
He then spends the rest of his life researching the Devil and the Apocalypse.
Such research is, unfortunately, of no value whatsoever unless it produces arguments for atheism demonstrating that that entire line of enquiry is useless and should not be pursued further.
But as the Industrial and Scientific Revolutions were just beginning, with exponential increases in global wealth and science and technology and population, ultimately leading to vaccine technology, rockets and space programs, and enough wealth to fund all manner of investments in x-risk reduction, he could instead had made a perhaps small but real contribution by contributing to economic growth by work & investment or making scientific discoveries.
For example, Isaac Newton's discoveries in astronomy and the laws of motion helped inaugurate threads of work that led directly to space satellites which can watch for asteroids with Earth-crossing orbits.
[Isaac Newton](/Newton) himself was concerned with x-risk, as he feared that the [Great Comet of 1680](!Wikipedia) would, centuries hence, plunge into the Sun and cause expansion destroying the Earth and humanity.
What could Newton have done to directly reduce this x-risk at the time?
Absolutely nothing.
There were no feasible counter-measures nor any foreseeable technologies which could forestall a comet or protect humanity from the Sun engulfing the Earth; there was not and still is not a mine or bomb shelter deep enough for that.
What he *could* have done is close to what he did do: make fundamental advances in science which posterity could build on and one day be rich and wise enough to do something about the x-risk.
As it happens, Newton was not quite right about the Great Comet (comets are not a meaningful fraction of the Sun's mass) but there was a similar x-risk he was unaware of: giant asteroid impacts.
And the solutions to a giant comet---observe all comets carefully to project their future orbits, destroy it, redirect its orbit, evacuate human colonists to safety to unaffected planets (Newton suggested the satellites of the gas giants)---are much the same as for a giant asteroid impact, and all benefit from economic growth & greater science/technology (someone has to pay for, and design those satellites and spacecraft, after all).
Economic wealth & science/technology are all-purpose goods: they are useful for compound growth, and can also be spent on x-risk reduction.
They are the ultimate instrumental goods.
If one is badly ignorant, or poor, or unable to meaningfully reduce an x-risk, one is better off accepting the x-risk and instead spending resources on fixing the former problems.
One would prefer to get rid of the x-risk as soon as possible, of course, but given one's starting position, there may simply be no better strategy and the risk must be accepted.
This raises the question: what is the optimal distribution of resources to economic growth vs x-risk reduction over time which maximizes expected utility?
Intuitively, we might expect something like early on investing nothing at all in x-risk reduction as there's not much money available to be spent, and money spent now costs a lot of money down the line in lost compound growth; and then as the economy reaches modern levels and the opportunity cost of x-risk becomes dire, money is increasingly diverted to x-risk reduction.
One might analogize it to insurance---poor people skimp on insurance because they need the money for other things which hopefully will pay off later like education or starting a business, while rich people want to buy lots of insurance because they already have enough and they fear the risks.
If this were an investment question, a good strategy would be something like the [Kelly criterion](!Wikipedia) or [probability matching](!Wikipedia) strategies like [Thompson sampling](!Wikipedia): even if the expected value of x-risk reduction is higher than other investments, it only pays off very rarely and so receives a very small fraction of one's investments.
However, it's not clear that the Kelly criterion or Thompson sampling are optimal or even relevant: because while Kelly avoids bankruptcy in the form of [gambler's ruin](!Wikipedia) but does so only by making arbitrarily small bets to avoid going bankrupt & refusing to ever risk one's entire wealth; with x-risks, the 'bankruptcy' (extinction) can't be avoided so easily, as the risk is there whether you like it or not, and one cannot turn it to 0. (This comes up often in discussion of why the Kelly criterion is relevant to decision-making under risk; see also [Peters 2011](http://rsta.royalsocietypublishing.org/content/369/1956/4913 "The time resolution of the St Petersburg paradox") and the niche area of "evolutionary finance" like [Evstigneev et al 2008](http://www.evstigneev.net/EF.pdf "Evolutionary finance")/[Lensberg & Schenk-Hoppé 2006](/docs/statistics/decision/2007-lensberg.pdf "On the Evolution of Investment Strategies and the Kelly Rule---A Darwinian Approach") which draws connections between the Kelly criterion, probability matching, long-term survival & evolutionary fitness.)
In economics, similar questions are often dealt with in terms of the [life-cycle hypothesis](!Wikipedia) in which economic agents strive to maximize their utility over a career/lifetime while [avoiding inefficient intertemporal allocation of wealth](!Wikipedia "Consumption smoothing") (as Mark Twain put it, "when in youth a dollar would bring a hundred pleasures, you can't have it. When you are old, you get it & there is nothing worth buying with it then. It's an epitome of life. The first half of it consists of the capacity to enjoy without the chance; the last half consists of the chance without the capacity."); in the life-cycle, one tries to build wealth as quickly as possible while young, even going into debt for investments like a college education, then begins saving up, consuming some, until retirement, at which point one consumes it all until one dies.
But as far as I've seen any results, life-cycle models tend to not include any mechanism for spending in order to reduce mortality/aging and accept the risk of death as a given.
We could create a simple Markov decision process model.
An agent (humanity), each time period (year), has a certain amount of wealth and an x-risk probability _P_.
In this period, it can choose to allocate that wealth between economic growth, in which case it receives that investment plus a return, and it can buy a permanent percentage reduction in the x-risk for a fixed sum.
For the reward, the x-risk is binary sampled with probability _P_; if the sample is true, then the reward is 0 and the decision process terminates, else the reward is the wealth and the process continues.
Let's imagine that this process can run up to 10,000 time periods, with a starting wealth of \$248 billion (Angus Deaton's estimate of PPP world GDP in 1500 https://en.wikipedia.org/wiki/List_of_regions_by_past_GDP_%28PPP%29 ), the economic growth rate is 2% (the long-run real growth rate of the global economy), the existential risk probability is 0.1% per year (arbitrarily chosen), and one can buy a reduction of 1% for a billion dollars. (We'll work in trillions units to help numeric stability.)
What strategy maximizes the cumulative rewards?
A few simple ones come to mind:
1. the agent could simply ignore the x-risk and reinvests all wealth, which to a first approximation, is the strategy which has been followed throughout human history and is primarily followed now (lumping together NASA's Spaceguard program, biowarfare and pandemic research, AI risk research etc probably doesn't come to more than \$1-2b a year in 2016). This maximizes economic growth rate but may backfire as the x-risk never gets reduced.
2. the agent could spend the full gain in its wealth from economic growth (2%) on x-risk reduction. The wealth doesn't grow and the returns from x-risk reduction do diminish, but the x-risk is at least reduced greatly over time.
3. the agent could implement a sort of probability matching: it spends on x-risk reduction a fraction of its wealth equal to the current _P_. This reduces how much is spent on extremely small x-risk reductions, but it might be suboptimal because it'll pay the largest fraction of its economy in the first time period, then second-largest in the second time period and so on, losing out on the potential compounding.
4. a more complicated hybrid strategy might work: it maximizes wealth like #1 for the first _n_ time periods (eg _n_ = 516), and then it switches to #2 for the remaining time period
5. like #4, but switching from #1 to #3 for the remaining time periods.
~~~{.R}
constantInvestmentAgent <- function (t, w, xrp) { return(c(w, 0)) }
constantReductionAgent <- function (t, w, xrp) { drawdown <- 0.9803921573; return(c(drawdown*w, (1-drawdown)*w)) }
probabilityMatchAgent <- function (t, w, xrp) { return(c(w*(1-xrp), w*xrp)) }
investThenReduceAgent <- function (t, w, xrp, n=516) { if (t
simple complex
--------- ------------------------------ -------------------------
simple model fit indexes+linear model statistician+linear model
complex model fit indexes+ML ML+ML (eg GANs)

1. Simple/simple is useful for cases like linear regression where classic methods like examining residuals or R^2s or Cook indexes can often flag problems with the model.
2. Simple/complex is also useful, as the human statistician can spot additional problems.
3. Complex/simple is probably useless, as the NNs may easily have severe problems but will have fit any simple linear structure and fool regular diagnostics.
4. Complex/complex can be very useful in machine learning, but in different ways from a good simple model.
Fast, simple, general---a good statistical method lets you choose one; a great method lets you choose two. (Consider linear models, decision trees, NNs, MCMC, ABC, discrete Bayesian networks, and exponential family vs nonparametric methods as examples of the tradeoffs here.)
So is quadrant 2 fully populated by human statisticians?
We wouldn't necessarily want to use GANs for everything we use statisticians for now, because neural networks can be *too* powerful and what we want from our models is often some sort of clear answer like "does X predict Y?" and simplicity.
But we could replace the statistician with some other powerful critic from machine learning---like a NN, SVM, random forest, or other ensemble.
So instead of having two NNs fighting each other as in a GAN, we simply have one specified model, and a NN which tries to find flaws in it, which can then be reported to the user.
The loop then becomes: write down and fit a model to the real data; generative posterior predictive samples from the distribution; train a small NN on real data vs predictive data; the classification performance measures the plausibility of the predictive samples (perhaps something like a KL divergence), giving a measure of the model quality, and flags data points which are particularly easily distinguished as real; the human statistician now knows exactly which data points are not captured by the model and can modify the model; repeat until the NN's performance declines to chance.
Let's try an example.
We'll set up a simple linear model regression `Y ~ A + B + C` with a few problems in it:
1. the trend is not linear but slightly quadratic
2. the outcome variable is also right-censored at a certain point
3. and finally, the measured covariates have been rounded
~~~{.R}
set.seed(2016-11-23)
n <- 10000
ceiling <- 1
a <- rnorm(n)
b <- rnorm(n)
c <- rnorm(n)
y <- 0 + 0.5*a + 0.5*b + 0.5*c^2 + rnorm(n)
y_censored <- ifelse(y>=3, 3, y)
df <- data.frame(Y=y_censored, A=round(a, digits=1), B=round(b, digits=1), C=round(c, digits=1))
l <- lm(Y ~ A + B + C, data=df)
summary(l)
plot(l)
plot(df$Y, predict(l, df))
l2 <- lm(Y ~ A + B + I(C^2), data=df)
summary(l2)
plot(df$Y, predict(l2, df))
~~~
The censoring shows up immediately on the diagnostics as an excess of actual points at 3, but the quadraticity is subtler, and I'm not sure I can see the rounding at all.
~~~{.R}
library(randomForest)
## First, random forest performance under the null hypothesis
modelNull <- data.frame(Y=c(df$Y, df$Y), Real=c(rep(1, n), rep(0, n)), A=c(df$A, df$A), B=c(df$B, df$B), C=c(df$C, df$C))
r_n <- randomForest(as.ordered(Real) ~ Y + A + B + C, modelNull); r_n
# Type of random forest: classification
# Number of trees: 500
# No. of variables tried at each split: 2
#
# OOB estimate of error rate: 100%
# Confusion matrix:
# 0 1 class.error
# 0 0 10000 1
# 1 10000 0 1
modelPredictions <- data.frame(Y=c(df$Y, predict(l, df)), Real=c(rep(1, n), rep(0, n)), A=c(df$A, df$A), B=c(df$B, df$B), C=c(df$C, df$C))
r <- randomForest(as.ordered(Real) ~ Y + A + B + C, modelPredictions); r
# Type of random forest: classification
# Number of trees: 500
# No. of variables tried at each split: 2
#
# OOB estimate of error rate: 6.59%
# Confusion matrix:
# 0 1 class.error
# 0 9883 117 0.0117
# 1 1200 8800 0.1200
## many of the LM predictions are identical, but the RF is not simply memorizing them as we can jitter predictions and still get the same classification performance:
modelPredictions$Y2 <- jitter(modelPredictions$Y)
randomForest(as.ordered(Real) ~ Y2 + A + B + C, modelPredictions)
#... Type of random forest: classification
# Number of trees: 500
# No. of variables tried at each split: 2
#
# OOB estimate of error rate: 6.57%
# Confusion matrix:
# 0 1 class.error
# 0 9887 113 0.0113
# 1 1200 8800 0.1200
~~~
Note we need to be careful about collecting the posterior predictive samples: if we collect 10000 posterior samples for each of the 10000 datapoints, we'll store 10000^2^ numbers which may cause problems. 1 should be enough.
~~~{.R}
library(runjags)
model <- 'model {
for (i in 1:n) {
mean[i] <- mu + betaA*A[i] + betaB*B[i] + betaC*C[i]
Y[i] ~ dnorm(mean[i], tau)
}
sd ~ dgamma(0.01, 0.01)
tau <- 1/sqrt(sd)
mu ~ dnorm(0, 100)
betaA ~ dnorm(0, 100)
betaB ~ dnorm(0, 100)
betaC ~ dnorm(0, 100)
}'
model <- run.jags(model, data = with(df, list(Y=c(Y, rep(NA, nrow(df))), A=c(A,A), B=c(B,B), C=c(C,C), n=2*nrow(df))), inits=list(mu=0.45, sd=0.94, betaA=0.47, betaB=0.46, betaC=0), monitor=c("Y"), n.chains = 1, sample=1)
posterior_predictive <- tail(n=10000, model$mcmc[[1]][1,])
plot(df$Y, posterior_predictive)
modelPredictions_r <- data.frame(Y=c(df$Y, posterior_predictive), Real=c(rep(1, n), rep(0, n)), A=c(df$A, df$A), B=c(df$B, df$B), C=c(df$C, df$C))
r <- randomForest(as.ordered(Real) ~ Y + A + B + C, modelPredictions_r); r
# OOB estimate of error rate: 49.11%
# Confusion matrix:
# 0 1 class.error
# 0 4953 5047 0.5047
# 1 4776 5224 0.4776
~~~
~~~{.R}
model_rounded <- 'model {
for (i in 1:n) {
roundA[i] ~ dround(A[i], 3)
roundB[i] ~ dround(B[i], 3)
roundC[i] ~ dround(C[i], 3)
mean[i] <- mu + betaA*roundA[i] + betaB*roundB[i] + betaC*roundC[i]
Y[i] ~ dnorm(mean[i], tau)
}
sd ~ dgamma(0.01, 0.01)
tau <- 1/sqrt(sd)
mu ~ dnorm(0, 100)
betaA ~ dnorm(0, 100)
betaB ~ dnorm(0, 100)
betaC ~ dnorm(0, 100)
}'
model_r <- run.jags(model_rounded, data = with(df, list(Y=c(Y, rep(NA, nrow(df))), A=c(A,A), B=c(B,B), C=c(C,C), n=2*nrow(df))), inits=list(mu=0.45, sd=0.94, betaA=0.47, betaB=0.46, betaC=0), monitor=c("Y"), n.chains = 1, sample=1)
posterior_samples <- tail(n=10000, model_r$mcmc[[1]][1,])
posterior_predictive <- ifelse(posterior_samples>=3, 3, posterior_samples)
plot(df$Y, posterior_predictive)
modelPredictions_r <- data.frame(Y=c(df$Y, posterior_predictive), Real=c(rep(1, n), rep(0, n)), A=c(df$A, df$A), B=c(df$B, df$B), C=c(df$C, df$C))
r_r <- randomForest(as.ordered(Real) ~ Y + A + B + C, modelPredictions_r); r_r
# OOB estimate of error rate: 50.48%
# Confusion matrix:
# 0 1 class.error
# 0 4814 5186 0.5186
# 1 4909 5091 0.4909
~~~
~~~{.R}
model_rounded_censor <- 'model {
for (i in 1:n) {
roundA[i] ~ dround(A[i], 3)
roundB[i] ~ dround(B[i], 3)
roundC[i] ~ dround(C[i], 3)
mean[i] <- mu + betaA*roundA[i] + betaB*roundB[i] + betaC*roundC[i]
Y[i] ~ dnorm(mean[i], tau)
is.censored[i] ~ dinterval(Y[i], c)
}
sd ~ dgamma(0.01, 0.01)
tau <- 1/sqrt(sd)
mu ~ dnorm(0, 100)
betaA ~ dnorm(0, 100)
betaB ~ dnorm(0, 100)
betaC ~ dnorm(0, 100)
}'
model_r_c <- run.jags(model_rounded_censor, data = with(df, list(Y=c(Y, rep(NA, nrow(df))), A=c(A,A), B=c(B,B), C=c(C,C), n=2*nrow(df), is.censored=c(as.integer(Y==3), as.integer(Y==3)), c=3)), inits=list(mu=0.37, sd=1, betaA=0.42, betaB=0.40, betaC=0), monitor=c("Y"), n.chains = 1, adapt=0, burnin=500, sample=1)
posterior_samples <- tail(n=10000, model_r_c$mcmc[[1]][1,])
posterior_predictive <- ifelse(posterior_samples>=3, 3, posterior_samples)
modelPredictions_r_c <- data.frame(Y=c(df$Y, posterior_predictive), Real=c(rep(1, n), rep(0, n)), A=c(df$A, df$A), B=c(df$B, df$B), C=c(df$C, df$C))
r_r_c <- randomForest(as.ordered(Real) ~ Y + A + B + C, modelPredictions_r_c); r_r_c
# OOB estimate of error rate: 53.67%
# Confusion matrix:
# 0 1 class.error
# 0 4490 5510 0.5510
# 1 5224 4776 0.5224
~~~
~~~{.R}
model_rounded_censor_quadratic <- 'model {
for (i in 1:n) {
roundA[i] ~ dround(A[i], 3)
roundB[i] ~ dround(B[i], 3)
roundC[i] ~ dround(C[i], 3)
mean[i] <- mu + betaA*roundA[i] + betaB*roundB[i] + betaC*roundC[i]^2
Y[i] ~ dnorm(mean[i], tau)
is.censored[i] ~ dinterval(Y[i], c)
}
sd ~ dgamma(0.01, 0.01)
tau <- 1/sqrt(sd)
mu ~ dnorm(0, 100)
betaA ~ dnorm(0, 100)
betaB ~ dnorm(0, 100)
betaC ~ dnorm(0, 100)
}'
model_r_c_q <- run.jags(model_rounded_censor_quadratic, data = with(df, list(Y=c(Y, rep(NA, nrow(df))), A=c(A,A), B=c(B,B), C=c(C,C), n=2*nrow(df), is.censored=c(as.integer(Y==3), as.integer(Y==3)), c=3)), inits=list(mu=0.37, sd=1, betaA=0.42, betaB=0.40, betaC=0), monitor=c("Y"), n.chains = 1, adapt=0, burnin=500, sample=1)
posterior_samples <- tail(n=10000, model_r_c_q$mcmc[[1]][1,])
posterior_predictive <- ifelse(posterior_samples>=3, 3, posterior_samples)
modelPredictions_r_c_q <- data.frame(Y=c(df$Y, posterior_predictive), Real=c(rep(1, n), rep(0, n)), A=c(df$A, df$A), B=c(df$B, df$B), C=c(df$C, df$C))
r_r_c_q <- randomForest(as.ordered(Real) ~ Y + A + B + C, modelPredictions_r_c_q); r_r_c_q
# OOB estimate of error rate: 61.02%
# Confusion matrix:
# 0 1 class.error
# 0 3924 6076 0.6076
# 1 6127 3873 0.6127
trueNegatives <- modelPredictions_r_c_q[predict(r_r_c_q) == 0 & modelPredictions_r_c_q$Real == 0,]
~~~
Where can we go with this?
The ML techniques can be used to rank existing Bayesian models in an effective if unprincipled way.
Techniques which quantify uncertainty like Bayesian neural networks could give more effective feedback by highlighting the points the Bayesian NN is most certain are fake, guiding the analyst towards the worst-modeled datapoints and providing hints for improvement.
More inspiration could be borrowed from the GAN literature, such as "minibatch discrimination"---as demonstrated above, the random forests only see one data point at a time, but in training GANs, it has proven useful to instead examine multiple datapoints at a time to encourage the generator to learn how to generate a wide variety of datapoints rather than modeling a few datapoints extremely well; a ML model which can predict multiple outputs simultaneously based on multiple inputs would be analogous (that is, instead of `X ~ A + B + C`, it would look more like `X1 + X2 + X3 ... ~ A1 + B1 + C1 + A2 + B2 + C2 + ...`, with the independent & dependent variables from multiple data points all fed in simultaneously as a single sample) and might be an even more effective model critic.
# Proportion of Important Thinkers by Global Region Over Time in Charles Murray's _Human Accomplishment_ {.collapse}
[_Human Accomplishment_](/docs/sociology/2003-murray-humanaccomplishment.pdf) is a 2003 book by [Charles Murray](!Wikipedia "Charles Murray (political scientist)") reporting a large-scale citation analysis of biographical dictionaries & reference books on art/literature/science/mathematics/philosophy/science throughout history, quantifying the relative importance of "significant individuals" such as Isaac Newton or Immanuel Kant or Confucius and the temporal & geographical patterns; in particular, it demonstrates large European contributions throughout history and increasingly dramatically post-1400 AD.
[The dataset has been released.](https://osf.io/z9cnk/)
Emil Kirkegaard created [a visualization of](https://nitter.net/KirkegaardEmil/status/804483340900376576) of the proportion by rough geographic region (European/Asian/other) in R using ggplot2 and LOESS smoothing.
Perhaps the most striking aspect of it is the Dark Ages showing up as a spike in Asian proportion, followed by the [Great Divergence](!Wikipedia).
This visualization has been criticized as Eurocentrically-misleading and driven by artifacts in the analysis/graphing:
- [LOESS](!Wikipedia) ignores the constraint that proportions must be 0-1 and naively extrapolates beyond the boundaries, producing negative estimates for some regions/times
- no visualization of uncertainty is provided, either in the form of graphing the raw data points by superimposing a scatterplot or by providing standard errors or credible intervals. It is possible that the overall shapes or specific periods are no more than chance scatters in a time-series based on few datapoints.
- LOESS can provide estimate local standard errors & confidence intervals but they are of questionable meaning in the absence of the underlying counts
- alternatively, the distribution of significant figures may not be treated correctly parametrically
- proportions may reflect a time-series with trends and so precision is exaggerated
None of these objections hold any water as the dataset and its embedded differences are sufficiently large that the method of analysis will make little difference; I will demonstrate this below by re-analyzing it to address the quibbles and show that all patterns remain intact or are sharpened.
The above criticisms can be addressed by:
1. switching from a LOESS plot to splines or local binomial regressions
2. plotting the raw proportions grouped by decade or century
3. using a nonparametric bootstrap to calculate confidence intervals, a procedure which lends itself to visualization as an animation of plots of all the resamples, giving an intuitive sense of how important sampling error is to the overall pattern of curves and specific parts of history
4. alternately, instead of attempting to fit the proportion, one can fit the original count of significant figures in a binomial or log-normal Bayesian time-series model and sample from the posterior estimates of each region for each decade/century, and calculate posterior proportions, gaining full quantification of uncertainty, incorporation of any autocorrelation, and smoothing; no additional algorithms or theorems are required, demonstrating the elegance of Bayesian approaches
I didn't realize [Kirkegaard's R code was available](http://rpubs.com/EmilOWK/232493 "Human Accomplishment visualizations: Exploring a large dataset of human creative achievement") so I wound up redoing it myself (and getting the same results):
~~~{.R}
## export CSV from spreadsheet in https://osf.io/z9cnk/
h <- read.csv("HA.csv", header=TRUE)
summary(h)
# Serial Name Fl Birth Death Inventory ScienceField
# Min. : 11.00 Descartes, René : 4 Min. :-700.000 Min. :-640.000 Min. :-559.00 Science :1442 :2560
# 1st Qu.: 6144.50 Hooke, Robert : 4 1st Qu.:1557.250 1st Qu.:1580.000 1st Qu.:1638.00 Lit.West : 835 Tech : 239
# Median :12534.50 Leonardo da Vinci: 4 Median :1804.000 Median :1782.000 Median :1844.00 Music.West: 522 Phys : 218
# Mean :15994.27 Archimedes : 3 Mean :1585.638 Mean :1616.174 Mean :1682.81 Art.West : 479 Chem : 204
# 3rd Qu.:21999.75 Bacon, Francis : 3 3rd Qu.:1900.000 3rd Qu.:1863.000 3rd Qu.:1930.00 Phil.West : 155 Biol : 193
# Max. :43134.00 d'Alembert, Jean : 3 Max. :1949.000 Max. :1910.000 Max. :1997.00 Art.China : 111 Math : 191
# (Other) :3981 NA's :304 NA's :351 (Other) : 458 (Other): 397
# Index Duplicate BirthCountry WorkCountry Ethnicity Woman No..of.Inventories
# Min. : 0.60000 Min. :0.00000000 France : 564 France : 605 Germanic: 592 Min. :0.00000000 Min. :2.000000
# 1st Qu.: 3.54000 1st Qu.:0.00000000 Germany: 556 Britain: 574 French : 565 1st Qu.:0.00000000 1st Qu.:2.000000
# Median : 7.60000 Median :0.00000000 Britain: 554 Germany: 525 English : 441 Median :0.00000000 Median :2.000000
# Mean : 12.95713 Mean :0.06221889 Italy : 400 Italy : 406 Italian : 397 Mean :0.02198901 Mean :2.228916
# 3rd Qu.: 15.89000 3rd Qu.:0.00000000 USA : 306 USA : 375 USA : 276 3rd Qu.:0.00000000 3rd Qu.:2.000000
# Max. :100.00000 Max. :1.00000000 China : 239 China : 239 Chinese : 240 Max. :1.00000000 Max. :4.000000
# NA's :115 (Other):1383 (Other):1278 (Other) :1491 NA's :3753
levels(h$Ethnicity)
# [1] "Ancient Greek" "Ancient Roman" "Arabic" "Australian" "Basque" "Black" "Bulgarian" "Canadian"
# [9] "Chinese" "Croatian" "Czech" "Danish" "Dutch" "English" "Estonian" "Finnish"
# [17] "Flemish" "French" "Germanic" "Greek" "Hungarian" "Icelandic" "Indian" "Irish"
# [25] "Italian" "Japanese" "Jewish" "Latino" "New Zealand" "Norwegian" "Polish" "Portuguese"
# [33] "Romanian" "Scots" "Slavic" "Slovenian" "Spanish" "Swedish" "Swiss" "USA"
european <- c("Ancient Greek", "Ancient Roman", "Australian", "Basque", "Bulgarian", "Canadian", "Croatian", "Czech", "Danish",
"Dutch", "English", "Estonian", "Finnish", "Flemish", "French", "Germanic", "Greek", "Hungarian", "Icelandic", "Irish",
"Italian", "Jewish", "New Zealand", "Norwegian", "Polish", "Portuguese", "Romanian", "Scots", "Slavic", "Slovenian",
"Spanish", "Swedish", "Swiss", "USA")
asian <- c("Chinese", "Indian", "Japanese")
other <- c("Arabic", "Black", "Latino")
groupMembership <- function(e) { if (e %in% european) { "European" } else { if (e %in% asian) { "Asian" } else { "Other" } } }
h$Group <- as.factor(sapply(h$Ethnicity, groupMembership))
summary(h$Group)
# Asian European Other
# 507 3379 116
## We use 'Fl' (floruit/flourished), when a person is believed to have done their most important work,
## since birth/death is often unavailable.
## group to decades by rounding:
h$Fl.decade <- round(h$Fl, digits=-1)
hd <- subset(select=c(Fl.decade, Group), h)
hdcount <- aggregate(cbind(Group) ~ Fl.decade+Group, length, data=hd)
colnames(hdcount)[3] <- "Count"
## sort by time:
hdcount <- hdcount[order(hdcount$Fl.decade),]
nrow(h); sum(hdcount$Count)
# [1] 4002
# [1] 4002
head(hdcount, n=20)
# Fl.decade Group Count
# 178 -700 European 3
# 179 -680 European 1
# 180 -650 European 1
# 1 -600 Asian 2
# 181 -600 European 2
# 182 -580 European 2
# 183 -570 European 2
# 2 -550 Asian 1
# 184 -550 European 1
# 185 -540 European 5
# 3 -520 Asian 1
# 186 -520 European 3
# 4 -510 Asian 1
# 187 -510 European 2
# 188 -500 European 2
# 189 -480 European 6
# 190 -460 European 3
# 191 -450 European 7
# 5 -440 Asian 1
# 192 -440 European 11
## One issue with the count data: decades with zero significant figures from a group
## (which happens frequently) get suppressed. Some tools can handle the omission
## automatically but many cannot, so we need to manually insert any missing decades with '0'
decades <- seq(-700, 1950, by=10)
for (i in 1:length(decades)) {
d <- decades[i]
if (nrow(hdcount[hdcount$Fl.decade==d & hdcount$Group=="European",])==0) {
hdcount <- rbind(hdcount, data.frame(Fl.decade=d, Group="European", Count=0))}
if (nrow(hdcount[hdcount$Fl.decade==d & hdcount$Group=="Asian",])==0) {
hdcount <- rbind(hdcount, data.frame(Fl.decade=d, Group="Asian", Count=0))}
if (nrow(hdcount[hdcount$Fl.decade==d & hdcount$Group=="Other",])==0) {
hdcount <- rbind(hdcount, data.frame(Fl.decade=d, Group="Other", Count=0))}
}
hdcount <- hdcount[order(hdcount$Fl.decade),]
library(ggplot2); library(gridExtra)
c1 <- with(hdcount, qplot(Fl.decade, Count, color=Group) + stat_smooth())
c2 <- with(hdcount, qplot(Fl.decade, log1p(Count), color=Group) + stat_smooth())
grid.arrange(c1, c2, ncol=1)
~~~
The absolute growth in human population and and hence accomplishment post-1400 is so dramatic that it obscures earlier temporal variations:
![Counts of "significant figures" in _Human Accomplishment_ (Murray 2003) by geographic region, raw and log-transformed](/images/statistics/2003-murray-humanaccomplishment-region-counts.png){.full-width}
Log-transformed, we can still see the inverted-V shape of European counts, but it's somewhat subtle because it's still being squashed by post-1400 increases and does leave room for doubt about sampling error.
Moving on to reproducing the proportions plot:
~~~{.R}
## Create proportions by summing per decade, then looping over each group & dividing by total for that decade:
decadeTotals <- aggregate(Count ~ Fl.decade, sum, data=hdcount)
for (i in 1:nrow(hdcount)) {
total <- decadeTotals[decadeTotals$Fl.decade == hdcount[i,]$Fl.decade,]$Count
p <- hdcount[i,]$Count / total
hdcount$Proportion[i] <- if(is.nan(p)) { 0 } else { p }
hdcount$Total[i] <- total
}
with(hdcount, qplot(Fl.decade, Proportion, color=Group) + stat_smooth() + coord_cartesian(ylim = c(0, 1)))
~~~
![Relative proportions of "significant figures" in _Human Accomplishment_ (Murray 2003) by geographic region, LOESS-smoothed](/images/statistics/2003-murray-humanaccomplishment-region-proportions.png){.full-width}
We successfully reproduce it, modulo the LOESS standard errors (which can be disabled by adding `se=FALSE` to `stat_smooth()`), including the unwanted nonsensical extrapolations.
It is possible with some tricky ggplot2 functionality to add in binomial smoothing (along with some jitter to unbunch the datapoints at the modal 0).
~~~{.R}
## roughly equivalent to:
# glm(cbind(Count,Total) ~ splines::ns(Fl.decade,3), family="binomial", data=hdcount, subset=Group=="European")
binomial_smooth <- function(...) { geom_smooth(se=FALSE, method = "glm", method.args = list(family = "binomial"), ...) }
with(hdcount, qplot(Fl.decade, Proportion, color=Group) +
binomial_smooth(formula = y ~ splines::ns(x, 3)) +
geom_jitter(aes(color=Group), width=0.013,, height=0.013))
~~~
![Relative proportions of "significant figures" in _Human Accomplishment_ (Murray 2003) by geographic region, binomial-spline-smoothed for sensible extrapolation](/images/statistics/2003-murray-humanaccomplishment-region-proportions-binomialspline.png){.full-width}
This still doesn't provide any indication of sampling error uncertainty, however.
Kirkegaard provides one with CIs derived from bootstrapping, so I will provide something a little different: visualizing the uncertainty *dynamically* by graphing the smoothed proportions for each resample in an animation of hundreds of bootstrap samples.
So to do this bootstrap, we package up the various transformations from before, so we can sample-with-replacement the original dataset^[Strictly speaking we could probably do a parametric bootstrap by counting per decade, and treating each one as a Poisson estimation, and generating random deviates, but that runs into issues with decades with 0 figures (the MLE might estimate a rate of 0 which we know is wrong but not what rate might be right, which could easily imply average rates >1 based on surrounding temporal trends).], transform, and plot repeatedly:
~~~{.R}
transformAndProportion <- function(df) {
df$Fl.decade <- round(df$Fl, digits=-1)
dfd <- subset(select=c(Fl.decade, Group), df)
dfdcount <- aggregate(cbind(Group) ~ Fl.decade+Group, length, data=dfd)
colnames(dfdcount)[3] <- "Count"
decades <- seq(-700, 1950, by=10)
for (i in 1:length(decades)) {
d <- decades[i]
if (nrow(dfdcount[dfdcount$Fl.decade==d & dfdcount$Group=="European",])==0) {
dfdcount <- rbind(dfdcount, data.frame(Fl.decade=d, Group="European", Count=0))}
if (nrow(dfdcount[dfdcount$Fl.decade==d & dfdcount$Group=="Asian",])==0) {
dfdcount <- rbind(dfdcount, data.frame(Fl.decade=d, Group="Asian", Count=0))}
if (nrow(dfdcount[dfdcount$Fl.decade==d & dfdcount$Group=="Other",])==0) {
dfdcount <- rbind(dfdcount, data.frame(Fl.decade=d, Group="Other", Count=0))}
}
dfdcount <- dfdcount[order(dfdcount$Fl.decade),]
decadeTotals <- aggregate(Count ~ Fl.decade, sum, data=dfdcount)
for (i in 1:nrow(dfdcount)) {
p <- dfdcount[i,]$Count / decadeTotals[decadeTotals$Fl.decade == dfdcount[i,]$Fl.decade,]$Count
dfdcount$Proportion[i] <- if(is.nan(p)) { 0 } else { p }
}
return(dfdcount)
}
bootPlot <- function(df) {
n <- nrow(df)
bootDf <- df[sample(1:n, n, replace=TRUE),]
bootDfdcount <- transformAndProportion(bootDf)
## WARNING: can't just call qplot due to old 'animation'/ggplot2 bug; have to assign & 'print'
p <- with(bootDfdcount, qplot(Fl.decade, Proportion, color=Group) +
binomial_smooth(formula = y ~ splines::ns(x, 3)) +
geom_jitter(aes(color=Group), width=0.013,, height=0.013))
print(p)
}
library(animation)
saveGIF({for (i in 1:200) { bootPlot(h) }}, interval=0.15, ani.width=1300, ani.height=700,
movie.name="2003-murray-humanaccomplishment-region-proportions-bootstrap.gif", clean=FALSE)
~~~
The bootstrap animation suggests to me that while the very earliest time-periods are opaque and the Dark Ages difference between Europe & Asia may be somewhat higher or lower, the overall shape doesn't change meaningfully.
The time-series aspect of the data on visual inspection appears to be a simple [auto-regressive model with a drift](!Wikipedia "Autoregressive integrated moving average") upwards, low-order models like `ARIMA(1,1,0)`, `ARIMA(1,1,2)`, or `ARIMA(0,1,2)`; this is probably due to the world population steadily increasing while the per capita rates remain stable.
~~~{.R}
library(forecast)
efit <- auto.arima(subset(hdcount, select=c("Fl.decade", "Count"), Group=="European")$Count)
afit <- auto.arima(subset(hdcount, select=c("Fl.decade", "Count"), Group=="Asian")$Count)
ofit <- auto.arima(subset(hdcount, select=c("Fl.decade", "Count"), Group=="Other")$Count)
par(mfrow=c(3,1))
plot(forecast(efit), ylim=c(0,200)); axis(side=1, labels=decades, at=seq(1, length(decades)))
plot(forecast(afit), ylim=c(0,200)); axis(side=1, labels=decades, at=seq(1, length(decades)))
plot(forecast(ofit), ylim=c(0,200)); axis(side=1, labels=decades, at=seq(1, length(decades)))
~~~
![Simple ARIMA time-series fits & forecasts to 3 global regions of "significant figures" in _Human Accomplishment_](/images/statistics/2003-murray-humanaccomplishment-region-arima-forecast.png)
We can combine the sampling error quantification of full Bayesian posteriors, Poisson distribution of counts, and time-series aspects into a single Bayesian model using [`brms`](https://github.com/paul-buerkner/brms) as a convenient interface to [Stan](http://mc-stan.org/) (rather than writing out the full model by hand), with uninformative priors, and then visualize the posterior distribution of the proportions (which itself is simply a transformation of the posterior):
~~~{.R}
library(brms)
b <- brm(Count ~ (1|Group), autocor = cor_bsts(~ Fl.decade | Group), family="zero_inflated_poisson", data = hdcount)
## Rather than use `fitted` to get the 95% CI & compute proportion, it would also be possible to draw samples from
## the posterior for each group/decade, total, calculate per-group proportion, and then summarize into quantiles; but
## that is much slower and requires more finicky code:
posterior <- fitted(b)
hdcount$B.low.prop <- posterior[,3] / hdcount$Total
hdcount$B.mean.prop <- posterior[,1] / hdcount$Total
hdcount$B.high.prop <- pmin(1, posterior[,4] / hdcount$Total)
predframe <- subset(hdcount, select=c("B.low.prop", "B.high.prop"))
with(hdcount, ggplot(hdcount, aes(Fl.decade, Proportion, color=Group)) +
geom_point() +
geom_line(data=predframe) +
geom_ribbon(aes(ymin=B.low.prop, ymax=B.high.prop), alpha=0.05, data=predframe))
~~~
![Bayesian multi-level time-series of "significant figures"; shaded region indicates 95% credible interval around group mean in that decade](/images/statistics/2003-murray-humanaccomplishment-region-bayes.png){.full-width}
The smoothed time-series looks about the same, and the CIs suggest, like the bootstrap, that there is great uncertainty early on when populations are small & surviving figures are rare, but that the dark ages dip looks real and the European increases in proportion since then are also highly probable.
So overall, correcting for the infelicities in Kirkegaard's original graph makes the graph somewhat cleaner and is helpful in providing quantification of uncertainty, but none of the problems drove the overall appearance of the curve in the slightest bit.
If the graph is wrong, the issues will lie in systematic biases in the data itself---not statistical quibbling over sampling error or LOESS curves crossing an axis. (Comparison with graphs drawn from other cliometric datasets such as Wikipedia or particularly [Seshat](!Wikipedia "Seshat (project)") would be informative.)
# Program for non-spaced-repetition review of past written materials for serendipity & rediscovery: Archive Revisiter {.collapse}
> ["Spaced repetition"](/Spaced-repetition) helps one remember facts by creating discrete flashcards which one tests oneself on at increasingly distant 'spaced' time periods, repeating the fact just before one probably would have forgotten it; using software to track & automate tests & review scheduling, spaced repetition can scale to hundreds of thousands of discrete items.
>
> If spacing out facts can help one remember by repeating items just *before* they are forgotten, is there any use for an "anti-spaced repetition" with the opposite method of repeating items only *after* they are probably forgotten?
>
> I can think of two: first, it could be used to plan [consumption of media such as movies](/Media-RL) by eg tracking one's favorite movies of all time and scheduling a rewatch whenever one is predicted to have forgotten enough to make them novel & highly enjoyable again. Second, and more interestingly, it could be used as a *serendipity generator* by allowing efficient processing of notes or excerpts or old writings.
>
> In rereading such materials many years later, one often gains a new perspective or learns something useful because one forgot something: one didn't understand something about it at the time, or new material has radically changed one's interpretation, and since it'd been forgotten, no use could be made of it. Unfortunately, using spaced repetition to memorize such material, while ensuring any serendipitous connections get made as soon as possible, would be radically infeasible for bulky items (a single lengthy text excerpt might correspond to hundreds of discrete items, quickly overloading even SRS systems) and for almost all items, useless. One can justify rereading old material once or perhaps twice, but not many rereads nor full memorization. But rereading haphazardly is likely to inefficiently cover some material many times while neglecting others, and such rereads will often be far too early in time (or---a lesser concern here---too late).
>
> Instead of spaced repetition, one would instead use *anti-spaced repetition*: each item would be tracked and reviewed and its expected forgetting time predicted, as in spaced repetition, but instead of scheduling a review *before* forgetting, a review is scheduled for some time (probably long afterwards) *after* forgetting. The total number of reviews of each item per user lifetime would be set to a small number, perhaps 1--4, bounding the time consumption at a feasible amount.
>
> Such an anti-spaced repetition system could be used with hundreds of thousands of notes or clippings which a person might accumulate over a lifetime, and enable them to invest a few minutes a day into reading old notes, occasionally coming up with new insights, while ensuring they don't waste time reading notes too many times or reading notes they likely already remember & have exhausted.

One reason to take notes/clippings and leave comments in stimulating discussions is to later benefit by having references & citations at hand, and gradually build up an idea from disparate threads and make new connections between them.
For this purpose, I make extensive excerpts from web pages & documents I read into my [Evernote](!Wikipedia) clippings (functioning as a [commonplace book](!Wikipedia)), and I comment constantly on Reddit/LessWrong/HN etc.
While expensive in time & effort, I often go back, months or years later, and search for a particular thing and expand & integrate it into another writing or expand it out to an entire essay of its own.
(I also value highly not being in the situation where I believe something but I do not know *why* I believe it other than the conviction "I read it somewhere, once".)
This sort of [personal information management](!Wikipedia) using simple [personal information manager](!Wikipedia)s like Evernote works well enough when I have a clear memory of what the citation/factoid was, perhaps because it was so memorable, or when the citations or comments are in a nice cluster (perhaps because there was a key phrase in them or I kept going back & expanding a comment), but it loses out on key benefits to this procedure: serendipity and perspective.
As time passes, one may realize the importance of an odd tidbit or have utterly forgotten something or events considerably changed its meaning; in this case, you would benefit from revisiting & rereading that old bit & experiencing an "aha!" moment, but you don't realize it.
So one thing you could do is reread all your old clippings & comments, appraising them for reuse.
But how often? And it's a pain to do so. And how do you keep track of which you've already read?
One thing I do for my emails is semi-annually I (try to) read through my previous 6 months of email to see what might need to be followed up on^[I've been experimenting with [Boomerang](http://www.boomeranggmail.com/) to reduce the problem of non-followups by setting 'ping me if no reply within 1 month' alerts on my sent emails.] or mined for inclusion in an article.
(For example, an ignored request for data, or a discussion of darknet markets with a journalist I could excerpt into one of my DNM articles so I can point future journalists at that instead.)
This is already difficult, and it would be even harder to expand. I have read through my LessWrong comment history... once. Years ago. It would be more difficult now. (And it would be impossible to read through my Reddit comments as the interface only goes back ~1000 comments.)
Simply re-reading periodically in big blocks may work but is suboptimal: there is no interface easily set up to reread them in small chunks over time, no constraints which avoid far too many reads, nor is there any way to remove individual items which you are certain need never be reviewed again.
Reviewing is useful but can be an indefinite timesink.
(My sent emails are not too hard to review in 6-month chunks, but my IRC logs are bad---7,182,361 words in one channel alone---and my >38k Evernote clippings are worse; any [lifestreaming](!Wikipedia) will exacerbate the problem by orders of magnitude.)
This is probably one reason that people who keep journals or diaries don't reread
Nor can it be crowdsourced or done by simply ranking comments by public upvotes (in the case of Reddit/LW/HN comments), because the most popular comments are ones you likely remember well & have already used up, and the oddities & serendipities you are hoping for are likely unrecognizable to outsiders.
This suggests some sort of reviewing framework where one systematically reviews old items (sent emails, comments, IRC logs by oneself), putting in a constant amount of time regularly and using some sort of ever expanding interval between re-reads as an item becomes exhausted & ever more likely to not be helpful.
Similar to the logarithmically-bounded number of backups required for indefinite survival of data ([Sandberg & Armstrong 2012](https://www.fhi.ox.ac.uk/reports/2012-1.pdf "Indefinite Survival through Backup Copies")), ["Deconstructing Deathism---Answering Objections to Immortality"](http://immortallife.info/articles/entry/deconstructing-deathism-answering-objections-to-immortality), Mike Perry 2013 (note: this is an entirely different kind of problem than those considered in Freeman Dyson's immortal intelligences in _Infinite in All Directions_, which are more fundamental), discusses something like what I have in mind in terms of an immortal agent trying to review its memories & maintain a sense of continuity, pointing out that if time is allocated correctly, it will not consume 100% of the agent's time but can be set to consume some bounded fraction:
> It seems reasonable that past versions of the self would "survive" as we remember the events of times past, that is to say, our episodic memories, and this would have importance in our continuing to persist as what could be considered the "same" albeit also a changing, developing person. But in addition to this mnemonic reinforcement I imagine there would be a more general feeling of being a particular individual, an "ambiance" derived from but not referring to any specific past experiences. Ambiance alone would not be sufficient, I think, to make us who we are; episodic memories would also be necessary, yet it could considerably lessen the need for frequent recall and thus alleviate the problem of dilution.
>
> Another interesting thought is that certain items might consistently be consulted more frequently than others. (Indeed, would this not be expected?) In this way it would actually be possible to bypass the dilution effect and instead allow a fixed fraction of time for perusal of any given item, even as more items were added indefinitely. A simple way of doing this could be first to allow some fixed fraction of the time for day-to-day affairs and other non-archival work ("prime time"), and spend the rest of the time on perusal of personal archives ("archive time"). The exact apportioning of prime versus archive time is not important here, but it will be instructive to consider how the archive time itself might be subdivided. A simple, if overly simplistic, strategy would be to have half this time devoted to the first century's records, half the remainder to the second century, and so on. (Since there would only be a finite number of centuries, there would be some unused archive time at the end, which could be spent as desired. Note, however, that in the limit of infinite total time covering infinitely many centuries, the usage of archive time would approach but not exceed 100%.) In this way, then, there would be a fixed fraction of archive time, $2-n$, spent on the _n_th century's records, regardless of how many centuries beyond the nth were lived or how many records accumulated. True, this way of apportioning time might not be much good beyond a few centuries; only about one trillionth the total time would be spent on the 40th century, for instance, around 1⁄300 sec per 100 years. (Possibly a lot could be covered even in this brief interval of about 3 million nanoseconds, however.) But the apportionment scheme could be adjusted.
>
> A more interesting and plausible, if slightly harder-to-describe scheme would be to choose a constant $c>0$ and allow the fraction $c \cdot (\frac{1}{n+c-1} - \frac{1}{n+c})$ to the _n_th-century records. It is easy to show that the time for all centuries will add up to 100% as before, whatever positive value of c we start with. Starting with $c=10$ will get 10% of the total time spent on the first century, with subsequent centuries receiving a diminishing share as before, but the rate of falloff will be much slower, so that the 40th century will still receive 0.4%, or about 5 months per 100 years, that is to say, 240 million nanoseconds per minute. If we suppose that our immortal settles eventually into a routine in which 10% of the time overall is archive time, there would be 24 million nanoseconds available each minute of life for the 40th century's memories alone, if desired, with many other centuries getting more or less comparable or greater amounts of attention, and none omitted entirely. This, I think, makes at least a plausible case that a reasonable sense of one's personal identity could be sustained indefinitely.
>
> In the above examples the greatest proportion of archive time falls to the earlier records, which might be fitting since these should be the most important as formative years for the prospective immortal, thus the most important for identity maintenance. (Memory recall would also naturally occur during prime time; the emphasis here could be on recent events, to maintain a balance overall.) In summary, then, we have considered ways that the problem of dilution might be successfully managed. Relatively infrequent perusal of memories might still suffice to maintain the necessary continuity with past versions of the self, or proper scheduling could stabilize the frequency of recall and bypass the dilution effect, or both. We see in any case that the problem is not what it may seem at first sight. We have no guarantee, of course, that it would not get out of bounds, but certainly some grounds for hope.
So you could imagine some sort of software along the lines of [spaced repetition systems](/Spaced-repetition) like Anki/Mnemosyne/Supermemo which you spend, say, 10 minutes a day at, simply rereading a selection of old emails you sent, lines from IRC with _n_ lines of surrounding context, Reddit & LW comments etc; with an appropriate backoff & time-curve, you would reread each item maybe 3 times in your lifetime (eg first after a delay of a month, then a year or two, then decades).
Each item could come with a rating function where the user rates it as an important or odd-seeming or incomplete item and to be exposed again in a few years, or as totally irrelevant and not to be shown again---as for many bits of idle chit-chat, mundane emails, or intemperate comments is not an instant too soon!
(More positively, anything already incorporated into an essay or otherwise reused likely doesn't need to be resurfaced.)
This wouldn't be the same as a spaced repetition system which is designed to recall an item as many times as necessary, at the brink of forgetting, to ensure you memorize it; in this case, the forgetting curve & memorization are irrelevant and indeed, the priority here is to try to *eliminate* as many irrelevant or useless items as possible from showing up again so that the review doesn't waste time.
More specifically, you could imagine an interface somewhat like Mutt which reads in a list of email files (my local POP email archives downloaded from Gmail with `getmail4`, filename IDs), chunks of IRC dialogue (a `grep` of my IRC logs producing lines written by me ±10 lines for context, hashes for ID), LW/Reddit comments downloaded by either scraping or API via [the BigQuery copy up to 2015](https://bigquery.cloud.google.com/table/fh-bigquery:reddit_comments.2015_05), and stores IDs, review dates, and scores in a database.
One would use it much like a SRS system, reading individual items for 10 or 20 minutes, and rating them, say, upvote ('this could be useful someday, show me this ahead of schedule in the future') / downvote (push this far off into the future) / delete (never show again).
Items would appear on an expanding schedule.
For example if one wanted to review items 4 times over the next 50 years (roughly my life expectancy), a schedule might be:
~~~{.R}
round({t=0:4; t^6.981})
# [1] 0 1 126 2142 15958
~~~
So in 1 day, then a third of a year, then after 5.8 years, then after 43 years.
Alternately, a geometric series might be a bit kinder and not too front-loaded:
~~~{.R}
review <- function(n, r, a) { a * (1 - r^n) / (1 - r) }
reviews <- function(n, r, a) { sapply(1:n, function(nn) { review(nn, r, a) }) }
findR <- function (firstReview=31, n_total=3, years=50) { optimize(interval=c(0, 1000),
f = function(r) { abs(sum(sapply(1:n_total,
function(n){review(n, a=firstReview, r=r)})) - (365*years)) })$minimum }
findR(firstReview=30, n_total=4, years=50)
# [1] 7.728823216
round(reviews(4, 7.728823216, 30))
# [1] 30 262 2054 15904
~~~
The geometric series allows for easy incorporation of rating modifications: a downvote penalty might multiply _r_ by 1.5, vs 0.5 for upvotes.
This would also allow some input from statistical algorithms which predict upvote/downvote/delete and advances/delays items based on that, which would hopefully quickly learn to avoid idle chit-chat and short performative utterances and start to prioritize more interesting & unusual items.
(For example, a good start might be a SVM on a bag-of-words version of each item's text, and then as the dataset ratings expand, more complicated algorithms could be plugged in.)
As far as I know, some to-do/self-help systems have something like a periodic review of past stuff, and as I mentioned, spaced repetition systems do something somewhat similar to this idea of exponential revisits, but there's nothing like this at the moment.
# On the value of new statistical methods {.collapse}
Genetic correlation research is a hot area in 2016-2017: [my WP article](!Wikipedia "Genetic correlation") passed 400 references in May 2017.
What is particularly interesting reference-wise is that publications 2015-2017 make up around half of the results: so more genetic correlations calculated in the past 3 years than in the previous 80 years since first estimates were made somewhere in the 1930s or so.
For calculating them, there are 3 main methods.
1. twin registry studies require twin phenotypic measurements which can usually be collected by mailed surveys and to analyze them one computes some Pearson's r or uses a standard SEM with additional covariance paths (doable with Wright's path analysis back in the 1930s by hand), scaling roughly linearly with sample size, having excellent statistical power at a few hundred twin pairs and capturing full heritabilities
2. for GCTA, one requires full raw SNP data on 5000+ unrelated individuals at \$100+ a sample, along with simultaneous phenotypic measurements of both traits and must use complicated custom software whose computation scales exponentially and can only examine a narrow subset of heritability
3. for LDSC, one requires public summary polygenic scores but they can be from separate GWASes and calculated on traits individually, and the computational complexity is closer to linear than exponential; the penalty for not needing raw SNP data from twice-measured individuals is that SNP costs double or more since multiple GWASes are used, and LDSC even more inefficient than GCTA, so you'll need >10,000 individuals used in each polygenic score, and still need custom software.
In other words, the twin method is old, simple, requires small sample sizes, and easily obtained phenotypic measurements; while GCTA/LDSC is new, complicated, and requires expensive novel genetic sequencing data in huge sample sizes as well as the phenotypic measurements.
So naturally LDSC gets used an order of magnitude more!
Looking at the bibliography, we can guesstimate the rates at twin: 1 paper/year; GCTA (requiring raw data), 10/year; LDSC (public summary stats), 100/year.
Amazing the difference methods can make. It's all about data access.
For all its disadvantages, LDSC statistically works around the lack of individual-level raw data and makes do with the data that gets publicly released because it is not seen to violate 'privacy' or 'bioethics', so any researcher can make use of the method on their particular dataset, while twin and GCTA require individual-level data which is jealously guarded by the owners.
Methodologists and statisticians are probably seriously undervalued: a good new method can cause a revolution.
# Bayesian power analysis: probability of exact replication {.collapse}
Psychologist [Michael Kane](https://nitter.net/Kane_WMC_Lab) [mentions](https://nitter.net/Kane_WMC_Lab/status/890949994919886848):
> TFW a correlation of interest in a new study (_n_ = 355) replicates that from a prior study (_n_ = 182) to the second decimal (_r_ = 0.23). Winning!
Turning up the same correlation twice is somewhat surprising because random sampling error will vary substantially the empirical correlation $\hat r$ from sample to sample, as reflected by the wide credible intervals around _r_ with _n_ = 182-355.
How surprising is it? Is it too good to be true?
One approach would be to ask, if we generated bivariate samples of size _n_ = 355 with a fixed relationship of _r_ = 0.23, how often would the samples generate a rounded estimate of $\hat r$=0.23?
~~~{.R}
set.seed(2017-07-28)
library('MASS')
powerSim <- function (r_gen, n, r_test=NA) {
data <- mvrnorm(n=n, mu=c(0, 0), Sigma=matrix(c(1, r_gen, r_gen, 1), nrow=2))
r_est = cor.test(data[,1], data[,2])$estimate
if (is.na(r_test)) { r_test <- r_gen }
return(round(r_test, digits=2) == round(r_est, digits=2))
}
powerSims <- function(r, n, r_test=NA, iters=100000) {
sim <- replicate(iters, powerSim(r,n, r_test=r_test))
return(sum(sim) / length(sim))
}
powerSims(0.23, 355)
# [1] 0.07798
~~~
So around 8% of the samples.
This treats $\hat r$=_r_ = 0.23 as a parameter known with infinite precision, rather than an estimate $\hat r$ (using around half the data) of the unknown parameter _r_; there would be considerable posterior uncertainty about what _r_ is, and this will affect how often two samples would yield the same estimate---if the true _r_ was, say, 0.10 (as is entirely possible), it would be highly unlikely for the second sample to yield $\hat r$=0.23 again, because the overestimation fluke would have to repeat itself twice to yield both $\hat r$=0.23.
To incorporate the uncertainty, we can feed in a simulated dataset exactly matching the description of _n_ = 182/_r_ = 0.23 to an uninformative Bayesian model, calculate a posterior distribution over _r_ (which gives CIs of 0.09-0.37), and then draw from the posterior possible _r_s and run the original simulation asking how often we recover $\hat r$=0.23.
~~~{.R}
library(brms)
n1 = 182
n2 = 355
r1 = 0.23
data1 = mvrnorm(n=n1, mu=c(0, 0), Sigma=matrix(c(1, r1, r1, 1), nrow=2), empirical=TRUE)
colnames(data1) <- c("x", "y")
b1 <- brm(y ~ x, iter=20000, data=data1); summary(b1)
# ...Population-Level Effects:
# Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
# Intercept 0.00 0.07 -0.14 0.14 40000 1
# x 0.23 0.07 0.09 0.37 40000 1
posteriorX <- fixef(b1, summary=FALSE)[,2]; summary(posteriorX)
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# -0.08060884 0.18061570 0.23010870 0.22999820 0.27916800 0.55868700
replicates <- sapply(posteriorX, function(r_post) { powerSim(r_post, n2, r_test=r1) })
summary(replicates); mean(replicates)
# Mode FALSE TRUE NA's
# logical 38262 1738 0
# [1] 0.04345
~~~
Around 4% of the time, reflecting the increased improbability of true values like _r_ = 0.09 or _r_ = 0.37 producing the specific sample estimate of $\hat r$=0.23
Thus, observing the same summary statistics in even relatively large samples is somewhat suspicious and might be a good reason to double-check other aspects of the code & data.
# Expectations are not expected deviations and large number of variables are not large samples {.collapse}
If one has a large number of variables with a certain expectation, it is tempting to interpret the [expectation](!Wikipedia "Expected value") or [central limit theorem](!Wikipedia) or [law of large numbers](!Wikipedia) as implying that the sum of a large number of variables or after a large number of timesteps, the observed sample value will be close or identical to the expected value.
So for coin-flipping, one knows that flipping 10 coins could easily yield a large deviation like a sum of 9 heads instead of the expected 5 heads, but one then thinks that after a million coin flips, the sum of heads will probably be 500,000.
Another example of this mistake might be to make arguments about scientific research or charities: "charitable intervention _X_ is affected by hundreds or thousands of different variables and the benefits or costs unfold over long time periods like decades or centuries; our best estimate of the mean value of interventions like _X_ is that it is some small value _Y_; thus, by CLT etc, we can be sure that _X_'s ultimate value will be neither much bigger nor much smaller than _Y_ but very close to _Y_, and, particularly, we can be sure that there are no interventions like _X_ which could possibly turn out to have ultimate values which are orders of magnitude larger or smaller than _Y_, so we can rule out any such claims and we know the Value of Information is small."
This is not wrong so much as misunderstood: one might call it a confusion of the variable's distribution with the sampling distribution.
The value only becomes closer in a *relative* sense; in an absolute sense, as more variables are added---*without* the absolute magnitude of each shrinking linearly---the actual deviation from the expectation simply becomes larger and larger.
(Like [martingales](!Wikipedia "Martingale (probability theory)"): the expectation is the same as the current value, but the variance increases with time.)
As [Wikipedia puts it](!Wikipedia "Diversification (finance)#Effect of diversification on variance") in discussing how "diversification" works, it is a mistake to think that one 'diversifies' one's investments by adding additional investments of the same size; for any variance reduction, the total investment must instead be split up among ever more different investments as many small investments:
> In general, the presence of more assets in a portfolio leads to greater diversification benefits, as can be seen by considering portfolio variance as a function of _n_, the number of assets. For example, if all assets' returns are mutually uncorrelated and have identical variances $\sigma^{2}_x$, portfolio variance is minimized by holding all assets in the equal proportions $\frac{1}{n}$.[Samuelson, Paul, ["General Proof that Diversification Pays"](/docs/statistics/decision/1967-samuelson.pdf), _Journal of Financial and Quantitative Analysis_ 2, March 1967, 1-13.] Then the portfolio return's variance equals $\text{var}[(\frac{1}{n})x_{1}+(\frac{1}{n})x_{2}+...+(\frac{1}{n})x_{n}]$ = $n(\frac{1}{n^{2}})\sigma^{2}_{x}$ = $\frac{\sigma^{2}_{x}}{n}$, which is monotonically decreasing in _n_.
>
> The latter analysis can be adapted to show why _adding_ uncorrelated volatile assets to a portfolio, [see Samuelson, Paul, ["Risk and uncertainty: A fallacy of large numbers"](https://www.casact.org/pubs/forum/94sforum/94sf049.pdf), _Scientia_ 98, 1963, 108-113.] [Ross, Stephen, ["Adding risks: Samuelson's fallacy of large numbers revisited"](/docs/statistics/decision/1999-ross.pdf), _Journal of Financial and Quantitative Analysis_ 34, September 1999, 323-339.] thereby increasing the portfolio's size, is not diversification, which involves subdividing the portfolio among many smaller investments. In the case of adding investments, the portfolio's return is $x_1+x_2+ \dots +x_n$ instead of $(\frac{1}{n})x_{1}+(\frac{1}{n})x_{2}+...+(\frac{1}{n})x_{n},$ and the variance of the portfolio return if the assets are uncorrelated is $\text{var}[x_1+x_2+\dots +x_n] = \sigma^{2}_{x} + \sigma^{2}_{x}+ \dots + \sigma^{2}_{x} = n\sigma^{2}_{x},$ which is _increasing_ in _n_ rather than decreasing. Thus, for example, when an insurance company adds more and more uncorrelated policies to its portfolio, this expansion does not itself represent diversification---the diversification occurs in the spreading of the insurance company's risks over a large number of part-owners of the company.
Samuelson's 1963 "Risk and uncertainty: A fallacy of large numbers" opens by recounting an anecdote:
> Is it true that an insurance company *reduces* its risk by *doubling* the number of ships it insures?
>
> ...a few years ago I offered some lunch colleagues to bet each \$200 to \$100 that the side of a coin *they* specified would not appear at the first toss. One distinguished scholar---who lays no claim to advanced mathematical skills---gave the following answer:
>
>> I won't bet because I would feel the \$100 loss more than the \$200 gain. But I'll take you on if you promise to let me make 100 such bets.
>
> What was behind this interesting answer? He, and many others, have given something like the following explanation. "One toss is not enough to make it reasonably sure that the law of averages will turn out in my favor. But in a hundred tosses of a coin, the law of large numbers will make it a darn good bet. I am, so to speak, virtually sure to come out ahead in such a sequence, and that is why I accept the sequence while rejecting the single toss."
>
> 4\. *Maximum Loss And Probable Loss*.---What are we to think about this answer?
>
> ...Firstly, when an insurance company doubles the number of ships it insures, it does also double the range of its possible losses or gains. (This does not deny that it reduces the probability of its losses.) If at the same time that it doubles the pool of its risks, it doubles the number of its owners, it has indeed left the maximum possible loss per owner unchanged; but---and this is the germ of truth in the expression "there is safety in numbers"---the insurance company has now succeeded in reducing the probability of each loss; the gain to each owner now becomes a more certain one.
>
> In short, it is not so much by *adding* new risks as by *subdividing* risks among more people that insurance companies reduce the risk of each. To see this, do not double or change at all the original number of ships insured by the company: but let each owner sell half his shares to each new owner. Then the risk of loss to each owner per dollar now in the company will have indeed been reduced.
>
> Undoubtedly this is what my colleague really had in mind. In refusing a bet of \$100 against \$200, he should not then have specified a sequence of 100 such bets. That is adding risks. He should have asked to subdivide the risk and asked for a sequence of 100 bets, each of which was 100th as big (or \$1 against \$2).
In the insurance example, ships do not change their insured value (your liability) simply because you insure more of them, and they certainly do not halve in value simply because you have decided to take more insurance.
If a ship is worth \$1m with an expected profit of \$10k and a SD of profits of \$5k, when you insure 1000 ships, your profit is distributed as $\mathcal{N}(1000 \cdot 10000, \sqrt{1000 \cdot 5} = \mathcal{N}(1e+07, 70.7)$, and when you double it, now the distribution is $\mathcal{N}(2000 \cdot 10000, \sqrt{2000 \cdot 5})) = \mathcal{N}(2e+07, 100)$---absolute size of your fluctuations has increased, not decreased. As a percentage, it has gone down indeed, but the absolute size has still gone up.
Similarly, by claiming to prefer an investment of 100 bets and putting not \$200 but \$20,000 at risk, the colleague has raised the stakes greatly, and if the prospect of fluctuations of \$100 unsettled his stomach before, he will enjoy less the SD of ~\$1500 from the proposed 100 coinflips (95% quantiles of \$2000 to \$8000) and even a ~1% chance of a loss such as -\$400; he has reduced the risk of any loss, yes, and most of the outcomes are indeed relatively closer to the expectation than with just 1 coinflip, but he has inadvertently replaced a small SD of \$150 with the much larger one of \$1500, and of course, his worst case scenario has gotten much worse---with just one coin flip he could never have lost \$400 or more, but now he can and occasionally would.
Another interesting example comes from Cavalli-Sforza's 1971 [_The Genetics of Human Populations_](/docs/genetics/selection/1971-cavallisforza-thegeneticsofhumanpopulations.pdf), where he argues that _a priori_, differences between human races on complex traits (intelligence in this case), in the absence of selection, cannot exist solely based on the large number of genes contributing to them, by the law of large numbers:
> In this context, it is worth mentioning that Jensen states that because the gene pools of whites and blacks are known to differ and "these genetic differences are manifested in virtually every anatomical, physiological, and biochemical comparison one can make between representative samples of identifiable racial groups" therefore "there is no reason to suppose that the brain should be exempt from this generalization." There is, however, no reason why genes affecting IQ which differ in frequency in the gene pools of blacks and whites, should be such that, on the average, whites would have significantly higher frequencies of genes increasing IQ than would blacks. On the contrary, one should expect, assuming no tendency for high IQ genes to accumulate by selection in one or other race, that the more polymorphic genes there are that affect IQ and that differ in frequency in blacks and whites, the less likely it is that there is an average genetic difference in LQ between the races. This follows from that most basic law of statistics, the law of large numbers, which predicts increasing accuracy of a mean based on increasing numbers of observations (See Appendix I).
Cavalli-Sforza is correct to note that, unless we wish to make (still) very controversial claims about differing selection, there is no apparent reason for any intelligence-related allele to be systematically rarer in one population than another, although of course they will in practice differ slightly due to random chance ([genetic drift](!Wikipedia), demographic bottlenecks etc), and that as this applies to all intelligence-related alleles, we would predict that the expectation of all populations to be identical.
He, however, commits the fallacy of large numbers when he then interprets the law of large numbers as guaranteeing that all populations *will* be identical, while in fact, they will be different, and the absolute size of the differences will increase---not decrease---"the more polymorphic genes there are that affect IQ".
Like with diversification or insurance or coinflipping, the desired disappearance of the variance only happens if each random variable (investment, insured vessel, coinflip) decreases proportional to the total number of random variables; for Cavalli-Sforza's argument to go through, it would need to be the case that every new IQ gene divvied up a fixed pie---but why would that be the case and how could that be known a priori?
More pointedly, we could note that Cavalli-Sforza's argument proves too much because it is equally applicable *within* races too, and implies that there could be no differences of important magnitude between humans of the same race on highly polygenic traits (and if someone wanted to try to rescue the argument by claiming we should expect mean differences or some sort of selection, then amend "humans of the same race" to "siblings in the same family"!).
Cavalli-Sforza's error is particularly striking since ch8 of the same book devotes extensive discussion, with many graphs, to how genetic drift will greatly differentiate populations over time, and takes pains to point out that the expectation of gene frequencies _p_ is merely the expectation, and the actual frequency will diverge arbitrarily far in a random walk and over a long enough time (connected to the population size) will eventually reach fixation at either _p_ = 1 or _p_ = 0, and certainly not converge exactly on the original _p_.^[In this context, it's interesting to note Cavalli-Sforza's role in [Lewontin's fallacy](!Wikipedia).]
Indeed, given the _a posteriori_ measured extent of average [differences in allele frequency](!Wikipedia "Fixation index") of ~0.12 and the assumption of no selection, [Chuck](https://humanvarieties.org/2016/01/20/the-evolutionary-default-hypothesis-and-negative-hbd/ "The Evolutionary Default Hypothesis and Negative HBD") flips Cavalli-Sforza's argument on its head and points out that the standard genetic drift framework ([Leinonen et al 2013](/docs/genetics/selection/2013-leinonen.pdf "Q~ST~-F~ST~ comparisons: evolutionary and ecological insights from genomic heterogeneity")) dating back to Wright implies that a difference of 0.12 yields large racial differences in polygenic traits!
Clearly, Cavalli-Sforza's argument does not prove what one would like it to.
The sum is not the mean: there is a difference between flipping 1 coin a thousand or a million times, the sum of a thousand or a million coins flipped 1 time, and the mean of a thousand or a million sums of a million coins; the distribution of the mean does indeed converge tightly, but the distribution of the sum just gets broader and broader.
The expectation or mean is just the best estimate one can make over a large number of samples under a particular loss.
But the long-run mean of many samples is not the same thing as the expected deviation of a single sample.
One might be dealing with something like a [Cauchy distribution](!Wikipedia) where there is not a mean in the first place. But more importantly, just because the *expectation* is a certain number like 0, doesn't mean any specific realization will be 0, and indeed the expectation may actually be an impossible value.
(Imagine a variable _X_ which is the sum of an odd number of -1 or +1 variables; the expectation of this _X_ is, of course, 0, however, one will *never* actually observe a sample of _X_ to be 0, because the parity means there will always be a -1 or +1 'left over' inside the sum and so _X_ will always be either -1 or +1. To reach the expectation of 0, one would have to create many _X_s and average them, and the more _X_s one draws and averages, the closer the average will be to 0.)
For Gaussians, the single-sample [*sum*](!Wikipedia "Sum of normally distributed random variables") of a set of Gaussian variables is the sum of their means with a variance equal to the sum of their variances ($\mathcal{N}(\mu \cdot x_1 + \mu \cdot x_2 + ... + \mu \cdot x_n, \sigma^2_{x_{1}} + \sigma^2_{x_{2}} + ... + \sigma^2_{x_{n}})$); if we sampled repeatedly and averaged, then we would indeed converge on the expected mean, but the expected deviation from the mean of a single sample is governed by the variance which can be extremely large. The increasing spread means that it would be extremely surprising to get exactly, or even near, the expectation. An example Monte Carlo of the sum of increasing numbers of $\mathcal{N}(0,1)$ deviates demonstrates the absolute deviation increases as we go from sum of 100 variables to 10,000 variables:
~~~{.R}
round(sapply(seq(100,10000,by=100), function(x) { mean(replicate(10000, abs(sum(rnorm(x))))) } ))
# [1] 8 11 14 16 18 20 21 23 24 25 26 27 29 30 31 32 33 34 35 35 37 38 38 39 40 41 41 42 43 44 45 45 46 46 47 48 48 49 50 50 51 52 52
# [44] 53 53 54 55 56 56 56 57 58 58 58 59 59 61 61 62 62 62 62 64 63 64 65 65 65 67 66 68 67 68 69 70 70 70 70 71 71 72 72 73 73 74 75
# [87] 74 75 75 75 76 76 77 78 78 78 78 80 80 80
~~~
Or consider a 1D [random walk](!Wikipedia).
The best estimate one can make under total ignorance of where the walker is, is to guess, regardless of how many _n_ steps it has made, that the walker is at the origin: 0.
If we run many random walkers for _n_ steps each and ask what the best prediction of the mean is, we would be right in saying it's 0.
We would not be right in saying that the walkers have not moved far or that we would expect them all to 'converge' and be at or at least very near 0.
However, if one asks, what is the [expected distance from the origin after _n_ steps](http://mathworld.wolfram.com/RandomWalk1-Dimensional.html), the answer turns out to be $\sqrt{\frac{2 \cdot N}{\pi}}$ ie. the more steps taken, the further we expect to find the walker, even if we cannot predict in what direction it has gone on average.
Similarly, for a Gaussian random walk, we find that after _n_ steps the walker will average a distance of $\sigma \cdot \sqrt{n}$ from the origin (and possibly much further).
Further, should there be any absorbing states, we may find that our walkers will not be at their expectation but at the absorber---an example being [Gambler's ruin](!Wikipedia) where repeatedly taking +EV bets can guarantee eventual \$0/bankruptcy, or the infamous [martingale](!Wikipedia "Martingale (betting system)") where the expectation of each bet may be ~0 but the martingale player eventually is either bankrupt or profitable, never exactly even.
Another analogy might be stock-picking: the expected value of each stock is about the same due to efficient markets, and if one indexes and waits a long time, one will likely get the fabled average return of ~7%; but if one buys a single individual stock and waits a long time, the return will likely be either -100% or >>7%.
In any process or scenario in which we are dealing with large numbers of variables which sum to produce a final result, even if each of those variables is neutral, a single sample will be potentially arbitrarily absolutely far from the expectation in a way which will surprise someone who believes that the law of large numbers ensures that the result must be extremely close to the expectation both relatively and absolutely.
# Oh Deer: Could Deer Evolve to Avoid Car Accidents? {.collapse}
> I've noticed while driving many deer corpses over the years. Cars seem like they could be a major source of deer mortality. If they are, deer might be evolving behavior to avoid cars. But deer/car accident rates appear stable or increasing (perhaps due to human population growth & construction). How fast would we expect to see any deer adaptation?
>
> Looking at some of the mortality statistics, I model it as a liability threshold trait being selected on via truncation selection, and calculate some hypotheticals about whether and how fast they could adapt.
>
> Teal deer: of course, but it'd be slow.

While driving to NYC recently I passed 3 [roadkill](!Wikipedia "Deer-vehicle collisions") [deer](!Wikipedia "White-tailed deer"), a few of many I have seen over the years, and a thought re-occurred to me: "if all these deer are being killed by cars, shouldn't they be evolving to avoid cars?"
I've seen many dead deer and narrowly avoided a few myself while driving, and deer/car mortality is, if anything, much higher in states like Pennsylvania.
Accident rates would not necessarily show a steep decline thanks to past selection, because the 'environment' is not static here: as cars get faster, accidents become more crippling or lethal to deer; the American population has expanded several-fold both in population count, per-capita vehicle miles, suburban living, territory fragmentation, and the deer population too has expanded many-fold (from ~0.5m a century ago to <30m now).
So if there was highly effective ongoing selection reducing deer accident risk, we would still observe large absolute and proportional increases in accidents/deaths.
But I am still curious as to what sort of selection we could expect, which is a hint as to long-term trends---the American population is now relatively stabilized in terms of growth and vehicle-miles, and deer appear to have also reached a population equilibrium, so a gradual long-term decline in accident rates might be possible to see in the future if there is substantial response to selection.
Deer accidents seem to be fairly fatal: wild animals are always on the edge, and small injuries can compound into death, so looking at mortality will if anything underestimate the strength of selection.
And of course there will be no single Mendelian genes, but being a complex behavioral trait, it is almost surely a highly polygenic additive trait.
So we can treat it as [truncation selection](!Wikipedia) on a binary trait ("killed by a car") in the [liability threshold model](!Wikipedia).
For truncation selection, the two key parameters are the heritability of a 'trait', and the fraction of the population expressing the 'trait'.
The heritability can only be guessed at.
Car accidents ought to be heritable to *some* degree, because everything is, and many behavioral traits like risk aversion or diurnality or reaction-time or startle reflex or wanderlust would affect being hit by a car (a deer can reduce risk by avoiding roads entirely, not traveling far, waiting to cross until very late at night when there is no traffic or during the day when they can be easily seen).
Response to selection need not cause some hardwired behavioral change like aversion to travel: it might yield something like the [Baldwin effect](!Wikipedia), where the response is for behavioral flexibility, and more fit deer are better able to learn how to navigate traffic gaps by watching other deer or imitating their mother.
The "anthropocene" has led to many animals evolving or otherwise learning how to adapt, with urban environments no exception[^urban-change], so why would deer be any exception?
[^urban-change]: Some relevant links:
- ["Global urban signatures of phenotypic change in animal and plant populations"](http://www.pnas.org/content/early/2017/01/01/1606034114.abstract), Alberti et al 2017
- ["Signatures of positive selection and local adaptation to urbanization in white-footed mice (_Peromyscus leucopus_)"](https://www.biorxiv.org/content/early/2017/09/26/038141), Harris & Munshi-South 2017 (city mice adapting to eat human food)
- ["Contrasting the effects of natural selection, genetic drift and gene flow on urban evolution in white clover (_Trifolium repens_)"](http://rspb.royalsocietypublishing.org/content/285/1883/20181019), Johnson et al 2018
- ["What Makes a City Ant? Maybe Just 100 Years of Evolution"](https://www.nytimes.com/2017/04/03/science/acorn-ants-evolution-cleveland.html)
- ["Urban Evolution: How Species Adapt, or Don't, to City Living; A Conversation With Jonathan B. Losos"](https://www.edge.org/conversation/jonathan_b_losos-urban-evolution)
- [Moscow subway dogs](!Wikipedia "Street dogs in Moscow#Subway dwelling dogs") (not necessarily genetic except perhaps in a Baldwin effect way, but a demonstration of behavioral flexibility/learning)
- Ridley: ["Cities Are the New Galapagos"](http://www.rationaloptimist.com/blog/the-wealth-of-urban-biodiversity/)/["The Sixth Genesis: a man-made, mass-speciation event"](http://www.rationaloptimist.com/blog/mass-speciation/)
- [_Darwin Comes to Town: How the Urban Jungle Drives Evolution_](https://www.amazon.com/gp/product/1250127823/), Schilthuizen 2018
Complicating things is the possibility that the heritability is high but actual responses to selection are lower than expected when estimated in a univariate single-trait fashion, because there might be a [genetic correlation](!Wikipedia) with another fitness-influencing trait, where better car avoidance means worse values of that other trait---perhaps it would be easy to avoid cars by avoiding roads or traveling far, but this has the drawback of preventing relocation to avoid starvation or hunters, in which case response to selection will be small or even opposite of expected (this is plausibly one of the main reasons why wild populations may not evolve as fast as predicted: ["The Missing Response to Selection in the Wild"](http://www.cell.com/trends/ecology-evolution/abstract/S0169-5347\(18\)30039-9), Pujol et al 2018).
I can only guess at the heritability, as I doubt heritabilities have been calculated for much in deer, but I would predict it's <50% simply because wild animals are under constant selection and car-based selection would've started almost a century ago.
It might seem impossible to calculate heritability for wild animals of a 'trait' like being hit by a car, but I think it's doable. Wild animals have no twin studies or family pedigrees, of course, and methods like common-garden rearing likewise seem questionable, but one use tracking devices to follow families of deer until they all die to construct a pedigree with outcome of death; more feasibly, one could collect DNA samples from dead car-accident deer and dead non-accident deer, and compute genomic similarity with a procedure like [GCTA](!Wikipedia) (for SNP heritability) or whole-genome sequencing data (upcoming methods for recovering full heritability).
But this hasn't been done as far as I know. Oh well. We can look at a range of heritabilities 0-50%.
The fraction of deer hit is a little easier. Wild deer live ~[3-4y on average](https://www.adirondackalmanack.com/2016/05/understanding-life-spans-whitetail-deer.html "'Understanding The Life Span Of Whitetail Deer', Hetzler 2016") (eg [Lopez et al 2004](https://www.jstor.org/stable/pdf/3803390.pdf "Population Density of the Endangered Florida Key Deer") on Key deer implies ~3.7y), sexually maturing ~1.5y, and of the ~25m deer in the USA, around 1.5m are killed by cars annually ~2012 (according to [_The Atlantic_ with no citation](https://www.theatlantic.com/magazine/archive/2012/11/the-deer-paradox/309104/ "The Deer Paradox: It's never been easier to shoot a buck. So why are hunters spending billions on high-tech gear?")) so perhaps ~5% annual mortality from cars ([McShea et al 2008](https://www.jstor.org/stable/pdf/24875111.pdf "'Factors affecting autumn deer/vehicle collisions in a rural Virginia county', McShea et al 2008") estimates 2% in a sampled Virginia county, while [McShea 2012](https://repository.si.edu/bitstream/handle/10088/18355/nzp_changing_world_mcshea.pdf "Ecology and management of white-tailed deer in a changing world") suggests 1.2m deaths out of 25m deer); if deer live 4 years and have a 5% annual risk of being killed by a car, then their lifetime risk should be `1 - 0.95^4 = 18%` or perhaps 8%, which sounds like a reasonable range---a substantial source of mortality but probably less than hunting or starvation or disease.
The effect of a generation of truncation selection on a binary trait following the liability-threshold model is more complicated but follows a similar spirit.
R implementation of pg6 of ["Chapter 14: Short-term Changes in the Mean: 2. Truncation and Threshold Selection"](/docs/genetics/selection/2013-walsh-book2-ch14-draft.pdf), Lynch & Walsh:
~~~{.R}
threshold_select <- function(fraction_0, heritability, verbose=FALSE) {
fraction_probit_0 = qnorm(fraction_0)
## threshold for not manifesting schizophrenia:
s_0 = dnorm(fraction_probit_0) / fraction_0
## new rate of trait after one selection where 100% of past-the-threshold never reproduce:
fraction_probit_1 = fraction_probit_0 + heritability * s_0
fraction_1 = pnorm(fraction_probit_1)
## how much did we reduce trait in percentage terms?
if (verbose) {
print(paste0("Start: population fraction: ", fraction_0, "; liability threshold: ", fraction_probit_0, "; Selection intensity: ", s_0))
print(paste0("End: liability threshold: ", fraction_probit_1, "; population fraction: ", fraction_1, "; Total population reduction: ",
fraction_0 - fraction_1, "; Percentage reduction: ", (1-((1-fraction_1) / (1-fraction_0)))*100)) }
return(c(fraction_probit_1, fraction_1, fraction_0 - fraction_1))
}
threshold_select(1-0.18, 0.50, verbose=TRUE)
# [1] "Start: population fraction: 0.82; liability threshold: 0.915365087842814; Selection intensity: 0.320000021339773"
# [1] "End: liability threshold: 1.0753650985127; population fraction: 0.858894349391959; Total population reduction: -0.0388943493919587; Percentage reduction: 21.6079718844215"
# [1] 1.07536509851 0.85889434939 -0.03889434939
threshold_select(1-0.08, 0.50, verbose=TRUE)
# [1] "Start: population fraction: 0.92; liability threshold: 1.40507156030963; Selection intensity: 0.161593724197447"
# [1] "End: liability threshold: 1.48586842240836; population fraction: 0.931343036233605; Total population reduction: -0.0113430362336053; Percentage reduction: 14.1787952920067"
# [1] 1.48586842241 0.93134303623 -0.01134303623
~~~
We can look at a range of scenarios for population prevalences 8-18%, and heritabilities 5%-50%.
Aside from the per-generation increase in car-avoiding deer/decrease in deer-car accidents, it might also be interesting to calculate a hypothetical like "how many generations/years would it take for natural selection, in a static environment, to reduce lifetime mortality to 1%?"
~~~{.R}
thresholdModeling <- function(fraction_0, heritability, targetPercentage) {
firstGeneration <- threshold_select(fraction_0, heritability)
## estimate how many generations of truncation selection until a target percentage is reached:
i <- 1; fraction_i <- firstGeneration[2]
while (fraction_i < targetPercentage) {
i <- i+1
nextGeneration <- threshold_select(fraction_i, heritability)
fraction_i <- nextGeneration[2]
}
return(c(firstGeneration, i)) }
thresholdModeling(1-0.20, 0.5, 1-0.01)
df <- expand.grid(Fraction=(1-seq(0.08, 0.18, by=0.02)), Heritability=seq(0.05, 0.50, by=0.05))
df <- cbind(df, round(digits=3, do.call(rbind, Map(thresholdModeling, df$Fraction, df$Heritability, 0.99))))
colnames(df)[3:6] <- c("Threshold.latent", "Fraction.new", "Fraction.reduction", "Generations.to.onepercent")
df$Years <- round(df$Generations.to.onepercent * 3.5)
df
~~~
Fraction Heritability Latent threshold Fraction 2nd Fraction reduction Generations to 1% Years
------- ------------- ---------------- ------------ ------------------ ------------------------- ------
0.92 0.05 1.413 0.921 -0.001 300 1050
0.90 0.05 1.291 0.902 -0.002 314 1099
0.88 0.05 1.186 0.882 -0.002 324 1134
0.86 0.05 1.093 0.863 -0.003 332 1162
0.84 0.05 1.009 0.843 -0.003 338 1183
0.82 0.05 0.931 0.824 -0.004 343 1200
0.92 0.10 1.421 0.922 -0.002 150 525
0.90 0.10 1.301 0.903 -0.003 157 550
0.88 0.10 1.198 0.884 -0.004 162 567
0.86 0.10 1.106 0.866 -0.006 166 581
0.84 0.10 1.023 0.847 -0.007 169 592
0.82 0.10 0.947 0.828 -0.008 171 598
0.92 0.15 1.429 0.924 -0.004 100 350
0.90 0.15 1.311 0.905 -0.005 104 364
0.88 0.15 1.209 0.887 -0.007 108 378
0.86 0.15 1.119 0.868 -0.008 110 385
0.84 0.15 1.038 0.850 -0.010 112 392
0.82 0.15 0.963 0.832 -0.012 114 399
0.92 0.20 1.437 0.925 -0.005 75 262
0.90 0.20 1.321 0.907 -0.007 78 273
0.88 0.20 1.220 0.889 -0.009 81 284
0.86 0.20 1.132 0.871 -0.011 82 287
0.84 0.20 1.052 0.854 -0.014 84 294
0.82 0.20 0.979 0.836 -0.016 85 298
0.92 0.25 1.445 0.926 -0.006 60 210
0.90 0.25 1.330 0.908 -0.008 62 217
0.88 0.25 1.232 0.891 -0.011 64 224
0.86 0.25 1.145 0.874 -0.014 66 231
0.84 0.25 1.067 0.857 -0.017 67 234
0.82 0.25 0.995 0.840 -0.020 68 238
0.92 0.30 1.454 0.927 -0.007 50 175
0.90 0.30 1.340 0.910 -0.010 52 182
0.88 0.30 1.243 0.893 -0.013 54 189
0.86 0.30 1.158 0.877 -0.017 55 192
0.84 0.30 1.081 0.860 -0.020 56 196
0.82 0.30 1.011 0.844 -0.024 57 200
0.92 0.35 1.462 0.928 -0.008 43 150
0.90 0.35 1.350 0.911 -0.011 44 154
0.88 0.35 1.255 0.895 -0.015 46 161
0.86 0.35 1.171 0.879 -0.019 47 164
0.84 0.35 1.096 0.863 -0.023 48 168
0.82 0.35 1.027 0.848 -0.028 48 168
0.92 0.40 1.470 0.929 -0.009 37 130
0.90 0.40 1.360 0.913 -0.013 39 136
0.88 0.40 1.266 0.897 -0.017 40 140
0.86 0.40 1.184 0.882 -0.022 41 144
0.84 0.40 1.110 0.867 -0.027 42 147
0.82 0.40 1.043 0.852 -0.032 42 147
0.92 0.45 1.478 0.930 -0.010 33 116
0.90 0.45 1.369 0.915 -0.015 34 119
0.88 0.45 1.277 0.899 -0.019 35 122
0.86 0.45 1.197 0.884 -0.024 36 126
0.84 0.45 1.125 0.870 -0.030 37 130
0.82 0.45 1.059 0.855 -0.035 37 130
0.92 0.50 1.486 0.931 -0.011 30 105
0.90 0.50 1.379 0.916 -0.016 31 108
0.88 0.50 1.289 0.901 -0.021 32 112
0.86 0.50 1.210 0.887 -0.027 33 116
0.84 0.50 1.139 0.873 -0.033 33 116
0.82 0.50 1.075 0.859 -0.039 34 119
Table: Truncation selection results for various scenarios of deer-car lifetime mortality rates & heritabilities.
As expected, scenarios where deer accidents are rare achieve 1% faster (less work to do), and higher heritabilities produce faster responses (more useful genes).
In most of the scenarios, the annual percent reduction is small, no often less than a percentage point, so would be easy to miss, and the almost complete elimination of deer-car accidents would take centuries of adaptation even with a fixed environment.
So while deer might be evolving to reduce accident mortality, it would be hard to see and won't help much anytime soon.
(For the same reasons, I would expect [squirrel-based power outages](https://www.washingtonpost.com/local/for-electric-utilities-sharp-toothed-daredevil-squirrels-are-enemy-number-one/2019/04/15/938f3bd8-5d53-11e9-9625-01d48d50ef75_story.html "For electric utilities, sharp-toothed, daredevil squirrels are Enemy Number One") to continue indefinitely.)
So if it's a problem, one'd better hurry up with safety measures like road fencing or self-driving cars.
# Evolution as Backstop for Reinforcement Learning
> One defense of free markets notes the inability of non-market mechanisms to solve planning & optimization problems. This has difficulty with Coase's paradox of the firm, and I note that the difficulty is increased by the fact that with improvements in computers, algorithms, and data, ever larger planning problems *are* solved. Expanding on some Cosma Shalizi comments, I suggest a group selection perspective: one reason for free market or evolutionary or Bayesian methods in general is that while poorer at planning/optimization in the short run, they have the advantage of simplicity and operating on ground-truth values, and serve as a constraint on the more sophisticated non-market mechanisms. I illustrate by discussing corporations, multicellular life, and reinforcement learning & meta-learning in AI. This view suggests that are inherent balances between market/non-market mechanisms which reflect the relative advantages between a slow unbiased method and faster but potentially arbitrarily biased methods.
*Split out to ["Evolution as Backstop for Reinforcement Learning"](/Backstop).*
# Acne: a good Quantified Self topic {.collapse}
> I suggest that teenagers interested in experimentation, statistics, or Quantified Self experiment with interventions to reduce acne, listing the advantages of the topic. To suggest specific interventions, I re-analyze & rank the April 2016 CureTogether crowdsourced ratings of ~113 acne interventions; dietary interventions rate particularly highly after standard retinoids (like Accutane) & benzoyl peroxide and might be worth closer investigation.

[Quantified Self](!Wikipedia) [acne](!Wikipedia) self-experiments are under-done and potentially valuable.
One kind of self-experiment occurs to me in retrospect which would have been particularly valuable---I regret getting into QS and statistics only long after my (not a little painful) acne problems largely subsided.
The [conventional wisdom on acne](https://www.nytimes.com/2019/01/07/well/live/managing-teenage-acne.html) seems poorly supported and neglectful of individual differences, and I've long suspected that some rigorous testing might turn up some interesting results which could be useful to Western teenagers.
(My acne wasn't *so* bad, in retrospect, but it still frustrated me a great deal; in extreme cases, it can contribute to suicide.)
Looking back, I can see how easy it would be to test the various theories. For example, facial washes & light boxes could be tested efficiently by blocking---randomizing which half of your face they are applied to.
Diet too would be useful to do as a factorial experiment---for 3 or 4 weeks, cut out all the carbs while loading up on dairy products, then vice versa.
As a topic for self-experimentation, acne has many advantages:
- many highly-motivated subjects with ample free time
- objective visible effects where the data can even be easily recorded for blind ratings by third parties to reduce bias (ie facial photographs)
- factorial experiments or blocking within-subject is possible for some acne interventions, which would considerably increase power
- in severe acne cases, changes would be easy to see within days/weeks, and large effects are plausible
- a large number of already-known possible interventions none of which are particularly strongly supported (so high [VoI](!Wikipedia "Value of Information")); a good starting point would be the [CureTogether crowdsourced rankings](https://web.archive.org/web/20160401173234/http://curetogether.com:80/acne/treatments/) (see [next section for specifics](#curetogether-acne-interventions)).
- most of the interventions are safe & easy to implement: it's not hard to use a face wash twice a day, apply some [benzoyl peroxide](!Wikipedia) cream, cut out all dairy products, etc
- new commercial products allow for interesting hobbyist projects---what would microbiome sampling show about the microbial communities of teens with high acne vs low acne? About their microbial communities *before* the high acne group developed acne? After treatment with benzoyl peroxide or antibiotics or retinol? etc. With a large enough community dataset, interesting techniques could be applied for measuring heterogeneity in individual response to treatments (eg GCTA-style variance components on acne response could be enlightening, as genes would be expected to be the source of much of the individual differences, acne itself being heritable)
- a launching pad into many areas of science & statistics---relevant topics include human & microbial genetics, methodological biases, regression to the mean, optimal experimental design, sequential testing, meta-analysis, time-series, deep learning... ("When making an axe handle with an axe, the model is near at hand.")
One could spin off lots of projects to increase rigor---instead of counting by hand, you could take smartphone photos and feed them into a pimple-counting CNN. With all the frameworks like PyTorch now, that's well within the ability of a bright computer-inclined high schooler. (And then put it up as a web service so other teens can run their own self-experiments & score their pimple photographs, why not...)
Indeed, [Seth Roberts](!Wikipedia), one of the early Quantified Self proponents, credits his interest in self-experiments to acne ([Roberts 2001](https://escholarship.org/uc/item/5bv8c7p3 "Surprises from self-experimentation: Sleep, mood, and weight")/[Roberts 2010](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2964443/ "The Unreasonable Effectiveness of My Self-Experimentation")), when he found [tetracycline](!Wikipedia) didn't help his acne but benzoyl peroxide did:
> My interest in self-experimentation began when I read [an article about teaching mathematics by Paul Halmos](https://www.sethroberts.net/wp-content/uploads/2013/01/halmos.pdf "'The Problem of Learning to Teach', Halmos 1975"), a professor at Indiana University. Halmos emphasized that "the best way to learn is to do." I was trying to learn how to do experiments; I took this advice to mean I should do as many as possible. I could do more experiments, I realized, if I not only did rat experiments but also did experiments with myself as the subject. So I started doing small self-experiments. Most of them were trivial and led nowhere (e.g., experiments about juggling). At the time I had acne. My dermatologist had prescribed both pills (tetracycline, a wide-spectrum antibiotic) and a cream (active ingredient benzoyl peroxide). Simply for the sake of doing experiments, any experiments, I did simple tests to measure the effectiveness of these treatments. I believed the pills were powerful and the cream had little effect. To my great surprise, the tests showed the opposite: The cream was powerful and the pills had little effect. It was very useful information. Many years later, an article in the British Journal of Dermatology reported that antibiotic-resistant acne is common.
Acne came up several times on Roberts's blog:
- Roberts has suggested that teens could run acne self-experiments as a group project: ["Acne Clubs"](https://sethroberts.net/2013/05/01/the-acne-club/ "Acne Club: A New Way to Fight Acne")
- A possible general cause: [glycemic index/insulin signaling](https://sethroberts.net/2013/02/24/progress-in-reducing-acne/)
- Anecdotes:
- [oolong/black tea allergy as acne trigger](https://sethroberts.net/2014/02/10/acne-sometimes-due-to-allergy/)
- pasteurized dairy trigger: [1](https://sethroberts.net/2012/10/30/acne-caused-by-pasteurized-dairy-how-one-person-figured-it-out/), [2](https://boingboing.net/2012/08/09/make-yourself-healthy-searchi.html "Make yourself healthy: Searching for the cause of acne --Martha Rotter")
- sugar/carb triggers: [1](https://sethroberts.net/2008/10/03/diet-and-acne-continued/), [2](https://sethroberts.net/2008/04/26/more-about-acne-continued/)
- soap trigger: [1](https://sethroberts.net/2008/05/05/more-acne-self-experimentation/); lack of soap trigger: [2](https://sethroberts.net/2008/04/23/more-about-acne/)
- [acne reduction due to milk thistle](https://sethroberts.net/2014/02/02/adult-acne-water-and-milk-thistle/)
## CureTogether acne interventions
CureTogether (2008-2012) was a social network/crowdsourced website for aggregating & ranking treatments of various problems, typically diseases, and had an acne page.
They were bought by [23andMe](!Wikipedia) [in 2012](https://blog.23andme.com/news/curetogether-with-23andme/), and [operated for a while](https://blog.23andme.com/tag/curetogether/) until the website seems to've vanished around mid-2016 & the data appears to no longer be available anywhere.
The last Internet Archive snapshot of their [acne page](https://web.archive.org/web/20160401173234/http://curetogether.com:80/acne/treatments/) is 2016-04-01.
Their presentation of uncertainty & average ranking of the interventions is useless; but they do provide the totals for each of the rating levels, so I decided to fix it by extracting & re-analyzing the ratings in a multi-level Bayesian ordinal regression with a weakly informative prior to get more useful posterior estimates of ratings.
(While I was at it, I fixed some spelling errors and merged a few interventions which were the same, like "Accutane" and "Isotretinoin (Roaccutane, Accutane)".)
~~~{.R}
## https://web.archive.org/web/20160401173234/http://curetogether.com:80/acne/treatments/
## "much worse / slightly worse / no effect / moderate improvement / major improvement" → 1-5
acne <- read.csv("https://www.gwern.net/docs/biology/2016-04-01-curetogether-acne.csv",
header=TRUE, colClasses=c("integer", "integer", "factor"))
library(skimr)
skim(acne)
# Skim summary statistics
# n obs: 439
# n variables: 3
#
# ── Variable type:factor ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
# variable missing complete n n_unique top_counts ordered
# Intervention 0 439 439 113 Acn: 5, Agi: 5, Amo: 5, Ant: 5 FALSE
#
# ── Variable type:integer ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
# variable missing complete n mean sd p0 p25 p50 p75 p100 hist
# Effect 0 439 439 3.16 1.33 1 2 3 4 5 ▅▆▁▇▁▇▁▆
# N 0 439 439 18.38 34.74 1 2 6 19 324 ▇▁▁▁▁▁▁▁
library(brms)
b <- brm(Effect | weights(N) ~ (1|Intervention), prior=c(prior(student_t(3, 0, 1), "sd")), family=cumulative(),
iter=5000, chains=30, cores=30, data=acne)
b
# Group-Level Effects:
# ~Intervention (Number of levels: 113)
# Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
# sd(Intercept) 0.64 0.06 0.53 0.76 16503 1.00
#
# Population-Level Effects:
# Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
# Intercept[1] -3.44 0.09 -3.62 -3.26 20848 1.00
# Intercept[2] -2.17 0.08 -2.32 -2.02 17179 1.00
# Intercept[3] 0.49 0.07 0.35 0.64 15767 1.00
# Intercept[4] 2.75 0.08 2.58 2.91 19558 1.00
red <- as.data.frame(ranef(b)$Intervention)
round(red[order(red$Estimate.Intercept, decreasing=TRUE),], digits=2)
## the ordinal model forest plot doesn't make much intuitive sense, so redo as a normal-distribution
## for easier forest plotting (the rankings are more or less identical, perhaps a little less precise):
bg <- brm(Effect | weights(N) ~ (1|Intervention), prior=c(prior(student_t(3, 0, 1), "sd")), family=gaussian(),
iter=5000, chains=30, cores=30, data=acne)
## Extract & sort estimates:
coefs <- as.data.frame(coef(bg))
coefs[order(coefs$Intervention.Estimate.Intercept, decreasing=TRUE),]
## Plot estimates:
library(brmstools)
forest(bg)
~~~
![Forest plot: posterior rankings of acne interventions by CureTogether users as of April 2016 (computed using a Gaussian Bayesian multilevel model).](/images/nootropics/qs-acne-curetogether-gaussian-forest.png){.full-width}
| Intervention | Estimate | SE | 2.5% | 97.5% |
|---------------------------------------------------------|----------|------|------|-------|
| Isotretinoin (Roaccutane, Accutane) | 4.04 | 0.06 | 3.91 | 4.16 |
| Paleo Diet | 3.75 | 0.08 | 3.58 | 3.92 |
| No gluten | 3.67 | 0.07 | 3.51 | 3.82 |
| Azelaic acid (Azelex) | 3.65 | 0.16 | 3.32 | 3.99 |
| Bactrim (Trimethoprim/sulfamethoxazole) | 3.62 | 0.12 | 3.36 | 3.87 |
| Ketogenic Diet | 3.59 | 0.18 | 3.23 | 3.95 |
| Clindamyacin phosphate gel | 3.58 | 0.13 | 3.31 | 3.85 |
| Glycolic acid | 3.56 | 0.21 | 3.14 | 3.99 |
| Ziana (Clindamycin/tretinoin) | 3.56 | 0.21 | 3.14 | 3.99 |
| Amoxicillin | 3.56 | 0.14 | 3.27 | 3.85 |
| No sugar | 3.55 | 0.07 | 3.41 | 3.70 |
| No dairy | 3.52 | 0.06 | 3.40 | 3.64 |
| Diane-35 (cyproterone/ethinyl estradiol) | 3.52 | 0.12 | 3.27 | 3.77 |
| Epiduo Gel | 3.51 | 0.15 | 3.20 | 3.82 |
| having a good skincare routine | 3.49 | 0.17 | 3.16 | 3.83 |
| eliminating comedogenic ingredients... | 3.49 | 0.21 | 3.07 | 3.93 |
| Sit out in the sun...at least 15 minutes | 3.49 | 0.05 | 3.38 | 3.60 |
| Avoid touching face | 3.49 | 0.03 | 3.41 | 3.57 |
| Benzoyl peroxide 10% | 3.49 | 0.09 | 3.30 | 3.67 |
| Ole Henriksen products | 3.47 | 0.18 | 3.10 | 3.84 |
| No chocolate | 3.45 | 0.22 | 3.02 | 3.89 |
| Doryx (Doxycycline) | 3.45 | 0.07 | 3.30 | 3.59 |
| foot bath with baking soda | 3.45 | 0.24 | 2.98 | 3.93 |
| Bikram Yoga | 3.45 | 0.24 | 2.97 | 3.93 |
| Tretinoin | 3.45 | 0.24 | 2.97 | 3.93 |
| Retin A | 3.45 | 0.05 | 3.33 | 3.56 |
| Birth control pill / Balance hormones | 3.44 | 0.05 | 3.33 | 3.55 |
| Zinc soap | 3.43 | 0.20 | 3.04 | 3.84 |
| Washing face | 3.43 | 0.03 | 3.36 | 3.49 |
| No fast food | 3.42 | 0.06 | 3.30 | 3.55 |
| PhotoDynamic Therapy | 3.42 | 0.15 | 3.11 | 3.73 |
| Exuviance Vespera Serum | 3.41 | 0.23 | 2.96 | 3.87 |
| Helminthic therapy | 3.41 | 0.23 | 2.96 | 3.87 |
| Clindamycin 1% dabber | 3.41 | 0.07 | 3.25 | 3.56 |
| White vinegar | 3.40 | 0.17 | 3.06 | 3.75 |
| Washing pillowcases regularly | 3.40 | 0.05 | 3.29 | 3.50 |
| AcZone | 3.39 | 0.20 | 2.99 | 3.81 |
| Tazorac | 3.39 | 0.16 | 3.07 | 3.71 |
| Tetracycline | 3.39 | 0.05 | 3.29 | 3.49 |
| Zinc cream | 3.39 | 0.15 | 3.09 | 3.69 |
| Benzaclin (Benzoyl peroxide/clindamycin) | 3.39 | 0.09 | 3.19 | 3.58 |
| Benzoyl peroxide 2.5% | 3.38 | 0.03 | 3.30 | 3.46 |
| Aveeno Skin Brightening Daily Scrub | 3.38 | 0.22 | 2.94 | 3.82 |
| Metrogel | 3.38 | 0.17 | 3.03 | 3.72 |
| Retinol Oral Supplementation | 3.37 | 0.20 | 2.97 | 3.77 |
| Duac Gel | 3.37 | 0.10 | 3.16 | 3.57 |
| avoid spicy food | 3.35 | 0.19 | 2.97 | 3.74 |
| Placing clean towel on pillow each night | 3.35 | 0.11 | 3.13 | 3.56 |
| Pantothenic acid | 3.35 | 0.13 | 3.09 | 3.61 |
| Antibiotic cream | 3.33 | 0.08 | 3.18 | 3.49 |
| Rosac | 3.33 | 0.19 | 2.96 | 3.70 |
| Sulfur Powder | 3.33 | 0.14 | 3.05 | 3.61 |
| Mychelle clear skin serum | 3.33 | 0.17 | 2.98 | 3.68 |
| Mario Bedescu Drying Lotion | 3.32 | 0.14 | 3.03 | 3.61 |
| Drink a lot of water | 3.31 | 0.13 | 3.05 | 3.57 |
| Acnetrex | 3.31 | 0.22 | 2.87 | 3.74 |
| Spironolactone | 3.30 | 0.16 | 2.97 | 3.63 |
| Blemish potion | 3.30 | 0.13 | 3.03 | 3.57 |
| Dr. Hauschka Natural Skin Care | 3.30 | 0.14 | 3.02 | 3.58 |
| Erythromycin topical solution | 3.30 | 0.10 | 3.10 | 3.50 |
| Using fresh aloe vera leaves on skin | 3.30 | 0.11 | 3.07 | 3.52 |
| Evening Primrose Oil | 3.29 | 0.21 | 2.86 | 3.71 |
| Salicylic acid | 3.29 | 0.04 | 3.20 | 3.37 |
| Vaccination with individually developed vaccine[^auto-v] | 3.28 | 0.21 | 2.86 | 3.70 |
| Stievia-A Cream 0.025% | 3.28 | 0.17 | 2.94 | 3.63 |
| Vicco Turmeric Skin Cream | 3.27 | 0.24 | 2.80 | 3.75 |
| Dr. Andrew Weil for Origins | 3.27 | 0.17 | 2.93 | 3.61 |
| Cleanance K by Avene | 3.27 | 0.14 | 2.99 | 3.55 |
| OBAGI | 3.27 | 0.15 | 2.96 | 3.58 |
| N-acetylcysteine | 3.27 | 0.20 | 2.86 | 3.67 |
| Differin Gel | 3.26 | 0.12 | 3.01 | 3.50 |
| Avoiding pork | 3.25 | 0.16 | 2.92 | 3.57 |
| Chemical-free, vegan cosmetics / personal care products | 3.25 | 0.07 | 3.09 | 3.40 |
| Washing face with baking soda | 3.24 | 0.19 | 2.85 | 3.62 |
| Gamma Linolenic Acid | 3.23 | 0.22 | 2.79 | 3.67 |
| Oil Cleansing Method | 3.23 | 0.10 | 3.02 | 3.44 |
| Acne Free | 3.22 | 0.10 | 3.01 | 3.43 |
| African Black Soap | 3.22 | 0.14 | 2.94 | 3.50 |
| Aloe Vera Soap | 3.21 | 0.15 | 2.91 | 3.51 |
| Aromatherapy | 3.21 | 0.16 | 2.89 | 3.53 |
| Urine therapy | 3.20 | 0.20 | 2.79 | 3.61 |
| Hydrocortisone cream | 3.20 | 0.14 | 2.91 | 3.49 |
| No eggs | 3.20 | 0.12 | 2.95 | 3.45 |
| Air Purifier | 3.19 | 0.14 | 2.91 | 3.47 |
| Aveda Enbrightenment Wash | 3.18 | 0.19 | 2.79 | 3.56 |
| Vegan diet | 3.18 | 0.08 | 3.01 | 3.35 |
| Cetaphil for sensitive skin | 3.18 | 0.06 | 3.05 | 3.30 |
| Vitamin D | 3.17 | 0.06 | 3.05 | 3.29 |
| Zinc | 3.16 | 0.07 | 3.02 | 3.30 |
| Bert's Bees blemish stick | 3.15 | 0.18 | 2.80 | 3.51 |
| Jojoba Oil | 3.13 | 0.20 | 2.73 | 3.52 |
| Purpose Gentle Cleansing Bar | 3.12 | 0.11 | 2.89 | 3.34 |
| Honey | 3.11 | 0.08 | 2.94 | 3.29 |
| Bert's Bees Cleansing Milk | 3.10 | 0.17 | 2.75 | 3.45 |
| Tea tree oil | 3.10 | 0.05 | 2.99 | 3.20 |
| Aging (improvement after around age 19) | 3.08 | 0.05 | 2.98 | 3.18 |
| Alpha Hydroxy Acid | 3.05 | 0.12 | 2.80 | 3.30 |
| Flax oil | 3.05 | 0.10 | 2.84 | 3.25 |
| Exercise | 3.04 | 0.05 | 2.93 | 3.15 |
| Pregnancy | 3.04 | 0.15 | 2.73 | 3.35 |
| St. Ives Apricot Scrub | 3.04 | 0.06 | 2.92 | 3.16 |
| Apple cider vinegar tonic | 3.03 | 0.08 | 2.86 | 3.21 |
| Washing face with water only | 3.03 | 0.06 | 2.90 | 3.16 |
| Proactiv | 3.03 | 0.05 | 2.92 | 3.13 |
| Toothpaste on acne | 3.01 | 0.06 | 2.89 | 3.13 |
| Rubbing Alcohol | 2.99 | 0.07 | 2.85 | 3.13 |
| Hydrogen peroxide | 2.98 | 0.07 | 2.83 | 3.13 |
| Bar soap | 2.96 | 0.05 | 2.85 | 3.07 |
| Sauna | 2.96 | 0.11 | 2.73 | 3.19 |
| Murad AcneComplex | 2.93 | 0.10 | 2.72 | 3.14 |
| Clearasil Stayclear | 2.91 | 0.06 | 2.79 | 3.02 |
| Emu oil | 2.81 | 0.18 | 2.44 | 3.17 |
| Not washing face at all | 2.75 | 0.16 | 2.43 | 3.06 |
Table: Posterior estimates of average CureTogether crowd-sourced ratings for acne treatments, in descending order (higher=better).
[^auto-v]: This sounds bizarre but appears to refer to a real (albeit obscure) medical treatment more commonly named "autovaccines" or "autovaccination" (more widely used before the introduction of antibiotics), where part of the localized infection is cultured & then injected to try to trigger a more powerful, systemic, immune response.
One thing which would be nice to do with the CureTogether ratings is examine the clustering of interventions.
Some of these interventions are doubtless the same thing and should be merged together; others are different but may act through the same mechanisms and so could be considered about the same thing too.
Doing clustering or factor analysis might pop up a handful of major 'approaches', like 'antibiotics' vs 'cleaning' vs 'dietary interventions', and that would make the available options much easier to understand.
More interestingly, interventions almost certainly correlate with each other & predict success of each other, and this could be used to provide flowcharts of advice along the lines of "if X didn't work for you, then Y might and Z probably won't".
Acne treatment could be considered as a Partially Observable Markov Decision Problem where the effectiveness of each treatment is unknown but has a prior probability based on the global ratings, and one treatment's result informs the posterior probability of success of the untried treatments; a POMDP can be solved to give an optimal sequence of treatments to try.
(I've done a simple example of this for [cats and cat stimulants like catnip](/Catnip#optimal-catnip-alternative-selection-solving-the-mdp).)
And this approach also immediately gives a principled way for users to collectively experiment by posterior sampling (like [Thompson sampling](!Wikipedia), but one solves the MDP which is defined after sampling a possible parameter value from each parameter's posterior & treating it as the true value).
Unfortunately, that requires each CureTogether rater's individual data, to examine correlations within-individual of ratings, and the web interface doesn't provide that, and since CureTogether has been acquired & disappeared, there's no one to ask for the raw data now.
# Fermi calculations {.collapse}
> A short discussion of "Fermi calculations": quick-and-dirty approximate answers to quantitative questions which prize cleverness in exploiting implications of common knowledge or basic principles in given reasonable answers to apparently unanswerable questions. Links to discussions of Fermi estimates, and a list of some Fermi estimates I've done.

I really like [Fermi problems](!Wikipedia) ([LessWrong](http://lesswrong.com/lw/h5e/fermi_estimates/))---it's like [dimensional analysis](!Wikipedia) for everything outside of physics^[This is a little misleading; dimensional analysis is much more like [type-checking](!Wikipedia) a program in a language with a good type system like Haskell. Given certain data types as *inputs* and certain allowed transformations on those data types, what data types *must* be the resulting output? But the analogy is still useful.].
Not only are they fun to think about, they can be amazingly accurate, and are extremely cheap to do---because they are so easy, you do them in all sorts of situations you wouldn't do a 'real' estimate for, and are a fun part of a [physics education](https://arxiv.org/abs/1109.1165). The common distaste for them baffles me; even if you never work through Hubbard's [_How to Measure Anything_](http://www.amazon.com/How-Measure-Anything-Intangibles-Business/dp/1118539273/) ([some strategies](http://80000hours.org/blog/170-estimation-part-i-how-to-do-it "Estimation---Part I: How to do it?")) or [_Street-Fighting Mathematics_](http://www.amazon.com/Street-Fighting-Mathematics-Educated-Guessing-Opportunistic/dp/026251429X) or read [Douglas Hofstadter's](!Wikipedia "Douglas Hofstadter") 1982 essay ["On Number Numbness"](/docs/math/1982-hofstadter.pdf) (collected in _[Metamagical Themas](https://archive.org/details/MetamagicalThemas)_), it's something you can teach yourself by asking, what information is public available, what can I compare this too, how can I put various boundaries around the true answers^[eg. if someone asks you how many piano tuners there are in Chicago, don't look blank, start thinking! 'Well, there must be fewer than 7 billion, because the human race isn't made of piano tuners, and likewise fewer than 300 million (the population of the United States), and heck, Wikipedia says Chicago has only 2.6 million people and piano tuners are rare, so there must be many fewer than *that*...'] You especially want to do Fermi calculations in areas where the data is unavailable; I wind up pondering such areas frequently:
- [is a lip-reading website a good idea?](/Notes#lip-reading-website)
- how many women [dye their hair blonde](/Notes#somatic-genetic-engineering)
- how many [people does Folding@home kill](/Charity-is-not-about-helping)
- what's [the entropy of natural language](/Notes#efficient-natural-language)
- or [how big a computer](/Simulation-inferences) is needed to compute the universe
- do [men have shorter *real* lives than women](/Notes#who-lives-longer-men-or-women)
- how much do Girl Scouts cookies [cost and earn them](/Girl-Scouts-and-good-governance-cookie-prices-and-inflation)
- how many people have [used modafinil](/Modafinil#side-effects), and what is the cheapest we can expect to find [modafinil for](/Modafinil#margins)
- quickly assessing the rough probability of some event as I made one of my [thousands of predictions](/Prediction-markets)
- how many sellers are there on [Silk Road 1](/Silk-Road)
- checking whether posthumous organ donation [justifies the Chinese justice system](https://news.ycombinator.com/item?id=3389084) (no)
- [how many tourists](http://lesswrong.com/lw/bi7/smbc_comic_poorly_programmed/69fw#body_t1_69gl) to the Egyptian pyramids there have been compared to the workers who build them
- [what we can infer from adultery & false paternity rates](/Notes#politicians-are-not-unethical)
- [how many times have video gamers killed Mario](https://old.reddit.com/r/estimation/comments/105okb/request_how_many_times_has_mario_died/)
- is tiger skin [more profitable](https://old.reddit.com/r/WTF/comments/188zui/this_is_beyond_saddening_tigers_starved_to_make/c8cql4x) than tiger bone wine?
- [plating Versailles with gold](https://slatestarcodex.com/2013/03/03/reactionary-philosophy-in-an-enormous-planet-sized-nutshell/#comment-544)
- the [number of people killed by falling pianos](http://lesswrong.com/lw/h5e/fermi_estimates/8pts)
- [the plausibility of](https://old.reddit.com/r/anime/comments/1cqtae/in_honor_of_kuronekos_birthday_my_favorite_scene/c9ja9a5) _[Oreimo's](!Wikipedia "Oreimo") Kirino Kosaka character
- [how many integers one to a million do you see over a lifetime?](https://old.reddit.com/r/estimation/comments/1f765d/request_in_an_average_lifetime_how_many_of_the/ca85zcl) (R simulation + [Benford's law](!Wikipedia))
- [how many anecdotes of CrossFit causing rhabdomyolysis does it take to justify "CrossFit causes rhabdomyolysis"?](http://lesswrong.com/r/discussion/lw/irm/rationality_competitiveness_and_akrasia/9uoq)
- [how long would it take to read every book in existence?](https://old.reddit.com/r/estimation/comments/27i4wq/request_how_long_would_it_take_to_read_every_book/ci1jfno)
- [what would it cost to replace the US nuclear arsenal?](https://old.reddit.com/r/estimation/comments/2chtft/if_all_the_uss_atomic_bombs_disappeared_assuming/cjfyses)
- [What are the lifetime odds of being pooped on by a bird?](https://old.reddit.com/r/estimation/comments/4f9bxl/what_are_the_lifetime_odds_of_being_pooped_on_by/)
An entire ["estimation" subreddit](https://old.reddit.com/r/estimation/) is devoted to working through questions like these (it can be quite fun), and of course, there are the memorable ["what if?"](http://what-if.xkcd.com/) _xkcd_ columns.
[Timothy Gowers](!Wikipedia) suggests [a number of problems](https://gowers.wordpress.com/2012/06/08/how-should-mathematics-be-taught-to-non-mathematicians/ "How should mathematics be taught to non-mathematicians?") which might help children really learn how to think with & apply the math they learn.
To look further afield, here's a quick and nifty application by investor John Hempton to the [Sino Forestry fraud](!Wikipedia "Sino-Forest Corporation#Fraud allegations and share suspension"): ["Risk management and sounding crazy"](http://brontecapital.blogspot.com/2011/09/risk-management-and-sounding-crazy.html). What I personally found most interesting about this post was not the overall theme that the whistleblowers were discounted before and after they were proven right (we see this in many bubbles, for example, the housing bubble), but how one could use a sort of [Outside View](http://wiki.lesswrong.com/wiki/Outside_view)/Fermi calculation to sanity-check the claims. If Sino Forestry was really causing 17m cubic meters of wood to be processed a year, where was all the processing? This simple question tells us a *lot*. With medicine, there is one simple question one can always ask too---where is the increased longevity? (This is an important question to ask of studies, such as a [recent caloric restriction study](http://aging.wisc.edu/pdfs/2297.pdf).)
Simple questions and reasoning can tell us a lot.
# Selective Emigration and Personality Trait Change {.collapse}
> [Knudsen 2019](https://annesofiebeckknudsen.com/wp-content/uploads/2019/01/thosewhostayed.pdf "Those Who Stayed: Individualism, Self-Selection and Cultural Change during the Age of Mass Migration") finds that the emigration of 25% of the Scandinavian population to the USA 1850--1920 was driven in part by more 'individualistic' personality factors among emigrants, leading to permanent decreases in mean 'individualism' in the home countries. This is attributed to cultural factors, rather than genetics. I model the overall migration as a simple truncation selection scenario, and find that in a simple model under reasonable assumptions, the entire effect could be genetic.

In ["Those Who Stayed: Individualism, Self-Selection and Cultural Change during the Age of Mass Migration"](https://annesofiebeckknudsen.com/wp-content/uploads/2019/01/thosewhostayed.pdf), Knudsen 2019, Knudsen examines one of the largest emigrations from Europe to the USA, from Scandinavian countries like [Sweden](!Wikipedia "Swedish emigration to the United States"): over about two generations, 25% of the entire population left for the USA.
(The effects of this emigration will be familiar to anyone who has visited the Midwest.)
*Who* left is not random at all, and would be influenced by personality traits; Knudsen uses noisy proxies to measure personality traits like 'individualism' on a population level, and finds that because of the emigration of Scandinavian individuals high on these traits, the remaining population is noticeably lower on them, by "approximately 3.9%-points in Denmark, 10.1%-points in Sweden, and 13.1%-points in Norway", and more interestingly, these reductions appear to be permanent:
> Several circumstances make the Age of Mass Migration an ideal case for the empirical objective of this paper. During the period, millions of people left Europe to settle in New World countries such as the United States. Sweden, Norway, and Denmark experienced some of the highest emigration rates in Europe during this period, involving the departure of approximately 25% of their populations. Besides representing the largest migration event in Scandinavian history, global regulatory policies on migration were particularly loose at this point in time, which enables the identification of self-selective processes under limited governmental influence. In addition, the historical context allows me to study long-run cultural implications of migration in sending locations.
>
> For use in analyses of aggregate district effects, I quantify the cultural shock of selective emigration as the percentage point drop in the prevalence of individualists in the migrant-sending population that occurs due to emigration. This is feasible because I have information on rates of emigration and the gap between emigrant and population individualism. Accumulating these shocks over the entire period of the Age of Mass Migration reveals an overall reduction in individualism from emigration of approximately 3.9%-points in Denmark, 10.1%-points in Sweden, and 13.1%-points in Norway...Fixed-effects estimations document that the cultural shocks of emigration pushed migrant-sending district culture in a collectivist direction and this is robust to the inclusion of control variables that capture alternative district, cohort, and emigrant characteristics
>
> The shock measures are available from 1860 onwards, since these are the decades covered by the emigrant passenger lists. The cultural shocks of emigration can be calculated for any time period and aggregation level. I produce baseline measures at the district and decade level. The accumulated shock over the entire Age of Mass Migration equals a loss of individualists of 3.9%-points in Denmark, 10.1%-points in Sweden, and 13.1%-points in Norway. Although the measure most precisely reflects shocks to the distribution of inherited individualism in the home population, they also approximate shocks to the distribution of actual individualism if emigrants and non-emigrants are assumed to abandon their cultural heritage at similar rates. The rate and cultural shock of emigration correlate strongly around a ratio of 0.4 (see scatter plot and histogram in Appendix Figures A.1 and A.3).^23^
([Karadja & Prawitz 2019](https://osf.io/preprints/socarxiv/y4wgm/ "Exit, Voice and Political Change: Evidence from Swedish Mass Migration to the United States") also show emigration effects in Sweden related to communalism, leading to greater voting/labor union/strike participation.)
Why are they permanent?
Knudsen only mentions 'cultural transmission', but of course, they have to also be partially genetic: personality traits are heritable, and so selecting on personality will cause genetic changes, and we can see population stratification caused by selection on complex traits (like intelligence/education in the UK: ["Genetic Consequences of Social Stratification in Great Britain"](https://www.biorxiv.org/content/early/2018/10/30/457515), Abdellaoui et al 2018).
A process like emigration has no way of somehow letting *only* people emigrate who are 'individualistic' for purely environmental reasons, after all; such processes see only phenotypes, which are the outcome of both genes and environments.
Is it plausible that genetics might explain a nontrivial amount of the reduction?
The selection effect here is relatively small: only 25% of the total population left, once, and while 'individualism' might be a big part of decisions to live, there are surely many other genetic and environmental factors and sheer randomness in any individual's decision to leave.
It feels like the genetic effect might be much too small to be particularly relevant.
We can quickly estimate expected effects by treating this as [truncation selection](!Wikipedia) in a [liability threshold model](!Wikipedia) (["Chapter 14: Short-term Changes in the Mean: 2. Truncation and Threshold Selection"](/docs/genetics/selection/2013-walsh-book2-ch14-draft.pdf), Lynch & Walsh): the bottom X% are 'truncated' from the population (because they physically left for the USA and did not return, a binary variable) in a single selective step, and we want to know the effect on personality traits of a certain heritability (which can be extrapolated from research on more-carefully measured personality traits like the Big Five) given that they are _r_ < 1 with the emigration variable (we can only guess at the values here---surely _r_ < 0.5?).
Knudsen says the overall selection is 25%; this would be over perhaps 2 generations, but since it's such a short period and small effects, it doesn't make much of a difference if we treat it as a single big population.
What is the overall effect on personality?
Knudsen includes some graphs showing emigration rates by country over time, but I don't see any population-weighted average overall effect reported in her paper.
(Bringing up the question of how much of the effect is genetic with Knudsen on Twitter, [she mentioned](https://nitter.net/ASBeckKnudsen/status/1169612398954631169) that she was working on revisions which would include more information on effect sizes and what other individual differences might cause the emigration decision.)
The observed effect is 3.9%/10.1%/13.1% (Norway/Sweden/Denmark), which averages to 8%:
~~~{.R}
library(psych); geometric.mean(c(3.9, 10.1, 13.1))
# [1] 8.02082595
~~~
Sweden is presumably by far the biggest country of the 3 at the time and would influence the average the most, so 8% is probably too small, but serves as a benchmark.
It would be more convenient to have this in standard deviations but I don't see anywhere Knudsen gives it in SDs rather than percentages.
I *think* what is done there is treat the population as a standardized variable and then the conversion was to convert percentiles to SDs; then that suggests an "8% decrease" would be equivalent to a 0.19SD change:
~~~{.R}
qnorm(0.99) / 99
# [1] 0.0234984634
(qnorm(0.99) / 99) * 8.02
# [1] 0.188457676
~~~
0.19SD isn't so big, and looks somewhat plausible.
Perhaps truncation selection can explain this change after all.
To estimate the change, we convert the percentage to an SD threshold: how low a score does it take to be in the bottom 25%^[I find it a bit easier to avoid phrasing it as "decreasing individualism" and think of the opposite way as "increasing communalism", because most discussions of threshold selection are---unsurprisingly given the major use is in breeding---framed as about increasing levels of some trait.] and thus emigrate?
~~~{.R}
qnorm(0.25)
# [1] -0.67448975
~~~
So we can imagine everyone with a 'communalism' score of <-0.67 emigrated.
In a population of $\mathcal{N}(0,1)$, if we delete everything <-0.67, giving a [truncated normal distribution](!Wikipedia) what does the remainder average?
~~~{.R}
truncNormMean <- function(a, mu=0, sigma=1, b=Inf) {
phi <- dnorm
erf <- function(x) 2 * pnorm(x * sqrt(2)) - 1
Phi <- function(x) { 0.5 * (1 + erf(x/sqrt(2))) }
Z <- function(beta, alpha) { Phi(beta) - Phi(alpha) }
alpha = (a-mu)/sigma; beta = (b-mu)/sigma
return( (phi(alpha) - phi(beta)) / Z(beta, alpha) ) }
truncNormMean(qnorm(0.25))
# [1] 0.423702097
~~~
So the survivors, as it were, have shifted from an average of 0SD to +0.42SD on 'communalism' (anti-'individualism').
This, however, is not the *genetic* increase.
It includes all the variables, not just genes, like rearing or random noise, which affect personality, and then all of the variables which affect emigration risk above and beyond personality.
So it must be decreased twice to compensate for personality being only partly genetic, and then emigration being only partly personality (ie. genes → personality → emigration has 2 arrows which are _r_ < 1).
Personality factors like the [Big Five personality traits](!Wikipedia) well-established as having heritabilities ~50% ([eg Bouchard & McGue 2003](http://emilkirkegaard.dk/en/wp-content/uploads/Genetic-and-environmental-influences-on-human-psychological-differences.pdf#page=20 "'Genetic and Environmental Influences on Human Psychological Differences', Bouchard & McGue 2003: 'Table 5: Broad Heritabilities of Self-Report Measures of the Big Five Factors Based on Four Recent Twin Studies,a Comprehensive Review of Twin, Adoption, and Biological Kinships (Loehlin, 1992), and a Summary of the Earlier Twin Literature (Bouchard, 1997)'")), so would correlate $\sqrt{0.50} = 0.71$.
'Individualism'/'communalism' is of course not *exactly* the Big Five, and in Knudsen correlates with many other traits associated with social & economic success, and so it may reflect much more than something like [Openness](!Wikipedia "Openness to experience"); intelligence is probably part of it, and intelligence has a substantially higher heritability than the Big Five, so I might be underestimating 'individualism' here.
How much of emigration is personality?
I can't give an exact value here: Knudsen's analysis probably implies a value somewhere, but she is using imperfect proxies with a lot of measurement error, so I wouldn't be sure how to interpret them anyway.
I'd be surprised if it was _r_ > 0.5, simply because few variables exceed that in psychology or sociology, and one would expect emigration to be influenced by proximity to ports, local economic & weather conditions, [chain migration](!Wikipedia), individual & familial wealth, etc---so consider an example with _r_ = 0.5 as a best-case scenario.
We deflate the truncated normal mean by the 2 variables:
~~~{.R}
truncNormMean(qnorm(0.25)) * sqrt(0.5) * 0.5
# [1] 0.149801313
(truncNormMean(qnorm(0.25)) * sqrt(0.5) * 0.5) / (qnorm(0.99) / 99)
# [1] 6.3749408
~~~
So in this truncation selection scenario, we'd predict a +0.15SD or ~6% increase.
This is not too far off the Knudsen estimates of 0.19SD or 8%, and quite small changes could equalize them (eg _r_ = 0.62 for personality→emigration would be enough).
So it seems that the truncation selection is stringent enough, and the observed effects small enough, that genetics *could* indeed explain most or all of it.
## See Also
- ["Dog Cloning For Special Forces: Breed All You Can Breed"](/Clone "Decision analysis of whether cloning the most elite Special Forces dogs is a profitable improvement over standard selection procedures. Unless training is extremely cheap or heritability is extremely low, dog cloning is hypothetically profitable.")
- ["Oh Deer: Could Deer Evolve to Avoid Car Accidents?"](#oh-deer-could-deer-evolve-to-avoid-car-accidents "Can deer evolve under the selection pressure of car accidents to learn to avoid roads? Probably, but it'll take a long time.")
- ["Embryo selection for intelligence"](/Embryo-selection "A cost-benefit analysis of the marginal cost of IVF-based embryo selection for intelligence and other traits with 2016-2017 state-of-the-art; for other binary traits, truncation selection in the liability threshold model is used.")
# The Most Abandoned Books on GoodReads
> What books are hardest for a reader who starts them to finish, and most likely to be abandoned? I scrape a crowdsourced tag, `abandoned`, from the GoodReads book social network to estimate conditional probability of being abandoned.
>
> The default GoodReads tag interface presents only raw counts of tags, not counts divided by total ratings (=reads). This conflates popularity with probability of being abandoned: a popular but rarely-abandoned book may have more `abandoned` tags than a less popular but often-abandoned book. There is also residual error from the winner's curse where books with fewer ratings are more mis-estimated than popular books.
>
> Correcting for both changes the top-5 ranking completely, from ([raw counts](/GoodReads#data)):
>
> 1. _[The Casual Vacancy](!Wikipedia)_, [J.K. Rowling](!Wikipedia)
> 2. _[Catch-22](!Wikipedia)_, [Joseph Heller](!Wikipedia)
> 3. _[American Gods](!Wikipedia)_, [Neil Gaiman](!Wikipedia)
> 4. _[A Game of Thrones](!Wikipedia)_, [George R.R. Martin](!Wikipedia)
> 5. _[The Book Thief](!Wikipedia)_, [Markus Zusak](!Wikipedia)
>
> to ([shrunken posterior proportions](/GoodReads#bayesian-modeling)):
>
> 1. _[Black Leopard, Red Wolf](!Wikipedia)_, [Marlon James](!Wikipedia "Marlon James (novelist)")
> 2. [_Space Opera_](!Wikipedia "Space Opera (Valente novel)"), [Catherynne M. Valente](!Wikipedia)
> 3. _[Little, Big](!Wikipedia)_, [John Crowley](!Wikipedia "John Crowley (author)")
> 4. [_The Witches: Salem, 1692_](https://www.amazon.com/Witches-Suspicion-Betrayal-Hysteria-Salem/dp/031620059X), [Stacy Schiff](!Wikipedia)
> 5. _[Tender Morsels](!Wikipedia)_, [Margo Lanagan](!Wikipedia)
>
> I also consider [a model adjusting for covariates](/GoodReads#modeling-with-covariates) (author/average-rating/year), to see what books are most surprisingly often-abandoned given their pedigrees & rating etc. Abandon rates increase the newer a book is, and the lower the average rating.
>
> Adjusting for those, the top-5 are:
>
> 1. _The Casual Vacancy_, J.K. Rowling
> 2. [_The Chemist_](https://www.amazon.com/Chemist-Stephenie-Meyer/dp/0316387843), [Stephenie Meyer](!Wikipedia)
> 3. _[Infinite Jest](!Wikipedia)_, [David Foster Wallace](!Wikipedia)
> 4. _[The Glass Bead Game](!Wikipedia)_, [Hermann Hesse](!Wikipedia)
> 5. _[Theft by Finding: Diaries (1977--2002)](!Wikipedia)_, [David Sedaris](!Wikipedia)
>
> Books at the top of the adjusted list appear to reflect a mix of highly-popular authors changing genres, and 'prestige' books which are highly-rated but a slog to read.
>
> These results are interesting in how they highlight how people read books for many different reasons (such as marketing campaigns, literary prestige, or following a popular author), and this is reflected in their decision whether to continue reading or to abandon a book.

**Split out to ["The Most Abandoned Books on GoodReads"](/GoodReads "Which books on GoodReads are most difficult to finish? Estimating proportion gives an entirely different result than absolute counts").**