> Given two disagreeing polls, one small & imprecise but taken at face-value, and the other large & precise but with a high chance of being totally mistaken, what is the right Bayesian model to update on these two datapoints?
> I give ABC and MCMC implementations of Bayesian inference on this problem and find that the posterior is bimodal with a mean estimate close to the large unreliable poll's estimate but with wide credible intervals to cover the mode based on the small reliable poll's estimate.

A question was asked of me: what should one infer if one is given what would be definitive data if one could take it at face value---but one suspects this data might be totally incorrect?
An example would be if one wanted to know what fraction of people would answer 'yes' to a particular question, and one had a very small poll (_n_=10) suggesting 90% say yes, but then one was also given the results from a much larger poll (_n_=1000) saying 75% responded yes---but this poll was run by untrustworthy people, people that, for whatever reason, you believe might make something up half the time.
You should be able to learn *something* from this unreliable poll, but you can't learn *everything* from it because you would be burned half the time.
If not for this issue of unreliability, this would be an easy binomial problem: specify a uniform or Jeffreys prior on what percentage of people will say yes, add in the binomial data of 9/10, and look at the posterior.
But what do we do with the unreliability joker?
## Binomial
First let's try the simple case, just updating on a small poll of 9/10. We would expect it to be unimodally peaked around 80-90%, but broad (due to the small sample size) and falling sharply until 100% since being that high is a priori unlikely.
MCMC using [Bayesian First Aid](https://github.com/rasmusab/bayesian_first_aid):
~~~{.R}
## install.packages("devtools")
## devtools::install_github("rasmusab/bayesian_first_aid")
library(BayesianFirstAid)
b <- bayes.binom.test(oldData$Yes, oldData$N); b
# ...number of successes = 9, number of trials = 10
# Estimated relative frequency of success:
# 0.85
# 95% credible interval:
# 0.63 0.99
# The relative frequency of success is more than 0.5 by a probability of 0.994
# and less than 0.5 by a probability of 0.006
~~~
Which itself is a wrapper around calling out to JAGS doing something like this:
~~~{.R}
library(runjags)
model_string <- "model {
x ~ dbinom(theta, n)
theta ~ dbeta(1, 1) }"
model <- autorun.jags(model_string, monitor="theta", data=list(x=oldData$Yes, n=oldData$N)); model
# JAGS model summary statistics from 20000 samples (chains = 2; adapt+burnin = 5000):
#
# Lower95 Median Upper95 Mean SD Mode MCerr MC%ofSD SSeff AC.10 psrf
# theta 0.63669 0.85254 0.9944 0.83357 0.10329 -- 0.0007304 0.7 20000 0.011014 1.0004
~~~
Here is a simulation-based version of Bayesian inference using [ABC](!Wikipedia "Approximate Bayesian computation"):
~~~{.R}
oldData <- data.frame(Yes=9, N=10)
simulatePoll <- function(n, pr) { rbinom(1, size=n, p=pr); }
poll_abc <- replicate(100000, {
# draw from our uniform prior
p <- runif(1,min=0,max=1)
# simulate a hypothetical poll dataset the same size as our original
newData <- data.frame(Yes=simulatePoll(oldData$N, p), N=oldData$N)
# were they equal? if so, save sample as part of posterior
if (all(oldData == newData)) { return(p) }
}
)
resultsABC <- unlist(Filter(function(x) {!is.null(x)}, poll_abc))
summary(resultsABC)
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# 0.3260816 0.7750520 0.8508855 0.8336383 0.9117471 0.9991691
hist(resultsABC)
# https://i.imgur.com/fn3XYQW.png
~~~
They look identical, as they should.
### Binomial with binary unreliability
To implement our more complicated version: the original poll remains the same but we add in the complication of a very large poll which 50% of the time is a true measure of the poll response and 50% of the time is drawn uniformly at random.
(So if the true poll response is 90%, then half the time the large poll will yield accurate data like 905/1000 or 890/1000, and the rest it will yield 10/1000 or 400/1000 or 700/1000.)
This is different from the more common kinds of measurement-error models where it's generally assumed that the noisy measurements still have *some* informativeness to them; here there is none.
Specifically, this faux poll has yielded the data not 9/10, but 750/1000.
#### ABC
Using ABC again: we generate the reliable small poll as before, and we add in an faux poll where we flip a coin to decide if we are going to return a 'yes' count based on the population parameters or just a random number, then we combine the two datasets and check that it's identical to the actual data, saving the population probability if it is.
~~~{.R}
oldData2 <- data.frame(Yes=c(9,750), N=c(10,1000)); oldData2
# Yes N
# 1 9 10
# 2 750 1000
simulateHonestPoll <- function(n, pr) { rbinom(1, size=n, p=pr); }
simulateFauxPoll <- function(n, pr, switchp) { if(sample(c(TRUE, FALSE), 1, prob=c(switchp, 1-switchp))) { rbinom(1, size=n, p=pr); } else { round(runif(1, min=0, max=n)); }}
poll_abc <- replicate(1000000, {
priorp <- runif(1,min=0,max=1)
switch <- 0.5
n1 <- 10
n2 <- 1000
data1 <- data.frame(Yes=simulateHonestPoll(n1, priorp), N=n1)
data2 <- data.frame(Yes=simulateFauxPoll(n2, priorp, switch), N=n2)
newData <- rbind(data1, data2)
if (all(oldData2 == newData)) { return(priorp) }
}
)
resultsABC <- unlist(Filter(function(x) {!is.null(x)}, poll_abc))
summary(resultsABC)
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# 0.5256471 0.7427098 0.7584650 0.7860109 0.8133581 0.9765648
hist(resultsABC)
# https://i.imgur.com/atMz0jg.png
~~~
The results are interesting and in this case the summary statistics are misleading: the median is indeed around 75% (as we would expect! since that's the result of the highly precise poll which has a 50% chance of being the truth) but we see the mean is being pulled away towards the original 90% estimate, and plotting the histogram, bimodality emerges.
The posterior reports that there's still a lot of credibility to the 90% point estimate, but between the original diffuseness of that posterior (leaving a lot of probability to lower responses including, say, 75%) and the high certainty that if accurate the responses will definitely be close to 75%, it winds up peaked at a little higher than 75% (since even if the larger poll is honest, the earlier poll did still find 9/10).
So it's not so much that we think the best estimate of true population rate really is 79% (indeed, the mode is more like 75%, but it could easily be far away from 75% and in the 90%s) as we would need to think more about what we want to do with this posterior before we decide how to summarize it.
#### Mixture
ABC is slow and would not scale to more hypothetical polls unless we abandoned exact ABC inference and began using approximate ABC (entirely possible in this case; instead of strict equality between the original and simulated data, we'd instead accept a sample of _p_ if the simulated dataset's fractions were within, say, 1% of the originals); and the simulation would need to be rewritten anyway.
MCMC can handle this if we think of our problem as a [mixture model](!Wikipedia): our problem is that we have poll data drawn from two clusters/distributions---one cluster is the true population distribution of opinion, and the other cluster just spits out noise. We have one observation which we know is from first reliable distribution (the 9/10 poll result), and one observation which we're not sure which of the two it came from (750/1000), but we do know that the indexing probability for mixing the two distributions is 50%.
In JAGS, we write down a model in which `dcat` flips between 1 and 2 if the cluster is not known, specifying which distribution a sample came from and its theta probability, and then we infer the thetas for both distributions.
Of course, we only care about the first distribution's theta since the second one is noise.
~~~{.R}
library(runjags)
model1 <- "model {
for (i in 1:N) {
y[i] ~ dbinom(theta[i], n[i])
theta[i] <- thetaOfClust[ clust[i] ]
clust[i] ~ dcat(pi[])
}
pi[1] <- switch[1]
pi[2] <- switch[2]
thetaOfClust[1] ~ dbeta(1,1)
thetaOfClust[2] ~ dunif(0,1)
}"
j1 <- autorun.jags(model1, monitor=c("theta"), data = list(N=nrow(oldData2), y=oldData2$Yes, n=oldData2$N, switch=c(0.5, 0.5), clust=c(1,NA))); j1
# ... Lower95 Median Upper95 Mean SD Mode MCerr MC%ofSD SSeff AC.10 psrf
# theta[1] 0.70582 0.75651 0.97263 0.77926 0.07178 -- 0.001442 2 2478 0.12978 1.0011
# theta[2] 0.72446 0.75078 0.77814 0.75054 0.013646 -- 0.00009649 0.7 20000 0.009458 1
plot(j1)
# https://i.imgur.com/EaqR0dD.png
~~~
Sure enough, we get a good match with the ABC estimate: a mean estimate for the population distribution of 78% with a very wide 95% CI and a clearly bimodal distribution with a huge spike at 75%.
Since the MCMC mixture model looks completely different from the imperative simulation-based model, the consistency in estimates & distributions gives me some confidence in the results being right.
So we can see how we should update our beliefs---by a perhaps surprising amount towards the unreliable datapoint.
The original data was too weak to strongly resist the allure of that highly precise poll.
#### Weakening heuristic?
We might try to think of it this way: half the time, the large poll means nothing whatsoever, it contains 0% or no information about the population at all; While the other half of the time, it is exactly what it seems to be and 100% informative; so doesn't that mean that on average we should treat it as containing half the information we thought it did? And the information is directly based on the sample size: a sample 5x as big contains 5x as much information.
So perhaps in this case of all-or-nothing accuracy, we could solve it easily by simply weakening the weight put the unreliable information and shrinking the claimed sample size---instead of treating it as 750 of 1000, treat it as 375/500; and if it had been 75,000 of 100,000, convert it to 37,500 of 50,000.
This is a simple and intuitive shortcut, but if we think about what the binomial will return as the unreliable poll increases in size or if we look at the results...
~~~{.R}
switch <- 0.5
oldData3 <- data.frame(Yes=c(9,(750*switch)), N=c(10,(1000*switch)))
b2 <- bayes.binom.test(sum(oldData3$Yes), sum(oldData3$N)); b2
#
# Bayesian First Aid binomial test
#
# data: sum(oldData3$Yes) and sum(oldData3$N)
# number of successes = 384, number of trials = 510
# Estimated relative frequency of success:
# 0.75
# 95% credible interval:
# 0.71 0.79
# The relative frequency of success is more than 0.5 by a probability of >0.999
# and less than 0.5 by a probability of <0.001
~~~
Unfortunately, this doesn't work because it doesn't preserve the bimodal aspect of the posterior, and we get a unimodal distribution ever concentrating on its mean, wiping out the existence of the 0.90 peak. If our untrustworthy poll had instead, say, reported 750,000 out of 1 million, that should only make the peak at 0.75 look like a needle---it should be unable to affect the mass around 0.9, because it doesn't matter if the data is 100 or 1 million or 1 billion, it still only has a 50% chance of being true.
It's a little hard to see this since the mean frequency of 0.75 is fairly close to the mean of 0.78 from the ABC and we might write this off as approximation error in either the ABC estimate or BFA's MCMC, but if we look at the 95% CI and note that 0.9 is not inside it or if we plot the posterior (`plot(b2)`), then the absence of bimodality jumps out.
So this trick doesn't work.
# Dysgenics power analysis {.collapse}
> Current dysgenic estimates predict that genotypic IQ in the West are falling at a substantial rate, amounting to around half a standard deviation or more over the past century, by 1. reducing the frequency at which intelligence-increasing genetic variants occur (through natural selection against such variants) and 2. by increasing the number of new and potentially harmful genetic mutations (increasing mutation load).
> Estimates are produced indirectly by surveying reproductive rates or by trying to show decreases in phenotypic traits associated with intelligence; it would obviously be preferable to examine dysgenic effects directly, by observing decreases in frequencies or increases in mutation load in a large sample of Western genetic information such as SNP arrays or whole-genomes (respectively).
> Such direct testing of dysgenics hypotheses are becoming increasingly feasible due to the exponential decrease in SNP & whole-genome sequencing costs creating large datasets (some publicly available) and the recent identification of some intelligence genes.
> It remains unclear how large these datasets must be to overcome sampling error and yield informative estimates of changes in frequencies or mutation load, however; datasets like PGP or SSGAC may still be too small to investigate dysgenics.
> I considered the effect size estimates and under some simple models derive power calculations & power simulations of how large a dataset would be required to have an 80% chance of detecting a dysgenic effect: to detect the decrease in intelligence SNPs using SNP data, _n_≥30,000; to detect the increase in mutation load in whole genomes, _n_≥160
> I then compare to available datasets: the effect on SNPs can be detected by a large number of existing proprietary databases, but there are no public databases which will be large enough in the foreseeable future; the effect on mutation load, on the other hand, can be detected using solely the currently publicly available dataset from PGP.
> So I conclude that while only the proprietary databases can directly test dysgenic theories of selection for the foreseeable future, there *is* an opportunity to analyze PGP genomes to directly test the dysgenic theory of mutation load.

The dysgenics hypothesis argues that due to observed reproductive patterns where the highly educated or intelligent tend to have fewer offspring, genotypic IQ (the upper bound on phenotypic IQs set by genes and the sort of thing measured by a polygenic score).
If dysgenics is true, then it is an extremely important phenomenon, as important as many things that get far more attention like lead remediation; but to paraphrase [Richard Hamming](!Wikipedia)[^Hamming-importance], just because a problem is important does not mean it is worth working on or researching or discussing if there is no chance of making progress---if the data is hopelessly compromised by many systematic biases which would cause false positives or if the data is too scanty to overcome random error or analyses so flexible that they could deliver any answer the partisan wishes.
Phenotypic data will, in all probability, never allow for a clear & decisive answer to the question of whether dysgenics exists or matters, as long-term comparisons are roughly as credible as noting that global piracy rates have declined while global warming increases, or paracetamol consumption rates have increased in tandem with Alzheimer's rates; only direct examination of genetics will deliver the decisive answer.
It would be nice to have an idea of *how much* genetic data we would need to overcome random error (and hence, whether it's possible to make progress in the near future), which we can answer by doing some statistical power analyses.
[^Hamming-importance]: Richard Hamming, ["You and Your Research"](http://www.cs.virginia.edu/~robins/YouAndYourResearch.html):
> The three outstanding problems in physics, in a certain sense, were never worked on while I was at Bell Labs. By important I mean guaranteed a Nobel Prize and any sum of money you want to mention. We didn't work on (1) time travel, (2) teleportation, and (3) antigravity. They are not important problems because we do not have an attack. It's not the consequence that makes a problem important, it is that you have a reasonable attack. That is what makes a problem important.
Changes over time in genetics could be due to changes within a particular race or population (for example, in all white Englishmen), or could be due to population movements like one group replacing or migrating or merging into another (population genetics has revealed innumerable complex examples historically).
The latter is possible thanks to the increasing availability of ancient DNA, often [made public for researchers](https://www.oagr.org.au/source/ "Online Ancient Genome Repository"); so one could observe very long-term trends with cumulatively large effects (implying that small samples may suffice), but this approach has serious issues in interpretation and questions about how comparable intelligence variants may be across groups or throughout human evolution.
With the former, there is less concern about interpretation due to greater temporal and ethnic homogeneity---if a GWAS on white northern Europeans in 2013 turns up intelligence variants and produces a useful polygenic score, it will almost certainly work on samples of white northern Europeans in 1900 too---but because the time-scale is so short the effect will be subtler and harder to detect.
Nevertheless, a result within a modern population would be much more credible, so we'll focus on that.
How subtle and hard to detect an effect are we talking about here?
[Woodley 2012](/docs/algernon/2012-woodley.pdf "The social and scientific temporal correlates of genotypic intelligence and the Flynn effect") summarizes a number of estimates:
> Early in the 20th century, negative correlations were observed between intelligence and fertility, which were taken to indicate a dysgenic fertility trend (e.g. Cattell, 1936; Lentz, 1927; Maller, 1933; Sutherland, 1929). Early predictions of the rate of dysgenesis were as high as between 1 and 1.5 IQ points per decade (Cattell, 1937, 1936)...In their study of the relationship between intelligence and both completed and partially completed fertility, van Court and Bean (1985) reported that the relationships were predominantly negative in cohorts born between the years 1912 and 1982...Vining (1982) was the first to have attempted an estimation of the rate of genotypic IQ decline due to dysgenesis with reference to a large national probability cohort of US women aged between 24 and 34 years in 1978. He identified significant negative correlations between fertility and IQ ranging from −.104 to −.221 across categories of sex, age and race, with an estimated genotypic IQ decline of one point a generation. In a 10year follow-up study using the same cohort, Vining (1995) re-examined the relationship between IQ and fertility, now that fertility was complete, finding evidence for a genotypic IQ decline of .5 points per generation. Retherford and Sewell (1988) examined the association between fertility and IQ amongst a sample of 9000 Wisconsin high-school graduates (graduated 1957). They found a selection differential that would have reduced the phenotypic IQ by .81 points per generation under the assumption of equal IQs for parents and children. With an estimate of .4 for the additive heritability of IQ, they calculated a more modest genotypic decline of approximately .33 points. The study of Ree and Earles (1991), which employed the NLSY suggests that once the differential fertility of immigrant groups is taken into consideration, the phenotypic IQ loss amongst the American population may be greater than .8 of a point per generation. Similarly, in summarizing various studies, Herrnstein & Murray (1994) suggest that "it would be nearly impossible to make the total [phenotypic IQ decline] come out to less than one point per generation. It might be twice that." (p. 364). Loehlin (1997) found a negative relationship between the fertility of American women aged 35-44 in 1992 and their educational level. By assigning IQ scores to each of six educational levels, Loehlin estimated a dysgenesis rate of .8 points in one generation. Significant contributions to the study of dysgenesis have been made by Lynn, 1996 (see also: 2011) whose book Dysgenics: Genetic deterioration in modern populations provided the first estimates of the magnitude of dysgenesis in Britain over a 90 year period, putting the phenotypic loss at .069 points per year (about 1.7 points a generation assuming a generational length of 25 years). In the same study, Lynn estimated that the genotypic IQ loss was 1.64 points per generation between 1920 and 1940, which reduced to .66 points between 1950 and the present. Subsequent work by Lynn has investigated dysgenesis in other populations. For example Lynn (1999) found evidence for dysgenic fertility amongst those surveyed in the 1994 National Opinion Research Center survey, which encompassed a representative sample of American adults, in the form of negative correlations between the intelligence of adults aged 40+ and the number of children and siblings. Lynn estimates the rate of dysgenesis amongst this cohort at .48 points per generation. In a more recent study, Lynn and van Court (2004) estimated that amongst the most recent US cohort for which fertility can be considered complete (i.e. those born in the years 1940-1949), IQ has declined by .9 points per generation. At the country level, Lynn and Harvey (2008) have found evidence of a global dysgenesis of around .86 points between 1950 and 2000, which is projected to increase to 1.28 points in the period from 2000 to 2050. This projection includes the assumption that 35% of the variance in cross-country IQ differences is due to the influence of genetic factors. A subsequent study by Meisenberg (2009), found that the fertility differential between developed and developing nations has the potential to reduce the phenotypic world population IQ mean by 1.34 points per decade (amounting to a genotypic decline of .47 points per decade assuming Lynn & Harvey's 35% estimate). This assumes present rates of fertility and pre-reproductive mortality within countries. Meisenberg (2010) and Meisenberg and Kaul (2010) have examined the factors through which intelligence influences reproductive outcomes. They found that amongst the NLSY79 cohort in the United States, the negative correlation between intelligence and fertility is primarily associated with g and is mediated in part by education and income, and to a lesser extent by more "liberal" gender attitudes. From this Meisenberg has suggested that in the absence of migration and with a constant environment, selection has the potential to reduce the average genotypic IQ of the US population by between .4, .8 and 1.2 points per generation.
All of these estimates are genetic selection estimates: indirect estimates inferred from IQ being a heritable trait and then treating it as a natural selection/breeding process, where a trait is selected against based on phenotype and how fast the trait decreases in each succeeding generation depends on how genetic the trait is and how harsh the selection is.
So variation in these estimates (quoted estimates per generation range from .3 to 3+) is due to sampling error, differences in populations or time periods, expressing the effect by year or generation, the estimate used for heritability, reliability of IQ estimates, and whether additional genetic effects are taken into account---for example, [Woodley et al 2015](http://www.sciencedirect.com/science/article/pii/S0191886915003712) finds -.262 points per decade from selection, but [in another paper](http://www.sciencedirect.com/science/article/pii/S0191886914006278) argues that paternal mutation load must be affecting intelligence by ~-0.84 in the general population, giving a total of -1 per decade.
Dysgenics effects should be observable by looking at genomes & SNP data with known ages/birth-years and looking for increases in total mutations or decreases in intelligence-causing SNPs, respectively.
## Selection on SNPs
Without formally meta-analyzing all dysgenics studies, a good starting point on the selection effect seems like a genetic selection of 1 point per decade or 0.1 points per year or 0.007 standard deviations per year (or 0.7 standard deviations per century).
The most common available genetic data is SNP data, which sequence only the variants most common in the general population; SNP data can look at the effects of genetic selection but will not look at new mutations (since a new mutation would not be common enough to be worth putting onto a SNP chip).
Given a large sample of SNP data, a birth year (or age), and a set of binary SNP variables which cause intelligence (coded as 1 for the good variant, 0 for the others), we could formulate this as a multivariate regression: `glm(cbind(SNP1, SNP2, ... SNP_N) ~ Year, family=binomial)` and see if the year variable has a negative sign (increasing passage of time predicts lower levels of the good genes); if it does, this is evidence for dysgenics.
Better yet, given information about the effect size of the SNPs, we could for each person's SNP sum the net effects and then regress on a single variable, giving more precision rather than looking for independent effects on each SNP: `lm(Polygenic_score ~ Year)`. Again a negative sign on the year variable is evidence for dysgenics.
Directional predictions are weak, and in this case we have quantitative predictions of how big the effects should be.
Most of the public genomes I looked at seem to have the earliest birthdates in the 1950s or so; genomes can come from any age person (parents can give permission, and sequencing has been done prenatally) so the maximum effect is the difference between 1950 and 2015, which is `65*0.007=0.455` standard deviations (but most genomes will come from intermediate birth-dates, which are less informative about the temporal trend---in the optimal experimental design for measuring a linear trend, half the samples would be from 1950 and the other half from 2015).
If the genetic total is going down by 0.455SDs, how much do the frequencies of all the good genes go down?
One simple model of genotypic IQ would be to treat it as a large number of alleles of equal binary effect: a binomial sum of _n_=10,000 1/0 variables with _P_=50% (population frequency) is reasonable. (For example, GIANT has found a large number of variants for height, and the [GCTA](!Wikipedia)s indicate that SNPs explain much more of variance than the top Rietveld hits currently account for; this specific model is loosely inspired by [Hsu 2014](http://arxiv.org/abs/1408.3421 "On the genetic architecture of intelligence and other quantitative traits").)
In such a model, the average value of the sum is of course `n*p=5000`, and the SD is `sqrt(n*p*(1-p))` or `sqrt(10000*0.5*0.5)` or 50.
Applying our estimate of dysgenic effect, we would expect the sum to fall by `0.455*50=22.75`, so we would be comparing two populations, one with a mean of 5000 and a dysgenic mean of 4977.25.
If we were given access to all alleles from a sample of 1950 and 2015 genomes and so we could construct the sum, how hard would it be able to tell the difference? In this case, the sum is normally distributed as there are more than enough alleles to create normality, so we can just treat this as a two-sample normally-distributed comparison of means (a _t_-test), and we already have a directional effect size in mind, -0.445SDs, so:
~~~{.R}
power.t.test(delta=0.455, power=0.8, alternative="one.sided")
# Two-sample t test power calculation
#
# n = 60.4155602
# ...
~~~
A total _n_=120 is doable, but it is unlikely that we will know all intelligence genes anytime soon; instead, we know a few.
A new mean of 4977 implies that since total number of alleles is the same but the mean has fallen, the frequencies must also fall and the average frequency falls from 0.5 to `4977.25/10000=0.497725`.
To go to the other extreme, if we know only a single gene and we want to test a fall from a frequency of 0.50 to 0.4977, we need infeasibly more samples:
~~~{.R}
power.prop.test(p1=0.5, p2=0.497725, power=0.8, alternative="one.sided")
# Two-sample comparison of proportions power calculation
#
# n = 597,272.2524
# ...
~~~
1.2m datapoints would be difficult to get, and so a single gene test would be unhelpful; further, a single gene could change frequencies solely through genetic drift without the change being due to dysgenic pressures.
We know a number of genes, though: Rietveld gives 4 good hits, so we can look at a polygenic score from that.
They are all of similar effect size and frequency, so we'll continue under the same assumptions of 1/0 and _P_=50%.
The non-dysgenic average score is `4*0.5=2`, sd=`sqrt(4*0.5*0.5)=1`. (Naturally, the SD is *much* larger than before because with so few random variables...)
The predicted shift is from frequencies of 0.5 to 0.497, so the dysgenic scores should be `4*0.497=1.988`, sd=`sqrt(4*0.497*0.503)=0.999`.
The difference of 0.012 on the reduced polygenic score is _d_=`((2-1.988) / 0.999)=0.012`, giving a necessary power of:
~~~{.R}
power.t.test(delta=0.012006003, power=0.8)
# Two-sample t test power calculation
#
# n = 108904.194
# ...
~~~
So the 4 hits do reduce the necessary sample size, but it's still not feasible to require 218k SNP datasets (unless you are 23andMe or SSGAC or an entity like that).
In the current GWAS literature, there are ~9 hits we could use, but the [upcoming SSGAC paper promises: "We identified 86 independent SNPs associated with EA (_p_<5E-8)."](http://drjamesthompson.blogspot.com/2015/09/scholar-in-86-snps.html "'86 genomic sites associated with educational attainment provide insight into the biology of cognitive performance', James J Lee, (at least 200 co-authors)").
So how much would 86 improve over 4?
- mean old: `86*0.5=43`
- sd old: `sqrt(86*0.5*0.5)=4.6368`
- mean new: `86*0.497=42.742`
- sd new: `sqrt(86*0.497*(1-0.497))=4.6367`
- so _d_=`(43-42.742)/4.63675=0.0556`
~~~{.R}
power.t.test(delta=((43-42.742)/4.63675), power=0.8)
# Two-sample t test power calculation
#
# n = 5071.166739
# ...
~~~
So with 75, it drops from 200k to 10.1k.
To work backwards: we know with 1 hit, we need a million SNP datasets (infeasible for any but the largest proprietary databases, who have no interest in studying this hypothesis), and with all hits we need more like 200 genomes (entirely doable with just publicly available datasets like PGP), but how many hits do we need to work with an in-between amount of data like the ~2k genomes with ages I guess may be publicly available now or in the near future?
~~~{.R}
power.t.test(n=1000, power=0.8)
# Two-sample t test power calculation
#
# n = 1000
# delta = 0.1253508704
hits=437;
mean1=hits*0.5; sd1=sqrt(hits*0.5*0.5);
mean2=hits*0.497; sd2=sqrt(hits*0.497*(1-0.497));
d=(mean1-mean2)/mean(c(sd1,sd2)); d
# [1] 0.1254283986
~~~
With a polygenic score drawing on 437 hits, then a sample of 2k suffices to detect the maximum decrease.
This is pessimistic because the 10k alleles are not all the same effect size and GWAS studies inherently will tend to find the largest effects first. So the first 4 (or 86) hits are worth the most.
The distribution of effects is probably something like an inverse exponential distribution: many small near-zero effects and a few large ones.
Rietveld 2013 released the betas for all SNPs, and [the beta estimates can be plotted](http://emilkirkegaard.dk/en/?p=5574 "Polygenic traits and the distribution of effect sizes: years of education from Rietveld et al (2013)"); each estimate is imprecise and there are artifacts in the beta sizes (SSGAC confirms that they were rounded to 3 decimals), but the distribution looks like a radioactive half-life graph, an inverse exponential distribution.
With a mean of 1, we can simulate creating a set of 10k effect sizes which are exponentially distributed and have mean 5000 and SD close to (but larger than) 50 and mimics closely the binomial model:
~~~{.R}
effects <- sort(rexp(10000)/1, decreasing=TRUE)
genomeOld <- function() { ifelse(sample(c(FALSE,TRUE), prob=c(0.5, 0.5), 10000, replace = TRUE), 0, effects) }
mean(replicate(10000, sum(genomeOld())))
# [1] 5000.270218
sd(replicate(10000, sum(genomeOld())))
# [1] 69.82652816
genomeNew <- function() { ifelse(sample(c(FALSE,TRUE), prob=c(0.497, 1-0.497), 10000, replace = TRUE), 0, effects) }
~~~
With a dysgenic effect of -0.445SDs, that's a fall of the sum of random exponentials of ~31, which agrees closely with the difference in polygenic genome scores:
~~~{.R}
mean(replicate(10000, sum(genomeOld() - genomeNew())))
# [1] 29.75354558
~~~
For each draw from the old and new populations, we can take the first 4 alleles, which were the ones assigned the largest effects, and build a weak polygenic score and compare means.
For example:
~~~{.R}
polyNew <- replicate(1000, sum(genomeNew()[1:4]))
polyOld <- replicate(1000, sum(genomeOld()[1:4]))
t.test(polyOld, polyNew, alternative="greater")
# Welch Two Sample t-test
#
# data: polyOld and polyNew
# t = 0.12808985, df = 1995.8371, p-value = 0.8980908
# alternative hypothesis: true difference in means is not equal to 0
# 95 percent confidence interval:
# -0.7044731204 0.8029267301
# sample estimates:
# mean of x mean of y
# 17.72741040 17.67818359
~~~
Or to mimic 86 hits:
~~~{.R}
t.test(replicate(1000, sum(genomeOld()[1:86])), replicate(1000, sum(genomeNew()[1:86])))
#
# Welch Two Sample t-test
#
# t = 1.2268929, df = 1997.6307, p-value = 0.2200074
# alternative hypothesis: true difference in means is not equal to 0
# 95% confidence interval:
# -0.8642674547 3.7525210076
# sample estimates:
# mean of x mean of y
# 244.5471658 243.1030390
~~~
Using the exponential simulation, we can do a parallelized power analysis: simulate draws (_i_=300) & tests for a variety of sample sizes to get an idea of what sample size we need to get decent power with 86 hits.
~~~{.R}
library(ggplot2)
library(parallel) # warning, Windows users
library(plyr)
genomeOld <- function(efft) { ifelse(sample(c(FALSE,TRUE), prob=c(0.5, 0.5), length(efft), replace = TRUE), 0, efft) }
genomeNew <- function(efft) { ifelse(sample(c(FALSE,TRUE), prob=c(0.497, 1-0.497), length(efft), replace = TRUE), 0, efft) }
simulateStudy <- function(n, hits) {
effects <- sort(rexp(10000)/1, decreasing=TRUE)[1:hits]
polyOld <- replicate(n, sum(genomeOld(effects)))
polyNew <- replicate(n, sum(genomeNew(effects)))
t <- t.test(polyOld, polyNew, alternative="greater")
return(data.frame(N=n, P=t$p.value, PO.mean=mean(polyOld), PO.sd=sd(polyOld), PN.mean=mean(polyNew), PN.sd=sd(polyNew))) }
hits <- 86
parallelStudies <- function(n, itr) { ldply(mclapply(1:itr, function(x) { simulateStudy(n, hits); })); }
sampleSizes <- seq(500, 5000, by=100)
iters <- 300
powerExponential <- ldply(lapply(sampleSizes, function(n) { parallelStudies(n, iters) })); summary(powerExponential)
# N P PO.mean PO.sd PN.mean
# Min. : 500 Min. :0.000000000 Min. :222.5525 Min. :23.84966 Min. :221.2894
# 1st Qu.:1600 1st Qu.:0.002991554 1st Qu.:242.8170 1st Qu.:26.46606 1st Qu.:241.3242
# Median :2750 Median :0.023639517 Median :247.2059 Median :27.04467 Median :245.7044
# Mean :2750 Mean :0.093184735 Mean :247.3352 Mean :27.06300 Mean :245.8298
# 3rd Qu.:3900 3rd Qu.:0.107997575 3rd Qu.:251.7787 3rd Qu.:27.64103 3rd Qu.:250.2157
# Max. :5000 Max. :0.997322161 Max. :276.2614 Max. :30.67000 Max. :275.7741
# PN.sd
# Min. :23.04527
# 1st Qu.:26.45508
# Median :27.04299
# Mean :27.05750
# 3rd Qu.:27.63241
# Max. :30.85065
powerExponential$Power <- powerExponential$P<0.05
powers <- aggregate(Power ~ N, mean, data=powerExponential); powers
# 1 500 0.2133333333
# 2 600 0.2833333333
# 3 700 0.2833333333
# 4 800 0.3133333333
# 5 900 0.3033333333
# 6 1000 0.3400000000
# 7 1100 0.4066666667
# 8 1200 0.3833333333
# 9 1300 0.4133333333
# 10 1400 0.4166666667
# 11 1500 0.4700000000
# 12 1600 0.4600000000
# 13 1700 0.4666666667
# 14 1800 0.4733333333
# 15 1900 0.5233333333
# 16 2000 0.5366666667
# 17 2100 0.6000000000
# 18 2200 0.5900000000
# 19 2300 0.5600000000
# 20 2400 0.6066666667
# 21 2500 0.6066666667
# 22 2600 0.6700000000
# 23 2700 0.6566666667
# 24 2800 0.7133333333
# 25 2900 0.7200000000
# 26 3000 0.7300000000
# 27 3100 0.7300000000
# 28 3200 0.7066666667
# 29 3300 0.7433333333
# 30 3400 0.7133333333
# 31 3500 0.7233333333
# 32 3600 0.7200000000
# 33 3700 0.7766666667
# 34 3800 0.7933333333
# 35 3900 0.7700000000
# 36 4000 0.8100000000
# 37 4100 0.7766666667
# 38 4200 0.8000000000
# 39 4300 0.8333333333
# 40 4400 0.8466666667
# 41 4500 0.8700000000
# 42 4600 0.8633333333
# 43 4700 0.8166666667
# 44 4800 0.8366666667
# 45 4900 0.8666666667
# 46 5000 0.8800000000
qplot(N, Power, data=powers) + stat_smooth()
~~~
![Power for a two-group comparison of old and new SNP datasets for testing a hypothesis of dysgenics](/images/genetics/dysgenics-snps-exponential-optimal.png)
So for a well-powered two-group comparison of 1950 & 2015 SNP datasets using 86 SNPs, we would want ~4000 in each group for a total _n_=8000; we do have nontrivial power even at a total _n_=1000 (500 in each group means 21% power) but a non-statistically-significant result will be difficult to interpret and if one wanted to do that, reporting a Bayes factor from a Bayesian hypothesis test would make much more sense to express clearly whether the (non-definitive) data is evidence for or against dysgenics.
This is still too optimistic since we assumed the optimal scenario of only very old and very new genomes, while available genomes are more likely to be distributed fairly uniformly between 1950 and 2015.
Per ["Optimal design in psychological research"](http://www2.psych.ubc.ca/~schaller/528Readings/McClelland1997.pdf), McClelland 1997, we expect a penalty of ~2x in sample size efficiency in going from the optimal two-group extreme endpoints design to samples being uniformly distributed (due to much of our sample size being wasted on estimating small effects) and so we would expect our sample size requirement to at least double to around _n_=16000, but we can do a power simulation here as well.
To get the effect size for each year, we simply split the frequency decrease over each year and generate hypothetical genomes with less of a frequency decrease uniformly distributed 1950-2015, and do a linear regression to get a _p_-value for the year predictor:
~~~{.R}
hits <- 86
sampleSizes <- seq(8000, 30000, by=1000)
iters <- 100
genome <- function(effects) {
t <- sample(c(1:(2015-1950)), 1)
decreasedFrequency <- 0.5 - (((0.5-0.497)/(2015-1950)) * t)
geneFlips <- sample(c(FALSE,TRUE), prob=c(decreasedFrequency, 1-decreasedFrequency), replace = TRUE, length(effects))
geneValues <- ifelse(geneFlips, effects, 0)
return(data.frame(Year=1950+t,
PolygenicScore=sum(geneValues)))
}
simulateStudy <- function(n, hits) {
effects <- sort(rexp(10000)/1, decreasing=TRUE)[1:hits]
d <- ldply(replicate(n, genome(effects), simplify=FALSE))
l <- lm(PolygenicScore ~ Year, data=d)
p <- anova(l)$`Pr(>F)`[1]
return(data.frame(N=n, P=p, PO.mean=predict(l, newdata=data.frame(Year=1950)),
PN.mean=predict(l, newdata=data.frame(Year=2015)))) }
parallelStudies <- function(n, itr) { ldply(mclapply(1:itr, function(x) { simulateStudy(n, hits); })); }
powerExponentialDistributed <- ldply(lapply(sampleSizes, function(n) { parallelStudies(n, iters) })); summary(powerExponential)
powerExponentialDistributed$Power <- powerExponentialDistributed$P<0.05
powers <- aggregate(Power ~ N, mean, data=powerExponentialDistributed); powers
# N Power
# 1 8000 0.27
# 2 9000 0.32
# 3 10000 0.35
# 4 11000 0.33
# 5 12000 0.41
# 6 13000 0.34
# 7 14000 0.41
# 8 15000 0.48
# 9 16000 0.55
# 10 17000 0.62
# 11 18000 0.55
# 12 19000 0.60
# 13 20000 0.69
# 14 21000 0.61
# 15 22000 0.65
# 16 23000 0.63
# 17 24000 0.71
# 18 25000 0.67
# 19 26000 0.71
# 20 27000 0.74
# 21 28000 0.70
# 22 29000 0.79
# 23 30000 0.83
qplot(N, Power, data=powers) + stat_smooth()
~~~
![Power to detect dysgenics effect with SNP samples spread over time](/images/genetics/dysgenics-snps-exponential-realistic.png)
In this case, the power simulation suggestions the need for triple rather than double the data, and so a total of _n_=30,000 to be well-powered.
## Mutation load
The paternal mutation load should show up as a increase (70 new mutations per generation, 35 years per generation, so ~2 per year on average) over the past century, while the genetic selection will operate by reducing the frequency of variants which increase intelligence.
If there are ~70 new mutations per generation and 2 harmful, and there is no longer any purifying selection so that all 70 will tend to remain present, how much does that compare to existing mutation load averages and, more importantly, standard deviations?
A [mutation load review](/docs/2015-henn.pdf "'Estimating the mutation load in human genomes', Henn et al 2015") leads me to some hard figures from [Simons et al 2014](/docs/2014-simons.pdf "The deleterious mutation load is insensitive to recent population history") ([supplement](/docs/2014-simons-supplementary.pdf)) using data from [Fu et al 2012](/docs/2012-fu.pdf "Analysis of 6,515 exomes reveals the recent origin of most human protein-coding variants"); particularly relevant is figure 3, the number of single-nucleotide variants per person over the European-American sample, split by estimates of harm from least to most likely: `21345 + 15231 + 5338 + 1682 + 1969 = 45565`.
The supplementary tables gives a count of all observed SNVs by category, which sum to `300209 + 8355 + 220391 + 7001 + 351265 + 10293 = 897514`, so the average frequency must be `45565/897514=0.05`, and then the binomial SD will be `sqrt(897514*0.05*(1-0.05))=206.47`.
Considering the two-sample case of 1950 vs 2015, that's an increase of 130 total SNVs (`65*2`), which is 0.63SDs, hence:
~~~{.R}
power.t.test(d=(130/206), power=0.8)
# Two-sample t test power calculation
#
# n = 40.40035398
# ...
~~~
A total of _n_=80.
This particular set up for the two-sample test can be seen as a linear model with the optimum design of allocating half the sample to each extreme (see again McClelland 1997); but more realistically, there is an even distribution across years, in which case the penalty is 2x and _n_=160.
## Weaknesses
There are some potential problems:
1. Range restriction: in many IQ-related studies, failure to account for selection effects yielding a limited range of IQs may [seriously understate the true correlation](https://en.wikipedia.org/wiki/Statistical_conclusion_validity#Restriction_of_range); this is true in general but particularly common in IQ studies because selection on IQ (eg samples of convenience using only college students) is so universal in human society
This may not be such a large issue when dealing with polygenic scores; even severe IQ selection effects will increase polygenic scores only somewhat because the polygenic scores explain so little of IQ variance in the first place.
2. Self-selection by age: if people providing genetic data are not random samples, then there may be pseudo-trends which can mask a real dysgenic trend or create a pseudo-dysgenic trend where there is none. For example, if young people buying genome or SNP data tend to be above-average in intelligence and scientific interest (which anecdotally they certainly do seem to be), while old people tend to get genomes or SNP data due to health problems (and otherwise have average levels of intelligence and thus polygenic score), then in comparing young vs old, one might find not a dysgenic but a pseudo-eugenic trend instead! Conversely, it could be the other way around, if much fewer elderly get genetic data and younger people are more concerned about future health or are going along with a fad, producing a pseudo-dysgenic effect instead (eg in the PGP genome data, there seem to be disproportionately more PhDs who are quite elderly, while younger participants are a more scattershot sample from the general population; probably relating to the circumstances of PGP's founding & Harvard home).
This is probably an issue with databases that rely on voluntary individual contributions, such as PGP, where selection effects have free play. It would be much less of an issue with longitudinal studies where motivations and participation rates will not differ much by age. Since most dysgenic theories accept that recorded IQ scores have remained stable over the 20th century and the decreases in genetic potential either have not manifested yet or have been masked by the Flynn effect & greater familiarity with tests & loss of some g-loading, one might reason that proxies like educational achievement should be increasing throughout one's sample (since they are known to have increased), and a lack of such a trend indicates selection bias.
## Genetic data availability
### Proprietary
The known proprietary databases have long been large enough to carry out either analysis, as well as countless other analyses (but have failed to and represent a [tragedy of the anticommons](!Wikipedia)):
1. The mutation load analysis requires a whole-genome sample size small enough to have been carried out by innumerable groups post-2009.
2. For SNPs, an incomplete list of examples of publications based on large samples:
- 23andMe reached [1 million customers in July 2015](http://blog.23andme.com/news/one-in-a-million/), of whom >=80% opt-in to research (>2m as of [August 2017](https://www.fastcompany.com/40438376/after-a-comeback-23andme-faces-its-next-test "After A Comeback, 23andMe Faces Its Next Test: Can the pioneering DNA-testing company satisfy the FDA while also staying true to its founding mission: putting people in control of their healthcare?"), still 80% opt-in, and [>6m as of November 2017](https://www.cnbc.com/2017/11/21/amazon-has-suddenly-become-a-big-marketplace-for-selling-genetic-tests.html "Amazon has suddenly become a big marketplace for selling genetic tests") with sales increasing); the first questions 23andMe asks all customers are age and education, so they likely have at least 700,000 usable SNPs for both discovering educational associations & dysgenic tests. In [June 2010](http://blog.23andme.com/23andme-research/23andme-and-a-new-paradigm-for-research/), they claimed to have 29k opt-ins out of 50k customers, implying they were well-powered for a dysgenic test in 2010 if they had access to a polygenic score, and that in the absence of the score, they could have found the 85 SSGAC hits (using _n_=305k) themselves somewhere around mid-2011 or 2012 and then done a dysgenic test.
- the SSGAC collaboration has _n_=305k as of late 2015
- [GIANT: height](http://www.nature.com/ng/journal/v46/n11/full/ng.3097.html "'Defining the role of common variation in the genomic and biological architecture of adult human height', Wood et al 2015"): _n_=253,288
- [cholesterol](http://www.nature.com/nm/journal/vaop/ncurrent/full/nm.3980.html "'Genome-wide identification of microRNAs regulating cholesterol and triglyceride homeostasis', Wagschal et al 2015"): _n_=188,000
- [UK Biobank](!Wikipedia): _n_=152,729 (sequenced/published on as of June 2015; 500k were enrolled and will be covered eventually)
- [diabetes](http://www.nature.com/ng/journal/v44/n9/full/ng.2383.html "'Large-scale association analysis provides insights into the genetic architecture and pathophysiology of type 2 diabetes', Morris et al 2012"): _n_=114,981
- [Psychiatric Genomics Consortium](http://www.med.unc.edu/pgc/results): has run multiple studies of varying sizes, the second-largest (a bipolar study) peaking at a total control group of _n_=51672, and the largest (schizophrenia) at a control group of _n_=113,075
- [Parkinson's](http://www.nature.com/ng/journal/v46/n9/full/ng.3043.html "'Large-scale meta-analysis of genome-wide association data identifies six new risk loci for Parkinson's disease', Nalls et al 2014"): _n_=100,833 (may overlap with 23andMe and some others)
- [eczema](http://www.nature.com/ng/journal/vaop/ncurrent/full/ng.3424.html "'Multi-ancestry genome-wide association study of 21,000 cases and 95,000 controls identifies new risk loci for atopic dermatitis', Paternoster et al 2015"): _n_=95,464
- [Genetics of Personality Consortium](http://www.tweelingenregister.org/GPC/): eg [Neuroticism](http://palmerlab.org/neuroticism-and-depression-gwas-consortium-paper-accepted-for-publication-in-jama-psychiatry-abraham-palmer-harriet-de-wit-and-amy-hart-are-co-authors/ "Genome-wide association study identifies novel locus for neuroticism and shows polygenic association with Major Depressive Disorder"): _n_=63k
- Dutch LifeLines Biobank and Cohort: _n_>13,000
- Health and Retirement Survey: _n_=12,500
- [Swedish TwinGene project](http://openarchive.ki.se/xmlui/bitstream/handle/10616/41868/Swedish_Twin_Registry_2013.pdf?sequence=3 "'The Swedish Twin Registry: establishment of a biobank and other recent developments', Magnusson et al 2013"): _n_=10,682
- TwinsUK registry: _n_=4,905
- [Generation Scotland: the Scottish Family Health Study](http://www.generationscotland.org/): _n_=24,000
The existing private groups do not seem to have any interest in testing dysgenics, with the possible exception of future GWAS studies examining fertility, one of which is mentioned by [Mills & Tropf 2015](http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4577548/ "The Biodemography of Fertility: A Review and Future Research Frontiers"):
> At the time of writing this review, Mills and her research team at the University of Oxford are currently leading a large consortium to engage in the first ever genome-wide association search (GWAS) and meta-analysis of reproductive choice (age at first birth; number of children), conducted in both men and women in over 50 data sets, with the results replicated in additional datasets in a large sample.
The hits in such a GWAS might overlap with intelligence hits, and if the multiple hits increase intelligence but decrease fertility or vice versa (as compared to decreasing or increasing both), that would be evidence for dysgenics.
Or, assuming the betas are reported, polygenic scores for fertility and intelligence could be estimated in independent samples and checked for an inverse correlation.
### Public
There are a few sources of data, primarily SNP data, which are freely available to all users:
1. [1000 Genomes](!Wikipedia): unusable due to a deliberate policy decision by 1000 Genomes to delete all phenotype data, including age; similar is [69 Genomes](http://www.completegenomics.com/public-data/69-Genomes/). Both likely would be unusable due to the diversity of the global sample (there is no reason to think that dysgenics pressures are operating in every population at the same strength)
2. [OpenSNP](https://opensnp.org): hosting for user-provided SNP & phenotype data with [dumps available](https://opensnp.org/genotypes); hosts ~2k SNP datasets, but only [270 users have birth-years](https://opensnp.org/phenotypes/148)
3. [SNPedia](http://www.snpedia.com/index.php/Genomes) likewise hosts SNP data (overlapping with OpenSNP) and genome data, but a very small number
4. [Genomes unzipped](http://genomesunzipped.org/data) provides a small amount of data
4. [DNA.LAND](https://dna.land/): claims _n_=8k based on public participation & input (_n_=43k as of [May 2017](http://biorxiv.org/content/early/2017/05/09/135715 "'DNA.Land: A Digital Biobank Using A Massive Crowdsourcing Approach', Yuan et al 2017")), but seems to then restrict access to a small set of researchers
4. [Exome Aggregation Consortium](http://exac.broadinstitute.org/about): _n_=61,486 exomes; [phenotype data](http://exac.broadinstitute.org/faq) is unavailable
5. [Personal Genome Project](!Wikipedia) (PGP): probably the single largest source of open SNP & genome data. ~1252 participants have registered birthdates according to `demographics.tsv`, and their [statistics page](https://my.pgp-hms.org/public_genetic_data/statistics)'s graphs indicates <300 whole genomes and <1k SNPs. Phenotype data [has been recently released as a SQLite database](http://blog.personalgenomes.org/2015/11/19/exploring-the-harvard-pgp-dataset-with-untap/ "Exploring the Harvard PGP Dataset with Untap"), making it easier to work with.
- Genomes: browsing the [user lists for 'Whole genome datasets'](https://my.pgp-hms.org/users), I estimate a total of ~222; looking at the first and last 22 entries, 34 had ages/birth-years, so ~75% of the whole genomes come with the necessary birth-year data, indicating ~166 usable genomes for the purpose of testing dysgenics. With the most recent one uploaded on 2015-10-12, and the earliest recorded being 2011-09-16, that suggests the available genome number increases by ~0.25/day. 166 is uncomfortably close to the requirement for a well-powered test, and there may not be enough data to account for glitches in the data or allow for more complicated statistical testing, but if we wanted to double the available data, we'd only need to wait around 885 days or 2.5 years (or less, depending on whether the collapse in genome sequencing prices continue and prices drop below even the current \$1k genomes).
- SNPs: PGP has [~656 23andMe SNP datasets](https://my.pgp-hms.org/public_genetic_data?utf8=%E2%9C%93&data_type=23andMe&commit=Search) (the number of SNP datasets sourced from other providers is quite small so I didn't include them), dated 2015-10-21-2011-01-06, so assuming same birth-date percentage, 0.37 per day. Unfortunately, to get 30k SNP datasets through PGP, we would have to wait (linearly extrapolating) 291 years. (Making matters worse, in October 2015, 23andMe doubled its price and reduced the quality of SNP coverage, which will discourage many users and push other users to purchase whole-genome sequencing instead.)
# Power analysis for racial admixture studies of continuous variables {.collapse}
> I consider power analysis of a genomic racial admixture study for detecting genetic group differences affecting a continuous trait such as IQ in US African-Americans, where ancestry is directly measured by genome sequencing and the comparisons are all within-family to eliminate confounding by population structure or racism/colorism/discrimination. The necessary sample size for well-powered studies is closely related to the average size of differences in ancestry percentage between siblings, as the upper bound on IQ effect per percentage is small, requiring large differences in ancestry to detect easily. A within-family comparison of siblings, due to the relatively small differences in ancestry between siblings estimated from IBD measurements of siblings, might require _n_>50,000 pairs of siblings to detect possible effects on IQ, an infeasible sample size. An alternative design focuses on increasing the available ancestry differences within a family unit by comparing *adoptees* with siblings; the larger within-population standard deviation of ancestry creates larger & more easily-detected IQ differences. A random-effects meta-analysis of past admixture & ancestry studies suggests the SD in heterogeneous samples may range from 2% to 20% with a mean of 11% (95% predictive interval), yielding sample sizes of _n_>20,000, _n_=1100, and _n_=500. Hence, an adoption study is probably in the feasible range, with required sample sizes comparable to annual adoption rates among US African-Americans.

[Admixture studies](!Wikipedia "Genetic admixture#Mapping") examine racial phenotypic differences in traits such as blood pressure by comparing people with ancestry from multiple groups, and correlating differences in ancestry percentage with differences in the phenotype.
So, for example, African-Americans have higher blood-pressure than white Americans, and most African-Americans have an average white ancestry of something like 20-25% (see later); if having 26% white ancestry predicts slightly lower blood pressure while 24% predicts higher, that suggests the difference is (as is currently believed) genetic; and this logic can be used to narrow down to specific chromosome regions, and has contributed to study of [racial differences in disease](!Wikipedia "Race and health").
One application would be to thorny questions like potential group differences in non-medical traits like intelligence.
The standard admixture design, requiring a few thousand subjects spanning the full range, might not necessarily work here here because of the claimed environmental effects.
[A proposed resolution](http://humanvarieties.org/2013/03/29/cryptic-admixture-mixed-race-siblings-social-outcomes/ "Cryptic Admixture, Mixed-Race Siblings, & Social Outcomes") to the question is to do an admixture study comparing African-American siblings.
Siblings are highly genetically related on average (50%) but in a randomized fashion due to recombination; so two siblings, including fraternal twins, born to the same parents in the same family in the same neighborhood going to the same schools, will nevertheless have many different variants, and will differ in how related they are---the average is 50% but it could be as low as 45% or high as 55%.
So given two siblings, they will differ slightly in their white ancestry, and if indeed white ancestry brings with it more intelligence variants, then the sibling with a higher whiter percentage ought to be slightly more intelligent on average, and this effect will have to be causal, as the inheritance is randomized and all other factors are equal by design.
(A result using ancestry percentages measured in the general population, outside families, would be able to make far more powerful comparisons by comparing people with ~0% white ancestry to those with anywhere up to 100%, and require small sample sizes, and such analyses have been done with the expected result, but are ambiguous & totally unconvincing, as the correlation of greater whiteness with intelligence could easily be due to greater SES or greater blackness could be a marker for recent immigration or any of a number of confounds that exist.)
This has historically been difficult or impossible since how does one measure the actual ancestry in siblings?
But with the rise of cheap genotyping, precise measure of actual (rather than average) ancestry can be done for <\$100, so that is no longer an obstacle.
## Sibling power analysis
How many sibling pairs would this require?
- you are trying to regress `IQ_difference ~ Ancestry_difference`
- the SD of the IQ difference of siblings is known---it's ~13 IQ points (nonshared environment + differences in genetics)
- of this, a small fraction will be explained by the small difference in ancestry percentage
- the power will be determined by the ratio of the sibling SD to the IQ-difference-due-to-ancestry-difference SD, giving an effect size, which combined with the usual alpha=0.05 and beta=0.80, uniquely determines the sample size
- IQ-difference-due-to-ancestry-difference SD will be the advantage of better ancestry times how much ancestry differs
- if you knew the number of relevant alleles, you could calculate through the binomial the expected SD of sibling ancestor differences. As there are so many alleles, it will be almost exactly normal. So it's not surprising that siblings overall, for all variants, are 50% IBD with a SD of 4%.
If we treated it as simply as possible, [Visscher 2006](http://www.plosgenetics.org/article/info%3Adoi%2F10.1371%2Fjournal.pgen.0020041 "Assumption-Free Estimation of Heritability from Genome-Wide Identity-by-Descent Sharing between Full Siblings") for an analogous height analysis says they measured 588 markers. So a binomial with 588 draws and _p_=0.5 implies that 147 markers are expected to be the same:
~~~{.R}
588 * 0.5*(1-0.5)
# [1] 147
~~~
and the distribution around 147 is 12, which is ~8%:
~~~{.R}
sqrt((588 * 0.5*(1-0.5)))
# [1] 12.12435565
12/147
# [1] 0.08163265306
~~~
Visscher does a more complicated analysis taking into account closeness of the markers and gets a SD of 3.9%: equation 7; variance = `1/(16*L) - (1/3*L^2)`, where L = 35, so
~~~{.R}
L=35; sqrt(1/(16*L) - (1/(3*L^2)))
# [1] 0.03890508247
~~~
And [Hill & Weir 2011's](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3070763/ "Variation in actual relationship as a consequence of Mendelian sampling and linkage") theoretical modeling gives an expected sibling SD of SD of 3.92%/3.84% (Table 2), which are nearly identical.
So whatever the mean admixture is, I suppose it'll have a similar SD of 4-8% of itself.
IIRC, African-Americans are ~25% admixed, so with a mean admixture of 25%, we would expect siblings differences to be $25\% \pm 0.04 \cdot 0.25=0.01$ or 1% difference.
If that 75% missing white ancestry accounts for 9 IQ points or 0.6SDs, then each percentage of white ancestry would be 0.6/75 =0.008 SDs.
So that SD of 1% more white ancestry yields an SD of 0.008 IQ, which is superimposed on the full sibling difference of 0.866, giving a standardized effect size/_d_ of 0.008 / 0.866 = 0.0092
Let me try a power simulation:
~~~{.R}
n <- 10000
siblings <- data.frame(
sibling1AncestryPercentage = rnorm(n, mean=25, sd=1),
sibling1NonancestryIQ = rnorm(n, mean=0, sd=12),
sibling2AncestryPercentage = rnorm(n, mean=25, sd=1),
sibling2NonancestryIQ = rnorm(n, mean=0, sd=12))
siblings$sibling1TotalIQ <- with(siblings, sibling1NonancestryIQ + sibling1AncestryPercentage*(0.008*15))
siblings$sibling2TotalIQ <- with(siblings, sibling2NonancestryIQ + sibling2AncestryPercentage*(0.008*15))
siblings$siblingAncestryDifference <- with(siblings, sibling1AncestryPercentage - sibling2AncestryPercentage)
siblings$siblingIQDifference <- with(siblings, sibling1TotalIQ - sibling2TotalIQ )
summary(siblings)
# ...
# siblingAncestryDifference siblingIQDifference
# Min. :-5.370128122 Min. :-68.2971343
# 1st Qu.:-0.932086950 1st Qu.:-11.7903864
# Median : 0.002384529 Median : -0.2501536
# Mean : 0.007831583 Mean : -0.4166863
# 3rd Qu.: 0.938513265 3rd Qu.: 11.0720667
# Max. : 5.271052675 Max. : 67.5569825
summary(lm(siblingIQDifference ~ siblingAncestryDifference, data=siblings))
# ...Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) -0.4192761 0.1705125 -2.45892 0.0139525
# siblingAncestryDifference 0.3306871 0.1220813 2.70874 0.0067653
#
# Residual standard error: 17.05098 on 9998 degrees of freedom
# Multiple R-squared: 0.000733338, Adjusted R-squared: 0.0006333913
# F-statistic: 7.337294 on 1 and 9998 DF, p-value: 0.006765343
confint(lm(siblingIQDifference ~ siblingAncestryDifference, data=siblings))
# 2.5 % 97.5 %
# (Intercept) -0.75351500523 -0.08503724643
# siblingAncestryDifference 0.09138308561 0.56999105507
admixtureTest <- function(n, alpha=0.05, ancestryEffect=0.008) {
siblings <- data.frame(
sibling1AncestryPercentage =pmax(0, rnorm(n, mean=25, sd=1)),
sibling1NonancestryIQ = rnorm(n, mean=0, sd=12),
sibling2AncestryPercentage = pmax(0,rnorm(n, mean=25, sd=1)),
sibling2NonancestryIQ = rnorm(n, mean=0, sd=12))
siblings$sibling1TotalIQ <- with(siblings, sibling1NonancestryIQ + sibling1AncestryPercentage*(ancestryEffect*15))
siblings$sibling2TotalIQ <- with(siblings, sibling2NonancestryIQ + sibling2AncestryPercentage*(ancestryEffect*15))
siblings$siblingAncestryDifference <- with(siblings, sibling1AncestryPercentage - sibling2AncestryPercentage)
siblings$siblingIQDifference <- with(siblings, sibling1TotalIQ - sibling2TotalIQ )
p <- summary(lm(siblingIQDifference ~ siblingAncestryDifference, data=siblings))$coefficients[8]
return(p
> Samples taken from the extremes of mixtures of distributions can have very different properties than random samples, such as wildly disproportionate representation of one distribution. This can be used to infer differing means. I demonstrate working backwards from the racial composition of TIP/SMPY samples of extremely (1-in-10,000) gifted youth to estimate the overall racial means, which is consistent with the known racial means and hence an unbiased selection process, using ABC to infer Bayesian credible intervals on the estimated means.

The general properties of statistical distributions can be very different from the properties of specific subsets in counterintuitive ways.
One common example is that a small difference in means for many distributions can lead to large differences in [extreme](!Wikipedia "Extreme value theory") subsets.
For example, male and female average heights differ by a relatively small amount, inches at most. So in a random sample, plenty of women will be taller than men, and vice versa.
However, if instead ask the sex of the tallest person in the sample, it will often be male, and the larger the sample, the more certain we can be that it will be male, and that the top X% by height will be male.
Likewise, if we wanted to start a basketball league and recruited the tallest 100 people in the country, this small mean difference will show up as our entire basketball league turning out to be male.
(And since height is highly heritable, we may find out that [many of them are related](http://www.wsj.com/articles/nba-basketball-runs-in-the-family-1464130236 "Why Basketball Runs in the Family: A new WSJ study finds 48.8% of players are related to an elite athlete-that number is 17.5% for the NFL and 14.5% for MLB")!)
What seemed like a small difference become a large one; we could have worked it out in advance if we had thought about it.
Reasoning from the general to the particular turned out to be tricky in this case because we were dealing with extreme values rather than random samples---1 basketball player *chosen by height* from thousands of people.
Many things of great interest turn out to be like that: we are interested in the extremes much more than the expectation.
Running a 2-hour marathon is an extreme on athleticism; winning the Nobel is an extreme on scientific accomplishment; being enlisted in the NBA is an extreme on height; being admitted to MIT/Stanford/Harvard is an extreme on intelligence; murdering someone is an extreme on violence; winning an Academy Award is an extreme on acting success.
When we ask questions like, "why does the world record in this sport keep being shattered" or "why are so many NBA players related" or "how good can we expect the best chess player to be in 10 years" or "does this racial composition prove bias" or "how much more important are the best authors in literature than obscurer figures" or "why do so few women win the Field Medal", we're asking extreme value questions whose answers may be counterintuitive---and the answer may be as simple as the shape of distributions, and a slightly lower mean here or a slightly higher standard deviation there.
(Working backwards from a sample selected for passing a threshold to a mean can be called "the method of limits" or "the method of thresholds".)
The study ["When Lightning Strikes Twice: Profoundly Gifted, Profoundly Accomplished", Makel et al 2016](/docs/iq/smpy/2016-makel.pdf) describes the accomplishments of the [Duke TIP](!Wikipedia "Talent Identification Program") sample, 259 children selected for their intelligence by taking the highest-scorers out of 425,000 adolescents taking the SAT (usually <13yo) starting in 1981, representing the top 0.01% of the test-takers.
The TIP sample parallels the better-known SMPY sample, which also selected extremely intelligent adolescents, who were included in a longitudinal sample.
It's frequently suggested, based on anecdotal evidence or some biased convenience samples, that more intelligence may not be better; extremely intelligent people may be unhealthy, neurotic, insane, isolated, lonely, discriminated against by society and their peers, and doomed to failure; or if things are not quite that dire, as all studies show things improving up to 130, then at around that point greater intelligence may stop making any difference, and there be little difference between someone with an IQ of 130 and 160.
This is difficult to study cross-sectionally, because once you start talking about as extreme as 0.01%, it is difficult to recruit any subjects at all, and your sample will be biased in unknown ways; if you only look at successful people, you are missing the hypothetical homeless bum living out of a trash can who is a troubled and misunderstood genius.
To solve these problems, you want to filter through hundreds of thousands of people so you can select the very brightest possible, and you want to find them as early as possible in life, before they have had any chance to fail or succeed, and track them longitudinally as they grow up.
This is what the SMPY & TIP studies do, and the results are that the subjects are spectacularly successful in life; great intelligence is not harmful and the returns to greater intelligence are not zero even as high as 1 in 10,000.
Makel et al 2016 also reports the ethnic breakdown of the TIP and SMPY samples: 72% white, 22% Asian, 6% not reported or other.
This distribution might seem remarkable given that subjects taking the SAT in 1981 were born ~1970, when the USA was ~77% white, ~11% black, and ~0.7% Asian, so white are slightly under-represented, blacks are very under-represented (even if we assume all 6% are black, then that's still half), and Asians are 31x (!) overrepresented.
~~~{.R}
## TIP/SMPY sample size & ethnic percentages: https://pbs.twimg.com/media/Cj9DXwxWEAEaQYk.jpg
tip <- 259; smpy <- 320 ## total: 579
white <- ((0.65*tip) + (0.78*smpy)) / (tip+smpy)
asian <- ((0.24*tip) + (0.20*smpy)) / (tip+smpy)
white; asian
# [1] 0.7218480138
# [1] 0.2178929188
# http://drjamesthompson.blogspot.com/2016/06/some-characteristics-of-eminent-persons.html
# > The data on ethnicity are rather sparse, but we can do a little bit of work on them by looking at US Census
# > figures for the 1970s when most of these children were born: White 178,119,221...Asia 1,526,401...So, in the
# > absence of more detailed particulars about the Other category, Asians win the race by a country mile. If we
# > simplify things by considering only Whites, Blacks and Asians the US in 1970 then the country at that time was
# > 88% White, 11% Black, and less than 1% Asian. The actual results of eminent students are 77% White, 0% Black,
# > 22% Asian. No need for a Chi square.
#
# Asian is 0.7%: R> 1526401 / (178119221 / 0.80)
whiteRR <- white / 0.77; asianRR <- asian / 0.007
whiteRR; asianRR
# [1] 0.937464953
# [1] 31.12755983
~~~
Of course, races in the USA have long differed by mean intelligence, with the rule of thumb being Asians ~105 IQ, whites ~100, and blacks ~90.
So the order is expected---but still, 31x!
Are the results being driven by some sort of pro-Asian bias or otherwise bizarre?
But this is an extreme sample. 1-in-10,000 is far out on the tails: 3.71SDs.
~~~{.R}
-qnorm(1/10000)
# [1] 3.719016485
~~~
Maybe this is normal. Can we work backwards from the overrepresentations to what differences would have generated them?
Yes, we can, even with this small sample which is so extreme and unrepresentative of the general population.
This is because it is an [order statistics](!Wikipedia) problem: we know the order represented by the sample and so can work back to parameters of the distribution the order statistics are being generated by.
Since IQ is a normal distribution, we know the overrepresentation RR, and the exact cutoff/limit used in the sample, we can convert the limit to _a_ standard deviations, and then find the normal distribution $\mathcal{N}(100+x,15)$ which is RR (31) times the normal distribution $\mathcal{N}(100,15)$ at _a_ standard deviations.
We can compare using two `pnorm`s and shifting the second by _a_ SDs. So for example, shifting by 15 IQ points or 1 SD would lead to 84x overrepresentation
~~~{.R}
pnorm(qnorm(1/10000)) / pnorm(qnorm(1/10000) - (15/15))
# [1] 84.39259519
~~~
We would like to solve for the shift which leads to an exact overrepresentation like 31.127; an optimization routine like R's `optim` function can do that, but it requires an error to minimize, so minimizing `pnorm()/pnorm(x)` doesn't work since it just leads to negative infinity, nor will `RR == pnorm()/pnorm(x)` work, because it evaluates to 0 for all values of _x_ except the exact right one . Instead, we minimize the squared error between the ratio predicted by a particular _x_ and our observed RR.
This works:
~~~{.R}
## An optimization routine which automatically finds for us the IQ increase which most closely matches the RR:
solver <- function(RR, cutoff=10000) {
optim(1,
function(IQ_gain) { (RR - (pnorm(qnorm(1/cutoff)) / pnorm(qnorm(1/cutoff)-(IQ_gain/15))))^2 },
)$par }
100 + solver(whiteRR)
# [1] 99.75488281
100 + solver(asianRR)
# [1] 111.8929688
~~~
So our inferred white & Asian populations means are: 99.8 and 111.9. These are relatively close to the expected values.
This approach can be used to infer other things as well.
For example, the TIP/SMPY papers have not, as far as I've seen, mentioned what fraction of the white subjects were ethnic Jewish; since they are so over-represented in areas like Nobel prizes, we would expect many of the TIP/SMPY white students to have been Jewish.
Using an estimate of the Jewish population in 1970 and estimates of their mean IQ, we can work forward to what fraction of TIP/SMPY subjects might be Jewish.
The [1970-1971 National Jewish Population Study](http://www.jewishdatabank.org/studies/details.cfm?StudyID=304) estimated "5,800,000 persons (of whom 5,370,000 were Jews) living in Jewish households" out of a total US population of 205 million, or 2.8% of the total population or ~3.6% of the white population. So of the ~418 white subjects, ~15 would be expected to be Jewish under the null hypothesis of no difference.
The majority of American Jews are of Ashkenazi descent^[This renders the population estimate a bit off, but I couldn't find any sources on the breakdown of Sephardic vs Ashkenazi in the USA in 1970 other than a comment that the latter were a "vast majority". Since the Jewish Population Study was probably an undercount in not including all the people of Jewish descent, I'm hopeful those two biases cancel out.], for whom intelligence estimates are [debated](!Wikipedia "Ashkenazi Jewish intelligence#Evidence for a group difference in intelligence") but tend to range 105-115 (with occasional samples suggesting even higher values, like [Levinson 1957](/docs/iq/1957-levinson.pdf "The Intelligence of Applicants for Admission to Jewish Day Schools")).
In the [Barbe 1964](http://files.eric.ed.gov/fulltext/ED013518.pdf "One In A Thousand: A Comparative Study of Highly and Moderately Gifted Elementary School Children") Ohio sample (IQ ~143), 8% were Jewish^[The high IQ sample in Barbe 1964 would have been >8% Jewish, but the paper only reports the overall Jewishness, without specifying whether it's 4% vs 12% or something like that.]; in [Terman's](!Wikipedia "Genetic Studies of Genius") (ratio IQ >140) 1920s sample in SF/LA, 10% were Jewish; Hollingworth's 1930s sample (>180) turned up 51/55 or 90% Jewish[^Hollingworth]; [Byrns 1936's](/docs/iq/1936-byrns.pdf "Intelligence and Nationality of Wisconsin School Children") 1931 Wisconsin state sample found 18% of the Jewish sample to be in the top decile vs 10% American; in the [Hunter College Elementary School](!Wikipedia) sample 1948-1960 (>140, mean 157) in New York City, 62% were Jewish ([Subotnik et al 1989](/docs/iq/1989-subotnik.pdf "High IQ children at midlife: An investigation into the generalizability of Terman's genetic studies of genius"), [Subotnik et al 1993](/docs/iq/1993-subotnik-geniusrevisited.pdf "Genius Revisited: High IQ Children Grown Up")[^Subotnik1993]).
Given estimates of the Jewish population of children in those specific times and places, one could work backwards to estimate a Jewish mean.
[^Subotnik1993]: Subotnik et al 1993, pg3-4:
> The mean IQ of the Hunter sample was 157, or approximately 3.5 standard deviations above the mean, with a range of 122 to 196 on the L-M form. [Stanford-Binet Intelligence Scale, Form L-M (SBL-M)]
>
> ...Each class at Hunter College Elementary School from the years 1948 to 1960 contained about 50 students, yielding a total possible population of 600 graduates...35% of the total population of 1948-1960 HCES students (_n_=210) completed and returned study questionnaires
>
> ...*Religious Affiliation*: The Hunter group is approximately 62% Jewish, although they describe themselves as Jews more in terms of ethnic identity than religious practice. The group, as a whole, is not religious.
>
> *Educational Attainments*: Over 80% of the study participants held at least a Master's degree. Furthermore, 40% of the women and 68% of the men held either a Ph.D, LL.B., J.D., or M.D. degree. Occupation and Income: Only two of the HCES women identified themselves primarily as homemakers. 53% were professionals, working as a teacher at the college or pre-college level, writer (journalist, author, editor), or psychologist. The same proportion of HCES men were professionals, serving as lawyers, medical doctors, or college teachers. The median income for men in 1988 was \$75,000 (range = \$500,000) and for women \$40,000 (range = \$169,000). Income levels were significantly different for men and women, even when matched by profession. For example, the median income for male college teachers or psychologists was \$50,000 and for females, \$30,000
We can calculate the fraction of the white sample being Jewish for each possible mean IQ:
[^Hollingworth]: [Hollingworth & Rust 1937](/docs/iq/1937-hollingworth.pdf "Application of the Bernreuter Inventory of Personality to Highly Intelligence Adolescents"): "The data of the present study were obtained early in 1933, the subjects being 36 boys and 19 girls, of the average age of 18 years 6 months. The IQ's (S-B) of all had been taken in early childhood (9). The group ranged from 135-190 IQ (S-B), with a median at about 153 IQ (S-B). All but four of these young persons were Jewish, a factor which must be considered as of possible consequence (8, 14)..."
~~~{.R}
proportion <- function (gain, cutoff=10000) {
(pnorm(qnorm(1/cutoff)) / pnorm(qnorm(1/cutoff)-(gain/15))) }
possibleIQs <- seq(5, 15, by=0.5)
data.frame(Advantage=possibleIQs, Fraction.of.white=(sapply(possibleIQs, proportion) * 15) / 418)
Advantage Fraction.of.white
1 5.0 0.1415427303
2 5.5 0.1633099334
3 6.0 0.1886246225
4 6.5 0.2180947374
5 7.0 0.2524371552
6 7.5 0.2924980125
7 8.0 0.3392769622
8 8.5 0.3939561508
9 9.0 0.4579348680
10 9.5 0.5328710150
11 10.0 0.6207307813
12 10.5 0.7238482059
13 11.0 0.8449966589
14 11.5 0.9874747049
15 12.0 1.1552093388
16 12.5 1.3528802227
17 13.0 1.5860693342
18 13.5 1.8614413902
19 14.0 2.1869615788
20 14.5 2.5721585555
21 15.0 3.0284424112
~~~
Judging from earlier samples with very high cutoffs, I'd guess TIP/SMPY has at least a majority Jewish, giving a mean IQ of ~110; this is pleasantly similar to estimates based on regular samples & estimation. This result is also similar to [La Griffe du Lion's 2003 threshold analysis](http://www.lagriffedulion.f2s.com/ashkenaz.htm "Assessing the Ashkenazic IQ") estimating a mean IQ of 112 based on Ashkenazi overrepresentation among USSR championship chess players, 111 based on Western [Fields Medal](!Wikipedia) awards, and 110 based on the USA/Canada [Putnam competition](!Wikipedia "William Lowell Putnam Mathematical Competition").
But if the mean IQ was as high as 112, then almost every single white subject would be Jewish in every sampling, which seems implausible and like something so striking that anyone writing or involved with TIP/SMPY would have to have mentioned at *some* point---right?
For the same reason, the original estimate of 112 for the Asians strikes me as on the high side.
This could be due to problems in the data like underestimating the Asian population at the time---perhaps the Southeast/Midwest states that TIP samples from were more than 0.7% Asian---or it could be due to sampling error (only _n_=579, after all).
Working backwards doesn't immediately provide any measurement of precision or confidence intervals. Presumably someone has worked out analytic formulas which come with standard errors and confidence intervals, but I don't know it.
Instead, since the selection process which generated our data is straightforward (population mean -> millions of samples -> take top 1-in-10000s -> calculate overrepresentation), I can again use [Approximate Bayesian computation](!Wikipedia) (ABC) to turn a simulation of the data generating process into a method of Bayesian inference on the unknown parameters (population means) and get credible intervals.
What sort of confidence do we have in these estimates given that these RRs are based only on?
We can simulate TIP/SMPY-like selection by taking the hypothetical means of the two groups, generating ~3 million simulates (`579 * 10000`) each, selecting the top 1/10000th, taking the RRs and then solving for the mean IQ.
If we provide a prior on the means and we hold onto only the means which successfully generate TIP/SMPY-like fractions of 72% & 21%, this becomes ABC with the saved means forming the posterior distribution of means.
(It would likely be faster to use MCMC like JAGS, but while JAGS provides truncated normal distributions which one could sample from quickly, and the necessary `pnorm`/`qnorm` functions, but it's not clear to me how one could go about estimating the overperformance ratio and the binomial.[^JAGS-TIP] There's likely some way to use [order statistics](/Order-statistics) more directly than simulating cutoffs, in which case there is a transformation to a beta distribution over 0-1, which is a well-supported distribution by MCM software and might allow exact solution as well.)
For my priors, I believe that the rule of thumbs of 100/105 are accurate and highly unlikely to be more than a few points off, so I use a very weak prior of populations means being $\mathcal{N}(100/105, 4)$.
[^JAGS-TIP]: My first attempt at it in JAGS went like this:
~~~{.R}
model_string <- '
model {
cutoffIQ <- 100 + 3.719016485*15
mu_asian ~ dnorm(105, 4^-2)
X_asian ~ dnorm(mu_asian, 15^-2) # T(cutoffIQ,)
X_frac_asian <- X_asian > cutoffIQ
P_asian <- 0.07 * (X_frac_asian / length(X_asian))
Y_asian ~ dbinom(P_asian, total)
# mu_white ~ dnorm(100, 4^-2)
# X_white ~ dnorm(mu_white, 15^-2) # T(cutoffIQ,)
# X_frac_white <- X_white > cutoffIQ
# P_white <- (1-0.07) * (X_frac_white / length(X_white))
# Y_white ~ dbinom(P_white, total)
}
'
library(runjags)
Y_asian=126
Y_white=418
total=579
model <- run.jags(model_string, data = list(Y_asian=Y_asian, Y_white=Y_white, total=total),
monitor=c("mu_asian", "mu_white"),
n.chains = getOption("mc.cores"), method="rjparallel")
summary(model)
~~~
But then I realized that `X_frac_asian <- X_asian > cutoffIQ` didn't do what I thought it did and I needed to somehow draw a large number of samples, just like in the ABC simulation, and compare to the number after the truncation... or something.
In exact ABC, we would keep only data which exactly matched 72%/22%, but that would require rejecting an extremely large number of samples. Here we'll loosen it to ±2% tolerance:
~~~{.R}
simulateTIPSMPY <- function() {
## informative priors: IQs are somewhere close to where we would estimate based on other datasets
whiteMean <- round(rnorm(1, mean=100, sd=5), digits=2)
asianMean <- round(rnorm(1, mean=105, sd=5), digits=2)
iqCutoff <- 100 + -qnorm(1/10000) * 15
whites <- rnorm(0.770 * 579 * 10000, mean=whiteMean, sd=15)
whiteSample <- max(1, sum(ifelse(whites>iqCutoff, 1, 0)))
asians <- rnorm(0.007 * 579 * 10000, mean=asianMean, sd=15)
asianSample <- max(1, sum(ifelse(asians>iqCutoff, 1, 0)))
## white+Asian = 92% of original total sample, so inflate by that much to preserve proportions: 1.08
totalSample <- (whiteSample+asianSample) * (1 + (1-(white+asian)))
whiteFraction <- round(whiteSample / totalSample, digits=2)
asianFraction <- round(asianSample / totalSample, digits=2)
# print(paste("samples: ", c(whiteSample, asianSample), "fractions: ", c(whiteFraction, asianFraction)))
tolerance <- 0.02
if ((abs(whiteFraction - 0.7218480138) < tolerance) && (abs(asianFraction - 0.2178929188) < tolerance)) {
return(data.frame(White=whiteMean, Asian=asianMean))
}
}
library(parallel); library(plyr)
simulateSamples <- function(n.sample=10000, iters=getOption("mc.cores")) {
## because of rejection sampling, no run is guaranteed to produce a sample so we loop:
results <- data.frame()
while (nrow(results) < n.sample) {
simResults <- ldply(mclapply(1:iters, function(i) { simulateTIPSMPY() } ))
results <- rbind(results, simResults)
# print(paste("Samples so far: ", nrow(results)))
}
return(results) }
posteriorSamples <- simulateSamples()
mean(posteriorSamples$White < posteriorSamples$Asian)
# [1] 1
## we have relatively few samples, so get a better posterior estimate by shuffling the posterior samples & comparing many times:
mean(replicate(1000, mean(c(sample(posteriorSamples$White) < sample(posteriorSamples$Asian)))))
# [1] 0.9968822
quantile(probs=c(0.025, 0.975), posteriorSamples$White, na.rm=TRUE)
# 2.5% 97.5%
# 89.49975 101.38050
quantile(probs=c(0.025, 0.975), posteriorSamples$Asian, na.rm=TRUE)
# 2.5% 97.5%
# 101.37000 116.74075
par(mfrow=c(2,1))
hist(posteriorSamples$White, main="Posterior white mean IQ estimated from TIP/SMPY cutoff & ratio", xlab="IQ")
hist(posteriorSamples$Asian, main="Posterior Asian mean", xlab="IQ")
~~~
![Histograms of the posterior estimate of white & Asian mean IQs ~1970 as estimated from fraction of TIP/SMPY sample using ABC](/images/iq/smpy/iq-smpytip.png)
So sampling error does turn out to be substantial: our 95% credible intervals are white 90-101, Asian 101-116.
Still, the overlap is minimal, with _P_=99.7% that the Asian mean is higher than the white.
We are able to conclude that the rank ordering is highly likely to be correct, and the results are consistent with the conventional wisdom, so there is no prima facie case for bias in the results: the ethnic composition is in line with what one would calculate from the design of TIP/SMPY and population means.
# _Genius Revisited_: On the Value of High IQ Elementary Schools
> _Genius Revisited_ documents the longitudinal results of a high-IQ/gifted-and-talented elementary school, Hunter College Elementary School (HCES); one of the most striking results is the general high education & income levels, but absence of great accomplishment on a national or global scale (eg a Nobel prize). The authors suggest that this may reflect harmful educational practices at their elementary school or the low predictive value of IQ.
>
> I suggest that there is no puzzle to this absence nor anything for HCES to be blamed for, as the absence is fully explainable by their making two statistical errors: base-rate neglect, and regression to the mean.
>
> First, their standards fall prey to a base-rate fallacy and even extreme predictive value of IQ would not predict 1 or more Nobel prizes because Nobel prize odds are measured at 1 in millions, and with a small total sample size of a few hundred, it is highly likely that there would simply be no Nobels.
>
> Secondly, and more seriously, the lack of accomplishment is inherent and unavoidable as it is driven by the [regression to the mean](!Wikipedia) caused by the relatively low correlation of early childhood with adult IQs---which means their sample is far less elite as adults than they believe. Using early-childhood/adult IQ correlations, regression to the mean implies that HCES students will fall from a mean of 157 IQ in kindergarten (when selected) to somewhere around 133 as adults (and possibly lower). Further demonstrating the role of regression to the mean, in contrast, HCES's associated high-IQ/gifted-and-talented high school, Hunter High, which has access to the adolescents' more predictive IQ scores, has much higher achievement in proportion to its lesser regression to the mean (despite dilution by Hunter elementary students being grandfathered in).
>
> This unavoidable statistical fact undermines the main rationale of HCES: extremely high-IQ adults cannot be very accurately selected as kindergarteners on the basis of a simple test. This greater-regression problem can be lessened by the use of additional variables in admissions, such as parental IQs or high-quality genetic polygenic scores; unfortunately, these are either politically unacceptable or dependent on future scientific advances. This suggests that such elementary schools may not be a good use of resources and HCES students should not be assigned scarce magnet high school slots.
Split out to [separate article](/Hunter).
# Great Scott! Personal Name Collisions and the Birthday Paradox {.collapse}
> "How large does can a social circle be before first names no longer suffice for identification? Scott, I'm looking at you."
>
> [MakerOfDecisions](https://twitter.com/MakerOfDecision/status/759086911084490752), 29 July 2016

Scott here refers to any of [Scott Alexander](http://slatestarcodex.com/about/), [Scott Adams](!Wikipedia), [Scott Aaronson](!Wikipedia), [Scott Sumner](!Wikipedia) (and to a much lesser extent, Scott Garrabrant, [Orson Scott Card](!Wikipedia), and [Scott H. Young](https://www.scotthyoung.com/blog/)); a reference to a 'Scott' on a site like Less Wrong is increasingly ambiguous.
When a large number of samples draw from a common pool of identifiers, collisions are common, leading to the [birthday paradox](!Wikipedia): despite there being 365.25 days in the year, a classroom of just 23 people (who can cover at most 6% of the days in a year) is ~50% likely to have at least two people who share the same birthday and so birthdays cease being unique unambiguous identifiers.
(Intuitively, you might expect the number to be much larger and closer to 180 than 23.)
We can verify this by simulation:
~~~{.R}
dupes <- function(a) { length(a) != length(unique(a)) }
identifiers <- function(n, ids, probabilities) { sample(1:ids, n, prob=probabilities, replace=TRUE) }
simulate <- function(n, ids, probabilities=rep(1/ids, ids), iters=10000) {
sims <- replicate(iters, { id <- identifiers(n, ids, probabilities)
return(dupes(id)) })
return(mean(sims)) }
simulate(23, 365)
# [1] 0.488
sapply(1:50, function(n) { simulate(n, 365) } )
# [1] 0.0000 0.0029 0.0059 0.0148 0.0253 0.0400 0.0585 0.0753 0.0909 0.1196 0.1431 0.1689 0.1891
# 0.2310 0.2560 0.2779 0.3142 0.3500 0.3787 0.4206 0.4383 0.4681 0.5165 0.5455 0.5722 0.5935
# [27] 0.6227 0.6491 0.6766 0.7107 0.7305 0.7536 0.7818 0.7934 0.8206 0.8302 0.8465 0.8603 0.8746
# 0.8919 0.9040 0.9134 0.9248 0.9356 0.9408 0.9490 0.9535 0.9595 0.9623 0.9732
~~~
Similarly, in a group of people, it will be common for first names to overlap.
(Overlaps of both first names & surnames are much more unlikely: [Charpentier & Coulmont 2017](https://arxiv.org/abs/1707.07607 "We are not alone! (at least, most of us). Homonymy in large scale social groups") estimate from French & Ohioan data that while almost everyone has a non-unique full name, even groups of thousands of people will have only a few duplicates.)
How common? There are far more than 365.25 first names, especially as some first names are made up by parents.
Names have a highly skewed (often said to be a [power law](!Wikipedia)) distribution: the first few baby names make up an enormous fraction of all names, hence all the Ethan/Lucas/Mason baby boys in 2016.
(One would think that parents would go out of their way to avoid too-popular names, but apparently not.)
Since there are only "10,000 things under heaven", one might think that the top 10000 personal names would give a good guess.
At what _n_ can we expect a collision?
~~~{.R}
findN <- function(ids, targetP=0.5, startingN=1, probabilities=rep(1/ids, ids)) {
n <- startingN
collisionProbability <- 0
while (collisionProbability < targetP) {
collisionProbability <- simulate(n, ids, probabilities)
n <- n+1
}
return(n) }
findN(10000)
# [1] 118
simulate(118, 10000)
# [1] 0.5031
~~~
We could also use [an approximation](!Wikipedia "Birthday problem#Approximations") such as the square approximation: $n \approx \sqrt { 2m \times p(n)}$: `sqrt(2 * 10000 * 0.5) ~> 100`
Or the similar upper bound: `ceiling(sqrt(2*10000*log(2))) ~> 118`.
So the collision point is smaller than [Dunbar's number](!Wikipedia).
But all of these are themselves upper bounds because the case in which birthdays/names are uniformly distributed is the worst case.
If there is any difference in the probabilities, a collision will happen much earlier.
This makes sense since if 1 birthday happens with, say, P=0.99, then it's almost impossible to go more than 3 or 4 birthdays without a collision. Likewise, if one birthday has P=0.50, and so on down to P=\frac{1}{365.25}$:
~~~{.R}
sapply(1:23, function(n){ simulate(n, 365, probabilities=c(0.99, rep(0.01/364, 364)))})
# [1] 0.0000 0.9789 0.9995 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000
# 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000
sapply(1:23, function(n){ simulate(n, 365, probabilities=c(0.5, rep(0.5/364, 364)))})
# [1] 0.0000 0.2531 0.5031 0.6915 0.8182 0.8896 0.9402 0.9666 0.9808 0.9914 0.9951 0.9973 0.9988
# 0.9993 0.9991 0.9999 1.0000 1.0000 0.9999 1.0000 1.0000 1.0000 1.0000
~~~
How skewed are real names?
[Given Names Frequency Project](http://www.galbithink.org/names/agnames.htm) provides ["Popular Given Names US, 1801-1999"](http://www.galbithink.org/names/us200.htm) (1990-1999, 909288 names) based on Social Security data.
After deleting the first 4 lines of `s1990m.txt`, it can be loaded into R and the fractions used as probabilities to find the 50% collision point for US names:
~~~{.R}
names <- read.csv("s1990m.txt", header=FALSE)
summary(names)
# V1 V2
# Aaron : 1 Min. : 55.0000
# Abdiel : 1 1st Qu.: 86.0000
# Abdullah: 1 Median : 183.0000
# Abel : 1 Mean : 914.1923
# Abraham : 1 3rd Qu.: 535.5000
# Adam : 1 Max. :24435.0000
# (Other) :852
sum(names$V2)
# [1] 784377
## "Scott" as fraction of all names:
2279 / 784377
# [1] 0.0029054906
## presumably male names:
2279 / (784377*0.5)
# [1] 0.005810981199
simulate(118, nrow(names), probabilities=names$V2/sum(names$V2))
# [1] 1
findN(nrow(names), probabilities=names$V2/sum(names$V2))
# [1] 15
~~~
So a more realistic analysis suggests _n_=15 is where unique first names will probably break down.
This only covers the 853 most common personal names, and the more names, the higher the _n_ has to be to trigger a collision (making 15 something of a lower upper bound); to estimate 10000, we need to fit a distribution to extrapolate below that.
The [log normal distribution](!Wikipedia) fits reasonably well and is easy to work with:
~~~{.R}
library(fitdistrplus)
fitdist(names$V2, "lnorm")
# Fitting of the distribution ' lnorm ' by maximum likelihood
# Parameters:
# estimate Std. Error
# meanlog 5.550448321 0.04640182299
# sdlog 1.359185357 0.03281096378
simulateLN <- replicate(100, {
names <- rlnorm(10000, meanlog=5.550448321, sdlog=1.359185357)
hit <- findN(length(names), startingN=46, probabilities=names/sum(names))
return(hit)
})
median(simulateLN)
# [1] 51
~~~
Since first names will cluster by age group, location, profession, and whatnot, arguably even 51 is a bit of an upper bound.
Finally, one might ask the probability of a group with a great Scott, or to put it another way, the probability of it unfortunately getting away scot-free.
This is easy to answer; the probability of having 1 or more Scotts in a group is the probability of everyone having a name other than Scott.
We saw that the probability of being named Scott was _P_=0.0029054906 in the name dataset.
So the probability of one person not being named Scott is $1 - 0.0029 = 0.997$. So the probability of _n_ people all being named not-Scott is 0.997^_n_^.
The crossover point is ~239.
So an American social group cannot exceed _n_=51 before first names begin to break down, and it is all Scott's fault at _n_=239.
# Detecting fake (human) Markov chain bots {.collapse}
Some popular Twitter and Tumblr accounts use [Markov chains](!Wikipedia) trained on a corpus of writing such as [Markov James Mitchens](https://twitter.com/MarkovMickens) or two unrelated corpuses to create amusing mashups: programming documentation and [H.P. Lovecraft's horror/SF fiction](http://thedoomthatcametopuppet.tumblr.com/ "The Doom That Came to Puppet") or the [King James Bible](http://kingjamesprogramming.tumblr.com/ "King James Programming") or [the works of Karl Marx](https://marxistprogramming.tumblr.com/ "Marxist Programming"), [Kim Kardashian and Kierkegaard](https://twitter.com/kimkierkegaard "Kim Kierkegaardashian"), or [Silicon Valley recruiting emails and Erowid drug use reports](https://twitter.com/erowidrecruiter "Erowid Recruiter").
The humor comes from the fact that the Markov chains have no understanding and are merely programs producing gibberish that occasionally present striking juxtapositions or insights.
Much of their appeal derives in large part from the fact that while humans *curate* them, humans don't *write* them.
They depend on this authenticity to be striking.
Of course, there's always the temptation to edit them or write them wholesale, perhaps because the Markov chains aren't cooperating in producing any comedy gold to tweet that day, which deceives the reader.
This poses an inverse Turing test: how would you detect a fake Markov chain account, that is, one where a human is pretending to be a computer and writing some of the text?
Markov chains are trained on a specific corpus and are a probabilistic generative model which encode the probability that a word _X_ follows another word _Y_ for all the words in that corpus (and similarly if they are operating on letters or on [_n_-grams](!Wikipedia "n-gram")); there is no state or memory or 'look back' or ability to model recursion.
To generate text, one simply picks a random word _Y_, looks up the probabilities of all the words _A_..._Z_ from _Y_, and picks a word at random weighted by those probabilities; then repeat indefinitely.
Conversely, one could also use it to calculate the [likelihood](!Wikipedia) of a given text by multiplying the probability of each word in the text conditional on the previous one.
One difficulty is the potential for double-use of data: the first pass through a Markov chain account is already applying to the data a highly flexible Bayesian neural network with billions of parameters (one's brain).
If one spots an 'anomalous' dataset and subsequent analysis confirms it, what does this mean?
I am reminded of one past incident: someone had lost a great deal of money on a Bitcoin gambling website, and suspected the site had defrauded him.
But he had contacted me only *because* he had had unusual losses. What does an analysis mean? Imagine that the top 1% of losers get angry and start looking into whether they were cheated; they go to a statistician who duly computes that based on the number of games played, there is a _p_=0.01 that they would lose as much or more as they did...
If one had *all* the gambling records, one could look at the overall patterns and see if there are more losers than there should be given the rules of the game and a supposedly fair random number generator, but what does one do with 1 self-selected player? The data generation process is certainly neither random nor 'ignorable' nor modelable without dubious assumptions.
A few possible attacks come to mind:
- observation of malformed syntax or lack of long-range dependencies
- vocabulary or output outside an independently trained Markov chain's domain
- unusually low likelihood for an independently trained Markov chain to generate known samples
- unusually low likelihood for an independently trained Markov chain to generate known samples compared to newly generated samples filtered at a 1-in-100s quality level
- unusually high quality of known samples compared to newly generated samples from independently trained Markov chain filtered at a 1-in-100s quality level, tested nonparametrically or parametrically as a mixture model
Markov chains produce realistic-looking output and are efficient to create & run, but, compared to RNNs, they notoriously model recursive syntax poorly, such as nested parentheses (since they have no way of remembering whether a parenthetical comment had been started), and cannot extrapolate---for example, a word-level Markov chain can't create new words, and would require _n_-grams to have available fragments of words which could be recombined.
The memory-less nature of Markov chains also means that, lacking any memory which could model the 'long-range correlations' found in natural English text like systematic use of particular names/topics/vocabulary, larger samples quickly veer off-topic and become gibberish and lack any coherency possibly even inside a single sentence.
(RNNs also have this problem, but somewhat less.)
With the limits of a Markov chain in mind, it would be easy to detect faked Markov chain output with large samples: it is just difficult for a human to deliberately generate long text which is as nonsensical and syntactically invalid as a Markov chain creates, for the same reason an unpracticed human is a remarkably bad random number generator.
However, for this same reason the selected Markov samples tend to be very short, usually no more than a sentence.
It *might* be possible to measure this on the samples as a whole and observe higher entropy or memoryless-ness (eg by measuring compression performance or efficiency of a Markov chain in modeling the samples), but I would guess that usually the samples are not long enough or large enough for this to have reasonable [statistical power](!Wikipedia) as a test.
This eliminates the easiest test.
Since the corpus is known in many of these cases, we can assume access to a Markov chain model which is similar (if not identical) to the one which supposedly wrote all the tweets.
This gives us several possibilities.
We could exploit the lack of creativity of Markov chains and look for anything in the tweets which is not present in the original corpus.
For example, if a word like "[cromulent](!Wikipedia)" appears neither in the Puppet documentation nor (having been coined in 1996, 59 years after he died) in H.P. Lovecraft's fiction, then it would have a probability of 0 of being generated by any Puppet/Lovecraft Markov chain (as no word will have any transition probability to it).
Such extra-corporal vocabulary immediately proves human authorship.
Continuing this same logic, we could take the corpus, train our own Markov chain (which will at least be similar), and use it to calculate the likelihood of all the tweets.
A human-written tweet may be *possible* for the Markov chain to have written, but it will be far more unlikely than most of the ones the Markov chain actually wrote & were selected.
So we would see that most of the tweets have reasonable log likelihoods, but that our suspicious ones will be far more extreme.
(If the Markov chains are word-level, this test subsumes the impossible-word test: any tweet with a word not in the corpus, and hence not represented in the Markov chain, will have a meaningless likelihood.)
This likelihood test might not help if they are all equally extreme, in which case one could use our Markov chain in another manner, as a generative model, to try to estimate the likelihood of getting as great a tweet.
For this, one samples several thousand samples from our Markov chain, and screens them for good ones.
This creates an empirical distribution of the likelihoods of good tweets conditional on the null hypothesis of a Markov chain author; in this case, the null hypothesis is known to be true by construction.
Then to test, one compares the known-Markov-chain tweets with the likelihoods of the suspect tweets (perhaps with a [permutation test](!Wikipedia)).
They should be similar.
Alternately, if one doesn't want to use likelihoods as a measure of improbability, one could instead use some human measure of funniness like having rating the originals and the samples on a scale 1-5, and comparing them.
The original poster is probably not screening more than a few hundred generated tweets for each selected tweet, so given a similar level of stringency, one's generated tweets should be equally good; if the originals turn out to be extremely better than yours, to a level where you would have to screen thousands of random samples, that is highly suspicious and suggests the originals were 'too good to be true'.
With ratings or likelihoods, one could try to assume a decreasing distribution like an exponential: most samples will be incoherent and totally unfunny, many will be slightly funny, a few will be funny, and a very few will be very funny.
The ratings on samples generated from our Markov chain will probably follow a smooth distribution.
However, if a human is authoring some in an attempt to spice things up, they will be above the average of the Markov chain (otherwise why bother with cheating?), and if there is a substantial number of them, this will create an anomaly in the ratings of the originals---a 'bump' indicating that the tweets are coming from two different populations.
In this case, it can be modeled as a [mixture model](!Wikipedia) with either _k_=1 or _k_=2, and the _p_-value or Bayesian posterior probability calculated for 1 vs 2.
# Optimal Existential Risk Reduction Investment {.collapse}
An existential risk is any risk which destroys or permanently cripples human civilization, such as an asteroid strike or pandemic.
Since humanity might otherwise continue for millions of years, creating untold trillions of humans and colonizing the galaxy, human extinction represents the loss of literally [astronomical amounts of utility](http://www.nickbostrom.com/astronomical/waste.html "'Astronomical Waste: The Opportunity Cost of Delayed Technological Development', Bostrom 2003").
The loss is greater than any disaster up to extinction levels, as humanity can always recover from lesser disasters; but there is no recovery from a total destruction.
Thus, the expected value of even a slight reduction in an exotic risk ought to itself be astronomical, or at least extremely large; under plausible values for well-characterized x-risks like asteroid strikes or nuclear war or pandemic, preventing them may be the charitable spending with the highest expected value and they should be receiving all charitable expenditures.
This strikes people as odd and dangerous reasoning.
Is it really true that we should be spending almost unlimited amounts of money on these things and not otherwise extremely compelling charities like distributing malaria nets in Africa to save millions of lives or vaccine distribution or funding research into ending aging?
And if we should, how do we choose what fraction to spend on global warming rather than artificial intelligence?
What if someone discovers an entirely new x-risk not previously considered, like nearby supernovas or vacuum collapses or nanotechnology 'grey goo'?
Thinking historically, it's clear in retrospect that someone concerned about x-risk would be better off not going after the terminal goal of x-risk reduction but instead spending their money on instrumental goals such as science/technology or economic growth.
Imagine someone in England in 1500 who reasons the same way about x-risk: humanity might be destroyed, so preventing that is the most important task possible.
He then spends the rest of his life researching the Devil and the Apocalypse.
Such research is, unfortunately, of no value whatsoever unless it produces arguments for atheism demonstrating that that entire line of enquiry is useless and should not be pursued further.
But as the Industrial and Scientific Revolutions were just beginning, with exponential increases in global wealth and science and technology and population, ultimately leading to vaccine technology, rockets and space programs, and enough wealth to fund all manner of investments in x-risk reduction, he could instead had made a perhaps small but real contribution by contributing to economic growth by work & investment or making scientific discoveries.
For example, Isaac Newton's discoveries in astronomy and the laws of motion helped inaugurate threads of work that led directly to space satellites which can watch for asteroids with Earth-crossing orbits.
[Isaac Newton](/Newton) himself was concerned with x-risk, as he feared that the [Great Comet of 1680](!Wikipedia) would, centuries hence, plunge into the Sun and cause expansion destroying the Earth and humanity.
What could Newton have done to directly reduce this x-risk at the time?
Absolutely nothing.
There were no feasible counter-measures nor any foreseeable technologies which could forestall a comet or protect humanity from the Sun engulfing the Earth; there was not and still is not a mine or bomb shelter deep enough for that.
What he *could* have done is close to what he did do: make fundamental advances in science which posterity could build on and one day be rich and wise enough to do something about the x-risk.
As it happens, Newton was not quite right about the Great Comet (comets are not a meaningful fraction of the Sun's mass) but there was a similar x-risk he was unaware of: giant asteroid impacts.
And the solutions to a giant comet---observe all comets carefully to project their future orbits, destroy it, redirect its orbit, evacuate human colonists to safety to unaffected planets (Newton suggested the satellites of the gas giants)---are much the same as for a giant asteroid impact, and all benefit from economic growth & greater science/technology (someone has to pay for, and design those satellites and spacecraft, after all).
Economic wealth & science/technology are all-purpose goods: they are useful for compound growth, and can also be spent on x-risk reduction.
They are the ultimate instrumental goods.
If one is badly ignorant, or poor, or unable to meaningfully reduce an x-risk, one is better off accepting the x-risk and instead spending resources on fixing the former problems.
One would prefer to get rid of the x-risk as soon as possible, of course, but given one's starting position, there may simply be no better strategy and the risk must be accepted.
This raises the question: what is the optimal distribution of resources to economic growth vs x-risk reduction over time which maximizes expected utility?
Intuitively, we might expect something like early on investing nothing at all in x-risk reduction as there's not much money available to be spent, and money spent now costs a lot of money down the line in lost compound growth; and then as the economy reaches modern levels and the opportunity cost of x-risk becomes dire, money is increasingly diverted to x-risk reduction.
One might analogize it to insurance---poor people skimp on insurance because they need the money for other things which hopefully will pay off later like education or starting a business, while rich people want to buy lots of insurance because they already have enough and they fear the risks.
If this were an investment question, a good strategy would be something like the [Kelly criterion](!Wikipedia) or [probability matching](!Wikipedia) strategies like [Thompson sampling](!Wikipedia): even if the expected value of x-risk reduction is higher than other investments, it only pays off very rarely and so receives a very small fraction of one's investments.
However, it's not clear that the Kelly criterion or Thompson sampling are optimal or even relevant: because while Kelly avoids bankruptcy in the form of [gambler's ruin](!Wikipedia) but does so only by making arbitrarily small bets to avoid going bankrupt & refusing to ever risk one's entire wealth; with x-risks, the 'bankruptcy' (extinction) can't be avoided so easily, as the risk is there whether you like it or not, and one cannot turn it to 0. (This comes up often in discussion of why the Kelly criterion is relevant to decision-making under risk; see also [Peters 2011](http://rsta.royalsocietypublishing.org/content/369/1956/4913 "The time resolution of the St Petersburg paradox") and the niche area of "evolutionary finance" like [Evstigneev et al 2008](http://papers.ssrn.com/sol3/papers.cfm?abstract_id=1155014 "Evolutionary finance")/[Lensberg & Schenk-Hoppé 2006](http://papers.ssrn.com/sol3/papers.cfm?abstract_id=951242 "On the Evolution of Investment Strategies and the Kelly Rule---A Darwinian Approach") which draws connections between the Kelly criterion, probability matching, long-term survival & evolutionary fitness.)
In economics, similar questions are often dealt with in terms of the [life-cycle hypothesis](!Wikipedia) in which economic agents strive to maximize their utility over a career/lifetime while [avoiding inefficient intertemporal allocation of wealth](!Wikipedia "Consumption smoothing") (as Mark Twain put it, "when in youth a dollar would bring a hundred pleasures, you can't have it. When you are old, you get it & there is nothing worth buying with it then. It's an epitome of life. The first half of it consists of the capacity to enjoy without the chance; the last half consists of the chance without the capacity."); in the life-cycle, one tries to build wealth as quickly as possible while young, even going into debt for investments like a college education, then begins saving up, consuming some, until retirement, at which point one consumes it all until one dies.
But as far as I've seen any results, life-cycle models tend to not include any mechanism for spending in order to reduce mortality/aging and accept the risk of death as a given.
We could create a simple Markov decision process model.
An agent (humanity), each time period (year), has a certain amount of wealth and an x-risk probability _P_.
In this period, it can choose to allocate that wealth between economic growth, in which case it receives that investment plus a return, and it can buy a permanent percentage reduction in the x-risk for a fixed sum.
For the reward, the x-risk is binary sampled with probability _P_; if the sample is true, then the reward is 0 and the decision process terminates, else the reward is the wealth and the process continues.
Let's imagine that this process can run up to 10,000 time periods, with a starting wealth of \$248 billion (Angus Deaton's estimate of PPP world GDP in 1500 https://en.wikipedia.org/wiki/List_of_regions_by_past_GDP_%28PPP%29 ), the economic growth rate is 2% (the long-run real growth rate of the global economy), the existential risk probability is 0.1% per year (arbitrarily chosen), and one can buy a reduction of 1% for a billion dollars. (We'll work in trillions units to help numeric stability.)
What strategy maximizes the cumulative rewards?
A few simple ones come to mind:
1. the agent could simply ignore the x-risk and reinvests all wealth, which to a first approximation, is the strategy which has been followed throughout human history and is primarily followed now (lumping together NASA's Spaceguard program, biowarfare and pandemic research, AI risk research etc probably doesn't come to more than \$1-2b a year in 2016). This maximizes economic growth rate but may backfire as the x-risk never gets reduced.
2. the agent could spend the full gain in its wealth from economic growth (2%) on x-risk reduction. The wealth doesn't grow and the returns from x-risk reduction do diminish, but the x-risk is at least reduced greatly over time.
3. the agent could implement a sort of probability matching: it spends on x-risk reduction a fraction of its wealth equal to the current _P_. This reduces how much is spent on extremely small x-risk reductions, but it might be suboptimal because it'll pay the largest fraction of its economy in the first time period, then second-largest in the second time period and so on, losing out on the potential compounding.
4. a more complicated hybrid strategy might work: it maximizes wealth like #1 for the first _n_ time periods (eg _n_=516), and then it switches to #2 for the remaining time period
5. like #4, but switching from #1 to #3 for the remaining time periods.
~~~{.R}
constantInvestmentAgent <- function (t, w, xrp) { return(c(w, 0)) }
constantReductionAgent <- function (t, w, xrp) { drawdown <- 0.9803921573; return(c(drawdown*w, (1-drawdown)*w)) }
probabilityMatchAgent <- function (t, w, xrp) { return(c(w*(1-xrp), w*xrp)) }
investThenReduceAgent <- function (t, w, xrp, n=516) { if (t
> I've noticed while driving many deer corpses over the years. Cars seem like they could be a major source of deer mortality. If they are, deer might be evolving behavior to avoid cars. But deer/car accident rates appear stable or increasing (perhaps due to human population growth & construction). How fast would we expect to see any deer adaptation?
>
> Looking at some of the mortality statistics, I model it as a liability threshold trait being selected on via truncation selection, and calculate some hypotheticals about whether and how fast they could adapt.
>
> Teal deer: of course, but it'd be slow.

While driving to NYC recently I passed 3 [roadkill](!Wikipedia "Deer-vehicle collisions") [deer](!Wikipedia "White-tailed deer"), a few of many I have seen over the years, and a thought re-occurred to me: "if all these deer are being killed by cars, shouldn't they be evolving to avoid cars?"
I've seen many dead deer and narrowly avoided a few myself while driving, and deer/car mortality is, if anything, much higher in states like Pennsylvania.
Accident rates would not necessarily show a steep decline thanks to past selection, because the 'environment' is not static here: as cars get faster, accidents become more crippling or lethal to deer; the American population has expanded several-fold both in population count, per-capita vehicle miles, suburban living, territory fragmentation, and the deer population too has expanded many-fold (from ~0.5m a century ago to <30m now).
So if there was highly effective ongoing selection reducing deer accident risk, we would still observe large absolute and proportional increases in accidents/deaths.
But I am still curious as to what sort of selection we could expect, which is a hint as to long-term trends---the American population is now relatively stabilized in terms of growth and vehicle-miles, and deer appear to have also reached a population equilibrium, so a gradual long-term decline in accident rates might be possible to see in the future if there is substantial response to selection.
Deer accidents seem to be fairly fatal: wild animals are always on the edge, and small injuries can compound into death, so looking at mortality will if anything underestimate the strength of selection.
And of course there will be no single Mendelian genes, but being a complex behavioral trait, it is almost surely a highly polygenic additive trait.
So we can treat it as [truncation selection](!Wikipedia) on a binary trait ("killed by a car") in the [liability threshold model](!Wikipedia).
For truncation selection, the two key parameters are the heritability of a 'trait', and the fraction of the population expressing the 'trait'.
The heritability can only be guessed at.
Car accidents ought to be heritable to *some* degree, because everything is, and many behavioral traits like risk aversion or diurnality or reaction-time or startle reflex or wanderlust would affect being hit by a car (a deer can reduce risk by avoiding roads entirely, not traveling far, waiting to cross until very late at night when there is no traffic or during the day when they can be easily seen).
Response to selection need not cause some hardwired behavioral change like aversion to travel: it might yield something like the [Baldwin effect](!Wikipedia), where the response is for behavioral flexibility, and more fit deer are better able to learn how to navigate traffic gaps by watching other deer or imitating their mother.
The "anthropocene" has led to many animals evolving or otherwise learning how to adapt, with urban environments no exception[^urban-change], so why would deer be any exception?
[^urban-change]: Some relevant links:
- ["Global urban signatures of phenotypic change in animal and plant populations"](http://www.pnas.org/content/early/2017/01/01/1606034114.abstract), Alberti et al 2017
- ["Signatures of positive selection and local adaptation to urbanization in white-footed mice (_Peromyscus leucopus_)"](https://www.biorxiv.org/content/early/2017/09/26/038141), Harris & Munshi-South 2017 (city mice adapting to eat human food)
- ["Contrasting the effects of natural selection, genetic drift and gene flow on urban evolution in white clover (_Trifolium repens_)"](http://rspb.royalsocietypublishing.org/content/285/1883/20181019), Johnson et al 2018
- ["What Makes a City Ant? Maybe Just 100 Years of Evolution"](https://www.nytimes.com/2017/04/03/science/acorn-ants-evolution-cleveland.html)
- ["Urban Evolution: How Species Adapt, or Don't, to City Living; A Conversation With Jonathan B. Losos"](https://www.edge.org/conversation/jonathan_b_losos-urban-evolution)
- [Moscow subway dogs](!Wikipedia "Street dogs in Moscow#Subway dwelling dogs") (not necessarily genetic except perhaps in a Baldwin effect way, but a demonstration of behavioral flexibility/learning)
- Ridley: ["Cities Are the New Galapagos"](http://www.rationaloptimist.com/blog/the-wealth-of-urban-biodiversity/)/["The Sixth Genesis: a man-made, mass-speciation event"](http://www.rationaloptimist.com/blog/mass-speciation/)
- [_Darwin Comes to Town: How the Urban Jungle Drives Evolution_](https://www.amazon.com/gp/product/1250127823/), Schilthuizen 2018
Complicating things is the possibility that the heritability is high but actual responses to selection are lower than expected when estimated in a univariate single-trait fashion, because there might be a [genetic correlation](!Wikipedia) with another fitness-influencing trait, where better car avoidance means worse values of that other trait---perhaps it would be easy to avoid cars by avoiding roads or traveling far, but this has the drawback of preventing relocation to avoid starvation or hunters, in which case response to selection will be small or even opposite of expected (this is plausibly one of the main reasons why wild populations may not evolve as fast as predicted: ["The Missing Response to Selection in the Wild"](http://www.cell.com/trends/ecology-evolution/abstract/S0169-5347\(18\)30039-9), Pujol et al 2018).
I can only guess at the heritability, as I doubt heritabilities have been calculated for much in deer, but I would predict it's <50% simply because wild animals are under constant selection and car-based selection would've started almost a century ago.
It might seem impossible to calculate heritability for wild animals of a 'trait' like being hit by a car, but I think it's doable. Wild animals have no twin studies or family pedigrees, of course, and methods like common-garden rearing likewise seem questionable, but one use tracking devices to follow families of deer until they all die to construct a pedigree with outcome of death; more feasibly, one could collect DNA samples from dead car-accident deer and dead non-accident deer, and compute genomic similarity with a procedure like [GCTA](!Wikipedia) (for SNP heritability) or whole-genome sequencing data (upcoming methods for recovering full heritability).
But this hasn't been done as far as I know. Oh well. We can look at a range of heritabilities 0-50%.
The fraction of deer hit is a little easier. Wild deer live ~[3-4y on average](https://www.adirondackalmanack.com/2016/05/understanding-life-spans-whitetail-deer.html "'Understanding The Life Span Of Whitetail Deer', Hetzler 2016") (eg [Lopez et al 2004](https://www.jstor.org/stable/pdf/3803390.pdf "Population Density of the Endangered Florida Key Deer") on Key deer implies ~3.7y), sexually maturing ~1.5y, and of the ~25m deer in the USA, around 1.5m are killed by cars annually ~2012 (according to [_The Atlantic_ with no citation](https://www.theatlantic.com/magazine/archive/2012/11/the-deer-paradox/309104/ "The Deer Paradox: It's never been easier to shoot a buck. So why are hunters spending billions on high-tech gear?")) so perhaps ~5% annual mortality from cars ([McShea et al 2008](https://www.jstor.org/stable/pdf/24875111.pdf "'Factors affecting autumn deer/vehicle collisions in a rural Virginia county', McShea et al 2008") estimates 2% in a sampled Virginia county, while [McShea 2012](https://repository.si.edu/bitstream/handle/10088/18355/nzp_changing_world_mcshea.pdf "Ecology and management of white-tailed deer in a changing world") suggests 1.2m deaths out of 25m deer); if deer live 4 years and have a 5% annual risk of being killed by a car, then their lifetime risk should be `1 - 0.95^4 = 18%` or perhaps 8%, which sounds like a reasonable range---a substantial source of mortality but probably less than hunting or starvation or disease.
The effect of a generation of truncation selection on a binary trait following the liability-threshold model is more complicated but follows a similar spirit.
R implementation of pg6 of ["Chapter 14: Short-term Changes in the Mean: 2. Truncation and Threshold Selection"](/docs/genetics/selection/2013-walsh-book2-ch14-draft.pdf), Lynch & Walsh:
~~~{.R}
threshold_select <- function(fraction_0, heritability, verbose=FALSE) {
fraction_probit_0 = qnorm(fraction_0)
## threshold for not manifesting schizophrenia:
s_0 = dnorm(fraction_probit_0) / fraction_0
## new rate of trait after one selection where 100% of past-the-threshold never reproduce:
fraction_probit_1 = fraction_probit_0 + heritability * s_0
fraction_1 = pnorm(fraction_probit_1)
## how much did we reduce trait in percentage terms?
if (verbose) {
print(paste0("Start: population fraction: ", fraction_0, "; liability threshold: ", fraction_probit_0, "; Selection intensity: ", s_0))
print(paste0("End: liability threshold: ", fraction_probit_1, "; population fraction: ", fraction_1, "; Total population reduction: ",
fraction_0 - fraction_1, "; Percentage reduction: ", (1-((1-fraction_1) / (1-fraction_0)))*100)) }
return(c(fraction_probit_1, fraction_1, fraction_0 - fraction_1))
}
threshold_select(1-0.18, 0.50, verbose=TRUE)
# [1] "Start: population fraction: 0.82; liability threshold: 0.915365087842814; Selection intensity: 0.320000021339773"
# [1] "End: liability threshold: 1.0753650985127; population fraction: 0.858894349391959; Total population reduction: -0.0388943493919587; Percentage reduction: 21.6079718844215"
# [1] 1.07536509851 0.85889434939 -0.03889434939
threshold_select(1-0.08, 0.50, verbose=TRUE)
# [1] "Start: population fraction: 0.92; liability threshold: 1.40507156030963; Selection intensity: 0.161593724197447"
# [1] "End: liability threshold: 1.48586842240836; population fraction: 0.931343036233605; Total population reduction: -0.0113430362336053; Percentage reduction: 14.1787952920067"
# [1] 1.48586842241 0.93134303623 -0.01134303623
~~~
We can look at a range of scenarios for population prevalences 8-18%, and heritabilities 5%-50%.
Aside from the per-generation increase in car-avoiding deer/decrease in deer-car accidents, it might also be interesting to calculate a hypothetical like "how many generations/years would it take for natural selection, in a static environment, to reduce lifetime mortality to 1%?"
~~~{.R}
thresholdModeling <- function(fraction_0, heritability, targetPercentage) {
firstGeneration <- threshold_select(fraction_0, heritability)
## estimate how many generations of truncation selection until a target percentage is reached:
i <- 1; fraction_i <- firstGeneration[2]
while (fraction_i < targetPercentage) {
i <- i+1
nextGeneration <- threshold_select(fraction_i, heritability)
fraction_i <- nextGeneration[2]
}
return(c(firstGeneration, i)) }
thresholdModeling(1-0.20, 0.5, 1-0.01)
df <- expand.grid(Fraction=(1-seq(0.08, 0.18, by=0.02)), Heritability=seq(0.05, 0.50, by=0.05))
df <- cbind(df, round(digits=3, do.call(rbind, Map(thresholdModeling, df$Fraction, df$Heritability, 0.99))))
colnames(df)[3:6] <- c("Threshold.latent", "Fraction.new", "Fraction.reduction", "Generations.to.onepercent")
df$Years <- round(df$Generations.to.onepercent * 3.5)
df
~~~
Fraction Heritability Latent threshold Fraction 2nd Fraction reduction Generations to 1% Years
------- ------------- ---------------- ------------ ------------------ ------------------------- ------
0.92 0.05 1.413 0.921 -0.001 300 1050
0.90 0.05 1.291 0.902 -0.002 314 1099
0.88 0.05 1.186 0.882 -0.002 324 1134
0.86 0.05 1.093 0.863 -0.003 332 1162
0.84 0.05 1.009 0.843 -0.003 338 1183
0.82 0.05 0.931 0.824 -0.004 343 1200
0.92 0.10 1.421 0.922 -0.002 150 525
0.90 0.10 1.301 0.903 -0.003 157 550
0.88 0.10 1.198 0.884 -0.004 162 567
0.86 0.10 1.106 0.866 -0.006 166 581
0.84 0.10 1.023 0.847 -0.007 169 592
0.82 0.10 0.947 0.828 -0.008 171 598
0.92 0.15 1.429 0.924 -0.004 100 350
0.90 0.15 1.311 0.905 -0.005 104 364
0.88 0.15 1.209 0.887 -0.007 108 378
0.86 0.15 1.119 0.868 -0.008 110 385
0.84 0.15 1.038 0.850 -0.010 112 392
0.82 0.15 0.963 0.832 -0.012 114 399
0.92 0.20 1.437 0.925 -0.005 75 262
0.90 0.20 1.321 0.907 -0.007 78 273
0.88 0.20 1.220 0.889 -0.009 81 284
0.86 0.20 1.132 0.871 -0.011 82 287
0.84 0.20 1.052 0.854 -0.014 84 294
0.82 0.20 0.979 0.836 -0.016 85 298
0.92 0.25 1.445 0.926 -0.006 60 210
0.90 0.25 1.330 0.908 -0.008 62 217
0.88 0.25 1.232 0.891 -0.011 64 224
0.86 0.25 1.145 0.874 -0.014 66 231
0.84 0.25 1.067 0.857 -0.017 67 234
0.82 0.25 0.995 0.840 -0.020 68 238
0.92 0.30 1.454 0.927 -0.007 50 175
0.90 0.30 1.340 0.910 -0.010 52 182
0.88 0.30 1.243 0.893 -0.013 54 189
0.86 0.30 1.158 0.877 -0.017 55 192
0.84 0.30 1.081 0.860 -0.020 56 196
0.82 0.30 1.011 0.844 -0.024 57 200
0.92 0.35 1.462 0.928 -0.008 43 150
0.90 0.35 1.350 0.911 -0.011 44 154
0.88 0.35 1.255 0.895 -0.015 46 161
0.86 0.35 1.171 0.879 -0.019 47 164
0.84 0.35 1.096 0.863 -0.023 48 168
0.82 0.35 1.027 0.848 -0.028 48 168
0.92 0.40 1.470 0.929 -0.009 37 130
0.90 0.40 1.360 0.913 -0.013 39 136
0.88 0.40 1.266 0.897 -0.017 40 140
0.86 0.40 1.184 0.882 -0.022 41 144
0.84 0.40 1.110 0.867 -0.027 42 147
0.82 0.40 1.043 0.852 -0.032 42 147
0.92 0.45 1.478 0.930 -0.010 33 116
0.90 0.45 1.369 0.915 -0.015 34 119
0.88 0.45 1.277 0.899 -0.019 35 122
0.86 0.45 1.197 0.884 -0.024 36 126
0.84 0.45 1.125 0.870 -0.030 37 130
0.82 0.45 1.059 0.855 -0.035 37 130
0.92 0.50 1.486 0.931 -0.011 30 105
0.90 0.50 1.379 0.916 -0.016 31 108
0.88 0.50 1.289 0.901 -0.021 32 112
0.86 0.50 1.210 0.887 -0.027 33 116
0.84 0.50 1.139 0.873 -0.033 33 116
0.82 0.50 1.075 0.859 -0.039 34 119
Table: Truncation selection results for various scenarios of deer-car lifetime mortality rates & heritabilities.
As expected, scenarios where deer accidents are rare achieve 1% faster (less work to do), and higher heritabilities produce faster responses (more useful genes).
In most of the scenarios, the annual percent reduction is small, no often less than a percentage point, so would be easy to miss, and the almost complete elimination of deer-car accidents would take centuries of adaptation even with a fixed environment.
So while deer might be evolving to reduce accident mortality, it would be hard to see and won't help much anytime soon. So if it's a problem, one'd better hurry up with safety measures like road fencing or self-driving cars.
# Evolution as Backstop for Reinforcement Learning
> One defense of free markets notes the inability of non-market mechanisms to solve planning & optimization problems. This has difficulty with Coase's paradox of the firm, and I note that the difficulty is increased by the fact that with improvements in computers, algorithms, and data, ever larger planning problems *are* solved. Expanding on some Cosma Shalizi comments, I suggest a group selection perspective: one reason for free market or evolutionary or Bayesian methods in general is that while poorer at planning/optimization in the short run, they have the advantage of simplicity and operating on ground-truth values, and serve as a constraint on the more sophisticated non-market mechanisms. I illustrate by discussing corporations, multicellular life, and reinforcement learning & meta-learning in AI. This view suggests that are inherent balances between market/non-market mechanisms which reflect the relative advantages between a slow unbiased method and faster but potentially arbitrarily biased methods.
*Split out to ["Evolution as Backstop for Reinforcement Learning"](/Backstop).*
# Acne: a good Quantified Self topic {.collapse}
> I suggest that teenagers interested in experimentation, statistics, or Quantified Self experiment with interventions to reduce acne, listing the advantages of the topic. To suggest specific interventions, I re-analyze & rank the April 2016 CureTogether crowdsourced ratings of ~113 acne interventions; dietary interventions rate particularly highly after standard retinoids (like Accutane) & benzoyl peroxide and might be worth closer investigation.

[Quantified Self](!Wikipedia) [acne](!Wikipedia) self-experiments are under-done and potentially valuable.
I kind of regret getting into QS and statistics only long after my acne problems largely subsided. The [conventional wisdom on acne](https://www.nytimes.com/2019/01/07/well/live/managing-teenage-acne.html) seems poorly supported and neglectful of individual differences, and I've long suspected that some rigorous testing might turn up some interesting results which could be useful to Western teenagers. (My acne wasn't *so* bad, in retrospect, but it still frustrated me a great deal; in extreme cases, it can contribute to suicide.)
Looking back, I can see how easy it would be to test the various theories. For example, facial washes & light boxes could be tested efficiently by blocking---randomizing which half of your face they are applied to.
Diet too would be useful to do as a factorial experiment---for 3 or 4 weeks, cut out all the carbs while loading up on dairy products, then vice versa.
As a topic for self-experimentation, acne has many advantages:
- many highly-motivated subjects with ample free time
- objective visible effects where the data can even be easily recorded for blind ratings by third parties to reduce bias (ie facial photographs)
- factorial experiments or blocking within-subject is possible for some acne interventions, which would considerably increase power
- in severe acne cases, changes would be easy to see within days/weeks, and large effects are plausible
- a large number of already-known possible interventions none of which are particularly strongly supported (so high [VoI](!Wikipedia "Value of Information")); a good starting point would be the [CureTogether crowdsourced rankings](https://web.archive.org/web/20160401173234/http://curetogether.com:80/acne/treatments/) (see [next section for specifics](#curetogether-acne-interventions)).
- most of the interventions are safe & easy to implement: it's not hard to use a face wash twice a day, apply some [benzoyl peroxide](!Wikipedia) cream, cut out all dairy products, etc
- new commercial products allow for interesting hobbyist projects---what would microbiome sampling show about the microbial communities of teens with high acne vs low acne? About their microbial communities *before* the high acne group developed acne? After treatment with benzoyl peroxide or antibiotics or retinol? etc. With a large enough community dataset, interesting techniques could be applied for measuring heterogeneity in individual response to treatments (eg GCTA-style variance components on acne response could be enlightening, as genes would be expected to be the source of much of the individual differences, acne itself being heritable)
- a launching pad into many areas of science & statistics---relevant topics include human & microbial genetics, methodological biases, regression to the mean, optimal experimental design, sequential testing, meta-analysis, time-series, deep learning... ("When making an axe handle with an axe, the model is near at hand.")
One could spin off lots of projects to increase rigor---instead of counting by hand, you could take smartphone photos and feed them into a pimple-counting CNN. With all the frameworks like PyTorch now, that's well within the ability of a bright computer-inclined high schooler. (And then put it up as a web service so other teens can run their own self-experiments & score their pimple photographs, why not...)
Indeed, [Seth Roberts](!Wikipedia), one of the early Quantified Self proponents, credits his interest in self-experiments to acne ([Roberts 2001](https://escholarship.org/uc/item/5bv8c7p3 "Surprises from self-experimentation: Sleep, mood, and weight")/[Roberts 2010](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2964443/ "The Unreasonable Effectiveness of My Self-Experimentation")), when he found [tetracycline](!Wikipedia) didn't help his acne but benzoyl peroxide did:
> My interest in self-experimentation began when I read [an article about teaching mathematics by Paul Halmos](https://www.sethroberts.net/wp-content/uploads/2013/01/halmos.pdf "'The Problem of Learning to Teach', Halmos 1975"), a professor at Indiana University. Halmos emphasized that "the best way to learn is to do." I was trying to learn how to do experiments; I took this advice to mean I should do as many as possible. I could do more experiments, I realized, if I not only did rat experiments but also did experiments with myself as the subject. So I started doing small self-experiments. Most of them were trivial and led nowhere (e.g., experiments about juggling). At the time I had acne. My dermatologist had prescribed both pills (tetracycline, a wide-spectrum antibiotic) and a cream (active ingredient benzoyl peroxide). Simply for the sake of doing experiments, any experiments, I did simple tests to measure the effectiveness of these treatments. I believed the pills were powerful and the cream had little effect. To my great surprise, the tests showed the opposite: The cream was powerful and the pills had little effect. It was very useful information. Many years later, an article in the British Journal of Dermatology reported that antibiotic-resistant acne is common.
Acne came up several times on Roberts's blog:
- Roberts has suggested that teens could run acne self-experiments as a group project: ["Acne Clubs"](https://sethroberts.net/2013/05/01/the-acne-club/ "Acne Club: A New Way to Fight Acne")
- A possible general cause: [glycemic index/insulin signaling](https://sethroberts.net/2013/02/24/progress-in-reducing-acne/)
- Anecdotes:
- [oolong/black tea allergy as acne trigger](https://sethroberts.net/2014/02/10/acne-sometimes-due-to-allergy/)
- pasteurized dairy trigger: [1](https://sethroberts.net/2012/10/30/acne-caused-by-pasteurized-dairy-how-one-person-figured-it-out/), [2](https://boingboing.net/2012/08/09/make-yourself-healthy-searchi.html "Make yourself healthy: Searching for the cause of acne --Martha Rotter")
- sugar/carb triggers: [1](https://sethroberts.net/2008/10/03/diet-and-acne-continued/), [2](https://sethroberts.net/2008/04/26/more-about-acne-continued/)
- soap trigger: [1](https://sethroberts.net/2008/05/05/more-acne-self-experimentation/); lack of soap trigger: [2](https://sethroberts.net/2008/04/23/more-about-acne/)
- [acne reduction due to milk thistle](https://sethroberts.net/2014/02/02/adult-acne-water-and-milk-thistle/)
## CureTogether acne interventions
CureTogether (2008-2012) was a social network/crowdsourced website for aggregating & ranking treatments of various problems, typically diseases, and had an acne page.
They were bought by [23andMe](!Wikipedia) [in 2012](https://blog.23andme.com/news/curetogether-with-23andme/), and [operated for a while](https://blog.23andme.com/tag/curetogether/) until the website seems to've vanished around mid-2016 & the data appears to no longer be available anywhere.
The last Internet Archive snapshot of their [acne page](https://web.archive.org/web/20160401173234/http://curetogether.com:80/acne/treatments/) is 2016-04-01.
Their presentation of uncertainty & average ranking of the interventions is useless; but they do provide the totals for each of the rating levels, so I decided to fix it by extracting & re-analyzing the ratings in a multi-level Bayesian ordinal regression with a weakly informative prior to get more useful posterior estimates of ratings.
(While I was at it, I fixed some spelling errors and merged a few interventions which were the same, like "Accutane" and "Isotretinoin (Roaccutane, Accutane)".)
~~~{.R}
## https://web.archive.org/web/20160401173234/http://curetogether.com:80/acne/treatments/
## "much worse / slightly worse / no effect / moderate improvement / major improvement" ~> 1-5
acne <- read.csv("https://www.gwern.net/docs/biology/2016-04-01-curetogether-acne.csv",
header=TRUE, colClasses=c("integer", "integer", "factor"))
library(skimr)
skim(acne)
# Skim summary statistics
# n obs: 439
# n variables: 3
#
# ── Variable type:factor ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
# variable missing complete n n_unique top_counts ordered
# Intervention 0 439 439 113 Acn: 5, Agi: 5, Amo: 5, Ant: 5 FALSE
#
# ── Variable type:integer ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
# variable missing complete n mean sd p0 p25 p50 p75 p100 hist
# Effect 0 439 439 3.16 1.33 1 2 3 4 5 ▅▆▁▇▁▇▁▆
# N 0 439 439 18.38 34.74 1 2 6 19 324 ▇▁▁▁▁▁▁▁
library(brms)
b <- brm(Effect | weights(N) ~ (1|Intervention), prior=c(prior(student_t(3, 0, 1), "sd")), family=cumulative(),
iter=5000, chains=30, cores=30, data=acne)
b
# Group-Level Effects:
# ~Intervention (Number of levels: 113)
# Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
# sd(Intercept) 0.64 0.06 0.53 0.76 16503 1.00
#
# Population-Level Effects:
# Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
# Intercept[1] -3.44 0.09 -3.62 -3.26 20848 1.00
# Intercept[2] -2.17 0.08 -2.32 -2.02 17179 1.00
# Intercept[3] 0.49 0.07 0.35 0.64 15767 1.00
# Intercept[4] 2.75 0.08 2.58 2.91 19558 1.00
red <- as.data.frame(ranef(b)$Intervention)
round(red[order(red$Estimate.Intercept, decreasing=TRUE),], digits=2)
## the ordinal model forest plot doesn't make much intuitive sense, so redo as a normal-distribution
## for easier forest plotting (the rankings are more or less identical, perhaps a little less precise):
bg <- brm(Effect | weights(N) ~ (1|Intervention), prior=c(prior(student_t(3, 0, 1), "sd")), family=gaussian(),
iter=5000, chains=30, cores=30, data=acne)
## Extract & sort estimates:
coefs <- as.data.frame(coef(bg))
coefs[order(coefs$Intervention.Estimate.Intercept, decreasing=TRUE),]
## Plot estimates:
library(brmstools)
forest(bg)
~~~
![Forest plot: posterior rankings of acne interventions by CureTogether users as of April 2016 (computed using a Gaussian Bayesian multilevel model).](/images/qs-acne-curetogether-gaussian-forest.png)
| Intervention | Estimate | SE | 2.5% | 97.5% |
|---------------------------------------------------------|----------|------|------|-------|
| Isotretinoin (Roaccutane, Accutane) | 4.04 | 0.06 | 3.91 | 4.16 |
| Paleo Diet | 3.75 | 0.08 | 3.58 | 3.92 |
| No gluten | 3.67 | 0.07 | 3.51 | 3.82 |
| Azelaic acid (Azelex) | 3.65 | 0.16 | 3.32 | 3.99 |
| Bactrim (Trimethoprim/sulfamethoxazole) | 3.62 | 0.12 | 3.36 | 3.87 |
| Ketogenic Diet | 3.59 | 0.18 | 3.23 | 3.95 |
| Clindamyacin phosphate gel | 3.58 | 0.13 | 3.31 | 3.85 |
| Glycolic acid | 3.56 | 0.21 | 3.14 | 3.99 |
| Ziana (Clindamycin/tretinoin) | 3.56 | 0.21 | 3.14 | 3.99 |
| Amoxicillin | 3.56 | 0.14 | 3.27 | 3.85 |
| No sugar | 3.55 | 0.07 | 3.41 | 3.70 |
| No dairy | 3.52 | 0.06 | 3.40 | 3.64 |
| Diane-35 (cyproterone/ethinyl estradiol) | 3.52 | 0.12 | 3.27 | 3.77 |
| Epiduo Gel | 3.51 | 0.15 | 3.20 | 3.82 |
| having a good skincare routine | 3.49 | 0.17 | 3.16 | 3.83 |
| eliminating comedogenic ingredients... | 3.49 | 0.21 | 3.07 | 3.93 |
| Sit out in the sun...at least 15 minutes | 3.49 | 0.05 | 3.38 | 3.60 |
| Avoid touching face | 3.49 | 0.03 | 3.41 | 3.57 |
| Benzoyl peroxide 10% | 3.49 | 0.09 | 3.30 | 3.67 |
| Ole Henriksen products | 3.47 | 0.18 | 3.10 | 3.84 |
| No chocolate | 3.45 | 0.22 | 3.02 | 3.89 |
| Doryx (Doxycycline) | 3.45 | 0.07 | 3.30 | 3.59 |
| foot bath with baking soda | 3.45 | 0.24 | 2.98 | 3.93 |
| Bikram Yoga | 3.45 | 0.24 | 2.97 | 3.93 |
| Tretinoin | 3.45 | 0.24 | 2.97 | 3.93 |
| Retin A | 3.45 | 0.05 | 3.33 | 3.56 |
| Birth control pill / Balance hormones | 3.44 | 0.05 | 3.33 | 3.55 |
| Zinc soap | 3.43 | 0.20 | 3.04 | 3.84 |
| Washing face | 3.43 | 0.03 | 3.36 | 3.49 |
| No fast food | 3.42 | 0.06 | 3.30 | 3.55 |
| PhotoDynamic Therapy | 3.42 | 0.15 | 3.11 | 3.73 |
| Exuviance Vespera Serum | 3.41 | 0.23 | 2.96 | 3.87 |
| Helminthic therapy | 3.41 | 0.23 | 2.96 | 3.87 |
| Clindamycin 1% dabber | 3.41 | 0.07 | 3.25 | 3.56 |
| White vinegar | 3.40 | 0.17 | 3.06 | 3.75 |
| Washing pillowcases regularly | 3.40 | 0.05 | 3.29 | 3.50 |
| AcZone | 3.39 | 0.20 | 2.99 | 3.81 |
| Tazorac | 3.39 | 0.16 | 3.07 | 3.71 |
| Tetracycline | 3.39 | 0.05 | 3.29 | 3.49 |
| Zinc cream | 3.39 | 0.15 | 3.09 | 3.69 |
| Benzaclin (Benzoyl peroxide/clindamycin) | 3.39 | 0.09 | 3.19 | 3.58 |
| Benzoyl peroxide 2.5% | 3.38 | 0.03 | 3.30 | 3.46 |
| Aveeno Skin Brightening Daily Scrub | 3.38 | 0.22 | 2.94 | 3.82 |
| Metrogel | 3.38 | 0.17 | 3.03 | 3.72 |
| Retinol Oral Supplementation | 3.37 | 0.20 | 2.97 | 3.77 |
| Duac Gel | 3.37 | 0.10 | 3.16 | 3.57 |
| avoid spicy food | 3.35 | 0.19 | 2.97 | 3.74 |
| Placing clean towel on pillow each night | 3.35 | 0.11 | 3.13 | 3.56 |
| Pantothenic acid | 3.35 | 0.13 | 3.09 | 3.61 |
| Antibiotic cream | 3.33 | 0.08 | 3.18 | 3.49 |
| Rosac | 3.33 | 0.19 | 2.96 | 3.70 |
| Sulfur Powder | 3.33 | 0.14 | 3.05 | 3.61 |
| Mychelle clear skin serum | 3.33 | 0.17 | 2.98 | 3.68 |
| Mario Bedescu Drying Lotion | 3.32 | 0.14 | 3.03 | 3.61 |
| Drink a lot of water | 3.31 | 0.13 | 3.05 | 3.57 |
| Acnetrex | 3.31 | 0.22 | 2.87 | 3.74 |
| Spironolactone | 3.30 | 0.16 | 2.97 | 3.63 |
| Blemish potion | 3.30 | 0.13 | 3.03 | 3.57 |
| Dr. Hauschka Natural Skin Care | 3.30 | 0.14 | 3.02 | 3.58 |
| Erythromycin topical solution | 3.30 | 0.10 | 3.10 | 3.50 |
| Using fresh aloe vera leaves on skin | 3.30 | 0.11 | 3.07 | 3.52 |
| Evening Primrose Oil | 3.29 | 0.21 | 2.86 | 3.71 |
| Salicylic acid | 3.29 | 0.04 | 3.20 | 3.37 |
| Vaccination with individually developed vaccine[^auto-v] | 3.28 | 0.21 | 2.86 | 3.70 |
| Stievia-A Cream 0.025% | 3.28 | 0.17 | 2.94 | 3.63 |
| Vicco Turmeric Skin Cream | 3.27 | 0.24 | 2.80 | 3.75 |
| Dr. Andrew Weil for Origins | 3.27 | 0.17 | 2.93 | 3.61 |
| Cleanance K by Avene | 3.27 | 0.14 | 2.99 | 3.55 |
| OBAGI | 3.27 | 0.15 | 2.96 | 3.58 |
| N-acetylcysteine | 3.27 | 0.20 | 2.86 | 3.67 |
| Differin Gel | 3.26 | 0.12 | 3.01 | 3.50 |
| Avoiding pork | 3.25 | 0.16 | 2.92 | 3.57 |
| Chemical-free, vegan cosmetics / personal care products | 3.25 | 0.07 | 3.09 | 3.40 |
| Washing face with baking soda | 3.24 | 0.19 | 2.85 | 3.62 |
| Gamma Linolenic Acid | 3.23 | 0.22 | 2.79 | 3.67 |
| Oil Cleansing Method | 3.23 | 0.10 | 3.02 | 3.44 |
| Acne Free | 3.22 | 0.10 | 3.01 | 3.43 |
| African Black Soap | 3.22 | 0.14 | 2.94 | 3.50 |
| Aloe Vera Soap | 3.21 | 0.15 | 2.91 | 3.51 |
| Aromatherapy | 3.21 | 0.16 | 2.89 | 3.53 |
| Urine therapy | 3.20 | 0.20 | 2.79 | 3.61 |
| Hydrocortisone cream | 3.20 | 0.14 | 2.91 | 3.49 |
| No eggs | 3.20 | 0.12 | 2.95 | 3.45 |
| Air Purifier | 3.19 | 0.14 | 2.91 | 3.47 |
| Aveda Enbrightenment Wash | 3.18 | 0.19 | 2.79 | 3.56 |
| Vegan diet | 3.18 | 0.08 | 3.01 | 3.35 |
| Cetaphil for sensitive skin | 3.18 | 0.06 | 3.05 | 3.30 |
| Vitamin D | 3.17 | 0.06 | 3.05 | 3.29 |
| Zinc | 3.16 | 0.07 | 3.02 | 3.30 |
| Bert's Bees blemish stick | 3.15 | 0.18 | 2.80 | 3.51 |
| Jojoba Oil | 3.13 | 0.20 | 2.73 | 3.52 |
| Purpose Gentle Cleansing Bar | 3.12 | 0.11 | 2.89 | 3.34 |
| Honey | 3.11 | 0.08 | 2.94 | 3.29 |
| Bert's Bees Cleansing Milk | 3.10 | 0.17 | 2.75 | 3.45 |
| Tea tree oil | 3.10 | 0.05 | 2.99 | 3.20 |
| Aging (improvement after around age 19) | 3.08 | 0.05 | 2.98 | 3.18 |
| Alpha Hydroxy Acid | 3.05 | 0.12 | 2.80 | 3.30 |
| Flax oil | 3.05 | 0.10 | 2.84 | 3.25 |
| Exercise | 3.04 | 0.05 | 2.93 | 3.15 |
| Pregnancy | 3.04 | 0.15 | 2.73 | 3.35 |
| St. Ives Apricot Scrub | 3.04 | 0.06 | 2.92 | 3.16 |
| Apple cider vinegar tonic | 3.03 | 0.08 | 2.86 | 3.21 |
| Washing face with water only | 3.03 | 0.06 | 2.90 | 3.16 |
| Proactiv | 3.03 | 0.05 | 2.92 | 3.13 |
| Toothpaste on acne | 3.01 | 0.06 | 2.89 | 3.13 |
| Rubbing Alcohol | 2.99 | 0.07 | 2.85 | 3.13 |
| Hydrogen peroxide | 2.98 | 0.07 | 2.83 | 3.13 |
| Bar soap | 2.96 | 0.05 | 2.85 | 3.07 |
| Sauna | 2.96 | 0.11 | 2.73 | 3.19 |
| Murad AcneComplex | 2.93 | 0.10 | 2.72 | 3.14 |
| Clearasil Stayclear | 2.91 | 0.06 | 2.79 | 3.02 |
| Emu oil | 2.81 | 0.18 | 2.44 | 3.17 |
| Not washing face at all | 2.75 | 0.16 | 2.43 | 3.06 |
Table: Posterior estimates of average CureTogether crowd-sourced ratings for acne treatments, in descending order (higher=better).
[^auto-v]: This sounds bizarre but appears to refer to a real (albeit obscure) medical treatment more commonly named "autovaccines" or "autovaccination" (more widely used before the introduction of antibiotics), where part of the localized infection is cultured & then injected to try to trigger a more powerful, systemic, immune response.
One thing which would be nice to do with the CureTogether ratings is examine the clustering of interventions.
Some of these interventions are doubtless the same thing and should be merged together; others are different but may act through the same mechanisms and so could be considered about the same thing too.
Doing clustering or factor analysis might pop up a handful of major 'approaches', like 'antibiotics' vs 'cleaning' vs 'dietary interventions', and that would make the available options much easier to understand.
More interestingly, interventions almost certainly correlate with each other & predict success of each other, and this could be used to provide flowcharts of advice along the lines of "if X didn't work for you, then Y might and Z probably won't".
Acne treatment could be considered as a Partially Observable Markov Decision Problem where the effectiveness of each treatment is unknown but has a prior probability based on the global ratings, and one treatment's result informs the posterior probability of success of the untried treatments; a POMDP can be solved to give an optimal sequence of treatments to try.
(I've done a simple example of this for [cats and cat stimulants like catnip](/Catnip#optimal-catnip-alternative-selection-solving-the-mdp).)
And this approach also immediately gives a principled way for users to collectively experiment by posterior sampling (like [Thompson sampling](!Wikipedia), but one solves the MDP which is defined after sampling a possible parameter value from each parameter's posterior & treating it as the true value).
Unfortunately, that requires each CureTogether rater's individual data, to examine correlations within-individual of ratings, and the web interface doesn't provide that, and since CureTogether has been acquired & disappeared, there's no one to ask for the raw data now.
# Fermi calculations {.collapse}
> A short discussion of "Fermi calculations": quick-and-dirty approximate answers to quantitative questions which prize cleverness in exploiting implications of common knowledge or basic principles in given reasonable answers to apparently unanswerable questions. Links to discussions of Fermi estimates, and a list of some Fermi estimates I've done.

I really like [Fermi problems](!Wikipedia) ([LessWrong](http://lesswrong.com/lw/h5e/fermi_estimates/))---it's like [dimensional analysis](!Wikipedia) for everything outside of physics^[This is a little misleading; dimensional analysis is much more like [type-checking](!Wikipedia) a program in a language with a good type system like Haskell. Given certain data types as *inputs* and certain allowed transformations on those data types, what data types *must* be the resulting output? But the analogy is still useful.].
Not only are they fun to think about, they can be amazingly accurate, and are extremely cheap to do---because they are so easy, you do them in all sorts of situations you wouldn't do a 'real' estimate for, and are a fun part of a [physics education](http://arxiv.org/abs/1109.1165). The common distaste for them baffles me; even if you never work through Hubbard's [_How to Measure Anything_](http://www.amazon.com/How-Measure-Anything-Intangibles-Business/dp/1118539273/) ([some strategies](http://80000hours.org/blog/170-estimation-part-i-how-to-do-it "Estimation---Part I: How to do it?")) or [_Street-Fighting Mathematics_](http://www.amazon.com/Street-Fighting-Mathematics-Educated-Guessing-Opportunistic/dp/026251429X) or read [Douglas Hofstadter](!Wikipedia)'s 1982 essay ["On Number Numbness"](/docs/math/1982-hofstadter.pdf) (collected in _[Metamagical Themas](https://archive.org/details/MetamagicalThemas)_), it's something you can teach yourself by asking, what information is public available, what can I compare this too, how can I put various boundaries around the true answers^[eg. if someone asks you how many piano tuners there are in Chicago, don't look blank, start thinking! 'Well, there must be fewer than 7 billion, because the human race isn't made of piano tuners, and likewise fewer than 300 million (the population of the United States), and heck, Wikipedia says Chicago has only 2.6 million people and piano tuners are rare, so there must be many fewer than *that*...'] You especially want to do Fermi calculations in areas where the data is unavailable; I wind up pondering such areas frequently:
- [is a lip-reading website a good idea?](Notes#lip-reading-website)
- how many women [dye their hair blonde](Notes#somatic-genetic-engineering)
- how many [people does Folding@home kill](/Charity-is-not-about-helping)
- what's [the entropy of natural language](Notes#efficient-natural-language)
- or [how big a computer](/Simulation-inferences) is needed to compute the universe
- do [men have shorter *real* lives than women](Notes#who-lives-longer-men-or-women)
- how much do Girl Scouts cookies [cost and earn them](/Girl-Scouts-and-good-governance-cookie-prices-and-inflation)
- how many people have [used modafinil](/Modafinil#side-effects), and what is the cheapest we can expect to find [modafinil for](/Modafinil#margins)
- quickly assessing the rough probability of some event as I made one of my [thousands of predictions](/Prediction-markets)
- how many sellers are there on [Silk Road 1](/Silk-Road)
- checking whether posthumous organ donation [justifies the Chinese justice system](https://news.ycombinator.com/item?id=3389084) (no)
- [how many tourists](http://lesswrong.com/lw/bi7/smbc_comic_poorly_programmed/69fw#body_t1_69gl) to the Egyptian pyramids there have been compared to the workers who build them
- [what we can infer from adultery & false paternity rates](Notes#politicians-are-not-unethical)
- [how many times have video gamers killed Mario](http://www.reddit.com/r/estimation/comments/105okb/request_how_many_times_has_mario_died/)
- is tiger skin [more profitable](http://www.reddit.com/r/WTF/comments/188zui/this_is_beyond_saddening_tigers_starved_to_make/c8cql4x) than tiger bone wine?
- [plating Versailles with gold](http://slatestarcodex.com/2013/03/03/reactionary-philosophy-in-an-enormous-planet-sized-nutshell/#comment-544)
- the [number of people killed by falling pianos](http://lesswrong.com/lw/h5e/fermi_estimates/8pts)
- [the plausibility of](http://www.reddit.com/r/anime/comments/1cqtae/in_honor_of_kuronekos_birthday_my_favorite_scene/c9ja9a5) _[Oreimo](!Wikipedia)'s Kirino Kosaka character
- [how many integers one to a million do you see over a lifetime?](http://www.reddit.com/r/estimation/comments/1f765d/request_in_an_average_lifetime_how_many_of_the/ca85zcl) (R simulation + [Benford's law](!Wikipedia))
- [how many anecdotes of CrossFit causing rhabdomyolysis does it take to justify "CrossFit causes rhabdomyolysis"?](http://lesswrong.com/r/discussion/lw/irm/rationality_competitiveness_and_akrasia/9uoq)
- [how long would it take to read every book in existence?](http://www.reddit.com/r/estimation/comments/27i4wq/request_how_long_would_it_take_to_read_every_book/ci1jfno)
- [what would it cost to replace the US nuclear arsenal?](http://www.reddit.com/r/estimation/comments/2chtft/if_all_the_uss_atomic_bombs_disappeared_assuming/cjfyses)
- [What are the lifetime odds of being pooped on by a bird?](https://www.reddit.com/r/estimation/comments/4f9bxl/what_are_the_lifetime_odds_of_being_pooped_on_by/)
An entire ["estimation" subreddit](http://www.reddit.com/r/estimation/) is devoted to working through questions like these (it can be quite fun), and of course, there are the memorable ["what if?"](http://what-if.xkcd.com/) _xkcd_ columns.
[Timothy Gowers](!Wikipedia) suggests [a number of problems](https://gowers.wordpress.com/2012/06/08/how-should-mathematics-be-taught-to-non-mathematicians/ "How should mathematics be taught to non-mathematicians?") which might help children really learn how to think with & apply the math they learn.
To look further afield, here's a quick and nifty application by investor John Hempton to the [Sino Forestry fraud](!Wikipedia "Sino-Forest Corporation#Fraud allegations and share suspension"): ["Risk management and sounding crazy"](http://brontecapital.blogspot.com/2011/09/risk-management-and-sounding-crazy.html). What I personally found most interesting about this post was not the overall theme that the whistleblowers were discounted before and after they were proven right (we see this in many bubbles, for example, the housing bubble), but how one could use a sort of [Outside View](http://wiki.lesswrong.com/wiki/Outside_view)/Fermi calculation to sanity-check the claims. If Sino Forestry was really causing 17m cubic meters of wood to be processed a year, where was all the processing? This simple question tells us a *lot*. With medicine, there is one simple question one can always ask too---where is the increased longevity? (This is an important question to ask of studies, such as a [recent caloric restriction study](http://aging.wisc.edu/pdfs/2297.pdf).)
Simple questions and reasoning can tell us a lot.