Miscellaneous statistical stuff
topics:
created: 17 July 2014; modified: 16 May 2019; status: in progress; confidence: possible; importance: 5
 Critiques
 “Someone Should Do Something”: Wishlist of Miscellaneous Project Ideas
 Estimating censored test scores
 The Traveling Gerontologist problem
 Bayes nets
 Genome sequencing costs
 Proposal: handcounting mobile app for more fluid group discussions
 Air conditioner replacement
 Some ways of dealing with measurement error
 Value of Information: clinical prediction instruments for suicide
 Bayesian Model Averaging
 Dealing with allornothing unreliability of data
 Dysgenics power analysis
 Power analysis for racial admixture studies of continuous variables
 Operating on an aneurysm
 The Power of Twins: Revisiting Student’s Scottish Milk Experiment Example
 RNN metadata for mimicking individual author style
 MCTS
 Candy Japan A/B test
 DeFriesFulker power analysis
 Inferring mean IQs from SMPY/TIP elite samples
 Genius Revisited: On the Value of High IQ Elementary Schools
 Great Scott! Personal Name Collisions and the Birthday Paradox
 Detecting fake (human) Markov chain bots
 Optimal Existential Risk Reduction Investment
 Model Criticism via Machine Learning
 Proportion of Important Thinkers by Global Region Over Time in Charles Murray’s Human Accomplishment
 Program for nonspacedrepetition review of past written materials for serendipity & rediscovery: Archive Revisiter
 On the value of new statistical methods
 Bayesian power analysis: probability of exact replication
 Expectations are not expected deviations and large number of variables are not large samples
 Oh Deer: Could Deer Evolve to Avoid Car Accidents?
 Evolution as Backstop for Reinforcement Learning
 Acne: a good Quantified Self topic
 Fermi calculations
Critiques
moxibustion mouse study https://plus.google.com/103530621949492999968/posts/TisYM64ckLM
criticism of teethremoval experiment in rats http://lesswrong.com/r/discussion/lw/kfb/open_thread_30_june_2014_6_july_2014/b1u3
criticism of small Noopept selfexperiment http://www.bluelight.org/vb/threads/689936MyPaperquotNoopeptampThePlaceboEffectquot?p=11910708&viewfull=1#post11910708
why Soylent is not a good idea http://lesswrong.com/lw/hht/link_soylent_crowdfunding/90y7
misinterpretation of fluoridation metaanalysis and ignorance of VoI http://theness.com/neurologicablog/index.php/antifluoridepropagandaasnews/#comment76400
http://lesswrong.com/lw/1lt/case_study_melatonin/8mgf
Fulltext: https://dl.dropboxusercontent.com/u/280585369/2014dubal.pdf is this possible? http://nextbigfuture.com/2014/05/klvsgenemakesupsixiqpointsof.html#comment1376748788 http://www.reddit.com/r/Nootropics/comments/25233r/boost_your_iq_by_6_points/chddd7f
Facebook emotion study: http://www.reddit.com/r/psychology/comments/29vg9j/no_emotions_arent_really_contagious_over_facebook/cip7ln5 https://plus.google.com/103530621949492999968/posts/1PqPdLyzXhn
tACS causes lucid dreaming: http://www.reddit.com/r/LucidDreaming/comments/27y7n6/no_brain_stimulation_will_not_get_you_lucid/ck6isgo
Herbalife growth patterns: http://www.reddit.com/r/business/comments/24aoo2/what_unsustainable_growth_looks_like_herbalife/ch5hwtv
Plausible correlate of Fairtrade: http://www.reddit.com/r/Economics/comments/26jb2d/surprise_fairtrade_doesnt_benefit_the_poor/chrx9s4
slave whippings vs cotton production http://lesswrong.com/r/discussion/lw/kwc/open_thread_sept_17_2014/bajv
whether a study on mental illness & violence shows schizophrenics are not more likely to murder but rather be murdered: http://www.reddit.com/r/psychology/comments/2fwjs8/people_with_mental_illness_are_more_likely_to_be/ckdq50k / http://www.nationalelfservice.net/publicationtypes/observationalstudy/peoplewithmentalillnessaremorelikelytobevictimsofhomicidethanperpetratorsofhomicide/#comment95507 (see also http://slatestarscratchpad.tumblr.com/post/120950150581/psycholargiraffepoliceforcemuseicetc https://www.reddit.com/r/slatestarcodex/comments/744rqn/violence_is_not_a_product_of_mental_illness/dnwb1kj/ )
Fortune analysis of higher female CEO returns http://lesswrong.com/r/discussion/lw/l3b/contrarian_lw_views_and_their_economic/bftw
failed attempt at estimating P(causationcorrelation) https://plus.google.com/103530621949492999968/posts/UzMMmPgyyaV
algae/IQ: http://lesswrong.com/r/discussion/lw/l9v/open_thread_nov_17_nov_23_2014/bm7o
synaesthesia/IQ: https://www.reddit.com/r/psychology/comments/2mryte/surprising_iq_boost_12_in_average_by_a_training/cm760v8
misinterpretation: http://slatestarcodex.com/2014/12/08/links1214comeyetobethlinkhem/#comment165197
underpowered/multiplecorrection jobs program: http://slatestarcodex.com/2014/12/08/links1214comeyetobethlinkhem/#comment165197
vitamin D/caffeine claim based on weak in vitro claims, inconsistent with more relevant in vivo results: https://plus.google.com/u/0/103530621949492999968/posts/AUg3udezXMS [already included in Nootropics.page]
claimed fall in digit span backwards minuscule and nonstatisticallysignificant, no evidence of heterogeneity beyond variability due to sample size http://drjamesthompson.blogspot.com/2015/04/digitspanbombshell.html?showComment=1428096775425#c4097303932864318518
Claimed randomized experiment of whether sushi tastes worse after freezing is not actually a randomized experiment https://www.reddit.com/r/science/comments/324xmf/randomized_doubleblind_study_shows_the_quality_of/cq8dmsb
aerobic vs weightlifting exercise, multiple problems but primarily phacking, differenceinstatisticalsignificanceisnotasignificantdifference, and controlling for intermediate variable: https://plus.google.com/103530621949492999968/posts/aeZqB8JWUiQ
 another controlling for intermediate variable: https://twitter.com/gwern/status/538763784048095232
sexual openness result undermined by ceiling effect http://mindhacks.com/2015/04/28/whensocietyisntjudgingwomenssexdriverivalsmens/#comment362749
music study claiming WM interaction: possible ceiling effect? see FB PM
attempt to measure effect of Nazi antischizophrenia eugenics program failed to use breeder’s equation to estimate possible size of effect, which is too small to detect with available data and hence attempt is foredoomed: https://www.reddit.com/r/eugenics/comments/3hqdll/between_73_and_100_of_all_individuals_with/cul2nzw
claim high IQ types almost 100% failure rates due to inappropriate model assumption of normal distribution + narrow standard deviation: http://polymatharchives.blogspot.com/2015/01/theinappropriatelyexcluded.html?showComment=1441741719623#c1407914596750199739
implausible claims about success rate of facial recognition applied to St Petersburg population: https://news.ycombinator.com/item?id=11491264 (see also “Facial recognition systems stumble when confronted with millionface database”)
human Toxoplasma gondii study is not wellpowered as authors claim due to incorrect power analysis, and results are evidence for harm: http://blogs.discovermagazine.com/neuroskeptic/2016/02/20/mythmindalteringparasitetoxoplasmagondii/#comment2755778490 ; https://www.reddit.com/r/slatestarcodex/comments/5vjrmo/toxoplasma_doesnt_cause_adolescent_psychosis/de2x4kh/
attempt at attributing Bitcoin price increases to technology improvements: https://www.reddit.com/r/Bitcoin/comments/5tbt8f/buzz_factor_or_innovation_potential_what_explains/ddlmzrz/
analysis of designer drug/research chemical activity on Wikipedia is driven almost entirely by editing patterns of just 2 Wikipedia editors particularly interested in the topic: http://calib.ro/chemicalwiki/explorations/20160912emcddawatchlistandwikipediatimeline#comment3277669328
failure to use mediation SEM, differenceinstatisticalsignificanceisnotasignificantdifference: https://www.reddit.com/r/slatestarcodex/comments/6qwb0q/critical_thinking_skills_are_more_important_than/dl51ubw/
Neanderthal ancestry percentage & autism: https://www.reddit.com/r/slatestarcodex/comments/74fevz/findings_suggest_that_high_levels_of_neanderthal/dny3sh9/
“Someone Should Do Something”: Wishlist of Miscellaneous Project Ideas
Statistics
Erowid: datamine the trip reports to create clusters or a statespace of drug effects/results and using the clusters & common terms, create a general inventory of descriptions; add this to the trip report form so Erowid users can provide some more structured information about their experience.
Dark net markets: use a longitudinal crawl of DNM sellers to estimate survival curves, outstanding escrow + orders, and listed product prices / type / language to try to predict exit scams.
Dark net markets: parse hyperlinks in all forums & DNMs and outlinked domains like Reddit’s
/r/darknetmarkets
in order to create a proper network topology of Tor hidden services—existing .onion crawls tend to find Tor hidden services are a very fragmented and unconnected network despite the Hidden Wikis, but this may reflect the instability of hidden services and the use of clearnet communities & indexes (like Reddit or gwern.net)read through the Extropy and SL4 mailing list archives: have any important ideas been forgotten or dropped? Are there any meta lessons about forecasting or thinking about existential risk to be gleaned?
what are the possibilities in general for predicting human traits from faces? Instead of bickering about how much you can predict homosexuality etc from faces and whether a specific dataset/analysis works, apply variance component analysis using distances in a facial recognition CNN’s faceembedding as a similarity metric (which is highly robust to all sorts of realworld transformations like angle or lighting or hair style); then calculate ‘face heritability’ on many traits. If the average is near zero, that implies that faces don’t carry any important signals and that, aside from occasional exceptions, nothing beyond the expected things like basic demographic data can be predicted from faces. On the other hand, if ‘face heritability’ of many traits turns out to be substantially above zero (perhaps 20%), this means that faces carry many signals and these signals may be commercially or legally exploitable and earlier findings about face prediction may have been right after all. We may not like the answers, but it’s better to know the truth than go along blithely assuring everyone that it’s impossible to do such things and things like homosexuality/criminality are merely junk statistics.
Quantified Self experiments for acne, especially for teenagers: one could work with some to run some preliminary experiments, perhaps design some canned experiments+analysis code?
Deep learning/RL
Markov chain/charRNN bot for Twitter trained on just English proverbs, idioms, and expressions
userfriendly charRNN implementation just for classifying text, taking in CSV data of text/category
hyperparameter optimization for algorithms in problems without available loss functions but humanjudgeable quality, using a human for making choices in a paired or forcedchoice comparison, then using a BradleyTerry or latent variable model to infer rankings of hyperparameter settings and optimizing based on the latent scores. This would be particularly useful in GAN comparisons, where most comparisons attempt to force comparisons into a cardinal framework.
GAN improvements: provide supervision via adding additional losses by requiring the Discriminator (D) to output an array of perpixel losses of sample images, as opposed to a single scalar loss across the whole image, thereby training the generator more effectively.
In looking at GAN samples, I notice that bad Generators (G) often generate decent overall samples but there will be small regions where the quality is glaringly bad. It is not the case that the “whole image just looks bad somehow”—often there’s a specific point like the eyes or the lips where it looks horrifyingly creepy (especially for dog or human images). If D produces a large loss (because it’s so easy to notice the flaw), this seems odd from a backpropagation sense since most of the image is fine, it’s just a few spots which contribute to the loss. GANs, as have been often noted, are closely related to reinforcement learning, and considered as RL, the G is getting a single reward at the end of long sequence of generated pixels, and does not know which pixels are responsible for low or high rewards; akin to REINFORCE, it has little choice but to reward/punish neurons and hope that on average it is approximating the correct gradient for each parameter. Actorcritic methods make the reward more informative by trying to assign blame to specific actions, and AlphaGo Zero’s expert iteration appears to exhibit such dramatic learning speed because the use of MCTS means that AG Z receives not a single reward 0/1 attenuated over an entire game of moves, but precise immediate feedback on the value of moves it took & also on all the moves it didn’t take. In general, providing more losses is good for learning—additional examples would include auxiliary losses in RL like UNREAL or “dark knowledge” in image classification. In GANs, everything is differentiable and synthetic, so we don’t need to accept RLlike impoverished losses, but it seems like for the most part, the losses are very simple and lowinformation. Further, in GANs, the largest improvements in image quality in StackGAN and ProGAN come from adding GAN global losses at multiple layers of the generator: a D specialized for 32x32px images, then another D specialized for 64x64px, then another D for 128x128px etc. This can be seen as stacking up losses “depth”wise, providing feedback about plausibility at multiple stages. So why not add losses “width”wise, by criticizing each pixel in the final upscaled image? If it’s good one way, why not the other? This is in large part how the strongest competitor to GANs for image generation, PixelCNN, works: generating 1 pixel at a time conditioned on previous generated pixels. (mere_mortise suggests that this scheme would be equivalent to a regular GAN loss but computed on many shifted versions of an image, although that would presumably be much slower.)
Given a D which outputs the 2D array of perpixel losses, the training of G is just backpropagation as usual, but how does one train D to provide perpixel losses? Given a real image, by definition the fakeness of each pixel is 0, after all. The simplest approach would be to train the D with real and Ged/fake images, and label all the pixels in the real image with 0 and all the pixels in the fake image with 1, and hope it works out and the D will learn that way over enough minibatches. Another approach might be to introduce kinds of noises or corruption or shuffles in the real images, label the original pixels with 0 and then label the new pixels with 1; for example, replace a random 50% of pixels with white noise. (This might sound crazy but then, so does an image augmentation technique like MixUp which nevertheless works in CNNs & GANs.) A more interesting approach might be to refashion G into not a singleshot image generator, but a region infiller/inpainter/completion; this lets one generate images which genuinely are a mix of real and fake pixels, by cropping out a random region in a real image, having G fill it back in, and labeling real/fake appropriately. Something like MixUp might be employed: an image could be 40% generated/60% real, and then the target for D is 60%. If MixUp on random pairs of images doesn’t work, a conditional GAN’s conditioning could be used as a kind of MixUp: combine a real image with a fake image based on the real’s conditioning, and since the conditioning should describe most of the image, the pair should constitute a good mashup for the D.
This essentially turns GANs into a “semantic segmentation” problem. For a similar but not identical use, see Pix2pix and a use of semantic segmentation in CycleGAN; what I propose may have been done, but simpler, in “Improving Shape Deformation in Unsupervised ImagetoImage Translation”, Gokaslan et al 2018.
can GANs be turned into a supervised learning problem? For example, use model distillation/compression to train a smaller generator—it would seem that you could train a smaller generator simply to turn the latent zvector into the pixel output of the original large generator… but can you? If not, why not?
GWAS via 1D (possibly dilated) CNNs on SNP sequences a la WaveNet or malware detection (Raff et al 2017):
Linear regressions are notoriously sampleinefficient and weak methods of implementing GWAS as they typically use unrealistic flat priors, do not exploit the ‘clumping’ of hits in groups of SNPs (requiring postprocessing to ‘prune’ SNP hits which are physically too close to each other and likely in linkage disequilibrium to reveal the ‘real’ hit) , expect linear effects, and additive effects. Linear regressions can easily produce polygenic scores explaining half or less of variance compared to a more optimal statistical method (eg compare Hsu’s lasso or MTAG use to the previous GWASes on height/intelligence). A CNN could benefit from the hit clusters, can flexibly model distributions of effects and subsuming the “Bayesian alphabet”, and can pool information both locally and globally while modeling potentially arbitrarily complex interactions and hierarchies of effects. A SNP sequence of, say, 500k highquality SNP calls may seem infeasible for a NN, and would be totally infeasible for a standard RNN processing the sequence 1 SNP at a time, as it would be unable to preserve enough information in its hidden state or learn effectively due to vanishing gradients; but WaveNet and 1D convolutions for text classification have demonstrated the ability for dilated convolutions to handle enormous sequences highly effectively while modeling both local & global aspects. It is possible that a 1D CNN could be a highly effective GWAS method as well.
The primary challenge, as discovered by Raff et al 2017 in experimenting with CNNs ingesting sequences of millions of byte, is that the first layer is inherently extremely memoryhungry, as each of the thousands or millions of variables must be connected to the NN simultaneously. Raff et al 2017 used a DGX1 with 4 GPUs and ~16GB VRAM for a month for convergence, and found almost all their memory was going to the first layer and the higher layers contributed minimal demand. If the additional layers prove problematic, dilated convolutions can be used instead, which increase memory use only logarithmically, especially with high dilation factors like 15 or 20. (Raff et al 2017 also found that dilated convolutions were unhelpful in their malware executable classification problem and that they needed a very shallow architecture, suggesting that malware byte sequences just don’t have that much local structure for convolutions to exploit and that they were having training/convergence issues despite considerable investment—but I expect genomes to have much more local structure due to the genome inherently being sequenced into genes (which do not all affect traits of interest to equal degree), coding regions of various sorts, and the previously mentioned SNPclumping empirically observed in many GWASes.) A GWAS CNN might require dataparallel training over multiple 1080ti GPUs, splitting the minibatch to fit into the 11GB VRAM, and at least a month. However, should it deliver predictive power much superior to existing SOTA techniques like lasso GWAS, these computational requirements would probably be considered acceptable—several GPUmonths may be expensive, but collecting twice or thrice as many human genomes is more expensive still.
Preference learning
“learning from preferences” quasiGAN RL for highquality music generation with global coherency: based on “Deep reinforcement learning from human preferences”, Christiano et al 2017 (see also “SPIRAL: Synthesizing Programs for Images using Reinforced Adversarial Learning”), adapt the framework to instead generate music in a simple textual format such as ABC notation and querying the human oracle for pairwise comparisons of music snippets (~10s).
While used by Christiano for generating sequences of actions in ALElike environments, it seems like it could also be used to generate sequences such as text (eg dialogue agents) or music. These are domains where existing generative approaches like charRNN’s 1step prediction or GANs either have no global coherency or hardly work at all, but are natural fits for pairwise comparison data and rankings where “I know it when I see it”.
Existing RNN music generation using charRNN has the exact same structure and same problems of exploration as ALE or Gym deep RL agents have in doing deep exploration & global coherency of actions rather than local greedy, it perfectly exploits the human ability to easily do pairwise comparisons of sequences (of musical passages rather than actions), and the sample efficiency in learning should make it feasible for music (even if music is much more difficult than controlling a robot)—if a single person can, doing pairwise ratings, guide the RL NN to victory in an Atari game in a few hours, a few days may be enough to provide proof of concept, and then one could crowdsource a large corpus of ratings. The pairwise comparisons solve the problem of having no good reward function for music, since GANs aren’t available and humans can’t provide meaningful scalar feedback on random gibberish.
To kickstart the learning process, you could preseed the preference classifier by generating lots of music from lowquality generative sources and then marking them all as the loser in a set of comparisons with real music. For example, one could define a few music generators (random characters, ngrams, charRNN at various temperatures) to generate a million fake music sequences, take the real music from the ABC Irish music corpus and create comparisons with the real music always the winner. If there is popularity data on the real music, then this too can be used to pregenerate a large number of comparisons (just have the more popular of two pieces win each comparison). This immediately provides an enormous corpus for the preference classifier, and then when it’s finished training on that, one can bring the human into the loop and start generating/comparing/retraining as in the preference learning. The generated music would be in ABC (ASCII text format) for Irish music for simplicity, following the earlier RNN example of ‘Infinite Irish Trad Session’. (So the agent generates a string of ASCII text which is hopefully valid ABC, and this and the comparison string get fed into a MIDI synthesizer for the human to listen to.)
More generally, you can see the pretraining+preference learning as a form of semisupervised learning:
 use unsupervised learning methods to create generative models based on some large corpus
 run the generative models to create large fake n
 create n comparisons with by pairing randomly fake n and real n
 train the preference learning classification NN to convergence
 begin finetuning on new data with human or other oracle
the pretraining comparisons can reflect as much additional information as you think you can get away with. Along with popularity rating to make distinctions between comparisons of the real music, why not order the comparisons as well by data quality source? random < ngram < charRNN < RL. There might be mistaken comparisons (perhaps sometimes the ngrams really do beat the charRNNs), but this is amenable to fixing by active learning on the persistently misclassified comparisons should it be an issue.
aside from GANs, there’s possibly a connection with model distillation like Hinton’s dark knowledge: the comparison predictor might in a sense be ‘distilling’ the knowledge gained by the unsupervised generators by receiving labels from comparisons of their outputs to ground truths.
it might be better to coldstart the NNs instead of using extensive pretraining
or use “curriculum learning” (eg Press et al 2017): there’s at least two straightforward ways of providing easier subtasks than generating a full music piece. First, the required length can be gradually expanded over training  once it learns to generate 5s of music that the critic can’t distinguish, require it to generate 10s, etc. Second, real music can be used as a crutch by providing the generator with a decreasing prefix from real music as a ‘seed’: once it can append 1 note successfully, require it to append 2 notes, then 3 notes, and so on, until the prefix is 0length it is generating music sequences from scratch. (This can be done with or without using a supervised loglikelihood loss for training the RNN to generate the prefix.)
Another possibility would be to try to use beam search (although it has produced bad results in charRNN, perhaps due to the log likelihood training encouraging repetition) or the expert iteration/MCTS training from AlphaGo Zero. MCTS was originally introduced for planning in general MDPs, it isn’t inherently limited to twoplayer games, the “rules” of generating sequence data is trivial (anything ASCII, in this case), and the discriminator provides a welldefined reward. So instead of an RNN which directly generates a next character, it could instead (given a particular prefix/history) output values for the 128 ASCII values, run MCTS search for a while, produce a refined value for each character, and retrain the RNN towards the refined values; every minibatch of the generator, one generates a bunch of examples for the human to judge and provide a new minibatch for the discriminator. Hence, tree iteration learningfrompreferences deep RL.
With music we don’t necessarily need the stable selfplay that tree iteration provides since I’m not too clear conceptually what one would expect selfplay to deliver (it is inherently a humandefined problem, as opposed to Go where it’s external and human preferences are not the criteria), but given the Zero & Anthony’s Hex results, this could be considerably more computationefficient by providing much more supervision at each timestep instead of providing just a little bit of supervision from the end result of win/lose with REINFORCE. Possibly also more humansampleefficient?
deep RL for neural network design but focusing on generating a distribution of random weights for initializing a NN; better initializations have proven to be extremely important in stably training NN and simply tweaking initialization can train NNs with hundreds of layers (previously impossible, then only possible with a major architectural innovation like residual networks) eg Balduzzi et al 2017. Better initializations are hard to design by hand as they apparently work by breaking various symmetries inside the NN, so this is a problem that is well suited for brute force and trialanderror. See further Saxe et al 2013, Krähenbühl et al 2015, Daniely et al 2016, Schoenholz et al 2016, Kumar 2017, Aghajanyan 2017. This may wind up being essentially the same thing as HyperNetworks/fastweights eg “SMASH: OneShot Model Architecture Search through HyperNetworks”, Brock et al 2017.
Playing with GPT2 for various things (mostly poetry: https://www.gwern.net/GPT2 ), I’ve been thinking about the potential for preference learning and I think the original architecture can be simplified & improved.
The motivation for the doublecritic architecture is that the data being collected from humans is pairwise, and so one trains the critic to predict comparisons. This outside training loop then has an inner G/agent training loop etc. The double training loop is necessary to collect ratings from brand new areas of statespace that the G/agent can newly access, but also, GANstyle, to avoid the D/critic from being too powerful and saturating loss, I get the impression?
But, just because the input is pairwise doesn’t mean that the output must also be pairwise. It could instead be a scalar, with the D/critic performing regression.
A BradleyTerry model is extremely simple and easy to estimate on even large samples, and can easily produce cardinal rankings (eg my simple interactive R tool for ranking, https://www.gwern.net/Resorter , just reestimates the entire BT model each interaction because it takes like a tenth of a second at most). Each datapoint gets an estimated cardinal ranking in standard deviations of a hypothetical latent Gaussian. The D/critic then is trained to do regression from a single input to the estimated latent variable of quality.
So the new loop would look like this:
 run offtheshelf BT over a dataset of comparisons of datapoints
 extract the estimated latent variables for each datapoint
 until convergence, supervised training of a D/critic NN to predict the latent for each datapoint
 until convergence, RL training of a G/agent NN with the D/critic NN
 extract n new datapoints from the trained G/agent NN and add to the dataset
 run BT over the augmented dataset
 ask the oracle for a rating of the m datapoints with the largest posterior uncertainty or some proxy thereof like standard error (which will usually be the new datapoints)
This could have a lot of advantages:
the D/critic NN is simplified considerably: instead of 3way classification on a double input, it is just a single input for regression
more memoryefficient: before, a double input takes up memory, even with tied weights, only to yield a single comparison; in the same space, 2 regression models could be run, each with a different input and target quality rating
This could be particularly useful if one tries to use a large Transformer model like GPT2345M where memory consumption becomes a serious barrier to running it at all… At 345M, we’re down to n=1 minibatches, and I’m not sure how it would be possible to run 2 models simultaneously so as to backpropagate the classification error into both ‘halves’.
more D dataefficient: many comparisons will be useless, or for a given pair, they will quickly cease to be informative; a quality rating is informative regardless of what might’ve been used as a comparison, providing richer feedback on each input (analogous to AlphaZero switching to a regression target)
 possibly better ‘offpolicy’ learning: related to saturating, a D/critic trained from a corpus (eg initializing a D/critic by taking a dataset of real and GPT2generated poems, and labeling all comparisons as victory for the human poem) might destroy G/agent training if it provides only comparison feedback, but
 better value function/reward signal for any other approach leveraging the NNs (like MCTS over a tree of sequences), too
 humans or other datasets can supply cardinal ratings directly when those are available
possibly more D trainingefficient: by training comparisons, the D/critic must, implicitly, be learning an equivalent to a quality rating, in order to provide accurate predictions of a human comparison of all possible pairs  but it does so in a very indirect and obfuscated fashion, which will take training on top of the actual difficulty of evaluation
G more data & training efficient: a richer reward signal for each sample will of course be quite useful for the G
the quality variable provides an objective loss for understanding training progress (useful for tasks which don’t have them, like poetry generation!), which is also interpretable and could be useful outside of the task (eg ranking poems for recommendation or datacleaning)
enables active learning via BT posterior uncertainty without any need to extract uncertainty estimates of any kind from the D/critic NN; human ratings can be acquired more efficiently, or datapoints selectively pulled from a large dataset (eg imagine a huge dump of poems from Project Gutenberg or elsewhere, of wildly varying quality  with a regression style D/critic NN, you can do a single pass over it with the D/critic NN to select the k% highest poems, use the estimate as a pseudodatapoint, insert into BT, and ask humans for the most informative comparisons; with a comparison D/critic NN, how to import usefully a large unlabeled corpus is harder to see)
The main downside I can see:
 the latent variables are not necessarily 100% stable, as the whole distribution can drift. The BT estimates a distribution arbitrarily defined as N(0,1); if the BT sees only very selected datapoints at the beginning, it might be that after G/agent trains enough, the BT step would be looking at datapoints which are much better than a mean of 0, so there might be new datapoints all the way out at (what used to be) +100SDs, say. This then leads to the BT estimate the next cycle shifting the mean/SD to restore the conventional N(0,1). So the regression target for the D/critic’s predictions of old datapoints may gradually shift over time, precisely because the richer latent variables don’t saturate the way simple pairwise comparisons would. I believe this would be a minor problem easily solved by training the D/critic NN each iteration, which is necessary just to handle novel datapoints anyway; since improvements will be small each iteration, the retraining should be easily able to keep up.
 (frequentist) BT might require more comparisons in order to infer any total order: a datapoint has to be compared with other datapoints which themselves have comparisons if it is to be globally ranked at all, while a comparison D/critic can work with two entirely disjoint sets of comparisons which don’t overlap. So it might require either a few more comparisons (in order to connect all datapoints via a chain of comparisons), or switching to ad hoc imputation like imputing a mean of 0 for uncompared datapoints, or fully Bayesian BT (which uses priors to provide meaningful estimates for all datapoints). This seems like probably a minor drawback—at worst a Bayesian BT would take a few seconds to run, which is bad for interactive use but minor compared to the slowness of DRL in general.
All in all, I think this version of preference could be simpler, easier to implement, and train faster.
Technology
writing tools:
 dialect/period writing tool, perhaps exploiting
word2vec
: identify words in a text which are of the wrong dialect or are characteristic of different time periods; for example, identifying Americanisms in an ostensibly British work (to ‘Britpick’), or identify anachronisms in a historical fiction (words which did not exist in that time period or would be highly unusual), and suggest replacements  character generator: generate random populationweighted samples of people by demographics, political & religious attitudes, ideology, drawing on realistic datasets such as US censuses (for demographics/names) or the General Social Survey (GSS)^{1}; this can be useful in reducing bias in characters, exploring possibilities, and increasing realism. Naive attempts to debias writings often wind up making the characters far more unrepresentative, such as by including too many homosexual or transsexual characters or including rare ethnicities like Jews while failing to include common types of people such as fundamentalist Christians or Republicans, and existing fake name or character generators do not help because they typically take the easy way out by merely sampling randomly from a list of unique values, skewing selection to bizarre & exotic—trying out one such generator, I get strange names like “Cynthia P. Teal” or “Cody A. Nguyen” or “Marshall T. Blanco”. Using real data & proportional sampling ensures realism and eliminates blind spots an author may not realize they have. (Of course, this is not to say that an author will be happy with the suggestions, particularly with what the GSS may reveal about the beliefs and knowledge of Americans in general. But if an author ensures that all of their characters are aware that chocolate milk doesn’t come from brown cows or graduated high school, at least it will then be a deliberate choice on their part.)
 dialect/period writing tool, perhaps exploiting
smartglasses w/NNs for lipreading+transcription+voicegeneration for deaf/hearingimpaired: https://plus.google.com/103530621949492999968/posts/gAPfia5s
a VR application for viewing stereoscopic images & video and for 3D environments with extremely large parallax such as for viewing clouds with true depth perception (discussion)
properly transcribe & annotate Douglas Hofstader’s Le Ton Beau de Marot, one of his best but also most obscure books
Genetics
 provide “polygenic scores as a service”, a website/API where one can upload a SNP data file like the 23andMe export and get back PGSes for everything in LD Hub, and utility weights
 expand/rewrite Wikipedia’s heritability of intelligence—grossly outdated, almost totally omitting all the GCTAs and GWASes that have definitively settled the answer in the strongly affirmative
 nominative determinism: do first names affect how people are perceived or their appearance? Some studies indicate that one can guess first names based on appearance… but I haven’t seen one which does a withinfamily comparison eg swapping at random the photographs of two samesex siblings, provide their first names, and asking people to guess which is which. Names are canonical examples of things which vary systematically between families.
Estimating censored test scores
An acquaintance asks the following question: he is applying for a university course which requires a certain minimum score on a test for admittance, and wonders about his chances and a possible trend of increasing minimum scores over time. (He hasn’t received his test results yet.) The university doesn’t provide a distribution of admittee scores, but it does provide the minimum scores for 20052013, unless all applicants were admitted because they all scored above an unknown cutoff—in which case it provides no minimum score. This leads to the dataset:
A quick eyeball tells us that we can’t conclude much: only 4 actual datapoints, with 5 hidden from us. We can’t hope to conclude anything about time trends, other than there doesn’t seem to be much of one: the last score, 417, is not much higher than 410, and the last two scores are low enough to be hidden. We might be able to estimate a mean, though.
We can’t simply average the 4 scores and conclude the mean minimum is 410 because of those NAs: a number of scores have been ‘censored’ because they were too low, and while we don’t know what they were, we do know they were <398 (the smallest score) and so a bunch of <398s will pull down the uncensored mean of 410.
On approach is to treat it as a Tobit model and estimate using something like the censReg
library (overview).
But if we try a quick call to censReg
, we are confounded: a Tobit model expects you to provide the cutoff below which the observations were censored, but that is something we don’t know. All we know is that it must be below 398, we weren’t told it was exactly 395, 394, etc. Fortunately, this is a solved problem. For example: “The Tobit model with a nonzero threshold”, Carson & Sun 2007 tells us:
In this paper, we consider estimating the unknown censoring threshold by the minimum of the uncensored ’s. We show that the estimator of is superconsistent and asymptotically exponentially distributed. Carson (1988, 1989) also suggests estimating the unknown censoring threshold by the minimum of the uncensored ’s. In a recent paper, Zuehlke (2003) rediscovers these unpublished results and demonstrates via simulations that the asymptotic distribution of the maximum likelihood estimator does not seem to be affected by the estimation of the censoring threshold.
That seems to be almost too simple and easy, but it makes sense and reminds me a little of the German tank problem: the minimum might not be that accurate a guess (it’s unlikely you just happened to draw a sample right on the censoring threshold) and it definitely can’t be wrong in the sense of being too low. (A Bayesian method might be able to do better with a prior like a exponential.)
With that settled, the analysis is straightforward: load the data, figure out the minimum score, set the NAs to 0, regress, and extract the model estimates for each year:
scores < data.frame(Year=2005:2013,
MinimumScore=c(NA,410,NA,NA,398,407,417,NA,NA));
censorThreshold < min(scores$MinimumScore, na.rm=T)
scores[is.na(scores)] < 0
library(censReg)
## 'censorThreshold1' because censReg seems to treat threshold as < and not <=
summary(censReg(MinimumScore ~ Year, left=censorThreshold1, data=scores))
# Warning message:
# In censReg(MinimumScore ~ Year, left = censorThreshold  1, data = scores) :
# at least one value of the endogenous variable is smaller than the left limit
#
# Call:
# censReg(formula = MinimumScore ~ Year, left = censorThreshold 
# 1, data = scores)
#
# Observations:
# Total Leftcensored Uncensored Rightcensored
# 9 5 4 0
#
# Coefficients:
# Estimate Std. error t value Pr(> t)
# (Intercept) 139.9711 Inf 0 1
# Year 0.2666 Inf 0 1
# logSigma 2.6020 Inf 0 1
#
# NewtonRaphson maximisation, 37 iterations
# Return code 1: gradient close to zero
# Loglikelihood: 19.35 on 3 Df
139.9711 + (0.2666 * scores$Year)
# [1] 394.6 394.8 395.1 395.4 395.6 395.9 396.2 396.4 396.7
With so little data the results aren’t very reliable, but there is one observation we can make.
The fact that half the dataset is censored tells us that the uncensored mean may be a huge overestimate (since we’re only looking at the ‘top half’ of the underlying data), and indeed it is. The original mean of the uncensored scores was 410; however, the estimate including the censored data is much lower, 397 (13 less)!
This demonstrates the danger of ignoring systematic biases in your data.
So, trying to calculate a mean or time effect is not helpful. What might be better is to instead exploit the censoring directly: if the censoring happened because everyone got in, then if you showed up in a censored year, you have 100% chance of getting in; while in a noncensored year you have an unknown but <100% chance of getting in; so the probability of a censored year sets a lower bound on one’s chances, and this is easy to calculate as a simple binomial problem—5 out of 9 years were censored years, so:
binom.test(c(5,4))
#
# Exact binomial test
#
# data: c(5, 4)
# number of successes = 5, number of trials = 9, pvalue = 1
# alternative hypothesis: true probability of success is not equal to 0.5
# 95% confidence interval:
# 0.212 0.863
# sample estimates:
# probability of success
# 0.5556
So we can tell him that he may have a >55% chance of getting in.
The Traveling Gerontologist problem
A quick probability exercise: Wikipedia mentions Finland has 566 centenarians as of 2010.
That’s few enough you could imagine visiting them all to research them and their longevity, in a sort of traveling salesman problem but with gerontologists instead. Except, because of the exponential increase in mortality, centenarians have high annual mortality rates; it depends on the exact age but you could call it >30% (eg Finnish 99yos in 2012 had a death toll of 326.54/1000). So you might well try to visit a centenarian and discover they’d died before you got there.
How bad a risk is this? Well, if the risk per year is 30%, then one has a 70% chance of surviving a year. To survive a year, you must survive all 365 days; by the multiplication rule, the risk is where or ; solving, .
It takes time to visit a centenarian—it wouldn’t do to be abrupt and see them for only a few minutes, you ought to listen to their stories, and you need to get to a hotel or airport, so let’s assume you visit 1 centenarian per day.
If you visit centenarian A on day 1, and you want to visit centenarian B on day 2, then you can count on a 99.9% chance B is still alive. So far so good. And if you wanted to visit 566 centenarians (let’s imagine you have a regularlyupdated master list of centenarians from the Finnish population registry), then you only have to beat the odds 566 times in a row, which is not that hard: .
But that’s coldblooded of you to objectify those Finnish centenarians! “Any centenarian will do, I don’t care.” What if you picked the current set of 566 centenarians and wanted to visit just them, specifically—with no new centenarians introduced to the list to replace any dead ones.
That’s a little more complicated. When you visit the first centenarian, it’s the same probability: 0.999024. When you visit the second centenarian the odds change since now she (and it’s more often ‘she’ than ‘he’, since remember the exponential and males having shorter mean lifetimes) has to survive 2 days, so it’s or ; for the third, it’s , and so on to #566 who has been patiently waiting and trying to survive a risk of , and then you need to multiply to get your odds of beating every single risk of death and the centenarian not leaving for a more permanent rendezvous: , which would be , or in Haskell:
(A little surprisingly, Wolfram Alpha can solve the TeX expression too.)
Given the use of floating point in that function (567 floating point exponentiations followed by as many multiplications) and the horror stories about floating point, one might worry the answer is wrong & the real probability is much larger. We can retry with an implementation of computable reals, CReal
, which can be very slow but should give more precise answers:
:module + Data.Number.CReal
showCReal 100 (product (map (\x > 0.999024**x) [1..566]))
~> 0.0000000000000000000000000000000000000000000000000000000000000000000089527433401308585720915431195262
Looks good—agrees with the floating point version up to the 11th digit:
8.9527433401 64081e69
8.9527433401 308585720915431195262
We can also check by rewriting the product equation to avoid all the exponentiation and multiplication (which might cause issues) in favor of a single exponential:
 (as before)
 = (since )
 = (by arithmetic progression/Gauss’s famous classroom trick since )
 = (start substituting in specific values)
 =
 =
So:
Or to go back to the longer version:
Also close. All probabilities of success are minute.
How fast would you have to be if you wanted to at least try to accomplish the tour with, say, a 5050 chance?
Well, that’s easy: you can consider the probability of all of them surviving one day and as we saw earlier, that’s , and two days would be So you can only take a little over a day before you’ve probabilistically lost & one of them has died; if you hit all 566 centenarians in 24 hours, that’s ~24 centenarians per hour or ~2 minutes to chat with each one and travel to the next. If you’re trying to collect DNA samples, better hope they’re all awake and able to give consent!
So safe to say, you will probably not be able to manage the Traveling Gerontologist’s tour.
Bayes nets
Daily weight data graph
As the datasets I’m interested in grow in number of variables, it becomes harder to justify doing analysis by simply writing down a simple linear model with a single dependent variable and throwing in the independent variables and maybe a few transformations chosen by hand. I can instead write down some simultaneousequations/structuralequationmodels, but while it’s usually obvious what to do for k<4 and if it’s not I can compare the possible variants, 4 variables is questionable what the right SEM is, and >5, it’s hopeless. Factor analysis to extract some latent variables is a possibility, but the more general solution here seems to be probabilistic graphical models such as Bayesian networks.
I thought I’d try out some Bayes net inference on some of my datasets. In this case, I have ~150 daily measurements from my Omron body composition scale, measuring total weight, body fat percentage, and some other things (see an Omron manual):
 Total weight
 BMI
 Body fat percentage
 Muscle percentage
 Resting metabolism in calories
 “Body age”
 Visceral fat index
The 7 variables are interrelated, so this is definitely a case where a simple lm
is not going to do the trick. It’s also not 100% clear how to set up a SEM; some definitions are obvious (the muchcriticized BMI is going to be determined solely by total weight, muscle and fat percentage might be inversely related) but others are not (how does “visceral fat” relate to body fat?). And it’s not a hopelessly small amount of data.
The Bayes net R library I’m trying out is bnlearn
(paper).
library(bnlearn)
# https://www.dropbox.com/s/4nsrszm85m47272/20150322gwernweight.csv
weight < read.csv("selfexperiment/weight.csv")
weight$Date < NULL; weight$Weight.scale < NULL
# remove missing data
weightC < na.omit(weight)
# bnlearn can't handle integers, oddly enough
weightC < as.data.frame(sapply(weightC, as.numeric))
summary(weightC)
# Weight.Omron Weight.BMI Weight.body.fat Weight.muscle
# Min. :193.0000 Min. : 26.90000 Min. :27.00000 Min. :32.60000
# 1st Qu.:195.2000 1st Qu.: 27.20000 1st Qu.:28.40000 1st Qu.:34.20000
# Median :196.4000 Median : 27.40000 Median :28.70000 Median :34.50000
# Mean :196.4931 Mean : 28.95409 Mean :28.70314 Mean :34.47296
# 3rd Qu.:197.8000 3rd Qu.: 27.60000 3rd Qu.:29.10000 3rd Qu.:34.70000
# Max. :200.6000 Max. : 28.00000 Max. :31.70000 Max. :35.50000
# Weight.resting.metabolism Weight.body.age Weight.visceral.fat
# Min. :1857.000 Min. :52.00000 Min. : 9.000000
# 1st Qu.:1877.000 1st Qu.:53.00000 1st Qu.:10.000000
# Median :1885.000 Median :53.00000 Median :10.000000
# Mean :1885.138 Mean :53.32704 Mean : 9.949686
# 3rd Qu.:1893.000 3rd Qu.:54.00000 3rd Qu.:10.000000
# Max. :1914.000 Max. :56.00000 Max. :11.000000
cor(weightC)
# Weight.Omron Weight.BMI Weight.body.fat Weight.muscle
# Weight.Omron 1.00000000000 0.98858376919 0.1610643221 0.06976934825
# Weight.BMI 0.98858376919 1.00000000000 0.1521872557 0.06231142104
# Weight.body.fat 0.16106432213 0.15218725566 1.0000000000 0.98704369855
# Weight.muscle 0.06976934825 0.06231142104 0.9870436985 1.00000000000
# Weight.resting.metabolism 0.96693236051 0.95959140245 0.0665001241 0.15621294274
# Weight.body.age 0.82581939626 0.81286141659 0.5500409365 0.47408608681
# Weight.visceral.fat 0.41542744168 0.43260100665 0.2798756916 0.25076619829
# Weight.resting.metabolism Weight.body.age Weight.visceral.fat
# Weight.Omron 0.9669323605 0.8258193963 0.4154274417
# Weight.BMI 0.9595914024 0.8128614166 0.4326010067
# Weight.body.fat 0.0665001241 0.5500409365 0.2798756916
# Weight.muscle 0.1562129427 0.4740860868 0.2507661983
# Weight.resting.metabolism 1.0000000000 0.7008354776 0.3557229425
# Weight.body.age 0.7008354776 1.0000000000 0.4840752389
# Weight.visceral.fat 0.3557229425 0.4840752389 1.0000000000
## create alternate dataset expressing the two percentage variables as pounds, since this might fit better
weightC2 < weightC
weightC2$Weight.body.fat < weightC2$Weight.Omron * (weightC2$Weight.body.fat / 100)
weightC2$Weight.muscle < weightC2$Weight.Omron * (weightC2$Weight.muscle / 100)
Begin analysis:
pdap < hc(weightC)
pdapc2 < hc(weightC2)
## bigger is better:
score(pdap, weightC)
# [1] 224.2563072
score(pdapc2, weightC2)
# [1] 439.7811072
## stick with the original, then
pdap
# Bayesian network learned via Scorebased methods
#
# model:
# [Weight.Omron][Weight.body.fat][Weight.BMIWeight.Omron]
# [Weight.resting.metabolismWeight.Omron:Weight.body.fat]
# [Weight.body.ageWeight.Omron:Weight.body.fat]
# [Weight.muscleWeight.body.fat:Weight.resting.metabolism][Weight.visceral.fatWeight.body.age]
# nodes: 7
# arcs: 8
# undirected arcs: 0
# directed arcs: 8
# average markov blanket size: 2.57
# average neighbourhood size: 2.29
# average branching factor: 1.14
#
# learning algorithm: HillClimbing
# score: BIC (Gauss.)
# penalization coefficient: 2.534452101
# tests used in the learning procedure: 69
# optimized: TRUE
plot(pdap)
## https://i.imgur.com/nipmqta.png
This inferred graph is obviously wrong in several respects, violating prior knowledge about some of the relationships.
More specifically, my prior knowledge:
Weight.Omron
== total weight; should be influenced byWeight.body.fat
(%),Weight.muscle
(%), &Weight.visceral.fat
Weight.visceral.fat
: ordinal variable, <=9 = normal; 1014 = high; 15+ = very high; from the Omron manual:Visceral fat area (0—approx. 300 cm , 1 inch=2.54 cm) distribution with 30 levels. NOTE: Visceral fat levels are relative and not absolute values.
Weight.BMI
: BMI is a simple function of total weight & height (specificallyBMI = round(weight / height^2)
), so it should be influenced only byWeight.Omron
, and influence nothing elseWeight.body.age
: should be influenced byWeight.Omron
,Weight.body.fat
, andWeight.muscle
, based on the description in the manual:Body age is based on your resting metabolism. Body age is calculated by using your weight, body fat percentage and skeletal muscle percentage to produce a guide to whether your body age is above or below the average for your actual age.
Weight.resting.metabolism
: a function of the others, but I’m not sure which exactly; manual talks about what resting metabolism is generically and specifies it has the range “385 to 3999 kcal with 1 kcal increments”; https://en.wikipedia.org/wiki/Basal_metabolic_rate suggests the Omron may be using one of several approximation equations based on age/sex/height/weight, but it might also be using lean body mass as well.
Unfortunately, bnlearn doesn’t seem to support any easy way of encoding the prior knowledge—for example, you can’t say ‘no outgoing arrows from node X’—so I iterate, adding bad arrows to the blacklist.
Which arrows violate prior knowledge?
[Weight.visceral.fatWeight.body.age]
(read backwards, asWeight.body.age ~> Weight.visceral.fat
)[Weight.muscleWeight.resting.metabolism]
Retry, blacklisting those 2 arrows:
New violations:
[Weight.visceral.fatWeight.BMI]
[Weight.muscleWeight.Omron]
New violations:
[Weight.visceral.fatWeight.Omron]
[Weight.muscleWeight.BMI]
One violation:
[Weight.muscleWeight.body.age]
pdap5 < hc(weightC, blacklist=data.frame(from=c("Weight.body.age", "Weight.resting.metabolism", "Weight.BMI", "Weight.Omron", "Weight.Omron", "Weight.BMI", "Weight.body.age"), to=c("Weight.visceral.fat","Weight.muscle", "Weight.visceral.fat", "Weight.muscle", "Weight.visceral.fat", "Weight.muscle", "Weight.muscle")))
# Bayesian network learned via Scorebased methods
#
# model:
# [Weight.body.fat][Weight.muscleWeight.body.fat][Weight.visceral.fatWeight.body.fat]
# [Weight.OmronWeight.visceral.fat][Weight.BMIWeight.Omron]
# [Weight.resting.metabolismWeight.Omron:Weight.body.fat]
# [Weight.body.ageWeight.Omron:Weight.body.fat]
# nodes: 7
# arcs: 8
# undirected arcs: 0
# directed arcs: 8
# average markov blanket size: 2.57
# average neighbourhood size: 2.29
# average branching factor: 1.14
#
# learning algorithm: HillClimbing
# score: BIC (Gauss.)
# penalization coefficient: 2.534452101
# tests used in the learning procedure: 62
# optimized: TRUE
plot(pdap5)
## https://i.imgur.com/nxCfmYf.png
## implementing all the prior knowledge cost ~30:
score(pdap5, weightC)
# [1] 254.6061724
No violations, so let’s use the network and estimate the specific parameters:
fit < bn.fit(pdap5, weightC); fit
# Bayesian network parameters
#
# Parameters of node Weight.Omron (Gaussian distribution)
#
# Conditional density: Weight.Omron  Weight.visceral.fat
# Coefficients:
# (Intercept) Weight.visceral.fat
# 169.181651376 2.744954128
# Standard deviation of the residuals: 1.486044472
#
# Parameters of node Weight.BMI (Gaussian distribution)
#
# Conditional density: Weight.BMI  Weight.Omron
# Coefficients:
# (Intercept) Weight.Omron
# 0.3115772322 0.1411044216
# Standard deviation of the residuals: 0.03513413381
#
# Parameters of node Weight.body.fat (Gaussian distribution)
#
# Conditional density: Weight.body.fat
# Coefficients:
# (Intercept)
# 28.70314465
# Standard deviation of the residuals: 0.644590085
#
# Parameters of node Weight.muscle (Gaussian distribution)
#
# Conditional density: Weight.muscle  Weight.body.fat
# Coefficients:
# (Intercept) Weight.body.fat
# 52.1003347352 0.6141270921
# Standard deviation of the residuals: 0.06455478599
#
# Parameters of node Weight.resting.metabolism (Gaussian distribution)
#
# Conditional density: Weight.resting.metabolism  Weight.Omron + Weight.body.fat
# Coefficients:
# (Intercept) Weight.Omron Weight.body.fat
# 666.910582196 6.767607964 3.886694779
# Standard deviation of the residuals: 1.323176507
#
# Parameters of node Weight.body.age (Gaussian distribution)
#
# Conditional density: Weight.body.age  Weight.Omron + Weight.body.fat
# Coefficients:
# (Intercept) Weight.Omron Weight.body.fat
# 32.2651379176 0.3603672788 0.5150134225
# Standard deviation of the residuals: 0.2914301529
#
# Parameters of node Weight.visceral.fat (Gaussian distribution)
#
# Conditional density: Weight.visceral.fat  Weight.body.fat
# Coefficients:
# (Intercept) Weight.body.fat
# 6.8781100009 0.1070118125
# Standard deviation of the residuals: 0.2373649058
## residuals look fairly good, except for Weight.resting.metabolism, where there are some extreme residuals in what looks a bit like a sigmoid sort of pattern, suggesting nonlinearities in the Omron scale's formula?
bn.fit.qqplot(fit)
## https://i.imgur.com/mSallOv.png
We can doublecheck the estimates here by turning the Bayes net model into a SEM and seeing how the estimates compare, and also seeing if the pvalues suggest we’ve found a good model:
library(lavaan)
Weight.model1 < '
Weight.visceral.fat ~ Weight.body.fat
Weight.Omron ~ Weight.visceral.fat
Weight.BMI ~ Weight.Omron
Weight.body.age ~ Weight.Omron + Weight.body.fat
Weight.muscle ~ Weight.body.fat
Weight.resting.metabolism ~ Weight.Omron + Weight.body.fat
'
Weight.fit1 < sem(model = Weight.model1, data = weightC)
summary(Weight.fit1)
# lavaan (0.516) converged normally after 139 iterations
#
# Number of observations 159
#
# Estimator ML
# Minimum Function Test Statistic 71.342
# Degrees of freedom 7
# Pvalue (Chisquare) 0.000
#
# Parameter estimates:
#
# Information Expected
# Standard Errors Standard
#
# Estimate Std.err Zvalue P(>z)
# Regressions:
# Weight.visceral.fat ~
# Weight.bdy.ft 0.107 0.029 3.676 0.000
# Weight.Omron ~
# Wght.vscrl.ft 2.745 0.477 5.759 0.000
# Weight.BMI ~
# Weight.Omron 0.141 0.002 82.862 0.000
# Weight.body.age ~
# Weight.Omron 0.357 0.014 25.162 0.000
# Weight.bdy.ft 0.516 0.036 14.387 0.000
# Weight.muscle ~
# Weight.bdy.ft 0.614 0.008 77.591 0.000
# Weight.resting.metabolism ~
# Weight.Omron 6.730 0.064 104.631 0.000
# Weight.bdy.ft 3.860 0.162 23.837 0.000
#
# Covariances:
# Weight.BMI ~~
# Weight.body.g 0.000 0.001 0.116 0.907
# Weight.muscle 0.000 0.000 0.216 0.829
# Wght.rstng.mt 0.005 0.004 1.453 0.146
# Weight.body.age ~~
# Weight.muscle 0.001 0.001 0.403 0.687
# Wght.rstng.mt 0.021 0.030 0.700 0.484
# Weight.muscle ~~
# Wght.rstng.mt 0.007 0.007 1.003 0.316
#
# Variances:
# Wght.vscrl.ft 0.056 0.006
# Weight.Omron 2.181 0.245
# Weight.BMI 0.001 0.000
# Weight.body.g 0.083 0.009
# Weight.muscle 0.004 0.000
# Wght.rstng.mt 1.721 0.193
Comparing the coefficients by eye, they tend to be quite close (usually within 0.1) and the pvalues are all statisticallysignificant.
The network itself looks right, although some of the edges are surprises: I didn’t know visceral fat was predictable from body fat (I thought they were measuring separate things), and the relative independence of muscle suggests that in any exercise plan I might be better off focusing on the body fat percentage rather than the muscle percentage since the former may be effectively determining the latter.
So what did I learn here?
learning network structure and direction of arrows is hard; even with only 7 variables and n=159 (accurate clean data), the hillclimbing algorithm will learn at least 7 wrong arcs.
 and the derived graphs depend disturbingly heavily on choice of algorithm; I used the
hc
hillclimbing algorithm (since I’m lazy and didn’t want to specify arrow directions), but when I try out the alternatives likeiamb
on the same data & blacklist, the found graph looks rather different
 and the derived graphs depend disturbingly heavily on choice of algorithm; I used the
Gaussians are, as always, sensitive to outliers: I was surprised the first graph didn’t show BMI connected to anything, so I took a closer look and found I had miscoded a BMI of 28 as 280!
bnlearn
, while not as hard to use as I expected, could still use usability improvements: I should not need to coerce integer data into exactly equivalent numeric types just becausebnlearn
doesn’t recognize integers; and blacklisting/whitelisting needs to be more powerful—iteratively generating graphs and manually inspecting and manually blacklisting is tedious and does not scale hence, it may make more sense to find a graph using
bnlearn
and then convert it into simultaneousequations and manipulate it using more mature SEM libraries
 hence, it may make more sense to find a graph using
Zeo sleep data
Here I look at my Zeo sleep data; more variables, more complex relations, and more unknown ones, but on the positive side, ~12x more data to work with.
zeo < read.csv("~/wiki/docs/zeo/gwernzeodata.csv")
zeo$Sleep.Date < as.Date(zeo$Sleep.Date, format="%m/%d/%Y")
## convert "05/12/2014 06:45" to "06:45"
zeo$Start.of.Night < sapply(strsplit(as.character(zeo$Start.of.Night), " "), function(x) { x[2] })
## convert "06:45" to 24300
interval < function(x) { if (!is.na(x)) { if (grepl(" s",x)) as.integer(sub(" s","",x))
else { y < unlist(strsplit(x, ":")); as.integer(y[[1]])*60 + as.integer(y[[2]]); }
}
else NA
}
zeo$Start.of.Night < sapply(zeo$Start.of.Night, interval)
## correct for the switch to new unencrypted firmware in March 2013;
## I don't know why the new firmware subtracts 15 hours
zeo[(zeo$Sleep.Date >= as.Date("20130311")),]$Start.of.Night < (zeo[(zeo$Sleep.Date >= as.Date("20130311")),]$Start.of.Night + 900) %% (24*60)
## after midnight (24*60=1440), Start.of.Night wraps around to 0, which obscures any trends,
## so we'll map anything before 7AM to time+1440
zeo[zeo$Start.of.Night<420 & !is.na(zeo$Start.of.Night),]$Start.of.Night < (zeo[zeo$Start.of.Night<420 & !is.na(zeo$Start.of.Night),]$Start.of.Night + (24*60))
zeoSmall < subset(zeo, select=c(ZQ,Total.Z,Time.to.Z,Time.in.Wake,Time.in.REM,Time.in.Light,Time.in.Deep,Awakenings,Start.of.Night,Morning.Feel))
zeoClean < na.omit(zeoSmall)
# bnlearn doesn't like the 'integer' class that most of the dataframe is in
zeoClean < as.data.frame(sapply(zeoClean, as.numeric))
Prior knowledge:
Start.of.Night
is temporally first, and cannot be causedTime.to.Z
is temporally second, and can be influenced byStart.of.Night
(likely a connection between how late I go to bed and how fast I fall asleep) &Time.in.Wake
(since if it takes 10 minutes to fall asleep, I must spend ≥10 minutes in wake) but not othersMorning.Feel
is temporally last, and cannot cause anythingZQ
is a synthetic variable invented by Zeo according to an opaque formula, which cannot cause anything but is determined by othersTotal.Z
should be the sum ofTime.in.Light
,Time.in.REM
, andTime.in.Deep
Awakenings
should have an arrow withTime.in.Wake
but it’s not clear which way it should run
library(bnlearn)
## after a bunch of iteration, blacklisting arrows which violate the prior knowledge
bl < data.frame(from=c("Morning.Feel", "ZQ", "ZQ", "ZQ", "ZQ", "ZQ", "ZQ", "Time.in.REM", "Time.in.Light", "Time.in.Deep", "Morning.Feel", "Awakenings", "Time.in.Light", "Morning.Feel", "Morning.Feel","Total.Z", "Time.in.Wake", "Time.to.Z", "Total.Z", "Total.Z", "Total.Z"),
to=c("Start.of.Night", "Total.Z", "Time.in.Wake", "Time.in.REM", "Time.in.Deep", "Morning.Feel","Start.of.Night", "Start.of.Night","Start.of.Night","Start.of.Night", "Time.to.Z", "Time.to.Z", "Time.to.Z", "Total.Z", "Time.in.Wake","Time.to.Z","Time.to.Z", "Start.of.Night", "Time.in.Deep", "Time.in.REM", "Time.in.Light"))
zeo.hc < hc(zeoClean, blacklist=bl)
zeo.iamb < iamb(zeoClean, blacklist=bl)
## problem: undirected arc: Time.in.Deep/Time.in.REM; since hc inferred [Time.in.DeepTime.in.REM], I'll copy that for iamb:
zeo.iamb < set.arc(zeo.iamb, from = "Time.in.REM", to = "Time.in.Deep")
zeo.gs < gs(zeoClean, blacklist=bl)
## same undirected arc:
zeo.gs < set.arc(zeo.gs, from = "Time.in.REM", to = "Time.in.Deep")
## Bigger is better:
score(zeo.iamb, data=zeoClean)
# [1] 44776.79185
score(zeo.gs, data=zeoClean)
# [1] 44776.79185
score(zeo.hc, data=zeoClean)
# [1] 44557.6952
## hc scores best, so let's look at it:
zeo.hc
# Bayesian network learned via Scorebased methods
#
# model:
# [Start.of.Night][Time.to.ZStart.of.Night][Time.in.LightTime.to.Z:Start.of.Night]
# [Time.in.REMTime.in.Light:Start.of.Night][Time.in.DeepTime.in.REM:Time.in.Light:Start.of.Night]
# [Total.ZTime.in.REM:Time.in.Light:Time.in.Deep][Time.in.WakeTotal.Z:Time.to.Z]
# [AwakeningsTime.to.Z:Time.in.Wake:Time.in.REM:Time.in.Light:Start.of.Night]
# [Morning.FeelTotal.Z:Time.to.Z:Time.in.Wake:Time.in.Light:Start.of.Night]
# [ZQTotal.Z:Time.in.Wake:Time.in.REM:Time.in.Deep:Awakenings]
# nodes: 10
# arcs: 28
# undirected arcs: 0
# directed arcs: 28
# average markov blanket size: 7.40
# average neighbourhood size: 5.60
# average branching factor: 2.80
#
# learning algorithm: HillClimbing
# score: BIC (Gauss.)
# penalization coefficient: 3.614556939
# tests used in the learning procedure: 281
# optimized: TRUE
plot(zeo.hc)
## https://i.imgur.com/nD3LXND.png
fit < bn.fit(zeo.hc, zeoClean); fit
#
# Bayesian network parameters
#
# Parameters of node ZQ (Gaussian distribution)
#
# Conditional density: ZQ  Total.Z + Time.in.Wake + Time.in.REM + Time.in.Deep + Awakenings
# Coefficients:
# (Intercept) Total.Z Time.in.Wake Time.in.REM Time.in.Deep Awakenings
# 0.12468522173 0.14197043518 0.07103211437 0.07053271816 0.21121000076 0.56476256303
# Standard deviation of the residuals: 0.3000223604
#
# Parameters of node Total.Z (Gaussian distribution)
#
# Conditional density: Total.Z  Time.in.Wake + Start.of.Night
# Coefficients:
# (Intercept) Time.in.Wake Start.of.Night
# 907.6406157850 0.4479377278 0.2680771514
# Standard deviation of the residuals: 68.90853885
#
# Parameters of node Time.to.Z (Gaussian distribution)
#
# Conditional density: Time.to.Z  Start.of.Night
# Coefficients:
# (Intercept) Start.of.Night
# 1.02898431407 0.01568450832
# Standard deviation of the residuals: 13.51606719
#
# Parameters of node Time.in.Wake (Gaussian distribution)
#
# Conditional density: Time.in.Wake  Time.to.Z
# Coefficients:
# (Intercept) Time.to.Z
# 14.7433880499 0.3289378711
# Standard deviation of the residuals: 19.0906685
#
# Parameters of node Time.in.REM (Gaussian distribution)
#
# Conditional density: Time.in.REM  Total.Z + Start.of.Night
# Coefficients:
# (Intercept) Total.Z Start.of.Night
# 120.62442964234 0.37864195651 0.06275760841
# Standard deviation of the residuals: 19.32560757
#
# Parameters of node Time.in.Light (Gaussian distribution)
#
# Conditional density: Time.in.Light  Total.Z + Time.in.REM + Time.in.Deep
# Coefficients:
# (Intercept) Total.Z Time.in.REM Time.in.Deep
# 0.6424267863 0.9997862624 1.0000587988 1.0001805537
# Standard deviation of the residuals: 0.5002896274
#
# Parameters of node Time.in.Deep (Gaussian distribution)
#
# Conditional density: Time.in.Deep  Total.Z + Time.in.REM
# Coefficients:
# (Intercept) Total.Z Time.in.REM
# 15.4961459056 0.1283622577 0.1187382535
# Standard deviation of the residuals: 11.90756843
#
# Parameters of node Awakenings (Gaussian distribution)
#
# Conditional density: Awakenings  Time.to.Z + Time.in.Wake + Time.in.REM + Time.in.Light + Start.of.Night
# Coefficients:
# (Intercept) Time.to.Z Time.in.Wake Time.in.REM Time.in.Light
# 18.41014329148 0.02605164827 0.05736596152 0.02291139969 0.01060661963
# Start.of.Night
# 0.01129521977
# Standard deviation of the residuals: 2.427868657
#
# Parameters of node Start.of.Night (Gaussian distribution)
#
# Conditional density: Start.of.Night
# Coefficients:
# (Intercept)
# 1413.382886
# Standard deviation of the residuals: 64.43144125
#
# Parameters of node Morning.Feel (Gaussian distribution)
#
# Conditional density: Morning.Feel  Total.Z + Time.to.Z + Time.in.Wake + Time.in.Light + Start.of.Night
# Coefficients:
# (Intercept) Total.Z Time.to.Z Time.in.Wake Time.in.Light
# 0.924662971061 0.004808652252 0.010127269154 0.008636841492 0.002766602019
# Start.of.Night
# 0.001672816480
# Standard deviation of the residuals: 0.7104115719
## some issues with big residuals at the extremes in the variables Time.in.Light, Time.in.Wake, and Time.to.Z;
## not sure how to fix those
bn.fit.qqplot(fit)
# https://i.imgur.com/fmP1ca0.png
library(lavaan)
Zeo.model1 < '
Time.to.Z ~ Start.of.Night
Time.in.Wake ~ Total.Z + Time.to.Z
Awakenings ~ Time.to.Z + Time.in.Wake + Time.in.REM + Time.in.Light + Start.of.Night
Time.in.Light ~ Time.to.Z + Start.of.Night
Time.in.REM ~ Time.in.Light + Start.of.Night
Time.in.Deep ~ Time.in.REM + Time.in.Light + Start.of.Night
Total.Z ~ Time.in.REM + Time.in.Light + Time.in.Deep
ZQ ~ Total.Z + Time.in.Wake + Time.in.REM + Time.in.Deep + Awakenings
Morning.Feel ~ Total.Z + Time.to.Z + Time.in.Wake + Time.in.Light + Start.of.Night
'
Zeo.fit1 < sem(model = Zeo.model1, data = zeoClean)
summary(Zeo.fit1)
# lavaan (0.516) converged normally after 183 iterations
#
# Number of observations 1379
#
# Estimator ML
# Minimum Function Test Statistic 22.737
# Degrees of freedom 16
# Pvalue (Chisquare) 0.121
#
# Parameter estimates:
#
# Information Expected
# Standard Errors Standard
#
# Estimate Std.err Zvalue P(>z)
# Regressions:
# Time.to.Z ~
# Start.of.Nght 0.016 0.006 2.778 0.005
# Time.in.Wake ~
# Total.Z 0.026 0.007 3.592 0.000
# Time.to.Z 0.314 0.038 8.277 0.000
# Awakenings ~
# Time.to.Z 0.026 0.005 5.233 0.000
# Time.in.Wake 0.057 0.003 16.700 0.000
# Time.in.REM 0.023 0.002 10.107 0.000
# Time.in.Light 0.011 0.002 6.088 0.000
# Start.of.Nght 0.011 0.001 10.635 0.000
# Time.in.Light ~
# Time.to.Z 0.348 0.085 4.121 0.000
# Start.of.Nght 0.195 0.018 10.988 0.000
# Time.in.REM ~
# Time.in.Light 0.358 0.018 19.695 0.000
# Start.of.Nght 0.034 0.013 2.725 0.006
# Time.in.Deep ~
# Time.in.REM 0.081 0.012 6.657 0.000
# Time.in.Light 0.034 0.009 3.713 0.000
# Start.of.Nght 0.017 0.006 3.014 0.003
# Total.Z ~
# Time.in.REM 1.000 0.000 2115.859 0.000
# Time.in.Light 1.000 0.000 2902.045 0.000
# Time.in.Deep 1.000 0.001 967.322 0.000
# ZQ ~
# Total.Z 0.142 0.000 683.980 0.000
# Time.in.Wake 0.071 0.000 155.121 0.000
# Time.in.REM 0.071 0.000 167.090 0.000
# Time.in.Deep 0.211 0.001 311.454 0.000
# Awakenings 0.565 0.003 178.407 0.000
# Morning.Feel ~
# Total.Z 0.005 0.001 8.488 0.000
# Time.to.Z 0.010 0.001 6.948 0.000
# Time.in.Wake 0.009 0.001 8.592 0.000
# Time.in.Light 0.003 0.001 2.996 0.003
# Start.of.Nght 0.002 0.000 5.414 0.000
Again no major surprises, but one thing I notice is that ZQ
does not seem to connect to Time.in.Light
, though Time.in.Light
does connect to Morning.Feel
; I’ve long suspected that ZQ
is a flawed summary and thought it was insufficiently taking into account wakes or something else, so it looks like it’s Time.in.Light
specifically which is missing. Start.of.night
also is more highly connected than I had expected.
Comparing graphs from the 3 algorithms, they don’t seem to differ as badly as the weight ones did. Is this thanks to the much greater data or the constraints?
Genome sequencing costs
# http://www.genome.gov/sequencingcosts/
# http://www.genome.gov/pages/der/sequencing_costs_apr2014.xls
# converted to CSV & deleted cost per base (less precision); CSV looks like:
# https://dl.dropboxusercontent.com/u/182368464/sequencing_costs_apr2014.csv
## Date, Cost per Genome
## Sep01,"$95,263,072"
## ...
sequencing < read.csv("sequencing_costs_apr2014.csv")
sequencing$Cost.per.Genome < as.integer(gsub(",", "", sub("\\$", "", as.character(sequencing$Cost.per.Genome))))
# interpret monthyears as first of month:
sequencing$Date < as.Date(paste0("01", as.character(sequencing$Date)), format="%d%b%y")
head(sequencing)
## Date Cost.per.Genome
## 1 20010901 95263072
## 2 20020301 70175437
## 3 20020901 61448422
## 4 20030301 53751684
## 5 20031001 40157554
## 6 20040101 28780376
l < lm(log(Cost.per.Genome) ~ Date, data=sequencing); summary(l)
##
## Coefficients:
## Estimate Std. Error t value Pr(>t)
## (Intercept) 50.969823683 1.433567932 35.5545 < 2.22e16
## Date 0.002689621 0.000101692 26.4486 < 2.22e16
##
## Residual standard error: 0.889707 on 45 degrees of freedom
## Multiple Rsquared: 0.939559, Adjusted Rsquared: 0.938216
## Fstatistic: 699.528 on 1 and 45 DF, pvalue: < 2.22e16
plot(log(Cost.per.Genome) ~ Date, data=sequencing)
## https://i.imgur.com/3XK8i0h.png
# as expected: linear in log (Moore's law) 20022008, sudden drop, return to Moore's lawish ~December 2011?
# but on the other hand, maybe the postDecember 2011 behavior is a continuation of the curve
library(segmented)
# 2 breakpoints / 3 segments:
piecewise < segmented(l, seg.Z=~Date, psi=list(Date=c(13970, 16071)))
summary(piecewise)
## Estimated BreakPoint(s):
## Est. St.Err
## psi1.Date 12680 1067.0
## psi2.Date 13200 279.8
##
## t value for the gapvariable(s) V: 0 0 2
##
## Meaningful coefficients of the linear terms:
## Estimate Std. Error t value Pr(>t)
## (Intercept) 35.841699121 8.975628264 3.99322 0.00026387
## Date 0.001504431 0.000738358 2.03754 0.04808491
## U1.Date 0.000679538 0.002057940 0.33020 NA
## U2.Date 0.002366688 0.001926528 1.22847 NA
##
## Residual standard error: 0.733558 on 41 degrees of freedom
## Multiple RSquared: 0.962565, Adjusted Rsquared: 0.958
with(sequencing, plot(Date, log(Cost.per.Genome), pch=16)); plot(piecewise, add=T)
## https://i.imgur.com/HSRqkJO.png
# The first two segments look fine, but the residuals are clearly bad for the third linesegment:
# it undershoots (damaging the second segment's fit), overshoots, then undershoots again. Let's try again with more breakpoints:
lots < segmented(l, seg.Z=~Date, psi=list(Date=NA), control=seg.control(stop.if.error=FALSE, n.boot=0))
summary(segmented(l, seg.Z=~Date, psi=list(Date=as.Date(c(12310, 12500, 13600, 13750, 14140, 14680, 15010, 15220), origin = "19700101", tz = "EST"))))
# delete every breakpoint below tvalue of ~2.3, for 3 breakpoints / 4 segments:
piecewise2 < segmented(l, seg.Z=~Date, psi=list(Date=as.Date(c("20070825","20080918","20100312"))))
with(sequencing, plot(Date, log(Cost.per.Genome), pch=16)); plot(piecewise2, add=T)
# the additional breakpoint is used up on a better fit in the curve. It looks like an exponential decay/asymptote,
# so let's work on fitting that part of the graph, the post2007 curve:
sequencingRecent < sequencing[sequencing$Date>as.Date("20071001"),]
lR < lm(log(Cost.per.Genome) ~ Date, data=sequencingRecent); summary(lR)
piecewiseRecent < segmented(lR, seg.Z=~Date, psi=list(Date=c(14061, 16071))); summary(piecewiseRecent)
## Estimated BreakPoint(s):
## Est. St.Err
## psi1.Date 14290 36.31
## psi2.Date 15290 48.35
##
## t value for the gapvariable(s) V: 0 0
##
## Meaningful coefficients of the linear terms:
## Estimate Std. Error t value Pr(>t)
## (Intercept) 1.13831e+02 6.65609e+00 17.10182 2.0951e13
## Date 7.13247e03 4.73332e04 15.06865 2.2121e12
## U1.Date 4.11492e03 4.94486e04 8.32161 NA
## U2.Date 2.48613e03 2.18528e04 11.37668 NA
##
## Residual standard error: 0.136958 on 20 degrees of freedom
## Multiple RSquared: 0.995976, Adjusted Rsquared: 0.994971
with(sequencingRecent, plot(Date, log(Cost.per.Genome), pch=16)); plot(piecewiseRecent, add=T)
lastPiece < lm(log(Cost.per.Genome) ~ Date, data=sequencingRecent[as.Date(15290, origin = "19700101", tz = "EST")<sequencingRecent$Date,]); summary(lastPiece)
## Coefficients:
## Estimate Std. Error t value Pr(>t)
## (Intercept) 17.012409648 1.875482507 9.07095 1.7491e05
## Date 0.000531621 0.000119056 4.46528 0.0020963
##
## Residual standard error: 0.0987207 on 8 degrees of freedom
## Multiple Rsquared: 0.71366, Adjusted Rsquared: 0.677867
with(sequencingRecent[as.Date(15290, origin = "19700101", tz = "EST") < sequencingRecent$Date,],
plot(Date, log(Cost.per.Genome), pch=16)); abline(lastPiece)
predictDays < seq(from=sequencing$Date[1], to=as.Date("20301201"), by="month")
lastPiecePredict < data.frame(Date = predictDays, Cost.per.Genome=c(sequencing$Cost.per.Genome, rep(NA, 305)), Cost.per.Genome.predicted = exp(predict(lastPiece, newdata = data.frame(Date = predictDays))))
nlmR < nls(log(Cost.per.Genome) ~ SSasymp(as.integer(Date), Asym, r0, lrc), data=sequencingRecent); summary(nlmR)
##
## Parameters:
## Estimate Std. Error t value Pr(>t)
## Asym 7.88908e+00 1.19616e01 65.95328 <2e16
## r0 1.27644e+08 1.07082e+08 1.19203 0.2454
## lrc 6.72151e+00 5.05221e02 133.04110 <2e16
##
## Residual standard error: 0.150547 on 23 degrees of freedom
with(sequencingRecent, plot(Date, log(Cost.per.Genome))); lines(sequencingRecent$Date, predict(nlmR), col=2)
# side by side:
with(sequencingRecent, plot(Date, log(Cost.per.Genome), pch=16))
plot(piecewiseRecent, add=TRUE, col=2)
lines(sequencingRecent$Date, predict(nlmR), col=3)
# as we can see, the 3piece linear fit and the exponential decay fit identically;
# but exponential decay is more parsimonious, IMO, so I prefer that.
predictDays < seq(from=sequencingRecent$Date[1], to=as.Date("20201201"), by="month")
data.frame(Date = predictDays, Cost.per.Genome.predicted = exp(predict(nlmR, newdata = data.frame(Date = predictDays))))
http://www.unz.com/gnxp/theintelofsequencing/#comment677904 https://biomickwatson.wordpress.com/2015/03/25/thecostofsequencingisstillgoingdown/
Genome sequencing historically has dropped in price ~18% per year. Consider this simple scenario: if we have a fixed amount of money to spend buying genomes, and we can afford to buy 1 genome in the first year, then the next year we can buy 1.21 genomes, then 1.48 genomes and so on and in 30 years we can afford to buy 385 genomes each year. The number we can afford in year x is:
sapply(0:30, function(x) 1/(0.82^x))
# [1] 1.000000000 1.219512195 1.487209994 1.813670724 2.211793566 2.697309227 3.289401497 4.011465240 4.892030780
# [10] 5.965891196 7.275477068 8.872533010 10.820162207 13.195319764 16.091853371 19.624211428 23.931965156 29.185323361
# [19] 35.591857758 43.404704583 52.932566564 64.551910444 78.721842005 96.002246348 117.075910180 142.775500220 174.116463682
# [28] 212.337150832 258.947744917 315.789932826 385.109674178
Genomes are unlike computation, though, as they are data rather than an ephemeral service. Each genome is still useful and accumulates in a database. How many genomes total do we have each year? Quite a lot:
cumsum(sapply(0:30, function(x) 1/(0.82^x)))
# [1] 1.000000000 2.219512195 3.706722189 5.520392914 7.732186480 10.429495707 13.718897204 17.730362444
# [9] 22.622393224 28.588284420 35.863761488 44.736294497 55.556456704 68.751776468 84.843629839 104.467841268
# [17] 128.399806424 157.585129785 193.176987543 236.581692126 289.514258690 354.066169134 432.788011139 528.790257487
# [25] 645.866167667 788.641667886 962.758131569 1175.095282401 1434.043027318 1749.832960144 2134.942634322
While initially there’s not much of a pile to concern ourselves with, eventually we have 2000+ genomes while still only producing <400 genomes that year, a factor of 5 difference. (As it happens, if you consider UKBB at n=500k produced as a single investment 20122017, 23andMe in 2017 is reportedly n=22.5m, so this 5x multiplier is about right.)
23andMe started back in 2007 or so offering $1000 SNP panels to a few thousand people, growing to ~1m by 8 years later in July 2015. To reproduce that in this model of constant investment we start with a base of 56k SNPs purchased per year, growing according to the cost decrease:
cumsum(sapply(0:7, function(x) (56000*1)/(0.82^x)))
# [1] 56000.0000 124292.6829 207576.4426 309142.0032 433002.4429 584051.7596 768258.2434 992900.2969
What does that yield by 10 years later (2017) or 20 years later (2027)? It yields: 1.6m (1,600,943) and 16.2m (16,212,798) respectively.
Even if we assumed that annual genomes/SNPs leveled off in 2017, the linear increase pushes us into the millions range rapidly:
annualStagnation < sapply(0:30, function(x) min(334089, (56000*1)/(0.82^x)))
cumsum(annualStagnation)
# [1] 56000.0000 124292.6829 207576.4426 309142.0032 433002.4429 584051.7596 768258.2434 992900.2969 1266854.0206 1600943.0206
# [11] 1935032.0206 2269121.0206 2603210.0206 2937299.0206 3271388.0206 3605477.0206 3939566.0206 4273655.0206 4607744.0206 4941833.0206
# [21] 5275922.0206 5610011.0206 5944100.0206 6278189.0206 6612278.0206 6946367.0206 7280456.0206 7614545.0206 7948634.0206 8282723.0206
# [31] 8616812.0206
data.frame(Year=2007:2037, total=round(totalStagnation))
# Year total
# 2007 56000
# 2008 124293
# 2009 207576
# 2010 309142
# 2011 433002
# 2012 584052
# 2013 768258
# 2014 992900
# 2015 1266854
# 2016 1600943
# 2017 1935032
# 2018 2269121
# 2019 2603210
# 2020 2937299
# 2021 3271388
# 2022 3605477
# 2023 3939566
# 2024 4273655
# 2025 4607744
# 2026 4941833
# 2027 5275922
# 2028 5610011
# 2029 5944100
# 2030 6278189
# 2031 6612278
# 2032 6946367
# 2033 7280456
# 2034 7614545
# 2035 7948634
# 2036 8282723
# 2037 8616812
So even if no additional funds per year start getting spent on genomics despite the increasing utility and the cost curve remains the same, the cumulative number of SNPs or wholegenomes will increase drastically over the next 30 years. Genomes on their own have many uses, such as detecting human evolution, allowing better imputation panels, inferring population structure, counting variants, detecting particularly lethal mutations etc, but of course their main use is trait prediction. Given the increases, we would expect large enough n for Hsu’s lasso to undergo phase transition and recover nearly the full SNP heritability (see pointestimates for various traits); the bottleneck increasingly will not be genomes but phenotypic measurements.
Proposal: handcounting mobile app for more fluid group discussions
Groups use voting for decisionmaking, but existing vote systems are cumbersome. Handraising is faster, but does not scale because handcounting hands is slow. Advances in machine vision may make it possible for AI to count hands in photos accurately. Combined with a smartphone’s camera, this could yield an app for fast voting in even large groups.
Mediumlarge (>10 people) groups face a problem in reaching consensus: ballot or penandpaper voting is sufficiently slow and clunky that it is too costly to use for anything but the most important discussions. A group is forced to adopt other discussion norms and save a formal vote for only the final decision, and even then the long delay kills a lot of enthusiasm and interest. Voting could be used for many more decisions if it could be faster, and of course all existing group votes would benefit from increased speed. (I am reminded of anime conventions and film festivals where, particularly for short films such as AMVs, one seems to spend more time filling out a ballot & passing them along aisles & the staff painfully counting through each ballot by hand than one actually spends watching the media in question!)
It would be better if voting could be as fluent and easy as simply raising your hand like in a small group such as a classroom—a mechanism which makes it so easy to vote that votes can be held as fast as the alternatives can be spoken aloud and a glance suffices to count (an alert group could vote on 2 or 3 topics in the time it takes to read this sentence). But handraising, as great as it is, suffers from the flaw that it does not scale due to the counting problem: a group of 500 people can raise their hands as easily as a group of 50 or 5, but it takes far too long to count ~250 hands: the person counting will quickly tire of the tedium, they will make mistakes counting, and this puts a serious lag on each vote, a lag which increases linearly with the number of voters. (Hands can be easy to approximate if almost everyone votes for or against something, but if consensus is so overwhelming, one doesn’t need to vote in the first place! The hard case of almostbalanced votes is the most important case.)
One might suggest using an entirely different strategy: a website with HTML polls or little clicker gizmos like used in some college lectures to administer quick quizzes. This have the downsides that they require potentially expensive equipment (I used a clicker in one class and I think it cost at least $20, so if a convention wanted to use that in an audience of hundreds, that’s a major upfront cost & my experience was that clickers were unintuitive, did not always work, and slowed things down if anything; a website would only work if you assume everyone has smartphones and is willing to pull them out to use at an instance’s notice and of course that there’s working WiFi in the room, which cannot be taken for granted) and considerable overhead in explaining to everyone how it works and getting them on the same page and making sure every person who wanders in also gets the message. (If anyone is going to be burdened with understanding or using a new system, it should be the handful of conference/festival/group organizers, not the entire audience!) A simpler approach than hands would be speciallyprinted paper using, for example, QR codes like piCards, which can then be recognized by standard simple computer vision techniques; this is much cheaper than clickers but still requires considerable setup & inconvenience. It’s hard to imagine a film festival running using any system, and difficult to see these systems improving on penandpaper ballots which at least are cheap, relatively straightforward, and wellknown.
Handcounting really does seem like the best solution, if only the counting could be fixed. Counting is something computers do fast, so that is the germ of an idea. What if a smartphone could count the votes? You don’t want a smartphone app on the entire audiences’ phones, of course, since that’s even worse than having everyone go to a website to vote; but machine vision has made enormous strides in the 2000s2010s, reaching humanequivalent performance on challenging image recognition contests like ImageNet. (Machine vision is complicated, but the important thing is that it’s the kind of complicated which can be outsourced to someone else and turned into a deadeasytouse app, and the burden does not fall on the primary users—the audience.) What if the organizer had an app which took a photo of the entire audience with lifted arms and counted hands & faces and returned a vote count in a second?
Such an app would be ideal for any cultural, political, or organizational meeting. Now the flow for, eg, a film festival could go: [no explanation given to audience, one just starts] “OK, how many people liked the first short, ‘Vampire Deli’ by Ms Houston?” [everyone raises hand, smartphone flashes, 1s passes] “OK, 140 votes. How many liked the second short, ‘Cthulicious’ by Mr Iouston?” [raises hands, smartphone flashes, 1s passes] “OK… 130 people. Congratulations Ms Houston!” And so on.
Such an app might be considered an infeasible machine vision task, but I believe it could be feasible: facial localization is an old and wellstudied image recognition task (and effective algorithms are built into every consumer camera), hands/fingers have very distinct shapes, and both tasks seem easier than the subtle discriminations between, say, various dog breeds demanded of ImageNet contestants.
Specifically, one could implement the machine vision core as follows:
multilayer neural networks trained for one task can be repurposed to similar tasks by removing the highest layer and retraining on the new task, potentially reaping great performance gains as the hybrid network has already learned much of what it needs for the second task (“transfer learning”). So one could take a publicly available NN trained for ImageNet (such as AlexNet, available in caffe), remove the top two layers, and retrain on a dataset of audiences; this will perform better since the original NN has already learned how to detect edges, recognize faces, etc
The simpler task of counting crowds has already shown itself susceptible to deep learning: eg “Crossscene Crowd Counting via Deep Convolutional Neural Networks”.
raid Flickr and Google Images for CClicensed photos of audiences raising their arms; then one can manually count how many arms are raised (or outsource to Amazon Mechanical Turk). With the boost from a transferred convolutional deep network, one might get good performance with just a few thousand photos to train with. If each photo takes a minute to obtain and count, then one can create a useful corpus in a week or two of work.
train the NN, applying the usual data augmentation tricks to increase one’s meager corpus, trying out random hyperparameters, tweaking the architecture, etc
(Note that while NNs are very slow and computationally intensive to train, they are typically quite fast to run; the smartphone app would not be training a NN, which is indeed completely infeasible from a CPU and battery life standpoint—it is merely running the NN created by the original developer.)
with an accurate NN, one can wrap it in a mobile app framework. The UI, at the simplest, is simply a big button to press to take a photo, feed it into the NN, and display the count. Some additional features come to mind:
“headcount mode”: one may not be interested in a vote, but in how many people are in an audience (to estimate how popular a guest is, whether an event needs to move to a new bigger space, etc). If the NN can count faces and hands to estimate a vote count, it can simply report the count of faces instead.
the app should save every photo & count, both as an audit trail and also to support postvote recounts in case of disputes or a desire for a more definitive count
the reported count should come with an indication of the NN’s uncertainty/errorrate, so users are not misled by their little handheld oracle and so they can redo a vote if the choice is borderline; Bayesian methods, in which previous votes are drawn upon, might be relevant here.
 if the original photo could be annotated with graphical notes for each recognized/counted hand & face, this would let the user ‘see’ what the NN is thinking and would help build confidence a great deal
it should support manually entering in a votecount; if the manual count differs, then this indicates the NN made an error and the photo & count should be uploaded to the original developer so it can be added to the corpus and the NN’s performance fixed in future releases of the app
smartphone cameras may not be highresolution or have a sufficiently wide fieldofview to capture the entire audience at once; some sort of “montage mode” should exist so the user can swing the phone across the audience, bursts of shots taken, and the overlapping photos stitched together into a single audience photo which can be then fed into the NN as usual
a burst of photos might be superior to a single photo due to smartphone & hand movement blur; I don’t know if it’s best to try to combine the photos, run the NN multiple times and take the median, or feed multiple photos into the NN (perhaps by moving to a RNN architecture?)
the fullstrength NN might still be too slow and energyhungry to run pleasantly on a smartphone; there are model compression techniques for simplifying deep NNs to reduce the number of nodes or have fewer layers without losing much performance, which might be useful in this context (and indeed, were originally motivated by wanting to make speechrecognition run better on smartphones)
Given this breakdown, one might estimate building such an app as requiring, assuming one is already reasonably familiar with deep networks & writing mobile apps:
 1 week to find an ImageNet NN, learn how to modify it, and set it up to train on a fresh corpus
 3 weeks to create a corpus of <5000 photos with manuallylabeled hand counts
 5 weeks to train the NN (NNs as large as ImageNet NNs take weeks to train; depending on the GPU hardware one has access to and how many tweaks and hyperparameters one tries, 5 weeks could be drastically optimistic; but on the plus side, it’s mostly waiting as the GPUs suck electricity like crazy)
 5 weeks to make an intuitive simple app, submitting to an app store, etc
 These estimates are loose and probably too optimistic (although I would be surprised if it took a good developer more than 6 months to develop this app), but that would suggest >14 weeks or 784 hours of work for a developer, start to finish. (Even at minimum wage, this represents a substantial development cost of >$6k; at more plausible developer salaries, easily >$60k of salary.)
How large is the market for such an app? Groups such as anime conventions or anything on a college campus are cheapskates and would balk at a price higher than $4.99 (even if only 5 or 10 staffers need to buy it and it makes the experience much smoother). There are probably several hundred anime or video game conventions which might use this to vote, so that might be 1000 sales there. There’s easily 13,000 business conventions or conferences in the USA, which might not need voting so much, but would be attracted by a headcount mode to help optimize their event. This suggests perhaps $70k in sales with much less profit after the app store cut & taxes, much of which sales would probably be oneoffs as the user reuses it for each conference. So even a wild success, in which most events adopt use of such voting software, would barely recoup the development costs; as a product, it seems this is just too much of a niche unless one could develop it much faster (such as by finding an existing corpus of hands/photos, or be certain of banging out the mobile app in much less than I estimated), find a larger market (theaters for audience participation?), or increase price substantially (10x the price and aim at only businesses?).
Air conditioner replacement
Is my old air conditioner inefficient enough to replace? After calculating electricity consumption for it and a new air conditioner, with discounting, and with uncertainty in parameters evaluated by a Monte Carlo method, I conclude that the savings are too small by an order of magnitude to pay for a new replacement air conditioner.
I have an old Whirlpool air conditioner (AC) in my apartment, and as part of insulating and cooling my apartment, I’ve wondered if the AC should be replaced on energy efficiency grounds. Would a new AC save more than it costs upfront? What is the optimal decision here?
Initially I was balked in analysis because I couldn’t figure out what model it was, and thus anything about it like its energy efficiency. (No model number or name appears anywhere visible on it, and I’m not going to rip it out of the wall just to look at hidden parts.)
Parameters
So I began looking at all the old Whirlpool AC photographs in Google, and eventually I found one whose appearance exactly matches mine and which was released around when I think the AC was installed. The old AC is the “Whirlpool ACQ189XS” (official) (cost: $0, sunk cost), which is claimed to have an EER of 10.7.
For comparison, I browsed Amazon looking for highlyrated Energy Star AC models with at least 5000 BTU cooling power and costing $250$300, picking out the Sunpentown WA8022S 8000 BTU Window Air Conditioner ($271) with 11.3 EER. (Checking some other entries on Amazon, this is fairly representative on both cost & EER.)
Question: what is the electrical savings and hence the payback period of a new AC?
The efficiency unit here is the EER or energy efficiency ratio, defined as BTUs (amount of heat being moved by the AC) divided by watts consumed. Here we have ACs with 10.7 EER vs 11.2 EER; I need ~10k BTUs to keep the apartment cool (after fixing a lot of cracks, installing an attic fan and two box fans, putting tin foil over some windows, insulation under a floor etc), so the ACs will use up , and then x = 898 watts and 934 watts respectively.
(EER is a lot like miles per gallon/MPG as a measure of efficiency, and shares the same drawbacks: from a costperspective, EER/MPG don’t necessarily tell you what you want to know and can be misleading and harder to work with than if efficiency were reported as, say, gallons per mile. As watts per BTU or gallons per mile, it is easy to see that after a certain point, the cost differences have become absolutely small enough that improvements are not worth paying for. Going from 30 gallons of gas to 15 gallons of gas is worth more than going from 3 gallons to 1.5 gallons, even if the relative improvement is the same.)
So while operating, the two ACs will use 898 watts vs 934 watts or 0.89kWh vs 0.934kWh to cool; a difference of 36 watts or 0.036kWh.
Each kWh costs around $0.09 so the costdifference is $0.00324 per hour.
AC is on MaySeptember (5 months), and on almost all day although it only runs intermittently, so say a third of the day or 8 hours, for a total of 1200 hours of operation.
Costbenefit
Thus, then the annual benefit from switching to the new AC with 11.2 EER is or $3.9.
The cost is $271 amortized over n years. At $3.9 a year, it will take annually = 68 years to payback (ignoring breakage and discounting/interest/opportunitycost). This is not good.
Decision: do not replace.
Discounting
To bring in discounting/interest: For what annual payment (costsavings) would we be willing to pay the price of a new AC? More specifically, if it costs $271 and has an average payout period of 7 years, then at my usual annual discount rate of 5%, how much must each payout be?
r turns out to be ≥$46.83, which sounds about right. (Discounting penalizes future savings, so r should be greater than or $39, which it is.)
$47 is 12x larger than the estimated savings of $3.9, so the conclusion remains the same.
We could also work backward to figure out what EEC would justify an upgrade by treating it as an unknown e and solving for it; let’s say it must payback in 7 years (I doubt average AC lifetime is much longer) at least $271, with the same kWh & usage as before, what must the rival EEC be? as an equation:
and solving,
I am pretty sure there are no ACs with EER>20!
Another way to look at it: if a new good AC costs ~$300 and I expect it to last ~7 years, then that’s an annual cost of $43. The current AC’s total annual cost to run is or . So it’s immediately clear that the energy savings must be huge—half!—before it can hope to justify a new purchase.
Sensitivity analysis
The above analyses were done with pointestimates. It’s only fair to note that there’s a lot of uncertainty lurking in those estimates: $0.09 was just the median of the estimates I found for my state’s electricity rates, the AC might be on 4 or 6 months, the hours per day might be considerably higher (or lower) than my guess of 8 hours, 10.7 & 11.2 EERs are probably bestcase estimates and the real efficiencies lower (they’re always lower than nominal), the discount rate may be a percent lower or higher and so minimum savings would be off by as much as $4 in either direction, and so on. It would be good to do a bit of a sensitivity analysis to make sure that this is not being driven by any particular number. (Based on the definition, since it’s using mostly multiplication, the final value should be robust to considerable error in estimating each parameter, but you never know.) Throwing together my intuition for how much uncertainty is in each parameter and modeling most as normals, I can simulate my prior distribution of savings:
set.seed(20150726)
simulate < function() {
BTUs < rnorm(1, 10000, 100)
EER_old < 10.7  abs(rnorm(1, 0, 0.5)) # halfnormals because efficiencies only get worse, not better
EER_new < 11.2  abs(rnorm(1, 0, 0.5))
kWh < rnorm(1, 0.09, 0.01)
dailyUsage < rnorm(1, 8, 2)
months < sample (4:6, 1)
minimumSavings < rnorm(1, 47, 4)
annualNetSavings < ((((BTUs / EER_old )  (BTUs / EER_new)) / 1000) * kWh * dailyUsage * 30 * months)  minimumSavings
return(annualNetSavings)
}
sims < replicate(100000, simulate())
summary(sims)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 70.3666500 46.2051500 42.3764100 42.1133700 38.3134600 0.7334517
quantile(sims, p=c(0.025, 0.975))
## 2.5% 97.5%
## 53.59989114 29.13999204
Under every simulation, a new AC is a net loss. (Since we have no observed data to update our priors with, this is an exercise in probability, not Bayesian inference, and so there is no need to bring in JAGS.)
There are two choices: replace or not. The expectedvalue of a replacement is or $42, and the expectedvalue of not replacing is or $0; the latter is larger than the former, so we should choose the latter and not replace the old AC.
Hence we can be confident that not getting a new AC really is the right decision.
Some ways of dealing with measurement error
Prompted by a question on LessWrong, some examples of how to analyze noisy measurements in R:
## Create a simulated dataset with known parameters, and then run a ML multilevel model, a ML SEM,
## and a Bayesian multilevel model; with the last, calculate Expected Value of Sample Information (EVSI):
## SIMULATE
set.seed(20150811)
## "There is a variable X, x belongs to [0, 100]."
toplevel < rnorm(n=1, 50, 25)
## "There are n ways of measuring it, among them A and B are widely used."
## "For any given measurer, the difference between x(A) and x(B) can be up to 20 points."
A < toplevel + runif(1, min=10, max=10)
B < toplevel + runif(1, min=10, max=10)
c(toplevel, A, B)
# [1] 63.85938385 55.43608379 59.42333264
### the true level of X we wish to recover is '63.85'
## "Between two any measurers, x(A)1 and x(A)2 can differ on average 10 points, likewise with B."
### let's imagine 10 hypothetical points are sample using method A and method B
### assume 'differ on average 10 points' here means something like 'the standard deviation is 10'
A_1 < rnorm(n=10, mean=A, sd=10)
B_1 < rnorm(n=10, mean=B, sd=10)
data < rbind(data.frame(Measurement="A", Y=A_1), data.frame(Measurement="B", Y=B_1)); data
# Measurement Y
# 1 A 56.33870025
# 2 A 69.07267213
# 3 A 40.36889573
# 4 A 48.67289213
# 5 A 79.92622603
# 6 A 62.86919410
# 7 A 53.12953462
# 8 A 66.58894990
# 9 A 47.86296948
# 10 A 60.72416003
# 11 B 68.60203507
# 12 B 58.24702007
# 13 B 45.47895879
# 14 B 63.45308935
# 15 B 52.27724328
# 16 B 56.89783535
# 17 B 55.93598486
# 18 B 59.28162022
# 19 B 70.92341777
# 20 B 49.51360373
## MLM
## multilevel model approach:
library(lme4)
mlm < lmer(Y ~ (1Measurement), data=data); summary(mlm)
# Random effects:
# Groups Name Variance Std.Dev.
# Measurement (Intercept) 0.0000 0.000000
# Residual 95.3333 9.763877
# Number of obs: 20, groups: Measurement, 2
#
# Fixed effects:
# Estimate Std. Error t value
# (Intercept) 58.308250 2.183269 26.70685
confint(mlm)
# 2.5 % 97.5 %
# .sig01 0.000000000 7.446867736
# .sigma 7.185811525 13.444112087
# (Intercept) 53.402531768 63.213970887
## So we estimate X at 58.3 but it's not inside our confidence interval with such little data. Bad luck?
## SEM
library(lavaan)
X.model < ' X =~ B + A
A =~ a
B =~ b'
X.fit < sem(model = X.model, meanstructure = TRUE, data = data2)
summary(X.fit)
# ... Estimate Std.err Zvalue P(>z)
# Latent variables:
# X =~
# B 1.000
# A 7619.504
# A =~
# a 1.000
# B =~
# b 1.000
#
# Intercepts:
# a 58.555
# b 58.061
# X 0.000
# A 0.000
# B 0.000
## Well, that didn't work well  explodes, unfortunately. Probably still not enough data.
## MLM (Bayesian)
library(R2jags)
## rough attempt at writing down an explicit multilevel model which
## respects the mentioned priors about errors being reasonably small:
model < function() {
grand.mean ~ dunif(0,100)
delta.between.group ~ dunif(0, 10)
sigma.between.group ~ dunif(0, 100)
tau.between.group < pow(sigma.between.group, 2)
for(j in 1:K){
# let's say the grouplevel differences are also normallydistributed:
group.delta[j] ~ dnorm(delta.between.group, tau.between.group)
# and each group also has its own standarddeviation, potentially different from the others':
group.within.sigma[j] ~ dunif(0, 20)
group.within.tau[j] < pow(group.within.sigma[j], 2)
# save the net combo for convenience & interpretability:
group.mean[j] < grand.mean + group.delta[j]
}
for (i in 1:N) {
# each individual observation is from the grandmean + groupoffset, then normally distributed:
Y[i] ~ dnorm(grand.mean + group.delta[Group[i]], group.within.tau[Group[i]])
}
}
jagsData < list(N=nrow(data), Y=data$Y, K=length(levels(data$Measurement)),
Group=data$Measurement)
params < c("grand.mean","delta.between.group", "sigma.between.group", "group.delta", "group.mean",
"group.within.sigma")
k1 < jags(data=jagsData, parameters.to.save=params, inits=NULL, model.file=model); k1
# ... mu.vect sd.vect 2.5% 25% 50% 75% 97.5% Rhat n.eff
# delta.between.group 4.971 2.945 0.221 2.353 4.967 7.594 9.791 1.008 260
# grand.mean 52.477 11.321 23.453 47.914 53.280 58.246 74.080 1.220 20
# group.delta[1] 6.017 11.391 16.095 0.448 5.316 10.059 34.792 1.152 21
# group.delta[2] 5.662 11.318 15.836 0.054 5.009 10.107 33.548 1.139 21
# group.mean[1] 58.494 3.765 50.973 56.188 58.459 60.838 66.072 1.001 3000
# group.mean[2] 58.139 2.857 52.687 56.366 58.098 59.851 63.999 1.003 920
# group.within.sigma[1] 12.801 2.766 8.241 10.700 12.446 14.641 18.707 1.002 1100
# group.within.sigma[2] 9.274 2.500 5.688 7.475 8.834 10.539 15.700 1.002 1600
# sigma.between.group 18.031 21.159 0.553 3.793 9.359 23.972 82.604 1.006 1700
# deviance 149.684 2.877 145.953 147.527 149.081 151.213 156.933 1.001 3000
## VOI
posteriorXs < k1$BUGSoutput$sims.list[["grand.mean"]]
MSE < function(x1, x2) { (x2  x1)^2 }
lossFunction < function(x, predictions) { mean(sapply(predictions, function(x2) { MSE(x, x2)}))}
## our hypothetical meansquared loss if we predicted, say, X=60:
lossFunction(60, posteriorXs)
# [1] 184.7087612
## of the possible values for X, 1100, what value of X minimizes our squared error loss?
losses < sapply(c(1:100), function (n) { lossFunction(n, posteriorXs);})
which.min(losses)
# [1] 52
## 52 also equals the mean estimate of X, which is good since it's well known that the mean is what minimizes
## the loss when the loss is squarederror so it suggests that I have not screwed up the definitions
losses[52]
[1] 128.3478462
## to calculate EVSI, we repeatedly simulate a few hundred times the existence of a hypothetical 'C' measurement
## and draw n samples from it;
## then we add the C data to our existing A & B data; run our Bayesian multilevel model again on the bigger dataset;,
## calculate what the new loss is, and compare it to the old loss to see how much the new data
## reduced the loss/meansquarederror.
## Done for each possible n (here, 130) and averaged out, this tells us how much 1 additional datapoint is worth,
## 2 additional datapoints, 3 additional datapoints, etc.
sampleValues < NULL
for (i in seq(from=1, to=30)) {
evsis < replicate(500, {
n < i
C < toplevel + runif(1, min=10, max=10)
C_1 < rnorm(n=n, mean=C, sd=10)
## all as before, more or less:
newData < rbind(data, data.frame(Measurement="C", Y=C_1))
jagsData < list(N=nrow(newData), Y=newData$Y, K=length(levels(newData$Measurement)),
Group=newData$Measurement)
params < c("grand.mean","delta.between.group", "sigma.between.group", "group.delta", "group.mean",
"group.within.sigma")
jEVSI < jags(data=jagsData, parameters.to.save=params, inits=NULL, model.file=model)
posteriorTimesEVSI < jEVSI$BUGSoutput$sims.list[["grand.mean"]]
lossesEVSI < sapply(c(1:100), function (n) { lossFunction(n, posteriorTimesEVSI);})
oldOptimum < 128.3478462 # losses[52]
newOptimum < losses[which.min(lossesEVSI)]
EVSI < newOptimum  oldOptimum
return(EVSI)
}
)
print(i)
print(mean(evsis))
sampleValues[i] < mean(evsis)
}
sampleValues
# [1] 13.86568780 11.07101087 14.15645538 13.05296681 11.98902668 13.86866619 13.65059093 14.05991443
# [9] 14.80018511 16.36944874 15.47624541 15.64710237 15.74060632 14.79901214 13.36776390 15.35179426
# [17] 14.31603459 13.70914727 17.20433606 15.89925289 16.35350861 15.09886204 16.30680175 16.27032067
# [25] 16.30418553 18.84776433 17.86881713 16.65973397 17.04451609 19.17173439
## As expected, the gain in reducing MSE continues increasing as data comes in but with diminishing returns;
## this is probably because in a multilevel model like this, you aren't using the _n_ datapoints to estimate X
## directly so much as you are using them to estimate a much smaller number of latent variables, which are then
## the _n_ used to estimate X. So instead of getting hyperprecise estimates of A/B/C, you need to sample from additional
## groups D/E/F/... Trying to improve your estimate of X by measuring A/B/C many times is like trying to estimate
## IQ precisely by administering a WM test a hundred times.
## If we wanted to compare with alternatives like instead sampling n data points from C and a D, it's easy to modify
## the EVSI loop to do so: generate `D < toplevel + runif(1, min=10, max=10); D_1 < rnorm(n=n, mean=D, sd=10)`
## and now `rbind` D_1 in as well. At a guess, after 510 samples from the current group, estimates of X will be improved more
## by then sampling from a new group.
## Or the loss function could be made more realistic. It's unlikely one is paid by MSE, and if one adds in how much
## money each sample costs, with a realistic loss function, one could decide exactly how much data is optimal to collect.
## To very precisely estimate X, when our measurements are needed to measure at least 3 latent variables,
## requires much more data than usual.
## In general, we can see the drawbacks and benefits of each approach. A canned MLM
## is very fast to write but doesn't let us include prior information or easily run
## additional analyses like how much additional samples are worth. SEM works poorly
## on small samples but is still easy to write in if we have more complicated
## models of measurement error. A fullblown modeling language like JAGS is quite
## difficult to write in and MCMC is slower than other approaches but handles small
## samples without any errors or problems and offers maximal flexibility in using
## the known prior information and then doing decisiontheoretic stuff. Overall for
## this problem, I think JAGS worked out best, but possibly I wasn't using LAVAAN
## right and that's why SEM didn't seem to work well.
Value of Information: clinical prediction instruments for suicide
http://slatestarcodex.com/2015/08/31/magicmarkers/#comment232970
I agree. When criticizing the study for claiming the blood levels added predictive power and it’s not clear they did, this is solely a statistical claim and can be done in a vacuum. But when one then goes on to pan the predictive power of the underlying clinical prediction instruments as useless in all circumstances, based on just the prediction stats:
So when people say “We have a blood test to diagnose suicidality with 92% accuracy!”, even if it’s true, what they mean is that they have a blood test which, if it comes back positive, there’s still less than 5050 odds the person involved is suicidal. Okay. Say you’re a psychiatrist. There’s a 48% chance your patient is going to be suicidal in the next year. What are you going to do? Commit her to the hospital? I sure hope not. Ask her some questions, make sure she’s doing okay, watch her kind of closely? You’re a psychiatrist and she’s your depressed patient, you would have been doing that anyway. This blood test is not really actionable. And then remember that this isn’t the blood test we have. We have some clinical prediction instruments that do this…But having “a blood test for suicide” won’t be very useful, even if it works.
One is implicitly making some strong costbenefit claims here and stepping from statistics (‘what are the probabilities?’) to decision theory (‘given these probabilities, how should I act?’). They are not identical: no AUC graph will ever tell you if a model’s predictions are useful or not, and there is no universal threshold where 92% specificity/sensitivity is totally useless but 95% would make a difference—these clinical prediction instruments might be useless indeed, but that will depend on costs, base rates, and available actions. (I tried to make this point to Coyne on Twitter earlier but I don’t think he understood what I was getting at & he blew me off.)
Discontinuities come from our actions; our inferences are incremental. There are some contexts where a tiny 1% improvement in AUC might be worth a lot (Wall Street) and there are some contexts where sensitivity or specificity of 99% is still useless because it won’t change your actions at all (I’m currently comparing my riding lawn mower to a robotic lawn mower, and thus far, it doesn’t matter how precise my parameters are, the robotic lawn mowers are, to my disappointment, just too expensive right now). I think pvalues have shown us how well arbitrary thresholds work out in practice (and remember where they came from in the first place! decision rules set per problem—Gosset, in optimizing a brewery, did not have the pathologies we have with p<0.05 fetishism.) I also don’t believe your choices are really that restricted: you mean if you were absolutely convinced that your patient was about to commit suicide, there is absolutely nothing you could do besides treat them like any other depressive? That seems unlikely. But whatever, even if commitment is the only alternative, there is still a value to the information provided by a clinical prediction instrument, and we can calculate it, and you should if you want to rule it out as having any value, in the same way that in criticizing a study as weak, it’s better to ignore the pvalues and just work out the right posterior and demonstrate directly how little evidence it contains.
Let’s try this as an example, it’s not hard or terribly complex (just tedious). So we have a ward of 100 depressive patients where we are interested in preventing suicide; our prior probability is that 7.5% or ~7 of them will commit suicide. The value of a life has been given a lot of different valuations, but $10 million is a good starting point.
Action 1:
What are our costs or losses? We could say that we expect a loss of 7.5*$10m or $75m, and if we stand by and do no treatment or intervention whatsoever, we spend no more money and so the total loss is
0 + 0.075 * 100 * 10,000,000 = $75,000,000
Action 2:
Let’s say they all stay by default for one week and this costs a net $1000 a day; let’s say further that, since commitment is the mentioned alternative, while committed a suicide attempt will fail. And since we know that suicides are so often spontaneous and major depression comes and goes, a frustrated suicide attempt doesn’t simply mean that they will immediately kill themselves as soon as they get out. This 7% comes from a followup period of a year, so the probability any will attempt suicide in the next week might be 0.075/52 or 0.001442307692. So this gives us our default setup: we have 100 patients staying for 7 days at a net cost of $1000 a day or $700,000 total, and by having them stay, we stop an expected average of 0.14 suicides and thus we prevent an expected loss of 0.14 * $10m = $1,440,000, for a total loss of treatmentcost minus treatmentgain plus remainingloss:
$700,000  (0.14 * $10m)  $10m * 100 * (0.075(0.075/52)) = $74,257,692.
Note that this loss is smaller than in the scenario in which we don’t do any commitment at all; since one week of suicidewatch reduced the suicide loss more than it cost, this is not surprising.
Specifically, the benefit is:
action1  action2 = gain to switching 75000000  74257692 = $742,308
Not fantastic, but it’s in the right order of magnitude (you can’t expect more from a low baserate event and a treatment with such a low probability of making a difference, after all) so it looks plausible, and it’s still more than zero. We can reject the action of not committing them at all as being inferior to committing them for one week.
Action 3:
What if we were instead choosing between one week and committing them for a full year—thus catching the full 7.5% of suicides during the 1year followup? Does that work? First, the loss from this course of action:
((100*365.2*1000)  (0 * 10000000)  (10000000 * 100 * (0.075(0.075/1)))) = $36,520,000
Since there are no suicides, we avoid the default loss of $75m, but we still have to spend $36,520,000 to pay for the longterm commitment. However, the benefit to the patients has increased dramatically since we stop so many more suicides:
action 2  action 3 = $35,637,692.31
(We go from a loss of $74m to a loss of $36m.) So we see action 3 is even better than action 2 for the patients. Of course, we can’t extrapolate out any further than 1 year, because that’s what our followup number is, and we don’t know how the suicide risk falls after the 1 year point—if it drops to ~0, then further commitment is a terrible idea. So I’m not going to calculate out any further. (Since this is all linear stuff, the predicted benefit will increase smoothly over the year and so there’s no point in calculating out alternatives like 1 month, 3 months, 6 months, 9 months, etc.) What’s that, action 3 is totally infeasible and no one would ever agree to this—the patients would scream their heads off and the health insurance companies would never go for it—even if we could show that long commitments do reduce the suicide rate enough to justify the costs? And, among other things, I’ve oversimplified in assuming the 7% risk is evenly distributed over the year rather than a more plausible distribution like exponentially decreasing from Day 1, so likely commitment stops being a good idea more like month 3 or something? Yeah, you’re probably right, so let’s go back to using action 2’s loss as our current best alternative.
Now, having set out some of the choices available, we can find out how much better information is worth. First, let’s ask what the Expected Value of Perfect Information is: if we were able to take our 100 patients and exactly predict which 7 were depressive and would commit suicide this year in the absence of any intervention, where our choice is between committing them for one week or not at all. Given such information we can eject the 93 who we now know were never a suicide risk, and we hold onto the 7 endangered patients, and we have a new loss of the commitment cost of 7 people for a week vs the prevented loss of the chance they will try to commit suicide that week of this year:
((771000)  (0.14 * 10000000)  (10000000 * 7 * (1(1/52)))) = $70,004,846
How much did we gain from our perfect information? About $4m:
74257692  70004846 = $4,252,846
(This passes our sanity checks: additional information should never hurt us, so the amount should be >=$0, but we are limited by the intervention to doing very little, so the ceiling should be a low amount compared to the total loss, which this is.)
So as long as the perfect information did not cost us more than $4m or so, we would have net gained from it: we would have been able to focus commitment on the patients at maximal risk. So suppose we had a perfect test which cost $1000 a patient to run, and we wanted to know if the gained information was valuable enough to bother with using this expensive test; the answer in this case is definitely yes: with 100 patients, it’ll cost $100,000 to run the test but it’ll save $4.25m for a net profit of $4.15m. In fact, we would be willing to pay perpatient costs up to $42k, at which point we hit breakeven (4252846 / 100).
OK, so that’s perfect information. What about imperfect information? Well, imperfect is a lot like perfect information, just, y’know—less so. Let’s consider this test: with the same prior, a negative on it means the patient now has P=0.007 to commit suicide that year, and a positive means P=0.48, and the sensitivity/specificity at 92%. (Just copying that from OP & ButYouDisagree, since those sound plausible.) So when we run the test on our patients, we find of the 4 possible outcomes:
 85.1 patients are nonsuicidal and the test will not flag them
 7.4 are nonsuicidal but the test will flag them
 6.9 are suicidal and the test will flag them
 0.6 are suicidal but the test will not flag them
So if we decide whether to commit or not commit solely based on this test, we will send home 85.1 + 0.6 = 85.7 patients (and indeed 0.6/85.7=0.007), and we will retain the remaining 7.4 + 6.9 = 14.3 patients (and indeed, 6.9/14.3=0.48). So our loss is the wrongly ejected patient of 0.6 suicides plus the cost of committing 14.3 patients (both safe and atrisk) for a week in exchange for the gain of a small chance of stopping the suicide of the 6.9 actually at risk:
(10000000*85.7*0.007) + (14.3*7*1000) + (10000000 * (0.4814.3) (1(1/52))) = $73,419,100
How much did we gain from our imperfect information? About $0.8m:
74257692  73419100 = $838,592
or $8,385.92 per patient. (This passes our sanity check: greater than $0, but much less than the perfect information. The exact amount may seem lame, but as a fraction of the value of perfect information, it’s not too bad: the test gets us 20%  838592 / 4252846  of the way to perfection.)
And that’s our answer: the test is not worth $0—it’s worth $8k. And once you know what the cost of administering the test is, you simply subtract it and now you have the Net Expected Value of Information for this test. (I can’t imagine it costs $8k to administer what this sounds like, so at least in this model, the value is highly likely >$0.)
By taking the posterior of the test and integrating all the estimated costs and benefits into a single framework, we can nail down exactly how much value these clinical instruments could deliver if used to guide decisionmaking. And if you object to some particular parameter or assumption, just build another decisiontheory model and estimate the new cost. For example, maybe commitment actually costs, once you take into account all the disruption to lives and other such sideeffects, not $1000 but net of $5000 per day, what then? Then the gain halves to $438,192, etc. And if it costs $10000 then the test is worth nothing because you won’t commit anyone ever because it’s just way too expensive, and now you know it’s worth $0; or if commitment is so cheap that it’s more like $100 a day, then the test is also worth $0 because you would just commit everyone (since breakeven is then a suicide probability way below 7%, all the way at ~0.4% which is still below the 0.7% which the test can deliver, so the test result doesn’t matter for deciding whether to commit, so it’s worth $0), or if you adopt a more reasonable value of life like $20m, the value of perfect information shoots up (obviously, since the avoided loss doubles) but the value of imperfect information drops like a stone (since now that one suicidal patient sent home blows away your savings from less committing) and the test becomes worthless; and playing with the formulas, you can figure out the various ranges of assumptions in which the test has positive value and estimate how much it has under particular parameters, and of course if parameters are uncertain, you can cope with that uncertainty by embedding this in a Bayesian model to get posterior distributions of particular parameters incorporating all the uncertainty.
So to sum up: there are no hard thresholds in decisionmaking and imposing them can cost us better decisionmaking, so to claim additional information is worthless, more analysis needed, and this analysis must be done with respect to the available actions & their consequences, which even under the somewhat extreme conditions here of very weak interventions & low baserates, suggests that the value of this information is positive.
Bayesian Model Averaging
## original: "Bayesian model choice via Markov chain Monte Carlo methods" Carlin & Chib 1995 http://stats.ma.ic.ac.uk/~das01/MyWeb/SCBI/Papers/CarlinChib.pdf
## Kobe example & data from: "A tutorial on Bayes factor estimation with the product space method", Lodewyckx et al 2011 http://ejwagenmakers.com/2011/LodewyckxEtAl2011.pdf
## Lodewyckx code can be downloaded after registration & email from http://ppw.kuleuven.be/okp/software/scripts_tut_bfepsm/
## "Table 2: Observed field goals (y) and attempts (n) by Kobe Bryant during the NBA seasons of 1999 to 2006."
kobe < read.csv(stdin(),header=TRUE)
Year, y, n, y.n
1999, 554, 1183, 0.47
2000, 701, 1510, 0.46
2001, 749, 1597, 0.47
2002, 868, 1924, 0.45
2003, 516, 1178, 0.44
2004, 573, 1324, 0.43
2005, 978, 2173, 0.45
2006, 399, 845, 0.47
library(runjags)
model1<"model{
# 1) MODEL INDEX
# Model index is 1 or 2.
# Prior probabilities based on argument prior1.
# Posterior probabilities obtained by averaging
# over postr1 and postr2.
M ~ dcat(p[])
p[1] < prior1
p[2] < 1prior1
postr1 < 2M
postr2 < 1postr1
# 2) MODEL LIKELIHOOD
# For each year, successes are Binomially distributed.
# In M1, the success rate is fixed over years.
# In M2, the success rate is yearspecific.
for (i in 1:n.years){
successes[i] ~ dbin(pi[M,i], attempts[i])
pi[1,i] < pi.fixed
pi[2,i] < pi.free[i]
}
# 3) MODEL 1 (one single rate)
# The fixed success rate is given a Beta prior and pseudoprior.
# Whether it is a prior or pseudoprior depends on the Model index.
pi.fixed ~ dbeta(alpha.fixed[M],beta.fixed[M])
alpha.fixed[1] < alpha1.prior
beta.fixed[1] < beta1.prior
alpha.fixed[2] < alpha1.pseudo
beta.fixed[2] < beta1.pseudo
# 4) MODEL 2 (multiple independent rates)
# The yearspecific success rate is given a Beta prior and pseudoprior.
# Whether it is a prior or pseudoprior depends on the Model index.
for (i in 1:n.years){
pi.free[i] ~ dbeta(alpha.free[M,i],beta.free[M,i])
alpha.free[2,i] < alpha2.prior
beta.free[2,i] < beta2.prior
alpha.free[1,i] < alpha2.pseudo[i]
beta.free[1,i] < beta2.pseudo[i]
}
# predictive interval for hypothetical 2007 data in which Kobe makes 1000 shots:
successes.new.1 ~ dbin(pi.fixed, 1000)
successes.new.2 ~ dbin(pi.free[n.years], 1000)
# success.new.weighted ~ dcat(M
}"
# 'prior1' value from paper
data < list("prior1"=0.000000007451, "n.years"= length(kobe$Year), "successes"=kobe$y, "attempts"=kobe$n,
"alpha1.prior"=1, "beta1.prior"=1, "alpha2.prior"=1, "beta2.prior"=1,
"alpha1.pseudo"=1, "beta1.pseudo"=1, "alpha2.pseudo"=rep(1,8), "beta2.pseudo"=rep(1,8) )
# inits < function() { list(mu=rnorm(1),sd=30,t=as.vector(apply(mailSim,1,mean))) }
params < c("pi.free", "pi.fixed", "postr1", "postr2", "M", "successes.new.1", "successes.new.2")
j1 < run.jags(model=model1, monitor=params, data=data, n.chains=getOption("mc.cores"), method="rjparallel", sample=500000); j1
# JAGS model summary statistics from 4000000 samples (chains = 8; adapt+burnin = 5000):
#
# Lower95 Median Upper95 Mean SD Mode MCerr MC%ofSD SSeff
# pi.free[1] 0.3145 0.46864 0.98709 0.47383 0.11553  0.00041958 0.4 75810
# pi.free[2] 0.10099 0.46447 0.77535 0.47005 0.1154  0.00042169 0.4 74887
# pi.free[3] 0.19415 0.4692 0.86566 0.4741 0.11457  0.00040171 0.4 81342
# pi.free[4] 0.020377 0.45146 0.69697 0.45867 0.11616  0.00042696 0.4 74023
# pi.free[5] 0.024472 0.43846 0.7036 0.44749 0.11757  0.00043352 0.4 73548
# pi.free[6] 0.076795 0.43325 0.74944 0.44318 0.11684  0.00043892 0.4 70863
# pi.free[7] 0.06405 0.45033 0.73614 0.45748 0.11541  0.00041715 0.4 76543
# pi.free[8] 0.30293 0.47267 0.97338 0.47708 0.11506  0.00040938 0.4 79000
# pi.fixed 0.039931 0.45756 0.97903 0.49256 0.26498  0.00099537 0.4 70868
# postr1 0 0 1 0.15601 0.36287 0 0.15113 41.6 6
# postr2 0 1 1 0.84399 0.36287 1 0.15113 41.6 6
# M 1 2 2 1.844 0.36287 2 0.15113 41.6 6
# successes.new.1 0 463 940 492.57 265.28 454 0.99543 0.4 71019
# successes.new.2 300 473 971 477.05 116.03 473 0.4152 0.4 78094
getLogBF < function(prior0, postr0) { log((postr0/(1postr0)) / (prior0/(1prior0))) }
getLogBF(0.000000007451, 0.15601)
# [1] 17.02669704
## analytic BF: 18.79; paper's MCMC estimate: 18.80; not sure where I lost 1.8 of the BF.
Dealing with allornothing unreliability of data
Given two disagreeing polls, one small & imprecise but taken at facevalue, and the other large & precise but with a high chance of being totally mistaken, what is the right Bayesian model to update on these two datapoints? I give ABC and MCMC implementations of Bayesian inference on this problem and find that the posterior is bimodal with a mean estimate close to the large unreliable poll’s estimate but with wide credible intervals to cover the mode based on the small reliable poll’s estimate.
A question was asked of me: what should one infer if one is given what would be definitive data if one could take it at face value—but one suspects this data might be totally incorrect? An example would be if one wanted to know what fraction of people would answer ‘yes’ to a particular question, and one had a very small poll (n=10) suggesting 90% say yes, but then one was also given the results from a much larger poll (n=1000) saying 75% responded yes—but this poll was run by untrustworthy people, people that, for whatever reason, you believe might make something up half the time. You should be able to learn something from this unreliable poll, but you can’t learn everything from it because you would be burned half the time.
If not for this issue of unreliability, this would be an easy binomial problem: specify a uniform or Jeffreys prior on what percentage of people will say yes, add in the binomial data of 9/10, and look at the posterior. But what do we do with the unreliability joker?
Binomial
First let’s try the simple case, just updating on a small poll of 9/10. We would expect it to be unimodally peaked around 8090%, but broad (due to the small sample size) and falling sharply until 100% since being that high is a priori unlikely.
MCMC using Bayesian First Aid:
## install.packages("devtools")
## devtools::install_github("rasmusab/bayesian_first_aid")
library(BayesianFirstAid)
b < bayes.binom.test(oldData$Yes, oldData$N); b
# ...number of successes = 9, number of trials = 10
# Estimated relative frequency of success:
# 0.85
# 95% credible interval:
# 0.63 0.99
# The relative frequency of success is more than 0.5 by a probability of 0.994
# and less than 0.5 by a probability of 0.006
Which itself is a wrapper around calling out to JAGS doing something like this:
library(runjags)
model_string < "model {
x ~ dbinom(theta, n)
theta ~ dbeta(1, 1) }"
model < autorun.jags(model_string, monitor="theta", data=list(x=oldData$Yes, n=oldData$N)); model
# JAGS model summary statistics from 20000 samples (chains = 2; adapt+burnin = 5000):
#
# Lower95 Median Upper95 Mean SD Mode MCerr MC%ofSD SSeff AC.10 psrf
# theta 0.63669 0.85254 0.9944 0.83357 0.10329  0.0007304 0.7 20000 0.011014 1.0004
Here is a simulationbased version of Bayesian inference using ABC:
oldData < data.frame(Yes=9, N=10)
simulatePoll < function(n, pr) { rbinom(1, size=n, p=pr); }
poll_abc < replicate(100000, {
# draw from our uniform prior
p < runif(1,min=0,max=1)
# simulate a hypothetical poll dataset the same size as our original
newData < data.frame(Yes=simulatePoll(oldData$N, p), N=oldData$N)
# were they equal? if so, save sample as part of posterior
if (all(oldData == newData)) { return(p) }
}
)
resultsABC < unlist(Filter(function(x) {!is.null(x)}, poll_abc))
summary(resultsABC)
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# 0.3260816 0.7750520 0.8508855 0.8336383 0.9117471 0.9991691
hist(resultsABC)
# https://i.imgur.com/fn3XYQW.png
They look identical, as they should.
Binomial with binary unreliability
To implement our more complicated version: the original poll remains the same but we add in the complication of a very large poll which 50% of the time is a true measure of the poll response and 50% of the time is drawn uniformly at random. (So if the true poll response is 90%, then half the time the large poll will yield accurate data like 905/1000 or 890/1000, and the rest it will yield 10/1000 or 400/1000 or 700/1000.) This is different from the more common kinds of measurementerror models where it’s generally assumed that the noisy measurements still have some informativeness to them; here there is none.
Specifically, this faux poll has yielded the data not 9/10, but 750/1000.
ABC
Using ABC again: we generate the reliable small poll as before, and we add in an faux poll where we flip a coin to decide if we are going to return a ‘yes’ count based on the population parameters or just a random number, then we combine the two datasets and check that it’s identical to the actual data, saving the population probability if it is.
oldData2 < data.frame(Yes=c(9,750), N=c(10,1000)); oldData2
# Yes N
# 1 9 10
# 2 750 1000
simulateHonestPoll < function(n, pr) { rbinom(1, size=n, p=pr); }
simulateFauxPoll < function(n, pr, switchp) { if(sample(c(TRUE, FALSE), 1, prob=c(switchp, 1switchp))) { rbinom(1, size=n, p=pr); } else { round(runif(1, min=0, max=n)); }}
poll_abc < replicate(1000000, {
priorp < runif(1,min=0,max=1)
switch < 0.5
n1 < 10
n2 < 1000
data1 < data.frame(Yes=simulateHonestPoll(n1, priorp), N=n1)
data2 < data.frame(Yes=simulateFauxPoll(n2, priorp, switch), N=n2)
newData < rbind(data1, data2)
if (all(oldData2 == newData)) { return(priorp) }
}
)
resultsABC < unlist(Filter(function(x) {!is.null(x)}, poll_abc))
summary(resultsABC)
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# 0.5256471 0.7427098 0.7584650 0.7860109 0.8133581 0.9765648
hist(resultsABC)
# https://i.imgur.com/atMz0jg.png
The results are interesting and in this case the summary statistics are misleading: the median is indeed around 75% (as we would expect! since that’s the result of the highly precise poll which has a 50% chance of being the truth) but we see the mean is being pulled away towards the original 90% estimate, and plotting the histogram, bimodality emerges. The posterior reports that there’s still a lot of credibility to the 90% point estimate, but between the original diffuseness of that posterior (leaving a lot of probability to lower responses including, say, 75%) and the high certainty that if accurate the responses will definitely be close to 75%, it winds up peaked at a little higher than 75% (since even if the larger poll is honest, the earlier poll did still find 9/10). So it’s not so much that we think the best estimate of true population rate really is 79% (indeed, the mode is more like 75%, but it could easily be far away from 75% and in the 90%s) as we would need to think more about what we want to do with this posterior before we decide how to summarize it.
Mixture
ABC is slow and would not scale to more hypothetical polls unless we abandoned exact ABC inference and began using approximate ABC (entirely possible in this case; instead of strict equality between the original and simulated data, we’d instead accept a sample of p if the simulated dataset’s fractions were within, say, 1% of the originals); and the simulation would need to be rewritten anyway.
MCMC can handle this if we think of our problem as a mixture model: our problem is that we have poll data drawn from two clusters/distributions—one cluster is the true population distribution of opinion, and the other cluster just spits out noise. We have one observation which we know is from first reliable distribution (the 9/10 poll result), and one observation which we’re not sure which of the two it came from (750/1000), but we do know that the indexing probability for mixing the two distributions is 50%.
In JAGS, we write down a model in which dcat
flips between 1 and 2 if the cluster is not known, specifying which distribution a sample came from and its theta probability, and then we infer the thetas for both distributions. Of course, we only care about the first distribution’s theta since the second one is noise.
library(runjags)
model1 < "model {
for (i in 1:N) {
y[i] ~ dbinom(theta[i], n[i])
theta[i] < thetaOfClust[ clust[i] ]
clust[i] ~ dcat(pi[])
}
pi[1] < switch[1]
pi[2] < switch[2]
thetaOfClust[1] ~ dbeta(1,1)
thetaOfClust[2] ~ dunif(0,1)
}"
j1 < autorun.jags(model1, monitor=c("theta"), data = list(N=nrow(oldData2), y=oldData2$Yes, n=oldData2$N, switch=c(0.5, 0.5), clust=c(1,NA))); j1
# ... Lower95 Median Upper95 Mean SD Mode MCerr MC%ofSD SSeff AC.10 psrf
# theta[1] 0.70582 0.75651 0.97263 0.77926 0.07178  0.001442 2 2478 0.12978 1.0011
# theta[2] 0.72446 0.75078 0.77814 0.75054 0.013646  0.00009649 0.7 20000 0.009458 1
plot(j1)
# https://i.imgur.com/EaqR0dD.png
Sure enough, we get a good match with the ABC estimate: a mean estimate for the population distribution of 78% with a very wide 95% CI and a clearly bimodal distribution with a huge spike at 75%. Since the MCMC mixture model looks completely different from the imperative simulationbased model, the consistency in estimates & distributions gives me some confidence in the results being right.
So we can see how we should update our beliefs—by a perhaps surprising amount towards the unreliable datapoint. The original data was too weak to strongly resist the allure of that highly precise poll.
Weakening heuristic?
We might try to think of it this way: half the time, the large poll means nothing whatsoever, it contains 0% or no information about the population at all; While the other half of the time, it is exactly what it seems to be and 100% informative; so doesn’t that mean that on average we should treat it as containing half the information we thought it did? And the information is directly based on the sample size: a sample 5x as big contains 5x as much information. So perhaps in this case of allornothing accuracy, we could solve it easily by simply weakening the weight put the unreliable information and shrinking the claimed sample size—instead of treating it as 750 of 1000, treat it as 375/500; and if it had been 75,000 of 100,000, convert it to 37,500 of 50,000. This is a simple and intuitive shortcut, but if we think about what the binomial will return as the unreliable poll increases in size or if we look at the results…
switch < 0.5
oldData3 < data.frame(Yes=c(9,(750*switch)), N=c(10,(1000*switch)))
b2 < bayes.binom.test(sum(oldData3$Yes), sum(oldData3$N)); b2
#
# Bayesian First Aid binomial test
#
# data: sum(oldData3$Yes) and sum(oldData3$N)
# number of successes = 384, number of trials = 510
# Estimated relative frequency of success:
# 0.75
# 95% credible interval:
# 0.71 0.79
# The relative frequency of success is more than 0.5 by a probability of >0.999
# and less than 0.5 by a probability of <0.001
Unfortunately, this doesn’t work because it doesn’t preserve the bimodal aspect of the posterior, and we get a unimodal distribution ever concentrating on its mean, wiping out the existence of the 0.90 peak. If our untrustworthy poll had instead, say, reported 750,000 out of 1 million, that should only make the peak at 0.75 look like a needle—it should be unable to affect the mass around 0.9, because it doesn’t matter if the data is 100 or 1 million or 1 billion, it still only has a 50% chance of being true. It’s a little hard to see this since the mean frequency of 0.75 is fairly close to the mean of 0.78 from the ABC and we might write this off as approximation error in either the ABC estimate or BFA’s MCMC, but if we look at the 95% CI and note that 0.9 is not inside it or if we plot the posterior (plot(b2)
), then the absence of bimodality jumps out. So this trick doesn’t work.
Dysgenics power analysis
Current dysgenic estimates predict that genotypic IQ in the West are falling at a substantial rate, amounting to around half a standard deviation or more over the past century, by 1. reducing the frequency at which intelligenceincreasing genetic variants occur (through natural selection against such variants) and 2. by increasing the number of new and potentially harmful genetic mutations (increasing mutation load). Estimates are produced indirectly by surveying reproductive rates or by trying to show decreases in phenotypic traits associated with intelligence; it would obviously be preferable to examine dysgenic effects directly, by observing decreases in frequencies or increases in mutation load in a large sample of Western genetic information such as SNP arrays or wholegenomes (respectively). Such direct testing of dysgenics hypotheses are becoming increasingly feasible due to the exponential decrease in SNP & wholegenome sequencing costs creating large datasets (some publicly available) and the recent identification of some intelligence genes. It remains unclear how large these datasets must be to overcome sampling error and yield informative estimates of changes in frequencies or mutation load, however; datasets like PGP or SSGAC may still be too small to investigate dysgenics. I considered the effect size estimates and under some simple models derive power calculations & power simulations of how large a dataset would be required to have an 80% chance of detecting a dysgenic effect: to detect the decrease in intelligence SNPs using SNP data, n≥30,000; to detect the increase in mutation load in whole genomes, n≥160 I then compare to available datasets: the effect on SNPs can be detected by a large number of existing proprietary databases, but there are no public databases which will be large enough in the foreseeable future; the effect on mutation load, on the other hand, can be detected using solely the currently publicly available dataset from PGP. So I conclude that while only the proprietary databases can directly test dysgenic theories of selection for the foreseeable future, there is an opportunity to analyze PGP genomes to directly test the dysgenic theory of mutation load.
The dysgenics hypothesis argues that due to observed reproductive patterns where the highly educated or intelligent tend to have fewer offspring, genotypic IQ (the upper bound on phenotypic IQs set by genes and the sort of thing measured by a polygenic score).
If dysgenics is true, then it is an extremely important phenomenon, as important as many things that get far more attention like lead remediation; but to paraphrase Richard Hamming^{2}, just because a problem is important does not mean it is worth working on or researching or discussing if there is no chance of making progress—if the data is hopelessly compromised by many systematic biases which would cause false positives or if the data is too scanty to overcome random error or analyses so flexible that they could deliver any answer the partisan wishes. Phenotypic data will, in all probability, never allow for a clear & decisive answer to the question of whether dysgenics exists or matters, as longterm comparisons are roughly as credible as noting that global piracy rates have declined while global warming increases, or paracetamol consumption rates have increased in tandem with Alzheimer’s rates; only direct examination of genetics will deliver the decisive answer. It would be nice to have an idea of how much genetic data we would need to overcome random error (and hence, whether it’s possible to make progress in the near future), which we can answer by doing some statistical power analyses.
Changes over time in genetics could be due to changes within a particular race or population (for example, in all white Englishmen), or could be due to population movements like one group replacing or migrating or merging into another (population genetics has revealed innumerable complex examples historically). The latter is possible thanks to the increasing availability of ancient DNA, often made public for researchers; so one could observe very longterm trends with cumulatively large effects (implying that small samples may suffice), but this approach has serious issues in interpretation and questions about how comparable intelligence variants may be across groups or throughout human evolution. With the former, there is less concern about interpretation due to greater temporal and ethnic homogeneity—if a GWAS on white northern Europeans in 2013 turns up intelligence variants and produces a useful polygenic score, it will almost certainly work on samples of white northern Europeans in 1900 too—but because the timescale is so short the effect will be subtler and harder to detect. Nevertheless, a result within a modern population would be much more credible, so we’ll focus on that.
How subtle and hard to detect an effect are we talking about here? Woodley 2012 summarizes a number of estimates:
Early in the 20th century, negative correlations were observed between intelligence and fertility, which were taken to indicate a dysgenic fertility trend (e.g. Cattell, 1936; Lentz, 1927; Maller, 1933; Sutherland, 1929). Early predictions of the rate of dysgenesis were as high as between 1 and 1.5 IQ points per decade (Cattell, 1937, 1936)…In their study of the relationship between intelligence and both completed and partially completed fertility, van Court and Bean (1985) reported that the relationships were predominantly negative in cohorts born between the years 1912 and 1982…Vining (1982) was the first to have attempted an estimation of the rate of genotypic IQ decline due to dysgenesis with reference to a large national probability cohort of US women aged between 24 and 34 years in 1978. He identified significant negative correlations between fertility and IQ ranging from −.104 to −.221 across categories of sex, age and race, with an estimated genotypic IQ decline of one point a generation. In a 10year followup study using the same cohort, Vining (1995) reexamined the relationship between IQ and fertility, now that fertility was complete, finding evidence for a genotypic IQ decline of .5 points per generation. Retherford and Sewell (1988) examined the association between fertility and IQ amongst a sample of 9000 Wisconsin highschool graduates (graduated 1957). They found a selection differential that would have reduced the phenotypic IQ by .81 points per generation under the assumption of equal IQs for parents and children. With an estimate of .4 for the additive heritability of IQ, they calculated a more modest genotypic decline of approximately .33 points. The study of Ree and Earles (1991), which employed the NLSY suggests that once the differential fertility of immigrant groups is taken into consideration, the phenotypic IQ loss amongst the American population may be greater than .8 of a point per generation. Similarly, in summarizing various studies, Herrnstein & Murray (1994) suggest that “it would be nearly impossible to make the total [phenotypic IQ decline] come out to less than one point per generation. It might be twice that.” (p. 364). Loehlin (1997) found a negative relationship between the fertility of American women aged 3544 in 1992 and their educational level. By assigning IQ scores to each of six educational levels, Loehlin estimated a dysgenesis rate of .8 points in one generation. Significant contributions to the study of dysgenesis have been made by Lynn, 1996 (see also: 2011) whose book Dysgenics: Genetic deterioration in modern populations provided the first estimates of the magnitude of dysgenesis in Britain over a 90 year period, putting the phenotypic loss at .069 points per year (about 1.7 points a generation assuming a generational length of 25 years). In the same study, Lynn estimated that the genotypic IQ loss was 1.64 points per generation between 1920 and 1940, which reduced to .66 points between 1950 and the present. Subsequent work by Lynn has investigated dysgenesis in other populations. For example Lynn (1999) found evidence for dysgenic fertility amongst those surveyed in the 1994 National Opinion Research Center survey, which encompassed a representative sample of American adults, in the form of negative correlations between the intelligence of adults aged 40+ and the number of children and siblings. Lynn estimates the rate of dysgenesis amongst this cohort at .48 points per generation. In a more recent study, Lynn and van Court (2004) estimated that amongst the most recent US cohort for which fertility can be considered complete (i.e. those born in the years 19401949), IQ has declined by .9 points per generation. At the country level, Lynn and Harvey (2008) have found evidence of a global dysgenesis of around .86 points between 1950 and 2000, which is projected to increase to 1.28 points in the period from 2000 to 2050. This projection includes the assumption that 35% of the variance in crosscountry IQ differences is due to the influence of genetic factors. A subsequent study by Meisenberg (2009), found that the fertility differential between developed and developing nations has the potential to reduce the phenotypic world population IQ mean by 1.34 points per decade (amounting to a genotypic decline of .47 points per decade assuming Lynn & Harvey’s 35% estimate). This assumes present rates of fertility and prereproductive mortality within countries. Meisenberg (2010) and Meisenberg and Kaul (2010) have examined the factors through which intelligence influences reproductive outcomes. They found that amongst the NLSY79 cohort in the United States, the negative correlation between intelligence and fertility is primarily associated with g and is mediated in part by education and income, and to a lesser extent by more “liberal” gender attitudes. From this Meisenberg has suggested that in the absence of migration and with a constant environment, selection has the potential to reduce the average genotypic IQ of the US population by between .4, .8 and 1.2 points per generation.
All of these estimates are genetic selection estimates: indirect estimates inferred from IQ being a heritable trait and then treating it as a natural selection/breeding process, where a trait is selected against based on phenotype and how fast the trait decreases in each succeeding generation depends on how genetic the trait is and how harsh the selection is. So variation in these estimates (quoted estimates per generation range from .3 to 3+) is due to sampling error, differences in populations or time periods, expressing the effect by year or generation, the estimate used for heritability, reliability of IQ estimates, and whether additional genetic effects are taken into account—for example, Woodley et al 2015 finds .262 points per decade from selection, but in another paper argues that paternal mutation load must be affecting intelligence by ~0.84 in the general population, giving a total of 1 per decade.
Dysgenics effects should be observable by looking at genomes & SNP data with known ages/birthyears and looking for increases in total mutations or decreases in intelligencecausing SNPs, respectively.
Selection on SNPs
Without formally metaanalyzing all dysgenics studies, a good starting point on the selection effect seems like a genetic selection of 1 point per decade or 0.1 points per year or 0.007 standard deviations per year (or 0.7 standard deviations per century).
The most common available genetic data is SNP data, which sequence only the variants most common in the general population; SNP data can look at the effects of genetic selection but will not look at new mutations (since a new mutation would not be common enough to be worth putting onto a SNP chip).
Given a large sample of SNP data, a birth year (or age), and a set of binary SNP variables which cause intelligence (coded as 1 for the good variant, 0 for the others), we could formulate this as a multivariate regression: glm(cbind(SNP1, SNP2, ... SNP_N) ~ Year, family=binomial)
and see if the year variable has a negative sign (increasing passage of time predicts lower levels of the good genes); if it does, this is evidence for dysgenics. Better yet, given information about the effect size of the SNPs, we could for each person’s SNP sum the net effects and then regress on a single variable, giving more precision rather than looking for independent effects on each SNP: lm(Polygenic_score ~ Year)
. Again a negative sign on the year variable is evidence for dysgenics.
Directional predictions are weak, and in this case we have quantitative predictions of how big the effects should be. Most of the public genomes I looked at seem to have the earliest birthdates in the 1950s or so; genomes can come from any age person (parents can give permission, and sequencing has been done prenatally) so the maximum effect is the difference between 1950 and 2015, which is 65*0.007=0.455
standard deviations (but most genomes will come from intermediate birthdates, which are less informative about the temporal trend—in the optimal experimental design for measuring a linear trend, half the samples would be from 1950 and the other half from 2015). If the genetic total is going down by 0.455SDs, how much do the frequencies of all the good genes go down?
One simple model of genotypic IQ would be to treat it as a large number of alleles of equal binary effect: a binomial sum of n=10,000 1/0 variables with P=50% (population frequency) is reasonable. (For example, GIANT has found a large number of variants for height, and the GCTAs indicate that SNPs explain much more of variance than the top Rietveld hits currently account for; this specific model is loosely inspired by Hsu 2014.) In such a model, the average value of the sum is of course n*p=5000
, and the SD is sqrt(n*p*(1p))
or sqrt(10000*0.5*0.5)
or 50. Applying our estimate of dysgenic effect, we would expect the sum to fall by 0.455*50=22.75
, so we would be comparing two populations, one with a mean of 5000 and a dysgenic mean of 4977.25. If we were given access to all alleles from a sample of 1950 and 2015 genomes and so we could construct the sum, how hard would it be able to tell the difference? In this case, the sum is normally distributed as there are more than enough alleles to create normality, so we can just treat this as a twosample normallydistributed comparison of means (a ttest), and we already have a directional effect size in mind, 0.445SDs, so:
power.t.test(delta=0.455, power=0.8, alternative="one.sided")
# Twosample t test power calculation
#
# n = 60.4155602
# ...
A total n=120 is doable, but it is unlikely that we will know all intelligence genes anytime soon; instead, we know a few. A new mean of 4977 implies that since total number of alleles is the same but the mean has fallen, the frequencies must also fall and the average frequency falls from 0.5 to 4977.25/10000=0.497725
. To go to the other extreme, if we know only a single gene and we want to test a fall from a frequency of 0.50 to 0.4977, we need infeasibly more samples:
power.prop.test(p1=0.5, p2=0.497725, power=0.8, alternative="one.sided")
# Twosample comparison of proportions power calculation
#
# n = 597,272.2524
# ...
1.2m datapoints would be difficult to get, and so a single gene test would be unhelpful; further, a single gene could change frequencies solely through genetic drift without the change being due to dysgenic pressures.
We know a number of genes, though: Rietveld gives 4 good hits, so we can look at a polygenic score from that. They are all of similar effect size and frequency, so we’ll continue under the same assumptions of 1/0 and P=50%. The nondysgenic average score is 4*0.5=2
, sd=sqrt(4*0.5*0.5)=1
. (Naturally, the SD is much larger than before because with so few random variables…) The predicted shift is from frequencies of 0.5 to 0.497, so the dysgenic scores should be 4*0.497=1.988
, sd=sqrt(4*0.497*0.503)=0.999
. The difference of 0.012 on the reduced polygenic score is d=((21.988) / 0.999)=0.012
, giving a necessary power of:
power.t.test(delta=0.012006003, power=0.8)
# Twosample t test power calculation
#
# n = 108904.194
# ...
So the 4 hits do reduce the necessary sample size, but it’s still not feasible to require 218k SNP datasets (unless you are 23andMe or SSGAC or an entity like that).
In the current GWAS literature, there are ~9 hits we could use, but the upcoming SSGAC paper promises: “We identified 86 independent SNPs associated with EA (p<5E8).”. So how much would 86 improve over 4?
 mean old:
86*0.5=43
 sd old:
sqrt(86*0.5*0.5)=4.6368
 mean new:
86*0.497=42.742
 sd new:
sqrt(86*0.497*(10.497))=4.6367
 so d=
(4342.742)/4.63675=0.0556
power.t.test(delta=((4342.742)/4.63675), power=0.8)
# Twosample t test power calculation
#
# n = 5071.166739
# ...
So with 75, it drops from 200k to 10.1k.
To work backwards: we know with 1 hit, we need a million SNP datasets (infeasible for any but the largest proprietary databases, who have no interest in studying this hypothesis), and with all hits we need more like 200 genomes (entirely doable with just publicly available datasets like PGP), but how many hits do we need to work with an inbetween amount of data like the ~2k genomes with ages I guess may be publicly available now or in the near future?
power.t.test(n=1000, power=0.8)
# Twosample t test power calculation
#
# n = 1000
# delta = 0.1253508704
hits=437;
mean1=hits*0.5; sd1=sqrt(hits*0.5*0.5);
mean2=hits*0.497; sd2=sqrt(hits*0.497*(10.497));
d=(mean1mean2)/mean(c(sd1,sd2)); d
# [1] 0.1254283986
With a polygenic score drawing on 437 hits, then a sample of 2k suffices to detect the maximum decrease.
This is pessimistic because the 10k alleles are not all the same effect size and GWAS studies inherently will tend to find the largest effects first. So the first 4 (or 86) hits are worth the most. The distribution of effects is probably something like an inverse exponential distribution: many small nearzero effects and a few large ones. Rietveld 2013 released the betas for all SNPs, and the beta estimates can be plotted; each estimate is imprecise and there are artifacts in the beta sizes (SSGAC confirms that they were rounded to 3 decimals), but the distribution looks like a radioactive halflife graph, an inverse exponential distribution. With a mean of 1, we can simulate creating a set of 10k effect sizes which are exponentially distributed and have mean 5000 and SD close to (but larger than) 50 and mimics closely the binomial model:
effects < sort(rexp(10000)/1, decreasing=TRUE)
genomeOld < function() { ifelse(sample(c(FALSE,TRUE), prob=c(0.5, 0.5), 10000, replace = TRUE), 0, effects) }
mean(replicate(10000, sum(genomeOld())))
# [1] 5000.270218
sd(replicate(10000, sum(genomeOld())))
# [1] 69.82652816
genomeNew < function() { ifelse(sample(c(FALSE,TRUE), prob=c(0.497, 10.497), 10000, replace = TRUE), 0, effects) }
With a dysgenic effect of 0.445SDs, that’s a fall of the sum of random exponentials of ~31, which agrees closely with the difference in polygenic genome scores:
For each draw from the old and new populations, we can take the first 4 alleles, which were the ones assigned the largest effects, and build a weak polygenic score and compare means. For example:
polyNew < replicate(1000, sum(genomeNew()[1:4]))
polyOld < replicate(1000, sum(genomeOld()[1:4]))
t.test(polyOld, polyNew, alternative="greater")
# Welch Two Sample ttest
#
# data: polyOld and polyNew
# t = 0.12808985, df = 1995.8371, pvalue = 0.8980908
# alternative hypothesis: true difference in means is not equal to 0
# 95 percent confidence interval:
# 0.7044731204 0.8029267301
# sample estimates:
# mean of x mean of y
# 17.72741040 17.67818359
Or to mimic 86 hits:
t.test(replicate(1000, sum(genomeOld()[1:86])), replicate(1000, sum(genomeNew()[1:86])))
#
# Welch Two Sample ttest
#
# t = 1.2268929, df = 1997.6307, pvalue = 0.2200074
# alternative hypothesis: true difference in means is not equal to 0
# 95% confidence interval:
# 0.8642674547 3.7525210076
# sample estimates:
# mean of x mean of y
# 244.5471658 243.1030390
Using the exponential simulation, we can do a parallelized power analysis: simulate draws (i=300) & tests for a variety of sample sizes to get an idea of what sample size we need to get decent power with 86 hits.
library(ggplot2)
library(parallel) # warning, Windows users
library(plyr)
genomeOld < function(efft) { ifelse(sample(c(FALSE,TRUE), prob=c(0.5, 0.5), length(efft), replace = TRUE), 0, efft) }
genomeNew < function(efft) { ifelse(sample(c(FALSE,TRUE), prob=c(0.497, 10.497), length(efft), replace = TRUE), 0, efft) }
simulateStudy < function(n, hits) {
effects < sort(rexp(10000)/1, decreasing=TRUE)[1:hits]
polyOld < replicate(n, sum(genomeOld(effects)))
polyNew < replicate(n, sum(genomeNew(effects)))
t < t.test(polyOld, polyNew, alternative="greater")
return(data.frame(N=n, P=t$p.value, PO.mean=mean(polyOld), PO.sd=sd(polyOld), PN.mean=mean(polyNew), PN.sd=sd(polyNew))) }
hits < 86
parallelStudies < function(n, itr) { ldply(mclapply(1:itr, function(x) { simulateStudy(n, hits); })); }
sampleSizes < seq(500, 5000, by=100)
iters < 300
powerExponential < ldply(lapply(sampleSizes, function(n) { parallelStudies(n, iters) })); summary(powerExponential)
# N P PO.mean PO.sd PN.mean
# Min. : 500 Min. :0.000000000 Min. :222.5525 Min. :23.84966 Min. :221.2894
# 1st Qu.:1600 1st Qu.:0.002991554 1st Qu.:242.8170 1st Qu.:26.46606 1st Qu.:241.3242
# Median :2750 Median :0.023639517 Median :247.2059 Median :27.04467 Median :245.7044
# Mean :2750 Mean :0.093184735 Mean :247.3352 Mean :27.06300 Mean :245.8298
# 3rd Qu.:3900 3rd Qu.:0.107997575 3rd Qu.:251.7787 3rd Qu.:27.64103 3rd Qu.:250.2157
# Max. :5000 Max. :0.997322161 Max. :276.2614 Max. :30.67000 Max. :275.7741
# PN.sd
# Min. :23.04527
# 1st Qu.:26.45508
# Median :27.04299
# Mean :27.05750
# 3rd Qu.:27.63241
# Max. :30.85065
powerExponential$Power < powerExponential$P<0.05
powers < aggregate(Power ~ N, mean, data=powerExponential); powers
# 1 500 0.2133333333
# 2 600 0.2833333333
# 3 700 0.2833333333
# 4 800 0.3133333333
# 5 900 0.3033333333
# 6 1000 0.3400000000
# 7 1100 0.4066666667
# 8 1200 0.3833333333
# 9 1300 0.4133333333
# 10 1400 0.4166666667
# 11 1500 0.4700000000
# 12 1600 0.4600000000
# 13 1700 0.4666666667
# 14 1800 0.4733333333
# 15 1900 0.5233333333
# 16 2000 0.5366666667
# 17 2100 0.6000000000
# 18 2200 0.5900000000
# 19 2300 0.5600000000
# 20 2400 0.6066666667
# 21 2500 0.6066666667
# 22 2600 0.6700000000
# 23 2700 0.6566666667
# 24 2800 0.7133333333
# 25 2900 0.7200000000
# 26 3000 0.7300000000
# 27 3100 0.7300000000
# 28 3200 0.7066666667
# 29 3300 0.7433333333
# 30 3400 0.7133333333
# 31 3500 0.7233333333
# 32 3600 0.7200000000
# 33 3700 0.7766666667
# 34 3800 0.7933333333
# 35 3900 0.7700000000
# 36 4000 0.8100000000
# 37 4100 0.7766666667
# 38 4200 0.8000000000
# 39 4300 0.8333333333
# 40 4400 0.8466666667
# 41 4500 0.8700000000
# 42 4600 0.8633333333
# 43 4700 0.8166666667
# 44 4800 0.8366666667
# 45 4900 0.8666666667
# 46 5000 0.8800000000
qplot(N, Power, data=powers) + stat_smooth()
So for a wellpowered twogroup comparison of 1950 & 2015 SNP datasets using 86 SNPs, we would want ~4000 in each group for a total n=8000; we do have nontrivial power even at a total n=1000 (500 in each group means 21% power) but a nonstatisticallysignificant result will be difficult to interpret and if one wanted to do that, reporting a Bayes factor from a Bayesian hypothesis test would make much more sense to express clearly whether the (nondefinitive) data is evidence for or against dysgenics.
This is still too optimistic since we assumed the optimal scenario of only very old and very new genomes, while available genomes are more likely to be distributed fairly uniformly between 1950 and 2015. Per “Optimal design in psychological research”, McClelland 1997, we expect a penalty of ~2x in sample size efficiency in going from the optimal twogroup extreme endpoints design to samples being uniformly distributed (due to much of our sample size being wasted on estimating small effects) and so we would expect our sample size requirement to at least double to around n=16000, but we can do a power simulation here as well. To get the effect size for each year, we simply split the frequency decrease over each year and generate hypothetical genomes with less of a frequency decrease uniformly distributed 19502015, and do a linear regression to get a pvalue for the year predictor:
hits < 86
sampleSizes < seq(8000, 30000, by=1000)
iters < 100
genome < function(effects) {
t < sample(c(1:(20151950)), 1)
decreasedFrequency < 0.5  (((0.50.497)/(20151950)) * t)
geneFlips < sample(c(FALSE,TRUE), prob=c(decreasedFrequency, 1decreasedFrequency), replace = TRUE, length(effects))
geneValues < ifelse(geneFlips, effects, 0)
return(data.frame(Year=1950+t,
PolygenicScore=sum(geneValues)))
}
simulateStudy < function(n, hits) {
effects < sort(rexp(10000)/1, decreasing=TRUE)[1:hits]
d < ldply(replicate(n, genome(effects), simplify=FALSE))
l < lm(PolygenicScore ~ Year, data=d)
p < anova(l)$`Pr(>F)`[1]
return(data.frame(N=n, P=p, PO.mean=predict(l, newdata=data.frame(Year=1950)),
PN.mean=predict(l, newdata=data.frame(Year=2015)))) }
parallelStudies < function(n, itr) { ldply(mclapply(1:itr, function(x) { simulateStudy(n, hits); })); }
powerExponentialDistributed < ldply(lapply(sampleSizes, function(n) { parallelStudies(n, iters) })); summary(powerExponential)
powerExponentialDistributed$Power < powerExponentialDistributed$P<0.05
powers < aggregate(Power ~ N, mean, data=powerExponentialDistributed); powers
# N Power
# 1 8000 0.27
# 2 9000 0.32
# 3 10000 0.35
# 4 11000 0.33
# 5 12000 0.41
# 6 13000 0.34
# 7 14000 0.41
# 8 15000 0.48
# 9 16000 0.55
# 10 17000 0.62
# 11 18000 0.55
# 12 19000 0.60
# 13 20000 0.69
# 14 21000 0.61
# 15 22000 0.65
# 16 23000 0.63
# 17 24000 0.71
# 18 25000 0.67
# 19 26000 0.71
# 20 27000 0.74
# 21 28000 0.70
# 22 29000 0.79
# 23 30000 0.83
qplot(N, Power, data=powers) + stat_smooth()
In this case, the power simulation suggestions the need for triple rather than double the data, and so a total of n=30,000 to be wellpowered.
Mutation load
The paternal mutation load should show up as a increase (70 new mutations per generation, 35 years per generation, so ~2 per year on average) over the past century, while the genetic selection will operate by reducing the frequency of variants which increase intelligence. If there are ~70 new mutations per generation and 2 harmful, and there is no longer any purifying selection so that all 70 will tend to remain present, how much does that compare to existing mutation load averages and, more importantly, standard deviations?
A mutation load review leads me to some hard figures from Simons et al 2014 (supplement) using data from Fu et al 2012; particularly relevant is figure 3, the number of singlenucleotide variants per person over the EuropeanAmerican sample, split by estimates of harm from least to most likely: 21345 + 15231 + 5338 + 1682 + 1969 = 45565
. The supplementary tables gives a count of all observed SNVs by category, which sum to 300209 + 8355 + 220391 + 7001 + 351265 + 10293 = 897514
, so the average frequency must be 45565/897514=0.05
, and then the binomial SD will be sqrt(897514*0.05*(10.05))=206.47
. Considering the twosample case of 1950 vs 2015, that’s an increase of 130 total SNVs (65*2
), which is 0.63SDs, hence:
power.t.test(d=(130/206), power=0.8)
# Twosample t test power calculation
#
# n = 40.40035398
# ...
A total of n=80.
This particular set up for the twosample test can be seen as a linear model with the optimum design of allocating half the sample to each extreme (see again McClelland 1997); but more realistically, there is an even distribution across years, in which case the penalty is 2x and n=160.
Weaknesses
There are some potential problems:
Range restriction: in many IQrelated studies, failure to account for selection effects yielding a limited range of IQs may seriously understate the true correlation; this is true in general but particularly common in IQ studies because selection on IQ (eg samples of convenience using only college students) is so universal in human society
This may not be such a large issue when dealing with polygenic scores; even severe IQ selection effects will increase polygenic scores only somewhat because the polygenic scores explain so little of IQ variance in the first place.
Selfselection by age: if people providing genetic data are not random samples, then there may be pseudotrends which can mask a real dysgenic trend or create a pseudodysgenic trend where there is none. For example, if young people buying genome or SNP data tend to be aboveaverage in intelligence and scientific interest (which anecdotally they certainly do seem to be), while old people tend to get genomes or SNP data due to health problems (and otherwise have average levels of intelligence and thus polygenic score), then in comparing young vs old, one might find not a dysgenic but a pseudoeugenic trend instead! Conversely, it could be the other way around, if much fewer elderly get genetic data and younger people are more concerned about future health or are going along with a fad, producing a pseudodysgenic effect instead (eg in the PGP genome data, there seem to be disproportionately more PhDs who are quite elderly, while younger participants are a more scattershot sample from the general population; probably relating to the circumstances of PGP’s founding & Harvard home).
This is probably an issue with databases that rely on voluntary individual contributions, such as PGP, where selection effects have free play. It would be much less of an issue with longitudinal studies where motivations and participation rates will not differ much by age. Since most dysgenic theories accept that recorded IQ scores have remained stable over the 20th century and the decreases in genetic potential either have not manifested yet or have been masked by the Flynn effect & greater familiarity with tests & loss of some gloading, one might reason that proxies like educational achievement should be increasing throughout one’s sample (since they are known to have increased), and a lack of such a trend indicates selection bias.
Genetic data availability
Proprietary
The known proprietary databases have long been large enough to carry out either analysis, as well as countless other analyses (but have failed to and represent a tragedy of the anticommons):
The mutation load analysis requires a wholegenome sample size small enough to have been carried out by innumerable groups post2009.
For SNPs, an incomplete list of examples of publications based on large samples:
 23andMe reached 1 million customers in July 2015, of whom >=80% optin to research (>2m as of August 2017, still 80% optin, and >6m as of November 2017 with sales increasing); the first questions 23andMe asks all customers are age and education, so they likely have at least 700,000 usable SNPs for both discovering educational associations & dysgenic tests. In June 2010, they claimed to have 29k optins out of 50k customers, implying they were wellpowered for a dysgenic test in 2010 if they had access to a polygenic score, and that in the absence of the score, they could have found the 85 SSGAC hits (using n=305k) themselves somewhere around mid2011 or 2012 and then done a dysgenic test.
 the SSGAC collaboration has n=305k as of late 2015
 GIANT: height: n=253,288
 cholesterol: n=188,000
 UK Biobank: n=152,729 (sequenced/published on as of June 2015; 500k were enrolled and will be covered eventually)
 diabetes: n=114,981
 Psychiatric Genomics Consortium: has run multiple studies of varying sizes, the secondlargest (a bipolar study) peaking at a total control group of n=51672, and the largest (schizophrenia) at a control group of n=113,075
 Parkinson’s: n=100,833 (may overlap with 23andMe and some others)
 eczema: n=95,464
 Genetics of Personality Consortium: eg Neuroticism: n=63k
 Dutch LifeLines Biobank and Cohort: n>13,000
 Health and Retirement Survey: n=12,500
 Swedish TwinGene project: n=10,682
 TwinsUK registry: n=4,905
 Generation Scotland: the Scottish Family Health Study: n=24,000
The existing private groups do not seem to have any interest in testing dysgenics, with the possible exception of future GWAS studies examining fertility, one of which is mentioned by Mills & Tropf 2015:
At the time of writing this review, Mills and her research team at the University of Oxford are currently leading a large consortium to engage in the first ever genomewide association search (GWAS) and metaanalysis of reproductive choice (age at first birth; number of children), conducted in both men and women in over 50 data sets, with the results replicated in additional datasets in a large sample.
The hits in such a GWAS might overlap with intelligence hits, and if the multiple hits increase intelligence but decrease fertility or vice versa (as compared to decreasing or increasing both), that would be evidence for dysgenics. Or, assuming the betas are reported, polygenic scores for fertility and intelligence could be estimated in independent samples and checked for an inverse correlation.
Public
There are a few sources of data, primarily SNP data, which are freely available to all users:
1000 Genomes: unusable due to a deliberate policy decision by 1000 Genomes to delete all phenotype data, including age; similar is 69 Genomes. Both likely would be unusable due to the diversity of the global sample (there is no reason to think that dysgenics pressures are operating in every population at the same strength)
OpenSNP: hosting for userprovided SNP & phenotype data with dumps available; hosts ~2k SNP datasets, but only 270 users have birthyears
SNPedia likewise hosts SNP data (overlapping with OpenSNP) and genome data, but a very small number
Genomes unzipped provides a small amount of data
DNA.LAND: claims n=8k based on public participation & input (n=43k as of May 2017), but seems to then restrict access to a small set of researchers
Exome Aggregation Consortium: n=61,486 exomes; phenotype data is unavailable
Personal Genome Project (PGP): probably the single largest source of open SNP & genome data. ~1252 participants have registered birthdates according to
demographics.tsv
, and their statistics page’s graphs indicates <300 whole genomes and <1k SNPs. Phenotype data has been recently released as a SQLite database, making it easier to work with. Genomes: browsing the user lists for ‘Whole genome datasets’, I estimate a total of ~222; looking at the first and last 22 entries, 34 had ages/birthyears, so ~75% of the whole genomes come with the necessary birthyear data, indicating ~166 usable genomes for the purpose of testing dysgenics. With the most recent one uploaded on 20151012, and the earliest recorded being 20110916, that suggests the available genome number increases by ~0.25/day. 166 is uncomfortably close to the requirement for a wellpowered test, and there may not be enough data to account for glitches in the data or allow for more complicated statistical testing, but if we wanted to double the available data, we’d only need to wait around 885 days or 2.5 years (or less, depending on whether the collapse in genome sequencing prices continue and prices drop below even the current $1k genomes).
 SNPs: PGP has ~656 23andMe SNP datasets (the number of SNP datasets sourced from other providers is quite small so I didn’t include them), dated 2015102120110106, so assuming same birthdate percentage, 0.37 per day. Unfortunately, to get 30k SNP datasets through PGP, we would have to wait (linearly extrapolating) 291 years. (Making matters worse, in October 2015, 23andMe doubled its price and reduced the quality of SNP coverage, which will discourage many users and push other users to purchase wholegenome sequencing instead.)
Power analysis for racial admixture studies of continuous variables
I consider power analysis of a genomic racial admixture study for detecting genetic group differences affecting a continuous trait such as IQ in US AfricanAmericans, where ancestry is directly measured by genome sequencing and the comparisons are all withinfamily to eliminate confounding by population structure or racism/colorism/discrimination. The necessary sample size for wellpowered studies is closely related to the average size of differences in ancestry percentage between siblings, as the upper bound on IQ effect per percentage is small, requiring large differences in ancestry to detect easily. A withinfamily comparison of siblings, due to the relatively small differences in ancestry between siblings estimated from IBD measurements of siblings, might require n>50,000 pairs of siblings to detect possible effects on IQ, an infeasible sample size. An alternative design focuses on increasing the available ancestry differences within a family unit by comparing adoptees with siblings; the larger withinpopulation standard deviation of ancestry creates larger & more easilydetected IQ differences. A randomeffects metaanalysis of past admixture & ancestry studies suggests the SD in heterogeneous samples may range from 2% to 20% with a mean of 11% (95% predictive interval), yielding sample sizes of n>20,000, n=1100, and n=500. Hence, an adoption study is probably in the feasible range, with required sample sizes comparable to annual adoption rates among US AfricanAmericans.
Admixture studies examine racial phenotypic differences in traits such as blood pressure by comparing people with ancestry from multiple groups, and correlating differences in ancestry percentage with differences in the phenotype. So, for example, AfricanAmericans have higher bloodpressure than white Americans, and most AfricanAmericans have an average white ancestry of something like 2025% (see later); if having 26% white ancestry predicts slightly lower blood pressure while 24% predicts higher, that suggests the difference is (as is currently believed) genetic; and this logic can be used to narrow down to specific chromosome regions, and has contributed to study of racial differences in disease.
One application would be to thorny questions like potential group differences in nonmedical traits like intelligence. The standard admixture design, requiring a few thousand subjects spanning the full range, might not necessarily work here here because of the claimed environmental effects. A proposed resolution to the question is to do an admixture study comparing AfricanAmerican siblings. Siblings are highly genetically related on average (50%) but in a randomized fashion due to recombination; so two siblings, including fraternal twins, born to the same parents in the same family in the same neighborhood going to the same schools, will nevertheless have many different variants, and will differ in how related they are—the average is 50% but it could be as low as 45% or high as 55%. So given two siblings, they will differ slightly in their white ancestry, and if indeed white ancestry brings with it more intelligence variants, then the sibling with a higher whiter percentage ought to be slightly more intelligent on average, and this effect will have to be causal, as the inheritance is randomized and all other factors are equal by design. (A result using ancestry percentages measured in the general population, outside families, would be able to make far more powerful comparisons by comparing people with ~0% white ancestry to those with anywhere up to 100%, and require small sample sizes, and such analyses have been done with the expected result, but are ambiguous & totally unconvincing, as the correlation of greater whiteness with intelligence could easily be due to greater SES or greater blackness could be a marker for recent immigration or any of a number of confounds that exist.) This has historically been difficult or impossible since how does one measure the actual ancestry in siblings? But with the rise of cheap genotyping, precise measure of actual (rather than average) ancestry can be done for <$100, so that is no longer an obstacle.
Sibling power analysis
How many sibling pairs would this require?
 you are trying to regress
IQ_difference ~ Ancestry_difference
 the SD of the IQ difference of siblings is known—it’s ~13 IQ points (nonshared environment + differences in genetics)
 of this, a small fraction will be explained by the small difference in ancestry percentage
 the power will be determined by the ratio of the sibling SD to the IQdifferenceduetoancestrydifference SD, giving an effect size, which combined with the usual alpha=0.05 and beta=0.80, uniquely determines the sample size
 IQdifferenceduetoancestrydifference SD will be the advantage of better ancestry times how much ancestry differs
 if you knew the number of relevant alleles, you could calculate through the binomial the expected SD of sibling ancestor differences. As there are so many alleles, it will be almost exactly normal. So it’s not surprising that siblings overall, for all variants, are 50% IBD with a SD of 4%.
If we treated it as simply as possible, Visscher 2006 for an analogous height analysis says they measured 588 markers. So a binomial with 588 draws and p=0.5 implies that 147 markers are expected to be the same:
and the distribution around 147 is 12, which is ~8%:
Visscher does a more complicated analysis taking into account closeness of the markers and gets a SD of 3.9%: equation 7; variance = 1/(16*L)  (1/3*L^2)
, where L = 35, so
And Hill & Weir 2011’s theoretical modeling gives an expected sibling SD of SD of 3.92%/3.84% (Table 2), which are nearly identical. So whatever the mean admixture is, I suppose it’ll have a similar SD of 48% of itself.
IIRC, AfricanAmericans are ~25% admixed, so with a mean admixture of 25%, we would expect siblings differences to be or 1% difference.
If that 75% missing white ancestry accounts for 9 IQ points or 0.6SDs, then each percentage of white ancestry would be 0.6/75 =0.008 SDs.
So that SD of 1% more white ancestry yields an SD of 0.008 IQ, which is superimposed on the full sibling difference of 0.866, giving a standardized effect size/d of 0.008 / 0.866 = 0.0092
Let me try a power simulation:
n < 10000
siblings < data.frame(
sibling1AncestryPercentage = rnorm(n, mean=25, sd=1),
sibling1NonancestryIQ = rnorm(n, mean=0, sd=12),
sibling2AncestryPercentage = rnorm(n, mean=25, sd=1),
sibling2NonancestryIQ = rnorm(n, mean=0, sd=12))
siblings$sibling1TotalIQ < with(siblings, sibling1NonancestryIQ + sibling1AncestryPercentage*(0.008*15))
siblings$sibling2TotalIQ < with(siblings, sibling2NonancestryIQ + sibling2AncestryPercentage*(0.008*15))
siblings$siblingAncestryDifference < with(siblings, sibling1AncestryPercentage  sibling2AncestryPercentage)
siblings$siblingIQDifference < with(siblings, sibling1TotalIQ  sibling2TotalIQ )
summary(siblings)
# ...
# siblingAncestryDifference siblingIQDifference
# Min. :5.370128122 Min. :68.2971343
# 1st Qu.:0.932086950 1st Qu.:11.7903864
# Median : 0.002384529 Median : 0.2501536
# Mean : 0.007831583 Mean : 0.4166863
# 3rd Qu.: 0.938513265 3rd Qu.: 11.0720667
# Max. : 5.271052675 Max. : 67.5569825
summary(lm(siblingIQDifference ~ siblingAncestryDifference, data=siblings))
# ...Coefficients:
# Estimate Std. Error t value Pr(>t)
# (Intercept) 0.4192761 0.1705125 2.45892 0.0139525
# siblingAncestryDifference 0.3306871 0.1220813 2.70874 0.0067653
#
# Residual standard error: 17.05098 on 9998 degrees of freedom
# Multiple Rsquared: 0.000733338, Adjusted Rsquared: 0.0006333913
# Fstatistic: 7.337294 on 1 and 9998 DF, pvalue: 0.006765343
confint(lm(siblingIQDifference ~ siblingAncestryDifference, data=siblings))
# 2.5 % 97.5 %
# (Intercept) 0.75351500523 0.08503724643
# siblingAncestryDifference 0.09138308561 0.56999105507
admixtureTest < function(n, alpha=0.05, ancestryEffect=0.008) {
siblings < data.frame(
sibling1AncestryPercentage =pmax(0, rnorm(n, mean=25, sd=1)),
sibling1NonancestryIQ = rnorm(n, mean=0, sd=12),
sibling2AncestryPercentage = pmax(0,rnorm(n, mean=25, sd=1)),
sibling2NonancestryIQ = rnorm(n, mean=0, sd=12))
siblings$sibling1TotalIQ < with(siblings, sibling1NonancestryIQ + sibling1AncestryPercentage*(ancestryEffect*15))
siblings$sibling2TotalIQ < with(siblings, sibling2NonancestryIQ + sibling2AncestryPercentage*(ancestryEffect*15))
siblings$siblingAncestryDifference < with(siblings, sibling1AncestryPercentage  sibling2AncestryPercentage)
siblings$siblingIQDifference < with(siblings, sibling1TotalIQ  sibling2TotalIQ )
p < summary(lm(siblingIQDifference ~ siblingAncestryDifference, data=siblings))$coefficients[8]
return(p<alpha)
}
power < function(n, iters=10000, n.parallel=8) {
library(parallel)
library(plyr)
mean(unlist(mclapply(1:n.parallel, function(i) {
replicate(iters/n.parallel, admixtureTest(n)) }))) }
# powers < sapply(seq(100, 10000, by=10), power)
power(100)
# [1] 0.0502
power(500)
# [1] 0.0535
power(1500)
# [1] 0.0642
power(15000)
# [1] 0.2251
power(50000)
# [1] 0.6077
So the estimated sample size is extremely large, well into the scores of thousands. This is large enough that it will be some time before biobanks or population samples are wellpowered: participants may not have siblings, those siblings may be included, only ~15% of the American population is AA, all participants must be sequenced, and so on. If it requires n=70,000, half of participants have a matching sibling, and it’s drawn proportionally from the general population, that would imply that a total sample size of almost 1m. (For comparison, that’s about twice the size of the UK Biobank, and the US Precision Medicine Initiative cohort program aims for 1m total participants by 2020.)
Simplifications aside, it is difficult to see any way to bring this method down into the low thousands range, as that would require siblings to vary tremendously more in ancestry, have much more homogeneous IQs than they do, or ancestry to be vastly more potent than it could possibly be.
Adoption power analysis
One possibility would be to examine a different population, perhaps one with more African ancestry and thus larger betweensibling ancestry differences and effects, such as Brazil; but that would raise questions about relevance to the USA. So another possibility is to drop the idea of using only biological siblings. Is there any way to have ancestry differences as large as in the general population, but within a family? Halfsiblings come to mind but those more typically tend to join the household as older kids or teenagers, so aren’t so good. One possibility is adoptees: there are a substantial number of AfricanAmerican children adopted into other AfricanAmerican households (white parents adopting black children is controversial and rarer, described as “decreasing dramatically”, but still substantial in total numbers, at least 20,000), and even a noticeable number of African children adopted abroad (14,800 from Ethiopia just 19992014, with more adoption from Nigeria & the Congo). The same logic of the withinfamily study should apply but to unrelated siblings who will have far greater differences in ancestry now (possibly anywhere up to 50% if an African child is adopted into an AfricanAmerican family with considerable white ancestry & some luck). This would increase power dramatically, perhaps enough to bring the study within the realm of nearfuture feasibility.
Examining adoptions of African children would not be a convincing way of establishing group differences, particularly for IQ, as there are many known environmental harms (eg prenatal lack of iodine is known to cause large impairments in cognitive which cannot be repaired later in life, and poor iodization is frequent in Africa), so while examining African adoptees would doubtless require a very small sample size, the results would be uninterpretable. So the more interesting case is instead examining AA adoptees/siblings, all of whom are raised in a wealthy (and iodinesufficient) industrialized country.
In this case, we’re considering a pair of an AA sibling with the same IQ & ancestry distributions, as compared with adoptees who are either African (100% African ancestry) or likewise have the same IQ/ancestry distributions. Since the sibling & adoptee are unrelated, they effectively vary & differ as much as two random people from the general population would in IQ & African ancestry, except for sharedenvironment effects on IQ; sharedenvironment for adult IQ is relatively low, maybe 10% of variance. So instead of an SD of 15, they would vary moderately less, like 14.23 points (sqrt(15^2 * 0.9)
).
One assumption here is a shared mean: one would actually expect, given the association of lighter skin with higher wealth/SES and darker with lower wealth/SES, that the adopting parents (and hence their biological children) would be relatively high on European ancestry, and conversely, the mothers giving up children for adoption would be relatively low, so the expected difference in ancestry is higher than simulated. Assuming equal means, however, is a conservative assumption since if such a correlation holds, the differences will be larger, hence the ancestry effect sizes larger, hence smaller sample sizes required. In the extreme version of this, the adoptive family is white and so the ancestry difference is maximal (~99% vs ~20%), requiring even smaller sample sizes, but at the cost of introducing complications like whether there are interactions with the white adoptive family not present in an AA adoptive family; in any case, such transracial adoption is apparently unpopular now, so it may not come up much.
Mean population European ancestry & population standard deviation
Ancestry effects remain as before; the mean ancestry is not too important as long as it’s not near 0, but since adoptees are drawn from the general population, the ancestry SD must be adjusted but it’s unclear what the right SD here is—cited studies range from 4% up to 11%, and this is a key parameter for power (with 4%, then sibling and adoptee will tend to be quite similar on ancestry percentage & much more data will be required, but with 11% they will differ a good deal and make results stronger).
Reported figures from the genetics literature for European ancestry in US AfricanAmerican range from 14% to 24%, reflecting both sampling error and various biases & selfselection & geographic/regional effects in the datasets:
Halder et al 2008, “A Panel of Ancestry Informative Markers for Estimating Individual Biogeographical Ancestry and Admixture From Four Continents: Utility and Applications”: 14.3%, SD 13.3% (n=136)
Ducci et al 2009, “Association of Substance Use Disorders With Childhood Trauma but not African Genetic Heritage in an African American Cohort”: 7%, SD 9% (n=864); this is a little odd since they allocate the 21% nonAfrican ancestry (African ancestry: 079%, SD 14%) to Europe, Middle East, Central Asia, Native American, East Asia, and Oceania, but it’s hard to imagine that really corresponds to the ancestry of New Jersey AfricanAmericans; if we ignore that attempt at splitting the nonAfrican ancestry, 21% & SD 14% might be more consistent.
Gravlee et al 2009, “Genetic ancestry, social classification, and racial inequalities in blood pressure in Southeastern Puerto Rico”: European ancestry isn’t reported and Puerto Ricans have too much other ancestry to simply subtract the reported African ancestries
Signorello et al 2010, “Blood vitamin D levels in relation to genetic estimation of African ancestry”, Table 1: mean African ancestry of 92.9% implies European ancestry ~7.1%, ~IQR of African ancestry implies SD ~8%, (n=379)
Bryc et al 2010, “Genomewide patterns of population structure and admixture in West Africans and African Americans”: median 18.5%, 25th75th percentiles: 11.627.7%; no mean or SD reported in the paper or supplemental materials I could find, but the median/percentiles suggest an SD of ~4.65% (
qnorm(0.25) * (0.116  0.185)
; n=365)Nassir et al 2012, “Relationship between Adiposity and Admixture in African American and Hispanic American Women”: 22.5%, SD 14.7% (n=11712)
Henry Louis Gates Jr summarizes results available to him in February 2013: 29%, 22%, 19%, 19% (unspecified sample sizes & population SDs)
 According to Ancestry.com, the average African American is 65% subSaharan African, 29% European and 2% Native American.
 According to 23andme.com, the average African American is 75% subSaharan African, 22% European and only 0.6% Native American.
 According to Family Tree DNA.com, the average African American is 72.95% subSaharan African, 22.83% European and 1.7% Native American.
 According to National Geographic’s Genographic Project, the average African American is 80% subSaharan African, 19% European and 1% Native American.
 According to AfricanDNA, in which I am a partner with Family Tree DNA, the average African American is 79% subSaharan African, 19% European and 2% Native American.
Bryc et al 2014, “The Genetic Ancestry of African Americans, Latinos, and European Americans across the United States”: mean 24%, SD not reported, and Bryc declined to provide any information about the study & population SD when contacted but pixelcounting Figure S1/Figure S18 yields the mean of 24% and an SD of ~17% (n=5269)
Baharian et al 2016, “The Great Migration and AfricanAmerican Genomic Diversity”: reports 3 cohorts with European ancestry at:
 SCCS: 14(13.6514.43)% (n=2128)
 HRS: 16.7(16.1617.27)% (n=1501)
 ASW: 21.3(19.5023.20)% (n=97)
but no population SDs, with statisticallysignificant differences likely due to recruiting:
The overall proportion of African ancestry is substantially higher in the SCCS and HRS than in the ASW and the recently published 23andMe cohort [12] (Table 1). The HRS cohort can be thought of as representative of the entire AfricanAmerican population, while the SCCS focuses primarily on individuals attending community health centers in rural, underserved locations in the South. By contrast, the sampling for the ASW and 23andMe did not aim for specific representativeness, and the ascertainment in the 23andMe cohort might have enriched for individuals with elevated European ancestry (see Materials and Methods and discussion in [12]). In the HRS, average African ancestry proportion is 83% in the South and lower in the North (80%, bootstrap p= 6 × 10−6) and West (79%, p= 10−4) (Fig 1). Within the SCCS, African ancestry proportion is highest in Florida (89%) and South Carolina (88%) and lowest in Louisiana (75%) with all three significantly different from the mean (Florida p=0.006, South Carolina p=4 × 10−4, and Louisiana p< 10−5; bootstrap). The elevated African ancestry proportion in Florida and South Carolina is also observed in the HRS and in the 23andMe study [12], but Louisiana is more variable across cohorts (Fig 1E). As expected, European ancestry proportions largely complement those of African ancestry across the US.
Kirkegaard et al 2016, “Biogeographic ancestry, cognitive ability, and socioeconomic outcomes”’s supplementary information: 17%, SD 11% (n=140)
Considering just studies with usable ancestry percentages, population SD, and n, and using inferred SDs from Signorello:
admixture < read.csv(stdin(), header=TRUE, colClasses=c("factor", "numeric", "numeric", "integer"))
Study,Mean,SD,N
"Halder et al 2008",0.143,0.133,136
"Ducci et al 2009",0.07,0.09,864
"Signorello et al 2010",0.071,0.08,379
"Bryc et al 2010",0.185,0.0465,365
"Nassir et al 2012",0.225,0.147,11712
"Bryc et al 2014",0.24,0.17,5269
"Kirkegaard et al 2016",0.17,0.11,140
# what is the standard error/precision of a population SD? http://davidmlane.com/hyperstat/A19196.html
admixture$SD.SE < (0.71*admixture$SD) / sqrt(admixture$N)
summary(admixture)
# Study Mean SD N SD.SE
# Bryc et al 2010 :1 Min. :0.0700000 Min. :0.0465000 Min. : 136.0 Min. :0.0009644066
# Bryc et al 2014 :1 1st Qu.:0.1070000 1st Qu.:0.0850000 1st Qu.: 252.5 1st Qu.:0.0016954481
# Ducci et al 2009 :1 Median :0.1700000 Median :0.1100000 Median : 379.0 Median :0.0021739221
# Halder et al 2008 :1 Mean :0.1577143 Mean :0.1109286 Mean : 2695.0 Mean :0.0034492579
# Kirkegaard et al 2016:1 3rd Qu.:0.2050000 3rd Qu.:0.1400000 3rd Qu.: 3066.5 3rd Qu.:0.0047591374
# Nassir et al 2012 :1 Max. :0.2400000 Max. :0.1700000 Max. :11712.0 Max. :0.0080973057
# Signorello et al 2010:1
library(metafor)
r.mean < rma(yi=Mean, sei=SD/sqrt(N), measure="SMD", ni=N, data=admixture); r.mean
# RandomEffects Model (k = 7; tau^2 estimator: REML)
#
# tau^2 (estimated amount of total heterogeneity): 0.0046 (SE = 0.0027)
# tau (square root of estimated tau^2 value): 0.0680
# I^2 (total heterogeneity / total variability): 99.82%
# H^2 (total variability / sampling variability): 566.51
#
# Test for Heterogeneity:
# Q(df = 6) = 3477.2614, pval < .0001
#
# Model Results:
#
# estimate se zval pval ci.lb ci.ub
# 0.1578 0.0258 6.1187 <.0001 0.1072 0.2083
predict(r.mean)
# pred se ci.lb ci.ub cr.lb cr.ub
# 0.1578 0.0258 0.1072 0.2083 0.0153 0.3003
r.sd < rma(yi=SD, sei=SD.SE, measure="SMD", ni=N, data=admixture); r.sd
# RandomEffects Model (k = 7; tau^2 estimator: REML)
#
# tau^2 (estimated amount of total heterogeneity): 0.0018 (SE = 0.0011)
# tau (square root of estimated tau^2 value): 0.0425
# I^2 (total heterogeneity / total variability): 99.77%
# H^2 (total variability / sampling variability): 440.67
#
# Test for Heterogeneity:
# Q(df = 6) = 3819.2793, pval < .0001
#
# Model Results:
#
# estimate se zval pval ci.lb ci.ub
# 0.1108 0.0162 6.8587 <.0001 0.0792 0.1425
predict(r.sd)
# pred se ci.lb ci.ub cr.lb cr.ub
# 0.1108 0.0162 0.0792 0.1425 0.0216 0.2001
par(mfrow=c(2,1))
forest(r.mean, slab=admixture$Study)
forest(r.sd, slab=admixture$Study)
There is high heterogeneity, as expected, and the metaanalytic summary are consistent with simply taking the mean, so metaanalysis was not really necessary.
The issue of heterogeneity depends on how one wants to interpret these numbers: as the true latent AfricanAmerican population mean/SD of European ancestry, or as a way to estimate the possible spread of sampling? In the former, the heterogeneity is a serious issue because it suggests the estimate may be badly biased or at least is highly imprecise; in the latter, it is both a curse and a benefit, since it implies that it is possible to recruit for genetics studies samples with a wide range of ancestry (thereby greatly increasing statistical power) but also that one might get unlucky & wind up with a very ancestryhomogeneous sample (if the sample turns out to have an SD as high as 20%, excellent; if it’s as low as 7.9%, one is in trouble).
So for power analysis one might check the metaanalytic mean case, as well as the prediction interval (a 95% CI around the SD/mean does not mean that 95% of the true effects, including the inherent heterogeneity, will fall in that interval): SDs of 2%, 11%, and 20%. (For any costbenefit analysis or trying to optimize expenditures, one would want to work with the posterior distributions to average over everything, but for just general informative purposes, those 3 are good parameters to check.)
Power simulation
Code:
adopteeTest < function(n, alpha=0.05, ancestryEffect=0.008, populationAncestryMean=0.1440, populationAncestrySD=0.1008, c=0.1) {
unrelatedSiblingSD < sqrt(15^2 * (1c)) # subtract 10% for same sharedenvironment
siblings < data.frame(
sibling1AncestryPercentage = pmax(0, rnorm(n, mean=populationAncestryMean*100, sd=populationAncestrySD*100)),
sibling1NonancestryIQ = rnorm(n, mean=0, sd=unrelatedSiblingSD),
adopteeAncestryPercentage = pmax(0, rnorm(n, mean=populationAncestryMean*100, sd=populationAncestrySD*100)),
adopteeNonancestryIQ = rnorm(n, mean=0, sd=unrelatedSiblingSD))
siblings$sibling1TotalIQ < with(siblings, sibling1NonancestryIQ + sibling1AncestryPercentage*(ancestryEffect*15))
siblings$adopteeTotalIQ < with(siblings, adopteeNonancestryIQ + adopteeAncestryPercentage*(ancestryEffect*15))
siblings$siblingAncestryDifference < with(siblings, sibling1AncestryPercentage  adopteeAncestryPercentage)
siblings$siblingIQDifference < with(siblings, sibling1TotalIQ  adopteeTotalIQ )
p < summary(lm(siblingIQDifference ~ siblingAncestryDifference, data=siblings))$coefficients[8]
return(p<alpha)
}
power < function(n, sd, iters=10000, n.parallel=8) {
library(parallel)
library(plyr)
mean(unlist(mclapply(1:n.parallel, function(i) {
replicate(iters/n.parallel, adopteeTest(n, populationAncestrySD=sd)) }))) }
ns < seq(100, 10000, by=100)
powerLow < sapply(ns, function(n) { power(n, sd=0.0216)})
powerMean < sapply(ns, function(n) { power(n, sd=0.1108)})
powerHigh < sapply(ns, function(n) { power(n, sd=0.2001)})
library(ggplot2); library(gridExtra)
pl < qplot(ns, powerLow) + coord_cartesian(ylim = c(0,1))
pm < qplot(ns, powerMean) + coord_cartesian(ylim = c(0,1))
ph < qplot(ns, powerHigh) + coord_cartesian(ylim = c(0,1))
grid.arrange(pl, pm, ph, ncol=1)
So for the worstcase SD, sample size is unclear but n>20,000 pairs; mean SD, n=1100 pairs; high SD, n=500 pairs. The latter two are feasible amounts for population registries or adoptionfocused cohort studies. Thus genome adoption studies, combined with the much less powerful but more common withinsibling comparisons, are capable of delivering precise answers to longstanding questions about the origins of group differences with moderate sample sizes.
Operating on an aneurysm
In the excellent neurosurgery memoir Do No Harm: Stories of Life, Death, and Brain Surgery (Henry Marsh 2014), chapter 2 “Aneurysm”, there is a passage on weighing the costs of action and inaction:
“A thirtytwoyearold woman,” he said tersely. “For surgery today. Had some headaches and had a brain scan.” As he talked a brain scan flashed up on the wall.
…“It’s an unruptured aneurysm, seven millimetres in size,” Fiona—the most experienced of the registrars—said. “So there’s a point zero five per cent risk of rupture per year according to the international study published in 1998.” “And if it ruptures?” “Fifteen per cent of people die immediately and another thirty per cent die within the next few weeks, usually from a further bleed and then there’s a compound interest rate of four per cent per year.”
…If we did nothing the patient might eventually suffer a haemorrhage which would probably cause a catastrophic stroke or kill her. But then she might die years away from something else without the aneurysm ever having burst. She was perfectly well at the moment, the headaches for which she had had the scan were irrelevant and had got better. The aneurysm had been discovered by chance. If I operated I could cause a stroke and wreck her—the risk of that would probably be about four or five per cent. So the acute risk of operating was roughly similar to the lifetime risk of doing nothing. Yet if we did nothing she would have to live with the knowledge that the aneurysm was sitting there in her brain and might kill her any moment.
Reading this, I was a little surprised by Marsh’s evaluation given those specific numbers. Intuitively, it did not seem to me that a single risk of 5% was anywhere near as bad as a lifelong risk of 0.5%, for a 32 year old woman who would probably live another 50 years—the one number is 10x bigger than the other, but the other number is 50x bigger, and a quick heuristic for the total probability of many independent small probabilities is to just sum them up, suggesting that the risk of the untreated aneurysm was much worse (50*0.005=0.25
, and 0.25>0.05). So I thought after I finished reading the book, I would work it out a little more accurately.
Risk
Specifically, this is a 32yo woman and the UK female life expectancy is ~80yo in 2015, so she had ~48 years left. The consequences of the aneurysm bursting is a large chance of instant death or else severe disability with death to soon follow; the consequence of surgery going wrong is also instant death or severe disability, presumably with a high chance of death soon following, so it looks like we can assume that the bad outcome in either case is the same. what is the probability of the aneurysm never bursting in all 48 years? (10.005)^48
= 0.786, or a probability of bursting of 21%. 21% is 4x larger than 5%. Since 21% is 4x larger and the consequences are similar, this would suggest that the risks are not “roughly similar” and it looks much worse to not operate.
Expected loss
But that’s just the risk of an event, not the expected loss:
In the case of doing surgery immediately, the expected loss, with years treated equally and a 5% instant risk from operation, is simply
48 * 0.005 = 0.24
years of life; all 48 years are risked on a single throw of the surgical dice, but after that she is safe.In the case of doing nothing and letting the aneurysm stay with a 0.5% annual risk from nonoperation, it’s not as simple as
48 * 0.21 = 10.1
years, because you cannot die of an aneurysm if you died in a previous year. The risk will instead follow a negative binomial distribution (number of years until 1 failure), and then the loss is the 48 years minus however many she actually got. That’s not the same as the expectation of the negative binomial, which in this case is 200 years (the expectation of a negative binomial with 1 failure and a success rate of 10.005 is1/(1(10.005))=200
) and she will die of other causes before then, in which case the aneurysm turned out to be harmless.We can simulate many draws from the negative binomial, ignore as 0 any time where the aneurysm struck after her life expectancy of 48 more years is past, hold onto the losses, and calculate the mean loss:
mean(sapply(rnbinom(10e4, 1, 0.005), function(years) { if(years>48) { 0; } else { 48years; }}))
~> 5.43.
So the expected loss from surgery looks even better than the risk did, as it is 22.6x smaller.
QALY/DALY adjustment
What about adjusting for older years being less valuable? We might say that the surgery look unfairly good because we are ignoring how its losses are frontloaded in the 30s, some of the best years of one’s life, and treating a loss of her 33rd year as being as bad as a loss of her 48th year. In terms of age weighting, DALYs usually use a 3% annual discounting; DALYs and QALYs differ in some ways but for this analysis I think we can treat them as equivalent and use the DALY agediscounting to calculate our QALYs. So we can redo the two expected losses including the discounting to get:
 Surgery:
0.05 * sum((10.03)^(0:48))
~> 1.291  No surgery:
mean(unlist(sapply(sapply(rnbinom(10e4, 1, 0.005), function(years) { if(years>48) { 0; } else { 48years; }}), function(yr) { sum((10.03)^(0:yr)); })))
~> 4.415
By appropriately penalizing the surgery’s loss of highquality early years as compared to the aneurysm’s loss of just some elderly years, the surgery’s superiority falls to 3.4x, and the gain is 3.124. (And if we include the mental wellbeing of the woman as a final touch, the surgery looks even better.)
How sensitive is the surgical superiority to the parameters?
 Surgical risk: a 4x increase in risk to 20% would create parity
 Aneurysm risk: if the annual risk of aneurysm were as low as 0.04% rather than 0.5%, then there would be parity
 Life expectancy & discount rate: no change will reverse the ordering
It seems extremely unlikely that Marsh could be as wrong about the surgical risk as to mistake 5% for 20%, especially for an operation he says he used to do routinely, and it also seems unlikely that the study on the annual risk of an aneurysm bursting could be as far off as 10x, so the difference is solid.
Costbenefit
Finally, having a surgery is much more expensive than not having it. Surgery is always expensive, and neurosurgery undoubtedly so—elsewhere in the book, Marsh quotes an American neurosurgeon’s estimate of $100,000 for a particularly complex case. Clipping an aneurysm surely cannot cost that much (being both much simpler and also being done in a more efficient healthcare system), but it’s still not going to be trivial. Does the cost of aneurysm surgery outweigh the benefit?
To convert the DALY loss to a dollar loss, we could note that UK PPP per capita is ~$38,160 (2013) so the gain from surgery would be (4.415  1.291) * 38169
=$119k, well above the $100k worstcase. Or more directly, the UK NHS prefers to pay <£20,000 per QALY and will generally reject treatments which cost >£30,000 per QALY as of 2007^{3} (implying QALYs are worth somewhat less than £30,000); the median US 2008 hospital cost for clipping an aneurysm is $36,188 or ~£23,500; and the gain is 3.124 QALYs for ~£7500/QALY—so clipping the aneurysm in this case definitely clears the costbenefit threshold (as we could have guessed from the fact that in the anecdote, the NHS allows her to have the surgery).
After calculating the loss of years, differing values of years, and cost of surgery, the surgery still comes out as substantially better than not operating.
The Power of Twins: Revisiting Student’s Scottish Milk Experiment Example
Randomized experiments require more subjects the more variable each datapoint is to overcome the noise which obscures any effects of the intervention. Reducing noise enables better inferences with the same data, or less data to be collected, which can be done by balancing observed characteristics between control and experimental datapoints. A particularly dramatic example of this approach is running experiments on identical twins rather than regular people, because twins vary far less from each other than random people do. In 1931, the great statistician Student noted problems with an extremely large (n=20,000) Scottish experiment in feeding children milk (to see if they grew more in height or weight), and claimed that the experiment could have been done far more costeffectively with an extraordinary reduction of >95% fewer children if it had been conducted using twins. He, however, did not provide any calculations or data demonstrating this. I revisit the issue and run a power calculation on height indicating that Student’s claims were correct and that the experiment would have required ~97% fewer children if run with twins. This reduction is not unique to the Scottish experiment and in general, one can expect a reduction of 89% using twins rather than regular people.
Due to length, this has been split out to a separate page.
MCTS
An implementation in R of a simple Monte Carlo tree search algorithm (using Thompson sampling rather than a UCT) implemented with data.tree
. This MCTS assumes binary win/loss (1/0) terminal rewards with no intermediate rewards/costs so it cannot be used to solve general MDPs, and does not expand leaf nodes in the move tree passed to it. (I also suspect parts of it are implemented wrong though it reaches the right answer in a simple Blockworld problem and seems OK in a TicTacToe problem. I have since understood how memoization works with backwards induction and would probably probably drop the painful explicit tree manipulation in favor of the indirect recursive aproach.)
library(data.tree)
## MCTS helper functions:
playOutMoves < function(move, state, actions) {
for (i in 1:length(actions)) {
state < move(state, actions[i])$State
}
return(state)
}
playOutRandom < function(move, state, actions, timeout=1000, verbose=FALSE) {
action < sample(actions, 1)
turn < move(state, action)
if(verbose) { print(turn); };
if (turn$End  timeout==0) { return(turn$Reward) } else {
playOutRandom(move, turn$State, actions, timeout=timeout1, verbose) }
}
createTree < function(plys, move, moves, initialState, tree=NULL) {
if (is.null(tree)) { tree < Node$new("MCTS", win=0, loss=0) }
if (plys != 0) {
for(i in 1:length(moves)) {
x < tree$AddChild(moves[i], win=0, loss=0)
createTree(plys1, move, moves, initialState, tree=x)
}
}
# cache the state at each leaf node so we don't have to recompute each move as we later walk the tree to do a rollout
tree$Do(function(node) { p < node$path; node$state < playOutMoves(move, initialState, p[2:length(p)]); }, filterFun = isLeaf)
return(tree)
}
mcts < function (tree, randomSimulation, rollouts=1000) {
replicate(rollouts, {
# Update posterior sample for each node based on current statistics and use Thompson sampling.
# With a beta uniform prior (Beta(1,1)), update on binomial (win/loss) is conjugate with simple closed form posterior: Beta(1+win, 1+nwin).
# So we sample directly from that posterior distribution for Thompson sampling
tree$Do(function(node) { node$Thompson < rbeta(1, 1+node$win, 1+(node$win+node$loss)node$win) })
# find & run 1 sample:
node < treeWalk(tree)
rollout < randomSimulation(node$state)
if(rollout==1) { node$win < node$win+1; } else { node$loss < node$loss+1; }
# propagate the new leaf results back up tree towards root:
tree$Do(function(x) { x$win < Aggregate(x, "win", sum); x$loss < Aggregate(x, "loss", sum) }, traversal = "postorder")
})
}
## walk the game tree by picking the branch with highest Thompson sample down to the leaves
## and return the leaf for a rollout
treeWalk < function(node) {
if(length(node$children)==0) { return(node); } else {
children < node$children
best < which.max(sapply(children, function(n) { n$Thompson; } ))
treeWalk(children[[best]]) } }
mctsDisplayTree < function(tree) {
tree$Do(function(node) { node$P < node$win / (node$win + node$loss) } )
tree$Sort("P", decreasing=TRUE)
print(tree, "win", "loss", "P", "Thompson")
}
## Blockworld simulation
## 0=empty space, 1=agent, 2=block, 3=goal point
blockActions < c("up", "down", "left", "right")
blockInitialState < matrix(ncol=5, nrow=5, byrow=TRUE,
data=c(0,0,0,0,1,
0,2,0,0,2,
0,0,0,2,0,
0,2,0,0,0,
0,0,0,0,3))
blockMove < function(state, direction) {
if(state[5,5] == 2) { return(list(State=state, Reward=1, End=TRUE)) }
position < which(state == 1, arr.ind=TRUE)
row < position[1]; col < position[2]
rowNew < 0; colNew < 0
switch(direction,
# if we are at an edge, no change
up = if(row == 1) { rowNew<row; colNew<col; } else { rowNew < row1; colNew < col; },
down = if(row == 5) { rowNew<row; colNew<col; } else { rowNew < row+1; colNew < col; },
left = if(col == 1) { rowNew<row; colNew<col; } else { rowNew < row; colNew < col1; },
right = if(col == 5) { rowNew<row; colNew<col; } else { rowNew < row; colNew < col+1; }
)
# if there is not a block at the new position, make the move
if (state[rowNew,colNew] != 2) {
state[row,col] < 0
state[rowNew,colNew] < 1
return(list(State=state, Reward=0, End=FALSE))
} else {
state[rowNew,colNew] < 1
state[row,col] < 0
switch(direction,
# if the block is at the edge it can't move
up = if(rowNew == 1) { } else { state[rowNew1,colNew] < 2 },
down = if(rowNew == 5) { } else { state[rowNew+1,colNew] < 2 },
left = if(colNew == 1) { } else { state[rowNew,colNew1] < 2 },
right = if(colNew == 5) { } else { state[rowNew,colNew+1] < 2 } )
# a block on the magic 5,5 point means a reward and reset of the playing field
if(state[5,5] == 2) { return(list(State=state, Reward=1, End=TRUE)) } else { return(list(State=state, Reward=0, End=FALSE)) }
}
}
## Blockworld examples:
# blockMove(blockInitialState, "left")
# blockMove(blockInitialState, "down")
# blockMove(blockInitialState, "right")$State
# blockMove(blockMove(blockInitialState, "right")$State, "down")
# blockMove(blockMove(blockMove(blockInitialState, "down")$State, "down")$State, "down")
# playOutMoves(blockMove, blockInitialState, c("down", "down", "down"))
# playOutRandom(blockMove, blockInitialState, blockActions)
tree < createTree(2, blockMove, blockActions, blockInitialState)
mcts(tree, function(state) { playOutRandom(blockMove, state, blockActions) })
mctsDisplayTree(tree)
tree2 < createTree(3, blockMove, blockActions, blockInitialState)
mcts(tree2, function(state) { playOutRandom(blockMove, state, blockActions) })
mctsDisplayTree(tree2)
## TicTacToe
tttActions < 1:9
tttInitialState < matrix(ncol=3, nrow=3, byrow=TRUE, data=0)
tttMove < function(state, move) {
move < as.integer(move)
# whose move is this? Player 1 moves first, so if the number of pieces are equal, it must be 1's turn:
player < 0; if(sum(state == 1) == sum(state == 2)) { player < 1 } else { player < 2}
# check move is valid:
if(state[move] == 0) { state[move] < player }
## enumerate all possible endstates (rows, columns, diagonals): victory, or the board is full and it's a tie
victory < any(c(
all(state[,1] == player),
all(state[1,] == player),
all(state[,2] == player),
all(state[2,] == player),
all(state[,3] == player),
all(state[3,] == player),
all(as.logical(c(state[1,1], state[2,2], state[3,3]) == player)),
all(as.logical(c(state[1,3], state[2,3], state[3,1]) == player))
))
tie < all(state != 0)
# if someone has won and the winner is player 1, then a reward of 1
if(victory) { return(list(State=state, Reward=as.integer(player==1), End=TRUE)) } else {
if(tie) { return(list(State=state, Reward=0, End=TRUE)) } else {
return(list(State=state, Reward=0, End=FALSE)) }
}
}
## TicTacToe examples:
# tttMove(tttMove(tttMove(tttInitialState, 5)$State, 9)$State, 2)
# playOutMoves(tttMove, tttInitialState, c(5, 9, 2))
# playOutRandom(tttMove, tttInitialState, tttActions, verbose=TRUE)
treeTTT < createTree(2, tttMove, tttActions, tttInitialState)
mcts(treeTTT, function(state) { playOutRandom(tttMove, state, tttActions) })
mctsDisplayTree(treeTTT)
## hypothetical: if opponent plays center (5), what should be the reply?
treeTTT2 < createTree(2, tttMove, tttActions, tttMove(tttInitialState, 5)$State)
mcts(treeTTT2, function(state) { playOutRandom(tttMove, state, tttActions) })
mctsDisplayTree(treeTTT2)
Candy Japan A/B test
Due to length, has been split out to Candy Japan’s new box A/B test.
DeFriesFulker power analysis
DeFriesFulker (DF) extremes analysis
 DeFries & Fulker 1985
 DeFries et al 1987, “Evidence for a genetic aetiology in reading disability of twins” /docs/genetics/heritable/1987defries.pdf
 DeFries & Fulker 1988, “Multiple regression analysis of twin data: Etiology of deviant scores versus individual differences”
 “A ModelFitting Implementation of the DeFriesFulker Model for Selected Twin Data” Purcell & Sham 2003 /docs/genetics/heritable/2003purcell.pdf
 LaBuda et al 1986
 DeFries et al 1991, “Colorado Reading Project: An update”
 Gillespie & Neale 2006 “A Finite Mixture Model for Genotype and Environment Interactions: Detecting Latent Population Heterogeneity” http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.534.6298&rep=rep1&type=pdf
 Purcell et al 2001, “Comorbidity between verbal and nonverbal cognitive delays in 2yearolds: A bivariate twin analysis”
 http://www.sciencedirect.com/science/article/pii/S0160289614001676 “Thinking positively: The genetics of high intelligence”, Shakeshaft et al 2015
 Schwartz & Connolly 2017, “An Investigation of Genetic and Environmental Influences Across The Distribution of SelfControl”
generateSiblingPair < function(ID=TRUE) {
## Population mean 100, SD 15; let's make family means distributed normally too;
## heritability 0.8, shared environment 0.1, siblings share half of genes on average + shared environment
## so a pair of siblings has 1  (0.8*0.5+0.1) = 0.5 of the variance of the general population.
parental < mean(rnorm(1,mean=100,sd=15*0.8), rnorm(1,mean=100,sd=15*0.8))
siblings < rnorm(2, mean=parental, sd=15*(1  (0.8*0.5+0.1)))
## Siblings will tend to vary this much, unless they are, lamentably, one of the, say,
## 5% struck by mutational lightning and reduced to an IQ of, let's say, 80
if(ID) { siblings < ifelse(rbinom(2,1,prob=0.05), siblings,rnorm(2, mean=80, sd=15)) }
return(c(max(siblings), min(siblings)))
}
generateSiblingPairs < function(n,ID=TRUE) { as.data.frame(t(replicate(n, generateSiblingPair(ID=ID)))) }
## dataset with lightning:
df < round(rescale(generateSiblingPairs(1000000, ID=TRUE), mean=5, sd=2))
## floor/ceiling at 0/9 for everyone:
df[df$V1>9,]$V1 < 9
df[df$V1<1,]$V1 < 1
df[df$V2>9,]$V2 < 9
df[df$V2<1,]$V2 < 1
## dataset without:
df2 < round(rescale(generateSiblingPairs(1000000, ID=FALSE), mean=5, sd=2))
df2[df2$V1>9,]$V1 < 9
df2[df2$V1<1,]$V1 < 1
df2[df2$V2>9,]$V2 < 9
df2[df2$V2<1,]$V2 < 1
par(mfrow=c(2,1))
hist(df$V1  df$V2)
hist(df2$V1  df2$V2)
## mixture modeling:
library(flexmix)
## check k=1 vs k=2 on df1, where k=2 is ground truth:
g1.1 < flexmix(I(V1V2) ~ 1, k=1, data=df)
g1.2 < flexmix(I(V1V2) ~ 1, k=2, data=df)
summary(g1.1); summary(g1.2)
## check k=1 vs k=2 on df2, where k=1 is ground truth:
g2.1 < flexmix(I(V1V2) ~ 1, k=1, data=df2)
g2.2 < flexmix(I(V1V2) ~ 1, k=2, data=df2)
summary(g2.1); summary(g2.2)
Inferring mean IQs from SMPY/TIP elite samples
Samples taken from the extremes of mixtures of distributions can have very different properties than random samples, such as wildly disproportionate representation of one distribution. This can be used to infer differing means. I demonstrate working backwards from the racial composition of TIP/SMPY samples of extremely (1in10,000) gifted youth to estimate the overall racial means, which is consistent with the known racial means and hence an unbiased selection process, using ABC to infer Bayesian credible intervals on the estimated means.
The general properties of statistical distributions can be very different from the properties of specific subsets in counterintuitive ways. One common example is that a small difference in means for many distributions can lead to large differences in extreme subsets.
For example, male and female average heights differ by a relatively small amount, inches at most. So in a random sample, plenty of women will be taller than men, and vice versa. However, if instead ask the sex of the tallest person in the sample, it will often be male, and the larger the sample, the more certain we can be that it will be male, and that the top X% by height will be male. Likewise, if we wanted to start a basketball league and recruited the tallest 100 people in the country, this small mean difference will show up as our entire basketball league turning out to be male. (And since height is highly heritable, we may find out that many of them are related!) What seemed like a small difference become a large one; we could have worked it out in advance if we had thought about it.
Reasoning from the general to the particular turned out to be tricky in this case because we were dealing with extreme values rather than random samples—1 basketball player chosen by height from thousands of people. Many things of great interest turn out to be like that: we are interested in the extremes much more than the expectation. Running a 2hour marathon is an extreme on athleticism; winning the Nobel is an extreme on scientific accomplishment; being enlisted in the NBA is an extreme on height; being admitted to MIT/Stanford/Harvard is an extreme on intelligence; murdering someone is an extreme on violence; winning an Academy Award is an extreme on acting success. When we ask questions like, “why does the world record in this sport keep being shattered” or “why are so many NBA players related” or “how good can we expect the best chess player to be in 10 years” or “does this racial composition prove bias” or “how much more important are the best authors in literature than obscurer figures” or “why do so few women win the Field Medal”, we’re asking extreme value questions whose answers may be counterintuitive—and the answer may be as simple as the shape of distributions, and a slightly lower mean here or a slightly higher standard deviation there. (Working backwards from a sample selected for passing a threshold to a mean can be called “the method of limits” or “the method of thresholds”.)
The study “When Lightning Strikes Twice: Profoundly Gifted, Profoundly Accomplished”, Makel et al 2016 describes the accomplishments of the Duke TIP sample, 259 children selected for their intelligence by taking the highestscorers out of 425,000 adolescents taking the SAT (usually <13yo) starting in 1981, representing the top 0.01% of the testtakers. The TIP sample parallels the betterknown SMPY sample, which also selected extremely intelligent adolescents, who were included in a longitudinal sample. It’s frequently suggested, based on anecdotal evidence or some biased convenience samples, that more intelligence may not be better; extremely intelligent people may be unhealthy, neurotic, insane, isolated, lonely, discriminated against by society and their peers, and doomed to failure; or if things are not quite that dire, as all studies show things improving up to 130, then at around that point greater intelligence may stop making any difference, and there be little difference between someone with an IQ of 130 and 160. This is difficult to study crosssectionally, because once you start talking about as extreme as 0.01%, it is difficult to recruit any subjects at all, and your sample will be biased in unknown ways; if you only look at successful people, you are missing the hypothetical homeless bum living out of a trash can who is a troubled and misunderstood genius. To solve these problems, you want to filter through hundreds of thousands of people so you can select the very brightest possible, and you want to find them as early as possible in life, before they have had any chance to fail or succeed, and track them longitudinally as they grow up. This is what the SMPY & TIP studies do, and the results are that the subjects are spectacularly successful in life; great intelligence is not harmful and the returns to greater intelligence are not zero even as high as 1 in 10,000.
Makel et al 2016 also reports the ethnic breakdown of the TIP and SMPY samples: 72% white, 22% Asian, 6% not reported or other. This distribution might seem remarkable given that subjects taking the SAT in 1981 were born ~1970, when the USA was ~77% white, ~11% black, and ~0.7% Asian, so white are slightly underrepresented, blacks are very underrepresented (even if we assume all 6% are black, then that’s still half), and Asians are 31x (!) overrepresented.
## TIP/SMPY sample size & ethnic percentages: https://pbs.twimg.com/media/Cj9DXwxWEAEaQYk.jpg
tip < 259; smpy < 320 ## total: 579
white < ((0.65*tip) + (0.78*smpy)) / (tip+smpy)
asian < ((0.24*tip) + (0.20*smpy)) / (tip+smpy)
white; asian
# [1] 0.7218480138
# [1] 0.2178929188
# http://drjamesthompson.blogspot.com/2016/06/somecharacteristicsofeminentpersons.html
# > The data on ethnicity are rather sparse, but we can do a little bit of work on them by looking at US Census
# > figures for the 1970s when most of these children were born: White 178,119,221...Asia 1,526,401...So, in the
# > absence of more detailed particulars about the Other category, Asians win the race by a country mile. If we
# > simplify things by considering only Whites, Blacks and Asians the US in 1970 then the country at that time was
# > 88% White, 11% Black, and less than 1% Asian. The actual results of eminent students are 77% White, 0% Black,
# > 22% Asian. No need for a Chi square.
#
# Asian is 0.7%: R> 1526401 / (178119221 / 0.80)
whiteRR < white / 0.77; asianRR < asian / 0.007
whiteRR; asianRR
# [1] 0.937464953
# [1] 31.12755983
Of course, races in the USA have long differed by mean intelligence, with the rule of thumb being Asians ~105 IQ, whites ~100, and blacks ~90. So the order is expected—but still, 31x! Are the results being driven by some sort of proAsian bias or otherwise bizarre?
But this is an extreme sample. 1in10,000 is far out on the tails: 3.71SDs.
Maybe this is normal. Can we work backwards from the overrepresentations to what differences would have generated them?
Yes, we can, even with this small sample which is so extreme and unrepresentative of the general population. This is because it is an order statistics problem: we know the order represented by the sample and so can work back to parameters of the distribution the order statistics are being generated by. Since IQ is a normal distribution, we know the overrepresentation RR, and the exact cutoff/limit used in the sample, we can convert the limit to a standard deviations, and then find the normal distribution which is RR (31) times the normal distribution at a standard deviations.
We can compare using two pnorm
s and shifting the second by a SDs. So for example, shifting by 15 IQ points or 1 SD would lead to 84x overrepresentation
We would like to solve for the shift which leads to an exact overrepresentation like 31.127; an optimization routine like R’s optim
function can do that, but it requires an error to minimize, so minimizing pnorm()/pnorm(x)
doesn’t work since it just leads to negative infinity, nor will RR == pnorm()/pnorm(x)
work, because it evaluates to 0 for all values of x except the exact right one . Instead, we minimize the squared error between the ratio predicted by a particular x and our observed RR. This works:
## An optimization routine which automatically finds for us the IQ increase which most closely matches the RR:
solver < function(RR, cutoff=10000) {
optim(1,
function(IQ_gain) { (RR  (pnorm(qnorm(1/cutoff)) / pnorm(qnorm(1/cutoff)(IQ_gain/15))))^2 },
)$par }
100 + solver(whiteRR)
# [1] 99.75488281
100 + solver(asianRR)
# [1] 111.8929688
So our inferred white & Asian populations means are: 99.8 and 111.9. These are relatively close to the expected values.
This approach can be used to infer other things as well. For example, the TIP/SMPY papers have not, as far as I’ve seen, mentioned what fraction of the white subjects were ethnic Jewish; since they are so overrepresented in areas like Nobel prizes, we would expect many of the TIP/SMPY white students to have been Jewish. Using an estimate of the Jewish population in 1970 and estimates of their mean IQ, we can work forward to what fraction of TIP/SMPY subjects might be Jewish. The 19701971 National Jewish Population Study estimated “5,800,000 persons (of whom 5,370,000 were Jews) living in Jewish households” out of a total US population of 205 million, or 2.8% of the total population or ~3.6% of the white population. So of the ~418 white subjects, ~15 would be expected to be Jewish under the null hypothesis of no difference. The majority of American Jews are of Ashkenazi descent^{4}, for whom intelligence estimates are debated but tend to range 105115 (with occasional samples suggesting even higher values, like Levinson 1957). In the Barbe 1964 Ohio sample (IQ ~143), 8% were Jewish^{5}; in Terman’s (ratio IQ >140) 1920s sample in SF/LA, 10% were Jewish; Hollingworth’s 1930s sample (>180) turned up 51/55 or 90% Jewish^{6}; Byrns 1936’s 1931 Wisconsin state sample found 18% of the Jewish sample to be in the top decile vs 10% American; in the Hunter College Elementary School sample 19481960 (>140, mean 157) in New York City, 62% were Jewish (Subotnik et al 1989, Subotnik et al 1993^{7}). Given estimates of the Jewish population of children in those specific times and places, one could work backwards to estimate a Jewish mean.
We can calculate the fraction of the white sample being Jewish for each possible mean IQ:
proportion < function (gain, cutoff=10000) {
(pnorm(qnorm(1/cutoff)) / pnorm(qnorm(1/cutoff)(gain/15))) }
possibleIQs < seq(5, 15, by=0.5)
data.frame(Advantage=possibleIQs, Fraction.of.white=(sapply(possibleIQs, proportion) * 15) / 418)
Advantage Fraction.of.white
1 5.0 0.1415427303
2 5.5 0.1633099334
3 6.0 0.1886246225
4 6.5 0.2180947374
5 7.0 0.2524371552
6 7.5 0.2924980125
7 8.0 0.3392769622
8 8.5 0.3939561508
9 9.0 0.4579348680
10 9.5 0.5328710150
11 10.0 0.6207307813
12 10.5 0.7238482059
13 11.0 0.8449966589
14 11.5 0.9874747049
15 12.0 1.1552093388
16 12.5 1.3528802227
17 13.0 1.5860693342
18 13.5 1.8614413902
19 14.0 2.1869615788
20 14.5 2.5721585555
21 15.0 3.0284424112
Judging from earlier samples with very high cutoffs, I’d guess TIP/SMPY has at least a majority Jewish, giving a mean IQ of ~110; this is pleasantly similar to estimates based on regular samples & estimation. This result is also similar to La Griffe du Lion’s 2003 threshold analysis estimating a mean IQ of 112 based on Ashkenazi overrepresentation among USSR championship chess players, 111 based on Western Fields Medal awards, and 110 based on the USA/Canada Putnam competition. But if the mean IQ was as high as 112, then almost every single white subject would be Jewish in every sampling, which seems implausible and like something so striking that anyone writing or involved with TIP/SMPY would have to have mentioned at some point—right?
For the same reason, the original estimate of 112 for the Asians strikes me as on the high side. This could be due to problems in the data like underestimating the Asian population at the time—perhaps the Southeast/Midwest states that TIP samples from were more than 0.7% Asian—or it could be due to sampling error (only n=579, after all).
Working backwards doesn’t immediately provide any measurement of precision or confidence intervals. Presumably someone has worked out analytic formulas which come with standard errors and confidence intervals, but I don’t know it. Instead, since the selection process which generated our data is straightforward (population mean > millions of samples > take top 1in10000s > calculate overrepresentation), I can again use Approximate Bayesian computation (ABC) to turn a simulation of the data generating process into a method of Bayesian inference on the unknown parameters (population means) and get credible intervals.
What sort of confidence do we have in these estimates given that these RRs are based only on? We can simulate TIP/SMPYlike selection by taking the hypothetical means of the two groups, generating ~3 million simulates (579 * 10000
) each, selecting the top 1/10000th, taking the RRs and then solving for the mean IQ. If we provide a prior on the means and we hold onto only the means which successfully generate TIP/SMPYlike fractions of 72% & 21%, this becomes ABC with the saved means forming the posterior distribution of means. (It would likely be faster to use MCMC like JAGS, but while JAGS provides truncated normal distributions which one could sample from quickly, and the necessary pnorm
/qnorm
functions, but it’s not clear to me how one could go about estimating the overperformance ratio and the binomial.^{8} There’s likely some way to use order statistics more directly than simulating cutoffs, in which case there is a transformation to a beta distribution over 01, which is a wellsupported distribution by MCM software and might allow exact solution as well.) For my priors, I believe that the rule of thumbs of 100/105 are accurate and highly unlikely to be more than a few points off, so I use a very weak prior of populations means being .
In exact ABC, we would keep only data which exactly matched 72%/22%, but that would require rejecting an extremely large number of samples. Here we’ll loosen it to ±2% tolerance:
simulateTIPSMPY < function() {
## informative priors: IQs are somewhere close to where we would estimate based on other datasets
whiteMean < round(rnorm(1, mean=100, sd=5), digits=2)
asianMean < round(rnorm(1, mean=105, sd=5), digits=2)
iqCutoff < 100 + qnorm(1/10000) * 15
whites < rnorm(0.770 * 579 * 10000, mean=whiteMean, sd=15)
whiteSample < max(1, sum(ifelse(whites>iqCutoff, 1, 0)))
asians < rnorm(0.007 * 579 * 10000, mean=asianMean, sd=15)
asianSample < max(1, sum(ifelse(asians>iqCutoff, 1, 0)))
## white+Asian = 92% of original total sample, so inflate by that much to preserve proportions: 1.08
totalSample < (whiteSample+asianSample) * (1 + (1(white+asian)))
whiteFraction < round(whiteSample / totalSample, digits=2)
asianFraction < round(asianSample / totalSample, digits=2)
# print(paste("samples: ", c(whiteSample, asianSample), "fractions: ", c(whiteFraction, asianFraction)))
tolerance < 0.02
if ((abs(whiteFraction  0.7218480138) < tolerance) && (abs(asianFraction  0.2178929188) < tolerance)) {
return(data.frame(White=whiteMean, Asian=asianMean))
}
}
library(parallel); library(plyr)
simulateSamples < function(n.sample=10000, iters=getOption("mc.cores")) {
## because of rejection sampling, no run is guaranteed to produce a sample so we loop:
results < data.frame()
while (nrow(results) < n.sample) {
simResults < ldply(mclapply(1:iters, function(i) { simulateTIPSMPY() } ))
results < rbind(results, simResults)
# print(paste("Samples so far: ", nrow(results)))
}
return(results) }
posteriorSamples < simulateSamples()
mean(posteriorSamples$White < posteriorSamples$Asian)
# [1] 1
## we have relatively few samples, so get a better posterior estimate by shuffling the posterior samples & comparing many times:
mean(replicate(1000, mean(c(sample(posteriorSamples$White) < sample(posteriorSamples$Asian)))))
# [1] 0.9968822
quantile(probs=c(0.025, 0.975), posteriorSamples$White, na.rm=TRUE)
# 2.5% 97.5%
# 89.49975 101.38050
quantile(probs=c(0.025, 0.975), posteriorSamples$Asian, na.rm=TRUE)
# 2.5% 97.5%
# 101.37000 116.74075
par(mfrow=c(2,1))
hist(posteriorSamples$White, main="Posterior white mean IQ estimated from TIP/SMPY cutoff & ratio", xlab="IQ")
hist(posteriorSamples$Asian, main="Posterior Asian mean", xlab="IQ")
So sampling error does turn out to be substantial: our 95% credible intervals are white 90101, Asian 101116. Still, the overlap is minimal, with P=99.7% that the Asian mean is higher than the white.
We are able to conclude that the rank ordering is highly likely to be correct, and the results are consistent with the conventional wisdom, so there is no prima facie case for bias in the results: the ethnic composition is in line with what one would calculate from the design of TIP/SMPY and population means.
Genius Revisited: On the Value of High IQ Elementary Schools
Genius Revisited documents the longitudinal results of a highIQ/giftedandtalented elementary school, Hunter College Elementary School (HCES); one of the most striking results is the general high education & income levels, but absence of great accomplishment on a national or global scale (eg a Nobel prize). The authors suggest that this may reflect harmful educational practices at their elementary school or the low predictive value of IQ.
I suggest that there is no puzzle to this absence nor anything for HCES to be blamed for, as the absence is fully explainable by their making two statistical errors: baserate neglect, and regression to the mean.
First, their standards fall prey to a baserate fallacy and even extreme predictive value of IQ would not predict 1 or more Nobel prizes because Nobel prize odds are measured at 1 in millions, and with a small total sample size of a few hundred, it is highly likely that there would simply be no Nobels.
Secondly, and more seriously, the lack of accomplishment is inherent and unavoidable as it is driven by the regression to the mean caused by the relatively low correlation of early childhood with adult IQs—which means their sample is far less elite as adults than they believe. Using earlychildhood/adult IQ correlations, regression to the mean implies that HCES students will fall from a mean of 157 IQ in kindergarten (when selected) to somewhere around 133 as adults (and possibly lower). Further demonstrating the role of regression to the mean, in contrast, HCES’s associated highIQ/giftedandtalented high school, Hunter High, which has access to the adolescents’ more predictive IQ scores, has much higher achievement in proportion to its lesser regression to the mean (despite dilution by Hunter elementary students being grandfathered in).
This unavoidable statistical fact undermines the main rationale of HCES: extremely highIQ adults cannot be very accurately selected as kindergarteners on the basis of a simple test. This greaterregression problem can be lessened by the use of additional variables in admissions, such as parental IQs or highquality genetic polygenic scores; unfortunately, these are either politically unacceptable or dependent on future scientific advances. This suggests that such elementary schools may not be a good use of resources and HCES students should not be assigned scarce magnet high school slots.
Split out to separate article.
Great Scott! Personal Name Collisions and the Birthday Paradox
“How large does can a social circle be before first names no longer suffice for identification? Scott, I’m looking at you.”
MakerOfDecisions, 29 July 2016
Scott here refers to any of Scott Alexander, Scott Adams, Scott Aaronson, Scott Sumner (and to a much lesser extent, Scott Garrabrant, Orson Scott Card, and Scott H. Young); a reference to a ‘Scott’ on a site like Less Wrong is increasingly ambiguous.
When a large number of samples draw from a common pool of identifiers, collisions are common, leading to the birthday paradox: despite there being 365.25 days in the year, a classroom of just 23 people (who can cover at most 6% of the days in a year) is ~50% likely to have at least two people who share the same birthday and so birthdays cease being unique unambiguous identifiers. (Intuitively, you might expect the number to be much larger and closer to 180 than 23.)
We can verify this by simulation:
dupes < function(a) { length(a) != length(unique(a)) }
identifiers < function(n, ids, probabilities) { sample(1:ids, n, prob=probabilities, replace=TRUE) }
simulate < function(n, ids, probabilities=rep(1/ids, ids), iters=10000) {
sims < replicate(iters, { id < identifiers(n, ids, probabilities)
return(dupes(id)) })
return(mean(sims)) }
simulate(23, 365)
# [1] 0.488
sapply(1:50, function(n) { simulate(n, 365) } )
# [1] 0.0000 0.0029 0.0059 0.0148 0.0253 0.0400 0.0585 0.0753 0.0909 0.1196 0.1431 0.1689 0.1891
# 0.2310 0.2560 0.2779 0.3142 0.3500 0.3787 0.4206 0.4383 0.4681 0.5165 0.5455 0.5722 0.5935
# [27] 0.6227 0.6491 0.6766 0.7107 0.7305 0.7536 0.7818 0.7934 0.8206 0.8302 0.8465 0.8603 0.8746
# 0.8919 0.9040 0.9134 0.9248 0.9356 0.9408 0.9490 0.9535 0.9595 0.9623 0.9732
Similarly, in a group of people, it will be common for first names to overlap. (Overlaps of both first names & surnames are much more unlikely: Charpentier & Coulmont 2017 estimate from French & Ohioan data that while almost everyone has a nonunique full name, even groups of thousands of people will have only a few duplicates.) How common? There are far more than 365.25 first names, especially as some first names are made up by parents.
Names have a highly skewed (often said to be a power law) distribution: the first few baby names make up an enormous fraction of all names, hence all the Ethan/Lucas/Mason baby boys in 2016. (One would think that parents would go out of their way to avoid toopopular names, but apparently not.)
Since there are only “10,000 things under heaven”, one might think that the top 10000 personal names would give a good guess. At what n can we expect a collision?
findN < function(ids, targetP=0.5, startingN=1, probabilities=rep(1/ids, ids)) {
n < startingN
collisionProbability < 0
while (collisionProbability < targetP) {
collisionProbability < simulate(n, ids, probabilities)
n < n+1
}
return(n) }
findN(10000)
# [1] 118
simulate(118, 10000)
# [1] 0.5031
We could also use an approximation such as the square approximation: : sqrt(2 * 10000 * 0.5) ~> 100
Or the similar upper bound: ceiling(sqrt(2*10000*log(2))) ~> 118
.
So the collision point is smaller than Dunbar’s number.
But all of these are themselves upper bounds because the case in which birthdays/names are uniformly distributed is the worst case. If there is any difference in the probabilities, a collision will happen much earlier. This makes sense since if 1 birthday happens with, say, P=0.99, then it’s almost impossible to go more than 3 or 4 birthdays without a collision. Likewise, if one birthday has P=0.50, and so on down to P=$:
sapply(1:23, function(n){ simulate(n, 365, probabilities=c(0.99, rep(0.01/364, 364)))})
# [1] 0.0000 0.9789 0.9995 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000
# 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000
sapply(1:23, function(n){ simulate(n, 365, probabilities=c(0.5, rep(0.5/364, 364)))})
# [1] 0.0000 0.2531 0.5031 0.6915 0.8182 0.8896 0.9402 0.9666 0.9808 0.9914 0.9951 0.9973 0.9988
# 0.9993 0.9991 0.9999 1.0000 1.0000 0.9999 1.0000 1.0000 1.0000 1.0000
How skewed are real names? Given Names Frequency Project provides “Popular Given Names US, 18011999” (19901999, 909288 names) based on Social Security data. After deleting the first 4 lines of s1990m.txt
, it can be loaded into R and the fractions used as probabilities to find the 50% collision point for US names:
names < read.csv("s1990m.txt", header=FALSE)
summary(names)
# V1 V2
# Aaron : 1 Min. : 55.0000
# Abdiel : 1 1st Qu.: 86.0000
# Abdullah: 1 Median : 183.0000
# Abel : 1 Mean : 914.1923
# Abraham : 1 3rd Qu.: 535.5000
# Adam : 1 Max. :24435.0000
# (Other) :852
sum(names$V2)
# [1] 784377
## "Scott" as fraction of all names:
2279 / 784377
# [1] 0.0029054906
## presumably male names:
2279 / (784377*0.5)
# [1] 0.005810981199
simulate(118, nrow(names), probabilities=names$V2/sum(names$V2))
# [1] 1
findN(nrow(names), probabilities=names$V2/sum(names$V2))
# [1] 15
So a more realistic analysis suggests n=15 is where unique first names will probably break down.
This only covers the 853 most common personal names, and the more names, the higher the n has to be to trigger a collision (making 15 something of a lower upper bound); to estimate 10000, we need to fit a distribution to extrapolate below that. The log normal distribution fits reasonably well and is easy to work with:
library(fitdistrplus)
fitdist(names$V2, "lnorm")
# Fitting of the distribution ' lnorm ' by maximum likelihood
# Parameters:
# estimate Std. Error
# meanlog 5.550448321 0.04640182299
# sdlog 1.359185357 0.03281096378
simulateLN < replicate(100, {
names < rlnorm(10000, meanlog=5.550448321, sdlog=1.359185357)
hit < findN(length(names), startingN=46, probabilities=names/sum(names))
return(hit)
})
median(simulateLN)
# [1] 51
Since first names will cluster by age group, location, profession, and whatnot, arguably even 51 is a bit of an upper bound.
Finally, one might ask the probability of a group with a great Scott, or to put it another way, the probability of it unfortunately getting away scotfree.
This is easy to answer; the probability of having 1 or more Scotts in a group is the probability of everyone having a name other than Scott. We saw that the probability of being named Scott was P=0.0029054906 in the name dataset. So the probability of one person not being named Scott is . So the probability of n people all being named notScott is 0.997^{n}. The crossover point is ~239.
So an American social group cannot exceed n=51 before first names begin to break down, and it is all Scott’s fault at n=239.
Detecting fake (human) Markov chain bots
Some popular Twitter and Tumblr accounts use Markov chains trained on a corpus of writing such as Markov James Mitchens or two unrelated corpuses to create amusing mashups: programming documentation and H.P. Lovecraft’s horror/SF fiction or the King James Bible or the works of Karl Marx, Kim Kardashian and Kierkegaard, or Silicon Valley recruiting emails and Erowid drug use reports. The humor comes from the fact that the Markov chains have no understanding and are merely programs producing gibberish that occasionally present striking juxtapositions or insights. Much of their appeal derives in large part from the fact that while humans curate them, humans don’t write them. They depend on this authenticity to be striking.
Of course, there’s always the temptation to edit them or write them wholesale, perhaps because the Markov chains aren’t cooperating in producing any comedy gold to tweet that day, which deceives the reader. This poses an inverse Turing test: how would you detect a fake Markov chain account, that is, one where a human is pretending to be a computer and writing some of the text?
Markov chains are trained on a specific corpus and are a probabilistic generative model which encode the probability that a word X follows another word Y for all the words in that corpus (and similarly if they are operating on letters or on ngrams); there is no state or memory or ‘look back’ or ability to model recursion. To generate text, one simply picks a random word Y, looks up the probabilities of all the words A…Z from Y, and picks a word at random weighted by those probabilities; then repeat indefinitely. Conversely, one could also use it to calculate the likelihood of a given text by multiplying the probability of each word in the text conditional on the previous one.
One difficulty is the potential for doubleuse of data: the first pass through a Markov chain account is already applying to the data a highly flexible Bayesian neural network with billions of parameters (one’s brain). If one spots an ‘anomalous’ dataset and subsequent analysis confirms it, what does this mean? I am reminded of one past incident: someone had lost a great deal of money on a Bitcoin gambling website, and suspected the site had defrauded him. But he had contacted me only because he had had unusual losses. What does an analysis mean? Imagine that the top 1% of losers get angry and start looking into whether they were cheated; they go to a statistician who duly computes that based on the number of games played, there is a p=0.01 that they would lose as much or more as they did… If one had all the gambling records, one could look at the overall patterns and see if there are more losers than there should be given the rules of the game and a supposedly fair random number generator, but what does one do with 1 selfselected player? The data generation process is certainly neither random nor ‘ignorable’ nor modelable without dubious assumptions.
A few possible attacks come to mind:
 observation of malformed syntax or lack of longrange dependencies
 vocabulary or output outside an independently trained Markov chain’s domain
 unusually low likelihood for an independently trained Markov chain to generate known samples
 unusually low likelihood for an independently trained Markov chain to generate known samples compared to newly generated samples filtered at a 1in100s quality level
 unusually high quality of known samples compared to newly generated samples from independently trained Markov chain filtered at a 1in100s quality level, tested nonparametrically or parametrically as a mixture model
Markov chains produce realisticlooking output and are efficient to create & run, but, compared to RNNs, they notoriously model recursive syntax poorly, such as nested parentheses (since they have no way of remembering whether a parenthetical comment had been started), and cannot extrapolate—for example, a wordlevel Markov chain can’t create new words, and would require ngrams to have available fragments of words which could be recombined. The memoryless nature of Markov chains also means that, lacking any memory which could model the ‘longrange correlations’ found in natural English text like systematic use of particular names/topics/vocabulary, larger samples quickly veer offtopic and become gibberish and lack any coherency possibly even inside a single sentence. (RNNs also have this problem, but somewhat less.)
With the limits of a Markov chain in mind, it would be easy to detect faked Markov chain output with large samples: it is just difficult for a human to deliberately generate long text which is as nonsensical and syntactically invalid as a Markov chain creates, for the same reason an unpracticed human is a remarkably bad random number generator. However, for this same reason the selected Markov samples tend to be very short, usually no more than a sentence. It might be possible to measure this on the samples as a whole and observe higher entropy or memorylessness (eg by measuring compression performance or efficiency of a Markov chain in modeling the samples), but I would guess that usually the samples are not long enough or large enough for this to have reasonable statistical power as a test. This eliminates the easiest test.
Since the corpus is known in many of these cases, we can assume access to a Markov chain model which is similar (if not identical) to the one which supposedly wrote all the tweets. This gives us several possibilities.
We could exploit the lack of creativity of Markov chains and look for anything in the tweets which is not present in the original corpus. For example, if a word like “cromulent” appears neither in the Puppet documentation nor (having been coined in 1996, 59 years after he died) in H.P. Lovecraft’s fiction, then it would have a probability of 0 of being generated by any Puppet/Lovecraft Markov chain (as no word will have any transition probability to it). Such extracorporal vocabulary immediately proves human authorship.
Continuing this same logic, we could take the corpus, train our own Markov chain (which will at least be similar), and use it to calculate the likelihood of all the tweets. A humanwritten tweet may be possible for the Markov chain to have written, but it will be far more unlikely than most of the ones the Markov chain actually wrote & were selected. So we would see that most of the tweets have reasonable log likelihoods, but that our suspicious ones will be far more extreme. (If the Markov chains are wordlevel, this test subsumes the impossibleword test: any tweet with a word not in the corpus, and hence not represented in the Markov chain, will have a meaningless likelihood.)
This likelihood test might not help if they are all equally extreme, in which case one could use our Markov chain in another manner, as a generative model, to try to estimate the likelihood of getting as great a tweet. For this, one samples several thousand samples from our Markov chain, and screens them for good ones. This creates an empirical distribution of the likelihoods of good tweets conditional on the null hypothesis of a Markov chain author; in this case, the null hypothesis is known to be true by construction. Then to test, one compares the knownMarkovchain tweets with the likelihoods of the suspect tweets (perhaps with a permutation test). They should be similar.
Alternately, if one doesn’t want to use likelihoods as a measure of improbability, one could instead use some human measure of funniness like having rating the originals and the samples on a scale 15, and comparing them. The original poster is probably not screening more than a few hundred generated tweets for each selected tweet, so given a similar level of stringency, one’s generated tweets should be equally good; if the originals turn out to be extremely better than yours, to a level where you would have to screen thousands of random samples, that is highly suspicious and suggests the originals were ‘too good to be true’.
With ratings or likelihoods, one could try to assume a decreasing distribution like an exponential: most samples will be incoherent and totally unfunny, many will be slightly funny, a few will be funny, and a very few will be very funny. The ratings on samples generated from our Markov chain will probably follow a smooth distribution. However, if a human is authoring some in an attempt to spice things up, they will be above the average of the Markov chain (otherwise why bother with cheating?), and if there is a substantial number of them, this will create an anomaly in the ratings of the originals—a ‘bump’ indicating that the tweets are coming from two different populations. In this case, it can be modeled as a mixture model with either k=1 or k=2, and the pvalue or Bayesian posterior probability calculated for 1 vs 2.
Optimal Existential Risk Reduction Investment
An existential risk is any risk which destroys or permanently cripples human civilization, such as an asteroid strike or pandemic. Since humanity might otherwise continue for millions of years, creating untold trillions of humans and colonizing the galaxy, human extinction represents the loss of literally astronomical amounts of utility. The loss is greater than any disaster up to extinction levels, as humanity can always recover from lesser disasters; but there is no recovery from a total destruction. Thus, the expected value of even a slight reduction in an exotic risk ought to itself be astronomical, or at least extremely large; under plausible values for wellcharacterized xrisks like asteroid strikes or nuclear war or pandemic, preventing them may be the charitable spending with the highest expected value and they should be receiving all charitable expenditures.
This strikes people as odd and dangerous reasoning. Is it really true that we should be spending almost unlimited amounts of money on these things and not otherwise extremely compelling charities like distributing malaria nets in Africa to save millions of lives or vaccine distribution or funding research into ending aging? And if we should, how do we choose what fraction to spend on global warming rather than artificial intelligence? What if someone discovers an entirely new xrisk not previously considered, like nearby supernovas or vacuum collapses or nanotechnology ‘grey goo’?
Thinking historically, it’s clear in retrospect that someone concerned about xrisk would be better off not going after the terminal goal of xrisk reduction but instead spending their money on instrumental goals such as science/technology or economic growth.
Imagine someone in England in 1500 who reasons the same way about xrisk: humanity might be destroyed, so preventing that is the most important task possible. He then spends the rest of his life researching the Devil and the Apocalypse. Such research is, unfortunately, of no value whatsoever unless it produces arguments for atheism demonstrating that that entire line of enquiry is useless and should not be pursued further. But as the Industrial and Scientific Revolutions were just beginning, with exponential increases in global wealth and science and technology and population, ultimately leading to vaccine technology, rockets and space programs, and enough wealth to fund all manner of investments in xrisk reduction, he could instead had made a perhaps small but real contribution by contributing to economic growth by work & investment or making scientific discoveries.
For example, Isaac Newton’s discoveries in astronomy and the laws of motion helped inaugurate threads of work that led directly to space satellites which can watch for asteroids with Earthcrossing orbits. Isaac Newton himself was concerned with xrisk, as he feared that the Great Comet of 1680 would, centuries hence, plunge into the Sun and cause expansion destroying the Earth and humanity. What could Newton have done to directly reduce this xrisk at the time? Absolutely nothing. There were no feasible countermeasures nor any foreseeable technologies which could forestall a comet or protect humanity from the Sun engulfing the Earth; there was not and still is not a mine or bomb shelter deep enough for that. What he could have done is close to what he did do: make fundamental advances in science which posterity could build on and one day be rich and wise enough to do something about the xrisk. As it happens, Newton was not quite right about the Great Comet (comets are not a meaningful fraction of the Sun’s mass) but there was a similar xrisk he was unaware of: giant asteroid impacts. And the solutions to a giant comet—observe all comets carefully to project their future orbits, destroy it, redirect its orbit, evacuate human colonists to safety to unaffected planets (Newton suggested the satellites of the gas giants)—are much the same as for a giant asteroid impact, and all benefit from economic growth & greater science/technology (someone has to pay for, and design those satellites and spacecraft, after all).
Economic wealth & science/technology are allpurpose goods: they are useful for compound growth, and can also be spent on xrisk reduction. They are the ultimate instrumental goods. If one is badly ignorant, or poor, or unable to meaningfully reduce an xrisk, one is better off accepting the xrisk and instead spending resources on fixing the former problems. One would prefer to get rid of the xrisk as soon as possible, of course, but given one’s starting position, there may simply be no better strategy and the risk must be accepted.
This raises the question: what is the optimal distribution of resources to economic growth vs xrisk reduction over time which maximizes expected utility?
Intuitively, we might expect something like early on investing nothing at all in xrisk reduction as there’s not much money available to be spent, and money spent now costs a lot of money down the line in lost compound growth; and then as the economy reaches modern levels and the opportunity cost of xrisk becomes dire, money is increasingly diverted to xrisk reduction. One might analogize it to insurance—poor people skimp on insurance because they need the money for other things which hopefully will pay off later like education or starting a business, while rich people want to buy lots of insurance because they already have enough and they fear the risks. If this were an investment question, a good strategy would be something like the Kelly criterion or probability matching strategies like Thompson sampling: even if the expected value of xrisk reduction is higher than other investments, it only pays off very rarely and so receives a very small fraction of one’s investments. However, it’s not clear that the Kelly criterion or Thompson sampling are optimal or even relevant: because while Kelly avoids bankruptcy in the form of gambler’s ruin but does so only by making arbitrarily small bets to avoid going bankrupt & refusing to ever risk one’s entire wealth; with xrisks, the ‘bankruptcy’ (extinction) can’t be avoided so easily, as the risk is there whether you like it or not, and one cannot turn it to 0. (This comes up often in discussion of why the Kelly criterion is relevant to decisionmaking under risk; see also Peters 2011 and the niche area of “evolutionary finance” like Evstigneev et al 2008/Lensberg & SchenkHoppé 2006 which draws connections between the Kelly criterion, probability matching, longterm survival & evolutionary fitness.) In economics, similar questions are often dealt with in terms of the lifecycle hypothesis in which economic agents strive to maximize their utility over a career/lifetime while avoiding inefficient intertemporal allocation of wealth (as Mark Twain put it, “when in youth a dollar would bring a hundred pleasures, you can’t have it. When you are old, you get it & there is nothing worth buying with it then. It’s an epitome of life. The first half of it consists of the capacity to enjoy without the chance; the last half consists of the chance without the capacity.”); in the lifecycle, one tries to build wealth as quickly as possible while young, even going into debt for investments like a college education, then begins saving up, consuming some, until retirement, at which point one consumes it all until one dies. But as far as I’ve seen any results, lifecycle models tend to not include any mechanism for spending in order to reduce mortality/aging and accept the risk of death as a given.
We could create a simple Markov decision process model. An agent (humanity), each time period (year), has a certain amount of wealth and an xrisk probability P. In this period, it can choose to allocate that wealth between economic growth, in which case it receives that investment plus a return, and it can buy a permanent percentage reduction in the xrisk for a fixed sum. For the reward, the xrisk is binary sampled with probability P; if the sample is true, then the reward is 0 and the decision process terminates, else the reward is the wealth and the process continues. Let’s imagine that this process can run up to 10,000 time periods, with a starting wealth of $248 billion (Angus Deaton’s estimate of PPP world GDP in 1500 https://en.wikipedia.org/wiki/List_of_regions_by_past_GDP_%28PPP%29 ), the economic growth rate is 2% (the longrun real growth rate of the global economy), the existential risk probability is 0.1% per year (arbitrarily chosen), and one can buy a reduction of 1% for a billion dollars. (We’ll work in trillions units to help numeric stability.) What strategy maximizes the cumulative rewards? A few simple ones come to mind:
 the agent could simply ignore the xrisk and reinvests all wealth, which to a first approximation, is the strategy which has been followed throughout human history and is primarily followed now (lumping together NASA’s Spaceguard program, biowarfare and pandemic research, AI risk research etc probably doesn’t come to more than $12b a year in 2016). This maximizes economic growth rate but may backfire as the xrisk never gets reduced.
 the agent could spend the full gain in its wealth from economic growth (2%) on xrisk reduction. The wealth doesn’t grow and the returns from xrisk reduction do diminish, but the xrisk is at least reduced greatly over time.
 the agent could implement a sort of probability matching: it spends on xrisk reduction a fraction of its wealth equal to the current P. This reduces how much is spent on extremely small xrisk reductions, but it might be suboptimal because it’ll pay the largest fraction of its economy in the first time period, then secondlargest in the second time period and so on, losing out on the potential compounding.
 a more complicated hybrid strategy might work: it maximizes wealth like #1 for the first n time periods (eg n=516), and then it switches to #2 for the remaining time period
 like #4, but switching from #1 to #3 for the remaining time periods.
constantInvestmentAgent < function (t, w, xrp) { return(c(w, 0)) }
constantReductionAgent < function (t, w, xrp) { drawdown < 0.9803921573; return(c(drawdown*w, (1drawdown)*w)) }
probabilityMatchAgent < function (t, w, xrp) { return(c(w*(1xrp), w*xrp)) }
investThenReduceAgent < function (t, w, xrp, n=516) { if (t<n) { return(constantInvestmentAgent(t, w, xrp)) } else { return(constantReductionAgent(t, w, xrp)) } }
investThenMatchAgent < function (t, w, xrp, n=516) { if (t<n) { return(constantInvestmentAgent(t, w, xrp)) } else { return(probabilityMatchAgent(t, w, xrp)) } }
simulateWorld < function(agent, t=10000) {
initialW < 0.248
initialP < 0.001
df < data.frame(T=0, Wealth=initialW, XriskP=initialP)
for (i in 1:t) {
last < tail(df, n=1)
xrisk < rbinom(1,1, p=last$XriskP)
if (xrisk) { break; } else {
choices < agent(last$T, last$Wealth, last$XriskP)
newXriskP < last$XriskP * (1  0.01)^(choices[2] / 0.001)
newWealth < choices[1] * 1.02
df < rbind(df, data.frame(T=i, Wealth=newWealth, XriskP=newXriskP))
}
}
df$Reward < cumsum(df$Wealth)
return(df)
}
library(parallel); library(plyr)
simulateWorlds < function(agent, iters=1000) {
mean(ldply(mclapply(1:iters, function(i) { tail(simulateWorld(agent), n=1)$Reward }))$V1) }
simulateWorlds(constantReductionAgent)
# [1] 2423.308636
simulateWorlds(investThenReduceAgent)
# [1] 10127204.73
simulateWorlds(constantInvestmentAgent)
# [1] 1.154991741e+76
simulateWorlds(investThenMatchAgent)
# [1] 7.53514145e+86
## Optimize the switch point:
which.max(sapply(seq(1, 10000, by=100), function(N) { simulateWorlds(function(t,w,xrp) { investThenMatchAgent(t, w, xrp, n=N) }, iters=100)}))
# [1] 3
simulateWorlds(function(t,w,xrp) { investThenMatchAgent(t, w, xrp, n=300) })
# [1] 9.331170221e+86
simulateWorlds(probabilityMatchAgent)
# [1] 1.006834082e+87
So of our 5 strategies, the constant reduction agent performs the worst (probably because with economic growth choked off, it can only buy small xrisk reductions), followed by the investthenreduce agent; then the ‘get rich before you get old’ constant investment agent manages to often attain very high growth rates when it’s lucky enough that xrisks don’t strike early on; but far better than any of them, by orders of magnitude, are the partial and full probability matching agents. The partial probability matching agent turns out to have a suboptimal switch point t=516, and a more careful search of switch points finds that t~=300 is the best switch point and it exceeds the pure probability matcher which matches from the start.
What’s going on there? I suspect it’s something similar to the difference in multiarmed bandit problems between the asymptotically optimal solution and the optimal solution for a fixed horizon found using dynamic programming: in the former scenario, there’s an indefinite amount of time to do any exploration or investment in information, but in the latter, there’s only a finite time left and exploration/growth must be done up front and then the optimal decision increasingly shifts to exploitation rather than growth.
Why does probability matching in general work so well? It may simply be because it’s the only baseline strategy which adjusts its xrisk investment over time.
This doesn’t demonstrate that probability matching is optimal, just that it beats the other baseline strategies. Other strategies could be used to decrease xrisk investment over time—instead of being proportional to xrisk P, it could shrink linearly over time, or by square root, or logarithmically, or…
What reinforcement learning techniques might we use to solve this?
This problem represents a large Markov Decision Process with 1 discrete state variable (time, t=010000), 2 continuous state variables (wealth, and risk probability), and 1 continuous action (fraction of growth to allocate to the economy vs existential risk reduction). The continuous action can be discretized into 11 actions without probably losing anything (allocate 100%/90%..10%/0%), but the 2 state variables can’t be discretized easily because they can span many orders of magnitude.
dynamic programming a decision tree with backwards induction: optimal, but requires discrete actions and state variables, and even if discretized, 10000 time steps would be infeasibly large.
standard tabular learning: Qlearning, SARSA, temporal differences: requires discrete actions and state variables
 Deep QNetworks: requires discrete actions, but not state variables
MDP solvers: value iteration etc: optimal, but requires discrete actions and state variables
hybrid MDP solvers: optimal, and can handle a limited amount of continuous state variables (but not continuous actions), which would work here; but high quality software implementations are rarely available.
One such hybrid MDP solver is
hmpd
, which solves problems specified in the PDDL Lisplike DSL (judging from the examples, a version with probabilistic effects, so PPDDL 1.0?). After trying to write down a PPDDL model corresponding to this scenario, it seems that PPDDL is unable to represent probabilities or rewards which change with time and so cannot represent the increase in wealth or decrease in xrisk probability.policy gradients: can handle continuous state variables & actions but are highly complex and unstable; high quality software implementations are unavailable
Of the possible options, a DQN agent seems like the best choice: a small neural network should be able to handle the problem and DQN only requires the actions to be discretized. reinforce.js
provides a DQN implementation in JS which I’ve used before, so I start there by rewriting the problem in JS
var script = document.createElement("script");
script.src = "https://www.gwern.net/docs/rl/armstrongcontrolproblem/20160202karpathyrl.js";
document.body.appendChild(script);
// environment: t, w, xrp
function simulate(environment, w_weight, xrp_weight) {
var xrisk = Math.random() < environment.xrp
if (xrisk) {
return {reward: 100, alive: false, t: environment.t, w: environment.w, xrp: environment.xrp};
} else {
return {reward: Math.log(environment.w), alive: true, t: environment.t+1,
w: environment.w*w_weight*1.02, xrp: environment.xrp * (Math.pow((1  0.01), (xrp_weight / 0.001))) }
}
}
var defaultState = {t: 0, w: 0.248, xrp: 0.01}
// simulate(defaultState, 0.99, 0.01)
// simulate(defaultState, 0.99, 0.01)
var env = {};
env.getNumStates = function() { return 3; }; // there are only 3 state variables: t/w/xrp
env.getMaxNumActions = function() { return 11; }; // we'll specify 10 possible allocations: 1/0, 0.998/0.002 .. 0.98/0.02
var spec = {
num_hidden_units: 200,
experience_add_every: 20,
learning_steps_per_iteration: 1,
experience_size: 1000000,
alpha: 0.01,
epsilon: 1.0,
gamma: 0.99 // minimal discounting
};
var agent = new RL.DQNAgent(env, spec);
var total_reward = 0;
state = defaultState;
spec.epsilon = 1.0; // reset epsilon if we've been running the loop multiple times
for(var i=0; i < 10000*3000; i++) {
var action = agent.act(state)
state = simulate(state, 1(action/500), 0+(action/500) );
agent.learn(state.reward);
total_reward = total_reward + state.reward;
if (Number.isInteger(Math.log(i) / Math.log(10)) ) { spec.epsilon = spec.epsilon / 1.5; } // decrease exploration
if (!state.alive  state.t >= 10000) { // if killed by xrisk or horizon reached
console.log(state.t, state.w, state.xrp, total_reward);
total_reward = 0;
state = defaultState;
}
}
//exercise the trained agent to see how it thinks
total_reward=0
state=defaultState;
spec.epsilon = 0;
for (var t=0; t < 10000; t++) {
action = agent.act(state)
state = simulate(state, 1(action/500), 0+(action/500) );
total_reward = total_reward + state.reward
console.log(action, state, total_reward);
}
After a day of training, the DQN agent had learned to get up to 5e41, which was disappointingly inferior to the constant investment & probability matching agents (1e87). The NN looks big enough for this problem and the experience replay buffer was more than adequate; NNs in RL are known to have issues with the reward, though, and typically ‘clamp’ the reward to a narrow range, so I suspected that rewards going up to 5e41 (interpreting wealth on each turn as the reward) might be playing havoc with convergence, and switched the reward to log wealth instead. This did not make a noticeable difference overnight (aside from the DQN agent now achieving 9.5e41). I wondered if the risk was too rare for easy learning and 100 neurons was not enough to approximate the curve over time, so I fixed a bug I noticed where the simulation did not terminate at t=10000, doubled led the neuron count, increased the initial xrisk to 1%, and began a fresh run. After 1 day, it reached 9.4e41 total reward (unlogged).
Cumulative log score for DQN after tweaks and ~2h of training: regularly reaches ~470k when it doesn’t die immediately (which happens ~1/20 of the time). In comparison, probabilitymatching agent averages a cumulative log score of 866k. After 2 days of training, the DQN had improved only slightly; the onpolicy strategy appears mostly random aside from having driven the xrisk probability down to what appears to be the smallest float JS supports, so it still had not learned a meaningful compromise between xrisk reduction and investment.
TODO: revisit with MCTS at some point?
Model Criticism via Machine Learning
In “Deep learning, model checking, AI, the nohomunculus principle, and the unitary nature of consciousness”, Andrew Gelman writes
Here’s how we put it on the very first page of our book:
The process of Bayesian data analysis can be idealized by dividing it into the following three steps:
 Setting up a full probability model  a joint probability distribution for all observable and unobservable quantities in a problem. The model should be consistent with knowledge about the underlying scientific problem and the data collection process.
 Conditioning on observed data: calculating and interpreting the appropriate posterior distribution—the conditional probability distribution of the unobserved quantities of ultimate interest, given the observed data.
 Evaluating the fit of the model and the implications of the resulting posterior distribution: how well does the model fit the data, are the substantive conclusions reasonable, and how sensitive are the results to the modeling assumptions in step 1? In response, one can alter or expand the model and repeat the three steps.
How does this fit in with goals of performing statistical analysis using artificial intelligence?
3. The third step—identifying model misfit and, in response, figuring out how to improve the model—seems like the toughest part to automate. We often learn of model problems through openended exploratory data analysis, where we look at data to find unexpected patterns and compare inferences to our vast stores of statistical experience and subjectmatter knowledge. Indeed, one of my main pieces of advice to statisticians is to integrate that knowledge into statistical analysis, both in the form of formal prior distributions and in a willingness to carefully interrogate the implications of fitted models.
One way of looking at step #3 is to treat the human statistician as another model: specifically, he is a large neural network with trillions of parameters, who has been trained to look for anomalies & model misspecification, and to fix them when he finds them, retraining the model, until he can no longer easily distinguish the original data from the model’s predictions or samples. As he is such a large model with the ability to represent and infer a large class of nonlinearities, he can usually easily spot flaws where the current model’s distribution differs from the true distribution.
This bears a considerable resemblance to the increasing popularity of “generative adversarial networks” (GANs): using pairs of neural networks, one of which tries to generate realistic data, and a second which tries to classify or discriminate between real and realistic data. As the second learns ways in which the current realistic data is unrealistic, the first gets feedback on what it’s doing wrong and fixes it. So the loop is very similar, but fully automated. (A third set of approaches this resembles is actorcritic reinforcement learning algorithms.)
If we consider the kinds of models which are being critiqued, and what is critiquing, this gives us 4 possible combinations:
simple  complex  

simple  model fit indexes+linear model  statistician+linear model 
complex  model fit indexes+ML  ML+ML (eg GANs) 
Simple/simple is useful for cases like linear regression where classic methods like examining residuals or R^2s or Cook indexes can often flag problems with the model.
Simple/complex is also useful, as the human statistician can spot additional problems.
Complex/simple is probably useless, as the NNs may easily have severe problems but will have fit any simple linear structure and fool regular diagnostics.
Complex/complex can be very useful in machine learning, but in different ways from a good simple model.
Fast, simple, general—a good statistical method lets you choose one; a great method lets you choose two. (Consider linear models, decision trees, NNs, MCMC, ABC, discrete Bayesian networks, and exponential family vs nonparametric methods as examples of the tradeoffs here.)
So is quadrant 2 fully populated by human statisticians? We wouldn’t necessarily want to use GANs for everything we use statisticians for now, because neural networks can be too powerful and what we want from our models is often some sort of clear answer like “does X predict Y?” and simplicity. But we could replace the statistician with some other powerful critic from machine learning—like a NN, SVM, random forest, or other ensemble. So instead of having two NNs fighting each other as in a GAN, we simply have one specified model, and a NN which tries to find flaws in it, which can then be reported to the user. The loop then becomes: write down and fit a model to the real data; generative posterior predictive samples from the distribution; train a small NN on real data vs predictive data; the classification performance measures the plausibility of the predictive samples (perhaps something like a KL divergence), giving a measure of the model quality, and flags data points which are particularly easily distinguished as real; the human statistician now knows exactly which data points are not captured by the model and can modify the model; repeat until the NN’s performance declines to chance.
Let’s try an example. We’ll set up a simple linear model regression Y ~ A + B + C
with a few problems in it:
 the trend is not linear but slightly quadratic
 the outcome variable is also rightcensored at a certain point
 and finally, the measured covariates have been rounded
set.seed(20161123)
n < 10000
ceiling < 1
a < rnorm(n)
b < rnorm(n)
c < rnorm(n)
y < 0 + 0.5*a + 0.5*b + 0.5*c^2 + rnorm(n)
y_censored < ifelse(y>=3, 3, y)
df < data.frame(Y=y_censored, A=round(a, digits=1), B=round(b, digits=1), C=round(c, digits=1))
l < lm(Y ~ A + B + C, data=df)
summary(l)
plot(l)
plot(df$Y, predict(l, df))
l2 < lm(Y ~ A + B + I(C^2), data=df)
summary(l2)
plot(df$Y, predict(l2, df))
The censoring shows up immediately on the diagnostics as an excess of actual points at 3, but the quadraticity is subtler, and I’m not sure I can see the rounding at all.
library(randomForest)
## First, random forest performance under the null hypothesis
modelNull < data.frame(Y=c(df$Y, df$Y), Real=c(rep(1, n), rep(0, n)), A=c(df$A, df$A), B=c(df$B, df$B), C=c(df$C, df$C))
r_n < randomForest(as.ordered(Real) ~ Y + A + B + C, modelNull); r_n
# Type of random forest: classification
# Number of trees: 500
# No. of variables tried at each split: 2
#
# OOB estimate of error rate: 100%
# Confusion matrix:
# 0 1 class.error
# 0 0 10000 1
# 1 10000 0 1
modelPredictions < data.frame(Y=c(df$Y, predict(l, df)), Real=c(rep(1, n), rep(0, n)), A=c(df$A, df$A), B=c(df$B, df$B), C=c(df$C, df$C))
r < randomForest(as.ordered(Real) ~ Y + A + B + C, modelPredictions); r
# Type of random forest: classification
# Number of trees: 500
# No. of variables tried at each split: 2
#
# OOB estimate of error rate: 6.59%
# Confusion matrix:
# 0 1 class.error
# 0 9883 117 0.0117
# 1 1200 8800 0.1200
## many of the LM predictions are identical, but the RF is not simply memorizing them as we can jitter predictions and still get the same classification performance:
modelPredictions$Y2 < jitter(modelPredictions$Y)
randomForest(as.ordered(Real) ~ Y2 + A + B + C, modelPredictions)
#... Type of random forest: classification
# Number of trees: 500
# No. of variables tried at each split: 2
#
# OOB estimate of error rate: 6.57%
# Confusion matrix:
# 0 1 class.error
# 0 9887 113 0.0113
# 1 1200 8800 0.1200
Note we need to be careful about collecting the posterior predictive samples: if we collect 10000 posterior samples for each of the 10000 datapoints, we’ll store 10000^{2} numbers which may cause problems. 1 should be enough.
library(runjags)
model < 'model {
for (i in 1:n) {
mean[i] < mu + betaA*A[i] + betaB*B[i] + betaC*C[i]
Y[i] ~ dnorm(mean[i], tau)
}
sd ~ dgamma(0.01, 0.01)
tau < 1/sqrt(sd)
mu ~ dnorm(0, 100)
betaA ~ dnorm(0, 100)
betaB ~ dnorm(0, 100)
betaC ~ dnorm(0, 100)
}'
model < run.jags(model, data = with(df, list(Y=c(Y, rep(NA, nrow(df))), A=c(A,A), B=c(B,B), C=c(C,C), n=2*nrow(df))), inits=list(mu=0.45, sd=0.94, betaA=0.47, betaB=0.46, betaC=0), monitor=c("Y"), n.chains = 1, sample=1)
posterior_predictive < tail(n=10000, model$mcmc[[1]][1,])
plot(df$Y, posterior_predictive)
modelPredictions_r < data.frame(Y=c(df$Y, posterior_predictive), Real=c(rep(1, n), rep(0, n)), A=c(df$A, df$A), B=c(df$B, df$B), C=c(df$C, df$C))
r < randomForest(as.ordered(Real) ~ Y + A + B + C, modelPredictions_r); r
# OOB estimate of error rate: 49.11%
# Confusion matrix:
# 0 1 class.error
# 0 4953 5047 0.5047
# 1 4776 5224 0.4776
model_rounded < 'model {
for (i in 1:n) {
roundA[i] ~ dround(A[i], 3)
roundB[i] ~ dround(B[i], 3)
roundC[i] ~ dround(C[i], 3)
mean[i] < mu + betaA*roundA[i] + betaB*roundB[i] + betaC*roundC[i]
Y[i] ~ dnorm(mean[i], tau)
}
sd ~ dgamma(0.01, 0.01)
tau < 1/sqrt(sd)
mu ~ dnorm(0, 100)
betaA ~ dnorm(0, 100)
betaB ~ dnorm(0, 100)
betaC ~ dnorm(0, 100)
}'
model_r < run.jags(model_rounded, data = with(df, list(Y=c(Y, rep(NA, nrow(df))), A=c(A,A), B=c(B,B), C=c(C,C), n=2*nrow(df))), inits=list(mu=0.45, sd=0.94, betaA=0.47, betaB=0.46, betaC=0), monitor=c("Y"), n.chains = 1, sample=1)
posterior_samples < tail(n=10000, model_r$mcmc[[1]][1,])
posterior_predictive < ifelse(posterior_samples>=3, 3, posterior_samples)
plot(df$Y, posterior_predictive)
modelPredictions_r < data.frame(Y=c(df$Y, posterior_predictive), Real=c(rep(1, n), rep(0, n)), A=c(df$A, df$A), B=c(df$B, df$B), C=c(df$C, df$C))
r_r < randomForest(as.ordered(Real) ~ Y + A + B + C, modelPredictions_r); r_r
# OOB estimate of error rate: 50.48%
# Confusion matrix:
# 0 1 class.error
# 0 4814 5186 0.5186
# 1 4909 5091 0.4909
model_rounded_censor < 'model {
for (i in 1:n) {
roundA[i] ~ dround(A[i], 3)
roundB[i] ~ dround(B[i], 3)
roundC[i] ~ dround(C[i], 3)
mean[i] < mu + betaA*roundA[i] + betaB*roundB[i] + betaC*roundC[i]
Y[i] ~ dnorm(mean[i], tau)
is.censored[i] ~ dinterval(Y[i], c)
}
sd ~ dgamma(0.01, 0.01)
tau < 1/sqrt(sd)
mu ~ dnorm(0, 100)
betaA ~ dnorm(0, 100)
betaB ~ dnorm(0, 100)
betaC ~ dnorm(0, 100)
}'
model_r_c < run.jags(model_rounded_censor, data = with(df, list(Y=c(Y, rep(NA, nrow(df))), A=c(A,A), B=c(B,B), C=c(C,C), n=2*nrow(df), is.censored=c(as.integer(Y==3), as.integer(Y==3)), c=3)), inits=list(mu=0.37, sd=1, betaA=0.42, betaB=0.40, betaC=0), monitor=c("Y"), n.chains = 1, adapt=0, burnin=500, sample=1)
posterior_samples < tail(n=10000, model_r_c$mcmc[[1]][1,])
posterior_predictive < ifelse(posterior_samples>=3, 3, posterior_samples)
modelPredictions_r_c < data.frame(Y=c(df$Y, posterior_predictive), Real=c(rep(1, n), rep(0, n)), A=c(df$A, df$A), B=c(df$B, df$B), C=c(df$C, df$C))
r_r_c < randomForest(as.ordered(Real) ~ Y + A + B + C, modelPredictions_r_c); r_r_c
# OOB estimate of error rate: 53.67%
# Confusion matrix:
# 0 1 class.error
# 0 4490 5510 0.5510
# 1 5224 4776 0.5224
model_rounded_censor_quadratic < 'model {
for (i in 1:n) {
roundA[i] ~ dround(A[i], 3)
roundB[i] ~ dround(B[i], 3)
roundC[i] ~ dround(C[i], 3)
mean[i] < mu + betaA*roundA[i] + betaB*roundB[i] + betaC*roundC[i]^2
Y[i] ~ dnorm(mean[i], tau)
is.censored[i] ~ dinterval(Y[i], c)
}
sd ~ dgamma(0.01, 0.01)
tau < 1/sqrt(sd)
mu ~ dnorm(0, 100)
betaA ~ dnorm(0, 100)
betaB ~ dnorm(0, 100)
betaC ~ dnorm(0, 100)
}'
model_r_c_q < run.jags(model_rounded_censor_quadratic, data = with(df, list(Y=c(Y, rep(NA, nrow(df))), A=c(A,A), B=c(B,B), C=c(C,C), n=2*nrow(df), is.censored=c(as.integer(Y==3), as.integer(Y==3)), c=3)), inits=list(mu=0.37, sd=1, betaA=0.42, betaB=0.40, betaC=0), monitor=c("Y"), n.chains = 1, adapt=0, burnin=500, sample=1)
posterior_samples < tail(n=10000, model_r_c_q$mcmc[[1]][1,])
posterior_predictive < ifelse(posterior_samples>=3, 3, posterior_samples)
modelPredictions_r_c_q < data.frame(Y=c(df$Y, posterior_predictive), Real=c(rep(1, n), rep(0, n)), A=c(df$A, df$A), B=c(df$B, df$B), C=c(df$C, df$C))
r_r_c_q < randomForest(as.ordered(Real) ~ Y + A + B + C, modelPredictions_r_c_q); r_r_c_q
# OOB estimate of error rate: 61.02%
# Confusion matrix:
# 0 1 class.error
# 0 3924 6076 0.6076
# 1 6127 3873 0.6127
trueNegatives < modelPredictions_r_c_q[predict(r_r_c_q) == 0 & modelPredictions_r_c_q$Real == 0,]
Where can we go with this? The ML techniques can be used to rank existing Bayesian models in an effective if unprincipled way. Techniques which quantify uncertainty like Bayesian neural networks could give more effective feedback by highlighting the points the Bayesian NN is most certain are fake, guiding the analyst towards the worstmodeled datapoints and providing hints for improvement. More inspiration could be borrowed from the GAN literature, such as “minibatch discrimination”—as demonstrated above, the random forests only see one data point at a time, but in training GANs, it has proven useful to instead examine multiple datapoints at a time to encourage the generator to learn how to generate a wide variety of datapoints rather than modeling a few datapoints extremely well; a ML model which can predict multiple outputs simultaneously based on multiple inputs would be analogous (that is, instead of X ~ A + B + C
, it would look more like X1 + X2 + X3 ... ~ A1 + B1 + C1 + A2 + B2 + C2 + ...
, with the independent & dependent variables from multiple data points all fed in simultaneously as a single sample) and might be an even more effective model critic.
Proportion of Important Thinkers by Global Region Over Time in Charles Murray’s Human Accomplishment
Human Accomplishment is a 2003 book by Charles Murray reporting a largescale citation analysis of biographical dictionaries & reference books on art/literature/science/mathematics/philosophy/science throughout history, quantifying the relative importance of “significant individuals” such as Isaac Newton or Immanuel Kant or Confucius and the temporal & geographical patterns; in particular, it demonstrates large European contributions throughout history and increasingly dramatically post1400 AD. The dataset has been released.
Emil Kirkegaard created a visualization of of the proportion by rough geographic region (European/Asian/other) in R using ggplot2 and LOESS smoothing. Perhaps the most striking aspect of it is the Dark Ages showing up as a spike in Asian proportion, followed by the Great Divergence.
This visualization has been criticized as Eurocentricallymisleading and driven by artifacts in the analysis/graphing:
LOESS ignores the constraint that proportions must be 01 and naively extrapolates beyond the boundaries, producing negative estimates for some regions/times
no visualization of uncertainty is provided, either in the form of graphing the raw data points by superimposing a scatterplot or by providing standard errors or credible intervals. It is possible that the overall shapes or specific periods are no more than chance scatters in a timeseries based on few datapoints.
 LOESS can provide estimate local standard errors & confidence intervals but they are of questionable meaning in the absence of the underlying counts
alternatively, the distribution of significant figures may not be treated correctly parametrically
proportions may reflect a timeseries with trends and so precision is exaggerated
None of these objections hold any water as the dataset and its embedded differences are sufficiently large that the method of analysis will make little difference; I will demonstrate this below by reanalyzing it to address the quibbles and show that all patterns remain intact or are sharpened. The above criticisms can be addressed by:
 switching from a LOESS plot to splines or local binomial regressions
 plotting the raw proportions grouped by decade or century
 using a nonparametric bootstrap to calculate confidence intervals, a procedure which lends itself to visualization as an animation of plots of all the resamples, giving an intuitive sense of how important sampling error is to the overall pattern of curves and specific parts of history
 alternately, instead of attempting to fit the proportion, one can fit the original count of significant figures in a binomial or lognormal Bayesian timeseries model and sample from the posterior estimates of each region for each decade/century, and calculate posterior proportions, gaining full quantification of uncertainty, incorporation of any autocorrelation, and smoothing; no additional algorithms or theorems are required, demonstrating the elegance of Bayesian approaches
I didn’t realize Kirkegaard’s R code was available so I wound up redoing it myself (and getting the same results):
## export CSV from spreadsheet in https://osf.io/z9cnk/
h < read.csv("HA.csv", header=TRUE)
summary(h)
# Serial Name Fl Birth Death Inventory ScienceField
# Min. : 11.00 Descartes, René : 4 Min. :700.000 Min. :640.000 Min. :559.00 Science :1442 :2560
# 1st Qu.: 6144.50 Hooke, Robert : 4 1st Qu.:1557.250 1st Qu.:1580.000 1st Qu.:1638.00 Lit.West : 835 Tech : 239
# Median :12534.50 Leonardo da Vinci: 4 Median :1804.000 Median :1782.000 Median :1844.00 Music.West: 522 Phys : 218
# Mean :15994.27 Archimedes : 3 Mean :1585.638 Mean :1616.174 Mean :1682.81 Art.West : 479 Chem : 204
# 3rd Qu.:21999.75 Bacon, Francis : 3 3rd Qu.:1900.000 3rd Qu.:1863.000 3rd Qu.:1930.00 Phil.West : 155 Biol : 193
# Max. :43134.00 d'Alembert, Jean : 3 Max. :1949.000 Max. :1910.000 Max. :1997.00 Art.China : 111 Math : 191
# (Other) :3981 NA's :304 NA's :351 (Other) : 458 (Other): 397
# Index Duplicate BirthCountry WorkCountry Ethnicity Woman No..of.Inventories
# Min. : 0.60000 Min. :0.00000000 France : 564 France : 605 Germanic: 592 Min. :0.00000000 Min. :2.000000
# 1st Qu.: 3.54000 1st Qu.:0.00000000 Germany: 556 Britain: 574 French : 565 1st Qu.:0.00000000 1st Qu.:2.000000
# Median : 7.60000 Median :0.00000000 Britain: 554 Germany: 525 English : 441 Median :0.00000000 Median :2.000000
# Mean : 12.95713 Mean :0.06221889 Italy : 400 Italy : 406 Italian : 397 Mean :0.02198901 Mean :2.228916
# 3rd Qu.: 15.89000 3rd Qu.:0.00000000 USA : 306 USA : 375 USA : 276 3rd Qu.:0.00000000 3rd Qu.:2.000000
# Max. :100.00000 Max. :1.00000000 China : 239 China : 239 Chinese : 240 Max. :1.00000000 Max. :4.000000
# NA's :115 (Other):1383 (Other):1278 (Other) :1491 NA's :3753
levels(h$Ethnicity)
# [1] "Ancient Greek" "Ancient Roman" "Arabic" "Australian" "Basque" "Black" "Bulgarian" "Canadian"
# [9] "Chinese" "Croatian" "Czech" "Danish" "Dutch" "English" "Estonian" "Finnish"
# [17] "Flemish" "French" "Germanic" "Greek" "Hungarian" "Icelandic" "Indian" "Irish"
# [25] "Italian" "Japanese" "Jewish" "Latino" "New Zealand" "Norwegian" "Polish" "Portuguese"
# [33] "Romanian" "Scots" "Slavic" "Slovenian" "Spanish" "Swedish" "Swiss" "USA"
european < c("Ancient Greek", "Ancient Roman", "Australian", "Basque", "Bulgarian", "Canadian", "Croatian", "Czech", "Danish",
"Dutch", "English", "Estonian", "Finnish", "Flemish", "French", "Germanic", "Greek", "Hungarian", "Icelandic", "Irish",
"Italian", "Jewish", "New Zealand", "Norwegian", "Polish", "Portuguese", "Romanian", "Scots", "Slavic", "Slovenian",
"Spanish", "Swedish", "Swiss", "USA")
asian < c("Chinese", "Indian", "Japanese")
other < c("Arabic", "Black", "Latino")
groupMembership < function(e) { if (e %in% european) { "European" } else { if (e %in% asian) { "Asian" } else { "Other" } } }
h$Group < as.factor(sapply(h$Ethnicity, groupMembership))
summary(h$Group)
# Asian European Other
# 507 3379 116
## We use 'Fl' (floruit/flourished), when a person is believed to have done their most important work,
## since birth/death is often unavailable.
## group to decades by rounding:
h$Fl.decade < round(h$Fl, digits=1)
hd < subset(select=c(Fl.decade, Group), h)
hdcount < aggregate(cbind(Group) ~ Fl.decade+Group, length, data=hd)
colnames(hdcount)[3] < "Count"
## sort by time:
hdcount < hdcount[order(hdcount$Fl.decade),]
nrow(h); sum(hdcount$Count)
# [1] 4002
# [1] 4002
head(hdcount, n=20)
# Fl.decade Group Count
# 178 700 European 3
# 179 680 European 1
# 180 650 European 1
# 1 600 Asian 2
# 181 600 European 2
# 182 580 European 2
# 183 570 European 2
# 2 550 Asian 1
# 184 550 European 1
# 185 540 European 5
# 3 520 Asian 1
# 186 520 European 3
# 4 510 Asian 1
# 187 510 European 2
# 188 500 European 2
# 189 480 European 6
# 190 460 European 3
# 191 450 European 7
# 5 440 Asian 1
# 192 440 European 11
## One issue with the count data: decades with zero significant figures from a group
## (which happens frequently) get suppressed. Some tools can handle the omission
## automatically but many cannot, so we need to manually insert any missing decades with '0'
decades < seq(700, 1950, by=10)
for (i in 1:length(decades)) {
d < decades[i]
if (nrow(hdcount[hdcount$Fl.decade==d & hdcount$Group=="European",])==0) {
hdcount < rbind(hdcount, data.frame(Fl.decade=d, Group="European", Count=0))}
if (nrow(hdcount[hdcount$Fl.decade==d & hdcount$Group=="Asian",])==0) {
hdcount < rbind(hdcount, data.frame(Fl.decade=d, Group="Asian", Count=0))}
if (nrow(hdcount[hdcount$Fl.decade==d & hdcount$Group=="Other",])==0) {
hdcount < rbind(hdcount, data.frame(Fl.decade=d, Group="Other", Count=0))}
}
hdcount < hdcount[order(hdcount$Fl.decade),]
library(ggplot2); library(gridExtra)
c1 < with(hdcount, qplot(Fl.decade, Count, color=Group) + stat_smooth())
c2 < with(hdcount, qplot(Fl.decade, log1p(Count), color=Group) + stat_smooth())
grid.arrange(c1, c2, ncol=1)
The absolute growth in human population and and hence accomplishment post1400 is so dramatic that it obscures earlier temporal variations:
Logtransformed, we can still see the invertedV shape of European counts, but it’s somewhat subtle because it’s still being squashed by post1400 increases and does leave room for doubt about sampling error. Moving on to reproducing the proportions plot:
## Create proportions by summing per decade, then looping over each group & dividing by total for that decade:
decadeTotals < aggregate(Count ~ Fl.decade, sum, data=hdcount)
for (i in 1:nrow(hdcount)) {
total < decadeTotals[decadeTotals$Fl.decade == hdcount[i,]$Fl.decade,]$Count
p < hdcount[i,]$Count / total
hdcount$Proportion[i] < if(is.nan(p)) { 0 } else { p }
hdcount$Total[i] < total
}
with(hdcount, qplot(Fl.decade, Proportion, color=Group) + stat_smooth() + coord_cartesian(ylim = c(0, 1)))
We successfully reproduce it, modulo the LOESS standard errors (which can be disabled by adding se=FALSE
to stat_smooth()
), including the unwanted nonsensical extrapolations. It is possible with some tricky ggplot2 functionality to add in binomial smoothing (along with some jitter to unbunch the datapoints at the modal 0).
## roughly equivalent to:
# glm(cbind(Count,Total) ~ splines::ns(Fl.decade,3), family="binomial", data=hdcount, subset=Group=="European")
binomial_smooth < function(...) { geom_smooth(se=FALSE, method = "glm", method.args = list(family = "binomial"), ...) }
with(hdcount, qplot(Fl.decade, Proportion, color=Group) +
binomial_smooth(formula = y ~ splines::ns(x, 3)) +
geom_jitter(aes(color=Group), width=0.013,, height=0.013))
This still doesn’t provide any indication of sampling error uncertainty, however. Kirkegaard provides one with CIs derived from bootstrapping, so I will provide something a little different: visualizing the uncertainty dynamically by graphing the smoothed proportions for each resample in an animation of hundreds of bootstrap samples.
So to do this bootstrap, we package up the various transformations from before, so we can samplewithreplacement the original dataset^{9}, transform, and plot repeatedly:
transformAndProportion < function(df) {
df$Fl.decade < round(df$Fl, digits=1)
dfd < subset(select=c(Fl.decade, Group), df)
dfdcount < aggregate(cbind(Group) ~ Fl.decade+Group, length, data=dfd)
colnames(dfdcount)[3] < "Count"
decades < seq(700, 1950, by=10)
for (i in 1:length(decades)) {
d < decades[i]
if (nrow(dfdcount[dfdcount$Fl.decade==d & dfdcount$Group=="European",])==0) {
dfdcount < rbind(dfdcount, data.frame(Fl.decade=d, Group="European", Count=0))}
if (nrow(dfdcount[dfdcount$Fl.decade==d & dfdcount$Group=="Asian",])==0) {
dfdcount < rbind(dfdcount, data.frame(Fl.decade=d, Group="Asian", Count=0))}
if (nrow(dfdcount[dfdcount$Fl.decade==d & dfdcount$Group=="Other",])==0) {
dfdcount < rbind(dfdcount, data.frame(Fl.decade=d, Group="Other", Count=0))}
}
dfdcount < dfdcount[order(dfdcount$Fl.decade),]
decadeTotals < aggregate(Count ~ Fl.decade, sum, data=dfdcount)
for (i in 1:nrow(dfdcount)) {
p < dfdcount[i,]$Count / decadeTotals[decadeTotals$Fl.decade == dfdcount[i,]$Fl.decade,]$Count
dfdcount$Proportion[i] < if(is.nan(p)) { 0 } else { p }
}
return(dfdcount)
}
bootPlot < function(df) {
n < nrow(df)
bootDf < df[sample(1:n, n, replace=TRUE),]
bootDfdcount < transformAndProportion(bootDf)
## WARNING: can't just call qplot due to old 'animation'/ggplot2 bug; have to assign & 'print'
p < with(bootDfdcount, qplot(Fl.decade, Proportion, color=Group) +
binomial_smooth(formula = y ~ splines::ns(x, 3)) +
geom_jitter(aes(color=Group), width=0.013,, height=0.013))
print(p)
}
library(animation)
saveGIF({for (i in 1:200) { bootPlot(h) }}, interval=0.15, ani.width=1300, ani.height=700,
movie.name="2003murrayhumanaccomplishmentregionproportionsbootstrap.gif", clean=FALSE)
The bootstrap animation suggests to me that while the very earliest timeperiods are opaque and the Dark Ages difference between Europe & Asia may be somewhat higher or lower, the overall shape doesn’t change meaningfully.
The timeseries aspect of the data on visual inspection appears to be a simple autoregressive model with a drift upwards, loworder models like ARIMA(1,1,0)
, ARIMA(1,1,2)
, or ARIMA(0,1,2)
; this is probably due to the world population steadily increasing while the per capita rates remain stable.
library(forecast)
efit < auto.arima(subset(hdcount, select=c("Fl.decade", "Count"), Group=="European")$Count)
afit < auto.arima(subset(hdcount, select=c("Fl.decade", "Count"), Group=="Asian")$Count)
ofit < auto.arima(subset(hdcount, select=c("Fl.decade", "Count"), Group=="Other")$Count)
par(mfrow=c(3,1))
plot(forecast(efit), ylim=c(0,200)); axis(side=1, labels=decades, at=seq(1, length(decades)))
plot(forecast(afit), ylim=c(0,200)); axis(side=1, labels=decades, at=seq(1, length(decades)))
plot(forecast(ofit), ylim=c(0,200)); axis(side=1, labels=decades, at=seq(1, length(decades)))
We can combine the sampling error quantification of full Bayesian posteriors, Poisson distribution of counts, and timeseries aspects into a single Bayesian model using brms
as a convenient interface to Stan (rather than writing out the full model by hand), with uninformative priors, and then visualize the posterior distribution of the proportions (which itself is simply a transformation of the posterior):
library(brms)
b < brm(Count ~ (1Group), autocor = cor_bsts(~ Fl.decade  Group), family="zero_inflated_poisson", data = hdcount)
## Rather than use `fitted` to get the 95% CI & compute proportion, it would also be possible to draw samples from
## the posterior for each group/decade, total, calculate pergroup proportion, and then summarize into quantiles; but
## that is much slower and requires more finicky code:
posterior < fitted(b)
hdcount$B.low.prop < posterior[,3] / hdcount$Total
hdcount$B.mean.prop < posterior[,1] / hdcount$Total
hdcount$B.high.prop < pmin(1, posterior[,4] / hdcount$Total)
predframe < subset(hdcount, select=c("B.low.prop", "B.high.prop"))
with(hdcount, ggplot(hdcount, aes(Fl.decade, Proportion, color=Group)) +
geom_point() +
geom_line(data=predframe) +
geom_ribbon(aes(ymin=B.low.prop, ymax=B.high.prop), alpha=0.05, data=predframe))
The smoothed timeseries looks about the same, and the CIs suggest, like the bootstrap, that there is great uncertainty early on when populations are small & surviving figures are rare, but that the dark ages dip looks real and the European increases in proportion since then are also highly probable.
So overall, correcting for the infelicities in Kirkegaard’s original graph makes the graph somewhat cleaner and is helpful in providing quantification of uncertainty, but none of the problems drove the overall appearance of the curve in the slightest bit. If the graph is wrong, the issues will lie in systematic biases in the data itself—not statistical quibbling over sampling error or LOESS curves crossing an axis. (Comparison with graphs drawn from other cliometric datasets such as Wikipedia or particularly Seshat would be informative.)
Program for nonspacedrepetition review of past written materials for serendipity & rediscovery: Archive Revisiter
One reason to take notes/clippings and leave comments in stimulating discussions is to later benefit by having references & citations at hand, and gradually build up an idea from disparate threads and make new connections between them. For this purpose, I make extensive excerpts from web pages & documents I read into my Evernote clippings (functioning as a commonplace book), and I comment constantly on Reddit/LessWrong/HN etc. While expensive in time & effort, I often go back, months or years later, and search for a particular thing and expand & integrate it into another writing or expand it out to an entire essay of its own. (I also value highly not being in the situation where I believe something but I do not know why I believe it other than the conviction “I read it somewhere, once”.)
This sort of personal information management using simple personal information managers like Evernote works well enough when I have a clear memory of what the citation/factoid was, perhaps because it was so memorable, or when the citations or comments are in a nice cluster (perhaps because there was a key phrase in them or I kept going back & expanding a comment), but it loses out on key benefits to this procedure: serendipity and perspective.
As time passes, one may realize the importance of an odd tidbit or have utterly forgotten something or events considerably changed its meaning; in this case, you would benefit from revisiting & rereading that old bit & experiencing an “aha!” moment, but you don’t realize it. So one thing you could do is reread all your old clippings & comments, appraising them for reuse.
But how often? And it’s a pain to do so. And how do you keep track of which you’ve already read? One thing I do for my emails is semiannually I (try to) read through my previous 6 months of email to see what might need to be followed up on^{10} or mined for inclusion in an article. (For example, an ignored request for data, or a discussion of darknet markets with a journalist I could excerpt into one of my DNM articles so I can point future journalists at that instead.) This is already difficult, and it would be even harder to expand. I have read through my LessWrong comment history… once. Years ago. It would be more difficult now. (And it would be impossible to read through my Reddit comments as the interface only goes back ~1000 comments.)
Simply rereading periodically in big blocks may work but is suboptimal: there is no interface easily set up to reread them in small chunks over time, no constraints which avoid far too many reads, nor is there any way to remove individual items which you are certain need never be reviewed again. Reviewing is useful but can be an indefinite timesink. (My sent emails are not too hard to review in 6month chunks, but my IRC logs are bad—7,182,361 words in one channel alone—and my >38k Evernote clippings are worse; any lifestreaming will exacerbate the problem by orders of magnitude.) This is probably one reason that people who keep journals or diaries don’t reread Nor can it be crowdsourced or done by simply ranking comments by public upvotes (in the case of Reddit/LW/HN comments), because the most popular comments are ones you likely remember well & have already used up, and the oddities & serendipities you are hoping for are likely unrecognizable to outsiders.
This suggests some sort of reviewing framework where one systematically reviews old items (sent emails, comments, IRC logs by oneself), putting in a constant amount of time regularly and using some sort of ever expanding interval between rereads as an item becomes exhausted & ever more likely to not be helpful. Similar to the logarithmicallybounded number of backups required for indefinite survival of data (Sandberg & Armstrong 2012), “Deconstructing Deathism—Answering Objections to Immortality”, Mike Perry 2013 (note: this is an entirely different kind of problem than those considered in Freeman Dyson’s immortal intelligences in Infinite in All Directions, which are more fundamental), discusses something like what I have in mind in terms of an immortal agent trying to review its memories & maintain a sense of continuity, pointing out that if time is allocated correctly, it will not consume 100% of the agent’s time but can be set to consume some bounded fraction:
It seems reasonable that past versions of the self would “survive” as we remember the events of times past, that is to say, our episodic memories, and this would have importance in our continuing to persist as what could be considered the “same” albeit also a changing, developing person. But in addition to this mnemonic reinforcement I imagine there would be a more general feeling of being a particular individual, an “ambiance” derived from but not referring to any specific past experiences. Ambiance alone would not be sufficient, I think, to make us who we are; episodic memories would also be necessary, yet it could considerably lessen the need for frequent recall and thus alleviate the problem of dilution.
Another interesting thought is that certain items might consistently be consulted more frequently than others. (Indeed, would this not be expected?) In this way it would actually be possible to bypass the dilution effect and instead allow a fixed fraction of time for perusal of any given item, even as more items were added indefinitely. A simple way of doing this could be first to allow some fixed fraction of the time for daytoday affairs and other nonarchival work (“prime time”), and spend the rest of the time on perusal of personal archives (“archive time”). The exact apportioning of prime versus archive time is not important here, but it will be instructive to consider how the archive time itself might be subdivided. A simple, if overly simplistic, strategy would be to have half this time devoted to the first century’s records, half the remainder to the second century, and so on. (Since there would only be a finite number of centuries, there would be some unused archive time at the end, which could be spent as desired. Note, however, that in the limit of infinite total time covering infinitely many centuries, the usage of archive time would approach but not exceed 100%.) In this way, then, there would be a fixed fraction of archive time, , spent on the _n_th century’s records, regardless of how many centuries beyond the nth were lived or how many records accumulated. True, this way of apportioning time might not be much good beyond a few centuries; only about one trillionth the total time would be spent on the 40th century, for instance, around sec per 100 years. (Possibly a lot could be covered even in this brief interval of about 3 million nanoseconds, however.) But the apportionment scheme could be adjusted.
A more interesting and plausible, if slightly hardertodescribe scheme would be to choose a constant and allow the fraction to the _n_thcentury records. It is easy to show that the time for all centuries will add up to 100% as before, whatever positive value of c we start with. Starting with will get 10% of the total time spent on the first century, with subsequent centuries receiving a diminishing share as before, but the rate of falloff will be much slower, so that the 40th century will still receive 0.4%, or about 5 months per 100 years, that is to say, 240 million nanoseconds per minute. If we suppose that our immortal settles eventually into a routine in which 10% of the time overall is archive time, there would be 24 million nanoseconds available each minute of life for the 40th century’s memories alone, if desired, with many other centuries getting more or less comparable or greater amounts of attention, and none omitted entirely. This, I think, makes at least a plausible case that a reasonable sense of one’s personal identity could be sustained indefinitely.
In the above examples the greatest proportion of archive time falls to the earlier records, which might be fitting since these should be the most important as formative years for the prospective immortal, thus the most important for identity maintenance. (Memory recall would also naturally occur during prime time; the emphasis here could be on recent events, to maintain a balance overall.) In summary, then, we have considered ways that the problem of dilution might be successfully managed. Relatively infrequent perusal of memories might still suffice to maintain the necessary continuity with past versions of the self, or proper scheduling could stabilize the frequency of recall and bypass the dilution effect, or both. We see in any case that the problem is not what it may seem at first sight. We have no guarantee, of course, that it would not get out of bounds, but certainly some grounds for hope.
So you could imagine some sort of software along the lines of spaced repetition systems like Anki/Mnemosyne/Supermemo which you spend, say, 10 minutes a day at, simply rereading a selection of old emails you sent, lines from IRC with n lines of surrounding context, Reddit & LW comments etc; with an appropriate backoff & timecurve, you would reread each item maybe 3 times in your lifetime (eg first after a delay of a month, then a year or two, then decades). Each item could come with a rating function where the user rates it as an important or oddseeming or incomplete item and to be exposed again in a few years, or as totally irrelevant and not to be shown again—as for many bits of idle chitchat, mundane emails, or intemperate comments is not an instant too soon! (More positively, anything already incorporated into an essay or otherwise reused likely doesn’t need to be resurfaced.)
This wouldn’t be the same as a spaced repetition system which is designed to recall an item as many times as necessary, at the brink of forgetting, to ensure you memorize it; in this case, the forgetting curve & memorization are irrelevant and indeed, the priority here is to try to eliminate as many irrelevant or useless items as possible from showing up again so that the review doesn’t waste time.
More specifically, you could imagine an interface somewhat like Mutt which reads in a list of email files (my local POP email archives downloaded from Gmail with getmail4
, filename IDs), chunks of IRC dialogue (a grep
of my IRC logs producing lines written by me ±10 lines for context, hashes for ID), LW/Reddit comments downloaded by either scraping or API via the BigQuery copy up to 2015, and stores IDs, review dates, and scores in a database. One would use it much like a SRS system, reading individual items for 10 or 20 minutes, and rating them, say, upvote (‘this could be useful someday, show me this ahead of schedule in the future’) / downvote (push this far off into the future) / delete (never show again). Items would appear on an expanding schedule. For example if one wanted to review items 4 times over the next 50 years (roughly my life expectancy), a schedule might be:
So in 1 day, then a third of a year, then after 5.8 years, then after 43 years. Alternately, a geometric series might be a bit kinder and not too frontloaded:
review < function(n, r, a) { a * (1  r^n) / (1  r) }
reviews < function(n, r, a) { sapply(1:n, function(nn) { review(nn, r, a) }) }
findR < function (firstReview=31, n_total=3, years=50) { optimize(interval=c(0, 1000),
f = function(r) { abs(sum(sapply(1:n_total,
function(n){review(n, a=firstReview, r=r)}))  (365*years)) })$minimum }
findR(firstReview=30, n_total=4, years=50)
# [1] 7.728823216
round(reviews(4, 7.728823216, 30))
# [1] 30 262 2054 15904
The geometric series allows for easy incorporation of rating modifications: a downvote penalty might multiply r by 1.5, vs 0.5 for upvotes. This would also allow some input from statistical algorithms which predict upvote/downvote/delete and advances/delays items based on that, which would hopefully quickly learn to avoid idle chitchat and short performative utterances and start to prioritize more interesting & unusual items. (For example, a good start might be a SVM on a bagofwords version of each item’s text, and then as the dataset ratings expand, more complicated algorithms could be plugged in.)
As far as I know, some todo/selfhelp systems have something like a periodic review of past stuff, and as I mentioned, spaced repetition systems do something somewhat similar to this idea of exponential revisits, but there’s nothing like this at the moment.
On the value of new statistical methods
Genetic correlation research is a hot area in 20162017: my WP article passed 400 references in May 2017. What is particularly interesting referencewise is that publications 20152017 make up around half of the results: so more genetic correlations calculated in the past 3 years than in the previous 80 years since first estimates were made somewhere in the 1930s or so.
For calculating them, there are 3 main methods.
 twin registry studies require twin phenotypic measurements which can usually be collected by mailed surveys and to analyze them one computes some Pearson’s r or uses a standard SEM with additional covariance paths (doable with Wright’s path analysis back in the 1930s by hand), scaling roughly linearly with sample size, having excellent statistical power at a few hundred twin pairs and capturing full heritabilities
 for GCTA, one requires full raw SNP data on 5000+ unrelated individuals at $100+ a sample, along with simultaneous phenotypic measurements of both traits and must use complicated custom software whose computation scales exponentially and can only examine a narrow subset of heritability
 for LDSC, one requires public summary polygenic scores but they can be from separate GWASes and calculated on traits individually, and the computational complexity is closer to linear than exponential; the penalty for not needing raw SNP data from twicemeasured individuals is that SNP costs double or more since multiple GWASes are used, and LDSC even more inefficient than GCTA, so you’ll need >10,000 individuals used in each polygenic score, and still need custom software.
In other words, the twin method is old, simple, requires small sample sizes, and easily obtained phenotypic measurements; while GCTA/LDSC is new, complicated, and requires expensive novel genetic sequencing data in huge sample sizes as well as the phenotypic measurements. So naturally LDSC gets used an order of magnitude more! Looking at the bibliography, we can guesstimate the rates at twin: 1 paper/year; GCTA (requiring raw data), 10/year; LDSC (public summary stats), 100/year.
Amazing the difference methods can make. It’s all about data access. For all its disadvantages, LDSC statistically works around the lack of individuallevel raw data and makes do with the data that gets publicly released because it is not seen to violate ‘privacy’ or ‘bioethics’, so any researcher can make use of the method on their particular dataset, while twin and GCTA require individuallevel data which is jealously guarded by the owners.
Methodologists and statisticians are probably seriously undervalued: a good new method can cause a revolution.
Bayesian power analysis: probability of exact replication
Psychologist Michael Kane mentions:
TFW a correlation of interest in a new study (n=355) replicates that from a prior study (n=182) to the second decimal (r=0.23). Winning!
Turning up the same correlation twice is somewhat surprising because random sampling error will vary substantially the empirical correlation from sample to sample, as reflected by the wide credible intervals around r with n=182355. How surprising is it? Is it too good to be true?
One approach would be to ask, if we generated bivariate samples of size n=355 with a fixed relationship of r=0.23, how often would the samples generate a rounded estimate of =0.23?
set.seed(20170728)
library('MASS')
powerSim < function (r_gen, n, r_test=NA) {
data < mvrnorm(n=n, mu=c(0, 0), Sigma=matrix(c(1, r_gen, r_gen, 1), nrow=2))
r_est = cor.test(data[,1], data[,2])$estimate
if (is.na(r_test)) { r_test < r_gen }
return(round(r_test, digits=2) == round(r_est, digits=2))
}
powerSims < function(r, n, r_test=NA, iters=100000) {
sim < replicate(iters, powerSim(r,n, r_test=r_test))
return(sum(sim) / length(sim))
}
powerSims(0.23, 355)
# [1] 0.07798
So around 8% of the samples.
This treats =r=0.23 as a parameter known with infinite precision, rather than an estimate (using around half the data) of the unknown parameter r; there would be considerable posterior uncertainty about what r is, and this will affect how often two samples would yield the same estimate—if the true r was, say, 0.10 (as is entirely possible), it would be highly unlikely for the second sample to yield =0.23 again, because the overestimation fluke would have to repeat itself twice to yield both =0.23.
To incorporate the uncertainty, we can feed in a simulated dataset exactly matching the description of n=182/r=0.23 to an uninformative Bayesian model, calculate a posterior distribution over r (which gives CIs of 0.090.37), and then draw from the posterior possible _r_s and run the original simulation asking how often we recover =0.23.
library(brms)
n1 = 182
n2 = 355
r1 = 0.23
data1 = mvrnorm(n=n1, mu=c(0, 0), Sigma=matrix(c(1, r1, r1, 1), nrow=2), empirical=TRUE)
colnames(data1) < c("x", "y")
b1 < brm(y ~ x, iter=20000, data=data1); summary(b1)
# ...PopulationLevel Effects:
# Estimate Est.Error l95% CI u95% CI Eff.Sample Rhat
# Intercept 0.00 0.07 0.14 0.14 40000 1
# x 0.23 0.07 0.09 0.37 40000 1
posteriorX < fixef(b1, summary=FALSE)[,2]; summary(posteriorX)
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# 0.08060884 0.18061570 0.23010870 0.22999820 0.27916800 0.55868700
replicates < sapply(posteriorX, function(r_post) { powerSim(r_post, n2, r_test=r1) })
summary(replicates); mean(replicates)
# Mode FALSE TRUE NA's
# logical 38262 1738 0
# [1] 0.04345
Around 4% of the time, reflecting the increased improbability of true values like r=0.09 or r=0.37 producing the specific sample estimate of =0.23
Thus, observing the same summary statistics in even relatively large samples is somewhat suspicious and might be a good reason to doublecheck other aspects of the code & data.
Expectations are not expected deviations and large number of variables are not large samples
If one has a large number of variables with a certain expectation, it is tempting to interpret the expectation or central limit theorem or law of large numbers as implying that the sum of a large number of variables or after a large number of timesteps, the observed sample value will be close or identical to the expected value. So for coinflipping, one knows that flipping 10 coins could easily yield a large deviation like a sum of 9 heads instead of the expected 5 heads, but one then thinks that after a million coin flips, the sum of heads will probably be 500,000. Another example of this mistake might be to make arguments about scientific research or charities: “charitable intervention X is affected by hundreds or thousands of different variables and the benefits or costs unfold over long time periods like decades or centuries; our best estimate of the mean value of interventions like X is that it is some small value Y; thus, by CLT etc, we can be sure that X’s ultimate value will be neither much bigger nor much smaller than Y but very close to Y, and, particularly, we can be sure that there are no interventions like X which could possibly turn out to have ultimate values which are orders of magnitude larger or smaller than Y, so we can rule out any such claims and we know the Value of Information is small.”
This is not wrong so much as misunderstood: one might call it a confusion of the variable’s distribution with the sampling distribution. The value only becomes closer in a relative sense; in an absolute sense, as more variables are added—without the absolute magnitude of each shrinking linearly—the actual deviation from the expectation simply becomes larger and larger.
As Wikipedia puts it in discussing how “diversification” works, it is a mistake to think that one ‘diversifies’ one’s investments by adding additional investments of the same size; for any variance reduction, the total investment must instead be split up among ever more different investments as many small investments:
In general, the presence of more assets in a portfolio leads to greater diversification benefits, as can be seen by considering portfolio variance as a function of , the number of assets. For example, if all assets’ returns are mutually uncorrelated and have identical variances , portfolio variance is minimized by holding all assets in the equal proportions .[Samuelson, Paul, “General Proof that Diversification Pays”, Journal of Financial and Quantitative Analysis 2, March 1967, 113.] Then the portfolio return’s variance equals = = , which is monotonically decreasing in .
The latter analysis can be adapted to show why adding uncorrelated volatile assets to a portfolio, [see Samuelson, Paul, “Risk and uncertainty: A fallacy of large numbers”, Scientia 98, 1963, 108113.] [Ross, Stephen, “Adding risks: Samuelson’s fallacy of large numbers revisited”, Journal of Financial and Quantitative Analysis 34, September 1999, 323339.] thereby increasing the portfolio’s size, is not diversification, which involves subdividing the portfolio among many smaller investments. In the case of adding investments, the portfolio’s return is instead of and the variance of the portfolio return if the assets are uncorrelated is which is increasing in n rather than decreasing. Thus, for example, when an insurance company adds more and more uncorrelated policies to its portfolio, this expansion does not itself represent diversification—the diversification occurs in the spreading of the insurance company’s risks over a large number of partowners of the company.
Samuelson’s 1963 “Risk and uncertainty: A fallacy of large numbers” opens by recounting an anecdote:
Is it true that an insurance company reduces its risk by doubling the number of ships it insures?
…a few years ago I offered some lunch colleagues to bet each $200 to $100 that the side of a coin they specified would not appear at the first toss. One distinguished scholar—who lays no claim to advanced mathematical skills—gave the following answer:
I won’t bet because I would feel the $100 loss more than the $200 gain. But I’ll take you on if you promise to let me make 100 such bets.
What was behind this interesting answer? He, and many others, have given something like the following explanation. “One toss is not enough to make it reasonably sure that the law of averages will turn out in my favor. But in a hundred tosses of a coin, the law of large numbers will make it a darn good bet. I am, so to speak, virtually sure to come out ahead in such a sequence, and that is why I accept the sequence while rejecting the single toss.”
4. Maximum Loss And Probable Loss.—What are we to think about this answer?
…Firstly, when an insurance company doubles the number of ships it insures, it does also double the range of its possible losses or gains. (This does not deny that it reduces the probability of its losses.) If at the same time that it doubles the pool of its risks, it doubles the number of its owners, it has indeed left the maximum possible loss per owner unchanged; but—and this is the germ of truth in the expression “there is safety in numbers”—the insurance company has now succeeded in reducing the probability of each loss; the gain to each owner now becomes a more certain one.
In short, it is not so much by adding new risks as by subdividing risks among more people that insurance companies reduce the risk of each. To see this, do not double or change at all the original number of ships insured by the company: but let each owner sell half his shares to each new owner. Then the risk of loss to each owner per dollar now in the company will have indeed been reduced.
Undoubtedly this is what my colleague really had in mind. In refusing a bet of $100 against $200, he should not then have specified a sequence of 100 such bets. That is adding risks. He should have asked to subdivide the risk and asked for a sequence of 100 bets, each of which was 100th as big (or $1 against $2).
In the insurance example, ships do not change their insured value (your liability) simply because you insure more of them, and they certainly do not halve in value simply because you have decided to take more insurance. If a ship is worth $1m with an expected profit of $10k and a SD of profits of $5k, when you insure 1000 ships, your profit is distributed as N(100010k, sqrt(10005) = N(1e+07, 70.7), and when you double it, now the distribution is N(200010k, sqrt(20005)) = N(2e+07, 100)—absolute size of your fluctuations has increased, not decreased. As a percentage, it has gone down indeed, but the absolute size has still gone up. Similarly, by claiming to prefer an investment of 100 bets and putting not $200 but $20,000 at risk, the colleague has raised the stakes greatly, and if the prospect of fluctuations of $100 unsettled his stomach before, he will enjoy less the SD of ~$1500 from the proposed 100 coinflips (95% quantiles of $2000 to $8000) and even a ~1% chance of a loss such as $400; he has reduced the risk of any loss, yes, and most of the outcomes are indeed relatively closer to the expectation than with just 1 coinflip, but he has inadvertently replaced a small SD of $150 with the much larger one of $1500, and of course, his worst case scenario has gotten much worse—with just one coin flip he could never have lost $400 or more, but now he can and occasionally would.
Another interesting example comes from CavalliSforza’s 1971 The Genetics of Human Populations, where he argues that a priori, differences between human races on complex traits (intelligence in this case), in the absence of selection, cannot exist solely based on the large number of genes contributing to them, by the law of large numbers:
In this context, it is worth mentioning that Jensen states that because the gene pools of whites and blacks are known to differ and “these genetic differences are manifested in virtually every anatomical, physiological, and biochemical comparison one can make between representative samples of identifiable racial groups” therefore “there is no reason to suppose that the brain should be exempt from this generalization.” There is, however, no reason why genes affecting IQ which differ in frequency in the gene pools of blacks and whites, should be such that, on the average, whites would have significantly higher frequencies of genes increasing IQ than would blacks. On the contrary, one should expect, assuming no tendency for high IQ genes to accumulate by selection in one or other race, that the more polymorphic genes there are that affect IQ and that differ in frequency in blacks and whites, the less likely it is that there is an average genetic difference in LQ between the races. This follows from that most basic law of statistics, the law of large numbers, which predicts increasing accuracy of a mean based on increasing numbers of observations (See Appendix I).
CavalliSforza is correct to note that, unless we wish to make (still) very controversial claims about differing selection, there is no apparent reason for any intelligencerelated allele to be systematically rarer in one population than another, although of course they will in practice differ slightly due to random chance (genetic drift, demographic bottlenecks etc), and that as this applies to all intelligencerelated alleles, we would predict that the expectation of all populations to be identical. He, however, commits the fallacy of large numbers when he then interprets the law of large numbers as guaranteeing that all populations will be identical, while in fact, they will be different, and the absolute size of the differences will increase—not decrease—“the more polymorphic genes there are that affect IQ”. Like with diversification or insurance or coinflipping, the desired disappearance of the variance only happens if each random variable (investment, insured vessel, coinflip) decreases proportional to the total number of random variables; for CavalliSforza’s argument to go through, it would need to be the case that every new IQ gene divvied up a fixed pie—but why would that be the case and how could that be known a priori? More pointedly, we could note that CavalliSforza’s argument proves too much because it is equally applicable within races too, and implies that there could be no differences of important magnitude between humans of the same race on highly polygenic traits (and if someone wanted to try to rescue the argument by claiming we should expect mean differences or some sort of selection, then amend “humans of the same race” to “siblings in the same family”!). CavalliSforza’s error is particularly striking since ch8 of the same book devotes extensive discussion, with many graphs, to how genetic drift will greatly differentiate populations over time, and takes pains to point out that the expectation of gene frequencies p is merely the expectation, and the actual frequency will diverge arbitrarily far in a random walk and over a long enough time (connected to the population size) will eventually reach fixation at either p=1 or p=0, and certainly not converge exactly on the original p.^{11} Indeed, given the a posteriori measured extent of average differences in allele frequency of ~0.12 and the assumption of no selection, Chuck flips CavalliSforza’s argument on its head and points out that the standard genetic drift framework (Leinonen et al 2013) dating back to Wright implies that a difference of 0.12 yields large racial differences in polygenic traits! Clearly, CavalliSforza’s argument does not prove what one would like it to.
The sum is not the mean: there is a difference between flipping 1 coin a thousand or a million times, the sum of a thousand or a million coins flipped 1 time, and the mean of a thousand or a million sums of a million coins; the distribution of the mean does indeed converge tightly, but the distribution of the sum just gets broader and broader. The expectation or mean is just the best estimate one can make over a large number of samples under a particular loss. But the longrun mean of many samples is not the same thing as the expected deviation of a single sample.
One might be dealing with something like a Cauchy distribution where there is not a mean in the first place. But more importantly, just because the expectation is a certain number like 0, doesn’t mean any specific realization will be 0, and indeed the expectation may actually be an impossible value. (Imagine a variable X which is the sum of an odd number of 1 or +1 variables; the expectation of this X is, of course, 0, however, one will never actually observe a sample of X to be 0, because the parity means there will always be a 1 or +1 ‘left over’ inside the sum and so X will always be either 1 or +1. To reach the expectation of 0, one would have to create many _X_s and average them, and the more _X_s one draws and averages, the closer the average will be to 0.)
For Gaussians, the singlesample sum of a set of Gaussian variables is the sum of their means with a variance equal to the sum of their variances (); if we sampled repeatedly and averaged, then we would indeed converge on the expected mean, but the expected deviation from the mean of a single sample is governed by the variance which can be extremely large. The increasing spread means that it would be extremely surprising to get exactly, or even near, the expectation. An example Monte Carlo of the sum of increasing numbers of deviates demonstrates the absolute deviation increases as we go from sum of 100 variables to 10,000 variables:
round(sapply(seq(100,10000,by=100), function(x) { mean(replicate(10000, abs(sum(rnorm(x))))) } ))
# [1] 8 11 14 16 18 20 21 23 24 25 26 27 29 30 31 32 33 34 35 35 37 38 38 39 40 41 41 42 43 44 45 45 46 46 47 48 48 49 50 50 51 52 52
# [44] 53 53 54 55 56 56 56 57 58 58 58 59 59 61 61 62 62 62 62 64 63 64 65 65 65 67 66 68 67 68 69 70 70 70 70 71 71 72 72 73 73 74 75
# [87] 74 75 75 75 76 76 77 78 78 78 78 80 80 80
Or consider a 1D random walk. The best estimate one can make under total ignorance of where the walker is, is to guess, regardless of how many n steps it has made, that the walker is at the origin: 0. If we run many random walkers for n steps each and ask what the best prediction of the mean is, we would be right in saying it’s 0. We would not be right in saying that the walkers have not moved far or that we would expect them all to ‘converge’ and be at or at least very near 0. However, if one asks, what is the expected distance from the origin after n steps, the answer turns out to be ie. the more steps taken, the further we expect to find the walker, even if we cannot predict in what direction it has gone on average. Similarly, for a Gaussian random walk, we find that after n steps the walker will average a distance of from the origin (and possibly much further).
Further, should there be any absorbing states, we may find that our walkers will not be at their expectation but at the absorber—an example being Gambler’s ruin where repeatedly taking +EV bets can guarantee eventual $0/bankruptcy, or the infamous martingale where the expectation of each bet may be ~0 but the martingale player eventually is either bankrupt or profitable, never exactly even. Another analogy might be stockpicking: the expected value of each stock is about the same due to efficient markets, and if one indexes and waits a long time, one will likely get the fabled average return of ~7%; but if one buys a single individual stock and waits a long time, the return will likely be either 100% or >>7%.
In any process or scenario in which we are dealing with large numbers of variables which sum to produce a final result, even if each of those variables is neutral, a single sample will be potentially arbitrarily absolutely far from the expectation in a way which will surprise someone who believes that the law of large numbers ensures that the result must be extremely close to the expectation both relatively and absolutely.
Oh Deer: Could Deer Evolve to Avoid Car Accidents?
I’ve noticed while driving many deer corpses over the years. Cars seem like they could be a major source of deer mortality. If they are, deer might be evolving behavior to avoid cars. But deer/car accident rates appear stable or increasing (perhaps due to human population growth & construction). How fast would we expect to see any deer adaptation?
Looking at some of the mortality statistics, I model it as a liability threshold trait being selected on via truncation selection, and calculate some hypotheticals about whether and how fast they could adapt.
Teal deer: of course, but it’d be slow.
While driving to NYC recently I passed 3 roadkill deer, a few of many I have seen over the years, and a thought reoccurred to me: “if all these deer are being killed by cars, shouldn’t they be evolving to avoid cars?” I’ve seen many dead deer and narrowly avoided a few myself while driving, and deer/car mortality is, if anything, much higher in states like Pennsylvania.
Accident rates would not necessarily show a steep decline thanks to past selection, because the ‘environment’ is not static here: as cars get faster, accidents become more crippling or lethal to deer; the American population has expanded severalfold both in population count, percapita vehicle miles, suburban living, territory fragmentation, and the deer population too has expanded manyfold (from ~0.5m a century ago to <30m now). So if there was highly effective ongoing selection reducing deer accident risk, we would still observe large absolute and proportional increases in accidents/deaths.
But I am still curious as to what sort of selection we could expect, which is a hint as to longterm trends—the American population is now relatively stabilized in terms of growth and vehiclemiles, and deer appear to have also reached a population equilibrium, so a gradual longterm decline in accident rates might be possible to see in the future if there is substantial response to selection.
Deer accidents seem to be fairly fatal: wild animals are always on the edge, and small injuries can compound into death, so looking at mortality will if anything underestimate the strength of selection. And of course there will be no single Mendelian genes, but being a complex behavioral trait, it is almost surely a highly polygenic additive trait. So we can treat it as truncation selection on a binary trait (“killed by a car”) in the liability threshold model.
For truncation selection, the two key parameters are the heritability of a ‘trait’, and the fraction of the population expressing the ‘trait’.
The heritability can only be guessed at. Car accidents ought to be heritable to some degree, because everything is, and many behavioral traits like risk aversion or diurnality or reactiontime or startle reflex or wanderlust would affect being hit by a car (a deer can reduce risk by avoiding roads entirely, not traveling far, waiting to cross until very late at night when there is no traffic or during the day when they can be easily seen). Response to selection need not cause some hardwired behavioral change like aversion to travel: it might yield something like the Baldwin effect, where the response is for behavioral flexibility, and more fit deer are better able to learn how to navigate traffic gaps by watching other deer or imitating their mother. The “anthropocene” has led to many animals evolving or otherwise learning how to adapt, with urban environments no exception^{12}, so why would deer be any exception?
Complicating things is the possibility that the heritability is high but actual responses to selection are lower than expected when estimated in a univariate singletrait fashion, because there might be a genetic correlation with another fitnessinfluencing trait, where better car avoidance means worse values of that other trait—perhaps it would be easy to avoid cars by avoiding roads or traveling far, but this has the drawback of preventing relocation to avoid starvation or hunters, in which case response to selection will be small or even opposite of expected (this is plausibly one of the main reasons why wild populations may not evolve as fast as predicted: “The Missing Response to Selection in the Wild”, Pujol et al 2018).
I can only guess at the heritability, as I doubt heritabilities have been calculated for much in deer, but I would predict it’s <50% simply because wild animals are under constant selection and carbased selection would’ve started almost a century ago. It might seem impossible to calculate heritability for wild animals of a ‘trait’ like being hit by a car, but I think it’s doable. Wild animals have no twin studies or family pedigrees, of course, and methods like commongarden rearing likewise seem questionable, but one use tracking devices to follow families of deer until they all die to construct a pedigree with outcome of death; more feasibly, one could collect DNA samples from dead caraccident deer and dead nonaccident deer, and compute genomic similarity with a procedure like GCTA (for SNP heritability) or wholegenome sequencing data (upcoming methods for recovering full heritability). But this hasn’t been done as far as I know. Oh well. We can look at a range of heritabilities 050%.
The fraction of deer hit is a little easier. Wild deer live ~34y on average (eg Lopez et al 2004 on Key deer implies ~3.7y), sexually maturing ~1.5y, and of the ~25m deer in the USA, around 1.5m are killed by cars annually ~2012 (according to The Atlantic with no citation) so perhaps ~5% annual mortality from cars (McShea et al 2008 estimates 2% in a sampled Virginia county, while McShea 2012 suggests 1.2m deaths out of 25m deer); if deer live 4 years and have a 5% annual risk of being killed by a car, then their lifetime risk should be 1  0.95^4 = 18%
or perhaps 8%, which sounds like a reasonable range—a substantial source of mortality but probably less than hunting or starvation or disease.
The effect of a generation of truncation selection on a binary trait following the liabilitythreshold model is more complicated but follows a similar spirit. R implementation of pg6 of “Chapter 14: Shortterm Changes in the Mean: 2. Truncation and Threshold Selection”, Lynch & Walsh:
threshold_select < function(fraction_0, heritability, verbose=FALSE) {
fraction_probit_0 = qnorm(fraction_0)
## threshold for not manifesting schizophrenia:
s_0 = dnorm(fraction_probit_0) / fraction_0
## new rate of trait after one selection where 100% of pastthethreshold never reproduce:
fraction_probit_1 = fraction_probit_0 + heritability * s_0
fraction_1 = pnorm(fraction_probit_1)
## how much did we reduce trait in percentage terms?
if (verbose) {
print(paste0("Start: population fraction: ", fraction_0, "; liability threshold: ", fraction_probit_0, "; Selection intensity: ", s_0))
print(paste0("End: liability threshold: ", fraction_probit_1, "; population fraction: ", fraction_1, "; Total population reduction: ",
fraction_0  fraction_1, "; Percentage reduction: ", (1((1fraction_1) / (1fraction_0)))*100)) }
return(c(fraction_probit_1, fraction_1, fraction_0  fraction_1))
}
threshold_select(10.18, 0.50, verbose=TRUE)
# [1] "Start: population fraction: 0.82; liability threshold: 0.915365087842814; Selection intensity: 0.320000021339773"
# [1] "End: liability threshold: 1.0753650985127; population fraction: 0.858894349391959; Total population reduction: 0.0388943493919587; Percentage reduction: 21.6079718844215"
# [1] 1.07536509851 0.85889434939 0.03889434939
threshold_select(10.08, 0.50, verbose=TRUE)
# [1] "Start: population fraction: 0.92; liability threshold: 1.40507156030963; Selection intensity: 0.161593724197447"
# [1] "End: liability threshold: 1.48586842240836; population fraction: 0.931343036233605; Total population reduction: 0.0113430362336053; Percentage reduction: 14.1787952920067"
# [1] 1.48586842241 0.93134303623 0.01134303623
We can look at a range of scenarios for population prevalences 818%, and heritabilities 5%50%. Aside from the pergeneration increase in caravoiding deer/decrease in deercar accidents, it might also be interesting to calculate a hypothetical like “how many generations/years would it take for natural selection, in a static environment, to reduce lifetime mortality to 1%?”
thresholdModeling < function(fraction_0, heritability, targetPercentage) {
firstGeneration < threshold_select(fraction_0, heritability)
## estimate how many generations of truncation selection until a target percentage is reached:
i < 1; fraction_i < firstGeneration[2]
while (fraction_i < targetPercentage) {
i < i+1
nextGeneration < threshold_select(fraction_i, heritability)
fraction_i < nextGeneration[2]
}
return(c(firstGeneration, i)) }
thresholdModeling(10.20, 0.5, 10.01)
df < expand.grid(Fraction=(1seq(0.08, 0.18, by=0.02)), Heritability=seq(0.05, 0.50, by=0.05))
df < cbind(df, round(digits=3, do.call(rbind, Map(thresholdModeling, df$Fraction, df$Heritability, 0.99))))
colnames(df)[3:6] < c("Threshold.latent", "Fraction.new", "Fraction.reduction", "Generations.to.onepercent")
df$Years < round(df$Generations.to.onepercent * 3.5)
df
Fraction  Heritability  Latent threshold  Fraction 2nd  Fraction reduction  Generations to 1%  Years 

0.92  0.05  1.413  0.921  0.001  300  1050 
0.90  0.05  1.291  0.902  0.002  314  1099 
0.88  0.05  1.186  0.882  0.002  324  1134 
0.86  0.05  1.093  0.863  0.003  332  1162 
0.84  0.05  1.009  0.843  0.003  338  1183 
0.82  0.05  0.931  0.824  0.004  343  1200 
0.92  0.10  1.421  0.922  0.002  150  525 
0.90  0.10  1.301  0.903  0.003  157  550 
0.88  0.10  1.198  0.884  0.004  162  567 
0.86  0.10  1.106  0.866  0.006  166  581 
0.84  0.10  1.023  0.847  0.007  169  592 
0.82  0.10  0.947  0.828  0.008  171  598 
0.92  0.15  1.429  0.924  0.004  100  350 
0.90  0.15  1.311  0.905  0.005  104  364 
0.88  0.15  1.209  0.887  0.007  108  378 
0.86  0.15  1.119  0.868  0.008  110  385 
0.84  0.15  1.038  0.850  0.010  112  392 
0.82  0.15  0.963  0.832  0.012  114  399 
0.92  0.20  1.437  0.925  0.005  75  262 
0.90  0.20  1.321  0.907  0.007  78  273 
0.88  0.20  1.220  0.889  0.009  81  284 
0.86  0.20  1.132  0.871  0.011  82  287 
0.84  0.20  1.052  0.854  0.014  84  294 
0.82  0.20  0.979  0.836  0.016  85  298 
0.92  0.25  1.445  0.926  0.006  60  210 
0.90  0.25  1.330  0.908  0.008  62  217 
0.88  0.25  1.232  0.891  0.011  64  224 
0.86  0.25  1.145  0.874  0.014  66  231 
0.84  0.25  1.067  0.857  0.017  67  234 
0.82  0.25  0.995  0.840  0.020  68  238 
0.92  0.30  1.454  0.927  0.007  50  175 
0.90  0.30  1.340  0.910  0.010  52  182 
0.88  0.30  1.243  0.893  0.013  54  189 
0.86  0.30  1.158  0.877  0.017  55  192 
0.84  0.30  1.081  0.860  0.020  56  196 
0.82  0.30  1.011  0.844  0.024  57  200 
0.92  0.35  1.462  0.928  0.008  43  150 
0.90  0.35  1.350  0.911  0.011  44  154 
0.88  0.35  1.255  0.895  0.015  46  161 
0.86  0.35  1.171  0.879  0.019  47  164 
0.84  0.35  1.096  0.863  0.023  48  168 
0.82  0.35  1.027  0.848  0.028  48  168 
0.92  0.40  1.470  0.929  0.009  37  130 
0.90  0.40  1.360  0.913  0.013  39  136 
0.88  0.40  1.266  0.897  0.017  40  140 
0.86  0.40  1.184  0.882  0.022  41  144 
0.84  0.40  1.110  0.867  0.027  42  147 
0.82  0.40  1.043  0.852  0.032  42  147 
0.92  0.45  1.478  0.930  0.010  33  116 
0.90  0.45  1.369  0.915  0.015  34  119 
0.88  0.45  1.277  0.899  0.019  35  122 
0.86  0.45  1.197  0.884  0.024  36  126 
0.84  0.45  1.125  0.870  0.030  37  130 
0.82  0.45  1.059  0.855  0.035  37  130 
0.92  0.50  1.486  0.931  0.011  30  105 
0.90  0.50  1.379  0.916  0.016  31  108 
0.88  0.50  1.289  0.901  0.021  32  112 
0.86  0.50  1.210  0.887  0.027  33  116 
0.84  0.50  1.139  0.873  0.033  33  116 
0.82  0.50  1.075  0.859  0.039  34  119 
As expected, scenarios where deer accidents are rare achieve 1% faster (less work to do), and higher heritabilities produce faster responses (more useful genes). In most of the scenarios, the annual percent reduction is small, no often less than a percentage point, so would be easy to miss, and the almost complete elimination of deercar accidents would take centuries of adaptation even with a fixed environment.
So while deer might be evolving to reduce accident mortality, it would be hard to see and won’t help much anytime soon. (For the same reasons, I would expect squirrelbased power outages to continue indefinitely.) So if it’s a problem, one’d better hurry up with safety measures like road fencing or selfdriving cars.
Evolution as Backstop for Reinforcement Learning
One defense of free markets notes the inability of nonmarket mechanisms to solve planning & optimization problems. This has difficulty with Coase’s paradox of the firm, and I note that the difficulty is increased by the fact that with improvements in computers, algorithms, and data, ever larger planning problems are solved. Expanding on some Cosma Shalizi comments, I suggest a group selection perspective: one reason for free market or evolutionary or Bayesian methods in general is that while poorer at planning/optimization in the short run, they have the advantage of simplicity and operating on groundtruth values, and serve as a constraint on the more sophisticated nonmarket mechanisms. I illustrate by discussing corporations, multicellular life, and reinforcement learning & metalearning in AI. This view suggests that are inherent balances between market/nonmarket mechanisms which reflect the relative advantages between a slow unbiased method and faster but potentially arbitrarily biased methods.
Split out to “Evolution as Backstop for Reinforcement Learning”.
Acne: a good Quantified Self topic
I suggest that teenagers interested in experimentation, statistics, or Quantified Self experiment with interventions to reduce acne, listing the advantages of the topic. To suggest specific interventions, I reanalyze & rank the April 2016 CureTogether crowdsourced ratings of ~113 acne interventions; dietary interventions rate particularly highly after standard retinoids (like Accutane) & benzoyl peroxide and might be worth closer investigation.
Quantified Self acne selfexperiments are underdone and potentially valuable.
I kind of regret getting into QS and statistics only long after my acne problems largely subsided. The conventional wisdom on acne seems poorly supported and neglectful of individual differences, and I’ve long suspected that some rigorous testing might turn up some interesting results which could be useful to Western teenagers. (My acne wasn’t so bad, in retrospect, but it still frustrated me a great deal; in extreme cases, it can contribute to suicide.)
Looking back, I can see how easy it would be to test the various theories. For example, facial washes & light boxes could be tested efficiently by blocking—randomizing which half of your face they are applied to. Diet too would be useful to do as a factorial experiment—for 3 or 4 weeks, cut out all the carbs while loading up on dairy products, then vice versa. As a topic for selfexperimentation, acne has many advantages:
many highlymotivated subjects with ample free time
objective visible effects where the data can even be easily recorded for blind ratings by third parties to reduce bias (ie facial photographs)
factorial experiments or blocking withinsubject is possible for some acne interventions, which would considerably increase power
in severe acne cases, changes would be easy to see within days/weeks, and large effects are plausible
a large number of alreadyknown possible interventions none of which are particularly strongly supported (so high VoI); a good starting point would be the CureTogether crowdsourced rankings (see next section for specifics).
 most of the interventions are safe & easy to implement: it’s not hard to use a face wash twice a day, apply some benzoyl peroxide cream, cut out all dairy products, etc
new commercial products allow for interesting hobbyist projects—what would microbiome sampling show about the microbial communities of teens with high acne vs low acne? About their microbial communities before the high acne group developed acne? After treatment with benzoyl peroxide or antibiotics or retinol? etc. With a large enough community dataset, interesting techniques could be applied for measuring heterogeneity in individual response to treatments (eg GCTAstyle variance components on acne response could be enlightening, as genes would be expected to be the source of much of the individual differences, acne itself being heritable)
a launching pad into many areas of science & statistics—relevant topics include human & microbial genetics, methodological biases, regression to the mean, optimal experimental design, sequential testing, metaanalysis, timeseries, deep learning… (“When making an axe handle with an axe, the model is near at hand.”)
One could spin off lots of projects to increase rigor—instead of counting by hand, you could take smartphone photos and feed them into a pimplecounting CNN. With all the frameworks like PyTorch now, that’s well within the ability of a bright computerinclined high schooler. (And then put it up as a web service so other teens can run their own selfexperiments & score their pimple photographs, why not…)
Indeed, Seth Roberts, one of the early Quantified Self proponents, credits his interest in selfexperiments to acne (Roberts 2001/Roberts 2010), when he found tetracycline didn’t help his acne but benzoyl peroxide did:
My interest in selfexperimentation began when I read an article about teaching mathematics by Paul Halmos, a professor at Indiana University. Halmos emphasized that “the best way to learn is to do.” I was trying to learn how to do experiments; I took this advice to mean I should do as many as possible. I could do more experiments, I realized, if I not only did rat experiments but also did experiments with myself as the subject. So I started doing small selfexperiments. Most of them were trivial and led nowhere (e.g., experiments about juggling). At the time I had acne. My dermatologist had prescribed both pills (tetracycline, a widespectrum antibiotic) and a cream (active ingredient benzoyl peroxide). Simply for the sake of doing experiments, any experiments, I did simple tests to measure the effectiveness of these treatments. I believed the pills were powerful and the cream had little effect. To my great surprise, the tests showed the opposite: The cream was powerful and the pills had little effect. It was very useful information. Many years later, an article in the British Journal of Dermatology reported that antibioticresistant acne is common.
Acne came up several times on Roberts’s blog:
Roberts has suggested that teens could run acne selfexperiments as a group project: “Acne Clubs”
A possible general cause: glycemic index/insulin signaling
Anecdotes:
 oolong/black tea allergy as acne trigger
 pasteurized dairy trigger: 1, 2
 sugar/carb triggers: 1, 2
 soap trigger: 1; lack of soap trigger: 2
 acne reduction due to milk thistle
CureTogether acne interventions
CureTogether (20082012) was a social network/crowdsourced website for aggregating & ranking treatments of various problems, typically diseases, and had an acne page. They were bought by 23andMe in 2012, and operated for a while until the website seems to’ve vanished around mid2016 & the data appears to no longer be available anywhere. The last Internet Archive snapshot of their acne page is 20160401.
Their presentation of uncertainty & average ranking of the interventions is useless; but they do provide the totals for each of the rating levels, so I decided to fix it by extracting & reanalyzing the ratings in a multilevel Bayesian ordinal regression with a weakly informative prior to get more useful posterior estimates of ratings. (While I was at it, I fixed some spelling errors and merged a few interventions which were the same, like “Accutane” and “Isotretinoin (Roaccutane, Accutane)”.)
## https://web.archive.org/web/20160401173234/http://curetogether.com:80/acne/treatments/
## "much worse / slightly worse / no effect / moderate improvement / major improvement" ~> 15
acne < read.csv("https://www.gwern.net/docs/biology/20160401curetogetheracne.csv",
header=TRUE, colClasses=c("integer", "integer", "factor"))
library(skimr)
skim(acne)
# Skim summary statistics
# n obs: 439
# n variables: 3
#
# ── Variable type:factor ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
# variable missing complete n n_unique top_counts ordered
# Intervention 0 439 439 113 Acn: 5, Agi: 5, Amo: 5, Ant: 5 FALSE
#
# ── Variable type:integer ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
# variable missing complete n mean sd p0 p25 p50 p75 p100 hist
# Effect 0 439 439 3.16 1.33 1 2 3 4 5 ▅▆▁▇▁▇▁▆
# N 0 439 439 18.38 34.74 1 2 6 19 324 ▇▁▁▁▁▁▁▁
library(brms)
b < brm(Effect  weights(N) ~ (1Intervention), prior=c(prior(student_t(3, 0, 1), "sd")), family=cumulative(),
iter=5000, chains=30, cores=30, data=acne)
b
# GroupLevel Effects:
# ~Intervention (Number of levels: 113)
# Estimate Est.Error l95% CI u95% CI Eff.Sample Rhat
# sd(Intercept) 0.64 0.06 0.53 0.76 16503 1.00
#
# PopulationLevel Effects:
# Estimate Est.Error l95% CI u95% CI Eff.Sample Rhat
# Intercept[1] 3.44 0.09 3.62 3.26 20848 1.00
# Intercept[2] 2.17 0.08 2.32 2.02 17179 1.00
# Intercept[3] 0.49 0.07 0.35 0.64 15767 1.00
# Intercept[4] 2.75 0.08 2.58 2.91 19558 1.00
red < as.data.frame(ranef(b)$Intervention)
round(red[order(red$Estimate.Intercept, decreasing=TRUE),], digits=2)
## the ordinal model forest plot doesn't make much intuitive sense, so redo as a normaldistribution
## for easier forest plotting (the rankings are more or less identical, perhaps a little less precise):
bg < brm(Effect  weights(N) ~ (1Intervention), prior=c(prior(student_t(3, 0, 1), "sd")), family=gaussian(),
iter=5000, chains=30, cores=30, data=acne)
## Extract & sort estimates:
coefs < as.data.frame(coef(bg))
coefs[order(coefs$Intervention.Estimate.Intercept, decreasing=TRUE),]
## Plot estimates:
library(brmstools)
forest(bg)
Intervention  Estimate  SE  2.5%  97.5% 

Isotretinoin (Roaccutane, Accutane)  4.04  0.06  3.91  4.16 
Paleo Diet  3.75  0.08  3.58  3.92 
No gluten  3.67  0.07  3.51  3.82 
Azelaic acid (Azelex)  3.65  0.16  3.32  3.99 
Bactrim (Trimethoprim/sulfamethoxazole)  3.62  0.12  3.36  3.87 
Ketogenic Diet  3.59  0.18  3.23  3.95 
Clindamyacin phosphate gel  3.58  0.13  3.31  3.85 
Glycolic acid  3.56  0.21  3.14  3.99 
Ziana (Clindamycin/tretinoin)  3.56  0.21  3.14  3.99 
Amoxicillin  3.56  0.14  3.27  3.85 
No sugar  3.55  0.07  3.41  3.70 
No dairy  3.52  0.06  3.40  3.64 
Diane35 (cyproterone/ethinyl estradiol)  3.52  0.12  3.27  3.77 
Epiduo Gel  3.51  0.15  3.20  3.82 
having a good skincare routine  3.49  0.17  3.16  3.83 
eliminating comedogenic ingredients…  3.49  0.21  3.07  3.93 
Sit out in the sun…at least 15 minutes  3.49  0.05  3.38  3.60 
Avoid touching face  3.49  0.03  3.41  3.57 
Benzoyl peroxide 10%  3.49  0.09  3.30  3.67 
Ole Henriksen products  3.47  0.18  3.10  3.84 
No chocolate  3.45  0.22  3.02  3.89 
Doryx (Doxycycline)  3.45  0.07  3.30  3.59 
foot bath with baking soda  3.45  0.24  2.98  3.93 
Bikram Yoga  3.45  0.24  2.97  3.93 
Tretinoin  3.45  0.24  2.97  3.93 
Retin A  3.45  0.05  3.33  3.56 
Birth control pill / Balance hormones  3.44  0.05  3.33  3.55 
Zinc soap  3.43  0.20  3.04  3.84 
Washing face  3.43  0.03  3.36  3.49 
No fast food  3.42  0.06  3.30  3.55 
PhotoDynamic Therapy  3.42  0.15  3.11  3.73 
Exuviance Vespera Serum  3.41  0.23  2.96  3.87 
Helminthic therapy  3.41  0.23  2.96  3.87 
Clindamycin 1% dabber  3.41  0.07  3.25  3.56 
White vinegar  3.40  0.17  3.06  3.75 
Washing pillowcases regularly  3.40  0.05  3.29  3.50 
AcZone  3.39  0.20  2.99  3.81 
Tazorac  3.39  0.16  3.07  3.71 
Tetracycline  3.39  0.05  3.29  3.49 
Zinc cream  3.39  0.15  3.09  3.69 
Benzaclin (Benzoyl peroxide/clindamycin)  3.39  0.09  3.19  3.58 
Benzoyl peroxide 2.5%  3.38  0.03  3.30  3.46 
Aveeno Skin Brightening Daily Scrub  3.38  0.22  2.94  3.82 
Metrogel  3.38  0.17  3.03  3.72 
Retinol Oral Supplementation  3.37  0.20  2.97  3.77 
Duac Gel  3.37  0.10  3.16  3.57 
avoid spicy food  3.35  0.19  2.97  3.74 
Placing clean towel on pillow each night  3.35  0.11  3.13  3.56 
Pantothenic acid  3.35  0.13  3.09  3.61 
Antibiotic cream  3.33  0.08  3.18  3.49 
Rosac  3.33  0.19  2.96  3.70 
Sulfur Powder  3.33  0.14  3.05  3.61 
Mychelle clear skin serum  3.33  0.17  2.98  3.68 
Mario Bedescu Drying Lotion  3.32  0.14  3.03  3.61 
Drink a lot of water  3.31  0.13  3.05  3.57 
Acnetrex  3.31  0.22  2.87  3.74 
Spironolactone  3.30  0.16  2.97  3.63 
Blemish potion  3.30  0.13  3.03  3.57 
Dr. Hauschka Natural Skin Care  3.30  0.14  3.02  3.58 
Erythromycin topical solution  3.30  0.10  3.10  3.50 
Using fresh aloe vera leaves on skin  3.30  0.11  3.07  3.52 
Evening Primrose Oil  3.29  0.21  2.86  3.71 
Salicylic acid  3.29  0.04  3.20  3.37 
Vaccination with individually developed vaccine^{13}  3.28  0.21  2.86  3.70 
StieviaA Cream 0.025%  3.28  0.17  2.94  3.63 
Vicco Turmeric Skin Cream  3.27  0.24  2.80  3.75 
Dr. Andrew Weil for Origins  3.27  0.17  2.93  3.61 
Cleanance K by Avene  3.27  0.14  2.99  3.55 
OBAGI  3.27  0.15  2.96  3.58 
Nacetylcysteine  3.27  0.20  2.86  3.67 
Differin Gel  3.26  0.12  3.01  3.50 
Avoiding pork  3.25  0.16  2.92  3.57 
Chemicalfree, vegan cosmetics / personal care products  3.25  0.07  3.09  3.40 
Washing face with baking soda  3.24  0.19  2.85  3.62 
Gamma Linolenic Acid  3.23  0.22  2.79  3.67 
Oil Cleansing Method  3.23  0.10  3.02  3.44 
Acne Free  3.22  0.10  3.01  3.43 
African Black Soap  3.22  0.14  2.94  3.50 
Aloe Vera Soap  3.21  0.15  2.91  3.51 
Aromatherapy  3.21  0.16  2.89  3.53 
Urine therapy  3.20  0.20  2.79  3.61 
Hydrocortisone cream  3.20  0.14  2.91  3.49 
No eggs  3.20  0.12  2.95  3.45 
Air Purifier  3.19  0.14  2.91  3.47 
Aveda Enbrightenment Wash  3.18  0.19  2.79  3.56 
Vegan diet  3.18  0.08  3.01  3.35 
Cetaphil for sensitive skin  3.18  0.06  3.05  3.30 
Vitamin D  3.17  0.06  3.05  3.29 
Zinc  3.16  0.07  3.02  3.30 
Bert’s Bees blemish stick  3.15  0.18  2.80  3.51 
Jojoba Oil  3.13  0.20  2.73  3.52 
Purpose Gentle Cleansing Bar  3.12  0.11  2.89  3.34 
Honey  3.11  0.08  2.94  3.29 
Bert’s Bees Cleansing Milk  3.10  0.17  2.75  3.45 
Tea tree oil  3.10  0.05  2.99  3.20 
Aging (improvement after around age 19)  3.08  0.05  2.98  3.18 
Alpha Hydroxy Acid  3.05  0.12  2.80  3.30 
Flax oil  3.05  0.10  2.84  3.25 
Exercise  3.04  0.05  2.93  3.15 
Pregnancy  3.04  0.15  2.73  3.35 
St. Ives Apricot Scrub  3.04  0.06  2.92  3.16 
Apple cider vinegar tonic  3.03  0.08  2.86  3.21 
Washing face with water only  3.03  0.06  2.90  3.16 
Proactiv  3.03  0.05  2.92  3.13 
Toothpaste on acne  3.01  0.06  2.89  3.13 
Rubbing Alcohol  2.99  0.07  2.85  3.13 
Hydrogen peroxide  2.98  0.07  2.83  3.13 
Bar soap  2.96  0.05  2.85  3.07 
Sauna  2.96  0.11  2.73  3.19 
Murad AcneComplex  2.93  0.10  2.72  3.14 
Clearasil Stayclear  2.91  0.06  2.79  3.02 
Emu oil  2.81  0.18  2.44  3.17 
Not washing face at all  2.75  0.16  2.43  3.06 
One thing which would be nice to do with the CureTogether ratings is examine the clustering of interventions. Some of these interventions are doubtless the same thing and should be merged together; others are different but may act through the same mechanisms and so could be considered about the same thing too. Doing clustering or factor analysis might pop up a handful of major ‘approaches’, like ‘antibiotics’ vs ‘cleaning’ vs ‘dietary interventions’, and that would make the available options much easier to understand.
More interestingly, interventions almost certainly correlate with each other & predict success of each other, and this could be used to provide flowcharts of advice along the lines of “if X didn’t work for you, then Y might and Z probably won’t”. Acne treatment could be considered as a Partially Observable Markov Decision Problem where the effectiveness of each treatment is unknown but has a prior probability based on the global ratings, and one treatment’s result informs the posterior probability of success of the untried treatments; a POMDP can be solved to give an optimal sequence of treatments to try. (I’ve done a simple example of this for cats and cat stimulants like catnip.) And this approach also immediately gives a principled way for users to collectively experiment by posterior sampling (like Thompson sampling, but one solves the MDP which is defined after sampling a possible parameter value from each parameter’s posterior & treating it as the true value). Unfortunately, that requires each CureTogether rater’s individual data, to examine correlations withinindividual of ratings, and the web interface doesn’t provide that, and since CureTogether has been acquired & disappeared, there’s no one to ask for the raw data now.
Fermi calculations
A short discussion of “Fermi calculations”: quickanddirty approximate answers to quantitative questions which prize cleverness in exploiting implications of common knowledge or basic principles in given reasonable answers to apparently unanswerable questions. Links to discussions of Fermi estimates, and a list of some Fermi estimates I’ve done.
I really like Fermi problems (LessWrong)—it’s like dimensional analysis for everything outside of physics^{14}.
Not only are they fun to think about, they can be amazingly accurate, and are extremely cheap to do—because they are so easy, you do them in all sorts of situations you wouldn’t do a ‘real’ estimate for, and are a fun part of a physics education. The common distaste for them baffles me; even if you never work through Hubbard’s How to Measure Anything (some strategies) or StreetFighting Mathematics or read Douglas Hofstadter’s 1982 essay “On Number Numbness” (collected in Metamagical Themas), it’s something you can teach yourself by asking, what information is public available, what can I compare this too, how can I put various boundaries around the true answers^{15} You especially want to do Fermi calculations in areas where the data is unavailable; I wind up pondering such areas frequently:
 is a lipreading website a good idea?
 how many women dye their hair blonde
 how many people does Folding@home kill
 what’s the entropy of natural language
 or how big a computer is needed to compute the universe
 do men have shorter real lives than women
 how much do Girl Scouts cookies cost and earn them
 how many people have used modafinil, and what is the cheapest we can expect to find modafinil for
 quickly assessing the rough probability of some event as I made one of my thousands of predictions
 how many sellers are there on Silk Road 1
 checking whether posthumous organ donation justifies the Chinese justice system (no)
 how many tourists to the Egyptian pyramids there have been compared to the workers who build them
 what we can infer from adultery & false paternity rates
 how many times have video gamers killed Mario
 is tiger skin more profitable than tiger bone wine?
 plating Versailles with gold
 the number of people killed by falling pianos
 the plausibility of _Oreimo’s Kirino Kosaka character
 how many integers one to a million do you see over a lifetime? (R simulation + Benford’s law)
 how many anecdotes of CrossFit causing rhabdomyolysis does it take to justify “CrossFit causes rhabdomyolysis”?
 how long would it take to read every book in existence?
 what would it cost to replace the US nuclear arsenal?
 What are the lifetime odds of being pooped on by a bird?
An entire “estimation” subreddit is devoted to working through questions like these (it can be quite fun), and of course, there are the memorable “what if?” xkcd columns.
Timothy Gowers suggests a number of problems which might help children really learn how to think with & apply the math they learn.
To look further afield, here’s a quick and nifty application by investor John Hempton to the Sino Forestry fraud: “Risk management and sounding crazy”. What I personally found most interesting about this post was not the overall theme that the whistleblowers were discounted before and after they were proven right (we see this in many bubbles, for example, the housing bubble), but how one could use a sort of Outside View/Fermi calculation to sanitycheck the claims. If Sino Forestry was really causing 17m cubic meters of wood to be processed a year, where was all the processing? This simple question tells us a lot. With medicine, there is one simple question one can always ask too—where is the increased longevity? (This is an important question to ask of studies, such as a recent caloric restriction study.)
Simple questions and reasoning can tell us a lot.
The GSS provides downloads of the full n=62k survey dataset as of 2016, and it is a proportional population sample, so a “character generator” can be implemented as simply as sampling 1 random row from the dataframe and mapping it back onto a natural language description. Should this be inadequate, a generative model such as an autoencoder or GAN could be trained on the dataset to generate further realistic examples which respect all the complicated correlations & patterns between responses. For privacy reasons, the GSS does not provide each respondent’s location or name (except for one question I found asking about the coarse region they grew up in such as “Atlantic region” or “New England”), which are key variables for characters, so they could perhaps be generated using the GSS and then the age/sex/region sampled from US Census data.↩︎
Richard Hamming, “You and Your Research”:
The three outstanding problems in physics, in a certain sense, were never worked on while I was at Bell Labs. By important I mean guaranteed a Nobel Prize and any sum of money you want to mention. We didn’t work on (1) time travel, (2) teleportation, and (3) antigravity. They are not important problems because we do not have an attack. It’s not the consequence that makes a problem important, it is that you have a reasonable attack. That is what makes a problem important.

The consensus among NICE’s economic advisers is that NICE should, generally, accept as cost effective those interventions with an incremental costeffectiveness ratio of less than £20,000 per QALY and that there should be increasingly strong reasons for accepting as cost effective interventions with an incremental costeffectiveness ratio of over £30,000 per QALY.
This renders the population estimate a bit off, but I couldn’t find any sources on the breakdown of Sephardic vs Ashkenazi in the USA in 1970 other than a comment that the latter were a “vast majority”. Since the Jewish Population Study was probably an undercount in not including all the people of Jewish descent, I’m hopeful those two biases cancel out.↩︎
The high IQ sample in Barbe 1964 would have been >8% Jewish, but the paper only reports the overall Jewishness, without specifying whether it’s 4% vs 12% or something like that.↩︎
Hollingworth & Rust 1937: “The data of the present study were obtained early in 1933, the subjects being 36 boys and 19 girls, of the average age of 18 years 6 months. The IQ’s (SB) of all had been taken in early childhood (9). The group ranged from 135190 IQ (SB), with a median at about 153 IQ (SB). All but four of these young persons were Jewish, a factor which must be considered as of possible consequence (8, 14)…”↩︎
Subotnik et al 1993, pg34:
The mean IQ of the Hunter sample was 157, or approximately 3.5 standard deviations above the mean, with a range of 122 to 196 on the LM form. [StanfordBinet Intelligence Scale, Form LM (SBLM)]
…Each class at Hunter College Elementary School from the years 1948 to 1960 contained about 50 students, yielding a total possible population of 600 graduates…35% of the total population of 19481960 HCES students (n=210) completed and returned study questionnaires
…Religious Affiliation: The Hunter group is approximately 62% Jewish, although they describe themselves as Jews more in terms of ethnic identity than religious practice. The group, as a whole, is not religious.
Educational Attainments: Over 80% of the study participants held at least a Master’s degree. Furthermore, 40% of the women and 68% of the men held either a Ph.D, LL.B., J.D., or M.D. degree. Occupation and Income: Only two of the HCES women identified themselves primarily as homemakers. 53% were professionals, working as a teacher at the college or precollege level, writer (journalist, author, editor), or psychologist. The same proportion of HCES men were professionals, serving as lawyers, medical doctors, or college teachers. The median income for men in 1988 was $75,000 (range = $500,000) and for women $40,000 (range = $169,000). Income levels were significantly different for men and women, even when matched by profession. For example, the median income for male college teachers or psychologists was $50,000 and for females, $30,000
My first attempt at it in JAGS went like this:
model_string < ' model { cutoffIQ < 100 + 3.719016485*15 mu_asian ~ dnorm(105, 4^2) X_asian ~ dnorm(mu_asian, 15^2) # T(cutoffIQ,) X_frac_asian < X_asian > cutoffIQ P_asian < 0.07 * (X_frac_asian / length(X_asian)) Y_asian ~ dbinom(P_asian, total) # mu_white ~ dnorm(100, 4^2) # X_white ~ dnorm(mu_white, 15^2) # T(cutoffIQ,) # X_frac_white < X_white > cutoffIQ # P_white < (10.07) * (X_frac_white / length(X_white)) # Y_white ~ dbinom(P_white, total) } ' library(runjags) Y_asian=126 Y_white=418 total=579 model < run.jags(model_string, data = list(Y_asian=Y_asian, Y_white=Y_white, total=total), monitor=c("mu_asian", "mu_white"), n.chains = getOption("mc.cores"), method="rjparallel") summary(model)
But then I realized that
X_frac_asian < X_asian > cutoffIQ
didn’t do what I thought it did and I needed to somehow draw a large number of samples, just like in the ABC simulation, and compare to the number after the truncation… or something.↩︎Strictly speaking we could probably do a parametric bootstrap by counting per decade, and treating each one as a Poisson estimation, and generating random deviates, but that runs into issues with decades with 0 figures (the MLE might estimate a rate of 0 which we know is wrong but not what rate might be right, which could easily imply average rates >1 based on surrounding temporal trends).↩︎
I’ve been experimenting with Boomerang to reduce the problem of nonfollowups by setting ‘ping me if no reply within 1 month’ alerts on my sent emails.↩︎
In this context, it’s interesting to note CavalliSforza’s role in Lewontin’s fallacy.↩︎
Some relevant links:
 “Global urban signatures of phenotypic change in animal and plant populations”, Alberti et al 2017
 “Signatures of positive selection and local adaptation to urbanization in whitefooted mice (Peromyscus leucopus)”, Harris & MunshiSouth 2017 (city mice adapting to eat human food)
 “Contrasting the effects of natural selection, genetic drift and gene flow on urban evolution in white clover (Trifolium repens)”, Johnson et al 2018
 “What Makes a City Ant? Maybe Just 100 Years of Evolution”
 “Urban Evolution: How Species Adapt, or Don’t, to City Living; A Conversation With Jonathan B. Losos”
 Moscow subway dogs (not necessarily genetic except perhaps in a Baldwin effect way, but a demonstration of behavioral flexibility/learning)
 Ridley: “Cities Are the New Galapagos”/“The Sixth Genesis: a manmade, massspeciation event”
 Darwin Comes to Town: How the Urban Jungle Drives Evolution, Schilthuizen 2018
This sounds bizarre but appears to refer to a real (albeit obscure) medical treatment more commonly named “autovaccines” or “autovaccination” (more widely used before the introduction of antibiotics), where part of the localized infection is cultured & then injected to try to trigger a more powerful, systemic, immune response.↩︎
This is a little misleading; dimensional analysis is much more like typechecking a program in a language with a good type system like Haskell. Given certain data types as inputs and certain allowed transformations on those data types, what data types must be the resulting output? But the analogy is still useful.↩︎
eg. if someone asks you how many piano tuners there are in Chicago, don’t look blank, start thinking! ‘Well, there must be fewer than 7 billion, because the human race isn’t made of piano tuners, and likewise fewer than 300 million (the population of the United States), and heck, Wikipedia says Chicago has only 2.6 million people and piano tuners are rare, so there must be many fewer than that…’↩︎