Statistical Notes

Miscellaneous statistical stuff
statistics, decision-theory, Haskell, R, JS, power-analysis, survey, Bayes, genetics, IQ, order-statistics
2014-07-172020-04-28 in progress certainty: possible importance: 5


  • crit­i­cism of teeth-re­moval exper­i­ment in rats
  • crit­i­cism of small Noopept self­-ex­per­i­ment
  • why Soy­lent is not a good idea
  • mis­in­ter­pre­ta­tion of flu­o­ri­da­tion meta-analy­sis and igno­rance of VoI
  • Full­text: is this pos­si­ble?
  • tACS causes lucid dream­ing:
  • Herbal­ife growth pat­terns:
  • Plau­si­ble cor­re­late of Fair­trade:
  • slave whip­pings vs cot­ton pro­duc­tion
  • whether a study on men­tal ill­ness & vio­lence shows schiz­o­phren­ics are not more likely to mur­der but rather be mur­dered: / (see also )
  • For­tune analy­sis of higher female CEO returns
  • algae/IQ:
  • synaesthesia/IQ:
  • mis­in­ter­pre­ta­tion:
  • underpowered/multiple-correction jobs pro­gram:
  • claimed fall in digit span back­wards minus­cule and non-s­ta­tis­ti­cal­ly-sig­nifi­cant, no evi­dence of het­ero­gene­ity beyond vari­abil­ity due to sam­ple size
  • Claimed ran­dom­ized exper­i­ment of whether sushi tastes worse after freez­ing is not actu­ally a ran­dom­ized exper­i­ment
  • sex­ual open­ness result under­mined by ceil­ing effect
  • music study claim­ing WM inter­ac­tion: pos­si­ble ceil­ing effect? see FB PM
  • attempt to mea­sure effect of Nazi anti-schiz­o­phre­nia eugen­ics pro­gram failed to use breed­er’s equa­tion to esti­mate pos­si­ble size of effect, which is too small to detect with avail­able data and hence attempt is fore­doomed:
  • claim high IQ types almost 100% fail­ure rates due to inap­pro­pri­ate model assump­tion of nor­mal dis­tri­b­u­tion + nar­row stan­dard devi­a­tion:
  • implau­si­ble claims about suc­cess rate of facial recog­ni­tion applied to St Peters­burg pop­u­la­tion: (see also “Facial recog­ni­tion sys­tems stum­ble when con­fronted with mil­lion-face data­base”)
  • human Tox­o­plasma gondii study is not well-pow­ered as authors claim due to incor­rect power analy­sis, and results are evi­dence for harm: ;
  • attempt at attribut­ing Bit­coin price increases to tech­nol­ogy improve­ments:
  • analy­sis of designer drug/research chem­i­cal activ­ity on Wikipedia is dri­ven almost entirely by edit­ing pat­terns of just 2 Wikipedia edi­tors par­tic­u­larly inter­ested in the top­ic:
  • fail­ure to use medi­a­tion SEM, differ­ence-in-s­ta­tis­ti­cal-sig­nifi­cance-is-not-a-sig­nifi­can­t-d­iffer­ence:
  • Nean­derthal ances­try per­cent­age & autism:
  • Anime image clas­si­fi­ca­tion project likely undone by using non-i.i.d. images

Failed Facebook Critiques

  • Face­book emo­tion study:

A reply to (too long for inclu­sion in )

91 points and no com­ments? OK, I guess it falls to me to jump on this grenade.

So why is the Face­book study bad sci­ence? After 5 screens of mean­der­ing anec­dotes, insin­u­a­tions, insults, etc we finally get to a real crit­i­cism:

Did peo­ple feel betrayed about the lack of informed con­sent? You know, in psy­chol­ogy research, when peo­ple find out they’ve been an unwit­ting exper­i­men­tal sub­ject, it’s not uncom­mon for them to feel duped. They’re at least sur­prised. The only dis­tinc­tion is that aca­d­e­mics who exper­i­ment on sub­jects with­out get­ting their con­sent first usu­ally tell peo­ple about it imme­di­ately after­ward. They debrief the sub­jects and answer ques­tions. They unruffle ruffled feath­ers. They may allow a sub­ject to remove his or her data from the exper­i­ment. In some cas­es, they even offer fol­low-up ser­vices. Given that Face­book did noth­ing to inform sub­jects or make them feel whole again, it’s hard to blame folks for feel­ing unduly vio­lat­ed.

So? As was pointed out, these exper­i­ments are run all the time by all sorts of enti­ties, and by mak­ing this crit­i­cism you are implic­itly argu­ing that it would be bet­ter for Face­book to keep the results secret (like com­pa­nies usu­ally do) instead of inform­ing us about very rel­e­vant results in the brave new world of the Inter­net. Far from argu­ing for good sci­ence, OP is argu­ing for bad sci­ence as some­how ‘eth­i­cal’. (This is quite aside from the issue that informed con­sent makes no sense and was a knee­jerk reac­tion to abuses that did­n’t need the inven­tion of scholas­tic con­cepts like ‘informed con­sent’.)

The exper­i­ment also forced many peo­ple to con­tem­plate, for the first time, the kind of per­sua­sive power Face­book might sur­rep­ti­tiously wield around the world given its size and scale.

Also not a rea­son for it being ‘bad sci­ence’.

On the other side of the firestorm were peo­ple who could­n’t see how the exper­i­ment was any differ­ent from your run-of-the-mill psy­chol­ogy exper­i­ment. Or, alter­na­tive­ly, how it was differ­ent from the wide­spread Inter­net prac­tice of A/B test­ing, where you exper­i­ment with differ­ent vari­a­tions of a web­site to see which is most effec­tive at per­suad­ing vis­i­tors to buy, or down­load, or what­ever the site’s goal is. Some of these exper­i­ments feel bla­tantly manip­u­la­tive, like the head­lines that are con­stantly tested and retested on vis­i­tors to see which ones will get them to click. We have a word for head­lines like this: “click­-bait.” But nobody ever hands out con­sent forms.

Oh good, so the author isn’t a com­plete idiot.

The every-which-way qual­ity of the reac­tion, I think, comes in part from the fact that the study crossed aca­d­e­mic and cor­po­rate bound­aries, two areas with differ­ent eth­i­cal stan­dards. It was unclear which to hold the com­pany to.

Wait, what? What hap­pened to all the blovi­at­ing ear­lier about the lack of con­sent? Now the prob­lem is it ‘crosses bound­aries’? WTF. Also: still noth­ing about how this was ‘bad sci­ence’.

If you were a researcher at Face­book, prob­a­bly one of the things that would pro­vide you with the great­est source of ten­sion about your job would be evi­dence that the prod­uct you’re push­ing to half the world’s pop­u­la­tion actu­ally causes them to feel “neg­a­tive or left out.” That would be a pretty epic fail for a com­pany that wants to “make the world more open and con­nect­ed.”

I believe that Kramer is con­cerned with address­ing the pop­u­lar worry that Face­book makes us unhap­py. Not just because I’ve met him but because, in the study, he seems adamant about refut­ing it. In dis­cussing his find­ings, Kramer asserts that the study “stands in con­trast to the­o­ries that sug­gest view­ing pos­i­tive posts by friends on Face­book may some­how affect us neg­a­tive­ly, for exam­ple, via social com­par­i­son.”

…[long descrip­tion of ‘social com­par­i­son’ which I’m not sure why is in there since the exper­i­ment in ques­tion strongly sug­gests it’s not rel­e­vant]

Yes, it would suck if that were true and would under­mine Face­book’s val­ue, so kudos to Face­book for not hid­ing its head under a rock and exper­i­ment­ing to find out the truth. Kudos… wait, I for­got, this is ‘lousy social sci­ence’ we’re sup­posed to be boo­ing and hiss­ing about.

In fact, social com­par­i­son is often posited as the solu­tion to what’s known as the “East­er­lin Para­dox,” which finds that, while our hap­pi­ness increases with our income, soci­eties that get richer do not tend to get hap­pi­er.

Actu­al­ly, if you look at the graphs, they do tend to get hap­pier it’s just there’s severe dimin­ish­ing returns and the graph looks log­a­rith­mic rather than lin­ear. Minor point, but it annoys me to think that being wealth­ier does­n’t help. It does.

[an­other 5 screens of mean­der­ing some­what related spec­u­la­tion]

In fact, she finds that greater pas­sive con­sump­tion over time, con­trol­ling for indi­vid­ual pre­dis­po­si­tions, is asso­ci­ated with lower per­ceived social sup­port, lower bridg­ing social cap­i­tal (feel­ing part of a broader com­mu­ni­ty), and mar­gin­ally lower pos­i­tive affect, higher depres­sion, and higher stress.

Gee, I won­der why that might be… No, let’s jump to the insin­u­a­tion that Face­book causes the higher depres­sion etc. Yeah, that’s plau­si­ble.

The first ques­tion about the study is whether any­thing notable hap­pened. This was a com­mon crit­i­cism. Although Face­book has tremen­dous scale, it does­n’t mean the sci­en­tific com­mu­nity should care about every effect the com­pany can demon­strate. Nei­ther should the com­pany itself work on small stuff that barely moves the nee­dle. Though Kramer said he removed a lot of emo­tion from users’ News Feeds (be­tween 10–90% of pos­i­tive or neg­a­tive post­s), he saw very lit­tle change in the emo­tions users sub­se­quently expressed. All of the changes were 0.1% or less. That’s not 10% or 1% — that’s 0.1%….Still, the small effects raise impor­tant ques­tions. Why were they so small?

Bzzt. First hard sci­en­tific crit­i­cism, and they failed. The rea­son the effects were small were, as the paper explic­itly dis­cusses (OP did read the paper, right? The whole thing? Not just blogs and media cov­er­age?), the inter­ven­tion was designed to be small (that’s real ethics for you, not bull­shit about informed con­sen­t), the inter­ven­tion only affects one of sev­eral news sources each user is exposed to (de­creas­ing the inter­ven­tion still more), and the mea­sure of mood in sub­se­quent items is itself a very noisy mea­sure (mea­sure­ment error biases the effect down­ward­s). The results are exactly as one would expect and this is an invalid exper­i­ment. makes the same mis­take. The descrip­tion of how much was removed is also wrong; here’s a quote from the paper:

Two par­al­lel exper­i­ments were con­ducted for pos­i­tive and neg­a­tive emo­tion: One in which expo­sure to friends’ pos­i­tive emo­tional con­tent in their News Feed was reduced, and one in which expo­sure to neg­a­tive emo­tional con­tent in their News Feed was reduced. In these con­di­tions, when a per­son loaded their News Feed, posts that con­tained emo­tional con­tent of the rel­e­vant emo­tional valence, each emo­tional post had between a 10% and 90% chance (based on their User ID) of being omit­ted from their News Feed for that spe­cific view­ing. It is impor­tant to note that this con­tent was always avail­able by view­ing a friend’s con­tent directly by going to that friend’s “wall” or “time­line,” rather than via the News Feed. Fur­ther, the omit­ted con­tent may have appeared on prior or sub­se­quent views of the News Feed. Final­ly, the exper­i­ment did not affect any direct mes­sages sent from one user to anoth­er…Both exper­i­ments had a con­trol con­di­tion, in which a sim­i­lar pro­por­tion of posts in their News Feed were omit­ted entirely at ran­dom (i.e., with­out respect to emo­tional con­tent). Sep­a­rate con­trol con­di­tions were nec­es­sary as 22.4% of posts con­tained neg­a­tive words, whereas 46.8% of posts con­tained pos­i­tive words. So for a per­son for whom 10% of posts con­tain­ing pos­i­tive con­tent were omit­ted, an appro­pri­ate con­trol would with­hold 10% of 46.8% (i.e., 4.68%) of posts at ran­dom, com­pared with omit­ting only 2.24% of the News Feed in the neg­a­tiv­i­ty-re­duced con­trol.

Note the differ­ence between writhold­ing ‘4.68% of posts’ or ‘2.24% of the News Feed’ and OP’s desc­crip­tion as remov­ing ‘between 10-90% of pos­i­tive or neg­a­tive posts’.

Words were deter­mined to be pos­i­tive or neg­a­tive using a dic­tio­nary pro­vided by the Lin­guis­tic Inquiry and Word Count soft­ware, known as LIWC, last updated in 2007. About 47% of posts in the exper­i­ment con­tained pos­i­tive words while about 22% of posts con­tained neg­a­tive words, leav­ing 31% of posts with no emo­tional words at all, as defined by LIWC. Every­thing but the text of the posts was dis­carded for this analy­sis, includ­ing pho­tos.

The third study, which looks at the con­ta­gion of neg­a­tive emo­tion in instant mes­sag­ing, finds that LIWC actu­ally can­not tell the differ­ence between groups shar­ing neg­a­tive vs. neu­tral emo­tions.

Pro­tip: don’t cite bro­ken links like with­out any other cita­tion data. I can’t fig­ure out what this study is sup­posed to be but given that the link is bro­ken despite the blog post being writ­ten barely a month or two ago, I sus­pect OP is mis­rep­re­sent­ing it.

Look­ing more broad­ly, one study com­pares a num­ber of sim­i­lar tech­niques and finds that LIWC is a mid­dling per­former, at best. It is con­sis­tently too pos­i­tive in its rat­ings, even label­ing the con­ver­sa­tion in social media around the H1N1 dis­ease out­break as pos­i­tive over­all. Another study that looks at emo­tional con­ta­gion in instant mes­sag­ing finds that, even when par­tic­i­pants have been induced to feel sad, LIWC still thinks they’re pos­i­tive.

Good thing the exper­i­ment tested mul­ti­ple con­di­tions and found sim­i­lar results.

Used in raw form as in the Face­book exper­i­ment, how­ev­er, it appears to be sub­stan­tially infe­rior to machine learn­ing.

Sure. But should we let the per­fect be the enemy of bet­ter?

Fur­ther, we know next to noth­ing about how well LIWC per­forms in social media when it comes to emo­tions under the big head­ings of pos­i­tive and neg­a­tive emo­tion. If it detects some neg­a­tive emo­tions, like anger, bet­ter than oth­ers like sad­ness this too may bias what we learn from the Face­book exper­i­ment.

Yes, maybe the LIWC works well in these cir­cum­stances, maybe it does­n’t. Who knows? One could write this of any instru­ment or analy­sis being applied in a new sit­u­a­tion. I hear the philoso­phers call this ‘the prob­lem of induc­tion’; maybe they have a solu­tion.

In a word: no. Face­book posts are likely to be a highly biased rep­re­sen­ta­tion of how Face­book makes peo­ple feel because Face­book posts are a highly biased rep­re­sen­ta­tion of how we feel in gen­er­al…Look­ing at social sit­u­a­tions in gen­er­al, we know for exam­ple that there are pow­er­ful pres­sures to con­form to the atti­tudes, feel­ings and beliefs of oth­ers. And so if we look at Face­book from this stand­point, it’s easy to see how the effects reported in the Face­book exper­i­ment might be due to con­for­mity rather than gen­uine emo­tional con­ta­gion. Con­sciously or uncon­scious­ly, we may sense a cer­tain emo­tional tone to our News Feeds and there­fore adapt what we post, ever so slight­ly, so that we don’t stick out too much.

Oh for heav­en’s sake. So if they had found a ‘social com­par­i­son’ effect, then that’s proof of social com­par­ison; and if they did­n’t, well, that’s OK because ‘Face­book posts are likely to be highly biased’ and it’s all due to con­for­mi­ty! Way to explain every pos­si­ble out­come there, OP. Just being biased does­n’t mean you can’t ran­dom­ize inter­ven­tions and learn.

Expe­ri­ence sam­pling involves ran­domly inter­rupt­ing peo­ple as they go about their lives to ask how they’re feel­ing in the moment. It’s pri­vate, so it’s less sub­ject to social bias­es. It does not rely on rec­ol­lec­tions, which can be off. And it solic­its expe­ri­ences evenly across time, rather than rely­ing on only the moments or feel­ings peo­ple think to share.

But wait! I thought we ‘con­sciously or uncon­sciously’ self­-cen­sored, and “If we cen­sor fully a third of what we want to express at the last min­ute, how much are we cen­sor­ing before we even reach for the key­board? [to report an expe­ri­ence sam­ple]”? So the ques­tion is which source of bias do we prefer: peo­ple know­ing they’re in an exper­i­ment and respond­ing to per­ceived exper­i­menter demands, or peo­ple not know­ing and going about life as nor­mal? I know which I prefer, espe­cially since the research has actu­ally been done…

Oh my god, it just keeps going on and on does­n’t it? Dude really likes expe­ri­ence sam­pling, but I’m think­ing he needs to write more con­cise­ly. OK, I’m going to wrap up here because I’d like to read some­thing else today. Let’s sum­ma­rize his com­plaints and my coun­ter-ob­jec­tions:

  1. no con­sent: irrel­e­vant to whether this was good sci­ence or ‘lousy social sci­ence’
  2. crossed bound­aries between cor­po­ra­tions and acad­e­mia: like­wise irrel­e­vant; also, wel­come to the mod­ern Inter­net
  3. small effect size: mis­un­der­stood the sta­tis­ti­cal design of study and why it was designed & expected to have small effects
  4. used LIWC with high error rate for mea­sur­ing emo­tion­al­ity of posts: if ran­dom error, biases effect to zero and so is not an argu­ment against sta­tis­ti­cal­ly-sig­nifi­cant find­ings
  5. and LIWC may have sys­tem­atic error towards pos­i­tiv­i­ty: appar­ently not an issue as neg­a­tive & pos­i­tive con­di­tions agreed, and the stud­ies he cites in sup­port of this claim are mixed or unavail­able
  6. also, other meth­ods are bet­ter than LIWC: sure. But that does­n’t mean the results are wrong
  7. maybe LIWC has large unknown biases applied to short social media texts: pos­si­ble, but it’s not like you have any real evi­dence for that claim
  8. Face­book news posts are a biased source of mood any­way: may­be, but they still changed after ran­dom manip­u­la­tion
  9. expe­ri­ence sam­pling is sooooooo awe­some: and also brings up its own issues of biases and I don’t see how this would ren­der the Face­book study use­less any­way even if we granted it (like com­plaints #1, 2, 6, 7)

Now, I don’t want to over­state my crit­i­cisms here. The author has failed to show the Face­book study is worth­less (I’d wager much more money on the Face­book results repli­cat­ing than 95% of the social sci­ence research I’ve read) and it would be out­right harm­ful for Face­book to aim for large effect sizes in future stud­ies, but he does at least raise some good points about improv­ing the fol­lowup work: Face­book cer­tainly should be pro­vid­ing some of its cut­ting-edge deep net­works for sen­ti­ment analy­sis for research like this after val­i­dat­ing them if it wants to get more reli­able results, and it would be worth­while to run expe­ri­ence sam­pling approaches to see what hap­pens there, in addi­tion to eas­ier web­site tests (in addi­tion, not instead of).

Correlation=Causation in Cancer Research

Failed attempt at esti­mat­ing P(cau­sa­tion|­cor­re­la­tion):

How often does cor­re­la­tion=­causal­i­ty? While I’m at it, here’s an exam­ple of how not to do it… “A weight of evi­dence approach to causal infer­ence”, Swaen & van Amelsvoort 2009:

Objec­tive: The Brad­ford Hill cri­te­ria are the best avail­able cri­te­ria for causal infer­ence. How­ev­er, there is no infor­ma­tion on how the cri­te­ria should be weighed and they can­not be com­bined into one prob­a­bil­ity esti­mate for causal­i­ty. Our objec­tive is to pro­vide an empir­i­cal basis for weigh­ing the Brad­ford Hill cri­te­ria and to develop a trans­par­ent method to esti­mate the prob­a­bil­ity for causal­i­ty. Study Design and Set­ting: All 159 agents clas­si­fied by Inter­na­tional Agency for Research of Can­cer as cat­e­gory 1 or 2A car­cino­gens were eval­u­ated by apply­ing the nine Brad­ford Hill cri­te­ria. Dis­crim­i­nant analy­sis was used to esti­mate the weights for each of the nine Brad­ford Hill cri­te­ria.

Results: The dis­crim­i­nant analy­sis yielded weights for the nine causal­ity cri­te­ria. These weights were used to com­bine the nine cri­te­ria into one over­all assess­ment of the prob­a­bil­ity that an asso­ci­a­tion is causal. The cri­te­ria strength, con­sis­tency of the asso­ci­a­tion and exper­i­men­tal evi­dence were the three cri­te­ria with the largest impact. The model cor­rectly pre­dicted 130 of the 159 (81.8%) agents. Con­clu­sion: The pro­posed approach enables using the Brad­ford Hill cri­te­ria in a quan­ti­ta­tive man­ner result­ing in a prob­a­bil­ity esti­mate of the prob­a­bil­ity that an asso­ci­a­tion is causal.

Sounds rea­son­able, right? Take this IARC data­base, pre­sum­ably of car­cino­gens known to be such by ran­dom­ized exper­i­ment, and see how well the cor­re­late stud­ies pre­dict after train­ing with - you might not want to build a reg­u­lar lin­ear model because those tend to be weak and not too great at pre­dic­tion rather than infer­ence. It’s not clear what they did to pre­vent over­fit­ting, but read­ing through, some­thing else strikes me:

The IARC has eval­u­ated the car­cino­genic­ity of a sub­stan­tial num­ber of chem­i­cals, mix­tures, and expo­sure cir­cum­stances. These eval­u­a­tions have been car­ried out by expert inter­dis­ci­pli­nary pan­els of sci­en­tists and have resulted in clas­si­fi­ca­tion of these agents or expo­sure con­di­tions into human car­cino­gens (cat­e­gory 1) prob­a­ble human car­cino­gens (cat­e­gory 2A), pos­si­ble human car­cino­gens (cat­e­gory 2B), not clas­si­fi­able agents (cat­e­gory 3), and chem­i­cals that are prob­a­bly not car­cino­genic to humans (cat­e­gory 4) (IARC, 2006). Although the IARC Work­ing Groups do not for­mally use the Brad­ford Hill cri­te­ria to draw causal infer­ences many of the cri­te­ria are men­tioned in the indi­vid­ual reports. For instance, the pre­am­ble specifi­cally men­tions that the pres­ence of a dose-ere­sponse is an impor­tant con­sid­er­a­tion for causal infer­ence. In this analy­sis, the IARC data­base serves as the ref­er­ence data­base although we rec­og­nize that it may con­tain some dis­putable clas­si­fi­ca­tions. How­ev­er, to our knowl­edge there is no other data­base con­tain­ing causal infer­ences that were com­piled by such a sys­tem­atic process involv­ing lead­ing experts in the areas of tox­i­col­ogy and epi­demi­ol­o­gy.


These eval­u­a­tions have been car­ried out by expert inter­dis­ci­pli­nary pan­els of sci­en­tists and have resulted in clas­si­fi­ca­tion of these agents or expo­sure con­di­tions into human car­cino­gens

eval­u­a­tions have been car­ried out by expert inter­dis­ci­pli­nary pan­els

IARC Work­ing Groups do not for­mally use the Brad­ford Hill cri­te­ria to draw causal infer­ences many of the cri­te­ria are men­tioned

Wait. So their data­base with causality/non-causality clas­si­fi­ca­tions is… based on… opin­ion. They got some experts together and asked them.

And the experts use the same cri­te­rion which they are using to pre­dict the clas­si­fi­ca­tions.

What. So it’s cir­cu­lar. Worse than cir­cu­lar, ran­dom­iza­tion and causal­ity never even enter the pic­ture. They’re not doing ‘causal infer­ence’, nor are they giv­ing an ‘over­all assess­ment of the prob­a­bil­ity that an asso­ci­a­tion is causal’. And their con­clu­sion (“The pro­posed approach enables using the Brad­ford Hill cri­te­ria in a quan­ti­ta­tive man­ner result­ing in a prob­a­bil­ity esti­mate of the prob­a­bil­ity that an asso­ci­a­tion is causal.”) cer­tainly is not cor­rect - at best, they are pre­dict­ing expert opin­ion (and maybe not even that well), they have no idea how well they’re pre­dict­ing causal­i­ty.

But wait, maybe the authors aren’t cretins or con artists, and have a good jus­ti­fi­ca­tion for this approach, so let’s check out the Dis­cus­sion sec­tion where they dis­cuss RCTs:

Using the results from ran­dom­ized con­trolled clin­i­cal tri­als as the gold stan­dard instead of the IARC data­base could have been an alter­na­tive approach for our analy­sis. How­ev­er, this alter­na­tive approach has sev­eral dis­ad­van­tages. First, only a selec­tion of risk fac­tors reported in the lit­er­a­ture have been inves­ti­gated by means of tri­als, cer­tainly not the occu­pa­tional and envi­ron­men­tal chem­i­cals. Sec­ond, there are instances in which ran­dom­ized tri­als have yielded con­tra­dic­tory results, for instance, in case of sev­eral vit­a­min sup­ple­ments and can­cer out­comes.

You see, ran­dom­ized tri­als are bad because some­times we haven’t done them but we still really really want to make causal infer­ences so we’ll just pre­tend we can do that; and some­times they dis­agree with each other and con­tra­dict what we epi­demi­ol­o­gists have already proven, while the experts & IARC data­base never dis­agrees with them­selves! Thank good­ness we have offi­cial IARC doc­trine to guide us in our con­fu­sion…

This must be one of the most brazen “it’s not a bug, it’s a fea­ture!” moves I’ve ever seen. Mon cha­peau, Ger­ard, Ludovic; mon cha­peau.

Inci­den­tal­ly, Google Scholar says this paper has been cited at least 40 times; look­ing at some, it seem the cita­tions are gen­er­ally all pos­i­tive. These are the sort of peo­ple decid­ing what’s a healthy diet and what sub­stances are dan­ger­ous and what should be per­mit­ted or banned.

Enjoy your din­ners.

Aerobic vs Weightlifting

Aer­o­bic vs weightlift­ing exer­cise claims: mul­ti­ple prob­lems but pri­mar­ily p-hack­ing, differ­ence-in-s­ta­tis­ti­cal-sig­nifi­cance-is-not-a-sig­nifi­can­t-d­iffer­ence, and con­trol­ling for inter­me­di­ate vari­able.

…For exam­ple, weightlift­ing enhances brain func­tion, reverses sar­cope­nia, and low­ers the death rate in can­cer sur­vivors. Take this last item, low­er­ing death rate in can­cer sur­vivors: gar­den-va­ri­ety aer­o­bic exer­cise had no effect on sur­vival, while resis­tance train­ing low­ered death rates by one third… –

[pa­per in ques­tion: “The Effect of Resis­tance Exer­cise on All-Cause Mor­tal­ity in Can­cer Sur­vivors”, Hardee et al 2014; full­text: / ]

This is a bad study, but sadly the prob­lems are com­mon to the field. Claim­ing that this study shows ‘weight lift­ing low­ered death rates and aer­o­bic exer­cise did not change sur­vival’ is mak­ing at least 4 errors:

  1. cor­re­la­tion!=­cau­sa­tion; this is sim­ply your usual cor­re­la­tion study (you know, of the sort which is always wrong in diet stud­ies?), where you look at some health records and crank out some p-val­ues. There should be no expec­ta­tion that this will prove to be causally valid; in par­tic­u­lar, reverse con­found­ing is pretty obvi­ous here and should remind peo­ple of the debate about weight and mor­tal­i­ty. (Ah, but you say that the differ­ence they found between aer­o­bic and resis­tance shows that it’s not con­found­ing because health bias should oper­ate equal­ly? Well, read on…)
  2. pow­er: with only 121 total deaths (~4% of the sam­ple), this is inad­e­quate to detect any differ­ences but com­i­cally large cor­re­lates of health, as the esti­mate of pre­dict­ing a third less mor­tal­ity indi­cates
  3. p-hacking/multiplicity, type S errors, exag­ger­a­tion fac­tor: take a look at that 95% con­fi­dence inter­val for resis­tance exer­cise (which is the only result they report in the abstrac­t), which is an HR of 0.45-0.99. In other words, if the cor­re­late were even the tini­est bit big­ger, it would no longer have the mag­i­cal ‘sta­tis­ti­cal sig­nifi­cance at p<0.05’. There’s at least 16 covari­ates, 2 strat­i­fi­ca­tions, and 3 full mod­els tested (that they report). By the sta­tis­ti­cal sig­nifi­cance fil­ter, a HR of 0.67 will be a seri­ous exag­ger­a­tion (be­cause only exag­ger­ated esti­mates would - just barely - reach p=0.05 on this small dataset with only 121 death­s). We can rule out a HR of 0.67 as cred­i­ble sim­ply on a pri­ori grounds: no exer­cise RCT has ever shown reduc­tions in all-cause mor­tal­ity remotely like that, and that’s the sort of reduc­tion you just don’t see out­side of mir­a­cle drugs for lethal dis­eases (for exam­ple, aspirin and vit­a­min D have RRs of >0.95).
  4. “The Differ­ence Between ‘Sig­nifi­cant’ and ‘Not Sig­nifi­cant’ is Not Itself Sta­tis­ti­cally Sig­nifi­cant” ( the differ­ence between aer­o­bic exer­cise and resis­tance exer­cise is not sta­tis­ti­cal­ly-sig­nifi­cant in this study. The HR in model 1 for aer­o­bic exer­cise is (0.63-1.32), and for aer­o­bic exer­cise, (0.46-0.99). That is, the con­fi­dence inter­vals over­lap. (Specifi­cal­ly, com­par­ing the pro­por­tion of aer­o­bic exer­cis­ers who died with the resis­tance exer­cis­ers who died, I get prop.test(c(39,75), c(1251,1746)) = p=0.12; to com­pute a sur­vival curve I would need more data, I think.) The study itself does not any­where seem to directly com­pare aer­o­bic with resis­tance but always works in a strat­i­fied set­ting; I don’t know if they don’t real­ize this point about the null hypothe­ses they’re test­ing, or if they did do the logrank test and it came out non-sig­nifi­cant and they qui­etly dropped it from the paper.
  5. the fal­lacy of con­trol­ling for inter­me­di­ate vari­ables: in the mod­els they fit, they include as covari­ates “body mass index, cur­rent smok­ing (yes or no), heavy drink­ing (yes or no), hyper­ten­sion (pre­sent or not), dia­betes (pre­sent or not), hyper­c­ho­les­terolemia (yes or no), and parental his­tory of can­cer (yes or no).” This makes no sense. Both resis­tance exer­cise and aer­o­bic exer­cise will them­selves influ­ence BMI, smok­ing sta­tus, hyper­ten­sion, dia­betes, and hyper­c­ho­les­terolemia. What does it mean to esti­mate the cor­re­la­tion of exer­cise with health which excludes all impact it has on your health through BMI, blood pres­sure, etc? You might as well say, ‘con­trol­ling for mus­cle per­cent­age and body fat, we find weight lift­ing has no esti­mated ben­e­fits’, or ‘con­trol­ling for edu­ca­tion, we find no ben­e­fits to IQ’ or ‘con­trol­ling for local infec­tion rates, we find no mor­tal­ity ben­e­fits to pub­lic vac­ci­na­tion’. This makes the results par­tic­u­larly non­sen­si­cal for the aer­o­bic esti­mates if you want to inter­pret them as direct causal esti­mates - at most, the HR esti­mates here are an esti­mate of weird indi­rect effects (‘the remain­ing effect of exer­cise after remov­ing all effects medi­ated by the covari­ates’). Unfor­tu­nate­ly, struc­tural equa­tion mod­els and Bayesian net­works are a lot harder to use and jus­tify than just dump­ing a list of covari­ates into your sur­vival analy­sis pack­age, so expect to see a lot more con­trol­ling for inter­me­di­ate vari­ables in the future.

The first three are suffi­cient to show you should not draw any strong con­clu­sions, the lat­ter two are nasty and could be prob­lem­atic but can be avoid­ed. These con­cerns are roughly ranked by impor­tance: #1 puts a low ceil­ing on how much con­fi­dence in causal­ity we could ever derive, a ceil­ing I infor­mally put at ~33%; #2 is impor­tant because it shows that very lit­tle of the sam­pling error has been over­com­ing; #3 means we know the esti­mate is exag­ger­at­ed; #4 is not impor­tant, because while that mis­in­ter­pre­ta­tion is tempt­ing and the authors do noth­ing to stop the reader from mak­ing it, there’s still enough data in the paper that you can cor­rect for it eas­ily by doing your own pro­por­tion test; #5 could be an impor­tant crit­i­cism if any­one was rely­ing heav­ily on the esti­mate con­t­a­m­i­nated by the covari­ates but in this case the raw pro­por­tions of deaths is what yields the head­li­nes, so I bring this up to explain why we should ignore model 3’s esti­mate of aer­o­bic exer­cise’s RR=1. This sort of prob­lem is why one should put more weight on meta-analy­ses of RCTs - for exam­ple, “Pro­gres­sive resis­tance strength train­ing for improv­ing phys­i­cal func­tion in older adults”

So to sum­ma­rize: this study col­lected the wrong kind of data for com­par­ing mor­tal­ity reduc­tion from aer­o­bics vs weightlift­ing, insuffi­cient mor­tal­ity data to result in strong evi­dence, exag­ger­ates the result through p-hack­ing, did not actu­ally com­pare aer­o­bics and weightlift­ing head to head, and the analy­sis’s implicit assump­tions would ignore much of any causal effects of aerobics/weightlifting!

Moxibustion Mouse Study … “Pre-mox­i­bus­tion and mox­i­bus­tion pre­vent Alzheimer’s dis­ease” …

I don’t believe this for a sec­ond. But actu­al­ly, this would be a nice fol­lowup to my pre­vi­ous email about the prob­lems in ani­mal research: this paper exhibits all the prob­lems men­tioned, and more. Let’s do a lit­tle cri­tique here.

  1. This paper is Chi­nese research per­formed in China by an all-Chi­nese team. The cur­rent state of Chi­nese research is bad. It’s really bad. Some read­ing on the top­ic:

I will note that there have been sta­tis­ti­cal anom­alies in some of the Chi­nese papers on dual n-back train­ing I have used in my meta-analy­sis, so I have some per­sonal expe­ri­ence in the top­ic. 2. ‘tra­di­tional med­i­cine’ research is really bad no mat­ter where you go. They men­tion acupunc­ture as jus­ti­fi­ca­tion? That’s just fan­tas­tic. punc­tures some of the hyper­bolic claims, the PLOS link above deals with the poor-qual­ity of the Chi­nese reviews & meta-analy­ses in gen­er­al, and Cochrane is not too kind to acupunc­ture: And in many of the reviews/meta-analyses there are stark geo­graphic differ­ences where the East Asian stud­ies turn in tons of pos­i­tive results while the West­ern stud­ies some­how… don’t. 3. The lead author is not an ordi­nary neu­ro­sci­en­tist or doc­tor, but works at the “Col­lege of Acupunc­ture and Mox­i­bus­tion”. Is he really going to pub­lish a study con­clud­ing “mox­i­bus­tion does not affect Alzheimer’s”⸮ Really⸮ 4. Does this claim even make sense? Mox­i­bus­tion, real­ly⸮ For those not famil­iar, entails

> Suppliers usually age the mugwort and grind it up to a fluff;

prac­ti­tion­ers burn the fluff or process it fur­ther into a cig­a­r-shaped stick. They can use it indi­rect­ly, with acupunc­ture needles, or burn it on the patien­t’s skin.

How on earth is this supposed to help AD? How does burning a plant

on your skin affect plaques in your brain? Or if they use acupunc­ture needles, how plau­si­ble is it that a few mil­ligrams at most of mug­wort inserted into the skin would do any­thing? While Wikipedia is not Cochrane or any­thing, it is trou­bling that this entry lists no use­ful appli­ca­tion of mox­i­bus­tion. And then it goes and links to “Does mox­i­bus­tion work? An overview of sys­tem­atic reviews” which finds that

> Ten SRs met our inclusion criteria, which related to the

fol­low­ing con­di­tions: can­cer, ulcer­a­tive col­i­tis, stroke reha­bil­i­ta­tion, con­sti­pa­tion, hyper­ten­sion, pain con­di­tions and breech pre­sen­ta­tion. Their con­clu­sions were con­tra­dic­tory in sev­eral instances. Rel­a­tively clear evi­dence emerged to sug­gest that mox­i­bus­tion is effec­tive for breech pre­sen­ta­tion.

That review also mentions, incidentally, that

> Many of the primary moxibustion trials originate from China

(data not shown); Vick­ers et al. demon­strated that vir­tu­ally 100% of Chi­nese acupunc­ture tri­als are pos­i­tive [], which seems to be equally applied to mox­i­bus­tion, an acupunc­ture-like inter­ven­tion. This casts con­sid­er­able doubt on the reli­a­bil­ity of these stud­ies.

Alright, so let’s take stock here. With­out ever look­ing beyond the title and author­ship, we have found that this is a paper from a coun­try with infa­mously bad research, in a field with infa­mously bad research qual­i­ty, led by a researcher with con­sid­er­able inher­ent con­flict of inter­est, using a technique/substance which has already been linked with biased research, on a hypoth­e­sis that is grossly implau­si­ble. Based on all these base rates, we can say that there is basi­cally zero chance this result will ever repli­cate, much less to other mice strains or even to humans.

It seems unfair to reject the paper out of hand, though, so let’s look at the actual paper a lit­tle.

Forty healthy rats were ran­domly divided into four groups: con­trol group, model group, mox­i­bus­tion group and pre-mox­i­bus­tion group. The lat­ter three groups were treated with intrac­ere­bral injec­tion of Aβ1–42 to estab­lish an AD-like pathol­o­gy. The mox­i­bus­tion group received sus­pended mox­i­bus­tion on Bai­hui and Shen­shu acu­points for 14 days after Aβ1–42 injec­tion. The pre-mox­i­bus­tion group was treated with mox­i­bus­tion for eight courses (each course last­ing for 6 days) prior to the expo­sure and 14 days after Aβ1–42 expo­sure. The final analy­sis incor­po­rated all rats.

From the mate­ri­als and meth­ods:

Male Wis­tar rats (12 months old; 500 ± 20 g), of spe­cific pathogen free grade, were obtained from the Exper­i­men­tal Ani­mal Cen­ter of Huazhong Uni­ver­sity of Sci­ence and Tech­nol­ogy (Wuhan, Chi­na), with license No. SCXK (E) 2008-0005.

After the hair around the acu­points was shaved, an ignited mox­a-stick (di­am­e­ter 6 mm; Nanyang Shen­nong Aaicao Appli­ance Com­pa­ny, Nanyang, Henan Chi­na; a round long stick made of moxa floss, also called moxa rol­l), was sus­pended per­pen­dic­u­larly 2 cm above the acu­points. Bai­hui (lo­cated in the mid­dle of the pari­etal bone[50]) and Shen­shu (lo­cated under the sec­ond lum­bar on both sides [50]) acu­points were simul­ta­ne­ously given sus­pended mox­i­bus­tion. Each treat­ment con­sisted of a 15-minute mox­i­bus­tion, keep­ing the spot warm and red but not burnt. Gen­er­al­ly, the skin tem­per­a­ture was kept at 43 ± 1° dur­ing the mox­i­bus­tion pro­ce­dure.

Right away we can spot 3 of the usual ani­mal research method­olog­i­cal prob­lems:

  1. the sam­ple size is too small - at n=10 rats in each group, you are not going to detect any­thing with­out large effect sizes. It is implau­si­ble that sus­pended moxa has any effects, and it is espe­cially implau­si­ble that the effect sizes would be large.
  2. there is no men­tion of blind­ing. The tech­ni­cians or research assis­tants or whomever clearly know which mice they are deal­ing with.
  3. there is men­tion of ran­dom­iza­tion, but it’s not spec­i­fied how the ran­dom­iza­tion was done, which means it prob­a­bly was done by the ‘stick your hand in and grab’ method, and prob­a­bly does not bal­ance by lit­ter or other vari­ables. This mas­sively wors­ens the power prob­lem, see “Design, pow­er, and inter­pre­ta­tion of stud­ies in the stan­dard murine model of ALS”

(I’m a lit­tle curi­ous about whether they really started with 10 mice in each group: the mice spent at least 60 days in the exper­i­ment and I won­der how many, out of 40, you would expect to die in that time peri­od, espe­cially after you’ve done your level best to give 3⁄4s of them Alzheimer’s dis­ease dur­ing that time.)

I also note that the mox­i­bus­tion sit­u­a­tion is even worse than I thought: they did not use acupunc­ture nee­dles to get some mug­wort into the mice, they did not put any moxa/mugwort in phys­i­cal con­tact, but instead the burn­ing was 2cm away from the mice! The mech­a­nism was bad, but it just got worse.

There’s no men­tion of the data being pro­vided any­where at all, either their web­site or the pub­lish­er; there’s some evi­dence that pro­vid­ing access to a paper’s data cor­re­lates with high­er-qual­ity research, so I men­tion this absence. It also makes it harder for me to do any­thing more com­plex like a post hoc power analy­sis.

Mov­ing on, they list as depen­dent vari­ables:

  • Mor­ris water maze nav­i­ga­tion test
  • Mor­ris water maze spa­tial probe test
  • apop­to­sis rate of hip­pocam­pal neu­rons

Let’s look at the stats a bit.

  1. Sig­nifi­cance: The paper lists no less* than 14 p-val­ues (4 < 0.05, the rest < 0.01), and for all of them uses an alpha of 0.05. The small­est given con­straint is p<0.01. A Bon­fer­roni cor­rec­tion on this would be 0.05/14 (since they must have done at least 14 tests to report 14 p-val­ues), which means an alpha of 0.003571. But 0.01 > 0.05/14, so the 4 0.05 p-val­ues dis­ap­pear under mul­ti­ple cor­rec­tion and prob­a­bly most of the 0.01s would too.

    • this is a lower bound since the Mor­ris dia­grams report some­thing like 20 p-val­ues them­selves, but I did­n’t feel like care­fully pars­ing the text to fig­ure out exactly how many p-val­ues are being reported
  2. Effect sizes: no tables are pro­vid­ed, but fig­ure 2 (the sec­ond Mor­ris maze test) is illus­tra­tive. The con­trol mice have no prob­lem remem­ber where the plat­form used to be, and so spend almost half a minute (~24s) in the right area search­ing for it. Makes sense, they don’t have AD. The AD mice have ter­ri­ble mem­o­ry, and so only spend ~6s in the right area and most of their time in the wrong place. Also makes sense. Now, what about the AD mice who had some moxa burnt 2cm away from their skin? They spend 14-16s or more than twice and almost 3 times as much as the non-moxa AD mice! And the claimed stan­dard error on all 4 group of mice’s time is tiny, maybe 1s eye­balling the graph. So they are claim­ing, in this point, to have an effect size on mem­ory of some­thing like d = (15-6)/1 = 9. Insane! From burn­ing some mug­wort 2cm away from the mice’s skin‽

  3. Pow­er: actu­al­ly, that result shows an exam­ple of what I mean by the result being absurd. Let’s cal­cu­late what that effect size implies for the power of their t-test com­par­ing the model AD mice with the moxa AD mice. So let’s say the 2 moxa groups equate to n=20 15(1.5), and the AD con­trols were then n=10 5(0.5). The pooled stan­dard devi­a­tion of the non-moxa and moxa mice is sqrt(((20-1)(1.5^2) + (10-1)(0.5^2)) / (20 + 10 - 2)) = 1.267, so the effect size was actu­ally d=(15-5)/1.267 = 7.89. With 20 mice in 1 group and 10 mice in the oth­er, an alpha of 0.05, then our power turns out to be…

      pwr.t2n.test(n1 = 20, n2 = 10, d = 7.89, sig.level = 0.05)
           t test power calculation
                   n1 = 20
                   n2 = 10
                    d = 7.89
            sig.level = 0.05
                power = 1

A power of 100%. Absurd. Have you ever seen ani­mal research (or Alzheimer’s research…) with such high pow­er? Real effects, real treat­ments, in large clin­i­cal tri­als or in meta-analy­ses, are hardly ever that high.

So. I don’t know how they got the results they got. Did they admin­is­ter dozens of tests until they got the results they want­ed? Did they sim­ply make up the data like so many Chi­nese aca­d­e­mics have? Or did they start with 30 mice in each group and cher­ryp­ick the best/worst 10? Did they abuse the model AD mice to make the AD+­moxa mice look good?

In con­clu­sion: this paper is com­plete bull­shit, will not repli­cate.

“Someone Should Do Something”: Wishlist of Miscellaneous Project Ideas


  • Erowid: data-mine the trip reports to cre­ate clus­ters or a state-space of drug effects/results and using the clus­ters & com­mon terms, cre­ate a gen­eral inven­tory of descrip­tions; add this to the trip report form so Erowid users can pro­vide some more struc­tured infor­ma­tion about their expe­ri­ence.

  • Dark net mar­kets: use a lon­gi­tu­di­nal crawl of DNM sell­ers to esti­mate sur­vival curves, out­stand­ing escrow + orders, and listed prod­uct prices / type / lan­guage to try to pre­dict exit scams.

  • what are the pos­si­bil­i­ties in gen­eral for pre­dict­ing human traits from faces? If eg , then per­haps faces can pre­dict many things. Instead of bick­er­ing about how much you can pre­dict homo­sex­u­al­ity etc from faces and whether a spe­cific dataset/analysis works, apply vari­ance com­po­nent analy­sis using dis­tances in a facial recog­ni­tion CNN’s face-em­bed­ding as a sim­i­lar­ity met­ric (which is highly robust to all sorts of real-world trans­for­ma­tions like angle or light­ing or hair style); then cal­cu­late ‘face her­i­tabil­ity’ on many traits (the OKCupid scrape dataset should sup­port this). If the aver­age is near zero, that implies that faces don’t carry any impor­tant sig­nals and that, aside from occa­sional excep­tions, noth­ing beyond the expected things like basic demo­graphic data can be pre­dicted from faces. On the other hand, if ‘face her­i­tabil­ity’ of many traits turns out to be sub­stan­tially above zero (per­haps 20%), this means that faces carry many sig­nals and these sig­nals may be com­mer­cially or legally exploitable and ear­lier find­ings about face pre­dic­tion may have been right after all. We may not like the answers, but it’s bet­ter to know the truth than go along blithely assur­ing every­one that it’s impos­si­ble to do such things and things like homosexuality/criminality are merely junk sta­tis­tics.

    This has been done using a stan­dard face-recog­ni­tion NN’s embed­ding for Big Five per­son­al­ity fac­tors: , Kachur et al 2020.

  • Quan­ti­fied Self exper­i­ments for acne, espe­cially for teenagers: one could work with some to run some pre­lim­i­nary exper­i­ments, per­haps design some canned exper­i­ments+­analy­sis code?

  • Find­ing the best movie adap­ta­tions of books: movie adap­ta­tions of books typ­i­cally dis­ap­point read­ers, but a movie which is bet­ter than the book is quite inter­est­ing and unusu­al. We can’t eas­ily find that by sim­ply look­ing at aver­age rat­ings on IMDb & GoodReads, because we want to know pairs of movies/books where the movie has a higher (stan­dard­ized) rat­ing.

    Can we cre­ate a list auto­mat­i­cally by scrap­ing Wikipedia/WikiData’s cat­e­gories of books & movies and cre­at­ing pairs where a book arti­cle links to a movie arti­cle & vice-ver­sa? (Pre­sum­ably, all movie adap­ta­tions link to the orig­i­nal book’s arti­cle, and all books which have a movie adap­ta­tion will link to the movie’s arti­cle, so rec­i­p­ro­cal link­ing indi­cates an adap­ta­tion. Obscure or bad works may not have high­-qual­ity WP arti­cles or thor­ough links—but those are also the ones least likely to be great adap­ta­tions, so the bias is fine.) Given qual­i­fy­ing pairs, the arti­cles will also prob­a­bly include ISBNs or Rot­ten Toma­toes or IMDb links which can be used to retrieve rat­ings, and then it’s merely a mat­ter of stan­dard­iz­ing over­all rat­ings and list­ing pairs with the largest differ­ence.

Deep learning/RL

  • Markov chain/char-RNN bot for Twit­ter trained on just Eng­lish proverbs, idioms, and expres­sions

  • user-friendly char-RNN imple­men­ta­tion just for clas­si­fy­ing text, tak­ing in CSV data of text/category

  • RL agent using MCTS + GAN/PixelCNN model of envi­ron­ment

  • hyper­pa­ra­me­ter opti­miza­tion for algo­rithms in prob­lems with­out avail­able loss func­tions but human-judge­able qual­i­ty, using a human for mak­ing choices in a paired or forced-choice com­par­ison, then using a Bradley-Terry or latent vari­able model to infer rank­ings of hyper­pa­ra­me­ter set­tings and opti­miz­ing based on the latent scores. This would be par­tic­u­larly use­ful in GAN com­par­isons, where most com­par­isons attempt to force com­par­isons into a car­di­nal frame­work.

  • GAN improve­ments: pro­vide super­vi­sion via adding addi­tional losses by requir­ing the Dis­crim­i­na­tor (D) to out­put an array of per-pixel losses of sam­ple images, as opposed to a sin­gle scalar loss across the whole image, thereby train­ing the gen­er­a­tor more effec­tive­ly. Shift from a sin­gle scalar vari­able as feed­back per image to a U-net (the log­i­cal extreme of a mul­ti­-s­cale approach like )

    In look­ing at GAN sam­ples, I notice that bad Gen­er­a­tors (G) often gen­er­ate decent over­all sam­ples but there will be small regions where the qual­ity is glar­ingly bad. It is not the case that the “whole image just looks bad some­how”—often there’s a spe­cific point like the eyes or the lips where it looks hor­ri­fy­ingly creepy (espe­cially for dog or human images). If D pro­duces a large loss (be­cause it’s so easy to notice the flaw), this seems odd from a back­prop­a­ga­tion sense since most of the image is fine, it’s just a few spots which con­tribute to the loss. GANs, as have been often not­ed, are closely related to rein­force­ment learn­ing, and con­sid­ered as RL, the G is get­ting a sin­gle reward at the end of long sequence of gen­er­ated pix­els, and does not know which pix­els are respon­si­ble for low or high rewards; akin to REINFORCE, it has lit­tle choice but to reward/punish neu­rons and hope that on aver­age it is approx­i­mat­ing the cor­rect gra­di­ent for each para­me­ter. Actor-critic meth­ods make the reward more infor­ma­tive by try­ing to assign blame to spe­cific actions, and AlphaGo Zero’s expert iter­a­tion appears to exhibit such dra­matic learn­ing speed because the use of MCTS means that AG Z receives not a sin­gle reward 0/1 atten­u­ated over an entire game of moves, but pre­cise imme­di­ate feed­back on the value of moves it took & also on all the moves it did­n’t take. In gen­er­al, pro­vid­ing more losses is good for learn­ing—ad­di­tional exam­ples would include aux­il­iary losses in RL like UNREAL or “dark knowl­edge” in image clas­si­fi­ca­tion. In GANs, every­thing is differ­en­tiable and syn­thet­ic, so we don’t need to accept RL-like impov­er­ished loss­es, but it seems like for the most part, the losses are very sim­ple and low-in­for­ma­tion. Fur­ther, in GANs, the largest improve­ments in image qual­ity in StackGAN and ProGAN come from adding GAN global losses at mul­ti­ple lay­ers of the gen­er­a­tor: a D spe­cial­ized for 32x32px images, then another D spe­cial­ized for 64x64px, then another D for 128x128px etc. This can be seen as stack­ing up losses “depth”-wise, pro­vid­ing feed­back about plau­si­bil­ity at mul­ti­ple stages. So why not add losses “width”-wise, by crit­i­ciz­ing each pixel in the final upscaled image? If it’s good one way, why not the oth­er? This is in large part how the strongest com­peti­tor to GANs for image gen­er­a­tion, PixelCNN, works: gen­er­at­ing 1 pixel at a time con­di­tioned on pre­vi­ous gen­er­ated pix­els. (mere_­mor­tise sug­gests that this scheme would be equiv­a­lent to a reg­u­lar GAN loss but com­puted on many shifted ver­sions of an image, although that would pre­sum­ably be much slow­er.)

    Given a D which out­puts the 2D array of per-pixel loss­es, the train­ing of G is just back­prop­a­ga­tion as usu­al, but how does one train D to pro­vide per-pixel loss­es? Given a real image, by defi­n­i­tion the fak­e­ness of each pixel is 0, after all. The sim­plest approach would be to train the D with real and G-ed/fake images, and label all the pix­els in the real image with 0 and all the pix­els in the fake image with 1, and hope it works out and the D will learn that way over enough mini­batch­es. Another approach might be to intro­duce kinds of noises or cor­rup­tion or shuffles in the real images, label the orig­i­nal pix­els with 0 and then label the new pix­els with 1; for exam­ple, replace a ran­dom 50% of pix­els with white noise. (This might sound crazy but then, so does an image aug­men­ta­tion tech­nique like which nev­er­the­less works in CNNs & GANs.) A more inter­est­ing approach might be to refash­ion G into not a sin­gle-shot image gen­er­a­tor, but a region in-filler/inpainter/completion; this lets one gen­er­ate images which gen­uinely are a mix of real and fake pix­els, by crop­ping out a ran­dom region in a real image, hav­ing G fill it back in, and label­ing real/fake appro­pri­ate­ly. Some­thing like MixUp might be employed: an image could be 40% generated/60% real, and then the tar­get for D is 60%. If MixUp on ran­dom pairs of images does­n’t work, a con­di­tional GAN’s con­di­tion­ing could be used as a kind of Mix­Up: com­bine a real image with a fake image based on the real’s con­di­tion­ing, and since the con­di­tion­ing should describe most of the image, the pair should con­sti­tute a good mashup for the D.

    This essen­tially turns GANs into a “seman­tic seg­men­ta­tion” prob­lem. For a sim­i­lar but not iden­ti­cal use, see and ; what I pro­pose may have been done, but sim­pler, in , Gokaslan et al 2018.

    Some­thing like this was done 2 years later by : scor­ing qual­ity of indi­vid­ual pix­els by train­ing on mashed-up images. The spe­cific noise was add pix­el-level noise to a real image using a weight­ed-av­er­age with a fake image, and occa­sion­ally copy­-paste cir­cu­lar regions from the fake into the real. This did not lead to any par­tic­u­lar improve­ment in the WGAN/StyleGAN/BigGAN mod­els they trained, but the Dis­crim­i­na­tors were able to rank images use­fully by qual­i­ty. A much more close imple­men­ta­tion is , Schön­feld et al 2020, which does pre­cisely what I sug­gest but using instead of Mix­Up—the aug­mented images look strange because they are just square blocks from differ­ent images copy­-pasted on top of another image, but they report improve­ments on top of reg­u­lar BigGAN for FFHQ/CelebA/COCO-Animals (al­beit harder datasets like ImageNet/Danbooru2019/JFT-300M are not attempt­ed).

  • GWAS via 1D (pos­si­bly dilat­ed) CNNs on SNP sequences a la WaveNet or mal­ware detec­tion ():

    Lin­ear regres­sions are noto­ri­ously sam­ple-in­effi­cient and weak meth­ods of imple­ment­ing GWAS as they typ­i­cally use unre­al­is­tic flat pri­ors, do not exploit the ‘clump­ing’ of hits in groups of SNPs (re­quir­ing post-pro­cess­ing to ‘prune’ SNP hits which are phys­i­cally too close to each other and likely in to reveal the ‘real’ hit) , expect lin­ear effects, and addi­tive effects. Lin­ear regres­sions can eas­ily pro­duce poly­genic scores explain­ing half or less of vari­ance com­pared to a more opti­mal sta­tis­ti­cal method (eg com­pare Hsu’s lasso or MTAG use to the pre­vi­ous GWASes on height/intelligence). A CNN could ben­e­fit from the hit clus­ters, can flex­i­bly model dis­tri­b­u­tions of effects and sub­sum­ing the “Bayesian alpha­bet”, and can pool infor­ma­tion both locally and glob­ally while mod­el­ing poten­tially arbi­trar­ily com­plex inter­ac­tions and hier­ar­chies of effects. A SNP sequence of, say, 500k high­-qual­ity SNP calls may seem infea­si­ble for a NN, and would be totally infea­si­ble for a stan­dard RNN pro­cess­ing the sequence 1 SNP at a time, as it would be unable to pre­serve enough infor­ma­tion in its hid­den state or learn effec­tively due to van­ish­ing gra­di­ents; but WaveNet and 1D con­vo­lu­tions for text clas­si­fi­ca­tion have demon­strated the abil­ity for dilated con­vo­lu­tions to han­dle enor­mous sequences highly effec­tively while mod­el­ing both local & global aspects. It is pos­si­ble that a 1D CNN could be a highly effec­tive GWAS method as well.

    The pri­mary chal­lenge, as dis­cov­ered by Raff et al 2017 in exper­i­ment­ing with CNNs ingest­ing sequences of mil­lions of byte, is that the first layer is inher­ently extremely mem­o­ry-hun­gry, as each of the thou­sands or mil­lions of vari­ables must be con­nected to the NN simul­ta­ne­ous­ly. Raff et al 2017 used a DGX-1 with 4 GPUs and ~16GB VRAM for a month for con­ver­gence, and found almost all their mem­ory was going to the first layer and the higher lay­ers con­tributed min­i­mal demand. If the addi­tional lay­ers prove prob­lem­at­ic, dilated con­vo­lu­tions can be used instead, which increase mem­ory use only log­a­rith­mi­cal­ly, espe­cially with high dila­tion fac­tors like 15 or 20. (Raff et al 2017 also found that dilated con­vo­lu­tions were unhelp­ful in their mal­ware exe­cutable clas­si­fi­ca­tion prob­lem and that they needed a very shal­low archi­tec­ture, sug­gest­ing that mal­ware byte sequences just don’t have that much local struc­ture for con­vo­lu­tions to exploit and that they were hav­ing training/convergence issues despite con­sid­er­able invest­men­t—but I expect genomes to have much more local struc­ture due to the genome inher­ently being sequenced into genes (which do not all affect traits of inter­est to equal degree), cod­ing regions of var­i­ous sorts, and the pre­vi­ously men­tioned SNP-clumping empir­i­cally observed in many GWASes.) A GWAS CNN might require data-par­al­lel train­ing over mul­ti­ple 1080ti GPUs, split­ting the mini­batch to fit into the 11GB VRAM, and at least a month. How­ev­er, should it deliver pre­dic­tive power much supe­rior to exist­ing SOTA tech­niques like lasso GWAS, these com­pu­ta­tional require­ments would prob­a­bly be con­sid­ered accept­able—­sev­eral GPU-months may be expen­sive, but col­lect­ing twice or thrice as many human genomes is more expen­sive still.

  • deep RL for neural net­work design but focus­ing on gen­er­at­ing a dis­tri­b­u­tion of ran­dom weights for ini­tial­iz­ing a NN; bet­ter ini­tial­iza­tions have proven to be extremely impor­tant in sta­bly train­ing NN and sim­ply tweak­ing ini­tial­iza­tion can train NNs with hun­dreds of lay­ers (pre­vi­ously impos­si­ble, then only pos­si­ble with a major archi­tec­tural inno­va­tion like resid­ual net­works) eg . Bet­ter ini­tial­iza­tions are hard to design by hand as they appar­ently work by break­ing var­i­ous sym­me­tries inside the NN, so this is a prob­lem that is well suited for brute force and tri­al-and-er­ror. See fur­ther , , , , , . This may wind up being essen­tially the same thing as HyperNetworks/fast-weights eg .

Preference learning

See & .


  • writ­ing tools:

    • dialect/period writ­ing tool, per­haps exploit­ing word2vec: iden­tify words in a text which are of the wrong dialect or are char­ac­ter­is­tic of differ­ent time peri­ods; for exam­ple, iden­ti­fy­ing Amer­i­can­isms in an osten­si­bly British work (to ‘Brit-pick’), or iden­tify anachro­nisms in a his­tor­i­cal fic­tion (words which did not exist in that time period or would be highly unusu­al), and sug­gest replace­ments

    • char­ac­ter gen­er­a­tor: gen­er­ate ran­dom pop­u­la­tion-weighted sam­ples of peo­ple by demo­graph­ics, polit­i­cal & reli­gious atti­tudes, ide­ol­o­gy, draw­ing on real­is­tic datasets such as US cen­suses (for demographics/names) or the (GSS)1; this can be use­ful in reduc­ing bias in char­ac­ters, explor­ing pos­si­bil­i­ties, and increas­ing real­ism. Naive attempts to debias writ­ings often wind up mak­ing the char­ac­ters far more unrep­re­sen­ta­tive, such as by includ­ing too many homo­sex­ual or trans­sex­ual char­ac­ters or includ­ing rare eth­nic­i­ties like Jews while fail­ing to include com­mon types of peo­ple such as fun­da­men­tal­ist Chris­tians or Repub­li­cans, and exist­ing fake name or char­ac­ter gen­er­a­tors do not help because they typ­i­cally take the easy way out by merely sam­pling ran­domly from a list of unique val­ues, skew­ing selec­tion to bizarre & exotic—try­ing out one such gen­er­a­tor, I get strange names like “Cyn­thia P. Teal” or “Cody A. Nguyen” or “Mar­shall T. Blanco”. Using real data & pro­por­tional sam­pling ensures real­ism and elim­i­nates blind spots an author may not real­ize they have. (Of course, this is not to say that an author will be happy with the sug­ges­tions, par­tic­u­larly with what the GSS may reveal about the beliefs and knowl­edge of Amer­i­cans in gen­er­al. But if an author ensures that all of their char­ac­ters are aware that choco­late milk does­n’t come from brown cows or grad­u­ated high school, at least it will then be a delib­er­ate choice on their part.)

    • -in­spired Eng­lish font: is it pos­si­ble to write Eng­lish in syl­la­ble blocks akin to how Korean is writ­ten in hangul, using a large set of ? (For exam­ple, a word like ‘the’ could be eas­ily writ­ten as a block with a ‘th’ lig­a­ture and plac­ing the ‘e’ over the ‘h’.)

      enthu­si­asts do not seem to’ve tried this; the clos­est I’ve found is Russ­ian ‘elm’ cal­lig­ra­phy (es­thetic but unread­able), (ped­a­gog­i­cal tool explain­ing how Chi­nese char­ac­ters work), an exper­i­ment in set­ting entire words as blocks (which mostly demon­strates the need to do it with syl­la­bles instead), and a hand­ful of “inter­lock” such as “Ed Inter­lock” (meant less for read­abil­ity than to con­vey a ’60s hippy or a Tahit­ian ).

  • smart-glasses w/NNs for lipread­ing+­tran­scrip­tion+voice-gen­er­a­tion for deaf/hearing-impaired:

    Inspired by , Assael et al 2016 (video)

    Lipread­ing is the task of decod­ing text from the move­ment of a speak­er’s mouth. Tra­di­tional approaches sep­a­rated the prob­lem into two stages: design­ing or learn­ing visual fea­tures, and pre­dic­tion. More recent deep lipread­ing approaches are end-to-end train­able (Wand et al., 2016; Chung & Zis­ser­man, 2016a). All exist­ing works, how­ev­er, per­form only word clas­si­fi­ca­tion, not sen­tence-level sequence pre­dic­tion. Stud­ies have shown that human lipread­ing per­for­mance increases for longer words (Eas­ton & Basala, 1982), indi­cat­ing the impor­tance of fea­tures cap­tur­ing tem­po­ral con­text in an ambigu­ous com­mu­ni­ca­tion chan­nel. Moti­vated by this obser­va­tion, we present Lip­Net, a model that maps a vari­able-length sequence of video frames to text, mak­ing use of spa­tiotem­po­ral con­vo­lu­tions, an LSTM recur­rent net­work, and the con­nec­tion­ist tem­po­ral clas­si­fi­ca­tion loss, trained entirely end-to-end. To the best of our knowl­edge, Lip­Net is the first lipread­ing model to oper­ate at sen­tence-level, using a sin­gle end-to-end speak­er-in­de­pen­dent deep model to simul­ta­ne­ously learn spa­tiotem­po­ral visual fea­tures and a sequence mod­el. On the GRID cor­pus, Lip­Net achieves 93.4% accu­ra­cy, out­per­form­ing expe­ri­enced human lipread­ers and the pre­vi­ous 79.6% state-of-the-art accu­ra­cy.

    Com­ing on the heels of human-level speech tran­scrip­tion, I am very much look­ing for­ward to smart glasses with real-time cap­tion­ing. That is going to be a game-changer for hard of hear­ing and deaf peo­ple.

    The out­put solu­tion for deaf peo­ple has existed for a long time, like a lit­tle chalk­board, or, many peo­ple can type almost as fast as they speak nor­mal­ly, and steno key­boards are much faster than that. But this was never rel­e­vant (offline) because deaf peo­ple could­n’t hear: there’s no point in being able to reply if you don’t know what you’re reply­ing to. So we had to teach deaf peo­ple both Eng­lish writ­ten and ASL for inter­ac­tions. Wavenet may offer human-level voice syn­the­sis, but it did­n’t mat­ter. How­ev­er, with Lip­net, does­n’t that change? If you can get real­time tran­scrip­tion with lipread­ing+­tran­scrip­tion RNNs which is human-e­quiv­a­lent or bet­ter, you’ve closed the loop. Why not just have deaf peo­ple use a smart glass for cap­tion­ing and a glove for a steno-like key­board + voice syn­the­sis? You have to teach them writ­ten eng­lish and typ­ing any­way, so what’s ASL now adding aside from esthet­ics and com­mu­ni­ty? (Peo­ple are happy to be depen­dent on smart­phones, so that’s not a seri­ous minus.)

  • a VR appli­ca­tion for view­ing images & video and for 3D envi­ron­ments with extremely large par­al­lax such as for view­ing clouds with true depth per­cep­tion (dis­cus­sion)

  • prop­erly tran­scribe & anno­tate Dou­glas Hof­s­tader’s Le Ton Beau de Marot, one of his best but also most obscure books


  • pro­vide “poly­genic scores as a ser­vice”, a website/API where one can upload a SNP data file like the 23andMe export and get back PGSes for every­thing in LD Hub, and util­ity weights
  • expand/rewrite Wikipedi­a’s —grossly out­dat­ed, almost totally omit­ting all the GCTAs and GWASes that have defin­i­tively set­tled the answer in the strongly affir­ma­tive
  • nom­i­na­tive deter­min­ism: do first names affect how peo­ple are per­ceived or their appear­ance? Some stud­ies indi­cate that one can guess first names based on appear­ance… but I haven’t seen one which does a with­in-fam­ily com­par­i­son eg swap­ping at ran­dom the pho­tographs of two same-sex sib­lings, pro­vide their first names, and ask­ing peo­ple to guess which is which. Names are canon­i­cal exam­ples of things which vary sys­tem­at­i­cally between fam­i­lies.

Estimating censored test scores

An acquain­tance asks the fol­low­ing ques­tion: he is apply­ing for a uni­ver­sity course which requires a cer­tain min­i­mum score on a test for admit­tance, and won­ders about his chances and a pos­si­ble trend of increas­ing min­i­mum scores over time. (He has­n’t received his test results yet.) The uni­ver­sity does­n’t pro­vide a dis­tri­b­u­tion of admit­tee scores, but it does pro­vide the min­i­mum scores for 2005-2013, unless all appli­cants were admit­ted because they all scored above an unknown cut­off—in which case it pro­vides no min­i­mum score. This leads to the dataset:


A quick eye­ball tells us that we can’t con­clude much: only 4 actual dat­a­points, with 5 hid­den from us. We can’t hope to con­clude any­thing about time trends, other than there does­n’t seem to be much of one: the last score, 417, is not much higher than 410, and the last two scores are low enough to be hid­den. We might be able to esti­mate a mean, though.

We can’t sim­ply aver­age the 4 scores and con­clude the mean min­i­mum is 410 because of those NAs: a num­ber of scores have been ‘cen­sored’ because they were too low, and while we don’t know what they were, we do know they were <398 (the small­est score) and so a bunch of <398s will pull down the uncen­sored mean of 410.

On approach is to treat it as a and esti­mate using some­thing like the censReg library (overview).

But if we try a quick call to censReg, we are con­found­ed: a Tobit model expects you to pro­vide the cut­off below which the obser­va­tions were cen­sored, but that is some­thing we don’t know. All we know is that it must be below 398, we weren’t told it was exactly 395, 394, etc. For­tu­nate­ly, this is a solved prob­lem. For exam­ple: “The Tobit model with a non-zero thresh­old”, Car­son & Sun 2007 tells us:

In this paper, we con­sider esti­mat­ing the unknown cen­sor­ing thresh­old by the min­i­mum of the uncen­sored yi’s. We show that the esti­ma­tor γ’ of γ is super­con­sis­tent and asymp­tot­i­cally expo­nen­tially dis­trib­uted. Car­son (1988, 1989) also sug­gests esti­mat­ing the unknown cen­sor­ing thresh­old by the min­i­mum of the uncen­sored yi’s. In a recent paper, Zuehlke (2003) redis­cov­ers these unpub­lished results and demon­strates via sim­u­la­tions that the asymp­totic dis­tri­b­u­tion of the max­i­mum like­li­hood esti­ma­tor does not seem to be affected by the esti­ma­tion of the cen­sor­ing thresh­old.

That seems to be almost too sim­ple and easy, but it makes sense and reminds me a lit­tle of the : the min­i­mum might not be that accu­rate a guess (it’s unlikely you just hap­pened to draw a sam­ple right on the cen­sor­ing thresh­old) and it defi­nitely can’t be wrong in the sense of being too low. (A Bayesian method might be able to do bet­ter with a prior like a expo­nen­tial.)

With that set­tled, the analy­sis is straight­for­ward: load the data, fig­ure out the min­i­mum score, set the NAs to 0, regress, and extract the model esti­mates for each year:

scores <- data.frame(Year=2005:2013,
censorThreshold <- min(scores$MinimumScore, na.rm=T)
scores[] <- 0

## 'censorThreshold-1' because censReg seems to treat threshold as < and not <=
summary(censReg(MinimumScore ~ Year, left=censorThreshold-1, data=scores))
# Warning message:
# In censReg(MinimumScore ~ Year, left = censorThreshold - 1, data = scores) :
#   at least one value of the endogenous variable is smaller than the left limit
# Call:
# censReg(formula = MinimumScore ~ Year, left = censorThreshold -
#     1, data = scores)
# Observations:
#          Total  Left-censored     Uncensored Right-censored
#              9              5              4              0
# Coefficients:
#              Estimate Std. error t value Pr(> t)
# (Intercept) -139.9711        Inf       0       1
# Year           0.2666        Inf       0       1
# logSigma       2.6020        Inf       0       1
# Newton-Raphson maximisation, 37 iterations
# Return code 1: gradient close to zero
# Log-likelihood: -19.35 on 3 Df
-139.9711 + (0.2666 * scores$Year)
# [1] 394.6 394.8 395.1 395.4 395.6 395.9 396.2 396.4 396.7

With so lit­tle data the results aren’t very reli­able, but there is one obser­va­tion we can make.

The fact that half the dataset is cen­sored tells us that the uncen­sored mean may be a huge over­es­ti­mate (since we’re only look­ing at the ‘top half’ of the under­ly­ing data), and indeed it is. The orig­i­nal mean of the uncen­sored scores was 410; how­ev­er, the esti­mate includ­ing the cen­sored data is much low­er, 397 (13 less)!

This demon­strates the dan­ger of ignor­ing sys­tem­atic biases in your data.

So, try­ing to cal­cu­late a mean or time effect is not help­ful. What might be bet­ter is to instead exploit the cen­sor­ing direct­ly: if the cen­sor­ing hap­pened because every­one got in, then if you showed up in a cen­sored year, you have 100% chance of get­ting in; while in a non-cen­sored year you have an unknown but <100% chance of get­ting in; so the prob­a­bil­ity of a cen­sored year sets a lower bound on one’s chances, and this is easy to cal­cu­late as a sim­ple bino­mial prob­lem—5 out of 9 years were cen­sored years, so:

#   Exact binomial test
# data:  c(5, 4)
# number of successes = 5, number of trials = 9, p-value = 1
# alternative hypothesis: true probability of success is not equal to 0.5
# 95% confidence interval:
#  0.212 0.863
# sample estimates:
# probability of success
#                 0.5556

So we can tell him that he may have a >55% chance of get­ting in.

The Traveling Gerontologist problem

A quick prob­a­bil­ity exer­cise: men­tions Fin­land has 566 cen­te­nar­i­ans as of 2010.

That’s few enough you could imag­ine vis­it­ing them all to research them and their longevi­ty, in a sort of trav­el­ing sales­man prob­lem but with geron­tol­o­gists instead. Except, because of the , cen­te­nar­i­ans have high annual mor­tal­ity rates; it depends on the exact age but you could call it >30% (eg Finnish 99yos in 2012 had a death toll of 326.54/1000). So you might well try to visit a cen­te­nar­ian and dis­cover they’d died before you got there.

How bad a risk is this? Well, if the risk per year is 30%, then one has a 70% chance of sur­viv­ing a year. To sur­vive a year, you must sur­vive all 365 days; by the mul­ti­pli­ca­tion rule, the risk is x where or 0.7 = x365.25; solv­ing, x = 0.999024.

It takes time to visit a cen­te­nar­i­an—it would­n’t do to be abrupt and see them for only a few min­utes, you ought to lis­ten to their sto­ries, and you need to get to a hotel or air­port, so let’s assume you visit 1 cen­te­nar­ian per day.

If you visit cen­te­nar­ian A on day 1, and you want to visit cen­te­nar­ian B on day 2, then you can count on a 99.9% chance B is still alive. So far so good. And if you wanted to visit 566 cen­te­nar­i­ans (let’s imag­ine you have a reg­u­lar­ly-up­dated mas­ter list of cen­te­nar­i­ans from the Finnish pop­u­la­tion reg­istry), then you only have to beat the odds 566 times in a row, which is not that hard: 0.999024566 = 0.5754023437943274.

But that’s cold­blooded of you to objec­tify those Finnish cen­te­nar­i­ans! “Any cen­te­nar­ian will do, I don’t care.” What if you picked the cur­rent set of 566 cen­te­nar­i­ans and wanted to visit just them, specifi­cal­ly—with no new cen­te­nar­i­ans intro­duced to the list to replace any dead ones.

That’s a lit­tle more com­pli­cat­ed. When you visit the first cen­te­nar­i­an, it’s the same prob­a­bil­i­ty: 0.999024. When you visit the sec­ond cen­te­nar­ian the odds change since now she (and it’s more often ‘she’ than ‘he’, since remem­ber the expo­nen­tial and males hav­ing shorter mean life­times) has to sur­vive 2 days, so it’s or 0.9990242; for the third, it’s 0.9990243, and so on to #566 who has been patiently wait­ing and try­ing to sur­vive a risk of 0.999024566, and then you need to mul­ti­ply to get your odds of beat­ing every sin­gle risk of death and the cen­te­nar­ian not leav­ing for a more per­ma­nent ren­dezvous: , which would be , or in Haskell:

product (map (\x -> 0.999024**x) [1..566])

(A lit­tle sur­pris­ing­ly, Wol­fram Alpha can solve the TeX expres­sion too.)

Given the use of float­ing point in that func­tion (567 float­ing point expo­nen­ti­a­tions fol­lowed by as many mul­ti­pli­ca­tions) and the hor­ror sto­ries about float­ing point, one might worry the answer is wrong & the real prob­a­bil­ity is much larg­er. We can retry with an imple­men­ta­tion of com­putable reals, CReal, which can be very slow but should give more pre­cise answers:

:module + Data.Number.CReal
showCReal 100 (product (map (\x -> 0.999024**x) [1..566]))

Looks good—a­grees with the float­ing point ver­sion up to the 11th dig­it:

8.9527433401 64081e-69
8.9527433401 308585720915431195262

We can also check by rewrit­ing the prod­uct equa­tion to avoid all the expo­nen­ti­a­tion and mul­ti­pli­ca­tion (which might cause issues) in favor of a sin­gle expo­nen­tial:

  1. (as before)
  2. = (since )
  3. = (by /Gauss’s famous class­room trick since )
  4. = (start sub­sti­tut­ing in spe­cific val­ues)
  5. =
  6. = 0.999024160461



Or to go back to the longer ver­sion:

0.999024**((566*(1 + 566)) / 2)

Also close. All prob­a­bil­i­ties of suc­cess are minute.

How fast would you have to be if you wanted to at least try to accom­plish the tour with, say, a 50-50 chance?

Well, that’s easy: you can con­sider the prob­a­bil­ity of all of them sur­viv­ing one day and as we saw ear­lier, that’s 0.999024566 = 0.58, and two days would be So you can only take a lit­tle over a day before you’ve prob­a­bilis­ti­cally lost & one of them has died; if you hit all 566 cen­te­nar­i­ans in 24 hours, that’s ~24 cen­te­nar­i­ans per hour or ~2 min­utes to chat with each one and travel to the next. If you’re try­ing to col­lect DNA sam­ples, bet­ter hope they’re all awake and able to give con­sent!

So safe to say, you will prob­a­bly not be able to man­age the Trav­el­ing Geron­tol­o­gist’s tour.

Bayes nets

Daily weight data graph

As the datasets I’m inter­ested in grow in num­ber of vari­ables, it becomes harder to jus­tify doing analy­sis by sim­ply writ­ing down a sim­ple lin­ear model with a sin­gle depen­dent vari­able and throw­ing in the inde­pen­dent vari­ables and maybe a few trans­for­ma­tions cho­sen by hand. I can instead write down some simultaneous-equations/structural-equation-models, but while it’s usu­ally obvi­ous what to do for k < 4 and if it’s not I can com­pare the pos­si­ble vari­ants, 4 vari­ables is ques­tion­able what the right SEM is, and >5, it’s hope­less. Fac­tor analy­sis to extract some latent vari­ables is a pos­si­bil­i­ty, but the more gen­eral solu­tion here seems to be prob­a­bilis­tic graph­i­cal mod­els such as Bayesian net­works.

I thought I’d try out some Bayes net infer­ence on some of my datasets. In this case, I have ~150 daily mea­sure­ments from my Omron body com­po­si­tion scale, mea­sur­ing total weight, body fat per­cent­age, and some other things (see an Omron man­ual):

  1. Total weight
  2. BMI
  3. Body fat per­cent­age
  4. Mus­cle per­cent­age
  5. Rest­ing metab­o­lism in calo­ries
  6. “Body age”
  7. Vis­ceral fat index

The 7 vari­ables are inter­re­lat­ed, so this is defi­nitely a case where a sim­ple lm is not going to do the trick. It’s also not 100% clear how to set up a SEM; some defi­n­i­tions are obvi­ous (the much-crit­i­cized BMI is going to be deter­mined solely by total weight, mus­cle and fat per­cent­age might be inversely relat­ed) but oth­ers are not (how does “vis­ceral fat” relate to body fat?). And it’s not a hope­lessly small amount of data.

The Bayes net R library I’m try­ing out is bnlearn (paper).

weight <- read.csv("selfexperiment/weight.csv")
weight$Date <- NULL; weight$Weight.scale <- NULL
# remove missing data
weightC <- na.omit(weight)
# bnlearn can't handle integers, oddly enough
weightC <-, as.numeric))
#   Weight.Omron        Weight.BMI        Weight.body.fat    Weight.muscle
#  Min.   :193.0000   Min.   : 26.90000   Min.   :27.00000   Min.   :32.60000
#  1st Qu.:195.2000   1st Qu.: 27.20000   1st Qu.:28.40000   1st Qu.:34.20000
#  Median :196.4000   Median : 27.40000   Median :28.70000   Median :34.50000
#  Mean   :196.4931   Mean   : 28.95409   Mean   :28.70314   Mean   :34.47296
#  3rd Qu.:197.8000   3rd Qu.: 27.60000   3rd Qu.:29.10000   3rd Qu.:34.70000
#  Max.   :200.6000   Max.   : 28.00000   Max.   :31.70000   Max.   :35.50000
#  Weight.resting.metabolism Weight.body.age    Weight.visceral.fat
#  Min.   :1857.000          Min.   :52.00000   Min.   : 9.000000
#  1st Qu.:1877.000          1st Qu.:53.00000   1st Qu.:10.000000
#  Median :1885.000          Median :53.00000   Median :10.000000
#  Mean   :1885.138          Mean   :53.32704   Mean   : 9.949686
#  3rd Qu.:1893.000          3rd Qu.:54.00000   3rd Qu.:10.000000
#  Max.   :1914.000          Max.   :56.00000   Max.   :11.000000
#                             Weight.Omron     Weight.BMI Weight.body.fat  Weight.muscle
# Weight.Omron               1.00000000000  0.98858376919    0.1610643221 -0.06976934825
# Weight.BMI                 0.98858376919  1.00000000000    0.1521872557 -0.06231142104
# Weight.body.fat            0.16106432213  0.15218725566    1.0000000000 -0.98704369855
# Weight.muscle             -0.06976934825 -0.06231142104   -0.9870436985  1.00000000000
# Weight.resting.metabolism  0.96693236051  0.95959140245   -0.0665001241  0.15621294274
# Weight.body.age            0.82581939626  0.81286141659    0.5500409365 -0.47408608681
# Weight.visceral.fat        0.41542744168  0.43260100665    0.2798756916 -0.25076619829
#                           Weight.resting.metabolism Weight.body.age Weight.visceral.fat
# Weight.Omron                           0.9669323605    0.8258193963        0.4154274417
# Weight.BMI                             0.9595914024    0.8128614166        0.4326010067
# Weight.body.fat                       -0.0665001241    0.5500409365        0.2798756916
# Weight.muscle                          0.1562129427   -0.4740860868       -0.2507661983
# Weight.resting.metabolism              1.0000000000    0.7008354776        0.3557229425
# Weight.body.age                        0.7008354776    1.0000000000        0.4840752389
# Weight.visceral.fat                    0.3557229425    0.4840752389        1.0000000000

## create alternate dataset expressing the two percentage variables as pounds, since this might fit better
weightC2 <- weightC
weightC2$Weight.body.fat <- weightC2$Weight.Omron * (weightC2$Weight.body.fat / 100)
weightC2$Weight.muscle   <- weightC2$Weight.Omron * (weightC2$Weight.muscle / 100)

Begin analy­sis:

pdap <- hc(weightC)
pdapc2 <- hc(weightC2)
## bigger is better:
score(pdap, weightC)
# [1] -224.2563072
score(pdapc2, weightC2)
# [1] -439.7811072
## stick with the original, then
#   Bayesian network learned via Score-based methods
#   model:
#    [Weight.Omron][Weight.body.fat][Weight.BMI|Weight.Omron]
#    [Weight.resting.metabolism|Weight.Omron:Weight.body.fat]
#    [Weight.body.age|Weight.Omron:Weight.body.fat]
#    [Weight.muscle|Weight.body.fat:Weight.resting.metabolism][Weight.visceral.fat|Weight.body.age]
#   nodes:                                 7
#   arcs:                                  8
#     undirected arcs:                     0
#     directed arcs:                       8
#   average markov blanket size:           2.57
#   average neighbourhood size:            2.29
#   average branching factor:              1.14
#   learning algorithm:                    Hill-Climbing
#   score:                                 BIC (Gauss.)
#   penalization coefficient:              2.534452101
#   tests used in the learning procedure:  69
#   optimized:                             TRUE

This inferred graph is obvi­ously wrong in sev­eral respects, vio­lat­ing prior knowl­edge about some of the rela­tion­ships.

More specifi­cal­ly, my prior knowl­edge:

  • Weight.Omron == total weight; should be influ­enced by Weight.body.fat (%), Weight.muscle (%), & Weight.visceral.fat

  • Weight.visceral.fat: ordi­nal vari­able, <=9 = nor­mal; 10-14 = high; 15+ = very high; from the Omron man­u­al:

    Vis­ceral fat area (0—ap­prox. 300 cm , 1 inch=2.54 cm) dis­tri­b­u­tion with 30 lev­els. NOTE: Vis­ceral fat lev­els are rel­a­tive and not absolute val­ues.

  • Weight.BMI: BMI is a sim­ple func­tion of total weight & height (specifi­cally BMI = round(weight / height^2)), so it should be influ­enced only by Weight.Omron, and influ­ence noth­ing else

  • Weight.body.age: should be influ­enced by Weight.Omron, Weight.body.fat, and Weight.muscle, based on the descrip­tion in the man­u­al:

    Body age is based on your rest­ing metab­o­lism. Body age is cal­cu­lated by using your weight, body fat per­cent­age and skele­tal mus­cle per­cent­age to pro­duce a guide to whether your body age is above or below the aver­age for your actual age.

  • Weight.resting.metabolism: a func­tion of the oth­ers, but I’m not sure which exact­ly; man­ual talks about what rest­ing metab­o­lism is gener­i­cally and spec­i­fies it has the range “385 to 3999 kcal with 1 kcal incre­ments”; sug­gests the Omron may be using one of sev­eral approx­i­ma­tion equa­tions based on age/sex/height/weight, but it might also be using lean body mass as well.

Unfor­tu­nate­ly, bnlearn does­n’t seem to sup­port any easy way of encod­ing the prior knowl­edge—­for exam­ple, you can’t say ‘no out­go­ing arrows from node X’—so I iter­ate, adding bad arrows to the black­list.

Which arrows vio­late prior knowl­edge?

  • [Weight.visceral.fat|Weight.body.age] (read back­wards, as Weight.body.age → Weight.visceral.fat)
  • [Weight.muscle|Weight.resting.metabolism]

Retry, black­list­ing those 2 arrows:

pdap2 <- hc(weightC, blacklist=data.frame(from=c("Weight.body.age", "Weight.resting.metabolism"), to=c("Weight.visceral.fat","Weight.muscle")))

New vio­la­tions:

  • [Weight.visceral.fat|Weight.BMI]
  • [Weight.muscle|Weight.Omron]
pdap3 <- hc(weightC, blacklist=data.frame(from=c("Weight.body.age", "Weight.resting.metabolism", "Weight.BMI", "Weight.Omron"), to=c("Weight.visceral.fat","Weight.muscle", "Weight.visceral.fat", "Weight.muscle")))

New vio­la­tions:

  • [Weight.visceral.fat|Weight.Omron]
  • [Weight.muscle|Weight.BMI]
pdap4 <- hc(weightC, blacklist=data.frame(from=c("Weight.body.age", "Weight.resting.metabolism", "Weight.BMI", "Weight.Omron", "Weight.Omron", "Weight.BMI"), to=c("Weight.visceral.fat","Weight.muscle", "Weight.visceral.fat", "Weight.muscle", "Weight.visceral.fat", "Weight.muscle")))

One vio­la­tion:

  • [Weight.muscle|Weight.body.age]
pdap5 <- hc(weightC, blacklist=data.frame(from=c("Weight.body.age", "Weight.resting.metabolism", "Weight.BMI", "Weight.Omron", "Weight.Omron", "Weight.BMI", "Weight.body.age"), to=c("Weight.visceral.fat","Weight.muscle", "Weight.visceral.fat", "Weight.muscle", "Weight.visceral.fat", "Weight.muscle", "Weight.muscle")))
#   Bayesian network learned via Score-based methods
#   model:
#    [Weight.body.fat][Weight.muscle|Weight.body.fat][Weight.visceral.fat|Weight.body.fat]
#    [Weight.Omron|Weight.visceral.fat][Weight.BMI|Weight.Omron]
#    [Weight.resting.metabolism|Weight.Omron:Weight.body.fat]
#    [Weight.body.age|Weight.Omron:Weight.body.fat]
#   nodes:                                 7
#   arcs:                                  8
#     undirected arcs:                     0
#     directed arcs:                       8
#   average markov blanket size:           2.57
#   average neighbourhood size:            2.29
#   average branching factor:              1.14
#   learning algorithm:                    Hill-Climbing
#   score:                                 BIC (Gauss.)
#   penalization coefficient:              2.534452101
#   tests used in the learning procedure:  62
#   optimized:                             TRUE

## implementing all the prior knowledge cost ~30:
score(pdap5, weightC)
# [1] -254.6061724

No vio­la­tions, so let’s use the net­work and esti­mate the spe­cific para­me­ters:

fit <-, weightC); fit
#   Bayesian network parameters
#   Parameters of node Weight.Omron (Gaussian distribution)
# Conditional density: Weight.Omron | Weight.visceral.fat
# Coefficients:
#         (Intercept)  Weight.visceral.fat
#       169.181651376          2.744954128
# Standard deviation of the residuals: 1.486044472
#   Parameters of node Weight.BMI (Gaussian distribution)
# Conditional density: Weight.BMI | Weight.Omron
# Coefficients:
#   (Intercept)   Weight.Omron
# -0.3115772322   0.1411044216
# Standard deviation of the residuals: 0.03513413381
#   Parameters of node Weight.body.fat (Gaussian distribution)
# Conditional density: Weight.body.fat
# Coefficients:
# (Intercept)
# 28.70314465
# Standard deviation of the residuals: 0.644590085
#   Parameters of node Weight.muscle (Gaussian distribution)
# Conditional density: Weight.muscle | Weight.body.fat
# Coefficients:
#     (Intercept)  Weight.body.fat
#   52.1003347352    -0.6141270921
# Standard deviation of the residuals: 0.06455478599
#   Parameters of node Weight.resting.metabolism (Gaussian distribution)
# Conditional density: Weight.resting.metabolism | Weight.Omron + Weight.body.fat
# Coefficients:
#     (Intercept)     Weight.Omron  Weight.body.fat
#   666.910582196      6.767607964     -3.886694779
# Standard deviation of the residuals: 1.323176507
#   Parameters of node Weight.body.age (Gaussian distribution)
# Conditional density: Weight.body.age | Weight.Omron + Weight.body.fat
# Coefficients:
#     (Intercept)     Weight.Omron  Weight.body.fat
#  -32.2651379176     0.3603672788     0.5150134225
# Standard deviation of the residuals: 0.2914301529
#   Parameters of node Weight.visceral.fat (Gaussian distribution)
# Conditional density: Weight.visceral.fat | Weight.body.fat
# Coefficients:
#     (Intercept)  Weight.body.fat
#    6.8781100009     0.1070118125
# Standard deviation of the residuals: 0.2373649058
## residuals look fairly good, except for Weight.resting.metabolism, where there are some extreme residuals in what looks a bit like a sigmoid sort of pattern, suggesting nonlinearities in the Omron scale's formula?

We can dou­ble-check the esti­mates here by turn­ing the Bayes net model into a SEM and see­ing how the esti­mates com­pare, and also see­ing if the p-val­ues sug­gest we’ve found a good mod­el:

Weight.model1 <- '
    Weight.visceral.fat ~ Weight.body.fat
    Weight.Omron ~ Weight.visceral.fat
    Weight.BMI ~ Weight.Omron
    Weight.body.age ~ Weight.Omron + Weight.body.fat
    Weight.muscle ~ Weight.body.fat
    Weight.resting.metabolism ~ Weight.Omron + Weight.body.fat
Weight.fit1 <- sem(model = Weight.model1,  data = weightC)
# lavaan (0.5-16) converged normally after 139 iterations
#   Number of observations                           159
#   Estimator                                         ML
#   Minimum Function Test Statistic               71.342
#   Degrees of freedom                                 7
#   P-value (Chi-square)                           0.000
# Parameter estimates:
#   Information                                 Expected
#   Standard Errors                             Standard
#                    Estimate  Std.err  Z-value  P(>|z|)
# Regressions:
#   Weight.visceral.fat ~
#     Weight.bdy.ft     0.107    0.029    3.676    0.000
#   Weight.Omron ~
#     Wght.vscrl.ft     2.745    0.477    5.759    0.000
#   Weight.BMI ~
#     Weight.Omron      0.141    0.002   82.862    0.000
#   Weight.body.age ~
#     Weight.Omron      0.357    0.014   25.162    0.000
#     Weight.bdy.ft     0.516    0.036   14.387    0.000
#   Weight.muscle ~
#     Weight.bdy.ft    -0.614    0.008  -77.591    0.000
#   Weight.resting.metabolism ~
#     Weight.Omron      6.730    0.064  104.631    0.000
#     Weight.bdy.ft    -3.860    0.162  -23.837    0.000
# Covariances:
#   Weight.BMI ~~
#     Weight.body.g    -0.000    0.001   -0.116    0.907
#     Weight.muscle    -0.000    0.000   -0.216    0.829
#     0.005    0.004    1.453    0.146
#   Weight.body.age ~~
#     Weight.muscle     0.001    0.001    0.403    0.687
#    -0.021    0.030   -0.700    0.484
#   Weight.muscle ~~
#     0.007    0.007    1.003    0.316
# Variances:
#     Wght.vscrl.ft     0.056    0.006
#     Weight.Omron      2.181    0.245
#     Weight.BMI        0.001    0.000
#     Weight.body.g     0.083    0.009
#     Weight.muscle     0.004    0.000
#     1.721    0.193

Com­par­ing the coeffi­cients by eye, they tend to be quite close (usu­ally within 0.1) and the p-val­ues are all sta­tis­ti­cal­ly-sig­nifi­cant.

The net­work itself looks right, although some of the edges are sur­pris­es: I did­n’t know vis­ceral fat was pre­dictable from body fat (I thought they were mea­sur­ing sep­a­rate things), and the rel­a­tive inde­pen­dence of mus­cle sug­gests that in any exer­cise plan I might be bet­ter off focus­ing on the body fat per­cent­age rather than the mus­cle per­cent­age since the for­mer may be effec­tively deter­min­ing the lat­ter.

So what did I learn here?

  • learn­ing net­work struc­ture and direc­tion of arrows is hard; even with only 7 vari­ables and n = 159 (ac­cu­rate clean data), the hill-climb­ing algo­rithm will learn at least 7 wrong arcs.

    • and the derived graphs depend dis­turbingly heav­ily on choice of algo­rithm; I used the hc hill-climb­ing algo­rithm (since I’m lazy and did­n’t want to spec­ify arrow direc­tion­s), but when I try out the alter­na­tives like iamb on the same data & black­list, the found graph looks rather differ­ent
  • Gaus­sians are, as always, sen­si­tive to out­liers: I was sur­prised the first graph did­n’t show BMI con­nected to any­thing, so I took a closer look and found I had mis­coded a BMI of 28 as 280!

  • bnlearn, while not as hard to use as I expect­ed, could still use usabil­ity improve­ments: I should not need to coerce inte­ger data into exactly equiv­a­lent numeric types just because bnlearn does­n’t rec­og­nize inte­gers; and blacklisting/whitelisting needs to be more pow­er­ful—it­er­a­tively gen­er­at­ing graphs and man­u­ally inspect­ing and man­u­ally black­list­ing is tedious and does not scale

    • hence, it may make more sense to find a graph using bnlearn and then con­vert it into simul­ta­ne­ous-e­qua­tions and manip­u­late it using more mature SEM libraries

Zeo sleep data

Here I look at my Zeo sleep data; more vari­ables, more com­plex rela­tions, and more unknown ones, but on the pos­i­tive side, ~12x more data to work with.

zeo <- read.csv("~/wiki/docs/zeo/gwern-zeodata.csv")
zeo$Sleep.Date <- as.Date(zeo$Sleep.Date, format="%m/%d/%Y")

## convert "05/12/2014 06:45" to "06:45"
zeo$Start.of.Night <- sapply(strsplit(as.character(zeo$Start.of.Night), " "), function(x) { x[2] })
## convert "06:45" to 24300
interval <- function(x) { if (! { if (grepl(" s",x)) as.integer(sub(" s","",x))
                                           else { y <- unlist(strsplit(x, ":")); as.integer(y[[1]])*60 + as.integer(y[[2]]); }
                          else NA
zeo$Start.of.Night <- sapply(zeo$Start.of.Night, interval)
## correct for the switch to new unencrypted firmware in March 2013;
## I don't know why the new firmware subtracts 15 hours
zeo[(zeo$Sleep.Date >= as.Date("2013-03-11")),]$Start.of.Night <- (zeo[(zeo$Sleep.Date >= as.Date("2013-03-11")),]$Start.of.Night + 900) %% (24*60)

## after midnight (24*60=1440), Start.of.Night wraps around to 0, which obscures any trends,
## so we'll map anything before 7AM to time+1440
zeo[zeo$Start.of.Night<420 & !$Start.of.Night),]$Start.of.Night <- (zeo[zeo$Start.of.Night<420 & !$Start.of.Night),]$Start.of.Night + (24*60))

zeoSmall <- subset(zeo, select=c(ZQ,Total.Z,,,,,,Awakenings,Start.of.Night,Morning.Feel))
zeoClean <- na.omit(zeoSmall)
# bnlearn doesn't like the 'integer' class that most of the data-frame is in
zeoClean <-, as.numeric))

Prior knowl­edge:

  • Start.of.Night is tem­po­rally first, and can­not be caused
  • is tem­po­rally sec­ond, and can be influ­enced by Start.of.Night (likely a con­nec­tion between how late I go to bed and how fast I fall asleep) & (since if it takes 10 min­utes to fall asleep, I must spend ≥10 min­utes in wake) but not oth­ers
  • Morning.Feel is tem­po­rally last, and can­not cause any­thing
  • ZQ is a syn­thetic vari­able invented by Zeo accord­ing to an opaque for­mu­la, which can­not cause any­thing but is deter­mined by oth­ers
  • Total.Z should be the sum of,, and
  • Awakenings should have an arrow with but it’s not clear which way it should run
## after a bunch of iteration, blacklisting arrows which violate the prior knowledge
bl <- data.frame(from=c("Morning.Feel", "ZQ", "ZQ", "ZQ", "ZQ", "ZQ", "ZQ", "", "", "", "Morning.Feel", "Awakenings", "", "Morning.Feel", "Morning.Feel","Total.Z", "", "", "Total.Z", "Total.Z", "Total.Z"),
                 to=c("Start.of.Night", "Total.Z", "", "", "", "Morning.Feel","Start.of.Night", "Start.of.Night","Start.of.Night","Start.of.Night", "", "", "", "Total.Z", "","","", "Start.of.Night", "", "", ""))

zeo.hc <- hc(zeoClean, blacklist=bl)
zeo.iamb         <- iamb(zeoClean, blacklist=bl)
## problem: undirected arc:; since hc inferred [|], I'll copy that for iamb:
zeo.iamb <- set.arc(zeo.iamb, from = "", to = "") <- gs(zeoClean, blacklist=bl)
## same undirected arc: <- set.arc(, from = "", to = "")

## Bigger is better:
score(zeo.iamb, data=zeoClean)
# [1] -44776.79185
score(, data=zeoClean)
# [1] -44776.79185
score(zeo.hc, data=zeoClean)
# [1] -44557.6952
## hc scores best, so let's look at it:
#   Bayesian network learned via Score-based methods
#   model:
#    [Start.of.Night][|Start.of.Night][|]
#    [|][|]
#    [Total.Z|][|]
#    [Awakenings|]
#    [Morning.Feel|]
#    [ZQ|]
#   nodes:                                 10
#   arcs:                                  28
#     undirected arcs:                     0
#     directed arcs:                       28
#   average markov blanket size:           7.40
#   average neighbourhood size:            5.60
#   average branching factor:              2.80
#   learning algorithm:                    Hill-Climbing
#   score:                                 BIC (Gauss.)
#   penalization coefficient:              3.614556939
#   tests used in the learning procedure:  281
#   optimized:                             TRUE


fit <-, zeoClean); fit
#   Bayesian network parameters
#   Parameters of node ZQ (Gaussian distribution)
# Conditional density: ZQ | Total.Z + + + + Awakenings
# Coefficients:
#    (Intercept)         Total.Z      Awakenings
# -0.12468522173   0.14197043518  -0.07103211437   0.07053271816   0.21121000076  -0.56476256303
# Standard deviation of the residuals: 0.3000223604
#   Parameters of node Total.Z (Gaussian distribution)
# Conditional density: Total.Z | + Start.of.Night
# Coefficients:
#    (Intercept)  Start.of.Night
# 907.6406157850   -0.4479377278   -0.2680771514
# Standard deviation of the residuals: 68.90853885
#   Parameters of node (Gaussian distribution)
# Conditional density: | Start.of.Night
# Coefficients:
#    (Intercept)  Start.of.Night
# -1.02898431407   0.01568450832
# Standard deviation of the residuals: 13.51606719
#   Parameters of node (Gaussian distribution)
# Conditional density: |
# Coefficients:
#   (Intercept)
# 14.7433880499   0.3289378711
# Standard deviation of the residuals: 19.0906685
#   Parameters of node (Gaussian distribution)
# Conditional density: | Total.Z + Start.of.Night
# Coefficients:
#      (Intercept)           Total.Z    Start.of.Night
# -120.62442964234     0.37864195651     0.06275760841
# Standard deviation of the residuals: 19.32560757
#   Parameters of node (Gaussian distribution)
# Conditional density: | Total.Z + +
# Coefficients:
#   (Intercept)        Total.Z
#  0.6424267863   0.9997862624  -1.0000587988  -1.0001805537
# Standard deviation of the residuals: 0.5002896274
#   Parameters of node (Gaussian distribution)
# Conditional density: | Total.Z +
# Coefficients:
#   (Intercept)        Total.Z
# 15.4961459056   0.1283622577  -0.1187382535
# Standard deviation of the residuals: 11.90756843
#   Parameters of node Awakenings (Gaussian distribution)
# Conditional density: Awakenings | + + + + Start.of.Night
# Coefficients:
#     (Intercept)
# -18.41014329148    0.02605164827    0.05736596152    0.02291139969    0.01060661963
#  Start.of.Night
#   0.01129521977
# Standard deviation of the residuals: 2.427868657
#   Parameters of node Start.of.Night (Gaussian distribution)
# Conditional density: Start.of.Night
# Coefficients:
# (Intercept)
# 1413.382886
# Standard deviation of the residuals: 64.43144125
#   Parameters of node Morning.Feel (Gaussian distribution)
# Conditional density: Morning.Feel | Total.Z + + + + Start.of.Night
# Coefficients:
#     (Intercept)          Total.Z
# -0.924662971061   0.004808652252  -0.010127269154  -0.008636841492  -0.002766602019
#  Start.of.Night
#  0.001672816480
# Standard deviation of the residuals: 0.7104115719

## some issues with big residuals at the extremes in the variables,, and;
## not sure how to fix those

Zeo.model1 <- ' ~ Start.of.Night ~ Total.Z +
    Awakenings ~ + + + + Start.of.Night ~ + Start.of.Night ~ + Start.of.Night ~ + + Start.of.Night
    Total.Z ~ + +
    ZQ ~ Total.Z + + + + Awakenings
    Morning.Feel ~ Total.Z + + + + Start.of.Night
Zeo.fit1 <- sem(model = Zeo.model1,  data = zeoClean)
# lavaan (0.5-16) converged normally after 183 iterations
#   Number of observations                          1379
#   Estimator                                         ML
#   Minimum Function Test Statistic               22.737
#   Degrees of freedom                                16
#   P-value (Chi-square)                           0.121
# Parameter estimates:
#   Information                                 Expected
#   Standard Errors                             Standard
#                    Estimate  Std.err  Z-value  P(>|z|)
# Regressions:
# ~
#     Start.of.Nght     0.016    0.006    2.778    0.005
# ~
#     Total.Z          -0.026    0.007   -3.592    0.000
#         0.314    0.038    8.277    0.000
#   Awakenings ~
#         0.026    0.005    5.233    0.000
#      0.057    0.003   16.700    0.000
#       0.023    0.002   10.107    0.000
#     0.011    0.002    6.088    0.000
#     Start.of.Nght     0.011    0.001   10.635    0.000
# ~
#        -0.348    0.085   -4.121    0.000
#     Start.of.Nght    -0.195    0.018  -10.988    0.000
# ~
#     0.358    0.018   19.695    0.000
#     Start.of.Nght     0.034    0.013    2.725    0.006
# ~
#       0.081    0.012    6.657    0.000
#     0.034    0.009    3.713    0.000
#     Start.of.Nght    -0.017    0.006   -3.014    0.003
#   Total.Z ~
#       1.000    0.000 2115.859    0.000
#     1.000    0.000 2902.045    0.000
#      1.000    0.001  967.322    0.000
#   ZQ ~
#     Total.Z           0.142    0.000  683.980    0.000
#     -0.071    0.000 -155.121    0.000
#       0.071    0.000  167.090    0.000
#      0.211    0.001  311.454    0.000
#     Awakenings       -0.565    0.003 -178.407    0.000
#   Morning.Feel ~
#     Total.Z           0.005    0.001    8.488    0.000
#        -0.010    0.001   -6.948    0.000
#     -0.009    0.001   -8.592    0.000
#    -0.003    0.001   -2.996    0.003
#     Start.of.Nght     0.002    0.000    5.414    0.000

Again no major sur­pris­es, but one thing I notice is that ZQ does not seem to con­nect to, though does con­nect to Morning.Feel; I’ve long sus­pected that ZQ is a flawed sum­mary and thought it was insuffi­ciently tak­ing into account wakes or some­thing else, so it looks like it’s specifi­cally which is miss­ing. Start.of.night also is more highly con­nected than I had expect­ed.

Com­par­ing graphs from the 3 algo­rithms, they don’t seem to differ as badly as the weight ones did. Is this thanks to the much greater data or the con­straints?

Genome sequencing costs

# converted to CSV & deleted cost per base (less precision); CSV looks like:
## Date, Cost per Genome
## Sep-01,"$95,263,072"
## ...
sequencing <- read.csv("sequencing_costs_apr2014.csv")
sequencing$Cost.per.Genome <- as.integer(gsub(",", "", sub("\\$", "", as.character(sequencing$Cost.per.Genome))))
# interpret month-years as first of month:
sequencing$Date <- as.Date(paste0("01-", as.character(sequencing$Date)), format="%d-%b-%y")
##         Date Cost.per.Genome
## 1 2001-09-01        95263072
## 2 2002-03-01        70175437
## 3 2002-09-01        61448422
## 4 2003-03-01        53751684
## 5 2003-10-01        40157554
## 6 2004-01-01        28780376

l <- lm(log(Cost.per.Genome) ~ Date, data=sequencing); summary(l)
## Coefficients:
##                 Estimate   Std. Error  t value   Pr(>|t|)
## (Intercept) 50.969823683  1.433567932  35.5545 < 2.22e-16
## Date        -0.002689621  0.000101692 -26.4486 < 2.22e-16
## Residual standard error: 0.889707 on 45 degrees of freedom
## Multiple R-squared:  0.939559,   Adjusted R-squared:  0.938216
## F-statistic: 699.528 on 1 and 45 DF,  p-value: < 2.22e-16
plot(log(Cost.per.Genome) ~ Date, data=sequencing)
# as expected: linear in log (Moore's law) 2002-2008, sudden drop, return to Moore's law-ish ~December 2011?
# but on the other hand, maybe the post-December 2011 behavior is a continuation of the curve
# 2 break-points / 3 segments:
piecewise <- segmented(l, seg.Z=~Date, psi=list(Date=c(13970, 16071)))
## Estimated Break-Point(s):
##             Est. St.Err
## psi1.Date 12680 1067.0
## psi2.Date 13200  279.8
## t value for the gap-variable(s) V:  0 0 2
## Meaningful coefficients of the linear terms:
##                 Estimate   Std. Error  t value   Pr(>|t|)
## (Intercept) 35.841699121  8.975628264  3.99322 0.00026387
## Date        -0.001504431  0.000738358 -2.03754 0.04808491
## U1.Date      0.000679538  0.002057940  0.33020         NA
## U2.Date     -0.002366688  0.001926528 -1.22847         NA
## Residual standard error: 0.733558 on 41 degrees of freedom
## Multiple R-Squared: 0.962565,  Adjusted R-squared:   0.958
with(sequencing, plot(Date, log(Cost.per.Genome), pch=16)); plot(piecewise, add=T)
# The first two segments look fine, but the residuals are clearly bad for the third line-segment:
# it undershoots (damaging the second segment's fit), overshoots, then undershoots again. Let's try again with more breakpoints:

lots <- segmented(l, seg.Z=~Date, psi=list(Date=NA), control=seg.control(stop.if.error=FALSE, n.boot=0))
summary(segmented(l, seg.Z=~Date, psi=list(Date=as.Date(c(12310, 12500, 13600, 13750,  14140,  14680,  15010, 15220), origin = "1970-01-01", tz = "EST"))))
# delete every breakpoint below t-value of ~|2.3|, for 3 breakpoints / 4 segments:
piecewise2 <- segmented(l, seg.Z=~Date, psi=list(Date=as.Date(c("2007-08-25","2008-09-18","2010-03-12"))))
with(sequencing, plot(Date, log(Cost.per.Genome), pch=16)); plot(piecewise2, add=T)

# the additional break-point is used up on a better fit in the curve. It looks like an exponential decay/asymptote,
# so let's work on fitting that part of the graph, the post-2007 curve:
sequencingRecent <- sequencing[sequencing$Date>as.Date("2007-10-01"),]
lR <- lm(log(Cost.per.Genome) ~ Date, data=sequencingRecent); summary(lR)
piecewiseRecent <- segmented(lR, seg.Z=~Date, psi=list(Date=c(14061, 16071))); summary(piecewiseRecent)
## Estimated Break-Point(s):
##             Est. St.Err
## psi1.Date 14290  36.31
## psi2.Date 15290  48.35
## t value for the gap-variable(s) V:  0 0
## Meaningful coefficients of the linear terms:
##                 Estimate   Std. Error   t value   Pr(>|t|)
## (Intercept)  1.13831e+02  6.65609e+00  17.10182 2.0951e-13
## Date        -7.13247e-03  4.73332e-04 -15.06865 2.2121e-12
## U1.Date      4.11492e-03  4.94486e-04   8.32161         NA
## U2.Date      2.48613e-03  2.18528e-04  11.37668         NA
## Residual standard error: 0.136958 on 20 degrees of freedom
## Multiple R-Squared: 0.995976,  Adjusted R-squared: 0.994971

with(sequencingRecent, plot(Date, log(Cost.per.Genome), pch=16)); plot(piecewiseRecent, add=T)

lastPiece <- lm(log(Cost.per.Genome) ~ Date, data=sequencingRecent[as.Date(15290, origin = "1970-01-01", tz = "EST")<sequencingRecent$Date,]); summary(lastPiece)
## Coefficients:
##                 Estimate   Std. Error  t value   Pr(>|t|)
## (Intercept) 17.012409648  1.875482507  9.07095 1.7491e-05
## Date        -0.000531621  0.000119056 -4.46528  0.0020963
## Residual standard error: 0.0987207 on 8 degrees of freedom
## Multiple R-squared:  0.71366,    Adjusted R-squared:  0.677867
with(sequencingRecent[as.Date(15290, origin = "1970-01-01", tz = "EST") < sequencingRecent$Date,],
    plot(Date, log(Cost.per.Genome), pch=16)); abline(lastPiece)

predictDays <- seq(from=sequencing$Date[1], to=as.Date("2030-12-01"), by="month")
lastPiecePredict <- data.frame(Date = predictDays, Cost.per.Genome=c(sequencing$Cost.per.Genome, rep(NA, 305)), Cost.per.Genome.predicted = exp(predict(lastPiece, newdata = data.frame(Date = predictDays))))

nlmR <- nls(log(Cost.per.Genome) ~ SSasymp(as.integer(Date), Asym, r0, lrc), data=sequencingRecent); summary(nlmR)
## Parameters:
##          Estimate   Std. Error    t value Pr(>|t|)
## Asym  7.88908e+00  1.19616e-01   65.95328   <2e-16
## r0    1.27644e+08  1.07082e+08    1.19203   0.2454
## lrc  -6.72151e+00  5.05221e-02 -133.04110   <2e-16
## Residual standard error: 0.150547 on 23 degrees of freedom
with(sequencingRecent, plot(Date, log(Cost.per.Genome))); lines(sequencingRecent$Date, predict(nlmR), col=2)

# side by side:
with(sequencingRecent, plot(Date, log(Cost.per.Genome), pch=16))
plot(piecewiseRecent, add=TRUE, col=2)
lines(sequencingRecent$Date, predict(nlmR), col=3)
# as we can see, the 3-piece linear fit and the exponential decay fit identically;
# but exponential decay is more parsimonious, IMO, so I prefer that.

predictDays <- seq(from=sequencingRecent$Date[1], to=as.Date("2020-12-01"), by="month")
data.frame(Date = predictDays, Cost.per.Genome.predicted = exp(predict(nlmR, newdata = data.frame(Date = predictDays))))

Genome sequenc­ing his­tor­i­cally has dropped in price ~18% per year. Con­sider this sim­ple sce­nar­io: if we have a fixed amount of money to spend buy­ing genomes, and we can afford to buy 1 genome in the first year, then the next year we can buy 1.21 genomes, then 1.48 genomes and so on and in 30 years we can afford to buy 385 genomes each year. The num­ber we can afford in year x is:

sapply(0:30, function(x) 1/(0.82^x))
#  [1]   1.000000000   1.219512195   1.487209994   1.813670724   2.211793566   2.697309227   3.289401497   4.011465240   4.892030780
# [10]   5.965891196   7.275477068   8.872533010  10.820162207  13.195319764  16.091853371  19.624211428  23.931965156  29.185323361
# [19]  35.591857758  43.404704583  52.932566564  64.551910444  78.721842005  96.002246348 117.075910180 142.775500220 174.116463682
# [28] 212.337150832 258.947744917 315.789932826 385.109674178

Genomes are unlike com­pu­ta­tion, though, as they are data rather than an ephemeral ser­vice. Each genome is still use­ful and accu­mu­lates in a data­base. How many genomes total do we have each year? Quite a lot:

cumsum(sapply(0:30, function(x) 1/(0.82^x)))
#  [1]    1.000000000    2.219512195    3.706722189    5.520392914    7.732186480   10.429495707   13.718897204   17.730362444
#  [9]   22.622393224   28.588284420   35.863761488   44.736294497   55.556456704   68.751776468   84.843629839  104.467841268
# [17]  128.399806424  157.585129785  193.176987543  236.581692126  289.514258690  354.066169134  432.788011139  528.790257487
# [25]  645.866167667  788.641667886  962.758131569 1175.095282401 1434.043027318 1749.832960144 2134.942634322

While ini­tially there’s not much of a pile to con­cern our­selves with, even­tu­ally we have 2000+ genomes while still only pro­duc­ing <400 genomes that year, a fac­tor of 5 differ­ence. (As it hap­pens, if you con­sider UKBB at n = 500k pro­duced as a sin­gle invest­ment 2012-2017, 23andMe in 2017 is report­edly n = 2-2.5m, so this 5x mul­ti­plier is about right.)

23andMe started back in 2007 or so offer­ing $1,322 SNP pan­els to a few thou­sand peo­ple, grow­ing to ~1m by 8 years later in July 2015. To repro­duce that in this model of con­stant invest­ment we start with a base of 56k SNPs pur­chased per year, grow­ing accord­ing to the cost decrease:

cumsum(sapply(0:7, function(x) (56000*1)/(0.82^x)))
# [1]  56000.0000 124292.6829 207576.4426 309142.0032 433002.4429 584051.7596 768258.2434 992900.2969

What does that yield by 10 years later (2017) or 20 years later (2027)? It yields: 1.6m (1,600,943) and 16.2m (16,212,798) respec­tive­ly.

Even if we assumed that annual genomes/SNPs lev­eled off in 2017, the lin­ear increase pushes us into the mil­lions range rapid­ly:

annualStagnation <- sapply(0:30, function(x) min(334089, (56000*1)/(0.82^x)))
#  [1]   56000.0000  124292.6829  207576.4426  309142.0032  433002.4429  584051.7596  768258.2434  992900.2969 1266854.0206 1600943.0206
# [11] 1935032.0206 2269121.0206 2603210.0206 2937299.0206 3271388.0206 3605477.0206 3939566.0206 4273655.0206 4607744.0206 4941833.0206
# [21] 5275922.0206 5610011.0206 5944100.0206 6278189.0206 6612278.0206 6946367.0206 7280456.0206 7614545.0206 7948634.0206 8282723.0206
# [31] 8616812.0206
data.frame(Year=2007:2037, total=round(totalStagnation))
# Year   total
# 2007   56000
# 2008  124293
# 2009  207576
# 2010  309142
# 2011  433002
# 2012  584052
# 2013  768258
# 2014  992900
# 2015 1266854
# 2016 1600943
# 2017 1935032
# 2018 2269121
# 2019 2603210
# 2020 2937299
# 2021 3271388
# 2022 3605477
# 2023 3939566
# 2024 4273655
# 2025 4607744
# 2026 4941833
# 2027 5275922
# 2028 5610011
# 2029 5944100
# 2030 6278189
# 2031 6612278
# 2032 6946367
# 2033 7280456
# 2034 7614545
# 2035 7948634
# 2036 8282723
# 2037 8616812

So even if no addi­tional funds per year start get­ting spent on genomics despite the increas­ing util­ity and the cost curve remains the same, the cumu­la­tive num­ber of SNPs or whole-genomes will increase dras­ti­cally over the next 30 years. Genomes on their own have many uses, such as detect­ing human evo­lu­tion, allow­ing bet­ter impu­ta­tion pan­els, infer­ring pop­u­la­tion struc­ture, count­ing vari­ants, detect­ing par­tic­u­larly lethal muta­tions etc, but of course their main use is trait pre­dic­tion. Given the increas­es, we would expect large enough n for Hsu’s lasso to undergo phase tran­si­tion and recover nearly the full SNP her­i­tabil­ity (see ); the bot­tle­neck increas­ingly will not be genomes but phe­no­typic mea­sure­ments.

Proposal: hand-counting mobile app for more fluid group discussions

Groups use vot­ing for deci­sion-mak­ing, but exist­ing vote sys­tems are cum­ber­some. Hand-rais­ing is faster, but does not scale because hand-count­ing hands is slow. Advances in machine vision may make it pos­si­ble for AI to count hands in pho­tos accu­rate­ly. Com­bined with a smart­phone’s cam­era, this could yield an app for fast vot­ing in even large groups.

Medi­um-large (>10 peo­ple) groups face a prob­lem in reach­ing con­sen­sus: bal­lot or pen-and-pa­per vot­ing is suffi­ciently slow and clunky that it is too costly to use for any­thing but the most impor­tant dis­cus­sions. A group is forced to adopt other dis­cus­sion norms and save a for­mal vote for only the final deci­sion, and even then the long delay kills a lot of enthu­si­asm and inter­est. Vot­ing could be used for many more deci­sions if it could be faster, and of course all exist­ing group votes would ben­e­fit from increased speed. (I am reminded of anime con­ven­tions and film fes­ti­vals where, par­tic­u­larly for short films such as AMVs, one seems to spend more time fill­ing out a bal­lot & pass­ing them along aisles & the staff painfully count­ing through each bal­lot by hand than one actu­ally spends watch­ing the media in ques­tion!)

It would be bet­ter if vot­ing could be as flu­ent and easy as sim­ply rais­ing your hand like in a small group such as a class­room—a mech­a­nism which makes it so easy to vote that votes can be held as fast as the alter­na­tives can be spo­ken aloud and a glance suffices to count (an alert group could vote on 2 or 3 top­ics in the time it takes to read this sen­tence). But hand-rais­ing, as great as it is, suffers from the flaw that it does not scale due to the count­ing prob­lem: a group of 500 peo­ple can raise their hands as eas­ily as a group of 50 or 5, but it takes far too long to count ~250 hands: the per­son count­ing will quickly tire of the tedi­um, they will make mis­takes count­ing, and this puts a seri­ous lag on each vote, a lag which increases lin­early with the num­ber of vot­ers. (Hands can be easy to approx­i­mate if almost every­one votes for or against some­thing, but if con­sen­sus is so over­whelm­ing, one does­n’t need to vote in the first place! The hard case of almost-bal­anced votes is the most impor­tant case.)

One might sug­gest using an entirely differ­ent strat­e­gy: a web­site with HTML polls or lit­tle clicker giz­mos like used in some col­lege lec­tures to admin­is­ter quick quizzes. This have the down­sides that they require poten­tially expen­sive equip­ment (I used a clicker in one class and I think it cost at least $20, so if a con­ven­tion wanted to use that in an audi­ence of hun­dreds, that’s a major upfront cost & my expe­ri­ence was that click­ers were unin­tu­itive, did not always work, and slowed things down if any­thing; a web­site would only work if you assume every­one has smart­phones and is will­ing to pull them out to use at an instance’s notice and of course that there’s work­ing WiFi in the room, which can­not be taken for grant­ed) and con­sid­er­able over­head in explain­ing to every­one how it works and get­ting them on the same page and mak­ing sure every per­son who wan­ders in also gets the mes­sage. (If any­one is going to be bur­dened with under­stand­ing or using a new sys­tem, it should be the hand­ful of conference/festival/group orga­niz­ers, not the entire audi­ence!) A sim­pler approach than hands would be spe­cial­ly-printed paper using, for exam­ple, QR codes like piCards, which can then be rec­og­nized by stan­dard sim­ple com­puter vision tech­niques; this is much cheaper than click­ers but still requires con­sid­er­able setup & incon­ve­nience. It’s hard to imag­ine a film fes­ti­val run­ning using any sys­tem, and diffi­cult to see these sys­tems improv­ing on pen-and-pa­per bal­lots which at least are cheap, rel­a­tively straight­for­ward, and well-known.

Hand-count­ing really does seem like the best solu­tion, if only the count­ing could be fixed. Count­ing is some­thing com­put­ers do fast, so that is the germ of an idea. What if a smart­phone could count the votes? You don’t want a smart­phone app on the entire audi­ences’ phones, of course, since that’s even worse than hav­ing every­one go to a web­site to vote; but machine vision has made enor­mous strides in the 2000s-2010s, reach­ing human-e­quiv­a­lent per­for­mance on chal­leng­ing image recog­ni­tion con­tests like Ima­geNet. (Ma­chine vision is com­pli­cat­ed, but the impor­tant thing is that it’s the kind of com­pli­cated which can be out­sourced to some­one else and turned into a dead­-easy-to-use app, and the bur­den does not fall on the pri­mary user­s—the audi­ence.) What if the orga­nizer had an app which took a photo of the entire audi­ence with lifted arms and counted hands & faces and returned a vote count in a sec­ond?

Such an app would be ideal for any cul­tur­al, polit­i­cal, or orga­ni­za­tional meet­ing. Now the flow for, eg, a film fes­ti­val could go: [no expla­na­tion given to audi­ence, one just starts] “OK, how many peo­ple liked the first short, ‘Vam­pire Deli’ by Ms Hous­ton?” [ev­ery­one raises hand, smart­phone flash­es, 1s pass­es] “OK, 140 votes. How many liked the sec­ond short, ‘Cthu­li­cious’ by Mr Ious­ton?” [raises hands, smart­phone flash­es, 1s pass­es] “OK… 130 peo­ple. Con­grat­u­la­tions Ms Hous­ton!” And so on.

Such an app might be con­sid­ered an infea­si­ble machine vision task, but I believe it could be fea­si­ble: facial local­iza­tion is an old and well-s­tud­ied image recog­ni­tion task (and effec­tive algo­rithms are built into every con­sumer cam­er­a), hands/fingers have very dis­tinct shapes, and both tasks seem eas­ier than the sub­tle dis­crim­i­na­tions between, say, var­i­ous dog breeds demanded of Ima­geNet con­tes­tants.

Specifi­cal­ly, one could imple­ment the machine vision core as fol­lows:

  1. mul­ti­layer neural net­works trained for one task can be repur­posed to sim­i­lar tasks by remov­ing the high­est layer and retrain­ing on the new task, poten­tially reap­ing great per­for­mance gains as the hybrid net­work has already learned much of what it needs for the sec­ond task (“trans­fer learn­ing”). So one could take a pub­licly avail­able NN trained for Ima­geNet (such as AlexNet, avail­able in caffe), remove the top two lay­ers, and retrain on a dataset of audi­ences; this will per­form bet­ter since the orig­i­nal NN has already learned how to detect edges, rec­og­nize faces, etc

    The sim­pler task of count­ing crowds has already shown itself sus­cep­ti­ble to deep learn­ing: eg “Cross-scene Crowd Count­ing via Deep Con­vo­lu­tional Neural Net­works”.

  2. raid Flickr and Google Images for CC-li­censed pho­tos of audi­ences rais­ing their arms; then one can man­u­ally count how many arms are raised (or out­source to Ama­zon Mechan­i­cal Turk). With the boost from a trans­ferred con­vo­lu­tional deep net­work, one might get good per­for­mance with just a few thou­sand pho­tos to train with. If each photo takes a minute to obtain and count, then one can cre­ate a use­ful cor­pus in a week or two of work.

  3. train the NN, apply­ing the usual data aug­men­ta­tion tricks to increase one’s mea­ger cor­pus, try­ing out ran­dom hyper­pa­ra­me­ters, tweak­ing the archi­tec­ture, etc

    (Note that while NNs are very slow and com­pu­ta­tion­ally inten­sive to train, they are typ­i­cally quite fast to run; the smart­phone app would not be train­ing a NN, which is indeed com­pletely infea­si­ble from a CPU and bat­tery life stand­point—it is merely run­ning the NN cre­ated by the orig­i­nal devel­op­er.)

  4. with an accu­rate NN, one can wrap it in a mobile app frame­work. The UI, at the sim­plest, is sim­ply a big but­ton to press to take a pho­to, feed it into the NN, and dis­play the count. Some addi­tional fea­tures come to mind:

    • “head­count mode”: one may not be inter­ested in a vote, but in how many peo­ple are in an audi­ence (to esti­mate how pop­u­lar a guest is, whether an event needs to move to a new big­ger space, etc). If the NN can count faces and hands to esti­mate a vote count, it can sim­ply report the count of faces instead.

    • the app should save every photo & count, both as an audit trail and also to sup­port post-vote recounts in case of dis­putes or a desire for a more defin­i­tive count

    • the reported count should come with an indi­ca­tion of the NN’s uncertainty/error-rate, so users are not mis­led by their lit­tle hand­held ora­cle and so they can redo a vote if the choice is bor­der­line; Bayesian meth­ods, in which pre­vi­ous votes are drawn upon, might be rel­e­vant here.

      • if the orig­i­nal photo could be anno­tated with graph­i­cal notes for each recognized/counted hand & face, this would let the user ‘see’ what the NN is think­ing and would help build con­fi­dence a great deal
    • it should sup­port man­u­ally enter­ing in a vote-count; if the man­ual count differs, then this indi­cates the NN made an error and the photo & count should be uploaded to the orig­i­nal devel­oper so it can be added to the cor­pus and the NN’s per­for­mance fixed in future releases of the app

    • smart­phone cam­eras may not be high­-res­o­lu­tion or have a suffi­ciently wide field­-of-view to cap­ture the entire audi­ence at once; some sort of “mon­tage mode” should exist so the user can swing the phone across the audi­ence, bursts of shots tak­en, and the over­lap­ping pho­tos stitched together into a sin­gle audi­ence photo which can be then fed into the NN as usual

    • a burst of pho­tos might be supe­rior to a sin­gle photo due to smart­phone & hand move­ment blur; I don’t know if it’s best to try to com­bine the pho­tos, run the NN mul­ti­ple times and take the medi­an, or feed mul­ti­ple pho­tos into the NN (per­haps by mov­ing to a RNN archi­tec­ture?)

    • the ful­l-strength NN might still be too slow and ener­gy-hun­gry to run pleas­antly on a smart­phone; there are model com­pres­sion to reduce the num­ber of nodes or with­out los­ing much per­for­mance, which might be use­ful in this con­text (and indeed, were orig­i­nally moti­vated by want­ing to make speech-recog­ni­tion run bet­ter on smart­phones)

Given this break­down, one might esti­mate build­ing such an app as requir­ing, assum­ing one is already rea­son­ably famil­iar with deep net­works & writ­ing mobile apps:

  1. 1 week to find an Ima­geNet NN, learn how to mod­ify it, and set it up to train on a fresh cor­pus
  2. 3 weeks to cre­ate a cor­pus of <5000 pho­tos with man­u­al­ly-la­beled hand counts
  3. 5 weeks to train the NN (NNs as large as Ima­geNet NNs take weeks to train; depend­ing on the GPU hard­ware one has access to and how many tweaks and hyper­pa­ra­me­ters one tries, 5 weeks could be dras­ti­cally opti­mistic; but on the plus side, it’s mostly wait­ing as the GPUs suck elec­tric­ity like crazy)
  4. 5 weeks to make an intu­itive sim­ple app, sub­mit­ting to an app store, etc
  5. These esti­mates are loose and prob­a­bly too opti­mistic (although I would be sur­prised if it took a good devel­oper more than 6 months to develop this app), but that would sug­gest >14 weeks or 784 hours of work for a devel­op­er, start to fin­ish. (Even at min­i­mum wage, this rep­re­sents a sub­stan­tial devel­op­ment cost of >$6k; at more plau­si­ble devel­oper salaries, eas­ily >$60k of salary.)

How large is the mar­ket for such an app? Groups such as anime con­ven­tions or any­thing on a col­lege cam­pus are cheap­skates and would balk at a price higher than $4.99 (even if only 5 or 10 staffers need to buy it and it makes the expe­ri­ence much smoother). There are prob­a­bly sev­eral hun­dred anime or video game con­ven­tions which might use this to vote, so that might be 1000 sales there. There’s eas­ily 13,000 busi­ness con­ven­tions or con­fer­ences in the USA, which might not need vot­ing so much, but would be attracted by a head­count mode to help opti­mize their event. This sug­gests per­haps $70k in sales with much less profit after the app store cut & tax­es, much of which sales would prob­a­bly be one-offs as the user reuses it for each con­fer­ence. So even a wild suc­cess, in which most events adopt use of such vot­ing soft­ware, would barely recoup the devel­op­ment costs; as a pro­duct, it seems this is just too much of a niche unless one could develop it much faster (such as by find­ing an exist­ing cor­pus of hands/photos, or be cer­tain of bang­ing out the mobile app in much less than I esti­mat­ed), find a larger mar­ket (the­aters for audi­ence par­tic­i­pa­tion?), or increase price sub­stan­tially (10x the price and aim at only busi­ness­es?).

Air conditioner replacement

Is my old air con­di­tioner ineffi­cient enough to replace? After cal­cu­lat­ing elec­tric­ity con­sump­tion for it and a new air con­di­tion­er, with dis­count­ing, and with uncer­tainty in para­me­ters eval­u­ated by a Monte Carlo method, I con­clude that the sav­ings are too small by an order of mag­ni­tude to pay for a new replace­ment air con­di­tion­er.

I have an old Whirlpool air con­di­tioner (AC) in my apart­ment, and as part of insu­lat­ing and cool­ing my apart­ment, I’ve won­dered if the AC should be replaced on energy effi­ciency grounds. Would a new AC save more than it costs upfront? What is the opti­mal deci­sion here?

Ini­tially I was balked in analy­sis because I could­n’t fig­ure out what model it was, and thus any­thing about it like its energy effi­cien­cy. (No model num­ber or name appears any­where vis­i­ble on it, and I’m not going to rip it out of the wall just to look at hid­den part­s.)


So I began look­ing at all the old Whirlpool AC pho­tographs in Google, and even­tu­ally I found one whose appear­ance exactly matches mine and which was released around when I think the AC was installed. The old AC is the “Whirlpool ACQ189XS” (offi­cial) (cost: $0, sunk cost), which is claimed to have an EER of 10.7.

For com­par­ison, I browsed Ama­zon look­ing for high­ly-rated AC mod­els with at least 5000 BTU cool­ing power and cost­ing $250-$300, pick­ing out the Sun­pen­town WA-8022S 8000 BTU Win­dow Air Con­di­tioner ($271) with 11.3 EER. (Check­ing some other entries on Ama­zon, this is fairly rep­re­sen­ta­tive on both cost & EER.)

Ques­tion: what is the elec­tri­cal sav­ings and hence the pay­back period of a new AC?

The effi­ciency unit here is the EER or energy effi­ciency ratio, defined as BTUs (amount of heat being moved by the AC) divided by watts con­sumed. Here we have ACs with 10.7 EER vs 11.2 EER; I need ~10k BTUs to keep the apart­ment cool (after fix­ing a lot of cracks, installing an attic fan and two box fans, putting tin foil over some win­dows, insu­la­tion under a floor etc), so the ACs will use up , and then x = 898 watts and 934 watts respec­tive­ly.

(EER is a lot like /MPG as a mea­sure of effi­cien­cy, and shares the same draw­backs: from a cost-per­spec­tive, EER/MPG don’t nec­es­sar­ily tell you what you want to know and can be mis­lead­ing and harder to work with than if effi­ciency were reported as, say, gal­lons per mile. As watts per BTU or gal­lons per mile, it is easy to see that after a cer­tain point, the cost differ­ences have become absolutely small enough that improve­ments are not worth pay­ing for. Going from 30 gal­lons of gas to 15 gal­lons of gas is worth more than going from 3 gal­lons to 1.5 gal­lons, even if the rel­a­tive improve­ment is the same.)

So while oper­at­ing, the two ACs will use 898 watts vs 934 watts or 0.89kWh vs 0.934kWh to cool; a differ­ence of 36 watts or 0.036k­Wh.

Each kWh costs around $0.09 so the cost-d­iffer­ence is $0.00324 per hour.

AC is on May-Sep­tem­ber (5 month­s), and on almost all day although it only runs inter­mit­tent­ly, so say a third of the day or 8 hours, for a total of 1200 hours of oper­a­tion.


Thus, then the annual ben­e­fit from switch­ing to the new AC with 11.2 EER is or $3.9.

The cost is $271 amor­tized over n years. At $3.9 a year, it will take annu­ally = 68 years to pay­back (ig­nor­ing break­age and discounting/interest/opportunity-cost). This is not good.

Deci­sion: do not replace.


To bring in discounting/interest: For what annual pay­ment (cost-sav­ings) would we be will­ing to pay the price of a new AC? More specifi­cal­ly, if it costs $271 and has an aver­age pay­out period of 7 years, then at my usual annual dis­count rate of 5%, how much must each pay­out be?

r turns out to be ≥$46.83, which sounds about right. (Dis­count­ing penal­izes future sav­ings, so r should be greater than or $39, which it is.)

$47 is 12x larger than the esti­mated sav­ings of $3.9, so the con­clu­sion remains the same.

We could also work back­ward to fig­ure out what EEC would jus­tify an upgrade by treat­ing it as an unknown e and solv­ing for it; let’s say it must pay­back in 7 years (I doubt aver­age AC life­time is much longer) at least $271, with the same kWh & usage as before, what must the rival EEC be? as an equa­tion:

and solv­ing,

I am pretty sure there are no ACs with EER>20!

Another way to look at it: if a new good AC costs ~$300 and I expect it to last ~7 years, then that’s an annual cost of $43. The cur­rent AC’s total annual cost to run is or . So it’s imme­di­ately clear that the energy sav­ings must be huge—half!—be­fore it can hope to jus­tify a new pur­chase.

Sensitivity analysis

The above analy­ses were done with point-es­ti­mates. It’s only fair to note that there’s a lot of uncer­tainty lurk­ing in those esti­mates: $0.09 was just the median of the esti­mates I found for my state’s elec­tric­ity rates, the AC might be on 4 or 6 months, the hours per day might be con­sid­er­ably higher (or low­er) than my guess of 8 hours, 10.7 & 11.2 EERs are prob­a­bly best-case esti­mates and the real effi­cien­cies lower (they’re always lower than nom­i­nal), the dis­count rate may be a per­cent lower or higher and so min­i­mum sav­ings would be off by as much as $4 in either direc­tion, and so on. It would be good to do a bit of a sen­si­tiv­ity analy­sis to make sure that this is not being dri­ven by any par­tic­u­lar num­ber. (Based on the defi­n­i­tion, since it’s using mostly mul­ti­pli­ca­tion, the final value should be robust to con­sid­er­able error in esti­mat­ing each para­me­ter, but you never know.) Throw­ing together my intu­ition for how much uncer­tainty is in each para­me­ter and mod­el­ing most as nor­mals, I can sim­u­late my prior dis­tri­b­u­tion of sav­ings:

simulate <- function() {
    BTUs <- rnorm(1, 10000, 100)
    EER_old <- 10.7 - abs(rnorm(1, 0, 0.5)) # half-normals because efficiencies only get worse, not better
    EER_new <- 11.2 - abs(rnorm(1, 0, 0.5))
    kWh <- rnorm(1, 0.09, 0.01)
    dailyUsage <- rnorm(1, 8, 2)
    months <- sample (4:6, 1)
    minimumSavings <- rnorm(1, 47, 4)

    annualNetSavings <- ((((BTUs / EER_old ) - (BTUs / EER_new)) / 1000) * kWh * dailyUsage * 30 * months) - minimumSavings
sims <- replicate(100000, simulate())
##        Min.     1st Qu.      Median        Mean     3rd Qu.        Max.
## -70.3666500 -46.2051500 -42.3764100 -42.1133700 -38.3134600  -0.7334517
quantile(sims, p=c(0.025, 0.975))
##        2.5%        97.5%
## -53.59989114 -29.13999204

Under every sim­u­la­tion, a new AC is a net loss. (Since we have no observed data to update our pri­ors with, this is an exer­cise in prob­a­bil­i­ty, not Bayesian infer­ence, and so there is no need to bring in JAGS.)

There are two choic­es: replace or not. The expect­ed-value of a replace­ment is or -$42, and the expect­ed-value of not replac­ing is or $0; the lat­ter is larger than the for­mer, so we should choose the lat­ter and not replace the old AC.

Hence we can be con­fi­dent that not get­ting a new AC really is the right deci­sion.

Some ways of dealing with measurement error

Prompted by a ques­tion on Less­Wrong, some exam­ples of how to ana­lyze noisy mea­sure­ments in R:

## Create a simulated dataset with known parameters, and then run a ML multilevel model, a ML SEM,
## and a Bayesian multilevel model; with the last, calculate Expected Value of Sample Information (EVSI):

## "There is a variable X, x belongs to [0, 100]."
toplevel <- rnorm(n=1, 50, 25)
## "There are n ways of measuring it, among them A and B are widely used."
## "For any given measurer, the difference between x(A) and x(B) can be up to 20 points."
A <- toplevel + runif(1, min=-10, max=10)
B <- toplevel + runif(1, min=-10, max=10)
c(toplevel, A, B)
# [1] 63.85938385 55.43608379 59.42333264
### the true level of X we wish to recover is '63.85'

## "Between two any measurers, x(A)1 and x(A)2 can differ on average 10 points, likewise with B."
### let's imagine 10 hypothetical points are sample using method A and method B
### assume 'differ on average 10 points' here means something like 'the standard deviation is 10'
A_1 <- rnorm(n=10, mean=A, sd=10)
B_1 <- rnorm(n=10, mean=B, sd=10)

data <- rbind(data.frame(Measurement="A", Y=A_1), data.frame(Measurement="B", Y=B_1)); data
#    Measurement           Y
# 1            A 56.33870025
# 2            A 69.07267213
# 3            A 40.36889573
# 4            A 48.67289213
# 5            A 79.92622603
# 6            A 62.86919410
# 7            A 53.12953462
# 8            A 66.58894990
# 9            A 47.86296948
# 10           A 60.72416003
# 11           B 68.60203507
# 12           B 58.24702007
# 13           B 45.47895879
# 14           B 63.45308935
# 15           B 52.27724328
# 16           B 56.89783535
# 17           B 55.93598486
# 18           B 59.28162022
# 19           B 70.92341777
# 20           B 49.51360373

## MLM

## multi-level model approach:
mlm <- lmer(Y ~ (1|Measurement), data=data); summary(mlm)
# Random effects:
#  Groups      Name        Variance Std.Dev.
#  Measurement (Intercept)  0.0000  0.000000
#  Residual                95.3333  9.763877
# Number of obs: 20, groups:  Measurement, 2
# Fixed effects:
#              Estimate Std. Error  t value
# (Intercept) 58.308250   2.183269 26.70685
#                    2.5 %       97.5 %
# .sig01       0.000000000  7.446867736
# .sigma       7.185811525 13.444112087
# (Intercept) 53.402531768 63.213970887

## So we estimate X at 58.3 but it's not inside our confidence interval with such little data. Bad luck?

## SEM

X.model <- '        X =~ B + A
                    A =~ a
                    B =~ b' <- sem(model = X.model, meanstructure = TRUE, data = data2)
# ...                   Estimate  Std.err  Z-value  P(>|z|)
# Latent variables:
#   X =~
#     B                 1.000
#     A              7619.504
#   A =~
#     a                 1.000
#   B =~
#     b                 1.000
# Intercepts:
#     a                58.555
#     b                58.061
#     X                 0.000
#     A                 0.000
#     B                 0.000
## Well, that didn't work well - explodes, unfortunately. Probably still not enough data.

## MLM (Bayesian)

## rough attempt at writing down an explicit multilevel model which
## respects the mentioned priors about errors being reasonably small:
model <- function() {
  grand.mean ~ dunif(0,100) ~ dunif(0, 10) ~ dunif(0, 100) <- pow(, -2)

  for(j in 1:K){
   # let's say the group-level differences are also normally-distributed:[j] ~ dnorm(,
   # and each group also has its own standard-deviation, potentially different from the others':
   group.within.sigma[j] ~ dunif(0, 20)
   group.within.tau[j] <- pow(group.within.sigma[j], -2)

   # save the net combo for convenience & interpretability:
   group.mean[j] <- grand.mean +[j]

  for (i in 1:N) {
   # each individual observation is from the grand-mean + group-offset, then normally distributed:
   Y[i] ~ dnorm(grand.mean +[Group[i]], group.within.tau[Group[i]])

jagsData <- list(N=nrow(data), Y=data$Y, K=length(levels(data$Measurement)),
params <- c("grand.mean","", "", "", "group.mean",
k1 <- jags(data=jagsData,, inits=NULL, model.file=model); k1
# ...                      mu.vect sd.vect    2.5%     25%     50%     75%   97.5%  Rhat n.eff
#     4.971   2.945   0.221   2.353   4.967   7.594   9.791 1.008   260
# grand.mean             52.477  11.321  23.453  47.914  53.280  58.246  74.080 1.220    20
#[1]          6.017  11.391 -16.095   0.448   5.316  10.059  34.792 1.152    21
#[2]          5.662  11.318 -15.836   0.054   5.009  10.107  33.548 1.139    21
# group.mean[1]          58.494   3.765  50.973  56.188  58.459  60.838  66.072 1.001  3000
# group.mean[2]          58.139   2.857  52.687  56.366  58.098  59.851  63.999 1.003   920
# group.within.sigma[1]  12.801   2.766   8.241  10.700  12.446  14.641  18.707 1.002  1100
# group.within.sigma[2]   9.274   2.500   5.688   7.475   8.834  10.539  15.700 1.002  1600
#    18.031  21.159   0.553   3.793   9.359  23.972  82.604 1.006  1700
# deviance              149.684   2.877 145.953 147.527 149.081 151.213 156.933 1.001  3000

## VOI

posteriorXs <- k1$BUGSoutput$sims.list[["grand.mean"]]
MSE <- function(x1, x2) { (x2 - x1)^2 }
lossFunction <- function(x, predictions) { mean(sapply(predictions, function(x2) { MSE(x, x2)}))}
## our hypothetical mean-squared loss if we predicted, say, X=60:
lossFunction(60, posteriorXs)
# [1] 184.7087612
## of the possible values for X, 1-100, what value of X minimizes our squared error loss?
losses <- sapply(c(1:100), function (n) { lossFunction(n, posteriorXs);})
# [1] 52
## 52 also equals the mean estimate of X, which is good since it's well known that the mean is what minimizes
## the loss when the loss is squared-error so it suggests that I have not screwed up the definitions
[1] 128.3478462

## to calculate EVSI, we repeatedly simulate a few hundred times the existence of a hypothetical 'C' measurement
## and draw n samples from it;
## then we add the C data to our existing A & B data; run our Bayesian multilevel model again on the bigger dataset;,
## calculate what the new loss is, and compare it to the old loss to see how much the new data
## reduced the loss/mean-squared-error.
## Done for each possible n (here, 1-30) and averaged out, this tells us how much 1 additional datapoint is worth,
## 2 additional datapoints, 3 additional datapoints, etc.
sampleValues <- NULL
for (i in seq(from=1, to=30)) {

    evsis <- replicate(500, {
        n <- i

        C <- toplevel + runif(1, min=-10, max=10)
        C_1 <- rnorm(n=n, mean=C, sd=10)
        ## all as before, more or less:
        newData <- rbind(data, data.frame(Measurement="C", Y=C_1))

        jagsData <- list(N=nrow(newData), Y=newData$Y, K=length(levels(newData$Measurement)),
        params <- c("grand.mean","", "", "", "group.mean",
        jEVSI <- jags(data=jagsData,, inits=NULL, model.file=model)

        posteriorTimesEVSI <- jEVSI$BUGSoutput$sims.list[["grand.mean"]]
        lossesEVSI <- sapply(c(1:100), function (n) { lossFunction(n, posteriorTimesEVSI);})

        oldOptimum <- 128.3478462 # losses[52]
        newOptimum <- losses[which.min(lossesEVSI)]
        EVSI <- newOptimum - oldOptimum


    sampleValues[i] <- mean(evsis)
#  [1] 13.86568780 11.07101087 14.15645538 13.05296681 11.98902668 13.86866619 13.65059093 14.05991443
#  [9] 14.80018511 16.36944874 15.47624541 15.64710237 15.74060632 14.79901214 13.36776390 15.35179426
# [17] 14.31603459 13.70914727 17.20433606 15.89925289 16.35350861 15.09886204 16.30680175 16.27032067
# [25] 16.30418553 18.84776433 17.86881713 16.65973397 17.04451609 19.17173439

## As expected, the gain in reducing MSE continues increasing as data comes in but with diminishing returns;
## this is probably because in a multilevel model like this, you aren't using the _n_ datapoints to estimate X
## directly so much as you are using them to estimate a much smaller number of latent variables, which are then
## the _n_ used to estimate X. So instead of getting hyperprecise estimates of A/B/C, you need to sample from additional
## groups D/E/F/... Trying to improve your estimate of X by measuring A/B/C many times is like trying to estimate
## IQ precisely by administering a WM test a hundred times.

## If we wanted to compare with alternatives like instead sampling n data points from C and a D, it's easy to modify
## the EVSI loop to do so: generate `D <- toplevel + runif(1, min=-10, max=10); D_1 <- rnorm(n=n, mean=D, sd=10)`
## and now `rbind` D_1 in as well. At a guess, after 5-10 samples from the current group, estimates of X will be improved more
## by then sampling from a new group.

## Or the loss function could be made more realistic. It's unlikely one is paid by MSE, and if one adds in how much
## money each sample costs, with a realistic loss function, one could decide exactly how much data is optimal to collect.

## To very precisely estimate X, when our measurements are needed to measure at least 3 latent variables,
## requires much more data than usual.

## In general, we can see the drawbacks and benefits of each approach. A canned MLM
## is very fast to write but doesn't let us include prior information or easily run
## additional analyses like how much additional samples are worth. SEM works poorly
## on small samples but is still easy to write in if we have more complicated
## models of measurement error. A full-blown modeling language like JAGS is quite
## difficult to write in and MCMC is slower than other approaches but handles small
## samples without any errors or problems and offers maximal flexibility in using
## the known prior information and then doing decision-theoretic stuff. Overall for
## this problem, I think JAGS worked out best, but possibly I wasn't using LAVAAN
## right and that's why SEM didn't seem to work well.

Value of Information: clinical prediction instruments for suicide

I agree. When crit­i­ciz­ing the study for claim­ing the blood lev­els added pre­dic­tive power and it’s not clear they did, this is solely a sta­tis­ti­cal claim and can be done in a vac­u­um. But when one then goes on to pan the pre­dic­tive power of the under­ly­ing clin­i­cal pre­dic­tion instru­ments as use­less in all cir­cum­stances, based on just the pre­dic­tion stats:

So when peo­ple say “We have a blood test to diag­nose sui­ci­dal­ity with 92% accu­ra­cy!”, even if it’s true, what they mean is that they have a blood test which, if it comes back pos­i­tive, there’s still less than 50-50 odds the per­son involved is sui­ci­dal. Okay. Say you’re a psy­chi­a­trist. There’s a 48% chance your patient is going to be sui­ci­dal in the next year. What are you going to do? Com­mit her to the hos­pi­tal? I sure hope not. Ask her some ques­tions, make sure she’s doing okay, watch her kind of close­ly? You’re a psy­chi­a­trist and she’s your depressed patient, you would have been doing that any­way. This blood test is not really action­able. And then remem­ber that this isn’t the blood test we have. We have some clin­i­cal pre­dic­tion instru­ments that do this…But hav­ing “a blood test for sui­cide” won’t be very use­ful, even if it works.

One is implic­itly mak­ing some strong cost-ben­e­fit claims here and step­ping from sta­tis­tics (‘what are the prob­a­bil­i­ties?’) to deci­sion the­ory (‘given these prob­a­bil­i­ties, how should I act?’). They are not iden­ti­cal: no AUC graph will ever tell you if a mod­el’s pre­dic­tions are use­ful or not, and there is no uni­ver­sal thresh­old where 92% specificity/sensitivity is totally use­less but 95% would make a differ­ence—these clin­i­cal pre­dic­tion instru­ments might be use­less indeed, but that will depend on costs, base rates, and avail­able actions. (I tried to make this point to Coyne on Twit­ter ear­lier but I don’t think he under­stood what I was get­ting at & he blew me off.)

Dis­con­ti­nu­ities come from our actions; our infer­ences are incre­men­tal. There are some con­texts where a tiny 1% improve­ment in AUC might be worth a lot (Wall Street) and there are some con­texts where sen­si­tiv­ity or speci­ficity of 99% is still use­less because it won’t change your actions at all (I’m cur­rently com­par­ing my rid­ing lawn mower to a robotic lawn mow­er, and thus far, it does­n’t mat­ter how pre­cise my para­me­ters are, the robotic lawn mow­ers are, to my dis­ap­point­ment, just too expen­sive right now). I think p-val­ues have shown us how well arbi­trary thresh­olds work out in prac­tice (and remem­ber where they came from in the first place! deci­sion rules set per prob­lem—­Gos­set, in opti­miz­ing a brew­ery, did not have the patholo­gies we have with p<0.05 fetishis­m.) I also don’t believe your choices are really that restrict­ed: you mean if you were absolutely con­vinced that your patient was about to com­mit sui­cide, there is absolutely noth­ing you could do besides treat them like any other depres­sive? That seems unlike­ly. But what­ev­er, even if com­mit­ment is the only alter­na­tive, there is still a value to the infor­ma­tion pro­vided by a clin­i­cal pre­dic­tion instru­ment, and we can cal­cu­late it, and you should if you want to rule it out as hav­ing any val­ue, in the same way that in crit­i­ciz­ing a study as weak, it’s bet­ter to ignore the p-val­ues and just work out the right pos­te­rior and demon­strate directly how lit­tle evi­dence it con­tains.

Let’s try this as an exam­ple, it’s not hard or ter­ri­bly com­plex (just tedious). So we have a ward of 100 depres­sive patients where we are inter­ested in pre­vent­ing sui­cide; our prior prob­a­bil­ity is that 7.5% or ~7 of them will com­mit sui­cide. The value of a life has been given a lot of differ­ent val­u­a­tions, but $10 mil­lion is a good start­ing point.

Action 1:

What are our costs or loss­es? We could say that we expect a loss of 7.5*$10m or -$75m, and if we stand by and do no treat­ment or inter­ven­tion what­so­ev­er, we spend no more money and so the total loss is

0 + 0.075 * 100 * 10,000,000 = -$75,000,000

Action 2:

Let’s say they all stay by default for one week and this costs a net $1000 a day; let’s say fur­ther that, since com­mit­ment is the men­tioned alter­na­tive, while com­mit­ted a sui­cide attempt will fail. And since we know that sui­cides are so often spon­ta­neous and major depres­sion comes and goes, a frus­trated sui­cide attempt does­n’t sim­ply mean that they will imme­di­ately kill them­selves as soon as they get out. This 7% comes from a fol­lowup period of a year, so the prob­a­bil­ity any will attempt sui­cide in the next week might be 0.075/52 or 0.001442307692. So this gives us our default setup: we have 100 patients stay­ing for 7 days at a net cost of $1000 a day or $700,000 total, and by hav­ing them stay, we stop an expected aver­age of 0.14 sui­cides and thus we pre­vent an expected loss of 0.14 * $10m = $1,440,000, for a total loss of treat­men­t-cost minus treat­men­t-gain plus remain­ing-loss:

$700,000 - (0.14 * $10m) - $10m * 100 * (0.075-(0.075/52)) = -$74,257,692.

Note that this loss is smaller than in the sce­nario in which we don’t do any com­mit­ment at all; since one week of sui­cide-watch reduced the sui­cide loss more than it cost, this is not sur­pris­ing.

Specifi­cal­ly, the ben­e­fit is:

action1 - action2 = gain to switch­ing 75000000 - 74257692 = $742,308

Not fan­tas­tic, but it’s in the right order of mag­ni­tude (you can’t expect more from a low base-rate event and a treat­ment with such a low prob­a­bil­ity of mak­ing a differ­ence, after all) so it looks plau­si­ble, and it’s still more than zero. We can reject the action of not com­mit­ting them at all as being infe­rior to com­mit­ting them for one week.

Action 3:

What if we were instead choos­ing between one week and com­mit­ting them for a full year—thus catch­ing the full 7.5% of sui­cides dur­ing the 1-year fol­lowup? Does that work? First, the loss from this course of action:

((100*365.2*1000) - (0 * 10000000) - (10000000 * 100 * (0.075-(0.075/1)))) = -$36,520,000

Since there are no sui­cides, we avoid the default loss of -$75m, but we still have to spend $36,520,000 to pay for the long-term com­mit­ment. How­ev­er, the ben­e­fit to the patients has increased dra­mat­i­cally since we stop so many more sui­cides:

action 2 - action 3 = $35,637,692.31

(We go from a loss of -$74m to a loss of -$36m.) So we see action 3 is even bet­ter than action 2 for the patients. Of course, we can’t extrap­o­late out any fur­ther than 1 year, because that’s what our fol­lowup num­ber is, and we don’t know how the sui­cide risk falls after the 1 year point—if it drops to ~0, then fur­ther com­mit­ment is a ter­ri­ble idea. So I’m not going to cal­cu­late out any fur­ther. (Since this is all lin­ear stuff, the pre­dicted ben­e­fit will increase smoothly over the year and so there’s no point in cal­cu­lat­ing out alter­na­tives like 1 mon­th, 3 months, 6 months, 9 months, etc.) What’s that, action 3 is totally infea­si­ble and no one would ever agree to this—the patients would scream their heads off and the health insur­ance com­pa­nies would never go for it—even if we could show that long com­mit­ments do reduce the sui­cide rate enough to jus­tify the costs? And, among other things, I’ve over­sim­pli­fied in assum­ing the 7% risk is evenly dis­trib­uted over the year rather than a more plau­si­ble dis­tri­b­u­tion like expo­nen­tially decreas­ing from Day 1, so likely com­mit­ment stops being a good idea more like month 3 or some­thing? Yeah, you’re prob­a­bly right, so let’s go back to using action 2’s loss as our cur­rent best alter­na­tive.

Now, hav­ing set out some of the choices avail­able, we can find out how much bet­ter infor­ma­tion is worth. First, let’s ask what the Expected Value of Per­fect Infor­ma­tion is: if we were able to take our 100 patients and exactly pre­dict which 7 were depres­sive and would com­mit sui­cide this year in the absence of any inter­ven­tion, where our choice is between com­mit­ting them for one week or not at all. Given such infor­ma­tion we can eject the 93 who we now know were never a sui­cide risk, and we hold onto the 7 endan­gered patients, and we have a new loss of the com­mit­ment cost of 7 peo­ple for a week vs the pre­vented loss of the chance they will try to com­mit sui­cide that week of this year:

((771000) - (0.14 * 10000000) - (10000000 * 7 * (1-(1/52)))) = -$70,004,846

How much did we gain from our per­fect infor­ma­tion? About $4m:

74257692 - 70004846 = $4,252,846

(This passes our san­ity checks: addi­tional infor­ma­tion should never hurt us, so the amount should be >=$0, but we are lim­ited by the inter­ven­tion to doing very lit­tle, so the ceil­ing should be a low amount com­pared to the total loss, which this is.)

So as long as the per­fect infor­ma­tion did not cost us more than $4m or so, we would have net gained from it: we would have been able to focus com­mit­ment on the patients at max­i­mal risk. So sup­pose we had a per­fect test which cost $1000 a patient to run, and we wanted to know if the gained infor­ma­tion was valu­able enough to bother with using this expen­sive test; the answer in this case is defi­nitely yes: with 100 patients, it’ll cost $100,000 to run the test but it’ll save $4.25m for a net profit of $4.15m. In fact, we would be will­ing to pay per-pa­tient costs up to $42k, at which point we hit break-even (4252846 / 100).

OK, so that’s per­fect infor­ma­tion. What about imper­fect infor­ma­tion? Well, imper­fect is a lot like per­fect infor­ma­tion, just, y’know—­less so. Let’s con­sider this test: with the same pri­or, a neg­a­tive on it means the patient now has P=0.007 to com­mit sui­cide that year, and a pos­i­tive means P=0.48, and the sensitivity/specificity at 92%. (Just copy­ing that from OP & ButY­ouD­is­agree, since those sound plau­si­ble.) So when we run the test on our patients, we find of the 4 pos­si­ble out­comes:

  • 85.1 patients are non-sui­ci­dal and the test will not flag them
  • 7.4 are non-sui­ci­dal but the test will flag them
  • 6.9 are sui­ci­dal and the test will flag them
  • 0.6 are sui­ci­dal but the test will not flag them

So if we decide whether to com­mit or not com­mit solely based on this test, we will send home 85.1 + 0.6 = 85.7 patients (and indeed 0.6/85.7=0.007), and we will retain the remain­ing 7.4 + 6.9 = 14.3 patients (and indeed, 6.9/14.3=0.48). So our loss is the wrongly ejected patient of 0.6 sui­cides plus the cost of com­mit­ting 14.3 patients (both safe and at-risk) for a week in exchange for the gain of a small chance of stop­ping the sui­cide of the 6.9 actu­ally at risk:

(10000000*85.7*0.007) + (14.3*7*1000) + (10000000 * (0.4814.3) (1-(1/52))) = -$73,419,100

How much did we gain from our imper­fect infor­ma­tion? About $0.8m:

74257692 - 73419100 = $838,592

or $8,385.92 per patient. (This passes our san­ity check: greater than $0, but much less than the per­fect infor­ma­tion. The exact amount may seem lame, but as a frac­tion of the value of per­fect infor­ma­tion, it’s not too bad: the test gets us 20% - 838592 / 4252846 - of the way to per­fec­tion.)

And that’s our answer: the test is not worth $0—it’s worth $8k. And once you know what the cost of admin­is­ter­ing the test is, you sim­ply sub­tract it and now you have the Net Expected Value of Infor­ma­tion for this test. (I can’t imag­ine it costs $8k to admin­is­ter what this sounds like, so at least in this mod­el, the value is highly likely >$0.)

By tak­ing the pos­te­rior of the test and inte­grat­ing all the esti­mated costs and ben­e­fits into a sin­gle frame­work, we can nail down exactly how much value these clin­i­cal instru­ments could deliver if used to guide deci­sion-mak­ing. And if you object to some par­tic­u­lar para­me­ter or assump­tion, just build another deci­sion-the­ory model and esti­mate the new cost. For exam­ple, maybe com­mit­ment actu­ally costs, once you take into account all the dis­rup­tion to lives and other such side-effects, not $1000 but net of $5000 per day, what then? Then the gain halves to $438,192, etc. And if it costs $10000 then the test is worth noth­ing because you won’t com­mit any­one ever because it’s just way too expen­sive, and now you know it’s worth $0; or if com­mit­ment is so cheap that it’s more like $100 a day, then the test is also worth $0 because you would just com­mit every­one (since breakeven is then a sui­cide prob­a­bil­ity way below 7%, all the way at ~0.4% which is still below the 0.7% which the test can deliv­er, so the test result does­n’t mat­ter for decid­ing whether to com­mit, so it’s worth $0), or if you adopt a more rea­son­able value of life like $20m, the value of per­fect infor­ma­tion shoots up (ob­vi­ous­ly, since the avoided loss dou­bles) but the value of imper­fect infor­ma­tion drops like a stone (since now that one sui­ci­dal patient sent home blows away your sav­ings from less com­mit­ting) and the test becomes worth­less; and play­ing with the for­mu­las, you can fig­ure out the var­i­ous ranges of assump­tions in which the test has pos­i­tive value and esti­mate how much it has under par­tic­u­lar para­me­ters, and of course if para­me­ters are uncer­tain, you can cope with that uncer­tainty by embed­ding this in a Bayesian model to get pos­te­rior dis­tri­b­u­tions of par­tic­u­lar para­me­ters incor­po­rat­ing all the uncer­tain­ty.

So to sum up: there are no hard thresh­olds in deci­sion-mak­ing and impos­ing them can cost us bet­ter deci­sion-mak­ing, so to claim addi­tional infor­ma­tion is worth­less, more analy­sis need­ed, and this analy­sis must be done with respect to the avail­able actions & their con­se­quences, which even under the some­what extreme con­di­tions here of very weak inter­ven­tions & low base-rates, sug­gests that the value of this infor­ma­tion is pos­i­tive.

Bayesian Model Averaging

## original: "Bayesian model choice via Markov chain Monte Carlo methods" Carlin & Chib 1995
## Kobe example & data from: "A tutorial on Bayes factor estimation with the product space method", Lodewyckx et al 2011
## Lodewyckx code can be downloaded after registration & email from

## "Table 2: Observed field goals (y) and attempts (n) by Kobe Bryant during the NBA seasons of 1999 to 2006."
kobe <- read.csv(stdin(),header=TRUE)
Year, y,   n,    y.n
1999, 554, 1183, 0.47
2000, 701, 1510, 0.46
2001, 749, 1597, 0.47
2002, 868, 1924, 0.45
2003, 516, 1178, 0.44
2004, 573, 1324, 0.43
2005, 978, 2173, 0.45
2006, 399,  845, 0.47

      # 1) MODEL INDEX
      # Model index is 1 or 2.
      # Prior probabilities based on argument prior1.
      # Posterior probabilities obtained by averaging
      # over postr1 and postr2.
      M ~ dcat(p[])
      p[1] <- prior1
      p[2] <- 1-prior1
      postr1 <- 2-M
      postr2 <- 1-postr1

      # For each year, successes are Binomially distributed.
      # In M1, the success rate is fixed over years.
      # In M2, the success rate is year-specific.
      for (i in 1:n.years){
       successes[i] ~ dbin(pi[M,i], attempts[i])

       pi[1,i] <- pi.fixed
       pi[2,i] <-[i]

      # 3) MODEL 1 (one single rate)
      # The fixed success rate is given a Beta prior and pseudoprior.
      # Whether it is a prior or pseudoprior depends on the Model index.
      pi.fixed ~ dbeta(alpha.fixed[M],beta.fixed[M])
      alpha.fixed[1] <- alpha1.prior
      beta.fixed[1] <- beta1.prior
      alpha.fixed[2] <- alpha1.pseudo
      beta.fixed[2] <- beta1.pseudo

      # 4) MODEL 2 (multiple independent rates)
      # The year-specific success rate is given a Beta prior and pseudoprior.
      # Whether it is a prior or pseudoprior depends on the Model index.
      for (i in 1:n.years){[i] ~ dbeta([M,i],[M,i])[2,i] <- alpha2.prior[2,i] <- beta2.prior[1,i] <- alpha2.pseudo[i][1,i] <- beta2.pseudo[i]
      # predictive interval for hypothetical 2007 data in which Kobe makes 1000 shots: ~ dbin(pi.fixed, 1000) ~ dbin([n.years], 1000)

# ~ dcat(M
# 'prior1' value from paper
data <- list("prior1"=0.000000007451, "n.years"= length(kobe$Year), "successes"=kobe$y, "attempts"=kobe$n,
             "alpha1.prior"=1, "beta1.prior"=1, "alpha2.prior"=1, "beta2.prior"=1,
             "alpha1.pseudo"=1, "beta1.pseudo"=1, "alpha2.pseudo"=rep(1,8), "beta2.pseudo"=rep(1,8) )
# inits <- function() { list(mu=rnorm(1),sd=30,t=as.vector(apply(mailSim,1,mean))) }
params <- c("", "pi.fixed", "postr1", "postr2", "M", "", "")
j1 <- run.jags(model=model1, monitor=params, data=data, n.chains=getOption("mc.cores"), method="rjparallel", sample=500000); j1
# JAGS model summary statistics from 4000000 samples (chains = 8; adapt+burnin = 5000):
#                  Lower95  Median Upper95    Mean      SD Mode      MCerr MC%ofSD SSeff
#[1]        0.3145 0.46864 0.98709 0.47383 0.11553 ---0.00041958     0.4 75810
#[2]       0.10099 0.46447 0.77535 0.47005  0.1154 ---0.00042169     0.4 74887
#[3]       0.19415  0.4692 0.86566  0.4741 0.11457 ---0.00040171     0.4 81342
#[4]      0.020377 0.45146 0.69697 0.45867 0.11616 ---0.00042696     0.4 74023
#[5]      0.024472 0.43846  0.7036 0.44749 0.11757 ---0.00043352     0.4 73548
#[6]      0.076795 0.43325 0.74944 0.44318 0.11684 ---0.00043892     0.4 70863
#[7]       0.06405 0.45033 0.73614 0.45748 0.11541 ---0.00041715     0.4 76543
#[8]       0.30293 0.47267 0.97338 0.47708 0.11506 ---0.00040938     0.4 79000
# pi.fixed        0.039931 0.45756 0.97903 0.49256 0.26498 ---0.00099537     0.4 70868
# postr1                 0       0       1 0.15601 0.36287    0    0.15113    41.6     6
# postr2                 0       1       1 0.84399 0.36287    1    0.15113    41.6     6
# M                      1       2       2   1.844 0.36287    2    0.15113    41.6     6
#        0     463     940  492.57  265.28  454    0.99543     0.4 71019
#      300     473     971  477.05  116.03  473     0.4152     0.4 78094
getLogBF <- function(prior0, postr0) { log((postr0/(1-postr0)) / (prior0/(1-prior0))) }
getLogBF(0.000000007451, 0.15601)
# [1] 17.02669704
## analytic BF: 18.79; paper's MCMC estimate: 18.80; not sure where I lost 1.8 of the BF.

Dealing with all-or-nothing unreliability of data

Given two dis­agree­ing polls, one small & impre­cise but taken at face-val­ue, and the other large & pre­cise but with a high chance of being totally mis­tak­en, what is the right Bayesian model to update on these two dat­a­points? I give ABC and MCMC imple­men­ta­tions of Bayesian infer­ence on this prob­lem and find that the pos­te­rior is bimodal with a mean esti­mate close to the large unre­li­able pol­l’s esti­mate but with wide cred­i­ble inter­vals to cover the mode based on the small reli­able pol­l’s esti­mate.

A ques­tion was asked of me: what should one infer if one is given what would be defin­i­tive data if one could take it at face val­ue—but one sus­pects this data might be totally 100% incor­rect? An exam­ple would be if one wanted to know what frac­tion of peo­ple would answer ‘yes’ to a par­tic­u­lar ques­tion, and one had a very small poll (n = 10) sug­gest­ing 90% say yes, but then one was also given the results from a much larger poll (n = 1000) say­ing 75% responded yes—but this poll was run by untrust­wor­thy peo­ple, peo­ple that, for what­ever rea­son, you believe might make some­thing up half the time. You should be able to learn some­thing from this unre­li­able poll, but you can’t learn every­thing from it because you would be burned half the time.

If not for this issue of unre­li­a­bil­i­ty, this would be an easy bino­mial prob­lem: spec­ify a uni­form or Jeffreys prior on what per­cent­age of peo­ple will say yes, add in the bino­mial data of 9⁄10, and look at the pos­te­ri­or. But what do we do with the unre­li­a­bil­ity jok­er?


First let’s try the sim­ple case, just updat­ing on a small poll of 9⁄10. We would expect it to be uni­modally peaked around 80-90%, but broad (due to the small sam­ple size) and falling sharply until 100% since being that high is a pri­ori unlike­ly.

MCMC using Bayesian First Aid:

## install.packages("devtools")
## devtools::install_github("rasmusab/bayesian_first_aid")
b <- bayes.binom.test(oldData$Yes, oldData$N); b
# ...number of successes = 9, number of trials = 10
# Estimated relative frequency of success:
#   0.85
# 95% credible interval:
#   0.63 0.99
# The relative frequency of success is more than 0.5 by a probability of 0.994
# and less than 0.5 by a probability of 0.006

Which itself is a wrap­per around call­ing out to JAGS doing some­thing like this:

model_string <- "model {
  x ~ dbinom(theta, n)
  theta ~ dbeta(1, 1) }"
model <- autorun.jags(model_string, monitor="theta", data=list(x=oldData$Yes, n=oldData$N)); model
# JAGS model summary statistics from 20000 samples (chains = 2; adapt+burnin = 5000):
#       Lower95  Median Upper95    Mean      SD Mode     MCerr MC%ofSD SSeff    AC.10   psrf
# theta 0.63669 0.85254  0.9944 0.83357 0.10329 ---0.0007304     0.7 20000 0.011014 1.0004

Here is a sim­u­la­tion-based ver­sion of Bayesian infer­ence using :

oldData <- data.frame(Yes=9, N=10)
simulatePoll <- function(n, pr)  { rbinom(1, size=n, p=pr); }
poll_abc <- replicate(100000, {
    # draw from our uniform prior
    p <- runif(1,min=0,max=1)
    # simulate a hypothetical poll dataset the same size as our original
    newData <- data.frame(Yes=simulatePoll(oldData$N, p), N=oldData$N)
    # were they equal? if so, save sample as part of posterior
    if (all(oldData == newData)) { return(p) }
resultsABC <- unlist(Filter(function(x) {!is.null(x)}, poll_abc))
#      Min.   1st Qu.    Median      Mean   3rd Qu.      Max.
# 0.3260816 0.7750520 0.8508855 0.8336383 0.9117471 0.9991691

They look iden­ti­cal, as they should.

Binomial with binary unreliability

To imple­ment our more com­pli­cated ver­sion: the orig­i­nal poll remains the same but we add in the com­pli­ca­tion of a very large poll which 50% of the time is a true mea­sure of the poll response and 50% of the time is drawn uni­formly at ran­dom. (So if the true poll response is 90%, then half the time the large poll will yield accu­rate data like 905⁄1000 or 890⁄1000, and the rest it will yield 10⁄1000 or 400⁄1000 or 700⁄1000.) This is differ­ent from the more com­mon kinds of mea­sure­men­t-er­ror mod­els where it’s gen­er­ally assumed that the noisy mea­sure­ments still have some infor­ma­tive­ness to them; here there is none.

Specifi­cal­ly, this faux poll has yielded the data not 9⁄10, but 750⁄1000.


Using ABC again: we gen­er­ate the reli­able small poll as before, and we add in an faux poll where we flip a coin to decide if we are going to return a ‘yes’ count based on the pop­u­la­tion para­me­ters or just a ran­dom num­ber, then we com­bine the two datasets and check that it’s iden­ti­cal to the actual data, sav­ing the pop­u­la­tion prob­a­bil­ity if it is.

oldData2 <- data.frame(Yes=c(9,750), N=c(10,1000)); oldData2
#   Yes    N
# 1   9   10
# 2 750 1000
simulateHonestPoll <- function(n, pr)  { rbinom(1, size=n, p=pr); }
simulateFauxPoll <- function(n, pr, switchp) { if(sample(c(TRUE, FALSE), 1, prob=c(switchp, 1-switchp))) { rbinom(1, size=n, p=pr); } else { round(runif(1, min=0, max=n)); }}
poll_abc <- replicate(1000000, {
 priorp <- runif(1,min=0,max=1)
 switch <- 0.5
 n1 <- 10
 n2 <- 1000
 data1 <- data.frame(Yes=simulateHonestPoll(n1, priorp), N=n1)
 data2 <- data.frame(Yes=simulateFauxPoll(n2, priorp, switch), N=n2)
 newData <- rbind(data1, data2)
 if (all(oldData2 == newData)) { return(priorp) }
resultsABC <- unlist(Filter(function(x) {!is.null(x)}, poll_abc))
#      Min.   1st Qu.    Median      Mean   3rd Qu.      Max.
# 0.5256471 0.7427098 0.7584650 0.7860109 0.8133581 0.9765648

The results are inter­est­ing and in this case the sum­mary sta­tis­tics are mis­lead­ing: the median is indeed around 75% (as we would expect! since that’s the result of the highly pre­cise poll which has a 50% chance of being the truth) but we see the mean is being pulled away towards the orig­i­nal 90% esti­mate, and plot­ting the his­togram, bimodal­ity emerges. The pos­te­rior reports that there’s still a lot of cred­i­bil­ity to the 90% point esti­mate, but between the orig­i­nal diffuse­ness of that pos­te­rior (leav­ing a lot of prob­a­bil­ity to lower responses includ­ing, say, 75%) and the high cer­tainty that if accu­rate the responses will defi­nitely be close to 75%, it winds up peaked at a lit­tle higher than 75% (since even if the larger poll is hon­est, the ear­lier poll did still find 9⁄10). So it’s not so much that we think the best esti­mate of true pop­u­la­tion rate really is 79% (in­deed, the mode is more like 75%, but it could eas­ily be far away from 75% and in the 90%s) as we would need to think more about what we want to do with this pos­te­rior before we decide how to sum­ma­rize it.


ABC is slow and would not scale to more hypo­thet­i­cal polls unless we aban­doned exact ABC infer­ence and began using approx­i­mate ABC (en­tirely pos­si­ble in this case; instead of strict equal­ity between the orig­i­nal and sim­u­lated data, we’d instead accept a sam­ple of p if the sim­u­lated dataset’s frac­tions were with­in, say, 1% of the orig­i­nal­s); and the sim­u­la­tion would need to be rewrit­ten any­way.

MCMC can han­dle this if we think of our prob­lem as a : our prob­lem is that we have poll data drawn from two clusters/distributions—one clus­ter is the true pop­u­la­tion dis­tri­b­u­tion of opin­ion, and the other clus­ter just spits out noise. We have one obser­va­tion which we know is from first reli­able dis­tri­b­u­tion (the 9⁄10 poll result), and one obser­va­tion which we’re not sure which of the two it came from (750/1000), but we do know that the index­ing prob­a­bil­ity for mix­ing the two dis­tri­b­u­tions is 50%.

In JAGS, we write down a model in which dcat flips between 1 and 2 if the clus­ter is not known, spec­i­fy­ing which dis­tri­b­u­tion a sam­ple came from and its theta prob­a­bil­i­ty, and then we infer the thetas for both dis­tri­b­u­tions. Of course, we only care about the first dis­tri­b­u­tion’s theta since the sec­ond one is noise.

model1 <- "model {
  for (i in 1:N) {
   y[i] ~ dbinom(theta[i], n[i])
   theta[i] <- thetaOfClust[ clust[i] ]
   clust[i] ~ dcat(pi[])
  pi[1]  <- switch[1]
  pi[2]  <- switch[2]
  thetaOfClust[1] ~ dbeta(1,1)
  thetaOfClust[2] ~ dunif(0,1)
j1 <- autorun.jags(model1, monitor=c("theta"), data = list(N=nrow(oldData2), y=oldData2$Yes, n=oldData2$N, switch=c(0.5, 0.5), clust=c(1,NA))); j1
# ...      Lower95  Median Upper95    Mean       SD Mode      MCerr MC%ofSD SSeff    AC.10   psrf
# theta[1] 0.70582 0.75651 0.97263 0.77926  0.07178  ---  0.001442       2  2478  0.12978 1.0011
# theta[2] 0.72446 0.75078 0.77814 0.75054 0.013646  ---0.00009649     0.7 20000 0.009458      1

Sure enough, we get a good match with the ABC esti­mate: a mean esti­mate for the pop­u­la­tion dis­tri­b­u­tion of 78% with a very wide 95% CI and a clearly bimodal dis­tri­b­u­tion with a huge spike at 75%. Since the MCMC mix­ture model looks com­pletely differ­ent from the imper­a­tive sim­u­la­tion-based mod­el, the con­sis­tency in esti­mates & dis­tri­b­u­tions gives me some con­fi­dence in the results being right.

So we can see how we should update our belief­s—by a per­haps sur­pris­ing amount towards the unre­li­able dat­a­point. The orig­i­nal data was too weak to strongly resist the allure of that highly pre­cise poll.

Weakening heuristic?

We might try to think of it this way: half the time, the large poll means noth­ing what­so­ev­er, it con­tains 0% or no infor­ma­tion about the pop­u­la­tion at all; While the other half of the time, it is exactly what it seems to be and 100% infor­ma­tive; so does­n’t that mean that on aver­age we should treat it as con­tain­ing half the infor­ma­tion we thought it did? And the infor­ma­tion is directly based on the sam­ple size: a sam­ple 5x as big con­tains 5x as much infor­ma­tion. So per­haps in this case of all-or-noth­ing accu­ra­cy, we could solve it eas­ily by sim­ply weak­en­ing the weight put the unre­li­able infor­ma­tion and shrink­ing the claimed sam­ple size—in­stead of treat­ing it as 750 of 1000, treat it as 375⁄500; and if it had been 75,000 of 100,000, con­vert it to 37,500 of 50,000. This is a sim­ple and intu­itive short­cut, but if we think about what the bino­mial will return as the unre­li­able poll increases in size or if we look at the results…

switch <- 0.5
oldData3 <- data.frame(Yes=c(9,(750*switch)), N=c(10,(1000*switch)))
b2 <- bayes.binom.test(sum(oldData3$Yes), sum(oldData3$N)); b2
#   Bayesian First Aid binomial test
# data: sum(oldData3$Yes) and sum(oldData3$N)
# number of successes = 384, number of trials = 510
# Estimated relative frequency of success:
#   0.75
# 95% credible interval:
#   0.71 0.79
# The relative frequency of success is more than 0.5 by a probability of >0.999
# and less than 0.5 by a probability of <0.001

Unfor­tu­nate­ly, this does­n’t work because it does­n’t pre­serve the bimodal aspect of the pos­te­ri­or, and we get a uni­modal dis­tri­b­u­tion ever con­cen­trat­ing on its mean, wip­ing out the exis­tence of the 0.90 peak. If our untrust­wor­thy poll had instead, say, reported 750,000 out of 1 mil­lion, that should only make the peak at 0.75 look like a needle—it should be unable to affect the mass around 0.9, because it does­n’t mat­ter if the data is 100 or 1 mil­lion or 1 bil­lion, it still only has a 50% chance of being true. It’s a lit­tle hard to see this since the mean fre­quency of 0.75 is fairly close to the mean of 0.78 from the ABC and we might write this off as approx­i­ma­tion error in either the ABC esti­mate or BFA’s MCMC, but if we look at the 95% CI and note that 0.9 is not inside it or if we plot the pos­te­rior (plot(b2)), then the absence of bimodal­ity jumps out. So this trick does­n’t work.

Dysgenics power analysis

Cur­rent dys­genic esti­mates pre­dict that geno­typic IQ in the West are falling at a sub­stan­tial rate, amount­ing to around half a stan­dard devi­a­tion or more over the past cen­tu­ry, by 1. reduc­ing the fre­quency at which intel­li­gence-in­creas­ing genetic vari­ants occur (through nat­ural selec­tion against such vari­ants) and 2. by increas­ing the num­ber of new and poten­tially harm­ful genetic muta­tions (in­creas­ing muta­tion load). Esti­mates are pro­duced indi­rectly by sur­vey­ing repro­duc­tive rates or by try­ing to show decreases in phe­no­typic traits asso­ci­ated with intel­li­gence; it would obvi­ously be prefer­able to exam­ine dys­genic effects direct­ly, by observ­ing decreases in fre­quen­cies or increases in muta­tion load in a large sam­ple of West­ern genetic infor­ma­tion such as SNP arrays or whole-genomes (re­spec­tive­ly). Such direct test­ing of dys­gen­ics hypothe­ses are becom­ing increas­ingly fea­si­ble due to the expo­nen­tial decrease in SNP & whole-genome sequenc­ing costs cre­at­ing large datasets (some pub­licly avail­able) and the recent iden­ti­fi­ca­tion of some intel­li­gence genes. It remains unclear how large these datasets must be to over­come sam­pling error and yield infor­ma­tive esti­mates of changes in fre­quen­cies or muta­tion load, how­ev­er; datasets like PGP or SSGAC may still be too small to inves­ti­gate dys­gen­ics. I con­sid­ered the effect size esti­mates and under some sim­ple mod­els derive power cal­cu­la­tions & power sim­u­la­tions of how large a dataset would be required to have an 80% chance of detect­ing a dys­genic effect: to detect the decrease in intel­li­gence SNPs using SNP data, n≥30,000; to detect the increase in muta­tion load in whole genomes, n≥160. I then com­pare to avail­able datasets: the effect on SNPs can be detected by a large num­ber of exist­ing pro­pri­etary data­bas­es, but there are no pub­lic data­bases which will be large enough in the fore­see­able future; the effect on muta­tion load, on the other hand, can be detected using solely the cur­rently pub­licly avail­able dataset from PGP. So I con­clude that while only the pro­pri­etary data­bases can directly test dys­genic the­o­ries of selec­tion for the fore­see­able future, there is an oppor­tu­nity to ana­lyze PGP genomes to directly test the dys­genic the­ory of muta­tion load.

The dys­gen­ics hypoth­e­sis argues that due to observed repro­duc­tive pat­terns where the highly edu­cated or intel­li­gent tend to have fewer off­spring, geno­typic IQ (the upper bound on phe­no­typic IQs set by genes and the sort of thing mea­sured by a poly­genic score). If dys­gen­ics is true, then it is an extremely impor­tant phe­nom­e­non, as impor­tant as many things that get far more atten­tion like lead reme­di­a­tion; but to para­phrase 2, just because a prob­lem is impor­tant does not mean it is worth work­ing on or research­ing or dis­cussing if there is no chance of mak­ing pro­gress—if the data is hope­lessly com­pro­mised by many sys­tem­atic biases which would cause false pos­i­tives or if the data is too scanty to over­come ran­dom error or analy­ses so flex­i­ble that they could deliver any answer the par­ti­san wish­es.

Phe­no­typic data will, in all prob­a­bil­i­ty, never allow for a clear & deci­sive answer to the ques­tion of whether dys­gen­ics exists or mat­ters, as long-term com­par­isons are roughly as cred­i­ble as not­ing that global piracy rates have declined while global warm­ing increas­es, or parac­eta­mol con­sump­tion rates have increased in tan­dem with Alzheimer’s rates; only direct exam­i­na­tion of genet­ics will deliver the deci­sive answer. It would be nice to have an idea of how much genetic data we would need to over­come ran­dom error (and hence, whether it’s pos­si­ble to make progress in the near future), which we can answer by doing some sta­tis­ti­cal power analy­ses.

Changes over time in genet­ics could be due to changes within a par­tic­u­lar race or pop­u­la­tion (for exam­ple, in all white Eng­lish­men), or could be due to pop­u­la­tion move­ments like one group replac­ing or migrat­ing or merg­ing into another (pop­u­la­tion genet­ics has revealed innu­mer­able com­plex exam­ples his­tor­i­cal­ly). The lat­ter is pos­si­ble thanks to the increas­ing avail­abil­ity of ancient DNA, often made pub­lic for researchers; so one could observe very long-term trends with cumu­la­tively large effects (im­ply­ing that small sam­ples may suffice), but this approach has seri­ous issues in inter­pre­ta­tion and ques­tions about how com­pa­ra­ble intel­li­gence vari­ants may be across groups or through­out human evo­lu­tion. With the for­mer, there is less con­cern about inter­pre­ta­tion due to greater tem­po­ral and eth­nic homo­gene­ity—if a GWAS on white north­ern Euro­peans in 2013 turns up intel­li­gence vari­ants and pro­duces a use­ful poly­genic score, it will almost cer­tainly work on sam­ples of white north­ern Euro­peans in 1900 too—but because the time-s­cale is so short the effect will be sub­tler and harder to detect. Nev­er­the­less, a result within a mod­ern pop­u­la­tion would be much more cred­i­ble, so we’ll focus on that.

How sub­tle and hard to detect an effect are we talk­ing about here? Wood­ley 2012 sum­ma­rizes a num­ber of esti­mates:

Early in the 20th cen­tu­ry, neg­a­tive cor­re­la­tions were observed between intel­li­gence and fer­til­i­ty, which were taken to indi­cate a dys­genic fer­til­ity trend (e.g. Cat­tell, 1936; Lentz, 1927; Maller, 1933; Suther­land, 1929). Early pre­dic­tions of the rate of dys­ge­n­e­sis were as high as between 1 and 1.5 IQ points per decade (Cat­tell, 1937, 1936)…In their study of the rela­tion­ship between intel­li­gence and both com­pleted and par­tially com­pleted fer­til­i­ty, van Court and Bean (1985) reported that the rela­tion­ships were pre­dom­i­nantly neg­a­tive in cohorts born between the years 1912 and 1982…Vin­ing (1982) was the first to have attempted an esti­ma­tion of the rate of geno­typic IQ decline due to dys­ge­n­e­sis with ref­er­ence to a large national prob­a­bil­ity cohort of US women aged between 24 and 34 years in 1978. He iden­ti­fied sig­nifi­cant neg­a­tive cor­re­la­tions between fer­til­ity and IQ rang­ing from −.104 to −.221 across cat­e­gories of sex, age and race, with an esti­mated geno­typic IQ decline of one point a gen­er­a­tion. In a 10year fol­low-up study using the same cohort, Vin­ing (1995) re-ex­am­ined the rela­tion­ship between IQ and fer­til­i­ty, now that fer­til­ity was com­plete, find­ing evi­dence for a geno­typic IQ decline of .5 points per gen­er­a­tion. Rether­ford and Sewell (1988) exam­ined the asso­ci­a­tion between fer­til­ity and IQ amongst a sam­ple of 9000 Wis­con­sin high­-school grad­u­ates (grad­u­ated 1957). They found a selec­tion differ­en­tial that would have reduced the phe­no­typic IQ by .81 points per gen­er­a­tion under the assump­tion of equal IQs for par­ents and chil­dren. With an esti­mate of .4 for the addi­tive her­i­tabil­ity of IQ, they cal­cu­lated a more mod­est geno­typic decline of approx­i­mately .33 points. The study of Ree and Ear­les (1991), which employed the NLSY sug­gests that once the differ­en­tial fer­til­ity of immi­grant groups is taken into con­sid­er­a­tion, the phe­no­typic IQ loss amongst the Amer­i­can pop­u­la­tion may be greater than .8 of a point per gen­er­a­tion. Sim­i­lar­ly, in sum­ma­riz­ing var­i­ous stud­ies, Her­rn­stein & Mur­ray (1994) sug­gest that “it would be nearly impos­si­ble to make the total [phe­no­typic IQ decline] come out to less than one point per gen­er­a­tion. It might be twice that.” (p. 364). Loehlin (1997) found a neg­a­tive rela­tion­ship between the fer­til­ity of Amer­i­can women aged 35-44 in 1992 and their edu­ca­tional lev­el. By assign­ing IQ scores to each of six edu­ca­tional lev­els, Loehlin esti­mated a dys­ge­n­e­sis rate of .8 points in one gen­er­a­tion. Sig­nifi­cant con­tri­bu­tions to the study of dys­ge­n­e­sis have been made by Lynn, 1996 (see also: 2011) whose book Dys­gen­ics: Genetic dete­ri­o­ra­tion in mod­ern pop­u­la­tions pro­vided the first esti­mates of the mag­ni­tude of dys­ge­n­e­sis in Britain over a 90 year peri­od, putting the phe­no­typic loss at .069 points per year (about 1.7 points a gen­er­a­tion assum­ing a gen­er­a­tional length of 25 years). In the same study, Lynn esti­mated that the geno­typic IQ loss was 1.64 points per gen­er­a­tion between 1920 and 1940, which reduced to .66 points between 1950 and the pre­sent. Sub­se­quent work by Lynn has inves­ti­gated dys­ge­n­e­sis in other pop­u­la­tions. For exam­ple Lynn (1999) found evi­dence for dys­genic fer­til­ity amongst those sur­veyed in the 1994 National Opin­ion Research Cen­ter sur­vey, which encom­passed a rep­re­sen­ta­tive sam­ple of Amer­i­can adults, in the form of neg­a­tive cor­re­la­tions between the intel­li­gence of adults aged 40+ and the num­ber of chil­dren and sib­lings. Lynn esti­mates the rate of dys­ge­n­e­sis amongst this cohort at .48 points per gen­er­a­tion. In a more recent study, Lynn and van Court (2004) esti­mated that amongst the most recent US cohort for which fer­til­ity can be con­sid­ered com­plete (i.e. those born in the years 1940-1949), IQ has declined by .9 points per gen­er­a­tion. At the coun­try lev­el, Lynn and Har­vey (2008) have found evi­dence of a global dys­ge­n­e­sis of around .86 points between 1950 and 2000, which is pro­jected to increase to 1.28 points in the period from 2000 to 2050. This pro­jec­tion includes the assump­tion that 35% of the vari­ance in cross-coun­try IQ differ­ences is due to the influ­ence of genetic fac­tors. A sub­se­quent study by Meisen­berg (2009), found that the fer­til­ity differ­en­tial between devel­oped and devel­op­ing nations has the poten­tial to reduce the phe­no­typic world pop­u­la­tion IQ mean by 1.34 points per decade (amount­ing to a geno­typic decline of .47 points per decade assum­ing Lynn & Har­vey’s 35% esti­mate). This assumes present rates of fer­til­ity and pre-re­pro­duc­tive mor­tal­ity within coun­tries. Meisen­berg (2010) and Meisen­berg and Kaul (2010) have exam­ined the fac­tors through which intel­li­gence influ­ences repro­duc­tive out­comes. They found that amongst the NLSY79 cohort in the United States, the neg­a­tive cor­re­la­tion between intel­li­gence and fer­til­ity is pri­mar­ily asso­ci­ated with g and is medi­ated in part by edu­ca­tion and income, and to a lesser extent by more “lib­eral” gen­der atti­tudes. From this Meisen­berg has sug­gested that in the absence of migra­tion and with a con­stant envi­ron­ment, selec­tion has the poten­tial to reduce the aver­age geno­typic IQ of the US pop­u­la­tion by between .4, .8 and 1.2 points per gen­er­a­tion.

All of these esti­mates are genetic selec­tion esti­mates: indi­rect esti­mates inferred from IQ being a her­i­ta­ble trait and then treat­ing it as a nat­ural selection/breeding process, where a trait is selected against based on phe­no­type and how fast the trait decreases in each suc­ceed­ing gen­er­a­tion depends on how genetic the trait is and how harsh the selec­tion is. So vari­a­tion in these esti­mates (quoted esti­mates per gen­er­a­tion range from .3 to 3+) is due to sam­pling error, differ­ences in pop­u­la­tions or time peri­ods, express­ing the effect by year or gen­er­a­tion, the esti­mate used for her­i­tabil­i­ty, reli­a­bil­ity of IQ esti­mates, and whether addi­tional genetic effects are taken into accoun­t—­for exam­ple, Wood­ley et al 2015 finds -.262 points per decade from selec­tion, but in another paper argues that pater­nal muta­tion load must be affect­ing intel­li­gence by ~-0.84 in the gen­eral pop­u­la­tion, giv­ing a total of -1 per decade.

Dys­gen­ics effects should be observ­able by look­ing at genomes & SNP data with known ages/birth-years and look­ing for increases in total muta­tions or decreases in intel­li­gence-caus­ing SNPs, respec­tive­ly.

Selection on SNPs

With­out for­mally meta-an­a­lyz­ing all dys­gen­ics stud­ies, a good start­ing point on the selec­tion effect seems like a genetic selec­tion of 1 point per decade or 0.1 points per year or 0.007 stan­dard devi­a­tions per year (or 0.7 stan­dard devi­a­tions per cen­tu­ry).

The most com­mon avail­able genetic data is SNP data, which sequence only the vari­ants most com­mon in the gen­eral pop­u­la­tion; SNP data can look at the effects of genetic selec­tion but will not look at new muta­tions (since a new muta­tion would not be com­mon enough to be worth putting onto a SNP chip).

Given a large sam­ple of SNP data, a birth year (or age), and a set of binary SNP vari­ables which cause intel­li­gence (coded as 1 for the good vari­ant, 0 for the oth­er­s), we could for­mu­late this as a mul­ti­vari­ate regres­sion: glm(cbind(SNP1, SNP2, ... SNP_N) ~ Year, family=binomial) and see if the year vari­able has a neg­a­tive sign (in­creas­ing pas­sage of time pre­dicts lower lev­els of the good genes); if it does, this is evi­dence for dys­gen­ics. Bet­ter yet, given infor­ma­tion about the effect size of the SNPs, we could for each per­son’s SNP sum the net effects and then regress on a sin­gle vari­able, giv­ing more pre­ci­sion rather than look­ing for inde­pen­dent effects on each SNP: lm(Polygenic_score ~ Year). Again a neg­a­tive sign on the year vari­able is evi­dence for dys­gen­ics.

Direc­tional pre­dic­tions are weak, and in this case we have quan­ti­ta­tive pre­dic­tions of how big the effects should be. Most of the pub­lic genomes I looked at seem to have the ear­li­est birth­dates in the 1950s or so; genomes can come from any age per­son (par­ents can give per­mis­sion, and sequenc­ing has been done pre­na­tal­ly) so the max­i­mum effect is the differ­ence between 1950 and 2015, which is 65*0.007=0.455 stan­dard devi­a­tions (but most genomes will come from inter­me­di­ate birth-dates, which are less infor­ma­tive about the tem­po­ral trend—in the opti­mal exper­i­men­tal design for mea­sur­ing a lin­ear trend, half the sam­ples would be from 1950 and the other half from 2015). If the genetic total is going down by 0.455S­Ds, how much do the fre­quen­cies of all the good genes go down?

One sim­ple model of geno­typic IQ would be to treat it as a large num­ber of alle­les of equal binary effect: a bino­mial sum of n = 10,000 1/0 vari­ables with P = 50% (pop­u­la­tion fre­quen­cy) is rea­son­able. (For exam­ple, GIANT has found a large num­ber of vari­ants for height, and the s indi­cate that SNPs explain much more of vari­ance than the top Rietveld hits cur­rently account for; this spe­cific model is loosely inspired by .) In such a mod­el, the aver­age value of the sum is of course n*p=5000, and the SD is sqrt(n*p*(1-p)) or sqrt(10000*0.5*0.5) or 50. Apply­ing our esti­mate of dys­genic effect, we would expect the sum to fall by 0.455*50=22.75, so we would be com­par­ing two pop­u­la­tions, one with a mean of 5000 and a dys­genic mean of 4977.25. If we were given access to all alle­les from a sam­ple of 1950 and 2015 genomes and so we could con­struct the sum, how hard would it be able to tell the differ­ence? In this case, the sum is nor­mally dis­trib­uted as there are more than enough alle­les to cre­ate nor­mal­i­ty, so we can just treat this as a two-sam­ple nor­mal­ly-dis­trib­uted com­par­i­son of means (a t-test), and we already have a direc­tional effect size in mind, -0.445S­Ds, so:

power.t.test(delta=0.455, power=0.8, alternative="one.sided")
#      Two-sample t test power calculation
#               n = 60.4155602
# ...

A total n = 120 is doable, but it is unlikely that we will know all intel­li­gence genes any­time soon; instead, we know a few. A new mean of 4977 implies that since total num­ber of alle­les is the same but the mean has fal­l­en, the fre­quen­cies must also fall and the aver­age fre­quency falls from 0.5 to 4977.25/10000=0.497725. To go to the other extreme, if we know only a sin­gle gene and we want to test a fall from a fre­quency of 0.50 to 0.4977, we need infea­si­bly more sam­ples:

power.prop.test(p1=0.5, p2=0.497725, power=0.8, alternative="one.sided")
#      Two-sample comparison of proportions power calculation
#               n = 597,272.2524
# ...

1.2m dat­a­points would be diffi­cult to get, and so a sin­gle gene test would be unhelp­ful; fur­ther, a sin­gle gene could change fre­quen­cies solely through genetic drift with­out the change being due to dys­genic pres­sures.

We know a num­ber of genes, though: Rietveld gives 4 good hits, so we can look at a poly­genic score from that. They are all of sim­i­lar effect size and fre­quen­cy, so we’ll con­tinue under the same assump­tions of 1/0 and P = 50%. The non-dys­genic aver­age score is 4*0.5=2, sd=sqrt(4*0.5*0.5)=1. (Nat­u­ral­ly, the SD is much larger than before because with so few ran­dom vari­ables…) The pre­dicted shift is from fre­quen­cies of 0.5 to 0.497, so the dys­genic scores should be 4*0.497=1.988, sd=sqrt(4*0.497*0.503)=0.999. The differ­ence of 0.012 on the reduced poly­genic score is d=((2-1.988) / 0.999)=0.012, giv­ing a nec­es­sary power of:

power.t.test(delta=0.012006003, power=0.8)
#      Two-sample t test power calculation
#               n = 108904.194
# ...

So the 4 hits do reduce the nec­es­sary sam­ple size, but it’s still not fea­si­ble to require 218k SNP datasets (un­less you are 23andMe or SSGAC or an entity like that).

In the cur­rent GWAS lit­er­a­ture, there are ~9 hits we could use, but the upcom­ing SSGAC paper promis­es: “We iden­ti­fied 86 inde­pen­dent SNPs asso­ci­ated with EA (p < 5E-8).”. So how much would 86 improve over 4?

  • mean old: 86*0.5=43
  • sd old: sqrt(86*0.5*0.5)=4.6368
  • mean new: 86*0.497=42.742
  • sd new: sqrt(86*0.497*(1-0.497))=4.6367
  • so d=(43-42.742)/4.63675=0.0556
power.t.test(delta=((43-42.742)/4.63675), power=0.8)
#      Two-sample t test power calculation
#               n = 5071.166739
# ...

So with 75, it drops from 200k to 10.1k.

To work back­wards: we know with 1 hit, we need a mil­lion SNP datasets (in­fea­si­ble for any but the largest pro­pri­etary data­bas­es, who have no inter­est in study­ing this hypoth­e­sis), and with all hits we need more like 200 genomes (en­tirely doable with just pub­licly avail­able datasets like PGP), but how many hits do we need to work with an in-be­tween amount of data like the ~2k genomes with ages I guess may be pub­licly avail­able now or in the near future?

power.t.test(n=1000, power=0.8)
#     Two-sample t test power calculation
#              n = 1000
#          delta = 0.1253508704
mean1=hits*0.5; sd1=sqrt(hits*0.5*0.5);
mean2=hits*0.497; sd2=sqrt(hits*0.497*(1-0.497));
d=(mean1-mean2)/mean(c(sd1,sd2)); d
# [1] 0.1254283986

With a poly­genic score draw­ing on 437 hits, then a sam­ple of 2k suffices to detect the max­i­mum decrease.

This is pes­simistic because the 10k alle­les are not all the same effect size and GWAS stud­ies inher­ently will tend to find the largest effects first. So the first 4 (or 86) hits are worth the most. The dis­tri­b­u­tion of effects is prob­a­bly some­thing like an inverse expo­nen­tial dis­tri­b­u­tion: many small near-zero effects and a few large ones. Rietveld 2013 released the betas for all SNPs, and the beta esti­mates can be plot­ted; each esti­mate is impre­cise and there are arti­facts in the beta sizes (SSGAC con­firms that they were rounded to 3 dec­i­mal­s), but the dis­tri­b­u­tion looks like a radioac­tive half-life graph, an inverse expo­nen­tial dis­tri­b­u­tion. With a mean of 1, we can sim­u­late cre­at­ing a set of 10k effect sizes which are expo­nen­tially dis­trib­uted and have mean 5000 and SD close to (but larger than) 50 and mim­ics closely the bino­mial mod­el:

effects <- sort(rexp(10000)/1, decreasing=TRUE)
genomeOld <- function() { ifelse(sample(c(FALSE,TRUE), prob=c(0.5, 0.5), 10000, replace = TRUE), 0, effects) }
mean(replicate(10000, sum(genomeOld())))
# [1] 5000.270218
sd(replicate(10000, sum(genomeOld())))
# [1] 69.82652816
genomeNew <- function() { ifelse(sample(c(FALSE,TRUE), prob=c(0.497, 1-0.497), 10000, replace = TRUE), 0, effects) }

With a dys­genic effect of -0.445S­Ds, that’s a fall of the sum of ran­dom expo­nen­tials of ~31, which agrees closely with the differ­ence in poly­genic genome scores:

mean(replicate(10000, sum(genomeOld() - genomeNew())))
# [1] 29.75354558

For each draw from the old and new pop­u­la­tions, we can take the first 4 alle­les, which were the ones assigned the largest effects, and build a weak poly­genic score and com­pare means. For exam­ple:

polyNew <- replicate(1000, sum(genomeNew()[1:4]))
polyOld <- replicate(1000, sum(genomeOld()[1:4]))
t.test(polyOld, polyNew, alternative="greater")
#   Welch Two Sample t-test
# data:  polyOld and polyNew
# t = 0.12808985, df = 1995.8371, p-value = 0.8980908
# alternative hypothesis: true difference in means is not equal to 0
# 95 percent confidence interval:
#  -0.7044731204  0.8029267301
# sample estimates:
#   mean of x   mean of y
# 17.72741040 17.67818359

Or to mimic 86 hits:

t.test(replicate(1000, sum(genomeOld()[1:86])), replicate(1000, sum(genomeNew()[1:86])))
#     Welch Two Sample t-test
# t = 1.2268929, df = 1997.6307, p-value = 0.2200074
# alternative hypothesis: true difference in means is not equal to 0
# 95% confidence interval:
#  -0.8642674547  3.7525210076
# sample estimates:
#   mean of x   mean of y
# 244.5471658 243.1030390

Using the expo­nen­tial sim­u­la­tion, we can do a par­al­lelized power analy­sis: sim­u­late draws (i = 300) & tests for a vari­ety of sam­ple sizes to get an idea of what sam­ple size we need to get decent power with 86 hits.

library(parallel) # warning, Windows users

genomeOld <- function(efft) { ifelse(sample(c(FALSE,TRUE), prob=c(0.5, 0.5), length(efft), replace = TRUE), 0, efft) }
genomeNew <- function(efft) { ifelse(sample(c(FALSE,TRUE), prob=c(0.497, 1-0.497), length(efft), replace = TRUE), 0, efft) }

simulateStudy <- function(n, hits) {
        effects <- sort(rexp(10000)/1, decreasing=TRUE)[1:hits]
        polyOld <- replicate(n, sum(genomeOld(effects)))
        polyNew <- replicate(n, sum(genomeNew(effects)))
        t <- t.test(polyOld, polyNew, alternative="greater")
        return(data.frame(N=n, P=t$p.value, PO.mean=mean(polyOld),, PN.mean=mean(polyNew), }

hits <- 86
parallelStudies <- function(n, itr) { ldply(mclapply(1:itr, function(x) { simulateStudy(n, hits); })); }

sampleSizes <- seq(500, 5000, by=100)
iters <- 300
powerExponential <- ldply(lapply(sampleSizes, function(n) { parallelStudies(n, iters) })); summary(powerExponential)
#        N              P                  PO.mean                PN.mean
#  Min.   : 500   Min.   :0.000000000   Min.   :222.5525   Min.   :23.84966   Min.   :221.2894
#  1st Qu.:1600   1st Qu.:0.002991554   1st Qu.:242.8170   1st Qu.:26.46606   1st Qu.:241.3242
#  Median :2750   Median :0.023639517   Median :247.2059   Median :27.04467   Median :245.7044
#  Mean   :2750   Mean   :0.093184735   Mean   :247.3352   Mean   :27.06300   Mean   :245.8298
#  3rd Qu.:3900   3rd Qu.:0.107997575   3rd Qu.:251.7787   3rd Qu.:27.64103   3rd Qu.:250.2157
#  Max.   :5000   Max.   :0.997322161   Max.   :276.2614   Max.   :30.67000   Max.   :275.7741
#  Min.   :23.04527
#  1st Qu.:26.45508
#  Median :27.04299
#  Mean   :27.05750
#  3rd Qu.:27.63241
#  Max.   :30.85065
powerExponential$Power <- powerExponential$P<0.05
powers <- aggregate(Power ~ N, mean, data=powerExponential); powers
# 1   500 0.2133333333
# 2   600 0.2833333333
# 3   700 0.2833333333
# 4   800 0.3133333333
# 5   900 0.3033333333
# 6  1000 0.3400000000
# 7  1100 0.4066666667
# 8  1200 0.3833333333
# 9  1300 0.4133333333
# 10 1400 0.4166666667
# 11 1500 0.4700000000
# 12 1600 0.4600000000
# 13 1700 0.4666666667
# 14 1800 0.4733333333
# 15 1900 0.5233333333
# 16 2000 0.5366666667
# 17 2100 0.6000000000
# 18 2200 0.5900000000
# 19 2300 0.5600000000
# 20 2400 0.6066666667
# 21 2500 0.6066666667
# 22 2600 0.6700000000
# 23 2700 0.6566666667
# 24 2800 0.7133333333
# 25 2900 0.7200000000
# 26 3000 0.7300000000
# 27 3100 0.7300000000
# 28 3200 0.7066666667
# 29 3300 0.7433333333
# 30 3400 0.7133333333
# 31 3500 0.7233333333
# 32 3600 0.7200000000
# 33 3700 0.7766666667
# 34 3800 0.7933333333
# 35 3900 0.7700000000
# 36 4000 0.8100000000
# 37 4100 0.7766666667
# 38 4200 0.8000000000
# 39 4300 0.8333333333
# 40 4400 0.8466666667
# 41 4500 0.8700000000
# 42 4600 0.8633333333
# 43 4700 0.8166666667
# 44 4800 0.8366666667
# 45 4900 0.8666666667
# 46 5000 0.8800000000
qplot(N, Power, data=powers)  + stat_smooth()
Power for a two-group com­par­i­son of old and new SNP datasets for test­ing a hypoth­e­sis of dys­gen­ics

So for a well-pow­ered two-group com­par­i­son of 1950 & 2015 SNP datasets using 86 SNPs, we would want ~4000 in each group for a total n = 8000; we do have non­triv­ial power even at a total n = 1000 (500 in each group means 21% pow­er) but a non-s­ta­tis­ti­cal­ly-sig­nifi­cant result will be diffi­cult to inter­pret and if one wanted to do that, report­ing a Bayes fac­tor from a Bayesian hypoth­e­sis test would make much more sense to express clearly whether the (non-de­fin­i­tive) data is evi­dence for or against dys­gen­ics.

This is still too opti­mistic since we assumed the opti­mal sce­nario of only very old and very new genomes, while avail­able genomes are more likely to be dis­trib­uted fairly uni­formly between 1950 and 2015. Per “Opti­mal design in psy­cho­log­i­cal research”, McClel­land 1997, we expect a penalty of ~2x in sam­ple size effi­ciency in going from the opti­mal two-group extreme end­points design to sam­ples being uni­formly dis­trib­uted (due to much of our sam­ple size being wasted on esti­mat­ing small effects) and so we would expect our sam­ple size require­ment to at least dou­ble to around n = 16000, but we can do a power sim­u­la­tion here as well. To get the effect size for each year, we sim­ply split the fre­quency decrease over each year and gen­er­ate hypo­thet­i­cal genomes with less of a fre­quency decrease uni­formly dis­trib­uted 1950-2015, and do a lin­ear regres­sion to get a p-value for the year pre­dic­tor:

hits <- 86
sampleSizes <- seq(8000, 30000, by=1000)
iters <- 100
genome <- function(effects) {
  t <- sample(c(1:(2015-1950)), 1)
  decreasedFrequency <- 0.5 - (((0.5-0.497)/(2015-1950)) * t)
  geneFlips <- sample(c(FALSE,TRUE), prob=c(decreasedFrequency, 1-decreasedFrequency), replace = TRUE, length(effects))
  geneValues <- ifelse(geneFlips, effects, 0)
simulateStudy <- function(n, hits) {
        effects <- sort(rexp(10000)/1, decreasing=TRUE)[1:hits]
        d <- ldply(replicate(n, genome(effects), simplify=FALSE))
        l <- lm(PolygenicScore ~ Year, data=d)
        p <- anova(l)$`Pr(>F)`[1]
        return(data.frame(N=n, P=p, PO.mean=predict(l, newdata=data.frame(Year=1950)),
                                           PN.mean=predict(l, newdata=data.frame(Year=2015)))) }
parallelStudies <- function(n, itr) { ldply(mclapply(1:itr, function(x) { simulateStudy(n, hits); })); }
powerExponentialDistributed <- ldply(lapply(sampleSizes, function(n) { parallelStudies(n, iters) })); summary(powerExponential)
powerExponentialDistributed$Power <- powerExponentialDistributed$P<0.05
powers <- aggregate(Power ~ N, mean, data=powerExponentialDistributed); powers
#        N Power
# 1   8000  0.27
# 2   9000  0.32
# 3  10000  0.35
# 4  11000  0.33
# 5  12000  0.41
# 6  13000  0.34
# 7  14000  0.41
# 8  15000  0.48
# 9  16000  0.55
# 10 17000  0.62
# 11 18000  0.55
# 12 19000  0.60
# 13 20000  0.69
# 14 21000  0.61
# 15 22000  0.65
# 16 23000  0.63
# 17 24000  0.71
# 18 25000  0.67
# 19 26000  0.71
# 20 27000  0.74
# 21 28000  0.70
# 22 29000  0.79
# 23 30000  0.83
qplot(N, Power, data=powers)  + stat_smooth()
Power to detect dys­gen­ics effect with SNP sam­ples spread over time

In this case, the power sim­u­la­tion sug­ges­tions the need for triple rather than dou­ble the data, and so a total of n = 30,000 to be well-pow­ered.

Mutation load

The pater­nal muta­tion load should show up as a increase (70 new muta­tions per gen­er­a­tion, 35 years per gen­er­a­tion, so ~2 per year on aver­age) over the past cen­tu­ry, while the genetic selec­tion will oper­ate by reduc­ing the fre­quency of vari­ants which increase intel­li­gence. If there are ~70 new muta­tions per gen­er­a­tion and 2 harm­ful, and there is no longer any puri­fy­ing selec­tion so that all 70 will tend to remain pre­sent, how much does that com­pare to exist­ing muta­tion load aver­ages and, more impor­tant­ly, stan­dard devi­a­tions?

A muta­tion load review leads me to some hard fig­ures from Simons et al 2014 (sup­ple­ment) using data from Fu et al 2012; par­tic­u­larly rel­e­vant is fig­ure 3, the num­ber of sin­gle-nu­cleotide vari­ants per per­son over the Euro­pean-Amer­i­can sam­ple, split by esti­mates of harm from least to most like­ly: 21345 + 15231 + 5338 + 1682 + 1969 = 45565. The sup­ple­men­tary tables gives a count of all observed SNVs by cat­e­go­ry, which sum to 300209 + 8355 + 220391 + 7001 + 351265 + 10293 = 897514, so the aver­age fre­quency must be 45565/897514=0.05, and then the bino­mial SD will be sqrt(897514*0.05*(1-0.05))=206.47. Con­sid­er­ing the two-sam­ple case of 1950 vs 2015, that’s an increase of 130 total SNVs (65*2), which is 0.63S­Ds, hence:

power.t.test(d=(130/206), power=0.8)
#      Two-sample t test power calculation
#               n = 40.40035398
# ...

A total of n = 80.

This par­tic­u­lar set up for the two-sam­ple test can be seen as a lin­ear model with the opti­mum design of allo­cat­ing half the sam­ple to each extreme (see again McClel­land 1997); but more real­is­ti­cal­ly, there is an even dis­tri­b­u­tion across years, in which case the penalty is 2x and n = 160.


There are some poten­tial prob­lems:

  1. Range restric­tion: in many IQ-re­lated stud­ies, fail­ure to account for selec­tion effects yield­ing a lim­ited range of IQs may ; this is true in gen­eral but par­tic­u­larly com­mon in IQ stud­ies because selec­tion on IQ (eg sam­ples of con­ve­nience using only col­lege stu­dents) is so uni­ver­sal in human soci­ety

    This may not be such a large issue when deal­ing with poly­genic scores; even severe IQ selec­tion effects will increase poly­genic scores only some­what because the poly­genic scores explain so lit­tle of IQ vari­ance in the first place.

  2. Self­-s­e­lec­tion by age: if peo­ple pro­vid­ing genetic data are not ran­dom sam­ples, then there may be pseudo-trends which can mask a real dys­genic trend or cre­ate a pseudo-dys­genic trend where there is none. For exam­ple, if young peo­ple buy­ing genome or SNP data tend to be above-av­er­age in intel­li­gence and sci­en­tific inter­est (which anec­do­tally they cer­tainly do seem to be), while old peo­ple tend to get genomes or SNP data due to health prob­lems (and oth­er­wise have aver­age lev­els of intel­li­gence and thus poly­genic score), then in com­par­ing young vs old, one might find not a dys­genic but a pseudo-eu­genic trend instead! Con­verse­ly, it could be the other way around, if much fewer elderly get genetic data and younger peo­ple are more con­cerned about future health or are going along with a fad, pro­duc­ing a pseudo-dys­genic effect instead (eg in the PGP genome data, there seem to be dis­pro­por­tion­ately more PhDs who are quite elder­ly, while younger par­tic­i­pants are a more scat­ter­shot sam­ple from the gen­eral pop­u­la­tion; prob­a­bly relat­ing to the cir­cum­stances of PGP’s found­ing & Har­vard home).

    This is prob­a­bly an issue with data­bases that rely on vol­un­tary indi­vid­ual con­tri­bu­tions, such as PGP, where selec­tion effects have free play. It would be much less of an issue with lon­gi­tu­di­nal stud­ies where moti­va­tions and par­tic­i­pa­tion rates will not differ much by age. Since most dys­genic the­o­ries accept that recorded IQ scores have remained sta­ble over the 20th cen­tury and the decreases in genetic poten­tial either have not man­i­fested yet or have been masked by the Flynn effect & greater famil­iar­ity with tests & loss of some g-load­ing, one might rea­son that prox­ies like edu­ca­tional achieve­ment should be increas­ing through­out one’s sam­ple (since they are known to have increased), and a lack of such a trend indi­cates selec­tion bias.

Genetic data availability


The known pro­pri­etary data­bases have long been large enough to carry out either analy­sis, as well as count­less other analy­ses (but have failed to and rep­re­sent a ):

  1. The muta­tion load analy­sis requires a whole-genome sam­ple size small enough to have been car­ried out by innu­mer­able groups post-2009.

  2. For SNPs, an incom­plete list of exam­ples of pub­li­ca­tions based on large sam­ples:

The exist­ing pri­vate groups do not seem to have any inter­est in test­ing dys­gen­ics, with the pos­si­ble excep­tion of future GWAS stud­ies exam­in­ing fer­til­i­ty, one of which is men­tioned by :

At the time of writ­ing this review, Mills and her research team at the Uni­ver­sity of Oxford are cur­rently lead­ing a large con­sor­tium to engage in the first ever genome-wide asso­ci­a­tion search (GWAS) and meta-analy­sis of repro­duc­tive choice (age at first birth; num­ber of chil­dren), con­ducted in both men and women in over 50 data sets, with the results repli­cated in addi­tional datasets in a large sam­ple.

The hits in such a GWAS might over­lap with intel­li­gence hits, and if the mul­ti­ple hits increase intel­li­gence but decrease fer­til­ity or vice versa (as com­pared to decreas­ing or increas­ing both), that would be evi­dence for dys­gen­ics. Or, assum­ing the betas are report­ed, poly­genic scores for fer­til­ity and intel­li­gence could be esti­mated in inde­pen­dent sam­ples and checked for an inverse cor­re­la­tion.


There are a few sources of data, pri­mar­ily SNP data, which are freely avail­able to all users:

  1. : unus­able due to a delib­er­ate pol­icy deci­sion by 1000 Genomes to delete all phe­no­type data, includ­ing age; sim­i­lar is 69 Genomes. Both likely would be unus­able due to the diver­sity of the global sam­ple (there is no rea­son to think that dys­gen­ics pres­sures are oper­at­ing in every pop­u­la­tion at the same strength)

  2. OpenSNP: host­ing for user-pro­vided SNP & phe­no­type data with dumps avail­able; hosts ~2k SNP datasets, but only 270 users have birth-years

  3. SNPedia like­wise hosts SNP data (over­lap­ping with OpenSNP) and genome data, but a very small num­ber

  4. Genomes unzipped pro­vides a small amount of data

  5. DNA.LAND: claims n = 8k based on pub­lic par­tic­i­pa­tion & input (n = 43k as of ), but seems to then restrict access to a small set of researchers

  6. Exome Aggre­ga­tion Con­sor­tium: n = 61,486 exomes; phe­no­type data is unavail­able

  7. (PGP): prob­a­bly the sin­gle largest source of open SNP & genome data. ~1252 par­tic­i­pants have reg­is­tered birth­dates accord­ing to demographics.tsv, and their sta­tis­tics page’s graphs indi­cates <300 whole genomes and <1k SNPs. Phe­no­type data has been recently released as a SQLite data­base, mak­ing it eas­ier to work with.

    • Genomes: brows­ing the user lists for ‘Whole genome datasets’, I esti­mate a total of ~222; look­ing at the first and last 22 entries, 34 had ages/birth-years, so ~75% of the whole genomes come with the nec­es­sary birth-year data, indi­cat­ing ~166 usable genomes for the pur­pose of test­ing dys­gen­ics. With the most recent one uploaded on 2015-10-12, and the ear­li­est recorded being 2011-09-16, that sug­gests the avail­able genome num­ber increases by ~0.25/day. 166 is uncom­fort­ably close to the require­ment for a well-pow­ered test, and there may not be enough data to account for glitches in the data or allow for more com­pli­cated sta­tis­ti­cal test­ing, but if we wanted to dou­ble the avail­able data, we’d only need to wait around 885 days or 2.5 years (or less, depend­ing on whether the col­lapse in genome sequenc­ing prices con­tinue and prices drop below even the cur­rent $1k genomes).
    • SNPs: PGP has ~656 23andMe SNP datasets (the num­ber of SNP datasets sourced from other providers is quite small so I did­n’t include them), dated 2015-10-21–2011-01-06, so assum­ing same birth-date per­cent­age, 0.37 per day. Unfor­tu­nate­ly, to get 30k SNP datasets through PGP, we would have to wait (lin­early extrap­o­lat­ing) 291 years. (Mak­ing mat­ters worse, in Octo­ber 2015, 23andMe dou­bled its price and reduced the qual­ity of SNP cov­er­age, which will dis­cour­age many users and push other users to pur­chase whole-genome sequenc­ing instead.)

Power analysis for racial admixture studies of continuous variables

I con­sider power analy­sis of a genomic racial admix­ture study for detect­ing genetic group differ­ences affect­ing a con­tin­u­ous trait such as IQ in US African-Amer­i­cans, where ances­try is directly mea­sured by genome sequenc­ing and the com­par­isons are all with­in-fam­ily to elim­i­nate con­found­ing by pop­u­la­tion struc­ture or racism/colorism/discrimination. The nec­es­sary sam­ple size for well-pow­ered stud­ies is closely related to the aver­age size of differ­ences in ances­try per­cent­age between sib­lings, as the upper bound on IQ effect per per­cent­age is small, requir­ing large differ­ences in ances­try to detect eas­i­ly. A with­in-fam­ily com­par­i­son of sib­lings, due to the rel­a­tively small differ­ences in ances­try between sib­lings esti­mated from IBD mea­sure­ments of sib­lings, might require n > 50,000 pairs of sib­lings to detect pos­si­ble effects on IQ, an infea­si­ble sam­ple size. An alter­na­tive design focuses on increas­ing the avail­able ances­try differ­ences within a fam­ily unit by com­par­ing adoptees with sib­lings; the larger with­in-pop­u­la­tion stan­dard devi­a­tion of ances­try cre­ates larger & more eas­i­ly-de­tected IQ differ­ences. A ran­dom-effects meta-analy­sis of past admix­ture & ances­try stud­ies sug­gests the SD in het­ero­ge­neous sam­ples may range from 2% to 20% with a mean of 11% (95% pre­dic­tive inter­val), yield­ing sam­ple sizes of n > 20,000, n = 1100, and n = 500. Hence, an adop­tion study is prob­a­bly in the fea­si­ble range, with required sam­ple sizes com­pa­ra­ble to annual adop­tion rates among US African-Amer­i­cans.

exam­ine racial phe­no­typic differ­ences in traits such as blood pres­sure by com­par­ing peo­ple with ances­try from mul­ti­ple groups, and cor­re­lat­ing differ­ences in ances­try per­cent­age with differ­ences in the phe­no­type. So, for exam­ple, African-Amer­i­cans have higher blood­-pres­sure than white Amer­i­cans, and most African-Amer­i­cans have an aver­age white ances­try of some­thing like 20-25% (see lat­er); if hav­ing 26% white ances­try pre­dicts slightly lower blood pres­sure while 24% pre­dicts high­er, that sug­gests the differ­ence is (as is cur­rently believed) genet­ic; and this logic can be used to nar­row down to spe­cific chro­mo­some regions, and has con­tributed to study of .

One appli­ca­tion would be to thorny ques­tions like poten­tial group differ­ences in non-med­ical traits like intel­li­gence. The stan­dard admix­ture design, requir­ing a few thou­sand sub­jects span­ning the full range, might not nec­es­sar­ily work here here because of the claimed envi­ron­men­tal effects. A pro­posed res­o­lu­tion to the ques­tion is to do an admix­ture study com­par­ing African-Amer­i­can sib­lings. Sib­lings are highly genet­i­cally related on aver­age (50%) but in a ran­dom­ized fash­ion due to recom­bi­na­tion; so two sib­lings, includ­ing fra­ter­nal twins, born to the same par­ents in the same fam­ily in the same neigh­bor­hood going to the same schools, will nev­er­the­less have many differ­ent vari­ants, and will differ in how related they are—the aver­age is 50% but it could be as low as 45% or high as 55%. So given two sib­lings, they will differ slightly in their white ances­try, and if indeed white ances­try brings with it more intel­li­gence vari­ants, then the sib­ling with a higher whiter per­cent­age ought to be slightly more intel­li­gent on aver­age, and this effect will have to be causal, as the inher­i­tance is ran­dom­ized and all other fac­tors are equal by design. (A result using ances­try per­cent­ages mea­sured in the gen­eral pop­u­la­tion, out­side fam­i­lies, would be able to make far more pow­er­ful com­par­isons by com­par­ing peo­ple with ~0% white ances­try to those with any­where up to 100%, and require small sam­ple sizes, and such analy­ses have been done with the expected result, but are ambigu­ous & totally uncon­vinc­ing, as the cor­re­la­tion of greater white­ness with intel­li­gence could eas­ily be due to greater SES or greater black­ness could be a marker for recent immi­gra­tion or any of a num­ber of con­founds that exist.) This has his­tor­i­cally been diffi­cult or impos­si­ble since how does one mea­sure the actual ances­try in sib­lings? But with the rise of cheap geno­typ­ing, pre­cise mea­sure of actual (rather than aver­age) ances­try can be done for <$100, so that is no longer an obsta­cle.

Sibling power analysis

How many sib­ling pairs would this require?

  • you are try­ing to regress IQ_difference ~ Ancestry_difference
  • the SD of the IQ differ­ence of sib­lings is known—it’s ~13 IQ points (non­shared envi­ron­ment + differ­ences in genet­ics)
  • of this, a small frac­tion will be explained by the small differ­ence in ances­try per­cent­age
  • the power will be deter­mined by the ratio of the sib­ling SD to the IQ-d­iffer­ence-due-to-ances­try-d­iffer­ence SD, giv­ing an effect size, which com­bined with the usual alpha=0.05 and beta=0.80, uniquely deter­mines the sam­ple size
  • IQ-d­iffer­ence-due-to-ances­try-d­iffer­ence SD will be the advan­tage of bet­ter ances­try times how much ances­try differs
  • if you knew the num­ber of rel­e­vant alle­les, you could cal­cu­late through the bino­mial the expected SD of sib­ling ances­tor differ­ences. As there are so many alle­les, it will be almost exactly nor­mal. So it’s not sur­pris­ing that sib­lings over­all, for all vari­ants, are 50% IBD with a SD of 4%.

If we treated it as sim­ply as pos­si­ble, for an anal­o­gous height analy­sis says they mea­sured 588 mark­ers. So a bino­mial with 588 draws and p = 0.5 implies that 147 mark­ers are expected to be the same:

588 * 0.5*(1-0.5)
# [1] 147

and the dis­tri­b­u­tion around 147 is 12, which is ~8%:

sqrt((588 * 0.5*(1-0.5)))
# [1] 12.12435565
# [1] 0.08163265306

Viss­cher does a more com­pli­cated analy­sis tak­ing into account close­ness of the mark­ers and gets a SD of 3.9%: equa­tion 7; vari­ance = 1/(16*L) - (1/3*L^2), where L = 35, so

L=35; sqrt(1/(16*L) - (1/(3*L^2)))
# [1] 0.03890508247

And the­o­ret­i­cal mod­el­ing gives an expected sib­ling SD of SD of 3.92%/3.84% (Table 2), which are nearly iden­ti­cal. So what­ever the mean admix­ture is, I sup­pose it’ll have a sim­i­lar SD of 4-8% of itself.

IIRC, African-Amer­i­cans are ~25% admixed, so with a mean admix­ture of 25%, we would expect sib­lings differ­ences to be or 1% differ­ence.

If that 75% miss­ing white ances­try accounts for 9 IQ points or 0.6S­Ds, then each per­cent­age of white ances­try would be 0.6/75 =0.008 SDs.

So that SD of 1% more white ances­try yields an SD of 0.008 IQ, which is super­im­posed on the full sib­ling differ­ence of 0.866, giv­ing a stan­dard­ized effect size/d of 0.008 / 0.866 = 0.0092

Let me try a power sim­u­la­tion:

n <- 10000
siblings <- data.frame(
sibling1AncestryPercentage = rnorm(n, mean=25, sd=1),
sibling1NonancestryIQ = rnorm(n, mean=0, sd=12),

sibling2AncestryPercentage = rnorm(n, mean=25, sd=1),
sibling2NonancestryIQ = rnorm(n, mean=0, sd=12))

siblings$sibling1TotalIQ <- with(siblings, sibling1NonancestryIQ + sibling1AncestryPercentage*(0.008*15))
siblings$sibling2TotalIQ <- with(siblings, sibling2NonancestryIQ + sibling2AncestryPercentage*(0.008*15))
siblings$siblingAncestryDifference <- with(siblings, sibling1AncestryPercentage - sibling2AncestryPercentage)
siblings$siblingIQDifference <- with(siblings, sibling1TotalIQ - sibling2TotalIQ )

# ...
# siblingAncestryDifference siblingIQDifference
# Min.   :-5.370128122      Min.   :-68.2971343
# 1st Qu.:-0.932086950      1st Qu.:-11.7903864
# Median : 0.002384529      Median : -0.2501536
# Mean   : 0.007831583      Mean   : -0.4166863
# 3rd Qu.: 0.938513265      3rd Qu.: 11.0720667
# Max.   : 5.271052675      Max.   : 67.5569825
summary(lm(siblingIQDifference ~ siblingAncestryDifference, data=siblings))
# ...Coefficients:
#                             Estimate Std. Error  t value  Pr(>|t|)
# (Intercept)               -0.4192761  0.1705125 -2.45892 0.0139525
# siblingAncestryDifference  0.3306871  0.1220813  2.70874 0.0067653
# Residual standard error: 17.05098 on 9998 degrees of freedom
# Multiple R-squared:  0.000733338,    Adjusted R-squared:  0.0006333913
# F-statistic: 7.337294 on 1 and 9998 DF,  p-value: 0.006765343
confint(lm(siblingIQDifference ~ siblingAncestryDifference, data=siblings))
#                                    2.5 %         97.5 %
# (Intercept)               -0.75351500523 -0.08503724643
# siblingAncestryDifference  0.09138308561  0.56999105507

admixtureTest <- function(n, alpha=0.05, ancestryEffect=0.008) {
 siblings <- data.frame(
     sibling1AncestryPercentage =pmax(0, rnorm(n, mean=25, sd=1)),
     sibling1NonancestryIQ = rnorm(n, mean=0, sd=12),

     sibling2AncestryPercentage = pmax(0,rnorm(n, mean=25, sd=1)),
     sibling2NonancestryIQ = rnorm(n, mean=0, sd=12))

 siblings$sibling1TotalIQ <- with(siblings, sibling1NonancestryIQ + sibling1AncestryPercentage*(ancestryEffect*15))
 siblings$sibling2TotalIQ <- with(siblings, sibling2NonancestryIQ + sibling2AncestryPercentage*(ancestryEffect*15))
 siblings$siblingAncestryDifference <- with(siblings, sibling1AncestryPercentage - sibling2AncestryPercentage)
 siblings$siblingIQDifference <- with(siblings, sibling1TotalIQ - sibling2TotalIQ )

 p <- summary(lm(siblingIQDifference ~ siblingAncestryDifference, data=siblings))$coefficients[8]

power <- function(n, iters=10000, n.parallel=8) {
    mean(unlist(mclapply(1:n.parallel, function(i) {
        replicate(iters/n.parallel, admixtureTest(n)) }))) }

# powers <- sapply(seq(100, 10000, by=10), power)
# [1] 0.0502
# [1] 0.0535
# [1] 0.0642
# [1] 0.2251
# [1] 0.6077

So the esti­mated sam­ple size is extremely large, well into the scores of thou­sands. This is large enough that it will be some time before biobanks or pop­u­la­tion sam­ples are well-pow­ered: par­tic­i­pants may not have sib­lings, those sib­lings may be includ­ed, only ~15% of the Amer­i­can pop­u­la­tion is AA, all par­tic­i­pants must be sequenced, imper­fect reli­a­bil­ity of mea­sure­ments can greatly increase the nec­es­sary sam­ple size, and so on. If it requires n = 70,000, half of par­tic­i­pants have a match­ing sib­ling, and it’s drawn pro­por­tion­ally from the gen­eral pop­u­la­tion, that would imply that a total sam­ple size of almost 1m. (For com­par­ison, that’s about twice the size of the , and the US cohort pro­gram aims for 1m total par­tic­i­pants by 2020.)

Sim­pli­fi­ca­tions aside, it is diffi­cult to see any way to bring this method down into the low thou­sands range, as that would require sib­lings to vary tremen­dously more in ances­try, have much more homo­ge­neous IQs than they do, or ances­try to be vastly more potent than it could pos­si­bly be.

Adoption power analysis

One pos­si­bil­ity would be to exam­ine a differ­ent pop­u­la­tion, per­haps one with more African ances­try and thus larger between-si­b­ling ances­try differ­ences and effects, such as ; but that would raise ques­tions about rel­e­vance to the USA. So another pos­si­bil­ity is to drop the idea of using only bio­log­i­cal sib­lings. Is there any way to have ances­try differ­ences as large as in the gen­eral pop­u­la­tion, but within a fam­i­ly? Half-si­b­lings come to mind but those more typ­i­cally tend to join the house­hold as older kids or teenagers, so aren’t so good. One pos­si­bil­ity is adoptees: there are a sub­stan­tial num­ber of African-Amer­i­can chil­dren adopted into other African-Amer­i­can house­holds (white par­ents adopt­ing black chil­dren is , described as “decreas­ing dra­mat­i­cally”, but still sub­stan­tial in total num­bers, at least 20,000), and even a notice­able num­ber of African chil­dren adopted abroad (14,800 from Ethiopia just 1999-2014, with more adop­tion from Nige­ria & the Con­go). The same logic of the with­in-fam­ily study should apply but to unre­lated sib­lings who will have far greater differ­ences in ances­try now (pos­si­bly any­where up to 50% if an African child is adopted into an African-Amer­i­can fam­ily with con­sid­er­able white ances­try & some luck). This would increase power dra­mat­i­cal­ly, per­haps enough to bring the study within the realm of near-fu­ture fea­si­bil­i­ty.

Exam­in­ing adop­tions of African chil­dren would not be a con­vinc­ing way of estab­lish­ing group differ­ences, par­tic­u­larly for IQ, as there are many known envi­ron­men­tal harms (eg pre-na­tal lack of is known to cause large impair­ments in cog­ni­tive which can­not be repaired later in life, and poor iodiza­tion is fre­quent in Africa), so while exam­in­ing African adoptees would doubt­less require a very small sam­ple size, the results would be unin­ter­pretable. So the more inter­est­ing case is instead exam­in­ing AA adoptees/siblings, all of whom are raised in a wealthy (and iodine-suffi­cient) indus­tri­al­ized coun­try.

In this case, we’re con­sid­er­ing a pair of an AA sib­ling with the same IQ & ances­try dis­tri­b­u­tions, as com­pared with adoptees who are either African (100% African ances­try) or like­wise have the same IQ/ancestry dis­tri­b­u­tions. Since the sib­ling & adoptee are unre­lat­ed, they effec­tively vary & differ as much as two ran­dom peo­ple from the gen­eral pop­u­la­tion would in IQ & African ances­try, except for shared-en­vi­ron­ment effects on IQ; shared-en­vi­ron­ment for adult IQ is rel­a­tively low, maybe 10% of vari­ance. So instead of an SD of 15, they would vary mod­er­ately less, like 14.23 points (sqrt(15^2 * 0.9)).

One assump­tion here is a shared mean: one would actu­ally expect, given the asso­ci­a­tion of lighter skin with higher wealth/SES and darker with lower wealth/SES, that the adopt­ing par­ents (and hence their bio­log­i­cal chil­dren) would be rel­a­tively high on Euro­pean ances­try, and con­verse­ly, the moth­ers giv­ing up chil­dren for adop­tion would be rel­a­tively low, so the expected differ­ence in ances­try is higher than sim­u­lat­ed. Assum­ing equal means, how­ev­er, is a con­ser­v­a­tive assump­tion since if such a cor­re­la­tion holds, the differ­ences will be larg­er, hence the ances­try effect sizes larg­er, hence smaller sam­ple sizes required. In the extreme ver­sion of this, the adop­tive fam­ily is white and so the ances­try differ­ence is max­i­mal (~99% vs ~20%), requir­ing even smaller sam­ple sizes, but at the cost of intro­duc­ing com­pli­ca­tions like whether there are inter­ac­tions with the white adop­tive fam­ily not present in an AA adop­tive fam­i­ly; in any case, such tran­s-ra­cial adop­tion is appar­ently unpop­u­lar now, so it may not come up much.

Mean population European ancestry & population standard deviation

Ances­try effects remain as before; the mean ances­try is not too impor­tant as long as it’s not near 0, but since adoptees are drawn from the gen­eral pop­u­la­tion, the ances­try SD must be adjusted but it’s unclear what the right SD here is—cited stud­ies range from 4% up to 11%, and this is a key para­me­ter for power (with 4%, then sib­ling and adoptee will tend to be quite sim­i­lar on ances­try per­cent­age & much more data will be required, but with 11% they will differ a good deal and make results stronger).

Reported fig­ures from the genet­ics lit­er­a­ture for Euro­pean ances­try in US African-Amer­i­can range from 14% to 24%, reflect­ing both sam­pling error and var­i­ous biases & self­-s­e­lec­tion & geographic/regional effects in the datasets:

Con­sid­er­ing just stud­ies with usable ances­try per­cent­ages, pop­u­la­tion SD, and n, and using inferred SDs from Sig­norel­lo:

admixture <- read.csv(stdin(), header=TRUE, colClasses=c("factor", "numeric", "numeric", "integer"))
"Halder et al 2008",0.143,0.133,136
"Ducci et al 2009",0.07,0.09,864
"Signorello et al 2010",0.071,0.08,379
"Bryc et al 2010",0.185,0.0465,365
"Nassir et al 2012",0.225,0.147,11712
"Bryc et al 2014",0.24,0.17,5269
"Kirkegaard et al 2016",0.17,0.11,140

# what is the standard error/precision of a population SD?
admixture$SD.SE <- (0.71*admixture$SD) / sqrt(admixture$N)
#                    Study        Mean                 SD                  N               SD.SE
#  Bryc et al 2010      :1   Min.   :0.0700000   Min.   :0.0465000   Min.   :  136.0   Min.   :0.0009644066
#  Bryc et al 2014      :1   1st Qu.:0.1070000   1st Qu.:0.0850000   1st Qu.:  252.5   1st Qu.:0.0016954481
#  Ducci et al 2009     :1   Median :0.1700000   Median :0.1100000   Median :  379.0   Median :0.0021739221
#  Halder et al 2008    :1   Mean   :0.1577143   Mean   :0.1109286   Mean   : 2695.0   Mean   :0.0034492579
#  Kirkegaard et al 2016:1   3rd Qu.:0.2050000   3rd Qu.:0.1400000   3rd Qu.: 3066.5   3rd Qu.:0.0047591374
#  Nassir et al 2012    :1   Max.   :0.2400000   Max.   :0.1700000   Max.   :11712.0   Max.   :0.0080973057
#  Signorello et al 2010:1

r.mean <- rma(yi=Mean, sei=SD/sqrt(N), measure="SMD", ni=N, data=admixture); r.mean
# Random-Effects Model (k = 7; tau^2 estimator: REML)
# tau^2 (estimated amount of total heterogeneity): 0.0046 (SE = 0.0027)
# tau (square root of estimated tau^2 value):      0.0680
# I^2 (total heterogeneity / total variability):   99.82%
# H^2 (total variability / sampling variability):  566.51
# Test for Heterogeneity:
# Q(df = 6) = 3477.2614, p-val < .0001
# Model Results:
# estimate       se     zval     pval    ci.ub
#   0.1578   0.0258   6.1187   <.0001   0.1072   0.2083
#   pred     se  ci.ub  cr.ub
# 0.1578 0.0258 0.1072 0.2083 0.0153 0.3003 <- rma(yi=SD, sei=SD.SE, measure="SMD", ni=N, data=admixture);
# Random-Effects Model (k = 7; tau^2 estimator: REML)
# tau^2 (estimated amount of total heterogeneity): 0.0018 (SE = 0.0011)
# tau (square root of estimated tau^2 value):      0.0425
# I^2 (total heterogeneity / total variability):   99.77%
# H^2 (total variability / sampling variability):  440.67
# Test for Heterogeneity:
# Q(df = 6) = 3819.2793, p-val < .0001
# Model Results:
# estimate       se     zval     pval    ci.ub
#   0.1108   0.0162   6.8587   <.0001   0.0792   0.1425
#    pred     se  ci.ub  cr.ub
#  0.1108 0.0162 0.0792 0.1425 0.0216 0.2001

forest(r.mean, slab=admixture$Study)
forest(, slab=admixture$Study)
Meta-an­a­lytic sum­mary of US African-Amer­i­can’s mean Euro­pean ances­try per­cent­age & pop­u­la­tion SD of that per­cent­age

There is high het­ero­gene­ity, as expect­ed, and the meta-an­a­lytic sum­mary are con­sis­tent with sim­ply tak­ing the mean, so meta-analy­sis was not really nec­es­sary.

The issue of het­ero­gene­ity depends on how one wants to inter­pret these num­bers: as the true latent African-Amer­i­can pop­u­la­tion mean/SD of Euro­pean ances­try, or as a way to esti­mate the pos­si­ble spread of sam­pling? In the for­mer, the het­ero­gene­ity is a seri­ous issue because it sug­gests the esti­mate may be badly biased or at least is highly impre­cise; in the lat­ter, it is both a curse and a ben­e­fit, since it implies that it is pos­si­ble to recruit for genet­ics stud­ies sam­ples with a wide range of ances­try (thereby greatly increas­ing sta­tis­ti­cal pow­er) but also that one might get unlucky & wind up with a very ances­try-ho­mo­ge­neous sam­ple (if the sam­ple turns out to have an SD as high as 20%, excel­lent; if it’s as low as 7.9%, one is in trou­ble).

So for power analy­sis one might check the meta-an­a­lytic mean case, as well as the (a 95% CI around the SD/mean does not mean that 95% of the true effects, includ­ing the inher­ent het­ero­gene­ity, will fall in that inter­val): SDs of 2%, 11%, and 20%. (For any cost-ben­e­fit analy­sis or try­ing to opti­mize expen­di­tures, one would want to work with the pos­te­rior dis­tri­b­u­tions to aver­age over every­thing, but for just gen­eral infor­ma­tive pur­pos­es, those 3 are good para­me­ters to check.)

Power simulation


adopteeTest <- function(n, alpha=0.05, ancestryEffect=0.008, populationAncestryMean=0.1440, populationAncestrySD=0.1008, c=0.1) {
 unrelatedSiblingSD <- sqrt(15^2 * (1-c)) # subtract 10% for same shared-environment
 siblings <- data.frame(
     sibling1AncestryPercentage = pmax(0, rnorm(n, mean=populationAncestryMean*100, sd=populationAncestrySD*100)),
     sibling1NonancestryIQ = rnorm(n, mean=0, sd=unrelatedSiblingSD),

     adopteeAncestryPercentage = pmax(0, rnorm(n, mean=populationAncestryMean*100, sd=populationAncestrySD*100)),
     adopteeNonancestryIQ = rnorm(n, mean=0, sd=unrelatedSiblingSD))

 siblings$sibling1TotalIQ <- with(siblings, sibling1NonancestryIQ + sibling1AncestryPercentage*(ancestryEffect*15))
 siblings$adopteeTotalIQ  <- with(siblings, adopteeNonancestryIQ + adopteeAncestryPercentage*(ancestryEffect*15))
 siblings$siblingAncestryDifference <- with(siblings, sibling1AncestryPercentage - adopteeAncestryPercentage)
 siblings$siblingIQDifference       <- with(siblings, sibling1TotalIQ - adopteeTotalIQ )

 p <- summary(lm(siblingIQDifference ~ siblingAncestryDifference, data=siblings))$coefficients[8]

power <- function(n, sd, iters=10000, n.parallel=8) {
    mean(unlist(mclapply(1:n.parallel, function(i) {
    replicate(iters/n.parallel, adopteeTest(n, populationAncestrySD=sd)) }))) }

ns <- seq(100, 10000, by=100)
powerLow  <- sapply(ns, function(n) { power(n, sd=0.0216)})
powerMean <- sapply(ns, function(n) { power(n, sd=0.1108)})
powerHigh <- sapply(ns, function(n) { power(n, sd=0.2001)})

library(ggplot2); library(gridExtra)
pl <- qplot(ns, powerLow)  + coord_cartesian(ylim = c(0,1))
pm <- qplot(ns, powerMean) + coord_cartesian(ylim = c(0,1))
ph <- qplot(ns, powerHigh) + coord_cartesian(ylim = c(0,1))
grid.arrange(pl, pm, ph, ncol=1)
Power analy­sis for detect­ing Euro­pean ances­try on IQ in an adoptee fam­ily study with pre­dicted low/mean/high pop­u­la­tion vari­ance in ances­try per­cent­age (higher vari­ance=larger sta­tis­ti­cal pow­er=fewer sam­ples required)

So for the worst-case SD, sam­ple size is unclear but n > 20,000 pairs; mean SD, n = 1100 pairs; high SD, n = 500 pairs. The lat­ter two are fea­si­ble amounts for pop­u­la­tion reg­istries or adop­tion-fo­cused cohort stud­ies. Thus genome adop­tion stud­ies, com­bined with the much less pow­er­ful but more com­mon with­in-si­b­ling com­par­isons, are capa­ble of deliv­er­ing pre­cise answers to long-s­tand­ing ques­tions about the ori­gins of group differ­ences with mod­er­ate sam­ple sizes.

Operating on an aneurysm

In the excel­lent neu­ro­surgery mem­oir Do No Harm: Sto­ries of Life, Death, and Brain Surgery ( 2014), chap­ter 2 “”, there is a pas­sage on weigh­ing the costs of action and inac­tion:

“A thir­ty-t­wo-year-old wom­an,” he said terse­ly. “For surgery today. Had some headaches and had a brain scan.” As he talked a brain scan flashed up on the wall.

…“It’s an unrup­tured aneurysm, seven mil­lime­tres in size,” Fion­a—the most expe­ri­enced of the reg­is­trars—­said. “So there’s a point zero five per cent risk of rup­ture per year accord­ing to the inter­na­tional study pub­lished in 1998.” “And if it rup­tures?” “Fifteen per cent of peo­ple die imme­di­ately and another thirty per cent die within the next few weeks, usu­ally from a fur­ther bleed and then there’s a com­pound inter­est rate of four per cent per year.”

…If we did noth­ing the patient might even­tu­ally suffer a haem­or­rhage which would prob­a­bly cause a cat­a­strophic stroke or kill her. But then she might die years away from some­thing else with­out the aneurysm ever hav­ing burst. She was per­fectly well at the moment, the headaches for which she had had the scan were irrel­e­vant and had got bet­ter. The aneurysm had been dis­cov­ered by chance. If I oper­ated I could cause a stroke and wreck her—the risk of that would prob­a­bly be about four or five per cent. So the acute risk of oper­at­ing was roughly sim­i­lar to the life-time risk of doing noth­ing. Yet if we did noth­ing she would have to live with the knowl­edge that the aneurysm was sit­ting there in her brain and might kill her any moment.

Read­ing this, I was a lit­tle sur­prised by Marsh’s eval­u­a­tion given those spe­cific num­bers. Intu­itive­ly, it did not seem to me that a sin­gle risk of 5% was any­where near as bad as a life­long risk of 0.5%, for a 32 year old woman who would prob­a­bly live another 50 years—the one num­ber is 10x big­ger than the oth­er, but the other num­ber is 50x big­ger, and a quick heuris­tic for the total prob­a­bil­ity of many inde­pen­dent small prob­a­bil­i­ties is to just sum them up, sug­gest­ing that the risk of the untreated aneurysm was much worse (50*0.005=0.25, and 0.25>0.05). So I thought after I fin­ished read­ing the book, I would work it out a lit­tle more accu­rate­ly.


Specifi­cal­ly, this is a 32yo woman and the UK female life expectancy is ~80yo in 2015, so she had ~48 years left. The con­se­quences of the aneurysm burst­ing is a large chance of instant death or else severe dis­abil­ity with death to soon fol­low; the con­se­quence of surgery going wrong is also instant death or severe dis­abil­i­ty, pre­sum­ably with a high chance of death soon fol­low­ing, so it looks like we can assume that the bad out­come in either case is the same. what is the prob­a­bil­ity of the aneurysm never burst­ing in all 48 years? (1-0.005)^48 = 0.786, or a prob­a­bil­ity of burst­ing of 21%. 21% is 4x larger than 5%. Since 21% is 4x larger and the con­se­quences are sim­i­lar, this would sug­gest that the risks are not “roughly sim­i­lar” and it looks much worse to not oper­ate.

Expected loss

But that’s just the risk of an event, not the expected loss:

  1. In the case of doing surgery imme­di­ate­ly, the expected loss, with years treated equally and a 5% instant risk from oper­a­tion, is sim­ply 48 * 0.005 = 0.24 years of life; all 48 years are risked on a sin­gle throw of the sur­gi­cal dice, but after that she is safe.

  2. In the case of doing noth­ing and let­ting the aneurysm stay with a 0.5% annual risk from non-op­er­a­tion, it’s not as sim­ple as 48 * 0.21 = 10.1 years, because you can­not die of an aneurysm if you died in a pre­vi­ous year. The risk will instead fol­low a (num­ber of years until 1 fail­ure), and then the loss is the 48 years minus how­ever many she actu­ally got. That’s not the same as the expec­ta­tion of the neg­a­tive bino­mi­al, which in this case is 200 years (the expec­ta­tion of a neg­a­tive bino­mial with 1 fail­ure and a suc­cess rate of 1-0.005 is 1/(1-(1-0.005))=200) and she will die of other causes before then, in which case the aneurysm turned out to be harm­less.

    We can sim­u­late many draws from the neg­a­tive bino­mi­al, ignore as 0 any time where the aneurysm struck after her life expectancy of 48 more years is past, hold onto the loss­es, and cal­cu­late the mean loss: mean(sapply(rnbinom(10e4, 1, 0.005), function(years) { if(years>48) { 0; } else { 48-years; }})) → 5.43.

So the expected loss from surgery looks even bet­ter than the risk did, as it is 22.6x small­er.

QALY/DALY adjustment

What about adjust­ing for older years being less valu­able? We might say that the surgery look unfairly good because we are ignor­ing how its losses are fron­t-loaded in the 30s, some of the best years of one’s life, and treat­ing a loss of her 33rd year as being as bad as a loss of her 48th year. In terms of age weight­ing, DALYs usu­ally use a 3% annual dis­count­ing; and differ in some ways but for this analy­sis I think we can treat them as equiv­a­lent and use the DALY age-dis­count­ing to cal­cu­late our QALYs. So we can redo the two expected losses includ­ing the dis­count­ing to get:

  1. Surgery: 0.05 * sum((1-0.03)^(0:48)) → 1.291
  2. No surgery: mean(unlist(sapply(sapply(rnbinom(10e4, 1, 0.005), function(years) { if(years>48) { 0; } else { 48-years; }}), function(yr) { sum((1-0.03)^(0:yr)); }))) → 4.415

By appro­pri­ately penal­iz­ing the surgery’s loss of high­-qual­ity early years as com­pared to the aneurys­m’s loss of just some elderly years, the surgery’s supe­ri­or­ity falls to 3.4x, and the gain is 3.124. (And if we include the men­tal well­be­ing of the woman as a final touch, the surgery looks even bet­ter.)

How sen­si­tive is the sur­gi­cal supe­ri­or­ity to the para­me­ters?

  • Sur­gi­cal risk: a 4x increase in risk to 20% would cre­ate par­ity
  • Aneurysm risk: if the annual risk of aneurysm were as low as 0.04% rather than 0.5%, then there would be par­ity
  • Life expectancy & dis­count rate: no change will reverse the order­ing

It seems extremely unlikely that Marsh could be as wrong about the sur­gi­cal risk as to mis­take 5% for 20%, espe­cially for an oper­a­tion he says he used to do rou­tine­ly, and it also seems unlikely that the study on the annual risk of an aneurysm burst­ing could be as far off as 10x, so the differ­ence is sol­id.


Final­ly, hav­ing a surgery is much more expen­sive than not hav­ing it. Surgery is always expen­sive, and neu­ro­surgery undoubt­edly so—else­where in the book, Marsh quotes an Amer­i­can neu­ro­sur­geon’s esti­mate of $100,000 for a par­tic­u­larly com­plex case. Clip­ping an aneurysm surely can­not cost that much (be­ing both much sim­pler and also being done in a more effi­cient health­care sys­tem), but it’s still not going to be triv­ial. Does the cost of aneurysm surgery out­weigh the ben­e­fit?

To con­vert the DALY loss to a dol­lar loss, we could note that UK PPP per capita is ~$38,160 (2013) so the gain from surgery would be (4.415 - 1.291) * 38169=$119k, well above the $100k worst-case. Or more direct­ly, the UK NHS prefers to pay <£20,000 per QALY and will gen­er­ally reject treat­ments which cost >£30,000 per QALY as of 20073 (im­ply­ing QALYs are worth some­what less than £30,000); the median US 2008 hos­pi­tal cost for clip­ping an aneurysm is $36,188 or ~£23,500; and the gain is 3.124 QALYs for ~£7500/QALY—so clip­ping the aneurysm in this case defi­nitely clears the cost-ben­e­fit thresh­old (as we could have guessed from the fact that in the anec­dote, the NHS allows her to have the surgery).

After cal­cu­lat­ing the loss of years, differ­ing val­ues of years, and cost of surgery, the surgery still comes out as sub­stan­tially bet­ter than not oper­at­ing.

The Power of Twins: Revisiting Student’s Scottish Milk Experiment Example

Ran­dom­ized exper­i­ments require more sub­jects the more vari­able each dat­a­point is to over­come the noise which obscures any effects of the inter­ven­tion. Reduc­ing noise enables bet­ter infer­ences with the same data, or less data to be col­lect­ed, which can be done by bal­anc­ing observed char­ac­ter­is­tics between con­trol and exper­i­men­tal dat­a­points. A par­tic­u­larly dra­matic exam­ple of this approach is run­ning exper­i­ments on iden­ti­cal twins rather than reg­u­lar peo­ple, because twins vary far less from each other than ran­dom peo­ple do. In 1931, the great sta­tis­ti­cian Stu­dent noted prob­lems with an extremely large (n = 20,000) Scot­tish exper­i­ment in feed­ing chil­dren milk (to see if they grew more in height or weight), and claimed that the exper­i­ment could have been done far more cost-effec­tively with an extra­or­di­nary reduc­tion of >95% fewer chil­dren if it had been con­ducted using twins. He, how­ev­er, did not pro­vide any cal­cu­la­tions or data demon­strat­ing this. I revisit the issue and run a power cal­cu­la­tion on height indi­cat­ing that Stu­den­t’s claims were cor­rect and that the exper­i­ment would have required ~97% fewer chil­dren if run with twins. This reduc­tion is not unique to the Scot­tish exper­i­ment and in gen­er­al, one can expect a reduc­tion of 89% using twins rather than reg­u­lar peo­ple.

Due to length, this has been .

RNN metadata for mimicking individual author style

Char-RNNs are unsu­per­vised gen­er­a­tive mod­els which learn to mimic text sequences. I sug­gest extend­ing char-RNNs with inline meta­data such as genre or author pre­fixed to each line of input, allow­ing for bet­ter & more effi­cient meta­data, and more con­trol­lable sam­pling of gen­er­ated out­put by feed­ing in desired meta­da­ta. An exper­i­ment using torch-rnn on a set of ~30 Project Guten­berg e-books (1 per author) to train a large char-RNN shows that a char-RNN can learn to remem­ber meta­data such as authors, learn asso­ci­ated prose styles, and often gen­er­ate text vis­i­bly sim­i­lar to that of a spec­i­fied author.

Due to length, this has been split out to .


An imple­men­ta­tion in R of a sim­ple algo­rithm (us­ing rather than a UCT) imple­mented with data.tree. This MCTS assumes binary win/loss (1/0) ter­mi­nal rewards with no inter­me­di­ate rewards/costs so it can­not be used to solve gen­eral , and does not expand leaf nodes in the move tree passed to it. (I also sus­pect parts of it are imple­mented wrong though it reaches the right answer in a sim­ple Block­world prob­lem and seems OK in a Tic-Tac-Toe prob­lem. I have since under­stood and would prob­a­bly prob­a­bly drop the painful explicit tree manip­u­la­tion in favor of the indi­rect recur­sive aproach.)

## MCTS helper functions:
playOutMoves <- function(move, state, actions) {
  for (i in 1:length(actions)) {
     state <- move(state, actions[i])$State
playOutRandom <- function(move, state, actions, timeout=1000, verbose=FALSE) {
 action <- sample(actions, 1)
 turn <- move(state, action)
 if(verbose) { print(turn); };
 if (turn$End || timeout==0) { return(turn$Reward) } else {
                               playOutRandom(move, turn$State, actions, timeout=timeout-1, verbose) }

createTree <- function(plys, move, moves, initialState, tree=NULL) {
 if (is.null(tree)) { tree <- Node$new("MCTS", win=0, loss=0) }
 if (plys != 0) {
  for(i in 1:length(moves)) {
    x <- tree$AddChild(moves[i], win=0, loss=0)
    createTree(plys-1, move, moves, initialState, tree=x)
 # cache the state at each leaf node so we don't have to recompute each move as we later walk the tree to do a rollout
 tree$Do(function(node) { p <- node$path; node$state <- playOutMoves(move, initialState, p[2:length(p)]); }, filterFun = isLeaf)

mcts <- function (tree, randomSimulation, rollouts=1000) {
 replicate(rollouts, {
   # Update posterior sample for each node based on current statistics and use Thompson sampling.
   # With a beta uniform prior (Beta(1,1)), update on binomial (win/loss) is conjugate with simple closed form posterior: Beta(1+win, 1+n-win).
   # So we sample directly from that posterior distribution for Thompson sampling
   tree$Do(function(node) { node$Thompson <- rbeta(1, 1+node$win, 1+(node$win+node$loss)-node$win) })
   # find & run 1 sample:
   node <- treeWalk(tree)
   rollout <- randomSimulation(node$state)
   if(rollout==1) { node$win <- node$win+1; } else { node$loss <- node$loss+1; }

   # propagate the new leaf results back up tree towards root:
   tree$Do(function(x) { x$win  <- Aggregate(x, "win",  sum); x$loss <- Aggregate(x, "loss", sum) }, traversal = "post-order")

## walk the game tree by picking the branch with highest Thompson sample down to the leaves
## and return the leaf for a rollout
treeWalk <- function(node) {
    if(length(node$children)==0) { return(node); } else {
        children <- node$children
         best <- which.max(sapply(children, function(n) { n$Thompson; } ))
        treeWalk(children[[best]]) } }

mctsDisplayTree <- function(tree) {
    tree$Do(function(node) { node$P <- node$win / (node$win + node$loss) } )
    tree$Sort("P", decreasing=TRUE)
    print(tree, "win", "loss", "P", "Thompson")

## Blockworld simulation
## 0=empty space, 1=agent, 2=block, 3=goal point
blockActions <- c("up", "down", "left", "right")
blockInitialState <- matrix(ncol=5, nrow=5, byrow=TRUE,
blockMove <- function(state, direction) {
   if(state[5,5] == 2) { return(list(State=state, Reward=1, End=TRUE)) }
   position <- which(state == 1, arr.ind=TRUE)
   row <- position[1]; col <- position[2]
   rowNew <- 0; colNew <- 0
     # if we are at an edge, no change
     up    = if(row == 1) { rowNew<-row; colNew<-col; } else { rowNew <- row-1; colNew <- col; },
     down  = if(row == 5) { rowNew<-row; colNew<-col; } else { rowNew <- row+1; colNew <- col; },
     left  = if(col == 1) { rowNew<-row; colNew<-col; } else { rowNew <- row;   colNew <- col-1; },
     right = if(col == 5) { rowNew<-row; colNew<-col; } else { rowNew <- row;   colNew <- col+1; }
   # if there is not a block at the new position, make the move
   if (state[rowNew,colNew] != 2) {
      state[row,col] <- 0
      state[rowNew,colNew] <- 1
      return(list(State=state, Reward=0, End=FALSE))
       } else {
               state[rowNew,colNew] <- 1
               state[row,col] <- 0
                # if the block is at the edge it can't move
                up    = if(rowNew == 1) { } else { state[rowNew-1,colNew] <- 2 },
                down  = if(rowNew == 5) { } else { state[rowNew+1,colNew] <- 2 },
                left  = if(colNew == 1) { } else { state[rowNew,colNew-1] <- 2 },
                right = if(colNew == 5) { } else { state[rowNew,colNew+1] <- 2 } )
                # a block on the magic 5,5 point means a reward and reset of the playing field
                if(state[5,5] == 2) { return(list(State=state, Reward=1, End=TRUE)) } else { return(list(State=state, Reward=0, End=FALSE)) }

## Blockworld examples:
# blockMove(blockInitialState, "left")
# blockMove(blockInitialState, "down")
# blockMove(blockInitialState, "right")$State
# blockMove(blockMove(blockInitialState, "right")$State, "down")
# blockMove(blockMove(blockMove(blockInitialState, "down")$State, "down")$State, "down")
# playOutMoves(blockMove, blockInitialState, c("down", "down", "down"))
# playOutRandom(blockMove, blockInitialState, blockActions)

tree <- createTree(2, blockMove, blockActions, blockInitialState)
mcts(tree, function(state) { playOutRandom(blockMove, state, blockActions) })

tree2 <- createTree(3, blockMove, blockActions, blockInitialState)
mcts(tree2, function(state) { playOutRandom(blockMove, state, blockActions) })

## Tic-Tac-Toe
tttActions <- 1:9
tttInitialState <- matrix(ncol=3, nrow=3, byrow=TRUE, data=0)
tttMove <- function(state, move) {
   move <- as.integer(move)
   # whose move is this? Player 1 moves first, so if the number of pieces are equal, it must be 1's turn:
   player <- 0;  if(sum(state == 1) == sum(state == 2)) { player <- 1 } else { player <- 2}

   # check move is valid:
   if(state[move] == 0) { state[move] <- player }

   ## enumerate all possible end-states (rows, columns, diagonals): victory, or the board is full and it's a tie
   victory <- any(c(
       all(state[,1] == player),
       all(state[1,] == player),
       all(state[,2] == player),
       all(state[2,] == player),
       all(state[,3] == player),
       all(state[3,] == player),
       all(as.logical(c(state[1,1], state[2,2], state[3,3]) == player)),
       all(as.logical(c(state[1,3], state[2,3], state[3,1]) == player))
   tie <- all(state != 0)

   # if someone has won and the winner is player 1, then a reward of 1
   if(victory) { return(list(State=state, Reward=as.integer(player==1), End=TRUE)) } else {
    if(tie) { return(list(State=state, Reward=0, End=TRUE)) } else {
      return(list(State=state, Reward=0, End=FALSE)) }

## Tic-Tac-Toe examples:
# tttMove(tttMove(tttMove(tttInitialState, 5)$State, 9)$State, 2)
# playOutMoves(tttMove, tttInitialState, c(5, 9, 2))
# playOutRandom(tttMove, tttInitialState, tttActions, verbose=TRUE)

treeTTT <- createTree(2, tttMove, tttActions, tttInitialState)
mcts(treeTTT, function(state) { playOutRandom(tttMove, state, tttActions) })
## hypothetical: if opponent plays center (5), what should be the reply?
treeTTT2 <- createTree(2, tttMove, tttActions, tttMove(tttInitialState, 5)$State)
mcts(treeTTT2, function(state) { playOutRandom(tttMove, state, tttActions) })

Candy Japan A/B test

Due to length, has been split out to .

DeFries-Fulker power analysis

DeFries-Fulker (DF) extremes analy­sis

generateSiblingPair <- function(ID=TRUE) {
   ## Population mean 100, SD 15; let's make family means distributed normally too;
   ## heritability 0.8, shared environment 0.1, siblings share half of genes on average + shared environment
   ## so a pair of siblings has 1 - (0.8*0.5+0.1) = 0.5 of the variance of the general population.
   parental <- mean(rnorm(1,mean=100,sd=15*0.8), rnorm(1,mean=100,sd=15*0.8))
   siblings <- rnorm(2, mean=parental, sd=15*(1 - (0.8*0.5+0.1)))
   ## Siblings will tend to vary this much, unless they are, lamentably, one of the, say,
   ## 5% struck by mutational lightning and reduced to an IQ of, let's say, 80
   if(ID) { siblings <- ifelse(rbinom(2,1,prob=0.05), siblings,rnorm(2, mean=80, sd=15)) }
   return(c(max(siblings), min(siblings)))
generateSiblingPairs <- function(n,ID=TRUE) {, generateSiblingPair(ID=ID)))) }
## dataset with lightning:
df <- round(rescale(generateSiblingPairs(1000000, ID=TRUE), mean=5, sd=2))
## floor/ceiling at 0/9 for everyone:
df[df$V1>9,]$V1 <- 9
df[df$V1<1,]$V1 <- 1
df[df$V2>9,]$V2 <- 9
df[df$V2<1,]$V2 <- 1

## dataset without:
df2 <- round(rescale(generateSiblingPairs(1000000, ID=FALSE), mean=5, sd=2))
df2[df2$V1>9,]$V1 <- 9
df2[df2$V1<1,]$V1 <- 1
df2[df2$V2>9,]$V2 <- 9
df2[df2$V2<1,]$V2 <- 1

hist(df$V1 - df$V2)
hist(df2$V1 - df2$V2)

## mixture modeling:
## check k=1 vs k=2 on df1, where k=2 is ground truth:
g1.1 <- flexmix(I(V1-V2) ~ 1, k=1, data=df)
g1.2 <- flexmix(I(V1-V2) ~ 1, k=2, data=df)
summary(g1.1); summary(g1.2)

## check k=1 vs k=2 on df2, where k=1 is ground truth:
g2.1 <- flexmix(I(V1-V2) ~ 1, k=1, data=df2)
g2.2 <- flexmix(I(V1-V2) ~ 1, k=2, data=df2)
summary(g2.1); summary(g2.2)

Inferring mean IQs from SMPY/TIP elite samples

Sam­ples taken from the extremes of mix­tures of dis­tri­b­u­tions can have very differ­ent prop­er­ties than ran­dom sam­ples, such as the tail effect of wildly dis­pro­por­tion­ate rep­re­sen­ta­tion of one dis­tri­b­u­tion due to order statistics/threshold selec­tion. This can be used to infer differ­ing means. I demon­strate work­ing back­wards from the racial com­po­si­tion of TIP/SMPY sam­ples of extremely (1-in-10,000) gifted youth to esti­mate the over­all racial means, which is con­sis­tent with the known racial means and hence an unbi­ased selec­tion process, using ABC to infer Bayesian cred­i­ble inter­vals on the esti­mated means.

The prop­er­ties of sta­tis­ti­cal dis­tri­b­u­tions can be very differ­ent from the prop­er­ties of spe­cific sub­sets of those dis­tri­b­u­tions in coun­ter­in­tu­itive ways. A point drawn from an extreme will exhibit “regres­sion to the mean”, a phe­nom­e­non which rou­tinely trips peo­ple up. Another com­mon exam­ple is that a small differ­ence in means for many dis­tri­b­u­tions can lead to large differ­ences in sub­sets.

For exam­ple, male and female aver­age heights differ by a rel­a­tively small amount, inches at most. So in a ran­dom sam­ple, plenty of women will be taller than men, and vice ver­sa. How­ev­er, if instead ask the sex of the tallest per­son in the sam­ple, it will often be male, and the larger the sam­ple, the more cer­tain we can be that it will be male, and that the top X% by height will be male. Like­wise, if we wanted to start a bas­ket­ball league and recruited the tallest 100 peo­ple in the coun­try, this small mean differ­ence will show up as our entire bas­ket­ball league turn­ing out to be male. (And since height is highly her­i­ta­ble, we may find out that many of them are related!) What seemed like a small differ­ence become a large one; we could have worked it out in advance if we had thought about it.

Rea­son­ing from the gen­eral to the par­tic­u­lar turned out to be tricky in this case because we were deal­ing with extreme val­ues rather than ran­dom sam­ples—1 bas­ket­ball player cho­sen by height from thou­sands of peo­ple. Many things of great inter­est turn out to be like that: we are inter­ested in the extremes much more than the expec­ta­tion. Run­ning a 2-hour marathon is an extreme on ath­leti­cism; win­ning the Nobel is an extreme on sci­en­tific accom­plish­ment; being enlisted in the NBA is an extreme on height; being admit­ted to MIT/Stanford/Harvard is an extreme on intel­li­gence; mur­der­ing some­one is an extreme on vio­lence; win­ning an Acad­emy Award is an extreme on act­ing suc­cess. When we ask ques­tions like, “why does the world record in this sport keep being shat­tered” or “why are so many NBA play­ers related” or “how good can we expect the best chess player to be in 10 years” or “does this racial com­po­si­tion prove bias” or “how much more impor­tant are the best authors in lit­er­a­ture than obscurer fig­ures” or “why do so few women win the Field Medal”, we’re ask­ing extreme value ques­tions whose answers may be coun­ter­in­tu­itive—and the answer may be as sim­ple as the shape of dis­tri­b­u­tions, and a slightly lower mean here or a slightly higher stan­dard devi­a­tion there. (Work­ing back­wards from a sam­ple selected for pass­ing a thresh­old to a mean can be called “the method of lim­its” or “the method of thresh­olds”.)

The study describes the accom­plish­ments of the sam­ple, 259 chil­dren selected for their intel­li­gence by tak­ing the high­est-s­cor­ers out of 425,000 ado­les­cents tak­ing the SAT (usu­ally <13yo) start­ing in 1981, rep­re­sent­ing the top 0.01% of the test-tak­ers. The TIP sam­ple par­al­lels the bet­ter-known SMPY sam­ple, which also selected extremely intel­li­gent ado­les­cents, who were included in a lon­gi­tu­di­nal sam­ple. It’s fre­quently sug­gest­ed, based on anec­do­tal evi­dence or some biased con­ve­nience sam­ples, that more intel­li­gence may not be bet­ter; extremely intel­li­gent peo­ple may be unhealthy, neu­rotic, insane, iso­lat­ed, lone­ly, dis­crim­i­nated against by soci­ety and their peers, and doomed to fail­ure; or if things are not quite that dire, as all stud­ies show things improv­ing up to 130, then at around that point greater intel­li­gence may stop mak­ing any differ­ence, and there be lit­tle differ­ence between some­one with an IQ of 130 and 160. This is diffi­cult to study cross-sec­tion­al­ly, because once you start talk­ing about as extreme as 0.01%, it is diffi­cult to recruit any sub­jects at all, and your sam­ple will be biased in unknown ways; if you only look at suc­cess­ful peo­ple, you are miss­ing the hypo­thet­i­cal home­less bum liv­ing out of a trash can who is a trou­bled and mis­un­der­stood genius. To solve these prob­lems, you want to fil­ter through hun­dreds of thou­sands of peo­ple so you can select the very bright­est pos­si­ble, and you want to find them as early as pos­si­ble in life, before they have had any chance to fail or suc­ceed, and track them lon­gi­tu­di­nally as they grow up. This is what the SMPY & TIP stud­ies do, and the results are that the sub­jects are spec­tac­u­larly suc­cess­ful in life; great intel­li­gence is not harm­ful and the returns to greater intel­li­gence are not zero even as high as 1 in 10,000.

Makel et al 2016 also reports the eth­nic break­down of the TIP and SMPY sam­ples: 72% white, 22% Asian, 6% not reported or oth­er. This dis­tri­b­u­tion might seem remark­able given that sub­jects tak­ing the SAT in 1981 were born ~1970, when the USA was ~77% white, ~11% black, and ~0.7% Asian, so white are slightly under­-rep­re­sent­ed, blacks are very under­-rep­re­sented (even if we assume all 6% are black, then that’s still half), and Asians are 31x (!) over­rep­re­sent­ed.

## TIP/SMPY sample size & ethnic percentages:
tip <- 259; smpy <- 320 ## total: 579
white <- ((0.65*tip) + (0.78*smpy)) / (tip+smpy)
asian <- ((0.24*tip) + (0.20*smpy)) / (tip+smpy)
white; asian
# [1] 0.7218480138
# [1] 0.2178929188

# > The data on ethnicity are rather sparse, but we can do a little bit of work on them by looking at US Census
# > figures for the 1970s when most of these children were born: White 178,119,221...Asia 1,526,401...So, in the
# > absence of more detailed particulars about the Other category, Asians win the race by a country mile. If we
# > simplify things by considering only Whites, Blacks and Asians the US in 1970 then the country at that time was
# > 88% White, 11% Black, and less than 1% Asian. The actual results of eminent students are 77% White, 0% Black,
# > 22% Asian. No need for a Chi square.
# Asian is 0.7%: 1526401  / (178119221 / 0.80)
whiteRR <- white / 0.77; asianRR <- asian / 0.007
whiteRR; asianRR
# [1] 0.937464953
# [1] 31.12755983

Of course, races in the USA have long differed by mean intel­li­gence, with the rule of thumb being Asians ~105 IQ, whites ~100, and blacks ~90. So the order is expect­ed—but still, 31x! Are the results being dri­ven by some sort of pro-Asian bias or oth­er­wise bizarre?

But this is an extreme sam­ple. 1-in-10,000 is far out on the tails: 3.71S­Ds.

# [1] 3.719016485

Maybe this is nor­mal. Can we work back­wards from the over­rep­re­sen­ta­tions to what differ­ences would have gen­er­ated them?

Yes, we can, even with this small sam­ple which is so extreme and unrep­re­sen­ta­tive of the gen­eral pop­u­la­tion. This is because it is an prob­lem: we know the order rep­re­sented by the sam­ple and so can work back to para­me­ters of the dis­tri­b­u­tion the order sta­tis­tics are being gen­er­ated by. Since IQ is a nor­mal dis­tri­b­u­tion, we know the over­rep­re­sen­ta­tion RR, and the exact cutoff/limit used in the sam­ple, we can con­vert the limit to a stan­dard devi­a­tions, and then find the nor­mal dis­tri­b­u­tion which is RR (31) times the nor­mal dis­tri­b­u­tion at a stan­dard devi­a­tions.

We can com­pare using two pnorms and shift­ing the sec­ond by a SDs. So for exam­ple, shift­ing by 15 IQ points or 1 SD would lead to 84x over­rep­re­sen­ta­tion

pnorm(qnorm(1/10000)) / pnorm(qnorm(1/10000) - (15/15))
# [1] 84.39259519

We would like to solve for the shift which leads to an exact over­rep­re­sen­ta­tion like 31.127; an opti­miza­tion rou­tine like R’s optim func­tion can do that, but it requires an error to min­i­mize, so min­i­miz­ing pnorm()/pnorm(x) does­n’t work since it just leads to neg­a­tive infin­i­ty, nor will RR == pnorm()/pnorm(x) work, because it eval­u­ates to 0 for all val­ues of x except the exact right one . Instead, we min­i­mize the squared error between the ratio pre­dicted by a par­tic­u­lar x and our observed RR. This works:

## An optimization routine which automatically finds for us the IQ increase which most closely matches the RR:
solver <- function(RR, cutoff=10000) {
        function(IQ_gain) { (RR - (pnorm(qnorm(1/cutoff)) / pnorm(qnorm(1/cutoff)-(IQ_gain/15))))^2 },
        )$par }

100 + solver(whiteRR)
# [1] 99.75488281
100 + solver(asianRR)
# [1] 111.8929688

So our inferred white & Asian pop­u­la­tions means are: 99.8 and 111.9. These are rel­a­tively close to the expected val­ues.

This approach can be used to infer other things as well. For exam­ple, the TIP/SMPY papers have not, as far as I’ve seen, men­tioned what frac­tion of the white sub­jects were eth­nic Jew­ish; since they are so over-rep­re­sented in areas like Nobel prizes, we would expect many of the TIP/SMPY white stu­dents to have been Jew­ish. Using an esti­mate of the Jew­ish pop­u­la­tion in 1970 and esti­mates of their mean IQ, we can work for­ward to what frac­tion of TIP/SMPY sub­jects might be Jew­ish. The 1970-1971 National Jew­ish Pop­u­la­tion Study esti­mated “5,800,000 per­sons (of whom 5,370,000 were Jews) liv­ing in Jew­ish house­holds” out of a total US pop­u­la­tion of 205 mil­lion, or 2.8% of the total pop­u­la­tion or ~3.6% of the white pop­u­la­tion. So of the ~418 white sub­jects, ~15 would be expected to be Jew­ish under the null hypoth­e­sis of no differ­ence. The major­ity of Amer­i­can Jews are of Ashke­nazi descent4, for whom intel­li­gence esti­mates are debated but tend to range 105-115 (with occa­sional sam­ples sug­gest­ing even higher val­ues, like Levin­son 1957). In the Barbe 1964 Ohio sam­ple (IQ ~143), 8% were Jew­ish5; in (ra­tio IQ >140) 1920s sam­ple in SF/LA, 10% were Jew­ish; Holling­worth’s 1930s sam­ple (>180) turned up 51⁄55 or 90% Jew­ish6; Byrns 1936’s 1931 Wis­con­sin state sam­ple found 18% of the Jew­ish sam­ple to be in the top decile vs 10% Amer­i­can; in the sam­ple 1948-1960 (>140, mean 157) in New York City, 62% were Jew­ish (Sub­ot­nik et al 1989, Sub­ot­nik et al 19937). Given esti­mates of the Jew­ish pop­u­la­tion of chil­dren in those spe­cific times and places, one could work back­wards to esti­mate a Jew­ish mean.

We can cal­cu­late the frac­tion of the white sam­ple being Jew­ish for each pos­si­ble mean IQ:

proportion <- function (gain, cutoff=10000) {
   (pnorm(qnorm(1/cutoff)) / pnorm(qnorm(1/cutoff)-(gain/15))) }
possibleIQs <- seq(5, 15, by=0.5)
data.frame(Advantage=possibleIQs, Fraction.of.white=(sapply(possibleIQs, proportion) * 15) / 418)
   Advantage Fraction.of.white
1        5.0      0.1415427303
2        5.5      0.1633099334
3        6.0      0.1886246225
4        6.5      0.2180947374
5        7.0      0.2524371552
6        7.5      0.2924980125
7        8.0      0.3392769622
8        8.5      0.3939561508
9        9.0      0.4579348680
10       9.5      0.5328710150
11      10.0      0.6207307813
12      10.5      0.7238482059
13      11.0      0.8449966589
14      11.5      0.9874747049
15      12.0      1.1552093388
16      12.5      1.3528802227
17      13.0      1.5860693342
18      13.5      1.8614413902
19      14.0      2.1869615788
20      14.5      2.5721585555
21      15.0      3.0284424112

Judg­ing from ear­lier sam­ples with very high cut­offs, I’d guess TIP/SMPY has at least a major­ity Jew­ish, giv­ing a mean IQ of ~110; this is pleas­antly sim­i­lar to esti­mates based on reg­u­lar sam­ples & esti­ma­tion. This result is also sim­i­lar to La Griffe du Lion’s 2003 thresh­old analy­sis esti­mat­ing a mean IQ of 112 based on Ashke­nazi over­rep­re­sen­ta­tion among USSR cham­pi­onship chess play­ers, 111 based on West­ern awards, and 110 based on the USA/Canada . But if the mean IQ was as high as 112, then almost every sin­gle white sub­ject would be Jew­ish in every sam­pling, which seems implau­si­ble and like some­thing so strik­ing that any­one writ­ing or involved with TIP/SMPY would have to have men­tioned at some point—right?

For the same rea­son, the orig­i­nal esti­mate of 112 for the Asians strikes me as on the high side. This could be due to prob­lems in the data like under­es­ti­mat­ing the Asian pop­u­la­tion at the time—per­haps the Southeast/Midwest states that TIP sam­ples from were more than 0.7% Asian—or it could be due to sam­pling error (only n = 579, after all).

Work­ing back­wards does­n’t imme­di­ately pro­vide any mea­sure­ment of pre­ci­sion or con­fi­dence inter­vals. Pre­sum­ably some­one has worked out ana­lytic for­mu­las which come with stan­dard errors and con­fi­dence inter­vals, but I don’t know it. Instead, since the selec­tion process which gen­er­ated our data is straight­for­ward (pop­u­la­tion mean -> mil­lions of sam­ples -> take top 1-in-10000s -> cal­cu­late over­rep­re­sen­ta­tion), I can again use (ABC) to turn a sim­u­la­tion of the data gen­er­at­ing process into a method of Bayesian infer­ence on the unknown para­me­ters (pop­u­la­tion means) and get cred­i­ble inter­vals.

What sort of con­fi­dence do we have in these esti­mates given that these RRs are based only on? We can sim­u­late TIP/SMPY-like selec­tion by tak­ing the hypo­thet­i­cal means of the two groups, gen­er­at­ing ~3 mil­lion sim­u­lates (579 * 10000) each, select­ing the top 1⁄10000th8, tak­ing the RRs and then solv­ing for the mean IQ. If we pro­vide a prior on the means and we hold onto only the means which suc­cess­fully gen­er­ate TIP/SMPY-like frac­tions of 72% & 21%, this becomes ABC with the saved means form­ing the pos­te­rior dis­tri­b­u­tion of means. (It would likely be faster to use MCMC like JAGS, but while JAGS pro­vides trun­cated nor­mal dis­tri­b­u­tions which one could sam­ple from quick­ly, and the nec­es­sary pnorm/qnorm func­tions, but it’s not clear to me how one could go about esti­mat­ing the over­per­for­mance ratio and the bino­mi­al.9 There’s likely some way to use more directly than sim­u­lat­ing cut­offs, in which case there is a trans­for­ma­tion to a beta dis­tri­b­u­tion over 0-1, which is a well-sup­ported dis­tri­b­u­tion by MCM soft­ware and might allow exact solu­tion as well.) For my pri­ors, I believe that the rule of thumbs of 100⁄105 are accu­rate and highly unlikely to be more than a few points off, so I use a very weak prior of pop­u­la­tions means being .

In exact ABC, we would keep only data which exactly matched 72%/22%, but that would require reject­ing an extremely large num­ber of sam­ples. Here we’ll loosen it to ±2% tol­er­ance:

simulateTIPSMPY <- function() {
    ## informative priors: IQs are somewhere close to where we would estimate based on other datasets
    whiteMean <- round(rnorm(1, mean=100, sd=5), digits=2)
    asianMean <- round(rnorm(1, mean=105, sd=5), digits=2)

    iqCutoff <- 100 + -qnorm(1/10000) * 15

    whites <- rnorm(0.770 * 579 * 10000, mean=whiteMean, sd=15)
    whiteSample <- max(1, sum(ifelse(whites>iqCutoff, 1, 0)))

    asians <- rnorm(0.007 * 579 * 10000, mean=asianMean, sd=15)
    asianSample <- max(1, sum(ifelse(asians>iqCutoff, 1, 0)))

    ## white+Asian = 92% of original total sample, so inflate by that much to preserve proportions: 1.08
    totalSample <- (whiteSample+asianSample) * (1 + (1-(white+asian)))

    whiteFraction <- round(whiteSample / totalSample, digits=2)
    asianFraction <- round(asianSample / totalSample, digits=2)
    # print(paste("samples: ", c(whiteSample, asianSample), "fractions: ", c(whiteFraction, asianFraction)))

    tolerance <- 0.02
    if ((abs(whiteFraction - 0.7218480138) < tolerance) && (abs(asianFraction - 0.2178929188) < tolerance)) {
      return(data.frame(White=whiteMean, Asian=asianMean))
library(parallel); library(plyr)
simulateSamples <- function(n.sample=10000, iters=getOption("mc.cores")) {
    ## because of rejection sampling, no run is guaranteed to produce a sample so we loop:
    results <- data.frame()
    while (nrow(results) < n.sample) {
        simResults <- ldply(mclapply(1:iters, function(i) { simulateTIPSMPY()  } ))
        results <- rbind(results, simResults)
        # print(paste("Samples so far: ", nrow(results)))
    return(results) }
posteriorSamples <- simulateSamples()

mean(posteriorSamples$White < posteriorSamples$Asian)
# [1] 1
## we have relatively few samples, so get a better posterior estimate by shuffling the posterior samples & comparing many times:
mean(replicate(1000, mean(c(sample(posteriorSamples$White) < sample(posteriorSamples$Asian)))))
# [1] 0.9968822
quantile(probs=c(0.025, 0.975), posteriorSamples$White, na.rm=TRUE)
#     2.5%     97.5%
# 89.49975 101.38050
quantile(probs=c(0.025, 0.975), posteriorSamples$Asian, na.rm=TRUE)
#      2.5%     97.5%
# 101.37000 116.74075
hist(posteriorSamples$White, main="Posterior white mean IQ estimated from TIP/SMPY cutoff & ratio", xlab="IQ")
hist(posteriorSamples$Asian, main="Posterior Asian mean", xlab="IQ")
His­tograms of the pos­te­rior esti­mate of white & Asian mean IQs ~1970 as esti­mated from frac­tion of TIP/SMPY sam­ple using ABC

So sam­pling error does turn out to be sub­stan­tial: our 95% cred­i­ble inter­vals are white 90-101, Asian 101-116. Still, the over­lap is min­i­mal, with P = 99.7% that the Asian mean is higher than the white.

We are able to con­clude that the rank order­ing is highly likely to be cor­rect, and the results are con­sis­tent with the con­ven­tional wis­dom, so there is no prima facie case for bias in the results: the eth­nic com­po­si­tion is in line with what one would cal­cu­late from the design of TIP/SMPY and pop­u­la­tion means.

Genius Revisited: On the Value of High IQ Elementary Schools

Genius Revis­ited doc­u­ments the lon­gi­tu­di­nal results of a high-IQ/gifted-and-talented ele­men­tary school, Hunter Col­lege Ele­men­tary School (HCES); one of the most strik­ing results is the gen­eral high edu­ca­tion & income lev­els, but absence of great accom­plish­ment on a national or global scale (eg a Nobel prize). The authors sug­gest that this may reflect harm­ful edu­ca­tional prac­tices at their ele­men­tary school or the low pre­dic­tive value of IQ.

I sug­gest that there is no puz­zle to this absence nor any­thing for HCES to be blamed for, as the absence is fully explain­able by their mak­ing two sta­tis­ti­cal errors: base-rate neglect, and regres­sion to the mean.

First, their stan­dards fall prey to a base-rate fal­lacy and even extreme pre­dic­tive value of IQ would not pre­dict 1 or more Nobel prizes because Nobel prize odds are mea­sured at 1 in mil­lions, and with a small total sam­ple size of a few hun­dred, it is highly likely that there would sim­ply be no Nobels.

Sec­ond­ly, and more seri­ous­ly, the lack of accom­plish­ment is inher­ent and unavoid­able as it is dri­ven by the caused by the rel­a­tively low cor­re­la­tion of early child­hood with adult IQs—which means their sam­ple is far less elite as adults than they believe. Using early-childhood/adult IQ cor­re­la­tions, regres­sion to the mean implies that HCES stu­dents will fall from a mean of 157 IQ in kinder­garten (when select­ed) to some­where around 133 as adults (and pos­si­bly low­er). Fur­ther demon­strat­ing the role of regres­sion to the mean, in con­trast, HCES’s asso­ci­ated high-IQ/gifted-and-talented high school, Hunter High, which has access to the ado­les­cents’ more pre­dic­tive IQ scores, has much higher achieve­ment in pro­por­tion to its lesser regres­sion to the mean (de­spite dilu­tion by Hunter ele­men­tary stu­dents being grand­fa­thered in).

This unavoid­able sta­tis­ti­cal fact under­mines the main ratio­nale of HCES: extremely high­-IQ adults can­not be very accu­rately selected as kinder­garten­ers on the basis of a sim­ple test. This greater-re­gres­sion prob­lem can be less­ened by the use of addi­tional vari­ables in admis­sions, such as parental IQs or high­-qual­ity genetic poly­genic scores; unfor­tu­nate­ly, these are either polit­i­cally unac­cept­able or depen­dent on future sci­en­tific advances. This sug­gests that such ele­men­tary schools may not be a good use of resources and HCES stu­dents should not be assigned scarce mag­net high school slots.

Split out to .

Great Scott! Personal Name Collisions and the Birthday Paradox

“How large does can a social cir­cle be before first names no longer suffice for iden­ti­fi­ca­tion? Scott, I’m look­ing at you.”

MakerOfDe­ci­sions, 2016-07-29

Scott here refers to any of Scott Alexan­der, , , (and to a much lesser extent, Scott Garrabrant, , and Scott H. Young); a ref­er­ence to a ‘Scott’ on a site like Less Wrong is increas­ingly ambigu­ous.

When a large num­ber of sam­ples draw from a com­mon pool of iden­ti­fiers, col­li­sions are com­mon, lead­ing to the : despite there being 365.25 days in the year, a class­room of just 23 peo­ple (who can cover at most 6% of the days in a year) is ~50% likely to have at least two peo­ple who share the same birth­day and so birth­days cease being unique unam­bigu­ous iden­ti­fiers. (In­tu­itive­ly, you might expect the num­ber to be much larger and closer to 180 than 23.)

We can ver­ify this by sim­u­la­tion:

dupes <- function(a) { length(a) != length(unique(a)) }

identifiers <- function(n, ids, probabilities) { sample(1:ids, n, prob=probabilities, replace=TRUE) }

simulate <- function(n, ids, probabilities=rep(1/ids, ids), iters=10000) {
    sims <- replicate(iters, { id <- identifiers(n, ids, probabilities)
                               return(dupes(id)) })
    return(mean(sims)) }

simulate(23, 365)
# [1] 0.488
sapply(1:50, function(n) { simulate(n, 365) } )
#  [1] 0.0000 0.0029 0.0059 0.0148 0.0253 0.0400 0.0585 0.0753 0.0909 0.1196 0.1431 0.1689 0.1891
#      0.2310 0.2560 0.2779 0.3142 0.3500 0.3787 0.4206 0.4383 0.4681 0.5165 0.5455 0.5722 0.5935
# [27] 0.6227 0.6491 0.6766 0.7107 0.7305 0.7536 0.7818 0.7934 0.8206 0.8302 0.8465 0.8603 0.8746
#      0.8919 0.9040 0.9134 0.9248 0.9356 0.9408 0.9490 0.9535 0.9595 0.9623 0.9732

Sim­i­lar­ly, in a group of peo­ple, it will be com­mon for first names to over­lap. (Over­laps of both first names & sur­names are much more unlike­ly: esti­mate from French & Ohioan data that while almost every­one has a non-u­nique full name, even groups of thou­sands of peo­ple will have only a few dupli­cates.) How com­mon? There are far more than 365.25 first names, espe­cially as some first names are made up by par­ents.

Names have a highly skewed (often said to be a ) dis­tri­b­u­tion: the first few baby names make up an enor­mous frac­tion of all names, hence all the Ethan/Lucas/Mason baby boys in 2016. (One would think that par­ents would go out of their way to avoid too-pop­u­lar names, but appar­ently not.)

Since there are only “10,000 things under heaven”, one might think that the top 10000 per­sonal names would give a good guess. At what n can we expect a col­li­sion?

findN <- function(ids, targetP=0.5, startingN=1, probabilities=rep(1/ids, ids)) {
    n <- startingN
    collisionProbability <- 0

    while (collisionProbability < targetP) {
        collisionProbability <- simulate(n, ids, probabilities)
        n <- n+1
    return(n) }
# [1] 118
simulate(118, 10000)
# [1] 0.5031

We could also use such as the square approx­i­ma­tion: : sqrt(2 * 10000 * 0.5) → 100 Or the sim­i­lar upper bound: ceiling(sqrt(2*10000*log(2))) → 118.

So the col­li­sion point is smaller than .

But all of these are them­selves upper bounds because the case in which birthdays/names are uni­formly dis­trib­uted is the worst case. If there is any differ­ence in the prob­a­bil­i­ties, a col­li­sion will hap­pen much ear­li­er. This makes sense since if 1 birth­day hap­pens with, say, P=0.99, then it’s almost impos­si­ble to go more than 3 or 4 birth­days with­out a col­li­sion. Like­wise, if one birth­day has P=0.50, and so on down to P=$:

sapply(1:23, function(n){ simulate(n, 365, probabilities=c(0.99, rep(0.01/364, 364)))})
# [1] 0.0000 0.9789 0.9995 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000
#     1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000
sapply(1:23, function(n){ simulate(n, 365, probabilities=c(0.5, rep(0.5/364, 364)))})
# [1] 0.0000 0.2531 0.5031 0.6915 0.8182 0.8896 0.9402 0.9666 0.9808 0.9914 0.9951 0.9973 0.9988
#     0.9993 0.9991 0.9999 1.0000 1.0000 0.9999 1.0000 1.0000 1.0000 1.0000

How skewed are real names? Given Names Fre­quency Project pro­vides “Pop­u­lar Given Names US, 1801-1999” (1990-1999, 909288 names) based on Social Secu­rity data. After delet­ing the first 4 lines of s1990m.txt, it can be loaded into R and the frac­tions used as prob­a­bil­i­ties to find the 50% col­li­sion point for US names:

names <- read.csv("s1990m.txt", header=FALSE)
#         V1            V2
#  Aaron   :  1   Min.   :   55.0000
#  Abdiel  :  1   1st Qu.:   86.0000
#  Abdullah:  1   Median :  183.0000
#  Abel    :  1   Mean   :  914.1923
#  Abraham :  1   3rd Qu.:  535.5000
#  Adam    :  1   Max.   :24435.0000
#  (Other) :852
# [1] 784377
## "Scott" as fraction of all names:
2279 / 784377
# [1] 0.0029054906
## presumably male names:
2279 / (784377*0.5)
# [1] 0.005810981199

simulate(118, nrow(names), probabilities=names$V2/sum(names$V2))
# [1] 1
findN(nrow(names), probabilities=names$V2/sum(names$V2))
# [1] 15

So a more real­is­tic analy­sis sug­gests n = 15 is where unique first names will prob­a­bly break down.

This only cov­ers the 853 most com­mon per­sonal names, and the more names, the higher the n has to be to trig­ger a col­li­sion (mak­ing 15 some­thing of a lower upper bound); to esti­mate 10000, we need to fit a dis­tri­b­u­tion to extrap­o­late below that. The fits rea­son­ably well and is easy to work with:

fitdist(names$V2, "lnorm")
# Fitting of the distribution ' lnorm ' by maximum likelihood
# Parameters:
#            estimate    Std. Error
# meanlog 5.550448321 0.04640182299
# sdlog   1.359185357 0.03281096378

simulateLN <- replicate(100, {
    names <- rlnorm(10000, meanlog=5.550448321, sdlog=1.359185357)
    hit <- findN(length(names), startingN=46, probabilities=names/sum(names))
# [1] 51

Since first names will clus­ter by age group, loca­tion, pro­fes­sion, and what­not, arguably even 51 is a bit of an upper bound.

Final­ly, one might ask the prob­a­bil­ity of a group with a great Scott, or to put it another way, the prob­a­bil­ity of it unfor­tu­nately get­ting away scot-free.

This is easy to answer; the prob­a­bil­ity of hav­ing 1 or more Scotts in a group is the prob­a­bil­ity of every­one hav­ing a name other than Scott. We saw that the prob­a­bil­ity of being named Scott was P = 0.0029054906 in the name dataset. So the prob­a­bil­ity of one per­son not being named Scott is . So the prob­a­bil­ity of n peo­ple all being named not-S­cott is 0.997n. The crossover point is ~239.

So an Amer­i­can social group can­not exceed n = 51 before first names begin to break down, and it is all Scot­t’s fault at n = 239.

Detecting fake (human) Markov chain bots

Some pop­u­lar Twit­ter and Tum­blr accounts use trained on a cor­pus of writ­ing such as Markov James Mitchens or two unre­lated cor­puses to cre­ate amus­ing mashups: pro­gram­ming doc­u­men­ta­tion and H.P. Love­craft’s horror/SF fic­tion or the King James Bible or the works of Karl Marx, Kim Kar­dashian and Kierkegaard, or Sil­i­con Val­ley recruit­ing emails and Erowid drug use reports. The humor comes from the fact that the Markov chains have no under­stand­ing and are merely pro­grams pro­duc­ing gib­ber­ish that occa­sion­ally present strik­ing jux­ta­po­si­tions or insights. Much of their appeal derives in large part from the fact that while humans curate them, humans don’t write them. They depend on this authen­tic­ity to be strik­ing.

Of course, there’s always the temp­ta­tion to edit them or write them whole­sale, per­haps because the Markov chains aren’t coop­er­at­ing in pro­duc­ing any com­edy gold to tweet that day, which deceives the read­er. This poses an inverse Tur­ing test: how would you detect a fake Markov chain account, that is, one where a human is pre­tend­ing to be a com­puter and writ­ing some of the text?

Markov chains are trained on a spe­cific cor­pus and are a prob­a­bilis­tic gen­er­a­tive model which encode the prob­a­bil­ity that a word X fol­lows another word Y for all the words in that cor­pus (and sim­i­larly if they are oper­at­ing on let­ters or on ); there is no state or mem­ory or ‘look back’ or abil­ity to model recur­sion. To gen­er­ate text, one sim­ply picks a ran­dom word Y, looks up the prob­a­bil­i­ties of all the words AZ from Y, and picks a word at ran­dom weighted by those prob­a­bil­i­ties; then repeat indefi­nite­ly. Con­verse­ly, one could also use it to cal­cu­late the of a given text by mul­ti­ply­ing the prob­a­bil­ity of each word in the text con­di­tional on the pre­vi­ous one.

One diffi­culty is the poten­tial for dou­ble-use of data: the first pass through a Markov chain account is already apply­ing to the data a highly flex­i­ble Bayesian neural net­work with bil­lions of para­me­ters (one’s brain). If one spots an ‘anom­alous’ dataset and sub­se­quent analy­sis con­firms it, what does this mean? I am reminded of one past inci­dent: some­one had lost a great deal of money on a Bit­coin gam­bling web­site, and sus­pected the site had defrauded him. But he had con­tacted me only because he had had unusual loss­es. What does an analy­sis mean? Imag­ine that the top 1% of losers get angry and start look­ing into whether they were cheat­ed; they go to a sta­tis­ti­cian who duly com­putes that based on the num­ber of games played, there is a p = 0.01 that they would lose as much or more as they did… If one had all the gam­bling records, one could look at the over­all pat­terns and see if there are more losers than there should be given the rules of the game and a sup­pos­edly fair ran­dom num­ber gen­er­a­tor, but what does one do with 1 self­-s­e­lected play­er? The data gen­er­a­tion process is cer­tainly nei­ther ran­dom nor ‘ignor­able’ nor mod­e­lable with­out dubi­ous assump­tions.

A few pos­si­ble attacks come to mind:

  • obser­va­tion of mal­formed syn­tax or lack of long-range depen­den­cies
  • vocab­u­lary or out­put out­side an inde­pen­dently trained Markov chain’s domain
  • unusu­ally low like­li­hood for an inde­pen­dently trained Markov chain to gen­er­ate known sam­ples
  • unusu­ally low like­li­hood for an inde­pen­dently trained Markov chain to gen­er­ate known sam­ples com­pared to newly gen­er­ated sam­ples fil­tered at a 1-in-100s qual­ity level
  • unusu­ally high qual­ity of known sam­ples com­pared to newly gen­er­ated sam­ples from inde­pen­dently trained Markov chain fil­tered at a 1-in-100s qual­ity lev­el, tested non­para­met­ri­cally or para­met­ri­cally as a mix­ture model

Markov chains pro­duce real­is­tic-look­ing out­put and are effi­cient to cre­ate & run, but, com­pared to RNNs, they noto­ri­ously model recur­sive syn­tax poor­ly, such as nested paren­the­ses (since they have no way of remem­ber­ing whether a par­en­thet­i­cal com­ment had been start­ed), and can­not extrap­o­late—­for exam­ple, a word-level Markov chain can’t cre­ate new words, and would require n-grams to have avail­able frag­ments of words which could be recom­bined. The mem­o­ry-less nature of Markov chains also means that, lack­ing any mem­ory which could model the ‘long-range cor­re­la­tions’ found in nat­ural Eng­lish text like sys­tem­atic use of par­tic­u­lar names/topics/vocabulary, larger sam­ples quickly veer off-topic and become gib­ber­ish and lack any coherency pos­si­bly even inside a sin­gle sen­tence. (RNNs also have this prob­lem, but some­what less.)

With the lim­its of a Markov chain in mind, it would be easy to detect faked Markov chain out­put with large sam­ples: it is just diffi­cult for a human to delib­er­ately gen­er­ate long text which is as non­sen­si­cal and syn­tac­ti­cally invalid as a Markov chain cre­ates, for the same rea­son an unprac­ticed human is a remark­ably bad ran­dom num­ber gen­er­a­tor. How­ev­er, for this same rea­son the selected Markov sam­ples tend to be very short, usu­ally no more than a sen­tence. It might be pos­si­ble to mea­sure this on the sam­ples as a whole and observe higher entropy or mem­o­ry­less-ness (eg by mea­sur­ing com­pres­sion per­for­mance or effi­ciency of a Markov chain in mod­el­ing the sam­ples), but I would guess that usu­ally the sam­ples are not long enough or large enough for this to have rea­son­able as a test. This elim­i­nates the eas­i­est test.

Since the cor­pus is known in many of these cas­es, we can assume access to a Markov chain model which is sim­i­lar (if not iden­ti­cal) to the one which sup­pos­edly wrote all the tweets. This gives us sev­eral pos­si­bil­i­ties.

We could exploit the lack of cre­ativ­ity of Markov chains and look for any­thing in the tweets which is not present in the orig­i­nal cor­pus. For exam­ple, if a word like “” appears nei­ther in the Pup­pet doc­u­men­ta­tion nor (hav­ing been coined in 1996, 59 years after he died) in H.P. Love­craft’s fic­tion, then it would have a prob­a­bil­ity of 0 of being gen­er­ated by any Puppet/Lovecraft Markov chain (as no word will have any tran­si­tion prob­a­bil­ity to it). Such extra-cor­po­ral vocab­u­lary imme­di­ately proves human author­ship.

Con­tin­u­ing this same log­ic, we could take the cor­pus, train our own Markov chain (which will at least be sim­i­lar), and use it to cal­cu­late the like­li­hood of all the tweets. A human-writ­ten tweet may be pos­si­ble for the Markov chain to have writ­ten, but it will be far more unlikely than most of the ones the Markov chain actu­ally wrote & were select­ed. So we would see that most of the tweets have rea­son­able log like­li­hoods, but that our sus­pi­cious ones will be far more extreme. (If the Markov chains are word-level, this test sub­sumes the impos­si­ble-word test: any tweet with a word not in the cor­pus, and hence not rep­re­sented in the Markov chain, will have a mean­ing­less like­li­hood.)

This like­li­hood test might not help if they are all equally extreme, in which case one could use our Markov chain in another man­ner, as a gen­er­a­tive mod­el, to try to esti­mate the like­li­hood of get­ting as great a tweet. For this, one sam­ples sev­eral thou­sand sam­ples from our Markov chain, and screens them for good ones. This cre­ates an empir­i­cal dis­tri­b­u­tion of the like­li­hoods of good tweets con­di­tional on the null hypoth­e­sis of a Markov chain author; in this case, the null hypoth­e­sis is known to be true by con­struc­tion. Then to test, one com­pares the known-Markov-chain tweets with the like­li­hoods of the sus­pect tweets (per­haps with a ). They should be sim­i­lar.

Alter­nate­ly, if one does­n’t want to use like­li­hoods as a mea­sure of improb­a­bil­i­ty, one could instead use some human mea­sure of fun­ni­ness like hav­ing rat­ing the orig­i­nals and the sam­ples on a scale 1-5, and com­par­ing them. The orig­i­nal poster is prob­a­bly not screen­ing more than a few hun­dred gen­er­ated tweets for each selected tweet, so given a sim­i­lar level of strin­gen­cy, one’s gen­er­ated tweets should be equally good; if the orig­i­nals turn out to be extremely bet­ter than yours, to a level where you would have to screen thou­sands of ran­dom sam­ples, that is highly sus­pi­cious and sug­gests the orig­i­nals were ‘too good to be true’.

With rat­ings or like­li­hoods, one could try to assume a decreas­ing dis­tri­b­u­tion like an expo­nen­tial: most sam­ples will be inco­her­ent and totally unfun­ny, many will be slightly fun­ny, a few will be fun­ny, and a very few will be very fun­ny. The rat­ings on sam­ples gen­er­ated from our Markov chain will prob­a­bly fol­low a smooth dis­tri­b­u­tion. How­ev­er, if a human is author­ing some in an attempt to spice things up, they will be above the aver­age of the Markov chain (other­wise why bother with cheat­ing?), and if there is a sub­stan­tial num­ber of them, this will cre­ate an anom­aly in the rat­ings of the orig­i­nal­s—a ‘bump’ indi­cat­ing that the tweets are com­ing from two differ­ent pop­u­la­tions. In this case, it can be mod­eled as a with either k = 1 or k = 2, and the p-value or Bayesian pos­te­rior prob­a­bil­ity cal­cu­lated for 1 vs 2.

Optimal Existential Risk Reduction Investment

An exis­ten­tial risk is any risk which destroys or per­ma­nently crip­ples human civ­i­liza­tion, such as an aster­oid strike or pan­dem­ic. Since human­ity might oth­er­wise con­tinue for mil­lions of years, cre­at­ing untold tril­lions of humans and col­o­niz­ing the galaxy, human extinc­tion rep­re­sents the loss of lit­er­ally astro­nom­i­cal amounts of util­ity. The loss is greater than any dis­as­ter up to extinc­tion lev­els, as human­ity can always recover from lesser dis­as­ters; but there is no recov­ery from a total destruc­tion. Thus, the expected value of even a slight reduc­tion in an exotic risk ought to itself be astro­nom­i­cal, or at least extremely large; under plau­si­ble val­ues for well-char­ac­ter­ized x-risks like aster­oid strikes or nuclear war or pan­demic, pre­vent­ing them may be the char­i­ta­ble spend­ing with the high­est expected value and they should be receiv­ing all char­i­ta­ble expen­di­tures.

This strikes peo­ple as odd and dan­ger­ous rea­son­ing. Is it really true that we should be spend­ing almost unlim­ited amounts of money on these things and not oth­er­wise extremely com­pelling char­i­ties like dis­trib­ut­ing malaria nets in Africa to save mil­lions of lives or vac­cine dis­tri­b­u­tion or fund­ing research into end­ing aging? And if we should, how do we choose what frac­tion to spend on global warm­ing rather than arti­fi­cial intel­li­gence? What if some­one dis­cov­ers an entirely new x-risk not pre­vi­ously con­sid­ered, like nearby super­novas or vac­uum col­lapses or nan­otech­nol­ogy ‘grey goo’?

Think­ing his­tor­i­cal­ly, it’s clear in ret­ro­spect that some­one con­cerned about x-risk would be bet­ter off not going after the ter­mi­nal goal of x-risk reduc­tion but instead spend­ing their money on instru­men­tal goals such as science/technology or eco­nomic growth.

Imag­ine some­one in Eng­land in 1500 who rea­sons the same way about x-risk: human­ity might be destroyed, so pre­vent­ing that is the most impor­tant task pos­si­ble. He then spends the rest of his life research­ing the Devil and the Apoc­a­lypse. Such research is, unfor­tu­nate­ly, of no value what­so­ever unless it pro­duces argu­ments for athe­ism demon­strat­ing that that entire line of enquiry is use­less and should not be pur­sued fur­ther. But as the Indus­trial and Sci­en­tific Rev­o­lu­tions were just begin­ning, with expo­nen­tial increases in global wealth and sci­ence and tech­nol­ogy and pop­u­la­tion, ulti­mately lead­ing to vac­cine tech­nol­o­gy, rock­ets and space pro­grams, and enough wealth to fund all man­ner of invest­ments in x-risk reduc­tion, he could instead had made a per­haps small but real con­tri­bu­tion by con­tribut­ing to eco­nomic growth by work & invest­ment or mak­ing sci­en­tific dis­cov­er­ies.

For exam­ple, Isaac New­ton’s dis­cov­er­ies in astron­omy and the laws of motion helped inau­gu­rate threads of work that led directly to space satel­lites which can watch for aster­oids with Earth­-cross­ing orbits. him­self was con­cerned with x-risk, as he feared that the would, cen­turies hence, plunge into the Sun and cause expan­sion destroy­ing the Earth and human­i­ty. What could New­ton have done to directly reduce this x-risk at the time? Absolutely noth­ing. There were no fea­si­ble coun­ter-mea­sures nor any fore­see­able tech­nolo­gies which could fore­stall a comet or pro­tect human­ity from the Sun engulfing the Earth; there was not and still is not a mine or bomb shel­ter deep enough for that. What he could have done is close to what he did do: make fun­da­men­tal advances in sci­ence which pos­ter­ity could build on and one day be rich and wise enough to do some­thing about the x-risk. As it hap­pens, New­ton was not quite right about the Great Comet (comets are not a mean­ing­ful frac­tion of the Sun’s mass) but there was a sim­i­lar x-risk he was unaware of: giant aster­oid impacts. And the solu­tions to a giant comet—ob­serve all comets care­fully to project their future orbits, destroy it, redi­rect its orbit, evac­u­ate human colonists to safety to unaffected plan­ets (New­ton sug­gested the satel­lites of the gas giants)—are much the same as for a giant aster­oid impact, and all ben­e­fit from eco­nomic growth & greater science/technology (some­one has to pay for, and design those satel­lites and space­craft, after all).

Eco­nomic wealth & science/technology are all-pur­pose goods: they are use­ful for com­pound growth, and can also be spent on x-risk reduc­tion. They are the ulti­mate instru­men­tal goods. If one is badly igno­rant, or poor, or unable to mean­ing­fully reduce an x-risk, one is bet­ter off accept­ing the x-risk and instead spend­ing resources on fix­ing the for­mer prob­lems. One would pre­fer to get rid of the x-risk as soon as pos­si­ble, of course, but given one’s start­ing posi­tion, there may sim­ply be no bet­ter strat­egy and the risk must be accept­ed.

This raises the ques­tion: what is the opti­mal dis­tri­b­u­tion of resources to eco­nomic growth vs x-risk reduc­tion over time which max­i­mizes expected util­i­ty?

Intu­itive­ly, we might expect some­thing like early on invest­ing noth­ing at all in x-risk reduc­tion as there’s not much money avail­able to be spent, and money spent now costs a lot of money down the line in lost com­pound growth; and then as the econ­omy reaches mod­ern lev­els and the oppor­tu­nity cost of x-risk becomes dire, money is increas­ingly diverted to x-risk reduc­tion. One might analo­gize it to insur­ance—poor peo­ple skimp on insur­ance because they need the money for other things which hope­fully will pay off later like edu­ca­tion or start­ing a busi­ness, while rich peo­ple want to buy lots of insur­ance because they already have enough and they fear the risks. If this were an invest­ment ques­tion, a good strat­egy would be some­thing like the or strate­gies like : even if the expected value of x-risk reduc­tion is higher than other invest­ments, it only pays off very rarely and so receives a very small frac­tion of one’s invest­ments. How­ev­er, it’s not clear that the Kelly cri­te­rion or Thomp­son sam­pling are opti­mal or even rel­e­vant: because while Kelly avoids bank­ruptcy in the form of but does so only by mak­ing arbi­trar­ily small bets to avoid going bank­rupt & refus­ing to ever risk one’s entire wealth; with x-risks, the ‘bank­ruptcy’ (ex­tinc­tion) can’t be avoided so eas­i­ly, as the risk is there whether you like it or not, and one can­not turn it to 0. (This comes up often in dis­cus­sion of why the Kelly cri­te­rion is rel­e­vant to deci­sion-mak­ing under risk; see also Peters 2011 and the niche area of “evo­lu­tion­ary finance” like Evstigneev et al 2008/Lens­berg & Schenk-Hoppé 2006 which draws con­nec­tions between the Kelly cri­te­ri­on, prob­a­bil­ity match­ing, long-term sur­vival & evo­lu­tion­ary fit­ness.) In eco­nom­ics, sim­i­lar ques­tions are often dealt with in terms of the in which eco­nomic agents strive to max­i­mize their util­ity over a career/lifetime while (as Mark Twain put it, “when in youth a dol­lar would bring a hun­dred plea­sures, you can’t have it. When you are old, you get it & there is noth­ing worth buy­ing with it then. It’s an epit­ome of life. The first half of it con­sists of the capac­ity to enjoy with­out the chance; the last half con­sists of the chance with­out the capac­i­ty.”); in the life-cy­cle, one tries to build wealth as quickly as pos­si­ble while young, even going into debt for invest­ments like a col­lege edu­ca­tion, then begins sav­ing up, con­sum­ing some, until retire­ment, at which point one con­sumes it all until one dies. But as far as I’ve seen any results, life-cy­cle mod­els tend to not include any mech­a­nism for spend­ing in order to reduce mortality/aging and accept the risk of death as a giv­en.

We could cre­ate a sim­ple Markov deci­sion process mod­el. An agent (hu­man­i­ty), each time period (year), has a cer­tain amount of wealth and an x-risk prob­a­bil­ity P. In this peri­od, it can choose to allo­cate that wealth between eco­nomic growth, in which case it receives that invest­ment plus a return, and it can buy a per­ma­nent per­cent­age reduc­tion in the x-risk for a fixed sum. For the reward, the x-risk is binary sam­pled with prob­a­bil­ity P; if the sam­ple is true, then the reward is 0 and the deci­sion process ter­mi­nates, else the reward is the wealth and the process con­tin­ues. Let’s imag­ine that this process can run up to 10,000 time peri­ods, with a start­ing wealth of $248 bil­lion (An­gus Deaton’s esti­mate of PPP world GDP in 1500 ), the eco­nomic growth rate is 2% (the long-run real growth rate of the global econ­o­my), the exis­ten­tial risk prob­a­bil­ity is 0.1% per year (ar­bi­trar­ily cho­sen), and one can buy a reduc­tion of 1% for a bil­lion dol­lars. (We’ll work in tril­lions units to help numeric sta­bil­i­ty.) What strat­egy max­i­mizes the cumu­la­tive rewards? A few sim­ple ones come to mind:

  1. the agent could sim­ply ignore the x-risk and rein­vests all wealth, which to a first approx­i­ma­tion, is the strat­egy which has been fol­lowed through­out human his­tory and is pri­mar­ily fol­lowed now (lump­ing together NASA’s Space­guard pro­gram, biowar­fare and pan­demic research, AI risk research etc prob­a­bly does­n’t come to more than $1-2b a year in 2016). This max­i­mizes eco­nomic growth rate but may back­fire as the x-risk never gets reduced.
  2. the agent could spend the full gain in its wealth from eco­nomic growth (2%) on x-risk reduc­tion. The wealth does­n’t grow and the returns from x-risk reduc­tion do dimin­ish, but the x-risk is at least reduced greatly over time.
  3. the agent could imple­ment a sort of prob­a­bil­ity match­ing: it spends on x-risk reduc­tion a frac­tion of its wealth equal to the cur­rent P. This reduces how much is spent on extremely small x-risk reduc­tions, but it might be sub­op­ti­mal because it’ll pay the largest frac­tion of its econ­omy in the first time peri­od, then sec­ond-largest in the sec­ond time period and so on, los­ing out on the poten­tial com­pound­ing.
  4. a more com­pli­cated hybrid strat­egy might work: it max­i­mizes wealth like #1 for the first n time peri­ods (eg n = 516), and then it switches to #2 for the remain­ing time period
  5. like #4, but switch­ing from #1 to #3 for the remain­ing time peri­ods.
constantInvestmentAgent <- function (t, w, xrp) { return(c(w, 0)) }
constantReductionAgent  <- function (t, w, xrp) { drawdown <- 0.9803921573; return(c(drawdown*w, (1-drawdown)*w)) }
probabilityMatchAgent   <- function (t, w, xrp) { return(c(w*(1-xrp), w*xrp)) }
investThenReduceAgent   <- function (t, w, xrp, n=516) { if (t<n) { return(constantInvestmentAgent(t, w, xrp)) } else { return(constantReductionAgent(t, w, xrp)) } }
investThenMatchAgent    <- function (t, w, xrp, n=516) { if (t<n) { return(constantInvestmentAgent(t, w, xrp)) } else { return(probabilityMatchAgent(t, w, xrp)) } }

simulateWorld <- function(agent, t=10000) {
    initialW <- 0.248
    initialP <- 0.001
    df <- data.frame(T=0, Wealth=initialW, XriskP=initialP)

    for (i in 1:t) {
        last <- tail(df, n=1)
        xrisk <- rbinom(1,1, p=last$XriskP)
        if (xrisk) { break; } else {
          choices <- agent(last$T, last$Wealth, last$XriskP)
          newXriskP <- last$XriskP * (1 - 0.01)^(choices[2] / 0.001)
          newWealth <- choices[1] * 1.02
          df <- rbind(df, data.frame(T=i, Wealth=newWealth, XriskP=newXriskP))
   df$Reward <- cumsum(df$Wealth)

library(parallel); library(plyr)
simulateWorlds <- function(agent, iters=1000) {
    mean(ldply(mclapply(1:iters, function(i) { tail(simulateWorld(agent), n=1)$Reward }))$V1)  }

# [1] 2423.308636
# [1] 10127204.73
# [1] 1.154991741e+76
# [1] 7.53514145e+86
## Optimize the switch point:
which.max(sapply(seq(1, 10000, by=100), function(N) { simulateWorlds(function(t,w,xrp) { investThenMatchAgent(t, w, xrp, n=N) }, iters=100)}))
# [1] 3
simulateWorlds(function(t,w,xrp) { investThenMatchAgent(t, w, xrp, n=300) })
# [1] 9.331170221e+86
# [1] 1.006834082e+87

So of our 5 strate­gies, the con­stant reduc­tion agent per­forms the worst (prob­a­bly because with eco­nomic growth choked off, it can only buy small x-risk reduc­tion­s), fol­lowed by the invest-then-re­duce agent; then the ‘get rich before you get old’ con­stant invest­ment agent man­ages to often attain very high growth rates when it’s lucky enough that x-risks don’t strike early on; but far bet­ter than any of them, by orders of mag­ni­tude, are the par­tial and full prob­a­bil­ity match­ing agents. The par­tial prob­a­bil­ity match­ing agent turns out to have a sub­op­ti­mal switch point t = 516, and a more care­ful search of switch points finds that t~=300 is the best switch point and it exceeds the pure prob­a­bil­ity matcher which matches from the start.

What’s going on there? I sus­pect it’s some­thing sim­i­lar to the differ­ence in mul­ti­-armed ban­dit prob­lems between the asymp­tot­i­cally opti­mal solu­tion and the opti­mal solu­tion for a fixed hori­zon found using dynamic pro­gram­ming: in the for­mer sce­nar­io, there’s an indefi­nite amount of time to do any explo­ration or invest­ment in infor­ma­tion, but in the lat­ter, there’s only a finite time left and exploration/growth must be done up front and then the opti­mal deci­sion increas­ingly shifts to exploita­tion rather than growth.

Why does prob­a­bil­ity match­ing in gen­eral work so well? It may sim­ply be because it’s the only base­line strat­egy which adjusts its xrisk invest­ment over time.

This does­n’t demon­strate that prob­a­bil­ity match­ing is opti­mal, just that it beats the other base­line strate­gies. Other strate­gies could be used to decrease xrisk invest­ment over time—in­stead of being pro­por­tional to xrisk P, it could shrink lin­early over time, or by square root, or log­a­rith­mi­cal­ly, or…

What rein­force­ment learn­ing tech­niques might we use to solve this?

This prob­lem rep­re­sents a large Markov Deci­sion Process with 1 dis­crete state vari­able (time, t = 0-10000), 2 con­tin­u­ous state vari­ables (wealth, and risk prob­a­bil­i­ty), and 1 con­tin­u­ous action (frac­tion of growth to allo­cate to the econ­omy vs exis­ten­tial risk reduc­tion). The con­tin­u­ous action can be dis­cretized into 11 actions with­out prob­a­bly los­ing any­thing (al­lo­cate 100%/90%..10%/0%), but the 2 state vari­ables can’t be dis­cretized eas­ily because they can span many orders of mag­ni­tude.

  • dynamic pro­gram­ming a deci­sion tree with back­wards induc­tion: opti­mal, but requires dis­crete actions and state vari­ables, and even if dis­cretized, 10000 time steps would be infea­si­bly large.

  • stan­dard tab­u­lar learn­ing: Q-learn­ing, SARSA, tem­po­ral differ­ences: requires dis­crete actions and state vari­ables

    • Deep Q-Net­works: requires dis­crete actions, but not state vari­ables
  • MDP solvers: value iter­a­tion etc: opti­mal, but requires dis­crete actions and state vari­ables

  • hybrid MDP solvers: opti­mal, and can han­dle a lim­ited amount of con­tin­u­ous state vari­ables (but not con­tin­u­ous action­s), which would work here; but high qual­ity soft­ware imple­men­ta­tions are rarely avail­able.

    One such hybrid MDP solver is hmpd, which solves prob­lems spec­i­fied in the Lisp-like DSL (judg­ing from the exam­ples, a ver­sion with prob­a­bilis­tic effects, so PPDDL 1.0?). After try­ing to write down a PPDDL model cor­re­spond­ing to this sce­nar­io, it seems that PPDDL is unable to rep­re­sent prob­a­bil­i­ties or rewards which change with time and so can­not rep­re­sent the increase in wealth or decrease in x-risk prob­a­bil­i­ty.

  • pol­icy gra­di­ents: can han­dle con­tin­u­ous state vari­ables & actions but are highly com­plex and unsta­ble; high qual­ity soft­ware imple­men­ta­tions are unavail­able

Of the pos­si­ble options, a DQN agent seems like the best choice: a small neural net­work should be able to han­dle the prob­lem and DQN only requires the actions to be dis­cretized. reinforce.js pro­vides a DQN imple­men­ta­tion in JS which I’ve used before, so I start there by rewrit­ing the prob­lem in JS

var script = document.createElement("script");
script.src = "";

// environment: t, w, xrp
function simulate(environment, w_weight, xrp_weight) {
    var xrisk = Math.random() < environment.xrp
    if (xrisk) {
    return {reward: -100, alive: false, t: environment.t, w: environment.w, xrp: environment.xrp};
    } else {
    return {reward: Math.log(environment.w), alive: true, t: environment.t+1,
            w: environment.w*w_weight*1.02, xrp: environment.xrp * (Math.pow((1 - 0.01), (xrp_weight / 0.001))) }
var defaultState = {t: 0, w: 0.248, xrp: 0.01}
// simulate(defaultState, 0.99, 0.01)
// simulate(defaultState, 0.99, 0.01)

var env = {};
env.getNumStates = function() { return 3; }; // there are only 3 state variables: t/w/xrp
env.getMaxNumActions = function() { return 11; }; // we'll specify 10 possible allocations: 1/0, 0.998/0.002 .. 0.98/0.02
var spec = {
  num_hidden_units: 200,
  experience_add_every: 20,
  learning_steps_per_iteration: 1,
  experience_size: 1000000,
  alpha: 0.01,
  epsilon: 1.0,
  gamma: 0.99 // minimal discounting
var agent = new RL.DQNAgent(env, spec);

var total_reward = 0;
state = defaultState;
spec.epsilon = 1.0; // reset epsilon if we've been running the loop multiple times

for(var i=0; i < 10000*3000; i++) {
   var action = agent.act(state)
   state = simulate(state, 1-(action/500), 0+(action/500) );

   total_reward = total_reward + state.reward;
   if (Number.isInteger(Math.log(i) / Math.log(10)) ) { spec.epsilon = spec.epsilon / 1.5; } // decrease exploration

   if (!state.alive || state.t >= 10000) { // if killed by x-risk or horizon reached
     console.log(state.t, state.w, state.xrp, total_reward);
     total_reward = 0;
     state = defaultState;

//exercise the trained agent to see how it thinks
spec.epsilon = 0;
for (var t=0; t < 10000; t++) {
    action = agent.act(state)
    state = simulate(state, 1-(action/500), 0+(action/500) );
    total_reward = total_reward + state.reward
    console.log(action, state, total_reward);

After a day of train­ing, the DQN agent had learned to get up to 5e41, which was dis­ap­point­ingly infe­rior to the con­stant invest­ment & prob­a­bil­ity match­ing agents (1e87). The NN looks big enough for this prob­lem and the expe­ri­ence replay buffer was more than ade­quate; NNs in RL are known to have issues with the reward, though, and typ­i­cally ‘clamp’ the reward to a nar­row range, so I sus­pected that rewards going up to 5e41 (in­ter­pret­ing wealth on each turn as the reward) might be play­ing havoc with con­ver­gence, and switched the reward to log wealth instead. This did not make a notice­able differ­ence overnight (aside from the DQN agent now achiev­ing 9.5e41). I won­dered if the risk was too rare for easy learn­ing and 100 neu­rons was not enough to approx­i­mate the curve over time, so I fixed a bug I noticed where the sim­u­la­tion did not ter­mi­nate at t=10000, dou­bled led the neu­ron count, increased the ini­tial x-risk to 1%, and began a fresh run. After 1 day, it reached 9.4e41 total reward (un­logged).

Cumu­la­tive log score for DQN after tweaks and ~2h of train­ing: reg­u­larly reaches ~470k when it does­n’t die imme­di­ately (which hap­pens ~1/20 of the time). In com­par­ison, prob­a­bil­i­ty-match­ing agent aver­ages a cumu­la­tive log score of 866k. After 2 days of train­ing, the DQN had improved only slight­ly; the on-pol­icy strat­egy appears mostly ran­dom aside from hav­ing dri­ven the xrisk prob­a­bil­ity down to what appears to be the small­est float JS sup­ports, so it still had not learned a mean­ing­ful com­pro­mise between xrisk reduc­tion and invest­ment.

TODO: revisit with MCTS at some point?

Model Criticism via Machine Learning

In “Deep learn­ing, model check­ing, AI, the no-ho­muncu­lus prin­ci­ple, and the uni­tary nature of con­scious­ness”, Andrew Gel­man writes

Here’s how we put it on the very first page of our book:

The process of Bayesian data analy­sis can be ide­al­ized by divid­ing it into the fol­low­ing three steps:

  1. Set­ting up a full prob­a­bil­ity model - a joint prob­a­bil­ity dis­tri­b­u­tion for all observ­able and unob­serv­able quan­ti­ties in a prob­lem. The model should be con­sis­tent with knowl­edge about the under­ly­ing sci­en­tific prob­lem and the data col­lec­tion process.
  2. Con­di­tion­ing on observed data: cal­cu­lat­ing and inter­pret­ing the appro­pri­ate pos­te­rior dis­tri­b­u­tion—the con­di­tional prob­a­bil­ity dis­tri­b­u­tion of the unob­served quan­ti­ties of ulti­mate inter­est, given the observed data.
  3. Eval­u­at­ing the fit of the model and the impli­ca­tions of the result­ing pos­te­rior dis­tri­b­u­tion: how well does the model fit the data, are the sub­stan­tive con­clu­sions rea­son­able, and how sen­si­tive are the results to the mod­el­ing assump­tions in step 1? In respon­se, one can alter or expand the model and repeat the three steps.

How does this fit in with goals of per­form­ing sta­tis­ti­cal analy­sis using arti­fi­cial intel­li­gence?

3. The third step—i­den­ti­fy­ing model mis­fit and, in respon­se, fig­ur­ing out how to improve the mod­el—seems like the tough­est part to auto­mate. We often learn of model prob­lems through open-ended exploratory data analy­sis, where we look at data to find unex­pected pat­terns and com­pare infer­ences to our vast stores of sta­tis­ti­cal expe­ri­ence and sub­jec­t-mat­ter knowl­edge. Indeed, one of my main pieces of advice to sta­tis­ti­cians is to inte­grate that knowl­edge into sta­tis­ti­cal analy­sis, both in the form of for­mal prior dis­tri­b­u­tions and in a will­ing­ness to care­fully inter­ro­gate the impli­ca­tions of fit­ted mod­els.

One way of look­ing at step #3 is to treat the human sta­tis­ti­cian as another mod­el: specifi­cal­ly, he is a large neural net­work with tril­lions of para­me­ters, who has been trained to look for anom­alies & model mis­spec­i­fi­ca­tion, and to fix them when he finds them, retrain­ing the mod­el, until he can no longer eas­ily dis­tin­guish the orig­i­nal data from the mod­el’s pre­dic­tions or sam­ples. As he is such a large model with the abil­ity to rep­re­sent and infer a large class of non­lin­ear­i­ties, he can usu­ally eas­ily spot flaws where the cur­rent mod­el’s dis­tri­b­u­tion differs from the true dis­tri­b­u­tion.

This bears a con­sid­er­able resem­blance to the increas­ing pop­u­lar­ity of “gen­er­a­tive adver­sar­ial net­works” (GANs): using pairs of neural net­works, one of which tries to gen­er­ate real­is­tic data, and a sec­ond which tries to clas­sify or dis­crim­i­nate between real and real­is­tic data. As the sec­ond learns ways in which the cur­rent real­is­tic data is unre­al­is­tic, the first gets feed­back on what it’s doing wrong and fixes it. So the loop is very sim­i­lar, but fully auto­mat­ed. (A third set of approaches this resem­bles is actor-critic rein­force­ment learn­ing algo­rithm­s.)

If we con­sider the kinds of mod­els which are being cri­tiqued, and what is cri­tiquing, this gives us 4 pos­si­ble com­bi­na­tions:

sim­ple com­plex
sim­ple model fit index­es+­lin­ear model sta­tis­ti­cian+­lin­ear model
com­plex model fit index­es+ML ML+ML (eg GANs)
  1. Simple/simple is use­ful for cases like lin­ear regres­sion where clas­sic meth­ods like exam­in­ing resid­u­als or R^2s or Cook indexes can often flag prob­lems with the mod­el.

  2. Simple/complex is also use­ful, as the human sta­tis­ti­cian can spot addi­tional prob­lems.

  3. Complex/simple is prob­a­bly use­less, as the NNs may eas­ily have severe prob­lems but will have fit any sim­ple lin­ear struc­ture and fool reg­u­lar diag­nos­tics.

  4. Complex/complex can be very use­ful in machine learn­ing, but in differ­ent ways from a good sim­ple mod­el.

    Fast, sim­ple, gen­er­al—a good sta­tis­ti­cal method lets you choose one; a great method lets you choose two. (Con­sider lin­ear mod­els, deci­sion trees, NNs, MCMC, ABC, dis­crete Bayesian net­works, and expo­nen­tial fam­ily vs non­para­met­ric meth­ods as exam­ples of the trade­offs here.)

So is quad­rant 2 fully pop­u­lated by human sta­tis­ti­cians? We would­n’t nec­es­sar­ily want to use GANs for every­thing we use sta­tis­ti­cians for now, because neural net­works can be too pow­er­ful and what we want from our mod­els is often some sort of clear answer like “does X pre­dict Y?” and sim­plic­i­ty. But we could replace the sta­tis­ti­cian with some other pow­er­ful critic from machine learn­ing—­like a NN, SVM, ran­dom forest, or other ensem­ble. So instead of hav­ing two NNs fight­ing each other as in a GAN, we sim­ply have one spec­i­fied mod­el, and a NN which tries to find flaws in it, which can then be reported to the user. The loop then becomes: write down and fit a model to the real data; gen­er­a­tive pos­te­rior pre­dic­tive sam­ples from the dis­tri­b­u­tion; train a small NN on real data vs pre­dic­tive data; the clas­si­fi­ca­tion per­for­mance mea­sures the plau­si­bil­ity of the pre­dic­tive sam­ples (per­haps some­thing like a KL diver­gence), giv­ing a mea­sure of the model qual­i­ty, and flags data points which are par­tic­u­larly eas­ily dis­tin­guished as real; the human sta­tis­ti­cian now knows exactly which data points are not cap­tured by the model and can mod­ify the mod­el; repeat until the NN’s per­for­mance declines to chance.

Let’s try an exam­ple. We’ll set up a sim­ple lin­ear model regres­sion Y ~ A + B + C with a few prob­lems in it:

  1. the trend is not lin­ear but slightly qua­dratic
  2. the out­come vari­able is also right-cen­sored at a cer­tain point
  3. and final­ly, the mea­sured covari­ates have been rounded
n <- 10000
ceiling <- 1
a <- rnorm(n)
b <- rnorm(n)
c <- rnorm(n)
y <- 0 + 0.5*a + 0.5*b + 0.5*c^2 + rnorm(n)
y_censored <- ifelse(y>=3, 3, y)
df <- data.frame(Y=y_censored, A=round(a, digits=1), B=round(b, digits=1), C=round(c, digits=1))

l <- lm(Y ~ A + B + C, data=df)
plot(df$Y, predict(l, df))

l2 <- lm(Y ~ A + B + I(C^2), data=df)
plot(df$Y, predict(l2, df))

The cen­sor­ing shows up imme­di­ately on the diag­nos­tics as an excess of actual points at 3, but the qua­dratic­ity is sub­tler, and I’m not sure I can see the round­ing at all.


## First, random forest performance under the null hypothesis

modelNull <- data.frame(Y=c(df$Y, df$Y), Real=c(rep(1, n), rep(0, n)), A=c(df$A, df$A), B=c(df$B, df$B), C=c(df$C, df$C))
r_n <- randomForest(as.ordered(Real) ~ Y + A + B + C, modelNull); r_n
#                Type of random forest: classification
#                      Number of trees: 500
# No. of variables tried at each split: 2
#         OOB estimate of  error rate: 100%
# Confusion matrix:
#       0     1 class.error
# 0     0 10000           1
# 1 10000     0           1

modelPredictions <- data.frame(Y=c(df$Y, predict(l, df)), Real=c(rep(1, n), rep(0, n)), A=c(df$A, df$A), B=c(df$B, df$B), C=c(df$C, df$C))
r <- randomForest(as.ordered(Real) ~ Y + A + B + C, modelPredictions); r
#                Type of random forest: classification
#                      Number of trees: 500
# No. of variables tried at each split: 2
#         OOB estimate of  error rate: 6.59%
# Confusion matrix:
#      0    1 class.error
# 0 9883  117      0.0117
# 1 1200 8800      0.1200

## many of the LM predictions are identical, but the RF is not simply memorizing them as we can jitter predictions and still get the same classification performance:
modelPredictions$Y2 <- jitter(modelPredictions$Y)
randomForest(as.ordered(Real) ~ Y2 + A + B + C, modelPredictions)
#...                Type of random forest: classification
#                      Number of trees: 500
# No. of variables tried at each split: 2
#         OOB estimate of  error rate: 6.57%
# Confusion matrix:
#      0    1 class.error
# 0 9887  113      0.0113
# 1 1200 8800      0.1200

Note we need to be care­ful about col­lect­ing the pos­te­rior pre­dic­tive sam­ples: if we col­lect 10000 pos­te­rior sam­ples for each of the 10000 dat­a­points, we’ll store 100002 num­bers which may cause prob­lems. 1 should be enough.

model <- 'model {
 for (i in 1:n) {
     mean[i] <- mu + betaA*A[i] + betaB*B[i] + betaC*C[i]
     Y[i] ~ dnorm(mean[i], tau)

 sd   ~ dgamma(0.01, 0.01)
 tau  <- 1/sqrt(sd)

 mu ~ dnorm(0, 100)
 betaA ~ dnorm(0, 100)
 betaB ~ dnorm(0, 100)
 betaC ~ dnorm(0, 100)
model <- run.jags(model, data = with(df, list(Y=c(Y, rep(NA, nrow(df))), A=c(A,A), B=c(B,B), C=c(C,C), n=2*nrow(df))), inits=list(mu=0.45, sd=0.94, betaA=0.47, betaB=0.46, betaC=0), monitor=c("Y"), n.chains = 1, sample=1)

posterior_predictive <- tail(n=10000, model$mcmc[[1]][1,])
plot(df$Y, posterior_predictive)

modelPredictions_r <- data.frame(Y=c(df$Y, posterior_predictive), Real=c(rep(1, n), rep(0, n)), A=c(df$A, df$A), B=c(df$B, df$B), C=c(df$C, df$C))
r <- randomForest(as.ordered(Real) ~ Y + A + B + C, modelPredictions_r); r
#         OOB estimate of  error rate: 49.11%
# Confusion matrix:
#      0    1 class.error
# 0 4953 5047      0.5047
# 1 4776 5224      0.4776
model_rounded <- 'model {
 for (i in 1:n) {
     roundA[i] ~ dround(A[i], 3)
     roundB[i] ~ dround(B[i], 3)
     roundC[i] ~ dround(C[i], 3)
     mean[i] <- mu + betaA*roundA[i] + betaB*roundB[i] + betaC*roundC[i]
     Y[i] ~ dnorm(mean[i], tau)

 sd   ~ dgamma(0.01, 0.01)
 tau  <- 1/sqrt(sd)

 mu ~ dnorm(0, 100)
 betaA ~ dnorm(0, 100)
 betaB ~ dnorm(0, 100)
 betaC ~ dnorm(0, 100)
model_r <- run.jags(model_rounded, data = with(df, list(Y=c(Y, rep(NA, nrow(df))), A=c(A,A), B=c(B,B), C=c(C,C), n=2*nrow(df))), inits=list(mu=0.45, sd=0.94, betaA=0.47, betaB=0.46, betaC=0), monitor=c("Y"), n.chains = 1, sample=1)

posterior_samples <- tail(n=10000, model_r$mcmc[[1]][1,])
posterior_predictive <- ifelse(posterior_samples>=3, 3, posterior_samples)
plot(df$Y, posterior_predictive)

modelPredictions_r <- data.frame(Y=c(df$Y, posterior_predictive), Real=c(rep(1, n), rep(0, n)), A=c(df$A, df$A), B=c(df$B, df$B), C=c(df$C, df$C))
r_r <- randomForest(as.ordered(Real) ~ Y + A + B + C, modelPredictions_r); r_r
#         OOB estimate of  error rate: 50.48%
# Confusion matrix:
#      0    1 class.error
# 0 4814 5186      0.5186
# 1 4909 5091      0.4909
model_rounded_censor <- 'model {
 for (i in 1:n) {
     roundA[i] ~ dround(A[i], 3)
     roundB[i] ~ dround(B[i], 3)
     roundC[i] ~ dround(C[i], 3)
     mean[i] <- mu + betaA*roundA[i] + betaB*roundB[i] + betaC*roundC[i]
     Y[i] ~ dnorm(mean[i], tau)
     is.censored[i] ~ dinterval(Y[i], c)

 sd   ~ dgamma(0.01, 0.01)
 tau  <- 1/sqrt(sd)

 mu ~ dnorm(0, 100)
 betaA ~ dnorm(0, 100)
 betaB ~ dnorm(0, 100)
 betaC ~ dnorm(0, 100)
model_r_c <- run.jags(model_rounded_censor, data = with(df, list(Y=c(Y, rep(NA, nrow(df))), A=c(A,A), B=c(B,B), C=c(C,C), n=2*nrow(df), is.censored=c(as.integer(Y==3), as.integer(Y==3)), c=3)), inits=list(mu=0.37, sd=1, betaA=0.42, betaB=0.40, betaC=0), monitor=c("Y"), n.chains = 1, adapt=0, burnin=500, sample=1)

posterior_samples <- tail(n=10000, model_r_c$mcmc[[1]][1,])
posterior_predictive <- ifelse(posterior_samples>=3, 3, posterior_samples)

modelPredictions_r_c <- data.frame(Y=c(df$Y, posterior_predictive), Real=c(rep(1, n), rep(0, n)), A=c(df$A, df$A), B=c(df$B, df$B), C=c(df$C, df$C))
r_r_c <- randomForest(as.ordered(Real) ~ Y + A + B + C, modelPredictions_r_c); r_r_c
#         OOB estimate of  error rate: 53.67%
# Confusion matrix:
#      0    1 class.error
# 0 4490 5510      0.5510
# 1 5224 4776      0.5224
model_rounded_censor_quadratic <- 'model {
 for (i in 1:n) {
     roundA[i] ~ dround(A[i], 3)
     roundB[i] ~ dround(B[i], 3)
     roundC[i] ~ dround(C[i], 3)
     mean[i] <- mu + betaA*roundA[i] + betaB*roundB[i] + betaC*roundC[i]^2
     Y[i] ~ dnorm(mean[i], tau)
     is.censored[i] ~ dinterval(Y[i], c)

 sd   ~ dgamma(0.01, 0.01)
 tau  <- 1/sqrt(sd)

 mu ~ dnorm(0, 100)
 betaA ~ dnorm(0, 100)
 betaB ~ dnorm(0, 100)
 betaC ~ dnorm(0, 100)

model_r_c_q <- run.jags(model_rounded_censor_quadratic, data = with(df, list(Y=c(Y, rep(NA, nrow(df))), A=c(A,A), B=c(B,B), C=c(C,C), n=2*nrow(df), is.censored=c(as.integer(Y==3), as.integer(Y==3)), c=3)), inits=list(mu=0.37, sd=1, betaA=0.42, betaB=0.40, betaC=0), monitor=c("Y"), n.chains = 1, adapt=0, burnin=500, sample=1)

posterior_samples <- tail(n=10000, model_r_c_q$mcmc[[1]][1,])
posterior_predictive <- ifelse(posterior_samples>=3, 3, posterior_samples)

modelPredictions_r_c_q <- data.frame(Y=c(df$Y, posterior_predictive), Real=c(rep(1, n), rep(0, n)), A=c(df$A, df$A), B=c(df$B, df$B), C=c(df$C, df$C))
r_r_c_q <- randomForest(as.ordered(Real) ~ Y + A + B + C, modelPredictions_r_c_q); r_r_c_q
#         OOB estimate of  error rate: 61.02%
# Confusion matrix:
#      0    1 class.error
# 0 3924 6076      0.6076
# 1 6127 3873      0.6127

trueNegatives <- modelPredictions_r_c_q[predict(r_r_c_q) == 0 & modelPredictions_r_c_q$Real == 0,]

Where can we go with this? The ML tech­niques can be used to rank exist­ing Bayesian mod­els in an effec­tive if unprin­ci­pled way. Tech­niques which quan­tify uncer­tainty like Bayesian neural net­works could give more effec­tive feed­back by high­light­ing the points the Bayesian NN is most cer­tain are fake, guid­ing the ana­lyst towards the worst-mod­eled dat­a­points and pro­vid­ing hints for improve­ment. More inspi­ra­tion could be bor­rowed from the GAN lit­er­a­ture, such as “mini­batch dis­crim­i­na­tion”—as demon­strated above, the ran­dom forests only see one data point at a time, but in train­ing GANs, it has proven use­ful to instead exam­ine mul­ti­ple dat­a­points at a time to encour­age the gen­er­a­tor to learn how to gen­er­ate a wide vari­ety of dat­a­points rather than mod­el­ing a few dat­a­points extremely well; a ML model which can pre­dict mul­ti­ple out­puts simul­ta­ne­ously based on mul­ti­ple inputs would be anal­o­gous (that is, instead of X ~ A + B + C, it would look more like X1 + X2 + X3 ... ~ A1 + B1 + C1 + A2 + B2 + C2 + ..., with the inde­pen­dent & depen­dent vari­ables from mul­ti­ple data points all fed in simul­ta­ne­ously as a sin­gle sam­ple) and might be an even more effec­tive model crit­ic.

Proportion of Important Thinkers by Global Region Over Time in Charles Murray’s Human Accomplishment

Human Accom­plish­ment is a 2003 book by report­ing a large-s­cale cita­tion analy­sis of bio­graph­i­cal dic­tio­nar­ies & ref­er­ence books on art/literature/science/mathematics/philosophy/science through­out his­to­ry, quan­ti­fy­ing the rel­a­tive impor­tance of “sig­nifi­cant indi­vid­u­als” such as Isaac New­ton or Immanuel Kant or Con­fu­cius and the tem­po­ral & geo­graph­i­cal pat­terns; in par­tic­u­lar, it demon­strates large Euro­pean con­tri­bu­tions through­out his­tory and increas­ingly dra­mat­i­cally post-1400 AD. The dataset has been released.

Emil Kirkegaard cre­ated a visu­al­iza­tion of of the pro­por­tion by rough geo­graphic region (European/Asian/other) in R using ggplot2 and LOESS smooth­ing. Per­haps the most strik­ing aspect of it is the Dark Ages show­ing up as a spike in Asian pro­por­tion, fol­lowed by the .

This visu­al­iza­tion has been crit­i­cized as Euro­cen­tri­cal­ly-mis­lead­ing and dri­ven by arti­facts in the analysis/graphing:

  • ignores the con­straint that pro­por­tions must be 0-1 and naively extrap­o­lates beyond the bound­aries, pro­duc­ing neg­a­tive esti­mates for some regions/times

  • no visu­al­iza­tion of uncer­tainty is pro­vid­ed, either in the form of graph­ing the raw data points by super­im­pos­ing a scat­ter­plot or by pro­vid­ing stan­dard errors or cred­i­ble inter­vals. It is pos­si­ble that the over­all shapes or spe­cific peri­ods are no more than chance scat­ters in a time-series based on few dat­a­points.

    • LOESS can pro­vide esti­mate local stan­dard errors & con­fi­dence inter­vals but they are of ques­tion­able mean­ing in the absence of the under­ly­ing counts
  • alter­na­tive­ly, the dis­tri­b­u­tion of sig­nifi­cant fig­ures may not be treated cor­rectly para­met­ri­cally

  • pro­por­tions may reflect a time-series with trends and so pre­ci­sion is exag­ger­ated

None of these objec­tions hold any water as the dataset and its embed­ded differ­ences are suffi­ciently large that the method of analy­sis will make lit­tle differ­ence; I will demon­strate this below by re-an­a­lyz­ing it to address the quib­bles and show that all pat­terns remain intact or are sharp­ened. The above crit­i­cisms can be addressed by:

  1. switch­ing from a LOESS plot to splines or local bino­mial regres­sions
  2. plot­ting the raw pro­por­tions grouped by decade or cen­tury
  3. using a non­para­met­ric boot­strap to cal­cu­late con­fi­dence inter­vals, a pro­ce­dure which lends itself to visu­al­iza­tion as an ani­ma­tion of plots of all the resam­ples, giv­ing an intu­itive sense of how impor­tant sam­pling error is to the over­all pat­tern of curves and spe­cific parts of his­tory
  4. alter­nate­ly, instead of attempt­ing to fit the pro­por­tion, one can fit the orig­i­nal count of sig­nifi­cant fig­ures in a bino­mial or log-nor­mal Bayesian time-series model and sam­ple from the pos­te­rior esti­mates of each region for each decade/century, and cal­cu­late pos­te­rior pro­por­tions, gain­ing full quan­tifi­ca­tion of uncer­tain­ty, incor­po­ra­tion of any auto­cor­re­la­tion, and smooth­ing; no addi­tional algo­rithms or the­o­rems are required, demon­strat­ing the ele­gance of Bayesian approaches

I did­n’t real­ize Kirkegaard’s R code was avail­able so I wound up redo­ing it myself (and get­ting the same result­s):

## export CSV from spreadsheet in
h <- read.csv("HA.csv", header=TRUE)
#      Serial                        Name            Fl               Birth              Death              Inventory     ScienceField
#  Min.   :   11.00   Descartes, René  :   4   Min.   :-700.000   Min.   :-640.000   Min.   :-559.00   Science   :1442          :2560
#  1st Qu.: 6144.50   Hooke, Robert    :   4   1st Qu.:1557.250   1st Qu.:1580.000   1st Qu.:1638.00   Lit.West  : 835   Tech   : 239
#  Median :12534.50   Leonardo da Vinci:   4   Median :1804.000   Median :1782.000   Median :1844.00   Music.West: 522   Phys   : 218
#  Mean   :15994.27   Archimedes       :   3   Mean   :1585.638   Mean   :1616.174   Mean   :1682.81   Art.West  : 479   Chem   : 204
#  3rd Qu.:21999.75   Bacon, Francis   :   3   3rd Qu.:1900.000   3rd Qu.:1863.000   3rd Qu.:1930.00   Phil.West : 155   Biol   : 193
#  Max.   :43134.00   d'Alembert, Jean :   3   Max.   :1949.000   Max.   :1910.000   Max.   :1997.00   Art.China : 111   Math   : 191
#                     (Other)          :3981                      NA's   :304        NA's   :351       (Other)   : 458   (Other): 397
#      Index             Duplicate           BirthCountry   WorkCountry      Ethnicity        Woman            No..of.Inventories
#  Min.   :  0.60000   Min.   :0.00000000   France : 564   France : 605   Germanic: 592   Min.   :0.00000000   Min.   :2.000000
#  1st Qu.:  3.54000   1st Qu.:0.00000000   Germany: 556   Britain: 574   French  : 565   1st Qu.:0.00000000   1st Qu.:2.000000
#  Median :  7.60000   Median :0.00000000   Britain: 554   Germany: 525   English : 441   Median :0.00000000   Median :2.000000
#  Mean   : 12.95713   Mean   :0.06221889   Italy  : 400   Italy  : 406   Italian : 397   Mean   :0.02198901   Mean   :2.228916
#  3rd Qu.: 15.89000   3rd Qu.:0.00000000   USA    : 306   USA    : 375   USA     : 276   3rd Qu.:0.00000000   3rd Qu.:2.000000
#  Max.   :100.00000   Max.   :1.00000000   China  : 239   China  : 239   Chinese : 240   Max.   :1.00000000   Max.   :4.000000
#  NA's   :115                              (Other):1383   (Other):1278   (Other) :1491                        NA's   :3753
#  [1] "Ancient Greek" "Ancient Roman" "Arabic"        "Australian"    "Basque"        "Black"         "Bulgarian"     "Canadian"
#  [9] "Chinese"       "Croatian"      "Czech"         "Danish"        "Dutch"         "English"       "Estonian"      "Finnish"
# [17] "Flemish"       "French"        "Germanic"      "Greek"         "Hungarian"     "Icelandic"     "Indian"        "Irish"
# [25] "Italian"       "Japanese"      "Jewish"        "Latino"        "New Zealand"   "Norwegian"     "Polish"        "Portuguese"
# [33] "Romanian"      "Scots"         "Slavic"        "Slovenian"     "Spanish"       "Swedish"       "Swiss"         "USA"

european <- c("Ancient Greek", "Ancient Roman", "Australian", "Basque", "Bulgarian", "Canadian", "Croatian", "Czech", "Danish",
    "Dutch", "English", "Estonian", "Finnish", "Flemish", "French", "Germanic", "Greek", "Hungarian", "Icelandic", "Irish",
    "Italian", "Jewish", "New Zealand", "Norwegian", "Polish", "Portuguese", "Romanian", "Scots", "Slavic", "Slovenian",
    "Spanish", "Swedish", "Swiss", "USA")
asian    <- c("Chinese", "Indian", "Japanese")
other    <- c("Arabic", "Black", "Latino")
groupMembership <- function(e) { if (e %in% european) { "European" } else { if (e %in% asian) { "Asian" } else { "Other" } } }
h$Group <- as.factor(sapply(h$Ethnicity, groupMembership))
#   Asian European    Other
#     507     3379      116

## We use 'Fl' (floruit/flourished), when a person is believed to have done their most important work,
## since birth/death is often unavailable.
## group to decades by rounding:
h$Fl.decade <- round(h$Fl, digits=-1)
hd <- subset(select=c(Fl.decade, Group), h)

hdcount <- aggregate(cbind(Group) ~ Fl.decade+Group, length, data=hd)
colnames(hdcount)[3] <- "Count"
## sort by time:
hdcount <- hdcount[order(hdcount$Fl.decade),]
nrow(h); sum(hdcount$Count)
# [1] 4002
# [1] 4002
head(hdcount, n=20)
#     Fl.decade    Group Count
# 178      -700 European     3
# 179      -680 European     1
# 180      -650 European     1
# 1        -600    Asian     2
# 181      -600 European     2
# 182      -580 European     2
# 183      -570 European     2
# 2        -550    Asian     1
# 184      -550 European     1
# 185      -540 European     5
# 3        -520    Asian     1
# 186      -520 European     3
# 4        -510    Asian     1
# 187      -510 European     2
# 188      -500 European     2
# 189      -480 European     6
# 190      -460 European     3
# 191      -450 European     7
# 5        -440    Asian     1
# 192      -440 European    11

## One issue with the count data: decades with zero significant figures from a group
## (which happens frequently) get suppressed. Some tools can handle the omission
## automatically but many cannot, so we need to manually insert any missing decades with '0'
decades <- seq(-700, 1950, by=10)
for (i in 1:length(decades)) {
    d <- decades[i]
    if (nrow(hdcount[hdcount$Fl.decade==d & hdcount$Group=="European",])==0) {
        hdcount <- rbind(hdcount, data.frame(Fl.decade=d, Group="European", Count=0))}
    if (nrow(hdcount[hdcount$Fl.decade==d & hdcount$Group=="Asian",])==0) {
        hdcount <- rbind(hdcount, data.frame(Fl.decade=d, Group="Asian", Count=0))}
    if (nrow(hdcount[hdcount$Fl.decade==d & hdcount$Group=="Other",])==0) {
        hdcount <- rbind(hdcount, data.frame(Fl.decade=d, Group="Other", Count=0))}
hdcount <- hdcount[order(hdcount$Fl.decade),]

library(ggplot2); library(gridExtra)
c1 <- with(hdcount, qplot(Fl.decade, Count, color=Group) + stat_smooth())
c2 <- with(hdcount, qplot(Fl.decade, log1p(Count), color=Group) + stat_smooth())
grid.arrange(c1, c2, ncol=1)

The absolute growth in human pop­u­la­tion and and hence accom­plish­ment post-1400 is so dra­matic that it obscures ear­lier tem­po­ral vari­a­tions:

Counts of “sig­nifi­cant fig­ures” in Human Accom­plish­ment (Mur­ray 2003) by geo­graphic region, raw and log-trans­formed

Log-trans­formed, we can still see the invert­ed-V shape of Euro­pean counts, but it’s some­what sub­tle because it’s still being squashed by post-1400 increases and does leave room for doubt about sam­pling error. Mov­ing on to repro­duc­ing the pro­por­tions plot:

## Create proportions by summing per decade, then looping over each group & dividing by total for that decade:
decadeTotals <- aggregate(Count ~ Fl.decade, sum, data=hdcount)
for (i in 1:nrow(hdcount)) {
        total <- decadeTotals[decadeTotals$Fl.decade == hdcount[i,]$Fl.decade,]$Count
        p <- hdcount[i,]$Count / total
        hdcount$Proportion[i] <- if(is.nan(p)) { 0 } else { p }
        hdcount$Total[i] <- total
with(hdcount, qplot(Fl.decade, Proportion, color=Group) + stat_smooth() + coord_cartesian(ylim = c(0, 1)))
Rel­a­tive pro­por­tions of “sig­nifi­cant fig­ures” in Human Accom­plish­ment (Mur­ray 2003) by geo­graphic region, LOESS-smoothed

We suc­cess­fully repro­duce it, mod­ulo the LOESS stan­dard errors (which can be dis­abled by adding se=FALSE to stat_smooth()), includ­ing the unwanted non­sen­si­cal extrap­o­la­tions. It is pos­si­ble with some tricky ggplot2 func­tion­al­ity to add in bino­mial smooth­ing (along with some jit­ter to unbunch the dat­a­points at the modal 0).

## roughly equivalent to:
# glm(cbind(Count,Total) ~ splines::ns(Fl.decade,3), family="binomial", data=hdcount, subset=Group=="European")
binomial_smooth <- function(...) { geom_smooth(se=FALSE, method = "glm", method.args = list(family = "binomial"), ...) }
with(hdcount, qplot(Fl.decade, Proportion, color=Group) +
    binomial_smooth(formula = y ~ splines::ns(x, 3)) +
    geom_jitter(aes(color=Group), width=0.013,, height=0.013))
Rel­a­tive pro­por­tions of “sig­nifi­cant fig­ures” in Human Accom­plish­ment (Mur­ray 2003) by geo­graphic region, bino­mi­al-s­pline-s­moothed for sen­si­ble extrap­o­la­tion

This still does­n’t pro­vide any indi­ca­tion of sam­pling error uncer­tain­ty, how­ev­er. Kirkegaard pro­vides one with CIs derived from boot­strap­ping, so I will pro­vide some­thing a lit­tle differ­ent: visu­al­iz­ing the uncer­tainty dynam­i­cally by graph­ing the smoothed pro­por­tions for each resam­ple in an ani­ma­tion of hun­dreds of boot­strap sam­ples.

So to do this boot­strap, we pack­age up the var­i­ous trans­for­ma­tions from before, so we can sam­ple-with­-re­place­ment the orig­i­nal dataset10, trans­form, and plot repeat­ed­ly:

transformAndProportion <- function(df) {
    df$Fl.decade <- round(df$Fl, digits=-1)
    dfd <- subset(select=c(Fl.decade, Group), df)
    dfdcount <- aggregate(cbind(Group) ~ Fl.decade+Group, length, data=dfd)
    colnames(dfdcount)[3] <- "Count"
    decades <- seq(-700, 1950, by=10)
    for (i in 1:length(decades)) {
        d <- decades[i]
        if (nrow(dfdcount[dfdcount$Fl.decade==d & dfdcount$Group=="European",])==0) {
    dfdcount <- rbind(dfdcount, data.frame(Fl.decade=d, Group="European", Count=0))}
        if (nrow(dfdcount[dfdcount$Fl.decade==d & dfdcount$Group=="Asian",])==0) {
    dfdcount <- rbind(dfdcount, data.frame(Fl.decade=d, Group="Asian", Count=0))}
        if (nrow(dfdcount[dfdcount$Fl.decade==d & dfdcount$Group=="Other",])==0) {
    dfdcount <- rbind(dfdcount, data.frame(Fl.decade=d, Group="Other", Count=0))}
    dfdcount <- dfdcount[order(dfdcount$Fl.decade),]
    decadeTotals <- aggregate(Count ~ Fl.decade, sum, data=dfdcount)
    for (i in 1:nrow(dfdcount)) {
        p <- dfdcount[i,]$Count / decadeTotals[decadeTotals$Fl.decade == dfdcount[i,]$Fl.decade,]$Count
        dfdcount$Proportion[i] <- if(is.nan(p)) { 0 } else { p }

bootPlot <- function(df) {
    n <- nrow(df)
    bootDf <- df[sample(1:n, n, replace=TRUE),]
    bootDfdcount <- transformAndProportion(bootDf)
    ## WARNING: can't just call qplot due to old 'animation'/ggplot2 bug; have to assign & 'print'
    p <- with(bootDfdcount, qplot(Fl.decade, Proportion, color=Group) +
        binomial_smooth(formula = y ~ splines::ns(x, 3)) +
        geom_jitter(aes(color=Group), width=0.013,, height=0.013))
saveGIF({for (i in 1:200) { bootPlot(h) }}, interval=0.15, ani.width=1300, ani.height=700,"2003-murray-humanaccomplishment-region-proportions-bootstrap.gif", clean=FALSE)
Ani­ma­tion of repeat­edly resam­pling & plot­ting rel­a­tive pro­por­tions of “sig­nifi­cant fig­ures” in Human Accom­plish­ment (Mur­ray 2003) by geo­graphic region, demon­strat­ing effects of sam­pling error on pro­por­tions & his­tor­i­cal curves

The boot­strap ani­ma­tion sug­gests to me that while the very ear­li­est time-pe­ri­ods are opaque and the Dark Ages differ­ence between Europe & Asia may be some­what higher or low­er, the over­all shape does­n’t change mean­ing­ful­ly.

The time-series aspect of the data on visual inspec­tion appears to be a sim­ple upwards, low-order mod­els like ARIMA(1,1,0), ARIMA(1,1,2), or ARIMA(0,1,2); this is prob­a­bly due to the world pop­u­la­tion steadily increas­ing while the per capita rates remain sta­ble.

efit <- auto.arima(subset(hdcount, select=c("Fl.decade", "Count"), Group=="European")$Count)
afit <- auto.arima(subset(hdcount, select=c("Fl.decade", "Count"), Group=="Asian")$Count)
ofit <- auto.arima(subset(hdcount, select=c("Fl.decade", "Count"), Group=="Other")$Count)
plot(forecast(efit), ylim=c(0,200)); axis(side=1, labels=decades, at=seq(1, length(decades)))
plot(forecast(afit), ylim=c(0,200)); axis(side=1, labels=decades, at=seq(1, length(decades)))
plot(forecast(ofit), ylim=c(0,200)); axis(side=1, labels=decades, at=seq(1, length(decades)))
Sim­ple ARIMA time-series fits & fore­casts to 3 global regions of “sig­nifi­cant fig­ures” in Human Accom­plish­ment

We can com­bine the sam­pling error quan­tifi­ca­tion of full Bayesian pos­te­ri­ors, Pois­son dis­tri­b­u­tion of counts, and time-series aspects into a sin­gle Bayesian model using as a con­ve­nient inter­face to Stan (rather than writ­ing out the full model by hand), with unin­for­ma­tive pri­ors, and then visu­al­ize the pos­te­rior dis­tri­b­u­tion of the pro­por­tions (which itself is sim­ply a trans­for­ma­tion of the pos­te­ri­or):

b <- brm(Count ~ (1|Group), autocor = cor_bsts(~ Fl.decade | Group), family="zero_inflated_poisson", data = hdcount)

## Rather than use `fitted` to get the 95% CI & compute proportion, it would also be possible to draw samples from
## the posterior for each group/decade, total, calculate per-group proportion, and then summarize into quantiles; but
## that is much slower and requires more finicky code:
posterior <- fitted(b)
hdcount$B.low.prop  <-         posterior[,3] / hdcount$Total
hdcount$B.mean.prop <-         posterior[,1] / hdcount$Total
hdcount$B.high.prop <- pmin(1, posterior[,4] / hdcount$Total)

predframe <- subset(hdcount, select=c("B.low.prop", "B.high.prop"))
with(hdcount, ggplot(hdcount, aes(Fl.decade, Proportion, color=Group)) +
    geom_point() +
    geom_line(data=predframe) +
    geom_ribbon(aes(ymin=B.low.prop, ymax=B.high.prop), alpha=0.05, data=predframe))
Bayesian mul­ti­-level time-series of “sig­nifi­cant fig­ures”; shaded region indi­cates 95% cred­i­ble inter­val around group mean in that decade

The smoothed time-series looks about the same, and the CIs sug­gest, like the boot­strap, that there is great uncer­tainty early on when pop­u­la­tions are small & sur­viv­ing fig­ures are rare, but that the dark ages dip looks real and the Euro­pean increases in pro­por­tion since then are also highly prob­a­ble.

So over­all, cor­rect­ing for the infe­lic­i­ties in Kirkegaard’s orig­i­nal graph makes the graph some­what cleaner and is help­ful in pro­vid­ing quan­tifi­ca­tion of uncer­tain­ty, but none of the prob­lems drove the over­all appear­ance of the curve in the slight­est bit. If the graph is wrong, the issues will lie in sys­tem­atic biases in the data itself—not sta­tis­ti­cal quib­bling over sam­pling error or LOESS curves cross­ing an axis. (Com­par­i­son with graphs drawn from other clio­met­ric datasets such as Wikipedia or par­tic­u­larly would be infor­ma­tive.)

Program for non-spaced-repetition review of past written materials for serendipity & rediscovery: Archive Revisiter

helps one remem­ber facts by cre­at­ing dis­crete flash­cards which one tests one­self on at increas­ingly dis­tant ‘spaced’ time peri­ods, repeat­ing the fact just before one prob­a­bly would have for­got­ten it; using soft­ware to track & auto­mate tests & review sched­ul­ing, spaced rep­e­ti­tion can scale to hun­dreds of thou­sands of dis­crete items.

If spac­ing out facts can help one remem­ber by repeat­ing items just before they are for­got­ten, is there any use for an “anti-spaced rep­e­ti­tion” with the oppo­site method of repeat­ing items only after they are prob­a­bly for­got­ten?

I can think of two: first, it could be used to plan by eg track­ing one’s favorite movies of all time and sched­ul­ing a rewatch when­ever one is pre­dicted to have for­got­ten enough to make them novel & highly enjoy­able again. Sec­ond, and more inter­est­ing­ly, it could be used as a serendip­ity gen­er­a­tor by allow­ing effi­cient pro­cess­ing of notes or excerpts or old writ­ings.

In reread­ing such mate­ri­als many years lat­er, one often gains a new per­spec­tive or learns some­thing use­ful because one for­got some­thing: one did­n’t under­stand some­thing about it at the time, or new mate­r­ial has rad­i­cally changed one’s inter­pre­ta­tion, and since it’d been for­got­ten, no use could be made of it. Unfor­tu­nate­ly, using spaced rep­e­ti­tion to mem­o­rize such mate­ri­al, while ensur­ing any serendip­i­tous con­nec­tions get made as soon as pos­si­ble, would be rad­i­cally infea­si­ble for bulky items (a sin­gle lengthy text excerpt might cor­re­spond to hun­dreds of dis­crete items, quickly over­load­ing even SRS sys­tems) and for almost all items, use­less. One can jus­tify reread­ing old mate­r­ial once or per­haps twice, but not many rereads nor full mem­o­riza­tion. But reread­ing hap­haz­ardly is likely to ineffi­ciently cover some mate­r­ial many times while neglect­ing oth­ers, and such rereads will often be far too early in time (or—a lesser con­cern here—­too late).

Instead of spaced rep­e­ti­tion, one would instead use anti-spaced rep­e­ti­tion: each item would be tracked and reviewed and its expected for­get­ting time pre­dict­ed, as in spaced rep­e­ti­tion, but instead of sched­ul­ing a review before for­get­ting, a review is sched­uled for some time (prob­a­bly long after­wards) after for­get­ting. The total num­ber of reviews of each item per user life­time would be set to a small num­ber, per­haps 1–4, bound­ing the time con­sump­tion at a fea­si­ble amount.

Such an anti-spaced rep­e­ti­tion sys­tem could be used with hun­dreds of thou­sands of notes or clip­pings which a per­son might accu­mu­late over a life­time, and enable them to invest a few min­utes a day into read­ing old notes, occa­sion­ally com­ing up with new insights, while ensur­ing they don’t waste time read­ing notes too many times or read­ing notes they likely already remem­ber & have exhaust­ed.

One rea­son to take notes/clippings and leave com­ments in stim­u­lat­ing dis­cus­sions is to later ben­e­fit by hav­ing ref­er­ences & cita­tions at hand, and grad­u­ally build up an idea from dis­parate threads and make new con­nec­tions between them. For this pur­pose, I make exten­sive excerpts from web pages & doc­u­ments I read into my clip­pings (func­tion­ing as a ), and I com­ment con­stantly on Reddit/LessWrong/HN etc. While expen­sive in time & effort, I often go back, months or years lat­er, and search for a par­tic­u­lar thing and expand & inte­grate it into another writ­ing or expand it out to an entire essay of its own. (I also value highly not being in the sit­u­a­tion where I believe some­thing but I do not know why I believe it other than the con­vic­tion “I read it some­where, once”.)

This sort of using sim­ple s like Ever­note works well enough when I have a clear mem­ory of what the citation/factoid was, per­haps because it was so mem­o­rable, or when the cita­tions or com­ments are in a nice clus­ter (per­haps because there was a key phrase in them or I kept going back & expand­ing a com­men­t), but it loses out on key ben­e­fits to this pro­ce­dure: serendip­ity and per­spec­tive.

As time pass­es, one may real­ize the impor­tance of an odd tid­bit or have utterly for­got­ten some­thing or events con­sid­er­ably changed its mean­ing; in this case, you would ben­e­fit from revis­it­ing & reread­ing that old bit & expe­ri­enc­ing an “aha!” moment, but you don’t real­ize it. So one thing you could do is reread all your old clip­pings & com­ments, apprais­ing them for reuse.

But how often? And it’s a pain to do so. And how do you keep track of which you’ve already read? One thing I do for my emails is semi­-an­nu­ally I (try to) read through my pre­vi­ous 6 months of email to see what might need to be fol­lowed up on11 or mined for inclu­sion in an arti­cle. (For exam­ple, an ignored request for data, or a dis­cus­sion of dark­net mar­kets with a jour­nal­ist I could excerpt into one of my DNM arti­cles so I can point future jour­nal­ists at that instead.) This is already diffi­cult, and it would be even harder to expand. I have read through my Less­Wrong com­ment his­to­ry… once. Years ago. It would be more diffi­cult now. (And it would be impos­si­ble to read through my Red­dit com­ments as the inter­face only goes back ~1000 com­ments.)

Sim­ply re-read­ing peri­od­i­cally in big blocks may work but is sub­op­ti­mal: there is no inter­face eas­ily set up to reread them in small chunks over time, no con­straints which avoid far too many reads, nor is there any way to remove indi­vid­ual items which you are cer­tain need never be reviewed again. Review­ing is use­ful but can be an indefi­nite timesink. (My sent emails are not too hard to review in 6-month chunks, but my IRC logs are bad—7,182,361 words in one chan­nel alone—and my >38k Ever­note clip­pings are worse; any will exac­er­bate the prob­lem by orders of mag­ni­tude.) This is prob­a­bly one rea­son that peo­ple who keep jour­nals or diaries don’t reread Nor can it be crowd­sourced or done by sim­ply rank­ing com­ments by pub­lic upvotes (in the case of Reddit/LW/HN com­ments), because the most pop­u­lar com­ments are ones you likely remem­ber well & have already used up, and the odd­i­ties & serendip­i­ties you are hop­ing for are likely unrec­og­niz­able to out­siders.

This sug­gests some sort of review­ing frame­work where one sys­tem­at­i­cally reviews old items (sent emails, com­ments, IRC logs by one­self), putting in a con­stant amount of time reg­u­larly and using some sort of ever expand­ing inter­val between re-reads as an item becomes exhausted & ever more likely to not be help­ful. Sim­i­lar to the log­a­rith­mi­cal­ly-bounded num­ber of back­ups required for indefi­nite sur­vival of data (), “Decon­struct­ing Death­is­m—An­swer­ing Objec­tions to Immor­tal­ity”, Mike Perry 2013 (note: this is an entirely differ­ent kind of prob­lem than those con­sid­ered in Free­man Dyson’s immor­tal intel­li­gences in Infi­nite in All Direc­tions, which are more fun­da­men­tal), dis­cusses some­thing like what I have in mind in terms of an immor­tal agent try­ing to review its mem­o­ries & main­tain a sense of con­ti­nu­ity, point­ing out that if time is allo­cated cor­rect­ly, it will not con­sume 100% of the agen­t’s time but can be set to con­sume some bounded frac­tion:

It seems rea­son­able that past ver­sions of the self would “sur­vive” as we remem­ber the events of times past, that is to say, our episodic mem­o­ries, and this would have impor­tance in our con­tin­u­ing to per­sist as what could be con­sid­ered the “same” albeit also a chang­ing, devel­op­ing per­son. But in addi­tion to this mnemonic rein­force­ment I imag­ine there would be a more gen­eral feel­ing of being a par­tic­u­lar indi­vid­u­al, an “ambiance” derived from but not refer­ring to any spe­cific past expe­ri­ences. Ambiance alone would not be suffi­cient, I think, to make us who we are; episodic mem­o­ries would also be nec­es­sary, yet it could con­sid­er­ably lessen the need for fre­quent recall and thus alle­vi­ate the prob­lem of dilu­tion.

Another inter­est­ing thought is that cer­tain items might con­sis­tently be con­sulted more fre­quently than oth­ers. (In­deed, would this not be expect­ed?) In this way it would actu­ally be pos­si­ble to bypass the dilu­tion effect and instead allow a fixed frac­tion of time for perusal of any given item, even as more items were added indefi­nite­ly. A sim­ple way of doing this could be first to allow some fixed frac­tion of the time for day-to-day affairs and other non-archival work (“prime time”), and spend the rest of the time on perusal of per­sonal archives (“archive time”). The exact appor­tion­ing of prime ver­sus archive time is not impor­tant here, but it will be instruc­tive to con­sider how the archive time itself might be sub­di­vid­ed. A sim­ple, if overly sim­plis­tic, strat­egy would be to have half this time devoted to the first cen­tu­ry’s records, half the remain­der to the sec­ond cen­tu­ry, and so on. (Since there would only be a finite num­ber of cen­turies, there would be some unused archive time at the end, which could be spent as desired. Note, how­ev­er, that in the limit of infi­nite total time cov­er­ing infi­nitely many cen­turies, the usage of archive time would approach but not exceed 100%.) In this way, then, there would be a fixed frac­tion of archive time, , spent on the _n_th cen­tu­ry’s records, regard­less of how many cen­turies beyond the nth were lived or how many records accu­mu­lat­ed. True, this way of appor­tion­ing time might not be much good beyond a few cen­turies; only about one tril­lionth the total time would be spent on the 40th cen­tu­ry, for instance, around 1⁄300 sec per 100 years. (Pos­si­bly a lot could be cov­ered even in this brief inter­val of about 3 mil­lion nanosec­onds, how­ev­er.) But the appor­tion­ment scheme could be adjust­ed.

A more inter­est­ing and plau­si­ble, if slightly hard­er-to-de­scribe scheme would be to choose a con­stant and allow the frac­tion to the _n_th-cen­tury records. It is easy to show that the time for all cen­turies will add up to 100% as before, what­ever pos­i­tive value of c we start with. Start­ing with will get 10% of the total time spent on the first cen­tu­ry, with sub­se­quent cen­turies receiv­ing a dimin­ish­ing share as before, but the rate of falloff will be much slow­er, so that the 40th cen­tury will still receive 0.4%, or about 5 months per 100 years, that is to say, 240 mil­lion nanosec­onds per minute. If we sup­pose that our immor­tal set­tles even­tu­ally into a rou­tine in which 10% of the time over­all is archive time, there would be 24 mil­lion nanosec­onds avail­able each minute of life for the 40th cen­tu­ry’s mem­o­ries alone, if desired, with many other cen­turies get­ting more or less com­pa­ra­ble or greater amounts of atten­tion, and none omit­ted entire­ly. This, I think, makes at least a plau­si­ble case that a rea­son­able sense of one’s per­sonal iden­tity could be sus­tained indefi­nite­ly.

In the above exam­ples the great­est pro­por­tion of archive time falls to the ear­lier records, which might be fit­ting since these should be the most impor­tant as for­ma­tive years for the prospec­tive immor­tal, thus the most impor­tant for iden­tity main­te­nance. (Mem­ory recall would also nat­u­rally occur dur­ing prime time; the empha­sis here could be on recent events, to main­tain a bal­ance over­al­l.) In sum­ma­ry, then, we have con­sid­ered ways that the prob­lem of dilu­tion might be suc­cess­fully man­aged. Rel­a­tively infre­quent perusal of mem­o­ries might still suffice to main­tain the nec­es­sary con­ti­nu­ity with past ver­sions of the self, or proper sched­ul­ing could sta­bi­lize the fre­quency of recall and bypass the dilu­tion effect, or both. We see in any case that the prob­lem is not what it may seem at first sight. We have no guar­an­tee, of course, that it would not get out of bounds, but cer­tainly some grounds for hope.

So you could imag­ine some sort of soft­ware along the lines of like Anki/Mnemosyne/Supermemo which you spend, say, 10 min­utes a day at, sim­ply reread­ing a selec­tion of old emails you sent, lines from IRC with n lines of sur­round­ing con­text, Red­dit & LW com­ments etc; with an appro­pri­ate back­off & time-curve, you would reread each item maybe 3 times in your life­time (eg first after a delay of a mon­th, then a year or two, then decades). Each item could come with a rat­ing func­tion where the user rates it as an impor­tant or odd­-seem­ing or incom­plete item and to be exposed again in a few years, or as totally irrel­e­vant and not to be shown again—as for many bits of idle chit-chat, mun­dane emails, or intem­per­ate com­ments is not an instant too soon! (More pos­i­tive­ly, any­thing already incor­po­rated into an essay or oth­er­wise reused likely does­n’t need to be resur­faced.)

This would­n’t be the same as a spaced rep­e­ti­tion sys­tem which is designed to recall an item as many times as nec­es­sary, at the brink of for­get­ting, to ensure you mem­o­rize it; in this case, the for­get­ting curve & mem­o­riza­tion are irrel­e­vant and indeed, the pri­or­ity here is to try to elim­i­nate as many irrel­e­vant or use­less items as pos­si­ble from show­ing up again so that the review does­n’t waste time.

More specifi­cal­ly, you could imag­ine an inter­face some­what like Mutt which reads in a list of email files (my local POP email archives down­loaded from Gmail with getmail4, file­name IDs), chunks of IRC dia­logue (a grep of my IRC logs pro­duc­ing lines writ­ten by me ±10 lines for con­text, hashes for ID), LW/Reddit com­ments down­loaded by either scrap­ing or API via the Big­Query copy up to 2015, and stores IDs, review dates, and scores in a data­base. One would use it much like a SRS sys­tem, read­ing indi­vid­ual items for 10 or 20 min­utes, and rat­ing them, say, upvote (‘this could be use­ful some­day, show me this ahead of sched­ule in the future’) / down­vote (push this far off into the future) / delete (never show again). Items would appear on an expand­ing sched­ule. For exam­ple if one wanted to review items 4 times over the next 50 years (roughly my life expectan­cy), a sched­ule might be:

round({t=0:4; t^6.981})
# [1]     0     1   126  2142 15958

So in 1 day, then a third of a year, then after 5.8 years, then after 43 years. Alter­nate­ly, a geo­met­ric series might be a bit kinder and not too fron­t-load­ed:

review <- function(n, r, a) { a * (1 - r^n) / (1 - r) }
reviews <- function(n, r, a) { sapply(1:n, function(nn) { review(nn, r, a) }) }
findR <- function (firstReview=31, n_total=3, years=50) {  optimize(interval=c(0, 1000),
    f = function(r) { abs(sum(sapply(1:n_total,
         function(n){review(n, a=firstReview, r=r)})) - (365*years)) })$minimum }
findR(firstReview=30, n_total=4, years=50)
# [1] 7.728823216
round(reviews(4, 7.728823216, 30))
# [1]    30   262  2054 15904

The geo­met­ric series allows for easy incor­po­ra­tion of rat­ing mod­i­fi­ca­tions: a down­vote penalty might mul­ti­ply r by 1.5, vs 0.5 for upvotes. This would also allow some input from sta­tis­ti­cal algo­rithms which pre­dict upvote/downvote/delete and advances/delays items based on that, which would hope­fully quickly learn to avoid idle chit-chat and short per­for­ma­tive utter­ances and start to pri­or­i­tize more inter­est­ing & unusual items. (For exam­ple, a good start might be a SVM on a bag-of-words ver­sion of each item’s text, and then as the dataset rat­ings expand, more com­pli­cated algo­rithms could be plugged in.)

As far as I know, some to-do/self-help sys­tems have some­thing like a peri­odic review of past stuff, and as I men­tioned, spaced rep­e­ti­tion sys­tems do some­thing some­what sim­i­lar to this idea of expo­nen­tial revis­its, but there’s noth­ing like this at the moment.

On the value of new statistical methods

Genetic cor­re­la­tion research is a hot area in 2016-2017: passed 400 ref­er­ences in May 2017. What is par­tic­u­larly inter­est­ing ref­er­ence-wise is that pub­li­ca­tions 2015-2017 make up around half of the results: so more genetic cor­re­la­tions cal­cu­lated in the past 3 years than in the pre­vi­ous 80 years since first esti­mates were made some­where in the 1930s or so.

For cal­cu­lat­ing them, there are 3 main meth­ods.

  1. twin reg­istry stud­ies require twin phe­no­typic mea­sure­ments which can usu­ally be col­lected by mailed sur­veys and to ana­lyze them one com­putes some Pear­son’s r or uses a stan­dard SEM with addi­tional covari­ance paths (doable with Wright’s path analy­sis back in the 1930s by hand), scal­ing roughly lin­early with sam­ple size, hav­ing excel­lent sta­tis­ti­cal power at a few hun­dred twin pairs and cap­tur­ing full her­i­tabil­i­ties
  2. for GCTA, one requires full raw SNP data on 5000+ unre­lated indi­vid­u­als at $100+ a sam­ple, along with simul­ta­ne­ous phe­no­typic mea­sure­ments of both traits and must use com­pli­cated cus­tom soft­ware whose com­pu­ta­tion scales expo­nen­tially and can only exam­ine a nar­row sub­set of her­i­tabil­ity
  3. for LDSC, one requires pub­lic sum­mary poly­genic scores but they can be from sep­a­rate GWASes and cal­cu­lated on traits indi­vid­u­al­ly, and the com­pu­ta­tional com­plex­ity is closer to lin­ear than expo­nen­tial; the penalty for not need­ing raw SNP data from twice-mea­sured indi­vid­u­als is that SNP costs dou­ble or more since mul­ti­ple GWASes are used, and LDSC even more ineffi­cient than GCTA, so you’ll need >10,000 indi­vid­u­als used in each poly­genic score, and still need cus­tom soft­ware.

In other words, the twin method is old, sim­ple, requires small sam­ple sizes, and eas­ily obtained phe­no­typic mea­sure­ments; while GCTA/LDSC is new, com­pli­cat­ed, and requires expen­sive novel genetic sequenc­ing data in huge sam­ple sizes as well as the phe­no­typic mea­sure­ments. So nat­u­rally LDSC gets used an order of mag­ni­tude more! Look­ing at the bib­li­og­ra­phy, we can guessti­mate the rates at twin: 1 paper/year; GCTA (re­quir­ing raw data), 10/year; LDSC (pub­lic sum­mary stat­s), 100/year.

Amaz­ing the differ­ence meth­ods can make. It’s all about data access. For all its dis­ad­van­tages, LDSC sta­tis­ti­cally works around the lack of indi­vid­u­al-level raw data and makes do with the data that gets pub­licly released because it is not seen to vio­late ‘pri­vacy’ or ‘bioethics’, so any researcher can make use of the method on their par­tic­u­lar dataset, while twin and GCTA require indi­vid­u­al-level data which is jeal­ously guarded by the own­ers.

Method­ol­o­gists and sta­tis­ti­cians are prob­a­bly seri­ously under­val­ued: a good new method can cause a rev­o­lu­tion.

Bayesian power analysis: probability of exact replication

Psy­chol­o­gist Michael Kane men­tions:

TFW a cor­re­la­tion of inter­est in a new study (n = 355) repli­cates that from a prior study (n = 182) to the sec­ond dec­i­mal (r = 0.23). Win­ning!

Turn­ing up the same cor­re­la­tion twice is some­what sur­pris­ing because ran­dom sam­pling error will vary sub­stan­tially the empir­i­cal cor­re­la­tion from sam­ple to sam­ple, as reflected by the wide cred­i­ble inter­vals around r with n = 182-355. How sur­pris­ing is it? Is it too good to be true?

One approach would be to ask, if we gen­er­ated bivari­ate sam­ples of size n = 355 with a fixed rela­tion­ship of r = 0.23, how often would the sam­ples gen­er­ate a rounded esti­mate of =0.23?

powerSim <- function (r_gen, n, r_test=NA) {
    data <- mvrnorm(n=n, mu=c(0, 0), Sigma=matrix(c(1, r_gen, r_gen, 1), nrow=2))
    r_est = cor.test(data[,1], data[,2])$estimate
    if ( { r_test <- r_gen }
    return(round(r_test, digits=2) == round(r_est, digits=2))
powerSims <- function(r, n, r_test=NA, iters=100000) {
 sim <- replicate(iters, powerSim(r,n, r_test=r_test))
 return(sum(sim) / length(sim))
powerSims(0.23, 355)
# [1] 0.07798

So around 8% of the sam­ples.

This treats =r = 0.23 as a para­me­ter known with infi­nite pre­ci­sion, rather than an esti­mate (us­ing around half the data) of the unknown para­me­ter r; there would be con­sid­er­able pos­te­rior uncer­tainty about what r is, and this will affect how often two sam­ples would yield the same esti­mate—if the true r was, say, 0.10 (as is entirely pos­si­ble), it would be highly unlikely for the sec­ond sam­ple to yield =0.23 again, because the over­es­ti­ma­tion fluke would have to repeat itself twice to yield both =0.23.

To incor­po­rate the uncer­tain­ty, we can feed in a sim­u­lated dataset exactly match­ing the descrip­tion of n = 182/r = 0.23 to an unin­for­ma­tive Bayesian mod­el, cal­cu­late a pos­te­rior dis­tri­b­u­tion over r (which gives CIs of 0.09-0.37), and then draw from the pos­te­rior pos­si­ble _r_s and run the orig­i­nal sim­u­la­tion ask­ing how often we recover =0.23.

n1 = 182
n2 = 355
r1 = 0.23
data1 = mvrnorm(n=n1, mu=c(0, 0), Sigma=matrix(c(1, r1, r1, 1), nrow=2), empirical=TRUE)
colnames(data1) <- c("x", "y")
b1 <- brm(y ~ x, iter=20000, data=data1); summary(b1)
# ...Population-Level Effects:
#           Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
# Intercept     0.00      0.07    -0.14     0.14      40000    1
# x             0.23      0.07     0.09     0.37      40000    1
posteriorX <- fixef(b1, summary=FALSE)[,2]; summary(posteriorX)
#        Min.     1st Qu.      Median        Mean     3rd Qu.        Max.
# -0.08060884  0.18061570  0.23010870  0.22999820  0.27916800  0.55868700
replicates <- sapply(posteriorX, function(r_post) { powerSim(r_post, n2, r_test=r1) })
summary(replicates); mean(replicates)
#    Mode   FALSE    TRUE    NA's
# logical   38262    1738       0
# [1] 0.04345

Around 4% of the time, reflect­ing the increased improb­a­bil­ity of true val­ues like r = 0.09 or r = 0.37 pro­duc­ing the spe­cific sam­ple esti­mate of =0.23

Thus, observ­ing the same sum­mary sta­tis­tics in even rel­a­tively large sam­ples is some­what sus­pi­cious and might be a good rea­son to dou­ble-check other aspects of the code & data.

Expectations are not expected deviations and large number of variables are not large samples

If one has a large num­ber of vari­ables with a cer­tain expec­ta­tion, it is tempt­ing to inter­pret the or or as imply­ing that the sum of a large num­ber of vari­ables or after a large num­ber of timesteps, the observed sam­ple value will be close or iden­ti­cal to the expected val­ue. So for coin-flip­ping, one knows that flip­ping 10 coins could eas­ily yield a large devi­a­tion like a sum of 9 heads instead of the expected 5 heads, but one then thinks that after a mil­lion coin flips, the sum of heads will prob­a­bly be 500,000. Another exam­ple of this mis­take might be to make argu­ments about sci­en­tific research or char­i­ties: “char­i­ta­ble inter­ven­tion X is affected by hun­dreds or thou­sands of differ­ent vari­ables and the ben­e­fits or costs unfold over long time peri­ods like decades or cen­turies; our best esti­mate of the mean value of inter­ven­tions like X is that it is some small value Y; thus, by CLT etc, we can be sure that X’s ulti­mate value will be nei­ther much big­ger nor much smaller than Y but very close to Y, and, par­tic­u­lar­ly, we can be sure that there are no inter­ven­tions like X which could pos­si­bly turn out to have ulti­mate val­ues which are orders of mag­ni­tude larger or smaller than Y, so we can rule out any such claims and we know the Value of Infor­ma­tion is small.”

This is not wrong so much as mis­un­der­stood: one might call it a con­fu­sion of the vari­able’s dis­tri­b­u­tion with the sam­pling dis­tri­b­u­tion. The value only becomes closer in a rel­a­tive sense; in an absolute sense, as more vari­ables are added—with­out the absolute mag­ni­tude of each shrink­ing lin­ear­ly—the actual devi­a­tion from the expec­ta­tion sim­ply becomes larger and larg­er. (Like : the expec­ta­tion is the same as the cur­rent val­ue, but the vari­ance increases with time.)

As in dis­cussing how “diver­si­fi­ca­tion” works, it is a mis­take to think that one ‘diver­si­fies’ one’s invest­ments by adding addi­tional invest­ments of the same size; for any vari­ance reduc­tion, the total invest­ment must instead be split up among ever more differ­ent invest­ments as many small invest­ments:

In gen­er­al, the pres­ence of more assets in a port­fo­lio leads to greater diver­si­fi­ca­tion ben­e­fits, as can be seen by con­sid­er­ing port­fo­lio vari­ance as a func­tion of n, the num­ber of assets. For exam­ple, if all assets’ returns are mutu­ally uncor­re­lated and have iden­ti­cal vari­ances , port­fo­lio vari­ance is min­i­mized by hold­ing all assets in the equal pro­por­tions .[Samuel­son, Paul, “Gen­eral Proof that Diver­si­fi­ca­tion Pays”, Jour­nal of Finan­cial and Quan­ti­ta­tive Analy­sis 2, March 1967, 1-13.] Then the port­fo­lio return’s vari­ance equals = = , which is monot­o­n­i­cally decreas­ing in n.

The lat­ter analy­sis can be adapted to show why adding uncor­re­lated volatile assets to a port­fo­lio, [see Samuel­son, Paul, “Risk and uncer­tain­ty: A fal­lacy of large num­bers”, Sci­en­tia 98, 1963, 108-113.] [Ross, Stephen, “Adding risks: Samuel­son’s fal­lacy of large num­bers revis­ited”, Jour­nal of Finan­cial and Quan­ti­ta­tive Analy­sis 34, Sep­tem­ber 1999, 323-339.] thereby increas­ing the port­fo­lio’s size, is not diver­si­fi­ca­tion, which involves sub­di­vid­ing the port­fo­lio among many smaller invest­ments. In the case of adding invest­ments, the port­fo­lio’s return is instead of and the vari­ance of the port­fo­lio return if the assets are uncor­re­lated is which is increas­i