Statistical Notes

Miscellaneous statistical stuff
statistics, decision-theory, Haskell, R, JS, power-analysis, survey, Bayes, genetics, IQ, order-statistics
2014-07-172020-04-28 in progress certainty: possible importance: 5


  • crit­i­cism of teeth-re­moval ex­per­i­ment in rats http://­less­wrong.­com/r/dis­cus­sion/l­w/kf­b/open_thread­_30_june_2014_6_ju­ly_2014/b1u3
  • crit­i­cism of small Noopept self­-ex­per­i­ment­­s/689936-My-Pa­per-quot-Noopep­t-am­p-The-Place­bo-Effec­t-quot?p=11910708&view­ful­l=1#­post11910708
  • why Soy­lent is not a good idea http://­less­wrong.­com/l­w/h­ht/link_­soy­len­t_crowd­fund­ing/90y7
  • mis­in­ter­pre­ta­tion of flu­o­ri­da­tion meta-analy­sis and ig­no­rance of VoI http://theness.­com/neu­ro­log­i­ca­blog/in­­p/an­ti-flu­o­ride-pro­pa­gan­da-as-news/#­com­men­t-76400
  • http://­less­wrong.­com/l­w/1lt/­case_s­tudy_me­la­ton­in/8mgf
  • Full­text: http­s://dl.­drop­boxuser­con­tent.­com/u/280585369/2014-dubal.pdf is this pos­si­ble? http://nextbig­fu­ture.­com/2014/05/k­l-vs-gene-makes-up-six-iq-points-of.htm­l#­com­men­t-1376748788 http­s://­dit.­com/r/Nootrop­ic­s/­com­ments/25233r/­boost­_y­our_iq_by_6_­points/chd­dd7f
  • tACS causes lu­cid dream­ing: http­s://­dit.­com/r/Lu­cid­Dream­ing/­com­ments/27y7n6/no_brain_s­tim­u­la­tion_will_not_get_y­ou_lu­cid/ck­6isgo
  • Herbal­ife growth pat­terns: http­s://­dit.­com/r/busi­ness/­com­ments/24aoo2/what_un­sus­tain­able_­growth_look­s_­like_herbal­ife/ch5hwtv
  • Plau­si­ble cor­re­late of Fair­trade: http­s://­dit.­com/r/E­co­nom­ic­s/­com­ments/26jb2d/­sur­prise_­fair­trade_­does­nt_ben­e­fit_the_poor/chrx9s4
  • slave whip­pings vs cot­ton pro­duc­tion http://­less­wrong.­com/r/dis­cus­sion/l­w/k­w­c/open_thread­_sep­t_17_2014/bajv
  • whether a study on men­tal ill­ness & vi­o­lence shows schiz­o­phren­ics are not more likely to mur­der but rather be mur­dered: http­s://­dit.­com/r/psy­chol­o­gy/­com­ments/2fwjs8/peo­ple_with­_­men­tal_ill­ness_are_­more_­like­ly_­to_be/ck­dq50k / http://www.­na­tionalelf­ser­­li­ca­tion-type­s/ob­ser­va­tion­al-s­tudy/peo­ple-with­-men­tal-ill­ness-are-more-like­ly-to-be-vic­tim­s-of-homi­cide-than-per­pe­tra­tors-of-homi­cide/#­com­men­t-95507 (see also http://s­lat­es­tarscratch­pad.­tum­blr.­com/­post/120950150581/psy­chol­ar-gi­raffe­po­lice­force-mu­se­icetc http­s://­dit.­com/r/s­lat­estar­codex/­com­ments/744rqn/vi­o­lence_is_not_a_pro­duc­t_of_­men­tal_ill­ness/d­nwb1kj/ )
  • For­tune analy­sis of higher fe­male CEO re­turns http://­less­wrong.­com/r/dis­cus­sion/l­w/l3b/­con­trar­i­an_l­w_views_and_their_e­co­nom­ic/bftw
  • al­gae/IQ: http://­less­wrong.­com/r/dis­cus­sion/l­w/l9v/open_thread­_nov_17_nov_23_2014/b­m7o
  • synaes­the­si­a/IQ: http­s://­dit.­com/r/psy­chol­o­gy/­com­ments/2m­ry­te/­sur­pris­ing_iq_­boost­_12_in_av­er­age_by_a_­train­ing/cm760v8
  • mis­in­ter­pre­ta­tion: http­s://s­lat­estar­codex.­com/2014/12/08/links-1214-come-ye-to-beth­linkhem/#­com­men­t-165197
  • un­der­pow­ered/­mul­ti­ple-cor­rec­tion jobs pro­gram: http­s://s­lat­estar­codex.­com/2014/12/08/links-1214-come-ye-to-beth­linkhem/#­com­men­t-165197
  • claimed fall in digit span back­wards mi­nus­cule and non-s­ta­tis­ti­cal­ly-sig­nifi­cant, no ev­i­dence of het­ero­gene­ity be­yond vari­abil­ity due to sam­ple size http://­dr­jamesthomp­son.blogspot.­com/2015/04/dig­it-s­pan-bomb­shel­l.htm­l?show­Com­men­t=1428096775425#c4097303932864318518
  • Claimed ran­dom­ized ex­per­i­ment of whether sushi tastes worse after freez­ing is not ac­tu­ally a ran­dom­ized ex­per­i­ment http­s://­dit.­com/r/­science/­com­ments/324xm­f/ran­dom­ized_­dou­ble­blind­_s­tudy_shows_the_qual­i­ty_of/c­q8dmsb
  • sex­ual open­ness re­sult un­der­mined by ceil­ing effect http://­mind­hack­s.­com/2015/04/28/when-so­ci­ety-is­nt-judg­ing-wom­en­s-sex-drive-ri­val­s-men­s/#­com­men­t-362749
  • mu­sic study claim­ing WM in­ter­ac­tion: pos­si­ble ceil­ing effect? see FB PM
  • at­tempt to mea­sure effect of Nazi an­ti-schiz­o­phre­nia eu­gen­ics pro­gram failed to use breed­er’s equa­tion to es­ti­mate pos­si­ble size of effect, which is too small to de­tect with avail­able data and hence at­tempt is fore­doomed: http­s://­dit.­com/r/eu­gen­ic­s/­com­ments/3hqdl­l/­be­tween_73_and_100_of_al­l_in­di­vid­u­al­s_with­/cul2nzw
  • claim high IQ types al­most 100% fail­ure rates due to in­ap­pro­pri­ate model as­sump­tion of nor­mal dis­tri­b­u­tion + nar­row stan­dard de­vi­a­tion: http://poly­math­archives.blogspot.­com/2015/01/the-i­nap­pro­pri­ate­ly-ex­clud­ed.htm­l?show­Com­men­t=1441741719623#c1407914596750199739
  • im­plau­si­ble claims about suc­cess rate of fa­cial recog­ni­tion ap­plied to St Pe­ters­burg pop­u­la­tion: http­s://news.y­combi­na­tor.­com/item?id=11491264 (see also “Fa­cial recog­ni­tion sys­tems stum­ble when con­fronted with mil­lion-face data­base”)
  • hu­man Tox­o­plasma gondii study is not well-pow­ered as au­thors claim due to in­cor­rect power analy­sis, and re­sults are ev­i­dence for harm: http://blogs.dis­cov­er­magazine.­com/neu­roskep­tic/2016/02/20/myth-mind-al­ter­ing-par­a­site-tox­o­plas­ma-gondi­i/#­com­men­t-2755778490 ; http­s://­dit.­com/r/s­lat­estar­codex/­com­ments/5vjr­mo/­tox­o­plas­ma_­does­nt_­cause_ado­les­cen­t_psy­chosis/de2x4kh/
  • at­tempt at at­tribut­ing Bit­coin price in­creases to tech­nol­ogy im­prove­ments: http­s://­dit.­com/r/Bit­coin/­com­ments/5tbt8f/buz­z_­fac­tor_or_in­no­va­tion_po­ten­tial_what_­ex­plain­s/d­dlmzrz/
  • analy­sis of de­signer drug/re­search chem­i­cal ac­tiv­ity on Wikipedia is dri­ven al­most en­tirely by edit­ing pat­terns of just 2 Wikipedia ed­i­tors par­tic­u­larly in­ter­ested in the top­ic: http://­cal­­chem­i­cal-wik­i/­ex­plo­rations/2016-09-12-em­cd­da-watch­list-and-wikipedi­a-time­line#­com­men­t-3277669328
  • fail­ure to use me­di­a­tion SEM, differ­ence-in-s­ta­tis­ti­cal-sig­nifi­cance-is-not-a-sig­nifi­can­t-d­iffer­ence: http­s://­dit.­com/r/s­lat­estar­codex/­com­ments/6qw­b0q/­crit­i­cal_­think­ing_skill­s_are_­more_im­por­tan­t_than/dl51ubw/
  • Ne­an­derthal an­ces­try per­cent­age & autism: http­s://­dit.­com/r/s­lat­estar­codex/­com­ments/74fevz/find­ings_­sug­gest_that_high­_level­s_of_ne­an­derthal/d­ny3sh9/
  • Anime im­age clas­si­fi­ca­tion project likely un­done by us­ing non-i.i.d. im­ages

Failed Facebook Critiques

  • Face­book emo­tion study: http­s://­dit.­com/r/psy­chol­o­gy/­com­ments/29vg9j/no_e­mo­tion­s_ar­en­t_re­al­ly_­con­ta­gious_over_­face­book/­cip7ln5

A re­ply to http://www.ischool.berke­ley.e­du/newsande­vents/news/20140828­face­book­ex­per­i­ment (too long for in­clu­sion in http­s://news.y­combi­na­tor.­com/item?id=8378743 )

91 points and no com­ments? OK, I guess it falls to me to jump on this grenade.

So why is the Face­book study bad sci­ence? After 5 screens of me­an­der­ing anec­dotes, in­sin­u­a­tions, in­sults, etc we fi­nally get to a real crit­i­cism:

Did peo­ple feel be­trayed about the lack of in­formed con­sent? You know, in psy­chol­ogy re­search, when peo­ple find out they’ve been an un­wit­ting ex­per­i­men­tal sub­ject, it’s not un­com­mon for them to feel duped. They’re at least sur­prised. The only dis­tinc­tion is that aca­d­e­mics who ex­per­i­ment on sub­jects with­out get­ting their con­sent first usu­ally tell peo­ple about it im­me­di­ately after­ward. They de­brief the sub­jects and an­swer ques­tions. They un­ruffle ruffled feath­ers. They may al­low a sub­ject to re­move his or her data from the ex­per­i­ment. In some cas­es, they even offer fol­low-up ser­vices. Given that Face­book did noth­ing to in­form sub­jects or make them feel whole again, it’s hard to blame folks for feel­ing un­duly vi­o­lat­ed.

So? As was pointed out, these ex­per­i­ments are run all the time by all sorts of en­ti­ties, and by mak­ing this crit­i­cism you are im­plic­itly ar­gu­ing that it would be bet­ter for Face­book to keep the re­sults se­cret (like com­pa­nies usu­ally do) in­stead of in­form­ing us about very rel­e­vant re­sults in the brave new world of the In­ter­net. Far from ar­gu­ing for good sci­ence, OP is ar­gu­ing for bad sci­ence as some­how ‘eth­i­cal’. (This is quite aside from the is­sue that in­formed con­sent makes no sense and was a knee­jerk re­ac­tion to abuses that did­n’t need the in­ven­tion of scholas­tic con­cepts like ‘in­formed con­sent’.)

The ex­per­i­ment also forced many peo­ple to con­tem­plate, for the first time, the kind of per­sua­sive power Face­book might sur­rep­ti­tiously wield around the world given its size and scale.

Also not a rea­son for it be­ing ‘bad sci­ence’.

On the other side of the firestorm were peo­ple who could­n’t see how the ex­per­i­ment was any differ­ent from your run-of-the-mill psy­chol­ogy ex­per­i­ment. Or, al­ter­na­tive­ly, how it was differ­ent from the wide­spread In­ter­net prac­tice of A/B test­ing, where you ex­per­i­ment with differ­ent vari­a­tions of a web­site to see which is most effec­tive at per­suad­ing vis­i­tors to buy, or down­load, or what­ever the site’s goal is. Some of these ex­per­i­ments feel bla­tantly ma­nip­u­la­tive, like the head­lines that are con­stantly tested and retested on vis­i­tors to see which ones will get them to click. We have a word for head­lines like this: “click­-bait.” But no­body ever hands out con­sent forms.

Oh good, so the au­thor is­n’t a com­plete id­iot.

The every-which-way qual­ity of the re­ac­tion, I think, comes in part from the fact that the study crossed aca­d­e­mic and cor­po­rate bound­aries, two ar­eas with differ­ent eth­i­cal stan­dards. It was un­clear which to hold the com­pany to.

Wait, what? What hap­pened to all the blovi­at­ing ear­lier about the lack of con­sent? Now the prob­lem is it ‘crosses bound­aries’? WTF. Al­so: still noth­ing about how this was ‘bad sci­ence’.

If you were a re­searcher at Face­book, prob­a­bly one of the things that would pro­vide you with the great­est source of ten­sion about your job would be ev­i­dence that the prod­uct you’re push­ing to half the world’s pop­u­la­tion ac­tu­ally causes them to feel “neg­a­tive or left out.” That would be a pretty epic fail for a com­pany that wants to “make the world more open and con­nect­ed.”

I be­lieve that Kramer is con­cerned with ad­dress­ing the pop­u­lar worry that Face­book makes us un­hap­py. Not just be­cause I’ve met him but be­cause, in the study, he seems adamant about re­fut­ing it. In dis­cussing his find­ings, Kramer as­serts that the study “stands in con­trast to the­o­ries that sug­gest view­ing pos­i­tive posts by friends on Face­book may some­how affect us neg­a­tive­ly, for ex­am­ple, via so­cial com­par­i­son.”

…[­long de­scrip­tion of ‘so­cial com­par­i­son’ which I’m not sure why is in there since the ex­per­i­ment in ques­tion strongly sug­gests it’s not rel­e­vant]

Yes, it would suck if that were true and would un­der­mine Face­book’s val­ue, so ku­dos to Face­book for not hid­ing its head un­der a rock and ex­per­i­ment­ing to find out the truth. Ku­dos… wait, I for­got, this is ‘lousy so­cial sci­ence’ we’re sup­posed to be boo­ing and hiss­ing about.

In fact, so­cial com­par­i­son is often posited as the so­lu­tion to what’s known as the “East­er­lin Para­dox,” which finds that, while our hap­pi­ness in­creases with our in­come, so­ci­eties that get richer do not tend to get hap­pi­er.

Ac­tu­al­ly, if you look at the graphs, they do tend to get hap­pier it’s just there’s se­vere di­min­ish­ing re­turns and the graph looks log­a­rith­mic rather than lin­ear. Mi­nor point, but it an­noys me to think that be­ing wealth­ier does­n’t help. It does.

[an­other 5 screens of me­an­der­ing some­what re­lated spec­u­la­tion]

In fact, she finds that greater pas­sive con­sump­tion over time, con­trol­ling for in­di­vid­ual pre­dis­po­si­tions, is as­so­ci­ated with lower per­ceived so­cial sup­port, lower bridg­ing so­cial cap­i­tal (feel­ing part of a broader com­mu­ni­ty), and mar­gin­ally lower pos­i­tive affect, higher de­pres­sion, and higher stress.

Gee, I won­der why that might be… No, let’s jump to the in­sin­u­a­tion that Face­book causes the higher de­pres­sion etc. Yeah, that’s plau­si­ble.

The first ques­tion about the study is whether any­thing no­table hap­pened. This was a com­mon crit­i­cism. Al­though Face­book has tremen­dous scale, it does­n’t mean the sci­en­tific com­mu­nity should care about every effect the com­pany can demon­strate. Nei­ther should the com­pany it­self work on small stuff that barely moves the nee­dle. Though Kramer said he re­moved a lot of emo­tion from users’ News Feeds (be­tween 10–90% of pos­i­tive or neg­a­tive post­s), he saw very lit­tle change in the emo­tions users sub­se­quently ex­pressed. All of the changes were 0.1% or less. That’s not 10% or 1% — that’s 0.1%….Still, the small effects raise im­por­tant ques­tions. Why were they so small?

Bz­zt. First hard sci­en­tific crit­i­cism, and they failed. The rea­son the effects were small were, as the pa­per ex­plic­itly dis­cusses (OP did read the pa­per, right? The whole thing? Not just blogs and me­dia cov­er­age?), the in­ter­ven­tion was de­signed to be small (that’s real ethics for you, not bull­shit about in­formed con­sen­t), the in­ter­ven­tion only affects one of sev­eral news sources each user is ex­posed to (de­creas­ing the in­ter­ven­tion still more), and the mea­sure of mood in sub­se­quent items is it­self a very noisy mea­sure (mea­sure­ment er­ror bi­ases the effect down­ward­s). The re­sults are ex­actly as one would ex­pect and this is an in­valid ex­per­i­ment. http://psy­ch­cen­tral.­com/blog/archives/2014/06/23/e­mo­tion­al-con­ta­gion-on-face­book-more-like-bad-re­search-meth­ods/ makes the same mis­take. The de­scrip­tion of how much was re­moved is also wrong; here’s a quote from the pa­per:

Two par­al­lel ex­per­i­ments were con­ducted for pos­i­tive and neg­a­tive emo­tion: One in which ex­po­sure to friends’ pos­i­tive emo­tional con­tent in their News Feed was re­duced, and one in which ex­po­sure to neg­a­tive emo­tional con­tent in their News Feed was re­duced. In these con­di­tions, when a per­son loaded their News Feed, posts that con­tained emo­tional con­tent of the rel­e­vant emo­tional va­lence, each emo­tional post had be­tween a 10% and 90% chance (based on their User ID) of be­ing omit­ted from their News Feed for that spe­cific view­ing. It is im­por­tant to note that this con­tent was al­ways avail­able by view­ing a friend’s con­tent di­rectly by go­ing to that friend’s “wall” or “time­line,” rather than via the News Feed. Fur­ther, the omit­ted con­tent may have ap­peared on prior or sub­se­quent views of the News Feed. Fi­nal­ly, the ex­per­i­ment did not affect any di­rect mes­sages sent from one user to an­oth­er…Both ex­per­i­ments had a con­trol con­di­tion, in which a sim­i­lar pro­por­tion of posts in their News Feed were omit­ted en­tirely at ran­dom (i.e., with­out re­spect to emo­tional con­tent). Sep­a­rate con­trol con­di­tions were nec­es­sary as 22.4% of posts con­tained neg­a­tive words, whereas 46.8% of posts con­tained pos­i­tive words. So for a per­son for whom 10% of posts con­tain­ing pos­i­tive con­tent were omit­ted, an ap­pro­pri­ate con­trol would with­hold 10% of 46.8% (i.e., 4.68%) of posts at ran­dom, com­pared with omit­ting only 2.24% of the News Feed in the neg­a­tiv­i­ty-re­duced con­trol.

Note the differ­ence be­tween writhold­ing ‘4.68% of posts’ or ‘2.24% of the News Feed’ and OP’s de­sc­crip­tion as re­mov­ing ‘be­tween 10-90% of pos­i­tive or neg­a­tive posts’.

Words were de­ter­mined to be pos­i­tive or neg­a­tive us­ing a dic­tio­nary pro­vided by the Lin­guis­tic In­quiry and Word Count soft­ware, known as LIWC, last up­dated in 2007. About 47% of posts in the ex­per­i­ment con­tained pos­i­tive words while about 22% of posts con­tained neg­a­tive words, leav­ing 31% of posts with no emo­tional words at all, as de­fined by LIWC. Every­thing but the text of the posts was dis­carded for this analy­sis, in­clud­ing pho­tos.

The third study, which looks at the con­ta­gion of neg­a­tive emo­tion in in­stant mes­sag­ing, finds that LIWC ac­tu­ally can­not tell the differ­ence be­tween groups shar­ing neg­a­tive vs. neu­tral emo­tions.

Pro­tip: don’t cite bro­ken links like with­out any other ci­ta­tion da­ta. I can’t fig­ure out what this study is sup­posed to be but given that the link is bro­ken de­spite the blog post be­ing writ­ten barely a month or two ago, I sus­pect OP is mis­rep­re­sent­ing it.

Look­ing more broad­ly, one study com­pares a num­ber of sim­i­lar tech­niques and finds that LIWC is a mid­dling per­former, at best. It is con­sis­tently too pos­i­tive in its rat­ings, even la­bel­ing the con­ver­sa­tion in so­cial me­dia around the H1N1 dis­ease out­break as pos­i­tive over­all. An­other study that looks at emo­tional con­ta­gion in in­stant mes­sag­ing finds that, even when par­tic­i­pants have been in­duced to feel sad, LIWC still thinks they’re pos­i­tive.

Good thing the ex­per­i­ment tested mul­ti­ple con­di­tions and found sim­i­lar re­sults.

Used in raw form as in the Face­book ex­per­i­ment, how­ev­er, it ap­pears to be sub­stan­tially in­fe­rior to ma­chine learn­ing.

Sure. But should we let the per­fect be the en­emy of bet­ter?

Fur­ther, we know next to noth­ing about how well LIWC per­forms in so­cial me­dia when it comes to emo­tions un­der the big head­ings of pos­i­tive and neg­a­tive emo­tion. If it de­tects some neg­a­tive emo­tions, like anger, bet­ter than oth­ers like sad­ness this too may bias what we learn from the Face­book ex­per­i­ment.

Yes, maybe the LIWC works well in these cir­cum­stances, maybe it does­n’t. Who knows? One could write this of any in­stru­ment or analy­sis be­ing ap­plied in a new sit­u­a­tion. I hear the philoso­phers call this ‘the prob­lem of in­duc­tion’; maybe they have a so­lu­tion.

In a word: no. Face­book posts are likely to be a highly bi­ased rep­re­sen­ta­tion of how Face­book makes peo­ple feel be­cause Face­book posts are a highly bi­ased rep­re­sen­ta­tion of how we feel in gen­er­al…Look­ing at so­cial sit­u­a­tions in gen­er­al, we know for ex­am­ple that there are pow­er­ful pres­sures to con­form to the at­ti­tudes, feel­ings and be­liefs of oth­ers. And so if we look at Face­book from this stand­point, it’s easy to see how the effects re­ported in the Face­book ex­per­i­ment might be due to con­for­mity rather than gen­uine emo­tional con­ta­gion. Con­sciously or un­con­scious­ly, we may sense a cer­tain emo­tional tone to our News Feeds and there­fore adapt what we post, ever so slight­ly, so that we don’t stick out too much.

Oh for heav­en’s sake. So if they had found a ‘so­cial com­par­i­son’ effect, then that’s proof of so­cial com­par­ison; and if they did­n’t, well, that’s OK be­cause ‘Face­book posts are likely to be highly bi­ased’ and it’s all due to con­for­mi­ty! Way to ex­plain every pos­si­ble out­come there, OP. Just be­ing bi­ased does­n’t mean you can’t ran­dom­ize in­ter­ven­tions and learn.

Ex­pe­ri­ence sam­pling in­volves ran­domly in­ter­rupt­ing peo­ple as they go about their lives to ask how they’re feel­ing in the mo­ment. It’s pri­vate, so it’s less sub­ject to so­cial bi­as­es. It does not rely on rec­ol­lec­tions, which can be off. And it so­lic­its ex­pe­ri­ences evenly across time, rather than re­ly­ing on only the mo­ments or feel­ings peo­ple think to share.

But wait! I thought we ‘con­sciously or un­con­sciously’ self­-cen­sored, and “If we cen­sor fully a third of what we want to ex­press at the last min­ute, how much are we cen­sor­ing be­fore we even reach for the key­board? [to re­port an ex­pe­ri­ence sam­ple]”? So the ques­tion is which source of bias do we prefer: peo­ple know­ing they’re in an ex­per­i­ment and re­spond­ing to per­ceived ex­per­i­menter de­mands, or peo­ple not know­ing and go­ing about life as nor­mal? I know which I prefer, es­pe­cially since the re­search has ac­tu­ally been done…

Oh my god, it just keeps go­ing on and on does­n’t it? Dude re­ally likes ex­pe­ri­ence sam­pling, but I’m think­ing he needs to write more con­cise­ly. OK, I’m go­ing to wrap up here be­cause I’d like to read some­thing else to­day. Let’s sum­ma­rize his com­plaints and my coun­ter-ob­jec­tions:

  1. no con­sent: ir­rel­e­vant to whether this was good sci­ence or ‘lousy so­cial sci­ence’
  2. crossed bound­aries be­tween cor­po­ra­tions and acad­e­mia: like­wise ir­rel­e­vant; al­so, wel­come to the mod­ern In­ter­net
  3. small effect size: mis­un­der­stood the sta­tis­ti­cal de­sign of study and why it was de­signed & ex­pected to have small effects
  4. used LIWC with high er­ror rate for mea­sur­ing emo­tion­al­ity of posts: if ran­dom er­ror, bi­ases effect to zero and so is not an ar­gu­ment against sta­tis­ti­cal­ly-sig­nifi­cant find­ings
  5. and LIWC may have sys­tem­atic er­ror to­wards pos­i­tiv­i­ty: ap­par­ently not an is­sue as neg­a­tive & pos­i­tive con­di­tions agreed, and the stud­ies he cites in sup­port of this claim are mixed or un­avail­able
  6. al­so, other meth­ods are bet­ter than LIWC: sure. But that does­n’t mean the re­sults are wrong
  7. maybe LIWC has large un­known bi­ases ap­plied to short so­cial me­dia texts: pos­si­ble, but it’s not like you have any real ev­i­dence for that claim
  8. Face­book news posts are a bi­ased source of mood any­way: may­be, but they still changed after ran­dom ma­nip­u­la­tion
  9. ex­pe­ri­ence sam­pling is sooooooo awe­some: and also brings up its own is­sues of bi­ases and I don’t see how this would ren­der the Face­book study use­less any­way even if we granted it (like com­plaints #1, 2, 6, 7)

Now, I don’t want to over­state my crit­i­cisms here. The au­thor has failed to show the Face­book study is worth­less (I’d wa­ger much more money on the Face­book re­sults repli­cat­ing than 95% of the so­cial sci­ence re­search I’ve read) and it would be out­right harm­ful for Face­book to aim for large effect sizes in fu­ture stud­ies, but he does at least raise some good points about im­prov­ing the fol­lowup work: Face­book cer­tainly should be pro­vid­ing some of its cut­ting-edge deep net­works for sen­ti­ment analy­sis for re­search like this after val­i­dat­ing them if it wants to get more re­li­able re­sults, and it would be worth­while to run ex­pe­ri­ence sam­pling ap­proaches to see what hap­pens there, in ad­di­tion to eas­ier web­site tests (in ad­di­tion, not in­stead of).

Correlation=Causation in Cancer Research

Failed at­tempt at es­ti­mat­ing P(­cau­sa­tion|­cor­re­la­tion):

How often does cor­re­la­tion=­causal­i­ty? While I’m at it, here’s an ex­am­ple of how not to do it… “A weight of ev­i­dence ap­proach to causal in­fer­ence”, Swaen & van Amelsvoort 2009:

Ob­jec­tive: The Brad­ford Hill cri­te­ria are the best avail­able cri­te­ria for causal in­fer­ence. How­ev­er, there is no in­for­ma­tion on how the cri­te­ria should be weighed and they can­not be com­bined into one prob­a­bil­ity es­ti­mate for causal­i­ty. Our ob­jec­tive is to pro­vide an em­pir­i­cal ba­sis for weigh­ing the Brad­ford Hill cri­te­ria and to de­velop a trans­par­ent method to es­ti­mate the prob­a­bil­ity for causal­i­ty. Study De­sign and Set­ting: All 159 agents clas­si­fied by In­ter­na­tional Agency for Re­search of Can­cer as cat­e­gory 1 or 2A car­cino­gens were eval­u­ated by ap­ply­ing the nine Brad­ford Hill cri­te­ria. Dis­crim­i­nant analy­sis was used to es­ti­mate the weights for each of the nine Brad­ford Hill cri­te­ria.

Re­sults: The dis­crim­i­nant analy­sis yielded weights for the nine causal­ity cri­te­ria. These weights were used to com­bine the nine cri­te­ria into one over­all as­sess­ment of the prob­a­bil­ity that an as­so­ci­a­tion is causal. The cri­te­ria strength, con­sis­tency of the as­so­ci­a­tion and ex­per­i­men­tal ev­i­dence were the three cri­te­ria with the largest im­pact. The model cor­rectly pre­dicted 130 of the 159 (81.8%) agents. Con­clu­sion: The pro­posed ap­proach en­ables us­ing the Brad­ford Hill cri­te­ria in a quan­ti­ta­tive man­ner re­sult­ing in a prob­a­bil­ity es­ti­mate of the prob­a­bil­ity that an as­so­ci­a­tion is causal.

Sounds rea­son­able, right? Take this IARC data­base, pre­sum­ably of car­cino­gens known to be such by ran­dom­ized ex­per­i­ment, and see how well the cor­re­late stud­ies pre­dict after train­ing with - you might not want to build a reg­u­lar lin­ear model be­cause those tend to be weak and not too great at pre­dic­tion rather than in­fer­ence. It’s not clear what they did to pre­vent over­fit­ting, but read­ing through, some­thing else strikes me:

The IARC has eval­u­ated the car­cino­genic­ity of a sub­stan­tial num­ber of chem­i­cals, mix­tures, and ex­po­sure cir­cum­stances. These eval­u­a­tions have been car­ried out by ex­pert in­ter­dis­ci­pli­nary pan­els of sci­en­tists and have re­sulted in clas­si­fi­ca­tion of these agents or ex­po­sure con­di­tions into hu­man car­cino­gens (cat­e­gory 1) prob­a­ble hu­man car­cino­gens (cat­e­gory 2A), pos­si­ble hu­man car­cino­gens (cat­e­gory 2B), not clas­si­fi­able agents (cat­e­gory 3), and chem­i­cals that are prob­a­bly not car­cino­genic to hu­mans (cat­e­gory 4) (IARC, 2006). Al­though the IARC Work­ing Groups do not for­mally use the Brad­ford Hill cri­te­ria to draw causal in­fer­ences many of the cri­te­ria are men­tioned in the in­di­vid­ual re­ports. For in­stance, the pre­am­ble specifi­cally men­tions that the pres­ence of a dose-ere­sponse is an im­por­tant con­sid­er­a­tion for causal in­fer­ence. In this analy­sis, the IARC data­base serves as the ref­er­ence data­base al­though we rec­og­nize that it may con­tain some dis­putable clas­si­fi­ca­tions. How­ev­er, to our knowl­edge there is no other data­base con­tain­ing causal in­fer­ences that were com­piled by such a sys­tem­atic process in­volv­ing lead­ing ex­perts in the ar­eas of tox­i­col­ogy and epi­demi­ol­o­gy.


These eval­u­a­tions have been car­ried out by ex­pert in­ter­dis­ci­pli­nary pan­els of sci­en­tists and have re­sulted in clas­si­fi­ca­tion of these agents or ex­po­sure con­di­tions into hu­man car­cino­gens

eval­u­a­tions have been car­ried out by ex­pert in­ter­dis­ci­pli­nary pan­els

IARC Work­ing Groups do not for­mally use the Brad­ford Hill cri­te­ria to draw causal in­fer­ences many of the cri­te­ria are men­tioned

Wait. So their data­base with causal­i­ty/non-causal­ity clas­si­fi­ca­tions is… based on… opin­ion. They got some ex­perts to­gether and asked them.

And the ex­perts use the same cri­te­rion which they are us­ing to pre­dict the clas­si­fi­ca­tions.

What. So it’s cir­cu­lar. Worse than cir­cu­lar, ran­dom­iza­tion and causal­ity never even en­ter the pic­ture. They’re not do­ing ‘causal in­fer­ence’, nor are they giv­ing an ‘over­all as­sess­ment of the prob­a­bil­ity that an as­so­ci­a­tion is causal’. And their con­clu­sion (“The pro­posed ap­proach en­ables us­ing the Brad­ford Hill cri­te­ria in a quan­ti­ta­tive man­ner re­sult­ing in a prob­a­bil­ity es­ti­mate of the prob­a­bil­ity that an as­so­ci­a­tion is causal.”) cer­tainly is not cor­rect - at best, they are pre­dict­ing ex­pert opin­ion (and maybe not even that well), they have no idea how well they’re pre­dict­ing causal­i­ty.

But wait, maybe the au­thors aren’t cretins or con artists, and have a good jus­ti­fi­ca­tion for this ap­proach, so let’s check out the Dis­cus­sion sec­tion where they dis­cuss RCTs:

Us­ing the re­sults from ran­dom­ized con­trolled clin­i­cal tri­als as the gold stan­dard in­stead of the IARC data­base could have been an al­ter­na­tive ap­proach for our analy­sis. How­ev­er, this al­ter­na­tive ap­proach has sev­eral dis­ad­van­tages. First, only a se­lec­tion of risk fac­tors re­ported in the lit­er­a­ture have been in­ves­ti­gated by means of tri­als, cer­tainly not the oc­cu­pa­tional and en­vi­ron­men­tal chem­i­cals. Sec­ond, there are in­stances in which ran­dom­ized tri­als have yielded con­tra­dic­tory re­sults, for in­stance, in case of sev­eral vi­t­a­min sup­ple­ments and can­cer out­comes.

You see, ran­dom­ized tri­als are bad be­cause some­times we haven’t done them but we still re­ally re­ally want to make causal in­fer­ences so we’ll just pre­tend we can do that; and some­times they dis­agree with each other and con­tra­dict what we epi­demi­ol­o­gists have al­ready proven, while the ex­perts & IARC data­base never dis­agrees with them­selves! Thank good­ness we have offi­cial IARC doc­trine to guide us in our con­fu­sion…

This must be one of the most brazen “it’s not a bug, it’s a fea­ture!” moves I’ve ever seen. Mon cha­peau, Ger­ard, Lu­dovic; mon cha­peau.

In­ci­den­tal­ly, Google Scholar says this pa­per has been cited at least 40 times; look­ing at some, it seem the ci­ta­tions are gen­er­ally all pos­i­tive. These are the sort of peo­ple de­cid­ing what’s a healthy diet and what sub­stances are dan­ger­ous and what should be per­mit­ted or banned.

En­joy your din­ners.

Aerobic vs Weightlifting

Aer­o­bic vs weightlift­ing ex­er­cise claims: mul­ti­ple prob­lems but pri­mar­ily p-hack­ing, differ­ence-in-s­ta­tis­ti­cal-sig­nifi­cance-is-not-a-sig­nifi­can­t-d­iffer­ence, and con­trol­ling for in­ter­me­di­ate vari­able.

…For ex­am­ple, weightlift­ing en­hances brain func­tion, re­verses sar­cope­nia, and low­ers the death rate in can­cer sur­vivors. Take this last item, low­er­ing death rate in can­cer sur­vivors: gar­den-va­ri­ety aer­o­bic ex­er­cise had no effect on sur­vival, while re­sis­tance train­ing low­ered death rates by one third… –http://rogue­healthand­fit­ness.­com/­case-for-weightlift­ing-as-an­ti-ag­ing/

[pa­per in ques­tion: “The Effect of Re­sis­tance Ex­er­cise on Al­l-Cause Mor­tal­ity in Can­cer Sur­vivors”, Hardee et al 2014; full­text: http­s://www.­drop­box.­com/s/vku­vr­pyfft­m4on­m/2014-hard­ee.pdf / http://lib­­p?­doi=10.1016%2Fj.­may­ocp.2014.03.018 ]

This is a bad study, but sadly the prob­lems are com­mon to the field. Claim­ing that this study shows ‘weight lift­ing low­ered death rates and aer­o­bic ex­er­cise did not change sur­vival’ is mak­ing at least 4 er­rors:

  1. cor­re­la­tion!=­cau­sa­tion; this is sim­ply your usual cor­re­la­tion study (you know, of the sort which is al­ways wrong in diet stud­ies?), where you look at some health records and crank out some p-val­ues. There should be no ex­pec­ta­tion that this will prove to be causally valid; in par­tic­u­lar, re­verse con­found­ing is pretty ob­vi­ous here and should re­mind peo­ple of the de­bate about weight and mor­tal­i­ty. (Ah, but you say that the differ­ence they found be­tween aer­o­bic and re­sis­tance shows that it’s not con­found­ing be­cause health bias should op­er­ate equal­ly? Well, read on…)
  2. pow­er: with only 121 to­tal deaths (~4% of the sam­ple), this is in­ad­e­quate to de­tect any differ­ences but com­i­cally large cor­re­lates of health, as the es­ti­mate of pre­dict­ing a third less mor­tal­ity in­di­cates
  3. p-hack­ing/­mul­ti­plic­i­ty, type S er­rors, ex­ag­ger­a­tion fac­tor: take a look at that 95% con­fi­dence in­ter­val for re­sis­tance ex­er­cise (which is the only re­sult they re­port in the ab­strac­t), which is an HR of 0.45-0.99. In other words, if the cor­re­late were even the tini­est bit big­ger, it would no longer have the mag­i­cal ‘sta­tis­ti­cal sig­nifi­cance at p<0.05’. There’s at least 16 co­vari­ates, 2 strat­i­fi­ca­tions, and 3 full mod­els tested (that they re­port). By the sta­tis­ti­cal sig­nifi­cance fil­ter, a HR of 0.67 will be a se­ri­ous ex­ag­ger­a­tion (be­cause only ex­ag­ger­ated es­ti­mates would - just barely - reach p=0.05 on this small dataset with only 121 death­s). We can rule out a HR of 0.67 as cred­i­ble sim­ply on a pri­ori grounds: no ex­er­cise RCT has ever shown re­duc­tions in al­l-cause mor­tal­ity re­motely like that, and that’s the sort of re­duc­tion you just don’t see out­side of mir­a­cle drugs for lethal dis­eases (for ex­am­ple, as­pirin and vi­t­a­min D have RRs of >0.95).
  4. “The Differ­ence Be­tween ‘Sig­nifi­cant’ and ‘Not Sig­nifi­cant’ is Not It­self Sta­tis­ti­cally Sig­nifi­cant” (http://www.­s­tat.­co­lum­bi­a.e­du/~gel­man/re­search/pub­lished/sig­nif4.pdf): the differ­ence be­tween aer­o­bic ex­er­cise and re­sis­tance ex­er­cise is not sta­tis­ti­cal­ly-sig­nifi­cant in this study. The HR in model 1 for aer­o­bic ex­er­cise is (0.63-1.32), and for aer­o­bic ex­er­cise, (0.46-0.99). That is, the con­fi­dence in­ter­vals over­lap. (Specifi­cal­ly, com­par­ing the pro­por­tion of aer­o­bic ex­er­cis­ers who died with the re­sis­tance ex­er­cis­ers who died, I get prop.test(c(39,75), c(1251,1746)) = p=0.12; to com­pute a sur­vival curve I would need more data, I think.) The study it­self does not any­where seem to di­rectly com­pare aer­o­bic with re­sis­tance but al­ways works in a strat­i­fied set­ting; I don’t know if they don’t re­al­ize this point about the null hy­pothe­ses they’re test­ing, or if they did do the lo­grank test and it came out non-sig­nifi­cant and they qui­etly dropped it from the pa­per.
  5. the fal­lacy of con­trol­ling for in­ter­me­di­ate vari­ables: in the mod­els they fit, they in­clude as co­vari­ates “body mass in­dex, cur­rent smok­ing (yes or no), heavy drink­ing (yes or no), hy­per­ten­sion (p­re­sent or not), di­a­betes (p­re­sent or not), hy­per­c­ho­les­terolemia (yes or no), and parental his­tory of can­cer (yes or no).” This makes no sense. Both re­sis­tance ex­er­cise and aer­o­bic ex­er­cise will them­selves in­flu­ence BMI, smok­ing sta­tus, hy­per­ten­sion, di­a­betes, and hy­per­c­ho­les­terolemia. What does it mean to es­ti­mate the cor­re­la­tion of ex­er­cise with health which ex­cludes all im­pact it has on your health through BMI, blood pres­sure, etc? You might as well say, ‘con­trol­ling for mus­cle per­cent­age and body fat, we find weight lift­ing has no es­ti­mated ben­e­fits’, or ‘con­trol­ling for ed­u­ca­tion, we find no ben­e­fits to IQ’ or ‘con­trol­ling for lo­cal in­fec­tion rates, we find no mor­tal­ity ben­e­fits to pub­lic vac­ci­na­tion’. This makes the re­sults par­tic­u­larly non­sen­si­cal for the aer­o­bic es­ti­mates if you want to in­ter­pret them as di­rect causal es­ti­mates - at most, the HR es­ti­mates here are an es­ti­mate of weird in­di­rect effects (‘the re­main­ing effect of ex­er­cise after re­mov­ing all effects me­di­ated by the co­vari­ates’). Un­for­tu­nate­ly, struc­tural equa­tion mod­els and Bayesian net­works are a lot harder to use and jus­tify than just dump­ing a list of co­vari­ates into your sur­vival analy­sis pack­age, so ex­pect to see a lot more con­trol­ling for in­ter­me­di­ate vari­ables in the fu­ture.

The first three are suffi­cient to show you should not draw any strong con­clu­sions, the lat­ter two are nasty and could be prob­lem­atic but can be avoid­ed. These con­cerns are roughly ranked by im­por­tance: #1 puts a low ceil­ing on how much con­fi­dence in causal­ity we could ever de­rive, a ceil­ing I in­for­mally put at ~33%; #2 is im­por­tant be­cause it shows that very lit­tle of the sam­pling er­ror has been over­com­ing; #3 means we know the es­ti­mate is ex­ag­ger­at­ed; #4 is not im­por­tant, be­cause while that mis­in­ter­pre­ta­tion is tempt­ing and the au­thors do noth­ing to stop the reader from mak­ing it, there’s still enough data in the pa­per that you can cor­rect for it eas­ily by do­ing your own pro­por­tion test; #5 could be an im­por­tant crit­i­cism if any­one was re­ly­ing heav­ily on the es­ti­mate con­t­a­m­i­nated by the co­vari­ates but in this case the raw pro­por­tions of deaths is what yields the head­li­nes, so I bring this up to ex­plain why we should ig­nore model 3’s es­ti­mate of aer­o­bic ex­er­cise’s RR=1. This sort of prob­lem is why one should put more weight on meta-analy­ses of RCTs - for ex­am­ple, “Pro­gres­sive re­sis­tance strength train­ing for im­prov­ing phys­i­cal func­tion in older adults” http://on­lineli­brary.wi­ley.­com/en­hanced/­doi/10.1002/14651858.CD002759.pub2

So to sum­ma­rize: this study col­lected the wrong kind of data for com­par­ing mor­tal­ity re­duc­tion from aer­o­bics vs weightlift­ing, in­suffi­cient mor­tal­ity data to re­sult in strong ev­i­dence, ex­ag­ger­ates the re­sult through p-hack­ing, did not ac­tu­ally com­pare aer­o­bics and weightlift­ing head to head, and the analy­sis’s im­plicit as­sump­tions would ig­nore much of any causal effects of aer­o­bic­s/weightlift­ing!

Moxibustion Mouse Study­­_re­leas­es/2013-12/n­r­r-pam120513.php … “Pre-mox­i­bus­tion and mox­i­bus­tion pre­vent Alzheimer’s dis­ease” …

I don’t be­lieve this for a sec­ond. But ac­tu­al­ly, this would be a nice fol­lowup to my pre­vi­ous email about the prob­lems in an­i­mal re­search: this pa­per ex­hibits all the prob­lems men­tioned, and more. Let’s do a lit­tle cri­tique here.

  1. This pa­per is Chi­nese re­search per­formed in China by an al­l-Chi­nese team. The cur­rent state of Chi­nese re­search is bad. It’s re­ally bad. Some read­ing on the top­ic:
  • http://www.wired.­­na-a­ca­d­e­mic-s­can­dal
  • http://newhu­man­­ne­se-science
  • http://news.b­bc.­
  • http://www.g­w­­doc­s/dnb/2010-zhang.pdf
  • http://www.ny­times.­com/2010/10/07/­world/asi­a/07fraud.html
  • http://news.b­bc.­­a-paci­fic/4755861.stm
  • http://news.b­bc.­­a-paci­fic/8442147.stm
  • http://www.­na­ture.­com/news/2010/100112/­ful­l/463142a.html
  • http­s://www.­sci­ence­­gener­ic/id/330930/ti­tle/­Tra­di­tion­al_Chi­ne­se_med­i­cine_Big_ques­tions
  • http://www.­­ti­cle/in­fo%3Adoi%2F10.1371%2Fjour­nal.pone.0020185
  • http://www.n­­pla­gia­ris­m-plague-hin­der­s-chi­nas-sci­en­tific-am­bi­tion

I will note that there have been sta­tis­ti­cal anom­alies in some of the Chi­nese pa­pers on dual n-back train­ing I have used in my meta-analy­sis, so I have some per­sonal ex­pe­ri­ence in the top­ic. 2. ‘tra­di­tional med­i­cine’ re­search is re­ally bad no mat­ter where you go. They men­tion acupunc­ture as jus­ti­fi­ca­tion? That’s just fan­tas­tic. http­s://en.wikipedi­­i/Acupunc­ture punc­tures some of the hy­per­bolic claims, the PLOS link above deals with the poor-qual­ity of the Chi­nese re­views & meta-analy­ses in gen­er­al, and Cochrane is not too kind to acupunc­ture: http://www.the­cochraneli­brary.­com/de­tail­s/­col­lec­tion/691705/Acupunc­ture-an­cien­t-tra­di­tion-meet­s-mod­ern-science.html And in many of the re­views/meta-analy­ses there are stark ge­o­graphic differ­ences where the East Asian stud­ies turn in tons of pos­i­tive re­sults while the West­ern stud­ies some­how… don’t. 3. The lead au­thor is not an or­di­nary neu­ro­sci­en­tist or doc­tor, but works at the “Col­lege of Acupunc­ture and Mox­i­bus­tion”. Is he re­ally go­ing to pub­lish a study con­clud­ing “mox­i­bus­tion does not affect Alzheimer’s”⸮ Re­ally⸮ 4. Does this claim even make sense? Mox­i­bus­tion, re­al­ly⸮ For those not fa­mil­iar, http­s://en.wikipedi­­i/­Mox­i­bus­tion en­tails

> Suppliers usually age the mugwort and grind it up to a fluff;

prac­ti­tion­ers burn the fluff or process it fur­ther into a cig­a­r-shaped stick. They can use it in­di­rect­ly, with acupunc­ture needles, or burn it on the pa­tien­t’s skin.

How on earth is this supposed to help AD? How does burning a plant

on your skin affect plaques in your brain? Or if they use acupunc­ture needles, how plau­si­ble is it that a few mil­ligrams at most of mug­wort in­serted into the skin would do any­thing? While Wikipedia is not Cochrane or any­thing, it is trou­bling that this en­try lists no use­ful ap­pli­ca­tion of mox­i­bus­tion. And then it goes and links to “Does mox­i­bus­tion work? An overview of sys­tem­atic re­views” http­s://­med­cen­tral.­com/1756-0500/3/284 which finds that

> Ten SRs met our inclusion criteria, which related to the

fol­low­ing con­di­tions: can­cer, ul­cer­a­tive col­i­tis, stroke re­ha­bil­i­ta­tion, con­sti­pa­tion, hy­per­ten­sion, pain con­di­tions and breech pre­sen­ta­tion. Their con­clu­sions were con­tra­dic­tory in sev­eral in­stances. Rel­a­tively clear ev­i­dence emerged to sug­gest that mox­i­bus­tion is effec­tive for breech pre­sen­ta­tion.

That review also mentions, incidentally, that

> Many of the primary moxibustion trials originate from China

(data not shown); Vick­ers et al. demon­strated that vir­tu­ally 100% of Chi­nese acupunc­ture tri­als are pos­i­tive [ http://www.d­c­­er­s_1998_­Con­trolled-Clin­i­cal-Tri­al­s.pdf], which seems to be equally ap­plied to mox­i­bus­tion, an acupunc­ture-like in­ter­ven­tion. This casts con­sid­er­able doubt on the re­li­a­bil­ity of these stud­ies.

Al­right, so let’s take stock here. With­out ever look­ing be­yond the ti­tle and au­thor­ship, we have found that this is a pa­per from a coun­try with in­fa­mously bad re­search, in a field with in­fa­mously bad re­search qual­i­ty, led by a re­searcher with con­sid­er­able in­her­ent con­flict of in­ter­est, us­ing a tech­nique/­sub­stance which has al­ready been linked with bi­ased re­search, on a hy­poth­e­sis that is grossly im­plau­si­ble. Based on all these base rates, we can say that there is ba­si­cally zero chance this re­sult will ever repli­cate, much less to other mice strains or even to hu­mans.

It seems un­fair to re­ject the pa­per out of hand, though, so let’s look at the ac­tual pa­per a lit­tle.

Forty healthy rats were ran­domly di­vided into four groups: con­trol group, model group, mox­i­bus­tion group and pre-mox­i­bus­tion group. The lat­ter three groups were treated with in­trac­ere­bral in­jec­tion of Aβ1–42 to es­tab­lish an AD-like pathol­o­gy. The mox­i­bus­tion group re­ceived sus­pended mox­i­bus­tion on Bai­hui and Shen­shu acu­points for 14 days after Aβ1–42 in­jec­tion. The pre-mox­i­bus­tion group was treated with mox­i­bus­tion for eight courses (each course last­ing for 6 days) prior to the ex­po­sure and 14 days after Aβ1–42 ex­po­sure. The fi­nal analy­sis in­cor­po­rated all rats.

From the ma­te­ri­als and meth­ods:

Male Wis­tar rats (12 months old; 500 ± 20 g), of spe­cific pathogen free grade, were ob­tained from the Ex­per­i­men­tal An­i­mal Cen­ter of Huazhong Uni­ver­sity of Sci­ence and Tech­nol­ogy (Wuhan, Chi­na), with li­cense No. SCXK (E) 2008-0005.

After the hair around the acu­points was shaved, an ig­nited mox­a-stick (di­am­e­ter 6 mm; Nanyang Shen­nong Aaicao Ap­pli­ance Com­pa­ny, Nanyang, Henan Chi­na; a round long stick made of moxa floss, also called moxa rol­l), was sus­pended per­pen­dic­u­larly 2 cm above the acu­points. Bai­hui (lo­cated in the mid­dle of the pari­etal bone[50]) and Shen­shu (lo­cated un­der the sec­ond lum­bar on both sides [50]) acu­points were si­mul­ta­ne­ously given sus­pended mox­i­bus­tion. Each treat­ment con­sisted of a 15-minute mox­i­bus­tion, keep­ing the spot warm and red but not burnt. Gen­er­al­ly, the skin tem­per­a­ture was kept at 43 ± 1° dur­ing the mox­i­bus­tion pro­ce­dure.

Right away we can spot 3 of the usual an­i­mal re­search method­olog­i­cal prob­lems:

  1. the sam­ple size is too small - at n=10 rats in each group, you are not go­ing to de­tect any­thing with­out large effect sizes. It is im­plau­si­ble that sus­pended moxa has any effects, and it is es­pe­cially im­plau­si­ble that the effect sizes would be large.
  2. there is no men­tion of blind­ing. The tech­ni­cians or re­search as­sis­tants or whomever clearly know which mice they are deal­ing with.
  3. there is men­tion of ran­dom­iza­tion, but it’s not spec­i­fied how the ran­dom­iza­tion was done, which means it prob­a­bly was done by the ‘stick your hand in and grab’ method, and prob­a­bly does not bal­ance by lit­ter or other vari­ables. This mas­sively wors­ens the power prob­lem, see “De­sign, pow­er, and in­ter­pre­ta­tion of stud­ies in the stan­dard murine model of ALS”

(I’m a lit­tle cu­ri­ous about whether they re­ally started with 10 mice in each group: the mice spent at least 60 days in the ex­per­i­ment and I won­der how many, out of 40, you would ex­pect to die in that time pe­ri­od, es­pe­cially after you’ve done your level best to give 3⁄4s of them Alzheimer’s dis­ease dur­ing that time.)

I also note that the mox­i­bus­tion sit­u­a­tion is even worse than I thought: they did not use acupunc­ture nee­dles to get some mug­wort into the mice, they did not put any mox­a/­mug­wort in phys­i­cal con­tact, but in­stead the burn­ing was 2cm away from the mice! The mech­a­nism was bad, but it just got worse.

There’s no men­tion of the data be­ing pro­vided any­where at all, ei­ther their web­site or the pub­lish­er; there’s some ev­i­dence that pro­vid­ing ac­cess to a pa­per’s data cor­re­lates with high­er-qual­ity re­search, so I men­tion this ab­sence. It also makes it harder for me to do any­thing more com­plex like a post hoc power analy­sis.

Mov­ing on, they list as de­pen­dent vari­ables:

  • Mor­ris wa­ter maze nav­i­ga­tion test
  • Mor­ris wa­ter maze spa­tial probe test
  • apop­to­sis rate of hip­pocam­pal neu­rons

Let’s look at the stats a bit.

  1. Sig­nifi­cance: The pa­per lists no less* than 14 p-val­ues (4 < 0.05, the rest < 0.01), and for all of them uses an al­pha of 0.05. The small­est given con­straint is p<0.01. A Bon­fer­roni cor­rec­tion on this would be 0.05/14 (s­ince they must have done at least 14 tests to re­port 14 p-val­ues), which means an al­pha of 0.003571. But 0.01 > 0.05/14, so the 4 0.05 p-val­ues dis­ap­pear un­der mul­ti­ple cor­rec­tion and prob­a­bly most of the 0.01s would too.

    • this is a lower bound since the Mor­ris di­a­grams re­port some­thing like 20 p-val­ues them­selves, but I did­n’t feel like care­fully pars­ing the text to fig­ure out ex­actly how many p-val­ues are be­ing re­ported
  2. Effect sizes: no ta­bles are pro­vid­ed, but fig­ure 2 (the sec­ond Mor­ris maze test) is il­lus­tra­tive. The con­trol mice have no prob­lem re­mem­ber where the plat­form used to be, and so spend al­most half a minute (~24s) in the right area search­ing for it. Makes sense, they don’t have AD. The AD mice have ter­ri­ble mem­o­ry, and so only spend ~6s in the right area and most of their time in the wrong place. Also makes sense. Now, what about the AD mice who had some moxa burnt 2cm away from their skin? They spend 14-16s or more than twice and al­most 3 times as much as the non-moxa AD mice! And the claimed stan­dard er­ror on all 4 group of mice’s time is tiny, maybe 1s eye­balling the graph. So they are claim­ing, in this point, to have an effect size on mem­ory of some­thing like d = (15-6)/1 = 9. In­sane! From burn­ing some mug­wort 2cm away from the mice’s skin‽

  3. Pow­er: ac­tu­al­ly, that re­sult shows an ex­am­ple of what I mean by the re­sult be­ing ab­surd. Let’s cal­cu­late what that effect size im­plies for the power of their t-test com­par­ing the model AD mice with the moxa AD mice. So let’s say the 2 moxa groups equate to n=20 15(1.5), and the AD con­trols were then n=10 5(0.5). The pooled stan­dard de­vi­a­tion of the non-moxa and moxa mice is sqrt(((20-1)(1.5^2) + (10-1)(0.5^2)) / (20 + 10 - 2)) = 1.267, so the effect size was ac­tu­ally d=(15-5)/1.267 = 7.89. With 20 mice in 1 group and 10 mice in the oth­er, an al­pha of 0.05, then our power turns out to be…

      pwr.t2n.test(n1 = 20, n2 = 10, d = 7.89, sig.level = 0.05)
           t test power calculation
                   n1 = 20
                   n2 = 10
                    d = 7.89
            sig.level = 0.05
                power = 1

A power of 100%. Ab­surd. Have you ever seen an­i­mal re­search (or Alzheimer’s re­search…) with such high pow­er? Real effects, real treat­ments, in large clin­i­cal tri­als or in meta-analy­ses, are hardly ever that high.

So. I don’t know how they got the re­sults they got. Did they ad­min­is­ter dozens of tests un­til they got the re­sults they want­ed? Did they sim­ply make up the data like so many Chi­nese aca­d­e­mics have? Or did they start with 30 mice in each group and cher­ryp­ick the best/­worst 10? Did they abuse the model AD mice to make the AD+­moxa mice look good?

In con­clu­sion: this pa­per is com­plete bull­shit, will not repli­cate.

“Someone Should Do Something”: Wishlist of Miscellaneous Project Ideas


  • Erowid: data-mine the trip re­ports to cre­ate clus­ters or a state-space of drug effect­s/re­sults and us­ing the clus­ters & com­mon terms, cre­ate a gen­eral in­ven­tory of de­scrip­tions; add this to the trip re­port form so Erowid users can pro­vide some more struc­tured in­for­ma­tion about their ex­pe­ri­ence.

  • Dark net mar­kets: use a lon­gi­tu­di­nal crawl of DNM sell­ers to es­ti­mate sur­vival curves, out­stand­ing es­crow + or­ders, and listed prod­uct prices / type / lan­guage to try to pre­dict exit scams.

  • what are the pos­si­bil­i­ties in gen­eral for pre­dict­ing hu­man traits from faces? If eg , then per­haps faces can pre­dict many things. In­stead of bick­er­ing about how much you can pre­dict ho­mo­sex­u­al­ity etc from faces and whether a spe­cific dataset/­analy­sis works, ap­ply vari­ance com­po­nent analy­sis us­ing dis­tances in a fa­cial recog­ni­tion CNN’s face-em­bed­ding as a sim­i­lar­ity met­ric (which is highly ro­bust to all sorts of re­al-world trans­for­ma­tions like an­gle or light­ing or hair style); then cal­cu­late ‘face her­i­tabil­ity’ on many traits (the OKCupid scrape dataset should sup­port this). If the av­er­age is near ze­ro, that im­plies that faces don’t carry any im­por­tant sig­nals and that, aside from oc­ca­sional ex­cep­tions, noth­ing be­yond the ex­pected things like ba­sic de­mo­graphic data can be pre­dicted from faces. On the other hand, if ‘face her­i­tabil­ity’ of many traits turns out to be sub­stan­tially above zero (per­haps 20%), this means that faces carry many sig­nals and these sig­nals may be com­mer­cially or legally ex­ploitable and ear­lier find­ings about face pre­dic­tion may have been right after all. We may not like the an­swers, but it’s bet­ter to know the truth than go along blithely as­sur­ing every­one that it’s im­pos­si­ble to do such things and things like ho­mo­sex­u­al­i­ty/crim­i­nal­ity are merely junk sta­tis­tics.

    This has been done us­ing a stan­dard face-recog­ni­tion NN’s em­bed­ding for Big Five per­son­al­ity fac­tors: , Kachur et al 2020.

  • Quan­ti­fied Self ex­per­i­ments for ac­ne, es­pe­cially for teenagers: one could work with some to run some pre­lim­i­nary ex­per­i­ments, per­haps de­sign some canned ex­per­i­ments+­analy­sis code?

  • Find­ing the best movie adap­ta­tions of books: movie adap­ta­tions of books typ­i­cally dis­ap­point read­ers, but a movie which is bet­ter than the book is quite in­ter­est­ing and un­usu­al. We can’t eas­ily find that by sim­ply look­ing at av­er­age rat­ings on IMDb & GoodReads, be­cause we want to know pairs of movies/­books where the movie has a higher (s­tan­dard­ized) rat­ing.

    Can we cre­ate a list au­to­mat­i­cally by scrap­ing Wikipedi­a/Wiki­Data’s cat­e­gories of books & movies and cre­at­ing pairs where a book ar­ti­cle links to a movie ar­ti­cle & vice-ver­sa? (Pre­sum­ably, all movie adap­ta­tions link to the orig­i­nal book’s ar­ti­cle, and all books which have a movie adap­ta­tion will link to the movie’s ar­ti­cle, so rec­i­p­ro­cal link­ing in­di­cates an adap­ta­tion. Ob­scure or bad works may not have high­-qual­ity WP ar­ti­cles or thor­ough links—but those are also the ones least likely to be great adap­ta­tions, so the bias is fine.) Given qual­i­fy­ing pairs, the ar­ti­cles will also prob­a­bly in­clude ISBNs or Rot­ten Toma­toes or IMDb links which can be used to re­trieve rat­ings, and then it’s merely a mat­ter of stan­dard­iz­ing over­all rat­ings and list­ing pairs with the largest differ­ence.

Deep learning/RL

  • Markov chain/char-RNN bot for Twit­ter trained on just Eng­lish proverbs, id­ioms, and ex­pres­sions

  • user-friendly char-RNN im­ple­men­ta­tion just for clas­si­fy­ing text, tak­ing in CSV data of tex­t/­cat­e­gory

  • RL agent us­ing MCTS + GAN/PixelCNN model of en­vi­ron­ment

  • hy­per­pa­ra­me­ter op­ti­miza­tion for al­go­rithms in prob­lems with­out avail­able loss func­tions but hu­man-judge­able qual­i­ty, us­ing a hu­man for mak­ing choices in a paired or forced-choice com­par­ison, then us­ing a Bradley-Terry or la­tent vari­able model to in­fer rank­ings of hy­per­pa­ra­me­ter set­tings and op­ti­miz­ing based on the la­tent scores. This would be par­tic­u­larly use­ful in GAN com­par­isons, where most com­par­isons at­tempt to force com­par­isons into a car­di­nal frame­work.

  • GAN im­prove­ments: pro­vide su­per­vi­sion via adding ad­di­tional losses by re­quir­ing the Dis­crim­i­na­tor (D) to out­put an ar­ray of per-pixel losses of sam­ple im­ages, as op­posed to a sin­gle scalar loss across the whole im­age, thereby train­ing the gen­er­a­tor more effec­tive­ly. Shift from a sin­gle scalar vari­able as feed­back per im­age to a (the log­i­cal ex­treme of a mul­ti­-s­cale ap­proach like )

    In look­ing at GAN sam­ples, I no­tice that bad Gen­er­a­tors (G) often gen­er­ate de­cent over­all sam­ples but there will be small re­gions where the qual­ity is glar­ingly bad. It is not the case that the “whole im­age just looks bad some­how”—often there’s a spe­cific point like the eyes or the lips where it looks hor­ri­fy­ingly creepy (e­spe­cially for dog or hu­man im­ages). If D pro­duces a large loss (be­cause it’s so easy to no­tice the flaw), this seems odd from a back­prop­a­ga­tion sense since most of the im­age is fine, it’s just a few spots which con­tribute to the loss. GANs, as have been often not­ed, are closely re­lated to re­in­force­ment learn­ing, and con­sid­ered as RL, the G is get­ting a sin­gle re­ward at the end of long se­quence of gen­er­ated pix­els, and does not know which pix­els are re­spon­si­ble for low or high re­wards; akin to REINFORCE, it has lit­tle choice but to re­ward/pun­ish neu­rons and hope that on av­er­age it is ap­prox­i­mat­ing the cor­rect gra­di­ent for each pa­ra­me­ter. Ac­tor-critic meth­ods make the re­ward more in­for­ma­tive by try­ing to as­sign blame to spe­cific ac­tions, and Al­phaGo Ze­ro’s ex­pert it­er­a­tion ap­pears to ex­hibit such dra­matic learn­ing speed be­cause the use of MCTS means that AG Z re­ceives not a sin­gle re­ward 0/1 at­ten­u­ated over an en­tire game of moves, but pre­cise im­me­di­ate feed­back on the value of moves it took & also on all the moves it did­n’t take. In gen­er­al, pro­vid­ing more losses is good for learn­ing—ad­di­tional ex­am­ples would in­clude aux­il­iary losses in RL like UNREAL or “dark knowl­edge” in im­age clas­si­fi­ca­tion. In GANs, every­thing is differ­en­tiable and syn­thet­ic, so we don’t need to ac­cept RL-like im­pov­er­ished loss­es, but it seems like for the most part, the losses are very sim­ple and low-in­for­ma­tion. Fur­ther, in GANs, the largest im­prove­ments in im­age qual­ity in StackGAN and ProGAN come from adding GAN global losses at mul­ti­ple lay­ers of the gen­er­a­tor: a D spe­cial­ized for 32x32px im­ages, then an­other D spe­cial­ized for 64x64px, then an­other D for 128x128px etc. This can be seen as stack­ing up losses “depth”-wise, pro­vid­ing feed­back about plau­si­bil­ity at mul­ti­ple stages. So why not add losses “width”-wise, by crit­i­ciz­ing each pixel in the fi­nal up­scaled im­age? If it’s good one way, why not the oth­er? This is in large part how the strongest com­peti­tor to GANs for im­age gen­er­a­tion, PixelCNN, works: gen­er­at­ing 1 pixel at a time con­di­tioned on pre­vi­ous gen­er­ated pix­els. (mere_­mor­tise sug­gests that this scheme would be equiv­a­lent to a reg­u­lar GAN loss but com­puted on many shifted ver­sions of an im­age, al­though that would pre­sum­ably be much slow­er.)

    Given a D which out­puts the 2D ar­ray of per-pixel loss­es, the train­ing of G is just back­prop­a­ga­tion as usu­al, but how does one train D to pro­vide per-pixel loss­es? Given a real im­age, by de­fi­n­i­tion the fak­e­ness of each pixel is 0, after all. The sim­plest ap­proach would be to train the D with real and G-ed/­fake im­ages, and la­bel all the pix­els in the real im­age with 0 and all the pix­els in the fake im­age with 1, and hope it works out and the D will learn that way over enough mini­batch­es. An­other ap­proach might be to in­tro­duce kinds of noises or cor­rup­tion or shuffles in the real im­ages, la­bel the orig­i­nal pix­els with 0 and then la­bel the new pix­els with 1; for ex­am­ple, re­place a ran­dom 50% of pix­els with white noise. (This might sound crazy but then, so does an im­age aug­men­ta­tion tech­nique like which nev­er­the­less works in CNNs & GANs.) A more in­ter­est­ing ap­proach might be to re­fash­ion G into not a sin­gle-shot im­age gen­er­a­tor, but a re­gion in­-filler/in­painter/­com­ple­tion; this lets one gen­er­ate im­ages which gen­uinely are a mix of real and fake pix­els, by crop­ping out a ran­dom re­gion in a real im­age, hav­ing G fill it back in, and la­bel­ing re­al/­fake ap­pro­pri­ate­ly. Some­thing like MixUp might be em­ployed: an im­age could be 40% gen­er­at­ed/60% re­al, and then the tar­get for D is 60%. If MixUp on ran­dom pairs of im­ages does­n’t work, a con­di­tional GAN’s con­di­tion­ing could be used as a kind of Mix­Up: com­bine a real im­age with a fake im­age based on the re­al’s con­di­tion­ing, and since the con­di­tion­ing should de­scribe most of the im­age, the pair should con­sti­tute a good mashup for the D.

    This es­sen­tially turns GANs into a “se­man­tic seg­men­ta­tion” prob­lem. For a sim­i­lar but not iden­ti­cal use, see and ; what I pro­pose may have been done, but sim­pler, in , Gokaslan et al 2018.

    Some­thing like this was done 2 years later by : scor­ing qual­ity of in­di­vid­ual pix­els by train­ing on mashed-up im­ages. The spe­cific noise was add pix­el-level noise to a real im­age us­ing a weight­ed-av­er­age with a fake im­age, and oc­ca­sion­ally copy­-paste cir­cu­lar re­gions from the fake into the re­al. This did not lead to any par­tic­u­lar im­prove­ment in the WGAN/StyleGAN/BigGAN mod­els they trained, but the Dis­crim­i­na­tors were able to rank im­ages use­fully by qual­i­ty. A much more close im­ple­men­ta­tion is , Schön­feld et al 2020, which does pre­cisely what I sug­gest but us­ing in­stead of Mix­Up—the aug­mented im­ages look strange be­cause they are just square blocks from differ­ent im­ages copy­-pasted on top of an­other im­age, but they re­port im­prove­ments on top of reg­u­lar BigGAN for FFHQ/CelebA/COCO-Animals (al­beit harder datasets like ImageNet/Danbooru2019/JFT-300M are not at­tempt­ed).

  • GWAS via 1D (pos­si­bly di­lat­ed) CNNs on SNP se­quences a la WaveNet or mal­ware de­tec­tion ():

    Lin­ear re­gres­sions are no­to­ri­ously sam­ple-in­effi­cient and weak meth­ods of im­ple­ment­ing GWAS as they typ­i­cally use un­re­al­is­tic flat pri­ors, do not ex­ploit the ‘clump­ing’ of hits in groups of SNPs (re­quir­ing post-pro­cess­ing to ‘prune’ SNP hits which are phys­i­cally too close to each other and likely in to re­veal the ‘real’ hit) , ex­pect lin­ear effects, and ad­di­tive effects. Lin­ear re­gres­sions can eas­ily pro­duce poly­genic scores ex­plain­ing half or less of vari­ance com­pared to a more op­ti­mal sta­tis­ti­cal method (eg com­pare Hsu’s lasso or MTAG use to the pre­vi­ous GWASes on height/in­tel­li­gence). A CNN could ben­e­fit from the hit clus­ters, can flex­i­bly model dis­tri­b­u­tions of effects and sub­sum­ing the “Bayesian al­pha­bet”, and can pool in­for­ma­tion both lo­cally and glob­ally while mod­el­ing po­ten­tially ar­bi­trar­ily com­plex in­ter­ac­tions and hi­er­ar­chies of effects. A SNP se­quence of, say, 500k high­-qual­ity SNP calls may seem in­fea­si­ble for a NN, and would be to­tally in­fea­si­ble for a stan­dard RNN pro­cess­ing the se­quence 1 SNP at a time, as it would be un­able to pre­serve enough in­for­ma­tion in its hid­den state or learn effec­tively due to van­ish­ing gra­di­ents; but WaveNet and 1D con­vo­lu­tions for text clas­si­fi­ca­tion have demon­strated the abil­ity for di­lated con­vo­lu­tions to han­dle enor­mous se­quences highly effec­tively while mod­el­ing both lo­cal & global as­pects. It is pos­si­ble that a 1D CNN could be a highly effec­tive GWAS method as well.

    The pri­mary chal­lenge, as dis­cov­ered by Raff et al 2017 in ex­per­i­ment­ing with CNNs in­gest­ing se­quences of mil­lions of byte, is that the first layer is in­her­ently ex­tremely mem­o­ry-hun­gry, as each of the thou­sands or mil­lions of vari­ables must be con­nected to the NN si­mul­ta­ne­ous­ly. Raff et al 2017 used a DGX-1 with 4 GPUs and ~16GB VRAM for a month for con­ver­gence, and found al­most all their mem­ory was go­ing to the first layer and the higher lay­ers con­tributed min­i­mal de­mand. If the ad­di­tional lay­ers prove prob­lem­at­ic, di­lated con­vo­lu­tions can be used in­stead, which in­crease mem­ory use only log­a­rith­mi­cal­ly, es­pe­cially with high di­la­tion fac­tors like 15 or 20. (Raff et al 2017 also found that di­lated con­vo­lu­tions were un­help­ful in their mal­ware ex­e­cutable clas­si­fi­ca­tion prob­lem and that they needed a very shal­low ar­chi­tec­ture, sug­gest­ing that mal­ware byte se­quences just don’t have that much lo­cal struc­ture for con­vo­lu­tions to ex­ploit and that they were hav­ing train­ing/­con­ver­gence is­sues de­spite con­sid­er­able in­vest­men­t—but I ex­pect genomes to have much more lo­cal struc­ture due to the genome in­her­ently be­ing se­quenced into genes (which do not all affect traits of in­ter­est to equal de­gree), cod­ing re­gions of var­i­ous sorts, and the pre­vi­ously men­tioned SNP-clumping em­pir­i­cally ob­served in many GWASes.) A GWAS CNN might re­quire data-par­al­lel train­ing over mul­ti­ple 1080ti GPUs, split­ting the mini­batch to fit into the 11GB VRAM, and at least a month. How­ev­er, should it de­liver pre­dic­tive power much su­pe­rior to ex­ist­ing SOTA tech­niques like lasso GWAS, these com­pu­ta­tional re­quire­ments would prob­a­bly be con­sid­ered ac­cept­able—­sev­eral GPU-months may be ex­pen­sive, but col­lect­ing twice or thrice as many hu­man genomes is more ex­pen­sive still.

  • deep RL for neural net­work de­sign but fo­cus­ing on gen­er­at­ing a dis­tri­b­u­tion of ran­dom weights for ini­tial­iz­ing a NN; bet­ter ini­tial­iza­tions have proven to be ex­tremely im­por­tant in sta­bly train­ing NN and sim­ply tweak­ing ini­tial­iza­tion can train NNs with hun­dreds of lay­ers (pre­vi­ously im­pos­si­ble, then only pos­si­ble with a ma­jor ar­chi­tec­tural in­no­va­tion like resid­ual net­works) eg . Bet­ter ini­tial­iza­tions are hard to de­sign by hand as they ap­par­ently work by break­ing var­i­ous sym­me­tries in­side the NN, so this is a prob­lem that is well suited for brute force and tri­al-and-er­ror. See fur­ther , , , , , . This may wind up be­ing es­sen­tially the same thing as Hy­per­Net­work­s/­fast-weights eg .

  • CNN for font . By far most te­dious, labor-in­ten­sive/­time-money-ex­pen­sive, rote part of high­-qual­ity font cre­ation; but also most I-know-it-when-I-see-it, non-se­man­tic, and easy to cre­ate mil­lions of train­ing sam­ples by jit­ter­ing ran­dom text set in ex­ist­ing kerned fonts (k > 50,000 is eas­ily doable by down­load­ing In­ter­net fonts) and train­ing to re­cover the hu­man-ex­pert-de­fined kern­ings.

Preference learning

See & .


  • writ­ing tools:

    • di­alec­t/pe­riod writ­ing tool, per­haps ex­ploit­ing word2vec: iden­tify words in a text which are of the wrong di­alect or are char­ac­ter­is­tic of differ­ent time pe­ri­ods; for ex­am­ple, iden­ti­fy­ing Amer­i­can­isms in an os­ten­si­bly British work (to ‘Brit-pick’), or iden­tify anachro­nisms in a his­tor­i­cal fic­tion (words which did not ex­ist in that time pe­riod or would be highly un­usu­al), and sug­gest re­place­ments

    • char­ac­ter gen­er­a­tor: gen­er­ate ran­dom pop­u­la­tion-weighted sam­ples of peo­ple by de­mo­graph­ics, po­lit­i­cal & re­li­gious at­ti­tudes, ide­ol­o­gy, draw­ing on re­al­is­tic datasets such as US cen­suses (for de­mo­graph­ic­s/­names) or the (GSS)1; this can be use­ful in re­duc­ing bias in char­ac­ters, ex­plor­ing pos­si­bil­i­ties, and in­creas­ing re­al­ism. Naive at­tempts to de­bias writ­ings often wind up mak­ing the char­ac­ters far more un­rep­re­sen­ta­tive, such as by in­clud­ing too many ho­mo­sex­ual or trans­sex­ual char­ac­ters or in­clud­ing rare eth­nic­i­ties like Jews while fail­ing to in­clude com­mon types of peo­ple such as fun­da­men­tal­ist Chris­tians or Re­pub­li­cans, and ex­ist­ing fake name or char­ac­ter gen­er­a­tors do not help be­cause they typ­i­cally take the easy way out by merely sam­pling ran­domly from a list of unique val­ues, skew­ing se­lec­tion to bizarre & ex­otic—try­ing out one such gen­er­a­tor, I get strange names like “Cyn­thia P. Teal” or “Cody A. Nguyen” or “Mar­shall T. Blanco”. Us­ing real data & pro­por­tional sam­pling en­sures re­al­ism and elim­i­nates blind spots an au­thor may not re­al­ize they have. (Of course, this is not to say that an au­thor will be happy with the sug­ges­tions, par­tic­u­larly with what the GSS may re­veal about the be­liefs and knowl­edge of Amer­i­cans in gen­er­al. But if an au­thor en­sures that all of their char­ac­ters are aware that choco­late milk does­n’t come from brown cows or grad­u­ated high school, at least it will then be a de­lib­er­ate choice on their part.)

    • -in­spired Eng­lish font: is it pos­si­ble to write Eng­lish in syl­la­ble blocks akin to how Ko­rean is writ­ten in hangul, us­ing a large set of ? (For ex­am­ple, a word like ‘the’ could be eas­ily writ­ten as a block with a ‘th’ lig­a­ture and plac­ing the ‘e’ over the ‘h’.)

      en­thu­si­asts do not seem to’ve tried this; the clos­est I’ve found is Russ­ian ‘elm’ cal­lig­ra­phy (es­thetic but un­read­able), (ped­a­gog­i­cal tool ex­plain­ing how Chi­nese char­ac­ters work), an ex­per­i­ment in set­ting en­tire words as blocks (which mostly demon­strates the need to do it with syl­la­bles in­stead), and a hand­ful of “in­ter­lock” such as “Ed In­ter­lock” (meant less for read­abil­ity than to con­vey a ’60s hippy or a Tahit­ian ).

  • smart-glasses w/NNs for lipread­ing+­tran­scrip­tion+voice-gen­er­a­tion for deaf­/­hear­ing-im­paired:

    In­spired by , As­sael et al 2016 (video)

    Lipread­ing is the task of de­cod­ing text from the move­ment of a speak­er’s mouth. Tra­di­tional ap­proaches sep­a­rated the prob­lem into two stages: de­sign­ing or learn­ing vi­sual fea­tures, and pre­dic­tion. More re­cent deep lipread­ing ap­proaches are end-to-end train­able (Wand et al., 2016; Chung & Zis­ser­man, 2016a). All ex­ist­ing works, how­ev­er, per­form only word clas­si­fi­ca­tion, not sen­tence-level se­quence pre­dic­tion. Stud­ies have shown that hu­man lipread­ing per­for­mance in­creases for longer words (Eas­ton & Basala, 1982), in­di­cat­ing the im­por­tance of fea­tures cap­tur­ing tem­po­ral con­text in an am­bigu­ous com­mu­ni­ca­tion chan­nel. Mo­ti­vated by this ob­ser­va­tion, we present Lip­Net, a model that maps a vari­able-length se­quence of video frames to text, mak­ing use of spa­tiotem­po­ral con­vo­lu­tions, an LSTM re­cur­rent net­work, and the con­nec­tion­ist tem­po­ral clas­si­fi­ca­tion loss, trained en­tirely end-to-end. To the best of our knowl­edge, Lip­Net is the first lipread­ing model to op­er­ate at sen­tence-level, us­ing a sin­gle end-to-end speak­er-in­de­pen­dent deep model to si­mul­ta­ne­ously learn spa­tiotem­po­ral vi­sual fea­tures and a se­quence mod­el. On the GRID cor­pus, Lip­Net achieves 93.4% ac­cu­ra­cy, out­per­form­ing ex­pe­ri­enced hu­man lipread­ers and the pre­vi­ous 79.6% state-of-the-art ac­cu­ra­cy.

    Com­ing on the heels of hu­man-level speech tran­scrip­tion, I am very much look­ing for­ward to smart glasses with re­al-time cap­tion­ing. That is go­ing to be a game-changer for hard of hear­ing and deaf peo­ple.

    The out­put so­lu­tion for deaf peo­ple has ex­isted for a long time, like a lit­tle chalk­board, or, many peo­ple can type al­most as fast as they speak nor­mal­ly, and steno key­boards are much faster than that. But this was never rel­e­vant (offline) be­cause deaf peo­ple could­n’t hear: there’s no point in be­ing able to re­ply if you don’t know what you’re re­ply­ing to. So we had to teach deaf peo­ple both Eng­lish writ­ten and ASL for in­ter­ac­tions. Wavenet may offer hu­man-level voice syn­the­sis, but it did­n’t mat­ter. How­ev­er, with Lip­net, does­n’t that change? If you can get re­al­time tran­scrip­tion with lipread­ing+­tran­scrip­tion RNNs which is hu­man-e­quiv­a­lent or bet­ter, you’ve closed the loop. Why not just have deaf peo­ple use a smart glass for cap­tion­ing and a glove for a steno-like key­board + voice syn­the­sis? You have to teach them writ­ten eng­lish and typ­ing any­way, so what’s ASL now adding aside from es­thet­ics and com­mu­ni­ty? (Peo­ple are happy to be de­pen­dent on smart­phones, so that’s not a se­ri­ous mi­nus.)

  • a VR ap­pli­ca­tion for view­ing im­ages & video and for 3D en­vi­ron­ments with ex­tremely large par­al­lax such as for view­ing clouds with true depth per­cep­tion (dis­cus­sion)

  • prop­erly tran­scribe & an­no­tate Dou­glas Hof­s­tader’s Le Ton Beau de Marot, one of his best but also most ob­scure books


  • pro­vide “poly­genic scores as a ser­vice”, a website/API where one can up­load a SNP data file like the 23andMe ex­port and get back PGSes for every­thing in LD Hub, and util­ity weights
  • ex­pand/rewrite Wikipedi­a’s —grossly out­dat­ed, al­most to­tally omit­ting all the GCTAs and GWASes that have de­fin­i­tively set­tled the an­swer in the strongly affir­ma­tive
  • nom­i­na­tive de­ter­min­ism: do first names affect how peo­ple are per­ceived or their ap­pear­ance? Some stud­ies in­di­cate that one can guess first names based on ap­pear­ance… but I haven’t seen one which does a with­in-fam­ily com­par­i­son eg swap­ping at ran­dom the pho­tographs of two same-sex sib­lings, pro­vide their first names, and ask­ing peo­ple to guess which is which. Names are canon­i­cal ex­am­ples of things which vary sys­tem­at­i­cally be­tween fam­i­lies.

Estimating censored test scores

An ac­quain­tance asks the fol­low­ing ques­tion: he is ap­ply­ing for a uni­ver­sity course which re­quires a cer­tain min­i­mum score on a test for ad­mit­tance, and won­ders about his chances and a pos­si­ble trend of in­creas­ing min­i­mum scores over time. (He has­n’t re­ceived his test re­sults yet.) The uni­ver­sity does­n’t pro­vide a dis­tri­b­u­tion of ad­mit­tee scores, but it does pro­vide the min­i­mum scores for 2005-2013, un­less all ap­pli­cants were ad­mit­ted be­cause they all scored above an un­known cut­off—in which case it pro­vides no min­i­mum score. This leads to the dataset:


A quick eye­ball tells us that we can’t con­clude much: only 4 ac­tual dat­a­points, with 5 hid­den from us. We can’t hope to con­clude any­thing about time trends, other than there does­n’t seem to be much of one: the last score, 417, is not much higher than 410, and the last two scores are low enough to be hid­den. We might be able to es­ti­mate a mean, though.

We can’t sim­ply av­er­age the 4 scores and con­clude the mean min­i­mum is 410 be­cause of those NAs: a num­ber of scores have been ‘cen­sored’ be­cause they were too low, and while we don’t know what they were, we do know they were <398 (the small­est score) and so a bunch of <398s will pull down the un­cen­sored mean of 410.

On ap­proach is to treat it as a and es­ti­mate us­ing some­thing like the censReg li­brary (overview).

But if we try a quick call to censReg, we are con­found­ed: a To­bit model ex­pects you to pro­vide the cut­off be­low which the ob­ser­va­tions were cen­sored, but that is some­thing we don’t know. All we know is that it must be be­low 398, we weren’t told it was ex­actly 395, 394, etc. For­tu­nate­ly, this is a solved prob­lem. For ex­am­ple: “The To­bit model with a non-zero thresh­old”, Car­son & Sun 2007 tells us:

In this pa­per, we con­sider es­ti­mat­ing the un­known cen­sor­ing thresh­old by the min­i­mum of the un­cen­sored yi’s. We show that the es­ti­ma­tor γ’ of γ is su­per­con­sis­tent and as­ymp­tot­i­cally ex­po­nen­tially dis­trib­uted. Car­son (1988, 1989) also sug­gests es­ti­mat­ing the un­known cen­sor­ing thresh­old by the min­i­mum of the un­cen­sored yi’s. In a re­cent pa­per, Zuehlke (2003) re­dis­cov­ers these un­pub­lished re­sults and demon­strates via sim­u­la­tions that the as­ymp­totic dis­tri­b­u­tion of the max­i­mum like­li­hood es­ti­ma­tor does not seem to be affected by the es­ti­ma­tion of the cen­sor­ing thresh­old.

That seems to be al­most too sim­ple and easy, but it makes sense and re­minds me a lit­tle of the : the min­i­mum might not be that ac­cu­rate a guess (it’s un­likely you just hap­pened to draw a sam­ple right on the cen­sor­ing thresh­old) and it defi­nitely can’t be wrong in the sense of be­ing too low. (A Bayesian method might be able to do bet­ter with a prior like a ex­po­nen­tial.)

With that set­tled, the analy­sis is straight­for­ward: load the data, fig­ure out the min­i­mum score, set the NAs to 0, re­gress, and ex­tract the model es­ti­mates for each year:

scores <- data.frame(Year=2005:2013,
censorThreshold <- min(scores$MinimumScore, na.rm=T)
scores[] <- 0

## 'censorThreshold-1' because censReg seems to treat threshold as < and not <=
summary(censReg(MinimumScore ~ Year, left=censorThreshold-1, data=scores))
# Warning message:
# In censReg(MinimumScore ~ Year, left = censorThreshold - 1, data = scores) :
#   at least one value of the endogenous variable is smaller than the left limit
# Call:
# censReg(formula = MinimumScore ~ Year, left = censorThreshold -
#     1, data = scores)
# Observations:
#          Total  Left-censored     Uncensored Right-censored
#              9              5              4              0
# Coefficients:
#              Estimate Std. error t value Pr(> t)
# (Intercept) -139.9711        Inf       0       1
# Year           0.2666        Inf       0       1
# logSigma       2.6020        Inf       0       1
# Newton-Raphson maximisation, 37 iterations
# Return code 1: gradient close to zero
# Log-likelihood: -19.35 on 3 Df
-139.9711 + (0.2666 * scores$Year)
# [1] 394.6 394.8 395.1 395.4 395.6 395.9 396.2 396.4 396.7

With so lit­tle data the re­sults aren’t very re­li­able, but there is one ob­ser­va­tion we can make.

The fact that half the dataset is cen­sored tells us that the un­cen­sored mean may be a huge over­es­ti­mate (s­ince we’re only look­ing at the ‘top half’ of the un­der­ly­ing data), and in­deed it is. The orig­i­nal mean of the un­cen­sored scores was 410; how­ev­er, the es­ti­mate in­clud­ing the cen­sored data is much low­er, 397 (13 less)!

This demon­strates the dan­ger of ig­nor­ing sys­tem­atic bi­ases in your da­ta.

So, try­ing to cal­cu­late a mean or time effect is not help­ful. What might be bet­ter is to in­stead ex­ploit the cen­sor­ing di­rect­ly: if the cen­sor­ing hap­pened be­cause every­one got in, then if you showed up in a cen­sored year, you have 100% chance of get­ting in; while in a non-cen­sored year you have an un­known but <100% chance of get­ting in; so the prob­a­bil­ity of a cen­sored year sets a lower bound on one’s chances, and this is easy to cal­cu­late as a sim­ple bi­no­mial prob­lem—5 out of 9 years were cen­sored years, so:

#   Exact binomial test
# data:  c(5, 4)
# number of successes = 5, number of trials = 9, p-value = 1
# alternative hypothesis: true probability of success is not equal to 0.5
# 95% confidence interval:
#  0.212 0.863
# sample estimates:
# probability of success
#                 0.5556

So we can tell him that he may have a >55% chance of get­ting in.

The Traveling Gerontologist problem

A quick prob­a­bil­ity ex­er­cise: men­tions Fin­land has 566 cen­te­nar­i­ans as of 2010.

That’s few enough you could imag­ine vis­it­ing them all to re­search them and their longevi­ty, in a sort of trav­el­ing sales­man prob­lem but with geron­tol­o­gists in­stead. Ex­cept, be­cause of the , cen­te­nar­i­ans have high an­nual mor­tal­ity rates; it de­pends on the ex­act age but you could call it >30% (eg Finnish 99yos in 2012 had a death toll of 326.54/1000). So you might well try to visit a cen­te­nar­ian and dis­cover they’d died be­fore you got there.

How bad a risk is this? Well, if the risk per year is 30%, then one has a 70% chance of sur­viv­ing a year. To sur­vive a year, you must sur­vive all 365 days; by the mul­ti­pli­ca­tion rule, the risk is x where or 0.7 = x365.25; solv­ing, x = 0.999024.

It takes time to visit a cen­te­nar­i­an—it would­n’t do to be abrupt and see them for only a few min­utes, you ought to lis­ten to their sto­ries, and you need to get to a ho­tel or air­port, so let’s as­sume you visit 1 cen­te­nar­ian per day.

If you visit cen­te­nar­ian A on day 1, and you want to visit cen­te­nar­ian B on day 2, then you can count on a 99.9% chance B is still alive. So far so good. And if you wanted to visit 566 cen­te­nar­i­ans (let’s imag­ine you have a reg­u­lar­ly-up­dated mas­ter list of cen­te­nar­i­ans from the Finnish pop­u­la­tion reg­istry), then you only have to beat the odds 566 times in a row, which is not that hard: 0.999024566 = 0.5754023437943274.

But that’s cold­blooded of you to ob­jec­tify those Finnish cen­te­nar­i­ans! “Any cen­te­nar­ian will do, I don’t care.” What if you picked the cur­rent set of 566 cen­te­nar­i­ans and wanted to visit just them, specifi­cal­ly—with no new cen­te­nar­i­ans in­tro­duced to the list to re­place any dead ones.

That’s a lit­tle more com­pli­cat­ed. When you visit the first cen­te­nar­i­an, it’s the same prob­a­bil­i­ty: 0.999024. When you visit the sec­ond cen­te­nar­ian the odds change since now she (and it’s more often ‘she’ than ‘he’, since re­mem­ber the ex­po­nen­tial and males hav­ing shorter mean life­times) has to sur­vive 2 days, so it’s or 0.9990242; for the third, it’s 0.9990243, and so on to #566 who has been pa­tiently wait­ing and try­ing to sur­vive a risk of 0.999024566, and then you need to mul­ti­ply to get your odds of beat­ing every sin­gle risk of death and the cen­te­nar­ian not leav­ing for a more per­ma­nent ren­dezvous: , which would be , or in Haskell:

product (map (\x -> 0.999024**x) [1..566])

(A lit­tle sur­pris­ing­ly, Wol­fram Al­pha can solve the TeX ex­pres­sion too.)

Given the use of float­ing point in that func­tion (567 float­ing point ex­po­nen­ti­a­tions fol­lowed by as many mul­ti­pli­ca­tions) and the hor­ror sto­ries about float­ing point, one might worry the an­swer is wrong & the real prob­a­bil­ity is much larg­er. We can retry with an im­ple­men­ta­tion of com­putable re­als, CReal, which can be very slow but should give more pre­cise an­swers:

:module + Data.Number.CReal
showCReal 100 (product (map (\x -> 0.999024**x) [1..566]))

Looks good—a­grees with the float­ing point ver­sion up to the 11th dig­it:

8.9527433401 64081e-69
8.9527433401 308585720915431195262

We can also check by rewrit­ing the prod­uct equa­tion to avoid all the ex­po­nen­ti­a­tion and mul­ti­pli­ca­tion (which might cause is­sues) in fa­vor of a sin­gle ex­po­nen­tial:

  1. (as be­fore)
  2. = (s­ince )
  3. = (by /Gauss’s fa­mous class­room trick since )
  4. = (s­tart sub­sti­tut­ing in spe­cific val­ues)
  5. =
  6. = 0.999024160461



Or to go back to the longer ver­sion:

0.999024**((566*(1 + 566)) / 2)

Also close. All prob­a­bil­i­ties of suc­cess are minute.

How fast would you have to be if you wanted to at least try to ac­com­plish the tour with, say, a 50-50 chance?

Well, that’s easy: you can con­sider the prob­a­bil­ity of all of them sur­viv­ing one day and as we saw ear­lier, that’s 0.999024566 = 0.58, and two days would be So you can only take a lit­tle over a day be­fore you’ve prob­a­bilis­ti­cally lost & one of them has died; if you hit all 566 cen­te­nar­i­ans in 24 hours, that’s ~24 cen­te­nar­i­ans per hour or ~2 min­utes to chat with each one and travel to the next. If you’re try­ing to col­lect DNA sam­ples, bet­ter hope they’re all awake and able to give con­sent!

So safe to say, you will prob­a­bly not be able to man­age the Trav­el­ing Geron­tol­o­gist’s tour.

Bayes nets

Daily weight data graph

As the datasets I’m in­ter­ested in grow in num­ber of vari­ables, it be­comes harder to jus­tify do­ing analy­sis by sim­ply writ­ing down a sim­ple lin­ear model with a sin­gle de­pen­dent vari­able and throw­ing in the in­de­pen­dent vari­ables and maybe a few trans­for­ma­tions cho­sen by hand. I can in­stead write down some si­mul­ta­ne­ous-e­qua­tion­s/struc­tural-e­qua­tion-mod­els, but while it’s usu­ally ob­vi­ous what to do for k < 4 and if it’s not I can com­pare the pos­si­ble vari­ants, 4 vari­ables is ques­tion­able what the right SEM is, and >5, it’s hope­less. Fac­tor analy­sis to ex­tract some la­tent vari­ables is a pos­si­bil­i­ty, but the more gen­eral so­lu­tion here seems to be prob­a­bilis­tic graph­i­cal mod­els such as Bayesian net­works.

I thought I’d try out some Bayes net in­fer­ence on some of my datasets. In this case, I have ~150 daily mea­sure­ments from my Om­ron body com­po­si­tion scale, mea­sur­ing to­tal weight, body fat per­cent­age, and some other things (see an Om­ron man­ual):

  1. To­tal weight
  2. BMI
  3. Body fat per­cent­age
  4. Mus­cle per­cent­age
  5. Rest­ing me­tab­o­lism in calo­ries
  6. “Body age”
  7. Vis­ceral fat in­dex

The 7 vari­ables are in­ter­re­lat­ed, so this is defi­nitely a case where a sim­ple lm is not go­ing to do the trick. It’s also not 100% clear how to set up a SEM; some de­fi­n­i­tions are ob­vi­ous (the much-crit­i­cized BMI is go­ing to be de­ter­mined solely by to­tal weight, mus­cle and fat per­cent­age might be in­versely re­lat­ed) but oth­ers are not (how does “vis­ceral fat” re­late to body fat?). And it’s not a hope­lessly small amount of da­ta.

The Bayes net R li­brary I’m try­ing out is bnlearn (pa­per).

weight <- read.csv("selfexperiment/weight.csv")
weight$Date <- NULL; weight$Weight.scale <- NULL
# remove missing data
weightC <- na.omit(weight)
# bnlearn can't handle integers, oddly enough
weightC <-, as.numeric))
#   Weight.Omron        Weight.BMI        Weight.body.fat    Weight.muscle
#  Min.   :193.0000   Min.   : 26.90000   Min.   :27.00000   Min.   :32.60000
#  1st Qu.:195.2000   1st Qu.: 27.20000   1st Qu.:28.40000   1st Qu.:34.20000
#  Median :196.4000   Median : 27.40000   Median :28.70000   Median :34.50000
#  Mean   :196.4931   Mean   : 28.95409   Mean   :28.70314   Mean   :34.47296
#  3rd Qu.:197.8000   3rd Qu.: 27.60000   3rd Qu.:29.10000   3rd Qu.:34.70000
#  Max.   :200.6000   Max.   : 28.00000   Max.   :31.70000   Max.   :35.50000
#  Weight.resting.metabolism Weight.body.age    Weight.visceral.fat
#  Min.   :1857.000          Min.   :52.00000   Min.   : 9.000000
#  1st Qu.:1877.000          1st Qu.:53.00000   1st Qu.:10.000000
#  Median :1885.000          Median :53.00000   Median :10.000000
#  Mean   :1885.138          Mean   :53.32704   Mean   : 9.949686
#  3rd Qu.:1893.000          3rd Qu.:54.00000   3rd Qu.:10.000000
#  Max.   :1914.000          Max.   :56.00000   Max.   :11.000000
#                             Weight.Omron     Weight.BMI Weight.body.fat  Weight.muscle
# Weight.Omron               1.00000000000  0.98858376919    0.1610643221 -0.06976934825
# Weight.BMI                 0.98858376919  1.00000000000    0.1521872557 -0.06231142104
# Weight.body.fat            0.16106432213  0.15218725566    1.0000000000 -0.98704369855
# Weight.muscle             -0.06976934825 -0.06231142104   -0.9870436985  1.00000000000
# Weight.resting.metabolism  0.96693236051  0.95959140245   -0.0665001241  0.15621294274
# Weight.body.age            0.82581939626  0.81286141659    0.5500409365 -0.47408608681
# Weight.visceral.fat        0.41542744168  0.43260100665    0.2798756916 -0.25076619829
#                           Weight.resting.metabolism Weight.body.age Weight.visceral.fat
# Weight.Omron                           0.9669323605    0.8258193963        0.4154274417
# Weight.BMI                             0.9595914024    0.8128614166        0.4326010067
# Weight.body.fat                       -0.0665001241    0.5500409365        0.2798756916
# Weight.muscle                          0.1562129427   -0.4740860868       -0.2507661983
# Weight.resting.metabolism              1.0000000000    0.7008354776        0.3557229425
# Weight.body.age                        0.7008354776    1.0000000000        0.4840752389
# Weight.visceral.fat                    0.3557229425    0.4840752389        1.0000000000

## create alternate dataset expressing the two percentage variables as pounds, since this might fit better
weightC2 <- weightC
weightC2$Weight.body.fat <- weightC2$Weight.Omron * (weightC2$Weight.body.fat / 100)
weightC2$Weight.muscle   <- weightC2$Weight.Omron * (weightC2$Weight.muscle / 100)

Be­gin analy­sis:

pdap <- hc(weightC)
pdapc2 <- hc(weightC2)
## bigger is better:
score(pdap, weightC)
# [1] -224.2563072
score(pdapc2, weightC2)
# [1] -439.7811072
## stick with the original, then
#   Bayesian network learned via Score-based methods
#   model:
#    [Weight.Omron][Weight.body.fat][Weight.BMI|Weight.Omron]
#    [Weight.resting.metabolism|Weight.Omron:Weight.body.fat]
#    [Weight.body.age|Weight.Omron:Weight.body.fat]
#    [Weight.muscle|Weight.body.fat:Weight.resting.metabolism][Weight.visceral.fat|Weight.body.age]
#   nodes:                                 7
#   arcs:                                  8
#     undirected arcs:                     0
#     directed arcs:                       8
#   average markov blanket size:           2.57
#   average neighbourhood size:            2.29
#   average branching factor:              1.14
#   learning algorithm:                    Hill-Climbing
#   score:                                 BIC (Gauss.)
#   penalization coefficient:              2.534452101
#   tests used in the learning procedure:  69
#   optimized:                             TRUE

This in­ferred graph is ob­vi­ously wrong in sev­eral re­spects, vi­o­lat­ing prior knowl­edge about some of the re­la­tion­ships.

More specifi­cal­ly, my prior knowl­edge:

  • Weight.Omron == to­tal weight; should be in­flu­enced by Weight.body.fat (%), Weight.muscle (%), & Weight.visceral.fat

  • Weight.visceral.fat: or­di­nal vari­able, <=9 = nor­mal; 10-14 = high; 15+ = very high; from the Om­ron man­u­al:

    Vis­ceral fat area (0—ap­prox. 300 cm , 1 inch=2.54 cm) dis­tri­b­u­tion with 30 lev­els. NOTE: Vis­ceral fat lev­els are rel­a­tive and not ab­solute val­ues.

  • Weight.BMI: BMI is a sim­ple func­tion of to­tal weight & height (specifi­cally BMI = round(weight / height^2)), so it should be in­flu­enced only by Weight.Omron, and in­flu­ence noth­ing else

  • Weight.body.age: should be in­flu­enced by Weight.Omron, Weight.body.fat, and Weight.muscle, based on the de­scrip­tion in the man­u­al:

    Body age is based on your rest­ing me­tab­o­lism. Body age is cal­cu­lated by us­ing your weight, body fat per­cent­age and skele­tal mus­cle per­cent­age to pro­duce a guide to whether your body age is above or be­low the av­er­age for your ac­tual age.

  • Weight.resting.metabolism: a func­tion of the oth­ers, but I’m not sure which ex­act­ly; man­ual talks about what rest­ing me­tab­o­lism is gener­i­cally and spec­i­fies it has the range “385 to 3999 kcal with 1 kcal in­cre­ments”; http­s://en.wikipedi­­i/Basal_meta­bol­ic_rate sug­gests the Om­ron may be us­ing one of sev­eral ap­prox­i­ma­tion equa­tions based on age/­sex/height/weight, but it might also be us­ing lean body mass as well.

Un­for­tu­nate­ly, bn­learn does­n’t seem to sup­port any easy way of en­cod­ing the prior knowl­edge—­for ex­am­ple, you can’t say ‘no out­go­ing ar­rows from node X’—so I it­er­ate, adding bad ar­rows to the black­list.

Which ar­rows vi­o­late prior knowl­edge?

  • [Weight.visceral.fat|Weight.body.age] (read back­wards, as Weight.body.age → Weight.visceral.fat)
  • [Weight.muscle|Weight.resting.metabolism]

Retry, black­list­ing those 2 ar­rows:

pdap2 <- hc(weightC, blacklist=data.frame(from=c("Weight.body.age", "Weight.resting.metabolism"), to=c("Weight.visceral.fat","Weight.muscle")))

New vi­o­la­tions:

  • [Weight.visceral.fat|Weight.BMI]
  • [Weight.muscle|Weight.Omron]
pdap3 <- hc(weightC, blacklist=data.frame(from=c("Weight.body.age", "Weight.resting.metabolism", "Weight.BMI", "Weight.Omron"), to=c("Weight.visceral.fat","Weight.muscle", "Weight.visceral.fat", "Weight.muscle")))

New vi­o­la­tions:

  • [Weight.visceral.fat|Weight.Omron]
  • [Weight.muscle|Weight.BMI]
pdap4 <- hc(weightC, blacklist=data.frame(from=c("Weight.body.age", "Weight.resting.metabolism", "Weight.BMI", "Weight.Omron", "Weight.Omron", "Weight.BMI"), to=c("Weight.visceral.fat","Weight.muscle", "Weight.visceral.fat", "Weight.muscle", "Weight.visceral.fat", "Weight.muscle")))

One vi­o­la­tion:

  • [Weight.muscle|Weight.body.age]
pdap5 <- hc(weightC, blacklist=data.frame(from=c("Weight.body.age", "Weight.resting.metabolism", "Weight.BMI", "Weight.Omron", "Weight.Omron", "Weight.BMI", "Weight.body.age"), to=c("Weight.visceral.fat","Weight.muscle", "Weight.visceral.fat", "Weight.muscle", "Weight.visceral.fat", "Weight.muscle", "Weight.muscle")))
#   Bayesian network learned via Score-based methods
#   model:
#    [Weight.body.fat][Weight.muscle|Weight.body.fat][Weight.visceral.fat|Weight.body.fat]
#    [Weight.Omron|Weight.visceral.fat][Weight.BMI|Weight.Omron]
#    [Weight.resting.metabolism|Weight.Omron:Weight.body.fat]
#    [Weight.body.age|Weight.Omron:Weight.body.fat]
#   nodes:                                 7
#   arcs:                                  8
#     undirected arcs:                     0
#     directed arcs:                       8
#   average markov blanket size:           2.57
#   average neighbourhood size:            2.29
#   average branching factor:              1.14
#   learning algorithm:                    Hill-Climbing
#   score:                                 BIC (Gauss.)
#   penalization coefficient:              2.534452101
#   tests used in the learning procedure:  62
#   optimized:                             TRUE

## implementing all the prior knowledge cost ~30:
score(pdap5, weightC)
# [1] -254.6061724

No vi­o­la­tions, so let’s use the net­work and es­ti­mate the spe­cific pa­ra­me­ters:

fit <-, weightC); fit
#   Bayesian network parameters
#   Parameters of node Weight.Omron (Gaussian distribution)
# Conditional density: Weight.Omron | Weight.visceral.fat
# Coefficients:
#         (Intercept)  Weight.visceral.fat
#       169.181651376          2.744954128
# Standard deviation of the residuals: 1.486044472
#   Parameters of node Weight.BMI (Gaussian distribution)
# Conditional density: Weight.BMI | Weight.Omron
# Coefficients:
#   (Intercept)   Weight.Omron
# -0.3115772322   0.1411044216
# Standard deviation of the residuals: 0.03513413381
#   Parameters of node Weight.body.fat (Gaussian distribution)
# Conditional density: Weight.body.fat
# Coefficients:
# (Intercept)
# 28.70314465
# Standard deviation of the residuals: 0.644590085
#   Parameters of node Weight.muscle (Gaussian distribution)
# Conditional density: Weight.muscle | Weight.body.fat
# Coefficients:
#     (Intercept)  Weight.body.fat
#   52.1003347352    -0.6141270921
# Standard deviation of the residuals: 0.06455478599
#   Parameters of node Weight.resting.metabolism (Gaussian distribution)
# Conditional density: Weight.resting.metabolism | Weight.Omron + Weight.body.fat
# Coefficients:
#     (Intercept)     Weight.Omron  Weight.body.fat
#   666.910582196      6.767607964     -3.886694779
# Standard deviation of the residuals: 1.323176507
#   Parameters of node Weight.body.age (Gaussian distribution)
# Conditional density: Weight.body.age | Weight.Omron + Weight.body.fat
# Coefficients:
#     (Intercept)     Weight.Omron  Weight.body.fat
#  -32.2651379176     0.3603672788     0.5150134225
# Standard deviation of the residuals: 0.2914301529
#   Parameters of node Weight.visceral.fat (Gaussian distribution)
# Conditional density: Weight.visceral.fat | Weight.body.fat
# Coefficients:
#     (Intercept)  Weight.body.fat
#    6.8781100009     0.1070118125
# Standard deviation of the residuals: 0.2373649058
## residuals look fairly good, except for Weight.resting.metabolism, where there are some extreme residuals in what looks a bit like a sigmoid sort of pattern, suggesting nonlinearities in the Omron scale's formula?

We can dou­ble-check the es­ti­mates here by turn­ing the Bayes net model into a SEM and see­ing how the es­ti­mates com­pare, and also see­ing if the p-val­ues sug­gest we’ve found a good mod­el:

Weight.model1 <- '
    Weight.visceral.fat ~ Weight.body.fat
    Weight.Omron ~ Weight.visceral.fat
    Weight.BMI ~ Weight.Omron
    Weight.body.age ~ Weight.Omron + Weight.body.fat
    Weight.muscle ~ Weight.body.fat
    Weight.resting.metabolism ~ Weight.Omron + Weight.body.fat
Weight.fit1 <- sem(model = Weight.model1,  data = weightC)
# lavaan (0.5-16) converged normally after 139 iterations
#   Number of observations                           159
#   Estimator                                         ML
#   Minimum Function Test Statistic               71.342
#   Degrees of freedom                                 7
#   P-value (Chi-square)                           0.000
# Parameter estimates:
#   Information                                 Expected
#   Standard Errors                             Standard
#                    Estimate  Std.err  Z-value  P(>|z|)
# Regressions:
#   Weight.visceral.fat ~
#     Weight.bdy.ft     0.107    0.029    3.676    0.000
#   Weight.Omron ~
#     Wght.vscrl.ft     2.745    0.477    5.759    0.000
#   Weight.BMI ~
#     Weight.Omron      0.141    0.002   82.862    0.000
#   Weight.body.age ~
#     Weight.Omron      0.357    0.014   25.162    0.000
#     Weight.bdy.ft     0.516    0.036   14.387    0.000
#   Weight.muscle ~
#     Weight.bdy.ft    -0.614    0.008  -77.591    0.000
#   Weight.resting.metabolism ~
#     Weight.Omron      6.730    0.064  104.631    0.000
#     Weight.bdy.ft    -3.860    0.162  -23.837    0.000
# Covariances:
#   Weight.BMI ~~
#     Weight.body.g    -0.000    0.001   -0.116    0.907
#     Weight.muscle    -0.000    0.000   -0.216    0.829
#     0.005    0.004    1.453    0.146
#   Weight.body.age ~~
#     Weight.muscle     0.001    0.001    0.403    0.687
#    -0.021    0.030   -0.700    0.484
#   Weight.muscle ~~
#     0.007    0.007    1.003    0.316
# Variances:
#     Wght.vscrl.ft     0.056    0.006
#     Weight.Omron      2.181    0.245
#     Weight.BMI        0.001    0.000
#     Weight.body.g     0.083    0.009
#     Weight.muscle     0.004    0.000
#     1.721    0.193

Com­par­ing the co­effi­cients by eye, they tend to be quite close (usu­ally within 0.1) and the p-val­ues are all sta­tis­ti­cal­ly-sig­nifi­cant.

The net­work it­self looks right, al­though some of the edges are sur­pris­es: I did­n’t know vis­ceral fat was pre­dictable from body fat (I thought they were mea­sur­ing sep­a­rate things), and the rel­a­tive in­de­pen­dence of mus­cle sug­gests that in any ex­er­cise plan I might be bet­ter off fo­cus­ing on the body fat per­cent­age rather than the mus­cle per­cent­age since the for­mer may be effec­tively de­ter­min­ing the lat­ter.

So what did I learn here?

  • learn­ing net­work struc­ture and di­rec­tion of ar­rows is hard; even with only 7 vari­ables and n = 159 (ac­cu­rate clean data), the hill-climb­ing al­go­rithm will learn at least 7 wrong arcs.

    • and the de­rived graphs de­pend dis­turbingly heav­ily on choice of al­go­rithm; I used the hc hill-climb­ing al­go­rithm (s­ince I’m lazy and did­n’t want to spec­ify ar­row di­rec­tion­s), but when I try out the al­ter­na­tives like iamb on the same data & black­list, the found graph looks rather differ­ent
  • Gaus­sians are, as al­ways, sen­si­tive to out­liers: I was sur­prised the first graph did­n’t show BMI con­nected to any­thing, so I took a closer look and found I had mis­coded a BMI of 28 as 280!

  • bnlearn, while not as hard to use as I ex­pect­ed, could still use us­abil­ity im­prove­ments: I should not need to co­erce in­te­ger data into ex­actly equiv­a­lent nu­meric types just be­cause bnlearn does­n’t rec­og­nize in­te­gers; and black­list­ing/whitelist­ing needs to be more pow­er­ful—it­er­a­tively gen­er­at­ing graphs and man­u­ally in­spect­ing and man­u­ally black­list­ing is te­dious and does not scale

    • hence, it may make more sense to find a graph us­ing bnlearn and then con­vert it into si­mul­ta­ne­ous-e­qua­tions and ma­nip­u­late it us­ing more ma­ture SEM li­braries

Zeo sleep data

Here I look at my Zeo sleep data; more vari­ables, more com­plex re­la­tions, and more un­known ones, but on the pos­i­tive side, ~12x more data to work with.

zeo <- read.csv("~/wiki/docs/zeo/gwern-zeodata.csv")
zeo$Sleep.Date <- as.Date(zeo$Sleep.Date, format="%m/%d/%Y")

## convert "05/12/2014 06:45" to "06:45"
zeo$Start.of.Night <- sapply(strsplit(as.character(zeo$Start.of.Night), " "), function(x) { x[2] })
## convert "06:45" to 24300
interval <- function(x) { if (! { if (grepl(" s",x)) as.integer(sub(" s","",x))
                                           else { y <- unlist(strsplit(x, ":")); as.integer(y[[1]])*60 + as.integer(y[[2]]); }
                          else NA
zeo$Start.of.Night <- sapply(zeo$Start.of.Night, interval)
## correct for the switch to new unencrypted firmware in March 2013;
## I don't know why the new firmware subtracts 15 hours
zeo[(zeo$Sleep.Date >= as.Date("2013-03-11")),]$Start.of.Night <- (zeo[(zeo$Sleep.Date >= as.Date("2013-03-11")),]$Start.of.Night + 900) %% (24*60)

## after midnight (24*60=1440), Start.of.Night wraps around to 0, which obscures any trends,
## so we'll map anything before 7AM to time+1440
zeo[zeo$Start.of.Night<420 & !$Start.of.Night),]$Start.of.Night <- (zeo[zeo$Start.of.Night<420 & !$Start.of.Night),]$Start.of.Night + (24*60))

zeoSmall <- subset(zeo, select=c(ZQ,Total.Z,,,,,,Awakenings,Start.of.Night,Morning.Feel))
zeoClean <- na.omit(zeoSmall)
# bnlearn doesn't like the 'integer' class that most of the data-frame is in
zeoClean <-, as.numeric))

Prior knowl­edge:

  • Start.of.Night is tem­po­rally first, and can­not be caused
  • is tem­po­rally sec­ond, and can be in­flu­enced by Start.of.Night (likely a con­nec­tion be­tween how late I go to bed and how fast I fall asleep) & (s­ince if it takes 10 min­utes to fall asleep, I must spend ≥10 min­utes in wake) but not oth­ers
  • Morning.Feel is tem­po­rally last, and can­not cause any­thing
  • ZQ is a syn­thetic vari­able in­vented by Zeo ac­cord­ing to an opaque for­mu­la, which can­not cause any­thing but is de­ter­mined by oth­ers
  • Total.Z should be the sum of,, and
  • Awakenings should have an ar­row with but it’s not clear which way it should run
## after a bunch of iteration, blacklisting arrows which violate the prior knowledge
bl <- data.frame(from=c("Morning.Feel", "ZQ", "ZQ", "ZQ", "ZQ", "ZQ", "ZQ", "", "", "", "Morning.Feel", "Awakenings", "", "Morning.Feel", "Morning.Feel","Total.Z", "", "", "Total.Z", "Total.Z", "Total.Z"),
                 to=c("Start.of.Night", "Total.Z", "", "", "", "Morning.Feel","Start.of.Night", "Start.of.Night","Start.of.Night","Start.of.Night", "", "", "", "Total.Z", "","","", "Start.of.Night", "", "", ""))

zeo.hc <- hc(zeoClean, blacklist=bl)
zeo.iamb         <- iamb(zeoClean, blacklist=bl)
## problem: undirected arc:; since hc inferred [|], I'll copy that for iamb:
zeo.iamb <- set.arc(zeo.iamb, from = "", to = "") <- gs(zeoClean, blacklist=bl)
## same undirected arc: <- set.arc(, from = "", to = "")

## Bigger is better:
score(zeo.iamb, data=zeoClean)
# [1] -44776.79185
score(, data=zeoClean)
# [1] -44776.79185
score(zeo.hc, data=zeoClean)
# [1] -44557.6952
## hc scores best, so let's look at it:
#   Bayesian network learned via Score-based methods
#   model:
#    [Start.of.Night][|Start.of.Night][|]
#    [|][|]
#    [Total.Z|][|]
#    [Awakenings|]
#    [Morning.Feel|]
#    [ZQ|]
#   nodes:                                 10
#   arcs:                                  28
#     undirected arcs:                     0
#     directed arcs:                       28
#   average markov blanket size:           7.40
#   average neighbourhood size:            5.60
#   average branching factor:              2.80
#   learning algorithm:                    Hill-Climbing
#   score:                                 BIC (Gauss.)
#   penalization coefficient:              3.614556939
#   tests used in the learning procedure:  281
#   optimized:                             TRUE


fit <-, zeoClean); fit
#   Bayesian network parameters
#   Parameters of node ZQ (Gaussian distribution)
# Conditional density: ZQ | Total.Z + + + + Awakenings
# Coefficients:
#    (Intercept)         Total.Z      Awakenings
# -0.12468522173   0.14197043518  -0.07103211437   0.07053271816   0.21121000076  -0.56476256303
# Standard deviation of the residuals: 0.3000223604
#   Parameters of node Total.Z (Gaussian distribution)
# Conditional density: Total.Z | + Start.of.Night
# Coefficients:
#    (Intercept)  Start.of.Night
# 907.6406157850   -0.4479377278   -0.2680771514
# Standard deviation of the residuals: 68.90853885
#   Parameters of node (Gaussian distribution)
# Conditional density: | Start.of.Night
# Coefficients:
#    (Intercept)  Start.of.Night
# -1.02898431407   0.01568450832
# Standard deviation of the residuals: 13.51606719
#   Parameters of node (Gaussian distribution)
# Conditional density: |
# Coefficients:
#   (Intercept)
# 14.7433880499   0.3289378711
# Standard deviation of the residuals: 19.0906685
#   Parameters of node (Gaussian distribution)
# Conditional density: | Total.Z + Start.of.Night
# Coefficients:
#      (Intercept)           Total.Z    Start.of.Night
# -120.62442964234     0.37864195651     0.06275760841
# Standard deviation of the residuals: 19.32560757
#   Parameters of node (Gaussian distribution)
# Conditional density: | Total.Z + +
# Coefficients:
#   (Intercept)        Total.Z
#  0.6424267863   0.9997862624  -1.0000587988  -1.0001805537
# Standard deviation of the residuals: 0.5002896274
#   Parameters of node (Gaussian distribution)
# Conditional density: | Total.Z +
# Coefficients:
#   (Intercept)        Total.Z
# 15.4961459056   0.1283622577  -0.1187382535
# Standard deviation of the residuals: 11.90756843
#   Parameters of node Awakenings (Gaussian distribution)
# Conditional density: Awakenings | + + + + Start.of.Night
# Coefficients:
#     (Intercept)
# -18.41014329148    0.02605164827    0.05736596152    0.02291139969    0.01060661963
#  Start.of.Night
#   0.01129521977
# Standard deviation of the residuals: 2.427868657
#   Parameters of node Start.of.Night (Gaussian distribution)
# Conditional density: Start.of.Night
# Coefficients:
# (Intercept)
# 1413.382886
# Standard deviation of the residuals: 64.43144125
#   Parameters of node Morning.Feel (Gaussian distribution)
# Conditional density: Morning.Feel | Total.Z + + + + Start.of.Night
# Coefficients:
#     (Intercept)          Total.Z
# -0.924662971061   0.004808652252  -0.010127269154  -0.008636841492  -0.002766602019
#  Start.of.Night
#  0.001672816480
# Standard deviation of the residuals: 0.7104115719

## some issues with big residuals at the extremes in the variables,, and;
## not sure how to fix those

Zeo.model1 <- ' ~ Start.of.Night ~ Total.Z +
    Awakenings ~ + + + + Start.of.Night ~ + Start.of.Night ~ + Start.of.Night ~ + + Start.of.Night
    Total.Z ~ + +
    ZQ ~ Total.Z + + + + Awakenings
    Morning.Feel ~ Total.Z + + + + Start.of.Night
Zeo.fit1 <- sem(model = Zeo.model1,  data = zeoClean)
# lavaan (0.5-16) converged normally after 183 iterations
#   Number of observations                          1379
#   Estimator                                         ML
#   Minimum Function Test Statistic               22.737
#   Degrees of freedom                                16
#   P-value (Chi-square)                           0.121
# Parameter estimates:
#   Information                                 Expected
#   Standard Errors                             Standard
#                    Estimate  Std.err  Z-value  P(>|z|)
# Regressions:
# ~
#     Start.of.Nght     0.016    0.006    2.778    0.005
# ~
#     Total.Z          -0.026    0.007   -3.592    0.000
#         0.314    0.038    8.277    0.000
#   Awakenings ~
#         0.026    0.005    5.233    0.000
#      0.057    0.003   16.700    0.000
#       0.023    0.002   10.107    0.000
#     0.011    0.002    6.088    0.000
#     Start.of.Nght     0.011    0.001   10.635    0.000
# ~
#        -0.348    0.085   -4.121    0.000
#     Start.of.Nght    -0.195    0.018  -10.988    0.000
# ~
#     0.358    0.018   19.695    0.000
#     Start.of.Nght     0.034    0.013    2.725    0.006
# ~
#       0.081    0.012    6.657    0.000
#     0.034    0.009    3.713    0.000
#     Start.of.Nght    -0.017    0.006   -3.014    0.003
#   Total.Z ~
#       1.000    0.000 2115.859    0.000
#     1.000    0.000 2902.045    0.000
#      1.000    0.001  967.322    0.000
#   ZQ ~
#     Total.Z           0.142    0.000  683.980    0.000
#     -0.071    0.000 -155.121    0.000
#       0.071    0.000  167.090    0.000
#      0.211    0.001  311.454    0.000
#     Awakenings       -0.565    0.003 -178.407    0.000
#   Morning.Feel ~
#     Total.Z           0.005    0.001    8.488    0.000
#        -0.010    0.001   -6.948    0.000
#     -0.009    0.001   -8.592    0.000
#    -0.003    0.001   -2.996    0.003
#     Start.of.Nght     0.002    0.000    5.414    0.000

Again no ma­jor sur­pris­es, but one thing I no­tice is that ZQ does not seem to con­nect to, though does con­nect to Morning.Feel; I’ve long sus­pected that ZQ is a flawed sum­mary and thought it was in­suffi­ciently tak­ing into ac­count wakes or some­thing else, so it looks like it’s specifi­cally which is miss­ing. Start.of.night also is more highly con­nected than I had ex­pect­ed.

Com­par­ing graphs from the 3 al­go­rithms, they don’t seem to differ as badly as the weight ones did. Is this thanks to the much greater data or the con­straints?

Genome sequencing costs

# converted to CSV & deleted cost per base (less precision); CSV looks like:
## Date, Cost per Genome
## Sep-01,"$95,263,072"
## ...
sequencing <- read.csv("sequencing_costs_apr2014.csv")
sequencing$Cost.per.Genome <- as.integer(gsub(",", "", sub("\\$", "", as.character(sequencing$Cost.per.Genome))))
# interpret month-years as first of month:
sequencing$Date <- as.Date(paste0("01-", as.character(sequencing$Date)), format="%d-%b-%y")
##         Date Cost.per.Genome
## 1 2001-09-01        95263072
## 2 2002-03-01        70175437
## 3 2002-09-01        61448422
## 4 2003-03-01        53751684
## 5 2003-10-01        40157554
## 6 2004-01-01        28780376

l <- lm(log(Cost.per.Genome) ~ Date, data=sequencing); summary(l)
## Coefficients:
##                 Estimate   Std. Error  t value   Pr(>|t|)
## (Intercept) 50.969823683  1.433567932  35.5545 < 2.22e-16
## Date        -0.002689621  0.000101692 -26.4486 < 2.22e-16
## Residual standard error: 0.889707 on 45 degrees of freedom
## Multiple R-squared:  0.939559,   Adjusted R-squared:  0.938216
## F-statistic: 699.528 on 1 and 45 DF,  p-value: < 2.22e-16
plot(log(Cost.per.Genome) ~ Date, data=sequencing)
# as expected: linear in log (Moore's law) 2002-2008, sudden drop, return to Moore's law-ish ~December 2011?
# but on the other hand, maybe the post-December 2011 behavior is a continuation of the curve
# 2 break-points / 3 segments:
piecewise <- segmented(l, seg.Z=~Date, psi=list(Date=c(13970, 16071)))
## Estimated Break-Point(s):
##             Est. St.Err
## psi1.Date 12680 1067.0
## psi2.Date 13200  279.8
## t value for the gap-variable(s) V:  0 0 2
## Meaningful coefficients of the linear terms:
##                 Estimate   Std. Error  t value   Pr(>|t|)
## (Intercept) 35.841699121  8.975628264  3.99322 0.00026387
## Date        -0.001504431  0.000738358 -2.03754 0.04808491
## U1.Date      0.000679538  0.002057940  0.33020         NA
## U2.Date     -0.002366688  0.001926528 -1.22847         NA
## Residual standard error: 0.733558 on 41 degrees of freedom
## Multiple R-Squared: 0.962565,  Adjusted R-squared:   0.958
with(sequencing, plot(Date, log(Cost.per.Genome), pch=16)); plot(piecewise, add=T)
# The first two segments look fine, but the residuals are clearly bad for the third line-segment:
# it undershoots (damaging the second segment's fit), overshoots, then undershoots again. Let's try again with more breakpoints:

lots <- segmented(l, seg.Z=~Date, psi=list(Date=NA), control=seg.control(stop.if.error=FALSE, n.boot=0))
summary(segmented(l, seg.Z=~Date, psi=list(Date=as.Date(c(12310, 12500, 13600, 13750,  14140,  14680,  15010, 15220), origin = "1970-01-01", tz = "EST"))))
# delete every breakpoint below t-value of ~|2.3|, for 3 breakpoints / 4 segments:
piecewise2 <- segmented(l, seg.Z=~Date, psi=list(Date=as.Date(c("2007-08-25","2008-09-18","2010-03-12"))))
with(sequencing, plot(Date, log(Cost.per.Genome), pch=16)); plot(piecewise2, add=T)

# the additional break-point is used up on a better fit in the curve. It looks like an exponential decay/asymptote,
# so let's work on fitting that part of the graph, the post-2007 curve:
sequencingRecent <- sequencing[sequencing$Date>as.Date("2007-10-01"),]
lR <- lm(log(Cost.per.Genome) ~ Date, data=sequencingRecent); summary(lR)
piecewiseRecent <- segmented(lR, seg.Z=~Date, psi=list(Date=c(14061, 16071))); summary(piecewiseRecent)
## Estimated Break-Point(s):
##             Est. St.Err
## psi1.Date 14290  36.31
## psi2.Date 15290  48.35
## t value for the gap-variable(s) V:  0 0
## Meaningful coefficients of the linear terms:
##                 Estimate   Std. Error   t value   Pr(>|t|)
## (Intercept)  1.13831e+02  6.65609e+00  17.10182 2.0951e-13
## Date        -7.13247e-03  4.73332e-04 -15.06865 2.2121e-12
## U1.Date      4.11492e-03  4.94486e-04   8.32161         NA
## U2.Date      2.48613e-03  2.18528e-04  11.37668         NA
## Residual standard error: 0.136958 on 20 degrees of freedom
## Multiple R-Squared: 0.995976,  Adjusted R-squared: 0.994971

with(sequencingRecent, plot(Date, log(Cost.per.Genome), pch=16)); plot(piecewiseRecent, add=T)

lastPiece <- lm(log(Cost.per.Genome) ~ Date, data=sequencingRecent[as.Date(15290, origin = "1970-01-01", tz = "EST")<sequencingRecent$Date,]); summary(lastPiece)
## Coefficients:
##                 Estimate   Std. Error  t value   Pr(>|t|)
## (Intercept) 17.012409648  1.875482507  9.07095 1.7491e-05
## Date        -0.000531621  0.000119056 -4.46528  0.0020963
## Residual standard error: 0.0987207 on 8 degrees of freedom
## Multiple R-squared:  0.71366,    Adjusted R-squared:  0.677867
with(sequencingRecent[as.Date(15290, origin = "1970-01-01", tz = "EST") < sequencingRecent$Date,],
    plot(Date, log(Cost.per.Genome), pch=16)); abline(lastPiece)

predictDays <- seq(from=sequencing$Date[1], to=as.Date("2030-12-01"), by="month")
lastPiecePredict <- data.frame(Date = predictDays, Cost.per.Genome=c(sequencing$Cost.per.Genome, rep(NA, 305)), Cost.per.Genome.predicted = exp(predict(lastPiece, newdata = data.frame(Date = predictDays))))

nlmR <- nls(log(Cost.per.Genome) ~ SSasymp(as.integer(Date), Asym, r0, lrc), data=sequencingRecent); summary(nlmR)
## Parameters:
##          Estimate   Std. Error    t value Pr(>|t|)
## Asym  7.88908e+00  1.19616e-01   65.95328   <2e-16
## r0    1.27644e+08  1.07082e+08    1.19203   0.2454
## lrc  -6.72151e+00  5.05221e-02 -133.04110   <2e-16
## Residual standard error: 0.150547 on 23 degrees of freedom
with(sequencingRecent, plot(Date, log(Cost.per.Genome))); lines(sequencingRecent$Date, predict(nlmR), col=2)

# side by side:
with(sequencingRecent, plot(Date, log(Cost.per.Genome), pch=16))
plot(piecewiseRecent, add=TRUE, col=2)
lines(sequencingRecent$Date, predict(nlmR), col=3)
# as we can see, the 3-piece linear fit and the exponential decay fit identically;
# but exponential decay is more parsimonious, IMO, so I prefer that.

predictDays <- seq(from=sequencingRecent$Date[1], to=as.Date("2020-12-01"), by="month")
data.frame(Date = predictDays, Cost.per.Genome.predicted = exp(predict(nlmR, newdata = data.frame(Date = predictDays))))

http://www.un­z.­com/gnx­p/the-in­tel-of-se­quenc­ing/#­com­men­t-677904 http­s://bio­mick­wat­son.­word­press.­com/2015/03/25/the-cost-of-se­quenc­ing-is-stil­l-go­ing-down/

Genome se­quenc­ing his­tor­i­cally has dropped in price ~18% per year. Con­sider this sim­ple sce­nar­io: if we have a fixed amount of money to spend buy­ing genomes, and we can afford to buy 1 genome in the first year, then the next year we can buy 1.21 genomes, then 1.48 genomes and so on and in 30 years we can afford to buy 385 genomes each year. The num­ber we can afford in year x is:

sapply(0:30, function(x) 1/(0.82^x))
#  [1]   1.000000000   1.219512195   1.487209994   1.813670724   2.211793566   2.697309227   3.289401497   4.011465240   4.892030780
# [10]   5.965891196   7.275477068   8.872533010  10.820162207  13.195319764  16.091853371  19.624211428  23.931965156  29.185323361
# [19]  35.591857758  43.404704583  52.932566564  64.551910444  78.721842005  96.002246348 117.075910180 142.775500220 174.116463682
# [28] 212.337150832 258.947744917 315.789932826 385.109674178

Genomes are un­like com­pu­ta­tion, though, as they are data rather than an ephemeral ser­vice. Each genome is still use­ful and ac­cu­mu­lates in a data­base. How many genomes to­tal do we have each year? Quite a lot:

cumsum(sapply(0:30, function(x) 1/(0.82^x)))
#  [1]    1.000000000    2.219512195    3.706722189    5.520392914    7.732186480   10.429495707   13.718897204   17.730362444
#  [9]   22.622393224   28.588284420   35.863761488   44.736294497   55.556456704   68.751776468   84.843629839  104.467841268
# [17]  128.399806424  157.585129785  193.176987543  236.581692126  289.514258690  354.066169134  432.788011139  528.790257487
# [25]  645.866167667  788.641667886  962.758131569 1175.095282401 1434.043027318 1749.832960144 2134.942634322

While ini­tially there’s not much of a pile to con­cern our­selves with, even­tu­ally we have 2000+ genomes while still only pro­duc­ing <400 genomes that year, a fac­tor of 5 differ­ence. (As it hap­pens, if you con­sider UKBB at n = 500k pro­duced as a sin­gle in­vest­ment 2012-2017, 23andMe in 2017 is re­port­edly n = 2-2.5m, so this 5x mul­ti­plier is about right.)

23andMe started back in 2007 or so offer­ing $1,322$10002007 SNP pan­els to a few thou­sand peo­ple, grow­ing to ~1m by 8 years later in July 2015. To re­pro­duce that in this model of con­stant in­vest­ment we start with a base of 56k SNPs pur­chased per year, grow­ing ac­cord­ing to the cost de­crease:

cumsum(sapply(0:7, function(x) (56000*1)/(0.82^x)))
# [1]  56000.0000 124292.6829 207576.4426 309142.0032 433002.4429 584051.7596 768258.2434 992900.2969

What does that yield by 10 years later (2017) or 20 years later (2027)? It yields: 1.6m (1,600,943) and 16.2m (16,212,798) re­spec­tive­ly.

Even if we as­sumed that an­nual genomes/SNPs lev­eled off in 2017, the lin­ear in­crease pushes us into the mil­lions range rapid­ly:

annualStagnation <- sapply(0:30, function(x) min(334089, (56000*1)/(0.82^x)))
#  [1]   56000.0000  124292.6829  207576.4426  309142.0032  433002.4429  584051.7596  768258.2434  992900.2969 1266854.0206 1600943.0206
# [11] 1935032.0206 2269121.0206 2603210.0206 2937299.0206 3271388.0206 3605477.0206 3939566.0206 4273655.0206 4607744.0206 4941833.0206
# [21] 5275922.0206 5610011.0206 5944100.0206 6278189.0206 6612278.0206 6946367.0206 7280456.0206 7614545.0206 7948634.0206 8282723.0206
# [31] 8616812.0206
data.frame(Year=2007:2037, total=round(totalStagnation))
# Year   total
# 2007   56000
# 2008  124293
# 2009  207576
# 2010  309142
# 2011  433002
# 2012  584052
# 2013  768258
# 2014  992900
# 2015 1266854
# 2016 1600943
# 2017 1935032
# 2018 2269121
# 2019 2603210
# 2020 2937299
# 2021 3271388
# 2022 3605477
# 2023 3939566
# 2024 4273655
# 2025 4607744
# 2026 4941833
# 2027 5275922
# 2028 5610011
# 2029 5944100
# 2030 6278189
# 2031 6612278
# 2032 6946367
# 2033 7280456
# 2034 7614545
# 2035 7948634
# 2036 8282723
# 2037 8616812

So even if no ad­di­tional funds per year start get­ting spent on ge­nomics de­spite the in­creas­ing util­ity and the cost curve re­mains the same, the cu­mu­la­tive num­ber of SNPs or whole-genomes will in­crease dras­ti­cally over the next 30 years. Genomes on their own have many us­es, such as de­tect­ing hu­man evo­lu­tion, al­low­ing bet­ter im­pu­ta­tion pan­els, in­fer­ring pop­u­la­tion struc­ture, count­ing vari­ants, de­tect­ing par­tic­u­larly lethal mu­ta­tions etc, but of course their main use is trait pre­dic­tion. Given the in­creas­es, we would ex­pect large enough n for Hsu’s lasso to un­dergo phase tran­si­tion and re­cover nearly the full SNP her­i­tabil­ity (see ); the bot­tle­neck in­creas­ingly will not be genomes but phe­no­typic mea­sure­ments.

Proposal: hand-counting mobile app for more fluid group discussions

Groups use vot­ing for de­ci­sion-mak­ing, but ex­ist­ing vote sys­tems are cum­ber­some. Hand-rais­ing is faster, but does not scale be­cause hand-count­ing hands is slow. Ad­vances in ma­chine vi­sion may make it pos­si­ble for AI to count hands in pho­tos ac­cu­rate­ly. Com­bined with a smart­phone’s cam­era, this could yield an app for fast vot­ing in even large groups.

Medi­um-large (>10 peo­ple) groups face a prob­lem in reach­ing con­sen­sus: bal­lot or pen-and-pa­per vot­ing is suffi­ciently slow and clunky that it is too costly to use for any­thing but the most im­por­tant dis­cus­sions. A group is forced to adopt other dis­cus­sion norms and save a for­mal vote for only the fi­nal de­ci­sion, and even then the long de­lay kills a lot of en­thu­si­asm and in­ter­est. Vot­ing could be used for many more de­ci­sions if it could be faster, and of course all ex­ist­ing group votes would ben­e­fit from in­creased speed. (I am re­minded of anime con­ven­tions and film fes­ti­vals where, par­tic­u­larly for short films such as AMVs, one seems to spend more time fill­ing out a bal­lot & pass­ing them along aisles & the staff painfully count­ing through each bal­lot by hand than one ac­tu­ally spends watch­ing the me­dia in ques­tion!)

It would be bet­ter if vot­ing could be as flu­ent and easy as sim­ply rais­ing your hand like in a small group such as a class­room—a mech­a­nism which makes it so easy to vote that votes can be held as fast as the al­ter­na­tives can be spo­ken aloud and a glance suffices to count (an alert group could vote on 2 or 3 top­ics in the time it takes to read this sen­tence). But hand-rais­ing, as great as it is, suffers from the flaw that it does not scale due to the count­ing prob­lem: a group of 500 peo­ple can raise their hands as eas­ily as a group of 50 or 5, but it takes far too long to count ~250 hands: the per­son count­ing will quickly tire of the te­di­um, they will make mis­takes count­ing, and this puts a se­ri­ous lag on each vote, a lag which in­creases lin­early with the num­ber of vot­ers. (Hands can be easy to ap­prox­i­mate if al­most every­one votes for or against some­thing, but if con­sen­sus is so over­whelm­ing, one does­n’t need to vote in the first place! The hard case of al­most-bal­anced votes is the most im­por­tant case.)

One might sug­gest us­ing an en­tirely differ­ent strat­e­gy: a web­site with HTML polls or lit­tle clicker giz­mos like used in some col­lege lec­tures to ad­min­is­ter quick quizzes. This have the down­sides that they re­quire po­ten­tially ex­pen­sive equip­ment (I used a clicker in one class and I think it cost at least $20, so if a con­ven­tion wanted to use that in an au­di­ence of hun­dreds, that’s a ma­jor up­front cost & my ex­pe­ri­ence was that click­ers were un­in­tu­itive, did not al­ways work, and slowed things down if any­thing; a web­site would only work if you as­sume every­one has smart­phones and is will­ing to pull them out to use at an in­stance’s no­tice and of course that there’s work­ing WiFi in the room, which can­not be taken for grant­ed) and con­sid­er­able over­head in ex­plain­ing to every­one how it works and get­ting them on the same page and mak­ing sure every per­son who wan­ders in also gets the mes­sage. (If any­one is go­ing to be bur­dened with un­der­stand­ing or us­ing a new sys­tem, it should be the hand­ful of con­fer­ence/fes­ti­val/­group or­ga­niz­ers, not the en­tire au­di­ence!) A sim­pler ap­proach than hands would be spe­cial­ly-printed pa­per us­ing, for ex­am­ple, QR codes like pi­Cards, which can then be rec­og­nized by stan­dard sim­ple com­puter vi­sion tech­niques; this is much cheaper than click­ers but still re­quires con­sid­er­able setup & in­con­ve­nience. It’s hard to imag­ine a film fes­ti­val run­ning us­ing any sys­tem, and diffi­cult to see these sys­tems im­prov­ing on pen-and-pa­per bal­lots which at least are cheap, rel­a­tively straight­for­ward, and well-known.

Hand-count­ing re­ally does seem like the best so­lu­tion, if only the count­ing could be fixed. Count­ing is some­thing com­put­ers do fast, so that is the germ of an idea. What if a smart­phone could count the votes? You don’t want a smart­phone app on the en­tire au­di­ences’ phones, of course, since that’s even worse than hav­ing every­one go to a web­site to vote; but ma­chine vi­sion has made enor­mous strides in the 2000s-2010s, reach­ing hu­man-e­quiv­a­lent per­for­mance on chal­leng­ing im­age recog­ni­tion con­tests like Im­a­geNet. (Ma­chine vi­sion is com­pli­cat­ed, but the im­por­tant thing is that it’s the kind of com­pli­cated which can be out­sourced to some­one else and turned into a dead­-easy-to-use app, and the bur­den does not fall on the pri­mary user­s—the au­di­ence.) What if the or­ga­nizer had an app which took a photo of the en­tire au­di­ence with lifted arms and counted hands & faces and re­turned a vote count in a sec­ond?

Such an app would be ideal for any cul­tur­al, po­lit­i­cal, or or­ga­ni­za­tional meet­ing. Now the flow for, eg, a film fes­ti­val could go: [no ex­pla­na­tion given to au­di­ence, one just starts] “OK, how many peo­ple liked the first short, ‘Vam­pire Deli’ by Ms Hous­ton?” [ev­ery­one raises hand, smart­phone flash­es, 1s pass­es] “OK, 140 votes. How many liked the sec­ond short, ‘Cthu­li­cious’ by Mr Ious­ton?” [raises hands, smart­phone flash­es, 1s pass­es] “OK… 130 peo­ple. Con­grat­u­la­tions Ms Hous­ton!” And so on.

Such an app might be con­sid­ered an in­fea­si­ble ma­chine vi­sion task, but I be­lieve it could be fea­si­ble: fa­cial lo­cal­iza­tion is an old and well-s­tud­ied im­age recog­ni­tion task (and effec­tive al­go­rithms are built into every con­sumer cam­er­a), hand­s/fin­gers have very dis­tinct shapes, and both tasks seem eas­ier than the sub­tle dis­crim­i­na­tions be­tween, say, var­i­ous dog breeds de­manded of Im­a­geNet con­tes­tants.

Specifi­cal­ly, one could im­ple­ment the ma­chine vi­sion core as fol­lows:

  1. mul­ti­layer neural net­works trained for one task can be re­pur­posed to sim­i­lar tasks by re­mov­ing the high­est layer and re­train­ing on the new task, po­ten­tially reap­ing great per­for­mance gains as the hy­brid net­work has al­ready learned much of what it needs for the sec­ond task (“trans­fer learn­ing”). So one could take a pub­licly avail­able NN trained for Im­a­geNet (such as AlexNet, avail­able in caffe), re­move the top two lay­ers, and re­train on a dataset of au­di­ences; this will per­form bet­ter since the orig­i­nal NN has al­ready learned how to de­tect edges, rec­og­nize faces, etc

    The sim­pler task of count­ing crowds has al­ready shown it­self sus­cep­ti­ble to deep learn­ing: eg “Cross-scene Crowd Count­ing via Deep Con­vo­lu­tional Neural Net­works”.

  2. raid Flickr and Google Im­ages for CC-li­censed pho­tos of au­di­ences rais­ing their arms; then one can man­u­ally count how many arms are raised (or out­source to Ama­zon Me­chan­i­cal Turk). With the boost from a trans­ferred con­vo­lu­tional deep net­work, one might get good per­for­mance with just a few thou­sand pho­tos to train with. If each photo takes a minute to ob­tain and count, then one can cre­ate a use­ful cor­pus in a week or two of work.

  3. train the NN, ap­ply­ing the usual data aug­men­ta­tion tricks to in­crease one’s mea­ger cor­pus, try­ing out ran­dom hy­per­pa­ra­me­ters, tweak­ing the ar­chi­tec­ture, etc

    (Note that while NNs are very slow and com­pu­ta­tion­ally in­ten­sive to train, they are typ­i­cally quite fast to run; the smart­phone app would not be train­ing a NN, which is in­deed com­pletely in­fea­si­ble from a CPU and bat­tery life stand­point—it is merely run­ning the NN cre­ated by the orig­i­nal de­vel­op­er.)

  4. with an ac­cu­rate NN, one can wrap it in a mo­bile app frame­work. The UI, at the sim­plest, is sim­ply a big but­ton to press to take a pho­to, feed it into the NN, and dis­play the count. Some ad­di­tional fea­tures come to mind:

    • “head­count mode”: one may not be in­ter­ested in a vote, but in how many peo­ple are in an au­di­ence (to es­ti­mate how pop­u­lar a guest is, whether an event needs to move to a new big­ger space, etc). If the NN can count faces and hands to es­ti­mate a vote count, it can sim­ply re­port the count of faces in­stead.

    • the app should save every photo & count, both as an au­dit trail and also to sup­port post-vote re­counts in case of dis­putes or a de­sire for a more de­fin­i­tive count

    • the re­ported count should come with an in­di­ca­tion of the NN’s un­cer­tain­ty/er­ror-rate, so users are not mis­led by their lit­tle hand­held or­a­cle and so they can redo a vote if the choice is bor­der­line; Bayesian meth­ods, in which pre­vi­ous votes are drawn up­on, might be rel­e­vant here.

      • if the orig­i­nal photo could be an­no­tated with graph­i­cal notes for each rec­og­nized/­counted hand & face, this would let the user ‘see’ what the NN is think­ing and would help build con­fi­dence a great deal
    • it should sup­port man­u­ally en­ter­ing in a vote-count; if the man­ual count differs, then this in­di­cates the NN made an er­ror and the photo & count should be up­loaded to the orig­i­nal de­vel­oper so it can be added to the cor­pus and the NN’s per­for­mance fixed in fu­ture re­leases of the app

    • smart­phone cam­eras may not be high­-res­o­lu­tion or have a suffi­ciently wide field­-of-view to cap­ture the en­tire au­di­ence at on­ce; some sort of “mon­tage mode” should ex­ist so the user can swing the phone across the au­di­ence, bursts of shots tak­en, and the over­lap­ping pho­tos stitched to­gether into a sin­gle au­di­ence photo which can be then fed into the NN as usual

    • a burst of pho­tos might be su­pe­rior to a sin­gle photo due to smart­phone & hand move­ment blur; I don’t know if it’s best to try to com­bine the pho­tos, run the NN mul­ti­ple times and take the me­di­an, or feed mul­ti­ple pho­tos into the NN (per­haps by mov­ing to a RNN ar­chi­tec­ture?)

    • the ful­l-strength NN might still be too slow and en­er­gy-hun­gry to run pleas­antly on a smart­phone; there are model com­pres­sion to re­duce the num­ber of nodes or with­out los­ing much per­for­mance, which might be use­ful in this con­text (and in­deed, were orig­i­nally mo­ti­vated by want­ing to make speech-recog­ni­tion run bet­ter on smart­phones)

Given this break­down, one might es­ti­mate build­ing such an app as re­quir­ing, as­sum­ing one is al­ready rea­son­ably fa­mil­iar with deep net­works & writ­ing mo­bile apps:

  1. 1 week to find an Im­a­geNet NN, learn how to mod­ify it, and set it up to train on a fresh cor­pus
  2. 3 weeks to cre­ate a cor­pus of <5000 pho­tos with man­u­al­ly-la­beled hand counts
  3. 5 weeks to train the NN (NNs as large as Im­a­geNet NNs take weeks to train; de­pend­ing on the GPU hard­ware one has ac­cess to and how many tweaks and hy­per­pa­ra­me­ters one tries, 5 weeks could be dras­ti­cally op­ti­mistic; but on the plus side, it’s mostly wait­ing as the GPUs suck elec­tric­ity like crazy)
  4. 5 weeks to make an in­tu­itive sim­ple app, sub­mit­ting to an app store, etc
  5. These es­ti­mates are loose and prob­a­bly too op­ti­mistic (although I would be sur­prised if it took a good de­vel­oper more than 6 months to de­velop this ap­p), but that would sug­gest >14 weeks or 784 hours of work for a de­vel­op­er, start to fin­ish. (Even at min­i­mum wage, this rep­re­sents a sub­stan­tial de­vel­op­ment cost of >$6k; at more plau­si­ble de­vel­oper salaries, eas­ily >$60k of salary.)

How large is the mar­ket for such an app? Groups such as anime con­ven­tions or any­thing on a col­lege cam­pus are cheap­skates and would balk at a price higher than $4.99 (even if only 5 or 10 staffers need to buy it and it makes the ex­pe­ri­ence much smoother). There are prob­a­bly sev­eral hun­dred anime or video game con­ven­tions which might use this to vote, so that might be 1000 sales there. There’s eas­ily 13,000 busi­ness con­ven­tions or con­fer­ences in the USA, which might not need vot­ing so much, but would be at­tracted by a head­count mode to help op­ti­mize their event. This sug­gests per­haps $70k in sales with much less profit after the app store cut & tax­es, much of which sales would prob­a­bly be one-offs as the user reuses it for each con­fer­ence. So even a wild suc­cess, in which most events adopt use of such vot­ing soft­ware, would barely re­coup the de­vel­op­ment costs; as a pro­duct, it seems this is just too much of a niche un­less one could de­velop it much faster (such as by find­ing an ex­ist­ing cor­pus of hand­s/pho­tos, or be cer­tain of bang­ing out the mo­bile app in much less than I es­ti­mat­ed), find a larger mar­ket (the­aters for au­di­ence par­tic­i­pa­tion?), or in­crease price sub­stan­tially (10x the price and aim at only busi­ness­es?).

Air conditioner replacement

Is my old air con­di­tioner in­effi­cient enough to re­place? After cal­cu­lat­ing elec­tric­ity con­sump­tion for it and a new air con­di­tion­er, with dis­count­ing, and with un­cer­tainty in pa­ra­me­ters eval­u­ated by a Monte Carlo method, I con­clude that the sav­ings are too small by an or­der of mag­ni­tude to pay for a new re­place­ment air con­di­tion­er.

I have an old Whirlpool air con­di­tioner (AC) in my apart­ment, and as part of in­su­lat­ing and cool­ing my apart­ment, I’ve won­dered if the AC should be re­placed on en­ergy effi­ciency grounds. Would a new AC save more than it costs up­front? What is the op­ti­mal de­ci­sion here?

Ini­tially I was balked in analy­sis be­cause I could­n’t fig­ure out what model it was, and thus any­thing about it like its en­ergy effi­cien­cy. (No model num­ber or name ap­pears any­where vis­i­ble on it, and I’m not go­ing to rip it out of the wall just to look at hid­den part­s.)


So I be­gan look­ing at all the old Whirlpool AC pho­tographs in Google, and even­tu­ally I found one whose ap­pear­ance ex­actly matches mine and which was re­leased around when I think the AC was in­stalled. The old AC is the “Whirlpool ACQ189XS” (offi­cial) (cost: $0, sunk cost), which is claimed to have an EER of 10.7.

For com­par­ison, I browsed Ama­zon look­ing for high­ly-rated AC mod­els with at least 5000 BTU cool­ing power and cost­ing $250-$300, pick­ing out the Sun­pen­town WA-8022S 8000 BTU Win­dow Air Con­di­tioner ($271) with 11.3 EER. (Check­ing some other en­tries on Ama­zon, this is fairly rep­re­sen­ta­tive on both cost & EER.)

Ques­tion: what is the elec­tri­cal sav­ings and hence the pay­back pe­riod of a new AC?

The effi­ciency unit here is the EER or en­ergy effi­ciency ra­tio, de­fined as BTUs (amount of heat be­ing moved by the AC) di­vided by watts con­sumed. Here we have ACs with 10.7 EER vs 11.2 EER; I need ~10k BTUs to keep the apart­ment cool (after fix­ing a lot of cracks, in­stalling an at­tic fan and two box fans, putting tin foil over some win­dows, in­su­la­tion un­der a floor etc), so the ACs will use up , and then x = 898 watts and 934 watts re­spec­tive­ly.

(EER is a lot like /MPG as a mea­sure of effi­cien­cy, and shares the same draw­backs: from a cost-per­spec­tive, EER/MPG don’t nec­es­sar­ily tell you what you want to know and can be mis­lead­ing and harder to work with than if effi­ciency were re­ported as, say, gal­lons per mile. As watts per BTU or gal­lons per mile, it is easy to see that after a cer­tain point, the cost differ­ences have be­come ab­solutely small enough that im­prove­ments are not worth pay­ing for. Go­ing from 30 gal­lons of gas to 15 gal­lons of gas is worth more than go­ing from 3 gal­lons to 1.5 gal­lons, even if the rel­a­tive im­prove­ment is the same.)

So while op­er­at­ing, the two ACs will use 898 watts vs 934 watts or 0.89kWh vs 0.934kWh to cool; a differ­ence of 36 watts or 0.036k­Wh.

Each kWh costs around $0.09 so the cost-d­iffer­ence is $0.00324 per hour.

AC is on May-Sep­tem­ber (5 month­s), and on al­most all day al­though it only runs in­ter­mit­tent­ly, so say a third of the day or 8 hours, for a to­tal of 1200 hours of op­er­a­tion.


Thus, then the an­nual ben­e­fit from switch­ing to the new AC with 11.2 EER is or $3.9.

The cost is $271 amor­tized over n years. At $3.9 a year, it will take an­nu­ally = 68 years to pay­back (ig­nor­ing break­age and dis­count­ing/in­ter­est/op­por­tu­ni­ty-cost). This is not good.

De­ci­sion: do not re­place.


To bring in dis­count­ing/in­ter­est: For what an­nual pay­ment (cost-sav­ings) would we be will­ing to pay the price of a new AC? More specifi­cal­ly, if it costs $271 and has an av­er­age pay­out pe­riod of 7 years, then at my usual an­nual dis­count rate of 5%, how much must each pay­out be?

r turns out to be ≥$46.83, which sounds about right. (Dis­count­ing pe­nal­izes fu­ture sav­ings, so r should be greater than or $39, which it is.)

$47 is 12x larger than the es­ti­mated sav­ings of $3.9, so the con­clu­sion re­mains the same.

We could also work back­ward to fig­ure out what EEC would jus­tify an up­grade by treat­ing it as an un­known e and solv­ing for it; let’s say it must pay­back in 7 years (I doubt av­er­age AC life­time is much longer) at least $271, with the same kWh & us­age as be­fore, what must the ri­val EEC be? as an equa­tion:

and solv­ing,

I am pretty sure there are no ACs with EER>20!

An­other way to look at it: if a new good AC costs ~$300 and I ex­pect it to last ~7 years, then that’s an an­nual cost of $43. The cur­rent AC’s to­tal an­nual cost to run is or . So it’s im­me­di­ately clear that the en­ergy sav­ings must be huge—half!—be­fore it can hope to jus­tify a new pur­chase.

Sensitivity analysis

The above analy­ses were done with point-es­ti­mates. It’s only fair to note that there’s a lot of un­cer­tainty lurk­ing in those es­ti­mates: $0.09 was just the me­dian of the es­ti­mates I found for my state’s elec­tric­ity rates, the AC might be on 4 or 6 months, the hours per day might be con­sid­er­ably higher (or low­er) than my guess of 8 hours, 10.7 & 11.2 EERs are prob­a­bly best-case es­ti­mates and the real effi­cien­cies lower (they’re al­ways lower than nom­i­nal), the dis­count rate may be a per­cent lower or higher and so min­i­mum sav­ings would be off by as much as $4 in ei­ther di­rec­tion, and so on. It would be good to do a bit of a sen­si­tiv­ity analy­sis to make sure that this is not be­ing dri­ven by any par­tic­u­lar num­ber. (Based on the de­fi­n­i­tion, since it’s us­ing mostly mul­ti­pli­ca­tion, the fi­nal value should be ro­bust to con­sid­er­able er­ror in es­ti­mat­ing each pa­ra­me­ter, but you never know.) Throw­ing to­gether my in­tu­ition for how much un­cer­tainty is in each pa­ra­me­ter and mod­el­ing most as nor­mals, I can sim­u­late my prior dis­tri­b­u­tion of sav­ings:

simulate <- function() {
    BTUs <- rnorm(1, 10000, 100)
    EER_old <- 10.7 - abs(rnorm(1, 0, 0.5)) # half-normals because efficiencies only get worse, not better
    EER_new <- 11.2 - abs(rnorm(1, 0, 0.5))
    kWh <- rnorm(1, 0.09, 0.01)
    dailyUsage <- rnorm(1, 8, 2)
    months <- sample (4:6, 1)
    minimumSavings <- rnorm(1, 47, 4)

    annualNetSavings <- ((((BTUs / EER_old ) - (BTUs / EER_new)) / 1000) * kWh * dailyUsage * 30 * months) - minimumSavings
sims <- replicate(100000, simulate())
##        Min.     1st Qu.      Median        Mean     3rd Qu.        Max.
## -70.3666500 -46.2051500 -42.3764100 -42.1133700 -38.3134600  -0.7334517
quantile(sims, p=c(0.025, 0.975))
##        2.5%        97.5%
## -53.59989114 -29.13999204

Un­der every sim­u­la­tion, a new AC is a net loss. (S­ince we have no ob­served data to up­date our pri­ors with, this is an ex­er­cise in prob­a­bil­i­ty, not Bayesian in­fer­ence, and so there is no need to bring in JAGS.)

There are two choic­es: re­place or not. The ex­pect­ed-value of a re­place­ment is or -$42, and the ex­pect­ed-value of not re­plac­ing is or $0; the lat­ter is larger than the for­mer, so we should choose the lat­ter and not re­place the old AC.

Hence we can be con­fi­dent that not get­ting a new AC re­ally is the right de­ci­sion.

Some ways of dealing with measurement error

Prompted by a ques­tion on Less­Wrong, some ex­am­ples of how to an­a­lyze noisy mea­sure­ments in R:

## Create a simulated dataset with known parameters, and then run a ML multilevel model, a ML SEM,
## and a Bayesian multilevel model; with the last, calculate Expected Value of Sample Information (EVSI):

## "There is a variable X, x belongs to [0, 100]."
toplevel <- rnorm(n=1, 50, 25)
## "There are n ways of measuring it, among them A and B are widely used."
## "For any given measurer, the difference between x(A) and x(B) can be up to 20 points."
A <- toplevel + runif(1, min=-10, max=10)
B <- toplevel + runif(1, min=-10, max=10)
c(toplevel, A, B)
# [1] 63.85938385 55.43608379 59.42333264
### the true level of X we wish to recover is '63.85'

## "Between two any measurers, x(A)1 and x(A)2 can differ on average 10 points, likewise with B."
### let's imagine 10 hypothetical points are sample using method A and method B
### assume 'differ on average 10 points' here means something like 'the standard deviation is 10'
A_1 <- rnorm(n=10, mean=A, sd=10)
B_1 <- rnorm(n=10, mean=B, sd=10)

data <- rbind(data.frame(Measurement="A", Y=A_1), data.frame(Measurement="B", Y=B_1)); data
#    Measurement           Y
# 1            A 56.33870025
# 2            A 69.07267213
# 3            A 40.36889573
# 4            A 48.67289213
# 5            A 79.92622603
# 6            A 62.86919410
# 7            A 53.12953462
# 8            A 66.58894990
# 9            A 47.86296948
# 10           A 60.72416003
# 11           B 68.60203507
# 12           B 58.24702007
# 13           B 45.47895879
# 14           B 63.45308935
# 15           B 52.27724328
# 16           B 56.89783535
# 17           B 55.93598486
# 18           B 59.28162022
# 19           B 70.92341777
# 20           B 49.51360373

## MLM

## multi-level model approach:
mlm <- lmer(Y ~ (1|Measurement), data=data); summary(mlm)
# Random effects:
#  Groups      Name        Variance Std.Dev.
#  Measurement (Intercept)  0.0000  0.000000
#  Residual                95.3333  9.763877
# Number of obs: 20, groups:  Measurement, 2
# Fixed effects:
#              Estimate Std. Error  t value
# (Intercept) 58.308250   2.183269 26.70685
#                    2.5 %       97.5 %
# .sig01       0.000000000  7.446867736
# .sigma       7.185811525 13.444112087
# (Intercept) 53.402531768 63.213970887

## So we estimate X at 58.3 but it's not inside our confidence interval with such little data. Bad luck?

## SEM

X.model <- '        X =~ B + A
                    A =~ a
                    B =~ b' <- sem(model = X.model, meanstructure = TRUE, data = data2)
# ...                   Estimate  Std.err  Z-value  P(>|z|)
# Latent variables:
#   X =~
#     B                 1.000
#     A              7619.504
#   A =~
#     a                 1.000
#   B =~
#     b                 1.000
# Intercepts:
#     a                58.555
#     b                58.061
#     X                 0.000
#     A                 0.000
#     B                 0.000
## Well, that didn't work well - explodes, unfortunately. Probably still not enough data.

## MLM (Bayesian)

## rough attempt at writing down an explicit multilevel model which
## respects the mentioned priors about errors being reasonably small:
model <- function() {
  grand.mean ~ dunif(0,100) ~ dunif(0, 10) ~ dunif(0, 100) <- pow(, -2)

  for(j in 1:K){
   # let's say the group-level differences are also normally-distributed:[j] ~ dnorm(,
   # and each group also has its own standard-deviation, potentially different from the others':
   group.within.sigma[j] ~ dunif(0, 20)
   group.within.tau[j] <- pow(group.within.sigma[j], -2)

   # save the net combo for convenience & interpretability:
   group.mean[j] <- grand.mean +[j]

  for (i in 1:N) {
   # each individual observation is from the grand-mean + group-offset, then normally distributed:
   Y[i] ~ dnorm(grand.mean +[Group[i]], group.within.tau[Group[i]])

jagsData <- list(N=nrow(data), Y=data$Y, K=length(levels(data$Measurement)),
params <- c("grand.mean","", "", "", "group.mean",
k1 <- jags(data=jagsData,, inits=NULL, model.file=model); k1
# ...                      mu.vect sd.vect    2.5%     25%     50%     75%   97.5%  Rhat n.eff
#     4.971   2.945   0.221   2.353   4.967   7.594   9.791 1.008   260
# grand.mean             52.477  11.321  23.453  47.914  53.280  58.246  74.080 1.220    20
#[1]          6.017  11.391 -16.095   0.448   5.316  10.059  34.792 1.152    21
#[2]          5.662  11.318 -15.836   0.054   5.009  10.107  33.548 1.139    21
# group.mean[1]          58.494   3.765  50.973  56.188  58.459  60.838  66.072 1.001  3000
# group.mean[2]          58.139   2.857  52.687  56.366  58.098  59.851  63.999 1.003   920
# group.within.sigma[1]  12.801   2.766   8.241  10.700  12.446  14.641  18.707 1.002  1100
# group.within.sigma[2]   9.274   2.500   5.688   7.475   8.834  10.539  15.700 1.002  1600
#    18.031  21.159   0.553   3.793   9.359  23.972  82.604 1.006  1700
# deviance              149.684   2.877 145.953 147.527 149.081 151.213 156.933 1.001  3000

## VOI

posteriorXs <- k1$BUGSoutput$sims.list[["grand.mean"]]
MSE <- function(x1, x2) { (x2 - x1)^2 }
lossFunction <- function(x, predictions) { mean(sapply(predictions, function(x2) { MSE(x, x2)}))}
## our hypothetical mean-squared loss if we predicted, say, X=60:
lossFunction(60, posteriorXs)
# [1] 184.7087612
## of the possible values for X, 1-100, what value of X minimizes our squared error loss?
losses <- sapply(c(1:100), function (n) { lossFunction(n, posteriorXs);})
# [1] 52
## 52 also equals the mean estimate of X, which is good since it's well known that the mean is what minimizes
## the loss when the loss is squared-error so it suggests that I have not screwed up the definitions
[1] 128.3478462

## to calculate EVSI, we repeatedly simulate a few hundred times the existence of a hypothetical 'C' measurement
## and draw n samples from it;
## then we add the C data to our existing A & B data; run our Bayesian multilevel model again on the bigger dataset;,
## calculate what the new loss is, and compare it to the old loss to see how much the new data
## reduced the loss/mean-squared-error.
## Done for each possible n (here, 1-30) and averaged out, this tells us how much 1 additional datapoint is worth,
## 2 additional datapoints, 3 additional datapoints, etc.
sampleValues <- NULL
for (i in seq(from=1, to=30)) {

    evsis <- replicate(500, {
        n <- i

        C <- toplevel + runif(1, min=-10, max=10)
        C_1 <- rnorm(n=n, mean=C, sd=10)
        ## all as before, more or less:
        newData <- rbind(data, data.frame(Measurement="C", Y=C_1))

        jagsData <- list(N=nrow(newData), Y=newData$Y, K=length(levels(newData$Measurement)),
        params <- c("grand.mean","", "", "", "group.mean",
        jEVSI <- jags(data=jagsData,, inits=NULL, model.file=model)

        posteriorTimesEVSI <- jEVSI$BUGSoutput$sims.list[["grand.mean"]]
        lossesEVSI <- sapply(c(1:100), function (n) { lossFunction(n, posteriorTimesEVSI);})

        oldOptimum <- 128.3478462 # losses[52]
        newOptimum <- losses[which.min(lossesEVSI)]
        EVSI <- newOptimum - oldOptimum


    sampleValues[i] <- mean(evsis)
#  [1] 13.86568780 11.07101087 14.15645538 13.05296681 11.98902668 13.86866619 13.65059093 14.05991443
#  [9] 14.80018511 16.36944874 15.47624541 15.64710237 15.74060632 14.79901214 13.36776390 15.35179426
# [17] 14.31603459 13.70914727 17.20433606 15.89925289 16.35350861 15.09886204 16.30680175 16.27032067
# [25] 16.30418553 18.84776433 17.86881713 16.65973397 17.04451609 19.17173439

## As expected, the gain in reducing MSE continues increasing as data comes in but with diminishing returns;
## this is probably because in a multilevel model like this, you aren't using the _n_ datapoints to estimate X
## directly so much as you are using them to estimate a much smaller number of latent variables, which are then
## the _n_ used to estimate X. So instead of getting hyperprecise estimates of A/B/C, you need to sample from additional
## groups D/E/F/... Trying to improve your estimate of X by measuring A/B/C many times is like trying to estimate
## IQ precisely by administering a WM test a hundred times.

## If we wanted to compare with alternatives like instead sampling n data points from C and a D, it's easy to modify
## the EVSI loop to do so: generate `D <- toplevel + runif(1, min=-10, max=10); D_1 <- rnorm(n=n, mean=D, sd=10)`
## and now `rbind` D_1 in as well. At a guess, after 5-10 samples from the current group, estimates of X will be improved more
## by then sampling from a new group.

## Or the loss function could be made more realistic. It's unlikely one is paid by MSE, and if one adds in how much
## money each sample costs, with a realistic loss function, one could decide exactly how much data is optimal to collect.

## To very precisely estimate X, when our measurements are needed to measure at least 3 latent variables,
## requires much more data than usual.

## In general, we can see the drawbacks and benefits of each approach. A canned MLM
## is very fast to write but doesn't let us include prior information or easily run
## additional analyses like how much additional samples are worth. SEM works poorly
## on small samples but is still easy to write in if we have more complicated
## models of measurement error. A full-blown modeling language like JAGS is quite
## difficult to write in and MCMC is slower than other approaches but handles small
## samples without any errors or problems and offers maximal flexibility in using
## the known prior information and then doing decision-theoretic stuff. Overall for
## this problem, I think JAGS worked out best, but possibly I wasn't using LAVAAN
## right and that's why SEM didn't seem to work well.

Value of Information: clinical prediction instruments for suicide


I agree. When crit­i­ciz­ing the study for claim­ing the blood lev­els added pre­dic­tive power and it’s not clear they did, this is solely a sta­tis­ti­cal claim and can be done in a vac­u­um. But when one then goes on to pan the pre­dic­tive power of the un­der­ly­ing clin­i­cal pre­dic­tion in­stru­ments as use­less in all cir­cum­stances, based on just the pre­dic­tion stats:

So when peo­ple say “We have a blood test to di­ag­nose sui­ci­dal­ity with 92% ac­cu­ra­cy!”, even if it’s true, what they mean is that they have a blood test which, if it comes back pos­i­tive, there’s still less than 50-50 odds the per­son in­volved is sui­ci­dal. Okay. Say you’re a psy­chi­a­trist. There’s a 48% chance your pa­tient is go­ing to be sui­ci­dal in the next year. What are you go­ing to do? Com­mit her to the hos­pi­tal? I sure hope not. Ask her some ques­tions, make sure she’s do­ing okay, watch her kind of close­ly? You’re a psy­chi­a­trist and she’s your de­pressed pa­tient, you would have been do­ing that any­way. This blood test is not re­ally ac­tion­able. And then re­mem­ber that this is­n’t the blood test we have. We have some clin­i­cal pre­dic­tion in­stru­ments that do this…But hav­ing “a blood test for sui­cide” won’t be very use­ful, even if it works.

One is im­plic­itly mak­ing some strong cost-ben­e­fit claims here and step­ping from sta­tis­tics (‘what are the prob­a­bil­i­ties?’) to de­ci­sion the­ory (‘given these prob­a­bil­i­ties, how should I act?’). They are not iden­ti­cal: no AUC graph will ever tell you if a mod­el’s pre­dic­tions are use­ful or not, and there is no uni­ver­sal thresh­old where 92% speci­fici­ty/sen­si­tiv­ity is to­tally use­less but 95% would make a differ­ence—these clin­i­cal pre­dic­tion in­stru­ments might be use­less in­deed, but that will de­pend on costs, base rates, and avail­able ac­tions. (I tried to make this point to Coyne on Twit­ter ear­lier but I don’t think he un­der­stood what I was get­ting at & he blew me off.)

Dis­con­ti­nu­ities come from our ac­tions; our in­fer­ences are in­cre­men­tal. There are some con­texts where a tiny 1% im­prove­ment in AUC might be worth a lot (Wall Street) and there are some con­texts where sen­si­tiv­ity or speci­ficity of 99% is still use­less be­cause it won’t change your ac­tions at all (I’m cur­rently com­par­ing my rid­ing lawn mower to a ro­botic lawn mow­er, and thus far, it does­n’t mat­ter how pre­cise my pa­ra­me­ters are, the ro­botic lawn mow­ers are, to my dis­ap­point­ment, just too ex­pen­sive right now). I think p-val­ues have shown us how well ar­bi­trary thresh­olds work out in prac­tice (and re­mem­ber where they came from in the first place! de­ci­sion rules set per prob­lem—­Gos­set, in op­ti­miz­ing a brew­ery, did not have the patholo­gies we have with p<0.05 fetishis­m.) I also don’t be­lieve your choices are re­ally that re­strict­ed: you mean if you were ab­solutely con­vinced that your pa­tient was about to com­mit sui­cide, there is ab­solutely noth­ing you could do be­sides treat them like any other de­pres­sive? That seems un­like­ly. But what­ev­er, even if com­mit­ment is the only al­ter­na­tive, there is still a value to the in­for­ma­tion pro­vided by a clin­i­cal pre­dic­tion in­stru­ment, and we can cal­cu­late it, and you should if you want to rule it out as hav­ing any val­ue, in the same way that in crit­i­ciz­ing a study as weak, it’s bet­ter to ig­nore the p-val­ues and just work out the right pos­te­rior and demon­strate di­rectly how lit­tle ev­i­dence it con­tains.

Let’s try this as an ex­am­ple, it’s not hard or ter­ri­bly com­plex (just te­dious). So we have a ward of 100 de­pres­sive pa­tients where we are in­ter­ested in pre­vent­ing sui­cide; our prior prob­a­bil­ity is that 7.5% or ~7 of them will com­mit sui­cide. The value of a life has been given a lot of differ­ent val­u­a­tions, but $10 mil­lion is a good start­ing point.

Ac­tion 1:

What are our costs or loss­es? We could say that we ex­pect a loss of 7.5*$10m or -$75m, and if we stand by and do no treat­ment or in­ter­ven­tion what­so­ev­er, we spend no more money and so the to­tal loss is

0 + 0.075 * 100 * 10,000,000 = -$75,000,000

Ac­tion 2:

Let’s say they all stay by de­fault for one week and this costs a net $1000 a day; let’s say fur­ther that, since com­mit­ment is the men­tioned al­ter­na­tive, while com­mit­ted a sui­cide at­tempt will fail. And since we know that sui­cides are so often spon­ta­neous and ma­jor de­pres­sion comes and goes, a frus­trated sui­cide at­tempt does­n’t sim­ply mean that they will im­me­di­ately kill them­selves as soon as they get out. This 7% comes from a fol­lowup pe­riod of a year, so the prob­a­bil­ity any will at­tempt sui­cide in the next week might be 0.075/52 or 0.001442307692. So this gives us our de­fault se­tup: we have 100 pa­tients stay­ing for 7 days at a net cost of $1000 a day or $700,000 to­tal, and by hav­ing them stay, we stop an ex­pected av­er­age of 0.14 sui­cides and thus we pre­vent an ex­pected loss of 0.14 * $10m = $1,440,000, for a to­tal loss of treat­men­t-cost mi­nus treat­men­t-gain plus re­main­ing-loss:

$700,000 - (0.14 * $10m) - $10m * 100 * (0.075-(0.075/52)) = -$74,257,692.

Note that this loss is smaller than in the sce­nario in which we don’t do any com­mit­ment at all; since one week of sui­cide-watch re­duced the sui­cide loss more than it cost, this is not sur­pris­ing.

Specifi­cal­ly, the ben­e­fit is:

ac­tion1 - ac­tion2 = gain to switch­ing 75000000 - 74257692 = $742,308

Not fan­tas­tic, but it’s in the right or­der of mag­ni­tude (you can’t ex­pect more from a low base-rate event and a treat­ment with such a low prob­a­bil­ity of mak­ing a differ­ence, after all) so it looks plau­si­ble, and it’s still more than ze­ro. We can re­ject the ac­tion of not com­mit­ting them at all as be­ing in­fe­rior to com­mit­ting them for one week.

Ac­tion 3:

What if we were in­stead choos­ing be­tween one week and com­mit­ting them for a full year—thus catch­ing the full 7.5% of sui­cides dur­ing the 1-year fol­lowup? Does that work? First, the loss from this course of ac­tion:

((100*365.2*1000) - (0 * 10000000) - (10000000 * 100 * (0.075-(0.075/1)))) = -$36,520,000

Since there are no sui­cides, we avoid the de­fault loss of -$75m, but we still have to spend $36,520,000 to pay for the long-term com­mit­ment. How­ev­er, the ben­e­fit to the pa­tients has in­creased dra­mat­i­cally since we stop so many more sui­cides:

ac­tion 2 - ac­tion 3 = $35,637,692.31

(We go from a loss of -$74m to a loss of -$36m.) So we see ac­tion 3 is even bet­ter than ac­tion 2 for the pa­tients. Of course, we can’t ex­trap­o­late out any fur­ther than 1 year, be­cause that’s what our fol­lowup num­ber is, and we don’t know how the sui­cide risk falls after the 1 year point—if it drops to ~0, then fur­ther com­mit­ment is a ter­ri­ble idea. So I’m not go­ing to cal­cu­late out any fur­ther. (S­ince this is all lin­ear stuff, the pre­dicted ben­e­fit will in­crease smoothly over the year and so there’s no point in cal­cu­lat­ing out al­ter­na­tives like 1 mon­th, 3 months, 6 months, 9 months, etc.) What’s that, ac­tion 3 is to­tally in­fea­si­ble and no one would ever agree to this—the pa­tients would scream their heads off and the health in­sur­ance com­pa­nies would never go for it—even if we could show that long com­mit­ments do re­duce the sui­cide rate enough to jus­tify the costs? And, among other things, I’ve over­sim­pli­fied in as­sum­ing the 7% risk is evenly dis­trib­uted over the year rather than a more plau­si­ble dis­tri­b­u­tion like ex­po­nen­tially de­creas­ing from Day 1, so likely com­mit­ment stops be­ing a good idea more like month 3 or some­thing? Yeah, you’re prob­a­bly right, so let’s go back to us­ing ac­tion 2’s loss as our cur­rent best al­ter­na­tive.

Now, hav­ing set out some of the choices avail­able, we can find out how much bet­ter in­for­ma­tion is worth. First, let’s ask what the Ex­pected Value of Per­fect In­for­ma­tion is: if we were able to take our 100 pa­tients and ex­actly pre­dict which 7 were de­pres­sive and would com­mit sui­cide this year in the ab­sence of any in­ter­ven­tion, where our choice is be­tween com­mit­ting them for one week or not at all. Given such in­for­ma­tion we can eject the 93 who we now know were never a sui­cide risk, and we hold onto the 7 en­dan­gered pa­tients, and we have a new loss of the com­mit­ment cost of 7 peo­ple for a week vs the pre­vented loss of the chance they will try to com­mit sui­cide that week of this year:

((771000) - (0.14 * 10000000) - (10000000 * 7 * (1-(1/52)))) = -$70,004,846

How much did we gain from our per­fect in­for­ma­tion? About $4m:

74257692 - 70004846 = $4,252,846

(This passes our san­ity checks: ad­di­tional in­for­ma­tion should never hurt us, so the amount should be >=$0, but we are lim­ited by the in­ter­ven­tion to do­ing very lit­tle, so the ceil­ing should be a low amount com­pared to the to­tal loss, which this is.)

So as long as the per­fect in­for­ma­tion did not cost us more than $4m or so, we would have net gained from it: we would have been able to fo­cus com­mit­ment on the pa­tients at max­i­mal risk. So sup­pose we had a per­fect test which cost $1000 a pa­tient to run, and we wanted to know if the gained in­for­ma­tion was valu­able enough to bother with us­ing this ex­pen­sive test; the an­swer in this case is defi­nitely yes: with 100 pa­tients, it’ll cost $100,000 to run the test but it’ll save $4.25m for a net profit of $4.15m. In fact, we would be will­ing to pay per-pa­tient costs up to $42k, at which point we hit break-even (4252846 / 100).

OK, so that’s per­fect in­for­ma­tion. What about imper­fect in­for­ma­tion? Well, im­per­fect is a lot like per­fect in­for­ma­tion, just, y’­know—­less so. Let’s con­sider this test: with the same pri­or, a neg­a­tive on it means the pa­tient now has P=0.007 to com­mit sui­cide that year, and a pos­i­tive means P=0.48, and the sen­si­tiv­i­ty/speci­ficity at 92%. (Just copy­ing that from OP & Bu­tY­ouD­is­agree, since those sound plau­si­ble.) So when we run the test on our pa­tients, we find of the 4 pos­si­ble out­comes:

  • 85.1 pa­tients are non-sui­ci­dal and the test will not flag them
  • 7.4 are non-sui­ci­dal but the test will flag them
  • 6.9 are sui­ci­dal and the test will flag them
  • 0.6 are sui­ci­dal but the test will not flag them

So if we de­cide whether to com­mit or not com­mit solely based on this test, we will send home 85.1 + 0.6 = 85.7 pa­tients (and in­deed 0.6/85.7=0.007), and we will re­tain the re­main­ing 7.4 + 6.9 = 14.3 pa­tients (and in­deed, 6.9/14.3=0.48). So our loss is the wrongly ejected pa­tient of 0.6 sui­cides plus the cost of com­mit­ting 14.3 pa­tients (both safe and at-risk) for a week in ex­change for the gain of a small chance of stop­ping the sui­cide of the 6.9 ac­tu­ally at risk:

(10000000*85.7*0.007) + (14.3*7*1000) + (10000000 * (0.4814.3) (1-(1/52))) = -$73,419,100

How much did we gain from our im­per­fect in­for­ma­tion? About $0.8m:

74257692 - 73419100 = $838,592

or $8,385.92 per pa­tient. (This passes our san­ity check: greater than $0, but much less than the per­fect in­for­ma­tion. The ex­act amount may seem lame, but as a frac­tion of the value of per­fect in­for­ma­tion, it’s not too bad: the test gets us 20% - 838592 / 4252846 - of the way to per­fec­tion.)

And that’s our an­swer: the test is not worth $0—it’s worth $8k. And once you know what the cost of ad­min­is­ter­ing the test is, you sim­ply sub­tract it and now you have the Net Ex­pected Value of In­for­ma­tion for this test. (I can’t imag­ine it costs $8k to ad­min­is­ter what this sounds like, so at least in this mod­el, the value is highly likely >$0.)

By tak­ing the pos­te­rior of the test and in­te­grat­ing all the es­ti­mated costs and ben­e­fits into a sin­gle frame­work, we can nail down ex­actly how much value these clin­i­cal in­stru­ments could de­liver if used to guide de­ci­sion-mak­ing. And if you ob­ject to some par­tic­u­lar pa­ra­me­ter or as­sump­tion, just build an­other de­ci­sion-the­ory model and es­ti­mate the new cost. For ex­am­ple, maybe com­mit­ment ac­tu­ally costs, once you take into ac­count all the dis­rup­tion to lives and other such side-effects, not $1000 but net of $5000 per day, what then? Then the gain halves to $438,192, etc. And if it costs $10000 then the test is worth noth­ing be­cause you won’t com­mit any­one ever be­cause it’s just way too ex­pen­sive, and now you know it’s worth $0; or if com­mit­ment is so cheap that it’s more like $100 a day, then the test is also worth $0 be­cause you would just com­mit every­one (s­ince breakeven is then a sui­cide prob­a­bil­ity way be­low 7%, all the way at ~0.4% which is still be­low the 0.7% which the test can de­liv­er, so the test re­sult does­n’t mat­ter for de­cid­ing whether to com­mit, so it’s worth $0), or if you adopt a more rea­son­able value of life like $20m, the value of per­fect in­for­ma­tion shoots up (ob­vi­ous­ly, since the avoided loss dou­bles) but the value of im­per­fect in­for­ma­tion drops like a stone (s­ince now that one sui­ci­dal pa­tient sent home blows away your sav­ings from less com­mit­ting) and the test be­comes worth­less; and play­ing with the for­mu­las, you can fig­ure out the var­i­ous ranges of as­sump­tions in which the test has pos­i­tive value and es­ti­mate how much it has un­der par­tic­u­lar pa­ra­me­ters, and of course if pa­ra­me­ters are un­cer­tain, you can cope with that un­cer­tainty by em­bed­ding this in a Bayesian model to get pos­te­rior dis­tri­b­u­tions of par­tic­u­lar pa­ra­me­ters in­cor­po­rat­ing all the un­cer­tain­ty.

So to sum up: there are no hard thresh­olds in de­ci­sion-mak­ing and im­pos­ing them can cost us bet­ter de­ci­sion-mak­ing, so to claim ad­di­tional in­for­ma­tion is worth­less, more analy­sis need­ed, and this analy­sis must be done with re­spect to the avail­able ac­tions & their con­se­quences, which even un­der the some­what ex­treme con­di­tions here of very weak in­ter­ven­tions & low base-rates, sug­gests that the value of this in­for­ma­tion is pos­i­tive.

Bayesian Model Averaging

## original: "Bayesian model choice via Markov chain Monte Carlo methods" Carlin & Chib 1995
## Kobe example & data from: "A tutorial on Bayes factor estimation with the product space method", Lodewyckx et al 2011
## Lodewyckx code can be downloaded after registration & email from

## "Table 2: Observed field goals (y) and attempts (n) by Kobe Bryant during the NBA seasons of 1999 to 2006."
kobe <- read.csv(stdin(),header=TRUE)
Year, y,   n,    y.n
1999, 554, 1183, 0.47
2000, 701, 1510, 0.46
2001, 749, 1597, 0.47
2002, 868, 1924, 0.45
2003, 516, 1178, 0.44
2004, 573, 1324, 0.43
2005, 978, 2173, 0.45
2006, 399,  845, 0.47

      # 1) MODEL INDEX
      # Model index is 1 or 2.
      # Prior probabilities based on argument prior1.
      # Posterior probabilities obtained by averaging
      # over postr1 and postr2.
      M ~ dcat(p[])
      p[1] <- prior1
      p[2] <- 1-prior1
      postr1 <- 2-M
      postr2 <- 1-postr1

      # For each year, successes are Binomially distributed.
      # In M1, the success rate is fixed over years.
      # In M2, the success rate is year-specific.
      for (i in 1:n.years){
       successes[i] ~ dbin(pi[M,i], attempts[i])

       pi[1,i] <- pi.fixed
       pi[2,i] <-[i]

      # 3) MODEL 1 (one single rate)
      # The fixed success rate is given a Beta prior and pseudoprior.
      # Whether it is a prior or pseudoprior depends on the Model index.
      pi.fixed ~ dbeta(alpha.fixed[M],beta.fixed[M])
      alpha.fixed[1] <- alpha1.prior
      beta.fixed[1] <- beta1.prior
      alpha.fixed[2] <- alpha1.pseudo
      beta.fixed[2] <- beta1.pseudo

      # 4) MODEL 2 (multiple independent rates)
      # The year-specific success rate is given a Beta prior and pseudoprior.
      # Whether it is a prior or pseudoprior depends on the Model index.
      for (i in 1:n.years){[i] ~ dbeta([M,i],[M,i])[2,i] <- alpha2.prior[2,i] <- beta2.prior[1,i] <- alpha2.pseudo[i][1,i] <- beta2.pseudo[i]
      # predictive interval for hypothetical 2007 data in which Kobe makes 1000 shots: ~ dbin(pi.fixed, 1000) ~ dbin([n.years], 1000)

# ~ dcat(M
# 'prior1' value from paper
data <- list("prior1"=0.000000007451, "n.years"= length(kobe$Year), "successes"=kobe$y, "attempts"=kobe$n,
             "alpha1.prior"=1, "beta1.prior"=1, "alpha2.prior"=1, "beta2.prior"=1,
             "alpha1.pseudo"=1, "beta1.pseudo"=1, "alpha2.pseudo"=rep(1,8), "beta2.pseudo"=rep(1,8) )
# inits <- function() { list(mu=rnorm(1),sd=30,t=as.vector(apply(mailSim,1,mean))) }
params <- c("", "pi.fixed", "postr1", "postr2", "M", "", "")
j1 <- run.jags(model=model1, monitor=params, data=data, n.chains=getOption("mc.cores"), method="rjparallel", sample=500000); j1
# JAGS model summary statistics from 4000000 samples (chains = 8; adapt+burnin = 5000):
#                  Lower95  Median Upper95    Mean      SD Mode      MCerr MC%ofSD SSeff
#[1]        0.3145 0.46864 0.98709 0.47383 0.11553 ---0.00041958     0.4 75810
#[2]       0.10099 0.46447 0.77535 0.47005  0.1154 ---0.00042169     0.4 74887
#[3]       0.19415  0.4692 0.86566  0.4741 0.11457 ---0.00040171     0.4 81342
#[4]      0.020377 0.45146 0.69697 0.45867 0.11616 ---0.00042696     0.4 74023
#[5]      0.024472 0.43846  0.7036 0.44749 0.11757 ---0.00043352     0.4 73548
#[6]      0.076795 0.43325 0.74944 0.44318 0.11684 ---0.00043892     0.4 70863
#[7]       0.06405 0.45033 0.73614 0.45748 0.11541 ---0.00041715     0.4 76543
#[8]       0.30293 0.47267 0.97338 0.47708 0.11506 ---0.00040938     0.4 79000
# pi.fixed        0.039931 0.45756 0.97903 0.49256 0.26498 ---0.00099537     0.4 70868
# postr1                 0       0       1 0.15601 0.36287    0    0.15113    41.6     6
# postr2                 0       1       1 0.84399 0.36287    1    0.15113    41.6     6
# M                      1       2       2   1.844 0.36287    2    0.15113    41.6     6
#        0     463     940  492.57  265.28  454    0.99543     0.4 71019
#      300     473     971  477.05  116.03  473     0.4152     0.4 78094
getLogBF <- function(prior0, postr0) { log((postr0/(1-postr0)) / (prior0/(1-prior0))) }
getLogBF(0.000000007451, 0.15601)
# [1] 17.02669704
## analytic BF: 18.79; paper's MCMC estimate: 18.80; not sure where I lost 1.8 of the BF.

Dealing with all-or-nothing unreliability of data

Given two dis­agree­ing polls, one small & im­pre­cise but taken at face-val­ue, and the other large & pre­cise but with a high chance of be­ing to­tally mis­tak­en, what is the right Bayesian model to up­date on these two dat­a­points? I give ABC and MCMC im­ple­men­ta­tions of Bayesian in­fer­ence on this prob­lem and find that the pos­te­rior is bi­modal with a mean es­ti­mate close to the large un­re­li­able pol­l’s es­ti­mate but with wide cred­i­ble in­ter­vals to cover the mode based on the small re­li­able pol­l’s es­ti­mate.

A ques­tion was asked of me: what should one in­fer if one is given what would be de­fin­i­tive data if one could take it at face val­ue—but one sus­pects this data might be to­tally 100% in­cor­rect? An ex­am­ple would be if one wanted to know what frac­tion of peo­ple would an­swer ‘yes’ to a par­tic­u­lar ques­tion, and one had a very small poll (n = 10) sug­gest­ing 90% say yes, but then one was also given the re­sults from a much larger poll (n = 1000) say­ing 75% re­sponded yes—but this poll was run by un­trust­wor­thy peo­ple, peo­ple that, for what­ever rea­son, you be­lieve might make some­thing up half the time. You should be able to learn some­thing from this un­re­li­able poll, but you can’t learn every­thing from it be­cause you would be burned half the time.

If not for this is­sue of un­re­li­a­bil­i­ty, this would be an easy bi­no­mial prob­lem: spec­ify a uni­form or Jeffreys prior on what per­cent­age of peo­ple will say yes, add in the bi­no­mial data of 9⁄10, and look at the pos­te­ri­or. But what do we do with the un­re­li­a­bil­ity jok­er?


First let’s try the sim­ple case, just up­dat­ing on a small poll of 9⁄10. We would ex­pect it to be uni­modally peaked around 80-90%, but broad (due to the small sam­ple size) and falling sharply un­til 100% since be­ing that high is a pri­ori un­like­ly.

MCMC us­ing Bayesian First Aid:

## install.packages("devtools")
## devtools::install_github("rasmusab/bayesian_first_aid")
b <- bayes.binom.test(oldData$Yes, oldData$N); b
# ...number of successes = 9, number of trials = 10
# Estimated relative frequency of success:
#   0.85
# 95% credible interval:
#   0.63 0.99
# The relative frequency of success is more than 0.5 by a probability of 0.994
# and less than 0.5 by a probability of 0.006

Which it­self is a wrap­per around call­ing out to JAGS do­ing some­thing like this:

model_string <- "model {
  x ~ dbinom(theta, n)
  theta ~ dbeta(1, 1) }"
model <- autorun.jags(model_string, monitor="theta", data=list(x=oldData$Yes, n=oldData$N)); model
# JAGS model summary statistics from 20000 samples (chains = 2; adapt+burnin = 5000):
#       Lower95  Median Upper95    Mean      SD Mode     MCerr MC%ofSD SSeff    AC.10   psrf
# theta 0.63669 0.85254  0.9944 0.83357 0.10329 ---0.0007304     0.7 20000 0.011014 1.0004

Here is a sim­u­la­tion-based ver­sion of Bayesian in­fer­ence us­ing :

oldData <- data.frame(Yes=9, N=10)
simulatePoll <- function(n, pr)  { rbinom(1, size=n, p=pr); }
poll_abc <- replicate(100000, {
    # draw from our uniform prior
    p <- runif(1,min=0,max=1)
    # simulate a hypothetical poll dataset the same size as our original
    newData <- data.frame(Yes=simulatePoll(oldData$N, p), N=oldData$N)
    # were they equal? if so, save sample as part of posterior
    if (all(oldData == newData)) { return(p) }
resultsABC <- unlist(Filter(function(x) {!is.null(x)}, poll_abc))
#      Min.   1st Qu.    Median      Mean   3rd Qu.      Max.
# 0.3260816 0.7750520 0.8508855 0.8336383 0.9117471 0.9991691

They look iden­ti­cal, as they should.

Binomial with binary unreliability

To im­ple­ment our more com­pli­cated ver­sion: the orig­i­nal poll re­mains the same but we add in the com­pli­ca­tion of a very large poll which 50% of the time is a true mea­sure of the poll re­sponse and 50% of the time is drawn uni­formly at ran­dom. (So if the true poll re­sponse is 90%, then half the time the large poll will yield ac­cu­rate data like 905⁄1000 or 890⁄1000, and the rest it will yield 10⁄1000 or 400⁄1000 or 700⁄1000.) This is differ­ent from the more com­mon kinds of mea­sure­men­t-er­ror mod­els where it’s gen­er­ally as­sumed that the noisy mea­sure­ments still have some in­for­ma­tive­ness to them; here there is none.

Specifi­cal­ly, this faux poll has yielded the data not 9⁄10, but 750⁄1000.


Us­ing ABC again: we gen­er­ate the re­li­able small poll as be­fore, and we add in an faux poll where we flip a coin to de­cide if we are go­ing to re­turn a ‘yes’ count based on the pop­u­la­tion pa­ra­me­ters or just a ran­dom num­ber, then we com­bine the two datasets and check that it’s iden­ti­cal to the ac­tual data, sav­ing the pop­u­la­tion prob­a­bil­ity if it is.

oldData2 <- data.frame(Yes=c(9,750), N=c(10,1000)); oldData2
#   Yes    N
# 1   9   10
# 2 750 1000
simulateHonestPoll <- function(n, pr)  { rbinom(1, size=n, p=pr); }
simulateFauxPoll <- function(n, pr, switchp) { if(sample(c(TRUE, FALSE), 1, prob=c(switchp, 1-switchp))) { rbinom(1, size=n, p=pr); } else { round(runif(1, min=0, max=n)); }}
poll_abc <- replicate(1000000, {
 priorp <- runif(1,min=0,max=1)
 switch <- 0.5
 n1 <- 10
 n2 <- 1000
 data1 <- data.frame(Yes=simulateHonestPoll(n1, priorp), N=n1)
 data2 <- data.frame(Yes=simulateFauxPoll(n2, priorp, switch), N=n2)
 newData <- rbind(data1, data2)
 if (all(oldData2 == newData)) { return(priorp) }
resultsABC <- unlist(Filter(function(x) {!is.null(x)}, poll_abc))
#      Min.   1st Qu.    Median      Mean   3rd Qu.      Max.
# 0.5256471 0.7427098 0.7584650 0.7860109 0.8133581 0.9765648

The re­sults are in­ter­est­ing and in this case the sum­mary sta­tis­tics are mis­lead­ing: the me­dian is in­deed around 75% (as we would ex­pect! since that’s the re­sult of the highly pre­cise poll which has a 50% chance of be­ing the truth) but we see the mean is be­ing pulled away to­wards the orig­i­nal 90% es­ti­mate, and plot­ting the his­togram, bi­modal­ity emerges. The pos­te­rior re­ports that there’s still a lot of cred­i­bil­ity to the 90% point es­ti­mate, but be­tween the orig­i­nal diffuse­ness of that pos­te­rior (leav­ing a lot of prob­a­bil­ity to lower re­sponses in­clud­ing, say, 75%) and the high cer­tainty that if ac­cu­rate the re­sponses will defi­nitely be close to 75%, it winds up peaked at a lit­tle higher than 75% (s­ince even if the larger poll is hon­est, the ear­lier poll did still find 9⁄10). So it’s not so much that we think the best es­ti­mate of true pop­u­la­tion rate re­ally is 79% (in­deed, the mode is more like 75%, but it could eas­ily be far away from 75% and in the 90%s) as we would need to think more about what we want to do with this pos­te­rior be­fore we de­cide how to sum­ma­rize it.


ABC is slow and would not scale to more hy­po­thet­i­cal polls un­less we aban­doned ex­act ABC in­fer­ence and be­gan us­ing ap­prox­i­mate ABC (en­tirely pos­si­ble in this case; in­stead of strict equal­ity be­tween the orig­i­nal and sim­u­lated data, we’d in­stead ac­cept a sam­ple of p if the sim­u­lated dataset’s frac­tions were with­in, say, 1% of the orig­i­nal­s); and the sim­u­la­tion would need to be rewrit­ten any­way.

MCMC can han­dle this if we think of our prob­lem as a : our prob­lem is that we have poll data drawn from two clus­ter­s/dis­tri­b­u­tion­s—one clus­ter is the true pop­u­la­tion dis­tri­b­u­tion of opin­ion, and the other clus­ter just spits out noise. We have one ob­ser­va­tion which we know is from first re­li­able dis­tri­b­u­tion (the 9⁄10 poll re­sult), and one ob­ser­va­tion which we’re not sure which of the two it came from (750/1000), but we do know that the in­dex­ing prob­a­bil­ity for mix­ing the two dis­tri­b­u­tions is 50%.

In JAGS, we write down a model in which dcat flips be­tween 1 and 2 if the clus­ter is not known, spec­i­fy­ing which dis­tri­b­u­tion a sam­ple came from and its theta prob­a­bil­i­ty, and then we in­fer the thetas for both dis­tri­b­u­tions. Of course, we only care about the first dis­tri­b­u­tion’s theta since the sec­ond one is noise.

model1 <- "model {
  for (i in 1:N) {
   y[i] ~ dbinom(theta[i], n[i])
   theta[i] <- thetaOfClust[ clust[i] ]
   clust[i] ~ dcat(pi[])
  pi[1]  <- switch[1]
  pi[2]  <- switch[2]
  thetaOfClust[1] ~ dbeta(1,1)
  thetaOfClust[2] ~ dunif(0,1)
j1 <- autorun.jags(model1, monitor=c("theta"), data = list(N=nrow(oldData2), y=oldData2$Yes, n=oldData2$N, switch=c(0.5, 0.5), clust=c(1,NA))); j1
# ...      Lower95  Median Upper95    Mean       SD Mode      MCerr MC%ofSD SSeff    AC.10   psrf
# theta[1] 0.70582 0.75651 0.97263 0.77926  0.07178  ---  0.001442       2  2478  0.12978 1.0011
# theta[2] 0.72446 0.75078 0.77814 0.75054 0.013646  ---0.00009649     0.7 20000 0.009458      1

Sure enough, we get a good match with the ABC es­ti­mate: a mean es­ti­mate for the pop­u­la­tion dis­tri­b­u­tion of 78% with a very wide 95% CI and a clearly bi­modal dis­tri­b­u­tion with a huge spike at 75%. Since the MCMC mix­ture model looks com­pletely differ­ent from the im­per­a­tive sim­u­la­tion-based mod­el, the con­sis­tency in es­ti­mates & dis­tri­b­u­tions gives me some con­fi­dence in the re­sults be­ing right.

So we can see how we should up­date our be­lief­s—by a per­haps sur­pris­ing amount to­wards the un­re­li­able dat­a­point. The orig­i­nal data was too weak to strongly re­sist the al­lure of that highly pre­cise poll.

Weakening heuristic?

We might try to think of it this way: half the time, the large poll means noth­ing what­so­ev­er, it con­tains 0% or no in­for­ma­tion about the pop­u­la­tion at all; While the other half of the time, it is ex­actly what it seems to be and 100% in­for­ma­tive; so does­n’t that mean that on av­er­age we should treat it as con­tain­ing half the in­for­ma­tion we thought it did? And the in­for­ma­tion is di­rectly based on the sam­ple size: a sam­ple 5x as big con­tains 5x as much in­for­ma­tion. So per­haps in this case of al­l-or-noth­ing ac­cu­ra­cy, we could solve it eas­ily by sim­ply weak­en­ing the weight put the un­re­li­able in­for­ma­tion and shrink­ing the claimed sam­ple size—in­stead of treat­ing it as 750 of 1000, treat it as 375⁄500; and if it had been 75,000 of 100,000, con­vert it to 37,500 of 50,000. This is a sim­ple and in­tu­itive short­cut, but if we think about what the bi­no­mial will re­turn as the un­re­li­able poll in­creases in size or if we look at the re­sults…

switch <- 0.5
oldData3 <- data.frame(Yes=c(9,(750*switch)), N=c(10,(1000*switch)))
b2 <- bayes.binom.test(sum(oldData3$Yes), sum(oldData3$N)); b2
#   Bayesian First Aid binomial test
# data: sum(oldData3$Yes) and sum(oldData3$N)
# number of successes = 384, number of trials = 510
# Estimated relative frequency of success:
#   0.75
# 95% credible interval:
#   0.71 0.79
# The relative frequency of success is more than 0.5 by a probability of >0.999
# and less than 0.5 by a probability of <0.001

Un­for­tu­nate­ly, this does­n’t work be­cause it does­n’t pre­serve the bi­modal as­pect of the pos­te­ri­or, and we get a uni­modal dis­tri­b­u­tion ever con­cen­trat­ing on its mean, wip­ing out the ex­is­tence of the 0.90 peak. If our un­trust­wor­thy poll had in­stead, say, re­ported 750,000 out of 1 mil­lion, that should only make the peak at 0.75 look like a needle—it should be un­able to affect the mass around 0.9, be­cause it does­n’t mat­ter if the data is 100 or 1 mil­lion or 1 bil­lion, it still only has a 50% chance of be­ing true. It’s a lit­tle hard to see this since the mean fre­quency of 0.75 is fairly close to the mean of 0.78 from the ABC and we might write this off as ap­prox­i­ma­tion er­ror in ei­ther the ABC es­ti­mate or BFA’s MCMC, but if we look at the 95% CI and note that 0.9 is not in­side it or if we plot the pos­te­rior (plot(b2)), then the ab­sence of bi­modal­ity jumps out. So this trick does­n’t work.

Dysgenics power analysis

Cur­rent dys­genic es­ti­mates pre­dict that geno­typic IQ in the West are falling at a sub­stan­tial rate, amount­ing to around half a stan­dard de­vi­a­tion or more over the past cen­tu­ry, by 1. re­duc­ing the fre­quency at which in­tel­li­gence-in­creas­ing ge­netic vari­ants oc­cur (through nat­ural se­lec­tion against such vari­ants) and 2. by in­creas­ing the num­ber of new and po­ten­tially harm­ful ge­netic mu­ta­tions (in­creas­ing mu­ta­tion load). Es­ti­mates are pro­duced in­di­rectly by sur­vey­ing re­pro­duc­tive rates or by try­ing to show de­creases in phe­no­typic traits as­so­ci­ated with in­tel­li­gence; it would ob­vi­ously be prefer­able to ex­am­ine dys­genic effects di­rect­ly, by ob­serv­ing de­creases in fre­quen­cies or in­creases in mu­ta­tion load in a large sam­ple of West­ern ge­netic in­for­ma­tion such as SNP ar­rays or whole-genomes (re­spec­tive­ly). Such di­rect test­ing of dys­gen­ics hy­pothe­ses are be­com­ing in­creas­ingly fea­si­ble due to the ex­po­nen­tial de­crease in SNP & whole-genome se­quenc­ing costs cre­at­ing large datasets (some pub­licly avail­able) and the re­cent iden­ti­fi­ca­tion of some in­tel­li­gence genes. It re­mains un­clear how large these datasets must be to over­come sam­pling er­ror and yield in­for­ma­tive es­ti­mates of changes in fre­quen­cies or mu­ta­tion load, how­ev­er; datasets like PGP or SSGAC may still be too small to in­ves­ti­gate dys­gen­ics. I con­sid­ered the effect size es­ti­mates and un­der some sim­ple mod­els de­rive power cal­cu­la­tions & power sim­u­la­tions of how large a dataset would be re­quired to have an 80% chance of de­tect­ing a dys­genic effect: to de­tect the de­crease in in­tel­li­gence SNPs us­ing SNP data, n≥30,000; to de­tect the in­crease in mu­ta­tion load in whole genomes, n≥160. I then com­pare to avail­able datasets: the effect on SNPs can be de­tected by a large num­ber of ex­ist­ing pro­pri­etary data­bas­es, but there are no pub­lic data­bases which will be large enough in the fore­see­able fu­ture; the effect on mu­ta­tion load, on the other hand, can be de­tected us­ing solely the cur­rently pub­licly avail­able dataset from PGP. So I con­clude that while only the pro­pri­etary data­bases can di­rectly test dys­genic the­o­ries of se­lec­tion for the fore­see­able fu­ture, there is an op­por­tu­nity to an­a­lyze PGP genomes to di­rectly test the dys­genic the­ory of mu­ta­tion load.

The dys­gen­ics hy­poth­e­sis ar­gues that due to ob­served re­pro­duc­tive pat­terns where the highly ed­u­cated or in­tel­li­gent tend to have fewer off­spring, geno­typic IQ (the up­per bound on phe­no­typic IQs set by genes and the sort of thing mea­sured by a poly­genic score). If dys­gen­ics is true, then it is an ex­tremely im­por­tant phe­nom­e­non, as im­por­tant as many things that get far more at­ten­tion like lead re­me­di­a­tion; but to para­phrase 2, just be­cause a prob­lem is im­por­tant does not mean it is worth work­ing on or re­search­ing or dis­cussing if there is no chance of mak­ing pro­gress—if the data is hope­lessly com­pro­mised by many sys­tem­atic bi­ases which would cause false pos­i­tives or if the data is too scanty to over­come ran­dom er­ror or analy­ses so flex­i­ble that they could de­liver any an­swer the par­ti­san wish­es.

Phe­no­typic data will, in all prob­a­bil­i­ty, never al­low for a clear & de­ci­sive an­swer to the ques­tion of whether dys­gen­ics ex­ists or mat­ters, as long-term com­par­isons are roughly as cred­i­ble as not­ing that global piracy rates have de­clined while global warm­ing in­creas­es, or parac­eta­mol con­sump­tion rates have in­creased in tan­dem with Alzheimer’s rates; only di­rect ex­am­i­na­tion of ge­net­ics will de­liver the de­ci­sive an­swer. It would be nice to have an idea of how much ge­netic data we would need to over­come ran­dom er­ror (and hence, whether it’s pos­si­ble to make progress in the near fu­ture), which we can an­swer by do­ing some sta­tis­ti­cal power analy­ses.

Changes over time in ge­net­ics could be due to changes within a par­tic­u­lar race or pop­u­la­tion (for ex­am­ple, in all white Eng­lish­men), or could be due to pop­u­la­tion move­ments like one group re­plac­ing or mi­grat­ing or merg­ing into an­other (pop­u­la­tion ge­net­ics has re­vealed in­nu­mer­able com­plex ex­am­ples his­tor­i­cal­ly). The lat­ter is pos­si­ble thanks to the in­creas­ing avail­abil­ity of an­cient DNA, often made pub­lic for re­searchers; so one could ob­serve very long-term trends with cu­mu­la­tively large effects (im­ply­ing that small sam­ples may suffice), but this ap­proach has se­ri­ous is­sues in in­ter­pre­ta­tion and ques­tions about how com­pa­ra­ble in­tel­li­gence vari­ants may be across groups or through­out hu­man evo­lu­tion. With the for­mer, there is less con­cern about in­ter­pre­ta­tion due to greater tem­po­ral and eth­nic ho­mo­gene­ity—if a GWAS on white north­ern Eu­ro­peans in 2013 turns up in­tel­li­gence vari­ants and pro­duces a use­ful poly­genic score, it will al­most cer­tainly work on sam­ples of white north­ern Eu­ro­peans in 1900 too—but be­cause the time-s­cale is so short the effect will be sub­tler and harder to de­tect. Nev­er­the­less, a re­sult within a mod­ern pop­u­la­tion would be much more cred­i­ble, so we’ll fo­cus on that.

How sub­tle and hard to de­tect an effect are we talk­ing about here? Wood­ley 2012 sum­ma­rizes a num­ber of es­ti­mates:

Early in the 20th cen­tu­ry, neg­a­tive cor­re­la­tions were ob­served be­tween in­tel­li­gence and fer­til­i­ty, which were taken to in­di­cate a dys­genic fer­til­ity trend (e.g. Cat­tell, 1936; Lentz, 1927; Maller, 1933; Suther­land, 1929). Early pre­dic­tions of the rate of dys­ge­n­e­sis were as high as be­tween 1 and 1.5 IQ points per decade (Cat­tell, 1937, 1936)…In their study of the re­la­tion­ship be­tween in­tel­li­gence and both com­pleted and par­tially com­pleted fer­til­i­ty, van Court and Bean (1985) re­ported that the re­la­tion­ships were pre­dom­i­nantly neg­a­tive in co­horts born be­tween the years 1912 and 1982…Vin­ing (1982) was the first to have at­tempted an es­ti­ma­tion of the rate of geno­typic IQ de­cline due to dys­ge­n­e­sis with ref­er­ence to a large na­tional prob­a­bil­ity co­hort of US women aged be­tween 24 and 34 years in 1978. He iden­ti­fied sig­nifi­cant neg­a­tive cor­re­la­tions be­tween fer­til­ity and IQ rang­ing from −.104 to −.221 across cat­e­gories of sex, age and race, with an es­ti­mated geno­typic IQ de­cline of one point a gen­er­a­tion. In a 10year fol­low-up study us­ing the same co­hort, Vin­ing (1995) re-ex­am­ined the re­la­tion­ship be­tween IQ and fer­til­i­ty, now that fer­til­ity was com­plete, find­ing ev­i­dence for a geno­typic IQ de­cline of .5 points per gen­er­a­tion. Rether­ford and Sewell (1988) ex­am­ined the as­so­ci­a­tion be­tween fer­til­ity and IQ amongst a sam­ple of 9000 Wis­con­sin high­-school grad­u­ates (grad­u­ated 1957). They found a se­lec­tion differ­en­tial that would have re­duced the phe­no­typic IQ by .81 points per gen­er­a­tion un­der the as­sump­tion of equal IQs for par­ents and chil­dren. With an es­ti­mate of .4 for the ad­di­tive her­i­tabil­ity of IQ, they cal­cu­lated a more mod­est geno­typic de­cline of ap­prox­i­mately .33 points. The study of Ree and Ear­les (1991), which em­ployed the NLSY sug­gests that once the differ­en­tial fer­til­ity of im­mi­grant groups is taken into con­sid­er­a­tion, the phe­no­typic IQ loss amongst the Amer­i­can pop­u­la­tion may be greater than .8 of a point per gen­er­a­tion. Sim­i­lar­ly, in sum­ma­riz­ing var­i­ous stud­ies, Her­rn­stein & Mur­ray (1994) sug­gest that “it would be nearly im­pos­si­ble to make the to­tal [phe­no­typic IQ de­cline] come out to less than one point per gen­er­a­tion. It might be twice that.” (p. 364). Loehlin (1997) found a neg­a­tive re­la­tion­ship be­tween the fer­til­ity of Amer­i­can women aged 35-44 in 1992 and their ed­u­ca­tional lev­el. By as­sign­ing IQ scores to each of six ed­u­ca­tional lev­els, Loehlin es­ti­mated a dys­ge­n­e­sis rate of .8 points in one gen­er­a­tion. Sig­nifi­cant con­tri­bu­tions to the study of dys­ge­n­e­sis have been made by Lynn, 1996 (see al­so: 2011) whose book Dys­gen­ics: Ge­netic de­te­ri­o­ra­tion in mod­ern pop­u­la­tions pro­vided the first es­ti­mates of the mag­ni­tude of dys­ge­n­e­sis in Britain over a 90 year pe­ri­od, putting the phe­no­typic loss at .069 points per year (about 1.7 points a gen­er­a­tion as­sum­ing a gen­er­a­tional length of 25 years). In the same study, Lynn es­ti­mated that the geno­typic IQ loss was 1.64 points per gen­er­a­tion be­tween 1920 and 1940, which re­duced to .66 points be­tween 1950 and the pre­sent. Sub­se­quent work by Lynn has in­ves­ti­gated dys­ge­n­e­sis in other pop­u­la­tions. For ex­am­ple Lynn (1999) found ev­i­dence for dys­genic fer­til­ity amongst those sur­veyed in the 1994 Na­tional Opin­ion Re­search Cen­ter sur­vey, which en­com­passed a rep­re­sen­ta­tive sam­ple of Amer­i­can adults, in the form of neg­a­tive cor­re­la­tions be­tween the in­tel­li­gence of adults aged 40+ and the num­ber of chil­dren and sib­lings. Lynn es­ti­mates the rate of dys­ge­n­e­sis amongst this co­hort at .48 points per gen­er­a­tion. In a more re­cent study, Lynn and van Court (2004) es­ti­mated that amongst the most re­cent US co­hort for which fer­til­ity can be con­sid­ered com­plete (i.e. those born in the years 1940-1949), IQ has de­clined by .9 points per gen­er­a­tion. At the coun­try lev­el, Lynn and Har­vey (2008) have found ev­i­dence of a global dys­ge­n­e­sis of around .86 points be­tween 1950 and 2000, which is pro­jected to in­crease to 1.28 points in the pe­riod from 2000 to 2050. This pro­jec­tion in­cludes the as­sump­tion that 35% of the vari­ance in cross-coun­try IQ differ­ences is due to the in­flu­ence of ge­netic fac­tors. A sub­se­quent study by Meisen­berg (2009), found that the fer­til­ity differ­en­tial be­tween de­vel­oped and de­vel­op­ing na­tions has the po­ten­tial to re­duce the phe­no­typic world pop­u­la­tion IQ mean by 1.34 points per decade (amount­ing to a geno­typic de­cline of .47 points per decade as­sum­ing Lynn & Har­vey’s 35% es­ti­mate). This as­sumes present rates of fer­til­ity and pre-re­pro­duc­tive mor­tal­ity within coun­tries. Meisen­berg (2010) and Meisen­berg and Kaul (2010) have ex­am­ined the fac­tors through which in­tel­li­gence in­flu­ences re­pro­duc­tive out­comes. They found that amongst the NLSY79 co­hort in the United States, the neg­a­tive cor­re­la­tion be­tween in­tel­li­gence and fer­til­ity is pri­mar­ily as­so­ci­ated with g and is me­di­ated in part by ed­u­ca­tion and in­come, and to a lesser ex­tent by more “lib­eral” gen­der at­ti­tudes. From this Meisen­berg has sug­gested that in the ab­sence of mi­gra­tion and with a con­stant en­vi­ron­ment, se­lec­tion has the po­ten­tial to re­duce the av­er­age geno­typic IQ of the US pop­u­la­tion by be­tween .4, .8 and 1.2 points per gen­er­a­tion.

All of these es­ti­mates are ge­netic se­lec­tion es­ti­mates: in­di­rect es­ti­mates in­ferred from IQ be­ing a her­i­ta­ble trait and then treat­ing it as a nat­ural se­lec­tion/breed­ing process, where a trait is se­lected against based on phe­no­type and how fast the trait de­creases in each suc­ceed­ing gen­er­a­tion de­pends on how ge­netic the trait is and how harsh the se­lec­tion is. So vari­a­tion in these es­ti­mates (quoted es­ti­mates per gen­er­a­tion range from .3 to 3+) is due to sam­pling er­ror, differ­ences in pop­u­la­tions or time pe­ri­ods, ex­press­ing the effect by year or gen­er­a­tion, the es­ti­mate used for her­i­tabil­i­ty, re­li­a­bil­ity of IQ es­ti­mates, and whether ad­di­tional ge­netic effects are taken into ac­coun­t—­for ex­am­ple, Wood­ley et al 2015 finds -.262 points per decade from se­lec­tion, but in an­other pa­per ar­gues that pa­ter­nal mu­ta­tion load must be affect­ing in­tel­li­gence by ~-0.84 in the gen­eral pop­u­la­tion, giv­ing a to­tal of -1 per decade.

Dys­gen­ics effects should be ob­serv­able by look­ing at genomes & SNP data with known ages/birth-years and look­ing for in­creases in to­tal mu­ta­tions or de­creases in in­tel­li­gence-caus­ing SNPs, re­spec­tive­ly.

Selection on SNPs

With­out for­mally meta-an­a­lyz­ing all dys­gen­ics stud­ies, a good start­ing point on the se­lec­tion effect seems like a ge­netic se­lec­tion of 1 point per decade or 0.1 points per year or 0.007 stan­dard de­vi­a­tions per year (or 0.7 stan­dard de­vi­a­tions per cen­tu­ry).

The most com­mon avail­able ge­netic data is SNP data, which se­quence only the vari­ants most com­mon in the gen­eral pop­u­la­tion; SNP data can look at the effects of ge­netic se­lec­tion but will not look at new mu­ta­tions (s­ince a new mu­ta­tion would not be com­mon enough to be worth putting onto a SNP chip).

Given a large sam­ple of SNP data, a birth year (or age), and a set of bi­nary SNP vari­ables which cause in­tel­li­gence (coded as 1 for the good vari­ant, 0 for the oth­er­s), we could for­mu­late this as a mul­ti­vari­ate re­gres­sion: glm(cbind(SNP1, SNP2, ... SNP_N) ~ Year, family=binomial) and see if the year vari­able has a neg­a­tive sign (in­creas­ing pas­sage of time pre­dicts lower lev­els of the good genes); if it does, this is ev­i­dence for dys­gen­ics. Bet­ter yet, given in­for­ma­tion about the effect size of the SNPs, we could for each per­son’s SNP sum the net effects and then regress on a sin­gle vari­able, giv­ing more pre­ci­sion rather than look­ing for in­de­pen­dent effects on each SNP: lm(Polygenic_score ~ Year). Again a neg­a­tive sign on the year vari­able is ev­i­dence for dys­gen­ics.

Di­rec­tional pre­dic­tions are weak, and in this case we have quan­ti­ta­tive pre­dic­tions of how big the effects should be. Most of the pub­lic genomes I looked at seem to have the ear­li­est birth­dates in the 1950s or so; genomes can come from any age per­son (par­ents can give per­mis­sion, and se­quenc­ing has been done pre­na­tal­ly) so the max­i­mum effect is the differ­ence be­tween 1950 and 2015, which is 65*0.007=0.455 stan­dard de­vi­a­tions (but most genomes will come from in­ter­me­di­ate birth-dates, which are less in­for­ma­tive about the tem­po­ral trend—in the op­ti­mal ex­per­i­men­tal de­sign for mea­sur­ing a lin­ear trend, half the sam­ples would be from 1950 and the other half from 2015). If the ge­netic to­tal is go­ing down by 0.455S­Ds, how much do the fre­quen­cies of all the good genes go down?

One sim­ple model of geno­typic IQ would be to treat it as a large num­ber of al­le­les of equal bi­nary effect: a bi­no­mial sum of n = 10,000 1/0 vari­ables with P = 50% (pop­u­la­tion fre­quen­cy) is rea­son­able. (For ex­am­ple, GIANT has found a large num­ber of vari­ants for height, and the s in­di­cate that SNPs ex­plain much more of vari­ance than the top Ri­etveld hits cur­rently ac­count for; this spe­cific model is loosely in­spired by .) In such a mod­el, the av­er­age value of the sum is of course n*p=5000, and the SD is sqrt(n*p*(1-p)) or sqrt(10000*0.5*0.5) or 50. Ap­ply­ing our es­ti­mate of dys­genic effect, we would ex­pect the sum to fall by 0.455*50=22.75, so we would be com­par­ing two pop­u­la­tions, one with a mean of 5000 and a dys­genic mean of 4977.25. If we were given ac­cess to all al­le­les from a sam­ple of 1950 and 2015 genomes and so we could con­struct the sum, how hard would it be able to tell the differ­ence? In this case, the sum is nor­mally dis­trib­uted as there are more than enough al­le­les to cre­ate nor­mal­i­ty, so we can just treat this as a two-sam­ple nor­mal­ly-dis­trib­uted com­par­i­son of means (a t-test), and we al­ready have a di­rec­tional effect size in mind, -0.445S­Ds, so:

power.t.test(delta=0.455, power=0.8, alternative="one.sided")
#      Two-sample t test power calculation
#               n = 60.4155602
# ...

A to­tal n = 120 is doable, but it is un­likely that we will know all in­tel­li­gence genes any­time soon; in­stead, we know a few. A new mean of 4977 im­plies that since to­tal num­ber of al­le­les is the same but the mean has fal­l­en, the fre­quen­cies must also fall and the av­er­age fre­quency falls from 0.5 to 4977.25/10000=0.497725. To go to the other ex­treme, if we know only a sin­gle gene and we want to test a fall from a fre­quency of 0.50 to 0.4977, we need in­fea­si­bly more sam­ples:

power.prop.test(p1=0.5, p2=0.497725, power=0.8, alternative="one.sided")
#      Two-sample comparison of proportions power calculation
#               n = 597,272.2524
# ...

1.2m dat­a­points would be diffi­cult to get, and so a sin­gle gene test would be un­help­ful; fur­ther, a sin­gle gene could change fre­quen­cies solely through ge­netic drift with­out the change be­ing due to dys­genic pres­sures.

We know a num­ber of genes, though: Ri­etveld gives 4 good hits, so we can look at a poly­genic score from that. They are all of sim­i­lar effect size and fre­quen­cy, so we’ll con­tinue un­der the same as­sump­tions of 1/0 and P = 50%. The non-dys­genic av­er­age score is 4*0.5=2, sd=sqrt(4*0.5*0.5)=1. (Nat­u­ral­ly, the SD is much larger than be­fore be­cause with so few ran­dom vari­ables…) The pre­dicted shift is from fre­quen­cies of 0.5 to 0.497, so the dys­genic scores should be 4*0.497=1.988, sd=sqrt(4*0.497*0.503)=0.999. The differ­ence of 0.012 on the re­duced poly­genic score is d=((2-1.988) / 0.999)=0.012, giv­ing a nec­es­sary power of:

power.t.test(delta=0.012006003, power=0.8)
#      Two-sample t test power calculation
#               n = 108904.194
# ...

So the 4 hits do re­duce the nec­es­sary sam­ple size, but it’s still not fea­si­ble to re­quire 218k SNP datasets (un­less you are 23andMe or SSGAC or an en­tity like that).

In the cur­rent GWAS lit­er­a­ture, there are ~9 hits we could use, but the up­com­ing SSGAC pa­per promis­es: “We iden­ti­fied 86 in­de­pen­dent SNPs as­so­ci­ated with EA (p < 5E-8).”. So how much would 86 im­prove over 4?

  • mean old: 86*0.5=43
  • sd old: sqrt(86*0.5*0.5)=4.6368
  • mean new: 86*0.497=42.742
  • sd new: sqrt(86*0.497*(1-0.497))=4.6367
  • so d=(43-42.742)/4.63675=0.0556
power.t.test(delta=((43-42.742)/4.63675), power=0.8)
#      Two-sample t test power calculation
#               n = 5071.166739
# ...

So with 75, it drops from 200k to 10.1k.

To work back­wards: we know with 1 hit, we need a mil­lion SNP datasets (in­fea­si­ble for any but the largest pro­pri­etary data­bas­es, who have no in­ter­est in study­ing this hy­poth­e­sis), and with all hits we need more like 200 genomes (en­tirely doable with just pub­licly avail­able datasets like PGP), but how many hits do we need to work with an in­-be­tween amount of data like the ~2k genomes with ages I guess may be pub­licly avail­able now or in the near fu­ture?

power.t.test(n=1000, power=0.8)
#     Two-sample t test power calculation
#              n = 1000
#          delta = 0.1253508704
mean1=hits*0.5; sd1=sqrt(hits*0.5*0.5);
mean2=hits*0.497; sd2=sqrt(hits*0.497*(1-0.497));
d=(mean1-mean2)/mean(c(sd1,sd2)); d
# [1] 0.1254283986

With a poly­genic score draw­ing on 437 hits, then a sam­ple of 2k suffices to de­tect the max­i­mum de­crease.

This is pes­simistic be­cause the 10k al­le­les are not all the same effect size and GWAS stud­ies in­her­ently will tend to find the largest effects first. So the first 4 (or 86) hits are worth the most. The dis­tri­b­u­tion of effects is prob­a­bly some­thing like an in­verse ex­po­nen­tial dis­tri­b­u­tion: many small near-zero effects and a few large ones. Ri­etveld 2013 re­leased the be­tas for all SNPs, and the beta es­ti­mates can be plot­ted; each es­ti­mate is im­pre­cise and there are ar­ti­facts in the beta sizes (SSGAC con­firms that they were rounded to 3 dec­i­mal­s), but the dis­tri­b­u­tion looks like a ra­dioac­tive half-life graph, an in­verse ex­po­nen­tial dis­tri­b­u­tion. With a mean of 1, we can sim­u­late cre­at­ing a set of 10k effect sizes which are ex­po­nen­tially dis­trib­uted and have mean 5000 and SD close to (but larger than) 50 and mim­ics closely the bi­no­mial mod­el:

effects <- sort(rexp(10000)/1, decreasing=TRUE)
genomeOld <- function() { ifelse(sample(c(FALSE,TRUE), prob=c(0.5, 0.5), 10000, replace = TRUE), 0, effects) }
mean(replicate(10000, sum(genomeOld())))
# [1] 5000.270218
sd(replicate(10000, sum(genomeOld())))
# [1] 69.82652816
genomeNew <- function() { ifelse(sample(c(FALSE,TRUE), prob=c(0.497, 1-0.497), 10000, replace = TRUE), 0, effects) }

With a dys­genic effect of -0.445S­Ds, that’s a fall of the sum of ran­dom ex­po­nen­tials of ~31, which agrees closely with the differ­ence in poly­genic genome scores:

mean(replicate(10000, sum(genomeOld() - genomeNew())))
# [1] 29.75354558

For each draw from the old and new pop­u­la­tions, we can take the first 4 al­le­les, which were the ones as­signed the largest effects, and build a weak poly­genic score and com­pare means. For ex­am­ple:

polyNew <- replicate(1000, sum(genomeNew()[1:4]))
polyOld <- replicate(1000, sum(genomeOld()[1:4]))
t.test(polyOld, polyNew, alternative="greater")
#   Welch Two Sample t-test
# data:  polyOld and polyNew
# t = 0.12808985, df = 1995.8371, p-value = 0.8980908
# alternative hypothesis: true difference in means is not equal to 0
# 95 percent confidence interval:
#  -0.7044731204  0.8029267301
# sample estimates:
#   mean of x   mean of y
# 17.72741040 17.67818359

Or to mimic 86 hits:

t.test(replicate(1000, sum(genomeOld()[1:86])), replicate(1000, sum(genomeNew()[1:86])))
#     Welch Two Sample t-test
# t = 1.2268929, df = 1997.6307, p-value = 0.2200074
# alternative hypothesis: true difference in means is not equal to 0
# 95% confidence interval:
#  -0.8642674547  3.7525210076
# sample estimates:
#   mean of x   mean of y
# 244.5471658 243.1030390

Us­ing the ex­po­nen­tial sim­u­la­tion, we can do a par­al­lelized power analy­sis: sim­u­late draws (i = 300) & tests for a va­ri­ety of sam­ple sizes to get an idea of what sam­ple size we need to get de­cent power with 86 hits.

library(parallel) # warning, Windows users

genomeOld <- function(efft) { ifelse(sample(c(FALSE,TRUE), prob=c(0.5, 0.5), length(efft), replace = TRUE), 0, efft) }
genomeNew <- function(efft) { ifelse(sample(c(FALSE,TRUE), prob=c(0.497, 1-0.497), length(efft), replace = TRUE), 0, efft) }

simulateStudy <- function(n, hits) {
        effects <- sort(rexp(10000)/1, decreasing=TRUE)[1:hits]
        polyOld <- replicate(n, sum(genomeOld(effects)))
        polyNew <- replicate(n, sum(genomeNew(effects)))
        t <- t.test(polyOld, polyNew, alternative="greater")
        return(data.frame(N=n, P=t$p.value, PO.mean=mean(polyOld),, PN.mean=mean(polyNew), }

hits <- 86
parallelStudies <- function(n, itr) { ldply(mclapply(1:itr, function(x) { simulateStudy(n, hits); })); }

sampleSizes <- seq(500, 5000, by=100)
iters <- 300
powerExponential <- ldply(lapply(sampleSizes, function(n) { parallelStudies(n, iters) })); summary(powerExponential)
#        N              P                  PO.mean                PN.mean
#  Min.   : 500   Min.   :0.000000000   Min.   :222.5525   Min.   :23.84966   Min.   :221.2894
#  1st Qu.:1600   1st Qu.:0.002991554   1st Qu.:242.8170   1st Qu.:26.46606   1st Qu.:241.3242
#  Median :2750   Median :0.023639517   Median :247.2059   Median :27.04467   Median :245.7044
#  Mean   :2750   Mean   :0.093184735   Mean   :247.3352   Mean   :27.06300   Mean   :245.8298
#  3rd Qu.:3900   3rd Qu.:0.107997575   3rd Qu.:251.7787   3rd Qu.:27.64103   3rd Qu.:250.2157
#  Max.   :5000   Max.   :0.997322161   Max.   :276.2614   Max.   :30.67000   Max.   :275.7741
#  Min.   :23.04527
#  1st Qu.:26.45508
#  Median :27.04299
#  Mean   :27.05750
#  3rd Qu.:27.63241
#  Max.   :30.85065
powerExponential$Power <- powerExponential$P<0.05
powers <- aggregate(Power ~ N, mean, data=powerExponential); powers
# 1   500 0.2133333333
# 2   600 0.2833333333
# 3   700 0.2833333333
# 4   800 0.3133333333
# 5   900 0.3033333333
# 6  1000 0.3400000000
# 7  1100 0.4066666667
# 8  1200 0.3833333333
# 9  1300 0.4133333333
# 10 1400 0.4166666667
# 11 1500 0.4700000000
# 12 1600 0.4600000000
# 13 1700 0.4666666667
# 14 1800 0.4733333333
# 15 1900 0.5233333333
# 16 2000 0.5366666667
# 17 2100 0.6000000000
# 18 2200 0.5900000000
# 19 2300 0.5600000000
# 20 2400 0.6066666667
# 21 2500 0.6066666667
# 22 2600 0.6700000000
# 23 2700 0.6566666667
# 24 2800 0.7133333333
# 25 2900 0.7200000000
# 26 3000 0.7300000000
# 27 3100 0.7300000000
# 28 3200 0.7066666667
# 29 3300 0.7433333333
# 30 3400 0.7133333333
# 31 3500 0.7233333333
# 32 3600 0.7200000000
# 33 3700 0.7766666667
# 34 3800 0.7933333333
# 35 3900 0.7700000000
# 36 4000 0.8100000000
# 37 4100 0.7766666667
# 38 4200 0.8000000000
# 39 4300 0.8333333333
# 40 4400 0.8466666667
# 41 4500 0.8700000000
# 42 4600 0.8633333333
# 43 4700 0.8166666667
# 44 4800 0.8366666667
# 45 4900 0.8666666667
# 46 5000 0.8800000000
qplot(N, Power, data=powers)  + stat_smooth()
Power for a two-group com­par­i­son of old and new SNP datasets for test­ing a hy­poth­e­sis of dys­gen­ics

So for a well-pow­ered two-group com­par­i­son of 1950 & 2015 SNP datasets us­ing 86 SNPs, we would want ~4000 in each group for a to­tal n = 8000; we do have non­triv­ial power even at a to­tal n = 1000 (500 in each group means 21% pow­er) but a non-s­ta­tis­ti­cal­ly-sig­nifi­cant re­sult will be diffi­cult to in­ter­pret and if one wanted to do that, re­port­ing a Bayes fac­tor from a Bayesian hy­poth­e­sis test would make much more sense to ex­press clearly whether the (non-de­fin­i­tive) data is ev­i­dence for or against dys­gen­ics.

This is still too op­ti­mistic since we as­sumed the op­ti­mal sce­nario of only very old and very new genomes, while avail­able genomes are more likely to be dis­trib­uted fairly uni­formly be­tween 1950 and 2015. Per “Op­ti­mal de­sign in psy­cho­log­i­cal re­search”, Mc­Clel­land 1997, we ex­pect a penalty of ~2x in sam­ple size effi­ciency in go­ing from the op­ti­mal two-group ex­treme end­points de­sign to sam­ples be­ing uni­formly dis­trib­uted (due to much of our sam­ple size be­ing wasted on es­ti­mat­ing small effects) and so we would ex­pect our sam­ple size re­quire­ment to at least dou­ble to around n = 16000, but we can do a power sim­u­la­tion here as well. To get the effect size for each year, we sim­ply split the fre­quency de­crease over each year and gen­er­ate hy­po­thet­i­cal genomes with less of a fre­quency de­crease uni­formly dis­trib­uted 1950-2015, and do a lin­ear re­gres­sion to get a p-value for the year pre­dic­tor:

hits <- 86
sampleSizes <- seq(8000, 30000, by=1000)
iters <- 100
genome <- function(effects) {
  t <- sample(c(1:(2015-1950)), 1)
  decreasedFrequency <- 0.5 - (((0.5-0.497)/(2015-1950)) * t)
  geneFlips <- sample(c(FALSE,TRUE), prob=c(decreasedFrequency, 1-decreasedFrequency), replace = TRUE, length(effects))
  geneValues <- ifelse(geneFlips, effects, 0)
simulateStudy <- function(n, hits) {
        effects <- sort(rexp(10000)/1, decreasing=TRUE)[1:hits]
        d <- ldply(replicate(n, genome(effects), simplify=FALSE))
        l <- lm(PolygenicScore ~ Year, data=d)
        p <- anova(l)$`Pr(>F)`[1]
        return(data.frame(N=n, P=p, PO.mean=predict(l, newdata=data.frame(Year=1950)),
                                           PN.mean=predict(l, newdata=data.frame(Year=2015)))) }
parallelStudies <- function(n, itr) { ldply(mclapply(1:itr, function(x) { simulateStudy(n, hits); })); }
powerExponentialDistributed <- ldply(lapply(sampleSizes, function(n) { parallelStudies(n, iters) })); summary(powerExponential)
powerExponentialDistributed$Power <- powerExponentialDistributed$P<0.05
powers <- aggregate(Power ~ N, mean, data=powerExponentialDistributed); powers
#        N Power
# 1   8000  0.27
# 2   9000  0.32
# 3  10000  0.35
# 4  11000  0.33
# 5  12000  0.41
# 6  13000  0.34
# 7  14000  0.41
# 8  15000  0.48
# 9  16000  0.55
# 10 17000  0.62
# 11 18000  0.55
# 12 19000  0.60
# 13 20000  0.69
# 14 21000  0.61
# 15 22000  0.65
# 16 23000  0.63
# 17 24000  0.71
# 18 25000  0.67
# 19 26000  0.71
# 20 27000  0.74
# 21 28000  0.70
# 22 29000  0.79
# 23 30000  0.83
qplot(N, Power, data=powers)  + stat_smooth()
Power to de­tect dys­gen­ics effect with SNP sam­ples spread over time

In this case, the power sim­u­la­tion sug­ges­tions the need for triple rather than dou­ble the data, and so a to­tal of n = 30,000 to be well-pow­ered.

Mutation load

The pa­ter­nal mu­ta­tion load should show up as a in­crease (70 new mu­ta­tions per gen­er­a­tion, 35 years per gen­er­a­tion, so ~2 per year on av­er­age) over the past cen­tu­ry, while the ge­netic se­lec­tion will op­er­ate by re­duc­ing the fre­quency of vari­ants which in­crease in­tel­li­gence. If there are ~70 new mu­ta­tions per gen­er­a­tion and 2 harm­ful, and there is no longer any pu­ri­fy­ing se­lec­tion so that all 70 will tend to re­main pre­sent, how much does that com­pare to ex­ist­ing mu­ta­tion load av­er­ages and, more im­por­tant­ly, stan­dard de­vi­a­tions?

A mu­ta­tion load re­view leads me to some hard fig­ures from Si­mons et al 2014 (sup­ple­ment) us­ing data from Fu et al 2012; par­tic­u­larly rel­e­vant is fig­ure 3, the num­ber of sin­gle-nu­cleotide vari­ants per per­son over the Eu­ro­pean-Amer­i­can sam­ple, split by es­ti­mates of harm from least to most like­ly: 21345 + 15231 + 5338 + 1682 + 1969 = 45565. The sup­ple­men­tary ta­bles gives a count of all ob­served SNVs by cat­e­go­ry, which sum to 300209 + 8355 + 220391 + 7001 + 351265 + 10293 = 897514, so the av­er­age fre­quency must be 45565/897514=0.05, and then the bi­no­mial SD will be sqrt(897514*0.05*(1-0.05))=206.47. Con­sid­er­ing the two-sam­ple case of 1950 vs 2015, that’s an in­crease of 130 to­tal SNVs (65*2), which is 0.63S­Ds, hence:

power.t.test(d=(130/206), power=0.8)
#      Two-sample t test power calculation
#               n = 40.40035398
# ...

A to­tal of n = 80.

This par­tic­u­lar set up for the two-sam­ple test can be seen as a lin­ear model with the op­ti­mum de­sign of al­lo­cat­ing half the sam­ple to each ex­treme (see again Mc­Clel­land 1997); but more re­al­is­ti­cal­ly, there is an even dis­tri­b­u­tion across years, in which case the penalty is 2x and n = 160.


There are some po­ten­tial prob­lems:

  1. Range re­stric­tion: in many IQ-re­lated stud­ies, fail­ure to ac­count for se­lec­tion effects yield­ing a lim­ited range of IQs may ; this is true in gen­eral but par­tic­u­larly com­mon in IQ stud­ies be­cause se­lec­tion on IQ (eg sam­ples of con­ve­nience us­ing only col­lege stu­dents) is so uni­ver­sal in hu­man so­ci­ety

    This may not be such a large is­sue when deal­ing with poly­genic scores; even se­vere IQ se­lec­tion effects will in­crease poly­genic scores only some­what be­cause the poly­genic scores ex­plain so lit­tle of IQ vari­ance in the first place.

  2. Self­-s­e­lec­tion by age: if peo­ple pro­vid­ing ge­netic data are not ran­dom sam­ples, then there may be pseudo-trends which can mask a real dys­genic trend or cre­ate a pseudo-dys­genic trend where there is none. For ex­am­ple, if young peo­ple buy­ing genome or SNP data tend to be above-av­er­age in in­tel­li­gence and sci­en­tific in­ter­est (which anec­do­tally they cer­tainly do seem to be), while old peo­ple tend to get genomes or SNP data due to health prob­lems (and oth­er­wise have av­er­age lev­els of in­tel­li­gence and thus poly­genic score), then in com­par­ing young vs old, one might find not a dys­genic but a pseudo-eu­genic trend in­stead! Con­verse­ly, it could be the other way around, if much fewer el­derly get ge­netic data and younger peo­ple are more con­cerned about fu­ture health or are go­ing along with a fad, pro­duc­ing a pseudo-dys­genic effect in­stead (eg in the PGP genome data, there seem to be dis­pro­por­tion­ately more PhDs who are quite el­der­ly, while younger par­tic­i­pants are a more scat­ter­shot sam­ple from the gen­eral pop­u­la­tion; prob­a­bly re­lat­ing to the cir­cum­stances of PGP’s found­ing & Har­vard home).

    This is prob­a­bly an is­sue with data­bases that rely on vol­un­tary in­di­vid­ual con­tri­bu­tions, such as PGP, where se­lec­tion effects have free play. It would be much less of an is­sue with lon­gi­tu­di­nal stud­ies where mo­ti­va­tions and par­tic­i­pa­tion rates will not differ much by age. Since most dys­genic the­o­ries ac­cept that recorded IQ scores have re­mained sta­ble over the 20th cen­tury and the de­creases in ge­netic po­ten­tial ei­ther have not man­i­fested yet or have been masked by the Flynn effect & greater fa­mil­iar­ity with tests & loss of some g-load­ing, one might rea­son that prox­ies like ed­u­ca­tional achieve­ment should be in­creas­ing through­out one’s sam­ple (s­ince they are known to have in­creased), and a lack of such a trend in­di­cates se­lec­tion bias.

Genetic data availability


The known pro­pri­etary data­bases have long been large enough to carry out ei­ther analy­sis, as well as count­less other analy­ses (but have failed to and rep­re­sent a ):

  1. The mu­ta­tion load analy­sis re­quires a whole-genome sam­ple size small enough to have been car­ried out by in­nu­mer­able groups post-2009.

  2. For SNPs, an in­com­plete list of ex­am­ples of pub­li­ca­tions based on large sam­ples:

The ex­ist­ing pri­vate groups do not seem to have any in­ter­est in test­ing dys­gen­ics, with the pos­si­ble ex­cep­tion of fu­ture GWAS stud­ies ex­am­in­ing fer­til­i­ty, one of which is men­tioned by :

At the time of writ­ing this re­view, Mills and her re­search team at the Uni­ver­sity of Ox­ford are cur­rently lead­ing a large con­sor­tium to en­gage in the first ever genome-wide as­so­ci­a­tion search (GWAS) and meta-analy­sis of re­pro­duc­tive choice (age at first birth; num­ber of chil­dren), con­ducted in both men and women in over 50 data sets, with the re­sults repli­cated in ad­di­tional datasets in a large sam­ple.

The hits in such a GWAS might over­lap with in­tel­li­gence hits, and if the mul­ti­ple hits in­crease in­tel­li­gence but de­crease fer­til­ity or vice versa (as com­pared to de­creas­ing or in­creas­ing both), that would be ev­i­dence for dys­gen­ics. Or, as­sum­ing the be­tas are re­port­ed, poly­genic scores for fer­til­ity and in­tel­li­gence could be es­ti­mated in in­de­pen­dent sam­ples and checked for an in­verse cor­re­la­tion.


There are a few sources of data, pri­mar­ily SNP data, which are freely avail­able to all users:

  1. : un­us­able due to a de­lib­er­ate pol­icy de­ci­sion by 1000 Genomes to delete all phe­no­type data, in­clud­ing age; sim­i­lar is 69 Genomes. Both likely would be un­us­able due to the di­ver­sity of the global sam­ple (there is no rea­son to think that dys­gen­ics pres­sures are op­er­at­ing in every pop­u­la­tion at the same strength)

  2. OpenSNP: host­ing for user-pro­vided SNP & phe­no­type data with dumps avail­able; hosts ~2k SNP datasets, but only 270 users have birth-years

  3. SNPedia like­wise hosts SNP data (over­lap­ping with OpenSNP) and genome data, but a very small num­ber

  4. Genomes un­zipped pro­vides a small amount of data

  5. DNA.LAND: claims n = 8k based on pub­lic par­tic­i­pa­tion & in­put (n = 43k as of ), but seems to then re­strict ac­cess to a small set of re­searchers

  6. Ex­ome Ag­gre­ga­tion Con­sor­tium: n = 61,486 ex­omes; phe­no­type data is un­avail­able

  7. (PGP): prob­a­bly the sin­gle largest source of open SNP & genome da­ta. ~1252 par­tic­i­pants have reg­is­tered birth­dates ac­cord­ing to demographics.tsv, and their sta­tis­tics page’s graphs in­di­cates <300 whole genomes and <1k SNPs. Phe­no­type data has been re­cently re­leased as a SQLite data­base, mak­ing it eas­ier to work with.

    • Genomes: brows­ing the user lists for ‘Whole genome datasets’, I es­ti­mate a to­tal of ~222; look­ing at the first and last 22 en­tries, 34 had ages/birth-years, so ~75% of the whole genomes come with the nec­es­sary birth-year data, in­di­cat­ing ~166 us­able genomes for the pur­pose of test­ing dys­gen­ics. With the most re­cent one up­loaded on 2015-10-12, and the ear­li­est recorded be­ing 2011-09-16, that sug­gests the avail­able genome num­ber in­creases by ~0.25/­day. 166 is un­com­fort­ably close to the re­quire­ment for a well-pow­ered test, and there may not be enough data to ac­count for glitches in the data or al­low for more com­pli­cated sta­tis­ti­cal test­ing, but if we wanted to dou­ble the avail­able data, we’d only need to wait around 885 days or 2.5 years (or less, de­pend­ing on whether the col­lapse in genome se­quenc­ing prices con­tinue and prices drop be­low even the cur­rent $1k genomes).
    • SNPs: PGP has ~656 23andMe SNP datasets (the num­ber of SNP datasets sourced from other providers is quite small so I did­n’t in­clude them), dated 2015-10-21–2011-01-06, so as­sum­ing same birth-date per­cent­age, 0.37 per day. Un­for­tu­nate­ly, to get 30k SNP datasets through PGP, we would have to wait (lin­early ex­trap­o­lat­ing) 291 years. (Mak­ing mat­ters worse, in Oc­to­ber 2015, 23andMe dou­bled its price and re­duced the qual­ity of SNP cov­er­age, which will dis­cour­age many users and push other users to pur­chase whole-genome se­quenc­ing in­stead.)

Power analysis for racial admixture studies of continuous variables

I con­sider power analy­sis of a ge­nomic racial ad­mix­ture study for de­tect­ing ge­netic group differ­ences affect­ing a con­tin­u­ous trait such as IQ in US African-Amer­i­cans, where an­ces­try is di­rectly mea­sured by genome se­quenc­ing and the com­par­isons are all with­in-fam­ily to elim­i­nate con­found­ing by pop­u­la­tion struc­ture or racis­m/­col­oris­m/dis­crim­i­na­tion. The nec­es­sary sam­ple size for well-pow­ered stud­ies is closely re­lated to the av­er­age size of differ­ences in an­ces­try per­cent­age be­tween sib­lings, as the up­per bound on IQ effect per per­cent­age is small, re­quir­ing large differ­ences in an­ces­try to de­tect eas­i­ly. A with­in-fam­ily com­par­i­son of sib­lings, due to the rel­a­tively small differ­ences in an­ces­try be­tween sib­lings es­ti­mated from IBD mea­sure­ments of sib­lings, might re­quire n > 50,000 pairs of sib­lings to de­tect pos­si­ble effects on IQ, an in­fea­si­ble sam­ple size. An al­ter­na­tive de­sign fo­cuses on in­creas­ing the avail­able an­ces­try differ­ences within a fam­ily unit by com­par­ing adoptees with sib­lings; the larger with­in-pop­u­la­tion stan­dard de­vi­a­tion of an­ces­try cre­ates larger & more eas­i­ly-de­tected IQ differ­ences. A ran­dom-effects meta-analy­sis of past ad­mix­ture & an­ces­try stud­ies sug­gests the SD in het­ero­ge­neous sam­ples may range from 2% to 20% with a mean of 11% (95% pre­dic­tive in­ter­val), yield­ing sam­ple sizes of n > 20,000, n = 1100, and n = 500. Hence, an adop­tion study is prob­a­bly in the fea­si­ble range, with re­quired sam­ple sizes com­pa­ra­ble to an­nual adop­tion rates among US African-Amer­i­cans.

ex­am­ine racial phe­no­typic differ­ences in traits such as blood pres­sure by com­par­ing peo­ple with an­ces­try from mul­ti­ple groups, and cor­re­lat­ing differ­ences in an­ces­try per­cent­age with differ­ences in the phe­no­type. So, for ex­am­ple, African-Amer­i­cans have higher blood­-pres­sure than white Amer­i­cans, and most African-Amer­i­cans have an av­er­age white an­ces­try of some­thing like 20-25% (see lat­er); if hav­ing 26% white an­ces­try pre­dicts slightly lower blood pres­sure while 24% pre­dicts high­er, that sug­gests the differ­ence is (as is cur­rently be­lieved) ge­net­ic; and this logic can be used to nar­row down to spe­cific chro­mo­some re­gions, and has con­tributed to study of .

One ap­pli­ca­tion would be to thorny ques­tions like po­ten­tial group differ­ences in non-med­ical traits like in­tel­li­gence. The stan­dard ad­mix­ture de­sign, re­quir­ing a few thou­sand sub­jects span­ning the full range, might not nec­es­sar­ily work here here be­cause of the claimed en­vi­ron­men­tal effects. A pro­posed res­o­lu­tion to the ques­tion is to do an ad­mix­ture study com­par­ing African-Amer­i­can sib­lings. Sib­lings are highly ge­net­i­cally re­lated on av­er­age (50%) but in a ran­dom­ized fash­ion due to re­com­bi­na­tion; so two sib­lings, in­clud­ing fra­ter­nal twins, born to the same par­ents in the same fam­ily in the same neigh­bor­hood go­ing to the same schools, will nev­er­the­less have many differ­ent vari­ants, and will differ in how re­lated they are—the av­er­age is 50% but it could be as low as 45% or high as 55%. So given two sib­lings, they will differ slightly in their white an­ces­try, and if in­deed white an­ces­try brings with it more in­tel­li­gence vari­ants, then the sib­ling with a higher whiter per­cent­age ought to be slightly more in­tel­li­gent on av­er­age, and this effect will have to be causal, as the in­her­i­tance is ran­dom­ized and all other fac­tors are equal by de­sign. (A re­sult us­ing an­ces­try per­cent­ages mea­sured in the gen­eral pop­u­la­tion, out­side fam­i­lies, would be able to make far more pow­er­ful com­par­isons by com­par­ing peo­ple with ~0% white an­ces­try to those with any­where up to 100%, and re­quire small sam­ple sizes, and such analy­ses have been done with the ex­pected re­sult, but are am­bigu­ous & to­tally un­con­vinc­ing, as the cor­re­la­tion of greater white­ness with in­tel­li­gence could eas­ily be due to greater SES or greater black­ness could be a marker for re­cent im­mi­gra­tion or any of a num­ber of con­founds that ex­ist.) This has his­tor­i­cally been diffi­cult or im­pos­si­ble since how does one mea­sure the ac­tual an­ces­try in sib­lings? But with the rise of cheap geno­typ­ing, pre­cise mea­sure of ac­tual (rather than av­er­age) an­ces­try can be done for <$100, so that is no longer an ob­sta­cle.

Sibling power analysis

How many sib­ling pairs would this re­quire?

  • you are try­ing to regress IQ_difference ~ Ancestry_difference
  • the SD of the IQ differ­ence of sib­lings is known—it’s ~13 IQ points (non­shared en­vi­ron­ment + differ­ences in ge­net­ics)
  • of this, a small frac­tion will be ex­plained by the small differ­ence in an­ces­try per­cent­age
  • the power will be de­ter­mined by the ra­tio of the sib­ling SD to the IQ-d­iffer­ence-due-to-ances­try-d­iffer­ence SD, giv­ing an effect size, which com­bined with the usual al­pha=0.05 and be­ta=0.80, uniquely de­ter­mines the sam­ple size
  • IQ-d­iffer­ence-due-to-ances­try-d­iffer­ence SD will be the ad­van­tage of bet­ter an­ces­try times how much an­ces­try differs
  • if you knew the num­ber of rel­e­vant al­le­les, you could cal­cu­late through the bi­no­mial the ex­pected SD of sib­ling an­ces­tor differ­ences. As there are so many al­le­les, it will be al­most ex­actly nor­mal. So it’s not sur­pris­ing that sib­lings over­all, for all vari­ants, are 50% IBD with a SD of 4%.

If we treated it as sim­ply as pos­si­ble, for an anal­o­gous height analy­sis says they mea­sured 588 mark­ers. So a bi­no­mial with 588 draws and p = 0.5 im­plies that 147 mark­ers are ex­pected to be the same:

588 * 0.5*(1-0.5)
# [1] 147

and the dis­tri­b­u­tion around 147 is 12, which is ~8%:

sqrt((588 * 0.5*(1-0.5)))
# [1] 12.12435565
# [1] 0.08163265306

Viss­cher does a more com­pli­cated analy­sis tak­ing into ac­count close­ness of the mark­ers and gets a SD of 3.9%: equa­tion 7; vari­ance = 1/(16*L) - (1/3*L^2), where L = 35, so

L=35; sqrt(1/(16*L) - (1/(3*L^2)))
# [1] 0.03890508247

And the­o­ret­i­cal mod­el­ing gives an ex­pected sib­ling SD of SD of 3.92%/3.84% (Table 2), which are nearly iden­ti­cal. So what­ever the mean ad­mix­ture is, I sup­pose it’ll have a sim­i­lar SD of 4-8% of it­self.

IIRC, African-Amer­i­cans are ~25% ad­mixed, so with a mean ad­mix­ture of 25%, we would ex­pect sib­lings differ­ences to be or 1% differ­ence.

If that 75% miss­ing white an­ces­try ac­counts for 9 IQ points or 0.6S­Ds, then each per­cent­age of white an­ces­try would be 0.6/75 =0.008 SDs.

So that SD of 1% more white an­ces­try yields an SD of 0.008 IQ, which is su­per­im­posed on the full sib­ling differ­ence of 0.866, giv­ing a stan­dard­ized effect size/d of 0.008 / 0.866 = 0.0092

Let me try a power sim­u­la­tion:

n <- 10000
siblings <- data.frame(
sibling1AncestryPercentage = rnorm(n, mean=25, sd=1),
sibling1NonancestryIQ = rnorm(n, mean=0, sd=12),

sibling2AncestryPercentage = rnorm(n, mean=25, sd=1),
sibling2NonancestryIQ = rnorm(n, mean=0, sd=12))

siblings$sibling1TotalIQ <- with(siblings, sibling1NonancestryIQ + sibling1AncestryPercentage*(0.008*15))
siblings$sibling2TotalIQ <- with(siblings, sibling2NonancestryIQ + sibling2AncestryPercentage*(0.008*15))
siblings$siblingAncestryDifference <- with(siblings, sibling1AncestryPercentage - sibling2AncestryPercentage)
siblings$siblingIQDifference <- with(siblings, sibling1TotalIQ - sibling2TotalIQ )

# ...
# siblingAncestryDifference siblingIQDifference
# Min.   :-5.370128122      Min.   :-68.2971343
# 1st Qu.:-0.932086950      1st Qu.:-11.7903864
# Median : 0.002384529      Median : -0.2501536
# Mean   : 0.007831583      Mean   : -0.4166863
# 3rd Qu.: 0.938513265      3rd Qu.: 11.0720667
# Max.   : 5.271052675      Max.   : 67.5569825
summary(lm(siblingIQDifference ~ siblingAncestryDifference, data=siblings))
# ...Coefficients:
#                             Estimate Std. Error  t value  Pr(>|t|)
# (Intercept)               -0.4192761  0.1705125 -2.45892 0.0139525
# siblingAncestryDifference  0.3306871  0.1220813  2.70874 0.0067653
# Residual standard error: 17.05098 on 9998 degrees of freedom
# Multiple R-squared:  0.000733338,    Adjusted R-squared:  0.0006333913
# F-statistic: 7.337294 on 1 and 9998 DF,  p-value: 0.006765343
confint(lm(siblingIQDifference ~ siblingAncestryDifference, data=siblings))
#                                    2.5 %         97.5 %
# (Intercept)               -0.75351500523 -0.08503724643
# siblingAncestryDifference  0.09138308561  0.56999105507

admixtureTest <- function(n, alpha=0.05, ancestryEffect=0.008) {
 siblings <- data.frame(
     sibling1AncestryPercentage =pmax(0, rnorm(n, mean=25, sd=1)),
     sibling1NonancestryIQ = rnorm(n, mean=0, sd=12),

     sibling2AncestryPercentage = pmax(0,rnorm(n, mean=25, sd=1)),
     sibling2NonancestryIQ = rnorm(n, mean=0, sd=12))

 siblings$sibling1TotalIQ <- with(siblings, sibling1NonancestryIQ + sibling1AncestryPercentage*(ancestryEffect*15))
 siblings$sibling2TotalIQ <- with(siblings, sibling2NonancestryIQ + sibling2AncestryPercentage*(ancestryEffect*15))
 siblings$siblingAncestryDifference <- with(siblings, sibling1AncestryPercentage - sibling2AncestryPercentage)
 siblings$siblingIQDifference <- with(siblings, sibling1TotalIQ - sibling2TotalIQ )

 p <- summary(lm(siblingIQDifference ~ siblingAncestryDifference, data=siblings))$coefficients[8]

power <- function(n, iters=10000, n.parallel=8) {
    mean(unlist(mclapply(1:n.parallel, function(i) {
        replicate(iters/n.parallel, admixtureTest(n)) }))) }

# powers <- sapply(seq(100, 10000, by=10), power)
# [1] 0.0502
# [1] 0.0535
# [1] 0.0642
# [1] 0.2251
# [1] 0.6077

So the es­ti­mated sam­ple size is ex­tremely large, well into the scores of thou­sands. This is large enough that it will be some time be­fore biobanks or pop­u­la­tion sam­ples are well-pow­ered: par­tic­i­pants may not have sib­lings, those sib­lings may be in­clud­ed, only ~15% of the Amer­i­can pop­u­la­tion is AA, all par­tic­i­pants must be se­quenced, im­per­fect re­li­a­bil­ity of mea­sure­ments can greatly in­crease the nec­es­sary sam­ple size, and so on. If it re­quires n = 70,000, half of par­tic­i­pants have a match­ing sib­ling, and it’s drawn pro­por­tion­ally from the gen­eral pop­u­la­tion, that would im­ply that a to­tal sam­ple size of al­most 1m. (For com­par­ison, that’s about twice the size of the , and the US co­hort pro­gram aims for 1m to­tal par­tic­i­pants by 2020.)

Sim­pli­fi­ca­tions aside, it is diffi­cult to see any way to bring this method down into the low thou­sands range, as that would re­quire sib­lings to vary tremen­dously more in an­ces­try, have much more ho­mo­ge­neous IQs than they do, or an­ces­try to be vastly more po­tent than it could pos­si­bly be.

Adoption power analysis

One pos­si­bil­ity would be to ex­am­ine a differ­ent pop­u­la­tion, per­haps one with more African an­ces­try and thus larger be­tween-si­b­ling an­ces­try differ­ences and effects, such as ; but that would raise ques­tions about rel­e­vance to the USA. So an­other pos­si­bil­ity is to drop the idea of us­ing only bi­o­log­i­cal sib­lings. Is there any way to have an­ces­try differ­ences as large as in the gen­eral pop­u­la­tion, but within a fam­i­ly? Half-si­b­lings come to mind but those more typ­i­cally tend to join the house­hold as older kids or teenagers, so aren’t so good. One pos­si­bil­ity is adoptees: there are a sub­stan­tial num­ber of African-Amer­i­can chil­dren adopted into other African-Amer­i­can house­holds (white par­ents adopt­ing black chil­dren is , de­scribed as “de­creas­ing dra­mat­i­cally”, but still sub­stan­tial in to­tal num­bers, at least 20,000), and even a no­tice­able num­ber of African chil­dren adopted abroad (14,800 from Ethiopia just 1999-2014, with more adop­tion from Nige­ria & the Con­go). The same logic of the with­in-fam­ily study should ap­ply but to un­re­lated sib­lings who will have far greater differ­ences in an­ces­try now (pos­si­bly any­where up to 50% if an African child is adopted into an African-Amer­i­can fam­ily with con­sid­er­able white an­ces­try & some luck). This would in­crease power dra­mat­i­cal­ly, per­haps enough to bring the study within the realm of near-fu­ture fea­si­bil­i­ty.

Ex­am­in­ing adop­tions of African chil­dren would not be a con­vinc­ing way of es­tab­lish­ing group differ­ences, par­tic­u­larly for IQ, as there are many known en­vi­ron­men­tal harms (eg pre-na­tal lack of is known to cause large im­pair­ments in cog­ni­tive which can­not be re­paired later in life, and poor iodiza­tion is fre­quent in Africa), so while ex­am­in­ing African adoptees would doubt­less re­quire a very small sam­ple size, the re­sults would be un­in­ter­pretable. So the more in­ter­est­ing case is in­stead ex­am­in­ing AA adoptees/si­b­lings, all of whom are raised in a wealthy (and iodine-suffi­cient) in­dus­tri­al­ized coun­try.

In this case, we’re con­sid­er­ing a pair of an AA sib­ling with the same IQ & an­ces­try dis­tri­b­u­tions, as com­pared with adoptees who are ei­ther African (100% African an­ces­try) or like­wise have the same IQ/ances­try dis­tri­b­u­tions. Since the sib­ling & adoptee are un­re­lat­ed, they effec­tively vary & differ as much as two ran­dom peo­ple from the gen­eral pop­u­la­tion would in IQ & African an­ces­try, ex­cept for shared-en­vi­ron­ment effects on IQ; shared-en­vi­ron­ment for adult IQ is rel­a­tively low, maybe 10% of vari­ance. So in­stead of an SD of 15, they would vary mod­er­ately less, like 14.23 points (sqrt(15^2 * 0.9)).

One as­sump­tion here is a shared mean: one would ac­tu­ally ex­pect, given the as­so­ci­a­tion of lighter skin with higher wealth/SES and darker with lower wealth/SES, that the adopt­ing par­ents (and hence their bi­o­log­i­cal chil­dren) would be rel­a­tively high on Eu­ro­pean an­ces­try, and con­verse­ly, the moth­ers giv­ing up chil­dren for adop­tion would be rel­a­tively low, so the ex­pected differ­ence in an­ces­try is higher than sim­u­lat­ed. As­sum­ing equal means, how­ev­er, is a con­ser­v­a­tive as­sump­tion since if such a cor­re­la­tion holds, the differ­ences will be larg­er, hence the an­ces­try effect sizes larg­er, hence smaller sam­ple sizes re­quired. In the ex­treme ver­sion of this, the adop­tive fam­ily is white and so the an­ces­try differ­ence is max­i­mal (~99% vs ~20%), re­quir­ing even smaller sam­ple sizes, but at the cost of in­tro­duc­ing com­pli­ca­tions like whether there are in­ter­ac­tions with the white adop­tive fam­ily not present in an AA adop­tive fam­i­ly; in any case, such tran­s-ra­cial adop­tion is ap­par­ently un­pop­u­lar now, so it may not come up much.

Mean population European ancestry & population standard deviation

An­ces­try effects re­main as be­fore; the mean an­ces­try is not too im­por­tant as long as it’s not near 0, but since adoptees are drawn from the gen­eral pop­u­la­tion, the an­ces­try SD must be ad­justed but it’s un­clear what the right SD here is—cited stud­ies range from 4% up to 11%, and this is a key pa­ra­me­ter for power (with 4%, then sib­ling and adoptee will tend to be quite sim­i­lar on an­ces­try per­cent­age & much more data will be re­quired, but with 11% they will differ a good deal and make re­sults stronger).

Re­ported fig­ures from the ge­net­ics lit­er­a­ture for Eu­ro­pean an­ces­try in US African-Amer­i­can range from 14% to 24%, re­flect­ing both sam­pling er­ror and var­i­ous bi­ases & self­-s­e­lec­tion & ge­o­graph­ic/re­gional effects in the datasets:

Con­sid­er­ing just stud­ies with us­able an­ces­try per­cent­ages, pop­u­la­tion SD, and n, and us­ing in­ferred SDs from Sig­norel­lo:

admixture <- read.csv(stdin(), header=TRUE, colClasses=c("factor", "numeric", "numeric", "integer"))
"Halder et al 2008",0.143,0.133,136
"Ducci et al 2009",0.07,0.09,864
"Signorello et al 2010",0.071,0.08,379
"Bryc et al 2010",0.185,0.0465,365
"Nassir et al 2012",0.225,0.147,11712
"Bryc et al 2014",0.24,0.17,5269
"Kirkegaard et al 2016",0.17,0.11,140

# what is the standard error/precision of a population SD?
admixture$SD.SE <- (0.71*admixture$SD) / sqrt(admixture$N)
#                    Study        Mean                 SD                  N               SD.SE
#  Bryc et al 2010      :1   Min.   :0.0700000   Min.   :0.0465000   Min.   :  136.0   Min.   :0.0009644066
#  Bryc et al 2014      :1   1st Qu.:0.1070000   1st Qu.:0.0850000   1st Qu.:  252.5   1st Qu.:0.0016954481
#  Ducci et al 2009     :1   Median :0.1700000   Median :0.1100000   Median :  379.0   Median :0.0021739221
#  Halder et al 2008    :1   Mean   :0.1577143   Mean   :0.1109286   Mean   : 2695.0   Mean   :0.0034492579
#  Kirkegaard et al 2016:1   3rd Qu.:0.2050000   3rd Qu.:0.1400000   3rd Qu.: 3066.5   3rd Qu.:0.0047591374
#  Nassir et al 2012    :1   Max.   :0.2400000   Max.   :0.1700000   Max.   :11712.0   Max.   :0.0080973057
#  Signorello et al 2010:1

r.mean <- rma(yi=Mean, sei=SD/sqrt(N), measure="SMD", ni=N, data=admixture); r.mean
# Random-Effects Model (k = 7; tau^2 estimator: REML)
# tau^2 (estimated amount of total heterogeneity): 0.0046 (SE = 0.0027)
# tau (square root of estimated tau^2 value):      0.0680
# I^2 (total heterogeneity / total variability):   99.82%
# H^2 (total variability / sampling variability):  566.51
# Test for Heterogeneity:
# Q(df = 6) = 3477.2614, p-val < .0001
# Model Results:
# estimate       se     zval     pval    ci.ub
#   0.1578   0.0258   6.1187   <.0001   0.1072   0.2083
#   pred     se  ci.ub  cr.ub
# 0.1578 0.0258 0.1072 0.2083 0.0153 0.3003 <- rma(yi=SD, sei=SD.SE, measure="SMD", ni=N, data=admixture);
# Random-Effects Model (k = 7; tau^2 estimator: REML)
# tau^2 (estimated amount of total heterogeneity): 0.0018 (SE = 0.0011)
# tau (square root of estimated tau^2 value):      0.0425
# I^2 (total heterogeneity / total variability):   99.77%
# H^2 (total variability / sampling variability):  440.67
# Test for Heterogeneity:
# Q(df = 6) = 3819.2793, p-val < .0001
# Model Results:
# estimate       se     zval     pval    ci.ub
#   0.1108   0.0162   6.8587   <.0001   0.0792   0.1425
#    pred     se  ci.ub  cr.ub
#  0.1108 0.0162 0.0792 0.1425 0.0216 0.2001

forest(r.mean, slab=admixture$Study)
forest(, slab=admixture$Study)
Meta-an­a­lytic sum­mary of US African-Amer­i­can’s mean Eu­ro­pean an­ces­try per­cent­age & pop­u­la­tion SD of that per­cent­age

There is high het­ero­gene­ity, as ex­pect­ed, and the meta-an­a­lytic sum­mary are con­sis­tent with sim­ply tak­ing the mean, so meta-analy­sis was not re­ally nec­es­sary.

The is­sue of het­ero­gene­ity de­pends on how one wants to in­ter­pret these num­bers: as the true la­tent African-Amer­i­can pop­u­la­tion mean/SD of Eu­ro­pean an­ces­try, or as a way to es­ti­mate the pos­si­ble spread of sam­pling? In the for­mer, the het­ero­gene­ity is a se­ri­ous is­sue be­cause it sug­gests the es­ti­mate may be badly bi­ased or at least is highly im­pre­cise; in the lat­ter, it is both a curse and a ben­e­fit, since it im­plies that it is pos­si­ble to re­cruit for ge­net­ics stud­ies sam­ples with a wide range of an­ces­try (thereby greatly in­creas­ing sta­tis­ti­cal pow­er) but also that one might get un­lucky & wind up with a very an­ces­try-ho­mo­ge­neous sam­ple (if the sam­ple turns out to have an SD as high as 20%, ex­cel­lent; if it’s as low as 7.9%, one is in trou­ble).

So for power analy­sis one might check the meta-an­a­lytic mean case, as well as the (a 95% CI around the SD/mean does not mean that 95% of the true effects, in­clud­ing the in­her­ent het­ero­gene­ity, will fall in that in­ter­val): SDs of 2%, 11%, and 20%. (For any cost-ben­e­fit analy­sis or try­ing to op­ti­mize ex­pen­di­tures, one would want to work with the pos­te­rior dis­tri­b­u­tions to av­er­age over every­thing, but for just gen­eral in­for­ma­tive pur­pos­es, those 3 are good pa­ra­me­ters to check.)

Power simulation


adopteeTest <- function(n, alpha=0.05, ancestryEffect=0.008, populationAncestryMean=0.1440, populationAncestrySD=0.1008, c=0.1) {
 unrelatedSiblingSD <- sqrt(15^2 * (1-c)) # subtract 10% for same shared-environment
 siblings <- data.frame(
     sibling1AncestryPercentage = pmax(0, rnorm(n, mean=populationAncestryMean*100, sd=populationAncestrySD*100)),
     sibling1NonancestryIQ = rnorm(n, mean=0, sd=unrelatedSiblingSD),

     adopteeAncestryPercentage = pmax(0, rnorm(n, mean=populationAncestryMean*100, sd=populationAncestrySD*100)),
     adopteeNonancestryIQ = rnorm(n, mean=0, sd=unrelatedSiblingSD))

 siblings$sibling1TotalIQ <- with(siblings, sibling1NonancestryIQ + sibling1AncestryPercentage*(ancestryEffect*15))
 siblings$adopteeTotalIQ  <- with(siblings, adopteeNonancestryIQ + adopteeAncestryPercentage*(ancestryEffect*15))
 siblings$siblingAncestryDifference <- with(siblings, sibling1AncestryPercentage - adopteeAncestryPercentage)
 siblings$siblingIQDifference       <- with(siblings, sibling1TotalIQ - adopteeTotalIQ )

 p <- summary(lm(siblingIQDifference ~ siblingAncestryDifference, data=siblings))$coefficients[8]

power <- function(n, sd, iters=10000, n.parallel=8) {
    mean(unlist(mclapply(1:n.parallel, function(i) {
    replicate(iters/n.parallel, adopteeTest(n, populationAncestrySD=sd)) }))) }

ns <- seq(100, 10000, by=100)
powerLow  <- sapply(ns, function(n) { power(n, sd=0.0216)})
powerMean <- sapply(ns, function(n) { power(n, sd=0.1108)})
powerHigh <- sapply(ns, function(n) { power(n, sd=0.2001)})

library(ggplot2); library(gridExtra)
pl <- qplot(ns, powerLow)  + coord_cartesian(ylim = c(0,1))
pm <- qplot(ns, powerMean) + coord_cartesian(ylim = c(0,1))
ph <- qplot(ns, powerHigh) + coord_cartesian(ylim = c(0,1))
grid.arrange(pl, pm, ph, ncol=1)
Power analy­sis for de­tect­ing Eu­ro­pean an­ces­try on IQ in an adoptee fam­ily study with pre­dicted low/mean/high pop­u­la­tion vari­ance in an­ces­try per­cent­age (higher vari­ance=larger sta­tis­ti­cal pow­er=fewer sam­ples re­quired)

So for the worst-case SD, sam­ple size is un­clear but n > 20,000 pairs; mean SD, n = 1100 pairs; high SD, n = 500 pairs. The lat­ter two are fea­si­ble amounts for pop­u­la­tion reg­istries or adop­tion-fo­cused co­hort stud­ies. Thus genome adop­tion stud­ies, com­bined with the much less pow­er­ful but more com­mon with­in-si­b­ling com­par­isons, are ca­pa­ble of de­liv­er­ing pre­cise an­swers to long-s­tand­ing ques­tions about the ori­gins of group differ­ences with mod­er­ate sam­ple sizes.

Operating on an aneurysm

In the ex­cel­lent neu­ro­surgery mem­oir Do No Harm: Sto­ries of Life, Death, and Brain Surgery ( 2014), chap­ter 2 “”, there is a pas­sage on weigh­ing the costs of ac­tion and in­ac­tion:

“A thir­ty-t­wo-year-old wom­an,” he said terse­ly. “For surgery to­day. Had some headaches and had a brain scan.” As he talked a brain scan flashed up on the wall.

…“It’s an un­rup­tured aneurysm, seven mil­lime­tres in size,” Fion­a—the most ex­pe­ri­enced of the reg­is­trars—­said. “So there’s a point zero five per cent risk of rup­ture per year ac­cord­ing to the in­ter­na­tional study pub­lished in 1998.” “And if it rup­tures?” “Fifteen per cent of peo­ple die im­me­di­ately and an­other thirty per cent die within the next few weeks, usu­ally from a fur­ther bleed and then there’s a com­pound in­ter­est rate of four per cent per year.”

…If we did noth­ing the pa­tient might even­tu­ally suffer a haem­or­rhage which would prob­a­bly cause a cat­a­strophic stroke or kill her. But then she might die years away from some­thing else with­out the aneurysm ever hav­ing burst. She was per­fectly well at the mo­ment, the headaches for which she had had the scan were ir­rel­e­vant and had got bet­ter. The aneurysm had been dis­cov­ered by chance. If I op­er­ated I could cause a stroke and wreck her—the risk of that would prob­a­bly be about four or five per cent. So the acute risk of op­er­at­ing was roughly sim­i­lar to the life-time risk of do­ing noth­ing. Yet if we did noth­ing she would have to live with the knowl­edge that the aneurysm was sit­ting there in her brain and might kill her any mo­ment.

Read­ing this, I was a lit­tle sur­prised by Marsh’s eval­u­a­tion given those spe­cific num­bers. In­tu­itive­ly, it did not seem to me that a sin­gle risk of 5% was any­where near as bad as a life­long risk of 0.5%, for a 32 year old woman who would prob­a­bly live an­other 50 years—the one num­ber is 10x big­ger than the oth­er, but the other num­ber is 50x big­ger, and a quick heuris­tic for the to­tal prob­a­bil­ity of many in­de­pen­dent small prob­a­bil­i­ties is to just sum them up, sug­gest­ing that the risk of the un­treated aneurysm was much worse (50*0.005=0.25, and 0.25>0.05). So I thought after I fin­ished read­ing the book, I would work it out a lit­tle more ac­cu­rate­ly.


Specifi­cal­ly, this is a 32yo woman and the UK fe­male life ex­pectancy is ~80yo in 2015, so she had ~48 years left. The con­se­quences of the aneurysm burst­ing is a large chance of in­stant death or else se­vere dis­abil­ity with death to soon fol­low; the con­se­quence of surgery go­ing wrong is also in­stant death or se­vere dis­abil­i­ty, pre­sum­ably with a high chance of death soon fol­low­ing, so it looks like we can as­sume that the bad out­come in ei­ther case is the same. what is the prob­a­bil­ity of the aneurysm never burst­ing in all 48 years? (1-0.005)^48 = 0.786, or a prob­a­bil­ity of burst­ing of 21%. 21% is 4x larger than 5%. Since 21% is 4x larger and the con­se­quences are sim­i­lar, this would sug­gest that the risks are not “roughly sim­i­lar” and it looks much worse to not op­er­ate.

Expected loss

But that’s just the risk of an event, not the ex­pected loss:

  1. In the case of do­ing surgery im­me­di­ate­ly, the ex­pected loss, with years treated equally and a 5% in­stant risk from op­er­a­tion, is sim­ply 48 * 0.005 = 0.24 years of life; all 48 years are risked on a sin­gle throw of the sur­gi­cal dice, but after that she is safe.

  2. In the case of do­ing noth­ing and let­ting the aneurysm stay with a 0.5% an­nual risk from non-op­er­a­tion, it’s not as sim­ple as 48 * 0.21 = 10.1 years, be­cause you can­not die of an aneurysm if you died in a pre­vi­ous year. The risk will in­stead fol­low a (num­ber of years un­til 1 fail­ure), and then the loss is the 48 years mi­nus how­ever many she ac­tu­ally got. That’s not the same as the ex­pec­ta­tion of the neg­a­tive bi­no­mi­al, which in this case is 200 years (the ex­pec­ta­tion of a neg­a­tive bi­no­mial with 1 fail­ure and a suc­cess rate of 1-0.005 is 1/(1-(1-0.005))=200) and she will die of other causes be­fore then, in which case the aneurysm turned out to be harm­less.

    We can sim­u­late many draws from the neg­a­tive bi­no­mi­al, ig­nore as 0 any time where the aneurysm struck after her life ex­pectancy of 48 more years is past, hold onto the loss­es, and cal­cu­late the mean loss: mean(sapply(rnbinom(10e4, 1, 0.005), function(years) { if(years>48) { 0; } else { 48-years; }})) → 5.43.

So the ex­pected loss from surgery looks even bet­ter than the risk did, as it is 22.6x small­er.

QALY/DALY adjustment

What about ad­just­ing for older years be­ing less valu­able? We might say that the surgery look un­fairly good be­cause we are ig­nor­ing how its losses are fron­t-loaded in the 30s, some of the best years of one’s life, and treat­ing a loss of her 33rd year as be­ing as bad as a loss of her 48th year. In terms of age weight­ing, DALYs usu­ally use a 3% an­nual dis­count­ing; and differ in some ways but for this analy­sis I think we can treat them as equiv­a­lent and use the DALY age-dis­count­ing to cal­cu­late our QALYs. So we can redo the two ex­pected losses in­clud­ing the dis­count­ing to get:

  1. Surgery: 0.05 * sum((1-0.03)^(0:48)) → 1.291
  2. No surgery: mean(unlist(sapply(sapply(rnbinom(10e4, 1, 0.005), function(years) { if(years>48) { 0; } else { 48-years; }}), function(yr) { sum((1-0.03)^(0:yr)); }))) → 4.415

By ap­pro­pri­ately pe­nal­iz­ing the surgery’s loss of high­-qual­ity early years as com­pared to the aneurys­m’s loss of just some el­derly years, the surgery’s su­pe­ri­or­ity falls to 3.4x, and the gain is 3.124. (And if we in­clude the men­tal well­be­ing of the woman as a fi­nal touch, the surgery looks even bet­ter.)

How sen­si­tive is the sur­gi­cal su­pe­ri­or­ity to the pa­ra­me­ters?

  • Sur­gi­cal risk: a 4x in­crease in risk to 20% would cre­ate par­ity
  • Aneurysm risk: if the an­nual risk of aneurysm were as low as 0.04% rather than 0.5%, then there would be par­ity
  • Life ex­pectancy & dis­count rate: no change will re­verse the or­der­ing

It seems ex­tremely un­likely that Marsh could be as wrong about the sur­gi­cal risk as to mis­take 5% for 20%, es­pe­cially for an op­er­a­tion he says he used to do rou­tine­ly, and it also seems un­likely that the study on the an­nual risk of an aneurysm burst­ing could be as far off as 10x, so the differ­ence is sol­id.


Fi­nal­ly, hav­ing a surgery is much more ex­pen­sive than not hav­ing it. Surgery is al­ways ex­pen­sive, and neu­ro­surgery un­doubt­edly so—else­where in the book, Marsh quotes an Amer­i­can neu­ro­sur­geon’s es­ti­mate of $100,000 for a par­tic­u­larly com­plex case. Clip­ping an aneurysm surely can­not cost that much (be­ing both much sim­pler and also be­ing done in a more effi­cient health­care sys­tem), but it’s still not go­ing to be triv­ial. Does the cost of aneurysm surgery out­weigh the ben­e­fit?

To con­vert the DALY loss to a dol­lar loss, we could note that UK PPP per capita is ~$38,160 (2013) so the gain from surgery would be (4.415 - 1.291) * 38169=$119k, well above the $100k worst-case. Or more di­rect­ly, the UK NHS prefers to pay <£20,000 per QALY and will gen­er­ally re­ject treat­ments which cost >£30,000 per QALY as of 20073 (im­ply­ing QALYs are worth some­what less than £30,000); the me­dian US 2008 hos­pi­tal cost for clip­ping an aneurysm is $36,188 or ~£23,500; and the gain is 3.124 QALYs for ~£7500/QALY—so clip­ping the aneurysm in this case defi­nitely clears the cost-ben­e­fit thresh­old (as we could have guessed from the fact that in the anec­dote, the NHS al­lows her to have the surgery).

After cal­cu­lat­ing the loss of years, differ­ing val­ues of years, and cost of surgery, the surgery still comes out as sub­stan­tially bet­ter than not op­er­at­ing.

The Power of Twins: Revisiting Student’s Scottish Milk Experiment Example

Ran­dom­ized ex­per­i­ments re­quire more sub­jects the more vari­able each dat­a­point is to over­come the noise which ob­scures any effects of the in­ter­ven­tion. Re­duc­ing noise en­ables bet­ter in­fer­ences with the same data, or less data to be col­lect­ed, which can be done by bal­anc­ing ob­served char­ac­ter­is­tics be­tween con­trol and ex­per­i­men­tal dat­a­points. A par­tic­u­larly dra­matic ex­am­ple of this ap­proach is run­ning ex­per­i­ments on iden­ti­cal twins rather than reg­u­lar peo­ple, be­cause twins vary far less from each other than ran­dom peo­ple do. In 1931, the great sta­tis­ti­cian Stu­dent noted prob­lems with an ex­tremely large (n = 20,000) Scot­tish ex­per­i­ment in feed­ing chil­dren milk (to see if they grew more in height or weight), and claimed that the ex­per­i­ment could have been done far more cost-effec­tively with an ex­tra­or­di­nary re­duc­tion of >95% fewer chil­dren if it had been con­ducted us­ing twins. He, how­ev­er, did not pro­vide any cal­cu­la­tions or data demon­strat­ing this. I re­visit the is­sue and run a power cal­cu­la­tion on height in­di­cat­ing that Stu­den­t’s claims were cor­rect and that the ex­per­i­ment would have re­quired ~97% fewer chil­dren if run with twins. This re­duc­tion is not unique to the Scot­tish ex­per­i­ment and in gen­er­al, one can ex­pect a re­duc­tion of 89% us­ing twins rather than reg­u­lar peo­ple.

Due to length, this has been .

RNN metadata for mimicking individual author style

Char-RNNs are un­su­per­vised gen­er­a­tive mod­els which learn to mimic text se­quences. I sug­gest ex­tend­ing char-RNNs with in­line meta­data such as genre or au­thor pre­fixed to each line of in­put, al­low­ing for bet­ter & more effi­cient meta­data, and more con­trol­lable sam­pling of gen­er­ated out­put by feed­ing in de­sired meta­da­ta. An ex­per­i­ment us­ing torch-rnn on a set of ~30 Project Guten­berg e-books (1 per au­thor) to train a large char-RNN shows that a char-RNN can learn to re­mem­ber meta­data such as au­thors, learn as­so­ci­ated prose styles, and often gen­er­ate text vis­i­bly sim­i­lar to that of a spec­i­fied au­thor.

Due to length, this has been split out to .


An im­ple­men­ta­tion in R of a sim­ple al­go­rithm (us­ing rather than a UCT) im­ple­mented with data.tree. This MCTS as­sumes bi­nary win/loss (1/0) ter­mi­nal re­wards with no in­ter­me­di­ate re­ward­s/­costs so it can­not be used to solve gen­eral , and does not ex­pand leaf nodes in the move tree passed to it. (I also sus­pect parts of it are im­ple­mented wrong though it reaches the right an­swer in a sim­ple Block­world prob­lem and seems OK in a Tic-Tac-Toe prob­lem. I have since un­der­stood and would prob­a­bly prob­a­bly drop the painful ex­plicit tree ma­nip­u­la­tion in fa­vor of the in­di­rect re­cur­sive aproach.)

## MCTS helper functions:
playOutMoves <- function(move, state, actions) {
  for (i in 1:length(actions)) {
     state <- move(state, actions[i])$State
playOutRandom <- function(move, state, actions, timeout=1000, verbose=FALSE) {
 action <- sample(actions, 1)
 turn <- move(state, action)
 if(verbose) { print(turn); };
 if (turn$End || timeout==0) { return(turn$Reward) } else {
                               playOutRandom(move, turn$State, actions, timeout=timeout-1, verbose) }

createTree <- function(plys, move, moves, initialState, tree=NULL) {
 if (is.null(tree)) { tree <- Node$new("MCTS", win=0, loss=0) }
 if (plys != 0) {
  for(i in 1:length(moves)) {
    x <- tree$AddChild(moves[i], win=0, loss=0)
    createTree(plys-1, move, moves, initialState, tree=x)
 # cache the state at each leaf node so we don't have to recompute each move as we later walk the tree to do a rollout
 tree$Do(function(node) { p <- node$path; node$state <- playOutMoves(move, initialState, p[2:length(p)]); }, filterFun = isLeaf)

mcts <- function (tree, randomSimulation, rollouts=1000) {
 replicate(rollouts, {
   # Update posterior sample for each node based on current statistics and use Thompson sampling.
   # With a beta uniform prior (Beta(1,1)), update on binomial (win/loss) is conjugate with simple closed form posterior: Beta(1+win, 1+n-win).
   # So we sample directly from that posterior distribution for Thompson sampling
   tree$Do(function(node) { node$Thompson <- rbeta(1, 1+node$win, 1+(node$win+node$loss)-node$win) })
   # find & run 1 sample:
   node <- treeWalk(tree)
   rollout <- randomSimulation(node$state)
   if(rollout==1) { node$win <- node$win+1; } else { node$loss <- node$loss+1; }

   # propagate the new leaf results back up tree towards root:
   tree$Do(function(x) { x$win  <- Aggregate(x, "win",  sum); x$loss <- Aggregate(x, "loss", sum) }, traversal = "post-order")

## walk the game tree by picking the branch with highest Thompson sample down to the leaves
## and return the leaf for a rollout
treeWalk <- function(node) {
    if(length(node$children)==0) { return(node); } else {
        children <- node$children
         best <- which.max(sapply(children, function(n) { n$Thompson; } ))
        treeWalk(children[[best]]) } }

mctsDisplayTree <- function(tree) {
    tree$Do(function(node) { node$P <- node$win / (node$win + node$loss) } )
    tree$Sort("P", decreasing=TRUE)
    print(tree, "win", "loss", "P", "Thompson")

## Blockworld simulation
## 0=empty space, 1=agent, 2=block, 3=goal point
blockActions <- c("up", "down", "left", "right")
blockInitialState <- matrix(ncol=5, nrow=5, byrow=TRUE,
blockMove <- function(state, direction) {
   if(state[5,5] == 2) { return(list(State=state, Reward=1, End=TRUE)) }
   position <- which(state == 1, arr.ind=TRUE)
   row <- position[1]; col <- position[2]
   rowNew <- 0; colNew <- 0
     # if we are at an edge, no change
     up    = if(row == 1) { rowNew<-row; colNew<-col; } else { rowNew <- row-1; colNew <- col; },
     down  = if(row == 5) { rowNew<-row; colNew<-col; } else { rowNew <- row+1; colNew <- col; },
     left  = if(col == 1) { rowNew<-row; colNew<-col; } else { rowNew <- row;   colNew <- col-1; },
     right = if(col == 5) { rowNew<-row; colNew<-col; } else { rowNew <- row;   colNew <- col+1; }
   # if there is not a block at the new position, make the move
   if (state[rowNew,colNew] != 2) {
      state[row,col] <- 0
      state[rowNew,colNew] <- 1
      return(list(State=state, Reward=0, End=FALSE))
       } else {
               state[rowNew,colNew] <- 1
               state[row,col] <- 0
                # if the block is at the edge it can't move
                up    = if(rowNew == 1) { } else { state[rowNew-1,colNew] <- 2 },
                down  = if(rowNew == 5) { } else { state[rowNew+1,colNew] <- 2 },
                left  = if(colNew == 1) { } else { state[rowNew,colNew-1] <- 2 },
                right = if(colNew == 5) { } else { state[rowNew,colNew+1] <- 2 } )
                # a block on the magic 5,5 point means a reward and reset of the playing field
                if(state[5,5] == 2) { return(list(State=state, Reward=1, End=TRUE)) } else { return(list(State=state, Reward=0, End=FALSE)) }

## Blockworld examples:
# blockMove(blockInitialState, "left")
# blockMove(blockInitialState, "down")
# blockMove(blockInitialState, "right")$State
# blockMove(blockMove(blockInitialState, "right")$State, "down")
# blockMove(blockMove(blockMove(blockInitialState, "down")$State, "down")$State, "down")
# playOutMoves(blockMove, blockInitialState, c("down", "down", "down"))
# playOutRandom(blockMove, blockInitialState, blockActions)

tree <- createTree(2, blockMove, blockActions, blockInitialState)
mcts(tree, function(state) { playOutRandom(blockMove, state, blockActions) })

tree2 <- createTree(3, blockMove, blockActions, blockInitialState)
mcts(tree2, function(state) { playOutRandom(blockMove, state, blockActions) })

## Tic-Tac-Toe
tttActions <- 1:9
tttInitialState <- matrix(ncol=3, nrow=3, byrow=TRUE, data=0)
tttMove <- function(state, move) {
   move <- as.integer(move)
   # whose move is this? Player 1 moves first, so if the number of pieces are equal, it must be 1's turn:
   player <- 0;  if(sum(state == 1) == sum(state == 2)) { player <- 1 } else { player <- 2}

   # check move is valid:
   if(state[move] == 0) { state[move] <- player }

   ## enumerate all possible end-states (rows, columns, diagonals): victory, or the board is full and it's a tie
   victory <- any(c(
       all(state[,1] == player),
       all(state[1,] == player),
       all(state[,2] == player),
       all(state[2,] == player),
       all(state[,3] == player),
       all(state[3,] == player),
       all(as.logical(c(state[1,1], state[2,2], state[3,3]) == player)),
       all(as.logical(c(state[1,3], state[2,3], state[3,1]) == player))
   tie <- all(state != 0)

   # if someone has won and the winner is player 1, then a reward of 1
   if(victory) { return(list(State=state, Reward=as.integer(player==1), End=TRUE)) } else {
    if(tie) { return(list(State=state, Reward=0, End=TRUE)) } else {
      return(list(State=state, Reward=0, End=FALSE)) }

## Tic-Tac-Toe examples:
# tttMove(tttMove(tttMove(tttInitialState, 5)$State, 9)$State, 2)
# playOutMoves(tttMove, tttInitialState, c(5, 9, 2))
# playOutRandom(tttMove, tttInitialState, tttActions, verbose=TRUE)

treeTTT <- createTree(2, tttMove, tttActions, tttInitialState)
mcts(treeTTT, function(state) { playOutRandom(tttMove, state, tttActions) })
## hypothetical: if opponent plays center (5), what should be the reply?
treeTTT2 <- createTree(2, tttMove, tttActions, tttMove(tttInitialState, 5)$State)
mcts(treeTTT2, function(state) { playOutRandom(tttMove, state, tttActions) })

Candy Japan A/B test

Due to length, has been split out to .

DeFries-Fulker power analysis

De­Fries-Fulker (DF) ex­tremes analy­sis

generateSiblingPair <- function(ID=TRUE) {
   ## Population mean 100, SD 15; let's make family means distributed normally too;
   ## heritability 0.8, shared environment 0.1, siblings share half of genes on average + shared environment
   ## so a pair of siblings has 1 - (0.8*0.5+0.1) = 0.5 of the variance of the general population.
   parental <- mean(rnorm(1,mean=100,sd=15*0.8), rnorm(1,mean=100,sd=15*0.8))
   siblings <- rnorm(2, mean=parental, sd=15*(1 - (0.8*0.5+0.1)))
   ## Siblings will tend to vary this much, unless they are, lamentably, one of the, say,
   ## 5% struck by mutational lightning and reduced to an IQ of, let's say, 80
   if(ID) { siblings <- ifelse(rbinom(2,1,prob=0.05), siblings,rnorm(2, mean=80, sd=15)) }
   return(c(max(siblings), min(siblings)))
generateSiblingPairs <- function(n,ID=TRUE) {, generateSiblingPair(ID=ID)))) }
## dataset with lightning:
df <- round(rescale(generateSiblingPairs(1000000, ID=TRUE), mean=5, sd=2))
## floor/ceiling at 0/9 for everyone:
df[df$V1>9,]$V1 <- 9
df[df$V1<1,]$V1 <- 1
df[df$V2>9,]$V2 <- 9
df[df$V2<1,]$V2 <- 1

## dataset without:
df2 <- round(rescale(generateSiblingPairs(1000000, ID=FALSE), mean=5, sd=2))
df2[df2$V1>9,]$V1 <- 9
df2[df2$V1<1,]$V1 <- 1
df2[df2$V2>9,]$V2 <- 9
df2[df2$V2<1,]$V2 <- 1

hist(df$V1 - df$V2)
hist(df2$V1 - df2$V2)

## mixture modeling:
## check k=1 vs k=2 on df1, where k=2 is ground truth:
g1.1 <- flexmix(I(V1-V2) ~ 1, k=1, data=df)
g1.2 <- flexmix(I(V1-V2) ~ 1, k=2, data=df)
summary(g1.1); summary(g1.2)

## check k=1 vs k=2 on df2, where k=1 is ground truth:
g2.1 <- flexmix(I(V1-V2) ~ 1, k=1, data=df2)
g2.2 <- flexmix(I(V1-V2) ~ 1, k=2, data=df2)
summary(g2.1); summary(g2.2)

Inferring mean IQs from SMPY/TIP elite samples

Sam­ples taken from the ex­tremes of mix­tures of dis­tri­b­u­tions can have very differ­ent prop­er­ties than ran­dom sam­ples, such as the tail effect of wildly dis­pro­por­tion­ate rep­re­sen­ta­tion of one dis­tri­b­u­tion due to or­der sta­tis­tic­s/thresh­old se­lec­tion. This can be used to in­fer differ­ing means. I demon­strate work­ing back­wards from the racial com­po­si­tion of TIP/SMPY sam­ples of ex­tremely (1-in-10,000) gifted youth to es­ti­mate the over­all racial means, which is con­sis­tent with the known racial means and hence an un­bi­ased se­lec­tion process, us­ing ABC to in­fer Bayesian cred­i­ble in­ter­vals on the es­ti­mated means.

The prop­er­ties of sta­tis­ti­cal dis­tri­b­u­tions can be very differ­ent from the prop­er­ties of spe­cific sub­sets of those dis­tri­b­u­tions in coun­ter­in­tu­itive ways. A point drawn from an ex­treme will ex­hibit “re­gres­sion to the mean”, a phe­nom­e­non which rou­tinely trips peo­ple up. An­other com­mon ex­am­ple is that a small differ­ence in means for many dis­tri­b­u­tions can lead to large differ­ences in sub­sets.

For ex­am­ple, male and fe­male av­er­age heights differ by a rel­a­tively small amount, inches at most. So in a ran­dom sam­ple, plenty of women will be taller than men, and vice ver­sa. How­ev­er, if in­stead ask the sex of the tallest per­son in the sam­ple, it will often be male, and the larger the sam­ple, the more cer­tain we can be that it will be male, and that the top X% by height will be male. Like­wise, if we wanted to start a bas­ket­ball league and re­cruited the tallest 100 peo­ple in the coun­try, this small mean differ­ence will show up as our en­tire bas­ket­ball league turn­ing out to be male. (And since height is highly her­i­ta­ble, we may find out that many of them are re­lated!) What seemed like a small differ­ence be­come a large one; we could have worked it out in ad­vance if we had thought about it.

Rea­son­ing from the gen­eral to the par­tic­u­lar turned out to be tricky in this case be­cause we were deal­ing with ex­treme val­ues rather than ran­dom sam­ples—1 bas­ket­ball player cho­sen by height from thou­sands of peo­ple. Many things of great in­ter­est turn out to be like that: we are in­ter­ested in the ex­tremes much more than the ex­pec­ta­tion. Run­ning a 2-hour marathon is an ex­treme on ath­leti­cism; win­ning the No­bel is an ex­treme on sci­en­tific ac­com­plish­ment; be­ing en­listed in the NBA is an ex­treme on height; be­ing ad­mit­ted to MIT/Stanford/Harvard is an ex­treme on in­tel­li­gence; mur­der­ing some­one is an ex­treme on vi­o­lence; win­ning an Acad­emy Award is an ex­treme on act­ing suc­cess. When we ask ques­tions like, “why does the world record in this sport keep be­ing shat­tered” or “why are so many NBA play­ers re­lated” or “how good can we ex­pect the best chess player to be in 10 years” or “does this racial com­po­si­tion prove bias” or “how much more im­por­tant are the best au­thors in lit­er­a­ture than ob­scurer fig­ures” or “why do so few women win the Field Medal”, we’re ask­ing ex­treme value ques­tions whose an­swers may be coun­ter­in­tu­itive—and the an­swer may be as sim­ple as the shape of dis­tri­b­u­tions, and a slightly lower mean here or a slightly higher stan­dard de­vi­a­tion there. (Work­ing back­wards from a sam­ple se­lected for pass­ing a thresh­old to a mean can be called “the method of lim­its” or “the method of thresh­olds”.)

The study de­scribes the ac­com­plish­ments of the sam­ple, 259 chil­dren se­lected for their in­tel­li­gence by tak­ing the high­est-s­cor­ers out of 425,000 ado­les­cents tak­ing the SAT (usu­ally <13yo) start­ing in 1981, rep­re­sent­ing the top 0.01% of the test-tak­ers. The TIP sam­ple par­al­lels the bet­ter-known SMPY sam­ple, which also se­lected ex­tremely in­tel­li­gent ado­les­cents, who were in­cluded in a lon­gi­tu­di­nal sam­ple. It’s fre­quently sug­gest­ed, based on anec­do­tal ev­i­dence or some bi­ased con­ve­nience sam­ples, that more in­tel­li­gence may not be bet­ter; ex­tremely in­tel­li­gent peo­ple may be un­healthy, neu­rotic, in­sane, iso­lat­ed, lone­ly, dis­crim­i­nated against by so­ci­ety and their peers, and doomed to fail­ure; or if things are not quite that di­re, as all stud­ies show things im­prov­ing up to 130, then at around that point greater in­tel­li­gence may stop mak­ing any differ­ence, and there be lit­tle differ­ence be­tween some­one with an IQ of 130 and 160. This is diffi­cult to study cross-sec­tion­al­ly, be­cause once you start talk­ing about as ex­treme as 0.01%, it is diffi­cult to re­cruit any sub­jects at all, and your sam­ple will be bi­ased in un­known ways; if you only look at suc­cess­ful peo­ple, you are miss­ing the hy­po­thet­i­cal home­less bum liv­ing out of a trash can who is a trou­bled and mis­un­der­stood ge­nius. To solve these prob­lems, you want to fil­ter through hun­dreds of thou­sands of peo­ple so you can se­lect the very bright­est pos­si­ble, and you want to find them as early as pos­si­ble in life, be­fore they have had any chance to fail or suc­ceed, and track them lon­gi­tu­di­nally as they grow up. This is what the SMPY & TIP stud­ies do, and the re­sults are that the sub­jects are spec­tac­u­larly suc­cess­ful in life; great in­tel­li­gence is not harm­ful and the re­turns to greater in­tel­li­gence are not zero even as high as 1 in 10,000.

Makel et al 2016 also re­ports the eth­nic break­down of the TIP and SMPY sam­ples: 72% white, 22% Asian, 6% not re­ported or oth­er. This dis­tri­b­u­tion might seem re­mark­able given that sub­jects tak­ing the SAT in 1981 were born ~1970, when the USA was ~77% white, ~11% black, and ~0.7% Asian, so white are slightly un­der­-rep­re­sent­ed, blacks are very un­der­-rep­re­sented (even if we as­sume all 6% are black, then that’s still half), and Asians are 31x (!) over­rep­re­sent­ed.

## TIP/SMPY sample size & ethnic percentages:
tip <- 259; smpy <- 320 ## total: 579
white <- ((0.65*tip) + (0.78*smpy)) / (tip+smpy)
asian <- ((0.24*tip) + (0.20*smpy)) / (tip+smpy)
white; asian
# [1] 0.7218480138
# [1] 0.2178929188

# > The data on ethnicity are rather sparse, but we can do a little bit of work on them by looking at US Census
# > figures for the 1970s when most of these children were born: White 178,119,221...Asia 1,526,401...So, in the
# > absence of more detailed particulars about the Other category, Asians win the race by a country mile. If we
# > simplify things by considering only Whites, Blacks and Asians the US in 1970 then the country at that time was
# > 88% White, 11% Black, and less than 1% Asian. The actual results of eminent students are 77% White, 0% Black,
# > 22% Asian. No need for a Chi square.
# Asian is 0.7%: 1526401  / (178119221 / 0.80)
whiteRR <- white / 0.77; asianRR <- asian / 0.007
whiteRR; asianRR
# [1] 0.937464953
# [1] 31.12755983

Of course, races in the USA have long differed by mean in­tel­li­gence, with the rule of thumb be­ing Asians ~105 IQ, whites ~100, and blacks ~90. So the or­der is ex­pect­ed—but still, 31x! Are the re­sults be­ing dri­ven by some sort of pro-Asian bias or oth­er­wise bizarre?

But this is an ex­treme sam­ple. 1-in-10,000 is far out on the tails: 3.71S­Ds.

# [1] 3.719016485

Maybe this is nor­mal. Can we work back­wards from the over­rep­re­sen­ta­tions to what differ­ences would have gen­er­ated them?

Yes, we can, even with this small sam­ple which is so ex­treme and un­rep­re­sen­ta­tive of the gen­eral pop­u­la­tion. This is be­cause it is an prob­lem: we know the or­der rep­re­sented by the sam­ple and so can work back to pa­ra­me­ters of the dis­tri­b­u­tion the or­der sta­tis­tics are be­ing gen­er­ated by. Since IQ is a nor­mal dis­tri­b­u­tion, we know the over­rep­re­sen­ta­tion RR, and the ex­act cut­off/limit used in the sam­ple, we can con­vert the limit to a stan­dard de­vi­a­tions, and then find the nor­mal dis­tri­b­u­tion which is RR (31) times the nor­mal dis­tri­b­u­tion at a stan­dard de­vi­a­tions.

We can com­pare us­ing two pnorms and shift­ing the sec­ond by a SDs. So for ex­am­ple, shift­ing by 15 IQ points or 1 SD would lead to 84x over­rep­re­sen­ta­tion

pnorm(qnorm(1/10000)) / pnorm(qnorm(1/10000) - (15/15))
# [1] 84.39259519

We would like to solve for the shift which leads to an ex­act over­rep­re­sen­ta­tion like 31.127; an op­ti­miza­tion rou­tine like R’s optim func­tion can do that, but it re­quires an er­ror to min­i­mize, so min­i­miz­ing pnorm()/pnorm(x) does­n’t work since it just leads to neg­a­tive in­fin­i­ty, nor will RR == pnorm()/pnorm(x) work, be­cause it eval­u­ates to 0 for all val­ues of x ex­cept the ex­act right one . In­stead, we min­i­mize the squared er­ror be­tween the ra­tio pre­dicted by a par­tic­u­lar x and our ob­served RR. This works:

## An optimization routine which automatically finds for us the IQ increase which most closely matches the RR:
solver <- function(RR, cutoff=10000) {
        function(IQ_gain) { (RR - (pnorm(qnorm(1/cutoff)) / pnorm(qnorm(1/cutoff)-(IQ_gain/15))))^2 },
        )$par }

100 + solver(whiteRR)
# [1] 99.75488281
100 + solver(asianRR)
# [1] 111.8929688

So our in­ferred white & Asian pop­u­la­tions means are: 99.8 and 111.9. These are rel­a­tively close to the ex­pected val­ues.

This ap­proach can be used to in­fer other things as well. For ex­am­ple, the TIP/SMPY pa­pers have not, as far as I’ve seen, men­tioned what frac­tion of the white sub­jects were eth­nic Jew­ish; since they are so over-rep­re­sented in ar­eas like No­bel prizes, we would ex­pect many of the TIP/SMPY white stu­dents to have been Jew­ish. Us­ing an es­ti­mate of the Jew­ish pop­u­la­tion in 1970 and es­ti­mates of their mean IQ, we can work for­ward to what frac­tion of TIP/SMPY sub­jects might be Jew­ish. The 1970-1971 Na­tional Jew­ish Pop­u­la­tion Study es­ti­mated “5,800,000 per­sons (of whom 5,370,000 were Jews) liv­ing in Jew­ish house­holds” out of a to­tal US pop­u­la­tion of 205 mil­lion, or 2.8% of the to­tal pop­u­la­tion or ~3.6% of the white pop­u­la­tion. So of the ~418 white sub­jects, ~15 would be ex­pected to be Jew­ish un­der the null hy­poth­e­sis of no differ­ence. The ma­jor­ity of Amer­i­can Jews are of Ashke­nazi de­scent4, for whom in­tel­li­gence es­ti­mates are de­bated but tend to range 105-115 (with oc­ca­sional sam­ples sug­gest­ing even higher val­ues, like Levin­son 1957). In the Barbe 1964 Ohio sam­ple (IQ ~143), 8% were Jew­ish5; in (ra­tio IQ >140) 1920s sam­ple in SF/LA, 10% were Jew­ish; Holling­worth’s 1930s sam­ple (>180) turned up 51⁄55 or 90% Jew­ish6; Byrns 1936’s 1931 Wis­con­sin state sam­ple found 18% of the Jew­ish sam­ple to be in the top decile vs 10% Amer­i­can; in the sam­ple 1948-1960 (>140, mean 157) in New York City, 62% were Jew­ish (Sub­ot­nik et al 1989, Sub­ot­nik et al 19937). Given es­ti­mates of the Jew­ish pop­u­la­tion of chil­dren in those spe­cific times and places, one could work back­wards to es­ti­mate a Jew­ish mean.

We can cal­cu­late the frac­tion of the white sam­ple be­ing Jew­ish for each pos­si­ble mean IQ:

proportion <- function (gain, cutoff=10000) {
   (pnorm(qnorm(1/cutoff)) / pnorm(qnorm(1/cutoff)-(gain/15))) }
possibleIQs <- seq(5, 15, by=0.5)
data.frame(Advantage=possibleIQs, Fraction.of.white=(sapply(possibleIQs, proportion) * 15) / 418)
   Advantage Fraction.of.white
1        5.0      0.1415427303
2        5.5      0.1633099334
3        6.0      0.1886246225
4        6.5      0.2180947374
5        7.0      0.2524371552
6        7.5      0.2924980125
7        8.0      0.3392769622
8        8.5      0.3939561508
9        9.0      0.4579348680
10       9.5      0.5328710150
11      10.0      0.6207307813
12      10.5      0.7238482059
13      11.0      0.8449966589
14      11.5      0.9874747049
15      12.0      1.1552093388
16      12.5      1.3528802227
17      13.0      1.5860693342
18      13.5      1.8614413902
19      14.0      2.1869615788
20      14.5      2.5721585555
21      15.0      3.0284424112

Judg­ing from ear­lier sam­ples with very high cut­offs, I’d guess TIP/SMPY has at least a ma­jor­ity Jew­ish, giv­ing a mean IQ of ~110; this is pleas­antly sim­i­lar to es­ti­mates based on reg­u­lar sam­ples & es­ti­ma­tion. This re­sult is also sim­i­lar to La Griffe du Li­on’s 2003 thresh­old analy­sis es­ti­mat­ing a mean IQ of 112 based on Ashke­nazi over­rep­re­sen­ta­tion among USSR cham­pi­onship chess play­ers, 111 based on West­ern awards, and 110 based on the USA/Canada . But if the mean IQ was as high as 112, then al­most every sin­gle white sub­ject would be Jew­ish in every sam­pling, which seems im­plau­si­ble and like some­thing so strik­ing that any­one writ­ing or in­volved with TIP/SMPY would have to have men­tioned at some point—right?

For the same rea­son, the orig­i­nal es­ti­mate of 112 for the Asians strikes me as on the high side. This could be due to prob­lems in the data like un­der­es­ti­mat­ing the Asian pop­u­la­tion at the time—per­haps the South­east­/Mid­west states that TIP sam­ples from were more than 0.7% Asian—or it could be due to sam­pling er­ror (only n = 579, after al­l).

Work­ing back­wards does­n’t im­me­di­ately pro­vide any mea­sure­ment of pre­ci­sion or con­fi­dence in­ter­vals. Pre­sum­ably some­one has worked out an­a­lytic for­mu­las which come with stan­dard er­rors and con­fi­dence in­ter­vals, but I don’t know it. In­stead, since the se­lec­tion process which gen­er­ated our data is straight­for­ward (pop­u­la­tion mean -> mil­lions of sam­ples -> take top 1-in-10000s -> cal­cu­late over­rep­re­sen­ta­tion), I can again use (ABC) to turn a sim­u­la­tion of the data gen­er­at­ing process into a method of Bayesian in­fer­ence on the un­known pa­ra­me­ters (pop­u­la­tion means) and get cred­i­ble in­ter­vals.

What sort of con­fi­dence do we have in these es­ti­mates given that these RRs are based only on? We can sim­u­late TIP/SMPY-like se­lec­tion by tak­ing the hy­po­thet­i­cal means of the two groups, gen­er­at­ing ~3 mil­lion sim­u­lates (579 * 10000) each, se­lect­ing the top 1⁄10000th8, tak­ing the RRs and then solv­ing for the mean IQ. If we pro­vide a prior on the means and we hold onto only the means which suc­cess­fully gen­er­ate TIP/SMPY-like frac­tions of 72% & 21%, this be­comes ABC with the saved means form­ing the pos­te­rior dis­tri­b­u­tion of means. (It would likely be faster to use MCMC like JAGS, but while JAGS pro­vides trun­cated nor­mal dis­tri­b­u­tions which one could sam­ple from quick­ly, and the nec­es­sary pnorm/qnorm func­tions, but it’s not clear to me how one could go about es­ti­mat­ing the over­per­for­mance ra­tio and the bi­no­mi­al.9 There’s likely some way to use more di­rectly than sim­u­lat­ing cut­offs, in which case there is a trans­for­ma­tion to a beta dis­tri­b­u­tion over 0-1, which is a well-sup­ported dis­tri­b­u­tion by MCM soft­ware and might al­low ex­act so­lu­tion as well.) For my pri­ors, I be­lieve that the rule of thumbs of 100⁄105 are ac­cu­rate and highly un­likely to be more than a few points off, so I use a very weak prior of pop­u­la­tions means be­ing .

In ex­act ABC, we would keep only data which ex­actly matched 72%/22%, but that would re­quire re­ject­ing an ex­tremely large num­ber of sam­ples. Here we’ll loosen it to ±2% tol­er­ance:

simulateTIPSMPY <- function() {
    ## informative priors: IQs are somewhere close to where we would estimate based on other datasets
    whiteMean <- round(rnorm(1, mean=100, sd=5), digits=2)
    asianMean <- round(rnorm(1, mean=105, sd=5), digits=2)

    iqCutoff <- 100 + -qnorm(1/10000) * 15

    whites <- rnorm(0.770 * 579 * 10000, mean=whiteMean, sd=15)
    whiteSample <- max(1, sum(ifelse(whites>iqCutoff, 1, 0)))

    asians <- rnorm(0.007 * 579 * 10000, mean=asianMean, sd=15)
    asianSample <- max(1, sum(ifelse(asians>iqCutoff, 1, 0)))

    ## white+Asian = 92% of original total sample, so inflate by that much to preserve proportions: 1.08
    totalSample <- (whiteSample+asianSample) * (1 + (1-(white+asian)))

    whiteFraction <- round(whiteSample / totalSample, digits=2)
    asianFraction <- round(asianSample / totalSample, digits=2)
    # print(paste("samples: ", c(whiteSample, asianSample), "fractions: ", c(whiteFraction, asianFraction)))

    tolerance <- 0.02
    if ((abs(whiteFraction - 0.7218480138) < tolerance) && (abs(asianFraction - 0.2178929188) < tolerance)) {
      return(data.frame(White=whiteMean, Asian=asianMean))
library(parallel); library(plyr)
simulateSamples <- function(n.sample=10000, iters=getOption("mc.cores")) {
    ## because of rejection sampling, no run is guaranteed to produce a sample so we loop:
    results <- data.frame()
    while (nrow(results) < n.sample) {
        simResults <- ldply(mclapply(1:iters, function(i) { simulateTIPSMPY()  } ))
        results <- rbind(results, simResults)
        # print(paste("Samples so far: ", nrow(results)))
    return(results) }
posteriorSamples <- simulateSamples()

mean(posteriorSamples$White < posteriorSamples$Asian)
# [1] 1
## we have relatively few samples, so get a better posterior estimate by shuffling the posterior samples & comparing many times:
mean(replicate(1000, mean(c(sample(posteriorSamples$White) < sample(posteriorSamples$Asian)))))
# [1] 0.9968822
quantile(probs=c(0.025, 0.975), posteriorSamples$White, na.rm=TRUE)
#     2.5%     97.5%
# 89.49975 101.38050
quantile(probs=c(0.025, 0.975), posteriorSamples$Asian, na.rm=TRUE)
#      2.5%     97.5%
# 101.37000 116.74075
hist(posteriorSamples$White, main="Posterior white mean IQ estimated from TIP/SMPY cutoff & ratio", xlab="IQ")
hist(posteriorSamples$Asian, main="Posterior Asian mean", xlab="IQ")
His­tograms of the pos­te­rior es­ti­mate of white & Asian mean IQs ~1970 as es­ti­mated from frac­tion of TIP/SMPY sam­ple us­ing ABC

So sam­pling er­ror does turn out to be sub­stan­tial: our 95% cred­i­ble in­ter­vals are white 90-101, Asian 101-116. Still, the over­lap is min­i­mal, with P = 99.7% that the Asian mean is higher than the white.

We are able to con­clude that the rank or­der­ing is highly likely to be cor­rect, and the re­sults are con­sis­tent with the con­ven­tional wis­dom, so there is no prima fa­cie case for bias in the re­sults: the eth­nic com­po­si­tion is in line with what one would cal­cu­late from the de­sign of TIP/SMPY and pop­u­la­tion means.

Genius Revisited: On the Value of High IQ Elementary Schools

Ge­nius Re­vis­ited doc­u­ments the lon­gi­tu­di­nal re­sults of a high­-IQ/gift­ed-and-tal­ented el­e­men­tary school, Hunter Col­lege El­e­men­tary School (HCES); one of the most strik­ing re­sults is the gen­eral high ed­u­ca­tion & in­come lev­els, but ab­sence of great ac­com­plish­ment on a na­tional or global scale (eg a No­bel prize). The au­thors sug­gest that this may re­flect harm­ful ed­u­ca­tional prac­tices at their el­e­men­tary school or the low pre­dic­tive value of IQ.

I sug­gest that there is no puz­zle to this ab­sence nor any­thing for HCES to be blamed for, as the ab­sence is fully ex­plain­able by their mak­ing two sta­tis­ti­cal er­rors: base-rate ne­glect, and re­gres­sion to the mean.

First, their stan­dards fall prey to a base-rate fal­lacy and even ex­treme pre­dic­tive value of IQ would not pre­dict 1 or more No­bel prizes be­cause No­bel prize odds are mea­sured at 1 in mil­lions, and with a small to­tal sam­ple size of a few hun­dred, it is highly likely that there would sim­ply be no No­bels.

Sec­ond­ly, and more se­ri­ous­ly, the lack of ac­com­plish­ment is in­her­ent and un­avoid­able as it is dri­ven by the caused by the rel­a­tively low cor­re­la­tion of early child­hood with adult IQs—which means their sam­ple is far less elite as adults than they be­lieve. Us­ing ear­ly-child­hood/adult IQ cor­re­la­tions, re­gres­sion to the mean im­plies that HCES stu­dents will fall from a mean of 157 IQ in kinder­garten (when se­lect­ed) to some­where around 133 as adults (and pos­si­bly low­er). Fur­ther demon­strat­ing the role of re­gres­sion to the mean, in con­trast, HCES’s as­so­ci­ated high­-IQ/gift­ed-and-tal­ented high school, Hunter High, which has ac­cess to the ado­les­cents’ more pre­dic­tive IQ scores, has much higher achieve­ment in pro­por­tion to its lesser re­gres­sion to the mean (de­spite di­lu­tion by Hunter el­e­men­tary stu­dents be­ing grand­fa­thered in).

This un­avoid­able sta­tis­ti­cal fact un­der­mines the main ra­tio­nale of HCES: ex­tremely high­-IQ adults can­not be very ac­cu­rately se­lected as kinder­garten­ers on the ba­sis of a sim­ple test. This greater-re­gres­sion prob­lem can be less­ened by the use of ad­di­tional vari­ables in ad­mis­sions, such as parental IQs or high­-qual­ity ge­netic poly­genic scores; un­for­tu­nate­ly, these are ei­ther po­lit­i­cally un­ac­cept­able or de­pen­dent on fu­ture sci­en­tific ad­vances. This sug­gests that such el­e­men­tary schools may not be a good use of re­sources and HCES stu­dents should not be as­signed scarce mag­net high school slots.

Split out to .

Great Scott! Personal Name Collisions and the Birthday Paradox

“How large does can a so­cial cir­cle be be­fore first names no longer suffice for iden­ti­fi­ca­tion? Scott, I’m look­ing at you.”

MakerOfDe­ci­sions, 2016-07-29

Scott here refers to any of Scott Alexan­der, , , (and to a much lesser ex­tent, Scott Garrabrant, , and Scott H. Young); a ref­er­ence to a ‘Scott’ on a site like Less Wrong is in­creas­ingly am­bigu­ous.

When a large num­ber of sam­ples draw from a com­mon pool of iden­ti­fiers, col­li­sions are com­mon, lead­ing to the : de­spite there be­ing 365.25 days in the year, a class­room of just 23 peo­ple (who can cover at most 6% of the days in a year) is ~50% likely to have at least two peo­ple who share the same birth­day and so birth­days cease be­ing unique un­am­bigu­ous iden­ti­fiers. (In­tu­itive­ly, you might ex­pect the num­ber to be much larger and closer to 180 than 23.)

We can ver­ify this by sim­u­la­tion:

dupes <- function(a) { length(a) != length(unique(a)) }

identifiers <- function(n, ids, probabilities) { sample(1:ids, n, prob=probabilities, replace=TRUE) }

simulate <- function(n, ids, probabilities=rep(1/ids, ids), iters=10000) {
    sims <- replicate(iters, { id <- identifiers(n, ids, probabilities)
                               return(dupes(id)) })
    return(mean(sims)) }

simulate(23, 365)
# [1] 0.488
sapply(1:50, function(n) { simulate(n, 365) } )
#  [1] 0.0000 0.0029 0.0059 0.0148 0.0253 0.0400 0.0585 0.0753 0.0909 0.1196 0.1431 0.1689 0.1891
#      0.2310 0.2560 0.2779 0.3142 0.3500 0.3787 0.4206 0.4383 0.4681 0.5165 0.5455 0.5722 0.5935
# [27] 0.6227 0.6491 0.6766 0.7107 0.7305 0.7536 0.7818 0.7934 0.8206 0.8302 0.8465 0.8603 0.8746
#      0.8919 0.9040 0.9134 0.9248 0.9356 0.9408 0.9490 0.9535 0.9595 0.9623 0.9732

Sim­i­lar­ly, in a group of peo­ple, it will be com­mon for first names to over­lap. (Over­laps of both first names & sur­names are much more un­like­ly: es­ti­mate from French & Ohioan data that while al­most every­one has a non-u­nique full name, even groups of thou­sands of peo­ple will have only a few du­pli­cates.) How com­mon? There are far more than 365.25 first names, es­pe­cially as some first names are made up by par­ents.

Names have a highly skewed (often said to be a ) dis­tri­b­u­tion: the first few baby names make up an enor­mous frac­tion of all names, hence all the Ethan/Lu­cas/­Ma­son baby boys in 2016. (One would think that par­ents would go out of their way to avoid too-pop­u­lar names, but ap­par­ently not.)

Since there are only “10,000 things un­der heaven”, one might think that the top 10000 per­sonal names would give a good guess. At what n can we ex­pect a col­li­sion?

findN <- function(ids, targetP=0.5, startingN=1, probabilities=rep(1/ids, ids)) {
    n <- startingN
    collisionProbability <- 0

    while (collisionProbability < targetP) {
        collisionProbability <- simulate(n, ids, probabilities)
        n <- n+1
    return(n) }
# [1] 118
simulate(118, 10000)
# [1] 0.5031

We could also use such as the square ap­prox­i­ma­tion: : sqrt(2 * 10000 * 0.5) → 100 Or the sim­i­lar up­per bound: ceiling(sqrt(2*10000*log(2))) → 118.

So the col­li­sion point is smaller than .

But all of these are them­selves up­per bounds be­cause the case in which birth­days/­names are uni­formly dis­trib­uted is the worst case. If there is any differ­ence in the prob­a­bil­i­ties, a col­li­sion will hap­pen much ear­li­er. This makes sense since if 1 birth­day hap­pens with, say, P=0.99, then it’s al­most im­pos­si­ble to go more than 3 or 4 birth­days with­out a col­li­sion. Like­wise, if one birth­day has P=0.50, and so on down to P=$:

sapply(1:23, function(n){ simulate(n, 365, probabilities=c(0.99, rep(0.01/364, 364)))})
# [1] 0.0000 0.9789 0.9995 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000
#     1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000
sapply(1:23, function(n){ simulate(n, 365, probabilities=c(0.5, rep(0.5/364, 364)))})
# [1] 0.0000 0.2531 0.5031 0.6915 0.8182 0.8896 0.9402 0.9666 0.9808 0.9914 0.9951 0.9973 0.9988
#     0.9993 0.9991 0.9999 1.0000 1.0000 0.9999 1.0000 1.0000 1.0000 1.0000

How skewed are real names? Given Names Fre­quency Project pro­vides “Pop­u­lar Given Names US, 1801-1999” (1990-1999, 909288 names) based on So­cial Se­cu­rity da­ta. After delet­ing the first 4 lines of s1990m.txt, it can be loaded into R and the frac­tions used as prob­a­bil­i­ties to find the 50% col­li­sion point for US names:

names <- read.csv("s1990m.txt", header=FALSE)
#         V1            V2
#  Aaron   :  1   Min.   :   55.0000
#  Abdiel  :  1   1st Qu.:   86.0000
#  Abdullah:  1   Median :  183.0000
#  Abel    :  1   Mean   :  914.1923
#  Abraham :  1   3rd Qu.:  535.5000
#  Adam    :  1   Max.   :24435.0000
#  (Other) :852
# [1] 784377
## "Scott" as fraction of all names:
2279 / 784377
# [1] 0.0029054906
## presumably male names:
2279 / (784377*0.5)
# [1] 0.005810981199

simulate(118, nrow(names), probabilities=names$V2/sum(names$V2))
# [1] 1
findN(nrow(names), probabilities=names$V2/sum(names$V2))
# [1] 15

So a more re­al­is­tic analy­sis sug­gests n = 15 is where unique first names will prob­a­bly break down.

This only cov­ers the 853 most com­mon per­sonal names, and the more names, the higher the n has to be to trig­ger a col­li­sion (mak­ing 15 some­thing of a lower up­per bound); to es­ti­mate 10000, we need to fit a dis­tri­b­u­tion to ex­trap­o­late be­low that. The fits rea­son­ably well and is easy to work with:

fitdist(names$V2, "lnorm")
# Fitting of the distribution ' lnorm ' by maximum likelihood
# Parameters:
#            estimate    Std. Error
# meanlog 5.550448321 0.04640182299
# sdlog   1.359185357 0.03281096378

simulateLN <- replicate(100, {
    names <- rlnorm(10000, meanlog=5.550448321, sdlog=1.359185357)
    hit <- findN(length(names), startingN=46, probabilities=names/sum(names))
# [1] 51

Since first names will clus­ter by age group, lo­ca­tion, pro­fes­sion, and what­not, ar­guably even 51 is a bit of an up­per bound.

Fi­nal­ly, one might ask the prob­a­bil­ity of a group with a great Scott, or to put it an­other way, the prob­a­bil­ity of it un­for­tu­nately get­ting away scot-free.

This is easy to an­swer; the prob­a­bil­ity of hav­ing 1 or more Scotts in a group is the prob­a­bil­ity of every­one hav­ing a name other than Scott. We saw that the prob­a­bil­ity of be­ing named Scott was P = 0.0029054906 in the name dataset. So the prob­a­bil­ity of one per­son not be­ing named Scott is . So the prob­a­bil­ity of n peo­ple all be­ing named not-S­cott is 0.997n. The crossover point is ~239.

So an Amer­i­can so­cial group can­not ex­ceed n = 51 be­fore first names be­gin to break down, and it is all Scot­t’s fault at n = 239.

Detecting fake (human) Markov chain bots

Some pop­u­lar Twit­ter and Tum­blr ac­counts use trained on a cor­pus of writ­ing such as Markov James Mitchens or two un­re­lated cor­puses to cre­ate amus­ing mashups: pro­gram­ming doc­u­men­ta­tion and H.P. Love­craft’s hor­ror/SF fic­tion or the King James Bible or the works of Karl Marx, Kim Kar­dashian and Kierkegaard, or Sil­i­con Val­ley re­cruit­ing emails and Erowid drug use re­ports. The hu­mor comes from the fact that the Markov chains have no un­der­stand­ing and are merely pro­grams pro­duc­ing gib­ber­ish that oc­ca­sion­ally present strik­ing jux­ta­po­si­tions or in­sights. Much of their ap­peal de­rives in large part from the fact that while hu­mans cu­rate them, hu­mans don’t write them. They de­pend on this au­then­tic­ity to be strik­ing.

Of course, there’s al­ways the temp­ta­tion to edit them or write them whole­sale, per­haps be­cause the Markov chains aren’t co­op­er­at­ing in pro­duc­ing any com­edy gold to tweet that day, which de­ceives the read­er. This poses an in­verse Tur­ing test: how would you de­tect a fake Markov chain ac­count, that is, one where a hu­man is pre­tend­ing to be a com­puter and writ­ing some of the text?

Markov chains are trained on a spe­cific cor­pus and are a prob­a­bilis­tic gen­er­a­tive model which en­code the prob­a­bil­ity that a word X fol­lows an­other word Y for all the words in that cor­pus (and sim­i­larly if they are op­er­at­ing on let­ters or on ); there is no state or mem­ory or ‘look back’ or abil­ity to model re­cur­sion. To gen­er­ate text, one sim­ply picks a ran­dom word Y, looks up the prob­a­bil­i­ties of all the words AZ from Y, and picks a word at ran­dom weighted by those prob­a­bil­i­ties; then re­peat in­defi­nite­ly. Con­verse­ly, one could also use it to cal­cu­late the of a given text by mul­ti­ply­ing the prob­a­bil­ity of each word in the text con­di­tional on the pre­vi­ous one.

One diffi­culty is the po­ten­tial for dou­ble-use of data: the first pass through a Markov chain ac­count is al­ready ap­ply­ing to the data a highly flex­i­ble Bayesian neural net­work with bil­lions of pa­ra­me­ters (one’s brain). If one spots an ‘anom­alous’ dataset and sub­se­quent analy­sis con­firms it, what does this mean? I am re­minded of one past in­ci­dent: some­one had lost a great deal of money on a Bit­coin gam­bling web­site, and sus­pected the site had de­frauded him. But he had con­tacted me only be­cause he had had un­usual loss­es. What does an analy­sis mean? Imag­ine that the top 1% of losers get an­gry and start look­ing into whether they were cheat­ed; they go to a sta­tis­ti­cian who duly com­putes that based on the num­ber of games played, there is a p = 0.01 that they would lose as much or more as they did… If one had all the gam­bling records, one could look at the over­all pat­terns and see if there are more losers than there should be given the rules of the game and a sup­pos­edly fair ran­dom num­ber gen­er­a­tor, but what does one do with 1 self­-s­e­lected play­er? The data gen­er­a­tion process is cer­tainly nei­ther ran­dom nor ‘ig­nor­able’ nor mod­e­lable with­out du­bi­ous as­sump­tions.

A few pos­si­ble at­tacks come to mind:

  • ob­ser­va­tion of mal­formed syn­tax or lack of long-range de­pen­den­cies
  • vo­cab­u­lary or out­put out­side an in­de­pen­dently trained Markov chain’s do­main
  • un­usu­ally low like­li­hood for an in­de­pen­dently trained Markov chain to gen­er­ate known sam­ples
  • un­usu­ally low like­li­hood for an in­de­pen­dently trained Markov chain to gen­er­ate known sam­ples com­pared to newly gen­er­ated sam­ples fil­tered at a 1-in-100s qual­ity level
  • un­usu­ally high qual­ity of known sam­ples com­pared to newly gen­er­ated sam­ples from in­de­pen­dently trained Markov chain fil­tered at a 1-in-100s qual­ity lev­el, tested non­para­met­ri­cally or para­met­ri­cally as a mix­ture model

Markov chains pro­duce re­al­is­tic-look­ing out­put and are effi­cient to cre­ate & run, but, com­pared to RNNs, they no­to­ri­ously model re­cur­sive syn­tax poor­ly, such as nested paren­the­ses (s­ince they have no way of re­mem­ber­ing whether a par­en­thet­i­cal com­ment had been start­ed), and can­not ex­trap­o­late—­for ex­am­ple, a word-level Markov chain can’t cre­ate new words, and would re­quire n-grams to have avail­able frag­ments of words which could be re­com­bined. The mem­o­ry-less na­ture of Markov chains also means that, lack­ing any mem­ory which could model the ‘long-range cor­re­la­tions’ found in nat­ural Eng­lish text like sys­tem­atic use of par­tic­u­lar names/­topic­s/vo­cab­u­lary, larger sam­ples quickly veer off-topic and be­come gib­ber­ish and lack any co­herency pos­si­bly even in­side a sin­gle sen­tence. (RNNs also have this prob­lem, but some­what less.)

With the lim­its of a Markov chain in mind, it would be easy to de­tect faked Markov chain out­put with large sam­ples: it is just diffi­cult for a hu­man to de­lib­er­ately gen­er­ate long text which is as non­sen­si­cal and syn­tac­ti­cally in­valid as a Markov chain cre­ates, for the same rea­son an un­prac­ticed hu­man is a re­mark­ably bad ran­dom num­ber gen­er­a­tor. How­ev­er, for this same rea­son the se­lected Markov sam­ples tend to be very short, usu­ally no more than a sen­tence. It might be pos­si­ble to mea­sure this on the sam­ples as a whole and ob­serve higher en­tropy or mem­o­ry­less-ness (eg by mea­sur­ing com­pres­sion per­for­mance or effi­ciency of a Markov chain in mod­el­ing the sam­ples), but I would guess that usu­ally the sam­ples are not long enough or large enough for this to have rea­son­able as a test. This elim­i­nates the eas­i­est test.

Since the cor­pus is known in many of these cas­es, we can as­sume ac­cess to a Markov chain model which is sim­i­lar (if not iden­ti­cal) to the one which sup­pos­edly wrote all the tweets. This gives us sev­eral pos­si­bil­i­ties.

We could ex­ploit the lack of cre­ativ­ity of Markov chains and look for any­thing in the tweets which is not present in the orig­i­nal cor­pus. For ex­am­ple, if a word like “” ap­pears nei­ther in the Pup­pet doc­u­men­ta­tion nor (hav­ing been coined in 1996, 59 years after he died) in H.P. Love­craft’s fic­tion, then it would have a prob­a­bil­ity of 0 of be­ing gen­er­ated by any Pup­pet/Love­craft Markov chain (as no word will have any tran­si­tion prob­a­bil­ity to it). Such ex­tra-cor­po­ral vo­cab­u­lary im­me­di­ately proves hu­man au­thor­ship.

Con­tin­u­ing this same log­ic, we could take the cor­pus, train our own Markov chain (which will at least be sim­i­lar), and use it to cal­cu­late the like­li­hood of all the tweets. A hu­man-writ­ten tweet may be pos­si­ble for the Markov chain to have writ­ten, but it will be far more un­likely than most of the ones the Markov chain ac­tu­ally wrote & were se­lect­ed. So we would see that most of the tweets have rea­son­able log like­li­hoods, but that our sus­pi­cious ones will be far more ex­treme. (If the Markov chains are word-level, this test sub­sumes the im­pos­si­ble-word test: any tweet with a word not in the cor­pus, and hence not rep­re­sented in the Markov chain, will have a mean­ing­less like­li­hood.)

This like­li­hood test might not help if they are all equally ex­treme, in which case one could use our Markov chain in an­other man­ner, as a gen­er­a­tive mod­el, to try to es­ti­mate the like­li­hood of get­ting as great a tweet. For this, one sam­ples sev­eral thou­sand sam­ples from our Markov chain, and screens them for good ones. This cre­ates an em­pir­i­cal dis­tri­b­u­tion of the like­li­hoods of good tweets con­di­tional on the null hy­poth­e­sis of a Markov chain au­thor; in this case, the null hy­poth­e­sis is known to be true by con­struc­tion. Then to test, one com­pares the known-Markov-chain tweets with the like­li­hoods of the sus­pect tweets (per­haps with a ). They should be sim­i­lar.

Al­ter­nate­ly, if one does­n’t want to use like­li­hoods as a mea­sure of im­prob­a­bil­i­ty, one could in­stead use some hu­man mea­sure of fun­ni­ness like hav­ing rat­ing the orig­i­nals and the sam­ples on a scale 1-5, and com­par­ing them. The orig­i­nal poster is prob­a­bly not screen­ing more than a few hun­dred gen­er­ated tweets for each se­lected tweet, so given a sim­i­lar level of strin­gen­cy, one’s gen­er­ated tweets should be equally good; if the orig­i­nals turn out to be ex­tremely bet­ter than yours, to a level where you would have to screen thou­sands of ran­dom sam­ples, that is highly sus­pi­cious and sug­gests the orig­i­nals were ‘too good to be true’.

With rat­ings or like­li­hoods, one could try to as­sume a de­creas­ing dis­tri­b­u­tion like an ex­po­nen­tial: most sam­ples will be in­co­her­ent and to­tally un­fun­ny, many will be slightly fun­ny, a few will be fun­ny, and a very few will be very fun­ny. The rat­ings on sam­ples gen­er­ated from our Markov chain will prob­a­bly fol­low a smooth dis­tri­b­u­tion. How­ev­er, if a hu­man is au­thor­ing some in an at­tempt to spice things up, they will be above the av­er­age of the Markov chain (other­wise why bother with cheat­ing?), and if there is a sub­stan­tial num­ber of them, this will cre­ate an anom­aly in the rat­ings of the orig­i­nal­s—a ‘bump’ in­di­cat­ing that the tweets are com­ing from two differ­ent pop­u­la­tions. In this case, it can be mod­eled as a with ei­ther k = 1 or k = 2, and the p-value or Bayesian pos­te­rior prob­a­bil­ity cal­cu­lated for 1 vs 2.

Optimal Existential Risk Reduction Investment

An ex­is­ten­tial risk is any risk which de­stroys or per­ma­nently crip­ples hu­man civ­i­liza­tion, such as an as­ter­oid strike or pan­dem­ic. Since hu­man­ity might oth­er­wise con­tinue for mil­lions of years, cre­at­ing un­told tril­lions of hu­mans and col­o­niz­ing the galaxy, hu­man ex­tinc­tion rep­re­sents the loss of lit­er­ally as­tro­nom­i­cal amounts of util­ity. The loss is greater than any dis­as­ter up to ex­tinc­tion lev­els, as hu­man­ity can al­ways re­cover from lesser dis­as­ters; but there is no re­cov­ery from a to­tal de­struc­tion. Thus, the ex­pected value of even a slight re­duc­tion in an ex­otic risk ought to it­self be as­tro­nom­i­cal, or at least ex­tremely large; un­der plau­si­ble val­ues for well-char­ac­ter­ized x-risks like as­ter­oid strikes or nu­clear war or pan­demic, pre­vent­ing them may be the char­i­ta­ble spend­ing with the high­est ex­pected value and they should be re­ceiv­ing all char­i­ta­ble ex­pen­di­tures.

This strikes peo­ple as odd and dan­ger­ous rea­son­ing. Is it re­ally true that we should be spend­ing al­most un­lim­ited amounts of money on these things and not oth­er­wise ex­tremely com­pelling char­i­ties like dis­trib­ut­ing malaria nets in Africa to save mil­lions of lives or vac­cine dis­tri­b­u­tion or fund­ing re­search into end­ing ag­ing? And if we should, how do we choose what frac­tion to spend on global warm­ing rather than ar­ti­fi­cial in­tel­li­gence? What if some­one dis­cov­ers an en­tirely new x-risk not pre­vi­ously con­sid­ered, like nearby su­per­novas or vac­uum col­lapses or nan­otech­nol­ogy ‘grey goo’?

Think­ing his­tor­i­cal­ly, it’s clear in ret­ro­spect that some­one con­cerned about x-risk would be bet­ter off not go­ing after the ter­mi­nal goal of x-risk re­duc­tion but in­stead spend­ing their money on in­stru­men­tal goals such as sci­ence/tech­nol­ogy or eco­nomic growth.

Imag­ine some­one in Eng­land in 1500 who rea­sons the same way about x-risk: hu­man­ity might be de­stroyed, so pre­vent­ing that is the most im­por­tant task pos­si­ble. He then spends the rest of his life re­search­ing the Devil and the Apoc­a­lypse. Such re­search is, un­for­tu­nate­ly, of no value what­so­ever un­less it pro­duces ar­gu­ments for athe­ism demon­strat­ing that that en­tire line of en­quiry is use­less and should not be pur­sued fur­ther. But as the In­dus­trial and Sci­en­tific Rev­o­lu­tions were just be­gin­ning, with ex­po­nen­tial in­creases in global wealth and sci­ence and tech­nol­ogy and pop­u­la­tion, ul­ti­mately lead­ing to vac­cine tech­nol­o­gy, rock­ets and space pro­grams, and enough wealth to fund all man­ner of in­vest­ments in x-risk re­duc­tion, he could in­stead had made a per­haps small but real con­tri­bu­tion by con­tribut­ing to eco­nomic growth by work & in­vest­ment or mak­ing sci­en­tific dis­cov­er­ies.

For ex­am­ple, Isaac New­ton’s dis­cov­er­ies in as­tron­omy and the laws of mo­tion helped in­au­gu­rate threads of work that led di­rectly to space satel­lites which can watch for as­ter­oids with Earth­-cross­ing or­bits. him­self was con­cerned with x-risk, as he feared that the would, cen­turies hence, plunge into the Sun and cause ex­pan­sion de­stroy­ing the Earth and hu­man­i­ty. What could New­ton have done to di­rectly re­duce this x-risk at the time? Ab­solutely noth­ing. There were no fea­si­ble coun­ter-mea­sures nor any fore­see­able tech­nolo­gies which could fore­stall a comet or pro­tect hu­man­ity from the Sun en­gulfing the Earth; there was not and still is not a mine or bomb shel­ter deep enough for that. What he could have done is close to what he did do: make fun­da­men­tal ad­vances in sci­ence which pos­ter­ity could build on and one day be rich and wise enough to do some­thing about the x-risk. As it hap­pens, New­ton was not quite right about the Great Comet (comets are not a mean­ing­ful frac­tion of the Sun’s mass) but there was a sim­i­lar x-risk he was un­aware of: gi­ant as­ter­oid im­pacts. And the so­lu­tions to a gi­ant comet—ob­serve all comets care­fully to project their fu­ture or­bits, de­stroy it, redi­rect its or­bit, evac­u­ate hu­man colonists to safety to un­affected plan­ets (New­ton sug­gested the satel­lites of the gas gi­ants)—are much the same as for a gi­ant as­ter­oid im­pact, and all ben­e­fit from eco­nomic growth & greater sci­ence/tech­nol­ogy (some­one has to pay for, and de­sign those satel­lites and space­craft, after al­l).

Eco­nomic wealth & sci­ence/tech­nol­ogy are al­l-pur­pose goods: they are use­ful for com­pound growth, and can also be spent on x-risk re­duc­tion. They are the ul­ti­mate in­stru­men­tal goods. If one is badly ig­no­rant, or poor, or un­able to mean­ing­fully re­duce an x-risk, one is bet­ter off ac­cept­ing the x-risk and in­stead spend­ing re­sources on fix­ing the for­mer prob­lems. One would pre­fer to get rid of the x-risk as soon as pos­si­ble, of course, but given one’s start­ing po­si­tion, there may sim­ply be no bet­ter strat­egy and the risk must be ac­cept­ed.

This raises the ques­tion: what is the op­ti­mal dis­tri­b­u­tion of re­sources to eco­nomic growth vs x-risk re­duc­tion over time which max­i­mizes ex­pected util­i­ty?

In­tu­itive­ly, we might ex­pect some­thing like early on in­vest­ing noth­ing at all in x-risk re­duc­tion as there’s not much money avail­able to be spent, and money spent now costs a lot of money down the line in lost com­pound growth; and then as the econ­omy reaches mod­ern lev­els and the op­por­tu­nity cost of x-risk be­comes di­re, money is in­creas­ingly di­verted to x-risk re­duc­tion. One might analo­gize it to in­sur­ance—poor peo­ple skimp on in­sur­ance be­cause they need the money for other things which hope­fully will pay off later like ed­u­ca­tion or start­ing a busi­ness, while rich peo­ple want to buy lots of in­sur­ance be­cause they al­ready have enough and they fear the risks. If this were an in­vest­ment ques­tion, a good strat­egy would be some­thing like the or strate­gies like : even if the ex­pected value of x-risk re­duc­tion is higher than other in­vest­ments, it only pays off very rarely and so re­ceives a very small frac­tion of one’s in­vest­ments. How­ev­er, it’s not clear that the Kelly cri­te­rion or Thomp­son sam­pling are op­ti­mal or even rel­e­vant: be­cause while Kelly avoids bank­ruptcy in the form of but does so only by mak­ing ar­bi­trar­ily small bets to avoid go­ing bank­rupt & re­fus­ing to ever risk one’s en­tire wealth; with x-risks, the ‘bank­ruptcy’ (ex­tinc­tion) can’t be avoided so eas­i­ly, as the risk is there whether you like it or not, and one can­not turn it to 0. (This comes up often in dis­cus­sion of why the Kelly cri­te­rion is rel­e­vant to de­ci­sion-mak­ing un­der risk; see also Pe­ters 2011 and the niche area of “evo­lu­tion­ary fi­nance” like Evstigneev et al 2008/Lens­berg & Schenk-Hoppé 2006 which draws con­nec­tions be­tween the Kelly cri­te­ri­on, prob­a­bil­ity match­ing, long-term sur­vival & evo­lu­tion­ary fit­ness.) In eco­nom­ics, sim­i­lar ques­tions are often dealt with in terms of the life-cy­cle hy­poth­e­sis in which eco­nomic agents strive to max­i­mize their util­ity over a ca­reer/life­time while (as Mark Twain put it, “when in youth a dol­lar would bring a hun­dred plea­sures, you can’t have it. When you are old, you get it & there is noth­ing worth buy­ing with it then. It’s an epit­ome of life. The first half of it con­sists of the ca­pac­ity to en­joy with­out the chance; the last half con­sists of the chance with­out the ca­pac­i­ty.”); in the life-cy­cle, one tries to build wealth as quickly as pos­si­ble while young, even go­ing into debt for in­vest­ments like a col­lege ed­u­ca­tion, then be­gins sav­ing up, con­sum­ing some, un­til re­tire­ment, at which point one con­sumes it all un­til one dies. But as far as I’ve seen any re­sults, life-cy­cle mod­els tend to not in­clude any mech­a­nism for spend­ing in or­der to re­duce mor­tal­i­ty/ag­ing and ac­cept the risk of death as a giv­en.

We could cre­ate a sim­ple Markov de­ci­sion process mod­el. An agent (hu­man­i­ty), each time pe­riod (year), has a cer­tain amount of wealth and an x-risk prob­a­bil­ity P. In this pe­ri­od, it can choose to al­lo­cate that wealth be­tween eco­nomic growth, in which case it re­ceives that in­vest­ment plus a re­turn, and it can buy a per­ma­nent per­cent­age re­duc­tion in the x-risk for a fixed sum. For the re­ward, the x-risk is bi­nary sam­pled with prob­a­bil­ity P; if the sam­ple is true, then the re­ward is 0 and the de­ci­sion process ter­mi­nates, else the re­ward is the wealth and the process con­tin­ues. Let’s imag­ine that this process can run up to 10,000 time pe­ri­ods, with a start­ing wealth of $248 bil­lion (An­gus Deaton’s es­ti­mate of PPP world GDP in 1500 ), the eco­nomic growth rate is 2% (the long-run real growth rate of the global econ­o­my), the ex­is­ten­tial risk prob­a­bil­ity is 0.1% per year (ar­bi­trar­ily cho­sen), and one can buy a re­duc­tion of 1% for a bil­lion dol­lars. (We’ll work in tril­lions units to help nu­meric sta­bil­i­ty.) What strat­egy max­i­mizes the cu­mu­la­tive re­wards? A few sim­ple ones come to mind:

  1. the agent could sim­ply ig­nore the x-risk and rein­vests all wealth, which to a first ap­prox­i­ma­tion, is the strat­egy which has been fol­lowed through­out hu­man his­tory and is pri­mar­ily fol­lowed now (lump­ing to­gether NASA’s Space­guard pro­gram, biowar­fare and pan­demic re­search, AI risk re­search etc prob­a­bly does­n’t come to more than $1-2b a year in 2016). This max­i­mizes eco­nomic growth rate but may back­fire as the x-risk never gets re­duced.
  2. the agent could spend the full gain in its wealth from eco­nomic growth (2%) on x-risk re­duc­tion. The wealth does­n’t grow and the re­turns from x-risk re­duc­tion do di­min­ish, but the x-risk is at least re­duced greatly over time.
  3. the agent could im­ple­ment a sort of prob­a­bil­ity match­ing: it spends on x-risk re­duc­tion a frac­tion of its wealth equal to the cur­rent P. This re­duces how much is spent on ex­tremely small x-risk re­duc­tions, but it might be sub­op­ti­mal be­cause it’ll pay the largest frac­tion of its econ­omy in the first time pe­ri­od, then sec­ond-largest in the sec­ond time pe­riod and so on, los­ing out on the po­ten­tial com­pound­ing.
  4. a more com­pli­cated hy­brid strat­egy might work: it max­i­mizes wealth like #1 for the first n time pe­ri­ods (eg n = 516), and then it switches to #2 for the re­main­ing time pe­riod
  5. like #4, but switch­ing from #1 to #3 for the re­main­ing time pe­ri­ods.
constantInvestmentAgent <- function (t, w, xrp) { return(c(w, 0)) }
constantReductionAgent  <- function (t, w, xrp) { drawdown <- 0.9803921573; return(c(drawdown*w, (1-drawdown)*w)) }
probabilityMatchAgent   <- function (t, w, xrp) { return(c(w*(1-xrp), w*xrp)) }
investThenReduceAgent   <- function (t, w, xrp, n=516) { if (t<n) { return(constantInvestmentAgent(t, w, xrp)) } else { return(constantReductionAgent(t, w, xrp)) } }
investThenMatchAgent    <- function (t, w, xrp, n=516) { if (t<n) { return(constantInvestmentAgent(t, w, xrp)) } else { return(probabilityMatchAgent(t, w, xrp)) } }

simulateWorld <- function(agent, t=10000) {
    initialW <- 0.248
    initialP <- 0.001
    df <- data.frame(T=0, Wealth=initialW, XriskP=initialP)

    for (i in 1:t) {
        last <- tail(df, n=1)
        xrisk <- rbinom(1,1, p=last$XriskP)
        if (xrisk) { break; } else {
          choices <- agent(last$T, last$Wealth, last$XriskP)
          newXriskP <- last$XriskP * (1 - 0.01)^(choices[2] / 0.001)
          newWealth <- choices[1] * 1.02
          df <- rbind(df, data.frame(T=i, Wealth=newWealth, XriskP=newXriskP))
   df$Reward <- cumsum(df$Wealth)

library(parallel); library(plyr)
simulateWorlds <- function(agent, iters=1000) {
    mean(ldply(mclapply(1:iters, function(i) { tail(simulateWorld(agent), n=1)$Reward }))$V1)  }

# [1] 2423.308636
# [1] 10127204.73
# [1] 1.154991741e+76
# [1] 7.53514145e+86
## Optimize the switch point:
which.max(sapply(seq(1, 10000, by=100), function(N) { simulateWorlds(function(t,w,xrp) { investThenMatchAgent(t, w, xrp, n=N) }, iters=100)}))
# [1] 3
simulateWorlds(function(t,w,xrp) { investThenMatchAgent(t, w, xrp, n=300) })
# [1] 9.331170221e+86
# [1] 1.006834082e+87

So of our 5 strate­gies, the con­stant re­duc­tion agent per­forms the worst (prob­a­bly be­cause with eco­nomic growth choked off, it can only buy small x-risk re­duc­tion­s), fol­lowed by the in­vest-then-re­duce agent; then the ‘get rich be­fore you get old’ con­stant in­vest­ment agent man­ages to often at­tain very high growth rates when it’s lucky enough that x-risks don’t strike early on; but far bet­ter than any of them, by or­ders of mag­ni­tude, are the par­tial and full prob­a­bil­ity match­ing agents. The par­tial prob­a­bil­ity match­ing agent turns out to have a sub­op­ti­mal switch point t = 516, and a more care­ful search of switch points finds that t~=300 is the best switch point and it ex­ceeds the pure prob­a­bil­ity matcher which matches from the start.

What’s go­ing on there? I sus­pect it’s some­thing sim­i­lar to the differ­ence in mul­ti­-armed ban­dit prob­lems be­tween the as­ymp­tot­i­cally op­ti­mal so­lu­tion and the op­ti­mal so­lu­tion for a fixed hori­zon found us­ing dy­namic pro­gram­ming: in the for­mer sce­nar­io, there’s an in­defi­nite amount of time to do any ex­plo­ration or in­vest­ment in in­for­ma­tion, but in the lat­ter, there’s only a fi­nite time left and ex­plo­ration/­growth must be done up front and then the op­ti­mal de­ci­sion in­creas­ingly shifts to ex­ploita­tion rather than growth.

Why does prob­a­bil­ity match­ing in gen­eral work so well? It may sim­ply be be­cause it’s the only base­line strat­egy which ad­justs its xrisk in­vest­ment over time.

This does­n’t demon­strate that prob­a­bil­ity match­ing is op­ti­mal, just that it beats the other base­line strate­gies. Other strate­gies could be used to de­crease xrisk in­vest­ment over time—in­stead of be­ing pro­por­tional to xrisk P, it could shrink lin­early over time, or by square root, or log­a­rith­mi­cal­ly, or…

What re­in­force­ment learn­ing tech­niques might we use to solve this?

This prob­lem rep­re­sents a large Markov De­ci­sion Process with 1 dis­crete state vari­able (time, t = 0-10000), 2 con­tin­u­ous state vari­ables (wealth, and risk prob­a­bil­i­ty), and 1 con­tin­u­ous ac­tion (frac­tion of growth to al­lo­cate to the econ­omy vs ex­is­ten­tial risk re­duc­tion). The con­tin­u­ous ac­tion can be dis­cretized into 11 ac­tions with­out prob­a­bly los­ing any­thing (al­lo­cate 100%/90%..10%/0%), but the 2 state vari­ables can’t be dis­cretized eas­ily be­cause they can span many or­ders of mag­ni­tude.

  • dy­namic pro­gram­ming a de­ci­sion tree with back­wards in­duc­tion: op­ti­mal, but re­quires dis­crete ac­tions and state vari­ables, and even if dis­cretized, 10000 time steps would be in­fea­si­bly large.

  • stan­dard tab­u­lar learn­ing: Q-learn­ing, SARSA, tem­po­ral differ­ences: re­quires dis­crete ac­tions and state vari­ables

    • Deep Q-Net­works: re­quires dis­crete ac­tions, but not state vari­ables
  • MDP solvers: value it­er­a­tion etc: op­ti­mal, but re­quires dis­crete ac­tions and state vari­ables

  • hy­brid MDP solvers: op­ti­mal, and can han­dle a lim­ited amount of con­tin­u­ous state vari­ables (but not con­tin­u­ous ac­tion­s), which would work here; but high qual­ity soft­ware im­ple­men­ta­tions are rarely avail­able.

    One such hy­brid MDP solver is hmpd, which solves prob­lems spec­i­fied in the Lisp-like DSL (judg­ing from the ex­am­ples, a ver­sion with prob­a­bilis­tic effects, so PPDDL 1.0?). After try­ing to write down a PPDDL model cor­re­spond­ing to this sce­nar­io, it seems that PPDDL is un­able to rep­re­sent prob­a­bil­i­ties or re­wards which change with time and so can­not rep­re­sent the in­crease in wealth or de­crease in x-risk prob­a­bil­i­ty.

  • pol­icy gra­di­ents: can han­dle con­tin­u­ous state vari­ables & ac­tions but are highly com­plex and un­sta­ble; high qual­ity soft­ware im­ple­men­ta­tions are un­avail­able

Of the pos­si­ble op­tions, a DQN agent seems like the best choice: a small neural net­work should be able to han­dle the prob­lem and DQN only re­quires the ac­tions to be dis­cretized. reinforce.js pro­vides a DQN im­ple­men­ta­tion in JS which I’ve used be­fore, so I start there by rewrit­ing the prob­lem in JS

var script = document.createElement("script");
script.src = "";

// environment: t, w, xrp
function simulate(environment, w_weight, xrp_weight) {
    var xrisk = Math.random() < environment.xrp
    if (xrisk) {
    return {reward: -100, alive: false, t: environment.t, w: environment.w, xrp: environment.xrp};
    } else {
    return {reward: Math.log(environment.w), alive: true, t: environment.t+1,
            w: environment.w*w_weight*1.02, xrp: environment.xrp * (Math.pow((1 - 0.01), (xrp_weight / 0.001))) }
var defaultState = {t: 0, w: 0.248, xrp: 0.01}
// simulate(defaultState, 0.99, 0.01)
// simulate(defaultState, 0.99, 0.01)

var env = {};
env.getNumStates = function() { return 3; }; // there are only 3 state variables: t/w/xrp
env.getMaxNumActions = function() { return 11; }; // we'll specify 10 possible allocations: 1/0, 0.998/0.002 .. 0.98/0.02
var spec = {
  num_hidden_units: 200,
  experience_add_every: 20,
  learning_steps_per_iteration: 1,
  experience_size: 1000000,
  alpha: 0.01,
  epsilon: 1.0,
  gamma: 0.99 // minimal discounting
var agent = new RL.DQNAgent(env, spec);

var total_reward = 0;
state = defaultState;
spec.epsilon = 1.0; // reset epsilon if we've been running the loop multiple times

for(var i=0; i < 10000*3000; i++) {
   var action = agent.act(state)
   state = simulate(state, 1-(action/500), 0+(action/500) );

   total_reward = total_reward + state.reward;
   if (Number.isInteger(Math.log(i) / Math.log(10)) ) { spec.epsilon = spec.epsilon / 1.5; } // decrease exploration

   if (!state.alive || state.t >= 10000) { // if killed by x-risk or horizon reached
     console.log(state.t, state.w, state.xrp, total_reward);
     total_reward = 0;
     state = defaultState;

//exercise the trained agent to see how it thinks
spec.epsilon = 0;
for (var t=0; t < 10000; t++) {
    action = agent.act(state)
    state = simulate(state, 1-(action/500), 0+(action/500) );
    total_reward = total_reward + state.reward
    console.log(action, state, total_reward);

After a day of train­ing, the DQN agent had learned to get up to 5e41, which was dis­ap­point­ingly in­fe­rior to the con­stant in­vest­ment & prob­a­bil­ity match­ing agents (1e87). The NN looks big enough for this prob­lem and the ex­pe­ri­ence re­play buffer was more than ad­e­quate; NNs in RL are known to have is­sues with the re­ward, though, and typ­i­cally ‘clamp’ the re­ward to a nar­row range, so I sus­pected that re­wards go­ing up to 5e41 (in­ter­pret­ing wealth on each turn as the re­ward) might be play­ing havoc with con­ver­gence, and switched the re­ward to log wealth in­stead. This did not make a no­tice­able differ­ence overnight (a­side from the DQN agent now achiev­ing 9.5e41). I won­dered if the risk was too rare for easy learn­ing and 100 neu­rons was not enough to ap­prox­i­mate the curve over time, so I fixed a bug I no­ticed where the sim­u­la­tion did not ter­mi­nate at t=10000, dou­bled led the neu­ron count, in­creased the ini­tial x-risk to 1%, and be­gan a fresh run. After 1 day, it reached 9.4e41 to­tal re­ward (un­logged).

Cu­mu­la­tive log score for DQN after tweaks and ~2h of train­ing: reg­u­larly reaches ~470k when it does­n’t die im­me­di­ately (which hap­pens ~1/20 of the time). In com­par­ison, prob­a­bil­i­ty-match­ing agent av­er­ages a cu­mu­la­tive log score of 866k. After 2 days of train­ing, the DQN had im­proved only slight­ly; the on-pol­icy strat­egy ap­pears mostly ran­dom aside from hav­ing dri­ven the xrisk prob­a­bil­ity down to what ap­pears to be the small­est float JS sup­ports, so it still had not learned a mean­ing­ful com­pro­mise be­tween xrisk re­duc­tion and in­vest­ment.

TODO: re­visit with MCTS at some point?

Model Criticism via Machine Learning

In “Deep learn­ing, model check­ing, AI, the no-ho­muncu­lus prin­ci­ple, and the uni­tary na­ture of con­scious­ness”, An­drew Gel­man writes

Here’s how we put it on the very first page of our book:

The process of Bayesian data analy­sis can be ide­al­ized by di­vid­ing it into the fol­low­ing three steps:

  1. Set­ting up a full prob­a­bil­ity model - a joint prob­a­bil­ity dis­tri­b­u­tion for all ob­serv­able and un­ob­serv­able quan­ti­ties in a prob­lem. The model should be con­sis­tent with knowl­edge about the un­der­ly­ing sci­en­tific prob­lem and the data col­lec­tion process.
  2. Con­di­tion­ing on ob­served data: cal­cu­lat­ing and in­ter­pret­ing the ap­pro­pri­ate pos­te­rior dis­tri­b­u­tion—the con­di­tional prob­a­bil­ity dis­tri­b­u­tion of the un­ob­served quan­ti­ties of ul­ti­mate in­ter­est, given the ob­served da­ta.
  3. Eval­u­at­ing the fit of the model and the im­pli­ca­tions of the re­sult­ing pos­te­rior dis­tri­b­u­tion: how well does the model fit the data, are the sub­stan­tive con­clu­sions rea­son­able, and how sen­si­tive are the re­sults to the mod­el­ing as­sump­tions in step 1? In re­spon­se, one can al­ter or ex­pand the model and re­peat the three steps.

How does this fit in with goals of per­form­ing sta­tis­ti­cal analy­sis us­ing ar­ti­fi­cial in­tel­li­gence?

3. The third step—i­den­ti­fy­ing model mis­fit and, in re­spon­se, fig­ur­ing out how to im­prove the mod­el—seems like the tough­est part to au­to­mate. We often learn of model prob­lems through open-ended ex­ploratory data analy­sis, where we look at data to find un­ex­pected pat­terns and com­pare in­fer­ences to our vast stores of sta­tis­ti­cal ex­pe­ri­ence and sub­jec­t-mat­ter knowl­edge. In­deed, one of my main pieces of ad­vice to sta­tis­ti­cians is to in­te­grate that knowl­edge into sta­tis­ti­cal analy­sis, both in the form of for­mal prior dis­tri­b­u­tions and in a will­ing­ness to care­fully in­ter­ro­gate the im­pli­ca­tions of fit­ted mod­els.

One way of look­ing at step #3 is to treat the hu­man sta­tis­ti­cian as an­other mod­el: specifi­cal­ly, he is a large neural net­work with tril­lions of pa­ra­me­ters, who has been trained to look for anom­alies & model mis­spec­i­fi­ca­tion, and to fix them when he finds them, re­train­ing the mod­el, un­til he can no longer eas­ily dis­tin­guish the orig­i­nal data from the mod­el’s pre­dic­tions or sam­ples. As he is such a large model with the abil­ity to rep­re­sent and in­fer a large class of non­lin­ear­i­ties, he can usu­ally eas­ily spot flaws where the cur­rent mod­el’s dis­tri­b­u­tion differs from the true dis­tri­b­u­tion.

This bears a con­sid­er­able re­sem­blance to the in­creas­ing pop­u­lar­ity of “gen­er­a­tive ad­ver­sar­ial net­works” (GANs): us­ing pairs of neural net­works, one of which tries to gen­er­ate re­al­is­tic data, and a sec­ond which tries to clas­sify or dis­crim­i­nate be­tween real and re­al­is­tic da­ta. As the sec­ond learns ways in which the cur­rent re­al­is­tic data is un­re­al­is­tic, the first gets feed­back on what it’s do­ing wrong and fixes it. So the loop is very sim­i­lar, but fully au­to­mat­ed. (A third set of ap­proaches this re­sem­bles is ac­tor-critic re­in­force­ment learn­ing al­go­rithm­s.)

If we con­sider the kinds of mod­els which are be­ing cri­tiqued, and what is cri­tiquing, this gives us 4 pos­si­ble com­bi­na­tions:

sim­ple com­plex
sim­ple model fit in­dex­es+­lin­ear model sta­tis­ti­cian+­lin­ear model
com­plex model fit in­dex­es+ML ML+ML (eg GANs)
  1. Sim­ple/sim­ple is use­ful for cases like lin­ear re­gres­sion where clas­sic meth­ods like ex­am­in­ing resid­u­als or R^2s or Cook in­dexes can often flag prob­lems with the mod­el.

  2. Sim­ple/­com­plex is also use­ful, as the hu­man sta­tis­ti­cian can spot ad­di­tional prob­lems.

  3. Com­plex/sim­ple is prob­a­bly use­less, as the NNs may eas­ily have se­vere prob­lems but will have fit any sim­ple lin­ear struc­ture and fool reg­u­lar di­ag­nos­tics.

  4. Com­plex/­com­plex can be very use­ful in ma­chine learn­ing, but in differ­ent ways from a good sim­ple mod­el.

    Fast, sim­ple, gen­er­al—a good sta­tis­ti­cal method lets you choose one; a great method lets you choose two. (Con­sider lin­ear mod­els, de­ci­sion trees, NNs, MCMC, ABC, dis­crete Bayesian net­works, and ex­po­nen­tial fam­ily vs non­para­met­ric meth­ods as ex­am­ples of the trade­offs here.)

So is quad­rant 2 fully pop­u­lated by hu­man sta­tis­ti­cians? We would­n’t nec­es­sar­ily want to use GANs for every­thing we use sta­tis­ti­cians for now, be­cause neural net­works can be too pow­er­ful and what we want from our mod­els is often some sort of clear an­swer like “does X pre­dict Y?” and sim­plic­i­ty. But we could re­place the sta­tis­ti­cian with some other pow­er­ful critic from ma­chine learn­ing—­like a NN, SVM, ran­dom forest, or other en­sem­ble. So in­stead of hav­ing two NNs fight­ing each other as in a GAN, we sim­ply have one spec­i­fied mod­el, and a NN which tries to find flaws in it, which can then be re­ported to the user. The loop then be­comes: write down and fit a model to the real data; gen­er­a­tive pos­te­rior pre­dic­tive sam­ples from the dis­tri­b­u­tion; train a small NN on real data vs pre­dic­tive data; the clas­si­fi­ca­tion per­for­mance mea­sures the plau­si­bil­ity of the pre­dic­tive sam­ples (per­haps some­thing like a KL di­ver­gence), giv­ing a mea­sure of the model qual­i­ty, and flags data points which are par­tic­u­larly eas­ily dis­tin­guished as re­al; the hu­man sta­tis­ti­cian now knows ex­actly which data points are not cap­tured by the model and can mod­ify the mod­el; re­peat un­til the NN’s per­for­mance de­clines to chance.

Let’s try an ex­am­ple. We’ll set up a sim­ple lin­ear model re­gres­sion Y ~ A + B + C with a few prob­lems in it:

  1. the trend is not lin­ear but slightly qua­dratic
  2. the out­come vari­able is also right-cen­sored at a cer­tain point
  3. and fi­nal­ly, the mea­sured co­vari­ates have been rounded
n <- 10000
ceiling <- 1
a <- rnorm(n)
b <- rnorm(n)
c <- rnorm(n)
y <- 0 + 0.5*a + 0.5*b + 0.5*c^2 + rnorm(n)
y_censored <- ifelse(y>=3, 3, y)
df <- data.frame(Y=y_censored, A=round(a, digits=1), B=round(b, digits=1), C=round(c, digits=1))

l <- lm(Y ~ A + B + C, data=df)
plot(df$Y, predict(l, df))

l2 <- lm(Y ~ A + B + I(C^2), data=df)
plot(df$Y, predict(l2, df))

The cen­sor­ing shows up im­me­di­ately on the di­ag­nos­tics as an ex­cess of ac­tual points at 3, but the qua­dratic­ity is sub­tler, and I’m not sure I can see the round­ing at all.


## First, random forest performance under the null hypothesis

modelNull <- data.frame(Y=c(df$Y, df$Y), Real=c(rep(1, n), rep(0, n)), A=c(df$A, df$A), B=c(df$B, df$B), C=c(df$C, df$C))
r_n <- randomForest(as.ordered(Real) ~ Y + A + B + C, modelNull); r_n
#                Type of random forest: classification
#                      Number of trees: 500
# No. of variables tried at each split: 2
#         OOB estimate of  error rate: 100%
# Confusion matrix:
#       0     1 class.error
# 0     0 10000           1
# 1 10000     0           1

modelPredictions <- data.frame(Y=c(df$Y, predict(l, df)), Real=c(rep(1, n), rep(0, n)), A=c(df$A, df$A), B=c(df$B, df$B), C=c(df$C, df$C))
r <- randomForest(as.ordered(Real) ~ Y + A + B + C, modelPredictions); r
#                Type of random forest: classification
#                      Number of trees: 500
# No. of variables tried at each split: 2
#         OOB estimate of  error rate: 6.59%
# Confusion matrix:
#      0    1 class.error
# 0 9883  117      0.0117
# 1 1200 8800      0.1200

## many of the LM predictions are identical, but the RF is not simply memorizing them as we can jitter predictions and still get the same classification performance:
modelPredictions$Y2 <- jitter(modelPredictions$Y)
randomForest(as.ordered(Real) ~ Y2 + A + B + C, modelPredictions)
#...                Type of random forest: classification
#                      Number of trees: 500
# No. of variables tried at each split: 2
#         OOB estimate of  error rate: 6.57%
# Confusion matrix:
#      0    1 class.error
# 0 9887  113      0.0113
# 1 1200 8800      0.1200

Note we need to be care­ful about col­lect­ing the pos­te­rior pre­dic­tive sam­ples: if we col­lect 10000 pos­te­rior sam­ples for each of the 10000 dat­a­points, we’ll store 100002 num­bers which may cause prob­lems. 1 should be enough.

model <- 'model {
 for (i in 1:n) {
     mean[i] <- mu + betaA*A[i] + betaB*B[i] + betaC*C[i]
     Y[i] ~ dnorm(mean[i], tau)

 sd   ~ dgamma(0.01, 0.01)
 tau  <- 1/sqrt(sd)

 mu ~ dnorm(0, 100)
 betaA ~ dnorm(0, 100)
 betaB ~ dnorm(0, 100)
 betaC ~ dnorm(0, 100)
model <- run.jags(model, data = with(df, list(Y=c(Y, rep(NA, nrow(df))), A=c(A,A), B=c(B,B), C=c(C,C), n=2*nrow(df))), inits=list(mu=0.45, sd=0.94, betaA=0.47, betaB=0.46, betaC=0), monitor=c("Y"), n.chains = 1, sample=1)

posterior_predictive <- tail(n=10000, model$mcmc[[1]][1,])
plot(df$Y, posterior_predictive)

modelPredictions_r <- data.frame(Y=c(df$Y, posterior_predictive), Real=c(rep(1, n), rep(0, n)), A=c(df$A, df$A), B=c(df$B, df$B), C=c(df$C, df$C))
r <- randomForest(as.ordered(Real) ~ Y + A + B + C, modelPredictions_r); r
#         OOB estimate of  error rate: 49.11%
# Confusion matrix:
#      0    1 class.error
# 0 4953 5047      0.5047
# 1 4776 5224      0.4776
model_rounded <- 'model {
 for (i in 1:n) {
     roundA[i] ~ dround(A[i], 3)
     roundB[i] ~ dround(B[i], 3)
     roundC[i] ~ dround(C[i], 3)
     mean[i] <- mu + betaA*roundA[i] + betaB*roundB[i] + betaC*roundC[i]
     Y[i] ~ dnorm(mean[i], tau)

 sd   ~ dgamma(0.01, 0.01)
 tau  <- 1/sqrt(sd)

 mu ~ dnorm(0, 100)
 betaA ~ dnorm(0, 100)
 betaB ~ dnorm(0, 100)
 betaC ~ dnorm(0, 100)
model_r <- run.jags(model_rounded, data = with(df, list(Y=c(Y, rep(NA, nrow(df))), A=c(A,A), B=c(B,B), C=c(C,C), n=2*nrow(df))), inits=list(mu=0.45, sd=0.94, betaA=0.47, betaB=0.46, betaC=0), monitor=c("Y"), n.chains = 1, sample=1)

posterior_samples <- tail(n=10000, model_r$mcmc[[1]][1,])
posterior_predictive <- ifelse(posterior_samples>=3, 3, posterior_samples)
plot(df$Y, posterior_predictive)

modelPredictions_r <- data.frame(Y=c(df$Y, posterior_predictive), Real=c(rep(1, n), rep(0, n)), A=c(df$A, df$A), B=c(df$B, df$B), C=c(df$C, df$C))
r_r <- randomForest(as.ordered(Real) ~ Y + A + B + C, modelPredictions_r); r_r
#         OOB estimate of  error rate: 50.48%
# Confusion matrix:
#      0    1 class.error
# 0 4814 5186      0.5186
# 1 4909 5091      0.4909
model_rounded_censor <- 'model {
 for (i in 1:n) {
     roundA[i] ~ dround(A[i], 3)
     roundB[i] ~ dround(B[i], 3)
     roundC[i] ~ dround(C[i], 3)
     mean[i] <- mu + betaA*roundA[i] + betaB*roundB[i] + betaC*roundC[i]
     Y[i] ~ dnorm(mean[i], tau)
     is.censored[i] ~ dinterval(Y[i], c)

 sd   ~ dgamma(0.01, 0.01)
 tau  <- 1/sqrt(sd)

 mu ~ dnorm(0, 100)
 betaA ~ dnorm(0, 100)
 betaB ~ dnorm(0, 100)
 betaC ~ dnorm(0, 100)
model_r_c <- run.jags(model_rounded_censor, data = with(df, list(Y=c(Y, rep(NA, nrow(df))), A=c(A,A), B=c(B,B), C=c(C,C), n=2*nrow(df), is.censored=c(as.integer(Y==3), as.integer(Y==3)), c=3)), inits=list(mu=0.37, sd=1, betaA=0.42, betaB=0.40, betaC=0), monitor=c("Y"), n.chains = 1, adapt=0, burnin=500, sample=1)

posterior_samples <- tail(n=10000, model_r_c$mcmc[[1]][1,])
posterior_predictive <- ifelse(posterior_samples>=3, 3, posterior_samples)

modelPredictions_r_c <- data.frame(Y=c(df$Y, posterior_predictive), Real=c(rep(1, n), rep(0, n)), A=c(df$A, df$A), B=c(df$B, df$B), C=c(df$C, df$C))
r_r_c <- randomForest(as.ordered(Real) ~ Y + A + B + C, modelPredictions_r_c); r_r_c
#         OOB estimate of  error rate: 53.67%
# Confusion matrix:
#      0    1 class.error
# 0 4490 5510      0.5510
# 1 5224 4776      0.5224
model_rounded_censor_quadratic <- 'model {
 for (i in 1:n) {
     roundA[i] ~ dround(A[i], 3)
     roundB[i] ~ dround(B[i], 3)
     roundC[i] ~ dround(C[i], 3)
     mean[i] <- mu + betaA*roundA[i] + betaB*roundB[i] + betaC*roundC[i]^2
     Y[i] ~ dnorm(mean[i], tau)
     is.censored[i] ~ dinterval(Y[i], c)

 sd   ~ dgamma(0.01, 0.01)
 tau  <- 1/sqrt(sd)

 mu ~ dnorm(0, 100)
 betaA ~ dnorm(0, 100)
 betaB ~ dnorm(0, 100)
 betaC ~ dnorm(0, 100)

model_r_c_q <- run.jags(model_rounded_censor_quadratic, data = with(df, list(Y=c(Y, rep(NA, nrow(df))), A=c(A,A), B=c(B,B), C=c(C,C), n=2*nrow(df), is.censored=c(as.integer(Y==3), as.integer(Y==3)), c=3)), inits=list(mu=0.37, sd=1, betaA=0.42, betaB=0.40, betaC=0), monitor=c("Y"), n.chains = 1, adapt=0, burnin=500, sample=1)

posterior_samples <- tail(n=10000, model_r_c_q$mcmc[[1]][1,])
posterior_predictive <- ifelse(posterior_samples>=3, 3, posterior_samples)

modelPredictions_r_c_q <- data.frame(Y=c(df$Y, posterior_predictive), Real=c(rep(1, n), rep(0, n)), A=c(df$A, df$A), B=c(df$B, df$B), C=c(df$C, df$C))
r_r_c_q <- randomForest(as.ordered(Real) ~ Y + A + B + C, modelPredictions_r_c_q); r_r_c_q
#         OOB estimate of  error rate: 61.02%
# Confusion matrix:
#      0    1 class.error
# 0 3924 6076      0.6076
# 1 6127 3873      0.6127

trueNegatives <- modelPredictions_r_c_q[predict(r_r_c_q) == 0 & modelPredictions_r_c_q$Real == 0,]

Where can we go with this? The ML tech­niques can be used to rank ex­ist­ing Bayesian mod­els in an effec­tive if un­prin­ci­pled way. Tech­niques which quan­tify un­cer­tainty like Bayesian neural net­works could give more effec­tive feed­back by high­light­ing the points the Bayesian NN is most cer­tain are fake, guid­ing the an­a­lyst to­wards the worst-mod­eled dat­a­points and pro­vid­ing hints for im­prove­ment. More in­spi­ra­tion could be bor­rowed from the GAN lit­er­a­ture, such as “mini­batch dis­crim­i­na­tion”—as demon­strated above, the ran­dom forests only see one data point at a time, but in train­ing GANs, it has proven use­ful to in­stead ex­am­ine mul­ti­ple dat­a­points at a time to en­cour­age the gen­er­a­tor to learn how to gen­er­ate a wide va­ri­ety of dat­a­points rather than mod­el­ing a few dat­a­points ex­tremely well; a ML model which can pre­dict mul­ti­ple out­puts si­mul­ta­ne­ously based on mul­ti­ple in­puts would be anal­o­gous (that is, in­stead of X ~ A + B + C, it would look more like X1 + X2 + X3 ... ~ A1 + B1 + C1 + A2 + B2 + C2 + ..., with the in­de­pen­dent & de­pen­dent vari­ables from mul­ti­ple data points all fed in si­mul­ta­ne­ously as a sin­gle sam­ple) and might be an even more effec­tive model crit­ic.

Proportion of Important Thinkers by Global Region Over Time in Charles Murray’s Human Accomplishment

Hu­man Ac­com­plish­ment is a 2003 book by re­port­ing a large-s­cale ci­ta­tion analy­sis of bi­o­graph­i­cal dic­tio­nar­ies & ref­er­ence books on art/lit­er­a­ture/­science/­math­e­mat­ic­s/phi­los­o­phy/­science through­out his­to­ry, quan­ti­fy­ing the rel­a­tive im­por­tance of “sig­nifi­cant in­di­vid­u­als” such as Isaac New­ton or Im­manuel Kant or Con­fu­cius and the tem­po­ral & ge­o­graph­i­cal pat­terns; in par­tic­u­lar, it demon­strates large Eu­ro­pean con­tri­bu­tions through­out his­tory and in­creas­ingly dra­mat­i­cally post-1400 AD. The dataset has been re­leased.

Emil Kirkegaard cre­ated a vi­su­al­iza­tion of of the pro­por­tion by rough ge­o­graphic re­gion (Eu­ro­pean/Asian/other) in R us­ing gg­plot2 and LOESS smooth­ing. Per­haps the most strik­ing as­pect of it is the Dark Ages show­ing up as a spike in Asian pro­por­tion, fol­lowed by the .

This vi­su­al­iza­tion has been crit­i­cized as Eu­ro­cen­tri­cal­ly-mis­lead­ing and dri­ven by ar­ti­facts in the analy­sis/­graph­ing:

  • ig­nores the con­straint that pro­por­tions must be 0-1 and naively ex­trap­o­lates be­yond the bound­aries, pro­duc­ing neg­a­tive es­ti­mates for some re­gion­s/­times

  • no vi­su­al­iza­tion of un­cer­tainty is pro­vid­ed, ei­ther in the form of graph­ing the raw data points by su­per­im­pos­ing a scat­ter­plot or by pro­vid­ing stan­dard er­rors or cred­i­ble in­ter­vals. It is pos­si­ble that the over­all shapes or spe­cific pe­ri­ods are no more than chance scat­ters in a time-series based on few dat­a­points.

    • LOESS can pro­vide es­ti­mate lo­cal stan­dard er­rors & con­fi­dence in­ter­vals but they are of ques­tion­able mean­ing in the ab­sence of the un­der­ly­ing counts
  • al­ter­na­tive­ly, the dis­tri­b­u­tion of sig­nifi­cant fig­ures may not be treated cor­rectly para­met­ri­cally

  • pro­por­tions may re­flect a time-series with trends and so pre­ci­sion is ex­ag­ger­ated

None of these ob­jec­tions hold any wa­ter as the dataset and its em­bed­ded differ­ences are suffi­ciently large that the method of analy­sis will make lit­tle differ­ence; I will demon­strate this be­low by re-an­a­lyz­ing it to ad­dress the quib­bles and show that all pat­terns re­main in­tact or are sharp­ened. The above crit­i­cisms can be ad­dressed by:

  1. switch­ing from a LOESS plot to splines or lo­cal bi­no­mial re­gres­sions
  2. plot­ting the raw pro­por­tions grouped by decade or cen­tury
  3. us­ing a non­para­met­ric boot­strap to cal­cu­late con­fi­dence in­ter­vals, a pro­ce­dure which lends it­self to vi­su­al­iza­tion as an an­i­ma­tion of plots of all the re­sam­ples, giv­ing an in­tu­itive sense of how im­por­tant sam­pling er­ror is to the over­all pat­tern of curves and spe­cific parts of his­tory
  4. al­ter­nate­ly, in­stead of at­tempt­ing to fit the pro­por­tion, one can fit the orig­i­nal count of sig­nifi­cant fig­ures in a bi­no­mial or log-nor­mal Bayesian time-series model and sam­ple from the pos­te­rior es­ti­mates of each re­gion for each decade/­cen­tu­ry, and cal­cu­late pos­te­rior pro­por­tions, gain­ing full quan­tifi­ca­tion of un­cer­tain­ty, in­cor­po­ra­tion of any au­to­cor­re­la­tion, and smooth­ing; no ad­di­tional al­go­rithms or the­o­rems are re­quired, demon­strat­ing the el­e­gance of Bayesian ap­proaches

I did­n’t re­al­ize Kirkegaard’s R code was avail­able so I wound up re­do­ing it my­self (and get­ting the same re­sult­s):

## export CSV from spreadsheet in
h <- read.csv("HA.csv", header=TRUE)
#      Serial                        Name            Fl               Birth              Death              Inventory     ScienceField
#  Min.   :   11.00   Descartes, René  :   4   Min.   :-700.000   Min.   :-640.000   Min.   :-559.00   Science   :1442          :2560
#  1st Qu.: 6144.50   Hooke, Robert    :   4   1st Qu.:1557.250   1st Qu.:1580.000   1st Qu.:1638.00   Lit.West  : 835   Tech   : 239
#  Median :12534.50   Leonardo da Vinci:   4   Median :1804.000   Median :1782.000   Median :1844.00   Music.West: 522   Phys   : 218
#  Mean   :15994.27   Archimedes       :   3   Mean   :1585.638   Mean   :1616.174   Mean   :1682.81   Art.West  : 479   Chem   : 204
#  3rd Qu.:21999.75   Bacon, Francis   :   3   3rd Qu.:1900.000   3rd Qu.:1863.000   3rd Qu.:1930.00   Phil.West : 155   Biol   : 193
#  Max.   :43134.00   d'Alembert, Jean :   3   Max.   :1949.000   Max.   :1910.000   Max.   :1997.00   Art.China : 111   Math   : 191
#                     (Other)          :3981                      NA's   :304        NA's   :351       (Other)   : 458   (Other): 397
#      Index             Duplicate           BirthCountry   WorkCountry      Ethnicity        Woman            No..of.Inventories
#  Min.   :  0.60000   Min.   :0.00000000   France : 564   France : 605   Germanic: 592   Min.   :0.00000000   Min.   :2.000000
#  1st Qu.:  3.54000   1st Qu.:0.00000000   Germany: 556   Britain: 574   French  : 565   1st Qu.:0.00000000   1st Qu.:2.000000
#  Median :  7.60000   Median :0.00000000   Britain: 554   Germany: 525   English : 441   Median :0.00000000   Median :2.000000
#  Mean   : 12.95713   Mean   :0.06221889   Italy  : 400   Italy  : 406   Italian : 397   Mean   :0.02198901   Mean   :2.228916
#  3rd Qu.: 15.89000   3rd Qu.:0.00000000   USA    : 306   USA    : 375   USA     : 276   3rd Qu.:0.00000000   3rd Qu.:2.000000
#  Max.   :100.00000   Max.   :1.00000000   China  : 239   China  : 239   Chinese : 240   Max.   :1.00000000   Max.   :4.000000
#  NA's   :115                              (Other):1383   (Other):1278   (Other) :1491                        NA's   :3753
#  [1] "Ancient Greek" "Ancient Roman" "Arabic"        "Australian"    "Basque"        "Black"         "Bulgarian"     "Canadian"
#  [9] "Chinese"       "Croatian"      "Czech"         "Danish"        "Dutch"         "English"       "Estonian"      "Finnish"
# [17] "Flemish"       "French"        "Germanic"      "Greek"         "Hungarian"     "Icelandic"     "Indian"        "Irish"
# [25] "Italian"       "Japanese"      "Jewish"        "Latino"        "New Zealand"   "Norwegian"     "Polish"        "Portuguese"
# [33] "Romanian"      "Scots"         "Slavic"        "Slovenian"     "Spanish"       "Swedish"       "Swiss"         "USA"

european <- c("Ancient Greek", "Ancient Roman", "Australian", "Basque", "Bulgarian", "Canadian", "Croatian", "Czech", "Danish",
    "Dutch", "English", "Estonian", "Finnish", "Flemish", "French", "Germanic", "Greek", "Hungarian", "Icelandic", "Irish",
    "Italian", "Jewish", "New Zealand", "Norwegian", "Polish", "Portuguese", "Romanian", "Scots", "Slavic", "Slovenian",
    "Spanish", "Swedish", "Swiss", "USA")
asian    <- c("Chinese", "Indian", "Japanese")
other    <- c("Arabic", "Black", "Latino")
groupMembership <- function(e) { if (e %in% european) { "European" } else { if (e %in% asian) { "Asian" } else { "Other" } } }
h$Group <- as.factor(sapply(h$Ethnicity, groupMembership))
#   Asian European    Other
#     507     3379      116

## We use 'Fl' (floruit/flourished), when a person is believed to have done their most important work,
## since birth/death is often unavailable.
## group to decades by rounding:
h$Fl.decade <- round(h$Fl, digits=-1)
hd <- subset(select=c(Fl.decade, Group), h)

hdcount <- aggregate(cbind(Group) ~ Fl.decade+Group, length, data=hd)
colnames(hdcount)[3] <- "Count"
## sort by time:
hdcount <- hdcount[order(hdcount$Fl.decade),]
nrow(h); sum(hdcount$Count)
# [1] 4002
# [1] 4002
head(hdcount, n=20)
#     Fl.decade    Group Count
# 178      -700 European     3
# 179      -680 European     1
# 180      -650 European     1
# 1        -600    Asian     2
# 181      -600 European     2
# 182      -580 European     2
# 183      -570 European     2
# 2        -550    Asian     1
# 184      -550 European     1
# 185      -540 European     5
# 3        -520    Asian     1
# 186      -520 European     3
# 4        -510    Asian     1
# 187      -510 European     2
# 188      -500 European     2
# 189      -480 European     6
# 190      -460 European     3
# 191      -450 European     7
# 5        -440    Asian     1
# 192      -440 European    11

## One issue with the count data: decades with zero significant figures from a group
## (which happens frequently) get suppressed. Some tools can handle the omission
## automatically but many cannot, so we need to manually insert any missing decades with '0'
decades <- seq(-700, 1950, by=10)
for (i in 1:length(decades)) {
    d <- decades[i]
    if (nrow(hdcount[hdcount$Fl.decade==d & hdcount$Group=="European",])==0) {
        hdcount <- rbind(hdcount, data.frame(Fl.decade=d, Group="European", Count=0))}
    if (nrow(hdcount[hdcount$Fl.decade==d & hdcount$Group=="Asian",])==0) {
        hdcount <- rbind(hdcount, data.frame(Fl.decade=d, Group="Asian", Count=0))}
    if (nrow(hdcount[hdcount$Fl.decade==d & hdcount$Group=="Other",])==0) {
        hdcount <- rbind(hdcount, data.frame(Fl.decade=d, Group="Other", Count=0))}
hdcount <- hdcount[order(hdcount$Fl.decade),]

library(ggplot2); library(gridExtra)
c1 <- with(hdcount, qplot(Fl.decade, Count, color=Group) + stat_smooth())
c2 <- with(hdcount, qplot(Fl.decade, log1p(Count), color=Group) + stat_smooth())
grid.arrange(c1, c2, ncol=1)

The ab­solute growth in hu­man pop­u­la­tion and and hence ac­com­plish­ment post-1400 is so dra­matic that it ob­scures ear­lier tem­po­ral vari­a­tions:

Counts of “sig­nifi­cant fig­ures” in Hu­man Ac­com­plish­ment (Mur­ray 2003) by ge­o­graphic re­gion, raw and log-trans­formed

Log-trans­formed, we can still see the in­vert­ed-V shape of Eu­ro­pean counts, but it’s some­what sub­tle be­cause it’s still be­ing squashed by post-1400 in­creases and does leave room for doubt about sam­pling er­ror. Mov­ing on to re­pro­duc­ing the pro­por­tions plot:

## Create proportions by summing per decade, then looping over each group & dividing by total for that decade:
decadeTotals <- aggregate(Count ~ Fl.decade, sum, data=hdcount)
for (i in 1:nrow(hdcount)) {
        total <- decadeTotals[decadeTotals$Fl.decade == hdcount[i,]$Fl.decade,]$Count
        p <- hdcount[i,]$Count / total
        hdcount$Proportion[i] <- if(is.nan(p)) { 0 } else { p }
        hdcount$Total[i] <- total
with(hdcount, qplot(Fl.decade, Proportion, color=Group) + stat_smooth() + coord_cartesian(ylim = c(0, 1)))
Rel­a­tive pro­por­tions of “sig­nifi­cant fig­ures” in Hu­man Ac­com­plish­ment (Mur­ray 2003) by ge­o­graphic re­gion, LOESS-smoothed

We suc­cess­fully re­pro­duce it, mod­ulo the LOESS stan­dard er­rors (which can be dis­abled by adding se=FALSE to stat_smooth()), in­clud­ing the un­wanted non­sen­si­cal ex­trap­o­la­tions. It is pos­si­ble with some tricky gg­plot2 func­tion­al­ity to add in bi­no­mial smooth­ing (a­long with some jit­ter to un­bunch the dat­a­points at the modal 0).

## roughly equivalent to:
# glm(cbind(Count,Total) ~ splines::ns(Fl.decade,3), family="binomial", data=hdcount, subset=Group=="European")
binomial_smooth <- function(...) { geom_smooth(se=FALSE, method = "glm", method.args = list(family = "binomial"), ...) }
with(hdcount, qplot(Fl.decade, Proportion, color=Group) +
    binomial_smooth(formula = y ~ splines::ns(x, 3)) +
    geom_jitter(aes(color=Group), width=0.013,, height=0.013))
Rel­a­tive pro­por­tions of “sig­nifi­cant fig­ures” in Hu­man Ac­com­plish­ment (Mur­ray 2003) by ge­o­graphic re­gion, bi­no­mi­al-s­pline-s­moothed for sen­si­ble ex­trap­o­la­tion

This still does­n’t pro­vide any in­di­ca­tion of sam­pling er­ror un­cer­tain­ty, how­ev­er. Kirkegaard pro­vides one with CIs de­rived from boot­strap­ping, so I will pro­vide some­thing a lit­tle differ­ent: vi­su­al­iz­ing the un­cer­tainty dy­nam­i­cally by graph­ing the smoothed pro­por­tions for each re­sam­ple in an an­i­ma­tion of hun­dreds of boot­strap sam­ples.

So to do this boot­strap, we pack­age up the var­i­ous trans­for­ma­tions from be­fore, so we can sam­ple-with­-re­place­ment the orig­i­nal dataset10, trans­form, and plot re­peat­ed­ly:

transformAndProportion <- function(df) {
    df$Fl.decade <- round(df$Fl, digits=-1)
    dfd <- subset(select=c(Fl.decade, Group), df)
    dfdcount <- aggregate(cbind(Group) ~ Fl.decade+Group, length, data=dfd)
    colnames(dfdcount)[3] <- "Count"
    decades <- seq(-700, 1950, by=10)
    for (i in 1:length(decades)) {
        d <- decades[i]
        if (nrow(dfdcount[dfdcount$Fl.decade==d & dfdcount$Group=="European",])==0) {
    dfdcount <- rbind(dfdcount, data.frame(Fl.decade=d, Group="European", Count=0))}
        if (nrow(dfdcount[dfdcount$Fl.decade==d & dfdcount$Group=="Asian",])==0) {
    dfdcount <- rbind(dfdcount, data.frame(Fl.decade=d, Group="Asian", Count=0))}
        if (nrow(dfdcount[dfdcount$Fl.decade==d & dfdcount$Group=="Other",])==0) {
    dfdcount <- rbind(dfdcount, data.frame(Fl.decade=d, Group="Other", Count=0))}
    dfdcount <- dfdcount[order(dfdcount$Fl.decade),]
    decadeTotals <- aggregate(Count ~ Fl.decade, sum, data=dfdcount)
    for (i in 1:nrow(dfdcount)) {
        p <- dfdcount[i,]$Count / decadeTotals[decadeTotals$Fl.decade == dfdcount[i,]$Fl.decade,]$Count
        dfdcount$Proportion[i] <- if(is.nan(p)) { 0 } else { p }

bootPlot <- function(df) {
    n <- nrow(df)
    bootDf <- df[sample(1:n, n, replace=TRUE),]
    bootDfdcount <- transformAndProportion(bootDf)
    ## WARNING: can't just call qplot due to old 'animation'/ggplot2 bug; have to assign & 'print'
    p <- with(bootDfdcount, qplot(Fl.decade, Proportion, color=Group) +
        binomial_smooth(formula = y ~ splines::ns(x, 3)) +
        geom_jitter(aes(color=Group), width=0.013,, height=0.013))
saveGIF({for (i in 1:200) { bootPlot(h) }}, interval=0.15, ani.width=1300, ani.height=700,"2003-murray-humanaccomplishment-region-proportions-bootstrap.gif", clean=FALSE)
An­i­ma­tion of re­peat­edly re­sam­pling & plot­ting rel­a­tive pro­por­tions of “sig­nifi­cant fig­ures” in Hu­man Ac­com­plish­ment (Mur­ray 2003) by ge­o­graphic re­gion, demon­strat­ing effects of sam­pling er­ror on pro­por­tions & his­tor­i­cal curves

The boot­strap an­i­ma­tion sug­gests to me that while the very ear­li­est time-pe­ri­ods are opaque and the Dark Ages differ­ence be­tween Eu­rope & Asia may be some­what higher or low­er, the over­all shape does­n’t change mean­ing­ful­ly.

The time-series as­pect of the data on vi­sual in­spec­tion ap­pears to be a sim­ple up­wards, low-order mod­els like ARIMA(1,1,0), ARIMA(1,1,2), or ARIMA(0,1,2); this is prob­a­bly due to the world pop­u­la­tion steadily in­creas­ing while the per capita rates re­main sta­ble.

efit <- auto.arima(subset(hdcount, select=c("Fl.decade", "Count"), Group=="European")$Count)
afit <- auto.arima(subset(hdcount, select=c("Fl.decade", "Count"), Group=="Asian")$Count)
ofit <- auto.arima(subset(hdcount, select=c("Fl.decade", "Count"), Group=="Other")$Count)
plot(forecast(efit), ylim=c(0,200)); axis(side=1, labels=decades, at=seq(1, length(decades)))
plot(forecast(afit), ylim=c(0,200)); axis(side=1, labels=decades, at=seq(1, length(decades)))
plot(forecast(ofit), ylim=c(0,200)); axis(side=1, labels=decades, at=seq(1, length(decades)))
Sim­ple ARIMA time-series fits & fore­casts to 3 global re­gions of “sig­nifi­cant fig­ures” in Hu­man Ac­com­plish­ment

We can com­bine the sam­pling er­ror quan­tifi­ca­tion of full Bayesian pos­te­ri­ors, Pois­son dis­tri­b­u­tion of counts, and time-series as­pects into a sin­gle Bayesian model us­ing as a con­ve­nient in­ter­face to Stan (rather than writ­ing out the full model by hand), with un­in­for­ma­tive pri­ors, and then vi­su­al­ize the pos­te­rior dis­tri­b­u­tion of the pro­por­tions (which it­self is sim­ply a trans­for­ma­tion of the pos­te­ri­or):

b <- brm(Count ~ (1|Group), autocor = cor_bsts(~ Fl.decade | Group), family="zero_inflated_poisson", data = hdcount)

## Rather than use `fitted` to get the 95% CI & compute proportion, it would also be possible to draw samples from
## the posterior for each group/decade, total, calculate per-group proportion, and then summarize into quantiles; but
## that is much slower and requires more finicky code:
posterior <- fitted(b)
hdcount$B.low.prop  <-         posterior[,3] / hdcount$Total
hdcount$B.mean.prop <-         posterior[,1] / hdcount$Total
hdcount$B.high.prop <- pmin(1, posterior[,4] / hdcount$Total)

predframe <- subset(hdcount, select=c("B.low.prop", "B.high.prop"))
with(hdcount, ggplot(hdcount, aes(Fl.decade, Proportion, color=Group)) +
    geom_point() +
    geom_line(data=predframe) +
    geom_ribbon(aes(ymin=B.low.prop, ymax=B.high.prop), alpha=0.05, data=predframe))
Bayesian mul­ti­-level time-series of “sig­nifi­cant fig­ures”; shaded re­gion in­di­cates 95% cred­i­ble in­ter­val around group mean in that decade

The smoothed time-series looks about the same, and the CIs sug­gest, like the boot­strap, that there is great un­cer­tainty early on when pop­u­la­tions are small & sur­viv­ing fig­ures are rare, but that the dark ages dip looks real and the Eu­ro­pean in­creases in pro­por­tion since then are also highly prob­a­ble.

So over­all, cor­rect­ing for the in­fe­lic­i­ties in Kirkegaard’s orig­i­nal graph makes the graph some­what cleaner and is help­ful in pro­vid­ing quan­tifi­ca­tion of un­cer­tain­ty, but none of the prob­lems drove the over­all ap­pear­ance of the curve in the slight­est bit. If the graph is wrong, the is­sues will lie in sys­tem­atic bi­ases in the data it­self—not sta­tis­ti­cal quib­bling over sam­pling er­ror or LOESS curves cross­ing an ax­is. (Com­par­i­son with graphs drawn from other clio­met­ric datasets such as Wikipedia or par­tic­u­larly would be in­for­ma­tive.)

Program for non-spaced-repetition review of past written materials for serendipity & rediscovery: Archive Revisiter

helps one re­mem­ber facts by cre­at­ing dis­crete flash­cards which one tests one­self on at in­creas­ingly dis­tant ‘spaced’ time pe­ri­ods, re­peat­ing the fact just be­fore one prob­a­bly would have for­got­ten it; us­ing soft­ware to track & au­to­mate tests & re­view sched­ul­ing, spaced rep­e­ti­tion can scale to hun­dreds of thou­sands of dis­crete items.

If spac­ing out facts can help one re­mem­ber by re­peat­ing items just be­fore they are for­got­ten, is there any use for an “an­ti-spaced rep­e­ti­tion” with the op­po­site method of re­peat­ing items only after they are prob­a­bly for­got­ten?

I can think of two: first, it could be used to plan by eg track­ing one’s fa­vorite movies of all time and sched­ul­ing a re­watch when­ever one is pre­dicted to have for­got­ten enough to make them novel & highly en­joy­able again. Sec­ond, and more in­ter­est­ing­ly, it could be used as a serendip­ity gen­er­a­tor by al­low­ing effi­cient pro­cess­ing of notes or ex­cerpts or old writ­ings.

In reread­ing such ma­te­ri­als many years lat­er, one often gains a new per­spec­tive or learns some­thing use­ful be­cause one for­got some­thing: one did­n’t un­der­stand some­thing about it at the time, or new ma­te­r­ial has rad­i­cally changed one’s in­ter­pre­ta­tion, and since it’d been for­got­ten, no use could be made of it. Un­for­tu­nate­ly, us­ing spaced rep­e­ti­tion to mem­o­rize such ma­te­ri­al, while en­sur­ing any serendip­i­tous con­nec­tions get made as soon as pos­si­ble, would be rad­i­cally in­fea­si­ble for bulky items (a sin­gle lengthy text ex­cerpt might cor­re­spond to hun­dreds of dis­crete items, quickly over­load­ing even SRS sys­tems) and for al­most all items, use­less. One can jus­tify reread­ing old ma­te­r­ial once or per­haps twice, but not many rereads nor full mem­o­riza­tion. But reread­ing hap­haz­ardly is likely to in­effi­ciently cover some ma­te­r­ial many times while ne­glect­ing oth­ers, and such rereads will often be far too early in time (or—a lesser con­cern here—­too late).

In­stead of spaced rep­e­ti­tion, one would in­stead use an­ti-spaced rep­e­ti­tion: each item would be tracked and re­viewed and its ex­pected for­get­ting time pre­dict­ed, as in spaced rep­e­ti­tion, but in­stead of sched­ul­ing a re­view be­fore for­get­ting, a re­view is sched­uled for some time (prob­a­bly long after­wards) after for­get­ting. The to­tal num­ber of re­views of each item per user life­time would be set to a small num­ber, per­haps 1–4, bound­ing the time con­sump­tion at a fea­si­ble amount.

Such an an­ti-spaced rep­e­ti­tion sys­tem could be used with hun­dreds of thou­sands of notes or clip­pings which a per­son might ac­cu­mu­late over a life­time, and en­able them to in­vest a few min­utes a day into read­ing old notes, oc­ca­sion­ally com­ing up with new in­sights, while en­sur­ing they don’t waste time read­ing notes too many times or read­ing notes they likely al­ready re­mem­ber & have ex­haust­ed.

One rea­son to take notes/­clip­pings and leave com­ments in stim­u­lat­ing dis­cus­sions is to later ben­e­fit by hav­ing ref­er­ences & ci­ta­tions at hand, and grad­u­ally build up an idea from dis­parate threads and make new con­nec­tions be­tween them. For this pur­pose, I make ex­ten­sive ex­cerpts from web pages & doc­u­ments I read into my clip­pings (func­tion­ing as a ), and I com­ment con­stantly on Red­dit/­Less­Wrong/HN etc. While ex­pen­sive in time & effort, I often go back, months or years lat­er, and search for a par­tic­u­lar thing and ex­pand & in­te­grate it into an­other writ­ing or ex­pand it out to an en­tire es­say of its own. (I also value highly not be­ing in the sit­u­a­tion where I be­lieve some­thing but I do not know why I be­lieve it other than the con­vic­tion “I read it some­where, once”.)

This sort of us­ing sim­ple s like Ever­note works well enough when I have a clear mem­ory of what the ci­ta­tion/­fac­toid was, per­haps be­cause it was so mem­o­rable, or when the ci­ta­tions or com­ments are in a nice clus­ter (per­haps be­cause there was a key phrase in them or I kept go­ing back & ex­pand­ing a com­men­t), but it loses out on key ben­e­fits to this pro­ce­dure: serendip­ity and per­spec­tive.

As time pass­es, one may re­al­ize the im­por­tance of an odd tid­bit or have ut­terly for­got­ten some­thing or events con­sid­er­ably changed its mean­ing; in this case, you would ben­e­fit from re­vis­it­ing & reread­ing that old bit & ex­pe­ri­enc­ing an “aha!” mo­ment, but you don’t re­al­ize it. So one thing you could do is reread all your old clip­pings & com­ments, ap­prais­ing them for reuse.

But how often? And it’s a pain to do so. And how do you keep track of which you’ve al­ready read? One thing I do for my emails is semi­-an­nu­ally I (try to) read through my pre­vi­ous 6 months of email to see what might need to be fol­lowed up on11 or mined for in­clu­sion in an ar­ti­cle. (For ex­am­ple, an ig­nored re­quest for data, or a dis­cus­sion of dark­net mar­kets with a jour­nal­ist I could ex­cerpt into one of my DNM ar­ti­cles so I can point fu­ture jour­nal­ists at that in­stead.) This is al­ready diffi­cult, and it would be even harder to ex­pand. I have read through my Less­Wrong com­ment his­to­ry… once. Years ago. It would be more diffi­cult now. (And it would be im­pos­si­ble to read through my Red­dit com­ments as the in­ter­face only goes back ~1000 com­ments.)

Sim­ply re-read­ing pe­ri­od­i­cally in big blocks may work but is sub­op­ti­mal: there is no in­ter­face eas­ily set up to reread them in small chunks over time, no con­straints which avoid far too many reads, nor is there any way to re­move in­di­vid­ual items which you are cer­tain need never be re­viewed again. Re­view­ing is use­ful but can be an in­defi­nite timesink. (My sent emails are not too hard to re­view in 6-month chunks, but my IRC logs are bad—7,182,361 words in one chan­nel alone—and my >38k Ever­note clip­pings are worse; any will ex­ac­er­bate the prob­lem by or­ders of mag­ni­tude.) This is prob­a­bly one rea­son that peo­ple who keep jour­nals or di­aries don’t reread Nor can it be crowd­sourced or done by sim­ply rank­ing com­ments by pub­lic up­votes (in the case of Red­dit/L­W/HN com­ments), be­cause the most pop­u­lar com­ments are ones you likely re­mem­ber well & have al­ready used up, and the odd­i­ties & serendip­i­ties you are hop­ing for are likely un­rec­og­niz­able to out­siders.

This sug­gests some sort of re­view­ing frame­work where one sys­tem­at­i­cally re­views old items (sent emails, com­ments, IRC logs by one­self), putting in a con­stant amount of time reg­u­larly and us­ing some sort of ever ex­pand­ing in­ter­val be­tween re-reads as an item be­comes ex­hausted & ever more likely to not be help­ful. Sim­i­lar to the log­a­rith­mi­cal­ly-bounded num­ber of back­ups re­quired for in­defi­nite sur­vival of data (), “De­con­struct­ing Death­is­m—An­swer­ing Ob­jec­tions to Im­mor­tal­ity”, Mike Perry 2013 (note: this is an en­tirely differ­ent kind of prob­lem than those con­sid­ered in Free­man Dyson’s im­mor­tal in­tel­li­gences in In­fi­nite in All Di­rec­tions, which are more fun­da­men­tal), dis­cusses some­thing like what I have in mind in terms of an im­mor­tal agent try­ing to re­view its mem­o­ries & main­tain a sense of con­ti­nu­ity, point­ing out that if time is al­lo­cated cor­rect­ly, it will not con­sume 100% of the agen­t’s time but can be set to con­sume some bounded frac­tion:

It seems rea­son­able that past ver­sions of the self would “sur­vive” as we re­mem­ber the events of times past, that is to say, our episodic mem­o­ries, and this would have im­por­tance in our con­tin­u­ing to per­sist as what could be con­sid­ered the “same” al­beit also a chang­ing, de­vel­op­ing per­son. But in ad­di­tion to this mnemonic re­in­force­ment I imag­ine there would be a more gen­eral feel­ing of be­ing a par­tic­u­lar in­di­vid­u­al, an “am­biance” de­rived from but not re­fer­ring to any spe­cific past ex­pe­ri­ences. Am­biance alone would not be suffi­cient, I think, to make us who we are; episodic mem­o­ries would also be nec­es­sary, yet it could con­sid­er­ably lessen the need for fre­quent re­call and thus al­le­vi­ate the prob­lem of di­lu­tion.

An­other in­ter­est­ing thought is that cer­tain items might con­sis­tently be con­sulted more fre­quently than oth­ers. (In­deed, would this not be ex­pect­ed?) In this way it would ac­tu­ally be pos­si­ble to by­pass the di­lu­tion effect and in­stead al­low a fixed frac­tion of time for pe­rusal of any given item, even as more items were added in­defi­nite­ly. A sim­ple way of do­ing this could be first to al­low some fixed frac­tion of the time for day-to-day affairs and other non-archival work (“prime time”), and spend the rest of the time on pe­rusal of per­sonal archives (“archive time”). The ex­act ap­por­tion­ing of prime ver­sus archive time is not im­por­tant here, but it will be in­struc­tive to con­sider how the archive time it­self might be sub­di­vid­ed. A sim­ple, if overly sim­plis­tic, strat­egy would be to have half this time de­voted to the first cen­tu­ry’s records, half the re­main­der to the sec­ond cen­tu­ry, and so on. (S­ince there would only be a fi­nite num­ber of cen­turies, there would be some un­used archive time at the end, which could be spent as de­sired. Note, how­ev­er, that in the limit of in­fi­nite to­tal time cov­er­ing in­fi­nitely many cen­turies, the us­age of archive time would ap­proach but not ex­ceed 100%.) In this way, then, there would be a fixed frac­tion of archive time, , spent on the _n_th cen­tu­ry’s records, re­gard­less of how many cen­turies be­yond the nth were lived or how many records ac­cu­mu­lat­ed. True, this way of ap­por­tion­ing time might not be much good be­yond a few cen­turies; only about one tril­lionth the to­tal time would be spent on the 40th cen­tu­ry, for in­stance, around 1⁄300 sec per 100 years. (Pos­si­bly a lot could be cov­ered even in this brief in­ter­val of about 3 mil­lion nanosec­onds, how­ev­er.) But the ap­por­tion­ment scheme could be ad­just­ed.

A more in­ter­est­ing and plau­si­ble, if slightly hard­er-to-de­scribe scheme would be to choose a con­stant and al­low the frac­tion to the _n_th-cen­tury records. It is easy to show that the time for all cen­turies will add up to 100% as be­fore, what­ever pos­i­tive value of c we start with. Start­ing with will get 10% of the to­tal time spent on the first cen­tu­ry, with sub­se­quent cen­turies re­ceiv­ing a di­min­ish­ing share as be­fore, but the rate of falloff will be much slow­er, so that the 40th cen­tury will still re­ceive 0.4%, or about 5 months per 100 years, that is to say, 240 mil­lion nanosec­onds per minute. If we sup­pose that our im­mor­tal set­tles even­tu­ally into a rou­tine in which 10% of the time over­all is archive time, there would be 24 mil­lion nanosec­onds avail­able each minute of life for the 40th cen­tu­ry’s mem­o­ries alone, if de­sired, with many other cen­turies get­ting more or less com­pa­ra­ble or greater amounts of at­ten­tion, and none omit­ted en­tire­ly. This, I think, makes at least a plau­si­ble case that a rea­son­able sense of one’s per­sonal iden­tity could be sus­tained in­defi­nite­ly.

In the above ex­am­ples the great­est pro­por­tion of archive time falls to the ear­lier records, which might be fit­ting since these should be the most im­por­tant as for­ma­tive years for the prospec­tive im­mor­tal, thus the most im­por­tant for iden­tity main­te­nance. (Mem­ory re­call would also nat­u­rally oc­cur dur­ing prime time; the em­pha­sis here could be on re­cent events, to main­tain a bal­ance over­al­l.) In sum­ma­ry, then, we have con­sid­ered ways that the prob­lem of di­lu­tion might be suc­cess­fully man­aged. Rel­a­tively in­fre­quent pe­rusal of mem­o­ries might still suffice to main­tain the nec­es­sary con­ti­nu­ity with past ver­sions of the self, or proper sched­ul­ing could sta­bi­lize the fre­quency of re­call and by­pass the di­lu­tion effect, or both. We see in any case that the prob­lem is not what it may seem at first sight. We have no guar­an­tee, of course, that it would not get out of bounds, but cer­tainly some grounds for hope.

So you could imag­ine some sort of soft­ware along the lines of like Anki/M­nemosyne/­Su­per­memo which you spend, say, 10 min­utes a day at, sim­ply reread­ing a se­lec­tion of old emails you sent, lines from IRC with n lines of sur­round­ing con­text, Red­dit & LW com­ments etc; with an ap­pro­pri­ate back­off & time-curve, you would reread each item maybe 3 times in your life­time (eg first after a de­lay of a mon­th, then a year or two, then decades). Each item could come with a rat­ing func­tion where the user rates it as an im­por­tant or odd­-seem­ing or in­com­plete item and to be ex­posed again in a few years, or as to­tally ir­rel­e­vant and not to be shown again—as for many bits of idle chit-chat, mun­dane emails, or in­tem­per­ate com­ments is not an in­stant too soon! (More pos­i­tive­ly, any­thing al­ready in­cor­po­rated into an es­say or oth­er­wise reused likely does­n’t need to be resur­faced.)

This would­n’t be the same as a spaced rep­e­ti­tion sys­tem which is de­signed to re­call an item as many times as nec­es­sary, at the brink of for­get­ting, to en­sure you mem­o­rize it; in this case, the for­get­ting curve & mem­o­riza­tion are ir­rel­e­vant and in­deed, the pri­or­ity here is to try to elim­i­nate as many ir­rel­e­vant or use­less items as pos­si­ble from show­ing up again so that the re­view does­n’t waste time.

More specifi­cal­ly, you could imag­ine an in­ter­face some­what like Mutt which reads in a list of email files (my lo­cal POP email archives down­loaded from Gmail with getmail4, file­name ID­s), chunks of IRC di­a­logue (a grep of my IRC logs pro­duc­ing lines writ­ten by me ±10 lines for con­text, hashes for ID), LW/Red­dit com­ments down­loaded by ei­ther scrap­ing or API via the Big­Query copy up to 2015, and stores IDs, re­view dates, and scores in a data­base. One would use it much like a SRS sys­tem, read­ing in­di­vid­ual items for 10 or 20 min­utes, and rat­ing them, say, up­vote (‘this could be use­ful some­day, show me this ahead of sched­ule in the fu­ture’) / down­vote (push this far off into the fu­ture) / delete (n­ever show again). Items would ap­pear on an ex­pand­ing sched­ule. For ex­am­ple if one wanted to re­view items 4 times over the next 50 years (roughly my life ex­pectan­cy), a sched­ule might be:

round({t=0:4; t^6.981})
# [1]     0     1   126  2142 15958

So in 1 day, then a third of a year, then after 5.8 years, then after 43 years. Al­ter­nate­ly, a geo­met­ric se­ries might be a bit kinder and not too fron­t-load­ed:

review <- function(n, r, a) { a * (1 - r^n) / (1 - r) }
reviews <- function(n, r, a) { sapply(1:n, function(nn) { review(nn, r, a) }) }
findR <- function (firstReview=31, n_total=3, years=50) {  optimize(interval=c(0, 1000),
    f = function(r) { abs(sum(sapply(1:n_total,
         function(n){review(n, a=firstReview, r=r)})) - (365*years)) })$minimum }
findR(firstReview=30, n_total=4, years=50)
# [1] 7.728823216
round(reviews(4, 7.728823216, 30))
# [1]    30   262  2054 15904

The geo­met­ric se­ries al­lows for easy in­cor­po­ra­tion of rat­ing mod­i­fi­ca­tions: a down­vote penalty might mul­ti­ply r by 1.5, vs 0.5 for up­votes. This would also al­low some in­put from sta­tis­ti­cal al­go­rithms which pre­dict up­vote/­down­vote/delete and ad­vances/de­lays items based on that, which would hope­fully quickly learn to avoid idle chit-chat and short per­for­ma­tive ut­ter­ances and start to pri­or­i­tize more in­ter­est­ing & un­usual items. (For ex­am­ple, a good start might be a SVM on a bag-of-words ver­sion of each item’s text, and then as the dataset rat­ings ex­pand, more com­pli­cated al­go­rithms could be plugged in.)

As far as I know, some to-do/­self-help sys­tems have some­thing like a pe­ri­odic re­view of past stuff, and as I men­tioned, spaced rep­e­ti­tion sys­tems do some­thing some­what sim­i­lar to this idea of ex­po­nen­tial re­vis­its, but there’s noth­ing like this at the mo­ment.

On the value of new statistical methods

Ge­netic cor­re­la­tion re­search is a hot area in 2016-2017: passed 400 ref­er­ences in May 2017. What is par­tic­u­larly in­ter­est­ing ref­er­ence-wise is that pub­li­ca­tions 2015-2017 make up around half of the re­sults: so more ge­netic cor­re­la­tions cal­cu­lated in the past 3 years than in the pre­vi­ous 80 years since first es­ti­mates were made some­where in the 1930s or so.

For cal­cu­lat­ing them, there are 3 main meth­ods.

  1. twin reg­istry stud­ies re­quire twin phe­no­typic mea­sure­ments which can usu­ally be col­lected by mailed sur­veys and to an­a­lyze them one com­putes some Pear­son’s r or uses a stan­dard SEM with ad­di­tional co­vari­ance paths (doable with Wright’s path analy­sis back in the 1930s by hand), scal­ing roughly lin­early with sam­ple size, hav­ing ex­cel­lent sta­tis­ti­cal power at a few hun­dred twin pairs and cap­tur­ing full her­i­tabil­i­ties
  2. for GCTA, one re­quires full raw SNP data on 5000+ un­re­lated in­di­vid­u­als at $100+ a sam­ple, along with si­mul­ta­ne­ous phe­no­typic mea­sure­ments of both traits and must use com­pli­cated cus­tom soft­ware whose com­pu­ta­tion scales ex­po­nen­tially and can only ex­am­ine a nar­row sub­set of her­i­tabil­ity
  3. for LDSC, one re­quires pub­lic sum­mary poly­genic scores but they can be from sep­a­rate GWASes and cal­cu­lated on traits in­di­vid­u­al­ly, and the com­pu­ta­tional com­plex­ity is closer to lin­ear than ex­po­nen­tial; the penalty for not need­ing raw SNP data from twice-mea­sured in­di­vid­u­als is that SNP costs dou­ble or more since mul­ti­ple GWASes are used, and LDSC even more in­effi­cient than GCTA, so you’ll need >10,000 in­di­vid­u­als used in each poly­genic score, and still need cus­tom soft­ware.

In other words, the twin method is old, sim­ple, re­quires small sam­ple sizes, and eas­ily ob­tained phe­no­typic mea­sure­ments; while GCTA/LDSC is new, com­pli­cat­ed, and re­quires ex­pen­sive novel ge­netic se­quenc­ing data in huge sam­ple sizes as well as the phe­no­typic mea­sure­ments. So nat­u­rally LDSC gets used an or­der of mag­ni­tude more! Look­ing at the bib­li­og­ra­phy, we can guessti­mate the rates at twin: 1 pa­per/year; GCTA (re­quir­ing raw data), 10/year; LDSC (pub­lic sum­mary stat­s), 100/year.

Amaz­ing the differ­ence meth­ods can make. It’s all about data ac­cess. For all its dis­ad­van­tages, LDSC sta­tis­ti­cally works around the lack of in­di­vid­u­al-level raw data and makes do with the data that gets pub­licly re­leased be­cause it is not seen to vi­o­late ‘pri­vacy’ or ‘bioethics’, so any re­searcher can make use of the method on their par­tic­u­lar dataset, while twin and GCTA re­quire in­di­vid­u­al-level data which is jeal­ously guarded by the own­ers.

Method­ol­o­gists and sta­tis­ti­cians are prob­a­bly se­ri­ously un­der­val­ued: a good new method can cause a rev­o­lu­tion.

Bayesian power analysis: probability of exact replication

Psy­chol­o­gist Michael Kane men­tions:

TFW a cor­re­la­tion of in­ter­est in a new study (n = 355) repli­cates that from a prior study (n = 182) to the sec­ond dec­i­mal (r = 0.23). Win­ning!

Turn­ing up the same cor­re­la­tion twice is some­what sur­pris­ing be­cause ran­dom sam­pling er­ror will vary sub­stan­tially the em­pir­i­cal cor­re­la­tion from sam­ple to sam­ple, as re­flected by the wide cred­i­ble in­ter­vals around r with n = 182-355. How sur­pris­ing is it? Is it too good to be true?

One ap­proach would be to ask, if we gen­er­ated bi­vari­ate sam­ples of size n = 355 with a fixed re­la­tion­ship of r = 0.23, how often would the sam­ples gen­er­ate a rounded es­ti­mate of =0.23?

powerSim <- function (r_gen, n, r_test=NA) {
    data <- mvrnorm(n=n, mu=c(0, 0), Sigma=matrix(c(1, r_gen, r_gen, 1), nrow=2))
    r_est = cor.test(data[,1], data[,2])$estimate
    if ( { r_test <- r_gen }
    return(round(r_test, digits=2) == round(r_est, digits=2))
powerSims <- function(r, n, r_test=NA, iters=100000) {
 sim <- replicate(iters, powerSim(r,n, r_test=r_test))
 return(sum(sim) / length(sim))
powerSims(0.23, 355)
# [1] 0.07798

So around 8% of the sam­ples.

This treats =r = 0.23 as a pa­ra­me­ter known with in­fi­nite pre­ci­sion, rather than an es­ti­mate (us­ing around half the data) of the un­known pa­ra­me­ter r; there would be con­sid­er­able pos­te­rior un­cer­tainty about what r is, and this will affect how often two sam­ples would yield the same es­ti­mate—if the true r was, say, 0.10 (as is en­tirely pos­si­ble), it would be highly un­likely for the sec­ond sam­ple to yield =0.23 again, be­cause the over­es­ti­ma­tion fluke would have to re­peat it­self twice to yield both =0.23.

To in­cor­po­rate the un­cer­tain­ty, we can feed in a sim­u­lated dataset ex­actly match­ing the de­scrip­tion of n = 182/r = 0.23 to an un­in­for­ma­tive Bayesian mod­el, cal­cu­late a pos­te­rior dis­tri­b­u­tion over r (which gives CIs of 0.09-0.37), and then draw from the pos­te­rior pos­si­ble _r_s and run the orig­i­nal sim­u­la­tion ask­ing how often we re­cover =0.23.

n1 = 182
n2 = 355
r1 = 0.23
data1 = mvrnorm(n=n1, mu=c(0, 0), Sigma=matrix(c(1, r1, r1, 1), nrow=2), empirical=TRUE)
colnames(data1) <- c("x", "y")
b1 <- brm(y ~ x, iter=20000, data=data1); summary(b1)
# ...Population-Level Effects:
#           Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
# Intercept     0.00      0.07    -0.14     0.14      40000    1
# x             0.23      0.07     0.09     0.37      40000    1
posteriorX <- fixef(b1, summary=FALSE)[,2]; summary(posteriorX)
#        Min.     1st Qu.      Median        Mean     3rd Qu.        Max.
# -0.08060884  0.18061570  0.23010870  0.22999820  0.27916800  0.55868700
replicates <- sapply(posteriorX, function(r_post) { powerSim(r_post, n2, r_test=r1) })
summary(replicates); mean(replicates)
#    Mode   FALSE    TRUE    NA's
# logical   38262    1738       0
# [1] 0.04345

Around 4% of the time, re­flect­ing the in­creased im­prob­a­bil­ity of true val­ues like r = 0.09 or r = 0.37 pro­duc­ing the spe­cific sam­ple es­ti­mate of =0.23

Thus, ob­serv­ing the same sum­mary sta­tis­tics in even rel­a­tively large sam­ples is some­what sus­pi­cious and might be a good rea­son to dou­ble-check other as­pects of the code & da­ta.

Expectations are not expected deviations and large number of variables are not large samples

If one has a large num­ber of vari­ables with a cer­tain ex­pec­ta­tion, it is tempt­ing to in­ter­pret the or or as im­ply­ing that the sum of a large num­ber of vari­ables or after a large num­ber of timesteps, the ob­served sam­ple value will be close or iden­ti­cal to the ex­pected val­ue. So for coin-flip­ping, one knows that flip­ping 10 coins could eas­ily yield a large de­vi­a­tion like a sum of 9 heads in­stead of the ex­pected 5 heads, but one then thinks that after a mil­lion coin flips, the sum of heads will prob­a­bly be 500,000. An­other ex­am­ple of this mis­take might be to make ar­gu­ments about sci­en­tific re­search or char­i­ties: “char­i­ta­ble in­ter­ven­tion X is affected by hun­dreds or thou­sands of differ­ent vari­ables and the ben­e­fits or costs un­fold over long time pe­ri­ods like decades or cen­turies; our best es­ti­mate of the mean value of in­ter­ven­tions like X is that it is some small value Y; thus, by CLT etc, we can be sure that X’s ul­ti­mate value will be nei­ther much big­ger nor much smaller than Y but very close to Y, and, par­tic­u­lar­ly, we can be sure that there are no in­ter­ven­tions like X which could pos­si­bly turn out to have ul­ti­mate val­ues which are or­ders of mag­ni­tude larger or smaller than Y, so we can rule out any such claims and we know the Value of In­for­ma­tion is small.”

This is not wrong so much as mis­un­der­stood: one might call it a con­fu­sion of the vari­able’s dis­tri­b­u­tion with the sam­pling dis­tri­b­u­tion. The value only be­comes closer in a rel­a­tive sense; in an ab­solute sense, as more vari­ables are added—with­out the ab­solute mag­ni­tude of each shrink­ing lin­ear­ly—the ac­tual de­vi­a­tion from the ex­pec­ta­tion sim­ply be­comes larger and larg­er. (Like : the ex­pec­ta­tion is the same as the cur­rent val­ue, but the vari­ance in­creases with time.)

As in dis­cussing how “di­ver­si­fi­ca­tion” works, it is a mis­take to think that one ‘di­ver­si­fies’ one’s in­vest­ments by adding ad­di­tional in­vest­ments of the same size; for any vari­ance re­duc­tion, the to­tal in­vest­ment must in­stead be split up among ever more differ­ent in­vest­ments as many small in­vest­ments:

In gen­er­al, the pres­ence of more as­sets in a port­fo­lio leads to greater di­ver­si­fi­ca­tion ben­e­fits, as can be seen by con­sid­er­ing port­fo­lio vari­ance as a func­tion of n, the num­ber of as­sets. For ex­am­ple, if all as­sets’ re­turns are mu­tu­ally un­cor­re­lated and have iden­ti­cal vari­ances , port­fo­lio vari­ance is min­i­mized by hold­ing all as­sets in the equal pro­por­tions .[Samuel­son, Paul, “Gen­eral Proof that Di­ver­si­fi­ca­tion Pays”, Jour­nal of Fi­nan­cial and Quan­ti­ta­tive Analy­sis 2, March 1967, 1-13.] Then the port­fo­lio re­turn’s vari­ance equals = = , which is mo­not­o­n­i­cally de­creas­ing in n.

The lat­ter analy­sis can be adapted to show why adding un­cor­re­lated volatile as­sets to a port­fo­lio, [see Samuel­son, Paul, “Risk and un­cer­tain­ty: A fal­lacy of large num­bers”, Sci­en­tia 98, 1963, 108-113.] [Ross, Stephen, “Adding risks: Samuel­son’s fal­lacy of large num­bers re­vis­ited”, Jour­nal of Fi­nan­cial and Quan­ti­ta­tive Analy­sis 34, Sep­tem­ber 1999, 323-339.] thereby in­creas­ing the port­fo­lio’s size, is not di­ver­si­fi­ca­tion, which in­volves sub­di­vid­ing the port­fo­lio among many smaller in­vest­ments. In the case of adding in­vest­ments, the port­fo­lio’s re­turn is in­stead of and the vari­ance of the port­fo­lio re­turn if the as­sets are un­cor­re­lated is which is in­creas­ing in n rather than de­creas­ing. Thus, for ex­am­ple, when an in­sur­ance com­pany adds more and more un­cor­re­lated poli­cies to its port­fo­lio, this ex­pan­sion does not it­self rep­re­sent di­ver­si­fi­ca­tion—the di­ver­si­fi­ca­tion oc­curs in the spread­ing of the in­sur­ance com­pa­ny’s risks over a large num­ber of part-own­ers of the com­pa­ny.

Samuel­son’s 1963 “Risk and un­cer­tain­ty: A fal­lacy of large num­bers” opens by re­count­ing an anec­dote:

Is it true that an in­sur­ance com­pany re­duces its risk by dou­bling the num­ber of ships it in­sures?

…a few years ago I offered some lunch col­leagues to bet each $200 to $100 that the side of a coin they spec­i­fied would not ap­pear at the first toss. One dis­tin­guished schol­ar—who lays no claim to ad­vanced math­e­mat­i­cal skill­s—­gave the fol­low­ing an­swer:

I won’t bet be­cause I would feel the $100 loss more than the $200 gain. But I’ll take you on if you promise to let me make 100 such bets.

What was be­hind this in­ter­est­ing an­swer? He, and many oth­ers, have given some­thing like the fol­low­ing ex­pla­na­tion. “One toss is not enough to make it rea­son­ably sure that the law of av­er­ages will turn out in my fa­vor. But in a hun­dred tosses of a coin, the law of large num­bers will make it a darn good bet. I am, so to speak, vir­tu­ally sure to come out ahead in such a se­quence, and that is why I ac­cept the se­quence while re­ject­ing the sin­gle toss.”

4. Max­i­mum Loss And Prob­a­ble Loss.—What are we to think about this an­swer?

…First­ly, when an in­sur­ance com­pany dou­bles the num­ber of ships it in­sures, it does also dou­ble the range of its pos­si­ble losses or gains. (This does not deny that it re­duces the prob­a­bil­ity of its loss­es.) If at the same time that it dou­bles the pool of its risks, it dou­bles the num­ber of its own­ers, it has in­deed left the max­i­mum pos­si­ble loss per owner un­changed; but—and this is the germ of truth in the ex­pres­sion “there is safety in num­bers”—the in­sur­ance com­pany has now suc­ceeded in re­duc­ing the prob­a­bil­ity of each loss; the gain to each owner now be­comes a more cer­tain one.

In short, it is not so much by adding new risks as by sub­di­vid­ing risks among more peo­ple that in­sur­ance com­pa­nies re­duce the risk of each. To see this, do not dou­ble or change at all the orig­i­nal num­ber of ships in­sured by the com­pa­ny: but let each owner sell half his shares to each new own­er. Then the risk of loss to each owner per dol­lar now in the com­pany will have in­deed been re­duced.

Un­doubt­edly this is what my col­league re­ally had in mind. In re­fus­ing a bet of $100 against $200, he should not then have spec­i­fied a se­quence of 100 such bets. That is adding risks. He should have asked to sub­di­vide the risk and asked for a se­quence of 100 bets, each of which was 100th as big (or $1 against $2).

In the in­sur­ance ex­am­ple, ships do not change their in­sured value (y­our li­a­bil­i­ty) sim­ply be­cause you in­sure more of them, and they cer­tainly do not halve in value sim­ply be­cause you have de­cided to take more in­sur­ance. If a ship is worth $1m with an ex­pected profit of $10k and a SD of profits of $5k, when you in­sure 1000 ships, your profit is dis­trib­uted as , and when you dou­ble it, now the dis­tri­b­u­tion is —ab­solute size of your fluc­tu­a­tions has in­creased, not de­creased. As a per­cent­age, it has gone down in­deed, but the ab­solute size has still gone up. Sim­i­lar­ly, by claim­ing to pre­fer an in­vest­ment of 100 bets and putting not $200 but $20,000 at risk, the col­league has raised the stakes great­ly, and if the prospect of fluc­tu­a­tions of $100 un­set­tled his stom­ach be­fore, he will en­joy less the SD of ~$1500 from the pro­posed 100 coin­flips (95% quan­tiles of $2000 to $8000) and even a ~1% chance of a loss such as -$400; he has re­duced the risk of any loss, yes, and most of the out­comes are in­deed rel­a­tively closer to the ex­pec­ta­tion than with just 1 coin­flip, but he has in­ad­ver­tently re­placed a small SD of $150 with the much larger one of $1500, and of course, his worst case sce­nario has got­ten much worse—with just one coin flip he could never have lost $400 or more, but now he can and oc­ca­sion­ally would.

An­other in­ter­est­ing ex­am­ple comes from Cav­al­li-S­forza’s 1971 The Ge­net­ics of Hu­man Pop­u­la­tions, where he ar­gues that a pri­ori, differ­ences be­tween hu­man races on com­plex traits (in­tel­li­gence in this case), in the ab­sence of se­lec­tion, can­not ex­ist solely based on the large num­ber of genes con­tribut­ing to them, by the law of large num­bers:

In this con­text, it is worth men­tion­ing that Jensen states that be­cause the gene pools of whites and blacks are known to differ and “these ge­netic differ­ences are man­i­fested in vir­tu­ally every anatom­i­cal, phys­i­o­log­i­cal, and bio­chem­i­cal com­par­i­son one can make be­tween rep­re­sen­ta­tive sam­ples of iden­ti­fi­able racial groups” there­fore “there is no rea­son to sup­pose that the brain should be ex­empt from this gen­er­al­iza­tion.” There is, how­ev­er, no rea­son why genes affect­ing IQ which differ in fre­quency in the gene pools of blacks and whites, should be such that, on the av­er­age, whites would have sig­nifi­cantly higher fre­quen­cies of genes in­creas­ing IQ than would blacks. On the con­trary, one should ex­pect, as­sum­ing no ten­dency for high IQ genes to ac­cu­mu­late by se­lec­tion in one or other race, that the more poly­mor­phic genes there are that affect IQ and that differ in fre­quency in blacks and whites, the less likely it is that there is an av­er­age ge­netic differ­ence in LQ be­tween the races. This fol­lows from that most ba­sic law of sta­tis­tics, the law of large num­bers, which pre­dicts in­creas­ing ac­cu­racy of a mean based on in­creas­ing num­bers of ob­ser­va­tions (See Ap­pen­dix I).

Cav­al­li-S­forza is cor­rect to note that, un­less we wish to make (still) very con­tro­ver­sial claims about differ­ing se­lec­tion, there is no ap­par­ent rea­son for any in­tel­li­gence-re­lated al­lele to be sys­tem­at­i­cally rarer in one pop­u­la­tion than an­oth­er, al­though of course they will in prac­tice differ slightly due to ran­dom chance (, de­mo­graphic bot­tle­necks etc), and that as this ap­plies to all in­tel­li­gence-re­lated al­le­les, we would pre­dict that the ex­pec­ta­tion of all pop­u­la­tions to be iden­ti­cal. He, how­ev­er, com­mits the fal­lacy of large num­bers when he then in­ter­prets the law of large num­bers as guar­an­tee­ing that all pop­u­la­tions will be iden­ti­cal, while in fact, they will be differ­ent, and the ab­solute size of the differ­ences will in­crease—not de­crease—“the more poly­mor­phic genes there are that affect IQ”. Like with di­ver­si­fi­ca­tion or in­sur­ance or coin­flip­ping, the de­sired dis­ap­pear­ance of the vari­ance only hap­pens if each ran­dom vari­able (in­vest­ment, in­sured ves­sel, coin­flip) de­creases pro­por­tional to the to­tal num­ber of ran­dom vari­ables; for Cav­al­li-S­forza’s ar­gu­ment to go through, it would need to be the case that every new IQ gene divvied up a fixed pie—but why would that be the case and how could that be known a pri­ori? More point­ed­ly, we could note that Cav­al­li-S­forza’s ar­gu­ment proves too much be­cause it is equally ap­plic­a­ble within races too, and im­plies that there could be no differ­ences of im­por­tant mag­ni­tude be­tween hu­mans of the same race on highly poly­genic traits (and if some­one wanted to try to res­cue the ar­gu­ment by claim­ing we should ex­pect mean differ­ences or some sort of se­lec­tion, then amend “hu­mans of the same race” to “sib­lings in the same fam­ily”!). Cav­al­li-S­forza’s er­ror is par­tic­u­larly strik­ing since ch8 of the same book de­votes ex­ten­sive dis­cus­sion, with many graphs, to how ge­netic drift will greatly differ­en­ti­ate pop­u­la­tions over time, and takes pains to point out that the ex­pec­ta­tion of gene fre­quen­cies p is merely the ex­pec­ta­tion, and the ac­tual fre­quency will di­verge ar­bi­trar­ily far in a ran­dom walk and over a long enough time (con­nected to the pop­u­la­tion size) will even­tu­ally reach fix­a­tion at ei­ther p = 1 or p = 0, and cer­tainly not con­verge ex­actly on the orig­i­nal p.12 In­deed, given the a pos­te­ri­ori mea­sured ex­tent of av­er­age of ~0.12 and the as­sump­tion of no se­lec­tion, Chuck flips Cav­al­li-S­forza’s ar­gu­ment on its head and points out that the stan­dard ge­netic drift frame­work (Leinonen et al 2013) dat­ing back to Wright im­plies that a differ­ence of 0.12 yields large racial differ­ences in poly­genic traits! Clear­ly, Cav­al­li-S­forza’s ar­gu­ment does not prove what one would like it to.

The sum is not the mean: there is a differ­ence be­tween flip­ping 1 coin a thou­sand or a mil­lion times, the sum of a thou­sand or a mil­lion coins flipped 1 time, and the mean of a thou­sand or a mil­lion sums of a mil­lion coins; the dis­tri­b­u­tion of the mean does in­deed con­verge tight­ly, but the dis­tri­b­u­tion of the sum just gets broader and broad­er. The ex­pec­ta­tion or mean is just the best es­ti­mate one can make over a large num­ber of sam­ples un­der a par­tic­u­lar loss. But the long-run mean of many sam­ples is not the same thing as the ex­pected de­vi­a­tion of a sin­gle sam­ple.

One might be deal­ing with some­thing like a where there is not a mean in the first place. But more im­por­tant­ly, just be­cause the ex­pec­ta­tion is a cer­tain num­ber like 0, does­n’t mean any spe­cific re­al­iza­tion will be 0, and in­deed the ex­pec­ta­tion may ac­tu­ally be an im­pos­si­ble val­ue. (Imag­ine a vari­able X which is the sum of an odd num­ber of -1 or +1 vari­ables; the ex­pec­ta­tion of this X is, of course, 0, how­ev­er, one will never ac­tu­ally ob­serve a sam­ple of X to be 0, be­cause the par­ity means there will al­ways be a -1 or +1 ‘left over’ in­side the sum and so X will al­ways be ei­ther -1 or +1. To reach the ex­pec­ta­tion of 0, one would have to cre­ate many _X_s and av­er­age them, and the more _X_s one draws and av­er­ages, the closer the av­er­age will be to 0.)

For Gaus­sians, the sin­gle-sam­ple of a set of Gauss­ian vari­ables is the sum of their means with a vari­ance equal to the sum of their vari­ances (); if we sam­pled re­peat­edly and av­er­aged, then we would in­deed con­verge on the ex­pected mean, but the ex­pected de­vi­a­tion from the mean of a sin­gle sam­ple is gov­erned by the vari­ance which can be ex­tremely large. The in­creas­ing spread means that it would be ex­tremely sur­pris­ing to get ex­act­ly, or even near, the ex­pec­ta­tion. An ex­am­ple Monte Carlo of the sum of in­creas­ing num­bers of 𝒩(0,1) de­vi­ates demon­strates the ab­solute de­vi­a­tion in­creases as we go from sum of 100 vari­ables to 10,000 vari­ables:

round(sapply(seq(100,10000,by=100), function(x) { mean(replicate(10000, abs(sum(rnorm(x))))) } ))
#  [1]  8 11 14 16 18 20 21 23 24 25 26 27 29 30 31 32 33 34 35 35 37 38 38 39 40 41 41 42 43 44 45 45 46 46 47 48 48 49 50 50 51 52 52
# [44] 53 53 54 55 56 56 56 57 58 58 58 59 59 61 61 62 62 62 62 64 63 64 65 65 65 67 66 68 67 68 69 70 70 70 70 71 71 72 72 73 73 74 75
# [87] 74 75 75 75 76 76 77 78 78 78 78 80 80 80

Or con­sider a 1D . The best es­ti­mate one can make un­der to­tal ig­no­rance of where the walker is, is to guess, re­gard­less of how many n steps it has made, that the walker is at the orig­in: 0. If we run many ran­dom walk­ers for n steps each and ask what the best pre­dic­tion of the mean is, we would be right in say­ing it’s 0. We would not be right in say­ing that the walk­ers have not moved far or that we would ex­pect them all to ‘con­verge’ and be at or at least very near 0. How­ev­er, if one asks, what is the ex­pected dis­tance from the ori­gin after n steps, the an­swer turns out to be ie. the more steps tak­en, the fur­ther we ex­pect to find the walk­er, even if we can­not pre­dict in what di­rec­tion it has gone on av­er­age. Sim­i­lar­ly, for a Gauss­ian ran­dom walk, we find that after n steps the walker will av­er­age a dis­tance of from the ori­gin (and pos­si­bly much fur­ther).

Fur­ther, should there be any ab­sorb­ing states, we may find that our walk­ers will not be at their ex­pec­ta­tion but at the ab­sorber—an ex­am­ple be­ing where re­peat­edly tak­ing +EV bets can guar­an­tee even­tual $0/bank­rupt­cy, or the in­fa­mous where the ex­pec­ta­tion of each bet may be ~0 but the mar­tin­gale player even­tu­ally is ei­ther bank­rupt or profitable, never ex­actly even. An­other anal­ogy might be stock­-pick­ing: the ex­pected value of each stock is about the same due to effi­cient mar­kets, and if one in­dexes and waits a long time, one will likely get the fa­bled av­er­age re­turn of ~7%; but if one buys a sin­gle in­di­vid­ual stock and waits a long time, the re­turn will likely be ei­ther -100% or >>7%.

In any process or sce­nario in which we are deal­ing with large num­bers of vari­ables which sum to pro­duce a fi­nal re­sult, even if each of those vari­ables is neu­tral, a sin­gle sam­ple will be po­ten­tially ar­bi­trar­ily ab­solutely far from the ex­pec­ta­tion in a way which will sur­prise some­one who be­lieves that the law of large num­bers en­sures that the re­sult must be ex­tremely close to the ex­pec­ta­tion both rel­a­tively and ab­solute­ly.

Oh Deer: Could Deer Evolve to Avoid Car Accidents?

I’ve no­ticed while dri­ving many deer corpses over the years. Cars seem like they could be a ma­jor source of deer mor­tal­i­ty. If they are, deer might be evolv­ing be­hav­ior to avoid cars. But deer/­car ac­ci­dent rates ap­pear sta­ble or in­creas­ing (per­haps due to hu­man pop­u­la­tion growth & con­struc­tion). How fast would we ex­pect to see any deer adap­ta­tion?

Look­ing at some of the mor­tal­ity sta­tis­tics, I model it as a li­a­bil­ity thresh­old trait be­ing se­lected on via trun­ca­tion se­lec­tion, and cal­cu­late some hy­po­thet­i­cals about whether and how fast they could adapt.

Teal deer: of course, but it’d be slow.

While dri­ving to NYC re­cently I passed 3 , a few of many I have seen over the years, and a thought re-oc­curred to me: “if all these deer are be­ing killed by cars, should­n’t they be evolv­ing to avoid cars?” I’ve seen many dead deer and nar­rowly avoided a few my­self while dri­ving, and deer/­car mor­tal­ity is, if any­thing, much higher in states like Penn­syl­va­nia.

Ac­ci­dent rates would not nec­es­sar­ily show a steep de­cline thanks to past se­lec­tion, be­cause the ‘en­vi­ron­ment’ is not sta­tic here: as cars get faster, ac­ci­dents be­come more crip­pling or lethal to deer; the Amer­i­can pop­u­la­tion has ex­panded sev­er­al-fold both in pop­u­la­tion count, per-capita ve­hi­cle miles, sub­ur­ban liv­ing, ter­ri­tory frag­men­ta­tion, and the deer pop­u­la­tion too has ex­panded many-fold (from ~0.5m a cen­tury ago to <30m now). So if there was highly effec­tive on­go­ing se­lec­tion re­duc­ing deer ac­ci­dent risk, we would still ob­serve large ab­solute and pro­por­tional in­creases in ac­ci­dents/deaths.

But I am still cu­ri­ous as to what sort of se­lec­tion we could ex­pect, which is a hint as to long-term trend­s—the Amer­i­can pop­u­la­tion is now rel­a­tively sta­bi­lized in terms of growth and ve­hi­cle-miles, and deer ap­pear to have also reached a pop­u­la­tion equi­lib­ri­um, so a grad­ual long-term de­cline in ac­ci­dent rates might be pos­si­ble to see in the fu­ture if there is sub­stan­tial re­sponse to se­lec­tion.

Deer ac­ci­dents seem to be fairly fa­tal: wild an­i­mals are al­ways on the edge, and small in­juries can com­pound into death, so look­ing at mor­tal­ity will if any­thing un­der­es­ti­mate the strength of se­lec­tion. And of course there will be no sin­gle Mendelian genes, but be­ing a com­plex be­hav­ioral trait, it is al­most surely a highly poly­genic ad­di­tive trait. So we can treat it as on a bi­nary trait (“killed by a car”) in the .

For trun­ca­tion se­lec­tion, the two key pa­ra­me­ters are the her­i­tabil­ity of a ‘trait’, and the frac­tion of the pop­u­la­tion ex­press­ing the ‘trait’.

The her­i­tabil­ity can only be guessed at. Car ac­ci­dents ought to be her­i­ta­ble to some de­gree, be­cause every­thing is, and many be­hav­ioral traits like risk aver­sion or di­ur­nal­ity or re­ac­tion-time or star­tle re­flex or wan­der­lust would affect be­ing hit by a car (a deer can re­duce risk by avoid­ing roads en­tire­ly, not trav­el­ing far, wait­ing to cross un­til very late at night when there is no traffic or dur­ing the day when they can be eas­ily seen). Re­sponse to se­lec­tion need not cause some hard­wired be­hav­ioral change like aver­sion to trav­el: it might yield some­thing like the , where the re­sponse is for be­hav­ioral flex­i­bil­i­ty, and more fit deer are bet­ter able to learn how to nav­i­gate traffic gaps by watch­ing other deer or im­i­tat­ing their moth­er. The “an­thro­pocene” has led to many an­i­mals evolv­ing or oth­er­wise learn­ing how to adapt, with ur­ban en­vi­ron­ments no ex­cep­tion13, so why would deer be any ex­cep­tion?

Com­pli­cat­ing things is the pos­si­bil­ity that the her­i­tabil­ity is high but ac­tual re­sponses to se­lec­tion are lower than ex­pected when es­ti­mated in a uni­vari­ate sin­gle-trait fash­ion, be­cause there might be a with an­other fit­ness-in­flu­enc­ing trait, where bet­ter car avoid­ance means worse val­ues of that other trait—per­haps it would be easy to avoid cars by avoid­ing roads or trav­el­ing far, but this has the draw­back of pre­vent­ing re­lo­ca­tion to avoid star­va­tion or hunters, in which case re­sponse to se­lec­tion will be small or even op­po­site of ex­pected (this is plau­si­bly one of the main rea­sons why wild pop­u­la­tions may not evolve as fast as pre­dict­ed: “The Miss­ing Re­sponse to Se­lec­tion in the Wild”, Pu­jol et al 2018).

I can only guess at the her­i­tabil­i­ty, as I doubt her­i­tabil­i­ties have been cal­cu­lated for much in deer, but I would pre­dict it’s <50% sim­ply be­cause wild an­i­mals are un­der con­stant se­lec­tion and car-based se­lec­tion would’ve started al­most a cen­tury ago. It might seem im­pos­si­ble to cal­cu­late her­i­tabil­ity for wild an­i­mals of a ‘trait’ like be­ing hit by a car, but I think it’s doable. Wild an­i­mals have no twin stud­ies or fam­ily pedi­grees, of course, and meth­ods like com­mon-gar­den rear­ing like­wise seem ques­tion­able, but one use track­ing de­vices to fol­low fam­i­lies of deer un­til they all die to con­struct a pedi­gree with out­come of death; more fea­si­bly, one could col­lect DNA sam­ples from dead car-ac­ci­dent deer and dead non-ac­ci­dent deer, and com­pute ge­nomic sim­i­lar­ity with a pro­ce­dure like (for SNP her­i­tabil­i­ty) or whole-genome se­quenc­ing data (up­com­ing meth­ods for re­cov­er­ing full her­i­tabil­i­ty). But this has­n’t been done as far as I know. Oh well. We can look at a range of her­i­tabil­i­ties 0-50%.

The frac­tion of deer hit is a lit­tle eas­i­er. Wild deer live ~3-4y on av­er­age (eg Lopez et al 2004 on Key deer im­plies ~3.7y), sex­u­ally ma­tur­ing ~1.5y, and of the ~25m deer in the USA, around 1.5m are killed by cars an­nu­ally ~2012 (ac­cord­ing to The At­lantic with no ci­ta­tion) so per­haps ~5% an­nual mor­tal­ity from cars (Mc­Shea et al 2008 es­ti­mates 2% in a sam­pled Vir­ginia coun­ty, while Mc­Shea 2012 sug­gests 1.2m deaths out of 25m deer); if deer live 4 years and have a 5% an­nual risk of be­ing killed by a car, then their life­time risk should be 1 - 0.95^4 = 18% or per­haps 8%, which sounds like a rea­son­able range—a sub­stan­tial source of mor­tal­ity but prob­a­bly less than hunt­ing or star­va­tion or dis­ease.

The effect of a gen­er­a­tion of trun­ca­tion se­lec­tion on a bi­nary trait fol­low­ing the li­a­bil­i­ty-thresh­old model is more com­pli­cated but fol­lows a sim­i­lar spir­it. R im­ple­men­ta­tion of pg6 of , Lynch & Wal­sh:

threshold_select <- function(fraction_0, heritability, verbose=FALSE) {
    fraction_probit_0 = qnorm(fraction_0)
    ## threshold for not manifesting schizophrenia:
    s_0 = dnorm(fraction_probit_0) / fraction_0
    ## new rate of trait after one selection where 100% of past-the-threshold never reproduce:
    fraction_probit_1 = fraction_probit_0 + heritability * s_0
    fraction_1 = pnorm(fraction_probit_1)
    ## how much did we reduce trait in percentage terms?
    if (verbose) {
        print(paste0("Start: population fraction: ", fraction_0, "; liability threshold: ", fraction_probit_0, "; Selection intensity: ", s_0))
        print(paste0("End: liability threshold: ", fraction_probit_1, "; population fraction: ", fraction_1, "; Total population reduction: ",
                     fraction_0 - fraction_1, "; Percentage reduction: ", (1-((1-fraction_1) / (1-fraction_0)))*100)) }
    return(c(fraction_probit_1, fraction_1, fraction_0 - fraction_1))

threshold_select(1-0.18, 0.50, verbose=TRUE)
# [1] "Start: population fraction: 0.82; liability threshold: 0.915365087842814; Selection intensity: 0.320000021339773"
# [1] "End: liability threshold: 1.0753650985127; population fraction: 0.858894349391959; Total population reduction: -0.0388943493919587; Percentage reduction: 21.6079718844215"
# [1]  1.07536509851  0.85889434939 -0.03889434939
threshold_select(1-0.08, 0.50, verbose=TRUE)
# [1] "Start: population fraction: 0.92; liability threshold: 1.40507156030963; Selection intensity: 0.161593724197447"
# [1] "End: liability threshold: 1.48586842240836; population fraction: 0.931343036233605; Total population reduction: -0.0113430362336053; Percentage reduction: 14.1787952920067"
# [1]  1.48586842241  0.93134303623 -0.01134303623

We can look at a range of sce­nar­ios for pop­u­la­tion preva­lences 8-18%, and her­i­tabil­i­ties 5%-50%. Aside from the per-gen­er­a­tion in­crease in car-avoid­ing deer/de­crease in deer-car ac­ci­dents, it might also be in­ter­est­ing to cal­cu­late a hy­po­thet­i­cal like “how many gen­er­a­tions/years would it take for nat­ural se­lec­tion, in a sta­tic en­vi­ron­ment, to re­duce life­time mor­tal­ity to 1%?”

thresholdModeling <- function(fraction_0, heritability, targetPercentage) {
    firstGeneration <- threshold_select(fraction_0, heritability)
    ## estimate how many generations of truncation selection until a target percentage is reached:
    i <- 1; fraction_i <- firstGeneration[2]
    while (fraction_i < targetPercentage) {
        i <- i+1
        nextGeneration <- threshold_select(fraction_i, heritability)
        fraction_i <- nextGeneration[2]
    return(c(firstGeneration, i)) }
thresholdModeling(1-0.20, 0.5, 1-0.01)

df <- expand.grid(Fraction=(1-seq(0.08, 0.18, by=0.02)), Heritability=seq(0.05, 0.50, by=0.05))
df <- cbind(df, round(digits=3,, Map(thresholdModeling, df$Fraction, df$Heritability, 0.99))))
colnames(df)[3:6] <- c("Threshold.latent", "", "Fraction.reduction", "")
df$Years <- round(df$ * 3.5)
Trun­ca­tion se­lec­tion re­sults for var­i­ous sce­nar­ios of deer-car life­time mor­tal­ity rates & her­i­tabil­i­ties.
Frac­tion Her­i­tabil­ity La­tent thresh­old Frac­tion 2nd Frac­tion re­duc­tion Gen­er­a­tions to 1% Years
0.92 0.05 1.413 0.921 -0.001 300 1050
0.90 0.05 1.291 0.902 -0.002 314 1099
0.88 0.05 1.186 0.882 -0.002 324 1134
0.86 0.05 1.093 0.863 -0.003 332 1162
0.84 0.05 1.009 0.843 -0.003 338 1183
0.82 0.05 0.931 0.824 -0.004 343 1200
0.92 0.10 1.421 0.922 -0.002 150 525
0.90 0.10 1.301 0.903 -0.003 157 550
0.88 0.10 1.198 0.884 -0.004 162 567
0.86 0.10 1.106 0.866 -0.006 166 581
0.84 0.10 1.023 0.847 -0.007 169 592
0.82 0.10 0.947 0.828 -0.008 171 598
0.92 0.15 1.429 0.924 -0.004 100 350
0.90 0.15 1.311 0.905 -0.005 104 364
0.88 0.15 1.209 0.887 -0.007 108 378
0.86 0.15 1.119 0.868 -0.008 110 385
0.84 0.15 1.038 0.850 -0.010 112 392
0.82 0.15 0.963 0.832 -0.012 114 399
0.92 0.20 1.437 0.925 -0.005 75 262
0.90 0.20 1.321 0.907 -0.007 78 273
0.88 0.20 1.220 0.889 -0.009 81 284
0.86 0.20 1.132 0.871 -0.011 82 287
0.84 0.20 1.052 0.854 -0.014 84 294
0.82 0.20 0.979 0.836 -0.016 85 298
0.92 0.25 1.445 0.926 -0.006 60 210
0.90 0.25 1.330 0.908 -0.008 62 217
0.88 0.25 1.232 0.891 -0.011 64 224
0.86 0.25 1.145 0.874 -0.014 66 231
0.84 0.25 1.067 0.857 -0.017 67 234
0.82 0.25 0.995 0.840 -0.020 68 238
0.92 0.30 1.454 0.927 -0.007 50 175
0.90 0.30 1.340 0.910 -0.010 52 182
0.88 0.30 1.243 0.893 -0.013 54 189
0.86 0.30 1.158 0.877 -0.017 55 192
0.84 0.30 1.081 0.860 -0.020 56 196
0.82 0.30 1.011 0.844 -0.024 57 200
0.92 0.35 1.462 0.928 -0.008 43 150
0.90 0.35 1.350 0.911 -0.011 44 154
0.88 0.35 1.255 0.895 -0.015 46 161
0.86 0.35 1.171 0.879 -0.019 47 164
0.84 0.35 1.096 0.863 -0.023 48 168
0.82 0.35 1.027 0.848 -0.028 48 168
0.92 0.40 1.470 0.929 -0.009 37 130
0.90 0.40 1.360 0.913 -0.013 39 136
0.88 0.40 1.266 0.897 -0.017 40 140
0.86 0.40 1.184 0.882 -0.022 41 144
0.84 0.40 1.110 0.867 -0.027 42 147
0.82 0.40 1.043 0.852 -0.032 42 147
0.92 0.45 1.478 0.930 -0.010 33 116
0.90 0.45 1.369 0.915 -0.015 34 119
0.88 0.45 1.277 0.899 -0.019 35 122
0.86 0.45 1.197 0.884 -0.024 36 126
0.84 0.45 1.125 0.870 -0.030 37 130
0.82 0.45 1.059 0.855 -0.035 37 130
0.92 0.50 1.486 0.931 -0.011 30 105
0.90 0.50 1.379 0.916 -0.016 31 108
0.88 0.50 1.289 0.901 -0.021 32 112
0.86 0.50 1.210 0.887 -0.027 33 116
0.84 0.50 1.139 0.873 -0.033 33 116
0.82 0.50 1.075 0.859 -0.039 34 119

As ex­pect­ed, sce­nar­ios where deer ac­ci­dents are rare achieve 1% faster (less work to do), and higher her­i­tabil­i­ties pro­duce faster re­sponses (more use­ful genes). In most of the sce­nar­ios, the an­nual per­cent re­duc­tion is small, no often less than a per­cent­age point, so would be easy to miss, and the al­most com­plete elim­i­na­tion of deer-car ac­ci­dents would take cen­turies of adap­ta­tion even with a fixed en­vi­ron­ment.

So while deer might be evolv­ing to re­duce ac­ci­dent mor­tal­i­ty, it would be hard to see and won’t help much any­time soon. (For the same rea­sons, I would ex­pect squir­rel-based power out­ages to con­tinue in­defi­nite­ly.) So if it’s a prob­lem, one’d bet­ter hurry up with safety mea­sures like road fenc­ing or self­-driv­ing cars.

Evolution as Backstop for Reinforcement Learning

One de­fense of free mar­kets notes the in­abil­ity of non-mar­ket mech­a­nisms to solve plan­ning & op­ti­miza­tion prob­lems. This has diffi­culty with Coase’s para­dox of the firm, and I note that the diffi­culty is in­creased by the fact that with im­prove­ments in com­put­ers, al­go­rithms, and data, ever larger plan­ning prob­lems are solved. Ex­pand­ing on some Cosma Shal­izi com­ments, I sug­gest a group se­lec­tion per­spec­tive: one rea­son for free mar­ket or evo­lu­tion­ary or Bayesian meth­ods in gen­eral is that while poorer at plan­ning/op­ti­miza­tion in the short run, they have the ad­van­tage of sim­plic­ity and op­er­at­ing on ground-truth val­ues, and serve as a con­straint on the more so­phis­ti­cated non-mar­ket mech­a­nisms. I il­lus­trate by dis­cussing cor­po­ra­tions, mul­ti­cel­lu­lar life, and re­in­force­ment learn­ing & meta-learn­ing in AI. This view sug­gests that are in­her­ent bal­ances be­tween mar­ket/non-mar­ket mech­a­nisms which re­flect the rel­a­tive ad­van­tages be­tween a slow un­bi­ased method and faster but po­ten­tially ar­bi­trar­ily bi­ased meth­ods.

Split out to .

Acne: a good Quantified Self topic

I sug­gest that teenagers in­ter­ested in ex­per­i­men­ta­tion, sta­tis­tics, or Quan­ti­fied Self ex­per­i­ment with in­ter­ven­tions to re­duce ac­ne, list­ing the ad­van­tages of the top­ic. To sug­gest spe­cific in­ter­ven­tions, I re-an­a­lyze & rank the April 2016 Cure­To­gether crowd­sourced rat­ings of ~113 acne in­ter­ven­tions; di­etary in­ter­ven­tions rate par­tic­u­larly highly after stan­dard retinoids (like Ac­cu­tane) & ben­zoyl per­ox­ide and might be worth closer in­ves­ti­ga­tion.

self­-ex­per­i­ments are un­der­-done and po­ten­tially valu­able. One kind of self­-ex­per­i­ment oc­curs to me in ret­ro­spect which would have been par­tic­u­larly valu­able—I re­gret get­ting into QS and sta­tis­tics only long after my (not a lit­tle painful) acne prob­lems largely sub­sided. The con­ven­tional wis­dom on acne seems poorly sup­ported and ne­glect­ful of in­di­vid­ual differ­ences, and I’ve long sus­pected that some rig­or­ous test­ing might turn up some in­ter­est­ing re­sults which could be use­ful to West­ern teenagers. (My acne was­n’t so bad, in ret­ro­spect, but it still frus­trated me a great deal; in ex­treme cas­es, it can con­tribute to sui­cide.)

Look­ing back, I can see how easy it would be to test the var­i­ous the­o­ries. For ex­am­ple, fa­cial washes & light boxes could be tested effi­ciently by block­ing—ran­dom­iz­ing which half of your face they are ap­plied to. Diet too would be use­ful to do as a fac­to­r­ial ex­per­i­men­t—­for 3 or 4 weeks, cut out all the carbs while load­ing up on dairy prod­ucts, then vice ver­sa. As a topic for self­-ex­per­i­men­ta­tion, acne has many ad­van­tages:

  • many high­ly-mo­ti­vated sub­jects with am­ple free time

  • ob­jec­tive vis­i­ble effects where the data can even be eas­ily recorded for blind rat­ings by third par­ties to re­duce bias (ie fa­cial pho­tographs)

  • fac­to­r­ial ex­per­i­ments or block­ing with­in-sub­ject is pos­si­ble for some acne in­ter­ven­tions, which would con­sid­er­ably in­crease power

  • in se­vere acne cas­es, changes would be easy to see within days/weeks, and large effects are plau­si­ble

  • a large num­ber of al­ready-known pos­si­ble in­ter­ven­tions none of which are par­tic­u­larly strongly sup­ported (so high VoI); a good start­ing point would be the Cure­To­gether crowd­sourced rank­ings (see next sec­tion for specifics).

    • most of the in­ter­ven­tions are safe & easy to im­ple­ment: it’s not hard to use a face wash twice a day, ap­ply some cream, cut out all dairy prod­ucts, etc
  • new com­mer­cial prod­ucts al­low for in­ter­est­ing hob­by­ist pro­ject­s—what would mi­cro­biome sam­pling show about the mi­cro­bial com­mu­ni­ties of teens with high acne vs low ac­ne? About their mi­cro­bial com­mu­ni­ties be­fore the high acne group de­vel­oped ac­ne? After treat­ment with ben­zoyl per­ox­ide or an­tibi­otics or retinol? etc. With a large enough com­mu­nity dataset, in­ter­est­ing tech­niques could be ap­plied for mea­sur­ing het­ero­gene­ity in in­di­vid­ual re­sponse to treat­ments (eg GCTA-style vari­ance com­po­nents on acne re­sponse could be en­light­en­ing, as genes would be ex­pected to be the source of much of the in­di­vid­ual differ­ences, acne it­self be­ing her­i­ta­ble)

  • a launch­ing pad into many ar­eas of sci­ence & sta­tis­tic­s—rel­e­vant top­ics in­clude hu­man & mi­cro­bial ge­net­ics, method­olog­i­cal bi­as­es, re­gres­sion to the mean, op­ti­mal ex­per­i­men­tal de­sign, se­quen­tial test­ing, meta-analy­sis, time-series, deep learn­ing… (“When mak­ing an axe han­dle with an axe, the model is near at hand.”)

    One could spin off lots of projects to in­crease rig­or—in­stead of count­ing by hand, you could take smart­phone pho­tos and feed them into a pim­ple-count­ing CNN. With all the frame­works like Py­Torch now, that’s well within the abil­ity of a bright com­put­er-in­clined high school­er. (And then put it up as a web ser­vice so other teens can run their own self­-ex­per­i­ments & score their pim­ple pho­tographs, why not…)

In­deed, , one of the early Quan­ti­fied Self pro­po­nents, cred­its his in­ter­est in self­-ex­per­i­ments to acne (Roberts 2001/), when he found did­n’t help his acne but ben­zoyl per­ox­ide did:

My in­ter­est in self­-ex­per­i­men­ta­tion be­gan when I read an ar­ti­cle about teach­ing math­e­mat­ics by Paul Hal­mos, a pro­fes­sor at In­di­ana Uni­ver­si­ty. Hal­mos em­pha­sized that “the best way to learn is to do.” I was try­ing to learn how to do ex­per­i­ments; I took this ad­vice to mean I should do as many as pos­si­ble. I could do more ex­per­i­ments, I re­al­ized, if I not only did rat ex­per­i­ments but also did ex­per­i­ments with my­self as the sub­ject. So I started do­ing small self­-ex­per­i­ments. Most of them were triv­ial and led nowhere (e.g., ex­per­i­ments about jug­gling). At the time I had ac­ne. My der­ma­tol­o­gist had pre­scribed both pills (te­tra­cy­cline, a wide-spec­trum an­tibi­otic) and a cream (ac­tive in­gre­di­ent ben­zoyl per­ox­ide). Sim­ply for the sake of do­ing ex­per­i­ments, any ex­per­i­ments, I did sim­ple tests to mea­sure the effec­tive­ness of these treat­ments. I be­lieved the pills were pow­er­ful and the cream had lit­tle effect. To my great sur­prise, the tests showed the op­po­site: The cream was pow­er­ful and the pills had lit­tle effect. It was very use­ful in­for­ma­tion. Many years lat­er, an ar­ti­cle in the British Jour­nal of Der­ma­tol­ogy re­ported that an­tibi­otic-re­sis­tant acne is com­mon.

Acne came up sev­eral times on Robert­s’s blog:

CureTogether acne interventions

Cure­To­gether (2008-2012) was a so­cial net­work/crowd­sourced web­site for ag­gre­gat­ing & rank­ing treat­ments of var­i­ous prob­lems, typ­i­cally dis­eases, and had an acne page. They were bought by in 2012, and op­er­ated for a while un­til the web­site seems to’ve van­ished around mid-2016 & the data ap­pears to no longer be avail­able any­where. The last In­ter­net Archive snap­shot of their acne page is 2016-04-01.

Their pre­sen­ta­tion of un­cer­tainty & av­er­age rank­ing of the in­ter­ven­tions is use­less; but they do pro­vide the to­tals for each of the rat­ing lev­els, so I de­cided to fix it by ex­tract­ing & re-an­a­lyz­ing the rat­ings in a mul­ti­-level Bayesian or­di­nal re­gres­sion with a weakly in­for­ma­tive prior to get more use­ful pos­te­rior es­ti­mates of rat­ings. (While I was at it, I fixed some spelling er­rors and merged a few in­ter­ven­tions which were the same, like “Ac­cu­tane” and “Isotretinoin (Roac­cu­tane, Ac­cu­tane)”.)

## "much worse / slightly worse / no effect / moderate improvement / major improvement" → 1-5
acne <- read.csv("",
                 header=TRUE, colClasses=c("integer", "integer", "factor"))
# Skim summary statistics
#  n obs: 439
#  n variables: 3
# ── Variable type:factor ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
#      variable missing complete   n n_unique                     top_counts ordered
#  Intervention       0      439 439      113 Acn: 5, Agi: 5, Amo: 5, Ant: 5   FALSE
# ── Variable type:integer ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
#  variable missing complete   n  mean    sd p0 p25 p50 p75 p100     hist
#    Effect       0      439 439  3.16  1.33  1   2   3   4    5 ▅▆▁▇▁▇▁▆
#         N       0      439 439 18.38 34.74  1   2   6  19  324 ▇▁▁▁▁▁▁▁
b <- brm(Effect | weights(N) ~ (1|Intervention), prior=c(prior(student_t(3, 0, 1), "sd")), family=cumulative(),
    iter=5000, chains=30, cores=30, data=acne)
# Group-Level Effects:
# ~Intervention (Number of levels: 113)
#               Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
# sd(Intercept)     0.64      0.06     0.53     0.76      16503 1.00
# Population-Level Effects:
#              Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
# Intercept[1]    -3.44      0.09    -3.62    -3.26      20848 1.00
# Intercept[2]    -2.17      0.08    -2.32    -2.02      17179 1.00
# Intercept[3]     0.49      0.07     0.35     0.64      15767 1.00
# Intercept[4]     2.75      0.08     2.58     2.91      19558 1.00
red <-$Intervention)
round(red[order(red$Estimate.Intercept, decreasing=TRUE),], digits=2)

## the ordinal model forest plot doesn't make much intuitive sense, so redo as a normal-distribution
## for easier forest plotting (the rankings are more or less identical, perhaps a little less precise):
bg <- brm(Effect | weights(N) ~ (1|Intervention), prior=c(prior(student_t(3, 0, 1), "sd")), family=gaussian(),
    iter=5000, chains=30, cores=30, data=acne)
## Extract & sort estimates:
coefs <-
coefs[order(coefs$Intervention.Estimate.Intercept, decreasing=TRUE),]
## Plot estimates:
For­est plot: pos­te­rior rank­ings of acne in­ter­ven­tions by Cure­To­gether users as of April 2016 (com­puted us­ing a Gauss­ian Bayesian mul­ti­level mod­el).
Pos­te­rior es­ti­mates of av­er­age Cure­To­gether crowd-sourced rat­ings for acne treat­ments, in de­scend­ing or­der (high­er=­bet­ter).
In­ter­ven­tion Es­ti­mate SE 2.5% 97.5%
Isotretinoin (Roac­cu­tane, Ac­cu­tane) 4.04 0.06 3.91 4.16
Pa­leo Diet 3.75 0.08 3.58 3.92
No gluten 3.67 0.07 3.51 3.82
Aze­laic acid (Azelex) 3.65 0.16 3.32 3.99
Bactrim (Trimetho­prim/­sul­famethox­a­zole) 3.62 0.12 3.36 3.87
Ke­to­genic Diet 3.59 0.18 3.23 3.95
Clin­damy­acin phos­phate gel 3.58 0.13 3.31 3.85
Gly­colic acid 3.56 0.21 3.14 3.99
Ziana (Clin­damycin/tretinoin) 3.56 0.21 3.14 3.99
Amox­i­cillin 3.56 0.14 3.27 3.85
No sugar 3.55 0.07 3.41 3.70
No dairy 3.52 0.06 3.40 3.64
Di­ane-35 (cypro­terone/ethinyl estra­di­ol) 3.52 0.12 3.27 3.77
Epiduo Gel 3.51 0.15 3.20 3.82
hav­ing a good skin­care rou­tine 3.49 0.17 3.16 3.83
elim­i­nat­ing come­do­genic in­gre­di­ents… 3.49 0.21 3.07 3.93
Sit out in the sun…at least 15 min­utes 3.49 0.05 3.38 3.60
Avoid touch­ing face 3.49 0.03 3.41 3.57
Ben­zoyl per­ox­ide 10% 3.49 0.09 3.30 3.67
Ole Hen­rik­sen prod­ucts 3.47 0.18 3.10 3.84
No choco­late 3.45 0.22 3.02 3.89
Do­ryx (Doxy­cy­cline) 3.45 0.07 3.30 3.59
foot bath with bak­ing soda 3.45 0.24 2.98 3.93
Bikram Yoga 3.45 0.24 2.97 3.93
Tretinoin 3.45 0.24 2.97 3.93
Retin A 3.45 0.05 3.33 3.56
Birth con­trol pill / Bal­ance hor­mones 3.44 0.05 3.33 3.55
Zinc soap 3.43 0.20 3.04 3.84
Wash­ing face 3.43 0.03 3.36 3.49
No fast food 3.42 0.06 3.30 3.55
Pho­to­Dy­namic Ther­apy 3.42 0.15 3.11 3.73
Ex­u­viance Ves­pera Serum 3.41 0.23 2.96 3.87
Helminthic ther­apy 3.41 0.23 2.96 3.87
Clin­damycin 1% dab­ber 3.41 0.07 3.25 3.56
White vine­gar 3.40 0.17 3.06 3.75
Wash­ing pil­low­cases reg­u­larly 3.40 0.05 3.29 3.50
Ac­Zone 3.39 0.20 2.99 3.81
Tazo­rac 3.39 0.16 3.07 3.71
Tetra­cy­cline 3.39 0.05 3.29 3.49
Zinc cream 3.39 0.15 3.09 3.69
Ben­za­clin (Ben­zoyl per­ox­ide/­clin­damycin) 3.39 0.09 3.19 3.58
Ben­zoyl per­ox­ide 2.5% 3.38 0.03 3.30 3.46
Aveeno Skin Bright­en­ing Daily Scrub 3.38 0.22 2.94 3.82
Met­ro­gel 3.38 0.17 3.03 3.72
Retinol Oral Sup­ple­men­ta­tion 3.37 0.20 2.97 3.77
Duac Gel 3.37 0.10 3.16 3.57
avoid spicy food 3.35 0.19 2.97 3.74
Plac­ing clean towel on pil­low each night 3.35 0.11 3.13 3.56
Pan­tothenic acid 3.35 0.13 3.09 3.61
An­tibi­otic cream 3.33 0.08 3.18 3.49
Rosac 3.33 0.19 2.96 3.70
Sul­fur Pow­der 3.33 0.14 3.05 3.61
My­chelle clear skin serum 3.33 0.17 2.98 3.68
Mario Be­descu Dry­ing Lo­tion 3.32 0.14 3.03 3.61
Drink a lot of wa­ter 3.31 0.13 3.05 3.57
Ac­ne­trex 3.31 0.22 2.87 3.74
Spirono­lac­tone 3.30 0.16 2.97 3.63
Blem­ish po­tion 3.30 0.13 3.03 3.57
Dr. Hauschka Nat­ural Skin Care 3.30 0.14 3.02 3.58
Ery­thromycin top­i­cal so­lu­tion 3.30 0.10 3.10 3.50
Us­ing fresh aloe vera leaves on skin 3.30 0.11 3.07 3.52
Evening Prim­rose Oil 3.29 0.21 2.86 3.71
Sal­i­cylic acid 3.29 0.04 3.20 3.37
Vac­ci­na­tion with in­di­vid­u­ally de­vel­oped vac­cine14 3.28 0.21 2.86 3.70
Stievi­a-A Cream 0.025% 3.28 0.17 2.94 3.63
Vicco Turmeric Skin Cream 3.27 0.24 2.80 3.75
Dr. An­drew Weil for Ori­gins 3.27 0.17 2.93 3.61
Cleanance K by Avene 3.27 0.14 2.99 3.55
OBAGI 3.27 0.15 2.96 3.58
N-acetyl­cys­teine 3.27 0.20 2.86 3.67
Differin Gel 3.26 0.12 3.01 3.50
Avoid­ing pork 3.25 0.16 2.92 3.57
Chem­i­cal-free, ve­gan cos­met­ics / per­sonal care prod­ucts 3.25 0.07 3.09 3.40
Wash­ing face with bak­ing soda 3.24 0.19 2.85 3.62
Gamma Linolenic Acid 3.23 0.22 2.79 3.67
Oil Cleans­ing Method 3.23 0.10 3.02 3.44
Acne Free 3.22 0.10 3.01 3.43
African Black Soap 3.22 0.14 2.94 3.50
Aloe Vera Soap 3.21 0.15 2.91 3.51
Aro­mather­apy 3.21 0.16 2.89 3.53
Urine ther­apy 3.20 0.20 2.79 3.61
Hy­dro­cor­ti­sone cream 3.20 0.14 2.91 3.49
No eggs 3.20 0.12 2.95 3.45
Air Pu­ri­fier 3.19 0.14 2.91 3.47
Aveda En­bright­en­ment Wash 3.18 0.19 2.79 3.56
Ve­gan diet 3.18 0.08 3.01 3.35
Cetaphil for sen­si­tive skin 3.18 0.06 3.05 3.30
Vi­t­a­min D 3.17 0.06 3.05 3.29
Zinc 3.16 0.07 3.02 3.30
Bert’s Bees blem­ish stick 3.15 0.18 2.80 3.51
Jo­joba Oil 3.13 0.20 2.73 3.52
Pur­pose Gen­tle Cleans­ing Bar 3.12 0.11 2.89 3.34
Honey 3.11 0.08 2.94 3.29
Bert’s Bees Cleans­ing Milk 3.10 0.17 2.75 3.45
Tea tree oil 3.10 0.05 2.99 3.20
Ag­ing (im­prove­ment after around age 19) 3.08 0.05 2.98 3.18
Al­pha Hy­droxy Acid 3.05 0.12 2.80 3.30
Flax oil 3.05 0.10 2.84 3.25
Ex­er­cise 3.04 0.05 2.93 3.15
Preg­nancy 3.04 0.15 2.73 3.35
St. Ives Apri­cot Scrub 3.04 0.06 2.92 3.16
Ap­ple cider vine­gar tonic 3.03 0.08 2.86 3.21
Wash­ing face with wa­ter only 3.03 0.06 2.90 3.16
Proac­tiv 3.03 0.05 2.92 3.13
Tooth­paste on acne 3.01 0.06 2.89 3.13
Rub­bing Al­co­hol 2.99 0.07 2.85 3.13
Hy­dro­gen per­ox­ide 2.98 0.07 2.83 3.13
Bar soap 2.96 0.05 2.85 3.07
Sauna 2.96 0.11 2.73 3.19
Mu­rad Ac­neCom­plex 2.93 0.10 2.72 3.14
Clear­asil Stay­clear 2.91 0.06 2.79 3.02
Emu oil 2.81 0.18 2.44 3.17
Not wash­ing face at all 2.75 0.16 2.43 3.06

One thing which would be nice to do with the Cure­To­gether rat­ings is ex­am­ine the clus­ter­ing of in­ter­ven­tions. Some of these in­ter­ven­tions are doubt­less the same thing and should be merged to­geth­er; oth­ers are differ­ent but may act through the same mech­a­nisms and so could be con­sid­ered about the same thing too. Do­ing clus­ter­ing or fac­tor analy­sis might pop up a hand­ful of ma­jor ‘ap­proaches’, like ‘an­tibi­otics’ vs ‘clean­ing’ vs ‘di­etary in­ter­ven­tions’, and that would make the avail­able op­tions much eas­ier to un­der­stand.

More in­ter­est­ing­ly, in­ter­ven­tions al­most cer­tainly cor­re­late with each other & pre­dict suc­cess of each oth­er, and this could be used to pro­vide flow­charts of ad­vice along the lines of “if X did­n’t work for you, then Y might and Z prob­a­bly won’t”. Acne treat­ment could be con­sid­ered as a Par­tially Ob­serv­able Markov De­ci­sion Prob­lem where the effec­tive­ness of each treat­ment is un­known but has a prior prob­a­bil­ity based on the global rat­ings, and one treat­men­t’s re­sult in­forms the pos­te­rior prob­a­bil­ity of suc­cess of the un­tried treat­ments; a POMDP can be solved to give an op­ti­mal se­quence of treat­ments to try. (I’ve done a sim­ple ex­am­ple of this for cats and cat stim­u­lants like cat­nip.) And this ap­proach also im­me­di­ately gives a prin­ci­pled way for users to col­lec­tively ex­per­i­ment by pos­te­rior sam­pling (like , but one solves the MDP which is de­fined after sam­pling a pos­si­ble pa­ra­me­ter value from each pa­ra­me­ter’s pos­te­rior & treat­ing it as the true val­ue). Un­for­tu­nate­ly, that re­quires each Cure­To­gether rater’s in­di­vid­ual data, to ex­am­ine cor­re­la­tions with­in-in­di­vid­ual of rat­ings, and the web in­ter­face does­n’t pro­vide that, and since Cure­To­gether has been ac­quired & dis­ap­peared, there’s no one to ask for the raw data now.

Fermi calculations

A short dis­cus­sion of “Fermi cal­cu­la­tions”: quick­-and-dirty ap­prox­i­mate an­swers to quan­ti­ta­tive ques­tions which prize clev­er­ness in ex­ploit­ing im­pli­ca­tions of com­mon knowl­edge or ba­sic prin­ci­ples in given rea­son­able an­swers to ap­par­ently unan­swer­able ques­tions. Links to dis­cus­sions of Fermi es­ti­mates, and a list of some Fermi es­ti­mates I’ve done.

I re­ally like (Less­Wrong)—it’s like for every­thing out­side of physics15.

Not only are they fun to think about, they can be amaz­ingly ac­cu­rate, and are ex­tremely cheap to do—be­cause they are so easy, you do them in all sorts of sit­u­a­tions you would­n’t do a ‘real’ es­ti­mate for, and are a fun part of a . The com­mon dis­taste for them baffles me; even if you never work through Hub­bard’s How to Mea­sure Any­thing (some strate­gies) or Street-Fight­ing Math­e­mat­ics or read 1982 es­say “On Num­ber Numb­ness” (col­lected in Meta­m­ag­i­cal The­mas), it’s some­thing you can teach your­self by ask­ing, what in­for­ma­tion is pub­lic avail­able, what can I com­pare this too, how can I put var­i­ous bound­aries around the true an­swers16 You es­pe­cially want to do Fermi cal­cu­la­tions in ar­eas where the data is un­avail­able; I wind up pon­der­ing such ar­eas fre­quent­ly:

An en­tire “es­ti­ma­tion” sub­red­dit is de­voted to work­ing through ques­tions like these (it can be quite fun), and of course, there are the mem­o­rable “what if?” xkcd columns.

sug­gests a num­ber of prob­lems which might help chil­dren re­ally learn how to think with & ap­ply the math they learn.

To look fur­ther afield, here’s a quick and nifty ap­pli­ca­tion by in­vestor John Hemp­ton to the : “Risk man­age­ment and sound­ing crazy”. What I per­son­ally found most in­ter­est­ing about this post was not the over­all theme that the whistle­blow­ers were dis­counted be­fore and after they were proven right (we see this in many bub­bles, for ex­am­ple, the hous­ing bub­ble), but how one could use a sort of Out­side View/Fermi cal­cu­la­tion to san­i­ty-check the claims. If Sino Forestry was re­ally caus­ing 17m cu­bic me­ters of wood to be processed a year, where was all the pro­cess­ing? This sim­ple ques­tion tells us a lot. With med­i­cine, there is one sim­ple ques­tion one can al­ways ask too—where is the in­creased longevi­ty? (This is an im­por­tant ques­tion to ask of stud­ies, such as a re­cent caloric re­stric­tion study.)

Sim­ple ques­tions and rea­son­ing can tell us a lot.

Selective Emigration and Personality Trait Change

finds that the em­i­gra­tion of 25% of the Scan­di­na­vian pop­u­la­tion to the USA 1850–1920 was dri­ven in part by more ‘in­di­vid­u­al­is­tic’ per­son­al­ity fac­tors among em­i­grants, lead­ing to per­ma­nent de­creases in mean ‘in­di­vid­u­al­ism’ in the home coun­tries. This is at­trib­uted to cul­tural fac­tors, rather than ge­net­ics. I model the over­all mi­gra­tion as a sim­ple trun­ca­tion se­lec­tion sce­nar­io, and find that in a sim­ple model un­der rea­son­able as­sump­tions, the en­tire effect could be ge­net­ic.

In , Knud­sen 2019, Knud­sen ex­am­ines one of the largest em­i­gra­tions from Eu­rope to the USA, from Scan­di­na­vian coun­tries like : over about two gen­er­a­tions, 25% of the en­tire pop­u­la­tion left for the USA. (The effects of this em­i­gra­tion will be fa­mil­iar to any­one who has vis­ited the Mid­west.) Who left is not ran­dom at all, and would be in­flu­enced by per­son­al­ity traits; Knud­sen uses noisy prox­ies to mea­sure per­son­al­ity traits like ‘in­di­vid­u­al­ism’ on a pop­u­la­tion lev­el, and finds that be­cause of the em­i­gra­tion of Scan­di­na­vian in­di­vid­u­als high on these traits, the re­main­ing pop­u­la­tion is no­tice­ably lower on them, by “ap­prox­i­mately 3.9%-points in Den­mark, 10.1%-points in Swe­den, and 13.1%-points in Nor­way”, and more in­ter­est­ing­ly, these re­duc­tions ap­pear to be per­ma­nent:

Sev­eral cir­cum­stances make the Age of Mass Mi­gra­tion an ideal case for the em­pir­i­cal ob­jec­tive of this pa­per. Dur­ing the pe­ri­od, mil­lions of peo­ple left Eu­rope to set­tle in New World coun­tries such as the United States. Swe­den, Nor­way, and Den­mark ex­pe­ri­enced some of the high­est em­i­gra­tion rates in Eu­rope dur­ing this pe­ri­od, in­volv­ing the de­par­ture of ap­prox­i­mately 25% of their pop­u­la­tions. Be­sides rep­re­sent­ing the largest mi­gra­tion event in Scan­di­na­vian his­to­ry, global reg­u­la­tory poli­cies on mi­gra­tion were par­tic­u­larly loose at this point in time, which en­ables the iden­ti­fi­ca­tion of self­-s­e­lec­tive processes un­der lim­ited gov­ern­men­tal in­flu­ence. In ad­di­tion, the his­tor­i­cal con­text al­lows me to study long-run cul­tural im­pli­ca­tions of mi­gra­tion in send­ing lo­ca­tions.

For use in analy­ses of ag­gre­gate dis­trict effects, I quan­tify the cul­tural shock of se­lec­tive em­i­gra­tion as the per­cent­age point drop in the preva­lence of in­di­vid­u­al­ists in the mi­grant-send­ing pop­u­la­tion that oc­curs due to em­i­gra­tion. This is fea­si­ble be­cause I have in­for­ma­tion on rates of em­i­gra­tion and the gap be­tween em­i­grant and pop­u­la­tion in­di­vid­u­al­ism. Ac­cu­mu­lat­ing these shocks over the en­tire pe­riod of the Age of Mass Mi­gra­tion re­veals an over­all re­duc­tion in in­di­vid­u­al­ism from em­i­gra­tion of ap­prox­i­mately 3.9%-points in Den­mark, 10.1%-points in Swe­den, and 13.1%-points in Nor­way…­Fixed-effects es­ti­ma­tions doc­u­ment that the cul­tural shocks of em­i­gra­tion pushed mi­grant-send­ing dis­trict cul­ture in a col­lec­tivist di­rec­tion and this is ro­bust to the in­clu­sion of con­trol vari­ables that cap­ture al­ter­na­tive dis­trict, co­hort, and em­i­grant char­ac­ter­is­tics

The shock mea­sures are avail­able from 1860 on­wards, since these are the decades cov­ered by the em­i­grant pas­sen­ger lists. The cul­tural shocks of em­i­gra­tion can be cal­cu­lated for any time pe­riod and ag­gre­ga­tion lev­el. I pro­duce base­line mea­sures at the dis­trict and decade lev­el. The ac­cu­mu­lated shock over the en­tire Age of Mass Mi­gra­tion equals a loss of in­di­vid­u­al­ists of 3.9%-points in Den­mark, 10.1%-points in Swe­den, and 13.1%-points in Nor­way. Al­though the mea­sure most pre­cisely re­flects shocks to the dis­tri­b­u­tion of in­her­ited in­di­vid­u­al­ism in the home pop­u­la­tion, they also ap­prox­i­mate shocks to the dis­tri­b­u­tion of ac­tual in­di­vid­u­al­ism if em­i­grants and non-em­i­grants are as­sumed to aban­don their cul­tural her­itage at sim­i­lar rates. The rate and cul­tural shock of em­i­gra­tion cor­re­late strongly around a ra­tio of 0.4 (see scat­ter plot and his­togram in Ap­pen­dix Fig­ures A.1 and A.3).23

( also show em­i­gra­tion effects in Swe­den re­lated to com­mu­nal­ism, lead­ing to greater vot­ing/la­bor union/strike par­tic­i­pa­tion.)

Why are they per­ma­nent? Knud­sen only men­tions ‘cul­tural trans­mis­sion’, but of course, they have to also be par­tially ge­net­ic: per­son­al­ity traits are her­i­ta­ble, and so se­lect­ing on per­son­al­ity will cause ge­netic changes, and we can see pop­u­la­tion strat­i­fi­ca­tion caused by se­lec­tion on com­plex traits (like in­tel­li­gence/e­d­u­ca­tion in the UK: , Ab­del­laoui et al 2018). A process like em­i­gra­tion has no way of some­how let­ting only peo­ple em­i­grate who are ‘in­di­vid­u­al­is­tic’ for purely en­vi­ron­men­tal rea­sons, after all; such processes see only phe­no­types, which are the out­come of both genes and en­vi­ron­ments.

Is it plau­si­ble that ge­net­ics might ex­plain a non­triv­ial amount of the re­duc­tion? The se­lec­tion effect here is rel­a­tively small: only 25% of the to­tal pop­u­la­tion left, on­ce, and while ‘in­di­vid­u­al­ism’ might be a big part of de­ci­sions to live, there are surely many other ge­netic and en­vi­ron­men­tal fac­tors and sheer ran­dom­ness in any in­di­vid­u­al’s de­ci­sion to leave. It feels like the ge­netic effect might be much too small to be par­tic­u­larly rel­e­vant.

We can quickly es­ti­mate ex­pected effects by treat­ing this as in a (, Lynch & Wal­sh): the bot­tom X% are ‘trun­cated’ from the pop­u­la­tion (be­cause they phys­i­cally left for the USA and did not re­turn, a bi­nary vari­able) in a sin­gle se­lec­tive step, and we want to know the effect on per­son­al­ity traits of a cer­tain her­i­tabil­ity (which can be ex­trap­o­lated from re­search on more-care­fully mea­sured per­son­al­ity traits like the Big Five) given that they are r < 1 with the em­i­gra­tion vari­able (we can only guess at the val­ues here—­surely r < 0.5?).

Knud­sen says the over­all se­lec­tion is 25%; this would be over per­haps 2 gen­er­a­tions, but since it’s such a short pe­riod and small effects, it does­n’t make much of a differ­ence if we treat it as a sin­gle big pop­u­la­tion.

What is the over­all effect on per­son­al­i­ty? Knud­sen in­cludes some graphs show­ing em­i­gra­tion rates by coun­try over time, but I don’t see any pop­u­la­tion-weighted av­er­age over­all effect re­ported in her pa­per. (Bring­ing up the ques­tion of how much of the effect is ge­netic with Knud­sen on Twit­ter, she men­tioned that she was work­ing on re­vi­sions which would in­clude more in­for­ma­tion on effect sizes and what other in­di­vid­ual differ­ences might cause the em­i­gra­tion de­ci­sion.) The ob­served effect is 3.9%/10.1%/13.1% (Nor­way/Swe­den/­Den­mark), which av­er­ages to 8%:

library(psych); geometric.mean(c(3.9, 10.1, 13.1))
# [1] 8.02082595

Swe­den is pre­sum­ably by far the biggest coun­try of the 3 at the time and would in­flu­ence the av­er­age the most, so 8% is prob­a­bly too small, but serves as a bench­mark. It would be more con­ve­nient to have this in stan­dard de­vi­a­tions but I don’t see any­where Knud­sen gives it in SDs rather than per­cent­ages. I think what is done there is treat the pop­u­la­tion as a stan­dard­ized vari­able and then the con­ver­sion was to con­vert per­centiles to SDs; then that sug­gests an “8% de­crease” would be equiv­a­lent to a 0.19SD change:

qnorm(0.99) / 99
# [1] 0.0234984634
(qnorm(0.99) / 99) * 8.02
# [1] 0.188457676

0.19SD is­n’t so big, and looks some­what plau­si­ble. Per­haps trun­ca­tion se­lec­tion can ex­plain this change after all.

To es­ti­mate the change, we con­vert the per­cent­age to an SD thresh­old: how low a score does it take to be in the bot­tom 25%17 and thus em­i­grate?

# [1] -0.67448975

So we can imag­ine every­one with a ‘com­mu­nal­ism’ score of <-0.67 em­i­grat­ed. In a pop­u­la­tion of 𝒩(0,1), if we delete every­thing <-0.67, giv­ing a what does the re­main­der av­er­age?

truncNormMean <- function(a, mu=0, sigma=1, b=Inf) {
        phi <- dnorm
        erf <- function(x) 2 * pnorm(x * sqrt(2)) - 1
        Phi <- function(x) { 0.5 * (1 + erf(x/sqrt(2))) }
        Z <- function(beta, alpha) { Phi(beta) - Phi(alpha) }

        alpha = (a-mu)/sigma; beta = (b-mu)/sigma

        return( (phi(alpha) - phi(beta)) / Z(beta, alpha) ) }
# [1] 0.423702097

So the sur­vivors, as it were, have shifted from an av­er­age of 0SD to +0.42SD on ‘com­mu­nal­ism’ (an­ti-‘in­di­vid­u­al­ism’).

This, how­ev­er, is not the ge­netic in­crease. It in­cludes all the vari­ables, not just genes, like rear­ing or ran­dom noise, which affect per­son­al­i­ty, and then all of the vari­ables which affect em­i­gra­tion risk above and be­yond per­son­al­i­ty. So it must be de­creased twice to com­pen­sate for per­son­al­ity be­ing only partly ge­net­ic, and then em­i­gra­tion be­ing only partly per­son­al­ity (ie. genes → per­son­al­ity → em­i­gra­tion has 2 ar­rows which are r < 1).

Per­son­al­ity fac­tors like the well-estab­lished as hav­ing her­i­tabil­i­ties ~50% (eg Bouchard & McGue 2003), so would cor­re­late . ‘In­di­vid­u­al­ism’/‘com­mu­nal­ism’ is of course not ex­actly the Big Five, and in Knud­sen cor­re­lates with many other traits as­so­ci­ated with so­cial & eco­nomic suc­cess, and so it may re­flect much more than some­thing like ; in­tel­li­gence is prob­a­bly part of it, and in­tel­li­gence has a sub­stan­tially higher her­i­tabil­ity than the Big Five, so I might be un­der­es­ti­mat­ing ‘in­di­vid­u­al­ism’ here.

How much of em­i­gra­tion is per­son­al­i­ty? I can’t give an ex­act value here: Knud­sen’s analy­sis prob­a­bly im­plies a value some­where, but she is us­ing im­per­fect prox­ies with a lot of mea­sure­ment er­ror, so I would­n’t be sure how to in­ter­pret them any­way. I’d be sur­prised if it was r > 0.5, sim­ply be­cause few vari­ables ex­ceed that in psy­chol­ogy or so­ci­ol­o­gy, and one would ex­pect em­i­gra­tion to be in­flu­enced by prox­im­ity to ports, lo­cal eco­nomic & weather con­di­tions, , in­di­vid­ual & fa­mil­ial wealth, etc—so con­sider an ex­am­ple with r = 0.5 as a best-case sce­nario.

We de­flate the trun­cated nor­mal mean by the 2 vari­ables:

truncNormMean(qnorm(0.25)) * sqrt(0.5) * 0.5
# [1] 0.149801313
(truncNormMean(qnorm(0.25)) * sqrt(0.5) * 0.5) / (qnorm(0.99) / 99)
# [1] 6.3749408

So in this trun­ca­tion se­lec­tion sce­nar­io, we’d pre­dict a +0.15SD or ~6% in­crease. This is not too far off the Knud­sen es­ti­mates of 0.19SD or 8%, and quite small changes could equal­ize them (eg r = 0.62 for per­son­al­ity → em­i­gra­tion would be enough).

So it seems that the trun­ca­tion se­lec­tion is strin­gent enough, and the ob­served effects small enough, that ge­net­ics could in­deed ex­plain most or all of it.

See Also

The Most Abandoned Books on GoodReads

What books are hard­est for a reader who starts them to fin­ish, and most likely to be aban­doned? I scrape a crowd­sourced tag, abandoned, from the GoodReads book so­cial net­work to es­ti­mate con­di­tional prob­a­bil­ity of be­ing aban­doned.

The de­fault GoodReads tag in­ter­face presents only raw counts of tags, not counts di­vided by to­tal rat­ings (=read­s). This con­flates pop­u­lar­ity with prob­a­bil­ity of be­ing aban­doned: a pop­u­lar but rarely-a­ban­doned book may have more abandoned tags than a less pop­u­lar but often-a­ban­doned book. There is also resid­ual er­ror from the win­ner’s curse where books with fewer rat­ings are more mis­-es­ti­mated than pop­u­lar books.

Cor­rect­ing for both changes the top-5 rank­ing com­plete­ly, from (raw counts):

  1. ,
  2. ,
  3. ,
  4. ,
  5. ,

to (shrunken pos­te­rior pro­por­tions):

  1. ,
  2. ,
  3. ,
  4. The Witch­es: Salem, 1692,
  5. ,

I also con­sider a model ad­just­ing for co­vari­ates (au­thor/av­er­age-rat­ing/year), to see what books are most sur­pris­ingly often-a­ban­doned given their pedi­grees & rat­ing etc. Aban­don rates in­crease the newer a book is, and the lower the av­er­age rat­ing.

Ad­just­ing for those, the top-5 are:

  1. The Ca­sual Va­cancy, J.K. Rowl­ing
  2. The Chemist,
  3. ,
  4. ,
  5. ,

Books at the top of the ad­justed list ap­pear to re­flect a mix of high­ly-pop­u­lar au­thors chang­ing gen­res, and ‘pres­tige’ books which are high­ly-rated but a slog to read.

These re­sults are in­ter­est­ing in how they high­light how peo­ple read books for many differ­ent rea­sons (such as mar­ket­ing cam­paigns, lit­er­ary pres­tige, or fol­low­ing a pop­u­lar au­thor), and this is re­flected in their de­ci­sion whether to con­tinue read­ing or to aban­don a book.

Split out to .

  1. The GSS pro­vides down­loads of the full n = 62k sur­vey dataset as of 2016, and it is a pro­por­tional pop­u­la­tion sam­ple, so a “char­ac­ter gen­er­a­tor” can be im­ple­mented as sim­ply as sam­pling 1 ran­dom row from the dataframe and map­ping it back onto a nat­ural lan­guage de­scrip­tion. Should this be in­ad­e­quate, a gen­er­a­tive model such as an au­toen­coder or GAN could be trained on the dataset to gen­er­ate fur­ther re­al­is­tic ex­am­ples which re­spect all the com­pli­cated cor­re­la­tions & pat­terns be­tween re­spons­es. For pri­vacy rea­sons, the GSS does not pro­vide each re­spon­den­t’s lo­ca­tion or name (ex­cept for one ques­tion I found ask­ing about the coarse re­gion they grew up in such as “At­lantic re­gion” or “New Eng­land”), which are key vari­ables for char­ac­ters, so they could per­haps be gen­er­ated us­ing the GSS and then the age/­sex/re­gion sam­pled from US Cen­sus da­ta.↩︎

  2. Richard Ham­ming, :

    The three out­stand­ing prob­lems in physics, in a cer­tain sense, were never worked on while I was at Bell Labs. By im­por­tant I mean guar­an­teed a No­bel Prize and any sum of money you want to men­tion. We did­n’t work on (1) time trav­el, (2) tele­por­ta­tion, and (3) anti­grav­i­ty. They are not im­por­tant prob­lems be­cause we do not have an at­tack. It’s not the con­se­quence that makes a prob­lem im­por­tant, it is that you have a rea­son­able at­tack. That is what makes a prob­lem im­por­tant.

  3. “The guide­lines man­u­al, ch8: In­cor­po­rat­ing health eco­nom­ics in guide­lines and as­sess­ing re­source im­pact”:

    The con­sen­sus among NICE’s eco­nomic ad­vis­ers is that NICE should, gen­er­al­ly, ac­cept as cost effec­tive those in­ter­ven­tions with an in­cre­men­tal cost-effec­tive­ness ra­tio of less than £20,000 per QALY and that there should be in­creas­ingly strong rea­sons for ac­cept­ing as cost effec­tive in­ter­ven­tions with an in­cre­men­tal cost-effec­tive­ness ra­tio of over £30,000 per QALY.

  4. This ren­ders the pop­u­la­tion es­ti­mate a bit off, but I could­n’t find any sources on the break­down of Sephardic vs Ashke­nazi in the USA in 1970 other than a com­ment that the lat­ter were a “vast ma­jor­ity”. Since the Jew­ish Pop­u­la­tion Study was prob­a­bly an un­der­count in not in­clud­ing all the peo­ple of Jew­ish de­scent, I’m hope­ful those two bi­ases can­cel out.↩︎

  5. The high IQ sam­ple in Barbe 1964 would have been >8% Jew­ish, but the pa­per only re­ports the over­all Jew­ish­ness, with­out spec­i­fy­ing whether it’s 4% vs 12% or some­thing like that.↩︎

  6. Holling­worth & Rust 1937: “The data of the present study were ob­tained early in 1933, the sub­jects be­ing 36 boys and 19 girls, of the av­er­age age of 18 years 6 months. The IQ’s (S-B) of all had been taken in early child­hood (9). The group ranged from 135-190 IQ (S-B), with a me­dian at about 153 IQ (S-B). All but four of these young per­sons were Jew­ish, a fac­tor which must be con­sid­ered as of pos­si­ble con­se­quence (8, 14)…”↩︎

  7. Sub­ot­nik et al 1993, pg3-4:

    The mean IQ of the Hunter sam­ple was 157, or ap­prox­i­mately 3.5 stan­dard de­vi­a­tions above the mean, with a range of 122 to 196 on the L-M form. [S­tan­ford-Bi­net In­tel­li­gence Scale, Form L-M (SBL-M)]

    …Each class at Hunter Col­lege El­e­men­tary School from the years 1948 to 1960 con­tained about 50 stu­dents, yield­ing a to­tal pos­si­ble pop­u­la­tion of 600 grad­u­ates…35% of the to­tal pop­u­la­tion of 1948-1960 HCES stu­dents (n = 210) com­pleted and re­turned study ques­tion­naires

    Re­li­gious Affil­i­a­tion: The Hunter group is ap­prox­i­mately 62% Jew­ish, al­though they de­scribe them­selves as Jews more in terms of eth­nic iden­tity than re­li­gious prac­tice. The group, as a whole, is not re­li­gious.

    Ed­u­ca­tional At­tain­ments: Over 80% of the study par­tic­i­pants held at least a Mas­ter’s de­gree. Fur­ther­more, 40% of the women and 68% of the men held ei­ther a Ph.D, LL.B., J.D., or M.D. de­gree. Oc­cu­pa­tion and In­come: Only two of the HCES women iden­ti­fied them­selves pri­mar­ily as home­mak­ers. 53% were pro­fes­sion­als, work­ing as a teacher at the col­lege or pre-col­lege lev­el, writer (jour­nal­ist, au­thor, ed­i­tor), or psy­chol­o­gist. The same pro­por­tion of HCES men were pro­fes­sion­als, serv­ing as lawyers, med­ical doc­tors, or col­lege teach­ers. The me­dian in­come for men in 1988 was $75,000 (range = $500,000) and for women $40,000 (range = $169,000). In­come lev­els were sig­nifi­cantly differ­ent for men and wom­en, even when matched by pro­fes­sion. For ex­am­ple, the me­dian in­come for male col­lege teach­ers or psy­chol­o­gists was $50,000 and for fe­males, $30,000

  8. In ret­ro­spect, I could have sped this up con­sid­er­ably by us­ing the to sam­ple the tail di­rectly rather than gen­er­at­ing the en­tire sam­ple.↩︎

  9. My first at­tempt at it in JAGS went like this:

    model_string <- '
      model {
        cutoffIQ <- 100 + 3.719016485*15
        mu_asian ~ dnorm(105, 4^-2)
        X_asian ~ dnorm(mu_asian, 15^-2) # T(cutoffIQ,)
        X_frac_asian <- X_asian > cutoffIQ
        P_asian <- 0.07 * (X_frac_asian / length(X_asian))
        Y_asian ~ dbinom(P_asian, total)
        # mu_white ~ dnorm(100, 4^-2)
        # X_white ~ dnorm(mu_white, 15^-2) # T(cutoffIQ,)
        # X_frac_white <- X_white > cutoffIQ
        # P_white <- (1-0.07) * (X_frac_white / length(X_white))
        # Y_white ~ dbinom(P_white, total)
    model <- run.jags(model_string, data = list(Y_asian=Y_asian, Y_white=Y_white, total=total),
                                    monitor=c("mu_asian", "mu_white"),
                        n.chains = getOption("mc.cores"), method="rjparallel")

    But then I re­al­ized that X_frac_asian <- X_asian > cutoffIQ did­n’t do what I thought it did and I needed to some­how draw a large num­ber of sam­ples, just like in the ABC sim­u­la­tion, and com­pare to the num­ber after the trun­ca­tion… or some­thing.↩︎

  10. Strictly speak­ing we could prob­a­bly do a para­met­ric boot­strap by count­ing per decade, and treat­ing each one as a Pois­son es­ti­ma­tion, and gen­er­at­ing ran­dom de­vi­ates, but that runs into is­sues with decades with 0 fig­ures (the MLE might es­ti­mate a rate of 0 which we know is wrong but not what rate might be right, which could eas­ily im­ply av­er­age rates >1 based on sur­round­ing tem­po­ral trend­s).↩︎

  11. I’ve been ex­per­i­ment­ing with Boomerang to re­duce the prob­lem of non-fol­lowups by set­ting ‘ping me if no re­ply within 1 month’ alerts on my sent emails.↩︎

  12. In this con­text, it’s in­ter­est­ing to note Cav­al­li-S­forza’s role in .↩︎

  13. Some rel­e­vant links: