Embryo Selection For Intelligence

A cost-benefit analysis of the marginal cost of IVF-based embryo selection for intelligence and other traits with 2016-2017 state-of-the-art
decision-theory, biology, psychology, statistics, transhumanism, R, power-analysis, survey, IQ, SMPY, order-statistics, genetics, bibliography
2016-01-222020-01-18 finished certainty: likely importance: 10

With genetic pre­dic­tors of a phe­no­typic trait, it is pos­si­ble to select embryos dur­ing an in vitro fer­til­iza­tion process to increase or decrease that trait. Extend­ing the work of Shul­man & Bostrom 2014/, I con­sider the case of human intel­li­gence using SNP-based genetic pre­dic­tion, find­ing:

  • a meta-analy­sis of results indi­cates that SNPs can explain >33% of vari­ance in cur­rent intel­li­gence scores, and >44% with bet­ter-qual­ity phe­no­type test­ing
  • this sets an upper bound on the effec­tive­ness of SNP-based selec­tion: a gain of 9 IQ points when select­ing the top embryo out of 10
  • the best 2016 poly­genic score could achieve a gain of ~3 IQ points when select­ing out of 10
  • the mar­ginal cost of embryo selec­tion (as­sum­ing IVF is already being done) is mod­est, at $1500 + $200 per embryo, with the sequenc­ing cost pro­jected to drop rapidly
  • a model of the IVF process, incor­po­rat­ing num­ber of extracted eggs, losses to abnor­mal­i­ties & vit­ri­fi­ca­tion & failed implan­ta­tion & mis­car­riages from 2 real IVF patient pop­u­la­tions, esti­mates fea­si­ble gains of 0.39 & 0.68 IQ points
  • embryo selec­tion is cur­rently unprofitable (mean: -$358) in the USA under the low­est esti­mate of the value of an IQ point, but profitable under the high­est (mean: $6230). The main con­straints on selec­tion profitabil­ity is the poly­genic score; under the high­est val­ue, the NPV EVPI of a per­fect SNP pre­dic­tor is $24b and the EVSI per education/SNP sam­ple is $71k
  • under the worst-case esti­mate, selec­tion can be made profitable with a bet­ter poly­genic score, which would require n > 237,300 using edu­ca­tion phe­no­type data (and much less using fluid intel­li­gence mea­sures)
  • selec­tion can be made more effec­tive by select­ing on mul­ti­ple phe­no­type traits: con­sid­er­ing an exam­ple using 7 traits (IQ/height/BMI/diabetes/ADHD/bipolar/schizophrenia), there is a fac­tor gain over IQ alone; the out­per­for­mance of mul­ti­ple selec­tion remains after adjust­ing for genetic cor­re­la­tions & poly­genic scores and using a broader set of 16 traits.

Overview of Major Approaches

Before going into a detailed cost-ben­e­fit analy­sis of embryo selec­tion, I’ll give a rough overview of the var­i­ous devel­op­ing approaches for genetic engi­neer­ing of com­plex traits in humans, com­pare them, and briefly dis­cuss pos­si­ble time­lines and out­comes. (References/analyses/code for par­tic­u­lar claims are gen­er­ally pro­vided in the rest of the text, or in some cas­es, buried in my , and omit­ted here for clar­i­ty.)

The past 2 decades have seen a rev­o­lu­tion in mol­e­c­u­lar genet­ics: the sequenc­ing of the human genome kicked off a expo­nen­tial reduc­tion in genetic sequenc­ing costs which have dropped the cost of genome sequenc­ing from mil­lions of dol­lars to $20 (SNP geno­typ­ing)–$500 (whole genomes). This has enabled the accu­mu­la­tion of datasets of mil­lions of indi­vid­u­als’ genomes which allow a range of genetic analy­ses to be con­duct­ed, rang­ing from SNP her­i­tabil­i­ties to detec­tion of recent evo­lu­tion to GWASes of traits to esti­ma­tion of the genetic over­lap of traits.

The sim­ple sum­mary of the results to date is: behav­ioral genet­ics was right. Almost all human traits, sim­ple and com­plex, are caused by a joint com­bi­na­tion of envi­ron­ment, sto­chas­tic & ran­dom­ness, and genes. These pat­terns can be stud­ied by meth­ods such as fam­i­ly, twin, adop­tion, or sib­ling stud­ies, but ide­ally are stud­ied directly by read­ing the genes of hun­dreds of thou­sands of unre­lated peo­ple, which yield esti­mates of the effects of spe­cific genes and pre­dic­tions of phe­no­type val­ues from entire genomes. Across all traits exam­ined, genes cause ~50% of differ­ences between peo­ple in the same envi­ron­ment, fac­tors like ran­dom­ness & mea­sure­men­t-er­ror explain much of the rest, and what­ever is left over is the effect of nur­ture. Evo­lu­tion is true, and genes are dis­crete phys­i­cal pat­terns encoded in chro­mo­somes which can be read and edit­ed, with sim­ple traits such as many dis­eases being deter­mined by a hand­ful of genes, yield­ing com­pli­cated but dis­crete behav­ior, while com­plex traits are instead gov­erned by hun­dreds or thou­sands of genes whose effects sum together and pro­duce a nor­mal dis­tri­b­u­tion such as IQ or risk of devel­op­ing a com­pli­cated dis­ease like schiz­o­phre­nia. This allows direct esti­ma­tion of their genetic con­tri­bu­tion to a phe­no­type, as well as that of their chil­dren. These genetic traits con­tribute to many observed soci­etal pat­terns, such as the chil­dren of the rich also being richer and smarter and health­ier, why poorer neigh­bor­hoods have sicker peo­ple, rel­a­tives of schiz­o­phren­ics are less intel­li­gent, etc; these traits are sub­stan­tially her­i­ta­ble, and traits are also inter­con­nected in an intri­cate web of cor­re­la­tions where one trait causes another and both are caused by the same genetic vari­ants. For exam­ple, intel­li­gence-re­lated vari­ants are uni­formly inversely cor­re­lated with dis­ease-re­lated vari­ants, and pos­i­tively cor­re­lated with desir­able traits. These results have been val­i­dated by many differ­ent approaches and the exis­tence of wide­spread large her­i­tabil­i­ties, genetic cor­re­la­tions, and valid PGSes are now aca­d­e­mic con­sen­sus.

Because of this per­va­sive genetic influ­ence on out­comes, genetic engi­neer­ing is one of the great open ques­tions in tran­shu­man­ism: how much is pos­si­ble, with what, and when?

Sug­gested inter­ven­tions can be bro­ken down into a few cat­e­gories:

  • cloning (copy­ing)
  • selec­tion (vari­a­tion with rank­ing)
  • edit­ing (rewrit­ing)
  • syn­the­sis (writ­ing)

Each of these has differ­ent poten­tials, costs, and advan­tages & dis­ad­van­tages:

An opin­ion­ated com­par­i­son of pos­si­ble inter­ven­tions, focus­ing on poten­tial for improve­ments, pow­er, and cost.
Inter­ven­tion Descrip­tion Time Cost Lim­its Advan­tages Dis­ad­van­tages
Cloning Somatic cells are har­vested from a human and their DNA trans­ferred into an embryo, replac­ing the orig­i­nal DNA. The embryo is implant­ed. The result is equiv­a­lent to an iden­ti­cal twin of the donor, and if the donor is selected for high trait-val­ues, will also have high trait-val­ues but will regress to the mean depend­ing on the her­i­tabil­ity of said traits. ? $100k? can­not exceed trait-val­ues of donor, lim­ited by best donor avail­abil­ity does not require any knowl­edge of PGSes or causal vari­ants, is likely doable rel­a­tively soon as mod­est exten­sion of exist­ing mam­malian cloning, imme­di­ate gains of 3-4SD (max­i­mum pos­si­ble global donor after regres­sion to mean) may trig­ger taboos & is ille­gal in many juris­dic­tions, human cloning has been min­i­mally researched, hard to find par­ents as clone will be genet­i­cally related to one par­ent at most & pos­si­bly nei­ther, can’t be used to get rare or new genetic vari­ants, inher­ently lim­ited to regressed max­i­mum selected donor, does not scale in any way with more inputs
Sim­ple (Sin­gle-Trait) Embryo Selec­tion A few eggs are extracted from a woman and fer­til­ized; each result­ing sib­ling embryo is biop­sied for a few cells which are sequenced. A sin­gle poly­genic score is used to rank the embryos by pre­dicted future trait-val­ue, and sur­viv­ing embryos are implanted one by one until a healthy live birth hap­pens or there are no more embryos. By start­ing with the top-ranked embryo, an aver­age gain is real­ized. 0 years $1k-$5k egg count, IVF yield, PGS power off­spring fully related to par­ents, doable & profitable now, does­n’t require knowl­edge of causal vari­ants, does­n’t risk off-tar­get muta­tions, inher­ently safe gains, PGSes steadily improv­ing per­ma­nently lim­ited to <1SD increases on trait, requires IVF so I am doubt­ful it’ could ever exceed ~10% US pop­u­la­tion usage, fails to ben­e­fit from using good genetic cor­re­la­tions to boost over­lap­ping traits & avoid harm from neg­a­tive cor­re­la­tions (where a good thing increases a bad thing), biop­sy-se­quenc­ing imposes fixed per-em­bryo costs, fast dimin­ish­ing returns to improve­ments, can only select on rel­a­tively com­mon vari­ants cur­rently well-es­ti­mated by PGSes & can­not do any­thing about fixed vari­ants nei­ther or both par­ents carry
Sim­ple Mul­ti­ple (Trait) Embryo Selec­tion *, but the PGS used for rank­ing is a weighted sum of mul­ti­ple (pos­si­bly scores or hun­dreds) of PGSes of indi­vid­ual traits, weighted by util­i­ty. * * * *, but sev­eral times larger gains from selec­tion on mul­ti­ple traits *, but avoids harms from bad genetic cor­re­la­tions
Mas­sive Mul­ti­ple Embryo Selec­tion A set of eggs is extracted from a wom­an, or alter­nate­ly, some somatic cells like skin cells. If imma­ture eggs in an ovary biop­sy, they are matured in vitro to eggs; if somatic cells, they are regressed to stem cells, pos­si­bly repli­cated hun­dreds of times, and then turned into egg-gen­er­at­ing-cells and finally eggs, yield­ing hun­dreds or thou­sands of eggs (all still iden­ti­cal to her own eggs). Either way, the result­ing large num­ber of eggs are then fer­til­ized (up to a few hun­dred will likely be eco­nom­i­cally opti­mal), and then selec­tion & implan­ta­tion pro­ceeds are in sim­ple mul­ti­ple embryo selec­tion. >5 years $5k->$100k sequenc­ing+biopsy fixed costs, PGS power off­spring fully related to par­ents, lifts main bind­ing lim­i­ta­tion on sim­ple mul­ti­ple embryo selec­tion, allow­ing poten­tially 1-5SD gains depend­ing on bud­get, highly likely to be at least the­o­ret­i­cally pos­si­ble in next decade cost of biop­sy+se­quenc­ing scales lin­early with num­ber of embryos while encoun­ter­ing fur­ther dimin­ish­ing returns than expe­ri­enced in sim­ple mul­ti­ple embryo selec­tion, may be diffi­cult to prove new eggs are as long-term healthy
Gamete selection/Optimal Chro­mo­some Selec­tion (OCS) Donor sperm and eggs are (some­how) sequenced; the ones with the high­est-ranked chro­mo­somes are selected to fer­til­ize each oth­er; this can then be com­bined with sim­ple or mas­sive embryo selec­tion. It may be pos­si­ble to fuse or split chro­mo­somes for more vari­ance & thus selec­tion gains. ? years $1?-$5k? abil­ity to non-de­struc­tively sequence or infer PGSes of gametes rather than embryos, PGS power imme­di­ate large boost of ~2SD pos­si­ble by select­ing ear­lier in the process before vari­ance has been can­celed out, does not require any new tech­nol­ogy other than the gamete sequenc­ing part how do you sequence sperm/eggs non-de­struc­tive­ly?
Iter­ated embryo selec­tion (IES) (Also called “whiz­zo­ge­net­ics”, “in vitro eugen­ics”, or “in vitro breed­ing”/IVB.) A large set of cells, per­haps from a diverse set of donors, is regressed to stem cells, turned into both sperm/egg cells, fer­til­iz­ing each oth­er, and then the top-ranked embryos are select­ed, yield­ing a mod­er­ate gain; those embryos are not implanted but regressed back to stem cells, and the cycle repeats. Each “gen­er­a­tion” the increases accu­mu­late; after per­haps a dozen gen­er­a­tions, the trait-val­ues have increased many SDs, and the final embryos are then implant­ed. >10 years $1m?-$100m? full game­to­ge­n­e­sis con­trol, total bud­get, PGS power can attain max­i­mum total pos­si­ble gains, less­ened IVF require­ment (im­plan­ta­tion but not the egg extrac­tion), cur­rent PGSes ade­quate full & reli­able con­trol of game­te⟺stem-cel­l⟺em­bryo pipeline diffi­cult & requires fun­da­men­tal biol­ogy break­throughs, run­ning mul­ti­ple gen­er­a­tions may be extremely expen­sive and gains lim­ited in prac­tice, still restricted to com­mon vari­ants & vari­ants present in orig­i­nal donors, unclear effects of going many SDs up in trait-val­ues, so expen­sive that embryos may have to be unre­lated to future par­ents as IES can­not be done cus­tom for every pair of prospec­tive par­ents, may not be fea­si­ble for decades
Edit­ing (eg CRISPR) A set of embryos are injected with gene edit­ing agents (eg CRISPR deliv­ered via viruses or micro-pel­let­s), which directly mod­ify DNA base-pairs in some desired fash­ion. The embryos are then implant­ed. Sim­i­lar approaches might be to instead try to edit the moth­er’s ovaries or the father’s tes­ti­cles using a viral agent. 0 years <$10k off­spring fully related to par­ents, causal vari­ant prob­lem, num­ber of safe edits, edit error rate gains inde­pen­dent of embryo num­ber (as­sum­ing no deep sequenc­ing to check for muta­tion­s), poten­tially arbi­trar­ily cheap, poten­tially unbounded gains does­n’t require biop­sy-se­quenc­ing, unknown upper bound on how many pos­si­ble total edits, can add rare or unique genes each edit adds lit­tle, edits inher­ently risky and may dam­age cells through off-tar­get muta­tions or the deliv­ery mech­a­nism itself, requires iden­ti­fi­ca­tion of the gen­er­al­ly-un­known causal genes rather than pre­dic­tive ones from PGSes, cur­rently does­n’t scale to more than a few (unique) edits, most approaches would require IVF, parental edit­ing inher­ently halves the pos­si­ble gain
Genome Syn­the­sis Chem­i­cal reac­tions are used to build up a strand of cus­tom DNA lit­er­ally base-pair by base-pair, which then becomes a chro­mo­some. This process can be repeated for each chro­mo­some nec­es­sary for a human cell. Once one or more of the chro­mo­somes are syn­the­sized, they can replace the orig­i­nal chro­mo­somes in a human cell. The syn­the­sized DNA can be any­thing, so it can be based on a poly­genic score in which every SNP or genetic vari­ant is set to the esti­mated best ver­sion. >10 years (sin­gle chro­mo­somes) to >15 years whole genome? $30m-$1b cost per base-pair, over­all reli­a­bil­ity of syn­the­sis achieves max­i­mum total pos­si­ble gains across all pos­si­ble traits, is not lim­ited to com­mon vari­ants & can imple­ment any desired change, cost scales with genome replace­ment per­cent­age (with an upper bound at replac­ing the whole genome), cost per base-pair falling expo­nen­tially for decades and HGP-Rewrite may accel­er­ate cost decrease, many pos­si­ble approaches for genome syn­the­sis & count­less valu­able research or com­mer­cial appli­ca­tions dri­ving devel­op­ment, cur­rent PGSes ade­quate full genome syn­the­sis would cost ~$1b, error rate in syn­the­sized genomes may be unac­cept­ably high, embryos may be unre­lated to par­ents due to cost like IES, likely not fea­si­ble for decades

Over­all I would sum­ma­rize the state of the field as:

  • cloning: is unlikely to be used at any scale for the fore­see­able future despite its pow­er, and so can be ignored (ex­cept inas­much as it might be use­ful in another tech­nol­ogy like IES or genome syn­the­sis)

  • sim­ple sin­gle-trait embryo selec­tion: is strictly infe­rior to sim­ple mul­ti­ple embryo selec­tion, and there is no rea­son to use it other than the desire to save a tiny bit of sta­tis­ti­cal effort, and much rea­son to use it (larger and safer gain­s), so it need not be dis­cussed except as a straw­man.

  • sim­ple mul­ti­ple-trait embryo selec­tion: avail­able & profitable now, is too lim­ited in pos­si­ble gains, requires a far too oner­ous process (IVF) for more than a small per­cent­age of the pop­u­la­tion to use it, and is more or less triv­ial. As median embryo count in IVF hov­ers around 5, the total gain from selec­tion is small, and much of the gain is wasted by losses in the IVF process (the best embryo does­n’t sur­vive stor­age, the sec­ond-best fails to implant, and so on). One of the key prob­lems is that poly­genic scores are the sum of many indi­vid­ual small genes’ effects and form a nor­mal dis­tri­b­u­tion, which is tightly clus­tered around a mean. A poly­genic score is attempt­ing to pre­dict the net effect of thou­sands of genes which almost all can­cel out, so even accu­rate iden­ti­fi­ca­tion of many rel­e­vant genes still yields an appar­ently unim­pres­sive pre­dic­tive pow­er. The fact that traits are nor­mally dis­trib­uted also cre­ates diffi­cul­ties for selec­tion: the fur­ther into the tail one wants to go, the larger the sam­ple required to reach the next step—to put it another way, if you have 10 sam­ples, it’s easy (a 1 in 10 prob­a­bil­i­ty) that your next ran­dom sam­ple will be the largest sam­ple yet, but if you have 100 sam­ples, now the prob­a­bil­ity of an improve­ment is the much harder 1 in 100, and if you have 1000, it’s only 1 in 1000; and worse, if you luck out and there’s an improve­ment, the improve­ment is ever tinier. After tak­ing into account exist­ing PGSes, pre­vi­ously reported IVF process loss­es, costs, and so on, the impli­ca­tion that it is mod­er­ately profitable and can increase traits per­haps 0.1SD, ris­ing some­what over the next decade as PGSes con­tinue to improve, but never exceed­ing, say, 0.5SD.

    Embryo selec­tion could have sub­stan­tial soci­etal impacts in the long run, espe­cially over mul­ti­ple gen­er­a­tions, but this would both require IVF to become more com­mon and for no other tech­nol­ogy to super­sede it (as they cer­tainly shal­l). When IVF began, many pun­dits pro­claimed it would “for­ever change what it means to be human” and other sim­i­lar fatu­osi­ties; it did no such thing, and has since pro­duc­tively helped count­less par­ents & chil­dren, and I fully expect embryo selec­tion to go the same way. I would con­sider embryo selec­tion to have been con­sid­er­ably over­hyped (by those hyper­ven­ti­lat­ing about “Gat­taca being around the cor­ner”), and, iron­i­cal­ly, also under­hyped (by those mak­ing argu­ments like “trait X is so poly­genic, there­fore embryo selec­tion can’t work”, which is sta­tis­ti­cally illit­er­ate, or “traits are com­plex inter­ac­tions between genes and envi­ron­ment most of which we will never under­stand”, which is obfus­cat­ing irrel­e­vancy and FUD).

    Embryo selec­tion does have the advan­tage of being the eas­i­est to ana­lyze & dis­cuss, and the most imme­di­ately rel­e­vant.

  • mas­sive mul­ti­ple embryo selec­tion: the sin­gle most bind­ing con­straint on sim­ple embryo selec­tion (sin­gle or mul­ti­ple trait), is the num­ber of embryos to work with, which, since pater­nal sperm is effec­tively infinite, means num­ber of eggs.

    For selec­tion, the key ques­tion is what is the most extreme or max­i­mum item in the sam­ple; a small sam­ple will not spread wide, but a large sam­ple will have a big­ger extreme. The more lot­tery tick­ets you buy, the bet­ter the chance of get­ting 1 ticket which wins a lot. Where­as, the PGS, to peo­ples’ gen­eral sur­prise, does­n’t make all that much of a differ­ence after a lit­tle while. If you have 3 embryos, even going from a noisy to a per­fect pre­dic­tor, it does­n’t make much of a differ­ence, because no mat­ter how flaw­less your pre­dic­tion, embryo #1 (whichever it is) out of 3 just isn’t going to be all that much bet­ter than aver­age; if you have 300 embryos, then a per­fect pre­dic­tor becomes more use­ful.

    There is no fore­see­able way to safely extract more eggs from a donor: stan­dard IVF cycle approaches appear to have largely reached their lim­it, and stim­u­lat­ing more eggs’ release in a har­vest­ing cycle is dan­ger­ous. A differ­ent approach is required, and it seems the only option may be to make more eggs. One pos­si­bil­ity is to not try to stim­u­late release of a few eggs and col­lect them, but instead biopsy sam­ples of pro­to-eggs and then hurry them in vitro to matu­rity as full eggs, and get many eggs that way; biop­sies might be com­pelling with­out selec­tion at all: the painful, pro­tract­ed, fail­ure-prone, and expen­sive egg har­vest­ing process to get ~5 embryos, which then might yield a failed cycle any­way, could be replaced by a sin­gle quick biopsy under anes­the­sia yield­ing hun­dreds of embryos effec­tively ensur­ing a suc­cess­ful cycle. Less inva­sive­ly, lab­o­ra­tory results in induc­ing regres­sion to stem cell states and then ooge­n­e­sis have made steady progress over the past decades in pri­mar­ily rat/mice but also human cells, and researchers have begun to speak of the pos­si­bil­ity in another 5 or 10 years of enabling infer­tile or homo­sex­ual cou­ples to con­ceive fully genet­i­cal­ly-re­lated chil­dren through somatic ↔︎ gametic cell con­ver­sions. This would also likely allow gen­er­at­ing scores or hun­dreds of embryos by turn­ing eas­ier-to-ac­quire cells like skin cells or extracted eggs into stem cells which can repli­cate and then be con­verted into egg cells & fer­til­ized. While it is still fight­ing the nor­mal dis­tri­b­u­tion with brute force, hav­ing 500 embryos works a lot bet­ter than hav­ing just 5 embryos to choose from. The down­side is that one still needs to biopsy and sequence each embryo in order to com­pute their par­tic­u­lar PGS; since one is still fight­ing the thin tail, at some point the cost of cre­at­ing & test­ing another embryo exceeds the expected gain (prob­a­bly some­where in the hun­dreds of embryos).

    Unlike sim­ple embryo selec­tion, this could yield imme­di­ately impor­tant gains like +2SD. IVF yield ceases to be much of a prob­lem (the second/third/fourth-best embryos are now almost exactly as good as the first-best was and they prob­a­bly won’t all fail), and enough brute force has been applied to reach poten­tially 1-2SD in prac­tice. If taken up by only the cur­rent IVF users and applied to intel­li­gence alone, it would imme­di­ately lead to the next gen­er­a­tion’s elite posi­tions being dom­i­nated by their kids; if taken up by more and done prop­erly on mul­ti­ple traits, the advan­tage would be greater.

  • Gamete selec­tion/Optimal Chro­mo­some Selec­tion: only a the­o­ret­i­cal pos­si­bil­ity at the moment, as there is no direct way to sequence indi­vid­ual sperm/eggs or manip­u­late chro­mo­some choice. GS/OCS are inter­est­ing more for the points they make about vari­ance & order sta­tis­tics & the CLT: it results in a much larger gain than one would expect sim­ply by switch­ing per­spec­tives and focus­ing on how to select ear­lier in the ‘pipeline’, so to speak, where vari­ance is greater because sets of genes haven’t yet been com­bined in one pack­age & can­celed each other out. If some­one did some­thing clever to allow infer­ence on game­tes’ PGSes or select indi­vid­ual chro­mo­somes, then it could yield an imme­di­ate dis­con­tin­u­ously large boost in trait-value of +2SD in con­junc­tion with what­ever embryo selec­tion is avail­able at that point.

  • Iter­ated Embryo Selec­tion: If IES were to hap­pen, it would allow for almost arbi­trar­ily large increases in trait-val­ues across the board in a short period of time, per­haps a year. While IES has major dis­ad­van­tages (ex­tremely costly to pro­duce the first opti­mized embryos, depend­ing on how many gen­er­a­tions of selec­tion are involved; selec­tion has some inher­ent speed lim­its trad­ing off between acci­den­tally los­ing pos­si­bly use­ful vari­ants & get­ting as large a gain each gen­er­a­tion as pos­si­ble; embryos are unlikely to resem­ble the orig­i­nal donors at all with­out an addi­tional gen­er­a­tion ‘back­crossed’ with the orig­i­nal donor cells, undo­ing most of the work), the extreme increases may jus­tify use of IES and cre­ate demand from par­ents. This could then start a tsuna­mi. Depend­ing on how far IES is pushed, the first release of IES-optimized embryos may become one of the most impor­tant events in human his­to­ry.

    IES is still dis­tant and depends on a large num­ber of wet lab break­throughs and fine­tuned human-cell pro­to­cols. Coax­ing scores or hun­dred of cells through all the stages of devel­op­ment and fer­til­iza­tion, for mul­ti­ple gen­er­a­tions, is no easy task. When will IES be pos­si­ble? The rel­e­vant lit­er­a­ture is highly tech­ni­cal and only an expert can make sense of it, and one should have hand­s-on exper­tise to even try to make fore­casts. There are no clear cost curves or laws gov­ern­ing progress in stem cell/gamete research which can be used to extrap­o­late. Per­haps no one will ever put all the money and con­sis­tent research effort into devel­op­ing it into some­thing which could be used clin­i­cal­ly. Just because some­thing is the­o­ret­i­cally pos­si­ble and has lots of lab pro­to­types does­n’t mean that the tran­si­tion will hap­pen. (Look at human cloning; every­one assumed it’d hap­pen long ago, but as far as any­one knows, it never has.) On the other hand, per­haps some­one will.

    IES is one of the scari­est pos­si­bil­i­ties on the list, and the hard­est to eval­u­ate; it seems clear, at least, that it will cer­tainly not hap­pen in the next decade, but after that…? IES has been badly under­-dis­cussed to date.

  • Gene Edit­ing: the devel­op­ment of CRISPR has led to more hype than embryo selec­tion itself. How­ev­er, the cur­rent fam­ily of CRISPR tech­niques & pre­vi­ous alter­na­tives & future improve­ments, can be largely dis­missed on sta­tis­ti­cal grounds alone. Even if we hypoth­e­sized some super-CRISPR which could make a hand­ful of arbi­trary SNP edits with zero risk of muta­tion or other forms of harm, it would not be espe­cially use­ful and would strug­gle to be com­pet­i­tive with embryo selec­tion, let alone IES/OCS/genome syn­the­sis. The unfix­able root cause is the poly­genic­ity of the most impor­tant poly­genic traits (which is a bless­ing for selec­tion or syn­the­sis approach­es, as it cre­ates a vast reser­voir of poten­tial improve­ments, but a curse for edit­ing), and to a lesser extent, the asym­me­try of effect sizes (harm­ful vari­ants are more harm­ful than ben­e­fi­cial ones are ben­e­fi­cial).

    The ben­e­fit of gene edit­ing a SNP is the num­ber of edits times the SNP effect of each edit times the prob­a­bil­ity the effect is causal. Prob­a­bil­ity it’s causal? Can’t we assume that the top hits from large GWASes these days have a pos­te­rior prob­a­bil­ity ~100% of hav­ing a non-zero effect? No. This is because of a tech­ni­cal detail which is largely irrel­e­vant to selec­tion processes but is vitally impor­tant to edit­ing: the hits iden­ti­fied in a PGS are not nec­es­sar­ily the exact causal base-pair(s). Often they are, but more often they are not. They are instead prox­ies for a neigh­bor­ing causal vari­ant which hap­pens to usu­ally be inher­ited with it, as genomes are inher­ited in a chunky fash­ion, in big blocks, and do not split & recom­bine at every sin­gle base-pair. This is no prob­lem for selec­tion—it pre­dicts great and is cheaper & eas­ier to find a cor­re­lated SNP than the true causal vari­ant. But it is fatal to edit­ing: if you edit a proxy, it’ll do noth­ing (or maybe it’ll do the oppo­site).

    How fatal is this? Attempts at “fine-map­ping” or using large datasets to dis­tin­guish which of a few SNPs is the real cul­prit or see­ing how PGSes’ per­for­mance shrinks when going from the orig­i­nal GWAS pop­u­la­tion to a deeply genet­i­cally differ­ent pop­u­la­tion like Sub­sa­ha­ran Africans who have totally differ­ent proxy pat­terns (if there is non-zero pre­dic­tion pow­er, it must be thanks to the causal hits, which act the same way in both pop­u­la­tion­s), we can esti­mate that the causal prob­a­bil­ity may be as low as 10%. Com­bine this with the few edits safely per­mit­ted, per­haps 5, the small effect size of each genetic vari­ant, like 0.2 IQ points for intel­li­gence, and the effect becomes dis­mal. A tenth of a point? Not much. Even if we had all causal vari­ants, the small aver­age effect size, com­bined with few pos­si­ble edits, is no good. Fix the causal vari­ant prob­lem, and it’s still only 5 edits at 0.2 points each. Nor is IQ at all unique in this respec­t—it’s some­what unusu­ally poly­genic, but a cleaner trait like height still implies small gains such as half an inch.

    What about rare vari­ants? The prob­lem with rare vari­ants is that they are rare, and also not of espe­cially large ben­e­fi­cial effect. Being rare makes them hard to find in the first place, and the lack of ben­e­fit (as com­pared to a base­line human with­out said vari­ant) means that they are not use­ful for edit­ing. We might find many vari­ants which dam­age a trait by a large amount, say, increas­ing abdom­i­nal fat mass by a kilo­gram or low­er­ing IQ by a dozen points, but of course, we don’t want to edit those in! (They also aren’t that impor­tant for any embryo selec­tion method, because they are rare, not usu­ally pre­sent, and thus there is usu­ally no selec­tion to be done.) We could hope to find some vari­ant which increases IQ by sev­eral points—but none have been found, if they were at all com­mon they would’ve been found a long time ago, and indi­rect meth­ods like DeFries-Fulker regres­sion sug­gest that there few or no such rare vari­ants. Nor is mea­sur­ing other traits a panacea: if there were some vari­ant which increased IQ by a medium amount by increas­ing a spe­cific trait like work­ing mem­ory which has not been stud­ied in large GWASes or DeFries-Fulker regres­sions to date, then such a WM-boost­ing vari­ant should’ve been detected through its medi­ated effect, and to the extent that it has no effect on hard end­points like IQ or edu­ca­tion or income, it then must be ques­tioned how use­ful it is in the first place. The sit­u­a­tion may be some­what bet­ter with other traits (there’s still hope for find­ing large ben­e­fi­cial effects1, and in the other direc­tion, dis­ease traits tend to have more rare vari­ants of larger effects which might be worth fix­ing in rel­a­tively many indi­vid­ual cas­es, like BRCA or APOE) but I just don’t see any real­is­tic way to reach gains like +1SD on any­thing with gene edit­ing meth­ods in the fore­see­able future using exist­ing vari­ants.

    What about non-ex­ist­ing vari­ants ie brand-new vari­ants based on extrap­o­la­tion from human genetic his­tory or ani­mal mod­els? These hypo­thet­i­cal mutations/edits could have large effects even if we have failed to find any in the wild. But the track record of ani­mal mod­els in pre­dict­ing com­plex human sys­tems such as the brain is not good at all, and such large novel muta­tions would have zero safety record, and how would you prove any were safe with­out dozens of live births and long-term fol­lowup—which would never be per­mit­ted? Given the poor prior prob­a­bil­ity of both safety & effi­ca­cy, such muta­tions would sim­ply remain untried indefi­nite­ly.

    It is diffi­cult to see how to rem­edy this in any use­ful way. The causal prob­a­bil­ity will creep up as datasets expand & cross-ra­cial GWASes become more com­mon, but that does­n’t resolve the issue after we increase the gain by a fac­tor of 10. The limit is still the edit count: the unique edit limit of ~5 is not enough to work with. Can this be com­bined use­fully with IES to do edits per gen­er­a­tion? Likely but you still need IES first! Can the edit limit be lift­ed? …Maybe. Genetic edit­ing fol­lows no pre­dictable improve­ment curve, or learn­ing curve, and does­n’t ben­e­fit directly from any expo­nen­tials. It is hard to fore­cast what improve­ments may hap­pen. 2019 saw a break­through from a repeat­ed-edit SOTA of ~60 edits in a cell to ~2,600 (), which no one fore­cast, but it’s unclear when if ever that would trans­fer to use­ful per-SNP edits; but nev­er­the­less, the pos­si­bil­ity of mass edit­ing can­not be ruled out.

    So, CRISPR-style edit­ing may be rev­o­lu­tion­ary in rare genetic dis­eases, agri­cul­ture, & research, but as far as we are con­cerned, it has been grossly over­hyped: there is a chance it will live up to the most extreme claims, but not a large one.

  • Genome syn­the­sis: the sim­ple answer to gene edit­ing’s fail­ure is to observe that if you have to make pos­si­bly thou­sands of edits to fix up a genome to the level you want it, why not go out and make your own genome? (with black­jack and hook­er­s…) That is the auda­cious pro­posal of genome syn­the­sis. It sounds crazy, since genome syn­the­sis has his­tor­i­cally been mostly used to make short seg­ments for research, or per­haps the odd pan­demic virus, but unno­ticed by most, the cost per base-pair has been crash­ing for decades, allow­ing the cre­ation of entire yeast genomes and lead­ing to the recent HGP-Write pro­posal from George Church & oth­ers to invest in genome syn­the­sis research with the aim of invent­ing meth­ods which can cre­ate cus­tom genomes at rea­son­able prices. Such an abil­ity would be stag­ger­ingly use­ful: cus­tom organ­isms designed to pro­duce arbi­trary sub­stances, genomes with the fun­da­men­tal encod­ing all swapped around ren­der­ing them immune to all viruses ever, organ­isms with a sin­gle giant genome or with all muta­tions replaced with the modal gene, among other crazy things. One could also, inci­den­tal­ly, use cheap genome syn­the­sis for bulk stor­age of data in a dense, durable, room-tem­per­a­ture for­mat (ex­plain­ing both Microsoft & IARPA’s inter­est in fund­ing genome syn­the­sis research).

    Of course, if you can syn­the­size an entire genome—a sin­gle chro­mo­some would be almost as good to some exten­t—you can take a base­line genome and make as many ‘edits’ as you please. Set all the top vari­ants for all the rel­e­vant traits to the esti­mated best set­ting. The pos­si­ble gains are greater than IES (since you are not lim­ited by the ini­tial gene pool of start­ing vari­ants nor by the selec­tion process itself), and one can increase traits by hun­dreds of SDs (what­ever that mean­s).

    Genome Sequencing/Synthesis Cost Curve, 1980–2015
    Genome Sequencing/Synthesis Cost Curve, 1990–2017 (up­dat­ed)

    Genome syn­the­sis, unlike IES, has his­tor­i­cally pro­ceeded on a smooth cost-curve, has many pos­si­ble imple­men­ta­tions, and has many research groups & star­tups involved due to its com­mer­cial appli­ca­tions. A large-s­cale HGP-Write” (appen­dix) has been pro­posed to scale genome syn­the­sis up to yeast sized organ­isms and even­tu­ally human-sized genomes. The cost curve sug­gests that around 2035, whole human genomes reach well-re­sourced research project ranges of $10-30m; some indi­vid­u­als in genome syn­the­sis tell me they are opti­mistic that new meth­ods can greatly accel­er­ate the cost-curve. (Un­like IES, genome syn­the­sis is not com­mit­ted to a par­tic­u­lar work­flow, but can use any method which yields, in the end, the desired genome; all of these meth­ods can be , rep­re­sent­ing a major advan­tage.) Genome syn­the­sis has many chal­lenges before one could real­is­ti­cally implant an embryo, such as ensur­ing all the rel­e­vant struc­tural fea­tures like methy­la­tion are cor­rect (which may not have been nec­es­sary for ear­lier more primitive/robust organ­isms like yeast), and so on, but what­ever the chal­lenges for genome syn­the­sis, the ones for IES appear greater. It is entirely pos­si­ble that IES will develop too slowly and will be obso­leted by genome syn­the­sis in 10-20 years. The con­se­quences of genome syn­the­sis would be, if any­thing, larger than IES because the syn­the­sis tech­nol­ogy will be dis­trib­uted in bulk, will prob­a­bly con­tinue decreas­ing in cost due to the com­mer­cial appli­ca­tions regard­less of human use, and don’t require rare spe­cial­ized wet lab exper­tise but like genome sequenc­ing, will almost cer­tainly become highly auto­mated & ‘push but­ton’.

    If IES has been under­-dis­cussed and is under­rat­ed, genome syn­the­sis has not been dis­cussed at all & vastly more under­rat­ed.

To sum up the time­line: CRISPR & cloning are already avail­able but will remain unim­por­tant indefi­nitely for var­i­ous fun­da­men­tal rea­sons; mul­ti­ple embryo selec­tion is use­ful now but will always be minor; mas­sive mul­ti­ple embryo selec­tion is some ways off but increas­ingly inevitable and the gains are large enough on both indi­vid­ual & soci­etal lev­els to result in a shock; IES will come some­time after mas­sive mul­ti­ple embryo selec­tion but it’s impos­si­ble to say when, although the con­se­quences are poten­tially glob­al; genome syn­the­sis is a sim­i­lar level of seri­ous­ness, but is much more pre­dictable and can be looked for, very loose­ly, 2030-2040 (and pos­si­bly soon­er).

FAQ: Frequently Asked Questions

Read­ers already famil­iar with the idea of embryo selec­tion may have some com­mon mis­con­cep­tions which would be good to address up front:

  1. IVF Costs: IVF is expen­sive, some­what dan­ger­ous, and may have worse health out­comes than nat­ural child­birth

    I agree, but we can con­sider the case where these issues are irrel­e­vant. It is unclear what the long-run effects of IVF on chil­dren may be, other than the harm prob­a­bly isn’t too great; the lit­er­a­ture on IVF sug­gests that the harms are prob­a­bly very small and smaller than, for exam­ple, pater­nal age effects, but it’s hard to be sure given that IVF usage is hardly exoge­nous and good com­par­i­son groups for even just cor­re­la­tional analy­sis are hard to come by. (Nat­u­ral-born chil­dren are clearly not com­pa­ra­ble, but nei­ther are nat­u­ral-born sib­lings of IVF chil­dren—why was their mother able to have one child nat­u­rally but needed IVF for the nex­t?) I would not rec­om­mend any­one do IVF solely to ben­e­fit from embryo selec­tion (as opposed to doing PGD to avoid pass­ing a hor­ri­ble genetic dis­ease like Hunt­ing­ton’s, where it is impos­si­ble for the hypo­thet­i­cal harms of IVF to out­weigh the very real harm of that genetic dis­ease). Here I con­sider the case where par­ents are already doing IVF, for what­ever rea­son, and so the poten­tial harms are a “sunk cost”: they will hap­pen regard­less of the choice to do embryo selec­tion, and can be ignored. This restricts any results to that small sub­set (~1% of par­ents in the USA as of 2016), of course, but that sub­set is the most rel­e­vant one at pre­sent, is going to grow over time, and could still have impor­tant soci­etal effects.

    An inter­est­ing ques­tion would be, at what point does embryo selec­tion become so com­pelling that would-be par­ents with a fam­ily his­tory of dis­ease (such as schiz­o­phre­nia) would want to do it? (Be­cause of the non­lin­ear nature of lia­bil­i­ty-thresh­old poly­genic traits and rel­a­tively rare dis­eases like schiz­o­phre­nia, some­one with a fam­ily his­tory ben­e­fits far more than some­one with an aver­age risk; see the trun­ca­tion selec­tion/mul­ti­ple-trait selec­tion on why this implies that selec­tion against dis­eases is not as use­ful as it seem­s.) What about would-be par­ents with no par­tic­u­lar his­to­ry? How good does embryo selec­tion need to be for would-be par­ents who could con­ceive nat­u­rally to be will­ing to undergo the cost (~$10k even at the cheap­est fer­til­ity clin­ics) and health risks (for both mother & child) to ben­e­fit from embryo selec­tion? I don’t know, but I sus­pect “sim­ple embryo selec­tion” is too weak and it will require “mas­sive embryo selec­tion” (see the overview for defi­n­i­tions & com­par­ison­s).

  2. PGSes Don’t Work: GWASes merely pro­duce false pos­i­tives and can’t do any­thing use­ful for embryo selec­tion because they are false positives/population structure/publication bias/etc…

    Some read­ers over­gen­er­al­ize the deba­cle of the can­di­date-gene lit­er­a­ture, which is almost 100% false-pos­i­tive garbage, to GWASes; but GWASes were designed in response to the fail­ure of can­di­date-genes by much more strin­gent thresh­olds & large datasets & more pop­u­la­tion struc­ture cor­rec­tion, and have per­formed well as datasets reached nec­es­sary sizes. Their PGSes pre­dict out­-of-sam­ple increas­ingly large amounts of vari­ance, the PGSes have high s between cohorts/countries/times/measurement meth­ods, and they work with­in-fam­ily between sib­lings, who by defi­n­i­tion have iden­ti­cal ancestries/family backgrounds/SES/etc but have ran­dom­ized inher­i­tance from their par­ents. For a more detailed dis­cus­sion, see the sec­tion, “Why Trust GWASes?”. (While GWASes are indeed highly flawed, those flaws typ­i­cally work in the direc­tion of inefficiency/reduc­ing their pre­dic­tive pow­er, not inflat­ing them.)

  3. The Pre­dic­tion Is Non­causal: GWASes may be pre­dic­tive but this is irrel­e­vant because the SNPs in a PGS are merely non-causal vari­ants which proxy for causal vari­ants

    Back­ground: in a GWAS, the mea­sured SNPs may cause the out­come or they may merely be located on a genome nearby a genetic vari­ant which has the causal effect; because genomes are inher­ited in a ‘chunky’ fash­ion, a mea­sured SNP may almost always be found along­side the causal genetic vari­ant within a par­tic­u­lar pop­u­la­tion. (Over a long enough time­frame, as organ­isms repro­duce, that part of the genome will be bro­ken up, but this may take cen­turies or mil­len­ni­a.) Such a SNP is in “link­age dis­e­qui­lib­rium” or just LD. Such a sce­nario is quite com­mon, and may in fact be the case for the over­whelm­ing major­ity of SNPs in human GWASes. This is both a bless­ing and a curse for GWASes: it means that easy cheap­ly-mea­sured SNPs can probe hard­er-to-find genetic vari­ants, but it also means that the SNPs are not causal them­selves. So for exam­ple, if one took a list of SNPs from a GWAS, and used CRISPR to edit them, most of the edits would do noth­ing. This is a seri­ous con­cern for genetic engi­neer­ing approach­es—just because you have a suc­cess­ful GWAS does­n’t mean you know what to edit!

    But is this a prob­lem for embryo selec­tion? No. Because you are not engaged in any edit­ing or causal manip­u­la­tion. You are pas­sively observ­ing and pre­dict­ing what is the best embryo in a sam­ple. This does not dis­turb the LD pat­terns or break any cor­re­la­tions, and the pre­dic­tions remain valid. Selec­tion does­n’t care what the causal vari­ants are, it cares only that, what­ever they are or wher­ever they are on the genome, the cho­sen embryo has more of them than the not-cho­sen embryos. Any proxy will do, as long as it pre­dicts well. In the long run, changes in LD will grad­u­ally reduce the PGS’s pre­dic­tive power as the SNPs become better/worse prox­ies, but this is unim­por­tant since there will be many GWASes in between now and then, and one would be upgrad­ing PGSes for other rea­sons (like their steadily improv­ing pre­dic­tive power regard­less of LD pat­tern­s).

  4. PGSes Pre­dict Too Lit­tle: Embryo selec­tion can’t be use­ful with PGSes pre­dict­ing only X% [where X% > state of the art] of indi­vid­ual vari­ance

    The mis­take here is con­fus­ing a sta­tis­ti­cal mea­sure of error with the goal. Any default sum­mary sta­tis­tic like R2 or RMSE is merely a crutch with ten­u­ous con­nec­tions to opti­mal deci­sions. In embryo selec­tion, the goal is to choose bet­ter embryos than aver­age to implant rather than implant ran­dom embryos, to get a gain which pays for the costs involved. A PGS only needs to be accu­rate enough to select a bet­ter embryo out of a (typ­i­cally small) batch. It does­n’t need to be able to pre­dict future, say, IQ, within a point. Esti­mat­ing the pre­cise future trait value of an embryo may be quite diffi­cult, but it’s much eas­ier to pre­dict which of two embryos will have a higher trait val­ue. (It’s the differ­ence between pre­dict­ing the win­ner of a soc­cer game and pre­dict­ing the exact final score; the lat­ter does let one do the for­mer, but the for­mer is what one need and is much eas­i­er.) Once your PGS is good enough to pick the best or near-best embryo, even a far bet­ter PGS makes lit­tle differ­ence—after all, one can’t do any bet­ter than pick­ing the best embryo out of a batch. And due to dimin­ish­ing returns/tail effects, the larger the batch, the smaller the differ­ence between the best and the 4th-best etc, reduc­ing the regret. (In a batch of 2, there’s not too much differ­ence between a poor and a per­fect pre­dic­tor; and in a batch of 1, there’s none.)

    To decide whether a PGS of X% is ade­quate can­not be done in a vac­u­um; the nec­es­sary per­for­mance will depend crit­i­cally on the value of a trait, the cost of embryo selec­tion, the losses in the IVF pipeline, and most impor­tantly of all, the num­ber of embryos in each batch. (The final gain depends the most on the embryo coun­t—a fact lost on most peo­ple dis­cussing this top­ic.) As embryo selec­tion is cheap at the mar­gin, and rank­ing is eas­ier than regres­sion, this can be done with sur­pris­ingly poor PGSes, and the bar of profitabil­ity is easy to meet, and for embryo selec­tion, has been met for some years now (see the rest of this page for an analy­sis of the spe­cific case of IQ).

    • The genome-wide sta­tis­ti­cal­ly-sig­nifi­cant hits explain <X% of indi­vid­ual vari­ance:

      Sta­tis­ti­cal-sig­nifi­cance thresh­olds are essen­tially arbi­trary. There is no need to fetishize them: they do not cor­re­spond to any pos­te­rior prob­a­bil­ity of a hit being “real”, intro­duce many seri­ous diffi­cul­ties of inter­pre­ta­tion due to power (if a GWAS has a hit on an SNP with an esti­mated effect size of X, and a sec­ond GWAS also esti­mates it at X but due to a slightly higher stan­dard error, it is no longer “sta­tis­ti­cal­ly-sig­nifi­cant”, what does that mean, exact­ly?) and even if they did, the num­ber of false pos­i­tives has lit­tle rela­tion­ship to the pre­dic­tive pow­er, much less selec­tion gain of a PGS, much less the final profit of embryo selec­tion. The rel­e­vant ques­tion is what are the best pre­dic­tions which can be made? For human com­plex traits, the most accu­rate pre­dic­tions typ­i­cally use a PGS based on most of or all mea­sured vari­ants. Any­thing less is less.

  5. Unin­tended Con­se­quences: Selec­tion on traits, espe­cially intel­li­gence, will back­fire hor­ri­bly

    It is hypo­thet­i­cally pos­si­ble for selec­tion on one trait, which hap­pens to be inversely cor­re­lated on a genetic lev­el, with another impor­tant trait, to back­fire by increas­ing the first trait but then doing much more dam­age by decreas­ing the sec­ond trait. This occurs occa­sion­ally in long-term or intense breed­ing pro­grams, and has been demon­strated by very care­ful­ly-de­signed exper­i­ments such as the famous chick­en-crate exper­i­ment.

    How­ev­er, for humans, such genetic cor­re­la­tions are highly unlikely a pri­ori as we can sim­ply observe broad pat­terns like the global cor­re­la­tions of SES/wealth/intelligence/health with all desir­able out­comes (“Cheverud’s con­jec­ture”), and count­less have already been cal­cu­lated by var­i­ous meth­ods and are now rou­tinely reported in GWASes, and invari­ably dis­eases pos­i­tively cor­re­late with dis­eases and good things cor­re­late with other good things. What­ever harm­ful back­fire effects there may be are far out­weighed by the ben­e­fi­cial back­fire effects, so selec­tion on a sin­gle trait, espe­cially intel­li­gence, is not going to incur these spec­u­la­tive hypo­thet­i­cal harms.

    If there are any such harms, they can be reduced or elim­i­nated by sim­ply tak­ing into account mul­ti­ple traits while select­ing, and doing mul­ti­-trait selec­tion. This is easy to do with the present avail­abil­ity of PGSes on hun­dreds of trait­s—­given that all the hard work is in the geno­typ­ing step, why would one ignore all traits but one and throw away all that data? In fact, even if there were no pos­si­bil­ity of back­fire effects, embryo selec­tion would be done with mul­ti­-trait selec­tion any­way, sim­ply because it is so easy and the ben­e­fits are so com­pelling: using mul­ti­ple traits allows for much greater over­all gains because two embryos sim­i­lar or iden­ti­cal on one trait may differ a great deal on another trait, and when traits are genet­i­cally cor­re­lat­ed, they can serve as prox­ies for each oth­er, pro­duc­ing effec­tive boosts in pre­dic­tive pow­er. For all these rea­sons, most breed­ing pro­grams use mul­ti­-trait selec­tion. For more details and an exam­ple of the ben­e­fits in embryo selec­tion, see the mul­ti­ple-s­e­lec­tion sec­tion.

Embryo selection cost-effectiveness

“Forty years ago, I could say in the Whole Earth Cat­a­log, ‘we are as gods, we might as well get good at it’…What I’m say­ing now is we are as gods and have to get good at it.”

Stew­art Brand

(IVF) is a med­ical pro­ce­dure for infer­tile women in which eggs are extract­ed, fer­til­ized with sperm, allowed to develop into an embryo, and the embryo injected into their womb to induce preg­nan­cy. The choice of embryo to implant is usu­ally arbi­trary, with some sim­ple screen­ing for gross abnor­mal­i­ties like miss­ing chro­mo­somes or other cel­lu­lar defects, which would either be fatal to the embry­o’s devel­op­ment (so use­less & waste­ful to implant) or cause birth defects like (so much prefer­able to implant a health­ier embry­o).

How­ev­er, var­i­ous tests can be run on embryos, includ­ing genome sequenc­ing after extract­ing a few cells from the embryo, which is called: (PGD; review)—when genetic infor­ma­tion is mea­sured and used to choose which embryo to implant. PGD has his­tor­i­cally been used pri­mar­ily to detect and select against a few rare reces­sive genetic dis­eases with sin­gle-gene causes like the fatal Hunt­ing­ton’s dis­ease: if both par­ents are car­ri­ers, an embryo with­out the reces­sive can be cho­sen, or at least, an embryo which is het­erozy­gous and won’t develop the dis­ease. This is use­ful for those unlucky enough to have a fam­ily his­tory or be known car­ri­ers, but while ini­tially con­tro­ver­sial, is now merely an obscure & use­ful part of fer­til­ity med­i­cine.

How­ev­er, with ever cheaper SNP arrays and the advent of large GWASes in the 2010s, large amounts of sub­tler genetic infor­ma­tion becomes avail­able, and one could check for abnor­mal­i­ties and also start mak­ing use­ful pre­dic­tions about adult phe­no­types: one could choose embryos with higher/lower prob­a­bil­ity of traits with many known genetic hits such as or intel­li­gence or alco­holism or schiz­o­phre­ni­a—thus, in effect, cre­at­ing with proven tech­nol­ogy no more exotic than IVF and 23andMe. Since such a prac­tice is differ­ent in so many ways from tra­di­tional PGD, I’ll call it “embryo selec­tion”.

Embryo selec­tion has already begun to be used by the most sophis­ti­cated cat­tle breed­ing pro­grams (Mul­laart & Wells 2018) as an adjunct to their highly suc­cess­ful genomic selec­tion & embryo trans­fer pro­grams.

What traits might one want to select on? For exam­ple, increases in height have long been linked to increased career suc­cess & life sat­is­fac­tion with esti­mates like +$800 per inch per year income, which com­bined with poly­genic scores pre­dict­ing a decent frac­tion of vari­ance, could be valu­able2 But height, or hair col­or, or other traits are in gen­eral zero-sum traits, often eas­ily mod­i­fied (eg hair dye or con­tact lens­es), and far less impor­tant to life out­comes than per­son­al­ity or intel­li­gence, which pro­foundly influ­ence an enor­mous range of out­comes rang­ing from aca­d­e­mic suc­cess to income to longevity to vio­lence to hap­pi­ness to altru­ism (and so increases in which are far from “friv­o­lous”, as some com­menters have labeled them); since the per­son­al­ity GWASes have had diffi­cul­ties (prob­a­bly due to non-ad­di­tiv­ity of the rel­e­vant genes con­nected to pre­dicted fre­quen­cy-de­pen­dent selec­tion, see /), that leaves intel­li­gence as the most impor­tant case.

Dis­cus­sions of this pos­si­bil­ity have often led to both over­heated prophe­cies of “genius babies” or “super-ba­bies”, and to dis­mis­sive scoffing that such meth­ods are either impos­si­ble or of triv­ial val­ue; unfor­tu­nate­ly, spe­cific num­bers and cal­cu­la­tions back­ing up either view tend to be lack­ing, even in cases where the effect can be pre­dicted eas­ily from behav­ioral genet­ics and shown to be not as large as lay­men might expect & con­sis­tent with the results (for exam­ple, the “genius sperm bank”3).

In “Embryo Selec­tion for Cog­ni­tive Enhance­ment: Curios­ity or Game-chang­er?”, Shul­man & Bostrom 2014 con­sider the poten­tial of embryo selec­tion for greater intel­li­gence in a lit­tle detail, ulti­mately con­clud­ing that in the most applic­a­ble cur­rent sce­nario of min­i­mal uptake (re­stricted largely to those forced into IVF use) and gains of a few IQ points, embryo selec­tion is more of “curios­ity” than “game-changer” as it will be “Socially neg­li­gi­ble over one gen­er­a­tion. Effects of social con­tro­versy more impor­tant than direct impacts.” Some things are left out of their analy­sis which I’m inter­ested in:

  1. they give the upper bound on the IQ gain that can be expected from a given level of selec­tion & then-cur­rent impre­cise GCTA her­i­tabil­ity esti­mates, but not the gain that could be expected with updated fig­ures: is it a large or small frac­tion of that max­i­mum? And they give a gen­eral descrip­tion of what soci­etal effects might be expected from com­bi­na­tions of IQ gains and preva­lence, but can we say some­thing more rig­or­ously about that?
  2. their level of selec­tion may bear lit­tle resem­blance to what can be prac­ti­cally obtained given the real­i­ties of IVF and high embryo attri­tion rates (se­lect­ing from 1 in 10 embryos may yield x IQ points, but how many real embryos would we need to imple­ment that, since if we extract 10 embryos, 3 might be abnor­mal, the best can­di­date might fail to implant, the sec­ond-best might result in a mis­car­riage, etc?)
  3. there is no attempt to esti­mate costs nor whether embryo selec­tion right now is worth the costs, or how much bet­ter our selec­tion abil­ity would need to be to make it worth­while. Are the advan­tages com­pelling enough that ordi­nary par­ents, who are already using IVF and could use embryo selec­tion at min­i­mal mar­ginal cost, would pay for it and take the prac­tice out of the lab? Under what assump­tions could embryo selec­tion be so valu­able as to moti­vate par­ents with­out fer­til­ity prob­lems into using IVF solely to ben­e­fit from embryo selec­tion?
  4. if it is not worth­while because the genetic infor­ma­tion is too weakly pre­dic­tive of adult phe­no­type, how much addi­tional data would it take to make the pre­dic­tions good enough to make selec­tion worth­while?
  5. What are the prospects for embryo edit­ing instead of selec­tion, in the­ory and right now?

I start with Shul­man & Bostrom 2014’s basic frame­work, repli­cate it, and extend it to include real­is­tic para­me­ters for prac­ti­cal obsta­cles & ineffi­cien­cies, full cost-ben­e­fits, and exten­sions & pos­si­ble improve­ments to the naive uni­vari­ate embryo selec­tion approach, among other things. (A sub­se­quent 2019 analy­sis, (code/sup­ple­ment), while con­clud­ing that the glass is half-emp­ty, reaches sim­i­lar results within its self­-im­posed ana­lyt­i­cal lim­its. Sim­i­lar­ly, /. These largely reca­pit­u­late the expected results from the many sib­ling PGS com­par­i­son stud­ies dis­cussed lat­er, such as )


Value of IQ

Shul­man & Bostrom 2014 note that

Stud­ies in labor eco­nom­ics typ­i­cally find that one IQ point cor­re­sponds to an increase in wages on the order of 1 per cent, other things equal, though higher esti­mates are obtained when effects of IQ on edu­ca­tional attain­ment are included (; Neal and John­son, 1996; Caw­ley et al., 1997; Behrman et al., 2004; Bowles et al., 2002; ).2 The indi­vid­ual increase in earn­ings from a genetic inter­ven­tion can be assessed in the same fash­ion as pre­na­tal care and sim­i­lar envi­ron­men­tal inter­ven­tions. One study of efforts to avert low birth weight esti­mated the value of a 1 per cent increase in earn­ings for a new­born in the US to be between $2,783 and $13,744, depend­ing on dis­count rate and future wage growth (Brook­s-Gunn et al., 2009)4

The given low/high range is based on 2006 data; infla­tion-ad­justed to 2016 dol­lars (as appro­pri­ate due to being com­pared to 2015/2016 cost­s), that would be $3270 and $16151. There is much more that can be said on this top­ic, start­ing with var­i­ous mea­sure­ments of indi­vid­u­als from income to wealth to cor­re­la­tions with occu­pa­tional pres­tige, look­ing at lon­gi­tu­di­nal & cross-sec­tional national wealth data, & psy­cho­log­i­cal differ­ences (such as increas­ing coop­er­a­tive­ness, patience, free-mar­ket and mod­er­ate pol­i­tic­s), ver­i­fi­ca­tion of causal­ity from lon­gi­tu­di­nal pre­dic­tive­ness, genetic over­lap, with­in-fam­ily com­par­isons, & exoge­nous shocks pos­i­tive (iodiza­tion & iron) or neg­a­tive (lead), etc; an incom­plete bib­li­og­ra­phy is pro­vided as an appen­dix. As poly­genic scores & genet­i­cal­ly-in­formed designs are slowly adopted by the social sci­ences, we can expect more known cor­re­la­tions to be con­firmed as causally down­stream of genetic intel­li­gence. These down­stream effects likely include not just income and edu­ca­tion, but behav­ioral mea­sures as well , notes in the data that a 3 point IQ increase pre­dicts 28% less risk of high­school dropouts, 25% less risk of poverty or being jailed (men), 20% less risk of par­ent­less chil­dren, 18% less risk of going on wel­fare, and 15% less risk of out­-of-wed­lock births. Anders Sand­berg pro­vides a descrip­tive table (ex­panded from Got­tfred­son 2003, itself adapted from Got­tfred­son 1997):

Pop­u­la­tion dis­tri­b­u­tion of IQ by intel­lec­tual capac­i­ty, com­mon jobs, and social dys­func­tion­al­ity
, Fig­ure 4: “The Big Foot­print of Mul­ti­ple-High­-Cost-Users”

Esti­mat­ing the value of an addi­tional IQ point is diffi­cult as there are many per­spec­tives one could take: zero-sum, includ­ing only per­sonal earn­ings or wealth and neglect­ing all the wealthy pro­duced for soci­ety (eg through research), often based on cor­re­lat­ing income with intel­li­gence scores or edu­ca­tion; pos­i­tive-sum, attempt­ing to include the pos­i­tive exter­nal­i­ties, per­haps through cross or lon­gi­tu­di­nal global com­par­isons, as intel­li­gence pre­dicts later wealth and the wealth of a coun­try is closely linked to the aver­age intel­li­gence of its pop­u­la­tion which cap­tures many (but not all) of the pos­i­tive exter­nal­i­ties; mea­sures which include the greater longevity & hap­pi­ness of more intel­li­gent peo­ple, etc. Fur­ther, intel­li­gence has intrin­sic value of its own, and the genetic hits appear to be pleiotropic and improve other desir­able traits (con­sis­tent with the muta­tion-s­e­lec­tion bal­ance evo­lu­tion­ary the­ory of per­sis­tent intel­li­gence differ­ences); the intelligence/longevity cor­re­la­tion has been found to be due to com­mon genet­ics, and Krapohl et al 2015 exam­ines the cor­re­la­tion of poly­genic scores with 50 diverse traits, find­ing that the college/IQ poly­genic scores cor­re­late with 10+ of them in gen­er­ally desir­able direc­tions5, sim­i­lar to 6 & / (graph) & & , indi­cat­ing both cau­sa­tion for those cor­re­la­tions & ben­e­fits beyond income. (For a more detailed dis­cus­sion of embryo selec­tion on mul­ti­ple traits and whether increase or decrease selec­tion gains, see later.) There are also pit­falls, like the fal­lacy of con­trol­ling for an inter­me­di­ate vari­able, exem­pli­fied by stud­ies which attempt to cor­re­late intel­li­gence with income after “con­trol­ling for” edu­ca­tion, despite know­ing that edu­ca­tional attain­ment is par­tially caused by intel­li­gence and so their esti­mates are actu­ally some­thing like ‘the gain from greater intel­li­gence for rea­sons other than through its effect on edu­ca­tion’. Esti­mates have come from a vari­ety of sources, such as iodine and lead stud­ies, using a vari­ety of method­olo­gies from cross-sec­tional sur­veys or admin­is­tra­tive data up to nat­ural exper­i­ments. Given the diffi­culty of com­ing up with reli­able esti­mates for ‘the’ value of an IQ point, which would be a sub­stan­tial research project in its own right (but worth doing as it would be highly use­ful in a wide range of analy­ses from lead reme­di­a­tion to iodiza­tion), I will just reuse the $3270-16151 range.

Polygenic scores for IQ


Shul­man & Bostrom’s upper bound works as fol­lows:

Stan­dard prac­tice today involves the cre­ation of fewer than ten embryos. Selec­tion among greater num­bers than that would require mul­ti­ple IVF cycles, which is expen­sive and bur­den­some. There­fore 1-in-10 selec­tion may rep­re­sent an upper limit of what would cur­rently be prac­ti­cally fea­si­ble …The stan­dard devi­a­tion of IQ in the pop­u­la­tion is about 15. esti­mates that com­mon addi­tive vari­a­tion can account for half of vari­ance in adult fluid intel­li­gence in its sam­ple. Sib­lings share half their genetic mate­r­ial on aver­age (ig­nor­ing the known assor­ta­tive mat­ing for intel­li­gence, which will reduce the vis­i­ble vari­ance among embryos). Thus, in a crude esti­mate, vari­ance is cut by 75 per cent and stan­dard devi­a­tion by 50 per cent. Adjust­ments for assor­ta­tive mat­ing, devi­a­tion from the Gauss­ian dis­tri­b­u­tion, and other fac­tors would adjust this esti­mate, but not dras­ti­cal­ly. These fig­ures were gen­er­ated by sim­u­lat­ing 10 mil­lion cou­ples pro­duc­ing the listed num­ber of embryos and select­ing the one with the high­est pre­dicted IQ based on the addi­tive vari­a­tion.

Table 1. How the max­i­mum amount of IQ gain (as­sum­ing a Gauss­ian dis­tri­b­u­tion of pre­dicted IQs among the embryos with a stan­dard devi­a­tion of 7.5 points) might depend on the num­ber of embryos used in selec­tion.
Selec­tion Aver­age IQ gain
1 in 2 4.2
1 in 10 11.5
1 in 100 18.8
1 in 1000 24.3

That is, the full her­i­tabil­ity of adult intel­li­gence is ~0.8; a SNP chip records the few hun­dred thou­sand most com­mon genetic vari­ants in the pop­u­la­tion and treat­ing each gene as hav­ing a sim­ple addi­tive increase-or-de­crease effect on intel­li­gence, Davies et al 2011’s GCTA () esti­mates that those SNPs are respon­si­ble for 0.51 of vari­ance; since sib­lings descend from the same two par­ents, they will share half the vari­ants (just like dizy­gotic twins) and differ on the rest, so the SNPs can only pre­dict up to 0.25 between sib­lings and sib­lings are anal­o­gous to mul­ti­ple embryos being con­sid­ered for implan­ta­tion in IVF (but not sperm or eggs7); sim­u­late n embryos by draw­ing from a nor­mal dis­tri­b­u­tion with a SD of 0.7 or 10.5 IQ points and select­ing the high­est, and with var­i­ous n, you get some­thing like the table.

GCTA is a method of esti­mat­ing the her­i­tabil­ity due to mea­sured SNPs (typ­i­cally sev­eral hun­dred thou­sand SNPs which are rel­a­tive­ly, >1%, fre­quent in the pop­u­la­tion); GCTAs use unre­lated indi­vid­u­als, and esti­mates how genet­i­cally and phe­no­typ­i­cally sim­i­lar they are by chance, and com­pares the sim­i­lar­i­ties: the more genetic sim­i­lar­ity pre­dicts phe­no­typic sim­i­lar­i­ty, the more her­i­ta­ble is. GCTA and other SNP her­i­tabil­ity esti­mates (like the now more com­mon LDSC) are use­ful because, by using unre­lated indi­vid­u­als, they avoid most of the crit­i­cisms of twin or fam­ily stud­ies, and defin­i­tively estab­lish the pres­ence of sub­stan­tial her­i­tabil­ity to most traits. GCTA SNP her­i­tabil­ity esti­mates are anal­o­gous to her­i­tabil­ity esti­mates in that they tell us how much the set of SNPs would explain if we knew all their effects exact­ly. This rep­re­sents both an upper bound and a lower bound. It is a lower bound on her­i­tabil­ity because:

  • only SNPs are used, which are a sub­set of all genetic vari­a­tion exclud­ing vari­ants found in <1% of the pop­u­la­tion, copy­-num­ber vari­a­tions, extremely rare or de novo muta­tions, etc; fre­quent­ly, the SNP sub­set is reduced fur­ther by drop­ping X/Y chro­mo­some data entirely & con­sid­er­ing only auto­so­mal DNA.

    Using tech­niques which boost genomic cov­er­age like impu­ta­tion based on whole-genomes could sub­stan­tially increase the GCTA esti­mate. demon­strated that using bet­ter impu­ta­tion to make mea­sured SNPs tag more causal vari­ants dras­ti­cally increased the GCTA esti­mate for height; applied GCTA to both com­mon vari­ants (23%) and also to rel­a­tives to pick up rarer vari­ants shared within fam­i­lies (31%), and found that com­bined, most/all of the esti­mated genetic vari­ance was accounted for (23+31=54% vs 54% her­i­tabil­ity in that dataset and a tra­di­tional her­i­tabil­ity esti­mate of 50-80%).

  • the SNPs are sta­tis­ti­cally treated in an addi­tive fash­ion, ignor­ing any con­tri­bu­tion they may make through and dom­i­nance8

  • GCTA esti­mates typ­i­cally include no cor­rec­tion for in the phe­no­type data which have the usual sta­tis­ti­cal effect of bias­ing para­me­ter esti­mates to zero, reduc­ing SNP her­i­tabil­ity or GWAS esti­mates sub­stan­tially (as often noted eg or ): a short IQ test, or a proxy like years of edu­ca­tion, will cor­re­late imper­fectly with intel­li­gence. This can be adjusted by psy­cho­me­t­ric for­mu­las using to get a true esti­mate (eg a GCTA esti­mate of 0.33 based on a short quiz with r = 0.5 reli­a­bil­ity might actu­ally imply a true GCTA esti­mate more like 0.5, imply­ing one could find much more of the genetic vari­ants respon­si­ble for intel­li­gence by run­ning a GWAS with bet­ter—but prob­a­bly slower & more expen­sive—IQ test­ing meth­od­s).

So GCTA is a lower bound on the total genetic con­tri­bu­tion to any trait; use of whole-genome data and more sophis­ti­cated analy­sis will allow pre­dic­tions beyond the GCTA. But the GCTA rep­re­sents an upper bound on the state of the art approach­es:

  • there are many SNPs (likely into the thou­sands) affect­ing intel­li­gence
  • only a few are known to a high level of con­fi­dence, and the rest will take much larger sam­ples to pin down
  • rel­a­tively small SNP datasets used in addi­tive mod­el­ing is most fea­si­ble in terms of com­put­ing power and imple­men­ta­tions

So the cur­rent approaches of get­ting increas­ingly large SNP sam­ples will not pass the GCTA ceil­ing. Poly­genic scores based on large SNP sam­ples mod­eled addi­tively are what is avail­able in 2015, and in prac­tice are nowhere near the GCTA ceil­ing; hence, the state of the art is well below the out­lined max­i­mum IQ gains. Prob­a­bly at some point whole-genomes will become cost-effec­tive com­pared to SNPs, improve­ments be made in mod­el­ing inter­ac­tions, and poten­tially much bet­ter poly­genic scores will become avail­able approach­ing the 0.8 of her­i­tabil­i­ty; but not yet.

GCTA meta-analysis

Davies et al 2011’s 0.5 (50%) SNP her­i­tabil­ity is out­dated & small, based on n = 3511 with cor­re­spond­ingly large impre­ci­sion in the GCTA esti­mates. We can do bet­ter by bring­ing it up to date incor­po­rat­ing the addi­tional GCTAs which have been pub­lished since 2011 through 2018.

Com­pil­ing 12 GCTAs, I find a meta-an­a­lytic esti­mate of SNPs can explain >33% of vari­ance in cur­rent intel­li­gence scores, and, adjust­ing for mea­sure­ment error (as we care about the latent trait, not any indi­vid­ual mea­sure­men­t), >44% with bet­ter-qual­ity phe­no­type test­ing.

Intelligence GCTA literature

I was able to find in total the fol­low­ing GCTA esti­mates:

  1. , Davies et al 2011 (sup­ple­men­tary)

    0.51(0.11); but Sup­ple­men­tary Table 1 (pg1) actu­ally reports in the com­bined sam­ple, the “no cut-off gf h^2” equals 0.53(0.10). The 0.51 esti­mate is drawn from a cryp­tic relat­ed­ness cut­off of <0.025. The sam­ples are also reported aggre­gated into Scot­tish & Eng­lish sam­ples: 0.17 (0.20) & 0.99 (0.22) respec­tive­ly. Sam­ple ages:

    1. Loth­ian Birth Cohort 1921 (Scot­tish): n = 550, 79.1 years aver­age
    2. Loth­ian Birth Cohort 1936 (Scot­tish): n = 1091, 69.5 years aver­age
    3. Aberdeen Birth Cohort 1936 (Scot­tish): n = 498, 64.6 years aver­age
    4. Man­ches­ter and New­cas­tle lon­gi­tu­di­nal stud­ies of cog­ni­tive aging cohorts (Eng­lish): n = 6063, 65 years median

    GCTA is not reported for the Nor­we­gian, and not reported for the 4 sam­ples indi­vid­u­al­ly, so I code Davies et al 2011 as 2 sam­ples with weight­ed-av­er­ages for ages (70.82 and 65 respec­tive­ly)

  2. , Chabris et al 2012

    0.47; no mea­sure of pre­ci­sion reported in paper or sup­ple­men­tary infor­ma­tion but the rel­e­vant sam­ple seems to be n = 2,441 and so the stan­dard error will be high. (Chabris et al 2012 does not attempt a poly­genic score beyond the can­di­date-gene SNP hits con­sid­ered.)

  3. “Genetic con­tri­bu­tions to sta­bil­ity and change in intel­li­gence from child­hood to old age”, Deary et al 2012

    The bivari­ate analy­sis resulted in esti­mates of the pro­por­tion of phe­no­typic vari­a­tion explained by all SNPs for cog­ni­tion, as fol­lows: 0.48 (stan­dard error 0.18) at age 11; and 0.28 (stan­dard error 0.18) at age 65, 70 or 79.

    This re-re­ports the Aberdeen & Loth­ian Birth Cohorts from Davies et al 2011.

  4. , Plomin et al 2013

    England/Wales TEDS cohort. Table 1: “.35 [.12, .58]” (95% CI; so pre­sum­ably a stan­dard error of ), 12-year-old twins

  5. , Benyamin et al 2014 (sup­ple­men­tary infor­ma­tion)

    Cohorts from Eng­land, USA, Aus­tralia, Nether­lands, & Scot­land. pg4: TEDS (mean age 12yo, twin­s): 0.22(0.10), UMN (14yo, mostly twins9): 0.40(0.21), ALSPAC (9y­o): 0.46(0.06)

  6. , Rietveld et al 2013 (sup­ple­men­tary infor­ma­tion)

    Edu­ca­tion years phe­no­type. pg2: 0.224(0.042); mean age ~57 (us­ing the sup­ple­men­tary infor­ma­tion’s Table S4 on pg92 & equal-weight­ing all reported mean ages; major­ity of sub­jects are non-twin)

  7. “Mol­e­c­u­lar genetic con­tri­bu­tions to socioe­co­nomic sta­tus and intel­li­gence”, Mar­i­oni et al 2014

    Gen­er­a­tion Scot­land cohort. Table 3: 0.29(0.05), median age 57.

  8. , Kirk­patrick et al 2014

    Two Min­nesota fam­ily & twin cohorts. 0.35( 0.11), 11.78 & 17.48yos (av­er­age: 14.63)

  9. DNA evi­dence for strong genetic sta­bil­ity and increas­ing her­i­tabil­ity of intel­li­gence from age 7 to 12”, Trza­skowski et al 2014a

    Rere­ports the TEDS cohort. pg4: age 7: 0.26(0.17); age 12: 0.45(0.14); used unre­lated twins for the GCTA.

  10. , Trza­skowski et al 2014b

    Table 2: 0.32(0.14); appears to be a fol­lowup to Trza­skowski et al 2014a & report on same dataset

  11. “Genomic archi­tec­ture of human neu­roanatom­i­cal diver­sity”, Toro et al 2014 (sup­ple­ment)

    0.56(0.25)/0.52(0.25) (vi­sual IQ vs per­for­mance IQ; mean: 0.54(0.25)); IMAGEN cohort (Ire­land, Eng­land, Scot­land, France, Ger­many, Nor­way), mean age 14.5

  12. “Genetic con­tri­bu­tions to vari­a­tion in gen­eral cog­ni­tive func­tion: a meta-analy­sis of genome-wide asso­ci­a­tion stud­ies in the CHARGE con­sor­tium (n = 53949)”, Davies et al 2015

    ARIC (57.2yo, USA, n = 6617): 0.29(0.05), HRS (70yo, USA, n = 5976): 0.28(0.07); ages from Sup­ple­men­tary Infor­ma­tion 2.

    The arti­cle reports doing GCTAs only on the ARIC & HRS sam­ples, but Fig­ure 4 shows a for­est plot which includes GCTA esti­mates from two other groups, CAGES (“Cog­ni­tive Age­ing Genet­ics in Eng­land and Scot­land Con­sor­tium”) at ~0.5 & GS (“Gen­er­a­tion Scot­land”) at ~0.25. The CAGES dat­a­point is cited to Davies et al 2011, which did report 0.51, and the GS cita­tion is incor­rect; so pre­sum­ably those two dat­a­points were pre­vi­ously reported GCTA esti­mates which Davies et al 2015 was meta-an­a­lyz­ing together with their 2 new ARIC/HS esti­mates, and they sim­ply did­n’t men­tion that.

  13. “A genome-wide analy­sis of puta­tive func­tional and exonic vari­a­tion asso­ci­ated with extremely high intel­li­gence”, Spain et al 2015

    0.174(0.017); but on the for extremely high intel­li­gence, so of unclear rel­e­vance to nor­mal vari­a­tion and I don’t know how it can be con­verted to a SNP her­i­tabil­ity equiv­a­lent to the oth­ers.

  14. “Epi­ge­netic age of the pre-frontal cor­tex is asso­ci­ated with neu­ritic plaques, amy­loid load, and Alzheimer’s dis­ease related cog­ni­tive func­tion­ing”, Levine et al 2015

    As mea­sures of cog­ni­tive func­tion & aging, some sort of IQ test was done, with the GCTAs reported as 0/0, but no stan­dard errors or other mea­sures of pre­ci­sion were included and so it can­not be meta-an­a­lyzed. (Although with only n = 700, orders of mag­ni­tude smaller than some other dat­a­points, the pre­ci­sion would be extremely poor and it is not much of a loss.)

  15. , Davies et al 2016

    n = 30801, 0.31(0.018) for ver­bal-nu­mer­i­cal rea­son­ing (13-item mul­ti­ple choice, test-retest 0.65) in UK Biobank, mean age 56.91 (Sup­ple­men­tary Table S1)

  16. , Robin­son et al 2015

    n = 3689, 0.360(0.108) for the prin­ci­pal fac­tor extracted from their bat­tery of tests, non-twins mean age 13.7

  17. , Tram­push et al 2017:

    n = 35298, 0.215(0.0001); mot GCTA but LD score regres­sion, with over­lap with CHARGE (co­horts: CHS, FHS, HBCS, LBC1936 and NCNG); non twin, mean age of 45.6

  18. , Zabaneh et al 2017

    n = 1238/8172, 0.33(0.22); but esti­mated on the lia­bil­ity scale (nor­mal intel­li­gence vs “extremely high intel­li­gence” as defined by being accepted into TIP) so unclear if directly com­pa­ra­ble to other GCTAs.

  19. Davies et al 2018, :

    We esti­mated the pro­por­tion of vari­ance explained by all com­mon SNPs using GCTA-GREML in four of the largest indi­vid­ual sam­ples: Eng­lish Lon­gi­tu­di­nal Study of Age­ing (ELSA: n = 6661, h2 = 0.12, SE = 0.06), Under­stand­ing Soci­ety (n = 7841, h2 = 0.17, SE = 0.04), UK Biobank Assess­ment Cen­tre (n = 86,010, h2 = 0.25, SE = 0.006), and Gen­er­a­tion Scot­land (n = 6,507, h2 = 0.20, SE = 0.0523) (Table 2). Genetic cor­re­la­tions for gen­eral cog­ni­tive func­tion amongst these cohorts, esti­mated using bivari­ate GCTA-GREML, ranged from rg = 0.88 to 1.0 (Table 2).

The ear­lier esti­mates tend to be smaller sam­ples and high­er, and as her­i­tabil­ity increases with age, it’s not sur­pris­ing that the GCTA esti­mates of SNP con­tri­bu­tion also increases with age.


Jian Yang says that GCTA esti­mates can be meta-an­a­lyt­i­cally com­bined straight­for­wardly in the usual way. Exclud­ing Chabris et al 2012 (no pre­ci­sion report­ed) and Spain et al 2015 and the dupli­cate Trza­skowski and doing a ran­dom-effects meta-analy­sis with mean age as a covari­ate:

gcta <- read.csv(stdin(), header=TRUE)
Study, N, HSNP, SE, Age.mean, Twin, Country
Davies et al 2011, 2139, 0.17, 0.2, 70.82, FALSE, Scotland
Davies et al 2011, 6063, 0.99, 0.22, 65, FALSE, England
Plomin et al 2013, 3154, 0.35, 0.117, 12, TRUE, England
Benyamin et al 2014, 3376, 0.40, 0.21, 14, TRUE, USA
Benyamin et al 2014, 5517, 0.46, 0.06, 9, FALSE, England
Rietveld et al 2013, 7959, 0.224, 0.042, 57.47, FALSE, international
Marioni et al 2014, 6609, 0.29, 0.05, 57, FALSE, Scotland
Kirkpatrick et al 2014, 3322, 0.35, 0.11, 14.63, FALSE, USA
Toro et al 2014, 1765, 0.54, 0.25, 14.5, FALSE, international
Davies et al 2015, 6617, 0.29, 0.05, 57.2, FALSE, USA
Davies et al 2015, 5976, 0.28, 0.07, 70, FALSE, USA
Davies et al 2016, 30801, 0.31, 0.018, 56.91, FALSE, England
Robinson et al 2015, 3689, 0.36, 0.108, 13.7, FALSE, USA

## Model as continuous normal variable; heritabilities are ratios 0-1,
## but metafor doesn't support heritability ratios, or correlations with
## standard errors rather than _n_s (which grossly overstates precision)
## so, as is common and safe when the estimates are not near 0/1, we treat it
## as a standardized mean difference
rem <- rma(measure="SMD", yi=HSNP, sei=SE, data=gcta); rem
# ...estimate       se     zval     pval    ci.lb    ci.ub
#  0.3207   0.0253  12.6586   <.0001   0.2711   0.3704
remAge <- rma(yi=HSNP, sei=SE, mods = Age.mean, data=gcta); remAge
# Mixed-Effects Model (k = 13; tau^2 estimator: REML)
# tau^2 (estimated amount of residual heterogeneity):     0.0001 (SE = 0.0010)
# tau (square root of estimated tau^2 value):             0.0100
# I^2 (residual heterogeneity / unaccounted variability): 2.64%
# H^2 (unaccounted variability / sampling variability):   1.03
# R^2 (amount of heterogeneity accounted for):            96.04%
# Test for Residual Heterogeneity:
# QE(df = 11) = 15.6885, p-val = 0.1531
# Test of Moderators (coefficient(s) 2):
# QM(df = 1) = 6.6593, p-val = 0.0099
# Model Results:
#          estimate      se     zval    pval    ci.lb    ci.ub
# intrcpt    0.4393  0.0523   8.3953  <.0001   0.3368   0.5419
# mods      -0.0025  0.0010  -2.5806  0.0099  -0.0044  -0.0006
remAgeT <- rma(yi=HSNP, sei=SE, mods = ~ Age.mean + Twin, data=gcta); remAgeT
# intrcpt      0.4505  0.0571   7.8929  <.0001   0.3387   0.5624
# Age.mean    -0.0027  0.0010  -2.5757  0.0100  -0.0047  -0.0006
# Twin TRUE   -0.0552  0.1119  -0.4939  0.6214  -0.2745   0.1640
gcta <- gcta[order(gcta$Age.mean),] # sort by age, young to old
forest(rma(yi=HSNP, sei=SE, data=gcta), slab=gcta$Study)
## so estimated heritability at 30yo:
0.4505 + 30*-0.0027
# [1] 0.3695
## Take a look at the possible existence of a quadratic trend as suggested
## by conventional IQ heritability results:
remAgeTQ <- rma(yi=HSNP, sei=SE, mods = ~ I(Age.mean^2) + Twin, data=gcta); remAgeTQ
# Mixed-Effects Model (k = 13; tau^2 estimator: REML)
# tau^2 (estimated amount of residual heterogeneity):     0.0000 (SE = 0.0009)
# tau (square root of estimated tau^2 value):             0.0053
# I^2 (residual heterogeneity / unaccounted variability): 0.83%
# H^2 (unaccounted variability / sampling variability):   1.01
# R^2 (amount of heterogeneity accounted for):            98.87%
# Test for Residual Heterogeneity:
# QE(df = 10) = 16.1588, p-val = 0.0952
# Test of Moderators (coefficient(s) 2,3):
# QM(df = 2) = 6.2797, p-val = 0.0433
# Model Results:
#                estimate      se     zval    pval    ci.lb    ci.ub
# intrcpt          0.4150  0.0457   9.0879  <.0001   0.3255   0.5045
# I(Age.mean^2)   -0.0000  0.0000  -2.4524  0.0142  -0.0001  -0.0000
# Twin TRUE       -0.0476  0.1112  -0.4285  0.6683  -0.2656   0.1703
## does fit better but enough?
For­est plot for meta-analy­sis of GCTA esti­mates of total addi­tive SNPs’ effect on intelligence/cognitive-ability

The regres­sion results, resid­u­als, and fun­nel plots are gen­er­ally sen­si­ble.

The over­all esti­mate of ~0.30 is about what one would have pre­dicted based on prior research: , meta-an­a­lyz­ing thou­sands of twin stud­ies on hun­dreds of mea­sure­ments, finds wide dis­per­sal among traits but an over­all grand mean of 0.49, of which most is addi­tive genetic effects, so com­bined with the usu­ally greater mea­sure­ment error of GCTA stud­ies com­pared to twin reg­istries (which can do detailed test­ing over many years) and the lim­i­ta­tion of SNP arrays in mea­sur­ing a sub­set of genetic vari­ants, one would guess at a GCTA grand mean of about half that or ~0.25; more direct­ly, runs a GCTA-like SNP her­i­tabil­ity algo­rithm on 551 traits avail­able in the UK Biobank with a grand mean of 16% (sup­ple­men­tary ‘All Tables’, work­sheet 3 ‘Supp Table 1’), and education/fluid-intelligence/numeric-memory/pairs-matching/prospective-memory/reaction-time at 29%/23%/15%/6%/11%/7% respec­tive­ly.10 This result was extended by to 717 UKBB traits, find­ing sim­i­larly grand mean SNP her­i­tabil­i­ties of 16% & 11% (con­tin­u­ous & binary trait­s); Watan­abe et al 2018’s SumHer SNP her­i­tabil­ity across 551 traits (Sup­ple­men­tary Table 22) has a grand mean of 17%. Hence, ~0.30 is a plau­si­ble result for any trait and for intel­li­gence specifi­cal­ly.

There are two issues with some of the details:

  1. Davies et al 2011’s sec­ond sam­ple, with a GCTA esti­mate of 0.99(0.22), is 3 stan­dard errors away from the over­all esti­mate.

    Noth­ing about the sam­ple or pro­ce­dures seem sus­pi­cious, so why is the esti­mate so high? The GCTA paper/manual do warn about the pos­si­bil­ity of unsta­ble esti­ma­tion where para­me­ter val­ues escape to a bound­ary (a com­mon flaw in fre­quen­tist pro­ce­dures with­out reg­u­lar­iza­tion), and it is sus­pi­cious that this out­lier is right at a bound­ary (1.0), so I sus­pect that that might be what hap­pened in this pro­ce­dure and if the Davies et al 2011 data were rerun, a more sen­si­ble value like 0.12 would be esti­mat­ed.

  2. the esti­mates decrease with age rather than increase.

    I thought this might be dri­ven by the sam­ples using twins, which have been accused in the past of deliv­er­ing higher her­i­tabil­ity esti­mates due to higher SES of par­ents and cor­re­spond­ingly less envi­ron­men­tal influ­ence, but when added as a pre­dic­tor, twin sam­ples are non-s­ta­tis­ti­cal­ly-sig­nifi­cantly low­er. My best guess so far is that the appar­ent trend is due to a lack of mid­dle-aged sam­ples: the stud­ies jump all the way from 14yo to 57yo, so the usual qua­dratic curve of increas­ing her­i­tabil­ity could be hid­den and look flat, since the high esti­mates will all be miss­ing from the mid­dle.

    Test­ing this, I tried fit­ting a qua­dratic model instead, and as expect­ed, it does fit some­what bet­ter but with­out using Bayesian meth­ods, hard to say how much bet­ter. This ques­tion awaits pub­li­ca­tion of fur­ther GCTA intel­li­gence sam­ples, with mid­dle-aged sub­jects.

Correcting for measurement error

This meta-an­a­lytic sum­mary is an under­es­ti­mate of the true genetic effect for sev­eral rea­sons, includ­ing as men­tioned, mea­sure­ment error. Using Spear­man’s for­mu­la, we can cor­rect it.

Davies et al 2016 is the most con­ve­nient and pre­cise GCTA esti­mate to work with, and reports a test-retest reli­a­bil­ity of 0.65 for its 13-ques­tion ver­bal-nu­mer­i­cal rea­son­ing test. Its h2SNP=0.31 is a square, so it must be square-rooted to be a r and √0.31 = 0.556. We assume the SNP test-retest reli­a­bil­ity is ~1 as genome sequenc­ing is highly accu­rate due to repeated pass­es.

The cor­rec­tion for atten­u­a­tion is

x/y are IQ/SNPs, so:

So the rSNP is 0.691, and to con­vert it back to h2SNP, 0.6912 = 0.477481 = 0.48, which is sub­stan­tially larger than the mea­sure­men­t-er­ror-con­t­a­m­i­nated under­es­ti­mate of 0.31.

0.48 rep­re­sents the true under­ly­ing genetic con­tri­bu­tion with indefi­nite amounts of exact data, but all IQ tests are imper­fect and one may ask what is the prac­ti­cal limit with the best cur­rent IQ tests?

One of the best cur­rent IQ tests is the ful­l-s­cale IQ test, with a 32-day test-retest reli­a­bil­ity of 0.93 (Table 2). Know­ing the true GCTA esti­mate, we can work back­wards assum­ing a ryy=0.93:

We can then work back­wards as I sug­gest to fig­ure out what a good IQ test could deliv­er, such as the WAIS-IV ful­l-s­cale IQ test. So:

  1. 0.444 = x

The bet­ter IQ test deliv­ers a gain of 0.444-0.31=0.134 or 43% more pos­si­ble vari­ance explic­a­ble, with 4% still left over com­pared to a per­fect test.

Mea­sure­ment error has con­sid­er­able impli­ca­tions for how GWASes will be run in years to come. As SNP costs decline from their 2016 cost of ~$50 and whole genomes from ~$1000, sam­ple sizes >500,000 and into the mil­lions will become rou­tine, espe­cially as whole-genome sequenc­ing becomes a rou­tine prac­tice for all babies and for any patients with a seri­ous dis­ease (if noth­ing else, for rea­son­s). Sam­ple sizes in the mil­lions will recover almost the full mea­sured GCTA her­i­tabil­ity of ~0.33 (eg Hsu’s argu­ment that spar­sity pri­ors will recover all of IQ at ~n = 1m); but at that point, addi­tional sam­ples become worth­less as they will not be able to pass the mea­sured ceil­ing of 0.33 and explain the full 0.48. Only bet­ter mea­sure­ments will allow any fur­ther progress. Con­sid­er­ing that a well-run IQ test will cost <$100, the crossover point may well have been passed with cur­rent n = 400k datasets, where resources would be bet­ter put into fewer but bet­ter mea­sured IQ/SNP dat­a­points rather than more low qual­ity IQ/SNP dat­a­points.

GCTA-based upper bound on selection gains

Since half of addi­tives will be shared within fam­i­ly, then we get with­in-fam­ily vari­ance, which gives √0.165 = 0.406 SD or 6.1 IQ points (Oc­ca­sion­ally with­in-fam­ily differ­ences are cited in a for­mat like “sib­lings have an aver­age differ­ence of 12 IQ points”, which comes from an SD of ~0.7/0.8, since , but you could also check what SD yields an aver­age differ­ence of 12 via sim­u­la­tion: eg mean(abs(rnorm(n=1000000, mean=0, sd=0.71) - rnorm(n=1000000, mean=0, sd=0.71))) * 15 → 12.018.) We don’t care about means since we’re only look­ing at gains, so the mean of the with­in-fam­ily nor­mal dis­tri­b­u­tion can be set to 0.

With that, we can write a sim­u­la­tion like Shul­man & Bostrom where we gen­er­ate n sam­ples from , take the max, and return the differ­ence of the max and mean. There are more effi­cient ways to com­pute the expected max­i­mum, how­ev­er, and so we’ll use a lookup table com­puted using the lmomco library for small n & an approx­i­ma­tion for large n for speed & accu­ra­cy; for a dis­cus­sion of alter­na­tive approx­i­ma­tions & imple­men­ta­tions and why I use this spe­cific com­bi­na­tion, see . Qual­i­ta­tive­ly, the max looks like a log­a­rith­mic curve: if we fit a log curve to n = 2-300, the curve is (R2=0.98); to adjust for the PGS vari­ance-ex­plained, we con­vert to SD and adjust by relat­ed­ness, so an approx­i­ma­tion of the gain from sib­ling embryo selec­tion would be or . (The log­a­rithm imme­di­ately indi­cates that we must worry about dimin­ish­ing returns and sug­gests that to opti­mize embryo selec­tion, we should look for ways around the log term, like mul­ti­ple stages which avoid going too far into the log’s tail.)

For gen­er­al­ity to other con­tin­u­ous nor­mally dis­trib­uted com­plex traits, we’ll work in stan­dard­ized units rather than the IQ scale (SD=15), but con­vert back to points for eas­ier read­ing:

exactMax <- Vectorize(function (n, mean=0, sd=1) {
if (n>2000) { ## avoid lmomco bugs at higher _n_, where the approximations are near-exact anyway
    chen1999 <- function(n,mean=0,sd=1){ mean + qnorm(0.5264^(1/n), sd=sd) }
    chen1999(n,mean=mean,sd=sd) } else {
    if(n>200) { library(lmomco)
        exactMax_unmemoized <- function(n, mean=0, sd=1) {
            expect.max.ostat(n, para=vec2par(c(mean, sd), type="nor"), cdf=cdfnor, pdf=pdfnor) }
        exactMax_unmemoized(n,mean=mean,sd=sd) } else {

 lookup <- c(0,0,0.5641895835,0.8462843753,1.0293753730,1.1629644736,1.2672063606,1.3521783756,1.4236003060,

 return(mean + sd*lookup[n+1]) }}})

One impor­tant thing to note here: embryo count > PGS. While much dis­cus­sion of embryo selec­tion obses­sively focuses on the PGS—is it more or less than X%? does it pick out the max­i­mum within pairs of sib­lings more than Y% of the time? (where X & Y are mov­ing goal­post­s)—­for real­is­tic sce­nar­ios, the embryo count deter­mines the out­put much more than the PGS. For exam­ple, would you rather select from between a pair of embryos using a PGS with a with­in-fam­ily vari­ance of 10%, or would you rather select from twice as many embryos using a weak PGS with half that pre­dic­tive pow­er, or are they roughly equiv­a­lent? The sec­ond! It’s around one-thirds bet­ter:

exactMax(4) * sqrt(0.05)
# [1] 0.230175331
exactMax(2) * sqrt(0.10)
# [1] 0.178412412
0.230175331 / 0.178412412
# [1] 1.29013071

Only as n increases far beyond what we see used in human IVF does the rela­tion­ship switch. This is because the nor­mal curve has thin tails and so our ini­tial large gains in the max­i­mum dimin­ish rapid­ly:

## show the locations of expected maxima/minima, demonstrating diminishing returns/thin tails:
x <- seq(-3, 3, length=1000)
y <- dnorm(x, mean=0, sd=1)
extremes <- unlist(Map(exactMax, 1:100))
plot(x, y, type="l", lwd=2,
    xlab="SDs", ylab="Normal density", main="Expected maximum/minimums for Gaussian samples of size n=1-100")
abline(v=c(extremes, extremes*-1), col=rep(c("black","gray"), 200))
Visu­al­iz­ing dimin­ish­ing returns in order sta­tis­tics with increas­ing n in each sam­ple.

It is worth not­ing that the max­i­mum is sen­si­tive to vari­ance, as it increases mul­ti­plica­tively with the square root of variance/the stan­dard devi­a­tion, while on the other hand, the mean is only addi­tive. So an increase of 20% in the stan­dard devi­a­tion means an increase of 20% in the max­i­mum, but an increase of +1SD in the mean is merely a fixed addi­tive increase, with the differ­ence grow­ing with total n. For exam­ple, in max­i­miz­ing the max­i­mum of even just n = 10, it would be much bet­ter (by +0.5SD) to dou­ble the SD from 1SD to 2SD than to increase the mean by +1SD:

exactMax(10, mean=0, sd=1)
# [1] 1.53875273
exactMax(10, mean=1, sd=1)
# [1] 2.53875273
exactMax(10, mean=0, sd=2)
# [1] 3.07750546

One way to visu­al­ize it is to ask how large a mean increase is required to have the same expected max­i­mum as that of var­i­ous increases in vari­ance:


compareDistributions <- function(n=10, varianceMultiplier=2) {
    baselineMax <- exactMax(n, mean=0, sd=1)
    increasedVarianceMax <- exactMax(n, mean=0, sd=varianceMultiplier)
    baselineAdjusted <- increasedVarianceMax - baselineMax

    width <- increasedVarianceMax*1.2
    x1 <- seq(-width, width, length=1000)
    y1 <- dnorm(x1, mean=baselineAdjusted, sd=1)

    x2 <- seq(-width, width, length=1000)
    y2 <- dnorm(x2, mean=0, sd=varianceMultiplier)

    df <- data.frame(X=c(x1, x2), Y=c(y1, y2), Distribution=c(rep("baseline", 1000), rep("variable", 1000)))

    return(qplot(X, Y, color=Distribution, data=df) +
        geom_vline(xintercept=increasedVarianceMax, color="blue") +
        ggtitle(paste0("Variance Increase: ", varianceMultiplier, "x (Difference: +",
            round(digits=2, baselineAdjusted), "SD)")) +
        geom_text(aes(x=increasedVarianceMax*1.01, label=paste0("expected maximum (n=", n, ")"),
            y=0.3), colour="blue", angle=270))
p0 <- compareDistributions(varianceMultiplier=1.25) +
 ggtitle("Mean increase required to have equal expected maximum as a more\
 variable distribution\nVariance increase: 1.25x (Difference: +0.38SD)")
p1 <- compareDistributions(varianceMultiplier=1.50)
p2 <- compareDistributions(varianceMultiplier=1.75)
p3 <- compareDistributions(varianceMultiplier=2.00)
p4 <- compareDistributions(varianceMultiplier=3.00)
p5 <- compareDistributions(varianceMultiplier=4.00)
p6 <- compareDistributions(varianceMultiplier=5.00)
grid.arrange(p0, p1, p2, p3, p4, p5, p6, ncol=1)
Illus­trat­ing the increases in expected max­i­mums of nor­mal dis­tri­b­u­tions (for n = 10) due to increases in vari­ance but not mean of the dis­tri­b­u­tion.

Note the vis­i­ble differ­ence in tail den­si­ties implies that the advan­tage of increased vari­ance increases the fur­ther out on the tail one is select­ing from (higher n); I’ve made addi­tional graphs for more extreme sce­nar­ios (n = 100, n = 1000, n = 10000), and cre­ated an inter­ac­tive Shiny app for fid­dling with the n/variance mul­ti­plier.

Apply­ing the order sta­tis­tics code to the spe­cific case of embryo selec­tion on full sib­lings:

## select 1 out of N embryos (default: siblings, who are half-related)
embryoSelection <- function(n, variance=1/3, relatedness=1/2) {
    exactMax(n, mean=0, sd=sqrt(variance*relatedness)); }
embryoSelection(n=10) * 15
# [1] 9.422897577
embryoSelection(n=10, variance=0.444) * 15
# [1] 10.87518323
embryoSelection(n=5, variance=0.444) * 15
# [1] 8.219287927

So 1 out of 10 gives a max­i­mal aver­age gain of ~9 IQ points, less than Shul­man & Bostrom’s 11.5 because of my lower GCTA esti­mate, but using bet­ter IQ tests like the WAIS, we could go as high as ~11 points. With a more real­is­tic num­ber of embryos, we might get 8 points.

For com­par­ison, the full genetic her­i­tabil­ity of accu­rate­ly-mea­sured adult IQ (go­ing far beyond just SNPs or addi­tive effects to include muta­tion load & de novo muta­tions, copy­-num­ber vari­a­tion, mod­el­ing of inter­ac­tions etc) is gen­er­ally esti­mated ~0.8, which case the upper bound on selec­tion out of 10 embryos would be ~14.5 IQ points:

embryoSelection(n=10, variance=0.8) * 15
# [1] 14.59789016

For intu­ition, an ani­ma­tion:

plotSelection <- function(n, variance, relatedness=1/2) {
    r = sqrt(variance*relatedness)

    data = mvrnorm(n=n, mu=c(0, 0), Sigma=matrix(c(1, r, r, 1), nrow=2), empirical=TRUE)
    df <- data.frame(Trait=data[,1], PGS=data[,2], Selected=max(data[,2]) == data[,2])

    trueMax <- max(df$Trait)
    selected <- df[df$Selected,]$Trait
    regret <- trueMax - selected

    return(qplot(PGS, Trait, color=Selected, size=I(9), data=df) +
        coord_cartesian(ylim = c(-2.5,2.5), xlim=c(-2.5,2.5), expand=FALSE) +
        geom_hline(yintercept=0, color="red") +
        labs(title=paste0("Selection hypothetical (higher=better): with n=", n, " samples & PGS variance=", round(variance,digits=2),
            ". Performance: true max: ", round(trueMax, digits=2), "; selected: ", round(selected, digits=2),
            "; regret: ", round(regret, digits=2)))
    for (i in 1:100) {
      n   <- max(3, round(rnorm(1, mean=6, sd=3)))
      pgs <- runif(1, min=0, max=0.5)
      p <- plotSelection(n, pgs)
    interval=0.8, ani.width = 1000, ani.height=800,
    movie.name = "embryo-selection.gif")
Sim­u­la­tion of true trait value vs poly­genic score in an embryo selec­tion sce­nario for var­i­ous pos­si­ble n and poly­genic score pre­dic­tive pow­er.

It is often claimed that a ‘small’ r cor­re­la­tion or pre­dic­tive power is, a pri­ori, of no use for any prac­ti­cal pur­pos­es; this is incor­rect, as the value of any par­tic­u­lar r is inher­ently con­text & deci­sion-speci­fic—a small r can be highly valu­able for one deci­sion prob­lem, and a large r could be use­less for anoth­er, depend­ing on the use, the costs, and the ben­e­fits. Rank­ing is eas­ier than pre­dic­tion; accu­rate pre­dic­tion implies accu­rate rank­ing, but not vice-ver­sa—one can have an accu­rate com­par­i­son of two dat­a­points while the esti­mate of each one’s absolute value is highly noisy. One way to think of it is to note that Pear­son’s r cor­re­la­tion can be con­verted to , and for nor­mal vari­ables like this, they are near-i­den­ti­cal; so a PGS of 10% vari­ance or r = 0.31 means that that every SD increase in PGS is equiv­a­lent to 0.31 SD increases in rank.

In par­tic­u­lar, it has long been noted in indus­trial psy­chol­ogy & psy­cho­met­rics that a tiny r2/r bivari­ate cor­re­la­tion between a test and a latent vari­able can con­sid­er­ably enhance the prob­a­bil­ity of select­ing dat­a­points pass­ing a given thresh­old (eg Tay­lor & Rus­sell 1939//; fur­ther dis­cus­sion), and this is increas­ingly true the more strin­gent the thresh­old (tail effects again!); this also applies to embryo selec­tion, since we can define a thresh­old as being set at the best of n embryos.

This helps explain why the PGS’s power is not as over­whelm­ingly impor­tant to embryo selec­tion as one might ini­tially expect; cer­tain­ly, you do need a decent PGS, but it is only a start­ing point & one of sev­eral vari­ables, and expe­ri­ences dimin­ish­ing returns, ren­der­ing it not nec­es­sar­ily as impor­tant a para­me­ter as the more obscure “num­ber of embryos” para­me­ter. A metaphor here might be that of bias­ing some dice to try to roll a high score: while ini­tially mak­ing the dice more loaded does help increase your total score, the gain quickly shrinks com­pared to being able to add a few more dice to be rolled.

The main met­ric we are inter­ested in is aver­age gain. Other met­rics, like ‘the prob­a­bil­ity of select­ing the max­i­mum’, are inter­est­ing but not nec­es­sar­ily impor­tant or infor­ma­tive. Select­ing the max­i­mum is irrel­e­vant because most screen­ing prob­lems are not like the Olympics, where the differ­ence between #1 & #2 is the differ­ence between glory & obscu­ri­ty; that may mean only a slight differ­ence on some trait, and #2 was almost as good. As n increas­es, our ‘regret’ from not select­ing the true max­i­mum grows only slow­ly. And : as we increase the n, the prob­a­bil­ity of select­ing the max­i­mum becomes ever small­er, sim­ply because n means more chances to make an error, and asymp­tot­i­cally con­verges on . Yet, we would greatly pre­fer to select the max out of a mil­lion n rather than 1!

But, as we have already seen how expected gain increases with n, so some fur­ther order sta­tis­tics plots can help visu­al­ize the three­-way rela­tion­ship between prob­a­bil­ity of opti­mal selection/regret, num­ber of embryos, and PGS vari­ance:

## consider first column as true latent genetic scores, & the second column as noisy measurements correlated _r_:
generateCorrelatedNormals <- function(n, r) {
    mvrnorm(n=n, mu=c(0, 0), Sigma=matrix(c(1, r, r, 1), nrow=2)) }

## consider plausible scenarios for IQ-related non-massive simple embryo selection, so 2-50 embryos;
## and PGSes must max out by 80%:
scenarios <- expand.grid(Embryo.n=2:50, PGS.variance=seq(0.01, 0.50, by=0.02), Rank.mean=NA, P.max=NA, P.min=NA, P.below.mean=NA, P.minus.two=NA, Regret.SD=NA)
for (i in 1:nrow(scenarios)) {
 n = scenarios[i,]$Embryo.n
 r = sqrt(scenarios[i,]$PGS.variance * 0.5) # relatedness deflation for the ES context
 iters = 500000
 sampleStatistics <- function(n,r) {
     sim <- generateCorrelatedNormals(n, r=r)
     # max1_l  <- max(sim[,1])
     # max2_m  <- max(sim[,2])
     max1i_l <- which.max(sim[,1])
     max2i_m <- which.max(sim[,2])
     gain <- sim[,1][max2i_m]
     rank <- which(sim[,2][max1i_l] == sort(sim[,2]))

     ## P(max): if the max of the noisy measurements is a different index than the max or min of the true latents,
     ## then embryo selection fails to select the best/maximum or selects the worst.
     ## If n=1, trivially P.max/P.min=1 & Regret=0; if r=0, P.max/P.min = 1/n;
     ## if r=1, P.max=1 & P.min=0; r=0-1 can be estimated by simulation:
     P.max <- max2i_m == max1i_l
     ## P(min): if our noisy measurement led us to select the worst point rather than best:
     P.min <- which.min(sim[,1]) == max2i_m
     ## P(<avg): whether we managed to at least boost above mean of 0
     P.below.mean <- gain < 0
     ## P(IQ(70)): whether the point falls below -2SDs
     P.minus.two <- gain <= -2

     ## Regret is the difference between the true latent's maximum, and the true score
     ## for the index with the maximum of the noisy measurements, which if a different index,
     ## means a loss and thus non-zero regret.
     ## If r=0, regret = max/the n_k order statistic; r=1, regret=0; in between, simulation:
     Regret.SD <- max(gain,0)
     return(c(P.max, P.min, P.below.mean, P.minus.two, Regret.SD, rank)) }
 sampleAverages <- colMeans(t(replicate(iters, sampleStatistics(n,r))))
 # print(c(n,r,sampleAverages))
 scenarios[i,]$P.max        <- sampleAverages[1]
 scenarios[i,]$P.min        <- sampleAverages[2]
 scenarios[i,]$P.below.mean <- sampleAverages[3]
 scenarios[i,]$P.minus.two  <- sampleAverages[4]
 scenarios[i,]$Regret.SD    <- sampleAverages[5]
 scenarios[i,]$Rank.mean    <- sampleAverages[6]

library(ggplot2); library(gridExtra)
p0 <- qplot(Embryo.n, Rank.mean, color=as.ordered(PGS.variance), data=scenarios) +
    theme(legend.title=element_blank()) + geom_abline(slope=1, intercept=0) +
    ggtitle("Expected true rank after selecting best out of N embryos based on PGS score (idealized, excluding IVF losses)")
p1 <- qplot(Embryo.n, P.max, color=as.ordered(PGS.variance), data=scenarios) +
    coord_cartesian(ylim = c(0,0.84)) + theme(legend.title=element_blank()) +
    ggtitle("Probability of selecting best out of N embryos as function of PGS score (*)")
p2 <- qplot(Embryo.n, P.min, color=as.ordered(PGS.variance), data=scenarios) +
    coord_cartesian(ylim = c(0,0.48)) + theme(legend.title=element_blank()) +
    ggtitle("Probability of mistakenly selecting worst out of N embryos (*)")
p3 <- qplot(Embryo.n, P.below.mean, color=as.ordered(PGS.variance), data=scenarios) +
    coord_cartesian(ylim = c(0,0.5)) + theme(legend.title=element_blank()) +
    ggtitle("Probability of mistakenly selecting below-average out of N embryos (*)")
p4 <- qplot(Embryo.n, P.minus.two, color=as.ordered(PGS.variance), data=scenarios) +
    coord_cartesian(ylim = c(0,0.02)) + theme(legend.title=element_blank()) +
    ggtitle("Probability of selecting below -2SDs out of N embryos (*)")
p5 <- qplot(Embryo.n, Regret.SD, color=as.ordered(PGS.variance), data=scenarios) +
    theme(legend.title=element_blank()) +
    ggtitle("Loss from non-omniscient selection from N embryos in SDs (*)")
grid.arrange(p0, p1, p2, p3, p4, p5, ncol=1)
Graphs of the expected rank of top select­ed, prob­a­bil­ity of mak­ing an ideal selec­tion, mak­ing a pes­si­mal selec­tion, and expected regret, for a sim­ple ide­al­ized order-s­ta­tis­tics sce­nario where the set of sam­ples is mea­sured with a noisy vari­able cor­re­lated r with the latent vari­ables (such as a PGS pre­dict­ing adult IQ).

Some obser­va­tions:

  • PGSes can eas­ily mat­ter less than n
  • expected rank steadily increases in n almost regard­less of PGS
  • prob­a­bil­ity of select­ing the min­i­mum or an extremely low value like −2SD, rapidly approaches 0, imply­ing that sub­stan­tial tail risk reduc­tion is easy
  • prob­a­bil­ity of mak­ing a below aver­age selec­tion, imply­ing no or neg­a­tive gain and a point­less selec­tion, decreases rel­a­tively slow­ly; the aver­age gain goes up, but not every selec­tion will work out­—they merely work out increas­ingly well on aver­age. Like many med­ical treat­ments, the ‘num­ber needed to treat’ will likely always be >1.
Polygenic scores

“‘Should we trust mod­els or obser­va­tions?’ In reply we note that if we had obser­va­tions of the future, we obvi­ously would trust them more than mod­els, but unfor­tu­nately obser­va­tions of the future are not avail­able at this time.”

Knut­son & Tuleya 2005, “Reply”

A SNP-based poly­genic score works much the same way: it explains a cer­tain frac­tion or per­cent­age of the vari­ance, halved due to sib­lings, and can be plugged in once we know how much less than 0.33 it is. An exam­ple of using SNP poly­genic scores to iden­tify genetic influ­ences and ver­ify they work with­in-fam­ily and are not con­founded would be Domingue et al 2015’s “Poly­genic Influ­ence on Edu­ca­tional Attain­ment”.

Past poly­genic scores for intel­li­gence:

  1. , Davies et al 2011:

    0.5776% of fluid intel­li­gence in the NCNG repli­ca­tion sam­ple, if I’ve under­stood their analy­sis cor­rect­ly.

  2. , Rietveld et al 2013:

    This land­mark study pro­vid­ing the first GWAS hits on intel­li­gence also esti­mated mul­ti­ple poly­genic scores: the full poly­genic scores pre­dicted 2% of vari­ance in edu­ca­tion, and 2.58% of vari­ance in cog­ni­tive func­tion (Swedish enlist­ment cog­ni­tive test bat­tery) in a Swedish repli­ca­tion sam­ple, and also per­formed well in with­in-fam­ily set­tings (0.31% & 0.19% & 0.41/0.76% of vari­ance in attend­ing col­lege & years of edu­ca­tion & test bat­tery, respec­tive­ly, in Table S25).

    • , Rietveld et al 2014a:

      Repli­ca­tion of the 3 Rietveld et al 2013 SNP hits, fol­lowed by repli­ca­tion of the PGS in STR & QIMR (non-fam­i­ly-based), then a with­in-fam­ily sib­ling com­par­i­son using Fram­ing­ham Heart Study (FHS). The 3 hits repli­cat­ed; the EDU PGS (in the 20 PC model which most closely cor­re­sponds to Rietveld et al 2013’s GWAS) pre­dicted 0.0265/0.0069, the col­lege PGS 0.0278/0.0186; and the over­all FHS EDU PGS was 0.0140 and the with­in-fam­ily sib­ling com­par­i­son was 0.0036.

  3. , Benyamin et al 2014 (sup­ple­ment) :

    0.5%, 1.2%, 3.5% (3 cohorts; no with­in-fam­ily sib­ling test).

  4. , Kirk­patrick et al 2014:

    0.55% (max­i­mum in sub­-sam­ples: 0.7%)

  5. “Com­mon genetic vari­ants asso­ci­ated with cog­ni­tive per­for­mance iden­ti­fied using the prox­y-phe­no­type method”, Rietveld et al 2014b ():

    Pre­dicts “0.2% to 0.4%” of vari­ance in cog­ni­tive per­for­mance & edu­ca­tion using a small poly­genic score of 69 SNPs; no full PGS is report­ed. (For other very small poly­genic score uses, see also Domingue et al 2015 & Zhu et al 2015.) Also tests the small PGS in both across-fam­ily and with­in-fam­ily between-si­b­ling set­tings, reported in the sup­ple­ment; no pooled result, but by cohort (GS/MCTFR/QIMR/STR): 0.0023/0.0022/0.0041/0.0044 vs 0.0007/0.0007/0.0002/0.0015.

  6. , Ward et al 2014:

    English/mathematics grades: 0.7%/0.16%. Based on Rietveld et al 2013.

  7. “Poly­genic scores asso­ci­ated with edu­ca­tional attain­ment in adults pre­dict edu­ca­tional achieve­ment and ADHD symp­toms in chil­dren”, de Zeeuw et al 2014:

    Education/school grades in NTR; based on Rietveld et al 2013. Edu­ca­tional achieve­ment, Arith­metic: 0.012/0.021; Lan­guage: 0.021/0.028; Study Skills: 0.016/0.017; Science/Social Stud­ies: 0.006/0.013; Total Score: 0.024/0.022. School grades, Arith­metic: 0.025/0.027; Lan­guage: 0.033/0.025; Read­ing: 0.031/0.042. de Zeeuw et al 2014 appears to report a with­in-fam­ily com­par­isons using fra­ter­nal twins/siblings, rather than a gen­eral pop­u­la­tion PGS per­for­mance. (“For each analy­sis, the pre­dic­tor and the out­come mea­sure were stan­dard­ized within each sub­set of chil­dren with data avail­able on both. To cor­rect for depen­dency of the obser­va­tions due to fam­ily clus­ter­ing an addi­tive genetic vari­ance com­po­nent was included as a ran­dom effect based on the fam­ily pedi­gree and depen­dent on zygos­i­ty.”)

  8. , Con­ley et al 2015:

    Edu­ca­tion in FHS/HRS, gen­eral pop­u­la­tion: 0.02/0.03 (Table 4, col­umn 2). With­in-fam­ily in FHS, 0.0124 (Table 6, col­umn 1/3).

  9. “Poly­genic influ­ence on edu­ca­tional attain­ment: New evi­dence from the national lon­gi­tu­di­nal study of ado­les­cent to adult health”, Domingue et al 2015 (sup­ple­ment):

    Edu­ca­tion & ver­bal intel­li­gence in National Lon­gi­tu­di­nal Study of Ado­les­cent to Adult Health (ADD Health). Edu­ca­tion, gen­eral pop­u­la­tion: 0.06/0.02; with­in-fam­ily between-si­b­ling, 0.06? (Table 3). Ver­bal intel­li­gence, gen­eral pop­u­la­tion: 0.0225/0.0196 (Table 2); with­in-fam­ily between-si­b­ling, 0.0049?

  10. “Genetic con­tri­bu­tions to vari­a­tion in gen­eral cog­ni­tive func­tion: a meta-analy­sis of genome-wide asso­ci­a­tion stud­ies in the CHARGE con­sor­tium (N=53 949)”, Davies et al 2015:

    1.2%, intel­li­gence.

  11. , Lencz et al 2013/2014:

    Meta-analy­sis of n = 5000 COGENT cohorts, using extracted fac­tor; 0.40-0.45% PGS for intel­li­gence in the MGS/GAIN cohort.

  12. , Hage­naars et al 2016 (sup­ple­men­tary data):

    Sup­ple­men­tary Table 4d reports pre­dic­tive valid­ity of the edu­ca­tional attain­ment poly­genic score for childhood-cognitive-ability/college-degree/years-of-education in its other sam­ples, yield­ing R2=0.0042/0.0214/0.0223 or 0.42%/2.14%/2.23% respec­tive­ly. Par­tic­u­larly intrigu­ing given its inves­ti­ga­tion of pleiotropy is Sup­ple­men­tary Table 5, which uses poly­genic scores con­structed for all the dis­eases in its data (eg type 2 dia­betes, ADHD, schiz­o­phre­nia, coro­nary artery dis­ease), where all the dis­ease scores & covari­ates are entered into the model and then the cog­ni­tive poly­genic scores are able to pre­dict even high­er, as high as R2=0.063/0.046/0.064.

  13. , Davies et al 2016:

    The Biobank poly­genic score con­structed for “ver­bal-nu­mer­i­cal rea­son­ing” pre­dicted 0.98%/1.32% of g/gf scores in Gen­er­a­tion Scot­land, and 2.79% in Loth­ian Birth Cohort 1936 (Fig­ure 2).

  14. , Ibrahim-Ver­baas et al 2016

    Does not report a poly­genic score.

  15. in 2016, a con­sor­tium com­bined the SSGAC dataset with UK Biobank, expand­ing the com­bined dataset to n > 300,000 and yield­ing a total of 162 edu­ca­tion hits; the results were reported in two papers, the lat­ter giv­ing the poly­genic scores:

    The poly­genic score pre­dicts 3.5% of intel­li­gence, 7% of fam­ily SES, and 9% of edu­ca­tion in a held­out sam­ple. Edu­ca­tion is pre­dicted in a with­in-fam­ily between-si­b­ling as well, with betas of 0.215 vs 0.625, R2s not pro­vided (“Extended Data Fig­ure 3” in Okbay paper; sec­tion “2.6. Sig­nifi­cance of the Poly­genic Scores in a WF regres­sion” in first Okbay sup­ple­ment; “Sup­ple­men­tary Table 2.2” in sec­ond Okbay sup­ple­men­t).

    The Okbay et al 2016 PGS has been used in a num­ber of stud­ies, includ­ing , which reports a r = 0.18 or 3.24% vari­ance in 4 sam­ples (UKBB/Dunedin Study/Brain Genomics Super­struct Project (GSP)/Duke Neu­ro­ge­net­ics Study (DNS)).

  16. , Kong et al 2017b (sup­ple­ment; pub­lished ver­sion):

    Edu­ca­tion: gen­eral pop­u­la­tion, 4.98%. See also “Par­ent and off­spring poly­genic scores as pre­dic­tors of off­spring years of edu­ca­tion”, Willoughby & Lee 2017 abstract.

  17. , Tram­push et al 2017: no poly­genic score reported

  18. , Sniek­ers et al 2017

    336 SNPs, and ~3% on aver­age in the held-out sam­ples, peak­ing at 4.8%. (Thus it likely does­n’t out­per­form Okbay/Selzam et al 2016, but does demon­strate the sam­ple-effi­ciency of good IQ mea­sure­ments.)

  19. , Bates et al 2018:

    Edu­ca­tion using Okbay et al 2016 in the Bris­bane Ado­les­cent Twin Study on Queens­land Core Skills Test (QCST), with­in-fam­ily between-si­b­ling com­par­ison: beta=0.15 (so 0.0225?).

  20. “Epi­ge­netic vari­ance in dopamine D2 recep­tor: a marker of IQ mal­leabil­i­ty?”, Kamin­ski et al 2018:

    Odd ana­lytic choices aside (why inter­ac­tions rather than a medi­a­tion mod­el?), they pro­vide repli­ca­tions of Benyamin et al 2014 and Sniek­ers et al 2017 in IMAGEN; both are highly sim­i­lar: 0.33% and 3.2% (1.64-5.43%).

  21. , Zabaneh et al 2017

    1.6%/2.4% of intel­li­gence. Like Spain et al 2016, this uses the TIP high­-IQ sam­ple in a liability-threshold/dichotomous/case-control approach, but the poly­genic score is com­puted on the held­out nor­mal IQ scores from the TEDS twin sam­ple so it is equiv­a­lent to the other poly­genic scores in pre­dict­ing pop­u­la­tion intel­li­gence; they esti­mated it on a 4-test IQ score and a 16-test IQ score (the lat­ter being more reli­able), respec­tive­ly. Despite the sam­ple-effi­ciency gains from using high­-qual­ity IQ tests in TIP/TEDS and the high­-IQ enrich­ment, the TIP sam­ple size (n = 1238) is not enough to sur­pass the Selzam et al 2016 poly­genic score (based on edu­ca­tion prox­ies from 242x more peo­ple).

  22. , Krapohl et al 2017

    10.9% edu­ca­tion / 4.8% intel­li­gence; this is an inter­est­ing method­olog­i­cally because it exploits (cho­sen using infor­ma­tive pri­ors in the form of , although unfor­tu­nately only on the PGS level and not the SNP lev­el) to increase the orig­i­nal PGS by ~1.1% from 3.6% to 4.8%.

  23. , Hill et al 2017

    Like Krapohl et al 2017, use of mul­ti­ple genetic cor­re­la­tions (via ) to over­come mea­sure­ment error greatly boosts effi­ciency of IQ GWAS, and pro­vides best pub­lic poly­genic score to date: 7% of vari­ance in a held out Gen­er­a­tion Scot­land sam­ple. This illus­trates a good way to work around the short­age of high­-qual­ity IQ test scores by exploit­ing mul­ti­ple more eas­i­ly-mea­sured phe­no­types.

  24. , Hill et al 2018

    An exten­sion of Hill et al 2017 increas­ing the effec­tive sam­ple size con­sid­er­ably; the UKBB sam­ple for test­ing the poly­genic score, using the short mul­ti­ple-choice, gives ~6% vari­ance explained. (The lower PGS despite the larger sam­ple & hits may be due to the use of a differ­ent sam­ple with a worse IQ mea­sure.)

  25. Sav­age et al 2017,


  26. , Lello et al 2017

    Demon­strates Hsu’s lasso on height, heel bone den­si­ty, and years of edu­ca­tion in UKBB, recov­er­ing 40% (ie almost the entire SNP her­i­tabil­i­ty), 20%, and 9% respec­tive­ly; given the rg with intel­li­gence and Krapohl et al 2017’s 10.9% edu­ca­tion PGS con­vert­ing to 4.8% intel­li­gence, Lello et al 2017’s edu­ca­tion PGS pre­sum­ably also per­forms ~4.5% on intel­li­gence. This is worse than Hill et al 2017, but it is impor­tant in prov­ing Hsu’s claims about the effi­cacy of the las­so: the impli­ca­tion is that around n > 1m (de­pend­ing on mea­sure­ment qual­i­ty), the intel­li­gence PGS will undergo a sim­i­lar jump in pow­er. Given the rapidly expand­ing datasets avail­able to UKBB and SSGAC, and com­bined with MTAG and other refine­ments, it is likely that the best intel­li­gence PGS will jump from Hill’s 7% to 25-30% some­time 2018-2019.

  27. Maier et al 2018,

    Another genetic cor­re­la­tion boost­ing paper, the fluid intel­li­gence boosted PGS appears to be still minor, ~3% vari­ance.

  28. , Davies et al 2018

    4.3%. (Followup/expansion of the preprint ver­sion .)

  29. , Lee et al 2018 (sup­ple­ment; sum­mary sta­tis­tics)

    The long-awaited SSGAC EA3 paper (men­tioned in the review ), which con­structs a PGS pre­dict­ing 11-13% vari­ance edu­ca­tion, 7-10% IQ, along with exten­sive addi­tional analy­ses includ­ing 4 with­in-fam­ily tests of causal power of the edu­ca­tion PGS (“we esti­mate that with­in-fam­ily effect sizes are roughly 40% smaller than GWAS effect sizes and that our assor­ta­tive-mat­ing adjust­ment explains at most one third of this defla­tion. (For com­par­ison, when we apply the same method to height, we found that the assor­ta­tive-mat­ing adjust­ment fully explains the defla­tion of the with­in-fam­ily effect­s.)… The source of bias con­jec­tured here oper­ates by ampli­fy­ing a true under­ly­ing genetic effect and hence would not lead to false dis­cov­er­ies33. How­ev­er, the envi­ron­men­tal ampli­fi­ca­tion implies that we should usu­ally expect GWAS coeffi­cients to pro­vide exag­ger­ated esti­mates of the mag­ni­tude of causal effects.”)

  30. , Barth et al 2018

    Repli­ca­tion of Lee et al 2018: in their held­out HRS sam­ple, the PGS pre­dicted 10.6% vari­ance in EDU (after remov­ing HRS from the Lee et al 2018 PGS); fur­ther use in HRS was made by .

  31. , Rus­ti­chini et al 2018

    Repli­cates Lee et al 2018 PGS between-par­ents & between-si­b­lings in the Min­nesota Twin Fam­ily Study (MTFS), pre­dict­ing 9% vari­ance IQ in both sam­ples.

  32. , Alle­grini et al 2018:

    11% IQ/16% EDU. Lee et al 2018’s PGS was used in the TEDS cohort, and the PGS’s power was boosted by use of MTAG/GSEM/ and in look­ing at scores from older ages (pos­si­bly ben­e­fit­ing from the Wil­son effect, see exam­in­ing growth).

  33. , de la Fuente et al 2019:

    UKBB reanaly­sis (n= 11,263–331,679), 3.96% genetic g (not IQ), plus PGSes for indi­vid­ual tests (Sup­ple­ment table S7, g vs PGS sub­sec­tion); the focus here is using GSEM to pre­dict not some lump-sum proxy for intel­li­gence like EDU or total score, but fac­tor model the avail­able tests as being influ­enced by the latent g intel­li­gence fac­tor and also test-spe­cific sub­fac­tors. This is the true struc­ture of the data, and ben­e­fits from the genetic cor­re­la­tions with­out set­tling for the low­est com­mon denom­i­na­tor. This makes sub­tests much more pre­dictable:

    Con­sis­tent with the Genomic SEM find­ings that indi­vid­ual cog­ni­tive out­comes are asso­ci­ated with a com­bi­na­tion of genetic g and spe­cific genetic fac­tors, we observed a pat­tern in which many of the regres­sion mod­els that included both the poly­genic score (PGS) from genetic g and test-spe­cific PGSs were con­sid­er­ably more pre­dic­tive of the cog­ni­tive phe­no­types in Gen­er­a­tion Scot­land than regres­sion mod­els that included only either a genetic g PGS or a PGS for a sin­gle test. A par­tic­u­larly rel­e­vant excep­tion involved the Digit Sym­bol Sub­sti­tu­tion test in Gen­er­a­tion Scot­land, which is a sim­i­lar test to the Sym­bol Digit Sub­sti­tu­tion test in UK Biobank, for which we derived a PGS. We found that the pro­por­tional increase in R2 in Digit Sym­bol by the Sym­bol Digit PGS beyond the genetic g PGS was <1%, whereas the genetic g PGS improved poly­genic pre­dic­tion beyond the Sym­bol Digit PGS by over 100%, reflect­ing the power advan­tage obtained from inte­grat­ing GWAS data from mul­ti­ple genet­i­cally cor­re­lated cog­ni­tive traits using a genetic g mod­el. An inter­est­ing coun­ter­point is the PGS for the VNR test, which is unique in the UK Biobank cog­ni­tive test bat­tery in index­ing ver­bal knowl­edge (24,31). High­light­ing the role of domain-spe­cific fac­tors, a regres­sion model that included this PGS and the genetic g PGS pro­vided sub­stan­tial incre­men­tal pre­dic­tion rel­a­tive to the genetic g PGS alone for those Gen­er­a­tion Scot­land phe­no­types most directly related to ver­bal knowl­edge: Mill Hill Vocab­u­lary (62.45% increase) and Edu­ca­tional Attain­ment (72.59%).

  34. “Genetic influ­ence on social out­comes dur­ing and after the Soviet era in Esto­nia”, Rim­feld et al 2018;

    Like Okbay, these papers repli­cate EDU/IQ PGSes in cohorts far removed from the dis­cov­ery cohorts, and inves­ti­gate PGS valid­ity & SNP her­i­tabil­ity changes over time; both increase greatly post-Com­mu­nism, reflect­ing bet­ter oppor­tu­ni­ties and mer­i­toc­ra­cy.

GWAS improvements

These results only scratch the sur­face of what is pos­si­ble.

In some ways, cur­rent GWASes for intel­li­gence are the worst meth­ods that could work, as their many flaws in pop­u­la­tion, data mea­sure­ment, analy­sis, and inter­pre­ta­tion reduce their pow­er; some of the most rel­e­vant flaws for intel­li­gence GWASes would be:

  • pop­u­la­tion: cohorts are designed for eth­nic homo­gene­ity to avoid ques­tions about con­founds, though cross-eth­nic GWASes (par­tic­u­larly ones includ­ing admixed sub­jects) would be bet­ter able to locate causal SNPs by inter­sect­ing hits between differ­ent LD pat­terns

  • data:

    • mis­guided legal & “med­ical ethics” & pri­vacy con­sid­er­a­tions impede shar­ing of indi­vid­u­al-level data, lead­ing to low­er-pow­ered tech­niques (such as LD score regres­sion or ran­dom-effects meta-analy­sis) being nec­es­sary to pool results across cohorts, which meth­ods them­selves often bring in addi­tional losses (such as not using mul­ti­level mod­els to pool/shrink the meta-an­a­lytic esti­mates)
    • exist­ing GWASes sequence lim­ited amounts of SNPs rather than whole genomes
    • impu­ta­tion is often not used or is done based on rel­a­tively small & old datasets like 1000 Genomes, though it would assist the SNP data in cap­tur­ing rarer vari­ants
    • ful­l-s­cale IQ tests taken over mul­ti­ple days by a pro­fes­sional are typ­i­cally not used, and the hier­ar­chi­cal nature of intel­li­gence & cog­ni­tive abil­ity is entirely ignored, mak­ing for SNP effects reflect­ing a mish-mash aver­age effect
    • genetic cor­re­la­tions are not employed to cor­rect for the large amounts of trait mea­sure­ment error or tap into shared causal path­ways
    • edu­ca­tion is the usual mea­sured phe­no­type despite not being that great a mea­sure of intel­li­gence, and even the edu­ca­tion mea­sure­ments are rife with mea­sure­ment error (eg using “years of edu­ca­tion”, as if every year of edu­ca­tion were equally diffi­cult, every school equally chal­leng­ing, every major equally g-load­ed, or every degree equal)
    • func­tional data, such as gene expres­sion, is not used to boost the prior prob­a­bil­ity of rel­e­vant vari­ants
  • analy­sis:

    • prin­ci­pal com­po­nents & LDSC & other meth­ods employed to con­trol for pop­u­la­tion struc­ture may be highly conservative/biased & poten­tially reduce GWAS hits as much as 20% (, , Yengo et al 2018)
    • for com­pu­ta­tional effi­cien­cy, SNPs are often regressed one at a time rather than simul­ta­ne­ous­ly, increas­ing vari­ance entirely unnec­es­sar­ily as even the vari­ance explained by already-found SNPs remains (see eg Loh et al 2018)
    • no attempts are made at includ­ing covari­ates like age or child­hood envi­ron­ment which will affect intel­li­gence scores
    • inter­ac­tions are not included in the lin­ear mod­els
    • genetic correlations/covariances and fac­to­r­ial struc­ture are typ­i­cally not mod­eled even when the traits in ques­tion are best treated as struc­tural equa­tion mod­els, lim­it­ing both power and pos­si­ble infer­ences (but see the recent­ly-in­tro­duced GSEM, , demon­strated on fac­tor analy­sis of genetic g in )
    • the lin­ear mod­els are also highly unre­al­is­tic & weak by using flat pri­ors on SNP effect sizes while not using infor­ma­tive priors/multilevel pooling/shrinkage/variable selec­tion tech­niques which could dra­mat­i­cally boost power by ignor­ing noise & focus­ing on the most rel­e­vant SNPs while infer­ring real­is­tic dis­tri­b­u­tions of effect sizes (eg the : /Vat­tikuti et al 2014/Ho & Hsu 2015//Loh et al 2018/Chung et al 2019/…)
    • NHST think­ing leads to strin­gent mul­ti­ple-cor­rec­tion & focus on the arbi­trary thresh­old of genome-wide sta­tis­ti­cal-sig­nifi­cance while down­play­ing full poly­genic scores, allow­ing only the few hits with the high­est pos­te­rior prob­a­bil­ity to be con­sid­ered in any sub­se­quent analy­ses or dis­cus­sions (en­sur­ing few false pos­i­tives at the cost of reduc­ing power even fur­ther in the orig­i­nal GWAS & all down­stream uses)
    • no hyper­pa­ra­me­ter tun­ing of the GWAS is done: pre­pro­cess­ing val­ues for qual­ity con­trol, impu­ta­tion, p-value thresh­old­ing, and ‘clump­ing’ of vari­ants in close LD are set by con­ven­tion and are not in any way opti­mal ()

This is not to crit­i­cize the authors of those GWASes—they are gen­er­ally doing the best that they can with exist­ing datasets in a hos­tile intel­lec­tual & fund­ing cli­mate and using the stan­dard meth­ods rather than tak­ing risks in using bet­ter but more exotic & unfa­mil­iar meth­ods and their results nev­er­the­less are intel­lec­tu­ally impor­tant, reli­able, & use­ful - but to point out that bet­ter results will inevitably arrive as data & com­pu­ta­tion become more plen­ti­ful and the older results slowly trickle out & change minds.

Since these scores over­lap and are not, like GCTA esti­mates, inde­pen­dent mea­sure­ments of a vari­able, there is lit­tle point in meta-an­a­lyz­ing them other than to esti­mate growth over time (even using them as an ensem­ble would­n’t be worth the com­plex­i­ty, and in any case, most stud­ies do not pro­vide the full list of beta val­ues mak­ing up the poly­genic score); for our pur­pose, the largest poly­genic score is the impor­tant num­ber. (Emil Kirkegaard notes that the poly­genic scores are also ineffi­cient: poly­genic scores are not always pub­lished, not always based on indi­vid­ual patient data, and gen­er­ally use max­i­mum-like­li­hood esti­ma­tion neglect­ing our strong pri­ors on the num­ber of hits & dis­tri­b­u­tion of effect sizes. But these pub­lished scores are what we have as of Jan­u­ary 2016, so we must make do.)

Selzam et al 2016’s reported poly­genic score for cog­ni­tive per­for­mance was 3.5%. Thus:

selzam2016 <- 0.035
embryoSelection(n=10, variance=selzam2016) * 15
# [1] 3.053367791

Inci­den­tal­ly, one might won­der why not use the EDU/EA PGSes given that their vari­ance-ex­plained are so much higher & edu­ca­tion is a large part of how intel­li­gence causes ben­e­fits? It would be rea­son­able to use them, alone or in con­junc­tion, but I have sev­eral rea­sons for not pre­fer­ring them:

  1. the greater per­for­mance thus far is not because ‘years of edu­ca­tion’ is inher­ently more impor­tant or more her­i­ta­ble or less poly­genic or any­thing like that; on a n for n basis, GWASes with good IQ mea­sure­ments work much bet­ter. The greater per­for­mance is dri­ven mostly by the fact that it is a basic demo­graphic vari­able which is rou­tinely recorded in datasets and eas­ily asked if not, allow­ing for far larger com­bined sam­ple sizes. If there were any dataset of 1.1m indi­vid­u­als with high qual­ity IQ scores, the IQ PGS from that would surely be far bet­ter than the IQ PGS cre­ated by Lee et al 2018 on 1.1m EDU. Unfor­tu­nate­ly, there is no such dataset and likely will not be for a while.

  2. ‘years of edu­ca­tion’ is a crude mea­sure­ment which cap­tures nei­ther selec­tiv­ity nor gains from school­ing: it lumps all ‘school­ing’ together and it’s unclear to what extent it cap­tures the desir­able ben­e­fits of for­mal edu­ca­tion, like learn­ing, as opposed to more unde­sir­able behav­ior like pro­cras­ti­nat­ing on life by going to grad school or going to com­mu­nity col­lege or a less selec­tive col­lege and drop­ping out (even though that may be harm­ful and incur a life­time of stu­dent debt); valu­ing “years of edu­ca­tion” is like valu­ing a car by how many kilo­grams of metal it takes to man­u­fac­ture—it treats a cost as a ben­e­fit. The causal nature of ben­e­fits from more years of for­mal edu­ca­tion is like­wise less clear than from IQ. ‘Years of edu­ca­tion’ is not even par­tic­u­larly mean­ing­ful in an absolute sense as gov­ern­ments can sim­ply man­date that chil­dren go to school longer, though this appears to have few ben­e­fits and sim­ply fuel arms races for more edu­ca­tional cre­den­tials and increases the higher edu­ca­tion pre­mium rather than reduces it; Okbay/Selzam et al 2016 include a nice graph show­ing how Swedish school changes man­dat­ing more atten­dance reduced the PGS pre­dic­tive per­for­mance (or ‘pen­e­trance’) of EDU, as would be expect­ed, although it seems doubt­ful such a man­date had any of the many con­se­quences which were hoped for… On the other hand, the rela­tion­ship between IQ and good out­comes like income has been sta­ble over the 20th cen­tury (Strenze et al 2007), and given the absence of any selec­tion for intel­li­gence now (or out­right dys­gen­ic­s), and near-u­ni­ver­sal fore­casts among econ­o­mists that future economies will draw at least as much or more on intel­li­gence as past economies, it is highly unlikely that intel­li­gence will become of less val­ue.

    In gen­er­al, intel­li­gence appears much more con­vinc­ingly causal, more likely to have pos­i­tive exter­nal­i­ties and cause gains in pos­i­tive-sum effect games rather than neg­a­tive-sum positional/signaling games, so I am more com­fort­able using esti­mates for intel­li­gence as I believe they are much more likely to be under­es­ti­mates of the true long-term soci­etal all-in­clu­sive effects, while edu­ca­tion could eas­ily be over­es­ti­mat­ed.

  3. the genetic cor­re­la­tions of EDU/EA are not as uni­formly pos­i­tive as they are for IQ (de­spite the high genetic cor­re­la­tion between the two, illus­trat­ing the non-tran­si­tiv­ity of cor­re­la­tion­s); eg bipo­lar disorder/education but not bipo­lar disorder/IQ (Bansal et al 2018). While genetic cor­re­la­tions can be dealt with by a gen­er­al­iza­tion of the sin­gle-trait case (see the mul­ti­ple selec­tion sec­tion) to make opti­mal trade­offs, such harm­ful genetic cor­re­la­tions are trou­bling & com­pli­cate things.

  4. EDU/EA PGSes are approach­ing their SNP her­i­tabil­ity ceil­ings, and as they mea­sure their crude con­struct fairly well (most peo­ple can recall how many years of for­mal school­ing they had), there’s not as much to gain as with IQ from fix­ing mea­sure­ment error. Con­sid­er­ing the twin/family stud­ies, the high­est her­i­tabil­ity for edu­ca­tion, var­i­ously mea­sured (typ­i­cally bet­ter than ‘years of edu­ca­tion’), tends to peak at 50%, while with IQ, the most refined meth­ods peak at 80%. Thus, at some point the pure-IQ or mul­ti­-trait GWASes will exceed the EDU PGSes for the pur­pose of pre­dict­ing intel­li­gence (although this may take some time or require upgrades like use of WGS or much bet­ter mea­sure­ments).

Measurement Error in Polygenic Scores

Like GCTA, mea­sure­ment error affects poly­genic scores, in reduc­ing both dis­cov­ery power and pro­vid­ing a down­ward­ly-bi­ased esti­mate of how good the PGS is. The GCTAs give a sub­stan­tially lower esti­mate than the one we care about if we for­get to cor­rect for mea­sure­ment error; is this true for the PGSes above as well?

Check­ing some the GWASes in ques­tion where pos­si­ble, it seems there is an unspo­ken gen­eral prac­tice of using the small­est high­est-qual­i­ty-phe­no­typed cohorts as the held­out val­i­da­tion sets, so the mea­sure­ment error turns out to not be too seri­ous, and we don’t need to take it much into con­sid­er­a­tion.

Like GCTA, mea­sure­ment error affects poly­genic scores. In two major ways: first, poor qual­ity mea­sure­ments reduce the sta­tis­ti­cal power con­sid­er­ably and thus the abil­ity to find genome-wide sta­tis­ti­cal­ly-sig­nifi­cant hits or cre­ate pre­dic­tive PGSes; sec­ond, after the hit to power has been taken (GIGO), mea­sure­ment error in a sep­a­rate validation/replication dataset will bias the esti­mate to zero because the true accu­racy is being hid­den by the noise in the new dataset. (If the “IQ score” only cor­re­lates r = 0.5 with intel­li­gence because it is just that noisy and unsta­ble, no PGS will ever exceed r = 0.5 pre­dic­tive power in that dataset, because by defi­n­i­tion you can’t pre­dict noise, even though the true latent intel­li­gence vari­able is much more her­i­ta­ble than that.) The UK Biobank’s cog­ni­tive abil­ity mea­sures are par­tic­u­larly low qual­i­ty, with test-retest reli­a­bil­ity alone aver­ag­ing only r = 0.55 (). From a psy­cho­me­t­ric per­spec­tive, it’s worth not­ing that the power will be reduced, and the PGS biased towards 0, by range restric­tion, espe­cially by attri­tion of very unin­tel­li­gent peo­ple (due to things like excess mor­tal­i­ty), which can be expected to reduce by another ~5% (go­ing by the Gen­er­a­tion Scot­land esti­mate of the range restric­tion bias).

There’s not much that can be done about the first prob­lem after the GWAS has been con­duct­ed, but the sec­ond prob­lem can be quan­ti­fied and cor­rected for sim­i­lar to with GCTA—the poly­genic score/replication dataset is just another cor­re­la­tion (even if we usu­ally write it as ‘vari­ance explained’ rather than r), and if we know how much noise is in the repli­ca­tion dataset IQ mea­sure­ments, we can cor­rect for that and see how much of true IQ was pre­dict­ed. The raw repli­ca­tion per­for­mance is mean­ing­ful for some pur­pos­es, like if one was try­ing to use the PGS as a covari­ate or to just pre­dict that cohort, but not for oth­ers; in the case of embryo selec­tion, we do not care about increas­ing mea­sured IQ but latent or true IQ. If our PGS is actu­ally pre­dict­ing 11% of vari­ance but the mea­sure­ments are so bad in the repli­ca­tion cohort that our PGS can only pre­dict 7% of the noisy mea­sure­ments, it is the 11% that mat­ters as it is what defines how much selected embryos will increase by.

Most GWASes do not men­tion the issue, few men­tion any­thing about the expected reli­a­bil­ity of the used IQ scores, and none cor­rect for mea­sure­ment error in report­ing PGS pre­dic­tions, so I’ve gone through the above list of PGSes and made an attempt to roughly cal­cu­late cor­rected PGSes. For UKBB, test-retest cor­re­la­tions have been reported and can be con­sid­ered loose upper bounds on the reli­a­bil­ity (since a test which can’t pre­dict itself can’t mea­sure intel­li­gence a for­tiori); for IQ mea­sures which are a prin­ci­pal com­po­nent extracted from mul­ti­ple tests, I assume they are at least r = 0.8 and accept­able qual­i­ty.

Year Study n PGS Repli­ca­tion cohort Test type Repli­ca­tion N Reli­a­bil­ity Cor­rected PGS
2011 Davies et al 2011 3511 0.0058 Nor­we­gian Cog­ni­tive Neu­ro­Ge­net­ics (NCNG) cus­tom bat­tery 670 >0.8 <0.007
2013 Rietveld et al 2013 126559 0.0258 Swedish Twin Reg­istry (STR) SEB80 9553 0.84-0.95 0.031
2014 Benyamin et al 2014 12441 0.035 Nether­lands Twin Reg­istry (NTR) RAKIT, WISC-R, WISC-R-III, WAIS-III 739 >0.90 <0.039
2014 Benyamin et al 2014 12441 0.005 Uni­ver­sity of Min­nesota study (UMN) WISC-R, WAIS-R 3367 0.90 0.006
2014 Benyamin et al 2014 12441 0.012 Gen­er­a­tion Rot­ter­dam study (Gen­er­a­tion R) SON-R 2,5-7 1442 0.62 0.02
2015 Davies et al 2015 53949 0.0127 Gen­er­a­tion Scot­land (GS) cus­tom bat­tery 5487 >0.8 <0.016
2016 Davies et al 2016 112151 0.0231 Gen­er­a­tion Scot­land (GS) cus­tom bat­tery 19994 >0.8 <0.029
2016 Davies et al 2016 112151 0.031 Loth­ian Birth Cohort of 1936 (LBC1936/1947) cus­tom bat­tery, Moray House Test No. 12 1005 >0.8 <0.039
2016 Selzam et al 2016 329000 0.0361 Twins Early Devel­op­ment Study (TEDS) cus­tom bat­tery 5825 >0.8 <0.045
2017 Sniek­ers et al 2017 78308 0.032 Twins Early Devel­op­ment Study (TEDS) cus­tom bat­tery 1173 >0.8 <0.04
2017 Sniek­ers et al 2017 78308 0.048 Man­ches­ter & New­cas­tle Lon­gi­tu­di­nal Stud­ies of Cog­ni­tive Age­ing Cohorts (ACPRC) cus­tom bat­tery 1558 >0.8 <0.06
2017 Sniek­ers et al 2017 78308 0.025 Rot­ter­dam Study cus­tom bat­tery 2015 ? ?
2017 Zabaneh et al 2017 9410 0.016 Twins Early Devel­op­ment Study (TEDS) cus­tom bat­tery 3414 >0.8 <0.02
2017 Zabaneh et al 2017 9410 0.024 Twins Early Devel­op­ment Study (TEDS) cus­tom bat­tery 4731 >0.8 <0.03
2017 Krapohl et al 2017 82493 0.048 Twins Early Devel­op­ment Study (TEDS) cus­tom bat­tery 6710 >0.8 <0.06
2017 Hill et al 2017 147194 0.0686 Gen­er­a­tion Scot­land (GS) cus­tom bat­tery 6884 >0.8 <0.086
2017 Sav­age et al 2017 279930 0.041 Gen­er­a­tion Rot­ter­dam study (Gen­er­a­tion R) SON-R 2,5-7 1929 0.62 0.066
2017 Sav­age et al 2017 279930 0.054 Spit 4 Sci­ence (S4S) SAT 2818 0.5 0.108
2017 Sav­age et al 2017 279930 0.021 Rot­ter­dam Study cus­tom bat­tery 6182 ? ?
2017 Sav­age et al 2017 279930 0.05 UK Biobank (UKBB) Cus­tom ver­bal-nu­mer­i­cal rea­son­ing sub­test (VNR) 53576 <0.65 >0.077
2018 Hill et al 2018 248482 0.065 UK Biobank (UKBB) Cus­tom ver­bal-nu­mer­i­cal rea­son­ing sub­test (VNR) 9050 <0.65 >0.10
2018 Hill et al 2018 248482 0.0683 UK Biobank (UKBB) Cus­tom ver­bal-nu­mer­i­cal rea­son­ing sub­test (VNR) 2431 <0.65 >0.11
2018 Hill et al 2018 248482 0.0464 UK Biobank (UKBB) Cus­tom ver­bal-nu­mer­i­cal rea­son­ing sub­test (VNR) 33065 <0.65 >0.07
A plot of poly­genic score pre­dic­tive power 2011-2018, raw vs cor­rected for mea­sure­ment error, demon­strat­ing the large gap in some cases

Over­all, it seems that most GWASes use the noisy mea­sure­ments for dis­cov­ery in the main GWAS and then reserve their small but rel­a­tively high­-qual­ity cohorts for the test­ing, which is the best approach, and so the cor­rected PGSes are sim­i­lar enough to the raw PGS that it is not a big issue—ex­cept in a few cases where the mea­sure­ment error is severe enough that it dra­mat­i­cally changes the inter­pre­ta­tion, like the use of UKBB or S4S cohorts, whose r < 0.65 reli­a­bil­i­ties (pos­si­bly much worse than that) seri­ously under­state the pre­dic­tive power of the PGS. Hill et al 2018, for exam­ple, appears to turn in a mediocre result which does­n’t exceed Hill et al 2017’s SOTA despite a much larger sam­ple size, but this is entirely an arti­fact of uncor­rected mea­sure­ment error, and the cor­rected PGSes are ~8.6% vs ~10%, imply­ing Hill et al 2018 actu­ally became the SOTA on pub­li­ca­tion. (The cor­rected PGSes also seem to show more of the expected expo­nen­tial growth with time, which has been some­what hid­den by increas­ing use of poor­ly-mea­sured val­i­da­tion dataset­s.)

Why Trust GWASes?

Before mov­ing on to the cost, it’s worth dis­cussing a ques­tion I see a lot: why trust any of these poly­genic scores or GWAS results like genetic cor­re­la­tions, and assume they will work in an embryo selec­tion any­where near the reported pre­dic­tive per­for­mance, when they are, after all, just a bunch of com­plex cor­re­la­tional stud­ies and not proper ran­dom­ized exper­i­ments?

Prag­ma­tism vs selec­tive skep­ti­cism.

Dur­ing the late 2000s, there were great amounts of crit­i­cism made of the “miss­ing her­i­tabil­ity” after GWASes exposed the bank­ruptcy of can­di­date-gene stud­ies (specifi­cal­ly, Chabris et al 2012 for IQ hit­s), and pre­dic­tions that, con­trary to the behav­ioral geneti­cists’ pre­dic­tions that increas­ing sam­ple sizes would over­come poly­genic­i­ty, GWASes would never amount to any­thing, in part because the genetic basis of many traits (espe­cially intel­li­gence) sim­ply did not exist. So now, in 2016 and lat­er, why should we trust GWASes & poly­genic scores, and the intelligence/education ones in gen­er­al, and believe they mea­sure mean­ing­ful genetic cau­sa­tion—rather than some sort of com­pli­cated hid­den “cryp­tic pop­u­la­tion struc­ture” which just hap­pens to cre­ate spu­ri­ous cor­re­la­tions between ances­try and, say, socioe­co­nomic sta­tus? We should because to the extent that the crit­i­cisms are true, they are as they are unlikely to change our deci­sion-mak­ing in embryo selec­tion, which makes sense even using highly con­ser­v­a­tive esti­mates at every step, and the evi­dence from many con­verg­ing direc­tions is strongly con­sis­tent with ‘naive’ inter­pre­ta­tions being more true than the extreme nur­ture claims:

  1. causal pri­ors: pre­dic­tions from genetic mark­ers are inher­ently a lon­gi­tu­di­nal design & thus more likely to be causal than a ran­dom pub­lished cor­re­la­tion, because genes are fixed at con­cep­tion, thereby rul­ing out 1 of the 3 main causal pat­terns: either the genes do cause the cor­re­lated phe­no­types, or they are con­found­ed, but the cor­re­la­tion can­not be reverse cau­sa­tion.

  2. con­silience among all genetic meth­ods: GWAS results show­ing non-zero SNP her­i­tabil­ity and highly poly­genic addi­tive genetic archi­tec­tures are con­sis­tent with the past cen­tury of adop­tion, twin, sib­ling, and fam­ily stud­ies. For exam­ple, poly­genic scores always explain less than SNP her­i­tabil­i­ty, and SNP her­i­tabil­ity is always less than twin her­i­tabil­i­ty, as expect­ed; but if the sig­nal were purely ances­try, a few thou­sand SNPs is more than enough to infer ances­try with high accu­racy and the results could be any­thing.

    • The same holds for genetic cor­re­la­tions: genetic cor­re­la­tions com­puted using mol­e­c­u­lar genet­ics are typ­i­cally con­sis­tent with those cal­cu­lated using twins.
  3. strong pre­cau­tions: GWASes typ­i­cally include strin­gent mea­sures to reduce cryp­tic relat­ed­ness, remov­ing too-re­lated dat­a­points as mea­sured directly on mol­e­c­u­lar genetic mark­ers, and includ­ing many prin­ci­pal com­po­nents as con­trol­s—­pos­si­bly too many, as these mea­sures come at costs in sam­ple size and abil­ity to detect rare vari­ants, to reduce a risk which has not much mate­ri­al­ized in prac­tice. (Sta­tis­ti­cal meth­ods like LD score regres­sion (/) gen­er­ally indi­cate that, after these mea­sures, most of the pre­dic­tive sig­nal comes from gen­uine poly­genic­ity and not resid­ual pop­u­la­tion strat­i­fi­ca­tion.)

  4. high repli­ca­tion rates: GWAS poly­genic scores are pre­dic­tive out of sam­ple (mul­ti­ple cohorts of the same study), across social classes11, across close­ly-re­lated but sep­a­rate coun­tries (eg UK GWASes closely agree with USA GWASes). GWASes also have high repli­ca­tion rates within countries/studies.12, and across times (while her­i­tabil­i­ties may change, the PGS remains pre­dic­tive and does not revert quickly to 0% vari­ance; sim­i­lar­ly, there is con­silience with selection/dysgenics, rather than small ran­dom walk­s).

    Sug­ges­tions that GWASes merely mea­sure social strat­i­fi­ca­tion pre­dict many things we sim­ply do not see, like extreme SES inter­ac­tions and gra­di­ents in pre­dic­tive valid­ity or being point­less if there is even the slight­est bit of range restric­tion (if any­thing, range restric­tion is com­mon in GWASes, and they still work), or very low genetic cor­re­la­tions between cohorts in differ­ent times or places or mea­sure­ments or recruit­ment meth­ods (rather than the usual high rg > 0.8). The crit­ics have yet to explain just how much relat­ed­ness is too much, or how far the cryp­tic relat­ed­ness goes, and why cur­rent prac­tices of elim­i­nat­ing even as close as 2.5% (fourth cousins) are inad­e­quate (un­less the argu­ment is cir­cu­lar—“we know it’s not enough because the GWASes & GCTAs con­tinue to work”!). A decade on, with datasets that have grown 10-50x larger than ini­tial GWASes like Chabris et al 2012, there has been no repli­ca­tion cri­sis for GWASes. This is despite the usual prac­tice of GWAS involv­ing con­sor­tia with repeated GWAS+meta-analysis across accu­mu­lat­ing datasets, which would quickly expose any seri­ous repli­ca­tion issues (prac­tices adopt­ed, in part, as a reac­tion to the can­di­date-gene deba­cle).

    Fur­ther, while GWAS poly­genic scores decrease in pre­dic­tive valid­ity when used in increas­ingly dis­tant eth­nic­i­ties (eg IQ PGSes pre­dict best in Cau­casians, some­what well in Han Chi­ne­se, worse in African-Amer­i­cans, and hardly at all in African­s), they do so grad­u­al­ly, as pre­dicted by eth­nic relat­ed­ness lead­ing link­age dis­e­qui­lib­rium decay of SNP mark­ers for iden­ti­cal causal vari­ants—and not abruptly based on national bor­ders or economies. What sort of pop­u­la­tion strat­i­fi­ca­tion or resid­ual con­found­ing could pos­si­bly be iden­ti­cal between both Lon­don and Bei­jing?

  5. with­in-fam­ily com­par­isons show causal­i­ty: GWASes pass one of the most strin­gent checks, with­in-fam­ily com­par­isons of sib­lings. As notes, sib­lings inherit ran­dom genes from their par­ents and are born equal in every respect like socioe­co­nomic sta­tus, ances­try, neigh­bor­hood etc (yet sib­lings within a fam­i­ly, includ­ing fra­ter­nal twins, differ a great deal on aver­age, a puz­zle for envi­ron­men­tal deter­min­ists but pre­dicted by the large genetic differ­ences between sib­lings & CLT), and so all genetic differ­ences between sib­lings are them­selves ran­dom­ized exper­i­ments show­ing causal­i­ty:

    Genet­ics is indeed in a pecu­liarly favoured con­di­tion in that Prov­i­dence has shielded the geneti­cist from many of the diffi­cul­ties of a reli­ably con­trolled com­par­i­son. The differ­ent geno­types pos­si­ble from the same mat­ing have been beau­ti­fully ran­domised by the mei­otic process. A more per­fect con­trol of con­di­tions is scarcely pos­si­ble, than that of differ­ent geno­types appear­ing in the same lit­ter

    Indeed, the first suc­cess­ful IQ/education GWAS, Rietveld et al 2013, checked the PGS in an avail­able sam­ple of sib­lings, and found in pairs of sib­lings, the sib­ling with the higher PGS tended to also have a higher edu­ca­tion. Hence, the PGS must mea­sure cau­sa­tion.

    Other meth­ods aside from sib­ling com­par­i­son like parental PGS con­trols, pedi­grees, or trans­mis­sion dis­e­qui­lib­rium can be expected to reduce or elim­i­nate any hypo­thet­i­cal con­found­ing from resid­ual pop­u­la­tion strat­i­fi­ca­tion; GWASes typ­i­cally sur­vive those as well. (See also: , Rietveld et al 2014b, de Zeeuw et al 2014, , Domingue et al 2015, , , , Willoughby et al 2019, , , , , , , , ).

  6. GWASes are biased towards nulls: the major sta­tis­ti­cal flaws in GWASes are typ­i­cally in the direc­tion of min­i­miz­ing genetic effects: using small num­bers of SNPs, highly unre­al­is­tic flat pri­ors, one-SNP-at-a-time regres­sion, no incor­po­ra­tion of mea­sure­ment error, too many prin­ci­pal com­po­nents, addi­tive-only mod­els, arbi­trary genome-wide sig­nifi­cance thresh­olds, PGSes of only genome-wide sta­tis­ti­cal­ly-sig­nifi­cant hits rather than full PGSes etc. (See sec­tion on how cur­rent PGSes rep­re­sent lower bounds and will become much bet­ter.)

  7. con­silience with bio­log­i­cal & neu­ro­log­i­cal evi­dence: if GWASes and PGSes were merely con­founded by some­thing like ances­try, the attempt to parse their tea leaves into some­thing bio­log­i­cally mean­ing­ful would fail. They would be exploit­ing chance vari­ants asso­ci­ated with ances­try and muta­tions, spread scat­ter­shot over the genome. But instead, we observe con­sid­er­able struc­ture of iden­ti­fied vari­ants within the genome in a way that looks as if they are doing some­thing.

    On a high lev­el, as pre­vi­ously men­tioned, the genetic cor­re­la­tions are con­sis­tent with those observed in twins, but also gen­er­ally with phe­no­typic cor­re­la­tions. In terms of gen­eral loca­tion in the genome, the iden­ti­fied vari­ants are where they are expected if they have func­tional causal con­se­quences: mostly in pro­tein-cod­ing & reg­u­la­tory regions (rather than the over­whelm­ing major­ity of junk DNA region­s), and located far above chance near rare patho­log­i­cal vari­ants—IQ vari­ants are enriched in loca­tions very near the rare muta­tions which cause many cases of intel­lec­tual dis­abil­i­ties, and sim­i­larly dis­ease-re­lated com­mon vari­ants are very near rare path­o­genic muta­tions (eg for heart defect­s). On a more fine-grained lev­el, the genes host­ing iden­ti­fied genetic vari­ants can be assigned to spe­cific organs and stages of life based on when they are typ­i­cally expressed using meth­ods like DNA microar­rays; IQ/EDU hits are, unsur­pris­ing­ly, heav­ily clus­tered in genes asso­ci­ated with the ner­vous sys­tem or with known psy­choac­tive drug tar­gets (and not skin color or appear­ance genes), and express most heav­ily early in life pre­na­tally & infan­cy—ex­actly when the human brain is grow­ing & learn­ing most rapid­ly. (See for exam­ple Okbay et al 2016 or Lam et al 2017.) While the bio­log­i­cal insights have not been too impres­sive for com­plex behav­ioral traits like education/intelligence, GWASes have given con­sid­er­able insight into dis­eases like Crohn’s or dia­betes or schiz­o­phre­nia, which is diffi­cult to rec­on­cile with the idea that GWASes are sys­tem­at­i­cally wrong or pick­ing up on pop­u­la­tion strat­i­fi­ca­tion con­found­ing. Or should we engage in post hoc spe­cial plead­ing and say that the GWAS method­ol­ogy works fine for dis­eases but some­how, invis­i­bly, fails when it comes to traits which are not con­ven­tion­ally defined as dis­eases (even when they are parts of con­tin­u­ums where the extremes are con­sid­ered dis­eases or dis­or­der­s)?

  8. the crit­ics were wrong: none of this was pre­dicted by crit­ics of “miss­ing her­i­tabil­ity”. The pre­dic­tion was that GWASes were a fools’ errand—­for exam­ple, from 2010, “If com­mon alle­les influ­enced com­mon dis­eases, many would have been found by now.” or “The most likely expla­na­tion for why genes for com­mon dis­eases have not been found is that, with few excep­tions, they do not exist.” (Quotes from crit­ics cited in /.) Few (none?) of the crit­ics pre­dicted that GWASes would suc­ceed—as pre­dicted by the power analy­ses—in find­ing hun­dreds or thou­sands of genome-wide sta­tis­ti­cal­ly-sig­nifi­cant hits when sam­ple sizes increased appro­pri­ately with datasets like 23andMe & UK Biobank becom­ing avail­able but that these would sim­ply be illu­so­ry; this was con­sid­ered too absurd and implau­si­ble to rate seri­ous men­tion com­pared to hypothe­ses like “they do not exist”. It behooves us to take the crit­ics at their word—before . As it hap­pened, pro­po­nents of addi­tive poly­genic archi­tec­tures and tak­ing results like GCTA at face value made spe­cific pre­dic­tions that hits would mate­ri­al­ize at appro­pri­ate sam­ple sizes like n = 100k (eg Viss­cher or Hsu). Their pre­dic­tions were right; the crit­ics’ were wrong. Every­thing else is post hoc.

Those are the main rea­sons I take GWASes & PGSes at largely face-val­ue. While pop­u­la­tion strat­i­fi­ca­tion cer­tainly exists and would inflate naive esti­mates, and indi­vid­ual SNPs of course may turn out to be errors, and there are seri­ous issues in try­ing to apply results from one pop­u­la­tion to anoth­er, and there is far to go in cre­at­ing use­ful PGSes for most traits, and there remain many unknowns about the nature of the causal effects (are genetic cor­re­la­tions hor­i­zon­tal or ver­ti­cal pleiotropy? is a par­tic­u­lar SNP causal or just a tag for a rarer vari­ant? what bio­log­i­cal differ­ence, exact­ly, does it make, and does it run out­side the body as well?), many mis­in­ter­pre­ta­tions of what spe­cific meth­ods like GCTA deliver and many sub­op­ti­mal prac­tices (like poly­genic scores using a p-value thresh­old), and so on—but it is not cred­i­ble now to claim that genes do not mat­ter, or that GWASes are untrust­wor­thy The fact is: most human traits are under con­sid­er­able genetic influ­ence, and GWASes are a highly suc­cess­ful method for quan­ti­fy­ing and pin­ning down that influ­ence.

Like E.O Wil­son famously defended evo­lu­tion, each point may seem minor or nar­row, hedged about with caveats and tech­ni­cal assump­tions, but the con­silience of the total weight of the evi­dence is unan­swer­able.

Can we seri­ously enter­tain the hypoth­e­sis that all the twin stud­ies, adop­tion stud­ies, fam­ily stud­ies, GCTAs, LD score regres­sions, GWAS PGSes, with­in-fam­ily com­par­isons or pedi­grees or parental covari­ate-con­trols or trans­mis­sion dis­e­qui­lib­rium tests, the tran­s-co­hort & pop­u­la­tion & eth­nic­ity & coun­try & time period repli­ca­tions, intel­lec­tual dis­abil­ity and other Mendelian dis­or­der over­laps, the devel­op­men­tal & gene expres­sion, all of these and more reported from so many coun­tries by so many researchers on so many peo­ple (mil­lions of twins alone have been stud­ied, see Pol­d­er­man et al 2015), are all just some incred­i­ble fluke of strat­i­fi­ca­tion or a SNP chip error, all of whose errors and assump­tions just hap­pen to go in exactly the same direc­tion and just hap­pen to act in exactly the way one would expect of gen­uine causal genetic influ­ences?

Surely the Devil did not plant dinosaur bones to fool the pale­on­tol­o­gists, or SNPs (“Satanic Nucleotide Poly­mor­phisms”?) to fool the medical/behavioral geneti­cist—the uni­verse is hard to under­stand, and ran­dom­ness and bias are vex­ing foes, but it is not actively mali­cious. At this point, we can safely trust in the major­ity of large GWAS results to be largely accu­rate and act as we expect them to.

Cost of embryo selection

In con­sid­er­ing the cost of embryo selec­tion, I am look­ing at the mar­ginal cost of embryo selec­tion and not the total cost of IVF: assum­ing that, for bet­ter or worse, a pair of par­ents have decided to use IVF to have a child, and incur­ring what­ever costs there may be, from the $8k13-$20k cost of each IVF cycle to any pos­si­ble side effects for mother/child of the IVF process, and merely ask­ing, what are the costs of ben­e­fits of doing embryo selec­tion as part of that IVF? The coun­ter­fac­tual is IVF vs IVF+embryo-selection, not hav­ing a child nor­mally or adopt­ing.

PGD is cur­rently , so there are no crim­i­nal or legal costs; even if there were, clin­ics in other coun­tries will con­tinue to offer it, and the cost of using a Chi­nese fer­til­ity clinic may not be par­tic­u­larly notice­able finan­cially14 and their qual­ity may even­tu­ally be higher15.

Cost of polygenic scores

An upper bound is the cost of whole-g­nome sequenc­ing, which has con­tin­u­ously fall­en. My impres­sion is that his­tor­i­cal­ly, a whole-genome has cost ~6x a com­pre­hen­sive SNP (500k+). The NHGRI Genome Sequenc­ing Pro­gram’s DNA Sequenc­ing Cost dataset most recently records an Octo­ber 2015 whole-genome cost of $1245. Illu­mina has boasted about a $1000 whole-genome start­ing around 2014 (un­der an unspec­i­fied cost mod­el); around Decem­ber 2015, Ver­i­tas Genet­ics started tak­ing orders for a con­sumer 20x whole-genome priced at $1000; in Jan­u­ary 2018, Dante Labs began offer­ing 30x whole-genomes at ~$740 (down from May-Sep 2017 at ~$950-$1000, appar­ently depen­dent on euro exchange rate), drop­ping to $500 by June 2018. So if a com­pre­hen­sive SNP cost >$1000, it would be cheaper to do a whole-genome, and his­tor­i­cally at that price, we would expect a SNP cost of ~$170.

The date & cost of get­ting a large selec­tion of SNPs is not col­lected in any dataset I know of, so here are a few 2010-2016 price quotes. Tur-Kaspa et al 2010 esti­mates “Genetic analy­ses of oocytes by polar bod­ies biopsy and embryos by blas­tomere biopsy” at $3000. Hsu 2014 esti­mates an SNP costs “~$100 USD” and “At the time of this writ­ing SNP geno­typ­ing costs are below $50 USD per indi­vid­ual”, with­out spec­i­fy­ing a source; given the lat­ter is below any 23andMe price offered, it is prob­a­bly an inter­nal Bei­jing Genomics Insti­tute cost esti­mate. The Cen­ter for Applied Genomics price list (un­spec­i­fied date but pre­sum­ably 2015) lists Affymetrix SNP 6.0 at $355 & the Human Omni Express-24 at $170. 23andMe famously offered its ser­vices for $108.95 for >600k SNPs as of Octo­ber 2014, but that price appar­ently was sub­stan­tially sub­si­dized by research & sales as they raised the price to $200 & low­ered com­pre­hen­sive­ness in Octo­ber 2015. NIH CIDR’s price list quotes a full cost of $150-$210 for 1 use of a 821K SNP Axiom Array (capa­bil­i­ties) as of 2015-12-10. (The NIH CIDR price list also says $40 for 96 SNPs, sug­gest­ing that it would be a false econ­omy to try to get only the top few SNP hits rather than a com­pre­hen­sive poly­genic score.) Rock­e­feller Uni­ver­si­ty’s 2016 price list quotes a range of $260-$520 for one sam­ple from an Affymetrix GeneChip. Tan et al 2014 note that for PGD pur­pos­es, “the esti­mated reagent cost of sequenc­ing for the detec­tion of chro­mo­so­mal abnor­mal­i­ties is cur­rently less than $100.” The price of the array & geno­typ­ing can be dri­ven far below this by economies of scale: Hugh Watkin­s’s talk at the June 2014 UK Biobank con­fer­ence says that they had reached a cost of ~$45 per SNP16 (The UK Biobank over­all has spent ~$110m 2003-2015, so geno­typ­ing 500,000 peo­ple at ~$45 each rep­re­sents a large frac­tion of its total bud­get. Some­what sim­i­lar­ly, 23andMe has raised 2006-2017 ~$491m in cap­i­tal along with charg­ing ~2m cus­tomers per­haps an aver­age of ~$150 along with unknown phar­ma­corp licens­ing rev­enue, so total 23andMe spend­ing could be esti­mated at some­where ~$800m. For com­par­ison, the US pro­gram in 2018 had an annual bud­get of $9,168m, or highly likely >9x more annu­ally than has ever been spent on UKBB/23andMe/SSGAC com­bined.) The Genes for Good pro­ject, began in 2015, reported that their smal­l­-s­cale (n = 27k) social-me­di­a-based sequenc­ing pro­gram cost “about $80, which includes postage, DNA extrac­tion, and geno­typ­ing” per par­tic­i­pant. Razib Khan reports in May 2017 that peo­ple at the Octo­ber 2016 ASHG were dis­cussing SNP chips in the “range of the low tens of dol­lars”.

Over­all, SNPing an embryo in 2016 should cost ~$100-400 and more towards the low end like $200 and we can expect the SNP cost to fall fur­ther, with fixed costs prob­a­bly push­ing a climb up the qual­ity lad­der to exome and then whole-genome sequenc­ing (which will increase the ceil­ing on pos­si­ble PGS by cov­er­ing rare & causal vari­ants, and allow selec­tion on other met­rics like avoid­ing unhealthy-look­ing de novo muta­tions or decreas­ing esti­mated muta­tion load).

SNP cost forecast

How much will SNP costs drop in the future?

We can extrap­o­late from the NHGRI Genome Sequenc­ing Pro­gram’s DNA Sequenc­ing Cost dataset, but it’s tricky: eye­balling their graph, we can see that his­tor­i­cal prices have not fol­lowed any sin­gle pat­tern. At first, costs closely tracks a sim­ple halv­ing every 18 months, then there is an abrupt trend-break to super-ex­po­nen­tial drops from mid-2007 to mid-2011 and then an equally abrupt rever­sion to a flat cost tra­jec­tory with occa­sional price increases and then another abrupt fall in early 2015 (ac­cen­tu­ated when one adds in the Ver­i­tas Genet­ics $1k as a dat­a­point).

Drop­ping pre-2007 data and fit­ting an expo­nen­tial shows a bad fit since 2012 (if it fol­lows the pre-2015 curve, it has large pre­dic­tion errors on 2015-2016 and vice-ver­sa). It’s prob­a­bly bet­ter to take the last 3 dat­a­points (the cur­rent trend) and fit the curve to them, cov­er­ing just the past 6 months since July 2015, and then apply­ing the 6x rule of thumb we can pre­dict SNP costs out 20 months to Octo­ber 2017:

# http://www.genome.gov/pages/der/seqcost2015_4.xlsx
genome <- c(9408739,9047003,8927342,7147571,3063820,1352982,752080,342502,232735,154714,108065,70333,46774,
l <- lm(log(I(tail(genome, 3))) ~ I(1:3)); l
# Coefficients:
# (Intercept)       I(1:3)
#  7.3937180   -0.1548441
exp(sapply(1:10, function(t) { 7.3937180 + -0.1548441*t } )) / 6
# [1] 232.08749421 198.79424215 170.27695028 145.85050092 124.92805739 107.00696553  91.65667754
# [8]  78.50840827  67.24627528  57.59970987

(Even if SNP prices stag­nate due to lack of com­pe­ti­tion or fixed-costs/overhead/small-scales, whole-genomes will sim­ply eat their lunch: at the cur­rent trend, whole-genomes will reach $200 ~2019 and $100 ~2020.)

PGD net costs

An IVF cycle involv­ing PGD will need ~4-5 SNP geno­typ­ings (given a median egg count of 9 and half being abnor­mal), so I esti­mate the genetic part costs ~$800-1000. The net cost of PGD will include the cell har­vest­ing part (one needs to extract cells from embryos to sequence) and inter­pre­ta­tion (although scor­ing and check­ing the genetic data for abnor­mal­ity should be automat­able), so we can com­pare with cur­rent PGD price quotes.

  • “The Fer­til­ity Insti­tutes” say “Aver­age costs for the med­ical and genetic por­tions of the ser­vice pro­vided by the Fer­til­ity Insti­tutes approach $27,000 U.S” (un­spec­i­fied date) with­out break­ing out the PGD part.

  • Tur-Kaspa et al 2010, using 2000-2005 data from the Repro­duc­tive Genet­ics Insti­tute (RGI) in Illi­nois esti­mates the first PGD cycle at $6k, and sub­se­quent at $4.5k, giv­ing a full table of costs:

    Table 2: Esti­mated cost of IVF-preimplantation genetic diag­no­sis (PGD) treat­ment for cys­tic fibro­sis (CF) car­ri­ers
    Pro­ce­dure Sub­pro­ce­dure Cost (US$) Notes
    IVF Pre-IVF lab­o­ra­tory screen­ing 1000 Range $600 to $2000; needs to be per­formed only once each year
    Med­ica­tions 3000 Range $1500 to $5000
    Cost of IVF treat­ment cycle 12000 Range $6000 to $18000
    Total cost, first IVF cycle 16000
    Total cost, each addi­tional IVF cycle 15000
    PGD Genetic sys­tem set-up for PGD of a spe­cific cou­ple 1500 Range $1000 to $2000; per­formed once for a spe­cific cou­ple, with or with­out analy­sis of sec­ond gen­er­a­tion, if applic­a­ble
    Biopsy of oocytes and embryos 1500
    Genetic analy­ses of oocytes by polar bod­ies biopsy and embryos by blas­tomere biopsy 3000 Vari­able; upper end pre­sent­ed; depends on num­ber of muta­tions antic­i­pated
    Subto­tal: cost of PGD, first cycle 6000
    Subto­tal: cost of PGD, each repeated cycle 4500
    IVF-PGD Total cost, first IVF-PGD cycle 22000
    Total cost, each addi­tional IVF-PGD cycle 19500

    …Over­all, 35.6% of the IVF-PGD cycles yielded a life birth with one or more healthy babies. If IVF-PGD is not suc­cess­ful, the cou­ple must decide whether to attempt another cycle of IVF-PGD (Fig­ure 1) know­ing that their prob­a­bil­ity of hav­ing a baby approaches 75% after only three treat­ment cycles and is pre­dicted to exceed 93% after six treat­ment cycles (Table 3). If 4000 cou­ples undergo one cycle of IVF-PGD, 1424 deliv­er­ies with non-affected chil­dren are expected (Table 3). Assum­ing a sim­i­lar suc­cess rate of 35.6% in sub­se­quent treat­ment cycles and that cou­ples could elect to undergo between four and six attempts per year yields a cumu­la­tive suc­cess rate approach­ing 93%. IVF as per­formed in the USA typ­i­cally involves the trans­fer of two or three embryos. The series yielded 1.3 non-affected babies per preg­nancy with an aver­age of about two embryos per trans­fer (Table 1). Thus, the num­ber of result­ing chil­dren would be higher than the num­ber of deliv­er­ies, per­haps by as much as 30% (Table 3). Nonethe­less, to avoid mul­ti­ple births, which have both med­ical com­pli­ca­tions and an addi­tional cost, the out­come was cal­cu­lated as if each deliv­ery results in the birth of one non-affected child. IVF-PGD cycles can be per­formed at an expe­ri­enced cen­tre. The esti­mated cost of per­form­ing the ini­tial IVF cycle with intra­cy­to­plas­mic sperm injec­tion (ICSI) with­out PGD was $16,000 includ­ing lab­o­ra­tory and imag­ing screen­ing, cost of med­ica­tions, mon­i­tor­ing dur­ing ovar­ian stim­u­la­tion and the IVF pro­ce­dure per se (Table 2). The cost of sub­se­quent IVF cycles was lower because the ini­tial screen­ing does not need to be repeated until a year lat­er. Esti­mated PGD costs were $6000 for the ini­tial cycle and $4500 for sub­se­quent cycles. The cost for sub­se­quent PGD cycles would be lower because the ini­tial genetic set-up for cou­ples (par­ents) and sib­lings for linked genetic mark­ers and probes needs to be per­formed only once. These con­di­tions yield an esti­mated cost of $22,000 for the ini­tial cycle of IVF/ICSI-PGD and $19,500 for each sub­se­quent treat­ment cycle.

  • Genetic Alliance UK claims (in 2012, based on PDF cre­ation date) that “The cost of PGD is typ­i­cally split into two parts: pro­ce­dural costs (con­sul­ta­tions, lab­o­ra­tory test­ing, egg col­lec­tion, embryo trans­fer, ultra­sound scans, and blood tests) and drug costs (for ovar­ian stim­u­la­tion and embryo trans­fer). PGD com­bined with IVF will cost £6,000 [$8.5k]–£9,000 [$12.8k] per treat­ment cycle.” but does­n’t spec­ify the mar­ginal cost of the PGD rather than IVF part.

  • Repro­duc­tive Health Tech­nolo­gies Project (2013?): “One round of IVF typ­i­cally costs around $9,000. PGD adds another $4,000 to $7,500 to the cost of each IVF attempt. A stan­dard round of IVF results in a suc­cess­ful preg­nancy only 10-35% of the time (de­pend­ing on the age and health of the wom­an), and a woman may need to undergo sub­se­quent attempts to achieve a viable preg­nan­cy.”

  • Alz­fo­rum (July 2014): “In Madis­on, Wis­con­sin, genetic coun­selor Margo Grady at Gen­er­a­tions Fer­til­ity Care esti­mated the out­-of-pocket price of one IVF cycle at about $12,000, and PGD adds another $3,000.”

  • SDFC (2015?): “PGD typ­i­cally costs between $4,000-$10,000 depend­ing on the cost of cre­at­ing the spe­cific probe used to detect the pres­ence of a sin­gle gene.”

  • Muru­gap­pan et al May 2015: “The aver­age cost of PGS was $4,268 (range $3,155-$12,626)”, cit­ing another study which esti­mated “Aver­age addi­tional cost of PGD pro­ce­dure: $3,550; Median Cost: $3,200”

  • the Advanced Fer­til­ity Cen­ter of Chicago (“cur­rent” pric­ing, so 2015?) says IVF costs ~$12k and of that, “Ane­u­ploidy test­ing (for chro­mo­some nor­mal­i­ty) with PGD is $1800 to $5000…PGD costs in the US vary from about $4000-$8000”. AFC use­fully breaks down the costs fur­ther in a table of “Aver­age PGS IVF Costs in USA”, say­ing that:

    • Embryo biopsy charges are about $1000 to $2500 (av­er­age: $1500)
    • Embryo freez­ing costs are usu­ally between $500 to $1000 (av­er­age: $750)
    • Ane­u­ploidy test­ing (for chro­mo­some nor­mal­i­ty) with PGD is $1800 to $5000
    • For sin­gle gene defects (such as cys­tic fibro­sis), there are addi­tional costs involved.
    • PGS test cost aver­age: $3500

    (The word­ing is unclear about whether these are costs per embryo or per batch of embryos; but the rest of the page implies that it’s per batch, and per embryo would imply that the other PGS cost esti­mates are either far too low or are being done on only one embryo & likely would fail.)

  • the startup Genomic Pre­dic­tion in September/October 2018 announced a full embryo selec­tion ser­vice for com­plex traits at a fixed cost of $1000 + $400/embryo (eg so 5 embryos would be $2000 total):

    300+ com­mon sin­gle-gene dis­or­ders, such as Cys­tic Fibro­sis, Tha­lassemia, BRCA, Sickle Cell Ane­mia, and Gaucher Dis­ease.

    Poly­genic Dis­ease Risk, such as risk for Type 1 and Type 2 dia­betes, Dwarfism, Hypothy­roidism, Men­tal Dis­abil­i­ty, Atrial Fib­ril­la­tion and other Car­dio­vas­cu­lar Dis­eases like CAD, Inflam­ma­tory Bowel Dis­ease, and Breast Can­cer.

    $1000/case, $400/embryo

    This may not reflect their true costs as they are a star­tup, but as a com­mer­cial ser­vice gives a hard dat­a­point: $1000 for overhead/biopsies, $400/embryo mar­ginal cost for sequenc­ing+­analy­sis.

From the final AFC costs, we can see that the genetic test­ing makes up a large frac­tion of the cost. Since cus­tom mark­ers are not nec­es­sary and we are only look­ing at stan­dard SNPs, the $1.8-5k genetic cost is a huge over­es­ti­mate of the $1k the SNPs should cost now or soon. Their break­down also implies that the embryo freezing/vitrification cost is counted as part of the PGS cost, but I don’t think this is right since one will need to store embryos regard­less of whether one is doing PGS/selection (even if an embryo is going to be implanted right away in a live trans­fer, the other embryos need to be stored since the first one will prob­a­bly fail). So the crit­i­cal num­ber here is that the embryo biopsy step costs $1000-$1500; there is prob­a­bly lit­tle prospect of large price decreases here com­pa­ra­ble to those for sequenc­ing, and we can take it as fixed.

Hence we can treat the cost of embryo selec­tion as a fixed $1.5k cost plus num­ber of embryos times SNP cost.

Modeling embryo selection

is a sequen­tial prob­a­bilis­tic process:

  1. har­vest x eggs
  2. fer­til­ize them and cre­ate x embryos
  3. cul­ture the embryos to either cleav­age (2-4 days) or blas­to­cyst (5-6 days) stage; of them, y will still be alive & not grossly abnor­mal
  4. freeze the embryos
  5. option­al: embryo selec­tion using qual­ity and PGS
  6. unfreeze & implant 1 embryo; if no embryos left, return to #1 or give up
  7. if no live birth, go to #6

Each step is nec­es­sary and deter­mines input into the next step; it is a ‘leaky pipeline’ (also related to “mul­ti­ple hur­dle selec­tion”), whose total yield depends heav­ily on the least effi­cient step, so out­comes might be . This has impli­ca­tions for cost-effec­tive­ness and opti­miza­tion, dis­cussed lat­er.

A sim­u­la­tion of this process:

## simulate a single IVF cycle (which may not yield any live birth, in which case there is no gain returnable):
simulateIVF <- function (eggMean, eggSD, polygenicScoreVariance, normalityP=0.5, vitrificationP, liveBirth) {
  eggsExtracted <- max(0, round(rnorm(n=1, mean=eggMean, sd=eggSD)))

  normal        <- rbinom(1, eggsExtracted, prob=normalityP)

  scores        <- rnorm(n=normal, mean=0, sd=sqrt(polygenicScoreVariance*0.5))

  survived      <- Filter(function(x){rbinom(1, 1, prob=vitrificationP)}, scores)

  selection <- sort(survived, decreasing=TRUE)

  if (length(selection)>0) {
   for (embryo in 1:length(selection)) {
    if (rbinom(1, 1, prob=liveBirth) == 1) {
      live <- selection[embryo]
simulateIVFs <- function(eggMean, eggSD, polygenicScoreVariance, normalityP, vitrificationP, liveBirth, iters=100000) {
  return(unlist(replicate(iters, simulateIVF(eggMean, eggSD, polygenicScoreVariance, normalityP, vitrificationP, liveBirth)))); }

Math­e­mat­i­cal­ly, one could model the expec­ta­tion of the first implan­ta­tion with this for­mu­la:

or using order sta­tis­tics:

(The order sta­tis­tic can be esti­mated by numeric inte­gra­tion or .)

This is a lower bound on the val­ue, though—treat­ing this math­e­mat­i­cally is made chal­leng­ing by the sequen­tial nature of the pro­ce­dure: implant­ing the max­i­mum-s­cor­ing embryo may fail, forc­ing a fall­back to the sec­ond-high­est embryo, and so on, until a suc­cess or run­ning out of embryos (trig­ger­ing a sec­ond IVF cycle, or pos­si­bly not depend­ing on finances & num­ber of pre­vi­ous failed cycles indi­cat­ing futil­i­ty). Given, say, 3 embryos, the expected value of the pro­ce­dure would be to sum the expected value of the embryo plus the expected value of the embryo times the prob­a­bil­ity of fail­ing to yield a birth (since if suc­ceeded one would stop there and not use ) plus the expected value of times the prob­a­bil­ity of both & fail­ing to yield a live birth plus the expected value of no live births times the prob­a­bil­ity of all fail­ing, and so on. So it is eas­ier to sim­u­late.

(Be­ing able to write it as an equa­tion would be use­ful if we needed to do com­plex opti­miza­tion on it, such as if we were try­ing to allo­cate an R&D bud­get opti­mal­ly, but real­is­ti­cal­ly, there are only two vari­ables which can be mean­ing­fully improved—the poly­genic score or scores, and the num­ber of eggs—and it’s impos­si­ble to esti­mate how much R&D expen­di­ture would increase egg count, leav­ing just the poly­genic scores, which is eas­ily opti­mized by hand or a black­box opti­miz­er.)

The tran­si­tion prob­a­bil­i­ties can be esti­mated from the flows reported in papers deal­ing with IVF and PGD. I have used:

  1. , Tan et al Decem­ber 2014:

    395 wom­en, 1512 eggs suc­cess­fully extracted & fer­til­ized into blas­to­cysts (~3.8 per wom­an); after genetic test­ing, 256+590=846 or 55% were abnor­mal & could not be used, leav­ing 666 good ones; all were vit­ri­fied for stor­age dur­ing analy­sis and 421 of the nor­mal ones rethawed, leav­ing 406 use­ful sur­vivors or ~1.4 per wom­an; the 406 were implanted into 252 wom­en, yield­ing 24+75=99 healthy live births or 24% implant­ed-em­bry­o->birth rate. Excerpts:

    A total of 395 cou­ples par­tic­i­pat­ed. They were car­ri­ers of either translo­ca­tion or inver­sion muta­tions, or were patients with recur­rent mis­car­riage and/or advanced mater­nal age. A total of 1,512 blas­to­cysts were biop­sied on D5 after fer­til­iza­tion, with 1,058 blas­to­cysts set aside for SNP array test­ing and 454 blas­to­cysts for NGS test­ing. In the NGS cycles group, the implan­ta­tion, clin­i­cal preg­nancy and mis­car­riage rates were 52.6% (60/114), 61.3% (49/80) and 14.3% (7/49), respec­tive­ly. In the SNP array cycles group, the implan­ta­tion, clin­i­cal preg­nancy and mis­car­riage rates were 47.6% (139/292), 56.7% (115/203) and 14.8% (17/115), respec­tive­ly. The out­come mea­sures of both the NGS and SNP array cycles were the same with insignifi­cant differ­ences. There were 150 blas­to­cysts that under­went both NGS and SNP array analy­sis, of which seven blas­to­cysts were found with incon­sis­tent sig­nals. All other sig­nals obtained from NGS analy­sis were con­firmed to be accu­rate by val­i­da­tion with qPCR. The rel­a­tive copy num­ber of mito­chon­dr­ial DNA (mtDNA) for each blas­to­cyst that under­went NGS test­ing was eval­u­at­ed, and a sig­nifi­cant differ­ence was found between the copy num­ber of mtDNA for the euploid and the chro­mo­so­ma­lly abnor­mal blas­to­cysts. So far, out of 42 ongo­ing preg­nan­cies, 24 babies were born in NGS cycles; all of these babies are healthy and free of any devel­op­men­tal prob­lems.

    …The median num­ber of normal/ bal­anced embryos per cou­ple was 1.76 (range from 0 to 8)…A­mong the 129 cou­ples in the NGS cycles group, 33 cou­ples had no euploid embryos suit­able for trans­fer; 75 cou­ples under­went embryo trans­fer and the remain­ing 21 cou­ples are cur­rently still wait­ing for trans­fer. In the SNP array cycles group, 177 cou­ples under­went embryo trans­fer, 66 cou­ples had no suit­able embryos for trans­fer, and 23 cou­ples are cur­rently still wait­ing. Of the 666 normal/balanced blas­to­cysts, 421 blas­to­cysts were warmed after vit­ri­fi­ca­tion, 406 sur­vived (96.4% of sur­vival rate) and were trans­ferred in 283 cycles. The num­bers of blas­to­cysts trans­ferred per cycle were 1.425 (114/80) and 1.438 (292/203) for NGS and SNP array, respec­tive­ly. The pro­por­tion of trans­ferred embryos that suc­cess­fully implanted was eval­u­ated by ultra­sound 6-7 weeks after embryo trans­fer, indi­cat­ing that 60 and 139 embryos resulted in a fetal sac, giv­ing implan­ta­tion rates of 52.6% (60/114) and 47.6% (139/292) for NGS and SNP array, respec­tive­ly. Pre­na­tal diag­no­sis with kary­otyp­ing of amnio­cen­te­sis fluid sam­ples did not find any fetus with chro­mo­so­mal abnor­mal­i­ties. A total of 164 preg­nan­cies were detect­ed, with 129 sin­gle­tons and 35 twins. The clin­i­cal preg­nancy rate per trans­fer cycle was 61.3% (49/80) and 56.7% (115/203) for NGS and SNP array, respec­tively (Table 3). A total of 24 mis­car­riages were detect­ed, giv­ing rates of 14.3% (7/49) and 14.8% (17/115) in NGS and SNP array cycles, respec­tively

    …The ongo­ing preg­nancy rates were 52.5% (42/80) and 48.3% (98/203) in NGS and SNP array cycles, respec­tive­ly. Out of these preg­nan­cies, 24 babies were deliv­ered in 20 NGS cycles; so far, all the babies are healthy and chro­mo­so­ma­lly nor­mal accord­ing to kary­otype analy­sis. In the SNP array cycles group the out­come of all preg­nan­cies went to full term and 75 healthy babies were deliv­ered (Table 3)…NGS is with a bright prospect. A case report described the use of NGS for PGD recently [33]. Sev­eral com­ments for the appli­ca­tion of NGS/MPS in PGD/PGS were pub­lished [34,35]. The cost and time of sequenc­ing is already com­pet­i­tive with array tests, and the esti­mated reagent cost of sequenc­ing for the detec­tion of chro­mo­so­mal abnor­mal­i­ties is cur­rently less than $100.

  2. “Cost-effec­tive­ness analy­sis of preim­plan­ta­tion genetic screen­ing and in vitro fer­til­iza­tion ver­sus expec­tant man­age­ment in patients with unex­plained recur­rent preg­nancy loss”, Muru­gap­pan et al May 2015:

    Prob­a­bil­i­ties for clin­i­cal out­comes with IVF and PGS in RPL patients were obtained from a 2012 study by Hodes-W­ertz et al. (10). This is the sin­gle largest study to date of out­comes using 24-chro­mo­some screen­ing by array com­par­a­tive genomic hybridiza­tion in a well-de­fined RPL pop­u­la­tion…The Hodes-W­ertz study reported on out­comes of 287 cycles of IVF with 24-chro­mo­some PGS with a total of 2,282 embryos fol­lowed by fresh day-5 embryo trans­fer in RPL patients. Of the PGS cycles, 67% were biop­sied on day 3, and 33% were biop­sied on day 5. The aver­age mater­nal age was 36.7 years (range: 21-45 years), and the mean num­ber of prior mis­car­riages was 3.3 (range: 2-7). From 287 PGS cycles, 181 cycles had at least one euploid embryo and pro­ceeded to fresh embryo trans­fer. There were 52 cycles with no euploid embryos for trans­fer, four cycles where an embryo trans­fer had not taken place at the time of analy­sis, and 51 cycles that were lost to fol­low-up obser­va­tion. All patients with a euploid embryo pro­ceeded to embryo trans­fer, with an aver­age of 1.65 Æ 0.65 (range: 1-4) embryos per trans­fer. Exclud­ing the cycles lost to fol­low-up eval­u­a­tion and the cycles with­out a trans­fer at the time of analy­sis, the clin­i­cal preg­nancy rate per attempt was 44% (n 1⁄4 102). One attempt at con­cep­tion was defined as an IVF cycle and oocyte retrieval Æ embryo trans­fer. The live-birth rate per attempt was 40% (n1⁄4 94), and the mis­car­riage rate per preg­nancy was 7% (n 1⁄4 7). Of these seven mis­car­riages, 57% (n 1⁄4 4) occurred after detec­tion of fetal car­diac activ­ity (10). Infor­ma­tion on the per­cent­age of cycles with sur­plus embryos was not pro­vided in the Hodes-W­ertz study, so we drew from their data­base of 240 RPL patients with 118 attempts at IVF and PGS (12). The clin­i­cal preg­nan­cy, live-birth, and clin­i­cal mis­car­riage rates did not sta­tis­ti­cal­ly-sig­nifi­cantly differ between the out­comes pub­lished in the Hodes-W­ertz study (P1⁄4 .89, P1⁄4 .66, P1⁄4 .61, respec­tive­ly). We reported that 62% of IVF cycles had at least one sur­plus embryo (12).

    …The aver­age cost of pre­con­cep­tion coun­sel­ing and base­line RPL workup, includ­ing parental kary­otyp­ing, mater­nal antiphos­pho­lipid anti­body test­ing, and uter­ine cav­ity eval­u­a­tion, was $4,377 (range: $4,000-$5,000) (16). Because this was incurred by both groups before their entry into the deci­sion tree, it was not included as a cost input in the study. The aver­age cost of IVF was $18,227 (range: $6,920-$27,685) (16) and includes cycle med­ica­tions, oocyte retrieval, and one embryo trans­fer. The aver­age cost of PGS was $4,268 (range $3,155-$12,626) (17), and the aver­age cost of a frozen embryo trans­fer was $6,395 (range: $3,155-$12,626) (13, 16). The aver­age cost of man­ag­ing a clin­i­cal mis­car­riage with dila­tion and curet­tage (D&C) was $1,304 (range: $517-$2,058) (18). Costs incurred in the IVF-PGS strat­egy include the cost of IVF, PGS, fresh embryo trans­fer, frozen embryo trans­fer, and D&C. Costs incurred in the expec­tant man­age­ment strat­egy include only the cost of D&C.

    17: National Infer­til­ity Asso­ci­a­tion. “The costs of infer­til­ity treat­ment: the Resolve Study”. Accessed on May 26, 2014: “Aver­age addi­tional cost of PGD pro­ce­dure: $3,550; Median Cost: $3,200 (Note: Med­ica­tions for IVF are $3,000–$5,000 per fresh cycle on aver­age.)”

  3. , Dah­douh et al 2015:

    The num­ber of dis­eases cur­rently diag­nosed via PGD-PCR is approx­i­mately 200 and includes some forms of inher­ited can­cers such as retinoblas­toma and the breast can­cer sus­cep­ti­bil­ity gene (BRCA2). 52 PGD has also been used in new appli­ca­tions such as HLA match­ing. 53,54 The ESHRE PGD con­sor­tium data analy­sis of the past 10 years’ expe­ri­ence demon­strated a clin­i­cal preg­nancy rate of 22% per oocyte retrieval and 29% per embryo trans­fer. 55 Table 4 shows a sam­ple of the differ­ent mono­genetic dis­eases for which PGD was car­ried out between Jan­u­ary and Decem­ber 2009, accord­ing to the ESHRE data. 22 In these reports a total of 6160 cycles of IVF cycles with PGD or PGS, includ­ing PGS-SS, are pre­sent­ed. Of the­se, 2580 (41.8%) were car­ried out for PGD pur­pos­es, in which 1597 cycles were per­formed for sin­gle-gene dis­or­ders, includ­ing HLA typ­ing. An addi­tional 3551 (57.6%) cycles were car­ried out for PGS pur­poses and 29 (0.5%) for PGS-SS. 22 Although the ESHRE data rep­re­sent only a par­tial record of the PGD cases con­ducted world­wide, it is indica­tive of gen­eral trends in the field of PGD.

    …At least 40% to 60% of human embryos are abnor­mal, and that num­ber increases to 80% in women 40 years or old­er. These abnor­mal­i­ties result in low implan­ta­tion rates in embryos trans­ferred dur­ing IVF pro­ce­dures, from 30% in women < 35 years to 6% in women ≥ 40 years. 33 In a recent ret­ro­spec­tive review of tro­phec­to­derm biop­sies, ane­u­ploidy risk was evi­dent with increas­ing female age. A slightly increased preva­lence was noted at younger ages, with > 40% ane­u­ploidy in women ≤ 23 years. The risk of hav­ing no chro­mo­so­ma­lly nor­mal blas­to­cyst for trans­fer (the no-e­u­ploid embryo rate) was low­est (2-6%) in women aged 26 to 37, then rose to 33% at age 42 and reached 53% at age 44. 11

  4. :

    Age: <35yo 35-37 38-40 41-42 >42
    Live birth rate 40.7 31.3 22.2 11.8 3.9

    …It is com­mon to remove between ten and thirty eggs.

    using non-donor eggs. (Though donor eggs are bet­ter qual­ity and more likely to yield a birth and hence bet­ter for selec­tion pur­pos­es)

  5. “Asso­ci­a­tion between the num­ber of eggs and live birth in IVF treat­ment: an analy­sis of 400 135 treat­ment cycles”, Sunkara et al 2011

    The median num­ber of eggs retrieved was 9 [in­ter-quar­tile range (IQR) 6-13; Fig. 2a] and the median num­ber of embryos cre­ated was 5 (IQR 3-8; Fig. 2b). The over­all LBR in the entire cohort was 21.3% [95% con­fi­dence inter­val (CI): 21.2-21.4%], with a grad­ual rise over the four time peri­ods in this study (14.9% in 1991-1995, 19.8% in 1996-2000, 23.2% in 2001-2005 and 25.6% in 2006-2008).

    Egg retrieval appears nor­mally dis­trib­uted in Sunkara et al 2011’s graph. The SD is not given any­where in the paper, but an SD of ~4-5 visu­ally fits the graph and is com­pat­i­ble with a 6-13 IQR, and AGC reports SDs for eggs for two groups of SDs 4.5 & 4.7 and aver­ages of 10.5 & 9.4—­closely match­ing the median of 9.

  6. The most nation­ally rep­re­sen­ta­tive sam­ple for the USA is the data that fer­til­ity clin­ics are legally required to report to the CDC. The most recent one is the “2013 Assisted Repro­duc­tive Tech­nol­ogy National Sum­mary Report”, which breaks down num­bers by age and egg source:

    Total num­ber of cycles : 190,773 (in­cludes 2,655 cycle[s] using frozen eggs)…­Donor eggs: 9718 fresh cycles, 10270 frozen []

    …Of the 190,773 ART cycles per­formed in 2013 at these report­ing clin­ics, 163,209 cycles (86%) were started with the intent to trans­fer at least one embryo. These 163,209 cycles resulted in 54,323 live births (de­liv­er­ies of one or more liv­ing infants) and 67,996 infants.

    Fresh eggs <35yo 35-37 38-40 41-42 43-44 >44
    cycles: 40,083 19,853 18,06 19,588 4,823 1,379
    P(birth|­cy­cle) 23.8 19.6 13.7 7.8 3.9 1.2
    P(birth|­trans­fer) 28.2 24.4 18.4 11.4 6.0 2.1
    Frozen eggs <35 35-37 38-40 41-42 43-44 >44
    cycles: 21,627 11,140 8,354 3,344 1,503 811
    P(birth|­trans­fer) 28.6 27.2 24.4 21.2 15.8 8.7

    …The largest group of women using ART ser­vices were women younger than age 35, rep­re­sent­ing approx­i­mately 38% of all ART cycles per­formed in 2013. About 20% of ART cycles were per­formed among women aged 35-37, 19% among women aged 38-40, 11% among women aged 41-42, 7% among women aged 43-44, and 5% among women older than age 44. Fig­ure 4 shows that, in 2013, the type of ART cycles var­ied by the wom­an’s age. The vast major­ity (97%) of women younger than age 35 used their own eggs (non-donor), and about 4% used donor eggs. In con­trast, 38% of women aged 43-44 and 73% of women older than age 44 used donor eggs.

    …Out­comes of ART Cycles Using Fresh Non-donor Eggs or Embryos, by Stage, 2013:

    1. 93,787 cycles started
    2. 84,868 retrievals
    3. 73,571 trans­fers
    4. 33,425 preg­nan­cies
    5. 27,406 live-birth deliv­er­ies

    CDC report does­n’t spec­ify how many eggs on aver­age are retrieved or abnor­mal­ity rate by age, although we can note that ~10% of retrievals did­n’t lead to any trans­fers (since there were 85k retrievals but only 74k trans­fers) which looks con­sis­tent with an over­all mean & SD of 9(4.6) and 50% abnor­mal­ity rate. We could also try to back out from the fig­ures on aver­age num­ber of embryos per trans­fer, num­ber of trans­fers, and num­ber of cycles (eg 1.8 for <35yos, and 33750, so 60750 trans­ferred embryos, as part of the 40083 cycles, indi­cat­ing each cycle must have yielded at least 1.5 embryos), but that only gives a loose lower bound since there may be many left over embryos and the abnor­mal­ity rate is unknown.

    So for an Amer­i­can model of <35yos (the chance of IVF suc­cess declines so dras­ti­cally with age that it’s not worth con­sid­er­ing older age brack­et­s), we could go with a set of para­me­ters like {9, 4.6, 0.5, 0.96, 0.28}, but it’s unclear how accu­rate a guess that would be.

  7. Tur-Kaspa et al 2010 reports results from an Illi­nois fer­til­ity clinic treat­ing cys­tic fibro­sis car­ri­ers who were using PGD:

    Para­me­ter Value Count Per­cent­age

    No. of patients (age 42 years) 74 No. of cycles for PGD for CF 104 Mean no. of IVF-PGD cycles/couple 1.4 (104/74) No. of cycles with embryo trans­fer (%) 94 (90.4) No. of embryos trans­ferred 184 Mean no. of embryos trans­ferred 1.96 (184/94) Total num­ber of preg­nan­cies 44 No. of mis­car­riages (%) 7 (15.9) No. of deliv­er­ies 37 No. of healthy babies born 49 No. of babies per deliv­ery 1.3 No. of cycles result­ing in preg­nancy (%) 44⁄104 (42.3) No. of trans­fer cycles result­ing in a preg­nancy (%) 44⁄94 (46.8) Take-home baby rate per IVF-PGD cycle (%) 37⁄104 ———————————————————————————

    Table: Table 1: Out­comes of IVF-preimplantation genetic diag­no­sis (PGD) cycles for cys­tic fibro­sis (CF) (2000-2005).

    For the Tur-Kaspa et al 2010 cost-ben­e­fit analy­sis, the num­ber of eggs and sur­vival rates are not given in the paper, so it can’t be used for sim­u­la­tion, but the over­all con­di­tional prob­a­bil­i­ties look sim­i­lar to Hodes-W­ertz.

With these sets of data, we can fill in para­me­ter val­ues for the sim­u­la­tion and esti­mate gains.

Using the Tan et al 2014 data:

  1. eggs extracted per per­son: nor­mal dis­tri­b­u­tion, mean=3, SD=4.6 (dis­cretized into whole num­bers)
  2. using pre­vi­ous sim­u­la­tion, ‘SNP test’ all eggs extracted for poly­genic score
  3. P=0.5 that an egg is nor­mal
  4. P=0.96 that it sur­vives vit­ri­fi­ca­tion
  5. P=0.24 that an implanted egg yields a birth
simulateTan <- function() { return(simulateIVFs(3, 4.6, selzam2016, 0.5, 0.96, 0.24)); }
iqTan <- mean(simulateTan()) * 15; iqTan
# [1] 0.3808377013

That is, the cou­ples in Tan et al 2014 would have seen a ~0.4IQ increase.

The Muru­gap­pan et al 2015 cost-ben­e­fit analy­sis uses data from Amer­i­can fer­til­ity clin­ics reported in Hodes-W­ertz 2012’s “Idio­pathic recur­rent mis­car­riage is caused mostly by ane­u­ploid embryos”: 278 cycles yield­ing 2282 blas­to­cysts or ~8.2 on aver­age; 35% nor­mal; there is no men­tion of losses to cryos­tor­age, so I bor­row 0.96 from Tan et al 2015; 1.65 implanted on aver­age in 181 trans­fers, yield­ing 40% live-births. So:

simulateHodesWertz <- function() { return(simulateIVFs(8.2, 4.6, selzam2016, 0.35, 0.96, 0.40)) }
iqHW <- mean(simulateHodesWertz()) * 15; iqHW
# [1] 0.684226242

Societal effects

One cat­e­gory of effects con­sid­ered by Shul­man & Bostrom is the non-fi­nan­cial social & soci­etal effects men­tioned in their Table 3, where embryo selec­tion can “per­cep­ti­bly advan­tage a minor­ity” or in an extreme case, “Selected dom­i­nate ranks of elite sci­en­tists, attor­neys, physi­cians, engi­neers. Intel­lec­tual Renais­sance?”

This is another point which is worth going into a lit­tle more; no spe­cific cal­cu­la­tions are men­tioned by Shul­man & Bostrom, and the thin-tail-effects of nor­mal dis­tri­b­u­tions are noto­ri­ously coun­ter­in­tu­itive, with sur­pris­ingly large effects out on the tails from smal­l­-seem­ing changes in means or stan­dard devi­a­tion­s—­for exam­ple, the leg­endary lev­els of West­ern Jew­ish over­per­for­mance despite their tiny pop­u­la­tion sizes.

The effects of selec­tion also com­pound over gen­er­a­tions; for exam­ple, in the famous , a large gap in mean per­for­mance had opened up by the 2nd gen­er­a­tion, and by the 7th, the dis­tri­b­u­tions almost ceased to over­lap (see fig­ure 4 in Tryon 1940). Or con­sider the long-term Illi­nois corn/maize selec­tion exper­i­ment (response to selec­tion of the 2 lines, ani­mated),

Con­sid­er­ing the order/tail effects for cutoffs/thresholds cor­re­spond­ing to admis­sion to elite uni­ver­si­ties, for many pos­si­ble com­bi­na­tions of embryo selec­tion boosts/IVF uptakes/generation accu­mu­la­tions, embryo selec­tion accounts for a major­ity or almost all of future elites.

As a gen­eral rule of thumb, ‘elite’ groups like sci­en­tists, attor­neys, physi­cians, Ivy League stu­dents etc are highly selected for intel­li­gence—one can com­fort­ably esti­mate aver­ages >=130 IQ (+2SD) from past IQ sam­ples & aver­age SAT scores & the ever-in­creas­ingly strin­gent admis­sions; and elite per­for­mance con­tin­ues to increase with increas­ing intel­li­gence as high as can rea­son­ably be mea­sured, as indi­cated by avail­able date like esti­mates of emi­nent his­tor­i­cal fig­ures (eg Cox 1926; see also Simon­ton in gen­er­al), the and TIP lon­gi­tu­di­nal study (), where we might define the cut off as 160 IQ based on stud­ies of the most emi­nent avail­able sci­en­tists (mean ~150-160). So to esti­mate an impact, one could con­sider a ques­tion like: given an aver­age boost of x IQ points through embry­o-s­e­lec­tion, how much would the odds of being elite (>=130) or extremely elite (>=160) increase for the select­ed? If a cer­tain frac­tion of IVFers were select­ed, what frac­tion of all peo­ple above the cut­off would they make up?

If there are 320 mil­lion peo­ple in the USA, then about 17m are +2SD and 43k are +3SD:

dnorm((130-100)/15) * 320000000
# [1] 17277109.28
dnorm((160-100)/15) * 320000000
# [1] 42825.67224

Sim­i­lar­ly, in 2013, the CDC reports 3,932,181 chil­dren born in the USA; and the 2013 CDC annual IVF report says that 67,996 (1.73%) were IVF. (This 1-2% pop­u­la­tion rate of IVF will highly likely increase sub­stan­tially in the future, as many coun­tries have recorded higher use of IVF or ART in gen­er­al: Europe-wide rates increased 1.3%-2.4% 1997-2011; in 2013 Euro­pean coun­tries reported per­cent­ages of 4.6% (Belgium)/5.7% (Czech Repub­lic), 6.2% (Den­mark), 4% (Esto­ni­a), 5.8% (Fin­land), 4.4% (Greece), 6% (Slove­ni­a), & 4.2% (Spain); Aus­tralia reached ~4% & NZ 3% in 201817; Japan report­edly had 5% in 2015; and Den­mark reached 8% in 2016. And pre­sum­ably US rates will go up as the pop­u­la­tion ages & edu­ca­tion cre­den­tial­ism con­tin­ues.) This implies that IVFers also make up a small num­ber of highly gifted chil­dren:

size <- function(mean, cutoff, populationSize, useFraction=1) { if(cutoff>mean) { dnorm(cutoff-mean) * populationSize * useFraction } else
                                                                 { (1 - dnorm(cutoff-mean)) * populationSize * useFraction }}
size(0, (60/15), 67996)
# [1] 9.099920031

So assum­ing IVF par­ents aver­age 100IQ, then we can take the embryo selec­tion the­o­ret­i­cal upper bound of +9.36 (+0.624SD) cor­re­spond­ing to the “aggres­sive IVF” set of sce­nar­ios in Table 3 of Shul­man & Bostrom, and ask, if 100% of IVF chil­dren were select­ed, how many addi­tional peo­ple over 160 would that cre­ate?

eliteGain <- function(ivfMean, ivfGain, ivfFraction, generation, cutoff, ivfPop, genMean, genPop) {

              ivfers      <- size(ivfMean,                      cutoff, ivfPop, 1)
              selected    <- size(ivfMean+(ivfGain*generation), cutoff, ivfPop, ivfFraction)
              nonSelected <- size(ivfMean,                      cutoff, ivfPop, 1-ivfFraction)
              gain        <- (selected+nonSelected) - ivfers

              population <- size(genMean, cutoff, genPop)
              multiplier <- gain / population
              return(multiplier) }
eliteGain(0, (9.36/15), 1, 1, (60/15), 67996, 0, 3932181)
# [1] 0.1554096565

In this exam­ple, the +0.624SD boosts the absolute num­ber by 82 peo­ple, rep­re­sent­ing 15.5% of chil­dren pass­ing the cut­off; this would mean that IVF over­rep­re­sen­ta­tion would be notice­able if any­one went look­ing for it, but would not be a major issue nor even as notice­able as Jew­ish achieve­ment. We would indeed see “Sub­stan­tial growth in edu­ca­tional attain­ment, income”, but we would not see much effect beyond that.

Is it real­is­tic to assume that IVF chil­dren will be dis­trib­uted around a mean of 100 sans any inter­ven­tion? That seems unlike­ly, if only due to the sub­stan­tial finan­cial cost of using IVF; how­ev­er, the exist­ing lit­er­a­ture is incon­sis­tent, show­ing both higher & lower edu­ca­tion or IQ scores (Hart & Nor­man 2013), so per­haps the start­ing point really is 100. The thin-tail effects make the start­ing mean extremely impor­tant; Shul­man & Bostrom say, “Sec­ond gen­er­a­tion many­fold increase at right tail.” Let’s con­sider the sec­ond gen­er­a­tion; with their post-s­e­lec­tion mean IQ of 109.36, what sec­ond-gen­er­a­tion is pro­duced in the absence of out­breed­ing when they use IVF selec­tion?

eliteGain(0, (9.36/15), 1, 2, (60/15), 67996, 0, 3932181)
# [1] 1.151238772
eliteGain(0, (9.36/15), 1, 5, (60/15), 67996, 0, 3932181)
# [1] 34.98100356

Now the IVF chil­dren rep­re­sent a major­i­ty. With the third gen­er­a­tion, they reach 5x; at the fourth, 17x; at the fifth, 35x; and so on.

In prac­tice, of course, we cur­rently would get much less: 0.138 IQ points in the USA mod­el, which would yield a triv­ial per­cent­age increase of 0.06% or 1.6%:

eliteGain(0, (0.13808892057/15), 1, 1, (60/15), 67996, 0, 3932181)
# [1] 0.0006478714323
eliteGain((15/15), (0.13808892057/15), 1, 1, (60/15), 67996, 0, 3932181)
# [1] 0.01601047464

Table 3 con­sid­ers 12 sce­nar­ios: 3 adop­tion frac­tions of the gen­eral pop­u­la­tion (100% IVFer/~0.25% gen­eral pop­u­la­tion, 10%, >90%) vs 4 aver­age gains (4, 12, 19, 100+). The descrip­tions add 2 addi­tional vari­ables: first vs sec­ond gen­er­a­tion, and elite vs emi­nent, giv­ing 48 rel­e­vant esti­mates total.

scenarios <- expand.grid(c(0.025, 0.1, 0.9), c(4/15, 12/15, 19/15, 100/15), c(1,2), c(30/15, 60/15))
colnames(scenarios) <- c("Adoption.fraction", "IQ.gain", "Generation", "Eliteness")
scenarios$Gain.fraction <- round(do.call(mapply, c(function(adoptionRate, gain, generation, selectiveness) {
                                  eliteGain(0, gain, adoptionRate, generation, selectiveness, 3932181, 0, 3932181) }, unname(scenarios[,1:4]))),
Adop­tion frac­tion IQ gain Gen­er­a­tion Elite­ness Gain frac­tion
0.025 4 1 130 0.02
0.100 4 1 130 0.06
0.900 4 1 130 0.58
0.025 12 1 130 0.06
0.100 12 1 130 0.26
0.900 12 1 130 2.34
0.025 19 1 130 0.12
0.100 19 1 130 0.46
0.900 19 1 130 4.18
0.025 100 1 130 0.44
0.100 100 1 130 1.75
0.900 100 1 130 15.77
0.025 4 2 130 0.04
0.100 4 2 130 0.15
0.900 4 2 130 1.37
0.025 12 2 130 0.15
0.100 12 2 130 0.58
0.900 12 2 130 5.24
0.025 19 2 130 0.28
0.100 19 2 130 1.11
0.900 19 2 130 10.00
0.025 100 2 130 0.44
0.100 100 2 130 1.75
0.900 100 2 130 15.77
0.025 4 1 160 0.05
0.100 4 1 160 0.18
0.900 4 1 160 1.62
0.025 12 1 160 0.42
0.100 12 1 160 1.68
0.900 12 1 160 15.13
0.025 19 1 160 1.75
0.100 19 1 160 7.01
0.900 19 1 160 63.11
0.025 100 1 160 184.65
0.100 100 1 160 738.60
0.900 100 1 160 6647.40
0.025 4 2 160 0.16
0.100 4 2 160 0.63
0.900 4 2 160 5.69
0.025 12 2 160 4.16
0.100 12 2 160 16.63
0.900 12 2 160 149.70
0.025 19 2 160 25.40
0.100 19 2 160 101.58
0.900 19 2 160 914.25
0.025 100 2 160 186.78
0.100 100 2 160 747.12
0.900 100 2 160 6724.04

To help cap­ture what might be con­sid­ered impor­tant or dis­rup­tive, let’s fil­ter down the sce­nar­ios to ones where the embry­o-s­e­lected now make up an absolute major­ity of any elite group (a frac­tion >0.5):

Adop­tion frac­tion IQ gain Gen­er­a­tion Elite­ness Gain frac­tion
0.900 4 1 130 0.58
0.900 12 1 130 2.34
0.900 19 1 130 4.18
0.100 100 1 130 1.75
0.900 100 1 130 15.77
0.900 4 2 130 1.37
0.100 12 2 130 0.58
0.900 12 2 130 5.24
0.100 19 2 130 1.11
0.900 19 2 130 10.00
0.100 100 2 130 1.75
0.900 100 2 130 15.77
0.900 4 1 160 1.62
0.100 12 1 160 1.68
0.900 12 1 160 15.13
0.025 19 1 160 1.75
0.100 19 1 160 7.01
0.900 19 1 160 63.11
0.025 100 1 160 184.65
0.100 100 1 160 738.60
0.900 100 1 160 6647.40
0.100 4 2 160 0.63
0.900 4 2 160 5.69
0.025 12 2 160 4.16
0.100 12 2 160 16.63
0.900 12 2 160 149.70
0.025 19 2 160 25.40
0.100 19 2 160 101.58
0.900 19 2 160 914.25
0.025 100 2 160 186.78
0.100 100 2 160 747.12
0.900 100 2 160 6724.04

For many of the sce­nar­ios, the impact is not bla­tant until a sec­ond gen­er­a­tion builds on the first, but the cumu­la­tive effect has an impact—one of the weak­est sce­nar­ios, +4 IQ/10% adop­tion can still be seen at the sec­ond gen­er­a­tion because eas­ier to spot effects on the most elite lev­els; in another exam­ple, a boost of 12 points is notice­able in a sin­gle gen­er­a­tion with as lit­tle as 10% of the gen­eral pop­u­la­tion adop­tion. A boost of 19 points is vis­i­ble in a fair num­ber of sce­nar­ios, and a boost of 100 is vis­i­ble at almost any adop­tion rate/generation/elite-level. (In­deed, a boost of 100 results in almost mean­ing­lessly large num­bers under many sce­nar­ios; it’s diffi­cult to imag­ine a soci­ety with 100x as many geniuses run­ning around, so it’s even more diffi­cult to imag­ine what it would mean for there to be 6,724x as many—other than many things will start chang­ing extremely rapidly in unpre­dictable ways.)

The tables do not attempt to give spe­cific dead­lines in years for when some of the effects will man­i­fest, but we could try to extrap­o­late based on when emi­nent fig­ures and made their first marks.

have become grand­mas­ters at very early ages, such as 12.6yo record, with (as of 2016) 24 other chess prodi­gies reach­ing grand­mas­ter lev­els before age 15; the record age has dropped rapidly over time which is often cred­ited to com­put­ers & the Inter­net unlock­ing chess data­bases & engines to inten­sively train again­st, pro­vid­ing a global pool of oppo­nents 24/7, and inten­sive tutor­ing and train­ing pro­grams. is prob­a­bly the most famous child prodi­gy, cred­ited with feats such as read­ing by age 2, writ­ing math­e­mat­i­cal papers by age 12 and so on, but he aban­doned acad­e­mia and never pro­duced any major accom­plish­ments; his acquain­tance and fel­low child prodigy , on the other hand, pro­duced his first major work at age 17, at age 19; physi­cists in the early quan­tum era were noted for youth, with Bragg/Heisenberg/Pauli/Dirac pro­duc­ing their Nobel prize-win­ning results at ages 22/23/25/26 (re­spec­tive­ly). In math­e­mat­ics, made major break­throughs around age 18, first modal logic result was age 17, likely began mak­ing major find­ings around age 16 and con­tin­ued up to his youth­ful death at age 32, and began pub­lish­ing age 15; young stu­dents mak­ing find­ings is such a trope that the Fields Medal has an age-limit of 39yo for awardees (who thus must’ve made their dis­cov­er­ies much ear­lier). Clio­met­rics and the age of sci­en­tists and their life-cy­cles of pro­duc­tiv­ity across time and fields have been stud­ied by Simon­ton, Jones, & Mur­ray’s Human Accom­plish­ment; we can also com­pare to the SMPY/TIP sam­ples where most took nor­mal school­ing paths. The peak age for pro­duc­tiv­i­ty, and aver­age age for work that wins major prizes differs a great deal by field­—­physics and math­e­mat­ics are gen­er­ally younger than fields like med­i­cine or biol­o­gy. This sug­gests that differ­ent fields place differ­ent demands on Gf vs Gc: a field like math­e­mat­ics deal­ing in pure abstrac­tions will stress deep thought & fluid intel­li­gence (peak­ing in the early 20s); while a field like med­i­cine will require a wide vari­ety of expe­ri­ences and fac­tual knowl­edge and less raw intel­li­gence, and so may require decades before one can make a major con­tri­bu­tion. (In lit­er­a­ture, it’s often been noted that lyric poets seem to peak young while nov­el­ists may con­tinue improv­ing through­out their life­times.)

So if we con­sider sce­nar­ios of intel­li­gence enhance­ment up to 2 or 3 SDs (145), then we can expect that there may be a few ini­tial results within 15 years heav­ily biased towards STEM fields with strong Inter­net pres­ences and tra­di­tions of open­ness in papers/software/data (such as machine learn­ing), fol­lowed by a grad­ual increase in num­ber of results as the cohort begins reach­ing their 20s and 30s and their adult careers and a broad­en­ing across fields such as med­i­cine and the human­i­ties. While math and tech­nol­ogy results can have out­sized impact these days, in a 2-3SD sce­nar­io, the total num­ber of 2-3SD researchers will not increase by more than a fac­tor, and so the expected impact will be sim­i­lar to what we already expe­ri­ence in the pace of tech­no­log­i­cal devel­op­men­t—quick, but not unman­age­able.

In the case of >=4S­Ds, things are a lit­tle differ­ent. The most com­pa­ra­ble case is Sidis, who as men­tioned was writ­ing papers by age 12 after 10 years of read­ing; in an IES sce­nar­io, each mem­ber of the cohort might be far beyond Sidis, and so the entire cohort will likely reach the research fron­tier and begin mak­ing con­tri­bu­tions before age 12—although there must be lim­its on how fast a human child can develop men­tal­ly, for raw ther­mo­dy­namic rea­sons like calo­ries con­sumed if noth­ing else, there is no good rea­son to think that Sidis’s bound of 12 years is tight, espe­cially given the mod­ern con­text and the pos­si­bil­i­ties for accel­er­ated edu­ca­tion pro­grams. (With such advan­tages, there may also be much larger cohorts as par­ents decide the advan­tages are so com­pelling that they want them for their chil­dren and are will­ing to undergo the cost­s.)

If genetic differ­ences and inequal­ity exists, then per­haps they need to be engi­neered away.


As writ­ten, the IVF sim­u­la­tor can­not deliver a cost-ben­e­fit because the costs will depend on the inter­nal state, like how many good embryos were cre­ated or the fact that a cycle end­ing in no live births will still incur costs, and report the mar­ginal gain now that we’re going case by case. So it must be aug­ment­ed:

simulateIVFCB <- function (eggMean, eggSD, polygenicScoreVariance, normalityP=0.5, vitrificationP, liveBirth, fixedCost, embryoCost, traitValue) {
  eggsExtracted <- max(0, round(rnorm(n=1, mean=eggMean, sd=eggSD)))

  normal        <- rbinom(1, eggsExtracted, prob=normalityP)

  totalCost     <- fixedCost + normal * embryoCost
  scores        <- rnorm(n=normal, mean=0, sd=sqrt(polygenicScoreVariance*0.5))

  survived      <- Filter(function(x){rbinom(1, 1, prob=vitrificationP)}, scores)

  selection <- sort(survived, decreasing=FALSE)
  live <- 0
  gain <- 0

  if (length(selection)>0) {
   for (embryo in 1:length(selection)) {
    if (rbinom(1, 1, prob=liveBirth) == 1) {
      live <- selection[embryo]
  gain <- max(0, live - mean(selection))
  return(data.frame(Trait.SD=gain, Cost=totalCost, Net=(traitValue*gain - totalCost)))  }
simulateIVFCBs <- function(eggMean, eggSD, polygenicScoreVariance, normalityP, vitrificationP, liveBirth, fixedCost, embryoCost, traitValue, iters=20000) {
  ldply(replicate(simplify=FALSE, iters, simulateIVFCB(eggMean, eggSD, polygenicScoreVariance, normalityP, vitrificationP, liveBirth, fixedCost, embryoCost, traitValue))) }

Now we have all our para­me­ters set:

  1. IQ’s value per point or per SD (mul­ti­ply by 15)
  2. The fixed cost of selec­tion is $1500
  3. per-em­bryo cost of selec­tion is $200
  4. and the rel­e­vant prob­a­bil­i­ties have been defined already
iqLow <- 3270*15; iqHigh <- 16151*15
## Tan:
summary(simulateIVFCBs(3, 4.6, selzam2016, 0.5, 0.96, 0.24, 1500, 200, iqLow))
#    Trait.SD               Cost              Net
# Min.   :0.00000000   Min.   :1500.00   Min.   :-3900.0000
# 1st Qu.:0.00000000   1st Qu.:1500.00   1st Qu.:-1700.0000
# Median :0.00000000   Median :1700.00   Median :-1500.0000
# Mean   :0.02854686   Mean   :1873.05   Mean   : -472.8266
# 3rd Qu.:0.03149430   3rd Qu.:2100.00   3rd Qu.: -579.1553
# Max.   :0.42872383   Max.   :4300.00   Max.   :19076.2182
summary(simulateIVFCBs(3, 4.6, selzam2016, 0.5, 0.96, 0.24, 1500, 200, iqHigh))
#    Trait.SD               Cost              Net
# Min.   :0.00000000   Min.   :1500.00   Min.   : -4100.000
# 1st Qu.:0.00000000   1st Qu.:1500.00   1st Qu.: -1700.000
# Median :0.00000000   Median :1700.00   Median : -1500.000
# Mean   :0.02847819   Mean   :1873.08   Mean   :  5026.188
# 3rd Qu.:0.03005473   3rd Qu.:2100.00   3rd Qu.:  5143.879
# Max.   :0.48532430   Max.   :4100.00   Max.   :115677.092

## Hodes-Wertz:
summary(simulateIVFCBs(8.2, 4.6, selzam2016, 0.35, 0.96, 0.40, 1500, 200, iqLow))
#    Trait.SD                Cost              Net
# Min.   :0.000000000   Min.   :1500.00   Min.   :-4100.0000
# 1st Qu.:0.000000000   1st Qu.:1700.00   1st Qu.:-1900.0000
# Median :0.007840085   Median :2100.00   Median :-1500.0000
# Mean   :0.051678465   Mean   :2079.25   Mean   :  455.5787
# 3rd Qu.:0.090090594   3rd Qu.:2300.00   3rd Qu.: 2168.2666
# Max.   :0.463198015   Max.   :4100.00   Max.   :21019.8626
summary(simulateIVFCBs(8.2, 4.6, selzam2016, 0.35, 0.96, 0.40, 1500, 200, iqHigh))
#    Trait.SD                Cost              Net
# Min.   :0.000000000   Min.   :1500.00   Min.   : -3700.0000
# 1st Qu.:0.000000000   1st Qu.:1700.00   1st Qu.: -1700.0000
# Median :0.006228574   Median :2100.00   Median :  -650.2792
# Mean   :0.050884913   Mean   :2083.41   Mean   : 10244.2234
# 3rd Qu.:0.088152844   3rd Qu.:2300.00   3rd Qu.: 19048.4272
# Max.   :0.486235107   Max.   :4100.00   Max.   :114497.7483
## USA, youngest:
summary(simulateIVFCBs(9, 4.6, selzam2016, 0.3, 0.90, 10.8/100, 1500, 200, iqLow))
#    Trait.SD               Cost              Net
# Min.   :0.00000000   Min.   :1500.00   Min.   :-3900.0000
# 1st Qu.:0.00000000   1st Qu.:1700.00   1st Qu.:-2045.5047
# Median :0.00000000   Median :1900.00   Median :-1500.0000
# Mean   :0.03360950   Mean   :2037.22   Mean   : -388.6739
# 3rd Qu.:0.05023528   3rd Qu.:2300.00   3rd Qu.:  287.3619
# Max.   :0.52294123   Max.   :3900.00   Max.   :23950.2672
summary(simulateIVFCBs(9, 4.6, selzam2016, 0.3, 0.90, 10.8/100, 1500, 200, iqHigh))
#    Trait.SD               Cost              Net
# Min.   :0.00000000   Min.   :1500.00   Min.   : -3900.000
# 1st Qu.:0.00000000   1st Qu.:1700.00   1st Qu.: -1900.000
# Median :0.00000000   Median :1900.00   Median : -1500.000
# Mean   :0.03389909   Mean   :2044.75   Mean   :  6167.812
# 3rd Qu.:0.05115755   3rd Qu.:2300.00   3rd Qu.: 10224.781
# Max.   :0.45364794   Max.   :4100.00   Max.   :108203.019

In gen­er­al, embryo selec­tion as of Jan­u­ary 2016 is just barely profitable or some­what unprofitable in each group using the low­est esti­mate of IQ’s val­ue; it is always profitable on aver­age with the high­est esti­mate.

Value of Information

To get an idea of the value of fur­ther research into improv­ing the poly­genic score or opti­miz­ing other parts of the pro­ce­dure, we can look at the over­all pop­u­la­tion gains in the USA if it was adopted by all poten­tial users.

Public interest in selection

How many peo­ple can we expect to use embryo selec­tion as it becomes avail­able?

My belief is that total uptake will be fairly mod­est as a frac­tion of the pop­u­la­tion. A large frac­tion of the pop­u­la­tion expresses hos­til­ity towards any new fer­til­i­ty-re­lated tech­nol­ogy what­so­ev­er, and the peo­ple open to the pos­si­bil­ity will be deterred by the neces­sity of advanced fam­ily plan­ning, the large finan­cial cost of IVF, and the fact that the IVF process is lengthy and painful. I think that prospec­tive moth­ers will not undergo it unless the gains are enor­mous: the differ­ence between hav­ing kids or never hav­ing kids, or hav­ing a nor­mal kid or one who will die young of a genetic dis­ease. A frac­tion of an IQ point, or even a few points, is not going to cut it. (Per­haps boosts around 20 IQ points, a level with dra­matic and vis­i­ble effects on edu­ca­tional out­comes, would be enough?)

We can see this unwill­ing­ness par­tially expressed in long-s­tand­ing trends against the wide use of sperm & egg dona­tion. As points out (“Why Eugen­ics Won’t Come Back”), a prospec­tive mother could eas­ily increase traits of her chil­dren by eugenic selec­tion of sperm donors, such as emi­nent sci­en­tists, above and beyond the rel­a­tively unstrin­gent screen­ing done by cur­rent sperm banks and the select­ness of sperm buy­ers:

…we now know from 40 years of expe­ri­ence that with­out coer­cion there is lit­tle or no demand for genetic enhance­ment. Peo­ple gen­er­ally don’t want paragon babies; they want healthy ones that are like them. At the time test-tube babies were first con­ceived in the 1970s, many peo­ple feared in-vitro fer­til­iza­tion would lead to peo­ple buy­ing sperm and eggs off celebri­ties, genius­es, mod­els and ath­letes. In fact, the demand for such things is neg­li­gi­ble; peo­ple wanted to use the new tech­nol­ogy to cure infer­til­i­ty—to have their own babies, not other peo­ple’s. It is a per­sis­tent mis­con­cep­tion shared among clever peo­ple to assume that every­body wants clever chil­dren.

Ignor­ing that celebri­ties, mod­els, and ath­letes are often highly suc­cess­ful sex­u­ally (which can be seen as a ‘dona­tion’ of sort­s), this sort of thing was in fact done by the ; but despite (as expected from select­ing for highly intel­li­gent donors), it had a trou­bled 29-year run (pri­mar­ily due to a severe donor short­age18) and has no explicit suc­ces­sors.19

So that largely lim­its the mar­ket for embryo selec­tion to those who would already use it: those who must use it.

Will they use it? Rid­ley’s argu­ment does­n’t prove that they won’t, because the use of sperm/egg donors comes at the cost of reduc­ing relat­ed­ness. Non-use of “celebri­ties, genius­es, mod­els, and ath­letes” merely shows that the per­ceived ben­e­fits do not out­weigh the costs; it does­n’t tell us what the ben­e­fits or costs are. And the cost of reduc­ing relat­ed­ness is a severe one—a nor­mal fer­tile pair of par­ents will no more be inclined to use a sperm or egg donor (and which one, exact­ly? who choos­es?) than they would be to adopt, and they would be will­ing to extract sperm from a dead man just for the relat­ed­ness.20 A more rel­e­vant sit­u­a­tion would be how par­ents act in the infer­til­ity sit­u­a­tion where avoid­ing reduced relat­ed­ness is impos­si­ble.

In that sit­u­a­tion, par­ents are noto­ri­ously eugenic in their pref­er­ences, demand­ing of sperm or egg banks that the donor be healthy, well-e­d­u­cated (at the Ivy League, of course, where egg dona­tion is reg­u­larly adver­tised), have par­tic­u­lar hair & eye col­ors (us­ing sperm/eggs exported from Scan­di­navia, if nec­es­sary), be tall (men) and young (Whyte et al 2016), and free of any men­tal ill­ness­es. This per­va­sive selec­tion works; draws on a donor sib­ling reg­istry, doc­u­ment­ing selec­tion in favor of taller sperm donors, and, as pre­dicted by the breed­er’s equa­tion, off­spring were taller by 1.23 inch­es.21 Should par­ents dis­cover that a sperm donor was actu­ally autis­tic or schiz­o­phrenic, alle­ga­tions of fraud & “” law­suits will imme­di­ately begin fly­ing, regard­less of whether those par­ents would explic­itly acknowl­edge that most human traits are highly her­i­ta­ble and embryo selec­tion was pos­si­ble. The prac­ti­cal will­ing­ness of par­ents to make eugenic choices based on donor pro­files sug­gests that adver­tised cor­rect­ly, embryo selec­tion could become stan­dard. (For exam­ple, given the per­va­sive Puri­tan­i­cal bias in health towards pre­vent­ing ill­ness instead of increas­ing health, embryo selec­tion for intel­li­gence or height can be framed as reduc­ing the risk of devel­op­men­tal delays or short­ness; which it would.) Report­edly as of 2016, PGD for hair and eye color is already qui­etly being offered to par­ents and accept­ed, and men­tions are made of the poten­tial for selec­tion on other traits.

More dras­ti­cal­ly, in cases of screen­ing for severe genetic dis­or­ders by test­ing poten­tial car­rier par­ents and fetus­es, par­ents in prac­tice are will­ing to make use of screen­ing (if they know about it) and use PGD or selec­tive abor­tions in any­where up to 95-100% of cases (de­pend­ing on dis­ease & sam­ple) in dis­eases such as (eg Choi et al 2012), (eg Kaback 2000), (eg Liao et al 2005, Scotet et al 2008), (eg Ioan­nou et al 2015, Sawyer et al 2006, Hale et al 2008, , Massie et al 2009), and in gen­eral (eg , Franasiak et al 2016). This will­ing­ness is enough to notice­ably affect pop­u­la­tion lev­els of these dis­or­ders (par­tic­u­larly Down’s syn­drome, which has dropped dra­mat­i­cally in the USA despite an aging pop­u­la­tion that should be increas­ing it). The will­ing­ness to use PGD or abort rises with the sever­ity of the dis­or­der, true, but here again there are exten­u­at­ing fac­tors: par­ents con­sid­er­ably under­es­ti­mate their will­ing­ness to use PGD/abortion before diag­no­sis com­pared to after they are actu­ally diag­nosed, and using IVF just for PGD or abort­ing a preg­nancy are expen­sive & highly unde­sir­able steps to take; so the rates being so high regard­less sug­gest that in other sce­nar­ios (like a cou­ple using IVF for fer­til­ity rea­son­s), will­ing­ness may be high (and higher than peo­ple think before being offered the option). Still we can’t under­es­ti­mate the strength of the desire for a child genet­i­cally related to one­self: will­ing­ness to use tech­niques like PGD is lim­ited and far from absolute. The num­ber of peo­ple who are car­ri­ers of a ter­mi­nal dom­i­nant genetic dis­ease like (which has a reli­able cheap uni­ver­sally avail­able test) who will delib­er­ately not test a fetus or use PGD, or will choose to bear a fetus which has already tested pos­i­tive, are strik­ingly high: Bouchghoul et al 2016 reports that car­ri­ers had only lim­ited patience for PNG test­ing and if the first fetus was suc­cess­ful, 20% did not bother test­ing their sec­ond preg­nan­cy, and if not, 13% did not test their sec­ond, and of those who tested twice with car­ri­ers, 3 of 5 did no fur­ther test­ing; , a fol­lowup study finds that of 13 cou­ples who decided in advance that they would abort a fetus who was a car­ri­er, 0 went through with it.

Time will tell whether embryo selec­tion becomes any­thing more than a exotic nov­el­ty, but it looks as though when relat­ed­ness is not a cost, par­ents will tend to accept it. This sug­gests that Rid­ley’s argu­ment is incor­rect when extended to embryo selection/editing; peo­ple sim­ply want to both have and eat their cake, and as embryo selection/editing entail lit­tle or no loss of relat­ed­ness, they are not com­pa­ra­ble to sperm/egg dona­tion.

Hence, I sug­gest the most appro­pri­ate tar­get mar­ket is sim­ply the total num­ber of IVF users, and not the much smaller num­ber of egg/sperm dona­tion users.

VoI for USA IVF population

Using the high esti­mate of an aver­age gain of $6230, and not­ing that there were 67996 IVF babies in 2013, that sug­gests an annual gain of up to $423m. What is the net present value of that annu­al­ly? Dis­counted at 5%, it’d be $8.6b. (Why a 5% dis­count rate? This is the high­est dis­count rate I’ve seen used in health eco­nom­ics; more typ­i­cal are dis­count rates like NICE’s 3.5%, which would yield a much larger NPV.)

We might also ask: as an upper bound, in the real­is­tic USA IVF mod­el, how much would a per­fect SNP poly­genic score be worth?

summary(simulateIVFCBs(9, 4.6, 0.33, 0.3, 0.90, 10.8/100, 1500, 200, iqLow))
#     Trait.SD              Cost              Net
#  Min.   :0.0000000   Min.   :1500.00   Min.   :-3700.000
#  1st Qu.:0.0000000   1st Qu.:1700.00   1st Qu.:-2100.000
#  Median :0.0000000   Median :1900.00   Median :-1500.000
#  Mean   :0.1037614   Mean   :2042.24   Mean   : 3047.259
#  3rd Qu.:0.1562492   3rd Qu.:2300.00   3rd Qu.: 5516.869
#  Max.   :1.4293926   Max.   :3900.00   Max.   :68411.709
summary(simulateIVFCBs(9, 4.6, 0.33, 0.3, 0.90, 10.8/100, 1500, 200, iqHigh))
#     Trait.SD              Cost             Net
#  Min.   :0.0000000   Min.   :1500.0   Min.   : -4100.00
#  1st Qu.:0.0000000   1st Qu.:1700.0   1st Qu.: -1900.00
#  Median :0.0000000   Median :1900.0   Median : -1500.00
#  Mean   :0.1030492   Mean   :2037.6   Mean   : 22927.61
#  3rd Qu.:0.1530295   3rd Qu.:2300.0   3rd Qu.: 34652.62
#  Max.   :1.3798166   Max.   :4100.0   Max.   :331981.26
ivfBirths <- 67996; discount <- 0.05
current <- 6230; perfect <- 23650
(ivfBirths * perfect)/(log(1+discount)) - (ivfBirths * current)/(log(1+discount))
# [1] 24277235795

Increas­ing the poly­genic score to its max­i­mum of 33% increases the profit by 5x. This increase, over the num­ber of annual IVF births, gives a net present expected value of per­fect infor­ma­tion (EVPI) for a per­fect score of some­thing like $24b. How much would it cost to gain per­fect infor­ma­tion? argues that a sam­ple around 1 mil­lion would suffice to reach the GCTA upper bound using a par­tic­u­lar algo­rithm; the largest usable22 sam­ple I know of, SSGAC, is around n = 300k, leav­ing 700k to go; with SNPs cost­ing ~$200, that implies that it would cost $0.14b for per­fect SNP infor­ma­tion. Hence, the expected value of infor­ma­tion would then be ~$26.15b and safely profitable. From that, we could also esti­mate the expected value of sam­ple infor­ma­tion (EVSI): if the 700k SNPs would be worth that much, then on aver­age23 each addi­tional dat­a­point is worth $37.6k. Aside from the Hsu 2014 esti­mate, we can use a for­mula from a model in the Rietveld et al 2013 sup­ple­men­tary mate­ri­als (pg22-23), where they offer a pop­u­la­tion genet­ic­s-based approx­i­ma­tion of how much vari­ance a given sam­ple size & her­i­tabil­ity will explain:

  1. ; they state that , so or M = 67865.
  2. For edu­ca­tion (the phe­no­type vari­able tar­geted by the main GWAS, serv­ing as a proxy for intel­li­gence), they esti­mate h2=0.2, or h = 0.447 (h2 here being the her­i­tabil­ity cap­turable by their SNP arrays, so equiv­a­lent to ), so for their sam­ple size of 100000, they would expect to explain or 4.5% of vari­ance while they got 2-3%, sug­gest­ing over-es­ti­ma­tion.

Using this equa­tion we can work out changes in vari­ance explained with changes in sam­ple sizes, and thus the value of an addi­tional dat­a­point. For intel­li­gence, the GCTA esti­mate is ; Rietveld et al 2013 real­ized a vari­ance explained of 0.025, imply­ing it’s equiv­a­lent to n = 17000 when we look for a N which yields 0.025 and so we need ~6x more edu­ca­tion-phe­no­type sam­ples to reach the same effi­cacy in pre­dict­ing intel­li­gence. We can then ask how much vari­ance is explained by a larger sam­ple and how much that is worth over the annual IVF head­count. Since selec­tion is not profitable under the low IQ esti­mate and 1 more dat­a­point will not make it profitable, the EVSI of another edu­ca­tion dat­a­point must be neg­a­tive and is not worth esti­mat­ing, so we use the high esti­mate instead, ask­ing how much a increase of, say, 100 dat­a­points is worth on aver­age:

gwasSizeToVariance <- function(N, h2) { ((N / 67865) * h2^2) / ((N/67865) * h2 + 1) }
sampleIncrease <- 1000
original     <- gwasSizeToVariance(17000, 0.33)
originalplus <- gwasSizeToVariance(17000+sampleIncrease, 0.33)
originalGain     <- mean(simulateIVFCBs(9, 4.6, original, 0.3, 0.90, 10.8/100, 1500, 200, iqHigh)$Net)
originalplusGain <- mean(simulateIVFCBs(9, 4.6, originalplus, 0.3, 0.90, 10.8/100, 1500, 200, iqHigh)$Net)
originalGain; originalplusGain
((((originalplusGain - originalGain) * ivfBirths) / log(1+discount)) / sampleIncrease) / 6
# [1] 71716.90116

$71k is within an order of mag­ni­tude of the Hsu 2014 extrap­o­la­tion, so rea­son­able given all the approx­i­ma­tions here.

Going back to the low­est IQ value esti­mate, in the US pop­u­la­tion esti­mate, embryo selec­tion only reaches break-even once the vari­ance explained increases by a fac­tor of 2.1 to 5.25%. To boost it to 2.1x (0.0525) turns out to require n = 40000 or 2.35x, sug­gest­ing that another Rietveld et al 2013-style edu­ca­tion GWAS would be ade­quate once it reached . After that sam­ple size has been exceed­ed, EVSI will then be closer to $10k.


Overview of Selection Improvements

There are many pos­si­ble ways to improve selec­tion. As selec­tion boils down to sim­ply tak­ing the max­i­mum of sam­ples from a nor­mal dis­tri­b­u­tion, at a high level there are only 3 para­me­ters: the num­ber of sam­ples from a nor­mal dis­tri­b­u­tion, the vari­ance of that nor­mal dis­tri­b­u­tion, and its mean. There are many things which affect each of those vari­ables and each of these para­me­ters influ­ences the final gain, but that’s the ulti­mate abstrac­tion. To help keep them straight, one way I find help­ful is to break up pos­si­ble improve­ments into those 3 cat­e­gories, which we could ask as: what vari­ables are vary­ing, how much are they vary­ing, and how can we increase the mean?

  1. what vari­ables vary?

    • mul­ti­ple selec­tion: select­ing on the weighted sum of many vari­ables simul­ta­ne­ous­ly; the more vari­ables, the closer the index approaches the true global latent value of a sam­ple

    • vari­able mea­sure­ment: binary/dichotomous vari­ables through away infor­ma­tion, while con­tin­u­ous vari­ables are more infor­ma­tive and reflect out­comes bet­ter.

      Schiz­o­phre­nia, for exam­ple, may typ­i­cally be described as a binary vari­able to be mod­eled by a lia­bil­ity thresh­old mod­el, which has the impli­ca­tion that returns dimin­ish espe­cially fast in reduc­ing schiz­o­phre­nia genetic bur­den, but there is mea­sure­ment error/disagreement about whether a per­son should be diag­nosed as schiz­o­phrenic and some­one who does­n’t have it yet may develop it lat­er, and there is evi­dence that schiz­o­phre­nia genetic bur­den has effects in non-cases as well like increased dis­or­dered think­ing or low­ered IQ. This affects both the ini­tial con­struc­tion of the SNP heritability/PGS, and the esti­mate of the value of chang­ing the PGS.

    • rare vs com­mon vari­ants: omit­ting rare vari­ants will nat­u­rally restrict how use­ful selec­tion can be; you can’t select on vari­ance in what you can’t see. (SNPs are only a tem­po­rary phase.) The rare vari­ants don’t nec­es­sar­ily need to be known with high con­fi­dence, selec­tion could be for fewer or less-harm­ful-look­ing rare vari­ants, as most rare vari­ants are either neu­tral or harm­ful.

  2. how much do they vary?

    • bet­ter PGSes:

      • more data: larger n in GWASes, whole genomes rather than only SNPs, more accu­rate detailed phe­no­type data to pre­dict
      • bet­ter analy­sis: bet­ter regres­sion meth­ods, bet­ter pri­ors (based on bio­log­i­cal data or just using infor­ma­tive dis­tri­b­u­tion­s), more impu­ta­tion, more cor­re­lated traits & latent traits hier­ar­chi­cally relat­ed, more exploita­tion of pop­u­la­tion struc­ture to esti­mate away envi­ron­men­tal effects & detect rare vari­ants which may be unique to families/lineages & indi­rect genetic effects rather than over-con­trol­ling pop­u­la­tion structure/indirect effects away along with part of the sig­nal
    • larger effec­tive n to select from:

      • safer egg har­vest­ing meth­ods which can increase the yields
      • reduc­ing loss in the IVF pipeline by improve­ments to implantation/live-birth rate
      • mas­sive embryo selec­tion: replac­ing stan­dard IVF egg har­vest­ing (in­trin­si­cally lim­it­ed) with egg man­u­fac­tur­ing via imma­ture egg har­vested from ovar­ian biop­sies, or game­to­ge­n­e­sis (somatic/stem cell­s→egg)
    • more vari­ance:

      • directed muta­ge­n­e­sis
      • increas­ing chro­mo­some recom­bi­na­tion rate?
      • split­ting up or recom­bin­ing chro­mo­somes or com­bin­ing chro­mo­somes
      • cre­ate only male embryos (to exploit greater vari­ance in out­comes from the X/Y chro­mo­some pair)
  3. how to increase the mean?

    • mul­ti­-stage selec­tion:

      • parental selec­tion
      • chro­mo­some selec­tion
      • gametic selec­tion
      • iter­ated embryo selec­tion
    • gene edit­ing, chro­mo­some or genome syn­the­sis

Limiting step: eggs or scores?

Embryo selec­tion gains can be opti­mized in a num­ber of ways: har­vest­ing more eggs, hav­ing more eggs be nor­mal & suc­cess­fully fer­til­ized, reduc­ing the cost of SNPing or increas­ing the pre­dic­tive power of the poly­genic scores, and bet­ter implan­ta­tion suc­cess. How­ev­er, the “leaky pipeline” nature of embryo selec­tion means that opti­miza­tion may be coun­ter­in­tu­itive (akin to sim­i­lar prob­lems in drug devel­op­ment; ).

There’s no clear way to improve egg qual­ity or implant bet­ter, and the cost of SNPs is already drop­ping as fast as any­one could wish for, which leaves just improv­ing the poly­genic scores and har­vest­ing more eggs. Improv­ing the poly­genic scores is addressed in the pre­vi­ous Value of Infor­ma­tion sec­tion and turns out to be doable and profitable but requires a large invest­ment by insti­tu­tions which may not be inter­ested in research­ing the mat­ter fur­ther. Fur­ther, bet­ter poly­genic scores make rel­a­tively lit­tle differ­ence when the num­ber of embryos to select from is small, as it cur­rently is in IVF due to the small num­ber of har­vested eggs & con­tin­u­ous losses in the IVF pipeline: it is not help­ful to increase the prob­a­bil­ity of select­ing the best embryo out of 3 by just a few per­cent­age points when that embryo will prob­a­bly not suc­cess­fully be born and when it is only a few IQ points above aver­age in the first place.

That leaves egg har­vest­ing; this is lim­ited by each wom­an’s idio­syn­cratic biol­o­gy, and also by safety issues, and we can’t expect much beyond the median 9 eggs. There is, how­ev­er, one oft-men­tioned pos­si­bil­ity for get­ting many more eggs: coax stem cells into using their pluripo­tency to develop into eggs, pos­si­bly hun­dreds or thou­sands of viable eggs. (There is another pos­si­ble alter­na­tive, “ovar­ian tis­sue extrac­tion”: sur­gi­cally extract­ing ovar­ian tis­sue, vit­ri­fy­ing, and at—a poten­tially much—later date, rewarm­ing & extract­ing eggs directly from the fol­li­cles. It’s a much more seri­ous pro­ce­dure and it’s unclear how many eggs it could yield.) This stem cell method is report­edly being devel­oped24 and if suc­cess­ful, would enable both pow­er­ful embryo selec­tion and also be a major step towards “iter­ated embryo selec­tion” (see that sec­tion). We can call an embryo selec­tion process which uses not har­vested eggs but grown eggs in large quan­ti­ties “mas­sive embryo selec­tion” to keep in mind the major differ­ence—quan­tity is a qual­ity all its own.

How much would get­ting scores or hun­dreds of eggs help, and how does the gain scale? Since returns dimin­ish, and we already know that under the low value of IQ embryo selec­tion is not profitable, it fol­lows that no larger num­ber of eggs will be profitable either; so like with EVSI, we look at the high val­ue’s upper bound if we could choose an arbi­trary num­ber of eggs:

gainByEggcount <- sapply(1:300, function(egg) { mean(simulateIVFCBs(egg, 4.6, selzam2016, 0.3, 0.90, 10.8/100, 1500, 200, iqHigh)$Net) })
max(gainByEggcount); which.max(gainByEggcount)
# [1] 26657.1117
# [1] 281
plot(1:300, gainByEggcount, xlab="Average number of eggs available", ylab="Profit")
summary(simulateIVFCBs(which.max(gainByEggcount), 4.6, selzam2016, 0.3, 0.90, 10.8/100, 1500, 200, iqHigh))
#     Trait.SD              Cost              Net
#  Min.   :0.0000000   Min.   :12300.0   Min.   :-21900.00
#  1st Qu.:0.1284192   1st Qu.:17300.0   1st Qu.: 12711.92
#  Median :0.1817688   Median :18300.0   Median : 25630.74
#  Mean   :0.1845060   Mean   :18369.1   Mean   : 26330.25
#  3rd Qu.:0.2372748   3rd Qu.:19500.0   3rd Qu.: 39162.75
#  Max.   :0.5661427   Max.   :25300.0   Max.   :117856.55
max(gainByEggcount) / which.max(gainByEggcount)
# [1] 94.86516619
Net profit vs aver­age num­ber of eggs

The max­ima is ~281, yield­ing 0.18SD/~2.7 points & a net profit ~$26k, indi­cat­ing that with that many eggs, the cost of the addi­tional SNPing exceeds the mar­ginal IQ gain from hav­ing 1 more egg avail­able which could turn into an embryo & be selected amongst. With $26k profit vs 281 eggs, we could say that the gain from unlim­ited eggs com­pared to the nor­mal yield of ~9 eggs is ~$20k ($26k vs the best cur­rent sce­nario of $6l), and that the aver­age profit from adding each egg was $73, giv­ing an idea of the sort of per-egg costs one would need from an egg stem cell tech­nol­ogy (smal­l). The total num­ber of eggs will decrease with an increase in per-egg costs; if it costs another $200 per embryo, then the opti­mal num­ber of eggs is around half, and so on.

So with present poly­genic-s­cores & SNP costs, an unlim­ited num­ber of eggs would only increase profit by 4x, as we are then still con­strained by the poly­genic score. This would be valu­able, of course, but it is not a huge change.

Induc­ing eggs from stem cells does have the poten­tially valu­able fea­ture that it is prob­a­bly mon­ey-con­strained rather than egg or PGS con­strained: you want to stop at a few hun­dred eggs but only because IQ and other selected traits are being val­ued at a low rate. If one val­ues them high­er, the limit will be pushed out fur­ther—a thou­sand eggs would deliver gains like +20 IQ points, and a wealthy actor might go even fur­ther to 10,000 eggs (+24), although even the wealth­i­est actors must stop at some point due to the thin tails/diminishing returns.

Optimal stopping/search

I model embryo selec­tion with many embryos as an opti­mal stopping/search prob­lem and give an exam­ple algo­rithm for when to halt that results in sub­stan­tial sav­ings over the brute force approach of test­ing all avail­able embryos. This shows that with a lit­tle thought, “too many embryos” need not be any prob­lem.

In sta­tis­tics, is that it is as good or bet­ter to have more options or actions or infor­ma­tion than fewer (com­pu­ta­tional issues aside). Embryo selec­tion is no excep­tion: it is bet­ter to have many embryos than few, many PGSes avail­able for each embryo than one, and it is bet­ter to adap­tively choose how many to sequence/test than to test them all blind­ly.25 This point becomes espe­cially crit­i­cal when we begin spec­u­lat­ing about hun­dreds or thou­sands of embryos, as the cost of test­ing them all may far exceed any gain.

But we can eas­ily do bet­ter.

The is a famous exam­ple of an prob­lem where in sequen­tially search­ing through n can­di­dates, per­ma­nently choosing/rejecting at each step, with only rel­a­tive rank­ings known & no dis­tri­b­u­tion, it turns out that, remark­ably, one can select the best can­di­date ~37% of the time inde­pen­dent of n, and that one can select the expected rank of 3.9th best can­di­date. Given that we know the PGSes are nor­mal, util­i­ties there­of, and do not need to irrev­o­ca­bly choose, we should be able to do even bet­ter.

This can be solved by the usual Bayesian search deci­sion the­ory approach: at each step, cal­cu­late the expected Value of Infor­ma­tion from another search (up­per bounded by the expected Value of Per­fect Infor­ma­tion), and when the mar­ginal VoI <= mar­ginal cost, halt, and return the best can­di­date. If we do not know parental genomes or have trait val­ues, we must update our dis­tri­b­u­tion of pos­si­ble out­comes from another sam­ple: for exam­ple, if we sequence the first embryo and find a high PGS com­pared to the pop­u­la­tion mean, then that implies a high parental mean which means that the future embryos might be even higher than we expect­ed, and thus we will want to con­tinue sam­pling longer than we did before. (In prac­tice, this prob­a­bly has lit­tle effect, as it turns out we already want to sam­ple so many embryos on aver­age that the uncer­tainty in the mean is near-zero by the time we near the stop­ping point.) In the case where parental genomes are avail­able or we have phe­no­types, we can assume we are sam­pling from a known nor­mal dis­tri­b­u­tion and so we don’t even need to do any Bayesian updates based on our pre­vi­ous obser­va­tions, we can sim­ply cal­cu­late the expected increase from another sam­ple.

Con­sider sequen­tially search­ing a sam­ple of n nor­mal devi­ates for the max­i­mum devi­ate, with a cer­tain util­ity cost per sam­ple & util­ity of each +SD.

Given dimin­ish­ing returns of order sta­tis­tics, there may be a n at which it on aver­age does not pay to search all of the n but only a few of them. There is also option­al­ity to search: if a large value is found early in the search, given nor­mal­ity it is unlikely to find a bet­ter can­di­date after­wards, so one should stop the search imme­di­ately to avoid pay­ing futile search costs; so while hav­ing not yet reached that aver­age n, a sam­ple may have been found so good that one should stop ear­ly.

The expected Value of Per­fect Infor­ma­tion is when we can search the whole sam­ple for free; so here it is sim­ply the expected max of the full n times the util­i­ty.

So our n might be the usual 5 embryos, our util­ity cost is $200 per step (the cost to sequence each embry­o), and the util­ity of each +SD can be the low value of IQ ($3270 per IQ point or 15x for +1 SD). Com­pared with zero embryos test­ed, since 5 yields a gain +1.16SD, the EVPI in that sce­nario is $57k. How­ev­er, if we already have 3 embryos tested (+0.84S­D), the EVPI dimin­ish­es—2 more embryos sam­pled on aver­age will only increase by +0.31SD or $15k. And by the same log­ic, the one-step case fol­lows: sam­pling 1 embryo given 3 already has an EVPI of +0.18SD or $8k. Given that the cost to sam­ple one-step is so low ($200), it is imme­di­ately clear we prob­a­bly should con­tinue sam­pling—after all, we gain $8k but only spend $0.2k to do so.

So the sequen­tial search in embryo selec­tion bor­ders on triv­ial: given the low cost and high returns, for all rea­son­able sizes of n, we will on aver­age want to search the entire sam­ple. At what n would we halt on aver­age? In order words, for what n is ? Or to put it another way, when is the order differ­ence <0.004 SDs ()? In this case, we only hit dimin­ish­ing returns strongly enough around n = 88.

allegrini2018 <- sqrt(0.11*0.5)
iqLow         <- 3270*15
testCost      <- 200

# [1] 1.162964474
exactMax(5) * iqLow
# [1] 57043.40743
(exactMax(5) - exactMax(3))
# [1] 0.3166800983
(exactMax(5) - exactMax(3)) * iqLow
# [1] 15533.15882

round(sapply(seq(2, 300, by=10), function(n) { (exactMax(n) - exactMax(n-1)) * iqLow }))
#  [1] 27673  2099  1007   648   473   370   303   255   220   194   172   155   141   129   119
#        110   103    96    90    85    81    76    73    69    66    63    61
# [28]    58    56    54

That assumes a per­fect pre­dic­tor, of course, and we do not have that. Deflat­ing by the halved Alle­grini et al 2018 PGS, the crossover is closer to n = 24:

round(sapply(2:26, function(n) { (exactMax(n, sd=allegrini2018) - exactMax(n-1, sd=allegrini2018)) * iqLow }))
# [1] 6490 3245 2106 1537 1199  977  822  706  618  549  492  446  407  374  346  322  300  281  265  250  236  224  213  203  194
exactMax(24, sd=allegrini2018)
# [1] 0.4567700586
exactMax(25, sd=allegrini2018)
# [1] 0.4609071309
0.4609071309 - 0.4567700586
# [1] -0.0041370723

stoppingRule <- function(predictorSD, utilityCost, utilityGain) {
 n <- 1
 while(((exactMax(n+1, sd=predictorSD) - exactMax(n, sd=predictorSD)) * utilityGain) > utilityCost) { n <- n+1 }
 return(c(n, exactMax(n), exactMax(n, sd=predictorSD))) }

round(digits=2, stoppingRule(allegrini2018, testCost, iqLow))
# [1] 25.00 1.97 0.46
round(digits=2, stoppingRule(allegrini2018, 100, iqLow))
# [1] 45.00  2.21  0.52

Another way of putting it would be that we’ve derived a stop­ping rule: once we have a can­di­date of >=0.4567SD, we should halt, as all future sam­ples are expected to cost too much. (If the can­di­date embryo is non­vi­able or fails to yield a live birth, test­ing can sim­ply resume with the rest of the stored embryos until the stop­ping rule fires again or one has tested the entire sam­ple.) Com­pared to blind batch sam­pling with­out regard to mar­ginal costs, the expected ben­e­fit of this stop­ping rule is the num­ber of searches past n = 24 times the cost minus the mar­ginal ben­e­fit, so if we were instead going to blindly test an entire sam­ple of n = 48, we’d incur a loss of $1516:

marginalGain <- (exactMax(48, sd=allegrini2018) - exactMax(24, sd=allegrini2018)) * iqLow
marginalCost <- (48-24) * testCost
marginalGain; marginalCost
[1] 3283.564451
[1] 4800
marginalGain - marginalCost
[1] -1516.435549

The loss would con­tinue to increase the fur­ther past the stop­ping point we go. This demon­strates the ben­e­fits of sequen­tial test­ing and gives a for­mula & code for decid­ing when to stop based on cost/benefits/normal dis­tri­b­u­tion para­me­ters.

To go into fur­ther detail, in any par­tic­u­lar run, we would see differ­ent ran­dom sam­ples at each step. We also might not have derived a stop­ping rule in advance. Does the stop­ping rule actu­ally work? What does it look like to sim­u­late out step­ping through embryos one at a time, cal­cu­lat­ing the expected value of test­ing another sam­ple (es­ti­mated via Monte Car­lo, since it’s not a thresh­old Gauss­ian but a ‘’ whose WP arti­cle has no for­mula for the expec­ta­tion26), and after stop­ping, com­par­ing to what if we had instead tested them all?

It looks as expected above: typ­i­cally we test up to 24 embryos, get a SD increase of <=0.45SD (if we don’t have >24 embryos, unsur­pris­ingly we won’t get that high), and by stop­ping ear­ly, we do in fact save a mod­est amount each run, enough to out­weigh the occa­sional sce­nario where the remain­ing embryos hid a really high score. And since we do usu­ally stop ~24, the batch test­ing becomes increas­ingly worse the larger the total n becomes—by 500 embryos, the loss is up to $80k:

library(parallel) # warning, Windows users

## Memoise the Monte Carlo evaluation to save time - it's almost exact w/100k & simpler:
expectedPastThreshold <- memoise(function(maximum, predictorSD) {
    mean({ x <- rnorm(100000, sd=predictorSD); ifelse(x>maximum, x-maximum, 0) }) })

optimalSearch <- function(maxN, predictorSD, utilityCost, utilityBenefit) {

    samples <- rnorm(maxN, sd=predictorSD)

    i <- 1; maximum <- samples[1]; cost <- utilityCost; profit <- 0; gain <- max(maximum,0);
    while (i < maxN) {

        marginalGain <- expectedPastThreshold(maximum, predictorSD)

        if (marginalGain*utilityBenefit > utilityCost) {
          i <- i+1
          cost <- cost+utilityCost
          nth <- samples[i]
          maximum <- max(maximum, nth); } else { break; }

    gain <- maximum * utilityBenefit; profit <- gain-cost;
    searchAllProfit <- max(samples)*utilityBenefit - maxN*utilityCost

    return(c(i, maximum, cost, gain, profit, searchAllProfit, searchAllProfit - (gain-cost)))

optimalSearch(100, allegrini2018, testCost, iqLow)
# [1]    48     0  9600 22475 12875  9462 -3413

## Parallelize simulations:
optimalSearchs <- function(a,b,c,d, iters=10000) { df <- ldply(mclapply(1:iters, function(x) { optimalSearch(a,b,c,d); }));
  colnames(df) <- c("N", "Maximum.SD", "Cost.total", "Gain.total", "Profit", "Nonadaptive.profit", "Nonadaptivity.regret"); return(df) }

summary(digits=2, optimalSearchs(5,   allegrini2018, testCost, iqLow))
#       N         Maximum.SD      Cost.total     Gain.total         Profit       Nonadaptive.profit Nonadaptivity.regret
# Min.   :1.0   Min.   :-0.27   Min.   : 200   Min.   :-13039   Min.   :-14039   Min.   :-14039     Min.   : -800
# 1st Qu.:5.0   1st Qu.: 0.16   1st Qu.:1000   1st Qu.:  7978   1st Qu.:  6978   1st Qu.:  6978     1st Qu.:    0
# Median :5.0   Median : 0.26   Median :1000   Median : 12902   Median : 11902   Median : 11902     Median :    0
# Mean   :4.6   Mean   : 0.27   Mean   : 921   Mean   : 13267   Mean   : 12346   Mean   : 12306     Mean   :  -40
# 3rd Qu.:5.0   3rd Qu.: 0.37   3rd Qu.:1000   3rd Qu.: 18199   3rd Qu.: 17199   3rd Qu.: 17199     3rd Qu.:    0
# Max.   :5.0   Max.   : 1.05   Max.   :1000   Max.   : 51405   Max.   : 51205   Max.   : 50405     Max.   :14789
summary(digits=2, optimalSearchs(10,  allegrini2018, testCost, iqLow))
#       N          Maximum.SD      Cost.total     Gain.total        Profit      Nonadaptive.profit Nonadaptivity.regret
# Min.   : 1.0   Min.   :-0.06   Min.   : 200   Min.   :-2934   Min.   :-4934   Min.   :-4934      Min.   :-1800
# 1st Qu.: 7.0   1st Qu.: 0.27   1st Qu.:1400   1st Qu.:13047   1st Qu.:11047   1st Qu.:11047      1st Qu.: -400
# Median :10.0   Median : 0.35   Median :2000   Median :17275   Median :15275   Median :15275      Median :    0
# Mean   : 8.2   Mean   : 0.36   Mean   :1649   Mean   :17594   Mean   :15945   Mean   :15754      Mean   : -190
# 3rd Qu.:10.0   3rd Qu.: 0.44   3rd Qu.:2000   3rd Qu.:21718   3rd Qu.:20742   3rd Qu.:20109      3rd Qu.:    0
# Max.   :10.0   Max.   : 0.97   Max.   :2000   Max.   :47618   Max.   :46218   Max.   :45618      Max.   :20883
summary(digits=2, optimalSearchs(24,  allegrini2018, testCost, iqLow))
#       N        Maximum.SD     Cost.total     Gain.total        Profit      Nonadaptive.profit Nonadaptivity.regret
# Min.   : 1   Min.   :0.12   Min.   : 200   Min.   : 5719   Min.   :  919   Min.   :  919      Min.   :-4600
# 1st Qu.: 7   1st Qu.:0.37   1st Qu.:1400   1st Qu.:18238   1st Qu.:13438   1st Qu.:13438      1st Qu.:-2800
# Median :16   Median :0.43   Median :3200   Median :21201   Median :19223   Median :17145      Median : -600
# Mean   :15   Mean   :0.44   Mean   :3032   Mean   :21689   Mean   :18656   Mean   :17648      Mean   :-1008
# 3rd Qu.:24   3rd Qu.:0.50   3rd Qu.:4800   3rd Qu.:24527   3rd Qu.:22636   3rd Qu.:21217      3rd Qu.:    0
# Max.   :24   Max.   :1.13   Max.   :4800   Max.   :55507   Max.   :52107   Max.   :50707      Max.   :25705
summary(digits=2, optimalSearchs(100, allegrini2018, testCost, iqLow))
#       N         Maximum.SD     Cost.total      Gain.total        Profit      Nonadaptive.profit Nonadaptivity.regret
# Min.   :  1   Min.   :0.31   Min.   :  200   Min.   :15218   Min.   :-4782   Min.   :-4782      Min.   :-19800
# 1st Qu.:  7   1st Qu.:0.43   1st Qu.: 1400   1st Qu.:21223   1st Qu.:16696   1st Qu.: 5342      1st Qu.:-15507
# Median : 16   Median :0.47   Median : 3200   Median :23239   Median :19919   Median : 8266      Median :-11772
# Mean   : 23   Mean   :0.50   Mean   : 4654   Mean   :24398   Mean   :19744   Mean   : 8762      Mean   :-10983
# 3rd Qu.: 33   3rd Qu.:0.54   3rd Qu.: 6600   3rd Qu.:26504   3rd Qu.:23076   3rd Qu.:11651      3rd Qu.: -7293
# Max.   :100   Max.   :1.10   Max.   :20000   Max.   :53952   Max.   :52352   Max.   :33952      Max.   : 18226
summary(digits=2, optimalSearchs(500, allegrini2018, testCost, iqLow))
#       N         Maximum.SD     Cost.total      Gain.total        Profit       Nonadaptive.profit Nonadaptivity.regret
# Min.   :  1   Min.   :0.40   Min.   :  200   Min.   :19607   Min.   :-25265   Min.   :-76428     Min.   :-99800
# 1st Qu.:  7   1st Qu.:0.43   1st Qu.: 1400   1st Qu.:21289   1st Qu.: 16559   1st Qu.:-67982     1st Qu.:-89569
# Median : 17   Median :0.48   Median : 3400   Median :23349   Median : 19779   Median :-65471     Median :-85154
# Mean   : 24   Mean   :0.50   Mean   : 4772   Mean   :24498   Mean   : 19726   Mean   :-64955     Mean   :-84681
# 3rd Qu.: 33   3rd Qu.:0.54   3rd Qu.: 6600   3rd Qu.:26500   3rd Qu.: 23232   3rd Qu.:-62591     3rd Qu.:-80393
# Max.   :234   Max.   :1.09   Max.   :46800   Max.   :53390   Max.   : 50453   Max.   :-44268     Max.   :-37431

Thus, the approach using the order sta­tis­tics and the approach using Monte Carlo sta­tis­tics agree; the thresh­old can be cal­cu­lated in advance and the prob­lem reduced to the sim­ple algo­rithm “sam­ple while best < thresh­old until run­ning out”.

24 might seem like a low num­ber, and it is, but it can be dri­ven much high­er: bet­ter PGSes which pre­dict more vari­ance, use of mul­ti­ple-s­e­lec­tion to syn­the­size an index trait which both varies more and has far greater val­ue, and the expected long-term decreases in sequenc­ing costs. For exam­ple, if we look at a later sec­tion where a few dozen traits are com­bined into a sin­gle “index” util­ity score, the SNP her­i­tabil­i­ty’s index util­ity scores are dis­trib­uted ~ & the 2016 PGSes give a ~, then our stop­ping rules look differ­ent:

## SNP heritability upper bound:
round(digits=2, stoppingRule(1, testCost, 72000))
# [1] 125.00   2.59   2.59
## 2016 multiple-selection:
round(digits=2, stoppingRule(1, testCost, 6876))
# [1] 16.00  1.77  1.77

Multiple selection

Intel­li­gence is one of the most valu­able traits to select on, and one of the eas­i­est to ana­lyze, but we should remem­ber that it is nei­ther nec­es­sary nor desir­able to select only on a sin­gle trait. For exam­ple, in cat­tle embryo selec­tion, selec­tion is done not on a sin­gle trait but a weighted sum of 48 traits (Mul­laart & Wells 2018).

Select­ing only on one trait means that almost all of the avail­able geno­type infor­ma­tion is being ignored; at best, this is a lost oppor­tu­ni­ty, and at worst, in some cases it is harm­ful—in the long run (dozens of gen­er­a­tions), selec­tion only on one trait, par­tic­u­larly in a very small breed­ing pop­u­la­tion like often used in agri­cul­ture (al­beit irrel­e­vant to human­s), will have “unin­tended con­se­quences” like greater dis­ease rates, shorter lifes­pans, etc (see Fal­coner 1960’s Intro­duc­tion to Quan­ti­ta­tive Genet­ics, Ch. 19 “Cor­re­lated Char­ac­ters”, & Lynch & Walsh 1998’s Ch. 21 “Cor­re­la­tions Between Char­ac­ters” on ). When breed­ing is done out of igno­rance or with regard only to a few traits or on tiny found­ing pop­u­la­tions, one may wind up with prob­lem­atic breeds like some pure­bred dog breeds which have seri­ous health issues due to inbreed­ing, small found­ing pop­u­la­tions, no selec­tion against neg­a­tive muta­tions pop­ping up, and vari­ants which increase the selected trait at the expense of another trait.27 (This is not an imme­di­ate con­cern for humans as we have an enor­mous pop­u­la­tion, only weak selec­tion meth­ods, low lev­els of his­tor­i­cal selec­tion, and high her­i­tabil­i­ties & much stand­ing vari­ance, but it is a con­cern for very long-term pro­grams or hypo­thet­i­cal future selec­tion meth­ods like iter­ated embryo selec­tion.)

This is why ani­mal breed­ers do not select purely on a sin­gle valu­able trait like egg-lay­ing rate but on an index of many traits, from matu­rity speed to dis­ease resis­tance to lifes­pan. An index is sim­ply the sum of a large num­ber of mea­sured vari­ables, implic­itly equally weighted or explic­itly weighted by their con­tri­bu­tion towards some desired goal—the more included vari­ables, the more effec­tive selec­tion becomes as it cap­tures more of the latent differ­ences in util­i­ty. For back­ground on the the­ory and con­struc­tion of indexes in selec­tion, see Lynch & Walsh 2018’s /.

In our case, a weak poly­genic score can be strength­ened by bet­ter GWASes, but it can also be com­bined with other poly­genic scores to do selec­tion on mul­ti­ple traits by sum­ming the scores per embryo and tak­ing the max­i­mum. For exam­ple, as of 2018-08-01, the UK Biobank makes pub­lic GWASes on 4,203 traits; many of these traits might be of no impor­tance or the PGS too weak to make much of a differ­ence, but the rest may be valu­able. Once an index has been con­structed from sev­eral PGSes, it func­tions iden­ti­cal to embryo selec­tion on a sin­gle PGS and pre­vi­ous dis­cus­sion applies to it, so the inter­est­ing ques­tions are: how expen­sive an index is to con­struct; what PGSes are used and how they are weight­ed; and what is the advan­tage of mul­ti­ple embryo selec­tion over sim­ple embryo selec­tion.

This can be done almost for free, since if one did sequenc­ing on a com­pre­hen­sive SNP array chip to com­pute 1 poly­genic score, one prob­a­bly has all the infor­ma­tion need­ed. (In­deed, you could see selec­tion on a sin­gle trait as a index selec­tion where all traits’ val­ues are implau­si­bly set to 0 except for 1 trait.) In real­i­ty, while some traits are of much more value than oth­ers, there are few traits with no value at all; an embryo which scores medioc­rely on our pri­mary trait may still have many other advan­tages which more than com­pen­sate, so why not check? (It is a gen­eral prin­ci­ple that more infor­ma­tion is bet­ter than less.) Intel­li­gence is valu­able, but it’s also valu­able to live a long time, have less risk for schiz­o­phre­nia, lower BMI, be hap­pier, and so on.

A quick demon­stra­tion of the pos­si­ble gain is to imag­ine the total of 1 nor­mal devi­ate () vs pick­ing the most extreme out of sev­eral nor­mal devi­ates. With 1 devi­ate, our aver­age extreme is 0, and most of the time will be ±1SD. But if we can pick out of batches of 10, we can gen­er­ally get +1.53SD:

mean(replicate(100000, max(rnorm(10, mean = 0))))
# [1] 1.537378753

What if we have 4 differ­ent scores (with two down­weighted sub­stan­tially to reflect that they are less valu­able)? We get 0.23SD for free:

mean(replicate(100000, max(   1*rnorm(10, mean = 0) +
                           0.33*rnorm(10, mean = 0) +
                           0.33*rnorm(10, mean = 0) +
                           0.33*rnorm(10, mean = 0))))
# [1] 1.769910562

This is like select­ing among mul­ti­ple embryos: the more we have to pick from, the bet­ter the chance the best one will be par­tic­u­larly good. So in select­ing embryos, we want to com­pute mul­ti­ple poly­genic scores for each embryo, weight them by the over­all value of that trait, sum them to get a total score for each embryo, then select the best embryo for implan­ta­tion.

The advan­tage of mul­ti­ple poly­genic scores fol­lows from the for 2 vari­ables X & Y is ; that is, the vari­ances are added, so the stan­dard devi­a­tion will increase, so our expected max­i­mum sam­ple will increase. Recall­ing , increas­ing beyond 1 will ini­tially yield larger returns than increas­ing n past 9 (it looks lin­ear rather than log­a­rith­mic, but embryo selec­tion is zero-sum—the gain is shrunk by the weight­ing of the mul­ti­ple vari­ables), and so mul­ti­ple selec­tion should not be neglect­ed. Using such a total score on n uncor­re­lated traits, as com­pared to alter­na­tive meth­ods like select­ing for 1 trait in each gen­er­a­tion, is con­sid­er­ably more effi­cient, ~√n times as effi­cient (Hazel & Lush 1943, “The effi­ciency of three meth­ods of selec­tion”28/Lush 1943).

We could rewrite simulateIVFCB to accept as para­me­ters a series of poly­genic score func­tions and sim­u­late out each poly­genic score and their sums; but we could also use the sum of ran­dom vari­ables to cre­ate a sin­gle com­pos­ite poly­genic score—s­ince the vari­ances sim­ply sum up (), we can take the poly­genic scores, weight them, and sum them.

combineScores <- function(polygenicScores, weights) {
    weights <- weights / sum(weights) # normalize to sum to 1
    # add variances, to get variance explained of total polygenic score
    sum(weights*polygenicScores) }

Let’s imag­ine a US exam­ple but with 3 traits now, IQ and 2 we con­sider to be roughly half as valu­able as IQ, but which have bet­ter poly­genic scores avail­able of 60% and 5%. What sort of gain can we expect above our start­ing point?

weights <- c(1, 0.5, 0.5)
polygenicScores <- c(selzam2016, 0.6, 0.05)
summary(simulateIVFCBs(9, 4.6, combineScores(polygenicScores, weights), 0.3, 0.90, 10.8/100, 1500, 200, iqHigh))
#     Trait.SD               Cost              Net
#  Min.   :0.00000000   Min.   :1500.00   Min.   : -3900.00
#  1st Qu.:0.00000000   1st Qu.:1700.00   1st Qu.: -1900.00
#  Median :0.00000000   Median :1900.00   Median : -1500.00
#  Mean   :0.07524308   Mean   :2039.25   Mean   : 16189.51
#  3rd Qu.:0.11491090   3rd Qu.:2300.00   3rd Qu.: 25638.72
#  Max.   :1.00232683   Max.   :4100.00   Max.   :241128.71

So we dou­ble our gains by con­sid­er­ing 3 traits instead of 1.

Multiple selection on independent traits

A more real­is­tic exam­ple would be to use some of the exist­ing poly­genic scores for com­plex traits, of which for analy­sis from sources like LD Hub. Per­haps a lit­tle coun­ter­in­tu­itive­ly, to max­i­mize the gains, we want to focus on uni­ver­sal traits such as IQ, or com­mon dis­eases with high preva­lence; the more hor­ri­fy­ing genetic dis­eases are rare pre­cisely because they are hor­ri­fy­ing (nat­ural selec­tion keeps them rare), so focus­ing on them will only occa­sion­ally pay off.29

Here are 7 I looked up and was able to con­vert to rel­a­tively rea­son­able gains/losses:

  1. IQ (us­ing the pre­vi­ously given value and Selzam et al 2016 poly­genic score, and exclud­ing any val­u­a­tion of the 7% of fam­ily SES & 9% of edu­ca­tion that the IQ poly­genic score comes with for free)

  2. height

    The lit­er­a­ture is unclear what the best poly­genic score for height is at the moment; let’s assume that it can pre­dict most but not all, like ~60%, of vari­ance with a pop­u­la­tion stan­dard devi­a­tion of ~4 inch­es; the eco­nom­ics esti­mate is $800 of annual income per inch or a NPV of $16k per inch or $65k per SD, so we would weight it as a quar­ter as valu­able as the high IQ esti­mate (((800/log(1.05))*4) / iqHigh → 0.27). The causal link is not fully known, but a Mendelian ran­dom­iza­tion study of height & BMI sup­ports causal esti­mates of $300/$1616 per SD respec­tive­ly, which shows the cor­re­la­tions are not solely due to con­found­ing.

  3. /

    Poly­genic scores: