# Embryo selection for intelligence

A cost-benefit analysis of the marginal cost of IVF-based embryo selection for intelligence and other traits with 2016-2017 state-of-the-art
topics: decision theory, biology, psychology, statistics, transhumanism, R, power analysis, survey, IQ, SMPY, order statistics
created: 22 Jan 2016; modified: 21 Sep 2019; status: finished; confidence: likely;

With genetic predictors of a phenotypic trait, it is possible to select embryos during an in vitro fertilization process to increase or decrease that trait. Extending the work of /, I consider the case of human intelligence using SNP-based genetic prediction, finding:

• a meta-analysis of results indicates that SNPs can explain >33% of variance in current intelligence scores, and >44% with better-quality phenotype testing
• this sets an upper bound on the effectiveness of selection: a gain of 9 IQ points when selecting the top embryo out of 10
• the best 2016 polygenic score could achieve a gain of ~3 IQ points when selecting out of 10
• the marginal cost of embryo selection (assuming IVF is already being done) is modest, at $1500 +$200 per embryo, with the sequencing cost projected to drop rapidly
• a model of the IVF process, incorporating number of extracted eggs, losses to abnormalities & vitrification & failed implantation & miscarriages from 2 real IVF patient populations, estimates feasible gains of 0.39 & 0.68 IQ points
• embryo selection is currently unprofitable (mean: -$358) in the USA under the lowest estimate of the value of an IQ point, but profitable under the highest (mean:$6230). The main constraints on selection profitability is the polygenic score; under the highest value, the NPV EVPI of a perfect SNP predictor is $24b and the EVSI per education/SNP sample is$71k
• under the worst-case estimate, selection can be made profitable with a better polygenic score, which would require n>237,300 using education phenotype data (and much less using fluid intelligence measures)
• selection can be made more effective by selecting on multiple phenotype traits: considering an example using 7 traits (IQ/height/BMI/diabetes/ADHD/bipolar/schizophrenia), there is a factor gain over IQ alone; the outperformance of multiple selection remains after adjusting for genetic correlations & polygenic scores and using a broader set of 16 traits.

# Overview of Major Approaches

Before going into a detailed cost-benefit analysis of embryo selection, I’ll give a rough overview of the various developing approaches for genetic engineering of complex traits in humans, compare them, and briefly discuss possible timelines and outcomes. (References/analyses/code for particular claims are generally provided in the rest of the text, or in some cases, buried in my , and omitted here for clarity.)

The past 2 decades have seen a revolution in molecular genetics: the sequencing of the human genome kicked off a exponential reduction in genetic sequencing costs which have dropped the cost of genome sequencing from millions of dollars to $20 (SNP genotyping)–$500 (whole genomes). This has enabled the accumulation of datasets of millions of individuals’ genomes which allow a range of genetic analyses to be conducted, ranging from SNP heritabilities to detection of recent evolution to GWASes of traits to estimation of the genetic overlap of traits.

The simple summary of the results to date is: behavioral genetics was right. Almost all human traits, simple and complex, are caused by a joint combination of environment, stochastic & randomness, and genes. These patterns can be studied by methods such as family, twin, adoption, or sibling studies, but ideally are studied directly by reading the genes of hundreds of thousands of unrelated people, which yield estimates of the effects of specific genes and predictions of phenotype values from entire genomes. Across all traits examined, genes cause ~50% of differences between people in the same environment, factors like randomness & measurement-error explain much of the rest, and whatever is left over is the effect of nurture. Evolution is true, and genes are discrete physical patterns encoded in chromosomes which can be read and edited, with simple traits such as many diseases being determined by a handful of genes, yielding complicated but discrete behavior, while complex traits are instead governed by hundreds or thousands of genes whose effects sum together and produce a normal distribution such as IQ or risk of developing a complicated disease like schizophrenia. This allows direct estimation of their genetic contribution to a phenotype, as well as that of their children. These genetic traits contribute to many observed societal patterns, such as the children of the rich also being richer and smarter and healthier, why poorer neighborhoods have sicker people, relatives of schizophrenics are less intelligent, etc; these traits are substantially heritable, and traits are also interconnected in an intricate web of correlations where one trait causes another and both are caused by the same genetic variants. For example, intelligence-related variants are uniformly inversely correlated with disease-related variants, and positively correlated with desirable traits. These results have been validated by many different approaches and the existence of widespread large heritabilities, genetic correlations, and valid PGSes are now academic consensus.

Because of this pervasive genetic influence on outcomes, genetic engineering is one of the great open questions in transhumanism: how much is possible, with what, and when?

Suggested interventions can be broken down into a few categories:

• cloning (copying)
• selection (variation with ranking)
• editing (rewriting)
• synthesis (writing)

An opinionated comparison of possible interventions, focusing on potential for improvements, power, and cost.
Cloning Somatic cells are harvested from a human and their DNA transferred into an embryo, replacing the original DNA. The embryo is implanted. The result is equivalent to an identical twin of the donor, and if the donor is selected for high trait-values, will also have high trait-values but will regress to the mean depending on the heritability of said traits. ? $100k? cannot exceed trait-values of donor, limited by best donor availability does not require any knowledge of PGSes or causal variants, is likely doable relatively soon as modest extension of existing mammalian cloning, immediate gains of 3-4SD (maximum possible global donor after regression to mean) may trigger taboos & is illegal in many jurisdictions, human cloning has been minimally researched, hard to find parents as clone will be genetically related to one parent at most & possibly neither, can’t be used to get rare or new genetic variants, inherently limited to regressed maximum selected donor, does not scale in any way with more inputs Simple (Single-Trait) Embryo Selection A few eggs are extracted from a woman and fertilized; each resulting sibling embryo is biopsied for a few cells which are sequenced. A single polygenic score is used to rank the embryos by predicted future trait-value, and surviving embryos are implanted one by one until a healthy live birth happens or there are no more embryos. By starting with the top-ranked embryo, an average gain is realized. 0 years$1k-$5k egg count, IVF yield, PGS power offspring fully related to parents, doable & profitable now, doesn’t require knowledge of causal variants, doesn’t risk off-target mutations, inherently safe gains, PGSes steadily improving permanently limited to <1SD increases on trait, requires IVF so I am doubtful it’ could ever exceed ~10% US population usage, fails to benefit from using good genetic correlations to boost overlapping traits & avoid harm from negative correlations (where a good thing increases a bad thing), biopsy-sequencing imposes fixed per-embryo costs, fast diminishing returns to improvements, can only select on relatively common variants currently well-estimated by PGSes & cannot do anything about fixed variants neither or both parents carry Simple Multiple (Trait) Embryo Selection *, but the PGS used for ranking is a weighted sum of multiple (possibly scores or hundreds) of PGSes of individual traits, weighted by utility. * * * *, but several times larger gains from selection on multiple traits *, but avoids harms from bad genetic correlations Massive Multiple Embryo Selection A set of eggs is extracted from a woman, or alternately, some somatic cells like skin cells. If immature eggs in an ovary biopsy, they are matured in vitro to eggs; if somatic cells, they are regressed to stem cells, possibly replicated hundreds of times, and then turned into egg-generating-cells and finally eggs, yielding hundreds or thousands of eggs (all still identical to her own eggs). Either way, the resulting large number of eggs are then fertilized (up to a few hundred will likely be economically optimal), and then selection & implantation proceeds are in simple multiple embryo selection. >5 years$5k->$100k sequencing+biopsy fixed costs, PGS power offspring fully related to parents, lifts main binding limitation on simple multiple embryo selection, allowing potentially 1-5SD gains depending on budget, highly likely to be at least theoretically possible in next decade cost of biopsy+sequencing scales linearly with number of embryos while encountering further diminishing returns than experienced in simple multiple embryo selection, may be difficult to prove new eggs are as long-term healthy Gamete selection/Optimal Chromosome Selection (OCS) Donor sperm and eggs are (somehow) sequenced; the ones with the highest-ranked chromosomes are selected to fertilize each other; this can then be combined with simple or massive embryo selection. It may be possible to fuse or split chromosomes for more variance & thus selection gains. ? years$1?-$5k? ability to non-destructively sequence or infer PGSes of gametes rather than embryos, PGS power immediate large boost of ~2SD possible by selecting earlier in the process before variance has been canceled out, does not require any new technology other than the gamete sequencing part how do you sequence sperm/eggs non-destructively? Iterated embryo selection (IES) (Also called “whizzogenetics”, “in vitro eugenics”, or “in vitro breeding”/IVB.) A large set of cells, perhaps from a diverse set of donors, is regressed to stem cells, turned into both sperm/egg cells, fertilizing each other, and then the top-ranked embryos are selected, yielding a moderate gain; those embryos are not implanted but regressed back to stem cells, and the cycle repeats. Each “generation” the increases accumulate; after perhaps a dozen generations, the trait-values have increased many SDs, and the final embryos are then implanted. >10 years$1m?-$100m? full gametogenesis control, total budget, PGS power can attain maximum total possible gains, lessened IVF requirement (implantation but not the egg extraction), current PGSes adequate full & reliable control of gamete⟺stem-cell⟺embryo pipeline difficult & requires fundamental biology breakthroughs, running multiple generations may be extremely expensive and gains limited in practice, still restricted to common variants & variants present in original donors, unclear effects of going many SDs up in trait-values, so expensive that embryos may have to be unrelated to future parents as IES cannot be done custom for every pair of prospective parents, may not be feasible for decades Editing (eg CRISPR) A set of embryos are injected with gene editing agents (eg CRISPR delivered via viruses or micro-pellets), which directly modify DNA base-pairs in some desired fashion. The embryos are then implanted. Similar approaches might be to instead try to edit the mother’s ovaries or the father’s testicles using a viral agent. 0 years <$10k offspring fully related to parents, causal variant problem, number of safe edits, edit error rate gains independent of embryo number (assuming no deep sequencing to check for mutations), potentially arbitrarily cheap, potentially unbounded gains doesn’t require biopsy-sequencing, unknown upper bound on how many possible total edits, can add rare or unique genes each edit adds little, edits inherently risky and may damage cells through off-target mutations or the delivery mechanism itself, requires identification of the generally-unknown causal genes rather than predictive ones from PGSes, currently doesn’t scale to more than a few (unique) edits, most approaches would require IVF, parental editing inherently halves the possible gain
Genome Synthesis Chemical reactions are used to build up a strand of custom DNA literally base-pair by base-pair, which then becomes a chromosome. This process can be repeated for each chromosome necessary for a human cell. Once one or more of the chromosomes are synthesized, they can replace the original chromosomes in a human cell. The synthesized DNA can be anything, so it can be based on a polygenic score in which every SNP or genetic variant is set to the estimated best version. >10 years (single chromosomes) to >15 years whole genome? $30m-$1b cost per base-pair, overall reliability of synthesis achieves maximum total possible gains across all possible traits, is not limited to common variants & can implement any desired change, cost scales with genome replacement percentage (with an upper bound at replacing the whole genome), cost per base-pair falling exponentially for decades and HGP-Rewrite may accelerate cost decrease, many possible approaches for genome synthesis & countless valuable research or commercial applications driving development, current PGSes adequate full genome synthesis would cost ~$1b, error rate in synthesized genomes may be unacceptably high, embryos may be unrelated to parents due to cost like IES, likely not feasible for decades Overall I would summarize the state of the field as: • cloning: is unlikely to be used at any scale for the foreseeable future despite its power, and so can be ignored (except inasmuch as it might be useful in another technology like IES or genome synthesis) • simple single-trait embryo selection: is strictly inferior to simple multiple embryo selection, and there is no reason to use it other than the desire to save a tiny bit of statistical effort, and much reason to use it (larger and safer gains), so it need not be discussed except as a strawman. • simple multiple-trait embryo selection: available & profitable now, is too limited in possible gains, requires a far too onerous process (IVF) for more than a small percentage of the population to use it, and is more or less trivial. As median embryo count in IVF hovers around 5, the total gain from selection is small, and much of the gain is wasted by losses in the IVF process (the best embryo doesn’t survive storage, the second-best fails to implant, and so on). One of the key problems is that polygenic scores are the sum of many individual small genes’ effects and form a normal distribution, which is tightly distributed around a mean; the further into the tail one goes, the larger a sample it requires to realize a gain—to put it another way, if you have 10 samples, it’s easy (a 1 in 10 probability) that your next random sample will be the largest sample yet, but if you have 100 samples, now the probability of an improvement is the much harder 1 in 100, and if you have 1000, it’s only 1 in 1000, and worse, if you luck out and there’s an improvement, the improvement is ever tinier. After taking into account existing PGSes, previously reported IVF process losses, costs, and so on, the implication that it is moderately profitable and can increase traits perhaps 0.1SD, rising somewhat over the next decade as PGSes continue to improve, but never exceeding, say, 0.5SD. Embryo selection could have substantial societal impacts in the long run, especially over multiple generations, but this would both require IVF to become more common and for no other technology to supersede it (as they certainly shall). When IVF began, many pundits proclaimed it would “forever change what it means to be human” and other similar fatuosities; it did no such thing, and has since productively helped countless parents & children, and I fully expect embryo selection to go the same way. I would consider embryo selection to have been considerably overhyped (by those hyperventilating about “Gattaca being around the corner”), and, ironically, also underhyped (by those making arguments like “trait X is so polygenic, therefore embryo selection can’t work”, which is statistically illiterate, or “traits are complex interactions between genes and environment most of which we will never understand”, which is obfuscating irrelevancy and FUD). Embryo selection does have the advantage of being the easiest to analyze & discuss, and the most immediately relevant. • massive multiple embryo selection: the single most binding constraint on simple embryo selection (single or multiple trait), is the number of embryos to work with, which, since paternal sperm is effectively infinite, means number of eggs. For selection, the key question is what is the most extreme or maximum item in the sample; a small sample will not spread wide, but a large sample will have a bigger extreme. The more lottery tickets you buy, the better the chance of getting 1 ticket which wins a lot. Whereas, the PGS, to peoples’ general surprise, doesn’t make all that much of a difference after a little while. If you have 3 embryos, even going from a noisy to a perfect predictor, it doesn’t make much of a difference, because no matter how flawless your prediction, embryo #1 (whichever it is) out of 3 just isn’t going to be all that much better than average; if you have 300 embryos, then a perfect predictor becomes more useful. There is no foreseeable way to safely extract more eggs from a donor, so the only option is to make more eggs. One possibility is to not try to stimulate release of a few eggs and collect them, but instead biopsy samples of proto-eggs and then hurry them in vitro to maturity as full eggs, and get many eggs that way; biopsies might be compelling without selection at all: the painful, protracted, failure-prone, and expensive egg harvesting process to get ~5 embryos, which then might yield a failed cycle anyway, could be replaced by a single quick biopsy under anesthesia yielding hundreds of embryos effectively ensuring a successful cycle. Less invasively, laboratory results in inducing regression to stem cell states and then oogenesis have made steady progress over the past decades in primarily rat/mice but also human cells, and researchers have begun to speak of the possibility in another 5 or 10 years of enabling infertile or homosexual couples to conceive fully genetically-related children through somaticgametic cell conversions. This would also likely allow generating scores or hundreds of embryos by turning easier-to-acquire cells like skin cells or extracted eggs into stem cells which can replicate and then be converted into egg cells & fertilized. While it is still fighting the normal distribution with brute force, having 500 embryos works a lot better than having just 5 embryos to choose from. The downside is that one still needs to biopsy and sequence each embryo in order to compute their particular PGS; since one is still fighting the thin tail, at some point the cost of creating & testing another embryo exceeds the expected gain (probably somewhere in the hundreds of embryos). Unlike simple embryo selection, this could yield immediately important gains like +2SD. IVF yield ceases to be much of a problem (the second/third/fourth-best embryos are now almost exactly as good as the first-best was and they probably won’t all fail), and enough brute force has been applied to reach potentially 1-2SD in practice. If taken up by only the current IVF users and applied to intelligence alone, it would immediately lead to the next generation’s elite positions being dominated by their kids; if taken up by more and done properly on multiple traits, the advantage would be greater. • Gamete selection/Optimal Chromosome Selection: only a theoretical possibility at the moment, as there is no direct way to sequence individual sperm/eggs or manipulate chromosome choice. GS/OCS are interesting more for the points they make about variance & order statistics & the CLT: it results in a much larger gain than one would expect simply by switching perspectives and focusing on how to select earlier in the ‘pipeline’, so to speak, where variance is greater because sets of genes haven’t yet been combined in one package & canceled each other out. If someone did something clever to allow inference on gametes’ PGSes or select individual chromosomes, then it could yield an immediate discontinuously large boost in trait-value of +2SD in conjunction with whatever embryo selection is available at that point. • Iterated Embryo Selection: If IES were to happen, it would allow for almost arbitrarily large increases in trait-values across the board in a short period of time, perhaps a year. While IES has major disadvantages (extremely costly to produce the first optimized embryos, depending on how many generations of selection are involved; selection has some inherent speed limits trading off between accidentally losing possibly useful variants & getting as large a gain each generation as possible; embryos are unlikely to resemble the original donors at all without an additional generation ‘backcrossed’ with the original donor cells, undoing most of the work), the extreme increases may justify use of IES and create demand from parents. This could then start a tsunami. Depending on how far IES is pushed, the first release of IES-optimized embryos may become one of the most important events in human history. IES is still distant and depends on a large number of wet lab breakthroughs and finetuned human-cell protocols. Coaxing scores or hundred of cells through all the stages of development and fertilization, for multiple generations, is no easy task. When will IES be possible? The relevant literature is highly technical and only an expert can make sense of it, and one should have hands-on expertise to even try to make forecasts. There are no clear cost curves or laws governing progress in stem cell/gamete research which can be used to extrapolate. Perhaps no one will ever put all the money and consistent research effort into developing it into something which could be used clinically. Just because something is theoretically possible and has lots of lab prototypes doesn’t mean that the transition will happen. (Look at human cloning; everyone assumed it’d happen long ago, but as far as anyone knows, it never has.) On the other hand, perhaps someone will. IES is one of the scariest possibilities on the list, and the hardest to evaluate; it seems clear, at least, that it will certainly not happen in the next decade, but after that…? IES has been badly under-discussed to date. • Gene Editing: the development of CRISPR has led to more hype than embryo selection itself. However, the current family of CRISPR techniques & previous alternatives & future improvements, can be largely dismissed on statistical grounds alone. Even if we hypothesized some super-CRISPR which could make a handful of arbitrary SNP edits with zero risk of mutation or other forms of harm, it would not be especially useful and would struggle to be competitive with embryo selection, let alone IES/OCS/genome synthesis. The unfixable root cause is the polygenicity of the most important polygenic traits (which is a blessing for selection or synthesis approaches, as it creates a vast reservoir of potential improvements, but a curse for editing), and to a lesser extent, the asymmetry of effect sizes (harmful variants are more harmful than beneficial ones are beneficial). The benefit of gene editing a SNP is the number of edits times the SNP effect of each edit times the probability the effect is causal. Probability it’s causal? Can’t we assume that the top hits from large GWASes these days have a posterior probability ~100% of having a non-zero effect? No. This is because of a technical detail which is largely irrelevant to selection processes but is vitally important to editing: the hits identified in a PGS are not necessarily the exact causal base-pair(s). Often they are, but more often they are not. They are instead proxies for a neighboring causal variant which happens to usually be inherited with it, as genomes are inherited in a chunky fashion, in big blocks, and do not split & recombine at every single base-pair. This is no problem for selection—it predicts great and is cheaper & easier to find a correlated SNP than the true causal variant. But it is fatal to editing: if you edit a proxy, it’ll do nothing (or maybe it’ll do the opposite). How fatal is this? Attempts at “fine-mapping” or using large datasets to distinguish which of a few SNPs is the real culprit or seeing how PGSes’ performance shrinks when going from the original GWAS population to a deeply genetically different population like Subsaharan Africans who have totally different proxy patterns (if there is non-zero prediction power, it must be thanks to the causal hits, which act the same way in both populations), we can estimate that the causal probability may be as low as 10%. Combine this with the few edits safely permitted, perhaps 5, the small effect size of each genetic variant, like 0.2 IQ points for intelligence, and the effect becomes dismal. A tenth of a point? Not much. Even if we had all causal variants, the small average effect size, combined with few possible edits, is no good. Fix the causal variant problem, and it’s still only 5 edits at 0.2 points each. Nor is IQ at all unique in this respect—it’s somewhat unusually polygenic, but a cleaner trait like height still implies small gains such as half an inch. What about rare variants? The problem with rare variants is that they are rare, and also not of especially large beneficial effect. Being rare makes them hard to find in the first place, and the lack of benefit (as compared to a baseline human without said variant) means that they are not useful for editing. We might find many variants which damage a trait by a large amount, say, increasing abdominal fat mass by a kilogram or lowering IQ by a dozen points, but of course, we don’t want to edit those in! (They also aren’t that important for any embryo selection method, because they are rare, not usually present, and thus there is usually no selection to be done.) We could hope to find some variant which increases IQ by several points—but none have been found, if they were at all common they would’ve been found a long time ago, and indirect methods like DeFries-Fulker regression suggest that there few or no such rare variants. Nor is measuring other traits a panacea: if there were some variant which increased IQ by a medium amount by increasing a specific trait like working memory which has not been studied in large GWASes or DeFries-Fulker regressions to date, then such a WM-boosting variant should’ve been detected through its mediated effect, and to the extent that it has no effect on hard endpoints like IQ or education or income, it then must be questioned how useful it is in the first place… The situation may be somewhat better with other traits (there’s still hope for finding large beneficial effects1, and in the other direction, disease traits tend to have more rare variants of larger effects which might be worth fixing in relatively many individual cases, like BRCA or APOE) but I just don’t see any realistic way to reach gains like +1SD on anything with gene editing methods in the foreseeable future using existing variants. What about non-existing variants ie brand-new variants based on extrapolation from human genetic history or animal models? These hypothetical mutations/edits could have large effects even if we have failed to find any in the wild. But the track record of animal models in predicting complex human systems such as the brain is not good at all, and such large novel mutations would have zero safety record, and how would you prove any were safe without dozens of live births and long-term followup—which would never be permitted? Given the poor prior probability of both safety & efficacy, such mutations would simply remain untried indefinitely. It is difficult to see how to remedy this in any useful way. The causal probability will creep up as datasets expand & cross-racial GWASes become more common, but that doesn’t resolve the issue after we increase the gain by a factor of 10. The limit is still the edit count: the unique edit limit of ~5 is not enough to work with. Can this be combined usefully with IES to do edits per generation? Likely but you still need IES first! Can the edit limit be lifted? …Maybe. Genetic editing follows no predictable improvement curve, or learning curve, and doesn’t benefit directly from any exponentials. It is hard to forecast what improvements may happen. 2019 saw a breakthrough from a repeated-edit SOTA of ~60 edits in a cell to ~2,600 (), which no one forecast, but it’s unclear when if ever that would transfer to useful per-SNP edits; but nevertheless, the possibility of mass editing cannot be ruled out. So, CRISPR-style editing may be revolutionary in rare genetic diseases, agriculture, & research, but as far as we are concerned, it has been grossly overhyped: there is a chance it will live up to the most extreme claims, but not a large one. • Genome synthesis: the simple answer to gene editing’s failure is to observe that if you have to make possibly thousands of edits to fix up a genome to the level you want it, why not go out and make your own genome? (with blackjack and hookers…) That is the audacious proposal of genome synthesis. It sounds crazy, since genome synthesis has historically been mostly used to make short segments for research, or perhaps the odd pandemic virus, but unnoticed by most, the cost per base-pair has been crashing for decades, allowing the creation of entire yeast genomes and leading to the recent HGP-Write proposal from George Church & others to invest in genome synthesis research with the aim of inventing methods which can create custom genomes at reasonable prices. Such an ability would be staggeringly useful: custom organisms designed to produce arbitrary substances, genomes with the fundamental encoding all swapped around rendering them immune to all viruses ever, organisms with a single giant genome or with all mutations replaced with the modal gene, among other crazy things. One could also, incidentally, use cheap genome synthesis for bulk storage of data in a dense, durable, room-temperature format (explaining both Microsoft & IARPA’s interest in funding genome synthesis research). Of course, if you can synthesize an entire genome—a single chromosome would be almost as good to some extent—you can take a baseline genome and make as many ‘edits’ as you please. Set all the top variants for all the relevant traits to the estimated best setting. The possible gains are greater than IES (since you are not limited by the initial gene pool of starting variants nor by the selection process itself), and one can increase traits by hundreds of SDs (whatever that means). Genome synthesis, unlike IES, has historically proceeded on a smooth cost-curve, has many possible implementations, and has many research groups & startups involved due to its commercial applications. A large-scale () has been proposed to scale genome synthesis up to yeast sized organisms and eventually human-sized genomes. The cost curve suggests that around 2035, whole human genomes reach well-resourced research project ranges of$10-30m; some individuals in genome synthesis tell me they are optimistic that new methods can greatly accelerate the cost-curve. (Unlike IES, genome synthesis is not committed to a particular workflow, but can use any method which yields, in the end, the desired genome; all of these methods can be , representing a major advantage.) Genome synthesis has many challenges before one could realistically implant an embryo, such as ensuring all the relevant structural features like methylation are correct (which may not have been necessary for earlier more primitive/robust organisms like yeast), and so on, but whatever the challenges for genome synthesis, the ones for IES appear greater. It is entirely possible that IES will develop too slowly and will be obsoleted by genome synthesis in 10-20 years. The consequences of genome synthesis would be, if anything, larger than IES because the synthesis technology will be distributed in bulk, will probably continue decreasing in cost due to the commercial applications regardless of human use, and don’t require rare specialized wet lab expertise but like genome sequencing, will almost certainly become highly automated & ‘push button’.

If IES has been under-discussed and is underrated, genome synthesis has not been discussed at all & vastly more underrated.

To sum up the timeline: CRISPR & cloning are already available but will remain unimportant indefinitely for various fundamental reasons; multiple embryo selection is useful now but will always be minor; massive multiple embryo selection is some ways off but increasingly inevitable and the gains are large enough on both individual & societal levels to result in a shock; IES will come sometime after massive multiple embryo selection but it’s impossible to say when, although the consequences are potentially global; genome synthesis is a similar level of seriousness, but is much more predictable and can be looked for, very loosely, 2030-2040 (and possibly sooner).

Readers already familiar with the idea of embryo selection may have some common misconceptions which would be good to address up front:

1. IVF is expensive, somewhat dangerous, and may have worse health outcomes than natural childbirth

I agree, but we can consider the case where these issues are irrelevant. It is unclear what the long-run effects of IVF on children may be, other than the harm probably isn’t too great; the literature on IVF suggests that the harms are probably very small and smaller than, for example, paternal age effects, but it’s hard to be sure given that IVF usage is hardly exogenous and good comparison groups for even just correlational analysis are hard to come by. (Natural-born children are clearly not comparable, but neither are natural-born siblings of IVF children—why was their mother able to have one child naturally but needed IVF for the next?) I would not recommend anyone do IVF solely to benefit from embryo selection (as opposed to doing PGD to avoid passing a horrible genetic disease like Huntington’s, where it is impossible for the hypothetical harms of IVF to outweigh the very real harm of that genetic disease). Here I consider the case where parents are already doing IVF, for whatever reason, and so the potential harms are a “sunk cost”: they will happen regardless of the choice to do embryo selection, and can be ignored. This restricts any results to that small subset (~1% of parents in the USA as of 2016), of course, but that subset is the most relevant one at present, is going to grow over time, and could still have important societal effects.

An interesting question would be, at what point does embryo selection become so compelling that would-be parents with a family history of disease (such as schizophrenia) would want to do it? (Because of the nonlinear nature of liability-threshold polygenic traits and relatively rare diseases like schizophrenia, someone with a family history benefits far more than someone with an average risk; see the truncation selection/multiple-trait selection on why this implies that selection against diseases is not as useful as it seems.) What about would-be parents with no particular history? How good does embryo selection need to be for would-be parents who could conceive naturally to be willing to undergo the cost (~$10k even at the cheapest fertility clinics) and health risks (for both mother & child) to benefit from embryo selection? I don’t know, but I suspect “simple embryo selection” is too weak and it will require “massive embryo selection” (see the overview for definitions & comparisons). 2. GWASes don’t work and are false positives and can’t do anything useful for embryo selection because they are false positives/population structure/publication bias/p-hacking/etc… Some readers overgeneralize the debacle of the candidate-gene literature, which is almost 100% false-positive garbage, to GWASes; but GWASes were designed in response to the failure of candidate-genes by much more stringent thresholds & large datasets & more population structure correction, and have performed well as datasets reached necessary sizes. Their PGSes predict out-of-sample increasingly large amounts of variance, the PGSes have high s between cohorts/countries/times/measurement methods, and they work within-family between siblings, who by definition have identical ancestries/family backgrounds/SES/etc but have randomized inheritance from their parents. For a more detailed discussion, see the section, “Why Trust GWASes?”. (While GWASes are indeed highly flawed, those flaws typically work in the direction of inefficiency/reducing their predictive power, not inflating them.) 3. GWASes may be predictive but this is irrelevant because the SNPs in a PGS are merely non-causal variants which proxy for causal variants Background: in a GWAS, the measured SNPs may cause the outcome or they may merely be located on a genome nearby a genetic variant which has the causal effect; because genomes are inherited in a ‘chunky’ fashion, a measured SNP may almost always be found alongside the causal genetic variant within a particular population. (Over a long enough timeframe, as organisms reproduce, that part of the genome will be broken up, but this may take centuries or millennia.) Such a SNP is in “linkage disequilibrium” or just LD. Such a scenario is quite common, and may in fact be the case for the overwhelming majority of SNPs in human GWASes. This is both a blessing and a curse for GWASes: it means that easy cheaply-measured SNPs can probe harder-to-find genetic variants, but it also means that the SNPs are not causal themselves. So for example, if one took a list of SNPs from a GWAS, and used CRISPR to edit them, most of the edits would do nothing. This is a serious concern for genetic engineering approaches—just because you have a successful GWAS doesn’t mean you know what to edit! But is this a problem for embryo selection? No. Because you are not engaged in any editing or causal manipulation. You are passively observing and predicting what is the best embryo in a sample. This does not disturb the LD patterns or break any correlations, and the predictions remain valid. Selection doesn’t care what the causal variants are, it cares only that, whatever they are or wherever they are on the genome, the chosen embryo has more of them than the not-chosen embryos. Any proxy will do, as long as it predicts well. In the long run, changes in LD will gradually reduce the PGS’s predictive power as the SNPs become better/worse proxies, but this is unimportant since there will be many GWASes in between now and then, and one would be upgrading PGSes for other reasons (like their steadily improving predictive power regardless of LD patterns). 4. Embryo selection can’t be useful with PGSes predicting only X% [where X% > state of the art] of individual variance The mistake here is confusing a statistical measure of error with the goal. Any default summary statistic like R2 or RMSE is merely a crutch with tenuous connections to optimal decisions. In embryo selection, the goal is to choose better embryos than average to implant rather than implant random embryos, to get a gain which pays for the costs involved. A PGS only needs to be accurate enough to select a better embryo out of a (typically small) batch. It doesn’t need to be able to predict future, say, IQ, within a point. Estimating the precise future trait value of an embryo may be quite difficult, but it’s much easier to predict which of two embryos will have a higher trait value. (It’s the difference between predicting the winner of a soccer game and predicting the exact final score; the latter does let one do the former, but the former is what one need and is much easier.) Once your PGS is good enough to pick the best or near-best embryo, even a far better PGS makes little difference—after all, one can’t do any better than picking the best embryo out of a batch. And due to diminishing returns/tail effects, the larger the batch, the smaller the difference between the best and the 4th-best etc, reducing the regret. (In a batch of 2, there’s not too much difference between a poor and a perfect predictor; and in a batch of 1, there’s none.) To decide whether a PGS of X% is adequate cannot be done in a vacuum; the necessary performance will depend critically on the value of a trait, the cost of embryo selection, the losses in the IVF pipeline, and most importantly of all, the number of embryos in each batch. (The final gain depends the most on the embryo count—a fact lost on most people discussing this topic.) As embryo selection is cheap at the margin, and ranking is easier than regression, this can be done with surprisingly poor PGSes, and the bar of profitability is easy to meet, and for embryo selection, has been met for some years now (see the rest of this page for an analysis of the specific case of IQ). • The genome-wide statistically-significant hits explain far less than X% of individual variance Statistical-significance thresholds are essentially arbitrary. There is no need to fetishize them: they do not correspond to any posterior probability of a hit being “real”, introduce many serious difficulties of interpretation due to power (if a GWAS has a hit on an SNP with an estimated effect size of X, and a second GWAS also estimates it at X but due to a slightly higher standard error, it is no longer “statistically-significant”, what does that mean, exactly?) and even if they did, the number of false positives has little relationship to the predictive power, much less selection gain of a PGS, much less the final profit of embryo selection. The relevant question is what are the best predictions which can be made? For human complex traits, the most accurate predictions typically use a PGS based on most of or all measured variants. Anything less is lesser. 5. Selection on traits, especially intelligence, will backfire horribly It is hypothetically possible for selection on one trait, which happens to be inversely correlated on a genetic level, with another important trait, to backfire by increasing the first trait but then doing much more damage by decreasing the second trait. This occurs occasionally in long-term or intense breeding programs, and has been demonstrated by very carefully-designed experiments such as the famous chicken-crate experiment. However, for humans, such genetic correlations are highly unlikely a priori as we can simply observe broad patterns like the global correlations of SES/wealth/intelligence/health with all desirable outcomes (“Cheverud’s conjecture”), and countless have already been calculated by various methods and are now routinely reported in GWASes, and invariably diseases positively correlate with diseases and good things correlate with other good things. Whatever harmful backfire effects there may be are far outweighed by the beneficial backfire effects, so selection on a single trait, especially intelligence, is not going to incur these speculative hypothetical harms. If there are any such harms, they can be reduced or eliminated by simply taking into account multiple traits while selecting, and doing multi-trait selection. This is easy to do with the present availability of PGSes on hundreds of traits—given that all the hard work is in the genotyping step, why would one ignore all traits but one and throw away all that data? In fact, even if there were no possibility of backfire effects, embryo selection would be done with multi-trait selection anyway, simply because it is so easy and the benefits are so compelling: using multiple traits allows for much greater overall gains because two embryos similar or identical on one trait may differ a great deal on another trait, and when traits are genetically correlated, they can serve as proxies for each other, producing effective boosts in predictive power. For all these reasons, most breeding programs use multi-trait selection. For more details and an example of the benefits in embryo selection, see the multiple-selection section. # Embryo selection cost-effectiveness “Forty years ago, I could say in the Whole Earth Catalog, ‘we are as gods, we might as well get good at it’…What I’m saying now is we are as gods and have to get good at it.” (IVF) is a medical procedure for infertile women in which eggs are extracted, fertilized with sperm, allowed to develop into an embryo, and the embryo injected into their womb to induce pregnancy. The choice of embryo to implant is usually arbitrary, with some simple screening for gross abnormalities like missing chromosomes or other cellular defects, which would either be fatal to the embryo’s development (so useless & wasteful to implant) or cause birth defects like (so much preferable to implant a healthier embryo). However, various tests can be run on embryos, including genome sequencing after extracting a few cells from the embryo, which is called: (PGD; )—when genetic information is measured and used to choose which embryo to implant. PGD has historically been used primarily to detect and select against a few rare recessive genetic diseases with single-gene causes like the fatal Huntington’s disease: if both parents are carriers, an embryo without the recessive can be chosen, or at least, an embryo which is heterozygous and won’t develop the disease. This is useful for those unlucky enough to have a family history or be known carriers, but while initially controversial, is now merely an obscure & useful part of fertility medicine. However, with ever cheaper SNP arrays and the advent of large GWASes in the 2010s, large amounts of subtler genetic information becomes available, and one could check for abnormalities and also start making useful predictions about adult phenotypes: one could choose embryos with higher/lower probability of traits with many known genetic hits such as or intelligence or alcoholism or schizophrenia—thus, in effect, creating with proven technology no more exotic than IVF and 23andMe. Since such a practice is different in so many ways from traditional PGD, I’ll call it “embryo selection”. Embryo selection has already begun to be used by the most sophisticated cattle breeding programs () as an adjunct to their highly successful genomic selection & embryo transfer programs. What traits might one want to select on? For example, increases in height have with estimates like +$800 per inch per year income, which combined with polygenic scores predicting a decent fraction of variance, could be valuable2 But height, or hair color, or other traits are in general zero-sum traits, often easily modified (eg hair dye or contact lenses), and far less important to life outcomes than personality or intelligence, which profoundly influence an enormous range of outcomes ranging from academic success to income to longevity to violence to happiness to altruism (and so increases in which are far from “frivolous”, as some commenters have labeled them); since the personality GWASes have had difficulties (probably due to non-additivity of the relevant genes connected to predicted frequency-dependent selection, see /), that leaves intelligence as the most important case.

Discussions of this possibility have often led to both overheated prophecies of “genius babies” or “super-babies”, and to dismissive scoffing that such methods are either impossible or of trivial value; unfortunately, specific numbers and calculations backing up either view tend to be lacking, even in cases where the effect can be predicted easily from behavioral genetics and shown to be not as large as laymen might expect & consistent with the results (for example, the “genius sperm bank”3).

In , Shulman & Bostrom 2014 consider the potential of embryo selection for greater intelligence in a little detail, ultimately concluding that in the most applicable current scenario of minimal uptake (restricted largely to those forced into IVF use) and gains of a few IQ points, embryo selection is more of “curiosity” than “game-changer” as it will be “Socially negligible over one generation. Effects of social controversy more important than direct impacts.” Some things are left out of their analysis which I’m interested in:

1. they give the upper bound on the IQ gain that can be expected from a given level of selection & then-current imprecise GCTA heritability estimates, but not the gain that could be expected with updated figures: is it a large or small fraction of that maximum? And they give a general description of what societal effects might be expected from combinations of IQ gains and prevalence, but can we say something more rigorously about that?
2. their level of selection may bear little resemblance to what can be practically obtained given the realities of IVF and high embryo attrition rates (selecting from 1 in 10 embryos may yield x IQ points, but how many real embryos would we need to implement that, since if we extract 10 embryos, 3 might be abnormal, the best candidate might fail to implant, the second-best might result in a miscarriage, etc?)
3. there is no attempt to estimate costs nor whether embryo selection right now is worth the costs, or how much better our selection ability would need to be to make it worthwhile. Are the advantages compelling enough that ordinary parents, who are already using IVF and could use embryo selection at minimal marginal cost, would pay for it and take the practice out of the lab? Under what assumptions could embryo selection be so valuable as to motivate parents without fertility problems into using IVF solely to benefit from embryo selection?
4. if it is not worthwhile because the genetic information is too weakly predictive of adult phenotype, how much additional data would it take to make the predictions good enough to make selection worthwhile?
5. What are the prospects for embryo editing instead of selection, in theory and right now?

## Benefit

### Value of IQ

Shulman & Bostrom 2014 note that

Studies in labor economics typically find that one IQ point corresponds to an increase in wages on the order of 1 per cent, other things equal, though higher estimates are obtained when effects of IQ on educational attainment are included (; ; ; ; ; ).2 The individual increase in earnings from a genetic intervention can be assessed in the same fashion as prenatal care and similar environmental interventions. One study of efforts to avert low birth weight estimated the value of a 1 per cent increase in earnings for a newborn in the US to be between $2,783 and$13,744, depending on discount rate and future wage growth ()4

The given low/high range is based on 2006 data; to 2016 dollars (as appropriate due to being compared to 2015/2016 costs), that would be $3270 and$16151. There is much more that can be said on this topic, starting with various measurements of individuals from income to wealth to correlations with occupational prestige, looking at longitudinal & cross-sectional national wealth data, positive externalities & psychological differences (such as increasing cooperativeness, patience, free-market and moderate politics), verification of causality from longitudinal predictiveness, genetic overlap, within-family comparisons, & exogenous shocks positive (iodization & iron) or negative (lead), etc; an incomplete bibliography is provided as an appendix. As polygenic scores & genetically-informed designs are slowly adopted by the social sciences, we can expect more known correlations to be confirmed as causally downstream of genetic intelligence. These downstream effects likely include not just income and education, but behavioral measures as well , notes in the data that a 3 point IQ increase predicts 28% less risk of highschool dropouts, 25% less risk of poverty or being jailed (men), 20% less risk of parentless children, 18% less risk of going on welfare, and 15% less risk of out-of-wedlock births. Anders Sandberg provides a descriptive table (expanded from , itself adapted from ):

##### GCTA-based upper bound on selection gains

Since half of additives will be shared within family, then we get within-family variance, which gives SD or 6.1 IQ points (Occasionally within-family differences are cited in a format like “siblings have an average difference of 12 IQ points”, which comes from an SD of ~0.7/0.8, since , but you could also check what SD yields an average difference of 12 via simulation: eg mean(abs(rnorm(n=1000000, mean=0, sd=0.71) - rnorm(n=1000000, mean=0, sd=0.71))) * 15 ~> 12.018.) We don’t care about means since we’re only looking at gains, so the mean of the within-family normal distribution can be set to 0.

With that, we can write a simulation like Shulman & Bostrom where we generate n samples from , take the max, and return the difference of the max and mean. There are more efficient ways to compute the expected maximum, however, and so we’ll use a lookup table computed using the library for small n & an approximation for large n for speed & accuracy; for a discussion of alternative approximations & implementations and why I use this specific combination, see . Qualitatively, the max looks like a logarithmic curve: if we fit a log curve to n=2-300, the curve is (R2=0.98); to adjust for the PGS variance-explained, we convert to SD and adjust by relatedness, so an approximation of the gain from sibling embryo selection would be or . (The logarithm immediately indicates that we must worry about diminishing returns and suggests that to optimize embryo selection, we should look for ways around the log term, like multiple stages which avoid going too far into the log’s tail.)

For generality to other continuous normally distributed complex traits, we’ll work in standardized units rather than the IQ scale (SD=15), but convert back to points for easier reading:

exactMax <- Vectorize(function (n, mean=0, sd=1) {
if (n>2000) { ## avoid lmomco bugs at higher _n_, where the approximations are near-exact anyway
chen1999 <- function(n,mean=0,sd=1){ mean + qnorm(0.5264^(1/n), sd=sd) }
chen1999(n,mean=mean,sd=sd) } else {
if(n>200) { library(lmomco)
exactMax_unmemoized <- function(n, mean=0, sd=1) {
expect.max.ostat(n, para=vec2par(c(mean, sd), type="nor"), cdf=cdfnor, pdf=pdfnor) }
exactMax_unmemoized(n,mean=mean,sd=sd) } else {

lookup <- c(0,0,0.5641895835,0.8462843753,1.0293753730,1.1629644736,1.2672063606,1.3521783756,1.4236003060,
1.4850131622,1.5387527308,1.5864363519,1.6292276399,1.6679901770,1.7033815541,1.7359134449,1.7659913931,
1.7939419809,1.8200318790,1.8444815116,1.8674750598,1.8891679149,1.9096923217,1.9291617116,1.9476740742,
1.9653146098,1.9821578398,1.9982693020,2.0137069241,2.0285221460,2.0427608442,2.0564640976,2.0696688279,
2.0824083360,2.0947127558,2.1066094396,2.1181232867,2.1292770254,2.1400914552,2.1505856577,2.1607771781,
2.1706821847,2.1803156075,2.1896912604,2.1988219487,2.2077195639,2.2163951679,2.2248590675,2.2331208808,
2.2411895970,2.2490736293,2.2567808626,2.2643186963,2.2716940833,2.2789135645,2.2859833005,2.2929091006,
2.2996964480,2.3063505243,2.3128762306,2.3192782072,2.3255608518,2.3317283357,2.3377846191,2.3437334651,
2.3495784520,2.3553229856,2.3609703096,2.3665235160,2.3719855541,2.3773592389,2.3826472594,2.3878521858,
2.3929764763,2.3980224835,2.4029924601,2.4078885649,2.4127128675,2.4174673530,2.4221539270,2.4267744193,
2.4313305880,2.4358241231,2.4402566500,2.4446297329,2.4489448774,2.4532035335,2.4574070986,2.4615569196,
2.4656542955,2.4697004768,2.4736966781,2.4776440650,2.4815437655,2.4853968699,2.4892044318,2.4929674704,
2.4966869713,2.5003638885,2.5039991455,2.5075936364,2.5111482275,2.5146637581,2.5181410417,2.5215808672,
2.5249839996,2.5283511812,2.5316831323,2.5349805521,2.5382441196,2.5414744943,2.5446723168,2.5478382097,
2.5509727783,2.5540766110,2.5571502801,2.5601943423,2.5632093392,2.5661957981,2.5691542321,2.5720851410,
2.5749890115,2.5778663175,2.5807175211,2.5835430725,2.5863434103,2.5891189625,2.5918701463,2.5945973686,
2.5973010263,2.5999815069,2.6026391883,2.6052744395,2.6078876209,2.6104790841,2.6130491728,2.6155982225,
2.6181265612,2.6206345093,2.6231223799,2.6255904791,2.6280391062,2.6304685538,2.6328791081,2.6352710490,
2.6376446504,2.6400001801,2.6423379005,2.6446580681,2.6469609341,2.6492467445,2.6515157401,2.6537681566,
2.6560042252,2.6582241720,2.6604282187,2.6626165826,2.6647894763,2.6669471086,2.6690896839,2.6712174028,
2.6733304616,2.6754290533,2.6775133667,2.6795835873,2.6816398969,2.6836824739,2.6857114935,2.6877271274,
2.6897295441,2.6917189092,2.6936953850,2.6956591311,2.6976103040,2.6995490574,2.7014755424,2.7033899072,
2.7052922974,2.7071828562,2.7090617242,2.7109290393,2.7127849375,2.7146295520,2.7164630139,2.7182854522,
2.7200969934,2.7218977622,2.7236878809,2.7254674700,2.7272366478,2.7289955308,2.7307442335,2.7324828686,
2.7342115470,2.7359303775,2.7376394676,2.7393389228,2.7410288469,2.7427093423,2.7443805094,2.7460424475)

return(mean + sd*lookup[n+1]) }}})

The normal curve has thin tails and so the increase in the maximum with an additional sample diminish rapidly:

## show the locations of expected maxima/minima, demonstrating diminishing returns/thin tails:
x <- seq(-3, 3, length=1000)
y <- dnorm(x, mean=0, sd=1)
extremes <- unlist(Map(exactMax, 1:100))
plot(x, y, type="l", lwd=2,
xlab="SDs", ylab="Normal density", main="Expected maximum/minimums for Gaussian samples of size n=1-100")
abline(v=c(extremes, extremes*-1), col=rep(c("black","gray"), 200))

It is worth noting that the maximum is sensitive to variance, as it increases multiplicatively with the square root of variance/the standard deviation, while on the other hand, the mean is only additive. So an increase of 20% in the standard deviation means an increase of 20% in the maximum, but an increase of +1SD in the mean is merely a fixed additive increase, with the difference growing with total n. For example, in maximizing the maximum of even just n=10, it would be much better (by +0.5SD) to double the SD from 1SD to 2SD than to increase the mean by +1SD:

exactMax(10, mean=0, sd=1)
# [1] 1.53875273
exactMax(10, mean=1, sd=1)
# [1] 2.53875273
exactMax(10, mean=0, sd=2)
# [1] 3.07750546

One way to visualize it is to ask how large a mean increase is required to have the same expected maximum as that of various increases in variance:

library(ggplot2)
library(gridExtra)

compareDistributions <- function(n=10, varianceMultiplier=2) {
baselineMax <- exactMax(n, mean=0, sd=1)
increasedVarianceMax <- exactMax(n, mean=0, sd=varianceMultiplier)

width <- increasedVarianceMax*1.2
x1 <- seq(-width, width, length=1000)

x2 <- seq(-width, width, length=1000)
y2 <- dnorm(x2, mean=0, sd=varianceMultiplier)

df <- data.frame(X=c(x1, x2), Y=c(y1, y2), Distribution=c(rep("baseline", 1000), rep("variable", 1000)))

return(qplot(X, Y, color=Distribution, data=df) +
geom_vline(xintercept=increasedVarianceMax, color="blue") +
ggtitle(paste0("Variance Increase: ", varianceMultiplier, "x (Difference: +",
geom_text(aes(x=increasedVarianceMax*1.01, label=paste0("expected maximum (n=", n, ")"),
y=0.3), colour="blue", angle=270))
}
p0 <- compareDistributions(varianceMultiplier=1.25) +
ggtitle("Mean increase required to have equal expected maximum as a more\
variable distribution\nVariance increase: 1.25x (Difference: +0.38SD)")
p1 <- compareDistributions(varianceMultiplier=1.50)
p2 <- compareDistributions(varianceMultiplier=1.75)
p3 <- compareDistributions(varianceMultiplier=2.00)
p4 <- compareDistributions(varianceMultiplier=3.00)
p5 <- compareDistributions(varianceMultiplier=4.00)
p6 <- compareDistributions(varianceMultiplier=5.00)
grid.arrange(p0, p1, p2, p3, p4, p5, p6, ncol=1)

Note the visible difference in tail densities implies that the advantage of increased variance increases the further out on the tail one is selecting from (higher n); I’ve made additional graphs for more extreme scenarios (n=100, n=1000, n=10000), and created an .

Applying the order statistics code to the specific case of embryo selection on full siblings:

## select 1 out of N embryos (default: siblings, who are half-related)
embryoSelection <- function(n, variance=1/3, relatedness=1/2) {
exactMax(n, mean=0, sd=sqrt(variance*relatedness)); }
embryoSelection(n=10) * 15
# [1] 9.422897577
embryoSelection(n=10, variance=0.444) * 15
# [1] 10.87518323
embryoSelection(n=5, variance=0.444) * 15
# [1] 8.219287927

So 1 out of 10 gives a maximal average gain of ~9 IQ points, less than Shulman & Bostrom’s 11.5 because of my lower GCTA estimate, but using better IQ tests like the WAIS, we could go as high as ~11 points. With a more realistic number of embryos, we might get 8 points.

For comparison, the full genetic heritability of accurately-measured adult IQ (going far beyond just SNPs or additive effects to include mutation load & de novo mutations, copy-number variation, modeling of interactions etc) is generally estimated ~0.8, which case the upper bound on selection out of 10 embryos would be ~14.5 IQ points:

embryoSelection(n=10, variance=0.8) * 15
# [1] 14.59789016

For intuition, an animation:

library(MASS)
library(ggplot2)
plotSelection <- function(n, variance, relatedness=1/2) {
r = sqrt(variance*relatedness)

data = mvrnorm(n=n, mu=c(0, 0), Sigma=matrix(c(1, r, r, 1), nrow=2), empirical=TRUE)
df <- data.frame(Trait=data[,1], PGS=data[,2], Selected=max(data[,2]) == data[,2])

trueMax <- max(df$Trait) selected <- df[df$Selected,]$Trait regret <- trueMax - selected return(qplot(PGS, Trait, color=Selected, size=I(9), data=df) + coord_cartesian(ylim = c(-2.5,2.5), xlim=c(-2.5,2.5), expand=FALSE) + geom_hline(yintercept=0, color="red") + labs(title=paste0("Selection hypothetical (higher=better): with n=", n, " samples & PGS variance=", round(variance,digits=2), ". Performance: true max: ", round(trueMax, digits=2), "; selected: ", round(selected, digits=2), "; regret: ", round(regret, digits=2))) ) } library(animation) saveGIF( for (i in 1:100) { n <- max(3, round(rnorm(1, mean=6, sd=3))) pgs <- runif(1, min=0, max=0.5) p <- plotSelection(n, pgs) print(p) }, interval=0.8, ani.width = 1000, ani.height=800, movie.name = "embryo-selection.gif") It is often claimed that a ‘small’ r correlation or predictive power is, a priori, of no use for any practical purposes; this is incorrect, as the value of any particular r is inherently context & decision-specific—a small r can be highly valuable for one decision problem, and a large r could be useless for another, depending on the use, the costs, and the benefits. Ranking is easier than prediction; accurate prediction implies accurate ranking, but not vice-versa—one can have an accurate comparison of two datapoints while the estimate of each one’s absolute value is highly noisy. One way to think of it is to note that Pearson’s r correlation can be converted to , and for normal variables like this, they are near-identical; so a PGS of 10% variance or r=0.31 means that that every SD increase in PGS is equivalent to 0.31 SD increases in rank. In particular, it has long been noted in industrial psychology & psychometrics that a tiny r2/r bivariate correlation between a test and a latent variable can considerably enhance the probability of selecting datapoints passing a given threshold (eg /; ), and this is increasingly true the more stringent the threshold (tail effects again!); this also applies to embryo selection, since we can define a threshold as being set at the best of n embryos. This helps explain why the PGS’s power is not as overwhelmingly important to embryo selection as one might initially expect; certainly, you do need a decent PGS, but it is only a starting point & one of several variables, and experiences diminishing returns, rendering it not necessarily as important a parameter as the more obscure “number of embryos” parameter. A metaphor here might be that of biasing some dice to try to roll a high score: while initially making the dice more loaded does help increase your total score, the gain quickly shrinks compared to being able to add a few more dice to be rolled. Some order statistics plots can help visualize the three-way relationship between probability of optimal selection/regret, number of embryos, and PGS variance: ## consider first column as true latent genetic scores, & the second column as noisy measurements correlated _r_: library(MASS) generateCorrelatedNormals <- function(n, r) { mvrnorm(n=n, mu=c(0, 0), Sigma=matrix(c(1, r, r, 1), nrow=2)) } ## consider plausible scenarios for IQ-related non-massive simple embryo selection, so 2-50 embryos; ## and PGSes must max out by 80%: scenarios <- expand.grid(Embryo.n=2:50, PGS.variance=seq(0.01, 0.80, by=0.04), Rank.mean=NA, P.max=NA, P.min=NA, P.below.mean=NA, P.minus.two=NA, Regret.SD=NA) for (i in 1:nrow(scenarios)) { n = scenarios[i,]$Embryo.n
r = sqrt(scenarios[i,]$PGS.variance) iters = 500000 sampleStatistics <- function(n,r) { sim <- generateCorrelatedNormals(n, r=r) # max1_l <- max(sim[,1]) # max2_m <- max(sim[,2]) max1i_l <- which.max(sim[,1]) max2i_m <- which.max(sim[,2]) gain <- sim[,1][max2i_m] rank <- which(sim[,2][max1i_l] == sort(sim[,2])) ## P(max): if the max of the noisy measurements is a different index than the max or min of the true latents, ## then embryo selection fails to select the best/maximum or selects the worst. ## If n=1, trivially P.max/P.min=1 & Regret=0; if r=0, P.max/P.min = 1/n; ## if r=1, P.max=1 & P.min=0; r=0-1 can be estimated by simulation: P.max <- max2i_m == max1i_l ## P(min): if our noisy measurement led us to select the worst point rather than best: P.min <- which.min(sim[,1]) == max2i_m ## P(<avg): whether we managed to at least boost above mean of 0 P.below.mean <- gain < 0 ## P(IQ(70)): whether the point falls below -2SDs P.minus.two <- gain <= -2 ## Regret is the difference between the true latent's maximum, and the true score ## for the index with the maximum of the noisy measurements, which if a different index, ## means a loss and thus non-zero regret. ## If r=0, regret = max/the n_k order statistic; r=1, regret=0; in between, simulation: Regret.SD <- max(gain,0) return(c(P.max, P.min, P.below.mean, P.minus.two, Regret.SD, rank)) } sampleAverages <- colMeans(t(replicate(iters, sampleStatistics(n,r)))) # print(c(n,r,sampleAverages)) scenarios[i,]$P.max        <- sampleAverages[1]
scenarios[i,]$P.min <- sampleAverages[2] scenarios[i,]$P.below.mean <- sampleAverages[3]
scenarios[i,]$P.minus.two <- sampleAverages[4] scenarios[i,]$Regret.SD    <- sampleAverages[5]
scenarios[i,]$Rank.mean <- sampleAverages[6] } library(ggplot2); library(gridExtra) p0 <- qplot(Embryo.n, Rank.mean, color=as.ordered(PGS.variance), data=scenarios) + theme(legend.title=element_blank()) + geom_abline(slope=1, intercept=0) + ggtitle("Expected true rank after selecting best out of N embryos based on PGS score (idealized, excluding IVF losses)") p1 <- qplot(Embryo.n, P.max, color=as.ordered(PGS.variance), data=scenarios) + coord_cartesian(ylim = c(0,0.84)) + theme(legend.title=element_blank()) + ggtitle("Probability of selecting best out of N embryos as function of PGS score (*)") p2 <- qplot(Embryo.n, P.min, color=as.ordered(PGS.variance), data=scenarios) + coord_cartesian(ylim = c(0,0.48)) + theme(legend.title=element_blank()) + ggtitle("Probability of mistakenly selecting worst out of N embryos (*)") p3 <- qplot(Embryo.n, P.below.mean, color=as.ordered(PGS.variance), data=scenarios) + coord_cartesian(ylim = c(0,0.5)) + theme(legend.title=element_blank()) + ggtitle("Probability of mistakenly selecting below-average out of N embryos (*)") p4 <- qplot(Embryo.n, P.minus.two, color=as.ordered(PGS.variance), data=scenarios) + coord_cartesian(ylim = c(0,0.02)) + theme(legend.title=element_blank()) + ggtitle("Probability of selecting below -2SDs out of N embryos (*)") p5 <- qplot(Embryo.n, Regret.SD, color=as.ordered(PGS.variance), data=scenarios) + theme(legend.title=element_blank()) + ggtitle("Loss from non-omniscient selection from N embryos in SDs (*)") grid.arrange(p0, p1, p2, p3, p4, p5, ncol=1) ##### Polygenic scores “‘Should we trust models or observations?’ In reply we note that if we had observations of the future, we obviously would trust them more than models, but unfortunately observations of the future are not available at this time.” Knutson & Tuleya 2005, “Reply” A SNP-based polygenic score works much the same way: it explains a certain fraction or percentage of the variance, halved due to siblings, and can be plugged in once we know how much less than 0.33 it is. An example of using SNP polygenic scores to identify genetic influences and verify they work within-family and are not confounded would be Domingue et al 2015’s . Past polygenic scores for intelligence: 1. 0.5776% of fluid intelligence in the NCNG replication sample, if I’ve understood their analysis correctly. 2. This landmark study providing the first GWAS hits on intelligence also estimated multiple polygenic scores: the full polygenic scores predicted 2% of variance in education, and 2.58% of variance in cognitive function (Swedish enlistment cognitive test battery) in a Swedish replication sample, and also performed well in within-family settings (0.31% & 0.19% & 0.41/0.76% of variance in attending college & years of education & test battery, respectively, in ). • , Rietveld et al 2014a: Replication of the 3 Rietveld et al 2013 SNP hits, followed by replication of the PGS in STR & QIMR (non-family-based), then a within-family sibling comparison using Framingham Heart Study (FHS). The 3 hits replicated; the EDU PGS (in the 20 PC model which most closely corresponds to Rietveld et al 2013’s GWAS) predicted 0.0265/0.0069, the college PGS 0.0278/0.0186; and the overall FHS EDU PGS was 0.0140 and the within-family sibling comparison was 0.0036. 3. , Benyamin et al 2014 () : 0.5%, 1.2%, 3.5% (3 cohorts; no within-family sibling test). 4. 0.55% (maximum in sub-samples: 0.7%) 5. Predicts “0.2% to 0.4%” of variance in cognitive performance & education using a small polygenic score of 69 SNPs; no full PGS is reported. (For other very small polygenic score uses, see also & .) Also tests the small PGS in both across-family and within-family between-sibling settings, reported in the supplement; no pooled result, but by cohort (GS/MCTFR/QIMR/STR): 0.0023/0.0022/0.0041/0.0044 vs 0.0007/0.0007/0.0002/0.0015. 6. English/mathematics grades: 0.7%/0.16%. Based on Rietveld et al 2013. 7. Education/school grades in NTR; based on Rietveld et al 2013. Educational achievement, Arithmetic: 0.012/0.021; Language: 0.021/0.028; Study Skills: 0.016/0.017; Science/Social Studies: 0.006/0.013; Total Score: 0.024/0.022. School grades, Arithmetic: 0.025/0.027; Language: 0.033/0.025; Reading: 0.031/0.042. de Zeeuw et al 2014 appears to report a within-family comparisons using fraternal twins/siblings, rather than a general population PGS performance. (“For each analysis, the predictor and the outcome measure were standardized within each subset of children with data available on both. To correct for dependency of the observations due to family clustering an additive genetic variance component was included as a random effect based on the family pedigree and dependent on zygosity.”) 8. Education in FHS/HRS, general population: 0.02/0.03 (Table 4, column 2). Within-family in FHS, 0.0124 (Table 6, column 1/3). 9. Education & verbal intelligence in National Longitudinal Study of Adolescent to Adult Health (ADD Health). Education, general population: 0.06/0.02; within-family between-sibling, 0.06? (Table 3). Verbal intelligence, general population: 0.0225/0.0196 (Table 2); within-family between-sibling, 0.0049? 10. 1.2%, intelligence. 11. Meta-analysis of n=5000 COGENT cohorts, using extracted factor; 0.40-0.45% PGS for intelligence in the MGS/GAIN cohort. 12. Supplementary Table 4d reports predictive validity of the educational attainment polygenic score for childhood-cognitive-ability/college-degree/years-of-education in its other samples, yielding R2=0.0042/0.0214/0.0223 or 0.42%/2.14%/2.23% respectively. Particularly intriguing given its investigation of pleiotropy is Supplementary Table 5, which uses polygenic scores constructed for all the diseases in its data (eg type 2 diabetes, ADHD, schizophrenia, coronary artery disease), where all the disease scores & covariates are entered into the model and then the cognitive polygenic scores are able to predict even higher, as high as R2=0.063/0.046/0.064. 13. The Biobank polygenic score constructed for “verbal-numerical reasoning” predicted 0.98%/1.32% of g/gf scores in Generation Scotland, and 2.79% in Lothian Birth Cohort 1936 (). 14. , Ibrahim-Verbaas et al 2016 Does not report a polygenic score. 15. in 2016, a consortium combined the SSGAC dataset with UK Biobank, expanding the combined dataset to n>300,000 and yielding a total of 162 education hits; the results were reported in two papers, the latter giving the polygenic scores: The polygenic score predicts 3.5% of intelligence, 7% of family SES, and 9% of education in a heldout sample. Education is predicted in a within-family between-sibling as well, with betas of 0.215 vs 0.625, R2s not provided (“Extended Data Figure 3” in Okbay paper; section “2.6. Significance of the Polygenic Scores in a WF regression” in first Okbay supplement; “Supplementary Table 2.2” in second Okbay supplement). The Okbay et al 2016 PGS has been used in a number of studies, including , which reports a r=0.18 or 3.24% variance in 4 samples (UKBB/Dunedin Study/Brain Genomics Superstruct Project (GSP)/Duke Neurogenetics Study (DNS)). 16. , Kong et al 2017b (supplement; ): Education: general population, 4.98%. See also , Willoughby & Lee 2017 abstract. 17. , Trampush et al 2017: no polygenic score reported 18. 336 SNPs, and ~3% on average in the held-out samples, peaking at 4.8%. (Thus it likely doesn’t outperform Okbay/Selzam et al 2016, but does demonstrate the sample-efficiency of good IQ measurements.) 19. Education using Okbay et al 2016 in the Brisbane Adolescent Twin Study on Queensland Core Skills Test (QCST), within-family between-sibling comparison: beta=0.15 (so 0.0225?). 20. , Kaminski et al 2018: Odd analytic choices aside (why interactions rather than a mediation model?), they provide replications of Benyamin et al 2014 and Sniekers et al 2017 in IMAGEN; both are highly similar: 0.33% and 3.2% (1.64-5.43%). 21. , Zabaneh et al 2017 1.6%/2.4% of intelligence. Like Spain et al 2016, this uses the TIP high-IQ sample in a liability-threshold/dichotomous/case-control approach, but the polygenic score is computed on the heldout normal IQ scores from the TEDS twin sample so it is equivalent to the other polygenic scores in predicting population intelligence; they estimated it on a 4-test IQ score and a 16-test IQ score (the latter being more reliable), respectively. Despite the sample-efficiency gains from using high-quality IQ tests in TIP/TEDS and the high-IQ enrichment, the TIP sample size (n=1238) is not enough to surpass the Selzam et al 2016 polygenic score (based on education proxies from 242x more people). 22. , Krapohl et al 2017 10.9% education / 4.8% intelligence; this is an interesting methodologically because it exploits (chosen using informative priors in the form of , although unfortunately only on the PGS level and not the SNP level) to increase the original PGS by ~1.1% from 3.6% to 4.8%. 23. Like Krapohl et al 2017, use of multiple genetic correlations (via ) to overcome measurement error greatly boosts efficiency of IQ GWAS, and provides best public polygenic score to date: 7% of variance in a held out Generation Scotland sample. This illustrates a good way to work around the shortage of high-quality IQ test scores by exploiting multiple more easily-measured phenotypes. 24. An extension of Hill et al 2017 increasing the effective sample size considerably; the UKBB sample for testing the polygenic score, using the short multiple-choice, gives ~6% variance explained. (The lower PGS despite the larger sample & hits may be due to the use of a different sample with a worse IQ measure.) 25. 5.4%. 26. , Lello et al 2017 Demonstrates Hsu’s lasso on height, heel bone density, and years of education in UKBB, recovering 40% (ie almost the entire SNP heritability), 20%, and 9% respectively; given the rg with intelligence and Krapohl et al 2017’s 10.9% education PGS converting to 4.8% intelligence, Lello et al 2017’s education PGS presumably also performs ~4.5% on intelligence. This is worse than Hill et al 2017, but it is important in proving Hsu’s claims about the efficacy of the lasso: the implication is that around n>1m (depending on measurement quality), the intelligence PGS will undergo a similar jump in power. Given the rapidly expanding datasets available to UKBB and SSGAC, and combined with MTAG and other refinements, it is likely that the best intelligence PGS will jump from Hill’s 7% to 25-30% sometime 2018-2019. 27. Another genetic correlation boosting paper, the fluid intelligence boosted PGS appears to be still minor, ~3% variance. 28. 4.3%. (Followup/expansion of the preprint version .) 29. The long-awaited SSGAC EA3 paper (mentioned in the review ), which constructs a PGS predicting 11-13% variance education, 7-10% IQ, along with extensive additional analyses including 4 within-family tests of causal power of the education PGS (“we estimate that within-family effect sizes are roughly 40% smaller than GWAS effect sizes and that our assortative-mating adjustment explains at most one third of this deflation. (For comparison, when we apply the same method to height, we found that the assortative-mating adjustment fully explains the deflation of the within-family effects.)… The source of bias conjectured here operates by amplifying a true underlying genetic effect and hence would not lead to false discoveries33. However, the environmental amplification implies that we should usually expect GWAS coefficients to provide exaggerated estimates of the magnitude of causal effects.”) 30. , Barth et al 2018 Replication of Lee et al 2018: in their heldout HRS sample, the PGS predicted 10.6% variance in EDU (after removing HRS from the Lee et al 2018 PGS); further use in HRS was made by . 31. , Rustichini et al 2018 Replicates Lee et al 2018 PGS between-parents & between-siblings in the Minnesota Twin Family Study (MTFS), predicting 9% variance IQ in both samples. 32. , Allegrini et al 2018: 11% IQ/16% EDU. Lee et al 2018’s PGS was used in the TEDS cohort, and the PGS’s power was boosted by use of MTAG/GSEM/ and in looking at scores from older ages (possibly benefiting from the Wilson effect, see examining growth). 33. , de la Fuente et al 2019: UKBB reanalysis (n= 11,263–331,679), 3.96% genetic g (not IQ), plus PGSes for individual tests (Supplement table S7, ); the focus here is using GSEM to predict not some lump-sum proxy for intelligence like EDU or total score, but factor model the available tests as being influenced by the latent g intelligence factor and also test-specific subfactors. This is the true structure of the data, and benefits from the genetic correlations without settling for the lowest common denominator. This makes subtests much more predictable: Consistent with the Genomic SEM findings that individual cognitive outcomes are associated with a combination of genetic g and specific genetic factors, we observed a pattern in which many of the regression models that included both the polygenic score (PGS) from genetic g and test-specific PGSs were considerably more predictive of the cognitive phenotypes in Generation Scotland than regression models that included only either a genetic g PGS or a PGS for a single test. A particularly relevant exception involved the Digit Symbol Substitution test in Generation Scotland, which is a similar test to the Symbol Digit Substitution test in UK Biobank, for which we derived a PGS. We found that the proportional increase in R2 in Digit Symbol by the Symbol Digit PGS beyond the genetic g PGS was <1%, whereas the genetic g PGS improved polygenic prediction beyond the Symbol Digit PGS by over 100%, reflecting the power advantage obtained from integrating GWAS data from multiple genetically correlated cognitive traits using a genetic g model. An interesting counterpoint is the PGS for the VNR test, which is unique in the UK Biobank cognitive test battery in indexing verbal knowledge (24,31). Highlighting the role of domain-specific factors, a regression model that included this PGS and the genetic g PGS provided substantial incremental prediction relative to the genetic g PGS alone for those Generation Scotland phenotypes most directly related to verbal knowledge: Mill Hill Vocabulary (62.45% increase) and Educational Attainment (72.59%). ###### GWAS improvements These results only scratch the surface of what is possible. In some ways, current GWASes for intelligence are the worst methods that could work, as their many flaws in population, data measurement, analysis, and interpretation reduce their power; some of the most relevant flaws for intelligence GWASes would be: • population: cohorts are designed for ethnic homogeneity to avoid questions about confounds, though cross-ethnic GWASes (particularly ones including admixed subjects) would be better able to locate causal SNPs by intersecting hits between different LD patterns • data: • misguided legal & “medical ethics” & privacy considerations impede sharing of individual-level data, leading to lower-powered techniques (such as LD score regression or random-effects meta-analysis) being necessary to pool results across cohorts, which methods themselves often bring in additional losses (such as not using multilevel models to pool/shrink the meta-analytic estimates) • existing GWASes sequence limited amounts of SNPs rather than whole genomes • imputation is often not used or is done based on relatively small & old datasets like 1000 Genomes, though it would assist the SNP data in capturing rarer variants • full-scale IQ tests taken over multiple days by a professional are typically not used, and the hierarchical nature of intelligence & cognitive ability is entirely ignored, making for SNP effects reflecting a mish-mash average effect • genetic correlations are not employed to correct for the large amounts of trait measurement error or tap into shared causal pathways • education is the usual measured phenotype despite not being that great a measure of intelligence, and even the education measurements are rife with measurement error (eg using “years of education”, as if every year of education were equally difficult, every school equally challenging, every major equally g-loaded, or every degree equal) • functional data, such as gene expression, is not used to boost the prior probability of relevant variants • analysis: • principal components & LDSC & other methods employed to control for population structure may be highly conservative/biased & potentially reduce GWAS hits as much as 20% (, , Yengo et al 2018) • for computational efficiency, SNPs are often regressed one at a time rather than simultaneously, increasing variance entirely unnecessarily as even the variance explained by already-found SNPs remains (see eg Loh et al 2018) • no attempts are made at including covariates like age or childhood environment which will affect intelligence scores • interactions are not included in the linear models • genetic correlations/covariances and factorial structure are typically not modeled even when the traits in question are best treated as structural equation models, limiting both power and possible inferences (but see the recently-introduced GSEM, , demonstrated on factor analysis of genetic g in ) • the linear models are also highly unrealistic & weak by using flat priors on SNP effect sizes while not using informative priors/multilevel pooling/shrinkage/variable selection techniques which could dramatically boost power by ignoring noise & focusing on the most relevant SNPs while inferring realistic distributions of effect sizes (eg the : ////Loh et al 2018/) • NHST thinking leads to stringent multiple-correction & focus on the arbitrary threshold of genome-wide statistical-significance while downplaying full polygenic scores, allowing only the few hits with the highest posterior probability to be considered in any subsequent analyses or discussions (ensuring few false positives at the cost of reducing power even further in the original GWAS & all downstream uses) • no hyperparameter tuning of the GWAS is done: preprocessing values for quality control, imputation, p-value thresholding, and ‘clumping’ of variants in close LD are set by convention and are not in any way optimal () This is not to criticize the authors of those GWASes—they are generally doing the best that they can with existing datasets in a hostile intellectual & funding climate and using the standard methods rather than taking risks in using better but more exotic & unfamiliar methods and their results nevertheless are intellectually important, reliable, & useful - but to point out that better results will inevitably arrive as data & computation become more plentiful and the older results slowly trickle out & change minds. Since these scores overlap and are not, like GCTA estimates, independent measurements of a variable, there is little point in meta-analyzing them other than to estimate growth over time (even using them as an ensemble wouldn’t be worth the complexity, and in any case, most studies do not provide the full list of beta values making up the polygenic score); for our purpose, the largest polygenic score is the important number. ( notes that the polygenic scores are also inefficient: polygenic scores are not always published, not always based on individual patient data, and generally use maximum-likelihood estimation neglecting our strong priors on the number of hits & distribution of effect sizes. But these published scores are what we have as of January 2016, so we must make do.) Selzam et al 2016’s reported polygenic score for cognitive performance was 3.5%. Thus: selzam2016 <- 0.035 embryoSelection(n=10, variance=selzam2016) * 15 # [1] 3.053367791 Incidentally, one might wonder why not use the EDU/EA PGSes given that their variance-explained are so much higher & education is a large part of how intelligence causes benefits? It would be reasonable to use them, alone or in conjunction, but I have several reasons for not preferring them: 1. the greater performance thus far is not because ‘years of education’ is inherently more important or more heritable or less polygenic or anything like that; on a n for n basis, GWASes with good IQ measurements work much better. The greater performance is driven mostly by the fact that it is a basic demographic variable which is routinely recorded in datasets and easily asked if not, allowing for far larger combined sample sizes. If there were any dataset of 1.1m individuals with high quality IQ scores, the IQ PGS from that would surely be far better than the IQ PGS created by Lee et al 2018 on 1.1m EDU. Unfortunately, there is no such dataset and likely will not be for a while. 2. ‘years of education’ is a crude measurement which captures neither selectivity nor gains from schooling: it lumps all ‘schooling’ together and it’s unclear to what extent it captures the desirable benefits of formal education, like learning, as opposed to more undesirable behavior like procrastinating on life by going to grad school or going to community college or a less selective college and dropping out (even though that may be harmful and incur a lifetime of student debt); valuing “years of education” is like valuing a car by how many kilograms of metal it takes to manufacture—it treats a cost as a benefit. The causal nature of benefits from more years of formal education is likewise less clear than from IQ. ‘Years of education’ is not even particularly meaningful in an absolute sense as governments can simply mandate that children go to school longer, though this appears to have few benefits and simply fuel arms races for more educational credentials and increases the higher education premium rather than reduces it; Okbay/Selzam et al 2016 include a nice graph showing how Swedish school changes mandating more attendance reduced the PGS predictive performance (or ‘penetrance’) of EDU, as would be expected, although it seems doubtful such a mandate had any of the many consequences which were hoped for… On the other hand, the relationship between IQ and good outcomes like income has been stable over the 20th century (Strenze et al 2007), and given the absence of any selection for intelligence now (or outright dysgenics), and near-universal forecasts among economists that future economies will draw at least as much or more on intelligence as past economies, it is highly unlikely that intelligence will become of less value. In general, intelligence appears much more convincingly causal, more likely to have positive externalities and cause gains in positive-sum effect games rather than negative-sum positional/signaling games, so I am more comfortable using estimates for intelligence as I believe they are much more likely to be underestimates of the true long-term societal all-inclusive effects, while education could easily be overestimated. 3. the genetic correlations of EDU/EA are not as uniformly positive as they are for IQ (despite the high genetic correlation between the two, illustrating the non-transitivity of correlations); eg bipolar disorder/education but not bipolar disorder/IQ (). While genetic correlations can be dealt with by a generalization of the single-trait case (see the multiple selection section) to make optimal tradeoffs, such harmful genetic correlations are troubling & complicate things. 4. EDU/EA PGSes are approaching their SNP heritability ceilings, and as they measure their crude construct fairly well (most people can recall how many years of formal schooling they had), there’s not as much to gain as with IQ from fixing measurement error. Considering the twin/family studies, the highest heritability for education, variously measured (typically better than ‘years of education’), tends to peak at 50%, while with IQ, the most refined methods peak at 80%. Thus, at some point the pure-IQ or multi-trait GWASes will exceed the EDU PGSes for the purpose of predicting intelligence (although this may take some time or require upgrades like use of WGS or much better measurements). ###### Measurement Error in Polygenic Scores Like GCTA, measurement error affects polygenic scores, in reducing both discovery power and providing a downwardly-biased estimate of how good the PGS is. The GCTAs give a substantially lower estimate than the one we care about if we forget to correct for measurement error; is this true for the PGSes above as well? Checking some the GWASes in question where possible, it seems there is an unspoken general practice of using the smallest highest-quality-phenotyped cohorts as the heldout validation sets, so the measurement error turns out to not be too serious, and we don’t need to take it much into consideration. Like GCTA, measurement error affects polygenic scores. In two major ways: first, poor quality measurements reduce the statistical power considerably and thus the ability to find genome-wide statistically-significant hits or create predictive PGSes; second, after the hit to power has been taken (GIGO), measurement error in a separate validation/replication dataset will bias the estimate to zero because the true accuracy is being hidden by the noise in the new dataset. (If the “IQ score” only correlates r=0.5 with intelligence because it is just that noisy and unstable, no PGS will ever exceed r=0.5 predictive power in that dataset, because by definition you can’t predict noise, even though the true latent intelligence variable is much more heritable than that.) The UK Biobank’s cognitive ability measures are particularly low quality, with test-retest reliability alone averaging only r=0.55 (). From a psychometric perspective, it’s worth noting that the power will be reduced, and the PGS biased towards 0, by range restriction, especially by attrition of very unintelligent people (due to things like excess mortality), which can be expected to reduce by another ~5% (going by the Generation Scotland estimate of the range restriction bias). There’s not much that can be done about the first problem after the GWAS has been conducted, but the second problem can be quantified and corrected for similar to with GCTA—the polygenic score/replication dataset is just another correlation (even if we usually write it as ‘variance explained’ rather than r), and if we know how much noise is in the replication dataset IQ measurements, we can correct for that and see how much of true IQ was predicted. The raw replication performance is meaningful for some purposes, like if one was trying to use the PGS as a covariate or to just predict that cohort, but not for others; in the case of embryo selection, we do not care about increasing measured IQ but latent or true IQ. If our PGS is actually predicting 11% of variance but the measurements are so bad in the replication cohort that our PGS can only predict 7% of the noisy measurements, it is the 11% that matters as it is what defines how much selected embryos will increase by. Most GWASes do not mention the issue, few mention anything about the expected reliability of the used IQ scores, and none correct for measurement error in reporting PGS predictions, so I’ve gone through the above list of PGSes and made an attempt to roughly calculate corrected PGSes. For UKBB, test-retest correlations have been reported and can be considered loose upper bounds on the reliability (since a test which can’t predict itself can’t measure intelligence a fortiori); for IQ measures which are a principal component extracted from multiple tests, I assume they are at least r=0.8 and acceptable quality. Year Study n PGS Replication cohort Test type Replication N Reliability Corrected PGS 2011 Davies et al 2011 3511 0.0058 Norwegian Cognitive NeuroGenetics (NCNG) custom battery 670 >0.8 <0.007 2013 Rietveld et al 2013 126559 0.0258 Swedish Twin Registry (STR) SEB80 9553 0.84-0.95 0.031 2014 Benyamin et al 2014 12441 0.035 Netherlands Twin Registry (NTR) RAKIT, WISC-R, WISC-R-III, WAIS-III 739 >0.90 <0.039 2014 Benyamin et al 2014 12441 0.005 University of Minnesota study (UMN) WISC-R, WAIS-R 3367 0.90 0.006 2014 Benyamin et al 2014 12441 0.012 Generation Rotterdam study (Generation R) SON-R 2,5-7 1442 0.62 0.02 2015 Davies et al 2015 53949 0.0127 Generation Scotland (GS) custom battery 5487 >0.8 <0.016 2016 Davies et al 2016 112151 0.0231 Generation Scotland (GS) custom battery 19994 >0.8 <0.029 2016 Davies et al 2016 112151 0.031 Lothian Birth Cohort of 1936 (LBC1936/1947) custom battery, Moray House Test No. 12 1005 >0.8 <0.039 2016 Selzam et al 2016 329000 0.0361 Twins Early Development Study (TEDS) custom battery 5825 >0.8 <0.045 2017 Sniekers et al 2017 78308 0.032 Twins Early Development Study (TEDS) custom battery 1173 >0.8 <0.04 2017 Sniekers et al 2017 78308 0.048 Manchester & Newcastle Longitudinal Studies of Cognitive Ageing Cohorts (ACPRC) custom battery 1558 >0.8 <0.06 2017 Sniekers et al 2017 78308 0.025 Rotterdam Study custom battery 2015 ? ? 2017 Zabaneh et al 2017 9410 0.016 Twins Early Development Study (TEDS) custom battery 3414 >0.8 <0.02 2017 Zabaneh et al 2017 9410 0.024 Twins Early Development Study (TEDS) custom battery 4731 >0.8 <0.03 2017 Krapohl et al 2017 82493 0.048 Twins Early Development Study (TEDS) custom battery 6710 >0.8 <0.06 2017 Hill et al 2017 147194 0.0686 Generation Scotland (GS) custom battery 6884 >0.8 <0.086 2017 Savage et al 2017 279930 0.041 Generation Rotterdam study (Generation R) SON-R 2,5-7 1929 0.62 0.066 2017 Savage et al 2017 279930 0.054 Spit 4 Science (S4S) SAT 2818 0.5 0.108 2017 Savage et al 2017 279930 0.021 Rotterdam Study custom battery 6182 ? ? 2017 Savage et al 2017 279930 0.05 UK Biobank (UKBB) Custom verbal-numerical reasoning subtest (VNR) 53576 <0.65 >0.077 2018 Hill et al 2018 248482 0.065 UK Biobank (UKBB) Custom verbal-numerical reasoning subtest (VNR) 9050 <0.65 >0.10 2018 Hill et al 2018 248482 0.0683 UK Biobank (UKBB) Custom verbal-numerical reasoning subtest (VNR) 2431 <0.65 >0.11 2018 Hill et al 2018 248482 0.0464 UK Biobank (UKBB) Custom verbal-numerical reasoning subtest (VNR) 33065 <0.65 >0.07 Overall, it seems that most GWASes use the noisy measurements for discovery in the main GWAS and then reserve their small but relatively high-quality cohorts for the testing, which is the best approach, and so the corrected PGSes are similar enough to the raw PGS that it is not a big issue—except in a few cases where the measurement error is severe enough that it dramatically changes the interpretation, like the use of UKBB or S4S cohorts, whose r<0.65 reliabilities (possibly much worse than that) seriously understate the predictive power of the PGS. Hill et al 2018, for example, appears to turn in a mediocre result which doesn’t exceed Hill et al 2017’s SOTA despite a much larger sample size, but this is entirely an artifact of uncorrected measurement error, and the corrected PGSes are ~8.6% vs ~10%, implying Hill et al 2018 actually became the SOTA on publication. (The corrected PGSes also seem to show more of the expected exponential growth with time, which has been somewhat hidden by increasing use of poorly-measured validation datasets.) ###### Why Trust GWASes? Before moving on to the cost, it’s worth discussing a question I see a lot: why trust any of these polygenic scores or GWAS results like genetic correlations, and assume they will work in an embryo selection anywhere near the reported predictive performance, when they are, after all, just a bunch of complex correlational studies and not proper randomized experiments? During the late 2000s, there were great amounts of criticism made of the “missing heritability” after GWASes exposed the bankruptcy of candidate-gene studies (specifically, Chabris et al 2012 for IQ hits), and predictions that, contrary to the behavioral geneticists’ predictions that increasing sample sizes would overcome polygenicity, GWASes would never amount to anything, in part because the genetic basis of many traits (especially intelligence) simply did not exist. So now, in 2016 and later, why should we trust GWASes & polygenic scores, and the intelligence/education ones in general, and believe they measure meaningful genetic causation—rather than some sort of complicated hidden “cryptic population structure” which just happens to create spurious correlations between ancestry and, say, socioeconomic status? 1. causal priors: predictions from genetic markers are inherently a longitudinal design & thus more likely to be causal than a random published correlation, because genes are fixed at conception, thereby ruling out 1 of the 3 main causal patterns: either the genes do cause the correlated phenotypes, or they are confounded, but the correlation cannot be reverse causation. 2. consilience among all genetic methods: GWAS results showing non-zero SNP heritability and highly polygenic additive genetic architectures are consistent with the past century of adoption, twin, sibling, and family studies. For example, polygenic scores always explain less than SNP heritability, and SNP heritability is always less than twin heritability, as expected; but if the signal were purely ancestry, a few thousand SNPs is more than enough to infer ancestry with high accuracy and the results could be anything. • The same holds for genetic correlations: genetic correlations computed using molecular genetics are typically consistent with those calculated using twins. 3. strong precautions: GWASes typically include stringent measures to reduce cryptic relatedness, removing too-related datapoints as measured directly on molecular genetic markers, and including many principal components as controls—possibly too many, as these measures come at costs in sample size and ability to detect rare variants, to reduce a risk which has not much materialized in practice. (Statistical methods like LD score regression (/) generally indicate that, after these measures, most of the predictive signal comes from genuine polygenicity and not residual population stratification.) 4. high replication rates: GWAS polygenic scores are predictive out of sample (multiple cohorts of the same study), across social classes11, across closely-related but separate countries (eg UK GWASes closely agree with USA GWASes). GWASes also have high replication rates within countries/studies.12, and across times (while heritabilities may change, the PGS remains predictive and does not revert quickly to 0% variance; similarly, there is consilience with selection/dysgenics, rather than small random walks). Suggestions that GWASes merely measure social stratification predict many things we simply do not see, like extreme SES interactions and gradients in predictive validity or being pointless if there is even the slightest bit of range restriction (if anything, range restriction is common in GWASes, and they still work), or very low genetic correlations between cohorts in different times or places or measurements or recruitment methods (rather than the usual high rg>0.8). The critics have yet to explain just how much relatedness is too much, or how far the cryptic relatedness goes, and why current practices of eliminating even as close as 2.5% (fourth cousins) are inadequate (unless the argument is circular—“we know it’s not enough because the GWASes & GCTAs continue to work”!). A decade on, with datasets that have grown 10-50x larger than initial GWASes like Chabris et al 2012, there has been no replication crisis for GWASes. This is despite the usual practice of GWAS involving consortia with repeated GWAS+meta-analysis across accumulating datasets, which would quickly expose any serious replication issues (practices adopted, in part, as a reaction to the candidate-gene debacle). Further, while GWAS polygenic scores decrease in predictive validity when used in increasingly distant ethnicities (eg IQ PGSes predict best in Caucasians, somewhat well in Han Chinese, worse in African-Americans, and hardly at all in Africans), they do so gradually, as predicted by ethnic relatedness leading linkage disequilibrium decay of SNP markers for identical causal variants—and not abruptly based on national borders or economies. What sort of population stratification or residual confounding could possibly be identical between both London and Beijing? 5. within-family comparisons show causality: GWASes pass one of the most stringent checks, within-family comparisons of siblings. As , siblings inherit random genes from their parents and are born equal in every respect like socioeconomic status, ancestry, neighborhood etc (yet siblings within a family, including fraternal twins, differ a great deal on average, a puzzle for environmental determinists but predicted by the large genetic differences between siblings & CLT), and so all genetic differences between siblings are themselves randomized experiments showing causality: Genetics is indeed in a peculiarly favoured condition in that Providence has shielded the geneticist from many of the difficulties of a reliably controlled comparison. The different genotypes possible from the same mating have been beautifully randomised by the meiotic process. A more perfect control of conditions is scarcely possible, than that of different genotypes appearing in the same litter Indeed, the first successful IQ/education GWAS, Rietveld et al 2013, checked the PGS in an available sample of siblings, and found in pairs of siblings, the sibling with the higher PGS tended to also have a higher education. Hence, the PGS must measure causation. Other methods aside from sibling comparison like parental PGS controls, pedigrees, or transmission disequilibrium can be expected to reduce or eliminate any hypothetical confounding from residual population stratification; GWASes typically survive those as well. (See also: , , , , , , , , Willoughby et al 2019, , , , , .) 6. GWASes are biased towards nulls: the major statistical flaws in GWASes are typically in the direction of minimizing genetic effects: using small numbers of SNPs, highly unrealistic flat priors, one-SNP-at-a-time regression, no incorporation of measurement error, too many principal components, additive-only models, arbitrary genome-wide significance thresholds, PGSes of only genome-wide statistically-significant hits rather than full PGSes etc. (See section on how current PGSes represent lower bounds and will become much better.) 7. consilience with biological & neurological evidence: if GWASes and PGSes were merely confounded by something like ancestry, the attempt to parse their tea leaves into something biologically meaningful would fail. They would be exploiting chance variants associated with ancestry and mutations, spread scattershot over the genome. But instead, we observe considerable structure of identified variants within the genome in a way that looks as if they are doing something. On a high level, as previously mentioned, the genetic correlations are consistent with those observed in twins, but also generally with phenotypic correlations. In terms of general location in the genome, the identified variants are where they are expected if they have functional causal consequences: mostly in protein-coding & regulatory regions (rather than the overwhelming majority of junk DNA regions), and located far above chance near rare pathological variants—IQ variants are enriched in locations very near the rare mutations which cause many cases of intellectual disabilities, and similarly disease-related common variants are very near rare pathogenic mutations (eg for heart defects). On a more fine-grained level, the genes hosting identified genetic variants can be assigned to specific organs and stages of life based on when they are typically expressed using methods like DNA microarrays; IQ/EDU hits are, unsurprisingly, heavily clustered in genes associated with the nervous system or with known psychoactive drug targets (and not skin color or appearance genes), and express most heavily early in life prenatally & infancy—exactly when the human brain is growing & learning most rapidly. (See for example Okbay et al 2016 or .) While the biological insights have not been too impressive for complex behavioral traits like education/intelligence, GWASes have given considerable insight into diseases like Crohn’s or diabetes or schizophrenia, which is difficult to reconcile with the idea that GWASes are systematically wrong or picking up on population stratification confounding. Or should we engage in post hoc special pleading and say that the GWAS methodology works fine for diseases but somehow, invisibly, fails when it comes to traits which are not conventionally defined as diseases (even when they are parts of continuums where the extremes are considered diseases or disorders)? 8. the critics were wrong: none of this was predicted by critics of “missing heritability”. The prediction was that GWASes were a fools’ errand—for example, from 2010, “If common alleles influenced common diseases, many would have been found by now.” or “The most likely explanation for why genes for common diseases have not been found is that, with few exceptions, they do not exist.” (Quotes from critics cited in /.) Few (none?) of the critics predicted that GWASes would succeed—as predicted by the power analyses—in finding hundreds or thousands of genome-wide statistically-significant hits when sample sizes increased appropriately with datasets like 23andMe & UK Biobank becoming available but that these would simply be illusory; this was considered too absurd and implausible to rate serious mention compared to hypotheses like “they do not exist”. It behooves us to take the critics at their word—before . As it happened, proponents of additive polygenic architectures and taking results like GCTA at face value made specific predictions that hits would materialize at appropriate sample sizes like n=100k (eg Visscher or Hsu). Their predictions were right; the critics’ were wrong. Everything else is post hoc. Those are the main reasons I take GWASes & PGSes at largely face-value. While population stratification certainly exists and would inflate naive estimates, and individual SNPs of course may turn out to be errors, and there are serious issues in trying to apply results from one population to another, and there is far to go in creating useful PGSes for most traits, and there remain many unknowns about the nature of the causal effects (are genetic correlations horizontal or vertical pleiotropy? is a particular SNP causal or just a tag for a rarer variant? what biological difference, exactly, does it make, and does it run outside the body as well?), many misinterpretations of what specific methods like GCTA deliver and many suboptimal practices (like polygenic scores using a p-value threshold), and so on—but it is not credible now to claim that genes do not matter, or that GWASes are untrustworthy The fact is: most human traits are under considerable genetic influence, and GWASes are a highly successful method for quantifying and pinning down that influence. Like E.O Wilson famously defended evolution, each point may seem minor or narrow, hedged about with caveats and technical assumptions, but the consilience of the total weight of the evidence is unanswerable. Can we seriously entertain the hypothesis that all the twin studies, adoption studies, family studies, GCTAs, LD score regressions, GWAS PGSes, within-family comparisons or pedigrees or parental covariate-controls or transmission disequilibrium tests, the trans-cohort & population & ethnicity & country & time period replications, intellectual disability and other Mendelian disorder overlaps, the developmental & gene expression, all of these and more reported from so many countries by so many researchers on so many people (millions of twins alone have been studied, see Polderman et al 2015), are all just some incredible fluke of stratification or a SNP chip error, all of whose errors and assumptions just happen to go in exactly the same direction and just happen to act in exactly the way one would expect of genuine causal genetic influences? Surely the Devil did not plant dinosaur bones to fool the paleontologists, or SNPs (“Satanic Nucleotide Polymorphisms”?) to fool the medical/behavioral geneticist—the universe is hard to understand, and randomness and bias are vexing foes, but it is not actively malicious. At this point, we can safely trust in the majority of large GWAS results to be largely accurate and act as we expect them to. ## Cost of embryo selection In considering the cost of embryo selection, I am looking at the marginal cost of embryo selection and not the total cost of IVF: assuming that, for better or worse, a pair of parents have decided to use IVF to have a child, and incurring whatever costs there may be, from the$8k13-$20k cost of each IVF cycle to any possible side effects for mother/child of the IVF process, and merely asking, what are the costs of benefits of doing embryo selection as part of that IVF? The counterfactual is IVF vs IVF+embryo-selection, not having a child normally or adopting. PGD is currently , so there are no criminal or legal costs; even if there were, clinics in other countries will continue to offer it, and the cost of using a Chinese fertility clinic may not be particularly noticeable financially14 and their quality may eventually be higher15. ### Cost of polygenic scores An upper bound is the cost of whole-gnome sequencing, which has continuously fallen. My impression is that historically, a whole-genome has cost ~6x a comprehensive SNP (500k+). The most recently records an . Illumina has boasted about a$1000 whole-genome starting around 2014 (under an unspecified cost model); around December 2015, Veritas Genetics started taking orders for a consumer 20x whole-genome priced at $1000; in January 2018, 30x whole-genomes at ~$740 (down from May-Sep 2017 at ~$950-$1000, apparently dependent on euro exchange rate), dropping to $500 by June 2018. So if a comprehensive SNP cost >$1000, it would be cheaper to do a whole-genome, and historically at that price, we would expect a SNP cost of ~$170. The date & cost of getting a large selection of SNPs is not collected in any dataset I know of, so here are a few 2010-2016 price quotes. Tur-Kaspa et al 2010 estimates “Genetic analyses of oocytes by polar bodies biopsy and embryos by blastomere biopsy” at$3000. Hsu 2014 estimates an SNP costs “~$100 USD” and “At the time of this writing SNP genotyping costs are below$50 USD per individual”, without specifying a source; given the latter is below any 23andMe price offered, it is probably an internal Beijing Genomics Institute cost estimate. The (unspecified date but presumably 2015) lists at $355 & the at$170. 23andMe famously offered its services for $108.95 for >600k SNPs as of October 2014, but that price apparently was substantially subsidized by research & sales as they raised the price to$200 & lowered comprehensiveness in October 2015. quotes a full cost of $150-$210 for 1 use of a 821K SNP Axiom Array () as of 10 December 2015. (The NIH CIDR price list also says $40 for 96 SNPs, suggesting that it would be a false economy to try to get only the top few SNP hits rather than a comprehensive polygenic score.) quotes a range of$260-$520 for one sample from an Affymetrix GeneChip. Tan et al 2014 note that for PGD purposes, “the estimated reagent cost of sequencing for the detection of chromosomal abnormalities is currently less than$100.” The price of the array & genotyping can be driven far below this by economies of scale: conference says that they had reached a cost of ~$45 per SNP16 (The UK Biobank overall has spent , so genotyping 500,000 people at ~$45 each represents a large fraction of its total budget. Somewhat similarly, 23andMe has raised along with charging ~2m customers perhaps an average of ~$150 along with unknown pharmacorp licensing revenue, so total 23andMe spending could be estimated at somewhere ~$800m. For comparison, the US program in 2018 had an annual budget of , or highly likely >9x more annually than has ever been spent on UKBB/23andMe/SSGAC combined.) The Genes for Good project, began in 2015, their small-scale (n=27k) social-media-based sequencing program cost “about $80, which includes postage, DNA extraction, and genotyping” per participant. Razib Khan reports in that people at the October 2016 ASHG were discussing SNP chips in the “range of the low tens of dollars”. Overall, SNPing an embryo in 2016 should cost ~$100-400 and more towards the low end like $200 and we can expect the SNP cost to fall further, with fixed costs probably pushing a climb up the quality ladder to exome and then whole-genome sequencing (which will increase the ceiling on possible PGS by covering rare & causal variants, and allow selection on other metrics like avoiding unhealthy-looking de novo mutations or decreasing estimated mutation load). #### SNP cost forecast How much will SNP costs drop in the future? We can extrapolate from the NHGRI Genome Sequencing Program’s DNA Sequencing Cost dataset, but it’s tricky: eyeballing , we can see that historical prices have not followed any single pattern. At first, costs closely tracks a simple halving every 18 months, then there is an abrupt trend-break to super-exponential drops from mid-2007 to mid-2011 and then an equally abrupt reversion to a flat cost trajectory with occasional price increases and then another abrupt fall in early 2015 (accentuated when one adds in the Veritas Genetics$1k as a datapoint).

Dropping pre-2007 data and fitting an exponential shows a bad fit since 2012 (if it follows the pre-2015 curve, it has large prediction errors on 2015-2016 and vice-versa). It’s probably better to take the last 3 datapoints (the current trend) and fit the curve to them, covering just the past 6 months since July 2015, and then applying the 6x rule of thumb we can predict SNP costs out 20 months to October 2017:

# http://www.genome.gov/pages/der/seqcost2015_4.xlsx
genome <- c(9408739,9047003,8927342,7147571,3063820,1352982,752080,342502,232735,154714,108065,70333,46774,
31512,31125,29092,20963,16712,10497,7743,7666,5901,5985,6618,5671,5826,5550,5096,4008,4920,4905,
5731,3970,4211,1363,1245,1000)
l <- lm(log(I(tail(genome, 3))) ~ I(1:3)); l
# Coefficients:
# (Intercept)       I(1:3)
#  7.3937180   -0.1548441
exp(sapply(1:10, function(t) { 7.3937180 + -0.1548441*t } )) / 6
# [1] 232.08749421 198.79424215 170.27695028 145.85050092 124.92805739 107.00696553  91.65667754
# [8]  78.50840827  67.24627528  57.59970987

(Even if SNP prices stagnate due to lack of competition or fixed-costs/overhead/small-scales, whole-genomes will simply eat their lunch: at the current trend, whole-genomes will reach $200 ~2019 and$100 ~2020.)

### PGD net costs

An IVF cycle involving PGD will need ~4-5 SNP genotypings (given a median egg count of 9 and half being abnormal), so I estimate the genetic part costs ~$800-1000. The net cost of PGD will include the cell harvesting part (one needs to extract cells from embryos to sequence) and interpretation (although scoring and checking the genetic data for abnormality should be automatable), so we can compare with current PGD price quotes. • “The Fertility Institutes” say (unspecified date) without breaking out the PGD part. • , using 2000-2005 data from the Reproductive Genetics Institute (RGI) in Illinois estimates the first PGD cycle at$6k, and subsequent at $4.5k, giving a full table of costs: Table 2: Estimated cost of IVF-preimplantation genetic diagnosis (PGD) treatment for cystic fibrosis (CF) carriers Procedure Subprocedure Cost (US$) Notes
IVF Pre-IVF laboratory screening 1000 Range $600 to$2000; needs to be performed only once each year
Medications 3000 Range $1500 to$5000
Cost of IVF treatment cycle 12000 Range $6000 to$18000
Total cost, first IVF cycle 16000
Total cost, each additional IVF cycle 15000
PGD Genetic system set-up for PGD of a specific couple 1500 Range $1000 to$2000; performed once for a specific couple, with or without analysis of second generation, if applicable
Biopsy of oocytes and embryos 1500
Genetic analyses of oocytes by polar bodies biopsy and embryos by blastomere biopsy 3000 Variable; upper end presented; depends on number of mutations anticipated
Subtotal: cost of PGD, first cycle 6000
Subtotal: cost of PGD, each repeated cycle 4500
IVF-PGD Total cost, first IVF-PGD cycle 22000
Total cost, each additional IVF-PGD cycle 19500

…Overall, 35.6% of the IVF-PGD cycles yielded a life birth with one or more healthy babies. If IVF-PGD is not successful, the couple must decide whether to attempt another cycle of IVF-PGD (Figure 1) knowing that their probability of having a baby approaches 75% after only three treatment cycles and is predicted to exceed 93% after six treatment cycles (Table 3). If 4000 couples undergo one cycle of IVF-PGD, 1424 deliveries with non-affected children are expected (Table 3). Assuming a similar success rate of 35.6% in subsequent treatment cycles and that couples could elect to undergo between four and six attempts per year yields a cumulative success rate approaching 93%. IVF as performed in the USA typically involves the transfer of two or three embryos. The series yielded 1.3 non-affected babies per pregnancy with an average of about two embryos per transfer (Table 1). Thus, the number of resulting children would be higher than the number of deliveries, perhaps by as much as 30% (Table 3). Nonetheless, to avoid multiple births, which have both medical complications and an additional cost, the outcome was calculated as if each delivery results in the birth of one non-affected child. IVF-PGD cycles can be performed at an experienced centre. The estimated cost of performing the initial IVF cycle with intracytoplasmic sperm injection (ICSI) without PGD was $16,000 including laboratory and imaging screening, cost of medications, monitoring during ovarian stimulation and the IVF procedure per se (Table 2). The cost of subsequent IVF cycles was lower because the initial screening does not need to be repeated until a year later. Estimated PGD costs were$6000 for the initial cycle and $4500 for subsequent cycles. The cost for subsequent PGD cycles would be lower because the initial genetic set-up for couples (parents) and siblings for linked genetic markers and probes needs to be performed only once. These conditions yield an estimated cost of$22,000 for the initial cycle of IVF/ICSI-PGD and $19,500 for each subsequent treatment cycle. • claims (in 2012, based on PDF creation date) that “The cost of PGD is typically split into two parts: procedural costs (consultations, laboratory testing, egg collection, embryo transfer, ultrasound scans, and blood tests) and drug costs (for ovarian stimulation and embryo transfer). PGD combined with IVF will cost £6,000 [$8.5k]–£9,000 [$12.8k] per treatment cycle.” but doesn’t specify the marginal cost of the PGD rather than IVF part. • (2013?): “One round of IVF typically costs around$9,000. PGD adds another $4,000 to$7,500 to the cost of each IVF attempt. A standard round of IVF results in a successful pregnancy only 10-35% of the time (depending on the age and health of the woman), and a woman may need to undergo subsequent attempts to achieve a viable pregnancy.”

• (July 2014): “In Madison, Wisconsin, genetic counselor Margo Grady at Generations Fertility Care estimated the out-of-pocket price of one IVF cycle at about $12,000, and PGD adds another$3,000.”

• (2015?): “PGD typically costs between $4,000-$10,000 depending on the cost of creating the specific probe used to detect the presence of a single gene.”

• Murugappan et al May 2015: “The average cost of PGS was $4,268 (range$3,155-$12,626)”, citing another study which estimated “Average additional cost of PGD procedure:$3,550; Median Cost: $3,200” • the (“current” pricing, so 2015?) says IVF costs ~$12k and of that, “Aneuploidy testing (for chromosome normality) with PGD is $1800 to$5000…PGD costs in the US vary from about $4000-$8000”. AFC usefully breaks down the costs further in a table of “Average PGS IVF Costs in USA”, saying that:

• Embryo biopsy charges are about $1000 to$2500 (average: $1500) • Embryo freezing costs are usually between$500 to $1000 (average:$750)
• Aneuploidy testing (for chromosome normality) with PGD is $1800 to$5000
• For single gene defects (such as cystic fibrosis), there are additional costs involved.
• PGS test cost average: $3500 (The wording is unclear about whether these are costs per embryo or per batch of embryos; but the rest of the page implies that it’s per batch, and per embryo would imply that the other PGS cost estimates are either far too low or are being done on only one embryo & likely would fail.) • the startup in September/October 2018 announced a full embryo selection service for complex traits at a fixed cost of$1000 + $400/embryo (eg so 5 embryos would be$2000 total):

300+ common single-gene disorders, such as Cystic Fibrosis, Thalassemia, BRCA, Sickle Cell Anemia, and Gaucher Disease.

Polygenic Disease Risk, such as risk for Type 1 and Type 2 diabetes, Dwarfism, Hypothyroidism, Mental Disability, Atrial Fibrillation and other Cardiovascular Diseases like CAD, Inflammatory Bowel Disease, and Breast Cancer.

$1000/case,$400/embryo

This may not reflect their true costs as they are a startup, but as a commercial service gives a hard datapoint: $1000 for overhead/biopsies,$400/embryo marginal cost for sequencing+analysis.

From the final AFC costs, we can see that the genetic testing makes up a large fraction of the cost. Since custom markers are not necessary and we are only looking at standard SNPs, the $1.8-5k genetic cost is a huge overestimate of the$1k the SNPs should cost now or soon. Their breakdown also implies that the embryo freezing/vitrification cost is counted as part of the PGS cost, but I don’t think this is right since one will need to store embryos regardless of whether one is doing PGS/selection (even if an embryo is going to be implanted right away in a live transfer, the other embryos need to be stored since the first one will probably fail). So the critical number here is that the embryo biopsy step costs $1000-$1500; there is probably little prospect of large price decreases here comparable to those for sequencing, and we can take it as fixed.

Hence we can treat the cost of embryo selection as a fixed $1.5k cost plus number of embryos times SNP cost. ## Modeling embryo selection is a sequential probabilistic process: 1. harvest x eggs 2. fertilize them and create x embryos 3. culture the embryos to either cleavage (2-4 days) or blastocyst (5-6 days) stage; of them, y will still be alive & not grossly abnormal 4. freeze the embryos 5. optional: embryo selection using quality and PGS 6. unfreeze & implant 1 embryo; if no embryos left, return to #1 or give up 7. if no live birth, go to #6 Each step is necessary and determines input into the next step; it is a ‘leaky pipeline’ (also related to “multiple hurdle selection”), whose total yield depends heavily on the least efficient step, so outcomes might be . This has implications for cost-effectiveness and optimization, discussed later. A simulation of this process: ## simulate a single IVF cycle (which may not yield any live birth, in which case there is no gain returnable): simulateIVF <- function (eggMean, eggSD, polygenicScoreVariance, normalityP=0.5, vitrificationP, liveBirth) { eggsExtracted <- max(0, round(rnorm(n=1, mean=eggMean, sd=eggSD))) normal <- rbinom(1, eggsExtracted, prob=normalityP) scores <- rnorm(n=normal, mean=0, sd=sqrt(polygenicScoreVariance*0.5)) survived <- Filter(function(x){rbinom(1, 1, prob=vitrificationP)}, scores) selection <- sort(survived, decreasing=TRUE) if (length(selection)>0) { for (embryo in 1:length(selection)) { if (rbinom(1, 1, prob=liveBirth) == 1) { live <- selection[embryo] return(live) } } } } simulateIVFs <- function(eggMean, eggSD, polygenicScoreVariance, normalityP, vitrificationP, liveBirth, iters=100000) { return(unlist(replicate(iters, simulateIVF(eggMean, eggSD, polygenicScoreVariance, normalityP, vitrificationP, liveBirth)))); } Mathematically, one could model the expectation of the first implantation with this formula: or using order statistics: (The order statistic can be estimated by numeric integration or .) This is a lower bound on the value, though—treating this mathematically is made challenging by the sequential nature of the procedure: implanting the maximum-scoring embryo may fail, forcing a fallback to the second-highest embryo, and so on, until a success or running out of embryos (triggering a second IVF cycle, or possibly not depending on finances & number of previous failed cycles indicating futility). Given, say, 3 embryos, the expected value of the procedure would be to sum the expected value of the embryo plus the expected value of the embryo times the probability of failing to yield a birth (since if succeeded one would stop there and not use ) plus the expected value of times the probability of both & failing to yield a live birth plus the expected value of no live births times the probability of all failing, and so on. So it is easier to simulate. (Being able to write it as an equation would be useful if we needed to do complex optimization on it, such as if we were trying to allocate an R&D budget optimally, but realistically, there are only two variables which can be meaningfully improved—the polygenic score or scores, and the number of eggs—and it’s impossible to estimate how much R&D expenditure would increase egg count, leaving just the polygenic scores, which is easily optimized by hand or a blackbox optimizer.) The transition probabilities can be estimated from the flows reported in papers dealing with IVF and PGD. I have used: 1. , Tan et al December 2014: 395 women, 1512 eggs successfully extracted & fertilized into blastocysts (~3.8 per woman); after genetic testing, 256+590=846 or 55% were abnormal & could not be used, leaving 666 good ones; all were vitrified for storage during analysis and 421 of the normal ones rethawed, leaving 406 useful survivors or ~1.4 per woman; the 406 were implanted into 252 women, yielding 24+75=99 healthy live births or 24% implanted-embryo->birth rate. Excerpts: A total of 395 couples participated. They were carriers of either translocation or inversion mutations, or were patients with recurrent miscarriage and/or advanced maternal age. A total of 1,512 blastocysts were biopsied on D5 after fertilization, with 1,058 blastocysts set aside for SNP array testing and 454 blastocysts for NGS testing. In the NGS cycles group, the implantation, clinical pregnancy and miscarriage rates were 52.6% (60/114), 61.3% (49/80) and 14.3% (7/49), respectively. In the SNP array cycles group, the implantation, clinical pregnancy and miscarriage rates were 47.6% (139/292), 56.7% (115/203) and 14.8% (17/115), respectively. The outcome measures of both the NGS and SNP array cycles were the same with insignificant differences. There were 150 blastocysts that underwent both NGS and SNP array analysis, of which seven blastocysts were found with inconsistent signals. All other signals obtained from NGS analysis were confirmed to be accurate by validation with qPCR. The relative copy number of mitochondrial DNA (mtDNA) for each blastocyst that underwent NGS testing was evaluated, and a significant difference was found between the copy number of mtDNA for the euploid and the chromosomally abnormal blastocysts. So far, out of 42 ongoing pregnancies, 24 babies were born in NGS cycles; all of these babies are healthy and free of any developmental problems. …The median number of normal/ balanced embryos per couple was 1.76 (range from 0 to 8)…Among the 129 couples in the NGS cycles group, 33 couples had no euploid embryos suitable for transfer; 75 couples underwent embryo transfer and the remaining 21 couples are currently still waiting for transfer. In the SNP array cycles group, 177 couples underwent embryo transfer, 66 couples had no suitable embryos for transfer, and 23 couples are currently still waiting. Of the 666 normal/balanced blastocysts, 421 blastocysts were warmed after vitrification, 406 survived (96.4% of survival rate) and were transferred in 283 cycles. The numbers of blastocysts transferred per cycle were 1.425 (114/80) and 1.438 (292/203) for NGS and SNP array, respectively. The proportion of transferred embryos that successfully implanted was evaluated by ultrasound 6-7 weeks after embryo transfer, indicating that 60 and 139 embryos resulted in a fetal sac, giving implantation rates of 52.6% (60/114) and 47.6% (139/292) for NGS and SNP array, respectively. Prenatal diagnosis with karyotyping of amniocentesis fluid samples did not find any fetus with chromosomal abnormalities. A total of 164 pregnancies were detected, with 129 singletons and 35 twins. The clinical pregnancy rate per transfer cycle was 61.3% (49/80) and 56.7% (115/203) for NGS and SNP array, respectively (Table 3). A total of 24 miscarriages were detected, giving rates of 14.3% (7/49) and 14.8% (17/115) in NGS and SNP array cycles, respectively …The ongoing pregnancy rates were 52.5% (42/80) and 48.3% (98/203) in NGS and SNP array cycles, respectively. Out of these pregnancies, 24 babies were delivered in 20 NGS cycles; so far, all the babies are healthy and chromosomally normal according to karyotype analysis. In the SNP array cycles group the outcome of all pregnancies went to full term and 75 healthy babies were delivered (Table 3)…NGS is with a bright prospect. A case report described the use of NGS for PGD recently [33]. Several comments for the application of NGS/MPS in PGD/PGS were published [34,35]. The cost and time of sequencing is already competitive with array tests, and the estimated reagent cost of sequencing for the detection of chromosomal abnormalities is currently less than$100.

2. Probabilities for clinical outcomes with IVF and PGS in RPL patients were obtained from a 2012 study by . This is the single largest study to date of outcomes using 24-chromosome screening by array comparative genomic hybridization in a well-defined RPL population…The Hodes-Wertz study reported on outcomes of 287 cycles of IVF with 24-chromosome PGS with a total of 2,282 embryos followed by fresh day-5 embryo transfer in RPL patients. Of the PGS cycles, 67% were biopsied on day 3, and 33% were biopsied on day 5. The average maternal age was 36.7 years (range: 21-45 years), and the mean number of prior miscarriages was 3.3 (range: 2-7). From 287 PGS cycles, 181 cycles had at least one euploid embryo and proceeded to fresh embryo transfer. There were 52 cycles with no euploid embryos for transfer, four cycles where an embryo transfer had not taken place at the time of analysis, and 51 cycles that were lost to follow-up observation. All patients with a euploid embryo proceeded to embryo transfer, with an average of 1.65 Æ 0.65 (range: 1-4) embryos per transfer. Excluding the cycles lost to follow-up evaluation and the cycles without a transfer at the time of analysis, the clinical pregnancy rate per attempt was 44% (n 1⁄4 102). One attempt at conception was defined as an IVF cycle and oocyte retrieval Æ embryo transfer. The live-birth rate per attempt was 40% (n1⁄4 94), and the miscarriage rate per pregnancy was 7% (n 1⁄4 7). Of these seven miscarriages, 57% (n 1⁄4 4) occurred after detection of fetal cardiac activity (10). Information on the percentage of cycles with surplus embryos was not provided in the Hodes-Wertz study, so we drew from their database of 240 RPL patients with 118 attempts at IVF and PGS (12). The clinical pregnancy, live-birth, and clinical miscarriage rates did not statistically-significantly differ between the outcomes published in the Hodes-Wertz study (P1⁄4 .89, P1⁄4 .66, P1⁄4 .61, respectively). We reported that 62% of IVF cycles had at least one surplus embryo (12).

…The average cost of preconception counseling and baseline RPL workup, including parental karyotyping, maternal antiphospholipid antibody testing, and uterine cavity evaluation, was $4,377 (range:$4,000-$5,000) (16). Because this was incurred by both groups before their entry into the decision tree, it was not included as a cost input in the study. The average cost of IVF was$18,227 (range: $6,920-$27,685) (16) and includes cycle medications, oocyte retrieval, and one embryo transfer. The average cost of PGS was $4,268 (range$3,155-$12,626) (17), and the average cost of a frozen embryo transfer was$6,395 (range: $3,155-$12,626) (13, 16). The average cost of managing a clinical miscarriage with dilation and curettage (D&C) was $1,304 (range:$517-$2,058) (18). Costs incurred in the IVF-PGS strategy include the cost of IVF, PGS, fresh embryo transfer, frozen embryo transfer, and D&C. Costs incurred in the expectant management strategy include only the cost of D&C. 17: National Infertility Association. . Accessed on May 26, 2014: “Average additional cost of PGD procedure:$3,550; Median Cost: $3,200 (Note: Medications for IVF are$3,000–$5,000 per fresh cycle on average.)” 3. “Technical Update: Preimplantation Genetic Diagnosis and Screening”, Dahdouh et al 2015: The number of diseases currently diagnosed via PGD-PCR is approximately 200 and includes some forms of inherited cancers such as retinoblastoma and the breast cancer susceptibility gene (BRCA2). 52 PGD has also been used in new applications such as HLA matching. 53,54 The ESHRE PGD consortium data analysis of the past 10 years’ experience demonstrated a clinical pregnancy rate of 22% per oocyte retrieval and 29% per embryo transfer. 55 Table 4 shows a sample of the different monogenetic diseases for which PGD was carried out between January and December 2009, according to the ESHRE data. 22 In these reports a total of 6160 cycles of IVF cycles with PGD or PGS, including PGS-SS, are presented. Of these, 2580 (41.8%) were carried out for PGD purposes, in which 1597 cycles were performed for single-gene disorders, including HLA typing. An additional 3551 (57.6%) cycles were carried out for PGS purposes and 29 (0.5%) for PGS-SS. 22 Although the ESHRE data represent only a partial record of the PGD cases conducted worldwide, it is indicative of general trends in the field of PGD. …At least 40% to 60% of human embryos are abnormal, and that number increases to 80% in women 40 years or older. These abnormalities result in low implantation rates in embryos transferred during IVF procedures, from 30% in women < 35 years to 6% in women ≥ 40 years. 33 In a recent retrospective review of trophectoderm biopsies, aneuploidy risk was evident with increasing female age. A slightly increased prevalence was noted at younger ages, with > 40% aneuploidy in women ≤ 23 years. The risk of having no chromosomally normal blastocyst for transfer (the no-euploid embryo rate) was lowest (2-6%) in women aged 26 to 37, then rose to 33% at age 42 and reached 53% at age 44. 11 4. Age: <35yo 35-37 38-40 41-42 >42 Live birth rate 40.7 31.3 22.2 11.8 3.9 …It is common to remove between ten and thirty eggs. using non-donor eggs. (Though donor eggs are better quality and more likely to yield a birth and hence better for selection purposes) 5. The median number of eggs retrieved was 9 [inter-quartile range (IQR) 6-13; Fig. 2a] and the median number of embryos created was 5 (IQR 3-8; Fig. 2b). The overall LBR in the entire cohort was 21.3% [95% confidence interval (CI): 21.2-21.4%], with a gradual rise over the four time periods in this study (14.9% in 1991-1995, 19.8% in 1996-2000, 23.2% in 2001-2005 and 25.6% in 2006-2008). Egg retrieval appears . The SD is not given anywhere in the paper, but an SD of ~4-5 visually fits the graph and is compatible with a 6-13 IQR, and reports SDs for eggs for two groups of SDs 4.5 & 4.7 and averages of 10.5 & 9.4—closely matching the median of 9. 6. The most nationally representative sample for the USA is the data that fertility clinics are legally required to report to the CDC. The most recent one is the , which breaks down numbers by age and egg source: Total number of cycles : 190,773 (includes 2,655 cycle[s] using frozen eggs)…Donor eggs: 9718 fresh cycles, 10270 frozen [] …Of the 190,773 ART cycles performed in 2013 at these reporting clinics, 163,209 cycles (86%) were started with the intent to transfer at least one embryo. These 163,209 cycles resulted in 54,323 live births (deliveries of one or more living infants) and 67,996 infants. Fresh eggs <35yo 35-37 38-40 41-42 43-44 >44 cycles: 40,083 19,853 18,06 19,588 4,823 1,379 P(birth|cycle) 23.8 19.6 13.7 7.8 3.9 1.2 P(birth|transfer) 28.2 24.4 18.4 11.4 6.0 2.1 Frozen eggs <35 35-37 38-40 41-42 43-44 >44 cycles: 21,627 11,140 8,354 3,344 1,503 811 P(birth|transfer) 28.6 27.2 24.4 21.2 15.8 8.7 …The largest group of women using ART services were women younger than age 35, representing approximately 38% of all ART cycles performed in 2013. About 20% of ART cycles were performed among women aged 35-37, 19% among women aged 38-40, 11% among women aged 41-42, 7% among women aged 43-44, and 5% among women older than age 44. Figure 4 shows that, in 2013, the type of ART cycles varied by the woman’s age. The vast majority (97%) of women younger than age 35 used their own eggs (non-donor), and about 4% used donor eggs. In contrast, 38% of women aged 43-44 and 73% of women older than age 44 used donor eggs. …Outcomes of ART Cycles Using Fresh Non-donor Eggs or Embryos, by Stage, 2013: 1. 93,787 cycles started 2. 84,868 retrievals 3. 73,571 transfers 4. 33,425 pregnancies 5. 27,406 live-birth deliveries CDC report doesn’t specify how many eggs on average are retrieved or abnormality rate by age, although we can note that ~10% of retrievals didn’t lead to any transfers (since there were 85k retrievals but only 74k transfers) which looks consistent with an overall mean & SD of 9(4.6) and 50% abnormality rate. We could also try to back out from the figures on average number of embryos per transfer, number of transfers, and number of cycles (eg 1.8 for <35yos, and 33750, so 60750 transferred embryos, as part of the 40083 cycles, indicating each cycle must have yielded at least 1.5 embryos), but that only gives a loose lower bound since there may be many left over embryos and the abnormality rate is unknown. So for an American model of <35yos (the chance of IVF success declines so drastically with age that it’s not worth considering older age brackets), we could go with a set of parameters like {9, 4.6, 0.5, 0.96, 0.28}, but it’s unclear how accurate a guess that would be. 7. Tur-Kaspa et al 2010 reports results from an Illinois fertility clinic treating cystic fibrosis carriers who were using PGD:  Parameter Value Count Percentage No. of patients (age 42 years) 74 No. of cycles for PGD for CF 104 Mean no. of IVF-PGD cycles/couple 1.4 (104/74) No. of cycles with embryo transfer (%) 94 (90.4) No. of embryos transferred 184 Mean no. of embryos transferred 1.96 (184/94) Total number of pregnancies 44 No. of miscarriages (%) 7 (15.9) No. of deliveries 37 No. of healthy babies born 49 No. of babies per delivery 1.3 No. of cycles resulting in pregnancy (%) 44/104 (42.3) No. of transfer cycles resulting in a pregnancy (%) 44/94 (46.8) Take-home baby rate per IVF-PGD cycle (%) 37/104 ——————————————————————————— Table: Table 1: Outcomes of IVF-preimplantation genetic diagnosis (PGD) cycles for cystic fibrosis (CF) (2000-2005). For the Tur-Kaspa et al 2010 cost-benefit analysis, the number of eggs and survival rates are not given in the paper, so it can’t be used for simulation, but the overall conditional probabilities look similar to Hodes-Wertz. With these sets of data, we can fill in parameter values for the simulation and estimate gains. Using the Tan et al 2014 data: 1. eggs extracted per person: normal distribution, mean=3, SD=4.6 (discretized into whole numbers) 2. using previous simulation, ‘SNP test’ all eggs extracted for polygenic score 3. P=0.5 that an egg is normal 4. P=0.96 that it survives vitrification 5. P=0.24 that an implanted egg yields a birth simulateTan <- function() { return(simulateIVFs(3, 4.6, selzam2016, 0.5, 0.96, 0.24)); } iqTan <- mean(simulateTan()) * 15; iqTan # [1] 0.3808377013 That is, the couples in Tan et al 2014 would have seen a ~0.4IQ increase. The Murugappan et al 2015 cost-benefit analysis uses data from American fertility clinics reported in : 278 cycles yielding 2282 blastocysts or ~8.2 on average; 35% normal; there is no mention of losses to cryostorage, so I borrow 0.96 from Tan et al 2015; 1.65 implanted on average in 181 transfers, yielding 40% live-births. So: simulateHodesWertz <- function() { return(simulateIVFs(8.2, 4.6, selzam2016, 0.35, 0.96, 0.40)) } iqHW <- mean(simulateHodesWertz()) * 15; iqHW # [1] 0.684226242 ### Societal effects One category of effects considered by Shulman & Bostrom is the non-financial social & societal effects mentioned in their Table 3, where embryo selection can “perceptibly advantage a minority” or in an extreme case, “Selected dominate ranks of elite scientists, attorneys, physicians, engineers. Intellectual Renaissance?” This is another point which is worth going into a little more; no specific calculations are mentioned by Shulman & Bostrom, and the thin-tail-effects of normal distributions are notoriously counterintuitive, with surprisingly large effects out on the tails from small-seeming changes in means or standard deviations—for example, the legendary levels of Western Jewish overperformance despite their tiny population sizes. The effects of selection also compound over generations; for example, in the famous , a large gap in mean performance had opened up by the 2nd generation, and by the 7th, the distributions almost ceased to overlap (see figure 4 in ). Or consider the long-term Illinois corn/maize selection experiment (response to selection of the 2 lines, animated), Considering the order/tail effects for cutoffs/thresholds corresponding to admission to elite universities, for many possible combinations of embryo selection boosts/IVF uptakes/generation accumulations, embryo selection accounts for a majority or almost all of future elites. As a general rule of thumb, ‘elite’ groups like scientists, attorneys, physicians, Ivy League students etc are highly selected for intelligence—one can comfortably estimate averages >=130 IQ (+2SD) from past IQ samples & average SAT scores & the ever-increasingly stringent admissions; and elite performance continues to increase with increasing intelligence as high as can reasonably be measured, as indicated by available date like estimates of eminent historical figures (eg ; see also Simonton in general), the SMPY longitudinal study and TIP longitudinal study (), where we might define the cut off as 160 IQ based on ’s studies of the most eminent available scientists (mean ~150-160). So to estimate an impact, one could consider a question like: given an average boost of x IQ points through embryo-selection, how much would the odds of being elite (>=130) or extremely elite (>=160) increase for the selected? If a certain fraction of IVFers were selected, what fraction of all people above the cutoff would they make up? If there are 320 million people in the USA, then about 17m are +2SD and 43k are +3SD: dnorm((130-100)/15) * 320000000 # [1] 17277109.28 dnorm((160-100)/15) * 320000000 # [1] 42825.67224 Similarly, in 2013, the 3,932,181 children born in the USA; and the 2013 CDC annual IVF report says that 67,996 (1.73%) were IVF. (This 1-2% population rate of IVF will highly likely increase substantially in the future, as many countries have recorded higher use of IVF or ART in general: ; in reported percentages of 4.6% (Belgium)/5.7% (Czech Republic), 6.2% (Denmark), 4% (Estonia), 5.8% (Finland), 4.4% (Greece), 6% (Slovenia), & 4.2% (Spain); ; and . And presumably US rates will go up as the population ages & education credentialism continues.) This implies that IVFers also make up a small number of highly gifted children: size <- function(mean, cutoff, populationSize, useFraction=1) { if(cutoff>mean) { dnorm(cutoff-mean) * populationSize * useFraction } else { (1 - dnorm(cutoff-mean)) * populationSize * useFraction }} size(0, (60/15), 67996) # [1] 9.099920031 So assuming IVF parents average 100IQ, then we can take the embryo selection theoretical upper bound of +9.36 (+0.624SD) corresponding to the “aggressive IVF” set of scenarios in Table 3 of Shulman & Bostrom, and ask, if 100% of IVF children were selected, how many additional people over 160 would that create? eliteGain <- function(ivfMean, ivfGain, ivfFraction, generation, cutoff, ivfPop, genMean, genPop) { ivfers <- size(ivfMean, cutoff, ivfPop, 1) selected <- size(ivfMean+(ivfGain*generation), cutoff, ivfPop, ivfFraction) nonSelected <- size(ivfMean, cutoff, ivfPop, 1-ivfFraction) gain <- (selected+nonSelected) - ivfers population <- size(genMean, cutoff, genPop) multiplier <- gain / population return(multiplier) } eliteGain(0, (9.36/15), 1, 1, (60/15), 67996, 0, 3932181) # [1] 0.1554096565 In this example, the +0.624SD boosts the absolute number by 82 people, representing 15.5% of children passing the cutoff; this would mean that IVF overrepresentation would be noticeable if anyone went looking for it, but would not be a major issue nor even as noticeable as Jewish achievement. We would indeed see “Substantial growth in educational attainment, income”, but we would not see much effect beyond that. Is it realistic to assume that IVF children will be distributed around a mean of 100 sans any intervention? That seems unlikely, if only due to the substantial financial cost of using IVF; however, the existing literature is inconsistent, showing both higher & lower education or IQ scores (), so perhaps the starting point really is 100. The thin-tail effects make the starting mean extremely important; Shulman & Bostrom say, “Second generation manyfold increase at right tail.” Let’s consider the second generation; with their post-selection mean IQ of 109.36, what second-generation is produced in the absence of outbreeding when they use IVF selection? eliteGain(0, (9.36/15), 1, 2, (60/15), 67996, 0, 3932181) # [1] 1.151238772 eliteGain(0, (9.36/15), 1, 5, (60/15), 67996, 0, 3932181) # [1] 34.98100356 Now the IVF children represent a majority. With the third generation, they reach 5x; at the fourth, 17x; at the fifth, 35x; and so on. In practice, of course, we currently would get much less: 0.138 IQ points in the USA model, which would yield a trivial percentage increase of 0.06% or 1.6%: eliteGain(0, (0.13808892057/15), 1, 1, (60/15), 67996, 0, 3932181) # [1] 0.0006478714323 eliteGain((15/15), (0.13808892057/15), 1, 1, (60/15), 67996, 0, 3932181) # [1] 0.01601047464 Table 3 considers 12 scenarios: 3 adoption fractions of the general population (100% IVFer/~0.25% general population, 10%, >90%) vs 4 average gains (4, 12, 19, 100+). The descriptions add 2 additional variables: first vs second generation, and elite vs eminent, giving 48 relevant estimates total. scenarios <- expand.grid(c(0.025, 0.1, 0.9), c(4/15, 12/15, 19/15, 100/15), c(1,2), c(30/15, 60/15)) colnames(scenarios) <- c("Adoption.fraction", "IQ.gain", "Generation", "Eliteness") scenarios$Gain.fraction <- round(do.call(mapply, c(function(adoptionRate, gain, generation, selectiveness) {
eliteGain(0, gain, adoptionRate, generation, selectiveness, 3932181, 0, 3932181) }, unname(scenarios[,1:4]))),
digits=2)
Adoption fraction IQ gain Generation Eliteness Gain fraction
0.025 4 1 130 0.02
0.100 4 1 130 0.06
0.900 4 1 130 0.58
0.025 12 1 130 0.06
0.100 12 1 130 0.26
0.900 12 1 130 2.34
0.025 19 1 130 0.12
0.100 19 1 130 0.46
0.900 19 1 130 4.18
0.025 100 1 130 0.44
0.100 100 1 130 1.75
0.900 100 1 130 15.77
0.025 4 2 130 0.04
0.100 4 2 130 0.15
0.900 4 2 130 1.37
0.025 12 2 130 0.15
0.100 12 2 130 0.58
0.900 12 2 130 5.24
0.025 19 2 130 0.28
0.100 19 2 130 1.11
0.900 19 2 130 10.00
0.025 100 2 130 0.44
0.100 100 2 130 1.75
0.900 100 2 130 15.77
0.025 4 1 160 0.05
0.100 4 1 160 0.18
0.900 4 1 160 1.62
0.025 12 1 160 0.42
0.100 12 1 160 1.68
0.900 12 1 160 15.13
0.025 19 1 160 1.75
0.100 19 1 160 7.01
0.900 19 1 160 63.11
0.025 100 1 160 184.65
0.100 100 1 160 738.60
0.900 100 1 160 6647.40
0.025 4 2 160 0.16
0.100 4 2 160 0.63
0.900 4 2 160 5.69
0.025 12 2 160 4.16
0.100 12 2 160 16.63
0.900 12 2 160 149.70
0.025 19 2 160 25.40
0.100 19 2 160 101.58
0.900 19 2 160 914.25
0.025 100 2 160 186.78
0.100 100 2 160 747.12
0.900 100 2 160 6724.04

To help capture what might be considered important or disruptive, let’s filter down the scenarios to ones where the embryo-selected now make up an absolute majority of any elite group (a fraction >0.5):

Adoption fraction IQ gain Generation Eliteness Gain fraction
0.900 4 1 130 0.58
0.900 12 1 130 2.34
0.900 19 1 130 4.18
0.100 100 1 130 1.75
0.900 100 1 130 15.77
0.900 4 2 130 1.37
0.100 12 2 130 0.58
0.900 12 2 130 5.24
0.100 19 2 130 1.11
0.900 19 2 130 10.00
0.100 100 2 130 1.75
0.900 100 2 130 15.77
0.900 4 1 160 1.62
0.100 12 1 160 1.68
0.900 12 1 160 15.13
0.025 19 1 160 1.75
0.100 19 1 160 7.01
0.900 19 1 160 63.11
0.025 100 1 160 184.65
0.100 100 1 160 738.60
0.900 100 1 160 6647.40
0.100 4 2 160 0.63
0.900 4 2 160 5.69
0.025 12 2 160 4.16
0.100 12 2 160 16.63
0.900 12 2 160 149.70
0.025 19 2 160 25.40
0.100 19 2 160 101.58
0.900 19 2 160 914.25
0.025 100 2 160 186.78
0.100 100 2 160 747.12
0.900 100 2 160 6724.04

For many of the scenarios, the impact is not blatant until a second generation builds on the first, but the cumulative effect has an impact—one of the weakest scenarios, +4 IQ/10% adoption can still be seen at the second generation because easier to spot effects on the most elite levels; in another example, a boost of 12 points is noticeable in a single generation with as little as 10% of the general population adoption. A boost of 19 points is visible in a fair number of scenarios, and a boost of 100 is visible at almost any adoption rate/generation/elite-level. (Indeed, a boost of 100 results in almost meaninglessly large numbers under many scenarios; it’s difficult to imagine a society with 100x as many geniuses running around, so it’s even more difficult to imagine what it would mean for there to be 6,724x as many—other than many things will start changing extremely rapidly in unpredictable ways.)

The tables do not attempt to give specific deadlines in years for when some of the effects will manifest, but we could try to extrapolate based on when eminent figures and made their first marks.

have become grandmasters at very early ages, such as ’s 12.6yo record, with (as of 2016) 24 other chess prodigies reaching grandmaster levels before age 15; the record age has dropped rapidly over time which is often credited to computers & the Internet unlocking chess databases & engines to intensively train against, providing a global pool of opponents 24/7, and intensive tutoring and training programs. is probably the most famous child prodigy, credited with feats such as reading by age 2, writing mathematical papers by age 12 and so on, but he abandoned academia and never produced any major accomplishments; his acquaintance and fellow child prodigy , on the other hand, produced his first major work at age 17, at age 19; physicists in the early quantum era were noted for youth, with Bragg/Heisenberg/Pauli/Dirac producing their Nobel prize-winning results at ages 22/23/25/26 (respectively). In mathematics, made major breakthroughs around age 18, ’s first modal logic result was age 17, likely began making major findings around age 16 and continued up to his youthful death at age 32, and began publishing age 15; young students making findings is such a trope that the Fields Medal has an age-limit of 39yo for awardees (who thus must’ve made their discoveries much earlier). Cliometrics and the age of scientists and their life-cycles of productivity across time and fields have been studied by Simonton, Jones, & Murray’s ; we can also compare to the SMPY/TIP samples where most took normal schooling paths. The peak age for productivity, and average age for work that wins major prizes differs a great deal by field—physics and mathematics are generally younger than fields like medicine or biology. This suggests that different fields place different demands on Gf vs Gc: a field like mathematics dealing in pure abstractions will stress deep thought & fluid intelligence (peaking in the early 20s); while a field like medicine will require a wide variety of experiences and factual knowledge and less raw intelligence, and so may require decades before one can make a major contribution. (In literature, it’s often been noted that lyric poets seem to peak young while novelists may continue improving throughout their lifetimes.)

So if we consider scenarios of intelligence enhancement up to 2 or 3 SDs (145), then we can expect that there may be a few initial results within 15 years heavily biased towards STEM fields with strong Internet presences and traditions of openness in papers/software/data (such as machine learning), followed by a gradual increase in number of results as the cohort begins reaching their 20s and 30s and their adult careers and a broadening across fields such as medicine and the humanities. While math and technology results can have outsized impact these days, in a 2-3SD scenario, the total number of 2-3SD researchers will not increase by more than a factor, and so the expected impact will be similar to what we already experience in the pace of technological development—quick, but not unmanageable.

In the case of >=4SDs, things are a little different. The most comparable case is Sidis, who as mentioned was writing papers by age 12 after 10 years of reading; in an IES scenario, each member of the cohort might be far beyond Sidis, and so the entire cohort will likely reach the research frontier and begin making contributions before age 12—although there must be limits on how fast a human child can develop mentally, for raw thermodynamic reasons like calories consumed if nothing else, there is no good reason to think that Sidis’s bound of 12 years is tight, especially given the modern context and the possibilities for accelerated education programs. (With such advantages, there may also be much larger cohorts as parents decide the advantages are so compelling that they want them for their children and are willing to undergo the costs.)

## Cost-benefit

As written, the IVF simulator cannot deliver a cost-benefit because the costs will depend on the internal state, like how many good embryos were created or the fact that a cycle ending in no live births will still incur costs, and report the marginal gain now that we’re going case by case. So it must be augmented:

simulateIVFCB <- function (eggMean, eggSD, polygenicScoreVariance, normalityP=0.5, vitrificationP, liveBirth, fixedCost, embryoCost, traitValue) {
eggsExtracted <- max(0, round(rnorm(n=1, mean=eggMean, sd=eggSD)))

normal        <- rbinom(1, eggsExtracted, prob=normalityP)

totalCost     <- fixedCost + normal * embryoCost
scores        <- rnorm(n=normal, mean=0, sd=sqrt(polygenicScoreVariance*0.5))

survived      <- Filter(function(x){rbinom(1, 1, prob=vitrificationP)}, scores)

selection <- sort(survived, decreasing=FALSE)
live <- 0
gain <- 0

if (length(selection)>0) {
for (embryo in 1:length(selection)) {
if (rbinom(1, 1, prob=liveBirth) == 1) {
live <- selection[embryo]
}
}
gain <- max(0, live - mean(selection))
}
return(data.frame(Trait.SD=gain, Cost=totalCost, Net=(traitValue*gain - totalCost)))  }
library(plyr)
simulateIVFCBs <- function(eggMean, eggSD, polygenicScoreVariance, normalityP, vitrificationP, liveBirth, fixedCost, embryoCost, traitValue, iters=20000) {
ldply(replicate(simplify=FALSE, iters, simulateIVFCB(eggMean, eggSD, polygenicScoreVariance, normalityP, vitrificationP, liveBirth, fixedCost, embryoCost, traitValue))) }

Now we have all our parameters set:

1. IQ’s value per point or per SD (multiply by 15)
2. The fixed cost of selection is $1500 3. per-embryo cost of selection is$200
4. and the relevant probabilities have been defined already
iqLow <- 3270*15; iqHigh <- 16151*15
## Tan:
summary(simulateIVFCBs(3, 4.6, selzam2016, 0.5, 0.96, 0.24, 1500, 200, iqLow))
#    Trait.SD               Cost              Net
# Min.   :0.00000000   Min.   :1500.00   Min.   :-3900.0000
# 1st Qu.:0.00000000   1st Qu.:1500.00   1st Qu.:-1700.0000
# Median :0.00000000   Median :1700.00   Median :-1500.0000
# Mean   :0.02854686   Mean   :1873.05   Mean   : -472.8266
# 3rd Qu.:0.03149430   3rd Qu.:2100.00   3rd Qu.: -579.1553
# Max.   :0.42872383   Max.   :4300.00   Max.   :19076.2182
summary(simulateIVFCBs(3, 4.6, selzam2016, 0.5, 0.96, 0.24, 1500, 200, iqHigh))
#    Trait.SD               Cost              Net
# Min.   :0.00000000   Min.   :1500.00   Min.   : -4100.000
# 1st Qu.:0.00000000   1st Qu.:1500.00   1st Qu.: -1700.000
# Median :0.00000000   Median :1700.00   Median : -1500.000
# Mean   :0.02847819   Mean   :1873.08   Mean   :  5026.188
# 3rd Qu.:0.03005473   3rd Qu.:2100.00   3rd Qu.:  5143.879
# Max.   :0.48532430   Max.   :4100.00   Max.   :115677.092

## Hodes-Wertz:
summary(simulateIVFCBs(8.2, 4.6, selzam2016, 0.35, 0.96, 0.40, 1500, 200, iqLow))
#    Trait.SD                Cost              Net
# Min.   :0.000000000   Min.   :1500.00   Min.   :-4100.0000
# 1st Qu.:0.000000000   1st Qu.:1700.00   1st Qu.:-1900.0000
# Median :0.007840085   Median :2100.00   Median :-1500.0000
# Mean   :0.051678465   Mean   :2079.25   Mean   :  455.5787
# 3rd Qu.:0.090090594   3rd Qu.:2300.00   3rd Qu.: 2168.2666
# Max.   :0.463198015   Max.   :4100.00   Max.   :21019.8626
summary(simulateIVFCBs(8.2, 4.6, selzam2016, 0.35, 0.96, 0.40, 1500, 200, iqHigh))
#    Trait.SD                Cost              Net
# Min.   :0.000000000   Min.   :1500.00   Min.   : -3700.0000
# 1st Qu.:0.000000000   1st Qu.:1700.00   1st Qu.: -1700.0000
# Median :0.006228574   Median :2100.00   Median :  -650.2792
# Mean   :0.050884913   Mean   :2083.41   Mean   : 10244.2234
# 3rd Qu.:0.088152844   3rd Qu.:2300.00   3rd Qu.: 19048.4272
# Max.   :0.486235107   Max.   :4100.00   Max.   :114497.7483
## USA, youngest:
summary(simulateIVFCBs(9, 4.6, selzam2016, 0.3, 0.90, 10.8/100, 1500, 200, iqLow))
#    Trait.SD               Cost              Net
# Min.   :0.00000000   Min.   :1500.00   Min.   :-3900.0000
# 1st Qu.:0.00000000   1st Qu.:1700.00   1st Qu.:-2045.5047
# Median :0.00000000   Median :1900.00   Median :-1500.0000
# Mean   :0.03360950   Mean   :2037.22   Mean   : -388.6739
# 3rd Qu.:0.05023528   3rd Qu.:2300.00   3rd Qu.:  287.3619
# Max.   :0.52294123   Max.   :3900.00   Max.   :23950.2672
summary(simulateIVFCBs(9, 4.6, selzam2016, 0.3, 0.90, 10.8/100, 1500, 200, iqHigh))
#    Trait.SD               Cost              Net
# Min.   :0.00000000   Min.   :1500.00   Min.   : -3900.000
# 1st Qu.:0.00000000   1st Qu.:1700.00   1st Qu.: -1900.000
# Median :0.00000000   Median :1900.00   Median : -1500.000
# Mean   :0.03389909   Mean   :2044.75   Mean   :  6167.812
# 3rd Qu.:0.05115755   3rd Qu.:2300.00   3rd Qu.: 10224.781
# Max.   :0.45364794   Max.   :4100.00   Max.   :108203.019

In general, embryo selection as of January 2016 is just barely profitable or somewhat unprofitable in each group using the lowest estimate of IQ’s value; it is always profitable on average with the highest estimate.

### Value of Information

To get an idea of the value of further research into improving the polygenic score or optimizing other parts of the procedure, we can look at the overall population gains in the USA if it was adopted by all potential users.

#### Public interest in selection

How many people can we expect to use embryo selection as it becomes available?

My belief is that total uptake will be fairly modest as a fraction of the population. A large fraction of the population expresses hostility towards any new fertility-related technology whatsoever, and the people open to the possibility will be deterred by the necessity of advanced family planning, the large financial cost of IVF, and the fact that the IVF process is lengthy and painful. I think that prospective mothers will not undergo it unless the gains are enormous: the difference between having kids or never having kids, or having a normal kid or one who will die young of a genetic disease. A fraction of an IQ point, or even a few points, is not going to cut it. (Perhaps boosts around 20 IQ points, a level with dramatic and visible effects on educational outcomes, would be enough?)

We can see this unwillingness partially expressed in long-standing trends against the wide use of sperm & egg donation. As points out (), a prospective mother could easily increase traits of her children by eugenic selection of sperm donors, such as eminent scientists, above and beyond the relatively unstringent screening done by current sperm banks and the selectness of sperm buyers:

…we now know from 40 years of experience that without coercion there is little or no demand for genetic enhancement. People generally don’t want paragon babies; they want healthy ones that are like them. At the time test-tube babies were first conceived in the 1970s, many people feared in-vitro fertilization would lead to people buying sperm and eggs off celebrities, geniuses, models and athletes. In fact, the demand for such things is negligible; people wanted to use the new technology to cure infertility—to have their own babies, not other people’s. It is a persistent misconception shared among clever people to assume that everybody wants clever children.

Ignoring that celebrities, models, and athletes are often highly successful sexually (which can be seen as a ‘donation’ of sorts), this sort of thing was in fact done by the ; but despite (as expected from selecting for highly intelligent donors), it had a troubled 29-year run (primarily due to a severe donor shortage17) and has no explicit successors.18

So that largely limits the market for embryo selection to those who would already use it: those who must use it.

Will they use it? Ridley’s argument doesn’t prove that they won’t, because the use of sperm/egg donors comes at the cost of reducing relatedness. Non-use of “celebrities, geniuses, models, and athletes” merely shows that the perceived benefits do not outweigh the costs; it doesn’t tell us what the benefits or costs are. And the cost of reducing relatedness is a severe one—a normal fertile pair of parents will no more be inclined to use a sperm or egg donor (and which one, exactly? who chooses?) than they would be to adopt, and they would be willing to extract sperm from a dead man just for the relatedness.19 A more relevant situation would be how parents act in the infertility situation where avoiding reduced relatedness is impossible.

In that situation, parents are notoriously eugenic in their preferences, demanding of sperm or egg banks that the donor be healthy, well-educated (at the Ivy League, of course, where egg donation is regularly advertised), have particular hair & eye colors (using sperm/eggs exported from Scandinavia, if necessary), be tall (men) and young (), and free of any mental illnesses. This pervasive selection works; draws on a donor sibling registry, documenting selection in favor of taller sperm donors, and, as predicted by the breeder’s equation, offspring were taller by 1.23 inches.20 Should parents discover that a sperm donor was actually autistic or schizophrenic, allegations of fraud & “” lawsuits will immediately begin flying, regardless of whether those parents would explicitly acknowledge that most human traits are highly heritable and embryo selection was possible. The practical willingness of parents to make eugenic choices based on donor profiles suggests that advertised correctly, embryo selection could become standard. (For example, given the pervasive Puritanical bias in health towards preventing illness instead of increasing health, embryo selection for intelligence or height can be framed as reducing the risk of developmental delays or shortness; which it would.) Reportedly as of 2016, PGD for hair and eye color is already quietly being offered to parents and accepted, and mentions are made of the potential for selection on other traits.

More drastically, in cases of screening for severe genetic disorders by testing potential carrier parents and fetuses, parents in practice are willing to make use of screening (if they know about it) and use PGD or selective abortions in anywhere up to 95-100% of cases (depending on disease & sample) in diseases such as (eg ), (eg ), (eg , ), (eg , , Hale et al 2008, , ), and in general (eg , ). This willingness is enough to noticeably affect population levels of these disorders (particularly Down’s syndrome, which has dropped dramatically in the USA despite an aging population that should be increasing it). The willingness to use PGD or abort rises with the severity of the disorder, true, but here again there are extenuating factors: parents considerably underestimate their willingness to use PGD/abortion before diagnosis compared to after they are actually diagnosed, and using IVF just for PGD or aborting a pregnancy are expensive & highly undesirable steps to take; so the rates being so high regardless suggest that in other scenarios (like a couple using IVF for fertility reasons), willingness may be high (and higher than people think before being offered the option). Still we can’t underestimate the strength of the desire for a child genetically related to oneself: willingness to use techniques like PGD is limited and far from absolute. The number of people who are carriers of a terminal dominant genetic disease like (which has a reliable cheap universally available test) who will deliberately not test a fetus or use PGD, or will choose to bear a fetus which has already tested positive, are strikingly high: reports that carriers had only limited patience for PNG testing and if the first fetus was successful, 20% did not bother testing their second pregnancy, and if not, 13% did not test their second, and of those who tested twice with carriers, 3 of 5 did no further testing; , a followup study finds that of 13 couples who decided in advance that they would abort a fetus who was a carrier, 0 went through with it.

Time will tell whether embryo selection becomes anything more than a exotic novelty, but it looks as though when relatedness is not a cost, parents will tend to accept it. This suggests that Ridley’s argument is incorrect when extended to embryo selection/editing; people simply want to both have and eat their cake, and as embryo selection/editing entail little or no loss of relatedness, they are not comparable to sperm/egg donation.

Hence, I suggest the most appropriate target market is simply the total number of IVF users, and not the much smaller number of egg/sperm donation users.

#### VoI for USA IVF population

Using the high estimate of an average gain of $6230, and noting that there were 67996 IVF babies in 2013, that suggests an annual gain of up to$423m. What is the net present value of that annually? Discounted at 5%, it’d be $8.6b. (Why a 5% discount rate? This is the highest discount rate I’ve seen used in health economics; more typical are discount rates like NICE’s 3.5%, which would yield a much larger NPV.) We might also ask: as an upper bound, in the realistic USA IVF model, how much would a perfect SNP polygenic score be worth? summary(simulateIVFCBs(9, 4.6, 0.33, 0.3, 0.90, 10.8/100, 1500, 200, iqLow)) # Trait.SD Cost Net # Min. :0.0000000 Min. :1500.00 Min. :-3700.000 # 1st Qu.:0.0000000 1st Qu.:1700.00 1st Qu.:-2100.000 # Median :0.0000000 Median :1900.00 Median :-1500.000 # Mean :0.1037614 Mean :2042.24 Mean : 3047.259 # 3rd Qu.:0.1562492 3rd Qu.:2300.00 3rd Qu.: 5516.869 # Max. :1.4293926 Max. :3900.00 Max. :68411.709 summary(simulateIVFCBs(9, 4.6, 0.33, 0.3, 0.90, 10.8/100, 1500, 200, iqHigh)) # Trait.SD Cost Net # Min. :0.0000000 Min. :1500.0 Min. : -4100.00 # 1st Qu.:0.0000000 1st Qu.:1700.0 1st Qu.: -1900.00 # Median :0.0000000 Median :1900.0 Median : -1500.00 # Mean :0.1030492 Mean :2037.6 Mean : 22927.61 # 3rd Qu.:0.1530295 3rd Qu.:2300.0 3rd Qu.: 34652.62 # Max. :1.3798166 Max. :4100.0 Max. :331981.26 ivfBirths <- 67996; discount <- 0.05 current <- 6230; perfect <- 23650 (ivfBirths * perfect)/(log(1+discount)) - (ivfBirths * current)/(log(1+discount)) # [1] 24277235795 Increasing the polygenic score to its maximum of 33% increases the profit by 5x. This increase, over the number of annual IVF births, gives a net present expected value of perfect information (EVPI) for a perfect score of something like$24b. How much would it cost to gain perfect information? argues that a sample around 1 million would suffice to reach the GCTA upper bound using a particular algorithm; the largest usable21 sample I know of, SSGAC, is around n=300k, leaving 700k to go; with SNPs costing ~$200, that implies that it would cost$0.14b for perfect SNP information. Hence, the expected value of information would then be ~$26.15b and safely profitable. From that, we could also estimate the expected value of sample information (EVSI): if the 700k SNPs would be worth that much, then on average22 each additional datapoint is worth$37.6k. Aside from the Hsu 2014 estimate, we can use a formula from a model in the Rietveld et al 2013 supplementary materials (pg22-23), where they offer a population genetics-based approximation of how much variance a given sample size & heritability will explain:

1. ; they state that , so or M = 67865.
2. For education (the phenotype variable targeted by the main GWAS, serving as a proxy for intelligence), they estimate h2=0.2, or h=0.447 (h2 here being the heritability capturable by their SNP arrays, so equivalent to ), so for their sample size of 100000, they would expect to explain or 4.5% of variance while they got 2-3%, suggesting over-estimation.

Using this equation we can work out changes in variance explained with changes in sample sizes, and thus the value of an additional datapoint. For intelligence, the GCTA estimate is ; Rietveld et al 2013 realized a variance explained of 0.025, implying it’s equivalent to n=17000 when we look for a N which yields 0.025 and so we need ~6x more education-phenotype samples to reach the same efficacy in predicting intelligence. We can then ask how much variance is explained by a larger sample and how much that is worth over the annual IVF headcount. Since selection is not profitable under the low IQ estimate and 1 more datapoint will not make it profitable, the EVSI of another education datapoint must be negative and is not worth estimating, so we use the high estimate instead, asking how much a increase of, say, 100 datapoints is worth on average:

gwasSizeToVariance <- function(N, h2) { ((N / 67865) * h2^2) / ((N/67865) * h2 + 1) }
sampleIncrease <- 1000
original     <- gwasSizeToVariance(17000, 0.33)
originalplus <- gwasSizeToVariance(17000+sampleIncrease, 0.33)
originalGain     <- mean(simulateIVFCBs(9, 4.6, original, 0.3, 0.90, 10.8/100, 1500, 200, iqHigh)$Net) originalplusGain <- mean(simulateIVFCBs(9, 4.6, originalplus, 0.3, 0.90, 10.8/100, 1500, 200, iqHigh)$Net)
originalGain; originalplusGain
((((originalplusGain - originalGain) * ivfBirths) / log(1+discount)) / sampleIncrease) / 6
# [1] 71716.90116

$71k is within an order of magnitude of the Hsu 2014 extrapolation, so reasonable given all the approximations here. Going back to the lowest IQ value estimate, in the US population estimate, embryo selection only reaches break-even once the variance explained increases by a factor of 2.1 to 5.25%. To boost it to 2.1x (0.0525) turns out to require n=40000 or 2.35x, suggesting that another Rietveld et al 2013-style education GWAS would be adequate once it reached . After that sample size has been exceeded, EVSI will then be closer to$10k.

## Improvements

### Overview of Selection Improvements

There are many possible ways to improve selection. As selection boils down to simply taking the maximum of samples from a normal distribution, at a high level there are only 3 parameters: the number of samples from a normal distribution, the variance of that normal distribution, and its mean. There are many things which affect each of those variables and each of these parameters influences the final gain, but that’s the ultimate abstraction. To help keep them straight, one way I find helpful is to break up possible improvements into those 3 categories, which we could ask as: what variables are varying, how much are they varying, and how can we increase the mean?

1. what variables vary?

• multiple selection: selecting on the weighted sum of many variables simultaneously; the more variables, the closer the index approaches the true global latent value of a sample

• variable measurement: binary/dichotomous variables through away information, while continuous variables are more informative and reflect outcomes better.

Schizophrenia, for example, may typically be described as a binary variable to be modeled by a liability threshold model, which has the implication that returns diminish especially fast in reducing schizophrenia genetic burden, but there is measurement error/disagreement about whether a person should be diagnosed as schizophrenic and someone who doesn’t have it yet may develop it later, and there is evidence that schizophrenia genetic burden has effects in non-cases as well like increased disordered thinking or lowered IQ. This affects both the initial construction of the SNP heritability/PGS, and the estimate of the value of changing the PGS.

• rare vs common variants: omitting rare variants will naturally restrict how useful selection can be; you can’t select on variance in what you can’t see. (SNPs are only a temporary phase.) The rare variants don’t necessarily need to be known with high confidence, selection could be for fewer or less-harmful-looking rare variants, as most rare variants are either neutral or harmful.

2. how much do they vary?

• better PGSes:

• more data: larger n in GWASes, whole genomes rather than only SNPs, more accurate detailed phenotype data to predict
• better analysis: better regression methods, better priors (based on biological data or just using informative distributions), more imputation, more correlated traits & latent traits hierarchically related, more exploitation of population structure to estimate away environmental effects & detect rare variants which may be unique to families/lineages & indirect genetic effects rather than over-controlling population structure/indirect effects away along with part of the signal
• larger effective n to select from:

• safer egg harvesting methods which can increase the yields
• reducing loss in the IVF pipeline by improvements to implantation/live-birth rate
• massive embryo selection: egg manufacturing via immature egg harvested from ovarian biopsies, or gametogenesis (somatic/stem cells→egg)
• more variance:

• directed mutagenesis
• increasing chromosome recombination rate?
• splitting up or recombining chromosomes or combining chromosomes
• create only male embryos (to exploit greater variance in outcomes from the X/Y chromosome pair)
3. how to increase the mean?

• multi-stage selection:

• parental selection
• chromosome selection
• gametic selection
• iterated embryo selection
• gene editing, chromosome or genome synthesis

### Limiting step: eggs or scores?

Embryo selection gains can be optimized in a number of ways: harvesting more eggs, having more eggs be normal & successfully fertilized, reducing the cost of SNPing or increasing the predictive power of the polygenic scores, and better implantation success. However, the “leaky pipeline” nature of embryo selection means that optimization may be counterintuitive (akin to similar problems in drug development; ).

There’s no clear way to improve egg quality or implant better, and the cost of SNPs is already dropping as fast as anyone could wish for, which leaves just improving the polygenic scores and harvesting more eggs. Improving the polygenic scores is addressed in the previous Value of Information section and turns out to be doable and profitable but requires a large investment by institutions which may not be interested in researching the matter further. Further, better polygenic scores make relatively little difference when the number of embryos to select from is small, as it currently is in IVF due to the small number of harvested eggs & continuous losses in the IVF pipeline: it is not helpful to increase the probability of selecting the best embryo out of 3 by just a few percentage points when that embryo will probably not successfully be born and when it is only a few IQ points above average in the first place.

That leaves egg harvesting; this is limited by each woman’s idiosyncratic biology, and also by safety issues, and we can’t expect much beyond the median 9 eggs. There is, however, one oft-mentioned possibility for getting many more eggs: coax stem cells into using their pluripotency to develop into eggs, possibly hundreds or thousands of viable eggs. (There is another possible alternative, : surgically extracting ovarian tissue, vitrifying, and at—a potentially much—later date, rewarming & extracting eggs directly from the follicles. It’s a much more serious procedure and it’s unclear how many eggs it could yield.) This stem cell method is reportedly being developed23 and if successful, would enable both powerful embryo selection and also be a major step towards “iterated embryo selection” (see that section). We can call an embryo selection process which uses not harvested eggs but grown eggs in large quantities “massive embryo selection” to keep in mind the major difference—quantity is a quality all its own.

How much would getting scores or hundreds of eggs help, and how does the gain scale? Since returns diminish, and we already know that under the low value of IQ embryo selection is not profitable, it follows that no larger number of eggs will be profitable either; so like with EVSI, we look at the high value’s upper bound if we could choose an arbitrary number of eggs:

gainByEggcount <- sapply(1:300, function(egg) { mean(simulateIVFCBs(egg, 4.6, selzam2016, 0.3, 0.90, 10.8/100, 1500, 200, iqHigh)$Net) }) max(gainByEggcount); which.max(gainByEggcount) # [1] 26657.1117 # [1] 281 plot(1:300, gainByEggcount, xlab="Average number of eggs available", ylab="Profit") summary(simulateIVFCBs(which.max(gainByEggcount), 4.6, selzam2016, 0.3, 0.90, 10.8/100, 1500, 200, iqHigh)) # Trait.SD Cost Net # Min. :0.0000000 Min. :12300.0 Min. :-21900.00 # 1st Qu.:0.1284192 1st Qu.:17300.0 1st Qu.: 12711.92 # Median :0.1817688 Median :18300.0 Median : 25630.74 # Mean :0.1845060 Mean :18369.1 Mean : 26330.25 # 3rd Qu.:0.2372748 3rd Qu.:19500.0 3rd Qu.: 39162.75 # Max. :0.5661427 Max. :25300.0 Max. :117856.55 max(gainByEggcount) / which.max(gainByEggcount) # [1] 94.86516619 The maxima is ~281, yielding 0.18SD/~2.7 points & a net profit ~$26k, indicating that with that many eggs, the cost of the additional SNPing exceeds the marginal IQ gain from having 1 more egg available which could turn into an embryo & be selected amongst. With $26k profit vs 281 eggs, we could say that the gain from unlimited eggs compared to the normal yield of ~9 eggs is ~$20k ($26k vs the best current scenario of$6l), and that the average profit from adding each egg was $73, giving an idea of the sort of per-egg costs one would need from an egg stem cell technology (small). The total number of eggs will decrease with an increase in per-egg costs; if it costs another$200 per embryo, then the optimal number of eggs is around half, and so on.

So with present polygenic-scores & SNP costs, an unlimited number of eggs would only increase profit by 4x, as we are then still constrained by the polygenic score. This would be valuable, of course, but it is not a huge change.

Inducing eggs from stem cells does have the potentially valuable feature that it is probably money-constrained rather than egg or PGS constrained: you want to stop at a few hundred eggs but only because IQ and other selected traits are being valued at a low rate. If one values them higher, the limit will be pushed out further—a thousand eggs would deliver gains like +20 IQ points, and a wealthy actor might go even further to 10,000 eggs (+24), although even the wealthiest actors must stop at some point due to the thin tails/diminishing returns.

#### Optimal stopping/search

I model embryo selection with many embryos as an optimal stopping/search problem and give an example algorithm for when to halt that results in substantial savings over the brute force approach of testing all available embryos. This shows that with a little thought, “too many embryos” need not be any problem.

In statistics, is that it is as good or better to have more options or actions or information than fewer (computational issues aside). Embryo selection is no exception: it is better to have many embryos than few, many PGSes available for each embryo than one, and it is better to adaptively choose how many to sequence/test than to test them all blindly.24 This point becomes especially critical when we begin speculating about hundreds or thousands of embryos, as the cost of testing them all may far exceed any gain.

But we can easily do better.

The is a famous example of an problem where in sequentially searching through n candidates, permanently choosing/rejecting at each step, with only relative rankings known & no distribution, it turns out that, remarkably, one can select the best candidate ~37% of the time independent of n, and that one can select the expected rank of 3.9th best candidate. Given that we know the PGSes are normal, utilities thereof, and do not need to irrevocably choose, we should be able to do even better.

This can be solved by the usual Bayesian search decision theory approach: at each step, calculate the expected Value of Information from another search (upper bounded by the expected Value of Perfect Information), and when the marginal VoI <= marginal cost, halt, and return the best candidate. If we do not know parental genomes or have trait values, we must update our distribution of possible outcomes from another sample: for example, if we sequence the first embryo and find a high PGS compared to the population mean, then that implies a high parental mean which means that the future embryos might be even higher than we expected, and thus we will want to continue sampling longer than we did before. (In practice, this probably has little effect, as it turns out we already want to sample so many embryos on average that the uncertainty in the mean is near-zero by the time we near the stopping point.) In the case where parental genomes are available or we have phenotypes, we can assume we are sampling from a known normal distribution and so we don’t even need to do any Bayesian updates based on our previous observations, we can simply calculate the expected increase from another sample.

Consider sequentially searching a sample of n normal deviates for the maximum deviate, with a certain utility cost per sample & utility of each +SD.

Given diminishing returns of order statistics, there may be a n at which it on average does not pay to search all of the n but only a few of them. There is also optionality to search: if a large value is found early in the search, given normality it is unlikely to find a better candidate afterwards, so one should stop the search immediately to avoid paying futile search costs; so while having not yet reached that average n, a sample may have been found so good that one should stop early.

The expected Value of Perfect Information is when we can search the whole sample for free; so here it is simply the expected max of the full n times the utility.

So our n might be the usual 5 embryos, our utility cost is $200 per step (the cost to sequence each embryo), and the utility of each +SD can be the low value of IQ ($3270 per IQ point or 15x for +1 SD). Compared with zero embryos tested, since 5 yields a gain +1.16SD, the EVPI in that scenario is $57k. However, if we already have 3 embryos tested (+0.84SD), the EVPI diminishes—2 more embryos sampled on average will only increase by +0.31SD or$15k. And by the same logic, the one-step case follows: sampling 1 embryo given 3 already has an EVPI of +0.18SD or $8k. Given that the cost to sample one-step is so low ($200), it is immediately clear we probably should continue sampling—after all, we gain $8k but only spend$0.2k to do so.

So the sequential search in embryo selection borders on trivial: given the low cost and high returns, for all reasonable sizes of n, we will on average want to search the entire sample. At what n would we halt on average? In order words, for what n is ? Or to put it another way, when is the order difference <0.004 SDs ()? In this case, we only hit diminishing returns strongly enough around n=88.

allegrini2018 <- sqrt(0.11*0.5)
iqLow         <- 3270*15
testCost      <- 200

exactMax(5)
# [1] 1.162964474
exactMax(5) * iqLow
# [1] 57043.40743
(exactMax(5) - exactMax(3))
# [1] 0.3166800983
(exactMax(5) - exactMax(3)) * iqLow
# [1] 15533.15882

round(sapply(seq(2, 300, by=10), function(n) { (exactMax(n) - exactMax(n-1)) * iqLow }))
#  [1] 27673  2099  1007   648   473   370   303   255   220   194   172   155   141   129   119
#        110   103    96    90    85    81    76    73    69    66    63    61
# [28]    58    56    54

That assumes a perfect predictor, of course, and we do not have that. Deflating by the halved Allegrini et al 2018 PGS, the crossover is closer to n=24:

round(sapply(2:26, function(n) { (exactMax(n, sd=allegrini2018) - exactMax(n-1, sd=allegrini2018)) * iqLow }))
# [1] 6490 3245 2106 1537 1199  977  822  706  618  549  492  446  407  374  346  322  300  281  265  250  236  224  213  203  194
exactMax(24, sd=allegrini2018)
# [1] 0.4567700586
exactMax(25, sd=allegrini2018)
# [1] 0.4609071309
0.4609071309 - 0.4567700586
# [1] -0.0041370723

stoppingRule <- function(predictorSD, utilityCost, utilityGain) {
n <- 1
while(((exactMax(n+1, sd=predictorSD) - exactMax(n, sd=predictorSD)) * utilityGain) > utilityCost) { n <- n+1 }
return(c(n, exactMax(n), exactMax(n, sd=predictorSD))) }

round(digits=2, stoppingRule(allegrini2018, testCost, iqLow))
# [1] 25.00 1.97 0.46
round(digits=2, stoppingRule(allegrini2018, 100, iqLow))
# [1] 45.00  2.21  0.52

Another way of putting it would be that we’ve derived a stopping rule: once we have a candidate of >=0.4567SD, we should halt, as all future samples are expected to cost too much. (If the candidate embryo is nonviable or fails to yield a live birth, testing can simply resume with the rest of the stored embryos until the stopping rule fires again or one has tested the entire sample.) Compared to blind batch sampling without regard to marginal costs, the expected benefit of this stopping rule is the number of searches past n=24 times the cost minus the marginal benefit, so if we were instead going to blindly test an entire sample of n=48, we’d incur a loss of $1516: marginalGain <- (exactMax(48, sd=allegrini2018) - exactMax(24, sd=allegrini2018)) * iqLow marginalCost <- (48-24) * testCost marginalGain; marginalCost [1] 3283.564451 [1] 4800 marginalGain - marginalCost [1] -1516.435549 The loss would continue to increase the further past the stopping point we go. This demonstrates the benefits of sequential testing and gives a formula & code for deciding when to stop based on cost/benefits/normal distribution parameters. To go into further detail, in any particular run, we would see different random samples at each step. We also might not have derived a stopping rule in advance. Does the stopping rule actually work? What does it look like to simulate out stepping through embryos one at a time, calculating the expected value of testing another sample (estimated via Monte Carlo, since it’s not a threshold Gaussian but a ‘’ whose WP article has no formula for the expectation25), and after stopping, comparing to what if we had instead tested them all? It looks as expected above: typically we test up to 24 embryos, get a SD increase of <=0.45SD (if we don’t have >24 embryos, unsurprisingly we won’t get that high), and by stopping early, we do in fact save a modest amount each run, enough to outweigh the occasional scenario where the remaining embryos hid a really high score. And since we do usually stop ~24, the batch testing becomes increasingly worse the larger the total n becomes—by 500 embryos, the loss is up to$80k:

library(memoise)
library(parallel) # warning, Windows users
library(plyr)

## Memoise the Monte Carlo evaluation to save time - it's almost exact w/100k & simpler:
expectedPastThreshold <- memoise(function(maximum, predictorSD) {
mean({ x <- rnorm(100000, sd=predictorSD); ifelse(x>maximum, x-maximum, 0) }) })

optimalSearch <- function(maxN, predictorSD, utilityCost, utilityBenefit) {

samples <- rnorm(maxN, sd=predictorSD)

i <- 1; maximum <- samples[1]; cost <- utilityCost; profit <- 0; gain <- max(maximum,0);
while (i < maxN) {

marginalGain <- expectedPastThreshold(maximum, predictorSD)

if (marginalGain*utilityBenefit > utilityCost) {
i <- i+1
cost <- cost+utilityCost
nth <- samples[i]
maximum <- max(maximum, nth); } else { break; }
}

gain <- maximum * utilityBenefit; profit <- gain-cost;
searchAllProfit <- max(samples)*utilityBenefit - maxN*utilityCost

return(c(i, maximum, cost, gain, profit, searchAllProfit, searchAllProfit - (gain-cost)))
}

optimalSearch(100, allegrini2018, testCost, iqLow)
# [1]    48     0  9600 22475 12875  9462 -3413

## Parallelize simulations:
optimalSearchs <- function(a,b,c,d, iters=10000) { df <- ldply(mclapply(1:iters, function(x) { optimalSearch(a,b,c,d); }));

summary(digits=2, optimalSearchs(5,   allegrini2018, testCost, iqLow))
# Min.   :1.0   Min.   :-0.27   Min.   : 200   Min.   :-13039   Min.   :-14039   Min.   :-14039     Min.   : -800
# 1st Qu.:5.0   1st Qu.: 0.16   1st Qu.:1000   1st Qu.:  7978   1st Qu.:  6978   1st Qu.:  6978     1st Qu.:    0
# Median :5.0   Median : 0.26   Median :1000   Median : 12902   Median : 11902   Median : 11902     Median :    0
# Mean   :4.6   Mean   : 0.27   Mean   : 921   Mean   : 13267   Mean   : 12346   Mean   : 12306     Mean   :  -40
# 3rd Qu.:5.0   3rd Qu.: 0.37   3rd Qu.:1000   3rd Qu.: 18199   3rd Qu.: 17199   3rd Qu.: 17199     3rd Qu.:    0
# Max.   :5.0   Max.   : 1.05   Max.   :1000   Max.   : 51405   Max.   : 51205   Max.   : 50405     Max.   :14789
summary(digits=2, optimalSearchs(10,  allegrini2018, testCost, iqLow))
# Min.   : 1.0   Min.   :-0.06   Min.   : 200   Min.   :-2934   Min.   :-4934   Min.   :-4934      Min.   :-1800
# 1st Qu.: 7.0   1st Qu.: 0.27   1st Qu.:1400   1st Qu.:13047   1st Qu.:11047   1st Qu.:11047      1st Qu.: -400
# Median :10.0   Median : 0.35   Median :2000   Median :17275   Median :15275   Median :15275      Median :    0
# Mean   : 8.2   Mean   : 0.36   Mean   :1649   Mean   :17594   Mean   :15945   Mean   :15754      Mean   : -190
# 3rd Qu.:10.0   3rd Qu.: 0.44   3rd Qu.:2000   3rd Qu.:21718   3rd Qu.:20742   3rd Qu.:20109      3rd Qu.:    0
# Max.   :10.0   Max.   : 0.97   Max.   :2000   Max.   :47618   Max.   :46218   Max.   :45618      Max.   :20883
summary(digits=2, optimalSearchs(24,  allegrini2018, testCost, iqLow))
# Min.   : 1   Min.   :0.12   Min.   : 200   Min.   : 5719   Min.   :  919   Min.   :  919      Min.   :-4600
# 1st Qu.: 7   1st Qu.:0.37   1st Qu.:1400   1st Qu.:18238   1st Qu.:13438   1st Qu.:13438      1st Qu.:-2800
# Median :16   Median :0.43   Median :3200   Median :21201   Median :19223   Median :17145      Median : -600
# Mean   :15   Mean   :0.44   Mean   :3032   Mean   :21689   Mean   :18656   Mean   :17648      Mean   :-1008
# 3rd Qu.:24   3rd Qu.:0.50   3rd Qu.:4800   3rd Qu.:24527   3rd Qu.:22636   3rd Qu.:21217      3rd Qu.:    0
# Max.   :24   Max.   :1.13   Max.   :4800   Max.   :55507   Max.   :52107   Max.   :50707      Max.   :25705
summary(digits=2, optimalSearchs(100, allegrini2018, testCost, iqLow))
# Min.   :  1   Min.   :0.31   Min.   :  200   Min.   :15218   Min.   :-4782   Min.   :-4782      Min.   :-19800
# 1st Qu.:  7   1st Qu.:0.43   1st Qu.: 1400   1st Qu.:21223   1st Qu.:16696   1st Qu.: 5342      1st Qu.:-15507
# Median : 16   Median :0.47   Median : 3200   Median :23239   Median :19919   Median : 8266      Median :-11772
# Mean   : 23   Mean   :0.50   Mean   : 4654   Mean   :24398   Mean   :19744   Mean   : 8762      Mean   :-10983
# 3rd Qu.: 33   3rd Qu.:0.54   3rd Qu.: 6600   3rd Qu.:26504   3rd Qu.:23076   3rd Qu.:11651      3rd Qu.: -7293
# Max.   :100   Max.   :1.10   Max.   :20000   Max.   :53952   Max.   :52352   Max.   :33952      Max.   : 18226
summary(digits=2, optimalSearchs(500, allegrini2018, testCost, iqLow))
# Min.   :  1   Min.   :0.40   Min.   :  200   Min.   :19607   Min.   :-25265   Min.   :-76428     Min.   :-99800
# 1st Qu.:  7   1st Qu.:0.43   1st Qu.: 1400   1st Qu.:21289   1st Qu.: 16559   1st Qu.:-67982     1st Qu.:-89569
# Median : 17   Median :0.48   Median : 3400   Median :23349   Median : 19779   Median :-65471     Median :-85154
# Mean   : 24   Mean   :0.50   Mean   : 4772   Mean   :24498   Mean   : 19726   Mean   :-64955     Mean   :-84681
# 3rd Qu.: 33   3rd Qu.:0.54   3rd Qu.: 6600   3rd Qu.:26500   3rd Qu.: 23232   3rd Qu.:-62591     3rd Qu.:-80393
# Max.   :234   Max.   :1.09   Max.   :46800   Max.   :53390   Max.   : 50453   Max.   :-44268     Max.   :-37431

Thus, the approach using the order statistics and the approach using Monte Carlo statistics agree; the threshold can be calculated in advance and the problem reduced to the simple algorithm “sample while best < threshold until running out”.

24 might seem like a low number, and it is, but it can be driven much higher: better PGSes which predict more variance, use of multiple-selection to synthesize an index trait which both varies more and has far greater value, and the expected long-term decreases in sequencing costs. For example, if we look at a later section where a few dozen traits are combined into a single “index” utility score, the SNP heritability’s index utility scores are distributed ~ & the 2016 PGSes give a ~, then our stopping rules look different:

## SNP heritability upper bound:
round(digits=2, stoppingRule(1, testCost, 72000))
# [1] 125.00   2.59   2.59
## 2016 multiple-selection:
round(digits=2, stoppingRule(1, testCost, 6876))
# [1] 16.00  1.77  1.77

### Multiple selection

Intelligence is one of the most valuable traits to select on, and one of the easiest to analyze, but we should remember that it is neither necessary nor desirable to select only on a single trait. For example, in cattle embryo selection, selection is done not on a single trait but a weighted sum of 48 traits ().

Selecting only on one trait means that almost all of the available genotype information is being ignored; at best, this is a lost opportunity, and at worst, in some cases it is harmful—in the long run (dozens of generations), selection only on one trait, particularly in a very small breeding population like often used in agriculture (albeit irrelevant to humans), will have “unintended consequences” like greater disease rates, shorter lifespans, etc (see Falconer 1960’s Introduction to Quantitative Genetics, , & Lynch & Walsh 1998’s on ). When breeding is done out of ignorance or with regard only to a few traits or on tiny founding populations, one may wind up with problematic breeds like some purebred dog breeds which have serious health issues due to inbreeding, small founding populations, no selection against negative mutations popping up, and variants which increase the selected trait at the expense of another trait.26 (This is not an immediate concern for humans as we have an enormous population, only weak selection methods, low levels of historical selection, and high heritabilities & much standing variance, but it is a concern for very long-term programs or hypothetical future selection methods like iterated embryo selection.)

This is why animal breeders do not select purely on a single valuable trait like egg-laying rate but on an index of many traits, from maturity speed to disease resistance to lifespan. An index is simply the sum of a large number of measured variables, implicitly equally weighted or explicitly weighted by their contribution towards some desired goal—the more included variables, the more effective selection becomes as it captures more of the latent differences in utility. For background on the theory and construction of indexes in selection, see Lynch & Walsh 2018’s /.

In our case, a weak polygenic score can be strengthened by better GWASes, but it can also be combined with other polygenic scores to do selection on multiple traits by summing the scores per embryo and taking the maximum. For example, as of 1 August 2018, the UK Biobank makes public ; many of these traits might be of no importance or the PGS too weak to make much of a difference, but the rest may be valuable. Once an index has been constructed from several PGSes, it functions identical to embryo selection on a single PGS and previous discussion applies to it, so the interesting questions are: how expensive an index is to construct; what PGSes are used and how they are weighted; and what is the advantage of multiple embryo selection over simple embryo selection.

This can be done almost for free, since if one did sequencing on a comprehensive SNP array chip to compute 1 polygenic score, one probably has all the information needed. (Indeed, you could see selection on a single trait as a index selection where all traits’ values are implausibly set to 0 except for 1 trait.) In reality, while some traits are of much more value than others, there are few traits with no value at all; an embryo which scores mediocrely on our primary trait may still have many other advantages which more than compensate, so why not check? (It is a general principle that more information is better than less.) Intelligence is valuable, but it’s also valuable to live a long time, have less risk for schizophrenia, lower BMI, be happier, and so on.

A quick demonstration of the possible gain is to imagine the total of 1 normal deviate () vs picking the most extreme out of several normal deviates. With 1 deviate, our average extreme is 0, and most of the time will be ±1SD. But if we can pick out of batches of 10, we can generally get +1.53SD:

mean(replicate(100000, max(rnorm(10, mean = 0))))
# [1] 1.537378753

What if we have 4 different scores (with two downweighted substantially to reflect that they are less valuable)? We get 0.23SD for free:

mean(replicate(100000, max(   1*rnorm(10, mean = 0) +
0.33*rnorm(10, mean = 0) +
0.33*rnorm(10, mean = 0) +
0.33*rnorm(10, mean = 0))))
# [1] 1.769910562

This is like selecting among multiple embryos: the more we have to pick from, the better the chance the best one will be particularly good. So in selecting embryos, we want to compute multiple polygenic scores for each embryo, weight them by the overall value of that trait, sum them to get a total score for each embryo, then select the best embryo for implantation.

The advantage of multiple polygenic scores follows from the for 2 variables X & Y is ; that is, the variances are added, so the standard deviation will increase, so our expected maximum sample will increase. Recalling , increasing beyond 1 will initially yield larger returns than increasing n past 9 (it looks linear rather than logarithmic, but embryo selection is zero-sum—the gain is shrunk by the weighting of the multiple variables), and so multiple selection should not be neglected. Using such a total score on n uncorrelated traits, as compared to alternative methods like selecting for 1 trait in each generation, is considerably more efficient, ~ times as efficient (Hazel & Lush 1943, 27).

We could rewrite simulateIVFCB to accept as parameters a series of polygenic score functions and simulate out each polygenic score and their sums; but we could also use the sum of random variables to create a single composite polygenic score—since the variances simply sum up (), we can take the polygenic scores, weight them, and sum them.

combineScores <- function(polygenicScores, weights) {
weights <- weights / sum(weights) # normalize to sum to 1
# add variances, to get variance explained of total polygenic score
sum(weights*polygenicScores) }

Let’s imagine a US example but with 3 traits now, IQ and 2 we consider to be roughly half as valuable as IQ, but which have better polygenic scores available of 60% and 5%. What sort of gain can we expect above our starting point?

weights <- c(1, 0.5, 0.5)
polygenicScores <- c(selzam2016, 0.6, 0.05)
summary(simulateIVFCBs(9, 4.6, combineScores(polygenicScores, weights), 0.3, 0.90, 10.8/100, 1500, 200, iqHigh))
#     Trait.SD               Cost              Net
#  Min.   :0.00000000   Min.   :1500.00   Min.   : -3900.00
#  1st Qu.:0.00000000   1st Qu.:1700.00   1st Qu.: -1900.00
#  Median :0.00000000   Median :1900.00   Median : -1500.00
#  Mean   :0.07524308   Mean   :2039.25   Mean   : 16189.51
#  3rd Qu.:0.11491090   3rd Qu.:2300.00   3rd Qu.: 25638.72
#  Max.   :1.00232683   Max.   :4100.00   Max.   :241128.71

So we double our gains by considering 3 traits instead of 1.

#### Multiple selection on independent traits

A more realistic example would be to use some of the existing polygenic scores for complex traits, of which for analysis from sources like LD Hub. Perhaps a little counterintuitively, to maximize the gains, we want to focus on universal traits such as IQ, or common diseases with high prevalence; the more horrifying genetic diseases are rare precisely because they are horrifying (natural selection keeps them rare), so focusing on them will only occasionally pay off.28

Here are 7 I looked up and was able to convert to relatively reasonable gains/losses:

1. IQ (using the previously given value and Selzam et al 2016 polygenic score, and excluding any valuation of the 7% of family SES & 9% of education that the IQ polygenic score comes with for free)

2. height

The literature is unclear what the best polygenic score for height is at the moment; let’s assume that it can predict most but not all, like ~60%, of variance with a population standard deviation of ~4 inches; the economics estimate is $800 of annual income per inch or a NPV of$16k per inch or $65k per SD, so we would weight it as a quarter as valuable as the high IQ estimate (((800/log(1.05))*4) / iqHigh ~> 0.27). The causal link is not fully known, but a of height & BMI supports causal estimates of$300/$1616 per SD respectively, which shows the correlations are not solely due to confounding. 3. Polygenic scores: //, population SD . Cost is a little trickier (low BMI can be as bad as high BMI, lots of costs are not paid by individuals, etc) but one could say there’s Then we’d get a weight of 7% (((175/log(1.05))*4.67) / iqHigh ~> 0.069). More recently, finds a 1 SD increase in a polygenic score predicts >$1400 increase in healthcare costs.

4. ’s reports a polygenic score predicting 5.73% on the liability scale.

Diabetes is not a continuous trait like IQ/height/BMI, but generally treated as a binary disease: you either have good blood sugar control and will not go blind and suffer all the other morbidity caused by diabetes, or you don’t. The underlying genetics is still highly polygenic and mostly additive, though, and in some sense one’s risk is normally distributed.

The “” is the usual quantitative genetics model for dealing with discrete polygenic variables like this: one’s latent risk is considered a normal variable (which is the sum of many individual variables, both genetic and environmental/random), and when one is unlucky enough for this risk to be enough standard deviations out past a threshold, one has the disease. The ‘enough standard deviations’ is set empirically; if 1% of the population will develop schizophrenia, then one has to be +2.33SD (qnorm(0.01)) out to develop schizophrenia, and assuming a mean risk of 0, one can then calculate the effects of an increase or decrease of 1SD. For example, if some change results in decreasing one’s risk score -1SD such that it would now take another 3.33SD to develop schizophrenia, then one’s probability of developing schizophrenia has decreased from 1% to 0.04%, a fall of 23x (pnorm(qnorm(0.01)) / pnorm(qnorm(0.01)-1) ~> 22.73) and so whatever one estimated the expected loss of schizophrenia at, it has decreased 23x and the change of 1SD can be valued at that. And vice versa for an increase: an increase of 1 SD in latent risk will increase the probability of developing schizophrenia several-fold and the expected loss must be increased accordingly. So if we have a polygenic score for schizophrenia which can produce a reduction (out of, say, 10 embryos) of 0.10SDs, a population prevalence of 1%, and a lifetime cost of $1m, then the expected reduction would be from 1% to 0.762%, or from an expected loss of$10000 (1m * 1%) to $7625 (1m * 0.762%) and the value of that quarter reduction would be around a quarter of the original loss. One consequence of this is that as a disorder becomes rarer, selection becomes worth less; or to put it another way, people with high risk of passing on schizophrenia (such as a diagnosed schizophrenic) will benefit far more: the child of 1 schizophrenic parent (and no other relatives) has a , implying thresholds of 1.28SDs and 0.25SDs respectively. Because most diseases are developed by a minority of people, the gain from selecting against disease is not as great as one might intuitively expect, and the gains are the least for the healthiest people (which is an amusing twist on the old fears that embryo selection will “exacerbate inequality”). Putting it together, we can compute the value like this: liabilityThresholdValue <- function(populationFraction, gainSD, value) { reducedFraction <- pnorm(qnorm(populationFraction) + gainSD) difference <- (populationFraction - reducedFraction) * value return(c(reducedFraction, difference)) } liabilityThresholdValue(0.01, -0.1, 1000000) # [1] 7.625821493e-03 2.374178507e+03 liabilityThresholdValue(0.10, -0.1, 1000000) # [1] 8.355471719e-02 1.644528281e+04 liabilityThresholdValue(0.40, -0.1, 1000000) # [1] 3.619141184e-01 3.808588159e+04 3.808588159e+04 / 2.374178507e+03 # [1] 16.04170937 Similarly for diabetes. We can estimate the NPV of not developing diabetes at as much as ; the lifetime risk of diabetes in the USA is approaching ~40% and has probably exceeded it by now (implying, incidentally, that diabetes is one of the most costly diseases in the world), so the expected loss is$49840 and developing diabetes has a threshold of 0.39SD; a decrease of 1SD gives one a third less chance of developing diabetes (pnorm(qnorm(0.40)-1) / pnorm(qnorm(0.40)) ~> 0.26) for a savings of $11k ((124600 * 0.4) - (124600 * 0.4 * 0.26) ~> 36881); finally,$36.8k/SD, compared with IQ, gets a weight of 15%. (If this seems low, it’s a combination of prevalence and PGS benefits. Similar to the “Population Attribute Risk” (PAR) statistic in epidemiology.)

5. ADHD polygenic scores range from 0.098%//0.5%//. Prevalence rates differ based on country & diagnosis method, but most genetics studies were run using DSM diagnoses in the West, so ~7% of children affected. find large harmful correlations, estimating a -$8900 annual loss from ADHD or ~$182k NPV. So the best score is 1.5%; the liability threshold is 1.47SD; the starting expected loss is ~$12768; a 1SD reduction is then worth$11.5k (182000*pnorm(qnorm(0.07)) - 182000*pnorm(qnorm(0.07)-1) ~> 115304) and has a weight of 4.7%.

6. Scores: // ().

Frequency is ~3%. Ranking after schizophrenia & depression, BPD is likewise expensive, . estimates a total annual loss of $45 billion but doesn’t give a lifetime per capita estimate; so to estimate that: in 1991, there were ~253 million people in the USA, life expectancy ~75 years, quoted 1991 lifetime prevalence of 1.3%; if there are a few million people every year with BPD which results in a total loss of$45b in 1991 dollars, and each person lives ~75 years, then that suggests an average lifetime total loss of ~$1026147, which to 2016 dollars is$1784953, and this has a NPV at 5% of $87k ((45000000000 / (253000000 * 0.013)) * 75 ~> 1026147; 1784953 * log(1.05) ~> 87088.1499). With a relatively low base-rate, the savings is not huge and it gets a weight of 0.01 ((87088*pnorm(qnorm(0.03)) - 87088*pnorm(qnorm(0.03)-1)) / iqHigh ~> 0.01007). 7. Scores: //// & (if pooled, <12%?). Frequency is ~1%. Schizophrenia is even more notoriously expensive worldwide than BPD, with 2002 USA costs estimated by at$15464 in direct & $22032 in indirect costs per patient, or total$49379 in 2016 dollars (which may well be a serious underestimate considering ) for a weight of 4% (49379 / log(1.05) ~> 1012068; (1012068*pnorm(qnorm(0.01)) - 1012068*pnorm(qnorm(0.01)-1)) ~> 9675.41; 9675/iqHigh ~> 0.039)

The low weights suggest we won’t see a 6x scaling from adding 6 more traits, but we still see a substantial gain from multiple selection—up to $14k/2.8x better than IQ alone: polygenicScores <- c(selzam2016, 0.6, 0.153, 0.0573, 0.015, 0.0283, 0.07) weights <- c(1, 0.27, 0.07, 0.15, 0.047, 0.01, 0.04) summary(simulateIVFCBs(9, 4.6, combineScores(polygenicScores, weights), 0.3, 0.90, 10.8/100, 1500, 200, iqHigh)) # Trait.SD Cost Net # Min. :0.00000000 Min. :1500.00 Min. : -3900.00 # 1st Qu.:0.00000000 1st Qu.:1700.00 1st Qu.: -1900.00 # Median :0.00000000 Median :2100.00 Median : -1500.00 # Mean :0.06839182 Mean :2044.12 Mean : 14524.82 # 3rd Qu.:0.10348042 3rd Qu.:2300.00 3rd Qu.: 22956.37 # Max. :0.98818115 Max. :3900.00 Max. :237701.71 14524 / 6230 # [1] 2.33 Note that this gain would be larger under lower values of IQ, as then more emphasis will be put on the other traits. Values may also be substantially underestimated because there are many more traits with polygenic scores than just the 7 used here, and for the mental health traits because they pervasively overlap genetically (indeed, in 1 case for ADHD, the schizophrenia/bipolar polygenic scores were better predictors of ADHD status than the ADHD polygenic score was!); counterbalancing this underestimation is that the long-noted correlation between schizophrenia & creativity is turning out to also be genetic, so the gain from reduced schizophrenia/bipolar/ADHD is a tradeoff coming at some cost to creativity. In any case, in theory and in practice, selection on multiple traits will be much more effective than selecting on one trait. #### Multiple selection on genetically correlated traits In single selection, the embryo selected is picked from the batch solely based on its polygenic score on 1 trait, even if the gain is small and some of the other embryos have large genetic advantages on other, almost as important, traits. In multiple selection, we take the maximum from the embryos based on all the scores summed together, allowing for excellence on 1 trait or general high quality on a few other traits. What sort of advantage do we expect? It’s not as simple as generating some random numbers independently from a distribution and then summing them, because the actual genetic scores will turn out to be intercorrelated: a high polygenic score for intelligence will also tend to lower the BMI polygenic score, and a high BMI polygenic score will increase the childhood obesity polygenic score or the smoking polygenic score because they genetically overlap on the same SNPs. In fact, all traits will tend to be a little (or a lot) genetically correlated, because If we ignore this, we may badly over or underestimate the advantage of multiple selection: the advantage of selection on a good trait may be partially negated if it drags in a bad trait, or the advantage may be amplified if it comes with other good traits. Depending on whether the good variables are positively or negatively correlated with bad variables, the gains can be larger or smaller. But as long as the correlations are not perfect, exactly +1 or -1, there will always some progress possible; this might be a little surprising, but the intuition is that one looks for points which are sufficiently high on the desirable traits to offset being higher on the undesirable ones, or, if not particularly high on the desirable trait, lower on the undesirable one; below is a bivariate example, where we have a good trait and a bad trait which are positively/negatively correlated (r=±0.3 in this case), each unit of the good trait is twice as good as the bad one is bad (giving a single weighted index), and the top 10% are selected—one can see that the inverse boosts selection, producing higher index, and the positive correlation, while worse, still allows gain from selection: library(mvtnorm) library(MBESS) rgMatrixPos <- matrix(ncol=2, c(1, 0.3, 0.3, 1)) rgMatrixInverse <- matrix(ncol=2, c(1, -0.3, -0.3, 1)) generate <- function(mu=c(0,0), n=1000, rg) { rmvnorm(n, mean=mu, sigma=cor2cov(rg, sd=rep(1,2)), method="svd") } plotBivariate <- function(rg) { df <- as.data.frame(generate(rg=rg)) colnames(df) <- c("Good", "Bad") df$Index <- (df$Good * 1) - (df$Bad * 0.5)
cutoff <- quantile(df$Index, probs=0.90) df$Selected <- df$Index > cutoff print(mean(df[df$Selected,]$Index)) library(ggplot2) qplot(Good, Bad, color=Selected, data=df) + geom_point(size=5) } plotBivariate(rgMatrixPos) # [1] 1.66183561 plotBivariate(rgMatrixInverse) # [1] 2.19804049 We need a dataset giving the pairwise genetic correlations of a lot of important traits, and then we can generate hypothetical multivariate sets of polygenic scores which follow what the real-world distribution of polygenic scores would look like, and then we can sum them up, maximize, and see what sort of gain we have. A specific genetic correlation can be estimated from twin studies, or as part of GWAS studies using an algorithm like GCTA or LD score regression. LD score regression has the notable advantage of being usable on solely the polygenic scores for individual traits released by GWASes, without requiring the same subjects to be phenotyped or access to subject-level data, and computationally tractable; hence it is possible to collect various publicly released polygenic scores for any traits and calculate the correlations for all pairs of traits. This has been done by done by (described in Zheng et al 2016), which provides a web interface to an implementation of LD score regression and >100 public polygenic scores which are now also available for estimating SNP heritability or genetic correlations. Zheng et al 2016 describes the initial correlation matrix for 49 traits, a number which are of practical interest; the spreadsheet can be downloaded, saved as CSV, and the first lines edited to provide a usable file in R. (A later update provides a correlation matrix for >200 traits, and countless additional polygenic scores have been released since that, but adding those wouldn’t clarify anything.) Several of the traits are redundant or overlapping: it is scientifically useful to know that height as measured in one study is the same thing as measured in a different study (implying that the relevant genetics in the two populations are the same, and that the phenotype data was collected in a similar manner, which is something you might take for granted for a trait like height, but would be in considerable doubt for mental illnesses), but we really don’t need 4 slightly different traits related to tobacco use or 9 traits about obesity. So before turning into a correlation matrix, we need to drop those to leave with 34 relevant traits: rg <- read.csv("https://www.gwern.net/docs/genetics/correlation/2016-zheng-ldhub-49x49geneticcorrelation.csv") # delete redundant/overlapping/obsolete ones: dupes <- c("BMI 2010", "Childhood Obesity", "Extreme BMI", "Obesity Class 1", "Obesity Class 2", "Obesity Class 3", "Overweight", "Waist Circumference", "Waist-Hip Ratio", "Cigarettes per Day", "Ever/Never Smoked", "Age at Smoking", "Extreme Height", "Height 2010") rgClean <- rg[!(rg$Trait1 %in% dupes | rg$Trait2 %in% dupes),] rgClean <- subset(rgClean, select=c("Trait1", "Trait2", "rg")) rgClean$rg[rgClean$rg>1] <- 1 # 3 of the values are >1 which is impossible library(reshape2) rgMatrix <- acast(rgClean, Trait2 ~ Trait1) ## convert from half-matrix to full symmetric matrix: TODO: this is a lot of work, is there any better way? library(psych) ## add redundant top row and last column rgMatrix <- rbind("ADHD" = rep(NA, 34), rgMatrix) rgMatrix <- cbind(rgMatrix, "Years of Education" = rep(NA, 35)) ## convert from half-matrix to full symmetric matrix rgMatrix <- lowerUpper(t(rgMatrix), rgMatrix) ## set diagonals to 1 diag(rgMatrix) <- 1 rgMatrix # ADHD Age at Menarche Alzheimer's Anorexia Autism Spectrum Bipolar Birth Length Birth Weight BMI Childhood IQ College # ADHD 1.000 -0.121 -0.170 0.174 -0.130 0.5280 -0.043 0.067 0.324 -0.115 -0.397 # Age at Menarche -0.121 1.000 0.061 0.007 -0.079 0.0570 0.014 -0.067 -0.321 -0.076 0.065 # Alzheimer's -0.170 0.061 1.000 0.108 0.042 -0.0020 -0.135 -0.034 -0.028 -0.362 -0.364 # Anorexia 0.174 0.007 0.108 1.000 0.009 0.1580 0.027 -0.054 -0.140 0.062 0.162 # Autism Spectrum -0.130 -0.079 0.042 0.009 1.000 0.0630 0.195 0.044 -0.003 0.425 0.339 # ... ## genetically independent traits: independent <- matrix(ncol=35, nrow=35, 0) diag(independent) <- 1 For a baseline, let’s revisit the single selection case, in which we have 1 trait where higher=better with a heritability of 0.33 where we are choosing from 10 half-related embryos: we can get embryoSelection(10, variance=0.33) ~> 0.62 SD in that case. For a multiple selection version, we can consider a correlation matrix for 34 traits in which every trait is uncorrelated, with the same settings (higher=better, 0.33 heritability, 10 half-related siblings): with more traits to sum, the extremes become more extreme—for example, the ‘largest’ is on average +3.7SDs (likewise smallest) This fact of increased variance means that selection has more to work with. Finally, what if we populate the correlation matrix with genetic correlations like those in the LD Hub dataset (ignoring the issues of trait-specific heritabilities, direction of losses/gains, and available polygenic scores)? Do we get less or more than 3.7SD because now the intercorrelations happen to (unfortunately for selection) make traits cancel out, reducing variance? No; we get more variance, +5.3SDs. Aside from simulation, the order statistic can be calculated directly: the is the sum of their covariance, so the SD is the square root of the sum of covariances, which then can be plugged into the order statistic function. And we can see how the order statistic grows as we consider more traits. mean(replicate(100000, max(rnorm(10, mean=0, sd=sqrt(0.33*0.5))))) # [1] 0.6250647743 library(mvtnorm) library(MBESS) ## simulate: mean(replicate(100000, max(rowSums(rmvnorm(10, sigma=cor2cov(independent, sd=rep(sqrt(0.33*0.5),35)), method="svd"))))) # [1] 3.700747303 ## analytic: sqrt(sum(cor2cov(independent, sd=rep(sqrt(0.33*0.5),35)))) # [1] 2.403122968 exactMax(10, sd=2.403122968) # [1] 3.697812029 mean(replicate(100000, max(rowSums(rmvnorm(10, sigma=cor2cov(rgMatrix, sd=rep(sqrt(0.33*0.5),35)), method="svd"))))) # [1] 5.199043247 mean(replicate(100000, max(rowSums(rmvnorm(10, sigma=cor2cov(rgMatrix, sd=rep(sqrt(0.33*0.5),35)), method="svd"))))) # [1] 5.368492597 sqrt(sum(cor2cov(rgMatrix, sd=rep(sqrt(0.33*0.5),35)))) # [1] 3.468564833 exactMax(10, sd=3.468564833) # [1] 5.337263609 round(digits=2, unlist(Map(function (n) { SD <- sqrt(sum(cor2cov(rgMatrix[1:n,1:n], sd=rep(sqrt(0.33*0.5), n)))); exactMax(10, sd=SD) }, 2:35))) # [1] 0.83 1.00 1.27 1.37 1.70 1.82 2.06 2.13 2.24 2.45 2.45 2.56 2.83 2.87 2.91 2.99 3.08 3.14 3.10 3.27 3.56 # 3.66 3.98 4.14 4.13 4.26 4.46 4.40 4.60 4.84 4.90 5.08 5.23 5.34 Next we consider what happens when we include SNP heritabilities (which set upper bounds on the polygenic scores, but see earlier GCTA discussion on why they’re loose upper bounds in practice). The heritabilities for 173 traits are provided by LD Hub in a different spreadsheet but the trait names don’t always match up with the names in the correlation spreadsheet ones, so I had to convert them manually. (The height heritability is also missing from the heritability page & spreadsheet so I borrowed a GCTA estimate from .) While we’re at it, I classified the traits by desirability to consistently set larger=better: utilities <- read.csv(stdin(), header=TRUE, colClasses=c("factor", "factor", "numeric","integer")) "Trait","Measurement.type","H2_snp","Sign" "ADHD","d",0.2573,-1 "Age at Menarche","c",0.183,1 "Alzheimer's","d",0.0688,-1 "Anorexia","d",0.559,-1 "Autism Spectrum","d",0.559,-1 "Bipolar","d",0.432,-1 "Birth Length","c",0.1697,-1 "Birth Weight","c",0.1124,1 "BMI","c",0.1855,-1 "Childhood IQ","c",0.2735,1 "College","d",0.0563,1 "Coronary Artery Disease","d",0.0781,-1 "Crohn's Disease","d",0.4799,-1 "Depression","d",0.1745,-1 "Fasting Glucose","c",0.0984,-1 "Fasting Insulin","c",0.0695,-1 "Fasting Proinsulin","c",0.1443,1 "Former/Current Smoker","d",0.0645,-1 "HbA1C","c",0.0656,-1 "HDL","c",0.116,1 "Height","c",0.69,1 "Hip Circumference","c",0.1266,-1 "HOMA-B","c",0.0888,-1 "HOMA-IR","c",0.0686,-1 "Infant Head Circumference","c",0.2352,-1 "LDL","c",0.1347,1 "Lumbar Spine BMD","c",0.2684,1 "Neck BMD","c",0.2977,1 "Rheumatoid Arthritis","d",0.161,-1 "Schizophrenia","d",0.4541,-1 "T2D","d",0.0872,-1 "Total Cholesterol","c",0.1014,-1 "Triglycerides","c",0.1525,-1 "Ulcerative Colitis","d",0.2631,-1 "Years of Education","c",0.0842,1 ## What is the distribution of the (univariate) index w/o weights? Up to N(0, 1.58), which is ## much bigger than any of the individual heritabilities: s <- rmvnorm(10000, sigma=cor2cov(independent, sd=utilities$H2_snp, method="svd")) %*% utilities$Sign mean(s[,1]); sd(s[,1]) # [1] 0.009085073526 # [1] 1.588303057 ## Order statistics of generic heritabilities & specific heritabilities: mean(replicate(100000, max(rmvnorm(10, sigma=cor2cov(independent, sd=rep(sqrt(0.33*0.5),35)), method="svd") %*% utilities$Sign)))
# [1] 3.699494503
mean(replicate(100000, max(rmvnorm(10, sigma=cor2cov(independent, sd=sqrt(utilities$H2_snp * 0.5)), method="svd") %*% utilities$Sign)))
# [1] 2.950714449

mean(replicate(100000, max(rmvnorm(10, sigma=cor2cov(rgMatrix, sd=rep(sqrt(0.33*0.5),35)), method="svd") %*% utilities$Sign))) # [1] 5.875711644 mean(replicate(100000, max(rmvnorm(10, sigma=cor2cov(rgMatrix, sd=sqrt(utilities$H2_snp * 0.5)), method="svd") %*% utilities$Sign))) # [1] 4.186435301 Re-estimating with higher=better corrected, the original multiple selection turned out to be somewhat overestimated. Adding the real trait heritabilities, we see that the gains to multiple selection remain large compared to single selection (2.9 or 3.3SDs vs 0.6SDs), and that the genetic correlations do not substantially reduce gains to multiple selection but in fact benefits multiple selection by adding +0.35SD. ##### Multiple selection with utility weights Continuing onward: if multiple selection is helpful, what sort of net benefit to selection would we get after assigning some reasonable costs to each trait & using current polygenic scores? (One intriguing possibility I won’t cover here: fertility is highly heritable, so one could select for greater fertility; combined with selection on other traits, could this eventually eliminate dysgenic trends at their source?) Coming up with that information for 34 traits in the same detail as I have for intelligence would be extremely challenging, so I will settle for some quicker and dirtier estimates; in cases where the causal impact is not clear or I cannot find reasonably reliable cost estimates, I will simply drop the trait (which will be conservative and underestimate possible gains from multiple selection). The traits: • Age at : reports a polygenic score explaining 15.8% of variance. demonstrates a causal impact of early puberty on “earlier first sexual intercourse, earlier first birth and lower educational attainment”, consistent with the intercorrelations (strong negative correlation with childhood IQ); clearly the sign should be negative, and early puberty has been linked to all sorts of problems (: “greater risk for breast cancer, teen pregnancy, HPV, heart disease, diabetes, and all-cause mortality, which is the risk of dying from any cause. There are psychological risks as well. Girls who develop early are at greater risk for depression, are more likely to drink, smoke tobacco and marijuana, and tend to have sex earlier.”) but no costs are available • : report a polygenic score of 0.021 for AD. The lifetime risk at age 65 is 9% men & 17% women; few people die before age 65 so I’ll take the average 13% as the lifetime risk at birth (since Alzheimer’s rates tend to increase, this should be conservative for future rates). Costs rise steeply before death as the dementia cripples the patient, imposing extraordinary costs for daily care & on families & caregivers. USA total costs have been estimated at >$200b; for dementia, the last 5 years of life can incur . Discounting Alzheimer treatment cost is a little tricky: unlike height/BMI/IQ or BPD which we could treat on an annual cost/gain basis and discount out indefinitely, that $287k of expenses will only be incurred 60+ years after birth on average. We can treat it as a single lump sum expense incurred 70 years in the future, discounted at 5% (as usual to be conservative): 287000 / (1+0.05)^70 ~> 9432. (In discounting late-life diseases, one might say that an ounce of prevention is worth less than a pound of cure.) • : ~1% prevalence. does not report a polygenic score. • : ~1.4% prevalence. report that the earlier PGS results found 17% of liability explained (though this does not seem to be reported in the cited original paper/appendix that I can find). ~$4m.

• Birth length, weight: skip as difficult to pin down the causal effects

• “Years of Education”: ~42% of younger Americans have . Selzam et al 2016 reports a polygenic score ~9% for ‘years of education’. A college degree is worth an estimated $250k+ for an American; given that ‘years of education’ is almost genetically identical to college attendance and differences are driven primarily by higher education (since relatively few people dropout), and estimates like that each year correlates with 10% additional income, which would be ~$5k/year and perhaps an SD of 2 years, we might guess somewhere around $50k. • : ~40% lifetime risk. report a limited polygenic score explaining 10.6% of the estimated 40% additive heritability or 4.24% of variance. Cardiovascular diseases are some of the most common, expensive, and fatal diseases, and US costs range into the hundreds of billions of dollars. estimates annual costs ~$7k up to age 64 but then ~$31k annually afterwards for total lifetime costs of$599k. Around half of people will be diagnosed by ~age 60, so at a first cut, we might discount it at 423000 / (1+0.05)^60 ~> 32067 or $32k. • : 0.32% incidence. reports a polygenic score explaining 13.6% of variance. Crohn’s strikes young and lasts a lifetime; estimates$8330 annually or $374850 over the estimated 45 years after diagnosis around age 20, suggesting a discounting of (8330/log(1.05)) / ((1+0.05)^20) ~> 64346. • : . reports a polygenic score of 0.6%. (’s polygenic score used only the top 17 SNPs, and they don’t report the variance explained of MDD, just the secondary phenotypes.) Another major burden of disease, both common and crippling and frequently fatal, depression has large direct costs for treatment and larger indirect costs from wages, worse health etc. finds children with depression have$300k less lifetime income, which doesn’t take into account the medical treatment costs or suicide etc and is a lower bound. I can’t find any lifetime costs so I will guesstimate that as the total cost for adults, starting at age 32, giving ~$63k as the cost. • Fasting glucose/insulin/proinsulin, : skip as their effects should be covered by diabetes. • Former/Current Smoker: has smoked >100 cigarettes (although by 2016 currently smoking adults were down to ~15% of the population). Supplemental material for reports a polygenic score for ever smoking of 6.7%. The lifetime cost of tobacco smoking includes the direct cost of tobacco, increased lung cancer risk, lower work output, fires, general worsened health, any second hand or fetal effects, and early mortality; the cost, from various perspectives (individual vs national healthcare systems etc) has been heavily debated, but I think it’s safe to put it at least$100k over a lifetime or $27k discounted. • Hip Circumference: should be covered by BMI • HOMA-B/HOMA-IR/Lumbar Spine BMD/Neck BMD: I have no idea where to start with these, so skipping • Infant Head Circumference: should be covered by IQ and education? • : . hits explain ~12% and report a polygenic score providing another 5.5%. estimates total annual costs for RA at ~$11542/year and cites a Stone 1984 estimate of lifetime cost of $15,504 ($35,909 in 2016); with typical age of onset around 60, the total annual cost might be discounted to $13k. • LDL, total cholesterol, triglycerides: harmful effects should be redundant with coronary artery disease • : 0.3%. reports a polygenic score of 7.5%. & report$15k medical expenses annually & $5k employment loss. With mean age diagnosis of ~35 (), something like (20000/log(1.05)) / ((1+0.05)^35) ~> 74314. • Longevity: The ultimate health trait to select for might be life expectancy. It is inherently an index variable affected by all diseases proportional to their mortality & prevalence, health-related behaviors such as smoking, and to a lesser degree quality of life/overall health (as those provide insurance against death—a very frail person living a long time borders on a contradiction in terms). Life expectancy is a reliable measurement which would be available for almost all participants sooner or later, and which is intuitively valuable. On the downside, GWASes will have difficulty with this trait for a while: the living, who can easily “consent” to biobanks, won’t die for a long time (assuming there is any followup at all), and ‘sequence the graveyards’ is balked by the fact that while the dead cannot be harmed, they can’t give consent either; thus, the GWASes using odd ‘traits’ like “paternal age at death” or “maternal age at death”. Another difficulty is that the heritability of life expectancy is not as large as one would expect given how heritable many key longevity factors like intelligence29 or BMI are—the by far largest analysis ever done () uses genealogical databases to estimate life expectancy heritabilities in various datasets & methods of 12-18% additive and an additional 1-4% dominance with near-zero epistasis; future PGSes of longevity are thus upper-bounded ~20%. A UKBB parental-lifespan GWAS () finds life expectancy hits enriched in expected places like APOE (Alzheimer’s), smoking/lung-cancer, cardiovascular disease, and type 2 diabetes; they unfortunately do not report SNP heritability or overall PGS variance explained, but do report (more or less equivalently30) that +1SD in their PGS predicts +1 year out of sample: When including all independent markers, we find an increase of one standard deviation in PRS increases lifespan by 0.8 to 1.1 years, after doubling observed parent effect sizes to compensate for the imputation of their genotypes (see Table S25 for a comparison of performance of different PRS thresholds). Correspondingly—a gain after doubling for parental imputation—we find a difference in median predicted survival for the top and bottom decile of 5.6/5.6 years for Scottish fathers/mothers, 6.4/4.8 for English & Welsh fathers/mothers and 3/2.8 for Estonian fathers/mothers. In the Estonian Biobank, where data is available for a wider range of subject 451 ages (i.e. beyond median survival age) we find a contrast of 3.5/2.7 years in survival for male/female subjects, across the PRS tenth to first decile (Table 2, Fig. 8)…The magnitude of the distinctions our genetic lifespan score is able to make (5 years of life between top and bottom decile) is meaningful socially and actuarially: the implied distinction in price (14%; Methods) being greater than some recently reported annuity profit margins (8.9%) (41). The 1SD=1 year should not be pushed too far here. Because life expectancy is not normally distributed, but instead follows the , life expectancy increases run into a ‘wall’ of exponentially increasing mortality (approaching nearly 50% annually for centenarians!), which leads to death-age distributions which look asymmetrically hump-shaped—essentially, the accelerating annual mortality rate means that as age increases, ever larger mortality reductions are necessary to squeeze out another year. Even with large improvements in health, there will be few or no 31 as the large improvements get almost immediately eaten by the acceleration. But it’s probably an acceptable approximation for a few SDs. (There are other issues in interpreting the PGS like what it means when it predicts higher risk of disease32, and life expectancy GWASes should probably move to an explicit competing-risks model.) So if +1SD PGS = +1 year, how much can that be increased with an order statistic of, say, 5? embryoSelection(n=5, variance=1) # [1] 0.8223400656 0.8 years is nothing to sneeze at, and Timmers et al 2018’s PGS can be improved on, demonstrating that life expectancy is potentially an important trait to select on. Without a SNP heritability or exact PGS variance or fitting to a Gompertz-Makeham curve, an upper bound for the standard GWAS’s PGS power is difficult to establish, but assuming a fairly common SNP heritability fraction of ~50% of additive heritability, the maximum of 20% additive+dominance heritability from Kaplanis et al 2018, and ~1% variance from Timmers et al 2018 (with phenotypic SD of 10), then the upper bound is 10% variance (half the heritability), with a r=0.31 () of +3.1 years for each PGS +SD, giving an embryo selection with n=5 of not but . Years of life are typically valued >$50,000. Discounting out to 80 years where life expectancy gains kick in, +2.5 years would be worth >$2,5000 now ((2.5*50000) / (1+0.05)^80). A weakness of longevity PGSes is that they may be too much of an index trait: the effect of any contributing factor is washed out by the effects of all the other factors. Breaking it down into pieces may afford greater gains, if those pieces are more heritable and more predictable—for example, one could increase longevity by selecting on BMI, intelligence, and tobacco smoking simultaneously (all of which can be measured without requiring participants to have died first and have been the subject of extensive and often highly successful GWASes). These traits would also have additional benefits earlier in life by increasing average QALY. Calculating possible gains would require considerable more work, though, and requires a full table of genetic correlations with longevity, so I will omit it from the simulation. This gives us 16 usable traits: liabilityThresholdValue <- function(populationFraction, gainSD, value) { if (value<0) { fraction <- pnorm(qnorm(populationFraction) - gainSD) } else { fraction <- pnorm(qnorm(populationFraction) + gainSD) } gain <- (fraction - populationFraction) * value return(gain) } ## handle both continuous & dichotomous traits: polygenicValue <- function(populationFraction, value, polygenicScore, n=10) { gainSD <- embryoSelection(n=n, variance=polygenicScore) if (populationFraction==1) { if (value<0) { gainSD <- -gainSD }; return(gainSD*value) } else { ## the value of increasing healthy fraction: liabilityThresholdValue(populationFraction, gainSD, value) } } ## examples for single selection: BMI polygenicValue(1, -16750, 0.153) # [1] 7125.151193 ## example: IQ iqHigh <- 16151*15 selzam2016 <- 0.035 polygenicValue(1, iqHigh, selzam2016) # [1] 49347.3885 ## example: bipolar polygenicValue(0.03, -87088, 0.0283) # [1] 911.28581092779 ## example: college polygenicValue(0.42, 250000, 0.03) # [1] 18670.466258862 utilitiesScores <- read.csv(stdin(), header=TRUE, colClasses=c("factor","factor","numeric", "numeric","numeric")) Trait, Measurement.type, Prevalence, Cost, Polygenic.score "ADHD","d", 0.07, -182000, 0.015 "Age at Menarche","c", 1,0,0.158 "Alzheimer's","d",0.13,-9432,0.021 "Anorexia","d",0.01,0,0 "Autism Spectrum","d",0.014,-4000000, 0.17 "Bipolar","d", 0.03, -87088, 0.0283 "Birth Length","c",1,0,0 "Birth Weight","c",1,0,0 "BMI","c",1, -16750, 0.153 "Childhood IQ","c", 1, 242265, 0.035 "College","d",0.42,250000,0.03 "Coronary Artery Disease","d",0.404,-32000,0.0424 "Crohn's Disease","d",0.0032,-64346,0.136 "Depression","d",0.17,-62959,0.006 "Fasting Glucose","c",1,0,0 "Fasting Insulin","c",1,0,0 "Fasting Proinsulin","c",1,0,0 "Former/Current Smoker","d",0.42,-27327,0.067 "HbA1C","c",1,0,0 "HDL","c",1,0,0 "Height","c",1, 1616, 0.60 "Hip Circumference","c",1,0,0 "HOMA-B","c",1,0,0 "HOMA-IR","c",1,0,0 "Infant Head Circumference","c",1,0,0 "LDL","c",1,0,0 "Lumbar Spine BMD","c",1,0,0 "Neck BMD","c",1,0,0 "Rheumatoid Arthritis","d",0.0265,-12664,0.175 "Schizophrenia","d",0.01, -49379, 0.184 "T2D","d", 0.40, -124600, 0.0573 "Total Cholesterol","c",1,0,0 "Triglycerides","c",1,0,0 "Ulcerative Colitis","d",0.003,-74314,0.075 "Years of Education","c",1,50000,0.09 utilitiesScores$Value <- with(utilitiesScores,
ifelse((Measurement.type=="c"),
Cost,
unlist(Map(liabilityThresholdValue, Prevalence, 1, Cost))))
round(utilitiesScores$Value) # [1] 11530 0 1068 0 53225 2440 0 0 -16750 242265 91899 9506 200 # 9108 0 0 0 8343 0 0 1616 0 0 # [24] 0 0 0 0 0 314 472 36752 0 0 216 50000 ## What is the utility distribution of the index using the heritability upper bound? N(1k, 73k) s <- rmvnorm(10000, sigma=cor2cov(independent, sd=utilities$H2_snp), method="svd") %*% utilitiesScores$Value mean(s[,1]); sd(s[,1]) # [1] 861.8971719 # [1] 73530.39912 ## And with the PGSes, N(81, 6.8k) s <- rmvnorm(10000, sigma=cor2cov(independent, sd=0.000001+utilitiesScores$Polygenic.score * 0.5), method="svd") %*% utilitiesScores$Value mean(s[,1]); sd(s[,1]) # [1] 81.5585318 # [1] 6876.864556 ## Order statistics: mean(replicate(10000, max(rmvnorm(10, sigma=cor2cov(independent, sd=sqrt(0.000001+utilitiesScores$Polygenic.score * 0.5)), method="svd") %*% utilitiesScores$Value) )) # [1] 60583.37613 mean(replicate(10000, max(rmvnorm(10, sigma=cor2cov(rgMatrix, sd=sqrt(0.000001+utilitiesScores$Polygenic.score * 0.5)), method="svd") %*% utilitiesScores$Value) )) # [1] 91093.1894 mean(replicate(10000, max(rmvnorm(10, sigma=cor2cov(independent, sd=sqrt(utilities$H2_snp * 0.5)), method="svd") %*% utilitiesScores$Value) )) # [1] 148336.8512 mean(replicate(10000, max(rmvnorm(10, sigma=cor2cov(rgMatrix, sd=sqrt(utilities$H2_snp * 0.5)), method="svd") %*% utilitiesScores$Value) )) # [1] 192998.9909 So with current polygenic scores, we could expect a gain of ~$91k out of 10 embryos (at least, before the inevitable losses of the IVF process), which is indeed more than expected for IQ on its own (which was $49k). We could also take a look at the expected gain if we could have perfect polygenic scores equal to the SNP heritabilities; then we would get as much as$192k.

###### Robustness of utility weights

As these utility weights are largely guesses, one might wonder how robust they are to errors or differences in preferences. As far as preferences go, I take the medical economics literature on QALYs & preference elicitation as suggesting that people agree to a great extent about how desirable various forms of health are (the occasional counterexample like deaf parents selecting for deafness being the exceptions that prove the rule), so differences in preferences may not be a big deal. But errors are worrisome, as it’s unclear how to estimate them (eg the example of valuing education & intelligence—most real-world estimates will hopelessly confound them…). However, decision theory has long noted that zero/one binary weights for decision-making (eg “improper linear models” or ) perform surprisingly well compared to the true weights on both decision-making & prediction (in what may be one of the rare ), and if zero-one weights can, perhaps noisy weights aren’t a big deal either.

Simulating scenarios out, it turns out that multiple selection appears fairly robust to noise in utility weights. This makes sense to me in retrospect as, as ever, we are trying to rank not estimate, which is easier; and, because we are using many traits rather than one, the greater the variance, the greater the gap between each sample and thus the less likely #2 is to really be #1 and, if it is, the regret (the difference between what we picked and what we would’ve picked if we had used the true utility weights) is probably not too great on average.

Specifically, I generate a multivariate sample of n embryos, value them with the given utility weights, then revalue them with the same utility weights corrupted by an error drawn uniformly from 50-150% (so over or underestimated by up to 50%). Then we see how often the erroneous max leads to the same decision, what the true rank was, and the ‘regret’ (the difference in value between the true best embryo and the selected embryo, which may be small even if a non-best embryo is picked). In practice, despite these large errors in utility weights, with both correlation matrices and the same parameters as before (SNP heritability ceiling, utility weights, n=10), the same decision is made >85% of the time, the ranks hardly change and only very rarely does the error go as low as third-best, and the regret is tiny compared to the general gains:

multivariateUtilityWeightError <- function(rgs, utilities, heritabilities, minError=0.5, maxError=1.5, n=10, iters=10000, verbose=FALSE) {
m <- t(replicate(iters, {
samples <- rmvnorm(n, sigma=cor2cov(rgs, sd=sqrt(heritabilities * 0.5)), method="svd")

samplesTrue  <- samples %*% utilities
samplesError <- samples %*% (utilities * runif(length(utilities), min=minError, max=maxError))

trueMax    <- which.max(samplesTrue)
falseMax   <- which.max(samplesError)
correctMax <- trueMax == falseMax
rank   <- n - rank(samplesTrue)[falseMax] # flip: 0=max,lower is better
regret <- max(samplesTrue) - samplesTrue[falseMax]

if (verbose) { print(samplesTrue); print(samplesError); print(trueMax);
print(falseMax); print(correctMax); print(regret); }
return(c(correctMax, rank, regret)) } ))

colnames(m) <- c("Max.P", "Rank", "Regret")
return(m)
}

summary(multivariateUtilityWeightError(independent, utilitiesScores$Value, utilities$H2_snp))
#     Max.P            Rank           Regret
# Min.   :0.000   Min.   :0.000   Min.   :    0.000
# 1st Qu.:1.000   1st Qu.:0.000   1st Qu.:    0.000
# Median :1.000   Median :0.000   Median :    0.000
# Mean   :0.857   Mean   :0.181   Mean   : 2231.187
# 3rd Qu.:1.000   3rd Qu.:0.000   3rd Qu.:    0.000
# Max.   :1.000   Max.   :3.000   Max.   :64713.835
summary(multivariateUtilityWeightError(rgMatrix,    utilitiesScores$Value, utilities$H2_snp))
#     Max.P            Rank           Regret
# Min.   :0.000   Min.   :0.000   Min.   :    0.0000
# 1st Qu.:1.000   1st Qu.:0.000   1st Qu.:    0.0000
# Median :1.000   Median :0.000   Median :    0.0000
# Mean   :0.936   Mean   :0.072   Mean   :  654.1218
# 3rd Qu.:1.000   3rd Qu.:0.000   3rd Qu.:    0.0000
# Max.   :1.000   Max.   :3.000   Max.   :51062.2298

## Gamete selection

One alternative to selection on the embryo level is to instead select on gametes, eggs or sperm cells. (This is briefly mentioned in primarily as a way to work around ‘ethical’ concerns about discarding embryos, but they do not notice the considerable statistical advantages of gamete selection.) Hypothetically, during , after the final meiosis, there are 4 spermatids/spermatozoids, one of which could be destructively sequenced and allow scoring of the others; something similar might be doable with the polar bodies in . Such gamete selection is probably infeasible as it would likely require surgery & being able to grow gametes to maturity in a lab environment & would be very expensive. (Are there ways to do the inference on each gamete more easily? Perhaps some sort of DNA tagging with fluorescent markers could work?)

But if gamete selection were possible, it would increase gains from selection: by , since eggs and sperms are haploid & sum for additive genetic purposes, maximizing over them separately will yield a bigger increase than summing them at random (canceling out variance) and only then maximizing. If we are selecting on embryo, a good egg might be fertilized by a bad sperm or vice-versa, negating some of the benefits.

If we have embryos distributed as , such as our concrete example using the GCTA upper bound, then we can split it into the , which for two random normals is , but we specified the means as 0 and we know a priori there should be no particular difference in additive SNP genetic variance between eggs and sperms, so the variances must also be equal, so we have as the sum and as the factorized version which we can maximize on. Since we don’t know what the variance of gametes are, we work backwards from the given variance by halving it. With the derived normal distributions, we then sum their expected maximums.

For identical numbers of gametes, there is a noticeable gain from doing gamete selection rather than embryo selection:

gameteSelection <- function(n1, n2, variance=1/3, relatedness=0) {
exactMax(n1, sd=sqrt(variance*(1-relatedness) / 2)) + exactMax(n2, sd=sqrt(variance*(1-relatedness) / 2))  }

gameteSelection(5,5)
# [1] 0.949556516
embryoSelection(5)
# [1] 0.4747782582

Naturally, there is no reason two-stage selection could not be done here: select on eggs/sperm, fertilize in rank order, and do a second stage of embryo selection. This would yield roughly additive gains.

Given unlimited funds (or some magical way of bulk non-destructively sequencing sperm), one could use the fact that there are typically enormous amounts of viable sperm in any given sperm donation and sperm donations are easy to collect indefinitely large amounts of, to benefit from extreme selection without embryo selection’s hard limit of egg count. For example, selection out of 10,000 sperm and 5 eggs would on its own represent a nearly 2SD gain (before a second stage of embryo selection):

gameteSelection(10000, 5)
# [1] 2.03821926

### Sperm Phenotype Selection

A possible adjunct to embryo selection is sperm selection. Non-destructive sequencing is not yet possible, but measuring phenotypic correlates of genetic quality (such as sperm speed/motility) is. These correlations of sperm quality/genetic quality are, however, small and confounded in current studies by between-individual variation. Optimistically, the gain from such sperm selection is probably small, <0.1SD, and there do not appear to be any easy ways to boost this effect. Sperm selection is probably cost-effective and a good enhancement of existing IVF practices, but not particularly notable.

One way towards gamete selection, while avoiding the need for non-destructive bulk sequencing or exotic approaches like chromosome transplantation, would be to find an easily-measured phenotype which correlates with genetic quality and which can be selected on. may offer one such family of phenotypes.

In the case of sperm selection, such a phenotype need be only slightly correlated for there to be benefits, because a male ejaculate sample typically contains millions of sperm (>15m/mL, >1mL/sample), and one could easily obtain dozens of ejaculate if necessary (unlike the difficulty of getting eggs or embryos). For example, adding one particular chemical to a sperm solution , allowing biasing fertilization towards either male (faster sperm) or female (slower sperm) embryos simply by selecting based on speed. A simple way to select on sperm might be to put them in a maze or ‘channel’ (), and then wait to see which ones reach the exit first; those will be the fastest, and exit in rank order.

Some studies have correlated measures of sperm quality with health/intelligence (/, , ).

There is reason to think that at least some of this is due to genetic rather than purely individual-level phenotypic health. Individual sperm vary widely in mutation count & aneuploidy & genetic abnormalities, and the is at least partially due to mosaicism in mutations in spermatogonia; to the extent that these are pleiotropic in affecting both sperm function (which is downstream of things like mitochondria) and future health, faster sperm will cause healthier people (see Pierce et al 2009). Haploid cells are exposed to more selection than diploid cells (), and are intrinsically more fragile; highly speculatively, one could imagine that sperm-relevant genes might be deliberately fragile, and extra pleiotropic, as a way to ensure only the best sperm have a chance at fertilization (such a mechanism would increase inclusive fitness).

In the usual IVF case of a father, rather a sperm donor, the relevant measures of sperm quality must be sperm-specific; a measure like sperm density is useful for selecting among all sperm donors, but is irrelevant when you are starting with a single male. Sperm density is between-individual & between-ejaculate, not within-individual and between-sperm. Sperm motility can be measured on an individual sperm basis, however, and Arden et al 2008 provides a correlation of r=0.14 between intelligence & sperm motility; unfortunately, that correlation is still between-individual as it is average sperm motility of individuals correlated against individual IQs. Bulk sequencing of individual sperm cells has recently become possible (eg ), but has not yet been done to disentangle within-ejaculate from between-individual variation.

Given how general health definitely affects sperm quality, we can be sure that a correlation like Arden et al 2008’s is at least partially due to between-individual factors and is not purely within-ejaculate. I would speculate that at least half of it is between-individual, and the within-ejaculate correlation is much smaller. Further, is the relationship between sperm motility and a phenotype like intelligence even a bivariate normal to begin with? There could easily be a ceiling effect: perhaps sperm quality reflects occasional harmful de novo mutations and major errors in meiosis, but then once a baseline healthy sperm has been selected, there are no further gains and sperm motility merely reflects non-genetic factors of no value. Without individual sperm sequencing (particularly PGSes)/motility datasets, there’s no way to know.

So for illustration I’ll use r=0.07, and consider this as an upper bound.

Currently, sperm selection in IVF is done in an , often requiring a fertility specialist to visually examine a few thousand sperm to pick one; this likely doesn’t come close to stringently selecting from the entire sample. But, given, say, 1 billion sperm from an ejaculate, the expected maximum on some normally-distributed trait would be +6.06SD. This would then be deflated by the r of that sperm trait with a target phenotype like birth defects or health or intelligence. The final sperm then fertilizes an egg and contributes half the genes to the embryo; so since both gametes are haploid and only have half the possible genes, and variances sum, the variance of any genetic trait must be half that of an embryo/adult. So crudely, sperm selection could accomplish something like , or with r=0.07, <0.1SD.

Use of the maximum implies that a single sperm is being selected out and used to fertilize an egg. (There are probably multiple eggs, but one could do multiple selections from ejaculate as ranking motility is so easy.) Ensuring a single hand-chosen sperm fertilizes the egg is in fact feasible using (ICSI), and routine. However, if traditional IVF is used without ICSI, the selection must be relaxed to provide the top few tens of thousands of sperm in order for one sperm to fertilize an egg. This reduces the possible gain somewhat: if there are 1 billion sperm in an ejaculate and we want the top 50,000, that’s equivalent to max-of-20,000, giving a new maximum of +2SD & thus a <0.07SD possible gain. Either way, though, the gain is small. (This would explain the difficulty in correlating use of ICSI with improvements in pregnancy or birth defect rates (): current sperm selection is weak, and the maximum effect would be subtle at best, and so easy to miss with the usual small n.)

Sperm selection can’t be rescued by increasing the sample size because, while sperm are easy to obtain, it is already into steeply diminishing returns; increasing to 10 billion would yield <0.113SD, and to 100 billion would yield <0.118SD. Improving measurements also appears to not be an option: existing sperm measurements already pretty much exactly measure the trait of interest, and the near-zero correlations are intrinsic. (Fundamentally, while there may be overlap, a sperm is not a brain much less a fully-grown human, and there’s only so much you can learn by watching it wiggle around.) In the event that the egg bottleneck is broken and one has the luxury of potentially throwing away eggs, this will probably be even more true of eggs: eggs don’t do much, they just sit around for decades until they are ovulated or die (and, since they don’t divide, suffer from less of a ‘maternal age effect’ as well).

On the bright side, sperm selection could potentially be as useful as embryo selection circa 2018, the true usefulness is easily researched with screening+single-cell-sequencing, can be made extremely inexpensive & done in bulk33, and given the manual procedures currently used could actually reduce total IVF costs (by eliminating the need for fertility specialists to squint through microscopes & chase down sperm for ICSI). So it may do something useful and be a meaningful improvement in IVF procedures, even if the individual-level effect is subtle at best.

### Chromosome selection

The logic can be extended further: embryo selection is weak because it operates only on final embryos where all the genetic variants have been randomized by meiosis of unselected sperm/eggs, fertilized in an unselected manner, and summed up inside a single embryo which we can take or leave; by that point, much of the genetic variance has been averaged out and the CLT gives us a narrow distribution of embryos around a mean with small order statistics.

We can (potentially) do better by going lower and selecting on sperm/eggs before they combine.

But we could do better than that by selecting on individual chromosomes before they are assembled into spermatocytes, rather than taking random unselected assortments inside sperm/eggs/embryos, what we might call . (As regular embryo selection doesn’t cleanly transfer to the chromosome level, the actual ‘selection’ might be accomplished by other methods like repeated chromosome transplantation: , Paulis et al 2015.) If one could select the best of each pair of chromosomes, and clone it to create a spermatocyte which has two copies of the best one, rendering it homozygous, then all of the sperm it created will still post-meiosis have the same assortment of chromosomes. By avoiding the usual randomization from crossover in the meiosis creating sperm, this necessarily reduces variance considerably, but one could take the top k such chromosome combinations or perhaps take the top k% of spermatocytes, in order to boost the mean while still having a random distribution around it.

There are 22 pairs of autosomal chromosomes and the sex chromosome pair (XX & XY) for the female and the male respectively; which chromosome in each pair are usually selected at random, but they could also be sequenced & the best chromosome selected, so one gets 22+22+1=45 binary choices, or approximately 4 million unique selections, 222. (+1 because you can select which of 2 X chromosomes in a female cell, but you can’t select between the male’s XY.) It is vanishingly unlikely to randomly select the best out of all 22 pairs in one parent, much less both. We can take a total PGS for a human, like 33%, and break it down across a genome by chromosome length; then we take the 2nd order statistic of that fraction of variance, and sum over the 45 chromosomes, giving us a selection boost as high as +2SD (maxing out at +3.65SD, apparently).

chromosomeSelection <- function(variance=1/3) {
chromosomeLengths <- c(0.0821,0.0799,0.0654,0.0628,0.0599,0.0564,0.0526,0.0479,0.0457,0.0441,
0.0446,0.0440,0.0377,0.0353,0.0336,0.0298,0.0275,0.0265,0.0193,0.0213,0.0154,0.0168,0.0515)
x2 <- 0.5641895835
f <- x2 * sqrt((chromosomeLengths[1:23] / 2) * variance)
m <- x2 * sqrt((chromosomeLengths[1:22] / 2) * variance)
sum(f, m) }
chromosomeSelection()
# [1] 2.10490714
chromosomeSelection(variance=1)
# [1] 3.645806112

For comparison, an embryo selection approach with 1/3 PGS would require somewhere closer to n=5 million to reach +2.10SD in a single shot. As humans have relatively few chromosomes compared to many plants or insects, and thus reduced variance, this would presumably be even more effective in agricultural breeding.

An additional unusual angle which might or might not increase variance further, and thus increase selective efficacy, would be to modify the rate of meiotic crossover directly.

During fertilization, chromosomes of the parents crossover, but typically in a small number of places, like 2. The rate of crossover/recombination is affected by ambient chemicals, but is also under genetic control and can be increased as much as 3-8 fold (Mieulet et al 2018). In plant breeding, increases in meiotic crossover are useful for the purpose of “reverse breeding” (): a breeder might want to create a new organism which has a precise set of alleles which exist in a current line, but those alleles might be in the wrong linkage disequilibrium such that a desired allele always comes with an undesirable hitchhiker, or it is merely improbable for the right assortment to be inherited given just occasional crossovers, requiring extreme numbers of organisms to be raised in order to get the one desired one; increases in meiotic crossover can greatly increase the odds of getting that one desired set. Increases in recombination rates also assist long-term selection by breaking up haplotypes to expose additional combinations of otherwise-correlated alleles, some of which are good and some of which are bad (and of course new mutations are always happening); so if not selecting on phenotype alone, new polygenic scores must be re-estimated every few generations to account for the new mutations & changes in correlations.

The total gain over selection programs of reasonable length such as 10–40 generations appears to be on the order of 10–30% in simulation studies to date, with gains requiring “at least three to four generations” ( & studies reviewed in it). This is a relatively modest but still substantial possible gain.

How about in humans? The same arguments would apply to multi-generation uses of embryo selection, but likely much less so, since selection will be far less intense than in the simulated agricultural models. The benefit should also be roughly nil in a single application of embryo selection, since so little genetic variance will be used (thus there’s no particular benefit from breaking up to expose new combinations), the per-generation gain is presumably small (a few percent), and an increase in recombination rate would, if anything, degrade the available PGSes’ predictive power by breaking the LD patterns it depends on.

But perhaps the order statistics perspective can rescue single-generation embryo selection—would increases in meiotic crossover in human embryos lead to greater variance (aside from the PGS problem)? It’s not clear to me; arguably, it wouldn’t help on average, merely smooth out the normal distribution by reducing the ‘chunkiness’ of maternal/paternal averaging. One way it might help is if there hidden variance: many causal variants are on the same contemporary haplotypes and are canceling each other out, in which case increased meiotic crossover would break them up and expose them (eg a haplotype with +1/-1 alleles will net out to 0 and not be selected for or against; it could be broken up by recombination into two haplotypes, now +1 and -1, and begin to show up with phenotypic effects or be selected against).

## Embryo selection versus alternative breeding methods

is increasingly used in animal and plant breeding because it can be used before phenotypes are measurable for faster breeding, and polygenic scores can also correct phenotypic measurements for measurement error & environment. This mention of measurement error understates the value—in the case of a binary or dichotomous or threshold trait, there is only a weak population-wide measurable correlation between genetic liability and whether the trait actually manifests. And the rarer the trait, the worse this is. Returning to schizophrenia as an example, only 1% of the population will develop it, even though it is hugely influenced by genetics; this is because there is a large reservoir of bad variants lurking in the population, and only once in a blue moon do enough bad variants cluster in a single person exposed to the wrong nonshared environment and develops full-blown schizophrenia. Any sort of selection based on schizophrenia status will be slow, and will get slower as schizophrenia becomes rarer & cases appear less. However, if one knew all the variants responsible, one could look directly at the whole population and rank by liability score and select based on that. What sort of gain might we expect?

First, we could consider the change in liability scores from simple embryo selection on schizophrenia with the Ripke et al 2014 polygenic score of 18.4%:

mean(simulateIVFCBs(3, 4.6, 0.184, 0.5, 0.96, 0.24, 0, 0, 0)\$Trait.SD)
# [1] 0.06622272431

So if embryo selection on schizophrenia were applied to the whole population, we could expect to decrease the liability score by ~0.6SDs the first generation, which would take us from 1% to ~0.8% population prevalence, for a 20% reduction:

0.01 + liabilityThresholdValue(0.01, -0.066, 1)
# [1] 0.008370483248

An alternative to embryo selection would be “”: selecting all members of a population which pass a certain phenotypic threshold and breeding from them (eg letting only people over 110IQ reproduce, or in the other direction, not letting any schizophrenics reproduce). This is one of the most easily implemented breeding methods, and is reasonably efficient.

For a continuous trait, truncation selection’s effect is easily to calculate via the breeder’s equation: the increase is given by the selection intensity times the heritability, where the selection intensity of a particular truncation threshold t is given by dnorm(qnorm(t))/(1-t). So if, for example, only the upper third of a population by IQ was allowed to reproduce and using the most optimistic possible additive heritability of <0.8, this truncation selection would yield an increase of <13 IQ points:

t=2/3; (dnorm(qnorm(t))/(1-t)) * 0.8 * 15
# [1] 13.08959189

(A more plausible estimate for the additive here, based on , would be 0.5, yielding 8.18 IQ points.)

This is noticeably larger than we would get with current polygenic scores for education/intelligence, and shows that for highly heritable continuous traits, it’s hard to beat selection on phenotypes, and so polygenic scores would supplement rather than replace phenotype when reasonably high-quality continuous phenotype data is available.

The effect of a generation of truncation selection on a binary trait following the liability-threshold model is more complicated but follows a similar spirit. A discussion & formula is on pg6 of ; I’ve attempted to implement it in R:

threshold_select <- function(fraction_0, heritability) {
fraction_probit_0 = qnorm(fraction_0)
## threshold for not manifesting schizophrenia:
s_0 = dnorm(fraction_probit_0) / fraction_0
## new rate of schizophrenia after one selection where 100% of schizophrenics never reproduce:
fraction_probit_1 = fraction_probit_0 + heritability * s_0
fraction_1 = pnorm(fraction_probit_1)
## how much did we reduce schizophrenia in percentage terms?
print(paste0("Start: population fraction: ", fraction_0, "; liability threshold: ", fraction_probit_0, "; Selection intensity: ", s_0))
print(paste0("End: liability threshold: ", fraction_probit_1, "; population fraction: ", fraction_1, "; Total population reduction: ",
fraction_0 - fraction_1, "; Percentage reduction: ", (1-((1-fraction_1) / (1-fraction_0)))*100))
}

Assuming 1% prevalence & 80% heritability, 1 generation of truncation selection would yield a ~5% decrease in schizophrenia (that is, from 1% to 0.95%):

threshold_select(0.99, 0.80)
# [1] "Start: population fraction: 0.99; liability threshold: 2.32634787404084; Selection intensity: 0.0269213557610688"
# [1] "End: liability threshold: 2.3478849586497; population fraction: 0.99055982415415; Total population reduction: -0.000559824154150346; Percentage reduction: 5.59824154150346"

(This ignores that schizophrenics already reproduce less and there should be ongoing selection against schizophrenia, and in a sense, truncation selection is already being done, so the ~5% is a bit of an overestimate.)

Thus, for rare binary traits, genomic selection methods can do much better than phenotypic selection methods. Which one works better will depend on the details of how rare a trait is, the heritability, available polygenic scores, available embryos etc. Of course, there’s no reason that they can’t both be used, and even phenotype+genotype methods can be improved further by taking into account other information like family histories and environments.

## Multi-stage selection

For an interactive visualization of single-stage versus multi-stage selection, see my page.

As mentioned earlier, there are , and one of them is to draw on how it looks & acts like a logarithmic curve, with an approximation of (R2=0.98), which including the PGS, becomes . This visualizes the diminishing returns which mean that as we increase n, we eke out ever tinier gains and in practice, the optimal n will often be small. Improvements to other aspects, like PGSes, can help, but don’t change the tyranny of the log..

How can this be improved? One way is to attack the term directly: that’s only for a single stage of selection. If we have many stages of selection, a process we could call multi-stage selection (not to be confused with /group selection) each one can have a small n but because the mean ratchets upward each time, the gain may be enormous.34 (The concavity of the log suggests a proof by .) The smaller each stage, the smaller the per-stage gain, but the decrease is not proportional, so the total gain increases.

We might have a fixed n, which can be split up. What if instead of a single stage (yielding ), one instead had many stages, up to a limit of stages with a gain of each? Then (for most positive integers); ~> ~> (dropping the constant factor). Intuitively, the more stages the better, since is the minimum necessary for any selection, and the larger n, the smaller each marginal gain is, so is ideal. Plotting the difference between the two curves as a function of total n:

n <- 1:100
singleStage <- exactMax(n)
multiStage  <- round(n/2) * exactMax(2)

df <- data.frame(N.total=n, Total.gain=c(singleStage, multiStage), Type=c(rep("single", 100), rep("multi", 100)))
library(ggplot2)
qplot(N.total, Total.gain, color=Type, data=df) + geom_line()