Embryo editing for intelligence

A cost-benefit analysis of CRISPR-based editing for intelligence with 2015-2016 state-of-the-art
decision-theory, biology, psychology, statistics, transhumanism, R, power-analysis, Bayes, IQ
2016-01-222019-04-04 in progress certainty: likely importance: 10

Embryo editing

One ap­proach not dis­cussed in Shul­man & Bostrom is em­bryo edit­ing, which has com­pelling ad­van­tages over se­lec­tion:

  1. Em­bryo se­lec­tion must be done col­lec­tively for any mean­ing­ful gains, so one must score all vi­able em­bryos; while edit­ing can po­ten­tially be done singly, edit­ing only the em­bryo be­ing im­plant­ed, and no more if that em­bryo yields a birth.

  2. Fur­ther, a failed im­plan­ta­tion is a dis­as­ter for em­bryo se­lec­tion, since it means one must set­tle for a pos­si­bly much lower scor­ing em­bryo with lit­tle or no gain while still hav­ing paid up­front for se­lec­tion; on the other hand, the edit gains will be largely uni­form across all em­bryos and so the loss of the first em­bryo is un­for­tu­nate (s­ince it means an­other set of ed­its and an im­plan­ta­tion will be nec­es­sary) but not a ma­jor set­back.

  3. Em­bryo se­lec­tion suffers from the curse of thin tails and be­ing un­able to start from a higher mean (ex­cept as part of a mul­ti­-gen­er­a­tional scheme): one takes 2 steps back for every 3 steps for­ward, so for larger gains, one must brute-force them with a steeply es­ca­lat­ing num­ber of em­bryos.

  4. Em­bryo se­lec­tion is con­strained by the broad­-sense her­i­tabil­i­ty, and the num­ber of avail­able em­bryos.

    • No mat­ter how many GWASes are done on whole-genomes or fancy al­go­rithms are ap­plied, em­bryo se­lec­tion will never sur­pass the up­per bound set by a broad­-sense her­i­tabil­ity 0.8; while the effects of edit­ing has no known up­per bound ei­ther in terms of ex­ist­ing pop­u­la­tion vari­a­tion, as there ap­pear to be thou­sands of vari­ants in­clu­sive of all ge­netic changes, and evo­lu­tion­ar­ily novel changes could the­o­ret­i­cally be made as well1 (a bac­te­ria or a chim­panzee is, after all, not sim­ply a small hu­man who re­fuses to talk).
    • Edit­ing is also largely in­de­pen­dent of the num­ber of em­bryos, since each em­bryo can be edited about as much as any oth­er, and as long as one has enough em­bryos to ex­pect a birth, more em­bryos make no differ­ence.
  5. Edit­ing scales lin­early and smoothly by go­ing to the root of the prob­lem, rather than diffi­culty in­creas­ing ex­po­nen­tially in gain - an ex­am­ple:

    as cal­cu­lated above, the Ri­etveld et al 2013 poly­genic score can yield a gain no greater than 2.58 IQ points with 1-in-10 se­lec­tion, and in prac­tice, we can hardly ex­pect more than 0.14-0.56 IQ points in prac­tice due to fewer than 10 em­bryos usu­ally be­ing vi­able etc. But with edit­ing, we can do bet­ter: the same study re­ported 3 SNP hits (rs9320913, rs11584700, rs4851266) each pre­dict­ing +1 month of school­ing, the largest effects trans­lat­ing to ~0.5IQ points (the ed­u­ca­tion proxy makes the ex­act value a lit­tle tricky but we can cross-check with other GWASes which re­gressed di­rectly on fluid in­tel­li­gence, like Davies et al 2015’s whose largest hit, rs10457441, does in­deed have a re­ported beta of -0.0324SD or -0.486) with fairly even bal­anced fre­quen­cies around 50% as well.2 So one can see im­me­di­ately that if an em­bryo is se­quenced & found to have the bad vari­ant on any of those SNPs, a sin­gle edit has the same gain as the en­tire em­bryo se­lec­tion process! Un­like em­bryo se­lec­tion, there is no in­her­ent rea­son an edit­ing process must stop at one edit - and with two ed­its, one does­n’t even need to se­quence in the first place, as the fre­quency means that the ex­pected value of edit­ing the SNP is 0.25 points and so two blind ed­its would have the same gain! Con­tin­u­ing, the gains can stack; tak­ing the 15 top hits from Davies et al 2015’s Sup­ple­men­tary Ta­ble S3, 15 ed­its would yield 6.35 points and yes, com­bined with the next 15 after that would yield ~14 points, blow­ing past the SNP em­bryo se­lec­tion up­per bounds. While on av­er­age any em­bryo would not need half of those ed­its, that just means that an edit­ing pro­ce­dure will go down an­other en­try down the list and edit that in­stead (given a bud­get of 15 ed­its, one might wind up edit­ing, say #30). Since the es­ti­mated effects do not de­cline too fast and fre­quen­cies are high, this is sim­i­lar to if we skipped every other edit and so the gains are still sub­stan­tial:

    davies2015 <- data.frame(Beta=c(-0.0324, 0.0321, -0.0446, -0.032, 0.0329, 0.0315, 0.0312, 0.0312, -0.0311, -0.0315, -0.0314, 0.0305,
     0.0309, 0.0306, 0.0305, 0.0293, -0.0292, -0.0292, -0.0292, 0.0292, -0.0292, 0.0292, -0.0291, -0.0293, -0.0293, 0.0292, -0.0296,
     -0.0293, -0.0291, 0.0296, -0.0313, -0.047, -0.0295, 0.0295, -0.0292, -0.028, -0.0287, -0.029, 0.0289, 0.0302, -0.0289, 0.0289,
     -0.0281, -0.028, 0.028, -0.028, 0.0281, -0.028, 0.0281, 0.028, 0.028, 0.028, -0.029, 0.029, 0.028, -0.0279, -0.029, 0.0279,
     -0.0289, -0.027, 0.0289, -0.0282, -0.0286, -0.0278, -0.0279, 0.0289, -0.0288, 0.0278, 0.0314, -0.0324, -0.0288, 0.0278, 0.0287,
     0.0278, 0.0277, -0.0287, -0.0268, -0.0287, -0.0287, -0.0272, -0.0277, 0.0277, -0.0286, -0.0276, -0.0267, 0.0276, -0.0277, 0.0284,
     0.0277, -0.0276, 0.0337, 0.0276, 0.0286, -0.0279, 0.0282, 0.0275, -0.0269, -0.0277),
                  Frequency=c(0.4797, 0.5199, 0.2931, 0.4803, 0.5256, 0.4858, 0.484, 0.4858, 0.4791, 0.4802, 0.4805, 0.487, 0.528,
                  0.5018, 0.5196, 0.5191, 0.481, 0.481, 0.4807, 0.5191, 0.4808, 0.5221, 0.4924, 0.3898, 0.3897, 0.5196, 0.3901,
                  0.3897, 0.4755, 0.4861, 0.6679, 0.1534, 0.3653, 0.6351, 0.6266, 0.4772, 0.3747, 0.3714, 0.6292, 0.6885, 0.668,
                  0.3319, 0.3703, 0.3696, 0.6307, 0.3695, 0.6255, 0.3695, 0.3559, 0.6306, 0.6305, 0.6309, 0.316, 0.684, 0.631,
                  0.3692, 0.3143, 0.631, 0.316, 0.4493, 0.6856, 0.6491, 0.6681, 0.3694, 0.3686, 0.6845, 0.3155, 0.6314, 0.2421,
                  0.7459, 0.3142, 0.3606, 0.6859, 0.6315, 0.6305, 0.3157, 0.5364, 0.3144, 0.3141, 0.5876, 0.3686, 0.6314, 0.3227,
                  0.3695, 0.5359, 0.6305, 0.3728, 0.3318, 0.3551, 0.3695, 0.2244, 0.6304, 0.6856, 0.6482, 0.6304, 0.6304, 0.4498, 0.6469))
    davies2015$Beta <- abs(davies2015$Beta)
    # [1] 13.851
    editSample <- function(editBudget) { head(Filter(function(x){rbinom(1, 1, prob=davies2015$Frequency)}, davies2015$Beta), n=editBudget) }
    mean(replicate(1000, sum(editSample(30) * 15)))
    # [1] 13.534914
  6. Edit­ing can be done on low-fre­quency or rare vari­ants, whose effects are known but will not be avail­able in the em­bryos in most se­lec­tion in­stances.

    For ex­am­ple, George Church lists 10 rare mu­ta­tions of large effect that may be worth edit­ing into peo­ple:

    1. G171V/+ Ex­tra-strong bones3
    2. -/- Lean mus­cles
    3. -/- In­sen­si­tiv­ity to pain
    4. -/- Low Odor pro­duc­tion
    5. , -/- Virus re­sis­tance
    6. -/- Low coro­nary dis­ease
    7. A673T/+ Low Alzheimer’s
    8. , GH -/- Low can­cer
    9. -/+ Low T2 Di­a­betes
    10. E627X/+ Low T1 Di­a­betes

    To which I would add: sleep du­ra­tion, qual­i­ty, morn­ing­ness-evening­ness, and re­sis­tance to sleep de­pri­va­tion () are, like most traits, her­i­ta­ble. The ex­treme case is that of “short­-sleep­ers”, the ~1% of the pop­u­la­tion who nor­mally sleep 3-6h; they often men­tion a par­ent who was also a short­-sleep­er, short sleep start­ing in child­hood, that ‘over’ sleep­ing is & they do not fall asleep faster & don’t sleep ex­ces­sively more on week­ends (indi­cat­ing they are not merely chron­i­cally sleep­-de­prived), and are anec­do­tally de­scribed as highly en­er­getic mul­ti­-taskers, thin, with pos­i­tive at­ti­tudes & high pain thresh­olds (Monk et al 2001) with­out any known health effects or down­sides in hu­mans4 or mice (a­side from, pre­sum­ably, greater caloric ex­pen­di­ture).

    Some in­stances of short­-sleep­ers are due to , with a vari­ant found in short­-sleep­ers vs con­trols (6.25h vs 8.37h, -127m) and the effect con­firmed by knock­out-mice (); an­other short­-sleep vari­ant was iden­ti­fied in a dis­cor­dant twin pair with an effect of -64m (). DEC2/BHLHE41 SNPs are also rare (for ex­am­ple, 3 such SNPs have fre­quen­cies of 0.08%, 3%, & 5%). Hence, se­lec­tion would be al­most en­tirely in­effec­tive, but edit­ing is easy.

    As far as costs and ben­e­fits go, we can ob­serve that be­ing able to stay awake an ad­di­tional 127 min­utes a day is equiv­a­lent to be­ing able to live an ad­di­tional 7 years, 191 min­utes to 11 years; to negate that, any side-effect would have to be tan­ta­mount to life­long smok­ing of to­bac­co.

The largest dis­ad­van­tage of edit­ing, and the largest ad­van­tage of em­bryo se­lec­tion, is that se­lec­tion re­lies on proven, well-un­der­stood known-priced PGD tech­nol­ogy al­ready in use for other pur­pos­es; while the for­mer has­n’t ex­isted and has been sci­ence fic­tion, not fact.

Genome synthesis

The cost of CRISPR edit­ing will scale roughly as the num­ber of ed­its: 100 ed­its will cost 10x 10 ed­its. It may also scale su­per­lin­early if each edit makes the next edit more diffi­cult. This poses chal­lenges to profitable edit­ing since the mar­ginal gain of each edit will keep de­creas­ing as the SNPs with largest effect sizes are edited first - it’s hard to see how 500 or 1000 ed­its would be profitable. Sim­i­lar to the daunt­ing cost of it­er­ated em­bryo se­lec­tion, where the IES is done only once or a few times and then ga­metes are dis­trib­uted en masse to prospec­tive par­ents to amor­tize per-child costs to small amounts, one could imag­ine do­ing the same thing for a heav­ily CRISPR-edited em­bryo.

But at some point, do­ing many ed­its raises the ques­tion of why you are both­er­ing with the wild type genome? Could­n’t you just cre­ate a whole from scratch in­cor­po­rat­ing every pos­si­ble ed­it? Syn­the­size a whole genome’s DNA and in­cor­po­rate all ed­its one wish­es; in the de­sign phase, take the GWASes for a wide va­ri­ety of traits and set each SNP, no mat­ter how weakly es­ti­mat­ed, to the pos­i­tive di­rec­tion; in copy­ing in the data of re­gions with rare vari­ants, they are prob­a­bly harm­ful and can be erased with the modal hu­man base-pairs at those po­si­tions, for sys­tem­atic health ben­e­fits across most dis­eases; or to im­i­tate it­er­a­tive em­bryo se­lec­tion, which ex­ploits the tag­ging power of SNPs to pull in the ben­e­fi­cial rare vari­ants, copy over the hap­lo­types for ben­e­fi­cial SNPs which might be tag­ging a rare vari­ant. Be­tween the eras­ing of mu­ta­tion load and ex­ploit­ing all com­mon vari­ants si­mul­ta­ne­ous­ly, the re­sults could be a stag­ger­ing phe­no­type.

DNA syn­the­sis, syn­the­siz­ing a strand of DNA by base-pair, has long been done, but gen­er­ally lim­ited to a few hun­dred BPs, which is much less than the 23 hu­man chro­mo­somes’ col­lec­tive ~3.3 bil­lion BP. Past work in syn­the­siz­ing genomes has in­cluded Craig Ven­ter’s min­i­mal bac­terium in 2008 with 582,970 BP; http://www.nature.com/news/2008/080124/full/news.2008.522.html 1.1 mil­lion BP in 2010 http://www.nature.com/news/2010/100520/full/news.2010.253.html 483,000 BP and 531,000 BP in 2016 http://www.nature.com/news/minimal-cell-raises-stakes-in-race-to-harness-synthetic-life-1.19633 (spend­ing some­where ~$40m on these pro­jects) 272,871 in 2014 (1 yeast chro­mo­some; 90k BP cost­ing $50k at the time) and plans for syn­the­siz­ing the whole yeast genome in 5 years http://www.nature.com/news/first-synthetic-yeast-chromosome-revealed-1.14941 http://science.sciencemag.org/content/290/5498/1972 http://science.sciencemag.org/content/329/5987/52 http://science.sciencemag.org/content/342/6156/357 http://science.sciencemag.org/content/333/6040/348 2,750,000 BP in 2016 for E. coli http://www.nature.com/news/radically-rewritten-bacterial-genome-unveiled-1.20451 “De­sign, syn­the­sis, and test­ing to­ward a 57-codon genome”, Os­trov et al 2016 http://science.sciencemag.org/content/353/6301/819


The biggest chro­mo­some is #1, with 8.1% of the base-pairs or 249,250,621; the small­est is #21, 1.6%, or 48,129,895. DNA syn­the­sis prices drop each year in an ex­po­nen­tial de­cline (if not re­motely as fast as the DNA se­quenc­ing cost curve), and so 2016 syn­the­siz­ing costs have reached <$0.3; let’s say $0.25/BP.

Chro­mo­some Length in base-pairs Frac­tion Syn­the­sis cost at $0.25/BP
1 249,250,621 0.081 $62,312,655
2 243,199,373 0.079 $60,799,843
3 198,022,430 0.064 $49,505,608
4 191,154,276 0.062 $47,788,569
5 180,915,260 0.058 $45,228,815
6 171,115,067 0.055 $42,778,767
7 159,138,663 0.051 $39,784,666
8 146,364,022 0.047 $36,591,006
9 141,213,431 0.046 $35,303,358
10 135,534,747 0.044 $33,883,687
11 135,006,516 0.044 $33,751,629
12 133,851,895 0.043 $33,462,974
13 115,169,878 0.037 $28,792,470
14 107,349,540 0.035 $26,837,385
15 102,531,392 0.033 $25,632,848
16 90,354,753 0.029 $22,588,688
17 81,195,210 0.026 $20,298,802
18 78,077,248 0.025 $19,519,312
19 59,128,983 0.019 $14,782,246
20 63,025,520 0.020 $15,756,380
21 48,129,895 0.016 $12,032,474
22 51,304,566 0.017 $12,826,142
X 155,270,560 0.050 $38,817,640
Y 59,373,566 0.019 $14,843,392
to­tal 3,095,693,981 1.000 $773,923,495

So the syn­the­sis of one genome in 2016, as­sum­ing no economies of scale or fur­ther im­prove­ment, would come in at ~$773m. This is a stag­ger­ing but fi­nite and even fea­si­ble amount: the orig­i­nal cost ~$3b, and other large sci­ence projects like the LHC, Man­hat­tan Pro­ject, ITER, Apollo Pro­gram, ISS, the Na­tional Chil­dren’s Study etc have cost many times what 1 hu­man genome would.

https://en.wikipedia.org/wiki/Human_Genome_Project_-_Write http://www.nature.com/news/plan-to-synthesize-human-genome-triggers-mixed-response-1.20028 http://science.sciencemag.org/content/early/2016/06/01/science.aaf6850.full http://science.sciencemag.org/content/sci/suppl/2016/06/01/science.aaf6850.DC1/Boeke.SM.pdf http://www.nytimes.com/2016/06/03/science/human-genome-project-write-synthetic-dna.html http://diyhpl.us/wiki/transcripts/2017-01-26-george-church/ https://www.wired.com/story/live-forever-synthetic-human-genome/ https://medium.com/neodotlife/andrew-hessel-human-genome-project-write-d15580dd0885 “Is the World Ready for Syn­thetic Peo­ple?: Stan­ford bio­engi­neer Drew Endy does­n’t mind bring­ing drag­ons to life. What re­ally scares him are hu­mans.” https://medium.com/neodotlife/q-a-with-drew-endy-bde0950fd038 https://www.chemistryworld.com/feature/step-by-step-synthesis-of-dna/3008753.article

Church is op­ti­mistic: maybe even $100k/genome by 2037 http://www.wired.co.uk/article/human-genome-synthesise-dna “Hu­mans 2.0: these ge­neti­cists want to cre­ate an ar­ti­fi­cial genome by syn­the­sis­ing our DNA; Sci­en­tists in­tend to have fully syn­the­sised the genome in a liv­ing cell - which would make the ma­te­r­ial func­tional - within ten years, at a pro­jected cost of $1 bil­lion” > But these are the “byprod­ucts” of HGP-Write, in Hes­sel’s view: the pro­jec­t’s true pur­pose is to cre­ate the im­pe­tus for tech­no­log­i­cal ad­vances that will lead to these long-term ben­e­fits. “Since all these [syn­the­sis] tech­nolo­gies are ex­po­nen­tially im­prov­ing, we should keep push­ing that im­prove­ment rather than just turn­ing the crank blindly and ex­pen­sive­ly,” Church says. In 20 years, this could cut the cost of syn­the­sis­ing a hu­man genome to $100,000, com­pared to the $12 bil­lion es­ti­mated a decade ago.

The ben­e­fit of this in­vest­ment would be to by­pass the death by a thou­sand cuts of CRISPR edit­ing and cre­ate a genome with an ar­bi­trary num­ber of ed­its on an ar­bi­trary num­ber of traits for the fixed up­front cost. Un­like mul­ti­ple se­lec­tion, one would not need to trade off mul­ti­ple traits against each other (ex­cept for pleiotropy); un­like edit­ing, one would not be lim­ited to mak­ing only ed­its with a mar­ginal ex­pected value ex­ceed­ing the cost of 1 ed­it. Do­ing in­di­vid­ual genome syn­the­ses will be out of the ques­tion for a long time to come, so genome syn­the­sis is like IES in amor­tiz­ing its cost over prospec­tive par­ents.

The “2013 As­sisted Re­pro­duc­tive Tech­nol­ogy Na­tional Sum­mary Re­port” says ~10% of IVF cy­cles use donor eggs, and a to­tal of 67996 in­fants, im­ply­ing >6.7k in­fants con­ceived with donor eggs, em­bryos, or sperm (sperm is not cov­ered by that re­port) and were only half or less re­lated to their par­ents. What im­me­di­ate 1-year re­turn over 6.7k in­fants would jus­tify spend­ing $773m? Con­sid­er­ing just the low es­ti­mate of IQ at $3270/point and no other traits, that would trans­late to (773m/6.7k) / 3270 = 35 IQ points or at an av­er­age IQ gain of 0.1 points, the equiv­a­lent of 350 causal ed­its. This is doable. If we al­low amor­ti­za­tion at a high dis­count rate of 5% and reuse the genome in­defi­nitely for each year’s crop of 6.7k in­fants, then we need at least X IQ points where ((x*3270*6700) / log(1.05)) - 773000000 >= 0; x >= 1.73 or at ~0.1 points per ed­it, 18 ed­its. We could also syn­the­size only 1 chro­mo­some and pay much less up­front (but at the cost of a lower up­per bound, as GCTA heritability/length re­gres­sions and GWAS poly­genic score re­sults in­di­cates that in­tel­li­gence and many other com­plex traits are spread very evenly over the genome, so each chro­mo­some will har­bor vari­ants pro­por­tional to its length).

The causal edit prob­lem re­mains but at 12% causal rates, 18 causal ed­its can eas­ily be made with 150 ed­its of SNP can­di­dates, which is less than al­ready avail­able. So at first glance, whole genome syn­the­sis can be profitable with op­ti­miza­tion of only one trait us­ing ex­ist­ing GWAS hits, and will be ex­tremely profitable if dozens of traits are op­ti­mized and mu­ta­tion load min­i­mized.

How much profitable? …see em­bryo se­lec­tion cal­cu­la­tions… At 70 SDs and 12% causal, then the profit would be 70*15*3270*0.12*6700 - 773000000 = $1,987,534,000 the first year or NPV of $55,806,723,536. TODO: only half-re­lat­ed­ness


  • model DNA syn­the­sis cost curve; when can we ex­pect a whole hu­man genome to be syn­the­siz­able with a sin­gle lab’s re­sources, like $1m/$5m/$10m? when does syn­the­sis be­gin to look bet­ter than IES?

    eye­balling http://science.sciencemag.org/content/sci/suppl/2016/06/01/science.aaf6850.DC1/Boeke.SM.pdf , 0.01/$ or 100$ in 1990, 3/$ or 0.3$ in 2015?

    synthesis <- data.frame(Year=c(1990,2015), Cost=c(100, 0.33))
    summary(l <- lm(log(Cost) ~ Year, data=synthesis))
    prettyNum(big.mark=",", round(exp(predict(l, data.frame(Year=c(2016:2040), Cost=NA))) * 3095693981))
    #             1             2             3             4             5             6             7             8             9            10            11            12            13
    # "812,853,950" "646,774,781" "514,628,265" "409,481,413" "325,817,758" "259,247,936" "206,279,403" "164,133,195" "130,598,137" "103,914,832"  "82,683,356"  "65,789,813"  "52,347,894"
    #            14            15            16            17            18            19            20            21            22            23            24            25
    #  "41,652,375"  "33,142,123"  "26,370,653"  "20,982,703"  "16,695,599"  "13,284,419"  "10,570,198"   "8,410,536"   "6,692,128"   "5,324,818"   "4,236,872"   "3,371,211"

    As the cost ap­pears to be roughly lin­ear in chro­mo­some length, it would be pos­si­ble to scale down syn­the­sis projects if an en­tire genome can­not be afford­ed.

    For ex­am­ple, IQ is highly poly­genic and the rel­e­vant SNPs & causal vari­ants are spread fairly evenly over the en­tire genome (as in­di­cated by the orig­i­nal GCTAs show SNP her­i­tabil­ity per chro­mo­some cor­re­lates with chro­mo­some length, and lo­ca­tion of GWAS hit­s), so one could in­stead syn­the­size a chro­mo­some ac­count­ing for ~1% of base­pairs which will carry 1% of vari­ants at 1% of the to­tal cost.

    So if a whole genome costs $1b, there are ~10,000 vari­ants, with an av­er­age effect of ~0.1 IQ points and a fre­quency of 50%, then for $10m one could cre­ate a chro­mo­some which would im­prove over a wild genome’s chro­mo­some by 10000 * 0.01 * 0.5 * 0.1 = 5 points; then as re­sources al­low and the syn­the­sis price keeps drop­ping, cre­ate a sec­ond small chro­mo­some for an­other 5 points, and so on with the big­ger chro­mo­somes for larger gains.

“Large-s­cale de novo DNA syn­the­sis: tech­nolo­gies and ap­pli­ca­tions”, Ko­suri & Church 2014 http://arep.med.harvard.edu/pdf/Kosuri_Church_2014.pdf

Cost curve:

His­tor­i­cal cost curves of genome se­quenc­ing & syn­the­sis, 1980-2015 (log scale)

“Bricks and blue­prints: meth­ods and stan­dards for DNA as­sem­bly” http://scienseed.com/clients/tomellis/wp-content/uploads/2015/08/BricksReview.pdf Casini et al 2015

“Lep­roust says that won’t al­ways be the case-not if her plans to im­prove the tech­nol­ogy work out.”In a few years it won’t be $100,000 to store that data," she says. “It will be 10 cents.”" https://www.technologyreview.com/s/610717/this-company-can-encode-your-favorite-song-in-dnafor-100000/ April 2018


Ques­tions: what is the best way to do genome syn­the­sis? Lots of pos­si­bil­i­ties:

  • can do one or more chro­mo­somes at a time (which would fit in small bud­gets)

  • op­ti­mize 1 trait to max­i­mize PGS SNP-wise (but causal tagging/LD prob­lem…)

  • op­ti­mize 1 trait to max­i­mize hap­lo­type PGS

  • op­ti­mize mul­ti­ple traits with ge­netic cor­re­la­tions and unit-weighted

  • mul­ti­ple trait, util­i­ty-weighted

  • lim­ited op­ti­miza­tion:

  • par­tial fac­to­r­ial trial of hap­lo­types (eg take the max­i­mal util­i­ty-weight­ed, then flip a ran­dom half for the first genome; flip a differ­ent ran­dom half for the sec­ond genome; etc)

    • this could be used for re­spon­se-sur­face es­ti­ma­tion: to try to es­ti­mate where ad­di­tiv­ity breaks down and ge­netic cor­re­la­tions change sub­stan­tially
  • con­strained op­ti­miza­tion of hap­lo­types: max­i­mize the util­i­ty-weight sub­ject to a soft con­straint like to­tal phe­no­type in­creases of >2SD on av­er­age (eg if there are 10 traits, al­low +20SD of to­tal phe­no­type change); or a hard con­straint, like no trait past >5SD (so at least a few peo­ple to ever live could have had a sim­i­lar PGS value on each trait)

    • be­cause of how many re­al-world out­comes are log-nor­mally dis­trib­uted and the com­po­nent nor­mals have thin tails, it will be more effi­cient to in­crease 20 traits by 1SD than 1 trait by 20SD
  • modal­iza­tion: sim­ply take the modal hu­man genome, im­plic­itly reap­ing gains from re­mov­ing much mu­ta­tion load

  • par­tial op­ti­miza­tion of a pro­to­type genome: se­lect some ex­em­plar genome as a base­line, like a modal genome or very ac­com­plished or very in­tel­li­gent per­son, and op­ti­mize only a few SD up from that

  • “dose-rang­ing study”: mul­ti­ple genomes op­ti­mized to var­i­ous SDs or var­i­ous hard/soft con­straints, to as quickly as pos­si­ble es­ti­mate the safe ex­treme (eg +5 vs 10 vs 15 SD)

  • ex­otic changes: adding very rare vari­ants like the short­-sleeper or myo­stat­in; in­creas­ing CNVs of genes differ­ing be­tween hu­mans and chim­panzees; genome with all codons recorded to make vi­ral in­fec­tion im­pos­si­ble


The past 5 years have seen ma­jor break­throughs in cheap, re­li­able, gen­er­al-pur­pose ge­netic en­gi­neer­ing & edit­ing us­ing , since the 2012 demon­stra­tion that it could edit cells, in­clud­ing hu­man ones ().5 Com­men­tary in 2012 and 2013 re­garded prospects for em­bryo edit­ing as highly re­mote6 Even as late as July 2014, cov­er­age was highly cir­cum­spect, with vague mus­ing that “Some ex­per­i­ments also hint that doc­tors may some­day be able to use it to treat ge­netic dis­or­ders”; suc­cess­ful edit­ing of ze­brafish (), mon­key, and hu­man em­bryos was al­ready be­ing demon­strated in the lab, and hu­man tri­als would be­gin 2-3 years lat­er. (Sur­pris­ing­ly, Shul­man & Bostrom do not men­tion CRISPR, but that gives an idea of how shock­ingly fast CRISPR went from strange to news to stan­dard, and how low ex­pec­ta­tions for ge­netic en­gi­neer­ing were pre-2015 after decades of ag­o­niz­ingly slow progress & high­-pro­file fail­ures & deaths, such that the de­fault as­sump­tion was that di­rect cheap safe ge­netic en­gi­neer­ing of em­bryos would not be fea­si­ble for decades and cer­tainly not the fore­see­able fu­ture.)

The CRISPR/Cas sys­tem ex­ploits a bac­te­r­ial an­ti-vi­ral im­mune sys­tem in which snip­pets of virus DNA are stored and an en­zyme sys­tem de­tects the pres­ence in the bac­te­r­ial genome of fresh vi­ral DNA and then ed­its it, break­ing it, and pre­sum­ably stop­ping the in­fec­tion in its la­tent phase. This re­quires the CRISPR en­zymes to be highly pre­cise (lest it at­tack le­git­i­mate DNA, dam­ag­ing or killing it­self), re­peat­able (it’s a poor im­mune sys­tem that can only fight off one in­fec­tion ever), and pro­gram­ma­ble by short RNA se­quences (be­cause viruses con­stantly mu­tate).

This turns out to be us­able in an­i­mal and hu­man cells to delete/knock-out genes: cre­ate an RNA se­quence match­ing a par­tic­u­lar gene, in­ject the RNA se­quence along with the key en­zymes into a cell, and it will find the gene in ques­tion in­side the nu­cleus and snip it; this can be aug­mented to edit rather than delete by pro­vid­ing an­other DNA tem­plate which the snip-re­pair mech­a­nisms will un­wit­tingly copy from. Com­pared to the ma­jor ear­lier ap­proaches us­ing & , CRISPR is far faster and eas­ier and cheaper to use, with large (at least halv­ing) de­creases in time & money cit­ed, and use of it has ex­ploded in re­search labs, draw­ing com­par­isons to the in­ven­tion of PCR and No­bel Prize pre­dic­tions. It has been used to edit at least 36 crea­tures as of June 20157, in­clud­ing:

(I am con­vinced some of these were done pri­mar­ily for the lulz.)

It has ap­peared to be a very effec­tive and promis­ing genome edit­ing tool in mam­malian cells Cho et al., 2013 “Tar­geted genome en­gi­neer­ing in hu­man cells with the Cas9 RNA-guided end” http://www.bmskorea.co.kr/bms_email/email2013/13-0802/paper.pdf hu­man cells Cong et al., 2013 “Mul­ti­plex genome en­gi­neer­ing us­ing CRISPR/Cas sys­tems” https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3795411/ mouse and hu­man cells Jinek et al., 2012 “A pro­gram­ma­ble dual-RNA-guided DNA en­donu­cle­ase in adap­tive bac­te­r­ial im­mu­nity” ze­brafish so­matic cells at the or­gan­is­mal level Hwang et al 2013“Effi­cient genome edit­ing in ze­brafish us­ing a CRISPR-Cas sys­tem” https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3686313/ ge­nomic edit­ing in cul­tured mam­malian cells pluripo­tent stem cells, but also in ze­brafish em­bryos with effi­cien­cies that are com­pa­ra­ble to those ob­tained us­ing ZFNs and TALENs (H­wang et al., 2013).

Jenko et al 2016, “Po­ten­tial of pro­mo­tion of al­le­les by genome edit­ing to im­prove quan­ti­ta­tive traits in live­stock breed­ing pro­grams” https://gsejournal.biomedcentral.com/articles/10.1186/s12711-015-0135-3

, Yang et al 2019 (—“en­gi­neer­ing 18 differ­ent loci us­ing mul­ti­ple genome en­gi­neer­ing meth­ods” in pigs; fol­lowup to )

Es­ti­mat­ing the profit from CRISPR ed­its is, in some ways, less straight­for­ward than from em­bryo se­lec­tion:

  1. tak­ing the high­est n re­ported betas/coefficients from a GWAS OLS re­gres­sions im­plies that they will be sys­tem­at­i­cally bi­ased up­wards due to the win­ner’s curse and sta­tis­ti­cal-sig­nifi­cance fil­ter­ing, thereby ex­ag­ger­at­ing the po­ten­tial ben­e­fit from each SNP
  2. the causal tag­ging prob­lem: GWAS es­ti­mates cor­re­lat­ing SNPs with phe­no­type traits, while valid for pre­dic­tion (and hence se­lec­tion), are not nec­es­sar­ily causal (such that an edit at that SNP will have the es­ti­mated effec­t), and the prob­a­bil­ity of be­ing non-causal
  3. each CRISPR edit has a chance of not mak­ing the cor­rect ed­it, and mak­ing wrong ed­its; an edit may not work in an em­bryo (the speci­fici­ty, false neg­a­tive), and there is a chance of an ‘off-tar­get’ mu­ta­tion (false pos­i­tive). A non-edit is a waste of mon­ey, while an off-tar­get mu­ta­tion could be fa­tal.
  4. CRISPR has ad­vanced at such a rapid rate that num­bers on cost, speci­fici­ty, & off-tar­get mu­ta­tion rate are gen­er­ally ei­ther not avail­able, or are ar­guably out of date be­fore they were pub­lished19

The win­ner’s curse can be dealt with sev­eral ways; dis­cusses 4 meth­ods (“in­vert­ing the con­di­tional ex­pec­ta­tion of the OLS es­ti­ma­tor, max­i­mum like­li­hood es­ti­ma­tion (MLE), Bayesian es­ti­ma­tion, and em­pir­i­cal-Bayes es­ti­ma­tion”). The best one for our pur­poses would be a Bayesian ap­proach, yield­ing pos­te­rior dis­tri­b­u­tions of effect sizes. Un­for­tu­nate­ly, the raw data for GWAS stud­ies like Ri­etveld et al 2013 is not avail­able; but for our pur­pos­es, we can, like in the Ri­etveld et al 2014 sup­ple­ment, sim­ply sim­u­late a dataset and work with that to get pos­te­ri­ors to find the SNPs with un­bi­ased largest pos­te­rior means.

The causal tag­ging prob­lem is more diffi­cult. The back­ground for the causal tag­ging prob­lem is that genes are re­com­bined ran­domly at con­cep­tion, but they are not re­com­bined at the in­di­vid­ual gene lev­el; rather, genes tend to live in ‘blocks’ of con­tigu­ous genes called , lead­ing to (LD). Not every pos­si­ble vari­ant in a hap­lo­type is de­tected by a SNP ar­ray chip, so if you have genes A/B/C/D, it may be the case that a vari­ant of A will in­crease in­tel­li­gence but your SNP ar­ray chip only de­tects vari­ants of D; when you do a GWAS, you then dis­cover D pre­dicts in­creased in­tel­li­gence. (And if your SNP ar­ray chip tags B or C, you may get hits on those as well!) This is not a prob­lem for em­bryo se­lec­tion, be­cause if you can only see D’s vari­ants, and you see that an em­bryo has the ‘good’ D vari­ant, and you pick that em­bryo, it will in­deed grow up as you hoped be­cause that D vari­ant pulled the true causal A vari­ant along for the ride. In­deed, for se­lec­tion or pre­dic­tion, the causal tag­ging prob­lem can be seen as a good thing: your GWASes can pick up effects from parts of the genome you did­n’t even pay to se­quence - “buy one, get one free”. (The down­side can be un­der­es­ti­ma­tion due to im­per­fect prox­ies.) But this is a prob­lem for em­bryo edit­ing be­cause if a CRISPR en­zyme goes in and care­fully ed­its D while leav­ing every­thing un­touched, A by de­fi­n­i­tion re­mains the same and there will be no ben­e­fits. A SNP be­ing non-causal is not a se­ri­ous prob­lem for em­bryo edit­ing if we know which SNPs are non-causal; as il­lus­trated above, the dis­tri­b­u­tion of effects is smooth enough that dis­card­ing a top SNP for the next best al­ter­na­tive costs lit­tle. But it is a se­ri­ous prob­lem if we don’t know which ones are non-causal, be­cause then we waste pre­cious ed­its (imag­ine a bud­get of 30 ed­its and a non-causal rate of 50%; if we are ig­no­rant which are non-causal, we effec­tively get only 15 ed­its, but if we know which to drop, then it’s al­most as good as if all were causal). Causal­ity can be es­tab­lished in a few ways; for ex­am­ple, a hit can be reused in a mu­tant lab­o­ra­tory an­i­mal breed to see if it also pre­dicts there (This is be­hind some strik­ing break­throughs like Sekar et al 2016’s proof that neural prun­ing is a cause of schiz­o­phre­ni­a), ei­ther us­ing ex­ist­ing strains or cre­at­ing a fresh mod­i­fi­ca­tion (us­ing, say, CRISPR). This has not been done for the top in­tel­li­gence hits and given the ex­pense & diffi­culty of an­i­mal ex­per­i­men­ta­tion, we can’t ex­pect it any­time soon. One can also try to use prior in­for­ma­tion to boost the pos­te­rior prob­a­bil­ity of an effect: if a gene has al­ready been linked to the ner­vous sys­tem by pre­vi­ous stud­ies ex­plor­ing mu­ta­tions or gene ex­pres­sion data etc, or other as­pects of the phys­i­cal struc­ture point to­wards that SNP like be­ing clos­est to a gene, then that is ev­i­dence for the as­so­ci­a­tion be­ing causal. In­tel­li­gence hits are en­riched for ner­vous sys­tem con­nec­tions, but this method is in­her­ently weak. A bet­ter method is fine-map­ping or whole-genome se­quenc­ing: when all the vari­ants in a hap­lo­type are se­quenced, then the true causal vari­ant will tend to per­form sub­tly bet­ter and one can sin­gle it out of the whole SNP set us­ing var­i­ous sta­tis­ti­cal cri­te­ria (eg Farh et al 2015 us­ing their al­go­rithm on au­toim­mune dis­or­der es­ti­mate 5.5% of their 4.9k SNPs are causal). Use­ful, but whole-genomes are still ex­pen­sive enough that they are not cre­ated nearly as much as SNPs and there do not seem to be many com­par­isons to ground-truths or meta-analy­ses. An­other ap­proach is sim­i­lar to the lab an­i­mal ap­proach: hu­man groups differ ge­net­i­cal­ly, and their hap­lo­types & LD pat­terns can differ great­ly; if D looks as­so­ci­ated with in­tel­li­gence in one group, but is not as­so­ci­ated in a ge­net­i­cally dis­tant group with differ­ent hap­lo­types, that strongly sug­gests that in the first group, D was in­deed just prox­y­ing for an­other vari­ant. We can also ex­pect that vari­ants with high fre­quen­cies will not be pop­u­la­tion-spe­cific but be an­cient & shared causal vari­ants. Some of the in­tel­li­gence hits have repli­cated in Eu­ro­pean-Amer­i­can sam­ples, as ex­pected but not help­ful­ly; more im­por­tant­ly, in African-Amer­i­can (Domingue et al 2015) and East Asian sam­ples (Zhu et al 2015), and the top SNPs have some pre­dic­tive power of mean IQs across eth­nic groups (Piffer 2015). More gen­er­al­ly, while GWASes usu­ally are para­noid about an­ces­try, us­ing as ho­mo­ge­neous a sam­ple as pos­si­ble and con­trol­ling away any de­tectable differ­ences in an­ces­try, GWASes can use cross-eth­nic and differ­ing an­ces­tries in “ad­mix­ture map­ping” to home in on causal vari­ants, but this has­n’t been done for in­tel­li­gence. We can note that some GWASes have com­pared how hits repli­cate across pop­u­la­tions (eg , Wa­ters et al 2010, , Zuo et al 2011, Chang et al 2011, , /Gong et al 2013, , Xing et al 2014, , Yin et al 2015, , He et al 2015); de­spite in­ter­pre­ta­tive diffi­cul­ties such as sta­tis­ti­cal pow­er, hits often repli­cate from Eu­ro­pean-de­s­cent sam­ples to dis­tant eth­nic­i­ties, and the ex­am­ple of the schiz­o­phre­nia GWASes (Can­dia et al 2013, ) also offers hope in show­ing a strong cor­re­la­tion of 0.66/0.61 be­tween African & Eu­ro­pean schiz­o­phre­nia SNP-based s. (It also seems to me that there is a trend with the com­plex highly poly­genic traits hav­ing bet­ter cross-eth­nic replic­a­bil­ity than in sim­pler dis­eases.)

  1. Ntzani et al 2012, “Con­sis­tency of genome-wide as­so­ci­a­tions across ma­jor an­ces­tral groups” /docs/genetics/correlation/2012-ntzani.pdf
  2. Marig­orta & Navarro 2013 “High Tran­s-eth­nic Replic­a­bil­ity of GWAS Re­sults Im­plies Com­mon Causal Vari­ants” https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3681663/

“Ge­netic effects in­flu­enc­ing risk for ma­jor de­pres­sive dis­or­der in China and Eu­rope”, Bigdeli et al 2017 https://www.nature.com/tp/journal/v7/n3/full/tp2016292a.html

“Fa­ther Ab­sence and Ac­cel­er­ated Re­pro­duc­tive De­vel­op­ment”, Gay­dosh et al 2017 https://www.biorxiv.org/content/biorxiv/early/2017/04/04/123711.full.pdf

“Transeth­nic differ­ences in GWAS sig­nals: A sim­u­la­tion study”, Zanetti & Weale 2018

, Brown et al 2016

, Lencz et al 2013/2014: IQ PGS ap­plied to schiz­o­phre­nia case sta­tus:

  • EA: 0.41%
  • Japan­ese: 0.38%
  • Ashke­nazi Jew: 0.16%
  • MGS African-Amer­i­can: 0.00%

my­opia (re­frac­tive er­ror): European/East Asian (T­edja et al 2018, “Genome-wide as­so­ci­a­tion meta-analy­sis high­lights light-in­duced sig­nal­ing as a dri­ver for re­frac­tive er­ror”): r_g = 0.90/0.80

“Con­sis­tent As­so­ci­a­tion of Type 2 Di­a­betes Risk Vari­ants Found in Eu­ro­peans in Di­verse Racial and Eth­nic Groups”, Wa­ters et al 2010 http://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.1001078 Ware et al 2017, “Het­ero­gene­ity in poly­genic scores for com­mon hu­man traits” https://www.biorxiv.org/content/early/2017/02/05/106062 Fig­ure 3 - the height/education/etc poly­genic scores roughly 5x as pre­dic­tive in white Amer­i­cans as African-Amer­i­cans, ~0.05 R^2 vs ~0.01 R^2, and since African-Amer­i­cans av­er­age about 25% white any­way, that’s con­sis­tent with 10% or less… “Us­ing Ge­netic Dis­tance to In­fer the Ac­cu­racy of Ge­nomic Pre­dic­tion”, Scu­tari et al 2016 http://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.1006288

Height PGSes don’t work cross­ra­cial­ly: “Hu­man De­mo­graphic His­tory Im­pacts Ge­netic Risk Pre­dic­tion across Di­verse Pop­u­la­tions”, Mar­tin et al 2017 (note that this pa­per had ):

The vast ma­jor­ity of genome-wide as­so­ci­a­tion stud­ies (GWASs) are per­formed in Eu­ro­peans, and their trans­fer­abil­ity to other pop­u­la­tions is de­pen­dent on many fac­tors (e.g., link­age dis­e­qui­lib­ri­um, al­lele fre­quen­cies, ge­netic ar­chi­tec­ture). As med­ical ge­nomics stud­ies be­come in­creas­ingly large and di­verse, gain­ing in­sights into pop­u­la­tion his­tory and con­se­quently the trans­fer­abil­ity of dis­ease risk mea­sure­ment is crit­i­cal. Here, we dis­en­tan­gle re­cent pop­u­la­tion his­tory in the widely used 1000 Genomes Project ref­er­ence pan­el, with an em­pha­sis on pop­u­la­tions un­der­rep­re­sented in med­ical stud­ies. To ex­am­ine the trans­fer­abil­ity of sin­gle-ances­try GWASs, we used pub­lished sum­mary sta­tis­tics to cal­cu­late poly­genic risk scores for eight well-s­tud­ied phe­no­types. We iden­tify di­rec­tional in­con­sis­ten­cies in all scores; for ex­am­ple, height is pre­dicted to de­crease with ge­netic dis­tance from Eu­ro­peans, de­spite ro­bust an­thro­po­log­i­cal ev­i­dence that West Africans are as tall as Eu­ro­peans on av­er­age. To gain deeper quan­ti­ta­tive in­sights into GWAS trans­fer­abil­i­ty, we de­vel­oped a com­plex trait co­a­les­cen­t-based sim­u­la­tion frame­work con­sid­er­ing effects of poly­genic­i­ty, causal al­lele fre­quency di­ver­gence, and her­i­tabil­i­ty. As ex­pect­ed, cor­re­la­tions be­tween true and in­ferred risk are typ­i­cally high­est in the pop­u­la­tion from which sum­mary sta­tis­tics were de­rived. We demon­strate that scores in­ferred from Eu­ro­pean GWASs are bi­ased by ge­netic drift in other pop­u­la­tions even when choos­ing the same causal vari­ants and that bi­ases in any di­rec­tion are pos­si­ble and un­pre­dictable. This work cau­tions that sum­ma­riz­ing find­ings from large-s­cale GWASs may have lim­ited porta­bil­ity to other pop­u­la­tions us­ing stan­dard ap­proaches and high­lights the need for gen­er­al­ized risk pre­dic­tion meth­ods and the in­clu­sion of more di­verse in­di­vid­u­als in med­ical ge­nomics.

“Tran­s-eth­nic genome-wide as­so­ci­a­tion stud­ies: ad­van­tages and chal­lenges of map­ping in di­verse pop­u­la­tions”, Li & Keat­ing 2014

“Ge­netic con­trib­u­tors to vari­a­tion in al­co­hol con­sump­tion vary by race/ethnicity in a large mul­ti­-eth­nic genome-wide as­so­ci­a­tion study”, Jor­gen­son et al 2017 /docs/genetics/correlation/2017-jorgenson.pdf

“High­-Res­o­lu­tion Ge­netic Maps Iden­tify Mul­ti­ple Type 2 Di­a­betes Loci at Reg­u­la­tory Hotspots in African Amer­i­cans and Eu­ro­peans”, Lau et al 2017 /docs/genetics/correlation/2017-lau.pdf

“The ge­nomic land­scape of African pop­u­la­tions in health and dis­ease”, Ro­timi et al 2017 /docs/genetics/selection/2017-rotimi.pdf

Wray et al (MDDWG PGC) 2017, “Genome-wide as­so­ci­a­tion analy­ses iden­tify 44 risk vari­ants and re­fine the ge­netic ar­chi­tec­ture of ma­jor de­pres­sion”:

  • European/Chinese de­pres­sion: rg = 0.31
  • European/Chinese schiz­o­phre­nia: rg = 0.34
  • European/Chinese bipo­lar dis­or­der: rg = 0.45

, Wo­j­cik et al 2017:

To demon­strate the ben­e­fit of study­ing un­der­rep­re­sented pop­u­la­tions, the Pop­u­la­tion Ar­chi­tec­ture us­ing Ge­nomics and Epi­demi­ol­ogy (PAGE) study con­ducted a GWAS of 26 clin­i­cal and be­hav­ioral phe­no­types in 49,839 non-Eu­ro­pean in­di­vid­u­als. Us­ing novel strate­gies for mul­ti­-eth­nic analy­sis of ad­mixed pop­u­la­tions, we con­firm 574 GWAS cat­a­log vari­ants across these traits, and find 28 novel loci and 42 resid­ual sig­nals in known loci. Our data show strong ev­i­dence of effec­t-size het­ero­gene­ity across an­ces­tries for pub­lished GWAS as­so­ci­a­tions, which sub­stan­tially re­stricts ge­net­i­cal­ly-guided pre­ci­sion med­i­cine. We ad­vo­cate for new, large genome-wide efforts in di­verse pop­u­la­tions to re­duce health dis­par­i­ties.

Akiyama et al 2017, : r = 0.82 be­tween top European/Japanese SNP hits

, Grinde et al 2018:

…we com­pare var­i­ous ap­proaches for GRS con­struc­tion, us­ing GWAS re­sults from both large EA stud­ies and a smaller study in Hispanics/Latinos, the His­panic Com­mu­nity Health Study/Study of Lati­nos (HCHS/SOL, n=12,803). We con­sider mul­ti­ple ways to se­lect SNPs from as­so­ci­a­tion re­gions and to cal­cu­late the SNP weights. We study the per­for­mance of the re­sult­ing GRSs in an in­de­pen­dent study of Hispanics/Latinos from the Woman Health Ini­tia­tive (WHI, n=3,582). We sup­port our in­ves­ti­ga­tion with sim­u­la­tion stud­ies of po­ten­tial ge­netic ar­chi­tec­tures in a sin­gle lo­cus. We ob­served that se­lect­ing vari­ants based on EA GWASs gen­er­ally per­forms well, as long as SNP weights are cal­cu­lated us­ing Hispanics/Latinos GWASs, or us­ing the meta-analy­sis of EA and Hispanics/Latinos GWASs. The op­ti­mal ap­proach de­pends on the ge­netic ar­chi­tec­ture of the trait.

, Mogil et al 2018: causal vari­ants are mostly shared cross-ra­cially de­spite differ­ent LD pat­terns, as in­di­cated by sim­i­lar gene ex­pres­sion effects?

, Liu et al 2015:

Con­sis­tent with the con­cor­dant di­rec­tion of effect at as­so­ci­ated SNPs, there was high ge­netic cor­re­la­tion (rG) be­tween the Eu­ro­pean and East Asian co­hort when con­sid­er­ing the ad­di­tive effects of all SNPs geno­typed on the Im­munochip19 (Crohn’s dis­ease rG = 0.76, ul­cer­a­tive col­i­tis rG = 0.79) (Sup­ple­men­tary Ta­ble 11). Given that rare SNPs (mi­nor al­lele fre­quency (MAF) < 1%) are more likely to be pop­u­la­tion-speci­fic, these high rG val­ues also sup­port the no­tion that the ma­jor­ity of causal vari­ants are com­mon (MAF>5%).

, Wang et al 2020:

Poly­genic scores (PGS) have been widely used to pre­dict com­plex traits and risk of dis­eases us­ing vari­ants iden­ti­fied from genome-wide as­so­ci­a­tion stud­ies (GWASs). To date, most GWASs have been con­ducted in pop­u­la­tions of Eu­ro­pean an­ces­try, which lim­its the use of GWAS-derived PGS in non-Eu­ro­pean pop­u­la­tions. Here, we de­velop a new the­ory to pre­dict the rel­a­tive ac­cu­racy (RA, rel­a­tive to the ac­cu­racy in pop­u­la­tions of the same an­ces­try as the dis­cov­ery pop­u­la­tion) of PGS across an­ces­tries. We used sim­u­la­tions and real data from the UK Biobank to eval­u­ate our re­sults. We found across var­i­ous sim­u­la­tion sce­nar­ios that the RA of PGS based on trait-as­so­ci­ated SNPs can be pre­dicted ac­cu­rately from mod­el­ling link­age dis­e­qui­lib­rium (LD), mi­nor al­lele fre­quen­cies (MAF), cross-pop­u­la­tion cor­re­la­tions of SNP effect sizes and her­i­tabil­i­ty. Al­to­geth­er, we find that LD and MAF differ­ences be­tween an­ces­tries ex­plain alone up to ~70% of the loss of RA us­ing Eu­ro­pean-based PGS in African an­ces­try for traits like body mass in­dex and height. Our re­sults sug­gest that causal vari­ants un­der­ly­ing com­mon ge­netic vari­a­tion iden­ti­fied in Eu­ro­pean an­ces­try GWASs are mostly shared across con­ti­nents.

“Poly­genic pre­dic­tion of the phe­nome, across an­ces­try, in emerg­ing adult­hood”, Docherty et al 2017 (sup­ple­ment): large loss of PGS power in many traits, but di­rect com­par­i­son seems to be un­avail­able for most of the PGSes other than stuff like height?

“Ge­net­ics of self­-re­ported risk-tak­ing be­hav­iour, tran­s-eth­nic con­sis­tency and rel­e­vance to brain gene ex­pres­sion”, Straw­bridge et al 2018:

There were strong pos­i­tive ge­netic cor­re­la­tions be­tween risk-tak­ing and at­ten­tion-d­eficit hy­per­ac­tiv­ity dis­or­der, bipo­lar dis­or­der and schiz­o­phre­nia. In­dex ge­netic vari­ants demon­strated effects gen­er­ally con­sis­tent with the dis­cov­ery analy­sis in in­di­vid­u­als of non-Bri­tish White, South Asian, African-Caribbean or mixed eth­nic­i­ty.

, Reis­berg et al 2017:

Poly­genic risk scores are gain­ing more and more at­ten­tion for es­ti­mat­ing ge­netic risks for li­a­bil­i­ties, es­pe­cially for non­com­mu­ni­ca­ble dis­eases. They are now cal­cu­lated us­ing thou­sands of DNA mark­ers. In this pa­per, we com­pare the score dis­tri­b­u­tions of two pre­vi­ously pub­lished very large risk score mod­els within differ­ent pop­u­la­tions. We show that the risk score model to­gether with its risk strat­i­fi­ca­tion thresh­olds, built upon the data of one pop­u­la­tion, can­not be ap­plied to an­other pop­u­la­tion with­out tak­ing into ac­count the tar­get pop­u­la­tion’s struc­ture. We also show that if an in­di­vid­ual is clas­si­fied to the wrong pop­u­la­tion, his/her dis­ease risk can be sys­tem­at­i­cally in­cor­rectly es­ti­mat­ed.

, Dun­can et al 2018:

We an­a­lyzed the first decade of poly­genic scor­ing stud­ies (2008-2017, in­clu­sive), and found that 67% of stud­ies in­cluded ex­clu­sively Eu­ro­pean an­ces­try par­tic­i­pants and an­other 19% in­cluded only East Asian an­ces­try par­tic­i­pants. Only 3.8% of stud­ies were car­ried out on sam­ples of African, His­pan­ic, or In­dige­nous peo­ples. We find that effect sizes for Eu­ro­pean an­ces­try-derived poly­genic scores are only 36% as large in African an­ces­try sam­ples, as in Eu­ro­pean an­ces­try sam­ples (t=-10.056, df=22, p=5.5x10-10). Poorer per­for­mance was also ob­served in other non-Eu­ro­pean an­ces­try sam­ples. Analy­sis of poly­genic scores in the 1000Genomes sam­ples re­vealed many strong cor­re­la­tions with global prin­ci­pal com­po­nents, and re­la­tion­ships be­tween height poly­genic scores and height phe­no­types that were highly vari­able de­pend­ing on method­olog­i­cal choices in poly­genic score con­struc­tion.

“Tran­sances­tral GWAS of al­co­hol de­pen­dence re­veals com­mon ge­netic un­der­pin­nings with psy­chi­atric dis­or­ders”, Wal­ters et al 2018:

PRS based on our meta-analy­sis of AD were sig­nifi­cantly pre­dic­tive of AD out­comes in all three tested ex­ter­nal co­horts. PRS de­rived from the un­re­lated EU GWAS pre­dicted up to 0.51% of the vari­ance in past month al­co­hol use dis­or­der in the Avon Lon­gi­tu­di­nal Study of Par­ents and Chil­dren (ALSPAC; P=0.0195; Sup­ple­men­tary Fig. 10a) and up to 0.3% of prob­lem drink­ing in Gen­er­a­tion Scot­land (P=7.9×10-6; Sup­ple­men­tary Fig. 10b) as in­dexed by the CAGE (Cut­ting down, An­noy­ance by crit­i­cism, Guilty feel­ings, and Eye­-open­ers) ques­tion­naire. PRS de­rived from the un­re­lated AA GWAS pre­dicted up to 1.7% of the vari­ance in al­co­hol de­pen­dence in the COGA AAfGWAS co­hort (P=1.92×10-7; Sup­ple­men­tary Fig. 10c). No­tably, PRS de­rived from the un­re­lated EU GWAS showed much weaker pre­dic­tion (max­i­mum r2=0.37%, P=0.01; Sup­ple­men­tary Fig. 10d) in the COGA AAfGWAS than the an­ces­trally matched AA GWAS-based PRS de­spite the much smaller dis­cov­ery sam­ple for AA. In ad­di­tion, the AA GWAS-based AD PRS also still yielded sig­nifi­cant vari­ance ex­plained after con­trol­ling for other ge­netic fac­tors (r2=1.16%, P=2.5×10-7). Pre­dic­tion of CAGE scores in Gen­er­a­tion Scot­land re­mained sig­nifi­cant and showed min­i­mal at­ten­u­a­tion (r2=0.29%, P=1.0×10-5) after con­di­tion­ing on PRS for al­co­hol con­sump­tion de­rived from UK Biobank re­sults17. In COGA AAfGWAS, the AA PRS de­rived from our study con­tin­ued to pre­dict 1.6% of the vari­ance in al­co­hol de­pen­dence after in­clu­sion of rs2066702 geno­type as a co­vari­ate, in­di­cat­ing in­de­pen­dent poly­genic effects be­yond the lead ADH1B vari­ant (Sup­ple­men­tary Meth­od­s).

So EU­->AA PGS is 0.37/1.7=21%?

, Telkar et al 2019:

A poly­genic score based on es­tab­lished LDL-cholesterol-associated loci from Eu­ro­pean dis­cov­ery sam­ples had con­sis­tent effects on serum lev­els in sam­ples from the UK, Uganda and Greek pop­u­la­tion iso­lates (cor­re­la­tion co­effi­cient r=0.23 to 0.28 per LDL stan­dard de­vi­a­tion, p<1.9x10-14). Tran­s-eth­nic ge­netic cor­re­la­tions be­tween Eu­ro­pean an­ces­try, Chi­nese and Japan­ese co­horts did not differ sig­nifi­cantly from 1 for HDL, LDL and triglyc­erides. In each study, >60% of ma­jor lipid loci dis­played ev­i­dence of repli­ca­tion with one ex­cep­tion. There was ev­i­dence for an effect on serum lev­els in the Ugan­dan sam­ples for only 10% of ma­jor triglyc­eride loci. The PRS was only weakly as­so­ci­ated in this group (r=0.06, SE=0.013).

“Iden­ti­fi­ca­tion of 28 new sus­cep­ti­bil­ity loci for type 2 di­a­betes in the Japan­ese pop­u­la­tion”, Suzuki et al 2019:

When we com­pared the effect sizes of 69 of the 88 lead vari­ants in Japan­ese and Eu­ro­peans that were avail­able in a pub­lished Eu­ro­pean GWAS2 (Sup­ple­men­tary Ta­ble 3 and Sup­ple­men­tary Fig. 5), we found a strong pos­i­tive cor­re­la­tion (Pear­son’s r= 0.87, P= 1.4 × 10−22) and di­rec­tional con­sis­tency (65 of 69 loci, 94%, sign-test P= 3.1 × 10−15). In ad­di­tion, when we com­pared the effect sizes of the 95 of 113 lead vari­ants re­ported in the Eu­ro­pean type 2 di­a­betes GWAS2 that were avail­able in both Japan­ese and Eu­ro­pean type 2 di­a­betes GWAS (Sup­ple­men­tary Ta­ble 2 and Sup­ple­men­tary Fig. 6a), we also found a strong pos­i­tive cor­re­la­tion (Pear­son’s r= 0.74, P= 5.9 × 10−18) and di­rec­tional con­sis­tency (83 of 95 loci, 87%, sign-test P= 3.2 × 10−14). After this man­u­script was sub­mit­ted, a larger type 2 di­a­betes GWAS of Eu­ro­pean an­ces­try was pub­lished17. When we re­peated the com­par­i­son at the lead vari­ants re­ported in this larger Eu­ro­pean GWAS, we found a more promi­nent cor­re­la­tion (Pear­son’s r= 0.83, P= 8.7 × 10−51) and di­rec­tional con­sis­tency (181 of 192 loci, 94%, sign-test P= 8.3 × 10−41) of the effect sizes (Sup­ple­men­tary Ta­ble 4 and Sup­ple­men­tary Fig. 6b). These re­sults in­di­cate that most of the type 2 di­a­betes sus­cep­ti­bil­ity loci iden­ti­fied in the Japan­ese or Eu­ro­pean pop­u­la­tion had com­pa­ra­ble effects on type 2 di­a­betes in the other pop­u­la­tion.

, Spracklen et al 2019:

Over­all, the per-al­lele effect sizes be­tween the two an­ces­tries were mod­er­ately cor­re­lated (r=0.54; Fig­ure 2A). When the com­par­i­son was re­stricted to the 290 vari­ants that are com­mon (MAF≥5%) in both an­ces­tries, the effect size cor­re­la­tion in­creased to r=0.71 (Fig­ure 2B; Sup­ple­men­tary Fig­ure 8). This effect size cor­re­la­tion fur­ther in­creased to r=0.88 for 116 vari­ants sig­nifi­cantly as­so­ci­ated with T2D (P<5x10-8) in both an­ces­tries. While the over­all effect sizes for all 343 vari­ants ap­pear, on av­er­age, to be stronger in East Asian in­di­vid­u­als than Eu­ro­pean, this trend is re­duced when each lo­cus is rep­re­sented only by the lead vari­ant from one pop­u­la­tion (Sup­ple­men­tary Fig­ure 9). Specifi­cal­ly, many vari­ants iden­ti­fied with larger effect sizes in the Eu­ro­pean meta-analy­sis are miss­ing from the com­par­i­son be­cause they were rare/monomorphic or poorly im­puted in the East Asian meta-analy­sis, for which im­pu­ta­tion ref­er­ence pan­els are less com­pre­hen­sive com­pared to the Eu­ro­pean-cen­tric Hap­lo­type Ref­er­ence Con­sor­tium pan­el.

“Ge­netic analy­ses of di­verse pop­u­la­tions im­proves dis­cov­ery for com­plex traits”, Wo­j­cik

When strat­i­fied by self­-i­den­ti­fied race/ethnicity, the effect sizes for the Hispanic/Latino pop­u­la­tion re­mained sig­nifi­cantly at­ten­u­ated com­pared to the pre­vi­ously re­ported effect sizes (β= 0.86; 95% con­fi­dence in­ter­val = 0.83–0.90; Fig. 2a). Effect sizes for the African Amer­i­can pop­u­la­tion were even fur­ther di­min­ished at nearly half the strength (β= 0.54; 95% con­fi­dence in­ter­val = 0.50–0.58; Fig. 2a). This is sug­ges­tive of truly differ­en­tial effect sizes be­tween an­ces­tries at pre­vi­ously re­ported vari­ants, rather than these effect sizes be­ing up­wardly bi­ased in gen­eral (that is, ex­hibit­ing ‘win­ner’s curse’), which should affect all groups equal­ly.

“Con­tri­bu­tions of com­mon ge­netic vari­ants to risk of schiz­o­phre­nia among in­di­vid­u­als of African and Latino an­ces­try”, Bigdeli et al 2019:

Con­sis­tent with pre­vi­ous re­ports demon­strat­ing the gen­er­al­iz­abil­ity of poly­genic find­ings for schiz­o­phre­nia across di­verse pop­u­la­tions [14, 43, 44], in­di­vid­u­al-level scores con­structed from PGC-SCZ2 sum­mary sta­tis­tics were sig­nifi­cantly as­so­ci­ated with case−­con­trol sta­tus in ad­mixed African, Lati­no, and Eu­ro­pean co­horts in the cur­rent study (Fig. 2a). When con­sid­er­ing scores con­structed from ap­prox­i­mately in­de­pen­dent com­mon vari­ants (pair­wise r2 < 0.1), we ob­served the best over­all pre­dic­tion at a P value thresh­old (PT) of 0.05; these scores ex­plained ~3.5% of the vari­ance in schiz­o­phre­nia li­a­bil­ity among Eu­ro­peans (P = 4.03 × 10−110), ~1.7% among Latino in­di­vid­u­als (P = 9.02 × 10−52), and ~0.5% among ad­mixed African in­di­vid­u­als (P = 8.25 × 10−19) (Fig. 2a; Sup­ple­men­tal Ta­ble 6). Con­sis­tent with ex­pec­ta­tion, when com­par­ing re­sults for scores con­structed from larger num­bers of non­in­de­pen­dent SNPs, we gen­er­ally ob­served an im­prove­ment in pre­dic­tive value (Fig. 2b; Sup­ple­men­tal Ta­ble 7).

Poly­genic scores based on African an­ces­try GWAS re­sults were sig­nifi­cantly as­so­ci­ated with schiz­o­phre­nia among ad­mixed African in­di­vid­u­als, at­tain­ing the best over­all pre­dic­tive value when con­structed from ap­prox­i­mately in­de­pen­dent com­mon vari­ants (pair­wise r2 < 0.1) with PT ≤ 0.5 in the dis­cov­ery analy­sis (Fig. 2a and Sup­ple­men­tal Ta­ble 6); this score ex­plained ~1.3% of the vari­ance in schiz­o­phre­nia li­a­bil­ity (P = 3.47 × 10−41). Scores trained on African an­ces­try GWAS re­sults also sig­nifi­cantly pre­dicted case−­con­trol sta­tus across pop­u­la­tions; scores based on a PT ≤ 0.5 and pair­wise r2 < 0.8 ex­plained ~0.2% of the vari­abil­ity in li­a­bil­ity in Eu­ro­peans (P = 2.35 × 10−7) and ~0.1% among Latino in­di­vid­u­als (P = 0.000184) (Fig. 2b and Sup­ple­men­tal Ta­ble 7). Sim­i­lar­ly, scores con­structed from Latino GWAS re­sults (PT ≤ 0.5) were of great­est pre­dic­tive value among Lati­nos (li­a­bil­ity R2 = 2%; P = 3.11 × 10−19) and Eu­ro­peans (li­a­bil­ity R2 = 0.8%; P = 1.60 × 10−9); with scores based on PT ≤ 0.05 and pair­wise r2 < 0.1 show­ing nom­i­nally sig­nifi­cant as­so­ci­a­tion with case-con­trol sta­tus among African an­ces­try in­di­vid­u­als (li­a­bil­ity R2 = 0.2%; P = 0.00513).

We next con­sid­ered poly­genic scores con­structed from tran­s-ances­try meta-analy­sis of PGC-SCZ2 sum­mary sta­tis­tics and our African and Latino GWAS, which re­vealed in­creased sig­nifi­cance and im­proved pre­dic­tive value in all three an­ces­tries. Among African an­ces­try in­di­vid­u­als, meta-an­a­lytic scores based on PT ≤ 0.5 ex­plained ~1.7% of the vari­ance (P = 4.37 × 10−53); while scores based on PT ≤ 0.05 ac­counted for ~2.1% and ~3.7% of the vari­abil­ity in li­a­bil­ity among Latino (P = 1.10 × 10−59) and Eu­ro­pean in­di­vid­u­als (P = 1.73 × 10−114), re­spec­tive­ly.

, Guo et al 2019:

Genome-wide as­so­ci­a­tion stud­ies (GWAS) in sam­ples of Eu­ro­pean an­ces­try have iden­ti­fied thou­sands of ge­netic vari­ants as­so­ci­ated with com­plex traits in hu­mans. How­ev­er, it re­mains largely un­clear whether these as­so­ci­a­tions can be used in non-Eu­ro­pean pop­u­la­tions. Here, we seek to quan­tify the pro­por­tion of ge­netic vari­a­tion for a com­plex trait shared be­tween con­ti­nen­tal pop­u­la­tions. We es­ti­mated the be­tween-pop­u­la­tion cor­re­la­tion of ge­netic effects at all SNPs (rg) or genome-wide sig­nifi­cant SNPs (rg(GWS)) for height and body mass in­dex (BMI) in sam­ples of Eu­ro­pean (EUR; n = 49,839) and African (AFR; n = 17,426) an­ces­try. The rg be­tween EUR and AFR was 0.75 (s. e. = 0.035) for height and 0.68 (s. e. = 0.062) for BMI, and the cor­re­spond­ing rg(g­was) was 0.82 (s. e. = 0.030) for height and 0.87 (s. e. = 0.064) for BMI, sug­gest­ing that a large pro­por­tion of GWAS find­ings dis­cov­ered in Eu­ro­peans are likely ap­plic­a­ble to non-Eu­ro­peans for height and BMI. There was no ev­i­dence that rg differs in SNP groups with differ­ent lev­els of be­tween-pop­u­la­tion differ­ence in al­lele fre­quency or link­age dis­e­qui­lib­ri­um, which, how­ev­er, can be due to the lack of pow­er.

, Koyama et al 2019:

To elu­ci­date the ge­net­ics of coro­nary artery dis­ease (CAD) in the Japan­ese pop­u­la­tion, we con­ducted a large-s­cale genome-wide as­so­ci­a­tion study (GWAS) of 168,228 Japan­ese (25,892 cases and 142,336 con­trols) with geno­type im­pu­ta­tion us­ing a newly de­vel­oped ref­er­ence panel of Japan­ese hap­lo­types in­clud­ing 1,782 CAD cases and 3,148 con­trols. We de­tected 9 novel dis­ease-sus­cep­ti­bil­ity loci and Japan­ese-spe­cific rare vari­ants con­tribut­ing to dis­ease sever­ity and in­creased car­dio­vas­cu­lar mor­tal­i­ty. We then con­ducted a transeth­nic meta-analy­sis and dis­cov­ered 37 ad­di­tional novel loci. Us­ing the re­sult of the meta-analy­sis, we de­rived a poly­genic risk score (PRS) for CAD, which out­per­formed those de­rived from ei­ther Japan­ese or Eu­ro­pean GWAS. The PRS pri­or­i­tized risk fac­tors among var­i­ous clin­i­cal pa­ra­me­ters and seg­re­gated in­di­vid­u­als with in­creased risk of long-term car­dio­vas­cu­lar mor­tal­i­ty. Our data im­proves clin­i­cal char­ac­ter­i­za­tion of CAD ge­net­ics and sug­gests the util­ity of transeth­nic meta-analy­sis for PRS de­riva­tion in non-Eu­ro­pean pop­u­la­tions.

, Ekoru et al 2020:

There is grow­ing sup­port for the use of ge­netic risk scores (GRS) in rou­tine clin­i­cal set­tings. Due to the lim­ited di­ver­sity of cur­rent ge­nomic dis­cov­ery sam­ples, there are con­cerns that the pre­dic­tive power of GRS will be lim­ited in non-Eu­ro­pean an­ces­try pop­u­la­tions. Here, we eval­u­ated the pre­dic­tive util­ity of GRS for 12 car­diometa­bolic traits in sub­-Sa­ha­ran Africans (AF; n=5200), African Amer­i­cans (AA; n=9139), and Eu­ro­pean Amer­i­cans (EA; n=9594). GRS were con­structed as weighted sums of the num­ber of risk al­le­les. Pre­dic­tive util­ity was as­sessed us­ing the ad­di­tional phe­no­typic vari­ance ex­plained and in­crease in dis­crim­i­na­tory abil­ity over tra­di­tional risk fac­tors (age, sex and BMI), with ad­just­ment for an­ces­try-derived prin­ci­pal com­po­nents. Across all traits, GRS showed upto a 5-fold and 20-fold greater pre­dic­tive util­ity in EA rel­a­tive to AA and AF, re­spec­tive­ly. Pre­dic­tive util­ity was most con­sis­tent for lipid traits, with per­cent in­crease in ex­plained vari­a­tion at­trib­ut­able to GRS rang­ing from 10.6% to 127.1% among EA, 26.6% to 65.8% among AA, and 2.4% to 37.5% among AF. These differ­ences were re­ca­pit­u­lated in the dis­crim­i­na­tory pow­er, whereby the pre­dic­tive util­ity of GRS was 4-fold greater in EA rel­a­tive to AA and up to 44-fold greater in EA rel­a­tive to AF. Obe­sity and blood pres­sure traits showed a sim­i­lar pat­tern of greater pre­dic­tive util­ity among EA.

Given the prac­tice of em­bryo edit­ing, the causal tag­ging prob­lem can grad­u­ally solve it­self: as ed­its are made, forcibly break­ing the po­ten­tial con­founds, the causal na­ture of an SNP be­comes clear. But how to al­lo­cate ed­its across the top SNPs to de­ter­mine each’s causal na­ture as effi­ciently as pos­si­ble with­out spend­ing too many ed­its in­ves­ti­gat­ing? A naive an­swer might be some­thing along the lines of a power analy­sis: in a two-group t-test try­ing to de­tect a differ­ence of ~0.03 SD for each vari­ant (the rough size of the top few vari­ants), with vari­ance re­duced by the known poly­genic score, and de­sired power is the stan­dard 80%; it fol­lows that one would need to ran­dom­ize a to­tal sam­ple of n = 34842 to re­ject the null hy­poth­e­sis of 0 effect20. Set­ting up a & ran­dom­iz­ing sev­eral vari­ants si­mul­ta­ne­ously may al­low in­fer­ences on them as well, but clearly this is go­ing to be a tough row to hoe. This is un­duly pes­simistic, since we nei­ther need nor de­sire 80% pow­er, nor are we com­par­ing to a null hy­poth­e­sis, as our goal is more mod­est: since only a cer­tain num­ber of ed­its will be doable for any em­bryo, say 30, we merely want to ac­cu­mu­late enough ev­i­dence about the top 30 vari­ants to ei­ther de­mote it to #31 (and so we no longer spend any ed­its on it) or con­firm it be­longs in the top 30 and we should al­ways be edit­ing it. This is im­me­di­ately rec­og­niz­able as : the (each pos­si­ble edit be­ing an in­de­pen­dent arm which can pulled or not); or more pre­cise­ly, since early child­hood (5yo) IQ scores are rel­a­tively poorly cor­re­lated with adult scores (r = 0.55) and many em­bryos may be edited be­fore data on the first ed­its starts com­ing in, a mul­ti­-armed ban­dit with mul­ti­ple plays and de­layed feed­back. (There is no way im­me­di­ately upon birth to re­ceive mean­ing­ful feed­back about the effect of an ed­it, al­though there might be ways to get feed­back faster, such as us­ing short­-sleeper gene ed­its to en­hance ed­u­ca­tion.) is a ran­dom­ized Bayesian ap­proach which is sim­ple and the­o­ret­i­cally op­ti­mal, with ex­cel­lent per­for­mance in prac­tice as well; an is also op­ti­mal. Deal­ing with the de­layed feed­back is known to be diffi­cult and mul­ti­ple-play Thomp­son sam­pling may not be op­ti­mal, but in sim­u­la­tions it per­forms bet­ter with de­layed feed­back than other stan­dard MABs. We can con­sider a sim­u­la­tion of the sce­nario in which every time-step is a day and 1 or more em­bryos must be edited that day; a noisy mea­sure of IQ is then made avail­able 9*31+5*365=2104 days lat­er, which is fed into a GWAS in which the GWAS cor­re­la­tion for each SNP is con­sid­ered as drawn with an un­known prob­a­bil­ity from a causal dis­tri­b­u­tion and from a nui­sance dis­tri­b­u­tion, so with ad­di­tional data, the effect es­ti­mates of the SNPs are re­fined, the prob­a­bil­ity of be­ing drawn from the causal dis­tri­b­u­tion is re­fined, and the over­all mix­ture prob­a­bil­ity is like­wise re­fined, sim­i­lar to the . (So for the first 2104 time-steps, a Thomp­son sam­ple would be per­formed to yield a new set of ed­its, then each sub­se­quent time-step a dat­a­point would ma­ture, the pos­te­ri­ors up­dat­ed, and an­other set of ed­its cre­at­ed.) The rel­e­vant ques­tion is how much re­gret will fall and how many causal SNPs be­come the top picks after how many ed­its & days: hope­fully high per­for­mance & low re­gret will be achieved within a few years after the ini­tial 5-year de­lay.

a more con­crete ex­am­ple: imag­ine we have a bud­get of 60 ed­its (based on the mul­ti­plex pig edit­ing), a causal prob­a­bil­ity of 10%, an ex­po­nen­tial dis­tri­b­u­tion (rate 70.87) over 500000 can­di­date al­le­les of which we con­sider the top 1000, each of which has a fre­quency of 50% and we se­quence be­fore edit­ing to avoid wast­ing ed­its. What is our best case and worst-case IQ in­crease? In the worst case, the top 60 are all non-causal, so our im­prove­ment is 0 IQ points; in the best case where all hits are causal, half of the hits are dis­carded after se­quenc­ing, and then the re­main­ing top 60 get us ~6.1 IQ points; the in­ter­me­di­ate case of 10% causal gets us to ~0.61 IQ points, and so our re­gret is 5.49 IQ points per em­bryo. Un­sur­pris­ing­ly, a 10% causal rate is hor­ri­bly in­effi­cient. In the 10% case, if we can in­fer the true causal SNPs, we only need to start with ~600 SNPs to sat­u­rate our edit­ing bud­get on av­er­age, or ~900 to have <1% chance of wind­ing up with <60 causal SNPs, so 1000 SNPs seems like a good start­ing point. (Of course, we also want a larger win­dow so as our edit bud­get in­creases with fu­ture in­come growth & tech­no­log­i­cal im­prove­ment, we can smoothly in­cor­po­rate the ad­di­tional SNPs.) So what or­der of sam­ples do we need here to re­duce our re­gret of 5.49 to some­thing more rea­son­able like <0.25 IQ points?

SNPs <- 500000
SNPlimit <- 1000
rate <- 70.87
editBudget <- 60
frequency <- 0.5
mean(replicate(1000, {
    hits <- sort(rexp(SNPs, rate=rate), decreasing=TRUE)[1:SNPlimit]
    sum(sample(hits, length(hits) * 0.5)[1:editBudget])

http://jmlr.csail.mit.edu/proceedings/papers/v31/agrawal13a.pdf Re­gret of Thomp­son sam­pling with Gauss­ian pri­ors & like­li­hood is O(sqrt(N * T * ln(N))), where N = num­ber of differ­ent arms/actions and T = cur­rent timestep hence, if we have 1000 ac­tions and we sam­ple 1 time, our ex­pected to­tal re­gret is on the or­der of sqrt(1000 * 1 * ln(1000)) = 83; with 100 sam­ples, our ex­pected to­tal re­gret has in­creased by two or­ders to 831 but we are only in­cur­ring an ad­di­tional ex­pected re­gret of ~4 or 5% of the first timestep’s re­gret diff(s­ap­ply((1:10000, func­tion(t) { sqrt(1000 * t * log(1000)) } )))

Thomp­son sam­pling also achieves the lower bound in mul­ti­ple-play but the as­ymp­totic is more com­plex, and does not take into ac­count the long de­lay & noise in mea­sur­ing IQ. TS em­pir­i­cally per­forms well but hard to know what sort of sam­ple size is re­quired. But at least we can say that the as­ymp­tot­ics don’t im­ply dozens of thou­sands of em­bryos.

prob­lem: what’s the prob­a­bil­ity of non-causal tag­ging due to LD? prob­a­bly low since they work cross-eth­ni­cally don’t they? on the other hand: http://emilkirkegaard.dk/en/?p=5415 "If the GWAS SNPs owe their pre­dic­tive power to be­ing ac­tual causal vari­ants, then LD is ir­rel­e­vant and they should pre­dict the rel­e­vant out­come in any racial group. If how­ever they owe wholly or partly their pre­dic­tive power to just be­ing sta­tis­ti­cally re­lated to causal vari­ants, they should be rel­a­tively worse pre­dic­tors in racial groups that are most dis­tantly re­lat­ed. One can in­ves­ti­gate this by com­par­ing the pre­dic­tive power of GWAS be­tas de­rived from one pop­u­la­tion on an­other pop­u­la­tion. Since there are by now 1000s of GWAS, meta-analy­ses have in fact made such com­par­isons, mostly for dis­ease traits. Two re­views found sub­stan­tial cross-va­lid­ity for the Eurasian pop­u­la­tion (Eu­ro­peans and East Asian­s), and less for Africans (usu­ally African Amer­i­cans) (23,24). The first re­view only re­lied on SNPs with p<α and found weaker re­sults. This is ex­pected be­cause us­ing only these is a thresh­old effect, as dis­cussed ear­li­er.

The sec­ond re­view (from 2013; 299 in­cluded GWAS) found much stronger re­sults, prob­a­bly be­cause it in­cluded more SNPs and be­cause they also ad­justed for sta­tis­ti­cal pow­er. Do­ing so, they found that: ~100% of SNPs repli­cate in other Eu­ro­pean sam­ples when ac­count­ing for sta­tis­ti­cal pow­er, ~80% in East Asian sam­ples but only ~10% in the African Amer­i­can sam­ple (not ad­justed for sta­tis­ti­cal pow­er, which was ~60% on av­er­age). There were fairly few GWAS for AAs how­ev­er, so some cau­tion is needed in in­ter­pret­ing the num­ber. Still, this throws some doubt on the use­ful­ness of GWAS re­sults from Eu­ro­peans or Asians used on African sam­ples (or re­verse­ly)." and http://emilkirkegaard.dk/en/?p=6415

“Iden­ti­fy­ing Causal Vari­ants at Loci with Mul­ti­ple Sig­nals of As­so­ci­a­tion”, Hor­moz­di­ari et al 2014 http://genetics.org/content/198/2/497.full “Where is the causal vari­ant? On the ad­van­tage of the fam­ily de­sign over the case-con­trol de­sign in ge­netic as­so­ci­a­tion stud­ies”, Dandine-Roul­land & Perdry 2015 http://www.nature.com/ejhg/journal/v23/n10/abs/ejhg2014284a.html worst-case, ~10% of SNPs are causal?

https://www.addgene.org/crispr/reference/ http://www.genome.gov/sequencingcosts/ https://crispr.bme.gatech.edu/ http://crispr.mit.edu/ low, near zero mu­ta­tion rates: “High­-fi­delity CRISPR-Cas9 nu­cle­ases with no de­tectable genome-wide off-tar­get effects” Kle­in­stiver et al 2016, /docs/genetics/editing/2016-kleinstiver.pdf ; “Ra­tio­nally en­gi­neered Cas9 nu­cle­ases with im­proved speci­ficity”, Slay­maker et al 2016 /docs/genetics/editing/2016-slaymaker.pdf Church, April 2016: “In­deed, the lat­est ver­sions of gene-edit­ing en­zymes have zero de­tectable off-tar­get ac­tiv­i­ties.” http://www.wsj.com/articles/should-heritable-gene-editing-be-used-on-humans-1460340173 Church, June 2016 “Church: In prac­tice, when we in­tro­duced our first CRISPR in 2013,19 it was about 5% off tar­get. In other words, CRISPR would edit five treated cells out of 100 in the wrong place in the genome. Now, we can get down to about one er­ror per 6 tril­lion cell­s…­Fahy: Just how effi­cient is CRISPR at edit­ing tar­geted genes? Church: With­out any par­tic­u­lar tricks, you can get any­where up to, on the high end, into the range of 50% to 80% or more of tar­geted genes ac­tu­ally get­ting edited in the in­tended way. Fahy: Why not 100%? Church: We don’t re­ally know, but over time, we’re get­ting closer and closer to 100%, and I sus­pect that some­day we will get to 100%. Fahy: Can you get a higher per­cent­age of suc­cess­ful gene ed­its by dos­ing with CRISPR more than on­ce? Church: Yes, but there are lim­its.” http://www.lifeextension.com/Lpages/2016/CRISPR/index “A per­son fa­mil­iar with the re­search says ‘many tens’ of hu­man IVF em­bryos were cre­ated for the ex­per­i­ment us­ing the do­nated sperm of men car­ry­ing in­her­ited dis­ease mu­ta­tions. Em­bryos at this stages are tiny clumps of cells in­vis­i­ble to the naked eye. ‘It is proof of prin­ci­ple that it can work. They sig­nifi­cantly re­duced mo­saicism. I don’t think it’s the start of clin­i­cal tri­als yet, but it does take it fur­ther than any­one has be­fore’, said a sci­en­tist fa­mil­iar with the pro­ject. Mi­tal­ipov’s group ap­pears to have over­come ear­lier diffi­cul­ties by ‘get­ting in early’ and in­ject­ing CRISPR into the eggs at the same time they were fer­til­ized with sperm.” https://www.technologyreview.com/s/608350/first-human-embryos-edited-in-us/ cost of the top vari­ants? want to edit all vari­ants such that: se­quenc­ing-based ed­it: pos­te­rior mean * value of IQ point > cost of 1 edit for blind ed­its: prob­a­bil­ity of the bad vari­ant * pos­te­rior mean * value of IQ point > cost of 1 edit

how to sim­u­late pos­te­rior prob­a­bil­i­ties? https://cran.r-project.org/web/packages/BGLR/BGLR.pdf https://cran.r-project.org/web/packages/BGLR/vignettes/BGLR-extdoc.pdf looks use­ful but won’t han­dle the mix­ture mod­el­ing

pre­vi­ous: Liang et al 2015 “CRISPR/Cas9-mediated gene edit­ing in hu­man tripronu­clear zy­gotes” http://link.springer.com/article/10.1007/s13238-015-0153-5%20/fulltext.html http://www.nature.com/news/chinese-scientists-genetically-modify-human-embryos-1.17378 Kang et al 2016, “In­tro­duc­ing pre­cise ge­netic mod­i­fi­ca­tions into hu­man 3PN em­bryos by CRISPR/Cas-mediated genome edit­ing” /docs/genetics/editing/2016-kang.pdf http://www.nature.com/news/second-chinese-team-reports-gene-editing-in-human-embryos-1.19718 Ko­mor et al 2016, “Pro­gram­ma­ble edit­ing of a tar­get base in ge­nomic DNA with­out dou­ble-s­tranded DNA cleav­age” https://ase.tufts.edu/chemistry/kumar/jc/pdf/Liu_2016.pdf http://www.nature.com/news/chinese-scientists-to-pioneer-first-human-crispr-trial-1.20302 “CRISPR/Cas9-mediated gene edit­ing in hu­man zy­gotes us­ing Cas9 pro­tein” Tang et al 2017 /docs/genetics/editing/2017-tang.pdf : no ob­served off-tar­get mu­ta­tions; effi­ciency of 20%, 50%, and 100% “Cor­rec­tion of a path­o­genic gene mu­ta­tion in hu­man em­bryos”, Ma et al 2017 https://www.nature.com/articles/nature23305 no ob­served off-tar­gets, 27.9% effi­ciency

le­gal in USA (no leg­is­la­tion but some in­ter­est­ing reg­u­la­tory wrin­kles: see ch7 of Hu­man Genome Edit­ing: Sci­ence, Ethics, and Gov­er­nance 2017 ), le­gal in China (only ‘un­en­force­able guide­lines’) as of 2014, ac­cord­ing to “In­ter­na­tional reg­u­la­tory land­scape and in­te­gra­tion of cor­rec­tive genome edit­ing into in vitro fer­til­iza­tion”, Araki & Ishii 2014 http://rbej.biomedcentral.com/articles/10.1186/1477-7827-12-108 as of 2015 too ac­cord­ing to http://www.nature.com/news/where-in-the-world-could-the-first-crispr-baby-be-born-1.18542 il­le­gal in the UK but they have given per­mis­sion to mod­ify hu­man em­bryos for re­search http://www.popsci.com/scientists-get-government-approval-to-edit-human-embryos? http://www.nytimes.com/2016/02/02/health/crispr-gene-editing-human-embryos-kathy-niakan-britain.html le­gal in Japan for re­search, but maybe not ap­pli­ca­tion? http://mainichi.jp/english/articles/20160423/p2g/00m/0dm/002000c le­gal in Swe­den for edit­ing, which has been done as of Sep­tem­ber 2016 by Fredrik Lan­ner http://www.npr.org/sections/health-shots/2016/09/22/494591738/breaking-taboo-swedish-scientist-seeks-to-edit-dna-of-healthy-human-embryos

Al­so, what about mo­saicism? When the CRISPR RNA is in­jected into an even sin­gle-celled zy­gote, it may al­ready have cre­ated some of the DNA for a split and so the edit cov­ers only a frac­tion of the cells of the fu­ture ful­l-grown or­gan­ism. “Ad­di­tion­al­ly, edit­ing may hap­pen after first em­bry­onic di­vi­sion, due to per­sis­tence of Cas9:gRNA com­plex­es, also caus­ing mo­saicism. We (un­pub­lished re­sults) and oth­ers (Yang et al. 2013a; Ma et al. 2014; Yen et al. 2014) have ob­served mo­saic an­i­mals car­ry­ing three or more al­le­les. A re­cent study re­ported sur­pris­ingly high per­cent­age of mo­saic mice (up to 80%) gen­er­ated by CRISPR tar­get­ing of the ty­rosi­nase gene (Tyr) (Yen et al. 2014). We have ob­served a vary­ing fre­quency of mo­saicism, 11-35%, de­pend­ing on the gene/locus (our un­pub­lished data)… The pronu­clear mi­croin­jec­tion of gRNA and Cas9, in a man­ner es­sen­tially iden­ti­cal to what is used for gen­er­at­ing trans­genic mice, can be eas­ily adapted by most trans­genic fa­cil­i­ties. Fa­cil­i­ties equipped with a Piezo-elec­tric mi­cro­ma­nip­u­la­tor can opt for cy­to­plas­mic in­jec­tions as re­ported (Wang et al. 2013; Yang et al. 2013a). Horii et al. (2014) per­formed an ex­ten­sive com­par­i­son study sug­gest­ing that cy­to­plas­mic in­jec­tion of a gRNA and Cas9 mRNA mix­ture as the best de­liv­ery method. Al­though the over­all edit­ing effi­ciency in born pups yielded by pronu­clear vs. cy­to­plas­mic RNA in­jec­tion seems to be com­pa­ra­ble (Table 1), the lat­ter method gen­er­ated two- to four­fold more live born pups. In­jec­tion of plas­mid DNA car­ry­ing Cas9 and gRNA to the pronu­cleus was the least effi­cient method in terms of sur­vival and tar­get­ing effi­ciency (Mashiko et al. 2013; Horii et al. 2014). In­jec­tion into pronu­clei seems to be more dam­ag­ing to em­bryos than in­jec­tion of the same vol­ume or con­cen­tra­tion of edit­ing reagents to the cy­to­plasm. It has been shown that cy­to­plas­mic in­jec­tion of Cas9 mRNA at con­cen­tra­tions up to 200 ng/μl is not toxic to em­bryos (Wang et al. 2013) and effi­cient edit­ing was achieved at con­cen­tra­tions as low as 1.5 ng/μl (Ran et al. 2013a). In our hands, in­ject­ing Cas9 mRNA at 50-150 ng/μl and gRNA at 50-75 ng/μl first into the pronu­cleus and also into the cy­to­plasm as the nee­dle is be­ing with­drawn, yields good sur­vival of em­bryos and effi­cient edit­ing by NHEJ in live born pups (our un­pub­lished ob­ser­va­tion­s).” http://genetics.org/content/199/1/1.full

dnorm((150-100)/15) * 320000000 [1] 493,529.2788 dnorm((170-100)/15) * 320000000 [1] 2382.734679

if you’re cu­ri­ous how I cal­cu­lated that, (10*1000 + 10 * 98 * 500) > 500000 → [1] FALSE sum(sort((rexp(10000)/1)/18, decreasing=TRUE)[1:98] * 0.5) → [1] 15.07656556

hm. there are ~50k IVF ba­bies each year in the USA. my quick CRISPR sketch sug­gested that for a few mill you could get up to 150-170. dnorm((150-100)/15) * 320000000 → [1] 493,529.2788; dnorm((170-100)/15) * 320000000 → [1] 2382.734679. so de­pend­ing on how many IVFers used it, you could boost the to­tal ge­nius pop­u­la­tion by any­where from 1/10th to 9x

but if only 10% causal rate and so only 100 effec­tive ed­its from 1000, and a net gain of 15 IQ points (1SD) then in­creas­es: IVF <- (dnorm((115-100)/15) * 50000); gen­pop <- (dnorm((150-100)/15) * 320000000); (IVF+genpop)/genpop [1] 1.024514323 IVF <- (dnorm((115-100)/15) * 50000); gen­pop <- (dnorm((170-100)/15) * 320000000); (IVF+genpop)/genpop [1] 6.077584313 an in­crease of 1.02x (150) and 6x (170) re­spec­tively

“To con­firm these GUIDE-seq find­ings, we used tar­geted am­pli­con se­quenc­ing to more di­rectly mea­sure the fre­quen­cies of in­del mu­ta­tions in­duced by wild-type Sp­Cas9 and Sp­Cas9-H­F1. For these ex­per­i­ments, we trans­fected hu­man cells only with sgRNA- and Cas9en­cod­ing plas­mids (with­out the GUIDE-seq tag). We used nex­t-gen­er­a­tion se­quenc­ing to ex­am­ine the on-tar­get sites and 36 of the 40 off-tar­get sites that had been iden­ti­fied for six sgRNAs with wild-type Sp­Cas9 in our GUIDE-seq ex­per­i­ments (four of the 40 sites could not be specifi­cally am­pli­fied from ge­nomic DNA). These deep se­quenc­ing ex­per­i­ments showed that: (1) wild-type Sp­Cas9 and Sp­Cas9-HF1 in­duced com­pa­ra­ble fre­quen­cies of in­dels at each of the six sgRNA on-tar­get sites, in­di­cat­ing that the nu­cle­ases and sgRNAs were func­tional in all ex­per­i­men­tal repli­cates (Fig. 3a, b); (2) as ex­pect­ed, wild-type Sp­Cas9 showed sta­tis­ti­cally sig­nifi­cant ev­i­dence of in­del mu­ta­tions at 35 of the 36 off-tar­get sites (Fig. 3b) at fre­quen­cies that cor­re­lated well with GUIDE-seq read counts for these same sites (Fig. 3c); and (3) the fre­quen­cies of in­dels in­duced by Sp­Cas9-HF1 at 34 of the 36 off-tar­get sites were sta­tis­ti­cally in­dis­tin­guish­able from the back­ground level of in­dels ob­served in sam­ples from con­trol trans­fec­tions (Fig. 3b). For the two off-tar­get sites that ap­peared to have sta­tis­ti­cally sig­nifi­cant mu­ta­tion fre­quen­cies with Sp­Cas9-HF1 rel­a­tive to the neg­a­tive con­trol, the mean fre­quen­cies of in­dels were 0.049% and 0.037%, lev­els at which it is diffi­cult to de­ter­mine whether these are due to se­quenc­ing or PCR er­ror or are bona fide nu­cle­ase-in­duced in­dels. Based on these re­sults, we con­clude that Sp­Cas9-HF1 can com­pletely or nearly com­pletely re­duce off-tar­get mu­ta­tions that oc­cur across a range of differ­ent fre­quen­cies with wild-type Sp­Cas9 to lev­els gen­er­ally un­de­tectable by GUIDE-seq and tar­geted deep se­quenc­ing.”

So no de­tected off-tar­get mu­ta­tions down to the level of lab er­ror rate de­tectabil­i­ty. Amaz­ing. So you can do a CRISPR on a cell with a >75% chance of mak­ing the edit to a de­sired gene cor­rect­ly, and a <0.05% chance of a mis­taken (po­ten­tially harm­less) edit/mutation on a sim­i­lar gene. With an er­ror rate that low, you could do hun­dreds of CRISPR ed­its to a set of em­bryos with a low net risk of er­ror… The me­dian num­ber of eggs ex­tracted from a woman dur­ing IVF in Amer­ica is ~9; as­sume the worst case of 0.05% risk of off-tar­get mu­ta­tion and that one scraps any em­bryo found to have any mu­ta­tion at all even if it looks harm­less; then the prob­a­bil­ity of mak­ing 1000 ed­its with­out an off-tar­get mu­ta­tion could be (1-(0.05/100)) ^ 1000 = 60%, so you’re left with 5.4 good em­bryos, which is a de­cent yield. Mak­ing an edit of the top 1000 be­tas from the Ri­etveld 2013 poly­genic score and fig­ur­ing that it’s weak­ened by maybe 25% due to par­tic­u­lar cells not get­ting par­tic­u­lar ed­its and that is… a very large num­ber.

When I was do­ing my dys­gen­ics analy­sis, I found that the Ri­etveld be­tas could be rea­son­ably ap­prox­i­mated by rexp, and we can an­chor it by as­sum­ing the biggest effect is 0.5 IQ points, so we di­vide by 18, in which case we might es­ti­mate the top 750 ed­its at a cu­mu­la­tive value of sum(sort((rexp(10000)/1)/18, decreasing=TRUE)[1:750] * 0.5) → [1] 73.97740467. (Caveats: as­sumes knowl­edge of true be­tas, needs to be weak­ened for ac­tual pos­te­rior prob­a­bil­i­ties, etc etc.)

How much did that <74 IQ points cost? Well, I hear that TALENS in bulk costs $500 so you could ball­park mar­ginal costs of par­tic­u­lar CRISPR ed­its at that much too (and hope­fully much less), and whole-genomes still cost $1k, and you need to do 1000 ed­its on each em­bryo and whole-genomes at the end to check for off-tar­get mu­ta­tions, so you could ball­park a full suite of ed­its to 10 em­bryos at ~$5m: 10*1000 + 10 * 1000 * 500 → [1] 5010000. Of them 40% will have an off-tar­get mu­ta­tion, so you get 6 em­bryos to im­plant at a suc­cess rate of ~20% each which gives you about even odds for a healthy live birth, so you need to dou­ble the $5m.

CRISPR: $30? vs $5k for TALENs http://www.nature.com/news/crispr-the-disruptor-1.17673 “Za­yner says the kits will con­tain every­thing a bud­ding sci­en­tist needs to carry out CRISPR ex­per­i­ments on yeast or bac­te­ria. For US$130, you can have a crack at re-engi­neer­ing bac­te­ria so that it can sur­vive on a food it nor­mally would­n’t be able to han­dle, or for $160, you can get your eu­kary­ote on and edit the ADE2 gene of yeast to give it a red pig­ment.” https://www.indiegogo.com/projects/diy-crispr-kits-learn-modern-science-by-doing#/ https://www.thermofisher.com/us/en/home/life-science/genome-editing/geneart-crispr/crispr-cas9-based-genome-editing.html http://www.sigmaaldrich.com/technical-documents/articles/biology/crispr-cas9-genome-editing.html http://www.genscript.com/CRISPR-genome-edited-mammalian-cell-lines.html http://ipscore.hsci.harvard.edu/genome-editing-services http://www.blueheronbio.com/Services/CRISPR-Cas9.aspx http://www.addgene.org/crispr/ ‘Jen­nifer Doud­na, one of the co-dis­cov­er­ers of CRISPR, told MIT Tech Re­view’s An­to­nio Re­gal­ado just how easy it was to work with the tool: “Any sci­en­tist with mol­e­c­u­lar bi­ol­ogy skills and knowl­edge of how to work with [em­bryos] is go­ing to be able to do this.” Har­vard ge­neti­cist George Church, whose lab is do­ing some of the pre­mier re­search on CRISPR, says: “You could con­ceiv­ably set up a CRISPR lab for $2000.”’ http://www.businessinsider.com/how-to-genetically-modify-human-embryos-2015-4 “De­moc­ra­tiz­ing ge­netic en­gi­neer­ing: This one should keep you up at night. CRISPR is so ac­ces­si­ble-you can or­der the com­po­nents on­line for $60-that it is putting the power of ge­netic en­gi­neer­ing into the hands of many more sci­en­tists. But the next wave of users could be at-home hob­by­ists. This year, de­vel­op­ers of a do-it-y­our­self ge­netic en­gi­neer­ing kit be­gan offer­ing it for $700, less than the price of some com­put­ers. The trend might lead to an ex­plo­sion of in­no­va­tion-or to dan­ger­ous, un­con­trolled ex­per­i­ments by new­bies. Watch out, world.” https://www.technologyreview.com/s/543941/everything-you-need-to-know-about-crispr-gene-editings-monster-year/ “In­di­vid­ual plas­mids can be or­dered at $65 per plas­mid, and will be shipped as bac­te­r­ial stabs” https://www.addgene.org/crispr/yamamoto/multiplex-crispr-kit/ “At least since 1953, when James Wat­son and Fran­cis Crick char­ac­ter­ized the he­li­cal struc­ture of DNA, the cen­tral project of bi­ol­ogy has been the effort to un­der­stand how the shift­ing arrange­ment of four com­pound­s-adenine, gua­nine, cy­tosine, and thymine-de­ter­mines the ways in which hu­mans differ from each other and from every­thing else alive. CRISPR is not the first sys­tem to help sci­en­tists pur­sue that goal, but it is the first that any­one with ba­sic skills and a few hun­dred dol­lars’ worth of equip­ment can use.”CRISPR is the Model T of ge­net­ic­s," Hank Greely told me when I vis­ited him re­cent­ly, at Stan­ford Law School, where he is a pro­fes­sor and the di­rec­tor of the Cen­ter for Law and the Bio­sciences. “The Model T was­n’t the first car, but it changed the way we dri­ve, work, and live. CRISPR has made a diffi­cult process cheap and re­li­able. It’s in­cred­i­bly pre­cise. But an im­por­tant part of the his­tory of mol­e­c­u­lar bi­ol­ogy is the his­tory of edit­ing genes.”…“In the past, this would have taken the field a decade, and would have re­quired a con­sor­tium,” Platt said. “With CRISPR, it took me four months to do it by my­self.” In Sep­tem­ber, Zhang pub­lished a re­port, in the jour­nal Cell, de­scrib­ing yet an­other CRISPR pro­tein, called Cpf1, that is smaller and eas­ier to pro­gram than Cas9." http://www.newyorker.com/magazine/2015/11/16/the-gene-hackers?mbid=rss

2017 7% http://predictionbook.com/predictions/177110 2018 30% http://predictionbook.com/predictions/177114 2019 55% http://predictionbook.com/predictions/177115 by 2020, 75% http://predictionbook.com/predictions/177111

Whither em­bryo edit­ing? The prob­lem with IVF is that it is ex­pen­sive, painful, and re­quires pa­tience & plan­ning. Even with large gains from edit­ing, can we ex­pect more than the cur­rent 1% or so of par­ents to ever be will­ing to con­ceive via IVF for those gains? If not, then the ben­e­fits may take gen­er­a­tions to mix into the gen­eral pop­u­la­tion as off­spring re­pro­duce nor­mally with unedited peo­ple and peo­ple hap­pen to need to use IVF for reg­u­lar fer­til­ity rea­sons. A so­lu­tion is sug­gested by the adult CRISPR tri­als: make the germline ed­its in ad­vance. In fe­males, it seems like it might be hard to reach all the eggs which could po­ten­tially re­sult in con­cep­tion, but in males, sperm turnover is con­stant https://en.wikipedia.org/wiki/Spermatogenesis and sperm are re­plen­ished by stem cells in https://en.wikipedia.org/wiki/Seminiferous_tubule al­ready demon­strated CRISPR ed­its to sper­mato­go­nia and sub­se­quent in­her­i­tance by off­spring: “Tar­geted Germline Mod­i­fi­ca­tions in Rats Us­ing CRISPR/Cas9 and Sper­mato­go­nial Stem Cells”, Chap­man et al 2014 http://www.sciencedirect.com/science/article/pii/S2211124715001989 “Genome Edit­ing in Mouse Sper­mato­go­nial Stem Cell Lines Us­ing TALENS and Dou­ble-Nick­ing CRISPR/Cas9”, Sato et al 2015 http://www.sciencedirect.com/science/article/pii/S221367111500154X ex­trac­tion, mod­i­fi­ca­tion, and re-trans­plant­ing is likely also a no-fly, but the ed­its demon­strate that CRISPR will not in­ter­fere with re­pro­duc­tion, and so bet­ter de­liv­ery meth­ods can be de­vel­oped edit­ing via two in­tra­venous in­jec­tions of a virus (car­ry­ing the edit­ing tem­plate) and lipid-en­cap­su­lated Cas9-en­zyme into the mice’ tail vein: Yin et al 2016, yield­ing 6% of cells edited in the liver

https://en.wikipedia.org/wiki/Status_quo_bias “The Re­ver­sal Test: Elim­i­nat­ing Sta­tus Quo Bias in Ap­plied Ethics”, Bostrom & Ord 2006 https://ethicslab.georgetown.edu/phil553/wordpress/wp-content/uploads/2015/01/Ord-and-Bostrom-Eliminating-Status-Quo-Bias-in-Applied-Ethics-.pdf A mu­ta­tion load re­view leads me to some hard fig­ures from Si­mons et al 2014 (sup­ple­ment) us­ing data from Fu et al 2012/; par­tic­u­larly rel­e­vant is fig­ure 3, the num­ber of sin­gle-nu­cleotide vari­ants per per­son over the Eu­ro­pean-Amer­i­can sam­ple, split by es­ti­mates of harm from least to most like­ly: 21345 + 15231 + 5338 + 1682 + 1969 = 45565. The sup­ple­men­tary ta­bles gives a count of all ob­served SNVs by cat­e­go­ry, which sum to 300209 + 8355 + 220391 + 7001 + 351265 + 10293 = 897514, so the av­er­age fre­quency must be 45565/897514=0.05, and then the bi­no­mial SD will be sqrt(897514*0.05*(1-0.05))=206.47. https://en.wikipedia.org/wiki/Paternal_age_effect https://www.biorxiv.org/content/biorxiv/early/2016/03/08/042788.full.pdf “Older fa­thers’ chil­dren have lower evo­lu­tion­ary fit­ness across four cen­turies and in four pop­u­la­tions” “Rate of de novo mu­ta­tions and the im­por­tance of fa­ther’s age to dis­ease risk” https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3548427/ “Novel vari­a­tion and de novo mu­ta­tion rates in pop­u­la­tion-wide de novo as­sem­bled Dan­ish trios” http://www.nature.com/ncomms/2015/150119/ncomms6969/full/ncomms6969.html http://www.genetics.org/content/202/3/869 “Thus, keep­ing in mind that some mu­ta­tions in repet­i­tive DNA likely go un­de­tected ow­ing to map­ping diffi­cul­ties in genome-se­quenc­ing pro­jects, with a diploid genome size of ~6 bil­lion bases, an av­er­age new­born con­tains ~100 de novo mu­ta­tion­s….Nu­mer­ous stud­ies with model or­gan­isms in­di­cate that such effects have a broad dis­tri­b­u­tion (Lynch et al. 1999; Hal­li­gan and Keight­ley 2009)-most mu­ta­tions have mi­nor effects, very few have lethal con­se­quences, and even fewer are ben­e­fi­cial. In all or­gan­isms, the ma­jor­ity of mu­ta­tions with effects on fit­ness re­duce viability/fecundity by some­thing on the or­der of 1% per mu­ta­tion (Lynch et al. 1999; Yam­pol­sky et al. 2005; Eyre-Walker and Keight­ley 2007), and this class is thought to con­sti­tute 1-10% of all hu­man mu­ta­tions, the re­main­der be­ing es­sen­tially neu­tral (Lind­blad-Toh et al. 2011; Keight­ley 2012; Rands et al. 2014). Tak­ing the lower end of the lat­ter range sug­gests that the re­cur­rent load of mu­ta­tions im­posed on the hu­man pop­u­la­tion drags fit­ness down by 1% per gen­er­a­tion, more so if the frac­tion of dele­te­ri­ous mu­ta­tions ex­ceeds 0.01 or if the en­vi­ron­ment is mu­ta­genic, and less so if the av­er­age fit­ness effect of a mu­ta­tion were to be <1%”

“Parental in­flu­ence on hu­man germline de novo mu­ta­tions in 1,548 trios from Ice­land”, Jóns­son et al 2017:

To un­der­stand how the age and sex of trans­mit­ting par­ents affect de novo mu­ta­tions, here we se­quence 1,548 Ice­landers, their par­ents, and, for a sub­set of 225, at least one child, to 35× genome-wide cov­er­age. We find 108,778 de novo mu­ta­tions, both sin­gle nu­cleotide poly­mor­phisms and in­dels, and de­ter­mine the par­ent of ori­gin of 42,961. The num­ber of de novo mu­ta­tions from moth­ers in­creases by 0.37 per year of age (95% CI 0.32-0.43), a quar­ter of the 1.51 per year from fa­thers (95% CI 1.45-1.57). The num­ber of clus­tered mu­ta­tions in­creases faster with the moth­er’s age than with the fa­ther’s, and the ge­nomic span of ma­ter­nal de novo mu­ta­tion clus­ters is greater than that of pa­ter­nal ones.

…To as­sess differ­ences in the rate and class of DNMs trans­mit­ted by moth­ers and fa­thers, we analysed whole-genome se­quenc­ing (WGS) data from 14,688 Ice­landers with an av­er­age of 35x cov­er­age (Data De­scrip­tor^19). This set con­tained 1,548 trios, used to iden­tify 108,778 high­-qual­ity DNMs (101,377 sin­gle nu­cleotide poly­mor­phisms (SNPs); Meth­ods and Fig. 1), re­sult­ing in an av­er­age of 70.3 DNMs per proband. The DNM call qual­ity was also as­sessed us­ing 91 monozy­gotic twins of probands (Meth­od­s). Of 6,034 DNMs ob­served in these probands, 97.1% were found in their twins. Sanger se­quenc­ing was used to val­i­date 38 dis­cor­dant calls in monozy­gotic twins, of which 57.9% were con­firmed to be present only in the proband, and there­fore postzy­gotic, with the rest deemed geno­typ­ing er­rors.

…Mu­ta­tion rates are key pa­ra­me­ters for cal­i­brat­ing the timescale of se­quence di­ver­gence. We es­ti­mate the mu­ta­tion rate as 1.29 × 10-8 per base pair per gen­er­a­tion and 4.27 × 10-10 per base pair per year (Meth­od­s). Our find­ings have a di­rect bear­ing on the dis­par­ity that has emerged be­tween mu­ta­tion rates es­ti­mated di­rectly from pedi­grees (~4 × 10-10 per base pair per year) and phy­lo­ge­netic rates (~10-9 per base pair per year)3, as they in­di­cate that the mol­e­c­u­lar clock is affected by life-his­tory traits in a sex-spe­cific man­ner23-25 and varies by ge­nomic re­gion within and across species. This al­lows us to pre­dict the long-term con­se­quences of a shift in gen­er­a­tion times (Meth­od­s)24. Thus, a 10 year in­crease in the av­er­age age of fa­thers would in­crease the mu­ta­tion rate by 4.7% per year. The same change for moth­ers would de­crease the mu­ta­tion rate by 9.6%, be­cause ex­tra mu­ta­tions at­trib­ut­able to older moth­ers are off­set by fewer gen­er­a­tions

, Paten et al 2017:

From short­-read based as­says, it is es­ti­mated that the av­er­age diploid hu­man has be­tween 4.1 and 5 mil­lion point mu­ta­tions, ei­ther sin­gle nu­cleotide vari­ants (SNVs), mul­ti­-nu­cleotide vari­ants (MNVs), or short in­dels, which is only around 1 point vari­ant every 1450 to 1200 bases of hap­loid se­quence (Au­ton et al. 2015). Such an av­er­age hu­man would also have about 20 mil­lion bases-about 0.3% of the genome-affected by around 2,100-2,500 larger struc­tural vari­ants (Au­ton et al. 2015). It should be noted that both these es­ti­mates are likely some­what con­ser­v­a­tive as some re­gions of the genome are not ac­cu­rately sur­veyed by the short read tech­nol­ogy used. In­deed, long read se­quenc­ing demon­strates an ex­cess of struc­tural vari­a­tion not found by ear­lier short read tech­nol­ogy (Chais­son et al. 2015; Seo et al. 2016).


https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3155974/ “Ge­netic costs of do­mes­ti­ca­tion and im­prove­ment” Moy­ers et al 2017 https://www.biorxiv.org/content/early/2017/03/29/122093

“The de­mo­graphic his­tory and mu­ta­tional load of African hunter-gath­er­ers and farm­ers”, Lopez et al 2017 https://www.biorxiv.org/content/early/2017/04/26/131219

“(This dataset in­cludes 463 mil­lion vari­ants on 62784 in­di­vid­u­als. Click here to switch to Freeze3a on GRCh37/hg19.)” https://bravo.sph.umich.edu/freeze5/hg38/

, Gazal et al 2017:

Re­cent work has hinted at the link­age dis­e­qui­lib­rium (LD)-de­pen­dent ar­chi­tec­ture of hu­man com­plex traits, where SNPs with low lev­els of LD (LLD) have larger per-SNP her­i­tabil­i­ty. Here we an­a­lyzed sum­mary sta­tis­tics from 56 com­plex traits (av­er­age N = 101,401) by ex­tend­ing strat­i­fied LD score re­gres­sion to con­tin­u­ous an­no­ta­tions. We de­ter­mined that SNPs with low LLD have sig­nifi­cantly larger per-SNP her­i­tabil­ity and that roughly half of this effect can be ex­plained by func­tional an­no­ta­tions neg­a­tively cor­re­lated with LLD, such as DNase I hy­per­sen­si­tiv­ity sites (DHSs). The re­main­ing sig­nal is largely dri­ven by our find­ing that more re­cent com­mon vari­ants tend to have lower LLD and to ex­plain more her­i­tabil­ity (P = 2.38 × 10−104); the youngest 20% of com­mon SNPs ex­plain 3.9 times more her­i­tabil­ity than the old­est 20%, con­sis­tent with the ac­tion of neg­a­tive se­lec­tion. We also in­ferred jointly sig­nifi­cant effects of other LD-re­lated an­no­ta­tions and con­firmed via for­ward sim­u­la­tions that they jointly pre­dict dele­te­ri­ous effects.

, Schoech et al 2017:

Un­der­stand­ing the role of rare vari­ants is im­por­tant in elu­ci­dat­ing the ge­netic ba­sis of hu­man dis­eases and com­plex traits. It is widely be­lieved that neg­a­tive se­lec­tion can cause rare vari­ants to have larger per-al­lele effect sizes than com­mon vari­ants. Here, we de­velop a method to es­ti­mate the mi­nor al­lele fre­quency (MAF) de­pen­dence of SNP effect sizes. We use a model in which per-al­lele effect sizes have vari­ance pro­por­tional to [p(1-p)]α, where p is the MAF and neg­a­tive val­ues of α im­ply larger effect sizes for rare vari­ants. We es­ti­mate α by max­i­miz­ing its pro­file like­li­hood in a lin­ear mixed model frame­work us­ing im­puted geno­types, in­clud­ing rare vari­ants (MAF >0.07%). We ap­plied this method to 25 UK Biobank dis­eases and com­plex traits (N=113,851). All traits pro­duced neg­a­tive α es­ti­mates with 20 sig­nifi­cantly neg­a­tive, im­ply­ing larger rare vari­ant effect sizes. The in­ferred best-fit dis­tri­b­u­tion of true α val­ues across traits had mean -0.38 (s.e. 0.02) and stan­dard de­vi­a­tion 0.08 (s.e. 0.03), with sta­tis­ti­cally sig­nifi­cant het­ero­gene­ity across traits (P=0.0014). De­spite larger rare vari­ant effect sizes, we show that for most traits an­a­lyzed, rare vari­ants (MAF <1%) ex­plain less than 10% of to­tal SNP-heritability. Us­ing evo­lu­tion­ary mod­el­ing and for­ward sim­u­la­tions, we val­i­dated the α model of MAF-dependent trait effects and es­ti­mated the level of cou­pling be­tween fit­ness effects and trait effects. Based on this analy­sis an av­er­age genome-wide neg­a­tive se­lec­tion co­effi­cient on the or­der of 10-4 or stronger is nec­es­sary to ex­plain the α val­ues that we in­ferred.

SNPs have two rel­e­vant prop­er­ties: a fre­quency of be­ing ze­ro, and an effect size

Benyamin poly­genic score dis­tri­b­u­tion: looks ei­ther ex­po­nen­tial or log-nor­mal and a Cullen & Frey graph in­di­cates it is closes to ex­po­nen­tial & log-nor­mal dis­tri­b­u­tions.

with(benyamin[benyamin$EFFECT_A1>0,], descdist(abs(EFFECT_A1)/15, discrete=FALSE))
fit.exp <- with(benyamin[benyamin$EFFECT_A1>0,], fitdist(abs(EFFECT_A1)/15, "exp")); fit.exp
# Fitting of the distribution ' exp ' by maximum likelihood
# Parameters:
#        estimate  Std. Error
# rate 1063.15695 1.281452665
fit.ln <- with(benyamin[benyamin$EFFECT_A1>0,], fitdist(abs(EFFECT_A1)/15, "lnorm")); fit.ln
# Fitting of the distribution ' lnorm ' by maximum likelihood
# Parameters:
#             estimate      Std. Error
# meanlog -7.402017853 0.0013198365617
# sdlog    1.095050030 0.0009332618808

Plot­ting resid­u­als & di­ag­nos­tics, the log-nor­mal per­forms much worse due to over­es­ti­mat­ing the num­ber of near-zero effects (the Q-Q plot is par­tic­u­larly bad), so we’ll use the ex­po­nen­tial for effect sizes

As pro­por­tions, the fre­quen­cies are prob­a­bly close to ei­ther uni­form or beta dis­tri­b­u­tions; in this case, they are very nearly uni­formly dis­trib­uted 0-1:

fitdist(benyamin$FREQ_A1, "beta", method="mme")
# Fitting of the distribution ' beta ' by matching moments
# Parameters:
#           estimate
# shape1 1.065412339
# shape2 1.126576711

Set up: 50k SNPs, true effect sizes are ex­po­nen­tially dis­trib­uted with ex­po­nen­tial(rate=1063.15695), fre­quen­cies dis­trib­uted be­ta(1.065412339, 1.126576711) can’t do full sim­u­la­tion of 500k SNPs with 100k dat­a­points, not enough RAM given in­effi­cient R for­mat for booleans: (100000 * 100000 * 48 bytes) / 1000000000 ; 4.8e+11 bytes or 480GB can JAGS even han­dle that much data?

SNPs <- 500000
SNPlimit <- 50000
N <- 10000
genes <- data.frame(Effects = sort(rexp(SNPs, rate=1063.15695), decreasing=TRUE)[1:SNPlimit], Frequencies = rbeta(SNPlimit, shape1=1.065412339, shape2=1.126576711))
person <- function() { rbinom(SNPlimit, 1, prob=genes$Frequencies) }
test <- as.data.frame(t(replicate(N, person())))
format(object.size(test), units="GB")
# [1] "1.9 Gb"
test$Polygenic.score <- sapply(1:nrow(test), function(i) { sum(ifelse(test[i,], genes$Effects, 0)) })
test$IQ <- 100 + 15 * rnorm(N, mean=scale(test$Polygenic.score), sd=sqrt(1-0.33))

b <- BGLR(test$IQ, ETA=list(list(X=test[,1:SNPlimit], model="BL", lambda=202, shape=1.1, rate=2.8e-06)), R2=0.33)

BGLR prob­lem: I can’t scale it to n=100k/SNP=500k, BGLR RAM con­sump­tion maxes out at n=10k/SNP=50k. This would­n’t be a prob­lem for the Thomp­son sam­pling, but I need to use JAGS any­way for that. So BGLR can’t give me pos­te­ri­ors for Ri­etveld et al 2013 or Benyamin 2014 da­ta.

can do Bayesian GWAS from pub­lic sum­mary sta­tis­tics us­ing RSS “Bayesian large-s­cale mul­ti­ple re­gres­sion with sum­mary sta­tis­tics from genome-wide as­so­ci­a­tion stud­ies” https://www.biorxiv.org/content/biorxiv/early/2016/03/04/042457.full.pdf Mat­lab li­brary https://github.com/stephenslab/rss in­stal­la­tion: https://github.com/stephenslab/rss/wiki/RSS-via-MCMC RSS ex­am­ple 1 https://github.com/stephenslab/rss/wiki/Example-1 code: https://github.com/stephenslab/rss/blob/master/examples/example1.m data https://uchicago.app.box.com/example1 does­n’t run un­der Oc­tave at the mo­ment al­though I got it close: https://github.com/stephenslab/rss/issues/1 what are my al­ter­na­tives? RSS is the best look­ing Bayesian one so far. I can try pi­rat­ing Mat­lab, or ap­par­ently the Ma­chine Learn­ing Cours­era gives one a tem­po­rary 120-day sub­scrip­tion

groundTruth <- function(alleles, causal, genotype) {
    score <- rnorm(1, mean=100, sd=15)
    for (i in 1:length(genotype)) {
      if(causal[i] && genotype[i]) { print(alleles[i]); score <- score + alleles[i]; } }

alleles <- sort(rexp(10000, rate=70), decreasing=TRUE)[1:10]
causal <- c(TRUE, rep(FALSE, 8), TRUE)

df <- data.frame()
for(i in 1:50) {
 genotype <- rbinom(n=10,1,prob=0.50)
 IQ <- groundTruth(alleles, causal, genotype)
 df <- rbind(df, c(IQ, genotype))
names(df) <- c("IQ", "A", "B", "C", "D", "E", "F", "G", "H", "I", "J")
df2 <- melt(df, "IQ")

model1 <- function() {
  for (i in 1:N) {
   for (j in 1:K) {
     b[j] ~ dexp(70)
     y[i] ~ dnorm(100, 0.06) + b[j]
data1 <- list(N=nrow(df2), K=10, y=df2$IQ, b=1:length(unique(df2$variable)))
jags(data=data1, parameters.to.save="b", model.file=model1)

simulate <- function(a, c) {

model1 <- "model {
  for (i in 1:N) {
   y[i] ~ dnorm(mu[i], tau[i])
   mu[i] <- muOfClust[ clust[i] ]
   clust[i] ~ dcat(pi[])

  for (i in 1:K) {
    pi[i] <- 0.05
    muOfClust[i] ~ dexp(70)
    tau[i] ~ dgamma(1.0E-3, 1.0E-3)
j1 <- autorun.jags(model1, monitor=c("theta"), data = list(N=nrow(oldData2), y=oldData2$Yes, n=oldData2$N, switch=c(0.5, 0.5), clust=c(1,NA))); j1
# ...      Lower95  Median Upper95    Mean       SD Mode      MCerr MC%ofSD SSeff    AC.10   psrf
# theta[1] 0.70582 0.75651 0.97263 0.77926  0.07178   --   0.001442       2  2478  0.12978 1.0011
# theta[2] 0.72446 0.75078 0.77814 0.75054 0.013646   -- 0.00009649     0.7 20000 0.009458      1

https://github.com/jeromyanglim/JAGS_by_Example/blob/master/03-multilevel/varying-intercept-long/varying-intercept-long.rmd http://doingbayesiandataanalysis.blogspot.com/2012/06/mixture-of-normal-distributions.html http://www.stats.ox.ac.uk/~bardenet/Material/mixture_with_jags_only_mu.html

“Genome-wide as­so­ci­a­tion study iden­ti­fies 74 [162] loci as­so­ci­ated with ed­u­ca­tional at­tain­ment”, Ok­bay et al 2016 https://www.dropbox.com/s/my9719yd8s5hplf/2016-okbay-2.pdf sup­ple­men­tary in­fo: http://www.nature.com/nature/journal/vaop/ncurrent/extref/nature17671-s1.pdf http://www.nature.com/nature/journal/vaop/ncurrent/extref/nature17671-s2.xlsx http://www.thessgac.org/#!data/kuzq8 http://ssgac.org/documents/FAQ_74_loci_educational_attainment.pdf

Ed­u­ca­tional at­tain­ment is strongly in­flu­enced by so­cial and other en­vi­ron­men­tal fac­tors, but ge­netic fac­tors are es­ti­mated to ac­count for at least 20% of the vari­a­tion across in­di­vid­u­al­s1. Here we re­port the re­sults of a genome-wide as­so­ci­a­tion study (GWAS) for ed­u­ca­tional at­tain­ment that ex­tends our ear­lier dis­cov­ery sam­ple1, 2 of 101,069 in­di­vid­u­als to 293,723 in­di­vid­u­als, and a repli­ca­tion study in an in­de­pen­dent sam­ple of 111,349 in­di­vid­u­als from the UK Biobank. We iden­tify 74 [162 to­tal] genome-wide sig­nifi­cant loci as­so­ci­ated with the num­ber of years of school­ing com­plet­ed. Sin­gle-nu­cleotide poly­mor­phisms as­so­ci­ated with ed­u­ca­tional at­tain­ment are dis­pro­por­tion­ately found in ge­nomic re­gions reg­u­lat­ing gene ex­pres­sion in the fe­tal brain. Can­di­date genes are pref­er­en­tially ex­pressed in neural tis­sue, es­pe­cially dur­ing the pre­na­tal pe­ri­od, and en­riched for bi­o­log­i­cal path­ways in­volved in neural de­vel­op­ment. Our find­ings demon­strate that, even for a be­hav­ioural phe­no­type that is mostly en­vi­ron­men­tally de­ter­mined, a well-pow­ered GWAS iden­ti­fies replic­a­ble as­so­ci­ated ge­netic vari­ants that sug­gest bi­o­log­i­cally rel­e­vant path­ways. Be­cause ed­u­ca­tional at­tain­ment is mea­sured in large num­bers of in­di­vid­u­als, it will con­tinue to be use­ful as a proxy phe­no­type in efforts to char­ac­ter­ize the ge­netic in­flu­ences of re­lated phe­no­types, in­clud­ing cog­ni­tion and neu­ropsy­chi­atric dis­eases.

Us­ing pro­ce­dures iden­ti­cal to those de­scribed in SI Sec­tion 1.6, we con­ducted a meta-analy­sis of the EduYears phe­no­type, com­bin­ing the re­sults from our dis­cov­ery co­horts (N = 293,723) and the re­sults from the UKB repli­ca­tion co­hort (N = 111,349). Ex­pand­ing the over­all sam­ple size to N = 405,072 in­creases the num­ber of ap­prox­i­mately in­de­pen­dent genome-wide sig­nifi­cant loci from 74 to 162.

…This back­ground suffices to mo­ti­vate the bi­o­log­i­cal ques­tions that arise in the in­ter­pre­ta­tion of GWAS re­sults and the means by which these ques­tions might be ten­ta­tively ad­dressed. For starters, since a GWAS lo­cus typ­i­cally con­tains many other SNPs in LD with the defin­ing lead SNP and with each oth­er, it is nat­ural to ask: which of these SNPs is the ac­tual causal site re­spon­si­ble for the down­stream phe­no­typic vari­a­tion? Many SNPs in the genome ap­pear to be bi­o­log­i­cally in­ert-nei­ther en­cod­ing differ­ences in pro­tein com­po­si­tion nor affect­ing gene reg­u­la­tion-and a lead GWAS SNP may fall into this cat­e­gory and nonethe­less show the strongest as­so­ci­a­tion sig­nal as a re­sult of sta­tis­ti­cal noise or hap­pen­stance LD with mul­ti­ple causal sites. For­tu­nate­ly, much is known from ex­ter­nal sources of data about whether vari­a­tion at a par­tic­u­lar site is likely to have bi­o­log­i­cal con­se­quences, and ex­ploit­ing these re­sources is our gen­eral strat­egy for fine-map­ping loci: nom­i­nat­ing in­di­vid­ual sites that may be causally re­spon­si­ble for the GWAS sig­nals. De­scrip­tions of ge­nomic sites or re­gions based on ex­ter­nal sources of data are known as an­no­ta­tions, and read­ers will not go far astray if they in­ter­pret this term rather lit­er­ally (as re­fer­ring to a note of ex­pla­na­tion or com­ment added to a text in one of the mar­gin­s). If we re­gard the type genome as the ba­sic text, then an­no­ta­tions are ad­di­tional com­ments de­scrib­ing the struc­tural or func­tional prop­er­ties of par­tic­u­lar sites or the re­gions in which they re­side. For ex­am­ple, all non­syn­ony­mous sites that in­flu­ence pro­tein struc­tures might be an­no­tated as such. An an­no­ta­tion can be far more spe­cific than this; for in­stance, all sites that fall in a reg­u­la­tory re­gion ac­tive in the fe­tal liver might bear an an­no­ta­tion to this effect. A given causal site will ex­ert its phe­no­typic effect through al­ter­ing the com­po­si­tion of a gene prod­uct or reg­u­lat­ing its ex­pres­sion. Con­cep­tu­al­ly, once a causal site has been iden­ti­fied or at least nom­i­nat­ed, the next ques­tion to pur­sue is the iden­tity of the me­di­at­ing gene. In prac­tice, be­cause only a hand­ful of genes at most will typ­i­cally over­lap a GWAS lo­cus, we can make some progress to­ward an­swer­ing this ques­tion with­out pre­cise knowl­edge of the causal site. The diffi­culty of the prob­lem, how­ev­er, should still not be un­der­es­ti­mat­ed. It is nat­ural to as­sume that a lead GWAS SNP ly­ing in­side the bound­aries of a par­tic­u­lar gene must re­flect a causal mech­a­nism in­volv­ing that gene it­self, but in cer­tain cases such a con­clu­sion would be pre­ma­ture. It is pos­si­ble for a causal SNP ly­ing in­side a cer­tain gene to ex­ert its phe­no­typic effect by reg­u­lat­ing the ex­pres­sion of a nearby gene or for sev­eral genes to in­ter­vene be­tween the SNP and its reg­u­la­tory tar­get. Sup­ple­men­tary Ta­ble 4.1 ranks each gene over­lap­ping a DEPICT-defined lo­cus by the num­ber of dis­crete ev­i­den­tiary items fa­vor­ing that gene (see Sup­ple­men­tary In­for­ma­tion sec­tion 4.5 for de­tails re­gard­ing DEPICT). These lines of ev­i­dence are taken from a num­ber of our analy­ses to be de­tailed in the fol­low­ing sub­sec­tions. Our pri­mary tool for gene pri­or­i­ti­za­tion is DEPICT, which can be used to cal­cu­late a P-value and as­so­ci­ated FDR for each gene. It is im­por­tant to keep in mind, how­ev­er, that a gene-level P-value re­turned by DEPICT refers to the tail prob­a­bil­ity un­der the null hy­poth­e­sis that ran­dom sam­pling of loci can ac­count for an­no­ta­tions and pat­terns of co-ex­pres­sion shared by the fo­cal gene with genes in all other GWAS-identified loci. Al­though it is very rea­son­able to ex­pect that genes in­volved in the same phe­no­type do in­deed share an­no­ta­tions and pat­terns of co-ex­pres­sion, it may be the case that cer­tain causal genes do not con­form to this ex­pec­ta­tion and thus fail to yield low DEPICT P-val­ues. This is why we do not rely on DEPICT alone but also the other lines of ev­i­dence de­scribed in the cap­tion of Sup­ple­men­tary Ta­ble 4.1.

How­ev­er, a pri­ori we know that some SNPs are more likely to be as­so­ci­ated with the phe­no­type than oth­ers; for ex­am­ple, it is often as­sumed that non­syn­ony­mous SNPs are more likely to in­flu­ence phe­no­types than sites that fall far from all known genes. So a P-value of 5×10 −7 , say, though not typ­i­cally con­sid­ered sig­nifi­cant at the genome-wide lev­el, might merit a sec­ond look if the SNP in ques­tion is non­syn­ony­mous. For­mal­iz­ing this in­tu­ition can be done with Bayesian sta­tis­tics, which com­bines the strength of ev­i­dence in fa­vor of a hy­poth­e­sis (in our case, that a ge­nomic site is as­so­ci­ated with a phe­no­type) with the prior prob­a­bil­ity of the hy­poth­e­sis. De­cid­ing how to set this prior is often sub­jec­tive. How­ev­er, if many hy­pothe­ses are be­ing tested (for ex­am­ple, if there are thou­sands of non­syn­ony­mous poly­mor­phisms in the genome), then the prior can be es­ti­mated from the data them­selves us­ing what is called “em­pir­i­cal Bayes” method­ol­o­gy. For ex­am­ple, if it turns out that SNPs with low P-val­ues tend to be non­syn­ony­mous sites rather than other types of sites, then the prior prob­a­bil­ity of true as­so­ci­a­tion is in­creased at all non­syn­ony­mous sites. In this way a non­syn­ony­mous site that oth­er­wise falls short of the con­ven­tional sig­nifi­cance thresh­old can be­come pri­or­i­tized once the em­pir­i­cally es­ti­mated prior prob­a­bil­ity of as­so­ci­a­tion is taken into ac­count. Note that such fa­vor­able reweight­ing of sites within a par­tic­u­lar class is not set a pri­ori, but is learned from the GWAS re­sults them­selves. In our case, we split the genome into ap­prox­i­mately in­de­pen­dent blocks and es­ti­mate the prior prob­a­bil­ity that each block con­tains a causal SNP that in­flu­ences the phe­no­type and (within each block) the con­di­tional prior prob­a­bil­ity that each in­di­vid­ual SNP is the causal one. Each such prob­a­bil­ity is al­lowed to de­pend on an­no­ta­tions de­scrib­ing struc­tural or func­tional prop­er­ties of the ge­nomic re­gion or the SNPs within it. We can then em­pir­i­cally es­ti­mate to ex­tent to each an­no­ta­tion pre­dicts as­so­ci­a­tion with the fo­cal phe­no­type. For a com­plete de­scrip­tion of the fg­was method, see ref. 1. 4.2.3 Meth­ods For ap­pli­ca­tion to the GWAS of EduYears, we used the same set of 450 an­no­ta­tions as ref. 1; these are avail­able at https://github.com/joepickrell/1000-genomes. …4.2.6 Reweighted GWAS and Fine Map­ping We reweighted the GWAS re­sults us­ing the func­tion­al-ge­nomic re­sults de­scribed above. Us­ing a re­gional pos­te­rior prob­a­bil­ity of as­so­ci­a­tion (PPA) greater than 0.90 as the cut­off, we iden­ti­fied 102 re­gions likely to har­bor a causal SNP with re­spect to EduYears (Ex­tended Data Fig. 7c and Sup­ple­men­tary Ta­ble 4.2.1). All but two of our 74 lead EduYears-as­so­ci­ated SNPs fall within one of these 102 re­gions. The ex­cep­tions are rs3101246 and rs2837992, which at­tained PPA > 0.80 (Ex­tended Data Fig. 7c). In pre­vi­ous ap­pli­ca­tions of fg­was, the ma­jor­ity of novel loci that at­tained the equiv­a­lent of genome-wide sig­nifi­cance only upon reweight­ing later at­tained the con­ven­tional P < 5×10 −8 in larger co­horts 1 . Within each re­gion at­tain­ing PPA > 0.90, each SNP re­ceived a con­di­tional pos­te­rior prob­a­bil­ity of be­ing the causal SNP (un­der the as­sump­tion that there is just one causal SNP in the re­gion). The method of as­sign­ing this lat­ter pos­te­rior prob­a­bil­ity is sim­i­lar to that of ref. 6, ex­cept that the in­put Bayes fac­tors are reweighted by an­no­ta­tion-de­pen­dent and hence SNP-varying prior prob­a­bil­i­ties. In essence, the like­li­hood of causal­ity at an in­di­vid­ual SNP de­rives from its Bayes fac­tor with re­spect to phe­no­typic as­so­ci­a­tion (which is mo­not­o­n­i­cally re­lated to the P-value un­der rea­son­able as­sump­tion­s), whereas the prior prob­a­bil­ity is de­rived from any em­pir­i­cal genome-wide ten­dency for the an­no­ta­tions borne by the SNP to pre­dict ev­i­dence of as­so­ci­a­tion. Thus, the SNP with the largest pos­te­rior prob­a­bil­i­ties of causal­ity tend to ex­hibit among the strongest P-val­ues within their loci and func­tional an­no­ta­tions that pre­dict as­so­ci­a­tion through­out the genome. Note that proper cal­i­bra­tion of this pos­te­rior prob­a­bil­ity re­quires that all po­ten­tial causal sites have been ei­ther geno­typed or im­put­ed, which may not be the case in our ap­pli­ca­tion; we did not in­clude diffi­cult-to-im­pute non-SNP sites such as insertions/deletions in the GWAS meta-analy­sis. With this caveat in mind, we iden­ti­fied 17 re­gions where fine map­ping amassed over 50 per­cent of the pos­te­rior prob­a­bil­ity on a sin­gle SNP (Sup­ple­men­tary Ta­ble 4.2.2). Of our 74 lead EduYears SNPs, 9 are good can­di­dates for be­ing the causal sites dri­ving their as­so­ci­a­tion sig­nals [12%]. One of our top SNPs, rs4500960, is in nearly per­fect LD with the causal can­di­date rs2268894 (and is in­deed the sec­ond most likely causal SNP in this re­gion ac­cord­ing to fg­was). The causal can­di­date rs6882046 is within 75kb of two lead SNPs on chro­mo­some 5 (rs324886 and rs10061788), but no two of these three SNPs show strong LD. In­ter­est­ing­ly, the re­main­ing 6 causal can­di­dates lie in ge­nomic re­gions that only at­tain the equiv­a­lent of genome-wide sig­nifi­cance upon Bayesian reweight­ing. Of the 17 causal can­di­dates, 9 lie in re­gions that are DNase I hy­per­sen­si­tive in the fe­tal brain.

Ta­ble 4.2.2:

Pos­te­rior prob­a­bil­ity of causal­ity

0.992035 0.766500 0.842271 0.567184 0.697862 0.524760 0.632536 0.885280 0.968627 0.781563 0.629610 0.837746 0.725158 0.755457 0.784373 0.682947 0.832675

[mean(c(0.524760, 0.567184, 0.629610, 0.632536, 0.682947, 0.697862, 0.725158, 0.755457, 0.766500, 0.781563, 0.784373, 0.832675, 0.837746, 0.842271, 0.885280, 0.968627, 0.992035)) = 0.76, 0.76*19=14.4]

The re­sults from both ap­proaches show that pre­dic­tion ac­cu­racy in­creases as more SNPs are used to con­struct the score, with the max­i­mum pre­dic­tive power achieved when us­ing all the geno­typed SNPs (with Ap­proach 1). In that case, the weighted av­er­age across the two co­horts of the in­cre­men­tal R 2 is ~3.85%.

[Ver­sus 2% from Ri­etveld’s n=100k; this is in line with the rough dou­bling of the main SSGAC sam­ple size. The ad­di­tional UK Biobank sam­ple of n=111k does not seem to have been used but if it was used, should boost the poly­genic score to ~5.3%?]

…The mag­ni­tude of pre­dic­tive power that we ob­serve is less than one might have ex­pected on the ba­sis of sta­tis­ti­cal ge­net­ics cal­cu­la­tions 6 and GCTA-GREML es­ti­mates of “SNP her­i­tabil­ity” from in­di­vid­ual co­horts. In­deed, Ri­etveld et al. (2013) 7 re­ported GCTA-GREML es­ti­mates of SNP her­i­tabil­ity for each of two co­horts (STR and QIMR), and the mean es­ti­mate was 22.4%. As­sum­ing that 22.4% is in fact the true SNP her­i­tabil­i­ty, the cal­cu­la­tions out­lined in the SOM of Ri­etveld et al. (pp. 22-23) gen­er­ate a pre­dic­tion of R 2 = 11.0% for a score con­structed from the GWAS es­ti­mates of this pa­per and of R 2 = 6.1% for a score con­structed from the com­bined (dis­cov­ery + repli­ca­tion co­horts, but ex­clud­ing the val­i­da­tion co­horts) GWAS sam­ple of N = ~117,000-119,000 in Ri­etveld et al.-sub­stan­tially higher than the 3.85% that we achieve here (with the score based on all geno­typed SNPs) and the 2.2% Ri­etveld et al. achieved, re­spec­tive­ly. These dis­crep­an­cies be­tween the scores’ pre­dicted and es­ti­mated R 2 may be due to the fail­ure of some of the as­sump­tions un­der­ly­ing the cal­cu­la­tion of the pre­dicted R 2 . An al­ter­na­tive (or ad­di­tion­al) ex­pla­na­tion is that the true SNP her­i­tabil­ity for the GWAS sam­ple pooled across co­horts is lower than 22.4%. That would be the case if the true GWAS co­effi­cients differ across co­horts, per­haps due to het­ero­gene­ity in phe­no­type mea­sure­ment or gene-by-en­vi­ron­ment in­ter­ac­tions. If so, then a poly­genic score con­structed from the pooled GWAS sam­ple would be ex­pected to have lower pre­dic­tive power in an in­di­vid­ual co­hort than im­plied by the cal­cu­la­tions above. Based on that rea­son­ing, the R 2 of 2.2% ob­served by Ri­etveld et al. (2013) could be ra­tio­nal­ized by as­sum­ing that the pro­por­tion of vari­ance ac­counted for by com­mon vari­ants across the pooled Ri­etveld co­horts is only 12.7% 6 . (We ob­tain a sim­i­lar es­ti­mate, 11.5% with a stan­dard er­ror of 0.45%, when we use LD Score re­gres­sion 5 to es­ti­mate the SNP her­i­tabil­ity us­ing our pooled-sam­ple meta-analy­sis re­sults from this pa­per, ex­clud­ing deCODE and with­out GC. While we be­lieve this es­ti­mate is based on co­hort re­sults with­out GC, it is bi­ased down­ward if any co­hort in fact ap­plied GC.) If we as­sume that the 12.7% is valid also for the co­horts con­sid­ered in this study, we would pre­dict an R 2 equal to 4.5%, some­what higher than we ob­serve in HRS and STR but much clos­er. How­ev­er, the de­gree of cor­re­la­tion in co­effi­cients across co­horts ap­pears to be rel­a­tively high (Sup­ple­men­tary Ta­ble 1.10 re­ports es­ti­mates of the ge­netic cor­re­la­tion be­tween se­lected co­horts and deCODE; al­though the cor­re­la­tion es­ti­mates vary a lot across co­horts, they tend to be large for the largest co­horts, and the weighted av­er­age is 0.76). We do not know whether a pooled-co­hort SNP her­i­tabil­ity of 12.7% or lower can be rec­on­ciled with the ob­served de­gree of cor­re­la­tion in co­effi­cients across co­horts.

The re­sults are re­ported in Sup­ple­men­tary Ta­bles 6.3 and 6.4. In both the STR and the HRS, cog­ni­tive per­for­mance sig­nifi­cantly me­di­ates the effect of PGS on EduYears; in the HRS, Open­ness to Ex­pe­ri­ence is also a sig­nifi­cant me­di­a­tor. The in­di­rect effects for the other me­di­at­ing vari­ables are not sig­nifi­cant s . The re­sults for cog­ni­tive per­for­mance are sim­i­lar across STR and HRS. In both datasets, a one-s­tan­dard de­vi­a­tion in­crease in PGS is as­so­ci­ated with ~0.6-0.7 more years of ed­u­ca­tion, and a one-s­tan­dard de­vi­a­tion in­crease in cog­ni­tive per­for­mance is as­so­ci­ated with ~0.15 more years of ed­u­ca­tion. In both datasets, the di­rect effect (θ 1 ) of PGS on EduYears is ~0.3-0.4 and the to­tal in­di­rect effect (β 1 θ 2 ) is ~0.19-0.31. This im­plies that a one-s­tan­dard­-de­vi­a­tion in­crease in PGS is as­so­ci­ated with ~0.3-0.4 more years of ed­u­ca­tion, keep­ing the me­di­at­ing vari­ables con­stant, and that chang­ing the me­di­at­ing vari­ables to the lev­els they would have at­tained had PGS in­creased by one stan­dard de­vi­a­tion (but keep­ing PGS fixed) in­creases years of ed­u­ca­tion by ~0.19-0.31 years. Last­ly, in both datasets, the par­tial in­di­rect effect (θ 21 β 11 ) of cog­ni­tive per­for­mance is large and very sig­nifi­cant: the es­ti­mates are equal to 0.29 and 0.14-or 42% and 23% of the to­tal effect (γ 1 )-in STR and HRS, re­spec­tive­ly. The re­sults also sug­gest that a one-s­tan­dard de­vi­a­tion in­crease in Open­ness to Ex­pe­ri­ence is as­so­ci­ated with ~0.06 more years of ed­u­ca­tion, and the es­ti­mated par­tial in­di­rect effect for Open­ness to Ex­pe­ri­ence is equal to 0.04-or 7% of the to­tal effect (γ 1 ).

Fine-map­ping: “As­so­ci­a­tion map­ping of in­flam­ma­tory bowel dis­ease loci to sin­gle vari­ant res­o­lu­tion”, Huang et al 2017 https://www.biorxiv.org/content/early/2015/10/20/028688 Farh et al 2015 Gong et al 2013 van de Bunt et al 2015 “Eval­u­at­ing the Per­for­mance of Fine-Map­ping Strate­gies at Com­mon Vari­ant GWAS Loci” 1http://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.1005535 “Re­fin­ing the ac­cu­racy of val­i­dated tar­get iden­ti­fi­ca­tion through cod­ing vari­ant fine-map­ping in type 2 di­a­betes”, Ma­ha­jan et al 2018: fine-map­ping sug­gests maybe a quar­ter of hits are causal?

let’s say we have 74 SNP can­di­dates with 2 lev­els each (on/off) which col­lec­tively de­ter­mine the de­pen­dent vari­able IQ with an effect of +0.2 IQ points or +0.0133SDs each; we want power=0.80/alpha=0.05 to de­tect the main effect for each SNP to see if it has a non-zero effect. In fac­to­r­ial ANOVA terms, this would be a de­sign with 74^2=5476 pos­si­ble con­di­tions (s­ince we are only in­ter­ested in main effects and not in­ter­ac­tions, I think it could be a par­tial fac­to­r­ial de­sign where many of those are dropped). As a mul­ti­ple re­gres­sion, it’d be some­thing like IQ ~ A + B + C + ... ZZ. The effect of each vari­able is quite small, and so adding them all into a lin­ear model will ex­plain only a small amount of vari­ance: (0.2/15)^2 * 74 = 0.0132.

G*Power 3.1 (http://www.gpower.hhu.de/ Win­dows ver­sion un­der Wine), test fam­ily ‘F test’, sta­tis­ti­cal test ‘Lin­ear mul­ti­ple re­gres­sion: Fixed mod­el, R in­crease’, a pri­ori power analy­sis:

l = 0.0133766? n=2711

Sup­pose we have a bud­get of 64 ed­its per em­bryo, and we want to be able to fill them all up with causal vari­ants; we know that around half will al­ready be the good vari­ant, so we need to have at least 642=128 causal vari­ants. And since we know that each SNP can­di­date has a ~10% prob­a­bil­ity of be­ing causal, we need to be test­ing 10 128 = >1280 SNPs to suc­cess­fully win­now down to >128 causal SNPs which will yield >64 avail­able ed­its.

We’ll drop the mean effect to 0.1 points and con­sider sam­ple size re­quire­ments test­ing 1280 SNPs. (0.1/15)^2 * 1280 = 0.05688888889

l=0.06032045 n=2954

What does 80% power and 5% al­pha mean here? We know that ~10% are causal (our base rate for a true find­ing is 10%), so there are 128 causal SNPs and 1152 non­causal SNPs here. There is a 5% false pos­i­tive rate, so there will be 0.051152=58 false pos­i­tive SNPs. There is 80% pow­er, so we will de­tect 0.8128=102 causal SNPs. So we will have 58+102=160 ap­par­ently causal SNPs, but only 58/160=36% are ac­tu­ally causal. So we waste 64% of our ed­its. We need to do bet­ter.

let’s try with 0.9 and 0.01

l=0.06032045 n=3987

0.90.11280=115 0.01*(1280-115)=11 115/(115+11)=0.91

0.1 av­er­age effect might be op­ti­mistic even though effect sizes de­cline slow­ly, so one last try with 0.05 mean effect size over those 1280 SNPs: (0.05/15)^2 * 1280 = 0.0142

l = 0.0144045 n=14197

n <- 1000
alpha <- 0.005
SNP_candidates <- 162
SNP_mean_effect <- 0.2
SNP_mean_frequency <- 0.5
SNP_causal_probability <- 0.12
SNP_effects <- sample(c(SNP_mean_effect, 0), SNP_candidates, replace=TRUE,
                      prob=c(SNP_causal_probability, 1-SNP_causal_probability))
r2 <- 0.01

generateSample <- function(n, SNP_effects, SNP_candidates=162, SNP_mean_frequency=0.5) {
    dm <- matrix(nrow=n, ncol=SNP_candidates+1)
    for (i in 1:n) {
        SNP_draw <- rbinom(SNP_candidates, 1, p=SNP_mean_frequency)
        SNPs_genetic <- sum(ifelse(SNP_draw, SNP_effects, 0))
        IQ <- rnorm(1, mean=100, sd=15) + SNPs_genetic
        dm[i,] <- c(round(IQ), SNP_draw)
    df <- as.data.frame(dm)
    colnames(df) <- c("IQ", paste("SNP.", as.character(1:SNP_candidates), sep=""))
df <- generateSample(n, SNP_effects, SNP_candidates, SNP_mean_frequency)

power <- sapply(seq(1, n, by=500), function(i) {
    l <- summary(lm(IQ ~ ., data=df[1:i,]))
    # coefficients <- l$coefficients[,1]
    pvalues <- l$coefficients[,4][-1]
    causalPvalues <- na.omit(ifelse(SNP_effects>0, pvalues, NA))
    noncausalPvalues <- na.omit(ifelse(SNP_effects==0, pvalues, NA))

    alphas <- seq(from=0.001, to=0.05, by=0.001)
    alphaYields <- sapply(alphas, function(alpha) { sum(causalPvalues<alpha) / (sum(pvalues<alpha)) })
    alpha <- alphas[which.max(alphaYields)]

    positives <- sum(pvalues<alpha)
    falsePositives <- sum(noncausalPvalues<alpha)
    truePositives <- sum(causalPvalues<alpha)
    truePositiveFraction <- truePositives / length(causalPvalues)
    causalFraction <- (truePositives + falsePositives) / positives
    return(causalFraction) })

# n <- 1000
ns <- seq(100, 50000, by=400)
recovery <- sapply(ns, function(n) {
  mean(replicate(100, {
    ## bootstrap a new dataset
    d <- df[sample(nrow(df), n, replace=TRUE),]

    b <- BGLR(d$IQ, ETA=list(list(X=d[-1], model="BayesB", probIn=SNP_causal_probability)), R2=0.01, burnIn=0, verbose=FALSE)
    # plot((b$ETA[[1]]$b), SNP_effects!=0)
    estimate <- data.frame(Causal=SNP_effects!=0, Estimate=(b$ETA[[1]]$b))
    estimate <- estimate[order(estimate$Estimate),]
    truePositives <- sum(estimate$Causal[1:64])
    falseNegatives <- sum(estimate$Causal[-(1:64)])
    # truePositives;falseNegatives
    causalFraction <- truePositives / (truePositives+falseNegatives)
    })) })
qplot(ns, recovery) + geom_smooth(method=lm)

cv <- cv.glmnet(as.matrix(df[-1]), df$IQ)
g <- glmnet(as.matrix(df[-1]), df$IQ, lambda=cv$lambda.min)
causalSNPs <- which(SNP_effects>0)
ridgeSNPs <- which(as.matrix(coef(g)[-1,])!=0)
intersect(causalSNPs, ridgeSNPs)

length(intersect(causalSNPs, ridgeSNPs)) / length(causalSNPs)
length(intersect(causalSNPs, ridgeSNPs)) / length(ridgeSNPs)

Thompson sampling with a subset/edit budget:

SNP_edit_limit <- 100
SNP_candidates <- 500
verbose <- TRUE
SNP_mean_effect <- 0.2
SNP_mean_frequency <- 0.5
SNP_causal_probability <- 0.12
r2 <- 0.05
iqError <- 0.55 ## 5yo IQ scores correlate r=0.55 with adult
SNP_effects <- sample(c(SNP_mean_effect, 0), SNP_candidates, replace=TRUE,
                      prob=c(SNP_causal_probability, 1-SNP_causal_probability))

measurementError <- function(r, IQ) {
    IQ.true.std <- (IQ-100)/15
    IQ.measured.std <- r*IQ.true.std + rnorm(length(IQ), mean=0, sd=sqrt(1 - r^2))
    IQ.measured <- 100 + 15 * IQ.measured.std
    return(IQ.measured) }
generateExperimentalSample <- function(n, r, SNP_effects, SNP_candidates, SNP_mean_frequency, SNP_draw_generator) {
    dm <- matrix(nrow=n, ncol=SNP_candidates+1+1) ## allocate space for all SNPs, true IQ, and measured IQ
    for (i in 1:n) {
        SNP_draw <- SNP_draw_generator() # which SNPs to toggle
        SNPs <- ifelse(SNP_draw, SNP_effects, 0) ## convert to effects
        SNP_genetic_score <- sum(SNPs, na.rm=TRUE)
        ## generate a N(100,15) - minus however much our candidate genes explain
        IQ <- round(rnorm(1, mean=100, sd= sqrt((15^2) * (1-r2))) + SNP_genetic_score)
        IQ_measured <- round(measurementError(r, IQ))
        dm[i,] <- c(IQ, IQ_measured, SNP_draw)
    df <- as.data.frame(dm)
    colnames(df) <- c("IQ", "IQ.measured", paste("SNP.", as.character(1:SNP_candidates), sep=""))
generateRandomSample <- function(n, r=1, SNP_effects, SNP_candidates, SNP_mean_frequency) {
    generateExperimentalSample(n, r, SNP_effects, SNP_candidates, SNP_mean_frequency,
                               function() { rbinom(SNP_candidates, 1, p=SNP_mean_frequency) })

d <- generateRandomSample(1000, iqError, SNP_effects, SNP_candidates, SNP_mean_frequency)

maximumDays <- 10000
batch <- 10
lag <- 5*365
jumpForward <- lag*5
## initialize with a seed of n>=2; with less or an empty dataframe, BGLR will crash
d <- generateRandomSample(jumpForward, iqError, SNP_effects, SNP_candidates, SNP_mean_frequency)
d$Date <- 1:jumpForward
# d$Date <- -lag # available immediately
regretLog <- data.frame(N=integer(), N.eff=integer(), Regret=numeric(), Fraction.random=numeric(), Fraction.causals=numeric(), Causal.found=integer(), Causal.notfound=integer())
for (i in jumpForward:maximumDays) {

    ## i=date; day 1, day 2, etc. A datapoint is only available if it was created more than 'lag' days ago; ie if we do 1 a day, then on day 20 with lag 18
    ## we will have 20-18=2 datapoints available. With a 5 year lag, we will go 1825 time-steps before any new data starts becoming available.
    dAvailable <- d[i >= (d$Date+lag),]
    ## quick Bayesian model of our data up to now, using Lasso priors and our SNP setup
    ## BGLR will randomly crash with a gamma/lambda range error every ~5k calls, so catch & retry until it succeeds:
    while(TRUE) {
      b <- BGLR(dAvailable$IQ.measured, ETA=list(list(X=dAvailable[-c(1, 2, length(colnames(dAvailable)))],
                                          model="BL", type="beta", probIn=0.12, counts=162, R2=r2, lambda=202, max=500)),
      means <- b$ETA[[1]]$b; sds <- b$ETA[[1]]$SD.b
      }, error = function(e) { print(e) })

    ## Thompson sampling: for Thompson sampling, we sample 1 possible value of each parameter from that parameter's
    ## posterior distribution. Normally, we would sample from the posterior samples returned by JAGS, but BGLR returns
    ## instead a mean+SD, so we sample from each parameter's normal distribution defined by the mean/sd. Then we see if it
    ## was positive (and hence worth editing); if it is positive, then we choose to edit it and toggle it to '1' (since
    ## we set the problem up as 1=increasing). This boolean is then passed into the simulation and a fresh datapoint generated according to
    ## that intervention. In the next loop, this new datapoint will be incorporated into the posterior and so on.
    ## For multiple-edit Thompson sampling where we can only edit _l_ out of _k_ possible SNPs, we sample 1 possible value
    ## as before, then we sort and take the top _l_ out of _k_ actions;
    ## then to decide whether to sample 1 or 0 in each arm, we do a second Thompson sample within each arm.
    ## To save computations while simulating, or to be more realistic, we might have batches: multiple datapoints in between
    ## updates. If batch=1, it is the conventional Thompson sample (and most data-efficient) but computationally demanding
    ## In the delayed-updates setting, 'batch' is how many get created each day
    for (j in 1:batch) {
     SNPs_Thompson_sampled <- function () {
         ## select the _l_ arms
         samplesArms <- rnorm(n=SNP_candidates, mean=means, sd=sds)
         cutoff <- sort(samplesArms, decreasing=TRUE)[SNP_edit_limit]
         l <- which(samplesArms>=cutoff)

         ## for each winner, do another Thompson sample, and record the losers as NA/not-played
         choices <- rep(NA, SNP_candidates)
         samplesActions <- rnorm(n=SNP_edit_limit, mean=means[l], sd=sds[l]) > 0
         choices[l] <- samplesActions
     newD <- generateExperimentalSample(1, iqError, SNP_effects, SNP_candidates, SNP_mean_frequency, SNPs_Thompson_sampled)
     newD$Date <- i
     d <- rbind(d, newD)

    ## reports, evaluations, logging:
    estimate <- data.frame(Causal=SNP_effects!=0, Effect=SNP_effects, Estimate=(b$ETA[[1]]$b), SD=b$ETA[[1]]$SD.b)
    estimate <- estimate[order(estimate$Estimate, decreasing=TRUE),]
    regretLog[i,]$Causal.found <- sum(estimate$Causal[1:SNP_edit_limit])
    regretLog[i,]$Causal.notfound <- sum(estimate$Causal[-(1:SNP_edit_limit)])
    regretLog[i,]$Fraction.random  <- sum(estimate$Causal!=0) / SNP_candidates  ## base rate: random guessing
    regretLog[i,]$Fraction.causals <- round(regretLog[i,]$Causal.found / (regretLog[i,]$Causal.found+regretLog[i,]$Causal.notfound), digits=2)
    regretLog[i,]$Regret <- (sum(estimate$Effect) - sum(estimate$Effect[1:SNP_edit_limit])) * SNP_mean_frequency
    regretLog[i,]$N <- nrow(d)
    regretLog[i,]$N.eff <- nrow(dAvailable)

    if(verbose) { print(regretLog[i,]) }
qplot(regretLog$N, jitter(regretLog$Regret), xlab="Total N (not effective N)", ylab="Regret (expected lost IQ points compared to omniscience)", main=paste("Edits:", SNP_edit_limit, "; Candidates:", SNP_candidates)) + stat_smooth()

b <- BGLR(d$IQ.measured, ETA=list(list(X=d[-c(1, 2, length(colnames(d)))], model="BL", type="beta", probIn=0.12, counts=162, R2=r2, lambda=202, max=500)), verbose=FALSE)
means <- b$ETA[[1]]$b; sds <- b$ETA[[1]]$SD.b; qplot(SNP_effects, b$ETA[[1]]$b, color=b$ETA[[1]]$SD.b, main=paste("Edits:", SNP_edit_limit, "; Candidates:", SNP_candidates))
image(as.matrix(d[-c(1, 2, length(colnames(d)))]), col=heat.colors(2))

  1. Al­though it is hard to imag­ine any par­ent au­tho­riz­ing the cre­ation of nev­er-be­fore-seen mu­ta­tions in their child, re­gard­less of how com­pelling a neu­ro­bi­o­log­i­cal mech­a­nis­tic case is made for it. In com­par­ison, switch­ing SNPs is know­ably safe be­cause they all ex­ist in the hu­man pop­u­la­tion at high fre­quen­cies and can be ob­served to cor­re­late or not with health prob­lems. An­i­mal ex­per­i­ments can help but in­her­ently are un­rep­re­sen­ta­tive of effects in hu­mans, so the first child will still be sub­ject to an un­known and prob­a­bly large risk.↩︎

  2. In­deed, most of the re­ported hits have bal­anced fre­quen­cies. I haven’t seen any­one specifi­cally men­tion why this is so. My best guess is that it’s the same rea­son we ex­pect to see SNPs with the largest effects de­tected first: it’s due to . Im­bal­anced re­duces power since the smaller group has larger stan­dard er­rors; hence, if there are two SNPs with the same effect, but one has a fre­quency of 10% (with 90% of peo­ple hav­ing the bad vari­ant) and an­other has an even 50-50 split, the sec­ond will be de­tected ear­lier by GWASes. This seems more sen­si­ble than ex­pect­ing some force like to be op­er­at­ing on most IQ-re­lated SNPs.↩︎

  3. For , or sim­ply pre­vent­ing , a ma­jor cause of mor­bid­ity & in­creased mor­tal­ity in the el­der­ly.↩︎

  4. This may seem con­trary to the many re­ported cor­re­la­tions of mor­bid­ity & mor­tal­ity with shiftwork/less self­-re­ported sleep, or less re­sis­tance to in­fec­tion, but those re­sults are al­ways dri­ven by the gen­eral pop­u­la­tion suffer­ing from in­som­nia, stress, dis­ease, etc, and not by the tiny mi­nor­ity of short­-sleep­ers. The short­-sleep­ers them­selves do not re­port greater vul­ner­a­bil­ity to in­fec­tion or early mor­tal­i­ty, and I don’t be­lieve any of the bad re­sults fo­cus on short­-sleep­ers.↩︎

  5. In­ter­est­ing­ly, this was not even the first ap­pli­ca­tion of CRISPR. That would ap­pear to be DuPont in 2007, which be­gan im­proved breed­ing of virus-re­sis­tant dairy cul­tures by in­tro­duc­ing trou­ble­some vi­ral in­fec­tions, and se­lect­ing for fur­ther use only those bac­te­ria with changed CRISPR-related genomes, in­di­cat­ing they had ac­quired re­sis­tance to the virus.↩︎

  6. No­vem­ber 2013: “‘But it is far too early to con­tem­plate us­ing these meth­ods to al­ter the hu­man germline’…Pro­fes­sor Da­gan Wells, an IVF re­searcher at Ox­ford Uni­ver­si­ty, said that al­though there is still a long way to go be­fore CRISPR could even be con­sid­ered for use on IVF em­bryos”↩︎

  7. Na­ture: “Jen­nifer Doud­na, a CRISPR pi­o­neer at the Uni­ver­sity of Cal­i­for­nia, Berke­ley, is keep­ing a list of CRISPR-altered crea­tures. So far, she has three dozen en­tries, in­clud­ing dis­ease-caus­ing par­a­sites called try­panosomes and yeasts used to make bio­fu­els.” As of May 2017, some­where around 20 hu­man tri­als (most in vivo) were re­port­edly un­der­way. Note that BGI’s com­mer­cial­ly-sold mi­cropigs, BGI’s myostatin/muscly pigs, and mus­cly sheep & cat­tle (Proud­foot et al 2015), were done us­ing TALENs, not CRISPR; but re­port­ed­ly, BGI is re­do­ing the mus­cly pigs with CRISPR↩︎

  8. Na­ture com­ments on the Liang work: “A Chi­nese source fa­mil­iar with de­vel­op­ments in the field said that at least four groups in China are pur­su­ing gene edit­ing in hu­man em­bryos.”↩︎

  9. Call­away 2015: “Sim­i­lar work is al­ready be­ing car­ried out in the lab of George Church, a ge­neti­cist at Har­vard Med­ical School in Boston. Us­ing a tech­nol­ogy known as CRISPR/Cas9 that al­lows genes to be eas­ily edit­ed, his team claims to have en­gi­neered ele­phant cells that con­tain the mam­moth ver­sion of 14 genes po­ten­tially in­volved in cold tol­er­ance - al­though the team has not yet tested how this affects the ele­phant cells. Church plans to do these ex­per­i­ments in”organoids" cre­ated from ele­phant iPS cell­s."↩︎

  10. Rear­don 2016: “To ad­dress such con­cerns, fish ge­neti­cist Rex Dun­ham of Auburn Uni­ver­sity in Al­abama has been us­ing CRISPR to in­ac­ti­vate genes for three re­pro­duc­tive hor­mones - in this case, in cat­fish, the most in­ten­sively farmed fish in the United States. The changes should leave the fish ster­ile, so any fish that might es­cape from a farm, whether ge­net­i­cally mod­i­fied or not, would stand lit­tle chance of pol­lut­ing nat­ural stocks.”If we’re able to achieve 100% steril­i­ty, there is no way that they can make a ge­netic im­pact," Dun­ham says. Ad­min­is­ter­ing hor­mones would al­low the fish to re­pro­duce for breed­ing pur­pos­es. And Dun­ham says that sim­i­lar meth­ods could be used in other fish species."↩︎

  11. Rear­don 2016: “CRISPR could also re­duce the need for farm­ers to cull an­i­mals, an ex­pen­sive and ar­guably in­hu­mane prac­tice. Biotech­nol­o­gist Al­i­son van Ee­nen­naam at the Uni­ver­sity of Cal­i­for­nia, Davis, is us­ing the tech­nique to en­sure that beef cat­tle pro­duce only male or male-like off­spring, be­cause fe­males pro­duce less meat and are often culled. She copies a Y-chro­mo­some gene that is im­por­tant for male sex­ual de­vel­op­ment onto the X chro­mo­some in sperm. Off­spring pro­duced with the sperm would be ei­ther nor­mal, XY males, or XX fe­males with male traits such as more mus­cle.”↩︎

  12. Rear­don 2016: “In the egg in­dus­try, male chicks from elite egg-lay­ing chicken breeds have no use, and farm­ers gen­er­ally cull them within a day of hatch­ing. Ti­zard and his col­leagues are adding a gene for green flu­o­res­cent pro­tein to the chick­ens’ sex chro­mo­somes so that male em­bryos will glow un­der ul­tra­vi­o­let light. Egg pro­duc­ers could re­move the male eggs be­fore they hatch and po­ten­tially use them for vac­cine pro­duc­tion.”↩︎

  13. Rear­don 2016: “Mol­e­c­u­lar ge­neti­cist Scott Fahrenkrug, founder of Re­com­bi­net­ics in Saint Paul, Min­neso­ta, is us­ing gene-edit­ing tech­niques to trans­fer the gene that elim­i­nates horns into elite breeds 12. The com­pany has pro­duced only two polled calves so far - both male - which are be­ing raised at the Uni­ver­sity of Cal­i­for­nia, Davis, un­til they are old enough to breed.”↩︎

  14. Rear­don 2016: “But un­til the ar­rival of CRISPR, vi­rol­o­gists lacked the tools to eas­ily al­ter fer­ret genes. Xi­ao­qun Wang and his col­leagues at the Chi­nese Acad­emy of Sci­ences in Bei­jing have used CRISPR to tweak genes in­volved in fer­ret brain de­vel­op­men­t14, and they are now us­ing it to mod­ify the an­i­mals’ sus­cep­ti­bil­ity to the flu virus. He says that he will make the model avail­able to in­fec­tious-dis­ease re­searchers.”↩︎

  15. from “Wel­come to the CRISPR zoo: Birds and bees are just the be­gin­ning for a bur­geon­ing tech­nol­ogy”, Rear­don 2016: “The group ex­pects to hatch its first gen­er­a­tion of chicks with gene mod­i­fi­ca­tions later this year as a proof of con­cept. Do­ran re­al­izes that it could be some time be­fore reg­u­la­tors would ap­prove gene-edited eggs, and he hopes that his daugh­ter will have grown out of her al­lergy by then.”If not, I’ve got some­one ready and wait­ing to try the first egg," he says."↩︎

  16. Rear­don 2016: “Gillis has been study­ing the genomes of ‘hy­gienic’ bees, which ob­ses­sively clean their hives and re­move sick and in­fested bee lar­vae. Their colonies are less likely to suc­cumb to mites, fungi and other pathogens than are those of other strains, and Gillis thinks that if he can iden­tify genes as­so­ci­ated with the be­hav­iour, he might be able to edit them in other breeds to bol­ster hive health. But the trait could be diffi­cult to en­gi­neer. No hy­gien­e-as­so­ci­ated genes have been de­fin­i­tively iden­ti­fied, and the roots of the be­hav­iour may prove com­plex, says Bart­Jan Fern­hout, chair­man of Arista Bee Re­search in Boxmeer, the Nether­lands, which stud­ies mite re­sis­tance. More­over, if genes are iden­ti­fied, he says, con­ven­tional breed­ing may be suffi­cient to con­fer re­sis­tance to new pop­u­la­tions, and that might be prefer­able given the wide­spread op­po­si­tion to ge­netic en­gi­neer­ing.”↩︎

  17. Rear­don 2016: “Other re­searchers are mak­ing cat­tle that are re­sis­tant to the try­panosome par­a­sites that are re­spon­si­ble for sleep­ing sick­ness.”↩︎

  18. Rear­don 2016: “BGI is also us­ing CRISPR to al­ter the size, colour and pat­terns of koi carp. Koi breed­ing is an an­cient tra­di­tion in Chi­na, and Jian Wang, di­rec­tor of gene-edit­ing plat­forms at BGI, says that even good breed­ers will usu­ally pro­duce only a few of the most beau­ti­fully coloured and pro­por­tioned, ‘cham­pion qual­ity’ fish out of mil­lions of eggs. CRISPR, she says, will let them pre­cisely con­trol the fish’s pat­terns, and could also be used to make the fish more suit­able for home aquar­i­ums rather than the large pools where they are usu­ally kept. Wang says that the com­pany will be­gin sell­ing koi in 2017 or 2018 and plans to even­tu­ally add other types of pet fish to its reper­toire.”↩︎

  19. For ex­am­ple, Liang et al 2015’s re­ported off-tar­get mu­ta­tion num­bers in hu­man em­bryos was greeted with a com­ment from George Church that “the re­searchers did not use the most up­-to-date CRISPR/Cas9 meth­ods and that many of the re­searchers’ prob­lems could have been avoided or less­ened if they had.”↩︎

  20. Power cal­cu­la­tion us­ing the Ri­etveld et al 2013 poly­genic score:

    power.t.test(delta=0.03, sd=(1-0.025^2), power=0.80)
         Two-sample t test power calculation
                  n = 17421.11975
              delta = 0.03
                 sd = 0.999375
          sig.level = 0.05
              power = 0.8
        alternative = two.sided
    NOTE: n is number in *each* group
    17421 * 2
    # [1] 34842