The Power of Twins: The Scottish Milk Experiment

In discussing a large Scottish public health experiment, Student noted that it would’ve been vastly more efficient using a twin experiment design; I fill in the details with a power analysis.
statistics, R, power-analysis, genetics
2016-01-122019-11-29 finished certainty: highly likely importance: 5

Ran­dom­ized exper­i­ments require more sub­jects the more vari­able each dat­a­point is to over­come the noise which obscures any effects of the inter­ven­tion. Reduc­ing noise enables bet­ter infer­ences with the same data, or less data to be col­lect­ed, which can be done by bal­anc­ing observed char­ac­ter­is­tics between con­trol and exper­i­men­tal dat­a­points.

A par­tic­u­larly dra­matic exam­ple of this approach is run­ning exper­i­ments on iden­ti­cal twins rather than reg­u­lar peo­ple, because twins vary far less from each other than ran­dom peo­ple due to shared genet­ics & fam­ily envi­ron­ment. In 1931, the great sta­tis­ti­cian Stu­dent (William Sealy Gos­set) noted prob­lems with an extremely large (n = 20,000) Scot­tish exper­i­ment in feed­ing chil­dren milk (to see if they grew more in height or weight), and claimed that the exper­i­ment could have been done far more cost-effec­tively with an extra­or­di­nary reduc­tion of >95% fewer chil­dren if it had been con­ducted using twins, and claimed that 100 iden­ti­cal twins would have been more accu­rate than 20,000 chil­dren. He, how­ev­er, did not pro­vide any cal­cu­la­tions or data demon­strat­ing this.

I revisit the issue and run a power cal­cu­la­tion on height indi­cat­ing that Stu­den­t’s claims were cor­rect and that the exper­i­ment would have required ~97% fewer chil­dren if run with twins.

This reduc­tion is not unique to the Scot­tish milk exper­i­ment on height/weight, and in gen­er­al, one can expect a reduc­tion of 89% in exper­i­ment sam­ple sizes using twins rather than reg­u­lar peo­ple, demon­strat­ing the ben­e­fits of using behav­ioral genet­ics in /.

enable causal infer­ence by assur­ing that the exper­i­men­tal group is iden­ti­cal, on aver­age, in all ways to the con­trol group and the sub­se­quent differ­ences are caused by the inter­ven­tion. If a coin is flipped, then each group will have the same frac­tion of wom­en, the same frac­tion of athe­ists, the same frac­tion of peo­ple with a par­tic­u­lar on , the same frac­tion of peo­ple with latent can­cers, etc—on aver­age. But since there are thou­sands upon thou­sands of ways in which sub­jects can differ which might affect the results (just con­sider how ), and there may only be a few dozen sub­jects, ran­dom­iza­tion can­not guar­an­tee exact or even approx­i­mate sim­i­lar­ity of groups on every sin­gle trait and with same sam­ples can gen­er­ate grossly imbal­anced sam­ples (per­haps you have 2 groups of 10, and one group turns out to have 9 women and the other just 1). This does­n’t bias the results or defeat the point of ran­dom­iza­tion, but it can add a lot of noise, reduc­ing our , thereby mak­ing our exist­ing stud­ies less mean­ing­ful, requir­ing more expen­sive stud­ies to esti­mate effects to a use­ful level of pre­ci­sion, and block­ing profitable deci­sions.

Simple randomization vs blocking

When put this way, one might won­der how to improve the : if you have a group with too many women and too few men, why not change your coin flip to instead pairs of women and men? Instead of flip­ping a fair coin for each patient to pick whether that patient goes into the con­trol or exper­i­men­tal group, why not instead take a pair of women and flip a coin to decide whether the one on the left goes into the exper­i­men­tal group, and then the woman on the right goes into the other group? And like­wise for the men. Since so many things clus­ter within fam­i­lies, it would be bet­ter to try to bal­ance fam­i­lies too, and ensure that if there are mul­ti­ple sib­lings avail­able, one sib­ling goes into the exper­i­ment and the other into the con­trol, rather than risk­ing lop­sided allo­ca­tion. (This turns out to be a seri­ous issue in some ani­mal exper­i­ments, if exper­i­menters acci­den­tally ran­domly allo­cate whole lit­ters into one arm, which can hap­pen often given the small n that ani­mal research typ­i­cally uses.) In fact, why not try to extend this ‘match­ing up’ to as many things as you can mea­sure? If we use iden­ti­cal twins (like in ), we match them on not just nation­al­i­ty, gen­der, loca­tion, par­ents, age, food con­sump­tion or what­not, we even match them on genet­ics too, which is part of why iden­ti­cal twins are so eerily iden­ti­cal even when raised apart. (Note that this point isn’t the same as the vari­ance com­po­nent esti­mat­ing going on in or other fam­ily designs; we care that they are sim­i­lar and thus com­par­isons are less noisy, not the ques­tion of how much of that eerie sim­i­lar­ity is due to genet­ics or other sources.) Indeed, since a per­son is matched with them­selves on just about every­thing and are even bet­ter than iden­ti­cal twins, why not test the treat­ment ? But peo­ple do change over time and there might be lin­ger­ing effects, so per­haps it would be even bet­ter if you can test it on the same per­son simul­ta­ne­ous­ly: for exam­ple, test an acne cream A on one side of the face and acne cream B on the other half of each sub­jec­t’s face, flip­ping a coin to decide whether A/B or B/A.

This is the con­cept of , one of whose advo­cates was , the great early sta­tis­ti­cian. While Stu­dent is best known for his work on exact tests and exper­i­ment design, he had an inter­est in genet­ics (per­haps due to his friend­ship with & pro­fes­sional work on eval­u­at­ing plant breed­ing result­s), mak­ing an impor­tant con­tri­bu­tion in Stu­dent 1933; in his com­ments on the Lanark­shire Milk Exper­i­ment, he would unite both inter­ests, point­ing out the use­ful­ness of genetic con­sid­er­a­tions for opti­mal exper­i­men­tal design in humans as well as agri­cul­ture.

Efficiency of blocking for the Lanarkshire Milk Experiment

The his­tory of block­ing and its advan­tages com­pared to sim­ple ran­dom­ized exper­i­ments are dis­cussed at length in “W.S. Gos­set and Some Neglected Con­cepts in Exper­i­men­tal Sta­tis­tics: Guin­nes­so­met­rics II”, Zil­iak 2011/“Bal­anced ver­sus Ran­dom­ized Field Exper­i­ments in Eco­nom­ics: Why W. S. Gos­set aka ‘Stu­dent’ Mat­ters”, Zil­iak 2014 (see also & ). A par­tic­u­larly early exam­ple of block­ing for demon­strat­ing ben­e­fits comes from the farm of noted sys­tem­atic breeder (1725–1795), who exper­i­mented with many meth­ods of improve­ment, par­tic­u­larly the ben­e­fits of thor­ough irri­ga­tion of his field­s—prov­ing it to his many domes­tic & inter­na­tional vis­i­tors with a qua­si­-ran­dom block­ing: Obser­va­tions on live stock: con­tain­ing hints for choos­ing and improv­ing the best breeds of the most use­ful kinds of domes­tic ani­mals, & Robert Heaton 1804, pg75:

At Dish­ley, Mr Bakewell has improved a con­sid­er­able tract of poor cold land, beyond any­thing I ever saw, or could have con­ceived, by this same mode of improve­men­t;—and, ever ready to com­mu­ni­cate his knowl­edge to the pub­lic, he has left proof pieces in differ­ent parts of his mead­ows, in order to con­vince peo­ple of the great impor­tance and util­ity of this kind of improve­men­t:—­par­tic­u­lar­ly, in one part he has been at pains to divide a rood of ground into twenty equal divi­sions, viz. two perches in each piece. It is so con­trived that they can water the first, and leave the sec­ond unwa­tered; or miss the first, and water the sec­ond; and so on through all the 20 divi­sions; by which con­trivance, you have the fairest and most unequiv­o­cal proofs of the good effects of improv­ing ground by water­ing.

Hous­man 1894, pg7–9 men­tions the block­ing was used for fer­til­iza­tion as well:

Young says that his irri­ga­tion is “among the rarest instances of spir­ited hus­bandry”, much exceed­ing any­thing of the kind he had seen before, even in the hands of land­lord­s…He did not hastily either adopt or extend his sys­tem of irri­ga­tion, but felt his way as he advanced, try­ing var­i­ous exper­i­ments to sat­isfy him­self of the effi­cacy and econ­omy of the sys­tem before incur­ring fur­ther expense. Side by side he had plots of land: two plots, one watered, the other not watered; two again, one watered, the other manured; and again two, one watered from a spring, the other from the stream; so that he could form his esti­mates of the com­par­a­tive value of irri­ga­tion as against other fer­til­is­ing agen­cies, and of differ­ent modes of irri­ga­tion…We have seen in the instance of his irri­gated land how Mr. Bakewell tested the worth of his notions by fre­quent and var­ied exper­i­ment. He did the same in every depart­ment of the farm. This was the grand source of his pow­er. He did not try to make facts square with his opin­ions, but his opin­ions with facts.

Another early his­tor­i­cal exam­ple is the pre­vi­ous­ly-men­tioned Gus­tav III of Swe­den’s coffee exper­i­ment some­time in the late 1700s, which is pos­si­bly the first twin exper­i­ment.

Zil­iak writes (pg22) about an inter­est­ing exam­ple where Stu­dent cal­cu­lates that a well-blocked sam­ple of 100 could be bet­ter than 20,000 with sim­ple ran­dom­iza­tion:

The intu­ition behind the higher power of ABBA and other bal­anced designs to detect a large and real treat­ment differ­ence was given by Stu­dent in 1911.68 “Now if we are com­par­ing two vari­eties it is clearly of advan­tage to arrange the plots in such a way that the yields of both vari­eties shall be affected as far as pos­si­ble by the same causes to as nearly as pos­si­ble an equal extent”.69 He used this “prin­ci­ple of max­i­mum con­ti­gu­ity” often, for exam­ple when he for exam­ple when he illus­trated the higher pre­ci­sion and lower costs that would be asso­ci­ated with a smal­l­-sam­ple study of bio­log­i­cal twins, to deter­mine the growth tra­jec­tory of chil­dren fed with pas­teur­ized milk, unpas­teur­ized milk, and no milk at all, in “The Lanark­shire Milk Exper­i­ment” (Stu­dent, 1931a).69 Stu­dent (1931a, p. 405) esti­mated that “50 pairs of [iden­ti­cal twins] would give more reli­able results than the 20,000” child sam­ple, nei­ther bal­anced nor ran­dom, actu­ally stud­ied in the exper­i­ment funded by the Scot­land Depart­ment of Health. “[I]t would be pos­si­ble to obtain much greater cer­tainty” in the mea­sured differ­ence of growth in height and weight of chil­dren drink­ing ver­sus “at an expen­di­ture of per­haps 1–2% of the money and less than 5% of the trou­ble.” Like­wise, Kar­lan and List (2007, p. 1777) could have revealed more about the eco­nom­ics of char­i­ta­ble giv­ing—­for less—us­ing a vari­ant of Stu­den­t’s method. Instead the AER arti­cle stud­ied n = 50,083 pri­mar­ily white, male, pro-Al Gore donors to pub­lic radio, nei­ther ran­dom nor bal­anced.

I thought this was a great exam­ple and top­i­cal inas­much as because gour­mands prize it over pas­teur­ized milk (lead­ing to occa­sional ill­nesses or deaths and sub­se­quent legal/political maneu­ver­ing to ban/preserve access to raw milk), and I was curi­ous about the details of how Stu­dent com­puted that the sur­pris­ingly low num­ber of 100 twins (50 pairs) would suffice. So I looked for Stu­den­t’s orig­i­nal paper.

Stu­dent begins by sum­ma­riz­ing the Lanark­shire exper­i­ment in a lit­tle more detail:

…In the spring of 1930, a nutri­tional exper­i­ment on a very large scale was car­ried out in the schools of Lanark­shire. For four months 10,000 school chil­dren received 0.75 pint of milk per day, 5000 of these got raw milk and 5,000 pas­teur­ized milk, in both cases Grade A (Tu­ber­culin tested1); another 10,000 chil­dren were selected as con­trols and the whole 20,000 chil­dren were weighed and their height was mea­sured at the begin­ning and end of the exper­i­men­t…The 20,000 chil­dren were cho­sen in 67 schools, not more than 400 nor less than 200 being cho­sen in any one school, and of these half were assigned as “feed­ers” and half as “con­trols”, some schools were pro­vided with raw milk and the oth­ers with pas­teur­ized milk, no school get­ting both­…Sec­ond­ly, the selec­tion of the chil­dren was left to the Head Teacher of the school and was made on the prin­ci­ple that both “con­trols” and “feed­ers” should be rep­re­sen­ta­tive of the aver­age chil­dren between 5 and 12 years of age: the actual method of selec­tion being impor­tant I quote from Drs Leighton and McKin­lay’s [1930] Report: “The teach­ers selected the two classes of pupils, those get­ting milk and those act­ing as ‘con­trols’, in two differ­ent ways. In cer­tain cases they selected them by bal­lot and in oth­ers on an alpha­bet­i­cal sys­tem.”

Unfor­tu­nate­ly, while ambi­tious, the ran­dom­iza­tion was heav­ily com­pro­mised (dis­cussed fur­ther in Kadane & Sei­den­feld 1996):

…after invok­ing the god­dess of chance they unfor­tu­nately wavered in their adher­ence to her for we read: “In any par­tic­u­lar school where there was any group to which these meth­ods had given an undue pro­por­tion of well fed or ill nour­ished chil­dren, oth­ers were sub­sti­tuted in order to obtain a more level selec­tion.” This is just the sort of after­thought that most of us have now and again and which is apt to spoil the best laid plans. In this case it was a fatal mis­take, for in con­se­quence the con­trols were, as pointed out in the Report, defi­nitely supe­rior both in weight and height to the “feed­ers” by an amount equiv­a­lent to about 3 months’ growth in weight and 4 months’ growth in height. Pre­sum­ably this dis­crim­i­na­tion in height and weight was not made delib­er­ate­ly, but it would seem prob­a­ble that the teach­ers, swayed by the very human feel­ing that the poorer chil­dren needed the milk more than the com­par­a­tively well to do, must have uncon­sciously made too large a sub­sti­tu­tion of the ill-nour­ished among the “feed­ers” and too few among the “con­trols” and that this uncon­scious selec­tion affect­ed, sec­on­dar­i­ly, both mea­sure­ments.

Stu­dent then observes that besides pro­duc­ing a base­line imbal­ance (which is clearly vis­i­ble in the graphs on pg400 & pg402, where the con­trols are taller in every time peri­od, although this advan­tage lessens with time), this favor­ing of the poor could pro­duce a sys­tem­atic bias in the recorded weights dur­ing win­ter, which were made with the chil­drens’ clothes on, as it is entirely pos­si­ble that poorer chil­dren will have lighter or fewer heavy win­ter clothes. Another issue is allo­cat­ing entire schools to using either pas­teur­ized or raw milk as their com­par­i­son to the no-milk con­trols (lead­ing to prob­lems in account­ing for the hier­ar­chi­cal nature of the data due to con­found­ing of school level effects with the pasteurized/raw effec­t). These three issues are reflected in anom­alies in the data, reduc­ing our con­fi­dence in the results despite it hav­ing been a ran­dom­ized exper­i­ment (rather than some­thing lamer like a sur­vey not­ing that chil­dren who could afford milk were taller). R.A. Fisher as well noted that (Fisher & Bar­lett 1931) taken at face-val­ue, the exper­i­ment pro­duced the oppo­site result of expect­ed: raw milk being supe­rior to pas­teur­ized milk, despite the raw nutri­ent value of fat/protein/etc pre­sum­ably being iden­ti­cal between milks and the greater safety of pas­teur­iza­tion, and that if any­thing, it implied that chil­dren should not be fed pas­teur­ized milk.

Hav­ing per­formed the post-mortem on the Lanark­shire Milk Exper­i­ment, Stu­dent then pro­poses improve­ments to the design, cul­mi­nat­ing in the most rad­i­cal (pg405):

(2) If it be agreed that milk is an advan­ta­geous addi­tion to chil­dren’s diet—and I doubt whether any one will com­bat that view—and that the differ­ence between raw and pas­teur­ized milk is the mat­ter to be inves­ti­gat­ed, it would be pos­si­ble to obtain much greater cer­tainty at an expen­di­ture of per­haps 1–2% of the money [This is a seri­ous con­sid­er­a­tion: the Lanark­shire exper­i­ment cost about £7500. [Which is ~£374,800 in 2016 pounds ster­ling or ~$541100.]] and less than 5% of the trou­ble.

For among 20,000 chil­dren there will be numer­ous pairs of twins; exactly how many it is not easy to say owing to the differ­en­tial death rate, but, since there is about one pair of twins in 90 births, one might hope to get at least 160 pairs in 20,000 chil­dren. But as a mat­ter of fact the 20,000 chil­dren were not all the Lanark­shire schools pop­u­la­tion, and I feel pretty cer­tain that some 200–300 pairs of twins would be avail­able for the pur­pose of the exper­i­ment. Of 200 pairs some 50 would be “iden­ti­cals” and of course of the same sex, while half the remain­der would be non-i­den­ti­cal twins of the same sex.

Now iden­ti­cal twins are prob­a­bly bet­ter exper­i­men­tal mate­r­ial than is avail­able for feed­ing exper­i­ments car­ried out on any other mam­mals, and the error of the com­par­i­son between them may be relied upon to be so small that 50 pairs of these would give more reli­able results than the 20,000 with which we have been deal­ing. The pro­posal is then to exper­i­ment on all pairs of twins of the same sex avail­able, not­ing whether each pair is so sim­i­lar that they are prob­a­bly “iden­ti­cals” or whether they are dis­sim­i­lar.

“Feed” one of each pair on raw and the other on pas­teurised milk, decid­ing in each case which is to take raw milk by the toss of a coin. Take weekly mea­sure­ments and weigh with­out clothes.

Some way of dis­tin­guish­ing the chil­dren from each other is nec­es­sary or the mis­chie­vous ones will play tricks. The obvi­ous method is to take fin­ger-prints, but as this is iden­ti­fied with crime in some peo­ple’s minds, it may be nec­es­sary to make a differ­ent indeli­ble mark on a fin­ger­nail of each, which will grow off after the exper­i­ment is over. With such com­par­a­tively small num­bers fur­ther infor­ma­tion about the dietetic habits and social posi­tion of the chil­dren could be col­lected and would doubt­less prove invalu­able.

The com­par­a­tive vari­a­tion in the effect in “iden­ti­cal” twins and in “unlike” twins should fur­nish use­ful infor­ma­tion on the rel­a­tive impor­tance of “Nature and Nur­ture”. …[The twin exper­i­ment] is likely to pro­vide a much more accu­rate deter­mi­na­tion of the point at issue, owing to the pos­si­bil­ity of bal­anc­ing both nature and nur­ture in the mate­r­ial of the exper­i­ment.

This is a rea­son­able sug­ges­tion, but I was dis­ap­pointed to see that Stu­dent gives no cal­cu­la­tion or ref­er­ence to another work with a sim­i­lar cal­cu­la­tion so it’s unclear if Stu­dent cal­cu­lated out an exact answer and is not giv­ing the details for rea­sons of flow or space, or was giv­ing an off-the-cuff guess based on long expe­ri­ence with power analy­sis from his past sta­tis­ti­cal research & his brew­ery job. (If Zil­iak was going to cite it as an exam­ple, it would’ve been bet­ter if he had ver­i­fied that Stu­den­t’s spec­u­la­tion was right at least to within an order of mag­ni­tude rather than sim­ply quot­ing him as an author­i­ty.)

Power estimate of twins vs general population

The cal­cu­la­tion does­n’t seem hard. For this exam­ple, using height, it’d go some­thing like:

  1. take the esti­mated height gain in inches
  2. find the dis­tri­b­u­tion of height differ­ences for twins and find the dis­tri­b­u­tion of height for the gen­eral pop­u­la­tion of Scot­tish kids those ages
  3. con­vert the inch gain into stan­dard devi­a­tions for twins and for gen­eral pop­u­la­tion
  4. plug those two into an effect size cal­cu­la­tor ask­ing for, say, 80% power
  5. com­pare how big n1 you need of twin pairs and how many n2 pairs of gen­eral pop­u­la­tion, and report the frac­tion n1n2 and how close it is to 5%.

Milk’s effect on male height

To start, on pg403, a table reports “Gain in height in inches by Feed­ers over Con­trols”, for which the largest effect in boys is the 5–7yo boys group at +0.083(0.011) inch­es. So we are tar­get­ing an aver­age increase of a tenth of an inch.

What is the height vari­abil­ity or stan­dard devi­a­tions for twins and Lanark­shire chil­dren of a sim­i­lar age? The fol­lowup paper “The Lanark­shire Milk Exper­i­ment”, Elder­ton 1933 help­fully reports stan­dard devi­a­tions both for Lanark­shire chil­dren and cites some stan­dard devi­a­tion for twins’ heights at that time (pg2):

Dr Stocks in his study of twins [Percy Stocks 1930, assisted by Mary N. Karn: “A Bio­met­ric Inves­ti­ga­tion of Twins and their Broth­ers and Sis­ters”, Annals of Eugen­ics, Vol. v, pp. 46–50. Fran­cis Gal­ton Lab­o­ra­tory for National Eugen­ic­s.] found differ­ences in weight as great as 28 hec­tograms (10 ounces) in those twins he regarded as monozy­gotic whose ages cor­re­sponded to the chil­dren in the milk exper­i­ment. The stan­dard devi­a­tion of weight in pounds is roughly twice that of the stan­dard devi­a­tion of height in inch­es, so that if 8 ounces differ­ence in ini­tial weight be per­mit­ted, 1⁄4 inch differ­ence in height could be allowed. Judg­ing also by Dr Stocks’ mate­r­ial in which monozy­gotic twins showed a modal differ­ence of 1 cm in height it would have been jus­ti­fi­able to allow chil­dren to be paired who differed by two-eighths of an inch, but the labour of pair­ing would have been much heav­ier if a greater vari­a­tion than that entered on the cards had been allowed for height as well as for weight…In Table I the stan­dard devi­a­tions and coeffi­cients of vari­a­tion of the ini­tial height and weight for each year of birth are given, and if these be com­pared with those for Glas­gow boys and girls, it will be seen that they are dis­tinctly less. The Glas­gow fig­ures were obtained by lin­ear inter­po­la­tion and are given in brack­ets after those for the selected Lanark­shire data.

Elder­ton’s Table 1 reports on the Lanark­shire chil­dren’s grouped data by “6 years 9 months”, “7 years 9 months”, and then “8 years 9 months” & high­er; the first two pre­sum­ably map best onto our 5–7yo group, and have val­ues of n = 382 with stan­dard devi­a­tion 1.483(2.58), and n = 337 with stan­dard devi­a­tion 1.648(2.82); con­vert­ing the SDs to vari­ance & , we get a stan­dard devi­a­tion of for the gen­eral pop­u­la­tion.

Elder­ton’s sum­mary of Stock­s’s twin research is con­fus­ingly worded (partly because Stocks worked in cen­time­ters and Elder­ton inch­es), but she appears to be say­ing that the stan­dard devi­a­tion of differ­ences in Stock­s’s twins is 0.25 inch­es, which com­pared with 1.483 is much smaller and around one-fifth; Table II (pg11) in Stocks 1930 records vari­abil­ity of iden­ti­cal twins vs fra­ter­nal vs their non-twin sib­lings in “mean cor­rected Devi­ates”, men­tion­ing that the root-mean-squared differ­ence in the table is the stan­dard devi­a­tion, so the σ0 of height for iden­ti­cal twins is 0.9497 while the gen­eral pop­u­la­tion is 6.01cm (pg13), and 0.9497/6.01 comes out to one-fifth, con­firm­ing where Elder­ton got her spe­cific esti­mate of ~0.25 inches as the stan­dard devi­a­tion for iden­ti­cal twins.

So the claimed effect is +0.083 inch­es, which rep­re­sents d = 0.05 (for the gen­eral pop­u­la­tion) and d = 0.332 (twin differ­ences).

Power analysis

We then ask how much data is required to con­duct a well-pow­ered exper­i­ment to detect the exis­tence of such an effect with a stan­dard t-test:

generalD <- 0.083 / 1.56249311
twinD    <- 0.083 / 0.25
generalP <- power.t.test(d=generalD, power=0.8); generalP
#      Two-sample t test power calculation
#               n = 5564.07129
#           delta = 0.0531202342
#              sd = 1
#       sig.level = 0.05
#           power = 0.8
#     alternative = two.sided
# NOTE: n is number in *each* group
twinP <- power.t.test(d=twinD, power=0.8); twinP
#      Two-sample t test power calculation
#               n = 143.383601
#           delta = 0.332
#              sd = 1
#       sig.level = 0.05
#           power = 0.8
#     alternative = two.sided
# NOTE: n is number in *each* group
twinP$n / generalP$n
# [1] 0.0257695478

So with 143 twin-pairs or n = 286, we can match the power of a sam­ple drawn from the gen­eral pop­u­la­tion with 5564 in each group or n = 11128—sav­ings of ~97% of the sam­ple. (To put that in per­spec­tive, if costs scaled exactly per head and twins did­n’t entail any extra expenses2, then that esti­mated cost of ~$541,100 would have instead been $13,982, for a sav­ings of $527,117.) Stu­den­t’s guess of “1–2%” proves to be on the mon­ey, and the exper­i­ment is also fea­si­ble as 143 twin-pairs is well below the num­ber of twin-pairs that Stu­dent esti­mated to be avail­able (>160, and more likely “200–300”; 300 twin-pairs would yield a power of ~97%, exceed­ing the Lanark­shire 20,000’s <96% pow­er).

We can safely say that Stu­den­t’s Scot­tish milk exper­i­ment exam­ple does indeed demon­strate the power of twins.

All traits

We can go fur­ther and esti­mate the power of twins in gen­eral for exper­i­men­ta­tion. While height is some­what unusu­ally her­i­ta­ble, we can say with con­fi­dence that almost all traits stud­ied are highly her­i­ta­ble based on the pre­vi­ously men­tioned mega-meta-analy­sis which com­piles data on 17,804 traits esti­mated from ~14.5m pairs. The upshot is that aver­ag­ing over all those traits, 48% of vari­ance is due to her­i­tabil­ity and 18% shared-en­vi­ron­ment3; imply­ing that com­pared to a sam­ple drawn from the gen­eral pop­u­la­tion (who share nei­ther genet­ics nor the shared-en­vi­ron­ment upbring­ing), iden­ti­cal twins will have ~34% (1—(0.488+0.174)) of the vari­abil­i­ty.

For eas­ier com­par­i­son with the run­ning exam­ple, we can redo the height cal­cu­la­tion but assum­ing we are look­ing at a generic trait with a higher vari­abil­i­ty:

twinAll    <- 0.083 / (0.338*1.56249311)
generalAll <- 0.083 / 1.56249311

generalAll <- power.t.test(d=generalD, power=0.8)
twinAll    <- power.t.test(d=twinAll,  power=0.8)

twinAll$n / generalAll$n
# [1] 0.114397136

In gen­er­al, an exper­i­ment run using twins will require a sam­ple 11% the size of the same exper­i­ment run using the gen­eral pop­u­la­tion.

While it may seem like an eso­teric sta­tis­ti­cal point, this is not a neg­li­gi­ble sav­ings, and demon­strates the value of con­sid­er­ing both exper­i­ment design and behav­ioral genet­ics rather than ignor­ing them. Many twins are avail­able via twin reg­istries, and even if twins are unfea­si­ble, the point car­ries through to other pop­u­la­tions that one could recruit: sib­lings, or close rel­a­tives—if noth­ing else, one could use the increas­ing preva­lence of geno­typ­ing (<$50) in to make sub­ject pairs as sim­i­lar as pos­si­ble on all covari­ates (in­clud­ing over­all genetic sim­i­lar­ity / relat­ed­ness or poly­genic scores for key trait­s).

Further reading

  1. “milk from herd that has been attested free from .” Tuber­cu­lo­sis infec­tion from milk was a seri­ous prob­lem in the UK in this peri­od; as remarks in his dis­cus­sion of “Vit­a­mins” (Pos­si­ble Worlds And Other Essays 1927): “I would sooner have my child run the risk of rick­ets or infan­tile scurvy from over-boiled milk than of tuber­cu­lo­sis from drink­ing it raw. I refer here to British milk—Amer­i­can is less tuber­cu­lous.” Later elab­o­rat­ing in the “The Fight With Tuber­cu­lo­sis” chap­ter:

    Tuber­cu­lo­sis does not stand first on the list of causes of death in Eng­land, but it is the most seri­ous, because it kills in infancy and prime of life…one quar­ter of the deaths of French chil­dren in their first two years are due to tuber­cle…the great­est sin­gle chan­nel of infec­tion is milk from tuber­cu­lous cows drunk in infancy or early child­hood. But the vast major­ity even of well-to-do par­ents do not take the trou­ble to obtain Grade A or Grade A cer­ti­fied milk for their younger chil­dren. In many places it is, of course, not avail­able, but it would be if an eco­nomic demand for it exist­ed. And with no pub­lic opin­ion behind it in the mat­ter the Gov­ern­ment can­not be expected to leg­is­late dras­ti­cally in favour of pure milk. If sci­ence has not dis­cov­ered a cure or an infal­li­ble pre­ven­tive for tuber­cu­lo­sis, it has at least shown how the mor­tal­ity could be greatly low­ered. For the price of a cigar or a cin­ema a week you can pro­tect your child against its most dan­ger­ous ene­my. Is it worth while?

    Thank­ful­ly, this is not a prob­lem par­ents need con­cern them­selves with any more. (Tu­ber­cu­lo­sis is devel­op­ing mul­ti­-drug resis­tance and so cows might again suffer reg­u­lar infec­tion, but test­ing and genetic engi­neer­ing will keep bovine tuber­cu­lo­sis at bay.)↩︎

  2. Because twin reg­istries already exist and reg­u­larly recruit twins for stud­ies, it’s pos­si­ble that twins might cost less to exper­i­ment on in a con­tem­po­rary set­ting (although I don’t believe twin reg­istries had been set up in Scot­land at this point).↩︎

  3. Note how much bet­ter iden­ti­cal twins are than sib­lings; sib­ling designs are valu­able when we are hypoth­e­siz­ing about some­thing in the shared-en­vi­ron­ment, and can be used for many pur­poses like show­ing that GWAS hits are not con­founded by pop­u­la­tion strat­i­fi­ca­tion or that the harm from things like mater­nal smok­ing is over­es­ti­mated by analy­ses ignor­ing genetic con­found­ing, but since sib­lings are not all that sim­i­lar, the gain is not nearly as large as with iden­ti­cal twins (which are far more sim­i­lar than DZ twins or sib­lings). For some dis­cus­sion of this topic from epi­demi­o­log­i­cal per­spec­tives, see & Smith 2011.↩︎