The Power of Twins: The Scottish Milk Experiment

In discussing a large Scottish public health experiment, Student noted that it would’ve been vastly more efficient using a twin experiment design; I fill in the details with a power analysis.
statistics, R, power-analysis, genetics
2016-01-122019-11-29 finished certainty: highly likely importance: 5


Ran­dom­ized ex­per­i­ments re­quire more sub­jects the more vari­able each dat­a­point is to over­come the noise which ob­scures any effects of the in­ter­ven­tion. Re­duc­ing noise en­ables bet­ter in­fer­ences with the same data, or less data to be col­lect­ed, which can be done by bal­anc­ing ob­served char­ac­ter­is­tics be­tween con­trol and ex­per­i­men­tal dat­a­points.

A par­tic­u­larly dra­matic ex­am­ple of this ap­proach is run­ning ex­per­i­ments on iden­ti­cal twins rather than reg­u­lar peo­ple, be­cause twins vary far less from each other than ran­dom peo­ple due to shared ge­net­ics & fam­ily en­vi­ron­ment. In 1931, the great sta­tis­ti­cian Stu­dent (William Sealy Gos­set) noted prob­lems with an ex­tremely large (n = 20,000) Scot­tish ex­per­i­ment in feed­ing chil­dren milk (to see if they grew more in height or weight), and claimed that the ex­per­i­ment could have been done far more cost-effec­tively with an ex­tra­or­di­nary re­duc­tion of >95% fewer chil­dren if it had been con­ducted us­ing twins, and claimed that 100 iden­ti­cal twins would have been more ac­cu­rate than 20,000 chil­dren. He, how­ev­er, did not pro­vide any cal­cu­la­tions or data demon­strat­ing this.

I re­visit the is­sue and run a power cal­cu­la­tion on height in­di­cat­ing that Stu­den­t’s claims were cor­rect and that the ex­per­i­ment would have re­quired ~97% fewer chil­dren if run with twins.

This re­duc­tion is not unique to the Scot­tish milk ex­per­i­ment on height/weight, and in gen­er­al, one can ex­pect a re­duc­tion of 89% in ex­per­i­ment sam­ple sizes us­ing twins rather than reg­u­lar peo­ple, demon­strat­ing the ben­e­fits of us­ing be­hav­ioral ge­net­ics in /.

en­able causal in­fer­ence by as­sur­ing that the ex­per­i­men­tal group is iden­ti­cal, on av­er­age, in all ways to the con­trol group and the sub­se­quent differ­ences are caused by the in­ter­ven­tion. If a coin is flipped, then each group will have the same frac­tion of wom­en, the same frac­tion of athe­ists, the same frac­tion of peo­ple with a par­tic­u­lar on , the same frac­tion of peo­ple with la­tent can­cers, etc—on av­er­age. But since there are thou­sands upon thou­sands of ways in which sub­jects can differ which might affect the re­sults (just con­sider how ), and there may only be a few dozen sub­jects, ran­dom­iza­tion can­not guar­an­tee ex­act or even ap­prox­i­mate sim­i­lar­ity of groups on every sin­gle trait and with same sam­ples can gen­er­ate grossly im­bal­anced sam­ples (per­haps you have 2 groups of 10, and one group turns out to have 9 women and the other just 1). This does­n’t bias the re­sults or de­feat the point of ran­dom­iza­tion, but it can add a lot of noise, re­duc­ing our , thereby mak­ing our ex­ist­ing stud­ies less mean­ing­ful, re­quir­ing more ex­pen­sive stud­ies to es­ti­mate effects to a use­ful level of pre­ci­sion, and block­ing profitable de­ci­sions.

Simple randomization vs blocking

When put this way, one might won­der how to im­prove the : if you have a group with too many women and too few men, why not change your coin flip to in­stead pairs of women and men? In­stead of flip­ping a fair coin for each pa­tient to pick whether that pa­tient goes into the con­trol or ex­per­i­men­tal group, why not in­stead take a pair of women and flip a coin to de­cide whether the one on the left goes into the ex­per­i­men­tal group, and then the woman on the right goes into the other group? And like­wise for the men. Since so many things clus­ter within fam­i­lies, it would be bet­ter to try to bal­ance fam­i­lies too, and en­sure that if there are mul­ti­ple sib­lings avail­able, one sib­ling goes into the ex­per­i­ment and the other into the con­trol, rather than risk­ing lop­sided al­lo­ca­tion. (This turns out to be a se­ri­ous is­sue in some an­i­mal ex­per­i­ments, if ex­per­i­menters ac­ci­den­tally ran­domly al­lo­cate whole lit­ters into one arm, which can hap­pen often given the small n that an­i­mal re­search typ­i­cally us­es.) In fact, why not try to ex­tend this ‘match­ing up’ to as many things as you can mea­sure? If we use iden­ti­cal twins (like in ), we match them on not just na­tion­al­i­ty, gen­der, lo­ca­tion, par­ents, age, food con­sump­tion or what­not, we even match them on ge­net­ics too, which is part of why iden­ti­cal twins are so eerily iden­ti­cal even when raised apart. (Note that this point is­n’t the same as the vari­ance com­po­nent es­ti­mat­ing go­ing on in or other fam­ily de­signs; we care that they are sim­i­lar and thus com­par­isons are less noisy, not the ques­tion of how much of that eerie sim­i­lar­ity is due to ge­net­ics or other sources.) In­deed, since a per­son is matched with them­selves on just about every­thing and are even bet­ter than iden­ti­cal twins, why not test the treat­ment ? But peo­ple do change over time and there might be lin­ger­ing effects, so per­haps it would be even bet­ter if you can test it on the same per­son si­mul­ta­ne­ous­ly: for ex­am­ple, test an acne cream A on one side of the face and acne cream B on the other half of each sub­jec­t’s face, flip­ping a coin to de­cide whether A/B or B/A.

This is the con­cept of block­ing, one of whose ad­vo­cates was , the great early sta­tis­ti­cian. While Stu­dent is best known for his work on ex­act tests and ex­per­i­ment de­sign, he had an in­ter­est in ge­net­ics (per­haps due to his friend­ship with & pro­fes­sional work on eval­u­at­ing plant breed­ing re­sult­s), mak­ing an im­por­tant con­tri­bu­tion in Stu­dent 1933; in his com­ments on the La­nark­shire Milk Ex­per­i­ment, he would unite both in­ter­ests, point­ing out the use­ful­ness of ge­netic con­sid­er­a­tions for op­ti­mal ex­per­i­men­tal de­sign in hu­mans as well as agri­cul­ture.

Efficiency of blocking for the Lanarkshire Milk Experiment

The his­tory of block­ing and its ad­van­tages com­pared to sim­ple ran­dom­ized ex­per­i­ments are dis­cussed at length in “W.S. Gos­set and Some Ne­glected Con­cepts in Ex­per­i­men­tal Sta­tis­tics: Guin­nes­so­met­rics II”, Zil­iak 2011/“Bal­anced ver­sus Ran­dom­ized Field Ex­per­i­ments in Eco­nom­ics: Why W. S. Gos­set aka ‘Stu­dent’ Mat­ters”, Zil­iak 2014 (see also & ). A par­tic­u­larly early ex­am­ple of block­ing for demon­strat­ing ben­e­fits comes from the farm of noted sys­tem­atic breeder (1725–1795), who ex­per­i­mented with many meth­ods of im­prove­ment, par­tic­u­larly the ben­e­fits of thor­ough ir­ri­ga­tion of his field­s—prov­ing it to his many do­mes­tic & in­ter­na­tional vis­i­tors with a qua­si­-ran­dom block­ing: Ob­ser­va­tions on live stock: con­tain­ing hints for choos­ing and im­prov­ing the best breeds of the most use­ful kinds of do­mes­tic an­i­mals, George Cul­ley & Robert Heaton 1804, pg75:

At Dish­ley, Mr Bakewell has im­proved a con­sid­er­able tract of poor cold land, be­yond any­thing I ever saw, or could have con­ceived, by this same mode of im­prove­men­t;—and, ever ready to com­mu­ni­cate his knowl­edge to the pub­lic, he has left proof pieces in differ­ent parts of his mead­ows, in or­der to con­vince peo­ple of the great im­por­tance and util­ity of this kind of im­prove­men­t:—­par­tic­u­lar­ly, in one part he has been at pains to di­vide a rood of ground into twenty equal di­vi­sions, viz. two perches in each piece. It is so con­trived that they can wa­ter the first, and leave the sec­ond un­wa­tered; or miss the first, and wa­ter the sec­ond; and so on through all the 20 di­vi­sions; by which con­trivance, you have the fairest and most un­equiv­o­cal proofs of the good effects of im­prov­ing ground by wa­ter­ing.

Hous­man 1894, pg7–9 men­tions the block­ing was used for fer­til­iza­tion as well:

Young says that his ir­ri­ga­tion is “among the rarest in­stances of spir­ited hus­bandry”, much ex­ceed­ing any­thing of the kind he had seen be­fore, even in the hands of land­lord­s…He did not hastily ei­ther adopt or ex­tend his sys­tem of ir­ri­ga­tion, but felt his way as he ad­vanced, try­ing var­i­ous ex­per­i­ments to sat­isfy him­self of the effi­cacy and econ­omy of the sys­tem be­fore in­cur­ring fur­ther ex­pense. Side by side he had plots of land: two plots, one wa­tered, the other not wa­tered; two again, one wa­tered, the other ma­nured; and again two, one wa­tered from a spring, the other from the stream; so that he could form his es­ti­mates of the com­par­a­tive value of ir­ri­ga­tion as against other fer­til­is­ing agen­cies, and of differ­ent modes of ir­ri­ga­tion…We have seen in the in­stance of his ir­ri­gated land how Mr. Bakewell tested the worth of his no­tions by fre­quent and var­ied ex­per­i­ment. He did the same in every de­part­ment of the farm. This was the grand source of his pow­er. He did not try to make facts square with his opin­ions, but his opin­ions with facts.

An­other early his­tor­i­cal ex­am­ple is the pre­vi­ous­ly-men­tioned Gus­tav III of Swe­den’s coffee ex­per­i­ment some­time in the late 1700s, which is pos­si­bly the first twin ex­per­i­ment.

Zil­iak writes (pg22) about an in­ter­est­ing ex­am­ple where Stu­dent cal­cu­lates that a well-blocked sam­ple of 100 could be bet­ter than 20,000 with sim­ple ran­dom­iza­tion:

The in­tu­ition be­hind the higher power of ABBA and other bal­anced de­signs to de­tect a large and real treat­ment differ­ence was given by Stu­dent in 1911.68 “Now if we are com­par­ing two va­ri­eties it is clearly of ad­van­tage to arrange the plots in such a way that the yields of both va­ri­eties shall be affected as far as pos­si­ble by the same causes to as nearly as pos­si­ble an equal ex­tent”.69 He used this “prin­ci­ple of max­i­mum con­ti­gu­ity” often, for ex­am­ple when he for ex­am­ple when he il­lus­trated the higher pre­ci­sion and lower costs that would be as­so­ci­ated with a smal­l­-sam­ple study of bi­o­log­i­cal twins, to de­ter­mine the growth tra­jec­tory of chil­dren fed with pas­teur­ized milk, un­pas­teur­ized milk, and no milk at all, in “The La­nark­shire Milk Ex­per­i­ment” (S­tu­dent, 1931a).69 Stu­dent (1931a, p. 405) es­ti­mated that “50 pairs of [i­den­ti­cal twins] would give more re­li­able re­sults than the 20,000” child sam­ple, nei­ther bal­anced nor ran­dom, ac­tu­ally stud­ied in the ex­per­i­ment funded by the Scot­land De­part­ment of Health. “[I]t would be pos­si­ble to ob­tain much greater cer­tainty” in the mea­sured differ­ence of growth in height and weight of chil­dren drink­ing ver­sus “at an ex­pen­di­ture of per­haps 1–2% of the money and less than 5% of the trou­ble.” Like­wise, Kar­lan and List (2007, p. 1777) could have re­vealed more about the eco­nom­ics of char­i­ta­ble giv­ing—­for less—us­ing a vari­ant of Stu­den­t’s method. In­stead the AER ar­ti­cle stud­ied n = 50,083 pri­mar­ily white, male, pro-Al Gore donors to pub­lic ra­dio, nei­ther ran­dom nor bal­anced.

I thought this was a great ex­am­ple and top­i­cal inas­much as be­cause gour­mands prize it over pas­teur­ized milk (lead­ing to oc­ca­sional ill­nesses or deaths and sub­se­quent legal/po­lit­i­cal ma­neu­ver­ing to ban/p­re­serve ac­cess to raw milk), and I was cu­ri­ous about the de­tails of how Stu­dent com­puted that the sur­pris­ingly low num­ber of 100 twins (50 pairs) would suffice. So I looked for Stu­den­t’s orig­i­nal pa­per.

Stu­dent be­gins by sum­ma­riz­ing the La­nark­shire ex­per­i­ment in a lit­tle more de­tail:

…In the spring of 1930, a nu­tri­tional ex­per­i­ment on a very large scale was car­ried out in the schools of La­nark­shire. For four months 10,000 school chil­dren re­ceived 0.75 pint of milk per day, 5000 of these got raw milk and 5,000 pas­teur­ized milk, in both cases Grade A (Tu­ber­culin tested1); an­other 10,000 chil­dren were se­lected as con­trols and the whole 20,000 chil­dren were weighed and their height was mea­sured at the be­gin­ning and end of the ex­per­i­men­t…The 20,000 chil­dren were cho­sen in 67 schools, not more than 400 nor less than 200 be­ing cho­sen in any one school, and of these half were as­signed as “feed­ers” and half as “con­trols”, some schools were pro­vided with raw milk and the oth­ers with pas­teur­ized milk, no school get­ting both­…Sec­ond­ly, the se­lec­tion of the chil­dren was left to the Head Teacher of the school and was made on the prin­ci­ple that both “con­trols” and “feed­ers” should be rep­re­sen­ta­tive of the av­er­age chil­dren be­tween 5 and 12 years of age: the ac­tual method of se­lec­tion be­ing im­por­tant I quote from Drs Leighton and McKin­lay’s [1930] Re­port: “The teach­ers se­lected the two classes of pupils, those get­ting milk and those act­ing as ‘con­trols’, in two differ­ent ways. In cer­tain cases they se­lected them by bal­lot and in oth­ers on an al­pha­bet­i­cal sys­tem.”

Un­for­tu­nate­ly, while am­bi­tious, the ran­dom­iza­tion was heav­ily com­pro­mised (dis­cussed fur­ther in Kadane & Sei­den­feld 1996):

…after in­vok­ing the god­dess of chance they un­for­tu­nately wa­vered in their ad­her­ence to her for we read: “In any par­tic­u­lar school where there was any group to which these meth­ods had given an un­due pro­por­tion of well fed or ill nour­ished chil­dren, oth­ers were sub­sti­tuted in or­der to ob­tain a more level se­lec­tion.” This is just the sort of after­thought that most of us have now and again and which is apt to spoil the best laid plans. In this case it was a fa­tal mis­take, for in con­se­quence the con­trols were, as pointed out in the Re­port, defi­nitely su­pe­rior both in weight and height to the “feed­ers” by an amount equiv­a­lent to about 3 months’ growth in weight and 4 months’ growth in height. Pre­sum­ably this dis­crim­i­na­tion in height and weight was not made de­lib­er­ate­ly, but it would seem prob­a­ble that the teach­ers, swayed by the very hu­man feel­ing that the poorer chil­dren needed the milk more than the com­par­a­tively well to do, must have un­con­sciously made too large a sub­sti­tu­tion of the il­l-nour­ished among the “feed­ers” and too few among the “con­trols” and that this un­con­scious se­lec­tion affect­ed, sec­on­dar­i­ly, both mea­sure­ments.

Stu­dent then ob­serves that be­sides pro­duc­ing a base­line im­bal­ance (which is clearly vis­i­ble in the graphs on pg400 & pg402, where the con­trols are taller in every time pe­ri­od, al­though this ad­van­tage lessens with time), this fa­vor­ing of the poor could pro­duce a sys­tem­atic bias in the recorded weights dur­ing win­ter, which were made with the chil­drens’ clothes on, as it is en­tirely pos­si­ble that poorer chil­dren will have lighter or fewer heavy win­ter clothes. An­other is­sue is al­lo­cat­ing en­tire schools to us­ing ei­ther pas­teur­ized or raw milk as their com­par­i­son to the no-milk con­trols (lead­ing to prob­lems in ac­count­ing for the hi­er­ar­chi­cal na­ture of the data due to con­found­ing of school level effects with the pas­teur­ized/raw effec­t). These three is­sues are re­flected in anom­alies in the data, re­duc­ing our con­fi­dence in the re­sults de­spite it hav­ing been a ran­dom­ized ex­per­i­ment (rather than some­thing lamer like a sur­vey not­ing that chil­dren who could afford milk were taller). R.A. Fisher as well noted that (Fisher & Bar­lett 1931) taken at face-val­ue, the ex­per­i­ment pro­duced the op­po­site re­sult of ex­pect­ed: raw milk be­ing su­pe­rior to pas­teur­ized milk, de­spite the raw nu­tri­ent value of fat/pro­tein/etc pre­sum­ably be­ing iden­ti­cal be­tween milks and the greater safety of pas­teur­iza­tion, and that if any­thing, it im­plied that chil­dren should not be fed pas­teur­ized milk.

Hav­ing per­formed the post-mortem on the La­nark­shire Milk Ex­per­i­ment, Stu­dent then pro­poses im­prove­ments to the de­sign, cul­mi­nat­ing in the most rad­i­cal (pg405):

(2) If it be agreed that milk is an ad­van­ta­geous ad­di­tion to chil­dren’s di­et—and I doubt whether any one will com­bat that view—and that the differ­ence be­tween raw and pas­teur­ized milk is the mat­ter to be in­ves­ti­gat­ed, it would be pos­si­ble to ob­tain much greater cer­tainty at an ex­pen­di­ture of per­haps 1–2% of the money [This is a se­ri­ous con­sid­er­a­tion: the La­nark­shire ex­per­i­ment cost about £7500. [Which is ~£374,800 in 2016 pounds ster­ling or ~$541100.]] and less than 5% of the trou­ble.

For among 20,000 chil­dren there will be nu­mer­ous pairs of twins; ex­actly how many it is not easy to say ow­ing to the differ­en­tial death rate, but, since there is about one pair of twins in 90 births, one might hope to get at least 160 pairs in 20,000 chil­dren. But as a mat­ter of fact the 20,000 chil­dren were not all the La­nark­shire schools pop­u­la­tion, and I feel pretty cer­tain that some 200–300 pairs of twins would be avail­able for the pur­pose of the ex­per­i­ment. Of 200 pairs some 50 would be “iden­ti­cals” and of course of the same sex, while half the re­main­der would be non-i­den­ti­cal twins of the same sex.

Now iden­ti­cal twins are prob­a­bly bet­ter ex­per­i­men­tal ma­te­r­ial than is avail­able for feed­ing ex­per­i­ments car­ried out on any other mam­mals, and the er­ror of the com­par­i­son be­tween them may be re­lied upon to be so small that 50 pairs of these would give more re­li­able re­sults than the 20,000 with which we have been deal­ing. The pro­posal is then to ex­per­i­ment on all pairs of twins of the same sex avail­able, not­ing whether each pair is so sim­i­lar that they are prob­a­bly “iden­ti­cals” or whether they are dis­sim­i­lar.

“Feed” one of each pair on raw and the other on pas­teurised milk, de­cid­ing in each case which is to take raw milk by the toss of a coin. Take weekly mea­sure­ments and weigh with­out clothes.

Some way of dis­tin­guish­ing the chil­dren from each other is nec­es­sary or the mis­chie­vous ones will play tricks. The ob­vi­ous method is to take fin­ger-prints, but as this is iden­ti­fied with crime in some peo­ple’s minds, it may be nec­es­sary to make a differ­ent in­deli­ble mark on a fin­ger­nail of each, which will grow off after the ex­per­i­ment is over. With such com­par­a­tively small num­bers fur­ther in­for­ma­tion about the di­etetic habits and so­cial po­si­tion of the chil­dren could be col­lected and would doubt­less prove in­valu­able.

The com­par­a­tive vari­a­tion in the effect in “iden­ti­cal” twins and in “un­like” twins should fur­nish use­ful in­for­ma­tion on the rel­a­tive im­por­tance of “Na­ture and Nur­ture”. …[The twin ex­per­i­ment] is likely to pro­vide a much more ac­cu­rate de­ter­mi­na­tion of the point at is­sue, ow­ing to the pos­si­bil­ity of bal­anc­ing both na­ture and nur­ture in the ma­te­r­ial of the ex­per­i­ment.

This is a rea­son­able sug­ges­tion, but I was dis­ap­pointed to see that Stu­dent gives no cal­cu­la­tion or ref­er­ence to an­other work with a sim­i­lar cal­cu­la­tion so it’s un­clear if Stu­dent cal­cu­lated out an ex­act an­swer and is not giv­ing the de­tails for rea­sons of flow or space, or was giv­ing an off-the-cuff guess based on long ex­pe­ri­ence with power analy­sis from his past sta­tis­ti­cal re­search & his brew­ery job. (If Zil­iak was go­ing to cite it as an ex­am­ple, it would’ve been bet­ter if he had ver­i­fied that Stu­den­t’s spec­u­la­tion was right at least to within an or­der of mag­ni­tude rather than sim­ply quot­ing him as an au­thor­i­ty.)

Power estimate of twins vs general population

The cal­cu­la­tion does­n’t seem hard. For this ex­am­ple, us­ing height, it’d go some­thing like:

  1. take the es­ti­mated height gain in inches
  2. find the dis­tri­b­u­tion of height differ­ences for twins and find the dis­tri­b­u­tion of height for the gen­eral pop­u­la­tion of Scot­tish kids those ages
  3. con­vert the inch gain into stan­dard de­vi­a­tions for twins and for gen­eral pop­u­la­tion
  4. plug those two into an effect size cal­cu­la­tor ask­ing for, say, 80% power
  5. com­pare how big n1 you need of twin pairs and how many n2 pairs of gen­eral pop­u­la­tion, and re­port the frac­tion n1n2 and how close it is to 5%.

Milk’s effect on male height

To start, on pg403, a ta­ble re­ports “Gain in height in inches by Feed­ers over Con­trols”, for which the largest effect in boys is the 5–7yo boys group at +0.083(0.011) inch­es. So we are tar­get­ing an av­er­age in­crease of a tenth of an inch.

What is the height vari­abil­ity or stan­dard de­vi­a­tions for twins and La­nark­shire chil­dren of a sim­i­lar age? The fol­lowup pa­per “The La­nark­shire Milk Ex­per­i­ment”, El­der­ton 1933 help­fully re­ports stan­dard de­vi­a­tions both for La­nark­shire chil­dren and cites some stan­dard de­vi­a­tion for twins’ heights at that time (pg2):

Dr Stocks in his study of twins [Percy Stocks 1930, as­sisted by Mary N. Karn: “A Bio­met­ric In­ves­ti­ga­tion of Twins and their Broth­ers and Sis­ters”, An­nals of Eu­gen­ics, Vol. v, pp. 46–50. Fran­cis Gal­ton Lab­o­ra­tory for Na­tional Eu­gen­ic­s.] found differ­ences in weight as great as 28 hec­tograms (10 ounces) in those twins he re­garded as monozy­gotic whose ages cor­re­sponded to the chil­dren in the milk ex­per­i­ment. The stan­dard de­vi­a­tion of weight in pounds is roughly twice that of the stan­dard de­vi­a­tion of height in inch­es, so that if 8 ounces differ­ence in ini­tial weight be per­mit­ted, 1⁄4 inch differ­ence in height could be al­lowed. Judg­ing also by Dr Stocks’ ma­te­r­ial in which monozy­gotic twins showed a modal differ­ence of 1 cm in height it would have been jus­ti­fi­able to al­low chil­dren to be paired who differed by two-eighths of an inch, but the labour of pair­ing would have been much heav­ier if a greater vari­a­tion than that en­tered on the cards had been al­lowed for height as well as for weight…In Ta­ble I the stan­dard de­vi­a­tions and co­effi­cients of vari­a­tion of the ini­tial height and weight for each year of birth are given, and if these be com­pared with those for Glas­gow boys and girls, it will be seen that they are dis­tinctly less. The Glas­gow fig­ures were ob­tained by lin­ear in­ter­po­la­tion and are given in brack­ets after those for the se­lected La­nark­shire da­ta.

El­der­ton’s Ta­ble 1 re­ports on the La­nark­shire chil­dren’s grouped data by “6 years 9 months”, “7 years 9 months”, and then “8 years 9 months” & high­er; the first two pre­sum­ably map best onto our 5–7yo group, and have val­ues of n = 382 with stan­dard de­vi­a­tion 1.483(2.58), and n = 337 with stan­dard de­vi­a­tion 1.648(2.82); con­vert­ing the SDs to vari­ance & , we get a stan­dard de­vi­a­tion of for the gen­eral pop­u­la­tion.

El­der­ton’s sum­mary of Stock­s’s twin re­search is con­fus­ingly worded (partly be­cause Stocks worked in cen­time­ters and El­der­ton inch­es), but she ap­pears to be say­ing that the stan­dard de­vi­a­tion of differ­ences in Stock­s’s twins is 0.25 inch­es, which com­pared with 1.483 is much smaller and around one-fifth; Ta­ble II (pg11) in Stocks 1930 records vari­abil­ity of iden­ti­cal twins vs fra­ter­nal vs their non-twin sib­lings in “mean cor­rected De­vi­ates”, men­tion­ing that the root-mean-squared differ­ence in the ta­ble is the stan­dard de­vi­a­tion, so the σ0 of height for iden­ti­cal twins is 0.9497 while the gen­eral pop­u­la­tion is 6.01cm (pg13), and 0.9497/6.01 comes out to one-fifth, con­firm­ing where El­der­ton got her spe­cific es­ti­mate of ~0.25 inches as the stan­dard de­vi­a­tion for iden­ti­cal twins.

So the claimed effect is +0.083 inch­es, which rep­re­sents d = 0.05 (for the gen­eral pop­u­la­tion) and d = 0.332 (twin differ­ences).

Power analysis

We then ask how much data is re­quired to con­duct a well-pow­ered ex­per­i­ment to de­tect the ex­is­tence of such an effect with a stan­dard t-test:

generalD <- 0.083 / 1.56249311
twinD    <- 0.083 / 0.25
generalP <- power.t.test(d=generalD, power=0.8); generalP
#      Two-sample t test power calculation
#
#               n = 5564.07129
#           delta = 0.0531202342
#              sd = 1
#       sig.level = 0.05
#           power = 0.8
#     alternative = two.sided
#
# NOTE: n is number in *each* group
twinP <- power.t.test(d=twinD, power=0.8); twinP
#      Two-sample t test power calculation
#
#               n = 143.383601
#           delta = 0.332
#              sd = 1
#       sig.level = 0.05
#           power = 0.8
#     alternative = two.sided
#
# NOTE: n is number in *each* group
twinP$n / generalP$n
# [1] 0.0257695478

So with 143 twin-pairs or n = 286, we can match the power of a sam­ple drawn from the gen­eral pop­u­la­tion with 5564 in each group or n = 11128—sav­ings of ~97% of the sam­ple. (To put that in per­spec­tive, if costs scaled ex­actly per head and twins did­n’t en­tail any ex­tra ex­penses2, then that es­ti­mated cost of ~$541,100 would have in­stead been $13,982, for a sav­ings of $527,117.) Stu­den­t’s guess of “1–2%” proves to be on the mon­ey, and the ex­per­i­ment is also fea­si­ble as 143 twin-pairs is well be­low the num­ber of twin-pairs that Stu­dent es­ti­mated to be avail­able (>160, and more likely “200–300”; 300 twin-pairs would yield a power of ~97%, ex­ceed­ing the La­nark­shire 20,000’s <96% pow­er).

We can safely say that Stu­den­t’s Scot­tish milk ex­per­i­ment ex­am­ple does in­deed demon­strate the power of twins.

All traits

We can go fur­ther and es­ti­mate the power of twins in gen­eral for ex­per­i­men­ta­tion. While height is some­what un­usu­ally her­i­ta­ble, we can say with con­fi­dence that al­most all traits stud­ied are highly her­i­ta­ble based on the pre­vi­ously men­tioned mega-meta-analy­sis which com­piles data on 17,804 traits es­ti­mated from ~14.5m pairs. The up­shot is that av­er­ag­ing over all those traits, 48% of vari­ance is due to her­i­tabil­ity and 18% shared-en­vi­ron­ment3; im­ply­ing that com­pared to a sam­ple drawn from the gen­eral pop­u­la­tion (who share nei­ther ge­net­ics nor the shared-en­vi­ron­ment up­bring­ing), iden­ti­cal twins will have ~34% (1—(0.488+0.174)) of the vari­abil­i­ty.

For eas­ier com­par­i­son with the run­ning ex­am­ple, we can redo the height cal­cu­la­tion but as­sum­ing we are look­ing at a generic trait with a higher vari­abil­i­ty:

twinAll    <- 0.083 / (0.338*1.56249311)
generalAll <- 0.083 / 1.56249311

generalAll <- power.t.test(d=generalD, power=0.8)
twinAll    <- power.t.test(d=twinAll,  power=0.8)

twinAll$n / generalAll$n
# [1] 0.114397136

In gen­er­al, an ex­per­i­ment run us­ing twins will re­quire a sam­ple 11% the size of the same ex­per­i­ment run us­ing the gen­eral pop­u­la­tion.

While it may seem like an es­o­teric sta­tis­ti­cal point, this is not a neg­li­gi­ble sav­ings, and demon­strates the value of con­sid­er­ing both ex­per­i­ment de­sign and be­hav­ioral ge­net­ics rather than ig­nor­ing them. Many twins are avail­able via twin reg­istries, and even if twins are un­fea­si­ble, the point car­ries through to other pop­u­la­tions that one could re­cruit: sib­lings, or close rel­a­tives—if noth­ing else, one could use the in­creas­ing preva­lence of geno­typ­ing (<$50) in to make sub­ject pairs as sim­i­lar as pos­si­ble on all co­vari­ates (in­clud­ing over­all ge­netic sim­i­lar­ity / re­lat­ed­ness or poly­genic scores for key trait­s).

Further reading


  1. “milk from herd that has been at­tested free from .” Tu­ber­cu­lo­sis in­fec­tion from milk was a se­ri­ous prob­lem in the UK in this pe­ri­od; as re­marks in his dis­cus­sion of “Vi­t­a­mins” (Pos­si­ble Worlds And Other Es­says 1927): “I would sooner have my child run the risk of rick­ets or in­fan­tile scurvy from over-boiled milk than of tu­ber­cu­lo­sis from drink­ing it raw. I re­fer here to British milk—Amer­i­can is less tu­ber­cu­lous.” Later elab­o­rat­ing in the “The Fight With Tu­ber­cu­lo­sis” chap­ter:

    Tu­ber­cu­lo­sis does not stand first on the list of causes of death in Eng­land, but it is the most se­ri­ous, be­cause it kills in in­fancy and prime of life…one quar­ter of the deaths of French chil­dren in their first two years are due to tu­ber­cle…the great­est sin­gle chan­nel of in­fec­tion is milk from tu­ber­cu­lous cows drunk in in­fancy or early child­hood. But the vast ma­jor­ity even of well-to-do par­ents do not take the trou­ble to ob­tain Grade A or Grade A cer­ti­fied milk for their younger chil­dren. In many places it is, of course, not avail­able, but it would be if an eco­nomic de­mand for it ex­ist­ed. And with no pub­lic opin­ion be­hind it in the mat­ter the Gov­ern­ment can­not be ex­pected to leg­is­late dras­ti­cally in favour of pure milk. If sci­ence has not dis­cov­ered a cure or an in­fal­li­ble pre­ven­tive for tu­ber­cu­lo­sis, it has at least shown how the mor­tal­ity could be greatly low­ered. For the price of a cigar or a cin­ema a week you can pro­tect your child against its most dan­ger­ous en­e­my. Is it worth while?

    Thank­ful­ly, this is not a prob­lem par­ents need con­cern them­selves with any more. (Tu­ber­cu­lo­sis is de­vel­op­ing mul­ti­-drug re­sis­tance and so cows might again suffer reg­u­lar in­fec­tion, but test­ing and ge­netic en­gi­neer­ing will keep bovine tu­ber­cu­lo­sis at bay.)↩︎

  2. Be­cause twin reg­istries al­ready ex­ist and reg­u­larly re­cruit twins for stud­ies, it’s pos­si­ble that twins might cost less to ex­per­i­ment on in a con­tem­po­rary set­ting (although I don’t be­lieve twin reg­istries had been set up in Scot­land at this point).↩︎

  3. Note how much bet­ter iden­ti­cal twins are than sib­lings; sib­ling de­signs are valu­able when we are hy­poth­e­siz­ing about some­thing in the shared-en­vi­ron­ment, and can be used for many pur­poses like show­ing that GWAS hits are not con­founded by pop­u­la­tion strat­i­fi­ca­tion or that the harm from things like ma­ter­nal smok­ing is over­es­ti­mated by analy­ses ig­nor­ing ge­netic con­found­ing, but since sib­lings are not all that sim­i­lar, the gain is not nearly as large as with iden­ti­cal twins (which are far more sim­i­lar than DZ twins or sib­lings). For some dis­cus­sion of this topic from epi­demi­o­log­i­cal per­spec­tives, see & Smith 2011.↩︎