In discussing a large Scottish public health experiment, Student noted that it would’ve been vastly more efficient using a twin experiment design; I fill in the details with a power analysis.
Randomized experiments require more subjects the more variable each datapoint is to overcome the noise which obscures any effects of the intervention. Reducing noise enables better inferences with the same data, or less data to be collected, which can be done by balancing observed characteristics between control and experimental datapoints.
A particularly dramatic example of this approach is running experiments on identical twins rather than regular people, because twins vary far less from each other than random people due to shared genetics & family environment. In 1931, the great statistician Student (William Sealy Gosset) noted problems with an extremely large (n = 20,000) Scottish experiment in feeding children milk (to see if they grew more in height or weight), and claimed that the experiment could have been done far more cost-effectively with an extraordinary reduction of >95% fewer children if it had been conducted using twins, and claimed that 100 identical twins would have been more accurate than 20,000 children. He, however, did not provide any calculations or data demonstrating this.
I revisit the issue and run a power calculation on height indicating that Student’s claims were correct and that the experiment would have required ~97% fewer children if run with twins.
This reduction is not unique to the Scottish milk experiment on height/weight, and in general, one can expect a reduction of 89% in experiment sample sizes using twins rather than regular people, demonstrating the benefits of using behavioral genetics in experiment design/power analysis.
Randomized experiments enable causal inference by assuring that the experimental group is identical, on average, in all ways to the control group and the subsequent differences are caused by the intervention. If a coin is flipped, then each group will have the same fraction of women, the same fraction of atheists, the same fraction of people with a particular SNP on COMT, the same fraction of people with latent cancers, etc—on average. But since there are thousands upon thousands of ways in which subjects can (and do) differ which might affect the results (just consider how almost all human traits are genetically heritable), and there may only be a few dozen subjects, randomization cannot guarantee exact or even approximate similarity of groups on every single trait and with same samples can generate grossly imbalanced samples (perhaps you have 2 groups of 10, and one group turns out to have 9 women and the other just 1). This doesn’t bias the results or defeat the point of randomization, but it can add a lot of noise, reducing our efficiency, thereby making our existing studies less meaningful, requiring more expensive studies to estimate effects to a useful level of precision, and blocking profitable decisions.
When put this way, one might wonder how to improve the design of simple randomized experiments: if you have a group with too many women and too few men, why not change your coin flip to instead pairs of women and men? Instead of flipping a fair coin for each patient to pick whether that patient goes into the control or experimental group, why not instead take a pair of women and flip a coin to decide whether the one on the left goes into the experimental group, and then the woman on the right goes into the other group? And likewise for the men. Since so many things cluster within families, it would be better to try to balance families too, and ensure that if there are multiple siblings available, one sibling goes into the experiment and the other into the control, rather than risking lopsided allocation. (This turns out to be a serious issue in some animal experiments, if experimenters accidentally randomly allocate whole litters into one arm, which can happen often given the small n that animal research typically uses.) In fact, why not try to extend this ‘matching up’ to as many things as you can measure? If we use identical twins (like in Gustav III of Sweden’s coffee experiment), we match them on not just nationality, gender, location, parents, age, food consumption or whatnot, we even match them on genetics too, which is part of why identical twins are so eerily identical even when raised apart. (Note that this point isn’t the same as the variance component estimating going on in twin study or other family designs; we care that they are similar and thus comparisons are less noisy, not the question of how much of that eerie similarity is due to genetics or other sources.) Indeed, since a person is matched with themselves on just about everything and are even better than identical twins, why not test the treatment on the same person over multiple time periods? But people do change over time and there might be lingering effects, so perhaps it would be even better if you can test it on the same person simultaneously: for example, test an acne cream A on one side of the face and acne cream B on the other half of each subject’s face, flipping a coin to decide whether A/B or B/A.
This is the concept of blocking, one of whose advocates was Student (William Sealy Gosset), the great early statistician. While Student is best known for his work on exact tests and experiment design, he had an interest in genetics (perhaps due to his friendship with R. A. Fisher & professional work on evaluating plant breeding results), making an important contribution in Student 1933; in his comments on the Lanarkshire Milk Experiment, he would unite both interests, pointing out the usefulness of genetic considerations for optimal experimental design in humans as well as agriculture.
The history of blocking and its advantages compared to simple randomized experiments are discussed at length in “W.S. Gosset and Some Neglected Concepts in Experimental Statistics: Guinnessometrics II”, Ziliak 2011/“Balanced versus Randomized Field Experiments in Economics: Why W. S. Gosset aka ‘Student’ Matters”, Ziliak 2014 (see also Pearson 1939 & Ziliak 2008). A particularly early example of blocking for demonstrating benefits comes from the farm of noted systematic breeder Robert Bakewell (1725–1795), who experimented with many methods of improvement, particularly the benefits of thorough irrigation of his fields—proving it to his many domestic & international visitors with a quasi-random blocking: Observations on live stock: containing hints for choosing and improving the best breeds of the most useful kinds of domestic animals, George Culley & Robert Heaton 1804, pg75:
At Dishley, Mr Bakewell has improved a considerable tract of poor cold land, beyond anything I ever saw, or could have conceived, by this same mode of improvement;—and, ever ready to communicate his knowledge to the public, he has left proof pieces in different parts of his meadows, in order to convince people of the great importance and utility of this kind of improvement:—particularly, in one part he has been at pains to divide a rood of ground into twenty equal divisions, viz. two perches in each piece. It is so contrived that they can water the first, and leave the second unwatered; or miss the first, and water the second; and so on through all the 20 divisions; by which contrivance, you have the fairest and most unequivocal proofs of the good effects of improving ground by watering.
Housman 1894, pg7–9 mentions the blocking was used for fertilization as well:
Young says that his irrigation is “among the rarest instances of spirited husbandry”, much exceeding anything of the kind he had seen before, even in the hands of landlords…He did not hastily either adopt or extend his system of irrigation, but felt his way as he advanced, trying various experiments to satisfy himself of the efficacy and economy of the system before incurring further expense. Side by side he had plots of land: two plots, one watered, the other not watered; two again, one watered, the other manured; and again two, one watered from a spring, the other from the stream; so that he could form his estimates of the comparative value of irrigation as against other fertilising agencies, and of different modes of irrigation…We have seen in the instance of his irrigated land how Mr. Bakewell tested the worth of his notions by frequent and varied experiment. He did the same in every department of the farm. This was the grand source of his power. He did not try to make facts square with his opinions, but his opinions with facts.
Another early historical example is the previously-mentioned Gustav III of Sweden’s coffee experiment sometime in the late 1700s, which is possibly the first twin experiment.
Ziliak writes (pg22) about an interesting example where Student calculates that a well-blocked sample of 100 could be better than 20,000 with simple randomization:
The intuition behind the higher power of ABBA and other balanced designs to detect a large and real treatment difference was given by Student in 1911.68 “Now if we are comparing two varieties it is clearly of advantage to arrange the plots in such a way that the yields of both varieties shall be affected as far as possible by the same causes to as nearly as possible an equal extent”.69 He used this “principle of maximum contiguity” often, for example when he for example when he illustrated the higher precision and lower costs that would be associated with a small-sample study of biological twins, to determine the growth trajectory of children fed with pasteurized milk, unpasteurized milk, and no milk at all, in “The Lanarkshire Milk Experiment” (Student, 1931a).69 Student (1931a, p. 405) estimated that “50 pairs of [identical twins] would give more reliable results than the 20,000” child sample, neither balanced nor random, actually studied in the experiment funded by the Scotland Department of Health. “[I]t would be possible to obtain much greater certainty” in the measured difference of growth in height and weight of children drinking raw versus pasteurized milk “at an expenditure of perhaps 1–2% of the money and less than 5% of the trouble.” Likewise, Karlan and List (2007, p. 1777) could have revealed more about the economics of charitable giving—for less—using a variant of Student’s method. Instead the AER article studied n = 50,083 primarily white, male, pro-Al Gore donors to public radio, neither random nor balanced.
I thought this was a great example and topical inasmuch as raw milk is still an issue because gourmands prize it over pasteurized milk (leading to occasional illnesses or deaths and subsequent legal/political maneuvering to ban/preserve access to raw milk), and I was curious about the details of how Student computed that the surprisingly low number of 100 twins (50 pairs) would suffice. So I looked for Student’s original paper.
Student begins by summarizing the Lanarkshire experiment in a little more detail:
…In the spring of 1930, a nutritional experiment on a very large scale was carried out in the schools of Lanarkshire. For four months 10,000 school children received 0.75 pint of milk per day, 5000 of these got raw milk and 5,000 pasteurized milk, in both cases Grade A (Tuberculin tested1); another 10,000 children were selected as controls and the whole 20,000 children were weighed and their height was measured at the beginning and end of the experiment…The 20,000 children were chosen in 67 schools, not more than 400 nor less than 200 being chosen in any one school, and of these half were assigned as “feeders” and half as “controls”, some schools were provided with raw milk and the others with pasteurized milk, no school getting both…Secondly, the selection of the children was left to the Head Teacher of the school and was made on the principle that both “controls” and “feeders” should be representative of the average children between 5 and 12 years of age: the actual method of selection being important I quote from Drs Leighton and McKinlay’s  Report: “The teachers selected the two classes of pupils, those getting milk and those acting as ‘controls’, in two different ways. In certain cases they selected them by ballot and in others on an alphabetical system.”
Unfortunately, while ambitious, the randomization was heavily compromised (discussed further in Kadane & Seidenfeld 1996):
…after invoking the goddess of chance they unfortunately wavered in their adherence to her for we read: “In any particular school where there was any group to which these methods had given an undue proportion of well fed or ill nourished children, others were substituted in order to obtain a more level selection.” This is just the sort of afterthought that most of us have now and again and which is apt to spoil the best laid plans. In this case it was a fatal mistake, for in consequence the controls were, as pointed out in the Report, definitely superior both in weight and height to the “feeders” by an amount equivalent to about 3 months’ growth in weight and 4 months’ growth in height. Presumably this discrimination in height and weight was not made deliberately, but it would seem probable that the teachers, swayed by the very human feeling that the poorer children needed the milk more than the comparatively well to do, must have unconsciously made too large a substitution of the ill-nourished among the “feeders” and too few among the “controls” and that this unconscious selection affected, secondarily, both measurements.
Student then observes that besides producing a baseline imbalance (which is clearly visible in the graphs on pg400 & pg402, where the controls are taller in every time period, although this advantage lessens with time), this favoring of the poor could produce a systematic bias in the recorded weights during winter, which were made with the childrens’ clothes on, as it is entirely possible that poorer children will have lighter or fewer heavy winter clothes. Another issue is allocating entire schools to using either pasteurized or raw milk as their comparison to the no-milk controls (leading to problems in accounting for the hierarchical nature of the data due to confounding of school level effects with the pasteurized/raw effect). These three issues are reflected in anomalies in the data, reducing our confidence in the results despite it having been a randomized experiment (rather than something lamer like a survey noting that children who could afford milk were taller). R.A. Fisher as well noted that (Fisher & Barlett 1931) taken at face-value, the experiment produced the opposite result of expected: raw milk being superior to pasteurized milk, despite the raw nutrient value of fat/protein/etc presumably being identical between milks and the greater safety of pasteurization, and that if anything, it implied that children should not be fed pasteurized milk.
Having performed the post-mortem on the Lanarkshire Milk Experiment, Student then proposes improvements to the design, culminating in the most radical (pg405):
(2) If it be agreed that milk is an advantageous addition to children’s diet—and I doubt whether any one will combat that view—and that the difference between raw and pasteurized milk is the matter to be investigated, it would be possible to obtain much greater certainty at an expenditure of perhaps 1–2% of the money [This is a serious consideration: the Lanarkshire experiment cost about £7500. [Which is ~£374,800 in 2016 pounds sterling or ~$657,513.0$541,100.02016.]] and less than 5% of the trouble.
For among 20,000 children there will be numerous pairs of twins; exactly how many it is not easy to say owing to the differential death rate, but, since there is about one pair of twins in 90 births, one might hope to get at least 160 pairs in 20,000 children. But as a matter of fact the 20,000 children were not all the Lanarkshire schools population, and I feel pretty certain that some 200–300 pairs of twins would be available for the purpose of the experiment. Of 200 pairs some 50 would be “identicals” and of course of the same sex, while half the remainder would be non-identical twins of the same sex.
Now identical twins are probably better experimental material than is available for feeding experiments carried out on any other mammals, and the error of the comparison between them may be relied upon to be so small that 50 pairs of these would give more reliable results than the 20,000 with which we have been dealing. The proposal is then to experiment on all pairs of twins of the same sex available, noting whether each pair is so similar that they are probably “identicals” or whether they are dissimilar.
“Feed” one of each pair on raw and the other on pasteurised milk, deciding in each case which is to take raw milk by the toss of a coin. Take weekly measurements and weigh without clothes.
Some way of distinguishing the children from each other is necessary or the mischievous ones will play tricks. The obvious method is to take finger-prints, but as this is identified with crime in some people’s minds, it may be necessary to make a different indelible mark on a fingernail of each, which will grow off after the experiment is over. With such comparatively small numbers further information about the dietetic habits and social position of the children could be collected and would doubtless prove invaluable.
The comparative variation in the effect in “identical” twins and in “unlike” twins should furnish useful information on the relative importance of “Nature and Nurture”. …[The twin experiment] is likely to provide a much more accurate determination of the point at issue, owing to the possibility of balancing both nature and nurture in the material of the experiment.
This is a reasonable suggestion, but I was disappointed to see that Student gives no calculation or reference to another work with a similar calculation so it’s unclear if Student calculated out an exact answer and is not giving the details for reasons of flow or space, or was giving an off-the-cuff guess based on long experience with power analysis from his past statistical research & his brewery job. (If Ziliak was going to cite it as an example, it would’ve been better if he had verified that Student’s speculation was right at least to within an order of magnitude rather than simply quoting him as an authority.)
The calculation doesn’t seem hard. For this example, using height, it’d go something like:
- take the estimated height gain in inches
- find the distribution of height differences for twins and find the distribution of height for the general population of Scottish kids those ages
- convert the inch gain into standard deviations for twins and for general population
- plug those two into an effect size calculator asking for, say, 80% power
- compare how big n1 you need of twin pairs and how many n2 pairs of general population, and report the fraction n1⁄n2 and how close it is to 5%.
To start, on pg403, a table reports “Gain in height in inches by Feeders over Controls”, for which the largest effect in boys is the 5–7yo boys group at +0.083(0.011) inches. So we are targeting an average increase of a tenth of an inch.
What is the height variability or standard deviations for twins and Lanarkshire children of a similar age? The followup paper “The Lanarkshire Milk Experiment”, Elderton 1933 helpfully reports standard deviations both for Lanarkshire children and cites some standard deviation for twins’ heights at that time (pg2):
Dr Stocks in his study of twins [Percy Stocks 1930, assisted by Mary N. Karn: “A Biometric Investigation of Twins and their Brothers and Sisters”, Annals of Eugenics, Vol. v, pp. 46–50. Francis Galton Laboratory for National Eugenics.] found differences in weight as great as 28 hectograms (10 ounces) in those twins he regarded as monozygotic whose ages corresponded to the children in the milk experiment. The standard deviation of weight in pounds is roughly twice that of the standard deviation of height in inches, so that if 8 ounces difference in initial weight be permitted, 1⁄4 inch difference in height could be allowed. Judging also by Dr Stocks’ material in which monozygotic twins showed a modal difference of 1 cm in height it would have been justifiable to allow children to be paired who differed by two-eighths of an inch, but the labour of pairing would have been much heavier if a greater variation than that entered on the cards had been allowed for height as well as for weight…In Table I the standard deviations and coefficients of variation of the initial height and weight for each year of birth are given, and if these be compared with those for Glasgow boys and girls, it will be seen that they are distinctly less. The Glasgow figures were obtained by linear interpolation and are given in brackets after those for the selected Lanarkshire data.
Elderton’s Table 1 reports on the Lanarkshire children’s grouped data by “6 years 9 months”, “7 years 9 months”, and then “8 years 9 months” & higher; the first two presumably map best onto our 5–7yo group, and have values of n = 382 with standard deviation 1.483(2.58), and n = 337 with standard deviation 1.648(2.82); converting the SDs to variance & pooling the variances, we get a standard deviation of √[(1.4832 × (382−1) + 1.6482 × (337−1)) / (382+337−2)] = 1.562 for the general population.
Elderton’s summary of Stocks’s twin research is confusingly worded (partly because Stocks worked in centimeters and Elderton inches), but she appears to be saying that the standard deviation of differences in Stocks’s twins is 0.25 inches, which compared with 1.483 is much smaller and around one-fifth; Table II (pg11) in Stocks 1930 records variability of identical twins vs fraternal vs their non-twin siblings in “mean corrected Deviates”, mentioning that the root-mean-squared difference in the table is the standard deviation, so the σ0 of height for identical twins is 0.9497 while the general population is 6.01cm (pg13), and 0.9497/6.01 comes out to one-fifth, confirming where Elderton got her specific estimate of ~0.25 inches as the standard deviation for identical twins.
So the claimed effect is +0.083 inches, which represents d = 0.05 (for the general population) and d = 0.332 (twin differences).
We then ask how much data is required to conduct a well-powered experiment to detect the existence of such an effect with a standard t-test:
generalD <- 0.083 / 1.56249311 twinD <- 0.083 / 0.25 generalP <- power.t.test(d=generalD, power=0.8); generalP # Two-sample t test power calculation # # n = 5564.07129 # delta = 0.0531202342 # sd = 1 # sig.level = 0.05 # power = 0.8 # alternative = two.sided # # NOTE: n is number in *each* group twinP <- power.t.test(d=twinD, power=0.8); twinP # Two-sample t test power calculation # # n = 143.383601 # delta = 0.332 # sd = 1 # sig.level = 0.05 # power = 0.8 # alternative = two.sided # # NOTE: n is number in *each* group twinP$n / generalP$n #  0.0257695478
So with 143 twin-pairs or n = 286, we can match the power of a sample drawn from the general population with 5564 in each group or n = 11128—savings of ~97% of the sample. (To put that in perspective, if costs scaled exactly per head and twins didn’t entail any extra expenses2, then that estimated cost of ~$657,513.0$541,100.02016 would have instead been $16,990.1$13,982.02016, for a savings of $640,521.7$527,117.02016.) Student’s guess of “1–2%” proves to be on the money, and the experiment is also feasible as 143 twin-pairs is well below the number of twin-pairs that Student estimated to be available (>160, and more likely “200–300”; 300 twin-pairs would yield a power of ~97%, exceeding the Lanarkshire 20,000’s <96% power).
We can safely say that Student’s Scottish milk experiment example does indeed demonstrate the power of twins.
We can go further and estimate the power of twins in general for experimentation. While height is somewhat unusually heritable, we can say with confidence that almost all traits studied are highly heritable based on the previously mentioned mega-meta-analysis “Meta-analysis of the heritability of human traits based on fifty years of twin studies”, Polderman et al 2015 which compiles data on 17,804 traits estimated from ~14.5m pairs. The upshot is that averaging over all those traits, 48% of variance is due to heritability and 18% shared-environment3; implying that compared to a sample drawn from the general population (who share neither genetics nor the shared-environment upbringing), identical twins will have ~34% (1—(0.488+0.174)) of the variability.
For easier comparison with the running example, we can redo the height calculation but assuming we are looking at a generic trait with a higher variability:
twinAll <- 0.083 / (0.338*1.56249311) generalAll <- 0.083 / 1.56249311 generalAll <- power.t.test(d=generalD, power=0.8) twinAll <- power.t.test(d=twinAll, power=0.8) twinAll$n / generalAll$n #  0.114397136
In general, an experiment run using twins will require a sample 11% the size of the same experiment run using the general population.
While it may seem like an esoteric statistical point, this is not a negligible savings, and demonstrates the value of considering both experiment design and behavioral genetics rather than ignoring them. Many twins are available via twin registries, and even if twins are unfeasible, the point carries through to other populations that one could recruit: siblings, or close relatives—if nothing else, one could use the increasing prevalence of genotyping (<$50) in rerandomization procedures to make subject pairs as similar as possible on all covariates (including overall genetic similarity / relatedness or polygenic scores for key traits).
- Bingley et al 2015, “Signaling and Productivity in the Private Financial Returns to Schooling”,
- Briley et al 2018, “Behaviour Genetic Frameworks of Causal Reasoning for Personality Psychology”
- Eney et al 2017, “Cross-sectional association between soda consumption and BMI in a community-based sample of twins”
- Hjelmborg et al 2016, “Lung cancer, genetic predisposition and smoking: the Nordic Twin Study of Cancer”
- Lee 2012, “Correlation and Causation in the Study of Personality”
- Rottensteiner et al 2015, “Physical Activity, Fitness, Glucose Homeostasis, and Brain Morphology in Twins”
- van Dongen et al 2012, “The continuing value of twin studies in the -omics era”
The note that “milk from herd that has been attested free from bovine tuberculosis” is an important detail. Tuberculosis infection from milk was a serious problem in the West in this period; as J.B.S. Haldane remarks in his discussion of “Vitamins” (Possible Worlds And Other Essays 1927): “I would sooner have my child run the risk of rickets or infantile scurvy from over-boiled milk than of tuberculosis from drinking it raw. I refer here to British milk—American is less tuberculous.” Later elaborating in the “The Fight With Tuberculosis” chapter:
Tuberculosis does not stand first on the list of causes of death in England, but it is the most serious, because it kills in infancy and prime of life…one quarter of the deaths of French children in their first two years are due to tubercle…the greatest single channel of infection is milk from tuberculous cows drunk in infancy or early childhood. But the vast majority even of well-to-do parents do not take the trouble to obtain Grade A or Grade A certified milk for their younger children. In many places it is, of course, not available, but it would be if an economic demand for it existed. And with no public opinion behind it in the matter the Government cannot be expected to legislate drastically in favour of pure milk. If science has not discovered a cure or an infallible preventive for tuberculosis, it has at least shown how the mortality could be greatly lowered. For the price of a cigar or a cinema a week you can protect your child against its most dangerous enemy. Is it worth while?
Haldane’s statement about American milk should be taken as a relative statement—countless Americans died of milk, just not as many (eg. bovine tuberculosis killed several of my relatives). But thankfully, whether US or UK, this is no longer a dilemma for parents. (Tuberculosis is developing multi-drug resistance and so cows might again suffer regular infection, but testing and genetic engineering will keep bovine tuberculosis at bay.)↩︎
Because twin registries already exist and regularly recruit twins for studies, it’s possible that twins might cost less to experiment on in a contemporary setting (although I don’t believe twin registries had been set up in Scotland at this point).↩︎
Note how much better identical twins are than siblings; sibling designs are valuable when we are hypothesizing about something in the shared-environment, and can be used for many purposes like showing that GWAS hits are not confounded by population stratification or that the harm from things like maternal smoking is overestimated by analyses ignoring genetic confounding, but since siblings are not all that similar, the gain is not nearly as large as with identical twins (which are far more similar than DZ twins or siblings). For some discussion of this topic from epidemiological perspectives, see “Why are children in the same family so different from one another?” & Smith 2011.↩︎