Genomic selection—the prediction of breeding values using DNA polymorphisms—is a disruptive method that has widely been adopted by animal and plant breeders to increase crop, forest and livestock productivity and ultimately secure food and energy supplies. It improves breeding schemes in different ways, depending on the biology of the species and genotyping and phenotyping constraints. However, both genomic selection and classical phenotypic selection remain difficult to implement because of the high genotyping and phenotyping costs that typically occur when selecting large collections of individuals, particularly in early breeding generations. To specifically address these issues, we propose a new conceptual framework called phenomic selection, which consists of a prediction approach based on low-cost and high-throughput phenotypic descriptors rather than DNA polymorphisms. We applied phenomic selection on two species of economic interest (wheat and poplar) using near-infrared spectroscopy on various tissues. We showed that one could reach accurate predictions in independent environments for developmental and productivity traits and tolerance to disease. We also demonstrated that under realistic scenarios, one could expect much higher genetic gains with phenomic selection than with genomic selection. Our work constitutes a proof of concept and is the first attempt at phenomic selection; it clearly provides new perspectives for the breeding community, as this approach is theoretically applicable to any organism and does not require any genotypic information.
Collecting useful, interpretable, and biologically relevant phenotypes in a resource-efficient manner is a bottleneck to plant breeding, genetic mapping, and genomic prediction. Autonomous and affordable sub-canopy rovers are an efficient and scalable way to generate sensor-based datasets of in-field crop plants. Rovers equipped with light detection and ranging (LiDar) can produce three-dimensional reconstructions of entire hybrid maize fields. In this study, we collected 2,103 LiDar scans of hybrid maize field plots and extracted phenotypic data from them by Latent Space Phenotyping (LSP). We performed LSP by two methods, principal component analysis (PCA) and a convolutional autoencoder, to extract meaningful, quantitative Latent Space Phenotypes (LSPs) describing whole-plant architecture and biomass distribution. The LSPs had heritabilities of up to 0.44, similar to some manually measured traits, indicating they can be selected on or genetically mapped. Manually measured traits can be successfully predicted by using LSPs as explanatory variables in partial least squares regression, indicating the LSPs contain biologically relevant information about plant architecture. These techniques can be used to assess crop architecture at a reduced cost and in an automated fashion for breeding, research, or extension purposes, as well as to create or inform crop growth models.
Association mapping studies have enabled researchers to identify candidate loci for many important environmental resistance factors, including agronomically relevant resistance traits in plants. However, traditional genome-by-environment studies such as these require a phenotyping pipeline which is capable of accurately and consistently measuring stress responses, typically in an automated high-throughput context using image processing. In this work, we present Space Phenotyping (LSP), a novel phenotyping method which is able to automatically detect and quantify response-to-treatment directly from images. Using two synthetically generated image datasets, we first show that LSP is able to successfully recover the simulated QTL in both simple and complex synthetic imagery. We then demonstrate an example application of an interspecific cross of the model C4 grass Setaria. We propose LSP as an alternative to traditional image analysis methods for phenotyping, enabling association mapping studies without the need for engineering complex image processing pipelines.
Grid-LMM(Runcie & Crawford 2019).] Large-scale phenotype data can enhance the power of genomic prediction in plant and animal breeding, as well as human genetics. However, the statistical foundation of multi-trait genomic prediction is based on the multivariate linear mixed effect model, a tool notorious for its fragility when applied to more than a handful of traits. We present
MegaLMM, a statistical framework and associated software package for mixed model analyses of a virtually unlimited number of traits. Using 3 examples with real plant data, we show that
MegaLMMcan leverage thousands of traits at once to substantially improve genetic value prediction accuracy.
…Here, we describe
MegaLMM(linear mixed models for millions of observations), a novel statistical method and computational algorithm for fitting massive-scale MvLMMs to large-scale phenotypic datasets. Although we focus on plant breeding applications for concreteness, our method can be broadly applied wherever multi-trait linear mixed models are used (e.g., human genetics, industrial experiments, psychology, linguistics, etc.).
MegaLMMdramatically improves upon existing methods that fit low-rank MvLMMs, allowing multiple random effects and un-balanced study designs with large amounts of missing data. We achieve both scalability and statistical robustness by combining strong, but biologically motivated, Bayesian priors for statistical regularization—analogous to the p≫n approach of genomic prediction methods—with algorithmic innovations recently developed for LMMs. In the 3 examples below, we demonstrate that our algorithm maintains high predictive accuracy for tens-of-thousands of traits, and dramatically improves the prediction of genetic values over existing methods when applied to data from real breeding programs.
…Together, the set of parallel univariate LMMs and the set of factor loading vectors result in a novel and very general re-parameterization of the MvLMM framework as a mixed-effect factor model. This parameterization leads to dramatic computational performance gains by avoiding all large matrix inversions. It also serves as a scaffold for eliciting Bayesian priors that are intuitive and provide powerful regularization which is necessary for robust performance with limited data. Our default prior distributions encourage: (1) shrinkage on the factor-trait correlations (λjk) to avoid over-fitting covariances, and (2) shrinkage on the factor sizes to avoid including too many traits. This 2-dimensional regularization helps the model focus only on the strongest, most relevant signals in the data.
…Model limitations: While
MegaLMMworks well across a wide range of applications in breeding programs, our approach does have some limitations.
MegaLMMis built on the Grid-LMM framework for efficient likelihood calculations22, it does not scale well to large numbers of observations (in contrast to large numbers of traits), or large numbers of random effects. As the number of observational units increases,
MegaLMM’s memory requirements increase quadratically because of the requirement to store sets of pre-calculated inverse- matrices. Similarly, for each additional random effect term included in the model, memory requirements increase exponentially. Therefore, we generally limit models to fewer than 10,000 observations [n] and only 1-to-4 random effect terms per trait. There may be opportunities to reduce this memory burden if some of the random effects are low-rank; then these random effects could be updated on the fly using efficient routines for low-rank Cholesky updates. We also do not currently suggest including regressions directly on markers and have used marker-based kinship matrices here instead for computational efficiency. Therefore as a stand-alone prediction method,
MegaLMMrequires calculations involving the Schur complement of the joint kinship matrix of the testing and training individuals which can be computationally costly.
MegaLMMis inherently a linear model and cannot effectively model trait relationships that are non-linear. Some non-linear relationships between predictor variables (like genotypes) and traits can be modeled through non-linear kernel matrices, as we demonstrated with the
RKHSapplication to the Bread Wheat data. However, allowing non-linear relationships among traits is currently beyond the capacity of our software and modeling approach. Extending our mixed effect model on the low-dimensional factor space to a non-linear modeling structure like a neural network may be an exciting area for future research. Also, some sets of traits may not have low-rank correlation structures that are well-approximated by a factor model. For example, certain auto-regressive dependence structures are low-rank but cannot efficiently be decomposed into a discrete set of factors.
Nevertheless, we believe that in its current form,
MegaLMMwill be useful to a wide range of researchers in quantitative genetics and plant breeding.
Modern genomic data sets often involve multiple data-layers (e.g., DNA-sequence, gene expression), each of which itself can be high-dimensional. The biological processes underlying these data-layers can lead to intricate multivariate association patterns.
Results: We propose and evaluate two methods for analysis variance when both input and output sets are high-dimensional. Our approach uses random effects models to estimate the proportion of variance of vectors in the linear span of the output set that can be explained by regression on the input set. We consider a method based on orthogonal basis (Eigen-ANOVA) and one that uses random vectors (Monte Carlo ANOVA, MC-ANOVA) in the linear span of the output set. We used simulations to assess the bias and of each of the methods, and to compare it with that of the Partial Least Squares (PLS)–an approach commonly used in multivariate-high-dimensional regressions. The MC-ANOVA method gave nearly unbiased estimates in all the simulation scenarios considered. Estimates produced by Eigen-ANOVA and PLS had noticeable biases. Finally, we demonstrate insight that can be obtained with the of MC-ANOVA and Eigen-ANOVA by applying these two methods to the study of multi-locus linkage disequilibrium in chicken genomes and to the assessment of inter-dependencies between gene expression, methylation and copy-number-variants in data from breast cancer tumors.
The Supplementary data includes an R-implementation of each of the proposed methods as well as the scripts used in simulations and in the real-data analyses.
Supplementary data are available at Bioinformatics online.
In this paper we develop and test a method which uses high-throughput phenotypes to infer the genotypes of an individual. The inferred genotypes can then be used to perform genomic selection. Previous methods which used high-throughput phenotype data to increase the accuracy of selection assumed that the high-throughput phenotypes correlate with selection targets. When this is not the case, we show that the high-throughput phenotypes can be used to determine which haplotypes an individual inherited from their parents, and thereby infer the individual’s genotypes. We tested this method in two simulations. In the first simulation, we explored, how the accuracy of the inferred genotypes depended on the high-throughput phenotypes used and the genome of the species analysed. In the second simulation we explored whether using this method could increase genetic gain a plant breeding program by enablingon non-genotyped individuals. In the first simulation, we found that genotype accuracy was higher if more high-throughput phenotypes were used and if those phenotypes had higher heritability. We also found that genotype accuracy decreased with an increasing size of the species genome. In the second simulation, we found that the inferred genotypes could be used to enable on non-genotyped individuals and increase genetic gain compared to random selection, or in some scenarios phenotypic selection. This method presents a novel way for using high-throughput phenotype data in breeding programs. As the quality of high-throughput phenotypes increases and the cost decreases, this method may enable the use of on large numbers of non-genotyped individuals.
Following years of epigenome-wide association studies (EWAS), traits analysed to date tend to yield few associations. Reinforcing this observation, we conducted EWAS on 400 traits and 16 yielded at least one association at the conventional significance threshold (p < 1×10−7). To investigate why EWAS yield is low, we formally estimated the proportion of phenotypic variation captured by 421,693 blood derived DNA methylation markers (h2EWAS) across all 400 traits. The mean h2EWAS was zero, with evidence for regular cigarette smoking exhibiting the largest association with all markers (h2EWAS = 0.42) and the only one surpassing a false discovery rate < 0.1. Though underpowered to determine the h2EWAS value for any one trait, h2EWAS was predictive of the number of EWAS hits across the traits analysed (AUC = 0.7). Modelling the contributions of the methylome on a per-site versus a per-region basis gave varied h2EWAS estimates (r = 0.47) but neither approach obtained substantially higher model fits across all traits. Our analysis indicates that most complex traits do not heavily associate with markers commonly measured in EWAS within blood. However, it is likely DNA methylation does capture variation in some traits and h2EWAS may be a reasonable way to prioritise traits that are likely to yield associations.
Human gut microbiome composition is shaped by multiple host intrinsic and extrinsic factors, but the relative contribution of host genetic compared to environmental factors remains elusive. Here, we genotyped a cohort of 696 healthy individuals from several distinct ancestral origins and a relatively common environment, and demonstrate that there is no statistically-significant association between microbiome composition and ethnicity, single nucleotide polymorphisms (SNPs), or overall genetic similarity, and that only 5 of 211 (2.4%) previously reported microbiome-SNP associations replicate in our cohort. In contrast, we find similarities in the microbiome composition of genetically unrelated individuals who share a household. We define the term biome-explainability as the variance of a host phenotype explained by the after accounting for the contribution of human genetics. Consistent with our finding that and host genetics are largely independent, we find significant biome-explainability levels of 16–33% for body mass index ( ), fasting glucose, high-density lipoprotein (HDL) cholesterol, waist circumference, waist-hip ratio (WHR), and lactose consumption. We further show that several human phenotypes can be predicted substantially more accurately when adding data to host genetics data, and that the contribution of both data sources to prediction accuracy is largely additive. Overall, our results suggest that human microbiome composition is dominated by environmental factors rather than by host genetics.
Neuroimaging has largely focused on 2 goals: mapping associations between neuroanatomical features and phenotypes and building individual-level prediction models. This paper presents a complementary analytic strategy called morphometricity that aims to measure the neuroanatomical signatures of different phenotypes.
Inspired by brain MRI). In the dawning era of large-scale datasets comprising traits across a broad phenotypic spectrum, morphometricity will be critical in prioritizing and characterizing behavioral, cognitive, and clinical phenotypes based on their neuroanatomical signatures. Furthermore, the proposed framework will be important in dissecting the functional, morphological, and molecular underpinnings of different traits.work on [genetic] heritability, we define morphometricity as the proportion of phenotypic variation that can be explained by brain morphology (eg., as captured by structural
…Complex physiological and behavioral traits, including neurological and psychiatric disorders, often associate with distributed anatomical variation. This paper introduces a global metric, called morphometricity, as a measure of the anatomical signature of different traits. Morphometricity is defined as the proportion of phenotypic variation that can be explained by macroscopic brain morphology.
We estimate morphometricity via a linear mixed-effects model that uses an anatomical similarity matrix computed based on measurements derived from structural brain MRI scans. We examined over 3,800 unique MRI scans from 9 large-scale studies to estimate the morphometricity of a range of phenotypes, including clinical diagnoses such as Alzheimer’s disease, and nonclinical traits such as measures of cognition.
Our results demonstrate that morphometricity can provide novel insights about the neuroanatomical correlates of a diverse set of traits, revealing associations that might not be detectable through traditional statistical techniques.
[Keywords: neuroimaging, brain morphology, statistical association]
2018-bessadok.pdf: “Intact Connectional Morphometricity Learning Using Multi-view Morphological Brain Networks with Application to Autism Spectrum Disorder”, Alaa Bessadok, Islem Rekik
The recent availability of large-scale neuroimaging cohorts (here the UK Biobank [UKB] and the Human Connectome Project [HCP]) facilitates deeper characterisation of the relationship between phenotypic and brain architecture variation in humans. We tested the association between 654,386 vertex-wise measures of cortical and subcortical morphology (from T1w and T2w MRI images) and behavioural, cognitive, psychiatric and lifestyle data. We found a statistically-significant association of grey-matter structure with 58 out of 167 UKB phenotypes spanning substance use, blood assay results, education or income level, diet, depression, being a twin as well as cognition domains (UKB discovery sample: n = 9,888). Twenty-three of the 58 associations replicated (UKB replication sample: n = 4,561; HCP, n = 1,110). In addition, differences in body size (height, weight, , waist and hip circumference, body fat percentage) could account for a substantial proportion of the association, providing possible insight into previous MRI case-control studies for psychiatric disorders where case status is associated with . Using the same linear mixed model, we showed that most of the associated characteristics (e.g. age, sex, body size, diabetes, being a twin, maternal smoking, body size) could be significantly predicted using all the brain measurements in out-of-sample prediction. Finally, we demonstrated other applications of our approach including a Region Of Interest (ROI) analysis that retain the vertex-wise complexity and ranking of the information contained across MRI processing options.
Highlights: Our linear mixed model approach unifies association and prediction analyses for highly dimensional vertex-wise MRI data
Grey-matter structure is associated with measures of substance use, blood assay results, education or income level, diet, depression, being a twin as well as cognition domains
Body size (height, weight,, waist and hip circumference) is an important source of covariation between the phenome and grey-matter structure
Grey-matter scores quantify grey-matter based risk for the associated traits and allow to study phenotypes not collected
The most general cortical processing (“fsaverage” mesh with no smoothing) maximises the brain-morphometricity for all UKB phenotypes
The recent availability of large-scale neuroimaging cohorts facilitates deeper characterisation of the relationship between phenotypic and brain architecture variation in humans. Here, we investigate the association (previously coined morphometricity) of a phenotype with all 652,283 vertex-wise measures of cortical and subcortical morphology in a large data set from the UK Biobank (UKB; n = 9,497 for discovery, n = 4,323 for replication) and the Human Connectome Project (n = 1,110).
We used a linear mixed model with the brain measures of individuals fitted as random effects with covariance relationships estimated from the imaging data. We tested 167 behavioural, cognitive, psychiatric or lifestyle phenotypes and found statistically-significant morphometricity for 58 phenotypes (spanning substance use, blood assay results, education or income level, diet, depression, and cognition domains), 23 of which replicated in the UKB replication set or the HCP. We then extended the model for a bivariate analysis to estimate grey-matter correlation between phenotypes, which revealed that body size (ie., height, weight, , waist and hip circumference, body fat percentage) could account for a substantial proportion of the morphometricity (confirmed using a conditional analysis), providing possible insight into previous MRI results for psychiatric disorders where case status is associated with body mass index. Our LMM framework also allowed to predict some of the associated phenotypes from the vertex-wise measures, in two independent samples. Finally, we demonstrated additional new applications of our approach: (a) region of interest (ROI) analysis that retain the vertex-wise complexity; (b) comparison of the information retained by different MRI processings.
2019-he.pdf: “Predicting human inhibitory control from brain structural MRI”, Ningning He, Edmund T. Rolls, Wei Zhao, Shuixia Guo
“Ensemble Learning of Convolutional Neural Network, Support Vector Machine, and Best Linear Unbiased Predictor for Brain Age Prediction: ARAMIS Contribution to the Predictive Analytics Competition 2019 Challenge”, (2020-12-15):
We ranked third in the Predictive Analytics Competition (PAC) 2019 challenge by achieving a mean absolute error (MAE) of 3.33 years in predicting age from T1-weighted MRI brain images. Our approach combined seven algorithms that allow generating predictions when the number of features exceeds the number of observations, in particular, two versions of best linear unbiased predictor (BLUP), support vector machine (SVM), two shallow convolutional neural networks (CNNs), and the famous ResNet and Inception V1. Ensemble learning was derived from estimating weights via linear regression in a hold-out subset of the training sample. We further evaluated and identified factors that could influence prediction accuracy: choice of algorithm, ensemble learning, and features used as input/MRI image processing. Our prediction error was correlated with age, and absolute error was greater for older participants, suggesting to increase the training sample for this subgroup. Our results may be used to guide researchers to build age predictors on healthy individuals, which can be used in research and in the clinics as non-specific predictors of disease status.
[Keywords: brain age, MRI, machine learning, deep learning, statistical learning, ensemble learning]
…Morphometricity of Age as Upper Bound of Prediction Accuracy: From BLUP models, we estimated the total association between age and the brain features. Morphometricity is expressed in proportion of the (R2) of age; thus, it quantifies how much of the differences in age in the sample may be attributed/associated with variation in brain structure. With surface-based processing (~650,000 vertices), we estimated the morphometricity to be R2 = 0.99 (SE = 0.052), while for volume-based processing (~480,000 voxels), it reached R2 = 0.97 (SE = 0.015).
“A parsimonious model for mass-univariate vertex-wise analysis”, (2021-01-22):
Covariance between grey-matter measurements can reflect structural or functional brain networks though it has also been shown to be influenced by confounding factors (e.g. age, head size, scanner), which could lead to lower mapping precision (increased size of associated clusters) and create distal false positives associations in mass-univariate vertex-wise analyses.
We evaluated this concern by performing state-of-the-art mass-univariate analyses (general linear model, GLM) on traits simulated from real vertex-wise grey matter data (including cortical and subcortical thickness and surface area). We contrasted the results with those from linear mixed models (LMMs), which have been shown to overcome similar issues in omics association studies.
We showed that when performed on a large sample (n = 8,662, UK Biobank), GLMs yielded large spatial clusters of statistically-significant vertices and greatly inflated false positive rate (Family Wise Error Rate: FWER = 1, cluster false discovery rate: FDR>0.6). We showed that LMMs resulted in more parsimonious results: smaller clusters and reduced false positive rate (yet FWER>5% after Bonferroni correction) but at a cost of increased computation. In practice, the parsimony of LMMs results from controlling for the joint effect of all vertices, which prevents local and distal redundant associations from reaching statistical-significance.
Next, we performed mass-univariate association analyses on 5 real UKB traits (age, sex, , fluid intelligence and smoking status) and LMM yielded fewer and more localised associations. We identified 19 statistically-significant clusters displaying small associations with age, sex and , which suggest a complex architecture of at least dozens of associated areas with those phenotypes.
2021-gotz.pdf: “Small Effects: The Indispensable Foundation for a Cumulative Psychological Science”, (2021-07-02; ):
We draw on genetics research to argue that complex psychological phenomena are most likely determined by a multitude of causes and that any individual cause is likely to have only a small effect.
Building on this, we highlight the dangers of a publication culture that continues to demand large effects. First, it rewards inflated effects that are unlikely to be real and encourages practices likely to yield such effects. Second, it overlooks the small effects that are most likely to be real, hindering attempts to identify and understand the actual determinants of complex psychological phenomena.
We then explain the theoretical and practical relevance of small effects, which can have substantial consequences, especially when considered at scale and over time. Finally, we suggest ways in which scholars can harness these insights to advance research and practices in psychology (i.e., leveraging the power of big data, machine learning, and crowdsourcing science; promoting rigorous preregistration, including prespecifying the smallest effect size of interest; contextualizing effects; changing cultural norms to reward accurate and meaningful effects rather than exaggerated and unreliable effects).
Only once small effects are accepted as the norm, rather than the exception, can a reliable and reproducible cumulative psychological science be built.
[See variance-components for one route forward in quantifying small effects given the daunting statistical power challenges. Götz et al appear locked into the conventional framework of directly estimating effects, when what they really need to borrow from genetics is heritability… You can’t afford to gather n in the millions when you aren’t even sure your haystack contains a needle!]
2012-herculanohouzel.pdf: “The remarkable, yet not extraordinary, human brain as a scaled-up primate brain and its associated cost”, (2012-06-19; ):
[Herculano-Houzel 2009] Neuroscientists have become used to a number of “facts” about the human brain: It has 100 billion neurons and 10- to 50-fold more glial cells; it is the largest-than-expected for its body among primates and mammals in general, and therefore the most cognitively able; it consumes an outstanding 20% of the total body energy budget despite representing only 2% of body mass because of an increased metabolic need of its neurons; and it is endowed with an overdeveloped cerebral cortex, the largest compared with brain size.
These facts led to the widespread notion that the human brain is literally extraordinary: an outlier among mammalian brains, defying evolutionary rules that apply to other species, with a uniqueness seemingly necessary to justify the superior cognitive abilities of humans over mammals with even larger brains. These facts, with deep implications for neurophysiology and evolutionary biology, are not grounded on solid evidence or sound assumptions, however.
Our recent development of a method that allows rapid and reliable quantification of the numbers of cells that compose the whole brain has provided a means to verify these facts. Here, I review this recent evidence and argue that, with 86 billion neurons and just as many nonneuronal cells, the human brain is a scaled-up primate brain in its cellular composition and metabolic cost, with a relatively enlargedthat does not have a relatively larger number of brain neurons yet is remarkable in its cognitive abilities and metabolism simply because of its extremely large number of neurons.