[media: 1, 2, 3; criticism: 1, 2] We [Ultima Genomics] introduce a massively parallel novel sequencing platform that combines an open flow cell design on a circular wafer with a large surface area and mostly natural nucleotides that allow optical end-point detection without reversible terminators.
This platform enables sequencing billions of reads with longer read length (~300bp) and fast runs times (<20hrs) with high base accuracy (Q30 > 85%), at a low cost of $1/Gb. We establish system performance by whole-genome sequencing of the Genome-In-A-Bottle reference samples HG001–7, demonstrating high accuracy for SNPs (99.6%) and Indels in homopolymers up to length 10 (96.4%) across the vast majority (>98%) of the defined high-confidence regions of these samples.
We demonstrate scalability of the whole-genome sequencing workflow by sequencing an additional 224 selected samples from the 1000 Genomes project achieving high concordance with reference data.
Hundreds of loci in human genomes have alleles that are methylated differentially according to their parent of origin. These imprinted loci generally show little variation across tissues, individuals, and populations.
We show that such loci can be used to distinguish the maternal and paternal homologs for all autosomes, without the need for the parental DNA. We integrate methylation-detecting nanopore sequencing with the long-range phase information in Strand-seq data to determine the parent of origin of chromosome-length haplotypes for both DNA sequence and DNA methylation in five trios with diverse genetic backgrounds.
The parent of origin was correctly inferred for all autosomes with an average mismatch error rate of 0.31% for SNVs and 1.89% for indels.
Because our method can determine whether an inherited disease allele originated from the mother or the father, we predict that it will improve the diagnosis and management of many genetic diseases.
Epigenetic estimators of age (known as “clocks”) allow one to identify interventions that slow or reverse aging. Previous epigenetic clocks only applied to one species at a time.
Here, we describe epigenetic clocks that apply to both dogs and humans.
These clocks, which measure methylation levels in highly conserved stretches of the DNA, promise to increase the likelihood that interventions that reverse epigenetic age in one species will have the same effect in the other.
DNA methylation profiles have been used to develop biomarkers of aging known as epigenetic clocks, which predict chronological age with remarkable accuracy and show promise for inferring health status as an indicator of biological age. Epigenetic clocks were first built to monitor human aging, but their underlying principles appear to be evolutionarily conserved, as they have now been successfully developed for many mammalian species.
Here, we describe reliable and highly accurate epigenetic clocks shown to apply to 93 domestic dog breeds. The methylation profiles were generated using the mammalian methylation array, which utilizes DNA sequences that are conserved across all mammalian species. Canine epigenetic clocks were constructed to estimate age and also average time to death.
We also present 2 highly accurate human-dog dual species epigenetic clocks (r = 0.97), which may facilitate the ready translation from canine to human use (or vice versa) of antiaging treatments being developed for longevity and preventive medicine. Finally, epigenome-wide association studies here reveal individual methylation sites that may underlie the inverse relationship between breed weight and lifespan.
Overall, we describe robust biomarkers to measure aging and, potentially, health status in canines.
For every trait under GS, the increase in accuracy obtained by genomic estimated breeding values instead of classical pedigree-based estimation of breeding values is very important in aquaculture species ranging from 15% to 89% for growth traits, and from 0% to 567% for disease resistance.
Although the implementation of GS in aquaculture is of little additional investment in breeding programs already implementing sib testing on pedigree, the deployment of GS remains sparse, but could be boosted by adaptation of cost-effective imputation from low-density panels. Moreover, GS could help to anticipate the effect of climate change by improving sustainability-related traits such as production yield (eg. carcass or fillet yields), feed efficiency or disease resistance, and by improving resistance to environmental variation (tolerance to temperature or salinity variation).
This chapter synthesized the literature in applications of GS in finfish, crustaceans and mollusks aquaculture in the present and future breeding programs.
Vertebrate eDNA carried in the air can be used to identify terrestrial animals
Environmental DNA can be detected in air several hundred meters from the source
Airborne environmental DNA can be detected from consumed prey following predation
The crisis of declining biodiversity exceeds our current ability to monitor changes in ecosystems. Rapid terrestrial biomonitoring approaches are essential to quantify the causes and consequences of global change. Environmental DNA has revolutionized aquatic ecology, permitting population monitoring and remote diversity assessments matching or outperforming conventional methods of community sampling. Despite this model, similar methods have not been widely adopted in terrestrial ecosystems.
Here, we demonstrate that DNA from terrestrial animals can be filtered, amplified, and then sequenced from air samples collected in natural settings representing a powerful tool for terrestrial ecology. We collected air samples at a zoological park, where spatially confined non-native species allowed us to track DNA sources. We show that DNA can be collected from air and used to identify species and their ecological interactions.
Air samples contained DNA from 25 species of mammals and birds, including 17 known terrestrial resident zoo species. We also identified food items from air sampled in enclosures and detected taxa native to the local area, including the Eurasian hedgehog, endangered in the United Kingdom. Our data demonstrate that airborne eDNA concentrates around recently inhabited areas but disperses away from sources, suggesting an ecology to airborne eDNA and the potential for sampling at a distance.
Our findings demonstrate the profound potential of air as a source of DNA for global terrestrial biomonitoring.
[Keywords: conservation biology, community ecology, environmental DNA, eDNA, species interactions, wildlife management, terrestrial ecology]
[Using long read whole genome sequencing, we have broken the record for making the fastest genetic diagnosis—multiple times. Our fastest: 7hrs18min.
The new method published today has the potential to revolutionize diagnosing critically ill patients.
Our team…aimed to make fast/accurate genetic diagnoses using nanopore WGS optimized sample prep and loading 48 Oxford PomethION flow cells; created a pipeline to transfer data to the cloud, base call, and align in real time; optimized PEPPER-Margin-DeepVariant to quickly call variants; and the rest of the curation team customized a variant filtration schema that was not only fast, but reduced the list of variants for manual curation substantially, while still maintaining sensitivity…In some cases our average sequencing rate exceeded 1.8gb/min—a 1× genome in 1min45sec—unprecedented speed! One case was sequenced so fast that we set a Guinness World Record for the fastest DNA sequencing technique.
We then recruited 12 critically ill patients and sequenced their genomes to ~50×. The patients ranged in age from 3 months to 57 years and had clinical presentations including neurological/seizure disorders, sudden cardiac arrests, and severe heart failure. In 5 cases we identified genetics variants (SNPs and INDELs) in gene such as RYR2, TNNT2, PCDH19, and CSNK2B that explained the patient’s clinical signs. These findings led to definitive genetic diagnosis.
As a result, these patients received precision care weeks earlier than had they had standard genetic testing. Treatments included surgical interventions, a heart transplant, changes to their medicines, and family screening.]
We describe the analysis of whole genome sequencing (WGS) of 150,119 individuals from the UK biobank (UKB). This yielded a set of high quality variants, including 585,040,410 SNPs, representing 7.0% of all possible human SNPs, and 58,707,036 indels. The large set of variants allows us to characterize selection based on sequence variation within a population through a Depletion Rank (DR) score for windows along the genome. DR analysis shows that coding exons represent a small fraction of regions in the genome subject to strong sequence conservation. We define three cohorts within the UKB, a large British Irish cohort (XBI) and smaller African (XAF) and South Asian (XSA) cohorts. A haplotype reference panel is provided that allows reliable imputation of most variants carried by three or more sequenced individuals. We identified 895,055 structural variants and 2,536,688 microsatellites, groups of variants typically excluded from large scale WGS studies. Using this formidable new resource, we provide several noteworthy examples of trait associations with rare variants with large effects not found previously through studies based on exome sequencing and/or imputation.
Methods: We conducted a pilot study involving 4,660 participants from 2,183 families, among whom 161 disorders covering a broad spectrum of rare diseases were present. We collected data on clinical features with the use of Human Phenotype Ontology terms, undertook genome sequencing, applied automated variant prioritization on the basis of applied virtual gene panels and phenotypes, and identified novel pathogenic variants through research analysis.
Results: Diagnostic yields varied among family structures and were highest in family trios (both parents and a proband) and families with larger pedigrees. Diagnostic yields were much higher for disorders likely to have a monogenic cause (35%) than for disorders likely to have a complex cause (11%). Diagnostic yields for intellectual disability, hearing disorders, and vision disorders ranged from 40 to 55%. We made genetic diagnoses in 25% of the probands. A total of 14% of the diagnoses were made by means of the combination of research and automated approaches, which was critical for cases in which we found etiologic noncoding, structural, and mitochondrial genome variants and coding variants poorly covered by exome sequencing. Cohort-wide burden testing across 57,000 genomes enabled the discovery of 3 new disease genes and 19 new associations. Of the genetic diagnoses that we made, 25% had immediate ramifications for clinical decision making for the patients or their relatives.
Conclusions: Our pilot study of genome sequencing in a national health care system showed an increase in diagnostic yield across a range of rare diseases.
…However, South Asian ancestry was statistically-significantly more common among pediatric probands than among adult probands (16% vs. 4%, p < 0.001); our results indicated potential consanguinity in 43% of the 93 pediatric South Asian probands and in 1% of the other 478 pediatric probands (Table 1).
…Health Care Outcomes after Diagnosis: The findings from our approach ended long diagnostic odysseys for some participants and their families (the median duration of such an odyssey was 75 months, and the median number of hospital visits was 68) (Table S1), and we speculate that they will mitigate NHS resource costs (the combined cost for 183,273 episodes of hospital care among the affected participants was £87 million [$122 million]) (Table S3). In addition, 134 of the 533 genetic diagnoses (25%) were reported by clinicians to be of immediate clinical actionability—only 11 (0.2%) were described as having no benefit. As of now, the remainder of the diagnoses are of unknown usefulness. The benefits in terms of health care included 4 diagnoses that led to a suggested change in medication, 26 that led to suggested additional surveillance of the proband or relatives, 13 that allowed for clinical trial eligibility, 59 that informed future reproductive choices, and 32 that had other benefits (Table S9).
In several specific probands, diagnoses have had important clinical actionability. In a 36-year-old man with suspected choroideremia, we detected a novel CHM promoter variant causing loss of gene expression,27 a diagnosis that enabled eligibility for a gene-replacement trial. A male neonate proband presented with severe infection and transient neurologic symptoms immediately after birth and died at 4 months of age with no diagnosis but with health care costs of ~£80,000 ($112,000) (Table S10). A diagnosis of transcobalamin II deficiency due to a homozygous frameshift in TCN2 was made from this study, which enabled predictive testing to be offered to the younger brother within 1 week after birth. The younger child, who received a positive result, received weekly hydroxocobalamin injections to prevent metabolic decompensation.
A 10-year-old girl was admitted to the intensive care unit with life-threatening chicken pox. She had undergone a diagnostic odyssey over a period of 7 years at a total cost of £356,571 ($499,199) across 307 secondary care episodes (Table S11). We were able to diagnose CTPS1 deficiency due to a homozygous, known pathogenic splice acceptor variant. A diagnosis enabled a curative bone marrow transplantation (cost of £70,000 [$98,000]), and predictive testing in her siblings showed no additional family members to be at risk.
One proband had waited until his 6th decade of life for a genomic diagnosis of an INF2 mutation causing focal segmental glomerulosclerosis. His father, brother, and uncle had all died from kidney failure. He had received 2 kidney transplants, had transmitted the condition to his daughter, and was concerned about whether his 15-year-old granddaughter, who was under surveillance, was at risk. After he received his genetic diagnosis, the granddaughter was tested, found to be negative, and discharged from regular medical surveillance.
…Discussion: Our findings show a substantial increase in yield of genomic diagnoses made in patients with the use of genome sequencing across a broad spectrum of rare disease. The enhanced diagnostic benefit was observed regardless of whether participants had undergone previous genetic testing (diagnostic yields were 31% among those who had undergone testing and 33% among those who had not). In 25% of those who received a genetic diagnosis, there was immediate clinical actionability…The findings from our pilot study support the case for genome sequencing in the diagnosis of certain specific rare diseases in the new NHS National Genomic Test Directory.37 In patients with specific disorders, such as intellectual disability, genome sequencing is now the first-line test in the NHS (Table S12). With a new National Genomic Medicine Service, the NHS in England is in the process of sequencing 500,000 whole genomes in rare disease and cancer in health care. We hope that our findings will assist other health systems in considering the role of genome sequencing in the care of patients with rare diseases.
Over 100 million research participants around the world have had research array-based genotyping (GT) or genome sequencing (GS), but only a small fraction of these have been offered return of actionable genomic findings (gRoR).
Between 2017 and 2021, we analyzed genomic results from 36,417 participants in the Mass General Brigham Biobank and offered to confirm and return pathogenic and likely pathogenic variants (PLPVs) in 59 genes.
Variant verification prior to participant recontact revealed that GT falsely identified PLPVs in 44.9% of samples, and GT failed to identify 72.0% of PLPVs detected in a subset of samples that were also sequenced. GT and GS detected verified PLPVs in 1% and 2.5% of the cohort, respectively. Of 256 participants who were alerted that they carried actionable PLPVs, 37.5% actively or passively declined further disclosure. 76.3% of those carrying PLPVs were unaware that they were carrying the variant, and over half of those met published professional criteria for genetic testing but had never been tested.
This gRoR protocol cost ~$129,000 USD per year in laboratory testing and research staff support, representing $14 per participant whose DNA was analyzed or $3,224 per participant in whom a PLPV was confirmed and disclosed.
These data provide logistical details around gRoR that could help other investigators planning to return genomic results.
Identical genetic variations can have different phenotypic effects depending on their parent of origin (PofO). Yet, studies focusing on PofO effects have been largely limited in terms of sample size due to the need of parental genomes or known genealogies.
Here, we used a novel probabilistic approach to infer PofO of individual alleles in the UK Biobank that does not require parental genomes nor prior knowledge of genealogy. Our model uses Identity-By-Descent (IBD) sharing with second-degree and third-degree relatives to assign alleles to parental groups and leverages chromosome X data in males to distinguish maternal from paternal groups.
When combined with robust haplotype inference and haploid imputation, this allowed us to infer the PofO at 5.4 million variants genome-wide for 26,393 UK Biobank individuals. We used this large dataset to systematically screen 59 biomarkers and 38 anthropomorphic phenotypes for PofO effects and discovered 101 statistically-significant associations, demonstrating that this type of effects contributes to the genetics of complex traits.
Notably, we retrieved well known PofO effects, such as the MEG3/DLK1 locus on platelet count, and we discovered many new ones at loci often unsuspected of being imprinted and, in some cases, previously thought to harbour additive associations.
Epigenetic “clocks” based on DNA methylation (DNAme) are the most robust and widely employed aging biomarker. They have been built for numerous species and reflect gold-standard interventions that extend lifespan. However, conventional methods for measuring epigenetic clocks are expensive and low-throughput. Here, we describe Tagmentation-based Indexing for Methylation Sequencing (TIME-Seq) for ultra-cheap and scalable targeted methylation sequencing of epigenetic clocks and other DNAme biomarkers. Using TIME-Seq, we built and validated inexpensive epigenetic clocks based on genomic and ribosomal DNAme in hundreds of mice and human samples. We also discover it is possible to accurately predict age from extremely low-cost shallow sequencing (eg. 10,000 reads) of TIME-Seq libraries using scAge, a probabilistic age-prediction algorithm originally applied to single cells. Together, these methods reduce the cost of DNAme biomarker analysis by more than two orders of magnitude, thereby expanding and democratizing their use in aging research, clinical trials, and disease diagnosis.
During the Anthropocene, Earth has experienced unprecedented habitat loss, native species decline, and global climate change. Concurrently, greater globalisation is facilitating species movement, increasing the likelihood of alien species establishment and propagation. There is a great need to understand what influences a species’ ability to persist or perish within a new or changing environment. Examining genes that may be associated with a species’ invasion success or persistence informs invasive species management, assists with native species preservation, and sheds light on important evolutionary mechanisms that occur in novel environments. This approach can be aided by coupling spatial and temporal investigations of evolutionary processes.
Here we use the common starling, Sturnus vulgaris, to identify parallel and divergent evolutionary change between contemporary native and invasive range samples and their common ancestral population. To do this, we use reduced-representation sequencing of native samples collected recently in north-western Europe and invasive samples from Australia, together with museum specimens sampled in the UK during the mid-19th Century. We found evidence of parallel selection on both continents, possibly resulting from common global selective forces such as exposure to pollutants (eg. TCDD) and food carbohydrate content. We also identified divergent selection in these populations, which might be related to adaptive changes in response to the novel environment encountered in the introduced Australian range. Interestingly, signatures of selection are equally as common within both invasive and native range contemporary samples. Our results demonstrate the value of including historical samples in genetic studies of invasion and highlight the ongoing and occasionally parallel role of adaptation in both native and invasive ranges.
The spread of pandemic viruses and invasive species can be catastrophic for human societies and natural ecosystems. SARS-CoV-2 demonstrated that the speed of our response is critical, as each day of delay permitted exponential growth and dispersion of the virus. Here we propose a global Nucleic Acid Observatory (NAO) to monitor the relative frequency of everything biological through comprehensive metagenomic sequencing of waterways and wastewater. By searching for divergences from historical baseline frequencies at sites throughout the world, NAO could detect any virus or invasive organism undergoing exponential growth whose nucleic acids end up in the water, even those previously unknown to science. Continuously monitoring nucleic acid diversity would provide us with universal early warning, obviate subtle bioweapons, and generate a wealth of sequence data sufficient to transform ecology, microbiology, and conservation.
We call for the immediate construction of a global NAO to defend and illuminate planetary health.
Compared to its predecessors, the Telomere-to-Telomere CHM13 genome adds nearly 200 Mbp of sequence, corrects thousands of structural errors, and unlocks the most complex regions of the human genome to clinical and functional study. Here we demonstrate how the new reference universally improves read mapping and variant calling for 3,202 and 17 globally diverse samples sequenced with short and long reads, respectively. We identify hundreds of thousands of novel variants per sample—a new frontier for evolutionary and biomedical discovery. Simultaneously, the new reference eliminates tens of thousands of spurious variants per sample, including up to 12× reduction of false positives in 269 medically relevant genes. The vast improvement in variant discovery coupled with population and functional genomic resources position T2T-CHM13 to replace GRCh38 as the prevailing reference for human genetics.
One Sentence Summary
The T2T-CHM13 reference genome universally improves the analysis of human genetic variation.
When the sequencing of the human genome was announced 2 decades ago by the Human Genome Project and biotech firm Celera Genomics, the sequence was not truly complete. About 15% was missing: technological limitations left researchers unable to work out how certain stretches of DNA fitted together, especially those where there were many repeating letters (or base pairs). Scientists solved some of the puzzle over time, but the most recent human genome, which geneticists have used as a reference since 2013, still lacks 8% of the full sequence.
Now, researchers in the Telomere-to-Telomere (T2T) Consortium, an international collaboration that comprises around 30 institutions, have filled in those gaps. In a 27 May preprint entitled ‘The complete sequence of a human genome’, genomics researcher Karen Miga at the University of California, Santa Cruz, and her colleagues report that they’ve sequenced the remainder, in the process discovering about 115 new genes that code for proteins, for a total of 19,969.
…New sequencing technology: The newly sequenced genome—dubbed T2T-CHM13—adds nearly 200 million base pairs to the 2013 version of the human genome sequence.
This time, instead of taking DNA from a living person, the researchers used a cell line derived from what’s known as a ‘complete hydatidiform mole’, a type of tissue that forms in humans when a sperm inseminates an egg with no nucleus. The resulting cell contains chromosomes only from the father, so the researchers don’t have to distinguish between 2 sets of chromosomes from different people.
Miga says the feat probably wouldn’t have been possible without new sequencing technology from Pacific Biosciences in Menlo Park, California, which uses lasers to scan long stretches of DNA isolated from cells—up to 20,000 base pairs at a time. Conventional sequencing methods read DNA in chunks of only a few hundred base pairs at a time, and researchers reassemble these stretches like puzzle pieces. The larger pieces are much easier to put together, because they are more likely to contain sequences that overlap.
T2T-CHM13 is not the last word on the human genome, however. The T2T team had trouble resolving a few regions on the chromosomes, and estimates that about 0.3% of the genome might contain errors. There are no gaps, but Miga says quality-control checks have proved difficult in those areas. And the sperm cell that formed the hydatidiform mole carried an X chromosome, so the researchers have not yet sequenced a Y chromosome, which typically triggers male biological development.
Approximately 30 years after the start of the Human Genome Project, we sequenced the genome of an infant with encephalopathy in just over 11 hours. The results led to a clinical diagnosis of thiamine metabolism dysfunction syndrome 2 (THMD2) 16.5 hours after a blood sample was obtained and 13 hours after we initiated sequencing, which informed treatment of the infant, thereby illustrating the fulfillment of the promise of the Human Genome Project to transform health care…Video electroencephalography showed numerous seizures occurring in the interim. Thiamine and biotin administration was started 37.5 hours after admission, and phenobarbital administration was started 2 hours later. One 15-second seizure was recorded thereafter. 6 hours later, the patient was alert, calm, and bottle feeding. Standard, trio genome sequencing confirmed the diagnosis. After a further 24 hours passed without seizures, the patient was discharged. He is now thriving at 7 months of age.
…Ten years earlier, his parents, who were first cousins, had had a child with a similar neurologic presentation that rapidly progressed to epileptic encephalopathy; the child died at 11 months of age without an etiologic diagnosis, despite extensive evaluation.
…This case illustrates the potential for decreased suffering and improved outcomes through the implementation of rapid genome sequencing in a multidisciplinary, integrated, precision medicine delivery system…Currently, rapid genome sequencing is being implemented in Australia, England, Germany, and Wales and in Medicaid pilot programs in California, Florida, and Michigan.
In 2001, Celera Genomics and the International Human Genome Sequencing Consortium published their initial drafts of the human genome, which revolutionized the field of genomics. While these drafts and the updates that followed effectively covered the euchromatic fraction of the genome, the heterochromatin and many other complex regions were left unfinished or erroneous. Addressing this remaining 8% of the genome, the Telomere-to-Telomere (T2T) Consortium has finished the first truly complete 3.055 billion base pair (bp) sequence of a human genome, representing the largest improvement to the human reference genome since its initial release. The new T2T-CHM13 reference includes gapless assemblies for all 22 autosomes plus Chromosome X, corrects numerous errors, and introduces nearly 200 million bp of novel sequence containing 2,226 paralogous gene copies, 115 of which are predicted to be protein coding. The newly completed regions include all centromeric satellite arrays and the short arms of all five acrocentric chromosomes, unlocking these complex regions of the genome to variational and functional studies for the first time.
Pathogens and associated outbreaks of infectious disease exert selective pressure on human populations, and any changes in allele frequencies that result may be especially evident for genes involved in immunity. In this regard, the 1346–1353 Yersinia pestis-caused Black Death pandemic, with continued plague outbreaks spanning several hundred years, is one of the most devastating recorded in human history.
To investigate the potential impact of Y. pestis on human immunity genes we extracted DNA from 36 plague victims buried in a mass grave in Ellwangen, Germany in the 16th century. We targeted 488 immune-related genes, including HLA, using a novel in-solution hybridization capture approach.
In comparison with 50 modern native inhabitants of Ellwangen, we find differences in allele frequencies for variants of the innate immunity proteins Ficolin-2 and NLRP14 at sites involved in determining specificity. We also observed that HLA-DRB1✱13 is more than twice as frequent in the modern population, whereas HLA-B alleles encoding an isoleucine at position 80 (I-80+), HLA C✱06:02 and HLA-DPB1 alleles encoding histidine at position 9 are half as frequent in the modern population. Simulations show that natural selection has likely driven these allele frequency changes.
Thus, our data suggests that allele frequencies of HLA genes involved in innate and adaptive immunity responsible for extracellular and intracellular responses to pathogenic bacteria, such as Y. pestis, could have been affected by the historical epidemics that occurred in Europe.
We report a methodology for the pooled construction of mutants bearing precise genomic sequence variations and multiplex phenotypic characterization of these mutants using next-generation sequencing (NGS). Unlike existing techniques depending on CRISPR-Cas-directed genomic breaks for genome editing, this strategy instead uses single-stranded DNA produced by a retron element for recombineering. This enables libraries of millions of elements to be constructed and offers relaxed design constraints which permit natural DNA or random variation to be used as inputs.
Creating and characterizing individual genetic variants remains limited in scale, compared to the tremendous variation both existing in nature and envisioned by genome engineers. Here we introduce retron library recombineering (RLR), a methodology for high-throughput functional screens that surpasses the scale and specificity of CRISPR-Cas methods.
We use the targeted reverse-transcription activity of retrons to produce single-stranded DNA (ssDNA) in vivo, incorporating edits at >90% efficiency and enabling multiplexed applications. RLR simultaneously introduces many genomic variants, producing pooled and barcoded variant libraries addressable by targeted deep sequencing.
We use RLR for pooled phenotyping of synthesized antibiotic resistance alleles, demonstrating quantitative measurement of relative growth rates. We also perform RLR using the sheared genomic DNA of an evolved bacterium, experimentally querying millions of sequences for causal variants, demonstrating that RLR is uniquely suited to utilize large pools of natural variation.
Using ssDNA produced in vivo for pooled experiments presents avenues for exploring variation across the genome.
Evaluating the impact of genetic variants on transcriptional regulation is a central goal in biological science that has been constrained by reliance on a single reference genome.
To address this, we constructed phased, diploid genomes for 4 cadaveric donors (using long-read sequencing) and systematically charted noncoding regulatory elements and transcriptional activity across more than 25 tissues from these donors.
Integrative analysis revealed over a million variants with allele-specific activity, coordinated, locus-scale allelic imbalances, and structural variants impacting proximal chromatin structure. We relate the personal genome analysis to the ENCODE encyclopedia, annotating allele-specific and tissue-specific elements that are strongly enriched for variants impacting expression and disease phenotypes.
These experimental and statistical approaches, and the corresponding EN-TEx resource, provide a framework for personalized functional genomics.
Bones and teeth are important sources of Pleistocene hominin DNA, but are rarely recovered at archaeological sites. Mitochondrial DNA has been retrieved from cave sediments, but provides limited value for studying population relationships.
We therefore developed methods for the enrichment and analysis of nuclear DNA from sediments, and applied them to cave deposits in western Europe and southern Siberia dated to between ~200,000 and 50,000 years ago.
We detect a population replacement in northern Spain ~100,000 years ago, accompanied by a turnover of mitochondrial DNA. We also identify 2 radiation events in Neanderthal history during the early part of the Late Pleistocene.
Our work lays the ground for studying the population history of ancient hominins from trace amounts of nuclear DNA in sediments.
When it comes to the mammoth family tree, it has long been believed that the Columbian mammoth evolved earlier than the smaller, shaggier woolly mammoth. But now, using DNA that is more than a million years old—the oldest ever recovered from a fossil—researchers have turned that assumption on its head: They found that the Columbian mammoth is in fact a hybrid of the woolly mammoth and a previously unrecognized mammoth lineage.
…Fossilized remains of mammoths, particularly those preserved in exquisite detail, can shed light on how these animals lived and died. But analyzing an ancient creature’s genetic code—by recovering its DNA and reassembling it into a genome—opens up vast new research possibilities, said David Díez-del-Molino, another paleogeneticist at the Centre for Palaeogenetics. “You can track the origin of species.”
A team of researchers, including Dr. Dalen and Dr. Díez-del-Molino, recently set out to do just that using three mammoth molars unearthed in northeastern Siberia. These teeth are old—about 700,000 years, 1.1 million years and 1.2 million years—and they’re also impressive to look at, Dr. Dalen said. “They’re the size of a carton of milk.”…After removing the non-mammoth DNA, the team was left with between 49 million and 3.7 billion base pairs in each of their three samples. (The mammoth genome is roughly 3.2 billion base pairs, which is slightly larger than the human genome.) The researchers compared their data with African elephant DNA a second time, which allowed them to put all their DNA fragments in the correct order.
This mammoth DNA smashes the record for the oldest DNA ever sequenced, which was previously held by a roughly 700,000-year-old horse specimen, said Morten E. Allentoft, an evolutionary biologist at Curtin University in Perth, Australia, who was not involved in the research. “It’s the oldest DNA that’s ever been authentically identified”, he said.
When the researchers looked at the three genomes they reconstructed, the oldest stood out. “The genome looked weird”, Dr. Dalen said. “I think it’s likely this is a different species.” That was a shock: Researchers have long believed that there was only a single lineage of mammoths in Siberia that gave rise to woolly and Columbian mammoths. This discovery suggests that a previously undiscovered mammoth lineage existed as well.
Temporal genomic data hold great potential for studying evolutionary processes such as speciation. However, sampling across speciation events would, in many cases, require genomic time series that stretch well back into the Early Pleistocene sub-epoch. Although theoretical models suggest that DNA should survive on this timescale, the oldest genomic data recovered so far are from a horse specimen dated to 780–560 thousand years ago.
Here we report the recovery of genome-wide data from three mammoth specimens dating to the Early and Middle Pleistocene sub-epochs, two of which are more than one million years old. We find that two distinct mammoth lineages were present in eastern Siberia during the Early Pleistocene. One of these lineages gave rise to the woolly mammoth and the other represents a previously unrecognized lineage that was ancestral to the first mammoths to colonize North America. Our analyses reveal that the Columbian mammoth of North America traces its ancestry to a Middle Pleistocene hybridization between these two lineages, with roughly equal admixture proportions. Finally, we show that the majority of protein-coding changes associated with cold adaptation in woolly mammoths were already present one million years ago.
These findings highlight the potential of deep-time palaeogenomics to expand our understanding of speciation and long-term adaptive evolution.
Aging is often perceived as a degenerative process caused by random accrual of cellular damage over time. In spite of this, age can be accurately estimated by epigenetic clocks based on DNA methylation profiles from almost any tissue of the body. Since such pan-tissue epigenetic clocks have been successfully developed for several different species, it is difficult to ignore the likelihood that a defined and shared mechanism instead, underlies the aging process.
To address this, we generated 10,000 methylation arrays, each profiling up to 37,000 cytosines in highly-conserved stretches of DNA, from over 59 tissue-types derived from 128 mammalian species. From these, we identified and characterized specific cytosines, whose methylation levels change with age across mammalian species. Genes associated with these cytosines are greatly enriched in mammalian developmental processes and implicated in age-associated diseases.
From the methylation profiles of these age-related cytosines, we successfully constructed 3 highly accurate universal mammalian clocks for eutherians, and 1 universal clock for marsupials. The universal clocks for eutherians are similarly accurate for estimating ages (r > 0.96) of any mammalian species and tissue with a single mathematical formula.
Collectively, these new observations support the notion that aging is indeed evolutionarily conserved and coupled to developmental processes across all mammalian species—a notion that was long-debated without the benefit of this new and compelling evidence.
Archaeological sediments have been shown to preserve ancient DNA, but so far have not yielded genome-scale information of the magnitude of skeletal remains. We retrieved and analysed human and mammalian low-coverage nuclear and high-coverage mitochondrial genomes from Upper Palaeolithic sediments from Satsurblia cave, western Georgia, dated to 25,000 years ago. First, a human female genome with substantial basal Eurasian ancestry, which was an ancestry component of the majority of post-Ice Age people in the Near East, North Africa, and parts of Europe. Second, a wolf genome that is basal to extant Eurasian wolves and dogs and represents a previously unknown, likely extinct, Caucasian lineage that diverged from the ancestors of modern wolves and dogs before these diversified. Third, a bison genome that is basal to present-day populations, suggesting that population structure has been substantially reshaped since the Last Glacial Maximum. Our results provide new insights into the late Pleistocene genetic histories of these three species, and demonstrate that sediment DNA can be used not only for species identification, but also be a source of genome-wide ancestry information and genetic history.
Highlights: We demonstrate for the first time that genome sequencing from sediments is comparable to that of skeletal remains
A single Pleistocene sediment sample from the Caucasus yielded three low-coverage mammalian ancient genomes
We show that sediment ancient DNA can reveal important aspects of the human and faunal past
Evidence of an uncharacterized human lineage from the Caucasus before the Last Glacial Maximum
~0.01× coverage wolf and bison genomes are both basal to present-day diversity, suggesting reshaping of population structure in both species
Long-read sequencing (LRS) promises to improve characterization of structural variants (SVs), a major source of genetic diversity. We generated LRS data on 3,622 Icelanders using Oxford Nanopore Technologies, and identified a median of 22,636 SVs per individual (a median of 13,353 insertions and 9,474 deletions), spanning a median of 10 Mb per haploid genome. We discovered a set of 133,886 reliably genotyped SV alleles and imputed them into 166,281 individuals to explore their effects on diseases and other traits. We discovered an association with a rare (AF = 0.037%) deletion of the first exon of PCSK9. Carriers of this deletion have 0.93 mmol/L (1.31 SD) lower LDL cholesterol levels than the population average (p-value = 7.0·10−20). We also discovered an association with a multi-allelic SV inside a large repeat region, contained within single long reads, in an exon of ACAN. Within this repeat region we found 11 alleles that differ in the number of a 57 bp-motif repeat, and observed a linear relationship (0.016 SD per motif inserted, p = 6.2·10−18) between the number of repeats carried and height. These results show that SVs can be accurately characterized at population scale using long read sequence data in a genome-wide non-targeted approach and demonstrate how SVs impact phenotypes.
Cannabis is a diploid species (2n = 20), the estimated haploid genome sizes of the female and male plants using flow cytometry are 818 and 843 Mb respectively. Although the genome of Cannabis has been sequenced (from hemp, wild and high-THC strains), all assemblies have significant gaps. In addition, there are inconsistencies in the chromosome numbering which limits their use. A new comprehensive draft genome sequence assembly (~900 Mb) has been generated from the medicinal cannabis strain Cannbio-2, that produces a balanced ratio of cannabidiol and delta-9-tetrahydrocannabinol using long-read sequencing. The assembly was subsequently analysed for completeness by ordering the contigs into chromosome-scale pseudomolecules using a reference genome assembly approach, annotated and compared to other existing reference genome assemblies. The Cannbio-2 genome sequence assembly was found to be the most complete genome sequence available based on nucleotides assembled and BUSCO evaluation in Cannabis sativa with a comprehensive genome annotation. The new draft genome sequence is an advancement in Cannabis genomics permitting pan-genome analysis, genomic selection as well as genome editing.
The speed, expense and throughput of genomic sequencing impose limitations on its use for time-sensitive acute cases, such as rare or antibiotic resistant infections, and large-scale testing that is necessary for containing COVID-19 outbreaks using source-tracing. The major bottleneck for increasing the bandwidth and decreasing operating costs of next-generation sequencers (NGS) is the flow cell that supplies reagents for the biochemical processes; this subsystem has not substantially improved since 2005.
Here we report a new method for sourcing reagents based on surface coating technology (SCT): the DNA adhered onto the biochip is directly contacted by a reagent-coated polymeric strip. Compared with flow cells the reagent layers are an order of magnitude thinner while both the reagent exchange rate and biochip area are orders of magnitude greater. These improvements drop the turn-around time from days to twelve hours and the cost for whole genome sequencing (WGS) from about $1000 to $15, as well as increase data production by several orders of magnitude.
This makes NGS more affordable than many blood tests while rapidly providing detailed genomic information about microbial and viral pathogens, cancers and genetic disorders for targeted treatments and personalized medicine. This data can be pooled in population-wide databases for accelerated research and development as well providing detailed real-time data for tracking and containing outbreaks, such as the current COVID-pandemic.
Background: Many human diseases are known to have a genetic contribution. While genome-wide studies have identified many disease-associated loci, it remains challenging to elucidate causal genes. In contrast, exome sequencing provides an opportunity to identify new disease genes and large-effect variants of clinical relevance. We therefore sought to determine the contribution of rare genetic variation in a curated set of human diseases and traits using an unique resource of 200,000 individuals with exome sequencing data from the UK Biobank.
Methods and Results: We included 199,832 participants with a mean age of 68 at follow-up. Exome-wide gene-based tests were performed for 64 diseases and 23 quantitative traits using a mixed-effects model, testing rare loss-of-function and damaging missense variants. We identified 51 known and 23 novel associations with 26 diseases and traits at a false-discovery-rate of 1%. There was a striking risk associated with many Mendelian disease genes including: MYPBC3 with over a 100-fold increased odds of hypertrophic cardiomyopathy, PKD1 with a greater than 25-fold increased odds of chronic kidney disease, and BRCA2, BRCA1, ATM and PALB2 with 3 to 10-fold increased odds of breast cancer. Notable novel findings included an association between GIGYF1 and type 2 diabetes (OR 5.6, p = 5.35×10−8), elevated blood glucose, and lower insulin-like-growth-factor-1 levels. Rare variants in CCAR2 were also associated with diabetes risk (OR 13, p = 8.5×10−8), while COL9A3 was associated with cataract (OR 3.4, p = 6.7×10−8). Notable associations for blood lipids and hypercholesterolemia included NR1H3, RRBP1, GIGYF1, SCGN, APH1A, PDE3B and ANGPTL8. A number of novel genes were associated with height, including DTL, PIEZO1, SCUBE3, PAPPA and ADAMTS6, while BSN was associated with body-mass-index. We further assessed putatively pathogenic variants in known Mendelian cardiovascular disease genes and found that between 1.3 and 2.3% of the population carried likely pathogenic variants in known cardiomyopathy, arrhythmia or hypercholesterolemia genes.
Conclusions: Large-scale population sequencing identifies known and novel genes harboring high-impact variation for human traits and diseases. A number of novel findings, including GIGYF1, represent interesting potential therapeutic targets. Exome sequencing at scale can identify a meaningful proportion of the population that carries a pathogenic variant underlying cardiovascular disease.
Complementary to the genome, the concept of exposome has been proposed to capture the totality of human environmental exposures. While there has been some recent progress on the construction of the exposome, few tools exist that can integrate the genome and exposome for complex trait analyses. Here we propose a linear mixed model approach to bridge this gap, which jointly models the random effects of the two omics layers on phenotypes of complex traits. We illustrate our approach using traits from the UK Biobank (eg. BMI & height for n ~ 40,000) with a small fraction of the exposome that comprises 28 lifestyle factors. The joint model of the genome and exposome explains substantially more phenotypic variance and significantly improves phenotypic prediction accuracy, compared to the model based on the genome alone. The additional phenotypic variance captured by the exposome includes its additive effects as well as non-additive effects such as genome-exposome (gxe) and exposome-exposome (exe) interactions. For example, 19% of variation in BMI is explained by additive effects of the genome, while additional 7.2% by additive effects of the exposome, 1.9% by exe interactions and 4.5% by gxe interactions. Correspondingly, the prediction accuracy for BMI, computed using Pearson’s correlation between the observed and predicted phenotypes, improves from 0.15 (based on the genome alone) to 0.35 (based on the genome & exposome). We also show, using established theories, integrating genomic and exposomic data is essential to attaining a clinically meaningful level of prediction accuracy for disease traits. In conclusion, the genomic and exposomic effects can contribute to phenotypic variation via their latent relationships, i.e. genome-exposome correlation, and gxe and exe interactions, and modelling these effects has a great potential to improve phenotypic prediction accuracy and thus holds a great promise for future clinical practice.
The UK Biobank is a prospective study of 502,543 individuals, combining extensive phenotypic and genotypic data with streamlined access for researchers around the world. Here we describe the release of exome-sequence data for the first 49,960 study participants, revealing ~4 million coding variants (of which around 98.6% have a frequency of less than 1%). The data include 198,269 autosomal predicted loss-of-function (LOF) variants, a more than 14× increase compared to the imputed sequence. Nearly all genes (more than 97%) had at least one carrier with a LOF variant, and most genes (more than 69%) had at least 10 carriers with a LOF variant. We illustrate the power of characterizing LOF variants in this population through association analyses across 1,730 phenotypes. In addition to replicating established associations, we found novel LOF variants with large effects on disease traits, including PIEZO1 on varicose veins, COL6A1 on corneal resistance, MEPE on bone density, and IQGAP2 and GMPR on blood cell traits. We further demonstrate the value of exome sequencing by surveying the prevalence of pathogenic variants of clinical importance, and show that 2% of this population has a medically actionable variant. Furthermore, we characterize the penetrance of cancer in carriers of pathogenic BRCA1 and BRCA2 variants. Exome sequences from the first 49,960 participants highlight the promise of genome sequencing in large population-based studies and are now accessible to the scientific community.
Exome association studies to date have generally been underpowered to systematically evaluate the phenotypic impact of very rare coding variants. We leveraged extensive haplotype sharing between 49,960 exome-sequenced UK Biobank participants and the remainder of the cohort (total N~500K) to impute exome-wide variants at high accuracy (R2>0.5) down to minor allele frequency (MAF) ~0.00005. Association and fine-mapping analyses of 54 quantitative traits identified 1,189 statistically-significant associations (P<5 x 10−8) involving 675 distinct rare protein-altering variants (MAF<0.01) that passed stringent filters for likely causality; 600 of the 675 variants (89%) were not present in the NHGRI-EBI GWAS Catalog. We replicated the effect directions of 28 of 28 height-associated variants genotyped in previous exome array studies, including missense variants in newly-associated collagen genes COL16A1 and COL11A2. Across all traits, 49% of associations (578/1,189) occurred in genes with two or more hits; follow-up analyses of these genes identified long allelic series containing up to 45 distinct likely-causal variants within the same gene (on average exhibiting 93%-concordant effect directions). In particular, 24 rare coding variants in IFRD2 independently associated with reticulocyte indices, suggesting an important role of IFRD2 in red blood cell development, and 11 rare coding variants in NPR2 (a gene previously implicated in Mendelian skeletal disorders) exhibited intermediate-to-strong effects on height (0.18–1.09 s.d.). Our results demonstrate the utility of within-cohort imputation in population-scale GWAS cohorts, provide a catalog of likely-causal, large-effect coding variant associations, and foreshadow the insights that will be revealed as genetic biobank studies continue to grow.
Aging is characterized by degeneration in cellular and organismal functions leading to increased disease susceptibility and death. Although our understanding of aging biology in model systems has increased dramatically, large-scale sequencing studies to understand human aging are now just beginning. We applied exome sequencing and association analyses (ExWAS) to identify age-related variants on 58,470 participants of the DiscovEHR cohort. Linear Mixed Model regression analyses of age at last encounter revealed variants in genes known to be linked with clonal hematopoiesis of indeterminate potential, which are associated with myelodysplastic syndromes, as top signals in our analysis, suggestive of age-related somatic mutation accumulation in hematopoietic cells despite patients lacking clinical diagnoses. In addition to APOE, we identified rare DISP2 rs183775254 (p = 7.40×10−10) and ZYG11A rs74227999 (p = 2.50×10−08) variants that were negatively associated with age in either both sexes combined and females, respectively, which were replicated with directional consistency in two independent cohorts. Epigenetic mapping showed these variants are located within cell-type-specific enhancers, suggestive of important transcriptional regulatory functions. To discover variants associated with extreme age, we performed exome-sequencing on persons of Ashkenazi Jewish descent ascertained for extensive lifespans. Case-Control analyses in 525 Ashkenazi Jews cases (Males ≥ 92 years, Females ≥ 95years) were compared to 482 controls. Our results showed variants in APOE (rs429358, rs6857), and TMTC2 (rs7976168) passed Bonferroni-adjusted p-value, as well as several nominally-associated population-specific variants. Collectively, our Age-ExWAS, the largest performed to date, confirmed and identified previously unreported candidate variants associated with human age.
How microbial metabolism is translated into cellular reproduction under energy-limited settings below the seafloor over long timescales is poorly understood. Here, we show that microbial abundance increases an order of magnitude over a five million-year-long sequence in anoxic subseafloor clay of the abyssal North Atlantic Ocean. This increase in biomass correlated with an increased number of transcribed protein-encoding genes that included those involved in cytokinesis, demonstrating that active microbial reproduction outpaces cell death in these ancient sediments. Metagenomes, metatranscriptomes, and 16S rRNA gene sequencing all show that the actively reproducing community was dominated by the candidate Phylum “Candidatus Atribacteria”, which exhibited patterns of gene expression consistent with a fermentative, and potentially acetogenic metabolism. “Ca. Atribacteria” dominated throughout the entire eight million-year-old cored sequence, despite the detection limit for gene expression being reached in five million-year-old sediments. The subseafloor reproducing “Ca. Atribacteria” also expressed genes encoding a bacterial micro-compartment that has potential to assist in secondary fermentation by recycling aldehydes and, thereby, harness additional power to reduce ferredoxin and NAD⁺. Expression of genes encoding the Rnf complex for generation of chemiosmotic ATP synthesis were also detected from the subseafloor “Ca. Atribacteria”, as well as the Wood-Ljungdahl pathway that could potentially have an anabolic or catabolic function. The correlation of this metabolism with cytokinesis gene expression and a net increase in biomass over the million-year-old sampled interval indicates that the “Ca. Atribacteria” can perform the necessary catabolic and anabolic functions necessary for cellular reproduction, even under energy limitation in millions of years old anoxic sediments.
The deep subseafloor sedimentary biosphere is one of the largest ecosystems on Earth, where microbes subsist under energy-limited conditions over long timescales. It remains poorly understood how mechanisms of microbial metabolism promote increased fitness in these settings. We discovered that the candidate bacterial Phylum “Candidatus Atribacteria” dominated a deep-sea subseafloor ecosystem, where it exhibited increased transcription of genes associated with acetogenic fermentation and reproduction in million-year old sediment. We attribute its improved fitness after burial in the seabed to its capabilities to derive energy from increasingly oxidized metabolites via a bacterial micro-compartment and utilize a potentially reversible Wood-Ljungdahl pathway to help meet anabolic and catabolic requirements for growth. Our findings show that “Ca. Atribacteria” can perform all the necessary catabolic and anabolic functions necessary for cellular reproduction, even under energy limitation in anoxic sediments that are millions of years old.
Deep neural networks have been revolutionizing the field of machine learning for the past several years. They have been applied with great success in many domains of the biomedical data sciences and are outperforming extant methods by a large margin. The ability of deep neural networks to pick up local image features and model the interactions between them makes them highly applicable to regulatory genomics. Instead of an image, the networks analyze DNA and RNA sequences and additional epigenomic data.
In this review, we survey the successes of deep learning in the field of regulatory genomics. We first describe the fundamental building blocks of deep neural networks, popular architectures used in regulatory genomics, and their training process on molecular sequence data. We then review several key methods in different gene regulation domains. We start with the pioneering method DeepBind and its successors, which were developed to predict protein-DNA binding. We then review methods developed to predict and model epigenetic information, such as histone marks and nucleosome occupancy. Following epigenomics, we review methods to predict protein-RNA binding with its unique challenge of incorporating RNA structure information. Finally, we provide our overall view of the strengths and weaknesses of deep neural networks and prospects for future developments.
Purpose: Carrier status associates strongly with genetic ancestry, yet current carrier screening guidelines recommend testing for a limited set of conditions based on a patient’s self-reported ethnicity. Ethnicity, which can reflect both genetic ancestry and cultural factors (eg. religion), may be imperfectly known or communicated by patients. We sought to quantitatively assess the efficacy and equity with which ethnicity-based carrier screening captures recessive disease risk.
Methods: For 93,419 individuals undergoing a 96-gene expanded carrier screen (ECS), correspondence was assessed among carrier status, self-reported ethnicity, and a dual-component genetic ancestry (eg. 75% African/25% European) calculated from sequencing data.
Results: Self-reported ethnicity was an imperfect indicator of genetic ancestry, with 9% of individuals having >50% genetic ancestry from a lineage inconsistent with self-reported ethnicity. Limitations of self-reported ethnicity led to missed carriers in at-risk populations: for 10 ECS conditions, patients with intermediate genetic ancestry backgrounds—who did not self-report the associated ethnicity—had statistically-significantly elevated carrier risk. Finally, for 7 of the 16 conditions included in current screening guidelines, most carriers were not from the population the guideline aimed to serve.
Conclusion: Substantial and disproportionate risk for recessive disease is not detected when carrier screening is based on ethnicity, leading to inequitable reproductive care.
A genetic etiology is identified for 1⁄3rd of patients with congenital heart disease (CHD), with 8% of cases attributable to coding de novo variants (DNVs). To assess the contribution of noncoding DNVs to CHD, we compared genome sequences from 749 CHD probands and their parents with those from 1,611 unaffected trios. Neural network prediction of noncoding DNV transcriptional impact identified a burden of DNVs in individuals with CHD (n = 2,238 DNVs) compared to controls (n = 4,177; p = 8.7 × 10−4). Independent analyses of enhancers showed an excess of DNVs in associated genes (27 genes versus 3.7 expected, p = 1 × 10−5). We observed statistically-significant overlap between these transcription-based approaches (odds ratio (OR) = 2.5, 95% confidence interval (CI) 1.1–5.0, p = 5.4 × 10−3). CHD DNVs altered transcription levels in 5 of 31 enhancers assayed. Finally, we observed a DNV burden in RNA-binding-protein regulatory sites (OR = 1.13, 95% CI 1.1–1.2, p = 8.8 × 10−5). Our findings demonstrate an enrichment of potentially disruptive regulatory noncoding DNVs in a fraction of CHD at least as high as that observed for damaging coding DNVs.
The police in China are collecting blood samples from men and boys from across the country to build a genetic map of its roughly 700 million males, giving the authorities a powerful new tool for their emerging high-tech surveillance state.
They have swept across the country since late 2017 to collect enough samples to build a vast DNA database, according to a new study published on Wednesday by the Australian Strategic Policy Institute, a research organization, based on documents also reviewed by The New York Times. With this database, the authorities would be able to track down a man’s male relatives using only that man’s blood, saliva or other genetic material. An American company, Thermo Fisher, is helping: The Massachusetts company has sold testing kits to the Chinese police tailored to their specifications. American lawmakers have criticized Thermo Fisher for selling equipment to the Chinese authorities, but the company has defended its business.
…The campaign even involves schools. In one southern coastal town in China, young boys offered up their tiny fingers to a police officer with a needle. About 230 miles to the north, officers went from table to table taking blood from schoolboys while the girls watched quizzically. Jiang Haolin, 31, gave a blood sample, too. He had no choice. The authorities told Mr. Jiang, a computer engineer from a rural county in northern China, that “if blood wasn’t collected, we would be listed as a ‘black household’”, he said last year, and it would deprive him and his family of benefits like the right to travel and go to a hospital…It is unclear whether the people in those photos fully understood what the blood collection was for. Interviews and social media posts have suggested that the failure to give blood would result in punishment.
…China already holds the world’s largest trove of genetic material, totaling 80 million profiles, according to state media. But earlier DNA gathering efforts were often more focused. Officials targeted criminal suspects or groups they considered potentially destabilizing, like migrant workers in certain neighborhoods. The police have also gathered DNA from ethnic minority groups like the Uighurs as a way to tighten the Communist Party’s control over them. The effort to compile a national male database broadens those efforts, said Emile Dirks, an author of the report from the Australian institute and a Ph.D. candidate in the department of political science at the University of Toronto. “We are seeing the expansion of those models to the rest of China in an aggressive way that I don’t think we’ve seen before”, Mr. Dirks said.
In the report released by the Australian institute, it estimated that the authorities aimed to collect DNA samples from 35 million to 70 million men and boys, or roughly 5% to 10% of China’s male population. They do not need to sample every male, because one person’s DNA sample can unlock the genetic identity of male relatives…Local officials often publicly announce the results of their sampling. In Donglan County in the Guangxi region, the police said they had collected more than 10,800 samples, covering nearly 10% of the male population. In Yijun County in Shaanxi Province, the police said they had collected more than 11,700 samples, or one quarter.
Background: There is considerable interest in whether genetic data can be used to improve standard cardiovascular disease risk calculators, as the latter are routinely used in clinical practice to manage preventative treatment.
Methods: This research has been conducted using the UK Biobank (UKB) resource. We developed our own polygenic risk score (PRS) for coronary artery disease (CAD), using novel and established methods to combine published genome-wide association study (GWAS) data with data from 114,196 UK Biobank individuals, also leveraging a large resource of other GWAS datasets along with functional information, to aid in the identification of causal variants, and thence define weights for > 8M genetic variants. We utilised a further 60,000 UKB individuals to develop an integrated risk tool (IRT) that combined our PRS with established risk tools (either the American Heart Association/American College of Cardiology’s pooled cohort equations (PCE) or the UK’s QRISK3) which was then tested in an additional, independent, set of 212,563 UKB individuals. We evaluated prediction performance in individuals of European ancestry, both as a whole and stratified by age and sex.
Findings: The novel CAD PRS showed superior predictive power for CAD events, compared to other published PRSs. As an individual risk factor, it has similar predictive power to each of systolic blood pressure, HDL cholesterol, and LDL cholesterol, but is more predictive than total cholesterol and smoking history. Our novel CAD PRS is largely uncorrelated with PCE, QRISK3, and family history, and, when combined with PCE into an integrated risk tool, had superior predictive accuracy. In individuals reclassified as high risk, CAD event rates were markedly and statistically-significantly higher compared to those reclassified as low risk. Overall, 9.7% of incident CAD cases were misclassified as low risk by PCE and correctly classified as high risk by the IRT, in contrast to 3.7% misclassified by the IRT and correctly classified by PCE. The overall net reclassification improvement for the IRT was 5.7% (95% CI 4.4–7.0), but when individuals were stratified into four age-by-sex subgroups the improvement was larger for all subgroups (range 7.7%-17.3%), with best performance in younger middle-aged men aged 40–54yo (17.3%, 95% CI 13.0–21.5). Broadly similar results were found using a different risk tool (QRISK3), and also for cardiovascular disease events defined more broadly.
Interpretation: An integrated risk tool that includes polygenic risk outperforms current, clinical risk stratification tools, and offers greater opportunity for early interventions. Given the plummeting costs of genetic tests, future iterations of CAD risk tools would be enhanced with the addition of a person’s polygenic risk.
Low-coverage whole genome sequencing followed by imputation has been proposed as a cost-effective genotyping approach for disease and population genetics studies. However, its competitiveness against SNP arrays is undermined as current imputation methods are computationally expensive and unable to leverage large reference panels.
Here, we describe a method, GLIMPSE, for phasing and imputation of low-coverage sequencing datasets from modern reference panels. We demonstrate its remarkable performance across different coverages and human populations. It achieves imputation of a full genome for less than $1, outperforming existing methods by orders of magnitude, with an increased accuracy of more than 20% at rare variants. We also show that 1× coverage enables effective association studies and is better suited than dense SNP arrays to access the impact of rare variations. Overall, this study demonstrates the promising potential of low-coverage imputation and suggests a paradigm shift in the design of future genomic studies.
The ability to generate genomic data from wild animal populations has the potential to give unprecedented insight into the population history and dynamics of species in their natural habitats. However, in the case of many species, it is impossible legally, ethically, or logistically to obtain tissues samples of high-quality necessary for genomic analyses. In this study we evaluate the success of multiple sources of genetic material (feces, urine, dentin, and dental calculus) and several capture methods (shotgun, whole-genome, exome) in generating genome-scale data in wild eastern chimpanzees (Pan troglodytes schweinfurthii) from Gombe National Park, Tanzania.
We found that urine harbors statistically-significantly more host DNA than other sources, leading to broader and deeper coverage across the genome. Urine also exhibited a lower rate of allelic dropout. We found exome sequencing to be far more successful than both shotgun sequencing and whole-genome capture at generating usable data from low-quality samples such as feces and dental calculus. These results highlight urine as a promising and untapped source of DNA that can be noninvasively collected from wild populations of many species.
Most patients with rare diseases do not receive a molecular diagnosis and the aetiological variants and mediating genes for more than half such disorders remain to be discovered. We implemented whole-genome sequencing (WGS) in a national healthcare system to streamline diagnosis and to discover unknown aetiological variants, in the coding and non-coding regions of the genome.
In a pilot study for the 100,000 Genomes Project, we generated WGS data for 13,037 participants, of whom 9,802 had a rare disease, and provided a genetic diagnosis to 1,138 of the 7,065 patients with detailed phenotypic data. We identified 95 Mendelian associations between genes and rare diseases, of which 11 have been discovered since 2015 and at least 79 are confirmed aetiological.
Using WGS of UK Biobank1, we showed that rare alleles can explain the presence of some individuals in the tails of a quantitative red blood cell (RBC) trait. Finally, we reported 4 novel non-coding variants which cause disease through the disruption of transcription of ARPC1B, GATA1, LRBA and MPL. Our study demonstrates a synergy by using WGS for diagnosis and aetiological discovery in routine healthcare.
Cannabis is a diverse and polymorphic species. To better understand cannabinoid synthesis inheritance and its impact on pathogen resistance, we shotgun sequenced and assembled a Cannabis trio (sibling pair and their offspring) utilizing long read single molecule sequencing. This resulted in the most contiguous Cannabis sativa assemblies to date. These reference assemblies were further annotated with full-length male and female mRNA sequencing (Iso-Seq) to help inform isoform complexity, gene model predictions and identification of the Y chromosome. To further annotate the genetic diversity in the species, 40 male, female, and monoecious cannabis and hemp varietals were evaluated for copy number variation (CNV) and RNA expression. This identified multiple CNVs governing cannabinoid expression and 82 genes associated with resistance to Golovinomyces chicoracearum, the causal agent of powdery mildew in cannabis. Results indicated that breeding for plants with low tetrahydrocannabinolic acid (THCA) concentrations may result in deletion of pathogen resistance genes. Low THCA cultivars also have a polymorphism every 51 bases while dispensary grade high THCA cannabis exhibited a variant every 73 bases. A refined genetic map of the variation in cannabis can guide more stable and directed breeding efforts for desired chemotypes and pathogen-resistant cultivars.
The rise of ancient genomics has revolutionised our understanding of human prehistory but this work depends on the availability of suitable samples. Here we present a complete ancient human genome and oral microbiome sequenced from a 5700 year-old piece of chewed birch pitch from Denmark. We sequence the human genome to an average depth of 2.3× and find that the individual who chewed the pitch was female and that she was genetically more closely related to western hunter-gatherers from mainland Europe than hunter-gatherers from central Scandinavia. We also find that she likely had dark skin, dark brown hair and blue eyes. In addition, we identify DNA fragments from several bacterial and viral taxa, including Epstein-Barr virus, as well as animal and plant DNA, which may have derived from a recent meal. The results highlight the potential of chewed birch pitch as a source of ancient DNA.
The French revolutionary Jean-Paul Marat was assassinated in 1793 in his bathtub, where he was trying to find relief from the debilitating skin disease he was suffering from. At the time of his death, Marat was annotating newspapers, which got stained with his blood and were subsequently preserved by his sister. We extracted and sequenced DNA from the blood stain and also from another section of the newspaper, which we used for comparison. Analysis of human DNA sequences supported the heterogeneous ancestry of Marat, with his mother being of French origin and his father born in Sardinia, although bearing more affinities to mainland Italy or Spain. Metagenomic analyses of the non-human reads uncovered the presence of fungal, bacterial and low levels of viral DNA. Relying on the presence/absence of microbial species in the samples, we could confidently rule out several putative infectious agents that had been previously hypothesised as the cause of his condition. Conversely, some of the detected species are uncommon as environmental contaminants and may represent plausible infective agents. Based on all the available evidence, we hypothesize that Marat may have suffered from a primary fungal infection (seborrheic dermatitis), superinfected with bacterial opportunistic pathogens.
Significance: The advent of second-generation sequencing technologies allows for the retrieval of ancient genomes from long-dead people and, using non-human sequencing reads, of the pathogens that infected them. In this work we combined both approaches to gain insights into the ancestry and health of the controversial French revolutionary leader and physicist Jean-Paul Marat (1743–1793). Specifically, we investigate the pathogens, which may have been the cause of the debilitating skin condition that was affecting him, by analysing DNA obtained from a paper stained with his blood at the time of his death. This allowed us to confidently rule out several conditions that have been put forward. To our knowledge, this represents the oldest successful retrieval of genetic material from cellulose paper.
A £200 million investment from government, industry and charity cements UK Biobank’s reputation as a world-leading health resource to tackle the widest range of common and chronic diseases—including dementia, mental illness, cancer and heart disease. The investment provides for the whole genome sequencing of 450,000 UK Biobank participants. A Vanguard study, funded by the Medical Research Council to sequence the first 50,000 individuals, is already underway.
…The ambitious project is funded with:
£50 million by the UK Government’s research and innovation agency, UK Research and Innovation (UKRI) through the Industrial Strategy Challenge Fund;
£50 million from The Wellcome Trust charity;
£100 million in total from pharmaceutical companies Amgen, AstraZeneca, GlaxoSmithKline (GSK) and Johnson & Johnson (J&J).
…At the end of May 2020, the consortium of pharmaceutical companies will be provided independently with access for analysis to the first tranche of sequence data (anticipated to be for about 125,000 participants) linked to all of the other data in the UK Biobank resource. After an exclusive access period of 9 months, the whole genome sequence data will be made available to all other approved researchers around the world. A similar exclusive access period will also apply on the completion of the sequencing. The period of exclusive access mirrors the arrangements that UK Biobank had with the exome sequencing project which is being undertaken by Regeneron in the US and other industry partners. The first tranche of exome data on 50,000 participants is now being used in more than 100 research projects worldwide.
Centuries of zoological studies amassed billions of specimens in collections worldwide. Genomics of these specimens promises to rejuvenate biodiversity research. The obstacles stem from DNA degradation with specimen age. Overcoming this challenge, we set out to resolve a series of long-standing controversies involving a group of butterflies. We deduced geographical origins of several ancient specimens of uncertain provenance that are at the heart of these debates. Here, genomics tackles one of the greatest problems in zoology: countless old, poorly documented specimens that serve as irreplaceable embodiments of species concepts. The ability to figure out where they were collected will resolve many on-going disputes. More broadly, we show the utility of genomics applied to ancient museum specimens to delineate the boundaries of species and populations, and to hypothesize about genotypic determinants of phenotypic traits.
In most mammals, the male to female sex ratio of offspring is about 50% because half of the sperm contain either the Y chromosome or X chromosome. In mice, the Y chromosome encodes fewer than 700 genes, whereas the X chromosome encodes over 3,000 genes. Although overall gene expression is lower in sperm than in somatic cells, transcription is activated selectively in round spermatids. By regulating the expression of specific genes, we hypothesized that the X chromosome might exert functional differences in sperm that are usually masked during fertilization. In this study, we found that Toll-like receptors 7/8 (TLR7/8) coding the X chromosome were expressed by ~50% of the round spermatids in testis and in ~50% of the epididymal sperm. Especially, TLR7 was localized to the tail, and TLR8 was localized to the midpiece. Ligand activation of TLR7/8 selectively suppressed the mobility of the X chromosome-bearing sperm (X-sperm) but not the Y-sperm without altering sperm viability or acrosome formation. The difference in sperm motility allowed for the separation of Y-sperm from X-sperm. Following in vitro fertilization using the ligand-selected high-mobility sperm, 90% of the embryos were XY male. Likewise, 83% of the pups obtained following embryo transfer were XY males. Conversely, the TLR7/8-activated, slow mobility sperm produced embryos and pups that were 81% XX females. Therefore, the functional differences between Y-sperm and X-sperm motility were revealed and related to different gene expression patterns, specifically TLR7/8 on X-sperm.
It is a long-standing question as to which genes define the characteristic facial features among different ethnic groups. In this study, we use Uyghurs, an ancient admixed population to query the genetic bases why Europeans and Han Chinese look different. Facial traits were analyzed based on high-dense 3D facial images; numerous biometric spaces were examined for divergent facial features between European and Han Chinese, ranging from inter-landmark distances to dense shape geometrics. Genome-wide association studies (GWAS) were conducted on a discovery panel of Uyghurs. Six statistically-significant loci were identified, four of which, rs1868752, rs118078182, rs60159418 at or near UBASH3B, COL23A1, PCDH7 and rs17868256 were replicated in independent cohorts of Uyghurs or Southern Han Chinese. A prospective model was also developed to predict 3D faces based on top GWAS signals and tested in hypothetical forensic scenarios.
[Keywords: genome-wide association study, dense 3D facial image, ancestry-divergent phenotypes, face prediction, forensic scenario]
Current sequencing technologies are able to produce reads orders of magnitude longer than ever possible before. Such long reads have sparked a new interest in de novo genome assembly, which removes reference biases inherent to re-sequencing approaches and allows for a direct characterization of complex genomic variants. However, even with latest algorithmic advances, assembling a mammalian genome from long error-prone reads incurs a large computational burden and does not preclude occasional misassemblies. Both problems could potentially be mitigated if assembly could commence for each chromosome separately.
Results: To address this, we show how single-cell template strand sequencing (Strand-seq) data can be leveraged for this purpose. We introduce a novel latent variable model and a corresponding Expectation Maximization algorithm, termed SaaRclust, and demonstrates its ability to reliably cluster long reads by chromosome. For each long read, this approach produces a posterior probability distribution over all chromosomes of origin and read directionalities. In this way, it allows to assess the amount of uncertainty inherent to sparse Strand-seq data on the level of individual reads.
Among the reads that our algorithm confidently assigns to a chromosome, we observed more than 99% correct assignments on a subset of Pacific Bioscience reads with 30.1× coverage.
To our knowledge, SaaRclust is the first approach for the in silico separation of long reads by chromosome prior to assembly.
Consumer genomics databases reached the scale of millions of individuals. Recently, law enforcement investigators have started to exploit some of these databases to find distant familial relatives, which can lead to a complete re-identification. Here, we leveraged genomic data of 600,000 individuals tested with consumer genomics to investigate the power of such long-range familial searches. We project that half of the searches with European-descent individuals will result with a third cousin or closer match and will provide a search space small enough to permit re-identification using common demographic identifiers. Moreover, in the near future, virtually any European-descent US person could be implicated by this technique. We propose a potential mitigation strategy based on cryptographic signature that can resolve the issue and discuss policy implications to human subject research.
Cannabis sativa is listed as a Schedule I substance by the United States Drug Enforcement Agency and has been federally illegal in the United States since 1937. However, the majority of states in the United States, as well as several countries, now have various levels of legal Cannabis. Products are labeled with identifying strain names but there is no official mechanism to register Cannabis strains, therefore the potential exists for incorrect identification or labeling. This study uses genetic analyses to investigate strain reliability from the consumer point of view. Ten microsatellite regions were used to examine samples from strains obtained from dispensaries in three states. Samples were examined for genetic similarity within strains, and also a possible genetic distinction between Sativa, Indica, or Hybrid types. The analyses revealed genetic inconsistencies within strains. Additionally, although there was strong statistical support dividing the samples into two genetic groups, the groups did not correspond to commonly reported Sativa/Hybrid/Indica types. Genetic differences have the potential to lead to phenotypic differences and unexpected effects, which could be surprising for the recreational user, but have more serious implications for patients relying on strains that alleviate specific medical symptoms.
The number of individuals in a random sample with close relatives in the sample is a quantity of interest when designing Genome Wide Association Studies (GWAS) and other cohort based genetic, and non-genetic, studies. In this paper, we develop expressions for the distribution and expectation of the number of p-th cousins in a sample from a population of size N under two diploid Wright-Fisher models. We also develop simple asymptotic expressions for large values of N. For example, the expected proportion of individuals with at least one p-th cousin in a sample of K individuals, for a diploid dioecious Wright-Fisher model, is ~1 − e−(22p−1)K/N. Our results show that a substantial fraction of individuals in the sample will have at least a second cousin if the sampling fraction (K/N) is on the order of 10−2. This confirms that, for large cohort samples, relatedness among individuals cannot easily be ignored.
Genealogies are likely the first, centuries-old “big data”, with their construction as old as human civilization itself. Globalization, and the identity crisis that ensued, turned many to online services, building family trees and investigating connections to historical records and other family trees. An explosion has been underway since the beginning of the century in the number and usage of websites offering such genealogical services. About 130 million users combine to have created almost four billion profiles for family members across the three most popular websites of genealogy enthusiasts, Ancestry.com, MyHeritage, and Geni. More recent years have witnessed a similar rapid increase of genetic-based services that address the same need to learn about familial relationships and ancestry. These vast amounts of crowdsourced—and often crowdfunded (as users often pay for these services)—data offers ample scientific research opportunities that would otherwise require expansive collection. In a paper published today in Science, Kaplanis et al 2018 introduce a genealogical dataset based on processing 86 million public Geni profiles. Armed with this crowdsourced dataset, they address fundamental research questions.
Background: Genetic disorders are a leading cause of morbidity and mortality in infants. Rapid Whole Genome Sequencing (rWGS) can diagnose genetic disorders in time to change acute medical or surgical management (clinical utility) and improve outcomes in acutely ill infants.
Methods: Retrospective cohort study of acutely ill inpatient infants in a regional children’s hospital from July 2016–March 2017. 42 families received rWGS for etiologic diagnosis of genetic disorders. Probands received standard genetic testing as clinically indicated. Primary end-points were rate of diagnosis, clinical utility, and healthcare utilization. The latter was modelled in six infants by comparing actual utilization with matched historical controls and/or counterfactual utilization had rWGS been performed at different time points.
Findings: The diagnostic sensitivity was 43% (18 of 42 infants) for rWGS and 10% (4 of 42 infants) for standard of care (p = 0.0005). The rate of clinical utility for rWGS (31%, 13 of 42 infants) was statistically-significantly greater than for standard of care (2%, one of 42; p = 0.0015). 11 (26%) infants with diagnostic rWGS avoided morbidity, one had 43% reduction in likelihood of mortality, and one started palliative care. In 6 of the 11 infants, the changes in management reduced inpatient cost by $800,000 to $2,000,000.
Discussion: These findings replicate a prior study of the clinical utility of rWGS in acutely ill inpatient infants, and demonstrate improved outcomes and net healthcare savings. rWGS merits consideration as a first tier test in this setting.
We aggregated genome-wide genotyping data from 32 European-descent GWAS (74,124 T2D cases, 824,006 controls) imputed to high-density reference panels of >30,000 sequenced haplotypes. Analysis of ˜27M variants (˜21M with minor allele frequency [MAF]<5%), identified 243 genome-wide statistically-significant loci (p<5×10−8; MAF 0.02%–50%; odds ratio [OR] 1.04–8.05), 135 not previously-implicated in T2D-predisposition. Conditional analyses revealed 160 additional distinct association signals (p<10−5) within the identified loci. The combined set of 403 T2D-risk signals includes 56 low-frequency (0.5%≤MAF<5%) and 24 rare (MAF<0.5%) index SNPs at 60 loci, including 14 with estimated allelic OR>2. Forty-one of the signals displayed effect-size heterogeneity between BMI-unadjusted and adjusted analyses. Increased sample size and improved imputation led to substantially more precise localisation of causal variants than previously attained: at 51 signals, the lead variant after fine-mapping accounted for >80% posterior probability of association (PPA) and at 18 of these, PPA exceeded 99%. Integration with islet regulatory annotations enriched for T2D association further reduced median credible set size (from 42 variants to 32) and extended the number of index variants with PPA>80% to 73. Although most signals mapped to regulatory sequence, we identified 18 genes as human validated therapeutic targets through coding variants that are causal for disease. Genome wide chip heritability accounted for 18% of T2D-risk, and individuals in the 2.5% extremes of a polygenic risk score generated from the GWAS data differed >9× in risk. Our observations highlight how increases in sample size and variant diversity deliver enhanced discovery and single-variant resolution of causal T2D-risk alleles, and the consequent impact on mechanistic insights and clinical translation.
Genotype imputation has become a standard tool in genome-wide association studies because it enables researchers to inexpensively approximate whole-genome sequence data from genome-wide single-nucleotide polymorphism array data. Genotype imputation increases statistical power, facilitates fine mapping of causal variants, and plays a key role in meta-analyses of genome-wide association studies. Only variants that were previously observed in a reference panel of sequenced individuals can be imputed. However, the rapid increase in the number of deeply sequenced individuals will soon make it possible to assemble enormous reference panels that greatly increase the number of imputable variants. In this review, we present an overview of genotype imputation and describe the computational techniques that make it possible to impute genotypes from reference panels with millions of individuals.
Genome-wide association studies have revealed many loci contributing to the variation of complex traits, yet the majority of loci that contribute to the heritability of complex traits remain elusive. Large study populations with sufficient statistical power are required to detect the small effect sizes of the yet unidentified genetic variants. However, the analysis of huge cohorts, like UK Biobank, is complicated by incidental structure present when collecting such large cohorts. For instance, UK Biobank comprises 107,162 third degree or closer related participants. Traditionally, GWAS have removed related individuals because they comprised an insignificant proportion of the overall sample size, however, removing related individuals in UK Biobank would entail a substantial loss of power. Furthermore, modelling such structure using linear mixed models is computationally expensive, which requires a computational infrastructure that may not be accessible to all researchers. Here we present an atlas of genetic associations for 118 non-binary and 599 binary traits of 408,455 related and unrelated UK Biobank participants of White-British descend. Results are compiled in a publicly accessible database that allows querying genome-wide association summary results for 623,944 genotyped and HapMap2 imputed SNPs, as well downloading whole GWAS summary statistics for over 30 million imputed SNPs from the Haplotype Reference Consortium panel. Our atlas of associations (GeneAtlas, http://geneatlas.roslin.ed.ac.uk )will help researchers to query UK Biobank results in an easy way without the need to incur in high computational costs.
The UK Biobank project is a large prospective cohort study of ~500,000 individuals from across the United Kingdom, aged between 40–69 at recruitment. A rich variety of phenotypic and health-related information is available on each participant, making the resource unprecedented in its size and scope. Here we describe the genome-wide genotype data (~805,000 markers) collected on all individuals in the cohort and its quality control procedures. Genotype data on this scale offers novel opportunities for assessing quality issues, although the wide range of ancestries of the individuals in the cohort also creates particular challenges. We also conducted a set of analyses that reveal properties of the genetic data—such as population structure and relatedness—that can be important for downstream analyses. In addition, we phased and imputed genotypes into the dataset, using computationally efficient methods combined with the Haplotype Reference Consortium (HRC) and UK10K haplotype resource. This increases the number of testable variants by over 100× to ~96 million variants. We also imputed classical allelic variation at 11 human leukocyte antigen (HLA) genes, and as a quality control check of this imputation, we replicate signals of known associations between HLA alleles and many common diseases. We describe tools that allow efficient genome-wide association studies (GWAS) of multiple traits and fast phenome-wide association studies (PheWAS), which work together with a new compressed file format that has been used to distribute the dataset. As a further check of the genotyped and imputed datasets, we performed a test-case genome-wide association scan on a well-studied human trait, standing height.
Application of the experimental design of genome-wide association studies (GWASs) is now 10 years old (young), and here we review the remarkable range of discoveries it has facilitated in population and complex-trait genetics, the biology of diseases, and translation toward new therapeutics.
We predict the likely discoveries in the next 10 years, when GWASs will be based on millions of samples with array data imputed to a large fully sequenced reference panel and on hundreds of thousands of samples with whole-genome sequencing data.
Precision medicine necessitates large scale collections of genomes and phenomes. Despite decreases in the costs of genomic technologies, collecting these types of information at scale is still a daunting task that poses logistical challenges and requires consortium-scale resources. Here, we describe DNA.Land, a digital biobank to collect genome and phenomes with a fraction of the resources of traditional studies at the same scale. Our approach relies on crowd-sourcing data from the rapidly growing number of individuals that have access to their own genomic datasets through Direct-to-Consumer (DTC) companies. To recruit participants, we developed a series of automatic return-of-results features in DNA.Land that increase users’ engagement while stratifying human subject research protection. So far, DNA.Land has collected over 43,000 genomes in 20 months of operation, orders of magnitude higher than previous digital attempts by academic groups. We report lessons learned in running a digital biobank, our technical framework, and our approach regarding ethical, legal, and social implications.
The human reference genome is part of the foundation of modern human biology, and a monumental scientific achievement. However, because it excludes a great deal of common human variation, it introduces a pervasive reference bias into the field of human genomics. To reduce this bias, it makes sense to draw on representative collections of human genomes, brought together into reference cohorts. There are a number of techniques to represent and organize data gleaned from these cohorts, many using ideas implicitly or explicitly borrowed from graph based models. Here, we survey various projects underway to build and apply these graph based structures—which we collectively refer to as genome graphs—and discuss the improvements in read mapping, variant calling, and haplotype determination that genome graphs are expected to produce.
Heritability, h2, is a foundational concept in genetics, critical to understanding the genetic basis of complex traits. Recently-developed methods that estimate heritability from genotyped SNPs, h2SNP, explain substantially more genetic variance than genome-wide statistically-significant loci, but less than classical estimates from twins and families. However, h2SNP estimates have yet to be comprehensively compared under a range of genetic architectures, making it difficult to draw conclusions from sometimes conflicting published estimates.
Here, we used thousands of real whole genome sequences to simulate realistic phenotypes under a variety of genetic architectures, including those from very rare causal variants. We compared the performance of ten methods across different types of genotypic data (commercial SNP array positions, whole genome sequence variants, and imputed variants) and under differing causal variant frequencies, levels of stratification, and relatedness thresholds. These results provide guidance in interpreting past results and choosing optimal approaches for future studies.
We then chose two methods (GREML-MS and GREML-LDMS) that best estimated overall h2SNP and the causal variant frequency spectra to six phenotypes in the UK Biobank using imputed genome-wide variants. Our results suggest that as imputation reference panels become larger and more diverse, estimates of the frequency distribution of causal variants will become increasingly unbiased and the vast majority of trait narrow-sense heritability will be accounted for.
Woolly mammoths (Mammuthus primigenius) populated Siberia, Beringia, and North America during the Pleistocene and early Holocene. Recent breakthroughs in ancient DNA sequencing have allowed for complete genome sequencing for two specimens of woolly mammoths (Palkopoulou et al 2015). One mammoth specimen is from a mainland population 45,000 years ago when mammoths were plentiful. The second, a 4300 yr old specimen, is derived from an isolated population on Wrangel island where mammoths subsisted with small effective population size more than 43-fold lower than previous populations. These extreme differences in effective population size offer a rare opportunity to test nearly neutral models of genome architecture evolution within a single species. Using these previously published mammoth sequences, we identify deletions, retrogenes, and non-functionalizing point mutations. In the Wrangel island mammoth, we identify a greater number of deletions, a larger proportion of deletions affecting gene sequences, a greater number of candidate retrogenes, and an increased number of premature stop codons. This accumulation of detrimental mutations is consistent with genomic meltdown in response to low effective population sizes in the dwindling mammoth population on Wrangel island. In addition, we observe high rates of loss of olfactory receptors and urinary proteins, either because these loci are non-essential or because they were favored by divergent selective pressures in island environments. Finally, at the locus of FOXQ1 we observe two independent loss-of-function mutations, which would confer a satin coat phenotype in this island woolly mammoth.
Author summary: We observe an excess of detrimental mutations, consistent with genomic meltdown in woolly mammoths on Wrangel Island just prior to extinction. We observe an excess of deletions, an increase in the proportion of deletions affecting gene sequences, and an excess of premature stop codons in response to evolution under low effective population sizes. Large numbers of olfactory receptors appear to have loss of function mutations in the island mammoth. These results offer genetic support within a single species for nearly-neutral theories of genome evolution. We also observe two independent loss of function mutations at the FOXQ1 locus, likely conferring a satin coat in this unusual woolly mammoth.
Humanity produces data at exponential rates, creating a growing demand for better storage devices. DNA molecules are an attractive medium to store digital information due to their durability and high information density. Recent studies have made large strides in developing DNA storage schemes by exploiting the advent of massive parallel synthesis of DNA oligos and the high throughput of sequencing platforms. However, most of these experiments reported small gaps and errors in the retrieved information. Here, we report a strategy to store and retrieve DNA information that is robust and approaches the theoretical maximum of information that can be stored per nucleotide. The success of our strategy lies in careful adaption of recent developments in coding theory to the domain specific constrains of DNA storage. To test our strategy, we stored an entire computer operating system, a movie, a gift card, and other computer files with a total of 2.14×106 bytes in DNA oligos. We were able to fully retrieve the information without a single error even with a sequencing throughput on the scale of a single tile of an Illumina sequencing flow cell. To further stress our strategy, we created a deep copy of the data by PCR amplifying the oligo pool in a total of nine successive reactions, reflecting one complete path of an exponential process to copy the file 218×1012 times. We perfectly retrieved the original data with only five million reads. Taken together, our approach opens the possibility of highly reliable DNA-based storage that approaches the information capacity of DNA molecules and enables virtually unlimited data retrieval.
Purpose: Expanded carrier screening (ECS) analyzes dozens or hundreds of recessive genes for determining reproductive risk. Data on clinical utility of screening conditions beyond professional guidelines is scarce.
Methods: Individuals underwent ECS for up to 110 genes. 537 at-risk couples (ARC), those in which both partners carry the same recessive disease, were invited to a retrospective IRB-approved survey of their reproductive decision making after receiving ECS results.
Results: 64 eligible ARC completed the survey. Of 45 respondents screened preconceptionally, 62% (n = 28) planned IVF with PGD or prenatal diagnosis (PNDx) in future pregnancies. 29% (n = 13) were not planning to alter reproductive decisions. The remaining 9% (n = 4) of responses were unclear.
Of 19 pregnant respondents, 42% (n = (8) elected PNDx, 11% (n = 2) planned amniocentesis but miscarried, and 47% (n = (9) considered the condition insufficiently severe to warrant invasive testing. Of the 8 pregnancies that underwent PNDx, 5 were unaffected and 3 were affected. 2 of 3 affected pregnancies were terminated.
Disease severity was found to have statistically-significant association (p = 0.000145) with changes in decision making, whereas guideline status of diseases, controlled for severity, was not (p = 0.284).
Conclusion: Most ARC altered reproductive planning, demonstrating the clinical utility of ECS. Severity of conditions factored into decision making.
We propose a method (GREML-LDMS) to estimate heritability for human complex traits in unrelated individuals using whole-genome sequencing data. We demonstrate using simulations based on whole-genome sequencing data that ~97% and ~68% of variation at common and rare variants, respectively, can be captured by imputation. Using the GREML-LDMS method, we estimate from 44,126 unrelated individuals that all ~17 million imputed variants explain 56% (standard error (s.e.) = 2.3%) of variance for height and 27% (s.e. = 2.5%) of variance for body mass index (BMI), and we find evidence that height-associated and BMI-associated variants have been under natural selection. Considering the imperfect tagging of imputation and potential overestimation of heritability from previous family-based studies, heritability is likely to be 60–70% for height and 30–40% for BMI. Therefore, the missing heritability is small for both traits. For further discovery of genes associated with complex traits, a study design with SNP arrays followed by imputation is more cost-effective than whole-genome sequencing at current prices.
The processes leading up to species extinctions are typically characterized by prolonged declines in population size and geographic distribution, followed by a phase in which populations are very small and may be subject to intrinsic threats, including loss of genetic diversity and inbreeding. However, whether such genetic factors have had an impact on species prior to their extinction is unclear; examining this would require a detailed reconstruction of a species’ demographic history as well as changes in genome-wide diversity leading up to its extinction. Here, we present high-quality complete genome sequences from two woolly mammoths (Mammuthus primigenius). The first mammoth was sequenced at 17.1× coverage and dates to ~4,300 years before present, representing one of the last surviving individuals on Wrangel Island. The second mammoth, sequenced at 11.2× coverage, was obtained from an ~44,800-year-old specimen from the Late Pleistocene population in northeastern Siberia. The demographic trajectories inferred from the two genomes are qualitatively similar and reveal a population bottleneck during the Middle or Early Pleistocene, and a more recent severe decline in the ancestors of the Wrangel mammoth at the end of the last glaciation. A comparison of the two genomes shows that the Wrangel mammoth has a 20% reduction in heterozygosity as well as a 28× increase in the fraction of the genome that comprises runs of homozygosity. We conclude that the population on Wrangel Island, which was the last surviving woolly mammoth population, was subject to reduced genetic diversity shortly before it became extinct.
The human genome is arguably the most complete mammalian reference assembly, yet more than 160 euchromatic gaps remain and aspects of its structural variation remain poorly understood ten years after its completion. To identify missing sequence and genetic variation, here we sequence and analyse a haploid human genome (CHM1) using single-molecule, real-time DNA sequencing. We close or extend 55% of the remaining interstitial gaps in the human GRCh37 reference genome–78% of which carried long runs of degenerate short tandem repeats, often several kilobases in length, embedded within (G+C)-rich genomic regions. We resolve the complete sequence of 26,079 euchromatic structural variants at the base-pair level, including inversions, complex insertions and long tracts of tandem repeats. Most have not been previously reported, with the greatest increases in sensitivity occurring for events less than 5 kilobases in size. Compared to the human reference, we find a significant insertional bias (3:1) in regions corresponding to complex insertions and long short tandem repeats. Our results suggest a greater complexity of the human genome in the form of variation of longer and more complex repetitive DNA that can now be largely resolved with the application of this longer-read sequencing technology.
Background: Next generation sequencing (NGS) is now being used for detecting chromosomal abnormalities in blastocyst trophectoderm (TE) cells from in vitro fertilized embryos. However, few data are available regarding the clinical outcome, which provides vital reference for further application of the methodology. Here, we present a clinical evaluation of NGS-based preimplantation genetic diagnosis/screening (PGD/PGS) compared with single nucleotide polymorphism (SNP) array-based PGD/PGS as a control.
Results: A total of 395 couples participated. They were carriers of either translocation or inversion mutations, or were patients with recurrent miscarriage and/or advanced maternal age. A total of 1,512 blastocysts were biopsied on D5 after fertilization, with 1,058 blastocysts set aside for SNP array testing and 454 blastocysts for NGS testing. In the NGS cycles group, the implantation, clinical pregnancy and miscarriage rates were 52.6% (60⁄114), 61.3% (49⁄80) and 14.3% (7⁄49), respectively. In the SNP array cycles group, the implantation, clinical pregnancy and miscarriage rates were 47.6% (139⁄292), 56.7% (115⁄203) and 14.8% (17⁄115), respectively. The outcome measures of both the NGS and SNP array cycles were the same with insignificant differences. There were 150 blastocysts that underwent both NGS and SNP array analysis, of which seven blastocysts were found with inconsistent signals. All other signals obtained from NGS analysis were confirmed to be accurate by validation with qPCR. The relative copy number of mitochondrial DNA (mtDNA) for each blastocyst that underwent NGS testing was evaluated, and a statistically-significant difference was found between the copy number of mtDNA for the euploid and the chromosomally abnormal blastocysts. So far, out of 42 ongoing pregnancies, 24 babies were born in NGS cycles; all of these babies are healthy and free of any developmental problems.
Conclusions: This study provides the first evaluation of the clinical outcomes of NGS-based pre-implantation genetic diagnosis/screening, and shows the reliability of this method in a clinical and array-based laboratory setting. NGS provides an accurate approach to detect embryonic imbalanced segmental rearrangements, to avoid the potential risks of false signals from SNP array in this study.
Background: Cannabis sativa has been cultivated throughout human history as a source of fiber, oil and food, and for its medicinal and intoxicating properties. Selective breeding has produced cannabis plants for specific uses, including high-potency marijuana strains and hemp cultivars for fiber and seed production. The molecular biology underlying cannabinoid biosynthesis and other traits of interest is largely unexplored.
Results: We sequenced genomic DNA and RNA from the marijuana strain Purple Kush using short read approaches. We report a draft haploid genome sequence of 534 Mb and a transcriptome of 30,000 genes. Comparison of the transcriptome of Purple Kush with that of the hemp cultivar ‘Finola’ revealed that many genes encoding proteins involved in cannabinoid and precursor pathways are more highly expressed in Purple Kush than in ‘Finola’. The exclusive occurrence of Δ9-tetrahydrocannabinolic acid synthase in the Purple Kush transcriptome, and its replacement by cannabidiolic acid synthase in ‘Finola’, may explain why the psychoactive cannabinoid Δ9-tetrahydrocannabinol (THC) is produced in marijuana but not in hemp. Resequencing the hemp cultivars ‘Finola’ and ‘USO-31’ showed little difference in gene copy numbers of cannabinoid pathway enzymes. However, single nucleotide variant analysis uncovered a relatively high level of variation among four cannabis types, and supported a separation of marijuana and hemp.
Conclusions: The availability of the Cannabis sativa genome enables the study of a multifunctional plant that occupies an unique role in human culture. Its availability will aid the development of therapeutic marijuana strains with tailored cannabinoid profiles and provide a basis for the breeding of hemp with improved agronomic characteristics.
Although the expected relationship or proportion of genome shared by pairs of relatives can be obtained from their pedigrees, the actual quantities deviate as a consequence of Mendelian sampling and depend on the number of chromosomes and map length. Formulae have been published previously for the variance of actual relationship for a number of specific types of relatives but no general formula for non-inbred individuals is available. We provide here a unified framework that enables the variances for distant relatives to be easily computed, showing, for example, how the variance of sharing for great grandparent-great grandchild, great uncle-great nephew, half uncle-nephew and first cousins differ, even though they have the same expected relationship. Results are extended in order to include differences in map length between sexes, no recombination in males and sex linkage. We derive the magnitude of skew in the proportion shared, showing the skew becomes increasingly large the more distant the relationship. The results obtained for variation in actual relationship apply directly to the variation in actual inbreeding as both are functions of genomic coancestry, and we show how to partition the variation in actual inbreeding between and within families. Although the variance of actual relationship falls as individuals become more distant, its coefficient of variation rises, and so, exacerbated by the skewness, it becomes increasingly difficult to distinguish different pedigree relationships from the actual fraction of the genome shared.
To provide a resource for assessing continental ancestry in a wide variety of genetic studies, we identified, validated, and characterized a set of 128 ancestry informative markers (AIMs). The markers were chosen for informativeness, genome-wide distribution, and genotype reproducibility on two platforms (TaqMan assays and Illumina arrays). We analyzed genotyping data from 825 subjects with diverse ancestry, including European, East Asian, Amerindian, African, South Asian, Mexican, and Puerto Rican. A comprehensive set of 128 AIMs and subsets as small as 24 AIMs are shown to be useful tools for ascertaining the origin of subjects from particular continents, and to correct for population stratification in admixed population sample sets. Our findings provide general guidelines for the application of specific AIM subsets as a resource for wide application. We conclude that investigators can use TaqMan assays for the selected AIMs as a simple and cost efficient tool to control for differences in continental ancestry when conducting association studies in ethnically diverse populations.
The model plant species Arabidopsis thaliana is successful at colonizing land that has recently undergone human-mediated disturbance. To investigate the prehistoric spread of A. thaliana, we applied approximate Bayesian computation and explicit spatial modeling to 76 European accessions sequenced at 876 nuclear loci. We find evidence that a major migration wave occurred from east to west, affecting most of the sampled individuals. The longitudinal gradient appears to result from the plant having spread in Europe from the east ~10,000 years ago, with a rate of westward spread of ~0.9 km/year. This wave-of-advance model is consistent with a natural colonization from an eastern glacial refugium that overwhelmed ancient western lineages. However, the speed and time frame of the model also suggest that the migration of A. thaliana into Europe may have accompanied the spread of agriculture during the Neolithic transition.
Author Summary :
The demographic forces that have shaped the pattern of genetic variability in the plant species Arabidopsis thaliana provide an important backdrop for the use of this model organism in understanding the genetic determinants of plant natural variation. We investigated the demographic history of A. thaliana using novel population-genetic tools applied to a combination of molecular and geographic data. We infer that A. thaliana entered Europe from the east and spread westward at a rate of ~0.9 kilometers per year, and that its population size began increasing around 10,000 years ago. The “wave-of-advance” model suggested by these results is potentially consistent with the pattern expected if the species colonized Europe as the ice retreated at the end of the most recent glaciation. Alternatively, it is also compatible with the possibility that A. thaliana—a weedy species—may have spread into Europe with the diffusion of agriculture, providing an example of the phenomenon of “ecological imperialism” described by A. Crosby. In this framework, just as weeds from Europe invaded temperate regions worldwide during European human colonization, weeds originating from the source region of farming invaded Europe as a result of the disturbance caused by the spread of agriculture.
Our knowledge of Neanderthals is based on a limited number of remains and artifacts from which we must make inferences about their biology, behavior, and relationship to ourselves. Here, we describe the characterization of these extinct hominids from a new perspective, based on the development of a Neanderthal metagenomic library and its high-throughput sequencing and analysis. Several lines of evidence indicate that the 65,250 base pairs of hominid sequence so far identified in the library are of Neanderthal origin, the strongest being the ascertainment of sequence identities between Neanderthal and chimpanzee at sites where the human genomic sequence is different. These results enabled us to calculate the human-Neanderthal divergence time based on multiple randomly distributed autosomal loci. Our analyses suggest that on average the Neanderthal genomic sequence we obtained and the reference human genome sequence share a most recent common ancestor ~706,000 years ago, and that the human and Neanderthal ancestral populations split ~370,000 years ago, before the emergence of anatomically modern humans. Our finding that the Neanderthal and human genomes are at least 99.5% identical led us to develop and successfully implement a targeted method for recovering specific ancient DNA sequences from metagenomic libraries. This initial analysis of the Neanderthal genome advances our understanding of the evolutionary relationship of Homo sapiens and Homo neanderthalensis and signifies the dawn of Neanderthal genomics.
We propose a new method for approximate Bayesian statistical inference on the basis of summary statistics. The method is suited to complex problems that arise in population genetics, extending ideas developed in this setting by earlier authors. Properties of the posterior distribution of a parameter, such as its mean or density curve, are approximated without explicit likelihood calculations. This is achieved by fitting a local-linear regression of simulated parameter values on simulated summary statistics, and then substituting the observed summary statistics into the regression equation. The method combines many of the advantages of Bayesian statistical inference with the computational efficiency of methods based on summary statistics. A key advantage of the method is that the nuisance parameters are automatically integrated out in the simulation step, so that the large numbers of nuisance parameters that arise in population genetics problems can be handled without difficulty. Simulation results indicate computational and statistical efficiency that compares favorably with those of alternative methods previously proposed in the literature. We also compare the relative efficiency of inferences obtained using methods based on summary statistics with those obtained directly from the data using MCMC.
A bacterial spore was revived, cultured, and identified from the abdominal contents of extinct bees preserved for 25 to 40 million years in buried Dominican amber. Rigorous surface decontamination of the amber and aseptic procedures were used during the recovery of the bacterium.
Several lines of evidence indicated that the isolated bacterium was of ancient origin and not an extant contaminant. The characteristic enzymatic, biochemical, and 16S ribosomal DNA profiles indicated that the ancient bacterium is most closely related to extant Bacillus sphaericus.