400M Rare Genetic Variants Discovered/Examined: Q&A with Michael Zody, PhD, on NYGC Contributions to TOPMed Nature Paper

NEW YORK, NY (February 10, 2021) —  Scientists at the New York Genome Center (NYGC), working in collaboration with researchers around the world, have identified 400 million rare genetic variants in a large-scale study of a diverse, multi-ethnic patient population. Through a deep analysis of genomic sequencing of more than 53,000 individuals from a wide range of ethnically diverse backgrounds, the research team discovered that 97% of the 400 million variants found were extremely rare, occurring in less than 1 percent of the population. The team’s findings, published today in Nature, provide a new understanding of the differences in disease across diverse ancestries and help to support the advancement of personalized disease prediction, prevention, diagnosis, and treatment.

The team’s analysis was conducted as part of the National Institutes of Health’s Trans-Omics for Precision Medicine (TOPMed) program and its broader Precision Medicine Initiative, which both aim to provide disease treatments tailored to an individual’s unique genes and environment. TOPMed is run by the National Institutes of Health’s National Heart, Lung, and Blood Institute (NHLBI), which has as its specific focus generating scientific resources that improve the understanding of genetic risk for heart, lung, blood, and sleep disorders. The University of Michigan and the University of Washington serve as the Informatics Research Center and Data Coordinating Center, respectively, for the TOPMed program, which includes many multi-institutional partners. The NYGC is a designated “Omics” Center in the program.

NYGC News asked Michael Zody, PhD, Scientific Director, Computational Biology, NYGC, to explain the significance of the Nature paper’s findings as well as detail the contributions of NYGC’s Comp Bio/Bioinformatics team. Dr. Zody (pictured at left) is a co-author of the years-in-the-making study. André Corvelo, PhD, Lead Bioinformatics Scientist, NYGC, is a co-first author on the paper. Other NYGC co-authors include Soren Germer, PhD, Senior Vice President, Genome Technologies, and Bioinformatics Scientists Anne-Katrin Emde, PhD, and Wayne Clarke, PhD.

Q: What are the most exciting findings of this study?

Sequencing this many genomes has allowed us to discover and examine rare variation that hasn’t been seen before. We now have variants (mostly very rare) at 1 in 7 sites in the genome. TOPMed variants are now the majority of all human variation that has been discovered, as indexed in dbSNP [the Single Nucleotide Polymorphism Database, developed and hosted by the National Center for Biotechnology Information in collaboration with the National Human Genome Research Institute]. This has allowed the creation of an imputation panel that can impute variation down to 0.01% frequency in the population. Using this, we were able to impute rare variants in sparsely genotyped samples from UK BioBank and recover known associations between rare of loss of function variation and diseases like breast cancer and glaucoma. Applying this to other large, well-phenotyped datasets could allow the extension of GWAS studies from common to rare variation association and reveal previously unknown gene-phenotype connections.

NYGC’s key contribution, in addition to sequencing a subset of the whole genomes, was looking at human sequences currently missing from the human reference genome and fully resolving and placing over 1,000 such sequences, 356 of which (spanning over 250,000 bases) had never been seen in previous studies that searched for such elements. This is largely attributable to the size of the TOPMed dataset. Although these events in general have higher frequency in the population than single nucleotide variants, we were also able to discover sequences present in only a single copy in the 53,000 genomes examined.

By leveraging the genomes of other great apes, scientists at the NYGC discovered that the human reference genome is likely missing a stretch of sequence affecting the annotation of the gene UBE2QL1. This image of the sequence aligned to the chimpanzee genome immediately highlights the fact that the annotated gene in other species starts inside the “missing” human sequence. Their conclusion that the reference genome is likely missing this sequence, rather than the sequence merely representing variation across individuals, was confirmed when they found it in all of the study participants.

Q: What are the specific contributions from NYGC Bioinformatics team?

In work led by [Lead Bioinformatics Scientist] André Corvelo, PhD, we developed new methods for using unmapped reads (those that do not cleanly line up with the reference human genome) to identify parts of the genome that are missing from the reference. About 10% of these appear to be errors in the reference, as they are present in 2 copies in every individual in the TOPMed cohort. One might expect that the reference donors might have had 1 or 2 such variants by chance, but not 100. However, the remainder are loci at which individuals vary. It has been known for years that such sites exist, but until recently, it has been much easier to discover sites present in the reference and absent in other genomes than those absent in the reference that are present in others. Adding these sequences to our catalog of human variation will allow future studies to see if any of them associate with disease risk or other traits.

Q: What do you see as the future use of this data/next steps for this project?

This paper describes analysis of 53,000 samples (97,000 for the imputation panel), but the entire TOPMed program now has over 130,000 whole genome sequences. Although it is unlikely that another paper of this scale will be published on the final dataset, the work of discovering variation will go on, and the resources will continue to be released to the community. Importantly, the cohorts in TOPMed were selected for their relevance to heart, lung, blood, and sleep disorders, so much important work is still ongoing using these data to study these diseases. Further, TOPMed has always been a multi-modal project (hence the “Trans-Omics” in the name), so while the genome sequencing phase is largely complete, additional studies of RNA sequencing, metabolomics, and proteomics are going on for many of the samples whose sequence is being reported here and will enable multi-modal data analyses that build on the work presented here.

We have continued to develop our methods for detecting non-references sequences in whole genome sequence. We are currently applying these to other large datasets including updated sequencing of the 1000 Genomes Project and the Centers for Common Disease Genomics program, which also has over 130,000 whole genomes. In addition to the variation discovery aspect, we are also working on applying our discoveries from these and other sequencing projects to the study of complex diseases such as Alzheimer’s disease.