Several mutations in DNA that promote diseases do not occur in actual genes. By contrast, they occur in the 99% of the genome once regarded as “junk.”
Genes predicted to be disrupted by regulatory mutations in people with autism tended to be involved in brain cell functioning and fell into two categories. One category relates to synapses, communication hubs between neurons, and the other relates to chromatin, the highly structured form of DNA and proteins required for proper gene expression from chromosomes. (Image credit: The Troyanskaya lab)
Although recently researchers have been able to understand that these extensive stretches of DNA indeed have crucial roles to play, to date, it has not been possible to decipher these effects on a large scale.
A research team led by Princeton University has used artificial intelligence to decode the functional effect of such mutations in people with autism. The team considers this robust method to be generally applicable to find out such genetic contributions to any disease.
The study, which was published in the Nature Genetics journal on May 27th, 2019, the analyzed the genomes of 1790 families in which one child has autism spectrum disorder but other members do not. The technique performed sorting among 120,000 mutations to identify those that have an impact on the behavior of genes in people with autism. The results do not unravel the precise causes of autism cases but they show thousands of probable contributors for researchers to analyze.
The focus of studies performed much earlier have been on finding mutations in genes themselves. Genes are typically instructions for synthesizing the various proteins that build and control the body. Mutations in genes cause mutated proteins, the functions of which are disrupted. However, mutations of other types disrupt the way genes are regulated. Mutations in these areas have an impact not what is synthesized by the genes but when and how much they synthesize.
According to the researchers, to date, it has not been feasible to analyze the entire genome for fragments of DNA that regulate genes and to estimate how mutations in this regulatory DNA will possibly contribute to complex disease. This research is the first evidence that mutations in regulatory DNA can lead to a complex disease.
“This method provides a framework for doing this analysis with any disease,” stated Olga Troyanskaya, professor of computer science and genomics and a senior author of the study. The method could specifically be helpful for heart disease, cancer, neurological disorders, and various other conditions that have circumvented efforts to determine genetic causes.
This transforms the way we need to think about the possible causes of those diseases.
Olga Troyanskaya, Professor of Computer Science and Genomics, Princeton University
She is also the deputy director for genomics at the Simons Foundation’s Flatiron Institute in New York, where she headed a team of co-authors.
The team also included a group headed by neuroscientist Robert Darnell of The Rockefeller University. The first authors of the study are Jian Zhou and Christopher Park, who earned PhDs at Princeton and are now visiting collaborators at the Lewis-Sigler Institute for Integrative Genomics and researchers at the Flatiron Institute, and Chandra Theesfeld at Princeton’s Lewis-Sigler Institute for Integrative Genomics.
The focus of a majority of the earlier studies on the genetic basis of disease has been on the 20,000 known genes and the surrounding sections of DNA regulating those genes. However, even this huge amount of genetic information constitutes just slightly over 1% of the 3.2 billion chemical pairs in the human genome. The other 99% has traditionally been considered as “junk” or “dark,” though recent studies have started to upset that concept.
As part of their new discovery, the researchers provide a technique to make sense of this huge array of genomic data. The system employs an artificial intelligence method known as deep learning, where an algorithm carries out successive layers of analysis to understand patterns that would otherwise be impossible to discern.
Here, the algorithm trains itself how to determine the biologically relevant sections of DNA and estimates whether those fragments have a role to play in any of more than 2000 protein interactions that are believed to have an impact on the regulation of genes. The system also estimates whether the disruption of a single pair of DNA units would have a considerable effect on those protein interactions.
According to Troyanskaya, the algorithm “slides along the genome” investigating every single chemical pair in the context of the 1000 chemical pairs surrounding it, until it scans all mutations. Thus, the system can estimate the effect of mutating each and every chemical unit in the entire genome. Eventually, it unravels a prioritized list of DNA sequences that could possibly regulate genes and mutations that may interfere with that regulation.
Before this computational achievement, the traditional means to gather such information would be meticulous laboratory experiments on each sequence and each probable mutation in that sequence. This number of probable functions and mutations is too huge to analyze—an experimental method would necessitate testing each mutation against over 2000 types of protein interactions and repeating those experiments again and again across cells and tissue types, adding up to hundreds of millions of experiments.
Other research teams have worked on speeding up this discovery by using machine learning to targeted DNA sections; however, they could not reach the potential to analyze at each DNA unit and each probable mutation and the effects on each of over 2000 regulatory interactions across the entire genome.
What our paper really allows you to do is take all those possibilities and rank them. That prioritization itself is very useful, because now you can also go ahead and do the experiments in just the highest priority cases.
Christopher Park, Visiting Collaborator, Lewis-Sigler Institute for Integrative Genomics
Finally, the system performs calibration based on familiar disease-causing mutations and creates a “disease impact score,” an evaluation of the likelihood of a given mutation to have an impact on disease.
With regards to autism, the team investigated the genomes of 1790 families with “simplex” autism spectrum disorder, that is, the condition is evident in one child but not in other family members. (These data were obtained from the Simons Simplex Collection of over 2000 autism families.) Of this sample, lesser than 30% of the people with autism spectrum disorder had a previously determined genetic cause. According to the researchers, it is likely that the newly discovered mutations would considerably increase that fraction.
The potential to estimate the functional impact of each mutation was the main novelty of this new research. Earlier studies had found it very difficult to detect any variation in the number of regulatory mutations in people with autism in comparison with unaffected people. However, the new technique analyzed mutations believed to have a high functional impact, and discovered a considerably higher number of such mutations in affected people.
When the scientists then analyzed what genes these mutations have an impact on, they were found to be genes strongly linked with brain functions. These newly found mutations had an impact on similar genes and functions as do mutations identified earlier.
“Now we open the field to understand all the factors that may be involved in autism,” stated Theesfeld.
This information is also of significance to families and their doctors to ensure better diagnosis of the disorder and to prevent overly general assumptions of how a person’s autism might be classified with others.
They say that when you meet one person with autism you have met one person with autism because no cases are alike. Genetically, it seems to be the same way.
Chandra Theesfeld, Lewis-Sigler Institute for Integrative Genomics, Princeton University
Using the new technique, the researchers are investigating the genetic causes of different forms of cancer, heart disease, as well as other disorders.
The study was funded by the National Institutes of Health and Simons Foundation. Other co-authors of the study include Aaron Wong, Julien Funk and Kevin Yao of Flatiron, and Yuan Yuan, Claudia Scheckel, John Fak, and Yoko Tajima of Rockefeller.