Researchers from Children's Hospital of Philadelphia (CHOP) and the Perelman School of Medicine at the University of Pennsylvania (Penn Medicine) have successfully employed an algorithm to identify potential mutations which increase disease risk in the noncoding regions our DNA, which make up the vast majority of the human genome. The findings could serve as the basis for detecting disease-associated variants in a range of common diseases. The findings were published online today by the American Journal of Human Genetics.
While certain sections of the human genome code for proteins to carry out a variety of essential biological functions, more than 98% of the genome does not code for proteins. However, disease-associated variants can also be found in these noncoding regions of the genome, which often control when proteins are made or "expressed." Since this "regulatory code" is not well understood, these noncoding variants have been more difficult to study, but prior genome-wide association studies (GWAS) have made great strides in understanding their clinical relevance.
One of the challenges is that while broad regions can be identified by GWAS as being disease-associated, pinpointing which variant among several is the one responsible for disease remains a challenge. Many of these variants in noncoding regions are concentrated around transcription factor binding motifs, which are areas in the genome that specific proteins, called transcription factors, recognize and bind to in order to regulate gene expression. While these proteins bind at regions on the genome that are "open," they temporarily "close off" the immediate region of DNA that they bind to, leaving a "footprint" in experimental results that can be used to locate exactly where they are binding.
"This situation is comparable to a police lineup," said senior study author Struan F.A. Grant, PhD, Director of the Center for Spatial and Functional Genomics and the Daniel B. Burke Endowed Chair for Diabetes Research at CHOP. "You're looking at similar suspects together, so it can be challenging to know who the actual culprit is. With the approach we used in this study, we're able to pinpoint the disease-causing variant through identification of this so-called footprint."
In this study, researchers utilized ATAC-seq, an experimental genomic sequencing method that identifies "open" regions of the genome, and PRINT, a deep-learning-based method to detect these types of footprints of DNA-protein interactions. Using data from 170 human liver samples, the researchers observed 809 "footprint quantitative trait loci," or specific parts of the human genomic associated with these footprints that indicate where DNA-protein interactions should be taking place. Using this method, the researchers could determine whether transcription factors were binding with varying strength to these sites depending on the variant.
With this useful foundational information, the authors of the study hope to apply these techniques to other organ and tissue samples and start identifying which of these variants are potentially driving a variety of common diseases.
"This approach helps resolve some fundamental issues we have encountered in the past when trying to determine which noncoding variants may be driving disease," said first study author Max Dudek, a PhD student in Grant and Almasy labs in the Department of Genetics at Penn Medicine and the Department of Pediatrics at Children's Hospital of Philadelphia. "With larger sample sizes, we believe that pinpointing these casual variants could ultimately inform the design of novel treatments for common diseases."
This study was supported by the National Science Foundation Graduate Research Fellowship Program, National Institutes of Health grants R01 HL133218, U10 AA008401, UM1 DK126194, U24 DK138512, UM1 DK126194, and R01 HD056465 and the Daniel B. Burke Endowed Chair for Diabetes Research.