Deep Learning Identifies Disease Variants in DNA's Dark Matter

Children's Hospital of Philadelphia

Philadelphia, April 17, 2025 – Researchers from Children's Hospital of Philadelphia (CHOP) and the Perelman School of Medicine at the University of Pennsylvania (Penn Medicine) have successfully employed an algorithm to identify potential mutations which increase disease risk in the noncoding regions our DNA, which make up the vast majority of the human genome. The findings could serve as the basis for detecting disease-associated variants in a range of common diseases. The findings were published online today by the American Journal of Human Genetics .

While certain sections of the human genome code for proteins to carry out a variety of essential biological functions, more than 98% of the genome does not code for proteins. However, disease-associated variants can also be found in these noncoding regions of the genome, which often control when proteins are made or "expressed." Since this "regulatory code" is not well understood, these noncoding variants have been more difficult to study, but prior genome-wide association studies (GWAS) have made great strides in understanding their clinical relevance.

One of the challenges is that while broad regions can be identified by GWAS as being disease-associated, pinpointing which variant among several is the one responsible for disease remains a challenge. Many of these variants in noncoding regions are concentrated around transcription factor binding motifs, which are areas in the genome that specific proteins, called transcription factors, recognize and bind to in order to regulate gene expression. While these proteins bind at regions on the genome that are "open," they temporarily "close off" the immediate region of DNA that they bind to, leaving a "footprint" in experimental results that can be used to locate exactly where they are binding.

"This situation is comparable to a police lineup," said senior study author Struan F.A. Grant, PhD , Director of the Center for Spatial and Functional Genomics and the Daniel B. Burke Endowed Chair for Diabetes Research at CHOP. "You're looking at similar suspects together, so it can be challenging to know who the actual culprit is. With the approach we used in this study, we're able to pinpoint the disease-causing variant through identification of this so-called footprint."

In this study, researchers utilized ATAC-seq, an experimental genomic sequencing method that identifies "open" regions of the genome, and PRINT, a deep-learning-based method to detect these types of footprints of DNA-protein interactions. Using data from 170 human liver samples, the researchers observed 809 "footprint quantitative trait loci," or specific parts of the human genomic associated with these footprints that indicate where DNA-protein interactions should be taking place. Using this method, the researchers could determine whether transcription factors were binding with varying strength to these sites depending on the variant.

With this useful foundational information, the authors of the study hope to apply these techniques to other organ and tissue samples and start identifying which of these variants are potentially driving a variety of common diseases.

"This approach helps resolve some fundamental issues we have encountered in the past when trying to determine which noncoding variants may be driving disease," said first study author Max Dudek, a PhD student in Grant and Almasy labs in the Department of Genetics at Penn Medicine and the Department of Pediatrics at Children's Hospital of Philadelphia. "With larger sample sizes, we believe that pinpointing these casual variants could ultimately inform the design of novel treatments for common diseases."

This study was supported by the National Science Foundation Graduate Research Fellowship Program, National Institutes of Health grants R01 HL133218, U10 AA008401, UM1 DK126194, U24 DK138512, UM1 DK126194, and R01 HD056465 and the Daniel B. Burke Endowed Chair for Diabetes Research.

Dudek et al, "Characterization of non-coding variants associated with transcription factor binding through ATAC-seq-defined footprint QTLs in liver." Am J Hum Genet. Online April 17, 2025. DOI: 10.1016/j.ajhg.2025.03.019.

About Children's Hospital of Philadelphia:

A non-profit, charitable organization, Children's Hospital of Philadelphia was founded in 1855 as the nation's first pediatric hospital. Through its long-standing commitment to providing exceptional patient care, training new generations of pediatric healthcare professionals, and pioneering major research initiatives, the hospital has fostered many discoveries that have benefited children worldwide. Its pediatric research program is among the largest in the country. The institution has a well-established history of providing advanced pediatric care close to home through its CHOP Care Network , which includes more than 50 primary care practices, specialty care and surgical centers, urgent care centers, and community hospital alliances throughout Pennsylvania and New Jersey. CHOP also operates the Middleman Family Pavilion and its dedicated pediatric emergency department in King of Prussia, the Behavioral Health and Crisis Center (including a 24/7 Crisis Response Center) and the Center for Advanced Behavioral Healthcare

/Public Release. This material from the originating organization/author(s) might be of the point-in-time nature, and edited for clarity, style and length. Mirage.News does not take institutional positions or sides, and all views, positions, and conclusions expressed herein are solely those of the author(s).View in full here.