AI Cracks Long-Range DNA Codes for RNA Splicing

The Institute of Medical Science, The University of Tokyo

Accurate RNA splicing is essential for gene expression and human health, yet predicting how DNA sequence variations affect splicing remains a major challenge. Although recent artificial intelligence (AI) models have improved splice site prediction, many struggle to capture regulatory signals located thousands of DNA bases away from the sites they influence. This limitation restricts our ability to understand disease-causing mutations and the complex mechanisms governing RNA processing, particularly in disorders ranging from genetic diseases to cancer.

To address these challenges, Professor Kenta Nakai from the Human Genome Center, the Institute of Medical Science and Ms. Yuna Miyachi, a Ph.D. student, from the Department of Computer Science, Graduate School of Information Science and Technology, both at The University of Tokyo, Japan, developed SpliceSelectNet (SSNet), a hierarchical Transformer-based deep learning framework for splice site prediction. Their study, published in Nucleic Acids Research on June 22, 2026, introduces a computational approach capable of analyzing DNA sequences spanning up to 100,000 base pairs while maintaining single-nucleotide resolution. By combining local and global attention mechanisms, SSNet efficiently captures both nearby and distant regulatory signals that contribute to RNA splicing.

Many existing computational tools struggle to model long-range genomic interactions because the computational cost increases rapidly with sequence length. To overcome this limitation, SSNet divides long DNA sequences into smaller blocks, analyzes local patterns within each block, and then integrates information across the entire sequence through a hierarchical attention process. This design allows the model to preserve dense attention while remaining computationally efficient. In addition, the researchers enabled visualization of attention scores, allowing them to identify which DNA regions the model considered important during prediction.

The model was trained and evaluated using several large genomic datasets and benchmarked against leading splice prediction systems. Across multiple validation datasets, SSNet achieved state-of-the-art performance for splice site prediction and aberrant splicing detection. The researchers also showed that the model could capture the effects of distant regulatory sequences beyond the effective range of conventional convolutional neural network approaches. In simulations using the DMD gene and evaluations of pathogenic variants from ClinVar, SSNet maintained sensitivity to regulatory signals located many thousands of base pairs from the affected splice site.

"The key achievement of this work is that we successfully modeled ultra-long-range genomic interactions while preserving high computational efficiency and single-nucleotide resolution," says Prof. Nakai. "We also demonstrated that the regions highlighted by the model closely correspond to biologically meaningful regulatory elements, helping to bridge predictive accuracy and biological interpretability."

The study suggests that hierarchical Transformer architectures could become valuable tools beyond splice site prediction. The same framework may support future research into promoter-enhancer interactions, three-dimensional genome organization, and broader DNA language models. The researchers also expect opportunities for collaboration with researchers in clinical and genomic medicine, where the technology could help screen variants in non-coding regions that currently have uncertain significance. In pharmaceutical research, the approach could assist in designing oligonucleotide therapeutics that target abnormal splicing.

"Many existing AI models for DNA analysis were adapted from natural language processing, but DNA has fundamentally different properties," explains Ms. Miyachi. "By redesigning the architecture to account for long-range genomic interactions and strict sequence resolution, we aimed to create a system better suited to biological reality."

By enabling accurate and interpretable analysis of genomic regions spanning up to 100,000 base pairs, SSNet represents a significant advance in computational genomics. Its ability to capture long-range regulatory signals while maintaining single-nucleotide precision provides a powerful new framework for studying RNA splicing, interpreting disease-associated variants, and advancing the development of precision genomic medicine.

/Public Release. This material from the originating organization/author(s) might be of the point-in-time nature, and edited for clarity, style and length. Mirage.News does not take institutional positions or sides, and all views, positions, and conclusions expressed herein are solely those of the author(s).View in full here.