RNA is the means of translating the genetic code embedded in DNA into proteins, which serve as enzymes, transporters, signaling molecules, receptors, structural components, regulators, and gene-expression controllers, among many other roles.
Yet one gene is not limited to producing one RNA variant. The process of RNA splicing—in which different coding RNA segments (exons) are joined together after noncoding regions (introns) are removed—allows for the generation of a large array of RNA transcript isoforms with distinct sequences, and consequently, distinct functions in tissue- and cell-type-specific patterns. In this way, alternative splicing of precursor mRNA (pre-mRNA) greatly expands the complexity of the human transcriptome. Conversely, transcript isoform alterations can also sensitively reflect dynamic changes in cellular states, while dysregulation of isoform usage caused by aberrant splicing is closely associated with major diseases such as cancer.
However, because isoform usage is jointly regulated by multiple layers of control, including regulatory elements (e.g., splicing enhancers and silencers on exons and introns), RNA binding proteins (RBPs), and tissue microenvironments, scientists have been challenged to accurately characterize and predict RNA splicing and isoform usage across tissues, cell types, and disease states.
Now, researchers from the China National Center for Bioinformation, a research center affiliated to the Chinese Academy of Sciences, led by Professor GAO Yuan, have developed an AI-driven framework that enables highly accurate prediction of RNA splicing and isoform usage by integrating genomic sequence features with tissue-specific RBP expression profiles.
The study, which was published in Nature Computational Science on May 19, offers valuable insights for splicing regulatory pattern, pathogenic variant interpretation, and precision medicine research.
The framework—Hierarchical Explainable LSTM for Isoform eXpression (HELIX)—overcomes the limitations of conventional approaches via a two-layer deep-learning architecture.
It first integrates DNA sequence information with the expression profiles of 1,499 RBPs and then employs long short-term memory (LSTM) networks to effectively capture the complex dependencies and competitive relationships among multiple splice sites.
This innovative design enables precise, reliable prediction of RNA splicing and transcript isoform usage. The model was trained and optimized on large-scale short- and long-read RNA-seq datasets covering 30 distinct human tissues, allowing accurate quantification of complex transcript structures and isoform usage. Results show that HELIX substantially outperforms existing mainstream methods in both splicing strength prediction and overall isoform usage prediction.
In disease-related studies, HELIX has demonstrated a powerful capacity to decipher aberrant RNA splicing and transcript isoform alterations. For example, using large colorectal cancer cohorts, the researchers identified widespread splicing dysregulation and abnormal isoform usage in tumor cells.
The results reveal strong correlations among such alterations and genomic mutations, RBP dysregulation, and patient clinical profiles. These findings indicate that splicing abnormalities can serve as vital molecular signatures for understanding tumor progression and guiding patient stratification.
Based on the findings, the team also developed scHELIX, a single-cell extension of HELIX specifically tailored for single-cell RNA sequencing data. scHELIX supports high-resolution profiling of transcript isoform usage across different cell types and tumor subpopulations, which offer a refined view of intratumoral heterogeneity.
The findings reveal distinct RNA splicing and isoform usage patterns among tumor subclones, providing new clues for tumor evolution research and potential therapeutic target discovery.
Overall, HELIX and its single-cell variant scHELIX constitute a robust AI toolkit for understanding RNA splicing regulation and transcript isoform dynamics under complex biological conditions.
This work not only deepens our understanding of tissue-specific and disease-related splicing mechanisms, but also provides valuable computational tools and theoretical support for cancer subtyping, pathogenic variant annotation, and precision medicine development.