
Researchers explore the human immune system by looking at the active components, namely the various genes and cells involved. But there is a broad range of these, and observations necessarily produce vast amounts of data. For the first time, researchers including those from the University of Tokyo built a software tool which leverages artificial intelligence to not only offer a more consistent analysis of these cells at speed but also categorizes them and aims to spot novel patterns people have not yet seen.
Our immune system is important — it's impossible to imagine complex life existing without it. This system, comprising different kinds of cells, each playing a different role, helps to identify things that threaten our health, and take actions to defend us. They are both very effective, but also far from perfect; hence, the existence of diseases such as the notorious acquired immunodeficiency syndrome, or AIDS. And recent earth-shattering issues, such as the coronavirus pandemic, serve to highlight the importance of research around this intricate yet powerful system.
One key branch of research in immunology involves the identification of immune system components and ascertaining their function. Doing this through manual observation would be impossible due to the time it would take, and some automated tools exist but have limitations around accuracy, consistency or flexibility. To this end, a team of researchers led by Professor Tatsuhiko Tsunoda from the University of Tokyo's Department of Biological Sciences rose to the challenge and developed a system to boost immunology research.
"We present scHDeepInsight, an AI-based framework for rapidly and consistently identifying immune cells from the RNA of cells. Instead of viewing all cell types as unrelated, the system reflects the natural hierarchy of the immune system," said lead researcher Shangru Jia. "By turning cellular genetic profiles into images and applying a hierarchy-aware AI, known as a convolutional neural network, or CNN, it can distinguish both broad immune cell types and finer subtypes, and it can do so more consistently than previous attempts. In our benchmark, labeling about 10,000 cells only took a few minutes, whereas manual marker-based annotation can take many hours to days. In comparison with other automated methods, run time is in a similar range. The main advantages are the consistency of predictions across the hierarchy and the improved accuracy gained from incorporating hierarchical labels, rather than raw speed alone."
There are three main aspects to scHDeepInsight. Hierarchical learning, whereby the model mirrors the immune system's "family tree," can distinguish both broad immune categories and finer subtypes. Image-based representation transforms gene data into 2D images so the CNN can capture subtle relationships between genes more effectively than by looking at tables of raw data. And analytics built into the system can highlight which genes contribute most to a behavior, and these can be checked against known markers to see how they align with past observations.
"A spreadsheet of gene numbers misses how genes relate to each other. When we map genes to pixels in an image so that related genes are placed nearby, the result is an image with meaningful structure. Image-recognition models such as CNNs are very good at detecting such patterns, allowing them to capture complex relationships between genes that are hard to learn from raw tables," said Jia. "The main challenge was balancing performance across both broad cell types and detailed subtypes, especially for rare cell populations. We addressed this by adapting the training process, so the model paid more attention to the categories that were harder to distinguish, reducing the risk of overlooking small but important subtypes."
scHDeepInsight is primarily a research tool rather than a full diagnostic system, partly due to its infancy, but mainly as the model is only trained on healthy cells. By applying it to patients' samples, researchers can see where they deviate from a healthy baseline. Such deviations may provide clues for further study, but medical interpretation requires additional validation. So this development will aid in fundamental research throughout the field of immunology, but it might take time before descendants of scHDeepInsight find their way into diagnostic systems.
"Studies where immune changes are important, including cancer immunology, infections and autoimmune conditions, can benefit from more reliable cell labels. Since our model is trained on healthy immune cells, its immediate value is in providing a consistent healthy baseline for comparison. Disease-related shifts can then be measured relative to this baseline, but clinical interpretation requires validation in each context," said Jia. "Generalization and validation are key. Clinical samples are diverse, so the model must be tested across varied trials and protocols. Integration into clinical workflows, regulatory requirements for transparency and reproducibility are also essential before routine use. For research use today, scHDeepInsight is already available as a downloadable package — researchers can readily apply it in their own analyses. Broader validation and clinical integration remain goals for the future."
Work on scHDeepInsight has not finished. The team aims to improve its abilities and features, taking it beyond immune system-related cellular identification and into other biological domains. Ultimately, they hope to validate the system for use as a tool for clinical research by using precise immune system profiling to support studies of disease. And there's also the matter of its capacity to spot novel cell types.
"For each cell, the model outputs probabilities at both the broad type and subtype levels. If confidence is high for the broad lineage but low for all known subtypes within that lineage, the cell may represent a potentially novel state. In test analyses of brain immune datasets, this probability pattern helped highlight regions that were rich in specialized microglia cells residing in the central nervous system," said Jia. "AI models reflect their training data. If a reference atlas is incomplete, some rare or context-specific populations can be misclassified or underrepresented. Predictions must therefore be interpreted with caution and validated experimentally. Our design emphasizes transparency to support careful, evidence-based use."