SAN FRANCISCO—Imagine looking at thousands of scattered puzzle pieces and trying to guess what picture they create. Without any reference point, it's nearly impossible.
In a similar way, biologists today face major challenges when trying to make sense of enormous datasets generated by experiments that profile thousands of individual cells. With so much genomic information at hand—but without much context—it can be difficult to understand how that data translates to human health and disease.
Researchers address this by defining cell types, akin to deciding how to sort puzzle pieces by color or pattern before assembling them. But just as it is hard to collaborate on a puzzle if your partner uses different criteria to sort pieces, scientists have been stymied by how to compare cell types across studies.
Now, a team at Gladstone Institutes has unveiled a powerful computational tool that solves this problem. The new method, called CellWalker2, allows scientists to determine how cell types are related and identify cell groupings that may impact health, leading to a better understanding of the cells they're studying.
"Previous methods did not leverage relationships between cell types," says Katie Pollard, PhD, the L.K. Whittier director of the Gladstone Institute of Data Science and Biotechnology, who led the new study published in Cell Genomics. "With our new tool, we use the fact that some cell types are siblings rather than distant cousins."
Revealing Relationships Between Cell Types
In the same way that a blue-sky puzzle piece could be mistaken for a water piece, an immature and a mature neuron that look somewhat similar might be misidentified as distinct cell types. CellWalker2 tackles this issue by using hierarchical relationships between cell types. The algorithm is similar to how you might first sort puzzle pieces into broad categories like "blue," and then separate this pile into smaller subtypes.
When presented with genomic data from a new study, CellWalker2 will identify the precise cell type if it can, and if not, it will assign a broader label.
"Cell types aren't random categories," says Zhirui Hu, PhD, a bioinformatics fellow in Pollard's lab who helped design the new tool. "Some are very closely related—like two excitatory neurons at slightly different stages of life—while others are fundamentally different, like an immune cell and a muscle cell. CellWalker2 takes those relationships into account."
By revealing how cell types relate to each other and to the genes that control them, the new tool designed by Pollard's team has given scientists the ability to match cell types across different experiments, even from different labs or across species.
Seeing Beyond Cell Labels
Single-cell data analysis is a rapidly evolving field with profound implications for uncovering novel biological insights. Pollard and her team use data from a relatively new technique called "single-cell ATAC-seq," which tells scientists how open, or accessible, different areas of DNA are. This is important because accessible regions offer instructions that the cell might use. Indeed, most diseases are believed to be caused by mutations in accessible regions.
But ATAC-seq produces data that is notoriously hard to interpret.
Several years ago, when Pawel Przytycki, PhD, of Boston University was a postdoc at Gladstone, he and Pollard developed the original version of CellWalker to help analyze this data. The initial tool paired ATAC-seq data with RNA sequencing—which provides data on which genes are active in a cell—from the same cell type.
"By combining the two types of data, we could interpret it more easily," says Pollard, who is also a professor at UC San Francisco and an investigator at the Chan Zuckerberg Biohub San Francisco. "CellWalker helped us go from messy single-cell ATAC-seq data to getting clearer signals about which parts of the genome are active in which cells."
Now, CellWalker2 expands that vision by connecting multiple types of data from the same cell with meaningful biological insights. This additional context—like taking a peek at the puzzle box cover—shows which regulatory DNA is active in each cell type and how closely cells from different datasets align. The tool then quantifies the strength of these relationships, giving researchers a rigorous way to interpret complex biological patterns.
This ability to connect regulatory elements of DNA (which determine which genes are turned on or off) to their cell-specific functions could prove crucial in understanding conditions like autism, schizophrenia, and congenital heart disease.
Comparing Across Contexts, Species, and Diseases
To demonstrate the power and flexibility of CellWalker2, Pollard and her team first used it to compare complex data from different studies of human immune cells. The datasets were generated in different labs, using different analyses.
"In the past, comparing cell types across studies has been really difficult because labs use different methods and naming conventions," says Hu.
CellWalker2 overcame this challenge by building a statistical map between cell types, showing one-to-one matches as well as broader relationships, such as when a single cell type in one study is categorized as multiple subtypes in another.
Pollard and Hu also applied CellWalker2 to pinpoint regulatory regions of DNA used by specific immune cell types. The tool was able to go beyond simply listing which genes are turned on in different cells and instead infer which regulatory regions and transcription factors are likely orchestrating those changes.
"With this tool, we can do so much more than label cells," Hu says. "We're uncovering the logic behind how cell types are defined, how they differ from one another, and how they evolve or malfunction."
Finally, the team used CellWalker2 to compare brain cells from humans, marmosets, and mice. The tool revealed which cell types are shared across species and which are unique, offering insight into brain evolution and what makes the human brain distinct.
Open-Source and Ready to Use
Already, CellWalker2 is freely available online to other researchers; Pryztycki and Hu have created documentation and example data to help novice users.
By offering new ways to understand how cells function and fail, tools like CellWalker2 lay essential groundwork for developing future diagnostics and therapies, Pollard says. Looking ahead, her team envisions using the new model to interpret the function of parts of the genome linked to heart and brain conditions where the underlying mechanisms remain murky.
"We're finally getting to a point where we can connect disease risk variants to the actual regulatory programs and cell types they affect," says Pollard. "CellWalker2 is a key step toward making those connections."