Single-cell analyses have emerged as powerful tools for studying cellular heterogeneity and gene regulation. Single-cell chromatin accessibility sequencing (scCAS) is a key technology that enables the analysis of chromatin accessibility at the resolution of individual cells. However, there are three main challenges in the use of scCAS data: (1) Publicly available data in public research generated from diverse species, tissues, and experimental conditions are not systematically collected; (2) scCAS data with cell type, tissue, and other labels can be used to train machine learning methods for single-cell tasks such as cell type annotations, but such critically important annotated datasets have not been systematically collected; (3) The diversity of data formats across studies complicates efforts toward format standardization.
To solve these problems, a research team led by Shengquan Chen published their new research on 15 November 2025 in Frontiers of Computer Science co-published by Higher Education Press and Springer Nature.
The team developed scCASdb, a user-friendly and well-annotated scCAS database that standardized datasets in the h5ad format. By systematically collecting 80 well-annotated datasets from diverse species, tissues, and experimental conditions, scCASdb enables diverse single-cell analyses that were previously hindered by the lack of comprehensive collections. Moreover, the adoption of the h5ad format ensures efficient data accessibility and compatibility with both Python-based tools like Scanpy and machine learning models.
All data stored in the database are saved in h5ad formats, which efficiently manage large-scale single-cell data and can be seamlessly utilized with Python-based machine learning methods, enabling researchers to develop computational tools for single-cell analysis.
Each dataset in scCASdb contains three key components: (1) a cell-by-peak matrix, which records chromatin accessibility information for each single cell, providing a precise description of chromatin accessibility across different genomic regions; (2) cell type labels for cells in cell-by-peak matrix when available, which help researchers identify and classify cell populations, supporting the analysis of cellular heterogeneity; (3) metadata, such as species, genome, organs, diseases, sequencing technologies, and batch labels, which greatly facilitate researchers in diverse single-cell tasks.
Future work can focus on increasing the number of datasets and incorporating additional features to facilitate user access to the data.