Massive volumes of digital data are generated every day from AI training, big data analytics and smart devices. As conventional hard drives and cloud storage are increasingly constrained by high costs, limited capacity, high power consumption and short lifespans, molecular data storage has emerged as a breakthrough storage alternative. Researchers at The Hong Kong Polytechnic University (PolyU) have pioneered a method that uses engineered proteins to store digital data and, for the first time, completed the full process from data storage to data retrieval in de novo designed unnatural proteins. This demonstrates the potential of establishing a protein-based storage framework with sustainability, high storage capacity and high stability, offering a promising solution to the explosive AI-generated growth in data globally.
Spanning the fields of protein engineering, synthetic biology, biochemistry, analytical chemistry and computer science, the interdisciplinary team is led by Prof. Zhongping YAO, Associate Head and Professor of the Department of Applied Biology and Chemical Technology. Other members include Dr Cheuk-chi NG, Research Assistant Professor of the same department, and Prof. Chung-Ming Francis LAU, Associate Dean (Global Engagement) of the Faculty of Engineering and Professor of the Department of Electrical and Electronic Engineering. The findings have been published in Nature Communications.
All digital files—including texts, images and videos—are stored in computers as sequences of bits comprising 0s and 1s. Molecular data storage typically works by assigning different types of monomers in a large molecule to specific bit sequences, thereby "translating" the data into monomer sequences that can later be decoded and read. Commonly used medium DNA (nucleotides as monomers) consists of only four types of nucleotides, resulting in relatively low storage capacity, and is also prone to degradation. Prof. Yao's team previously developed peptides (amino acids as monomers) as an alternative. Peptides can be made of 20 types of natural amino acids, as well as many non-natural amino acids, offering much higher storage capacity. They can also be optimised to achieve very high stability. However, peptides have limited storage efficiency due to their short molecular sequences, and are produced mainly through chemical synthesis, which is costly.
The research team has innovatively proposed using proteins as data carriers. Proteins have much longer amino acid sequences than peptides, delivering even higher storage efficiency and capacity. In addition, proteins can be readily expressed by biological systems like bacteria and animal cells—i.e., by injecting genetic information that prompts the cells to make designated proteins—enabling large-scale and low-cost generation of data-bearing proteins. Proteins can also be preserved with greater stability in powder or solution form in various environments.
However, protein-based data storage faces two major challenges. First, the amino acid sequences of data-bearing proteins appear highly random and variable, which can compromise their stability and solubility, making such proteins difficult to design and express. Second, the protein sequencing technique is currently used primarily for protein identification, where only a part of the protein sequence is needed to match against existing protein databases; however, to fully retrieve the encoded data, the entire sequence must be accurately rebuilt.
The research team devised innovative strategies to overcome these challenges. Inspired by the sequence pattern of collagen—a natural protein known for its long-term stability—they designed a protein template as the "backbone" to enhance structural stability and resistance to degradation. By embedding the data-bearing amino acid sequences that were able to encode several files into the collagen-like template, they successfully expressed these proteins via E. coli.
For data retrieval, these proteins were then digested and analysed by liquid chromatography–tandem mass spectrometry, which separated all the peptide fragments produced and identified their amino acid sequences one by one. The team further employed self-developed algorithms-driven software to reconstruct the full sequences and successfully convert them back into bit strings. An error-correction scheme was also utilised to recover minor incorrect or missed sequences, achieving accurate and efficient data readout.
The team's previous work on peptide-based data storage had demonstrated its stability and suitability for space exploration in China's next-generation manned spacecraft in 2020. This new approach delivers significant improvements in multiple aspects. Prof. Yao said, "As data carriers, proteins have many advantages over DNA and peptides. The protein samples in our research achieved 30 times the storage density at only 10% of the cost of the peptide-based method. In addition, compared to the data-storing DNA that had been quickly degraded in solution form or in strong acid, the proteins remained readable for very long durations, demonstrating superior stability."
Beyond basic data storage, the research team further "functionalised" the proteins to enable random access and cryptographic protection. With non-functionalised proteins, specific segments of data cannot be retrieved without decoding the entire dataset. By attaching specific affinity tags to the proteins carrying required data segments, the team successfully used corresponding antibodies to "capture" the target proteins during purification, achieving random access. The team also leveraged these functionalised proteins to encode secret messages and proved that the messages could only be retrieved by the known affinity compound, showcasing the data encryption capabilities of proteins.
"The inherent stability, ease of preservation and high storage capacity of proteins make them excellent carriers for the long-term storage of large volumes of data. Their favourable biocompatibility even opens up the possibility of storing digital data in living organisms," Prof. Yao concluded. "Moving forward, we aim to achieve mass storage capabilities, faster data writing and reading speeds, and further reductions in protein production costs, while designing diverse protein templates to achieve new functionalities to protein-based data storage."
This research was supported by the Collaborative Research Fund and Research Impact Fund from the Hong Kong Research Grants Council.