URBANA, Ill. – Genes are the building blocks of life, and the genetic code provides the instructions for the complex processes that make organisms function. But how and why did it come to be the way it is? A recent study from the University of Illinois Urbana-Champaign sheds new light on the origin and evolution of the genetic code, providing valuable insights for genetic engineering and bioinformatics.
"We find the origin of the genetic code mysteriously linked to the dipeptide composition of a proteome, the collective of proteins in an organism," said corresponding author Gustavo Caetano-Anollés , professor in the Department of Crop Sciences , the Carl R. Woese Institute for Genomic Biology , and Biomedical and Translation Sciences of Carle Illinois College of Medicine at U. of I.
Caetano-Anollés' work focuses on phylogenomics, which is the study of evolutionary relationships between the genomes of organisms. His research team previously built phylogenetic trees mapping the evolutionary timelines of protein domains (structural units in proteins) and transfer RNA (tRNA), an RNA molecule that delivers amino acids to the ribosome during protein synthesis. In this study, they explored the evolution of dipeptide sequences (basic modules of two amino acids linked by a peptide bond), finding the histories of domains, tRNA, and dipeptides all match.
Life on Earth began 3.8 billion years ago, but genes and the genetic code did not emerge until 800,000 million years later, and there are competing theories about how it happened.
Some scientists believe RNA-based enzymatic activity came first, while others suggest proteins first started working together. The research of Caetano-Anollés and his colleagues over the past decades supports the latter view, showing that ribosomal proteins and tRNA interactions appeared later in the evolutionary timeline.
Life runs on two codes that work hand in hand, Caetano-Anollés explained. The genetic code stores instructions in nucleic acids (DNA and RNA), while the protein code tells enzymes and other molecules how to keep cells alive and running. Bridging the two is the ribosome, the cell's protein factory, which assembles amino acids carried by tRNA molecules into proteins. The enzymes that load the amino acids onto the tRNAs are called aminoacyl tRNA synthetases. These synthetase enzymes serve as guardians of the genetic code, monitoring that everything works properly.
"Why does life rely on two languages – one for genes and one for proteins?" Caetano-Anollés asked. "We still don't know why this dual system exists or what drives the connection between the two. The drivers couldn't be in RNA, which is functionally clumsy. Proteins, on the other hand, are experts in operating the sophisticated molecular machinery of the cell."
The proteome appeared to be a better fit to hold the early history of the genetic code, with dipeptides playing a particularly significant role as early structural modules of proteins. There are 400 possible dipeptide combinations whose abundances vary across different organisms.
The research team analyzed a dataset of 4.3 billion dipeptide sequences across 1,561 proteomes representing organisms from the three superkingdoms of life: Archaea, Bacteria, and Eukarya. They used the information to construct a phylogenetic tree and a chronology of dipeptide evolution. They also mapped the dipeptides to a tree of protein structural domains to see if similar patterns arose.
In previous work, the researchers had built a phylogeny of tRNA that helped provide a timeline of the entry of amino acids into the genetic code, categorizing amino acids into three groups based on when they appeared. The oldest were Group 1, which included tyrosine, serine, and leucine, and Group 2, with 8 additional amino acids. These two groups were associated with the origin of editing in synthetase enzymes, which corrected inaccurate loading of amino acids, and an early operational code, which established the first rules of specificity, ensuring each codon corresponds to a single amino acid. Group 3 included amino acids that came later and were linked to derived functions related to the standard genetic code.
The team had already demonstrated the co-evolution of synthetases and tRNA in relation to the appearance of amino acids. Now, they could add dipeptides to the analysis.
"We found the results were congruent," Caetano-Anollés explained. "Congruence is a key concept in phylogenetic analysis. It means that a statement of evolution obtained with one type of data is confirmed by another. In this case, we examined three sources of information: protein domains, tRNAs, and dipeptide sequences. All three reveal the same progression of amino acids being added to the genetic code in a specific order."
Another novel finding was duality in the appearance of dipeptide pairs. Each dipeptide combines two amino acids, for example, alanine-leucine (AL), while a symmetrical one — an anti-dipeptide — has the opposite combination of leucine-alanine (LA). The two dipeptides in a pair are complementary; they can be considered mirror images of each other.
"We found something remarkable in the phylogenetic tree," Caetano-Anollés said. "Most dipeptide and anti-dipeptide pairs appeared very close to each other on the evolutionary timeline. This synchronicity was unanticipated. The duality reveals something fundamental about the genetic code with potentially transformative implications for biology. It suggests dipeptides were arising encoded in complementary strands of nucleic acid genomes, likely minimalistic tRNAs that interacted with primordial synthetase enzymes."
Dipeptides did not arise as arbitrary combinations but as critical structural elements that shaped protein folding and function. The study suggests that dipeptides represent a primordial protein code emerging in response to the structural demands of early proteins, alongside an early RNA-based operational code. This process was shaped by co-evolution, molecular editing, catalysis, and specificity, ultimately giving rise to the synthetase enzymes, the modern guardians of the genetic code.
Uncovering the evolutionary roots of the genetic code deepens our understanding of life's origin, and it informs modern fields such as genetic engineering, synthetic biology, and biomedical research.
"Synthetic biology is recognizing the value of an evolutionary perspective. It strengthens genetic engineering by letting nature guide the design. Understanding the antiquity of biological components and processes is important because it highlights their resilience and resistance to change. To make meaningful modifications, it is essential to understand the constraints and underlying logic of the genetic code," Caetano-Anollés said.
The paper, "Tracing the origin of the genetic code and thermostability to dipeptide sequences in proteomes," is published in the Journal of Molecular Biology [ 10.1016/j.jmb.2025.169396 ]. Authors include Minglei Wang, M. Fayez Aziz and Gustavo Caetano-Anollés.
The study was supported by grants from the National Science Foundation ( MCB-0749836 and OISE-1132791 ), the United States Department of Agriculture ( ILLU-802-909 and ILLU-483-625 ) and Blue Waters supercomputer allocations from the National Center for Supercomputing Applications to Caetano-Anollés.