Human gene maps contain major blind spots because they were built largely from the DNA sequences of people with European ancestry, according to a study published today in Nature Communications.
Researchers uncovered thousands of missing transcripts (the RNA molecules that carry a gene's instructions) in people from populations in Africa, Asia and the Americas, possibly including products of entirely new genes that scientists have yet to discover.
Some of these transcripts also appear in genes already linked to conditions that differ between ancestries, including lupus, rheumatoid arthritis, asthma, and cholesterol-related traits.
The findings suggest that part of the reason some diseases occur more often, or behave differently, in certain populations may be because their genes produce different transcripts and potentially different proteins through processes such as splicing. These molecular variations have been effectively invisible in current gene maps, leaving potentially important insights into disease risk hidden from view.
"Gene maps are used by scientists every day, but we've been leaving out huge sections of the world's population. This study shows, for the first time, how much we've been missing," says first author Pau Clavell-Revelles of the Barcelona Supercomputing Center (BSC) and Centre for Genomic Regulation (CRG).
The legacy of Eurocentric genetics
The first draft of the human genome, published in 2001, was a landmark scientific achievement, but it had limitations. The sequence alone did not reveal where the genes were, how many existed, or how a single gene could produce multiple versions of a protein through splicing, the process by which cells cut and stitch together genetic instructions.
To solve this, gene annotation maps were built. These are detailed catalogues showing the position of every human gene and the full set of RNA transcripts produced from them. Projects such as GENCODE turned the three billion letters of the genome into something interpretable, helping scientists understand which regions drive disease and how genetic differences between people might matter.
But these maps inherited a blind spot. Although any two humans are 99.9% genetically identical, the remaining fraction reflects our evolutionary history. Some groups have lived apart for tens of thousands of years and accumulated distinct variants shaped by environment, chance and geography. Those differences are real but not well documented.
The human genome reference, and many of the gene annotations built on top of it, were derived mostly from individuals of European ancestry. As a result, population-specific biology from Africa, Asia, Oceania and the Americas was never fully represented in gene maps.
That means much of what scientists know about how cells use genes is based on a narrow slice of humanity, leaving important transcripts and potential clues to disease effectively invisible.
"Most gene sequencing so far has come from European individuals, so the reference catalogues we rely on may be missing genes or transcripts that exist only in non-European populations," says Dr. Roderic Guigó, senior co-author of the study and researcher at the Centre for Genomic Regulation and University Pompeu Fabra in Barcelona. "If a genetic variant falls in one of these missing genes, we assume it has no biological effect. In some cases, that assumption may simply be wrong," he adds.
Long-read RNA sequencing uncovers hidden biology
To uncover what was missing from existing gene maps, the researchers focused on transcripts, the RNA molecules that show how genes are used inside human cells. They used long-read sequencing, a technology that can read entire RNA molecules from end to end. Earlier methods captured only tiny fragments, making transcript reconstruction extremely difficult and leading to ambiguous outcomes, one of the key reasons this question couldn't be addressed until now.
The team analysed blood cells from 43 people across eight populations, including Yoruba (Nigeria), Luhya (Kenya), Mbuti (Congo), Han Chinese, Indian Telugu, Peruvians in Lima, Ashkenazi Jewish and Utah Europeans. These groups are also part of the 1000 Genomes Project, meaning their DNA is already well mapped, allowing the new RNA data to be compared directly.
The researchers identified 41,000 potential transcripts missing from the official GENCODE gene maps. Out of the transcripts coming from known protein-coding genes, 41% are predicted to encode different versions of existing proteins. In other words, the study revealed thousands of protein variants that had never been catalogued before.
One example is the gene SUB1, involved in essential cellular processes such as DNA repair. The researchers found that individuals of Peruvian ancestry produce a different transcript of SUB1. This altered RNA molecule changes the resulting protein made, yet it was absent from all existing gene annotations.
When the team grouped the data by ancestry, they found a clear pattern where non-European samples contained far more previously unseen transcripts than European ones. In total, the study found 2,267 population-specific transcripts, RNA molecules present in one population but absent from all others. For European groups, most of these were already known. For non-European groups, most were entirely new.
773 of the newly identified transcripts appear to come from previously unrecognised gene loci, suggesting they may be the products of gene regions that scientists did not know existed.
The team also tested whether using each person's own DNA sequence as the reference could uncover even more missing transcripts. They found switching from the standard reference genome to personalised ones revealed hundreds of additional transcripts per individual, with the biggest gains in people of African ancestry.
While confirming existing biases in gene maps, this part of the study also shows how relying on a single, universal reference genome can mask biologically meaningful variation in how people's genes are used.
Why the missing transcripts matter
To understand why these missing transcripts matter, the researchers next looked at something called allele-specific transcript usage. Each person carries two copies of most genes, one from each parent. Sometimes, these two copies produce different transcripts, and these differences can influence how the gene works.
However, these effects can only be detected if all the transcripts which actually exist are catalogued in the gene maps. If important transcripts are missing, the effects are invisible.
By adding the thousands of newly discovered transcripts to existing gene maps, the team were able to detect many more genetic effects that influence how genes behave, especially in people of non-European ancestry.
"We found that many novel ancestry-biased transcripts occur in genes already associated with autoimmune diseases, asthma and metabolic traits," says Dr. Marta Melé, senior co-author of the study and Group Leader at the BSC.
Dr. Melé explains that this doesn't mean the transcripts themselves cause the differences in disease but rather help scientists see genetic signals that were previously hidden. Without these transcripts in the reference maps, researchers would miss key information about why certain diseases are more common, or act differently, in some groups than others.
Towards a human 'pantranscriptome'
The researchers emphasise their work is only a first step which has important limitations. The study looked at just one cell type taken from one tissue, and from only 43 individuals. Many parts of the world are not represented at all and none of the body's most complex organs were examined.
Yet despite the narrow window of human biology explored, the team still found tens of thousands of transcripts that had slipped through the cracks of official gene maps. For Dr. Fairlie Reese, the small scope of the study and the size of what it uncovered is a striking outcome. "We firmly believe that any findings that we made here are really just the tip of the iceberg," says Dr. Reese, postdoctoral researcher at the BSC.
The authors of the study call for a rethink in how we build maps of human biology that truly reflect humanity. In recent years, large international efforts such as the Human Pangenome Project have begun to expand the reference genome, capturing far more of the DNA diversity found around the world.
However, DNA is only the instruction manual. To understand how those instructions are used the research community also need a human pantranscriptome: the complete catalogue of all RNA molecules used across all tissues, all life stages and all populations.
"The pangenome tells us about DNA diversity, essentially, it's a book of instructions. The pantranscriptome tells us which words are important in each cell of our body. Both are essential for fully understanding human diversity," says Dr. Melé.
Building such a resource is a mammoth task. The current study alone produced more than 10 terabytes of data and 800 million full-length RNA sequences, one of the largest datasets of its kind that requried advanced machine-learning tools and the power of the BSC's MareNostrum 5 supercomputer to process. Scaling this up to hundreds of tissues and thousands of individuals would demand computational capacities and global coordination on an entirely different scale.
But the researchers say the ambition is worth it. "We hope our study serves as a foundation and an invitation for the global scientific community to contribute data, methods, and diverse populations. Only through a collective effort will we achieve a truly complete and inclusive map of human biology, which is essential for fair and accurate genomic medicine," concludes Dr. Melé.