AlphaFold Database Expands with Protein Complex Insights

Four-way collaboration brings together world-leading AI and biological expertise to make AI-predicted protein complex structures openly available to the global scientific community

Slime mold Dictyostelium discoideum protein complex Q55DI5 (AF-0000000066503175), annotated as a transcription elongation factor. The single chain looks disordered, but the homodimer reveals that two chains intertwine, each contributing half a domain to form a stable fold. An illustration of how predicting protein complexes reveals biology that single-protein models miss. Credit: AlphaFold Database, background by Karen Arnott/EMBL-EBI

A new collaboration between EMBL's European Bioinformatics Institute (EMBL-EBI), Google DeepMind, NVIDIA, and Seoul National University has made millions of AI-predicted protein complex structures openly available through the AlphaFold Database . To maximise global health impact, the dataset prioritises proteins important for understanding human health and disease. This is the largest dataset of protein complex predictions currently available.

Proteins are the building blocks of life. They interact to create protein complexes which fulfil biological functions. By visualising protein interactions, scientists can uncover the molecular mechanisms that drive cell behaviour, identify what goes wrong when someone gets sick, and develop new drugs and therapies. Predicting the structure of protein complexes is extremely challenging because, in nature, proteins change shape and interact in many different ways.

"Science thrives on collaboration," said Jo McEntyre, Interim Director of EMBL-EBI . "By making this foundational protein complex dataset openly available to the world, we're inviting researchers to test, refine, and build on it to drive the next wave of biological discoveries."

Protein complexes for global health impact

The latest AlphaFold Database update spans millions of homodimers - protein complexes formed of two identical proteins. It focuses on 20 of the most studied species , including humans, as well as the World Health Organization's priority pathogens list . This approach aims to bring significant and immediate value for global health challenges.

"By expanding the AlphaFold Database to include protein complexes, we are addressing a critical need expressed by the scientific community," said Anna Koivuniemi, Head of the Google DeepMind Impact Accelerator. "We hope that by lowering the barrier to these complex predictions, we can empower researchers everywhere to pursue the next wave of discoveries that could ultimately improve human health on a global scale."

Scientific expertise meets technical innovation

The collaboration builds on Google DeepMind's AI system AlphaFold, which, since 2021, accurately predicted the structure of millions of proteins. To democratise access to AlphaFold predictions, Google DeepMind and EMBL-EBI developed the AlphaFold Database, an open resource that anyone can access. The database has over 3.4 million users from 190 countries.

Through ongoing dialogue with the scientific community, a clear need emerged to expand the AlphaFold database to include protein complexes. In response to this need, EMBL-EBI, Google DeepMind, NVIDIA, and Seoul National University teamed up, contributing specialist expertise and resources, to calculate and integrate millions of protein complexes into the AlphaFold Database.

"By making this foundational dataset openly available to the world, we're inviting researchers to test, refine, and build on it to drive the next wave of biological discoveries."

The collaboration brought together deep biological expertise and technical innovations. NVIDIA and the Steinegger Lab at the Seoul National University developed the methodology, based on Google DeepMind's AI system AlphaFold, including accelerations to multiple sequence alignment calculations and deep learning inference. NVIDIA provided cutting-edge AI infrastructure and scaled out inference pipelines to overcome limitations that historically made this scale of calculations challenging. EMBL-EBI enabled the collaboration by bringing the other parties together and contributing expertise in scientific and biodata management, as well as analysis. As a champion of open science, EMBL-EBI, together with Google DeepMind, integrated the new dataset into the AlphaFold Database.

"NVIDIA's ambition is to consistently contribute orders-of-magnitude accelerations for fundamental digital biology workloads, enabling what was not possible before," said Anthony Costa, NVIDIA Director of Digital Biology. "This release is a great example of how AI infrastructure and software can uniquely enable new scales of biological understanding."

"By making predicted protein complexes accessible at an unprecedented scale, we are illuminating an unseen landscape of molecular interactions across the tree of life," explained Martin Steinegger, Associate Professor at Seoul National University.

Open science at scale

It takes a blend of AI-scale infrastructure and deep technical knowledge in accelerating complex workflows to generate AI predictions for protein complexes at this scale. The collaboration is centrally hosting data that would otherwise require around 17 million hours of GPU (graphics processing unit) computing to recreate.

By making these calculations once and adding the information into the AlphaFold Database, this collaboration aims to help democratise access to protein complex predictions. It enables scientists everywhere to investigate how proteins interact in the vast protein universe, and accelerate discoveries that could lead to new medicines, new products, and a deeper understanding of life itself.

"This release is a great example of how AI infrastructure and software can uniquely enable new scales of biological understanding."

This is the first step in an ambition to add a wide range of protein complex structure predictions to the AlphaFold Database. The partnership has already calculated predictions for 30 million complexes. Of these, 1.7 million high-confidence homodimer predictions have been added to the AlphaFold Database. Another 18 million are lower-confidence homodimers, which will be made available as a list and for bulk download from the EMBL-EBI FTP server in the coming days. The rest are heterodimers, currently being analysed and assessed. More protein complex predictions will be calculated and high-confidence predictions will be added to the AlphaFold Database in the coming months. The work is described in more detail in this preprint .

"The human genome has just over 20,000 different proteins. Despite this relatively small genome, human beings display incredibly complex pathways, processes and regulation. Much of this complexity arises from the intermolecular interactions between proteins, and with small molecule ligands and DNA. Adding predicted protein-protein homodimeric interactions to the AlphaFold Database is a first step towards a comprehensive description of the human interactome, the basis by which human biology will be described and understood. This has relevance for the design of new therapeutics, understanding host-pathogen interactions, and more. Making these structures accessible to all, allows every researcher around the world to build on these data, moving one step closer to predicting the biology of life," said Dame Janet Thornton, Director Emeritus of EMBL-EBI.

/Public Release. This material from the originating organization/author(s) might be of the point-in-time nature, and edited for clarity, style and length. Mirage.News does not take institutional positions or sides, and all views, positions, and conclusions expressed herein are solely those of the author(s).View in full here.