New MatterChat Model Lets AI See Science Language

Berkeley Lab

From writing emails to generating computer code, much of the artificial intelligence prevalent in our daily lives has succeeded by mastering one domain: text. However, this leaves a major blind spot in the physical sciences, where models depend on the high-resolution, three-dimensional data of the physical world, like the intricate lattice of atoms in a crystal. Delivering on the promise of using AI for science requires teaching these data-driven text models to seamlessly "talk to" physics-based models.

Now, a new AI framework from Lawrence Berkeley National Laboratory (Berkeley Lab), called MatterChat, solves this problem by creating a specialized "bridge." It connects the conversational power of a Large Language Model (LLM) with a physics-based AI that models "interatomic potentials": the complex physical forces between atoms. The resulting system already significantly outperforms general-purpose AI tools like GPT-4 at predicting material properties, and the team hopes it can accelerate scientific discovery by serving as a robust research partner that provides grounded insights and generates step-by-step instructions for synthesizing novel materials.

A paper describing this work was recently published in Nature Machine Intelligence.

"Traditional simulations can provide the physical rigor required for materials science, yet their computational cost remains prohibitive for high-throughput screening. Conversely, while LLMs excel at rapid knowledge synthesis, they inherently lack the 'structural vision' to interpret materials directly from their underlying atomic coordinates," said Yingheng Tang, a postdoctoral researcher in Berkeley Lab's Applied Math and Computational Research Division (AMCR) and lead author on the paper. "MatterChat was built to solve this dilemma, empowering LLMs with a structural 'vision' that allows researchers to leverage their full potential for solving complex, real-world materials challenges."

Empowering language models to solve complex challenges in materials science

To build MatterChat, the Berkeley Lab team drew inspiration from technologies like Vision Question Answering (VQA) and Text-to-Image (T2I) generation. In these tasks, AI must translate high-level text concepts into visual images, or vice versa. Doing so requires developers to build tools that "bridge" two fundamentally different forms of data.

The researchers adapted this concept to the physical sciences. With MatterChat, they created a "bridge model" that successfully connects an LLM's general knowledge with the deep understanding of the atomic-scale world encoded in scientific interatomic potentials.

Until now, researchers using LLMs to solve materials problems usually had to feed them raw data files as if they were just strings of text. It's like asking an AI to understand a complex 3D engine based only on a parts list: the LLM can read the names, but it can't "see" how the atoms fit together in space. MatterChat solves this by training a specialized AI bridge model, pre-trained on millions of crystal structures and an LLM, to align the LLM's representation of the world with the interatomic potential's representation of the world.

The bridge model can translate physical insights into a format the LLM can actually understand. By giving the LLM these "scientific eyes"- a scientific "inductive bias" in the terminology of AI - the Berkeley Lab team has transformed it into a robust research tool capable of providing grounded scientific insights into complex materials challenges, such as predicting thermal stability or analyzing electronic band gaps.

"We think of atoms as living in a physical space, but from a machine learning perspective, they are just vectors living in some very non-trivially structured manifold in a high-dimensional Euclidean space; and, of course, the same is true for the sentences and paragraphs that we use to express our ideas about those atoms," said co-author Michael Mahoney, Berkeley Lab's Scientific Data Division (SDD) AI Initiative Research Lead. "The bridge model basically gets those two structures to 'talk with' each other."

As a proof-of-concept of this general approach, the team trained their bridge model on a dataset curated by pairing nearly 143,000 stable atomic structures from the Materials Project with their corresponding physical properties. This training data was automatically assembled using the Materials Project's API and deliberately enriched with properties fundamental to microelectronics design - like formation energy and bandgap - allowing MatterChat to learn the complex patterns connecting a material's atomic blueprint to its functional performance.

To validate their model, the researchers benchmarked MatterChat against a suite of other AI systems, from general-purpose LLMs to other specialized scientific AI methods. The results show that MatterChat consistently outperformed its competitors across a range of tasks. The model was more accurate in classifying material types and demonstrated superior precision in predicting numerical properties. For example, it excelled at predicting a material's bandgap, a property critical for designing new electronics from high-capacity energy storage to next-generation computer chips.

"Our design is significantly more efficient because we don't have to build a massive AI model from the ground up," said co-author Zhi (Jackie) Yao, a research scientist in Berkeley Lab's AMCR. "Instead, we take two powerful, pre-trained models - a structural encoder for materials physics and an open-source LLM - and use them off-the-shelf. The only component we actually train is the lightweight 'bridge model' that translates between them. It's the difference between building an entire car factory and simply designing a smart adapter that connects a world-class engine to a world-class navigation system. This approach is not only computationally efficient, but also makes the system modular, so we can easily upgrade components or adapt the bridge for other scientific domains in the future."

Crucially, this modular design highlights exactly how institutions like Berkeley Lab and the Department of Energy are carving out a highly valuable niche in the booming AI landscape. Rather than competing with Silicon Valley tech giants to build ever-larger language models from scratch, the lab is focusing on the specialized connective tissue that makes commercial AI useful for hardcore science.

Because the bridge model approach underlying MatterChat is forward-compatible, it is perfectly positioned to leverage these parallel tracks of innovation. As Mahoney pointed out, "We expect that industry will continue to develop improved LLMs, and we expect domain scientists and facilities will continue to generate new data. An important part of scientific machine learning is not simply to solve problems on today's data, but instead to develop general methods that will be forward-compatible with orders of magnitude more data, whether from scientific domains or from LLMs."

According to Yao, the MatterChat project, which was initially developed and enhanced with funding from a Berkeley Lab Laboratory Directed Research & Development (LDRD) Program, will now expand its capabilities. In a collaboration with Fermilab, MatterChat is already contributing to a U.S. Department of Energy Genesis Mission project - called Accelerating eXtreme Environment Specs-to-Silicon (AXESS) - that aims to speed up the development of next-generation, high-speed, radiation-hardened detectors for challenging particle physics experiments by using advanced 3D integrated circuits (chiplets) and AI-driven data analysis.

In addition to the LDRD support, the team also credits supercomputing resources at the National Energy Research Scientific Computing Center (NERSC), located at Berkeley Lab, with MatterChat's success. "We are incredibly grateful to NERSC; this research simply would not have happened without access to the Perlmutter supercomputer through their AI for Science program," said Tang. Wenbin Xu, a NERSC postdoctoral fellow at the time, was also a major co-author of the work, as was Benjamin Erichson, a research scientist in Berkeley Lab's SDD, highlighting the benefits of AMCR-SDD collaboration on AI for science.

/Public Release. This material from the originating organization/author(s) might be of the point-in-time nature, and edited for clarity, style and length. Mirage.News does not take institutional positions or sides, and all views, positions, and conclusions expressed herein are solely those of the author(s).View in full here.