U-M researchers build unique dataset of chemical reactions to streamline drug research, address supply chain challenges, power artificial intelligence
Study: A 50,688-Reaction Data Set Reveals General Ligands and Mechanistic Diversity in C-N Couplings (DOI: 10.1021/jacs.6c05959)
Developing new medicines can require thousands of chemistry experiments to identify the right recipe for a safe, effective and ideally affordable drug.
The process is slow and labor-intensive, and many of the reactions depend on hard-to-source metals that act as essential catalysts.
While artificial intelligence is helping speed up the process of drug discovery, it can only learn from the data available, and when it comes to chemical reactions, the large, high-quality datasets needed to train powerful AI tools aren't there.

That's where Tim Cernak and his team at the University of Michigan College of Pharmacy come in.
They created an open-access database of more than 50,000 carefully designed chemistry experiments, testing thousands of combinations of ingredients and conditions to better understand the reactions that form carbon-nitrogen bonds-essential building blocks of many medicines.
The database is the largest body of chemical reactions data to date and just the start of what could grow into a much larger library of chemical reaction conditions that will feed AI systems, Cernak said.
"Building the platform that could pull this off has taken over a decade, but it's still just scratching the surface," said Cernak, Associate Professor of Medicinal Chemistry at the College of Pharmacy.
The data is freely available to scientists through the Open Reaction Database, a site for sharing reactions.
"We are excited about the discoveries that other scientists can make within this new dataset," Cernak said. "There's so much data to mine."
Giving researchers and AI systems access to more reaction data can help identify promising ways to make medicines more quickly and efficiently. It can also help scientists find alternatives to expensive or hard-to-source catalysts based on precious metals used in drug manufacturing.
"The latest drugs in the pipeline are raising the bar of sophistication for chemical synthesis. At the same time, supply chains for precious metals and other critical reaction components are being exposed as risks," Cernak said. "Big data drops like this one are going to be needed to build the predictive models that can make better drugs faster."
In addition to sharing all the data, the study, published in the Journal of the American Chemical Society, compares how different catalysts, specifically palladium, nickel and copper, perform under similar conditions. This is important because palladium is the go-to catalyst for many reactions used in drug synthesis, but the supply of palladium is controlled by just a few countries.
However, the study found that certain reactions performed equally well with nickel, and some even with copper catalysts, which can be sourced all over the planet. The database allows researchers to more easily and quickly compare catalysts and reactions.
"One key takeaway was that large, systematically designed reaction datasets can uncover patterns that are difficult to see from traditional scope studies alone," Cernak said. "For example, I never would have predicted that the highly reactive intermediate molecules called arynes could form at such low temperatures, but it was hard to ignore when we saw it hundreds of times. This is exciting as a possibility to synthesize drugs without precious metal catalysts."