A team led by Guoyin Yin at Wuhan University and the Shanghai Artificial Intelligence Laboratory recently proposed a modular machine learning framework using LoRA fine-tuning. This framework can not only accurately predict single organic reactions but also achieve the prediction of "one model handling multiple types of reactions." Even using only natural language to describe chemical reactions, the prediction accuracy is comparable to machine learning models based on expert experience. The article was published as an open access Research Article in CCS Chemistry, the flagship journal of the Chinese Chemical Society.
Background information:
Predicting the outcomes of organic chemical reactions has long relied heavily on the expertise of experimental chemists, but achieving accurate and efficient predictions remains challenging. The rise of artificial intelligence (AI) has opened new frontiers in this field, with machine learning (ML) algorithms demonstrating transformative potential to build data-driven predictive frameworks for modeling complex organic transformation reactions. Currently, machine learning methods for reaction prediction primarily employ two strategies (Figure 1a): traditional feature-based methods and graph neural networks (GNNs). Traditional machine learning relies on manually designed features, such as descriptors calculated based on density functional theory (DFT) (highly accurate in quantum mechanics but computationally expensive) or molecular fingerprints (computationally efficient but overly simplified in chemical information). Graph neural networks overcome these limitations by directly processing molecular graph structures represented by simplified linear input systems (SMILES), preserving atomic-level chemical information while avoiding DFT-level computations. This architecture automatically extracts features and significantly outperforms descriptor- or molecular fingerprint-based methods in terms of chemical fidelity, thus dominating current chemical machine learning research. However, existing GNN implementations are still limited by task-specific modeling paradigms — that is, separate models need to be built for different reaction types — which fundamentally limits their generalization ability in the chemical reaction space.
Highlights of this article:
LoRA-Chem transcends limitations by building upon Large Language Models (LLMs) and drawing on the "Low-Rank Adaptation (LoRA)" technology paradigm in AI image generation (Figure 1b) to create an architecture of "shared base model + replaceable LoRA modules" (Figure 1c).
a. Flexible module switching: When multiple responses need to be predicted, such as different responses like Buchwald-Hartwig coupling and Suzuki-Miyaura coupling, simply replace the corresponding LoRA module without retraining the entire model; in addition, a LoRA module can also handle multiple responses simultaneously.
b. Natural language interaction: Input the IUPAC name or SMILES string of the reactants, and the model can first identify the reaction type and then predict the result through "thinking chain-like hints", which is in line with the analytical habits of chemists;
c. Higher data efficiency: Only a few thousand training samples are needed, and training can be completed in half a day on a consumer-grade GPU (such as RTX 4060 Ti), which is suitable for the scarce organic reaction data.
The team validated the capabilities of LoRA-Chem on multiple classic reaction datasets (Figure 2), with key metrics reaching or exceeding those of current mainstream methods: in the sterically hindrance-prone meta-CH bond functionalization reaction of ortho-alkylaryl ketones (SHC-oAK), the prediction determination coefficient (R²) reached 0.748; when handling the out-of-distribution test of the ligand distribution of the Suzuki-Miyaura coupling reaction, the R² reached 0.60; more noteworthy is its multi-task performance: after training multiple reaction prediction tasks simultaneously, although the performance of a single task decreased slightly, it was still on par with the "single-task dedicated model", completely breaking the limitation of the traditional model of "one training for one event".
Furthermore, LoRA-Chem demonstrates strong adaptability — when fine-tuned based on the superior Qwen2.5-7B-Instruct model, the predicted R² of the SHC-oAK response further improves to 0.77, proving that its performance has even greater potential for improvement as the base model is upgraded.
LoRA-Chem retains the core capabilities of LLM (Figure 3) and has broad application scenarios in the future. Unlike traditional models that "specialize in one area and lose others" after fine-tuning, LoRA-Chem fully retains the original capabilities of large language models while focusing on reaction prediction. Tests show that models equipped with LoRA-Chem perform almost identically to the original models in tasks such as mathematical reasoning (GSM8K), language modeling (LAMBADA), and multidisciplinary knowledge (MMLU). This means that it can be easily integrated into existing LLM systems and become an "intelligent assistant" for synthetic chemistry research.
Summary and Outlook:
Currently, the team has released the LoRA-Chem dataset, training code, and model files (GitHub: https://github.com/flyben97/LoRA-Chem ; Hugging Face: https://huggingface.co/Flyben/LoRA-Chem ) to facilitate further development by researchers worldwide.
In the future, this framework is expected to play a role in fields such as drug synthesis and new material development, helping scientists to quickly screen reaction pathways, reduce experimental trial and error costs, and promote the transformation of organic chemistry research towards a new "data-driven+AI-assisted" model.
---
About the journal: CCS Chemistry is the Chinese Chemical Society's flagship publication, established to serve as the preeminent international chemistry journal published in China. It is an English language journal that covers all areas of chemistry and the chemical sciences, including groundbreaking concepts, mechanisms, methods, materials, reactions, and applications. All articles are diamond open access, with no fees for authors or readers. More information can be found at https://www.chinesechemsoc.org/journal/ccschem .
About the Chinese Chemical Society: The Chinese Chemical Society (CCS) is an academic organization formed by Chinese chemists of their own accord with the purpose of uniting Chinese chemists at home and abroad to promote the development of chemistry in China. The CCS was founded during a meeting of preeminent chemists in Nanjing on August 4, 1932. It currently has more than 120,000 individual members and 184 organizational members. There are 7 Divisions covering the major areas of chemistry: physical, inorganic, organic, polymer, analytical, applied and chemical education, as well as 31 Commissions, including catalysis, computational chemistry, photochemistry, electrochemistry, organic solid chemistry, environmental chemistry, and many other sub-fields of the chemical sciences. The CCS also has 10 committees, including the Woman's Chemists Committee and Young Chemists Committee. More information can be found at https://www.chinesechemsoc.org/ .