< (From bottom left) KAIST Ph.D. Candidate Yoonho Lee, Integrated M.S./Ph.D. Candidate Sein Kim, Ph.D. Candidate Sungwon Kim, Ph.D. Candidate Junseok Lee, Ph.D. Candidate Yunhak Oh, (From top right) Ph.D. Candidate Namkyeong Lee, UNC Chapel Hill Ph.D. Candidate Sukwon Yun, Emory University Professor Carl Yang, KAIST Professor Chanyoung Park >
Federated Learning was devised to solve the problem of difficulty in aggregating personal data, such as patient medical records or financial data, in one place. However, during the process where each institution optimizes the collaboratively trained AI to suit its own environment, a limitation arose: the AI became overly adapted to the specific institution's data, making it vulnerable to new data. Our university research team has presented a solution to this problem and confirmed its stable performance not only in security-critical fields like hospitals and banks but also in rapidly changing environments such as social media and online shopping.
KAIST announced on October 15th that the research team led by Professor Chanyoung Park of the Department of Industrial and Systems Engineering has developed a new learning method that fundamentally solves the chronic performance degradation problem of Federated Learning, significantly enhancing the Generalization performance of AI models.
Federated Learning is a method that allows multiple institutions to jointly train an AI without directly exchanging data. However, a problem occurs when each institution fine-tunes the resulting joint AI model to its local setting. This is because the broad knowledge acquired earlier is diluted, leading to a Local Overfitting problem where the AI becomes excessively adapted only to the data characteristics of a specific institution.
For example, if several banks jointly build a 'Collaborative Loan Review AI,' and one specific bank performs fine-tuning focusing on corporate customer data, that bank's AI becomes strong in corporate reviews but suffers from local overfitting, leading to degraded performance in reviewing individual or startup customers.
Professor Park's team introduced the Synthetic Data method to solve this. They extracted only the core and representative features from each institution's data to generate virtual data that does not contain personal information and applied this during the fine-tuning process. As a result, each institution's AI can strengthen its expertise according to its own data without sharing personal information, while maintaining the broad perspective (generalization performance) gained through collaborative learning.
<Figure 1. Federated Learning is a distributed learning method where multiple institutions collaboratively train a joint Artificial Intelligence model without directly sharing their data. Each institution trains its individual AI model using its local data (Institution 1, 2, 3 Data). Afterward, only the trained model information, not the original data, is securely aggregated to a central server to construct a high-performing 'Joint AI Model.' This method allows for the effect of training with diverse data while protecting the privacy of sensitive information>
< Figure 2. The Local Overfitting problem occurs during the process of fine-tuning the 'Joint AI Model' built through Federated Learning with each institution's data. For example, Institution 3 can fine-tune the joint AI with its own data (Type 0, 2) to create an expert AI for those types, but in the process, it forgets the knowledge about data (Type 1) that other institutions had (Information Loss). In this way, each institution's AI becomes optimized only for its own data, gradually losing the ability (generalization performance) to solve other types of problems that were obtained through collaboration. >
The research results showed that this method is particularly effective in fields where data security is crucial, such as healthcare and finance, and also demonstrated stable performance in environments where new users and products are continuously added, like social media and e-commerce. It proved that the AI could maintain stable performance without confusion even if a new institution joins the collaboration or data characteristics change rapidly.
< Figure 3. The technology proposed by the research team solves the local overfitting problem by utilizing Synthetic Data. When each institution fine-tunes its AI with its own data, it simultaneously trains with 'Global Synthetic Data' created from the data of other institutions. This synthetic data acts as a kind of 'Vaccine' to prevent the AI from forgetting information not present in the local data (e.g., Type 2 in the image), helping the AI to gain expertise on specific data while retaining a broad view (generalization performance) to handle other types of data. >
Professor Chanyoung Park of the Department of Industrial and Systems Engineering said, "This research opens a new path to simultaneously ensure both expertise and versatility for each institution's AI while protecting data privacy," and "It will be a great help in fields where data collaboration is essential but security is important, such as medical AI and financial fraud detection AI."
The research was primarily authored by Graduate School of Data Science student Sungwon Kim and co-authored by Professor Chanyoung Park as the corresponding author. It was recognized for its excellence by being selected for an Oral Presentation, which is reserved for the top 1.8% of outstanding papers, at the International Conference on Learning Representations (ICLR) 2025, a top-tier academic conference in the field of Artificial Intelligence held in Singapore last April.
※ Paper Title: Subgraph Federated Learning for Local Generalization, https://doi.org/10.48550/arXiv.2503.03995
Meanwhile, this research is a result of projects supported by the Institute of Information & Communications Technology Planning & Evaluation (IITP) — the 'Robust, Fair, and Scalable Data-Centric Continual Learning' project, the National Research Foundation of Korea (NRF) — the 'Graph Foundation Model: Graph-based Machine Learning Applicable to Various Modalities and Domains' project, and the 'Data Science Convergence Talent Fostering Program.'