Spotting a needle in a haystack is easy compared to Yuejie Chi's typical day.
As a leading researcher on the underpinnings of large language models (LLMs) and other machine learning systems, Chi - the Charles C. and Dorothea S. Dilley Professor of Statistics and Data Science in the Faculty of Arts and Sciences, and professor of computer science for Yale Engineering - sifts through multiple haystacks that talk to each other.
Her specialty is eliciting useful information from massive datasets. This involves separating signal from noise, understanding the intricacies of how data is collected, and always seeking out the most efficient path to the use of data. In doing so, she helps to improve AI's ability to make predictions and decisions across a range of applications, from medical imaging to materials sciences.
The process is filled with surprises, she says. Although helpful information is "hidden" within a sea of data, it often has a shape or structure that is predictable once you find it.
"Structure is ubiquitous when dealing with efficiencies in the context of AI, where it can show up in various forms and various places, across data, model, and systems," Chi said.
Indeed, Chi's wide-ranging research has continued to grow since joining the Yale faculty in 2025 (she is also a member of the Yale Institute for Foundations of Data Science, the Wu Tsai Institute, and the Center for Algorithms, Data, and Market Design).
Her work has already led to improvements in algorithms for imaging. For example, her research into Super Resolution Fluorescence Microscopy - techniques that enable observation of images at higher resolutions than standard microscopy - has helped produce better, more detailed images for optical and biomedical research, while using fewer computational resources. She's also conducted important research in the field of Phase Retrieval - imaging used in crystallography and astronomy.
In an interview, Chi discussed the need for more efficient AI and the joys of blending theoretical research with practical outcomes. The interview has been edited and condensed.
What is the largest or most complicated database you've encountered?
Yuejie Chi: I helped curate a database for Nationwide Children's Hospital in Columbus, Ohio. That database includes sleep study data for more than 3,000 patients who have done multiple sleep study sessions. This was three years' worth of data.
The scale of our dataset was unprecedented at the time, but we envisioned that data of this scale will be of great value for training large machine learning models. Indeed, researchers have been using it for training large foundation models and drawing great insights.
How does theory help inform the performance of AI?
Chi: I'll give you a general example. My colleagues and I have been looking at understanding the mechanics of LLMs, and how different training paradigms that are being used are unlocking the reasoning capabilities of LLMs.
You can think of these different training paradigms and associated data as tuning knobs. What are the efficiencies of these different tuning knobs and how can they be used together to get the best results? Theory can be a key to understanding LLMs and developing more efficient models.
Once you've identified 'hidden structure,' how can it be put to good use?
Chi: One major example is imaging. If we better understand the hidden structure in data, we can often recover high-quality information from fewer or noisier measurements. That could make medical imaging faster, more comfortable, and less demanding for patients - for instance, by reducing the need to stay perfectly still or hold one's breath for long periods [during MRI or CT scans, for instance]. More broadly, identifying useful structure in data can help AI systems do more with less. That means lower computational cost, lower energy use, and better downstream capabilities.
What is something you're working on now that reflects this?
Chi: We are currently working on leveraging diffusion models in AI for materials imaging in collaboration with researchers at the U.S. Air Force Research Lab. These are generative models that learn structure in data and can capture complex data patterns very effectively. This opens up possibilities to use diffusion models to dramatically accelerate materials imaging, which can be very time consuming.
Is there another ongoing line of research you're particularly excited about?
Chi: Yes, there are several! But one that I am especially excited about is reinforcement learning [RL], the machine learning paradigm that learns through trial and error. It perhaps became widely known through game-playing systems, such as AlphaGo.
I have been interested in RL for some time. My focus has been on understanding the efficiency of RL algorithms in various contexts, and to bridge the gap between theory and practice. One of our most recent studies on this had to do with how RL is used to train language models. In fact, I will be offering a new graduate level course on RL next semester.