Yale Probes Chatbot Errors: Can We Trust AI Models?

Yale University

The rapid rise of artificial intelligence (AI) has inserted a new character into people's lives: the chatbot.

Individuals now engage with agentic AI chatbots to perform a growing number of tasks; They can help a person shop for a new laptop, manage email, or plan a vacation.

And while these interactions can save time and increase productivity, they also carry risk. Large language models (LLM) - the AI systems trained on massive datasets to generate human-like text - are imperfect. They hallucinate. They misinterpret. They make mistakes.

Two multidisciplinary teams of researchers associated with the Center for Algorithms, Data, and Market Design at Yale are pursuing projects that aim to balance the capability and safety of AI systems and improve interactions between users and AI models.

Yale News recently spoke with members of both teams about their research projects.

One team is represented by Dirk Bergemann, the Douglass and Marion Campbell Professor of Economics, and Zhuoran Yang, assistant professor of statistics and data science. The other is represented by Yang Cai, professor of computer science and economics, and Nicole Immorlica, professor of computer science and economics. All are members of Yale's Faculty of Arts and Sciences. Cai and Immorlica are also part of Yale School of Engineering and Applied Sciences.

In the conversation, the researchers discussed their efforts to pry inside the "black box" of proprietary AI models and use aspects of game theory to determine whether such the actions of those models align with the intentions of their users.

The conversation has been edited and condensed.

What questions do these research projects seek to answer?

Yang Cai: Our work concerns a simple but fundamental question: Why does an AI model give you bad information? Is it because the model is misinformed, meaning it lacks the necessary facts or knowledge to answer the question, or is it because it's misaligned with the user's intent?

Dirk Bergemann: In this context, "alignment" means making LLMs behave in the best interest of the human who is engaging with it to complete a task. Ultimately, LLMs are prediction engines that seek to minimize prediction errors.

A central problem is that if you perceive the LLM as being an equal partner - if you were to query it and it gives you an answer - you have to think about whether you want to adopt that answer. Is that a useful answer? Is it based on similar information that you have? Is it based on different information than you have? Why would I have wanted a different answer? Basically, we need to align both on the gain function - that is, what our preference is for this specific task - and on the information that the LLM can access to answer a query.

What's an example of misalignment between an AI model and its user?

Zhuoran Yang: There are AI products designed to act as a personalized assistant that lives on your phone and accesses your email and text messages. People are granting these AI systems access to their personal data. But the systems aren't harnessed in a way that ensures that data is protected. There was a story where an executive at Meta used one of these apps and, without prompting, it deleted all her emails. Obviously, that's an instance of the AI assistant being unaligned. The user didn't instruct the model to delete all her emails, but it perceived the instructions that way.

Dirk and Zhuoran, how does your project approach these issues?

Bergemann: In economics, we are often concerned with bringing together information from many individuals who have private preferences about the right course of social or public action. And in a way, bringing together a decision-maker and an LLM is like a game: the LLM has some private information about its own objective function and what it's been exposed to in terms of the underlying data, and the human needs to elicit that information.

Basically, we want to think about how we can elicit the information from the agents to bring the preferences of the LLM in alignment with the implicit or explicit objective function of the human decision-maker. We have strategies in economics to think about what's the game's equilibrium and what are the tools to attain a high social welfare in this situation.

Yang: As Dirk mentioned, LLMs are prediction machines. They are highly complex, involving billions of parameters. We treat them as black boxes. We provide an input and receive an output. But we don't really know what's happening inside the box.

Our project has three components: The first is to dig inside the black box. There are highly capable open-source LLMs, which are much more transparent than proprietary models. We can look inside them and learn whether they truly understand the human's intent. We can see if it takes shortcuts to get the job done or if it follows its instructions verbatim.

Secondly, LLMs are agentic systems that can accomplish specific tasks with little supervision. We need to make sure that when the model says, "I understand what you are saying," that it genuinely understands what you're saying and will follow the instructions. We also want to make it auditable so we can go back after it performs a task and see what it did right or wrong and how we can help it improve.

The last component looks at human-agent cooperation. The companies that develop LLMs can provide training services to help people use them more effectively. They also could write system prompts that are better at discerning human intent. There is a tradeoff between these two things from the suppliers' perspective. How do we balance that tradeoff in a principled way?

Nicole and Yang, what is your project studying?

Nicole Immorlica: As Dirk said, we can think of the interactions between LLM and users as a kind of game. There's a sender, the LLM, that's trying to communicate something to a receiver, the human using the LLM. As the receiver, the user will consider the information the LLM provides and then decide what action to take. Part of our grant proposal suggested that by observing the distribution of actions of the receiver, an outside observer can understand whether the LLM was misinformed or misaligned.

How does that work?

Immorlica: Let's say you've asked an LLM to suggest a good price for a laptop being auctioned on eBay. In the case of misalignment, the LLM subtly biases its answer to influence your actions. Maybe you sense this bias and adjust your bid. Say, the LLM suggests bidding $1,500, but you decide to bid $1,250. Different users will correct for the bias in different ways. This creates a smooth distribution of actions, meaning people's decisions spread continuously across many possibilities as they "debias" the LLM's recommendation differently.

In cases of misinformation, the LLM is not being biased, it's just mistaken. If the receiver knows the LLM could be misinformed, they will consider the probability that its recommendation is a hallucination. There is some probability that the LLM's bid recommendation is exactly right. There's also some probability that it's complete garbage. This is going to induce a different distribution of bids in the eBay auction because you could, with some probability, assume that the recommendation is right and bid accordingly. Others will ignore it completely. As a result, the bid distribution will be choppy instead of smooth.

How could this work impact the AI industry?

Cai: I think AI companies will be interested in this work because they want to know how to tell whether their models are misinformed or misaligned. One way to address issues of misalignment and misinformation is by designing some better "loss" functions to optimize these models. This just means finding better ways to measure model performance by calculating the deviation of its predictions from the baseline correct predictions. That's the direction that we have in mind in terms of an intervention. It's something AI companies can do to improve their models.

Immorlica: Our work could also inform government regulation. Regulators might want to take steps to prevent AI agents from being highly misaligned - something we can identify by observing the distribution of actions. If the government realizes that the distribution of purchases based on AI recommendations doesn't look right according to our theories, it could tell the AI company that developed the model to fix it.

Other researchers working on these projects include Nima Haghpanah, professor of economics at Yale School of Management; Elliot Lipnowski, associate professor of economics in Yale's Faculty of Arts and Sciences; Doron Ravid, associate professor of economics at the University of Michigan; and Stephen Morris, the Peter A. Diamond Professor of Economics at the Massachusetts Institute of Technology.

/University Release. This material from the originating organization/author(s) might be of the point-in-time nature, and edited for clarity, style and length. Mirage.News does not take institutional positions or sides, and all views, positions, and conclusions expressed herein are solely those of the author(s).View in full here.