Project To Verify AI Models With Confidence Awarded Major Grant

King’s College London

Dr Nicola Paoletti from the Department of Informatics will embark on the work following a competitive grant from Open Philanthropy. 

Nicola Paoletti thumbnail

King's researchers are developing a pioneering new approach to check if Large Language Models (LLMs) like ChatGPT are being biased, deceptive or harmful, thanks to a £300,000 grant from Open Philanthropy.

Dr Nicola Paoletti and Professor Osvaldo Simeone from the Departments of Informatics and Engineering, will develop ways of detecting misaligned LLM behaviours with a high degree of confidence - attaching a quantitative value to that level of trust. Their goal is to enable safer and more socially beneficial models and empower policy makers to make better decisions about any necessary guardrails on the technology. 

LLMs have rapidly become central in our society and as such have invited scrutiny over their use and the text that they generate. Cases like chatbots encouraging young people to take their own lives have led to calls in recent weeks for LLMs to police their own content. However, training AI with a high degree of accuracy and have confidence that its output won't be harmful is difficult. 

Current approaches often rely on an auxiliary model alongside the LLM to analyse its output, checking if any of the generated text is biased, deceptive, manipulative or harmful. However, working at the output/text level is a complex task, and so, to be effective, these auxiliary models need to become large and complex themselves. This makes it difficult to understand how they reached their decisions.

The way we interact with LLMs on a daily basis has the potential to unlock great benefits in productivity and creativity, but it also raises significant dangers - will this model try to manipulate me to achieve its goals? Is the model being truthful or just trying to please me? If you don't have a way to reliably detect these behaviours, how can policymakers legislate on them with confidence?"

Dr Nicola Paoletti

Current approaches often rely on an auxiliary model alongside the LLM to analyse its output, checking if any of the generated text is biased, deceptive, manipulative or harmful. However, working at the output/text level is a complex task, and so, to be effective, these auxiliary models need to become large and complex themselves. This makes it difficult to understand how they reached their decisions.

The approach proposed by Dr Paoletti and team will deploy so-called 'latent probes', simple algorithms trained on the internal activations of an LLM to detect early signs of misaligned behaviour before it appears in the generated text. Instead of looking at textual output, latent probes look at how prompts are transformed through the hidden layers of the neural network which makes up the LLM, offering an interpretation of the model's internal reasoning.

While such latent probes have already demonstrated very promising performance, they can still commit errors like failing to detect harmful LLM behaviour and can be fooled by malicious prompts from bad actors designed to evade them and produce unsafe outputs.

To tackle these challenges, the project will develop an enhanced and more reliable form of latent probes called Verifiably Robust Conformal Probes. These will offer rigorous estimations on their prediction error, even when met with inputs designed to evade them.

Our method looks at the inner reasoning of AI to produce robust estimates of misaligned behaviour. So, you now have a trusted monitor not only to inform future decisions and prevent misbehaviour, but to build better AI." 

Dr Nicola Paoletti

Where a 'vanilla' probe may predict a single score to rank the LLM's deception level, the proposed probes would predict a range of scores. These ranges provide an estimate of the probe's uncertainty -- the wider the range the more uncertain -- and are guaranteed to include the true deception score of the model with given probability. By quantifying the probe's uncertainty in predicting unsafe behaviours, developers and AI companies can design more effective and principled policies for preventing harmful LLMs outputs.

Principal investigator Dr Paoletti explains, "The way we interact with LLMs on a daily basis has the potential to unlock great benefits in productivity and creativity, but it also raises significant dangers - will this model try to manipulate me to achieve its goals? Is the model being truthful or just trying to please me?

"But if you don't have a way to reliably detect these behaviours, how can policymakers legislate on them with confidence? Our method looks at the inner reasoning of AI to produce robust estimates of misaligned behaviour. So, you now have a trusted monitor not only to inform future decisions and prevent misbehaviour, but to build better AI." 

The team will be testing their model using several standard benchmarks, establishing whether the LLM intends to behave unethically, produce deceptive information, break its guardrails, or stop it from acting as a malicious 'sleeper agent' after a particular trigger.

/Public Release. This material from the originating organization/author(s) might be of the point-in-time nature, and edited for clarity, style and length. Mirage.News does not take institutional positions or sides, and all views, positions, and conclusions expressed herein are solely those of the author(s).View in full here.