Researchers from the University of Oxford, EleutherAI, and the UK AI Security Institute have reported a major advance in safeguarding open-weight language models. By filtering out potentially harmful knowledge during training, the researchers were able to build models that resist subsequent malicious updates - especially valuable in sensitive domains such as biothreat research.
Senior author Yarin Gal , Associate Professor of Machine Learning at Oxford's Department of Computer Science, said: 'The research community has made great progress with AI safeguards over the past few years, but a remaining massive challenge is safeguarding open weight models - how do we build models that we can distribute to all without raising risks of misuse. Our study makes a significant stride in this direction.'
Embedding safety from the start
This approach represents a shift in the approach to AI safety: rather than retrofitting safeguards, safety is embedded from the start. The method reduces risk without sacrificing openness, enabling transparency and research without compromising security.
The research community has made great progress with AI safeguards over the past few years, but a remaining massive challenge is safeguarding open weight models - how do we build models that we can distribute to all without raising risks of misuse. Our study makes a significant stride in this direction.
Senior author Associate Professor Yarin Gal , Department of Computer Science
Open-weight models are a cornerstone of transparent, collaborative AI research. Their availability promotes red teaming, mitigates market concentration, and accelerates scientific progress. With the recent releases of prominent models like Kimi-K2, GLM-4.5, and gpt-oss, open-weight models are steadily increasing in their capabilities and influence, with capabilities that reportedly lag behind the best closed models by just 6-12 months .
However, openness brings risk. Just as open models can be refined for positive applications, they can also be modified for harm. Modified text models lacking safeguards are already widespread, while open image generators have become tools for producing illegal content. Because these models can be downloaded, altered, and redistributed by anyone, developing robust protections against tampering is critical.
Instead of training a general-purpose model and then adding filters, this work builds safeguards throughout the entire training process by filtering unwanted knowledge from the training data. The team focused on a biothreat setting and filtered biology-related content from the model's training data, aiming to deny the model this knowledge entirely, rather than suppressing it post hoc which can often be reversed easily.
The filtered model was able to resist training on up to 25,000 papers on biothreat-related topics (such as virology, bioweapons, reverse genetics, and viral vectors), proving over ten times more effective than prior state-of-the-art methods. Unlike traditional fine-tuning or access-limiting strategies, which can often be bypassed, filtering pretraining data proved resilient even under sustained adversarial attack-surviving 10,000 steps and over 300 million tokens of targeted fine-tuning.
How the method works:
By removing the unwanted knowledge from the start, the resulting model had no basis for acquiring dangerous capabilities, even after further training attempts. Our study therefore shows that data filtration can be a powerful tool in helping developers balance safety and innovation in open-source AI.
Study co-author Stephen Casper, UK AI Security Institute.
The team used a multi-stage filtering pipeline combining keyword blocklists and a machine-learning classifier trained to detect high-risk content. This allowed them to remove only the relevant materials-around 8-9% of the dataset-while preserving the breadth and depth of general information.
They then trained AI models from scratch using this filtered data, benchmarking them against both unfiltered models and models using state-of-the-art safety fine-tuning methods. Across evaluations, the filtered models performed just as well on standard tasks-like commonsense reasoning and scientific Q&A.
A major advance for global AI governance
The findings come at a critical moment for global AI governance. Several recent AI safety reports from OpenAI, Anthropic and DeepMind have warned that frontier models may soon be able to assist with the creation of biological or chemical threats. Many governments have expressed concern about the lack of safeguards for openly available models, which cannot be recalled once released.
Study co-author Stephen Casper (UK AI Security Institute) said: 'By removing the unwanted knowledge from the start, the resulting model had no basis for acquiring dangerous capabilities, even after further training attempts. Our study therefore shows that data filtration can be a powerful tool in helping developers balance safety and innovation in open-source AI.'
This research was conducted by the University of Oxford, EleutherAI,and the UK AI Security Institute.
The study 'Deep Ignorance: Filtering pretraining data builds tamper-resistant safeguards into open-weight LLMs' has been published as a preprint on arXiv .