A new machine learning tool has identified more than 250,000 cancer research papers that may have been produced by so-called "paper mills".
Developed by QUT researcher Professor Adrian Barnett, from the School of Public Health and Social Work and Australian Centre for Health Services and Innovation (AusHSI), and an international team of collaborators, the study, published in The BMJ, analysed 2.6 million cancer studies from 1999 to 2024.
It found more than 250,000 papers with writing patterns similar to articles already retracted for suspected fabrication.
"Paper mills are companies that sell fake or low-quality scientific studies. They are producing 'research' on an industrial scale, and our findings suggest the problem in cancer research is far larger than most people realised," Professor Barnett said.
Selling authorships and entire ready-made research papers, paper mills often use recycled text, awkward phrasing or fabricated data and images.
"Most likely, they're relying on boilerplate templates which can be detected by large language models that analyse patterns in texts," Professor Barnett said.
He and his team trained a language model called BERT to recognise the subtle textual "fingerprints" that repeatedly appear across known paper-mill products.
When tested on verified examples, the model correctly identified suspicious papers 91 per cent of the time.
"We've essentially built a scientific spam filter," Professor Barnett said.
"Just like your email system can spot unwanted messages, our tool flags papers that match the writing style and structure we see in retracted, fraudulent work."
Key findings from the large-scale analysis include:
- Flagged papers have increased dramatically over two decades, rising from around 1 per cent in the early 2000s and peaking at over 16 per cent in 2022.
- The issue affects thousands of journals across major publishers, including high-impact titles.
- The problem is most concentrated in fields such as molecular cancer biology and early-stage laboratory research.
- Some cancer types, including gastric, liver, bone and lung cancer, show especially high rates of suspicious papers.
Three scientific journals are already piloting the tool as part of their editorial screening. It will allow editors to identify potentially fabricated manuscripts before they are sent for peer review.
The team plans to expand the tool to other fields of research and improve the model as more confirmed cases of paper-mill activity become available. They stress the findings are not confirmed cases of research fraud and should be checked by human specialists.
"Cancer research influences clinical trials, drug development and patient care," Professor Barnett said.
"If fabricated studies make their way into the evidence base, they can mislead real scientists and ultimately slow progress for patients. That's why it's vital we get ahead of this problem."
Read the full paper, Machine Learning-Based Screening of Potential Paper Mill Publications in Cancer Research: Methodological and Cross-Sectional Study, in The BMJ, online.
Main photo: Professor Adrian Barnett