The research question
Combating papermill activity is critical to protecting the integrity of the scientific record. An effective response to papermills — and to upholding research integrity more broadly — requires a multilayered approach:
Al screening to detect scalable, pattern-based risks at submission
Research integrity expertise to interrogate anomalies, link behaviors, and identify emerging tactics
Editorial oversight and peer review to assess scientific validity, coherence, and credibility in context
Human expertise and Al tools are both essential to this effort.
A range of commercial and proprietary tools have been developed to screen submissions for papermill activity. This study focuses on that first checkpoint: detecting suspected papermill submissions before entering the review pipeline. Specifically, Frontiers analyzed the output of three papermill detection tools on more than 37,000 manuscript submissions across six journals, assessing how reliably these tools flag fraudulent behavior.
Methodology and selected statistics in brief
Three leading Al-powered detection systems were benchmarked against the same dataset of submissions. Each flagged markedly different proportions of manuscripts, ranging from roughly 10% to 27%. The spread underscores a fundamental issue for the sector: there is no shared threshold for what constitutes a suspicious submission.
The divergence is most evident in the overlap. Of the 8,649 submissions flagged by at least one tool, just 396 were flagged by all three, meaning there was agreement of only 4.5% about which articles indicated papermill activity. In other words, the tools are largely identifying different manuscripts rather than corroborating the same risks.
Why the detection gap?
The Frontiers' team examined the tools' output in more detail to better understand the poor signal-overlap among the groups of detected articles.
The overall pattern was clear. The tools appear to emphasize different types of signals, with one relying more on author-related indicators and others placing greater weight on content or reference-based signals. This may help explain both the divergence in flagging rates and the low manuscript-level overlap, with each tool capturing a different aspect of papermill risk.
The full report will include additional new findings and insights:
Comparison of Al-detected versus human-expert detected papermill submissions
Data on different signals used by different detection tools
Insight into how and why detection tool sensitivity fluctuates
Impact analysis of both negative and false positives
Frontiers' cross-industry advice and calls to action