How AI Bias Can Creep Into Online Content Moderation

University of Queensland
A projection of a content moderation system over a laptop user.

Assigning a persona to an LLM chatbot alters its precision and recall in line with ideological leanings

(Photo credit: Adobe / NongAsimo )

A University of Queensland study has shown Large Language Models (LLMs) used in AI content moderation may be prone to subtle biases that undermine their neutrality.

Key points

  • Researchers asked 6 LLMs - including vision models - to moderate thousands of examples of hateful text and memes through the lens of different ideologically diverse AI personas
  • The exercise revealed that AI political personas, even without significantly altering overall accuracy, were prone to introducing consistent ideological biases and divergences in chatbot content moderation judgments
  • Researchers said this means there is an underlying risk that different LLMs will lean towards certain perspectives when identifying and responding to hateful and harmful comments

A team led by data scientist Professor Gianluca Demartini from UQ's School of Electrical Engineering and Computer Science used persona prompting to test the tendency of AI chatbots to encode and reproduce political biases, and found significant behavioural shifts.

The research team asked 6 LLMs - including vision models - to moderate thousands of examples of hateful text and memes through the lens of different ideologically diverse AI personas.

Professor Demartini said the exercise revealed that AI political personas, even without significantly altering overall accuracy, were prone to introducing consistent ideological biases and divergences in chatbot content moderation judgments.

"It has already been established that persona conditioning can shift the political stance expressed by LLMs," Professor Demartini said.

"Now we have shown through political personas that there is an underlying risk that LLMs will lean towards certain perspectives when identifying and responding to hateful and harmful comments."

"It demonstrates a need to rigorously examine the ideological robustness of AI systems used in tasks where even subtle biases can affect fairness, inclusivity and public trust."

Three UQ researchers examining a laptop computer

UQ data scientist Professor Gianluca Demartini with PhD scholars Stefano Civelli and Pietro Bernadelle.

(Photo credit: The University of Queensland)

The AI personas used in the study were from a database of 200,000 synthetic identities ranging from schoolteachers to musicians, sports stars and political activists.

Each persona was put through a political compass test to determine their ideological positioning, with 400 of the more 'extreme' positions asked to identify hateful online content.

Professor Demartini said his team found that assigning a persona to an LLM chatbot altered its precision and recall in line with ideological leanings, rather than change the overall accuracy of hate speech detection.

However, the team found LLMs - especially larger models - exhibited strong ideological cohesion and alignment between personas from the same ideological 'region'.

Professor Demartini said this suggested larger AI models tend to internalise ideological framings, as opposed to smoothing them out or 'neutralising' them.

A view of an AI content moderation dashboard

UQ researchers say the outputs of AI models used in content moderation reflect embedded ideological biases that can disproportionately affect certain groups.

(Photo credit: Adobe / MS)

"As LLMs become more capable at persona adoption, they also encode ideological 'in-groups' more distinctly," Professor Demartini said.

"On politically targeted tasks like hate speech detection this manifested as partisan bias, with LLMs judging criticism directed at their ideological in-group more harshly than content aimed at their opponents."

Professor Demartini said larger LLMs also displayed more complex patterns, including a tendency towards defensive bias.

"Left personas showed heightened sensitivity to anti-left hate, and right-wing personas were more sensitive to anti-right hate speech," Professor Demartini said.

"This suggests that ideological alignment not only shifts detection thresholds globally, but also conditions the model to prioritise protection of its 'in-group' while downplaying harmfulness directed at opposing groups."

Researchers said the project highlighted that it was crucial for high-stakes content moderation tasks to be overseen by neutral arbiters so that fairness and public trust is maintained and the health and wellbeing of vulnerable demographics is protected.

"People interact with AI programs trusting and believing they are completely neutral," Professor Demartini said.

"But concerns remain about their tendency to encode and reproduce political biases, raising important questions about AI ethics and deployment.

"In content moderation the outputs of these models reflect embedded ideological biases that can disproportionately affect certain groups, potentially leading to unfair treatment of billions of users."

PhD candidates Stefano Civelli, Pietro Bernadelle and research assistant Nardiena Pratama collaborated on the study.

The research is published in Transactions on Intelligent Systems and Technology.

Republish via Creative Commons
/Public Release. This material from the originating organization/author(s) might be of the point-in-time nature, and edited for clarity, style and length. Mirage.News does not take institutional positions or sides, and all views, positions, and conclusions expressed herein are solely those of the author(s).View in full here.