AI Giants Differ in Detecting Hate Speech

University of Pennsylvania

With the proliferation of online hate speech—which, research shows, can increase political polarization and damage mental health—leading artificial intelligence companies have released large language models that promise automatic content filtering. "Private technology companies have become the de facto arbiters of what speech is permissible in the digital public square, yet they do so without any consistent standard," says Yphtach Lelkes, associate professor in the Annenberg School for Communication.

He and Annenberg doctoral student Neil Fasching have produced the first large-scale comparative analysis of AI content moderation systems—which social media platforms employ—and tackled the question of how consistent they are in evaluating hate speech. Their study is published in Findings of the Association for Computational Linguistics .

Lelkes and Fasching analyzed seven models, some designed specifically for content classification and others more general: two from OpenAI and two from Mistral, along with Claude 3.5 Sonnet, DeepSeek V3, and Google Perspective API. Their analysis includes 1.3 million synthetic sentences that make statements about 125 groups—including both neutral terms and slurs—ranging from ones about religion to disabilities to age. Each sentence includes "all" or "some," a group, and a hate speech phrase.

Here are three takeaways from their research:

The models make different decisions about the same content

"The research shows that content moderation systems have dramatic inconsistencies when evaluating identical hate speech content, with some systems flagging content as harmful while others deem it acceptable," Fasching says. This is a critical issue for the public, Lelkes says, because inconsistent moderation can erode trust and create perceptions of bias.

Fasching and Lelkes also found variation in the internal consistency of models: One demonstrated high predictability for how it would classify similar content, another produced different results for similar content, and others showed a more measured approach, neither over-flagging nor under-detecting content as hate speech. "These differences highlight the challenge of balancing detection accuracy with avoiding over-moderation," the researchers write.

The variations are especially pronounced for certain groups

"These inconsistencies are especially pronounced for specific demographic groups, leaving some communities more vulnerable to online harm than others," Fasching says.

He and Lelkes found that hate speech evaluations across the seven systems were more similar for statements about groups based on sexual orientation, race, and gender, while inconsistencies intensified for groups based on education level, personal interest, and economic class. This suggests "that systems generally recognize hate speech targeting traditional protected classes more readily than content targeting other groups," the authors write.

Models handle neutral and positive sentences differently

A minority of the 1.3 million synthetic sentences were neutral or positive to assess false identification of hate speech and how models handled pejorative terms in non-hateful contexts, such as "All [slur] are great people."

The researchers found that Claude 3.5 Sonnet and Mistral's specialized content classification system treat slurs as harmful across the board, whereas other systems prioritize the context and intent. The authors say they are surprised to find that each model consistently fell into either camp, with little middle ground.

Yphtach Lelkes is an associate professor of communication in the Annenberg School for Communication, co-director of the Polarization Research Lab, and co-director of the Center for Information Networks and Democracy.

Neil Fasching is a doctoral candidate in the Annenberg School for Communication and member of the Democracy and Information Group.

This research was supported by the Annenberg School for Communication.

/Public Release. This material from the originating organization/author(s) might be of the point-in-time nature, and edited for clarity, style and length. Mirage.News does not take institutional positions or sides, and all views, positions, and conclusions expressed herein are solely those of the author(s).View in full here.

The models make different decisions about the same content

The variations are especially pronounced for certain groups

Models handle neutral and positive sentences differently

You might also like