As artificial intelligence is rapidly developing and becoming a growing presence in healthcare communication, a new study addresses a concern that large language models (LLMs) can reinforce harmful stereotypes by using stigmatizing language. The study from researchers at Mass General Brigham found that more than 35% of responses in answers related to alcohol- and substance use-related conditions contained stigmatizing language. But the researchers also highlight that targeted prompts can be used to substantially reduce stigmatizing language in the LLMs' answers. Results are published in The Journal of Addiction Medicine.
"Using patient-centered language can build trust and improve patient engagement and outcomes. It tells patients we care about them and want to help," said corresponding author Wei Zhang, MD, PhD, an assistant professor of Medicine in the Division of Gastroenterology at Mass General Hospital, a founding member of the Mass General Brigham healthcare system. "Stigmatizing language, even through LLMs, may make patients feel judged and could cause a loss of trust in clinicians."
LLM responses are generated from everyday language, which often includes biased or harmful language towards patients. Prompt engineering is a process of strategically crafting input instructions to guide model outputs towards non-stigmatizing language and can be used to train LLMs to employ more inclusive language for patients. This study showed that employing prompt engineering within LLMs reduced the likelihood of stigmatizing language by 88%.
For their study, the authors tested 14 LLMs on 60 generated clinically relevant prompts related to alcohol use disorder (AUD), alcohol-associated liver disease (ALD), and substance use disorder (SUD). Mass General Brigham physicians then assessed the responses for stigmatizing language using guidelines from the National Institute on Drug Abuse and the National Institute on Alcohol Abuse and Alcoholism (both organizations' official names still contain outdated and stigmatizing terminology).
Their results indicated that 35.4% of responses from LLMs without prompt engineering contained stigmatizing language, in comparison to 6.3% of LLMs with prompt engineering. Additionally, results indicated that longer responses are associated with a higher likelihood of stigmatizing language in comparison to shorter responses. The effect was seen across all 14 models tested, although some models were more likely than others to use stigmatizing terms.
Future directions include developing chatbots that avoid stigmatizing language to improve patient engagement and outcomes. The authors advise clinicians to proofread LLM-generated content to avoid stigmatizing language before using it in patient interactions and to offer alternative, patient-centered language options. The authors note that future research should involve patients and family members with lived experience to refine definitions and lexicons of stigmatizing language, ensuring LLM outputs align with the needs of those most affected. This study reinforces the need to prioritize language in patient care as LLMs become increasingly used in healthcare communication.
Authorship: In addition to Zhang, Mass General Brigham authors include Yichen Wang, Kelly Hsu, Christopher Brokus, Yuting Huang, Nneka Ufere, Sarah Wakeman, and James Zou.
Disclosures: None.
Funding: This study was funded by grants from the May Center Clinic for Digital Health in partnership with the Mayo Clinic Office of Equity, Inclusion, and Diversity and Dalio Philanthropies.
Paper cited: Wang, Y. et. al. "Stigmatizing Language in Large Language Models for Alcohol and Substance Use
Disorders: A Multi-Model Evaluation and Prompt Engineering Approach" DOI: 10.1097/ADM.0000000000001536