A new Cornell study revealed that Amazon's AI shopping assistant, Rufus, gives vague or incorrect responses to users writing in some English dialects, such as African American English (AAE), especially when prompts contain typos.
The paper introduces a framework to evaluate chatbots for harms that occur when AI systems perform worse for users who speak or write in different dialects. The study has implications for the increasing number of online platforms that are incorporating chatbots based on large language models to provide services to users, the researchers said.
"Currently, chatbots may provide lower-quality responses to users who write in dialects. However, this doesn't have to be the case," said lead author Emma Harvey, a Ph.D. student at Cornell Tech. "If we train large language models to be robust to common dialectical features that exist outside of so-called Standard American English, we could see more equitable behavior."
The research received a Best Paper Award at the June 23-26 ACM Conference on Fairness, Accountability, and Transparency (FAccT). Co-authors are Rene F. Kizilcec, associate professor of computer and information science at Cornell Ann S. Bowers College of Information Science, and Allison Koenecke, assistant professor at Cornell Tech.
"Chatbots are increasingly used for high-stakes tasks, from education to government services," said Koenecke, who is also affiliated with Cornell Bowers. "We wanted to study whether users who speak and write differently - across dialects and formality levels - have comparable experiences with chatbots trained mostly on 'standard' American English."
To test their framework, the researchers audited Amazon Rufus, a chatbot in the Amazon shopping app. They used a tool called MultiVALUE to convert standard English prompts into five widely spoken dialects: AAE, Chicano English, Appalachian English, Indian English and Singaporean English. The researchers also modified these prompts to reflect real-world use by adding typos, removing punctuation and changing capitalization.
The team found Rufus more often gave low-quality answers that were vague or incorrect when prompted in dialects rather than in Standard American English (SAE). The gap widened when prompts included typos.
For example, when asked in SAE if a jacket was machine washable, Rufus answered correctly. But when researchers rephrased the same question in AAE and without a linking verb - "this jacket machine washable?" - Rufus often failed to respond properly and instead directed users to unrelated products.
"Part of this underperformance stems from specific grammatical rules," said Koenecke. "This has serious implications for widely used chatbots like Rufus, which likely underperform for a large portion of users."
Overall, the authors advocate for dialect-aware AI auditing. They also urge developers to design systems that embrace linguistic diversity.
"Chatbots are increasingly added to educational technologies as AI tutors that support a wide range of students," said Kizilcec, who leads the Future of Learning Lab and the National Tutoring Observatory at Cornell. "Linguistic audits should become standard practice to mitigate the risk of exacerbating educational inequalities."
Funding for the study was supported by grants from Apple Inc. and Renaissance Philanthropy.
Grace Stanley is a staff writer-editor for Cornell Tech.