In a new study, Dartmouth researchers trained an AI language-identification model inspired by Google Translate to recognize Navajo with near-perfect accuracy.
The study suggests that services such as Google's LangID—which recognizes languages for various applications including Translate but does not currently support any Native American language—could be expanded to identify Navajo and related languages relatively easily.
The team presented their work May 1 at the Nations of the Americas Chapter of the Association for Computational Linguistics (NAACL) conference in Albuquerque.
"By building on the ideas behind LangID, we found that it's possible to develop a classifier to identify Indigenous languages," says the study's first author, Ivory Yang, a PhD candidate at Dartmouth. "From Google's perspective, adding a new language involves rigorous verification, which makes sense given the scale. What I hope to show is that even with limited resources, meaningful progress is still possible."
The researchers began developing the program after finding that LangID misidentified Navajo—the most widely spoken Indigenous language north of Mexico—as unrelated languages such as Icelandic. Using a data set of 10,000 Navajo sentences, they were able to create a model that correctly identified Navajo with 97-100% accuracy, the researchers report.
"Many Indigenous languages lack even the basic dignity of being recognized online, a reflection of systemic bias in language technology," says Soroush Vosoughi, the paper's senior author and an assistant professor of computer science at Dartmouth. "Revitalization begins with visibility, and visibility begins with identification."
The team also found that Navajo may serve as a bridge to "teaching" translation tools to recognize related languages that have less data available, Yang says. Navajo has the most speakers in the Athabaskan family, which includes Apache and several Native Alaskan languages.
The researchers trained their model on a sample of these languages, sometimes using data sets as small as 20 sentences. But when they typed these languages into their model, it identified them as Navajo.
"What we noticed is that they are so linguistically similar to Navajo that it could be used to eventually identify these related languages without needing the same amount of data," Yang says. "That could mean that higher resource languages can act as a bridge to lower resource languages in general."
The paper stems from a larger project in Vosoughi's Minds, Machines, and Society research group to use AI for revitalizing endangered languages, with a particular focus on Indigenous and underrepresented linguistic communities.
Yang led the development of an AI framework called NüshuRescue that translates Chinese into Nüshu, an endangered centuries-old secret script traditionally used by women in southern Hunan province. The team published the framework in January.
The next step for the team's latest model is to translate original sentences into Navajo. "Basically, we want to switch from identification to translation," Yang says. "The end goal is translation, but that is way, way harder. Right now, we know we can do identification."