How some endangered language speakers get creative with AI for preservation efforts
Revitalizing endangered Indigenous languages that have little or no digital presence is challenging with artificial intelligence—but not impossible.
Listen 11:05This story is from The Pulse, a weekly health and science podcast.
Find it on Apple Podcasts, Spotify, or wherever you get your podcasts.
There are about 7,000 languages in the world, and they do more than provide a means of communication. Language is a huge part of a group’s identity — more often than not, it is the thread that holds together history, culture, and legacy.
Yet, according to the Living Tongues Institute for Endangered Languages, nearly half of the languages in the world are endangered and could vanish by the end of the next century — along with them – much of the traditions they are tied to.
“Languages do sometimes come and go,” said Ross Perlin, co-director of the Endangered Language Alliance in New York City. “But over the last several centuries in particular, something has radically shifted, and those pressures to shift languages have become intense, ranging from outright bans on certain languages that have been instituted by governments.”
Perlin is referring to language bans that were implemented in countries like Canada, the U.S., and even Japan in the 1800s. In his book, “Language City,” Perlin argues that these language bans are a driving force for linguists and Indigenous groups to preserve mother tongues today.
“The most powerful argument is one of moral obligation,” Perlin said. “The languages are not … dying natural deaths but are being hounded out of existence.”
Perlin is not alone in this endeavor. Big tech companies like Google and IBM are working with endangered language organizations to revitalize languages with artificial intelligence and large language models.
But when it comes to low-resource languages, those that have little-to-no digital presence and are often left out of research databases, it is much more difficult to breathe new life into them with AI.
A different type of bot
Five years ago, Jared Coleman came across this barrier when he found an old archive of his native language, Owens Valley Paiute.
Coleman is a member of the Big Pine Paiute Tribe of the Owens Valley in California.
The archive included an old dictionary and audio recordings of his elders, including his own great-great-great grandfather’s voice. Coleman recalls listening to the archive as an amazing experience — but there was one problem.
“I had no idea what he was saying,” he said. “Growing up, everyone knows some words, like the word for water, the word for sit … but I didn’t speak the language at all.”
Subscribe to The Pulse
With a rich archive at his fingertips, Coleman decided to learn the language. He even built an online dictionary after studying computer science in college. Then, with the explosion of large language models, like ChatGPT, he saw an opportunity to create a tool to revitalize Paiute, and to create something that would make learning the language easier for future generations.
But even though the archive that Coleman found was enough for him to learn the language, it wasn’t enough training data for a large language model to work with. For example, Google Translate uses large language models to do translations.
“It’s trained on like millions and millions and millions of sentences until it learns how to translate,” he said. “You can’t do that for endangered languages, we just don’t have that many sentences.”
But Coleman wasn’t stumped for long.
In 2023, while completing his Ph.D. in computer science at the University of Southern California, Coleman, his advisor Bhaskar Krishnamachari, and Khalil Iskarous explored a different idea to get around this issue. They developed a new paradigm for machine translation.
It’s called a large language model-assisted rule-based translation tool.
“Maybe we don’t give it a bunch of examples and train it like a traditional AI approach. But we give it rules about the language,” he said.
It’s a system that helps translate text from one language to another by combining two methods.
First, the system is given a set of rules about the language’s unique grammar and vocabulary. The tool catches on with each rule and learns over time how to follow the basics. This is then combined with the processing capabilities of a large language model.
In an interesting twist, Coleman also proposed that native Paiute speakers teach AI to speak the language correctly since they don’t have a large enough archive to train the tool on.
“So I taught it. I said this is the word for water. This is the word for drink. This is how you say, ‘drink water.’ And then this is how you say, ‘he.’ Now, how would you say ‘he will drink water?’ I’m not giving it the answer, but I’m giving it enough information to where a human would be able to figure this out,” Coleman said.
And it worked.
Coleman gathered native speakers and teachers to train the tool. And he soon found another advantage. The model doesn’t spit out wrong translations when it doesn’t know the answer, unlike other large language models.
“When you’re talking with ChatGPT, you should never assume that it’s accurate because it just makes stuff up all the time,” he said. “ChatGPT can be creative and fun. And so you never kind of separate when it’s being creative and making things up.”
Coleman is now an assistant professor of computer science at Loyola Marymount University in Los Angeles. His tool is still in the beginning stages — but he hopes it can offer accurate translation, and help future generations learn their native language. In the meantime, his online dictionary, Kubishi, meaning “brain” in Pauite, is readily available.
Setting the benchmark
In New Zealand, the Māori people ran into a different issue while using AI to revitalize their endangered language.
The Māori language, which is called Te Reo, started to decline in the 1800s due to an increase in the European population in New Zealand. The native population also faced language discrimination. The Native Schools Act of 1867 barred children from speaking their language in schools and they were forced to speak English instead.
It wasn’t until the 1980s that the Māori Language Act established Te Reo as an official language in New Zealand, along with English. This was an effort to persuade communities to speak their native tongue. By then, there were far fewer people who could speak the language fluently.
But in 1990, a broadcast company Te Hiku Media was born, and it began to assemble native voices for their media company, which would later become an archive.
“What we quickly learned was digitizing our language was not only a priority for archival purposes, it was a priority to ensure that our language had a place in the digital future of the world,” said Peter-Lucas Jones, the CEO of Te Hiku Media.
In 2016, Te Hiku began to build an automatic speech recognition model, a computer program that uses a combination of AI and machine learning algorithms to convert spoken language into text. And they had the perfect archive, 30 years worth of recorded audio from their broadcasts.
But the process turned out to be more intricate than they had imagined.
Even though there was an archive of language, low-resource languages still face issues with automatic speech recognition due to a lack of high-quality training data, which can lead to poor accuracy. When the system is trained on older audio recordings of the language, it may not be able to accurately recognize pronunciations, accents, and linguistic nuances.
Jones said the only path forward was to manually change the program, word for word, with native speakers.
“And that’s what we did,” he said. “We taught our own people how to tag and label phonetical data, how to tag and label language data in a way that took into account the context, the cultural context.”
By doing so, Te Hiku was able to create an automatic speech recognition model with 92 percent accuracy.
And Jones said this accuracy alone is very essential if they want their language represented correctly in the digital world, because inaccuracy can be harmful. For example, he wrote in Te Reo, “Men and women hold on to the land,” into a popular translation tool.
“It provided me with an English translation, and that English translation was, ‘white man, white woman, keep the land,’” he said. “So what I’m saying is, it’s important that as Indigenous people, we actually set the benchmark around what the quality is for translation.”
Te Hiku has also launched an app that improves users’ pronunciation of Te Reo, called Rongo, which translates to “listen.”
Owning the future
Aside from building accurate tools for their communities, Indigenous groups stress the importance of owning the data that they collect and use.
Jones said it’s sort of like controlling your destiny: “Our tools are only to be used for good.”
There have been efforts to preserve Indigenous languages in the past that did not always reap benefits for communities.
For example, Thomas Jefferson began initiatives to collect native languages from 1797 to 1814 for learning purposes, but to also better navigate political relations, especially during negotiations and interactions with Native Americans.
Although historically there has been a shift in linguists’ practices, they have primarily focused on collecting languages to produce academic records — such as dictionaries, grammatical descriptions, and recordings. But the work was often intended for scholars, rather than for the speakers of the languages themselves.
Today, Indigenous groups hope that owning the data will not only prevent another case of vulnerability but also remain an open resource for those who want to learn.
“We want to ensure that the tools that we create enhance the lives of the people that are members of our communities and do no harm,” Jones said. “That is why it’s important to own the data.”
WHYY is your source for fact-based, in-depth journalism and information. As a nonprofit organization, we rely on financial support from readers like you. Please give today.