How do you translate “dim sum”? Many English speakers would find the question strange, knowing the term refers to the large array of small dishes that accompanies a Cantonese-style brunch—and so doesn’t need translation.
But words like “dim sum” are a challenge for developers like Jacky Chan, who launched a Cantonese large language model last year through his startup Votee. It might be obvious to a human translator what words are loanwords and which need direct translation. Yet it’s less intuitive for machines.
“It’s not natural enough,” Chan says. “When you see it, you know it’s not something a human writes.”
Translation troubles are part of a growing list of issues when today’s AI models, strongest in English and other major languages, try to work in an array of smaller tongues still spoken by tens of millions of people.
When AI “models encounter a word they don’t know or that doesn’t exist in another culture, they will simply make up a translation,” explains Aliya Bhatia, a senior policy analyst at the Center of Democracy & Technology, where she researches issues related to multilingual AI. “As a result, many machine-created datasets could feature mistranslations, words that no native speaker actually uses in a specific language.”
LLMs need data, and lots of it. Text from books, articles and websites is broken down into smaller word sequences to form a model’s training dataset. From this, LLMs learn how to predict the next word in a sequence, eventually generating text.
So-called low resource languages are those with limited online data. Endangered languages, no longer being passed down to younger generations, clearly fall into this category. But widely spoken languages like Cantonese, Vietnamese and Bahasa Indonesia are also considered low-resource.
One reason could be limited internet access, which would prevent the creation of digital content. Another could be government regulation, which might limit what’s available online. Indonesia, for example, can remove online content without offering a way to appeal decisions. The resulting self-censorship may mean that available data in some regional languages might not represent authentic local culture.
This resource gap leads to a performance gap: Non-English LLMs are more likely to produce gibberish or inaccurate answers. LLMs also struggle with languages that don’t use Latin script, the set of letters used in English, as well as those with tonal features that are hard to represent in writing or code.
Currently, the best-performing models work in English and, to a lesser extent, Mandarin Chinese. That reflects where the world’s biggest tech companies are based. But outside of San Francisco and Hangzhou, a legion of developers, large and small, are trying to make AI work for everyone.
Chan, from Votee, faces these struggles when dealing with Cantonese, spoken by 85 million people across southern China and Hong Kong. Cantonese uses different grammar for formal writing compared to informal writing and speech. Available digital data is scarce and often low-quality.
Training on digitalized Cantonese texts is like “learning from a library with many books, but they have lots of typos, they are poorly translated, or they’re just plain wrong,” says Chan.
LLMs have no inherent sense of what is true, and so false or incomplete information will be reproduced as fact. A model trained solely on Vietnamese pop music might struggle to accurately answer questions on historical events, particularly those not related to Vietnam.
Turning English content into the target language is one way to supplement the otherwise-limited training data. As Chan explains, “we synthesize the data using AI so that we can have more data to do the training.”
The threat is even greater for low-resource languages. With less genuine content out there, AI-generated content could quickly end up making up a larger share of what’s online in a given language.
Large businesses are starting to realize the opportunities in building a non-English AI. But while these companies are key players in their respective tech sectors, they’re still much smaller than giants like Alibaba, OpenAI, and Microsoft.
Bhatia says more organizations—both for-profit and not-for-profit—need to invest in multilingual AI if this new technology is to be truly global.
“If LLMs are going to be used to equip people with access to economic opportunities, educational resources, and more, they should work in the languages people use,” she says.