info@crystalhues.com +91 9818333952

10 Challenges in Training Multilingual LLMs

As artificial intelligence becomes increasingly integrated into our daily lives, multilingual large language models (LLMs) that can function meaningfully across languages are becoming necessary.

These models are expected to not only translate accurately between languages, but also engage in meaningful, culturally aware and contextually appropriate language tasks in dozens - or even hundreds of languages.

However, building truly multilingual LLMs requires navigating a series of interrelated challenges, many of which are not only computational and architectural but also linguistic and cultural, shaped by how language exists and varies across the world.

Let’s look at 10 such challenges.

1) Imbalance and Shortage of Data

A predominant challenge is the stark imbalance in the availability of language data. While English and a few major languages in the world can access (or are already endowed with) more high-quality digital corpora, that is not the case with most others.

And for low-resource languages there is little digital content in the first place and what does exist is often noisy, unstructured or machine translated.

Even if you are lucky, the model training process, which is unreliable on unreliable input, will yield unreliable output, including grammatically awkward, semantically incorrect, and culturally inappropriate models.

There is a lack of parity, which will systematically underserve the majority of the languages spoken in the world.

2) Quality and Representativeness of Data

Volume is important, but it is not enough.

The quality of the language, the culture of the corpus, and the appropriateness of the related domain of the training corpus all determine how well the model has a chance to understand and generate text in that language.

User-generated content and web-scraped data arise from the internet and may contain informal language, slang, toxic language, etc. These factors can lead to bias, misinformation, and/or stereotypes becoming embedded into model training.

They may interfere with generating trustworthy models when we don't have at least moderately high-quality data, and, even worse, low-quality data, which may have a substantial impact in languages that include complex morphology or deep social and cultural nuance.

3) Cross-Language Interference

Training a single model in dozens of languages creates some inherent conflict.

Language representations are fitted into a shared parameter space and can interfere with each other. This is referred to as negative transfer or cross-lingual interference - and it can introduce degradation in model performance even in high-resource languages.

Ex: The word “Gift,” means a present in English. However, in German, it translates to poison.

Models struggle more severely in structurally distant languages - different grammar, syntax, and writing systems are difficult to merge.

Ex: If the model is trained in English, which follows the SVO sentence structure, it will make a complete mess when used for most regional Indian languages, Japanese, Korean etc.

This trade-off between parameter sharing and partitioning in a linguistic sense is a central linguistic and technical problem.

4) Incomplete Knowledge Transfer

While multilingual LLMs can generalize over patterns in related languages, the process is often superficial. The model may be capable of handling translation tasks reasonably, but the direction to summarization, question answering or domain specific reasoning is muddied by a lack of deeper cultural understandings.

In legal domains, for example, direct linguistic transfer may not acknowledge the differences in jurisdictional or conceptual meanings. Even when vocabulary draws parallels, semantic frameworks frequently diverge, leading to knowledge gaps and degradation of the overall model's reliability.

5) Script and Tokenization Issues

Tokenization—the act of breaking text down to learnable units—works very well for convolutional neural networks (CNN) and other similar model types for many Latin-based languages. However, it also has little to offer when it comes to different types of scripts like logographic scripts (such as Chinese) or abugida scripts (like Hindi and Amharic) and these are inefficient in many ways.

For example, how they break the words apart or the frequency of occurrence (in a corpus) of each character combination may cause tokenizers to incorrectly break a word or over-value the token's representation in a text’s input sequence by potentially causing a large input buffer and ultimately leaving meaning behind.

The successful training of these models depends on the specific script and orthographic conventions. And not to be underemphasized, the organization of the pre-processing needed on a language-specific basis.

6) Safety, Toxicity, and Cultural Alignment

How a multilingual model aligns with safety norms has many dependencies.

Even if a multilingual model was aligned with safety conventions, they are often only as strong as the language it was fine-tuned in.

Most safety tuning is performed in English, where it is easy to find crowdsourced applications related to safety’s cognitive duties. However, most other languages will most likely be considerably weaker in these regards and could lead to potentially damaging or biased outputs depending on whether the user is interacting in the agent's non-English domains.

Furthermore, moral occlusions, taboos, sensitivities, and other cultural-related content differ greatly like anti-norms among questions about the function of translators.

Without linguistic oversight (by the local language), multilingual models can generate insensitive or, in some cases, even damaging content.

7) Synthetic Data and Its Challenges

When language developers do not have sufficient native language data or training examples, they often turn to synthetic data generation, using pre-existing models to make new training examples.

While synthetic data generation can provide a lot of scale, it is not without cost.

Models trained on synthetic data typically tend to hallucinate, build other feedback loops internally, or fail to think through the meaning of their examples, resulting in content drift.

This method can entrench errors and devalue a linguistic approach if left unmanaged.

When it comes to developing LLMs for low-resource languages, we must be careful when using synthetic data and rely on robust and verified sources that human reviewers have vetted.

8) Evaluation Gaps and Hidden Failures

We know that multilingual LLMs are most often evaluated through a very narrow set of standardized tasks, many of which are based on English language tasks or fail to reflect real-world complexity. As such, many failures can go unnoticed until after deployment.

For example, linguistic errors, factual errors, or outputs that lack cultural relevance may not ever show up in formal evaluations. However, they are quite apparent in user-facing applications.

Evaluations need to be done concerning language, culture, and actual use, rather than simply 'inherited' from evaluations that move away from monolingual assessment.

9) Technological and Structural Complexity

Training and deploying multilingual models are capital intensive.

More than just adding additional words, a new language requires grammatical rules, syntactic structures, and cultural contexts.

These increase the total parameters in the model (and, therefore, the total cost of training) in ways that are proportionally greater than merely multiplying the number of new words.

You need to perform fine-tuning for each language (or more likely for each family of languages) to achieve satisfactory to good alignment and auditing of the total model, as well as fine-tuning and auditing alignment and safety checks employed by the model will likely need to be done for all language families.

Multilingual tokenization, memory management, and context window management become even more problematic with respect to the structural and operational complexities.

10) The Human Dimension of Multilingual AI

A multilingual LLM is successful only because of the foundation of human linguistic intelligence.

Native speakers, linguists, cultural experts, and content experts contribute fundamentally to the construction of viable and safe training data to predict responses for use in multiple languages.

Relying on fully automated pipelines is never enough, and human engagement is required to connect humans' linguistic reality with the multifaceted trajectory that makes up language.

Teams of linguists or those with linguistic peripherals to offer are extremely important, not only for the complexity and equality of linguistic similarity of the data that they will curate, but also as possible heuristics for bias detection, accuracy, and ongoing model tuning.

And this brings us to our conclusion.

Multilingual LLMs will open broader access, more inclusion, and value to users across the globe. Building them, however, requires so much more than ramping up data collection or adding to the vocabulary.

Building multilingual LLMs is a more sustained and systemic commitment to the diversity of human experience, positively reflected in a range of languages and with due respect to cultural context. Consider content data quality, practices of ethical alignment. They encompass as many humans as technical challenges.

The challenge will be to not only engineer but also understand the language as more than an opportunity for communication.

Language is the expression of people, histories, and identities for which AI must learn to develop respect.