Multilingual Evals: Pitfalls And Playbook
Most LLM evaluation programs start in English and stay there far longer than they should. The assumption is that if a model performs well in English, it will perform acceptably in other languages, or that translating an English evaluation rubric is sufficient for multilingual coverage. Both assumptions are wrong. This playbook covers why multilingual evaluations are essential, where they go wrong, and how to run them effectively across dozens of languages.
Why Monolingual Evaluations Are Not Enough
Large language models are trained on multilingual corpora, but their capabilities are not uniform across languages. A model that produces fluent, accurate responses in English may hallucinate more frequently in Turkish, produce grammatically awkward outputs in Korean, or fail to respect register conventions in Japanese. These failures are invisible to English-only evaluation.
The stakes are practical: if you ship a product that serves users in 40 languages, your quality bar must be validated in those 40 languages. Users in São Paulo, Jakarta, and Cairo deserve the same evaluation rigor as users in San Francisco. Beyond user experience, regulatory frameworks in the EU, Brazil, and other jurisdictions increasingly require demonstrating model safety and fairness across the languages in which a product is offered.
There is also a technical argument. Multilingual evaluation surfaces model failure modes that English evaluation cannot: code-switching behavior (mixing languages within a response), translationese (outputs that read like machine-translated English rather than natural target-language text), and cultural misalignment (responses that are factually correct but culturally inappropriate for the target audience).
The Core Challenges
Cultural Context Is Not Translatable
The most fundamental challenge in multilingual evaluation is that "quality" is culturally situated. A response that is helpfully direct in American English may be rudely blunt in Japanese, where indirectness signals respect. A cooking explanation that references "a cup" of flour assumes a measurement system not used in much of the world. A historical summary that centers Western events may be irrelevant or offensive to users in other regions.
These are not edge cases. They pervade every evaluation dimension: helpfulness, tone, relevance, safety, and factual accuracy. A rubric designed for English evaluation will systematically mismeasure quality in languages and cultures where different norms apply.
Translation Artifacts
When prompts are translated from English into target languages for evaluation, the translations often carry artifacts that distort the evaluation. Translated prompts may use unnatural phrasing that a native speaker would never produce, leading models to generate responses to a prompt that real users would never write. This creates a mismatch between evaluation conditions and production conditions.
More subtly, some English prompts are untranslatable because they rely on English-specific ambiguities, idioms, or cultural references. "Explain the difference between 'affect' and 'effect'" has no meaningful equivalent in languages where these concepts are expressed differently. Evaluation prompt sets should include a significant proportion of natively authored prompts in each target language, not just translated English prompts.
Code-Switching and Multilingual Inputs
In many real-world deployments, users do not write in a single, clean language. Spanglish, Hinglish, and other code-switched varieties are the natural mode of communication for millions of users. A model that handles formal Hindi well but breaks down on Hindi-English code-switching is failing a significant user population. Your evaluation must include code-switched inputs if your users produce them.
Script and Tokenization Issues
Languages with non-Latin scripts (Arabic, Chinese, Thai, Devanagari) introduce tokenization challenges that affect model output quality. Thai lacks word boundaries in its writing system. Arabic is written right-to-left with complex morphology. Chinese uses characters that map poorly to subword tokenizers designed for alphabetic languages. These differences mean that model outputs in these languages may have fundamentally different error patterns than English outputs, requiring evaluators who understand the specific failure modes of LLMs for their language.
Recruiting Evaluators: Native Speakers With Domain Expertise
The non-negotiable requirement for multilingual evaluation is native-speaker evaluators. Near-native fluency is not sufficient. The distinction matters because native speakers have internalized cultural norms, register conventions, and naturalness judgments that non-native speakers cannot fully replicate, regardless of their proficiency level.
Finding the Intersection of Language and Domain
The challenge intensifies when you need domain expertise combined with language proficiency. Finding a native Tagalog speaker who is also a practicing physician, or a native Arabic speaker with a software engineering background, requires access to a large and diverse talent pool. Small, locally sourced teams typically cannot provide both the language coverage and domain depth that serious multilingual evaluation requires.
Avoiding Proxy Evaluators
A common failure mode is using bilingual evaluators who are native in one language to evaluate another language in which they are merely proficient. A native English speaker who learned Mandarin in university is not a reliable evaluator for Chinese-language model outputs. They will miss naturalness issues, over-accept translationese, and under-weight cultural context. Similarly, diaspora speakers who have not lived in the target country for years may have outdated cultural knowledge and vocabulary.
Scaling the Talent Search
For programs covering 20 or more languages, the recruiting challenge alone can dominate the project timeline. OpenTrain's network of over 100,000 AI trainers spanning 130 countries and 70+ languages provides a practical solution: teams can filter by native language, country of residence, and domain expertise, then run language-specific calibration before production annotation begins.
Per-Language Calibration
You cannot calibrate all your evaluators together across languages. Calibration must happen within each language group, for several reasons.
Language-Specific Rubric Addenda
Your core rubric should define universal evaluation dimensions (accuracy, completeness, safety), but each language needs addenda that address language-specific quality signals. For example:
- Japanese: Appropriate keigo (honorific language) usage, correct particle selection, natural sentence-ending forms.
- Arabic: Dialectal appropriateness (Modern Standard Arabic vs. regional dialects), correct gender agreement, appropriate formality level.
- German: Compound word formation, correct case usage, appropriate formal vs. informal address (Sie vs. du).
- Hindi: Script consistency (Devanagari vs. Romanized Hindi), handling of English loanwords, register matching.
These addenda should be written by native-speaker evaluation leads, not translated from English rubric supplements.
Language-Specific Gold Sets
Your calibration gold set must include language-specific examples that test evaluators on the particular challenges of their language. A gold set that is merely a translation of the English gold set will not exercise the right judgment muscles. Include examples that test for translationese detection, cultural appropriateness, script-specific errors, and natural phrasing.
Separate Calibration Sessions
Run calibration sessions within each language group, led by a native-speaker team lead. Cross-language calibration sessions (where speakers of different languages review the same rubric together) are useful for aligning on universal dimensions but cannot address language-specific concerns. Budget for at least two hours of language-specific calibration per evaluator before production work begins.
Coverage Planning Across Language Families
Not all languages need the same evaluation depth, and resource allocation should reflect the actual usage patterns of your model.
Tier Your Languages
A practical approach is to tier your target languages based on user volume and risk:
- Tier 1 (10-15 languages): Highest user volume. Full evaluation coverage: multiple evaluators per language, continuous gold monitoring, dedicated team leads. Typically includes English, Spanish, Mandarin, Hindi, Arabic, Portuguese, French, German, Japanese, Korean, and a few others based on your specific user base.
- Tier 2 (15-30 languages): Moderate user volume. Periodic evaluation with smaller teams. Sample-based coverage rather than full production monitoring.
- Tier 3 (30+ languages): Long-tail languages with lower volume. Spot-check evaluation on a quarterly cadence, focused on safety and basic quality rather than comprehensive assessment.
Language Family Considerations
Languages within the same family share structural features that affect model performance similarly. If your model struggles with agglutinative morphology in Turkish, it may also have issues in Finnish, Hungarian, and Korean. Use language family relationships to prioritize evaluation: if you find problems in one language, proactively evaluate related languages before users report issues.
Right-to-Left and Non-Latin Scripts
Languages with right-to-left scripts (Arabic, Hebrew, Persian, Urdu) and logographic scripts (Chinese, Japanese kanji) require special attention in evaluation tooling. Ensure your annotation platform renders text correctly, supports bidirectional text mixing, and does not introduce display artifacts that affect evaluator judgment.
Quality Assurance Across Languages
Cross-language quality assurance is fundamentally harder than monolingual QA because you cannot directly compare evaluator judgments across languages. A "4 out of 5" quality rating in German and a "4 out of 5" in Thai may reflect very different absolute quality levels.
Within-Language QA
Apply the standard quality assurance toolkit within each language: gold task monitoring, inter-annotator agreement tracking, temporal consistency checks, and adjudication workflows. The target metrics may need to be language-specific. Some languages have inherently more ambiguous quality boundaries, which will produce lower agreement even with perfect calibration.
Cross-Language Comparability
To compare model quality across languages, you need anchor items: prompts that are functionally equivalent (not necessarily literally translated) across languages, with responses evaluated by calibrated native speakers in each language. This gives you a common baseline for cross-language quality comparison, though it should be supplemented with language-native prompts for full coverage.
Detecting Systematic Bias
Watch for patterns where model quality is consistently rated lower in specific languages. This may reflect genuine model weakness in those languages, or it may reflect stricter evaluation standards by those language teams. Disentangle the two by having bilingual evaluators rate the same model outputs in both their native language and English, then comparing the distributions.
Common Pitfalls and How to Avoid Them
- Translating the rubric word-for-word. Rubrics should be adapted for each language, not translated. Some rubric concepts do not have clean equivalents, and forcing a translation creates confusion.
- Assuming one evaluator per language is sufficient. You need at least two evaluators per language to compute agreement. For Tier 1 languages, you need substantially more.
- Ignoring dialectal variation. "Spanish" is not one language for evaluation purposes. Latin American Spanish and Peninsular Spanish have different conventions. Brazilian Portuguese and European Portuguese differ significantly. Simplified and Traditional Chinese reflect different communities. Match your evaluators to your users' actual linguistic context.
- Using English-only project managers. If your project manager cannot read the target language, they cannot effectively QA the evaluation work. Use language leads who are native speakers.
- Launching all languages simultaneously. Start with 3-5 Tier 1 languages, refine your process, then expand. Launching 40 languages at once guarantees quality problems that are hard to diagnose.
- Neglecting evaluator sourcing timelines. Finding native speakers with domain expertise in long-tail languages can take weeks. Start recruiting early, especially for specialized domains.
Building a Multilingual Evaluation Program
A practical launch sequence for multilingual evaluation looks like this:
- Weeks 1-2: Define language tiers based on user data. Identify domain expertise requirements per language. Begin evaluator recruiting for Tier 1 languages.
- Weeks 3-4: Develop core rubric with universal dimensions. Work with native-speaker leads to create language-specific addenda. Build language-specific gold sets.
- Weeks 5-6: Run per-language calibration sessions for Tier 1. Begin pilot evaluation with redundancy for agreement measurement.
- Weeks 7-8: Analyze pilot results. Revise rubric based on disagreement patterns. Scale Tier 1 to production volume. Begin recruiting for Tier 2.
- Weeks 9-12: Expand to Tier 2 languages. Establish cross-language comparability with anchor items. Begin quarterly Tier 3 spot checks.
This timeline assumes access to a ready pool of qualified evaluators. Building that pool from scratch adds 4-8 weeks for sourcing, screening, and onboarding, which is where working with established talent networks like OpenTrain can compress the critical path significantly.
The Payoff
Multilingual evaluation done well gives you something that no automated metric can: confidence that your model serves every user population with measured, validated quality. It surfaces failure modes that English-only evaluation cannot detect. It provides the data you need to prioritize language-specific improvements. And it demonstrates to users, regulators, and stakeholders that you take multilingual quality seriously.
The investment is real. Multilingual evaluation costs more, takes longer, and requires more organizational coordination than monolingual evaluation. But the alternative, shipping a model to millions of non-English speakers without validated quality, is a risk that responsible teams cannot afford to take.
Find the Best AI Trainers.
Build the Best AI Models.