Skip to content
Research & Insights
September 8, 2025
8 min read

Multilingual Evals: Pitfalls And Playbook

How to plan coverage, avoid translation traps, and calibrate bilingual evaluators.

Most LLM evaluation programs start in English and stay there far longer than they should. The assumption is that if a model performs well in English, it will perform acceptably in other languages, or that translating an English evaluation rubric is sufficient for multilingual coverage. Both assumptions are wrong. This playbook covers why multilingual evaluations are essential, where they go wrong, and how to run them effectively across dozens of languages.

Why Monolingual Evaluations Are Not Enough

Large language models are trained on multilingual corpora, but their capabilities are not uniform across languages. A model that produces fluent, accurate responses in English may hallucinate more frequently in Turkish, produce grammatically awkward outputs in Korean, or fail to respect register conventions in Japanese. These failures are invisible to English-only evaluation.

The stakes are practical: if you ship a product that serves users in 40 languages, your quality bar must be validated in those 40 languages. Users in São Paulo, Jakarta, and Cairo deserve the same evaluation rigor as users in San Francisco. Beyond user experience, regulatory frameworks in the EU, Brazil, and other jurisdictions increasingly require demonstrating model safety and fairness across the languages in which a product is offered.

There is also a technical argument. Multilingual evaluation surfaces model failure modes that English evaluation cannot: code-switching behavior (mixing languages within a response), translationese (outputs that read like machine-translated English rather than natural target-language text), and cultural misalignment (responses that are factually correct but culturally inappropriate for the target audience).

The Core Challenges

Cultural Context Is Not Translatable

The most fundamental challenge in multilingual evaluation is that "quality" is culturally situated. A response that is helpfully direct in American English may be rudely blunt in Japanese, where indirectness signals respect. A cooking explanation that references "a cup" of flour assumes a measurement system not used in much of the world. A historical summary that centers Western events may be irrelevant or offensive to users in other regions.

These are not edge cases. They pervade every evaluation dimension: helpfulness, tone, relevance, safety, and factual accuracy. A rubric designed for English evaluation will systematically mismeasure quality in languages and cultures where different norms apply.

Translation Artifacts

When prompts are translated from English into target languages for evaluation, the translations often carry artifacts that distort the evaluation. Translated prompts may use unnatural phrasing that a native speaker would never produce, leading models to generate responses to a prompt that real users would never write. This creates a mismatch between evaluation conditions and production conditions.

More subtly, some English prompts are untranslatable because they rely on English-specific ambiguities, idioms, or cultural references. "Explain the difference between 'affect' and 'effect'" has no meaningful equivalent in languages where these concepts are expressed differently. Evaluation prompt sets should include a significant proportion of natively authored prompts in each target language, not just translated English prompts.

Code-Switching and Multilingual Inputs

In many real-world deployments, users do not write in a single, clean language. Spanglish, Hinglish, and other code-switched varieties are the natural mode of communication for millions of users. A model that handles formal Hindi well but breaks down on Hindi-English code-switching is failing a significant user population. Your evaluation must include code-switched inputs if your users produce them.

Script and Tokenization Issues

Languages with non-Latin scripts (Arabic, Chinese, Thai, Devanagari) introduce tokenization challenges that affect model output quality. Thai lacks word boundaries in its writing system. Arabic is written right-to-left with complex morphology. Chinese uses characters that map poorly to subword tokenizers designed for alphabetic languages. These differences mean that model outputs in these languages may have fundamentally different error patterns than English outputs, requiring evaluators who understand the specific failure modes of LLMs for their language.

Recruiting Evaluators: Native Speakers With Domain Expertise

The non-negotiable requirement for multilingual evaluation is native-speaker evaluators. Near-native fluency is not sufficient. The distinction matters because native speakers have internalized cultural norms, register conventions, and naturalness judgments that non-native speakers cannot fully replicate, regardless of their proficiency level.

Finding the Intersection of Language and Domain

The challenge intensifies when you need domain expertise combined with language proficiency. Finding a native Tagalog speaker who is also a practicing physician, or a native Arabic speaker with a software engineering background, requires access to a large and diverse talent pool. Small, locally sourced teams typically cannot provide both the language coverage and domain depth that serious multilingual evaluation requires.

Avoiding Proxy Evaluators

A common failure mode is using bilingual evaluators who are native in one language to evaluate another language in which they are merely proficient. A native English speaker who learned Mandarin in university is not a reliable evaluator for Chinese-language model outputs. They will miss naturalness issues, over-accept translationese, and under-weight cultural context. Similarly, diaspora speakers who have not lived in the target country for years may have outdated cultural knowledge and vocabulary.

Scaling the Talent Search

For programs covering 20 or more languages, the recruiting challenge alone can dominate the project timeline. OpenTrain's network of over 100,000 AI trainers spanning 130 countries and 70+ languages provides a practical solution: teams can filter by native language, country of residence, and domain expertise, then run language-specific calibration before production annotation begins.

Per-Language Calibration

You cannot calibrate all your evaluators together across languages. Calibration must happen within each language group, for several reasons.

Language-Specific Rubric Addenda

Your core rubric should define universal evaluation dimensions (accuracy, completeness, safety), but each language needs addenda that address language-specific quality signals. For example:

  • Japanese: Appropriate keigo (honorific language) usage, correct particle selection, natural sentence-ending forms.
  • Arabic: Dialectal appropriateness (Modern Standard Arabic vs. regional dialects), correct gender agreement, appropriate formality level.
  • German: Compound word formation, correct case usage, appropriate formal vs. informal address (Sie vs. du).
  • Hindi: Script consistency (Devanagari vs. Romanized Hindi), handling of English loanwords, register matching.

These addenda should be written by native-speaker evaluation leads, not translated from English rubric supplements.

Language-Specific Gold Sets

Your calibration gold set must include language-specific examples that test evaluators on the particular challenges of their language. A gold set that is merely a translation of the English gold set will not exercise the right judgment muscles. Include examples that test for translationese detection, cultural appropriateness, script-specific errors, and natural phrasing.

Separate Calibration Sessions

Run calibration sessions within each language group, led by a native-speaker team lead. Cross-language calibration sessions (where speakers of different languages review the same rubric together) are useful for aligning on universal dimensions but cannot address language-specific concerns. Budget for at least two hours of language-specific calibration per evaluator before production work begins.

Coverage Planning Across Language Families

Not all languages need the same evaluation depth, and resource allocation should reflect the actual usage patterns of your model.

Tier Your Languages

A practical approach is to tier your target languages based on user volume and risk:

  • Tier 1 (10-15 languages): Highest user volume. Full evaluation coverage: multiple evaluators per language, continuous gold monitoring, dedicated team leads. Typically includes English, Spanish, Mandarin, Hindi, Arabic, Portuguese, French, German, Japanese, Korean, and a few others based on your specific user base.
  • Tier 2 (15-30 languages): Moderate user volume. Periodic evaluation with smaller teams. Sample-based coverage rather than full production monitoring.
  • Tier 3 (30+ languages): Long-tail languages with lower volume. Spot-check evaluation on a quarterly cadence, focused on safety and basic quality rather than comprehensive assessment.

Language Family Considerations

Languages within the same family share structural features that affect model performance similarly. If your model struggles with agglutinative morphology in Turkish, it may also have issues in Finnish, Hungarian, and Korean. Use language family relationships to prioritize evaluation: if you find problems in one language, proactively evaluate related languages before users report issues.

Right-to-Left and Non-Latin Scripts

Languages with right-to-left scripts (Arabic, Hebrew, Persian, Urdu) and logographic scripts (Chinese, Japanese kanji) require special attention in evaluation tooling. Ensure your annotation platform renders text correctly, supports bidirectional text mixing, and does not introduce display artifacts that affect evaluator judgment.

Quality Assurance Across Languages

Cross-language quality assurance is fundamentally harder than monolingual QA because you cannot directly compare evaluator judgments across languages. A "4 out of 5" quality rating in German and a "4 out of 5" in Thai may reflect very different absolute quality levels.

Within-Language QA

Apply the standard quality assurance toolkit within each language: gold task monitoring, inter-annotator agreement tracking, temporal consistency checks, and adjudication workflows. The target metrics may need to be language-specific. Some languages have inherently more ambiguous quality boundaries, which will produce lower agreement even with perfect calibration.

Cross-Language Comparability

To compare model quality across languages, you need anchor items: prompts that are functionally equivalent (not necessarily literally translated) across languages, with responses evaluated by calibrated native speakers in each language. This gives you a common baseline for cross-language quality comparison, though it should be supplemented with language-native prompts for full coverage.

Detecting Systematic Bias

Watch for patterns where model quality is consistently rated lower in specific languages. This may reflect genuine model weakness in those languages, or it may reflect stricter evaluation standards by those language teams. Disentangle the two by having bilingual evaluators rate the same model outputs in both their native language and English, then comparing the distributions.

Common Pitfalls and How to Avoid Them

  • Translating the rubric word-for-word. Rubrics should be adapted for each language, not translated. Some rubric concepts do not have clean equivalents, and forcing a translation creates confusion.
  • Assuming one evaluator per language is sufficient. You need at least two evaluators per language to compute agreement. For Tier 1 languages, you need substantially more.
  • Ignoring dialectal variation. "Spanish" is not one language for evaluation purposes. Latin American Spanish and Peninsular Spanish have different conventions. Brazilian Portuguese and European Portuguese differ significantly. Simplified and Traditional Chinese reflect different communities. Match your evaluators to your users' actual linguistic context.
  • Using English-only project managers. If your project manager cannot read the target language, they cannot effectively QA the evaluation work. Use language leads who are native speakers.
  • Launching all languages simultaneously. Start with 3-5 Tier 1 languages, refine your process, then expand. Launching 40 languages at once guarantees quality problems that are hard to diagnose.
  • Neglecting evaluator sourcing timelines. Finding native speakers with domain expertise in long-tail languages can take weeks. Start recruiting early, especially for specialized domains.

Building a Multilingual Evaluation Program

A practical launch sequence for multilingual evaluation looks like this:

  1. Weeks 1-2: Define language tiers based on user data. Identify domain expertise requirements per language. Begin evaluator recruiting for Tier 1 languages.
  2. Weeks 3-4: Develop core rubric with universal dimensions. Work with native-speaker leads to create language-specific addenda. Build language-specific gold sets.
  3. Weeks 5-6: Run per-language calibration sessions for Tier 1. Begin pilot evaluation with redundancy for agreement measurement.
  4. Weeks 7-8: Analyze pilot results. Revise rubric based on disagreement patterns. Scale Tier 1 to production volume. Begin recruiting for Tier 2.
  5. Weeks 9-12: Expand to Tier 2 languages. Establish cross-language comparability with anchor items. Begin quarterly Tier 3 spot checks.

This timeline assumes access to a ready pool of qualified evaluators. Building that pool from scratch adds 4-8 weeks for sourcing, screening, and onboarding, which is where working with established talent networks like OpenTrain can compress the critical path significantly.

The Payoff

Multilingual evaluation done well gives you something that no automated metric can: confidence that your model serves every user population with measured, validated quality. It surfaces failure modes that English-only evaluation cannot detect. It provides the data you need to prioritize language-specific improvements. And it demonstrates to users, regulators, and stakeholders that you take multilingual quality seriously.

The investment is real. Multilingual evaluation costs more, takes longer, and requires more organizational coordination than monolingual evaluation. But the alternative, shipping a model to millions of non-English speakers without validated quality, is a risk that responsible teams cannot afford to take.

Read Next Publication
AI Data Talent Layer

Find the Best AI Trainers.
Build the Best AI Models.

Post a job and get a curated shortlist of vetted AI Trainers and Data Labelers within 24 hours. Hire them into any annotation tool. No commitment required.
Abstract brain
Get Started

Join the #1 Platform for AI Training Talent

Where top AI builders and expert AI Trainers connect to build the future of AI.
Self-Service
Post a Job
Post your project and get a shortlist of qualified AI Trainers and Data Labelers. Hire and manage your team in the tools you already use.
Managed Service
For Large Projects
Done-for-You
We recruit, onboard, and manage a dedicated team inside your tools. End-to-end operations for large or complex projects.
For Freelancers
Join as an AI Trainer
Find AI training and data labeling projects across platforms, all in one place. One profile, one application process, more opportunities.