Scaling Model and Data for Multilingual Machine Translation with Open Large Language Models

Yuzhe Shang, Pengzhi Gao, Wei Liu, Jian Luan, Jinsong Su · Feb 12, 2026 · Citations: 0

Open arXiv Find Implementation RSS feed Shortlist (0)

How to use this page

Coverage: Stale

Use this page to decide whether the paper is strong enough to influence an eval design. If the signals below are thin, treat it as background context and compare it against the stronger hub pages before making protocol choices.

Paper metadata checked

Feb 25, 2026, 2:45 AM

Stale

Protocol signals checked

Feb 25, 2026, 2:45 AM

Stale

Signal strength

Low

Model confidence 0.15

Abstract

Open large language models (LLMs) have demonstrated improving multilingual capabilities in recent years. In this paper, we present a study of open LLMs for multilingual machine translation (MT) across a range of languages, and investigate the effects of model scaling and data scaling when adapting open LLMs to multilingual MT through continual pretraining and instruction finetuning. Based on the Gemma3 model family, we develop MiLMMT-46, which achieves top-tier multilingual translation performance across 46 languages. Extensive experiments show that MiLMMT-46 consistently outperforms recent state-of-the-art (SOTA) models, including Seed-X, HY-MT-1.5, and TranslateGemma, and achieves competitive performance with strong proprietary systems such as Google Translate and Gemini 3 Pro. Models are released at https://huggingface.co/collections/xiaomi-research/milmmt-46. Codes are released at https://github.com/xiaomi-research/gemmax.

Use caution before copying this protocol

Use this page for context, then validate protocol choices against stronger HFEPX references before implementation decisions.

Extraction flags indicate low-signal or possible false-positive protocol mapping.
Extraction confidence is 0.15 (below strong-reference threshold).
No explicit evaluation mode was extracted from available metadata.
No benchmark/dataset or metric anchors were extracted.

Human Eval Hub LLM-as-Judge Hub Pairwise Preference Hub

HFEPX Relevance Assessment

This paper is adjacent to HFEPX scope and is best used for background context, not as a primary protocol reference.

Best use

Background context only

Use if you need

Background context only.

Main weakness

Extraction flags indicate low-signal or possible false-positive protocol mapping.

Trust level

Low

Eval-Fit Score

0/100 • Low

Treat as adjacent context, not a core eval-method reference.

Human Feedback Signal

Not explicit in abstract metadata

Evaluation Signal

Weak / implicit signal

HFEPX Fit

Adjacent candidate

Extraction confidence: Low

If you are doing eval pipeline work, start here:

Human Eval Hub LLM-as-Judge Hub Pairwise Preference Hub Tool-Use Eval Hub

What We Could Reliably Extract

Each protocol field below shows whether the signal looked explicit, partial, or missing in the available metadata. Use this to judge what is safe to trust directly and what still needs full-paper validation.

Human Feedback Types

missing

None explicit

Confidence: Low Source: Persisted extraction missing

No explicit feedback protocol extracted.