Data Contamination in Neural Hieroglyphic Translation: A Reproducibility Study
Ammar Toutou, Abdelrahman Harb, Christine Basta · May 8, 2026 · Citations: 0
How to use this page
Low trustUse this as background context only. Do not make protocol decisions from this page alone.
Best use
Background context only
What to verify
Validate the evaluation procedure and quality controls in the full paper before operational use.
Evidence quality
Low
Derived from extracted protocol signals and abstract evidence.
Abstract
Ancient and endangered languages pose a unique challenge for NLP: their datasets are inherently scarce, difficult to expand, and built from formulaic corpora -- making data-quality issues especially consequential yet rarely audited. Motivated by the need to understand what current NMT can realistically achieve for such languages, we investigate hieroglyphic-to-German translation, where a recent study reported 61.5 BLEU using fine-tuned M2M-100. Our reproduction yields only 37.0 BLEU with the released model. Investigating this gap, we find 2\% of test targets appear identically in training (16/50; 50\% under 8-gram overlap at 70\% threshold). This contamination inflates scores dramatically: contaminated samples achieve up to 83.8 BLEU / 0.924 COMET-22 versus 30.9--39.2 BLEU / 0.622--0.676 COMET-22 on clean samples across five model configurations spanning two architectures. Document-level decontamination reduces contaminated BLEU by only 4.6 points because 8/16 targets persist via other source documents -- target-level deduplication is required. We release a decontaminated 34-sample test set and establish corrected baselines (30.9--39.2 BLEU), providing a realistic assessment of NMT capability for this endangered writing system.