Penalizing Length: Uncovering Systematic Bias in Quality Estimation Metrics
Yilin Zhang, Wenda Xu, Zhongtao Liu, Tetsuji Nakagawa, Markus Freitag · Oct 24, 2025 · Citations: 0
How to use this page
Moderate trustUse this for comparison and orientation, not as your only source.
Best use
Secondary protocol comparison source
What to verify
Read the full paper before copying any benchmark, metric, or protocol choices.
Evidence quality
Moderate
Derived from extracted protocol signals and abstract evidence.
Abstract
Quality Estimation (QE) metrics are vital in machine translation for reference-free evaluation and increasingly serve as selection criteria in data filtering and candidate reranking. However, the prevalence and impact of length bias in QE metrics have been underexplored. Through a systematic study of top-performing learned and LLM-as-a-Judge QE metrics across 10 diverse language pairs, we reveal two critical length biases: First, QE metrics consistently over-predict errors with increasing translation length, even for high-quality, error-free texts. Second, they exhibit a systematic preference for shorter translations when multiple candidates of comparable quality are available for the same source text. These biases risk unfairly penalizing longer, correct translations and can propagate into downstream pipelines that rely on QE signals for data selection or system optimization. We trace the root cause of learned QE metrics to skewed supervision distributions, where longer error-free examples are underrepresented in training data. As a diagnostic intervention, we apply length normalization during training and show that this simple modification effectively decouples error prediction from sequence length, yielding more reliable QE signals across translations of varying length.