Removing Noise, not Finding Gold: Quality Filtering for Large-Scale Pretraining
Thiziri Nait Saada, Louis Bethune, Michal Klein, David Grangier, Marco Cuturi, Pierre Ablin · Oct 1, 2025 · Citations: 0
How to use this page
Low trustUse this as background context only. Do not make protocol decisions from this page alone.
Best use
Background context only
What to verify
Read the full paper before copying any benchmark, metric, or protocol choices.
Evidence quality
Low
Derived from extracted protocol signals and abstract evidence.
Abstract
Large-scale models are pretrained on massive web-crawled datasets containing documents of mixed quality, making data filtering essential. A popular method is Classifier-based Quality Filtering (CQF), which trains a binary classifier to distinguish between pretraining data and a small, high-quality set. It assigns each pretraining document a quality score defined as the classifier's score and retains only the top-scoring ones. We provide an in-depth analysis of CQF. We show that while CQF improves downstream task performance, it does not necessarily enhance language modeling on the high-quality set. Importantly, we find that training on CQF-selected data can outperform training directly on the high-quality set, even when the latter is sufficiently large. This finding alone is particularly striking, given the substantial effort and cost recently devoted to augmenting high-quality data. We explain this paradox by the fact that CQF implicitly filters the high-quality dataset as well as the low-quality one. Finally, we introduce an optimization-driven notion of data quality and demonstrate that it can be reliably estimated using small-scale proxy experiments. Altogether, our results both elucidate the mechanisms behind CQF and deepen our understanding of data selection methods widely used in practice.