Skip to content

Hidden Measurement Error in LLM Pipelines Distorts Annotation, Evaluation, and Benchmarking

Solomon Messing

2026-04-13T14:58:15Z

Abstract

LLM evaluations drive which models get deployed, which safety standards get adopted, and which research conclusions get published. Yet these scores carry hidden uncertainty: rephrasing the prompt, switching the judge model, or changing the temperature can shift results enough to flip rankings and reverse conclusions. Standard confidence intervals ignore this variance, producing under-coverage that worsens with more data. The same unmeasured variance creates an exploitable surface for benchmarks: model developers can optimize against measurement noise rather than genuine performance (some have infamously done so, see \citep{boyeau2025leaderboard}). This paper decomposes LLM pipeline uncertainty into its sources, distinguishes variance that shrinks with more data from sensitivity to researcher design choices, and uses design-study projections to reduce total error. Across ideology annotation, safety classification, MMLU benchmarking, and a human-validated propaganda audit, the decomposition reveals that the dominant variance source differs by domain and scoring method. On MMLU, optimized budget allocation halves estimation error at equivalent cost. On the propaganda task, the recommended pipeline outperforms 73\% of single-configuration alternatives against a human baseline. A small-sample pilot is sufficient to derive confidence intervals that approach nominal coverage and to identify which design changes yield the largest precision gains.

Full analysis loading… Code implementations, benchmark data, and reproduction guides are being assembled. Please check back shortly.

Browse all papers

Need human evaluators for your AI research? Scale annotation with expert AI Trainers.