Skip to content
← Back to explorer

SCOPE: Selective Conformal Optimized Pairwise LLM Judging

Sher Badshah, Ali Emami, Hassan Sajjad · Feb 13, 2026 · Citations: 0

Data freshness

Extraction: Fresh

Check recency before relying on this page for active eval decisions. Use stale pages as context and verify against current hub results.

Metadata refreshed

Feb 19, 2026, 2:41 PM

Stale

Extraction refreshed

Apr 13, 2026, 6:33 AM

Fresh

Extraction source

Persisted extraction

Confidence 0.90

Abstract

Large language models (LLMs) are increasingly used as judges to replace costly human preference labels in pairwise evaluation. Despite their practicality, LLM judges remain prone to miscalibration and systematic biases. This paper proposes SCOPE (Selective Conformal Optimized Pairwise Evaluation), a framework for selective pairwise judging with finite-sample statistical guarantees. Under exchangeability, SCOPE calibrates an acceptance threshold such that the error rate among non-abstained judgments is at most a user-specified level $α$. To provide SCOPE with a bias-neutral uncertainty signal, we introduce Bidirectional Preference Entropy (BPE), which queries the judge under both response positions, aggregates the implied preference probabilities to enforce invariance to response order, and converts the aggregated probability into an entropy-based uncertainty score. Across MT-Bench, RewardBench, and Chatbot Arena, BPE improves uncertainty quality over standard confidence proxies, providing a stronger selection signal that enables SCOPE to consistently meet the target risk level while retaining good coverage across judge scales. In particular, at $α= 0.10$, SCOPE consistently satisfies the risk bound across all benchmarks and judge scales (empirical risk $\approx 0.097$ to $0.099$), while retaining substantial coverage, reaching $0.89$ on RewardBench with Qwen-14B and $0.98$ on RewardBench with Qwen-32B. Compared to naïve baselines, SCOPE accepts up to $2.4\times$ more judgments on MT-Bench with Qwen-7B under the same target risk constraint, demonstrating that BPE enables reliable and high-coverage LLM-based evaluation.

HFEPX Relevance Assessment

This paper has strong direct human-feedback and evaluation protocol signal and is suitable as a primary eval pipeline reference.

Best use

Primary benchmark and eval reference

Use if you need

A concrete protocol example with enough signal to inform rater workflow design.

Main weakness

No major weakness surfaced.

Trust level

High

Eval-Fit Score

75/100 • High

Use this as a primary source when designing or comparing eval protocols.

Human Feedback Signal

Detected

Evaluation Signal

Detected

HFEPX Fit

High-confidence candidate

Extraction confidence: High

Field Provenance & Confidence

Each key protocol field shows extraction state, confidence band, and data source so you can decide whether to trust it directly or validate from full text.

Human Feedback Types

strong

Pairwise Preference

Confidence: High Source: Persisted extraction evidenced

Directly usable for protocol triage.

Evidence snippet: Large language models (LLMs) are increasingly used as judges to replace costly human preference labels in pairwise evaluation.

Evaluation Modes

strong

Automatic Metrics

Confidence: High Source: Persisted extraction evidenced

Includes extracted eval setup.

Evidence snippet: Large language models (LLMs) are increasingly used as judges to replace costly human preference labels in pairwise evaluation.

Quality Controls

strong

Calibration

Confidence: High Source: Persisted extraction evidenced

Calibration/adjudication style controls detected.

Evidence snippet: Despite their practicality, LLM judges remain prone to miscalibration and systematic biases.

Benchmarks / Datasets

strong

MT Bench, LMSYS Chatbot Arena, Rewardbench

Confidence: High Source: Persisted extraction evidenced

Useful for quick benchmark comparison.

Evidence snippet: Across MT-Bench, RewardBench, and Chatbot Arena, BPE improves uncertainty quality over standard confidence proxies, providing a stronger selection signal that enables SCOPE to consistently meet the target risk level while retaining good coverage across judge scales.

Reported Metrics

strong

Error rate

Confidence: High Source: Persisted extraction evidenced

Useful for evaluation criteria comparison.

Evidence snippet: Under exchangeability, SCOPE calibrates an acceptance threshold such that the error rate among non-abstained judgments is at most a user-specified level $α$.

Rater Population

missing

Unknown

Confidence: Low Source: Persisted extraction missing

Rater source not explicitly reported.

Evidence snippet: Large language models (LLMs) are increasingly used as judges to replace costly human preference labels in pairwise evaluation.

Human Data Lens

  • Uses human feedback: Yes
  • Feedback types: Pairwise Preference
  • Rater population: Unknown
  • Unit of annotation: Pairwise
  • Expertise required: General
  • Extraction source: Persisted extraction

Evaluation Lens

  • Evaluation modes: Automatic Metrics
  • Agentic eval: None
  • Quality controls: Calibration
  • Confidence: 0.90
  • Flags: None

Protocol And Measurement Signals

Benchmarks / Datasets

MT-BenchLMSYS Chatbot ArenaRewardbench

Reported Metrics

error rate

Research Brief

Deterministic synthesis

Large language models (LLMs) are increasingly used as judges to replace costly human preference labels in pairwise evaluation. HFEPX signals include Pairwise Preference, Automatic Metrics with confidence 0.90. Updated from current HFEPX corpus.

Generated Apr 13, 2026, 6:33 AM · Grounded in abstract + metadata only

Key Takeaways

  • Large language models (LLMs) are increasingly used as judges to replace costly human preference labels in pairwise evaluation.
  • Despite their practicality, LLM judges remain prone to miscalibration and systematic biases.

Researcher Actions

  • Compare its human-feedback setup against pairwise and rubric hubs.
  • Cross-check benchmark overlap: MT-Bench, LMSYS Chatbot Arena, Rewardbench.
  • Validate metric comparability (error rate).

Caveats

  • Generated from title, abstract, and extracted metadata only; full-paper implementation details are not parsed.
  • Extraction confidence is probabilistic and should be validated for critical decisions.

Research Summary

Contribution Summary

  • Large language models (LLMs) are increasingly used as judges to replace costly human preference labels in pairwise evaluation.
  • Despite their practicality, LLM judges remain prone to miscalibration and systematic biases.
  • To provide SCOPE with a bias-neutral uncertainty signal, we introduce Bidirectional Preference Entropy (BPE), which queries the judge under both response positions, aggregates the implied preference probabilities to enforce invariance to…

Why It Matters For Eval

  • Large language models (LLMs) are increasingly used as judges to replace costly human preference labels in pairwise evaluation.
  • To provide SCOPE with a bias-neutral uncertainty signal, we introduce Bidirectional Preference Entropy (BPE), which queries the judge under both response positions, aggregates the implied preference probabilities to enforce invariance to…

Researcher Checklist

  • Pass: Human feedback protocol is explicit

    Detected: Pairwise Preference

  • Pass: Evaluation mode is explicit

    Detected: Automatic Metrics

  • Pass: Quality control reporting appears

    Detected: Calibration

  • Pass: Benchmark or dataset anchors are present

    Detected: MT-Bench, LMSYS Chatbot Arena, Rewardbench

  • Pass: Metric reporting is present

    Detected: error rate

Related Papers

Papers are ranked by protocol overlap, extraction signal alignment, and semantic proximity.

Need human evaluators for your AI research? Scale annotation with expert AI Trainers.