Skip to content
← Back to explorer

Benchmark Hub

MMLU + Automatic Metrics Benchmark Papers

Updated from current HFEPX corpus (Feb 27, 2026). 16 papers are grouped in this benchmark page. Common evaluation modes: Automatic Metrics. Most common rater population: Domain Experts. Common annotation unit: Ranking. Frequent quality control: Calibration. Frequently cited benchmark: MMLU. Common metric signal: accuracy. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Feb 25, 2026.

Papers: 16 Last published: Feb 25, 2026 Global RSS

Research Narrative

Grounded narrative Model: deterministic-grounded Source: persisted

Updated from current HFEPX corpus (Feb 27, 2026). This page tracks 16 papers for MMLU + Automatic Metrics Benchmark Papers. Dominant protocol signals include automatic metrics, with frequent benchmark focus on MMLU, MMLU-Pro and metric focus on accuracy, cost. Use the grounded sections below to prioritize reproducible protocol choices, benchmark-matched comparisons, and judge-vs-human evaluation checks.

Why This Matters For Eval Research

Protocol Takeaways

Benchmark Interpretation

  • MMLU appears in 100% of hub papers (16/16); use this cohort for benchmark-matched comparisons.
  • MMLU-Pro appears in 18.8% of hub papers (3/16); use this cohort for benchmark-matched comparisons.

Metric Interpretation

  • accuracy is reported in 31.3% of hub papers (5/16); compare with a secondary metric before ranking methods.
  • cost is reported in 18.8% of hub papers (3/16); compare with a secondary metric before ranking methods.

Researcher Checklist

  • Close gap on Papers with explicit human feedback. Coverage is a replication risk (6.3% vs 45% target).
  • Tighten coverage on Papers reporting quality controls. Coverage is usable but incomplete (25% vs 30% target).
  • Maintain strength on Papers naming benchmarks/datasets. Coverage is strong (100% vs 35% target).
  • Maintain strength on Papers naming evaluation metrics. Coverage is strong (56.3% vs 35% target).
  • Close gap on Papers with known rater population. Coverage is a replication risk (6.3% vs 35% target).
  • Tighten coverage on Papers with known annotation unit. Coverage is usable but incomplete (25% vs 35% target).

Papers with explicit human feedback

Coverage is a replication risk (6.3% vs 45% target).

Papers reporting quality controls

Coverage is usable but incomplete (25% vs 30% target).

Papers naming benchmarks/datasets

Coverage is strong (100% vs 35% target).

Papers naming evaluation metrics

Coverage is strong (56.3% vs 35% target).

Papers with known rater population

Coverage is a replication risk (6.3% vs 35% target).

Papers with known annotation unit

Coverage is usable but incomplete (25% vs 35% target).

Suggested Reading Order

  1. 1. Bridging Latent Reasoning and Target-Language Generation via Retrieval-Transition Heads

    Start here for detailed protocol reporting, including rater and quality-control evidence.

  2. 2. Confidence-Driven Multi-Scale Model Selection for Cost-Efficient Inference

    Start here for detailed protocol reporting, including rater and quality-control evidence.

  3. 3. D-COT: Disciplined Chain-of-Thought Learning for Efficient Reasoning in Small Language Models

    Start here for detailed protocol reporting, including rater and quality-control evidence.

  4. 4. Black-Box Reliability Certification for AI Agents via Self-Consistency Sampling and Conformal Calibration

    Adds automatic metrics for broader coverage within this hub.

  5. 5. ID-LoRA: Efficient Low-Rank Adaptation Inspired by Matrix Interpolative Decomposition

    Adds automatic metrics for broader coverage within this hub.

  6. 6. KNIGHT: Knowledge Graph-Driven Multiple-Choice Question Generation with Adaptive Hardness Calibration

    Adds automatic metrics for broader coverage within this hub.

  7. 7. Same Meaning, Different Scores: Lexical and Syntactic Sensitivity in LLM Evaluation

    Adds automatic metrics for broader coverage within this hub.

  8. 8. RoPE-LIME: RoPE-Space Locality + Sparse-K Sampling for Efficient LLM Attribution

    Adds automatic metrics for broader coverage within this hub.

Known Limitations

  • Rater population is under-specified (6.3% coverage).
  • Narrative synthesis is grounded in metadata and abstracts only; full-paper implementation details are not parsed.
  • Cross-page comparisons should be benchmark- and metric-matched to avoid protocol confounding.

Research Utility Links

Metric Brief

calibration

Coverage: 2 papers (12.5%)

2 papers (12.5%) mention calibration.

Examples: KNIGHT: Knowledge Graph-Driven Multiple-Choice Question Generation with Adaptive Hardness Calibration , Humanity's Last Exam

Top Papers On This Benchmark

Other Benchmark Hubs