Skip to content
← Back to explorer

HFEPX Metric Hub

Accuracy Metric Papers

Updated from current HFEPX corpus (2026-04-13). This page tracks 60 papers for Accuracy.

Read Full Context

Updated from current HFEPX corpus (2026-04-13). This page tracks 60 papers for Accuracy. Use it to compare how accuracy is measured across human feedback and evaluation studies.

Papers: 60 Last published: Apr 9, 2026 Global RSS

Researcher Quick Triage

Use this page to compare metric behavior across protocols and benchmarks before selecting your reporting stack. Quality band: High .

Metric Coverage

100.0%

60 sampled papers include metric names.

Benchmark Anchoring

16.7%

Papers with explicit dataset/benchmark anchors for fair comparison.

Quality Controls

5.0%

3 papers report calibration/adjudication/IAA controls.

  • 60 papers are not low-signal flagged in this sample.
  • Use the protocol matrix below to avoid comparing metrics across incompatible eval setups.

Primary action: Use the top metric-reliable papers first, then compare benchmark context in the matrix before drawing conclusions.

Why This Matters (Expanded)

Why This Matters For Eval Research

  • Use this page to compare how accuracy is operationalized across benchmarks and rater setups.
Metric Notes (Expanded)

Metric-Driven Protocol Takeaways

  • Accuracy is often paired with automatic_metrics, human_eval.

Metric Interpretation

  • accuracy: 60 papers
  • cost: 6 papers
  • latency: 4 papers
  • coherence: 3 papers

Benchmark Context

  • GSM8K: 2 papers
  • aot-psyphybench: 1 papers
  • ARC-Challenge: 1 papers

Start Here (Metric-Reliable First 6)

Ranked for metric reporting completeness and comparability.

Metric Protocol Matrix (Top 10)

Compare metric, benchmark, and evaluation context side by side.

Paper Metrics Benchmarks Eval Modes Quality Controls
MinerU2.5-Pro: Pushing the Limits of Data-Centric Document Parsing at Scale

Apr 6, 2026

Accuracy Omnidocbench Automatic Metrics Adjudication
Personalized RewardBench: Evaluating Reward Models with Human Aligned Personalization

Apr 8, 2026

Accuracy, Helpfulness Rewardbench Human Eval, Automatic Metrics Not reported
TraceSafe: A Systematic Assessment of LLM Guardrails on Multi-Step Tool-Calling Trajectories

Apr 8, 2026

Accuracy Tracesafe Bench Automatic Metrics Not reported
AfriVoices-KE: A Multilingual Speech Dataset for Kenyan Languages

Apr 9, 2026

Accuracy Not reported Automatic Metrics Calibration
Don't Overthink It: Inter-Rollout Action Agreement as a Free Adaptive-Compute Signal for LLM Agents

Apr 9, 2026

Accuracy GSM8K Automatic Metrics Not reported
Training Data Size Sensitivity in Unsupervised Rhyme Recognition

Apr 9, 2026

Accuracy, Agreement Not reported Automatic Metrics Inter Annotator Agreement Reported
Gemma 4, Phi-4, and Qwen3: Accuracy-Efficiency Tradeoffs in Dense and MoE Reasoning Language Models

Apr 8, 2026

Accuracy, Latency GSM8K, TruthfulQA Automatic Metrics Not reported
MedDialBench: Benchmarking LLM Diagnostic Robustness under Parametric Adversarial Patient Behaviors

Apr 8, 2026

Accuracy Meddialbench Automatic Metrics Not reported
Select-then-Solve: Paradigm Routing as Inference-Time Optimization for LLM Agents

Apr 8, 2026

Accuracy GAIA, HumanEval+ Automatic Metrics Not reported
SQLStructEval: Structural Evaluation of LLM Text-to-SQL Generation

Apr 8, 2026

Accuracy Spider, Sqlstructeval Automatic Metrics Not reported
Researcher Workflow (Detailed)

Checklist

  • Gap: Human feedback

    Human feedback is present in 7 of 60 papers.

  • Gap: Quality controls

    Quality controls is present in 3 of 60 papers.

  • Gap: Benchmarks

    Benchmarks is present in 10 of 60 papers.

  • Strong: Metrics

    Metrics is present in 60 of 60 papers.

  • Gap: Known rater population

    Known rater population is present in 9 of 60 papers.

  • Gap: Known annotation unit

    Known annotation unit is present in 10 of 60 papers.

Strengths

  • Metrics is present in 60 of 60 papers.

Known Gaps

  • Human feedback is present in 7 of 60 papers.
  • Quality controls is present in 3 of 60 papers.
  • Benchmarks is present in 10 of 60 papers.

Suggested Next Analyses

  • Review the most recent accuracy papers first, then compare benchmark context before reusing the metric.

Recommended Queries

Known Limitations
  • This synthetic persisted page is generated from extraction data because the cached metric payload was missing for accuracy.
Research Utility Snapshot (Detailed)

Top Metrics

  • Accuracy (60)
  • Cost (6)
  • Latency (4)
  • Coherence (3)

Evaluation Modes

  • Automatic Metrics (60)
  • Human Eval (3)
  • Llm As Judge (2)
  • Simulation Env (1)

Top Benchmarks

  • GSM8K (2)
  • Aot Psyphybench (1)
  • ARC Challenge (1)
  • DROP (1)

Agentic Mix

  • None (52)
  • Long Horizon (5)
  • Tool Use (2)
  • Multi Agent (1)

Top Papers Reporting This Metric

Related Metric Hubs

Get Started

Join the #1 Platform for AI Training Talent

Where top AI builders and expert AI Trainers connect to build the future of AI.
Self-Service
Post a Job
Post your project and get a shortlist of qualified AI Trainers and Data Labelers. Hire and manage your team in the tools you already use.
Managed Service
For Large Projects
Done-for-You
We recruit, onboard, and manage a dedicated team inside your tools. End-to-end operations for large or complex projects.
For Freelancers
Join as an AI Trainer
Find AI training and data labeling projects across platforms, all in one place. One profile, one application process, more opportunities.