Skip to content
← Back to explorer

HFEPX Hub

Human Eval Papers

Updated from current HFEPX corpus (Feb 26, 2026). 36 papers are grouped in this hub page. Common evaluation modes: Human Eval, Automatic Metrics. Frequently cited benchmark: retrieval. Common metric signal: agreement. Newest paper in this set is from Feb 25, 2026.

Papers: 36 Last published: Feb 25, 2026 Global RSS Tag RSS
Human Eval

Why This Matters For Eval Research

  • Common evaluation patterns here: Human Eval, Automatic Metrics.
  • Benchmark signals emphasize: retrieval, AIME.
  • Top reported metrics include: agreement, accuracy.

Research Utility Snapshot

Human Feedback Mix

  • Pairwise Preference (9)
  • Rubric Rating (4)
  • Critique Edit (1)
  • Expert Verification (1)

Evaluation Modes

  • Human Eval (36)
  • Automatic Metrics (9)
  • Simulation Env (3)
  • Llm As Judge (1)

Top Benchmarks

  • Retrieval (2)
  • AIME (1)
  • Correctbench (1)
  • Cruxeval (1)

Top Metrics

  • Agreement (11)
  • Accuracy (7)
  • F1 (5)
  • Cost (2)

Top Papers

Related Hubs