Skip to content
← Back to explorer

HFEPX Benchmark Hub

LiveCodeBench + Coding Benchmark Papers

Updated from current HFEPX corpus (Mar 17, 2026). 4 papers are grouped in this benchmark page.

Read Full Context

Updated from current HFEPX corpus (Mar 17, 2026). 4 papers are grouped in this benchmark page. Common evaluation modes: Automatic Metrics. Most common rater population: Domain Experts. Common annotation unit: Pairwise. Frequent quality control: Calibration. Frequently cited benchmark: LiveCodeBench. Common metric signal: cost. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Feb 18, 2026.

Papers: 4 Last published: Feb 18, 2026 Global RSS

Researcher Quick Triage

Use this page for benchmark-matched method comparisons and eval protocol selection. Quality band: Developing .

High-Signal Coverage

100.0%

4 / 4 sampled papers are not low-signal flagged.

Replication-Ready Set

1

Papers with explicit benchmark + metric + eval mode fields.

Quality Controls

25.0%

1 papers report calibration/adjudication/IAA controls.

  • 4 papers explicitly name benchmark datasets in the sampled set.
  • 2 papers report at least one metric term in metadata extraction.
  • Start with the ranked shortlist below before reading all papers.

Primary action: Use this page to map benchmark mentions first; wait for stronger metric/QC coverage before strict comparisons.

Why This Matters (Expanded)

Why This Matters For Eval Research

  • 100% of papers report explicit human-feedback signals, led by pairwise preferences.
  • automatic metrics appears in 25% of papers in this hub.
  • LiveCodeBench is a recurring benchmark anchor for cross-paper comparisons in this page.
Protocol Notes (Expanded)

Protocol Takeaways

  • Most common quality-control signal is rater calibration (25% of papers).
  • Rater context is mostly domain experts, and annotation is commonly pairwise annotation; use this to scope replication staffing.
  • Stratify by benchmark (LiveCodeBench vs AIME) before comparing methods.

Benchmark Interpretation

  • LiveCodeBench appears in 100% of hub papers (4/4); use this cohort for benchmark-matched comparisons.
  • AIME appears in 25% of hub papers (1/4); use this cohort for benchmark-matched comparisons.

Metric Interpretation

  • cost is reported in 25% of hub papers (1/4); compare with a secondary metric before ranking methods.
  • latency is reported in 25% of hub papers (1/4); compare with a secondary metric before ranking methods.

Start Here (Benchmark-Matched First 6)

Ranked by protocol completeness so you can quickly find papers suitable for comparison studies.

Protocol Matrix (Top 10)

Compare protocol ingredients quickly before deep-reading full papers.

Paper Eval Modes Human Feedback Metrics Quality Controls
$V_1$: Unifying Generation and Self-Verification for Parallel Reasoners

Mar 4, 2026

Automatic Metrics Pairwise Preference Pass@1 Not reported
Team of Thoughts: Efficient Test-time Scaling of Agentic Systems through Orchestrated Tool Calling

Feb 18, 2026

Not reported Expert Verification Not reported Calibration
Step 3.5 Flash: Open Frontier-Level Intelligence with 11B Active Parameters

Feb 11, 2026

Not reported Pairwise Preference Latency, Cost Not reported
Critique-Coder: Enhancing Coder Models by Critique Reinforcement Learning

Sep 26, 2025

Not reported Critique Edit Not reported Not reported
Researcher Workflow (Detailed)

Checklist

  • Strong: Papers with explicit human feedback

    Coverage is strong (100% vs 45% target).

  • Moderate: Papers reporting quality controls

    Coverage is usable but incomplete (25% vs 30% target).

  • Strong: Papers naming benchmarks/datasets

    Coverage is strong (100% vs 35% target).

  • Strong: Papers naming evaluation metrics

    Coverage is strong (50% vs 35% target).

  • Strong: Papers with known rater population

    Coverage is strong (50% vs 35% target).

  • Moderate: Papers with known annotation unit

    Coverage is usable but incomplete (25% vs 35% target).

Strengths

  • Strong human-feedback signal (100% of papers).
  • Most papers provide measurable evaluation context (100% benchmarks, 50% metrics).
  • Agentic evaluation appears in 50% of papers.

Known Gaps

  • No dominant metadata gap detected in current extraction coverage.

Suggested Next Analyses

  • Stratify by benchmark (LiveCodeBench vs AIME) before comparing methods.
  • Track metric sensitivity by reporting both cost and latency.
  • Add inter-annotator agreement checks when reproducing these protocols.

Recommended Queries

Known Limitations
  • No dominant metadata gap detected in current extraction coverage.
  • Narrative synthesis is grounded in metadata and abstracts only; full-paper implementation details are not parsed.
  • Cross-page comparisons should be benchmark- and metric-matched to avoid protocol confounding.
Research Utility Snapshot (Detailed)

Evaluation Modes

  • Automatic Metrics (1)

Human Feedback Mix

  • Pairwise Preference (2)
  • Critique Edit (1)
  • Expert Verification (1)

Top Benchmarks

  • LiveCodeBench (4)
  • AIME (1)
  • BrowseComp (1)
  • CodeContests (1)

Top Metrics

  • Cost (1)
  • Latency (1)
  • Pass@1 (1)

Top Papers On This Benchmark

Related Benchmark Hubs

Need human evaluators for your AI research? Scale annotation with expert AI Trainers.